superlab 0.1.22 → 0.1.24
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +10 -1
- package/README.zh-CN.md +10 -1
- package/lib/auto_runner.cjs +30 -0
- package/lib/auto_state.cjs +30 -0
- package/lib/context.cjs +282 -1
- package/lib/eval_protocol.cjs +75 -0
- package/lib/i18n.cjs +90 -11
- package/package-assets/claude/commands/lab.md +10 -2
- package/package-assets/codex/prompts/lab.md +10 -2
- package/package-assets/shared/lab/.managed/scripts/validate_collaborator_report.py +53 -0
- package/package-assets/shared/lab/.managed/templates/final-report.md +24 -8
- package/package-assets/shared/lab/.managed/templates/review-checklist.md +4 -0
- package/package-assets/shared/lab/context/auto-mode.md +7 -1
- package/package-assets/shared/lab/context/auto-outcome.md +15 -0
- package/package-assets/shared/lab/context/eval-protocol.md +21 -0
- package/package-assets/shared/skills/lab/SKILL.md +1 -0
- package/package-assets/shared/skills/lab/stages/auto.md +19 -1
- package/package-assets/shared/skills/lab/stages/iterate.md +4 -0
- package/package-assets/shared/skills/lab/stages/report.md +7 -0
- package/package-assets/shared/skills/lab/stages/review.md +4 -0
- package/package-assets/shared/skills/lab/stages/run.md +4 -0
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -176,6 +176,7 @@ superlab auto stop
|
|
|
176
176
|
- `L1` is a safe run envelope for `run`, `review`, and `report`
|
|
177
177
|
- `L2` is the default bounded iteration envelope for `run`, `iterate`, `review`, and `report`
|
|
178
178
|
- `L3` is an aggressive campaign envelope that may also include `write`
|
|
179
|
+
- If you are unsure, choose `L2`
|
|
179
180
|
|
|
180
181
|
- `run` and `iterate` must change persistent outputs under `results_root`
|
|
181
182
|
- `review` must update canonical review context
|
|
@@ -189,10 +190,18 @@ It does not replace manual `idea`, `data`, `framing`, or `spec` decisions.
|
|
|
189
190
|
|
|
190
191
|
Good `/lab:auto` input is explicit. Treat `Autonomy level L1/L2/L3` as execution privilege, and treat `paper layer`, `phase`, or `table` as experiment targets. If the workflow language is Chinese, summaries, checklist items, task labels, and progress updates should also stay in Chinese unless a literal identifier must remain unchanged.
|
|
191
192
|
|
|
193
|
+
Level Guide for `/lab:auto`:
|
|
194
|
+
|
|
195
|
+
- `L1`: use this when you want safe validation, one bounded real run, or a simple report refresh
|
|
196
|
+
- `L2`: use this for normal bounded experiment iteration inside a frozen core
|
|
197
|
+
- `L3`: use this only when you want a broader campaign with a larger search space and optional writing
|
|
198
|
+
- If the request omits the level or mixes it with a paper layer, phase, or table target, stop and ask for an explicit level before starting
|
|
199
|
+
- If you are unsure, choose `L2`
|
|
200
|
+
|
|
192
201
|
Example:
|
|
193
202
|
|
|
194
203
|
```text
|
|
195
|
-
/lab:auto Autonomy level L2. Objective: advance paper layer 3
|
|
204
|
+
/lab:auto Autonomy level L2. Objective: advance paper layer 3 through one bounded protocol improvement. Terminal goal: task-completion. Scope: bounded protocol, tests, one minimal implementation, and one small run. Allowed modifications: configuration, evaluation script, and data-loading logic only.
|
|
196
205
|
```
|
|
197
206
|
|
|
198
207
|
## Version
|
package/README.zh-CN.md
CHANGED
|
@@ -174,6 +174,7 @@ superlab auto stop
|
|
|
174
174
|
- `L1` 是安全运行级别,只允许 `run`、`review`、`report`
|
|
175
175
|
- `L2` 是默认推荐级别,允许 `run`、`iterate`、`review`、`report`
|
|
176
176
|
- `L3` 是激进 campaign 级别,才允许额外编排 `write`
|
|
177
|
+
- 如果不确定,默认推荐 `L2`
|
|
177
178
|
|
|
178
179
|
- `run` 和 `iterate` 必须更新 `results_root` 下的持久输出
|
|
179
180
|
- `review` 必须更新规范的审查上下文
|
|
@@ -187,10 +188,18 @@ superlab auto stop
|
|
|
187
188
|
|
|
188
189
|
好的 `/lab:auto` 输入应该显式写清。把 `Autonomy level L1/L2/L3` 当成执行权限级别,把 `paper layer`、`phase`、`table` 当成实验目标,不要混用。如果 workflow language 是中文,摘要、清单条目、任务标签和进度更新也应保持中文,除非某个字面标识符必须保持原样。
|
|
189
190
|
|
|
191
|
+
`/lab:auto` 层级指南:
|
|
192
|
+
|
|
193
|
+
- `L1`:适合安全验证、一轮有边界真实运行,或简单的 report 刷新
|
|
194
|
+
- `L2`:适合冻结核心边界内的常规实验迭代,也是默认推荐级别
|
|
195
|
+
- `L3`:只在你明确想做更大范围 campaign、允许更广探索和可选写作时使用
|
|
196
|
+
- 如果用户输入没写级别,或者把级别和 `paper layer`、`phase`、`table` 混用了,就应先停下来,给出更详细的层级说明,再要求用户明确选 `L1/L2/L3`
|
|
197
|
+
- 如果不确定,默认推荐 `L2`
|
|
198
|
+
|
|
190
199
|
示例:
|
|
191
200
|
|
|
192
201
|
```text
|
|
193
|
-
/lab:auto 自治级别 L2。目标:推进 paper layer 3
|
|
202
|
+
/lab:auto 自治级别 L2。目标:推进 paper layer 3 的一项有边界协议改进。终止条件:完成 bounded protocol、测试、一项最小实现和一轮小规模结果。允许修改:配置、评估脚本、数据加载逻辑。
|
|
194
203
|
```
|
|
195
204
|
|
|
196
205
|
## 版本查询
|
package/lib/auto_runner.cjs
CHANGED
|
@@ -278,6 +278,21 @@ async function startAutoMode({ targetDir, now = new Date() }) {
|
|
|
278
278
|
comparisonSourcePapers: evalProtocol.comparisonSourcePapers,
|
|
279
279
|
comparisonImplementationSource: evalProtocol.comparisonImplementationSource,
|
|
280
280
|
deviationFromOriginalImplementation: evalProtocol.deviationFromOriginalImplementation,
|
|
281
|
+
evaluationSettingSemantics: evalProtocol.evaluationSettingSemantics,
|
|
282
|
+
visibilityAndLeakageRisks: evalProtocol.visibilityAndLeakageRisks,
|
|
283
|
+
anchorAndLabelPolicy: evalProtocol.anchorAndLabelPolicy,
|
|
284
|
+
scaleAndComparabilityPolicy: evalProtocol.scaleAndComparabilityPolicy,
|
|
285
|
+
metricValidityChecks: evalProtocol.metricValidityChecks,
|
|
286
|
+
comparisonValidityChecks: evalProtocol.comparisonValidityChecks,
|
|
287
|
+
statisticalValidityChecks: evalProtocol.statisticalValidityChecks,
|
|
288
|
+
claimBoundary: evalProtocol.claimBoundary,
|
|
289
|
+
integritySelfCheck: evalProtocol.integritySelfCheck,
|
|
290
|
+
anomalySignals: evalProtocol.anomalySignals,
|
|
291
|
+
implementationRealityChecks: evalProtocol.implementationRealityChecks,
|
|
292
|
+
alternativeExplanationsConsidered: evalProtocol.alternativeExplanationsConsidered,
|
|
293
|
+
crossCheckMethod: evalProtocol.crossCheckMethod,
|
|
294
|
+
bestSupportedInterpretation: evalProtocol.bestSupportedInterpretation,
|
|
295
|
+
escalationThreshold: evalProtocol.escalationThreshold,
|
|
281
296
|
};
|
|
282
297
|
|
|
283
298
|
const writeRunningStatus = (overrides = {}) => {
|
|
@@ -768,6 +783,21 @@ function stopAutoMode({ targetDir, now = new Date() }) {
|
|
|
768
783
|
comparisonSourcePapers: evalProtocol.comparisonSourcePapers,
|
|
769
784
|
comparisonImplementationSource: evalProtocol.comparisonImplementationSource,
|
|
770
785
|
deviationFromOriginalImplementation: evalProtocol.deviationFromOriginalImplementation,
|
|
786
|
+
evaluationSettingSemantics: evalProtocol.evaluationSettingSemantics,
|
|
787
|
+
visibilityAndLeakageRisks: evalProtocol.visibilityAndLeakageRisks,
|
|
788
|
+
anchorAndLabelPolicy: evalProtocol.anchorAndLabelPolicy,
|
|
789
|
+
scaleAndComparabilityPolicy: evalProtocol.scaleAndComparabilityPolicy,
|
|
790
|
+
metricValidityChecks: evalProtocol.metricValidityChecks,
|
|
791
|
+
comparisonValidityChecks: evalProtocol.comparisonValidityChecks,
|
|
792
|
+
statisticalValidityChecks: evalProtocol.statisticalValidityChecks,
|
|
793
|
+
claimBoundary: evalProtocol.claimBoundary,
|
|
794
|
+
integritySelfCheck: evalProtocol.integritySelfCheck,
|
|
795
|
+
anomalySignals: evalProtocol.anomalySignals,
|
|
796
|
+
implementationRealityChecks: evalProtocol.implementationRealityChecks,
|
|
797
|
+
alternativeExplanationsConsidered: evalProtocol.alternativeExplanationsConsidered,
|
|
798
|
+
crossCheckMethod: evalProtocol.crossCheckMethod,
|
|
799
|
+
bestSupportedInterpretation: evalProtocol.bestSupportedInterpretation,
|
|
800
|
+
escalationThreshold: evalProtocol.escalationThreshold,
|
|
771
801
|
};
|
|
772
802
|
const status = {
|
|
773
803
|
...existing,
|
package/lib/auto_state.cjs
CHANGED
|
@@ -154,6 +154,21 @@ function renderAutoOutcome(outcome, { lang = "en" } = {}) {
|
|
|
154
154
|
- 对比方法来源论文: ${outcome.comparisonSourcePapers || ""}
|
|
155
155
|
- 对比方法实现来源: ${outcome.comparisonImplementationSource || ""}
|
|
156
156
|
- 与原始实现的偏差: ${outcome.deviationFromOriginalImplementation || ""}
|
|
157
|
+
- 评测设定语义: ${outcome.evaluationSettingSemantics || ""}
|
|
158
|
+
- 可见性与泄漏风险: ${outcome.visibilityAndLeakageRisks || ""}
|
|
159
|
+
- 锚点与标签策略: ${outcome.anchorAndLabelPolicy || ""}
|
|
160
|
+
- 尺度与可比性策略: ${outcome.scaleAndComparabilityPolicy || ""}
|
|
161
|
+
- 指标有效性检查: ${outcome.metricValidityChecks || ""}
|
|
162
|
+
- 对比有效性检查: ${outcome.comparisonValidityChecks || ""}
|
|
163
|
+
- 统计有效性检查: ${outcome.statisticalValidityChecks || ""}
|
|
164
|
+
- 结论边界: ${outcome.claimBoundary || ""}
|
|
165
|
+
- 完整性自检: ${outcome.integritySelfCheck || ""}
|
|
166
|
+
- 异常信号: ${outcome.anomalySignals || ""}
|
|
167
|
+
- 实现层现实检查: ${outcome.implementationRealityChecks || ""}
|
|
168
|
+
- 已考虑的替代解释: ${outcome.alternativeExplanationsConsidered || ""}
|
|
169
|
+
- 交叉验证方法: ${outcome.crossCheckMethod || ""}
|
|
170
|
+
- 当前最站得住的解释: ${outcome.bestSupportedInterpretation || ""}
|
|
171
|
+
- 升级阈值: ${outcome.escalationThreshold || ""}
|
|
157
172
|
- 终止目标类型: ${outcome.terminalGoalType || ""}
|
|
158
173
|
- 终止目标目标值: ${outcome.terminalGoalTarget || ""}
|
|
159
174
|
- 必要终止工件: ${outcome.requiredTerminalArtifact || ""}
|
|
@@ -191,6 +206,21 @@ function renderAutoOutcome(outcome, { lang = "en" } = {}) {
|
|
|
191
206
|
- Comparison source papers: ${outcome.comparisonSourcePapers || ""}
|
|
192
207
|
- Comparison implementation source: ${outcome.comparisonImplementationSource || ""}
|
|
193
208
|
- Deviation from original implementation: ${outcome.deviationFromOriginalImplementation || ""}
|
|
209
|
+
- Evaluation setting semantics: ${outcome.evaluationSettingSemantics || ""}
|
|
210
|
+
- Visibility and leakage risks: ${outcome.visibilityAndLeakageRisks || ""}
|
|
211
|
+
- Anchor and label policy: ${outcome.anchorAndLabelPolicy || ""}
|
|
212
|
+
- Scale and comparability policy: ${outcome.scaleAndComparabilityPolicy || ""}
|
|
213
|
+
- Metric validity checks: ${outcome.metricValidityChecks || ""}
|
|
214
|
+
- Comparison validity checks: ${outcome.comparisonValidityChecks || ""}
|
|
215
|
+
- Statistical validity checks: ${outcome.statisticalValidityChecks || ""}
|
|
216
|
+
- Claim boundary: ${outcome.claimBoundary || ""}
|
|
217
|
+
- Integrity self-check: ${outcome.integritySelfCheck || ""}
|
|
218
|
+
- Anomaly signals: ${outcome.anomalySignals || ""}
|
|
219
|
+
- Implementation reality checks: ${outcome.implementationRealityChecks || ""}
|
|
220
|
+
- Alternative explanations considered: ${outcome.alternativeExplanationsConsidered || ""}
|
|
221
|
+
- Cross-check method: ${outcome.crossCheckMethod || ""}
|
|
222
|
+
- Best-supported interpretation: ${outcome.bestSupportedInterpretation || ""}
|
|
223
|
+
- Escalation threshold: ${outcome.escalationThreshold || ""}
|
|
194
224
|
- Terminal goal type: ${outcome.terminalGoalType || ""}
|
|
195
225
|
- Terminal goal target: ${outcome.terminalGoalTarget || ""}
|
|
196
226
|
- Required terminal artifact: ${outcome.requiredTerminalArtifact || ""}
|
package/lib/context.cjs
CHANGED
|
@@ -27,6 +27,24 @@ const EVAL_COLLABORATOR_FIELDS = [
|
|
|
27
27
|
labels: ["Method and baseline implementation source", "方法与基线实现来源"],
|
|
28
28
|
},
|
|
29
29
|
{ name: "Metric source papers", labels: ["Metric source papers", "指标来源论文"] },
|
|
30
|
+
{ name: "Evaluation setting semantics", labels: ["Evaluation setting semantics", "评测设定语义"] },
|
|
31
|
+
{ name: "Visibility and leakage risks", labels: ["Visibility and leakage risks", "可见性与泄漏风险"] },
|
|
32
|
+
{ name: "Anchor and label policy", labels: ["Anchor and label policy", "锚点与标签策略"] },
|
|
33
|
+
{ name: "Scale and comparability policy", labels: ["Scale and comparability policy", "尺度与可比性策略"] },
|
|
34
|
+
{ name: "Metric validity checks", labels: ["Metric validity checks", "指标有效性检查"] },
|
|
35
|
+
{ name: "Comparison validity checks", labels: ["Comparison validity checks", "对比方法有效性检查"] },
|
|
36
|
+
{ name: "Statistical validity checks", labels: ["Statistical validity checks", "统计有效性检查"] },
|
|
37
|
+
{ name: "Claim boundary", labels: ["Claim boundary", "结论边界"] },
|
|
38
|
+
{ name: "Integrity self-check", labels: ["Integrity self-check", "完整性自检"] },
|
|
39
|
+
{ name: "Anomaly signals", labels: ["Anomaly signals", "异常信号"] },
|
|
40
|
+
{ name: "Implementation reality checks", labels: ["Implementation reality checks", "实现层现实检查"] },
|
|
41
|
+
{
|
|
42
|
+
name: "Alternative explanations considered",
|
|
43
|
+
labels: ["Alternative explanations considered", "已考虑的替代解释"],
|
|
44
|
+
},
|
|
45
|
+
{ name: "Cross-check method", labels: ["Cross-check method", "交叉验证方法"] },
|
|
46
|
+
{ name: "Best-supported interpretation", labels: ["Best-supported interpretation", "当前最站得住的解释"] },
|
|
47
|
+
{ name: "Escalation threshold", labels: ["Escalation threshold", "升级阈值"] },
|
|
30
48
|
{ name: "Required output artifacts", labels: ["Required output artifacts", "必要输出工件"] },
|
|
31
49
|
];
|
|
32
50
|
const REPORT_REQUIRED_SECTIONS = [
|
|
@@ -44,7 +62,34 @@ const REPORT_REQUIRED_SECTIONS = [
|
|
|
44
62
|
patterns: [/^##\s+Method and Baseline Sources\s*$/m, /^##\s+方法与基线来源\s*$/m],
|
|
45
63
|
},
|
|
46
64
|
{ name: "Metric Sources", patterns: [/^##\s+Metric Sources\s*$/m, /^##\s+指标来源\s*$/m] },
|
|
65
|
+
{
|
|
66
|
+
name: "Sanity and Alternative Explanations",
|
|
67
|
+
patterns: [/^##\s+Sanity and Alternative Explanations\s*$/m, /^##\s+异常与替代解释\s*$/m],
|
|
68
|
+
},
|
|
69
|
+
];
|
|
70
|
+
const REPORT_SOURCE_SECTION_NAMES = new Set([
|
|
71
|
+
"Background Sources",
|
|
72
|
+
"Method and Baseline Sources",
|
|
73
|
+
"Metric Sources",
|
|
74
|
+
]);
|
|
75
|
+
const REPORT_SOURCE_PATH_PATTERNS = [
|
|
76
|
+
/\/Users\//,
|
|
77
|
+
/\/home\//,
|
|
78
|
+
/\/tmp\//,
|
|
79
|
+
/\/private\/tmp\//,
|
|
80
|
+
/\.lab\//,
|
|
81
|
+
/outputs\//,
|
|
82
|
+
/docs\/research\//,
|
|
83
|
+
];
|
|
84
|
+
const REPORT_SOURCE_CITATION_MARKERS = ["Citation:", "引用:"];
|
|
85
|
+
const REPORT_SOURCE_ROLE_MARKERS = [
|
|
86
|
+
"What it established:",
|
|
87
|
+
"What it does:",
|
|
88
|
+
"What it measures:",
|
|
89
|
+
"做了什么:",
|
|
90
|
+
"衡量什么:",
|
|
47
91
|
];
|
|
92
|
+
const REPORT_SOURCE_LIMITATION_MARKERS = ["Limitation", "局限"];
|
|
48
93
|
const MAIN_TABLES_REQUIRED_SECTIONS = [
|
|
49
94
|
{ name: "Reader Summary", patterns: [/^##\s+Reader Summary\s*$/m, /^##\s+给用户看的总结\s*$/m] },
|
|
50
95
|
{ name: "Selected Metrics", patterns: [/^##\s+Selected Metrics\s*$/m, /^##\s+选定指标\s*$/m] },
|
|
@@ -82,6 +127,13 @@ const REPORT_FIELDS = {
|
|
|
82
127
|
metricSourcePapers: ["Metric source papers", "指标来源论文"],
|
|
83
128
|
metricImplementationSource: ["Metric implementation source", "指标实现来源"],
|
|
84
129
|
metricDeviation: ["Deviation from original implementation", "与原始实现的偏差"],
|
|
130
|
+
claimBoundary: ["Claim boundary", "结论边界"],
|
|
131
|
+
anomalySignals: ["Anomaly signals observed", "观察到的异常信号"],
|
|
132
|
+
implementationChecks: ["Implementation checks performed", "做过的实现层检查"],
|
|
133
|
+
alternativeExplanations: ["Alternative explanations ruled out", "已排除的更简单解释"],
|
|
134
|
+
crossChecks: ["Cross-checks that strengthen the current interpretation", "支撑当前解释的交叉验证"],
|
|
135
|
+
bestSupportedInterpretation: ["Best-supported interpretation", "当前最站得住的解释"],
|
|
136
|
+
escalationThreshold: ["Escalation threshold if future anomalies appear", "未来异常出现时的升级阈值"],
|
|
85
137
|
datasets: ["Datasets", "数据集"],
|
|
86
138
|
baselines: ["Baselines", "基线"],
|
|
87
139
|
metrics: ["Metrics", "指标"],
|
|
@@ -229,6 +281,48 @@ function missingRequiredSections(text, sections) {
|
|
|
229
281
|
.map((section) => section.name);
|
|
230
282
|
}
|
|
231
283
|
|
|
284
|
+
function extractSectionBody(text, section) {
|
|
285
|
+
if (!text) {
|
|
286
|
+
return "";
|
|
287
|
+
}
|
|
288
|
+
for (const pattern of section.patterns) {
|
|
289
|
+
const match = pattern.exec(text);
|
|
290
|
+
if (!match) {
|
|
291
|
+
continue;
|
|
292
|
+
}
|
|
293
|
+
const start = match.index + match[0].length;
|
|
294
|
+
const nextHeading = text.slice(start).search(/^##\s+/m);
|
|
295
|
+
const end = nextHeading >= 0 ? start + nextHeading : text.length;
|
|
296
|
+
return text.slice(start, end).trim();
|
|
297
|
+
}
|
|
298
|
+
return "";
|
|
299
|
+
}
|
|
300
|
+
|
|
301
|
+
function sourceSectionIssues(reportText) {
|
|
302
|
+
const issues = [];
|
|
303
|
+
for (const section of REPORT_REQUIRED_SECTIONS) {
|
|
304
|
+
if (!REPORT_SOURCE_SECTION_NAMES.has(section.name)) {
|
|
305
|
+
continue;
|
|
306
|
+
}
|
|
307
|
+
const body = extractSectionBody(reportText, section);
|
|
308
|
+
if (!body) {
|
|
309
|
+
continue;
|
|
310
|
+
}
|
|
311
|
+
if (REPORT_SOURCE_PATH_PATTERNS.some((pattern) => pattern.test(body))) {
|
|
312
|
+
issues.push(`report.md section '${section.name}' must not rely on local file paths or internal provenance`);
|
|
313
|
+
}
|
|
314
|
+
if (!REPORT_SOURCE_CITATION_MARKERS.some((marker) => body.includes(marker))) {
|
|
315
|
+
issues.push(`report.md section '${section.name}' must include at least one citation anchor`);
|
|
316
|
+
}
|
|
317
|
+
const hasRole = REPORT_SOURCE_ROLE_MARKERS.some((marker) => body.includes(marker));
|
|
318
|
+
const hasLimitation = REPORT_SOURCE_LIMITATION_MARKERS.some((marker) => body.includes(marker));
|
|
319
|
+
if (!hasRole || !hasLimitation) {
|
|
320
|
+
issues.push(`report.md section '${section.name}' must explain what the anchor does and one limitation`);
|
|
321
|
+
}
|
|
322
|
+
}
|
|
323
|
+
return issues;
|
|
324
|
+
}
|
|
325
|
+
|
|
232
326
|
function collaboratorReportIssues(targetDir) {
|
|
233
327
|
if (!hasCollaboratorFacingDeliverables(targetDir)) {
|
|
234
328
|
return [];
|
|
@@ -236,10 +330,12 @@ function collaboratorReportIssues(targetDir) {
|
|
|
236
330
|
const { reportPath, mainTablesPath } = getCollaboratorDeliverablePaths(targetDir);
|
|
237
331
|
const issues = [];
|
|
238
332
|
if (fs.existsSync(reportPath)) {
|
|
239
|
-
const
|
|
333
|
+
const reportText = readFileIfExists(reportPath);
|
|
334
|
+
const missing = missingRequiredSections(reportText, REPORT_REQUIRED_SECTIONS);
|
|
240
335
|
if (missing.length > 0) {
|
|
241
336
|
issues.push(`report.md is missing required collaborator-facing sections: ${missing.join(", ")}`);
|
|
242
337
|
}
|
|
338
|
+
issues.push(...sourceSectionIssues(reportText));
|
|
243
339
|
}
|
|
244
340
|
if (fs.existsSync(mainTablesPath)) {
|
|
245
341
|
const missing = missingRequiredSections(readFileIfExists(mainTablesPath), MAIN_TABLES_REQUIRED_SECTIONS);
|
|
@@ -397,6 +493,27 @@ function buildEvalProtocolText(lang, fields, rungs) {
|
|
|
397
493
|
- 对比方法实现来源: ${fields.comparisonImplementationSource || "待补充"}
|
|
398
494
|
- 与原始实现的偏差: ${fields.deviationFromOriginalImplementation || "待补充"}
|
|
399
495
|
|
|
496
|
+
## 学术有效性检查
|
|
497
|
+
|
|
498
|
+
- 评测设定语义: ${fields.evaluationSettingSemantics || "待补充"}
|
|
499
|
+
- 可见性与泄漏风险: ${fields.visibilityAndLeakageRisks || "待补充"}
|
|
500
|
+
- 锚点与标签策略: ${fields.anchorAndLabelPolicy || "待补充"}
|
|
501
|
+
- 尺度与可比性策略: ${fields.scaleAndComparabilityPolicy || "待补充"}
|
|
502
|
+
- 指标有效性检查: ${fields.metricValidityChecks || "待补充"}
|
|
503
|
+
- 对比方法有效性检查: ${fields.comparisonValidityChecks || "待补充"}
|
|
504
|
+
- 统计有效性检查: ${fields.statisticalValidityChecks || "待补充"}
|
|
505
|
+
- 结论边界: ${fields.claimBoundary || "待补充"}
|
|
506
|
+
- 完整性自检: ${fields.integritySelfCheck || "待补充"}
|
|
507
|
+
|
|
508
|
+
## 异常与替代解释检查
|
|
509
|
+
|
|
510
|
+
- 异常信号: ${fields.anomalySignals || "待补充"}
|
|
511
|
+
- 实现层现实检查: ${fields.implementationRealityChecks || "待补充"}
|
|
512
|
+
- 已考虑的替代解释: ${fields.alternativeExplanationsConsidered || "待补充"}
|
|
513
|
+
- 交叉验证方法: ${fields.crossCheckMethod || "待补充"}
|
|
514
|
+
- 当前最站得住的解释: ${fields.bestSupportedInterpretation || "待补充"}
|
|
515
|
+
- 升级阈值: ${fields.escalationThreshold || "待补充"}
|
|
516
|
+
|
|
400
517
|
## Gate Ladder
|
|
401
518
|
|
|
402
519
|
- 实验阶梯: ${fields.experimentLadder || "待补充"}
|
|
@@ -439,6 +556,27 @@ Use this file to define the paper-facing evaluation target, table plan, gates, a
|
|
|
439
556
|
- Comparison implementation source: ${fields.comparisonImplementationSource || "TBD"}
|
|
440
557
|
- Deviation from original implementation: ${fields.deviationFromOriginalImplementation || "TBD"}
|
|
441
558
|
|
|
559
|
+
## Academic Validity Checks
|
|
560
|
+
|
|
561
|
+
- Evaluation setting semantics: ${fields.evaluationSettingSemantics || "TBD"}
|
|
562
|
+
- Visibility and leakage risks: ${fields.visibilityAndLeakageRisks || "TBD"}
|
|
563
|
+
- Anchor and label policy: ${fields.anchorAndLabelPolicy || "TBD"}
|
|
564
|
+
- Scale and comparability policy: ${fields.scaleAndComparabilityPolicy || "TBD"}
|
|
565
|
+
- Metric validity checks: ${fields.metricValidityChecks || "TBD"}
|
|
566
|
+
- Comparison validity checks: ${fields.comparisonValidityChecks || "TBD"}
|
|
567
|
+
- Statistical validity checks: ${fields.statisticalValidityChecks || "TBD"}
|
|
568
|
+
- Claim boundary: ${fields.claimBoundary || "TBD"}
|
|
569
|
+
- Integrity self-check: ${fields.integritySelfCheck || "TBD"}
|
|
570
|
+
|
|
571
|
+
## Sanity and Alternative-Explanation Checks
|
|
572
|
+
|
|
573
|
+
- Anomaly signals: ${fields.anomalySignals || "TBD"}
|
|
574
|
+
- Implementation reality checks: ${fields.implementationRealityChecks || "TBD"}
|
|
575
|
+
- Alternative explanations considered: ${fields.alternativeExplanationsConsidered || "TBD"}
|
|
576
|
+
- Cross-check method: ${fields.crossCheckMethod || "TBD"}
|
|
577
|
+
- Best-supported interpretation: ${fields.bestSupportedInterpretation || "TBD"}
|
|
578
|
+
- Escalation threshold: ${fields.escalationThreshold || "TBD"}
|
|
579
|
+
|
|
442
580
|
## Gate Ladder
|
|
443
581
|
|
|
444
582
|
- Experiment ladder: ${fields.experimentLadder || "TBD"}
|
|
@@ -638,6 +776,74 @@ function hydrateEvalProtocol(targetDir) {
|
|
|
638
776
|
protocol.metricImplementationSource,
|
|
639
777
|
extractReportValue(reportText, "metricImplementationSource")
|
|
640
778
|
),
|
|
779
|
+
evaluationSettingSemantics: mergePreferred(
|
|
780
|
+
protocol.evaluationSettingSemantics,
|
|
781
|
+
extractReportValue(reportText, "setting"),
|
|
782
|
+
extractValue(missionText, ["Approved direction", "已批准方向"])
|
|
783
|
+
),
|
|
784
|
+
visibilityAndLeakageRisks: mergePreferred(
|
|
785
|
+
protocol.visibilityAndLeakageRisks,
|
|
786
|
+
extractValue(dataDecisions, ["Remaining preprocessing or leakage risks", "剩余预处理或 leakage 风险"]),
|
|
787
|
+
"stay within the approved visible inputs and avoid unavailable information or leakage-prone artifacts"
|
|
788
|
+
),
|
|
789
|
+
anchorAndLabelPolicy: mergePreferred(
|
|
790
|
+
protocol.anchorAndLabelPolicy,
|
|
791
|
+
"use only anchors and labels that are legitimate and visible inside the approved evaluation setting"
|
|
792
|
+
),
|
|
793
|
+
scaleAndComparabilityPolicy: mergePreferred(
|
|
794
|
+
protocol.scaleAndComparabilityPolicy,
|
|
795
|
+
"compare only natively aligned scales or explicitly justified mappings recorded in the protocol"
|
|
796
|
+
),
|
|
797
|
+
metricValidityChecks: mergePreferred(
|
|
798
|
+
protocol.metricValidityChecks,
|
|
799
|
+
"promote only metrics that directly support the active claim; keep health metrics as support checks"
|
|
800
|
+
),
|
|
801
|
+
comparisonValidityChecks: mergePreferred(
|
|
802
|
+
protocol.comparisonValidityChecks,
|
|
803
|
+
"keep canonical, historical, recent, and closest-prior comparisons fair, source-backed, and justified"
|
|
804
|
+
),
|
|
805
|
+
statisticalValidityChecks: mergePreferred(
|
|
806
|
+
protocol.statisticalValidityChecks,
|
|
807
|
+
"report minimum sample sizes, seed variance, and uncertainty before promotion"
|
|
808
|
+
),
|
|
809
|
+
claimBoundary: mergePreferred(
|
|
810
|
+
protocol.claimBoundary,
|
|
811
|
+
extractValue(missionText, ["Approved direction", "已批准方向"])
|
|
812
|
+
),
|
|
813
|
+
integritySelfCheck: mergePreferred(
|
|
814
|
+
protocol.integritySelfCheck,
|
|
815
|
+
"do not use unavailable information, do not promote health metrics into main claims, and do not treat workflow status as scientific evidence"
|
|
816
|
+
),
|
|
817
|
+
anomalySignals: mergePreferred(
|
|
818
|
+
protocol.anomalySignals,
|
|
819
|
+
extractReportValue(reportText, "anomalySignals"),
|
|
820
|
+
"treat all-null outputs, suspiciously identical runs, no-op deltas, or impl/result mismatches as diagnostic triggers instead of findings"
|
|
821
|
+
),
|
|
822
|
+
implementationRealityChecks: mergePreferred(
|
|
823
|
+
protocol.implementationRealityChecks,
|
|
824
|
+
extractReportValue(reportText, "implementationChecks"),
|
|
825
|
+
"inspect the concrete code path, parser or metric wiring, input chain, split logic, and output artifacts before interpreting an anomaly"
|
|
826
|
+
),
|
|
827
|
+
alternativeExplanationsConsidered: mergePreferred(
|
|
828
|
+
protocol.alternativeExplanationsConsidered,
|
|
829
|
+
extractReportValue(reportText, "alternativeExplanations"),
|
|
830
|
+
"rule out simpler explanations such as parser bugs, stale checkpoints, split mistakes, leakage, or unfair comparisons before promoting a claim"
|
|
831
|
+
),
|
|
832
|
+
crossCheckMethod: mergePreferred(
|
|
833
|
+
protocol.crossCheckMethod,
|
|
834
|
+
extractReportValue(reportText, "crossChecks"),
|
|
835
|
+
"use at least one independent cross-check such as direct artifact inspection, alternate parsing, tiny-slice rerun, or code-path verification"
|
|
836
|
+
),
|
|
837
|
+
bestSupportedInterpretation: mergePreferred(
|
|
838
|
+
protocol.bestSupportedInterpretation,
|
|
839
|
+
extractReportValue(reportText, "bestSupportedInterpretation"),
|
|
840
|
+
"prefer the narrowest interpretation that still survives the anomaly checks and alternative explanations"
|
|
841
|
+
),
|
|
842
|
+
escalationThreshold: mergePreferred(
|
|
843
|
+
protocol.escalationThreshold,
|
|
844
|
+
extractReportValue(reportText, "escalationThreshold"),
|
|
845
|
+
"switch to diagnostic mode or reviewer escalation when anomalies remain unresolved or risk increases across two rounds"
|
|
846
|
+
),
|
|
641
847
|
comparisonSourcePapers: mergePreferred(
|
|
642
848
|
protocol.comparisonSourcePapers,
|
|
643
849
|
extractReportValue(reportText, "baselineSourcePapers")
|
|
@@ -767,6 +973,21 @@ function renderSummary(lang, data) {
|
|
|
767
973
|
- Comparison source papers: ${data.evalComparisonSourcePapers || "待补充"}
|
|
768
974
|
- Comparison implementation source: ${data.evalComparisonImplementationSource || "待补充"}
|
|
769
975
|
- Deviation from original implementation: ${data.evalDeviationFromOriginalImplementation || "待补充"}
|
|
976
|
+
- Evaluation setting semantics: ${data.evalEvaluationSettingSemantics || "待补充"}
|
|
977
|
+
- Visibility and leakage risks: ${data.evalVisibilityAndLeakageRisks || "待补充"}
|
|
978
|
+
- Anchor and label policy: ${data.evalAnchorAndLabelPolicy || "待补充"}
|
|
979
|
+
- Scale and comparability policy: ${data.evalScaleAndComparabilityPolicy || "待补充"}
|
|
980
|
+
- Metric validity checks: ${data.evalMetricValidityChecks || "待补充"}
|
|
981
|
+
- Comparison validity checks: ${data.evalComparisonValidityChecks || "待补充"}
|
|
982
|
+
- Statistical validity checks: ${data.evalStatisticalValidityChecks || "待补充"}
|
|
983
|
+
- Claim boundary: ${data.evalClaimBoundary || "待补充"}
|
|
984
|
+
- Integrity self-check: ${data.evalIntegritySelfCheck || "待补充"}
|
|
985
|
+
- Anomaly signals: ${data.evalAnomalySignals || "待补充"}
|
|
986
|
+
- Implementation reality checks: ${data.evalImplementationRealityChecks || "待补充"}
|
|
987
|
+
- Alternative explanations considered: ${data.evalAlternativeExplanationsConsidered || "待补充"}
|
|
988
|
+
- Cross-check method: ${data.evalCrossCheckMethod || "待补充"}
|
|
989
|
+
- Best-supported interpretation: ${data.evalBestSupportedInterpretation || "待补充"}
|
|
990
|
+
- Escalation threshold: ${data.evalEscalationThreshold || "待补充"}
|
|
770
991
|
- Experiment ladder: ${data.evalExperimentLadder || "待补充"}
|
|
771
992
|
- Benchmark ladder: ${data.evalBenchmarkLadder || "待补充"}
|
|
772
993
|
- Promotion gate: ${data.evalPromotionGate || "待补充"}
|
|
@@ -830,6 +1051,21 @@ function renderSummary(lang, data) {
|
|
|
830
1051
|
- Comparison source papers: ${data.evalComparisonSourcePapers || "TBD"}
|
|
831
1052
|
- Comparison implementation source: ${data.evalComparisonImplementationSource || "TBD"}
|
|
832
1053
|
- Deviation from original implementation: ${data.evalDeviationFromOriginalImplementation || "TBD"}
|
|
1054
|
+
- Evaluation setting semantics: ${data.evalEvaluationSettingSemantics || "TBD"}
|
|
1055
|
+
- Visibility and leakage risks: ${data.evalVisibilityAndLeakageRisks || "TBD"}
|
|
1056
|
+
- Anchor and label policy: ${data.evalAnchorAndLabelPolicy || "TBD"}
|
|
1057
|
+
- Scale and comparability policy: ${data.evalScaleAndComparabilityPolicy || "TBD"}
|
|
1058
|
+
- Metric validity checks: ${data.evalMetricValidityChecks || "TBD"}
|
|
1059
|
+
- Comparison validity checks: ${data.evalComparisonValidityChecks || "TBD"}
|
|
1060
|
+
- Statistical validity checks: ${data.evalStatisticalValidityChecks || "TBD"}
|
|
1061
|
+
- Claim boundary: ${data.evalClaimBoundary || "TBD"}
|
|
1062
|
+
- Integrity self-check: ${data.evalIntegritySelfCheck || "TBD"}
|
|
1063
|
+
- Anomaly signals: ${data.evalAnomalySignals || "TBD"}
|
|
1064
|
+
- Implementation reality checks: ${data.evalImplementationRealityChecks || "TBD"}
|
|
1065
|
+
- Alternative explanations considered: ${data.evalAlternativeExplanationsConsidered || "TBD"}
|
|
1066
|
+
- Cross-check method: ${data.evalCrossCheckMethod || "TBD"}
|
|
1067
|
+
- Best-supported interpretation: ${data.evalBestSupportedInterpretation || "TBD"}
|
|
1068
|
+
- Escalation threshold: ${data.evalEscalationThreshold || "TBD"}
|
|
833
1069
|
- Experiment ladder: ${data.evalExperimentLadder || "TBD"}
|
|
834
1070
|
- Benchmark ladder: ${data.evalBenchmarkLadder || "TBD"}
|
|
835
1071
|
- Promotion gate: ${data.evalPromotionGate || "TBD"}
|
|
@@ -948,6 +1184,21 @@ ${data.problem || "待补充"}
|
|
|
948
1184
|
- Comparison source papers: ${data.evalComparisonSourcePapers || "待补充"}
|
|
949
1185
|
- Comparison implementation source: ${data.evalComparisonImplementationSource || "待补充"}
|
|
950
1186
|
- Deviation from original implementation: ${data.evalDeviationFromOriginalImplementation || "待补充"}
|
|
1187
|
+
- Evaluation setting semantics: ${data.evalEvaluationSettingSemantics || "待补充"}
|
|
1188
|
+
- Visibility and leakage risks: ${data.evalVisibilityAndLeakageRisks || "待补充"}
|
|
1189
|
+
- Anchor and label policy: ${data.evalAnchorAndLabelPolicy || "待补充"}
|
|
1190
|
+
- Scale and comparability policy: ${data.evalScaleAndComparabilityPolicy || "待补充"}
|
|
1191
|
+
- Metric validity checks: ${data.evalMetricValidityChecks || "待补充"}
|
|
1192
|
+
- Comparison validity checks: ${data.evalComparisonValidityChecks || "待补充"}
|
|
1193
|
+
- Statistical validity checks: ${data.evalStatisticalValidityChecks || "待补充"}
|
|
1194
|
+
- Claim boundary: ${data.evalClaimBoundary || "待补充"}
|
|
1195
|
+
- Integrity self-check: ${data.evalIntegritySelfCheck || "待补充"}
|
|
1196
|
+
- Anomaly signals: ${data.evalAnomalySignals || "待补充"}
|
|
1197
|
+
- Implementation reality checks: ${data.evalImplementationRealityChecks || "待补充"}
|
|
1198
|
+
- Alternative explanations considered: ${data.evalAlternativeExplanationsConsidered || "待补充"}
|
|
1199
|
+
- Cross-check method: ${data.evalCrossCheckMethod || "待补充"}
|
|
1200
|
+
- Best-supported interpretation: ${data.evalBestSupportedInterpretation || "待补充"}
|
|
1201
|
+
- Escalation threshold: ${data.evalEscalationThreshold || "待补充"}
|
|
951
1202
|
- Experiment ladder: ${data.evalExperimentLadder || "待补充"}
|
|
952
1203
|
- Benchmark ladder: ${data.evalBenchmarkLadder || "待补充"}
|
|
953
1204
|
- Promotion gate: ${data.evalPromotionGate || "待补充"}
|
|
@@ -1022,6 +1273,21 @@ ${data.problem || "TBD"}
|
|
|
1022
1273
|
- Comparison source papers: ${data.evalComparisonSourcePapers || "TBD"}
|
|
1023
1274
|
- Comparison implementation source: ${data.evalComparisonImplementationSource || "TBD"}
|
|
1024
1275
|
- Deviation from original implementation: ${data.evalDeviationFromOriginalImplementation || "TBD"}
|
|
1276
|
+
- Evaluation setting semantics: ${data.evalEvaluationSettingSemantics || "TBD"}
|
|
1277
|
+
- Visibility and leakage risks: ${data.evalVisibilityAndLeakageRisks || "TBD"}
|
|
1278
|
+
- Anchor and label policy: ${data.evalAnchorAndLabelPolicy || "TBD"}
|
|
1279
|
+
- Scale and comparability policy: ${data.evalScaleAndComparabilityPolicy || "TBD"}
|
|
1280
|
+
- Metric validity checks: ${data.evalMetricValidityChecks || "TBD"}
|
|
1281
|
+
- Comparison validity checks: ${data.evalComparisonValidityChecks || "TBD"}
|
|
1282
|
+
- Statistical validity checks: ${data.evalStatisticalValidityChecks || "TBD"}
|
|
1283
|
+
- Claim boundary: ${data.evalClaimBoundary || "TBD"}
|
|
1284
|
+
- Integrity self-check: ${data.evalIntegritySelfCheck || "TBD"}
|
|
1285
|
+
- Anomaly signals: ${data.evalAnomalySignals || "TBD"}
|
|
1286
|
+
- Implementation reality checks: ${data.evalImplementationRealityChecks || "TBD"}
|
|
1287
|
+
- Alternative explanations considered: ${data.evalAlternativeExplanationsConsidered || "TBD"}
|
|
1288
|
+
- Cross-check method: ${data.evalCrossCheckMethod || "TBD"}
|
|
1289
|
+
- Best-supported interpretation: ${data.evalBestSupportedInterpretation || "TBD"}
|
|
1290
|
+
- Escalation threshold: ${data.evalEscalationThreshold || "TBD"}
|
|
1025
1291
|
- Experiment ladder: ${data.evalExperimentLadder || "TBD"}
|
|
1026
1292
|
- Benchmark ladder: ${data.evalBenchmarkLadder || "TBD"}
|
|
1027
1293
|
- Promotion gate: ${data.evalPromotionGate || "TBD"}
|
|
@@ -1286,6 +1552,21 @@ function buildContextSnapshot(targetDir) {
|
|
|
1286
1552
|
evalComparisonSourcePapers: evalProtocol.comparisonSourcePapers,
|
|
1287
1553
|
evalComparisonImplementationSource: evalProtocol.comparisonImplementationSource,
|
|
1288
1554
|
evalDeviationFromOriginalImplementation: evalProtocol.deviationFromOriginalImplementation,
|
|
1555
|
+
evalEvaluationSettingSemantics: evalProtocol.evaluationSettingSemantics,
|
|
1556
|
+
evalVisibilityAndLeakageRisks: evalProtocol.visibilityAndLeakageRisks,
|
|
1557
|
+
evalAnchorAndLabelPolicy: evalProtocol.anchorAndLabelPolicy,
|
|
1558
|
+
evalScaleAndComparabilityPolicy: evalProtocol.scaleAndComparabilityPolicy,
|
|
1559
|
+
evalMetricValidityChecks: evalProtocol.metricValidityChecks,
|
|
1560
|
+
evalComparisonValidityChecks: evalProtocol.comparisonValidityChecks,
|
|
1561
|
+
evalStatisticalValidityChecks: evalProtocol.statisticalValidityChecks,
|
|
1562
|
+
evalClaimBoundary: evalProtocol.claimBoundary,
|
|
1563
|
+
evalIntegritySelfCheck: evalProtocol.integritySelfCheck,
|
|
1564
|
+
evalAnomalySignals: evalProtocol.anomalySignals,
|
|
1565
|
+
evalImplementationRealityChecks: evalProtocol.implementationRealityChecks,
|
|
1566
|
+
evalAlternativeExplanationsConsidered: evalProtocol.alternativeExplanationsConsidered,
|
|
1567
|
+
evalCrossCheckMethod: evalProtocol.crossCheckMethod,
|
|
1568
|
+
evalBestSupportedInterpretation: evalProtocol.bestSupportedInterpretation,
|
|
1569
|
+
evalEscalationThreshold: evalProtocol.escalationThreshold,
|
|
1289
1570
|
evalExperimentLadder: evalProtocol.experimentLadder,
|
|
1290
1571
|
evalBenchmarkLadder: evalProtocol.benchmarkLadder,
|
|
1291
1572
|
evalPromotionGate: evalProtocol.promotionGate,
|
package/lib/eval_protocol.cjs
CHANGED
|
@@ -79,6 +79,81 @@ const EVAL_PROTOCOL_FIELDS = [
|
|
|
79
79
|
key: "deviationFromOriginalImplementation",
|
|
80
80
|
labels: ["Deviation from original implementation", "与原始实现的偏差"],
|
|
81
81
|
},
|
|
82
|
+
{
|
|
83
|
+
name: "Evaluation setting semantics",
|
|
84
|
+
key: "evaluationSettingSemantics",
|
|
85
|
+
labels: ["Evaluation setting semantics", "评测设定语义"],
|
|
86
|
+
},
|
|
87
|
+
{
|
|
88
|
+
name: "Visibility and leakage risks",
|
|
89
|
+
key: "visibilityAndLeakageRisks",
|
|
90
|
+
labels: ["Visibility and leakage risks", "可见性与泄漏风险"],
|
|
91
|
+
},
|
|
92
|
+
{
|
|
93
|
+
name: "Anchor and label policy",
|
|
94
|
+
key: "anchorAndLabelPolicy",
|
|
95
|
+
labels: ["Anchor and label policy", "锚点与标签策略"],
|
|
96
|
+
},
|
|
97
|
+
{
|
|
98
|
+
name: "Scale and comparability policy",
|
|
99
|
+
key: "scaleAndComparabilityPolicy",
|
|
100
|
+
labels: ["Scale and comparability policy", "尺度与可比性策略"],
|
|
101
|
+
},
|
|
102
|
+
{
|
|
103
|
+
name: "Metric validity checks",
|
|
104
|
+
key: "metricValidityChecks",
|
|
105
|
+
labels: ["Metric validity checks", "指标有效性检查"],
|
|
106
|
+
},
|
|
107
|
+
{
|
|
108
|
+
name: "Comparison validity checks",
|
|
109
|
+
key: "comparisonValidityChecks",
|
|
110
|
+
labels: ["Comparison validity checks", "对比方法有效性检查"],
|
|
111
|
+
},
|
|
112
|
+
{
|
|
113
|
+
name: "Statistical validity checks",
|
|
114
|
+
key: "statisticalValidityChecks",
|
|
115
|
+
labels: ["Statistical validity checks", "统计有效性检查"],
|
|
116
|
+
},
|
|
117
|
+
{
|
|
118
|
+
name: "Claim boundary",
|
|
119
|
+
key: "claimBoundary",
|
|
120
|
+
labels: ["Claim boundary", "结论边界"],
|
|
121
|
+
},
|
|
122
|
+
{
|
|
123
|
+
name: "Integrity self-check",
|
|
124
|
+
key: "integritySelfCheck",
|
|
125
|
+
labels: ["Integrity self-check", "完整性自检"],
|
|
126
|
+
},
|
|
127
|
+
{
|
|
128
|
+
name: "Anomaly signals",
|
|
129
|
+
key: "anomalySignals",
|
|
130
|
+
labels: ["Anomaly signals", "异常信号"],
|
|
131
|
+
},
|
|
132
|
+
{
|
|
133
|
+
name: "Implementation reality checks",
|
|
134
|
+
key: "implementationRealityChecks",
|
|
135
|
+
labels: ["Implementation reality checks", "实现层现实检查"],
|
|
136
|
+
},
|
|
137
|
+
{
|
|
138
|
+
name: "Alternative explanations considered",
|
|
139
|
+
key: "alternativeExplanationsConsidered",
|
|
140
|
+
labels: ["Alternative explanations considered", "已考虑的替代解释"],
|
|
141
|
+
},
|
|
142
|
+
{
|
|
143
|
+
name: "Cross-check method",
|
|
144
|
+
key: "crossCheckMethod",
|
|
145
|
+
labels: ["Cross-check method", "交叉验证方法"],
|
|
146
|
+
},
|
|
147
|
+
{
|
|
148
|
+
name: "Best-supported interpretation",
|
|
149
|
+
key: "bestSupportedInterpretation",
|
|
150
|
+
labels: ["Best-supported interpretation", "当前最站得住的解释"],
|
|
151
|
+
},
|
|
152
|
+
{
|
|
153
|
+
name: "Escalation threshold",
|
|
154
|
+
key: "escalationThreshold",
|
|
155
|
+
labels: ["Escalation threshold", "升级阈值"],
|
|
156
|
+
},
|
|
82
157
|
{
|
|
83
158
|
name: "Benchmark ladder",
|
|
84
159
|
key: "benchmarkLadder",
|
package/lib/i18n.cjs
CHANGED
|
@@ -332,9 +332,12 @@ const ZH_SKILL_FILES = {
|
|
|
332
332
|
- 必须用白话解释选定的主指标和次级指标:每个指标在衡量什么、越高还是越低更好、它是主结果指标还是健康度/支持性指标。
|
|
333
333
|
- 如果出现 coverage、completeness、confidence 或类似健康度指标,必须明确说明这类指标回答的是“实验是否跑稳、证据是否完整”,而不是主要科学效应本身。
|
|
334
334
|
- 要把最关键的背景来源、方法/基线来源和指标来源直接写进报告,不要把它们藏在 \`.lab/context/*\` 里。
|
|
335
|
+
- 把 \`report.md\` 当作给外部评审或合作者看的研究 memo;来源章节必须给出人类可读的 anchor references,不能拿本地路径或内部 provenance 充数。
|
|
335
336
|
- 如果 \`.lab/context/terminology-lock.md\` 里已经冻结了方法名和 contribution bullets,就必须把它们带进报告。
|
|
336
337
|
- 方法概述必须用协作者能读懂的话说明:我们的方法大致怎么做、相对 closest prior work 或 strongest baseline 改了什么、这些 prior 方法各自做了什么,以及它们为什么在当前 claim 下仍然不够。
|
|
337
338
|
- 只保留少量最关键的 prior work/baseline 锚点;每个锚点都要用一句话交代它做了什么和它的局限。
|
|
339
|
+
- 在“背景来源”“方法与基线来源”“指标来源”里,每个锚点都必须包含:引用、它做了什么或衡量什么、以及至少一个局限或 caveat。
|
|
340
|
+
- 内部 provenance 只能放到 \`工件状态\` 或 \`.lab/context/evidence-index.md\`,不能塞进来源章节。
|
|
338
341
|
- 在起草报告前,先检查 \`.lab/context/mission.md\` 和 \`.lab/context/eval-protocol.md\` 是否仍是模板空壳。
|
|
339
342
|
- 如果 canonical context 还是空壳,要先根据 frozen result artifacts、data-decisions、evidence-index 和已批准上下文回填“最小可信版本”,再写报告。
|
|
340
343
|
- 如果回填后仍缺少协作者可读所需的关键字段,就必须把输出降级成 \`artifact-anchored interim report\`,不能冒充最终协作者报告。
|
|
@@ -699,6 +702,10 @@ const ZH_SKILL_FILES = {
|
|
|
699
702
|
|
|
700
703
|
## Checklist
|
|
701
704
|
|
|
705
|
+
- 学术有效性检查是否已经填写,并且和实际实验设置保持一致?
|
|
706
|
+
- 完整性自检是否排除了不可见输入、不合理指标使用和把工作流状态当成科学证据的做法?
|
|
707
|
+
- 异常信号是否先被当成 diagnostic trigger,而不是被直接合理化成结果?
|
|
708
|
+
- 在升格当前解释前,是否已经记录更简单的替代解释和至少一种交叉验证?
|
|
702
709
|
- 是否把 claims 和 evidence 分开写清楚?
|
|
703
710
|
- baseline 是否公平且足够强?
|
|
704
711
|
- 数据集、切分和指标是否合理?
|
|
@@ -763,20 +770,36 @@ const ZH_SKILL_FILES = {
|
|
|
763
770
|
|
|
764
771
|
## 背景来源
|
|
765
772
|
|
|
766
|
-
-
|
|
767
|
-
-
|
|
773
|
+
- 参考 1:
|
|
774
|
+
- 引用:
|
|
775
|
+
- 做了什么:
|
|
776
|
+
- 为什么和当前问题相关:
|
|
777
|
+
- 对当前项目的局限:
|
|
768
778
|
|
|
769
779
|
## 方法与基线来源
|
|
770
780
|
|
|
771
|
-
-
|
|
772
|
-
-
|
|
773
|
-
-
|
|
781
|
+
- 参考 1:
|
|
782
|
+
- 引用:
|
|
783
|
+
- 做了什么:
|
|
784
|
+
- 为什么是这里的关键对照:
|
|
785
|
+
- 相对我们目标的局限:
|
|
774
786
|
|
|
775
787
|
## 指标来源
|
|
776
788
|
|
|
777
|
-
-
|
|
778
|
-
-
|
|
779
|
-
-
|
|
789
|
+
- 参考 1:
|
|
790
|
+
- 引用:
|
|
791
|
+
- 衡量什么:
|
|
792
|
+
- 为什么适合这里:
|
|
793
|
+
- 局限或注意事项:
|
|
794
|
+
|
|
795
|
+
## 异常与替代解释
|
|
796
|
+
|
|
797
|
+
- 观察到的异常信号:
|
|
798
|
+
- 做过的实现层检查:
|
|
799
|
+
- 已排除的更简单解释:
|
|
800
|
+
- 支撑当前解释的交叉验证:
|
|
801
|
+
- 当前最站得住的解释:
|
|
802
|
+
- 未来异常出现时的升级阈值:
|
|
780
803
|
|
|
781
804
|
## 怎么看主表
|
|
782
805
|
|
|
@@ -1107,7 +1130,13 @@ const ZH_SKILL_FILES = {
|
|
|
1107
1130
|
- Objective:
|
|
1108
1131
|
- Autonomy level: L2
|
|
1109
1132
|
- Autonomy level 只表示执行权限级别,不表示论文 layer 或 table 编号。
|
|
1133
|
+
- 层级指南:
|
|
1134
|
+
- \`L1\` = 一轮安全验证或简单 report 刷新
|
|
1135
|
+
- \`L2\` = 默认推荐级别,适合冻结核心边界内的实验迭代
|
|
1136
|
+
- \`L3\` = 激进 campaign,适合更大范围探索和可选写作
|
|
1137
|
+
- 如果不确定,默认推荐 \`L2\`。
|
|
1110
1138
|
- 如果你想表达论文层、实验 phase 或主表,请明确写成 \`paper layer\`、\`phase\` 或 \`table\`。
|
|
1139
|
+
- 如果你的请求提到了论文层、实验 phase 或主表,但没写自治级别,就不要启动循环,先补明确的 \`Autonomy level\`。
|
|
1111
1140
|
- Approval status: draft
|
|
1112
1141
|
- Allowed stages: run, iterate, review, report
|
|
1113
1142
|
- Success criteria:
|
|
@@ -1115,7 +1144,7 @@ const ZH_SKILL_FILES = {
|
|
|
1115
1144
|
- Terminal goal target:
|
|
1116
1145
|
- Required terminal artifact:
|
|
1117
1146
|
- 如果 workflow language 是中文,摘要、清单条目、任务标签和进度更新都应使用中文。
|
|
1118
|
-
- 示例 Objective: 推进 paper layer 3
|
|
1147
|
+
- 示例 Objective: 推进 paper layer 3,完成一轮 bounded protocol、测试、最小实现和一轮小规模结果。
|
|
1119
1148
|
|
|
1120
1149
|
## 循环预算
|
|
1121
1150
|
|
|
@@ -1178,6 +1207,21 @@ const ZH_SKILL_FILES = {
|
|
|
1178
1207
|
- 对比方法来源论文:
|
|
1179
1208
|
- 对比方法实现来源:
|
|
1180
1209
|
- 与原始实现的偏差:
|
|
1210
|
+
- 评测设定语义:
|
|
1211
|
+
- 可见性与泄漏风险:
|
|
1212
|
+
- 锚点与标签策略:
|
|
1213
|
+
- 尺度与可比性策略:
|
|
1214
|
+
- 指标有效性检查:
|
|
1215
|
+
- 对比有效性检查:
|
|
1216
|
+
- 统计有效性检查:
|
|
1217
|
+
- 结论边界:
|
|
1218
|
+
- 完整性自检:
|
|
1219
|
+
- 异常信号:
|
|
1220
|
+
- 实现层现实检查:
|
|
1221
|
+
- 已考虑的替代解释:
|
|
1222
|
+
- 交叉验证方法:
|
|
1223
|
+
- 当前最站得住的解释:
|
|
1224
|
+
- 升级阈值:
|
|
1181
1225
|
- 终止目标类型:
|
|
1182
1226
|
- 终止目标目标值:
|
|
1183
1227
|
- 必要终止工件:
|
|
@@ -1572,7 +1616,7 @@ ZH_CONTENT[path.join(".lab", ".managed", "templates", "framing.md")] = `# 论文
|
|
|
1572
1616
|
ZH_CONTENT[path.join(".codex", "prompts", "lab.md")] = codexPrompt(
|
|
1573
1617
|
"查看 /lab 研究工作流总览并选择合适阶段",
|
|
1574
1618
|
"workflow question 或 stage choice",
|
|
1575
|
-
"# `/lab` for Codex\n\n`/lab` 是严格的研究工作流命令族。每次都使用同一套仓库工件和阶段边界。\n\n## 子命令\n\n- `/lab:idea`\n 调研 idea,定义问题与 failure case,归类 contribution 与 breakthrough level,对比现有方法,收束三个一眼就有意义的点,并在实现前保留 approval gate。\n\n- `/lab:data`\n 把已批准的 idea 转成数据集与 benchmark 方案,记录数据集年份、使用过该数据集的论文、下载来源、许可或访问限制,以及 classic-public、recent-strong-public、claim-specific 三类 benchmark 的纳入理由,和 canonical baselines、strong historical baselines、recent strong public methods、closest prior work 四类对比方法的纳入理由。\n\n- `/lab:auto`\n 在不改变 mission、framing 和核心 claims 的前提下,读取 eval-protocol 与 auto-mode 契约并自动编排 `run`、`iterate`、`review`、`report`,必要时扩展数据集、benchmark 和 comparison methods,并在满足升格策略时自动升级 primary package。启动前必须选定 autonomy level、声明 terminal goal,并显式批准契约。\n\n- `/lab:framing`\n 通过审计当前领域与相邻领域的术语,锁定 paper-facing 的方法名、模块名、论文题目和 contribution bullets,并在 section 起草前保留 approval gate。\n\n- `/lab:spec`\n 把已批准的 idea 转成 `.lab/changes/<change-id>/` 下的一个 lab change 目录,并在其中写出 `proposal`、`design`、`spec`、`tasks`。\n\n- `/lab:run`\n 执行最小有意义验证运行,登记 run,并生成第一版标准化评估摘要。\n\n- `/lab:iterate`\n 在冻结 mission、阈值、verification commands 与 `completion_promise` 的前提下执行有边界的实验迭代。\n\n- `/lab:review`\n 以 reviewer mode 审查文档或结果,先给短摘要,再输出 findings、fatal flaws、fix priority 和 residual risks。\n\n- `/lab:report`\n 从 runs 和 iterations 工件生成最终研究报告。\n\n- `/lab:write`\n 使用已安装 `lab` skill 下 vendored 的 paper-writing references,把稳定 report 工件转成论文 section。\n\n## 调度规则\n\n- 始终使用 `skills/lab/SKILL.md` 作为工作流合同。\n- 用户显式调用 `/lab:<stage>` 时,要立刻执行该 stage,而不是只推荐别的 `/lab` stage。\n- 先给简洁摘要,再决定是否写工件,最后回报输出路径和下一步。\n- 如果歧义会影响结论,一次只问一个问题;如果有多条可行路径,先给 2-3 个方案再收敛。\n- `/lab:spec` 前应已有经批准的数据集与 benchmark 方案。\n- `/lab:run`、`/lab:iterate`、`/lab:auto`、`/lab:report` 都应遵循 `.lab/context/eval-protocol.md`。\n- `.lab/context/eval-protocol.md` 不只定义主指标和主表,也应定义指标释义、实验阶梯,以及指标和对比实现的来源。\n- `/lab:auto` 只编排已批准边界内的执行阶段,不替代手动的 idea/data/framing/spec 决策。\n- `/lab:write` 前必须已有经批准的 `/lab:framing` 工件。\n\n## 如何输入 `/lab:auto`\n\n- 把 `Autonomy level L1/L2/L3` 视为执行权限级别,不要和论文里的 layer、phase、table 编号混用。\n- 把 `paper layer`、`phase`、`table` 视为实验目标。例如 `paper layer 3` 或 `Phase 1
|
|
1619
|
+
"# `/lab` for Codex\n\n`/lab` 是严格的研究工作流命令族。每次都使用同一套仓库工件和阶段边界。\n\n## 子命令\n\n- `/lab:idea`\n 调研 idea,定义问题与 failure case,归类 contribution 与 breakthrough level,对比现有方法,收束三个一眼就有意义的点,并在实现前保留 approval gate。\n\n- `/lab:data`\n 把已批准的 idea 转成数据集与 benchmark 方案,记录数据集年份、使用过该数据集的论文、下载来源、许可或访问限制,以及 classic-public、recent-strong-public、claim-specific 三类 benchmark 的纳入理由,和 canonical baselines、strong historical baselines、recent strong public methods、closest prior work 四类对比方法的纳入理由。\n\n- `/lab:auto`\n 在不改变 mission、framing 和核心 claims 的前提下,读取 eval-protocol 与 auto-mode 契约并自动编排 `run`、`iterate`、`review`、`report`,必要时扩展数据集、benchmark 和 comparison methods,并在满足升格策略时自动升级 primary package。启动前必须选定 autonomy level、声明 terminal goal,并显式批准契约。\n\n- `/lab:framing`\n 通过审计当前领域与相邻领域的术语,锁定 paper-facing 的方法名、模块名、论文题目和 contribution bullets,并在 section 起草前保留 approval gate。\n\n- `/lab:spec`\n 把已批准的 idea 转成 `.lab/changes/<change-id>/` 下的一个 lab change 目录,并在其中写出 `proposal`、`design`、`spec`、`tasks`。\n\n- `/lab:run`\n 执行最小有意义验证运行,登记 run,并生成第一版标准化评估摘要。\n\n- `/lab:iterate`\n 在冻结 mission、阈值、verification commands 与 `completion_promise` 的前提下执行有边界的实验迭代。\n\n- `/lab:review`\n 以 reviewer mode 审查文档或结果,先给短摘要,再输出 findings、fatal flaws、fix priority 和 residual risks。\n\n- `/lab:report`\n 从 runs 和 iterations 工件生成最终研究报告。\n\n- `/lab:write`\n 使用已安装 `lab` skill 下 vendored 的 paper-writing references,把稳定 report 工件转成论文 section。\n\n## 调度规则\n\n- 始终使用 `skills/lab/SKILL.md` 作为工作流合同。\n- 用户显式调用 `/lab:<stage>` 时,要立刻执行该 stage,而不是只推荐别的 `/lab` stage。\n- 先给简洁摘要,再决定是否写工件,最后回报输出路径和下一步。\n- 如果歧义会影响结论,一次只问一个问题;如果有多条可行路径,先给 2-3 个方案再收敛。\n- `/lab:spec` 前应已有经批准的数据集与 benchmark 方案。\n- `/lab:run`、`/lab:iterate`、`/lab:auto`、`/lab:report` 都应遵循 `.lab/context/eval-protocol.md`。\n- `.lab/context/eval-protocol.md` 不只定义主指标和主表,也应定义指标释义、实验阶梯,以及指标和对比实现的来源。\n- `/lab:auto` 只编排已批准边界内的执行阶段,不替代手动的 idea/data/framing/spec 决策。\n- `/lab:write` 前必须已有经批准的 `/lab:framing` 工件。\n\n## 如何输入 `/lab:auto`\n\n## `/lab:auto` 层级指南\n\n- `L1`:适合安全验证、一轮 bounded 真实运行,或简单 report 刷新。\n- `L2`:默认推荐级别,适合冻结核心边界内的常规实验迭代。\n- `L3`:激进 campaign 级别,只在你明确想做更大范围探索和可选写作时使用。\n- 如果不确定,默认推荐 `L2`。\n- 如果用户输入没写级别,或者把级别和 `paper layer`、`phase`、`table` 混用了,就应先停下来,要求用户明确选 `L1/L2/L3`。\n\n- 把 `Autonomy level L1/L2/L3` 视为执行权限级别,不要和论文里的 layer、phase、table 编号混用。\n- 把 `paper layer`、`phase`、`table` 视为实验目标。例如 `paper layer 3` 或 `Phase 1` 不是 `Autonomy level L3`。\n- 一条好的 `/lab:auto` 输入应至少说清:objective、自治级别、terminal goal、scope、allowed modifications。\n- 如果 workflow language 是中文,摘要、清单条目、任务标签和进度更新都应使用中文,除非文件路径、代码标识符或字面指标名必须保持原样。\n- 示例:`/lab:auto 自治级别 L2。目标:推进 paper layer 3。终止条件:完成 bounded protocol、测试、最小实现和一轮小规模结果。允许修改:配置、数据接入、评估脚本。`\n"
|
|
1576
1620
|
);
|
|
1577
1621
|
|
|
1578
1622
|
ZH_CONTENT[path.join(".codex", "prompts", "lab-data.md")] = codexPrompt(
|
|
@@ -1591,7 +1635,7 @@ ZH_CONTENT[path.join(".claude", "commands", "lab.md")] = claudeCommand(
|
|
|
1591
1635
|
"lab",
|
|
1592
1636
|
"查看 /lab 研究工作流总览并选择合适阶段",
|
|
1593
1637
|
"[stage] [target]",
|
|
1594
|
-
"# `/lab` for Claude\n\n`/lab` 是 Claude Code 里的 lab 工作流分发入口。调用方式有两种:\n\n- `/lab <stage> ...`\n- `/lab-idea`、`/lab-data`、`/lab-auto`、`/lab-framing`、`/lab-spec`、`/lab-run`、`/lab-iterate`、`/lab-review`、`/lab-report`、`/lab-write`\n\n## 阶段别名\n\n- `/lab idea ...` 或 `/lab-idea`\n- `/lab data ...` 或 `/lab-data`\n- `/lab auto ...` 或 `/lab-auto`\n- `/lab framing ...` 或 `/lab-framing`\n- `/lab spec ...` 或 `/lab-spec`\n- `/lab run ...` 或 `/lab-run`\n- `/lab iterate ...` 或 `/lab-iterate`\n- `/lab review ...` 或 `/lab-review`\n- `/lab report ...` 或 `/lab-report`\n- `/lab write ...` 或 `/lab-write`\n\n## 调度规则\n\n- 始终使用 `skills/lab/SKILL.md` 作为工作流合同。\n- 用户显式调用 `/lab <stage> ...` 或 `/lab-<stage>` 时,要立刻执行该 stage,而不是只推荐别的阶段。\n- 先给简洁摘要,再决定是否写工件,最后回报输出路径和下一步。\n- 如果歧义会影响结论,一次只问一个问题;如果有多条可行路径,先给 2-3 个方案再收敛。\n- `spec` 前应已有经批准的数据集与 benchmark 方案。\n- `run`、`iterate`、`auto`、`report` 都应遵循 `.lab/context/eval-protocol.md`。\n- `auto` 只编排已批准边界内的执行阶段,不替代手动的 idea/data/framing/spec 决策。\n- `write` 前必须已有经批准的 `framing` 工件。\n\n## 如何输入 `/lab auto`\n\n- 把 `Autonomy level L1/L2/L3` 视为执行权限级别,不要和论文里的 layer、phase、table 编号混用。\n- 把 `paper layer`、`phase`、`table` 视为实验目标。例如 `paper layer 3` 或 `Phase 1
|
|
1638
|
+
"# `/lab` for Claude\n\n`/lab` 是 Claude Code 里的 lab 工作流分发入口。调用方式有两种:\n\n- `/lab <stage> ...`\n- `/lab-idea`、`/lab-data`、`/lab-auto`、`/lab-framing`、`/lab-spec`、`/lab-run`、`/lab-iterate`、`/lab-review`、`/lab-report`、`/lab-write`\n\n## 阶段别名\n\n- `/lab idea ...` 或 `/lab-idea`\n- `/lab data ...` 或 `/lab-data`\n- `/lab auto ...` 或 `/lab-auto`\n- `/lab framing ...` 或 `/lab-framing`\n- `/lab spec ...` 或 `/lab-spec`\n- `/lab run ...` 或 `/lab-run`\n- `/lab iterate ...` 或 `/lab-iterate`\n- `/lab review ...` 或 `/lab-review`\n- `/lab report ...` 或 `/lab-report`\n- `/lab write ...` 或 `/lab-write`\n\n## 调度规则\n\n- 始终使用 `skills/lab/SKILL.md` 作为工作流合同。\n- 用户显式调用 `/lab <stage> ...` 或 `/lab-<stage>` 时,要立刻执行该 stage,而不是只推荐别的阶段。\n- 先给简洁摘要,再决定是否写工件,最后回报输出路径和下一步。\n- 如果歧义会影响结论,一次只问一个问题;如果有多条可行路径,先给 2-3 个方案再收敛。\n- `spec` 前应已有经批准的数据集与 benchmark 方案。\n- `run`、`iterate`、`auto`、`report` 都应遵循 `.lab/context/eval-protocol.md`。\n- `auto` 只编排已批准边界内的执行阶段,不替代手动的 idea/data/framing/spec 决策。\n- `write` 前必须已有经批准的 `framing` 工件。\n\n## 如何输入 `/lab auto`\n\n## `/lab auto` 层级指南\n\n- `L1`:适合安全验证、一轮 bounded 真实运行,或简单 report 刷新。\n- `L2`:默认推荐级别,适合冻结核心边界内的常规实验迭代。\n- `L3`:激进 campaign 级别,只在你明确想做更大范围探索和可选写作时使用。\n- 如果不确定,默认推荐 `L2`。\n- 如果用户输入没写级别,或者把级别和 `paper layer`、`phase`、`table` 混用了,就应先停下来,要求用户明确选 `L1/L2/L3`。\n\n- 把 `Autonomy level L1/L2/L3` 视为执行权限级别,不要和论文里的 layer、phase、table 编号混用。\n- 把 `paper layer`、`phase`、`table` 视为实验目标。例如 `paper layer 3` 或 `Phase 1` 不是 `Autonomy level L3`。\n- 一条好的 `/lab auto` 输入应至少说清:objective、自治级别、terminal goal、scope、allowed modifications。\n- 如果 workflow language 是中文,摘要、清单条目、任务标签和进度更新都应使用中文,除非文件路径、代码标识符或字面指标名必须保持原样。\n- 示例:`/lab auto 自治级别 L2。目标:推进 paper layer 3。终止条件:完成 bounded protocol、测试、最小实现和一轮小规模结果。允许修改:配置、数据接入、评估脚本。`\n"
|
|
1595
1639
|
);
|
|
1596
1640
|
|
|
1597
1641
|
ZH_CONTENT[path.join(".claude", "commands", "lab-data.md")] = claudeCommand(
|
|
@@ -2034,6 +2078,27 @@ ZH_CONTENT[path.join(".lab", "context", "eval-protocol.md")] = `# 评估协议
|
|
|
2034
2078
|
- 对比方法实现来源:
|
|
2035
2079
|
- 与原始实现的偏差:
|
|
2036
2080
|
|
|
2081
|
+
## 学术有效性检查
|
|
2082
|
+
|
|
2083
|
+
- 评测设定语义:
|
|
2084
|
+
- 可见性与泄漏风险:
|
|
2085
|
+
- 锚点与标签策略:
|
|
2086
|
+
- 尺度与可比性策略:
|
|
2087
|
+
- 指标有效性检查:
|
|
2088
|
+
- 对比有效性检查:
|
|
2089
|
+
- 统计有效性检查:
|
|
2090
|
+
- 结论边界:
|
|
2091
|
+
- 完整性自检:
|
|
2092
|
+
|
|
2093
|
+
## 异常与替代解释检查
|
|
2094
|
+
|
|
2095
|
+
- 异常信号:
|
|
2096
|
+
- 实现层现实检查:
|
|
2097
|
+
- 已考虑的替代解释:
|
|
2098
|
+
- 交叉验证方法:
|
|
2099
|
+
- 当前最站得住的解释:
|
|
2100
|
+
- 升级阈值:
|
|
2101
|
+
|
|
2037
2102
|
## Gate Ladder
|
|
2038
2103
|
|
|
2039
2104
|
- 实验阶梯:
|
|
@@ -2132,11 +2197,25 @@ ZH_CONTENT[path.join(".codex", "skills", "lab", "stages", "auto.md")] = `# \`/la
|
|
|
2132
2197
|
|
|
2133
2198
|
## 交互约束
|
|
2134
2199
|
|
|
2200
|
+
## 层级指南
|
|
2201
|
+
|
|
2202
|
+
- \`L1\` = safe validation
|
|
2203
|
+
- \`L2\` = 默认推荐的 bounded iteration
|
|
2204
|
+
- \`L3\` = aggressive campaign
|
|
2205
|
+
- 如果不确定,默认推荐 \`L2\`。
|
|
2206
|
+
|
|
2135
2207
|
- 开始前先简洁说明:objective、frozen core 和下一自动阶段。
|
|
2136
2208
|
- 如果契约本身不完整,一次只追问一个问题。
|
|
2137
2209
|
- 如果存在多个可信的下一动作,先给 2-3 个 bounded 方案和推荐项,再启动长任务。
|
|
2138
2210
|
- 只有当下一步会离开已批准的 exploration envelope、超出选定 autonomy level,或实质改变 frozen core 时,才保留人工 approval gate。
|
|
2211
|
+
- 每次进入 \`/lab:auto\` 都要先给出这份层级指南。
|
|
2139
2212
|
- 先做输入归一化:把 \`Autonomy level L1/L2/L3\` 视为执行权限级别,把 \`Layer 3\`、\`Phase 1\`、\`Table 2\` 视为论文范围目标。
|
|
2213
|
+
- 如果用户没有写自治级别,或者把自治级别和论文层、phase、table 混用了,就必须先给一版更详细的层级说明,至少解释:
|
|
2214
|
+
- \`L1/L2/L3\` 的典型适用场景
|
|
2215
|
+
- 每个级别允许改什么
|
|
2216
|
+
- 每个级别通常在什么 stop boundary 停下
|
|
2217
|
+
- 如果不确定,默认推荐 \`L2\`
|
|
2218
|
+
- 给完这版详细说明后,再追问一个明确的 \`L1/L2/L3\` 选择;在用户明确选级别前不要启动循环。
|
|
2140
2219
|
- 如果用户同时提了论文层、实验 phase 和自治级别,先用一句话重述:objective、自治级别、terminal goal、scope、allowed modifications。
|
|
2141
2220
|
- 如果 workflow language 是中文,摘要、清单条目、任务标签和进度更新都应使用中文,除非文件路径、代码标识符或字面指标名必须保持原样。
|
|
2142
2221
|
- 当循环进入 \`report\` 时,要主动给出用户可读的白话总结,解释主指标、次级指标和主表作用;不要等用户额外发一句“解释这些指标”。
|
|
@@ -62,8 +62,16 @@ Use the same repository artifacts and stage boundaries every time.
|
|
|
62
62
|
|
|
63
63
|
## How to Ask for `/lab auto`
|
|
64
64
|
|
|
65
|
+
## Level Guide for `/lab auto`
|
|
66
|
+
|
|
67
|
+
- `L1` is the safe validation level. Use it for a smoke run, one bounded real run, or a simple review/report refresh when you do not want automatic iteration.
|
|
68
|
+
- `L2` is the default recommended level. Use it for bounded experiment iteration inside a frozen core when you want auto to keep running until a gate, stop condition, or terminal goal is hit.
|
|
69
|
+
- `L3` is the aggressive campaign level. Use it only when you explicitly want broad exploration, larger search space changes, and optional manuscript-writing work.
|
|
70
|
+
- If you are unsure, choose `L2`.
|
|
71
|
+
- If the request omits the level or mixes it with a paper layer, phase, or table target, `/lab auto` should stop and ask for an explicit autonomy level before arming the loop.
|
|
72
|
+
|
|
65
73
|
- Treat `Autonomy level L1/L2/L3` as the execution privilege level, not as a paper layer, phase, or table number.
|
|
66
|
-
- Treat `paper layer`, `phase`, and `table` as experiment targets. For example, `paper layer 3` or `Phase 1
|
|
74
|
+
- Treat `paper layer`, `phase`, and `table` as experiment targets. For example, `paper layer 3` or `Phase 1` should not be interpreted as `Autonomy level L3`.
|
|
67
75
|
- A good `/lab auto` request should name:
|
|
68
76
|
- the objective
|
|
69
77
|
- the autonomy level
|
|
@@ -72,4 +80,4 @@ Use the same repository artifacts and stage boundaries every time.
|
|
|
72
80
|
- the allowed modifications
|
|
73
81
|
- If the repository workflow language is Chinese, summaries, checklist items, task labels, and progress updates should be written in Chinese unless a code identifier or file path must stay literal.
|
|
74
82
|
- Good example:
|
|
75
|
-
- `/lab auto Autonomy level L2. Objective: advance paper layer 3
|
|
83
|
+
- `/lab auto Autonomy level L2. Objective: advance paper layer 3 through one bounded protocol improvement. Terminal goal: task-completion. Scope: bounded protocol, tests, one minimal implementation, and one small run. Allowed modifications: configuration, evaluation script, and data-loading logic only.`
|
|
@@ -56,8 +56,16 @@ argument-hint: workflow question or stage choice
|
|
|
56
56
|
|
|
57
57
|
## How to Ask for `/lab:auto`
|
|
58
58
|
|
|
59
|
+
## Level Guide for `/lab:auto`
|
|
60
|
+
|
|
61
|
+
- `L1` is the safe validation level. Use it for a smoke run, one bounded real run, or a simple review/report refresh when you do not want automatic iteration.
|
|
62
|
+
- `L2` is the default recommended level. Use it for bounded experiment iteration inside a frozen core when you want auto to keep running until a gate, stop condition, or terminal goal is hit.
|
|
63
|
+
- `L3` is the aggressive campaign level. Use it only when you explicitly want broad exploration, larger search space changes, and optional manuscript-writing work.
|
|
64
|
+
- If you are unsure, choose `L2`.
|
|
65
|
+
- If the request omits the level or mixes it with a paper layer, phase, or table target, `/lab:auto` should stop and ask for an explicit autonomy level before arming the loop.
|
|
66
|
+
|
|
59
67
|
- Treat `Autonomy level L1/L2/L3` as the execution privilege level, not as a paper layer, phase, or table number.
|
|
60
|
-
- Treat `paper layer`, `phase`, and `table` as experiment targets. For example, `paper layer 3` or `Phase 1
|
|
68
|
+
- Treat `paper layer`, `phase`, and `table` as experiment targets. For example, `paper layer 3` or `Phase 1` should not be interpreted as `Autonomy level L3`.
|
|
61
69
|
- A good `/lab:auto` request should name:
|
|
62
70
|
- the objective
|
|
63
71
|
- the autonomy level
|
|
@@ -66,4 +74,4 @@ argument-hint: workflow question or stage choice
|
|
|
66
74
|
- the allowed modifications
|
|
67
75
|
- If the repository workflow language is Chinese, summaries, checklist items, task labels, and progress updates should be written in Chinese unless a code identifier or file path must stay literal.
|
|
68
76
|
- Good example:
|
|
69
|
-
- `/lab:auto Autonomy level L2. Objective: advance paper layer 3
|
|
77
|
+
- `/lab:auto Autonomy level L2. Objective: advance paper layer 3 through one bounded protocol improvement. Terminal goal: task-completion. Scope: bounded protocol, tests, one minimal implementation, and one small run. Allowed modifications: configuration, evaluation script, and data-loading logic only.`
|
|
@@ -20,6 +20,10 @@ REPORT_REQUIRED_SECTIONS = {
|
|
|
20
20
|
r"^##\s+方法与基线来源\s*$",
|
|
21
21
|
],
|
|
22
22
|
"Metric Sources": [r"^##\s+Metric Sources\s*$", r"^##\s+指标来源\s*$"],
|
|
23
|
+
"Sanity and Alternative Explanations": [
|
|
24
|
+
r"^##\s+Sanity and Alternative Explanations\s*$",
|
|
25
|
+
r"^##\s+异常与替代解释\s*$",
|
|
26
|
+
],
|
|
23
27
|
}
|
|
24
28
|
|
|
25
29
|
MAIN_TABLES_REQUIRED_SECTIONS = {
|
|
@@ -30,6 +34,24 @@ MAIN_TABLES_REQUIRED_SECTIONS = {
|
|
|
30
34
|
"How to Read These Tables": [r"^##\s+How to Read These Tables\s*$", r"^##\s+怎么读这些表\s*$"],
|
|
31
35
|
}
|
|
32
36
|
|
|
37
|
+
SOURCE_SECTION_NAMES = (
|
|
38
|
+
"Background Sources",
|
|
39
|
+
"Method and Baseline Sources",
|
|
40
|
+
"Metric Sources",
|
|
41
|
+
)
|
|
42
|
+
SOURCE_SECTION_PATH_MARKERS = (
|
|
43
|
+
"/Users/",
|
|
44
|
+
"/home/",
|
|
45
|
+
"/tmp/",
|
|
46
|
+
"/private/tmp/",
|
|
47
|
+
".lab/",
|
|
48
|
+
"outputs/",
|
|
49
|
+
"docs/research/",
|
|
50
|
+
)
|
|
51
|
+
SOURCE_SECTION_CITATION_MARKERS = ("Citation:", "引用:")
|
|
52
|
+
SOURCE_SECTION_ROLE_MARKERS = ("What it established:", "What it does:", "What it measures:", "做了什么:", "衡量什么:")
|
|
53
|
+
SOURCE_SECTION_LIMITATION_MARKERS = ("Limitation", "局限")
|
|
54
|
+
|
|
33
55
|
|
|
34
56
|
def parse_args():
|
|
35
57
|
parser = argparse.ArgumentParser(
|
|
@@ -48,6 +70,35 @@ def missing_sections(text: str, required_sections: dict[str, list[str]]) -> list
|
|
|
48
70
|
return missing
|
|
49
71
|
|
|
50
72
|
|
|
73
|
+
def extract_section_body(text: str, patterns: list[str]) -> str:
|
|
74
|
+
for pattern in patterns:
|
|
75
|
+
match = re.search(pattern, text, flags=re.MULTILINE)
|
|
76
|
+
if not match:
|
|
77
|
+
continue
|
|
78
|
+
start = match.end()
|
|
79
|
+
next_heading = re.search(r"^##\s+", text[start:], flags=re.MULTILINE)
|
|
80
|
+
end = start + next_heading.start() if next_heading else len(text)
|
|
81
|
+
return text[start:end].strip()
|
|
82
|
+
return ""
|
|
83
|
+
|
|
84
|
+
|
|
85
|
+
def validate_source_sections(text: str, label: str) -> list[str]:
|
|
86
|
+
issues = []
|
|
87
|
+
for section_name in SOURCE_SECTION_NAMES:
|
|
88
|
+
body = extract_section_body(text, REPORT_REQUIRED_SECTIONS[section_name])
|
|
89
|
+
if not body:
|
|
90
|
+
continue
|
|
91
|
+
if any(marker in body for marker in SOURCE_SECTION_PATH_MARKERS):
|
|
92
|
+
issues.append(f"{label} section '{section_name}' must not rely on local file paths or internal provenance")
|
|
93
|
+
if not any(marker in body for marker in SOURCE_SECTION_CITATION_MARKERS):
|
|
94
|
+
issues.append(f"{label} section '{section_name}' must include at least one citation anchor")
|
|
95
|
+
has_role = any(marker in body for marker in SOURCE_SECTION_ROLE_MARKERS)
|
|
96
|
+
has_limitation = any(marker in body for marker in SOURCE_SECTION_LIMITATION_MARKERS)
|
|
97
|
+
if not has_role or not has_limitation:
|
|
98
|
+
issues.append(f"{label} section '{section_name}' must explain what the anchor does and one limitation")
|
|
99
|
+
return issues
|
|
100
|
+
|
|
101
|
+
|
|
51
102
|
def validate(path_str: str, required_sections: dict[str, list[str]], label: str) -> list[str]:
|
|
52
103
|
path = Path(path_str)
|
|
53
104
|
if not path.exists():
|
|
@@ -56,6 +107,8 @@ def validate(path_str: str, required_sections: dict[str, list[str]], label: str)
|
|
|
56
107
|
missing = missing_sections(text, required_sections)
|
|
57
108
|
if missing:
|
|
58
109
|
return [f"{label} is missing required sections: {', '.join(missing)}"]
|
|
110
|
+
if label == "report.md":
|
|
111
|
+
return validate_source_sections(text, label)
|
|
59
112
|
return []
|
|
60
113
|
|
|
61
114
|
|
|
@@ -54,20 +54,36 @@
|
|
|
54
54
|
|
|
55
55
|
## Background Sources
|
|
56
56
|
|
|
57
|
-
-
|
|
58
|
-
-
|
|
57
|
+
- Anchor reference 1:
|
|
58
|
+
- Citation:
|
|
59
|
+
- What it established:
|
|
60
|
+
- Why it matters here:
|
|
61
|
+
- Limitation for the current project:
|
|
59
62
|
|
|
60
63
|
## Method and Baseline Sources
|
|
61
64
|
|
|
62
|
-
-
|
|
63
|
-
-
|
|
64
|
-
-
|
|
65
|
+
- Anchor reference 1:
|
|
66
|
+
- Citation:
|
|
67
|
+
- What it does:
|
|
68
|
+
- Why it is the right anchor here:
|
|
69
|
+
- Limitation relative to our goal:
|
|
65
70
|
|
|
66
71
|
## Metric Sources
|
|
67
72
|
|
|
68
|
-
-
|
|
69
|
-
-
|
|
70
|
-
-
|
|
73
|
+
- Anchor reference 1:
|
|
74
|
+
- Citation:
|
|
75
|
+
- What it measures:
|
|
76
|
+
- Why it is appropriate here:
|
|
77
|
+
- Limitation or caveat:
|
|
78
|
+
|
|
79
|
+
## Sanity and Alternative Explanations
|
|
80
|
+
|
|
81
|
+
- Anomaly signals observed:
|
|
82
|
+
- Implementation checks performed:
|
|
83
|
+
- Alternative explanations ruled out:
|
|
84
|
+
- Cross-checks that strengthen the current interpretation:
|
|
85
|
+
- Best-supported interpretation:
|
|
86
|
+
- Escalation threshold if future anomalies appear:
|
|
71
87
|
|
|
72
88
|
## Experiment Setup
|
|
73
89
|
|
|
@@ -23,6 +23,10 @@
|
|
|
23
23
|
|
|
24
24
|
## Checklist
|
|
25
25
|
|
|
26
|
+
- Are the academic validity checks filled and still consistent with the actual setup?
|
|
27
|
+
- Does the integrity self-check rule out unavailable inputs, invalid metric use, and workflow-status-as-evidence?
|
|
28
|
+
- Have anomaly signals been treated as diagnostic triggers instead of being rationalized into findings?
|
|
29
|
+
- Are simpler alternative explanations and at least one cross-check recorded before promoting the current interpretation?
|
|
26
30
|
- Are the dataset and split choices stated clearly?
|
|
27
31
|
- Is the baseline fair, current, and reproducible?
|
|
28
32
|
- Are the primary and secondary metrics justified?
|
|
@@ -9,7 +9,13 @@ If `eval-protocol.md` declares structured rung entries, auto mode follows those
|
|
|
9
9
|
- Objective:
|
|
10
10
|
- Autonomy level: L2
|
|
11
11
|
- Autonomy level controls execution privilege, not paper layer or table number.
|
|
12
|
+
- Level guide:
|
|
13
|
+
- `L1` = safe validation over one bounded run/review/report cycle
|
|
14
|
+
- `L2` = default recommended bounded iteration inside a frozen core
|
|
15
|
+
- `L3` = aggressive campaign with broader exploration and optional writing
|
|
16
|
+
- If you are unsure, choose `L2`.
|
|
12
17
|
- If you mean a paper layer, phase, or table, spell it explicitly as `paper layer`, `phase`, or `table`.
|
|
18
|
+
- If your request mentions a paper layer, phase, or table but omits the autonomy level, do not arm the loop until the level is explicit.
|
|
13
19
|
- Approval status: draft
|
|
14
20
|
- Allowed stages: run, iterate, review, report
|
|
15
21
|
- Success criteria:
|
|
@@ -17,7 +23,7 @@ If `eval-protocol.md` declares structured rung entries, auto mode follows those
|
|
|
17
23
|
- Terminal goal target:
|
|
18
24
|
- Required terminal artifact:
|
|
19
25
|
- If the workflow language is Chinese, keep summaries, checklist items, task labels, and progress updates in Chinese.
|
|
20
|
-
- Example objective: advance paper layer 3
|
|
26
|
+
- Example objective: advance paper layer 3 through one bounded protocol, tests, minimal implementation, and one small run.
|
|
21
27
|
|
|
22
28
|
## Loop Budget
|
|
23
29
|
|
|
@@ -13,6 +13,21 @@
|
|
|
13
13
|
- Comparison source papers:
|
|
14
14
|
- Comparison implementation source:
|
|
15
15
|
- Deviation from original implementation:
|
|
16
|
+
- Evaluation setting semantics:
|
|
17
|
+
- Visibility and leakage risks:
|
|
18
|
+
- Anchor and label policy:
|
|
19
|
+
- Scale and comparability policy:
|
|
20
|
+
- Metric validity checks:
|
|
21
|
+
- Comparison validity checks:
|
|
22
|
+
- Statistical validity checks:
|
|
23
|
+
- Claim boundary:
|
|
24
|
+
- Integrity self-check:
|
|
25
|
+
- Anomaly signals:
|
|
26
|
+
- Implementation reality checks:
|
|
27
|
+
- Alternative explanations considered:
|
|
28
|
+
- Cross-check method:
|
|
29
|
+
- Best-supported interpretation:
|
|
30
|
+
- Escalation threshold:
|
|
16
31
|
- Terminal goal type:
|
|
17
32
|
- Terminal goal target:
|
|
18
33
|
- Required terminal artifact:
|
|
@@ -29,6 +29,27 @@ Use this file to define the paper-facing evaluation objective, table plan, gates
|
|
|
29
29
|
|
|
30
30
|
Record enough source detail here that later `run`, `iterate`, `auto`, and `report` stages do not have to guess what a metric means, which baseline implementation is canonical, or where a comparison method came from.
|
|
31
31
|
|
|
32
|
+
## Academic Validity Checks
|
|
33
|
+
|
|
34
|
+
- Evaluation setting semantics:
|
|
35
|
+
- Visibility and leakage risks:
|
|
36
|
+
- Anchor and label policy:
|
|
37
|
+
- Scale and comparability policy:
|
|
38
|
+
- Metric validity checks:
|
|
39
|
+
- Comparison validity checks:
|
|
40
|
+
- Statistical validity checks:
|
|
41
|
+
- Claim boundary:
|
|
42
|
+
- Integrity self-check:
|
|
43
|
+
|
|
44
|
+
## Sanity and Alternative-Explanation Checks
|
|
45
|
+
|
|
46
|
+
- Anomaly signals:
|
|
47
|
+
- Implementation reality checks:
|
|
48
|
+
- Alternative explanations considered:
|
|
49
|
+
- Cross-check method:
|
|
50
|
+
- Best-supported interpretation:
|
|
51
|
+
- Escalation threshold:
|
|
52
|
+
|
|
32
53
|
## Gate Ladder
|
|
33
54
|
|
|
34
55
|
- Experiment ladder:
|
|
@@ -87,6 +87,7 @@ Use this skill when the user invokes `/lab:*` or asks for the structured researc
|
|
|
87
87
|
- Treat `.lab/context/auto-mode.md` as the control contract and `.lab/context/auto-status.md` as the live state file.
|
|
88
88
|
- Require `Autonomy level` and `Approval status` in `.lab/context/auto-mode.md` before execution.
|
|
89
89
|
- Treat `L1` as safe-run validation, `L2` as bounded iteration, and `L3` as aggressive campaign mode.
|
|
90
|
+
- Surface the level guide every time `/lab:auto` starts, and make the detailed guide mandatory when the user omits the level or mixes it with a paper layer, phase, or table target.
|
|
90
91
|
- Reuse `/lab:run`, `/lab:iterate`, `/lab:review`, `/lab:report`, and optional `/lab:write` instead of inventing a second workflow.
|
|
91
92
|
- Do not automatically change the research mission, paper-facing framing, or core claims.
|
|
92
93
|
- You may add exploratory datasets, benchmarks, and comparison methods inside the approved exploration envelope.
|
|
@@ -40,6 +40,8 @@
|
|
|
40
40
|
- Treat `/lab:auto` as an orchestration layer, not a replacement for existing `/lab:*` stages.
|
|
41
41
|
- Treat `.lab/context/eval-protocol.md` as the source of truth for paper-facing metrics, metric glossary, table plan, gates, and structured experiment ladders.
|
|
42
42
|
- Treat the evaluation protocol as source-backed, not imagination-backed: metric definitions, baseline behavior, comparison implementations, and deviations must come from recorded sources before they are used in gates or promotions.
|
|
43
|
+
- Treat `Academic Validity Checks` and `Integrity self-check` as mandatory automation gates. Auto mode should not proceed, promote, or declare success while those fields are missing, stale, or contradicted by the current rung.
|
|
44
|
+
- Treat `Sanity and Alternative-Explanation Checks` as the anomaly gate for automation. When a rung yields all-null outputs, suspiciously identical runs, no-op deltas, or impl/result mismatches, pause promotion logic until implementation reality checks, alternative explanations, and at least one cross-check are recorded.
|
|
43
45
|
- Treat paper-template selection as an explicit write-time gate, not as a silent fallback, when the loop is about to create `.tex` deliverables for the first time.
|
|
44
46
|
- The contract must declare `Autonomy level` and `Approval status`, and execution starts only when approval is explicitly set to `approved`.
|
|
45
47
|
- The contract must also declare a concrete terminal goal:
|
|
@@ -74,6 +76,8 @@
|
|
|
74
76
|
- `write` must produce LaTeX output under `<deliverables_root>/paper/`
|
|
75
77
|
- Treat promotion as incomplete unless it writes back to `data-decisions.md`, `decisions.md`, `state.md`, and `session-brief.md`.
|
|
76
78
|
- Do not stop or promote on the basis of a metric or comparison claim whose source-backed definition is missing from the approved evaluation protocol.
|
|
79
|
+
- Before each rung and before each success, stop, or promotion decision, re-check the generic academic-risk questions: setting semantics, visibility/leakage, anchor or label policy, scale comparability, metric validity, comparison validity, statistical validity, claim boundary, and integrity self-check.
|
|
80
|
+
- Before each success, stop, or promotion decision, also re-check the anomaly policy: whether anomaly signals fired, whether simpler explanations were ruled out, whether a cross-check was performed, and whether the current interpretation is still the narrowest supported one.
|
|
77
81
|
|
|
78
82
|
## Minimum Procedure
|
|
79
83
|
|
|
@@ -90,7 +94,15 @@
|
|
|
90
94
|
|
|
91
95
|
## Interaction Contract
|
|
92
96
|
|
|
97
|
+
## Level Guide
|
|
98
|
+
|
|
99
|
+
- `L1` = safe validation
|
|
100
|
+
- `L2` = default recommended bounded iteration
|
|
101
|
+
- `L3` = aggressive campaign
|
|
102
|
+
- If you are unsure, choose `L2`.
|
|
103
|
+
|
|
93
104
|
- Start with a concise summary of the objective, the frozen core, and the next automatic stage.
|
|
105
|
+
- Always surface the level guide before execution.
|
|
94
106
|
- If the contract is incomplete, ask one clarifying question at a time.
|
|
95
107
|
- If multiple next actions are credible, present 2-3 bounded options with trade-offs before arming a long run.
|
|
96
108
|
- Only ask for approval when the next step would leave the approved exploration envelope, exceed the chosen autonomy level, or materially change the frozen core.
|
|
@@ -100,8 +112,14 @@
|
|
|
100
112
|
- Normalize ambiguous user requests before arming the loop.
|
|
101
113
|
- Treat `Autonomy level L1/L2/L3` as execution privilege only.
|
|
102
114
|
- Treat `Layer`, `Phase`, and `Table` references as paper-structure or experiment-scope targets, not as autonomy levels.
|
|
115
|
+
- If the user does not name an autonomy level, or mixes it with a paper layer, phase, or table target, stop and deliver a detailed level guide before execution. That detailed guide should explain:
|
|
116
|
+
- the typical use case for `L1`, `L2`, and `L3`
|
|
117
|
+
- what kinds of modifications each level allows
|
|
118
|
+
- what kind of stop boundary each level is meant for
|
|
119
|
+
- that `L2` is the default recommendation when the user is unsure
|
|
120
|
+
- Ask for one explicit level choice before arming the loop after that detailed guide.
|
|
103
121
|
- Example:
|
|
104
|
-
- `Layer 3
|
|
122
|
+
- `Layer 3` means a paper layer or experiment target.
|
|
105
123
|
- `Autonomy level L3` means the aggressive campaign permission envelope.
|
|
106
124
|
- If the user mixes framework work and experiment work in one request, restate a normalized contract with:
|
|
107
125
|
- objective
|
|
@@ -64,6 +64,10 @@ If the loop stops without success, record:
|
|
|
64
64
|
- Do not change metric definitions, baseline semantics, or comparison implementations unless the approved evaluation protocol records both their sources and any deviations.
|
|
65
65
|
- When you change ladders, sample sizes, or promotion gates, keep the resulting logic anchored to the source-backed evaluation protocol instead of ad-hoc chat reasoning.
|
|
66
66
|
- Keep `.lab/context/eval-protocol.md` synchronized with the active benchmark scope, ladder gates, source-backed metric definitions, and any accepted implementation deviations instead of leaving it as a stale template.
|
|
67
|
+
- Re-run the `Academic Validity Checks` and `Integrity self-check` whenever you change inputs, anchors, labels, metrics, comparisons, or promotion logic.
|
|
68
|
+
- Re-run the `Sanity and Alternative-Explanation Checks` whenever a round produces anomaly signals, suspiciously unchanged results, impl/result mismatches, or other outcomes that could still have simpler explanations.
|
|
69
|
+
- If a round reveals leakage risk, invalid scale comparisons, unsupported metric semantics, or overstated claim boundaries, treat that as a fatal methodological flaw instead of a normal failed iteration.
|
|
70
|
+
- If anomaly signals remain unresolved after implementation reality checks and at least one cross-check, switch to diagnostic mode instead of continuing as if the interpretation were settled.
|
|
67
71
|
|
|
68
72
|
## Interaction Contract
|
|
69
73
|
|
|
@@ -50,11 +50,18 @@
|
|
|
50
50
|
- Explain the selected primary and secondary metrics in plain language for the user: what each metric measures, whether higher or lower is better, and whether it is a main result metric or only a health/support metric.
|
|
51
51
|
- If coverage, completeness, confidence, or similar health metrics appear, explicitly say that they describe experimental reliability rather than the main scientific effect.
|
|
52
52
|
- Pull the core background references, method or baseline references, and metric references out of the approved evaluation protocol instead of hiding them in `.lab/context/*`.
|
|
53
|
+
- Treat `report.md` as an external-review-ready memo. Source sections must not rely on local file paths or internal provenance notes; they must give a few human-readable anchor references instead.
|
|
53
54
|
- Pull the approved method name and contribution bullets out of `.lab/context/terminology-lock.md` when that framing context exists; do not silently drop them from the collaborator-facing report.
|
|
54
55
|
- Explain the method overview in collaborator language: what the method roughly does, what changed relative to the closest prior work or strongest baseline, what those prior methods do, and why they remain insufficient for the approved claim.
|
|
55
56
|
- When citing prior work or baselines in the method overview, include only the few anchor references a collaborator needs, and summarize their role and limitation in one short line each.
|
|
56
57
|
- Report only the few references a collaborator needs to orient themselves quickly; do not turn `report.md` into a full bibliography dump.
|
|
58
|
+
- In `Background Sources`, `Method and Baseline Sources`, and `Metric Sources`, every anchor must include a citation line, one short line about what it established or measures, and one limitation or caveat.
|
|
59
|
+
- Internal provenance belongs in `Artifact Status` or `.lab/context/evidence-index.md`, not in the external-review-ready source sections.
|
|
57
60
|
- If the report depends on a deviation from an original metric or implementation, state that deviation explicitly instead of smoothing it over.
|
|
61
|
+
- Carry the approved `Claim boundary` into the collaborator-facing report instead of implying broader validity than the protocol allows.
|
|
62
|
+
- If the `Academic Validity Checks` or `Integrity self-check` sections are incomplete, contradictory, or obviously violated by the evidence, degrade the report instead of presenting it as collaborator-ready.
|
|
63
|
+
- Carry the protocol's anomaly handling into the report: summarize any anomaly signals, what implementation checks were performed, which simpler explanations were ruled out, what cross-check was used, and why the current interpretation is still the best-supported one.
|
|
64
|
+
- If anomaly signals remain unresolved, or the report cannot explain why the current interpretation beat the simpler alternatives, degrade the report instead of presenting it as collaborator-ready.
|
|
58
65
|
- Before drafting the report, inspect `.lab/context/mission.md` and `.lab/context/eval-protocol.md` for skeletal template fields.
|
|
59
66
|
- If either canonical context file is still skeletal, hydrate the smallest trustworthy version from frozen result artifacts, dataset decisions, evidence-index, and prior approved context, and write that back before finalizing the report.
|
|
60
67
|
- If collaborator-critical fields still remain missing after hydration, downgrade the output to an `artifact-anchored interim report` instead of presenting it as a final collaborator-ready report.
|
|
@@ -24,6 +24,10 @@
|
|
|
24
24
|
|
|
25
25
|
## Reviewer Priorities
|
|
26
26
|
|
|
27
|
+
- academic validity checks are missing, shallow, or contradicted by the actual setup
|
|
28
|
+
- integrity self-check is missing or obviously ignored
|
|
29
|
+
- anomaly signals are being rationalized away without concrete code, input-chain, or artifact checks
|
|
30
|
+
- simpler alternative explanations were not examined before elevating the preferred interpretation
|
|
27
31
|
- unfair or weak baselines
|
|
28
32
|
- missing canonical baselines, strong historical baselines, recent strong public methods, or closest prior work without a justified omission
|
|
29
33
|
- unrepresentative benchmark mix or missing classic-public versus recent-strong-public coverage
|
|
@@ -28,7 +28,11 @@
|
|
|
28
28
|
- Fail fast on data, environment, or metric wiring problems.
|
|
29
29
|
- Tie the run to the approved evaluation protocol, not just an ad-hoc chat goal.
|
|
30
30
|
- Do not invent metric definitions, baseline behavior, or comparison implementations from memory; anchor them to the approved evaluation protocol and its recorded sources.
|
|
31
|
+
- Treat `Academic Validity Checks` and `Integrity self-check` as preflight gates, not optional notes. Do not bless a run as protocol-valid until those fields are filled and still match the current experiment.
|
|
32
|
+
- Treat `Sanity and Alternative-Explanation Checks` as a second preflight gate. If anomaly signals have fired and the implementation reality checks, alternative explanations, cross-check method, best-supported interpretation, or escalation threshold are still blank, do not bless the run as valid evidence.
|
|
31
33
|
- If `.lab/context/eval-protocol.md` is still skeletal, write the smallest trustworthy version of the current evaluation objective, metric set, ladder, and source-backed implementation notes before treating the run as the new protocol anchor.
|
|
34
|
+
- Refuse to treat a run as scientifically valid if the protocol has not answered the generic academic-risk questions: setting semantics, visibility/leakage, anchor or label policy, scale comparability, metric validity, comparison validity, statistical validity, and claim boundary.
|
|
35
|
+
- Treat all-null outputs, suspiciously identical reruns, no-op deltas, and impl/result mismatches as diagnostic triggers first; check code paths and rule out simpler explanations before interpreting them as findings.
|
|
32
36
|
- Record the exact launch command and output location.
|
|
33
37
|
- Write durable run outputs, logs, and checkpoints under `results_root`.
|
|
34
38
|
- Write figures or plots under `figures_root`.
|