superlab 0.1.12 → 0.1.14

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -147,11 +147,13 @@ superlab doctor
147
147
  - `deliverables_root` placement
148
148
  - `paper_template_root` when configured
149
149
  - LaTeX-first paper output layout under `<deliverables_root>/paper/`
150
+ - `.lab/context/eval-protocol.md` completeness once evaluation planning has started
151
+ - source-backed evaluation protocol fields for metrics and comparison implementations once evaluation planning has started
150
152
  - boundary violations such as durable run outputs stored under `.lab/changes/*/runs`
151
153
 
152
154
  ## Auto Mode
153
155
 
154
- First fill `.lab/context/auto-mode.md` with the bounded contract, the per-stage commands, the stage output contracts, and the policy check commands for the campaign, then arm it for the current project:
156
+ First fill `.lab/context/eval-protocol.md` with the paper-facing evaluation objective, table plan, metric glossary, source references for metrics and comparison implementations, gates, and experiment ladder. Then fill `.lab/context/auto-mode.md` with the bounded contract, choose an `Autonomy level`, declare a concrete terminal goal (`rounds`, `metric-threshold`, or `task-completion`), leave `Approval status` as `draft` until you have reviewed the plan, then switch it to `approved` before you arm the current project:
155
157
 
156
158
  ```bash
157
159
  superlab auto start
@@ -169,16 +171,30 @@ Stop the current auto-mode run:
169
171
  superlab auto stop
170
172
  ```
171
173
 
172
- `/lab:auto` is an orchestration mode layered on top of approved execution stages. It reuses `run`, `iterate`, `review`, `report`, and optional `write` inside the limits defined by `.lab/context/auto-mode.md` and `.lab/context/auto-status.md`. `superlab auto start` runs the configured stage commands in the foreground, polls for completion, enforces `success/stop/promotion` check commands, guards the configured frozen core, and validates stage-specific contracts:
174
+ `/lab:auto` is an orchestration mode layered on top of approved execution stages. It reuses `run`, `iterate`, `review`, `report`, and optional `write` inside the limits defined by `.lab/context/eval-protocol.md`, `.lab/context/auto-mode.md`, and `.lab/context/auto-status.md`. `superlab auto start` only starts when the contract is explicitly approved, then stays in the foreground, polls long-running commands until they finish, follows the structured rung transitions declared in `.lab/context/eval-protocol.md`, guards the configured frozen core, writes `.lab/context/auto-outcome.md`, and validates stage-specific contracts. Metrics, baseline behavior, and comparison implementations are expected to be source-backed through the evaluation protocol before they are used in gates or promotions:
175
+
176
+ - `L1` is a safe run envelope for `run`, `review`, and `report`
177
+ - `L2` is the default bounded iteration envelope for `run`, `iterate`, `review`, and `report`
178
+ - `L3` is an aggressive campaign envelope that may also include `write`
173
179
 
174
180
  - `run` and `iterate` must change persistent outputs under `results_root`
175
181
  - `review` must update canonical review context
176
182
  - `report` must write `<deliverables_root>/report.md`
177
183
  - `write` must produce LaTeX output under `<deliverables_root>/paper/`
178
184
  - a successful promotion must write back into `.lab/context/data-decisions.md`, `.lab/context/decisions.md`, `.lab/context/state.md`, and `.lab/context/session-brief.md`
185
+ - every run must end with `.lab/context/auto-outcome.md`, including why it stopped, whether the terminal goal was reached, and which artifact is the final outcome
186
+ - when the evaluation protocol declares structured ladder rungs, each rung should declare `Stage`, `Goal`, `Command`, `Watch`, `Gate`, `On pass`, `On fail`, and `On stop`; auto records the current rung, watch target, and next rung in `.lab/context/auto-status.md`
179
187
 
180
188
  It does not replace manual `idea`, `data`, `framing`, or `spec` decisions.
181
189
 
190
+ Good `/lab:auto` input is explicit. Treat `Autonomy level L1/L2/L3` as execution privilege, and treat `paper layer`, `phase`, or `table` as experiment targets. If the workflow language is Chinese, summaries, checklist items, task labels, and progress updates should also stay in Chinese unless a literal identifier must remain unchanged.
191
+
192
+ Example:
193
+
194
+ ```text
195
+ /lab:auto Autonomy level L2. Objective: advance paper layer 3 organizer enforcement. Terminal goal: task-completion. Scope: bounded protocol, tests, minimal implementation, and one small run. Allowed modifications: evaluator prompt registry, ingestion, and parser only.
196
+ ```
197
+
182
198
  ## Version
183
199
 
184
200
  Show the CLI version and the current project asset version:
@@ -238,6 +254,7 @@ Stages should follow that file rather than guess language locally.
238
254
  - `/lab:data` turns the approved idea into a dataset and benchmark package with years, paper usage, source audit, download plans, explicit benchmark-role rationale for classic-public, recent-strong-public, and claim-specific benchmarks, and explicit comparison rationale for canonical baselines, strong historical baselines, recent strong public methods, and closest prior work.
239
255
  - `/lab:framing` locks paper-facing method names, module names, titles, and contribution wording before drafting.
240
256
  - `/lab:auto` orchestrates approved `run`, `iterate`, `review`, `report`, and optional `write` stages inside a bounded contract and can promote exploratory additions to the primary package when the promotion policy is satisfied.
257
+ - `/lab:run`, `/lab:iterate`, `/lab:auto`, and `/lab:report` should use source-backed metric and comparison definitions recorded in `.lab/context/eval-protocol.md` instead of inventing them from memory.
241
258
  - `/lab:spec` converts the approved idea into one `.lab/changes/<change-id>/` directory.
242
259
  - `/lab:run` executes a small-scale validation run and establishes the evaluation pipeline.
243
260
  - Durable run outputs should go under the configured `results_root`, not `.lab/changes/`.
package/README.zh-CN.md CHANGED
@@ -145,11 +145,13 @@ superlab doctor
145
145
  - `deliverables_root` 是否放在合理位置
146
146
  - `paper_template_root` 在配置时是否合法
147
147
  - `<deliverables_root>/paper/` 下是否仍然满足 LaTeX-first 输出约束
148
+ - `.lab/context/eval-protocol.md` 在评估规划已启动时是否完整
149
+ - 评估规划已启动后,指标与对比实现是否补齐了来源约束字段
148
150
  - 是否把长期 run 输出错误地堆在 `.lab/changes/*/runs`
149
151
 
150
152
  ## 自动模式
151
153
 
152
- 先填写 `.lab/context/auto-mode.md`,明确本次自治执行的边界契约、各阶段命令、阶段产物约束,以及 success/stop/promotion 的检查命令,再启动当前项目的自动模式:
154
+ 先填写 `.lab/context/eval-protocol.md`,把论文导向的评估目标、主表计划、指标释义、指标与对比实现的来源、gate 和实验阶梯写清楚。然后再填写 `.lab/context/auto-mode.md`,明确本次自治执行的边界契约、各阶段命令、阶段产物约束,以及 success/stop/promotion 的检查命令;同时选择 `Autonomy level`,声明一个明确的终止目标(`rounds`、`metric-threshold` 或 `task-completion`),保持 `Approval status: draft` 直到你审过这份契约,再改成 `approved` 后启动当前项目的自动模式:
153
155
 
154
156
  ```bash
155
157
  superlab auto start
@@ -167,16 +169,30 @@ superlab auto status
167
169
  superlab auto stop
168
170
  ```
169
171
 
170
- `/lab:auto` 是叠加在现有执行阶段之上的编排模式。它会在 `.lab/context/auto-mode.md` 和 `.lab/context/auto-status.md` 的约束下,复用 `run`、`iterate`、`review`、`report`,以及可选的 `write`。`superlab auto start` 会在前台执行这些已配置阶段命令、轮询完成情况,并真正执行 success/stop/promotion 检查命令,同时保护已声明的 frozen core,并校验各阶段的产物约束:
172
+ `/lab:auto` 是叠加在现有执行阶段之上的编排模式。它会在 `.lab/context/eval-protocol.md`、`.lab/context/auto-mode.md` 和 `.lab/context/auto-status.md` 的约束下,复用 `run`、`iterate`、`review`、`report`,以及可选的 `write`。`superlab auto start` 只会在契约被显式批准后启动,然后保持前台长驻、轮询长任务直到完成、按 `.lab/context/eval-protocol.md` 里声明的结构化 rung 转移继续推进,同时保护已声明的 frozen core、写出 `.lab/context/auto-outcome.md`,并校验各阶段的产物约束。指标、baseline 行为和对比方法实现必须先在评估协议里挂到来源上,才能拿来做 gate 或 promotion:
173
+
174
+ - `L1` 是安全运行级别,只允许 `run`、`review`、`report`
175
+ - `L2` 是默认推荐级别,允许 `run`、`iterate`、`review`、`report`
176
+ - `L3` 是激进 campaign 级别,才允许额外编排 `write`
171
177
 
172
178
  - `run` 和 `iterate` 必须更新 `results_root` 下的持久输出
173
179
  - `review` 必须更新规范的审查上下文
174
180
  - `report` 必须写出 `<deliverables_root>/report.md`
175
181
  - `write` 必须写出 `<deliverables_root>/paper/` 下的 LaTeX 论文产物
176
182
  - promotion 成功后必须写回 `.lab/context/data-decisions.md`、`.lab/context/decisions.md`、`.lab/context/state.md` 和 `.lab/context/session-brief.md`
183
+ - 每次运行都必须写出 `.lab/context/auto-outcome.md`,记录为什么停止、是否达到终止目标,以及哪一个工件是最终结果
184
+ - 如果评估协议里声明了结构化 rung,每个 rung 都应声明 `Stage`、`Goal`、`Command`、`Watch`、`Gate`、`On pass`、`On fail`、`On stop`;auto 会把当前 rung、监视目标和下一 rung 写进 `.lab/context/auto-status.md`
177
185
 
178
186
  它不会替代手动的 `idea`、`data`、`framing`、`spec` 决策。
179
187
 
188
+ 好的 `/lab:auto` 输入应该显式写清。把 `Autonomy level L1/L2/L3` 当成执行权限级别,把 `paper layer`、`phase`、`table` 当成实验目标,不要混用。如果 workflow language 是中文,摘要、清单条目、任务标签和进度更新也应保持中文,除非某个字面标识符必须保持原样。
189
+
190
+ 示例:
191
+
192
+ ```text
193
+ /lab:auto 自治级别 L2。目标:推进 paper layer 3 的 organizer enforcement。终止条件:完成 bounded protocol、测试、最小实现和一轮小规模结果。允许修改:evaluator prompt registry、ingestion、parser。
194
+ ```
195
+
180
196
  ## 版本查询
181
197
 
182
198
  查看当前 CLI 版本和当前目录项目的资产版本:
@@ -236,6 +252,7 @@ superlab init --lang en
236
252
  - `/lab:data` 把已批准的 idea 收敛成数据集与 benchmark 方案,要求记录年份、使用论文、来源审计、下载计划,并明确 classic-public、recent-strong-public、claim-specific 三类 benchmark 的纳入理由,以及 canonical baselines、strong historical baselines、recent strong public methods、closest prior work 四类对比方法的纳入理由。
237
253
  - `/lab:framing` 在正式写作前收紧方法名、模块名、论文题目和 contribution wording。
238
254
  - `/lab:auto` 在已批准边界内编排 `run`、`iterate`、`review`、`report` 和可选 `write`,并在升格策略满足时允许把 exploratory additions 自动升级为 primary package。
255
+ - `/lab:run`、`/lab:iterate`、`/lab:auto`、`/lab:report` 都应优先使用 `.lab/context/eval-protocol.md` 里带来源的指标定义和对比实现说明,而不是凭记忆现想。
239
256
  - `/lab:spec` 把批准后的方案转换成一个统一的 `.lab/changes/<change-id>/` 目录。
240
257
  - `/lab:run` 执行最小可运行实验,并建立首版评估链路。
241
258
  - 持久 run 输出应写到 `results_root`,不要写进 `.lab/changes/`。
package/bin/superlab.cjs CHANGED
@@ -15,6 +15,7 @@ const {
15
15
  pruneContext,
16
16
  refreshContext,
17
17
  } = require("../lib/context.cjs");
18
+ const { validateEvalProtocol } = require("../lib/eval_protocol.cjs");
18
19
  const {
19
20
  getAutoStatus,
20
21
  startAutoMode,
@@ -566,6 +567,9 @@ function printAutoStatus(options) {
566
567
  console.log(`objective: ${mode.objective || "TBD"}`);
567
568
  console.log(`allowed stages: ${mode.allowedStages.join(", ") || "TBD"}`);
568
569
  console.log(`current stage: ${status.currentStage || "TBD"}`);
570
+ console.log(`current rung: ${status.currentRung || "TBD"}`);
571
+ console.log(`watch target: ${status.watchTarget || "TBD"}`);
572
+ console.log(`next rung: ${status.nextRung || "TBD"}`);
569
573
  console.log(`decision: ${status.decision || "TBD"}`);
570
574
  console.log(`issues: ${issues.length > 0 ? issues.join(" | ") : "none"}`);
571
575
  }
@@ -739,6 +743,7 @@ function printDoctor(options) {
739
743
  ".lab/context/evidence-index.md",
740
744
  ".lab/context/open-questions.md",
741
745
  ".lab/context/data-decisions.md",
746
+ ".lab/context/eval-protocol.md",
742
747
  ".lab/context/auto-mode.md",
743
748
  ".lab/context/auto-status.md",
744
749
  ".lab/context/terminology-lock.md",
@@ -753,6 +758,7 @@ function printDoctor(options) {
753
758
  const deliverableIssues = validateDeliverables(options.targetDir, config);
754
759
  const templateIssues = validatePaperTemplateRoot(options.targetDir, config);
755
760
  const dataDecisionIssues = validateDataDecisions(options.targetDir);
761
+ const evalProtocolIssues = validateEvalProtocol(options.targetDir);
756
762
  const rootIssues = validateProjectRoots(options.targetDir, config);
757
763
  const autoStatus = getAutoStatus({ targetDir: options.targetDir });
758
764
  const autoIssues = autoStatus.issues;
@@ -770,6 +776,7 @@ function printDoctor(options) {
770
776
  deliverableIssues.length > 0 ||
771
777
  templateIssues.length > 0 ||
772
778
  dataDecisionIssues.length > 0 ||
779
+ evalProtocolIssues.length > 0 ||
773
780
  rootIssues.length > 0 ||
774
781
  autoIssues.length > 0
775
782
  ) {
@@ -780,7 +787,13 @@ function printDoctor(options) {
780
787
  console.log(`language: ${projectInfo.lang}`);
781
788
  console.log(`missing: ${missing.length > 0 ? missing.join(", ") : "none"}`);
782
789
  console.log(`config: ${configIssues.length > 0 ? configIssues.join(" | ") : "none"}`);
783
- const outputIssues = deliverableIssues.concat(templateIssues, dataDecisionIssues, rootIssues, autoIssues);
790
+ const outputIssues = deliverableIssues.concat(
791
+ templateIssues,
792
+ dataDecisionIssues,
793
+ evalProtocolIssues,
794
+ rootIssues,
795
+ autoIssues
796
+ );
784
797
  console.log(`outputs: ${outputIssues.length > 0 ? outputIssues.join(" | ") : "none"}`);
785
798
  return;
786
799
  }
@@ -907,6 +920,35 @@ async function main() {
907
920
  console.log(`auto mode ${verb} in ${options.targetDir}`);
908
921
  console.log(`objective: ${result.mode.objective}`);
909
922
  console.log(`stages executed: ${result.executedStages.join(", ")}`);
923
+ console.log(`goal type: ${result.outcome.goalType}`);
924
+ console.log(`goal target: ${result.outcome.goalTarget}`);
925
+ if (result.outcome.experimentLadder) {
926
+ console.log(`experiment ladder: ${result.outcome.experimentLadder}`);
927
+ }
928
+ if (result.outcome.metricGlossary) {
929
+ console.log(`metric glossary: ${result.outcome.metricGlossary}`);
930
+ }
931
+ if (result.outcome.metricSourcePapers) {
932
+ console.log(`metric sources: ${result.outcome.metricSourcePapers}`);
933
+ }
934
+ if (result.outcome.metricImplementationSource) {
935
+ console.log(`metric implementation source: ${result.outcome.metricImplementationSource}`);
936
+ }
937
+ if (result.outcome.comparisonSourcePapers) {
938
+ console.log(`comparison source papers: ${result.outcome.comparisonSourcePapers}`);
939
+ }
940
+ if (result.outcome.comparisonImplementationSource) {
941
+ console.log(`comparison implementation source: ${result.outcome.comparisonImplementationSource}`);
942
+ }
943
+ if (result.outcome.deviationFromOriginalImplementation) {
944
+ console.log(`deviation from original implementation: ${result.outcome.deviationFromOriginalImplementation}`);
945
+ }
946
+ console.log(`goal reached: ${result.outcome.goalReached ? "yes" : "no"}`);
947
+ console.log(`stop reason: ${result.outcome.stopReason}`);
948
+ console.log(`promotion applied: ${result.outcome.promotionApplied ? "yes" : "no"}`);
949
+ console.log(`final artifact: ${result.outcome.finalArtifact}`);
950
+ console.log(`final rung: ${result.outcome.finalRung || "TBD"}`);
951
+ console.log(`outcome: .lab/context/auto-outcome.md`);
910
952
  console.log(`status: ${result.status.status}`);
911
953
  return;
912
954
  }