superlab 0.1.12 → 0.1.13
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +11 -2
- package/README.zh-CN.md +11 -2
- package/bin/superlab.cjs +43 -1
- package/lib/auto.cjs +14 -972
- package/lib/auto_common.cjs +129 -0
- package/lib/auto_contracts.cjs +387 -0
- package/lib/auto_runner.cjs +830 -0
- package/lib/auto_state.cjs +227 -0
- package/lib/context.cjs +94 -0
- package/lib/eval_protocol.cjs +236 -0
- package/lib/i18n.cjs +125 -11
- package/lib/install.cjs +26 -6
- package/package-assets/claude/commands/lab/auto.md +1 -1
- package/package-assets/claude/commands/lab.md +2 -1
- package/package-assets/codex/prompts/lab-auto.md +1 -1
- package/package-assets/codex/prompts/lab.md +2 -1
- package/package-assets/shared/lab/context/auto-mode.md +7 -0
- package/package-assets/shared/lab/context/auto-outcome.md +28 -0
- package/package-assets/shared/lab/context/auto-status.md +3 -0
- package/package-assets/shared/lab/context/eval-protocol.md +46 -0
- package/package-assets/shared/skills/lab/SKILL.md +12 -1
- package/package-assets/shared/skills/lab/stages/auto.md +31 -7
- package/package-assets/shared/skills/lab/stages/iterate.md +4 -0
- package/package-assets/shared/skills/lab/stages/report.md +4 -0
- package/package-assets/shared/skills/lab/stages/run.md +4 -1
- package/package.json +1 -1
|
@@ -16,9 +16,11 @@
|
|
|
16
16
|
- `.lab/context/decisions.md`
|
|
17
17
|
- `.lab/context/data-decisions.md`
|
|
18
18
|
- `.lab/context/evidence-index.md`
|
|
19
|
+
- `.lab/context/eval-protocol.md`
|
|
19
20
|
- `.lab/context/terminology-lock.md`
|
|
20
21
|
- `.lab/context/auto-mode.md`
|
|
21
22
|
- `.lab/context/auto-status.md`
|
|
23
|
+
- `.lab/context/auto-outcome.md`
|
|
22
24
|
|
|
23
25
|
## Context Write Set
|
|
24
26
|
|
|
@@ -29,16 +31,34 @@
|
|
|
29
31
|
- `.lab/context/summary.md`
|
|
30
32
|
- `.lab/context/session-brief.md`
|
|
31
33
|
- `.lab/context/auto-status.md`
|
|
34
|
+
- `.lab/context/auto-outcome.md`
|
|
32
35
|
|
|
33
36
|
## Boundary Rules
|
|
34
37
|
|
|
35
38
|
- Treat `/lab:auto` as an orchestration layer, not a replacement for existing `/lab:*` stages.
|
|
39
|
+
- Treat `.lab/context/eval-protocol.md` as the source of truth for paper-facing metrics, metric glossary, table plan, gates, and structured experiment ladders.
|
|
40
|
+
- Treat the evaluation protocol as source-backed, not imagination-backed: metric definitions, baseline behavior, comparison implementations, and deviations must come from recorded sources before they are used in gates or promotions.
|
|
41
|
+
- The contract must declare `Autonomy level` and `Approval status`, and execution starts only when approval is explicitly set to `approved`.
|
|
42
|
+
- The contract must also declare a concrete terminal goal:
|
|
43
|
+
- `rounds`
|
|
44
|
+
- `metric-threshold`
|
|
45
|
+
- `task-completion`
|
|
46
|
+
- The contract must provide both `Terminal goal target` and `Required terminal artifact`.
|
|
47
|
+
- Recommended level meanings:
|
|
48
|
+
- `L1`: safe run validation over `run`, `review`, and `report`
|
|
49
|
+
- `L2`: bounded iteration over `run`, `iterate`, `review`, and `report`
|
|
50
|
+
- `L3`: aggressive campaign that may also include `write`
|
|
36
51
|
- Default allowed stages are `run`, `iterate`, `review`, and `report`. Only include `write` when framing is already approved and manuscript drafting is within scope.
|
|
37
52
|
- Do not automatically change the research mission, paper-facing framing, or core claims.
|
|
38
53
|
- You may add exploratory datasets, benchmarks, and comparison methods inside the exploration envelope.
|
|
39
54
|
- You may promote exploratory additions to the primary package only when the contract's promotion policy is satisfied and the promotion is written back into `data-decisions.md`, `decisions.md`, `state.md`, and `session-brief.md`.
|
|
40
55
|
- Poll long-running commands until they finish, hit a timeout, or hit a stop condition.
|
|
41
56
|
- Keep a poll-based waiting loop instead of sleeping blindly.
|
|
57
|
+
- Always write a canonical `.lab/context/auto-outcome.md` when the run completes, stops, or fails.
|
|
58
|
+
- When the evaluation protocol declares structured ladder rungs, execute them as a foreground rung state machine:
|
|
59
|
+
- each rung must declare `Stage`, `Goal`, `Command`, `Watch`, `Gate`, `On pass`, `On fail`, and `On stop`
|
|
60
|
+
- keep the session alive while the current rung is running
|
|
61
|
+
- write the current rung, watch target, and next rung to `.lab/context/auto-status.md`
|
|
42
62
|
- Reuse the existing `/lab:run`, `/lab:iterate`, `/lab:review`, `/lab:report`, and optional `/lab:write` contracts instead of inventing a parallel workflow.
|
|
43
63
|
- Enforce stage contracts, not just exit codes:
|
|
44
64
|
- `run` and `iterate` must change persistent outputs under `results_root`
|
|
@@ -46,20 +66,24 @@
|
|
|
46
66
|
- `report` must produce `<deliverables_root>/report.md`
|
|
47
67
|
- `write` must produce LaTeX output under `<deliverables_root>/paper/`
|
|
48
68
|
- Treat promotion as incomplete unless it writes back to `data-decisions.md`, `decisions.md`, `state.md`, and `session-brief.md`.
|
|
69
|
+
- Do not stop or promote on the basis of a metric or comparison claim whose source-backed definition is missing from the approved evaluation protocol.
|
|
49
70
|
|
|
50
71
|
## Minimum Procedure
|
|
51
72
|
|
|
52
73
|
1. Validate the auto-mode contract
|
|
53
|
-
2.
|
|
54
|
-
3.
|
|
55
|
-
4.
|
|
56
|
-
5.
|
|
57
|
-
6.
|
|
58
|
-
7.
|
|
74
|
+
2. Confirm the approved autonomy level matches the requested stage envelope
|
|
75
|
+
3. Set or refresh auto-status
|
|
76
|
+
4. Choose the next allowed `/lab` stage or structured ladder rung
|
|
77
|
+
5. Launch the bounded action
|
|
78
|
+
6. Poll for process completion, checkpoint movement, or summary generation while keeping the session alive
|
|
79
|
+
7. Evaluate the declared rung gate and transition to the next rung when structured ladder mode is active
|
|
80
|
+
8. Evaluate the declared terminal goal semantics at the correct boundary
|
|
81
|
+
9. Evaluate stop, success, and promotion checks at the correct boundary
|
|
82
|
+
10. Write auto-outcome and decide continue, promote, stop, or escalate
|
|
59
83
|
|
|
60
84
|
## Interaction Contract
|
|
61
85
|
|
|
62
86
|
- Start with a concise summary of the objective, the frozen core, and the next automatic stage.
|
|
63
87
|
- If the contract is incomplete, ask one clarifying question at a time.
|
|
64
88
|
- If multiple next actions are credible, present 2-3 bounded options with trade-offs before arming a long run.
|
|
65
|
-
- Only ask for approval when the next step would leave the approved exploration envelope or materially change the frozen core.
|
|
89
|
+
- Only ask for approval when the next step would leave the approved exploration envelope, exceed the chosen autonomy level, or materially change the frozen core.
|
|
@@ -8,6 +8,7 @@ Declare and keep fixed:
|
|
|
8
8
|
- baseline
|
|
9
9
|
- primary metric
|
|
10
10
|
- success threshold
|
|
11
|
+
- evaluation ladder and benchmark expansion gates
|
|
11
12
|
- verification commands
|
|
12
13
|
- completion_promise
|
|
13
14
|
- maximum iteration count
|
|
@@ -19,6 +20,7 @@ Declare and keep fixed:
|
|
|
19
20
|
- `.lab/context/decisions.md`
|
|
20
21
|
- `.lab/context/evidence-index.md`
|
|
21
22
|
- `.lab/context/data-decisions.md`
|
|
23
|
+
- `.lab/context/eval-protocol.md`
|
|
22
24
|
- `.lab/config/workflow.json`
|
|
23
25
|
|
|
24
26
|
## Context Write Set
|
|
@@ -58,6 +60,8 @@ If the loop stops without success, record:
|
|
|
58
60
|
- Keep durable run outputs, logs, and checkpoints under `results_root`.
|
|
59
61
|
- Keep figures or plots under `figures_root`.
|
|
60
62
|
- Do not accumulate long-lived results under `.lab/changes/<change-id>/runs`.
|
|
63
|
+
- Do not change metric definitions, baseline semantics, or comparison implementations unless the approved evaluation protocol records both their sources and any deviations.
|
|
64
|
+
- When you change ladders, sample sizes, or promotion gates, keep the resulting logic anchored to the source-backed evaluation protocol instead of ad-hoc chat reasoning.
|
|
61
65
|
|
|
62
66
|
## Interaction Contract
|
|
63
67
|
|
|
@@ -17,6 +17,7 @@
|
|
|
17
17
|
- `.lab/context/decisions.md`
|
|
18
18
|
- `.lab/context/evidence-index.md`
|
|
19
19
|
- `.lab/context/data-decisions.md`
|
|
20
|
+
- `.lab/context/eval-protocol.md`
|
|
20
21
|
- `.lab/context/terminology-lock.md`
|
|
21
22
|
|
|
22
23
|
## Context Write Set
|
|
@@ -28,6 +29,9 @@
|
|
|
28
29
|
|
|
29
30
|
- Do not hide failed iterations.
|
|
30
31
|
- Tie every major claim to recorded summaries or iteration artifacts.
|
|
32
|
+
- Structure tables, gates, and main claims against the approved evaluation protocol.
|
|
33
|
+
- Do not restate metric definitions, baseline behavior, or comparison implementations from memory; use the approved evaluation protocol and its recorded sources.
|
|
34
|
+
- If the report depends on a deviation from an original metric or implementation, state that deviation explicitly instead of smoothing it over.
|
|
31
35
|
- Prefer conservative interpretation over marketing language.
|
|
32
36
|
- Leave a clear handoff path into `/lab:write` with evidence links that section drafts can cite.
|
|
33
37
|
|
|
@@ -12,6 +12,7 @@
|
|
|
12
12
|
- `.lab/context/mission.md`
|
|
13
13
|
- `.lab/context/state.md`
|
|
14
14
|
- `.lab/context/data-decisions.md`
|
|
15
|
+
- `.lab/context/eval-protocol.md`
|
|
15
16
|
- `.lab/config/workflow.json`
|
|
16
17
|
|
|
17
18
|
## Context Write Set
|
|
@@ -23,6 +24,8 @@
|
|
|
23
24
|
|
|
24
25
|
- Prefer the smallest experiment that exercises the full pipeline.
|
|
25
26
|
- Fail fast on data, environment, or metric wiring problems.
|
|
27
|
+
- Tie the run to the approved evaluation protocol, not just an ad-hoc chat goal.
|
|
28
|
+
- Do not invent metric definitions, baseline behavior, or comparison implementations from memory; anchor them to the approved evaluation protocol and its recorded sources.
|
|
26
29
|
- Record the exact launch command and output location.
|
|
27
30
|
- Write durable run outputs, logs, and checkpoints under `results_root`.
|
|
28
31
|
- Write figures or plots under `figures_root`.
|
|
@@ -34,7 +37,7 @@
|
|
|
34
37
|
2. Register the run
|
|
35
38
|
3. Execute the smallest useful experiment
|
|
36
39
|
4. Normalize raw metrics
|
|
37
|
-
5. Validate the normalized summary
|
|
40
|
+
5. Validate the normalized summary against the active evaluation protocol
|
|
38
41
|
|
|
39
42
|
## Interaction Contract
|
|
40
43
|
|