npm - cclaw-cli - Versions diffs - 0.22.0 → 0.23.1 - Mend

cclaw-cli 0.22.0 → 0.23.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

package/dist/cli.d.ts +2 -0
package/dist/cli.js +42 -11
package/dist/constants.d.ts +4 -4
package/dist/constants.js +4 -4
package/dist/content/eval-scaffold.d.ts +4 -4
package/dist/content/eval-scaffold.js +13 -14
package/dist/content/examples.js +11 -11
package/dist/content/hooks.js +1 -1
package/dist/content/skills.d.ts +3 -3
package/dist/content/skills.js +19 -19
package/dist/content/stage-schema.js +2 -2
package/dist/content/stages/plan.js +18 -18
package/dist/content/stages/schema-types.d.ts +2 -2
package/dist/content/stages/tdd.js +1 -1
package/dist/content/subagents.js +1 -1
package/dist/content/templates.js +8 -8
package/dist/content/utility-skills.js +19 -19
package/dist/doctor.js +2 -2
package/dist/eval/baseline.d.ts +14 -0
package/dist/eval/baseline.js +209 -0
package/dist/eval/corpus.d.ts +13 -2
package/dist/eval/corpus.js +97 -13
package/dist/eval/llm-client.d.ts +10 -10
package/dist/eval/llm-client.js +5 -5
package/dist/eval/report.js +17 -4
package/dist/eval/runner.d.ts +8 -16
package/dist/eval/runner.js +124 -42
package/dist/eval/types.d.ts +94 -14
package/dist/eval/verifiers/structural.d.ts +14 -0
package/dist/eval/verifiers/structural.js +171 -0
package/dist/install.js +3 -3
package/dist/policy.js +1 -1
package/package.json +1 -1

package/dist/content/templates.js CHANGED Viewed

@@ -309,23 +309,23 @@ inputs_hash: sha256:pending
 ## Dependency Graph
 -
-## Dependency Waves
+## Dependency Batches
-### Wave 1 (foundation)
+### Batch 1 (foundation)
 - Task IDs:
 - Verification gate:
-### Wave 2 (dependent)
+### Batch 2 (dependent)
 - Task IDs:
 - Depends on:
 - Verification gate:
-### Wave 3 (integration)
+### Batch 3 (integration)
 - Task IDs:
 - Depends on:
 - Verification gate:
-Execution rule: complete and verify each wave before starting the next wave.
+Execution rule: complete and verify each batch before starting the next batch.
 ## Task List
@@ -333,7 +333,7 @@ Execution rule: complete and verify each wave before starting the next wave.
 - Every task fits the **2-5 minute budget**. If \`[~Nm]\` is >5, split the task.
 - **No placeholders.** Forbidden tokens anywhere in this table: \`TODO\`, \`TBD\`, \`FIXME\`, \`<fill-in>\`, \`<your-*-here>\`, \`xxx\`, bare ellipsis. Every file path, test, and verification command must be copy-pasteable as written.
 - **No silent scope reduction.** Forbidden phrasing when locked decisions exist: \`v1\`, \`for now\`, \`later\`, \`temporary\`, \`placeholder\`, \`mock for now\`, \`hardcoded for now\`, \`will improve later\`.
-- If an estimate is genuinely uncertain (new library, unfamiliar subsystem), add a **spike task in wave 0** to de-risk — do NOT hide the uncertainty inside a large estimate.
+- If an estimate is genuinely uncertain (new library, unfamiliar subsystem), add a **spike task in batch 0** to de-risk — do NOT hide the uncertainty inside a large estimate.
 | Task ID | Description | Acceptance criterion | Verification command | Effort (S/M/L) | Minutes |
 |---|---|---|---|---|---|
@@ -350,12 +350,12 @@ Execution rule: complete and verify each wave before starting the next wave.
 | D-01 | 02-scope.md > Locked Decisions | T-1 | covered |
 ## Risk Assessment
-| Task/Wave | Risk | Likelihood | Impact | Mitigation |
+| Task/Batch | Risk | Likelihood | Impact | Mitigation |
 |---|---|---|---|---|
 |  |  |  |  |  |
 ## Boundary Map
-| Task/Wave | Produces (exports) | Consumes (imports from) |
+| Task/Batch | Produces (exports) | Consumes (imports from) |
 |---|---|---|
 |  |  |  |

package/dist/content/utility-skills.js CHANGED Viewed

@@ -482,7 +482,7 @@ description: "Execute approved plans with disciplined batching, explicit checkpo
 ## Quick Start
 > 1. Confirm the plan and stage gates are approved before execution.
-> 2. Execute in batches (waves), not as one giant untracked stream.
+> 2. Execute in batches, not as one giant untracked stream.
 > 3. Stop at checkpoint boundaries for verification and user visibility.
 ## HARD-GATE
@@ -492,47 +492,47 @@ Do not start implementation execution without an approved plan artifact and expl
 ## Execution Protocol
 1. **Load plan source of truth** from \`.cclaw/artifacts/05-plan.md\` (canonical run copy when available).
-2. **Group tasks into waves** by dependency order and risk.
-3. **Run one wave at a time** with evidence after each task (tests, build, lint, or review evidence as applicable).
-4. **Checkpoint each wave** by updating stage artifact evidence and unresolved blockers.
+2. **Group tasks into batches** by dependency order and risk.
+3. **Run one batch at a time** with evidence after each task (tests, build, lint, or review evidence as applicable).
+4. **Checkpoint each batch** by updating stage artifact evidence and unresolved blockers.
 5. **Stop immediately** on any hard blocker, failing gate, or unresolved critical finding.
-## Wave Checklist
+## Batch Checklist
-- Wave scope is explicit (task IDs + expected outputs).
+- Batch scope is explicit (task IDs + expected outputs).
 - Verification command for each task is predetermined.
 - Machine-only checks are delegated to subagents when supported.
 - User approvals are requested only at required gate boundaries.
-## Fresh Context Protocol (between waves)
+## Fresh Context Protocol (between batches)
-After a wave completes — especially after long agent turns — context drift is
-the #1 cause of degraded execution quality. Before starting the **next wave**,
+After a batch completes — especially after long agent turns — context drift is
+the #1 cause of degraded execution quality. Before starting the **next batch**,
 prefer a **fresh agent context** over continuing in a saturated session:
-1. **Snapshot wave outcome** — append a short summary to the plan artifact
-   (\`### Wave <N> outcome\` with: tasks done, evidence files, blockers, next-wave inputs).
+1. **Snapshot batch outcome** — append a short summary to the plan artifact
+   (\`### Batch <N> outcome\` with: tasks done, evidence files, blockers, next-batch inputs).
 2. **Capture handoff facts** — the minimum information the next agent needs:
    - Stage and run id (from \`.cclaw/state/flow-state.json\`)
    - List of completed task IDs from the plan
    - Open blockers / failing gates by name
-   - File paths the next wave will touch (no full diffs)
+   - File paths the next batch will touch (no full diffs)
 3. **Decide: continue or rotate**
-   - **Rotate** (start a new agent session) when: prior wave consumed > ~50% of the context budget, the prior wave required deep investigation that the next wave does not need, or you are about to cross a stage boundary.
-   - **Continue** when: next wave is a tiny follow-up (≤ 1 task) and the prior context is directly relevant.
+   - **Rotate** (start a new agent session) when: prior batch consumed > ~50% of the context budget, the prior batch required deep investigation that the next batch does not need, or you are about to cross a stage boundary.
+   - **Continue** when: next batch is a tiny follow-up (≤ 1 task) and the prior context is directly relevant.
 4. **Resume** in the new session via \`/cc-next\` — the session-start hook will restore flow state, checkpoint, and digest automatically.
-This is the same intuition as Compound Engineering's "fresh context per iteration": every wave starts with a clean, intentionally-loaded context, not a degraded carry-over.
+This is the same intuition as Compound Engineering's "fresh context per iteration": every batch starts with a clean, intentionally-loaded context, not a degraded carry-over.
 ### Handoff template (paste into next session)
 \`\`\`markdown
-## Wave <N> handoff
+## Batch <N> handoff
 - Stage: <stage>
 - Run: <runId>
 - Completed task IDs: <list>
 - Blockers: <list or none>
-- Files next wave will touch: <list>
+- Files next batch will touch: <list>
 - Verification command(s) used: <list>
 \`\`\`
@@ -542,7 +542,7 @@ This is the same intuition as Compound Engineering's "fresh context per iteratio
 - Marking tasks done without command evidence.
 - Reordering critical dependencies for speed.
 - Continuing after a gate failure hoping later tasks fix it.
-- Carrying a saturated context across wave boundaries because "it has all the history" — saturated context is a liability, not an asset.
+- Carrying a saturated context across batch boundaries because "it has all the history" — saturated context is a liability, not an asset.
 `;
 }
 export function contextEngineeringSkill() {
@@ -1338,7 +1338,7 @@ For each lens, write either a knowledge entry **or** the explicit string
 ### 2. What slowed us down?
-- Repeated context loss between waves → \`[compound]\` accelerator.
+- Repeated context loss between batches → \`[compound]\` accelerator.
 - Re-derivation of a fact already in upstream artifacts → \`[pattern]\` "re-read X first".
 - Tooling friction (slow test loop, flaky CI) → \`[compound]\` follow-up.

package/dist/doctor.js CHANGED Viewed

@@ -283,8 +283,8 @@ export async function doctorChecks(projectRoot, options = {}) {
             const skillContent = await fs.readFile(skillPath, "utf8");
             const lineCount = skillContent.split("\n").length;
             const MIN_SKILL_LINES = 110;
-            // Soft max tightened in wave 3 from 650 → 500 after externalising the
-            // TDD wave-execution walkthrough and collapsing the duplicate "what
+            // Soft max tightened from 650 → 500 after externalising the TDD
+            // batch-execution walkthrough and collapsing the duplicate "what
             // goes wrong" lists. Stage skills beyond 500 lines drift into unread
             // bloat; long-form content belongs under `.cclaw/references/` instead.
             const MAX_SKILL_LINES = 500;

package/dist/eval/baseline.d.ts ADDED Viewed

@@ -0,0 +1,14 @@
+import type { FlowStage } from "../types.js";
+import type { BaselineDelta, BaselineSnapshot, EvalReport } from "./types.js";
+export declare const BASELINE_SCHEMA_VERSION = 1;
+export declare function loadBaseline(projectRoot: string, stage: FlowStage): Promise<BaselineSnapshot | null>;
+export declare function loadBaselinesByStage(projectRoot: string, stages: readonly FlowStage[]): Promise<Map<FlowStage, BaselineSnapshot>>;
+export declare function buildBaselineForStage(stage: FlowStage, report: EvalReport): BaselineSnapshot;
+export declare function writeBaselinesFromReport(projectRoot: string, report: EvalReport): Promise<string[]>;
+/**
+ * Compare a freshly computed report against loaded baselines. If no baseline
+ * exists for a stage covered by the report, that stage contributes zero
+ * regressions (first run of that stage). Current is the source of truth.
+ */
+export declare function compareAgainstBaselines(report: EvalReport, baselines: Map<FlowStage, BaselineSnapshot>): BaselineDelta | undefined;
+export declare function listBaselineStages(projectRoot: string): Promise<FlowStage[]>;

package/dist/eval/baseline.js ADDED Viewed

@@ -0,0 +1,209 @@
+/**
+ * Baseline I/O + regression comparison for the eval subsystem.
+ *
+ * Layout on disk (committed):
+ *
+ *   .cclaw/evals/baselines/<stage>.json
+ *
+ * Each file contains a `BaselineSnapshot` keyed by `EvalCase.id`. We compute
+ * regressions by comparing per-verifier `ok` flags across runs: any verifier
+ * that was `ok:true` in the baseline and is `ok:false` now counts as a
+ * critical failure. A case whose aggregate `passed` flipped from true to
+ * false is flagged as `case-now-failing` regardless of per-verifier churn.
+ *
+ * Writes are gated behind an explicit `--update-baseline --confirm` pair at
+ * the CLI layer so accidental resets do not slip into PRs.
+ */
+import fs from "node:fs/promises";
+import path from "node:path";
+import { EVALS_ROOT, CCLAW_VERSION } from "../constants.js";
+import { exists } from "../fs-utils.js";
+import { FLOW_STAGES } from "../types.js";
+export const BASELINE_SCHEMA_VERSION = 1;
+function baselinePath(projectRoot, stage) {
+    return path.join(projectRoot, EVALS_ROOT, "baselines", `${stage}.json`);
+}
+export async function loadBaseline(projectRoot, stage) {
+    const filePath = baselinePath(projectRoot, stage);
+    if (!(await exists(filePath)))
+        return null;
+    const raw = await fs.readFile(filePath, "utf8");
+    let parsed;
+    try {
+        parsed = JSON.parse(raw);
+    }
+    catch (err) {
+        throw new Error(`Invalid baseline at ${filePath}: ${err instanceof Error ? err.message : String(err)}`);
+    }
+    if (!isBaseline(parsed, stage)) {
+        throw new Error(`Invalid baseline at ${filePath}: shape mismatch (expected schemaVersion=${BASELINE_SCHEMA_VERSION}, stage=${stage})`);
+    }
+    return parsed;
+}
+function isBaseline(value, stage) {
+    if (!value || typeof value !== "object")
+        return false;
+    const candidate = value;
+    if (candidate.schemaVersion !== BASELINE_SCHEMA_VERSION)
+        return false;
+    if (candidate.stage !== stage)
+        return false;
+    if (typeof candidate.generatedAt !== "string")
+        return false;
+    if (typeof candidate.cclawVersion !== "string")
+        return false;
+    if (!candidate.cases || typeof candidate.cases !== "object")
+        return false;
+    return true;
+}
+export async function loadBaselinesByStage(projectRoot, stages) {
+    const out = new Map();
+    for (const stage of stages) {
+        const snapshot = await loadBaseline(projectRoot, stage);
+        if (snapshot)
+            out.set(stage, snapshot);
+    }
+    return out;
+}
+function entryFromResult(result) {
+    const verifierResults = result.verifierResults.map((v) => ({
+        id: v.id,
+        kind: v.kind,
+        ok: v.ok,
+        ...(v.score !== undefined ? { score: v.score } : {})
+    }));
+    return { passed: result.passed, verifierResults };
+}
+export function buildBaselineForStage(stage, report) {
+    const stageCases = report.cases.filter((c) => c.stage === stage);
+    const cases = {};
+    for (const c of stageCases) {
+        cases[c.caseId] = entryFromResult(c);
+    }
+    return {
+        schemaVersion: BASELINE_SCHEMA_VERSION,
+        stage,
+        generatedAt: new Date().toISOString(),
+        cclawVersion: CCLAW_VERSION,
+        cases
+    };
+}
+export async function writeBaselinesFromReport(projectRoot, report) {
+    const written = [];
+    const stages = new Set(report.cases.map((c) => c.stage));
+    for (const stage of stages) {
+        const snapshot = buildBaselineForStage(stage, report);
+        const file = baselinePath(projectRoot, stage);
+        await fs.mkdir(path.dirname(file), { recursive: true });
+        await fs.writeFile(file, `${JSON.stringify(snapshot, null, 2)}\n`, "utf8");
+        written.push(file);
+    }
+    return written.sort();
+}
+function verifierMap(entries) {
+    const out = new Map();
+    for (const entry of entries) {
+        out.set(entry.id, entry);
+    }
+    return out;
+}
+function computePassRate(cases) {
+    if (cases.length === 0)
+        return 1;
+    const passed = cases.filter((c) => c.passed).length;
+    return passed / cases.length;
+}
+function baselinePassRate(snapshot) {
+    const entries = Object.values(snapshot.cases);
+    if (entries.length === 0)
+        return 1;
+    const passed = entries.filter((e) => e.passed).length;
+    return passed / entries.length;
+}
+/**
+ * Compare a freshly computed report against loaded baselines. If no baseline
+ * exists for a stage covered by the report, that stage contributes zero
+ * regressions (first run of that stage). Current is the source of truth.
+ */
+export function compareAgainstBaselines(report, baselines) {
+    if (baselines.size === 0)
+        return undefined;
+    const regressions = [];
+    const caseResultsByStage = new Map();
+    for (const c of report.cases) {
+        const bucket = caseResultsByStage.get(c.stage) ?? [];
+        bucket.push(c);
+        caseResultsByStage.set(c.stage, bucket);
+    }
+    let baselineTotalPassRate = 0;
+    let baselineStagesCounted = 0;
+    for (const [stage, snapshot] of baselines) {
+        const current = caseResultsByStage.get(stage) ?? [];
+        baselineTotalPassRate += baselinePassRate(snapshot);
+        baselineStagesCounted += 1;
+        for (const caseResult of current) {
+            const baselineEntry = snapshot.cases[caseResult.caseId];
+            if (!baselineEntry)
+                continue;
+            if (baselineEntry.passed && !caseResult.passed) {
+                regressions.push({
+                    caseId: caseResult.caseId,
+                    stage,
+                    verifierId: "<case>",
+                    reason: "case-now-failing",
+                    previousScore: 1,
+                    currentScore: 0
+                });
+            }
+            const baselineVerifiers = verifierMap(baselineEntry.verifierResults);
+            for (const currentVerifier of caseResult.verifierResults) {
+                const prev = baselineVerifiers.get(currentVerifier.id);
+                if (!prev)
+                    continue;
+                if (prev.ok && !currentVerifier.ok) {
+                    regressions.push({
+                        caseId: caseResult.caseId,
+                        stage,
+                        verifierId: currentVerifier.id,
+                        reason: "newly-failing",
+                        previousScore: prev.score ?? 1,
+                        currentScore: currentVerifier.score ?? 0
+                    });
+                }
+                else if (prev.score !== undefined &&
+                    currentVerifier.score !== undefined &&
+                    currentVerifier.score < prev.score) {
+                    regressions.push({
+                        caseId: caseResult.caseId,
+                        stage,
+                        verifierId: currentVerifier.id,
+                        reason: "score-drop",
+                        previousScore: prev.score,
+                        currentScore: currentVerifier.score
+                    });
+                }
+            }
+        }
+    }
+    const currentPassRate = computePassRate(report.cases);
+    const baselineAveragePassRate = baselineStagesCounted === 0 ? currentPassRate : baselineTotalPassRate / baselineStagesCounted;
+    const scoreDelta = Number((currentPassRate - baselineAveragePassRate).toFixed(4));
+    const criticalFailures = regressions.filter((r) => r.reason === "newly-failing" || r.reason === "case-now-failing").length;
+    const baselineStages = [...baselines.keys()].sort().join(",");
+    return {
+        baselineId: baselineStages.length > 0 ? baselineStages : "(empty)",
+        scoreDelta,
+        criticalFailures,
+        regressions
+    };
+}
+export function listBaselineStages(projectRoot) {
+    const root = path.join(projectRoot, EVALS_ROOT, "baselines");
+    return fs
+        .readdir(root, { withFileTypes: true })
+        .then((entries) => entries
+        .filter((entry) => entry.isFile() && entry.name.endsWith(".json"))
+        .map((entry) => entry.name.replace(/\.json$/, ""))
+        .filter((name) => FLOW_STAGES.includes(name)))
+        .catch(() => []);
+}

package/dist/eval/corpus.d.ts CHANGED Viewed

@@ -2,7 +2,18 @@ import type { FlowStage } from "../types.js";
 import type { EvalCase } from "./types.js";
 /**
  * Load all eval cases under `.cclaw/evals/corpus/**`. Optionally restrict to a
- * single stage. Returns an empty array for a fresh install (Wave 7.0 ships
- * without seed cases; corpus is authored in Wave 7.1+).
+ * single stage. Returns an empty array for a fresh install.
  */
 export declare function loadCorpus(projectRoot: string, stage?: FlowStage): Promise<EvalCase[]>;
+/**
+ * Resolve a case's `fixture` path to an absolute filesystem path. The fixture
+ * field is interpreted relative to the case's stage directory (i.e., a
+ * sibling subdirectory or file inside `.cclaw/evals/corpus/<stage>/`).
+ */
+export declare function fixturePathFor(projectRoot: string, caseEntry: EvalCase): string | undefined;
+/**
+ * Read the fixture artifact text for a case. Returns `undefined` if the case
+ * has no fixture reference. Throws a descriptive error if the path exists in
+ * the case but not on disk — structural fixtures ship alongside cases.
+ */
+export declare function readFixtureArtifact(projectRoot: string, caseEntry: EvalCase): Promise<string | undefined>;

package/dist/eval/corpus.js CHANGED Viewed

@@ -12,6 +12,76 @@ function corpusError(filePath, reason) {
 function isRecord(value) {
     return typeof value === "object" && value !== null && !Array.isArray(value);
 }
+function readStringArray(filePath, context, value) {
+    if (value === undefined)
+        return undefined;
+    if (!Array.isArray(value) || value.some((item) => typeof item !== "string")) {
+        throw corpusError(filePath, `"${context}" must be an array of strings`);
+    }
+    return value;
+}
+function readNonNegativeInteger(filePath, context, value) {
+    if (value === undefined)
+        return undefined;
+    if (typeof value !== "number" || !Number.isFinite(value) || value < 0 || !Number.isInteger(value)) {
+        throw corpusError(filePath, `"${context}" must be a non-negative integer`);
+    }
+    return value;
+}
+function parseStructural(filePath, raw) {
+    if (raw === undefined)
+        return undefined;
+    if (!isRecord(raw)) {
+        throw corpusError(filePath, `"expected.structural" must be a mapping`);
+    }
+    const requiredSections = readStringArray(filePath, "expected.structural.required_sections", raw.required_sections ?? raw.requiredSections);
+    const forbiddenPatterns = readStringArray(filePath, "expected.structural.forbidden_patterns", raw.forbidden_patterns ?? raw.forbiddenPatterns);
+    const requiredFrontmatterKeys = readStringArray(filePath, "expected.structural.required_frontmatter_keys", raw.required_frontmatter_keys ?? raw.requiredFrontmatterKeys);
+    const minLines = readNonNegativeInteger(filePath, "expected.structural.min_lines", raw.min_lines ?? raw.minLines);
+    const maxLines = readNonNegativeInteger(filePath, "expected.structural.max_lines", raw.max_lines ?? raw.maxLines);
+    const minChars = readNonNegativeInteger(filePath, "expected.structural.min_chars", raw.min_chars ?? raw.minChars);
+    const maxChars = readNonNegativeInteger(filePath, "expected.structural.max_chars", raw.max_chars ?? raw.maxChars);
+    const structural = {};
+    if (requiredSections)
+        structural.requiredSections = requiredSections;
+    if (forbiddenPatterns)
+        structural.forbiddenPatterns = forbiddenPatterns;
+    if (requiredFrontmatterKeys)
+        structural.requiredFrontmatterKeys = requiredFrontmatterKeys;
+    if (minLines !== undefined)
+        structural.minLines = minLines;
+    if (maxLines !== undefined)
+        structural.maxLines = maxLines;
+    if (minChars !== undefined)
+        structural.minChars = minChars;
+    if (maxChars !== undefined)
+        structural.maxChars = maxChars;
+    return structural;
+}
+function parseExpected(filePath, raw) {
+    if (raw === undefined)
+        return undefined;
+    if (!isRecord(raw)) {
+        throw corpusError(filePath, `"expected" must be a mapping`);
+    }
+    const shape = {};
+    const structural = parseStructural(filePath, raw.structural);
+    if (structural)
+        shape.structural = structural;
+    if (raw.rules !== undefined) {
+        if (!isRecord(raw.rules)) {
+            throw corpusError(filePath, `"expected.rules" must be a mapping`);
+        }
+        shape.rules = raw.rules;
+    }
+    if (raw.judge !== undefined) {
+        if (!isRecord(raw.judge)) {
+            throw corpusError(filePath, `"expected.judge" must be a mapping`);
+        }
+        shape.judge = raw.judge;
+    }
+    return Object.keys(shape).length === 0 ? undefined : shape;
+}
 function validateCase(filePath, raw) {
     if (!isRecord(raw)) {
         throw corpusError(filePath, "top-level value must be a mapping");
@@ -28,17 +98,8 @@ function validateCase(filePath, raw) {
     if (typeof inputPrompt !== "string" || inputPrompt.trim().length === 0) {
         throw corpusError(filePath, `"input_prompt" must be a non-empty string`);
     }
-    const contextFilesRaw = raw.context_files ?? raw.contextFiles;
-    let contextFiles;
-    if (contextFilesRaw !== undefined) {
-        if (!Array.isArray(contextFilesRaw) || contextFilesRaw.some((f) => typeof f !== "string")) {
-            throw corpusError(filePath, `"context_files" must be an array of strings`);
-        }
-        contextFiles = contextFilesRaw;
-    }
-    const expected = raw.expected !== undefined && isRecord(raw.expected)
-        ? raw.expected
-        : undefined;
+    const contextFiles = readStringArray(filePath, "context_files", raw.context_files ?? raw.contextFiles);
+    const expected = parseExpected(filePath, raw.expected);
     const fixture = typeof raw.fixture === "string" ? raw.fixture : undefined;
     return {
         id: id.trim(),
@@ -51,8 +112,7 @@ function validateCase(filePath, raw) {
 }
 /**
  * Load all eval cases under `.cclaw/evals/corpus/**`. Optionally restrict to a
- * single stage. Returns an empty array for a fresh install (Wave 7.0 ships
- * without seed cases; corpus is authored in Wave 7.1+).
+ * single stage. Returns an empty array for a fresh install.
  */
 export async function loadCorpus(projectRoot, stage) {
     const corpusRoot = path.join(projectRoot, EVALS_ROOT, "corpus");
@@ -89,3 +149,27 @@ export async function loadCorpus(projectRoot, stage) {
     cases.sort((a, b) => a.stage.localeCompare(b.stage) || a.id.localeCompare(b.id));
     return cases;
 }
+/**
+ * Resolve a case's `fixture` path to an absolute filesystem path. The fixture
+ * field is interpreted relative to the case's stage directory (i.e., a
+ * sibling subdirectory or file inside `.cclaw/evals/corpus/<stage>/`).
+ */
+export function fixturePathFor(projectRoot, caseEntry) {
+    if (!caseEntry.fixture)
+        return undefined;
+    return path.resolve(projectRoot, EVALS_ROOT, "corpus", caseEntry.stage, caseEntry.fixture);
+}
+/**
+ * Read the fixture artifact text for a case. Returns `undefined` if the case
+ * has no fixture reference. Throws a descriptive error if the path exists in
+ * the case but not on disk — structural fixtures ship alongside cases.
+ */
+export async function readFixtureArtifact(projectRoot, caseEntry) {
+    const fixturePath = fixturePathFor(projectRoot, caseEntry);
+    if (!fixturePath)
+        return undefined;
+    if (!(await exists(fixturePath))) {
+        throw new Error(`Fixture missing for case ${caseEntry.stage}/${caseEntry.id}: ${fixturePath}`);
+    }
+    return fs.readFile(fixturePath, "utf8");
+}

package/dist/eval/llm-client.d.ts CHANGED Viewed

@@ -1,17 +1,17 @@
 /**
  * LLM client skeleton for the cclaw eval subsystem.
  *
- * Wave 7.0 declares the shape of the client without pulling in the `openai`
- * runtime dependency. The real implementation is wired in Wave 7.3 when
+ * This module declares the shape of the client without pulling in the
+ * `openai` runtime dependency. The real implementation lands when
  * single-shot (Tier A) evals and LLM judging come online. Keeping this stub
- * separate means users of Waves 7.0–7.2 (structural + rule-based verifiers)
- * never install an extra dependency or receive network egress warnings.
+ * separate means users who only run structural + rule-based verifiers never
+ * install an extra dependency or receive network egress warnings.
  */
 import type { ResolvedEvalConfig } from "./types.js";
 /**
  * Minimal chat interface the rest of the eval code will depend on. It is
  * intentionally a subset of OpenAI's Chat Completions surface so that the
- * Wave 7.3 implementation is a thin adapter around `OpenAI.chat.completions.create`.
+ * real implementation is a thin adapter around `OpenAI.chat.completions.create`.
  */
 export interface ChatMessage {
     role: "system" | "user" | "assistant" | "tool";
@@ -26,8 +26,8 @@ export interface ChatRequest {
     temperature?: number;
     timeoutMs?: number;
     /**
-     * Tool/function-calling definitions in OpenAI wire format. Populated only by
-     * Wave 7.4 (Tier B). Ignored by the Wave 7.3 single-shot path.
+     * Tool/function-calling definitions in OpenAI wire format. Populated only
+     * by Tier B. Ignored by the Tier A single-shot path.
      */
     tools?: unknown[];
     toolChoice?: "auto" | "none";
@@ -52,11 +52,11 @@ export interface EvalLlmClient {
     chat(request: ChatRequest): Promise<ChatResponse>;
 }
 export declare class EvalLlmNotWiredError extends Error {
-    constructor(wave: string);
+    constructor();
 }
 /**
- * Factory stub. Throws with a clear message so accidental Wave 7.0 usage is
- * easy to diagnose. The Wave 7.3 implementation will replace this body with
+ * Factory stub. Throws with a clear message so accidental early usage is
+ * easy to diagnose. The real implementation will replace this body with
  * `new OpenAI({ apiKey, baseURL }) ... adapter`.
  */
 export declare function createEvalClient(_config: ResolvedEvalConfig): EvalLlmClient;

package/dist/eval/llm-client.js CHANGED Viewed

@@ -1,19 +1,19 @@
 export class EvalLlmNotWiredError extends Error {
-    constructor(wave) {
-        super(`LLM client is not wired in Wave 7.0. It arrives in Wave ${wave}.\n` +
+    constructor() {
+        super(`LLM client is not wired yet.\n` +
             `Run \`cclaw eval --dry-run\` or \`cclaw eval --schema-only\` for offline evals.`);
         this.name = "EvalLlmNotWiredError";
     }
 }
 /**
- * Factory stub. Throws with a clear message so accidental Wave 7.0 usage is
- * easy to diagnose. The Wave 7.3 implementation will replace this body with
+ * Factory stub. Throws with a clear message so accidental early usage is
+ * easy to diagnose. The real implementation will replace this body with
  * `new OpenAI({ apiKey, baseURL }) ... adapter`.
  */
 export function createEvalClient(_config) {
     return {
         async chat() {
-            throw new EvalLlmNotWiredError("7.3");
+            throw new EvalLlmNotWiredError();
         }
     };
 }

package/dist/eval/report.js CHANGED Viewed

@@ -39,17 +39,30 @@ export function formatMarkdownReport(report) {
     lines.push(`| total duration (ms) | ${summary.totalDurationMs} |`);
     lines.push(``);
     if (report.baselineDelta) {
+        const delta = report.baselineDelta;
         lines.push(`## Baseline delta`);
         lines.push(``);
-        lines.push(`- baseline: ${report.baselineDelta.baselineId}`);
-        lines.push(`- score delta: ${report.baselineDelta.scoreDelta.toFixed(4)}`);
-        lines.push(`- critical failures: ${report.baselineDelta.criticalFailures}`);
+        lines.push(`- baseline: ${delta.baselineId}`);
+        lines.push(`- score delta: ${delta.scoreDelta.toFixed(4)}`);
+        lines.push(`- critical failures: ${delta.criticalFailures}`);
         lines.push(``);
+        if (delta.regressions.length > 0) {
+            lines.push(`### Regressions`);
+            lines.push(``);
+            lines.push(`| stage | case id | verifier | reason | prev | curr |`);
+            lines.push(`| --- | --- | --- | --- | --- | --- |`);
+            for (const reg of delta.regressions) {
+                const prev = reg.previousScore !== undefined ? reg.previousScore.toFixed(2) : "-";
+                const curr = reg.currentScore !== undefined ? reg.currentScore.toFixed(2) : "-";
+                lines.push(`| ${reg.stage} | ${reg.caseId} | ${reg.verifierId} | ${reg.reason} | ${prev} | ${curr} |`);
+            }
+            lines.push(``);
+        }
     }
     if (report.cases.length === 0) {
         lines.push(`## Cases`);
         lines.push(``);
-        lines.push(`No cases were executed. See \`docs/evals.md\` for the Wave rollout plan.`);
+        lines.push(`No cases were executed. See \`docs/evals.md\` for the rollout plan.`);
         lines.push(``);
         return `${lines.join("\n")}\n`;
     }