npm - baldart - Versions diffs - 4.47.0 → 4.49.0 - Mend

baldart 4.47.0 → 4.49.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/CHANGELOG.md +20 -0
package/VERSION +1 -1
package/framework/.claude/agents/codebase-architect.md +24 -1
package/framework/.claude/agents/plan-auditor.md +17 -79
package/framework/.claude/skills/bug/SKILL.md +3 -0
package/framework/.claude/skills/context-primer/SKILL.md +4 -0
package/framework/.claude/skills/new/SKILL.md +18 -0
package/framework/.claude/skills/new/references/final-review.md +4 -15
package/framework/.claude/skills/new/references/implement.md +1 -1
package/framework/.claude/skills/new2/SKILL.md +61 -10
package/framework/.claude/skills/prd/references/discovery-phase.md +5 -1
package/package.json +1 -1

package/CHANGELOG.md CHANGED Viewed

@@ -5,6 +5,26 @@ All notable changes to BALDART will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [4.49.0] - 2026-06-17
+**Caveman-style terse contracts close the last prose-leak the v4.47.0 turn-economy left out of scope: `codebase-architect` gets an opt-in `OUTPUT=terse` grounding mode, and `/new`'s orchestrator is forced near-silent.** A study of the [caveman](https://github.com/JuliusBrussee/caveman) skill (compress *how the agent talks*, not *what it builds* — output tokens only, reasoning untouched) plus a **prose-leak surface map of BALDART's own fleet** found the surface already covered almost everywhere — finders emit YAML/schema (`code-reviewer`, auditors, `plan-auditor`, `prd-card-writer`, `coder` completion-report), `senior-researcher` is file-resident, `context-primer` is already capped at a 30–50-line XML — with ONE genuinely-uncovered prose exploder: `codebase-architect`, whose budget cap is *input*-side only (`codebase-architect.md`), whose "Communication Style" explicitly licenses narrative prose, and whose return is outside the v4.47.0 rule by design (`new/SKILL.md` § "Context economy": *"It does NOT change any agent's return contract"*). The earlier ponytail-style "cap all finder prose" idea was **refuted** by a 3-skeptic adversarial pass (finders already capped + data-driven; new2 already isolates subagent output; driver is turn-count) — this release ships only the two data-validated survivors. A follow-up **return-consumption audit** of the whole fleet confirmed the floor is already reached everywhere else — finders/auditors return a parsed schema or override-to-`## FINDINGS`, and `qa-sentinel`/`doc-reviewer`/`senior-researcher`/`prd-card-writer` are file-resident with a thin manifest return; the only remaining lever was wiring the new `OUTPUT=terse` flag into the remaining machine-to-machine `codebase-architect` call-sites (`context-primer`, `/prd` discovery dimension-resolution + ISA spawns). The near-silent rule is scoped to `/new` only — `/prd` and `/bug` are conversational, where the prose IS the deliverable (caveman's own Auto-Clarity principle: never compress what a human reads to decide). `/new` is the opposite: not conversational — the user keeps the orchestrator attached only so it can surface a genuine decision, so its running narration serves no one. **MINOR** (additive agent capability + skill behaviour; **no new `baldart.config.yml` key** ⇒ schema-change propagation rule N/A; no removed surface).
+### Changed
+- **`framework/.claude/agents/codebase-architect.md`** — new **"Output mode: terse grounding (opt-in)"** section: when the invoking prompt carries `OUTPUT=terse`, the architect emits the structured substance only (the `## Reuse Analysis` table + `## Canonical Evidence` block unchanged, affected code as `path:line — symbol — pattern/role` rows, high-risk paths one-line each, a closing `totals:`), drops the high-level-overview/"why" narrative, keeps code identifiers verbatim, and keeps an **Auto-clarity carve-out** (`⚠` one-liner for a genuine ambiguity/risk a row can't carry). The existing "Communication Style" is retitled *(interactive / explanatory invocations only)* — without the flag the narrative path is unchanged.
+- **`framework/.claude/skills/{new/references/implement.md` (Phase 1 step 3), `bug/SKILL.md` (Phase 0 step 3), `context-primer/SKILL.md` (loader spawn template), `prd/references/discovery-phase.md` (dimension-resolution spawn + ISA spawn)}** — every machine-to-machine `codebase-architect` grounding spawn now passes `OUTPUT=terse`. The `/new` baseline persisted verbatim at step 5b stays lean while keeping everything a lean `/codexreview` needs (paths, signatures, patterns, high-risk paths).
+- **`framework/.claude/skills/new/SKILL.md`** — new **HARD RULE in § "Context economy": "orchestrator is near-silent / maximally cryptic"** — the orchestrator emits **nothing user-facing during the run** (no progress, no "now I will…", no per-phase recap, no agent-activity description, no tracker restatement, no tool-call narration) and speaks in exactly **three** cases: (1) decision gates — `AskUserQuestion` + security/irreversible-action confirmations, full clarity, never compressed; (2) genuine blockers / items needing user action, terse and exact; (3) one minimal end-of-batch result block (one line per card + actionable residue). Governs the orchestrator's *prose*, not any agent return contract.
+- **`framework/.claude/agents/plan-auditor.md`** — the FULL-mode `## OUTPUT FORMAT` 10-section block is compressed in place (~78 → ~22 lines): same 10 sections + every load-bearing semantic (verdict definitions, `Priority ≥ 16 → BLOCK`, the three distinct pre-mortem root-cause classes), with the verbose per-section sub-bullet templates collapsed to one-line specs. This is dormant weight on the common QUICK-mode spawns (which explicitly skip it), trimmed from every spawn's system prompt. Compressed **in place** rather than extracted to an on-demand reference — agents don't Read external format files (that's a skill-only pattern), so extraction would introduce a cross-context path dependency not worth ~1k tokens. (Return-consumption audit verdict: this is a per-spawn *subagent input* cost, not the orchestrator's replayed-context multiplier — a deliberately small, low-risk trim, the only one of the dormant-format set worth touching.)
+- **`framework/.claude/skills/new/references/final-review.md`** (Step F.6 item 4) — the per-card + batch **13-field summary report** is collapsed to the minimal result block (`CARD-ID: merged @<sha> | failed: … | deferred → …` one line per card + actionable residue only). All process telemetry (files changed, test/build/lint status, fix cycles, review counts, QA verdict, commit/merge hashes, cleanup, knowledge-sync) is no longer rendered — it lives in the tracker + `skill-runs.jsonl`. The Next Steps & Launch Command (4b) and Production Readiness manual steps stay (they are actionable residue).
+## [4.48.0] - 2026-06-16
+**`new2` is relaxed: a post-batch, interactive-only escape hatch hands genuinely-blocked hard cases to `/new`'s real human gate — without re-introducing rigidity, a twin, or breaking the A/B.** `new2` runs the batch autonomously, so a card the deterministic policy can't salvage becomes a tracked follow-up + is left `IN_PROGRESS` — the "edge case that wants human intelligence" the user flagged. A 3-lens **adversarial review before implementation** killed the obvious fix (a Step-5 `AskUserQuestion` that lets the skill implement/resolve the residual): (1) **correctness** — the workflow auto-merges and removes the worktree *before* Step 5, so a skill-side fix has no worktree, lands unreviewed code on trunk, and bypasses the F-029 DONE gate; (2) **value** — on the only run with a full `deferral_breakdown` (14 residuals) a post-batch fix changes the outcome of *zero* of them (the batch already merged); only a mid-batch checkpoint could salvage-before-merge, and the data (n=1) doesn't justify breaking `new2`'s background/no-poll contract; (3) **prior-art** — it would twin `/new`'s Phase 2.5b AC-Closure gate and muddy the autonomous-vs-`/new` A/B. What survived all three: the sound escalation is **not to re-implement a gate but to invoke `/new` on the already-materialised follow-up** — `/new` owns the real worktree+review+AC-Closure+F-029+merge pipeline. **MINOR** (additive skill behaviour, interactive-only, autonomous-mode-safe; **no new `baldart.config.yml` key** ⇒ schema-change propagation rule N/A; no removed surface).
+### Changed
+- **`framework/.claude/skills/new2/SKILL.md`** — new **Step 3b "Escape-hatch escalation"** in the skill's post-batch reconciliation: in INTERACTIVE mode only (skipped when `BALDART_AUTONOMOUS`/`CI`/`GITHUB_ACTIONS` is set), after follow-ups are materialised on disk (offline-safe ordering preserved — the offer is additive over an already-safe ledger) and any `degraded` resume has converged, it presents **one batched `AskUserQuestion`** offering to run `/new` on the **code-actionable** hard-case follow-ups (`deferralClass ∈ {unresolved, out-of-ownership, scope-expansion}`; `owner-gated`/`not-a-code-defect`/`policy-deferred-ac` excluded — `/new` can't perform infra steps). "Sì" invokes `/new <followup-id …>` via the Skill tool (which closes them through its own gates — the skill never marks DONE itself, never re-implements the gate). The **ZERO-ASK CONTRACT** banner is rewritten to scope it precisely to *the workflow during the batch* (the skill may interact pre-launch AND post-batch, interactive-only). New `escape_hatch` telemetry field. Honest limitation documented: post-batch, so it gives the human gate on the follow-up but does NOT salvage a card before its merge (that would need a mid-batch checkpoint — out of scope by design).
 ## [4.47.0] - 2026-06-16
 **`/new`'s orchestrator context economy is re-aimed at its real driver — turn count — and the user-visible Progress Bar + native Task spine are removed.** Telemetry of two real 8-card batches (FEAT-0028/0029 on a consumer) showed the orchestrator paying ~285M `cache_read`: 613 turns each replaying a ~490k-token accumulated context (growing toward ~800k), so total cost ≈ **turn count × accumulated context**. A 3-lens **adversarial review before implementation** refuted the obvious diagnoses: the static prefix is only ~77k (not the ~225k first assumed — context is ~86% *accumulated*, not static); narration prose is only ~7% of the fuel; and the existing § "Context economy" (bulk-content-inline) rule targets a channel that totals only ~119k cumulatively. The measurement reviewer surfaced the actual missed lever — **0 of 274 tool turns batched any calls, and ~55% of turns carried no tool call at all** — and the correctness + prior-art reviewers established that delegating bookkeeping out of the orchestrator is a previously-trodden trap (v4.15.0 reverted a Write-from-memory tracker flush; the tracker is the recovery SSOT; `card_status: DONE` needs orchestrator-side disk re-read; the weak-subagent fabrication precedent applies). What survived: (1) a new turn-economy HARD RULE (batch independent tool calls; no narration-only turns; never poll/wait), and (2) since the Progress Bar + Task spine are pure *mirrors* of the internal tracker (recovery reads the tracker, never the spine), removing them is correctness-safe and eliminates ~45 dedicated visibility turns (~8% of a batch's `cache_read`, guaranteed, not batching-dependent). **MINOR** (skill behaviour change; **no new `baldart.config.yml` key** ⇒ schema-change propagation rule N/A; no removed agent/command/skill/routine).

package/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 4.47.0
1	+ 4.49.0

package/framework/.claude/agents/codebase-architect.md CHANGED Viewed

@@ -276,7 +276,30 @@ canonical doc already told you.
 - Documentation sync: code and docs must stay aligned
 - Backlog-driven work: features are tracked in `${paths.backlog_dir}/*.yml` files
-**Communication Style:**
+**Output mode: terse grounding (opt-in, since v4.49.0):**
+When the invoking prompt contains `OUTPUT=terse` (architecture grounding handed to
+an implementer or reviewer — not an interactive explanation for a human to read),
+emit the structured substance ONLY, no narrative. Brain big, mouth small. Your
+return value is injected verbatim into the caller's context, so every word of
+prose costs them context budget on every subsequent turn.
+- **Emit, unchanged (these ARE the contract):** the `## Reuse Analysis` table and
+  the `## Canonical Evidence` block.
+- **Affected code as rows, not paragraphs:** `path:line — symbol — pattern/role`.
+- **High-risk paths / integration points, one line each:** `path:line — risk`.
+- **Close with** a `totals:` line (files, reuse candidates, high-risk paths).
+- **Drop:** the high-level-overview paragraph, the "why" essays, restating the
+  request, any prose a row already carries.
+- **Keep verbatim:** file paths, symbols, type signatures, error strings — never
+  abbreviate a code identifier.
+- **Auto-clarity carve-out:** a genuine ambiguity, drift, or safety-relevant
+  caveat a row can't carry gets ONE plain short line flagged `⚠`. Never compress a
+  warning into ambiguity.
+Without `OUTPUT=terse`, use the narrative style below.
+**Communication Style (interactive / explanatory invocations only):**
 - Start with a high-level overview, then dive into specifics
 - Use precise technical terminology from `coding-standards.md`

package/framework/.claude/agents/plan-auditor.md CHANGED Viewed

@@ -471,85 +471,23 @@ Every HIGH and MEDIUM finding MUST be emitted in this exact shape so it can be p
 LOW findings can be one-liners with target tag (no full schema).
-## OUTPUT FORMAT (MANDATORY — USE THIS EXACT STRUCTURE)
-### 1) Executive Verdict
-- **Verdict**: {PASS | PASS WITH FIXES | BLOCK | NEEDS_REPLAN}
-- **Coverage Score**: <N>/8 audit categories passed (A–H)
-- **Memory matches**: <N> known pitfalls applied to this audit
-- **High-Risk Path Triggers**: [list or "none"] — if any: "Per-card `/codexreview` MUST run BEFORE merge"
-- **Specialist auto-spawn**: [list of agents spawned, or "none triggered"]
-- **Active Code Context drift**: [list of colliding cards, or "no collision"]
-- **Tool budget used**: <reads>/20 reads, <greps>/30 greps
-- 3–7 bullet reasons (highest impact first)
-**Verdict definitions:**
-- `PASS`: thorough, explicit, ready for implementation. (Rare.)
-- `PASS WITH FIXES`: structurally sound, gaps enumerated and fixable before coding starts.
-- `BLOCK`: gap that would cause production incident, data loss, security breach, or mid-sprint rewrite. Work must not start.
-- `NEEDS_REPLAN`: framing is wrong, not just details. Premise / approach / scope is incorrect; restart planning, do not patch.
-### 2) High-Risk Gaps (Must Fix)
-Findings list in YAML schema, grouped by category (integrity, architecture, security, reliability, testing, api-perf, traceability, verification, adr, drift, high-risk-path, simulation, injection).
-For backlog card findings, include the `target` field per TARGET TAG SYSTEM.
-### 3) Plan Simulation Findings
-Walk through each step that broke during simulation. Format:
-- **Step N**: <step description>
-- **Broken invariant**: <what assumption fails>
-- **Why**: <causal chain>
-- **Fix**: <reorder, add prerequisite step, add rollback>
-### 4) Assumptions & Hidden Dependencies
-- List each assumption found in or implied by the plan
-- For each: **Confidence level** (High / Med / Low) + **How to validate quickly**
-### 5) Blocking Questions (If any)
-- Numbered list of precise questions
-- For each: provide **recommended default if unanswered** + **trade-off explanation**
-### 6) Hardened Plan (Rewritten)
-- Provide a revised plan with:
-  - Phases (numbered, with clear entry/exit criteria)
-  - Tasks per phase (with estimated complexity: S/M/L)
-  - Acceptance criteria per phase (testable, specific)
-  - Explicit dependencies between phases and external systems
-  - Rollout + rollback strategy
-  - Instrumentation requirements (what to log, what to alert on)
-  - Test plan (what to test at each phase)
-### 7) Quantified Risk Register
-Format per risk: **Risk** | **Impact (1-5)** | **Likelihood (1-5)** | **Priority (I×L)** | **Severity** | **Mitigation** | **Owner**
-Rank by Priority descending. Highlight any Priority ≥ 16 as **BLOCK**.
-### 8) Three-Scenario Pre-Mortem (counterfactual diversity)
-Generate 3 INDEPENDENT failure scenarios, each with a DIFFERENT root cause class:
-**Scenario A — Data Corruption** ("It's 30 days later and silent data loss occurred because…")
-- Causal chain
-- Top 3 failure modes feeding this scenario
-- How the hardened plan prevents each
-**Scenario B — Security Breach** ("It's 30 days later and a customer escalated unauthorized access because…")
-- Causal chain
-- Top 3 failure modes
-- Mitigation in hardened plan
-**Scenario C — Scale Collapse** ("It's 30 days later and prod is throttling because…")
-- Causal chain
-- Top 3 failure modes
-- Mitigation in hardened plan
-If a scenario class doesn't apply to this plan (e.g. no data writes → no Scenario A), state explicitly and skip — but argue why it doesn't apply.
-### 9) Hallucinated Findings Dropped (CoVe)
-Findings disproven by Chain-of-Verification. Format:
-- **Finding title** — Verification: <command> → <result> → dropped because <reason>
-### 10) Suppressed Findings (Challenge Pass)
-Already documented in the suppressed-findings collapsible block above.
+## OUTPUT FORMAT — FULL mode only (MANDATORY — USE THESE 10 SECTIONS)
+> QUICK mode does NOT use this section — it returns `VERDICT: …` + numbered fixes (see top). The
+> 10-section structure below is for a standalone/holistic FULL audit; emit every section in order.
+> Terse spec — one line per field, no prose padding; substance and exact values verbatim.
+1. **Executive Verdict** — `Verdict` {PASS | PASS WITH FIXES | BLOCK | NEEDS_REPLAN}; Coverage <N>/8 (A–H); Memory matches <N>; High-Risk Path Triggers [list|none] (if any: "Per-card `/codexreview` MUST run BEFORE merge"); Specialist auto-spawn [list|none]; Active-Code-Context drift [colliding cards|none]; Tool budget <reads>/20, <greps>/30; 3–7 reasons, highest impact first.
+   - **Verdicts:** `PASS` ready (rare) · `PASS WITH FIXES` sound, gaps fixable pre-coding · `BLOCK` gap → prod incident / data loss / security breach / mid-sprint rewrite, do not start · `NEEDS_REPLAN` premise/approach/scope wrong, restart not patch.
+2. **High-Risk Gaps** — findings in the FINDINGS YAML SCHEMA above, grouped by category; include `target` for card findings.
+3. **Plan Simulation Findings** — per broken step: `Step N` · broken invariant · why (causal chain) · fix (reorder / add prerequisite / add rollback).
+4. **Assumptions & Hidden Dependencies** — each assumption + confidence (High/Med/Low) + how to validate quickly.
+5. **Blocking Questions** — numbered; each + recommended default if unanswered + trade-off.
+6. **Hardened Plan** — phases (entry/exit criteria) · tasks S/M/L · per-phase ACs (testable) · explicit deps (phases + external) · rollout+rollback · instrumentation (log/alert) · test plan.
+7. **Quantified Risk Register** — `Risk | Impact 1-5 | Likelihood 1-5 | Priority I×L | Severity | Mitigation | Owner`, ranked by Priority desc; Priority ≥ 16 → **BLOCK**.
+8. **Three-Scenario Pre-Mortem** — 3 INDEPENDENT "30 days later…" scenarios, each a DIFFERENT root-cause class (A Data Corruption / B Security Breach / C Scale Collapse): causal chain + top-3 failure modes + how the hardened plan prevents each. A class that can't apply (e.g. no writes → no A) is stated and skipped with the reason.
+9. **Hallucinated Findings Dropped (CoVe)** — `Finding — Verification: <command> → <result> → dropped because <reason>`.
+10. **Suppressed Findings (Challenge Pass)** — per the suppressed-findings block above.
 ## TONE & STYLE

package/framework/.claude/skills/bug/SKILL.md CHANGED Viewed

@@ -60,6 +60,9 @@ See [framework/docs/MCP-INTEGRATION.md](../../../docs/MCP-INTEGRATION.md) for th
    - **Auth** → the project's auth middleware/guard + auth error codes (per `stack.auth_provider`)
    - **Build/Deploy** → the deploy platform's logs + build output (per `stack.deployment`, e.g. Vercel)
 3. Launch `codebase-architect` agent to map the affected code paths — do NOT start debugging blind.
+   Pass `OUTPUT=terse` (since v4.49.0): you want the structured map (`path:line — symbol` rows,
+   high-risk paths, `totals:`), not a narrative essay — its return is injected verbatim into this
+   investigation context, so keep it lean.
    When `features.has_lsp_layer: true` and the symptom names a concrete symbol
    (function/type/handler), `codebase-architect` will use LSP find-references /
    go-to-definition to locate the exact callsites instead of grepping. See

package/framework/.claude/skills/context-primer/SKILL.md CHANGED Viewed

@@ -61,6 +61,10 @@ precedence caveats: `framework/agents/effort-protocol.md`.
 Load context about: {keywords}
 Task type: {task_type}
+OUTPUT=terse — you are a silent loader's machine consumer, not a human Q&A. Return the
+structured contract (reuse table + canonical evidence + `path:line — symbol` rows + `totals:`)
+with no narrative prose; the loader condenses this into the XML context block.
 TOKEN BUDGET: stay under 15K tokens total. Prefer the canonical router + targeted reads over broad file scans.
 SEARCH STRATEGY — canonical-router-first, verify-only-if-needed:

package/framework/.claude/skills/new/SKILL.md CHANGED Viewed

@@ -207,6 +207,24 @@ baselines. Keep that bulk on disk and pass **paths**, not bodies.
 >    completion — end your turn after spawning; never `sleep N; echo "waiting…"` (already stated for
 >    team mode in `references/team-mode.md` — it applies to every barrier in both modes).
 >
+> **HARD RULE — orchestrator is near-silent / maximally cryptic (since v4.49.0).** `/new` is NOT
+> conversational: the user does not read its running narration and keeps the orchestrator attached ONLY so it
+> can surface a genuine decision when something unpredictable needs a human. So emit **nothing user-facing
+> during the run** — no progress, no "now I will…", no per-phase recap, no description of what an agent is
+> doing, no tracker restatement, no tool-call narration (each also costs a full replayed-context turn per
+> § turn-count above; the tracker is the only state surface). The orchestrator speaks to the user in exactly
+> THREE cases, and nowhere else:
+> 1. **Decision gates — full clarity, NEVER compress:** `AskUserQuestion` prompts and any security /
+>    irreversible-action confirmation. This is the sole reason the human is in the loop — it must be readable.
+> 2. **Genuine blockers / items needing user action** — surfaced terse and exact (`CARD-ID`, `path:line`,
+>    error string verbatim), no prose around them.
+> 3. **One minimal end-of-batch result block** — one line per card (`CARD-ID: merged @<sha> | failed:
+>    <one clause> | deferred → <followup-id>`) plus only the actionable residue (unresolved/flagged items,
+>    the launch command, any production-readiness manual step). NOT a per-phase report — process telemetry
+>    lives in the tracker + `skill-runs.jsonl`, never rendered.
+> This governs the orchestrator's *prose*, not any agent's return contract (that is § "Context economy" above).
+>
 > These rules reduce the *number* of replays; the bulk-content rule above reduces the *size* of each.
 > Both compound — apply them together.

package/framework/.claude/skills/new/references/final-review.md CHANGED Viewed

@@ -295,21 +295,10 @@ that is a **gate violation**: log it as
    If the project ships a `knowledge-sync` corpus agent (declared in `.baldart/overlays/new.md`), invoke it after doc updates so the external corpus stays aligned. If no such agent exists, skip this step with a one-line notice (do NOT dispatch a non-existent agent).
 3. **Proceed to Phase 6** (post-batch merge & cleanup).
-4. Present a **single summary report** to the user per card (and a batch summary at the end):
-   - **Files changed** (short list per card)
-   - **Test results** (new tests + existing tests count, pass rate at each iteration)
-   - **Build/lint status** (pass + retry count if any)
-   - **Fix cycles** (total number of self-healing retries across phases)
-   - **Final review** (findings: N raised / M verified / K false positives | fixes applied: N | highest severity)
-   - **UX testing** (PASS/FAIL/SKIP | test file path if written)
-   - **QA result** (profile: skip/light/balanced/deep | verdict: PASS/FAIL/SKIP | confidence % | failing gates: which of lint/tsc/test/build/markdownlint, or "none" | findings file path). E2E is reported separately (Phase 2.6 e2e-review), not by qa-sentinel.
-   - **Issues needing user attention** (anything unresolved, partially wired, or flagged)
-   - **Commit hashes** (from tracker)
-   - **Merge commit hash** (from Phase 6)
-   - **Card status reconciliation** (Phase 6b: N cards verified DONE, K force-updated)
-   - **Worktree cleanup status** (success/failed)
-   - **Knowledge-corpus sync status** (from Step F.6 item 2 "Knowledge Base Sync" — COMPLETED/FAILED/PARTIAL, or SKIPPED if no corpus-sync agent is configured)
-   - Overall implementation status
+4. Emit the **minimal end-of-batch result block** (per § "orchestrator is near-silent" in the core SKILL.md — this is NOT a per-phase report). ONE line per card and nothing more of the process telemetry:
+   - `<CARD-ID>: merged @<sha> | failed: <one clause> | deferred → <followup-id>`
+   Then, **only if non-empty**, the **actionable residue**: issues needing user attention (anything unresolved, partially wired, or flagged — terse, with `path:line` / `CARD-ID` exact). All process telemetry — files changed, test/build/lint status, fix cycles, review-finding counts, UX/QA verdicts, commit + merge hashes, card-status reconciliation, worktree cleanup, knowledge-corpus sync — is **NOT rendered**: it lives in the tracker and `$METRICS/skill-runs.jsonl`. (The Next Steps & Launch Command in 4b below and any Production Readiness manual steps ARE part of the actionable residue — keep them.)
 4b. **Next Steps & Launch Command** (MANDATORY section — always present):

package/framework/.claude/skills/new/references/implement.md CHANGED Viewed

@@ -10,7 +10,7 @@
    - If `DONE` (or the user chose "proceed anyway") → continue.
 2b. **Claim** — only now set the card status to `IN_PROGRESS` (in the backlog YAML / tracker) and assign yourself. This is a state write, not a visibility emission — do not also spend a TaskUpdate / Progress-Bar turn (§ "State surface — the tracker only").
 2c. **Trivial-card classification (BLOCKING gate for steps 3–4)** — evaluate `IS_TRIVIAL(card)` per § "Trivial-card fast-lane". Note: condition 3 (non-source diff) cannot be fully evaluated until the coder has produced the diff, so at this point compute the **provisional** trivial flag from conditions 1+2 only (`review_profile == skip` AND no Step-A trigger sourced from the card YAML text + `files_likely_touched` extensions — if EVERY path in `files_likely_touched` is non-source, condition 3 is provisionally satisfied). If provisionally trivial → **SKIP steps 3, 3a, and 4** (architecture grounding); log `trivial: architecture grounding skipped (review_profile=skip + non-source files_likely_touched + 0 triggers)` and jump to Phase 2. Re-confirm `IS_TRIVIAL` on the ACTUAL committed diff at the review gates (Phase 2.55/3.5/3.7); if the coder unexpectedly touched a source file, the guard flips the card back onto the normal review path there. If NOT provisionally trivial → run steps 3, 3a, 4 as normal.
-3. **(skip when provisionally trivial — see 2c)** Invoke the **codebase-architect** agent (MUST per AGENTS.md) to understand the relevant codebase area, existing patterns, and architecture before any implementation. When `features.has_lsp_layer: true`, the architect uses LSP find-references for identifier-shaped lookups — this needs NO handoff from the orchestrator: the architect reads `features.has_lsp_layer` from `baldart.config.yml` directly (the flag is ambient) per `agents/code-search-protocol.md`. Likewise, when `features.has_code_graph: true`, the architect uses the Graphify code graph for structural/relational lookups (ambient flag) per `agents/code-graph-protocol.md`. The orchestrator does NOT propagate either flag. (Earlier doc versions numbered this step 4; the step that read project-status BEFORE the architect was removed because it persisted pre-analysis context — see step 3a.)
+3. **(skip when provisionally trivial — see 2c)** Invoke the **codebase-architect** agent (MUST per AGENTS.md) to understand the relevant codebase area, existing patterns, and architecture before any implementation. When `features.has_lsp_layer: true`, the architect uses LSP find-references for identifier-shaped lookups — this needs NO handoff from the orchestrator: the architect reads `features.has_lsp_layer` from `baldart.config.yml` directly (the flag is ambient) per `agents/code-search-protocol.md`. Likewise, when `features.has_code_graph: true`, the architect uses the Graphify code graph for structural/relational lookups (ambient flag) per `agents/code-graph-protocol.md`. The orchestrator does NOT propagate either flag. (Earlier doc versions numbered this step 4; the step that read project-status BEFORE the architect was removed because it persisted pre-analysis context — see step 3a.) **Pass `OUTPUT=terse` in the architect prompt (since v4.49.0)** — this is grounding for the coder/reviewer, not an explanation, so the architect returns its structured contract (reuse table + canonical evidence + `path:line — symbol` rows + `totals:`) without narrative prose. The verbatim baseline persisted at step 5b keeps all the substance a lean `/codexreview` needs (paths, signatures, patterns, high-risk paths) while staying small in the orchestrator's context and on disk.
 3a. Update `${paths.references_dir}/project-status.md` Active Code Context (skip when the file does not exist in the project) — do this AFTER the codebase-architect run (step 3) so the "Active Code Context" reflects the architect's findings (which files are actually in scope), not just the card YAML's `files_likely_touched`. Writing it before the architect run would persist pre-analysis claims that downstream agents (e.g. a parallel card) would then read as truth.
 4. **Plan-auditor grounding check** — but first, a **provenance skip (since v4.7.0 — mirror of Step 3d)**: the `/prd` Step 6.6 holistic audit already ran a per-card grounding pass when the card was authored. Re-grounding here is only worth a plan-auditor spawn if something changed since. Skip the plan-auditor when **BOTH** hold:
    - **P1 — provenance present**: the card has `metadata.holistic_audit` with non-empty `audited_at`, `audited_commit`, `audited_set`.

package/framework/.claude/skills/new2/SKILL.md CHANGED Viewed

@@ -5,9 +5,11 @@ description: >
   EXPERIMENTAL workflow-hosted variant of /new (A/B testing). Implements one or
   more backlog cards end-to-end by delegating the WHOLE batch to a background
   dynamic workflow — so subagent output never enters the main orchestrator
-  context. Fully autonomous (zero AskUserQuestion): every /new gate is replaced by
-  a deterministic policy + a self-healing resolution pass. Claude-only (needs the
-  Workflow tool). Usage: /new2 CARD-IDS (same arg grammar as /new). Triggers on:
+  context. The batch runs autonomously (zero AskUserQuestion during the run): every
+  /new gate is replaced by a deterministic policy + a self-healing resolution pass;
+  in interactive mode an optional post-batch escape hatch can hand the hard-case
+  follow-ups to /new for the real human gate. Claude-only (needs the Workflow tool).
+  Usage: /new2 CARD-IDS (same arg grammar as /new). Triggers on:
   /new2, "implementa le card con workflow", "new2".
 ---
@@ -17,12 +19,18 @@ description: >
 > default, and the recovery-safe path. Do NOT route to `new2` unless the user
 > explicitly asks for it.
-> **ZERO-ASK CONTRACT.** A dynamic workflow cannot prompt the user mid-run. `new2`
-> therefore runs the entire batch autonomously: every `/new` `AskUserQuestion`
-> gate is replaced by a deterministic policy (auto-resolve seamless defaults, or
-> fail → self-healing `new2-resolve`, or — last resort — auto-materialise a
-> tracked follow-up card). Destructive/outward ops (`reset --hard`, force-push,
-> stash drop) are NEVER auto-run; they degrade to "leave intact + report".
+> **ZERO-ASK CONTRACT — scoped to the *batch*, not the skill.** A dynamic workflow
+> cannot prompt the user mid-run, so the **workflow runs the entire batch autonomously**:
+> every `/new` `AskUserQuestion` gate *inside the batch* is replaced by a deterministic
+> policy (auto-resolve seamless defaults, or fail → self-healing `new2-resolve`, or — last
+> resort — auto-materialise a tracked follow-up card). The **skill main loop** (which CAN
+> prompt) may interact at exactly two boundaries that are NOT mid-batch: **pre-launch**
+> (Step 2 card-ID question, Step 3.5 migration gate) and **post-batch** (Step 3b
+> escape-hatch escalation + Step 5 reconciliation) — both are interactive-only and skipped
+> in autonomous mode (`BALDART_AUTONOMOUS`/`CI`/`GITHUB_ACTIONS`). The zero-ask invariant is
+> about the **workflow during the batch**, which stays untouched. Destructive/outward ops
+> (`reset --hard`, force-push, stash drop) are NEVER auto-run; they degrade to "leave intact
+> + report".
 ## Project Context
@@ -226,10 +234,49 @@ returns when the batch is done. It returns:
    per-card **skip-completed** guard makes the resume idempotent — already-committed
    cards are skipped, only the incomplete/blocked ones run. Repeat until `degraded`
    is false (or the same cards stall twice → surface to the user).
+3b. **Escape-hatch escalation for the hard cases (INTERACTIVE mode only — the `new2`
+   "relaxation").** `new2` is autonomous *during the batch* — but a genuinely-blocked
+   card (the workflow rolled it back / left it `IN_PROGRESS`, DoD not met) is exactly the
+   "edge case that wants human intelligence" the deterministic policy cannot supply. The
+   sound way to give that intelligence is NOT to re-implement a gate here (that would twin
+   `/new`'s Phase 2.5b and bypass review/F-029) — it is to hand the card's **already-tracked
+   follow-up** to `/new`, which owns the real per-card pipeline (worktree + review + the
+   interactive AC-Closure gate + F-029 + gated merge). Ordering is load-bearing: this runs
+   **after** step 1 materialised every follow-up on disk and step 3's resume converged, so
+   the offer is purely additive over an already-safe ledger — declining (or a closed
+   terminal) never drops a residual.
+   - **Skip this step entirely in AUTONOMOUS mode** (env `BALDART_AUTONOMOUS` / `CI` /
+     `GITHUB_ACTIONS` set, or no TTY) — leave the cards `IN_PROGRESS` + their follow-ups,
+     exactly as before. The escape hatch is interactive-only.
+   - **Eligible set** = the follow-ups whose residual `deferralClass` is **code-actionable**:
+     `unresolved`, `out-of-ownership`, `scope-expansion`. EXCLUDE `owner-gated` /
+     `not-a-code-defect` / `policy-deferred-ac` (external infra steps — `/new` cannot perform
+     a DB deploy / secret / DNS action, so escalating them is noise; they stay tracked
+     follow-ups). If the eligible set is empty → skip silently.
+   - In interactive mode, present **ONE batched `AskUserQuestion`** (never one-per-residual —
+     that would re-introduce the ~25-question profile `new2` exists to remove): *"N card sono
+     rimaste IN_PROGRESS / con residui code-actionable (DoD non soddisfatta) — i follow-up
+     sono già tracciati su disco. Vuoi che lanci `/new` su quei follow-up adesso, per chiuderli
+     col gate umano completo?"* Options: **[Sì — lancia `/new` sui follow-up]** / **[No —
+     lasciali tracciati]**.
+     - **Sì** → invoke `/new <followup-id …>` via the **Skill tool**, passing the materialised
+       follow-up card IDs. `/new` runs its full pipeline on the current trunk; do NOT
+       re-implement any of it here and do NOT mark anything DONE yourself — `/new` closes each
+       follow-up through its own gates. (This is post-batch follow-up work at the skill layer —
+       the same class as the Step 3.5 / Step 5 skill interactions; the autonomous workflow has
+       already returned, so the zero-ask-**during-batch** invariant is untouched.)
+     - **No** → leave as-is (prior behaviour).
+   - **Honest limitation (do not over-sell):** this is post-batch — it gives the human the real
+     gate on the *follow-up*, but it does NOT salvage a card *before* its merge (the workflow
+     already merged the committed cards). Pre-merge salvage would require a mid-batch checkpoint
+     (out of scope by design — the workflow is autonomous).
+   - Record `escape_hatch: { eligible: N, offered: <bool>, ran_new: <bool>, followups: [...] }`
+     in telemetry (step 5 below) so the A/B stays honest about when the hatch was used.
 4. **Present.** Print `report` verbatim. Surface `residuals` prominently
    ("questi residui sono tracciati come follow-up: …") — the post-run review that
    replaced the ~25 mid-run questions. If `degraded`, say so plainly (the run was
-   incomplete and resumed).
+   incomplete and resumed). If the escape hatch ran `/new` (step 3b), fold its outcome
+   into the presentation (which follow-ups were closed by `/new`).
 5. **Record truthful telemetry — reconciled against disk (F-040).** Before appending `telemetry`
    to `${metricsDir}/skill-runs.jsonl`, fill the fields the workflow could not compute and
    **reconcile the report against the real disk state** (agent `reason` strings can over-claim — a
@@ -264,6 +311,10 @@ returns when the batch is done. It returns:
    already satisfied (work the skill used to suppress by hand; a persistently high value signals
    deferrals resolving too late — order the dependent card earlier), and `owner_gated_deduped` > 0
    means N defers were collapsed to one external action.
+   Also record `escape_hatch: { eligible, offered, ran_new, followups }` (Step 3b) — it keeps the
+   A/B honest about when the post-batch human escalation was used and whether the user chose to run
+   `/new` on the hard-case follow-ups (vs leaving them tracked). In autonomous mode it is
+   `{ eligible:N, offered:false, ran_new:false }`.
    Do NOT re-summarise the cards — the workflow already did.
 6. **Process hygiene — reap orphaned Codex MCP servers (NON-BLOCKING).** The batch's per-card Codex
    finder calls drive `codex app-server`, whose broker spawns the `~/.codex/config.toml` MCP servers

package/framework/.claude/skills/prd/references/discovery-phase.md CHANGED Viewed

@@ -663,9 +663,11 @@ section of the state file. Mark dimension as covered.
    If more specific detail is needed beyond what context-primer provided,
    invoke `codebase-architect` agent (foreground) with a targeted question:
    ```
+   OUTPUT=terse — machine consumer (discovery dimension resolution), not human Q&A.
    For feature `<slug>`: regarding the dimension `<dimension>`, does the codebase
    already answer this? Specifically check: existing collections, routes, components,
-   permissions, and patterns. Report what you find with file references.
+   permissions, and patterns. Report what you find with file references — structured
+   rows (`path:line — symbol — note`) + `totals:`, no narrative prose.
    ```
    Avoid launching codebase-architect when the context-primer output already
    provides sufficient information for the dimension.
@@ -743,6 +745,8 @@ Instead, run the ISA scan and present findings for validation.
 1. Invoke `codebase-architect` agent with this specific prompt:
    ```
+   OUTPUT=terse — machine consumer (ISA table builder), not human Q&A: per-touchpoint
+   rows only, no narrative overview.
    For feature `<slug>`: perform an Integration Surface Analysis (ISA).
    Find ALL existing code that should reference, link to, or interact with this
    new feature. Scan these categories systematically:

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "baldart",
-  "version": "4.47.0",
+  "version": "4.49.0",
   "description": "Claude Agent Framework - Reusable framework for coordinating AI agents and humans in software projects",
   "bin": {
     "baldart": "./bin/baldart.js"