baldart 4.46.0 → 4.48.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -5,6 +5,23 @@ All notable changes to BALDART will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [4.48.0] - 2026-06-16
9
+
10
+ **`new2` is relaxed: a post-batch, interactive-only escape hatch hands genuinely-blocked hard cases to `/new`'s real human gate — without re-introducing rigidity, a twin, or breaking the A/B.** `new2` runs the batch autonomously, so a card the deterministic policy can't salvage becomes a tracked follow-up + is left `IN_PROGRESS` — the "edge case that wants human intelligence" the user flagged. A 3-lens **adversarial review before implementation** killed the obvious fix (a Step-5 `AskUserQuestion` that lets the skill implement/resolve the residual): (1) **correctness** — the workflow auto-merges and removes the worktree *before* Step 5, so a skill-side fix has no worktree, lands unreviewed code on trunk, and bypasses the F-029 DONE gate; (2) **value** — on the only run with a full `deferral_breakdown` (14 residuals) a post-batch fix changes the outcome of *zero* of them (the batch already merged); only a mid-batch checkpoint could salvage-before-merge, and the data (n=1) doesn't justify breaking `new2`'s background/no-poll contract; (3) **prior-art** — it would twin `/new`'s Phase 2.5b AC-Closure gate and muddy the autonomous-vs-`/new` A/B. What survived all three: the sound escalation is **not to re-implement a gate but to invoke `/new` on the already-materialised follow-up** — `/new` owns the real worktree+review+AC-Closure+F-029+merge pipeline. **MINOR** (additive skill behaviour, interactive-only, autonomous-mode-safe; **no new `baldart.config.yml` key** ⇒ schema-change propagation rule N/A; no removed surface).
11
+
12
+ ### Changed
13
+
14
+ - **`framework/.claude/skills/new2/SKILL.md`** — new **Step 3b "Escape-hatch escalation"** in the skill's post-batch reconciliation: in INTERACTIVE mode only (skipped when `BALDART_AUTONOMOUS`/`CI`/`GITHUB_ACTIONS` is set), after follow-ups are materialised on disk (offline-safe ordering preserved — the offer is additive over an already-safe ledger) and any `degraded` resume has converged, it presents **one batched `AskUserQuestion`** offering to run `/new` on the **code-actionable** hard-case follow-ups (`deferralClass ∈ {unresolved, out-of-ownership, scope-expansion}`; `owner-gated`/`not-a-code-defect`/`policy-deferred-ac` excluded — `/new` can't perform infra steps). "Sì" invokes `/new <followup-id …>` via the Skill tool (which closes them through its own gates — the skill never marks DONE itself, never re-implements the gate). The **ZERO-ASK CONTRACT** banner is rewritten to scope it precisely to *the workflow during the batch* (the skill may interact pre-launch AND post-batch, interactive-only). New `escape_hatch` telemetry field. Honest limitation documented: post-batch, so it gives the human gate on the follow-up but does NOT salvage a card before its merge (that would need a mid-batch checkpoint — out of scope by design).
15
+
16
+ ## [4.47.0] - 2026-06-16
17
+
18
+ **`/new`'s orchestrator context economy is re-aimed at its real driver — turn count — and the user-visible Progress Bar + native Task spine are removed.** Telemetry of two real 8-card batches (FEAT-0028/0029 on a consumer) showed the orchestrator paying ~285M `cache_read`: 613 turns each replaying a ~490k-token accumulated context (growing toward ~800k), so total cost ≈ **turn count × accumulated context**. A 3-lens **adversarial review before implementation** refuted the obvious diagnoses: the static prefix is only ~77k (not the ~225k first assumed — context is ~86% *accumulated*, not static); narration prose is only ~7% of the fuel; and the existing § "Context economy" (bulk-content-inline) rule targets a channel that totals only ~119k cumulatively. The measurement reviewer surfaced the actual missed lever — **0 of 274 tool turns batched any calls, and ~55% of turns carried no tool call at all** — and the correctness + prior-art reviewers established that delegating bookkeeping out of the orchestrator is a previously-trodden trap (v4.15.0 reverted a Write-from-memory tracker flush; the tracker is the recovery SSOT; `card_status: DONE` needs orchestrator-side disk re-read; the weak-subagent fabrication precedent applies). What survived: (1) a new turn-economy HARD RULE (batch independent tool calls; no narration-only turns; never poll/wait), and (2) since the Progress Bar + Task spine are pure *mirrors* of the internal tracker (recovery reads the tracker, never the spine), removing them is correctness-safe and eliminates ~45 dedicated visibility turns (~8% of a batch's `cache_read`, guaranteed, not batching-dependent). **MINOR** (skill behaviour change; **no new `baldart.config.yml` key** ⇒ schema-change propagation rule N/A; no removed agent/command/skill/routine).
19
+
20
+ ### Changed
21
+
22
+ - **`framework/.claude/skills/new/SKILL.md`** — replaced the `## Progress Visibility (MANDATORY)` section (native Task spine + transition Progress Bar + Phase→ledger mapping) with `## State surface — the tracker only`: the internal `/tmp/batch-tracker-*.md` is now the SINGLE state surface (it was already the recovery SSOT; the spine/bar were never read by recovery). Added a second HARD RULE to § "Context economy" — **"turn count is the multiplier"** — instructing the orchestrator to (1) batch independent tool calls into one message, (2) never emit a narration-only turn, (3) never poll/wait on background work. The § "Context Tracking" mirror bullet and the § "Routing" core-invariant list are reconciled.
23
+ - **`framework/.claude/skills/new/references/{setup,implement,commit,team-mode,final-review,merge-cleanup}.md`** — every `**→ Visibility**: TaskUpdate … / emit a Progress Bar` step is removed; the underlying *state* writes (card `IN_PROGRESS` claim at implement.md 2b, `card_status: DONE (verified)` at commit.md 29 / team D.6) are **preserved** (they are correctness, not visibility) and now explicitly note the tracker is the only surface. Pre-flight no longer creates a Task spine (setup.md step 2b).
24
+
8
25
  ## [4.46.0] - 2026-06-15
9
26
 
10
27
  **Worktree env-file copy is unified onto one stack-agnostic config key, `stack.env_files` — closing the v4.42.0-deferred divergence (`/nw` copied `.env.local`+`.env`; new2's pre-flight copied `env/.env.local/.env.example/supabase/.temp`) WITHOUT the superset the adversarial pass had refuted.** A worktree is a fresh checkout, so the gitignored env artifacts a build needs must be copied from main — but the SET was hard-coded and divergent across paths. This release makes the set a single SSOT list (`stack.env_files`, default `['.env.local', '.env']`) read identically by `worktree-manager` (`/nw`), `/new`, and `new2`. A 3-skeptic **adversarial review before implementation** shaped the design: it (1) confirmed `framework/agents/runbook.md:36` (`cp .env.example .env`) is a documentation-template onboarding idiom — a DIFFERENT context — and **excluded** it; (2) refuted folding `supabase/.temp` into the copy set (it carries the remote project-ref → every copying worktree auto-links to the shared remote, the exact footgun `stack.schema_deploy_from_trunk_only` exists to prevent, and worse unattended in `/new` than in manual `/nw`; it is not even a build input); and (3) found the bash bug class fixed below. Crucially, the layer is **stack-agnostic**: the generic skill/installer/template never name Supabase or any stack — copying a tool's local-state directory is a per-project opt-in the user adds to their own `stack.env_files` / overlay, never a framework default. **MINOR** (additive: a new `baldart.config.yml` key, propagated end-to-end per the schema-change rule; no removed surface).
package/VERSION CHANGED
@@ -1 +1 @@
1
- 4.46.0
1
+ 4.48.0
@@ -129,80 +129,25 @@ Trunk branch: [resolved git.trunk_branch — Phase 0 step 0 populates]
129
129
  - **card_status: DONE (verified)** — confirms the backlog YAML was updated and re-read to verify
130
130
  - **When blocked**: log the blocker in `## Issues & Flags`.
131
131
  - **On context recovery**: if you ever feel lost or after context compaction, IMMEDIATELY read your tracker file (`/tmp/batch-tracker-<FIRST-CARD-ID>.md`) to restore your state.
132
- - **Mirror every transition to the user (MANDATORY)**: this tracker file is internal — the user does NOT see it. Every transition you log here MUST also be reflected in the **two user-visible surfaces** defined in § "Progress Visibility": the native **Task spine** (TaskUpdate) and, at transition boundaries, the **Progress Bar**. The tracker is the recovery SSOT; the Task spine + Progress Bar are its live mirror.
132
+ - **The tracker is internal and is the ONLY state surface** (since v4.47.0 see § "State surface — the tracker only"): the user does NOT see it, and there is no separate user-visible mirror to keep in sync. Do NOT spend a dedicated turn restating a transition to the user (no TaskUpdate, no Progress-Bar block) surface progress only as a short prose line folded into a turn you are already taking for real work "Context economy" turn-count rule).
133
133
 
134
134
  ---
135
135
 
136
- ## Progress Visibility (MANDATORY)
136
+ ## State surface — the tracker only
137
137
 
138
- The `/tmp/batch-tracker-*.md` file is your **internal** recovery SSOT the user never sees it. Because `/new` runs largely autonomously (YOLO mode, long stretches of agent spawns), the user otherwise has no idea which phase is running, which wave of the team mode is in flight, or which gate was just resolved vs skipped. This section defines the **two user-visible surfaces** that fix that. They are the live mirror of the tracker; keeping them in sync is non-negotiable.
138
+ `/new` keeps a **single** state surface: the internal `/tmp/batch-tracker-<FIRST-CARD-ID>.md`
139
+ recovery SSOT (§ "Context Tracking"). There is **no separate user-visible mirror**. The Progress
140
+ Bar markdown block and the per-transition native Task spine (TaskCreate/TaskUpdate) were **removed
141
+ for context economy (v4.47.0)**: every standalone TaskUpdate / Progress-Bar emission was its own
142
+ orchestrator turn paying a full accumulated-context replay (~490k tokens median, measured ~8% of a
143
+ batch's cache_read) for zero correctness value — the tracker already holds the authoritative state,
144
+ and recovery (§ "Context recovery protocol") reads the tracker, never the Task spine.
139
145
 
140
- > **HARD RULE — visibility.** (1) Maintain the native **Task spine** via TaskCreate/TaskUpdate. (2) Emit the **Progress Bar** at every transition boundary (defined below). Both mirror the tracker. Failing to update either after a transition is a protocol violation.
141
-
142
- ### A. Task spine native TaskCreate/TaskUpdate (the always-visible panel)
143
-
144
- The native task list is what Claude Code renders as a **persistent, always-visible todo panel** (this is exactly how `/prd` gets its beloved todo list — see `framework/.claude/skills/prd/SKILL.md` HARD RULE 4). Keep it **coarse**: one task per card plus a few batch-level framing tasks. Do NOT create a task per phase (that buries the overview).
145
-
146
- **When to create (once):** right after Pre-flight has resolved the execution mode (sequential vs team) and, in team mode, the wave layout (`execution_strategy.groups[].level`) — because only then do you know the wave label for each card. Create, in this order:
147
-
148
- 1. `Pre-flight` — framing task (already in progress by the time you create the spine; mark it `in_progress` immediately and `completed` as soon as the worktree is ready).
149
- 2. One task **per card**, in queue order. Subject:
150
- - **team mode** (wave-labelled): `Wave <level> · <CARD-ID> — <title>` (e.g. `Wave 1 · FEAT-0502 — Add merchant filter`).
151
- - **sequential mode**: `<CARD-ID> — <title>` (no wave prefix).
152
- 3. `Final review` — framing task.
153
- 4. `Merge & cleanup` — framing task.
154
-
155
- **State transitions (TaskUpdate):**
156
-
157
- | Task | → `in_progress` | → `completed` |
158
- |------|-----------------|---------------|
159
- | `Pre-flight` | at spine creation | worktree ready (end of Pre-flight) |
160
- | card task | **Phase 1 step 2b** (claim — when you set the card `status: IN_PROGRESS`) | **Phase 4 step 28** (after commit + `card_status: DONE (verified)`) |
161
- | `Final review` | start of Final Review (F.1) | Final Review complete (fixes applied, build green) |
162
- | `Merge & cleanup` | start of Phase 6 | end of Phase 6c |
163
-
164
- **Live sub-state in the subject (optional but encouraged):** while a card task is `in_progress`, you MAY append the current phase to its subject so the panel shows live movement between Progress Bar emissions — e.g. `Wave 1 · FEAT-0502 — Add merchant filter · Phase 3.7 Codex 🔄`. Strip the suffix when the card completes.
165
-
166
- In team mode the cards of a wave run in parallel: set ALL of a wave's card tasks `in_progress` when that wave's coders are spawned (Step B), and complete each as its per-card pipeline reaches Phase 4 (team Step D.5/D.6).
167
-
168
- ### B. Progress Bar — markdown, at transition boundaries
169
-
170
- Append this block to your message **at the heavy transition boundaries only** (NOT on every message, and NOT on every intra-card phase change — that accumulates markdown across long autonomous runs; per § "Context economy"). A **heavy transition boundary** is any of: a **card change**, a **wave change**, **any gate decision** (resolved or skipped), and every `AskUserQuestion` / STOP. **Intra-card phase movement** (e.g. Phase 2 → 2.5 → 2.55 within the same card, with no gate decision) is shown via the **Task spine live sub-state** (TaskUpdate, § A above) — NOT by re-emitting this full markdown block each phase.
171
-
172
- ```
173
- ---
174
- 📋 **Progresso /new: <batch-id>** — modalità <sequential|team> · Wave <X>/<Y>
175
-
176
- | Card | Wave | Fase corrente | Stato |
177
- |------|------|---------------|-------|
178
- | FEAT-0501 | 0 | Phase 4 Commit | ✅ |
179
- | FEAT-0502 | 1 | Phase 3.7 Codex | 🔄 |
180
- | FEAT-0503 | 1 | — | ⬜ |
181
-
182
- Gate ledger (card corrente): 2.5b AC-Closure ✅ · 2.55 Simplify ✅ · 2.6 E2E ⏭️ (backend-only) · 3 Doc ⏭️ (light, no-doc-diff→Final) · 3.5 QA ⏭️ (balanced→Final) · 3.7 Codex 🔄
183
- Prossimo passo: <cosa succede dopo>
184
- ```
185
-
186
- - Legend: `⬜ da fare · 🔄 in corso · ✅ risolto · ⏭️ skippato`.
187
- - The `Wave` column and the `Wave <X>/<Y>` header are present in **team mode** only; in sequential mode drop the column and write `modalità sequential` with no wave counter.
188
- - The **Gate ledger** line shows the current card's per-card gates (see § C for which). Each entry is `<gate> <state>`; a skipped entry MUST carry its reason **verbatim from the enumerated Gate-table / fast-lane skip reasons** — e.g. `IS_TRIVIAL`, `review_profile=light`, `backend-only diff`, `features.has_e2e_review:false`, `balanced→Final`, `holistic_audit provenance`. **Never invent a reason.** A ledger entry reading `⏭️ (time budget)` / `(to save tokens)` / any model-invented constraint is itself a protocol violation (see the top-of-file "NO PHASE SKIP FOR PERCEIVED TIME" clause) — the ledger exists to make skips auditable, not to license them.
189
-
190
- ### C. Phase → ledger mapping
191
-
192
- These are the per-card gates tracked in the Gate ledger line (the rest of the pipeline — Phase 1 context, Phase 2 implement, Phase 4 commit — is shown in the table's "Fase corrente" column, not the ledger):
193
-
194
- | Ledger entry | Phase | Skipped when (enumerated reason) |
195
- |--------------|-------|----------------------------------|
196
- | `2.5b AC-Closure` | Phase 2.5b | never (BLOCKING, unconditional) |
197
- | `2.55 Simplify` | Phase 2.55 | `IS_TRIVIAL` |
198
- | `2.6 E2E` | Phase 2.6 | `features.has_e2e_review:false`, backend-only diff, or card type ∈ {backend/api/db/infra/docs/chore/config} |
199
- | `3 Doc` | Phase 3 | `review_profile=light` AND no doc files in diff → deferred to Final F.3 |
200
- | `3.5 QA` | Phase 3.5 | `skip`/`light` (tests already ran Phase 2), or `balanced`→Final (unless Step-A escalation) |
201
- | `3.7 Codex` | Phase 3.7 | `IS_TRIVIAL` only (otherwise unconditional; `light`/`full` is depth, not skip) |
202
-
203
- In **team mode** the same gates map to the per-card team sub-steps (D.3a AC-Closure, D.3b Simplify, D.3c E2E, D.4 QA, D.4a/D.2 doc, D.4b Codex) — track them identically. The pre-flight cross-card Codex check (Step 3d) and plan-auditor grounding (Phase 1 step 4) are batch/card-setup gates: surface their resolve/skip in the `Prossimo passo` line or a one-off ledger note, not as recurring per-card rows.
204
-
205
- ---
146
+ > **HARD RULE — no dedicated visibility turns.** Do NOT spend a turn (a TaskUpdate, a Progress-Bar
147
+ > markdown block, or a narration-only message) whose sole purpose is to restate progress. Surface
148
+ > progress to the user ONLY as a short natural-prose line **folded into a turn you are already
149
+ > taking** for real work — per § "Context economy" → turn-count rule. `STOP` / `AskUserQuestion`
150
+ > moments are surfaced as before (they carry a real decision, not a status restatement).
206
151
 
207
152
  ## Context economy (MANDATORY)
208
153
 
@@ -241,6 +186,30 @@ baselines. Keep that bulk on disk and pass **paths**, not bodies.
241
186
  > nor what the orchestrator is allowed to read at a decision point — only that bulk arrives via a
242
187
  > path the consumer opens itself, not via the orchestrator's own context.
243
188
 
189
+ > **HARD RULE — turn count is the multiplier (this is the dominant cost).** Measurement of real
190
+ > batches shows the bulk-content rule above is necessary but NOT where most tokens go: the
191
+ > orchestrator's accumulated context (~490k tokens median, growing toward ~800k on long batches) is
192
+ > **replayed in full on EVERY turn** via `cache_read`. So total cost ≈ **turn count × accumulated
193
+ > context** — and the cheapest turn (a lone `cd`, a single TaskUpdate, a one-line narration) pays the
194
+ > SAME ~490k replay as the most expensive one. A measured 8-card batch ran **613 orchestrator turns,
195
+ > of which 0 batched any tool calls and ~55% carried no tool call at all.** Cut turns:
196
+ > 1. **Batch independent tool calls into ONE message.** Whenever you need ≥2 tool calls with no data
197
+ > dependency between them — several `Read`s, a `cd` plus the command that follows it, `Edit`s to
198
+ > different files, a status `Edit` to the tracker alongside the next real action — emit them in a
199
+ > **single assistant message**, not one-per-turn. A run that issues 270 tool calls one-per-turn
200
+ > pays ~270 full-context replays; batched ~3-for-1 it pays ~90. Never split independent mechanical
201
+ > calls across turns "for clarity."
202
+ > 2. **No narration-only turns.** Never emit a turn whose only content is a progress recap, a
203
+ > "now I will…" preamble, or a status restatement. If a status note matters, fold it into the same
204
+ > message as the next tool call. (Per § "State surface" the tracker is the only state surface — and
205
+ > you write it via batched `Edit`s, never narrate it.)
206
+ > 3. **Never poll or wait.** Background subagents and background `Bash` re-invoke you automatically on
207
+ > completion — end your turn after spawning; never `sleep N; echo "waiting…"` (already stated for
208
+ > team mode in `references/team-mode.md` — it applies to every barrier in both modes).
209
+ >
210
+ > These rules reduce the *number* of replays; the bulk-content rule above reduces the *size* of each.
211
+ > Both compound — apply them together.
212
+
244
213
  ---
245
214
 
246
215
  ## Toolchain gates
@@ -268,7 +237,7 @@ Reference modules cite this section as `§ "Toolchain gates"`.
268
237
  sistema, ri-letto a OGNI turno. Per non pagare 60k+ token di istruzioni di fase a
269
238
  ogni turno, il dettaglio passo-passo di ogni fase vive in un **modulo `references/<x>.md`**
270
239
  caricato on-demand. Questo file (il core) tiene solo gli invarianti cross-fase
271
- (Context Tracking, Progress Visibility, § "Context economy", § "Toolchain gates",
240
+ (Context Tracking, State surface, § "Context economy", § "Toolchain gates",
272
241
  Agent Routing, QA Profile, Trivial fast-lane, Risk-signal detector, Fix Application
273
242
  Log) + questa mappa di navigazione.
274
243
 
@@ -35,7 +35,7 @@
35
35
  c. **Update `${paths.references_dir}/ssot-registry.md`** — add/update the entry for this card's feature area. The pre-commit doc-freshness hook BLOCKS commits that touch `${paths.backlog_dir}/` without a corresponding ssot-registry update. Always include ssot-registry.md in the same commit as the backlog YAML.
36
36
  d. **Verify the write**: re-read the YAML file and confirm `status: DONE` is present. If not, retry the edit.
37
37
  e. Stage BOTH the updated YAML AND ssot-registry.md, then commit (or as an immediate follow-up commit if the Phase 4 implementation commit already happened). When this produces a SECOND commit for the card, record BOTH hashes in the tracker (`commit: <impl-hash> + <done-hash>`) so traceability/bisect is unambiguous.
38
- 29. **Update tracker**: move card to `## Completed Cards` with commit hash(es), summary, flags, **and `card_status: DONE (verified)`**. **→ Visibility**: TaskUpdate this card's spine task `completed` (strip any live phase suffix) and emit a Progress Bar (card change) per § "Progress Visibility".
38
+ 29. **Update tracker**: move card to `## Completed Cards` with commit hash(es), summary, flags, **and `card_status: DONE (verified)`** the tracker is the only state surface (§ "State surface the tracker only"); do not also emit a TaskUpdate or Progress-Bar turn for the transition. If you surface the card completion to the user at all, do it as a short prose line folded into the commit turn, never a dedicated turn.
39
39
 
40
40
  ### Sub-agent failure protocol (since v3.28.3)
41
41
 
@@ -42,7 +42,7 @@ Once ALL cards are committed in the worktree:
42
42
 
43
43
  ### Step F.1 — Resolve scope
44
44
 
45
- **→ Visibility (batch transition)**: all cards are committed. TaskUpdate `Final review` `in_progress` and emit a Progress Bar per § "Progress Visibility". Mark `Final review` `completed` when F.5 finishes (fixes applied, build green).
45
+ (All cards are committed; the batch now enters Final Review. No visibility emission the internal tracker is the only state surface, see SKILL.md § "State surface the tracker only". Record the Final-Review start/finish in the tracker only.)
46
46
 
47
47
  1. **Read the tracker file** to get the full picture: card IDs, files changed, commit hashes.
48
48
  2. Gather git evidence in the worktree:
@@ -8,7 +8,7 @@
8
8
  - Read that card's backlog YAML and check its `status` field.
9
9
  - If NOT `DONE` → HALT: log in `## Issues & Flags` and ask the user: "Card <CARD-ID> depends on <DEP-ID> which is `<status>`. Proceed anyway, or wait?" Do not start implementation until the user responds explicitly. **The card status is NOT written until this gate passes** — so a HALT leaves the card untouched (no stale `IN_PROGRESS` for the next run to mis-read).
10
10
  - If `DONE` (or the user chose "proceed anyway") → continue.
11
- 2b. **Claim** — only now set the card status to `IN_PROGRESS` and assign yourself. **→ Visibility**: TaskUpdate this card's spine task `in_progress`, and emit a Progress Bar (card change) per § "Progress Visibility".
11
+ 2b. **Claim** — only now set the card status to `IN_PROGRESS` (in the backlog YAML / tracker) and assign yourself. This is a state write, not a visibility emission do not also spend a TaskUpdate / Progress-Bar turn (§ "State surface the tracker only").
12
12
  2c. **Trivial-card classification (BLOCKING gate for steps 3–4)** — evaluate `IS_TRIVIAL(card)` per § "Trivial-card fast-lane". Note: condition 3 (non-source diff) cannot be fully evaluated until the coder has produced the diff, so at this point compute the **provisional** trivial flag from conditions 1+2 only (`review_profile == skip` AND no Step-A trigger sourced from the card YAML text + `files_likely_touched` extensions — if EVERY path in `files_likely_touched` is non-source, condition 3 is provisionally satisfied). If provisionally trivial → **SKIP steps 3, 3a, and 4** (architecture grounding); log `trivial: architecture grounding skipped (review_profile=skip + non-source files_likely_touched + 0 triggers)` and jump to Phase 2. Re-confirm `IS_TRIVIAL` on the ACTUAL committed diff at the review gates (Phase 2.55/3.5/3.7); if the coder unexpectedly touched a source file, the guard flips the card back onto the normal review path there. If NOT provisionally trivial → run steps 3, 3a, 4 as normal.
13
13
  3. **(skip when provisionally trivial — see 2c)** Invoke the **codebase-architect** agent (MUST per AGENTS.md) to understand the relevant codebase area, existing patterns, and architecture before any implementation. When `features.has_lsp_layer: true`, the architect uses LSP find-references for identifier-shaped lookups — this needs NO handoff from the orchestrator: the architect reads `features.has_lsp_layer` from `baldart.config.yml` directly (the flag is ambient) per `agents/code-search-protocol.md`. Likewise, when `features.has_code_graph: true`, the architect uses the Graphify code graph for structural/relational lookups (ambient flag) per `agents/code-graph-protocol.md`. The orchestrator does NOT propagate either flag. (Earlier doc versions numbered this step 4; the step that read project-status BEFORE the architect was removed because it persisted pre-analysis context — see step 3a.)
14
14
  3a. Update `${paths.references_dir}/project-status.md` Active Code Context (skip when the file does not exist in the project) — do this AFTER the codebase-architect run (step 3) so the "Active Code Context" reflects the architect's findings (which files are actually in scope), not just the card YAML's `files_likely_touched`. Writing it before the architect run would persist pre-analysis claims that downstream agents (e.g. a parallel card) would then read as truth.
@@ -4,7 +4,7 @@
4
4
 
5
5
  ## Phase 6 — Post-batch merge & cleanup (delegated to worktree-manager skill)
6
6
 
7
- **→ Visibility (batch transition)**: TaskUpdate `Merge & cleanup` `in_progress` and emit a Progress Bar per § "Progress Visibility". Mark it `completed` at the end of Phase 6c (merge done, worktree removed, workspace reconciled).
7
+ (The batch now enters Merge & cleanup. No visibility emission the internal tracker is the only state surface, see SKILL.md § "State surface the tracker only". Record merge/cleanup completion in the tracker only.)
8
8
 
9
9
  After the final review passes AND all cards are committed in the worktree, delegate the entire merge and cleanup to the **worktree-manager** skill (`/mw` in programmatic mode):
10
10
 
@@ -182,7 +182,7 @@
182
182
 
183
183
  When `mode == sequential`, the per-card pipeline below runs exactly as documented. The `execution_strategy.groups` levels are simply ignored. When `mode == team`, skip the per-card pipeline and follow the **Team Mode** section at the end of this document.
184
184
 
185
- **→ Create the Task spine now.** The execution mode and (in team mode) the wave layout are resolved create the native Task spine per § "Progress Visibility" A: `Pre-flight` (→ `in_progress`) + one task per card (wave-labelled in team mode) + `Final review` + `Merge & cleanup`. Emit the first Progress Bar with this batch's table. Mark `Pre-flight` `completed` at the pre-flight resume (step 6d), once both background ops have returned and the consolidated tracker flush is written.
185
+ (No user-visible Task spine / Progress Bar is created — the internal tracker is the only state surface; see SKILL.md § "State surface the tracker only".)
186
186
 
187
187
  3d. **Codex batch cross-card grounding check** (background — launched together with the worktree-setup subagent in step 4, then a single barrier in step 5)
188
188
 
@@ -324,7 +324,6 @@
324
324
  - `## Worktree` — path / branch / slug / port, plus `Created:` = **the subagent block's `created_at`** (worktree-creation time, NOT resume time, so Phase 8's `cycle_time_mins` still spans the build window). On the **4a2 resume** path, `Created:` = the registry entry's `createdAt`; on the **4d inline fallback**, stamp `date -u +%Y-%m-%dT%H:%M:%SZ` at creation. (Never leave `Created:` empty — `cycle_time_mins` anchors on it.)
325
325
  - `## Cross-Card Conflicts (Codex)` — distilled findings (the 3d skip-decision already wrote the `SKIPPED`/`RUN — reason` line; append the distilled findings on the RUN path, nothing to add on SKIP).
326
326
  - In team mode: `## Team Mode` + `## Parallel Groups` (per team-mode.md).
327
- `## Execution Mode` was already written at step 3c (it must exist before the Task spine) — do NOT rewrite it here. **Rationale**: pre-flight is idempotent and cheap to redo (step 4a2's git pre-check guards worktree re-creation), so the data sections do not need mid-flight persistence; per-phase incremental writes resume for card execution, where mid-phase recovery actually matters. The file already exists (the skeleton was created at batch start per § Context Tracking; Phase 0 wrote `## Phase 0`) — backfill, do NOT re-create.
328
- d. **→ Visibility**: mark the `Pre-flight` Task spine entry → `completed` and emit the first wave's Progress Bar.
327
+ `## Execution Mode` was already written at step 3c — do NOT rewrite it here. **Rationale**: pre-flight is idempotent and cheap to redo (step 4a2's git pre-check guards worktree re-creation), so the data sections do not need mid-flight persistence; per-phase incremental writes resume for card execution, where mid-phase recovery actually matters. The file already exists (the skeleton was created at batch start per § Context Tracking; Phase 0 wrote `## Phase 0`) — backfill, do NOT re-create.
329
328
 
330
329
  ---
@@ -57,7 +57,7 @@ its `arch_baseline_path` at this file, so the review reuses the group baseline i
57
57
 
58
58
  #### Step B: Spawn parallel coder agents
59
59
 
60
- **→ Visibility (wave change)**: this is a new wave starting. TaskUpdate ALL of this group's card spine tasks `in_progress`, and emit a Progress Bar with the new `Wave <X>/<Y>` header per § "Progress Visibility".
60
+ (No visibility emission on wave change the internal tracker is the only state surface; see SKILL.md § "State surface the tracker only". Record the wave start in the tracker only.)
61
61
 
62
62
  For each card in the current group, spawn a coder agent using the Agent tool. ALL agents for the group MUST be spawned in a **SINGLE message** (multiple Agent tool calls) to run truly in parallel.
63
63
 
@@ -278,8 +278,7 @@ After ALL agents in the group complete successfully:
278
278
  a. Edit the backlog YAML (`${paths.backlog_dir}/<CARD-ID>.yml`): set `status: DONE`, add `completed_date: <today>`, add implementation notes (NEVER include `[USER-APPROVED DEFERRAL]` lines that didn't actually pass through D.3a's gate).
279
279
  b. **Verify the write**: re-read the YAML file and confirm `status: DONE` is present. If not, retry.
280
280
  c. Stage the updated YAML and include it in the card's commit (or as an immediate follow-up commit).
281
- d. Log in tracker: `card_status: DONE (verified)` for each card.
282
- e. **→ Visibility**: TaskUpdate each card's spine task → `completed` (strip any live phase suffix) as it reaches DONE here.
281
+ d. Log in tracker: `card_status: DONE (verified)` for each card — the tracker is the only state surface (§ "State surface — the tracker only"); do not emit a TaskUpdate turn per card.
283
282
  Note: Phase 6b (Status Reconciliation) will catch any card missed here, but aim for zero misses.
284
283
 
285
284
  #### Step D coverage assertion (MANDATORY end-of-group check)
@@ -302,8 +301,7 @@ A missing entry means a sub-step was skipped. An entry whose value is a **`revie
302
301
  After committing all cards in the group:
303
302
  1. Update tracker: move group to done, log all results per card.
304
303
  2. **PURGE**: forget all implementation details, review findings, architect context.
305
- 3. **→ Visibility**: emit a Progress Bar reflecting the wave boundary (this wave done; next `Wave <X+1>/<Y>` pending) per § "Progress Visibility".
306
- 4. Move to the next pending group (Step A again).
304
+ 3. Move to the next pending group (Step A again).
307
305
 
308
306
  ### Sequential fallback within team mode
309
307
 
@@ -5,9 +5,11 @@ description: >
5
5
  EXPERIMENTAL workflow-hosted variant of /new (A/B testing). Implements one or
6
6
  more backlog cards end-to-end by delegating the WHOLE batch to a background
7
7
  dynamic workflow — so subagent output never enters the main orchestrator
8
- context. Fully autonomous (zero AskUserQuestion): every /new gate is replaced by
9
- a deterministic policy + a self-healing resolution pass. Claude-only (needs the
10
- Workflow tool). Usage: /new2 CARD-IDS (same arg grammar as /new). Triggers on:
8
+ context. The batch runs autonomously (zero AskUserQuestion during the run): every
9
+ /new gate is replaced by a deterministic policy + a self-healing resolution pass;
10
+ in interactive mode an optional post-batch escape hatch can hand the hard-case
11
+ follow-ups to /new for the real human gate. Claude-only (needs the Workflow tool).
12
+ Usage: /new2 CARD-IDS (same arg grammar as /new). Triggers on:
11
13
  /new2, "implementa le card con workflow", "new2".
12
14
  ---
13
15
 
@@ -17,12 +19,18 @@ description: >
17
19
  > default, and the recovery-safe path. Do NOT route to `new2` unless the user
18
20
  > explicitly asks for it.
19
21
 
20
- > **ZERO-ASK CONTRACT.** A dynamic workflow cannot prompt the user mid-run. `new2`
21
- > therefore runs the entire batch autonomously: every `/new` `AskUserQuestion`
22
- > gate is replaced by a deterministic policy (auto-resolve seamless defaults, or
23
- > fail → self-healing `new2-resolve`, or — last resort — auto-materialise a
24
- > tracked follow-up card). Destructive/outward ops (`reset --hard`, force-push,
25
- > stash drop) are NEVER auto-run; they degrade to "leave intact + report".
22
+ > **ZERO-ASK CONTRACT scoped to the *batch*, not the skill.** A dynamic workflow
23
+ > cannot prompt the user mid-run, so the **workflow runs the entire batch autonomously**:
24
+ > every `/new` `AskUserQuestion` gate *inside the batch* is replaced by a deterministic
25
+ > policy (auto-resolve seamless defaults, or fail → self-healing `new2-resolve`, or — last
26
+ > resort — auto-materialise a tracked follow-up card). The **skill main loop** (which CAN
27
+ > prompt) may interact at exactly two boundaries that are NOT mid-batch: **pre-launch**
28
+ > (Step 2 card-ID question, Step 3.5 migration gate) and **post-batch** (Step 3b
29
+ > escape-hatch escalation + Step 5 reconciliation) — both are interactive-only and skipped
30
+ > in autonomous mode (`BALDART_AUTONOMOUS`/`CI`/`GITHUB_ACTIONS`). The zero-ask invariant is
31
+ > about the **workflow during the batch**, which stays untouched. Destructive/outward ops
32
+ > (`reset --hard`, force-push, stash drop) are NEVER auto-run; they degrade to "leave intact
33
+ > + report".
26
34
 
27
35
  ## Project Context
28
36
 
@@ -226,10 +234,49 @@ returns when the batch is done. It returns:
226
234
  per-card **skip-completed** guard makes the resume idempotent — already-committed
227
235
  cards are skipped, only the incomplete/blocked ones run. Repeat until `degraded`
228
236
  is false (or the same cards stall twice → surface to the user).
237
+ 3b. **Escape-hatch escalation for the hard cases (INTERACTIVE mode only — the `new2`
238
+ "relaxation").** `new2` is autonomous *during the batch* — but a genuinely-blocked
239
+ card (the workflow rolled it back / left it `IN_PROGRESS`, DoD not met) is exactly the
240
+ "edge case that wants human intelligence" the deterministic policy cannot supply. The
241
+ sound way to give that intelligence is NOT to re-implement a gate here (that would twin
242
+ `/new`'s Phase 2.5b and bypass review/F-029) — it is to hand the card's **already-tracked
243
+ follow-up** to `/new`, which owns the real per-card pipeline (worktree + review + the
244
+ interactive AC-Closure gate + F-029 + gated merge). Ordering is load-bearing: this runs
245
+ **after** step 1 materialised every follow-up on disk and step 3's resume converged, so
246
+ the offer is purely additive over an already-safe ledger — declining (or a closed
247
+ terminal) never drops a residual.
248
+ - **Skip this step entirely in AUTONOMOUS mode** (env `BALDART_AUTONOMOUS` / `CI` /
249
+ `GITHUB_ACTIONS` set, or no TTY) — leave the cards `IN_PROGRESS` + their follow-ups,
250
+ exactly as before. The escape hatch is interactive-only.
251
+ - **Eligible set** = the follow-ups whose residual `deferralClass` is **code-actionable**:
252
+ `unresolved`, `out-of-ownership`, `scope-expansion`. EXCLUDE `owner-gated` /
253
+ `not-a-code-defect` / `policy-deferred-ac` (external infra steps — `/new` cannot perform
254
+ a DB deploy / secret / DNS action, so escalating them is noise; they stay tracked
255
+ follow-ups). If the eligible set is empty → skip silently.
256
+ - In interactive mode, present **ONE batched `AskUserQuestion`** (never one-per-residual —
257
+ that would re-introduce the ~25-question profile `new2` exists to remove): *"N card sono
258
+ rimaste IN_PROGRESS / con residui code-actionable (DoD non soddisfatta) — i follow-up
259
+ sono già tracciati su disco. Vuoi che lanci `/new` su quei follow-up adesso, per chiuderli
260
+ col gate umano completo?"* Options: **[Sì — lancia `/new` sui follow-up]** / **[No —
261
+ lasciali tracciati]**.
262
+ - **Sì** → invoke `/new <followup-id …>` via the **Skill tool**, passing the materialised
263
+ follow-up card IDs. `/new` runs its full pipeline on the current trunk; do NOT
264
+ re-implement any of it here and do NOT mark anything DONE yourself — `/new` closes each
265
+ follow-up through its own gates. (This is post-batch follow-up work at the skill layer —
266
+ the same class as the Step 3.5 / Step 5 skill interactions; the autonomous workflow has
267
+ already returned, so the zero-ask-**during-batch** invariant is untouched.)
268
+ - **No** → leave as-is (prior behaviour).
269
+ - **Honest limitation (do not over-sell):** this is post-batch — it gives the human the real
270
+ gate on the *follow-up*, but it does NOT salvage a card *before* its merge (the workflow
271
+ already merged the committed cards). Pre-merge salvage would require a mid-batch checkpoint
272
+ (out of scope by design — the workflow is autonomous).
273
+ - Record `escape_hatch: { eligible: N, offered: <bool>, ran_new: <bool>, followups: [...] }`
274
+ in telemetry (step 5 below) so the A/B stays honest about when the hatch was used.
229
275
  4. **Present.** Print `report` verbatim. Surface `residuals` prominently
230
276
  ("questi residui sono tracciati come follow-up: …") — the post-run review that
231
277
  replaced the ~25 mid-run questions. If `degraded`, say so plainly (the run was
232
- incomplete and resumed).
278
+ incomplete and resumed). If the escape hatch ran `/new` (step 3b), fold its outcome
279
+ into the presentation (which follow-ups were closed by `/new`).
233
280
  5. **Record truthful telemetry — reconciled against disk (F-040).** Before appending `telemetry`
234
281
  to `${metricsDir}/skill-runs.jsonl`, fill the fields the workflow could not compute and
235
282
  **reconcile the report against the real disk state** (agent `reason` strings can over-claim — a
@@ -264,6 +311,10 @@ returns when the batch is done. It returns:
264
311
  already satisfied (work the skill used to suppress by hand; a persistently high value signals
265
312
  deferrals resolving too late — order the dependent card earlier), and `owner_gated_deduped` > 0
266
313
  means N defers were collapsed to one external action.
314
+ Also record `escape_hatch: { eligible, offered, ran_new, followups }` (Step 3b) — it keeps the
315
+ A/B honest about when the post-batch human escalation was used and whether the user chose to run
316
+ `/new` on the hard-case follow-ups (vs leaving them tracked). In autonomous mode it is
317
+ `{ eligible:N, offered:false, ran_new:false }`.
267
318
  Do NOT re-summarise the cards — the workflow already did.
268
319
  6. **Process hygiene — reap orphaned Codex MCP servers (NON-BLOCKING).** The batch's per-card Codex
269
320
  finder calls drive `codex app-server`, whose broker spawns the `~/.codex/config.toml` MCP servers
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "baldart",
3
- "version": "4.46.0",
3
+ "version": "4.48.0",
4
4
  "description": "Claude Agent Framework - Reusable framework for coordinating AI agents and humans in software projects",
5
5
  "bin": {
6
6
  "baldart": "./bin/baldart.js"