baldart 4.47.0 → 4.48.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -5,6 +5,14 @@ All notable changes to BALDART will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [4.48.0] - 2026-06-16
9
+
10
+ **`new2` is relaxed: a post-batch, interactive-only escape hatch hands genuinely-blocked hard cases to `/new`'s real human gate — without re-introducing rigidity, a twin, or breaking the A/B.** `new2` runs the batch autonomously, so a card the deterministic policy can't salvage becomes a tracked follow-up + is left `IN_PROGRESS` — the "edge case that wants human intelligence" the user flagged. A 3-lens **adversarial review before implementation** killed the obvious fix (a Step-5 `AskUserQuestion` that lets the skill implement/resolve the residual): (1) **correctness** — the workflow auto-merges and removes the worktree *before* Step 5, so a skill-side fix has no worktree, lands unreviewed code on trunk, and bypasses the F-029 DONE gate; (2) **value** — on the only run with a full `deferral_breakdown` (14 residuals) a post-batch fix changes the outcome of *zero* of them (the batch already merged); only a mid-batch checkpoint could salvage-before-merge, and the data (n=1) doesn't justify breaking `new2`'s background/no-poll contract; (3) **prior-art** — it would twin `/new`'s Phase 2.5b AC-Closure gate and muddy the autonomous-vs-`/new` A/B. What survived all three: the sound escalation is **not to re-implement a gate but to invoke `/new` on the already-materialised follow-up** — `/new` owns the real worktree+review+AC-Closure+F-029+merge pipeline. **MINOR** (additive skill behaviour, interactive-only, autonomous-mode-safe; **no new `baldart.config.yml` key** ⇒ schema-change propagation rule N/A; no removed surface).
11
+
12
+ ### Changed
13
+
14
+ - **`framework/.claude/skills/new2/SKILL.md`** — new **Step 3b "Escape-hatch escalation"** in the skill's post-batch reconciliation: in INTERACTIVE mode only (skipped when `BALDART_AUTONOMOUS`/`CI`/`GITHUB_ACTIONS` is set), after follow-ups are materialised on disk (offline-safe ordering preserved — the offer is additive over an already-safe ledger) and any `degraded` resume has converged, it presents **one batched `AskUserQuestion`** offering to run `/new` on the **code-actionable** hard-case follow-ups (`deferralClass ∈ {unresolved, out-of-ownership, scope-expansion}`; `owner-gated`/`not-a-code-defect`/`policy-deferred-ac` excluded — `/new` can't perform infra steps). "Sì" invokes `/new <followup-id …>` via the Skill tool (which closes them through its own gates — the skill never marks DONE itself, never re-implements the gate). The **ZERO-ASK CONTRACT** banner is rewritten to scope it precisely to *the workflow during the batch* (the skill may interact pre-launch AND post-batch, interactive-only). New `escape_hatch` telemetry field. Honest limitation documented: post-batch, so it gives the human gate on the follow-up but does NOT salvage a card before its merge (that would need a mid-batch checkpoint — out of scope by design).
15
+
8
16
  ## [4.47.0] - 2026-06-16
9
17
 
10
18
  **`/new`'s orchestrator context economy is re-aimed at its real driver — turn count — and the user-visible Progress Bar + native Task spine are removed.** Telemetry of two real 8-card batches (FEAT-0028/0029 on a consumer) showed the orchestrator paying ~285M `cache_read`: 613 turns each replaying a ~490k-token accumulated context (growing toward ~800k), so total cost ≈ **turn count × accumulated context**. A 3-lens **adversarial review before implementation** refuted the obvious diagnoses: the static prefix is only ~77k (not the ~225k first assumed — context is ~86% *accumulated*, not static); narration prose is only ~7% of the fuel; and the existing § "Context economy" (bulk-content-inline) rule targets a channel that totals only ~119k cumulatively. The measurement reviewer surfaced the actual missed lever — **0 of 274 tool turns batched any calls, and ~55% of turns carried no tool call at all** — and the correctness + prior-art reviewers established that delegating bookkeeping out of the orchestrator is a previously-trodden trap (v4.15.0 reverted a Write-from-memory tracker flush; the tracker is the recovery SSOT; `card_status: DONE` needs orchestrator-side disk re-read; the weak-subagent fabrication precedent applies). What survived: (1) a new turn-economy HARD RULE (batch independent tool calls; no narration-only turns; never poll/wait), and (2) since the Progress Bar + Task spine are pure *mirrors* of the internal tracker (recovery reads the tracker, never the spine), removing them is correctness-safe and eliminates ~45 dedicated visibility turns (~8% of a batch's `cache_read`, guaranteed, not batching-dependent). **MINOR** (skill behaviour change; **no new `baldart.config.yml` key** ⇒ schema-change propagation rule N/A; no removed agent/command/skill/routine).
package/VERSION CHANGED
@@ -1 +1 @@
1
- 4.47.0
1
+ 4.48.0
@@ -5,9 +5,11 @@ description: >
5
5
  EXPERIMENTAL workflow-hosted variant of /new (A/B testing). Implements one or
6
6
  more backlog cards end-to-end by delegating the WHOLE batch to a background
7
7
  dynamic workflow — so subagent output never enters the main orchestrator
8
- context. Fully autonomous (zero AskUserQuestion): every /new gate is replaced by
9
- a deterministic policy + a self-healing resolution pass. Claude-only (needs the
10
- Workflow tool). Usage: /new2 CARD-IDS (same arg grammar as /new). Triggers on:
8
+ context. The batch runs autonomously (zero AskUserQuestion during the run): every
9
+ /new gate is replaced by a deterministic policy + a self-healing resolution pass;
10
+ in interactive mode an optional post-batch escape hatch can hand the hard-case
11
+ follow-ups to /new for the real human gate. Claude-only (needs the Workflow tool).
12
+ Usage: /new2 CARD-IDS (same arg grammar as /new). Triggers on:
11
13
  /new2, "implementa le card con workflow", "new2".
12
14
  ---
13
15
 
@@ -17,12 +19,18 @@ description: >
17
19
  > default, and the recovery-safe path. Do NOT route to `new2` unless the user
18
20
  > explicitly asks for it.
19
21
 
20
- > **ZERO-ASK CONTRACT.** A dynamic workflow cannot prompt the user mid-run. `new2`
21
- > therefore runs the entire batch autonomously: every `/new` `AskUserQuestion`
22
- > gate is replaced by a deterministic policy (auto-resolve seamless defaults, or
23
- > fail → self-healing `new2-resolve`, or — last resort — auto-materialise a
24
- > tracked follow-up card). Destructive/outward ops (`reset --hard`, force-push,
25
- > stash drop) are NEVER auto-run; they degrade to "leave intact + report".
22
+ > **ZERO-ASK CONTRACT scoped to the *batch*, not the skill.** A dynamic workflow
23
+ > cannot prompt the user mid-run, so the **workflow runs the entire batch autonomously**:
24
+ > every `/new` `AskUserQuestion` gate *inside the batch* is replaced by a deterministic
25
+ > policy (auto-resolve seamless defaults, or fail → self-healing `new2-resolve`, or — last
26
+ > resort — auto-materialise a tracked follow-up card). The **skill main loop** (which CAN
27
+ > prompt) may interact at exactly two boundaries that are NOT mid-batch: **pre-launch**
28
+ > (Step 2 card-ID question, Step 3.5 migration gate) and **post-batch** (Step 3b
29
+ > escape-hatch escalation + Step 5 reconciliation) — both are interactive-only and skipped
30
+ > in autonomous mode (`BALDART_AUTONOMOUS`/`CI`/`GITHUB_ACTIONS`). The zero-ask invariant is
31
+ > about the **workflow during the batch**, which stays untouched. Destructive/outward ops
32
+ > (`reset --hard`, force-push, stash drop) are NEVER auto-run; they degrade to "leave intact
33
+ > + report".
26
34
 
27
35
  ## Project Context
28
36
 
@@ -226,10 +234,49 @@ returns when the batch is done. It returns:
226
234
  per-card **skip-completed** guard makes the resume idempotent — already-committed
227
235
  cards are skipped, only the incomplete/blocked ones run. Repeat until `degraded`
228
236
  is false (or the same cards stall twice → surface to the user).
237
+ 3b. **Escape-hatch escalation for the hard cases (INTERACTIVE mode only — the `new2`
238
+ "relaxation").** `new2` is autonomous *during the batch* — but a genuinely-blocked
239
+ card (the workflow rolled it back / left it `IN_PROGRESS`, DoD not met) is exactly the
240
+ "edge case that wants human intelligence" the deterministic policy cannot supply. The
241
+ sound way to give that intelligence is NOT to re-implement a gate here (that would twin
242
+ `/new`'s Phase 2.5b and bypass review/F-029) — it is to hand the card's **already-tracked
243
+ follow-up** to `/new`, which owns the real per-card pipeline (worktree + review + the
244
+ interactive AC-Closure gate + F-029 + gated merge). Ordering is load-bearing: this runs
245
+ **after** step 1 materialised every follow-up on disk and step 3's resume converged, so
246
+ the offer is purely additive over an already-safe ledger — declining (or a closed
247
+ terminal) never drops a residual.
248
+ - **Skip this step entirely in AUTONOMOUS mode** (env `BALDART_AUTONOMOUS` / `CI` /
249
+ `GITHUB_ACTIONS` set, or no TTY) — leave the cards `IN_PROGRESS` + their follow-ups,
250
+ exactly as before. The escape hatch is interactive-only.
251
+ - **Eligible set** = the follow-ups whose residual `deferralClass` is **code-actionable**:
252
+ `unresolved`, `out-of-ownership`, `scope-expansion`. EXCLUDE `owner-gated` /
253
+ `not-a-code-defect` / `policy-deferred-ac` (external infra steps — `/new` cannot perform
254
+ a DB deploy / secret / DNS action, so escalating them is noise; they stay tracked
255
+ follow-ups). If the eligible set is empty → skip silently.
256
+ - In interactive mode, present **ONE batched `AskUserQuestion`** (never one-per-residual —
257
+ that would re-introduce the ~25-question profile `new2` exists to remove): *"N card sono
258
+ rimaste IN_PROGRESS / con residui code-actionable (DoD non soddisfatta) — i follow-up
259
+ sono già tracciati su disco. Vuoi che lanci `/new` su quei follow-up adesso, per chiuderli
260
+ col gate umano completo?"* Options: **[Sì — lancia `/new` sui follow-up]** / **[No —
261
+ lasciali tracciati]**.
262
+ - **Sì** → invoke `/new <followup-id …>` via the **Skill tool**, passing the materialised
263
+ follow-up card IDs. `/new` runs its full pipeline on the current trunk; do NOT
264
+ re-implement any of it here and do NOT mark anything DONE yourself — `/new` closes each
265
+ follow-up through its own gates. (This is post-batch follow-up work at the skill layer —
266
+ the same class as the Step 3.5 / Step 5 skill interactions; the autonomous workflow has
267
+ already returned, so the zero-ask-**during-batch** invariant is untouched.)
268
+ - **No** → leave as-is (prior behaviour).
269
+ - **Honest limitation (do not over-sell):** this is post-batch — it gives the human the real
270
+ gate on the *follow-up*, but it does NOT salvage a card *before* its merge (the workflow
271
+ already merged the committed cards). Pre-merge salvage would require a mid-batch checkpoint
272
+ (out of scope by design — the workflow is autonomous).
273
+ - Record `escape_hatch: { eligible: N, offered: <bool>, ran_new: <bool>, followups: [...] }`
274
+ in telemetry (step 5 below) so the A/B stays honest about when the hatch was used.
229
275
  4. **Present.** Print `report` verbatim. Surface `residuals` prominently
230
276
  ("questi residui sono tracciati come follow-up: …") — the post-run review that
231
277
  replaced the ~25 mid-run questions. If `degraded`, say so plainly (the run was
232
- incomplete and resumed).
278
+ incomplete and resumed). If the escape hatch ran `/new` (step 3b), fold its outcome
279
+ into the presentation (which follow-ups were closed by `/new`).
233
280
  5. **Record truthful telemetry — reconciled against disk (F-040).** Before appending `telemetry`
234
281
  to `${metricsDir}/skill-runs.jsonl`, fill the fields the workflow could not compute and
235
282
  **reconcile the report against the real disk state** (agent `reason` strings can over-claim — a
@@ -264,6 +311,10 @@ returns when the batch is done. It returns:
264
311
  already satisfied (work the skill used to suppress by hand; a persistently high value signals
265
312
  deferrals resolving too late — order the dependent card earlier), and `owner_gated_deduped` > 0
266
313
  means N defers were collapsed to one external action.
314
+ Also record `escape_hatch: { eligible, offered, ran_new, followups }` (Step 3b) — it keeps the
315
+ A/B honest about when the post-batch human escalation was used and whether the user chose to run
316
+ `/new` on the hard-case follow-ups (vs leaving them tracked). In autonomous mode it is
317
+ `{ eligible:N, offered:false, ran_new:false }`.
267
318
  Do NOT re-summarise the cards — the workflow already did.
268
319
  6. **Process hygiene — reap orphaned Codex MCP servers (NON-BLOCKING).** The batch's per-card Codex
269
320
  finder calls drive `codex app-server`, whose broker spawns the `~/.codex/config.toml` MCP servers
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "baldart",
3
- "version": "4.47.0",
3
+ "version": "4.48.0",
4
4
  "description": "Claude Agent Framework - Reusable framework for coordinating AI agents and humans in software projects",
5
5
  "bin": {
6
6
  "baldart": "./bin/baldart.js"