npm - baldart - Versions diffs - 4.47.0 → 4.48.0 - Mend

baldart 4.47.0 → 4.48.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/CHANGELOG.md +8 -0
package/VERSION +1 -1
package/framework/.claude/skills/new2/SKILL.md +61 -10
package/package.json +1 -1

package/CHANGELOG.md CHANGED Viewed

@@ -5,6 +5,14 @@ All notable changes to BALDART will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [4.48.0] - 2026-06-16
+**`new2` is relaxed: a post-batch, interactive-only escape hatch hands genuinely-blocked hard cases to `/new`'s real human gate — without re-introducing rigidity, a twin, or breaking the A/B.** `new2` runs the batch autonomously, so a card the deterministic policy can't salvage becomes a tracked follow-up + is left `IN_PROGRESS` — the "edge case that wants human intelligence" the user flagged. A 3-lens **adversarial review before implementation** killed the obvious fix (a Step-5 `AskUserQuestion` that lets the skill implement/resolve the residual): (1) **correctness** — the workflow auto-merges and removes the worktree *before* Step 5, so a skill-side fix has no worktree, lands unreviewed code on trunk, and bypasses the F-029 DONE gate; (2) **value** — on the only run with a full `deferral_breakdown` (14 residuals) a post-batch fix changes the outcome of *zero* of them (the batch already merged); only a mid-batch checkpoint could salvage-before-merge, and the data (n=1) doesn't justify breaking `new2`'s background/no-poll contract; (3) **prior-art** — it would twin `/new`'s Phase 2.5b AC-Closure gate and muddy the autonomous-vs-`/new` A/B. What survived all three: the sound escalation is **not to re-implement a gate but to invoke `/new` on the already-materialised follow-up** — `/new` owns the real worktree+review+AC-Closure+F-029+merge pipeline. **MINOR** (additive skill behaviour, interactive-only, autonomous-mode-safe; **no new `baldart.config.yml` key** ⇒ schema-change propagation rule N/A; no removed surface).
+### Changed
+- **`framework/.claude/skills/new2/SKILL.md`** — new **Step 3b "Escape-hatch escalation"** in the skill's post-batch reconciliation: in INTERACTIVE mode only (skipped when `BALDART_AUTONOMOUS`/`CI`/`GITHUB_ACTIONS` is set), after follow-ups are materialised on disk (offline-safe ordering preserved — the offer is additive over an already-safe ledger) and any `degraded` resume has converged, it presents **one batched `AskUserQuestion`** offering to run `/new` on the **code-actionable** hard-case follow-ups (`deferralClass ∈ {unresolved, out-of-ownership, scope-expansion}`; `owner-gated`/`not-a-code-defect`/`policy-deferred-ac` excluded — `/new` can't perform infra steps). "Sì" invokes `/new <followup-id …>` via the Skill tool (which closes them through its own gates — the skill never marks DONE itself, never re-implements the gate). The **ZERO-ASK CONTRACT** banner is rewritten to scope it precisely to *the workflow during the batch* (the skill may interact pre-launch AND post-batch, interactive-only). New `escape_hatch` telemetry field. Honest limitation documented: post-batch, so it gives the human gate on the follow-up but does NOT salvage a card before its merge (that would need a mid-batch checkpoint — out of scope by design).
 ## [4.47.0] - 2026-06-16
 **`/new`'s orchestrator context economy is re-aimed at its real driver — turn count — and the user-visible Progress Bar + native Task spine are removed.** Telemetry of two real 8-card batches (FEAT-0028/0029 on a consumer) showed the orchestrator paying ~285M `cache_read`: 613 turns each replaying a ~490k-token accumulated context (growing toward ~800k), so total cost ≈ **turn count × accumulated context**. A 3-lens **adversarial review before implementation** refuted the obvious diagnoses: the static prefix is only ~77k (not the ~225k first assumed — context is ~86% *accumulated*, not static); narration prose is only ~7% of the fuel; and the existing § "Context economy" (bulk-content-inline) rule targets a channel that totals only ~119k cumulatively. The measurement reviewer surfaced the actual missed lever — **0 of 274 tool turns batched any calls, and ~55% of turns carried no tool call at all** — and the correctness + prior-art reviewers established that delegating bookkeeping out of the orchestrator is a previously-trodden trap (v4.15.0 reverted a Write-from-memory tracker flush; the tracker is the recovery SSOT; `card_status: DONE` needs orchestrator-side disk re-read; the weak-subagent fabrication precedent applies). What survived: (1) a new turn-economy HARD RULE (batch independent tool calls; no narration-only turns; never poll/wait), and (2) since the Progress Bar + Task spine are pure *mirrors* of the internal tracker (recovery reads the tracker, never the spine), removing them is correctness-safe and eliminates ~45 dedicated visibility turns (~8% of a batch's `cache_read`, guaranteed, not batching-dependent). **MINOR** (skill behaviour change; **no new `baldart.config.yml` key** ⇒ schema-change propagation rule N/A; no removed agent/command/skill/routine).

package/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 4.47.0
1	+ 4.48.0

package/framework/.claude/skills/new2/SKILL.md CHANGED Viewed

@@ -5,9 +5,11 @@ description: >
   EXPERIMENTAL workflow-hosted variant of /new (A/B testing). Implements one or
   more backlog cards end-to-end by delegating the WHOLE batch to a background
   dynamic workflow — so subagent output never enters the main orchestrator
-  context. Fully autonomous (zero AskUserQuestion): every /new gate is replaced by
-  a deterministic policy + a self-healing resolution pass. Claude-only (needs the
-  Workflow tool). Usage: /new2 CARD-IDS (same arg grammar as /new). Triggers on:
+  context. The batch runs autonomously (zero AskUserQuestion during the run): every
+  /new gate is replaced by a deterministic policy + a self-healing resolution pass;
+  in interactive mode an optional post-batch escape hatch can hand the hard-case
+  follow-ups to /new for the real human gate. Claude-only (needs the Workflow tool).
+  Usage: /new2 CARD-IDS (same arg grammar as /new). Triggers on:
   /new2, "implementa le card con workflow", "new2".
 ---
@@ -17,12 +19,18 @@ description: >
 > default, and the recovery-safe path. Do NOT route to `new2` unless the user
 > explicitly asks for it.
-> **ZERO-ASK CONTRACT.** A dynamic workflow cannot prompt the user mid-run. `new2`
-> therefore runs the entire batch autonomously: every `/new` `AskUserQuestion`
-> gate is replaced by a deterministic policy (auto-resolve seamless defaults, or
-> fail → self-healing `new2-resolve`, or — last resort — auto-materialise a
-> tracked follow-up card). Destructive/outward ops (`reset --hard`, force-push,
-> stash drop) are NEVER auto-run; they degrade to "leave intact + report".
+> **ZERO-ASK CONTRACT — scoped to the *batch*, not the skill.** A dynamic workflow
+> cannot prompt the user mid-run, so the **workflow runs the entire batch autonomously**:
+> every `/new` `AskUserQuestion` gate *inside the batch* is replaced by a deterministic
+> policy (auto-resolve seamless defaults, or fail → self-healing `new2-resolve`, or — last
+> resort — auto-materialise a tracked follow-up card). The **skill main loop** (which CAN
+> prompt) may interact at exactly two boundaries that are NOT mid-batch: **pre-launch**
+> (Step 2 card-ID question, Step 3.5 migration gate) and **post-batch** (Step 3b
+> escape-hatch escalation + Step 5 reconciliation) — both are interactive-only and skipped
+> in autonomous mode (`BALDART_AUTONOMOUS`/`CI`/`GITHUB_ACTIONS`). The zero-ask invariant is
+> about the **workflow during the batch**, which stays untouched. Destructive/outward ops
+> (`reset --hard`, force-push, stash drop) are NEVER auto-run; they degrade to "leave intact
+> + report".
 ## Project Context
@@ -226,10 +234,49 @@ returns when the batch is done. It returns:
    per-card **skip-completed** guard makes the resume idempotent — already-committed
    cards are skipped, only the incomplete/blocked ones run. Repeat until `degraded`
    is false (or the same cards stall twice → surface to the user).
+3b. **Escape-hatch escalation for the hard cases (INTERACTIVE mode only — the `new2`
+   "relaxation").** `new2` is autonomous *during the batch* — but a genuinely-blocked
+   card (the workflow rolled it back / left it `IN_PROGRESS`, DoD not met) is exactly the
+   "edge case that wants human intelligence" the deterministic policy cannot supply. The
+   sound way to give that intelligence is NOT to re-implement a gate here (that would twin
+   `/new`'s Phase 2.5b and bypass review/F-029) — it is to hand the card's **already-tracked
+   follow-up** to `/new`, which owns the real per-card pipeline (worktree + review + the
+   interactive AC-Closure gate + F-029 + gated merge). Ordering is load-bearing: this runs
+   **after** step 1 materialised every follow-up on disk and step 3's resume converged, so
+   the offer is purely additive over an already-safe ledger — declining (or a closed
+   terminal) never drops a residual.
+   - **Skip this step entirely in AUTONOMOUS mode** (env `BALDART_AUTONOMOUS` / `CI` /
+     `GITHUB_ACTIONS` set, or no TTY) — leave the cards `IN_PROGRESS` + their follow-ups,
+     exactly as before. The escape hatch is interactive-only.
+   - **Eligible set** = the follow-ups whose residual `deferralClass` is **code-actionable**:
+     `unresolved`, `out-of-ownership`, `scope-expansion`. EXCLUDE `owner-gated` /
+     `not-a-code-defect` / `policy-deferred-ac` (external infra steps — `/new` cannot perform
+     a DB deploy / secret / DNS action, so escalating them is noise; they stay tracked
+     follow-ups). If the eligible set is empty → skip silently.
+   - In interactive mode, present **ONE batched `AskUserQuestion`** (never one-per-residual —
+     that would re-introduce the ~25-question profile `new2` exists to remove): *"N card sono
+     rimaste IN_PROGRESS / con residui code-actionable (DoD non soddisfatta) — i follow-up
+     sono già tracciati su disco. Vuoi che lanci `/new` su quei follow-up adesso, per chiuderli
+     col gate umano completo?"* Options: **[Sì — lancia `/new` sui follow-up]** / **[No —
+     lasciali tracciati]**.
+     - **Sì** → invoke `/new <followup-id …>` via the **Skill tool**, passing the materialised
+       follow-up card IDs. `/new` runs its full pipeline on the current trunk; do NOT
+       re-implement any of it here and do NOT mark anything DONE yourself — `/new` closes each
+       follow-up through its own gates. (This is post-batch follow-up work at the skill layer —
+       the same class as the Step 3.5 / Step 5 skill interactions; the autonomous workflow has
+       already returned, so the zero-ask-**during-batch** invariant is untouched.)
+     - **No** → leave as-is (prior behaviour).
+   - **Honest limitation (do not over-sell):** this is post-batch — it gives the human the real
+     gate on the *follow-up*, but it does NOT salvage a card *before* its merge (the workflow
+     already merged the committed cards). Pre-merge salvage would require a mid-batch checkpoint
+     (out of scope by design — the workflow is autonomous).
+   - Record `escape_hatch: { eligible: N, offered: <bool>, ran_new: <bool>, followups: [...] }`
+     in telemetry (step 5 below) so the A/B stays honest about when the hatch was used.
 4. **Present.** Print `report` verbatim. Surface `residuals` prominently
    ("questi residui sono tracciati come follow-up: …") — the post-run review that
    replaced the ~25 mid-run questions. If `degraded`, say so plainly (the run was
-   incomplete and resumed).
+   incomplete and resumed). If the escape hatch ran `/new` (step 3b), fold its outcome
+   into the presentation (which follow-ups were closed by `/new`).
 5. **Record truthful telemetry — reconciled against disk (F-040).** Before appending `telemetry`
    to `${metricsDir}/skill-runs.jsonl`, fill the fields the workflow could not compute and
    **reconcile the report against the real disk state** (agent `reason` strings can over-claim — a
@@ -264,6 +311,10 @@ returns when the batch is done. It returns:
    already satisfied (work the skill used to suppress by hand; a persistently high value signals
    deferrals resolving too late — order the dependent card earlier), and `owner_gated_deduped` > 0
    means N defers were collapsed to one external action.
+   Also record `escape_hatch: { eligible, offered, ran_new, followups }` (Step 3b) — it keeps the
+   A/B honest about when the post-batch human escalation was used and whether the user chose to run
+   `/new` on the hard-case follow-ups (vs leaving them tracked). In autonomous mode it is
+   `{ eligible:N, offered:false, ran_new:false }`.
    Do NOT re-summarise the cards — the workflow already did.
 6. **Process hygiene — reap orphaned Codex MCP servers (NON-BLOCKING).** The batch's per-card Codex
    finder calls drive `codex app-server`, whose broker spawns the `~/.codex/config.toml` MCP servers

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "baldart",
-  "version": "4.47.0",
+  "version": "4.48.0",
   "description": "Claude Agent Framework - Reusable framework for coordinating AI agents and humans in software projects",
   "bin": {
     "baldart": "./bin/baldart.js"