baldart 4.31.0 → 4.32.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -5,6 +5,35 @@ All notable changes to BALDART will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [4.32.0] - 2026-06-11
9
+
10
+ **`new2`: `deep` is now relevance-gated too — it no longer fires all 5 reviewers on every card regardless of content.** Diagnosing a real batch (the FEAT-0023 supplier-associated-users epic — permissions/RLS/migrations, so 7 of 11 cards are legitimately `review_profile: deep` per `prd-card-writer` Rule C) exposed an **asymmetry**, not a misclassification: the per-card review matrix relevance-gated `balanced` (a specialist runs only if its domain is evidenced by `scopeFiles ∪ MAY-EDIT`) but left `deep` on an **unconditional 5-way fan-out** (`FULL_FANOUT` = code-reviewer + doc-reviewer + qa-sentinel + api-perf-cost-auditor + security-reviewer). So a `deep` card with no doc surface still paid for a doc-reviewer, and one with no API/data surface still paid for api-perf-cost-auditor — every time. This is exactly the "every card gets a full review regardless of what's inside" symptom.
11
+
12
+ The fix unifies the review MODEL: **the profile controls review DEPTH, the surface controls review BREADTH.**
13
+ - `deep` and `balanced` now share the same relevance-gated reviewer SET (extracted into `relevanceGated()`): `touchesCode` → code-reviewer + qa-sentinel; `touchesDocs` → doc-reviewer (the core finder on a doc-only card); `touchesApiData` → api-perf-cost-auditor; `securityRelevant` → security-reviewer.
14
+ - `deep`'s extra DEPTH is unchanged and is carried downstream, NOT by the fan-out set: the full QA suite + Codex-full posture reach each spawned reviewer via `cardBrief`'s "Review profile" line, and the Phase-1 architect audit still keys on `reviewProfile === 'deep'` (`needAudit`). So the reviewers that DO run review at full depth — only the ones whose domain the card never touches are dropped.
15
+ - **Fail-safe preserved**: `noEvidence` (empty `scopeFiles ∪ MAY-EDIT`) still falls back to `FULL_FANOUT` for BOTH balanced and deep — absence of evidence ≠ evidence of absence. `light`/`skip` unchanged. Batch-level coverage is unchanged: the Final Review's batch-wide doc + api/perf passes remain the safety nets, and `securityRelevant` (which a `deep` card almost always trips) keeps security-reviewer on genuinely sensitive cards.
16
+
17
+ Not touched: `prd-card-writer` Rule C (it classifies correctly — `deep` for migration/RLS/permission/schema/HIGH-integration cards, `light` for pure-UI, `skip` for epic/doc — verified against the FEAT-0023 cards) and `/new`'s interactive prose path (which already scales per-card by profile + relevance — api-perf is deferred to the Final Review F.3, qa deferred at balanced, doc deferred at light-no-doc — so the unconditional-deep fan-out was unique to `new2.js`). **MINOR** (review-behavior change on the EXPERIMENTAL `new2` surface only; no config key — the schema-change propagation rule does not apply; no change to `/new`).
18
+
19
+ ### Changed
20
+
21
+ - **`framework/.claude/workflows/new2.js`** — per-card review matrix (B7): the relevance gate now applies to `deep` as well as `balanced` via a shared `relevanceGated()` helper; the `reviewProfile === 'deep'` arm of the `FULL_FANOUT` branch is removed, leaving only `noEvidence` as the conservative full-fan-out fail-safe. The `review-matrix` ledger row's `→conservative-full` annotation now fires for any `noEvidence` profile (was `balanced`-only). Comment block updated to document the depth-vs-breadth model.
22
+
23
+ ## [4.31.1] - 2026-06-11
24
+
25
+ **`new2`: fix v4.31.0's dedup coverage + drop empty residuals + honest A/B cost — verified against the real FEAT-0022 telemetry/report (not the narrative).** Reading the actual run's `skill-runs.jsonl` + workflow report (instead of the prior turn's recollection) exposed three things v4.31.0 missed:
26
+ - **The owner-gated dedup was too narrow** — it filtered by `deferralClass ∈ {owner-gated, not-a-code-defect}`, but the run's FIVE identical `db:push` residuals on card -01 carried kinds `ac-unmet`/`blocker`/`policy-deferred-ac`; the two `policy-deferred-ac` (an F-016 class that is ALWAYS an external/infra action) slipped through. Fix: dedup on `EXTERNAL_DEFERRAL = {owner-gated, not-a-code-defect, policy-deferred-ac}`. A second defect caught by a self-test before publishing: the action key split those 5 across `db-migration-deploy` and `migration:<file>` (the filename branch ran first), collapsing to 2 not 1 — **one `db:push` pushes all pending migrations, so the deploy intent must win over a bare filename**; reordered the key so every db-migration-deploy residual maps to one key. Verified on the real residuals: 7 db:push residuals (5 on -01, 1 on -02, 1 batch-level `F001`) → 2 (one per real card; `F001` dropped), matching the hand reconciliation.
27
+ - **Empty out-of-scope residuals** — the run emitted two `out-of-scope` residuals with no file/line/evidence (`":"`), which would mint contentless follow-up cards. `resolve()` now skips an out-of-scope finding whose composed evidence is empty after stripping `:`/space separators.
28
+ - **A/B cost telemetry was ~8× under** — the workflow reported `total_tokens=476k` (`budget.spent()`, output-only) / `agent_count=30` while the harness saw ~4.15M subagent tokens / 64 agents, because the workflow counters exclude the subagents spawned inside the nested workflows (`new2-resolve`, `new-final-review`). Since `total_tokens` was non-null, the skill's "backfill only if null" rule accepted the wrong figure — defeating new2's whole purpose (A/B on context economy). SKILL.md Step 5.5 now mandates **always** reading the real transcript `usage` via the `-stats` script as the headline cost, labelling the workflow figures as partial.
29
+
30
+ **PATCH** (corrects v4.31.0's dedup coverage + noise filter + telemetry honesty on the EXPERIMENTAL `new2` surface; no new capability, no config key). Meta: same lesson as the feature itself — the v4.31.0 dedup was written from the prior turn's recollection; reading the raw telemetry/report found the gap. Validate fixes against the data, not the memory of the data.
31
+
32
+ ### Changed
33
+
34
+ - **`framework/.claude/workflows/new2.js`** — `dedupOwnerGatedResiduals` now keys on `EXTERNAL_DEFERRAL` (adds `policy-deferred-ac`); `ownerGatedActionKey` reorders the db-migration-deploy branch ahead of the filename branch (one key per deploy action); `resolve()` skips empty `out-of-scope` residuals.
35
+ - **`framework/.claude/skills/new2/SKILL.md`** — Step 5.5 cost recording: always backfill real transcript `usage` (the workflow's `total_tokens`/`agent_count` are partial — output-only + exclude nested workflows); keep them labelled `*_workflow`.
36
+
8
37
  ## [4.31.0] - 2026-06-11
9
38
 
10
39
  **`new2`: the residual ledger self-corrects before returning — no more duplicate-of-done follow-ups, no more N defers for one external action.** A real `new2` run (FEAT-0022 epic, 3 cards) surfaced two over-report classes in the offline-safe residual ledger that the skill was absorbing **by hand** every run: (1) **4 of 8 follow-ups were false-open** — scope-expansion residuals deferred early in the batch but satisfied LATER by another card's commit / a final-review fix, which only `integrateCrossCard()` retracts; a residual closed by any *other* in-batch path stayed falsely-open (and left a best-effort, uncommitted follow-up YAML in the worktree). (2) **3 follow-ups for ONE physical action** — one migration's remote `db:push` re-raised per-card AND batch-wide by the final review → three near-identical owner-gated cards. Both were caught only by the skill's manual per-residual disk grep + consolidation (load-bearing, repeated every run). This release moves that work into the workflow, once, deterministically:
package/VERSION CHANGED
@@ -1 +1 @@
1
- 4.31.0
1
+ 4.32.0
@@ -228,9 +228,16 @@ returns when the batch is done. It returns:
228
228
  (`git -C $MAIN log --oneline ${trunk} | grep <card>`); annotate any divergence and never present
229
229
  progress the disk does not show. Then fill `wall_clock_s` (now − kickoff `ts`) and
230
230
  `followups_on_disk` (count the actual follow-up files on disk in the main repo, NOT
231
- `residualFollowups.length` — which double-counts). `total_tokens`/`agent_count` come from the
232
- workflow; if `total_tokens` is null, run the `/new` Phase-8 `-stats` script to backfill real
233
- `usage`. Keep `degraded`/`degradation_reasons` + `cards_deferred_done_pending` in the record so
231
+ `residualFollowups.length` — which double-counts). **Cost (the A/B's whole point) — do NOT trust
232
+ the workflow's `total_tokens`/`agent_count` as the headline figure.** They are **partial**:
233
+ `total_tokens` is `budget.spent()` output-only AND the agent counters do not include the subagents
234
+ spawned inside the nested workflows (`new2-resolve`, `new-final-review`) — on the real FEAT-0022 run
235
+ the workflow reported `total_tokens=476k` / `agent_count=30` while the harness saw ~4.15M subagent
236
+ tokens / 64 agents (≈8× under). So **always** run the `/new` Phase-8 `-stats` script to read the
237
+ real transcript `usage` (not only when `total_tokens` is null), record it as the headline cost, and
238
+ keep the workflow's figures clearly labelled as `total_tokens_workflow`/`agent_count_workflow`
239
+ (partial, output-only, excludes nested workflows). Keep `degraded`/`degradation_reasons` +
240
+ `cards_deferred_done_pending` in the record so
234
241
  the A/B comparison stays honest. Also record `migration_gate: <migration.status>`
235
242
  (`none`|`applied`|`skipped`|`degraded`) — the Step-3.5 gate is a pre-launch interaction, NOT a
236
243
  mid-batch question, so it does not break the zero-ask-during-batch invariant; logging it keeps the
@@ -389,9 +389,14 @@ async function resolve(kind, card, evidence, extra) {
389
389
  // domain and decide whether an out-of-ownership remedy lands inside the batch union.
390
390
  residuals.push({ card, kind, evidence, materialized: !!fc, deferralClass, domain: dom, remedyFiles: (res && res.remedyFiles) || [] })
391
391
  }
392
- // F-022 — route out-of-scope findings the resolve surfaced.
392
+ // F-022 — route out-of-scope findings the resolve surfaced. v4.31.1 — a finding with no file,
393
+ // no line AND no evidence is noise (the resolve emitted an empty placeholder); pushing it would
394
+ // mint a contentless follow-up card. Skip when the composed evidence is empty after stripping the
395
+ // `:`/space separators — a real finding always carries at least a file, a line, or a description.
393
396
  for (const osf of (res && res.outOfScopeFindings) || []) {
394
- residuals.push({ card, kind: 'out-of-scope', evidence: `${osf.file || ''}:${osf.line || ''} ${osf.evidence || ''}`, materialized: false })
397
+ const ev = `${osf.file || ''}:${osf.line || ''} ${osf.evidence || ''}`
398
+ if (!ev.replace(/[:\s]/g, '')) continue
399
+ residuals.push({ card, kind: 'out-of-scope', evidence: ev.trim(), materialized: false })
395
400
  }
396
401
  ledger(card, 'resolve:' + kind, status, (res && (res.followupCard || res.reason)) || '')
397
402
  return { status, deferralClass }
@@ -573,15 +578,22 @@ async function runCard(cardId, cardPath) {
573
578
  // block routes through resolve(), whose mandatory adversarial judge (new2-resolve F-015, code domain
574
579
  // → code-reviewer) cross-checks the Codex finding before a fix/followup.
575
580
  const codexAvail = !!sharedCtx.codexResolved && !!sharedCtx.codexScriptPath
576
- // B6 (v4.25.0) — deterministic per-card review matrix. `balanced` no longer means "every
577
- // specialist on every card": each one runs IFF its domain is evidenced by the card's actual
578
- // surface (scopeFiles ∪ MAY-EDIT), computed deterministically HERE in JS and audited via a
579
- // `review-matrix` ledger row. `deep` keeps the unconditional full fan-out (Rule C assigns it
580
- // to high-risk cards respect the escalation). Coverage holds at batch level: the final
581
- // review's doc pass stays the missing-doc-update safety net (its singleCard/slim skip is now
582
- // gated on doc-reviewer having ACTUALLY run per-card see Phase Final), and its api-perf
583
- // pass keys on hasApiDataFiles with a regex this matrix supersets so gating OFF a per-card
584
- // specialist never leaves its domain unreviewed.
581
+ // B6 (v4.25.0) / B7 (v4.32.0) — deterministic per-card review matrix. A specialist runs IFF its
582
+ // domain is evidenced by the card's actual surface (scopeFiles ∪ MAY-EDIT), computed
583
+ // deterministically HERE in JS and audited via a `review-matrix` ledger row. Since v4.32.0 this
584
+ // relevance gate applies to `deep` TOO `deep` no longer means an unconditional 5-way fan-out.
585
+ // The asymmetry it removed: `balanced` was surface-gated while `deep` paid for doc-reviewer on a
586
+ // card with no doc surface and api-perf-cost-auditor on a card with no API/data surface, every
587
+ // time. The REVIEW MODEL is now: the **profile controls review DEPTH** (deep full QA suite +
588
+ // Codex full + the Phase-1 audit gate all keyed on `reviewProfile` downstream and carried to
589
+ // each spawned reviewer via `cardBrief`'s "Review profile" line, so the depth of the reviewers
590
+ // that DO run is unchanged), and the **surface controls review BREADTH**. `noEvidence` (empty
591
+ // surface) stays the conservative full-fan-out fail-safe for BOTH balanced and deep — absence of
592
+ // evidence ≠ evidence of absence. Coverage holds at batch level: the final review's doc pass
593
+ // stays the missing-doc-update safety net (its singleCard/slim skip is gated on doc-reviewer
594
+ // having ACTUALLY run per-card — see Phase Final), and its api-perf pass keys on hasApiDataFiles
595
+ // with a regex this matrix supersets — so gating OFF a per-card specialist never leaves its
596
+ // domain unreviewed.
585
597
  const surface = dedupe((scopeFiles || []).concat(mayEdit || []))
586
598
  const docDirs = [paths.docs_dir, paths.references_dir, paths.wiki_dir, paths.prd_dir, paths.design_system].filter(Boolean)
587
599
  const isDocFile = (f) => /\.(md|mdx)$/i.test(String(f)) || docDirs.some((d) => String(f).includes(String(d)))
@@ -592,19 +604,23 @@ async function runCard(cardId, cardPath) {
592
604
  const touchesApiData = surface.some((f) => /api\/|data-model|\.sql$|migrations?\/|server|route|edge|middleware|cron|queue|worker|prisma|drizzle|supabase|schema/i.test(String(f)))
593
605
  const noEvidence = surface.length === 0 // fail-safe: absence of evidence ≠ evidence of absence
594
606
  const FULL_FANOUT = ['code-reviewer', 'doc-reviewer', 'qa-sentinel', 'api-perf-cost-auditor'].concat(securityRelevant ? ['security-reviewer'] : [])
607
+ // Relevance-gated BREADTH — shared by `balanced` AND `deep` (profile-gated DEPTH is carried
608
+ // downstream via cardBrief + needAudit, not by which specialists spawn).
609
+ const relevanceGated = () => {
610
+ const rs = []
611
+ if (touchesCode) rs.push('code-reviewer')
612
+ if (touchesDocs) rs.push('doc-reviewer') // doc-ONLY card: doc-reviewer IS the core finder
613
+ if (touchesCode) rs.push('qa-sentinel') // skip for doc-only cards — no behavior to QA
614
+ if (touchesApiData) rs.push('api-perf-cost-auditor')
615
+ if (securityRelevant) rs.push('security-reviewer')
616
+ return rs
617
+ }
595
618
  let reviewers
596
619
  if (reviewProfile === 'skip') reviewers = []
597
620
  else if (reviewProfile === 'light') reviewers = codexAvail ? ['codex'] : ['code-reviewer']
598
- else if (reviewProfile === 'deep' || noEvidence) reviewers = FULL_FANOUT
599
- else { // balanced — relevance-gated
600
- reviewers = []
601
- if (touchesCode) reviewers.push('code-reviewer')
602
- if (touchesDocs) reviewers.push('doc-reviewer') // doc-ONLY card: doc-reviewer IS the core finder
603
- if (touchesCode) reviewers.push('qa-sentinel') // skip for doc-only cards — no behavior to QA
604
- if (touchesApiData) reviewers.push('api-perf-cost-auditor')
605
- if (securityRelevant) reviewers.push('security-reviewer')
606
- }
607
- if (reviewProfile !== 'skip') g('review-matrix', 'PLANNED', `[${reviewProfile}${noEvidence && reviewProfile === 'balanced' ? '→conservative-full (no surface evidence)' : ''}] ${reviewers.join('+') || '(none)'} · docs:${touchesDocs} code:${touchesCode} api/data:${touchesApiData} sec:${securityRelevant}`)
621
+ else if (noEvidence) reviewers = FULL_FANOUT // balanced/deep, no surface evidence conservative full
622
+ else reviewers = relevanceGated() // balanced AND deep surface-gated breadth
623
+ if (reviewProfile !== 'skip') g('review-matrix', 'PLANNED', `[${reviewProfile}${noEvidence ? '→conservative-full (no surface evidence)' : ''}] ${reviewers.join('+') || '(none)'} · docs:${touchesDocs} code:${touchesCode} api/data:${touchesApiData} sec:${securityRelevant}`)
608
624
  const reviewSchema = { type: 'object', required: ['blocks', 'scopeExpansion'], additionalProperties: true,
609
625
  properties: { blocks: { type: 'array', items: { type: 'object', additionalProperties: true } }, scopeExpansion: { type: 'array', items: { type: 'object', additionalProperties: true } }, note: { type: 'string' } } }
610
626
  let reviewResults = []
@@ -1128,19 +1144,31 @@ async function reconcileLedgerAgainstHead() {
1128
1144
  // ───────────────────────────────────────────────────────────────────────────
1129
1145
  function ownerGatedActionKey(r) {
1130
1146
  const hay = `${r.evidence || ''} ${(r.remedyFiles || []).join(' ')}`
1147
+ // ORDER MATTERS — the migration-deploy intent wins over a bare filename match. One `db:push`
1148
+ // pushes ALL pending migrations in a single action, so every "migration not deployed" / db:check-
1149
+ // -sync / db:push residual must collapse onto ONE key regardless of whether it also names the .sql
1150
+ // file (the real FEAT-0022 run split 5 identical db:push residuals across `db-migration-deploy`
1151
+ // and `migration:<file>` because the filename branch ran first → only the deploy intent is canon).
1152
+ if (/\bdb:push\b|\bdb:check-sync\b|remote db push|migration[^.]*(deploy|remote|push|not[_ ]?deployed)/i.test(hay)) return 'db-migration-deploy'
1131
1153
  const mig = hay.match(/(\d{14}_[a-z0-9_]+\.sql)/i)
1132
1154
  if (mig) return 'migration:' + mig[1].toLowerCase()
1133
- if (/\bdb:push\b|\bdb:check-sync\b|remote db push|migration.*(deploy|remote|push)/i.test(hay)) return 'db-migration-deploy'
1134
1155
  if (/\bdeploy(ment)?\b/i.test(hay)) return 'deploy'
1135
1156
  if (/\bsecret\b/i.test(hay)) return 'secret'
1136
1157
  if (/\bDNS\b|\bdomain\b/i.test(hay)) return 'dns'
1137
1158
  return null // unknown action → never dedup (avoid collapsing two genuinely-distinct externals)
1138
1159
  }
1160
+ // classes that are ALWAYS an external/infra action (never code a commit can close) — the only ones
1161
+ // safe to collapse by shared action key. `policy-deferred-ac` belongs here (F-016: an AC whose
1162
+ // remedy is out-of-ownership or an owner-gated infra step) — the real FEAT-0022 run proved a single
1163
+ // `db:push` surfaced as owner-gated AND policy-deferred-ac on the same card, so excluding the latter
1164
+ // left duplicates uncollapsed. ac-unmet/blocker/merge-blocker are NOT here: they may be code defects,
1165
+ // and their EXTERNAL instances are already reclassified to `owner-gated` by the F-040 classifier.
1166
+ const EXTERNAL_DEFERRAL = new Set(['owner-gated', 'not-a-code-defect', 'policy-deferred-ac'])
1139
1167
  function dedupOwnerGatedResiduals() {
1140
1168
  const realCard = new Set(cardIds)
1141
1169
  const groups = {}
1142
1170
  for (const r of residuals) {
1143
- if (r.deferralClass !== 'owner-gated' && r.deferralClass !== 'not-a-code-defect') continue
1171
+ if (!EXTERNAL_DEFERRAL.has(r.deferralClass)) continue
1144
1172
  const k = ownerGatedActionKey(r)
1145
1173
  if (!k) continue
1146
1174
  ;(groups[k] = groups[k] || []).push(r)
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "baldart",
3
- "version": "4.31.0",
3
+ "version": "4.32.0",
4
4
  "description": "Claude Agent Framework - Reusable framework for coordinating AI agents and humans in software projects",
5
5
  "bin": {
6
6
  "baldart": "./bin/baldart.js"