baldart 4.28.1 → 4.29.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +18 -0
- package/VERSION +1 -1
- package/framework/.claude/skills/new/references/final-review.md +2 -2
- package/framework/.claude/skills/new2/SKILL.md +6 -2
- package/framework/.claude/workflows/new-final-review.js +45 -13
- package/framework/.claude/workflows/new2.js +9 -0
- package/package.json +1 -1
package/CHANGELOG.md
CHANGED
|
@@ -5,6 +5,24 @@ All notable changes to BALDART will be documented in this file.
|
|
|
5
5
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
6
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
7
|
|
|
8
|
+
## [4.29.1] - 2026-06-11
|
|
9
|
+
|
|
10
|
+
**`new2`: `deferral_breakdown` telemetry — see WHY residuals became follow-ups, per class.** Follow-up cards in `new2` are not "useless deferred work": an in-scope fixable anomaly is fixed directly by `resolve()`'s domain fixer in-batch; a follow-up is created ONLY for a residual that is structurally undeferrable (`out-of-ownership` would edit another card's files · `owner-gated`/`not-a-code-defect`/`baseline-not-reached` aren't code a coder can apply · `unresolved` already failed fixer+judge+tier-2 · `scope-expansion` with a new AC needs a PRD decision · `outage`). To diagnose a run with *many* follow-ups, the telemetry now counts residuals per class so a skewed breakdown points at the real root cause (many `out-of-ownership` → PRD MAY-EDIT too narrow · many `unresolved` → fixes genuinely hard · many `scope-expansion` → cards under-specified upstream) — data first, before any change to the deferral logic. **PATCH** (observability only on the EXPERIMENTAL `new2` surface; diagnosis-only — nothing is auto-implemented from a residual; no behavior change, no config key).
|
|
11
|
+
|
|
12
|
+
### Added
|
|
13
|
+
|
|
14
|
+
- **`framework/.claude/workflows/new2.js`** — `telemetry.deferral_breakdown` (per-class residual counts, derived from the `deferralClass`/`kind` already carried on each residual) + a "Ripartizione per classe" line in the human report's Residui section.
|
|
15
|
+
- **`framework/.claude/skills/new2/SKILL.md`** — the A/B record step now keeps `deferral_breakdown` and names it the data to consult BEFORE proposing any deferral-logic change.
|
|
16
|
+
|
|
17
|
+
## [4.29.0] - 2026-06-11
|
|
18
|
+
|
|
19
|
+
**Final review (F.4): each domain specialist owns its lane — no cross-domain `code-reviewer` re-judge, no self-judge.** The verification step classified findings by spawning `code-reviewer` over EVERY low-confidence finding. That was wrong on two counts: (a) a `doc` finding from `doc-reviewer` (or an `api`/`perf` finding from `api-perf-cost-auditor`) was re-validated by `code-reviewer` — the WRONG specialist judging another domain (a `code-reviewer` judging prose); (b) when Codex is unavailable the `code-reviewer` *fallback* produces the findings and `code-reviewer` then re-judged its OWN findings — self-judging with no model diversity (the same waste removed from the resolve pass in v4.27.1–2). Now every domain specialist FP-checks its OWN findings in the finding pass (doc-reviewer, api-perf-cost-auditor, and the Codex/`code-reviewer`-fallback code engine), so surviving findings arrive already validated; the residual `confidence < 80` path is routed to the finding's DOMAIN specialist (doc→doc-reviewer, api/perf→api-perf-cost-auditor, security/migration→security-reviewer, test→qa-sentinel, else code-reviewer), and when that specialist is the originating finder the finding is surfaced as `NEEDS_MANUAL_CONFIRMATION` rather than re-judged. Applies to BOTH `/new` (inline F.4 prose, the SSOT) and `new2` (the `new-final-review` workflow). **MINOR** (behavioral refinement of the final-review classification rule; the returned `{findings, classification, summary}` contract is unchanged; no `baldart.config.yml` key, so the schema-change propagation rule does not apply).
|
|
20
|
+
|
|
21
|
+
### Changed
|
|
22
|
+
|
|
23
|
+
- **`framework/.claude/skills/new/references/final-review.md` Step F.4 step 9** (SSOT) — replaced "Claude agent findings with confidence < 80 → cross-validate by spawning code-reviewer" with the specialist-ownership rule: specialists self-FP-check their own domain, residual unresolved findings route to the domain specialist, finder===verifier ⇒ `NEEDS_MANUAL_CONFIRMATION`.
|
|
24
|
+
- **`framework/.claude/workflows/new-final-review.js`** — `doc-reviewer`/`api-perf-cost-auditor` prompts now mandate an in-domain false-positive check; their findings (and the `code-reviewer` fallback's) are marked `preValidated` so they skip the re-judge; `verifyFinding()` uses a new `domainVerifier()` router (never a hardcoded `code-reviewer`) and short-circuits a finder===verifier case to `NEEDS_MANUAL_CONFIRMATION`. `meta.description` + Verify phase detail updated.
|
|
25
|
+
|
|
8
26
|
## [4.28.1] - 2026-06-11
|
|
9
27
|
|
|
10
28
|
**Doc: list the `new.migration-example.md` overlay example in the overlays README.** Completes the v4.28.0 Migration Gate docs — the example overlay file shipped in v4.28.0 was not yet listed in the `framework/templates/overlays/README.md` example table. **PATCH** (doc-only).
|
package/VERSION
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
4.
|
|
1
|
+
4.29.1
|
|
@@ -199,8 +199,8 @@ deterministic script outside this orchestrator's context window.
|
|
|
199
199
|
the unavailable case, so the final merge gate still gets a full code review.
|
|
200
200
|
|
|
201
201
|
9. **Merge all findings** (Codex + Claude agents) into a consolidated list.
|
|
202
|
-
- Codex findings are already FP-validated (Step F.3 protocol includes it).
|
|
203
|
-
-
|
|
202
|
+
- **Each domain specialist OWNS its lane end-to-end** — it runs its OWN false-positive check in the finding pass (doc-reviewer for doc, api-perf-cost-auditor for api/perf/data, and the Codex/`code-reviewer`-fallback engine for code), so its surviving findings are **already validated**. Do **NOT** cross-validate a doc or api finding by spawning `code-reviewer` (that is the WRONG specialist judging another domain), and do **NOT** re-run the same specialist over its own findings (self-judge — no model diversity). Codex findings are likewise already FP-validated (Step F.3 protocol includes it).
|
|
203
|
+
- **Residual only:** a finding the originating specialist explicitly leaves UNRESOLVED (`confidence < 80`) is routed to the finding's **domain specialist** (by `domain`: doc→doc-reviewer, api/perf→api-perf-cost-auditor, security/migration→security-reviewer, test→qa-sentinel, else code-reviewer) over the cited file:line. If that domain specialist **is** the originating finder (it already had its pass), classify `NEEDS_MANUAL_CONFIRMATION` instead of re-spawning it — never self-judge, never silently drop.
|
|
204
204
|
- Classify: `VERIFIED` | `FALSE_POSITIVE` | `NEEDS_MANUAL_CONFIRMATION`.
|
|
205
205
|
- `VERIFIED` findings proceed to fixes. **`NEEDS_MANUAL_CONFIRMATION` findings are NOT discarded** — list them in `## Issues & Flags` and surface them to the user via `AskUserQuestion` (treat as VERIFIED, treat as FALSE_POSITIVE, or hand off) before merge. Only `FALSE_POSITIVE` are dropped.
|
|
206
206
|
|
|
@@ -226,5 +226,9 @@ returns when the batch is done. It returns:
|
|
|
226
226
|
the A/B comparison stays honest. Also record `migration_gate: <migration.status>`
|
|
227
227
|
(`none`|`applied`|`skipped`|`degraded`) — the Step-3.5 gate is a pre-launch interaction, NOT a
|
|
228
228
|
mid-batch question, so it does not break the zero-ask-during-batch invariant; logging it keeps the
|
|
229
|
-
A/B honest about when a migration was front-loaded.
|
|
230
|
-
|
|
229
|
+
A/B honest about when a migration was front-loaded. Keep `deferral_breakdown` (per-class counts
|
|
230
|
+
of WHY residuals became follow-ups instead of in-batch fixes) in the record — a class dominating
|
|
231
|
+
it is a root-cause signal (many `out-of-ownership` → PRD MAY-EDIT too narrow · many `unresolved` →
|
|
232
|
+
fixes genuinely hard · many `scope-expansion` → cards under-specified upstream), and it is the
|
|
233
|
+
data to consult BEFORE proposing any change to the deferral logic. Do NOT re-summarise the cards —
|
|
234
|
+
the workflow already did.
|
|
@@ -1,11 +1,11 @@
|
|
|
1
1
|
export const meta = {
|
|
2
2
|
name: 'new-final-review',
|
|
3
3
|
description:
|
|
4
|
-
"Cross-batch final code review for /new. Fans out a multi-agent review (Codex primary + doc-reviewer + api-perf-cost-auditor + qa-sentinel gates) over the WHOLE batch diff,
|
|
4
|
+
"Cross-batch final code review for /new. Fans out a multi-agent review (Codex primary + doc-reviewer + api-perf-cost-auditor + qa-sentinel gates) over the WHOLE batch diff. Each domain specialist OWNS its lane end-to-end — it FP-checks its own findings in the finding pass, so there is no generic code-reviewer re-judge over another specialist's domain (cross-domain) nor over its own findings (self-judge); any residual unresolved finding is routed to its domain specialist. Read-only: returns classified findings + a gate table, applies NO fixes (the calling skill owns fix application + user gates). Maps to references/final-review.md steps F.2–F.4.",
|
|
5
5
|
phases: [
|
|
6
6
|
{ title: 'Baseline', detail: 'architecture grounding for the batch scope (F.2)' },
|
|
7
7
|
{ title: 'Review', detail: 'parallel multi-agent review of the batch diff (F.3)' },
|
|
8
|
-
{ title: 'Verify', detail: '
|
|
8
|
+
{ title: 'Verify', detail: 'specialist-owned validation; residual routed to domain specialist (F.4)' },
|
|
9
9
|
],
|
|
10
10
|
}
|
|
11
11
|
|
|
@@ -155,10 +155,14 @@ const codexPrompt =
|
|
|
155
155
|
`Run the mandatory false-positive check on every finding and suppress the unconvincing ones (your findings are treated as already FP-validated). Set codexAvailable:true when the review ran.`
|
|
156
156
|
|
|
157
157
|
const docPrompt =
|
|
158
|
-
`Cross-card documentation + SSOT-registry review over the batch diff, per ${protocolRef} Step F.3 (doc-reviewer row). Check doc consistency, ssot-registry completeness, and invariants across the changed files.\n\n${scopeBrief}\n\n${baselineBrief}\n\
|
|
158
|
+
`Cross-card documentation + SSOT-registry review over the batch diff, per ${protocolRef} Step F.3 (doc-reviewer row). Check doc consistency, ssot-registry completeness, and invariants across the changed files.\n\n${scopeBrief}\n\n${baselineBrief}\n\n` +
|
|
159
|
+
`You OWN the doc domain end-to-end: run the mandatory false-positive check on every finding yourself and SUPPRESS the unconvincing ones — your surviving findings are treated as already validated and are NOT re-judged by another agent (a generic code-reviewer judging prose would be cross-domain). Flag only a finding you genuinely cannot resolve as confidence < 80.\n\n` +
|
|
160
|
+
`Return findings (domain almost always "doc"). Use the finding schema fields.`
|
|
159
161
|
|
|
160
162
|
const apiPrompt =
|
|
161
|
-
`API / data-model / performance / cost defect review over the batch diff, per ${protocolRef} Step F.3 (api-perf-cost-auditor row). Look for unbounded reads, N+1, missing pagination, contract drift, and cost regressions.\n\n${scopeBrief}\n\n${baselineBrief}\n\
|
|
163
|
+
`API / data-model / performance / cost defect review over the batch diff, per ${protocolRef} Step F.3 (api-perf-cost-auditor row). Look for unbounded reads, N+1, missing pagination, contract drift, and cost regressions.\n\n${scopeBrief}\n\n${baselineBrief}\n\n` +
|
|
164
|
+
`You OWN the api/perf/data domain end-to-end: run the mandatory false-positive check on every finding yourself and SUPPRESS the unconvincing ones — your surviving findings are treated as already validated and are NOT re-judged by another agent. Flag only a finding you genuinely cannot resolve as confidence < 80.\n\n` +
|
|
165
|
+
`Return findings with domain in {code, perf, migration, security}. Use the finding schema fields.`
|
|
162
166
|
|
|
163
167
|
const qaPrompt =
|
|
164
168
|
`Run MECHANICAL GATES ONLY over the batch scope, per ${protocolRef} Step F.3 (qa-sentinel row): lint, type-check, the full test suite, build, dependency audit, and markdownlint as applicable to this project. Do NOT read source for code findings, do NOT emit severities — return only a PASS/FAIL/SKIP gate table.\n\nWorktree: ${a.worktreePath || '(cwd)'}\nChanged files:\n${scope.join('\n')}`
|
|
@@ -207,7 +211,9 @@ for (const item of reviewResults) {
|
|
|
207
211
|
} else if (item.kind === 'qa') {
|
|
208
212
|
gateTable = (item.r && item.r.gates) || []
|
|
209
213
|
} else if (item.r && Array.isArray(item.r.findings)) {
|
|
210
|
-
|
|
214
|
+
// doc-reviewer / api-perf-cost-auditor own their domain and FP-check their OWN findings in
|
|
215
|
+
// the finding pass (their prompts mandate it) → already validated, no cross-domain re-judge.
|
|
216
|
+
raw.push(...item.r.findings.map((f) => ({ ...f, source: item.kind, preValidated: true })))
|
|
211
217
|
}
|
|
212
218
|
}
|
|
213
219
|
|
|
@@ -219,27 +225,53 @@ if (!codexRan) {
|
|
|
219
225
|
`Codex was unavailable for the batch final review. Run the FULL code review yourself over the batch diff, per ${protocolRef} Step F.3.\n\n${scopeBrief}\n\n${baselineBrief}\n\nReturn findings using the schema fields, with a self false-positive check applied.`,
|
|
220
226
|
{ label: 'code-reviewer (fallback)', phase: 'Review', agentType: 'code-reviewer', schema: FINDINGS_SCHEMA }
|
|
221
227
|
)
|
|
222
|
-
|
|
228
|
+
// the fallback code-reviewer IS the primary code-review engine here and applied its own FP
|
|
229
|
+
// check (prompt above) → trusted, exactly like Codex. A SECOND code-reviewer pass over its own
|
|
230
|
+
// findings would be self-judging (no model diversity) — preValidated short-circuits it.
|
|
231
|
+
if (fb && Array.isArray(fb.findings)) raw.push(...fb.findings.map((f) => ({ ...f, source: 'code-reviewer', preValidated: true })))
|
|
223
232
|
}
|
|
224
233
|
|
|
225
234
|
// ───────────────────────────────────────────────────────────────────────────
|
|
226
|
-
// Phase Verify (F.4) —
|
|
227
|
-
//
|
|
228
|
-
//
|
|
229
|
-
//
|
|
230
|
-
//
|
|
235
|
+
// Phase Verify (F.4) — specialist-owned validation (NOT a generic code-reviewer re-judge).
|
|
236
|
+
// Each domain specialist owns its lane end-to-end: it FP-checks its OWN findings in the
|
|
237
|
+
// finding pass, so codex / doc-reviewer / api-perf-cost-auditor / code-reviewer-fallback
|
|
238
|
+
// findings arrive preValidated → VERIFIED with no second pass. This removes two flaws of the
|
|
239
|
+
// old "spawn code-reviewer for every low-confidence finding": (a) code-reviewer judging a doc
|
|
240
|
+
// or api finding is CROSS-DOMAIN (wrong specialist); (b) code-reviewer judging the fallback
|
|
241
|
+
// code-reviewer's OWN findings is SELF-JUDGING (no model diversity). The residual branch only
|
|
242
|
+
// fires for a finding a specialist explicitly left UNRESOLVED (confidence < 80, not
|
|
243
|
+
// preValidated): it is routed to the finding's DOMAIN specialist; if that specialist IS the
|
|
244
|
+
// originating finder, it already had its pass → surface as NEEDS_MANUAL_CONFIRMATION (never a
|
|
245
|
+
// self-judge, never a silent drop).
|
|
231
246
|
// ───────────────────────────────────────────────────────────────────────────
|
|
232
247
|
phase('Verify')
|
|
233
248
|
const classified = (await parallel(raw.map((f) => () => verifyFinding(f)))).filter(Boolean)
|
|
234
249
|
|
|
250
|
+
// Route a finding to the specialist that OWNS its domain (never a generic code-reviewer for
|
|
251
|
+
// doc/api/security). Kept in sync with new2-resolve.js normDomain() routing buckets.
|
|
252
|
+
function domainVerifier(domain) {
|
|
253
|
+
const d = String(domain || 'code').toLowerCase()
|
|
254
|
+
if (/doc|wiki|ssot|readme/.test(d)) return 'doc-reviewer'
|
|
255
|
+
if (/sec|auth|secret|rls|migrat|schema|ddl|\bsql\b/.test(d)) return 'security-reviewer'
|
|
256
|
+
if (/perf|cost|\bapi\b|data|latency|throughput|n\+1/.test(d)) return 'api-perf-cost-auditor'
|
|
257
|
+
if (/\btest|qa\b|spec|coverage/.test(d)) return 'qa-sentinel'
|
|
258
|
+
return 'code-reviewer'
|
|
259
|
+
}
|
|
260
|
+
|
|
235
261
|
async function verifyFinding(f) {
|
|
236
262
|
if (f.preValidated || (typeof f.confidence === 'number' && f.confidence >= 80)) {
|
|
237
263
|
return { ...f, classification: 'VERIFIED' }
|
|
238
264
|
}
|
|
265
|
+
const verifier = domainVerifier(f.domain)
|
|
266
|
+
if (verifier === f.source) {
|
|
267
|
+
// the domain specialist IS the finder and already had its single pass — a second instance of
|
|
268
|
+
// the same agent adds no diversity (self-judge). Surface to the human, do not drop.
|
|
269
|
+
return { ...f, classification: 'NEEDS_MANUAL_CONFIRMATION' }
|
|
270
|
+
}
|
|
239
271
|
const v = await agent(
|
|
240
|
-
`Adversarially validate this code
|
|
272
|
+
`Adversarially validate this ${f.domain || 'code'} finding as the DOMAIN specialist over the cited file:line. Default to FALSE_POSITIVE if the evidence does not hold; use NEEDS_MANUAL_CONFIRMATION only when you genuinely cannot decide from the code.\n\n` +
|
|
241
273
|
`finding_id: ${f.finding_id}\nseverity: ${f.severity}\ntitle: ${f.title}\nevidence: ${f.evidence}\ndomain: ${f.domain}\nsecurity-sensitive paths: ${highRisk.join(', ') || '(none configured)'}`,
|
|
242
|
-
{ label: `verify:${f.finding_id}`, phase: 'Verify', agentType:
|
|
274
|
+
{ label: `verify:${f.finding_id}`, phase: 'Verify', agentType: verifier, schema: VERDICT_SCHEMA }
|
|
243
275
|
)
|
|
244
276
|
return { ...f, classification: (v && v.classification) || 'NEEDS_MANUAL_CONFIRMATION' }
|
|
245
277
|
}
|
|
@@ -987,6 +987,13 @@ function buildTelemetry() {
|
|
|
987
987
|
// satisfied up-front instead of deferred owner-gated.
|
|
988
988
|
migration_gate: (migration && migration.status) || 'none',
|
|
989
989
|
residuals_total: residuals.length,
|
|
990
|
+
// Why each residual became a follow-up instead of an in-batch fix, counted per class
|
|
991
|
+
// (out-of-ownership | owner-gated | not-a-code-defect | baseline-not-reached | unresolved |
|
|
992
|
+
// outage | scope-expansion | policy-deferred-ac | out-of-scope | file-diff-violation). A
|
|
993
|
+
// skewed breakdown is a root-cause signal: many `out-of-ownership` → MAY-EDIT too narrow (PRD
|
|
994
|
+
// ownership), many `unresolved` → fixes genuinely hard, many `scope-expansion` → cards
|
|
995
|
+
// under-specified upstream. This is diagnosis-only; nothing is auto-implemented from a residual.
|
|
996
|
+
deferral_breakdown: residuals.reduce((b, x) => { const k = x.deferralClass || x.kind || 'unknown'; b[k] = (b[k] || 0) + 1; return b }, {}),
|
|
990
997
|
// followups_on_disk is filled by the SKILL after it materialises pending residuals.
|
|
991
998
|
followups_materialized_in_workflow: residuals.filter((x) => x.materialized).length,
|
|
992
999
|
resolve_invocations: resolvedSignatures.size,
|
|
@@ -1029,6 +1036,8 @@ function buildReport(o) {
|
|
|
1029
1036
|
}
|
|
1030
1037
|
if (residuals.length) {
|
|
1031
1038
|
L.push(``, `## ⚠️ Residui (il skill materializza le follow-up mancanti — nulla perso)`)
|
|
1039
|
+
const bd = residuals.reduce((b, x) => { const k = x.deferralClass || x.kind || 'unknown'; b[k] = (b[k] || 0) + 1; return b }, {})
|
|
1040
|
+
L.push(`Ripartizione per classe: ${Object.entries(bd).map(([k, n]) => `${k}=${n}`).join(' · ')} — una classe dominante è il segnale di causa (out-of-ownership → MAY-EDIT troppo strette · unresolved → fix difficili · scope-expansion → card sotto-specificate).`)
|
|
1032
1041
|
for (const f of residuals) L.push(`- ${f.card} (${f.kind})${f.materialized ? ' ✓' : ' — DA MATERIALIZZARE'}: ${f.evidence}`)
|
|
1033
1042
|
}
|
|
1034
1043
|
const excluded = gateLedger.filter((x) => x.decision === 'EXCLUDED')
|