baldart 4.31.0 → 4.31.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +14 -0
- package/VERSION +1 -1
- package/framework/.claude/skills/new2/SKILL.md +10 -3
- package/framework/.claude/workflows/new2.js +21 -4
- package/package.json +1 -1
package/CHANGELOG.md
CHANGED
|
@@ -5,6 +5,20 @@ All notable changes to BALDART will be documented in this file.
|
|
|
5
5
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
6
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
7
|
|
|
8
|
+
## [4.31.1] - 2026-06-11
|
|
9
|
+
|
|
10
|
+
**`new2`: fix v4.31.0's dedup coverage + drop empty residuals + honest A/B cost — verified against the real FEAT-0022 telemetry/report (not the narrative).** Reading the actual run's `skill-runs.jsonl` + workflow report (instead of the prior turn's recollection) exposed three things v4.31.0 missed:
|
|
11
|
+
- **The owner-gated dedup was too narrow** — it filtered by `deferralClass ∈ {owner-gated, not-a-code-defect}`, but the run's FIVE identical `db:push` residuals on card -01 carried kinds `ac-unmet`/`blocker`/`policy-deferred-ac`; the two `policy-deferred-ac` (an F-016 class that is ALWAYS an external/infra action) slipped through. Fix: dedup on `EXTERNAL_DEFERRAL = {owner-gated, not-a-code-defect, policy-deferred-ac}`. A second defect caught by a self-test before publishing: the action key split those 5 across `db-migration-deploy` and `migration:<file>` (the filename branch ran first), collapsing to 2 not 1 — **one `db:push` pushes all pending migrations, so the deploy intent must win over a bare filename**; reordered the key so every db-migration-deploy residual maps to one key. Verified on the real residuals: 7 db:push residuals (5 on -01, 1 on -02, 1 batch-level `F001`) → 2 (one per real card; `F001` dropped), matching the hand reconciliation.
|
|
12
|
+
- **Empty out-of-scope residuals** — the run emitted two `out-of-scope` residuals with no file/line/evidence (`":"`), which would mint contentless follow-up cards. `resolve()` now skips an out-of-scope finding whose composed evidence is empty after stripping `:`/space separators.
|
|
13
|
+
- **A/B cost telemetry was ~8× under** — the workflow reported `total_tokens=476k` (`budget.spent()`, output-only) / `agent_count=30` while the harness saw ~4.15M subagent tokens / 64 agents, because the workflow counters exclude the subagents spawned inside the nested workflows (`new2-resolve`, `new-final-review`). Since `total_tokens` was non-null, the skill's "backfill only if null" rule accepted the wrong figure — defeating new2's whole purpose (A/B on context economy). SKILL.md Step 5.5 now mandates **always** reading the real transcript `usage` via the `-stats` script as the headline cost, labelling the workflow figures as partial.
|
|
14
|
+
|
|
15
|
+
**PATCH** (corrects v4.31.0's dedup coverage + noise filter + telemetry honesty on the EXPERIMENTAL `new2` surface; no new capability, no config key). Meta: same lesson as the feature itself — the v4.31.0 dedup was written from the prior turn's recollection; reading the raw telemetry/report found the gap. Validate fixes against the data, not the memory of the data.
|
|
16
|
+
|
|
17
|
+
### Changed
|
|
18
|
+
|
|
19
|
+
- **`framework/.claude/workflows/new2.js`** — `dedupOwnerGatedResiduals` now keys on `EXTERNAL_DEFERRAL` (adds `policy-deferred-ac`); `ownerGatedActionKey` reorders the db-migration-deploy branch ahead of the filename branch (one key per deploy action); `resolve()` skips empty `out-of-scope` residuals.
|
|
20
|
+
- **`framework/.claude/skills/new2/SKILL.md`** — Step 5.5 cost recording: always backfill real transcript `usage` (the workflow's `total_tokens`/`agent_count` are partial — output-only + exclude nested workflows); keep them labelled `*_workflow`.
|
|
21
|
+
|
|
8
22
|
## [4.31.0] - 2026-06-11
|
|
9
23
|
|
|
10
24
|
**`new2`: the residual ledger self-corrects before returning — no more duplicate-of-done follow-ups, no more N defers for one external action.** A real `new2` run (FEAT-0022 epic, 3 cards) surfaced two over-report classes in the offline-safe residual ledger that the skill was absorbing **by hand** every run: (1) **4 of 8 follow-ups were false-open** — scope-expansion residuals deferred early in the batch but satisfied LATER by another card's commit / a final-review fix, which only `integrateCrossCard()` retracts; a residual closed by any *other* in-batch path stayed falsely-open (and left a best-effort, uncommitted follow-up YAML in the worktree). (2) **3 follow-ups for ONE physical action** — one migration's remote `db:push` re-raised per-card AND batch-wide by the final review → three near-identical owner-gated cards. Both were caught only by the skill's manual per-residual disk grep + consolidation (load-bearing, repeated every run). This release moves that work into the workflow, once, deterministically:
|
package/VERSION
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
4.31.
|
|
1
|
+
4.31.1
|
|
@@ -228,9 +228,16 @@ returns when the batch is done. It returns:
|
|
|
228
228
|
(`git -C $MAIN log --oneline ${trunk} | grep <card>`); annotate any divergence and never present
|
|
229
229
|
progress the disk does not show. Then fill `wall_clock_s` (now − kickoff `ts`) and
|
|
230
230
|
`followups_on_disk` (count the actual follow-up files on disk in the main repo, NOT
|
|
231
|
-
`residualFollowups.length` — which double-counts).
|
|
232
|
-
workflow
|
|
233
|
-
`
|
|
231
|
+
`residualFollowups.length` — which double-counts). **Cost (the A/B's whole point) — do NOT trust
|
|
232
|
+
the workflow's `total_tokens`/`agent_count` as the headline figure.** They are **partial**:
|
|
233
|
+
`total_tokens` is `budget.spent()` output-only AND the agent counters do not include the subagents
|
|
234
|
+
spawned inside the nested workflows (`new2-resolve`, `new-final-review`) — on the real FEAT-0022 run
|
|
235
|
+
the workflow reported `total_tokens=476k` / `agent_count=30` while the harness saw ~4.15M subagent
|
|
236
|
+
tokens / 64 agents (≈8× under). So **always** run the `/new` Phase-8 `-stats` script to read the
|
|
237
|
+
real transcript `usage` (not only when `total_tokens` is null), record it as the headline cost, and
|
|
238
|
+
keep the workflow's figures clearly labelled as `total_tokens_workflow`/`agent_count_workflow`
|
|
239
|
+
(partial, output-only, excludes nested workflows). Keep `degraded`/`degradation_reasons` +
|
|
240
|
+
`cards_deferred_done_pending` in the record so
|
|
234
241
|
the A/B comparison stays honest. Also record `migration_gate: <migration.status>`
|
|
235
242
|
(`none`|`applied`|`skipped`|`degraded`) — the Step-3.5 gate is a pre-launch interaction, NOT a
|
|
236
243
|
mid-batch question, so it does not break the zero-ask-during-batch invariant; logging it keeps the
|
|
@@ -389,9 +389,14 @@ async function resolve(kind, card, evidence, extra) {
|
|
|
389
389
|
// domain and decide whether an out-of-ownership remedy lands inside the batch union.
|
|
390
390
|
residuals.push({ card, kind, evidence, materialized: !!fc, deferralClass, domain: dom, remedyFiles: (res && res.remedyFiles) || [] })
|
|
391
391
|
}
|
|
392
|
-
// F-022 — route out-of-scope findings the resolve surfaced.
|
|
392
|
+
// F-022 — route out-of-scope findings the resolve surfaced. v4.31.1 — a finding with no file,
|
|
393
|
+
// no line AND no evidence is noise (the resolve emitted an empty placeholder); pushing it would
|
|
394
|
+
// mint a contentless follow-up card. Skip when the composed evidence is empty after stripping the
|
|
395
|
+
// `:`/space separators — a real finding always carries at least a file, a line, or a description.
|
|
393
396
|
for (const osf of (res && res.outOfScopeFindings) || []) {
|
|
394
|
-
|
|
397
|
+
const ev = `${osf.file || ''}:${osf.line || ''} ${osf.evidence || ''}`
|
|
398
|
+
if (!ev.replace(/[:\s]/g, '')) continue
|
|
399
|
+
residuals.push({ card, kind: 'out-of-scope', evidence: ev.trim(), materialized: false })
|
|
395
400
|
}
|
|
396
401
|
ledger(card, 'resolve:' + kind, status, (res && (res.followupCard || res.reason)) || '')
|
|
397
402
|
return { status, deferralClass }
|
|
@@ -1128,19 +1133,31 @@ async function reconcileLedgerAgainstHead() {
|
|
|
1128
1133
|
// ───────────────────────────────────────────────────────────────────────────
|
|
1129
1134
|
function ownerGatedActionKey(r) {
|
|
1130
1135
|
const hay = `${r.evidence || ''} ${(r.remedyFiles || []).join(' ')}`
|
|
1136
|
+
// ORDER MATTERS — the migration-deploy intent wins over a bare filename match. One `db:push`
|
|
1137
|
+
// pushes ALL pending migrations in a single action, so every "migration not deployed" / db:check-
|
|
1138
|
+
// -sync / db:push residual must collapse onto ONE key regardless of whether it also names the .sql
|
|
1139
|
+
// file (the real FEAT-0022 run split 5 identical db:push residuals across `db-migration-deploy`
|
|
1140
|
+
// and `migration:<file>` because the filename branch ran first → only the deploy intent is canon).
|
|
1141
|
+
if (/\bdb:push\b|\bdb:check-sync\b|remote db push|migration[^.]*(deploy|remote|push|not[_ ]?deployed)/i.test(hay)) return 'db-migration-deploy'
|
|
1131
1142
|
const mig = hay.match(/(\d{14}_[a-z0-9_]+\.sql)/i)
|
|
1132
1143
|
if (mig) return 'migration:' + mig[1].toLowerCase()
|
|
1133
|
-
if (/\bdb:push\b|\bdb:check-sync\b|remote db push|migration.*(deploy|remote|push)/i.test(hay)) return 'db-migration-deploy'
|
|
1134
1144
|
if (/\bdeploy(ment)?\b/i.test(hay)) return 'deploy'
|
|
1135
1145
|
if (/\bsecret\b/i.test(hay)) return 'secret'
|
|
1136
1146
|
if (/\bDNS\b|\bdomain\b/i.test(hay)) return 'dns'
|
|
1137
1147
|
return null // unknown action → never dedup (avoid collapsing two genuinely-distinct externals)
|
|
1138
1148
|
}
|
|
1149
|
+
// classes that are ALWAYS an external/infra action (never code a commit can close) — the only ones
|
|
1150
|
+
// safe to collapse by shared action key. `policy-deferred-ac` belongs here (F-016: an AC whose
|
|
1151
|
+
// remedy is out-of-ownership or an owner-gated infra step) — the real FEAT-0022 run proved a single
|
|
1152
|
+
// `db:push` surfaced as owner-gated AND policy-deferred-ac on the same card, so excluding the latter
|
|
1153
|
+
// left duplicates uncollapsed. ac-unmet/blocker/merge-blocker are NOT here: they may be code defects,
|
|
1154
|
+
// and their EXTERNAL instances are already reclassified to `owner-gated` by the F-040 classifier.
|
|
1155
|
+
const EXTERNAL_DEFERRAL = new Set(['owner-gated', 'not-a-code-defect', 'policy-deferred-ac'])
|
|
1139
1156
|
function dedupOwnerGatedResiduals() {
|
|
1140
1157
|
const realCard = new Set(cardIds)
|
|
1141
1158
|
const groups = {}
|
|
1142
1159
|
for (const r of residuals) {
|
|
1143
|
-
if (
|
|
1160
|
+
if (!EXTERNAL_DEFERRAL.has(r.deferralClass)) continue
|
|
1144
1161
|
const k = ownerGatedActionKey(r)
|
|
1145
1162
|
if (!k) continue
|
|
1146
1163
|
;(groups[k] = groups[k] || []).push(r)
|