@adia-ai/a2ui-mcp 0.0.5 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +166 -0
- package/package.json +2 -2
- package/scripts/eval-diff.mjs +62 -6
- package/scripts/eval-refine-synthesis.mjs +270 -0
- package/scripts/semantic-stats.mjs +113 -0
- package/scripts/smoke-issues.mjs +266 -0
- package/scripts/smoke-refine.mjs +374 -0
- package/scripts/smoke-state-cache.mjs +130 -0
- package/scripts/test-a2ui.mjs +103 -0
- package/server.js +309 -0
package/CHANGELOG.md
CHANGED
|
@@ -11,6 +11,172 @@ zettel strategies.
|
|
|
11
11
|
|
|
12
12
|
---
|
|
13
13
|
|
|
14
|
+
## [0.1.1] - 2026-05-01
|
|
15
|
+
|
|
16
|
+
Phase 2 of [`docs/specs/semantic-validator.md`](../../../docs/specs/semantic-validator.md)
|
|
17
|
+
— opt-in combined-gating in `eval-diff.mjs` + new
|
|
18
|
+
`semantic-stats.mjs` companion script. **No breaking changes.**
|
|
19
|
+
Default `eval-diff` behavior unchanged — Phase 1 shadow-mode is
|
|
20
|
+
still the default; combined gating is opt-in via flags.
|
|
21
|
+
|
|
22
|
+
### Added (`scripts/eval-diff.mjs` — Phase 2 gating flags)
|
|
23
|
+
|
|
24
|
+
- **`--gate-mode {structural|combined}`** — `structural` (default)
|
|
25
|
+
preserves Phase 1 shadow behavior: `row.pass` gates on
|
|
26
|
+
`validationScore` alone; semantic verdicts are annotation-only.
|
|
27
|
+
`combined` flips `row.pass` to gate on the combined score
|
|
28
|
+
(`round(0.6 × validationScore + 0.4 × semanticScore)`); preserves
|
|
29
|
+
the pre-flip pass as `row.passStructural`; recomputes
|
|
30
|
+
`runObj.passRate` + carries `runObj.passRateStructural` (baseline)
|
|
31
|
+
alongside; records `runObj.gateMode` + `runObj.gateThreshold`;
|
|
32
|
+
`diff.md` gains structural-baseline + avgSemantic + avgCombined
|
|
33
|
+
rows.
|
|
34
|
+
- **`--gate-threshold N`** — combined-mode threshold; default 70 to
|
|
35
|
+
match the existing structural threshold. Override per-run for
|
|
36
|
+
sweep-style tuning.
|
|
37
|
+
- **Validation gate** — combined-mode requires `--semantic`; the
|
|
38
|
+
script rejects the flag combination at startup so the operator
|
|
39
|
+
never silently ships the gating change without the scores it
|
|
40
|
+
needs.
|
|
41
|
+
|
|
42
|
+
### Added (`scripts/semantic-stats.mjs` — companion stats script)
|
|
43
|
+
|
|
44
|
+
- **New** — read-only; takes two run JSON paths
|
|
45
|
+
(`evals/mcp/runs/<stamp>/{mcp,zettel}.json`); emits markdown to
|
|
46
|
+
stdout with **verdict-distribution deltas + per-intent pass-flip
|
|
47
|
+
diagnostics** (which intents flipped pass→fail or fail→pass
|
|
48
|
+
between baseline and candidate). The tooling that satisfies the
|
|
49
|
+
"no unexplained regressions" exit criterion of Phase 2 §
|
|
50
|
+
Rollout before promoting combined gating to default.
|
|
51
|
+
|
|
52
|
+
### Procedure for promotion (deferred)
|
|
53
|
+
|
|
54
|
+
Promotion to default is deferred until two full eval-diff runs
|
|
55
|
+
(structural-only baseline + combined-gating candidate) have been
|
|
56
|
+
compared via `semantic-stats.mjs` and the regression count
|
|
57
|
+
justifies it. Procedure:
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
# 1. Capture structural-only baseline (default behavior)
|
|
61
|
+
node packages/a2ui/mcp/scripts/eval-diff.mjs --engine zettel --semantic
|
|
62
|
+
|
|
63
|
+
# 2. Run the candidate (combined gating)
|
|
64
|
+
node packages/a2ui/mcp/scripts/eval-diff.mjs --engine zettel --semantic --gate-mode combined
|
|
65
|
+
|
|
66
|
+
# 3. Compare
|
|
67
|
+
node packages/a2ui/mcp/scripts/semantic-stats.mjs \
|
|
68
|
+
evals/mcp/runs/<baseline-stamp>/zettel.json \
|
|
69
|
+
evals/mcp/runs/<candidate-stamp>/zettel.json > /tmp/semantic-stats.md
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
### Implementation references
|
|
73
|
+
|
|
74
|
+
- [`scripts/eval-diff.mjs`](scripts/eval-diff.mjs)
|
|
75
|
+
- [`scripts/semantic-stats.mjs`](scripts/semantic-stats.mjs)
|
|
76
|
+
|
|
77
|
+
### Commits
|
|
78
|
+
|
|
79
|
+
- `8415ff9e` — `feat(validator): semantic Phase 2 — opt-in combined-gating + drift cleanup`
|
|
80
|
+
|
|
81
|
+
## [0.1.0] - 2026-04-28
|
|
82
|
+
|
|
83
|
+
**Multi-turn gen-UI tool surface (Phase A code-complete).** Adds three new
|
|
84
|
+
MCP tools that turn the chunk-composition pipeline from single-shot into a
|
|
85
|
+
multi-turn surface, plus extends `compose_from_chunks` to mint a `state_id`
|
|
86
|
+
for refinement chains.
|
|
87
|
+
|
|
88
|
+
Spec: [`docs/specs/genui-multiturn-architecture.md`](../../../docs/specs/genui-multiturn-architecture.md) (Active v0.1.0).
|
|
89
|
+
Plan: [`docs/plans/genui-multiturn-rollout-2026-04-28.md`](../../../docs/plans/genui-multiturn-rollout-2026-04-28.md) (Phase A scoped).
|
|
90
|
+
ADR: [`0008-multiturn-genui-architecture.md`](../../../.brain/adrs/0008-multiturn-genui-architecture.md).
|
|
91
|
+
|
|
92
|
+
### Added (MCP tools)
|
|
93
|
+
|
|
94
|
+
- **`refine_composition(state_id, intent | ops, max_attempts?)`** — takes a
|
|
95
|
+
`state_id` from a prior `compose_from_chunks` (or `refine_composition`)
|
|
96
|
+
call plus either a natural-language intent OR an explicit op-list, runs
|
|
97
|
+
the chunk-refiner's two-pass synthesis (locator → modifier; validator-
|
|
98
|
+
driven retry on op-validation failure), applies the resulting chunk-plan
|
|
99
|
+
ops, re-materializes HTML, mints a child `state_id` chained back to the
|
|
100
|
+
parent, and returns A2UI `updateComponents` messages (the wire format).
|
|
101
|
+
Failed ops surface in `ops_failed` with reasons; the new state is cached
|
|
102
|
+
for further refinement.
|
|
103
|
+
- **`get_state(state_id)`** — read-only inspection of a cached composition
|
|
104
|
+
state. Returns the chunk binding plan, materialized HTML, ops history
|
|
105
|
+
(chronological list of every refinement applied to this state's lineage),
|
|
106
|
+
and `parent_state_id` (chain-back). Auto-fires `cache-miss-on-known-state`
|
|
107
|
+
(severity `nit`) when the id is absent.
|
|
108
|
+
- **`report_issue(type, severity, title, body, state_id?, trace?, …)`** —
|
|
109
|
+
first-class telemetry / dev-process feedback tool. Writes a structured
|
|
110
|
+
JSON ticket to `.brain/audit-history/issues/<issue_id>.json`. Three
|
|
111
|
+
reporter kinds: LLM self-fire (this tool with `reporter: 'llm'`),
|
|
112
|
+
consumer-fire (passed through directly), engine auto-fire (internal,
|
|
113
|
+
per `AUTO_FIRE_POLICY` in the issue-reporter module). Severity vocabulary
|
|
114
|
+
`blocker | drift | nit` matches the existing `coherence-audit`
|
|
115
|
+
discipline. Trace levels: `'full' | 'summary' | 'none'`; oversized
|
|
116
|
+
traces (> 200 KB) spill to a sidecar `.trace.json` file. Tool count
|
|
117
|
+
goes from 25 → 28.
|
|
118
|
+
|
|
119
|
+
### Changed
|
|
120
|
+
|
|
121
|
+
- **`compose_from_chunks`** now mints a `state_id` and caches the result
|
|
122
|
+
before returning. The response shape gains a `state_id` field; existing
|
|
123
|
+
fields (`html`, `plan`, `source`, `score`, `warnings`, `synthesis`)
|
|
124
|
+
are unchanged. Backward-compatible — consumers ignoring `state_id` see
|
|
125
|
+
no behavior change.
|
|
126
|
+
- **MCP server boot** instantiates a `getStateCache()` singleton and an
|
|
127
|
+
`ENGINE_VERSION_INFO` block (mcp 0.1.0, corpus 0.0.6, engine zettel,
|
|
128
|
+
llm_adapter anthropic) that's threaded through every issue-reporter
|
|
129
|
+
call so written tickets carry environment metadata.
|
|
130
|
+
|
|
131
|
+
### Auto-fire policy (engine-driven)
|
|
132
|
+
|
|
133
|
+
`refine_composition` and `get_state` auto-fire `report_issue` on these
|
|
134
|
+
failure paths via the per-tool-call `IssueAccumulator`:
|
|
135
|
+
|
|
136
|
+
| Path | Type | Severity |
|
|
137
|
+
|---|---|---|
|
|
138
|
+
| Synthesizer exhausts retries | bug | drift |
|
|
139
|
+
| Validator exhausts retries on refinement | bug | blocker |
|
|
140
|
+
| Locator pass returns empty for targeted intent | bug | drift |
|
|
141
|
+
| Retrieval 0 + synthesis fallback fails | training-gap | drift |
|
|
142
|
+
| `get_state` called with absent `state_id` | bug | nit |
|
|
143
|
+
| `refine_composition` ops_failed list non-empty | bug | drift |
|
|
144
|
+
|
|
145
|
+
Multiple auto-fires within one tool call coalesce into a single issue
|
|
146
|
+
(highest severity wins; reasons listed in body + tags).
|
|
147
|
+
|
|
148
|
+
### Smoke + eval
|
|
149
|
+
|
|
150
|
+
- `smoke:state-cache` — 34/34.
|
|
151
|
+
- `smoke:issues` — 62/62.
|
|
152
|
+
- `smoke:refine` — 51/51 (stub LLM).
|
|
153
|
+
- `test:a2ui` — 25/25 + 1 skipped (was 19/19 + 1; +6 multi-turn assertions).
|
|
154
|
+
- `mcp:smoke` — server boots clean with 28 tools registered.
|
|
155
|
+
- **`eval:refine-synthesis`** — 15/15 PASS. Ops 100%, validate 100%,
|
|
156
|
+
0 auto-fires, 67 s.
|
|
157
|
+
- **No regression:** `eval:chunk-synthesis` 10/10, `eval:diff zettel`
|
|
158
|
+
coverage 83 / score 89 / MRR 0.986.
|
|
159
|
+
|
|
160
|
+
### Dependencies
|
|
161
|
+
|
|
162
|
+
- Bumps `@adia-ai/a2ui-compose` requirement from `^0.0.1` to `^0.1.0`.
|
|
163
|
+
|
|
164
|
+
### Migration
|
|
165
|
+
|
|
166
|
+
Additive surface; no breaking changes. The existing 25 tools are
|
|
167
|
+
unchanged behaviorally; `compose_from_chunks` adds a `state_id` field to
|
|
168
|
+
its response that ignoring consumers can safely drop.
|
|
169
|
+
|
|
170
|
+
### Phase A simplification (documented)
|
|
171
|
+
|
|
172
|
+
Refinement ops internally use a chunk-plan vocabulary
|
|
173
|
+
(`rebindSlot | appendToSlot | removeFromSlot | replacePage`), wrapped
|
|
174
|
+
on output as standard `updateComponents` A2UI messages with
|
|
175
|
+
`components[].html` carrying the materialized payload. Strict
|
|
176
|
+
component-tree shape upgrade is queued for Phase B.
|
|
177
|
+
|
|
178
|
+
---
|
|
179
|
+
|
|
14
180
|
## [0.0.5] - 2026-04-28
|
|
15
181
|
|
|
16
182
|
**Retires the legacy exemplar auto-ingest.** Server boot no longer pulls
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@adia-ai/a2ui-mcp",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.1.1",
|
|
4
4
|
"description": "AdiaUI A2UI MCP server. Exposes the compose engine over MCP with an engine selector for monolithic + zettel strategies.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"bin": {
|
|
@@ -26,7 +26,7 @@
|
|
|
26
26
|
},
|
|
27
27
|
"dependencies": {
|
|
28
28
|
"@modelcontextprotocol/sdk": "^1.29.0",
|
|
29
|
-
"@adia-ai/a2ui-compose": "^0.0
|
|
29
|
+
"@adia-ai/a2ui-compose": "^0.1.0",
|
|
30
30
|
"@adia-ai/a2ui-retrieval": "^0.0.1",
|
|
31
31
|
"@adia-ai/a2ui-validator": "^0.0.1",
|
|
32
32
|
"@adia-ai/a2ui-corpus": "^0.0.6",
|
package/scripts/eval-diff.mjs
CHANGED
|
@@ -18,6 +18,9 @@
|
|
|
18
18
|
* node packages/a2ui/mcp/scripts/eval-diff.mjs --engine zettel # fragment-graph only
|
|
19
19
|
* node packages/a2ui/mcp/scripts/eval-diff.mjs --limit 20
|
|
20
20
|
* node packages/a2ui/mcp/scripts/eval-diff.mjs --domain forms
|
|
21
|
+
* node packages/a2ui/mcp/scripts/eval-diff.mjs --semantic # Phase 1: shadow-mode semantic annotations
|
|
22
|
+
* node packages/a2ui/mcp/scripts/eval-diff.mjs --semantic --gate-mode combined # Phase 2: gate row.pass on combined score
|
|
23
|
+
* node packages/a2ui/mcp/scripts/eval-diff.mjs --semantic --gate-mode combined --gate-threshold 75
|
|
21
24
|
*/
|
|
22
25
|
import '../../../../scripts/load-env.mjs';
|
|
23
26
|
|
|
@@ -42,8 +45,24 @@ const opt = (k) => {
|
|
|
42
45
|
const engine = opt('engine') || 'all';
|
|
43
46
|
const limit = opt('limit') ? Number(opt('limit')) : undefined;
|
|
44
47
|
const domain = opt('domain');
|
|
45
|
-
// Shadow-mode semantic validator (Phase 1). Opt-in; zero effect on gating
|
|
48
|
+
// Shadow-mode semantic validator (Phase 1). Opt-in; zero effect on gating
|
|
49
|
+
// when --gate-mode=structural (default).
|
|
46
50
|
const semanticEnabled = args.includes('--semantic');
|
|
51
|
+
// Phase 2 (gating mode):
|
|
52
|
+
// structural (default) — `row.pass` gated on validationScore alone (Phase 1 behavior)
|
|
53
|
+
// combined — `row.pass` gated on (0.6 * validationScore + 0.4 * semanticScore)
|
|
54
|
+
// Combined mode requires --semantic. Threshold defaults to 70 to match the
|
|
55
|
+
// existing structural threshold; override with --gate-threshold N.
|
|
56
|
+
const gateMode = opt('gate-mode') || 'structural';
|
|
57
|
+
const gateThreshold = opt('gate-threshold') ? Number(opt('gate-threshold')) : 70;
|
|
58
|
+
if (!['structural', 'combined'].includes(gateMode)) {
|
|
59
|
+
console.error(`[eval-diff] --gate-mode must be one of: structural | combined (got: ${gateMode})`);
|
|
60
|
+
process.exit(2);
|
|
61
|
+
}
|
|
62
|
+
if (gateMode === 'combined' && !semanticEnabled) {
|
|
63
|
+
console.error(`[eval-diff] --gate-mode=combined requires --semantic (semantic scores must be computed before gating on them)`);
|
|
64
|
+
process.exit(2);
|
|
65
|
+
}
|
|
47
66
|
|
|
48
67
|
if (!['mcp', 'zettel', 'all'].includes(engine)) {
|
|
49
68
|
console.error(`[eval-diff] --engine must be one of: mcp | zettel | all (got: ${engine})`);
|
|
@@ -144,7 +163,12 @@ async function annotateSemantic(runObj, label) {
|
|
|
144
163
|
row.rubricVersion = v.rubricVersion;
|
|
145
164
|
const structural = row.validationScore ?? 0;
|
|
146
165
|
row.combinedScore = Math.round(0.6 * structural + 0.4 * v.score);
|
|
147
|
-
//
|
|
166
|
+
// Phase 2: when gateMode === 'combined', flip row.pass to gate on the
|
|
167
|
+
// combined score. Preserves the structural pass for diagnostic purposes.
|
|
168
|
+
if (gateMode === 'combined') {
|
|
169
|
+
row.passStructural = row.pass;
|
|
170
|
+
row.pass = row.combinedScore >= gateThreshold;
|
|
171
|
+
}
|
|
148
172
|
if (!v.error) {
|
|
149
173
|
semSum += v.score;
|
|
150
174
|
semN += 1;
|
|
@@ -164,7 +188,8 @@ async function annotateSemantic(runObj, label) {
|
|
|
164
188
|
}
|
|
165
189
|
runObj.semantic = {
|
|
166
190
|
enabled: true,
|
|
167
|
-
mode: 'shadow',
|
|
191
|
+
mode: gateMode === 'combined' ? 'gating' : 'shadow',
|
|
192
|
+
gateThreshold: gateMode === 'combined' ? gateThreshold : null,
|
|
168
193
|
judged: semN,
|
|
169
194
|
errors,
|
|
170
195
|
cached,
|
|
@@ -174,14 +199,31 @@ async function annotateSemantic(runObj, label) {
|
|
|
174
199
|
tokens: { input: tokensIn, output: tokensOut },
|
|
175
200
|
rubricVersion: 'v1',
|
|
176
201
|
};
|
|
177
|
-
|
|
202
|
+
// Phase 2: when gateMode === 'combined', recompute pass aggregates so
|
|
203
|
+
// runObj.passRate / runObj.pass reflect the new gate. Capture the
|
|
204
|
+
// structural-only pass count alongside for diagnostic comparison.
|
|
205
|
+
if (gateMode === 'combined') {
|
|
206
|
+
const structuralPassCount = runObj.results.filter((r) => r.passStructural).length;
|
|
207
|
+
const combinedPassCount = runObj.results.filter((r) => r.pass).length;
|
|
208
|
+
runObj.passStructural = structuralPassCount;
|
|
209
|
+
runObj.passRateStructural = Math.round((structuralPassCount / (runObj.results.length || 1)) * 100);
|
|
210
|
+
runObj.pass = combinedPassCount;
|
|
211
|
+
runObj.passRate = Math.round((combinedPassCount / (runObj.results.length || 1)) * 100);
|
|
212
|
+
runObj.gateMode = 'combined';
|
|
213
|
+
runObj.gateThreshold = gateThreshold;
|
|
214
|
+
}
|
|
215
|
+
const modeLabel = gateMode === 'combined' ? `gating(>=${gateThreshold})` : 'shadow';
|
|
216
|
+
console.error(`[semantic:${label}] mode=${modeLabel} judged=${semN} avgSem=${runObj.semantic.avgSemanticScore} avgCombined=${runObj.semantic.avgCombinedScore} cached=${cached} errors=${errors} tokens=${tokensIn}+${tokensOut}`);
|
|
178
217
|
}
|
|
179
218
|
|
|
180
219
|
if (semanticEnabled) {
|
|
181
220
|
if (!process.env.ANTHROPIC_API_KEY) {
|
|
182
221
|
console.error('[eval-diff] --semantic requested but ANTHROPIC_API_KEY missing; skipping.');
|
|
183
222
|
} else {
|
|
184
|
-
|
|
223
|
+
const modeNote = gateMode === 'combined'
|
|
224
|
+
? `gating mode (combined threshold=${gateThreshold})`
|
|
225
|
+
: 'shadow mode';
|
|
226
|
+
console.error(`[eval-diff] running semantic validator (${modeNote})…`);
|
|
185
227
|
if (mcp) await annotateSemantic(mcp, 'mcp');
|
|
186
228
|
if (zettel) await annotateSemantic(zettel, 'zettel');
|
|
187
229
|
}
|
|
@@ -209,7 +251,11 @@ md += `# Engine Eval ${mcp && zettel ? 'Diff' : 'Report'}\n\n`;
|
|
|
209
251
|
md += `- Run: \`${stamp}\`\n`;
|
|
210
252
|
md += `- Engine(s): ${engine}\n`;
|
|
211
253
|
md += `- Intents: ${(mcp || zettel).total}${domain ? ` (domain: ${domain})` : ''}${limit ? ` (limit: ${limit})` : ''}\n`;
|
|
212
|
-
md += `- Mode: instant\n
|
|
254
|
+
md += `- Mode: instant\n`;
|
|
255
|
+
if (semanticEnabled) {
|
|
256
|
+
md += `- Semantic: ${gateMode === 'combined' ? `gating (threshold=${gateThreshold})` : 'shadow'}\n`;
|
|
257
|
+
}
|
|
258
|
+
md += `\n`;
|
|
213
259
|
|
|
214
260
|
md += `## Aggregates\n\n`;
|
|
215
261
|
if (mcp && zettel) {
|
|
@@ -219,6 +265,11 @@ if (mcp && zettel) {
|
|
|
219
265
|
md += `| avgScore (emitted only) | ${mcp.avgScoreWhenEmitted} | ${zettel.avgScoreWhenEmitted} |\n`;
|
|
220
266
|
md += `| avgF1 (emitted only) | ${mcp.avgF1WhenEmitted} | ${zettel.avgF1WhenEmitted} |\n`;
|
|
221
267
|
md += `| pass rate % | ${mcp.passRate} | ${zettel.passRate} |\n`;
|
|
268
|
+
if (mcp.gateMode === 'combined' || zettel.gateMode === 'combined') {
|
|
269
|
+
md += `| pass rate % (structural-only baseline) | ${fmt(mcp.passRateStructural)} | ${fmt(zettel.passRateStructural)} |\n`;
|
|
270
|
+
md += `| avgSemanticScore | ${fmt(mcp.semantic?.avgSemanticScore)} | ${fmt(zettel.semantic?.avgSemanticScore)} |\n`;
|
|
271
|
+
md += `| avgCombinedScore | ${fmt(mcp.semantic?.avgCombinedScore)} | ${fmt(zettel.semantic?.avgCombinedScore)} |\n`;
|
|
272
|
+
}
|
|
222
273
|
md += `| retrieval MRR | ${fmt(mcp.retrievalMRR)} | ${fmt(zettel.retrievalMRR)} |\n\n`;
|
|
223
274
|
} else {
|
|
224
275
|
const e = mcp || zettel;
|
|
@@ -229,6 +280,11 @@ if (mcp && zettel) {
|
|
|
229
280
|
md += `| avgScore (emitted only) | ${e.avgScoreWhenEmitted} |\n`;
|
|
230
281
|
md += `| avgF1 (emitted only) | ${e.avgF1WhenEmitted} |\n`;
|
|
231
282
|
md += `| pass rate % | ${e.passRate} |\n`;
|
|
283
|
+
if (e.gateMode === 'combined') {
|
|
284
|
+
md += `| pass rate % (structural-only baseline) | ${fmt(e.passRateStructural)} |\n`;
|
|
285
|
+
md += `| avgSemanticScore | ${fmt(e.semantic?.avgSemanticScore)} |\n`;
|
|
286
|
+
md += `| avgCombinedScore | ${fmt(e.semantic?.avgCombinedScore)} |\n`;
|
|
287
|
+
}
|
|
232
288
|
md += `| retrieval MRR | ${fmt(e.retrievalMRR)} |\n\n`;
|
|
233
289
|
}
|
|
234
290
|
|
|
@@ -0,0 +1,270 @@
|
|
|
1
|
+
#!/usr/bin/env node
|
|
2
|
+
/**
|
|
3
|
+
* Real-LLM eval set for the chunk-refiner — multi-turn refinement engine.
|
|
4
|
+
*
|
|
5
|
+
* Walks 5 seed compositions × 3 refinement intents = 15 total. Seeds are
|
|
6
|
+
* deterministic chunk-binding plans (no LLM cost on the create side); the
|
|
7
|
+
* refiner exercises the full two-pass synthesis (locator → modifier),
|
|
8
|
+
* validator-driven retry, and op-application path.
|
|
9
|
+
*
|
|
10
|
+
* Pass criteria (per spec §6.2 + plan §1.7):
|
|
11
|
+
* - ≥ 80% of refinements produce ops (no all-fail outcome).
|
|
12
|
+
* - ≥ 90% of returned ops apply cleanly (validator + applyOps).
|
|
13
|
+
* - ≤ 5 auto-fired issues across the full run (plan §1.8 #3).
|
|
14
|
+
*
|
|
15
|
+
* Spec: docs/specs/genui-multiturn-architecture.md (Active v0.1.0).
|
|
16
|
+
* Plan: docs/plans/genui-multiturn-rollout-2026-04-28.md (Phase A).
|
|
17
|
+
*
|
|
18
|
+
* Usage:
|
|
19
|
+
* ANTHROPIC_API_KEY=… node packages/a2ui/mcp/scripts/eval-refine-synthesis.mjs
|
|
20
|
+
*/
|
|
21
|
+
|
|
22
|
+
import '../../../../scripts/load-env.mjs';
|
|
23
|
+
import {
|
|
24
|
+
refineFromIntent,
|
|
25
|
+
applyOps,
|
|
26
|
+
} from '../../compose/engines/zettel/chunk-refiner.js';
|
|
27
|
+
import { mintStateId } from '../../compose/engines/zettel/state-cache.js';
|
|
28
|
+
import { createIssueAccumulator } from '../../compose/engines/zettel/issue-reporter.js';
|
|
29
|
+
import { composeFromPlan } from '../../compose/engines/zettel/chunk-composer.js';
|
|
30
|
+
import { listChunksByKind, getChunk } from '../../corpus/scripts/chunk-library.js';
|
|
31
|
+
import { createAdapter } from '../../compose/llm/llm-bridge.js';
|
|
32
|
+
|
|
33
|
+
// ── Discover corpus shape ────────────────────────────────────────────
|
|
34
|
+
// Pick a page with ≥ 2 slots so refinements have room to target.
|
|
35
|
+
|
|
36
|
+
const pages = listChunksByKind('page');
|
|
37
|
+
const panels = listChunksByKind('panel');
|
|
38
|
+
const blocks = listChunksByKind('block');
|
|
39
|
+
|
|
40
|
+
const slotsOf = (c) => (c.slots || c.instances?.[0]?.slots || []).map((s) => s.name);
|
|
41
|
+
|
|
42
|
+
const samplePage =
|
|
43
|
+
pages.find((p) => slotsOf(p).length >= 2)
|
|
44
|
+
|| panels.find((p) => slotsOf(p).length >= 2)
|
|
45
|
+
|| pages[0]
|
|
46
|
+
|| panels[0];
|
|
47
|
+
|
|
48
|
+
if (!samplePage || slotsOf(samplePage).length === 0) {
|
|
49
|
+
console.error('Corpus has no page/panel chunks with declared slots — aborting eval');
|
|
50
|
+
process.exit(2);
|
|
51
|
+
}
|
|
52
|
+
|
|
53
|
+
const pageSlots = slotsOf(samplePage);
|
|
54
|
+
const slotA = pageSlots[0]; // typically the header slot
|
|
55
|
+
const slotB = pageSlots[1] || pageSlots[0]; // typically the content slot
|
|
56
|
+
|
|
57
|
+
// Pick at least 4 distinct block chunks. Filter out anything missing HTML.
|
|
58
|
+
const usableBlocks = blocks
|
|
59
|
+
.filter((b) => (b.html || b.instances?.[0]?.html))
|
|
60
|
+
.slice(0, 8)
|
|
61
|
+
.map((b) => b.name);
|
|
62
|
+
|
|
63
|
+
if (usableBlocks.length < 4) {
|
|
64
|
+
console.error(`Corpus has only ${usableBlocks.length} usable block chunks (need ≥ 4) — aborting eval`);
|
|
65
|
+
process.exit(2);
|
|
66
|
+
}
|
|
67
|
+
|
|
68
|
+
const [B0, B1, B2, B3, B4 = B0, B5 = B1] = usableBlocks;
|
|
69
|
+
|
|
70
|
+
console.log(`▶ refine-synthesis eval`);
|
|
71
|
+
console.log(` page: ${samplePage.name} (${pageSlots.join(', ')})`);
|
|
72
|
+
console.log(` blocks: ${usableBlocks.slice(0, 6).join(', ')}`);
|
|
73
|
+
console.log('');
|
|
74
|
+
|
|
75
|
+
// ── Seeds + refinements ──────────────────────────────────────────────
|
|
76
|
+
// 5 seeds × 3 refinements = 15 hold-out intents.
|
|
77
|
+
|
|
78
|
+
const SEEDS = [
|
|
79
|
+
{
|
|
80
|
+
label: 'two-block-content',
|
|
81
|
+
plan: {
|
|
82
|
+
page: samplePage.name,
|
|
83
|
+
slot_bindings: { [slotA]: [B0], [slotB]: [B1, B2] },
|
|
84
|
+
},
|
|
85
|
+
refinements: [
|
|
86
|
+
`add another block to ${slotB}`,
|
|
87
|
+
`remove one block from ${slotB}`,
|
|
88
|
+
`swap the ${slotA} for a different option`,
|
|
89
|
+
],
|
|
90
|
+
},
|
|
91
|
+
{
|
|
92
|
+
label: 'single-block-content',
|
|
93
|
+
plan: {
|
|
94
|
+
page: samplePage.name,
|
|
95
|
+
slot_bindings: { [slotA]: [B0], [slotB]: [B1] },
|
|
96
|
+
},
|
|
97
|
+
refinements: [
|
|
98
|
+
`add a second block alongside the existing one in ${slotB}`,
|
|
99
|
+
`replace the ${slotA} with a more concise header`,
|
|
100
|
+
`preserve the existing block and append another to ${slotB}`,
|
|
101
|
+
],
|
|
102
|
+
},
|
|
103
|
+
{
|
|
104
|
+
label: 'three-block-stack',
|
|
105
|
+
plan: {
|
|
106
|
+
page: samplePage.name,
|
|
107
|
+
slot_bindings: { [slotA]: [B0], [slotB]: [B1, B2, B3] },
|
|
108
|
+
},
|
|
109
|
+
refinements: [
|
|
110
|
+
`remove the middle block from ${slotB}`,
|
|
111
|
+
`drop the last block from ${slotB}`,
|
|
112
|
+
`make the layout more compact`,
|
|
113
|
+
],
|
|
114
|
+
},
|
|
115
|
+
{
|
|
116
|
+
label: 'header-only',
|
|
117
|
+
plan: {
|
|
118
|
+
page: samplePage.name,
|
|
119
|
+
slot_bindings: { [slotA]: [B0], [slotB]: [B1] },
|
|
120
|
+
},
|
|
121
|
+
refinements: [
|
|
122
|
+
`add an additional block to ${slotB}`,
|
|
123
|
+
`change the ${slotA}`,
|
|
124
|
+
`preserve everything and add a new block to ${slotB}`,
|
|
125
|
+
],
|
|
126
|
+
},
|
|
127
|
+
{
|
|
128
|
+
label: 'mixed-stack',
|
|
129
|
+
plan: {
|
|
130
|
+
page: samplePage.name,
|
|
131
|
+
slot_bindings: { [slotA]: [B0], [slotB]: [B2, B3] },
|
|
132
|
+
},
|
|
133
|
+
refinements: [
|
|
134
|
+
`swap the first block in ${slotB} for a different one`,
|
|
135
|
+
`add another block at the end of ${slotB}`,
|
|
136
|
+
`drop the last block from ${slotB}`,
|
|
137
|
+
],
|
|
138
|
+
},
|
|
139
|
+
];
|
|
140
|
+
|
|
141
|
+
// ── Run ──────────────────────────────────────────────────────────────
|
|
142
|
+
|
|
143
|
+
const llmAdapter = await createAdapter();
|
|
144
|
+
const startedAt = Date.now();
|
|
145
|
+
const results = [];
|
|
146
|
+
const autoFires = { total: 0, byReason: {} };
|
|
147
|
+
|
|
148
|
+
for (const seed of SEEDS) {
|
|
149
|
+
// Materialize the seed plan (no LLM call; pure compose-from-plan).
|
|
150
|
+
const composed = composeFromPlan(seed.plan);
|
|
151
|
+
if (!composed.html) {
|
|
152
|
+
console.log(`✗ seed [${seed.label}] failed to materialize — skipping refinements`);
|
|
153
|
+
for (const intent of seed.refinements) {
|
|
154
|
+
results.push({
|
|
155
|
+
seed: seed.label, intent, ms: 0, ops_count: 0, attempts: 0,
|
|
156
|
+
targeted: null, ops_applied: 0, ops_failed: 0,
|
|
157
|
+
auto_fires: [], error: 'seed-materialize-failed',
|
|
158
|
+
});
|
|
159
|
+
}
|
|
160
|
+
continue;
|
|
161
|
+
}
|
|
162
|
+
|
|
163
|
+
const priorState = {
|
|
164
|
+
state_id: mintStateId(seed.label, 1),
|
|
165
|
+
intent: `[seed] ${seed.label}`,
|
|
166
|
+
plan: seed.plan,
|
|
167
|
+
html: composed.html,
|
|
168
|
+
version: 1,
|
|
169
|
+
};
|
|
170
|
+
|
|
171
|
+
console.log(`── seed [${seed.label}] · ${seed.plan.slot_bindings[slotB].length} block(s) in ${slotB}`);
|
|
172
|
+
|
|
173
|
+
for (const intent of seed.refinements) {
|
|
174
|
+
const acc = createIssueAccumulator();
|
|
175
|
+
const t0 = Date.now();
|
|
176
|
+
const row = {
|
|
177
|
+
seed: seed.label, intent, ms: 0, ops_count: 0, attempts: 0,
|
|
178
|
+
targeted: null, ops_applied: 0, ops_failed: 0,
|
|
179
|
+
auto_fires: [], error: null,
|
|
180
|
+
};
|
|
181
|
+
|
|
182
|
+
try {
|
|
183
|
+
const refined = await refineFromIntent({
|
|
184
|
+
priorState,
|
|
185
|
+
intent,
|
|
186
|
+
llmAdapter,
|
|
187
|
+
maxAttempts: 2,
|
|
188
|
+
issueAccumulator: acc,
|
|
189
|
+
});
|
|
190
|
+
row.ms = Date.now() - t0;
|
|
191
|
+
row.ops_count = refined.ops.length;
|
|
192
|
+
row.attempts = refined.synthesis?.attempts ?? 0;
|
|
193
|
+
row.targeted = refined.synthesis?.targeted ?? null;
|
|
194
|
+
|
|
195
|
+
if (refined.ops.length > 0) {
|
|
196
|
+
const applied = await applyOps({ priorState, ops: refined.ops });
|
|
197
|
+
row.ops_applied = applied.ops_applied.length;
|
|
198
|
+
row.ops_failed = applied.ops_failed.length;
|
|
199
|
+
}
|
|
200
|
+
} catch (e) {
|
|
201
|
+
row.ms = Date.now() - t0;
|
|
202
|
+
row.error = e.message;
|
|
203
|
+
}
|
|
204
|
+
|
|
205
|
+
row.auto_fires = acc.reasons();
|
|
206
|
+
autoFires.total += row.auto_fires.length;
|
|
207
|
+
for (const r of row.auto_fires) {
|
|
208
|
+
autoFires.byReason[r] = (autoFires.byReason[r] || 0) + 1;
|
|
209
|
+
}
|
|
210
|
+
|
|
211
|
+
results.push(row);
|
|
212
|
+
|
|
213
|
+
const flag = row.ops_count > 0 && row.ops_failed === 0 ? '✓' : (row.ops_count > 0 ? '~' : '✗');
|
|
214
|
+
const tgtTag = row.targeted === true ? 'tgt' : row.targeted === false ? 'unt' : '???';
|
|
215
|
+
const padMs = row.ms.toString().padStart(5);
|
|
216
|
+
console.log(` ${flag} [${tgtTag}] ${padMs}ms ops=${row.ops_count} att=${row.attempts} ${intent}`);
|
|
217
|
+
if (row.error) console.log(` error: ${row.error}`);
|
|
218
|
+
if (row.ops_failed > 0) console.log(` ops_failed: ${row.ops_failed}`);
|
|
219
|
+
if (row.auto_fires.length) console.log(` auto-fires: ${row.auto_fires.join(', ')}`);
|
|
220
|
+
}
|
|
221
|
+
}
|
|
222
|
+
|
|
223
|
+
// ── Summary ──────────────────────────────────────────────────────────
|
|
224
|
+
|
|
225
|
+
const total = results.length;
|
|
226
|
+
const producedOps = results.filter((r) => r.ops_count > 0).length;
|
|
227
|
+
const totalOpsReturned = results.reduce((s, r) => s + r.ops_count, 0);
|
|
228
|
+
const totalOpsApplied = results.reduce((s, r) => s + r.ops_applied, 0);
|
|
229
|
+
const totalOpsFailed = results.reduce((s, r) => s + r.ops_failed, 0);
|
|
230
|
+
|
|
231
|
+
const opsRate = total ? producedOps / total : 0;
|
|
232
|
+
const validateRate = totalOpsReturned ? totalOpsApplied / totalOpsReturned : 0;
|
|
233
|
+
|
|
234
|
+
console.log(`\n── Summary ──`);
|
|
235
|
+
console.log(` Refinements: ${total}`);
|
|
236
|
+
console.log(` Produced ops: ${producedOps}/${total} (${(opsRate * 100).toFixed(0)}%)`);
|
|
237
|
+
console.log(` Ops returned: ${totalOpsReturned}; applied: ${totalOpsApplied}; failed: ${totalOpsFailed} (validate ${(validateRate * 100).toFixed(0)}%)`);
|
|
238
|
+
console.log(` Auto-fires: ${autoFires.total}`);
|
|
239
|
+
if (autoFires.total > 0) {
|
|
240
|
+
for (const [reason, n] of Object.entries(autoFires.byReason)) {
|
|
241
|
+
console.log(` ${reason}: ${n}`);
|
|
242
|
+
}
|
|
243
|
+
}
|
|
244
|
+
const targeted = results.filter((r) => r.targeted === true).length;
|
|
245
|
+
const untargeted = results.filter((r) => r.targeted === false).length;
|
|
246
|
+
console.log(` Targeted vs untargeted: ${targeted} / ${untargeted}`);
|
|
247
|
+
console.log(` Total time: ${((Date.now() - startedAt) / 1000).toFixed(1)}s`);
|
|
248
|
+
|
|
249
|
+
const opsThreshold = 0.8;
|
|
250
|
+
const validateThreshold = 0.9;
|
|
251
|
+
const autoFireCeiling = 5;
|
|
252
|
+
|
|
253
|
+
const opsPass = opsRate >= opsThreshold;
|
|
254
|
+
const validatePass = totalOpsReturned === 0 || validateRate >= validateThreshold;
|
|
255
|
+
const autoFirePass = autoFires.total <= autoFireCeiling;
|
|
256
|
+
|
|
257
|
+
const allPass = opsPass && validatePass && autoFirePass;
|
|
258
|
+
|
|
259
|
+
console.log('');
|
|
260
|
+
console.log(` ops rate ≥ ${opsThreshold * 100}%: ${opsPass ? '✓' : '✗'} (${(opsRate * 100).toFixed(0)}%)`);
|
|
261
|
+
console.log(` validate ≥ ${validateThreshold * 100}%: ${validatePass ? '✓' : '✗'} (${(validateRate * 100).toFixed(0)}%)`);
|
|
262
|
+
console.log(` auto-fires ≤ ${autoFireCeiling}: ${autoFirePass ? '✓' : '✗'} (${autoFires.total})`);
|
|
263
|
+
|
|
264
|
+
if (allPass) {
|
|
265
|
+
console.log(`\n✓ PASS`);
|
|
266
|
+
process.exit(0);
|
|
267
|
+
} else {
|
|
268
|
+
console.log(`\n✗ FAIL`);
|
|
269
|
+
process.exit(1);
|
|
270
|
+
}
|