@adia-ai/a2ui-mcp 0.1.1 → 0.1.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +113 -0
- package/evals/compose-from-chunks-holdout.jsonl +20 -0
- package/package.json +2 -1
- package/scripts/eval-compose-from-chunks.mjs +264 -0
- package/scripts/eval-diff.mjs +16 -10
- package/scripts/generate.mjs +2 -2
package/CHANGELOG.md
CHANGED
|
@@ -9,8 +9,121 @@ zettel strategies.
|
|
|
9
9
|
|
|
10
10
|
## [Unreleased]
|
|
11
11
|
|
|
12
|
+
_No pending changes._
|
|
13
|
+
|
|
14
|
+
## [0.1.3] - 2026-05-02
|
|
15
|
+
|
|
16
|
+
Additive — `compose_from_chunks` eval runner + hold-out set. No
|
|
17
|
+
BREAKING changes.
|
|
18
|
+
|
|
19
|
+
### Added
|
|
20
|
+
|
|
21
|
+
- **`compose_from_chunks` eval runner + hold-out set**. Closes the
|
|
22
|
+
spec at
|
|
23
|
+
[`docs/specs/compose-from-chunks-eval.md`](../../../docs/specs/compose-from-chunks-eval.md):
|
|
24
|
+
- `evals/compose-from-chunks-holdout.jsonl` — 20 intents (10
|
|
25
|
+
single-turn compose + 10 multi-turn refine) across 9
|
|
26
|
+
categories (data-display, forms, layout, data, data-viz,
|
|
27
|
+
agent, overlay, chat, settings, display).
|
|
28
|
+
- `scripts/eval-compose-from-chunks.mjs` — runner with `--stub`
|
|
29
|
+
(default; retrieval-only, no API calls) and `--real-llm`
|
|
30
|
+
(wires Anthropic SDK) modes. Composite scoring (structural
|
|
31
|
+
30 + coverage 20 + retrieval 20 + render 30, with null-
|
|
32
|
+
pro-rata redistribution when render is deferred). `--json`
|
|
33
|
+
+ `--report-file` outputs. Threshold 80; exits 1 if avg < 80.
|
|
34
|
+
- `npm run eval:compose-from-chunks` exposes the runner.
|
|
35
|
+
- Stub baseline: 9 of 20 intents retrievable directly (avg
|
|
36
|
+
~50-70 each); 11 need synthesis (correctly warn without
|
|
37
|
+
`--real-llm`). Aggregate ~26 / 100 stub. Real-LLM mode is
|
|
38
|
+
the gating signal for the `chunk-zettel` engine promotion.
|
|
39
|
+
|
|
40
|
+
Render-fidelity component is DEFERRED — needs Playwright
|
|
41
|
+
headless render + console-error capture wiring; current
|
|
42
|
+
composite drops the render weight pro-rata across
|
|
43
|
+
structural/coverage/retrieval until that lands.
|
|
44
|
+
|
|
45
|
+
### Changed
|
|
46
|
+
|
|
47
|
+
- `package.json` `files:` array now includes `evals/` so the hold-out
|
|
48
|
+
set ships in the published tarball.
|
|
49
|
+
|
|
12
50
|
---
|
|
13
51
|
|
|
52
|
+
## [0.1.2] - 2026-05-01
|
|
53
|
+
|
|
54
|
+
Phase 2 of [`docs/specs/semantic-validator.md`](../../../docs/specs/semantic-validator.md)
|
|
55
|
+
**promoted to default**. When `--semantic` is set, `eval-diff.mjs`
|
|
56
|
+
now gates on the combined score at threshold 80 by default, instead
|
|
57
|
+
of running in Phase 1 shadow-mode. **Behavior change** — see § Opt
|
|
58
|
+
back into Phase 1 below.
|
|
59
|
+
|
|
60
|
+
### Changed (default behavior when `--semantic` is set)
|
|
61
|
+
|
|
62
|
+
- **`--gate-mode` default flipped from `structural` to `combined`
|
|
63
|
+
when `--semantic` is set.** `npm run eval:diff -- --engine zettel
|
|
64
|
+
--semantic` now produces a combined-gated `passRate` (gate on
|
|
65
|
+
`0.6 × validationScore + 0.4 × semanticScore`) instead of the
|
|
66
|
+
structural-only Phase 1 shadow output. The structural pass survives
|
|
67
|
+
as `row.passStructural` + `runObj.passRateStructural` for diagnostic
|
|
68
|
+
comparison.
|
|
69
|
+
- **`--gate-threshold` default raised from 70 to 80.** Chosen via
|
|
70
|
+
the 2026-05-01 sweep — 70 produced zero pass-flips on the current
|
|
71
|
+
zettel engine output (cosmetic flip with no signal); 80 catches the
|
|
72
|
+
4 partial-verdict cases the spec was designed to surface; 85
|
|
73
|
+
over-aggressively rejects legit aligned items. See spec § Phase 2
|
|
74
|
+
status (2026-05-01) for the sweep table.
|
|
75
|
+
- **Without `--semantic`, the default stays `structural`.** Semantic
|
|
76
|
+
work remains opt-in; the no-flag `npm run eval:diff` shape is
|
|
77
|
+
unchanged.
|
|
78
|
+
|
|
79
|
+
### Opt back into Phase 1 shadow-mode
|
|
80
|
+
|
|
81
|
+
- `--gate-mode structural` reverts the default flip — `row.pass`
|
|
82
|
+
gates on `validationScore` alone, semantic verdicts annotation-
|
|
83
|
+
only.
|
|
84
|
+
- `--gate-threshold 70` reverts the threshold change.
|
|
85
|
+
- Combine both for full v0.1.1 default behavior:
|
|
86
|
+
```bash
|
|
87
|
+
node packages/a2ui/mcp/scripts/eval-diff.mjs --engine zettel \
|
|
88
|
+
--semantic --gate-mode structural
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
### Why the promotion is safe
|
|
92
|
+
|
|
93
|
+
Per the Phase 2 exit criterion ("no unexplained regressions in first
|
|
94
|
+
two full `eval-diff` runs"), zero pass-flips were observed at
|
|
95
|
+
threshold=70 (the v0.1.1 default). At threshold=80, the four
|
|
96
|
+
pass→fail flips are all `partial`-verdict items that fit the spec's
|
|
97
|
+
motivating failure mode (structural=86–88 + semantic=54–69) — the
|
|
98
|
+
intents the structural-only validator scored highly but the LLM
|
|
99
|
+
judge correctly flagged as "wrong UI for this intent":
|
|
100
|
+
|
|
101
|
+
- `user profile card with avatar` — structural 88, semantic 54
|
|
102
|
+
- `calendar month view` — structural 86, semantic 65
|
|
103
|
+
- `social media post card` — structural 86, semantic 69
|
|
104
|
+
- `user onboarding checklist` — structural 87, semantic 67
|
|
105
|
+
|
|
106
|
+
Pass rate drops from 83 → 79 (4 percentage points), all justified.
|
|
107
|
+
|
|
108
|
+
### Updated baseline thresholds
|
|
109
|
+
|
|
110
|
+
Per [`docs/specs/semantic-validator.md`](../../../docs/specs/semantic-validator.md)
|
|
111
|
+
§ Phase 2 status (2026-05-01):
|
|
112
|
+
|
|
113
|
+
- `avgScore ≥ 88` (structural — unchanged)
|
|
114
|
+
- `avgSemanticScore ≥ 85` (new)
|
|
115
|
+
- `avgCombined ≥ 87` (new)
|
|
116
|
+
- `passRate ≥ 78%` at combined-gate threshold=80 (new; was 83%
|
|
117
|
+
structural-only)
|
|
118
|
+
- `verdict aligned ≥ 84%` of judged items
|
|
119
|
+
|
|
120
|
+
### Implementation references
|
|
121
|
+
|
|
122
|
+
- [`scripts/eval-diff.mjs`](scripts/eval-diff.mjs) — default flip
|
|
123
|
+
- [`scripts/semantic-stats.mjs`](scripts/semantic-stats.mjs) —
|
|
124
|
+
unchanged from v0.1.1; remains the validation tool for future
|
|
125
|
+
threshold revalidation passes
|
|
126
|
+
|
|
14
127
|
## [0.1.1] - 2026-05-01
|
|
15
128
|
|
|
16
129
|
Phase 2 of [`docs/specs/semantic-validator.md`](../../../docs/specs/semantic-validator.md)
|
|
@@ -0,0 +1,20 @@
|
|
|
1
|
+
{"id":"intent-001","kind":"compose","category":"data-display","intent":"kpi grid with 4 stat cards: users, revenue, sessions, churn","expected_components":["Card","Stat","Grid"],"expected_chunk":"kpi-grid-4-card"}
|
|
2
|
+
{"id":"intent-002","kind":"compose","category":"forms","intent":"sign-in form with email + password + 'forgot password' link","expected_components":["Card","Input","Button","Field"],"expected_chunk":"auth-sign-in"}
|
|
3
|
+
{"id":"intent-003","kind":"compose","category":"layout","intent":"settings page with three tabs (general, integrations, billing)","expected_components":["Tabs","Tab","Card","Section"],"expected_chunk":"settings-tabs-3"}
|
|
4
|
+
{"id":"intent-004","kind":"compose","category":"data","intent":"data table of users with role badge + last-active timestamp","expected_components":["Table","Badge"],"expected_chunk":"users-table"}
|
|
5
|
+
{"id":"intent-005","kind":"compose","category":"data-viz","intent":"conversion funnel chart over 6 stages, with drop-off labels","expected_components":["Chart","Card","ChartLegend"],"expected_chunk":"conversion-funnel"}
|
|
6
|
+
{"id":"intent-006","kind":"compose","category":"agent","intent":"agent activity feed with reasoning steps + final artifact","expected_components":["AgentTrace","AgentReasoning","AgentArtifact"],"expected_chunk":"agent-activity-feed"}
|
|
7
|
+
{"id":"intent-007","kind":"compose","category":"layout","intent":"split-pane editor: code on the left, preview on the right","expected_components":["EditorShell","Pane","Code"],"expected_chunk":"editor-split"}
|
|
8
|
+
{"id":"intent-008","kind":"compose","category":"overlay","intent":"command palette modal with grouped results (recent, suggestions)","expected_components":["Command","Modal"],"expected_chunk":"command-grouped"}
|
|
9
|
+
{"id":"intent-009","kind":"compose","category":"forms","intent":"registration step 2 of 5 — profile setup with 4 fields","expected_components":["Card","StepProgress","Field","Input"],"expected_chunk":"reg-step-shell"}
|
|
10
|
+
{"id":"intent-010","kind":"compose","category":"layout","intent":"404 error page with breadcrumb + back-to-home link","expected_components":["Card","Breadcrumb","Button"],"expected_chunk":"error-404"}
|
|
11
|
+
{"id":"intent-011","kind":"refine","category":"data-display","intent":"dashboard for project metrics","refine":"add a date-range filter at the top","expected_components":["Card","Stat","Select"],"expected_chunk":"project-dashboard"}
|
|
12
|
+
{"id":"intent-012","kind":"refine","category":"display","intent":"user profile card","refine":"make the email editable inline","expected_components":["Card","Avatar","Input"],"expected_chunk":"user-profile-card"}
|
|
13
|
+
{"id":"intent-013","kind":"refine","category":"data","intent":"kanban board with 3 columns","refine":"add a count badge to each column header","expected_components":["Card","Badge","Header"],"expected_chunk":"kanban-3col"}
|
|
14
|
+
{"id":"intent-014","kind":"refine","category":"chat","intent":"chat surface with streaming reply","refine":"add a stop button while streaming","expected_components":["ChatShell","Button","ChatInput"],"expected_chunk":"chat-streaming"}
|
|
15
|
+
{"id":"intent-015","kind":"refine","category":"forms","intent":"sign-up form with email + password","refine":"add password strength meter","expected_components":["Card","Input","Progress"],"expected_chunk":"auth-sign-up"}
|
|
16
|
+
{"id":"intent-016","kind":"refine","category":"settings","intent":"settings tab for notifications","refine":"split email + push into separate sections","expected_components":["Card","Section","Switch"],"expected_chunk":"settings-notifications"}
|
|
17
|
+
{"id":"intent-017","kind":"refine","category":"data","intent":"table of orders","refine":"add a bulk-action toolbar above the table","expected_components":["Table","TableToolbar","Button"],"expected_chunk":"orders-table"}
|
|
18
|
+
{"id":"intent-018","kind":"refine","category":"agent","intent":"agent reasoning panel","refine":"collapse intermediate steps by default, expandable","expected_components":["AgentReasoning","Accordion"],"expected_chunk":"agent-reasoning-collapsed"}
|
|
19
|
+
{"id":"intent-019","kind":"refine","category":"overlay","intent":"modal confirming destructive action","refine":"require typing the resource name to confirm","expected_components":["Modal","Input","Button"],"expected_chunk":"destructive-confirm"}
|
|
20
|
+
{"id":"intent-020","kind":"refine","category":"display","intent":"marketing landing hero","refine":"add a secondary 'see demo' CTA","expected_components":["Card","Heading","Button"],"expected_chunk":"marketing-hero"}
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@adia-ai/a2ui-mcp",
|
|
3
|
-
"version": "0.1.
|
|
3
|
+
"version": "0.1.3",
|
|
4
4
|
"description": "AdiaUI A2UI MCP server. Exposes the compose engine over MCP with an engine selector for monolithic + zettel strategies.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"bin": {
|
|
@@ -10,6 +10,7 @@
|
|
|
10
10
|
"server.js",
|
|
11
11
|
"tools/",
|
|
12
12
|
"scripts/",
|
|
13
|
+
"evals/",
|
|
13
14
|
"personas/",
|
|
14
15
|
"README.md",
|
|
15
16
|
"CHANGELOG.md"
|
|
@@ -0,0 +1,264 @@
|
|
|
1
|
+
#!/usr/bin/env node
|
|
2
|
+
/**
|
|
3
|
+
* eval-compose-from-chunks.mjs — Hold-out eval for the chunk-aware
|
|
4
|
+
* synthesizer. Per `docs/specs/compose-from-chunks-eval.md`.
|
|
5
|
+
*
|
|
6
|
+
* Reads `packages/a2ui/mcp/evals/compose-from-chunks-holdout.jsonl`,
|
|
7
|
+
* runs each intent through `composeFromIntent`, and emits a per-intent
|
|
8
|
+
* + aggregate report.
|
|
9
|
+
*
|
|
10
|
+
* Two modes:
|
|
11
|
+
* --stub (default) — retrieval-only path, no LLM calls. Fast;
|
|
12
|
+
* produces a baseline that exercises the chunk-corpus +
|
|
13
|
+
* retrieval scoring without spending API budget. Use to
|
|
14
|
+
* verify the runner shape + the corpus surface.
|
|
15
|
+
* --real-llm — wires the LLM bridge for synthesis-tier composition.
|
|
16
|
+
* Costs ~$2 per full eval at Claude Sonnet 4.6 pricing.
|
|
17
|
+
* Requires `ANTHROPIC_API_KEY` in env.
|
|
18
|
+
*
|
|
19
|
+
* Scoring (composite, 0-100):
|
|
20
|
+
* - Structural (30%) — composition emitted (html non-null OR plan
|
|
21
|
+
* non-null).
|
|
22
|
+
* - Coverage (20%) — at least one of the intent's
|
|
23
|
+
* `expected_components` appears in the emitted markup.
|
|
24
|
+
* - Retrieval (20%) — top-k retrieved chunks include the intent's
|
|
25
|
+
* `expected_chunk` (when set). Soft-asserted; absence flags but
|
|
26
|
+
* doesn't fail.
|
|
27
|
+
* - Render fidelity (30%) — DEFERRED. Real implementation needs
|
|
28
|
+
* Playwright headless render + console-error capture; this runner
|
|
29
|
+
* emits a placeholder pending the render-fidelity smoke wiring.
|
|
30
|
+
* The composite re-distributes its weight pro-rata across the
|
|
31
|
+
* three remaining components when render is null.
|
|
32
|
+
*
|
|
33
|
+
* Exit:
|
|
34
|
+
* 0 if avg ≥ 80 (passes the chunk-zettel promotion gate threshold)
|
|
35
|
+
* 1 if avg < 80
|
|
36
|
+
*
|
|
37
|
+
* Usage:
|
|
38
|
+
* npm run eval:compose-from-chunks # stub mode
|
|
39
|
+
* npm run eval:compose-from-chunks -- --real-llm # real-LLM
|
|
40
|
+
* npm run eval:compose-from-chunks -- --limit 5 # first 5 intents
|
|
41
|
+
* npm run eval:compose-from-chunks -- --json # JSON report
|
|
42
|
+
* npm run eval:compose-from-chunks -- --report-file # write to docs/reports/
|
|
43
|
+
*/
|
|
44
|
+
|
|
45
|
+
import '../../../../scripts/load-env.mjs';
|
|
46
|
+
|
|
47
|
+
import fs from 'node:fs';
|
|
48
|
+
import path from 'node:path';
|
|
49
|
+
import { fileURLToPath } from 'node:url';
|
|
50
|
+
|
|
51
|
+
import { composeFromIntent } from '../../compose/engines/zettel/chunk-synthesizer.js';
|
|
52
|
+
import { searchChunksAsync } from '../../corpus/scripts/chunk-library.js';
|
|
53
|
+
|
|
54
|
+
const REPO_ROOT = path.resolve(path.dirname(fileURLToPath(import.meta.url)), '../../../..');
|
|
55
|
+
const HOLDOUT = path.join(REPO_ROOT, 'packages/a2ui/mcp/evals/compose-from-chunks-holdout.jsonl');
|
|
56
|
+
const PASS_THRESHOLD = 80;
|
|
57
|
+
|
|
58
|
+
const args = process.argv.slice(2);
|
|
59
|
+
const FLAG_REAL_LLM = args.includes('--real-llm');
|
|
60
|
+
const FLAG_JSON = args.includes('--json');
|
|
61
|
+
const FLAG_REPORT = args.includes('--report-file');
|
|
62
|
+
const limitIdx = args.indexOf('--limit');
|
|
63
|
+
const LIMIT = limitIdx >= 0 ? parseInt(args[limitIdx + 1], 10) : null;
|
|
64
|
+
|
|
65
|
+
// ─────────────────────────────────────────────────────────────────
|
|
66
|
+
// Hold-out loader
|
|
67
|
+
// ─────────────────────────────────────────────────────────────────
|
|
68
|
+
|
|
69
|
+
function loadHoldOut() {
|
|
70
|
+
const raw = fs.readFileSync(HOLDOUT, 'utf8');
|
|
71
|
+
const lines = raw.split('\n').map((l) => l.trim()).filter(Boolean);
|
|
72
|
+
const intents = lines.map((l) => JSON.parse(l));
|
|
73
|
+
return LIMIT ? intents.slice(0, LIMIT) : intents;
|
|
74
|
+
}
|
|
75
|
+
|
|
76
|
+
// ─────────────────────────────────────────────────────────────────
|
|
77
|
+
// LLM adapter
|
|
78
|
+
// ─────────────────────────────────────────────────────────────────
|
|
79
|
+
|
|
80
|
+
async function buildLLMAdapter() {
|
|
81
|
+
if (!FLAG_REAL_LLM) return null;
|
|
82
|
+
const apiKey = process.env.ANTHROPIC_API_KEY;
|
|
83
|
+
if (!apiKey) {
|
|
84
|
+
console.error('--real-llm requires ANTHROPIC_API_KEY in env.');
|
|
85
|
+
process.exit(2);
|
|
86
|
+
}
|
|
87
|
+
const { default: Anthropic } = await import('@anthropic-ai/sdk');
|
|
88
|
+
const client = new Anthropic({ apiKey });
|
|
89
|
+
return {
|
|
90
|
+
async complete({ system, user, model = 'claude-sonnet-4-6', maxTokens = 2048 }) {
|
|
91
|
+
const resp = await client.messages.create({
|
|
92
|
+
model,
|
|
93
|
+
max_tokens: maxTokens,
|
|
94
|
+
system,
|
|
95
|
+
messages: [{ role: 'user', content: user }],
|
|
96
|
+
});
|
|
97
|
+
const text = resp.content
|
|
98
|
+
.filter((b) => b.type === 'text')
|
|
99
|
+
.map((b) => b.text)
|
|
100
|
+
.join('');
|
|
101
|
+
return { text };
|
|
102
|
+
},
|
|
103
|
+
};
|
|
104
|
+
}
|
|
105
|
+
|
|
106
|
+
// ─────────────────────────────────────────────────────────────────
|
|
107
|
+
// Scoring
|
|
108
|
+
// ─────────────────────────────────────────────────────────────────
|
|
109
|
+
|
|
110
|
+
function scoreStructural(result) {
|
|
111
|
+
if (result.html || result.plan) return 100;
|
|
112
|
+
return 0;
|
|
113
|
+
}
|
|
114
|
+
|
|
115
|
+
function scoreCoverage(result, expected) {
|
|
116
|
+
if (!expected || expected.length === 0) return 100;
|
|
117
|
+
const html = String(result.html || '');
|
|
118
|
+
let hits = 0;
|
|
119
|
+
for (const tag of expected) {
|
|
120
|
+
const lowered = tag.toLowerCase().replace(/^ui/, '');
|
|
121
|
+
const re = new RegExp(`<${lowered}-ui[\\s>]`, 'i');
|
|
122
|
+
if (re.test(html)) hits++;
|
|
123
|
+
}
|
|
124
|
+
return Math.round((hits / expected.length) * 100);
|
|
125
|
+
}
|
|
126
|
+
|
|
127
|
+
async function scoreRetrieval(intent, expectedChunk) {
|
|
128
|
+
if (!expectedChunk) return null;
|
|
129
|
+
const hits = await searchChunksAsync(intent, { limit: 5 });
|
|
130
|
+
const found = hits.some((h) => h.name === expectedChunk);
|
|
131
|
+
return found ? 100 : 0;
|
|
132
|
+
}
|
|
133
|
+
|
|
134
|
+
function scoreRenderFidelity(_result) {
|
|
135
|
+
// DEFERRED — Playwright headless render + console-error capture.
|
|
136
|
+
// See spec § Out-of-band for the follow-up.
|
|
137
|
+
return null;
|
|
138
|
+
}
|
|
139
|
+
|
|
140
|
+
function compositeScore({ structural, coverage, retrieval, render }) {
|
|
141
|
+
const components = [
|
|
142
|
+
{ name: 'structural', value: structural, weight: 30 },
|
|
143
|
+
{ name: 'coverage', value: coverage, weight: 20 },
|
|
144
|
+
{ name: 'retrieval', value: retrieval, weight: 20 },
|
|
145
|
+
{ name: 'render', value: render, weight: 30 },
|
|
146
|
+
].filter((c) => c.value !== null);
|
|
147
|
+
|
|
148
|
+
const totalWeight = components.reduce((s, c) => s + c.weight, 0);
|
|
149
|
+
const weighted = components.reduce((s, c) => s + c.value * c.weight, 0);
|
|
150
|
+
return totalWeight > 0 ? Math.round(weighted / totalWeight) : 0;
|
|
151
|
+
}
|
|
152
|
+
|
|
153
|
+
// ─────────────────────────────────────────────────────────────────
|
|
154
|
+
// Eval loop
|
|
155
|
+
// ─────────────────────────────────────────────────────────────────
|
|
156
|
+
|
|
157
|
+
async function evalIntent(intent, llmAdapter) {
|
|
158
|
+
const t0 = performance.now();
|
|
159
|
+
const result = await composeFromIntent({
|
|
160
|
+
intent: intent.intent,
|
|
161
|
+
llmAdapter,
|
|
162
|
+
maxAttempts: 2,
|
|
163
|
+
});
|
|
164
|
+
const elapsedMs = Math.round(performance.now() - t0);
|
|
165
|
+
|
|
166
|
+
const structural = scoreStructural(result);
|
|
167
|
+
const coverage = scoreCoverage(result, intent.expected_components);
|
|
168
|
+
const retrieval = await scoreRetrieval(intent.intent, intent.expected_chunk);
|
|
169
|
+
const render = scoreRenderFidelity(result);
|
|
170
|
+
|
|
171
|
+
const score = compositeScore({ structural, coverage, retrieval, render });
|
|
172
|
+
return {
|
|
173
|
+
id: intent.id,
|
|
174
|
+
kind: intent.kind,
|
|
175
|
+
intent: intent.intent,
|
|
176
|
+
source: result.source,
|
|
177
|
+
elapsedMs,
|
|
178
|
+
structural,
|
|
179
|
+
coverage,
|
|
180
|
+
retrieval,
|
|
181
|
+
render,
|
|
182
|
+
score,
|
|
183
|
+
warnings: result.warnings ?? [],
|
|
184
|
+
};
|
|
185
|
+
}
|
|
186
|
+
|
|
187
|
+
// ─────────────────────────────────────────────────────────────────
|
|
188
|
+
// Reporter
|
|
189
|
+
// ─────────────────────────────────────────────────────────────────
|
|
190
|
+
|
|
191
|
+
function reportText(results, mode) {
|
|
192
|
+
const avg = Math.round(results.reduce((s, r) => s + r.score, 0) / results.length);
|
|
193
|
+
const passing = results.filter((r) => r.score >= PASS_THRESHOLD).length;
|
|
194
|
+
const out = [];
|
|
195
|
+
out.push(`# compose_from_chunks eval — ${mode === 'real-llm' ? 'real LLM' : 'stub (retrieval only)'}`);
|
|
196
|
+
out.push('');
|
|
197
|
+
out.push(`Aggregate: avg **${avg}**, passing **${passing} / ${results.length}** (threshold ${PASS_THRESHOLD}).`);
|
|
198
|
+
out.push('');
|
|
199
|
+
out.push('| ID | Kind | Source | Struct | Cov | Retr | Render | Score | ms |');
|
|
200
|
+
out.push('|---|---|---|---:|---:|---:|---:|---:|---:|');
|
|
201
|
+
for (const r of results) {
|
|
202
|
+
out.push(`| ${r.id} | ${r.kind} | ${r.source ?? '—'} | ${r.structural} | ${r.coverage} | ${r.retrieval ?? '—'} | ${r.render ?? '—'} | **${r.score}** | ${r.elapsedMs} |`);
|
|
203
|
+
}
|
|
204
|
+
if (results.some((r) => r.warnings.length > 0)) {
|
|
205
|
+
out.push('');
|
|
206
|
+
out.push('## Warnings');
|
|
207
|
+
for (const r of results) {
|
|
208
|
+
if (r.warnings.length === 0) continue;
|
|
209
|
+
out.push(`- **${r.id}**: ${r.warnings.join('; ')}`);
|
|
210
|
+
}
|
|
211
|
+
}
|
|
212
|
+
return out.join('\n');
|
|
213
|
+
}
|
|
214
|
+
|
|
215
|
+
function reportJSON(results, mode) {
|
|
216
|
+
const avg = Math.round(results.reduce((s, r) => s + r.score, 0) / results.length);
|
|
217
|
+
const passing = results.filter((r) => r.score >= PASS_THRESHOLD).length;
|
|
218
|
+
return JSON.stringify({
|
|
219
|
+
mode,
|
|
220
|
+
threshold: PASS_THRESHOLD,
|
|
221
|
+
avg,
|
|
222
|
+
passing,
|
|
223
|
+
total: results.length,
|
|
224
|
+
results,
|
|
225
|
+
}, null, 2);
|
|
226
|
+
}
|
|
227
|
+
|
|
228
|
+
// ─────────────────────────────────────────────────────────────────
|
|
229
|
+
// Main
|
|
230
|
+
// ─────────────────────────────────────────────────────────────────
|
|
231
|
+
|
|
232
|
+
async function main() {
|
|
233
|
+
const intents = loadHoldOut();
|
|
234
|
+
const llmAdapter = await buildLLMAdapter();
|
|
235
|
+
const mode = FLAG_REAL_LLM ? 'real-llm' : 'stub';
|
|
236
|
+
|
|
237
|
+
const results = [];
|
|
238
|
+
for (const intent of intents) {
|
|
239
|
+
process.stderr.write(`▶ ${intent.id} ${intent.kind.padEnd(7)} ${intent.intent.slice(0, 50)}...\n`);
|
|
240
|
+
const result = await evalIntent(intent, llmAdapter);
|
|
241
|
+
results.push(result);
|
|
242
|
+
}
|
|
243
|
+
|
|
244
|
+
const output = FLAG_JSON ? reportJSON(results, mode) : reportText(results, mode);
|
|
245
|
+
|
|
246
|
+
if (FLAG_REPORT) {
|
|
247
|
+
const date = new Date().toISOString().slice(0, 10);
|
|
248
|
+
const dir = path.join(REPO_ROOT, 'docs/reports');
|
|
249
|
+
if (!fs.existsSync(dir)) fs.mkdirSync(dir, { recursive: true });
|
|
250
|
+
const file = path.join(dir, `eval-compose-from-chunks-${date}.md`);
|
|
251
|
+
fs.writeFileSync(file, output + '\n');
|
|
252
|
+
console.error(`\nReport written to ${path.relative(REPO_ROOT, file)}`);
|
|
253
|
+
} else {
|
|
254
|
+
console.log(output);
|
|
255
|
+
}
|
|
256
|
+
|
|
257
|
+
const avg = results.reduce((s, r) => s + r.score, 0) / results.length;
|
|
258
|
+
process.exit(avg >= PASS_THRESHOLD ? 0 : 1);
|
|
259
|
+
}
|
|
260
|
+
|
|
261
|
+
main().catch((err) => {
|
|
262
|
+
console.error('eval-compose-from-chunks failed:', err.message);
|
|
263
|
+
process.exit(2);
|
|
264
|
+
});
|
package/scripts/eval-diff.mjs
CHANGED
|
@@ -18,9 +18,9 @@
|
|
|
18
18
|
* node packages/a2ui/mcp/scripts/eval-diff.mjs --engine zettel # fragment-graph only
|
|
19
19
|
* node packages/a2ui/mcp/scripts/eval-diff.mjs --limit 20
|
|
20
20
|
* node packages/a2ui/mcp/scripts/eval-diff.mjs --domain forms
|
|
21
|
-
* node packages/a2ui/mcp/scripts/eval-diff.mjs --semantic # Phase
|
|
22
|
-
* node packages/a2ui/mcp/scripts/eval-diff.mjs --semantic --gate-mode
|
|
23
|
-
* node packages/a2ui/mcp/scripts/eval-diff.mjs --semantic --gate-
|
|
21
|
+
* node packages/a2ui/mcp/scripts/eval-diff.mjs --semantic # Phase 2 default: gates on combined score (>= 80)
|
|
22
|
+
* node packages/a2ui/mcp/scripts/eval-diff.mjs --semantic --gate-mode structural # Phase 1 shadow-mode (opt-back-in)
|
|
23
|
+
* node packages/a2ui/mcp/scripts/eval-diff.mjs --semantic --gate-threshold 75 # Override default threshold
|
|
24
24
|
*/
|
|
25
25
|
import '../../../../scripts/load-env.mjs';
|
|
26
26
|
|
|
@@ -48,13 +48,19 @@ const domain = opt('domain');
|
|
|
48
48
|
// Shadow-mode semantic validator (Phase 1). Opt-in; zero effect on gating
|
|
49
49
|
// when --gate-mode=structural (default).
|
|
50
50
|
const semanticEnabled = args.includes('--semantic');
|
|
51
|
-
// Phase 2 (gating mode):
|
|
52
|
-
//
|
|
53
|
-
//
|
|
54
|
-
//
|
|
55
|
-
//
|
|
56
|
-
|
|
57
|
-
|
|
51
|
+
// Phase 2 (gating mode). Default depends on --semantic:
|
|
52
|
+
// without --semantic → 'structural' (no semantic work; Phase 1 baseline).
|
|
53
|
+
// with --semantic → 'combined' since v0.1.2 (Phase 2 promotion); was
|
|
54
|
+
// 'structural' / shadow-mode in v0.1.1.
|
|
55
|
+
// Opt-back-in to shadow with --gate-mode structural.
|
|
56
|
+
// Combined mode gates on (0.6 * validationScore + 0.4 * semanticScore) at
|
|
57
|
+
// threshold 80 — chosen via the 2026-05-01 sweep (70 produced zero
|
|
58
|
+
// pass-flips; 80 catches the 4 partial-verdict cases the spec was designed
|
|
59
|
+
// to surface; 85 over-aggressively rejects aligned items).
|
|
60
|
+
// Override threshold with --gate-threshold N.
|
|
61
|
+
const gateModeDefault = args.includes('--semantic') ? 'combined' : 'structural';
|
|
62
|
+
const gateMode = opt('gate-mode') || gateModeDefault;
|
|
63
|
+
const gateThreshold = opt('gate-threshold') ? Number(opt('gate-threshold')) : 80;
|
|
58
64
|
if (!['structural', 'combined'].includes(gateMode)) {
|
|
59
65
|
console.error(`[eval-diff] --gate-mode must be one of: structural | combined (got: ${gateMode})`);
|
|
60
66
|
process.exit(2);
|
package/scripts/generate.mjs
CHANGED
|
@@ -138,7 +138,7 @@ function componentsToHTML(comps) {
|
|
|
138
138
|
Stat: 'stat-ui', Table: 'table-ui', Chart: 'chart-ui',
|
|
139
139
|
List: 'list-ui', Pagination: 'pagination-ui',
|
|
140
140
|
// Navigation
|
|
141
|
-
Tabs: 'tabs-ui', Tab: 'tab-ui', Nav: 'nav-
|
|
141
|
+
Tabs: 'tabs-ui', Tab: 'tab-ui', Nav: 'nav-ui',
|
|
142
142
|
Breadcrumb: 'breadcrumb-ui', SegmentedControl: 'segmented-ui', Segment: 'segment-ui',
|
|
143
143
|
// Overlay
|
|
144
144
|
Modal: 'modal-ui', Drawer: 'drawer-ui', Popover: 'popover-ui',
|
|
@@ -170,7 +170,7 @@ function componentsToHTML(comps) {
|
|
|
170
170
|
return `${indent}<${tag}${attrStr} nomargin>${c.textContent || ''}</${tag}>`;
|
|
171
171
|
}
|
|
172
172
|
|
|
173
|
-
const tag = TAG_MAP[c.component] || c.component.toLowerCase() + '-
|
|
173
|
+
const tag = TAG_MAP[c.component] || c.component.toLowerCase() + '-ui';
|
|
174
174
|
const skip = new Set(['id', 'component', 'children', 'textContent']);
|
|
175
175
|
const attrs = Object.entries(c)
|
|
176
176
|
.filter(([k]) => !skip.has(k))
|