@adia-ai/a2ui-mcp 0.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +65 -0
- package/README.md +154 -0
- package/package.json +35 -0
- package/scripts/dogfood-test.mjs +107 -0
- package/scripts/eval-diff.mjs +282 -0
- package/scripts/eval-fix.mjs +446 -0
- package/scripts/generate.mjs +189 -0
- package/scripts/multi-turn-test.mjs +247 -0
- package/scripts/smoke-engine-registry.mjs +43 -0
- package/scripts/smoke-merged.mjs +50 -0
- package/scripts/smoke-register-engine.mjs +51 -0
- package/scripts/smoke-searchable-select.mjs +39 -0
- package/scripts/smoke-synthesis.mjs +59 -0
- package/scripts/smoke-zettel.mjs +37 -0
- package/scripts/test-a2ui.mjs +269 -0
- package/scripts/test-evals.mjs +238 -0
- package/scripts/visual-validate.mjs +158 -0
- package/server.js +573 -0
package/CHANGELOG.md
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
# Changelog — @adia-ai/a2ui-mcp
|
|
2
|
+
|
|
3
|
+
Follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
|
|
4
|
+
|
|
5
|
+
Scope: MCP transport wrapping `@adia-ai/a2ui-compose`. Exposes
|
|
6
|
+
generation tools over JSON-RPC for Claude Code, Claude Desktop,
|
|
7
|
+
and other MCP hosts. Engine selector supports both monolithic and
|
|
8
|
+
zettel strategies.
|
|
9
|
+
|
|
10
|
+
## [Unreleased]
|
|
11
|
+
|
|
12
|
+
### Changed
|
|
13
|
+
- Registry / transpilation scripts at `packages/web-components/scripts/a2ui-to-html.cjs` and `packages/web-components/scripts/mcp-pipeline.cjs` now mirror the canonical `packages/web-components/a2ui/registry.js` for all A2UI component → custom-element mappings. Previously these scripts had several stale mappings that diverged from the runtime registry.
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## [0.0.1] - 2026-04-24
|
|
18
|
+
|
|
19
|
+
First public release. MCP server that wraps the compose engine and
|
|
20
|
+
exposes A2UI generation tools over JSON-RPC.
|
|
21
|
+
|
|
22
|
+
### Included
|
|
23
|
+
|
|
24
|
+
- **Server** (`server.js`) — MCP stdio server. Registers tools,
|
|
25
|
+
handles JSON-RPC, routes requests to the compose engine selector.
|
|
26
|
+
- **Tools** (`tools/`) — MCP tool definitions for
|
|
27
|
+
`generate_ui`, `validate_schema`, `check_anti_patterns`,
|
|
28
|
+
`search_patterns`, `lookup_component`, `get_composition`,
|
|
29
|
+
`get_fragment`, `get_graph`, `get_traits`, `get_wiring_catalog`,
|
|
30
|
+
`zettel_stats`, `run_eval`, `submit_feedback`, plus several more.
|
|
31
|
+
See `tools/` for the full surface.
|
|
32
|
+
- **Personas** (`personas/`) — per-MCP-host prompts + behavior
|
|
33
|
+
tweaks.
|
|
34
|
+
- **Scripts** (`scripts/`) — `eval-diff.mjs` (cross-engine regression
|
|
35
|
+
runner), `generate.mjs` (CLI wrapper), `test-a2ui.mjs` (integration
|
|
36
|
+
suite), smoke tests, and dogfood/multi-turn harnesses.
|
|
37
|
+
- **Binary** — `adiaui-mcp` CLI registered via `package.json` `bin`
|
|
38
|
+
so `npx @adia-ai/a2ui-mcp` runs the stdio server directly.
|
|
39
|
+
|
|
40
|
+
### Dependencies
|
|
41
|
+
|
|
42
|
+
- `@modelcontextprotocol/sdk` ^1.29 — MCP transport + tool registration.
|
|
43
|
+
- `zod` ^3.24 — tool-parameter schema validation.
|
|
44
|
+
- `@adia-ai/a2ui-compose` ^0.0.1 — generation engine.
|
|
45
|
+
- `@adia-ai/a2ui-retrieval` ^0.0.1 — retrieval layer.
|
|
46
|
+
- `@adia-ai/a2ui-validator` ^0.0.1 — validation layer.
|
|
47
|
+
- `@adia-ai/a2ui-corpus` ^0.0.1 — training corpus (patterns, fragments,
|
|
48
|
+
compositions, exemplars).
|
|
49
|
+
|
|
50
|
+
### Verification
|
|
51
|
+
- `npm run smoke:register-engine` — 11/11 passing (engine registration + reserved-name protection + custom-engine hooks).
|
|
52
|
+
- `npm run test:a2ui` — 19 passed, 0 failed, 1 skipped (thinking-mode test is `--thinking` opt-in).
|
|
53
|
+
|
|
54
|
+
---
|
|
55
|
+
|
|
56
|
+
## [0.1.0] — internal baseline (unreleased)
|
|
57
|
+
|
|
58
|
+
Initial version at the time the monorepo was established. Contains:
|
|
59
|
+
|
|
60
|
+
- MCP server (`server.js`) exposing `generate_ui`, `search_patterns`, `lookup_component`, `resolve_composition`, `validate_schema`, and related tools.
|
|
61
|
+
- Engine plugin API via in-process `registerEngine()` (see `packages/a2ui/compose/engines/registry.js`).
|
|
62
|
+
- Smoke tests (`scripts/smoke-*.mjs`) for engine registration, merged generation, and end-to-end A2UI tests.
|
|
63
|
+
- Eval harnesses (`scripts/test-evals.mjs`, `scripts/test-a2ui.mjs`, `scripts/eval-fix.mjs`) for corpus regression measurement.
|
|
64
|
+
|
|
65
|
+
Package name still uses the legacy `@adia-ai/a2ui-mcp` scope pending rename (see root CHANGELOG for context).
|
package/README.md
ADDED
|
@@ -0,0 +1,154 @@
|
|
|
1
|
+
# @adia-ai/a2ui-mcp
|
|
2
|
+
|
|
3
|
+
MCP server wrapping [`@adia-ai/a2ui-compose`](../compose). Exposes the generation
|
|
4
|
+
engine, component catalog, pattern library, validator, and training
|
|
5
|
+
feedback loop as stdio tools for Claude Desktop, Claude Code, and any
|
|
6
|
+
other MCP-speaking host.
|
|
7
|
+
|
|
8
|
+
> Runtime only. Generation logic lives in `@adia-ai/a2ui-compose`; UI atoms in
|
|
9
|
+
> [`@adia-ai/web-components`](../web-components); the A2UI protocol runtime
|
|
10
|
+
> (renderer, registry, streams, wiring) in
|
|
11
|
+
> [`@adia-ai/a2ui-utils`](../a2ui/utils); corpus in
|
|
12
|
+
> [`@adia-ai/a2ui-corpus`](../corpus).
|
|
13
|
+
|
|
14
|
+
## Quick start
|
|
15
|
+
|
|
16
|
+
```bash
|
|
17
|
+
node packages/a2ui/mcp/server.js
|
|
18
|
+
# or, if linked:
|
|
19
|
+
adiaui-mcp
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
Register with Claude Code (`.claude/settings.json`):
|
|
23
|
+
|
|
24
|
+
```json
|
|
25
|
+
{
|
|
26
|
+
"mcpServers": {
|
|
27
|
+
"adia-ui": {
|
|
28
|
+
"command": "node",
|
|
29
|
+
"args": ["packages/a2ui/mcp/server.js"]
|
|
30
|
+
}
|
|
31
|
+
}
|
|
32
|
+
}
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
API keys (at least one required for `generate_ui`):
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
export ANTHROPIC_API_KEY=sk-ant-…
|
|
39
|
+
export OPENAI_API_KEY=sk-…
|
|
40
|
+
export GEMINI_API_KEY=AIza…
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## Tools
|
|
44
|
+
|
|
45
|
+
The server registers 21 tools. Shape is stable; argument schemas via Zod.
|
|
46
|
+
|
|
47
|
+
| Tool | What it does |
|
|
48
|
+
|------------------------|-----------------------------------------------------------|
|
|
49
|
+
| `generate_ui` | Intent → A2UI tree. Engine (`monolithic`/`zettel`) + mode. |
|
|
50
|
+
| `validate_schema` | Run the 15-check validator on an A2UI tree; returns 0-100. |
|
|
51
|
+
| `classify_intent` | Extract concepts, entities, implied components, steelman. |
|
|
52
|
+
| `lookup_component` | Resolve a component name (alias-aware) to its schema. |
|
|
53
|
+
| `get_component_map` | Full tag→class map including alias normalizations. |
|
|
54
|
+
| `search_patterns` | Keyword-rank the monolithic pattern corpus. |
|
|
55
|
+
| `assemble_context` | Build the system prompt context for a given intent. |
|
|
56
|
+
| `check_anti_patterns` | Scan a tree for canonical anti-patterns (chart-legend, …). |
|
|
57
|
+
| `get_traits` | List trait catalog + their host-binding rules. |
|
|
58
|
+
| `convert_html` | Raw HTML → best-effort A2UI tree (import path). |
|
|
59
|
+
| `get_wiring_catalog` | Declarative wiring-engine recipes. |
|
|
60
|
+
| `import_pattern` | Commit a generated result into the pattern library. |
|
|
61
|
+
| `submit_feedback` | Append a user-feedback event to the feedback store. |
|
|
62
|
+
| `get_quality_metrics` | Aggregate pass/fail scores over a window. |
|
|
63
|
+
| `get_training_gaps` | Intents that currently miss coverage. |
|
|
64
|
+
| `run_eval` | Run the held-out benchmark; return pass/fail per intent. |
|
|
65
|
+
| `get_fragment` | Fetch a single zettel fragment by id. |
|
|
66
|
+
| `get_composition` | Fetch a named multi-fragment composition. |
|
|
67
|
+
| `resolve_composition` | Expand a composition reference into its fragments. |
|
|
68
|
+
| `get_graph` | Dump the zettel fragment-dependency graph. |
|
|
69
|
+
| `zettel_stats` | Corpus counts (fragments, compositions, reuse ratio, …). |
|
|
70
|
+
|
|
71
|
+
## Layout
|
|
72
|
+
|
|
73
|
+
```
|
|
74
|
+
gen-ui-mcp/
|
|
75
|
+
├── server.js MCP bootstrap — registers all 21 tools inline
|
|
76
|
+
├── scripts/ Standalone runners (smoke tests, eval diffs, visual validate)
|
|
77
|
+
│ ├── generate.mjs CLI: `node scripts/generate.mjs "pricing page"`
|
|
78
|
+
│ ├── eval-diff.mjs diff held-out run vs baseline
|
|
79
|
+
│ ├── eval-fix.mjs re-run failing eval intents
|
|
80
|
+
│ ├── dogfood-test.mjs 20-intent dogfood + avg-score gate
|
|
81
|
+
│ ├── multi-turn-test.mjs session iteration smoke
|
|
82
|
+
│ ├── smoke-engine-registry.mjs registry + reserved-name checks
|
|
83
|
+
│ ├── smoke-register-engine.mjs plugin engine registration
|
|
84
|
+
│ ├── smoke-zettel.mjs zettel-specific smoke (corpus + composer)
|
|
85
|
+
│ ├── smoke-synthesis.mjs fragment-synthesis path
|
|
86
|
+
│ ├── smoke-merged.mjs cross-engine merge check
|
|
87
|
+
│ ├── smoke-searchable-select.mjs known-good composition regression
|
|
88
|
+
│ ├── test-a2ui.mjs A2UI message validator unit tests
|
|
89
|
+
│ ├── test-evals.mjs evals harness wrapper
|
|
90
|
+
│ └── visual-validate.mjs Playwright render + Vision-LLM critique
|
|
91
|
+
├── tools/ (reserved; tool code currently inlined in server.js)
|
|
92
|
+
└── personas/ persona presets (vocabulary + style hints for adapters)
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
## CLI usage
|
|
96
|
+
|
|
97
|
+
```bash
|
|
98
|
+
# One-shot generation from a prompt
|
|
99
|
+
node packages/a2ui/mcp/scripts/generate.mjs "pricing page with 3 tiers"
|
|
100
|
+
|
|
101
|
+
# Run the held-out eval
|
|
102
|
+
node packages/a2ui/mcp/scripts/test-evals.mjs
|
|
103
|
+
|
|
104
|
+
# Full verification sweep (invoked by /verification-sweep slash command)
|
|
105
|
+
npm run smoke:engines && \
|
|
106
|
+
npm run smoke:register-engine && \
|
|
107
|
+
npm run test:a2ui && \
|
|
108
|
+
npm run eval:diff -- --engine zettel
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
## How it fits together
|
|
112
|
+
|
|
113
|
+
```
|
|
114
|
+
┌─────────────┐ MCP stdio
|
|
115
|
+
│ Claude Code │◀─────────────────▶ server.js
|
|
116
|
+
└─────────────┘ │
|
|
117
|
+
├── @adia-ai/a2ui-compose (engines, validator, LLM bridge)
|
|
118
|
+
├── @adia-ai/a2ui-corpus (catalog, fragments, feedback)
|
|
119
|
+
└── @adiahealth/web-components (component schemas via .a2ui.json)
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
On start, the server:
|
|
123
|
+
|
|
124
|
+
1. Loads the component catalog (`gen-ui-training/catalog-a2ui_0_9.json`)
|
|
125
|
+
2. Lazy-initializes the zettel corpus on first `generate_ui`/`zettel_*` call
|
|
126
|
+
3. Resolves LLM adapter from env vars
|
|
127
|
+
4. Registers tools + opens stdio transport
|
|
128
|
+
|
|
129
|
+
## Gotchas
|
|
130
|
+
|
|
131
|
+
- **Corpus load is noisy.** The zettel composer prints stats to stderr on
|
|
132
|
+
first invocation. Callers expecting silent MCP should swallow stderr.
|
|
133
|
+
- **API keys must be set before the server starts.** Changing env vars
|
|
134
|
+
mid-session doesn't hot-reload adapters.
|
|
135
|
+
- **Engine selector is internal.** Don't pass `engine: 'mcp'` — the
|
|
136
|
+
generator picks `monolithic` vs `zettel` from intent + mode. Reserved
|
|
137
|
+
names in the registry guard against accidental shadowing.
|
|
138
|
+
- **Validator score < 70 is still returned.** Consumers should gate on
|
|
139
|
+
the `passed` boolean or raw `score` — the tool doesn't auto-retry.
|
|
140
|
+
|
|
141
|
+
## Evals + regression floors
|
|
142
|
+
|
|
143
|
+
The server enforces the same thresholds as the rest of the pipeline:
|
|
144
|
+
|
|
145
|
+
- `test:a2ui` — 19/19 green (+1 skipped OK)
|
|
146
|
+
- `smoke:register-engine` — 11/11 green
|
|
147
|
+
- `eval:diff -- --engine zettel` — coverage 100%, avgScore ≥ 88
|
|
148
|
+
- Dogfood — 20/20 intents at avg ≥ 95
|
|
149
|
+
|
|
150
|
+
See repo-root `AGENTS.md` for the full verification sweep.
|
|
151
|
+
|
|
152
|
+
## License
|
|
153
|
+
|
|
154
|
+
MIT
|
package/package.json
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "@adia-ai/a2ui-mcp",
|
|
3
|
+
"version": "0.0.1",
|
|
4
|
+
"description": "AdiaUI A2UI MCP server. Exposes the compose engine over MCP with an engine selector for monolithic + zettel strategies.",
|
|
5
|
+
"type": "module",
|
|
6
|
+
"bin": {
|
|
7
|
+
"adiaui-mcp": "./server.js"
|
|
8
|
+
},
|
|
9
|
+
"files": [
|
|
10
|
+
"server.js",
|
|
11
|
+
"tools/",
|
|
12
|
+
"scripts/",
|
|
13
|
+
"personas/",
|
|
14
|
+
"README.md",
|
|
15
|
+
"CHANGELOG.md"
|
|
16
|
+
],
|
|
17
|
+
"license": "MIT",
|
|
18
|
+
"publishConfig": {
|
|
19
|
+
"access": "public",
|
|
20
|
+
"registry": "https://registry.npmjs.org"
|
|
21
|
+
},
|
|
22
|
+
"repository": {
|
|
23
|
+
"type": "git",
|
|
24
|
+
"url": "git+https://github.com/adiahealth/gen-ui-kit.git",
|
|
25
|
+
"directory": "packages/a2ui/mcp"
|
|
26
|
+
},
|
|
27
|
+
"dependencies": {
|
|
28
|
+
"@modelcontextprotocol/sdk": "^1.29.0",
|
|
29
|
+
"@adia-ai/a2ui-compose": "^0.0.1",
|
|
30
|
+
"@adia-ai/a2ui-retrieval": "^0.0.1",
|
|
31
|
+
"@adia-ai/a2ui-validator": "^0.0.1",
|
|
32
|
+
"@adia-ai/a2ui-corpus": "^0.0.1",
|
|
33
|
+
"zod": "^3.24.0"
|
|
34
|
+
}
|
|
35
|
+
}
|
|
@@ -0,0 +1,107 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Dogfood test: Run 20 diverse intents through A2UI instant mode.
|
|
3
|
+
*/
|
|
4
|
+
const { generateUI } = await import('../../compose/engine/generator.js');
|
|
5
|
+
|
|
6
|
+
const intents = [
|
|
7
|
+
'user registration form with name, email, password',
|
|
8
|
+
'product listing page with filters and grid',
|
|
9
|
+
'email inbox with folders and message preview',
|
|
10
|
+
'restaurant menu with categories and prices',
|
|
11
|
+
'weather app with forecast and current conditions',
|
|
12
|
+
'music player with playlist and controls',
|
|
13
|
+
'image carousel with thumbnails',
|
|
14
|
+
'file manager with folder tree and preview',
|
|
15
|
+
'social media profile with posts and followers',
|
|
16
|
+
'kanban board with tasks and columns',
|
|
17
|
+
'calendar with events and day view',
|
|
18
|
+
'shopping cart with items and checkout',
|
|
19
|
+
'chat application with contacts and messages',
|
|
20
|
+
'admin dashboard with users table and charts',
|
|
21
|
+
'blog post editor with preview',
|
|
22
|
+
'notification center with filters',
|
|
23
|
+
'project settings with team members',
|
|
24
|
+
'API documentation with endpoints list',
|
|
25
|
+
'color theme builder with preview',
|
|
26
|
+
'onboarding wizard with progress steps',
|
|
27
|
+
];
|
|
28
|
+
|
|
29
|
+
const results = [];
|
|
30
|
+
|
|
31
|
+
for (const intent of intents) {
|
|
32
|
+
try {
|
|
33
|
+
const r = await generateUI({ intent, mode: 'instant' });
|
|
34
|
+
const components = r.messages?.[0]?.components || [];
|
|
35
|
+
const score = r.validation?.score ?? 0;
|
|
36
|
+
const errors = r.validation?.errors || [];
|
|
37
|
+
const warnings = r.validation?.warnings || [];
|
|
38
|
+
|
|
39
|
+
// Detect pattern match from suggestions and pipeline
|
|
40
|
+
const adaptedSuggestion = (r.suggestions || []).find(s => s.includes('Adapted from'));
|
|
41
|
+
const thinkingModeSuggestion = (r.suggestions || []).find(s => s.includes('Use "thinking" mode for LLM-powered'));
|
|
42
|
+
|
|
43
|
+
let matchType = 'direct'; // assume direct pattern match
|
|
44
|
+
let patternName = null;
|
|
45
|
+
|
|
46
|
+
if (adaptedSuggestion) {
|
|
47
|
+
matchType = 'partial';
|
|
48
|
+
const m = adaptedSuggestion.match(/Adapted from "([^"]+)"/);
|
|
49
|
+
patternName = m ? m[1] : 'unknown';
|
|
50
|
+
} else if (thinkingModeSuggestion) {
|
|
51
|
+
matchType = 'fallback';
|
|
52
|
+
patternName = null;
|
|
53
|
+
} else {
|
|
54
|
+
// Direct match - infer from pipeline confidence
|
|
55
|
+
const planReport = r.pipeline?.reports?.find(s => s.stage === 'plan');
|
|
56
|
+
const genReport = r.pipeline?.reports?.find(s => s.stage === 'generate');
|
|
57
|
+
if (planReport?.confidence >= 0.95) {
|
|
58
|
+
matchType = 'direct';
|
|
59
|
+
} else if (planReport?.confidence >= 0.5) {
|
|
60
|
+
matchType = 'partial';
|
|
61
|
+
}
|
|
62
|
+
}
|
|
63
|
+
|
|
64
|
+
// Component types
|
|
65
|
+
const compTypes = [...new Set(components.map(c => c.component))];
|
|
66
|
+
|
|
67
|
+
// Quality heuristics
|
|
68
|
+
const hasLayout = compTypes.some(t => ['Card', 'Section', 'Column', 'Row', 'Grid'].includes(t));
|
|
69
|
+
const hasInteractive = compTypes.some(t => ['Button', 'Input', 'Select', 'CheckBox', 'Toggle', 'Slider', 'TextArea'].includes(t));
|
|
70
|
+
const hasContent = compTypes.some(t => ['Text', 'Image', 'Avatar', 'Icon', 'Badge'].includes(t));
|
|
71
|
+
|
|
72
|
+
results.push({
|
|
73
|
+
intent,
|
|
74
|
+
componentCount: components.length,
|
|
75
|
+
score,
|
|
76
|
+
matchType,
|
|
77
|
+
patternName,
|
|
78
|
+
compTypes,
|
|
79
|
+
hasLayout,
|
|
80
|
+
hasInteractive,
|
|
81
|
+
hasContent,
|
|
82
|
+
errors,
|
|
83
|
+
warnings,
|
|
84
|
+
suggestions: r.suggestions || [],
|
|
85
|
+
success: components.length > 0 && score >= 70,
|
|
86
|
+
});
|
|
87
|
+
} catch (err) {
|
|
88
|
+
results.push({
|
|
89
|
+
intent,
|
|
90
|
+
componentCount: 0,
|
|
91
|
+
score: 0,
|
|
92
|
+
matchType: 'error',
|
|
93
|
+
patternName: null,
|
|
94
|
+
compTypes: [],
|
|
95
|
+
hasLayout: false,
|
|
96
|
+
hasInteractive: false,
|
|
97
|
+
hasContent: false,
|
|
98
|
+
errors: [err.message],
|
|
99
|
+
warnings: [],
|
|
100
|
+
suggestions: [],
|
|
101
|
+
success: false,
|
|
102
|
+
});
|
|
103
|
+
}
|
|
104
|
+
}
|
|
105
|
+
|
|
106
|
+
// Print results as JSON
|
|
107
|
+
console.log(JSON.stringify(results, null, 2));
|
|
@@ -0,0 +1,282 @@
|
|
|
1
|
+
#!/usr/bin/env node
|
|
2
|
+
/**
|
|
3
|
+
* Run one or both engines through the V2 harness and emit side-by-side artifacts.
|
|
4
|
+
*
|
|
5
|
+
* Replaces the older `scripts/zettel-eval-diff.mjs`. Lives under a2ui-mcp because
|
|
6
|
+
* cross-engine evaluation is an MCP-consumer concern (the MCP surface is what
|
|
7
|
+
* exposes multiple engines; this script is the CLI form of that comparison).
|
|
8
|
+
*
|
|
9
|
+
* Output: evals/mcp/runs/<ISO-timestamp>/
|
|
10
|
+
* mcp.json — V2 results for monolithic-pattern engine (if engine ∈ {mcp, all})
|
|
11
|
+
* zettel.json — V2 results for fragment-graph engine (if engine ∈ {zettel, all})
|
|
12
|
+
* diff.md — per-intent table + aggregate summary (always)
|
|
13
|
+
*
|
|
14
|
+
* Usage:
|
|
15
|
+
* node packages/a2ui/mcp/scripts/eval-diff.mjs # both engines
|
|
16
|
+
* node packages/a2ui/mcp/scripts/eval-diff.mjs --engine all # both engines (explicit)
|
|
17
|
+
* node packages/a2ui/mcp/scripts/eval-diff.mjs --engine mcp # monolithic only
|
|
18
|
+
* node packages/a2ui/mcp/scripts/eval-diff.mjs --engine zettel # fragment-graph only
|
|
19
|
+
* node packages/a2ui/mcp/scripts/eval-diff.mjs --limit 20
|
|
20
|
+
* node packages/a2ui/mcp/scripts/eval-diff.mjs --domain forms
|
|
21
|
+
*/
|
|
22
|
+
import '../../../../scripts/load-env.mjs';
|
|
23
|
+
|
|
24
|
+
import { mkdir, writeFile } from 'node:fs/promises';
|
|
25
|
+
import { join, dirname } from 'node:path';
|
|
26
|
+
import { fileURLToPath } from 'node:url';
|
|
27
|
+
|
|
28
|
+
import { generateUI } from '../../compose/engine/generator.js';
|
|
29
|
+
import { generateZettel } from '../../compose/engines/zettel/generator-adapter.js';
|
|
30
|
+
import { runHarnessV2 } from '../../compose/evals/harness.mjs';
|
|
31
|
+
import { validateSemantics } from '../../validator/semantic/index.js';
|
|
32
|
+
|
|
33
|
+
const __dirname = dirname(fileURLToPath(import.meta.url));
|
|
34
|
+
const repoRoot = join(__dirname, '..', '..', '..', '..');
|
|
35
|
+
|
|
36
|
+
// ── Args ──
|
|
37
|
+
const args = process.argv.slice(2);
|
|
38
|
+
const opt = (k) => {
|
|
39
|
+
const i = args.indexOf(`--${k}`);
|
|
40
|
+
return i >= 0 ? args[i + 1] : undefined;
|
|
41
|
+
};
|
|
42
|
+
const engine = opt('engine') || 'all';
|
|
43
|
+
const limit = opt('limit') ? Number(opt('limit')) : undefined;
|
|
44
|
+
const domain = opt('domain');
|
|
45
|
+
// Shadow-mode semantic validator (Phase 1). Opt-in; zero effect on gating.
|
|
46
|
+
const semanticEnabled = args.includes('--semantic');
|
|
47
|
+
|
|
48
|
+
if (!['mcp', 'zettel', 'all'].includes(engine)) {
|
|
49
|
+
console.error(`[eval-diff] --engine must be one of: mcp | zettel | all (got: ${engine})`);
|
|
50
|
+
process.exit(2);
|
|
51
|
+
}
|
|
52
|
+
|
|
53
|
+
const runMcp = engine === 'mcp' || engine === 'all';
|
|
54
|
+
const runZettel = engine === 'zettel' || engine === 'all';
|
|
55
|
+
|
|
56
|
+
// ── MCP adapter: use the top-level patternName exposed by generateInstant ──
|
|
57
|
+
// Shadow-mode capture: when --semantic is set, remember the emitted messages
|
|
58
|
+
// per intent so we can judge them after the structural harness finishes.
|
|
59
|
+
const capturedMessages = new Map(); // key: `${label}:${intent}` → messages
|
|
60
|
+
|
|
61
|
+
async function generateMcp({ intent, mode }) {
|
|
62
|
+
const result = await generateUI({ intent, mode: mode || 'instant' });
|
|
63
|
+
const patternName = result.patternName ?? null;
|
|
64
|
+
const strategy = result.strategy || (patternName ? 'pattern-match' : 'fallback');
|
|
65
|
+
if (semanticEnabled && Array.isArray(result.messages) && result.messages.length > 0) {
|
|
66
|
+
capturedMessages.set(`mcp:${intent}`, result.messages);
|
|
67
|
+
}
|
|
68
|
+
return {
|
|
69
|
+
...result,
|
|
70
|
+
strategy,
|
|
71
|
+
retrieval: {
|
|
72
|
+
hit: !!patternName,
|
|
73
|
+
rank: patternName ? 1 : null, // mcp picks top-1 only; no candidate list surfaced
|
|
74
|
+
candidate: patternName,
|
|
75
|
+
},
|
|
76
|
+
};
|
|
77
|
+
}
|
|
78
|
+
|
|
79
|
+
async function generateZettelCapture({ intent, mode }) {
|
|
80
|
+
const result = await generateZettel({ intent, mode });
|
|
81
|
+
if (semanticEnabled && Array.isArray(result.messages) && result.messages.length > 0) {
|
|
82
|
+
capturedMessages.set(`zettel:${intent}`, result.messages);
|
|
83
|
+
}
|
|
84
|
+
return result;
|
|
85
|
+
}
|
|
86
|
+
|
|
87
|
+
// ── Run ──
|
|
88
|
+
let mcp = null;
|
|
89
|
+
let zettel = null;
|
|
90
|
+
|
|
91
|
+
if (runMcp) {
|
|
92
|
+
console.error(`[eval-diff] running mcp (monolithic) harness…`);
|
|
93
|
+
mcp = await runHarnessV2({
|
|
94
|
+
generate: generateMcp,
|
|
95
|
+
domain,
|
|
96
|
+
limit,
|
|
97
|
+
mode: 'instant',
|
|
98
|
+
label: 'mcp',
|
|
99
|
+
});
|
|
100
|
+
console.error(` coverage=${mcp.coverage}% emitted=${mcp.emitted}/${mcp.total} avgScore=${mcp.avgScoreWhenEmitted}`);
|
|
101
|
+
}
|
|
102
|
+
|
|
103
|
+
if (runZettel) {
|
|
104
|
+
console.error(`[eval-diff] running zettel (fragment-graph) harness…`);
|
|
105
|
+
zettel = await runHarnessV2({
|
|
106
|
+
generate: generateZettelCapture,
|
|
107
|
+
domain,
|
|
108
|
+
limit,
|
|
109
|
+
mode: 'instant',
|
|
110
|
+
label: 'zettel',
|
|
111
|
+
});
|
|
112
|
+
console.error(` coverage=${zettel.coverage}% emitted=${zettel.emitted}/${zettel.total} avgScore=${zettel.avgScoreWhenEmitted}`);
|
|
113
|
+
}
|
|
114
|
+
|
|
115
|
+
// ── Shadow-mode semantic validation (Phase 1) ──
|
|
116
|
+
// Opt-in via --semantic. Annotates per-intent rows + aggregates with
|
|
117
|
+
// semanticScore/verdict/combinedScore. DOES NOT affect row.pass, passRate,
|
|
118
|
+
// or avgScoreWhenEmitted — those remain structural-only.
|
|
119
|
+
async function annotateSemantic(runObj, label) {
|
|
120
|
+
if (!runObj || !semanticEnabled) return;
|
|
121
|
+
let semSum = 0;
|
|
122
|
+
let semN = 0;
|
|
123
|
+
let combSum = 0;
|
|
124
|
+
let combN = 0;
|
|
125
|
+
let tokensIn = 0;
|
|
126
|
+
let tokensOut = 0;
|
|
127
|
+
let cached = 0;
|
|
128
|
+
let errors = 0;
|
|
129
|
+
const verdictBreakdown = {};
|
|
130
|
+
for (const row of runObj.results) {
|
|
131
|
+
if (!row.messagesEmitted) continue;
|
|
132
|
+
const msgs = capturedMessages.get(`${label}:${row.intent}`);
|
|
133
|
+
if (!msgs) continue;
|
|
134
|
+
try {
|
|
135
|
+
const v = await validateSemantics(
|
|
136
|
+
{ intent: row.intent, messages: msgs },
|
|
137
|
+
{ cache: true, timeoutMs: 15000 },
|
|
138
|
+
);
|
|
139
|
+
row.semanticScore = v.score;
|
|
140
|
+
row.semanticVerdict = v.verdict;
|
|
141
|
+
row.semanticRationale = v.rationale;
|
|
142
|
+
row.semanticAxes = v.axes;
|
|
143
|
+
row.semanticCost = v.cost;
|
|
144
|
+
row.rubricVersion = v.rubricVersion;
|
|
145
|
+
const structural = row.validationScore ?? 0;
|
|
146
|
+
row.combinedScore = Math.round(0.6 * structural + 0.4 * v.score);
|
|
147
|
+
// NOTE: row.pass intentionally NOT updated — shadow mode only.
|
|
148
|
+
if (!v.error) {
|
|
149
|
+
semSum += v.score;
|
|
150
|
+
semN += 1;
|
|
151
|
+
combSum += row.combinedScore;
|
|
152
|
+
combN += 1;
|
|
153
|
+
verdictBreakdown[v.verdict] = (verdictBreakdown[v.verdict] || 0) + 1;
|
|
154
|
+
tokensIn += v.cost?.inputTokens || 0;
|
|
155
|
+
tokensOut += v.cost?.outputTokens || 0;
|
|
156
|
+
if (v.cost?.cached) cached += 1;
|
|
157
|
+
} else {
|
|
158
|
+
errors += 1;
|
|
159
|
+
}
|
|
160
|
+
} catch (e) {
|
|
161
|
+
errors += 1;
|
|
162
|
+
row.semanticError = e.message;
|
|
163
|
+
}
|
|
164
|
+
}
|
|
165
|
+
runObj.semantic = {
|
|
166
|
+
enabled: true,
|
|
167
|
+
mode: 'shadow',
|
|
168
|
+
judged: semN,
|
|
169
|
+
errors,
|
|
170
|
+
cached,
|
|
171
|
+
avgSemanticScore: semN ? Math.round(semSum / semN) : null,
|
|
172
|
+
avgCombinedScore: combN ? Math.round(combSum / combN) : null,
|
|
173
|
+
verdictBreakdown,
|
|
174
|
+
tokens: { input: tokensIn, output: tokensOut },
|
|
175
|
+
rubricVersion: 'v1',
|
|
176
|
+
};
|
|
177
|
+
console.error(`[semantic:${label}] judged=${semN} avgSem=${runObj.semantic.avgSemanticScore} avgCombined=${runObj.semantic.avgCombinedScore} cached=${cached} errors=${errors} tokens=${tokensIn}+${tokensOut}`);
|
|
178
|
+
}
|
|
179
|
+
|
|
180
|
+
if (semanticEnabled) {
|
|
181
|
+
if (!process.env.ANTHROPIC_API_KEY) {
|
|
182
|
+
console.error('[eval-diff] --semantic requested but ANTHROPIC_API_KEY missing; skipping.');
|
|
183
|
+
} else {
|
|
184
|
+
console.error(`[eval-diff] running semantic validator (shadow mode)…`);
|
|
185
|
+
if (mcp) await annotateSemantic(mcp, 'mcp');
|
|
186
|
+
if (zettel) await annotateSemantic(zettel, 'zettel');
|
|
187
|
+
}
|
|
188
|
+
}
|
|
189
|
+
|
|
190
|
+
// ── Write artifacts ──
|
|
191
|
+
const stamp = new Date().toISOString().replace(/[:.]/g, '-');
|
|
192
|
+
const outDir = join(repoRoot, 'evals', 'mcp', 'runs', stamp);
|
|
193
|
+
await mkdir(outDir, { recursive: true });
|
|
194
|
+
if (mcp) await writeFile(join(outDir, 'mcp.json'), JSON.stringify(mcp, null, 2));
|
|
195
|
+
if (zettel) await writeFile(join(outDir, 'zettel.json'), JSON.stringify(zettel, null, 2));
|
|
196
|
+
|
|
197
|
+
// ── Build diff.md ──
|
|
198
|
+
function fmt(v) { return v == null ? '—' : String(v); }
|
|
199
|
+
function winner(a, b) {
|
|
200
|
+
const sa = a.messagesEmitted ? (a.validationScore ?? 0) : -1;
|
|
201
|
+
const sb = b.messagesEmitted ? (b.validationScore ?? 0) : -1;
|
|
202
|
+
if (sa === -1 && sb === -1) return 'both-miss';
|
|
203
|
+
if (sa === sb) return 'tie';
|
|
204
|
+
return sa > sb ? 'mcp' : 'zettel';
|
|
205
|
+
}
|
|
206
|
+
|
|
207
|
+
let md = '';
|
|
208
|
+
md += `# Engine Eval ${mcp && zettel ? 'Diff' : 'Report'}\n\n`;
|
|
209
|
+
md += `- Run: \`${stamp}\`\n`;
|
|
210
|
+
md += `- Engine(s): ${engine}\n`;
|
|
211
|
+
md += `- Intents: ${(mcp || zettel).total}${domain ? ` (domain: ${domain})` : ''}${limit ? ` (limit: ${limit})` : ''}\n`;
|
|
212
|
+
md += `- Mode: instant\n\n`;
|
|
213
|
+
|
|
214
|
+
md += `## Aggregates\n\n`;
|
|
215
|
+
if (mcp && zettel) {
|
|
216
|
+
md += `| metric | mcp | zettel |\n|---|---:|---:|\n`;
|
|
217
|
+
md += `| coverage % | ${mcp.coverage} | ${zettel.coverage} |\n`;
|
|
218
|
+
md += `| emitted | ${mcp.emitted}/${mcp.total} | ${zettel.emitted}/${zettel.total} |\n`;
|
|
219
|
+
md += `| avgScore (emitted only) | ${mcp.avgScoreWhenEmitted} | ${zettel.avgScoreWhenEmitted} |\n`;
|
|
220
|
+
md += `| avgF1 (emitted only) | ${mcp.avgF1WhenEmitted} | ${zettel.avgF1WhenEmitted} |\n`;
|
|
221
|
+
md += `| pass rate % | ${mcp.passRate} | ${zettel.passRate} |\n`;
|
|
222
|
+
md += `| retrieval MRR | ${fmt(mcp.retrievalMRR)} | ${fmt(zettel.retrievalMRR)} |\n\n`;
|
|
223
|
+
} else {
|
|
224
|
+
const e = mcp || zettel;
|
|
225
|
+
const label = mcp ? 'mcp' : 'zettel';
|
|
226
|
+
md += `| metric | ${label} |\n|---|---:|\n`;
|
|
227
|
+
md += `| coverage % | ${e.coverage} |\n`;
|
|
228
|
+
md += `| emitted | ${e.emitted}/${e.total} |\n`;
|
|
229
|
+
md += `| avgScore (emitted only) | ${e.avgScoreWhenEmitted} |\n`;
|
|
230
|
+
md += `| avgF1 (emitted only) | ${e.avgF1WhenEmitted} |\n`;
|
|
231
|
+
md += `| pass rate % | ${e.passRate} |\n`;
|
|
232
|
+
md += `| retrieval MRR | ${fmt(e.retrievalMRR)} |\n\n`;
|
|
233
|
+
}
|
|
234
|
+
|
|
235
|
+
if (mcp && zettel) {
|
|
236
|
+
const zByIntent = new Map(zettel.results.map((r) => [r.id, r]));
|
|
237
|
+
const rows = mcp.results.map((m) => {
|
|
238
|
+
const z = zByIntent.get(m.id);
|
|
239
|
+
return { m, z, winner: winner(m, z) };
|
|
240
|
+
});
|
|
241
|
+
const counts = rows.reduce((a, r) => { a[r.winner] = (a[r.winner] || 0) + 1; return a; }, {});
|
|
242
|
+
|
|
243
|
+
md += `## Winner breakdown\n\n`;
|
|
244
|
+
md += `- mcp wins: ${counts.mcp || 0}\n`;
|
|
245
|
+
md += `- zettel wins: ${counts.zettel || 0}\n`;
|
|
246
|
+
md += `- ties: ${counts.tie || 0}\n`;
|
|
247
|
+
md += `- both missed: ${counts['both-miss'] || 0}\n\n`;
|
|
248
|
+
|
|
249
|
+
md += `## Strategy breakdown\n\n`;
|
|
250
|
+
md += `**mcp**: ` + Object.entries(mcp.strategyBreakdown).map(([k, v]) => `${k}=${v}`).join(', ') + `\n\n`;
|
|
251
|
+
md += `**zettel**: ` + Object.entries(zettel.strategyBreakdown).map(([k, v]) => `${k}=${v}`).join(', ') + `\n\n`;
|
|
252
|
+
|
|
253
|
+
md += `## Per-intent\n\n`;
|
|
254
|
+
md += `| id | domain | intent | mcp score | mcp F1 | zettel score | zettel F1 | winner | zettel strategy |\n`;
|
|
255
|
+
md += `|---|---|---|---:|---:|---:|---:|---|---|\n`;
|
|
256
|
+
for (const { m, z, winner: w } of rows) {
|
|
257
|
+
const intent = (m.intent || '').slice(0, 48).replace(/\|/g, '\\|');
|
|
258
|
+
md += `| ${m.id} | ${fmt(m.domain)} | ${intent} | ${fmt(m.validationScore)} | ${fmt(m.componentF1)} | ${fmt(z.validationScore)} | ${fmt(z.componentF1)} | ${w} | ${fmt(z.strategy)} |\n`;
|
|
259
|
+
}
|
|
260
|
+
|
|
261
|
+
await writeFile(join(outDir, 'diff.md'), md);
|
|
262
|
+
console.error(`\n[eval-diff] wrote ${outDir}`);
|
|
263
|
+
console.error(` mcp wins: ${counts.mcp || 0}`);
|
|
264
|
+
console.error(` zettel wins: ${counts.zettel || 0}`);
|
|
265
|
+
console.error(` ties: ${counts.tie || 0}`);
|
|
266
|
+
console.error(` both missed: ${counts['both-miss'] || 0}`);
|
|
267
|
+
} else {
|
|
268
|
+
const e = mcp || zettel;
|
|
269
|
+
const label = mcp ? 'mcp' : 'zettel';
|
|
270
|
+
md += `## Strategy breakdown\n\n`;
|
|
271
|
+
md += `**${label}**: ` + Object.entries(e.strategyBreakdown).map(([k, v]) => `${k}=${v}`).join(', ') + `\n\n`;
|
|
272
|
+
md += `## Per-intent\n\n`;
|
|
273
|
+
md += `| id | domain | intent | score | F1 | strategy |\n|---|---|---|---:|---:|---|\n`;
|
|
274
|
+
for (const r of e.results) {
|
|
275
|
+
const intent = (r.intent || '').slice(0, 48).replace(/\|/g, '\\|');
|
|
276
|
+
md += `| ${r.id} | ${fmt(r.domain)} | ${intent} | ${fmt(r.validationScore)} | ${fmt(r.componentF1)} | ${fmt(r.strategy)} |\n`;
|
|
277
|
+
}
|
|
278
|
+
await writeFile(join(outDir, 'diff.md'), md);
|
|
279
|
+
console.error(`\n[eval-diff] wrote ${outDir}`);
|
|
280
|
+
}
|
|
281
|
+
|
|
282
|
+
console.log(outDir);
|