@adia-ai/a2ui-mcp 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md ADDED
@@ -0,0 +1,65 @@
1
+ # Changelog — @adia-ai/a2ui-mcp
2
+
3
+ Follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
4
+
5
+ Scope: MCP transport wrapping `@adia-ai/a2ui-compose`. Exposes
6
+ generation tools over JSON-RPC for Claude Code, Claude Desktop,
7
+ and other MCP hosts. Engine selector supports both monolithic and
8
+ zettel strategies.
9
+
10
+ ## [Unreleased]
11
+
12
+ ### Changed
13
+ - Registry / transpilation scripts at `packages/web-components/scripts/a2ui-to-html.cjs` and `packages/web-components/scripts/mcp-pipeline.cjs` now mirror the canonical `packages/web-components/a2ui/registry.js` for all A2UI component → custom-element mappings. Previously these scripts had several stale mappings that diverged from the runtime registry.
14
+
15
+ ---
16
+
17
+ ## [0.0.1] - 2026-04-24
18
+
19
+ First public release. MCP server that wraps the compose engine and
20
+ exposes A2UI generation tools over JSON-RPC.
21
+
22
+ ### Included
23
+
24
+ - **Server** (`server.js`) — MCP stdio server. Registers tools,
25
+ handles JSON-RPC, routes requests to the compose engine selector.
26
+ - **Tools** (`tools/`) — MCP tool definitions for
27
+ `generate_ui`, `validate_schema`, `check_anti_patterns`,
28
+ `search_patterns`, `lookup_component`, `get_composition`,
29
+ `get_fragment`, `get_graph`, `get_traits`, `get_wiring_catalog`,
30
+ `zettel_stats`, `run_eval`, `submit_feedback`, plus several more.
31
+ See `tools/` for the full surface.
32
+ - **Personas** (`personas/`) — per-MCP-host prompts + behavior
33
+ tweaks.
34
+ - **Scripts** (`scripts/`) — `eval-diff.mjs` (cross-engine regression
35
+ runner), `generate.mjs` (CLI wrapper), `test-a2ui.mjs` (integration
36
+ suite), smoke tests, and dogfood/multi-turn harnesses.
37
+ - **Binary** — `adiaui-mcp` CLI registered via `package.json` `bin`
38
+ so `npx @adia-ai/a2ui-mcp` runs the stdio server directly.
39
+
40
+ ### Dependencies
41
+
42
+ - `@modelcontextprotocol/sdk` ^1.29 — MCP transport + tool registration.
43
+ - `zod` ^3.24 — tool-parameter schema validation.
44
+ - `@adia-ai/a2ui-compose` ^0.0.1 — generation engine.
45
+ - `@adia-ai/a2ui-retrieval` ^0.0.1 — retrieval layer.
46
+ - `@adia-ai/a2ui-validator` ^0.0.1 — validation layer.
47
+ - `@adia-ai/a2ui-corpus` ^0.0.1 — training corpus (patterns, fragments,
48
+ compositions, exemplars).
49
+
50
+ ### Verification
51
+ - `npm run smoke:register-engine` — 11/11 passing (engine registration + reserved-name protection + custom-engine hooks).
52
+ - `npm run test:a2ui` — 19 passed, 0 failed, 1 skipped (thinking-mode test is `--thinking` opt-in).
53
+
54
+ ---
55
+
56
+ ## [0.1.0] — internal baseline (unreleased)
57
+
58
+ Initial version at the time the monorepo was established. Contains:
59
+
60
+ - MCP server (`server.js`) exposing `generate_ui`, `search_patterns`, `lookup_component`, `resolve_composition`, `validate_schema`, and related tools.
61
+ - Engine plugin API via in-process `registerEngine()` (see `packages/a2ui/compose/engines/registry.js`).
62
+ - Smoke tests (`scripts/smoke-*.mjs`) for engine registration, merged generation, and end-to-end A2UI tests.
63
+ - Eval harnesses (`scripts/test-evals.mjs`, `scripts/test-a2ui.mjs`, `scripts/eval-fix.mjs`) for corpus regression measurement.
64
+
65
+ Package name still uses the legacy `@adia-ai/a2ui-mcp` scope pending rename (see root CHANGELOG for context).
package/README.md ADDED
@@ -0,0 +1,154 @@
1
+ # @adia-ai/a2ui-mcp
2
+
3
+ MCP server wrapping [`@adia-ai/a2ui-compose`](../compose). Exposes the generation
4
+ engine, component catalog, pattern library, validator, and training
5
+ feedback loop as stdio tools for Claude Desktop, Claude Code, and any
6
+ other MCP-speaking host.
7
+
8
+ > Runtime only. Generation logic lives in `@adia-ai/a2ui-compose`; UI atoms in
9
+ > [`@adia-ai/web-components`](../web-components); the A2UI protocol runtime
10
+ > (renderer, registry, streams, wiring) in
11
+ > [`@adia-ai/a2ui-utils`](../a2ui/utils); corpus in
12
+ > [`@adia-ai/a2ui-corpus`](../corpus).
13
+
14
+ ## Quick start
15
+
16
+ ```bash
17
+ node packages/a2ui/mcp/server.js
18
+ # or, if linked:
19
+ adiaui-mcp
20
+ ```
21
+
22
+ Register with Claude Code (`.claude/settings.json`):
23
+
24
+ ```json
25
+ {
26
+ "mcpServers": {
27
+ "adia-ui": {
28
+ "command": "node",
29
+ "args": ["packages/a2ui/mcp/server.js"]
30
+ }
31
+ }
32
+ }
33
+ ```
34
+
35
+ API keys (at least one required for `generate_ui`):
36
+
37
+ ```bash
38
+ export ANTHROPIC_API_KEY=sk-ant-…
39
+ export OPENAI_API_KEY=sk-…
40
+ export GEMINI_API_KEY=AIza…
41
+ ```
42
+
43
+ ## Tools
44
+
45
+ The server registers 21 tools. Shape is stable; argument schemas via Zod.
46
+
47
+ | Tool | What it does |
48
+ |------------------------|-----------------------------------------------------------|
49
+ | `generate_ui` | Intent → A2UI tree. Engine (`monolithic`/`zettel`) + mode. |
50
+ | `validate_schema` | Run the 15-check validator on an A2UI tree; returns 0-100. |
51
+ | `classify_intent` | Extract concepts, entities, implied components, steelman. |
52
+ | `lookup_component` | Resolve a component name (alias-aware) to its schema. |
53
+ | `get_component_map` | Full tag→class map including alias normalizations. |
54
+ | `search_patterns` | Keyword-rank the monolithic pattern corpus. |
55
+ | `assemble_context` | Build the system prompt context for a given intent. |
56
+ | `check_anti_patterns` | Scan a tree for canonical anti-patterns (chart-legend, …). |
57
+ | `get_traits` | List trait catalog + their host-binding rules. |
58
+ | `convert_html` | Raw HTML → best-effort A2UI tree (import path). |
59
+ | `get_wiring_catalog` | Declarative wiring-engine recipes. |
60
+ | `import_pattern` | Commit a generated result into the pattern library. |
61
+ | `submit_feedback` | Append a user-feedback event to the feedback store. |
62
+ | `get_quality_metrics` | Aggregate pass/fail scores over a window. |
63
+ | `get_training_gaps` | Intents that currently miss coverage. |
64
+ | `run_eval` | Run the held-out benchmark; return pass/fail per intent. |
65
+ | `get_fragment` | Fetch a single zettel fragment by id. |
66
+ | `get_composition` | Fetch a named multi-fragment composition. |
67
+ | `resolve_composition` | Expand a composition reference into its fragments. |
68
+ | `get_graph` | Dump the zettel fragment-dependency graph. |
69
+ | `zettel_stats` | Corpus counts (fragments, compositions, reuse ratio, …). |
70
+
71
+ ## Layout
72
+
73
+ ```
74
+ gen-ui-mcp/
75
+ ├── server.js MCP bootstrap — registers all 21 tools inline
76
+ ├── scripts/ Standalone runners (smoke tests, eval diffs, visual validate)
77
+ │ ├── generate.mjs CLI: `node scripts/generate.mjs "pricing page"`
78
+ │ ├── eval-diff.mjs diff held-out run vs baseline
79
+ │ ├── eval-fix.mjs re-run failing eval intents
80
+ │ ├── dogfood-test.mjs 20-intent dogfood + avg-score gate
81
+ │ ├── multi-turn-test.mjs session iteration smoke
82
+ │ ├── smoke-engine-registry.mjs registry + reserved-name checks
83
+ │ ├── smoke-register-engine.mjs plugin engine registration
84
+ │ ├── smoke-zettel.mjs zettel-specific smoke (corpus + composer)
85
+ │ ├── smoke-synthesis.mjs fragment-synthesis path
86
+ │ ├── smoke-merged.mjs cross-engine merge check
87
+ │ ├── smoke-searchable-select.mjs known-good composition regression
88
+ │ ├── test-a2ui.mjs A2UI message validator unit tests
89
+ │ ├── test-evals.mjs evals harness wrapper
90
+ │ └── visual-validate.mjs Playwright render + Vision-LLM critique
91
+ ├── tools/ (reserved; tool code currently inlined in server.js)
92
+ └── personas/ persona presets (vocabulary + style hints for adapters)
93
+ ```
94
+
95
+ ## CLI usage
96
+
97
+ ```bash
98
+ # One-shot generation from a prompt
99
+ node packages/a2ui/mcp/scripts/generate.mjs "pricing page with 3 tiers"
100
+
101
+ # Run the held-out eval
102
+ node packages/a2ui/mcp/scripts/test-evals.mjs
103
+
104
+ # Full verification sweep (invoked by /verification-sweep slash command)
105
+ npm run smoke:engines && \
106
+ npm run smoke:register-engine && \
107
+ npm run test:a2ui && \
108
+ npm run eval:diff -- --engine zettel
109
+ ```
110
+
111
+ ## How it fits together
112
+
113
+ ```
114
+ ┌─────────────┐ MCP stdio
115
+ │ Claude Code │◀─────────────────▶ server.js
116
+ └─────────────┘ │
117
+ ├── @adia-ai/a2ui-compose (engines, validator, LLM bridge)
118
+ ├── @adia-ai/a2ui-corpus (catalog, fragments, feedback)
119
+ └── @adiahealth/web-components (component schemas via .a2ui.json)
120
+ ```
121
+
122
+ On start, the server:
123
+
124
+ 1. Loads the component catalog (`gen-ui-training/catalog-a2ui_0_9.json`)
125
+ 2. Lazy-initializes the zettel corpus on first `generate_ui`/`zettel_*` call
126
+ 3. Resolves LLM adapter from env vars
127
+ 4. Registers tools + opens stdio transport
128
+
129
+ ## Gotchas
130
+
131
+ - **Corpus load is noisy.** The zettel composer prints stats to stderr on
132
+ first invocation. Callers expecting silent MCP should swallow stderr.
133
+ - **API keys must be set before the server starts.** Changing env vars
134
+ mid-session doesn't hot-reload adapters.
135
+ - **Engine selector is internal.** Don't pass `engine: 'mcp'` — the
136
+ generator picks `monolithic` vs `zettel` from intent + mode. Reserved
137
+ names in the registry guard against accidental shadowing.
138
+ - **Validator score < 70 is still returned.** Consumers should gate on
139
+ the `passed` boolean or raw `score` — the tool doesn't auto-retry.
140
+
141
+ ## Evals + regression floors
142
+
143
+ The server enforces the same thresholds as the rest of the pipeline:
144
+
145
+ - `test:a2ui` — 19/19 green (+1 skipped OK)
146
+ - `smoke:register-engine` — 11/11 green
147
+ - `eval:diff -- --engine zettel` — coverage 100%, avgScore ≥ 88
148
+ - Dogfood — 20/20 intents at avg ≥ 95
149
+
150
+ See repo-root `AGENTS.md` for the full verification sweep.
151
+
152
+ ## License
153
+
154
+ MIT
package/package.json ADDED
@@ -0,0 +1,35 @@
1
+ {
2
+ "name": "@adia-ai/a2ui-mcp",
3
+ "version": "0.0.1",
4
+ "description": "AdiaUI A2UI MCP server. Exposes the compose engine over MCP with an engine selector for monolithic + zettel strategies.",
5
+ "type": "module",
6
+ "bin": {
7
+ "adiaui-mcp": "./server.js"
8
+ },
9
+ "files": [
10
+ "server.js",
11
+ "tools/",
12
+ "scripts/",
13
+ "personas/",
14
+ "README.md",
15
+ "CHANGELOG.md"
16
+ ],
17
+ "license": "MIT",
18
+ "publishConfig": {
19
+ "access": "public",
20
+ "registry": "https://registry.npmjs.org"
21
+ },
22
+ "repository": {
23
+ "type": "git",
24
+ "url": "git+https://github.com/adiahealth/gen-ui-kit.git",
25
+ "directory": "packages/a2ui/mcp"
26
+ },
27
+ "dependencies": {
28
+ "@modelcontextprotocol/sdk": "^1.29.0",
29
+ "@adia-ai/a2ui-compose": "^0.0.1",
30
+ "@adia-ai/a2ui-retrieval": "^0.0.1",
31
+ "@adia-ai/a2ui-validator": "^0.0.1",
32
+ "@adia-ai/a2ui-corpus": "^0.0.1",
33
+ "zod": "^3.24.0"
34
+ }
35
+ }
@@ -0,0 +1,107 @@
1
+ /**
2
+ * Dogfood test: Run 20 diverse intents through A2UI instant mode.
3
+ */
4
+ const { generateUI } = await import('../../compose/engine/generator.js');
5
+
6
+ const intents = [
7
+ 'user registration form with name, email, password',
8
+ 'product listing page with filters and grid',
9
+ 'email inbox with folders and message preview',
10
+ 'restaurant menu with categories and prices',
11
+ 'weather app with forecast and current conditions',
12
+ 'music player with playlist and controls',
13
+ 'image carousel with thumbnails',
14
+ 'file manager with folder tree and preview',
15
+ 'social media profile with posts and followers',
16
+ 'kanban board with tasks and columns',
17
+ 'calendar with events and day view',
18
+ 'shopping cart with items and checkout',
19
+ 'chat application with contacts and messages',
20
+ 'admin dashboard with users table and charts',
21
+ 'blog post editor with preview',
22
+ 'notification center with filters',
23
+ 'project settings with team members',
24
+ 'API documentation with endpoints list',
25
+ 'color theme builder with preview',
26
+ 'onboarding wizard with progress steps',
27
+ ];
28
+
29
+ const results = [];
30
+
31
+ for (const intent of intents) {
32
+ try {
33
+ const r = await generateUI({ intent, mode: 'instant' });
34
+ const components = r.messages?.[0]?.components || [];
35
+ const score = r.validation?.score ?? 0;
36
+ const errors = r.validation?.errors || [];
37
+ const warnings = r.validation?.warnings || [];
38
+
39
+ // Detect pattern match from suggestions and pipeline
40
+ const adaptedSuggestion = (r.suggestions || []).find(s => s.includes('Adapted from'));
41
+ const thinkingModeSuggestion = (r.suggestions || []).find(s => s.includes('Use "thinking" mode for LLM-powered'));
42
+
43
+ let matchType = 'direct'; // assume direct pattern match
44
+ let patternName = null;
45
+
46
+ if (adaptedSuggestion) {
47
+ matchType = 'partial';
48
+ const m = adaptedSuggestion.match(/Adapted from "([^"]+)"/);
49
+ patternName = m ? m[1] : 'unknown';
50
+ } else if (thinkingModeSuggestion) {
51
+ matchType = 'fallback';
52
+ patternName = null;
53
+ } else {
54
+ // Direct match - infer from pipeline confidence
55
+ const planReport = r.pipeline?.reports?.find(s => s.stage === 'plan');
56
+ const genReport = r.pipeline?.reports?.find(s => s.stage === 'generate');
57
+ if (planReport?.confidence >= 0.95) {
58
+ matchType = 'direct';
59
+ } else if (planReport?.confidence >= 0.5) {
60
+ matchType = 'partial';
61
+ }
62
+ }
63
+
64
+ // Component types
65
+ const compTypes = [...new Set(components.map(c => c.component))];
66
+
67
+ // Quality heuristics
68
+ const hasLayout = compTypes.some(t => ['Card', 'Section', 'Column', 'Row', 'Grid'].includes(t));
69
+ const hasInteractive = compTypes.some(t => ['Button', 'Input', 'Select', 'CheckBox', 'Toggle', 'Slider', 'TextArea'].includes(t));
70
+ const hasContent = compTypes.some(t => ['Text', 'Image', 'Avatar', 'Icon', 'Badge'].includes(t));
71
+
72
+ results.push({
73
+ intent,
74
+ componentCount: components.length,
75
+ score,
76
+ matchType,
77
+ patternName,
78
+ compTypes,
79
+ hasLayout,
80
+ hasInteractive,
81
+ hasContent,
82
+ errors,
83
+ warnings,
84
+ suggestions: r.suggestions || [],
85
+ success: components.length > 0 && score >= 70,
86
+ });
87
+ } catch (err) {
88
+ results.push({
89
+ intent,
90
+ componentCount: 0,
91
+ score: 0,
92
+ matchType: 'error',
93
+ patternName: null,
94
+ compTypes: [],
95
+ hasLayout: false,
96
+ hasInteractive: false,
97
+ hasContent: false,
98
+ errors: [err.message],
99
+ warnings: [],
100
+ suggestions: [],
101
+ success: false,
102
+ });
103
+ }
104
+ }
105
+
106
+ // Print results as JSON
107
+ console.log(JSON.stringify(results, null, 2));
@@ -0,0 +1,282 @@
1
+ #!/usr/bin/env node
2
+ /**
3
+ * Run one or both engines through the V2 harness and emit side-by-side artifacts.
4
+ *
5
+ * Replaces the older `scripts/zettel-eval-diff.mjs`. Lives under a2ui-mcp because
6
+ * cross-engine evaluation is an MCP-consumer concern (the MCP surface is what
7
+ * exposes multiple engines; this script is the CLI form of that comparison).
8
+ *
9
+ * Output: evals/mcp/runs/<ISO-timestamp>/
10
+ * mcp.json — V2 results for monolithic-pattern engine (if engine ∈ {mcp, all})
11
+ * zettel.json — V2 results for fragment-graph engine (if engine ∈ {zettel, all})
12
+ * diff.md — per-intent table + aggregate summary (always)
13
+ *
14
+ * Usage:
15
+ * node packages/a2ui/mcp/scripts/eval-diff.mjs # both engines
16
+ * node packages/a2ui/mcp/scripts/eval-diff.mjs --engine all # both engines (explicit)
17
+ * node packages/a2ui/mcp/scripts/eval-diff.mjs --engine mcp # monolithic only
18
+ * node packages/a2ui/mcp/scripts/eval-diff.mjs --engine zettel # fragment-graph only
19
+ * node packages/a2ui/mcp/scripts/eval-diff.mjs --limit 20
20
+ * node packages/a2ui/mcp/scripts/eval-diff.mjs --domain forms
21
+ */
22
+ import '../../../../scripts/load-env.mjs';
23
+
24
+ import { mkdir, writeFile } from 'node:fs/promises';
25
+ import { join, dirname } from 'node:path';
26
+ import { fileURLToPath } from 'node:url';
27
+
28
+ import { generateUI } from '../../compose/engine/generator.js';
29
+ import { generateZettel } from '../../compose/engines/zettel/generator-adapter.js';
30
+ import { runHarnessV2 } from '../../compose/evals/harness.mjs';
31
+ import { validateSemantics } from '../../validator/semantic/index.js';
32
+
33
+ const __dirname = dirname(fileURLToPath(import.meta.url));
34
+ const repoRoot = join(__dirname, '..', '..', '..', '..');
35
+
36
+ // ── Args ──
37
+ const args = process.argv.slice(2);
38
+ const opt = (k) => {
39
+ const i = args.indexOf(`--${k}`);
40
+ return i >= 0 ? args[i + 1] : undefined;
41
+ };
42
+ const engine = opt('engine') || 'all';
43
+ const limit = opt('limit') ? Number(opt('limit')) : undefined;
44
+ const domain = opt('domain');
45
+ // Shadow-mode semantic validator (Phase 1). Opt-in; zero effect on gating.
46
+ const semanticEnabled = args.includes('--semantic');
47
+
48
+ if (!['mcp', 'zettel', 'all'].includes(engine)) {
49
+ console.error(`[eval-diff] --engine must be one of: mcp | zettel | all (got: ${engine})`);
50
+ process.exit(2);
51
+ }
52
+
53
+ const runMcp = engine === 'mcp' || engine === 'all';
54
+ const runZettel = engine === 'zettel' || engine === 'all';
55
+
56
+ // ── MCP adapter: use the top-level patternName exposed by generateInstant ──
57
+ // Shadow-mode capture: when --semantic is set, remember the emitted messages
58
+ // per intent so we can judge them after the structural harness finishes.
59
+ const capturedMessages = new Map(); // key: `${label}:${intent}` → messages
60
+
61
+ async function generateMcp({ intent, mode }) {
62
+ const result = await generateUI({ intent, mode: mode || 'instant' });
63
+ const patternName = result.patternName ?? null;
64
+ const strategy = result.strategy || (patternName ? 'pattern-match' : 'fallback');
65
+ if (semanticEnabled && Array.isArray(result.messages) && result.messages.length > 0) {
66
+ capturedMessages.set(`mcp:${intent}`, result.messages);
67
+ }
68
+ return {
69
+ ...result,
70
+ strategy,
71
+ retrieval: {
72
+ hit: !!patternName,
73
+ rank: patternName ? 1 : null, // mcp picks top-1 only; no candidate list surfaced
74
+ candidate: patternName,
75
+ },
76
+ };
77
+ }
78
+
79
+ async function generateZettelCapture({ intent, mode }) {
80
+ const result = await generateZettel({ intent, mode });
81
+ if (semanticEnabled && Array.isArray(result.messages) && result.messages.length > 0) {
82
+ capturedMessages.set(`zettel:${intent}`, result.messages);
83
+ }
84
+ return result;
85
+ }
86
+
87
+ // ── Run ──
88
+ let mcp = null;
89
+ let zettel = null;
90
+
91
+ if (runMcp) {
92
+ console.error(`[eval-diff] running mcp (monolithic) harness…`);
93
+ mcp = await runHarnessV2({
94
+ generate: generateMcp,
95
+ domain,
96
+ limit,
97
+ mode: 'instant',
98
+ label: 'mcp',
99
+ });
100
+ console.error(` coverage=${mcp.coverage}% emitted=${mcp.emitted}/${mcp.total} avgScore=${mcp.avgScoreWhenEmitted}`);
101
+ }
102
+
103
+ if (runZettel) {
104
+ console.error(`[eval-diff] running zettel (fragment-graph) harness…`);
105
+ zettel = await runHarnessV2({
106
+ generate: generateZettelCapture,
107
+ domain,
108
+ limit,
109
+ mode: 'instant',
110
+ label: 'zettel',
111
+ });
112
+ console.error(` coverage=${zettel.coverage}% emitted=${zettel.emitted}/${zettel.total} avgScore=${zettel.avgScoreWhenEmitted}`);
113
+ }
114
+
115
+ // ── Shadow-mode semantic validation (Phase 1) ──
116
+ // Opt-in via --semantic. Annotates per-intent rows + aggregates with
117
+ // semanticScore/verdict/combinedScore. DOES NOT affect row.pass, passRate,
118
+ // or avgScoreWhenEmitted — those remain structural-only.
119
+ async function annotateSemantic(runObj, label) {
120
+ if (!runObj || !semanticEnabled) return;
121
+ let semSum = 0;
122
+ let semN = 0;
123
+ let combSum = 0;
124
+ let combN = 0;
125
+ let tokensIn = 0;
126
+ let tokensOut = 0;
127
+ let cached = 0;
128
+ let errors = 0;
129
+ const verdictBreakdown = {};
130
+ for (const row of runObj.results) {
131
+ if (!row.messagesEmitted) continue;
132
+ const msgs = capturedMessages.get(`${label}:${row.intent}`);
133
+ if (!msgs) continue;
134
+ try {
135
+ const v = await validateSemantics(
136
+ { intent: row.intent, messages: msgs },
137
+ { cache: true, timeoutMs: 15000 },
138
+ );
139
+ row.semanticScore = v.score;
140
+ row.semanticVerdict = v.verdict;
141
+ row.semanticRationale = v.rationale;
142
+ row.semanticAxes = v.axes;
143
+ row.semanticCost = v.cost;
144
+ row.rubricVersion = v.rubricVersion;
145
+ const structural = row.validationScore ?? 0;
146
+ row.combinedScore = Math.round(0.6 * structural + 0.4 * v.score);
147
+ // NOTE: row.pass intentionally NOT updated — shadow mode only.
148
+ if (!v.error) {
149
+ semSum += v.score;
150
+ semN += 1;
151
+ combSum += row.combinedScore;
152
+ combN += 1;
153
+ verdictBreakdown[v.verdict] = (verdictBreakdown[v.verdict] || 0) + 1;
154
+ tokensIn += v.cost?.inputTokens || 0;
155
+ tokensOut += v.cost?.outputTokens || 0;
156
+ if (v.cost?.cached) cached += 1;
157
+ } else {
158
+ errors += 1;
159
+ }
160
+ } catch (e) {
161
+ errors += 1;
162
+ row.semanticError = e.message;
163
+ }
164
+ }
165
+ runObj.semantic = {
166
+ enabled: true,
167
+ mode: 'shadow',
168
+ judged: semN,
169
+ errors,
170
+ cached,
171
+ avgSemanticScore: semN ? Math.round(semSum / semN) : null,
172
+ avgCombinedScore: combN ? Math.round(combSum / combN) : null,
173
+ verdictBreakdown,
174
+ tokens: { input: tokensIn, output: tokensOut },
175
+ rubricVersion: 'v1',
176
+ };
177
+ console.error(`[semantic:${label}] judged=${semN} avgSem=${runObj.semantic.avgSemanticScore} avgCombined=${runObj.semantic.avgCombinedScore} cached=${cached} errors=${errors} tokens=${tokensIn}+${tokensOut}`);
178
+ }
179
+
180
+ if (semanticEnabled) {
181
+ if (!process.env.ANTHROPIC_API_KEY) {
182
+ console.error('[eval-diff] --semantic requested but ANTHROPIC_API_KEY missing; skipping.');
183
+ } else {
184
+ console.error(`[eval-diff] running semantic validator (shadow mode)…`);
185
+ if (mcp) await annotateSemantic(mcp, 'mcp');
186
+ if (zettel) await annotateSemantic(zettel, 'zettel');
187
+ }
188
+ }
189
+
190
+ // ── Write artifacts ──
191
+ const stamp = new Date().toISOString().replace(/[:.]/g, '-');
192
+ const outDir = join(repoRoot, 'evals', 'mcp', 'runs', stamp);
193
+ await mkdir(outDir, { recursive: true });
194
+ if (mcp) await writeFile(join(outDir, 'mcp.json'), JSON.stringify(mcp, null, 2));
195
+ if (zettel) await writeFile(join(outDir, 'zettel.json'), JSON.stringify(zettel, null, 2));
196
+
197
+ // ── Build diff.md ──
198
+ function fmt(v) { return v == null ? '—' : String(v); }
199
+ function winner(a, b) {
200
+ const sa = a.messagesEmitted ? (a.validationScore ?? 0) : -1;
201
+ const sb = b.messagesEmitted ? (b.validationScore ?? 0) : -1;
202
+ if (sa === -1 && sb === -1) return 'both-miss';
203
+ if (sa === sb) return 'tie';
204
+ return sa > sb ? 'mcp' : 'zettel';
205
+ }
206
+
207
+ let md = '';
208
+ md += `# Engine Eval ${mcp && zettel ? 'Diff' : 'Report'}\n\n`;
209
+ md += `- Run: \`${stamp}\`\n`;
210
+ md += `- Engine(s): ${engine}\n`;
211
+ md += `- Intents: ${(mcp || zettel).total}${domain ? ` (domain: ${domain})` : ''}${limit ? ` (limit: ${limit})` : ''}\n`;
212
+ md += `- Mode: instant\n\n`;
213
+
214
+ md += `## Aggregates\n\n`;
215
+ if (mcp && zettel) {
216
+ md += `| metric | mcp | zettel |\n|---|---:|---:|\n`;
217
+ md += `| coverage % | ${mcp.coverage} | ${zettel.coverage} |\n`;
218
+ md += `| emitted | ${mcp.emitted}/${mcp.total} | ${zettel.emitted}/${zettel.total} |\n`;
219
+ md += `| avgScore (emitted only) | ${mcp.avgScoreWhenEmitted} | ${zettel.avgScoreWhenEmitted} |\n`;
220
+ md += `| avgF1 (emitted only) | ${mcp.avgF1WhenEmitted} | ${zettel.avgF1WhenEmitted} |\n`;
221
+ md += `| pass rate % | ${mcp.passRate} | ${zettel.passRate} |\n`;
222
+ md += `| retrieval MRR | ${fmt(mcp.retrievalMRR)} | ${fmt(zettel.retrievalMRR)} |\n\n`;
223
+ } else {
224
+ const e = mcp || zettel;
225
+ const label = mcp ? 'mcp' : 'zettel';
226
+ md += `| metric | ${label} |\n|---|---:|\n`;
227
+ md += `| coverage % | ${e.coverage} |\n`;
228
+ md += `| emitted | ${e.emitted}/${e.total} |\n`;
229
+ md += `| avgScore (emitted only) | ${e.avgScoreWhenEmitted} |\n`;
230
+ md += `| avgF1 (emitted only) | ${e.avgF1WhenEmitted} |\n`;
231
+ md += `| pass rate % | ${e.passRate} |\n`;
232
+ md += `| retrieval MRR | ${fmt(e.retrievalMRR)} |\n\n`;
233
+ }
234
+
235
+ if (mcp && zettel) {
236
+ const zByIntent = new Map(zettel.results.map((r) => [r.id, r]));
237
+ const rows = mcp.results.map((m) => {
238
+ const z = zByIntent.get(m.id);
239
+ return { m, z, winner: winner(m, z) };
240
+ });
241
+ const counts = rows.reduce((a, r) => { a[r.winner] = (a[r.winner] || 0) + 1; return a; }, {});
242
+
243
+ md += `## Winner breakdown\n\n`;
244
+ md += `- mcp wins: ${counts.mcp || 0}\n`;
245
+ md += `- zettel wins: ${counts.zettel || 0}\n`;
246
+ md += `- ties: ${counts.tie || 0}\n`;
247
+ md += `- both missed: ${counts['both-miss'] || 0}\n\n`;
248
+
249
+ md += `## Strategy breakdown\n\n`;
250
+ md += `**mcp**: ` + Object.entries(mcp.strategyBreakdown).map(([k, v]) => `${k}=${v}`).join(', ') + `\n\n`;
251
+ md += `**zettel**: ` + Object.entries(zettel.strategyBreakdown).map(([k, v]) => `${k}=${v}`).join(', ') + `\n\n`;
252
+
253
+ md += `## Per-intent\n\n`;
254
+ md += `| id | domain | intent | mcp score | mcp F1 | zettel score | zettel F1 | winner | zettel strategy |\n`;
255
+ md += `|---|---|---|---:|---:|---:|---:|---|---|\n`;
256
+ for (const { m, z, winner: w } of rows) {
257
+ const intent = (m.intent || '').slice(0, 48).replace(/\|/g, '\\|');
258
+ md += `| ${m.id} | ${fmt(m.domain)} | ${intent} | ${fmt(m.validationScore)} | ${fmt(m.componentF1)} | ${fmt(z.validationScore)} | ${fmt(z.componentF1)} | ${w} | ${fmt(z.strategy)} |\n`;
259
+ }
260
+
261
+ await writeFile(join(outDir, 'diff.md'), md);
262
+ console.error(`\n[eval-diff] wrote ${outDir}`);
263
+ console.error(` mcp wins: ${counts.mcp || 0}`);
264
+ console.error(` zettel wins: ${counts.zettel || 0}`);
265
+ console.error(` ties: ${counts.tie || 0}`);
266
+ console.error(` both missed: ${counts['both-miss'] || 0}`);
267
+ } else {
268
+ const e = mcp || zettel;
269
+ const label = mcp ? 'mcp' : 'zettel';
270
+ md += `## Strategy breakdown\n\n`;
271
+ md += `**${label}**: ` + Object.entries(e.strategyBreakdown).map(([k, v]) => `${k}=${v}`).join(', ') + `\n\n`;
272
+ md += `## Per-intent\n\n`;
273
+ md += `| id | domain | intent | score | F1 | strategy |\n|---|---|---|---:|---:|---|\n`;
274
+ for (const r of e.results) {
275
+ const intent = (r.intent || '').slice(0, 48).replace(/\|/g, '\\|');
276
+ md += `| ${r.id} | ${fmt(r.domain)} | ${intent} | ${fmt(r.validationScore)} | ${fmt(r.componentF1)} | ${fmt(r.strategy)} |\n`;
277
+ }
278
+ await writeFile(join(outDir, 'diff.md'), md);
279
+ console.error(`\n[eval-diff] wrote ${outDir}`);
280
+ }
281
+
282
+ console.log(outDir);