@sanity/ailf 5.0.0 → 6.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (90) hide show
  1. package/config/diagnosis-cards.ts +318 -0
  2. package/config/models.ts +12 -0
  3. package/dist/_vendor/ailf-core/grader/failure-modes/agent-harness.d.ts +13 -0
  4. package/dist/_vendor/ailf-core/grader/failure-modes/agent-harness.js +16 -0
  5. package/dist/_vendor/ailf-core/grader/failure-modes/common.d.ts +14 -0
  6. package/dist/_vendor/ailf-core/grader/failure-modes/common.js +18 -0
  7. package/dist/_vendor/ailf-core/grader/failure-modes/index.d.ts +45 -0
  8. package/dist/_vendor/ailf-core/grader/failure-modes/index.js +109 -0
  9. package/dist/_vendor/ailf-core/grader/failure-modes/knowledge-probe.d.ts +13 -0
  10. package/dist/_vendor/ailf-core/grader/failure-modes/knowledge-probe.js +17 -0
  11. package/dist/_vendor/ailf-core/grader/failure-modes/literacy.d.ts +13 -0
  12. package/dist/_vendor/ailf-core/grader/failure-modes/literacy.js +17 -0
  13. package/dist/_vendor/ailf-core/grader/failure-modes/mcp.d.ts +13 -0
  14. package/dist/_vendor/ailf-core/grader/failure-modes/mcp.js +17 -0
  15. package/dist/_vendor/ailf-core/index.d.ts +1 -0
  16. package/dist/_vendor/ailf-core/index.js +4 -0
  17. package/dist/_vendor/ailf-core/services/diagnosis/card-validators.d.ts +41 -0
  18. package/dist/_vendor/ailf-core/services/diagnosis/card-validators.js +40 -0
  19. package/dist/_vendor/ailf-core/services/diagnosis/cards/__tests__/area-summary.test.d.ts +7 -0
  20. package/dist/_vendor/ailf-core/services/diagnosis/cards/__tests__/area-summary.test.js +131 -0
  21. package/dist/_vendor/ailf-core/services/diagnosis/cards/__tests__/failure-mode-summary.test.d.ts +7 -0
  22. package/dist/_vendor/ailf-core/services/diagnosis/cards/__tests__/failure-mode-summary.test.js +171 -0
  23. package/dist/_vendor/ailf-core/services/diagnosis/cards/__tests__/no-issues.test.d.ts +7 -0
  24. package/dist/_vendor/ailf-core/services/diagnosis/cards/__tests__/no-issues.test.js +155 -0
  25. package/dist/_vendor/ailf-core/services/diagnosis/cards/area-summary.d.ts +17 -0
  26. package/dist/_vendor/ailf-core/services/diagnosis/cards/area-summary.js +43 -0
  27. package/dist/_vendor/ailf-core/services/diagnosis/cards/doc-attribution-spotlight.d.ts +46 -0
  28. package/dist/_vendor/ailf-core/services/diagnosis/cards/doc-attribution-spotlight.js +104 -0
  29. package/dist/_vendor/ailf-core/services/diagnosis/cards/failure-mode-summary.d.ts +28 -0
  30. package/dist/_vendor/ailf-core/services/diagnosis/cards/failure-mode-summary.js +96 -0
  31. package/dist/_vendor/ailf-core/services/diagnosis/cards/index.d.ts +39 -0
  32. package/dist/_vendor/ailf-core/services/diagnosis/cards/index.js +52 -0
  33. package/dist/_vendor/ailf-core/services/diagnosis/cards/low-confidence-attribution.d.ts +27 -0
  34. package/dist/_vendor/ailf-core/services/diagnosis/cards/low-confidence-attribution.js +77 -0
  35. package/dist/_vendor/ailf-core/services/diagnosis/cards/no-issues.d.ts +32 -0
  36. package/dist/_vendor/ailf-core/services/diagnosis/cards/no-issues.js +71 -0
  37. package/dist/_vendor/ailf-core/services/diagnosis/cards/regression-vs-baseline.d.ts +44 -0
  38. package/dist/_vendor/ailf-core/services/diagnosis/cards/regression-vs-baseline.js +126 -0
  39. package/dist/_vendor/ailf-core/services/diagnosis/cards/top-recommendations.d.ts +41 -0
  40. package/dist/_vendor/ailf-core/services/diagnosis/cards/top-recommendations.js +107 -0
  41. package/dist/_vendor/ailf-core/services/diagnosis/cards/weakest-area.d.ts +43 -0
  42. package/dist/_vendor/ailf-core/services/diagnosis/cards/weakest-area.js +114 -0
  43. package/dist/_vendor/ailf-core/services/diagnosis/prompt-builders.d.ts +72 -0
  44. package/dist/_vendor/ailf-core/services/diagnosis/prompt-builders.js +273 -0
  45. package/dist/_vendor/ailf-core/services/diagnosis/prompts/doc-attribution-spotlight.system.d.ts +17 -0
  46. package/dist/_vendor/ailf-core/services/diagnosis/prompts/doc-attribution-spotlight.system.js +58 -0
  47. package/dist/_vendor/ailf-core/services/diagnosis/prompts/index.d.ts +10 -0
  48. package/dist/_vendor/ailf-core/services/diagnosis/prompts/index.js +10 -0
  49. package/dist/_vendor/ailf-core/services/diagnosis/prompts/low-confidence-attribution.system.d.ts +15 -0
  50. package/dist/_vendor/ailf-core/services/diagnosis/prompts/low-confidence-attribution.system.js +53 -0
  51. package/dist/_vendor/ailf-core/services/diagnosis/prompts/regression-vs-baseline.system.d.ts +14 -0
  52. package/dist/_vendor/ailf-core/services/diagnosis/prompts/regression-vs-baseline.system.js +63 -0
  53. package/dist/_vendor/ailf-core/services/diagnosis/prompts/top-recommendations.system.d.ts +16 -0
  54. package/dist/_vendor/ailf-core/services/diagnosis/prompts/top-recommendations.system.js +78 -0
  55. package/dist/_vendor/ailf-core/services/diagnosis/prompts/weakest-area.system.d.ts +16 -0
  56. package/dist/_vendor/ailf-core/services/diagnosis/prompts/weakest-area.system.js +86 -0
  57. package/dist/_vendor/ailf-core/services/diagnosis/registry.d.ts +10 -0
  58. package/dist/_vendor/ailf-core/services/diagnosis/registry.js +10 -0
  59. package/dist/_vendor/ailf-core/services/diagnosis-runner.d.ts +119 -2
  60. package/dist/_vendor/ailf-core/services/diagnosis-runner.js +136 -2
  61. package/dist/_vendor/ailf-core/services/index.d.ts +5 -1
  62. package/dist/_vendor/ailf-core/services/index.js +15 -2
  63. package/dist/_vendor/ailf-core/services/llm-client-factory.d.ts +64 -0
  64. package/dist/_vendor/ailf-core/services/llm-client-factory.js +54 -0
  65. package/dist/_vendor/ailf-core/types/diagnosis.d.ts +112 -10
  66. package/dist/_vendor/ailf-core/types/diagnosis.js +3 -1
  67. package/dist/_vendor/ailf-core/types/index.d.ts +1 -1
  68. package/dist/adapters/llm/fake-llm-client.d.ts +20 -0
  69. package/dist/adapters/llm/fake-llm-client.js +38 -1
  70. package/dist/adapters/llm/openai-llm-client.js +52 -3
  71. package/dist/cli-program.js +3 -0
  72. package/dist/commands/interpret.d.ts +50 -0
  73. package/dist/commands/interpret.js +212 -0
  74. package/dist/composition-root.d.ts +21 -23
  75. package/dist/composition-root.js +107 -41
  76. package/dist/config/diagnosis-cards.ts +318 -0
  77. package/dist/config/models.ts +12 -0
  78. package/dist/grader/agent-harness.d.ts +5 -10
  79. package/dist/grader/agent-harness.js +5 -13
  80. package/dist/grader/common.d.ts +5 -13
  81. package/dist/grader/common.js +5 -17
  82. package/dist/grader/index.d.ts +15 -29
  83. package/dist/grader/index.js +15 -66
  84. package/dist/grader/knowledge-probe.d.ts +5 -10
  85. package/dist/grader/knowledge-probe.js +5 -14
  86. package/dist/grader/literacy.d.ts +5 -9
  87. package/dist/grader/literacy.js +5 -13
  88. package/dist/grader/mcp.d.ts +5 -10
  89. package/dist/grader/mcp.js +5 -14
  90. package/package.json +2 -2
@@ -0,0 +1,273 @@
1
+ /**
2
+ * Deterministic prompt-builder helpers for the 5 LLM diagnosis cards.
3
+ *
4
+ * Each function takes typed inputs and returns `{ system, user }` strings
5
+ * (+ a `deltas` side-channel for regression-vs-baseline). The same inputs
6
+ * always produce the same output — no randomness, no side effects.
7
+ *
8
+ * Per AI-SPEC §4: input tokens are bounded by truncation; the user message
9
+ * projects only the fields each card needs.
10
+ *
11
+ * @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4
12
+ * @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §3 lines 603-664
13
+ */
14
+ import { TOP_RECOMMENDATIONS_SYSTEM_PROMPT, WEAKEST_AREA_SYSTEM_PROMPT, LOW_CONFIDENCE_ATTRIBUTION_SYSTEM_PROMPT, DOC_ATTRIBUTION_SPOTLIGHT_SYSTEM_PROMPT, REGRESSION_VS_BASELINE_SYSTEM_PROMPT, } from "./prompts/index.js";
15
+ // ---------------------------------------------------------------------------
16
+ // Shared helper: build the docSlug allow-list from a report
17
+ // ---------------------------------------------------------------------------
18
+ /**
19
+ * Build the allow-list of doc slugs for a report. This is the union of:
20
+ * - report.summary.documentManifest[].slug
21
+ * - per-score documents[].slug
22
+ *
23
+ * Only slugs (not documentIds) appear in the allow-list because the cards
24
+ * reference human-readable slugs. DocumentId-only entries are skipped.
25
+ */
26
+ export function buildDocSlugAllowList(report) {
27
+ const slugs = new Set();
28
+ const manifest = report.summary.documentManifest ?? [];
29
+ for (const doc of manifest) {
30
+ if ("slug" in doc && typeof doc.slug === "string" && doc.slug.length > 0) {
31
+ slugs.add(doc.slug);
32
+ }
33
+ }
34
+ for (const score of report.summary.scores ?? []) {
35
+ for (const doc of score.documents ?? []) {
36
+ if ("slug" in doc &&
37
+ typeof doc.slug === "string" &&
38
+ doc.slug.length > 0) {
39
+ slugs.add(doc.slug);
40
+ }
41
+ }
42
+ }
43
+ return slugs;
44
+ }
45
+ // ---------------------------------------------------------------------------
46
+ // top-recommendations prompt builder
47
+ // ---------------------------------------------------------------------------
48
+ /**
49
+ * Projects the 3 weakest areas + top failure modes + doc-slug allow-list
50
+ * into the user message (~2500 input tokens).
51
+ */
52
+ export function buildTopRecommendationsPrompt(report, allowList) {
53
+ const scores = report.summary.scores ?? [];
54
+ const sorted = [...scores].sort((a, b) => a.totalScore - b.totalScore);
55
+ const weakest3 = sorted.slice(0, 3);
56
+ const failureModes = report.summary.failureModes;
57
+ const topModes = failureModes?.topTitles
58
+ ?.slice(0, 5)
59
+ .map((t) => `${t.category} (${t.count})`) ?? [];
60
+ const allowListArr = [...allowList].slice(0, 50); // cap at 50 slugs
61
+ const user = [
62
+ "## Weakest Areas",
63
+ weakest3
64
+ .map((s) => `- ${s.feature}: totalScore=${s.totalScore}, judgmentCount=${s.testCount}`)
65
+ .join("\n"),
66
+ "",
67
+ "## Top Failure Modes",
68
+ topModes.length > 0 ? topModes.join(", ") : "(none recorded)",
69
+ "",
70
+ "## Document Slug Allow-List",
71
+ "Suggestions MUST use one of these slugs:",
72
+ allowListArr.map((s) => `- ${s}`).join("\n"),
73
+ "",
74
+ "Generate 1-5 actionable recommendations targeting the weakest areas above.",
75
+ ].join("\n");
76
+ return { system: TOP_RECOMMENDATIONS_SYSTEM_PROMPT, user };
77
+ }
78
+ // ---------------------------------------------------------------------------
79
+ // weakest-area prompt builder
80
+ // ---------------------------------------------------------------------------
81
+ /**
82
+ * Projects the single weakest area + full failure-mode breakdown.
83
+ */
84
+ export function buildWeakestAreaPrompt(report) {
85
+ const scores = report.summary.scores ?? [];
86
+ if (scores.length === 0) {
87
+ return {
88
+ system: WEAKEST_AREA_SYSTEM_PROMPT,
89
+ user: "No areas in report.",
90
+ };
91
+ }
92
+ const weakest = [...scores].sort((a, b) => a.totalScore - b.totalScore)[0];
93
+ const judgmentCount = weakest.testCount ?? 0;
94
+ const topMode = report.summary.failureModes?.topTitles?.[0]?.category ?? "unclassified";
95
+ const user = [
96
+ "## Weakest Area",
97
+ `Feature: ${weakest.feature}`,
98
+ `Total Score: ${weakest.totalScore}`,
99
+ `Ceiling Score: ${weakest.ceilingScore}`,
100
+ `Floor Score: ${weakest.floorScore}`,
101
+ `Judgment Count (sampleSize): ${judgmentCount}`,
102
+ "",
103
+ "## Top Failure Mode Observed",
104
+ `Category: ${topMode}`,
105
+ "",
106
+ "## Failure Mode by Dimension",
107
+ report.summary.failureModes?.topTitles
108
+ ?.slice(0, 5)
109
+ .map((t) => `- ${t.category}: ${t.count} judgments`)
110
+ .join("\n") ?? "(no data)",
111
+ "",
112
+ "Identify the area, its primary dimension, and most frequent failure mode.",
113
+ `sampleSize in your response MUST equal exactly ${judgmentCount} (the judgment count above).`,
114
+ judgmentCount < 10
115
+ ? `WARNING: sampleSize=${judgmentCount} < 10 — you MUST set confidence.level = "low".`
116
+ : "",
117
+ ]
118
+ .filter(Boolean)
119
+ .join("\n");
120
+ return { system: WEAKEST_AREA_SYSTEM_PROMPT, user };
121
+ }
122
+ // ---------------------------------------------------------------------------
123
+ // low-confidence-attribution prompt builder
124
+ // ---------------------------------------------------------------------------
125
+ /**
126
+ * Filters to low-confidence entries (or top-N if none low), produces a table
127
+ * for the LLM (~2000 input tokens).
128
+ *
129
+ * Caller short-circuits on empty `judgmentAttributions` BEFORE calling this.
130
+ */
131
+ export function buildLowConfidenceAttributionPrompt(report, judgmentAttributions) {
132
+ // Filter to low-confidence entries (by any attribution in the set)
133
+ const lowConf = judgmentAttributions.filter((ja) => ja.attributions.some((a) => a.confidence.level === "low"));
134
+ // If no low-confidence entries, use all sorted by score ascending (most uncertain first)
135
+ const source = lowConf.length > 0
136
+ ? lowConf
137
+ : [...judgmentAttributions].sort((a, b) => {
138
+ const aMin = Math.min(...a.attributions.map((x) => x.score));
139
+ const bMin = Math.min(...b.attributions.map((x) => x.score));
140
+ return aMin - bMin;
141
+ });
142
+ // Cap at 20 to stay within token budget
143
+ const capped = source.slice(0, 20);
144
+ const tableRows = capped
145
+ .map((ja) => {
146
+ const minConf = ja.attributions.reduce((worst, a) => a.confidence.level === "low"
147
+ ? "low"
148
+ : worst === "low"
149
+ ? "low"
150
+ : a.confidence.level === "medium"
151
+ ? "medium"
152
+ : worst, "high");
153
+ return `| ${ja.judgmentRef} | ${ja.taskId} | ${ja.modelId} | ${ja.dimension} | ${minConf} |`;
154
+ })
155
+ .join("\n");
156
+ const user = [
157
+ "## Per-Judgment Attribution Confidence",
158
+ `Total entries: ${judgmentAttributions.length}; Low-confidence: ${lowConf.length}`,
159
+ "",
160
+ "| judgmentRef | taskId | modelId | dimension | minConfidence |",
161
+ "|-------------|--------|---------|-----------|---------------|",
162
+ tableRows,
163
+ "",
164
+ lowConf.length === 0
165
+ ? "No low-confidence entries found. Return the highest-uncertainty entry and note that confidence is well-calibrated."
166
+ : `Identify the ${Math.min(lowConf.length, 5)} most uncertain judgments.`,
167
+ ].join("\n");
168
+ return { system: LOW_CONFIDENCE_ATTRIBUTION_SYSTEM_PROMPT, user };
169
+ }
170
+ // ---------------------------------------------------------------------------
171
+ // doc-attribution-spotlight prompt builder
172
+ // ---------------------------------------------------------------------------
173
+ /**
174
+ * Aggregates attributions by documentId, picks top-5 by aggregate score,
175
+ * emits a table for the LLM.
176
+ *
177
+ * Caller short-circuits on empty `judgmentAttributions` BEFORE calling this.
178
+ */
179
+ export function buildDocAttributionSpotlightPrompt(report, judgmentAttributions) {
180
+ // Aggregate by documentId (D0052 canonical ref)
181
+ const byDoc = new Map();
182
+ for (const ja of judgmentAttributions) {
183
+ for (const a of ja.attributions) {
184
+ const entry = byDoc.get(a.documentId) ?? {
185
+ slug: a.slug,
186
+ scoreSum: 0,
187
+ count: 0,
188
+ };
189
+ entry.scoreSum += a.score;
190
+ entry.count += 1;
191
+ // Keep slug if we have it
192
+ if (a.slug && !entry.slug)
193
+ entry.slug = a.slug;
194
+ byDoc.set(a.documentId, entry);
195
+ }
196
+ }
197
+ // Sort by aggregate score descending, take top 5
198
+ const sorted = [...byDoc.entries()]
199
+ .map(([docId, v]) => ({
200
+ documentId: docId,
201
+ slug: v.slug,
202
+ aggregateScore: v.scoreSum / v.count,
203
+ signalCount: v.count,
204
+ }))
205
+ .filter((d) => d.slug) // only emit docs with slugs (allow-list check in Zod)
206
+ .sort((a, b) => b.aggregateScore - a.aggregateScore)
207
+ .slice(0, 5);
208
+ const tableRows = sorted
209
+ .map((d) => `| ${d.documentId} | ${d.slug ?? "(no slug)"} | ${d.aggregateScore.toFixed(3)} | ${d.signalCount} |`)
210
+ .join("\n");
211
+ const user = [
212
+ "## Top Documents by Attribution Score",
213
+ `Total unique documents: ${byDoc.size}`,
214
+ "",
215
+ "| documentId | slug | aggregateScore | signalCount |",
216
+ "|------------|------|----------------|-------------|",
217
+ tableRows,
218
+ "",
219
+ "For each document in the table, determine its role (supports/contradicts/missing/irrelevant).",
220
+ "docSlug in your output MUST exactly match the slug column — do not invent slugs.",
221
+ ].join("\n");
222
+ return { system: DOC_ATTRIBUTION_SPOTLIGHT_SYSTEM_PROMPT, user };
223
+ }
224
+ // ---------------------------------------------------------------------------
225
+ // regression-vs-baseline prompt builder
226
+ // ---------------------------------------------------------------------------
227
+ /**
228
+ * Computes per-area deltas in JS (failure-mode #1 mitigation), then builds
229
+ * the user message. Returns `deltas` as a side-channel for the per-call
230
+ * schema `.refine()`.
231
+ */
232
+ export function buildRegressionVsBaselinePrompt(report, baseline) {
233
+ // Build area → score maps
234
+ const currentByArea = new Map((report.summary.scores ?? []).map((s) => [s.feature, s.totalScore]));
235
+ const baselineByArea = new Map((baseline.summary.scores ?? []).map((s) => [s.feature, s.totalScore]));
236
+ // Only compute deltas for areas present in BOTH reports
237
+ const deltas = [];
238
+ for (const [area, current] of currentByArea) {
239
+ const base = baselineByArea.get(area);
240
+ if (base !== undefined) {
241
+ deltas.push({
242
+ area,
243
+ pointsDelta: parseFloat((current - base).toFixed(2)),
244
+ });
245
+ }
246
+ }
247
+ // Sort by absolute delta descending (most changed first), cap at 10
248
+ const topDeltas = deltas
249
+ .sort((a, b) => Math.abs(b.pointsDelta) - Math.abs(a.pointsDelta))
250
+ .slice(0, 10);
251
+ const deltaRows = topDeltas
252
+ .map((d) => `| ${d.area} | ${d.pointsDelta > 0 ? "+" : ""}${d.pointsDelta} | ${d.pointsDelta > 0 ? "improved" : d.pointsDelta < 0 ? "regressed" : "unchanged"} |`)
253
+ .join("\n");
254
+ const user = [
255
+ "## Pre-Computed Score Deltas (current minus baseline)",
256
+ "These values are FACTS — do not modify them.",
257
+ "",
258
+ "| area | pointsDelta | expectedDirection |",
259
+ "|------|-------------|-------------------|",
260
+ deltaRows,
261
+ "",
262
+ "For each row, echo the exact area + pointsDelta, assign the matching direction label, and add prose drivers.",
263
+ "Do NOT round or change any numeric value.",
264
+ "",
265
+ `Current run: ${report.provenance.runId}`,
266
+ `Baseline run: ${baseline.provenance.runId}`,
267
+ ].join("\n");
268
+ return {
269
+ system: REGRESSION_VS_BASELINE_SYSTEM_PROMPT,
270
+ user,
271
+ deltas: topDeltas,
272
+ };
273
+ }
@@ -0,0 +1,17 @@
1
+ /**
2
+ * System prompt for the doc-attribution-spotlight card.
3
+ *
4
+ * Card: doc-attribution-spotlight
5
+ * Model: claude-sonnet-4-6 (routine card)
6
+ * Version: doc-attribution-spotlight@0.1.0
7
+ *
8
+ * This card identifies which documentation pages are most influential in
9
+ * grader attributions. Uses D0052 documentId as the canonical ref.
10
+ *
11
+ * Mitigations embedded:
12
+ * - failure-mode #5: docSlug allow-list enforced via Zod refine
13
+ *
14
+ * @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
15
+ * @see docs/decisions/D0052-judgment-ref-granularity.md
16
+ */
17
+ export declare const SYSTEM_PROMPT = "You are an AILF documentation analyst identifying which Sanity documentation pages have the highest attribution scores across evaluation runs.\n\n## Your Output\n\nReturn a JSON object matching this exact shape:\n{\n \"summary\": \"<1-2 sentence overview of the most influential documentation pages>\",\n \"docCitations\": [\n {\n \"docSlug\": \"<MUST be from the provided manifest \u2014 use the slug field>\",\n \"confidence\": {\n \"level\": \"high\" | \"medium\" | \"low\",\n \"signalsPresent\": <number of attribution entries supporting this doc>,\n \"derivation\": \"ensemble-stdev\"\n },\n \"role\": \"supports\" | \"contradicts\" | \"missing\" | \"irrelevant\"\n }\n ]\n}\n\nReturn 1-5 docCitations, sorted by aggregate attribution score descending.\n\n## Critical Rules\n\n1. **docSlug MUST be from the provided slug column** \u2014 every doc in your output must appear in the attribution table. Never invent slugs.\n2. **documentId is the canonical identity** \u2014 the input table identifies each doc by documentId (per D0052). The slug is a human-readable annotation. If the table shows a documentId without a slug, omit that doc from docCitations (no slug to report).\n3. **role classifications:**\n - supports: doc was cited and citation aligned with correct model behavior\n - contradicts: doc was cited but contradicted the correct implementation\n - missing: relevant doc was absent from the cited set (hallucination risk)\n - irrelevant: doc was cited but didn't contribute signal\n\n## Attribution Table Format\n\nThe input provides rows in this format:\n documentId | slug | aggregateScore | signalCount\n\nThe aggregate score is the sum of per-judgment attribution scores normalized by signal count. Higher = more influential.\n\n## Tone\n\nTechnical, direct. Focus on actionable insights: which docs are doing work, which are missing from citations that should be there.";
@@ -0,0 +1,58 @@
1
+ /**
2
+ * System prompt for the doc-attribution-spotlight card.
3
+ *
4
+ * Card: doc-attribution-spotlight
5
+ * Model: claude-sonnet-4-6 (routine card)
6
+ * Version: doc-attribution-spotlight@0.1.0
7
+ *
8
+ * This card identifies which documentation pages are most influential in
9
+ * grader attributions. Uses D0052 documentId as the canonical ref.
10
+ *
11
+ * Mitigations embedded:
12
+ * - failure-mode #5: docSlug allow-list enforced via Zod refine
13
+ *
14
+ * @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
15
+ * @see docs/decisions/D0052-judgment-ref-granularity.md
16
+ */
17
+ export const SYSTEM_PROMPT = `You are an AILF documentation analyst identifying which Sanity documentation pages have the highest attribution scores across evaluation runs.
18
+
19
+ ## Your Output
20
+
21
+ Return a JSON object matching this exact shape:
22
+ {
23
+ "summary": "<1-2 sentence overview of the most influential documentation pages>",
24
+ "docCitations": [
25
+ {
26
+ "docSlug": "<MUST be from the provided manifest — use the slug field>",
27
+ "confidence": {
28
+ "level": "high" | "medium" | "low",
29
+ "signalsPresent": <number of attribution entries supporting this doc>,
30
+ "derivation": "ensemble-stdev"
31
+ },
32
+ "role": "supports" | "contradicts" | "missing" | "irrelevant"
33
+ }
34
+ ]
35
+ }
36
+
37
+ Return 1-5 docCitations, sorted by aggregate attribution score descending.
38
+
39
+ ## Critical Rules
40
+
41
+ 1. **docSlug MUST be from the provided slug column** — every doc in your output must appear in the attribution table. Never invent slugs.
42
+ 2. **documentId is the canonical identity** — the input table identifies each doc by documentId (per D0052). The slug is a human-readable annotation. If the table shows a documentId without a slug, omit that doc from docCitations (no slug to report).
43
+ 3. **role classifications:**
44
+ - supports: doc was cited and citation aligned with correct model behavior
45
+ - contradicts: doc was cited but contradicted the correct implementation
46
+ - missing: relevant doc was absent from the cited set (hallucination risk)
47
+ - irrelevant: doc was cited but didn't contribute signal
48
+
49
+ ## Attribution Table Format
50
+
51
+ The input provides rows in this format:
52
+ documentId | slug | aggregateScore | signalCount
53
+
54
+ The aggregate score is the sum of per-judgment attribution scores normalized by signal count. Higher = more influential.
55
+
56
+ ## Tone
57
+
58
+ Technical, direct. Focus on actionable insights: which docs are doing work, which are missing from citations that should be there.`;
@@ -0,0 +1,10 @@
1
+ /**
2
+ * System-prompt barrel — unambiguous re-exports for all 5 LLM card prompts.
3
+ *
4
+ * Named re-exports (never `export *`) per W0124 guidance.
5
+ */
6
+ export { SYSTEM_PROMPT as TOP_RECOMMENDATIONS_SYSTEM_PROMPT } from "./top-recommendations.system.js";
7
+ export { SYSTEM_PROMPT as WEAKEST_AREA_SYSTEM_PROMPT } from "./weakest-area.system.js";
8
+ export { SYSTEM_PROMPT as LOW_CONFIDENCE_ATTRIBUTION_SYSTEM_PROMPT } from "./low-confidence-attribution.system.js";
9
+ export { SYSTEM_PROMPT as DOC_ATTRIBUTION_SPOTLIGHT_SYSTEM_PROMPT } from "./doc-attribution-spotlight.system.js";
10
+ export { SYSTEM_PROMPT as REGRESSION_VS_BASELINE_SYSTEM_PROMPT } from "./regression-vs-baseline.system.js";
@@ -0,0 +1,10 @@
1
+ /**
2
+ * System-prompt barrel — unambiguous re-exports for all 5 LLM card prompts.
3
+ *
4
+ * Named re-exports (never `export *`) per W0124 guidance.
5
+ */
6
+ export { SYSTEM_PROMPT as TOP_RECOMMENDATIONS_SYSTEM_PROMPT } from "./top-recommendations.system.js";
7
+ export { SYSTEM_PROMPT as WEAKEST_AREA_SYSTEM_PROMPT } from "./weakest-area.system.js";
8
+ export { SYSTEM_PROMPT as LOW_CONFIDENCE_ATTRIBUTION_SYSTEM_PROMPT } from "./low-confidence-attribution.system.js";
9
+ export { SYSTEM_PROMPT as DOC_ATTRIBUTION_SPOTLIGHT_SYSTEM_PROMPT } from "./doc-attribution-spotlight.system.js";
10
+ export { SYSTEM_PROMPT as REGRESSION_VS_BASELINE_SYSTEM_PROMPT } from "./regression-vs-baseline.system.js";
@@ -0,0 +1,15 @@
1
+ /**
2
+ * System prompt for the low-confidence-attribution card.
3
+ *
4
+ * Card: low-confidence-attribution
5
+ * Model: claude-sonnet-4-6 (routine card)
6
+ * Version: low-confidence-attribution@0.1.0
7
+ *
8
+ * This card analyzes per-judgment attribution data to identify which
9
+ * judgments have low confidence in their attribution scores. It helps
10
+ * the reader understand where the attribution ensemble is uncertain.
11
+ *
12
+ * @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
13
+ * @see docs/decisions/D0052-judgment-ref-granularity.md
14
+ */
15
+ export declare const SYSTEM_PROMPT = "You are an AILF attribution analyst identifying judgment-attribution entries with low confidence scores.\n\n## Your Output\n\nReturn a JSON object matching this exact shape:\n{\n \"summary\": \"<1-2 sentence summary of the low-confidence pattern observed>\",\n \"judgmentRefs\": [\n {\n \"taskId\": \"<exact taskId from the input data>\",\n \"modelId\": \"<exact modelId from the input data>\",\n \"dimension\": \"<exact dimension from the input data>\"\n }\n ]\n}\n\nReturn 1 or more judgmentRefs citing the judgments with lowest attribution confidence.\n\n## Critical Rules\n\n1. **judgmentRefs MUST reference actual entries from the input attribution data** \u2014 never invent taskId, modelId, or dimension values.\n2. **judgmentRefs must have length \u2265 1** \u2014 if no low-confidence judgments are found, still return the top-N highest-uncertainty entries.\n3. **If all attributions are high confidence** \u2014 write a summary stating \"No low-confidence attributions found \u2014 uncertainty appears well-calibrated\" and return the single highest-uncertainty entry.\n\n## Interpretation Guide\n\nAttribution confidence levels:\n- high: ensemble signals agree, citation grounding strong\n- medium: partial signal agreement, some uncertainty\n- low: signal disagreement or weak citation grounding \u2192 ACTION REQUIRED\n\nLow-confidence attributions may indicate:\n1. The document is poorly cited in grader judgments (hallucination risk)\n2. The ensemble signals disagree (retrieval \u2260 citation \u2260 canonical)\n3. Small sample size within the ensemble context window\n\n## Tone\n\nTechnical, direct. Cite specific taskId/dimension pairs so the reader can drill down into the raw attribution data.";
@@ -0,0 +1,53 @@
1
+ /**
2
+ * System prompt for the low-confidence-attribution card.
3
+ *
4
+ * Card: low-confidence-attribution
5
+ * Model: claude-sonnet-4-6 (routine card)
6
+ * Version: low-confidence-attribution@0.1.0
7
+ *
8
+ * This card analyzes per-judgment attribution data to identify which
9
+ * judgments have low confidence in their attribution scores. It helps
10
+ * the reader understand where the attribution ensemble is uncertain.
11
+ *
12
+ * @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
13
+ * @see docs/decisions/D0052-judgment-ref-granularity.md
14
+ */
15
+ export const SYSTEM_PROMPT = `You are an AILF attribution analyst identifying judgment-attribution entries with low confidence scores.
16
+
17
+ ## Your Output
18
+
19
+ Return a JSON object matching this exact shape:
20
+ {
21
+ "summary": "<1-2 sentence summary of the low-confidence pattern observed>",
22
+ "judgmentRefs": [
23
+ {
24
+ "taskId": "<exact taskId from the input data>",
25
+ "modelId": "<exact modelId from the input data>",
26
+ "dimension": "<exact dimension from the input data>"
27
+ }
28
+ ]
29
+ }
30
+
31
+ Return 1 or more judgmentRefs citing the judgments with lowest attribution confidence.
32
+
33
+ ## Critical Rules
34
+
35
+ 1. **judgmentRefs MUST reference actual entries from the input attribution data** — never invent taskId, modelId, or dimension values.
36
+ 2. **judgmentRefs must have length ≥ 1** — if no low-confidence judgments are found, still return the top-N highest-uncertainty entries.
37
+ 3. **If all attributions are high confidence** — write a summary stating "No low-confidence attributions found — uncertainty appears well-calibrated" and return the single highest-uncertainty entry.
38
+
39
+ ## Interpretation Guide
40
+
41
+ Attribution confidence levels:
42
+ - high: ensemble signals agree, citation grounding strong
43
+ - medium: partial signal agreement, some uncertainty
44
+ - low: signal disagreement or weak citation grounding → ACTION REQUIRED
45
+
46
+ Low-confidence attributions may indicate:
47
+ 1. The document is poorly cited in grader judgments (hallucination risk)
48
+ 2. The ensemble signals disagree (retrieval ≠ citation ≠ canonical)
49
+ 3. Small sample size within the ensemble context window
50
+
51
+ ## Tone
52
+
53
+ Technical, direct. Cite specific taskId/dimension pairs so the reader can drill down into the raw attribution data.`;
@@ -0,0 +1,14 @@
1
+ /**
2
+ * System prompt for the regression-vs-baseline card.
3
+ *
4
+ * Card: regression-vs-baseline
5
+ * Model: claude-opus-4-6 (high-stakes card)
6
+ * Version: regression-vs-baseline@0.1.0
7
+ *
8
+ * Mitigations embedded:
9
+ * - failure-mode #1: fabricated metric deltas — deltas are pre-computed in JS;
10
+ * this prompt instructs the LLM NOT to change or round them
11
+ *
12
+ * @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
13
+ */
14
+ export declare const SYSTEM_PROMPT = "You are an AILF regression analyst interpreting pre-computed score deltas between two evaluation runs.\n\n## Your Output\n\nReturn a JSON object matching this exact shape:\n{\n \"summary\": \"<1-2 sentence overview of the overall trend>\",\n \"deltas\": [\n {\n \"area\": \"<feature area name \u2014 MUST match the provided delta table>\",\n \"direction\": \"improved\" | \"regressed\" | \"unchanged\",\n \"pointsDelta\": <number \u2014 MUST EXACTLY MATCH the pre-computed value in the delta table>,\n \"drivers\": [\"<prose explanation of what caused this change>\"]\n }\n ],\n \"overallTrend\": \"net-improved\" | \"net-regressed\" | \"mixed\" | \"stable\"\n}\n\n## CRITICAL: Do Not Modify the Numbers\n\nThe `pointsDelta` values are **pre-computed facts** from the evaluation data. Your job is ONLY to:\n1. Echo the provided delta values EXACTLY (do not round, do not \"correct\")\n2. Assign direction labels that MATCH the sign (positive pointsDelta = \"improved\", negative = \"regressed\", zero = \"unchanged\")\n3. Write `drivers` prose explaining what might have caused the change\n4. Summarize the `overallTrend`\n\n**You may not change, round, or \"correct\" any numeric value.** If you disagree with a number, write your interpretation in the `drivers` text, not by changing the number.\n\n## Direction Sign Rules\n\n- pointsDelta > 0 \u2192 direction MUST be \"improved\"\n- pointsDelta < 0 \u2192 direction MUST be \"regressed\"\n- pointsDelta = 0 \u2192 direction MUST be \"unchanged\"\n\nViolating this causes a schema validation error.\n\n## Overall Trend\n\n- \"net-improved\": majority of deltas are positive\n- \"net-regressed\": majority of deltas are negative\n- \"mixed\": roughly equal positive and negative\n- \"stable\": all deltas near zero (< 1 point)\n\n## Comparison Validity\n\nOnly analyze areas where both baseline and current runs have data. Do not speculate about areas that appear only in one run.\n\n## Tone\n\nDirect, factual. Focus on which areas moved and plausible explanations based on the report context. Avoid marketing language.";
@@ -0,0 +1,63 @@
1
+ /**
2
+ * System prompt for the regression-vs-baseline card.
3
+ *
4
+ * Card: regression-vs-baseline
5
+ * Model: claude-opus-4-6 (high-stakes card)
6
+ * Version: regression-vs-baseline@0.1.0
7
+ *
8
+ * Mitigations embedded:
9
+ * - failure-mode #1: fabricated metric deltas — deltas are pre-computed in JS;
10
+ * this prompt instructs the LLM NOT to change or round them
11
+ *
12
+ * @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
13
+ */
14
+ export const SYSTEM_PROMPT = `You are an AILF regression analyst interpreting pre-computed score deltas between two evaluation runs.
15
+
16
+ ## Your Output
17
+
18
+ Return a JSON object matching this exact shape:
19
+ {
20
+ "summary": "<1-2 sentence overview of the overall trend>",
21
+ "deltas": [
22
+ {
23
+ "area": "<feature area name — MUST match the provided delta table>",
24
+ "direction": "improved" | "regressed" | "unchanged",
25
+ "pointsDelta": <number — MUST EXACTLY MATCH the pre-computed value in the delta table>,
26
+ "drivers": ["<prose explanation of what caused this change>"]
27
+ }
28
+ ],
29
+ "overallTrend": "net-improved" | "net-regressed" | "mixed" | "stable"
30
+ }
31
+
32
+ ## CRITICAL: Do Not Modify the Numbers
33
+
34
+ The \`pointsDelta\` values are **pre-computed facts** from the evaluation data. Your job is ONLY to:
35
+ 1. Echo the provided delta values EXACTLY (do not round, do not "correct")
36
+ 2. Assign direction labels that MATCH the sign (positive pointsDelta = "improved", negative = "regressed", zero = "unchanged")
37
+ 3. Write \`drivers\` prose explaining what might have caused the change
38
+ 4. Summarize the \`overallTrend\`
39
+
40
+ **You may not change, round, or "correct" any numeric value.** If you disagree with a number, write your interpretation in the \`drivers\` text, not by changing the number.
41
+
42
+ ## Direction Sign Rules
43
+
44
+ - pointsDelta > 0 → direction MUST be "improved"
45
+ - pointsDelta < 0 → direction MUST be "regressed"
46
+ - pointsDelta = 0 → direction MUST be "unchanged"
47
+
48
+ Violating this causes a schema validation error.
49
+
50
+ ## Overall Trend
51
+
52
+ - "net-improved": majority of deltas are positive
53
+ - "net-regressed": majority of deltas are negative
54
+ - "mixed": roughly equal positive and negative
55
+ - "stable": all deltas near zero (< 1 point)
56
+
57
+ ## Comparison Validity
58
+
59
+ Only analyze areas where both baseline and current runs have data. Do not speculate about areas that appear only in one run.
60
+
61
+ ## Tone
62
+
63
+ Direct, factual. Focus on which areas moved and plausible explanations based on the report context. Avoid marketing language.`;
@@ -0,0 +1,16 @@
1
+ /**
2
+ * System prompt for the top-recommendations card.
3
+ *
4
+ * Card: top-recommendations
5
+ * Model: claude-opus-4-6 (high-stakes card)
6
+ * Version: top-recommendations@0.1.0
7
+ *
8
+ * Mitigations embedded:
9
+ * - failure-mode #2: "Improve the introduction" anti-pattern — 2 few-shot
10
+ * pairs showing good vs bad recommendations
11
+ * - failure-mode #5: docSlug allow-list — prompt instructs LLM to pick slugs
12
+ * from the provided manifest; Zod refine enforces this at parse time
13
+ *
14
+ * @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
15
+ */
16
+ export declare const SYSTEM_PROMPT = "You are a senior documentation engineer analyzing AILF (AI Literacy Framework) evaluation reports. Your task is to generate concrete, actionable recommendations for improving Sanity documentation.\n\n## Your Output\n\nReturn a JSON object matching this exact shape:\n{\n \"summary\": \"<1-2 sentence overview of the top issues>\",\n \"suggestions\": [\n {\n \"title\": \"<specific action title>\",\n \"body\": \"<specific change to make, 40+ chars, must cite `docSlug` and the exact artifact/flag/API involved>\",\n \"priority\": \"high\" | \"medium\" | \"low\",\n \"docSlug\": \"<MUST be a slug from the provided allow-list>\",\n \"sectionHeading\": \"<exact section heading to edit, or null if targeting the whole page>\"\n }\n ]\n}\n\nReturn 1-5 suggestions, sorted by priority (high first).\n\n## Critical Rules\n\n1. **docSlug MUST be from the provided allow-list** \u2014 never invent slugs. If you cannot match a recommendation to a slug in the allow-list, omit that recommendation.\n2. **body MUST be \u226540 characters and cite a concrete artifact** \u2014 the body must include at least one backtick-delimited term (e.g., a CLI flag like `--dataset production`, a type like `SanityClient`, a section like `\u00A7Working Examples`).\n3. **Do not recommend \"improve the introduction\" or vague clarifications** \u2014 every recommendation must name a specific doc, specific section, and specific change.\n\n## Few-Shot Examples\n\n### Good recommendation (DO THIS):\n{\n \"title\": \"Add --dry-run worked example to schema-deploy docs\",\n \"body\": \"Add a worked example under \u00A7Worked Examples showing `ailf run --dataset production --dry-run` interaction: what the command prints, what it skips, and when to use it before a destructive change.\",\n \"priority\": \"high\",\n \"docSlug\": \"/docs/cli/schema-deploy\",\n \"sectionHeading\": \"Worked Examples\"\n}\n\n### Bad recommendation (DO NOT DO THIS):\n{\n \"title\": \"Improve the introduction\",\n \"body\": \"Consider clarifying the introduction to make it more user-friendly.\",\n \"priority\": \"high\",\n \"docSlug\": \"/docs/cli/schema-deploy\",\n \"sectionHeading\": null\n}\n\nThe bad recommendation is rejected because:\n- \"improve the introduction\" is generic \u2014 every doc has an intro\n- \"make it more user-friendly\" names no artifact, flag, or change\n- A content engineer cannot start work from this recommendation\n\n### Good recommendation \u2014 another example (DO THIS):\n{\n \"title\": \"Document GROQ projection syntax for nested arrays\",\n \"body\": \"Add \u00A7Nested Array Projections to the GROQ reference showing how `_id` and `slug.current` projections behave differently under array[]`{...}` traversal \u2014 a common source of `null` in queries.\",\n \"priority\": \"medium\",\n \"docSlug\": \"/docs/how-it-works/querying\",\n \"sectionHeading\": \"Nested Array Projections\"\n}\n\n## Tone\n\nWrite for a senior Sanity content engineer reading triage notes at 10pm. Direct, technical, present-tense. No marketing softeners.";
@@ -0,0 +1,78 @@
1
+ /**
2
+ * System prompt for the top-recommendations card.
3
+ *
4
+ * Card: top-recommendations
5
+ * Model: claude-opus-4-6 (high-stakes card)
6
+ * Version: top-recommendations@0.1.0
7
+ *
8
+ * Mitigations embedded:
9
+ * - failure-mode #2: "Improve the introduction" anti-pattern — 2 few-shot
10
+ * pairs showing good vs bad recommendations
11
+ * - failure-mode #5: docSlug allow-list — prompt instructs LLM to pick slugs
12
+ * from the provided manifest; Zod refine enforces this at parse time
13
+ *
14
+ * @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
15
+ */
16
+ export const SYSTEM_PROMPT = `You are a senior documentation engineer analyzing AILF (AI Literacy Framework) evaluation reports. Your task is to generate concrete, actionable recommendations for improving Sanity documentation.
17
+
18
+ ## Your Output
19
+
20
+ Return a JSON object matching this exact shape:
21
+ {
22
+ "summary": "<1-2 sentence overview of the top issues>",
23
+ "suggestions": [
24
+ {
25
+ "title": "<specific action title>",
26
+ "body": "<specific change to make, 40+ chars, must cite \`docSlug\` and the exact artifact/flag/API involved>",
27
+ "priority": "high" | "medium" | "low",
28
+ "docSlug": "<MUST be a slug from the provided allow-list>",
29
+ "sectionHeading": "<exact section heading to edit, or null if targeting the whole page>"
30
+ }
31
+ ]
32
+ }
33
+
34
+ Return 1-5 suggestions, sorted by priority (high first).
35
+
36
+ ## Critical Rules
37
+
38
+ 1. **docSlug MUST be from the provided allow-list** — never invent slugs. If you cannot match a recommendation to a slug in the allow-list, omit that recommendation.
39
+ 2. **body MUST be ≥40 characters and cite a concrete artifact** — the body must include at least one backtick-delimited term (e.g., a CLI flag like \`--dataset production\`, a type like \`SanityClient\`, a section like \`§Working Examples\`).
40
+ 3. **Do not recommend "improve the introduction" or vague clarifications** — every recommendation must name a specific doc, specific section, and specific change.
41
+
42
+ ## Few-Shot Examples
43
+
44
+ ### Good recommendation (DO THIS):
45
+ {
46
+ "title": "Add --dry-run worked example to schema-deploy docs",
47
+ "body": "Add a worked example under §Worked Examples showing \`ailf run --dataset production --dry-run\` interaction: what the command prints, what it skips, and when to use it before a destructive change.",
48
+ "priority": "high",
49
+ "docSlug": "/docs/cli/schema-deploy",
50
+ "sectionHeading": "Worked Examples"
51
+ }
52
+
53
+ ### Bad recommendation (DO NOT DO THIS):
54
+ {
55
+ "title": "Improve the introduction",
56
+ "body": "Consider clarifying the introduction to make it more user-friendly.",
57
+ "priority": "high",
58
+ "docSlug": "/docs/cli/schema-deploy",
59
+ "sectionHeading": null
60
+ }
61
+
62
+ The bad recommendation is rejected because:
63
+ - "improve the introduction" is generic — every doc has an intro
64
+ - "make it more user-friendly" names no artifact, flag, or change
65
+ - A content engineer cannot start work from this recommendation
66
+
67
+ ### Good recommendation — another example (DO THIS):
68
+ {
69
+ "title": "Document GROQ projection syntax for nested arrays",
70
+ "body": "Add §Nested Array Projections to the GROQ reference showing how \`_id\` and \`slug.current\` projections behave differently under array[]\`{...}\` traversal — a common source of \`null\` in queries.",
71
+ "priority": "medium",
72
+ "docSlug": "/docs/how-it-works/querying",
73
+ "sectionHeading": "Nested Array Projections"
74
+ }
75
+
76
+ ## Tone
77
+
78
+ Write for a senior Sanity content engineer reading triage notes at 10pm. Direct, technical, present-tense. No marketing softeners.`;
@@ -0,0 +1,16 @@
1
+ /**
2
+ * System prompt for the weakest-area card.
3
+ *
4
+ * Card: weakest-area
5
+ * Model: claude-sonnet-4-6 (routine card)
6
+ * Version: weakest-area@0.1.0
7
+ *
8
+ * Mitigations embedded:
9
+ * - failure-mode #3: confidence inflation on small samples — prompt instructs
10
+ * to hedge when sampleSize < 10; Zod W3 refine enforces at parse time
11
+ * - failure-mode #4: taxonomy drift — full canonical taxonomy enumerated
12
+ * verbatim in this prompt so the LLM picks from a known list
13
+ *
14
+ * @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
15
+ */
16
+ export declare const SYSTEM_PROMPT = "You are an AILF evaluation analyst identifying the documentation area most in need of improvement.\n\n## Your Output\n\nReturn a JSON object matching this exact shape:\n{\n \"summary\": \"<1-2 sentence description of the weakest area and why>\",\n \"area\": \"<feature area name, e.g. 'schema-deploy'>\",\n \"dimension\": \"<MUST be one of the canonical dimensions listed below>\",\n \"failureMode\": \"<MUST be from the canonical taxonomy for the chosen dimension>\",\n \"sampleSize\": <number \u2014 MUST equal the judgmentCount provided for this area>,\n \"confidence\": {\n \"level\": \"high\" | \"medium\" | \"low\",\n \"signalsPresent\": <number of tasks backing this finding>,\n \"derivation\": \"card-type-specific\"\n }\n}\n\n## CANONICAL DIMENSIONS AND FAILURE MODES\n\nYou MUST pick dimension and failureMode from this exact taxonomy. Cross-dimension combinations are invalid (e.g., \"security\" dimension with \"missing-docs\" failure mode is rejected).\n\n### Literacy family (dimensions: task-completion, code-correctness, doc-coverage)\nFailure modes:\n- missing-docs \u2014 relevant doc didn't exist\n- outdated-docs \u2014 doc reflects an older API/version\n- incorrect-docs \u2014 doc states something factually wrong\n- poor-structure \u2014 doc exists but is hard to find or follow\nPlus cross-cutting: api-error, model-limitation, false-floor, unclassified\n\n### MCP family (dimensions: mcp-behavior, input-validation, output-correctness, error-handling, security)\nFailure modes:\n- invalid-tool-call \u2014 model called tool with wrong args\n- missing-required-param \u2014 required parameter omitted\n- extra-param \u2014 unexpected extra parameter sent\n- wrong-tool-selected \u2014 chose wrong tool for task\n- tool-call-order \u2014 tools called in wrong sequence\n- no-tool-call \u2014 should have used a tool but didn't\n- schema-mismatch \u2014 response did not match expected schema\n- unsafe-operation \u2014 operation could cause data loss\n- auth-bypass \u2014 security check skipped\nPlus cross-cutting: api-error, model-limitation, false-floor, unclassified\n\n### Knowledge-probe family (dimensions: knowledge-probe, factual-correctness, completeness, currency)\nFailure modes:\n- factual-error \u2014 stated an incorrect fact\n- out-of-date \u2014 used deprecated API or old syntax\n- missing-step \u2014 omitted a required step\n- hallucinated-api \u2014 invented an API that does not exist\n- wrong-version \u2014 used v1 API when v2 was required\n- incomplete-coverage \u2014 missed important edge case\nPlus cross-cutting: api-error, model-limitation, false-floor, unclassified\n\n### Agent-harness family (dimensions: agent-harness, process-quality, agent-output, tool-usage)\nFailure modes:\n- excessive-loops \u2014 agent looped unnecessarily\n- premature-stop \u2014 stopped before completing the task\n- incorrect-output \u2014 output was wrong or incomplete\n- inefficient-path \u2014 completed task but via unnecessary steps\n- assertion-failure \u2014 failed a structural assertion check\nPlus cross-cutting: api-error, model-limitation, false-floor, unclassified\n\n## Confidence Calibration Rules\n\n**CRITICAL:** When sampleSize < 10, you MUST set confidence.level = \"low\".\n\n- sampleSize >= 30 \u2192 \"high\" is appropriate\n- sampleSize >= 10 \u2192 \"medium\" is appropriate\n- sampleSize < 10 \u2192 MUST use \"low\" (small-sample hedge required)\n\nIn your summary, reflect the confidence level: if \"low\", include language like \"small sample (N=X) \u2014 re-run with broader dataset before acting\".";