@sanity/ailf 4.6.0 → 6.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/canonical/grader-references/agent-harness-tools.yaml +42 -0
- package/canonical/grader-references/knowledge-probe-recall.yaml +36 -0
- package/canonical/grader-references/mcp-server-spec.yaml +51 -0
- package/canonical/grader-references/portable-text.yaml +48 -0
- package/config/diagnosis-cards.ts +318 -0
- package/config/models.ts +12 -0
- package/config/rubrics.ts +38 -2
- package/dist/_vendor/ailf-core/artifact-registry.d.ts +60 -2
- package/dist/_vendor/ailf-core/artifact-registry.js +288 -7
- package/dist/_vendor/ailf-core/examples/index.d.ts +125 -26
- package/dist/_vendor/ailf-core/examples/index.js +146 -47
- package/dist/_vendor/ailf-core/grader/failure-modes/agent-harness.d.ts +13 -0
- package/dist/_vendor/ailf-core/grader/failure-modes/agent-harness.js +16 -0
- package/dist/_vendor/ailf-core/grader/failure-modes/common.d.ts +14 -0
- package/dist/_vendor/ailf-core/grader/failure-modes/common.js +18 -0
- package/dist/_vendor/ailf-core/grader/failure-modes/index.d.ts +45 -0
- package/dist/_vendor/ailf-core/grader/failure-modes/index.js +109 -0
- package/dist/_vendor/ailf-core/grader/failure-modes/knowledge-probe.d.ts +13 -0
- package/dist/_vendor/ailf-core/grader/failure-modes/knowledge-probe.js +17 -0
- package/dist/_vendor/ailf-core/grader/failure-modes/literacy.d.ts +13 -0
- package/dist/_vendor/ailf-core/grader/failure-modes/literacy.js +17 -0
- package/dist/_vendor/ailf-core/grader/failure-modes/mcp.d.ts +13 -0
- package/dist/_vendor/ailf-core/grader/failure-modes/mcp.js +17 -0
- package/dist/_vendor/ailf-core/index.d.ts +1 -0
- package/dist/_vendor/ailf-core/index.js +4 -0
- package/dist/_vendor/ailf-core/ports/context.d.ts +8 -0
- package/dist/_vendor/ailf-core/ports/mode-handler.d.ts +15 -0
- package/dist/_vendor/ailf-core/schemas/branded-string.d.ts +40 -0
- package/dist/_vendor/ailf-core/schemas/branded-string.js +45 -0
- package/dist/_vendor/ailf-core/schemas/confidence-schema.d.ts +36 -0
- package/dist/_vendor/ailf-core/schemas/confidence-schema.js +32 -0
- package/dist/_vendor/ailf-core/schemas/eval-config.d.ts +1 -0
- package/dist/_vendor/ailf-core/schemas/eval-config.js +8 -4
- package/dist/_vendor/ailf-core/schemas/index.d.ts +2 -0
- package/dist/_vendor/ailf-core/schemas/index.js +9 -0
- package/dist/_vendor/ailf-core/schemas/pipeline-request.d.ts +1 -0
- package/dist/_vendor/ailf-core/schemas/pipeline-request.js +1 -0
- package/dist/_vendor/ailf-core/schemas/pipeline.d.ts +34 -8
- package/dist/_vendor/ailf-core/schemas/pipeline.js +23 -1
- package/dist/_vendor/ailf-core/services/diagnosis/card-validators.d.ts +41 -0
- package/dist/_vendor/ailf-core/services/diagnosis/card-validators.js +40 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/__tests__/area-summary.test.d.ts +7 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/__tests__/area-summary.test.js +131 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/__tests__/failure-mode-summary.test.d.ts +7 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/__tests__/failure-mode-summary.test.js +171 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/__tests__/no-issues.test.d.ts +7 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/__tests__/no-issues.test.js +155 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/area-summary.d.ts +17 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/area-summary.js +43 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/doc-attribution-spotlight.d.ts +46 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/doc-attribution-spotlight.js +104 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/failure-mode-summary.d.ts +28 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/failure-mode-summary.js +96 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/index.d.ts +39 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/index.js +52 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/low-confidence-attribution.d.ts +27 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/low-confidence-attribution.js +77 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/no-issues.d.ts +32 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/no-issues.js +71 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/regression-vs-baseline.d.ts +44 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/regression-vs-baseline.js +126 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/top-recommendations.d.ts +41 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/top-recommendations.js +107 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/weakest-area.d.ts +43 -0
- package/dist/_vendor/ailf-core/services/diagnosis/cards/weakest-area.js +114 -0
- package/dist/_vendor/ailf-core/services/diagnosis/prompt-builders.d.ts +72 -0
- package/dist/_vendor/ailf-core/services/diagnosis/prompt-builders.js +273 -0
- package/dist/_vendor/ailf-core/services/diagnosis/prompts/doc-attribution-spotlight.system.d.ts +17 -0
- package/dist/_vendor/ailf-core/services/diagnosis/prompts/doc-attribution-spotlight.system.js +58 -0
- package/dist/_vendor/ailf-core/services/diagnosis/prompts/index.d.ts +10 -0
- package/dist/_vendor/ailf-core/services/diagnosis/prompts/index.js +10 -0
- package/dist/_vendor/ailf-core/services/diagnosis/prompts/low-confidence-attribution.system.d.ts +15 -0
- package/dist/_vendor/ailf-core/services/diagnosis/prompts/low-confidence-attribution.system.js +53 -0
- package/dist/_vendor/ailf-core/services/diagnosis/prompts/regression-vs-baseline.system.d.ts +14 -0
- package/dist/_vendor/ailf-core/services/diagnosis/prompts/regression-vs-baseline.system.js +63 -0
- package/dist/_vendor/ailf-core/services/diagnosis/prompts/top-recommendations.system.d.ts +16 -0
- package/dist/_vendor/ailf-core/services/diagnosis/prompts/top-recommendations.system.js +78 -0
- package/dist/_vendor/ailf-core/services/diagnosis/prompts/weakest-area.system.d.ts +16 -0
- package/dist/_vendor/ailf-core/services/diagnosis/prompts/weakest-area.system.js +86 -0
- package/dist/_vendor/ailf-core/services/diagnosis/registry.d.ts +50 -0
- package/dist/_vendor/ailf-core/services/diagnosis/registry.js +35 -0
- package/dist/_vendor/ailf-core/services/diagnosis-runner.d.ts +136 -0
- package/dist/_vendor/ailf-core/services/diagnosis-runner.js +153 -0
- package/dist/_vendor/ailf-core/services/index.d.ts +6 -0
- package/dist/_vendor/ailf-core/services/index.js +18 -0
- package/dist/_vendor/ailf-core/services/llm-client-factory.d.ts +64 -0
- package/dist/_vendor/ailf-core/services/llm-client-factory.js +54 -0
- package/dist/_vendor/ailf-core/services/report-to-markdown.js +3 -2
- package/dist/_vendor/ailf-core/types/attribution.d.ts +82 -0
- package/dist/_vendor/ailf-core/types/attribution.js +18 -0
- package/dist/_vendor/ailf-core/types/branded-ids.d.ts +26 -1
- package/dist/_vendor/ailf-core/types/branded-ids.js +80 -4
- package/dist/_vendor/ailf-core/types/confidence.d.ts +1 -1
- package/dist/_vendor/ailf-core/types/confidence.js +7 -0
- package/dist/_vendor/ailf-core/types/diagnosis.d.ts +271 -0
- package/dist/_vendor/ailf-core/types/diagnosis.js +19 -0
- package/dist/_vendor/ailf-core/types/generalized-task.d.ts +16 -1
- package/dist/_vendor/ailf-core/types/grader-judgment.d.ts +125 -0
- package/dist/_vendor/ailf-core/types/grader-judgment.js +30 -0
- package/dist/_vendor/ailf-core/types/index.d.ts +80 -29
- package/dist/_vendor/ailf-core/types/index.js +15 -1
- package/dist/_vendor/ailf-core/types/legacy-grader-judgment.d.ts +55 -0
- package/dist/_vendor/ailf-core/types/legacy-grader-judgment.js +30 -0
- package/dist/_vendor/ailf-core/types/pipeline-request.d.ts +1 -0
- package/dist/_vendor/ailf-core/types/repo-config.d.ts +8 -0
- package/dist/_vendor/ailf-shared/document-ref.d.ts +1 -1
- package/dist/adapters/api-client/build-request.d.ts +1 -0
- package/dist/adapters/api-client/build-request.js +3 -0
- package/dist/adapters/attribution/attribution-meta-writer.d.ts +35 -0
- package/dist/adapters/attribution/attribution-meta-writer.js +34 -0
- package/dist/adapters/attribution/index.d.ts +9 -0
- package/dist/adapters/attribution/index.js +8 -0
- package/dist/adapters/attribution/per-entry-attribution-writer.d.ts +56 -0
- package/dist/adapters/attribution/per-entry-attribution-writer.js +49 -0
- package/dist/adapters/config-sources/file-config-adapter.js +1 -0
- package/dist/adapters/grader-outputs/index.d.ts +10 -0
- package/dist/adapters/grader-outputs/index.js +8 -0
- package/dist/adapters/grader-outputs/legacy/index.d.ts +11 -0
- package/dist/adapters/grader-outputs/legacy/index.js +10 -0
- package/dist/adapters/grader-outputs/legacy/promptfoo-grader-output-legacy.d.ts +49 -0
- package/dist/adapters/grader-outputs/legacy/promptfoo-grader-output-legacy.js +48 -0
- package/dist/adapters/grader-outputs/promptfoo-grader-output.d.ts +102 -0
- package/dist/adapters/grader-outputs/promptfoo-grader-output.js +93 -0
- package/dist/adapters/index.d.ts +3 -0
- package/dist/adapters/index.js +4 -0
- package/dist/adapters/llm/fake-llm-client.d.ts +20 -0
- package/dist/adapters/llm/fake-llm-client.js +38 -1
- package/dist/adapters/llm/openai-llm-client.js +52 -3
- package/dist/adapters/task-sources/content-lake-task-source.d.ts +5 -1
- package/dist/adapters/task-sources/content-lake-task-source.js +28 -2
- package/dist/adapters/task-sources/repo-schemas.d.ts +79 -11
- package/dist/adapters/task-sources/repo-schemas.js +19 -2
- package/dist/cli-program.js +3 -0
- package/dist/commands/calculate-scores.js +1 -1
- package/dist/commands/explain-handler.js +1 -1
- package/dist/commands/interpret.d.ts +50 -0
- package/dist/commands/interpret.js +212 -0
- package/dist/commands/lookup-doc.d.ts +1 -1
- package/dist/commands/lookup-doc.js +3 -3
- package/dist/commands/pipeline-action.d.ts +6 -0
- package/dist/commands/pipeline-action.js +2 -0
- package/dist/commands/remote-pipeline.js +1 -0
- package/dist/composition-root.d.ts +57 -23
- package/dist/composition-root.js +155 -41
- package/dist/config/diagnosis-cards.ts +318 -0
- package/dist/config/models.ts +12 -0
- package/dist/config/rubrics.ts +38 -2
- package/dist/grader/agent-harness.d.ts +9 -0
- package/dist/grader/agent-harness.js +9 -0
- package/dist/grader/common.d.ts +9 -0
- package/dist/grader/common.js +9 -0
- package/dist/grader/index.d.ts +24 -0
- package/dist/grader/index.js +24 -0
- package/dist/grader/knowledge-probe.d.ts +9 -0
- package/dist/grader/knowledge-probe.js +9 -0
- package/dist/grader/literacy.d.ts +9 -0
- package/dist/grader/literacy.js +9 -0
- package/dist/grader/mcp.d.ts +9 -0
- package/dist/grader/mcp.js +9 -0
- package/dist/orchestration/build-app-context.js +1 -0
- package/dist/orchestration/build-step-sequence.js +5 -0
- package/dist/orchestration/steps/calculate-scores-step.js +23 -1
- package/dist/orchestration/steps/compute-attribution-step.d.ts +44 -0
- package/dist/orchestration/steps/compute-attribution-step.js +279 -0
- package/dist/orchestration/steps/gap-analysis-step.js +35 -7
- package/dist/orchestration/steps/index.d.ts +1 -0
- package/dist/orchestration/steps/index.js +1 -0
- package/dist/pipeline/attribution.d.ts +15 -0
- package/dist/pipeline/attribution.js +18 -9
- package/dist/pipeline/borderline-consensus-runner.d.ts +63 -0
- package/dist/pipeline/borderline-consensus-runner.js +124 -0
- package/dist/pipeline/borderline-detector.d.ts +24 -0
- package/dist/pipeline/borderline-detector.js +26 -0
- package/dist/pipeline/calculate-scores.d.ts +114 -3
- package/dist/pipeline/calculate-scores.js +426 -24
- package/dist/pipeline/compiler/literacy-bridge.d.ts +1 -1
- package/dist/pipeline/compiler/literacy-bridge.js +35 -17
- package/dist/pipeline/compiler/rubric-resolution.d.ts +15 -0
- package/dist/pipeline/compiler/rubric-resolution.js +9 -1
- package/dist/pipeline/compute-attribution.d.ts +80 -0
- package/dist/pipeline/compute-attribution.js +196 -0
- package/dist/pipeline/failure-modes.d.ts +52 -17
- package/dist/pipeline/failure-modes.js +178 -117
- package/dist/pipeline/map-request-to-config.js +1 -0
- package/package.json +7 -5
|
@@ -0,0 +1,273 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Deterministic prompt-builder helpers for the 5 LLM diagnosis cards.
|
|
3
|
+
*
|
|
4
|
+
* Each function takes typed inputs and returns `{ system, user }` strings
|
|
5
|
+
* (+ a `deltas` side-channel for regression-vs-baseline). The same inputs
|
|
6
|
+
* always produce the same output — no randomness, no side effects.
|
|
7
|
+
*
|
|
8
|
+
* Per AI-SPEC §4: input tokens are bounded by truncation; the user message
|
|
9
|
+
* projects only the fields each card needs.
|
|
10
|
+
*
|
|
11
|
+
* @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4
|
|
12
|
+
* @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §3 lines 603-664
|
|
13
|
+
*/
|
|
14
|
+
import { TOP_RECOMMENDATIONS_SYSTEM_PROMPT, WEAKEST_AREA_SYSTEM_PROMPT, LOW_CONFIDENCE_ATTRIBUTION_SYSTEM_PROMPT, DOC_ATTRIBUTION_SPOTLIGHT_SYSTEM_PROMPT, REGRESSION_VS_BASELINE_SYSTEM_PROMPT, } from "./prompts/index.js";
|
|
15
|
+
// ---------------------------------------------------------------------------
|
|
16
|
+
// Shared helper: build the docSlug allow-list from a report
|
|
17
|
+
// ---------------------------------------------------------------------------
|
|
18
|
+
/**
|
|
19
|
+
* Build the allow-list of doc slugs for a report. This is the union of:
|
|
20
|
+
* - report.summary.documentManifest[].slug
|
|
21
|
+
* - per-score documents[].slug
|
|
22
|
+
*
|
|
23
|
+
* Only slugs (not documentIds) appear in the allow-list because the cards
|
|
24
|
+
* reference human-readable slugs. DocumentId-only entries are skipped.
|
|
25
|
+
*/
|
|
26
|
+
export function buildDocSlugAllowList(report) {
|
|
27
|
+
const slugs = new Set();
|
|
28
|
+
const manifest = report.summary.documentManifest ?? [];
|
|
29
|
+
for (const doc of manifest) {
|
|
30
|
+
if ("slug" in doc && typeof doc.slug === "string" && doc.slug.length > 0) {
|
|
31
|
+
slugs.add(doc.slug);
|
|
32
|
+
}
|
|
33
|
+
}
|
|
34
|
+
for (const score of report.summary.scores ?? []) {
|
|
35
|
+
for (const doc of score.documents ?? []) {
|
|
36
|
+
if ("slug" in doc &&
|
|
37
|
+
typeof doc.slug === "string" &&
|
|
38
|
+
doc.slug.length > 0) {
|
|
39
|
+
slugs.add(doc.slug);
|
|
40
|
+
}
|
|
41
|
+
}
|
|
42
|
+
}
|
|
43
|
+
return slugs;
|
|
44
|
+
}
|
|
45
|
+
// ---------------------------------------------------------------------------
|
|
46
|
+
// top-recommendations prompt builder
|
|
47
|
+
// ---------------------------------------------------------------------------
|
|
48
|
+
/**
|
|
49
|
+
* Projects the 3 weakest areas + top failure modes + doc-slug allow-list
|
|
50
|
+
* into the user message (~2500 input tokens).
|
|
51
|
+
*/
|
|
52
|
+
export function buildTopRecommendationsPrompt(report, allowList) {
|
|
53
|
+
const scores = report.summary.scores ?? [];
|
|
54
|
+
const sorted = [...scores].sort((a, b) => a.totalScore - b.totalScore);
|
|
55
|
+
const weakest3 = sorted.slice(0, 3);
|
|
56
|
+
const failureModes = report.summary.failureModes;
|
|
57
|
+
const topModes = failureModes?.topTitles
|
|
58
|
+
?.slice(0, 5)
|
|
59
|
+
.map((t) => `${t.category} (${t.count})`) ?? [];
|
|
60
|
+
const allowListArr = [...allowList].slice(0, 50); // cap at 50 slugs
|
|
61
|
+
const user = [
|
|
62
|
+
"## Weakest Areas",
|
|
63
|
+
weakest3
|
|
64
|
+
.map((s) => `- ${s.feature}: totalScore=${s.totalScore}, judgmentCount=${s.testCount}`)
|
|
65
|
+
.join("\n"),
|
|
66
|
+
"",
|
|
67
|
+
"## Top Failure Modes",
|
|
68
|
+
topModes.length > 0 ? topModes.join(", ") : "(none recorded)",
|
|
69
|
+
"",
|
|
70
|
+
"## Document Slug Allow-List",
|
|
71
|
+
"Suggestions MUST use one of these slugs:",
|
|
72
|
+
allowListArr.map((s) => `- ${s}`).join("\n"),
|
|
73
|
+
"",
|
|
74
|
+
"Generate 1-5 actionable recommendations targeting the weakest areas above.",
|
|
75
|
+
].join("\n");
|
|
76
|
+
return { system: TOP_RECOMMENDATIONS_SYSTEM_PROMPT, user };
|
|
77
|
+
}
|
|
78
|
+
// ---------------------------------------------------------------------------
|
|
79
|
+
// weakest-area prompt builder
|
|
80
|
+
// ---------------------------------------------------------------------------
|
|
81
|
+
/**
|
|
82
|
+
* Projects the single weakest area + full failure-mode breakdown.
|
|
83
|
+
*/
|
|
84
|
+
export function buildWeakestAreaPrompt(report) {
|
|
85
|
+
const scores = report.summary.scores ?? [];
|
|
86
|
+
if (scores.length === 0) {
|
|
87
|
+
return {
|
|
88
|
+
system: WEAKEST_AREA_SYSTEM_PROMPT,
|
|
89
|
+
user: "No areas in report.",
|
|
90
|
+
};
|
|
91
|
+
}
|
|
92
|
+
const weakest = [...scores].sort((a, b) => a.totalScore - b.totalScore)[0];
|
|
93
|
+
const judgmentCount = weakest.testCount ?? 0;
|
|
94
|
+
const topMode = report.summary.failureModes?.topTitles?.[0]?.category ?? "unclassified";
|
|
95
|
+
const user = [
|
|
96
|
+
"## Weakest Area",
|
|
97
|
+
`Feature: ${weakest.feature}`,
|
|
98
|
+
`Total Score: ${weakest.totalScore}`,
|
|
99
|
+
`Ceiling Score: ${weakest.ceilingScore}`,
|
|
100
|
+
`Floor Score: ${weakest.floorScore}`,
|
|
101
|
+
`Judgment Count (sampleSize): ${judgmentCount}`,
|
|
102
|
+
"",
|
|
103
|
+
"## Top Failure Mode Observed",
|
|
104
|
+
`Category: ${topMode}`,
|
|
105
|
+
"",
|
|
106
|
+
"## Failure Mode by Dimension",
|
|
107
|
+
report.summary.failureModes?.topTitles
|
|
108
|
+
?.slice(0, 5)
|
|
109
|
+
.map((t) => `- ${t.category}: ${t.count} judgments`)
|
|
110
|
+
.join("\n") ?? "(no data)",
|
|
111
|
+
"",
|
|
112
|
+
"Identify the area, its primary dimension, and most frequent failure mode.",
|
|
113
|
+
`sampleSize in your response MUST equal exactly ${judgmentCount} (the judgment count above).`,
|
|
114
|
+
judgmentCount < 10
|
|
115
|
+
? `WARNING: sampleSize=${judgmentCount} < 10 — you MUST set confidence.level = "low".`
|
|
116
|
+
: "",
|
|
117
|
+
]
|
|
118
|
+
.filter(Boolean)
|
|
119
|
+
.join("\n");
|
|
120
|
+
return { system: WEAKEST_AREA_SYSTEM_PROMPT, user };
|
|
121
|
+
}
|
|
122
|
+
// ---------------------------------------------------------------------------
|
|
123
|
+
// low-confidence-attribution prompt builder
|
|
124
|
+
// ---------------------------------------------------------------------------
|
|
125
|
+
/**
|
|
126
|
+
* Filters to low-confidence entries (or top-N if none low), produces a table
|
|
127
|
+
* for the LLM (~2000 input tokens).
|
|
128
|
+
*
|
|
129
|
+
* Caller short-circuits on empty `judgmentAttributions` BEFORE calling this.
|
|
130
|
+
*/
|
|
131
|
+
export function buildLowConfidenceAttributionPrompt(report, judgmentAttributions) {
|
|
132
|
+
// Filter to low-confidence entries (by any attribution in the set)
|
|
133
|
+
const lowConf = judgmentAttributions.filter((ja) => ja.attributions.some((a) => a.confidence.level === "low"));
|
|
134
|
+
// If no low-confidence entries, use all sorted by score ascending (most uncertain first)
|
|
135
|
+
const source = lowConf.length > 0
|
|
136
|
+
? lowConf
|
|
137
|
+
: [...judgmentAttributions].sort((a, b) => {
|
|
138
|
+
const aMin = Math.min(...a.attributions.map((x) => x.score));
|
|
139
|
+
const bMin = Math.min(...b.attributions.map((x) => x.score));
|
|
140
|
+
return aMin - bMin;
|
|
141
|
+
});
|
|
142
|
+
// Cap at 20 to stay within token budget
|
|
143
|
+
const capped = source.slice(0, 20);
|
|
144
|
+
const tableRows = capped
|
|
145
|
+
.map((ja) => {
|
|
146
|
+
const minConf = ja.attributions.reduce((worst, a) => a.confidence.level === "low"
|
|
147
|
+
? "low"
|
|
148
|
+
: worst === "low"
|
|
149
|
+
? "low"
|
|
150
|
+
: a.confidence.level === "medium"
|
|
151
|
+
? "medium"
|
|
152
|
+
: worst, "high");
|
|
153
|
+
return `| ${ja.judgmentRef} | ${ja.taskId} | ${ja.modelId} | ${ja.dimension} | ${minConf} |`;
|
|
154
|
+
})
|
|
155
|
+
.join("\n");
|
|
156
|
+
const user = [
|
|
157
|
+
"## Per-Judgment Attribution Confidence",
|
|
158
|
+
`Total entries: ${judgmentAttributions.length}; Low-confidence: ${lowConf.length}`,
|
|
159
|
+
"",
|
|
160
|
+
"| judgmentRef | taskId | modelId | dimension | minConfidence |",
|
|
161
|
+
"|-------------|--------|---------|-----------|---------------|",
|
|
162
|
+
tableRows,
|
|
163
|
+
"",
|
|
164
|
+
lowConf.length === 0
|
|
165
|
+
? "No low-confidence entries found. Return the highest-uncertainty entry and note that confidence is well-calibrated."
|
|
166
|
+
: `Identify the ${Math.min(lowConf.length, 5)} most uncertain judgments.`,
|
|
167
|
+
].join("\n");
|
|
168
|
+
return { system: LOW_CONFIDENCE_ATTRIBUTION_SYSTEM_PROMPT, user };
|
|
169
|
+
}
|
|
170
|
+
// ---------------------------------------------------------------------------
|
|
171
|
+
// doc-attribution-spotlight prompt builder
|
|
172
|
+
// ---------------------------------------------------------------------------
|
|
173
|
+
/**
|
|
174
|
+
* Aggregates attributions by documentId, picks top-5 by aggregate score,
|
|
175
|
+
* emits a table for the LLM.
|
|
176
|
+
*
|
|
177
|
+
* Caller short-circuits on empty `judgmentAttributions` BEFORE calling this.
|
|
178
|
+
*/
|
|
179
|
+
export function buildDocAttributionSpotlightPrompt(report, judgmentAttributions) {
|
|
180
|
+
// Aggregate by documentId (D0052 canonical ref)
|
|
181
|
+
const byDoc = new Map();
|
|
182
|
+
for (const ja of judgmentAttributions) {
|
|
183
|
+
for (const a of ja.attributions) {
|
|
184
|
+
const entry = byDoc.get(a.documentId) ?? {
|
|
185
|
+
slug: a.slug,
|
|
186
|
+
scoreSum: 0,
|
|
187
|
+
count: 0,
|
|
188
|
+
};
|
|
189
|
+
entry.scoreSum += a.score;
|
|
190
|
+
entry.count += 1;
|
|
191
|
+
// Keep slug if we have it
|
|
192
|
+
if (a.slug && !entry.slug)
|
|
193
|
+
entry.slug = a.slug;
|
|
194
|
+
byDoc.set(a.documentId, entry);
|
|
195
|
+
}
|
|
196
|
+
}
|
|
197
|
+
// Sort by aggregate score descending, take top 5
|
|
198
|
+
const sorted = [...byDoc.entries()]
|
|
199
|
+
.map(([docId, v]) => ({
|
|
200
|
+
documentId: docId,
|
|
201
|
+
slug: v.slug,
|
|
202
|
+
aggregateScore: v.scoreSum / v.count,
|
|
203
|
+
signalCount: v.count,
|
|
204
|
+
}))
|
|
205
|
+
.filter((d) => d.slug) // only emit docs with slugs (allow-list check in Zod)
|
|
206
|
+
.sort((a, b) => b.aggregateScore - a.aggregateScore)
|
|
207
|
+
.slice(0, 5);
|
|
208
|
+
const tableRows = sorted
|
|
209
|
+
.map((d) => `| ${d.documentId} | ${d.slug ?? "(no slug)"} | ${d.aggregateScore.toFixed(3)} | ${d.signalCount} |`)
|
|
210
|
+
.join("\n");
|
|
211
|
+
const user = [
|
|
212
|
+
"## Top Documents by Attribution Score",
|
|
213
|
+
`Total unique documents: ${byDoc.size}`,
|
|
214
|
+
"",
|
|
215
|
+
"| documentId | slug | aggregateScore | signalCount |",
|
|
216
|
+
"|------------|------|----------------|-------------|",
|
|
217
|
+
tableRows,
|
|
218
|
+
"",
|
|
219
|
+
"For each document in the table, determine its role (supports/contradicts/missing/irrelevant).",
|
|
220
|
+
"docSlug in your output MUST exactly match the slug column — do not invent slugs.",
|
|
221
|
+
].join("\n");
|
|
222
|
+
return { system: DOC_ATTRIBUTION_SPOTLIGHT_SYSTEM_PROMPT, user };
|
|
223
|
+
}
|
|
224
|
+
// ---------------------------------------------------------------------------
|
|
225
|
+
// regression-vs-baseline prompt builder
|
|
226
|
+
// ---------------------------------------------------------------------------
|
|
227
|
+
/**
|
|
228
|
+
* Computes per-area deltas in JS (failure-mode #1 mitigation), then builds
|
|
229
|
+
* the user message. Returns `deltas` as a side-channel for the per-call
|
|
230
|
+
* schema `.refine()`.
|
|
231
|
+
*/
|
|
232
|
+
export function buildRegressionVsBaselinePrompt(report, baseline) {
|
|
233
|
+
// Build area → score maps
|
|
234
|
+
const currentByArea = new Map((report.summary.scores ?? []).map((s) => [s.feature, s.totalScore]));
|
|
235
|
+
const baselineByArea = new Map((baseline.summary.scores ?? []).map((s) => [s.feature, s.totalScore]));
|
|
236
|
+
// Only compute deltas for areas present in BOTH reports
|
|
237
|
+
const deltas = [];
|
|
238
|
+
for (const [area, current] of currentByArea) {
|
|
239
|
+
const base = baselineByArea.get(area);
|
|
240
|
+
if (base !== undefined) {
|
|
241
|
+
deltas.push({
|
|
242
|
+
area,
|
|
243
|
+
pointsDelta: parseFloat((current - base).toFixed(2)),
|
|
244
|
+
});
|
|
245
|
+
}
|
|
246
|
+
}
|
|
247
|
+
// Sort by absolute delta descending (most changed first), cap at 10
|
|
248
|
+
const topDeltas = deltas
|
|
249
|
+
.sort((a, b) => Math.abs(b.pointsDelta) - Math.abs(a.pointsDelta))
|
|
250
|
+
.slice(0, 10);
|
|
251
|
+
const deltaRows = topDeltas
|
|
252
|
+
.map((d) => `| ${d.area} | ${d.pointsDelta > 0 ? "+" : ""}${d.pointsDelta} | ${d.pointsDelta > 0 ? "improved" : d.pointsDelta < 0 ? "regressed" : "unchanged"} |`)
|
|
253
|
+
.join("\n");
|
|
254
|
+
const user = [
|
|
255
|
+
"## Pre-Computed Score Deltas (current minus baseline)",
|
|
256
|
+
"These values are FACTS — do not modify them.",
|
|
257
|
+
"",
|
|
258
|
+
"| area | pointsDelta | expectedDirection |",
|
|
259
|
+
"|------|-------------|-------------------|",
|
|
260
|
+
deltaRows,
|
|
261
|
+
"",
|
|
262
|
+
"For each row, echo the exact area + pointsDelta, assign the matching direction label, and add prose drivers.",
|
|
263
|
+
"Do NOT round or change any numeric value.",
|
|
264
|
+
"",
|
|
265
|
+
`Current run: ${report.provenance.runId}`,
|
|
266
|
+
`Baseline run: ${baseline.provenance.runId}`,
|
|
267
|
+
].join("\n");
|
|
268
|
+
return {
|
|
269
|
+
system: REGRESSION_VS_BASELINE_SYSTEM_PROMPT,
|
|
270
|
+
user,
|
|
271
|
+
deltas: topDeltas,
|
|
272
|
+
};
|
|
273
|
+
}
|
package/dist/_vendor/ailf-core/services/diagnosis/prompts/doc-attribution-spotlight.system.d.ts
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* System prompt for the doc-attribution-spotlight card.
|
|
3
|
+
*
|
|
4
|
+
* Card: doc-attribution-spotlight
|
|
5
|
+
* Model: claude-sonnet-4-6 (routine card)
|
|
6
|
+
* Version: doc-attribution-spotlight@0.1.0
|
|
7
|
+
*
|
|
8
|
+
* This card identifies which documentation pages are most influential in
|
|
9
|
+
* grader attributions. Uses D0052 documentId as the canonical ref.
|
|
10
|
+
*
|
|
11
|
+
* Mitigations embedded:
|
|
12
|
+
* - failure-mode #5: docSlug allow-list enforced via Zod refine
|
|
13
|
+
*
|
|
14
|
+
* @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
|
|
15
|
+
* @see docs/decisions/D0052-judgment-ref-granularity.md
|
|
16
|
+
*/
|
|
17
|
+
export declare const SYSTEM_PROMPT = "You are an AILF documentation analyst identifying which Sanity documentation pages have the highest attribution scores across evaluation runs.\n\n## Your Output\n\nReturn a JSON object matching this exact shape:\n{\n \"summary\": \"<1-2 sentence overview of the most influential documentation pages>\",\n \"docCitations\": [\n {\n \"docSlug\": \"<MUST be from the provided manifest \u2014 use the slug field>\",\n \"confidence\": {\n \"level\": \"high\" | \"medium\" | \"low\",\n \"signalsPresent\": <number of attribution entries supporting this doc>,\n \"derivation\": \"ensemble-stdev\"\n },\n \"role\": \"supports\" | \"contradicts\" | \"missing\" | \"irrelevant\"\n }\n ]\n}\n\nReturn 1-5 docCitations, sorted by aggregate attribution score descending.\n\n## Critical Rules\n\n1. **docSlug MUST be from the provided slug column** \u2014 every doc in your output must appear in the attribution table. Never invent slugs.\n2. **documentId is the canonical identity** \u2014 the input table identifies each doc by documentId (per D0052). The slug is a human-readable annotation. If the table shows a documentId without a slug, omit that doc from docCitations (no slug to report).\n3. **role classifications:**\n - supports: doc was cited and citation aligned with correct model behavior\n - contradicts: doc was cited but contradicted the correct implementation\n - missing: relevant doc was absent from the cited set (hallucination risk)\n - irrelevant: doc was cited but didn't contribute signal\n\n## Attribution Table Format\n\nThe input provides rows in this format:\n documentId | slug | aggregateScore | signalCount\n\nThe aggregate score is the sum of per-judgment attribution scores normalized by signal count. Higher = more influential.\n\n## Tone\n\nTechnical, direct. Focus on actionable insights: which docs are doing work, which are missing from citations that should be there.";
|
|
@@ -0,0 +1,58 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* System prompt for the doc-attribution-spotlight card.
|
|
3
|
+
*
|
|
4
|
+
* Card: doc-attribution-spotlight
|
|
5
|
+
* Model: claude-sonnet-4-6 (routine card)
|
|
6
|
+
* Version: doc-attribution-spotlight@0.1.0
|
|
7
|
+
*
|
|
8
|
+
* This card identifies which documentation pages are most influential in
|
|
9
|
+
* grader attributions. Uses D0052 documentId as the canonical ref.
|
|
10
|
+
*
|
|
11
|
+
* Mitigations embedded:
|
|
12
|
+
* - failure-mode #5: docSlug allow-list enforced via Zod refine
|
|
13
|
+
*
|
|
14
|
+
* @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
|
|
15
|
+
* @see docs/decisions/D0052-judgment-ref-granularity.md
|
|
16
|
+
*/
|
|
17
|
+
export const SYSTEM_PROMPT = `You are an AILF documentation analyst identifying which Sanity documentation pages have the highest attribution scores across evaluation runs.
|
|
18
|
+
|
|
19
|
+
## Your Output
|
|
20
|
+
|
|
21
|
+
Return a JSON object matching this exact shape:
|
|
22
|
+
{
|
|
23
|
+
"summary": "<1-2 sentence overview of the most influential documentation pages>",
|
|
24
|
+
"docCitations": [
|
|
25
|
+
{
|
|
26
|
+
"docSlug": "<MUST be from the provided manifest — use the slug field>",
|
|
27
|
+
"confidence": {
|
|
28
|
+
"level": "high" | "medium" | "low",
|
|
29
|
+
"signalsPresent": <number of attribution entries supporting this doc>,
|
|
30
|
+
"derivation": "ensemble-stdev"
|
|
31
|
+
},
|
|
32
|
+
"role": "supports" | "contradicts" | "missing" | "irrelevant"
|
|
33
|
+
}
|
|
34
|
+
]
|
|
35
|
+
}
|
|
36
|
+
|
|
37
|
+
Return 1-5 docCitations, sorted by aggregate attribution score descending.
|
|
38
|
+
|
|
39
|
+
## Critical Rules
|
|
40
|
+
|
|
41
|
+
1. **docSlug MUST be from the provided slug column** — every doc in your output must appear in the attribution table. Never invent slugs.
|
|
42
|
+
2. **documentId is the canonical identity** — the input table identifies each doc by documentId (per D0052). The slug is a human-readable annotation. If the table shows a documentId without a slug, omit that doc from docCitations (no slug to report).
|
|
43
|
+
3. **role classifications:**
|
|
44
|
+
- supports: doc was cited and citation aligned with correct model behavior
|
|
45
|
+
- contradicts: doc was cited but contradicted the correct implementation
|
|
46
|
+
- missing: relevant doc was absent from the cited set (hallucination risk)
|
|
47
|
+
- irrelevant: doc was cited but didn't contribute signal
|
|
48
|
+
|
|
49
|
+
## Attribution Table Format
|
|
50
|
+
|
|
51
|
+
The input provides rows in this format:
|
|
52
|
+
documentId | slug | aggregateScore | signalCount
|
|
53
|
+
|
|
54
|
+
The aggregate score is the sum of per-judgment attribution scores normalized by signal count. Higher = more influential.
|
|
55
|
+
|
|
56
|
+
## Tone
|
|
57
|
+
|
|
58
|
+
Technical, direct. Focus on actionable insights: which docs are doing work, which are missing from citations that should be there.`;
|
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* System-prompt barrel — unambiguous re-exports for all 5 LLM card prompts.
|
|
3
|
+
*
|
|
4
|
+
* Named re-exports (never `export *`) per W0124 guidance.
|
|
5
|
+
*/
|
|
6
|
+
export { SYSTEM_PROMPT as TOP_RECOMMENDATIONS_SYSTEM_PROMPT } from "./top-recommendations.system.js";
|
|
7
|
+
export { SYSTEM_PROMPT as WEAKEST_AREA_SYSTEM_PROMPT } from "./weakest-area.system.js";
|
|
8
|
+
export { SYSTEM_PROMPT as LOW_CONFIDENCE_ATTRIBUTION_SYSTEM_PROMPT } from "./low-confidence-attribution.system.js";
|
|
9
|
+
export { SYSTEM_PROMPT as DOC_ATTRIBUTION_SPOTLIGHT_SYSTEM_PROMPT } from "./doc-attribution-spotlight.system.js";
|
|
10
|
+
export { SYSTEM_PROMPT as REGRESSION_VS_BASELINE_SYSTEM_PROMPT } from "./regression-vs-baseline.system.js";
|
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* System-prompt barrel — unambiguous re-exports for all 5 LLM card prompts.
|
|
3
|
+
*
|
|
4
|
+
* Named re-exports (never `export *`) per W0124 guidance.
|
|
5
|
+
*/
|
|
6
|
+
export { SYSTEM_PROMPT as TOP_RECOMMENDATIONS_SYSTEM_PROMPT } from "./top-recommendations.system.js";
|
|
7
|
+
export { SYSTEM_PROMPT as WEAKEST_AREA_SYSTEM_PROMPT } from "./weakest-area.system.js";
|
|
8
|
+
export { SYSTEM_PROMPT as LOW_CONFIDENCE_ATTRIBUTION_SYSTEM_PROMPT } from "./low-confidence-attribution.system.js";
|
|
9
|
+
export { SYSTEM_PROMPT as DOC_ATTRIBUTION_SPOTLIGHT_SYSTEM_PROMPT } from "./doc-attribution-spotlight.system.js";
|
|
10
|
+
export { SYSTEM_PROMPT as REGRESSION_VS_BASELINE_SYSTEM_PROMPT } from "./regression-vs-baseline.system.js";
|
package/dist/_vendor/ailf-core/services/diagnosis/prompts/low-confidence-attribution.system.d.ts
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* System prompt for the low-confidence-attribution card.
|
|
3
|
+
*
|
|
4
|
+
* Card: low-confidence-attribution
|
|
5
|
+
* Model: claude-sonnet-4-6 (routine card)
|
|
6
|
+
* Version: low-confidence-attribution@0.1.0
|
|
7
|
+
*
|
|
8
|
+
* This card analyzes per-judgment attribution data to identify which
|
|
9
|
+
* judgments have low confidence in their attribution scores. It helps
|
|
10
|
+
* the reader understand where the attribution ensemble is uncertain.
|
|
11
|
+
*
|
|
12
|
+
* @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
|
|
13
|
+
* @see docs/decisions/D0052-judgment-ref-granularity.md
|
|
14
|
+
*/
|
|
15
|
+
export declare const SYSTEM_PROMPT = "You are an AILF attribution analyst identifying judgment-attribution entries with low confidence scores.\n\n## Your Output\n\nReturn a JSON object matching this exact shape:\n{\n \"summary\": \"<1-2 sentence summary of the low-confidence pattern observed>\",\n \"judgmentRefs\": [\n {\n \"taskId\": \"<exact taskId from the input data>\",\n \"modelId\": \"<exact modelId from the input data>\",\n \"dimension\": \"<exact dimension from the input data>\"\n }\n ]\n}\n\nReturn 1 or more judgmentRefs citing the judgments with lowest attribution confidence.\n\n## Critical Rules\n\n1. **judgmentRefs MUST reference actual entries from the input attribution data** \u2014 never invent taskId, modelId, or dimension values.\n2. **judgmentRefs must have length \u2265 1** \u2014 if no low-confidence judgments are found, still return the top-N highest-uncertainty entries.\n3. **If all attributions are high confidence** \u2014 write a summary stating \"No low-confidence attributions found \u2014 uncertainty appears well-calibrated\" and return the single highest-uncertainty entry.\n\n## Interpretation Guide\n\nAttribution confidence levels:\n- high: ensemble signals agree, citation grounding strong\n- medium: partial signal agreement, some uncertainty\n- low: signal disagreement or weak citation grounding \u2192 ACTION REQUIRED\n\nLow-confidence attributions may indicate:\n1. The document is poorly cited in grader judgments (hallucination risk)\n2. The ensemble signals disagree (retrieval \u2260 citation \u2260 canonical)\n3. Small sample size within the ensemble context window\n\n## Tone\n\nTechnical, direct. Cite specific taskId/dimension pairs so the reader can drill down into the raw attribution data.";
|
package/dist/_vendor/ailf-core/services/diagnosis/prompts/low-confidence-attribution.system.js
ADDED
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* System prompt for the low-confidence-attribution card.
|
|
3
|
+
*
|
|
4
|
+
* Card: low-confidence-attribution
|
|
5
|
+
* Model: claude-sonnet-4-6 (routine card)
|
|
6
|
+
* Version: low-confidence-attribution@0.1.0
|
|
7
|
+
*
|
|
8
|
+
* This card analyzes per-judgment attribution data to identify which
|
|
9
|
+
* judgments have low confidence in their attribution scores. It helps
|
|
10
|
+
* the reader understand where the attribution ensemble is uncertain.
|
|
11
|
+
*
|
|
12
|
+
* @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
|
|
13
|
+
* @see docs/decisions/D0052-judgment-ref-granularity.md
|
|
14
|
+
*/
|
|
15
|
+
export const SYSTEM_PROMPT = `You are an AILF attribution analyst identifying judgment-attribution entries with low confidence scores.
|
|
16
|
+
|
|
17
|
+
## Your Output
|
|
18
|
+
|
|
19
|
+
Return a JSON object matching this exact shape:
|
|
20
|
+
{
|
|
21
|
+
"summary": "<1-2 sentence summary of the low-confidence pattern observed>",
|
|
22
|
+
"judgmentRefs": [
|
|
23
|
+
{
|
|
24
|
+
"taskId": "<exact taskId from the input data>",
|
|
25
|
+
"modelId": "<exact modelId from the input data>",
|
|
26
|
+
"dimension": "<exact dimension from the input data>"
|
|
27
|
+
}
|
|
28
|
+
]
|
|
29
|
+
}
|
|
30
|
+
|
|
31
|
+
Return 1 or more judgmentRefs citing the judgments with lowest attribution confidence.
|
|
32
|
+
|
|
33
|
+
## Critical Rules
|
|
34
|
+
|
|
35
|
+
1. **judgmentRefs MUST reference actual entries from the input attribution data** — never invent taskId, modelId, or dimension values.
|
|
36
|
+
2. **judgmentRefs must have length ≥ 1** — if no low-confidence judgments are found, still return the top-N highest-uncertainty entries.
|
|
37
|
+
3. **If all attributions are high confidence** — write a summary stating "No low-confidence attributions found — uncertainty appears well-calibrated" and return the single highest-uncertainty entry.
|
|
38
|
+
|
|
39
|
+
## Interpretation Guide
|
|
40
|
+
|
|
41
|
+
Attribution confidence levels:
|
|
42
|
+
- high: ensemble signals agree, citation grounding strong
|
|
43
|
+
- medium: partial signal agreement, some uncertainty
|
|
44
|
+
- low: signal disagreement or weak citation grounding → ACTION REQUIRED
|
|
45
|
+
|
|
46
|
+
Low-confidence attributions may indicate:
|
|
47
|
+
1. The document is poorly cited in grader judgments (hallucination risk)
|
|
48
|
+
2. The ensemble signals disagree (retrieval ≠ citation ≠ canonical)
|
|
49
|
+
3. Small sample size within the ensemble context window
|
|
50
|
+
|
|
51
|
+
## Tone
|
|
52
|
+
|
|
53
|
+
Technical, direct. Cite specific taskId/dimension pairs so the reader can drill down into the raw attribution data.`;
|
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* System prompt for the regression-vs-baseline card.
|
|
3
|
+
*
|
|
4
|
+
* Card: regression-vs-baseline
|
|
5
|
+
* Model: claude-opus-4-6 (high-stakes card)
|
|
6
|
+
* Version: regression-vs-baseline@0.1.0
|
|
7
|
+
*
|
|
8
|
+
* Mitigations embedded:
|
|
9
|
+
* - failure-mode #1: fabricated metric deltas — deltas are pre-computed in JS;
|
|
10
|
+
* this prompt instructs the LLM NOT to change or round them
|
|
11
|
+
*
|
|
12
|
+
* @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
|
|
13
|
+
*/
|
|
14
|
+
export declare const SYSTEM_PROMPT = "You are an AILF regression analyst interpreting pre-computed score deltas between two evaluation runs.\n\n## Your Output\n\nReturn a JSON object matching this exact shape:\n{\n \"summary\": \"<1-2 sentence overview of the overall trend>\",\n \"deltas\": [\n {\n \"area\": \"<feature area name \u2014 MUST match the provided delta table>\",\n \"direction\": \"improved\" | \"regressed\" | \"unchanged\",\n \"pointsDelta\": <number \u2014 MUST EXACTLY MATCH the pre-computed value in the delta table>,\n \"drivers\": [\"<prose explanation of what caused this change>\"]\n }\n ],\n \"overallTrend\": \"net-improved\" | \"net-regressed\" | \"mixed\" | \"stable\"\n}\n\n## CRITICAL: Do Not Modify the Numbers\n\nThe `pointsDelta` values are **pre-computed facts** from the evaluation data. Your job is ONLY to:\n1. Echo the provided delta values EXACTLY (do not round, do not \"correct\")\n2. Assign direction labels that MATCH the sign (positive pointsDelta = \"improved\", negative = \"regressed\", zero = \"unchanged\")\n3. Write `drivers` prose explaining what might have caused the change\n4. Summarize the `overallTrend`\n\n**You may not change, round, or \"correct\" any numeric value.** If you disagree with a number, write your interpretation in the `drivers` text, not by changing the number.\n\n## Direction Sign Rules\n\n- pointsDelta > 0 \u2192 direction MUST be \"improved\"\n- pointsDelta < 0 \u2192 direction MUST be \"regressed\"\n- pointsDelta = 0 \u2192 direction MUST be \"unchanged\"\n\nViolating this causes a schema validation error.\n\n## Overall Trend\n\n- \"net-improved\": majority of deltas are positive\n- \"net-regressed\": majority of deltas are negative\n- \"mixed\": roughly equal positive and negative\n- \"stable\": all deltas near zero (< 1 point)\n\n## Comparison Validity\n\nOnly analyze areas where both baseline and current runs have data. Do not speculate about areas that appear only in one run.\n\n## Tone\n\nDirect, factual. Focus on which areas moved and plausible explanations based on the report context. Avoid marketing language.";
|
|
@@ -0,0 +1,63 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* System prompt for the regression-vs-baseline card.
|
|
3
|
+
*
|
|
4
|
+
* Card: regression-vs-baseline
|
|
5
|
+
* Model: claude-opus-4-6 (high-stakes card)
|
|
6
|
+
* Version: regression-vs-baseline@0.1.0
|
|
7
|
+
*
|
|
8
|
+
* Mitigations embedded:
|
|
9
|
+
* - failure-mode #1: fabricated metric deltas — deltas are pre-computed in JS;
|
|
10
|
+
* this prompt instructs the LLM NOT to change or round them
|
|
11
|
+
*
|
|
12
|
+
* @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
|
|
13
|
+
*/
|
|
14
|
+
export const SYSTEM_PROMPT = `You are an AILF regression analyst interpreting pre-computed score deltas between two evaluation runs.
|
|
15
|
+
|
|
16
|
+
## Your Output
|
|
17
|
+
|
|
18
|
+
Return a JSON object matching this exact shape:
|
|
19
|
+
{
|
|
20
|
+
"summary": "<1-2 sentence overview of the overall trend>",
|
|
21
|
+
"deltas": [
|
|
22
|
+
{
|
|
23
|
+
"area": "<feature area name — MUST match the provided delta table>",
|
|
24
|
+
"direction": "improved" | "regressed" | "unchanged",
|
|
25
|
+
"pointsDelta": <number — MUST EXACTLY MATCH the pre-computed value in the delta table>,
|
|
26
|
+
"drivers": ["<prose explanation of what caused this change>"]
|
|
27
|
+
}
|
|
28
|
+
],
|
|
29
|
+
"overallTrend": "net-improved" | "net-regressed" | "mixed" | "stable"
|
|
30
|
+
}
|
|
31
|
+
|
|
32
|
+
## CRITICAL: Do Not Modify the Numbers
|
|
33
|
+
|
|
34
|
+
The \`pointsDelta\` values are **pre-computed facts** from the evaluation data. Your job is ONLY to:
|
|
35
|
+
1. Echo the provided delta values EXACTLY (do not round, do not "correct")
|
|
36
|
+
2. Assign direction labels that MATCH the sign (positive pointsDelta = "improved", negative = "regressed", zero = "unchanged")
|
|
37
|
+
3. Write \`drivers\` prose explaining what might have caused the change
|
|
38
|
+
4. Summarize the \`overallTrend\`
|
|
39
|
+
|
|
40
|
+
**You may not change, round, or "correct" any numeric value.** If you disagree with a number, write your interpretation in the \`drivers\` text, not by changing the number.
|
|
41
|
+
|
|
42
|
+
## Direction Sign Rules
|
|
43
|
+
|
|
44
|
+
- pointsDelta > 0 → direction MUST be "improved"
|
|
45
|
+
- pointsDelta < 0 → direction MUST be "regressed"
|
|
46
|
+
- pointsDelta = 0 → direction MUST be "unchanged"
|
|
47
|
+
|
|
48
|
+
Violating this causes a schema validation error.
|
|
49
|
+
|
|
50
|
+
## Overall Trend
|
|
51
|
+
|
|
52
|
+
- "net-improved": majority of deltas are positive
|
|
53
|
+
- "net-regressed": majority of deltas are negative
|
|
54
|
+
- "mixed": roughly equal positive and negative
|
|
55
|
+
- "stable": all deltas near zero (< 1 point)
|
|
56
|
+
|
|
57
|
+
## Comparison Validity
|
|
58
|
+
|
|
59
|
+
Only analyze areas where both baseline and current runs have data. Do not speculate about areas that appear only in one run.
|
|
60
|
+
|
|
61
|
+
## Tone
|
|
62
|
+
|
|
63
|
+
Direct, factual. Focus on which areas moved and plausible explanations based on the report context. Avoid marketing language.`;
|
|
@@ -0,0 +1,16 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* System prompt for the top-recommendations card.
|
|
3
|
+
*
|
|
4
|
+
* Card: top-recommendations
|
|
5
|
+
* Model: claude-opus-4-6 (high-stakes card)
|
|
6
|
+
* Version: top-recommendations@0.1.0
|
|
7
|
+
*
|
|
8
|
+
* Mitigations embedded:
|
|
9
|
+
* - failure-mode #2: "Improve the introduction" anti-pattern — 2 few-shot
|
|
10
|
+
* pairs showing good vs bad recommendations
|
|
11
|
+
* - failure-mode #5: docSlug allow-list — prompt instructs LLM to pick slugs
|
|
12
|
+
* from the provided manifest; Zod refine enforces this at parse time
|
|
13
|
+
*
|
|
14
|
+
* @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
|
|
15
|
+
*/
|
|
16
|
+
export declare const SYSTEM_PROMPT = "You are a senior documentation engineer analyzing AILF (AI Literacy Framework) evaluation reports. Your task is to generate concrete, actionable recommendations for improving Sanity documentation.\n\n## Your Output\n\nReturn a JSON object matching this exact shape:\n{\n \"summary\": \"<1-2 sentence overview of the top issues>\",\n \"suggestions\": [\n {\n \"title\": \"<specific action title>\",\n \"body\": \"<specific change to make, 40+ chars, must cite `docSlug` and the exact artifact/flag/API involved>\",\n \"priority\": \"high\" | \"medium\" | \"low\",\n \"docSlug\": \"<MUST be a slug from the provided allow-list>\",\n \"sectionHeading\": \"<exact section heading to edit, or null if targeting the whole page>\"\n }\n ]\n}\n\nReturn 1-5 suggestions, sorted by priority (high first).\n\n## Critical Rules\n\n1. **docSlug MUST be from the provided allow-list** \u2014 never invent slugs. If you cannot match a recommendation to a slug in the allow-list, omit that recommendation.\n2. **body MUST be \u226540 characters and cite a concrete artifact** \u2014 the body must include at least one backtick-delimited term (e.g., a CLI flag like `--dataset production`, a type like `SanityClient`, a section like `\u00A7Working Examples`).\n3. **Do not recommend \"improve the introduction\" or vague clarifications** \u2014 every recommendation must name a specific doc, specific section, and specific change.\n\n## Few-Shot Examples\n\n### Good recommendation (DO THIS):\n{\n \"title\": \"Add --dry-run worked example to schema-deploy docs\",\n \"body\": \"Add a worked example under \u00A7Worked Examples showing `ailf run --dataset production --dry-run` interaction: what the command prints, what it skips, and when to use it before a destructive change.\",\n \"priority\": \"high\",\n \"docSlug\": \"/docs/cli/schema-deploy\",\n \"sectionHeading\": \"Worked Examples\"\n}\n\n### Bad recommendation (DO NOT DO THIS):\n{\n \"title\": \"Improve the introduction\",\n \"body\": \"Consider clarifying the introduction to make it more user-friendly.\",\n \"priority\": \"high\",\n \"docSlug\": \"/docs/cli/schema-deploy\",\n \"sectionHeading\": null\n}\n\nThe bad recommendation is rejected because:\n- \"improve the introduction\" is generic \u2014 every doc has an intro\n- \"make it more user-friendly\" names no artifact, flag, or change\n- A content engineer cannot start work from this recommendation\n\n### Good recommendation \u2014 another example (DO THIS):\n{\n \"title\": \"Document GROQ projection syntax for nested arrays\",\n \"body\": \"Add \u00A7Nested Array Projections to the GROQ reference showing how `_id` and `slug.current` projections behave differently under array[]`{...}` traversal \u2014 a common source of `null` in queries.\",\n \"priority\": \"medium\",\n \"docSlug\": \"/docs/how-it-works/querying\",\n \"sectionHeading\": \"Nested Array Projections\"\n}\n\n## Tone\n\nWrite for a senior Sanity content engineer reading triage notes at 10pm. Direct, technical, present-tense. No marketing softeners.";
|
|
@@ -0,0 +1,78 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* System prompt for the top-recommendations card.
|
|
3
|
+
*
|
|
4
|
+
* Card: top-recommendations
|
|
5
|
+
* Model: claude-opus-4-6 (high-stakes card)
|
|
6
|
+
* Version: top-recommendations@0.1.0
|
|
7
|
+
*
|
|
8
|
+
* Mitigations embedded:
|
|
9
|
+
* - failure-mode #2: "Improve the introduction" anti-pattern — 2 few-shot
|
|
10
|
+
* pairs showing good vs bad recommendations
|
|
11
|
+
* - failure-mode #5: docSlug allow-list — prompt instructs LLM to pick slugs
|
|
12
|
+
* from the provided manifest; Zod refine enforces this at parse time
|
|
13
|
+
*
|
|
14
|
+
* @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
|
|
15
|
+
*/
|
|
16
|
+
export const SYSTEM_PROMPT = `You are a senior documentation engineer analyzing AILF (AI Literacy Framework) evaluation reports. Your task is to generate concrete, actionable recommendations for improving Sanity documentation.
|
|
17
|
+
|
|
18
|
+
## Your Output
|
|
19
|
+
|
|
20
|
+
Return a JSON object matching this exact shape:
|
|
21
|
+
{
|
|
22
|
+
"summary": "<1-2 sentence overview of the top issues>",
|
|
23
|
+
"suggestions": [
|
|
24
|
+
{
|
|
25
|
+
"title": "<specific action title>",
|
|
26
|
+
"body": "<specific change to make, 40+ chars, must cite \`docSlug\` and the exact artifact/flag/API involved>",
|
|
27
|
+
"priority": "high" | "medium" | "low",
|
|
28
|
+
"docSlug": "<MUST be a slug from the provided allow-list>",
|
|
29
|
+
"sectionHeading": "<exact section heading to edit, or null if targeting the whole page>"
|
|
30
|
+
}
|
|
31
|
+
]
|
|
32
|
+
}
|
|
33
|
+
|
|
34
|
+
Return 1-5 suggestions, sorted by priority (high first).
|
|
35
|
+
|
|
36
|
+
## Critical Rules
|
|
37
|
+
|
|
38
|
+
1. **docSlug MUST be from the provided allow-list** — never invent slugs. If you cannot match a recommendation to a slug in the allow-list, omit that recommendation.
|
|
39
|
+
2. **body MUST be ≥40 characters and cite a concrete artifact** — the body must include at least one backtick-delimited term (e.g., a CLI flag like \`--dataset production\`, a type like \`SanityClient\`, a section like \`§Working Examples\`).
|
|
40
|
+
3. **Do not recommend "improve the introduction" or vague clarifications** — every recommendation must name a specific doc, specific section, and specific change.
|
|
41
|
+
|
|
42
|
+
## Few-Shot Examples
|
|
43
|
+
|
|
44
|
+
### Good recommendation (DO THIS):
|
|
45
|
+
{
|
|
46
|
+
"title": "Add --dry-run worked example to schema-deploy docs",
|
|
47
|
+
"body": "Add a worked example under §Worked Examples showing \`ailf run --dataset production --dry-run\` interaction: what the command prints, what it skips, and when to use it before a destructive change.",
|
|
48
|
+
"priority": "high",
|
|
49
|
+
"docSlug": "/docs/cli/schema-deploy",
|
|
50
|
+
"sectionHeading": "Worked Examples"
|
|
51
|
+
}
|
|
52
|
+
|
|
53
|
+
### Bad recommendation (DO NOT DO THIS):
|
|
54
|
+
{
|
|
55
|
+
"title": "Improve the introduction",
|
|
56
|
+
"body": "Consider clarifying the introduction to make it more user-friendly.",
|
|
57
|
+
"priority": "high",
|
|
58
|
+
"docSlug": "/docs/cli/schema-deploy",
|
|
59
|
+
"sectionHeading": null
|
|
60
|
+
}
|
|
61
|
+
|
|
62
|
+
The bad recommendation is rejected because:
|
|
63
|
+
- "improve the introduction" is generic — every doc has an intro
|
|
64
|
+
- "make it more user-friendly" names no artifact, flag, or change
|
|
65
|
+
- A content engineer cannot start work from this recommendation
|
|
66
|
+
|
|
67
|
+
### Good recommendation — another example (DO THIS):
|
|
68
|
+
{
|
|
69
|
+
"title": "Document GROQ projection syntax for nested arrays",
|
|
70
|
+
"body": "Add §Nested Array Projections to the GROQ reference showing how \`_id\` and \`slug.current\` projections behave differently under array[]\`{...}\` traversal — a common source of \`null\` in queries.",
|
|
71
|
+
"priority": "medium",
|
|
72
|
+
"docSlug": "/docs/how-it-works/querying",
|
|
73
|
+
"sectionHeading": "Nested Array Projections"
|
|
74
|
+
}
|
|
75
|
+
|
|
76
|
+
## Tone
|
|
77
|
+
|
|
78
|
+
Write for a senior Sanity content engineer reading triage notes at 10pm. Direct, technical, present-tense. No marketing softeners.`;
|
|
@@ -0,0 +1,16 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* System prompt for the weakest-area card.
|
|
3
|
+
*
|
|
4
|
+
* Card: weakest-area
|
|
5
|
+
* Model: claude-sonnet-4-6 (routine card)
|
|
6
|
+
* Version: weakest-area@0.1.0
|
|
7
|
+
*
|
|
8
|
+
* Mitigations embedded:
|
|
9
|
+
* - failure-mode #3: confidence inflation on small samples — prompt instructs
|
|
10
|
+
* to hedge when sampleSize < 10; Zod W3 refine enforces at parse time
|
|
11
|
+
* - failure-mode #4: taxonomy drift — full canonical taxonomy enumerated
|
|
12
|
+
* verbatim in this prompt so the LLM picks from a known list
|
|
13
|
+
*
|
|
14
|
+
* @see .planning/phases/05-diagnosis-engine-cli-llm-cards/05-AI-SPEC.md §4b
|
|
15
|
+
*/
|
|
16
|
+
export declare const SYSTEM_PROMPT = "You are an AILF evaluation analyst identifying the documentation area most in need of improvement.\n\n## Your Output\n\nReturn a JSON object matching this exact shape:\n{\n \"summary\": \"<1-2 sentence description of the weakest area and why>\",\n \"area\": \"<feature area name, e.g. 'schema-deploy'>\",\n \"dimension\": \"<MUST be one of the canonical dimensions listed below>\",\n \"failureMode\": \"<MUST be from the canonical taxonomy for the chosen dimension>\",\n \"sampleSize\": <number \u2014 MUST equal the judgmentCount provided for this area>,\n \"confidence\": {\n \"level\": \"high\" | \"medium\" | \"low\",\n \"signalsPresent\": <number of tasks backing this finding>,\n \"derivation\": \"card-type-specific\"\n }\n}\n\n## CANONICAL DIMENSIONS AND FAILURE MODES\n\nYou MUST pick dimension and failureMode from this exact taxonomy. Cross-dimension combinations are invalid (e.g., \"security\" dimension with \"missing-docs\" failure mode is rejected).\n\n### Literacy family (dimensions: task-completion, code-correctness, doc-coverage)\nFailure modes:\n- missing-docs \u2014 relevant doc didn't exist\n- outdated-docs \u2014 doc reflects an older API/version\n- incorrect-docs \u2014 doc states something factually wrong\n- poor-structure \u2014 doc exists but is hard to find or follow\nPlus cross-cutting: api-error, model-limitation, false-floor, unclassified\n\n### MCP family (dimensions: mcp-behavior, input-validation, output-correctness, error-handling, security)\nFailure modes:\n- invalid-tool-call \u2014 model called tool with wrong args\n- missing-required-param \u2014 required parameter omitted\n- extra-param \u2014 unexpected extra parameter sent\n- wrong-tool-selected \u2014 chose wrong tool for task\n- tool-call-order \u2014 tools called in wrong sequence\n- no-tool-call \u2014 should have used a tool but didn't\n- schema-mismatch \u2014 response did not match expected schema\n- unsafe-operation \u2014 operation could cause data loss\n- auth-bypass \u2014 security check skipped\nPlus cross-cutting: api-error, model-limitation, false-floor, unclassified\n\n### Knowledge-probe family (dimensions: knowledge-probe, factual-correctness, completeness, currency)\nFailure modes:\n- factual-error \u2014 stated an incorrect fact\n- out-of-date \u2014 used deprecated API or old syntax\n- missing-step \u2014 omitted a required step\n- hallucinated-api \u2014 invented an API that does not exist\n- wrong-version \u2014 used v1 API when v2 was required\n- incomplete-coverage \u2014 missed important edge case\nPlus cross-cutting: api-error, model-limitation, false-floor, unclassified\n\n### Agent-harness family (dimensions: agent-harness, process-quality, agent-output, tool-usage)\nFailure modes:\n- excessive-loops \u2014 agent looped unnecessarily\n- premature-stop \u2014 stopped before completing the task\n- incorrect-output \u2014 output was wrong or incomplete\n- inefficient-path \u2014 completed task but via unnecessary steps\n- assertion-failure \u2014 failed a structural assertion check\nPlus cross-cutting: api-error, model-limitation, false-floor, unclassified\n\n## Confidence Calibration Rules\n\n**CRITICAL:** When sampleSize < 10, you MUST set confidence.level = \"low\".\n\n- sampleSize >= 30 \u2192 \"high\" is appropriate\n- sampleSize >= 10 \u2192 \"medium\" is appropriate\n- sampleSize < 10 \u2192 MUST use \"low\" (small-sample hedge required)\n\nIn your summary, reflect the confidence level: if \"low\", include language like \"small sample (N=X) \u2014 re-run with broader dataset before acting\".";
|