@roodriigoooo/pi-scrutiny 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,165 @@
1
+ # pi-scrutiny
2
+
3
+ multi-model deliberation and objective repo verification for the pi coding agent.
4
+
5
+ ## why this exists
6
+
7
+ openrouter fusion sends one hard prompt to several models at once and merges the answers. it works well for bounded research questions where you want diverse priors on the same problem. that was the original inspiration for this extension.
8
+
9
+ but fusion is not evidence that fusing model outputs helps with long-horizon coding. multi-turn coding workflows involve editing files, running tests, reading feedback, iterating. no amount of parallel prose from language models settles whether a change to a real repo is correct. tests, type checks, and human review do that.
10
+
11
+ so scrutiny takes the part of fusion that is grounded (independent perspectives on a shared question) and drops the part that is not (textual merging as a correctness signal). what you get is:
12
+
13
+ - send a hard question to a panel of models, each answering independently from the same packet
14
+ - fuse hypotheses, constraints, risks, and verification strategies, never patches
15
+ - let one coding agent act against the repo and tests
16
+ - the arbiter is objective repo tools, not an llm judge
17
+
18
+ consultation, not democracy. deliberation, not patch fusion.
19
+
20
+ ## what it does
21
+
22
+ one tool, `scrutiny_consult`, and one command, `/scrutiny`, expose six **surfaces**. each produces a distinct non-patch artifact:
23
+
24
+ ```text
25
+ consult replicate mode: bounded research/synthesis. trade-off explainer runs by default.
26
+ hypotheses replicate mode: ranked root causes + confirming evidence + minimal distinguishing tests. no fix yet.
27
+ criteria replicate mode: acceptance spec: edge cases, backward-compat, migration, test cases.
28
+ repo-map roles mode: compact context (symbols, call paths, tests, config, invariants) for an upcoming edit.
29
+ risks roles mode: per-class risk review of a patch (concurrency, reactive-chain, api-compat, security, perf, migration, null, flaky). runs verify.
30
+ verify no panel: runs tests/typecheck/lint as objective arbiter. no judge. blocks.
31
+ ```
32
+
33
+ ## panel modes
34
+
35
+ two epistemic instruments, not stylistic variants:
36
+
37
+ - **replicate** (`consult`, `hypotheses`, `criteria`): every panelist gets the same prompt. diversity comes from model priors. the signal is agreement or disagreement. sharp disagreement is a stop signal: gather more evidence, run a narrower test, or ask the human. do not smooth it into a synthesized answer.
38
+
39
+ - **roles** (`repo-map`, `risks`): each panelist gets a different lens. diversity comes from task-splitting. the signal is coverage and gaps, not conflict. a concurrency reviewer saying "avoid X" and a security reviewer not mentioning X is not a disagreement. it is coverage of different facets.
40
+
41
+ the analysis layer is honest about which mode it is in. disagreement is computed only in replicate mode. roles mode computes coverage and gaps instead.
42
+
43
+ ## how calls happen
44
+
45
+ panelists run **sequentially**, one at a time. each panelist is a `pi` subprocess (`pi --mode json -p --no-session --model <model> --no-tools <prompt>`) that receives the full task packet and returns its analysis as structured markdown. the engine collects each response, then builds a deterministic evidence map (shared vocabulary, contradictions, unique insights, risks, coverage/gaps). optionally, a trade-off explainer model compares the panel outputs. optionally, verify runs objective repo checks.
46
+
47
+ only one scrutiny run can be active at a time. if you call `/scrutiny` while a run is in progress, the second call is rejected with a clear message. this is deliberate: parallel scrutiny runs would compete for provider rate limits, make progress harder to read, and add cost without adding signal.
48
+
49
+ ## principles
50
+
51
+ - **arbiter is objective, not textual.** correctness is decided by tests, type checks, static analysis, runtime, diff size, architecture constraints, and sometimes human review. an llm judge is weak as the final arbiter of a repo-wide change.
52
+ - **do not fuse patches.** fusing N patches into one frankenstein diff that no model validated is the failure mode to avoid. fuse uncertainty, evidence, tests, plans, context, risks.
53
+ - **disagreement is a stop signal only in replicate mode.** same-prompt panelists disagreeing on a load-bearing point means gather more evidence. role-lens panelists not overlapping means coverage, not conflict.
54
+ - **sequential, not parallel.** panelists run one at a time. one run at a time. scrutiny is deliberation, not a race.
55
+ - **judge demoted to trade-off explainer.** it never decides correctness. it only explains trade-offs, and only runs for `consult` by default.
56
+ - **simplicity is protected.** few surfaces, legible activation, simple model selection.
57
+
58
+ ## configure
59
+
60
+ ```text
61
+ /scrutiny config edit # global ~/.pi/agent/scrutiny.json
62
+ /scrutiny config edit project # project .pi/scrutiny.json (trusted projects only)
63
+ /scrutiny config # show active config + sources
64
+ ```
65
+
66
+ example `scrutiny.json`:
67
+
68
+ ```json
69
+ {
70
+ "panel": [
71
+ { "model": "openai-codex/gpt-5.4-mini", "thinking": "low" },
72
+ { "model": "opencode-go/kimi-k2.7-code", "thinking": "off" }
73
+ ],
74
+ "judge": "openai-codex/gpt-5.4-mini",
75
+ "verifyChecks": [{ "name": "typecheck", "command": "npm", "args": ["run", "check"] }],
76
+ "panels": {
77
+ "code-duo": {
78
+ "surface": "risks",
79
+ "members": [
80
+ { "model": "openai-codex/gpt-5.4-mini", "lens": "concurrency", "thinking": "low" },
81
+ { "model": "opencode-go/kimi-k2.7-code", "lens": "reactive-chain", "thinking": "off" }
82
+ ],
83
+ "verify": true,
84
+ "judgeMode": "off"
85
+ }
86
+ }
87
+ }
88
+ ```
89
+
90
+ `councils`/`panelists` still work as old aliases for `panels`/`members`. `PI_SCRUTINY_*` env vars still work and override config files.
91
+
92
+ ## install
93
+
94
+ Pi Scrutiny is a Pi package. Until the npm package is published, install it directly from GitHub:
95
+
96
+ ```bash
97
+ pi install git:github.com/roodriigoooo/pi-scrutiny
98
+ ```
99
+
100
+ To install it for a single project, write the package entry to project settings instead:
101
+
102
+ ```bash
103
+ pi install -l git:github.com/roodriigoooo/pi-scrutiny
104
+ ```
105
+
106
+ For local development from a checkout:
107
+
108
+ ```bash
109
+ git clone https://github.com/roodriigoooo/pi-scrutiny.git
110
+ cd pi-scrutiny
111
+ npm install
112
+ npm run check
113
+ pi install "$(pwd)"
114
+ ```
115
+
116
+ To try the extension for one session without installing it:
117
+
118
+ ```bash
119
+ pi -e ./extensions/scrutiny.ts
120
+ ```
121
+
122
+ After installation, restart pi and run `/scrutiny help`.
123
+
124
+ ## use
125
+
126
+ ```text
127
+ /scrutiny # open palette (surface-first)
128
+ /scrutiny models
129
+ /scrutiny runs # recent runs + artifact paths (this session)
130
+ /scrutiny history # interactive searchable artifact history
131
+ /scrutiny history list retry # text history for scripts
132
+ /scrutiny panels # list saved panel presets
133
+ /scrutiny config # show active config + sources
134
+ /scrutiny config edit # edit global config in pi
135
+ /scrutiny verify: # run objective checks now
136
+ /scrutiny @code-duo: review this patch # run a saved panel
137
+ /scrutiny risks: review this webflux retry patch
138
+ /scrutiny hypotheses: intermittent offset commit on kafka consumer
139
+ /scrutiny criteria: migrate orders service to new idempotency key
140
+ /scrutiny ask compare these two implementation plans
141
+ ```
142
+
143
+ press **ctrl+p** in the palette to cycle through saved panels.
144
+
145
+ or let the main model call `scrutiny_consult` when the extra spend is worth it.
146
+
147
+ ## flow
148
+
149
+ surfaces run **inline** and stream a compact status line while the panel works. press **esc** to cancel. deliberation takes time; that is expected.
150
+
151
+ runs persist on disk under `.pi/scrutiny/<run-id>/` (`packet.md`, `responses.json`, per-surface JSON, `verify.json`, `result.json`). `/scrutiny history` opens searchable artifact history backed by `.pi/scrutiny/index.jsonl`.
152
+
153
+ every brief ends with one machine-actionable line: `RECOMMENDED NEXT ACTION: ...`. that is what the main agent acts on. prose lives in the expanded view and history.
154
+
155
+ ## defaults
156
+
157
+ - 2 panelists is the intended shape for deliberation
158
+ - `consult`, `hypotheses`, `criteria` use replicate mode (same prompt, disagreement is signal)
159
+ - `repo-map`, `risks` use roles mode (assigned lenses, coverage/gaps is signal)
160
+ - panelists run sequentially, one at a time, `--no-tools` by default
161
+ - only one scrutiny run active at a time
162
+ - panel timeout: 180s per panelist (configurable via `PI_SCRUTINY_PANEL_TIMEOUT_MS`)
163
+ - no auto-spend
164
+ - trade-off explainer skipped except `consult` (or `judgeMode: on`)
165
+ - `risks` and `verify` run objective repo checks
@@ -0,0 +1,335 @@
1
+ import type { PanelMode, ScrutinyAnalysis, PanelResponse, VerifyReport, ScrutinySurface } from "./types.js";
2
+ import { truncate } from "./util.js";
3
+
4
+ export function formatFailureBrief(input: {
5
+ surface: import("./types.js").ScrutinySurface;
6
+ runId: string;
7
+ runDir: string;
8
+ responses: PanelResponse[];
9
+ failedModels: Array<{ model: string; error: string }>;
10
+ reason: string;
11
+ }): string {
12
+ const lines: string[] = [];
13
+ lines.push(`# Scrutiny ${input.surface} failed`);
14
+ lines.push(`reason: ${input.reason}`);
15
+ lines.push(`run-id: ${input.runId}`);
16
+ lines.push(`artifacts: ${input.runDir}`);
17
+ lines.push("");
18
+ lines.push("Do NOT synthesize an answer from this. There is no usable panel evidence.");
19
+ lines.push("Tell the user the panel failed and show the reason + failed models below. Suggest fixing config (PI_SCRUTINY_PANEL, keys, network) and rerunning.");
20
+ lines.push("");
21
+ if (input.failedModels.length > 0) {
22
+ lines.push("## Failed panelists");
23
+ for (const failed of input.failedModels) lines.push(`- ${failed.model}: ${truncate(failed.error, 240)}`);
24
+ }
25
+ return lines.join("\n").trim();
26
+ }
27
+
28
+ export function detectMush(responses: PanelResponse[]): string | undefined {
29
+ if (responses.length === 0) return undefined;
30
+ const ok = responses.filter((response) => response.status === "ok" && response.content.trim());
31
+ if (ok.length === 0) return undefined; // all-failed handled separately
32
+ const allTiny = ok.every((response) => response.content.trim().length < 80);
33
+ if (allTiny) return "all ok responses are near-empty (< 80 chars)";
34
+ const allHeadersOnly = ok.every((response) => {
35
+ const body = response.content.replace(/^#+\s.*$/gm, "").replace(/[\s`*-]/g, "").trim();
36
+ return body.length < 40;
37
+ });
38
+ if (allHeadersOnly) return "all ok responses contain only template headings, no substance";
39
+ const allToolPreambles = ok.every((response) => isToolIntentPreamble(response.content));
40
+ if (allToolPreambles) return "all ok responses are tool-use preambles; panel likely ignored no-tools packet";
41
+ return undefined;
42
+ }
43
+
44
+ function isToolIntentPreamble(text: string): boolean {
45
+ const compact = text.trim().replace(/\s+/g, " ");
46
+ if (compact.length > 600) return false;
47
+ return /\b(i'?ll|i will|let me|need to|first,? i|i should)\b.{0,120}\b(inspect|read|open|check|look at|call|use|run|grep)\b.{0,120}\b(repo|files?|tools?|commands?|bash|grep|read|tests?)\b/i.test(compact)
48
+ || /\bcalling\b.{0,80}\b(repo reads|tools?|read|grep|bash)\b/i.test(compact)
49
+ || /\b(can'?t|cannot|don'?t)\b.{0,80}\b(access|inspect|read)\b.{0,80}\b(repo|files?)\b/i.test(compact);
50
+ }
51
+
52
+ export function buildDeterministicAnalysis(responses: PanelResponse[], panelMode: PanelMode | undefined = "replicate"): ScrutinyAnalysis {
53
+ const mode = panelMode ?? "replicate";
54
+ const ok = responses.filter((response) => response.status === "ok" && response.content.trim());
55
+ const risks = unique(ok.flatMap((response) => extractRiskLines(response.content)).slice(0, 8));
56
+ const uniqueInsights = ok.flatMap((response) =>
57
+ extractBullets(response.content)
58
+ .filter((line) => isDistinct(line, ok.filter((other) => other !== response).map((other) => other.content)))
59
+ .slice(0, 3)
60
+ .map((insight) => ({ model: response.model, insight: truncate(insight, 280) })),
61
+ );
62
+ const sharedTerms = mode === "replicate" ? sharedKeywords(ok.map((response) => response.content)) : [];
63
+ const contradictions = mode === "replicate" ? detectContradictions(ok) ?? [] : [];
64
+ const coverage = mode === "roles" ? roleCoverage(responses) : undefined;
65
+ const consensus = mode === "replicate"
66
+ ? [
67
+ `${ok.length}/${responses.length} panelists returned usable output.`,
68
+ sharedTerms.length ? `Shared technical vocabulary: ${sharedTerms.join(", ")}.` : "No strong lexical consensus detected; compare panel stances before synthesizing.",
69
+ ]
70
+ : [
71
+ `${ok.length}/${responses.length} role lenses returned usable output.`,
72
+ "Roles mode: coverage/gaps are signal; disagreement stop-signal disabled.",
73
+ ];
74
+ const blindSpots = mode === "roles"
75
+ ? roleGaps(responses)
76
+ : ["Deterministic analysis does not infer all semantic contradictions; main Pi model should compare panel stances before final answer."];
77
+ return {
78
+ consensus,
79
+ contradictions,
80
+ unique_insights: uniqueInsights.slice(0, 8),
81
+ risks,
82
+ coverage,
83
+ blind_spots: blindSpots,
84
+ confidence: ok.length >= 2 ? (contradictions.length ? "low" : "medium") : "low",
85
+ disagreement_signal: mode === "replicate" && contradictions.length > 0,
86
+ };
87
+ }
88
+
89
+ export function formatScrutinyBrief(input: {
90
+ surface: ScrutinySurface;
91
+ panelMode?: PanelMode;
92
+ analysis?: ScrutinyAnalysis;
93
+ responses: PanelResponse[];
94
+ failedModels: Array<{ model: string; error: string }>;
95
+ judgeRan: boolean;
96
+ verify?: VerifyReport;
97
+ llmPanelExcerptChars: number;
98
+ budgetLine: string;
99
+ }): string {
100
+ const ok = input.responses.filter((response) => response.status === "ok");
101
+ const lines: string[] = [];
102
+ lines.push(`# Scrutiny ${input.surface} result`);
103
+ lines.push(input.budgetLine);
104
+ if (input.panelMode) lines.push(panelModeBriefLine(input.surface, input.panelMode));
105
+ lines.push(input.judgeRan ? "evidence map: trade-off explainer ran" : "evidence map: deterministic only; main Pi model synthesizes");
106
+ if (input.verify) lines.push(verifyLine(input.verify));
107
+ lines.push("");
108
+
109
+ if (input.analysis) {
110
+ if (input.analysis.disagreement_signal) {
111
+ lines.push("## ⚠ disagreement signal");
112
+ lines.push("Panel disagrees on a load-bearing point. Treat this as a stop signal: gather more evidence, run a narrower test, or ask the human. Do not smooth this into a synthesized answer.");
113
+ lines.push("");
114
+ }
115
+ lines.push(`## Evidence map`);
116
+ pushList(lines, "Shared signals", input.analysis.consensus);
117
+ if (input.panelMode !== "roles") pushContradictions(lines, input.analysis.contradictions);
118
+ pushUnique(lines, input.analysis.unique_insights);
119
+ pushList(lines, "Risks", input.analysis.risks);
120
+ pushList(lines, "Coverage", input.analysis.coverage);
121
+ pushList(lines, "Blind spots", input.analysis.blind_spots);
122
+ if (input.analysis.confidence) lines.push(`confidence: ${input.analysis.confidence}`);
123
+ lines.push("");
124
+ }
125
+
126
+ if (ok.length > 0) {
127
+ lines.push(`## Panel excerpts`);
128
+ for (const response of ok) {
129
+ lines.push(`### ${response.model} (${response.role})`);
130
+ lines.push(truncate(response.content, input.llmPanelExcerptChars));
131
+ lines.push("");
132
+ }
133
+ }
134
+
135
+ if (input.failedModels.length > 0) {
136
+ lines.push(`## Failed panelists`);
137
+ for (const failed of input.failedModels) lines.push(`- ${failed.model}: ${truncate(failed.error, 240)}`);
138
+ }
139
+
140
+ if (input.verify) {
141
+ lines.push("");
142
+ lines.push("## Verify (objective arbiter)");
143
+ lines.push(formatVerifyReport(input.verify));
144
+ }
145
+
146
+ lines.push("", surfaceActionLine(input.surface));
147
+ return lines.join("\n").trim();
148
+ }
149
+
150
+ export function formatVerifyBrief(input: { verify: VerifyReport; budgetLine: string }): string {
151
+ const lines: string[] = [];
152
+ lines.push(`# Scrutiny verify result`);
153
+ lines.push(input.budgetLine);
154
+ lines.push("");
155
+ lines.push("Verify is the objective arbiter. No LLM judge involved.");
156
+ lines.push("");
157
+ lines.push(formatVerifyReport(input.verify));
158
+ lines.push("", "RECOMMENDED NEXT ACTION: act on pass/fail above. Fix failing checks before any panel deliberation weighs in.");
159
+ return lines.join("\n").trim();
160
+ }
161
+
162
+ export function formatVerifyReport(verify: VerifyReport): string {
163
+ const lines: string[] = [];
164
+ lines.push(`${verify.passed} passed · ${verify.failed} failed · ${verify.skipped} skipped · ${formatMs(verify.durationMs)}`);
165
+ if (verify.diffStat) {
166
+ lines.push("");
167
+ lines.push("### diff stat");
168
+ lines.push("```");
169
+ lines.push(verify.diffStat.trim());
170
+ lines.push("```");
171
+ }
172
+ for (const check of verify.checks) {
173
+ const icon = check.status === "pass" ? "✓" : check.status === "fail" ? "✕" : check.status === "error" ? "!" : "–";
174
+ lines.push(`- ${icon} ${check.name} (${check.status}, ${formatMs(check.durationMs)})`);
175
+ if (check.status !== "pass" && check.output?.trim()) lines.push(` \`\`\`\n ${truncate(check.output.trim(), 800).replace(/\n/g, "\n ")}\n \`\`\``);
176
+ }
177
+ return lines.join("\n");
178
+ }
179
+
180
+ function verifyLine(verify: VerifyReport): string {
181
+ return `verify: ${verify.passed} pass · ${verify.failed} fail · ${verify.skipped} skipped`;
182
+ }
183
+
184
+ function panelModeBriefLine(surface: ScrutinySurface, panelMode: PanelMode): string {
185
+ return panelMode === "replicate"
186
+ ? `[${surface}] replicate · agreement/disagreement is signal.`
187
+ : `[${surface}] roles · coverage/gaps are signal; disagreement stop-signal disabled.`;
188
+ }
189
+
190
+ function surfaceActionLine(surface: ScrutinySurface): string {
191
+ switch (surface) {
192
+ case "consult":
193
+ return "RECOMMENDED NEXT ACTION: synthesize from evidence above. Treat panel as consultation, not authority.";
194
+ case "hypotheses":
195
+ return "RECOMMENDED NEXT ACTION: run best distinguishing test(s), then act against repo. Do not merge a fix until hypothesis is confirmed by evidence.";
196
+ case "criteria":
197
+ return "RECOMMENDED NEXT ACTION: implement against fused spec above. Run verify after edit.";
198
+ case "repo-map":
199
+ return "RECOMMENDED NEXT ACTION: use map above as context for one coding agent to edit. Do not fuse edits from multiple panelists.";
200
+ case "risks":
201
+ return "RECOMMENDED NEXT ACTION: address findings by running suggested checks/tests, then editing. Do not merge risk-review prose into patch.";
202
+ case "verify":
203
+ return "RECOMMENDED NEXT ACTION: act on pass/fail above. Arbiter is checks, not any model.";
204
+ }
205
+ }
206
+
207
+ function pushList(lines: string[], title: string, items: string[] | undefined): void {
208
+ if (!items || items.length === 0) return;
209
+ lines.push(`### ${title}`);
210
+ for (const item of items.slice(0, 8)) lines.push(`- ${item}`);
211
+ }
212
+
213
+ function pushContradictions(lines: string[], items: ScrutinyAnalysis["contradictions"]): void {
214
+ if (!items || items.length === 0) return;
215
+ lines.push(`### Contradictions`);
216
+ for (const item of items.slice(0, 6)) {
217
+ lines.push(`- ${item.topic}`);
218
+ for (const stance of item.stances.slice(0, 4)) lines.push(` - ${stance.model}: ${stance.stance}`);
219
+ }
220
+ }
221
+
222
+ function pushUnique(lines: string[], items: ScrutinyAnalysis["unique_insights"]): void {
223
+ if (!items || items.length === 0) return;
224
+ lines.push(`### Unique insights`);
225
+ for (const item of items.slice(0, 8)) lines.push(`- ${item.model}: ${item.insight}`);
226
+ }
227
+
228
+ function roleCoverage(responses: PanelResponse[]): string[] {
229
+ const ok = responses.filter((response) => response.status === "ok" && response.content.trim());
230
+ const failed = responses.filter((response) => response.status !== "ok" || !response.content.trim());
231
+ return unique([
232
+ ok.length ? `Covered lenses: ${ok.map((response) => response.role).join(", ")}.` : undefined,
233
+ failed.length ? `Missing/failed lenses: ${failed.map((response) => response.role).join(", ")}.` : undefined,
234
+ ].filter((item): item is string => Boolean(item)));
235
+ }
236
+
237
+ function roleGaps(responses: PanelResponse[]): string[] {
238
+ const failed = responses.filter((response) => response.status !== "ok" || !response.content.trim());
239
+ const missing = responses.flatMap((response) => extractMissingContextLines(response.content));
240
+ return unique([
241
+ "Roles mode does not compare panelists for contradiction; inspect missing lenses and uncovered risk classes.",
242
+ ...failed.map((response) => `No usable coverage from ${response.role} (${response.model}).`),
243
+ ...missing,
244
+ ]).slice(0, 8);
245
+ }
246
+
247
+ function extractMissingContextLines(text: string): string[] {
248
+ return extractBullets(text)
249
+ .filter((line) => /\b(missing|not shown|not in (the )?packet|insufficient|unknown|cannot determine|can't determine|need(?:s)? to inspect|must inspect|would need|need more evidence|not enough evidence|gap|uncovered)\b/i.test(line))
250
+ .map((line) => truncate(line, 280));
251
+ }
252
+
253
+ function extractBullets(text: string): string[] {
254
+ return text
255
+ .split(/\r?\n/)
256
+ .map((line) => line.trim().replace(/^[-*•]\s+/, ""))
257
+ .filter((line) => line.length >= 24 && line.length <= 400)
258
+ .slice(0, 30);
259
+ }
260
+
261
+ function extractRiskLines(text: string): string[] {
262
+ return extractBullets(text).filter((line) => /\b(risk|trade-?off|caution|fail|failure|bug|security|regression|uncertain|unknown|race|deadlock|idempoten|ordering)\b/i.test(line));
263
+ }
264
+
265
+ function isDistinct(line: string, otherTexts: string[]): boolean {
266
+ const tokens = keywords(line);
267
+ if (tokens.length < 3) return false;
268
+ return !otherTexts.some((text) => tokens.filter((token) => text.toLowerCase().includes(token)).length >= Math.min(3, tokens.length));
269
+ }
270
+
271
+ function sharedKeywords(texts: string[]): string[] {
272
+ if (texts.length < 2) return [];
273
+ const counts = new Map<string, number>();
274
+ for (const text of texts) {
275
+ for (const token of new Set(keywords(text))) counts.set(token, (counts.get(token) ?? 0) + 1);
276
+ }
277
+ return [...counts.entries()]
278
+ .filter(([, count]) => count >= Math.min(2, texts.length))
279
+ .sort((a, b) => b[1] - a[1] || a[0].localeCompare(b[0]))
280
+ .slice(0, 10)
281
+ .map(([token]) => token);
282
+ }
283
+
284
+ function detectContradictions(responses: PanelResponse[]): ScrutinyAnalysis["contradictions"] {
285
+ const ok = responses.filter((response) => response.status === "ok" && response.content.trim());
286
+ if (ok.length < 2) return [];
287
+ const contradictions: NonNullable<ScrutinyAnalysis["contradictions"]> = [];
288
+ const negation = /\b(not|no|never|should not|don'?t|won'?t|cannot|can'?t|avoid|wrong|incorrect|disagree)\b/i;
289
+ for (let i = 0; i < ok.length; i++) {
290
+ for (let j = i + 1; j < ok.length; j++) {
291
+ const a = ok[i].content.toLowerCase();
292
+ const b = ok[j].content.toLowerCase();
293
+ const shared = sharedKeywords([ok[i].content, ok[j].content]).filter((term) => term.length >= 6);
294
+ const aNeg = negation.test(ok[i].content);
295
+ const bNeg = negation.test(ok[j].content);
296
+ if (shared.length >= 2 && aNeg !== bNeg) {
297
+ const topic = shared.slice(0, 3).join(" / ");
298
+ contradictions.push({
299
+ topic,
300
+ stances: [
301
+ { model: ok[i].model, stance: truncate(firstSentenceAround(ok[i].content, shared[0]), 160) },
302
+ { model: ok[j].model, stance: truncate(firstSentenceAround(ok[j].content, shared[0]), 160) },
303
+ ],
304
+ });
305
+ }
306
+ }
307
+ }
308
+ return contradictions.slice(0, 4);
309
+ }
310
+
311
+ function firstSentenceAround(text: string, term: string): string {
312
+ const idx = text.toLowerCase().indexOf(term);
313
+ if (idx < 0) return text.split(/\r?\n/).find((line) => line.trim().length > 24) ?? text.slice(0, 160);
314
+ const start = Math.max(0, text.lastIndexOf("\n", idx) + 1);
315
+ const end = text.indexOf("\n", idx);
316
+ return text.slice(start, end < 0 ? start + 160 : end).trim();
317
+ }
318
+
319
+ function keywords(text: string): string[] {
320
+ return text
321
+ .toLowerCase()
322
+ .match(/[a-z][a-z0-9_-]{4,}/g)?.filter((token) => !STOP.has(token)) ?? [];
323
+ }
324
+
325
+ function unique(items: string[]): string[] {
326
+ return [...new Set(items.map((item) => item.trim()).filter(Boolean))];
327
+ }
328
+
329
+ function formatMs(ms: number): string {
330
+ return ms < 1_000 ? `${ms}ms` : `${(ms / 1_000).toFixed(1)}s`;
331
+ }
332
+
333
+ const STOP = new Set([
334
+ "about", "after", "again", "answer", "because", "before", "could", "first", "model", "panel", "there", "these", "thing", "which", "would", "should", "their", "while", "where", "under", "using", "without", "recommendation", "evidence", "position", "panelist", "scrutiny",
335
+ ]);