@tangle-network/agent-eval 0.28.0 → 0.29.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +87 -0
- package/dist/index.d.ts +139 -105
- package/dist/index.js +142 -94
- package/dist/index.js.map +1 -1
- package/dist/openapi.json +1 -1
- package/package.json +1 -1
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,92 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.29.0 — 2026-05-19
|
|
4
|
+
|
|
5
|
+
### Analyst kinds + cross-run findings context
|
|
6
|
+
|
|
7
|
+
Builds on 0.28.0's analyst registry. Ships four trace-analyst **kinds**
|
|
8
|
+
that emit graded findings through native Ax structured output (no more
|
|
9
|
+
flat-defaulted bullet lists) and a cross-run findings context the
|
|
10
|
+
registry can inject into prompts so each kind sees what the prior run
|
|
11
|
+
already surfaced.
|
|
12
|
+
|
|
13
|
+
### Added
|
|
14
|
+
|
|
15
|
+
- **`createTraceAnalystKind(spec, opts)`** (`src/analyst/kind-factory.ts`) —
|
|
16
|
+
turns a `TraceAnalystKindSpec` into a registry-ready
|
|
17
|
+
`Analyst<TraceAnalysisStore>`. Ax signature is
|
|
18
|
+
`'question:string -> findings:json[]'`; the Zod boundary in
|
|
19
|
+
`finding-signature.ts` rejects malformed rows instead of lifting them
|
|
20
|
+
with default severity. Supports `versionSuffix` for optimizer-fitted
|
|
21
|
+
prompts (MIPRO / GEPA / Bootstrap) and a per-row `postProcess` hook.
|
|
22
|
+
- **`RawAnalystFinding`** Zod schema + **`RAW_FINDING_SCHEMA_PROMPT`**
|
|
23
|
+
string embedded into kind actor prompts so the model and the parser
|
|
24
|
+
share one source of truth.
|
|
25
|
+
- **`TraceToolGroupName`** + **`buildTraceToolsForGroup`**
|
|
26
|
+
(`src/analyst/tool-groups.ts`) — five named tool subsets
|
|
27
|
+
(`all | discovery | discoveryAndRead | discoveryAndSearch | targeted`);
|
|
28
|
+
unknown group names throw.
|
|
29
|
+
- **Four shipping kinds** (`src/analyst/kinds/`):
|
|
30
|
+
- `FAILURE_MODE_KIND_SPEC` — clusters dataset failures into distinct
|
|
31
|
+
modes (maxDepth 3, parallel 4, all tools).
|
|
32
|
+
- `KNOWLEDGE_GAP_KIND_SPEC` — attributes missing/stale knowledge to
|
|
33
|
+
`agent-knowledge:wiki:*`, `websearch:outdated:*`, `tool-doc:*`,
|
|
34
|
+
`system-prompt:*`, `memory:*` (maxDepth 2, discoveryAndSearch).
|
|
35
|
+
- `KNOWLEDGE_POISONING_KIND_SPEC` — dual-verify analyst for
|
|
36
|
+
confident-but-wrong actions (maxDepth 2, all tools).
|
|
37
|
+
- `IMPROVEMENT_KIND_SPEC` — converts upstream failure / gap /
|
|
38
|
+
poisoning findings into concrete locus-named edits with leverage
|
|
39
|
+
grades (maxDepth 3, all tools).
|
|
40
|
+
- **`DEFAULT_TRACE_ANALYST_KINDS`** — the four specs in canonical run
|
|
41
|
+
order (failure-mode → gap → poisoning → improvement).
|
|
42
|
+
- **`priorFindings` on `AnalystContext`** — registry injects findings
|
|
43
|
+
from a prior `AnalystRunResult` into every analyst's context, so an
|
|
44
|
+
improvement-kind run can see the failure-mode findings the previous
|
|
45
|
+
pass surfaced. Kinds reference prior findings via
|
|
46
|
+
`evidence_uri: "finding://<id>"`.
|
|
47
|
+
|
|
48
|
+
### Deprecated
|
|
49
|
+
|
|
50
|
+
- `createTraceAnalystAdapter` (`src/analyst/adapters.ts`) — the legacy
|
|
51
|
+
bullet-list lifter. Kept for one minor while consumers migrate to
|
|
52
|
+
`createTraceAnalystKind`.
|
|
53
|
+
|
|
54
|
+
## 0.28.0 — 2026-05-19
|
|
55
|
+
|
|
56
|
+
### Analyst registry + findings envelope
|
|
57
|
+
|
|
58
|
+
A generic, model-agnostic orchestration layer over the existing
|
|
59
|
+
analyzers (`analyzeTraces`, `MultiLayerVerifier`, `RunCritic`,
|
|
60
|
+
`SemanticConceptJudge`, `JudgeFn`). One contract, one runner, one
|
|
61
|
+
persistence path. Reusable by VB operator bench, leaderboard submission
|
|
62
|
+
pipeline, and orchestrator on-completion reports with the same code.
|
|
63
|
+
|
|
64
|
+
### Added
|
|
65
|
+
|
|
66
|
+
- **`Analyst<TInput>`** contract + **`AnalystFinding`** envelope with
|
|
67
|
+
sha-stable `finding_id` (`src/analyst/types.ts`).
|
|
68
|
+
- **`AnalystRegistry`** (`src/analyst/registry.ts`) — register/list/run
|
|
69
|
+
with input routing by `inputKind`, per-analyst isolation, equal-split
|
|
70
|
+
budget by default, per-analyst telemetry.
|
|
71
|
+
- **`AnalystHooks`** — `onBeforeAnalyze | onAfterAnalyze | onError |
|
|
72
|
+
onComplete`. Generic seam for telemetry, cost ingestion, rotation,
|
|
73
|
+
error → finding conversion.
|
|
74
|
+
- **`BudgetPolicy`** — `{ totalUsd, weights, allocate }`. Default
|
|
75
|
+
equal-split; weighted split or custom `allocate(args)` for precision.
|
|
76
|
+
- **`ChatClient`** abstraction (`src/analyst/chat-client.ts`) over
|
|
77
|
+
`router | sandbox-sdk | cli-bridge | direct-provider | mock` so
|
|
78
|
+
analyst code is transport-agnostic; `wrapLlmClient` races the call
|
|
79
|
+
against `ChatCallOpts.signal`.
|
|
80
|
+
- **`FindingsStore`** + **`diffFindings(prev, cur, { isMaterial })`**
|
|
81
|
+
(`src/analyst/findings-store.ts`) — locked JSONL persistence + cross-run
|
|
82
|
+
diff (appeared / disappeared / persisted / changed) with a pluggable
|
|
83
|
+
materiality predicate (`defaultIsMaterial` exported for layering).
|
|
84
|
+
- Five **adapter** factories (`src/analyst/adapters.ts`) that lift
|
|
85
|
+
existing primitives into the contract without re-implementing them:
|
|
86
|
+
`createTraceAnalystAdapter`, `createVerifierAdapter`,
|
|
87
|
+
`createRunCriticAdapter`, `createJudgeAdapter`,
|
|
88
|
+
`createSemanticConceptJudgeAdapter`.
|
|
89
|
+
|
|
3
90
|
## 0.27.2 — 2026-05-17
|
|
4
91
|
|
|
5
92
|
### Corpus-wide inter-rater agreement primitive
|
package/dist/index.d.ts
CHANGED
|
@@ -690,6 +690,16 @@ interface AnalystContext {
|
|
|
690
690
|
* analyst code.
|
|
691
691
|
*/
|
|
692
692
|
chat?: ChatClient;
|
|
693
|
+
/**
|
|
694
|
+
* Findings from a prior run the operator wants the analyst to see as
|
|
695
|
+
* retrieval context. Kinds that take advantage of cross-run memory
|
|
696
|
+
* (failure-mode "I saw this cluster last run", knowledge-gap "the wiki
|
|
697
|
+
* page I asked for is still missing") render these into the actor's
|
|
698
|
+
* working set. Filtering is the operator's job: pass the slice that
|
|
699
|
+
* matches the analyst's id, or pass everything and let the kind
|
|
700
|
+
* filter. Empty / absent means no cross-run context.
|
|
701
|
+
*/
|
|
702
|
+
priorFindings?: ReadonlyArray<AnalystFinding>;
|
|
693
703
|
/** Free-form runtime tags (env, host, op). Findings can echo these into metadata. */
|
|
694
704
|
tags?: Record<string, string>;
|
|
695
705
|
/** Logger callback — analysts SHOULD prefer this over console.* for testability. */
|
|
@@ -842,6 +852,54 @@ interface SemanticConceptJudgeAdapterOpts {
|
|
|
842
852
|
}
|
|
843
853
|
declare function createSemanticConceptJudgeAdapter(opts?: SemanticConceptJudgeAdapterOpts): Analyst<SemanticConceptJudgeInput>;
|
|
844
854
|
|
|
855
|
+
/**
|
|
856
|
+
* Typed Ax output for analyst findings.
|
|
857
|
+
*
|
|
858
|
+
* Replaces the legacy `findings:string[]` pattern (where every bullet
|
|
859
|
+
* became a flat-severity `AnalystFinding`) with a structured object
|
|
860
|
+
* array. Ax binds the field as `findings:json[]` so the provider emits
|
|
861
|
+
* native structured output; at the kind-factory boundary we Zod-validate
|
|
862
|
+
* each emitted finding so malformed rows fail loud instead of being
|
|
863
|
+
* silently lifted with default severity.
|
|
864
|
+
*
|
|
865
|
+
* Why not `f.object().array()` directly in the signature? The Ax
|
|
866
|
+
* signature string `question:string -> findings:json[]` already lets
|
|
867
|
+
* the provider emit JSON arrays. A Zod boundary is required either
|
|
868
|
+
* way (the provider can return any JSON), and Zod gives us a single
|
|
869
|
+
* validation surface independent of which Ax version is installed.
|
|
870
|
+
*/
|
|
871
|
+
|
|
872
|
+
declare const ANALYST_SEVERITIES: readonly ["critical", "high", "medium", "low", "info"];
|
|
873
|
+
declare const RawAnalystFindingSchema: z.ZodObject<{
|
|
874
|
+
severity: z.ZodEnum<{
|
|
875
|
+
info: "info";
|
|
876
|
+
critical: "critical";
|
|
877
|
+
medium: "medium";
|
|
878
|
+
low: "low";
|
|
879
|
+
high: "high";
|
|
880
|
+
}>;
|
|
881
|
+
claim: z.ZodString;
|
|
882
|
+
subject: z.ZodOptional<z.ZodString>;
|
|
883
|
+
evidence_uri: z.ZodString;
|
|
884
|
+
evidence_excerpt: z.ZodOptional<z.ZodString>;
|
|
885
|
+
confidence: z.ZodNumber;
|
|
886
|
+
rationale: z.ZodOptional<z.ZodString>;
|
|
887
|
+
recommended_action: z.ZodOptional<z.ZodString>;
|
|
888
|
+
}, z.core.$strict>;
|
|
889
|
+
type RawAnalystFinding = z.infer<typeof RawAnalystFindingSchema>;
|
|
890
|
+
/**
|
|
891
|
+
* Description embedded into the actor prompt so the LLM knows what
|
|
892
|
+
* shape to emit. Kept here so kinds share one source of truth rather
|
|
893
|
+
* than restating the schema in every prompt.
|
|
894
|
+
*/
|
|
895
|
+
declare const RAW_FINDING_SCHEMA_PROMPT = "Each finding MUST be a JSON object with these fields:\n - severity: one of \"critical\" | \"high\" | \"medium\" | \"low\" | \"info\"\n - claim: one-sentence statement (max 2000 chars)\n - subject?: the leaf id, agent id, span id, tool name, or noun phrase the finding is about\n - evidence_uri: \"span://<trace_id>/<span_id>\" for trace evidence, \"artifact://<relative-path>\" for files, \"metric://<name>\" for named scalars \u2014 ALWAYS cite a real id surfaced by the tools\n - evidence_excerpt?: short quote (<=2000 chars) from the cited span/artifact\n - confidence: number 0..1 \u2014 0.9+ when backed by exact quotes, 0.6-0.8 for inferred patterns, <0.5 for speculative\n - rationale?: one or two sentences explaining the reasoning\n - recommended_action?: concrete change phrased as an imperative (\"Add ...\", \"Replace ...\", \"Stop ...\") \u2014 omit when the finding is purely descriptive\n\nEmit an empty array when the question has no findings to report. Do not fabricate evidence.";
|
|
896
|
+
/**
|
|
897
|
+
* Validate one row emitted by the LLM. Returns the typed finding on
|
|
898
|
+
* success; returns `null` and logs the reason on failure so the kind
|
|
899
|
+
* factory can skip-and-count rather than abort the whole analyst run.
|
|
900
|
+
*/
|
|
901
|
+
declare function parseRawFinding(row: unknown, log?: (msg: string, fields?: Record<string, unknown>) => void): RawAnalystFinding | null;
|
|
902
|
+
|
|
845
903
|
/**
|
|
846
904
|
* FindingsStore — durable persistence for AnalystFinding rows + a diff
|
|
847
905
|
* helper so we can answer "what changed since the last run?" without
|
|
@@ -917,54 +975,6 @@ declare function defaultIsMaterial(a: AnalystFinding, b: AnalystFinding): boolea
|
|
|
917
975
|
*/
|
|
918
976
|
declare function diffFindings(previous: PersistedFinding[], current: PersistedFinding[], policy?: DiffPolicy): FindingsDiff;
|
|
919
977
|
|
|
920
|
-
/**
|
|
921
|
-
* Typed Ax output for analyst findings.
|
|
922
|
-
*
|
|
923
|
-
* Replaces the legacy `findings:string[]` pattern (where every bullet
|
|
924
|
-
* became a flat-severity `AnalystFinding`) with a structured object
|
|
925
|
-
* array. Ax binds the field as `findings:json[]` so the provider emits
|
|
926
|
-
* native structured output; at the kind-factory boundary we Zod-validate
|
|
927
|
-
* each emitted finding so malformed rows fail loud instead of being
|
|
928
|
-
* silently lifted with default severity.
|
|
929
|
-
*
|
|
930
|
-
* Why not `f.object().array()` directly in the signature? The Ax
|
|
931
|
-
* signature string `question:string -> findings:json[]` already lets
|
|
932
|
-
* the provider emit JSON arrays. A Zod boundary is required either
|
|
933
|
-
* way (the provider can return any JSON), and Zod gives us a single
|
|
934
|
-
* validation surface independent of which Ax version is installed.
|
|
935
|
-
*/
|
|
936
|
-
|
|
937
|
-
declare const ANALYST_SEVERITIES: readonly ["critical", "high", "medium", "low", "info"];
|
|
938
|
-
declare const RawAnalystFindingSchema: z.ZodObject<{
|
|
939
|
-
severity: z.ZodEnum<{
|
|
940
|
-
info: "info";
|
|
941
|
-
critical: "critical";
|
|
942
|
-
medium: "medium";
|
|
943
|
-
low: "low";
|
|
944
|
-
high: "high";
|
|
945
|
-
}>;
|
|
946
|
-
claim: z.ZodString;
|
|
947
|
-
subject: z.ZodOptional<z.ZodString>;
|
|
948
|
-
evidence_uri: z.ZodString;
|
|
949
|
-
evidence_excerpt: z.ZodOptional<z.ZodString>;
|
|
950
|
-
confidence: z.ZodNumber;
|
|
951
|
-
rationale: z.ZodOptional<z.ZodString>;
|
|
952
|
-
recommended_action: z.ZodOptional<z.ZodString>;
|
|
953
|
-
}, z.core.$strict>;
|
|
954
|
-
type RawAnalystFinding = z.infer<typeof RawAnalystFindingSchema>;
|
|
955
|
-
/**
|
|
956
|
-
* Description embedded into the actor prompt so the LLM knows what
|
|
957
|
-
* shape to emit. Kept here so kinds share one source of truth rather
|
|
958
|
-
* than restating the schema in every prompt.
|
|
959
|
-
*/
|
|
960
|
-
declare const RAW_FINDING_SCHEMA_PROMPT = "Each finding MUST be a JSON object with these fields:\n - severity: one of \"critical\" | \"high\" | \"medium\" | \"low\" | \"info\"\n - claim: one-sentence statement (max 2000 chars)\n - subject?: the leaf id, agent id, span id, tool name, or noun phrase the finding is about\n - evidence_uri: \"span://<trace_id>/<span_id>\" for trace evidence, \"artifact://<relative-path>\" for files, \"metric://<name>\" for named scalars \u2014 ALWAYS cite a real id surfaced by the tools\n - evidence_excerpt?: short quote (<=2000 chars) from the cited span/artifact\n - confidence: number 0..1 \u2014 0.9+ when backed by exact quotes, 0.6-0.8 for inferred patterns, <0.5 for speculative\n - rationale?: one or two sentences explaining the reasoning\n - recommended_action?: concrete change phrased as an imperative (\"Add ...\", \"Replace ...\", \"Stop ...\") \u2014 omit when the finding is purely descriptive\n\nEmit an empty array when the question has no findings to report. Do not fabricate evidence.";
|
|
961
|
-
/**
|
|
962
|
-
* Validate one row emitted by the LLM. Returns the typed finding on
|
|
963
|
-
* success; returns `null` and logs the reason on failure so the kind
|
|
964
|
-
* factory can skip-and-count rather than abort the whole analyst run.
|
|
965
|
-
*/
|
|
966
|
-
declare function parseRawFinding(row: unknown, log?: (msg: string, fields?: Record<string, unknown>) => void): RawAnalystFinding | null;
|
|
967
|
-
|
|
968
978
|
/**
|
|
969
979
|
* Analyst-kind factory — the typed, focused replacement for the
|
|
970
980
|
* legacy `createTraceAnalystAdapter`.
|
|
@@ -1054,6 +1064,21 @@ interface CreateTraceAnalystKindOpts {
|
|
|
1054
1064
|
* want shared across analyst runs).
|
|
1055
1065
|
*/
|
|
1056
1066
|
declare function createTraceAnalystKind(spec: TraceAnalystKindSpec, opts: CreateTraceAnalystKindOpts): Analyst<TraceAnalysisStore>;
|
|
1067
|
+
/**
|
|
1068
|
+
* Render a compact prior-findings block the actor reads alongside its
|
|
1069
|
+
* brief. Each row is one line so the actor can scan dozens cheaply.
|
|
1070
|
+
* The kind's prompt instructs the actor to (a) check whether a new
|
|
1071
|
+
* cluster matches a prior `finding_id` (carry the id forward via
|
|
1072
|
+
* `id_basis` to keep diffs stable) and (b) raise severity / confidence
|
|
1073
|
+
* when a prior finding has reappeared without remediation.
|
|
1074
|
+
*
|
|
1075
|
+
* Returns the empty string when there are no prior findings — most
|
|
1076
|
+
* runs are "first-of-its-kind" and the prompt stays unchanged.
|
|
1077
|
+
*
|
|
1078
|
+
* Exported for tests + for consumers that build their own actor
|
|
1079
|
+
* prompts (e.g. specialized analysts living outside the default kinds).
|
|
1080
|
+
*/
|
|
1081
|
+
declare function renderPriorFindings(prior: AnalystContext['priorFindings']): string;
|
|
1057
1082
|
|
|
1058
1083
|
/**
|
|
1059
1084
|
* Failure-mode analyst — classifies what went wrong and why.
|
|
@@ -1074,6 +1099,29 @@ declare function createTraceAnalystKind(spec: TraceAnalystKindSpec, opts: Create
|
|
|
1074
1099
|
|
|
1075
1100
|
declare const FAILURE_MODE_KIND_SPEC: TraceAnalystKindSpec;
|
|
1076
1101
|
|
|
1102
|
+
/**
|
|
1103
|
+
* Improvement analyst — actionable, recursive self-improvement findings.
|
|
1104
|
+
*
|
|
1105
|
+
* Brief: read findings from upstream analysts (failure-mode,
|
|
1106
|
+
* knowledge-gap, knowledge-poisoning) AND the trace dataset itself,
|
|
1107
|
+
* then propose **concrete edits** to the agent's runtime: prompt
|
|
1108
|
+
* additions, RAG documents to ingest, tool descriptions to rewrite,
|
|
1109
|
+
* scaffolding changes to make, memory entries to invalidate. Each
|
|
1110
|
+
* finding is one proposed edit with the locus, the diff, and the
|
|
1111
|
+
* expected effect.
|
|
1112
|
+
*
|
|
1113
|
+
* This is the recursive-self-improvement loop's last mile: the prior
|
|
1114
|
+
* kinds describe *what's wrong*; this kind describes *what to change*.
|
|
1115
|
+
*
|
|
1116
|
+
* Recursion is deep (`maxDepth: 3`) because real improvement proposals
|
|
1117
|
+
* are competitive: for each failure-mode there are usually 2-3 viable
|
|
1118
|
+
* fix directions (tighten prompt vs add tool vs adjust scaffolding),
|
|
1119
|
+
* and the actor should explore each with a focused subagent before
|
|
1120
|
+
* picking the highest-leverage one to recommend.
|
|
1121
|
+
*/
|
|
1122
|
+
|
|
1123
|
+
declare const IMPROVEMENT_KIND_SPEC: TraceAnalystKindSpec;
|
|
1124
|
+
|
|
1077
1125
|
/**
|
|
1078
1126
|
* Knowledge-gap analyst — what did the agent NOT know that it needed?
|
|
1079
1127
|
*
|
|
@@ -1124,29 +1172,6 @@ declare const KNOWLEDGE_GAP_KIND_SPEC: TraceAnalystKindSpec;
|
|
|
1124
1172
|
|
|
1125
1173
|
declare const KNOWLEDGE_POISONING_KIND_SPEC: TraceAnalystKindSpec;
|
|
1126
1174
|
|
|
1127
|
-
/**
|
|
1128
|
-
* Improvement analyst — actionable, recursive self-improvement findings.
|
|
1129
|
-
*
|
|
1130
|
-
* Brief: read findings from upstream analysts (failure-mode,
|
|
1131
|
-
* knowledge-gap, knowledge-poisoning) AND the trace dataset itself,
|
|
1132
|
-
* then propose **concrete edits** to the agent's runtime: prompt
|
|
1133
|
-
* additions, RAG documents to ingest, tool descriptions to rewrite,
|
|
1134
|
-
* scaffolding changes to make, memory entries to invalidate. Each
|
|
1135
|
-
* finding is one proposed edit with the locus, the diff, and the
|
|
1136
|
-
* expected effect.
|
|
1137
|
-
*
|
|
1138
|
-
* This is the recursive-self-improvement loop's last mile: the prior
|
|
1139
|
-
* kinds describe *what's wrong*; this kind describes *what to change*.
|
|
1140
|
-
*
|
|
1141
|
-
* Recursion is deep (`maxDepth: 3`) because real improvement proposals
|
|
1142
|
-
* are competitive: for each failure-mode there are usually 2-3 viable
|
|
1143
|
-
* fix directions (tighten prompt vs add tool vs adjust scaffolding),
|
|
1144
|
-
* and the actor should explore each with a focused subagent before
|
|
1145
|
-
* picking the highest-leverage one to recommend.
|
|
1146
|
-
*/
|
|
1147
|
-
|
|
1148
|
-
declare const IMPROVEMENT_KIND_SPEC: TraceAnalystKindSpec;
|
|
1149
|
-
|
|
1150
1175
|
/**
|
|
1151
1176
|
* Default analyst kinds focused on agent failure + recursive
|
|
1152
1177
|
* self-improvement.
|
|
@@ -1165,39 +1190,6 @@ declare const IMPROVEMENT_KIND_SPEC: TraceAnalystKindSpec;
|
|
|
1165
1190
|
*/
|
|
1166
1191
|
declare const DEFAULT_TRACE_ANALYST_KINDS: readonly TraceAnalystKindSpec[];
|
|
1167
1192
|
|
|
1168
|
-
/**
|
|
1169
|
-
* Pre-curated tool subsets for analyst kinds.
|
|
1170
|
-
*
|
|
1171
|
-
* The full trace-analyst tool set is seven functions. Most kinds only
|
|
1172
|
-
* need three or four. Picking from named groups instead of importing
|
|
1173
|
-
* the whole bundle keeps every kind's actor-context budget tight and
|
|
1174
|
-
* makes "what can this analyst see?" obvious at registration time.
|
|
1175
|
-
*
|
|
1176
|
-
* Each function in the group keeps its full `name`/`description` from
|
|
1177
|
-
* `buildTraceAnalystTools` — we filter, we don't re-implement.
|
|
1178
|
-
*/
|
|
1179
|
-
|
|
1180
|
-
/** Named tool sets. Kinds pass `tools: TRACE_TOOL_GROUPS.failureForensics` etc. */
|
|
1181
|
-
type TraceToolGroupName =
|
|
1182
|
-
/** All seven tools. Use for open-ended discovery kinds. */
|
|
1183
|
-
'all'
|
|
1184
|
-
/** Overview + paginated query + count. No deep reads. Cheap. */
|
|
1185
|
-
| 'discovery'
|
|
1186
|
-
/** Discovery + viewTrace + viewSpans. Deep-read but no regex search. */
|
|
1187
|
-
| 'discoveryAndRead'
|
|
1188
|
-
/** Discovery + search tools. For pattern-matching across many traces. */
|
|
1189
|
-
| 'discoveryAndSearch'
|
|
1190
|
-
/** Discovery + viewSpans + searchSpan. Targeted-span work after another kind narrows down. */
|
|
1191
|
-
| 'targeted';
|
|
1192
|
-
/**
|
|
1193
|
-
* Build the tool set for a named group bound to a specific trace store.
|
|
1194
|
-
*
|
|
1195
|
-
* `all` returns every tool. Other groups filter `buildTraceAnalystTools`
|
|
1196
|
-
* by name to the documented subset. An unrecognised group name throws —
|
|
1197
|
-
* silently returning all tools would defeat the cost-control point.
|
|
1198
|
-
*/
|
|
1199
|
-
declare function buildTraceToolsForGroup(group: TraceToolGroupName, store: TraceAnalysisStore): AxFunction[];
|
|
1200
|
-
|
|
1201
1193
|
/**
|
|
1202
1194
|
* AnalystRegistry — orchestrate N analysts against one run.
|
|
1203
1195
|
*
|
|
@@ -1285,6 +1277,15 @@ interface RegistryRunOpts {
|
|
|
1285
1277
|
signal?: AbortSignal;
|
|
1286
1278
|
/** Tags echoed into AnalystContext.tags — useful for tracking environment/version in findings. */
|
|
1287
1279
|
tags?: Record<string, string>;
|
|
1280
|
+
/**
|
|
1281
|
+
* Prior-run findings made available as retrieval context to every
|
|
1282
|
+
* analyst via `ctx.priorFindings`. The registry forwards the slice
|
|
1283
|
+
* whose `analyst_id` matches each registered analyst so a kind sees
|
|
1284
|
+
* only its own history. Pass `{ '*': findings }` to broadcast to
|
|
1285
|
+
* every analyst (useful for cross-kind chaining where the improvement
|
|
1286
|
+
* analyst consumes upstream failure findings).
|
|
1287
|
+
*/
|
|
1288
|
+
priorFindings?: ReadonlyArray<AnalystFinding> | Record<string, ReadonlyArray<AnalystFinding>>;
|
|
1288
1289
|
}
|
|
1289
1290
|
declare class AnalystRegistry {
|
|
1290
1291
|
private readonly analysts;
|
|
@@ -1302,6 +1303,39 @@ declare class AnalystRegistry {
|
|
|
1302
1303
|
private routeInput;
|
|
1303
1304
|
}
|
|
1304
1305
|
|
|
1306
|
+
/**
|
|
1307
|
+
* Pre-curated tool subsets for analyst kinds.
|
|
1308
|
+
*
|
|
1309
|
+
* The full trace-analyst tool set is seven functions. Most kinds only
|
|
1310
|
+
* need three or four. Picking from named groups instead of importing
|
|
1311
|
+
* the whole bundle keeps every kind's actor-context budget tight and
|
|
1312
|
+
* makes "what can this analyst see?" obvious at registration time.
|
|
1313
|
+
*
|
|
1314
|
+
* Each function in the group keeps its full `name`/`description` from
|
|
1315
|
+
* `buildTraceAnalystTools` — we filter, we don't re-implement.
|
|
1316
|
+
*/
|
|
1317
|
+
|
|
1318
|
+
/** Named tool sets. Kinds pass `tools: TRACE_TOOL_GROUPS.failureForensics` etc. */
|
|
1319
|
+
type TraceToolGroupName =
|
|
1320
|
+
/** All seven tools. Use for open-ended discovery kinds. */
|
|
1321
|
+
'all'
|
|
1322
|
+
/** Overview + paginated query + count. No deep reads. Cheap. */
|
|
1323
|
+
| 'discovery'
|
|
1324
|
+
/** Discovery + viewTrace + viewSpans. Deep-read but no regex search. */
|
|
1325
|
+
| 'discoveryAndRead'
|
|
1326
|
+
/** Discovery + search tools. For pattern-matching across many traces. */
|
|
1327
|
+
| 'discoveryAndSearch'
|
|
1328
|
+
/** Discovery + viewSpans + searchSpan. Targeted-span work after another kind narrows down. */
|
|
1329
|
+
| 'targeted';
|
|
1330
|
+
/**
|
|
1331
|
+
* Build the tool set for a named group bound to a specific trace store.
|
|
1332
|
+
*
|
|
1333
|
+
* `all` returns every tool. Other groups filter `buildTraceAnalystTools`
|
|
1334
|
+
* by name to the documented subset. An unrecognised group name throws —
|
|
1335
|
+
* silently returning all tools would defeat the cost-control point.
|
|
1336
|
+
*/
|
|
1337
|
+
declare function buildTraceToolsForGroup(group: TraceToolGroupName, store: TraceAnalysisStore): AxFunction[];
|
|
1338
|
+
|
|
1305
1339
|
/**
|
|
1306
1340
|
* Automated pull request opener for the production loop.
|
|
1307
1341
|
*
|
|
@@ -5945,4 +5979,4 @@ declare function aggregateTrialsByMode(trials: TrialResult[], opts: {
|
|
|
5945
5979
|
mode: AggregatorMode;
|
|
5946
5980
|
}): TrialAggregate;
|
|
5947
5981
|
|
|
5948
|
-
export { ANALYST_SEVERITIES, ActionableSideInfo, type ActiveLearningOptions, type AdapterRun, AgentDriver, type AgentDriverConfig, AgentEvalError, type AggregatorMode, type AlignmentOp, type Analyst, type AnalystContext, type AnalystCost, type AnalystFinding, type AnalystHooks, type AnalystInputKind, AnalystRegistry, type AnalystRegistryOptions, type AnalystRequirements, type AnalystRunInputs, type AnalystRunResult, type AnalystRunSummary, type AnalystSeverity, type AntiSlopConfig, type AntiSlopIssue, type AntiSlopReport, Artifact$1 as Artifact, type ArtifactCheck, type Artifact as ArtifactCheckArtifact, type ArtifactResult, type ArtifactValidator, type AutoPrClient, AxGepaSteeringOptimizer, type AxSteeringOptimizerConfig, BaselineReport, BehaviorAssertion, type BenchmarkReport, BenchmarkRunner, type BenchmarkRunnerConfig, type BisectOptions, type BisectResult, type BisectStep, BudgetBreachError, BudgetGuard, BudgetLedgerEntry, type BudgetPolicy, BudgetSpec, CallExpectation, type CanaryAlert, type CanaryKind, type CanaryLeak, type CanaryOptions, type CanaryReport, type CanarySeverity, type CandidateScenario, type CausalAttributionReport, type ChatCallOpts, type ChatClient, type ChatRequest, type ChatResponse, type ChatTransport, type CheckResult, type CliBridgeTransportOpts, type CodeMutationOutcome, type CodeMutationRunner, type CollectedArtifacts, type CommandRunner, type CompletionCriterion, type CompositePolicy, type ConceptComplexity, type ConceptFinding, type ConceptSpec, type ConceptWeightStrategy, type ContinuityCheck, type ContinuityCheckResult, type ContinuityReport, type ContinuitySnapshotPair, ContinuousAgreement, ContinuousAgreementOptions, type ContractMetric, type ContractReport, ControlEvalResult, ConvergenceTracker, type CorpusAgreementOptions, type CorpusAgreementPerDimension, type CorpusAgreementReport, type CorpusScoreRecord, type CostEntry, CostLedger, type CostLedgerGeneration, type CostLedgerSnapshot, type CostSummary, CostTracker, type CounterfactualContext, type CounterfactualMutation, type CounterfactualResult, type CounterfactualRunner, type CreateChatClientOpts, type CreateCompositeMutatorOpts, type CreateDefaultReviewerOptions, type CreateSandboxCodeMutatorOpts, type CreateSandboxPoolOpts, type CreateTraceAnalystKindOpts, type CrossTraceDiff, type CrossTraceDiffOptions, D1ExperimentStore, type D1ExperimentStoreOptions, type D1Like, type D1PreparedStatementLike, DEFAULT_AGENT_SLOS, DEFAULT_COMPLEXITY_WEIGHTS, DEFAULT_FINDERS, DEFAULT_HARNESS_OBJECTIVES, DEFAULT_MUTATORS, DEFAULT_RUN_SCORE_WEIGHTS, DEFAULT_SEVERITY_WEIGHTS, DEFAULT_TRACE_ANALYST_KINDS, Dataset, DatasetScenario, type DeployFamily, type DeployGateLayerInput, type DeployRunResult, type DeployRunner, type DiffPolicy, type DirEntry, type DirectProviderTransportOpts, type DiscoverPersonasOptions, type DiscoveredPersona, type DriverResult, type DriverState, DualAgentBench, type DualAgentBenchConfig, type DualAgentReport, type DualAgentRound, type DualAgentScenario, type DualAgentScenarioResult, ERROR_COUNT_PATTERNS, type ErrorCountPattern, type EvalResult, type EvidenceRef, type EvolutionRound, EvolvableVariant, type ExecutorConfig, type Expectation, type Experiment, type Run as ExperimentRun, type ExperimentStore, ExperimentTracker, type ExportedRewardModel, type ExtractOptions, type ExtractResult, FAILURE_MODE_KIND_SPEC, type FactorContribution, type FactorialCell, type FailureClusterConfig, FeedbackLabel, type FeedbackPattern, FeedbackTrajectory, FeedbackTrajectoryStore, type FileChange, FileSystemExperimentStore, type FileSystemExperimentStoreOptions, type FindingsDiff, FindingsStore, type FlowAction, type FlowLayerEnv, type FlowLayerFactoryInput, type FlowRunner, type FlowRunnerStepResult, type FlowSpec, type FlowStep, GateDecision, type GhCliClientOptions, type GoldenSeverity, type GoldenSpec, type HarnessAdapter, HarnessConfig, type HarnessExperimentConfig, type HarnessExperimentResult, type HarnessIntervention, type HarnessRunRequest, type HarnessRunResult, type HarnessScenario, type HarnessSelection, type HarnessVariant, type HarnessVariantReport, HeldOutGateConfig, HoldoutAuditor, type HostedJudgeConfig, type HostedJudgeDimension, type HostedJudgeRequest, type HostedJudgeResponse, type HostedRunCriticConfig, type HostedRunScoreRequest, type HostedRunScoreResponse, type HttpGithubClientOptions, type HypothesisManifest, type HypothesisResult, IMPROVEMENT_KIND_SPEC, INTENT_MATCH_JUDGE_VERSION, type ImageData, InMemoryExperimentStore, InMemoryWorkspaceInspector, type InferenceScorer, type InspectorContext, type IntegrationGateSurface, type IntegrationInvokeFailureInput, type IntegrationManifestGateInput, type IntentMatchInput, type IntentMatchOptions, type IntentMatchResult, type InteractionContribution, JsonlTrialCache, type JudgeAdapterOpts, type JudgeConfig, type JudgeFleetOptions, type JudgeFn, type JudgeInput, type JudgeReplayResult, type JudgeRetryOutcome, type JudgeRetryPolicy, type JudgeRubric, JudgeRunner, type JudgeScore, KNOWLEDGE_GAP_KIND_SPEC, KNOWLEDGE_POISONING_KIND_SPEC, type KeywordConceptSpec, type KeywordCoverageFinding, type KeywordCoverageOptions, type KeywordCoverageResult, type LangfuseEnvelope, type LangfuseGeneration, type LangfuseScore, Layer, LayerResult, type LineageKind, type LineageKindResolver, type LineageNode, LineageRecorder, type LiveProofArtifact, type LiveProofConfig, type LiveProofContext, type LiveProofResult, LlmCallRequest, LlmCallResult, LlmClientOptions, LlmSpan, LockedJsonlAppender, MODEL_PRICING, type MatchResult, type MatcherResult, type MeasurementPolicy, type MergeOptions, MetricsCollector, type MockTransportOpts, type MuffledFinder, type MuffledFinding, MultiLayerVerifier, MultiShotMutateAdapter, MultiShotOptimizationResult, MultiShotRunner, MultiShotScorer, MultiShotTrialResult, type MultiToolchainLayerConfig, MutateAdapter, type MutationAttempt, type MutationChannel, MutationTelemetry, type Mutator, Mutex, Objective, type Oracle, type OracleObservation, type OracleReport, type OracleResult, type OrthogonalityInput, type OrthogonalityResult, PairwiseSteeringOptimizer, type ParaphraseRobustnessScenarioInput, type ParaphraseRobustnessScenarioResult, ParetoResult, type PersistedFinding, type PersonaConfig, type Playbook, type PlaybookEntry, type PoolSlot, ProductClient, type ProductClientConfig, type ProductionEvolveConfig, type ProductionLoopCronConfig, type ProductionLoopDecision, type ProductionLoopRenderContext, type ProductionLoopResult, type ProductionShipConfig, type PromptHandle, PromptRegistry, TrialResult as PromptTrialResult, type ProposeAutomatedPullRequestInput, type ProposeAutomatedPullRequestResult, RAW_FINDING_SCHEMA_PROMPT, type RawAnalystFinding, RawAnalystFindingSchema, type ReferenceMatchResult, type ReferenceReplayAdapter, type ReferenceReplayAdapterFn, type ReferenceReplayAdapterLike, type ReferenceReplayAggregate, type ReferenceReplayCandidate, type ReferenceReplayCase, type ReferenceReplayCaseRun, type ReferenceReplayExecutionScenario, type ReferenceReplayItem, type ReferenceReplayMatch, type ReferenceReplayMatchStrategy, type ReferenceReplayMatcher, type ReferenceReplayPromotionDecision, type ReferenceReplayPromotionPolicy, type ReferenceReplayRun, type ReferenceReplayRunContext, type ReferenceReplayRunOptions, type ReferenceReplayRunStore, type ReferenceReplayScenario, type ReferenceReplayScenarioScore, type ReferenceReplayScore, type ReferenceReplayScoreOptions, type ReferenceReplaySplit, type ReferenceReplaySplitComparison, type ReferenceReplaySteeringRowsOptions, type RegistryRunOpts, ReleaseConfidenceScorecard, ReleaseConfidenceThresholds, type RepoRef, type ReviewerMemoryEntry, type ReviewerOutput, type ReviewerPromptInput, type ReviewerSoftFailDefaults, type ReviewerVerificationSummary, type RobustnessResult, type RouteMap, type RouterTransportOpts, type RubricDimension, Run$1 as Run, type RunCommandInput, type RunCommandResult, type RunConfig, RunCritic, type RunCriticAdapterOpts, type RunCriticOptions, type RunDiff, RunFilter, type RunProductionLoopOptions, RunRecord, type RunScore, type RunScoreWeights, RunSplitTag, type RunTrace, SEMANTIC_CONCEPT_JUDGE_VERSION, SandboxDriver, SandboxHarnessResult, type SandboxJudgeKind, type SandboxJudgeResult, type SandboxJudgeSpec, type SandboxPool, type SandboxSdkTransportOpts, type ScanOptions, type Scenario, type ScenarioCost, type ScenarioFile, ScenarioRegistry, type ScenarioResult, type ScoredTarget, type SelfPlayOptions, type SelfPlayProposer, type SelfPlayScorer, type SemanticConceptJudgeAdapterOpts, type SemanticConceptJudgeInput, type SemanticConceptJudgeOptions, type SemanticConceptJudgeResult, type SeriesConvergenceOptions, type SeriesConvergenceResult, Severity, type SignedManifest, type SignedManifestAlgo, type Slo, type SloCheckResult, type SloComparator, type SloReport, type SloSeverity, type SlopCategory, type SlotFactory, Span, type SteeringBundle, type SteeringDelta, type SteeringOptimizationResult, type SteeringOptimizationRow, type SteeringOptimizationSelector, type SteeringOptimizerBackend, type SteeringOptimizerConfig, type SteeringRolePrompt, type StepAttribution, type SynthesisReason, type SynthesisTarget, type TestResult, type ThresholdContract, TokenCounter, type TokenSpec, type TraceAnalystAdapterOpts, type TraceAnalystGolden, type TraceAnalystKindSpec, TraceEmitter, TraceEvent, TraceStore, type TraceToolGroupName, Trajectory, TrajectoryStep, type TrialAggregate, type TrialAttempt, TrialCache, TrialTelemetry, type Turn, type TurnMetrics, type TurnResult, UNIVERSAL_FINDERS, type ValidationContext, type ValidationIssue, type ValidationResult, VariantAggregate, type VerifierAdapterOpts, VerifyContext, VerifyOptions, type VisualDiffOptions, type VisualDiffResult, type ViteDeployRunnerInput, type WorkflowTopology, type WorkspaceAssertion, type WorkspaceAssertionResult, type WorkspaceInspector, type WorkspaceSnapshot, type WranglerDeployRunnerInput, adversarialJudge, aggregateRunScore, aggregateTrialsByMode, analyzeAntiSlop, analyzeSeries, attributeCounterfactuals, benjaminiHochberg, bisect, bonferroni, buildReviewerPrompt, buildTraceToolsForGroup, byteLengthRange, canaryLeakView, canonicalize, causalAttribution, checkBehavioralCanary, checkCanaries, checkSlos, clamp01, codeExecutionJudge, cohensD, coherenceJudge, collectionPreserved, commitBisect, compareReferenceReplay, compilerJudge, composeValidators, computeFindingId, confidenceInterval, containsAll, corpusInterRaterAgreement, corpusInterRaterAgreementFromJudgeScores, createAntiSlopJudge, createChatClient, createCompositeMutator, createCustomJudge, createDefaultReviewer, createDomainExpertJudge, createIntentMatchJudge, createJudgeAdapter, createRunCriticAdapter, createSandboxCodeMutator, createSandboxPool, createSemanticConceptJudge, createSemanticConceptJudgeAdapter, createTraceAnalystAdapter, createTraceAnalystKind, createVerifierAdapter, crossTraceDiff, decideReferenceReplayPromotion, decideReferenceReplayRunPromotion, defaultIsMaterial, defaultJudges, defaultReferenceReplayMatcher, deployGateLayer, diffFindings, discoverPersonas, distillPlaybook, estimateCost, estimateTokens, evaluateContract, evaluateHypothesis, evaluateOracles, executeScenario, expectAgent, exportRewardModel, extractAssetUrls, extractErrorCount, fileContains, fileExists, findAutoMatchNoExpectation, findConstructorCwdDropped, findFallbackToPass, findLiteralTruePass, findSkipCountsAsPass, flowLayer, formatBenchmarkReport, formatDriverReport, formatFindings, ghCliClient, precision as goldenPrecision, hashContent, hashJson, htmlContainsElement, httpGithubClient, inMemoryReferenceReplayStore, integrationAsi, integrationGateEvals, integrationInvokeFailedPayload, integrationManifestResolvedPayload, integrationManifestValidatedPayload, interRaterReliability, jsonHasKeys, jsonShape, jsonlReferenceReplayStore, keyPreserved, liftSeverity, linterJudge, loadScorerFromGrader, localCommandRunner, lowercaseMutator, makeFinding, mannWhitneyU, matchGoldens, mergeLayerResults, mergeSteeringBundle, multiToolchainLayer, normalizeScores, notBlocked, pairedTTest, paraphraseRobustness, paraphraseRobustnessScenarios, parseRawFinding, partialCredit, passOrthogonality, pixelDeltaRatio, politenessPrefixMutator, printDriverSummary, promptBisect, proposeAutomatedPullRequest, proposeSynthesisTargets, referenceReplayRunsToSteeringRows, referenceReplayScenarioToRunScore, regexMatch, regexMatches, renderMarkdownReport, renderPlaybookMarkdown, renderSteeringText, replayScorerOverCorpus, replayTraceThroughJudge, requiredSampleSize, resetLockedAppendersForTesting, rowCount, rowWhere, runAssertions, runBehavioralCanaries, runCanaries, runCounterfactual, runE2EWorkflow, runExpectations, runHarnessExperiment, runIntentMatchJudge, runJudgeFleet, runKeywordCoverageJudge, runKeywordCoverageJudgeUrl, runLiveProof, runProductionLoop, runReferenceReplay, runSelfPlay, runSemanticConceptJudge, scanForMuffledGates, scoreContinuity, scoreReferenceReplay, securityJudge, selectHarnessVariant, sentenceReorderMutator, signManifest, statusAdvanced, summarizeHarnessResults, testJudge, textInSnapshot, toLangfuseEnvelope, toPrometheusText, typoMutator, urlContains, verifyManifest, visualDiff, viteDeployRunner, weightedMean, weightedRecall, whitespaceCollapseMutator, wilcoxonSignedRank, withJudgeRetry, wranglerDeployRunner };
|
|
5982
|
+
export { ANALYST_SEVERITIES, ActionableSideInfo, type ActiveLearningOptions, type AdapterRun, AgentDriver, type AgentDriverConfig, AgentEvalError, type AggregatorMode, type AlignmentOp, type Analyst, type AnalystContext, type AnalystCost, type AnalystFinding, type AnalystHooks, type AnalystInputKind, AnalystRegistry, type AnalystRegistryOptions, type AnalystRequirements, type AnalystRunInputs, type AnalystRunResult, type AnalystRunSummary, type AnalystSeverity, type AntiSlopConfig, type AntiSlopIssue, type AntiSlopReport, Artifact$1 as Artifact, type ArtifactCheck, type Artifact as ArtifactCheckArtifact, type ArtifactResult, type ArtifactValidator, type AutoPrClient, AxGepaSteeringOptimizer, type AxSteeringOptimizerConfig, BaselineReport, BehaviorAssertion, type BenchmarkReport, BenchmarkRunner, type BenchmarkRunnerConfig, type BisectOptions, type BisectResult, type BisectStep, BudgetBreachError, BudgetGuard, BudgetLedgerEntry, type BudgetPolicy, BudgetSpec, CallExpectation, type CanaryAlert, type CanaryKind, type CanaryLeak, type CanaryOptions, type CanaryReport, type CanarySeverity, type CandidateScenario, type CausalAttributionReport, type ChatCallOpts, type ChatClient, type ChatRequest, type ChatResponse, type ChatTransport, type CheckResult, type CliBridgeTransportOpts, type CodeMutationOutcome, type CodeMutationRunner, type CollectedArtifacts, type CommandRunner, type CompletionCriterion, type CompositePolicy, type ConceptComplexity, type ConceptFinding, type ConceptSpec, type ConceptWeightStrategy, type ContinuityCheck, type ContinuityCheckResult, type ContinuityReport, type ContinuitySnapshotPair, ContinuousAgreement, ContinuousAgreementOptions, type ContractMetric, type ContractReport, ControlEvalResult, ConvergenceTracker, type CorpusAgreementOptions, type CorpusAgreementPerDimension, type CorpusAgreementReport, type CorpusScoreRecord, type CostEntry, CostLedger, type CostLedgerGeneration, type CostLedgerSnapshot, type CostSummary, CostTracker, type CounterfactualContext, type CounterfactualMutation, type CounterfactualResult, type CounterfactualRunner, type CreateChatClientOpts, type CreateCompositeMutatorOpts, type CreateDefaultReviewerOptions, type CreateSandboxCodeMutatorOpts, type CreateSandboxPoolOpts, type CreateTraceAnalystKindOpts, type CrossTraceDiff, type CrossTraceDiffOptions, D1ExperimentStore, type D1ExperimentStoreOptions, type D1Like, type D1PreparedStatementLike, DEFAULT_AGENT_SLOS, DEFAULT_COMPLEXITY_WEIGHTS, DEFAULT_FINDERS, DEFAULT_HARNESS_OBJECTIVES, DEFAULT_MUTATORS, DEFAULT_RUN_SCORE_WEIGHTS, DEFAULT_SEVERITY_WEIGHTS, DEFAULT_TRACE_ANALYST_KINDS, Dataset, DatasetScenario, type DeployFamily, type DeployGateLayerInput, type DeployRunResult, type DeployRunner, type DiffPolicy, type DirEntry, type DirectProviderTransportOpts, type DiscoverPersonasOptions, type DiscoveredPersona, type DriverResult, type DriverState, DualAgentBench, type DualAgentBenchConfig, type DualAgentReport, type DualAgentRound, type DualAgentScenario, type DualAgentScenarioResult, ERROR_COUNT_PATTERNS, type ErrorCountPattern, type EvalResult, type EvidenceRef, type EvolutionRound, EvolvableVariant, type ExecutorConfig, type Expectation, type Experiment, type Run as ExperimentRun, type ExperimentStore, ExperimentTracker, type ExportedRewardModel, type ExtractOptions, type ExtractResult, FAILURE_MODE_KIND_SPEC, type FactorContribution, type FactorialCell, type FailureClusterConfig, FeedbackLabel, type FeedbackPattern, FeedbackTrajectory, FeedbackTrajectoryStore, type FileChange, FileSystemExperimentStore, type FileSystemExperimentStoreOptions, type FindingsDiff, FindingsStore, type FlowAction, type FlowLayerEnv, type FlowLayerFactoryInput, type FlowRunner, type FlowRunnerStepResult, type FlowSpec, type FlowStep, GateDecision, type GhCliClientOptions, type GoldenSeverity, type GoldenSpec, type HarnessAdapter, HarnessConfig, type HarnessExperimentConfig, type HarnessExperimentResult, type HarnessIntervention, type HarnessRunRequest, type HarnessRunResult, type HarnessScenario, type HarnessSelection, type HarnessVariant, type HarnessVariantReport, HeldOutGateConfig, HoldoutAuditor, type HostedJudgeConfig, type HostedJudgeDimension, type HostedJudgeRequest, type HostedJudgeResponse, type HostedRunCriticConfig, type HostedRunScoreRequest, type HostedRunScoreResponse, type HttpGithubClientOptions, type HypothesisManifest, type HypothesisResult, IMPROVEMENT_KIND_SPEC, INTENT_MATCH_JUDGE_VERSION, type ImageData, InMemoryExperimentStore, InMemoryWorkspaceInspector, type InferenceScorer, type InspectorContext, type IntegrationGateSurface, type IntegrationInvokeFailureInput, type IntegrationManifestGateInput, type IntentMatchInput, type IntentMatchOptions, type IntentMatchResult, type InteractionContribution, JsonlTrialCache, type JudgeAdapterOpts, type JudgeConfig, type JudgeFleetOptions, type JudgeFn, type JudgeInput, type JudgeReplayResult, type JudgeRetryOutcome, type JudgeRetryPolicy, type JudgeRubric, JudgeRunner, type JudgeScore, KNOWLEDGE_GAP_KIND_SPEC, KNOWLEDGE_POISONING_KIND_SPEC, type KeywordConceptSpec, type KeywordCoverageFinding, type KeywordCoverageOptions, type KeywordCoverageResult, type LangfuseEnvelope, type LangfuseGeneration, type LangfuseScore, Layer, LayerResult, type LineageKind, type LineageKindResolver, type LineageNode, LineageRecorder, type LiveProofArtifact, type LiveProofConfig, type LiveProofContext, type LiveProofResult, LlmCallRequest, LlmCallResult, LlmClientOptions, LlmSpan, LockedJsonlAppender, MODEL_PRICING, type MatchResult, type MatcherResult, type MeasurementPolicy, type MergeOptions, MetricsCollector, type MockTransportOpts, type MuffledFinder, type MuffledFinding, MultiLayerVerifier, MultiShotMutateAdapter, MultiShotOptimizationResult, MultiShotRunner, MultiShotScorer, MultiShotTrialResult, type MultiToolchainLayerConfig, MutateAdapter, type MutationAttempt, type MutationChannel, MutationTelemetry, type Mutator, Mutex, Objective, type Oracle, type OracleObservation, type OracleReport, type OracleResult, type OrthogonalityInput, type OrthogonalityResult, PairwiseSteeringOptimizer, type ParaphraseRobustnessScenarioInput, type ParaphraseRobustnessScenarioResult, ParetoResult, type PersistedFinding, type PersonaConfig, type Playbook, type PlaybookEntry, type PoolSlot, ProductClient, type ProductClientConfig, type ProductionEvolveConfig, type ProductionLoopCronConfig, type ProductionLoopDecision, type ProductionLoopRenderContext, type ProductionLoopResult, type ProductionShipConfig, type PromptHandle, PromptRegistry, TrialResult as PromptTrialResult, type ProposeAutomatedPullRequestInput, type ProposeAutomatedPullRequestResult, RAW_FINDING_SCHEMA_PROMPT, type RawAnalystFinding, RawAnalystFindingSchema, type ReferenceMatchResult, type ReferenceReplayAdapter, type ReferenceReplayAdapterFn, type ReferenceReplayAdapterLike, type ReferenceReplayAggregate, type ReferenceReplayCandidate, type ReferenceReplayCase, type ReferenceReplayCaseRun, type ReferenceReplayExecutionScenario, type ReferenceReplayItem, type ReferenceReplayMatch, type ReferenceReplayMatchStrategy, type ReferenceReplayMatcher, type ReferenceReplayPromotionDecision, type ReferenceReplayPromotionPolicy, type ReferenceReplayRun, type ReferenceReplayRunContext, type ReferenceReplayRunOptions, type ReferenceReplayRunStore, type ReferenceReplayScenario, type ReferenceReplayScenarioScore, type ReferenceReplayScore, type ReferenceReplayScoreOptions, type ReferenceReplaySplit, type ReferenceReplaySplitComparison, type ReferenceReplaySteeringRowsOptions, type RegistryRunOpts, ReleaseConfidenceScorecard, ReleaseConfidenceThresholds, type RepoRef, type ReviewerMemoryEntry, type ReviewerOutput, type ReviewerPromptInput, type ReviewerSoftFailDefaults, type ReviewerVerificationSummary, type RobustnessResult, type RouteMap, type RouterTransportOpts, type RubricDimension, Run$1 as Run, type RunCommandInput, type RunCommandResult, type RunConfig, RunCritic, type RunCriticAdapterOpts, type RunCriticOptions, type RunDiff, RunFilter, type RunProductionLoopOptions, RunRecord, type RunScore, type RunScoreWeights, RunSplitTag, type RunTrace, SEMANTIC_CONCEPT_JUDGE_VERSION, SandboxDriver, SandboxHarnessResult, type SandboxJudgeKind, type SandboxJudgeResult, type SandboxJudgeSpec, type SandboxPool, type SandboxSdkTransportOpts, type ScanOptions, type Scenario, type ScenarioCost, type ScenarioFile, ScenarioRegistry, type ScenarioResult, type ScoredTarget, type SelfPlayOptions, type SelfPlayProposer, type SelfPlayScorer, type SemanticConceptJudgeAdapterOpts, type SemanticConceptJudgeInput, type SemanticConceptJudgeOptions, type SemanticConceptJudgeResult, type SeriesConvergenceOptions, type SeriesConvergenceResult, Severity, type SignedManifest, type SignedManifestAlgo, type Slo, type SloCheckResult, type SloComparator, type SloReport, type SloSeverity, type SlopCategory, type SlotFactory, Span, type SteeringBundle, type SteeringDelta, type SteeringOptimizationResult, type SteeringOptimizationRow, type SteeringOptimizationSelector, type SteeringOptimizerBackend, type SteeringOptimizerConfig, type SteeringRolePrompt, type StepAttribution, type SynthesisReason, type SynthesisTarget, type TestResult, type ThresholdContract, TokenCounter, type TokenSpec, type TraceAnalystAdapterOpts, type TraceAnalystGolden, type TraceAnalystKindSpec, TraceEmitter, TraceEvent, TraceStore, type TraceToolGroupName, Trajectory, TrajectoryStep, type TrialAggregate, type TrialAttempt, TrialCache, TrialTelemetry, type Turn, type TurnMetrics, type TurnResult, UNIVERSAL_FINDERS, type ValidationContext, type ValidationIssue, type ValidationResult, VariantAggregate, type VerifierAdapterOpts, VerifyContext, VerifyOptions, type VisualDiffOptions, type VisualDiffResult, type ViteDeployRunnerInput, type WorkflowTopology, type WorkspaceAssertion, type WorkspaceAssertionResult, type WorkspaceInspector, type WorkspaceSnapshot, type WranglerDeployRunnerInput, adversarialJudge, aggregateRunScore, aggregateTrialsByMode, analyzeAntiSlop, analyzeSeries, attributeCounterfactuals, benjaminiHochberg, bisect, bonferroni, buildReviewerPrompt, buildTraceToolsForGroup, byteLengthRange, canaryLeakView, canonicalize, causalAttribution, checkBehavioralCanary, checkCanaries, checkSlos, clamp01, codeExecutionJudge, cohensD, coherenceJudge, collectionPreserved, commitBisect, compareReferenceReplay, compilerJudge, composeValidators, computeFindingId, confidenceInterval, containsAll, corpusInterRaterAgreement, corpusInterRaterAgreementFromJudgeScores, createAntiSlopJudge, createChatClient, createCompositeMutator, createCustomJudge, createDefaultReviewer, createDomainExpertJudge, createIntentMatchJudge, createJudgeAdapter, createRunCriticAdapter, createSandboxCodeMutator, createSandboxPool, createSemanticConceptJudge, createSemanticConceptJudgeAdapter, createTraceAnalystAdapter, createTraceAnalystKind, createVerifierAdapter, crossTraceDiff, decideReferenceReplayPromotion, decideReferenceReplayRunPromotion, defaultIsMaterial, defaultJudges, defaultReferenceReplayMatcher, deployGateLayer, diffFindings, discoverPersonas, distillPlaybook, estimateCost, estimateTokens, evaluateContract, evaluateHypothesis, evaluateOracles, executeScenario, expectAgent, exportRewardModel, extractAssetUrls, extractErrorCount, fileContains, fileExists, findAutoMatchNoExpectation, findConstructorCwdDropped, findFallbackToPass, findLiteralTruePass, findSkipCountsAsPass, flowLayer, formatBenchmarkReport, formatDriverReport, formatFindings, ghCliClient, precision as goldenPrecision, hashContent, hashJson, htmlContainsElement, httpGithubClient, inMemoryReferenceReplayStore, integrationAsi, integrationGateEvals, integrationInvokeFailedPayload, integrationManifestResolvedPayload, integrationManifestValidatedPayload, interRaterReliability, jsonHasKeys, jsonShape, jsonlReferenceReplayStore, keyPreserved, liftSeverity, linterJudge, loadScorerFromGrader, localCommandRunner, lowercaseMutator, makeFinding, mannWhitneyU, matchGoldens, mergeLayerResults, mergeSteeringBundle, multiToolchainLayer, normalizeScores, notBlocked, pairedTTest, paraphraseRobustness, paraphraseRobustnessScenarios, parseRawFinding, partialCredit, passOrthogonality, pixelDeltaRatio, politenessPrefixMutator, printDriverSummary, promptBisect, proposeAutomatedPullRequest, proposeSynthesisTargets, referenceReplayRunsToSteeringRows, referenceReplayScenarioToRunScore, regexMatch, regexMatches, renderMarkdownReport, renderPlaybookMarkdown, renderPriorFindings, renderSteeringText, replayScorerOverCorpus, replayTraceThroughJudge, requiredSampleSize, resetLockedAppendersForTesting, rowCount, rowWhere, runAssertions, runBehavioralCanaries, runCanaries, runCounterfactual, runE2EWorkflow, runExpectations, runHarnessExperiment, runIntentMatchJudge, runJudgeFleet, runKeywordCoverageJudge, runKeywordCoverageJudgeUrl, runLiveProof, runProductionLoop, runReferenceReplay, runSelfPlay, runSemanticConceptJudge, scanForMuffledGates, scoreContinuity, scoreReferenceReplay, securityJudge, selectHarnessVariant, sentenceReorderMutator, signManifest, statusAdvanced, summarizeHarnessResults, testJudge, textInSnapshot, toLangfuseEnvelope, toPrometheusText, typoMutator, urlContains, verifyManifest, visualDiff, viteDeployRunner, weightedMean, weightedRecall, whitespaceCollapseMutator, wilcoxonSignedRank, withJudgeRetry, wranglerDeployRunner };
|