@forwardimpact/libeval 0.1.44 → 0.1.45
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +212 -13
- package/package.json +1 -1
- package/src/agent-runner.js +45 -181
- package/src/benchmark/runner.js +2 -2
- package/src/commands/supervise.js +3 -1
- package/src/discuss-tools.js +72 -140
- package/src/discusser.js +18 -35
- package/src/facilitator.js +26 -43
- package/src/index.js +0 -2
- package/src/judge.js +1 -1
- package/src/message-bus.js +27 -81
- package/src/orchestration-loop.js +176 -229
- package/src/orchestration-toolkit.js +272 -303
- package/src/orchestrator-helpers.js +9 -45
- package/src/redaction.js +2 -0
- package/src/render/orchestrator-filter.js +1 -9
- package/src/supervisor.js +79 -465
package/README.md
CHANGED
|
@@ -7,12 +7,188 @@ reproducible evidence.
|
|
|
7
7
|
|
|
8
8
|
<!-- END:description -->
|
|
9
9
|
|
|
10
|
-
|
|
10
|
+
`libeval` provides the runtime and tool surface for multi-LLM
|
|
11
|
+
coordination: an agent talks to a supervisor, a facilitator chairs a
|
|
12
|
+
team meeting, or a lead drives an asynchronous discussion across a
|
|
13
|
+
human channel. Every conversation produces a structured NDJSON trace
|
|
14
|
+
for analysis.
|
|
15
|
+
|
|
16
|
+
## Modes
|
|
17
|
+
|
|
18
|
+
| Mode | Lead | Participants | Terminal tool |
|
|
19
|
+
| ------------- | ------------- | ------------- | ---------------------- |
|
|
20
|
+
| `run` | (none) | one agent | task completion |
|
|
21
|
+
| `supervise` | `supervisor` | one `agent` | `Conclude` |
|
|
22
|
+
| `facilitate` | `facilitator` | N named | `Conclude` |
|
|
23
|
+
| `discuss` | `lead` | N named | `Adjourn` or `Recess` |
|
|
24
|
+
| `judge` | `judge` | (none) | `Conclude` |
|
|
25
|
+
|
|
26
|
+
Every mode except `run` and `judge` shares one orchestration loop
|
|
27
|
+
(`OrchestrationLoop`) and one tool surface (`Ask` / `Answer` /
|
|
28
|
+
`Announce` / `RollCall`, plus a mode-specific terminal tool). The
|
|
29
|
+
loop fires the lead's LLM, fans messages out to participants over an
|
|
30
|
+
in-memory bus, wakes them when something lands, and emits the
|
|
31
|
+
universal `{source, seq, event}` NDJSON envelope for every line.
|
|
32
|
+
|
|
33
|
+
## The Ask / Answer protocol
|
|
34
|
+
|
|
35
|
+
Coordination uses one async request/reply pattern with one piece of
|
|
36
|
+
state per question — the `askId`. Every Ask returns immediately; the
|
|
37
|
+
reply arrives later on the asker's inbox.
|
|
38
|
+
|
|
39
|
+
### Ask
|
|
40
|
+
|
|
41
|
+
```text
|
|
42
|
+
Ask({ question, to? }) → { askIds: [N, …] }
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
The handler registers a pending entry per addressee, posts the
|
|
46
|
+
question on the bus, and returns immediately. Each pending entry is
|
|
47
|
+
keyed by a numeric `askId`. Two Asks to the same addressee each get
|
|
48
|
+
their own id, so they coexist without overwriting.
|
|
49
|
+
|
|
50
|
+
Broadcast: omit `to` on a multi-participant lead's Ask to fan out to
|
|
51
|
+
every other participant — the result `askIds` array has one entry
|
|
52
|
+
per addressee.
|
|
53
|
+
|
|
54
|
+
### Answer
|
|
55
|
+
|
|
56
|
+
```text
|
|
57
|
+
Answer({ message, askId? }) → routed to the asker
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
The reply lands in the asker's bus inbox as
|
|
61
|
+
`[answer#N] <participant>: <text>` on a later turn. `askId` is
|
|
62
|
+
optional and the handler is forgiving:
|
|
63
|
+
|
|
64
|
+
- **Provided + matches an ask owed by the caller** → routes the reply
|
|
65
|
+
to that specific asker.
|
|
66
|
+
- **Provided but unknown or wrong addressee** → `isError` with a
|
|
67
|
+
pointed message. The caller tried to specify; we tell them why.
|
|
68
|
+
- **Omitted + exactly one ask is owed to the caller** → auto-picks
|
|
69
|
+
that ask. (Forcing an Announce when the only owed ask is obvious
|
|
70
|
+
would be pedantic.)
|
|
71
|
+
- **Omitted + 0 or many asks owed** → broadcasts as Announce so the
|
|
72
|
+
message still reaches every participant.
|
|
73
|
+
|
|
74
|
+
### Announce
|
|
75
|
+
|
|
76
|
+
```text
|
|
77
|
+
Announce({ message }) → broadcast, no reply expected
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
Lands on every other participant's queue as `[shared] <from>: <text>`.
|
|
81
|
+
|
|
82
|
+
### Inbox format
|
|
83
|
+
|
|
84
|
+
Every line a participant reads on a resume is one bus message rendered
|
|
85
|
+
with its tag:
|
|
86
|
+
|
|
87
|
+
```text
|
|
88
|
+
[ask#42] facilitator: What is your current condition?
|
|
89
|
+
[answer#41] agent-1: We're at 7 out of 10.
|
|
90
|
+
[shared] agent-2: FYI I'm switching to Bun 1.2.
|
|
91
|
+
[system] @orchestrator: You have an unanswered ask from facilitator (askId=42)…
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
The `[ask#N]` tag is what the participant quotes back in Answer's
|
|
95
|
+
`askId` field.
|
|
96
|
+
|
|
97
|
+
### Why async
|
|
98
|
+
|
|
99
|
+
The lead can issue Asks, end its turn, and use the gap between turns
|
|
100
|
+
for planning, reflection, or follow-up Asks while participants work
|
|
101
|
+
in parallel. Nothing blocks the LLM thread waiting on a reply. The
|
|
102
|
+
orchestrator wakes the lead whenever the inbox has new content.
|
|
103
|
+
|
|
104
|
+
## The orchestration loop
|
|
105
|
+
|
|
106
|
+
`OrchestrationLoop` runs one outer pattern for both the lead and each
|
|
107
|
+
participant:
|
|
108
|
+
|
|
109
|
+
1. Drain the bus queue, or wait for the first message.
|
|
110
|
+
2. Run (first turn) or resume (every subsequent turn) the LLM with the
|
|
111
|
+
drained messages formatted as tagged lines.
|
|
112
|
+
3. If the participant ended a turn with an unanswered Ask owed to it,
|
|
113
|
+
inject one synthetic reminder and resume once more. If still
|
|
114
|
+
unanswered, emit a `protocol_violation` event and cancel the
|
|
115
|
+
pending entry with a synthetic null answer so the asker unblocks.
|
|
116
|
+
|
|
117
|
+
The lead's first turn starts with the task as its initial prompt;
|
|
118
|
+
participants' first runs are triggered by their first inbound message.
|
|
119
|
+
|
|
120
|
+
Termination flips two flags:
|
|
121
|
+
|
|
122
|
+
- `ctx.concluded` — explicit `Conclude` / `Adjourn` / `Recess`. The
|
|
123
|
+
handler also cancels any in-flight Asks with a synthetic null so
|
|
124
|
+
askers see why their question won't be answered.
|
|
125
|
+
- `stopped` — broader: also true on a lead error, an agent crash, or
|
|
126
|
+
any abort path. Loops watch `stopped`; `ctx.concluded` is only used
|
|
127
|
+
for the summary's `success` / `verdict`.
|
|
128
|
+
|
|
129
|
+
## Tool surface, by role
|
|
130
|
+
|
|
131
|
+
| Role | Ask | Answer | Announce | RollCall | Conclude | Other |
|
|
132
|
+
| ------------ | --- | ------ | -------- | -------- | -------- | ------------------------------------ |
|
|
133
|
+
| Facilitator | ✓ | ✓ | ✓ | ✓ | ✓ | |
|
|
134
|
+
| Fac. agent | ✓ | ✓ | ✓ | ✓ | | |
|
|
135
|
+
| Supervisor | ✓ | ✓ | ✓ | ✓ | ✓ | |
|
|
136
|
+
| Sup. agent | ✓ | ✓ | ✓ | ✓ | | |
|
|
137
|
+
| Discuss lead | ✓ | ✓ | ✓ | ✓ | | `RequestForComment`, `Recess`, `Adjourn` |
|
|
138
|
+
| Discuss agt | ✓ | ✓ | ✓ | ✓ | | |
|
|
139
|
+
| Judge | | | | | ✓ | |
|
|
140
|
+
|
|
141
|
+
Ask's `to` accepts a participant name on multi-participant roles
|
|
142
|
+
(facilitator, discuss lead, all participants); supervise's
|
|
143
|
+
`supervisor` / `agent` pair don't accept `to` because there's only
|
|
144
|
+
one possible target.
|
|
145
|
+
|
|
146
|
+
## Minimal example: a two-participant facilitator
|
|
11
147
|
|
|
12
148
|
```js
|
|
13
|
-
import {
|
|
149
|
+
import { createFacilitator, createRedactor } from "@forwardimpact/libeval";
|
|
150
|
+
import { query } from "@anthropic-ai/claude-agent-sdk";
|
|
151
|
+
|
|
152
|
+
const facilitator = createFacilitator({
|
|
153
|
+
facilitatorCwd: process.cwd(),
|
|
154
|
+
agentConfigs: [
|
|
155
|
+
{ name: "alice", role: "explorer", agentProfile: "alice" },
|
|
156
|
+
{ name: "bob", role: "tester", agentProfile: "bob" },
|
|
157
|
+
],
|
|
158
|
+
query,
|
|
159
|
+
output: process.stdout,
|
|
160
|
+
redactor: createRedactor(),
|
|
161
|
+
facilitatorProfile: "improvement-coach",
|
|
162
|
+
});
|
|
163
|
+
|
|
164
|
+
const result = await facilitator.run("Run a kata storyboard meeting.");
|
|
165
|
+
// result.success / result.turns / NDJSON trace on process.stdout
|
|
14
166
|
```
|
|
15
167
|
|
|
168
|
+
The facilitator's LLM, started with that task, has access to `Ask`,
|
|
169
|
+
`Answer`, `Announce`, `RollCall`, and `Conclude`. Alice and Bob each
|
|
170
|
+
get `Ask`, `Answer`, `Announce`, `RollCall`. Every tool call, every
|
|
171
|
+
message routed through the bus, and every orchestrator event becomes a
|
|
172
|
+
line in the trace.
|
|
173
|
+
|
|
174
|
+
## Trace format
|
|
175
|
+
|
|
176
|
+
Every line is one JSON object with three fields:
|
|
177
|
+
|
|
178
|
+
```json
|
|
179
|
+
{ "source": "facilitator", "seq": 42, "event": { … } }
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
- `source` — the participant whose LLM produced the line, or
|
|
183
|
+
`orchestrator` for loop-level events (`session_start`, `agent_start`,
|
|
184
|
+
`protocol_violation`, `lead_turn_limit`, `summary`).
|
|
185
|
+
- `seq` — monotonically increasing across the whole trace; useful for
|
|
186
|
+
reconstructing the wall-clock order across concurrent participants.
|
|
187
|
+
- `event` — the SDK event verbatim, or the orchestrator event payload.
|
|
188
|
+
|
|
189
|
+
`fit-trace` consumes this format. See the trace analysis guide for the
|
|
190
|
+
full schema.
|
|
191
|
+
|
|
16
192
|
## Trace redaction
|
|
17
193
|
|
|
18
194
|
`fit-eval run`, `fit-eval supervise`, and `fit-eval facilitate` redact
|
|
@@ -21,14 +197,37 @@ secrets in trace artifacts before they reach disk. Two layers compose:
|
|
|
21
197
|
- **Env-var allowlist**, defaulting to `ANTHROPIC_API_KEY`, `GH_TOKEN`,
|
|
22
198
|
`GITHUB_TOKEN`. The runtime values of these vars are replaced with
|
|
23
199
|
`[REDACTED:env:NAME]` wherever they appear in tool inputs, tool
|
|
24
|
-
outputs, assistant text, or orchestrator summaries. Override the
|
|
25
|
-
with `LIBEVAL_REDACTION_ENV_VARS=NAME1,NAME2,…` (replaces, not
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
(`
|
|
29
|
-
`
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
200
|
+
outputs, assistant text, or orchestrator summaries. Override the
|
|
201
|
+
list with `LIBEVAL_REDACTION_ENV_VARS=NAME1,NAME2,…` (replaces, not
|
|
202
|
+
extends).
|
|
203
|
+
- **Credential-shape patterns**, covering Anthropic API keys
|
|
204
|
+
(`sk-ant-`), GitHub PATs (`ghp_`), installation tokens (`ghs_`),
|
|
205
|
+
OAuth tokens (`gho_`), and fine-grained PATs (`github_pat_`).
|
|
206
|
+
Pattern hits become `[REDACTED:pattern:KIND]`.
|
|
207
|
+
|
|
208
|
+
Redaction is on by default. To disable, set
|
|
209
|
+
`LIBEVAL_REDACTION_DISABLED=1` — a stderr warning fires once per run.
|
|
210
|
+
Never set this in CI on a public repository: workflow artifacts there
|
|
211
|
+
are downloadable through the retention window.
|
|
212
|
+
|
|
213
|
+
## Module map
|
|
214
|
+
|
|
215
|
+
| Module | Purpose |
|
|
216
|
+
| ---------------------------- | ----------------------------------------------------------------------- |
|
|
217
|
+
| `agent-runner.js` | One Claude Agent SDK session; emits NDJSON via the redactor. |
|
|
218
|
+
| `message-bus.js` | In-memory per-participant queues + `waitForMessages` Promise wakeup. |
|
|
219
|
+
| `orchestration-toolkit.js` | Shared Ask / Answer / Announce / Conclude / RollCall handlers + builders. |
|
|
220
|
+
| `orchestration-loop.js` | Unified lead+participant loop; reminder/violation handling. |
|
|
221
|
+
| `facilitator.js` | `Facilitator` class + factory + system prompts. |
|
|
222
|
+
| `supervisor.js` | `Supervisor` class + factory + system prompts. |
|
|
223
|
+
| `discuss-tools.js` | Discuss-only RequestForComment / Recess / Adjourn handlers + tool servers. |
|
|
224
|
+
| `discusser.js` | `Discusser` class + factory + system prompt + resume hydration. |
|
|
225
|
+
| `judge.js` | One-shot post-hoc verdict via `Conclude`. |
|
|
226
|
+
| `trace-collector.js` / `trace-query.js` / `trace-github.js` | Trace ingestion / querying / GitHub-attachment helpers. |
|
|
227
|
+
| `redaction.js` | Env-var allowlist + credential-shape pattern redaction. |
|
|
228
|
+
|
|
229
|
+
## Documentation
|
|
230
|
+
|
|
231
|
+
- [Agent Evaluations Guide](https://www.forwardimpact.team/docs/libraries/agent-evaluations/index.md) — how to run an eval and read its trace.
|
|
232
|
+
- [Agent Collaboration Guide](https://www.forwardimpact.team/docs/libraries/agent-collaboration/index.md) — supervise / facilitate / discuss in depth.
|
|
233
|
+
- [Trace Analysis Guide](https://www.forwardimpact.team/docs/libraries/trace-analysis/index.md) — analysing NDJSON traces with `fit-trace`.
|
package/package.json
CHANGED
package/src/agent-runner.js
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
/**
|
|
2
|
-
* AgentRunner — runs a single Claude Agent SDK session and emits raw
|
|
3
|
-
* events to an output stream. Building block for
|
|
4
|
-
* `fit-eval supervise`.
|
|
2
|
+
* AgentRunner — runs a single Claude Agent SDK session and emits raw
|
|
3
|
+
* NDJSON events to an output stream. Building block for `fit-eval run`,
|
|
4
|
+
* `fit-eval supervise`, `fit-eval facilitate`, and `fit-eval discuss`.
|
|
5
5
|
*
|
|
6
6
|
* Follows OO+DI: constructor injection, factory function, tests bypass factory.
|
|
7
7
|
*/
|
|
@@ -13,25 +13,6 @@ const DEFAULT_ALLOWED_TOOLS = ["Bash", "Read", "Glob", "Grep", "Write", "Edit"];
|
|
|
13
13
|
// overridable — so a future caller can't accidentally reduce permissions.
|
|
14
14
|
const PERMISSION_MODE = "bypassPermissions";
|
|
15
15
|
|
|
16
|
-
function applyDefaults(deps) {
|
|
17
|
-
return {
|
|
18
|
-
cwd: deps.cwd,
|
|
19
|
-
query: deps.query,
|
|
20
|
-
output: deps.output,
|
|
21
|
-
model: deps.model ?? "claude-opus-4-7[1m]",
|
|
22
|
-
maxTurns: deps.maxTurns ?? 50,
|
|
23
|
-
allowedTools: deps.allowedTools ?? DEFAULT_ALLOWED_TOOLS,
|
|
24
|
-
onLine: deps.onLine ?? null,
|
|
25
|
-
onBatch: deps.onBatch ?? null,
|
|
26
|
-
batchSize: deps.batchSize ?? 3,
|
|
27
|
-
settingSources: deps.settingSources ?? [],
|
|
28
|
-
systemPrompt: deps.systemPrompt ?? null,
|
|
29
|
-
disallowedTools: deps.disallowedTools ?? [],
|
|
30
|
-
mcpServers: deps.mcpServers ?? null,
|
|
31
|
-
taskAmend: deps.taskAmend ?? null,
|
|
32
|
-
};
|
|
33
|
-
}
|
|
34
|
-
|
|
35
16
|
/** Run a single Claude Agent SDK session and emit raw NDJSON events to an output stream. */
|
|
36
17
|
export class AgentRunner {
|
|
37
18
|
/**
|
|
@@ -43,29 +24,38 @@ export class AgentRunner {
|
|
|
43
24
|
* @param {number} [deps.maxTurns] - Maximum agentic turns; 0 means unlimited
|
|
44
25
|
* @param {string[]} [deps.allowedTools] - Tools the agent may use
|
|
45
26
|
* @param {function} [deps.onLine] - Callback invoked with each NDJSON line as it's produced
|
|
46
|
-
* @param {function} [deps.onBatch] - Async callback invoked with a batch of NDJSON lines at flush boundaries: every `batchSize` assistant text blocks, the terminal `result` message, and — on iterator crash/abort — once more in a final flush carrying any lines that never reached a boundary. Receives `(lines, { abort })` where calling `abort()` stops the in-flight SDK session via the AbortController. Optional; assignable at runtime so the Supervisor can swap it per turn.
|
|
47
|
-
* @param {number} [deps.batchSize] - Assistant text-block messages to accumulate before firing onBatch. Tool-only assistant messages ride along without counting. Default 3: the supervisor reviews the agent every three text turns instead of every turn. The terminal `result` always flushes regardless of count.
|
|
48
27
|
* @param {string[]} [deps.settingSources] - SDK setting sources (e.g. ['project'] to load CLAUDE.md)
|
|
49
28
|
* @param {string|object} [deps.systemPrompt] - SDK system prompt (string replaces default; {type:'preset', preset:'claude_code', append} appends)
|
|
50
29
|
* @param {string[]} [deps.disallowedTools] - Tools to explicitly remove from the model's context
|
|
51
30
|
* @param {Record<string, object>} [deps.mcpServers] - MCP server configs to pass to the SDK query
|
|
31
|
+
* @param {object} deps.redactor
|
|
52
32
|
*/
|
|
53
33
|
constructor(deps) {
|
|
54
34
|
if (!deps.cwd) throw new Error("cwd is required");
|
|
55
35
|
if (!deps.query) throw new Error("query is required");
|
|
56
36
|
if (!deps.output) throw new Error("output is required");
|
|
57
37
|
if (!deps.redactor) throw new Error("redactor is required");
|
|
58
|
-
|
|
38
|
+
this.cwd = deps.cwd;
|
|
39
|
+
this.query = deps.query;
|
|
40
|
+
this.output = deps.output;
|
|
59
41
|
this.redactor = deps.redactor;
|
|
42
|
+
this.model = deps.model ?? "claude-opus-4-7[1m]";
|
|
43
|
+
this.maxTurns = deps.maxTurns ?? 50;
|
|
44
|
+
this.allowedTools = deps.allowedTools ?? DEFAULT_ALLOWED_TOOLS;
|
|
45
|
+
this.onLine = deps.onLine ?? null;
|
|
46
|
+
this.settingSources = deps.settingSources ?? [];
|
|
47
|
+
this.systemPrompt = deps.systemPrompt ?? null;
|
|
48
|
+
this.disallowedTools = deps.disallowedTools ?? [];
|
|
49
|
+
this.mcpServers = deps.mcpServers ?? null;
|
|
50
|
+
this.taskAmend = deps.taskAmend ?? null;
|
|
60
51
|
this.sessionId = null;
|
|
61
|
-
this.buffer = [];
|
|
62
52
|
/** @type {AbortController|null} */
|
|
63
53
|
this.currentAbortController = null;
|
|
64
54
|
}
|
|
65
55
|
|
|
66
56
|
/**
|
|
67
57
|
* Run a new agent session with the given task.
|
|
68
|
-
* @param {string} task
|
|
58
|
+
* @param {string} task
|
|
69
59
|
* @returns {Promise<{success: boolean, text: string, sessionId: string|null, error: Error|null, aborted: boolean}>}
|
|
70
60
|
*/
|
|
71
61
|
async run(task) {
|
|
@@ -87,7 +77,7 @@ export class AgentRunner {
|
|
|
87
77
|
|
|
88
78
|
/**
|
|
89
79
|
* Resume an existing session with a follow-up prompt.
|
|
90
|
-
* @param {string} prompt
|
|
80
|
+
* @param {string} prompt
|
|
91
81
|
* @returns {Promise<{success: boolean, text: string, sessionId: string|null, error: Error|null, aborted: boolean}>}
|
|
92
82
|
*/
|
|
93
83
|
async resume(prompt) {
|
|
@@ -108,17 +98,16 @@ export class AgentRunner {
|
|
|
108
98
|
}
|
|
109
99
|
|
|
110
100
|
/**
|
|
111
|
-
* Build the options passed to every SDK query() call. Shared by run()
|
|
112
|
-
* resume() so the agent's configuration — cwd, tools, prompt,
|
|
113
|
-
* sources, turn budget — is identical across the session's
|
|
114
|
-
* resume() layers `resume: this.sessionId` on top.
|
|
101
|
+
* Build the options passed to every SDK query() call. Shared by run()
|
|
102
|
+
* and resume() so the agent's configuration — cwd, tools, prompt,
|
|
103
|
+
* setting sources, turn budget — is identical across the session's
|
|
104
|
+
* lifetime. Only resume() layers `resume: this.sessionId` on top.
|
|
115
105
|
*
|
|
116
|
-
* SDK options are call-attached, not session-attached: the resumed
|
|
117
|
-
* loads the prior conversation but otherwise uses whatever
|
|
118
|
-
* call passes. Omitting tool/prompt/setting options on
|
|
119
|
-
* agent to silently lose its restrictions and
|
|
120
|
-
*
|
|
121
|
-
* @returns {object}
|
|
106
|
+
* SDK options are call-attached, not session-attached: the resumed
|
|
107
|
+
* call loads the prior conversation but otherwise uses whatever
|
|
108
|
+
* options this call passes. Omitting tool/prompt/setting options on
|
|
109
|
+
* resume causes the agent to silently lose its restrictions and
|
|
110
|
+
* persona between turns.
|
|
122
111
|
*/
|
|
123
112
|
#callOptions(abortController) {
|
|
124
113
|
return {
|
|
@@ -139,59 +128,28 @@ export class AgentRunner {
|
|
|
139
128
|
}
|
|
140
129
|
|
|
141
130
|
/**
|
|
142
|
-
*
|
|
143
|
-
*
|
|
144
|
-
*
|
|
145
|
-
*
|
|
146
|
-
* and the terminal `result` message. Tool-only assistant messages still
|
|
147
|
-
* accumulate in the pending batch and ride along in the next flush, so
|
|
148
|
-
* the supervisor always sees the tool calls that led up to each text
|
|
149
|
-
* block. Raising `batchSize` above 1 is the knob that makes the mid-turn
|
|
150
|
-
* supervisor review less chatty — with the default of 3, the supervisor
|
|
151
|
-
* sees the agent in chunks of three text turns instead of every turn.
|
|
152
|
-
*
|
|
153
|
-
* Corollary: a turn that is *entirely* tool_use with no text blocks and
|
|
154
|
-
* then hits `result` produces exactly one flush at `result` regardless
|
|
155
|
-
* of how many tools ran. That is deliberate — the supervisor only needs
|
|
156
|
-
* to weigh in when the agent surfaces something text-like to react to.
|
|
157
|
-
*
|
|
158
|
-
* INVARIANT: the `await this.onBatch(...)` call below is the ONLY
|
|
159
|
-
* suspension point in this loop. While it is pending, no further lines
|
|
160
|
-
* are pulled from the SDK generator. The Supervisor relies on this — its
|
|
161
|
-
* onBatch callback flips `currentSource` to "supervisor" for the duration
|
|
162
|
-
* of its mid-turn LLM call, and the invariant guarantees no agent line
|
|
163
|
-
* can arrive concurrently and be mis-tagged.
|
|
164
|
-
*
|
|
165
|
-
* If the supervisor calls `abort()` from inside the callback, the next
|
|
166
|
-
* iteration of the for-await loop will throw. We catch the throw, check
|
|
167
|
-
* `currentAbortController.signal.aborted` (avoiding fragility around
|
|
168
|
-
* AbortError vs DOMException shapes), and report `aborted: true` so the
|
|
169
|
-
* caller can distinguish "supervisor asked us to stop" from a real error.
|
|
131
|
+
* Iterate the SDK query iterator, mirroring every message to the
|
|
132
|
+
* output stream and the `onLine` callback. Captures `sessionId` from
|
|
133
|
+
* the SDK's `system/init` message and tracks Skill invocations into
|
|
134
|
+
* `LIBEVAL_SKILL` for downstream metrics.
|
|
170
135
|
*
|
|
171
|
-
* If the iterator throws
|
|
172
|
-
*
|
|
173
|
-
*
|
|
174
|
-
* observe the partial state (e.g. note a crash or react to an external
|
|
175
|
-
* abort). A throw from that final flush becomes the returned `error`
|
|
176
|
-
* only if no earlier error was captured — the original failure wins.
|
|
177
|
-
* @param {AsyncIterable<object>} iterator
|
|
178
|
-
* @returns {Promise<{success: boolean, text: string, sessionId: string|null, error: Error|null, aborted: boolean}>}
|
|
136
|
+
* If the iterator throws and we triggered the abort ourselves
|
|
137
|
+
* (`currentAbortController.signal.aborted`), we report `aborted:
|
|
138
|
+
* true`; otherwise the error propagates as `error`.
|
|
179
139
|
*/
|
|
180
140
|
async #consumeQuery(iterator) {
|
|
181
141
|
let text = "";
|
|
182
142
|
let stopReason = null;
|
|
183
143
|
let error = null;
|
|
184
144
|
let aborted = false;
|
|
185
|
-
const state = { pendingBatch: [], assistantTextCount: 0 };
|
|
186
145
|
|
|
187
146
|
try {
|
|
188
147
|
for await (const message of iterator) {
|
|
189
|
-
this.#recordLine(message
|
|
148
|
+
this.#recordLine(message);
|
|
190
149
|
if (message.type === "result") {
|
|
191
150
|
text = message.result ?? "";
|
|
192
151
|
stopReason = message.subtype;
|
|
193
152
|
}
|
|
194
|
-
await this.#maybeFlushBatch(message, state);
|
|
195
153
|
}
|
|
196
154
|
} catch (err) {
|
|
197
155
|
if (this.currentAbortController?.signal.aborted) {
|
|
@@ -201,118 +159,28 @@ export class AgentRunner {
|
|
|
201
159
|
}
|
|
202
160
|
}
|
|
203
161
|
|
|
204
|
-
|
|
205
|
-
|
|
206
|
-
|
|
207
|
-
|
|
208
|
-
|
|
162
|
+
return {
|
|
163
|
+
success: stopReason === "success",
|
|
164
|
+
text,
|
|
165
|
+
sessionId: this.sessionId,
|
|
166
|
+
error,
|
|
167
|
+
aborted,
|
|
168
|
+
};
|
|
209
169
|
}
|
|
210
170
|
|
|
211
|
-
|
|
212
|
-
* Mirror a single SDK message to the output stream, buffer, onLine
|
|
213
|
-
* callback, and (when set) the pending-batch state. Also handles
|
|
214
|
-
* session id capture and text-block counting so `#consumeQuery` can
|
|
215
|
-
* stay within the complexity budget.
|
|
216
|
-
* @param {object} message
|
|
217
|
-
* @param {{pendingBatch: string[], assistantTextCount: number}} state
|
|
218
|
-
*/
|
|
219
|
-
#recordLine(message, state) {
|
|
171
|
+
#recordLine(message) {
|
|
220
172
|
const redacted = this.redactor.redactValue(message);
|
|
221
173
|
const line = JSON.stringify(redacted);
|
|
222
174
|
this.output.write(line + "\n");
|
|
223
|
-
this.buffer.push(line);
|
|
224
175
|
if (this.onLine) this.onLine(line);
|
|
225
|
-
if (this.onBatch) state.pendingBatch.push(line);
|
|
226
176
|
|
|
227
|
-
// Session-id / text-block tracking reads the ORIGINAL message —
|
|
228
|
-
// these fields are not secret carriers, and the trackers rely on
|
|
229
|
-
// shape, not string contents.
|
|
230
177
|
if (message.type === "system" && message.subtype === "init") {
|
|
231
178
|
this.sessionId = message.session_id;
|
|
232
179
|
}
|
|
233
|
-
if (message.type === "assistant")
|
|
234
|
-
if (hasTextBlock(message)) state.assistantTextCount++;
|
|
235
|
-
trackSkillInvocation(message);
|
|
236
|
-
}
|
|
237
|
-
}
|
|
238
|
-
|
|
239
|
-
/**
|
|
240
|
-
* Terminal flush — only fires on the abnormal-end paths (iterator
|
|
241
|
-
* threw or was aborted mid-stream). Delivers any pending lines so the
|
|
242
|
-
* supervisor sees the partial state instead of losing the tail of
|
|
243
|
-
* the run. A natural-end iterator that simply ran out of messages
|
|
244
|
-
* without a `result` marker is treated as an incomplete stub (the
|
|
245
|
-
* real SDK always terminates with `result`) and its pending batch is
|
|
246
|
-
* not re-flushed. Returns an error thrown by the flush callback, or
|
|
247
|
-
* `null` if the flush succeeded or did not fire.
|
|
248
|
-
* @param {{pendingBatch: string[], assistantTextCount: number}} state
|
|
249
|
-
* @param {{error: Error|null, aborted: boolean}} outcome
|
|
250
|
-
* @returns {Promise<Error|null>}
|
|
251
|
-
*/
|
|
252
|
-
async #terminalFlush(state, { error, aborted }) {
|
|
253
|
-
const loopEndedAbnormally = Boolean(error || aborted);
|
|
254
|
-
if (!loopEndedAbnormally) return null;
|
|
255
|
-
if (!this.onBatch || state.pendingBatch.length === 0) return null;
|
|
256
|
-
try {
|
|
257
|
-
const batchLines = state.pendingBatch.splice(0);
|
|
258
|
-
await this.onBatch(batchLines, {
|
|
259
|
-
abort: () => this.currentAbortController?.abort(),
|
|
260
|
-
});
|
|
261
|
-
return null;
|
|
262
|
-
} catch (flushErr) {
|
|
263
|
-
return flushErr;
|
|
264
|
-
}
|
|
265
|
-
}
|
|
266
|
-
|
|
267
|
-
/**
|
|
268
|
-
* Flush the pending batch to `onBatch` if either the batchSize threshold
|
|
269
|
-
* has been reached or the current message is the terminal `result`.
|
|
270
|
-
* Extracted so that `#consumeQuery` stays within the project's complexity
|
|
271
|
-
* budget — the flush is one cohesive unit of logic in its own right.
|
|
272
|
-
* @param {object} message
|
|
273
|
-
* @param {{pendingBatch: string[], assistantTextCount: number}} state
|
|
274
|
-
*/
|
|
275
|
-
async #maybeFlushBatch(message, state) {
|
|
276
|
-
if (!this.onBatch) return;
|
|
277
|
-
const shouldFlush =
|
|
278
|
-
message.type === "result" || state.assistantTextCount >= this.batchSize;
|
|
279
|
-
if (!shouldFlush) return;
|
|
280
|
-
state.assistantTextCount = 0;
|
|
281
|
-
const batchLines = state.pendingBatch.splice(0);
|
|
282
|
-
await this.onBatch(batchLines, {
|
|
283
|
-
abort: () => this.currentAbortController?.abort(),
|
|
284
|
-
});
|
|
285
|
-
}
|
|
286
|
-
|
|
287
|
-
/**
|
|
288
|
-
* Drain buffered output lines. Used by Supervisor to tag and re-emit lines.
|
|
289
|
-
* @returns {string[]}
|
|
290
|
-
*/
|
|
291
|
-
drainOutput() {
|
|
292
|
-
const lines = [...this.buffer];
|
|
293
|
-
this.buffer = [];
|
|
294
|
-
return lines;
|
|
180
|
+
if (message.type === "assistant") trackSkillInvocation(message);
|
|
295
181
|
}
|
|
296
182
|
}
|
|
297
183
|
|
|
298
|
-
/**
|
|
299
|
-
* Whether an SDK assistant message contains at least one text block.
|
|
300
|
-
* Only text-block messages count toward the `batchSize` threshold — tool-only
|
|
301
|
-
* assistant messages accumulate silently into the pending batch and ride along
|
|
302
|
-
* in the next flush, keeping supervisor LLM cost bounded. Exported so the mock
|
|
303
|
-
* runner can mirror the real flush predicate without duplicating the logic.
|
|
304
|
-
* @param {object} message
|
|
305
|
-
* @returns {boolean}
|
|
306
|
-
*/
|
|
307
|
-
export function hasTextBlock(message) {
|
|
308
|
-
const content = message.message?.content ?? message.content;
|
|
309
|
-
if (!Array.isArray(content)) return false;
|
|
310
|
-
for (const block of content) {
|
|
311
|
-
if (block.type === "text" && block.text) return true;
|
|
312
|
-
}
|
|
313
|
-
return false;
|
|
314
|
-
}
|
|
315
|
-
|
|
316
184
|
function trackSkillInvocation(message) {
|
|
317
185
|
const content = message.message?.content ?? message.content;
|
|
318
186
|
if (!Array.isArray(content)) return;
|
|
@@ -327,11 +195,7 @@ function trackSkillInvocation(message) {
|
|
|
327
195
|
}
|
|
328
196
|
}
|
|
329
197
|
|
|
330
|
-
/**
|
|
331
|
-
* Factory function — wires real dependencies.
|
|
332
|
-
* @param {object} deps - Same as AgentRunner constructor
|
|
333
|
-
* @returns {AgentRunner}
|
|
334
|
-
*/
|
|
198
|
+
/** Factory function — wires real dependencies. */
|
|
335
199
|
export function createAgentRunner(deps) {
|
|
336
200
|
return new AgentRunner(deps);
|
|
337
201
|
}
|
package/src/benchmark/runner.js
CHANGED
|
@@ -3,7 +3,7 @@
|
|
|
3
3
|
*
|
|
4
4
|
* Phases per (task, runIndex):
|
|
5
5
|
* 1. WorkdirManager.start → seed CWD + run pre-flight probe
|
|
6
|
-
* 2. Supervisor
|
|
6
|
+
* 2. Supervisor session (agent + supervisor) → produce traces + submission
|
|
7
7
|
* 3. Scorer.runScoring → exit-code-driven verdict via fd-3 NDJSON
|
|
8
8
|
* 4. Judge.runJudge → Conclude-driven verdict mapped to pass/fail
|
|
9
9
|
* 5. WorkdirManager.teardown → process-group cleanup
|
|
@@ -272,7 +272,7 @@ export class BenchmarkRunner {
|
|
|
272
272
|
}
|
|
273
273
|
|
|
274
274
|
/**
|
|
275
|
-
* Run the agent-under-test
|
|
275
|
+
* Run the agent-under-test under a Supervisor. The supervisor writes
|
|
276
276
|
* a combined tagged NDJSON trace; after the session we split it into
|
|
277
277
|
* agent.ndjson and supervisor.ndjson and extract cost/turns/submission.
|
|
278
278
|
*/
|
|
@@ -53,7 +53,9 @@ export function parseSuperviseOptions(values) {
|
|
|
53
53
|
}
|
|
54
54
|
|
|
55
55
|
/**
|
|
56
|
-
* Supervise command — run
|
|
56
|
+
* Supervise command — run one agent under a supervisor via the
|
|
57
|
+
* orchestration loop. The supervisor delegates work through Ask, sees
|
|
58
|
+
* each reply on its next turn, and ends with Conclude.
|
|
57
59
|
*
|
|
58
60
|
* Usage: fit-eval supervise [options]
|
|
59
61
|
*
|