@forwardimpact/libeval 0.1.43 → 0.1.45

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -7,12 +7,188 @@ reproducible evidence.
7
7
 
8
8
  <!-- END:description -->
9
9
 
10
- ## Getting Started
10
+ `libeval` provides the runtime and tool surface for multi-LLM
11
+ coordination: an agent talks to a supervisor, a facilitator chairs a
12
+ team meeting, or a lead drives an asynchronous discussion across a
13
+ human channel. Every conversation produces a structured NDJSON trace
14
+ for analysis.
15
+
16
+ ## Modes
17
+
18
+ | Mode | Lead | Participants | Terminal tool |
19
+ | ------------- | ------------- | ------------- | ---------------------- |
20
+ | `run` | (none) | one agent | task completion |
21
+ | `supervise` | `supervisor` | one `agent` | `Conclude` |
22
+ | `facilitate` | `facilitator` | N named | `Conclude` |
23
+ | `discuss` | `lead` | N named | `Adjourn` or `Recess` |
24
+ | `judge` | `judge` | (none) | `Conclude` |
25
+
26
+ Every mode except `run` and `judge` shares one orchestration loop
27
+ (`OrchestrationLoop`) and one tool surface (`Ask` / `Answer` /
28
+ `Announce` / `RollCall`, plus a mode-specific terminal tool). The
29
+ loop fires the lead's LLM, fans messages out to participants over an
30
+ in-memory bus, wakes them when something lands, and emits the
31
+ universal `{source, seq, event}` NDJSON envelope for every line.
32
+
33
+ ## The Ask / Answer protocol
34
+
35
+ Coordination uses one async request/reply pattern with one piece of
36
+ state per question — the `askId`. Every Ask returns immediately; the
37
+ reply arrives later on the asker's inbox.
38
+
39
+ ### Ask
40
+
41
+ ```text
42
+ Ask({ question, to? }) → { askIds: [N, …] }
43
+ ```
44
+
45
+ The handler registers a pending entry per addressee, posts the
46
+ question on the bus, and returns immediately. Each pending entry is
47
+ keyed by a numeric `askId`. Two Asks to the same addressee each get
48
+ their own id, so they coexist without overwriting.
49
+
50
+ Broadcast: omit `to` on a multi-participant lead's Ask to fan out to
51
+ every other participant — the result `askIds` array has one entry
52
+ per addressee.
53
+
54
+ ### Answer
55
+
56
+ ```text
57
+ Answer({ message, askId? }) → routed to the asker
58
+ ```
59
+
60
+ The reply lands in the asker's bus inbox as
61
+ `[answer#N] <participant>: <text>` on a later turn. `askId` is
62
+ optional and the handler is forgiving:
63
+
64
+ - **Provided + matches an ask owed by the caller** → routes the reply
65
+ to that specific asker.
66
+ - **Provided but unknown or wrong addressee** → `isError` with a
67
+ pointed message. The caller tried to specify; we tell them why.
68
+ - **Omitted + exactly one ask is owed to the caller** → auto-picks
69
+ that ask. (Forcing an Announce when the only owed ask is obvious
70
+ would be pedantic.)
71
+ - **Omitted + 0 or many asks owed** → broadcasts as Announce so the
72
+ message still reaches every participant.
73
+
74
+ ### Announce
75
+
76
+ ```text
77
+ Announce({ message }) → broadcast, no reply expected
78
+ ```
79
+
80
+ Lands on every other participant's queue as `[shared] <from>: <text>`.
81
+
82
+ ### Inbox format
83
+
84
+ Every line a participant reads on a resume is one bus message rendered
85
+ with its tag:
86
+
87
+ ```text
88
+ [ask#42] facilitator: What is your current condition?
89
+ [answer#41] agent-1: We're at 7 out of 10.
90
+ [shared] agent-2: FYI I'm switching to Bun 1.2.
91
+ [system] @orchestrator: You have an unanswered ask from facilitator (askId=42)…
92
+ ```
93
+
94
+ The `[ask#N]` tag is what the participant quotes back in Answer's
95
+ `askId` field.
96
+
97
+ ### Why async
98
+
99
+ The lead can issue Asks, end its turn, and use the gap between turns
100
+ for planning, reflection, or follow-up Asks while participants work
101
+ in parallel. Nothing blocks the LLM thread waiting on a reply. The
102
+ orchestrator wakes the lead whenever the inbox has new content.
103
+
104
+ ## The orchestration loop
105
+
106
+ `OrchestrationLoop` runs one outer pattern for both the lead and each
107
+ participant:
108
+
109
+ 1. Drain the bus queue, or wait for the first message.
110
+ 2. Run (first turn) or resume (every subsequent turn) the LLM with the
111
+ drained messages formatted as tagged lines.
112
+ 3. If the participant ended a turn with an unanswered Ask owed to it,
113
+ inject one synthetic reminder and resume once more. If still
114
+ unanswered, emit a `protocol_violation` event and cancel the
115
+ pending entry with a synthetic null answer so the asker unblocks.
116
+
117
+ The lead's first turn starts with the task as its initial prompt;
118
+ participants' first runs are triggered by their first inbound message.
119
+
120
+ Termination flips two flags:
121
+
122
+ - `ctx.concluded` — explicit `Conclude` / `Adjourn` / `Recess`. The
123
+ handler also cancels any in-flight Asks with a synthetic null so
124
+ askers see why their question won't be answered.
125
+ - `stopped` — broader: also true on a lead error, an agent crash, or
126
+ any abort path. Loops watch `stopped`; `ctx.concluded` is only used
127
+ for the summary's `success` / `verdict`.
128
+
129
+ ## Tool surface, by role
130
+
131
+ | Role | Ask | Answer | Announce | RollCall | Conclude | Other |
132
+ | ------------ | --- | ------ | -------- | -------- | -------- | ------------------------------------ |
133
+ | Facilitator | ✓ | ✓ | ✓ | ✓ | ✓ | |
134
+ | Fac. agent | ✓ | ✓ | ✓ | ✓ | | |
135
+ | Supervisor | ✓ | ✓ | ✓ | ✓ | ✓ | |
136
+ | Sup. agent | ✓ | ✓ | ✓ | ✓ | | |
137
+ | Discuss lead | ✓ | ✓ | ✓ | ✓ | | `RequestForComment`, `Recess`, `Adjourn` |
138
+ | Discuss agt | ✓ | ✓ | ✓ | ✓ | | |
139
+ | Judge | | | | | ✓ | |
140
+
141
+ Ask's `to` accepts a participant name on multi-participant roles
142
+ (facilitator, discuss lead, all participants); supervise's
143
+ `supervisor` / `agent` pair don't accept `to` because there's only
144
+ one possible target.
145
+
146
+ ## Minimal example: a two-participant facilitator
11
147
 
12
148
  ```js
13
- import { createTraceCollector, createTraceQuery, createAgentRunner } from '@forwardimpact/libeval';
149
+ import { createFacilitator, createRedactor } from "@forwardimpact/libeval";
150
+ import { query } from "@anthropic-ai/claude-agent-sdk";
151
+
152
+ const facilitator = createFacilitator({
153
+ facilitatorCwd: process.cwd(),
154
+ agentConfigs: [
155
+ { name: "alice", role: "explorer", agentProfile: "alice" },
156
+ { name: "bob", role: "tester", agentProfile: "bob" },
157
+ ],
158
+ query,
159
+ output: process.stdout,
160
+ redactor: createRedactor(),
161
+ facilitatorProfile: "improvement-coach",
162
+ });
163
+
164
+ const result = await facilitator.run("Run a kata storyboard meeting.");
165
+ // result.success / result.turns / NDJSON trace on process.stdout
14
166
  ```
15
167
 
168
+ The facilitator's LLM, started with that task, has access to `Ask`,
169
+ `Answer`, `Announce`, `RollCall`, and `Conclude`. Alice and Bob each
170
+ get `Ask`, `Answer`, `Announce`, `RollCall`. Every tool call, every
171
+ message routed through the bus, and every orchestrator event becomes a
172
+ line in the trace.
173
+
174
+ ## Trace format
175
+
176
+ Every line is one JSON object with three fields:
177
+
178
+ ```json
179
+ { "source": "facilitator", "seq": 42, "event": { … } }
180
+ ```
181
+
182
+ - `source` — the participant whose LLM produced the line, or
183
+ `orchestrator` for loop-level events (`session_start`, `agent_start`,
184
+ `protocol_violation`, `lead_turn_limit`, `summary`).
185
+ - `seq` — monotonically increasing across the whole trace; useful for
186
+ reconstructing the wall-clock order across concurrent participants.
187
+ - `event` — the SDK event verbatim, or the orchestrator event payload.
188
+
189
+ `fit-trace` consumes this format. See the trace analysis guide for the
190
+ full schema.
191
+
16
192
  ## Trace redaction
17
193
 
18
194
  `fit-eval run`, `fit-eval supervise`, and `fit-eval facilitate` redact
@@ -21,14 +197,37 @@ secrets in trace artifacts before they reach disk. Two layers compose:
21
197
  - **Env-var allowlist**, defaulting to `ANTHROPIC_API_KEY`, `GH_TOKEN`,
22
198
  `GITHUB_TOKEN`. The runtime values of these vars are replaced with
23
199
  `[REDACTED:env:NAME]` wherever they appear in tool inputs, tool
24
- outputs, assistant text, or orchestrator summaries. Override the list
25
- with `LIBEVAL_REDACTION_ENV_VARS=NAME1,NAME2,…` (replaces, not extends).
26
- - **Credential-shape patterns**, covering Anthropic API keys (`sk-ant-`),
27
- GitHub PATs (`ghp_`), installation tokens (`ghs_`), OAuth tokens
28
- (`gho_`), and fine-grained PATs (`github_pat_`). Pattern hits become
29
- `[REDACTED:pattern:KIND]`.
30
-
31
- Redaction is on by default. To disable, set `LIBEVAL_REDACTION_DISABLED=1`
32
- a stderr warning fires once per run. Never set this in CI on a public
33
- repository: workflow artifacts there are downloadable through the
34
- retention window.
200
+ outputs, assistant text, or orchestrator summaries. Override the
201
+ list with `LIBEVAL_REDACTION_ENV_VARS=NAME1,NAME2,…` (replaces, not
202
+ extends).
203
+ - **Credential-shape patterns**, covering Anthropic API keys
204
+ (`sk-ant-`), GitHub PATs (`ghp_`), installation tokens (`ghs_`),
205
+ OAuth tokens (`gho_`), and fine-grained PATs (`github_pat_`).
206
+ Pattern hits become `[REDACTED:pattern:KIND]`.
207
+
208
+ Redaction is on by default. To disable, set
209
+ `LIBEVAL_REDACTION_DISABLED=1` a stderr warning fires once per run.
210
+ Never set this in CI on a public repository: workflow artifacts there
211
+ are downloadable through the retention window.
212
+
213
+ ## Module map
214
+
215
+ | Module | Purpose |
216
+ | ---------------------------- | ----------------------------------------------------------------------- |
217
+ | `agent-runner.js` | One Claude Agent SDK session; emits NDJSON via the redactor. |
218
+ | `message-bus.js` | In-memory per-participant queues + `waitForMessages` Promise wakeup. |
219
+ | `orchestration-toolkit.js` | Shared Ask / Answer / Announce / Conclude / RollCall handlers + builders. |
220
+ | `orchestration-loop.js` | Unified lead+participant loop; reminder/violation handling. |
221
+ | `facilitator.js` | `Facilitator` class + factory + system prompts. |
222
+ | `supervisor.js` | `Supervisor` class + factory + system prompts. |
223
+ | `discuss-tools.js` | Discuss-only RequestForComment / Recess / Adjourn handlers + tool servers. |
224
+ | `discusser.js` | `Discusser` class + factory + system prompt + resume hydration. |
225
+ | `judge.js` | One-shot post-hoc verdict via `Conclude`. |
226
+ | `trace-collector.js` / `trace-query.js` / `trace-github.js` | Trace ingestion / querying / GitHub-attachment helpers. |
227
+ | `redaction.js` | Env-var allowlist + credential-shape pattern redaction. |
228
+
229
+ ## Documentation
230
+
231
+ - [Agent Evaluations Guide](https://www.forwardimpact.team/docs/libraries/agent-evaluations/index.md) — how to run an eval and read its trace.
232
+ - [Agent Collaboration Guide](https://www.forwardimpact.team/docs/libraries/agent-collaboration/index.md) — supervise / facilitate / discuss in depth.
233
+ - [Trace Analysis Guide](https://www.forwardimpact.team/docs/libraries/trace-analysis/index.md) — analysing NDJSON traces with `fit-trace`.
@@ -46,10 +46,10 @@ export const definition = {
46
46
  description:
47
47
  "Claude model for the agent-under-test (default: claude-sonnet-4-6)",
48
48
  },
49
- "supervisor-model": {
49
+ "lead-model": {
50
50
  type: "string",
51
51
  description:
52
- "Claude model for the supervisor (default: claude-opus-4-7)",
52
+ "Claude model for the lead role (default: claude-opus-4-7)",
53
53
  },
54
54
  "judge-model": {
55
55
  type: "string",
package/bin/fit-eval.js CHANGED
@@ -9,6 +9,8 @@ import { runTeeCommand } from "../src/commands/tee.js";
9
9
  import { runRunCommand } from "../src/commands/run.js";
10
10
  import { runSuperviseCommand } from "../src/commands/supervise.js";
11
11
  import { runFacilitateCommand } from "../src/commands/facilitate.js";
12
+ import { runDiscussCommand } from "../src/commands/discuss.js";
13
+ import { runCallbackCommand } from "../src/commands/callback.js";
12
14
 
13
15
  // `bun build --compile` injects FIT_EVAL_VERSION via --define, eliminating
14
16
  // the readFileSync branch in the compiled binary (which would ENOENT against
@@ -18,6 +20,18 @@ const VERSION =
18
20
  JSON.parse(readFileSync(new URL("../package.json", import.meta.url), "utf8"))
19
21
  .version;
20
22
 
23
+ const LEAD_OPTIONS = {
24
+ "lead-profile": {
25
+ type: "string",
26
+ description: "Lead role profile name (supervisor / facilitator / chair)",
27
+ },
28
+ "lead-model": {
29
+ type: "string",
30
+ description:
31
+ "Claude model for the lead role (default: claude-opus-4-7[1m])",
32
+ },
33
+ };
34
+
21
35
  const definition = {
22
36
  name: "fit-eval",
23
37
  version: VERSION,
@@ -93,11 +107,7 @@ const definition = {
93
107
  description:
94
108
  "Claude model for the agent (default: claude-opus-4-7[1m])",
95
109
  },
96
- "supervisor-model": {
97
- type: "string",
98
- description:
99
- "Claude model for the supervisor (default: claude-opus-4-7[1m])",
100
- },
110
+ ...LEAD_OPTIONS,
101
111
  "max-turns": {
102
112
  type: "string",
103
113
  description:
@@ -117,10 +127,6 @@ const definition = {
117
127
  description: "Supervisor working directory",
118
128
  },
119
129
  "agent-cwd": { type: "string", description: "Agent working directory" },
120
- "supervisor-profile": {
121
- type: "string",
122
- description: "Supervisor (judge) profile name",
123
- },
124
130
  "supervisor-allowed-tools": {
125
131
  type: "string",
126
132
  description: "Supervisor tool allowlist",
@@ -154,11 +160,7 @@ const definition = {
154
160
  type: "string",
155
161
  description: "Claude model for agents (default: claude-opus-4-7[1m])",
156
162
  },
157
- "facilitator-model": {
158
- type: "string",
159
- description:
160
- "Claude model for the facilitator (default: claude-opus-4-7[1m])",
161
- },
163
+ ...LEAD_OPTIONS,
162
164
  "max-turns": {
163
165
  type: "string",
164
166
  description: "Max agentic turns (default: 20, 0 = unlimited)",
@@ -171,10 +173,6 @@ const definition = {
171
173
  type: "string",
172
174
  description: "Facilitator working directory",
173
175
  },
174
- "facilitator-profile": {
175
- type: "string",
176
- description: "Facilitator profile name",
177
- },
178
176
  "agent-profiles": {
179
177
  type: "string",
180
178
  description:
@@ -186,6 +184,56 @@ const definition = {
186
184
  },
187
185
  },
188
186
  },
187
+ {
188
+ name: "discuss",
189
+ args: "",
190
+ description:
191
+ "Run an async, suspendable discussion — Chair + N participants + bridge callback",
192
+ options: {
193
+ "task-file": {
194
+ type: "string",
195
+ description: "Path to a markdown task file",
196
+ },
197
+ "task-text": {
198
+ type: "string",
199
+ description: "Inline task text (alternative to --task-file)",
200
+ },
201
+ "task-amend": {
202
+ type: "string",
203
+ description: "Additional text appended to the task",
204
+ },
205
+ "agent-model": {
206
+ type: "string",
207
+ description: "Claude model for agents (default: claude-opus-4-7[1m])",
208
+ },
209
+ ...LEAD_OPTIONS,
210
+ "max-turns": {
211
+ type: "string",
212
+ description: "Max agentic turns (default: 40, 0 = unlimited)",
213
+ },
214
+ output: {
215
+ type: "string",
216
+ description: "Write the NDJSON trace to a file",
217
+ },
218
+ "agent-profiles": {
219
+ type: "string",
220
+ description: "Comma-separated participant profile names (optional)",
221
+ },
222
+ "agent-cwd": {
223
+ type: "string",
224
+ description: "Working directory shared by participants (default: .)",
225
+ },
226
+ "discussion-id": {
227
+ type: "string",
228
+ description:
229
+ "Stable id for the threaded conversation; carried through traces for linking",
230
+ },
231
+ "resume-context": {
232
+ type: "string",
233
+ description: "JSON-serialized prior state for a resumed run",
234
+ },
235
+ },
236
+ },
189
237
  {
190
238
  name: "output",
191
239
  args: "",
@@ -198,6 +246,35 @@ const definition = {
198
246
  description:
199
247
  "Stream readable text to stdout while saving raw NDJSON to a file",
200
248
  },
249
+ {
250
+ name: "callback",
251
+ args: "",
252
+ description:
253
+ "Extract the terminal summary from an NDJSON trace and POST it to a callback URL",
254
+ options: {
255
+ "trace-file": {
256
+ type: "string",
257
+ description: "Path to the NDJSON trace file",
258
+ },
259
+ "callback-url": {
260
+ type: "string",
261
+ description: "URL to POST the summary to",
262
+ },
263
+ "correlation-id": {
264
+ type: "string",
265
+ description: "Correlation ID to include in the payload",
266
+ },
267
+ "run-url": {
268
+ type: "string",
269
+ description: "GitHub Actions run URL (optional)",
270
+ },
271
+ "discussion-id": {
272
+ type: "string",
273
+ description:
274
+ "Discussion id (fallback when the trace lacks a meta event)",
275
+ },
276
+ },
277
+ },
201
278
  ],
202
279
  globalOptions: {
203
280
  format: { type: "string", description: "Output format (json|text)" },
@@ -207,8 +284,9 @@ const definition = {
207
284
  },
208
285
  examples: [
209
286
  "fit-eval run --task-file=task.md --output=trace.ndjson",
210
- "fit-eval supervise --task-file=task.md --supervisor-profile=judge --agent-profile=coder --output=trace.ndjson",
211
- 'fit-eval facilitate --task-file=task.md --facilitator-profile=lead --agent-profiles="security-engineer,technical-writer" --output=trace.ndjson',
287
+ "fit-eval supervise --task-file=task.md --lead-profile=judge --agent-profile=coder --output=trace.ndjson",
288
+ 'fit-eval facilitate --task-file=task.md --lead-profile=lead --agent-profiles="security-engineer,technical-writer" --output=trace.ndjson',
289
+ 'fit-eval discuss --task-file=task.md --lead-profile=release-engineer --agent-profiles="staff-engineer,security-engineer" --discussion-id=GD_kw...',
212
290
  "fit-eval output --format=text < trace.ndjson",
213
291
  ],
214
292
  documentation: [
@@ -234,7 +312,7 @@ const definition = {
234
312
  title: "Agent Teams",
235
313
  url: "https://www.forwardimpact.team/docs/products/agent-teams/index.md",
236
314
  description:
237
- "How to author the agent, supervisor, and facilitator profiles consumed by --agent-profile, --supervisor-profile, --facilitator-profile, and --agent-profiles.",
315
+ "How to author the profiles consumed by --agent-profile, --lead-profile, and --agent-profiles.",
238
316
  },
239
317
  ],
240
318
  };
@@ -248,6 +326,8 @@ const COMMANDS = {
248
326
  run: runRunCommand,
249
327
  supervise: runSuperviseCommand,
250
328
  facilitate: runFacilitateCommand,
329
+ discuss: runDiscussCommand,
330
+ callback: runCallbackCommand,
251
331
  };
252
332
 
253
333
  async function main() {
package/bin/fit-trace.js CHANGED
@@ -26,6 +26,7 @@ import {
26
26
  runSplitCommand,
27
27
  } from "../src/commands/trace.js";
28
28
  import { runAssertCommand } from "../src/commands/assert.js";
29
+ import { runByDiscussionCommand } from "../src/commands/by-discussion.js";
29
30
 
30
31
  // `bun build --compile` injects FIT_TRACE_VERSION via --define, eliminating
31
32
  // the readFileSync branch in the compiled binary (which would ENOENT against
@@ -160,6 +161,18 @@ const definition = {
160
161
  args: "<file> <index>",
161
162
  description: "Single turn by index",
162
163
  },
164
+ {
165
+ name: "by-discussion",
166
+ args: "<discussion-id> [trace-dir]",
167
+ description:
168
+ "List trace files whose meta header carries the given discussion_id, ordered by first-event timestamp",
169
+ options: {
170
+ "trace-dir": {
171
+ type: "string",
172
+ description: "Directory to scan (default: traces)",
173
+ },
174
+ },
175
+ },
163
176
  {
164
177
  name: "filter",
165
178
  args: "<file>",
@@ -307,6 +320,7 @@ const COMMANDS = {
307
320
  filter: runFilterCommand,
308
321
  split: runSplitCommand,
309
322
  assert: runAssertCommand,
323
+ "by-discussion": runByDiscussionCommand,
310
324
  };
311
325
 
312
326
  async function main() {
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@forwardimpact/libeval",
3
- "version": "0.1.43",
3
+ "version": "0.1.45",
4
4
  "description": "Agent evaluation framework — prove whether agent changes improved outcomes with reproducible evidence.",
5
5
  "keywords": [
6
6
  "eval",