@forwardimpact/libeval 0.1.44 → 0.1.46

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -7,28 +7,193 @@ reproducible evidence.
7
7
 
8
8
  <!-- END:description -->
9
9
 
10
- ## Getting Started
10
+ `libeval` provides the runtime and tool surface for multi-LLM coordination —
11
+ an agent talks to a supervisor, a facilitator chairs a meeting, a lead drives
12
+ an asynchronous discussion — plus a CLI suite that runs evals, queries the
13
+ traces they produce, and edits skill files under controlled conditions.
14
+
15
+ ## CLIs
16
+
17
+ | CLI | Purpose |
18
+ | --------------- | ---------------------------------------------------------------------- |
19
+ | `fit-eval` | Run agents in `run`/`supervise`/`facilitate`/`discuss` subcommands. |
20
+ | `fit-trace` | Download, query, and analyze NDJSON traces produced by `fit-eval`. |
21
+ | `fit-benchmark` | Run task families for N runs each and aggregate pass@k. |
22
+ | `fit-selfedit` | Write stdin to `.claude/**` paths, gated by settings.json + branch. |
23
+
24
+ `fit-eval`'s subcommands share one orchestration loop and one async tool
25
+ surface, below. The `judge` role is a profile passed to `supervise`.
26
+
27
+ ## Modes
28
+
29
+ | Mode | Lead | Participants | Terminal tool |
30
+ | ------------ | ------------- | ------------- | ---------------------- |
31
+ | `run` | (none) | one agent | task completion |
32
+ | `supervise` | `supervisor` | one `agent` | `Conclude` |
33
+ | `facilitate` | `facilitator` | N named | `Conclude` |
34
+ | `discuss` | `lead` | N named | `Adjourn` or `Recess` |
35
+ | `judge` | `judge` | (none) | `Conclude` |
36
+
37
+ `run` and `judge` are one-shot. The other three share `OrchestrationLoop`
38
+ plus an async Ask/Answer/Announce/RollCall tool surface; the loop fans
39
+ messages out over an in-memory bus and emits a `{source, seq, event}`
40
+ NDJSON envelope for every line.
41
+
42
+ ## Async Ask / Answer / Announce
43
+
44
+ ```text
45
+ Ask({ question, to? }) → { askIds: [N, …] }
46
+ Answer({ message, askId? }) → routed to the asker
47
+ Announce({ message }) → broadcast, no reply expected
48
+ ```
49
+
50
+ Every Ask returns immediately and registers a pending entry keyed by an
51
+ `askId`. The reply arrives later on the asker's inbox as `[answer#N]
52
+ <participant>: <text>`. Broadcast: omit `to` on a multi-participant
53
+ lead. Answer's `askId` is optional — the handler is forgiving:
54
+
55
+ - **Provided + matches an ask owed by the caller** → routes to that asker.
56
+ - **Provided but unknown or wrong addressee** → `isError` with a pointed message.
57
+ - **Omitted + exactly one ask owed to the caller** → auto-picks it.
58
+ - **Omitted + 0 or many asks owed** → broadcasts as Announce.
59
+
60
+ Inbox lines on resume:
61
+
62
+ ```text
63
+ [ask#42] facilitator: What is your current condition?
64
+ [answer#41] agent-1: We're at 7 out of 10.
65
+ [shared] agent-2: FYI I'm switching to Bun 1.2.
66
+ [system] @orchestrator: You have an unanswered ask from facilitator (askId=42)…
67
+ ```
68
+
69
+ Async means the lead can issue Asks, end its turn, and plan in the gap
70
+ while participants work in parallel — nothing blocks the LLM thread.
71
+
72
+ ## Orchestration loop
73
+
74
+ Each participant drains the bus (or waits), runs/resumes the LLM with
75
+ drained messages as tagged lines, and on an unanswered owed Ask injects
76
+ one synthetic reminder before emitting `protocol_violation` and
77
+ unblocking the asker with a synthetic null answer.
78
+
79
+ Termination uses two flags. `ctx.concluded` is explicit
80
+ `Conclude`/`Adjourn`/`Recess` — also cancels in-flight Asks so askers
81
+ see why their question won't be answered. `stopped` is broader: lead
82
+ error, agent crash, abort path. Loops watch `stopped`; `ctx.concluded`
83
+ only feeds the summary's `success`/`verdict`.
84
+
85
+ ## Tool surface, by role
86
+
87
+ | Role | Ask | Answer | Announce | RollCall | Conclude | Other |
88
+ | ------------ | --- | ------ | -------- | -------- | -------- | ---------------------------------------- |
89
+ | Facilitator | ✓ | ✓ | ✓ | ✓ | ✓ | |
90
+ | Fac. agent | ✓ | ✓ | ✓ | ✓ | | |
91
+ | Supervisor | ✓ | ✓ | ✓ | ✓ | ✓ | |
92
+ | Sup. agent | ✓ | ✓ | ✓ | ✓ | | |
93
+ | Discuss lead | ✓ | ✓ | ✓ | ✓ | | `RequestForComment`, `Recess`, `Adjourn` |
94
+ | Discuss agt | ✓ | ✓ | ✓ | ✓ | | |
95
+ | Judge | | | | | ✓ | |
96
+
97
+ Ask's `to` accepts a participant name on multi-participant roles
98
+ (facilitator, discuss lead, all participants). The supervise pair has
99
+ only one possible target so `to` is rejected there.
100
+
101
+ ## Minimal example: two-participant facilitator
11
102
 
12
103
  ```js
13
- import { createTraceCollector, createTraceQuery, createAgentRunner } from '@forwardimpact/libeval';
104
+ import { createFacilitator, createRedactor } from "@forwardimpact/libeval";
105
+ import { query } from "@anthropic-ai/claude-agent-sdk";
106
+
107
+ const facilitator = createFacilitator({
108
+ facilitatorCwd: process.cwd(),
109
+ agentConfigs: [
110
+ { name: "alice", role: "explorer", agentProfile: "alice" },
111
+ { name: "bob", role: "tester", agentProfile: "bob" },
112
+ ],
113
+ query,
114
+ output: process.stdout,
115
+ redactor: createRedactor(),
116
+ facilitatorProfile: "improvement-coach",
117
+ });
118
+
119
+ const result = await facilitator.run("Run a kata storyboard meeting.");
120
+ // result.success / result.turns / NDJSON trace on process.stdout
121
+ ```
122
+
123
+ The facilitator gets `Ask`/`Answer`/`Announce`/`RollCall`/`Conclude`;
124
+ each agent gets the same minus `Conclude`. Every tool call, bus
125
+ message, and orchestrator event becomes one trace line.
126
+
127
+ ## Trace format and redaction
128
+
129
+ Each line is `{ "source": "<participant|orchestrator>", "seq": N, "event":
130
+ {…} }`. `seq` is monotonic across the whole trace; `orchestrator` emits
131
+ `session_start`, `agent_start`, `protocol_violation`, `lead_turn_limit`,
132
+ and `summary`. `event` is the SDK event verbatim or the orchestrator
133
+ payload. `fit-trace` consumes this format.
134
+
135
+ Redaction is on by default for `fit-eval run`/`supervise`/`facilitate`
136
+ and composes two layers:
137
+
138
+ - **Env-var allowlist** — `ANTHROPIC_API_KEY`, `GH_TOKEN`, `GITHUB_TOKEN`
139
+ by default; override with `LIBEVAL_REDACTION_ENV_VARS=NAME1,…`
140
+ (replaces, not extends). Runtime values become `[REDACTED:env:NAME]`
141
+ everywhere they appear.
142
+ - **Credential-shape patterns** — `sk-ant-`, `ghp_`, `ghs_`, `gho_`,
143
+ `github_pat_`. Hits become `[REDACTED:pattern:KIND]`.
144
+
145
+ Set `LIBEVAL_REDACTION_DISABLED=1` to disable (one stderr warning per
146
+ run). Never on CI for a public repo — workflow artifacts are
147
+ downloadable through retention.
148
+
149
+ ## Module map
150
+
151
+ | Module | Purpose |
152
+ | ----------------------------------------------------------- | -------------------------------------------------------------------- |
153
+ | `agent-runner.js` | One Claude Agent SDK session; emits NDJSON via the redactor. |
154
+ | `message-bus.js` | Per-participant queues + `waitForMessages` Promise wakeup. |
155
+ | `orchestration-toolkit.js` | Shared Ask/Answer/Announce/Conclude/RollCall handlers + builders. |
156
+ | `orchestration-loop.js` | Unified lead+participant loop; reminder/violation handling. |
157
+ | `facilitator.js` / `supervisor.js` / `discusser.js` / `judge.js` | Per-mode class + factory + system prompt. |
158
+ | `discuss-tools.js` | Discuss-only `RequestForComment`/`Recess`/`Adjourn`. |
159
+ | `trace-collector.js` / `trace-query.js` / `trace-github.js` | Trace ingestion / querying / GitHub-attachment helpers. |
160
+ | `redaction.js` | Env-var allowlist + credential-shape pattern redaction. |
161
+
162
+ ## fit-selfedit
163
+
164
+ A narrow, audited bypass for sessions where `Edit`/`Write` (and bash
165
+ writes) are blocked against paths the project's own allowlist permits —
166
+ see [#1162](https://github.com/forwardimpact/monorepo/issues/1162) and
167
+ [#441](https://github.com/forwardimpact/monorepo/issues/441) for the
168
+ original episodes. Reads stdin, writes the target, exits 0 / 2
169
+ (safeguard violation) / 1 (I/O error).
170
+
171
+ ```sh
172
+ echo "<content>" | bunx fit-selfedit <path>
14
173
  ```
15
174
 
16
- ## Trace redaction
17
-
18
- `fit-eval run`, `fit-eval supervise`, and `fit-eval facilitate` redact
19
- secrets in trace artifacts before they reach disk. Two layers compose:
20
-
21
- - **Env-var allowlist**, defaulting to `ANTHROPIC_API_KEY`, `GH_TOKEN`,
22
- `GITHUB_TOKEN`. The runtime values of these vars are replaced with
23
- `[REDACTED:env:NAME]` wherever they appear in tool inputs, tool
24
- outputs, assistant text, or orchestrator summaries. Override the list
25
- with `LIBEVAL_REDACTION_ENV_VARS=NAME1,NAME2,…` (replaces, not extends).
26
- - **Credential-shape patterns**, covering Anthropic API keys (`sk-ant-`),
27
- GitHub PATs (`ghp_`), installation tokens (`ghs_`), OAuth tokens
28
- (`gho_`), and fine-grained PATs (`github_pat_`). Pattern hits become
29
- `[REDACTED:pattern:KIND]`.
30
-
31
- Redaction is on by default. To disable, set `LIBEVAL_REDACTION_DISABLED=1`
32
- — a stderr warning fires once per run. Never set this in CI on a public
33
- repository: workflow artifacts there are downloadable through the
34
- retention window.
175
+ Two safeguards, checked in order:
176
+
177
+ 1. **Settings-allow.** Walk upward from the target with
178
+ [`Finder.findUpward`](../libutil/src/finder.js) to find the nearest
179
+ `.claude/settings.json`. The target relative to its grandparent
180
+ directory must match at least one `Edit(<glob>)` rule in
181
+ `permissions.allow[]` (matched with
182
+ [`minimatch`](https://github.com/isaacs/minimatch), `dot: true`).
183
+ Settings.json is the single source of truth — widen the project
184
+ allowlist and the CLI follows. Traversal like `.claude/../README.md`
185
+ is rejected as a side effect: `path.resolve` collapses `..` first,
186
+ then the resolved path tests against the rules.
187
+
188
+ 2. **Branch scope.** `git rev-parse --abbrev-ref HEAD` must not be
189
+ `HEAD` (detached) or `main`. Edits ride a feature branch through
190
+ whatever merge gates the project has configured.
191
+
192
+ Failure messages name the safeguard that rejected; safeguard 1 also
193
+ lists the `Edit()` rules that were tried.
194
+
195
+ ## Documentation
196
+
197
+ - [Agent Evaluations Guide](https://www.forwardimpact.team/docs/libraries/agent-evaluations/index.md) — how to run an eval and read its trace.
198
+ - [Agent Collaboration Guide](https://www.forwardimpact.team/docs/libraries/agent-collaboration/index.md) — supervise / facilitate / discuss in depth.
199
+ - [Trace Analysis Guide](https://www.forwardimpact.team/docs/libraries/trace-analysis/index.md) — analysing NDJSON traces with `fit-trace`.
@@ -0,0 +1,162 @@
1
+ #!/usr/bin/env node
2
+ /**
3
+ * fit-selfedit — write stdin to a path that .claude/settings.json
4
+ * permits Edit on, while on a non-main git branch. See
5
+ * libraries/libeval/README.md § fit-selfedit for the full rationale.
6
+ */
7
+
8
+ import { existsSync, readFileSync, writeFileSync } from "node:fs";
9
+ import fsPromises from "node:fs/promises";
10
+ import { parseArgs } from "node:util";
11
+ import { resolve, relative, dirname } from "node:path";
12
+ import { execFileSync } from "node:child_process";
13
+
14
+ import { Finder } from "@forwardimpact/libutil";
15
+ import { minimatch } from "minimatch";
16
+
17
+ const HELP = `fit-selfedit — write stdin to a settings.json-allowed path on a non-main branch.
18
+
19
+ Usage:
20
+ echo content | fit-selfedit <path>
21
+ fit-selfedit <path> < input.txt
22
+
23
+ Safeguards (checked in order):
24
+ 1. The nearest .claude/settings.json must contain an Edit(<glob>) rule
25
+ in permissions.allow[] that resolves to the target path.
26
+ 2. HEAD must not be detached and the current branch must not be 'main'.
27
+
28
+ Exit codes:
29
+ 0 wrote the file
30
+ 2 safeguard violation (no settings.json, no matching Edit rule, on
31
+ main, detached HEAD, missing parent directory, TTY stdin)
32
+ 1 unexpected I/O error
33
+
34
+ Why this exists:
35
+ Some session harnesses block Edit/Write (and interactive bash writes)
36
+ on .claude/skills/**, even when the project allowlist permits them.
37
+ This CLI is a narrow, audited bypass: a subprocess write that still
38
+ has to clear the project allowlist and the normal merge gates.
39
+ `;
40
+
41
+ function fail(message) {
42
+ process.stderr.write(`fit-selfedit: ${message}\n`);
43
+ process.exit(2);
44
+ }
45
+
46
+ const { values, positionals } = parseArgs({
47
+ options: {
48
+ help: { type: "boolean", short: "h" },
49
+ version: { type: "boolean" },
50
+ },
51
+ allowPositionals: true,
52
+ });
53
+
54
+ if (values.help) {
55
+ process.stdout.write(HELP);
56
+ process.exit(0);
57
+ }
58
+
59
+ if (values.version) {
60
+ const pkg = JSON.parse(
61
+ readFileSync(new URL("../package.json", import.meta.url), "utf8"),
62
+ );
63
+ process.stdout.write(`${pkg.version}\n`);
64
+ process.exit(0);
65
+ }
66
+
67
+ const [targetArg, ...extra] = positionals;
68
+ if (!targetArg) fail("missing <path> (try --help)");
69
+ if (extra.length > 0) fail(`unexpected extra arguments: ${extra.join(" ")}`);
70
+
71
+ const absoluteTarget = resolve(process.cwd(), targetArg);
72
+
73
+ // Safeguard 1: settings.json must grant Edit() on this path.
74
+ const settingsPath = new Finder(fsPromises, { debug() {} }).findUpward(
75
+ dirname(absoluteTarget),
76
+ ".claude/settings.json",
77
+ 20,
78
+ );
79
+ if (!settingsPath) {
80
+ fail(
81
+ `no .claude/settings.json found walking upward from ${dirname(absoluteTarget)}`,
82
+ );
83
+ }
84
+
85
+ const projectRoot = dirname(dirname(settingsPath));
86
+ const relativeTarget = relative(projectRoot, absoluteTarget);
87
+
88
+ let settings;
89
+ try {
90
+ settings = JSON.parse(readFileSync(settingsPath, "utf8"));
91
+ } catch (err) {
92
+ fail(`failed to parse ${settingsPath}: ${err.message}`);
93
+ }
94
+
95
+ const allowRules = settings?.permissions?.allow;
96
+ if (!Array.isArray(allowRules)) {
97
+ fail(`${settingsPath} has no permissions.allow[] array`);
98
+ }
99
+
100
+ const editPatterns = allowRules
101
+ .filter((rule) => typeof rule === "string")
102
+ .map((rule) => rule.match(/^Edit\((.+)\)$/)?.[1])
103
+ .filter(Boolean);
104
+
105
+ if (editPatterns.length === 0) {
106
+ fail(`${settingsPath} has no Edit() rules in permissions.allow[]`);
107
+ }
108
+
109
+ const matchedPattern = editPatterns.find((pattern) =>
110
+ minimatch(relativeTarget, pattern, { dot: true }),
111
+ );
112
+ if (!matchedPattern) {
113
+ fail(
114
+ `no Edit() rule in ${relative(projectRoot, settingsPath)} matches '${relativeTarget}' ` +
115
+ `(tried: ${editPatterns.map((p) => `Edit(${p})`).join(", ")})`,
116
+ );
117
+ }
118
+
119
+ // Safeguard 2: branch must not be main and HEAD must not be detached.
120
+ let branch;
121
+ try {
122
+ branch = execFileSync("git", ["rev-parse", "--abbrev-ref", "HEAD"], {
123
+ stdio: ["ignore", "pipe", "pipe"],
124
+ encoding: "utf8",
125
+ }).trim();
126
+ } catch {
127
+ fail("failed to read current git branch (not inside a git repository?)");
128
+ }
129
+
130
+ if (branch === "HEAD") {
131
+ fail("HEAD is detached — refusing (check out a non-main branch first)");
132
+ }
133
+ if (branch === "main") {
134
+ fail("refusing to write while on branch 'main' — switch to a feature branch");
135
+ }
136
+
137
+ const parent = dirname(absoluteTarget);
138
+ if (!existsSync(parent)) {
139
+ fail(`parent directory '${relative(projectRoot, parent)}' does not exist`);
140
+ }
141
+
142
+ if (process.stdin.isTTY) {
143
+ fail(
144
+ "stdin is a TTY — pipe content in (e.g. `echo … | fit-selfedit <path>`)",
145
+ );
146
+ }
147
+
148
+ const chunks = [];
149
+ for await (const chunk of process.stdin) chunks.push(chunk);
150
+ const content = Buffer.concat(chunks);
151
+
152
+ try {
153
+ writeFileSync(absoluteTarget, content);
154
+ } catch (err) {
155
+ process.stderr.write(`fit-selfedit: write failed: ${err.message}\n`);
156
+ process.exit(1);
157
+ }
158
+
159
+ process.stderr.write(
160
+ `fit-selfedit: wrote ${content.length} byte${content.length === 1 ? "" : "s"} to ${relativeTarget} ` +
161
+ `(matched Edit(${matchedPattern}), branch ${branch})\n`,
162
+ );
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@forwardimpact/libeval",
3
- "version": "0.1.44",
3
+ "version": "0.1.46",
4
4
  "description": "Agent evaluation framework — prove whether agent changes improved outcomes with reproducible evidence.",
5
5
  "keywords": [
6
6
  "eval",
@@ -33,12 +33,14 @@
33
33
  ".": "./src/index.js",
34
34
  "./bin/fit-eval.js": "./bin/fit-eval.js",
35
35
  "./bin/fit-trace.js": "./bin/fit-trace.js",
36
- "./bin/fit-benchmark.js": "./bin/fit-benchmark.js"
36
+ "./bin/fit-benchmark.js": "./bin/fit-benchmark.js",
37
+ "./bin/fit-selfedit.js": "./bin/fit-selfedit.js"
37
38
  },
38
39
  "bin": {
39
40
  "fit-eval": "./bin/fit-eval.js",
40
41
  "fit-trace": "./bin/fit-trace.js",
41
- "fit-benchmark": "./bin/fit-benchmark.js"
42
+ "fit-benchmark": "./bin/fit-benchmark.js",
43
+ "fit-selfedit": "./bin/fit-selfedit.js"
42
44
  },
43
45
  "files": [
44
46
  "src/**/*.js",
@@ -53,7 +55,9 @@
53
55
  "@forwardimpact/libcli": "^0.1.0",
54
56
  "@forwardimpact/libconfig": "^0.1.0",
55
57
  "@forwardimpact/libtelemetry": "^0.1.22",
58
+ "@forwardimpact/libutil": "^0.1.0",
56
59
  "jmespath": "^0.16.0",
60
+ "minimatch": "^10.0.0",
57
61
  "zod": "^4.4.3"
58
62
  },
59
63
  "devDependencies": {