@plaited/acp-harness 0.2.5 → 0.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +1 -1
- package/README.md +120 -16
- package/bin/cli.ts +105 -636
- package/bin/tests/cli.spec.ts +218 -51
- package/package.json +20 -4
- package/src/acp-client.ts +5 -4
- package/src/acp-transport.ts +14 -7
- package/src/adapter-check.ts +542 -0
- package/src/adapter-scaffold.ts +934 -0
- package/src/balance.ts +232 -0
- package/src/calibrate.ts +300 -0
- package/src/capture.ts +457 -0
- package/src/constants.ts +94 -0
- package/src/grader-loader.ts +174 -0
- package/src/harness.ts +35 -0
- package/src/schemas-cli.ts +239 -0
- package/src/schemas.ts +567 -0
- package/src/summarize.ts +245 -0
- package/src/tests/adapter-check.spec.ts +70 -0
- package/src/tests/adapter-scaffold.spec.ts +112 -0
- package/src/tests/fixtures/grader-bad-module.ts +5 -0
- package/src/tests/fixtures/grader-exec-fail.py +9 -0
- package/src/tests/fixtures/grader-exec-invalid.py +6 -0
- package/src/tests/fixtures/grader-exec.py +29 -0
- package/src/tests/fixtures/grader-module.ts +14 -0
- package/src/tests/grader-loader.spec.ts +153 -0
- package/src/trials.ts +395 -0
- package/src/validate-refs.ts +188 -0
- package/.claude/rules/accuracy.md +0 -43
- package/.claude/rules/bun-apis.md +0 -80
- package/.claude/rules/code-review.md +0 -254
- package/.claude/rules/git-workflow.md +0 -37
- package/.claude/rules/github.md +0 -154
- package/.claude/rules/testing.md +0 -172
- package/.claude/skills/acp-harness/SKILL.md +0 -310
- package/.claude/skills/acp-harness/assets/Dockerfile.acp +0 -25
- package/.claude/skills/acp-harness/assets/docker-compose.acp.yml +0 -19
- package/.claude/skills/acp-harness/references/downstream.md +0 -288
- package/.claude/skills/acp-harness/references/output-formats.md +0 -221
- package/.claude-plugin/marketplace.json +0 -15
- package/.claude-plugin/plugin.json +0 -16
- package/.github/CODEOWNERS +0 -6
- package/.github/workflows/ci.yml +0 -63
- package/.github/workflows/publish.yml +0 -146
- package/.mcp.json +0 -20
- package/CLAUDE.md +0 -92
- package/Dockerfile.test +0 -23
- package/biome.json +0 -96
- package/bun.lock +0 -513
- package/docker-compose.test.yml +0 -21
- package/scripts/bun-test-wrapper.sh +0 -46
- package/src/acp.constants.ts +0 -56
- package/src/acp.schemas.ts +0 -161
- package/src/acp.types.ts +0 -28
- package/src/tests/fixtures/.claude/settings.local.json +0 -8
- package/src/tests/fixtures/.claude/skills/greeting/SKILL.md +0 -17
- package/tsconfig.json +0 -32
|
@@ -1,310 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: acp-harness
|
|
3
|
-
description: CLI tool for capturing agent trajectories. Execute prompts against ACP-compatible agents, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring.
|
|
4
|
-
compatibility: Bun >= 1.2.9
|
|
5
|
-
---
|
|
6
|
-
|
|
7
|
-
# ACP Harness
|
|
8
|
-
|
|
9
|
-
## Purpose
|
|
10
|
-
|
|
11
|
-
CLI tool for capturing trajectories from ACP-compatible agents, optimized for TypeScript/JavaScript projects using Bun.
|
|
12
|
-
|
|
13
|
-
**The harness captures. You score.**
|
|
14
|
-
|
|
15
|
-
| Harness Provides | You Provide |
|
|
16
|
-
|------------------|-------------|
|
|
17
|
-
| Prompt execution against ACP agents | Scoring logic (Braintrust, custom scripts) |
|
|
18
|
-
| Full trajectory capture (thoughts, tools, plans) | Pass/fail determination |
|
|
19
|
-
| Structured JSONL output | LLM-as-judge prompts |
|
|
20
|
-
| Reproducible execution environment | CI integration, golden file comparison |
|
|
21
|
-
|
|
22
|
-
**Use this when:**
|
|
23
|
-
- Capturing trajectories for downstream evaluation
|
|
24
|
-
- Generating training data (SFT/DPO) with full context
|
|
25
|
-
- Building regression test fixtures for agent behavior
|
|
26
|
-
- Comparing agent responses across configurations
|
|
27
|
-
|
|
28
|
-
## Installation
|
|
29
|
-
|
|
30
|
-
```bash
|
|
31
|
-
# Run without installing (recommended for CI)
|
|
32
|
-
bunx @plaited/acp-harness prompts.jsonl -o results.jsonl
|
|
33
|
-
|
|
34
|
-
# Or install globally for repeated use
|
|
35
|
-
bun add -g @plaited/acp-harness
|
|
36
|
-
acp-harness prompts.jsonl -o results.jsonl
|
|
37
|
-
|
|
38
|
-
# Or add as project dependency
|
|
39
|
-
bun add @plaited/acp-harness
|
|
40
|
-
```
|
|
41
|
-
|
|
42
|
-
**Note:** Examples below use `acp-harness` (the command available after global install). Replace with `bunx @plaited/acp-harness` if not installed globally.
|
|
43
|
-
|
|
44
|
-
## Capture Workflow
|
|
45
|
-
|
|
46
|
-
```mermaid
|
|
47
|
-
flowchart LR
|
|
48
|
-
Prompts["prompts.jsonl"] --> Harness["acp-harness"]
|
|
49
|
-
Agent["ACP Agent"] --> Harness
|
|
50
|
-
Harness -->|"JSONL"| Output["trajectories"]
|
|
51
|
-
Output --> Scoring["Your scoring logic"]
|
|
52
|
-
Scoring --> Decision["Informed choices"]
|
|
53
|
-
```
|
|
54
|
-
|
|
55
|
-
The harness is a **capture layer** - it executes prompts and records trajectories. Scoring happens in your codebase.
|
|
56
|
-
|
|
57
|
-
| Use Case | Harness Captures | You Build |
|
|
58
|
-
|----------|------------------|-----------|
|
|
59
|
-
| **Agent comparison** | Same prompts → multiple agents → trajectories | Scoring pipeline (Braintrust, custom) |
|
|
60
|
-
| **Tool comparison** | Trajectory with tool/skill attribution | Diff analysis, preference data |
|
|
61
|
-
| **Training data** | Structured I/O with tool calls, plans, thoughts | SFT/DPO formatting |
|
|
62
|
-
| **Regression testing** | Deterministic prompt → trajectory capture | Golden file comparison, CI assertions |
|
|
63
|
-
|
|
64
|
-
### Example: Comparing Built-in vs Skill
|
|
65
|
-
|
|
66
|
-
```bash
|
|
67
|
-
# Run same prompt with built-in tool
|
|
68
|
-
acp-harness prompts.jsonl \
|
|
69
|
-
--cmd "bunx claude-code-acp" \
|
|
70
|
-
-o results-builtin.jsonl
|
|
71
|
-
|
|
72
|
-
# Run same prompt with custom skill installed
|
|
73
|
-
acp-harness prompts.jsonl \
|
|
74
|
-
--cmd "bunx claude-code-acp" \
|
|
75
|
-
--cwd /project/with/typescript-lsp-skill \
|
|
76
|
-
-o results-skill.jsonl
|
|
77
|
-
|
|
78
|
-
# Compare trajectories - which used better tools? faster? more accurate?
|
|
79
|
-
diff <(jq '.toolCalls' results-builtin.jsonl) <(jq '.toolCalls' results-skill.jsonl)
|
|
80
|
-
```
|
|
81
|
-
|
|
82
|
-
## Execution Environment
|
|
83
|
-
|
|
84
|
-
**Recommendation:** Run the harness in Docker containers for consistent, isolated execution.
|
|
85
|
-
|
|
86
|
-
```bash
|
|
87
|
-
# Build and run with Docker Compose
|
|
88
|
-
docker compose -f docker-compose.acp.yml run --rm acp-harness
|
|
89
|
-
|
|
90
|
-
# Or build directly
|
|
91
|
-
docker build -f Dockerfile.acp -t acp-harness .
|
|
92
|
-
docker run --rm -e ANTHROPIC_API_KEY acp-harness
|
|
93
|
-
```
|
|
94
|
-
|
|
95
|
-
Docker provides:
|
|
96
|
-
- Consistent environment across local and CI
|
|
97
|
-
- Filesystem isolation without app-level sandboxing
|
|
98
|
-
- Reproducible results for training data generation
|
|
99
|
-
|
|
100
|
-
See [assets/](assets/) for example container configurations:
|
|
101
|
-
- `Dockerfile.acp` - Base container with Bun and git
|
|
102
|
-
- `docker-compose.acp.yml` - Compose file with volume mounts for results
|
|
103
|
-
|
|
104
|
-
## Non-Goals
|
|
105
|
-
|
|
106
|
-
This harness is optimized for TypeScript/JavaScript projects using Bun. It is **not** designed for:
|
|
107
|
-
|
|
108
|
-
- **Python projects** - Use [SWE-bench](https://github.com/SWE-bench/SWE-bench), [Braintrust Python SDK](https://www.braintrust.dev/)
|
|
109
|
-
- **Academic model benchmarking** - Use [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
|
|
110
|
-
- **IDE integrations** - Use Copilot Evaluation Harness
|
|
111
|
-
- **SaaS observability** - Use Braintrust, Langfuse platforms directly
|
|
112
|
-
|
|
113
|
-
## Quick Reference
|
|
114
|
-
|
|
115
|
-
| Resource | Description |
|
|
116
|
-
|----------|-------------|
|
|
117
|
-
| `bunx @plaited/acp-harness` | Execute prompts against agent, capture trajectories |
|
|
118
|
-
| [output-formats.md](references/output-formats.md) | JSONL schemas, format options |
|
|
119
|
-
| [downstream.md](references/downstream.md) | Integration patterns (Braintrust, jq, custom scorers) |
|
|
120
|
-
|
|
121
|
-
## Output Pipeline
|
|
122
|
-
|
|
123
|
-
```mermaid
|
|
124
|
-
flowchart LR
|
|
125
|
-
Prompts["prompts.jsonl"] --> Harness["acp-harness"]
|
|
126
|
-
Agent["ACP Agent"] --> Harness
|
|
127
|
-
Harness --> Summary["summary.jsonl"]
|
|
128
|
-
Harness --> Full["results.md + results.full.jsonl"]
|
|
129
|
-
Summary --> Your["Your scoring code"]
|
|
130
|
-
Full --> Your
|
|
131
|
-
```
|
|
132
|
-
|
|
133
|
-
1. **Prepare** - Create `prompts.jsonl` with test cases
|
|
134
|
-
2. **Execute** - Run harness against target agent
|
|
135
|
-
3. **Capture** - Trajectories streamed to output files
|
|
136
|
-
4. **Score** - Pipe output to your scoring logic (Braintrust, jq, LLM-as-judge)
|
|
137
|
-
|
|
138
|
-
## Harness Script
|
|
139
|
-
|
|
140
|
-
### Basic Usage
|
|
141
|
-
|
|
142
|
-
```bash
|
|
143
|
-
acp-harness <prompts.jsonl> --cmd <cmd> [options]
|
|
144
|
-
```
|
|
145
|
-
|
|
146
|
-
### Arguments
|
|
147
|
-
|
|
148
|
-
| Flag | Description | Default |
|
|
149
|
-
|------|-------------|---------|
|
|
150
|
-
| `prompts.jsonl` | Input file with prompts to execute | Required |
|
|
151
|
-
| `--cmd, --command` | ACP agent command (e.g., `bunx claude-code-acp`, `bun ./adapter.ts`) | `"claude-code-acp"` |
|
|
152
|
-
| `-o, --output` | Output file/path | stdout |
|
|
153
|
-
| `-c, --cwd` | Working directory for agent | current |
|
|
154
|
-
| `-t, --timeout` | Request timeout in ms | `60000` |
|
|
155
|
-
| `-f, --format` | Output format: `summary`, `judge` | `summary` |
|
|
156
|
-
| `--progress` | Show progress to stderr | false |
|
|
157
|
-
| `--append` | Append to output file | false |
|
|
158
|
-
| `--mcp-server` | MCP server config JSON (repeatable) | none |
|
|
159
|
-
|
|
160
|
-
### Examples
|
|
161
|
-
|
|
162
|
-
```bash
|
|
163
|
-
# Using the default claude-code-acp adapter
|
|
164
|
-
acp-harness prompts.jsonl -o results.jsonl
|
|
165
|
-
|
|
166
|
-
# Using bunx to run an adapter
|
|
167
|
-
acp-harness prompts.jsonl --cmd "bunx claude-code-acp" -o results.jsonl
|
|
168
|
-
|
|
169
|
-
# Using a local adapter script (great for custom adapters in same repo)
|
|
170
|
-
acp-harness prompts.jsonl --cmd "bun ./my-adapter.ts" -o results.jsonl
|
|
171
|
-
|
|
172
|
-
# Judge format - creates two files for downstream scoring
|
|
173
|
-
acp-harness prompts.jsonl --format judge -o results
|
|
174
|
-
# Creates: results.md (summary with step IDs) + results.full.jsonl (complete trajectory)
|
|
175
|
-
|
|
176
|
-
# With MCP server (stdio transport)
|
|
177
|
-
acp-harness prompts.jsonl \
|
|
178
|
-
--mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem","/data"]}'
|
|
179
|
-
|
|
180
|
-
# With MCP server (HTTP transport)
|
|
181
|
-
acp-harness prompts.jsonl \
|
|
182
|
-
--mcp-server '{"type":"http","name":"api","url":"http://localhost:3000"}'
|
|
183
|
-
|
|
184
|
-
# Stream with progress
|
|
185
|
-
acp-harness prompts.jsonl --progress -o results.jsonl
|
|
186
|
-
```
|
|
187
|
-
|
|
188
|
-
## Input Format
|
|
189
|
-
|
|
190
|
-
Each line in `prompts.jsonl`:
|
|
191
|
-
|
|
192
|
-
```jsonl
|
|
193
|
-
{"id":"test-001","input":"Create a primary button","expected":"should contain <button>","metadata":{"category":"ui"}}
|
|
194
|
-
{"id":"test-002","input":"Write a function for form validation","metadata":{"category":"logic"}}
|
|
195
|
-
```
|
|
196
|
-
|
|
197
|
-
| Field | Required | Description |
|
|
198
|
-
|-------|----------|-------------|
|
|
199
|
-
| `id` | Yes | Unique identifier |
|
|
200
|
-
| `input` | Yes | Prompt text for the agent |
|
|
201
|
-
| `expected` | No | Expected output (for downstream scoring) |
|
|
202
|
-
| `metadata` | No | Tags, category, difficulty for filtering |
|
|
203
|
-
| `timeout` | No | Override default timeout for this prompt |
|
|
204
|
-
|
|
205
|
-
## Output Formats
|
|
206
|
-
|
|
207
|
-
### Summary Format (default)
|
|
208
|
-
|
|
209
|
-
Minimal JSONL for quick metrics and analysis:
|
|
210
|
-
|
|
211
|
-
```jsonl
|
|
212
|
-
{"id":"test-001","input":"Create a button","output":"I created...","toolCalls":["Write"],"status":"passed","duration":1234}
|
|
213
|
-
```
|
|
214
|
-
|
|
215
|
-
### Judge Format (two-tier)
|
|
216
|
-
|
|
217
|
-
Creates two files optimized for downstream LLM-as-judge scoring:
|
|
218
|
-
|
|
219
|
-
**`<output>.md`** - Markdown summary with step IDs and code previews:
|
|
220
|
-
|
|
221
|
-
```markdown
|
|
222
|
-
## Capture Record: test-001
|
|
223
|
-
|
|
224
|
-
**Input:** Create a primary button
|
|
225
|
-
|
|
226
|
-
**Trajectory:**
|
|
227
|
-
1. [THOUGHT] I'll create a styled button... [->test-001-step-1]
|
|
228
|
-
2. [TOOL:Write] -> completed (234ms) [->test-001-step-2]
|
|
229
|
-
File: src/button.tsx (847 chars)
|
|
230
|
-
```tsx
|
|
231
|
-
import { css } from 'some-css-lib'
|
|
232
|
-
|
|
233
|
-
type ButtonProps = {
|
|
234
|
-
label: string
|
|
235
|
-
|
|
236
|
-
// ... 30 lines omitted ...
|
|
237
|
-
|
|
238
|
-
export const Button = ({ label }: ButtonProps) => (
|
|
239
|
-
<button className={styles.btn}>{label}</button>
|
|
240
|
-
)
|
|
241
|
-
```
|
|
242
|
-
3. [MESSAGE] I created the button... [->test-001-step-3]
|
|
243
|
-
|
|
244
|
-
**Output:** I created the button with primary styling.
|
|
245
|
-
**Metadata:** category=ui, agent=claude-code-acp
|
|
246
|
-
**Status:** passed
|
|
247
|
-
**Duration:** 1234ms
|
|
248
|
-
|
|
249
|
-
---
|
|
250
|
-
```
|
|
251
|
-
|
|
252
|
-
**`<output>.full.jsonl`** - Complete trajectory with step IDs for correlation:
|
|
253
|
-
|
|
254
|
-
```jsonl
|
|
255
|
-
{"id":"test-001","input":"...","output":"...","trajectory":[{"type":"thought","content":"...","timestamp":100,"stepId":"test-001-step-1"},{"type":"tool_call","name":"Write","status":"completed","input":{...},"output":{...},"duration":234,"stepId":"test-001-step-2"}],...}
|
|
256
|
-
```
|
|
257
|
-
|
|
258
|
-
**Usage patterns by judge context window:**
|
|
259
|
-
|
|
260
|
-
| Judge Model | Strategy |
|
|
261
|
-
|-------------|----------|
|
|
262
|
-
| Gemini (1M+ tokens) | Feed `results.full.jsonl` directly |
|
|
263
|
-
| Claude/GPT-4 (128-200k) | Use `results.full.jsonl` for most runs |
|
|
264
|
-
| Smaller models | Use `results.md`, retrieve specific steps by ID as needed |
|
|
265
|
-
|
|
266
|
-
## Downstream Integration
|
|
267
|
-
|
|
268
|
-
The harness outputs standard JSONL that pipes to any tool:
|
|
269
|
-
|
|
270
|
-
```bash
|
|
271
|
-
# Filter with jq
|
|
272
|
-
cat results.jsonl | jq 'select(.metadata.category == "ui")'
|
|
273
|
-
|
|
274
|
-
# Count tool usage
|
|
275
|
-
cat results.jsonl | jq -s 'map(.toolCalls | length) | add'
|
|
276
|
-
|
|
277
|
-
# Feed full trajectory to Gemini (large context)
|
|
278
|
-
cat results.full.jsonl | your-gemini-judge.ts
|
|
279
|
-
```
|
|
280
|
-
|
|
281
|
-
See [downstream.md](references/downstream.md) for integration patterns with Braintrust, Gemini, and custom scorers.
|
|
282
|
-
|
|
283
|
-
## Capture Targets
|
|
284
|
-
|
|
285
|
-
| Target | How to Capture |
|
|
286
|
-
|--------|----------------|
|
|
287
|
-
| **Agent capability** | Direct prompts, capture trajectory for analysis |
|
|
288
|
-
| **Skills** | Set `--cwd` to project with skill, capture skill-specific behavior |
|
|
289
|
-
| **MCP Servers** | Use `--mcp-server` flag, capture tool usage in trajectory |
|
|
290
|
-
|
|
291
|
-
### Capturing Skill Behavior
|
|
292
|
-
|
|
293
|
-
```bash
|
|
294
|
-
bunx @plaited/acp-harness skill-prompts.jsonl \
|
|
295
|
-
--cwd /project/with/skill \
|
|
296
|
-
-o results.jsonl
|
|
297
|
-
```
|
|
298
|
-
|
|
299
|
-
### Capturing MCP Server Usage
|
|
300
|
-
|
|
301
|
-
```bash
|
|
302
|
-
bunx @plaited/acp-harness mcp-prompts.jsonl \
|
|
303
|
-
--mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem"]}' \
|
|
304
|
-
-o results.jsonl
|
|
305
|
-
```
|
|
306
|
-
|
|
307
|
-
## Related
|
|
308
|
-
|
|
309
|
-
- **[@agentclientprotocol/sdk](https://www.npmjs.com/package/@agentclientprotocol/sdk)** - ACP SDK for programmatic access
|
|
310
|
-
- **[@zed-industries/claude-code-acp](https://www.npmjs.com/package/@zed-industries/claude-code-acp)** - Claude Code ACP adapter
|
|
@@ -1,25 +0,0 @@
|
|
|
1
|
-
# ACP Harness Docker Configuration
|
|
2
|
-
#
|
|
3
|
-
# Example Dockerfile for running ACP evaluations in an isolated container.
|
|
4
|
-
# Copy this to your project and customize as needed.
|
|
5
|
-
#
|
|
6
|
-
# Usage:
|
|
7
|
-
# docker build -f Dockerfile.acp -t acp-harness .
|
|
8
|
-
# docker run --rm -e ANTHROPIC_API_KEY acp-harness bunx @plaited/acp-harness prompts.jsonl
|
|
9
|
-
|
|
10
|
-
FROM oven/bun:1.2.9
|
|
11
|
-
|
|
12
|
-
# Install git (required for some agent operations)
|
|
13
|
-
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
|
|
14
|
-
|
|
15
|
-
WORKDIR /app
|
|
16
|
-
|
|
17
|
-
# Copy package files first for better layer caching
|
|
18
|
-
COPY package.json bun.lock* ./
|
|
19
|
-
RUN bun install --frozen-lockfile
|
|
20
|
-
|
|
21
|
-
# Copy source files
|
|
22
|
-
COPY . .
|
|
23
|
-
|
|
24
|
-
# Default command - override with your harness invocation
|
|
25
|
-
CMD ["bun", "test"]
|
|
@@ -1,19 +0,0 @@
|
|
|
1
|
-
# ACP Harness Docker Compose Configuration
|
|
2
|
-
#
|
|
3
|
-
# Example docker-compose for running ACP evaluations.
|
|
4
|
-
# Copy this to your project and customize as needed.
|
|
5
|
-
#
|
|
6
|
-
# Usage:
|
|
7
|
-
# ANTHROPIC_API_KEY=sk-... docker compose -f docker-compose.acp.yml run --rm acp-harness
|
|
8
|
-
|
|
9
|
-
services:
|
|
10
|
-
acp-harness:
|
|
11
|
-
build:
|
|
12
|
-
context: .
|
|
13
|
-
dockerfile: Dockerfile.acp
|
|
14
|
-
environment:
|
|
15
|
-
- ANTHROPIC_API_KEY
|
|
16
|
-
volumes:
|
|
17
|
-
# Mount output directory to persist results
|
|
18
|
-
- ./results:/app/results
|
|
19
|
-
command: ["bunx", "@plaited/acp-harness", "prompts.jsonl", "-o", "results/output.jsonl"]
|
|
@@ -1,288 +0,0 @@
|
|
|
1
|
-
# Downstream Integration
|
|
2
|
-
|
|
3
|
-
Patterns for piping harness output to analysis tools.
|
|
4
|
-
|
|
5
|
-
## Loading Results
|
|
6
|
-
|
|
7
|
-
Both output formats use JSONL (newline-delimited JSON):
|
|
8
|
-
|
|
9
|
-
```typescript
|
|
10
|
-
// TypeScript pattern (validated in tests)
|
|
11
|
-
const parseResults = (jsonl: string) =>
|
|
12
|
-
jsonl.trim().split('\n').map((line) => JSON.parse(line))
|
|
13
|
-
|
|
14
|
-
// Load from file
|
|
15
|
-
const results = parseResults(await Bun.file('results.jsonl').text())
|
|
16
|
-
```
|
|
17
|
-
|
|
18
|
-
## jq Analysis
|
|
19
|
-
|
|
20
|
-
Summary JSONL is designed for quick analysis with `jq`:
|
|
21
|
-
|
|
22
|
-
```bash
|
|
23
|
-
# Calculate average duration
|
|
24
|
-
cat results.jsonl | jq -s 'map(.duration) | add / length'
|
|
25
|
-
|
|
26
|
-
# Count tool usage
|
|
27
|
-
cat results.jsonl | jq -s 'map(.toolCalls) | flatten | group_by(.) | map({tool: .[0], count: length})'
|
|
28
|
-
|
|
29
|
-
# Filter by status
|
|
30
|
-
cat results.jsonl | jq 'select(.status == "failed")'
|
|
31
|
-
|
|
32
|
-
# Pass rate
|
|
33
|
-
cat results.jsonl | jq -s 'map(select(.status == "passed")) | length as $p | length as $t | "\($p)/\($t) passed"'
|
|
34
|
-
|
|
35
|
-
# Group by category
|
|
36
|
-
cat results.jsonl | jq -s 'group_by(.metadata.category) | map({category: .[0].metadata.category, count: length})'
|
|
37
|
-
|
|
38
|
-
# Find slowest runs
|
|
39
|
-
cat results.jsonl | jq -s 'sort_by(-.duration) | .[0:5] | map({id, duration})'
|
|
40
|
-
```
|
|
41
|
-
|
|
42
|
-
## TypeScript Analysis Patterns
|
|
43
|
-
|
|
44
|
-
These patterns are validated by tests in `bin/tests/cli.spec.ts`:
|
|
45
|
-
|
|
46
|
-
### Filter by Status
|
|
47
|
-
|
|
48
|
-
```typescript
|
|
49
|
-
const failed = results.filter((r) => r.status === 'failed')
|
|
50
|
-
const passed = results.filter((r) => r.status === 'passed')
|
|
51
|
-
const passRate = passed.length / results.length
|
|
52
|
-
```
|
|
53
|
-
|
|
54
|
-
### Filter by Tool Usage
|
|
55
|
-
|
|
56
|
-
```typescript
|
|
57
|
-
// Find runs that used Write tool
|
|
58
|
-
const withWrite = results.filter((r) => r.toolCalls.includes('Write'))
|
|
59
|
-
|
|
60
|
-
// Find runs that used multiple tools
|
|
61
|
-
const multiTool = results.filter((r) => r.toolCalls.length > 1)
|
|
62
|
-
```
|
|
63
|
-
|
|
64
|
-
### Filter by Duration
|
|
65
|
-
|
|
66
|
-
```typescript
|
|
67
|
-
// Slow runs (> 2 seconds)
|
|
68
|
-
const slow = results.filter((r) => r.duration > 2000)
|
|
69
|
-
|
|
70
|
-
// Find top 5 slowest
|
|
71
|
-
const slowest = [...results].sort((a, b) => b.duration - a.duration).slice(0, 5)
|
|
72
|
-
```
|
|
73
|
-
|
|
74
|
-
### Filter by Metadata
|
|
75
|
-
|
|
76
|
-
```typescript
|
|
77
|
-
// Filter by category
|
|
78
|
-
const uiResults = results.filter((r) => r.metadata.category === 'ui')
|
|
79
|
-
|
|
80
|
-
// Group and count by category
|
|
81
|
-
const grouped = results.reduce<Record<string, number>>((acc, r) => {
|
|
82
|
-
const cat = r.metadata.category as string
|
|
83
|
-
acc[cat] = (acc[cat] ?? 0) + 1
|
|
84
|
-
return acc
|
|
85
|
-
}, {})
|
|
86
|
-
```
|
|
87
|
-
|
|
88
|
-
### Count Tool Usage
|
|
89
|
-
|
|
90
|
-
```typescript
|
|
91
|
-
const allTools = results.flatMap((r) => r.toolCalls)
|
|
92
|
-
const toolCounts = allTools.reduce<Record<string, number>>((acc, tool) => {
|
|
93
|
-
acc[tool] = (acc[tool] ?? 0) + 1
|
|
94
|
-
return acc
|
|
95
|
-
}, {})
|
|
96
|
-
```
|
|
97
|
-
|
|
98
|
-
### Deduplicate by ID
|
|
99
|
-
|
|
100
|
-
```typescript
|
|
101
|
-
// Keep latest occurrence when merging multiple runs
|
|
102
|
-
const byId = new Map<string, unknown>()
|
|
103
|
-
for (const result of results) {
|
|
104
|
-
byId.set(result.id, result)
|
|
105
|
-
}
|
|
106
|
-
const deduped = Array.from(byId.values())
|
|
107
|
-
```
|
|
108
|
-
|
|
109
|
-
## Step-Level Retrieval
|
|
110
|
-
|
|
111
|
-
For judge format, correlate markdown step IDs with full JSONL:
|
|
112
|
-
|
|
113
|
-
```typescript
|
|
114
|
-
// Load both files
|
|
115
|
-
const markdown = await Bun.file('results.md').text()
|
|
116
|
-
const fullResults = parseResults(await Bun.file('results.full.jsonl').text())
|
|
117
|
-
|
|
118
|
-
// Build step index
|
|
119
|
-
const stepIndex = new Map<string, unknown>()
|
|
120
|
-
for (const result of fullResults) {
|
|
121
|
-
for (const step of result.trajectory) {
|
|
122
|
-
stepIndex.set(step.stepId, step)
|
|
123
|
-
}
|
|
124
|
-
}
|
|
125
|
-
|
|
126
|
-
// Retrieve full step by ID (from markdown [→stepId])
|
|
127
|
-
const stepId = 'test-001-step-2'
|
|
128
|
-
const fullStep = stepIndex.get(stepId) as { name: string; input: unknown }
|
|
129
|
-
console.log('Tool name:', fullStep.name)
|
|
130
|
-
console.log('Full input:', fullStep.input)
|
|
131
|
-
```
|
|
132
|
-
|
|
133
|
-
## Extract Tool Calls from Trajectory
|
|
134
|
-
|
|
135
|
-
```typescript
|
|
136
|
-
const toolCalls = result.trajectory.filter((s) => s.type === 'tool_call')
|
|
137
|
-
const toolNames = toolCalls.map((t) => t.name)
|
|
138
|
-
```
|
|
139
|
-
|
|
140
|
-
## Timing Information
|
|
141
|
-
|
|
142
|
-
```typescript
|
|
143
|
-
const result = results[0]
|
|
144
|
-
const duration = result.timing.end - result.timing.start
|
|
145
|
-
const timeToFirstResponse = result.timing.firstResponse // ms after start
|
|
146
|
-
```
|
|
147
|
-
|
|
148
|
-
## LLM-as-Judge
|
|
149
|
-
|
|
150
|
-
### Large Context Models (Gemini 1M+)
|
|
151
|
-
|
|
152
|
-
Feed full trajectory directly:
|
|
153
|
-
|
|
154
|
-
```typescript
|
|
155
|
-
import { GoogleGenerativeAI } from '@google/generative-ai'
|
|
156
|
-
|
|
157
|
-
const genAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!)
|
|
158
|
-
const model = genAI.getGenerativeModel({ model: 'gemini-2.5-pro' })
|
|
159
|
-
|
|
160
|
-
const results = parseResults(await Bun.file('results.full.jsonl').text())
|
|
161
|
-
|
|
162
|
-
const prompt = `
|
|
163
|
-
Evaluate these agent trajectories for code quality and reasoning.
|
|
164
|
-
|
|
165
|
-
${JSON.stringify(results, null, 2)}
|
|
166
|
-
|
|
167
|
-
For each evaluation, score 1-3:
|
|
168
|
-
- 1: Major issues (wrong tools, broken logic, incorrect output)
|
|
169
|
-
- 2: Minor issues (inefficient but correct)
|
|
170
|
-
- 3: Excellent (efficient trajectory, correct output)
|
|
171
|
-
|
|
172
|
-
Respond as JSON array: [{"id": "...", "score": N, "reasoning": "..."}]
|
|
173
|
-
`
|
|
174
|
-
|
|
175
|
-
const response = await model.generateContent(prompt)
|
|
176
|
-
console.log(response.response.text())
|
|
177
|
-
```
|
|
178
|
-
|
|
179
|
-
### Medium Context Models (Claude 200k)
|
|
180
|
-
|
|
181
|
-
Use full trajectory for most runs:
|
|
182
|
-
|
|
183
|
-
```typescript
|
|
184
|
-
import Anthropic from '@anthropic-ai/sdk'
|
|
185
|
-
|
|
186
|
-
const client = new Anthropic()
|
|
187
|
-
const markdown = await Bun.file('results.md').text()
|
|
188
|
-
|
|
189
|
-
const response = await client.messages.create({
|
|
190
|
-
model: 'claude-sonnet-4-20250514',
|
|
191
|
-
max_tokens: 4096,
|
|
192
|
-
messages: [{
|
|
193
|
-
role: 'user',
|
|
194
|
-
content: `Evaluate these agent trajectories:\n\n${markdown}\n\nScore each 1-3 and explain.`
|
|
195
|
-
}]
|
|
196
|
-
})
|
|
197
|
-
|
|
198
|
-
console.log(response.content[0].text)
|
|
199
|
-
```
|
|
200
|
-
|
|
201
|
-
## Braintrust Integration
|
|
202
|
-
|
|
203
|
-
Upload results programmatically:
|
|
204
|
-
|
|
205
|
-
```typescript
|
|
206
|
-
import { initLogger } from 'braintrust'
|
|
207
|
-
|
|
208
|
-
const logger = initLogger({
|
|
209
|
-
projectName: 'agent-eval',
|
|
210
|
-
apiKey: process.env.BRAINTRUST_API_KEY,
|
|
211
|
-
})
|
|
212
|
-
|
|
213
|
-
const results = parseResults(await Bun.file('results.jsonl').text())
|
|
214
|
-
|
|
215
|
-
for (const result of results) {
|
|
216
|
-
logger.log({
|
|
217
|
-
input: result.input,
|
|
218
|
-
output: result.output,
|
|
219
|
-
expected: result.expected,
|
|
220
|
-
scores: {
|
|
221
|
-
passed: result.status === 'passed' ? 1 : 0,
|
|
222
|
-
duration_ms: result.duration,
|
|
223
|
-
},
|
|
224
|
-
metadata: {
|
|
225
|
-
...result.metadata,
|
|
226
|
-
toolCalls: result.toolCalls,
|
|
227
|
-
},
|
|
228
|
-
})
|
|
229
|
-
}
|
|
230
|
-
|
|
231
|
-
await logger.flush()
|
|
232
|
-
```
|
|
233
|
-
|
|
234
|
-
## CI Integration
|
|
235
|
-
|
|
236
|
-
### GitHub Actions
|
|
237
|
-
|
|
238
|
-
```yaml
|
|
239
|
-
name: Agent Eval
|
|
240
|
-
on:
|
|
241
|
-
schedule:
|
|
242
|
-
- cron: '0 0 * * 0' # Weekly
|
|
243
|
-
|
|
244
|
-
jobs:
|
|
245
|
-
eval:
|
|
246
|
-
runs-on: ubuntu-latest
|
|
247
|
-
steps:
|
|
248
|
-
- uses: actions/checkout@v4
|
|
249
|
-
- uses: oven-sh/setup-bun@v2
|
|
250
|
-
|
|
251
|
-
- name: Install ACP adapter
|
|
252
|
-
run: npm install -g @zed-industries/claude-code-acp
|
|
253
|
-
|
|
254
|
-
- name: Install dependencies
|
|
255
|
-
run: bun add @plaited/acp-harness
|
|
256
|
-
|
|
257
|
-
- name: Run harness
|
|
258
|
-
env:
|
|
259
|
-
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
|
260
|
-
run: |
|
|
261
|
-
bunx @plaited/acp-harness prompts.jsonl \
|
|
262
|
-
--format judge \
|
|
263
|
-
--progress \
|
|
264
|
-
-o eval-results
|
|
265
|
-
|
|
266
|
-
- name: Upload results
|
|
267
|
-
uses: actions/upload-artifact@v4
|
|
268
|
-
with:
|
|
269
|
-
name: eval-results
|
|
270
|
-
path: |
|
|
271
|
-
eval-results.md
|
|
272
|
-
eval-results.full.jsonl
|
|
273
|
-
```
|
|
274
|
-
|
|
275
|
-
## Output Aggregation
|
|
276
|
-
|
|
277
|
-
Combine multiple runs:
|
|
278
|
-
|
|
279
|
-
```bash
|
|
280
|
-
# Append mode during runs
|
|
281
|
-
bunx @plaited/acp-harness prompts-1.jsonl --append -o combined.jsonl
|
|
282
|
-
bunx @plaited/acp-harness prompts-2.jsonl --append -o combined.jsonl
|
|
283
|
-
|
|
284
|
-
# Merge separate files
|
|
285
|
-
cat run1.jsonl run2.jsonl run3.jsonl > combined.jsonl
|
|
286
|
-
|
|
287
|
-
# Dedupe by ID (keep latest) - use TypeScript pattern above
|
|
288
|
-
```
|