@plaited/acp-harness 0.2.6 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (57) hide show
  1. package/LICENSE +1 -1
  2. package/README.md +120 -16
  3. package/bin/cli.ts +105 -636
  4. package/bin/tests/cli.spec.ts +218 -51
  5. package/package.json +20 -4
  6. package/src/acp-client.ts +5 -4
  7. package/src/acp-transport.ts +14 -7
  8. package/src/adapter-check.ts +542 -0
  9. package/src/adapter-scaffold.ts +934 -0
  10. package/src/balance.ts +232 -0
  11. package/src/calibrate.ts +300 -0
  12. package/src/capture.ts +457 -0
  13. package/src/constants.ts +94 -0
  14. package/src/grader-loader.ts +174 -0
  15. package/src/harness.ts +35 -0
  16. package/src/schemas-cli.ts +239 -0
  17. package/src/schemas.ts +567 -0
  18. package/src/summarize.ts +245 -0
  19. package/src/tests/adapter-check.spec.ts +70 -0
  20. package/src/tests/adapter-scaffold.spec.ts +112 -0
  21. package/src/tests/fixtures/grader-bad-module.ts +5 -0
  22. package/src/tests/fixtures/grader-exec-fail.py +9 -0
  23. package/src/tests/fixtures/grader-exec-invalid.py +6 -0
  24. package/src/tests/fixtures/grader-exec.py +29 -0
  25. package/src/tests/fixtures/grader-module.ts +14 -0
  26. package/src/tests/grader-loader.spec.ts +153 -0
  27. package/src/trials.ts +395 -0
  28. package/src/validate-refs.ts +188 -0
  29. package/.claude/rules/accuracy.md +0 -43
  30. package/.claude/rules/bun-apis.md +0 -80
  31. package/.claude/rules/code-review.md +0 -254
  32. package/.claude/rules/git-workflow.md +0 -37
  33. package/.claude/rules/github.md +0 -154
  34. package/.claude/rules/testing.md +0 -172
  35. package/.claude/skills/acp-harness/SKILL.md +0 -310
  36. package/.claude/skills/acp-harness/assets/Dockerfile.acp +0 -25
  37. package/.claude/skills/acp-harness/assets/docker-compose.acp.yml +0 -19
  38. package/.claude/skills/acp-harness/references/downstream.md +0 -288
  39. package/.claude/skills/acp-harness/references/output-formats.md +0 -221
  40. package/.claude-plugin/marketplace.json +0 -15
  41. package/.claude-plugin/plugin.json +0 -16
  42. package/.github/CODEOWNERS +0 -6
  43. package/.github/workflows/ci.yml +0 -63
  44. package/.github/workflows/publish.yml +0 -146
  45. package/.mcp.json +0 -20
  46. package/CLAUDE.md +0 -92
  47. package/Dockerfile.test +0 -23
  48. package/biome.json +0 -96
  49. package/bun.lock +0 -513
  50. package/docker-compose.test.yml +0 -21
  51. package/scripts/bun-test-wrapper.sh +0 -46
  52. package/src/acp.constants.ts +0 -56
  53. package/src/acp.schemas.ts +0 -161
  54. package/src/acp.types.ts +0 -28
  55. package/src/tests/fixtures/.claude/settings.local.json +0 -8
  56. package/src/tests/fixtures/.claude/skills/greeting/SKILL.md +0 -17
  57. package/tsconfig.json +0 -32
@@ -1,310 +0,0 @@
1
- ---
2
- name: acp-harness
3
- description: CLI tool for capturing agent trajectories. Execute prompts against ACP-compatible agents, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring.
4
- compatibility: Bun >= 1.2.9
5
- ---
6
-
7
- # ACP Harness
8
-
9
- ## Purpose
10
-
11
- CLI tool for capturing trajectories from ACP-compatible agents, optimized for TypeScript/JavaScript projects using Bun.
12
-
13
- **The harness captures. You score.**
14
-
15
- | Harness Provides | You Provide |
16
- |------------------|-------------|
17
- | Prompt execution against ACP agents | Scoring logic (Braintrust, custom scripts) |
18
- | Full trajectory capture (thoughts, tools, plans) | Pass/fail determination |
19
- | Structured JSONL output | LLM-as-judge prompts |
20
- | Reproducible execution environment | CI integration, golden file comparison |
21
-
22
- **Use this when:**
23
- - Capturing trajectories for downstream evaluation
24
- - Generating training data (SFT/DPO) with full context
25
- - Building regression test fixtures for agent behavior
26
- - Comparing agent responses across configurations
27
-
28
- ## Installation
29
-
30
- ```bash
31
- # Run without installing (recommended for CI)
32
- bunx @plaited/acp-harness prompts.jsonl -o results.jsonl
33
-
34
- # Or install globally for repeated use
35
- bun add -g @plaited/acp-harness
36
- acp-harness prompts.jsonl -o results.jsonl
37
-
38
- # Or add as project dependency
39
- bun add @plaited/acp-harness
40
- ```
41
-
42
- **Note:** Examples below use `acp-harness` (the command available after global install). Replace with `bunx @plaited/acp-harness` if not installed globally.
43
-
44
- ## Capture Workflow
45
-
46
- ```mermaid
47
- flowchart LR
48
- Prompts["prompts.jsonl"] --> Harness["acp-harness"]
49
- Agent["ACP Agent"] --> Harness
50
- Harness -->|"JSONL"| Output["trajectories"]
51
- Output --> Scoring["Your scoring logic"]
52
- Scoring --> Decision["Informed choices"]
53
- ```
54
-
55
- The harness is a **capture layer** - it executes prompts and records trajectories. Scoring happens in your codebase.
56
-
57
- | Use Case | Harness Captures | You Build |
58
- |----------|------------------|-----------|
59
- | **Agent comparison** | Same prompts → multiple agents → trajectories | Scoring pipeline (Braintrust, custom) |
60
- | **Tool comparison** | Trajectory with tool/skill attribution | Diff analysis, preference data |
61
- | **Training data** | Structured I/O with tool calls, plans, thoughts | SFT/DPO formatting |
62
- | **Regression testing** | Deterministic prompt → trajectory capture | Golden file comparison, CI assertions |
63
-
64
- ### Example: Comparing Built-in vs Skill
65
-
66
- ```bash
67
- # Run same prompt with built-in tool
68
- acp-harness prompts.jsonl \
69
- --cmd "bunx claude-code-acp" \
70
- -o results-builtin.jsonl
71
-
72
- # Run same prompt with custom skill installed
73
- acp-harness prompts.jsonl \
74
- --cmd "bunx claude-code-acp" \
75
- --cwd /project/with/typescript-lsp-skill \
76
- -o results-skill.jsonl
77
-
78
- # Compare trajectories - which used better tools? faster? more accurate?
79
- diff <(jq '.toolCalls' results-builtin.jsonl) <(jq '.toolCalls' results-skill.jsonl)
80
- ```
81
-
82
- ## Execution Environment
83
-
84
- **Recommendation:** Run the harness in Docker containers for consistent, isolated execution.
85
-
86
- ```bash
87
- # Build and run with Docker Compose
88
- docker compose -f docker-compose.acp.yml run --rm acp-harness
89
-
90
- # Or build directly
91
- docker build -f Dockerfile.acp -t acp-harness .
92
- docker run --rm -e ANTHROPIC_API_KEY acp-harness
93
- ```
94
-
95
- Docker provides:
96
- - Consistent environment across local and CI
97
- - Filesystem isolation without app-level sandboxing
98
- - Reproducible results for training data generation
99
-
100
- See [assets/](assets/) for example container configurations:
101
- - `Dockerfile.acp` - Base container with Bun and git
102
- - `docker-compose.acp.yml` - Compose file with volume mounts for results
103
-
104
- ## Non-Goals
105
-
106
- This harness is optimized for TypeScript/JavaScript projects using Bun. It is **not** designed for:
107
-
108
- - **Python projects** - Use [SWE-bench](https://github.com/SWE-bench/SWE-bench), [Braintrust Python SDK](https://www.braintrust.dev/)
109
- - **Academic model benchmarking** - Use [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
110
- - **IDE integrations** - Use Copilot Evaluation Harness
111
- - **SaaS observability** - Use Braintrust, Langfuse platforms directly
112
-
113
- ## Quick Reference
114
-
115
- | Resource | Description |
116
- |----------|-------------|
117
- | `bunx @plaited/acp-harness` | Execute prompts against agent, capture trajectories |
118
- | [output-formats.md](references/output-formats.md) | JSONL schemas, format options |
119
- | [downstream.md](references/downstream.md) | Integration patterns (Braintrust, jq, custom scorers) |
120
-
121
- ## Output Pipeline
122
-
123
- ```mermaid
124
- flowchart LR
125
- Prompts["prompts.jsonl"] --> Harness["acp-harness"]
126
- Agent["ACP Agent"] --> Harness
127
- Harness --> Summary["summary.jsonl"]
128
- Harness --> Full["results.md + results.full.jsonl"]
129
- Summary --> Your["Your scoring code"]
130
- Full --> Your
131
- ```
132
-
133
- 1. **Prepare** - Create `prompts.jsonl` with test cases
134
- 2. **Execute** - Run harness against target agent
135
- 3. **Capture** - Trajectories streamed to output files
136
- 4. **Score** - Pipe output to your scoring logic (Braintrust, jq, LLM-as-judge)
137
-
138
- ## Harness Script
139
-
140
- ### Basic Usage
141
-
142
- ```bash
143
- acp-harness <prompts.jsonl> --cmd <cmd> [options]
144
- ```
145
-
146
- ### Arguments
147
-
148
- | Flag | Description | Default |
149
- |------|-------------|---------|
150
- | `prompts.jsonl` | Input file with prompts to execute | Required |
151
- | `--cmd, --command` | ACP agent command (e.g., `bunx claude-code-acp`, `bun ./adapter.ts`) | `"claude-code-acp"` |
152
- | `-o, --output` | Output file/path | stdout |
153
- | `-c, --cwd` | Working directory for agent | current |
154
- | `-t, --timeout` | Request timeout in ms | `60000` |
155
- | `-f, --format` | Output format: `summary`, `judge` | `summary` |
156
- | `--progress` | Show progress to stderr | false |
157
- | `--append` | Append to output file | false |
158
- | `--mcp-server` | MCP server config JSON (repeatable) | none |
159
-
160
- ### Examples
161
-
162
- ```bash
163
- # Using the default claude-code-acp adapter
164
- acp-harness prompts.jsonl -o results.jsonl
165
-
166
- # Using bunx to run an adapter
167
- acp-harness prompts.jsonl --cmd "bunx claude-code-acp" -o results.jsonl
168
-
169
- # Using a local adapter script (great for custom adapters in same repo)
170
- acp-harness prompts.jsonl --cmd "bun ./my-adapter.ts" -o results.jsonl
171
-
172
- # Judge format - creates two files for downstream scoring
173
- acp-harness prompts.jsonl --format judge -o results
174
- # Creates: results.md (summary with step IDs) + results.full.jsonl (complete trajectory)
175
-
176
- # With MCP server (stdio transport)
177
- acp-harness prompts.jsonl \
178
- --mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem","/data"]}'
179
-
180
- # With MCP server (HTTP transport)
181
- acp-harness prompts.jsonl \
182
- --mcp-server '{"type":"http","name":"api","url":"http://localhost:3000"}'
183
-
184
- # Stream with progress
185
- acp-harness prompts.jsonl --progress -o results.jsonl
186
- ```
187
-
188
- ## Input Format
189
-
190
- Each line in `prompts.jsonl`:
191
-
192
- ```jsonl
193
- {"id":"test-001","input":"Create a primary button","expected":"should contain <button>","metadata":{"category":"ui"}}
194
- {"id":"test-002","input":"Write a function for form validation","metadata":{"category":"logic"}}
195
- ```
196
-
197
- | Field | Required | Description |
198
- |-------|----------|-------------|
199
- | `id` | Yes | Unique identifier |
200
- | `input` | Yes | Prompt text for the agent |
201
- | `expected` | No | Expected output (for downstream scoring) |
202
- | `metadata` | No | Tags, category, difficulty for filtering |
203
- | `timeout` | No | Override default timeout for this prompt |
204
-
205
- ## Output Formats
206
-
207
- ### Summary Format (default)
208
-
209
- Minimal JSONL for quick metrics and analysis:
210
-
211
- ```jsonl
212
- {"id":"test-001","input":"Create a button","output":"I created...","toolCalls":["Write"],"status":"passed","duration":1234}
213
- ```
214
-
215
- ### Judge Format (two-tier)
216
-
217
- Creates two files optimized for downstream LLM-as-judge scoring:
218
-
219
- **`<output>.md`** - Markdown summary with step IDs and code previews:
220
-
221
- ```markdown
222
- ## Capture Record: test-001
223
-
224
- **Input:** Create a primary button
225
-
226
- **Trajectory:**
227
- 1. [THOUGHT] I'll create a styled button... [->test-001-step-1]
228
- 2. [TOOL:Write] -> completed (234ms) [->test-001-step-2]
229
- File: src/button.tsx (847 chars)
230
- ```tsx
231
- import { css } from 'some-css-lib'
232
-
233
- type ButtonProps = {
234
- label: string
235
-
236
- // ... 30 lines omitted ...
237
-
238
- export const Button = ({ label }: ButtonProps) => (
239
- <button className={styles.btn}>{label}</button>
240
- )
241
- ```
242
- 3. [MESSAGE] I created the button... [->test-001-step-3]
243
-
244
- **Output:** I created the button with primary styling.
245
- **Metadata:** category=ui, agent=claude-code-acp
246
- **Status:** passed
247
- **Duration:** 1234ms
248
-
249
- ---
250
- ```
251
-
252
- **`<output>.full.jsonl`** - Complete trajectory with step IDs for correlation:
253
-
254
- ```jsonl
255
- {"id":"test-001","input":"...","output":"...","trajectory":[{"type":"thought","content":"...","timestamp":100,"stepId":"test-001-step-1"},{"type":"tool_call","name":"Write","status":"completed","input":{...},"output":{...},"duration":234,"stepId":"test-001-step-2"}],...}
256
- ```
257
-
258
- **Usage patterns by judge context window:**
259
-
260
- | Judge Model | Strategy |
261
- |-------------|----------|
262
- | Gemini (1M+ tokens) | Feed `results.full.jsonl` directly |
263
- | Claude/GPT-4 (128-200k) | Use `results.full.jsonl` for most runs |
264
- | Smaller models | Use `results.md`, retrieve specific steps by ID as needed |
265
-
266
- ## Downstream Integration
267
-
268
- The harness outputs standard JSONL that pipes to any tool:
269
-
270
- ```bash
271
- # Filter with jq
272
- cat results.jsonl | jq 'select(.metadata.category == "ui")'
273
-
274
- # Count tool usage
275
- cat results.jsonl | jq -s 'map(.toolCalls | length) | add'
276
-
277
- # Feed full trajectory to Gemini (large context)
278
- cat results.full.jsonl | your-gemini-judge.ts
279
- ```
280
-
281
- See [downstream.md](references/downstream.md) for integration patterns with Braintrust, Gemini, and custom scorers.
282
-
283
- ## Capture Targets
284
-
285
- | Target | How to Capture |
286
- |--------|----------------|
287
- | **Agent capability** | Direct prompts, capture trajectory for analysis |
288
- | **Skills** | Set `--cwd` to project with skill, capture skill-specific behavior |
289
- | **MCP Servers** | Use `--mcp-server` flag, capture tool usage in trajectory |
290
-
291
- ### Capturing Skill Behavior
292
-
293
- ```bash
294
- bunx @plaited/acp-harness skill-prompts.jsonl \
295
- --cwd /project/with/skill \
296
- -o results.jsonl
297
- ```
298
-
299
- ### Capturing MCP Server Usage
300
-
301
- ```bash
302
- bunx @plaited/acp-harness mcp-prompts.jsonl \
303
- --mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem"]}' \
304
- -o results.jsonl
305
- ```
306
-
307
- ## Related
308
-
309
- - **[@agentclientprotocol/sdk](https://www.npmjs.com/package/@agentclientprotocol/sdk)** - ACP SDK for programmatic access
310
- - **[@zed-industries/claude-code-acp](https://www.npmjs.com/package/@zed-industries/claude-code-acp)** - Claude Code ACP adapter
@@ -1,25 +0,0 @@
1
- # ACP Harness Docker Configuration
2
- #
3
- # Example Dockerfile for running ACP evaluations in an isolated container.
4
- # Copy this to your project and customize as needed.
5
- #
6
- # Usage:
7
- # docker build -f Dockerfile.acp -t acp-harness .
8
- # docker run --rm -e ANTHROPIC_API_KEY acp-harness bunx @plaited/acp-harness prompts.jsonl
9
-
10
- FROM oven/bun:1.2.9
11
-
12
- # Install git (required for some agent operations)
13
- RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
14
-
15
- WORKDIR /app
16
-
17
- # Copy package files first for better layer caching
18
- COPY package.json bun.lock* ./
19
- RUN bun install --frozen-lockfile
20
-
21
- # Copy source files
22
- COPY . .
23
-
24
- # Default command - override with your harness invocation
25
- CMD ["bun", "test"]
@@ -1,19 +0,0 @@
1
- # ACP Harness Docker Compose Configuration
2
- #
3
- # Example docker-compose for running ACP evaluations.
4
- # Copy this to your project and customize as needed.
5
- #
6
- # Usage:
7
- # ANTHROPIC_API_KEY=sk-... docker compose -f docker-compose.acp.yml run --rm acp-harness
8
-
9
- services:
10
- acp-harness:
11
- build:
12
- context: .
13
- dockerfile: Dockerfile.acp
14
- environment:
15
- - ANTHROPIC_API_KEY
16
- volumes:
17
- # Mount output directory to persist results
18
- - ./results:/app/results
19
- command: ["bunx", "@plaited/acp-harness", "prompts.jsonl", "-o", "results/output.jsonl"]
@@ -1,288 +0,0 @@
1
- # Downstream Integration
2
-
3
- Patterns for piping harness output to analysis tools.
4
-
5
- ## Loading Results
6
-
7
- Both output formats use JSONL (newline-delimited JSON):
8
-
9
- ```typescript
10
- // TypeScript pattern (validated in tests)
11
- const parseResults = (jsonl: string) =>
12
- jsonl.trim().split('\n').map((line) => JSON.parse(line))
13
-
14
- // Load from file
15
- const results = parseResults(await Bun.file('results.jsonl').text())
16
- ```
17
-
18
- ## jq Analysis
19
-
20
- Summary JSONL is designed for quick analysis with `jq`:
21
-
22
- ```bash
23
- # Calculate average duration
24
- cat results.jsonl | jq -s 'map(.duration) | add / length'
25
-
26
- # Count tool usage
27
- cat results.jsonl | jq -s 'map(.toolCalls) | flatten | group_by(.) | map({tool: .[0], count: length})'
28
-
29
- # Filter by status
30
- cat results.jsonl | jq 'select(.status == "failed")'
31
-
32
- # Pass rate
33
- cat results.jsonl | jq -s 'map(select(.status == "passed")) | length as $p | length as $t | "\($p)/\($t) passed"'
34
-
35
- # Group by category
36
- cat results.jsonl | jq -s 'group_by(.metadata.category) | map({category: .[0].metadata.category, count: length})'
37
-
38
- # Find slowest runs
39
- cat results.jsonl | jq -s 'sort_by(-.duration) | .[0:5] | map({id, duration})'
40
- ```
41
-
42
- ## TypeScript Analysis Patterns
43
-
44
- These patterns are validated by tests in `bin/tests/cli.spec.ts`:
45
-
46
- ### Filter by Status
47
-
48
- ```typescript
49
- const failed = results.filter((r) => r.status === 'failed')
50
- const passed = results.filter((r) => r.status === 'passed')
51
- const passRate = passed.length / results.length
52
- ```
53
-
54
- ### Filter by Tool Usage
55
-
56
- ```typescript
57
- // Find runs that used Write tool
58
- const withWrite = results.filter((r) => r.toolCalls.includes('Write'))
59
-
60
- // Find runs that used multiple tools
61
- const multiTool = results.filter((r) => r.toolCalls.length > 1)
62
- ```
63
-
64
- ### Filter by Duration
65
-
66
- ```typescript
67
- // Slow runs (> 2 seconds)
68
- const slow = results.filter((r) => r.duration > 2000)
69
-
70
- // Find top 5 slowest
71
- const slowest = [...results].sort((a, b) => b.duration - a.duration).slice(0, 5)
72
- ```
73
-
74
- ### Filter by Metadata
75
-
76
- ```typescript
77
- // Filter by category
78
- const uiResults = results.filter((r) => r.metadata.category === 'ui')
79
-
80
- // Group and count by category
81
- const grouped = results.reduce<Record<string, number>>((acc, r) => {
82
- const cat = r.metadata.category as string
83
- acc[cat] = (acc[cat] ?? 0) + 1
84
- return acc
85
- }, {})
86
- ```
87
-
88
- ### Count Tool Usage
89
-
90
- ```typescript
91
- const allTools = results.flatMap((r) => r.toolCalls)
92
- const toolCounts = allTools.reduce<Record<string, number>>((acc, tool) => {
93
- acc[tool] = (acc[tool] ?? 0) + 1
94
- return acc
95
- }, {})
96
- ```
97
-
98
- ### Deduplicate by ID
99
-
100
- ```typescript
101
- // Keep latest occurrence when merging multiple runs
102
- const byId = new Map<string, unknown>()
103
- for (const result of results) {
104
- byId.set(result.id, result)
105
- }
106
- const deduped = Array.from(byId.values())
107
- ```
108
-
109
- ## Step-Level Retrieval
110
-
111
- For judge format, correlate markdown step IDs with full JSONL:
112
-
113
- ```typescript
114
- // Load both files
115
- const markdown = await Bun.file('results.md').text()
116
- const fullResults = parseResults(await Bun.file('results.full.jsonl').text())
117
-
118
- // Build step index
119
- const stepIndex = new Map<string, unknown>()
120
- for (const result of fullResults) {
121
- for (const step of result.trajectory) {
122
- stepIndex.set(step.stepId, step)
123
- }
124
- }
125
-
126
- // Retrieve full step by ID (from markdown [→stepId])
127
- const stepId = 'test-001-step-2'
128
- const fullStep = stepIndex.get(stepId) as { name: string; input: unknown }
129
- console.log('Tool name:', fullStep.name)
130
- console.log('Full input:', fullStep.input)
131
- ```
132
-
133
- ## Extract Tool Calls from Trajectory
134
-
135
- ```typescript
136
- const toolCalls = result.trajectory.filter((s) => s.type === 'tool_call')
137
- const toolNames = toolCalls.map((t) => t.name)
138
- ```
139
-
140
- ## Timing Information
141
-
142
- ```typescript
143
- const result = results[0]
144
- const duration = result.timing.end - result.timing.start
145
- const timeToFirstResponse = result.timing.firstResponse // ms after start
146
- ```
147
-
148
- ## LLM-as-Judge
149
-
150
- ### Large Context Models (Gemini 1M+)
151
-
152
- Feed full trajectory directly:
153
-
154
- ```typescript
155
- import { GoogleGenerativeAI } from '@google/generative-ai'
156
-
157
- const genAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!)
158
- const model = genAI.getGenerativeModel({ model: 'gemini-2.5-pro' })
159
-
160
- const results = parseResults(await Bun.file('results.full.jsonl').text())
161
-
162
- const prompt = `
163
- Evaluate these agent trajectories for code quality and reasoning.
164
-
165
- ${JSON.stringify(results, null, 2)}
166
-
167
- For each evaluation, score 1-3:
168
- - 1: Major issues (wrong tools, broken logic, incorrect output)
169
- - 2: Minor issues (inefficient but correct)
170
- - 3: Excellent (efficient trajectory, correct output)
171
-
172
- Respond as JSON array: [{"id": "...", "score": N, "reasoning": "..."}]
173
- `
174
-
175
- const response = await model.generateContent(prompt)
176
- console.log(response.response.text())
177
- ```
178
-
179
- ### Medium Context Models (Claude 200k)
180
-
181
- Use full trajectory for most runs:
182
-
183
- ```typescript
184
- import Anthropic from '@anthropic-ai/sdk'
185
-
186
- const client = new Anthropic()
187
- const markdown = await Bun.file('results.md').text()
188
-
189
- const response = await client.messages.create({
190
- model: 'claude-sonnet-4-20250514',
191
- max_tokens: 4096,
192
- messages: [{
193
- role: 'user',
194
- content: `Evaluate these agent trajectories:\n\n${markdown}\n\nScore each 1-3 and explain.`
195
- }]
196
- })
197
-
198
- console.log(response.content[0].text)
199
- ```
200
-
201
- ## Braintrust Integration
202
-
203
- Upload results programmatically:
204
-
205
- ```typescript
206
- import { initLogger } from 'braintrust'
207
-
208
- const logger = initLogger({
209
- projectName: 'agent-eval',
210
- apiKey: process.env.BRAINTRUST_API_KEY,
211
- })
212
-
213
- const results = parseResults(await Bun.file('results.jsonl').text())
214
-
215
- for (const result of results) {
216
- logger.log({
217
- input: result.input,
218
- output: result.output,
219
- expected: result.expected,
220
- scores: {
221
- passed: result.status === 'passed' ? 1 : 0,
222
- duration_ms: result.duration,
223
- },
224
- metadata: {
225
- ...result.metadata,
226
- toolCalls: result.toolCalls,
227
- },
228
- })
229
- }
230
-
231
- await logger.flush()
232
- ```
233
-
234
- ## CI Integration
235
-
236
- ### GitHub Actions
237
-
238
- ```yaml
239
- name: Agent Eval
240
- on:
241
- schedule:
242
- - cron: '0 0 * * 0' # Weekly
243
-
244
- jobs:
245
- eval:
246
- runs-on: ubuntu-latest
247
- steps:
248
- - uses: actions/checkout@v4
249
- - uses: oven-sh/setup-bun@v2
250
-
251
- - name: Install ACP adapter
252
- run: npm install -g @zed-industries/claude-code-acp
253
-
254
- - name: Install dependencies
255
- run: bun add @plaited/acp-harness
256
-
257
- - name: Run harness
258
- env:
259
- ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
260
- run: |
261
- bunx @plaited/acp-harness prompts.jsonl \
262
- --format judge \
263
- --progress \
264
- -o eval-results
265
-
266
- - name: Upload results
267
- uses: actions/upload-artifact@v4
268
- with:
269
- name: eval-results
270
- path: |
271
- eval-results.md
272
- eval-results.full.jsonl
273
- ```
274
-
275
- ## Output Aggregation
276
-
277
- Combine multiple runs:
278
-
279
- ```bash
280
- # Append mode during runs
281
- bunx @plaited/acp-harness prompts-1.jsonl --append -o combined.jsonl
282
- bunx @plaited/acp-harness prompts-2.jsonl --append -o combined.jsonl
283
-
284
- # Merge separate files
285
- cat run1.jsonl run2.jsonl run3.jsonl > combined.jsonl
286
-
287
- # Dedupe by ID (keep latest) - use TypeScript pattern above
288
- ```