@plaited/acp-harness 0.2.6 → 0.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (62) hide show
  1. package/LICENSE +1 -1
  2. package/README.md +175 -34
  3. package/bin/cli.ts +105 -636
  4. package/bin/tests/cli.spec.ts +218 -51
  5. package/package.json +21 -5
  6. package/src/acp-client.ts +5 -4
  7. package/src/acp-transport.ts +14 -7
  8. package/src/adapter-check.ts +542 -0
  9. package/src/adapter-scaffold.ts +934 -0
  10. package/src/balance.ts +257 -0
  11. package/src/calibrate.ts +319 -0
  12. package/src/capture.ts +457 -0
  13. package/src/constants.ts +94 -0
  14. package/src/grader-loader.ts +174 -0
  15. package/src/harness.ts +35 -0
  16. package/src/schemas-cli.ts +239 -0
  17. package/src/schemas.ts +567 -0
  18. package/src/summarize.ts +259 -0
  19. package/src/tests/adapter-check.spec.ts +70 -0
  20. package/src/tests/adapter-scaffold.spec.ts +112 -0
  21. package/src/tests/balance-helpers.spec.ts +279 -0
  22. package/src/tests/calibrate-helpers.spec.ts +226 -0
  23. package/src/tests/capture-helpers.spec.ts +553 -0
  24. package/src/tests/fixtures/grader-bad-module.ts +5 -0
  25. package/src/tests/fixtures/grader-exec-fail.py +9 -0
  26. package/src/tests/fixtures/grader-exec-invalid.py +6 -0
  27. package/src/tests/fixtures/grader-exec.py +29 -0
  28. package/src/tests/fixtures/grader-module.ts +14 -0
  29. package/src/tests/grader-loader.spec.ts +153 -0
  30. package/src/tests/summarize-helpers.spec.ts +339 -0
  31. package/src/tests/trials-calculations.spec.ts +209 -0
  32. package/src/trials.ts +407 -0
  33. package/src/validate-refs.ts +188 -0
  34. package/.claude/rules/accuracy.md +0 -43
  35. package/.claude/rules/bun-apis.md +0 -80
  36. package/.claude/rules/code-review.md +0 -254
  37. package/.claude/rules/git-workflow.md +0 -37
  38. package/.claude/rules/github.md +0 -154
  39. package/.claude/rules/testing.md +0 -172
  40. package/.claude/skills/acp-harness/SKILL.md +0 -310
  41. package/.claude/skills/acp-harness/assets/Dockerfile.acp +0 -25
  42. package/.claude/skills/acp-harness/assets/docker-compose.acp.yml +0 -19
  43. package/.claude/skills/acp-harness/references/downstream.md +0 -288
  44. package/.claude/skills/acp-harness/references/output-formats.md +0 -221
  45. package/.claude-plugin/marketplace.json +0 -15
  46. package/.claude-plugin/plugin.json +0 -16
  47. package/.github/CODEOWNERS +0 -6
  48. package/.github/workflows/ci.yml +0 -63
  49. package/.github/workflows/publish.yml +0 -146
  50. package/.mcp.json +0 -20
  51. package/CLAUDE.md +0 -92
  52. package/Dockerfile.test +0 -23
  53. package/biome.json +0 -96
  54. package/bun.lock +0 -513
  55. package/docker-compose.test.yml +0 -21
  56. package/scripts/bun-test-wrapper.sh +0 -46
  57. package/src/acp.constants.ts +0 -56
  58. package/src/acp.schemas.ts +0 -161
  59. package/src/acp.types.ts +0 -28
  60. package/src/tests/fixtures/.claude/settings.local.json +0 -8
  61. package/src/tests/fixtures/.claude/skills/greeting/SKILL.md +0 -17
  62. package/tsconfig.json +0 -32
package/LICENSE CHANGED
@@ -1,6 +1,6 @@
1
1
  ISC License
2
2
 
3
- Copyright (c) 2025 Plaited
3
+ Copyright (c) 2026 Plaited Labs
4
4
 
5
5
  Permission to use, copy, modify, and/or distribute this software for any
6
6
  purpose with or without fee is hereby granted, provided that the above
package/README.md CHANGED
@@ -4,43 +4,120 @@
4
4
  [![CI](https://github.com/plaited/acp-harness/actions/workflows/ci.yml/badge.svg)](https://github.com/plaited/acp-harness/actions/workflows/ci.yml)
5
5
  [![License: ISC](https://img.shields.io/badge/License-ISC-blue.svg)](https://opensource.org/licenses/ISC)
6
6
 
7
- CLI tool for capturing agent trajectories from ACP-compatible agents. Execute prompts, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring.
7
+ CLI tool for capturing agent trajectories from ACP-compatible agents. Execute prompts, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. Available as both a CLI tool and as installable skills for AI coding agents.
8
8
 
9
- ## Quick Start
9
+ ## CLI Tool
10
+
11
+ Use these tools directly via the CLI without installation:
10
12
 
11
13
  ```bash
12
14
  # Run without installing
13
- bunx @plaited/acp-harness prompts.jsonl -o results.jsonl
15
+ bunx @plaited/acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl
14
16
 
15
17
  # Or install globally
16
18
  bun add -g @plaited/acp-harness
17
- acp-harness prompts.jsonl -o results.jsonl
19
+ acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl
18
20
  ```
19
21
 
20
22
  **Prerequisite:** Install an ACP adapter and set your API key:
21
23
 
22
24
  ```bash
23
- npm install -g @zed-industries/claude-code-acp
25
+ npm install -g @anthropic-ai/claude-code-acp
24
26
  export ANTHROPIC_API_KEY=sk-...
25
27
  ```
26
28
 
27
- ## Usage
29
+ ### Commands
30
+
31
+ | Command | Description |
32
+ |---------|-------------|
33
+ | `capture <prompts> <cmd>` | Trajectory capture (full JSONL) |
34
+ | `trials <prompts> <cmd>` | Multi-run with pass@k metrics |
35
+ | `summarize <results>` | Derive compact views from results |
36
+ | `calibrate <results>` | Sample failures for review |
37
+ | `validate-refs <prompts>` | Check reference solutions |
38
+ | `balance <prompts>` | Analyze test set coverage |
39
+ | `schemas [name]` | Export JSON schemas |
40
+ | `adapter:scaffold [name]` | Scaffold new ACP adapter project |
41
+ | `adapter:check <cmd>` | Validate adapter ACP compliance |
42
+
43
+ ### Examples
44
+
45
+ ```bash
46
+ # Capture trajectories
47
+ bunx @plaited/acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl
48
+
49
+ # Run trials for pass@k analysis
50
+ bunx @plaited/acp-harness trials prompts.jsonl bunx claude-code-acp -k 5 --grader ./grader.ts
51
+
52
+ # Summarize results
53
+ bunx @plaited/acp-harness summarize results.jsonl -o summary.jsonl
54
+
55
+ # Export schemas
56
+ bunx @plaited/acp-harness schemas CaptureResult --json
57
+
58
+ # Scaffold a new adapter
59
+ bunx @plaited/acp-harness adapter:scaffold my-agent -o ./my-agent-acp
60
+
61
+ # Validate adapter compliance
62
+ bunx @plaited/acp-harness adapter:check bun ./my-adapter/src/main.ts
63
+ ```
64
+
65
+ ## Skills for AI Agents
66
+
67
+ **Install skills** for use with AI coding agents:
28
68
 
29
69
  ```bash
30
- acp-harness <prompts.jsonl> [options]
31
-
32
- Options:
33
- --cmd, --command ACP agent command (default: "claude-code-acp")
34
- -o, --output Output file (default: stdout)
35
- -c, --cwd Working directory for agent
36
- -t, --timeout Request timeout in ms (default: 60000)
37
- -f, --format Output format: summary, judge (default: summary)
38
- --progress Show progress to stderr
39
- --append Append to output file
40
- --mcp-server MCP server config JSON (repeatable)
41
- -h, --help Show help
70
+ curl -fsSL https://raw.githubusercontent.com/plaited/skills-installer/main/install.sh | bash -s -- --agent <agent-name> --project acp-harness
42
71
  ```
43
72
 
73
+ Replace `<agent-name>` with your agent: `claude`, `cursor`, `copilot`, `opencode`, `amp`, `goose`, `factory`
74
+
75
+ **Update skills:**
76
+
77
+ ```bash
78
+ curl -fsSL https://raw.githubusercontent.com/plaited/skills-installer/main/install.sh | bash -s -- update --agent <agent-name> --project acp-harness
79
+ ```
80
+
81
+ ### Available Skills
82
+
83
+ #### ACP Harness
84
+
85
+ CLI tool for capturing agent trajectories, optimized for TypeScript/JavaScript projects using Bun.
86
+
87
+ **Commands:**
88
+
89
+ | Command | Description |
90
+ |---------|-------------|
91
+ | `capture` | Execute prompts and capture full trajectories |
92
+ | `trials` | Multi-run trials with pass@k/pass^k metrics |
93
+ | `summarize` | Derive compact views from trajectory results |
94
+ | `calibrate` | Sample failures for grader calibration |
95
+ | `validate-refs` | Validate reference solutions against graders |
96
+ | `balance` | Analyze test set coverage distribution |
97
+ | `schemas` | Export Zod schemas as JSON Schema |
98
+
99
+ **Use cases:**
100
+ - Capturing trajectories for downstream evaluation (Braintrust, custom scorers)
101
+ - Generating training data (SFT/DPO) with full context
102
+ - Building regression test fixtures for agent behavior
103
+ - Comparing agent responses across configurations
104
+
105
+ #### ACP Adapters
106
+
107
+ Discover, create, and validate ACP adapters for agent integration.
108
+
109
+ **Commands:**
110
+
111
+ | Command | Description |
112
+ |---------|-------------|
113
+ | `adapter:scaffold` | Generate new adapter project with handlers |
114
+ | `adapter:check` | Validate ACP protocol compliance |
115
+
116
+ **Use cases:**
117
+ - Finding existing adapters for your agent
118
+ - Building custom ACP adapters from scratch
119
+ - Validating adapter implementations
120
+
44
121
  ## Input Format
45
122
 
46
123
  ```jsonl
@@ -48,31 +125,95 @@ Options:
48
125
  {"id":"test-002","input":"Fix the TypeScript error","metadata":{"category":"bugfix"}}
49
126
  ```
50
127
 
51
- ## Output
128
+ ## Output Format
52
129
 
53
- The harness captures trajectories and outputs structured JSONL. **You provide the scoring logic.**
130
+ The harness outputs full trajectory JSONL (`CaptureResult` schema):
54
131
 
55
- ```bash
56
- # Capture trajectories
57
- acp-harness prompts.jsonl -o results.jsonl
132
+ ```jsonl
133
+ {
134
+ "id": "test-001",
135
+ "input": "Create a primary button",
136
+ "output": "Here's a button component...",
137
+ "expected": "should contain <button>",
138
+ "trajectory": [...],
139
+ "metadata": {"category": "ui", "agent": "bunx claude-code-acp"},
140
+ "timing": {"start": 1234567890, "end": 1234567900},
141
+ "toolErrors": false,
142
+ "score": {"pass": true, "score": 1.0, "reasoning": "Contains expected"}
143
+ }
144
+ ```
58
145
 
59
- # Score with your tools
60
- cat results.jsonl | jq 'select(.status == "failed")'
61
- cat results.jsonl | your-scoring-script.ts
146
+ Key fields:
147
+ - `toolErrors`: Boolean indicating if any tool calls failed
148
+ - `score`: Grader result (only if `--grader` provided)
149
+ - `trajectory`: Full execution trace (thoughts, messages, tool calls, plans)
150
+
151
+ ## Graders
152
+
153
+ Graders score agent outputs. The harness supports two types:
154
+
155
+ ### TypeScript/JavaScript Graders
156
+
157
+ Export a `grade` function:
158
+
159
+ ```typescript
160
+ import type { Grader } from '@plaited/acp-harness/schemas'
161
+
162
+ export const grade: Grader = async ({ input, output, expected, trajectory }) => {
163
+ const pass = output.toLowerCase().includes(expected?.toLowerCase() ?? '')
164
+ return {
165
+ pass,
166
+ score: pass ? 1.0 : 0.0,
167
+ reasoning: pass ? 'Contains expected answer' : 'Missing expected answer'
168
+ }
169
+ }
62
170
  ```
63
171
 
64
- ## Plugin
172
+ ```bash
173
+ acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.ts
174
+ ```
65
175
 
66
- This package includes an **acp-harness skill** for AI coding agents with complete documentation:
176
+ ### Polyglot Graders (Python, etc.)
67
177
 
68
- - CLI usage and examples
69
- - Output format schemas
70
- - Integration patterns (Braintrust, jq, custom scorers)
178
+ Any executable script using stdin/stdout JSON protocol:
71
179
 
72
- **Install via Claude Code:**
180
+ ```python
181
+ #!/usr/bin/env python3
182
+ import json
183
+ import sys
184
+
185
+ data = json.load(sys.stdin)
186
+ output = data["output"].lower()
187
+ expected = (data.get("expected") or "").lower()
188
+
189
+ pass_result = expected in output if expected else True
190
+ print(json.dumps({
191
+ "pass": pass_result,
192
+ "score": 1.0 if pass_result else 0.0,
193
+ "reasoning": "Contains expected" if pass_result else "Missing expected"
194
+ }))
195
+ ```
73
196
 
74
197
  ```bash
75
- /plugin marketplace add plaited/marketplace
198
+ chmod +x grader.py
199
+ acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.py
200
+ ```
201
+
202
+ **Protocol:**
203
+ - Input (stdin): `{"input": "...", "output": "...", "expected": "...", "trajectory": [...]}`
204
+ - Output (stdout): `{"pass": true, "score": 1.0, "reasoning": "..."}`
205
+
206
+ ## Downstream Integration
207
+
208
+ ```bash
209
+ # Filter failures
210
+ cat results.jsonl | jq 'select(.score.pass == false)'
211
+
212
+ # Extract tool usage patterns
213
+ cat results.jsonl | jq '.trajectory[] | select(.type == "tool_call") | .name'
214
+
215
+ # Use with your scoring pipeline
216
+ cat results.jsonl | your-scoring-script.ts
76
217
  ```
77
218
 
78
219
  ## Development
@@ -86,7 +227,7 @@ bun test # Run unit tests
86
227
  ## Requirements
87
228
 
88
229
  - **Runtime:** Bun >= 1.2.9
89
- - **ACP Adapter:** `@zed-industries/claude-code-acp` or compatible
230
+ - **ACP Adapter:** `@anthropic-ai/claude-code-acp` or compatible
90
231
  - **API Key:** `ANTHROPIC_API_KEY` environment variable
91
232
 
92
233
  ## License