@plaited/acp-harness 0.2.6 → 0.3.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +1 -1
- package/README.md +175 -34
- package/bin/cli.ts +105 -636
- package/bin/tests/cli.spec.ts +218 -51
- package/package.json +21 -5
- package/src/acp-client.ts +5 -4
- package/src/acp-transport.ts +14 -7
- package/src/adapter-check.ts +542 -0
- package/src/adapter-scaffold.ts +934 -0
- package/src/balance.ts +257 -0
- package/src/calibrate.ts +319 -0
- package/src/capture.ts +457 -0
- package/src/constants.ts +94 -0
- package/src/grader-loader.ts +174 -0
- package/src/harness.ts +35 -0
- package/src/schemas-cli.ts +239 -0
- package/src/schemas.ts +567 -0
- package/src/summarize.ts +259 -0
- package/src/tests/adapter-check.spec.ts +70 -0
- package/src/tests/adapter-scaffold.spec.ts +112 -0
- package/src/tests/balance-helpers.spec.ts +279 -0
- package/src/tests/calibrate-helpers.spec.ts +226 -0
- package/src/tests/capture-helpers.spec.ts +553 -0
- package/src/tests/fixtures/grader-bad-module.ts +5 -0
- package/src/tests/fixtures/grader-exec-fail.py +9 -0
- package/src/tests/fixtures/grader-exec-invalid.py +6 -0
- package/src/tests/fixtures/grader-exec.py +29 -0
- package/src/tests/fixtures/grader-module.ts +14 -0
- package/src/tests/grader-loader.spec.ts +153 -0
- package/src/tests/summarize-helpers.spec.ts +339 -0
- package/src/tests/trials-calculations.spec.ts +209 -0
- package/src/trials.ts +407 -0
- package/src/validate-refs.ts +188 -0
- package/.claude/rules/accuracy.md +0 -43
- package/.claude/rules/bun-apis.md +0 -80
- package/.claude/rules/code-review.md +0 -254
- package/.claude/rules/git-workflow.md +0 -37
- package/.claude/rules/github.md +0 -154
- package/.claude/rules/testing.md +0 -172
- package/.claude/skills/acp-harness/SKILL.md +0 -310
- package/.claude/skills/acp-harness/assets/Dockerfile.acp +0 -25
- package/.claude/skills/acp-harness/assets/docker-compose.acp.yml +0 -19
- package/.claude/skills/acp-harness/references/downstream.md +0 -288
- package/.claude/skills/acp-harness/references/output-formats.md +0 -221
- package/.claude-plugin/marketplace.json +0 -15
- package/.claude-plugin/plugin.json +0 -16
- package/.github/CODEOWNERS +0 -6
- package/.github/workflows/ci.yml +0 -63
- package/.github/workflows/publish.yml +0 -146
- package/.mcp.json +0 -20
- package/CLAUDE.md +0 -92
- package/Dockerfile.test +0 -23
- package/biome.json +0 -96
- package/bun.lock +0 -513
- package/docker-compose.test.yml +0 -21
- package/scripts/bun-test-wrapper.sh +0 -46
- package/src/acp.constants.ts +0 -56
- package/src/acp.schemas.ts +0 -161
- package/src/acp.types.ts +0 -28
- package/src/tests/fixtures/.claude/settings.local.json +0 -8
- package/src/tests/fixtures/.claude/skills/greeting/SKILL.md +0 -17
- package/tsconfig.json +0 -32
package/LICENSE
CHANGED
package/README.md
CHANGED
|
@@ -4,43 +4,120 @@
|
|
|
4
4
|
[](https://github.com/plaited/acp-harness/actions/workflows/ci.yml)
|
|
5
5
|
[](https://opensource.org/licenses/ISC)
|
|
6
6
|
|
|
7
|
-
CLI tool for capturing agent trajectories from ACP-compatible agents. Execute prompts, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring.
|
|
7
|
+
CLI tool for capturing agent trajectories from ACP-compatible agents. Execute prompts, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. Available as both a CLI tool and as installable skills for AI coding agents.
|
|
8
8
|
|
|
9
|
-
##
|
|
9
|
+
## CLI Tool
|
|
10
|
+
|
|
11
|
+
Use these tools directly via the CLI without installation:
|
|
10
12
|
|
|
11
13
|
```bash
|
|
12
14
|
# Run without installing
|
|
13
|
-
bunx @plaited/acp-harness prompts.jsonl -o results.jsonl
|
|
15
|
+
bunx @plaited/acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl
|
|
14
16
|
|
|
15
17
|
# Or install globally
|
|
16
18
|
bun add -g @plaited/acp-harness
|
|
17
|
-
acp-harness prompts.jsonl -o results.jsonl
|
|
19
|
+
acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl
|
|
18
20
|
```
|
|
19
21
|
|
|
20
22
|
**Prerequisite:** Install an ACP adapter and set your API key:
|
|
21
23
|
|
|
22
24
|
```bash
|
|
23
|
-
npm install -g @
|
|
25
|
+
npm install -g @anthropic-ai/claude-code-acp
|
|
24
26
|
export ANTHROPIC_API_KEY=sk-...
|
|
25
27
|
```
|
|
26
28
|
|
|
27
|
-
|
|
29
|
+
### Commands
|
|
30
|
+
|
|
31
|
+
| Command | Description |
|
|
32
|
+
|---------|-------------|
|
|
33
|
+
| `capture <prompts> <cmd>` | Trajectory capture (full JSONL) |
|
|
34
|
+
| `trials <prompts> <cmd>` | Multi-run with pass@k metrics |
|
|
35
|
+
| `summarize <results>` | Derive compact views from results |
|
|
36
|
+
| `calibrate <results>` | Sample failures for review |
|
|
37
|
+
| `validate-refs <prompts>` | Check reference solutions |
|
|
38
|
+
| `balance <prompts>` | Analyze test set coverage |
|
|
39
|
+
| `schemas [name]` | Export JSON schemas |
|
|
40
|
+
| `adapter:scaffold [name]` | Scaffold new ACP adapter project |
|
|
41
|
+
| `adapter:check <cmd>` | Validate adapter ACP compliance |
|
|
42
|
+
|
|
43
|
+
### Examples
|
|
44
|
+
|
|
45
|
+
```bash
|
|
46
|
+
# Capture trajectories
|
|
47
|
+
bunx @plaited/acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl
|
|
48
|
+
|
|
49
|
+
# Run trials for pass@k analysis
|
|
50
|
+
bunx @plaited/acp-harness trials prompts.jsonl bunx claude-code-acp -k 5 --grader ./grader.ts
|
|
51
|
+
|
|
52
|
+
# Summarize results
|
|
53
|
+
bunx @plaited/acp-harness summarize results.jsonl -o summary.jsonl
|
|
54
|
+
|
|
55
|
+
# Export schemas
|
|
56
|
+
bunx @plaited/acp-harness schemas CaptureResult --json
|
|
57
|
+
|
|
58
|
+
# Scaffold a new adapter
|
|
59
|
+
bunx @plaited/acp-harness adapter:scaffold my-agent -o ./my-agent-acp
|
|
60
|
+
|
|
61
|
+
# Validate adapter compliance
|
|
62
|
+
bunx @plaited/acp-harness adapter:check bun ./my-adapter/src/main.ts
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
## Skills for AI Agents
|
|
66
|
+
|
|
67
|
+
**Install skills** for use with AI coding agents:
|
|
28
68
|
|
|
29
69
|
```bash
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
Options:
|
|
33
|
-
--cmd, --command ACP agent command (default: "claude-code-acp")
|
|
34
|
-
-o, --output Output file (default: stdout)
|
|
35
|
-
-c, --cwd Working directory for agent
|
|
36
|
-
-t, --timeout Request timeout in ms (default: 60000)
|
|
37
|
-
-f, --format Output format: summary, judge (default: summary)
|
|
38
|
-
--progress Show progress to stderr
|
|
39
|
-
--append Append to output file
|
|
40
|
-
--mcp-server MCP server config JSON (repeatable)
|
|
41
|
-
-h, --help Show help
|
|
70
|
+
curl -fsSL https://raw.githubusercontent.com/plaited/skills-installer/main/install.sh | bash -s -- --agent <agent-name> --project acp-harness
|
|
42
71
|
```
|
|
43
72
|
|
|
73
|
+
Replace `<agent-name>` with your agent: `claude`, `cursor`, `copilot`, `opencode`, `amp`, `goose`, `factory`
|
|
74
|
+
|
|
75
|
+
**Update skills:**
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
curl -fsSL https://raw.githubusercontent.com/plaited/skills-installer/main/install.sh | bash -s -- update --agent <agent-name> --project acp-harness
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
### Available Skills
|
|
82
|
+
|
|
83
|
+
#### ACP Harness
|
|
84
|
+
|
|
85
|
+
CLI tool for capturing agent trajectories, optimized for TypeScript/JavaScript projects using Bun.
|
|
86
|
+
|
|
87
|
+
**Commands:**
|
|
88
|
+
|
|
89
|
+
| Command | Description |
|
|
90
|
+
|---------|-------------|
|
|
91
|
+
| `capture` | Execute prompts and capture full trajectories |
|
|
92
|
+
| `trials` | Multi-run trials with pass@k/pass^k metrics |
|
|
93
|
+
| `summarize` | Derive compact views from trajectory results |
|
|
94
|
+
| `calibrate` | Sample failures for grader calibration |
|
|
95
|
+
| `validate-refs` | Validate reference solutions against graders |
|
|
96
|
+
| `balance` | Analyze test set coverage distribution |
|
|
97
|
+
| `schemas` | Export Zod schemas as JSON Schema |
|
|
98
|
+
|
|
99
|
+
**Use cases:**
|
|
100
|
+
- Capturing trajectories for downstream evaluation (Braintrust, custom scorers)
|
|
101
|
+
- Generating training data (SFT/DPO) with full context
|
|
102
|
+
- Building regression test fixtures for agent behavior
|
|
103
|
+
- Comparing agent responses across configurations
|
|
104
|
+
|
|
105
|
+
#### ACP Adapters
|
|
106
|
+
|
|
107
|
+
Discover, create, and validate ACP adapters for agent integration.
|
|
108
|
+
|
|
109
|
+
**Commands:**
|
|
110
|
+
|
|
111
|
+
| Command | Description |
|
|
112
|
+
|---------|-------------|
|
|
113
|
+
| `adapter:scaffold` | Generate new adapter project with handlers |
|
|
114
|
+
| `adapter:check` | Validate ACP protocol compliance |
|
|
115
|
+
|
|
116
|
+
**Use cases:**
|
|
117
|
+
- Finding existing adapters for your agent
|
|
118
|
+
- Building custom ACP adapters from scratch
|
|
119
|
+
- Validating adapter implementations
|
|
120
|
+
|
|
44
121
|
## Input Format
|
|
45
122
|
|
|
46
123
|
```jsonl
|
|
@@ -48,31 +125,95 @@ Options:
|
|
|
48
125
|
{"id":"test-002","input":"Fix the TypeScript error","metadata":{"category":"bugfix"}}
|
|
49
126
|
```
|
|
50
127
|
|
|
51
|
-
## Output
|
|
128
|
+
## Output Format
|
|
52
129
|
|
|
53
|
-
The harness
|
|
130
|
+
The harness outputs full trajectory JSONL (`CaptureResult` schema):
|
|
54
131
|
|
|
55
|
-
```
|
|
56
|
-
|
|
57
|
-
|
|
132
|
+
```jsonl
|
|
133
|
+
{
|
|
134
|
+
"id": "test-001",
|
|
135
|
+
"input": "Create a primary button",
|
|
136
|
+
"output": "Here's a button component...",
|
|
137
|
+
"expected": "should contain <button>",
|
|
138
|
+
"trajectory": [...],
|
|
139
|
+
"metadata": {"category": "ui", "agent": "bunx claude-code-acp"},
|
|
140
|
+
"timing": {"start": 1234567890, "end": 1234567900},
|
|
141
|
+
"toolErrors": false,
|
|
142
|
+
"score": {"pass": true, "score": 1.0, "reasoning": "Contains expected"}
|
|
143
|
+
}
|
|
144
|
+
```
|
|
58
145
|
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
146
|
+
Key fields:
|
|
147
|
+
- `toolErrors`: Boolean indicating if any tool calls failed
|
|
148
|
+
- `score`: Grader result (only if `--grader` provided)
|
|
149
|
+
- `trajectory`: Full execution trace (thoughts, messages, tool calls, plans)
|
|
150
|
+
|
|
151
|
+
## Graders
|
|
152
|
+
|
|
153
|
+
Graders score agent outputs. The harness supports two types:
|
|
154
|
+
|
|
155
|
+
### TypeScript/JavaScript Graders
|
|
156
|
+
|
|
157
|
+
Export a `grade` function:
|
|
158
|
+
|
|
159
|
+
```typescript
|
|
160
|
+
import type { Grader } from '@plaited/acp-harness/schemas'
|
|
161
|
+
|
|
162
|
+
export const grade: Grader = async ({ input, output, expected, trajectory }) => {
|
|
163
|
+
const pass = output.toLowerCase().includes(expected?.toLowerCase() ?? '')
|
|
164
|
+
return {
|
|
165
|
+
pass,
|
|
166
|
+
score: pass ? 1.0 : 0.0,
|
|
167
|
+
reasoning: pass ? 'Contains expected answer' : 'Missing expected answer'
|
|
168
|
+
}
|
|
169
|
+
}
|
|
62
170
|
```
|
|
63
171
|
|
|
64
|
-
|
|
172
|
+
```bash
|
|
173
|
+
acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.ts
|
|
174
|
+
```
|
|
65
175
|
|
|
66
|
-
|
|
176
|
+
### Polyglot Graders (Python, etc.)
|
|
67
177
|
|
|
68
|
-
|
|
69
|
-
- Output format schemas
|
|
70
|
-
- Integration patterns (Braintrust, jq, custom scorers)
|
|
178
|
+
Any executable script using stdin/stdout JSON protocol:
|
|
71
179
|
|
|
72
|
-
|
|
180
|
+
```python
|
|
181
|
+
#!/usr/bin/env python3
|
|
182
|
+
import json
|
|
183
|
+
import sys
|
|
184
|
+
|
|
185
|
+
data = json.load(sys.stdin)
|
|
186
|
+
output = data["output"].lower()
|
|
187
|
+
expected = (data.get("expected") or "").lower()
|
|
188
|
+
|
|
189
|
+
pass_result = expected in output if expected else True
|
|
190
|
+
print(json.dumps({
|
|
191
|
+
"pass": pass_result,
|
|
192
|
+
"score": 1.0 if pass_result else 0.0,
|
|
193
|
+
"reasoning": "Contains expected" if pass_result else "Missing expected"
|
|
194
|
+
}))
|
|
195
|
+
```
|
|
73
196
|
|
|
74
197
|
```bash
|
|
75
|
-
|
|
198
|
+
chmod +x grader.py
|
|
199
|
+
acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.py
|
|
200
|
+
```
|
|
201
|
+
|
|
202
|
+
**Protocol:**
|
|
203
|
+
- Input (stdin): `{"input": "...", "output": "...", "expected": "...", "trajectory": [...]}`
|
|
204
|
+
- Output (stdout): `{"pass": true, "score": 1.0, "reasoning": "..."}`
|
|
205
|
+
|
|
206
|
+
## Downstream Integration
|
|
207
|
+
|
|
208
|
+
```bash
|
|
209
|
+
# Filter failures
|
|
210
|
+
cat results.jsonl | jq 'select(.score.pass == false)'
|
|
211
|
+
|
|
212
|
+
# Extract tool usage patterns
|
|
213
|
+
cat results.jsonl | jq '.trajectory[] | select(.type == "tool_call") | .name'
|
|
214
|
+
|
|
215
|
+
# Use with your scoring pipeline
|
|
216
|
+
cat results.jsonl | your-scoring-script.ts
|
|
76
217
|
```
|
|
77
218
|
|
|
78
219
|
## Development
|
|
@@ -86,7 +227,7 @@ bun test # Run unit tests
|
|
|
86
227
|
## Requirements
|
|
87
228
|
|
|
88
229
|
- **Runtime:** Bun >= 1.2.9
|
|
89
|
-
- **ACP Adapter:** `@
|
|
230
|
+
- **ACP Adapter:** `@anthropic-ai/claude-code-acp` or compatible
|
|
90
231
|
- **API Key:** `ANTHROPIC_API_KEY` environment variable
|
|
91
232
|
|
|
92
233
|
## License
|