@plaited/acp-harness 0.2.6 → 0.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +1 -1
- package/README.md +120 -16
- package/bin/cli.ts +105 -636
- package/bin/tests/cli.spec.ts +218 -51
- package/package.json +20 -4
- package/src/acp-client.ts +5 -4
- package/src/acp-transport.ts +14 -7
- package/src/adapter-check.ts +542 -0
- package/src/adapter-scaffold.ts +934 -0
- package/src/balance.ts +232 -0
- package/src/calibrate.ts +300 -0
- package/src/capture.ts +457 -0
- package/src/constants.ts +94 -0
- package/src/grader-loader.ts +174 -0
- package/src/harness.ts +35 -0
- package/src/schemas-cli.ts +239 -0
- package/src/schemas.ts +567 -0
- package/src/summarize.ts +245 -0
- package/src/tests/adapter-check.spec.ts +70 -0
- package/src/tests/adapter-scaffold.spec.ts +112 -0
- package/src/tests/fixtures/grader-bad-module.ts +5 -0
- package/src/tests/fixtures/grader-exec-fail.py +9 -0
- package/src/tests/fixtures/grader-exec-invalid.py +6 -0
- package/src/tests/fixtures/grader-exec.py +29 -0
- package/src/tests/fixtures/grader-module.ts +14 -0
- package/src/tests/grader-loader.spec.ts +153 -0
- package/src/trials.ts +395 -0
- package/src/validate-refs.ts +188 -0
- package/.claude/rules/accuracy.md +0 -43
- package/.claude/rules/bun-apis.md +0 -80
- package/.claude/rules/code-review.md +0 -254
- package/.claude/rules/git-workflow.md +0 -37
- package/.claude/rules/github.md +0 -154
- package/.claude/rules/testing.md +0 -172
- package/.claude/skills/acp-harness/SKILL.md +0 -310
- package/.claude/skills/acp-harness/assets/Dockerfile.acp +0 -25
- package/.claude/skills/acp-harness/assets/docker-compose.acp.yml +0 -19
- package/.claude/skills/acp-harness/references/downstream.md +0 -288
- package/.claude/skills/acp-harness/references/output-formats.md +0 -221
- package/.claude-plugin/marketplace.json +0 -15
- package/.claude-plugin/plugin.json +0 -16
- package/.github/CODEOWNERS +0 -6
- package/.github/workflows/ci.yml +0 -63
- package/.github/workflows/publish.yml +0 -146
- package/.mcp.json +0 -20
- package/CLAUDE.md +0 -92
- package/Dockerfile.test +0 -23
- package/biome.json +0 -96
- package/bun.lock +0 -513
- package/docker-compose.test.yml +0 -21
- package/scripts/bun-test-wrapper.sh +0 -46
- package/src/acp.constants.ts +0 -56
- package/src/acp.schemas.ts +0 -161
- package/src/acp.types.ts +0 -28
- package/src/tests/fixtures/.claude/settings.local.json +0 -8
- package/src/tests/fixtures/.claude/skills/greeting/SKILL.md +0 -17
- package/tsconfig.json +0 -32
package/LICENSE
CHANGED
package/README.md
CHANGED
|
@@ -10,37 +10,63 @@ CLI tool for capturing agent trajectories from ACP-compatible agents. Execute pr
|
|
|
10
10
|
|
|
11
11
|
```bash
|
|
12
12
|
# Run without installing
|
|
13
|
-
bunx @plaited/acp-harness prompts.jsonl -o results.jsonl
|
|
13
|
+
bunx @plaited/acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl
|
|
14
14
|
|
|
15
15
|
# Or install globally
|
|
16
16
|
bun add -g @plaited/acp-harness
|
|
17
|
-
acp-harness prompts.jsonl -o results.jsonl
|
|
17
|
+
acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl
|
|
18
18
|
```
|
|
19
19
|
|
|
20
20
|
**Prerequisite:** Install an ACP adapter and set your API key:
|
|
21
21
|
|
|
22
22
|
```bash
|
|
23
|
-
npm install -g @
|
|
23
|
+
npm install -g @anthropic-ai/claude-code-acp
|
|
24
24
|
export ANTHROPIC_API_KEY=sk-...
|
|
25
25
|
```
|
|
26
26
|
|
|
27
|
-
##
|
|
27
|
+
## Commands
|
|
28
|
+
|
|
29
|
+
| Command | Purpose |
|
|
30
|
+
|---------|---------|
|
|
31
|
+
| `capture` | Trajectory capture (full JSONL) |
|
|
32
|
+
| `trials` | Multi-run with pass@k metrics |
|
|
33
|
+
| `summarize` | Derive compact views from results |
|
|
34
|
+
| `calibrate` | Sample failures for review |
|
|
35
|
+
| `validate-refs` | Check reference solutions |
|
|
36
|
+
| `balance` | Analyze test set coverage |
|
|
37
|
+
| `schemas` | Export JSON schemas |
|
|
38
|
+
|
|
39
|
+
### capture
|
|
40
|
+
|
|
41
|
+
Capture full trajectories from an ACP agent:
|
|
28
42
|
|
|
29
43
|
```bash
|
|
30
|
-
acp-harness <prompts.jsonl> [options]
|
|
44
|
+
acp-harness capture <prompts.jsonl> <command> [args...] [options]
|
|
31
45
|
|
|
32
46
|
Options:
|
|
33
|
-
--cmd, --command ACP agent command (default: "claude-code-acp")
|
|
34
47
|
-o, --output Output file (default: stdout)
|
|
35
48
|
-c, --cwd Working directory for agent
|
|
36
49
|
-t, --timeout Request timeout in ms (default: 60000)
|
|
37
|
-
-
|
|
50
|
+
-g, --grader Path to grader (.ts/.js module or executable script)
|
|
38
51
|
--progress Show progress to stderr
|
|
39
52
|
--append Append to output file
|
|
40
53
|
--mcp-server MCP server config JSON (repeatable)
|
|
41
54
|
-h, --help Show help
|
|
42
55
|
```
|
|
43
56
|
|
|
57
|
+
### trials
|
|
58
|
+
|
|
59
|
+
Run multiple trials per prompt for pass@k analysis:
|
|
60
|
+
|
|
61
|
+
```bash
|
|
62
|
+
acp-harness trials <prompts.jsonl> <command> [args...] [options]
|
|
63
|
+
|
|
64
|
+
Options:
|
|
65
|
+
-k Number of trials per prompt (default: 3)
|
|
66
|
+
-g, --grader Path to grader (computes pass@k/pass^k metrics)
|
|
67
|
+
... (same as capture)
|
|
68
|
+
```
|
|
69
|
+
|
|
44
70
|
## Input Format
|
|
45
71
|
|
|
46
72
|
```jsonl
|
|
@@ -48,16 +74,94 @@ Options:
|
|
|
48
74
|
{"id":"test-002","input":"Fix the TypeScript error","metadata":{"category":"bugfix"}}
|
|
49
75
|
```
|
|
50
76
|
|
|
51
|
-
## Output
|
|
77
|
+
## Output Format
|
|
78
|
+
|
|
79
|
+
The harness outputs full trajectory JSONL (`CaptureResult` schema):
|
|
80
|
+
|
|
81
|
+
```jsonl
|
|
82
|
+
{
|
|
83
|
+
"id": "test-001",
|
|
84
|
+
"input": "Create a primary button",
|
|
85
|
+
"output": "Here's a button component...",
|
|
86
|
+
"expected": "should contain <button>",
|
|
87
|
+
"trajectory": [...],
|
|
88
|
+
"metadata": {"category": "ui", "agent": "bunx claude-code-acp"},
|
|
89
|
+
"timing": {"start": 1234567890, "end": 1234567900},
|
|
90
|
+
"toolErrors": false,
|
|
91
|
+
"score": {"pass": true, "score": 1.0, "reasoning": "Contains expected"}
|
|
92
|
+
}
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
Key fields:
|
|
96
|
+
- `toolErrors`: Boolean indicating if any tool calls failed
|
|
97
|
+
- `score`: Grader result (only if `--grader` provided)
|
|
98
|
+
- `trajectory`: Full execution trace (thoughts, messages, tool calls, plans)
|
|
99
|
+
|
|
100
|
+
## Graders
|
|
101
|
+
|
|
102
|
+
Graders score agent outputs. The harness supports two types:
|
|
103
|
+
|
|
104
|
+
### TypeScript/JavaScript Graders
|
|
105
|
+
|
|
106
|
+
Export a `grade` function:
|
|
107
|
+
|
|
108
|
+
```typescript
|
|
109
|
+
import type { Grader } from '@plaited/acp-harness/schemas'
|
|
110
|
+
|
|
111
|
+
export const grade: Grader = async ({ input, output, expected, trajectory }) => {
|
|
112
|
+
const pass = output.toLowerCase().includes(expected?.toLowerCase() ?? '')
|
|
113
|
+
return {
|
|
114
|
+
pass,
|
|
115
|
+
score: pass ? 1.0 : 0.0,
|
|
116
|
+
reasoning: pass ? 'Contains expected answer' : 'Missing expected answer'
|
|
117
|
+
}
|
|
118
|
+
}
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.ts
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
### Polyglot Graders (Python, etc.)
|
|
126
|
+
|
|
127
|
+
Any executable script using stdin/stdout JSON protocol:
|
|
128
|
+
|
|
129
|
+
```python
|
|
130
|
+
#!/usr/bin/env python3
|
|
131
|
+
import json
|
|
132
|
+
import sys
|
|
133
|
+
|
|
134
|
+
data = json.load(sys.stdin)
|
|
135
|
+
output = data["output"].lower()
|
|
136
|
+
expected = (data.get("expected") or "").lower()
|
|
137
|
+
|
|
138
|
+
pass_result = expected in output if expected else True
|
|
139
|
+
print(json.dumps({
|
|
140
|
+
"pass": pass_result,
|
|
141
|
+
"score": 1.0 if pass_result else 0.0,
|
|
142
|
+
"reasoning": "Contains expected" if pass_result else "Missing expected"
|
|
143
|
+
}))
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
```bash
|
|
147
|
+
chmod +x grader.py
|
|
148
|
+
acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.py
|
|
149
|
+
```
|
|
52
150
|
|
|
53
|
-
|
|
151
|
+
**Protocol:**
|
|
152
|
+
- Input (stdin): `{"input": "...", "output": "...", "expected": "...", "trajectory": [...]}`
|
|
153
|
+
- Output (stdout): `{"pass": true, "score": 1.0, "reasoning": "..."}`
|
|
154
|
+
|
|
155
|
+
## Downstream Integration
|
|
54
156
|
|
|
55
157
|
```bash
|
|
56
|
-
#
|
|
57
|
-
|
|
158
|
+
# Filter failures
|
|
159
|
+
cat results.jsonl | jq 'select(.score.pass == false)'
|
|
160
|
+
|
|
161
|
+
# Extract tool usage patterns
|
|
162
|
+
cat results.jsonl | jq '.trajectory[] | select(.type == "tool_call") | .name'
|
|
58
163
|
|
|
59
|
-
#
|
|
60
|
-
cat results.jsonl | jq 'select(.status == "failed")'
|
|
164
|
+
# Use with your scoring pipeline
|
|
61
165
|
cat results.jsonl | your-scoring-script.ts
|
|
62
166
|
```
|
|
63
167
|
|
|
@@ -69,10 +173,10 @@ This package includes an **acp-harness skill** for AI coding agents with complet
|
|
|
69
173
|
- Output format schemas
|
|
70
174
|
- Integration patterns (Braintrust, jq, custom scorers)
|
|
71
175
|
|
|
72
|
-
**
|
|
176
|
+
**Other AI coding agents:**
|
|
73
177
|
|
|
74
178
|
```bash
|
|
75
|
-
|
|
179
|
+
curl -fsSL https://raw.githubusercontent.com/plaited/marketplace/main/install.sh | bash -s -- --agent <agent-name> --plugin development-skills
|
|
76
180
|
```
|
|
77
181
|
|
|
78
182
|
## Development
|
|
@@ -86,7 +190,7 @@ bun test # Run unit tests
|
|
|
86
190
|
## Requirements
|
|
87
191
|
|
|
88
192
|
- **Runtime:** Bun >= 1.2.9
|
|
89
|
-
- **ACP Adapter:** `@
|
|
193
|
+
- **ACP Adapter:** `@anthropic-ai/claude-code-acp` or compatible
|
|
90
194
|
- **API Key:** `ANTHROPIC_API_KEY` environment variable
|
|
91
195
|
|
|
92
196
|
## License
|