@plaited/acp-harness 0.2.6 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (57) hide show
  1. package/LICENSE +1 -1
  2. package/README.md +120 -16
  3. package/bin/cli.ts +105 -636
  4. package/bin/tests/cli.spec.ts +218 -51
  5. package/package.json +20 -4
  6. package/src/acp-client.ts +5 -4
  7. package/src/acp-transport.ts +14 -7
  8. package/src/adapter-check.ts +542 -0
  9. package/src/adapter-scaffold.ts +934 -0
  10. package/src/balance.ts +232 -0
  11. package/src/calibrate.ts +300 -0
  12. package/src/capture.ts +457 -0
  13. package/src/constants.ts +94 -0
  14. package/src/grader-loader.ts +174 -0
  15. package/src/harness.ts +35 -0
  16. package/src/schemas-cli.ts +239 -0
  17. package/src/schemas.ts +567 -0
  18. package/src/summarize.ts +245 -0
  19. package/src/tests/adapter-check.spec.ts +70 -0
  20. package/src/tests/adapter-scaffold.spec.ts +112 -0
  21. package/src/tests/fixtures/grader-bad-module.ts +5 -0
  22. package/src/tests/fixtures/grader-exec-fail.py +9 -0
  23. package/src/tests/fixtures/grader-exec-invalid.py +6 -0
  24. package/src/tests/fixtures/grader-exec.py +29 -0
  25. package/src/tests/fixtures/grader-module.ts +14 -0
  26. package/src/tests/grader-loader.spec.ts +153 -0
  27. package/src/trials.ts +395 -0
  28. package/src/validate-refs.ts +188 -0
  29. package/.claude/rules/accuracy.md +0 -43
  30. package/.claude/rules/bun-apis.md +0 -80
  31. package/.claude/rules/code-review.md +0 -254
  32. package/.claude/rules/git-workflow.md +0 -37
  33. package/.claude/rules/github.md +0 -154
  34. package/.claude/rules/testing.md +0 -172
  35. package/.claude/skills/acp-harness/SKILL.md +0 -310
  36. package/.claude/skills/acp-harness/assets/Dockerfile.acp +0 -25
  37. package/.claude/skills/acp-harness/assets/docker-compose.acp.yml +0 -19
  38. package/.claude/skills/acp-harness/references/downstream.md +0 -288
  39. package/.claude/skills/acp-harness/references/output-formats.md +0 -221
  40. package/.claude-plugin/marketplace.json +0 -15
  41. package/.claude-plugin/plugin.json +0 -16
  42. package/.github/CODEOWNERS +0 -6
  43. package/.github/workflows/ci.yml +0 -63
  44. package/.github/workflows/publish.yml +0 -146
  45. package/.mcp.json +0 -20
  46. package/CLAUDE.md +0 -92
  47. package/Dockerfile.test +0 -23
  48. package/biome.json +0 -96
  49. package/bun.lock +0 -513
  50. package/docker-compose.test.yml +0 -21
  51. package/scripts/bun-test-wrapper.sh +0 -46
  52. package/src/acp.constants.ts +0 -56
  53. package/src/acp.schemas.ts +0 -161
  54. package/src/acp.types.ts +0 -28
  55. package/src/tests/fixtures/.claude/settings.local.json +0 -8
  56. package/src/tests/fixtures/.claude/skills/greeting/SKILL.md +0 -17
  57. package/tsconfig.json +0 -32
package/LICENSE CHANGED
@@ -1,6 +1,6 @@
1
1
  ISC License
2
2
 
3
- Copyright (c) 2025 Plaited
3
+ Copyright (c) 2026 Plaited Labs
4
4
 
5
5
  Permission to use, copy, modify, and/or distribute this software for any
6
6
  purpose with or without fee is hereby granted, provided that the above
package/README.md CHANGED
@@ -10,37 +10,63 @@ CLI tool for capturing agent trajectories from ACP-compatible agents. Execute pr
10
10
 
11
11
  ```bash
12
12
  # Run without installing
13
- bunx @plaited/acp-harness prompts.jsonl -o results.jsonl
13
+ bunx @plaited/acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl
14
14
 
15
15
  # Or install globally
16
16
  bun add -g @plaited/acp-harness
17
- acp-harness prompts.jsonl -o results.jsonl
17
+ acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl
18
18
  ```
19
19
 
20
20
  **Prerequisite:** Install an ACP adapter and set your API key:
21
21
 
22
22
  ```bash
23
- npm install -g @zed-industries/claude-code-acp
23
+ npm install -g @anthropic-ai/claude-code-acp
24
24
  export ANTHROPIC_API_KEY=sk-...
25
25
  ```
26
26
 
27
- ## Usage
27
+ ## Commands
28
+
29
+ | Command | Purpose |
30
+ |---------|---------|
31
+ | `capture` | Trajectory capture (full JSONL) |
32
+ | `trials` | Multi-run with pass@k metrics |
33
+ | `summarize` | Derive compact views from results |
34
+ | `calibrate` | Sample failures for review |
35
+ | `validate-refs` | Check reference solutions |
36
+ | `balance` | Analyze test set coverage |
37
+ | `schemas` | Export JSON schemas |
38
+
39
+ ### capture
40
+
41
+ Capture full trajectories from an ACP agent:
28
42
 
29
43
  ```bash
30
- acp-harness <prompts.jsonl> [options]
44
+ acp-harness capture <prompts.jsonl> <command> [args...] [options]
31
45
 
32
46
  Options:
33
- --cmd, --command ACP agent command (default: "claude-code-acp")
34
47
  -o, --output Output file (default: stdout)
35
48
  -c, --cwd Working directory for agent
36
49
  -t, --timeout Request timeout in ms (default: 60000)
37
- -f, --format Output format: summary, judge (default: summary)
50
+ -g, --grader Path to grader (.ts/.js module or executable script)
38
51
  --progress Show progress to stderr
39
52
  --append Append to output file
40
53
  --mcp-server MCP server config JSON (repeatable)
41
54
  -h, --help Show help
42
55
  ```
43
56
 
57
+ ### trials
58
+
59
+ Run multiple trials per prompt for pass@k analysis:
60
+
61
+ ```bash
62
+ acp-harness trials <prompts.jsonl> <command> [args...] [options]
63
+
64
+ Options:
65
+ -k Number of trials per prompt (default: 3)
66
+ -g, --grader Path to grader (computes pass@k/pass^k metrics)
67
+ ... (same as capture)
68
+ ```
69
+
44
70
  ## Input Format
45
71
 
46
72
  ```jsonl
@@ -48,16 +74,94 @@ Options:
48
74
  {"id":"test-002","input":"Fix the TypeScript error","metadata":{"category":"bugfix"}}
49
75
  ```
50
76
 
51
- ## Output
77
+ ## Output Format
78
+
79
+ The harness outputs full trajectory JSONL (`CaptureResult` schema):
80
+
81
+ ```jsonl
82
+ {
83
+ "id": "test-001",
84
+ "input": "Create a primary button",
85
+ "output": "Here's a button component...",
86
+ "expected": "should contain <button>",
87
+ "trajectory": [...],
88
+ "metadata": {"category": "ui", "agent": "bunx claude-code-acp"},
89
+ "timing": {"start": 1234567890, "end": 1234567900},
90
+ "toolErrors": false,
91
+ "score": {"pass": true, "score": 1.0, "reasoning": "Contains expected"}
92
+ }
93
+ ```
94
+
95
+ Key fields:
96
+ - `toolErrors`: Boolean indicating if any tool calls failed
97
+ - `score`: Grader result (only if `--grader` provided)
98
+ - `trajectory`: Full execution trace (thoughts, messages, tool calls, plans)
99
+
100
+ ## Graders
101
+
102
+ Graders score agent outputs. The harness supports two types:
103
+
104
+ ### TypeScript/JavaScript Graders
105
+
106
+ Export a `grade` function:
107
+
108
+ ```typescript
109
+ import type { Grader } from '@plaited/acp-harness/schemas'
110
+
111
+ export const grade: Grader = async ({ input, output, expected, trajectory }) => {
112
+ const pass = output.toLowerCase().includes(expected?.toLowerCase() ?? '')
113
+ return {
114
+ pass,
115
+ score: pass ? 1.0 : 0.0,
116
+ reasoning: pass ? 'Contains expected answer' : 'Missing expected answer'
117
+ }
118
+ }
119
+ ```
120
+
121
+ ```bash
122
+ acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.ts
123
+ ```
124
+
125
+ ### Polyglot Graders (Python, etc.)
126
+
127
+ Any executable script using stdin/stdout JSON protocol:
128
+
129
+ ```python
130
+ #!/usr/bin/env python3
131
+ import json
132
+ import sys
133
+
134
+ data = json.load(sys.stdin)
135
+ output = data["output"].lower()
136
+ expected = (data.get("expected") or "").lower()
137
+
138
+ pass_result = expected in output if expected else True
139
+ print(json.dumps({
140
+ "pass": pass_result,
141
+ "score": 1.0 if pass_result else 0.0,
142
+ "reasoning": "Contains expected" if pass_result else "Missing expected"
143
+ }))
144
+ ```
145
+
146
+ ```bash
147
+ chmod +x grader.py
148
+ acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.py
149
+ ```
52
150
 
53
- The harness captures trajectories and outputs structured JSONL. **You provide the scoring logic.**
151
+ **Protocol:**
152
+ - Input (stdin): `{"input": "...", "output": "...", "expected": "...", "trajectory": [...]}`
153
+ - Output (stdout): `{"pass": true, "score": 1.0, "reasoning": "..."}`
154
+
155
+ ## Downstream Integration
54
156
 
55
157
  ```bash
56
- # Capture trajectories
57
- acp-harness prompts.jsonl -o results.jsonl
158
+ # Filter failures
159
+ cat results.jsonl | jq 'select(.score.pass == false)'
160
+
161
+ # Extract tool usage patterns
162
+ cat results.jsonl | jq '.trajectory[] | select(.type == "tool_call") | .name'
58
163
 
59
- # Score with your tools
60
- cat results.jsonl | jq 'select(.status == "failed")'
164
+ # Use with your scoring pipeline
61
165
  cat results.jsonl | your-scoring-script.ts
62
166
  ```
63
167
 
@@ -69,10 +173,10 @@ This package includes an **acp-harness skill** for AI coding agents with complet
69
173
  - Output format schemas
70
174
  - Integration patterns (Braintrust, jq, custom scorers)
71
175
 
72
- **Install via Claude Code:**
176
+ **Other AI coding agents:**
73
177
 
74
178
  ```bash
75
- /plugin marketplace add plaited/marketplace
179
+ curl -fsSL https://raw.githubusercontent.com/plaited/marketplace/main/install.sh | bash -s -- --agent <agent-name> --plugin development-skills
76
180
  ```
77
181
 
78
182
  ## Development
@@ -86,7 +190,7 @@ bun test # Run unit tests
86
190
  ## Requirements
87
191
 
88
192
  - **Runtime:** Bun >= 1.2.9
89
- - **ACP Adapter:** `@zed-industries/claude-code-acp` or compatible
193
+ - **ACP Adapter:** `@anthropic-ai/claude-code-acp` or compatible
90
194
  - **API Key:** `ANTHROPIC_API_KEY` environment variable
91
195
 
92
196
  ## License