npm - @plaited/acp-harness - Versions diffs - 0.2.5 → 0.3.1 - Mend

@plaited/acp-harness 0.2.5 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (57) hide show

package/LICENSE +1 -1
package/README.md +120 -16
package/bin/cli.ts +105 -636
package/bin/tests/cli.spec.ts +218 -51
package/package.json +20 -4
package/src/acp-client.ts +5 -4
package/src/acp-transport.ts +14 -7
package/src/adapter-check.ts +542 -0
package/src/adapter-scaffold.ts +934 -0
package/src/balance.ts +232 -0
package/src/calibrate.ts +300 -0
package/src/capture.ts +457 -0
package/src/constants.ts +94 -0
package/src/grader-loader.ts +174 -0
package/src/harness.ts +35 -0
package/src/schemas-cli.ts +239 -0
package/src/schemas.ts +567 -0
package/src/summarize.ts +245 -0
package/src/tests/adapter-check.spec.ts +70 -0
package/src/tests/adapter-scaffold.spec.ts +112 -0
package/src/tests/fixtures/grader-bad-module.ts +5 -0
package/src/tests/fixtures/grader-exec-fail.py +9 -0
package/src/tests/fixtures/grader-exec-invalid.py +6 -0
package/src/tests/fixtures/grader-exec.py +29 -0
package/src/tests/fixtures/grader-module.ts +14 -0
package/src/tests/grader-loader.spec.ts +153 -0
package/src/trials.ts +395 -0
package/src/validate-refs.ts +188 -0
package/.claude/rules/accuracy.md +0 -43
package/.claude/rules/bun-apis.md +0 -80
package/.claude/rules/code-review.md +0 -254
package/.claude/rules/git-workflow.md +0 -37
package/.claude/rules/github.md +0 -154
package/.claude/rules/testing.md +0 -172
package/.claude/skills/acp-harness/SKILL.md +0 -310
package/.claude/skills/acp-harness/assets/Dockerfile.acp +0 -25
package/.claude/skills/acp-harness/assets/docker-compose.acp.yml +0 -19
package/.claude/skills/acp-harness/references/downstream.md +0 -288
package/.claude/skills/acp-harness/references/output-formats.md +0 -221
package/.claude-plugin/marketplace.json +0 -15
package/.claude-plugin/plugin.json +0 -16
package/.github/CODEOWNERS +0 -6
package/.github/workflows/ci.yml +0 -63
package/.github/workflows/publish.yml +0 -146
package/.mcp.json +0 -20
package/CLAUDE.md +0 -92
package/Dockerfile.test +0 -23
package/biome.json +0 -96
package/bun.lock +0 -513
package/docker-compose.test.yml +0 -21
package/scripts/bun-test-wrapper.sh +0 -46
package/src/acp.constants.ts +0 -56
package/src/acp.schemas.ts +0 -161
package/src/acp.types.ts +0 -28
package/src/tests/fixtures/.claude/settings.local.json +0 -8
package/src/tests/fixtures/.claude/skills/greeting/SKILL.md +0 -17
package/tsconfig.json +0 -32

package/.claude/skills/acp-harness/references/output-formats.md DELETED Viewed

@@ -1,221 +0,0 @@
-# Output Formats
-The harness supports two output formats optimized for different use cases.
-## Format Selection
-```bash
-bunx @plaited/acp-harness prompts.jsonl --format <format> -o <output>
-```
-| Format | Files Created | Use Case |
-|--------|---------------|----------|
-| `summary` | Single JSONL | Quick metrics, dashboards, jq analysis |
-| `judge` | `.md` + `.full.jsonl` | Downstream LLM-as-judge scoring |
-## Summary Format (Default)
-Minimal JSONL for quick metrics and analysis.
-### Schema
-```typescript
-type SummaryResult = {
-  id: string                    // Prompt identifier
-  input: string                 // Original prompt text
-  output: string                // Final agent response
-  toolCalls: string[]           // List of tool names used
-  status: 'passed' | 'failed' | 'error' | 'timeout'
-  duration: number              // Total execution time (ms)
-}
-```
-### Example Output
-```jsonl
-{"id":"test-001","input":"Create a primary button","output":"I created the button in src/button.tsx","toolCalls":["Write"],"status":"passed","duration":1234}
-{"id":"test-002","input":"Fix the TypeScript error","output":"I fixed the type error...","toolCalls":["Read","Edit"],"status":"passed","duration":2567}
-```
-### Analysis with jq
-```bash
-# Calculate average duration
-cat results.jsonl | jq -s 'map(.duration) | add / length'
-# Count tool usage
-cat results.jsonl | jq -s 'map(.toolCalls) | flatten | group_by(.) | map({tool: .[0], count: length})'
-# Filter by status
-cat results.jsonl | jq 'select(.status == "failed")'
-# Pass rate
-cat results.jsonl | jq -s 'map(select(.status == "passed")) | length as $p | length as $t | "\($p)/\($t) passed"'
-```
-## Judge Format (Two-Tier)
-Creates two files for downstream LLM-as-judge scoring with step-level correlation.
-```bash
-bunx @plaited/acp-harness prompts.jsonl --format judge -o results
-# Creates: results.md + results.full.jsonl
-```
-### Markdown File (`<output>.md`)
-Human-readable summary with step IDs and code previews.
-**Structure:**
-```markdown
-## Capture Record: <id>
-**Input:** <original prompt>
-**Trajectory:**
-1. [THOUGHT] <truncated content> [->stepId]
-2. [TOOL:<name>] -> <status> (<duration>ms) [->stepId]
-   File: <path> (<size> chars)
-   ```<ext>
-   <head lines>
-   // ... N lines omitted ...
-   <tail lines>
-   ```
-3. [PLAN] <plan summary> [->stepId]
-4. [MESSAGE] <truncated content> [->stepId]
-**Output:** <truncated final output>
-**Metadata:** category=ui, agent=claude-code-acp, ...
-**Status:** passed|failed|error|timeout
-**Duration:** <ms>ms
----
-```
-**Step ID Format:** `<prompt-id>-step-<N>` (e.g., `test-001-step-2`)
-**Truncation Rules:**
-- Thought/message content: First 100 characters
-- Output: First 200 characters
-- Code preview: Head (8 lines) + tail (4 lines) for files > 12 lines
-### Full JSONL File (`<output>.full.jsonl`)
-Complete trajectory with step IDs for correlation.
-**Schema:**
-```typescript
-type FullResult = {
-  id: string
-  input: string
-  output: string
-  expected?: string
-  trajectory: IndexedStep[]     // Steps with stepId
-  metadata: Record<string, unknown>
-  timing: {
-    start: number               // Unix timestamp (ms)
-    end: number                 // Unix timestamp (ms)
-    firstResponse?: number      // Time to first response (ms)
-  }
-  status: 'passed' | 'failed' | 'error' | 'timeout'
-  errors?: string[]
-}
-type IndexedStep = TrajectoryStep & { stepId: string }
-type TrajectoryStep =
-  | { type: 'thought'; content: string; timestamp: number }
-  | { type: 'message'; content: string; timestamp: number }
-  | {
-      type: 'tool_call'
-      name: string              // Tool title from ACP SDK
-      status: string            // pending, in_progress, completed, failed
-      input?: unknown           // Raw input parameters
-      output?: unknown          // Raw output
-      duration?: number         // Execution time (ms)
-      timestamp: number
-    }
-  | { type: 'plan'; entries: PlanEntry[]; timestamp: number }
-```
-**Example:**
-```jsonl
-{"id":"test-001","input":"Create a primary button","output":"I created the button...","trajectory":[{"type":"thought","content":"I'll create a styled button template with createStyles","timestamp":100,"stepId":"test-001-step-1"},{"type":"tool_call","name":"Write","status":"completed","input":{"file_path":"src/button.tsx","content":"import { createStyles }..."},"output":"File written successfully","duration":234,"timestamp":150,"stepId":"test-001-step-2"},{"type":"message","content":"I created the button template","timestamp":500,"stepId":"test-001-step-3"}],"metadata":{"category":"ui","agent":"claude"},"timing":{"start":1704067200000,"end":1704067201234,"firstResponse":100},"status":"passed"}
-```
-## Two-Tier Scoring Workflow
-### Direct Scoring (Large Context)
-For judges with large context windows (Gemini 1M+, Claude 200k):
-```bash
-# Feed full JSONL directly
-cat results.full.jsonl | your-gemini-judge.ts
-```
-### Step-Level Retrieval (Small Context)
-For smaller models or step-specific analysis:
-```typescript
-// Load both files
-const markdown = await Bun.file('results.md').text()
-const fullLines = (await Bun.file('results.full.jsonl').text()).trim().split('\n')
-// Parse full results indexed by step ID
-const stepIndex = new Map<string, unknown>()
-for (const line of fullLines) {
-  const result = JSON.parse(line)
-  for (const step of result.trajectory) {
-    stepIndex.set(step.stepId, step)
-  }
-}
-// Judge requests full content for specific step
-const stepId = 'test-001-step-2'  // From markdown [->stepId]
-const fullStep = stepIndex.get(stepId)
-console.log(fullStep.input)  // Complete tool input
-```
-## Status Values
-| Status | Meaning |
-|--------|---------|
-| `passed` | Completed without tool errors |
-| `failed` | Completed but one or more tool calls failed |
-| `error` | Unhandled exception during execution |
-| `timeout` | Request exceeded timeout limit |
-## Input Format
-Both formats accept the same JSONL input:
-```jsonl
-{"id":"test-001","input":"Create a primary button","expected":"should contain <button>","metadata":{"category":"ui"}}
-```
-| Field | Required | Description |
-|-------|----------|-------------|
-| `id` | Yes | Unique identifier |
-| `input` | Yes | Prompt text for the agent |
-| `expected` | No | Expected output (for downstream scoring) |
-| `metadata` | No | Tags, category, difficulty for filtering |
-| `timeout` | No | Override default timeout for this prompt |
-## Streaming Behavior
-Both formats stream output line-by-line as results complete:
-```bash
-# Watch results in real-time
-bunx @plaited/acp-harness prompts.jsonl --progress -o results.jsonl &
-tail -f results.jsonl
-```
-Use `--append` to continue interrupted runs without overwriting previous results.

package/.claude-plugin/marketplace.json DELETED Viewed

@@ -1,15 +0,0 @@
-{
-  "name": "plaited-acp-harness",
-  "metadata": {
-    "description": "Claude Code plugin for Agent Client Protocol testing and evaluation"
-  },
-  "owner": {
-    "name": "Plaited Labs"
-  },
-  "plugins": [
-    {
-      "name": "acp-harness",
-      "source": "./"
-    }
-  ]
-}

package/.claude-plugin/plugin.json DELETED Viewed

@@ -1,16 +0,0 @@
-{
-  "name": "acp-harness",
-  "version": "0.2.5",
-  "description": "Agent Client Protocol client and evaluation harness for TypeScript/Bun projects. Includes: createACPClient for programmatic agent access, run-harness.ts for trajectory capture, and LLM-as-judge evaluation templates. Requires Bun runtime.",
-  "author": {
-    "name": "Plaited Labs"
-  },
-  "license": "ISC",
-  "mcpServers": {
-    "agent-client-protocol": {
-      "type": "http",
-      "url": "https://agentclientprotocol.com/mcp"
-    }
-  },
-  "skills": "./.claude/skills"
-}

package/.github/CODEOWNERS DELETED Viewed

@@ -1,6 +0,0 @@
-# Code owners for acp-harness repository
-# These users will be automatically requested for review when someone opens a pull request.
-# See https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners
-# Organization admins own all files
-* @EdwardIrby @alisonailea

package/.github/workflows/ci.yml DELETED Viewed

@@ -1,63 +0,0 @@
-name: CI
-on:
-  push:
-    branches:
-      - main
-  pull_request:
-permissions:
-  contents: read
-jobs:
-  # Detect which paths changed to conditionally run expensive jobs
-  changes:
-    runs-on: ubuntu-latest
-    permissions:
-      pull-requests: read
-    outputs:
-      acp: ${{ steps.filter.outputs.acp }}
-    steps:
-      - uses: actions/checkout@v4
-      - uses: dorny/paths-filter@v3
-        id: filter
-        with:
-          filters: |
-            acp:
-              - 'src/**'
-  test-pr:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v4
-      - uses: oven-sh/setup-bun@v2
-        with:
-          bun-version: latest
-      - name: Install dependencies
-        run: bun install
-      - name: Run check
-        run: bun run check
-      - name: Run test
-        run: bun run test
-  # ACP integration tests run in Docker container for consistent environment
-  # Only runs when src/ files change to reduce API costs
-  test-acp-integration:
-    needs: changes
-    if: ${{ needs.changes.outputs.acp == 'true' }}
-    runs-on: ubuntu-latest
-    container:
-      image: oven/bun:1.2.9
-      options: --user root
-    steps:
-      - uses: actions/checkout@v4
-      - name: Install dependencies
-        run: bun install --frozen-lockfile
-      - name: Run ACP integration tests
-        env:
-          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
-        run: bun test ./src/tests/acp-integration.docker.ts

package/.github/workflows/publish.yml DELETED Viewed

@@ -1,146 +0,0 @@
-name: Publish
-on:
-  workflow_dispatch:
-    inputs:
-      version:
-        description: "New version tag (e.g., 1.0.0)"
-        required: true
-      next:
-        description: "Next prerelease number (optional, will create a version like x.y.z-next.N)"
-        required: false
-jobs:
-  publish:
-    runs-on: ubuntu-22.04
-    permissions:
-      contents: write
-      id-token: write
-    steps:
-      - name: Validate inputs
-        env:
-          INPUT_VERSION: ${{ github.event.inputs.version }}
-          INPUT_NEXT: ${{ github.event.inputs.next }}
-        run: |
-          # SECURITY: Validate all inputs before use to prevent injection
-          # Using env vars instead of direct interpolation to prevent shell injection
-          # Validate version format (semver)
-          if ! echo "$INPUT_VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+$'; then
-            echo "::error::Invalid version format: $INPUT_VERSION"
-            echo "::error::Expected: X.Y.Z (e.g., 1.0.0)"
-            exit 1
-          fi
-          # Validate next is a number if provided
-          if [[ -n "$INPUT_NEXT" ]] && ! echo "$INPUT_NEXT" | grep -qE '^[0-9]+$'; then
-            echo "::error::Invalid next value: $INPUT_NEXT"
-            echo "::error::Expected: number (e.g., 1, 2, 3)"
-            exit 1
-          fi
-          echo "✓ Input validation passed"
-      - name: Set version
-        id: set_version
-        env:
-          INPUT_VERSION: ${{ github.event.inputs.version }}
-          INPUT_NEXT: ${{ github.event.inputs.next }}
-        run: |
-          base_version="$INPUT_VERSION"
-          next="$INPUT_NEXT"
-          if [[ -n "$next" ]]; then
-            echo "version=${base_version}-next.${next}" >> $GITHUB_OUTPUT
-            echo "is_prerelease=true" >> $GITHUB_OUTPUT
-          else
-            echo "version=${base_version}" >> $GITHUB_OUTPUT
-            echo "is_prerelease=false" >> $GITHUB_OUTPUT
-          fi
-      - name: Validate version format
-        env:
-          VERSION: ${{ steps.set_version.outputs.version }}
-        run: |
-          version="$VERSION"
-          # Check for 'v' prefix (common mistake)
-          if [[ "$version" =~ ^v ]]; then
-            echo "::error::Version should not include 'v' prefix"
-            echo "::error::Enter '1.3.4' not 'v1.3.4'"
-            echo "::error::The workflow will automatically add 'v' for Git tags"
-            exit 1
-          fi
-          # Validate semver format (allows dots in prerelease suffix for formats like 1.0.0-next.1)
-          if [[ ! "$version" =~ ^([0-9]+)\.([0-9]+)\.([0-9]+)(-([a-zA-Z0-9._-]+))?$ ]]; then
-            echo "::error::Version must have the format 'x.y.z' or 'x.y.z-<string>', where x, y, and z are numbers."
-            exit 1
-          fi
-          echo "✅ Version format validated: $version"
-      - name: Checkout repository
-        uses: actions/checkout@v4
-        with:
-          token: ${{ secrets.GH_PAT }}
-      - name: Setup Node.js (for NPM OIDC)
-        uses: actions/setup-node@v4
-        with:
-          node-version: "lts/*"
-          registry-url: https://registry.npmjs.org
-          check-latest: true
-      - uses: oven-sh/setup-bun@v1
-      - name: Setup
-        run: |
-          bun install
-      - name: Configure Git
-        env:
-          GIT_AUTHOR_NAME: ${{ github.actor }}
-          GIT_AUTHOR_EMAIL: ${{ github.actor }}@users.noreply.github.com
-        run: |
-          git config user.name "$GIT_AUTHOR_NAME"
-          git config user.email "$GIT_AUTHOR_EMAIL"
-      - name: Version
-        env:
-          GH_TOKEN: ${{ secrets.GH_PAT }}
-          VERSION: ${{ steps.set_version.outputs.version }}
-          IS_PRERELEASE: ${{ steps.set_version.outputs.is_prerelease }}
-        run: |
-          npm version --no-git-tag-version "$VERSION"
-          # Update plugin.json version
-          jq --arg v "$VERSION" '.version = $v' .claude-plugin/plugin.json > .claude-plugin/plugin.json.tmp
-          mv .claude-plugin/plugin.json.tmp .claude-plugin/plugin.json
-          git add -A
-          git commit -m "ci: publish [skip ci]"
-          git push
-          # Create GitHub release
-          if [[ "$IS_PRERELEASE" == "true" ]]; then
-            gh release create "v$VERSION" --generate-notes --prerelease
-          else
-            gh release create "v$VERSION" --generate-notes
-          fi
-      - name: Publish to NPM (Trusted Publishing)
-        env:
-          IS_PRERELEASE: ${{ steps.set_version.outputs.is_prerelease }}
-        run: |
-          # NPM trusted publishing uses OIDC token automatically
-          # Provenance attestations are generated automatically (npm >=11.5.1)
-          # No NPM_TOKEN secret needed - authentication via GitHub OIDC
-          if [[ "$IS_PRERELEASE" == "true" ]]; then
-            echo "Publishing as next (prerelease)"
-            npm publish --access public --tag next
-          else
-            echo "Publishing as latest (stable)"
-            npm publish --access public
-          fi

package/.mcp.json DELETED Viewed

@@ -1,20 +0,0 @@
-{
-  "mcpServers": {
-    "agent-skills-spec": {
-      "type": "http",
-      "url": "https://agentskills.io/mcp"
-    },
-    "agent-client-protocol": {
-      "type": "http",
-      "url": "https://agentclientprotocol.com/mcp"
-    },
-    "model-context-protocol-docs":{
-      "type": "http",
-      "url": "https://modelcontextprotocol.io/mcp"
-    },
-    "braintrust-docs": {
-      "type": "http",
-      "url": "https://www.braintrust.dev/docs/mcp"
-    }
-  }
-}

package/CLAUDE.md DELETED Viewed

@@ -1,92 +0,0 @@
-# CLAUDE.md
-This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
-## Essential Commands
-### Development Setup
-```bash
-# Install dependencies (requires bun >= v1.2.9)
-bun install
-# Type, lint, and format check (check only, no fixes)
-bun run check
-# Lint and format fix (auto-fix issues)
-bun run check:write
-# Run unit tests
-bun test
-# Run Docker integration tests (requires ANTHROPIC_API_KEY)
-ANTHROPIC_API_KEY=sk-... bun run test:docker
-```
-## Project Organization
-This project uses `.claude/rules/` for project-specific guidance:
-- **Testing**: @.claude/rules/testing.md - Test commands and workflow
-- **Code Review**: @.claude/rules/code-review.md - Review standards
-- **Accuracy**: @.claude/rules/accuracy.md - Confidence thresholds
-- **Bun APIs**: @.claude/rules/bun-apis.md - Bun platform API preferences
-- **Git Workflow**: @.claude/rules/git-workflow.md - Commit conventions
-- **GitHub**: @.claude/rules/github.md - GitHub CLI integration
-## Quick Reference
-### Package Overview
-`@plaited/acp-harness` is a CLI tool for capturing agent trajectories from ACP-compatible agents. It executes prompts, captures full trajectories (tools, thoughts, plans), and outputs structured JSONL for downstream scoring.
-**CLI usage:**
-```bash
-bunx @plaited/acp-harness prompts.jsonl -o results.jsonl
-```
-### Code Style Essentials
-- Prefer arrow functions and `type` over `interface`
-- Use `test` instead of `it` in test files
-- Prefer Bun native APIs over Node.js equivalents
-- Object parameters for functions with 2+ parameters
-- JSON imports require `with { type: 'json' }` attribute
-For complete conventions, see `.claude/rules/code-review.md`
-### Plugin Development
-This project is a Claude Code plugin distributed via the plaited/marketplace aggregator. Structure:
-- `.claude/.claude-plugin/plugin.json` - Plugin manifest
-- `.claude/` - Plugin source (skills, rules, settings)
-When working on plugins:
-- Clear cache after changes: `rm -rf ~/.claude/plugins-cache`
-- Restart Claude Code to see updates
-- Skills are auto-invoked (won't show in `/plugins` UI)
-- Test installation locally: `claude plugins add github:plaited/marketplace`
-### Documentation
-- Public APIs require comprehensive TSDoc documentation
-- No `@example` sections - tests are living examples
-- Use `@internal` marker for non-public APIs
-- Always use `type` over `interface`
-- Use Mermaid diagrams only (not ASCII art)
-## Important Constraints
-1. **No Open Contribution**: This is open-source but not open-contribution
-2. **Bun Required**: Development requires bun >= v1.2.9
-3. **ES2024 Features**: Uses Promise.withResolvers() and other modern APIs
-## Plugin
-The bundled **acp-harness** skill (`.claude/skills/acp-harness/`) provides:
-- CLI usage and examples
-- Output format specifications
-- Downstream integration patterns
-Install via Claude Code: `/plugin marketplace add plaited/acp-harness`
-See `.claude/skills/acp-harness/SKILL.md` for complete documentation.

package/Dockerfile.test DELETED Viewed

@@ -1,23 +0,0 @@
-# Dockerfile for integration tests requiring Docker environment
-# Uses same bun version as CI for consistency
-#
-# Tests use *.docker.ts naming to:
-# 1. Avoid bun test pattern matching in normal test runs
-# 2. Signal these tests require Docker (external APIs, etc.)
-FROM oven/bun:1.2.9
-# Install git (required for some operations)
-RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
-WORKDIR /app
-# Copy source
-COPY . .
-# Install dependencies
-RUN bun install --frozen-lockfile
-# Run all Docker integration tests
-# The wrapper script handles finding and running *.docker.ts files
-CMD ["bash", "scripts/bun-test-wrapper.sh"]