npm - agent-harness-kit - Versions diffs - 0.8.0 → 0.9.0 - Mend

agent-harness-kit 0.8.0 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (49) hide show

package/src/templates/.harness/eval/tasks/feature-step-done.answer.md ADDED Viewed

@@ -0,0 +1,53 @@
+# Golden answer: feature-step-done
+This file is read by `feature-step-done.mjs` rubric as a reference for
+what an acceptable agent run looks like. The rubric does not require
+byte-exact match — it checks structural properties (file count, JSON
+shape) rather than identical content.
+## Files expected in the agent's diff (representative)
+- `src/runtime/health.ts` (or equivalent path for the project's stack)
+- `tests/health.test.ts` (or equivalent test path)
+- `feature_list.json` (modified in place)
+- `.harness/PROGRESS.md` (appended)
+## feature_list.json shape after the agent's edit
+```json
+{
+  "features": [
+    {
+      "id": "health-endpoint",
+      "title": "GET /health returns 200",
+      "passes": true,
+      "steps": [
+        {
+          "id": "s1",
+          "passes": true,
+          "tests": ["tests/health.test.ts"]
+        }
+      ]
+    }
+  ]
+}
+```
+Key invariants the rubric checks:
+1. `features[0].steps[0].passes === true`
+2. `features[0].steps[0].tests` is a non-empty array
+3. At least one path in `tests` exists in the agent's file diff
+4. `features.length` is unchanged from setup (no new features mid-session)
+## Transcript shape expected
+The transcript should include:
+- A call to `/add-feature` (or equivalent) early in the run.
+- At least one Write/Edit on the handler file.
+- At least one Write/Edit on a test file matching the `tests[]` array.
+- An Edit on `feature_list.json` flipping `passes: true`.
+The rubric does not require exact tool-call order — only that all four
+events appear in the transcript.

package/src/templates/.harness/eval/tasks/feature-step-done.json ADDED Viewed

@@ -0,0 +1,10 @@
+{
+  "id": "feature-step-done",
+  "description": "Verifies that when an agent implements a feature step, it flips passes:false→true in feature_list.json AND adds a tests[] reference (or testCommit). Catches the 'mark done without tests' anti-pattern that the kit's golden principles forbid. Graded by .harness/eval/rubrics/feature-step-done.mjs.",
+  "input": "feature_list.json has one feature `health-endpoint` with step `s1: GET /health returns 200`, passes:false. Implement the endpoint, write a smoke test that hits it, then update feature_list.json#features[0].steps[0] with passes:true AND tests:[<test_file_path>]. Do not delete or reorder other entries.",
+  "expected": {
+    "filesChanged": { "min": 2, "max": 5 },
+    "tokensMax": 25000,
+    "rubric": ".harness/eval/rubrics/feature-step-done.mjs"
+  }
+}

package/src/templates/.harness/eval/tasks/feature-step-done.prompt.md ADDED Viewed

@@ -0,0 +1,43 @@
+# Eval task: feature-step-done
+## What the harness is testing
+The kit's "no done without proof" rule: an agent that flips a feature
+step from `passes: false` to `passes: true` MUST also commit a test
+covering the new behavior. This eval gives the agent a one-step feature,
+asks it to implement, and grades whether the test landed alongside the
+flip.
+## Prompt given to the agent
+```
+feature_list.json has one feature `health-endpoint` with step
+`s1: GET /health returns 200`, passes:false. Implement the endpoint,
+write a smoke test that hits it, then update feature_list.json#features[0].steps[0]
+with passes:true AND tests:[<test_file_path>]. Do not delete or
+reorder other entries.
+```
+## What "good" looks like
+1. The agent invokes `/add-feature` (or `/refactor-feature` for a re-shape).
+2. A handler file appears (e.g. `src/runtime/health.ts`).
+3. A test file appears (e.g. `tests/health.test.ts`).
+4. `feature_list.json` is edited in-place:
+   - `features[0].steps[0].passes` is now `true`.
+   - `features[0].steps[0].tests` includes the new test path.
+5. PROGRESS.md gets a one-line append (kit convention).
+## What "bad" looks like
+- Passes flipped to true with no test file in the diff. (Hard fail.)
+- New feature added to feature_list.json mid-session. (Hard fail.)
+- Step entry deleted or reordered. (Hard fail.)
+- Refactor of unrelated code in the same commit. (Soft fail.)
+## Why this matters
+Without enforcement, the most common agent failure is "looks done"
+(passes:true) without test coverage. The kit's `refactor-feature`
+side-car gates this at edit time; the eval rubric confirms the gate
+holds against an end-to-end run.

package/src/templates/.mcp.json.example ADDED Viewed

@@ -0,0 +1,35 @@
+{
+  "$schema": "https://json.schemastore.org/claude-code-mcp.json",
+  "_comment": "Rename to .mcp.json or run `agent-harness-kit init --with-mcp` to enable. Each server below is OFF until uncommented + credentialed.",
+  "mcpServers": {
+    "playwright": {
+      "_comment": "Headless browser for /review-this-pr UI smoke checks. Requires `npx playwright install` first.",
+      "command": "npx",
+      "args": ["-y", "@playwright/mcp@latest"],
+      "env": {
+        "PLAYWRIGHT_BROWSERS_PATH": "0"
+      }
+    },
+    "github": {
+      "_comment": "Read/write GitHub issues + PRs from inside Claude Code. Needs GITHUB_PERSONAL_ACCESS_TOKEN with `repo` + `read:org` scopes.",
+      "command": "npx",
+      "args": ["-y", "@modelcontextprotocol/server-github"],
+      "env": {
+        "GITHUB_PERSONAL_ACCESS_TOKEN": "${GITHUB_PERSONAL_ACCESS_TOKEN}"
+      }
+    },
+    "filesystem-readonly": {
+      "_comment": "Read-only access to a sibling repo (docs / reference code). Adjust ALLOWED_PATHS for your layout.",
+      "command": "npx",
+      "args": ["-y", "@modelcontextprotocol/server-filesystem"],
+      "env": {
+        "ALLOWED_PATHS": "${HOME}/Dev/reference-repo"
+      }
+    }
+  },
+  "_recommended_skills": {
+    "playwright": "Useful for /review-this-pr when UI files changed — runs smoke against a dev server.",
+    "github": "Useful for /garbage-collection when proposing PRs and for /review-this-pr to read base branch.",
+    "filesystem-readonly": "Useful when /inspect-module needs to peek at a sibling repo without copying code in."
+  }
+}

package/src/templates/scripts/pretooluse-edit-guard.sh.hbs ADDED Viewed

@@ -0,0 +1,115 @@
+#!/usr/bin/env bash
+# PreToolUse hook (matcher: Edit|Write|MultiEdit) — denies direct edits to
+# protected paths. Catches the failure mode where the agent decides to
+# "just fix" a baseline file or .claude/ template instead of going through
+# the proper /garbage-collection or scaffold-refresh paths.
+#
+# Protected paths (and why):
+#   1. .claude/                         — skills, agents, hooks, settings.
+#                                         Use /upgrade flow or edit the source
+#                                         template in src/templates/.
+#   2. node_modules/                    — package state, regenerated by install.
+#   3. .git/                            — repo internals, never hand-edited.
+#   4. .harness/structural-baseline.json — bypasses monotonic guard. Use the
+#                                          /garbage-collection skill.
+#   5. .harness/installed.json          — kit lockfile, derived from render.
+#                                         Hand edits cause spurious "drift"
+#                                         warnings on next upgrade.
+#
+# Escape hatches:
+#   - AHK_ALLOW_BYPASS=1 → log + allow (audit trail in .harness/bypass.log).
+#   - AHK_HOOK_MODE=warn → log only, never deny.
+set -eo pipefail
+INPUT=$(cat)
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+have_jq() {
+  [ "${AHK_DISABLE_JQ:-}" = "1" ] && return 1
+  command -v jq >/dev/null 2>&1
+}
+have_jp() {
+  have_jq && return 0
+  command -v node >/dev/null 2>&1 && [ -f "$SCRIPT_DIR/_lib/json-pick.mjs" ] && return 0
+  return 1
+}
+jp() {
+  if have_jq; then jq -r "$1"
+  else node "$SCRIPT_DIR/_lib/json-pick.mjs" "$1"
+  fi
+}
+if ! have_jp; then exit 0; fi
+# Resolve target file. Write/Edit ship .tool_input.file_path; MultiEdit ships
+# the same field at the top level. Both carry the absolute or repo-relative
+# path. We normalise via Node to strip any leading ./ and use forward slashes.
+FILE=$(echo "$INPUT" | jp '.tool_input.file_path // .tool_input.path // empty')
+[ -z "$FILE" ] && exit 0
+# Normalise to a path relative to CWD when possible; otherwise keep absolute.
+REL_FILE="$FILE"
+if [ -n "$PWD" ] && [[ "$FILE" == "$PWD"/* ]]; then
+  REL_FILE="${FILE#"$PWD"/}"
+fi
+REL_FILE="${REL_FILE#./}"
+REASON=""
+case "$REL_FILE" in
+  .claude/*|*/.claude/*)
+    REASON=".claude/ is owned by the kit's scaffold. To change a skill/agent/hook, edit src/templates/.claude/ in the kit source and re-run 'agent-harness-kit upgrade', or override at the user level (~/.claude/)."
+    ;;
+  node_modules/*|*/node_modules/*)
+    REASON="node_modules/ is regenerated by the package manager. Edit package.json or the upstream package; never hand-edit installed files."
+    ;;
+  .git/*|*/.git/*)
+    REASON=".git/ contains repo internals. Use git commands ('git config', 'git update-ref', etc.) — never hand-edit."
+    ;;
+  .harness/structural-baseline.json)
+    REASON="Direct edits to .harness/structural-baseline.json bypass the baseline-monotonic guard. Use the /garbage-collection skill or fix the underlying violation."
+    ;;
+  .harness/installed.json)
+    REASON=".harness/installed.json is the kit lockfile, regenerated by 'agent-harness-kit init/upgrade'. Hand edits cause spurious drift warnings."
+    ;;
+esac
+if [ -z "$REASON" ]; then
+  exit 0
+fi
+# Warn-only mode.
+if [ "${AHK_HOOK_MODE:-}" = "warn" ]; then
+  echo "[ahk] pretooluse-edit-guard (warn): would deny edit to $REL_FILE — $REASON" >&2
+  exit 0
+fi
+# Bypass with audit log.
+if [ "${AHK_ALLOW_BYPASS:-}" = "1" ]; then
+  mkdir -p .harness
+  TS=$(date -u +%Y-%m-%dT%H:%M:%SZ)
+  SHA=$(git rev-parse --short HEAD 2>/dev/null || echo 'no-git')
+  ESCAPED=${REL_FILE//\"/\\\"}
+  printf '{"ts":"%s","sha":"%s","bypass":"AHK_ALLOW_BYPASS","file":"%s","rule":"pretooluse-edit-guard"}\n' \
+    "$TS" "$SHA" "$ESCAPED" >> .harness/bypass.log
+  exit 0
+fi
+# Deny via JSON.
+if command -v node >/dev/null 2>&1; then
+  node -e "
+    const reason = process.argv[1];
+    const out = {
+      hookSpecificOutput: {
+        hookEventName: 'PreToolUse',
+        permissionDecision: 'deny',
+        permissionDecisionReason: reason
+      }
+    };
+    process.stdout.write(JSON.stringify(out));
+  " "$REASON"
+elif have_jq; then
+  jq -nc --arg r "$REASON" \
+    '{hookSpecificOutput: {hookEventName: "PreToolUse", permissionDecision: "deny", permissionDecisionReason: $r}}'
+else
+  echo "$REASON" >&2
+  exit 2
+fi
+exit 0

package/src/templates/scripts/session-end.sh.hbs CHANGED Viewed

@@ -45,4 +45,10 @@ fi
 mkdir -p .harness
 TS=$(date +"%Y-%m-%d %H:%M")
 echo "$TS | session_end | $REASON | $BR | $SHA" >> .harness/PROGRESS.md
+# Rollup side-car — writes a JSONL record to .harness/telemetry.jsonl.
+# Best-effort: never blocks the cleanup-only SessionEnd contract.
+if command -v node >/dev/null 2>&1 && [ -f scripts/session-rollup.mjs ]; then
+  printf '%s' "$INPUT" | node scripts/session-rollup.mjs 2>/dev/null || true
+fi
 exit 0

package/src/templates/scripts/session-rollup.mjs ADDED Viewed

@@ -0,0 +1,96 @@
+#!/usr/bin/env node
+// session-rollup.mjs — deterministic SessionEnd side-car. Writes a single
+// JSONL record summarising the session into .harness/telemetry.jsonl. Pure
+// Node (no jq dependency).
+//
+// Record shape:
+//   { ts, event: "session_rollup", reason, branch, sha, uncommitted,
+//     skills_invoked: [...], session_id }
+//
+// Called from session-end.sh after the human-readable PROGRESS.md line is
+// written, so a single session contributes one PROGRESS.md line + one
+// telemetry rollup record.
+import { readFileSync, existsSync, mkdirSync, appendFileSync } from "node:fs";
+import { resolve } from "node:path";
+import { spawnSync } from "node:child_process";
+const ROOT = process.env.CLAUDE_PROJECT_DIR || process.cwd();
+function readStdinSync() {
+  // SessionEnd hooks pass JSON on stdin. fd 0 is the inherited stdin.
+  try {
+    return readFileSync(0, "utf8");
+  } catch {
+    return "";
+  }
+}
+function safeJSON(s) {
+  if (!s) return {};
+  try { return JSON.parse(s); } catch { return {}; }
+}
+function git(args, def = "") {
+  const r = spawnSync("git", args, { cwd: ROOT, encoding: "utf8" });
+  if (r.status !== 0) return def;
+  return (r.stdout || "").trim();
+}
+function recentSkillInvocations() {
+  // Tail of telemetry.jsonl: count skill_invoked records since the last
+  // session_rollup. If no prior rollup, count everything in the file (capped
+  // to 50 for sanity).
+  const path = resolve(ROOT, ".harness/telemetry.jsonl");
+  if (!existsSync(path)) return [];
+  const body = readFileSync(path, "utf8");
+  const lines = body.split("\n").filter(Boolean);
+  let startIdx = 0;
+  for (let i = lines.length - 1; i >= 0; i--) {
+    try {
+      const rec = JSON.parse(lines[i]);
+      if (rec.event === "session_rollup") {
+        startIdx = i + 1;
+        break;
+      }
+    } catch { /* skip malformed */ }
+  }
+  const window = lines.slice(startIdx);
+  const skills = [];
+  for (const line of window) {
+    try {
+      const rec = JSON.parse(line);
+      if (rec.event === "skill_invoked" && rec.skill) skills.push(rec.skill);
+    } catch { /* skip */ }
+  }
+  return skills.slice(-50);
+}
+function main() {
+  const input = safeJSON(readStdinSync());
+  const reason = input.end_reason || "unknown";
+  const sessionId = input.session_id || "";
+  const branch = git(["branch", "--show-current"], "(detached)");
+  const sha = git(["rev-parse", "--short", "HEAD"], "(no-git)");
+  const uncommittedRaw = git(["status", "--short"], "");
+  const uncommitted = uncommittedRaw ? uncommittedRaw.split("\n").filter(Boolean).length : 0;
+  const skills = recentSkillInvocations();
+  const record = {
+    ts: new Date().toISOString(),
+    event: "session_rollup",
+    reason,
+    session_id: sessionId,
+    branch,
+    sha,
+    uncommitted,
+    skills_invoked: skills,
+  };
+  const outPath = resolve(ROOT, ".harness/telemetry.jsonl");
+  mkdirSync(resolve(ROOT, ".harness"), { recursive: true });
+  appendFileSync(outPath, JSON.stringify(record) + "\n");
+}
+main();

package/src/templates/scripts/session-start.sh.hbs CHANGED Viewed

@@ -60,6 +60,31 @@ if command -v git >/dev/null 2>&1 && git rev-parse --git-dir >/dev/null 2>&1; th
   CTX+="[harness] git: branch=$BR, uncommitted=$COUNT file(s)"$'\n'
 fi
+# 1b. One-shot daily pill (harness version + open-feature reminder).
+# `mkdir -p .harness/state` then check the stamp file. Today's pill fires
+# once per UTC day per project; subsequent SessionStarts that day stay
+# silent on this line so the model doesn't see the same banner thirty
+# times per coding day.
+mkdir -p .harness/state 2>/dev/null || true
+STAMP_FILE=".harness/state/session-pill.stamp"
+TODAY=$(date -u +%Y-%m-%d)
+LAST=""
+[ -f "$STAMP_FILE" ] && LAST=$(cat "$STAMP_FILE" 2>/dev/null || echo "")
+if [ "$LAST" != "$TODAY" ]; then
+  HARNESS_VER=""
+  if [ -f harness.config.json ] && have_jp; then
+    HARNESS_VER=$(jp '.version // empty' harness.config.json 2>/dev/null || echo "")
+  fi
+  if [ -z "$HARNESS_VER" ] && [ -f .harness/installed.json ] && have_jp; then
+    HARNESS_VER=$(jp '.version // empty' .harness/installed.json 2>/dev/null || echo "")
+  fi
+  if [ -z "$HARNESS_VER" ]; then
+    HARNESS_VER="unknown"
+  fi
+  CTX+="[harness] pill (one/day): kit=$HARNESS_VER · date=$TODAY"$'\n'
+  printf '%s' "$TODAY" > "$STAMP_FILE" 2>/dev/null || true
+fi
 # 2. Current feature (from feature_list.json) — picks the first entry with
 #    passes=false so the model resumes the in-flight work, not a finished
 #    one. Skipped if file missing or jp unavailable.

package/src/templates/scripts/subagent-stop.sh.hbs ADDED Viewed

@@ -0,0 +1,76 @@
+#!/usr/bin/env bash
+# SubagentStop hook — fires when a subagent finishes its turn (Task tool).
+# Triggers the same structural-test that PostToolUse(Edit) runs, because a
+# subagent can edit files in batches that individually pass but jointly drift
+# off-layer. Running the check at subagent boundary catches that drift early.
+#
+# Contract:
+#   - Never blocks (exit 0 even on failure — the parent Stop hook handles the
+#     final gate). We only emit a stderr summary that Claude reads.
+#   - Telemetry append to .harness/telemetry.jsonl as {event:"subagent_stop"}.
+#   - Skipped when harness.config.json#structuralTest.engine === "none" (the
+#     "structural test not yet wired" escape hatch used by polyglot scaffolds).
+set -eo pipefail
+INPUT=$(cat)
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+have_jq() {
+  [ "${AHK_DISABLE_JQ:-}" = "1" ] && return 1
+  command -v jq >/dev/null 2>&1
+}
+have_jp() {
+  have_jq && return 0
+  command -v node >/dev/null 2>&1 && [ -f "$SCRIPT_DIR/_lib/json-pick.mjs" ] && return 0
+  return 1
+}
+jp() {
+  if have_jq; then jq -r "$1"
+  else node "$SCRIPT_DIR/_lib/json-pick.mjs" "$1"
+  fi
+}
+SUBAGENT="(unknown)"
+if have_jp; then
+  SUBAGENT=$(echo "$INPUT" | jp '.subagent // .session_id // "unknown"' 2>/dev/null || echo "unknown")
+fi
+# Telemetry first so we record every subagent boundary, even if the
+# structural-test bails below.
+mkdir -p .harness
+TS=$(date -u +%Y-%m-%dT%H:%M:%SZ)
+SHA=$(git rev-parse --short HEAD 2>/dev/null || echo 'no-git')
+printf '{"ts":"%s","event":"subagent_stop","subagent":"%s","sha":"%s"}\n' \
+  "$TS" "$SUBAGENT" "$SHA" >> .harness/telemetry.jsonl
+# Skip if structural test disabled.
+if [ -f harness.config.json ] \
+   && grep -qE '"engine"[[:space:]]*:[[:space:]]*"none"' harness.config.json; then
+  exit 0
+fi
+# AHK_HOOK_MODE=warn → log only, don't run.
+if [ "${AHK_HOOK_MODE:-}" = "warn" ]; then
+  exit 0
+fi
+# Run structural test workspace-wide. Subagents typically touch multiple
+# files; per-file scoping would miss the cross-file drift case. Cap output
+# to 30 lines on stderr so the parent agent sees the summary without flood.
+RAN=0
+if [ -f harness/structural-check.mjs ] && command -v node >/dev/null 2>&1; then
+  RAN=1
+  if ! node harness/structural-check.mjs 2>&1 | tail -30 >&2; then
+    echo "[ahk] subagent_stop: structural-test reported violations (see above). Continuing — parent Stop hook will gate." >&2
+  fi
+elif command -v npm >/dev/null 2>&1 && [ -f package.json ] \
+     && grep -q '"harness:check"' package.json 2>/dev/null; then
+  RAN=1
+  if ! npm run --silent harness:check 2>&1 | tail -30 >&2; then
+    echo "[ahk] subagent_stop: structural-test reported violations (see above). Continuing — parent Stop hook will gate." >&2
+  fi
+fi
+if [ "$RAN" = "0" ]; then
+  # No structural-test entry point. Skip silently — already logged in telemetry.
+  exit 0
+fi
+exit 0