@htechcs/harness-kit 0.1.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.en.md +8 -8
- package/README.md +8 -8
- package/bin/cli.js +43 -43
- package/docs/harness-engineering-tutorial.en.md +1 -1
- package/docs/harness-engineering-tutorial.md +1 -1
- package/package.json +1 -1
- package/skills/init-harness/SKILL.md +74 -74
- package/templates/agents/README.md +25 -24
- package/templates/agents/repo-explorer.md +16 -16
- package/templates/evals/README.md +39 -35
- package/templates/evals/cases/example-task.md +22 -22
- package/templates/evals/observability.md +43 -42
- package/templates/guardrails/README.md +59 -57
- package/templates/long-running/README.md +29 -28
- package/templates/long-running/TASK.md +19 -19
- package/templates/mcp-audit.md +16 -16
- package/templates/new-worktree.sh +16 -16
- package/templates/setup.sh +25 -25
- package/templates/spec/FEATURE.md +19 -19
- package/templates/spec/README.md +20 -20
|
@@ -1,48 +1,49 @@
|
|
|
1
|
-
# Long-running —
|
|
1
|
+
# Long-running — work that spans more than one session (Level 4)
|
|
2
2
|
|
|
3
|
-
|
|
4
|
-
agent
|
|
3
|
+
Level 4 keeps the agent working *reliably* on long-horizon work: multi-day refactors, big migrations,
|
|
4
|
+
an agent running in the background for hours. Long work dies from 3 things — each file here blocks one.
|
|
5
5
|
|
|
6
|
-
## 3
|
|
6
|
+
## 3 files, 3 problems
|
|
7
7
|
|
|
8
|
-
| File |
|
|
9
|
-
|
|
10
|
-
| [`setup.sh`](../setup.sh) |
|
|
11
|
-
| [`TASK.md`](./TASK.md) |
|
|
12
|
-
| [`new-worktree.sh`](../new-worktree.sh) |
|
|
8
|
+
| File | Which death it blocks |
|
|
9
|
+
|------|-----------------------|
|
|
10
|
+
| [`setup.sh`](../setup.sh) | Can't get back to a runnable state (a fresh clone/agent doesn't know what to install) |
|
|
11
|
+
| [`TASK.md`](./TASK.md) | State lost across sessions / after `/clear` |
|
|
12
|
+
| [`new-worktree.sh`](../new-worktree.sh) | Parallel tasks step on each other |
|
|
13
13
|
|
|
14
|
-
##
|
|
14
|
+
## Install
|
|
15
15
|
|
|
16
16
|
```bash
|
|
17
|
-
cp setup.sh new-worktree.sh . #
|
|
17
|
+
cp setup.sh new-worktree.sh . # into the repo root
|
|
18
18
|
chmod +x setup.sh new-worktree.sh
|
|
19
|
-
cp long-running/TASK.md . #
|
|
19
|
+
cp long-running/TASK.md . # when you start a specific long task
|
|
20
20
|
```
|
|
21
21
|
|
|
22
|
-
|
|
22
|
+
Then **point `CLAUDE.md`** at them so the agent knows to use them:
|
|
23
23
|
|
|
24
24
|
```md
|
|
25
25
|
## Build / Test / Run
|
|
26
|
-
-
|
|
26
|
+
- Set up the environment: ./setup.sh (idempotent — safe to re-run)
|
|
27
27
|
|
|
28
28
|
## Long-running
|
|
29
|
-
-
|
|
29
|
+
- An in-flight long task is tracked in TASK.md — read it before starting, update it at milestones.
|
|
30
30
|
```
|
|
31
31
|
|
|
32
|
-
##
|
|
32
|
+
## When to use which (this is the discipline part)
|
|
33
33
|
|
|
34
|
-
|
|
34
|
+
The files are just tools — knowing *when* to reach for them is the skill:
|
|
35
35
|
|
|
36
|
-
- **`setup.sh`** —
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
36
|
+
- **`setup.sh`** — run it the moment you enter a fresh checkout, and whenever you suspect the
|
|
37
|
+
environment drifted. Keep it **idempotent + fail-fast**: a background agent must be able to rebuild
|
|
38
|
+
it without asking.
|
|
39
|
+
- **`TASK.md`** — open it when the work *will span multiple sessions* or you're about to `/clear`.
|
|
40
|
+
Update it at **milestones** (a part done, a decision locked), not every line of code. It's what the
|
|
41
|
+
next agent reads to continue — write it so a "stranger" understands, not as shorthand for yourself.
|
|
42
|
+
- **`new-worktree.sh`** — split off a worktree when running **2+ directions in parallel**, or when a
|
|
43
|
+
long task needs isolation from the main branch. One task = one worktree. Done: `git worktree remove <dir>`.
|
|
43
44
|
|
|
44
|
-
##
|
|
45
|
+
## Running in the background & checking back
|
|
45
46
|
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
47
|
+
Genuinely long work can run in the background while you check back (no babysitting). The principle:
|
|
48
|
+
a chunk of work must be **resumable** — if it's cut off mid-way, `TASK.md` + `setup.sh` are enough to
|
|
49
|
+
pick up from the latest milestone instead of starting over. That's why those two files exist.
|
|
@@ -1,33 +1,33 @@
|
|
|
1
1
|
<!--
|
|
2
|
-
TASK.md —
|
|
2
|
+
TASK.md — the memory that survives a long task (Level 4).
|
|
3
3
|
|
|
4
|
-
|
|
5
|
-
|
|
4
|
+
Work that spans more than one session loses its state on `/clear` or when a new session opens.
|
|
5
|
+
This file is where the agent (and the next agent) reads to CONTINUE, instead of rebuilding from scratch.
|
|
6
6
|
|
|
7
|
-
|
|
8
|
-
-
|
|
9
|
-
-
|
|
10
|
-
-
|
|
11
|
-
|
|
7
|
+
How to use:
|
|
8
|
+
- Put it at the repo root or the task folder. Update it AT milestones, not every line.
|
|
9
|
+
- Point CLAUDE.md at it: "An in-flight long task is tracked in TASK.md — read it before starting."
|
|
10
|
+
- When the work is done, delete / archive it; don't let a stale TASK.md add noise.
|
|
11
|
+
Delete this comment when you start using it for real.
|
|
12
12
|
-->
|
|
13
13
|
|
|
14
|
-
# Task: <
|
|
14
|
+
# Task: <one line — what you're doing>
|
|
15
15
|
|
|
16
|
-
##
|
|
17
|
-
|
|
16
|
+
## Goal
|
|
17
|
+
<A measurable definition of "done". When is this task complete?>
|
|
18
18
|
|
|
19
|
-
##
|
|
20
|
-
<
|
|
19
|
+
## Where things stand (update at each milestone)
|
|
20
|
+
<Current state in 2–4 bullets. What already works, what doesn't yet.>
|
|
21
21
|
-
|
|
22
22
|
|
|
23
|
-
##
|
|
24
|
-
<
|
|
23
|
+
## Next steps
|
|
24
|
+
<The concrete next action, clear enough to start immediately without re-thinking.>
|
|
25
25
|
- [ ]
|
|
26
26
|
|
|
27
|
-
##
|
|
28
|
-
<
|
|
27
|
+
## Decisions locked (don't reopen)
|
|
28
|
+
<Choices already made + a short WHY. So the next agent doesn't re-litigate from scratch.>
|
|
29
29
|
-
|
|
30
30
|
|
|
31
|
-
##
|
|
32
|
-
<
|
|
31
|
+
## Pitfalls hit
|
|
32
|
+
<Things tried that failed, dead ends — so you don't walk into them again.>
|
|
33
33
|
-
|
package/templates/mcp-audit.md
CHANGED
|
@@ -1,26 +1,26 @@
|
|
|
1
|
-
# MCP hygiene — checklist
|
|
1
|
+
# MCP hygiene — a checklist for pruning surplus tools (Level 2)
|
|
2
2
|
|
|
3
|
-
|
|
4
|
-
|
|
5
|
-
model "
|
|
6
|
-
(
|
|
3
|
+
Every MCP server you plug in **injects all of its tool definitions into context on EVERY turn** —
|
|
4
|
+
even turns that don't use it. Five rarely-used servers can eat thousands of tokens per turn and
|
|
5
|
+
make the model "fumble" when choosing a tool. This is a **permanent context tax**, not a safety
|
|
6
|
+
issue (safety is Level 3).
|
|
7
7
|
|
|
8
|
-
##
|
|
8
|
+
## See what's plugged in
|
|
9
9
|
|
|
10
10
|
```bash
|
|
11
|
-
claude mcp list #
|
|
11
|
+
claude mcp list # list the configured MCP servers
|
|
12
12
|
```
|
|
13
13
|
|
|
14
|
-
##
|
|
14
|
+
## For EACH server, ask 3 questions
|
|
15
15
|
|
|
16
|
-
- [ ] **
|
|
17
|
-
- [ ] **
|
|
18
|
-
- [ ] **
|
|
16
|
+
- [ ] **Did you actually call its tools in the past month?** No → remove it.
|
|
17
|
+
- [ ] **How many tools does it add per turn when you only use 1–2?** Lots of surplus → remove it or find a leaner build.
|
|
18
|
+
- [ ] **Can an existing CLI replace it?** (e.g. `gh` instead of a GitHub MCP for simple things) → prefer the CLI, drop the MCP.
|
|
19
19
|
|
|
20
|
-
##
|
|
20
|
+
## Principle
|
|
21
21
|
|
|
22
|
-
> **
|
|
23
|
-
>
|
|
22
|
+
> **Few clear tools > many overlapping ones.** Keep exactly what you use weekly.
|
|
23
|
+
> Add more only when you *genuinely* need a new source/capability, not "just in case".
|
|
24
24
|
|
|
25
|
-
|
|
26
|
-
|
|
25
|
+
Plugging in an MCP "just in case" is the most common Level 2 trap: it quietly makes every session
|
|
26
|
+
more expensive and less accurate, without anyone noticing.
|
|
@@ -1,35 +1,35 @@
|
|
|
1
1
|
#!/usr/bin/env bash
|
|
2
|
-
# new-worktree.sh —
|
|
2
|
+
# new-worktree.sh — create a git worktree + its own branch for a parallel/long task.
|
|
3
3
|
#
|
|
4
|
-
#
|
|
5
|
-
#
|
|
4
|
+
# Level 4 (long-running): each heavy task should have its OWN working tree, so parallel runs
|
|
5
|
+
# don't step on each other (no constant stash/switch, no overwriting each other's edits).
|
|
6
6
|
#
|
|
7
|
-
#
|
|
8
|
-
#
|
|
9
|
-
# ->
|
|
7
|
+
# Usage: ./new-worktree.sh <task-name>
|
|
8
|
+
# E.g.: ./new-worktree.sh refactor-auth
|
|
9
|
+
# -> creates branch 'refactor-auth' + dir ../<repo>-refactor-auth (NEXT TO the repo, not nested).
|
|
10
10
|
|
|
11
11
|
set -euo pipefail
|
|
12
12
|
|
|
13
13
|
name="${1:-}"
|
|
14
|
-
[ -n "$name" ] || { echo "
|
|
14
|
+
[ -n "$name" ] || { echo "usage: $0 <task-name> (e.g. refactor-auth)"; exit 1; }
|
|
15
15
|
|
|
16
|
-
#
|
|
17
|
-
git rev-parse --is-inside-work-tree >/dev/null 2>&1 || { echo "
|
|
16
|
+
# Must be inside a git repo.
|
|
17
|
+
git rev-parse --is-inside-work-tree >/dev/null 2>&1 || { echo "not a git repo"; exit 1; }
|
|
18
18
|
|
|
19
19
|
root="$(git rev-parse --show-toplevel)"
|
|
20
20
|
repo="$(basename "$root")"
|
|
21
|
-
dir="$root/../${repo}-${name}" #
|
|
21
|
+
dir="$root/../${repo}-${name}" # place it NEXT TO the repo, not nested -> git status stays clean
|
|
22
22
|
|
|
23
|
-
[ -e "$dir" ] && { echo "
|
|
23
|
+
[ -e "$dir" ] && { echo "already exists: $dir"; exit 1; } # fail-fast, don't overwrite
|
|
24
24
|
|
|
25
|
-
#
|
|
25
|
+
# Branch from the latest base. Change 'main' if your repo uses a different name.
|
|
26
26
|
base="$(git symbolic-ref --quiet --short HEAD || echo main)"
|
|
27
27
|
|
|
28
28
|
if git show-ref --verify --quiet "refs/heads/${name}"; then
|
|
29
|
-
git worktree add "$dir" "$name" # branch
|
|
29
|
+
git worktree add "$dir" "$name" # branch exists -> check it out into the new worktree
|
|
30
30
|
else
|
|
31
|
-
git worktree add -b "$name" "$dir" "$base" # branch
|
|
31
|
+
git worktree add -b "$name" "$dir" "$base" # new branch from the current base
|
|
32
32
|
fi
|
|
33
33
|
|
|
34
|
-
echo "
|
|
35
|
-
echo " cd \"$dir\"
|
|
34
|
+
echo "worktree: $dir (branch: $name)"
|
|
35
|
+
echo " cd \"$dir\" to start. When done: git worktree remove \"$dir\""
|
package/templates/setup.sh
CHANGED
|
@@ -1,51 +1,51 @@
|
|
|
1
1
|
#!/usr/bin/env bash
|
|
2
|
-
# setup.sh — bootstrap
|
|
2
|
+
# setup.sh — environment bootstrap: "one command makes the repo runnable".
|
|
3
3
|
#
|
|
4
|
-
#
|
|
5
|
-
#
|
|
4
|
+
# Level 4 (long-running): a background/looping agent CANNOT stop to ask "what do I install now".
|
|
5
|
+
# This script must bring a clean checkout to a runnable state, non-interactively.
|
|
6
6
|
#
|
|
7
|
-
# 2
|
|
8
|
-
# - IDEMPOTENT:
|
|
9
|
-
# - FAIL-FAST:
|
|
7
|
+
# 2 mandatory principles:
|
|
8
|
+
# - IDEMPOTENT: safe to run many times (no duplication, no broken state).
|
|
9
|
+
# - FAIL-FAST: if something is missing, report it NOW and stop; don't continue blind.
|
|
10
10
|
#
|
|
11
|
-
#
|
|
12
|
-
#
|
|
11
|
+
# THIS IS A SKELETON — every repo installs different things. Fill in the TODOs for your repo,
|
|
12
|
+
# then point CLAUDE.md here ("Build / Test / Run: run ./setup.sh first").
|
|
13
13
|
|
|
14
|
-
set -euo pipefail #
|
|
14
|
+
set -euo pipefail # error -> stop; unset var -> stop; mid-pipe error -> stop
|
|
15
15
|
|
|
16
16
|
cd "$(dirname "$0")"
|
|
17
17
|
|
|
18
|
-
# --- 1.
|
|
19
|
-
#
|
|
18
|
+
# --- 1. Check prerequisites (fail-fast) ------------------------------------
|
|
19
|
+
# Missing base tools are reported immediately; don't let a vague error surface later.
|
|
20
20
|
require() {
|
|
21
|
-
command -v "$1" >/dev/null 2>&1 || { echo "
|
|
21
|
+
command -v "$1" >/dev/null 2>&1 || { echo "missing '$1' — install it then re-run"; exit 1; }
|
|
22
22
|
}
|
|
23
|
-
# TODO:
|
|
23
|
+
# TODO: list the tools your repo needs
|
|
24
24
|
require git
|
|
25
25
|
# require node
|
|
26
26
|
# require python3
|
|
27
27
|
|
|
28
|
-
# --- 2.
|
|
29
|
-
#
|
|
30
|
-
# TODO:
|
|
28
|
+
# --- 2. Install dependencies (idempotent) ----------------------------------
|
|
29
|
+
# Most package managers are already idempotent — re-running is a no-op.
|
|
30
|
+
# TODO: replace with your repo's commands
|
|
31
31
|
# npm ci
|
|
32
32
|
# uv sync
|
|
33
33
|
# go mod download
|
|
34
34
|
|
|
35
|
-
# --- 3.
|
|
36
|
-
#
|
|
35
|
+
# --- 3. Config / secrets (fail-fast, do NOT invent values) -----------------
|
|
36
|
+
# If .env.example exists but .env doesn't -> tell the runner to fill it in; don't guess values.
|
|
37
37
|
# if [ -f .env.example ] && [ ! -f .env ]; then
|
|
38
|
-
# echo "
|
|
38
|
+
# echo "no .env yet — copy .env.example and fill in the secrets"; exit 1
|
|
39
39
|
# fi
|
|
40
40
|
|
|
41
|
-
# --- 4.
|
|
42
|
-
# TODO: DB/queue... —
|
|
41
|
+
# --- 4. Supporting services (idempotent) -----------------------------------
|
|
42
|
+
# TODO: DB/queue... — use create-only-if-missing commands
|
|
43
43
|
# docker compose up -d
|
|
44
44
|
|
|
45
|
-
# --- 5. Verify:
|
|
46
|
-
#
|
|
47
|
-
# TODO:
|
|
45
|
+
# --- 5. Verify: prove the environment actually runs ------------------------
|
|
46
|
+
# Don't end with a blind "done" — run one cheap check to be sure.
|
|
47
|
+
# TODO: your repo's fastest smoke test
|
|
48
48
|
# npm run build --silent
|
|
49
49
|
# python -c "import yourpkg"
|
|
50
50
|
|
|
51
|
-
echo "
|
|
51
|
+
echo "environment ready"
|
|
@@ -1,40 +1,40 @@
|
|
|
1
1
|
<!--
|
|
2
|
-
FEATURE.md — spec
|
|
2
|
+
FEATURE.md — spec one feature BEFORE writing code (spec-driven development).
|
|
3
3
|
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
4
|
+
This is the "Specs" half of the "Repo-local instructions & Specs" pillar: CLAUDE.md (Level 1) covers
|
|
5
|
+
durable rules across EVERY task; FEATURE.md defines ONE specific feature before you start. A clear spec
|
|
6
|
+
→ the agent runs in the right direction; clear acceptance criteria → something a Level 5 eval can grade.
|
|
7
7
|
|
|
8
|
-
|
|
9
|
-
|
|
8
|
+
How to use: copy it per significant feature (e.g. docs/specs/<feature>.md). Skip small jobs — don't
|
|
9
|
+
ritualize it. Fill in the sections, delete this comment when you use it for real.
|
|
10
10
|
-->
|
|
11
11
|
|
|
12
|
-
# Feature: <
|
|
12
|
+
# Feature: <short name>
|
|
13
13
|
|
|
14
|
-
##
|
|
15
|
-
<
|
|
14
|
+
## Problem / why
|
|
15
|
+
<What it solves and for whom. 1–3 sentences. If you can't state the "why", don't code yet.>
|
|
16
16
|
|
|
17
|
-
##
|
|
17
|
+
## Scope
|
|
18
18
|
|
|
19
|
-
###
|
|
19
|
+
### In scope
|
|
20
20
|
-
|
|
21
21
|
|
|
22
|
-
###
|
|
22
|
+
### OUT of scope (important — blocks scope creep)
|
|
23
23
|
-
|
|
24
24
|
|
|
25
|
-
##
|
|
26
|
-
<
|
|
25
|
+
## Constraints
|
|
26
|
+
<Technical/business musts: API must stay stable, no new dependencies, performance limits…>
|
|
27
27
|
-
|
|
28
28
|
|
|
29
|
-
## Acceptance criteria (
|
|
30
|
-
<
|
|
29
|
+
## Acceptance criteria (measurable — clear pass/fail)
|
|
30
|
+
<What MUST be true for the feature to count as done. This also feeds the Level 5 golden task.>
|
|
31
31
|
- [ ]
|
|
32
32
|
- [ ]
|
|
33
33
|
|
|
34
34
|
## Test hooks
|
|
35
|
-
<
|
|
35
|
+
<Which tests / commands will check it. Prefer things that can be automated.>
|
|
36
36
|
-
|
|
37
37
|
|
|
38
|
-
##
|
|
39
|
-
<
|
|
38
|
+
## Design notes & decisions locked
|
|
39
|
+
<Approach + a short why. So the next person (and the agent) don't re-litigate from scratch.>
|
|
40
40
|
-
|
package/templates/spec/README.md
CHANGED
|
@@ -1,34 +1,34 @@
|
|
|
1
|
-
# Specs — spec-driven development (
|
|
1
|
+
# Specs — spec-driven development (the other half of Pillar 2)
|
|
2
2
|
|
|
3
|
-
|
|
4
|
-
**instructions
|
|
5
|
-
|
|
6
|
-
|
|
3
|
+
Pillar 2 is **"Repo-local instructions & Specs"**. Level 1 (`/init-harness` → `CLAUDE.md`) covers the
|
|
4
|
+
**instructions** half: durable rules that hold across *every* task. `FEATURE.md` covers the **specs**
|
|
5
|
+
half: defining **one specific feature BEFORE you code it**. A clear spec → the agent strays less; clear
|
|
6
|
+
acceptance criteria → you already have what a Level 5 eval grades pass/fail.
|
|
7
7
|
|
|
8
|
-
##
|
|
8
|
+
## When to use it
|
|
9
9
|
|
|
10
|
-
-
|
|
11
|
-
-
|
|
10
|
+
- A feature **big enough to get lost in** / multi-step / multi-person → write a spec first.
|
|
11
|
+
- A small, one-shot job → **skip it**. Don't turn specs into ritual.
|
|
12
12
|
|
|
13
|
-
## `FEATURE.md` vs `TASK.md` (
|
|
13
|
+
## `FEATURE.md` vs `TASK.md` (don't confuse them — they complement each other)
|
|
14
14
|
|
|
15
|
-
| | `FEATURE.md` (
|
|
15
|
+
| | `FEATURE.md` (here) | `TASK.md` (Level 4) |
|
|
16
16
|
|---|---|---|
|
|
17
|
-
|
|
|
18
|
-
|
|
|
19
|
-
|
|
|
17
|
+
| Purpose | **plan BEFORE coding**: what + pass criteria | **state that survives ACROSS sessions** |
|
|
18
|
+
| Lifecycle | written once at the feature's start, changes little | updated continuously at each milestone |
|
|
19
|
+
| Answers | "what *is* this feature, when is it done" | "where am I *right now*, what's next" |
|
|
20
20
|
|
|
21
|
-
|
|
21
|
+
A big feature usually uses **both**: `FEATURE.md` fixes the target, `TASK.md` tracks the path.
|
|
22
22
|
|
|
23
|
-
##
|
|
23
|
+
## Install
|
|
24
24
|
|
|
25
25
|
```bash
|
|
26
26
|
mkdir -p docs/specs
|
|
27
|
-
cp FEATURE.md docs/specs/<
|
|
27
|
+
cp FEATURE.md docs/specs/<feature-name>.md
|
|
28
28
|
```
|
|
29
29
|
|
|
30
|
-
##
|
|
30
|
+
## Advanced
|
|
31
31
|
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
32
|
+
Big feature / whole team → use a dedicated framework: **GitHub Spec Kit**, **12-Factor Agents** (links
|
|
33
|
+
in `docs/harness-engineering-tutorial.md`). `FEATURE.md` is just a minimal starting point — enough to
|
|
34
|
+
have a spec, without forcing you to adopt a whole framework.
|