claude-dev-env 1.36.2 → 1.37.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/_shared/pr-loop/scripts/config/preflight_constants.py +29 -8
- package/_shared/pr-loop/scripts/preflight.py +242 -20
- package/_shared/pr-loop/scripts/tests/test_preflight.py +362 -25
- package/_shared/pr-loop/scripts/tests/test_preflight_constants.py +9 -14
- package/hooks/blocking/code_rules_enforcer.py +269 -23
- package/hooks/blocking/test_code_rules_enforcer_unused_imports.py +157 -1
- package/hooks/config/test_unused_module_import_constants.py +48 -0
- package/hooks/config/unused_module_import_constants.py +41 -0
- package/package.json +1 -1
- package/rules/gh-paginate.md +4 -50
- package/rules/no-historical-clutter.md +36 -0
- package/skills/bg-agent/SKILL.md +69 -0
- package/skills/bugteam/CONSTRAINTS.md +10 -19
- package/skills/bugteam/PROMPTS.md +21 -14
- package/skills/bugteam/SKILL.md +122 -208
- package/skills/bugteam/SKILL_EVALS.md +75 -114
- package/skills/bugteam/reference/README.md +2 -4
- package/skills/bugteam/reference/audit-and-teammates.md +21 -48
- package/skills/bugteam/reference/audit-contract.md +7 -7
- package/skills/bugteam/reference/design-rationale.md +3 -8
- package/skills/bugteam/reference/team-setup.md +11 -19
- package/skills/bugteam/reference/teardown-publish-permissions.md +2 -14
- package/skills/bugteam/scripts/config/__init__.py +0 -0
- package/skills/bugteam/scripts/config/reflow_skill_md_constants.py +12 -0
- package/skills/bugteam/scripts/reflow_skill_md.py +51 -47
- package/skills/bugteam/sources.md +1 -25
- package/skills/bugteam/test_skill_additions.py +4 -13
- package/skills/fresh-branch/SKILL.md +71 -0
- package/skills/gotcha/SKILL.md +73 -0
- package/skills/monitor-open-prs/SKILL.md +4 -37
- package/skills/monitor-open-prs/test_skill_contract.py +0 -5
- package/skills/pr-converge/SKILL.md +60 -1298
- package/skills/pr-converge/reference/convergence-gates.md +122 -0
- package/skills/pr-converge/reference/examples.md +76 -0
- package/skills/pr-converge/reference/fix-protocol.md +56 -0
- package/skills/pr-converge/reference/ground-rules.md +13 -0
- package/skills/pr-converge/reference/multi-pr-orchestration.md +204 -0
- package/skills/pr-converge/reference/per-tick.md +204 -0
- package/skills/pr-converge/reference/state-schema.md +19 -0
- package/skills/pr-converge/reference/stop-conditions.md +26 -0
- package/skills/pr-converge/scripts/README.md +36 -9
- package/skills/pr-converge/scripts/check_pr_mergeability.py +1 -2
- package/skills/pr-converge/scripts/config/pr_converge_constants.py +74 -5
- package/skills/pr-converge/scripts/config/reflow_skill_md_constants.py +13 -0
- package/skills/pr-converge/scripts/config/test_pr_converge_constants.py +0 -24
- package/skills/pr-converge/scripts/cursor-agents-continue.ahk +22 -2
- package/skills/pr-converge/scripts/fetch_bugbot_inline_comments.py +19 -59
- package/skills/pr-converge/scripts/fetch_bugbot_reviews.py +15 -61
- package/skills/pr-converge/scripts/fetch_claude_inline_comments.py +70 -0
- package/skills/pr-converge/scripts/fetch_claude_reviews.py +61 -0
- package/skills/pr-converge/scripts/fetch_copilot_inline_comments.py +19 -61
- package/skills/pr-converge/scripts/fetch_copilot_reviews.py +14 -74
- package/skills/pr-converge/scripts/reflow_skill_md.py +71 -50
- package/skills/pr-converge/scripts/reviewer_fetch_core.py +153 -0
- package/skills/pr-converge/scripts/reviewer_specs.py +98 -0
- package/skills/pr-converge/scripts/test_cursor_agents_continue.py +65 -0
- package/skills/pr-converge/scripts/test_fetch_bugbot_inline_comments.py +107 -6
- package/skills/pr-converge/scripts/test_fetch_bugbot_reviews.py +85 -6
- package/skills/pr-converge/scripts/test_fetch_claude_inline_comments.py +485 -0
- package/skills/pr-converge/scripts/test_fetch_claude_reviews.py +368 -0
- package/skills/pr-converge/scripts/test_fetch_copilot_inline_comments.py +74 -6
- package/skills/pr-converge/scripts/test_fetch_copilot_reviews.py +94 -8
- package/skills/pr-converge/scripts/test_reflow_skill_md.py +162 -0
- package/skills/pr-converge/scripts/test_reviewer_fetch_core.py +448 -0
- package/skills/pr-converge/scripts/test_reviewer_specs.py +107 -0
- package/skills/pr-converge/scripts/test_view_pr_context.py +44 -0
- package/skills/pr-converge/scripts/view_pr_context.py +35 -4
- package/skills/pr-converge/workflows/schedule-wakeup-loop.md +24 -22
- package/skills/bugteam/reference/workflow-path-a-orchestrated-teams.md +0 -113
- package/skills/bugteam/reference/workflow-path-b-task-harness.md +0 -48
- package/skills/bugteam/test_team_lifecycle.py +0 -103
- package/skills/monitor-open-prs/test_team_lifecycle.py +0 -46
- package/skills/pr-converge/scripts/open_followup_copilot_pr.py +0 -136
- package/skills/pr-converge/scripts/test_open_followup_copilot_pr.py +0 -236
- package/skills/pr-converge/test_team_lifecycle.py +0 -56
- package/skills/pr-converge/workflows/ahk-auto-continue-loop.md +0 -108
|
@@ -14,23 +14,20 @@ Evals are split into two layers. Both layers run against the same trace but carr
|
|
|
14
14
|
|
|
15
15
|
## Ironclad invariants (Layer A, apply to every eval)
|
|
16
16
|
|
|
17
|
-
Each invariant cites the normative section or companion file it derives from.
|
|
17
|
+
Each invariant cites the normative section or companion file it derives from. All spawns use `Agent(..., run_in_background=true)`. Invariants apply uniformly across all eval fixtures.
|
|
18
18
|
|
|
19
19
|
| # | Invariant | Citation |
|
|
20
20
|
|---|---|---|
|
|
21
|
-
| I-1 |
|
|
22
|
-
| I-2 | `Bash` invoking `scripts/revoke_project_claude_permissions.py` runs exactly once per invocation on every exit path,
|
|
23
|
-
| I-3 |
|
|
24
|
-
| I-4 |
|
|
25
|
-
| I-5 |
|
|
26
|
-
| I-6 |
|
|
27
|
-
| I-7 |
|
|
28
|
-
| I-8 |
|
|
29
|
-
| I-9 |
|
|
30
|
-
| I-10 |
|
|
31
|
-
| I-11 | **Path A:** `git worktree remove` each PR → teammate shutdowns → `TeamDelete` → `rmtree` `<team_temp_dir>` → Step 4.5 → revoke. **Path B:** `git worktree remove` each PR → (omit shutdown / `TeamDelete`) → `rmtree` → Step 4.5 → revoke. | `SKILL.md` § Step 4; § Step 4.5; § Step 5; `reference/workflow-path-a-orchestrated-teams.md` § Step 4; `reference/workflow-path-b-task-harness.md` § Step 4 |
|
|
32
|
-
| I-12 | **Path A:** Lead never posts PR review / finding / fix replies except Step 4.5 body. **Path B:** Lead performs Step 2.5 posts per deltas; Step 4.5 unchanged. | `CONSTRAINTS.md` — **Audit/fix comment posting** |
|
|
33
|
-
| I-13 | **Path A:** Only the lead invokes `TeamCreate`; every teammate `Agent(..., team_name=...)`. **Path B:** no `TeamCreate`; `Task` spawns omit `team_name`. | `CONSTRAINTS.md` — **Path A — orchestrator-only `TeamCreate`**; `reference/workflow-path-b-task-harness.md` |
|
|
21
|
+
| I-1 | `Bash` invoking `scripts/grant_project_claude_permissions.py` precedes the first audit `Agent` spawn. | `SKILL.md` § Step 0 |
|
|
22
|
+
| I-2 | `Bash` invoking `scripts/revoke_project_claude_permissions.py` runs exactly once per invocation on every exit path, after teardown. | `SKILL.md` § Step 5 |
|
|
23
|
+
| I-3 | Orchestration uses `Agent(..., run_in_background=true)` only — no `TeamCreate`, `TeamDelete`, `SendMessage`, or `Task` tool calls. | `SKILL.md` § Step 2; § Step 4 |
|
|
24
|
+
| I-4 | `Agent` calls are fresh per loop (`run_in_background=true`; new `name` each loop). | `CONSTRAINTS.md` — **Fresh subagent per loop** |
|
|
25
|
+
| I-5 | Audit sibling spawns pass `model="haiku"`; validator and fix spawns pass `model="opus"`. | `SKILL.md` § AUDIT action (parallel auditors); § FIX action; `CONSTRAINTS.md` — **Opus 4.7 at xhigh effort for validator and fix subagents** |
|
|
26
|
+
| I-6 | Loop count ≤ 10 audits. 11th audit never fires. | `SKILL.md` YAML `description` (10-loop cap); § Step 3 (**Pre-audit** / **FIX** increment rules) |
|
|
27
|
+
| I-7 | From loop 4 onward without convergence, eleven parallel `Agent(..., run_in_background=true)` calls in one message for audit. | `SKILL.md` § AUDIT action (**Parallel auditors**) |
|
|
28
|
+
| I-8 | Lead reads `.bugteam-pr<N>-loop<L>.outcomes.xml` with the `Read` tool after each audit, before the next action. | `SKILL.md` § AUDIT action |
|
|
29
|
+
| I-9 | Teardown sequence: `git worktree remove` each PR → `rmtree` `<run_temp_dir>` → Step 4.5 → revoke. | `SKILL.md` § Step 4; § Step 4.5; § Step 5 |
|
|
30
|
+
| I-10 | The bugfind subagent posts ONE per-loop review; the bugfix subagent posts fix replies. The lead's only PR-write action is the Step 4.5 description rewrite. | `CONSTRAINTS.md` — **Audit/fix comment posting** |
|
|
34
31
|
|
|
35
32
|
Any eval failing one or more Layer A invariants fails the run.
|
|
36
33
|
|
|
@@ -46,25 +43,23 @@ The harness does not yet exist; this document defines its contract.
|
|
|
46
43
|
|
|
47
44
|
---
|
|
48
45
|
|
|
49
|
-
## Eval 1 —
|
|
46
|
+
## Eval 1 — Smoke: background subagent spawns fire correctly
|
|
50
47
|
|
|
51
|
-
**Scenario.**
|
|
48
|
+
**Scenario.** PR exists; PR is a clean target with no unusual pre-conditions.
|
|
52
49
|
|
|
53
50
|
**Trigger.** `/bugteam`
|
|
54
51
|
|
|
55
|
-
**Layer A invariants.**
|
|
52
|
+
**Layer A invariants.** I-1, I-2, I-3, I-4, I-5, I-8, I-9, I-10.
|
|
56
53
|
|
|
57
|
-
**Layer B predicted trace (
|
|
58
|
-
1. `Bash("
|
|
59
|
-
2. `
|
|
60
|
-
3.
|
|
61
|
-
4.
|
|
54
|
+
**Layer B predicted trace (smoke).**
|
|
55
|
+
1. `Bash("python .../grant_project_claude_permissions.py")` runs (Step 0).
|
|
56
|
+
2. `Agent(subagent_type="code-quality-agent", name="bugfind-pr...-loop1", run_in_background=true, model="opus", ...)` spawned for AUDIT.
|
|
57
|
+
3. Lead awaits background-completion notification, then `Read(".bugteam-pr42-loop1.outcomes.xml")`.
|
|
58
|
+
4. `Agent(subagent_type="clean-coder", name="bugfix-pr...-loop1", run_in_background=true, model="opus", ...)` spawned for FIX (if findings).
|
|
62
59
|
5. `Bash("python .../revoke_project_claude_permissions.py")` on exit.
|
|
63
60
|
|
|
64
61
|
**Pass criteria.**
|
|
65
|
-
-
|
|
66
|
-
- Zero `TeamCreate`, zero `TeamDelete`, zero teammate `SendMessage` shutdowns.
|
|
67
|
-
- Non-zero `Task` (or `Agent` without `team_name` only if the host maps Path B that way) carrying **`code-quality-agent`** / **fix worker under the clean-coder contract** (subtype `clean-coder` where accepted, else `generalPurpose` + `clean-coder.md` Read per `workflow-path-b-task-harness.md`).
|
|
62
|
+
- Non-zero `Agent(subagent_type="code-quality-agent", run_in_background=true)` and `Agent(subagent_type="clean-coder", run_in_background=true)` calls.
|
|
68
63
|
|
|
69
64
|
---
|
|
70
65
|
|
|
@@ -75,7 +70,7 @@ The harness does not yet exist; this document defines its contract.
|
|
|
75
70
|
**Layer B predicted trace.**
|
|
76
71
|
1. `Bash("gh pr view --json ...")` → non-zero exit.
|
|
77
72
|
2. `Bash("git merge-base HEAD origin/main")` → empty.
|
|
78
|
-
3. No grant script
|
|
73
|
+
3. No grant script.
|
|
79
74
|
|
|
80
75
|
**Pass criteria.** Assistant message matches `No PR or upstream diff. /bugteam needs a target.`. Zero downstream tool calls.
|
|
81
76
|
|
|
@@ -93,15 +88,15 @@ The harness does not yet exist; this document defines its contract.
|
|
|
93
88
|
|
|
94
89
|
**Scenario.** `code-quality-agent` is present in the available-agents list; `clean-coder` is not.
|
|
95
90
|
|
|
96
|
-
**Pass criteria.** Assistant message contains `Required subagent type clean-coder not installed.`. Zero grant script call, zero `
|
|
91
|
+
**Pass criteria.** Assistant message contains `Required subagent type clean-coder not installed.`. Zero grant script call, zero `Agent` spawns.
|
|
97
92
|
|
|
98
93
|
---
|
|
99
94
|
|
|
100
|
-
## Eval 5 — Happy path: converges in 2 loops
|
|
95
|
+
## Eval 5 — Happy path: converges in 2 loops
|
|
101
96
|
|
|
102
|
-
**Scenario.** PR #42 contains three P1 bugs all addressable by the mock fix
|
|
97
|
+
**Scenario.** PR #42 contains three P1 bugs all addressable by the mock fix subagent. Loop 1 audit returns 3 findings; loop 1 fix commits cleanly; loop 2 audit returns zero findings.
|
|
103
98
|
|
|
104
|
-
**Layer A invariants.**
|
|
99
|
+
**Layer A invariants.** I-1, I-2, I-3, I-4, I-5, I-6, I-8, I-9, I-10.
|
|
105
100
|
|
|
106
101
|
**Layer B predicted trace.**
|
|
107
102
|
|
|
@@ -109,41 +104,40 @@ The harness does not yet exist; this document defines its contract.
|
|
|
109
104
|
|---|---|---|
|
|
110
105
|
| 1 | `Bash("python .../scripts/grant_project_claude_permissions.py")` | `SKILL.md` § Step 0 |
|
|
111
106
|
| 2 | `Bash("gh pr view --json number,baseRefName,headRefName,url")` | `SKILL.md` § Step 1 |
|
|
112
|
-
| 3 | `Bash("git rev-parse HEAD")` → captures `starting_sha` | `SKILL.md` § Step 2 — **Loop state** block |
|
|
113
|
-
| 4 | `
|
|
114
|
-
| 5 | `Bash("
|
|
115
|
-
| 6 | `
|
|
116
|
-
| 7 |
|
|
117
|
-
| 8 | `Read(".bugteam-
|
|
118
|
-
| 9 | `
|
|
119
|
-
| 10 |
|
|
120
|
-
| 11 | `Read(".bugteam-
|
|
121
|
-
| 12 | `Bash("git rev-parse HEAD")` → verify HEAD advanced | `SKILL.md` § FIX action (**Verify**) |
|
|
122
|
-
| 13 | `Bash("git fetch origin <branch>
|
|
123
|
-
| 14 | `
|
|
124
|
-
| 15 | `Bash("gh pr diff 42 -R ... > <
|
|
125
|
-
| 16 | `Agent(subagent_type="code-quality-agent", name="bugfind", ...)` (loop 2) | `SKILL.md` § AUDIT action |
|
|
126
|
-
| 17 |
|
|
127
|
-
| 18 | `
|
|
128
|
-
| 19 | `
|
|
129
|
-
| 20 | `Bash("python -c \"
|
|
107
|
+
| 3 | `Bash("git -C \"<run_temp_dir>/pr-42/worktree\" rev-parse HEAD")` → captures `starting_sha` | `SKILL.md` § Step 2 — **Loop state** block |
|
|
108
|
+
| 4 | `Bash("mkdir -p <run_temp_dir>/pr-42")` | `SKILL.md` § AUDIT action |
|
|
109
|
+
| 5 | `Bash("gh pr diff 42 -R ... > <run_temp_dir>/pr-42/loop-1.patch")` | `SKILL.md` § AUDIT action |
|
|
110
|
+
| 6 | `Agent(subagent_type="code-quality-agent", name="bugfind-pr42-loop1", run_in_background=true, model="opus", description=..., prompt=<audit XML loop 1>)` | `SKILL.md` § AUDIT action |
|
|
111
|
+
| 7 | Lead awaits background-completion notification | `SKILL.md` § AUDIT action |
|
|
112
|
+
| 8 | `Read(".bugteam-pr42-loop1.outcomes.xml")` | `SKILL.md` § AUDIT action |
|
|
113
|
+
| 9 | `Agent(subagent_type="clean-coder", name="bugfix-pr42-loop1", run_in_background=true, model="opus", description=..., prompt=<fix XML loop 1>)` | `SKILL.md` § FIX action |
|
|
114
|
+
| 10 | Lead awaits background-completion notification | `SKILL.md` § FIX action |
|
|
115
|
+
| 11 | `Read(".bugteam-pr42-loop1.outcomes.xml")` — bugfix outcome XML | `SKILL.md` § FIX action |
|
|
116
|
+
| 12 | `Bash("git -C \"<run_temp_dir>/pr-42/worktree\" rev-parse HEAD")` → verify HEAD advanced | `SKILL.md` § FIX action (**Verify**) |
|
|
117
|
+
| 13 | `Bash("git -C \"<run_temp_dir>/pr-42/worktree\" fetch origin <branch>")` → fetch remote state | `SKILL.md` § FIX action (**Verify**) |
|
|
118
|
+
| 14 | `Bash("git -C \"<run_temp_dir>/pr-42/worktree\" rev-parse origin/<branch>")` → confirm matches HEAD | `SKILL.md` § FIX action (**Verify**) |
|
|
119
|
+
| 15 | `Bash("gh pr diff 42 -R ... > <run_temp_dir>/pr-42/loop-2.patch")` | `SKILL.md` § AUDIT action |
|
|
120
|
+
| 16 | `Agent(subagent_type="code-quality-agent", name="bugfind-pr42-loop2", run_in_background=true, ...)` (loop 2) | `SKILL.md` § AUDIT action |
|
|
121
|
+
| 17 | Lead awaits background-completion notification | `SKILL.md` § AUDIT action |
|
|
122
|
+
| 18 | `Read(".bugteam-pr42-loop2.outcomes.xml")` — zero findings | `SKILL.md` § AUDIT action |
|
|
123
|
+
| 19 | `Bash("git worktree remove \"<run_temp_dir>/pr-42/worktree\"")` | `SKILL.md` § Step 4 step 1 |
|
|
124
|
+
| 20 | `Bash("python -c \"...shutil.rmtree(r'<run_temp_dir>', ...)\"")` | `SKILL.md` § Step 4 step 2 (Windows-safe teardown) |
|
|
130
125
|
| 21 | `Bash("gh pr diff 42 -R ... > .bugteam-final.diff")` | `SKILL.md` § Step 4.5 step 1 |
|
|
131
126
|
| 22 | `Bash("gh pr view 42 -R ... --json body --jq .body > .bugteam-original-body.md")` | `SKILL.md` § Step 4.5 step 2 |
|
|
132
127
|
| 23 | `Agent(subagent_type="pr-description-writer", description=..., prompt=<brief>)` | `SKILL.md` § Step 4.5 |
|
|
133
128
|
| 24 | `Write(".bugteam-final-body.md", <returned body>)` | `SKILL.md` § Step 4.5 step 4 |
|
|
134
129
|
| 25 | `Bash("gh pr edit 42 -R ... --body-file .bugteam-final-body.md")` | `SKILL.md` § Step 4.5 step 4 |
|
|
135
|
-
| 26 | `Bash("rm .bugteam-final.diff .bugteam-original-body.md .bugteam-final-body.md")` | `SKILL.md` § Step 4.5 step 5
|
|
130
|
+
| 26 | `Bash("rm .bugteam-final.diff .bugteam-original-body.md .bugteam-final-body.md")` | `SKILL.md` § Step 4.5 step 5 |
|
|
136
131
|
| 27 | `Bash("python .../scripts/revoke_project_claude_permissions.py")` | `SKILL.md` § Step 5 |
|
|
137
132
|
|
|
138
133
|
**Pass criteria.**
|
|
139
134
|
- All Layer A invariants hold.
|
|
140
|
-
- Exactly 2 `Agent(name="bugfind"
|
|
141
|
-
- Exactly 2 bugfind shutdown messages + 1 bugfix shutdown message.
|
|
135
|
+
- Exactly 2 `Agent(name="bugfind-pr42-loop...")` calls, exactly 1 `Agent(name="bugfix-pr42-loop...")` call.
|
|
142
136
|
- Final report contains `/bugteam exit: converged` and `Loops: 2`.
|
|
143
137
|
|
|
144
138
|
**Process check after first real run.** Compare the observed trace against steps 1–27. Common expected divergences that should not fail the eval:
|
|
145
139
|
- Extra `Bash("git rev-parse HEAD")` calls the lead inserts for bookkeeping.
|
|
146
|
-
- Consolidated `Bash` calls (step
|
|
140
|
+
- Consolidated `Bash` calls (step 25 may split into two or three calls).
|
|
147
141
|
- Extra `Read` calls when the lead re-reads an outcome XML to quote specific findings.
|
|
148
142
|
- Reordered but still-Layer-A-compliant cleanup sequencing.
|
|
149
143
|
|
|
@@ -151,40 +145,40 @@ Patch this table to match observation and annotate each correction.
|
|
|
151
145
|
|
|
152
146
|
---
|
|
153
147
|
|
|
154
|
-
## Eval 6 — Stuck path: fix
|
|
148
|
+
## Eval 6 — Stuck path: fix subagent produces no commit
|
|
155
149
|
|
|
156
|
-
**Scenario.** Loop 1 audit finds 2 P1 bugs; the mock fix
|
|
150
|
+
**Scenario.** Loop 1 audit finds 2 P1 bugs; the mock fix subagent reports both as `could_not_address` (no commit created).
|
|
157
151
|
|
|
158
|
-
**Layer A invariants.** I-1, I-2, I-3, I-4, I-5, I-6, I-
|
|
152
|
+
**Layer A invariants.** I-1, I-2, I-3, I-4, I-5, I-6, I-8, I-9, I-10. I-6 trivially holds.
|
|
159
153
|
|
|
160
|
-
**Layer B predicted trace.** Identical to Eval 5 steps 1–
|
|
154
|
+
**Layer B predicted trace.** Identical to Eval 5 steps 1–12 with this divergence:
|
|
161
155
|
- Step 11 bugfix outcome XML marks every finding `status="could_not_address"`.
|
|
162
156
|
- Step 12 `Bash("git rev-parse HEAD")` returns the pre-fix SHA unchanged.
|
|
163
|
-
- Skill sets exit reason = `stuck`, skips loop 2, and falls through to `
|
|
157
|
+
- Skill sets exit reason = `stuck`, skips loop 2, and falls through to `rmtree`.
|
|
164
158
|
|
|
165
159
|
**Pass criteria.**
|
|
166
160
|
- Loop count stops at 1.
|
|
167
161
|
- Final report contains `/bugteam exit: stuck` and names the two unresolved findings.
|
|
168
|
-
- Steps 19–
|
|
162
|
+
- Steps 19–26 fire despite the stuck exit — I-2 and I-9 enforce this.
|
|
169
163
|
|
|
170
164
|
---
|
|
171
165
|
|
|
172
166
|
## Eval 7 — Cap reached: 10 loops, no convergence
|
|
173
167
|
|
|
174
|
-
**Scenario.** Mock audit returns one P2 finding every loop. Mock fix
|
|
168
|
+
**Scenario.** Mock audit returns one P2 finding every loop. Mock fix subagent always commits but never clears the finding.
|
|
175
169
|
|
|
176
|
-
**Layer A invariants.** All of I-1 through I-
|
|
170
|
+
**Layer A invariants.** All of I-1 through I-10.
|
|
177
171
|
|
|
178
172
|
**Layer B predicted behavior.**
|
|
179
|
-
- Loops 1–3: single `Agent(name="bugfind")` per loop.
|
|
180
|
-
- Loops 4–10:
|
|
181
|
-
- Each loop produces one `Agent(name="bugfix")
|
|
173
|
+
- Loops 1–3: single `Agent(name="bugfind-pr<N>-loop<L>", run_in_background=true)` per loop.
|
|
174
|
+
- Loops 4–10: eleven parallel `Agent(name="bugfind-pr<N>-loop<L>-[a..k]", run_in_background=true)` in a single assistant message per loop (10 haiku + 1 opus validator); lead awaits the validator notification.
|
|
175
|
+
- Each loop produces one `Agent(name="bugfix-pr<N>-loop<L>", run_in_background=true)`.
|
|
182
176
|
- Exactly 10 audit phases, exactly 10 fix phases.
|
|
183
|
-
- Steps 19–
|
|
177
|
+
- Steps 19–26 from Eval 5 fire at teardown.
|
|
184
178
|
|
|
185
179
|
**Pass criteria.**
|
|
186
|
-
- I-
|
|
187
|
-
- I-
|
|
180
|
+
- I-6 holds: exactly 10 audit phases.
|
|
181
|
+
- I-7 holds: loops 4–10 each emit eleven audit `Agent` calls in a single assistant message.
|
|
188
182
|
- Final report contains `/bugteam exit: cap reached` and the remaining bug count.
|
|
189
183
|
|
|
190
184
|
**Process check.** The distinct `Agent(name=...)` audit-call count is a prediction. On the first real run, record the exact count and rewrite the formula here.
|
|
@@ -195,12 +189,12 @@ Patch this table to match observation and annotate each correction.
|
|
|
195
189
|
|
|
196
190
|
**Scenario.** Loop 1 audit returns zero findings.
|
|
197
191
|
|
|
198
|
-
**Layer A invariants.** I-1
|
|
192
|
+
**Layer A invariants.** I-1, I-2, I-3, I-4, I-5, I-6, I-8, I-9, I-10.
|
|
199
193
|
|
|
200
|
-
**Layer B predicted trace.** Eval 5 steps 1–
|
|
194
|
+
**Layer B predicted trace.** Eval 5 steps 1–8 and 19–26 only — no FIX phase because zero findings means the skill exits the loop at `last_action == "audited"` and `last_findings.total == 0`.
|
|
201
195
|
|
|
202
196
|
**Pass criteria.**
|
|
203
|
-
- Exactly 1 `Agent(
|
|
197
|
+
- Exactly 1 `Agent(subagent_type="code-quality-agent", run_in_background=true)` call, 0 fix agent spawns.
|
|
204
198
|
- Bugfind's outcome XML records zero findings; the per-loop review POST carries body `## /bugteam loop 1 audit: 0P0 / 0P1 / 0P2 → clean`.
|
|
205
199
|
- Step 4.5 and Step 5 still fire.
|
|
206
200
|
|
|
@@ -212,7 +206,7 @@ Patch this table to match observation and annotate each correction.
|
|
|
212
206
|
|
|
213
207
|
**Layer A invariants.** Same as Eval 5.
|
|
214
208
|
|
|
215
|
-
**Layer B predicted
|
|
209
|
+
**Layer B predicted subagent-side behavior** (observed via the recorded `gh api ... /reviews` POST payload in the bugfind subagent fixture).
|
|
216
210
|
- `comments[]` length in the POST body = 2 (anchored findings only).
|
|
217
211
|
- Review body contains a `### Findings without a diff anchor` section listing the third finding.
|
|
218
212
|
- Bugfix outcome XML marks all 3 findings with a `reply_comment_url`; the unanchored finding's `used_fallback="true"` and `finding_comment_url` equals the parent review URL.
|
|
@@ -242,9 +236,9 @@ Patch this table to match observation and annotate each correction.
|
|
|
242
236
|
- Bugfix teammate outcome XML marks every finding `status="hook_blocked"` with populated `<hook_output>`.
|
|
243
237
|
- Bugfix teammate posts `Hook blocked the fix commit: <one-line summary>` to each finding comment.
|
|
244
238
|
- Lead's `Bash("git rev-parse HEAD")` after fix detects no SHA change → exit reason `stuck`.
|
|
245
|
-
- Steps 19–
|
|
239
|
+
- Steps 19–26 from Eval 5 fire at teardown.
|
|
246
240
|
|
|
247
|
-
**Pass criteria.** Layer A I-2 and I-
|
|
241
|
+
**Pass criteria.** Layer A I-2 and I-9 hold. Final report contains `/bugteam exit: stuck` and surfaces the hook_output summary.
|
|
248
242
|
|
|
249
243
|
---
|
|
250
244
|
|
|
@@ -252,13 +246,13 @@ Patch this table to match observation and annotate each correction.
|
|
|
252
246
|
|
|
253
247
|
**Scenario.** The available-agents list does not include `pr-description-writer` but does include `general-purpose`.
|
|
254
248
|
|
|
255
|
-
**Layer B predicted trace.** Eval 5 steps 1–
|
|
249
|
+
**Layer B predicted trace.** Eval 5 steps 1–21 identical; step 22 becomes:
|
|
256
250
|
|
|
257
251
|
```
|
|
258
252
|
Agent(subagent_type="general-purpose", description="Rewrite PR 42 body from cumulative diff", prompt=<same brief>)
|
|
259
253
|
```
|
|
260
254
|
|
|
261
|
-
Steps
|
|
255
|
+
Steps 23–26 follow normally.
|
|
262
256
|
|
|
263
257
|
**Pass criteria.** Exactly 1 `Agent(subagent_type="general-purpose", ...)` call for the description rewrite. `gh pr edit` fires. Final report carries no Step 4.5 skip warning.
|
|
264
258
|
|
|
@@ -268,7 +262,7 @@ Steps 24–27 follow normally.
|
|
|
268
262
|
|
|
269
263
|
**Scenario.** Neither `pr-description-writer` nor `general-purpose` appear in the available-agents list.
|
|
270
264
|
|
|
271
|
-
**Layer B predicted trace.** Eval 5 steps 1–
|
|
265
|
+
**Layer B predicted trace.** Eval 5 steps 1–21, then skip steps 22–24. Steps 25–26 still fire.
|
|
272
266
|
|
|
273
267
|
**Pass criteria.**
|
|
274
268
|
- Zero `Agent` calls for PR description rewriting.
|
|
@@ -280,50 +274,17 @@ Steps 24–27 follow normally.
|
|
|
280
274
|
|
|
281
275
|
## Eval 14 — Permissions revoke on error path
|
|
282
276
|
|
|
283
|
-
**Scenario.** Bugfind
|
|
277
|
+
**Scenario.** Bugfind subagent completes but writes no outcomes XML (background subagent completes notification arrives with no file at the expected path).
|
|
284
278
|
|
|
285
|
-
**Layer B predicted trace.** Eval 5 steps 1–
|
|
286
|
-
-
|
|
287
|
-
- Skill sets exit reason = `error: bugfind
|
|
288
|
-
-
|
|
279
|
+
**Layer B predicted trace.** Eval 5 steps 1–7, then:
|
|
280
|
+
- Lead awaits notification and calls `Read(".bugteam-pr42-loop1.outcomes.xml")` → file missing.
|
|
281
|
+
- Skill sets exit reason = `error: outcomes XML missing after bugfind loop 1`.
|
|
282
|
+
- Teardown (steps 19–26 from Eval 5) all fire.
|
|
289
283
|
|
|
290
284
|
**Pass criteria.** Final report surfaces the error and the loop number. Revoke fires despite the error.
|
|
291
285
|
|
|
292
286
|
---
|
|
293
287
|
|
|
294
|
-
## Eval 15 — Orchestrator-only `TeamCreate` (supplementary work path)
|
|
295
|
-
|
|
296
|
-
**Scenario.** A loop 1 audit surfaces a P0/P1 finding whose root cause sits in adjacent infrastructure the lead needs to fix before the cycle can converge (e.g., a broken CI hook, a misbehaving lint config, a wrong GitHub API shape in a teammate's own dependency). The lead recognizes supplementary work is needed and decides to spawn additional teammates to handle it.
|
|
297
|
-
|
|
298
|
-
**Layer A invariants.** I-1, I-3, I-4, I-5, I-6, I-7, I-11, I-12, **I-13 (primary focus)**.
|
|
299
|
-
|
|
300
|
-
**Layer B predicted trace.** Eval 5 steps 1–9 identical. At step 10 (where a standard cycle spawns `bugfix`), the lead decides the finding requires adjacent infrastructure work first. Rather than call `TeamCreate` for a new team, the lead spawns a supplementary teammate into the existing team:
|
|
301
|
-
|
|
302
|
-
```
|
|
303
|
-
Agent(
|
|
304
|
-
subagent_type="code-quality-agent",
|
|
305
|
-
name="bugfind-adjacent",
|
|
306
|
-
team_name="<lead_team_name>", // same team as bugfind/bugfix
|
|
307
|
-
model="opus",
|
|
308
|
-
description="Supplementary audit of adjacent infrastructure",
|
|
309
|
-
prompt=<brief naming the specific adjacent files + observed symptom>
|
|
310
|
-
)
|
|
311
|
-
```
|
|
312
|
-
|
|
313
|
-
The adjacent-audit teammate writes its own outcome XML, self-terminates. Lead reads the XML, decides fix strategy, spawns an adjacent-fix teammate into the same team. Cycle eventually returns to the standard `bugfix` spawn for the original finding(s). All spawns pass the same `team_name`.
|
|
314
|
-
|
|
315
|
-
**Pass criteria.**
|
|
316
|
-
- Layer A I-13 holds: zero `TeamCreate` calls beyond the single one at skill Step 2.
|
|
317
|
-
- Every `Agent(...)` call in the session carries `team_name="<lead_team_name>"`. No teammate spawn omits `team_name`.
|
|
318
|
-
- If the lead attempts a second `TeamCreate` call, the runtime returns the exact error quoted in I-13's citation; the lead treats this as a signal to spawn a teammate into the existing team instead.
|
|
319
|
-
- Working behavior is unchanged from a single-set cycle: grant → TeamCreate (once) → Agent spawns (many, all same team_name) → SendMessage shutdowns as needed → TeamDelete (once) → temp cleanup → Step 4.5 → revoke.
|
|
320
|
-
|
|
321
|
-
**Failure mode.** A second `TeamCreate` call in the session, or any `Agent(...)` call without `team_name` once the team exists. Either signals the orchestrator-only invariant has been violated and the clean-room/team semantics are broken.
|
|
322
|
-
|
|
323
|
-
**Observation source for this eval.** This eval was added after a real /bugteam run on PR #184 where the lead discovered a broken hook mid-cycle and initially spawned a standalone subagent (no `team_name`) for the adjacent audit — a direct violation. The runtime had already prevented a second `TeamCreate` with the error quoted in I-13. The eval codifies the correct path (spawn as teammate into existing team) so future runs do not repeat the violation.
|
|
324
|
-
|
|
325
|
-
---
|
|
326
|
-
|
|
327
288
|
## Iteration protocol
|
|
328
289
|
|
|
329
290
|
1. **Cycle 0 — Reconcile predictions with reality.** On the first real run, diff every Layer B predicted trace against the observed trace. Patch this file to match reality and annotate each correction with a reason.
|
|
@@ -344,5 +305,5 @@ A minimal Python harness under `packages/claude-dev-env/skills/bugteam/evals/`:
|
|
|
344
305
|
## Open research items flagged during this pass
|
|
345
306
|
|
|
346
307
|
1. **GitHub REST review-POST payload shape.** Eval 9 and Eval 10 depend on the exact body shape of `POST /pulls/<number>/reviews`. The `jq -n --rawfile ... --argjson ... | gh api ... --input -` fence lives in `SKILL.md` § Step 2.5 (**Review POST**); expanded copy in `reference/github-pr-reviews.md` § **Per-loop review**. Before running Eval 9/10 for real, fetch the current GitHub REST reference to confirm the request schema (fields `commit_id`, `event`, `body`, `comments[]`) and the multi-line anchor `{path, start_line, start_side, line, side, body}` shape still apply. Record the confirmed version and URL here.
|
|
347
|
-
2.
|
|
308
|
+
2. **Background subagent completion signal.** Real-run observation (loop 1 of eval run 2026-04-18) confirmed: background subagents self-terminate when their task is complete — the background-completion notification arrives and the lead reads the outcomes XML. No shutdown handshake required. `SKILL.md` § AUDIT / FIX actions document this flow. Layer A **I-4** encodes “fresh subagent per loop.”
|
|
348
309
|
3. **Model override redundancy.** `clean-coder` pins `model: opus` in its agent definition, while `code-quality-agent` currently uses `model: inherit`. The explicit `model="opus"` in every spawn is insurance against frontmatter drift; on the first real run, confirm the resolved model is `claude-opus-4-7` and that effort defaults to `xhigh` (Claude Code shows the active effort next to the spinner per the model-config docs). If a teammate's frontmatter ever pins a non-default `effort:` value, that frontmatter overrides the model default for that subagent (https://code.claude.com/docs/en/model-config — *"Frontmatter effort applies when that skill or subagent is active, overriding the session level but not the environment variable."*).
|
|
@@ -4,10 +4,8 @@ Expanded material that used to live inline in `SKILL.md`. Load a file when the o
|
|
|
4
4
|
|
|
5
5
|
| File | Domain |
|
|
6
6
|
|------|--------|
|
|
7
|
-
| [`
|
|
8
|
-
| [`
|
|
9
|
-
| [`design-rationale.md`](design-rationale.md) | Why agent teams (clean-room), table-of-contents habit, when `/bugteam` applies, refusal reasons |
|
|
10
|
-
| [`team-setup.md`](team-setup.md) | Permissions grant (`CLAUDE_SKILL_DIR`), PR scope, `TeamCreate`, team name / sanitization / temp dir / roles / loop state |
|
|
7
|
+
| [`design-rationale.md`](design-rationale.md) | Why clean-room subagents, table-of-contents habit, when `/bugteam` applies, refusal reasons |
|
|
8
|
+
| [`team-setup.md`](team-setup.md) | Permissions grant (`CLAUDE_SKILL_DIR`), PR scope, run name / temp dir / loop state |
|
|
11
9
|
| [`github-pr-reviews.md`](github-pr-reviews.md) | Per-loop reviews, `jq` + `gh api` payloads, anchors, fallbacks, REST endpoints |
|
|
12
10
|
| [`audit-and-teammates.md`](audit-and-teammates.md) | Pre-audit gate, full cycle numbering, AUDIT and FIX actions, parallel auditors |
|
|
13
11
|
| [`teardown-publish-permissions.md`](teardown-publish-permissions.md) | Utility scripts note, teardown, PR description rewrite, revoke, final report |
|
|
@@ -24,11 +24,11 @@ Repeat until an exit condition fires.
|
|
|
24
24
|
2. If exit code **0** → continue to step 2.5 (AUDIT spawn) below.
|
|
25
25
|
3. If exit code **non-zero** → spawn a new **clean-coder** teammate — **standards-fix pass** — with instructions: read the script’s stderr, edit the repo until a **re-run** of the **same** gate command exits **0**, then one commit, `git push`, shutdown. Repeat standards-fix spawns until the gate exits **0** or **5** failed gate rounds (each round = one teammate session after a non-zero gate). If still non-zero after 5 rounds → exit reason = `error: code rules gate failed pre-audit`.
|
|
26
26
|
4. After gate exit **0**, increment `loop_count`. If `loop_count > 10`, exit reason = `cap reached` (counts **audits**, not standards-only rounds).
|
|
27
|
-
5. Execute **AUDIT action** (spawn bugfind). Print progress: `Loop <
|
|
27
|
+
5. Execute **AUDIT action** (spawn bugfind). Print progress: `Loop <L> audit: ...`
|
|
28
28
|
|
|
29
29
|
3. **FIX path** (when `last_action == "audited"` and `last_findings.total > 0`):
|
|
30
30
|
1. Increment `loop_count`. If `loop_count > 10`, exit reason = `cap reached`.
|
|
31
|
-
2. Execute **FIX action** (spawn bugfix clean-coder for audit findings). Print: `Loop <
|
|
31
|
+
2. Execute **FIX action** (spawn bugfix clean-coder for audit findings). Print: `Loop <L> fix: commit ...`
|
|
32
32
|
3. Set `last_action = "fixed"`, update `audit_log`, loop to step 1 (next iteration hits **pre-audit path** before the next AUDIT).
|
|
33
33
|
|
|
34
34
|
4. After **AUDIT**, update `last_action`, `last_findings`, `audit_log`; print the audit progress line if not already printed.
|
|
@@ -39,62 +39,45 @@ Repeat until an exit condition fires.
|
|
|
39
39
|
|
|
40
40
|
## AUDIT action (clean-room teammate, fresh per loop)
|
|
41
41
|
|
|
42
|
-
Capture a fresh PR diff for this loop into the per-
|
|
42
|
+
Capture a fresh PR diff for this loop into the per-PR scoped directory so concurrent `/bugteam` runs keep patches isolated. Use the literal `<run_temp_dir>` resolved once in Step 2 — Claude resolves the absolute path; every shell receives the same literal value.
|
|
43
43
|
|
|
44
44
|
Commands and `Agent(...)` shape: `SKILL.md`.
|
|
45
45
|
|
|
46
|
-
`<
|
|
46
|
+
`<run_temp_dir>` includes the sanitized `team_name` and timestamp; `team_name` is already prefixed with `bugteam-`. Claude resolves `Path(tempfile.gettempdir()) / team_name` once and passes that absolute path to every shell. `tempfile.gettempdir()` honors `TMPDIR`, `TEMP`, `TMP` and falls back to the OS temp directory, so the same approach works on macOS, Linux, Windows cmd.exe, and PowerShell.
|
|
47
47
|
|
|
48
48
|
Each loop calls `Agent` again with a fresh invocation so the teammate starts with its own context window. Doc line on lead history: [`../sources.md`](../sources.md).
|
|
49
49
|
|
|
50
50
|
See [`../PROMPTS.md`](../PROMPTS.md) for AUDIT spawn-prompt XML and bugfind outcome schema. Substitute placeholders (`repo`, `branch`, `base_branch`, `pr_url`, `loop`, `diff_path`) into the `prompt` argument.
|
|
51
51
|
|
|
52
|
-
After the teammate returns, the lead reads `.bugteam-loop
|
|
52
|
+
After the teammate returns, the lead reads `.bugteam-pr<N>-loop<L>.outcomes.xml` from the worktree directory with the `Read` tool, parses it, and populates `loop_comment_index` from `<finding>` elements.
|
|
53
53
|
|
|
54
54
|
### Shutdown (bugfind)
|
|
55
55
|
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
**Fallback — lead-initiated shutdown:** If the teammate still appears active after `Agent` returns, send:
|
|
59
|
-
|
|
60
|
-
```
|
|
61
|
-
SendMessage(
|
|
62
|
-
to="bugfind",
|
|
63
|
-
message={
|
|
64
|
-
"type": "shutdown_request",
|
|
65
|
-
"reason": "audit loop <N> complete; outcome XML captured"
|
|
66
|
-
}
|
|
67
|
-
)
|
|
68
|
-
```
|
|
69
|
-
|
|
70
|
-
The teammate replies with `{type: "shutdown_response", approve: true}`. If `approve` is `false`, exit reason = `error: bugfind teammate refused shutdown` → Step 4 teardown then Step 5 revoke.
|
|
56
|
+
Teammates self-terminate when complete — the background-completion notification arrives and the lead reads the outcomes XML. If the notification does not arrive within the lead timeout (120s), treat as a hard blocker and abort the loop.
|
|
71
57
|
|
|
72
58
|
`last_action = "audited"`. Append audit metadata to `audit_log`.
|
|
73
59
|
|
|
74
60
|
### Parallel auditors (`loop_count >= 4`)
|
|
75
61
|
|
|
76
|
-
The pre-audit gate must pass immediately before this step. After three full audit/fix rounds without convergence, issue
|
|
62
|
+
The pre-audit gate must pass immediately before this step. After three full audit/fix rounds without convergence, issue eleven `Agent` calls in **one** assistant message so they run in parallel:
|
|
77
63
|
|
|
78
64
|
```
|
|
79
|
-
Agent(subagent_type="code-quality-agent", name="bugfind-
|
|
80
|
-
Agent(subagent_type="code-quality-agent", name="bugfind-
|
|
81
|
-
Agent(subagent_type="code-quality-agent", name="bugfind-
|
|
65
|
+
Agent(subagent_type="code-quality-agent", name="bugfind-pr<N>-loop<L>-a", team_name="<team_name>", model="opus", run_in_background=true, description="Bugfind audit PR <N> loop <L> validator", prompt="<audit XML; poll for all 10 sibling XMLs at <run_temp_dir>/pr-<N>/loop-<L>-b.outcomes.xml through <run_temp_dir>/pr-<N>/loop-<L>-k.outcomes.xml (60s timeout, 2s interval); on timeout: log diagnostics entry, proceed with validated findings from available XMLs; validate each finding: file exists, line in bounds, excerpt matches claimed line, category A-J, severity P0/P1/P2; quarantine hallucinated findings to <run_temp_dir>/pr-<N>/loop-<L>-diagnostics.json under validator_rejected; de-dup by (file, line, category), max severity wins, keep longest description on conflict; re-id as loop<L>-<K>; write <worktree_path>/.bugteam-pr<N>-loop<L>.outcomes.xml; post review>")
|
|
66
|
+
Agent(subagent_type="code-quality-agent", name="bugfind-pr<N>-loop<L>-b", team_name="<team_name>", model="haiku", run_in_background=true, description="Bugfind audit PR <N> loop <L> variant b", prompt="<audit XML; write outcome to <run_temp_dir>/pr-<N>/loop-<L>-b.outcomes.xml; skip PR posting>")
|
|
67
|
+
Agent(subagent_type="code-quality-agent", name="bugfind-pr<N>-loop<L>-c", team_name="<team_name>", model="haiku", run_in_background=true, description="Bugfind audit PR <N> loop <L> variant c", prompt="<audit XML; write outcome to <run_temp_dir>/pr-<N>/loop-<L>-c.outcomes.xml; skip PR posting>")
|
|
68
|
+
Agent(subagent_type="code-quality-agent", name="bugfind-pr<N>-loop<L>-d", team_name="<team_name>", model="haiku", run_in_background=true, description="Bugfind audit PR <N> loop <L> variant d", prompt="<audit XML; write outcome to <run_temp_dir>/pr-<N>/loop-<L>-d.outcomes.xml; skip PR posting>")
|
|
69
|
+
Agent(subagent_type="code-quality-agent", name="bugfind-pr<N>-loop<L>-e", team_name="<team_name>", model="haiku", run_in_background=true, description="Bugfind audit PR <N> loop <L> variant e", prompt="<audit XML; write outcome to <run_temp_dir>/pr-<N>/loop-<L>-e.outcomes.xml; skip PR posting>")
|
|
70
|
+
Agent(subagent_type="code-quality-agent", name="bugfind-pr<N>-loop<L>-f", team_name="<team_name>", model="haiku", run_in_background=true, description="Bugfind audit PR <N> loop <L> variant f", prompt="<audit XML; write outcome to <run_temp_dir>/pr-<N>/loop-<L>-f.outcomes.xml; skip PR posting>")
|
|
71
|
+
Agent(subagent_type="code-quality-agent", name="bugfind-pr<N>-loop<L>-g", team_name="<team_name>", model="haiku", run_in_background=true, description="Bugfind audit PR <N> loop <L> variant g", prompt="<audit XML; write outcome to <run_temp_dir>/pr-<N>/loop-<L>-g.outcomes.xml; skip PR posting>")
|
|
72
|
+
Agent(subagent_type="code-quality-agent", name="bugfind-pr<N>-loop<L>-h", team_name="<team_name>", model="haiku", run_in_background=true, description="Bugfind audit PR <N> loop <L> variant h", prompt="<audit XML; write outcome to <run_temp_dir>/pr-<N>/loop-<L>-h.outcomes.xml; skip PR posting>")
|
|
73
|
+
Agent(subagent_type="code-quality-agent", name="bugfind-pr<N>-loop<L>-i", team_name="<team_name>", model="haiku", run_in_background=true, description="Bugfind audit PR <N> loop <L> variant i", prompt="<audit XML; write outcome to <run_temp_dir>/pr-<N>/loop-<L>-i.outcomes.xml; skip PR posting>")
|
|
74
|
+
Agent(subagent_type="code-quality-agent", name="bugfind-pr<N>-loop<L>-j", team_name="<team_name>", model="haiku", run_in_background=true, description="Bugfind audit PR <N> loop <L> variant j", prompt="<audit XML; write outcome to <run_temp_dir>/pr-<N>/loop-<L>-j.outcomes.xml; skip PR posting>")
|
|
75
|
+
Agent(subagent_type="code-quality-agent", name="bugfind-pr<N>-loop<L>-k", team_name="<team_name>", model="haiku", run_in_background=true, description="Bugfind audit PR <N> loop <L> variant k", prompt="<audit XML; write outcome to <run_temp_dir>/pr-<N>/loop-<L>-k.outcomes.xml; skip PR posting>")
|
|
82
76
|
```
|
|
83
77
|
|
|
84
|
-
Teammate `-a` is the
|
|
85
|
-
|
|
86
|
-
Shutdown order: parallel `SendMessage` to `b` and `c`, then `a`:
|
|
78
|
+
Teammate `-a` is the opus validator: polls for all 10 sibling XMLs at explicit absolute paths under `<run_temp_dir>/pr-<N>` (60s timeout, 2s interval; on timeout: log diagnostics entry, proceed with validated findings from available XMLs), then validates each finding — file exists, line in bounds, excerpt matches claimed line, category is A–J, severity is P0/P1/P2. Hallucinated findings are quarantined to `<run_temp_dir>/pr-<N>/loop-<L>-diagnostics.json` under `validator_rejected`. Valid findings are de-duplicated by `(file, line, category)` (max severity wins, keep longest description on conflict) and re-assigned merged IDs as `loop<L>-<K>`. The `-a` prompt must embed sibling paths as literal absolutes so `Read` works without discovery.
|
|
87
79
|
|
|
88
|
-
|
|
89
|
-
SendMessage(to="bugfind-loop-<N>-b", message={"type": "shutdown_request", "reason": "variant XML captured"})
|
|
90
|
-
SendMessage(to="bugfind-loop-<N>-c", message={"type": "shutdown_request", "reason": "variant XML captured"})
|
|
91
|
-
```
|
|
92
|
-
|
|
93
|
-
then
|
|
94
|
-
|
|
95
|
-
```
|
|
96
|
-
SendMessage(to="bugfind-loop-<N>-a", message={"type": "shutdown_request", "reason": "merged review posted"})
|
|
97
|
-
```
|
|
80
|
+
All subagents self-terminate via background completion. The lead awaits only the validator (-a) notification (120s timeout). Missing notification → hard blocker.
|
|
98
81
|
|
|
99
82
|
## FIX action (fresh teammate)
|
|
100
83
|
|
|
@@ -106,17 +89,7 @@ After replies, the teammate writes outcome XML (schema in [`../PROMPTS.md`](../P
|
|
|
106
89
|
|
|
107
90
|
### Shutdown (bugfix)
|
|
108
91
|
|
|
109
|
-
Same self-termination
|
|
110
|
-
|
|
111
|
-
```
|
|
112
|
-
SendMessage(
|
|
113
|
-
to="bugfix",
|
|
114
|
-
message={
|
|
115
|
-
"type": "shutdown_request",
|
|
116
|
-
"reason": "fix loop <N> complete; commit <sha7> pushed"
|
|
117
|
-
}
|
|
118
|
-
)
|
|
119
|
-
```
|
|
92
|
+
Same self-termination model as bugfind. Missing notification → hard blocker.
|
|
120
93
|
|
|
121
94
|
`approve: false` → `error: bugfix teammate refused shutdown` → Step 4 then 5.
|
|
122
95
|
|
|
@@ -8,7 +8,7 @@ Shared output schema and audit-loop contract used by `/bugteam`, `/qbug`, `/find
|
|
|
8
8
|
- Adversarial second pass
|
|
9
9
|
- Haiku secondary auditor
|
|
10
10
|
- Post-fix self-audit
|
|
11
|
-
- Persistence (loop
|
|
11
|
+
- Persistence (loop-<L>-audit.json, loop-<L>-diagnostics.json)
|
|
12
12
|
|
|
13
13
|
## Finding schema
|
|
14
14
|
|
|
@@ -18,7 +18,7 @@ Each finding an audit produces MUST be one of exactly two shapes.
|
|
|
18
18
|
|
|
19
19
|
```json
|
|
20
20
|
{
|
|
21
|
-
"id": "loop<
|
|
21
|
+
"id": "loop<L>-<K>",
|
|
22
22
|
"file": "path/relative/to/repo/root.py",
|
|
23
23
|
"line": 123,
|
|
24
24
|
"category": "A | B | C | D | E | F | G | H | I | J",
|
|
@@ -29,7 +29,7 @@ Each finding an audit produces MUST be one of exactly two shapes.
|
|
|
29
29
|
}
|
|
30
30
|
```
|
|
31
31
|
|
|
32
|
-
`id` is `loop<
|
|
32
|
+
`id` is `loop<L>-<K>` where `L` is the loop counter (1-based) and `K` is the 1-based index within the loop. For `/findbugs` which runs once, use `find<K>`.
|
|
33
33
|
|
|
34
34
|
### Shape B — structured proof-of-absence
|
|
35
35
|
|
|
@@ -105,9 +105,9 @@ Merge rules:
|
|
|
105
105
|
- **Unique-to-Haiku findings**: added to the primary set with Haiku's severity and source annotation.
|
|
106
106
|
- **Unique-to-primary findings**: kept as-is.
|
|
107
107
|
- **Zero Haiku findings**: primary set trusted; proceed.
|
|
108
|
-
- **Malformed or non-parseable Haiku output**: lead trusts the primary set, logs the event in `loop-<
|
|
108
|
+
- **Malformed or non-parseable Haiku output**: lead trusts the primary set, logs the event in `loop-<L>-diagnostics.json` under `haiku_findings` as `[{"parse_error": "<message>"}]`.
|
|
109
109
|
|
|
110
|
-
For multi-subagent skills (`/bugteam`) the parallel-auditors pattern in [`audit-and-teammates.md`](audit-and-teammates.md) already provides cross-model coverage via
|
|
110
|
+
For multi-subagent skills (`/bugteam`) the parallel-auditors pattern in [`audit-and-teammates.md`](audit-and-teammates.md) already provides cross-model coverage via 10 haiku auditors + opus validator.
|
|
111
111
|
|
|
112
112
|
## Post-fix self-audit
|
|
113
113
|
|
|
@@ -131,7 +131,7 @@ Sequence:
|
|
|
131
131
|
|
|
132
132
|
Every audit loop writes two JSON files under the skill's scoped temp directory (resolved via `tempfile.gettempdir()`):
|
|
133
133
|
|
|
134
|
-
### `loop-<
|
|
134
|
+
### `loop-<L>-audit.json`
|
|
135
135
|
|
|
136
136
|
```json
|
|
137
137
|
{
|
|
@@ -141,7 +141,7 @@ Every audit loop writes two JSON files under the skill's scoped temp directory (
|
|
|
141
141
|
}
|
|
142
142
|
```
|
|
143
143
|
|
|
144
|
-
### `loop-<
|
|
144
|
+
### `loop-<L>-diagnostics.json`
|
|
145
145
|
|
|
146
146
|
```json
|
|
147
147
|
{
|
|
@@ -2,13 +2,9 @@
|
|
|
2
2
|
|
|
3
3
|
## Core principle (expanded)
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
Background subagents (`Agent(..., run_in_background=true)`) run the audit-and-fix loop until convergence. The bugfind subagent audits clean-room (own context window, no chat history); the bugfix subagent addresses each audit’s findings; both spawn fresh per loop with no shared state. A 10-loop hard cap prevents runaway cost. Project permissions are granted at session start and revoked at session end.
|
|
6
6
|
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
## Why not parallel subagents here
|
|
10
|
-
|
|
11
|
-
Subagents return their results into the lead’s context, which accumulates across loops. Agent-team teammates are independent sessions with their own context windows and do not pollute the lead. The lead can shut down and respawn each loop so every audit starts fresh. For `/bugteam`, the independent-context property is required; parallel subagents fail the clean-room requirement. Supporting quotes: [`../sources.md`](../sources.md) (subagents vs agent teams).
|
|
7
|
+
Fresh-spawn clean-room isolation: each `Agent` call creates a new subagent with its own context window and no access to prior conversation. After the subagent writes its outcome XML and self-terminates, the lead reads the file. Results never accumulate in the lead’s context beyond the XML artifact. Verbatim Anthropic quotes and URLs: [`../sources.md`](../sources.md).
|
|
12
8
|
|
|
13
9
|
## Table of contents in `SKILL.md`
|
|
14
10
|
|
|
@@ -20,9 +16,8 @@ The user wants automated convergence on a clean PR without babysitting each step
|
|
|
20
16
|
|
|
21
17
|
### Refusal reasons (detail)
|
|
22
18
|
|
|
23
|
-
- **Agent teams off:** Without the feature flag, the workflow cannot run.
|
|
24
19
|
- **No PR / diff:** There is nothing scoped to audit.
|
|
25
|
-
- **Dirty tree:** The fix
|
|
20
|
+
- **Dirty tree:** The fix subagent will commit; uncommitted local work would be mixed into automated commits.
|
|
26
21
|
- **Missing subagents:** Both `code-quality-agent` and `clean-coder` must exist in the environment before Step 0.
|
|
27
22
|
|
|
28
23
|
Exact refusal strings remain in `SKILL.md`.
|