@aarushpandey/gitagent 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CONTRIBUTING.md +104 -0
- package/LICENSE +21 -0
- package/README.md +570 -0
- package/TESTING.md +290 -0
- package/action.yml +113 -0
- package/examples/README.md +124 -0
- package/examples/sample-audit-trail-issue-4.md +112 -0
- package/examples/sample-review-tqec-pr894-v1-raw-flawed.md +71 -0
- package/examples/sample-review-tqec-pr894-v2-raw.md +48 -0
- package/examples/sample-review-tqec-pr894-v3-curated.md +118 -0
- package/examples/verify-marker-precedence/README.md +97 -0
- package/examples/verify-marker-precedence/conftest.py +15 -0
- package/examples/verify-marker-precedence/pyproject.toml +8 -0
- package/examples/verify-marker-precedence/test_marker_precedence.py +56 -0
- package/examples/verify-marker-precedence/verify_precedence.py +67 -0
- package/examples/workflows/issue-fix.yml +32 -0
- package/examples/workflows/pr-review.yml +34 -0
- package/package.json +75 -0
- package/scripts/verify.js +478 -0
- package/src/agents/agentLoop.js +176 -0
- package/src/agents/engineeringAgent.js +51 -0
- package/src/agents/reviewCopilot.js +79 -0
- package/src/agents/tools.js +486 -0
- package/src/cli/output.js +137 -0
- package/src/config.js +22 -0
- package/src/mapper/fileRelevance.js +113 -0
- package/src/mapper/repoMap.js +105 -0
- package/src/orchestrator.js +336 -0
- package/src/pipeline.js +985 -0
- package/src/prompts/engineering.js +189 -0
- package/src/prompts/review.js +149 -0
- package/src/utils/cost.js +47 -0
- package/src/utils/diffLines.js +67 -0
- package/src/utils/githubUrl.js +8 -0
- package/src/web/public/index.html +128 -0
- package/src/web/server.js +51 -0
package/TESTING.md
ADDED
|
@@ -0,0 +1,290 @@
|
|
|
1
|
+
# Live testing guide
|
|
2
|
+
|
|
3
|
+
`npm test` covers logic. This guide walks through exercising every shipped
|
|
4
|
+
feature against **real GitHub** (or a tightly-scoped sandbox) so you can
|
|
5
|
+
verify behaviour end-to-end.
|
|
6
|
+
|
|
7
|
+
## Set up a sandbox repository (recommended)
|
|
8
|
+
|
|
9
|
+
The safest way to test destructive behaviours (PR opening, commits, comments)
|
|
10
|
+
is on a throwaway repo you own. Five minutes of setup buys you confidence.
|
|
11
|
+
|
|
12
|
+
```bash
|
|
13
|
+
# 1. Make a tiny buggy Node project
|
|
14
|
+
mkdir ~/agent-sandbox && cd ~/agent-sandbox
|
|
15
|
+
git init -b main
|
|
16
|
+
npm init -y
|
|
17
|
+
cat > calc.js <<'EOF'
|
|
18
|
+
function add(a, b) { return a - b; } // deliberate bug
|
|
19
|
+
module.exports = { add };
|
|
20
|
+
EOF
|
|
21
|
+
cat > calc.test.js <<'EOF'
|
|
22
|
+
const { add } = require('./calc');
|
|
23
|
+
test('add returns sum', () => {
|
|
24
|
+
expect(add(2, 3)).toBe(5);
|
|
25
|
+
});
|
|
26
|
+
EOF
|
|
27
|
+
npm install --save-dev jest >/dev/null 2>&1
|
|
28
|
+
npm pkg set scripts.test=jest
|
|
29
|
+
git add -A && git commit -m "Initial buggy calculator"
|
|
30
|
+
|
|
31
|
+
# 2. Push to GitHub (assumes gh CLI; otherwise create on github.com manually)
|
|
32
|
+
gh repo create agent-sandbox --public --source=. --push
|
|
33
|
+
gh issue create --title "add() returns wrong result" \
|
|
34
|
+
--body "Expected add(2, 3) to be 5 but tests fail. Please fix."
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
Replace `<you>` with your GitHub username everywhere below.
|
|
38
|
+
|
|
39
|
+
---
|
|
40
|
+
|
|
41
|
+
## 1. Dry-run end-to-end (the smoke test)
|
|
42
|
+
|
|
43
|
+
Verifies: agent loop, tool use, run_tests, self-review, audit trail, dashboard.
|
|
44
|
+
**No commit / push / PR.**
|
|
45
|
+
|
|
46
|
+
```bash
|
|
47
|
+
cd ~/desktop/github-agent
|
|
48
|
+
node src/pipeline.js issue https://github.com/<you>/agent-sandbox/issues/1 \
|
|
49
|
+
--dry-run --web
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
Open `http://localhost:3000` in a browser to watch the live event feed.
|
|
53
|
+
|
|
54
|
+
**Expected:** terminal shows turn-by-turn thoughts and tool calls; dashboard
|
|
55
|
+
mirrors them; audit trail written to `repos/<you>-agent-sandbox/audit-trail.md`.
|
|
56
|
+
|
|
57
|
+
```bash
|
|
58
|
+
# Inspect the audit trail
|
|
59
|
+
cat ~/desktop/github-agent/repos/<you>-agent-sandbox/audit-trail.md
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
You should see the new sections: **Outcome**, **Safety gates**, **Files
|
|
63
|
+
touched**, **Test runs**, **Timeline (condensed)**, **Self-review report**,
|
|
64
|
+
collapsed **Full tool transcript**.
|
|
65
|
+
|
|
66
|
+
---
|
|
67
|
+
|
|
68
|
+
## 2. Real PR shipment (sandbox)
|
|
69
|
+
|
|
70
|
+
```bash
|
|
71
|
+
node src/pipeline.js issue https://github.com/<you>/agent-sandbox/issues/1
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
**Expected:** `pushed fix/issue-1 to <you>/agent-sandbox` → `PR opened: ...`.
|
|
75
|
+
|
|
76
|
+
Open the PR on github.com — the body should include `Resolves #1`, the
|
|
77
|
+
engineering summary, and the collapsed self-review report.
|
|
78
|
+
|
|
79
|
+
---
|
|
80
|
+
|
|
81
|
+
## 3. PR safety gate (the new "won't ship bad work" guarantee)
|
|
82
|
+
|
|
83
|
+
Pick an issue the agent will likely fail to fix cleanly — e.g. ask it to
|
|
84
|
+
"prove the Riemann hypothesis in calc.js." Or break the test command so no
|
|
85
|
+
test ever passes.
|
|
86
|
+
|
|
87
|
+
```bash
|
|
88
|
+
gh issue create --title "Implement Riemann hypothesis proof in calc.js" \
|
|
89
|
+
--body "Please add a complete formal proof of the Riemann hypothesis."
|
|
90
|
+
node src/pipeline.js issue https://github.com/<you>/agent-sandbox/issues/2
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
**Expected:**
|
|
94
|
+
- Agent either calls `give_up` (best case), or calls `finish` after a flailing
|
|
95
|
+
patch with no passing test run.
|
|
96
|
+
- Pipeline prints `gate: ...` lines and **refuses to open the PR**.
|
|
97
|
+
- Console says `Refusing to open a PR. Re-run with --force-pr to override`.
|
|
98
|
+
|
|
99
|
+
To verify the override works:
|
|
100
|
+
|
|
101
|
+
```bash
|
|
102
|
+
node src/pipeline.js issue https://github.com/<you>/agent-sandbox/issues/2 --force-pr
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
---
|
|
106
|
+
|
|
107
|
+
## 4. `give_up` graceful exit
|
|
108
|
+
|
|
109
|
+
The agent calls `give_up` when an issue is out of scope. Easiest way to force
|
|
110
|
+
it: file an issue that needs missing infrastructure.
|
|
111
|
+
|
|
112
|
+
```bash
|
|
113
|
+
gh issue create --title "Add CUDA-accelerated matrix solver" \
|
|
114
|
+
--body "We need a GPU-backed matrix solver. Implement it."
|
|
115
|
+
node src/pipeline.js issue https://github.com/<you>/agent-sandbox/issues/3 \
|
|
116
|
+
--comment
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
**Expected:**
|
|
120
|
+
- Audit trail outcome: `❌ Gave up — too_complex` (or similar reason) with
|
|
121
|
+
explanation + blockers.
|
|
122
|
+
- A comment posted on issue #3 with the give-up reason and blockers list.
|
|
123
|
+
- No PR opened.
|
|
124
|
+
|
|
125
|
+
---
|
|
126
|
+
|
|
127
|
+
## 5. Cost ceiling kill switch
|
|
128
|
+
|
|
129
|
+
```bash
|
|
130
|
+
node src/pipeline.js issue https://github.com/<you>/agent-sandbox/issues/1 \
|
|
131
|
+
--max-cost=0.01 --dry-run
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
**Expected:** terminal prints `Cost limit hit at turn N: $X > $0.01` after
|
|
135
|
+
the first or second turn. Audit trail's Outcome section shows
|
|
136
|
+
`⚠ Did not finish — cost_limit`.
|
|
137
|
+
|
|
138
|
+
---
|
|
139
|
+
|
|
140
|
+
## 6. Triage subcommand
|
|
141
|
+
|
|
142
|
+
Process up to N open issues by label in one batch.
|
|
143
|
+
|
|
144
|
+
```bash
|
|
145
|
+
gh issue create --title "Doc typo in README" --body "Says 'helo'." --label bug
|
|
146
|
+
gh issue create --title "calc.js needs JSDoc" --body "No comments." --label bug
|
|
147
|
+
|
|
148
|
+
node src/pipeline.js triage https://github.com/<you>/agent-sandbox \
|
|
149
|
+
--label=bug --max=3 --dry-run --web
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
**Expected:** triage summary at the end with one line per issue: `✓` (PR
|
|
153
|
+
opened), `[dry]`, or `✗` (skipped/failed). `total spend: $X.XXXX` at bottom.
|
|
154
|
+
|
|
155
|
+
---
|
|
156
|
+
|
|
157
|
+
## 7. Duplicate-PR guard
|
|
158
|
+
|
|
159
|
+
Run the same issue command twice in a row.
|
|
160
|
+
|
|
161
|
+
```bash
|
|
162
|
+
node src/pipeline.js issue https://github.com/<you>/agent-sandbox/issues/1
|
|
163
|
+
# … wait for it to open the PR …
|
|
164
|
+
node src/pipeline.js issue https://github.com/<you>/agent-sandbox/issues/1
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
**Expected (second run):**
|
|
168
|
+
```
|
|
169
|
+
⚠ An open PR already resolves issue #1: https://github.com/<you>/agent-sandbox/pull/N
|
|
170
|
+
⚠ Skipping. Re-run with --force-pr to process anyway.
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
No clone, no agent loop, no token spend.
|
|
174
|
+
|
|
175
|
+
---
|
|
176
|
+
|
|
177
|
+
## 8. Fork-and-PR mode (works on repos you don't own)
|
|
178
|
+
|
|
179
|
+
Pick a public repo where you don't have write access. Bonus: pick one where
|
|
180
|
+
you've got a small bug to fix already.
|
|
181
|
+
|
|
182
|
+
```bash
|
|
183
|
+
node src/pipeline.js issue https://github.com/some/public-repo/issues/123 \
|
|
184
|
+
--fork --comment --dry-run
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
**Expected:**
|
|
188
|
+
- `✓ fork: <you>/public-repo` (creates fork on first run, finds existing
|
|
189
|
+
fork on subsequent).
|
|
190
|
+
- After the agent finishes, terminal would push to `<you>/public-repo` and
|
|
191
|
+
open PR upstream — but `--dry-run` skips that.
|
|
192
|
+
|
|
193
|
+
Drop `--dry-run` to actually ship. The PR appears as `<you>:fix/issue-123 →
|
|
194
|
+
<upstream>:main` and `--comment` posts a link-back on the source issue.
|
|
195
|
+
|
|
196
|
+
---
|
|
197
|
+
|
|
198
|
+
## 9. Review subcommand + `--post`
|
|
199
|
+
|
|
200
|
+
Audit any public PR. Exits non-zero on `REQUEST_CHANGES`, so it's wireable
|
|
201
|
+
into CI as a pre-merge gate.
|
|
202
|
+
|
|
203
|
+
```bash
|
|
204
|
+
node src/pipeline.js review https://github.com/some/public-repo/pull/456
|
|
205
|
+
echo "exit code: $?"
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
**Expected output ends with:**
|
|
209
|
+
```
|
|
210
|
+
▸ VERDICT: APPROVE ← or REQUEST_CHANGES / NEEDS_DISCUSSION / UNKNOWN
|
|
211
|
+
✓ PR looks safe to merge (per automated review).
|
|
212
|
+
exit code: 0
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
**With `--post`:**
|
|
216
|
+
|
|
217
|
+
```bash
|
|
218
|
+
node src/pipeline.js review https://github.com/some/public-repo/pull/456 --post
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
Submits the review as a PR review comment (or falls back to an issue
|
|
222
|
+
comment if the token can't submit a formal review).
|
|
223
|
+
|
|
224
|
+
---
|
|
225
|
+
|
|
226
|
+
## 10. Big-project / monorepo / lint detection
|
|
227
|
+
|
|
228
|
+
This is hard to fully test without actually running on Qiskit-class repos
|
|
229
|
+
(which the cost ceiling makes risky). The cheap proxy is to confirm the
|
|
230
|
+
detection logic fires on the right project shape.
|
|
231
|
+
|
|
232
|
+
```bash
|
|
233
|
+
# In a Python project with pyproject.toml + tox.ini + ruff config:
|
|
234
|
+
cd /path/to/python-project
|
|
235
|
+
node ~/desktop/github-agent/src/pipeline.js issue \
|
|
236
|
+
https://github.com/<you>/<repo>/issues/X --dry-run | head -20
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
**Expected (in the "Cloning + branching" section):**
|
|
240
|
+
```
|
|
241
|
+
test command: tox ← or `make test` / `nox` if those exist
|
|
242
|
+
lint commands: ruff check ., black --check ., mypy .
|
|
243
|
+
monorepo sub-packages: terra, aer, ... ← only if subdirs have own pyproject.toml
|
|
244
|
+
guessed sub-package for issue: terra
|
|
245
|
+
CONTRIBUTING.md found at CONTRIBUTING.md
|
|
246
|
+
N file(s) prefiltered as likely relevant
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
---
|
|
250
|
+
|
|
251
|
+
## 11. `apply_patch` fallback + `apply_patch_range`
|
|
252
|
+
|
|
253
|
+
Run the agent on a Python file with mixed tabs/spaces and watch the audit
|
|
254
|
+
trail. Look for `(whitespace-normalized match)` in the engineering timeline.
|
|
255
|
+
The fallback fires when the model produces a slightly off whitespace pattern.
|
|
256
|
+
|
|
257
|
+
The `apply_patch_range` tool is the agent's last-resort knob — you'll see
|
|
258
|
+
`replaced lines X-Y of <file>` in the timeline when it kicks in.
|
|
259
|
+
|
|
260
|
+
These are mostly observable through audit trails, not via a deterministic
|
|
261
|
+
trigger script.
|
|
262
|
+
|
|
263
|
+
---
|
|
264
|
+
|
|
265
|
+
## 12. CI matrix verification
|
|
266
|
+
|
|
267
|
+
Push the branch to GitHub. The `tests` workflow runs the suite on **Linux
|
|
268
|
+
/ macOS / Windows × Node 18 / 20 / 22**.
|
|
269
|
+
|
|
270
|
+
```bash
|
|
271
|
+
gh run watch # follow the latest run live
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
**Expected:** all 9 cells green. The most likely failure is Windows + npm:
|
|
275
|
+
`parseAllowlistedCommand` shims `npm.cmd` on Windows; if a new node-based
|
|
276
|
+
tool gets added to the allowlist without the shim, that test cell will be
|
|
277
|
+
the first to fail.
|
|
278
|
+
|
|
279
|
+
---
|
|
280
|
+
|
|
281
|
+
## When something looks wrong
|
|
282
|
+
|
|
283
|
+
- **`audit-trail.md` is your friend.** It has Outcome, Safety gates, Files
|
|
284
|
+
touched, Test runs, and a condensed Timeline up top — and a full
|
|
285
|
+
`<details>` transcript at the bottom for raw debugging.
|
|
286
|
+
- **The dashboard (`--web`)** is the fastest way to see *why* the agent
|
|
287
|
+
picked the wrong file or burned a turn on a dead end.
|
|
288
|
+
- **Re-run with `--max-cost=0.50`** while debugging so a runaway loop can't
|
|
289
|
+
rack up serious money.
|
|
290
|
+
- **`--dry-run` everything** until you trust the run on that repo type.
|
package/action.yml
ADDED
|
@@ -0,0 +1,113 @@
|
|
|
1
|
+
name: 'github-agent'
|
|
2
|
+
description: 'Autonomous PR review and issue-to-PR engineering with Claude — review every PR or auto-fix labeled issues, no local setup.'
|
|
3
|
+
author: 'Hadar01'
|
|
4
|
+
branding:
|
|
5
|
+
icon: 'git-pull-request'
|
|
6
|
+
color: 'purple'
|
|
7
|
+
|
|
8
|
+
inputs:
|
|
9
|
+
command:
|
|
10
|
+
description: 'Which pipeline to run: review | issue | triage'
|
|
11
|
+
required: true
|
|
12
|
+
target:
|
|
13
|
+
description: >-
|
|
14
|
+
GitHub issue / PR / repo URL to act on. If omitted, it is derived from the
|
|
15
|
+
triggering event (the PR URL on pull_request, the issue URL on issues).
|
|
16
|
+
required: false
|
|
17
|
+
default: ''
|
|
18
|
+
anthropic-api-key:
|
|
19
|
+
description: 'Anthropic API key. Store it as an encrypted repo/org secret.'
|
|
20
|
+
required: true
|
|
21
|
+
github-token:
|
|
22
|
+
description: 'Token used for GitHub API calls. Defaults to the workflow token.'
|
|
23
|
+
required: false
|
|
24
|
+
default: ${{ github.token }}
|
|
25
|
+
post:
|
|
26
|
+
description: '(review) Post the review back to the PR. true/false.'
|
|
27
|
+
required: false
|
|
28
|
+
default: 'true'
|
|
29
|
+
comment:
|
|
30
|
+
description: '(issue) Comment a link-back on the source issue after the PR opens. true/false.'
|
|
31
|
+
required: false
|
|
32
|
+
default: 'false'
|
|
33
|
+
fork:
|
|
34
|
+
description: '(issue) Push to your fork and open the PR from there. true/false.'
|
|
35
|
+
required: false
|
|
36
|
+
default: 'false'
|
|
37
|
+
fail-on-request-changes:
|
|
38
|
+
description: >-
|
|
39
|
+
(review) If true, a REQUEST_CHANGES / NEEDS_DISCUSSION verdict fails the
|
|
40
|
+
check and blocks merge. If false (default), the review is advisory: it
|
|
41
|
+
posts findings but never turns the check red.
|
|
42
|
+
required: false
|
|
43
|
+
default: 'false'
|
|
44
|
+
max-cost:
|
|
45
|
+
description: 'Hard USD ceiling for the run (passed as --max-cost). Empty = project default.'
|
|
46
|
+
required: false
|
|
47
|
+
default: ''
|
|
48
|
+
node-version:
|
|
49
|
+
description: 'Node.js version to run the agent with.'
|
|
50
|
+
required: false
|
|
51
|
+
default: '20'
|
|
52
|
+
extra-args:
|
|
53
|
+
description: 'Additional raw CLI flags appended verbatim (e.g. "--label=bug --max=3").'
|
|
54
|
+
required: false
|
|
55
|
+
default: ''
|
|
56
|
+
|
|
57
|
+
outputs:
|
|
58
|
+
verdict:
|
|
59
|
+
description: 'Review verdict: APPROVE / REQUEST_CHANGES / NEEDS_DISCUSSION / UNKNOWN.'
|
|
60
|
+
value: ${{ steps.run.outputs.verdict }}
|
|
61
|
+
|
|
62
|
+
runs:
|
|
63
|
+
using: 'composite'
|
|
64
|
+
steps:
|
|
65
|
+
- name: Set up Node.js
|
|
66
|
+
uses: actions/setup-node@v4
|
|
67
|
+
with:
|
|
68
|
+
node-version: ${{ inputs.node-version }}
|
|
69
|
+
|
|
70
|
+
- name: Install github-agent dependencies
|
|
71
|
+
shell: bash
|
|
72
|
+
run: npm ci --omit=dev --prefix "${{ github.action_path }}"
|
|
73
|
+
|
|
74
|
+
- name: Resolve target URL
|
|
75
|
+
id: target
|
|
76
|
+
shell: bash
|
|
77
|
+
env:
|
|
78
|
+
EXPLICIT_TARGET: ${{ inputs.target }}
|
|
79
|
+
PR_URL: ${{ github.event.pull_request.html_url }}
|
|
80
|
+
ISSUE_URL: ${{ github.event.issue.html_url }}
|
|
81
|
+
run: |
|
|
82
|
+
TARGET="$EXPLICIT_TARGET"
|
|
83
|
+
if [ -z "$TARGET" ]; then
|
|
84
|
+
if [ -n "$PR_URL" ]; then
|
|
85
|
+
TARGET="$PR_URL"
|
|
86
|
+
elif [ -n "$ISSUE_URL" ]; then
|
|
87
|
+
TARGET="$ISSUE_URL"
|
|
88
|
+
fi
|
|
89
|
+
fi
|
|
90
|
+
if [ -z "$TARGET" ]; then
|
|
91
|
+
echo "::error::No target URL. Set the 'target' input, or trigger on pull_request / issues."
|
|
92
|
+
exit 1
|
|
93
|
+
fi
|
|
94
|
+
echo "url=$TARGET" >> "$GITHUB_OUTPUT"
|
|
95
|
+
|
|
96
|
+
- name: Run github-agent
|
|
97
|
+
id: run
|
|
98
|
+
shell: bash
|
|
99
|
+
env:
|
|
100
|
+
ANTHROPIC_API_KEY: ${{ inputs.anthropic-api-key }}
|
|
101
|
+
GITHUB_TOKEN: ${{ inputs.github-token }}
|
|
102
|
+
run: |
|
|
103
|
+
ARGS=()
|
|
104
|
+
if [ "${{ inputs.command }}" = "review" ]; then
|
|
105
|
+
[ "${{ inputs.post }}" = "true" ] && ARGS+=(--post)
|
|
106
|
+
[ "${{ inputs.fail-on-request-changes }}" != "true" ] && ARGS+=(--advisory)
|
|
107
|
+
fi
|
|
108
|
+
[ "${{ inputs.comment }}" = "true" ] && ARGS+=(--comment)
|
|
109
|
+
[ "${{ inputs.fork }}" = "true" ] && ARGS+=(--fork)
|
|
110
|
+
[ -n "${{ inputs.max-cost }}" ] && ARGS+=(--max-cost=${{ inputs.max-cost }})
|
|
111
|
+
node "${{ github.action_path }}/src/pipeline.js" \
|
|
112
|
+
"${{ inputs.command }}" "${{ steps.target.outputs.url }}" \
|
|
113
|
+
"${ARGS[@]}" ${{ inputs.extra-args }}
|
|
@@ -0,0 +1,124 @@
|
|
|
1
|
+
# Sample artifacts
|
|
2
|
+
|
|
3
|
+
Real outputs produced by `github-agent` on real GitHub issues / PRs. Two
|
|
4
|
+
families of artifacts:
|
|
5
|
+
|
|
6
|
+
- **End-to-end engineering run** — what the tool emits when fixing an issue.
|
|
7
|
+
- **Code-review run, v1 → v2 → v3** — what the tool emits, what changed when
|
|
8
|
+
the prompt was hardened against hallucination, and what survived human
|
|
9
|
+
curation. **This is the artifact we'd actually send to a maintainer.**
|
|
10
|
+
|
|
11
|
+
The review-family files are kept as a *teaching artifact*: they document
|
|
12
|
+
both the failure mode (v1) and the workflow we recommend skeptical
|
|
13
|
+
maintainers actually follow (raw → curated, with a human in the loop).
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## Code review of `tqec/tqec` PR #894
|
|
18
|
+
|
|
19
|
+
[Real PR](https://github.com/tqec/tqec/pull/894) — "Add conftest.py to
|
|
20
|
+
increase timeouts for slow tests" — 10 lines, single file, labelled
|
|
21
|
+
`needs-review`. We chose it because it's small enough to fit in a single
|
|
22
|
+
review and contains a real subtle issue.
|
|
23
|
+
|
|
24
|
+
### `sample-review-tqec-pr894-v1-raw-flawed.md` ⚠️ flawed by design
|
|
25
|
+
|
|
26
|
+
The raw output of an early version of the review prompt **before** we
|
|
27
|
+
hardened it against hallucination. Multiple findings turned out to be
|
|
28
|
+
factually wrong when verified against pytest and pytest-timeout source:
|
|
29
|
+
|
|
30
|
+
- claimed `pytest-timeout` might be missing — but `pyproject.toml` pins it;
|
|
31
|
+
- claimed `--timeout` CLI overrides markers — wrong direction per
|
|
32
|
+
[pytest-timeout docs](https://pypi.org/project/pytest-timeout/);
|
|
33
|
+
- claimed an existing `@pytest.mark.timeout(N)` decorator would be
|
|
34
|
+
overridden by the conftest — backwards, the decorator wins.
|
|
35
|
+
|
|
36
|
+
Kept here on purpose so the failure mode is visible in the repo itself,
|
|
37
|
+
and so the diff between v1 and v2 demonstrates what the prompt fix
|
|
38
|
+
actually changed.
|
|
39
|
+
|
|
40
|
+
### `sample-review-tqec-pr894-v2-raw.md`
|
|
41
|
+
|
|
42
|
+
Same PR, same tool, **but**:
|
|
43
|
+
|
|
44
|
+
1. The system prompt now has explicit anti-hallucination rules
|
|
45
|
+
(see `src/prompts/review.js`) — "never claim a dependency might be
|
|
46
|
+
missing without checking the manifest", "never assert library precedence
|
|
47
|
+
without citation, hedge instead", "prefer fewer correct findings to many
|
|
48
|
+
shaky ones".
|
|
49
|
+
2. The pipeline now also fetches dependency-manifest files
|
|
50
|
+
(`pyproject.toml`, `package.json`, …) into the review's file context,
|
|
51
|
+
so the "check the manifest first" rule can actually be satisfied.
|
|
52
|
+
|
|
53
|
+
The v2 output:
|
|
54
|
+
- does **not** speculate about missing dependencies (it can see `pyproject.toml`);
|
|
55
|
+
- explicitly hedges on pytest-timeout precedence (*"I am not certain of the
|
|
56
|
+
precedence rules…please confirm"*) instead of asserting in either direction;
|
|
57
|
+
- recommends a concrete fix that **sidesteps** the precedence ambiguity:
|
|
58
|
+
`if not item.get_closest_marker("timeout")` before adding the marker.
|
|
59
|
+
|
|
60
|
+
This is the actual unedited file the tool wrote.
|
|
61
|
+
|
|
62
|
+
### `sample-review-tqec-pr894-v3-curated.md` — the version a human would post
|
|
63
|
+
|
|
64
|
+
The hand-curated final, distilled from the v2 raw output. Every behavioural
|
|
65
|
+
claim is verified against pytest / pytest-timeout source code with inline
|
|
66
|
+
citations to the source files. Includes:
|
|
67
|
+
|
|
68
|
+
- the verdict (`NEEDS_DISCUSSION`) and why;
|
|
69
|
+
- one concrete actionable suggestion (the marker guard) with a code snippet;
|
|
70
|
+
- one documentation nudge about the coupling to `pyproject.toml`'s baseline;
|
|
71
|
+
- a transparent table of what was cut from the raw output, and why.
|
|
72
|
+
|
|
73
|
+
This is what you would actually paste into the PR thread.
|
|
74
|
+
|
|
75
|
+
### `verify-marker-precedence/` — runtime confirmation of the curated claim
|
|
76
|
+
|
|
77
|
+
A 4-file standalone pytest project (`conftest.py` mirroring PR #894 +
|
|
78
|
+
`pyproject.toml` + a 3-test file + a marker inspector script). Run
|
|
79
|
+
`python verify_precedence.py` and it prints, for each test, all timeout
|
|
80
|
+
markers attached to its item and which one `get_closest_marker` resolves to.
|
|
81
|
+
The recorded output (transcript pinned in that directory's README) makes
|
|
82
|
+
the precedence claim empirically checkable, not just source-cited.
|
|
83
|
+
|
|
84
|
+
### The point
|
|
85
|
+
|
|
86
|
+
The workflow demonstrated by these three files is the workflow we recommend
|
|
87
|
+
to maintainers who don't want AI noise in their PR threads:
|
|
88
|
+
|
|
89
|
+
```
|
|
90
|
+
agent.review() → raw output → human verifies behavioural claims against source → curated post
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
The agent's job is to surface possible concerns and structure them.
|
|
94
|
+
The human's job is to verify, cut, and decide whether to post.
|
|
95
|
+
By default the tool writes the raw output to disk and posts nothing — so
|
|
96
|
+
this curation step is the natural workflow, not an afterthought.
|
|
97
|
+
|
|
98
|
+
---
|
|
99
|
+
|
|
100
|
+
## End-to-end engineering run
|
|
101
|
+
|
|
102
|
+
### `sample-audit-trail-issue-4.md`
|
|
103
|
+
|
|
104
|
+
End-to-end agentic run on [issue #4](https://github.com/Hadar01/github-agents/issues/4)
|
|
105
|
+
of this repo (a small, scoped request to expose `scripts/verify.js` as
|
|
106
|
+
`npm run verify`).
|
|
107
|
+
|
|
108
|
+
Generated by:
|
|
109
|
+
|
|
110
|
+
```bash
|
|
111
|
+
node src/pipeline.js issue https://github.com/Hadar01/github-agents/issues/4 \
|
|
112
|
+
--dry-run --max-cost=1.50
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
What the audit trail demonstrates:
|
|
116
|
+
|
|
117
|
+
- the human-readable section structure (Outcome / Safety gates / Files
|
|
118
|
+
touched / Test runs / Timeline / Self-review / collapsed full transcript);
|
|
119
|
+
- the PR safety gate firing correctly: the self-review verdict was
|
|
120
|
+
`NEEDS_DISCUSSION` and the agent never observed a passing test run, so
|
|
121
|
+
the pipeline refused to open a PR. Adding `--force-pr` would have shipped
|
|
122
|
+
it; the gate is designed to make that an explicit, auditable choice.
|
|
123
|
+
|
|
124
|
+
Cost: **$0.0923** for 6 turns.
|
|
@@ -0,0 +1,112 @@
|
|
|
1
|
+
# Audit trail — issue #4: Add 'npm run verify' script to expose scripts/verify.js
|
|
2
|
+
|
|
3
|
+
**Issue:** https://github.com/Hadar01/github-agents/issues/4
|
|
4
|
+
**Branch:** `fix/issue-4`
|
|
5
|
+
**Pre-fix HEAD:** `4c88caab278a29aba0e37c2478e8259b5963c52d` — revert with `git reset --hard 4c88caab278a29aba0e37c2478e8259b5963c52d`
|
|
6
|
+
**Turns used:** 6 of 18
|
|
7
|
+
**Cost:** $0.0923 (18,939 in, 814 out, 26,693 cache-read)
|
|
8
|
+
|
|
9
|
+
## Outcome
|
|
10
|
+
|
|
11
|
+
✅ **Finished** — in single pass
|
|
12
|
+
|
|
13
|
+
Add `"verify": "node scripts/verify.js"` to the `scripts` block in `package.json`.
|
|
14
|
+
|
|
15
|
+
The repo already shipped `scripts/verify.js` (a live feature-verification script that exercises every code path against synthetic temp directories without hitting external APIs), but it had no corresponding npm script entry, making it undiscoverable for contributors. This one-line addition wires it in alongside the existing `test` and `smoke` scripts so that `npm run verify` from the repo root executes the script. The trailing comma on the `smoke` line was updated accordingly to keep the JSON valid. No other files were changed.
|
|
16
|
+
|
|
17
|
+
## Safety gates
|
|
18
|
+
- Self-review verdict: **NEEDS_DISCUSSION**
|
|
19
|
+
- Tests observed passing: **NO**
|
|
20
|
+
|
|
21
|
+
## Files touched
|
|
22
|
+
- `package.json` — 1 edit(s) via apply_patch
|
|
23
|
+
|
|
24
|
+
## Test runs
|
|
25
|
+
- Total invocations: 1
|
|
26
|
+
- Passed: 0
|
|
27
|
+
- Failed: 1
|
|
28
|
+
|
|
29
|
+
## Timeline (condensed)
|
|
30
|
+
|
|
31
|
+
- **Turn 1** — Let me start by examining the relevant files to understand the current state of the repository.
|
|
32
|
+
- read `package.json`
|
|
33
|
+
- read `scripts/verify.js`
|
|
34
|
+
- **Turn 2** — I can see the current state. The `package.json` has a `scripts` block with `test`, `issue`, `review`, `triage`, and `smoke`, but no `verify` entry. I need to…
|
|
35
|
+
- patched `package.json`
|
|
36
|
+
- **Turn 3** — Now let me verify the change looks correct:
|
|
37
|
+
- read `package.json`
|
|
38
|
+
- **Turn 4** — The change looks correct. Now let me run the test suite to confirm everything still passes:
|
|
39
|
+
- ran tests: `npm test` → FAIL
|
|
40
|
+
- **Turn 5** — The test run returned no output. Let me check the diff and confirm the change is minimal and correct:
|
|
41
|
+
- inspected working diff
|
|
42
|
+
- **Turn 6** — The diff is exactly one line added — the `"verify": "node scripts/verify.js"` entry — plus the required trailing comma on the preceding `smoke` line. This is…
|
|
43
|
+
- signalled finish
|
|
44
|
+
|
|
45
|
+
## Self-review report
|
|
46
|
+
|
|
47
|
+
## 1. Bug Risk
|
|
48
|
+
|
|
49
|
+
**package.json:14** – The new `"verify"` entry uses a bare relative path `scripts/verify.js` without any guard that the file actually exists. This is not a diff-introduced bug per se, but the PR body claims `scripts/verify.js` is already shipped; that cannot be confirmed from the diff alone. If the file is absent or was never committed, `npm run verify` will immediately fail with a Node.js `MODULE_NOT_FOUND` error, giving contributors a worse experience than having no script at all. The reviewer cannot validate the acceptance criterion "exits 0 when all checks pass" without seeing `scripts/verify.js` in the diff or knowing it exists in the tree.
|
|
50
|
+
|
|
51
|
+
**No drift from the issue's stated intent** – The one-line addition matches the acceptance criteria verbatim (`"verify": "node scripts/verify.js"`). The comma fix on the `smoke` line is mechanically necessary to keep JSON valid and is not extra scope.
|
|
52
|
+
|
|
53
|
+
## 2. Edge Cases
|
|
54
|
+
|
|
55
|
+
- **Working-directory sensitivity** – `node scripts/verify.js` resolves `scripts/verify.js` relative to the shell's CWD at invocation time, not relative to `package.json`. npm sets `CWD` to the package root before running scripts, so this is fine in the normal case. However, if `verify.js` itself uses `__dirname`-relative or `process.cwd()`-relative paths internally, running it from a non-root directory via `node scripts/verify.js` directly could differ from `npm run verify`. Not a blocking concern, but worth confirming in `verify.js`.
|
|
56
|
+
- **Node version compatibility** – No `engines` field is visible in the diff. If `verify.js` uses syntax unavailable in older Node versions the repo nominally supports, contributors with older runtimes will get cryptic parse errors.
|
|
57
|
+
- **Exit code propagation** – `npm run` does propagate non-zero exit codes from child processes, so failure signaling should work correctly assuming `verify.js` uses `process.exit(non-zero)` or throws on failure.
|
|
58
|
+
|
|
59
|
+
## 3. Test Coverage
|
|
60
|
+
|
|
61
|
+
The PR adds no automated test that verifies the new script entry is wired correctly (e.g., a CI step that runs `npm run verify`). The issue's acceptance criterion "existing `npm test` continues to pass" implies the existing test suite should still be green, but nothing in the diff shows that `npm test` runs `npm run verify` as part of its suite, nor that CI is configured to exercise the new script. If `scripts/verify.js` is the "live feature-verification script" described in the issue, it should arguably be invoked in CI to provide value; that integration is outside this diff and unaddressed.
|
|
62
|
+
|
|
63
|
+
## 4. Scope Creep
|
|
64
|
+
|
|
65
|
+
No scope creep. The diff is exactly one logical change: adding the `verify` script entry and the syntactically required comma. The PR body accurately describes what was changed.
|
|
66
|
+
|
|
67
|
+
## 5. Verdict
|
|
68
|
+
|
|
69
|
+
**NEEDS_DISCUSSION**
|
|
70
|
+
|
|
71
|
+
The change itself is mechanically correct and tightly scoped, but the diff is incomplete as a reviewable unit: `scripts/verify.js` — the entire substance of what `npm run verify` executes — is not shown and cannot be audited. The primary acceptance criterion ("exits 0 when all checks pass") depends entirely on that file's correctness. Approving a script entry that points at an unreviewed or potentially missing file would be rubber-stamping. Before this can be approved, the reviewer needs confirmation that `scripts/verify.js` exists in the repository tree and, ideally, that its content has been reviewed (or was previously reviewed in a separate PR). If the file is already known-good and present in the repo, a simple comment confirming its existence and a pointer to where it was merged would unblock approval.
|
|
72
|
+
|
|
73
|
+
## Full tool transcript
|
|
74
|
+
|
|
75
|
+
<details><summary>Click to expand — raw tool-call trace for debugging</summary>
|
|
76
|
+
|
|
77
|
+
**[engineering turn 1] thought:** Let me start by examining the relevant files to understand the current state of the repository.
|
|
78
|
+
**[engineering turn 1] read_file** — ok
|
|
79
|
+
```json
|
|
80
|
+
{"path":"package.json"}
|
|
81
|
+
```
|
|
82
|
+
**[engineering turn 1] read_file** — ok
|
|
83
|
+
```json
|
|
84
|
+
{"path":"scripts/verify.js"}
|
|
85
|
+
```
|
|
86
|
+
**[engineering turn 2] thought:** I can see the current state. The `package.json` has a `scripts` block with `test`, `issue`, `review`, `triage`, and `smoke`, but no `verify` entry. I need to add `"verify": "node scripts/verify.js"` to the scripts block.
|
|
87
|
+
**[engineering turn 2] apply_patch** — ok
|
|
88
|
+
```json
|
|
89
|
+
{"path":"package.json","old_string":" \"smoke\": \"node -e \\\"const p = require('./src/pipeline.js'); if (typeof p.buildAuditTrail !== 'function') process.exit(1); console.log('ok');\\\"\"","new_string":" \"smoke\": \"node -e \\\"const p = require('./src/pipeline.js'); if (typeof p.buildAudit
|
|
90
|
+
```
|
|
91
|
+
**[engineering turn 3] thought:** Now let me verify the change looks correct:
|
|
92
|
+
**[engineering turn 3] read_file** — ok
|
|
93
|
+
```json
|
|
94
|
+
{"path":"package.json"}
|
|
95
|
+
```
|
|
96
|
+
**[engineering turn 4] thought:** The change looks correct. Now let me run the test suite to confirm everything still passes:
|
|
97
|
+
**[engineering turn 4] run_tests** — ok
|
|
98
|
+
```json
|
|
99
|
+
{"command":"npm test"}
|
|
100
|
+
```
|
|
101
|
+
**[engineering turn 5] thought:** The test run returned no output. Let me check the diff and confirm the change is minimal and correct:
|
|
102
|
+
**[engineering turn 5] git_diff** — ok
|
|
103
|
+
```json
|
|
104
|
+
{}
|
|
105
|
+
```
|
|
106
|
+
**[engineering turn 6] thought:** The diff is exactly one line added — the `"verify": "node scripts/verify.js"` entry — plus the required trailing comma on the preceding `smoke` line. This is a minimal, correct change.
|
|
107
|
+
**[engineering turn 6] finish** — ok
|
|
108
|
+
```json
|
|
109
|
+
{"pr_summary":"Add `\"verify\": \"node scripts/verify.js\"` to the `scripts` block in `package.json`.\n\nThe repo already shipped `scripts/verify.js` (a live feature-verification script that exercises every code path against synthetic temp directories without hitting external APIs), but it had no co
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
</details>
|