@ironbee-ai/cli 0.26.0 → 0.27.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +2 -0
- package/dist/clients/claude/agents/ironbee-verifier.md +31 -0
- package/dist/clients/claude/platforms/skill.android.md +2 -0
- package/dist/clients/claude/platforms/skill.backend.md +2 -0
- package/dist/clients/claude/platforms/skill.browser.md +2 -0
- package/dist/clients/claude/platforms/skill.node.md +2 -0
- package/dist/clients/codex/agents/ironbee-verifier.md +75 -26
- package/dist/clients/codex/commands/ironbee-verify/SKILL.md +38 -61
- package/dist/clients/codex/index.js +2 -2
- package/dist/clients/codex/platforms/skill.android.md +2 -0
- package/dist/clients/codex/platforms/skill.backend.md +2 -0
- package/dist/clients/codex/platforms/skill.browser.md +2 -0
- package/dist/clients/codex/platforms/skill.node.md +2 -0
- package/dist/clients/codex/rules/ironbee-verification.md +10 -27
- package/dist/clients/codex/skills/ironbee-verification.md +40 -68
- package/dist/clients/codex/util.js +32 -22
- package/dist/clients/cursor/platforms/skill.android.md +2 -0
- package/dist/clients/cursor/platforms/skill.backend.md +2 -0
- package/dist/clients/cursor/platforms/skill.browser.md +2 -0
- package/dist/clients/cursor/platforms/skill.node.md +2 -0
- package/dist/clients/cursor/skills/ironbee-verification.md +21 -0
- package/dist/hooks/core/session-state.js +1 -1
- package/dist/hooks/core/submit-verdict.js +2 -2
- package/dist/hooks/core/verification-lifecycle.js +1 -1
- package/dist/hooks/core/verify-gate.js +11 -11
- package/dist/lib/install-version.js +1 -1
- package/dist/lib/platform-section.js +3 -3
- package/package.json +1 -1
package/CHANGELOG.md
CHANGED
|
@@ -74,6 +74,13 @@ echo '{"status":"pass","checks":["..."]}' | ironbee hook submit-verdict
|
|
|
74
74
|
- Verdict shape is platform-agnostic: `status`, `checks`, optionally `issues`.
|
|
75
75
|
- Pass → `{ "status": "pass", "checks": [...] }` (what you functionally verified).
|
|
76
76
|
- Fail → `{ "status": "fail", "checks": [...], "issues": [...] }` (what failed).
|
|
77
|
+
- **A FALSE failure is a FAIL — not "verified failure handling".** When you exercise a
|
|
78
|
+
negative path, separate an EXPECTED negative test (you deliberately fed invalid input —
|
|
79
|
+
bad card, missing auth, malformed payload — and it correctly failed → supports a `pass`)
|
|
80
|
+
from a FALSE failure (a VALID, in-scope operation that SHOULD succeed but errors out → a
|
|
81
|
+
DEFECT). Report a false failure as `status: "fail"` (or at minimum non-empty `issues`),
|
|
82
|
+
never as a passing "failure path verified". Passing a run whose own evidence shows a
|
|
83
|
+
legitimate operation breaking is a false pass.
|
|
77
84
|
- You do **not** supply `fixes` — you didn't perform the fix. IronBee fills it from what
|
|
78
85
|
the main agent recorded / changed.
|
|
79
86
|
- **Nothing to verify? Use N/A — do NOT fake evidence.** If the change has no runtime
|
|
@@ -100,6 +107,30 @@ echo '{"status":"pass","checks":["..."]}' | ironbee hook submit-verdict
|
|
|
100
107
|
6. Return a short summary to the main agent: the verdict status and, on fail, the issues so
|
|
101
108
|
it can fix and re-delegate.
|
|
102
109
|
|
|
110
|
+
## Speed — batch your tool calls (fewer LLM round-trips)
|
|
111
|
+
|
|
112
|
+
Each tool call is a separate LLM round-trip, and that round-trip — not the tool's execution
|
|
113
|
+
— is the dominant cost of a verification. Drive the tools in as few turns as you can:
|
|
114
|
+
|
|
115
|
+
- **Batch a scope's work into ONE `*_execute` call.** Each cycle exposes a batch tool
|
|
116
|
+
(`bdt_execute` / `ndt_execute` / `bedt_execute` / `adt_execute`) that runs many steps in
|
|
117
|
+
one turn — nest each as a `callTool('<tool>', { … })`. A batch nests only that cycle's own
|
|
118
|
+
tools (you can't mix servers in one `*_execute`). It's a JS sandbox, so a later step
|
|
119
|
+
can reuse a value an earlier `callTool` returned
|
|
120
|
+
(`const r = callTool(…); callTool(…, { /* a field from r */ })`); and `*_execute` STOPS at
|
|
121
|
+
the first failing nested call, so the rest don't run. Nested calls are credited to the gate like
|
|
122
|
+
standalone calls — but authoring the batch is not the work: read each result and confirm
|
|
123
|
+
real evidence came back (a batch whose interaction failed has no screenshot/snapshot
|
|
124
|
+
behind it). See each platform section for that cycle's concrete batch shape, including any
|
|
125
|
+
cycle-specific screenshot or recording handling.
|
|
126
|
+
- **Discovery stays standalone — you can't batch what you haven't seen.** The step that
|
|
127
|
+
reveals what to do (navigate / connect / snapshot) runs first and on its own; you read its
|
|
128
|
+
result, THEN batch the actions it told you to take.
|
|
129
|
+
- **Put independent calls in ONE assistant message.** Bash + your first devtools call can
|
|
130
|
+
ride the same turn — e.g. `verification-start` then the cycle's discovery call in one
|
|
131
|
+
message, in THIS order (until `verification-start` opens the cycle, every devtools call is
|
|
132
|
+
blocked).
|
|
133
|
+
|
|
103
134
|
<!--IRONBEE:PLATFORM:browser-->
|
|
104
135
|
<!--/IRONBEE:PLATFORM:browser-->
|
|
105
136
|
|
|
@@ -33,6 +33,8 @@ If you see only `ios/`, `web/`, or no mobile directories — the project does NO
|
|
|
33
33
|
- Read Logcat output for the tag(s) relevant to the changed code: `mcp__android-devtools__adt_o11y_log-read` or `mcp__android-devtools__adt_o11y_log-follow` (drain a follow with `mcp__android-devtools__adt_o11y_log-get-followed`, stop it with `mcp__android-devtools__adt_o11y_log-stop-follow`).
|
|
34
34
|
- Confirm expected log lines appear AND no unexpected crashes (FATAL / E/ entries for the app package).
|
|
35
35
|
|
|
36
|
+
**Batch (speed):** connect + launch-app run standalone first (prerequisites). On the device-evidence path, batch the UI interactions + the UI snapshot into one `mcp__android-devtools__adt_execute`; the snapshot captures the state after the batched interactions, so to assert an intermediate state take a snapshot at that point too. The device-evidence screenshot is usually pixel-judged (a visual change) — take THAT one standalone with `includeBase64: true` so you can see it; batch it only when it's purely gate evidence. Log-evidence reads batch together too.
|
|
37
|
+
|
|
36
38
|
### Verdict fields
|
|
37
39
|
The verdict is platform-agnostic — submit only semantic judgment:
|
|
38
40
|
|
|
@@ -13,6 +13,8 @@ The **backend protocol cycle** verifies backend changes by driving real protocol
|
|
|
13
13
|
|
|
14
14
|
You can satisfy the cycle via **protocol-call evidence** (you drive the request yourself), **log evidence** (something else drives the request, you read the resulting logs), **DB evidence** (you inspect database state directly), or any combination. Pick whichever fits the task; one is enough.
|
|
15
15
|
|
|
16
|
+
**Batch (speed):** group consecutive `bedt_*` steps into one `mcp__backend-devtools__bedt_execute` — e.g. a POST then a GET that reuses the created id (bind the first call's result: `const r = callTool('bedt_request_http', {…POST…}); callTool('bedt_request_http', { /* GET using an id from r */ })`), register-source + read, or db-connect + query. Keep a step standalone only when you must inspect its result to DECIDE what to do next, not just to pass a value along.
|
|
17
|
+
|
|
16
18
|
### Path A — Protocol-call evidence
|
|
17
19
|
|
|
18
20
|
1. **Confirm a backend service is running** (the user's dev server, Docker compose, k8s port-forward, …). The agent itself does not start the service — ask the user if uncertain.
|
|
@@ -14,6 +14,8 @@
|
|
|
14
14
|
|
|
15
15
|
All four tools are MANDATORY (the Stop hook checks each). Functional interaction is expected for every verification.
|
|
16
16
|
|
|
17
|
+
**Batch (speed):** navigate (step 1) is standalone — read the ARIA snapshot it returns to decide your interactions. Then run steps 2–5 in ONE `mcp__browser-devtools__bdt_execute` batch — `callTool('bdt_interaction_…', …)` for each interaction, `callTool('bdt_content_take-screenshot', …)`, `callTool('bdt_a11y_take-aria-snapshot', …)`, `callTool('bdt_o11y_get-console-messages', …)` — instead of four separate turns. Screenshot/aria/console capture the state AFTER the batched interactions, so batch interactions that lead to ONE state you want to assert; to assert an intermediate state (e.g. a modal that opens then closes) take a screenshot/snapshot at that point too — interleave it in the batch or split into two. The interaction is what makes the evidence meaningful: a batch of just the four evidence tools with no real interaction passes the tool-presence check but verifies nothing. If you must judge the screenshot's pixels, take that one standalone with `includeBase64: true`.
|
|
18
|
+
|
|
17
19
|
### Verdict fields
|
|
18
20
|
The verdict is platform-agnostic — you submit only semantic judgment:
|
|
19
21
|
|
|
@@ -31,6 +31,8 @@ If you see `pom.xml`, `build.gradle`, `requirements.txt`, `pyproject.toml`, `go.
|
|
|
31
31
|
- Read errors: `ndt_debug_get-logs` with the error-level filter.
|
|
32
32
|
4. **Disconnect** (optional): `ndt_debug_disconnect`.
|
|
33
33
|
|
|
34
|
+
**Batch (speed):** connect (step 2) is standalone discovery. Batch consecutive `ndt_*` calls in one `mcp__node-devtools__ndt_execute` — set several probes together, then later read snapshots/logs together. The exercise step is ALWAYS separate: whatever triggers the code path (a browser/backend call on another server, a CLI command, the user) can't share an `ndt_*` batch — so node runs as set probes (batch) → exercise (separate) → read snapshots (batch).
|
|
35
|
+
|
|
34
36
|
### Verdict fields
|
|
35
37
|
The verdict is platform-agnostic — you submit only semantic judgment:
|
|
36
38
|
|
|
@@ -1,16 +1,18 @@
|
|
|
1
1
|
# IronBee Verifier (delegated verification)
|
|
2
2
|
|
|
3
3
|
You are a dedicated verification sub-agent. The main agent edited code and delegated
|
|
4
|
-
verification to you
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
gate sees your work.
|
|
4
|
+
verification to you. Your job: exercise the affected verification cycle(s) through **real
|
|
5
|
+
tools** (never by reading code), then submit a single verdict. You run inside the main
|
|
6
|
+
agent's session — every tool call and the verdict you submit are recorded in that shared
|
|
7
|
+
session, so the main agent's completion gate sees your work.
|
|
9
8
|
|
|
10
9
|
## What you do NOT do
|
|
11
|
-
- **Never edit code.** You run under a read-only sandbox
|
|
12
|
-
verification fails, report the failures as `issues` in a
|
|
13
|
-
agent fixes and re-delegates.
|
|
10
|
+
- **Never edit code.** You run under a read-only sandbox — all file writes are blocked (both
|
|
11
|
+
`apply_patch` and any shell write). If verification fails, report the failures as `issues` in a
|
|
12
|
+
fail verdict and return — the main agent fixes and re-delegates.
|
|
13
|
+
- **Never substitute reading for verification.** Reading the code is for understanding what
|
|
14
|
+
changed and finding what to exercise — the verdict itself must come from driving the real
|
|
15
|
+
devtools tools; a code-reading "pass" is banned.
|
|
14
16
|
|
|
15
17
|
## Scenario
|
|
16
18
|
If the delegating prompt includes a verification **scenario**, it is authoritative — verify
|
|
@@ -46,29 +48,76 @@ echo '{"status":"pass","checks":["..."]}' | ironbee hook submit-verdict
|
|
|
46
48
|
3. **Run the per-cycle flows for every active cycle.** See the platform sections near the
|
|
47
49
|
bottom of this file — each enabled cycle has its own flow steps and mandatory tools. All
|
|
48
50
|
active cycles must be exercised within this one verification cycle.
|
|
49
|
-
4. **Teardown — shut down ONLY what you started, every run
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
51
|
+
4. **Teardown — shut down ONLY what you started, and do it every run (do not skip it on your
|
|
52
|
+
way to the verdict).** If in step 2 YOU started the app / dev server / any process *for
|
|
53
|
+
this verification*, stop it now before you return — kill the exact process/container you
|
|
54
|
+
launched (e.g. the backgrounded `npm run dev`, the `docker compose up` you ran). **Never
|
|
55
|
+
stop a server that was already running** (user/main-agent-owned). Also honor any
|
|
56
|
+
cycle-specific teardown noted in the platform sections (e.g. stopping an active screen
|
|
57
|
+
recording) BEFORE submitting your verdict.
|
|
54
58
|
5. **Submit your verdict immediately** — do NOT wait:
|
|
55
59
|
```
|
|
56
60
|
echo '<verdict-json>' | ironbee hook submit-verdict
|
|
57
61
|
```
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
**
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
62
|
+
- Verdict shape is platform-agnostic: `status`, `checks`, optionally `issues`.
|
|
63
|
+
- Pass → `{ "status": "pass", "checks": [...] }` (what you functionally verified).
|
|
64
|
+
- Fail → `{ "status": "fail", "checks": [...], "issues": [...] }` (what failed).
|
|
65
|
+
- **A FALSE failure is a FAIL — not "verified failure handling".** When you exercise a
|
|
66
|
+
negative path, separate an EXPECTED negative test (you deliberately fed invalid input —
|
|
67
|
+
bad card, missing auth, malformed payload — and it correctly failed → supports a `pass`)
|
|
68
|
+
from a FALSE failure (a VALID, in-scope operation that SHOULD succeed but errors out → a
|
|
69
|
+
DEFECT). Report a false failure as `status: "fail"` (or at minimum non-empty `issues`),
|
|
70
|
+
never as a passing "failure path verified". Passing a run whose own evidence shows a
|
|
71
|
+
legitimate operation breaking is a false pass.
|
|
72
|
+
- You do **not** supply `fixes` — you didn't perform the fix. IronBee fills it from what
|
|
73
|
+
the main agent recorded / changed.
|
|
74
|
+
- **Nothing to verify? Use N/A — do NOT fake evidence.** If the change has no runtime
|
|
75
|
+
surface to exercise (a type-only edit, a pure refactor with no behavior change, a
|
|
76
|
+
config/constant tweak, a docs change that still tripped a cycle):
|
|
77
|
+
- Global N/A → `{ "status": "not_applicable", "reason": ["why there's no runtime surface"] }`
|
|
78
|
+
(no `checks` needed). Use this when NONE of the active cycles apply.
|
|
79
|
+
- Per-platform N/A → keep a normal `pass`/`fail` for the cycles you DID verify and
|
|
80
|
+
exempt the rest: `{ "status": "pass", "checks": [...], "not_applicable_cycles": ["browser"], "reason": ["server-only change, no UI path"] }`.
|
|
81
|
+
Use this for a mixed change — e.g. verify the backend/node cycle but exempt browser.
|
|
82
|
+
- `reason` is REQUIRED for either form. It is recorded and observable — be honest;
|
|
83
|
+
don't N/A something that genuinely has a surface.
|
|
84
|
+
- **Base "nothing to verify" on the FULL change set, not a clean working tree.**
|
|
85
|
+
The change you're verifying is often already COMMITTED (the main agent committed
|
|
86
|
+
before delegating). IronBee injects the changed-path list on your first devtools
|
|
87
|
+
call — it covers recent commits, not just uncommitted `git status`. Before
|
|
88
|
+
declaring N/A, check the committed changes too (e.g. `git diff HEAD~1 HEAD --stat`,
|
|
89
|
+
widen the range if the work spans more commits). A clean `git status` does NOT mean
|
|
90
|
+
there's nothing to verify.
|
|
91
|
+
- Strict mode rejects N/A (you'll be told). If so, actually exercise the tools or
|
|
92
|
+
report a fail.
|
|
93
|
+
- The Stop hook enforces that you called the required tools for every active (non-exempt)
|
|
94
|
+
cycle and that a pass/fail verdict carries non-empty `checks`.
|
|
95
|
+
6. Return a short summary to the main agent: the verdict status and, on fail, the issues so
|
|
96
|
+
it can fix and re-delegate.
|
|
67
97
|
|
|
68
|
-
##
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
98
|
+
## Speed — batch your tool calls (fewer LLM round-trips)
|
|
99
|
+
|
|
100
|
+
Each tool call is a separate LLM round-trip, and that round-trip — not the tool's execution
|
|
101
|
+
— is the dominant cost of a verification. Drive the tools in as few turns as you can:
|
|
102
|
+
|
|
103
|
+
- **Batch a scope's work into ONE `*_execute` call.** Each cycle exposes a batch tool
|
|
104
|
+
(`bdt_execute` / `ndt_execute` / `bedt_execute` / `adt_execute`) that runs many steps in
|
|
105
|
+
one turn — nest each as a `callTool('<tool>', { … })`. A batch nests only that cycle's own
|
|
106
|
+
tools (you can't mix servers in one `*_execute`). It's a JS sandbox, so a later step
|
|
107
|
+
can reuse a value an earlier `callTool` returned
|
|
108
|
+
(`const r = callTool(…); callTool(…, { /* a field from r */ })`); and `*_execute` STOPS at
|
|
109
|
+
the first failing nested call, so the rest don't run. Nested calls are credited to the gate like
|
|
110
|
+
standalone calls — but authoring the batch is not the work: read each result and confirm
|
|
111
|
+
real evidence came back (a batch whose interaction failed has no screenshot/snapshot
|
|
112
|
+
behind it). See each platform section for that cycle's concrete batch shape, including any
|
|
113
|
+
cycle-specific screenshot or recording handling.
|
|
114
|
+
- **Discovery stays standalone — you can't batch what you haven't seen.** The step that
|
|
115
|
+
reveals what to do (navigate / connect / snapshot) runs first and on its own; you read its
|
|
116
|
+
result, THEN batch the actions it told you to take.
|
|
117
|
+
- **Run `verification-start` alone first, THEN batch.** Codex runs shell commands and MCP
|
|
118
|
+
tools in separate lanes, so a same-message ordering of the `verification-start` shell
|
|
119
|
+
command before a devtools call is not guaranteed — and a devtools call that lands first is
|
|
120
|
+
blocked. Once the cycle is open, independent MCP calls can ride one message.
|
|
72
121
|
|
|
73
122
|
<!--IRONBEE:PLATFORM:browser-->
|
|
74
123
|
<!--/IRONBEE:PLATFORM:browser-->
|
|
@@ -1,90 +1,67 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: ironbee-verify
|
|
3
3
|
description: >
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
submits a verdict. Default is verify-only (report the verdict and stop); a
|
|
9
|
-
leading `fix` argument adds the fix-and-re-verify loop until pass. A custom
|
|
10
|
-
scenario may ride along with the invocation — inline text or a path to a
|
|
11
|
-
scenario file — defining exactly what to verify.
|
|
4
|
+
Delegate verification of the current code changes to the ironbee-verifier custom agent. Use
|
|
5
|
+
when the user types `$ironbee-verify`. Default is verify-only (report the verdict and stop);
|
|
6
|
+
a leading `fix` argument adds the fix-and-re-verify loop until pass. Optionally pass a custom
|
|
7
|
+
scenario (inline text or a file path) that defines what to verify.
|
|
12
8
|
---
|
|
13
9
|
|
|
14
10
|
# IronBee Verify
|
|
15
11
|
|
|
16
|
-
|
|
12
|
+
> **Delegate — do NOT verify inline.** Run this command by spawning the **`ironbee-verifier` custom agent** via `spawn_agent` with `agent_type="ironbee-verifier"` **and `fork_turns="none"`** (the default `fork_turns="all"` silently drops the agent_type → a generic toolless agent; not a generic "act as" agent either) and relaying its verdict. The verifier owns the devtools tools; you (the main agent) don't have them. Everything below describes what the **verifier** does — your job is only to spawn it (passing the mode + scenario in its prompt) and report back its verdict.
|
|
13
|
+
|
|
14
|
+
Verify the current code changes by **delegating to the `ironbee-verifier` custom agent**. It drives the verification tools out-of-band in this **shared session** and returns a verdict summary — so the heavy devtools output (DOM, console, screenshots) stays in its context, not yours. **You do not run the verification tools yourself**: you resolve the mode and scenario (below), spawn the verifier, and relay its result. The gate still runs every active cycle and all must pass for `status: pass`.
|
|
17
15
|
|
|
18
16
|
## Mode
|
|
19
17
|
|
|
20
18
|
The FIRST whitespace-delimited token of whatever the user provided alongside `$ironbee-verify` selects the mode; everything after it is the scenario:
|
|
21
19
|
|
|
22
|
-
- `fix` → **verify-and-fix**: on a fail verdict, fix the reported issues
|
|
20
|
+
- `fix` → **verify-and-fix**: on a fail verdict, fix the reported issues and re-delegate until the verdict passes.
|
|
23
21
|
- `report` → **verify-only** (the explicit form of the default).
|
|
24
22
|
- Anything else, or nothing → **verify-only** (default), and the WHOLE provided text is the scenario.
|
|
25
23
|
|
|
26
|
-
**Verify-only** means:
|
|
24
|
+
**Verify-only** means: relay the verdict and STOP — do **not** edit code, do **not** re-delegate on fail. The fail verdict is still submitted and recorded (that's the point — an honest status report). If the user wants the issues repaired, suggest `$ironbee-verify fix`. One caveat (enforce mode): if code was edited earlier in THIS turn, the Stop gate may still block on the fail verdict and demand fixes — follow the gate then; the mode token never overrides enforcement.
|
|
27
25
|
|
|
28
26
|
## Verification scenario
|
|
29
27
|
|
|
30
|
-
A custom verification scenario may be supplied when this command is invoked — either as **inline text** or as a **path to a file** (any location, any format;
|
|
31
|
-
|
|
32
|
-
- **If a scenario is supplied, it is authoritative**: verify exactly what it describes. Drive each active cycle's tools to exercise precisely the flows, states, and endpoints it names — this **replaces** the default "exercise the changed pages/endpoints" guidance.
|
|
33
|
-
- **If the scenario is (or points to) a file path**, read that file with your file-read tool and treat its contents as the scenario. Do not assume a fixed location or format — read whatever path was given.
|
|
34
|
-
- **If the path does not resolve to an existing file**, stop and report `scenario file not found: <path>`, then ask how to proceed — do not verify the literal path string or guess a target.
|
|
35
|
-
- **If no scenario is supplied**, fall back to the default flow: exercise the changed pages/endpoints per the active platform sections below.
|
|
36
|
-
|
|
37
|
-
Whatever the scenario directs, the gate is unchanged — you must still call every active cycle's required tools and submit a non-empty `checks`. Map each `checks` entry to a concrete scenario step/expectation, and each `issues` entry to a scenario step that failed.
|
|
38
|
-
|
|
39
|
-
## Universal steps
|
|
40
|
-
|
|
41
|
-
1. **Start verification**: Run `echo '{"session_id":"<your-session-id>"}' | ironbee hook verification-start` via Bash (substitute the actual session ID printed by the SessionStart hook).
|
|
42
|
-
**In fix mode**, add the intent flag so IronBee's completion gate enforces fix-until-pass:
|
|
43
|
-
`echo '{"session_id":"<your-session-id>"}' | ironbee hook verification-start --intent fix`
|
|
44
|
-
2. **Build and start** the application if not already running.
|
|
45
|
-
3. **For every active cycle, run its flow** — driven by the **Verification scenario** above when one was supplied, otherwise as described in the platform sections near the bottom of this file. All active cycles must be exercised within this same verification cycle.
|
|
46
|
-
4. **Stop** the dev server when verification is complete (every cycle — including the final one).
|
|
47
|
-
5. **Honor any cycle-specific teardown** noted in the platform sections BEFORE submitting your verdict.
|
|
48
|
-
6. **Submit your verdict** via Bash. One verdict covers every active cycle:
|
|
49
|
-
- Pass: `echo '{"session_id":"...","status":"pass","checks":["..."]}' | ironbee hook submit-verdict`
|
|
50
|
-
- Fail: `echo '{"session_id":"...","status":"fail","checks":["..."],"issues":["describe what failed"]}' | ironbee hook submit-verdict`
|
|
51
|
-
- N/A (nothing to verify — never fake evidence): global `echo '{"session_id":"...","status":"not_applicable","reason":["no runtime surface — type-only/config/refactor"]}'`, or per-platform on a pass/fail verdict `"not_applicable_cycles":["browser"],"reason":["server-only change"]`. `reason` is REQUIRED (recorded + observable); strict mode rejects N/A.
|
|
52
|
-
7. **If failed** → collect ALL issues first (finish testing every active cycle) and submit ONE fail verdict with all issues. Then branch by mode:
|
|
53
|
-
- **Verify-only (default)**: report the issues to the user and stop — do not edit code. Suggest `$ironbee-verify fix` to repair them.
|
|
54
|
-
- **Fix mode (`fix` token)**: fix everything, rebuild, and re-verify until pass. Do not fix one issue at a time — batch fixes to avoid repeated build/restart cycles.
|
|
55
|
-
8. If pass after a previous fail, include `"fixes"` in the verdict describing what was fixed.
|
|
28
|
+
A custom verification scenario may be supplied when this command is invoked — either as **inline text** or as a **path to a file** (any location, any format; read at run time).
|
|
56
29
|
|
|
57
|
-
|
|
30
|
+
> The scenario is whatever the user provided alongside `$ironbee-verify`, after stripping a leading `fix` / `report` mode token — the remainder is the scenario; empty remainder → the verifier uses its default flow.
|
|
58
31
|
|
|
59
|
-
|
|
60
|
-
|
|
32
|
+
- **If a scenario is supplied, it is authoritative**: the verifier must verify exactly what it describes, exercising precisely the flows/states/endpoints it names — this **replaces** the default "exercise the changed pages/endpoints" guidance.
|
|
33
|
+
- **If the scenario is (or points to) a file path**, read that file with your file-read tool yourself and pass its **contents** into the verifier's prompt (the verifier has no file-read tool). Do not assume a fixed location or format — read whatever path was given.
|
|
34
|
+
- **If the path does not resolve to an existing file**, stop and report `scenario file not found: <path>`, then ask how to proceed — do not delegate with the literal path string or guess a target.
|
|
35
|
+
- **If no scenario is supplied**, the verifier falls back to exercising the changed pages/endpoints per the active cycles.
|
|
61
36
|
|
|
62
|
-
|
|
63
|
-
<!--/IRONBEE:PLATFORM:node-->
|
|
37
|
+
## Steps
|
|
64
38
|
|
|
65
|
-
|
|
66
|
-
|
|
39
|
+
1. **Resolve the mode and scenario**: strip a leading `fix` / `report` token (see **Mode**); then file path → read it now; inline text → use as-is; empty → none.
|
|
40
|
+
2. **Spawn the `ironbee-verifier` custom agent** — call `spawn_agent` with **`agent_type="ironbee-verifier"`** AND **`fork_turns="none"`**. The `fork_turns="none"` is REQUIRED: the default `fork_turns="all"` is a full-history fork that silently DROPS the `agent_type` override, giving you a generic agent *without* the verification tools. (Do NOT "act as" the verifier or use a plain generic fork either.) Put the task, the mode, and the resolved scenario in the `message`, e.g.:
|
|
41
|
+
> Verify the current code changes.
|
|
42
|
+
> Mode: \<`fix` in fix mode — OMIT this line entirely in verify-only mode>
|
|
43
|
+
> Scenario: \<the resolved scenario text, or "none — exercise the changed pages/endpoints">
|
|
44
|
+
The verifier runs `verification-start` (relaying the fix intent to IronBee's completion gate, which then enforces fix-until-pass on you) → drives every active cycle's tools → submits the single verdict, all in this shared session. It resolves the session id from the environment, so you don't pass one.
|
|
45
|
+
**Wait for the verifier in the same turn — do NOT background it.** Let it run to completion and read its verdict before responding; a backgrounded verifier can let your turn end (and the Stop gate fire) before its verdict is recorded.
|
|
46
|
+
3. **Relay the verifier's summary** — the verdict status and, on fail, the issues it found.
|
|
47
|
+
4. **On a fail verdict, branch by mode**:
|
|
48
|
+
- **Verify-only (default)**: stop here. Report the issues clearly and suggest `$ironbee-verify fix` to repair them. Do not edit code.
|
|
49
|
+
- **Fix mode (`fix` token)**: fix the issues it reported. Optionally record what you fixed so the next pass verdict can describe it:
|
|
50
|
+
```
|
|
51
|
+
echo '{"fixes":["what you repaired"]}' | ironbee hook record-fix
|
|
52
|
+
```
|
|
53
|
+
Then re-run the verification by re-delegating (step 2) — repeat until the verdict passes. (If you skip `record-fix`, IronBee fills `fixes` from the files you changed since the fail.)
|
|
67
54
|
|
|
68
|
-
|
|
69
|
-
<!--/IRONBEE:PLATFORM:android-->
|
|
55
|
+
Do NOT verify inline — always delegate, so your context stays clean. The per-cycle "how to verify" detail (which tools to drive, the verdict expectations) lives in the `ironbee-verifier` custom agent itself — you don't need it here to delegate.
|
|
70
56
|
|
|
71
57
|
---
|
|
72
58
|
|
|
73
|
-
##
|
|
74
|
-
|
|
75
|
-
If you observe ANY problem on any active cycle — wrong data, unexpected errors, broken interactions, missing evidence, anything that doesn't match the spec — you MUST submit a **fail** verdict.
|
|
76
|
-
|
|
77
|
-
**Do NOT rationalize away problems.** If something looks wrong or behaves unexpectedly, it IS wrong.
|
|
78
|
-
|
|
79
|
-
**After a fail verdict in fix mode, you MUST fix the issues and re-verify** — do not just report and stop. In verify-only mode (the default) the opposite holds: report and stop; fixing without the `fix` token is overstepping.
|
|
80
|
-
|
|
81
|
-
## Verdict Quality
|
|
59
|
+
## What the verifier judges (so you know what to expect back)
|
|
82
60
|
|
|
83
|
-
|
|
84
|
-
-
|
|
85
|
-
- BAD: `["it works", "looks good", "feature implemented"]`
|
|
61
|
+
- It submits a **fail** verdict on ANY problem on any active cycle — wrong data, unexpected errors, broken interactions, missing evidence. It does not rationalize problems away.
|
|
62
|
+
- Its `checks` are specific observations (e.g. `"submitted valid credentials, redirected to /dashboard"`, `"console clean — 0 errors"`), not `"it works"`.
|
|
86
63
|
|
|
87
64
|
## Important
|
|
88
|
-
-
|
|
89
|
-
-
|
|
90
|
-
-
|
|
65
|
+
- The **verifier** produces the verdict; your job is to delegate, relay it, and — in fix mode — fix on fail.
|
|
66
|
+
- **Fix mode only**: a fail verdict means you must fix the issues and re-delegate until pass. In verify-only mode (the default) you report and stop — fixing without the `fix` token is overstepping.
|
|
67
|
+
- Never verify inline to "save a round trip" — delegation keeps your context clean and is the supported path.
|
|
@@ -1,3 +1,3 @@
|
|
|
1
|
-
"use strict";var E=Object.defineProperty;var W=Object.getOwnPropertyDescriptor;var Y=Object.getOwnPropertyNames;var z=Object.prototype.hasOwnProperty;var h=(u,o)=>E(u,"name",{value:o,configurable:!0});var Q=(u,o)=>{for(var e in o)E(u,e,{get:o[e],enumerable:!0})},Z=(u,o,e,r)=>{if(o&&typeof o=="object"||typeof o=="function")for(let n of Y(o))!z.call(u,n)&&n!==e&&E(u,n,{get:()=>o[n],enumerable:!(r=W(o,n))||r.enumerable});return u};var j=u=>Z(E({},"__esModule",{value:!0}),u);var io={};Q(io,{CodexClient:()=>no});module.exports=j(io);var i=require("fs"),a=require("path"),B=require("../../lib/gitignore"),f=require("../../lib/logger"),l=require("../../lib/output"),P=require("../../lib/fs-prune"),d=require("../../lib/config"),$=require("../../lib/platform-section"),t=require("./util"),H=require("./thread-map"),N=require("./hooks/verify-gate"),O=require("./hooks/activity-end"),V=require("./hooks/session-start"),G=require("./hooks/activity-start"),J=require("./hooks/require-verification"),L=require("./hooks/require-verdict"),F=require("./hooks/clear-verdict"),K=require("./hooks/track-action"),U=require("./hooks/track-action-monitor"),q=require("./hooks/track-action-pre"),D=require("./hooks/subagent-start"),X=require("./hooks/subagent-stop");const w="browser-devtools",T="node-devtools",A="backend-devtools",_="android-devtools",oo="ironbee",k="ironbee-verifier",I="Verifies recent code changes through real browser/runtime/backend tools and submits the IronBee verdict. Spawn this custom agent (by agent_type) after editing code to run the verification cycle out-of-band \u2014 it drives the devtools tools, judges the result, and records the verdict in the shared session. It does NOT edit code.";function R(u){return(0,a.join)(__dirname,"..",u,"platforms")}h(R,"platformsDirFor");function b(u){return l.pc.dim(u)}h(b,"codexColor");function M(u){return u.hooks.some(o=>o.command.includes(oo))}h(M,"isIronBeeHookGroup");function eo(u){const o=Object.keys(u);return o.length===0?!0:o.length===1&&o[0]==="hooks"?Object.keys(u.hooks??{}).length===0:!1}h(eo,"isCodexHooksEmpty");class no{constructor(){this.name="codex";this.supportsVerifierModel=!0}static{h(this,"CodexClient")}detect(o){return(0,i.existsSync)((0,a.join)(o,".agents","skills","ironbee-verify"))}resolveProjectDir(){return process.env.CODEX_PROJECT_DIR??process.env.IRONBEE_PROJECT_DIR??process.cwd()}install(o,e){const r=e??(0,d.loadConfig)(o),n=(0,d.getVerificationMode)(r),s=n!=="monitor";this.cleanupArtifacts(o);const c=(0,t.codexHooksJsonPath)(o);this.mergeHooksConfig(c,n),this.mergeConfigToml(o,r,s),s&&(n==="enforce"&&this.writeAgentsMdBlock(o,r),this.writeSkills(o,n==="enforce"),(0,$.syncPlatformSectionsToConfig)(o,R)),(0,B.ensureIronBeeGitignored)(o),console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} hooks ${l.pc.dim("\u2192")} ${l.pc.dim(c)}`),console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} config ${l.pc.dim("\u2192")} ${l.pc.dim((0,t.codexConfigTomlPath)(o))}`),n==="enforce"?(console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} agents ${l.pc.dim("\u2192")} ${l.pc.dim((0,a.join)(o,"AGENTS.md"))}`),console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} skill ${l.pc.dim("\u2192")} ${l.pc.dim((0,a.join)(o,".agents","skills","ironbee-verification","SKILL.md"))}`),console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} command ${l.pc.dim("\u2192")} ${l.pc.dim((0,a.join)(o,".agents","skills","ironbee-verify","SKILL.md"))}`)):n==="assist"?(console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} ${l.pc.yellow("assist mode")} (verification.auto: false) \u2014 manual $ironbee-verify only, no enforcement`),console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} command ${l.pc.dim("\u2192")} ${l.pc.dim((0,a.join)(o,".agents","skills","ironbee-verify","SKILL.md"))}`)):console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} ${l.pc.yellow("monitoring-only mode")} (verification.enable: false)`),console.log(),console.log(` ${l.pc.yellow("\u26A0")} ${l.pc.yellow("Codex requires one-time TUI setup:")}`),console.log(` ${l.pc.yellow("1.")} Run ${l.pc.bold("/hooks")} in a fresh Codex session to review and trust IronBee hooks`),console.log(` ${l.pc.yellow("2.")} Restart any open Codex sessions to pick up new hook config`)}uninstall(o){this.cleanupArtifacts(o),(0,P.pruneEmptyDirs)((0,a.join)(o,".codex"));const e=(0,H.codexThreadMapPath)(o);if((0,i.existsSync)(e))try{(0,i.unlinkSync)(e)}catch(r){f.logger.debug(`failed to remove codex thread map: ${r}`)}console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} removed hooks, MCP entries, AGENTS.md block, and skills`)}cleanupArtifacts(o){this.migrateAwayFromUserLevel();const e=(0,t.codexHooksJsonPath)(o);this.removeIronBeeHooks(e),this.maybeDeleteEmptyHooks(e),this.removeIronBeeMcpServers(o),this.removeVerifierAgentToml(o);const r=(0,a.join)(o,"AGENTS.md");if((0,i.existsSync)(r))try{const s=(0,i.readFileSync)(r,"utf-8"),c=(0,t.stripAgentsMdBlock)(s);c===null?(0,i.unlinkSync)(r):c!==s&&(0,i.writeFileSync)(r,c)}catch(s){f.logger.debug(`failed to strip AGENTS.md block: ${s}`)}const n=(0,a.join)(o,".agents","skills");this.removeDir((0,a.join)(n,"ironbee-verification")),this.removeDir((0,a.join)(n,"ironbee-verify")),(0,P.pruneEmptyDirs)((0,a.join)(o,".agents"))}async runVerifyGate(o){await(0,N.run)(o)}async runActivityEnd(o){await(0,O.run)(o)}async runSessionStart(o){await(0,V.run)(o)}async runActivityStart(o){await(0,G.run)(o)}async runRequireVerification(o,e){await(0,J.run)(o,e)}async runRequireVerdict(o,e){await(0,L.run)(o,e)}async runClearVerdict(o){await(0,F.run)(o)}async runTrackAction(o){await(0,K.run)(o)}async runTrackActionMonitor(o){await(0,U.run)(o)}async runTrackActionPre(o){await(0,q.run)(o)}async runSubagentStart(o){await(0,D.run)(o)}async runSubagentStop(o){await(0,X.run)(o)}resolveAgentSessionId(o,e){const r=process.env.CODEX_THREAD_ID;if(typeof r=="string"&&r.length>0&&e)return(0,H.lookupThreadSession)(e,r)}async runSessionEnd(o){f.logger.debug("session-end: no-op on Codex (no SessionEnd hook event)")}mergeHooksConfig(o,e){const r=e!=="monitor",n=e==="assist"?" --soft":"";(0,i.mkdirSync)((0,a.dirname)(o),{recursive:!0});let s={hooks:{}};if((0,i.existsSync)(o))try{s=JSON.parse((0,i.readFileSync)(o,"utf-8")),s.hooks||(s.hooks={})}catch(m){f.logger.debug(`failed to parse ${o}: ${m}`),s={hooks:{}}}for(const m of Object.keys(s.hooks)){const v=s.hooks[m].filter(y=>!M(y));v.length===0?delete s.hooks[m]:s.hooks[m]=v}const c=h((m,v,y)=>{s.hooks[m]||(s.hooks[m]=[]),s.hooks[m].push({matcher:v,hooks:[{type:"command",command:y}]})},"addGroup");c("SessionStart",".*","ironbee hook session-start --client codex"),c("UserPromptSubmit",".*","ironbee hook activity-start --client codex"),c("PreToolUse",".*","ironbee hook track-action-pre --client codex"),r&&(c("PreToolUse","^mcp__(browser|node|backend|android)[-_]devtools__.*",`ironbee hook require-verification --client codex${n}`),c("PreToolUse","^apply_patch$",`ironbee hook require-verdict --client codex${n}`),c("PostToolUse","^apply_patch$","ironbee hook clear-verdict --client codex"),c("SubagentStart",".*","ironbee hook subagent-start --client codex")),c("SubagentStop",".*","ironbee hook subagent-stop --client codex"),c("PostToolUse",".*",r?"ironbee hook track-action --client codex":"ironbee hook track-action-monitor --client codex"),c("Stop",".*",e==="enforce"?"ironbee hook verify-gate --client codex":"ironbee hook activity-end --client codex"),(0,i.writeFileSync)(o,JSON.stringify(s,null,2))}removeIronBeeHooks(o){if((0,i.existsSync)(o))try{const e=(0,i.readFileSync)(o,"utf-8"),r=JSON.parse(e);if(!r.hooks)return;let n=!1;for(const s of Object.keys(r.hooks)){const c=r.hooks[s].filter(g=>!M(g));c.length!==r.hooks[s].length&&(n=!0),c.length===0?delete r.hooks[s]:r.hooks[s]=c}n&&(0,i.writeFileSync)(o,JSON.stringify(r,null,2))}catch(e){f.logger.debug(`failed to strip IronBee hooks from ${o}: ${e}`)}}maybeDeleteEmptyHooks(o){if((0,i.existsSync)(o))try{const e=JSON.parse((0,i.readFileSync)(o,"utf-8"));eo(e)&&(0,i.unlinkSync)(o)}catch(e){f.logger.debug(`failed to inspect ${o} for emptiness: ${e}`)}}mergeConfigToml(o,e,r){(0,i.mkdirSync)((0,a.join)(o,".codex"),{recursive:!0});let n=(0,t.readCodexConfigToml)(o);if(n=(0,t.ensureFeaturesHooksTrue)(n),n=(0,t.removeMcpServer)(n,w),n=(0,t.removeMcpServer)(n,T),n=(0,t.removeMcpServer)(n,A),n=(0,t.removeMcpServer)(n,_),r){const s=(0,d.getVerificationModel)(e,"codex"),c=(0,i.existsSync)((0,t.userCodexConfigTomlPath)())?(0,i.readFileSync)((0,t.userCodexConfigTomlPath)(),"utf-8"):"",g=(0,t.extractTomlTopLevelModel)(n)===null&&(0,t.extractTomlTopLevelModel)(c)===null;s===void 0&&g&&console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} ${l.pc.yellow("\u26A0 no model for the verifier")} \u2014 the ${l.pc.bold("ironbee-verifier")} sub-agent inherits the session model, but neither this project's .codex/config.toml nor ~/.codex/config.toml has a top-level ${l.pc.bold("model")}, so it may fail to spawn ("could not resolve the child model"). Fix: set ${l.pc.bold("model")} in ~/.codex/config.toml, or set ${l.pc.bold("verification.model")} in your ironbee config.`),this.writeVerifierAgentToml(o,e,s),n=(0,t.upsertAgentsTable)(n,k,[`description = ${JSON.stringify(I)}`,`config_file = ${JSON.stringify(`agents/${k}.toml`)}`])}else n=(0,t.removeAgentsTable)(n,k),this.removeVerifierAgentToml(o);(0,t.writeCodexConfigToml)(o,n)}writeVerifierAgentToml(o,e,r){const n=(0,a.join)(__dirname,"agents",`${k}.md`);let s;try{s=(0,i.readFileSync)(n,"utf-8")}catch(v){f.logger.debug(`failed to read verifier agent source ${n}: ${v}`);return}const c=R("codex");for(const v of d.ALL_CYCLES){const S=(0,d.isCycleEnabled)(e,v)?C=>{const x=(0,a.join)(c,(0,$.fragmentFilename)("skill",v,C));return(0,i.existsSync)(x)?(0,i.readFileSync)(x,"utf-8").trimEnd():null}:null;s=(0,$.applyPlatformSection)(s,v,S,`${k}.toml`)}const g=[];g.push(`name = ${JSON.stringify(k)}`),g.push(`description = ${JSON.stringify(I)}`),g.push('sandbox_mode = "read-only"'),r&&g.push(`model = ${JSON.stringify(r)}`),g.push("developer_instructions = '''"),g.push(s.replace(/'''/g,"```").trimEnd()),g.push("'''");const p=h((v,y,S)=>{v&&(g.push(""),g.push(`[mcp_servers.${y}]`),g.push(...to(S)),g.push("required = true"),g.push('default_tools_approval_mode = "approve"'))},"addCycle");p((0,d.isCycleEnabled)(e,"browser"),w,(0,d.getMcpServerEntry)(o)),p((0,d.isCycleEnabled)(e,"node"),T,(0,d.getNodeDevToolsMcpEntry)(o)),p((0,d.isCycleEnabled)(e,"backend"),A,(0,d.getBackendDevToolsMcpEntry)(o)),p((0,d.isCycleEnabled)(e,"android"),_,(0,d.getAndroidDevToolsMcpEntry)(o));const m=(0,t.codexAgentTomlPath)(o,k);(0,i.mkdirSync)((0,a.dirname)(m),{recursive:!0}),(0,i.writeFileSync)(m,g.join(`
|
|
1
|
+
"use strict";var E=Object.defineProperty;var W=Object.getOwnPropertyDescriptor;var Y=Object.getOwnPropertyNames;var z=Object.prototype.hasOwnProperty;var h=(g,o)=>E(g,"name",{value:o,configurable:!0});var Q=(g,o)=>{for(var n in o)E(g,n,{get:o[n],enumerable:!0})},Z=(g,o,n,r)=>{if(o&&typeof o=="object"||typeof o=="function")for(let e of Y(o))!z.call(g,e)&&e!==n&&E(g,e,{get:()=>o[e],enumerable:!(r=W(o,e))||r.enumerable});return g};var j=g=>Z(E({},"__esModule",{value:!0}),g);var ro={};Q(ro,{CodexClient:()=>to});module.exports=j(ro);var i=require("fs"),a=require("path"),B=require("../../lib/gitignore"),f=require("../../lib/logger"),l=require("../../lib/output"),P=require("../../lib/fs-prune"),u=require("../../lib/config"),$=require("../../lib/platform-section"),t=require("./util"),R=require("./thread-map"),N=require("./hooks/verify-gate"),V=require("./hooks/activity-end"),O=require("./hooks/session-start"),G=require("./hooks/activity-start"),J=require("./hooks/require-verification"),L=require("./hooks/require-verdict"),F=require("./hooks/clear-verdict"),K=require("./hooks/track-action"),U=require("./hooks/track-action-monitor"),q=require("./hooks/track-action-pre"),D=require("./hooks/subagent-start"),X=require("./hooks/subagent-stop");const T="browser-devtools",w="node-devtools",A="backend-devtools",_="android-devtools",oo="ironbee",k="ironbee-verifier",eo=30,I="Verifies recent code changes through real browser/runtime/backend tools and submits the IronBee verdict. Spawn this custom agent (by agent_type) after editing code to run the verification cycle out-of-band \u2014 it drives the devtools tools, judges the result, and records the verdict in the shared session. It does NOT edit code.";function H(g){return(0,a.join)(__dirname,"..",g,"platforms")}h(H,"platformsDirFor");function b(g){return l.pc.dim(g)}h(b,"codexColor");function M(g){return g.hooks.some(o=>o.command.includes(oo))}h(M,"isIronBeeHookGroup");function no(g){const o=Object.keys(g);return o.length===0?!0:o.length===1&&o[0]==="hooks"?Object.keys(g.hooks??{}).length===0:!1}h(no,"isCodexHooksEmpty");class to{constructor(){this.name="codex";this.supportsVerifierModel=!0}static{h(this,"CodexClient")}detect(o){return(0,i.existsSync)((0,a.join)(o,".agents","skills","ironbee-verify"))}resolveProjectDir(){return process.env.CODEX_PROJECT_DIR??process.env.IRONBEE_PROJECT_DIR??process.cwd()}install(o,n){const r=n??(0,u.loadConfig)(o),e=(0,u.getVerificationMode)(r),s=e!=="monitor";this.cleanupArtifacts(o);const c=(0,t.codexHooksJsonPath)(o);this.mergeHooksConfig(c,e),this.mergeConfigToml(o,r,s),s&&(e==="enforce"&&this.writeAgentsMdBlock(o,r),this.writeSkills(o,e==="enforce"),(0,$.syncPlatformSectionsToConfig)(o,H)),(0,B.ensureIronBeeGitignored)(o),console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} hooks ${l.pc.dim("\u2192")} ${l.pc.dim(c)}`),console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} config ${l.pc.dim("\u2192")} ${l.pc.dim((0,t.codexConfigTomlPath)(o))}`),e==="enforce"?(console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} agents ${l.pc.dim("\u2192")} ${l.pc.dim((0,a.join)(o,"AGENTS.md"))}`),console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} skill ${l.pc.dim("\u2192")} ${l.pc.dim((0,a.join)(o,".agents","skills","ironbee-verification","SKILL.md"))}`),console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} command ${l.pc.dim("\u2192")} ${l.pc.dim((0,a.join)(o,".agents","skills","ironbee-verify","SKILL.md"))}`)):e==="assist"?(console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} ${l.pc.yellow("assist mode")} (verification.auto: false) \u2014 manual $ironbee-verify only, no enforcement`),console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} command ${l.pc.dim("\u2192")} ${l.pc.dim((0,a.join)(o,".agents","skills","ironbee-verify","SKILL.md"))}`)):console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} ${l.pc.yellow("monitoring-only mode")} (verification.enable: false)`),console.log(),console.log(` ${l.pc.yellow("\u26A0")} ${l.pc.yellow("Codex requires one-time TUI setup:")}`),console.log(` ${l.pc.yellow("1.")} Run ${l.pc.bold("/hooks")} in a fresh Codex session to review and trust IronBee hooks`),console.log(` ${l.pc.yellow("2.")} Restart any open Codex sessions to pick up new hook config`)}uninstall(o){this.cleanupArtifacts(o),(0,P.pruneEmptyDirs)((0,a.join)(o,".codex"));const n=(0,R.codexThreadMapPath)(o);if((0,i.existsSync)(n))try{(0,i.unlinkSync)(n)}catch(r){f.logger.debug(`failed to remove codex thread map: ${r}`)}console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} removed hooks, MCP entries, AGENTS.md block, and skills`)}cleanupArtifacts(o){this.migrateAwayFromUserLevel();const n=(0,t.codexHooksJsonPath)(o);this.removeIronBeeHooks(n),this.maybeDeleteEmptyHooks(n),this.removeIronBeeMcpServers(o),this.removeVerifierAgentToml(o);const r=(0,a.join)(o,"AGENTS.md");if((0,i.existsSync)(r))try{const s=(0,i.readFileSync)(r,"utf-8"),c=(0,t.stripAgentsMdBlock)(s);c===null?(0,i.unlinkSync)(r):c!==s&&(0,i.writeFileSync)(r,c)}catch(s){f.logger.debug(`failed to strip AGENTS.md block: ${s}`)}const e=(0,a.join)(o,".agents","skills");this.removeDir((0,a.join)(e,"ironbee-verification")),this.removeDir((0,a.join)(e,"ironbee-verify")),(0,P.pruneEmptyDirs)((0,a.join)(o,".agents"))}async runVerifyGate(o){await(0,N.run)(o)}async runActivityEnd(o){await(0,V.run)(o)}async runSessionStart(o){await(0,O.run)(o)}async runActivityStart(o){await(0,G.run)(o)}async runRequireVerification(o,n){await(0,J.run)(o,n)}async runRequireVerdict(o,n){await(0,L.run)(o,n)}async runClearVerdict(o){await(0,F.run)(o)}async runTrackAction(o){await(0,K.run)(o)}async runTrackActionMonitor(o){await(0,U.run)(o)}async runTrackActionPre(o){await(0,q.run)(o)}async runSubagentStart(o){await(0,D.run)(o)}async runSubagentStop(o){await(0,X.run)(o)}resolveAgentSessionId(o,n){const r=process.env.CODEX_THREAD_ID;if(typeof r=="string"&&r.length>0&&n)return(0,R.lookupThreadSession)(n,r)}async runSessionEnd(o){f.logger.debug("session-end: no-op on Codex (no SessionEnd hook event)")}mergeHooksConfig(o,n){const r=n!=="monitor",e=n==="assist"?" --soft":"";(0,i.mkdirSync)((0,a.dirname)(o),{recursive:!0});let s={hooks:{}};if((0,i.existsSync)(o))try{s=JSON.parse((0,i.readFileSync)(o,"utf-8")),s.hooks||(s.hooks={})}catch(m){f.logger.debug(`failed to parse ${o}: ${m}`),s={hooks:{}}}for(const m of Object.keys(s.hooks)){const v=s.hooks[m].filter(y=>!M(y));v.length===0?delete s.hooks[m]:s.hooks[m]=v}const c=h((m,v,y)=>{s.hooks[m]||(s.hooks[m]=[]),s.hooks[m].push({matcher:v,hooks:[{type:"command",command:y}]})},"addGroup");c("SessionStart",".*","ironbee hook session-start --client codex"),c("UserPromptSubmit",".*","ironbee hook activity-start --client codex"),c("PreToolUse",".*","ironbee hook track-action-pre --client codex"),r&&(c("PreToolUse","^mcp__(browser|node|backend|android)[-_]devtools__.*",`ironbee hook require-verification --client codex${e}`),c("PreToolUse","^apply_patch$",`ironbee hook require-verdict --client codex${e}`),c("PostToolUse","^apply_patch$","ironbee hook clear-verdict --client codex"),c("SubagentStart",".*","ironbee hook subagent-start --client codex")),c("SubagentStop",".*","ironbee hook subagent-stop --client codex"),c("PostToolUse",".*",r?"ironbee hook track-action --client codex":"ironbee hook track-action-monitor --client codex"),c("Stop",".*",n==="enforce"?"ironbee hook verify-gate --client codex":"ironbee hook activity-end --client codex"),(0,i.writeFileSync)(o,JSON.stringify(s,null,2))}removeIronBeeHooks(o){if((0,i.existsSync)(o))try{const n=(0,i.readFileSync)(o,"utf-8"),r=JSON.parse(n);if(!r.hooks)return;let e=!1;for(const s of Object.keys(r.hooks)){const c=r.hooks[s].filter(d=>!M(d));c.length!==r.hooks[s].length&&(e=!0),c.length===0?delete r.hooks[s]:r.hooks[s]=c}e&&(0,i.writeFileSync)(o,JSON.stringify(r,null,2))}catch(n){f.logger.debug(`failed to strip IronBee hooks from ${o}: ${n}`)}}maybeDeleteEmptyHooks(o){if((0,i.existsSync)(o))try{const n=JSON.parse((0,i.readFileSync)(o,"utf-8"));no(n)&&(0,i.unlinkSync)(o)}catch(n){f.logger.debug(`failed to inspect ${o} for emptiness: ${n}`)}}mergeConfigToml(o,n,r){(0,i.mkdirSync)((0,a.join)(o,".codex"),{recursive:!0});let e=(0,t.readCodexConfigToml)(o);if(e=(0,t.ensureFeaturesHooksTrue)(e),e=(0,t.removeMcpServer)(e,T),e=(0,t.removeMcpServer)(e,w),e=(0,t.removeMcpServer)(e,A),e=(0,t.removeMcpServer)(e,_),r){const s=(0,u.getVerificationModel)(n,"codex"),c=(0,i.existsSync)((0,t.userCodexConfigTomlPath)())?(0,i.readFileSync)((0,t.userCodexConfigTomlPath)(),"utf-8"):"",d=(0,t.extractTomlTopLevelModel)(e)===null&&(0,t.extractTomlTopLevelModel)(c)===null;s===void 0&&d&&console.log(` ${l.pc.dim("\u2192")} ${b("[codex]")} ${l.pc.yellow("\u26A0 no model for the verifier")} \u2014 the ${l.pc.bold("ironbee-verifier")} sub-agent inherits the session model, but neither this project's .codex/config.toml nor ~/.codex/config.toml has a top-level ${l.pc.bold("model")}, so it may fail to spawn ("could not resolve the child model"). Fix: set ${l.pc.bold("model")} in ~/.codex/config.toml, or set ${l.pc.bold("verification.model")} in your ironbee config.`),this.writeVerifierAgentToml(o,n,s),e=(0,t.upsertAgentsTable)(e,k,[`description = ${JSON.stringify(I)}`,`config_file = ${JSON.stringify(`agents/${k}.toml`)}`]),e=(0,t.ensureMultiAgentV2SpawnMetadataExposed)(e)}else e=(0,t.removeAgentsTable)(e,k),e=(0,t.removeMultiAgentV2SpawnMetadata)(e),this.removeVerifierAgentToml(o);(0,t.writeCodexConfigToml)(o,e)}writeVerifierAgentToml(o,n,r){const e=(0,a.join)(__dirname,"agents",`${k}.md`);let s;try{s=(0,i.readFileSync)(e,"utf-8")}catch(v){f.logger.debug(`failed to read verifier agent source ${e}: ${v}`);return}const c=H("codex");for(const v of u.ALL_CYCLES){const S=(0,u.isCycleEnabled)(n,v)?C=>{const x=(0,a.join)(c,(0,$.fragmentFilename)("skill",v,C));return(0,i.existsSync)(x)?(0,i.readFileSync)(x,"utf-8").trimEnd():null}:null;s=(0,$.applyPlatformSection)(s,v,S,`${k}.toml`)}const d=[];d.push(`name = ${JSON.stringify(k)}`),d.push(`description = ${JSON.stringify(I)}`),d.push('sandbox_mode = "read-only"'),r&&d.push(`model = ${JSON.stringify(r)}`),d.push("developer_instructions = '''"),d.push(s.replace(/'''/g,"```").trimEnd()),d.push("'''");const p=h((v,y,S)=>{v&&(d.push(""),d.push(`[mcp_servers.${y}]`),d.push(...io(S)),d.push(`startup_timeout_sec = ${eo}`),d.push("required = true"),d.push('default_tools_approval_mode = "approve"'))},"addCycle");p((0,u.isCycleEnabled)(n,"browser"),T,(0,u.getMcpServerEntry)(o)),p((0,u.isCycleEnabled)(n,"node"),w,(0,u.getNodeDevToolsMcpEntry)(o)),p((0,u.isCycleEnabled)(n,"backend"),A,(0,u.getBackendDevToolsMcpEntry)(o)),p((0,u.isCycleEnabled)(n,"android"),_,(0,u.getAndroidDevToolsMcpEntry)(o));const m=(0,t.codexAgentTomlPath)(o,k);(0,i.mkdirSync)((0,a.dirname)(m),{recursive:!0}),(0,i.writeFileSync)(m,d.join(`
|
|
2
2
|
`)+`
|
|
3
|
-
`)}removeVerifierAgentToml(o){const
|
|
3
|
+
`)}removeVerifierAgentToml(o){const n=(0,t.codexAgentTomlPath)(o,k);if((0,i.existsSync)(n))try{(0,i.unlinkSync)(n)}catch(r){f.logger.debug(`failed to remove verifier agent toml: ${r}`)}}removeIronBeeMcpServers(o){let n=(0,t.readCodexConfigToml)(o);n&&(n=(0,t.removeMcpServer)(n,T),n=(0,t.removeMcpServer)(n,w),n=(0,t.removeMcpServer)(n,A),n=(0,t.removeMcpServer)(n,_),n=(0,t.removeAgentsTable)(n,k),n=(0,t.removeMultiAgentV2SpawnMetadata)(n),(0,t.writeCodexConfigToml)(o,n))}migrateAwayFromUserLevel(){const o=(0,t.userCodexHooksJsonPath)();this.removeIronBeeHooks(o),this.maybeDeleteEmptyHooks(o);const n=(0,t.userCodexConfigTomlPath)();if((0,i.existsSync)(n))try{let e=(0,i.readFileSync)(n,"utf-8");const s=e;e=(0,t.removeMcpServer)(e,T),e=(0,t.removeMcpServer)(e,w),e=(0,t.removeMcpServer)(e,A),e=(0,t.removeMcpServer)(e,_),e=(0,t.removeAgentsTable)(e,k),e=(0,t.removeMultiAgentV2SpawnMetadata)(e),e!==s&&(0,i.writeFileSync)(n,e)}catch(e){f.logger.debug(`migrate: failed to clean user-level config.toml: ${e}`)}const r=(0,t.userCodexAgentTomlPath)(k);if((0,i.existsSync)(r))try{(0,i.unlinkSync)(r)}catch(e){f.logger.debug(`migrate: failed to remove user-level verifier toml: ${e}`)}}writeAgentsMdBlock(o,n){const r=(0,a.join)(o,"AGENTS.md"),e=(0,a.join)(__dirname,"rules","ironbee-verification.md");let s;try{s=(0,i.readFileSync)(e,"utf-8")}catch(m){f.logger.debug(`failed to read rule source ${e}: ${m}`);return}const c=H("codex");for(const m of u.ALL_CYCLES){const y=(0,u.isCycleEnabled)(n,m)?S=>{const C=(0,a.join)(c,(0,$.fragmentFilename)("rule",m,S));if(!(0,i.existsSync)(C)){const x=S.length>0?`${m}:${S}`:m;return f.logger.debug(`AGENTS.md platform-section ${x}: missing fragment ${C}, using placeholder`),null}return(0,i.readFileSync)(C,"utf-8").trimEnd()}:null;s=(0,$.applyPlatformSection)(s,m,y,"AGENTS.md")}const d=(0,i.existsSync)(r)?(0,i.readFileSync)(r,"utf-8"):"",p=(0,t.upsertAgentsMdBlock)(d,s);(0,i.writeFileSync)(r,p)}writeSkills(o,n){const r=(0,a.join)(o,".agents","skills");if(n){const c=(0,a.join)(r,"ironbee-verification");(0,i.mkdirSync)(c,{recursive:!0});const d=(0,a.join)(__dirname,"skills","ironbee-verification.md");try{const p=(0,i.readFileSync)(d,"utf-8");(0,i.writeFileSync)((0,a.join)(c,"SKILL.md"),p)}catch(p){f.logger.debug(`failed to copy skill ${d}: ${p}`)}}const e=(0,a.join)(r,"ironbee-verify");(0,i.mkdirSync)(e,{recursive:!0});const s=(0,a.join)(__dirname,"commands","ironbee-verify","SKILL.md");try{const c=(0,i.readFileSync)(s,"utf-8");(0,i.writeFileSync)((0,a.join)(e,"SKILL.md"),c)}catch(c){f.logger.debug(`failed to copy verify command ${s}: ${c}`)}}removeDir(o){if((0,i.existsSync)(o))try{(0,i.rmSync)(o,{recursive:!0,force:!0})}catch(n){f.logger.debug(`failed to remove ${o}: ${n}`)}}}function io(g){return(0,t.tomlBodyFromRecord)(g)}h(io,"mcpEntryToTomlBody");0&&(module.exports={CodexClient});
|
|
@@ -33,6 +33,8 @@ If you see only `ios/`, `web/`, or no mobile directories — the project does NO
|
|
|
33
33
|
- Read Logcat output for the tag(s) relevant to the changed code: `mcp__android-devtools__adt_o11y_log-read` or `mcp__android-devtools__adt_o11y_log-follow` (drain a follow with `mcp__android-devtools__adt_o11y_log-get-followed`, stop it with `mcp__android-devtools__adt_o11y_log-stop-follow`).
|
|
34
34
|
- Confirm expected log lines appear AND no unexpected crashes (FATAL / E/ entries for the app package).
|
|
35
35
|
|
|
36
|
+
**Batch (speed):** connect + launch-app run standalone first (prerequisites). On the device-evidence path, batch the UI interactions + the UI snapshot into one `mcp__android-devtools__adt_execute`; the snapshot captures the state after the batched interactions, so to assert an intermediate state take a snapshot at that point too. The device-evidence screenshot is usually pixel-judged (a visual change) — take THAT one standalone with `includeBase64: true` so you can see it; batch it only when it's purely gate evidence. Log-evidence reads batch together too.
|
|
37
|
+
|
|
36
38
|
### Verdict fields
|
|
37
39
|
The verdict is platform-agnostic — submit only semantic judgment:
|
|
38
40
|
|
|
@@ -13,6 +13,8 @@ The **backend protocol cycle** verifies backend changes by driving real protocol
|
|
|
13
13
|
|
|
14
14
|
You can satisfy the cycle via **protocol-call evidence** (you drive the request yourself), **log evidence** (something else drives the request, you read the resulting logs), **DB evidence** (you inspect database state directly), or any combination. Pick whichever fits the task; one is enough.
|
|
15
15
|
|
|
16
|
+
**Batch (speed):** group consecutive `bedt_*` steps into one `mcp__backend-devtools__bedt_execute` — e.g. a POST then a GET that reuses the created id (bind the first call's result: `const r = callTool('bedt_request_http', {…POST…}); callTool('bedt_request_http', { /* GET using an id from r */ })`), register-source + read, or db-connect + query. Keep a step standalone only when you must inspect its result to DECIDE what to do next, not just to pass a value along.
|
|
17
|
+
|
|
16
18
|
### Path A — Protocol-call evidence
|
|
17
19
|
|
|
18
20
|
1. **Confirm a backend service is running** (the user's dev server, Docker compose, k8s port-forward, …). The agent itself does not start the service — ask the user if uncertain.
|
|
@@ -14,6 +14,8 @@
|
|
|
14
14
|
|
|
15
15
|
All four tools are MANDATORY (the Stop hook checks each). Functional interaction is expected for every verification.
|
|
16
16
|
|
|
17
|
+
**Batch (speed):** navigate (step 1) is standalone — read the ARIA snapshot it returns to decide your interactions. Then run steps 2–5 in ONE `mcp__browser-devtools__bdt_execute` batch — `callTool('bdt_interaction_…', …)` for each interaction, `callTool('bdt_content_take-screenshot', …)`, `callTool('bdt_a11y_take-aria-snapshot', …)`, `callTool('bdt_o11y_get-console-messages', …)` — instead of four separate turns. Screenshot/aria/console capture the state AFTER the batched interactions, so batch interactions that lead to ONE state you want to assert; to assert an intermediate state (e.g. a modal that opens then closes) take a screenshot/snapshot at that point too — interleave it in the batch or split into two. The interaction is what makes the evidence meaningful: a batch of just the four evidence tools with no real interaction passes the tool-presence check but verifies nothing. If you must judge the screenshot's pixels, take that one standalone with `includeBase64: true`.
|
|
18
|
+
|
|
17
19
|
### Verdict fields
|
|
18
20
|
The verdict is platform-agnostic — you submit only semantic judgment:
|
|
19
21
|
|
|
@@ -31,6 +31,8 @@ If you see `pom.xml`, `build.gradle`, `requirements.txt`, `pyproject.toml`, `go.
|
|
|
31
31
|
- Read errors: `ndt_debug_get-logs` with the error-level filter.
|
|
32
32
|
4. **Disconnect** (optional): `ndt_debug_disconnect`.
|
|
33
33
|
|
|
34
|
+
**Batch (speed):** connect (step 2) is standalone discovery. Batch consecutive `ndt_*` calls in one `mcp__node-devtools__ndt_execute` — set several probes together, then later read snapshots/logs together. The exercise step is ALWAYS separate: whatever triggers the code path (a browser/backend call on another server, a CLI command, the user) can't share an `ndt_*` batch — so node runs as set probes (batch) → exercise (separate) → read snapshots (batch).
|
|
35
|
+
|
|
34
36
|
### Verdict fields
|
|
35
37
|
The verdict is platform-agnostic — you submit only semantic judgment:
|
|
36
38
|
|
|
@@ -1,33 +1,16 @@
|
|
|
1
|
-
You MUST verify all code changes
|
|
1
|
+
You MUST verify all code changes before completing any task — by DELEGATING to the `ironbee-verifier` custom agent. You do not have the verification tools; the verifier does. Never verify inline.
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
After editing code, before reporting completion: spawn the `ironbee-verifier` custom agent — call `spawn_agent` with `agent_type="ironbee-verifier"` and `fork_turns="none"` (REQUIRED: the default `fork_turns="all"` silently drops the agent_type → a generic agent without the verification tools; not a generic "act as" agent either) with a prompt describing what to verify. It drives the verification tools, exercises every active cycle (browser / runtime / backend, as wired up for this project), and submits the single verdict in this shared session — then returns a summary. Relay it. **Wait for the verifier in the same turn — do NOT background it; if it is backgrounded your turn can end before its verdict is recorded, leaving your changes unverified.**
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
If verification FAILS: fix the issues the verifier reported, optionally record what you fixed (`echo '{"fixes":["what you repaired"]}' | ironbee hook record-fix`), then re-delegate until it passes. Every code edit (apply_patch) clears the verdict, requiring re-delegation.
|
|
6
6
|
|
|
7
|
-
|
|
8
|
-
- judges the result and submits a single verdict via `ironbee hook submit-verdict`,
|
|
9
|
-
- runs inside your session, so the Stop gate sees its work,
|
|
10
|
-
- does NOT edit code — if it finds problems it returns them as `issues` for YOU to fix.
|
|
11
|
-
|
|
12
|
-
You (the main agent) do **not** have the devtools tools and must not try to drive them. Your job is to edit code, spawn the verifier to verify, and — if it reports a fail — fix the issues and re-spawn it.
|
|
13
|
-
|
|
14
|
-
**Wait for the verifier in the same turn — do NOT background it.** Let it run to completion and read its verdict before you respond; a backgrounded verifier can let your turn end (and the Stop gate fire) before its verdict is recorded, leaving your changes unverified.
|
|
15
|
-
|
|
16
|
-
## After a fail → fix → record → re-verify
|
|
17
|
-
|
|
18
|
-
1. The verifier returns a fail verdict with `issues`.
|
|
19
|
-
2. Fix the issues in your code.
|
|
20
|
-
3. Record what you fixed so the next pass verdict captures it (the verifier can't author this — it didn't make the edit):
|
|
21
|
-
```
|
|
22
|
-
echo '{"fixes":["fixed null check in src/foo.ts"]}' | ironbee hook record-fix
|
|
23
|
-
```
|
|
24
|
-
4. Re-spawn the `ironbee-verifier` custom agent. Repeat until it passes.
|
|
7
|
+
The Stop gate blocks completion until a verdict exists for your changes — delegation is the only path.
|
|
25
8
|
|
|
26
9
|
## BANNED
|
|
27
10
|
|
|
28
|
-
-
|
|
29
|
-
- "
|
|
30
|
-
-
|
|
31
|
-
-
|
|
32
|
-
|
|
33
|
-
|
|
11
|
+
- Running the verification tools (`bdt_*` / `ndt_*` / `bedt_*`) or `ironbee hook verification-start` / `submit-verdict` yourself — those are the verifier's job. Delegate.
|
|
12
|
+
- Using the generic `spawn_agent` tool / a plain fork to "be" the verifier — that spawns a DEFAULT agent without the devtools. Spawn the `ironbee-verifier` custom agent by its `agent_type`.
|
|
13
|
+
- Reporting a task complete without delegating verification of your changes.
|
|
14
|
+
- Submitting a verdict based on assumptions, code reading, or prior knowledge — the verifier verifies through real tools.
|
|
15
|
+
- Writing `verdict.json` directly.
|
|
16
|
+
- Backgrounding the verifier custom agent, or ending your turn before it returns its verdict — wait for it in the same turn.
|