@ironbee-ai/cli 0.26.0 → 0.28.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. package/CHANGELOG.md +8 -0
  2. package/dist/clients/claude/agents/ironbee-verifier.md +31 -0
  3. package/dist/clients/claude/platforms/skill.android.md +2 -0
  4. package/dist/clients/claude/platforms/skill.backend.md +2 -0
  5. package/dist/clients/claude/platforms/skill.browser.md +2 -0
  6. package/dist/clients/claude/platforms/skill.node.md +2 -0
  7. package/dist/clients/claude/rules/ironbee-verification.md +1 -1
  8. package/dist/clients/codex/agents/ironbee-verifier.md +75 -26
  9. package/dist/clients/codex/cli.js +1 -1
  10. package/dist/clients/codex/commands/ironbee-verify/SKILL.main.md +114 -0
  11. package/dist/clients/codex/commands/ironbee-verify/SKILL.md +38 -61
  12. package/dist/clients/codex/index.js +2 -2
  13. package/dist/clients/codex/platforms/skill.android.md +2 -0
  14. package/dist/clients/codex/platforms/skill.backend.md +2 -0
  15. package/dist/clients/codex/platforms/skill.browser.md +2 -0
  16. package/dist/clients/codex/platforms/skill.node.md +2 -0
  17. package/dist/clients/codex/rules/ironbee-verification.main.md +39 -0
  18. package/dist/clients/codex/rules/ironbee-verification.md +10 -27
  19. package/dist/clients/codex/skills/ironbee-verification.main.md +110 -0
  20. package/dist/clients/codex/skills/ironbee-verification.md +40 -68
  21. package/dist/clients/codex/util.js +32 -22
  22. package/dist/clients/codex/verifier.js +2 -0
  23. package/dist/clients/cursor/platforms/skill.android.md +2 -0
  24. package/dist/clients/cursor/platforms/skill.backend.md +2 -0
  25. package/dist/clients/cursor/platforms/skill.browser.md +2 -0
  26. package/dist/clients/cursor/platforms/skill.node.md +2 -0
  27. package/dist/clients/cursor/skills/ironbee-verification.md +21 -0
  28. package/dist/commands/config.js +1 -1
  29. package/dist/commands/install.js +1 -1
  30. package/dist/commands/mode-select.js +2 -2
  31. package/dist/commands/verification-toggle.js +1 -1
  32. package/dist/hooks/core/activity-end.js +1 -1
  33. package/dist/hooks/core/session-state.js +1 -1
  34. package/dist/hooks/core/submit-verdict.js +3 -3
  35. package/dist/hooks/core/verification-lifecycle.js +1 -1
  36. package/dist/hooks/core/verify-gate.js +11 -11
  37. package/dist/lib/config.js +1 -1
  38. package/dist/lib/install-version.js +1 -1
  39. package/dist/lib/platform-section.js +3 -3
  40. package/dist/tui/config/schema.js +1 -1
  41. package/package.json +1 -1
@@ -33,6 +33,8 @@ If you see only `ios/`, `web/`, or no mobile directories — the project does NO
33
33
  - Read Logcat output for the tag(s) relevant to the changed code: `mcp__android-devtools__adt_o11y_log-read` or `mcp__android-devtools__adt_o11y_log-follow` (drain a follow with `mcp__android-devtools__adt_o11y_log-get-followed`, stop it with `mcp__android-devtools__adt_o11y_log-stop-follow`).
34
34
  - Confirm expected log lines appear AND no unexpected crashes (FATAL / E/ entries for the app package).
35
35
 
36
+ **Batch (speed):** connect + launch-app run standalone first (prerequisites). On the device-evidence path, batch the UI interactions + the UI snapshot into one `mcp__android-devtools__adt_execute`; the snapshot captures the state after the batched interactions, so to assert an intermediate state take a snapshot at that point too. The device-evidence screenshot is usually pixel-judged (a visual change) — take THAT one standalone with `includeBase64: true` so you can see it; batch it only when it's purely gate evidence. Log-evidence reads batch together too.
37
+
36
38
  ### Verdict fields
37
39
  The verdict is platform-agnostic — submit only semantic judgment:
38
40
 
@@ -13,6 +13,8 @@ The **backend protocol cycle** verifies backend changes by driving real protocol
13
13
 
14
14
  You can satisfy the cycle via **protocol-call evidence** (you drive the request yourself), **log evidence** (something else drives the request, you read the resulting logs), **DB evidence** (you inspect database state directly), or any combination. Pick whichever fits the task; one is enough.
15
15
 
16
+ **Batch (speed):** group consecutive `bedt_*` steps into one `mcp__backend-devtools__bedt_execute` — e.g. a POST then a GET that reuses the created id (bind the first call's result: `const r = callTool('bedt_request_http', {…POST…}); callTool('bedt_request_http', { /* GET using an id from r */ })`), register-source + read, or db-connect + query. Keep a step standalone only when you must inspect its result to DECIDE what to do next, not just to pass a value along.
17
+
16
18
  ### Path A — Protocol-call evidence
17
19
 
18
20
  1. **Confirm a backend service is running** (the user's dev server, Docker compose, k8s port-forward, …). The agent itself does not start the service — ask the user if uncertain.
@@ -14,6 +14,8 @@
14
14
 
15
15
  All four tools are MANDATORY (the Stop hook checks each). Functional interaction is expected for every verification.
16
16
 
17
+ **Batch (speed):** navigate (step 1) is standalone — read the ARIA snapshot it returns to decide your interactions. Then run steps 2–5 in ONE `mcp__browser-devtools__bdt_execute` batch — `callTool('bdt_interaction_…', …)` for each interaction, `callTool('bdt_content_take-screenshot', …)`, `callTool('bdt_a11y_take-aria-snapshot', …)`, `callTool('bdt_o11y_get-console-messages', …)` — instead of four separate turns. Screenshot/aria/console capture the state AFTER the batched interactions, so batch interactions that lead to ONE state you want to assert; to assert an intermediate state (e.g. a modal that opens then closes) take a screenshot/snapshot at that point too — interleave it in the batch or split into two. The interaction is what makes the evidence meaningful: a batch of just the four evidence tools with no real interaction passes the tool-presence check but verifies nothing. If you must judge the screenshot's pixels, take that one standalone with `includeBase64: true`.
18
+
17
19
  ### Verdict fields
18
20
  The verdict is platform-agnostic — you submit only semantic judgment:
19
21
 
@@ -31,6 +31,8 @@ If you see `pom.xml`, `build.gradle`, `requirements.txt`, `pyproject.toml`, `go.
31
31
  - Read errors: `ndt_debug_get-logs` with the error-level filter.
32
32
  4. **Disconnect** (optional): `ndt_debug_disconnect`.
33
33
 
34
+ **Batch (speed):** connect (step 2) is standalone discovery. Batch consecutive `ndt_*` calls in one `mcp__node-devtools__ndt_execute` — set several probes together, then later read snapshots/logs together. The exercise step is ALWAYS separate: whatever triggers the code path (a browser/backend call on another server, a CLI command, the user) can't share an `ndt_*` batch — so node runs as set probes (batch) → exercise (separate) → read snapshots (batch).
35
+
34
36
  ### Verdict fields
35
37
  The verdict is platform-agnostic — you submit only semantic judgment:
36
38
 
@@ -0,0 +1,39 @@
1
+ You MUST verify all code changes before completing any task — by driving the IronBee verification tools yourself. This project runs IronBee in **main-agent** mode: the devtools tools (`mcp__browser-devtools__*` / `mcp__node-devtools__*` / `mcp__backend-devtools__*` / `mcp__android-devtools__*`) are wired into THIS session. There is no verifier sub-agent — you verify inline.
2
+
3
+ Your **Session ID** is injected into your context by the SessionStart hook (a `Session ID: <sid>` line); author it in every `ironbee hook` command's JSON.
4
+
5
+ After editing code (`apply_patch`), before reporting completion:
6
+
7
+ 1. **Start verification — run this ALONE first** (Codex runs shell and MCP tools in separate lanes, so the open must land before any devtools call):
8
+ ```
9
+ echo '{"session_id":"<your-session-id>"}' | ironbee hook verification-start
10
+ ```
11
+ 2. Build and start the app if it isn't already running (don't guess ports). Track what YOU started.
12
+ 3. **Drive every active cycle's tools** (browser / runtime / backend / android, as wired up for this project — see the platform sections below) to exercise what changed. Reading code is NOT verification.
13
+ 4. Tear down only what you started, then **submit one verdict**:
14
+ ```
15
+ echo '{"session_id":"...","status":"pass","checks":["..."]}' | ironbee hook submit-verdict
16
+ ```
17
+ On fail add `"issues":[...]`; on pass-after-fail add `"fixes":[...]`. Nothing to verify → N/A (`"status":"not_applicable","reason":[...]`, or per-platform `"not_applicable_cycles":[...]`).
18
+ 5. If verification FAILS: submit the fail verdict first, then fix the issues, then re-verify (back to step 1) until it passes.
19
+
20
+ Every `apply_patch` clears the verdict, requiring re-verification. The Stop gate blocks completion until a valid verdict exists for your changes and the required tools were called for every active cycle.
21
+
22
+ <!--IRONBEE:PLATFORM:browser-->
23
+ <!--/IRONBEE:PLATFORM:browser-->
24
+
25
+ <!--IRONBEE:PLATFORM:node-->
26
+ <!--/IRONBEE:PLATFORM:node-->
27
+
28
+ <!--IRONBEE:PLATFORM:backend-->
29
+ <!--/IRONBEE:PLATFORM:backend-->
30
+
31
+ <!--IRONBEE:PLATFORM:android-->
32
+ <!--/IRONBEE:PLATFORM:android-->
33
+
34
+ ## BANNED
35
+
36
+ - Reporting a task complete without verifying your changes through the real tools.
37
+ - Submitting a verdict based on assumptions, code reading, or prior knowledge — verify through real tools.
38
+ - Writing `verdict.json` directly (always use `ironbee hook submit-verdict`).
39
+ - Submitting `pass` when your own evidence shows a legitimate, in-scope operation breaking — that is a FALSE pass; fail it.
@@ -1,33 +1,16 @@
1
- You MUST verify all code changes through real tools before completing any task — and on Codex you do this by **delegating to the `ironbee-verifier` custom agent**, not by verifying inline.
1
+ You MUST verify all code changes before completing any task — by DELEGATING to the `ironbee-verifier` custom agent. You do not have the verification tools; the verifier does. Never verify inline.
2
2
 
3
- ## Delegate verification (do not verify inline)
3
+ After editing code, before reporting completion: spawn the `ironbee-verifier` custom agent — call `spawn_agent` with `agent_type="ironbee-verifier"` and `fork_turns="none"` (REQUIRED: the default `fork_turns="all"` silently drops the agent_type → a generic agent without the verification tools; not a generic "act as" agent either) with a prompt describing what to verify. It drives the verification tools, exercises every active cycle (browser / runtime / backend, as wired up for this project), and submits the single verdict in this shared session — then returns a summary. Relay it. **Wait for the verifier in the same turn — do NOT background it; if it is backgrounded your turn can end before its verdict is recorded, leaving your changes unverified.**
4
4
 
5
- After you edit code (or whenever asked to "verify"), **spawn the `ironbee-verifier` custom agent** to run the verification cycle. Spawn it as a custom agent **by its `agent_type`** (`ironbee-verifier`) NOT by telling a generic agent to "act as" it (that loads a generic agent without the verification tools). The verifier:
5
+ If verification FAILS: fix the issues the verifier reported, optionally record what you fixed (`echo '{"fixes":["what you repaired"]}' | ironbee hook record-fix`), then re-delegate until it passes. Every code edit (apply_patch) clears the verdict, requiring re-delegation.
6
6
 
7
- - drives the real devtools tools (browser / node / backendit owns them; you do not),
8
- - judges the result and submits a single verdict via `ironbee hook submit-verdict`,
9
- - runs inside your session, so the Stop gate sees its work,
10
- - does NOT edit code — if it finds problems it returns them as `issues` for YOU to fix.
11
-
12
- You (the main agent) do **not** have the devtools tools and must not try to drive them. Your job is to edit code, spawn the verifier to verify, and — if it reports a fail — fix the issues and re-spawn it.
13
-
14
- **Wait for the verifier in the same turn — do NOT background it.** Let it run to completion and read its verdict before you respond; a backgrounded verifier can let your turn end (and the Stop gate fire) before its verdict is recorded, leaving your changes unverified.
15
-
16
- ## After a fail → fix → record → re-verify
17
-
18
- 1. The verifier returns a fail verdict with `issues`.
19
- 2. Fix the issues in your code.
20
- 3. Record what you fixed so the next pass verdict captures it (the verifier can't author this — it didn't make the edit):
21
- ```
22
- echo '{"fixes":["fixed null check in src/foo.ts"]}' | ironbee hook record-fix
23
- ```
24
- 4. Re-spawn the `ironbee-verifier` custom agent. Repeat until it passes.
7
+ The Stop gate blocks completion until a verdict exists for your changes delegation is the only path.
25
8
 
26
9
  ## BANNED
27
10
 
28
- - Verifying inline / driving devtools tools yourself — always delegate to the `ironbee-verifier` custom agent.
29
- - "Act as ironbee-verifier" free-text spawns spawn the custom agent by `agent_type` so it loads its tools.
30
- - Backgrounding the verifier, or ending your turn before it returns its verdict — wait for it in the same turn.
31
- - Writing `verdict.json` directly, or completing a task with unverified code changes.
32
-
33
- The Stop gate blocks completion until the verifier has exercised every active cycle's tools and submitted a verdict with non-empty `checks`. Every code edit clears the prior verdict, requiring a fresh verification.
11
+ - Running the verification tools (`bdt_*` / `ndt_*` / `bedt_*`) or `ironbee hook verification-start` / `submit-verdict` yourself — those are the verifier's job. Delegate. The verifier already submitted the single verdict in this shared session; you only RELAY it in text. Re-running `submit-verdict` yourself is REJECTED ("no active verification cycle" — the cycle already closed) and records nothing — a duplicate, not a verdict.
12
+ - Using the generic `spawn_agent` tool / a plain fork to "be" the verifier that spawns a DEFAULT agent without the devtools. Spawn the `ironbee-verifier` custom agent by its `agent_type`.
13
+ - Reporting a task complete without delegating verification of your changes.
14
+ - Submitting a verdict based on assumptions, code reading, or prior knowledge the verifier verifies through real tools.
15
+ - Writing `verdict.json` directly.
16
+ - Backgrounding the verifier custom agent, or ending your turn before it returns its verdict wait for it in the same turn.
@@ -0,0 +1,110 @@
1
+ ---
2
+ name: ironbee-verification
3
+ description: >
4
+ MANDATORY verification after code changes. Activates when implementing features, fixing
5
+ bugs, modifying UI components, API endpoints, styles, refactoring, or any task that changes
6
+ application behavior. After editing code you MUST verify the affected cycle(s) through real
7
+ tools and submit a single verdict (pass or fail) before reporting task completion. You drive
8
+ the verification tools yourself — they are wired into this session.
9
+ ---
10
+
11
+ # IronBee Verification
12
+
13
+ > **You verify INLINE.** This project runs IronBee in **main-agent** mode: the devtools tools
14
+ > (`mcp__browser-devtools__*` / `mcp__node-devtools__*` / `mcp__backend-devtools__*` /
15
+ > `mcp__android-devtools__*`) are wired into THIS session — you drive them yourself. There is no
16
+ > verifier sub-agent to delegate to.
17
+
18
+ ## Rule
19
+ No task is complete until your changes are verified through **real tools**, not by reading code or
20
+ inferring behavior. After verifying, you MUST submit a verdict (pass or fail) before doing
21
+ anything else. If verification fails, submit the fail verdict first, then fix and re-verify.
22
+
23
+ ## Session id
24
+ IronBee's SessionStart hook injects your **Session ID** into your context at the top of the
25
+ session (a line `Session ID: <sid>`). Author it in every `ironbee hook` command's JSON.
26
+
27
+ ## Cycles
28
+ IronBee runs verification in **cycles**. A single Stop hook can drive multiple cycles in
29
+ parallel — every active cycle must pass for your task to complete. You don't choose which cycle
30
+ runs — the edited file's path decides (pattern match). **See the platform sections near the
31
+ bottom of this file** for which cycles are active for this project, the tools they expose, and
32
+ the per-cycle flow.
33
+
34
+ ## Universal flow
35
+
36
+ 1. Finish your code edits (`apply_patch`).
37
+ 2. **Start verification — run this ALONE first**, before any devtools call:
38
+ ```
39
+ echo '{"session_id":"<your-session-id>"}' | ironbee hook verification-start
40
+ ```
41
+ Devtools tools are blocked until this opens the cycle. **Codex runs shell commands and MCP
42
+ tools in separate lanes**, so a same-message ordering of this shell command before a devtools
43
+ call is NOT guaranteed — a devtools call that lands first is blocked. Once the cycle is open,
44
+ independent MCP calls can ride one message.
45
+ 3. Build and start the application **only if it isn't already running** (check `docker compose ps`
46
+ / process output / config — don't guess ports). Track whether YOU started it.
47
+ 4. **Run the per-cycle flows for every active cycle** (see the platform sections below). All
48
+ active cycles must be exercised within this one verification cycle.
49
+ 5. **Teardown** — stop ONLY what you started (the dev server / container you launched for this
50
+ verification); never stop a server that was already running. Honor any cycle-specific teardown
51
+ (e.g. stop an active screen recording) BEFORE the verdict.
52
+ 6. **Submit your verdict immediately** — do NOT edit any code first:
53
+ ```
54
+ echo '<verdict-json>' | ironbee hook submit-verdict
55
+ ```
56
+ - Platform-agnostic shape: `session_id`, `status`, `checks`, optionally `issues` / `fixes`.
57
+ One verdict regardless of how many cycles ran.
58
+ - Pass → `{ "session_id": "...", "status": "pass", "checks": [...] }`
59
+ - Fail → add `"issues": [...]` describing what failed.
60
+ - Pass after a previous fail → add `"fixes": [...]` describing what was repaired.
61
+ - **A FALSE failure is a FAIL — not "verified failure handling".** Separate an EXPECTED
62
+ negative test (you deliberately fed invalid input and it correctly failed → supports a
63
+ `pass`) from a FALSE failure (a VALID, in-scope operation that SHOULD succeed but errors out
64
+ → a DEFECT → `status: "fail"`).
65
+ - **Nothing to verify? Use N/A — never fake evidence.** When the change has no runtime surface
66
+ (type-only edit, behavior-neutral refactor, config/docs that still tripped a cycle): global
67
+ `{ "session_id": "...", "status": "not_applicable", "reason": ["why"] }` (no `checks`), or
68
+ per-platform on a pass/fail verdict `"not_applicable_cycles": ["browser"], "reason": ["..."]`
69
+ to exempt some cycles while verifying others. `reason` is REQUIRED. Strict mode rejects N/A.
70
+ Base "nothing to verify" on the FULL change set (the change is often already COMMITTED) —
71
+ check `git diff HEAD~1 HEAD --stat`, not just a clean `git status`.
72
+ 7. If failed → fix → rebuild → go back to step 2 → repeat until pass (up to the retry cap).
73
+
74
+ ## Speed — batch your tool calls (fewer LLM round-trips)
75
+
76
+ Each tool call is a separate LLM round-trip, and that round-trip — not the tool's execution — is
77
+ the dominant cost. Drive the tools in as few turns as you can:
78
+
79
+ - **Batch a scope's work into ONE `*_execute` call.** Each cycle exposes a batch tool
80
+ (`mcp__browser-devtools__bdt_execute` / `mcp__node-devtools__ndt_execute` /
81
+ `mcp__backend-devtools__bedt_execute` / `mcp__android-devtools__adt_execute`) that runs many
82
+ steps in one turn — nest each as a `callTool('<tool>', { … })`. A batch nests only that cycle's
83
+ own tools. It's a JS sandbox, so a later step can reuse a value an earlier `callTool` returned;
84
+ `*_execute` STOPS at the first failing nested call. Authoring the batch is not the work — read
85
+ each result and confirm real evidence came back.
86
+ - **Discovery stays standalone** — the step that reveals what to do (navigate / connect / snapshot)
87
+ runs first and on its own; THEN batch the actions it told you to take.
88
+
89
+ <!--IRONBEE:PLATFORM:browser-->
90
+ <!--/IRONBEE:PLATFORM:browser-->
91
+
92
+ <!--IRONBEE:PLATFORM:node-->
93
+ <!--/IRONBEE:PLATFORM:node-->
94
+
95
+ <!--IRONBEE:PLATFORM:backend-->
96
+ <!--/IRONBEE:PLATFORM:backend-->
97
+
98
+ <!--IRONBEE:PLATFORM:android-->
99
+ <!--/IRONBEE:PLATFORM:android-->
100
+
101
+ ## Important
102
+ - **Always submit a verdict after every verification attempt** — both pass AND fail.
103
+ - Submit verdicts via `ironbee hook submit-verdict`, never write `verdict.json` directly.
104
+ - Every `apply_patch` edit automatically clears your session's verdict — re-verify after edits.
105
+ - After the retry cap, you may complete but must report unresolved issues.
106
+
107
+ ## BANNED
108
+ - Reporting a task complete without verifying your changes through the real tools.
109
+ - Claiming verification passed based on code reading, assumptions, or prior knowledge.
110
+ - Submitting `pass` when your own evidence shows a legitimate operation breaking.
@@ -2,82 +2,54 @@
2
2
  name: ironbee-verification
3
3
  description: >
4
4
  MANDATORY verification after code changes. Activates when implementing features, fixing
5
- bugs, modifying UI components, API endpoints, styles, refactoring, or any task that
6
- changes application behavior. Verification runs in cycles activated by file-pattern
7
- matchwhich cycles are wired up for this project is shown in the platform sections
8
- near the bottom of this file. After every code edit you MUST verify the affected
9
- cycle(s) through real tools and submit a single verdict (pass or fail) before
10
- reporting task completion. If verification fails, submit the fail verdict first,
11
- then fix.
5
+ bugs, modifying UI components, API endpoints, styles, refactoring, or any task that changes
6
+ application behavior. After editing code you MUST verify the changes before reporting task
7
+ completionand you verify by DELEGATING to the ironbee-verifier custom agent, never inline.
12
8
  ---
13
9
 
14
10
  # IronBee Verification
15
11
 
16
- ## Rule
17
- No task is complete until changes are verified — through **real tools**, not by reading code or inferring behavior. After verification, you MUST submit a verdict (pass or fail) before doing anything else. If verification fails, submit the fail verdict first, then fix.
18
-
19
- ## Cycles
20
-
21
- IronBee runs verification in **cycles**. A single Stop hook can drive multiple cycles in parallel — every active cycle must pass for your task to complete.
22
-
23
- You don't choose which cycle runs — the file pattern decides. A single edited file can match multiple cycles' patterns and activate them all. Cycles always run in parallel within a single Stop run. Each cycle has its own tools, flow steps, and verdict fields.
24
-
25
- **See the platform sections near the bottom of this file** for which cycles are active for this project, the tools they expose, and the per-cycle verdict fields you must include.
26
-
27
- ## Application lifecycle (your responsibility)
28
-
29
- For every active cycle you manage the running application:
30
- - **Build** if needed (`npm run build`, `docker compose build`, …)
31
- - **Start** before navigating/connecting (`npm run dev`, `docker compose up -d`, …)
32
- - **Stop** when verification is complete
33
-
34
- If already running, skip start. If the build fails, fix it before proceeding.
12
+ > **Delegate — do NOT verify inline.** You verify by spawning the **`ironbee-verifier` custom agent** via `spawn_agent` with `agent_type="ironbee-verifier"` **and `fork_turns="none"`** (the default `fork_turns="all"` silently drops the agent_type → a generic toolless agent; not a generic "act as" agent either) and relaying its verdict. The verifier owns the devtools tools; you (the main agent) don't have them.
35
13
 
36
- **Don't guess ports.** After starting, check the actual port via `docker compose ps`, process output, or config files.
37
-
38
- ## Universal flow
39
-
40
- 1. Implement your changes (write/edit code).
41
- 2. **Start verification** (one cycle covers every active mode — every active cycle's flow runs within the same verification cycle):
42
- ```
43
- echo '{"session_id":"<your-session-id>"}' | ironbee hook verification-start
44
- ```
45
- Devtools tools are blocked without this.
46
- 3. Build and start the application if not already running.
47
- 4. **Run the per-cycle flows for every active cycle.** See the platform sections near the bottom of this file — each enabled cycle's section has its own flow steps and mandatory tools. All active cycles must be exercised within this one verification cycle.
48
- 5. Stop the dev server when verification is complete (every cycle including the final one).
49
- 6. **Honor any cycle-specific teardown** noted in the platform sections (e.g. recording stop) BEFORE submitting your verdict.
50
- 7. **Submit your verdict immediately** — do NOT edit any code first:
14
+ ## Rule
15
+ No task is complete until your changes are verified through **real tools** — and you verify by
16
+ **delegating to the `ironbee-verifier` custom agent**, never inline. You do not have the
17
+ verification tools (browser / runtime / backend devtools); the verifier does. After delegating,
18
+ relay its verdict; on fail, fix the reported issues and re-delegate until it passes.
19
+
20
+ ## How to verify — delegate
21
+ 1. Finish your code edits.
22
+ 2. Spawn the `ironbee-verifier` custom agent: call `spawn_agent` with `agent_type="ironbee-verifier"`
23
+ AND `fork_turns="none"` (REQUIRED the default `fork_turns="all"` silently drops the agent_type,
24
+ giving a generic agent without the verification tools; not a generic "act as" agent either) with a
25
+ prompt like *"Verify the recent changes"* (optionally describe what changed, or pass a
26
+ scenario). It drives the verification tools, exercises every active cycle, and submits the
27
+ single verdict in this **shared session** then returns a short summary.
28
+ **Wait for it in the same turn — do NOT background the verifier.** Let it run to completion
29
+ and read its verdict before you respond. If the verifier is backgrounded, your turn can end
30
+ (and the Stop gate fire) before its verdict is recorded, leaving your changes unverified.
31
+ 3. **Relay the verdict.** If it FAILED: fix the issues it reported. Optionally record what you
32
+ fixed so the next pass verdict can describe it:
51
33
  ```
52
- echo '<verdict-json>' | ironbee hook submit-verdict
34
+ echo '{"fixes":["what you repaired"]}' | ironbee hook record-fix
53
35
  ```
54
- - Verdict shape is platform-agnostic: `status`, `checks`, optionally `issues` / `fixes`. One verdict regardless of how many cycles ran.
55
- - Pass → `{ "session_id": "...", "status": "pass", "checks": [...] }`
56
- - Fail → add `"issues": [...]` describing what failed.
57
- - Pass after a previous fail → add `"fixes": [...]` describing what was repaired.
58
- - **The Stop hook enforces that you called the required tools for every active cycle and that the verdict carries non-empty `checks`.**
59
- 8. If failed → fix → rebuild → go back to step 2 → repeat until pass.
60
-
61
- <!--IRONBEE:PLATFORM:browser-->
62
- <!--/IRONBEE:PLATFORM:browser-->
63
-
64
- <!--IRONBEE:PLATFORM:node-->
65
- <!--/IRONBEE:PLATFORM:node-->
66
-
67
- <!--IRONBEE:PLATFORM:backend-->
68
- <!--/IRONBEE:PLATFORM:backend-->
36
+ Then re-delegate. Repeat until it passes.
69
37
 
70
- <!--IRONBEE:PLATFORM:android-->
71
- <!--/IRONBEE:PLATFORM:android-->
38
+ The Stop gate enforces this: it blocks completion until a verdict exists for your changes. Since
39
+ you can't verify inline, delegation is the only path forward.
72
40
 
73
- ## Important
74
- - **Always submit a verdict after every verification attempt** both pass AND fail. Fail verdicts are tracked for analytics.
75
- - The Stop hook checks that the required tools were used for every active cycle and that the verdict carries non-empty `checks`.
76
- - Submit verdicts via `ironbee hook submit-verdict`, never write `verdict.json` directly.
77
- - Every code edit (Write/Edit) automatically clears your session's verdict.
78
- - After 3 failed verification attempts, you may complete but must report unresolved issues.
41
+ ## BANNED
42
+ - Trying to run the verification tools yourself (`bdt_*` / `ndt_*` / `bedt_*`) or
43
+ `ironbee hook verification-start` / `submit-verdict` those are the verifier's job. Delegate.
44
+ - Using the generic `spawn_agent` tool / a plain fork to "be" the verifier — that spawns a
45
+ DEFAULT agent without the devtools. Spawn the `ironbee-verifier` custom agent via `spawn_agent` with `agent_type="ironbee-verifier"` and `fork_turns="none"`.
46
+ - Reporting a task complete without delegating verification of your changes.
47
+ - Claiming verification passed based on code reading, assumptions, or prior knowledge.
48
+ - Backgrounding the verifier custom agent (or ending your turn before it returns its verdict) —
49
+ wait for it to finish in the same turn.
79
50
 
80
51
  ## Subagent teams
81
- - Subagents focus on implementation only do NOT verify.
82
- - The main orchestrator agent verifies ALL changes after subagents complete.
83
- - Each session's verification is isolated via session-specific verdict files.
52
+ - Implementation subagents write code; they do NOT verify.
53
+ - Verification is ALWAYS delegated to the dedicated `ironbee-verifier` custom agent — it owns the
54
+ per-cycle browser/node/backend flows and the verification tools. Each session's verification
55
+ is isolated via session-specific verdict files.
@@ -1,38 +1,48 @@
1
- "use strict";var y=Object.defineProperty;var I=Object.getOwnPropertyDescriptor;var E=Object.getOwnPropertyNames;var P=Object.prototype.hasOwnProperty;var o=(n,t)=>y(n,"name",{value:t,configurable:!0});var W=(n,t)=>{for(var e in t)y(n,e,{get:t[e],enumerable:!0})},J=(n,t,e,s)=>{if(t&&typeof t=="object"||typeof t=="function")for(let r of E(t))!P.call(n,r)&&r!==e&&y(n,r,{get:()=>t[r],enumerable:!(s=I(t,r))||s.enumerable});return n};var L=n=>J(y({},"__esModule",{value:!0}),n);var gn={};W(gn,{AGENTS_MD_END_MARKER:()=>h,AGENTS_MD_START_MARKER:()=>w,canonicalizeCodexServerName:()=>$,canonicalizeCodexToolName:()=>v,classifyCodexTool:()=>F,codexAgentTomlPath:()=>nn,codexConfigTomlPath:()=>R,codexHooksJsonPath:()=>cn,decodeJwtPayload:()=>A,ensureFeaturesHooksTrue:()=>Z,extractBashBinary:()=>j,extractCodexMcpServer:()=>S,extractCodexToolInput:()=>D,extractTomlTopLevelModel:()=>tn,findTomlSection:()=>k,normalizeCodexToolName:()=>T,parseCodexHookStdin:()=>M,readCodexConfigToml:()=>on,removeAgentsTable:()=>N,removeMcpServer:()=>Q,resolveCodexUsage:()=>V,stripAgentsMdBlock:()=>sn,tomlBodyFromRecord:()=>en,upsertAgentsMdBlock:()=>rn,upsertAgentsTable:()=>Y,upsertMcpServer:()=>q,userCodexAgentTomlPath:()=>an,userCodexConfigTomlPath:()=>dn,userCodexHooksJsonPath:()=>ln,writeCodexConfigToml:()=>un});module.exports=L(gn);var m=require("fs"),x=require("os"),p=require("path"),b=require("../../lib/logger");function M(n){try{return JSON.parse(n)}catch(t){return b.logger.debug(`failed to parse Codex hook stdin: ${t}`),{}}}o(M,"parseCodexHookStdin");const _="mcp__",B={browser_devtools:"browser-devtools",node_devtools:"node-devtools",backend_devtools:"backend-devtools",android_devtools:"android-devtools"},z=["bdt_","ndt_","bedt_","adt_"];function $(n){return B[n]??n}o($,"canonicalizeCodexServerName");function v(n){if(!z.some(e=>n.startsWith(e)))return n;const t=n.split("_");return t.length<=3?n:`${t[0]}_${t[1]}_${t.slice(2).join("-")}`}o(v,"canonicalizeCodexToolName");const H=[["bdt_","browser-devtools"],["ndt_","node-devtools"],["bedt_","backend-devtools"],["adt_","android-devtools"]];function S(n){if(!n)return null;if(n.startsWith(_)){const t=n.slice(_.length),e=t.indexOf("__");return e<0?null:$(t.slice(0,e))}for(const[t,e]of H)if(n.startsWith(t))return e;return null}o(S,"extractCodexMcpServer");function T(n){return n==="exec_command"?"Bash":n==="apply_patch"?"Edit":n==="update_plan"?"TodoWrite":n==="read_file"?"Read":n==="web_search"?"WebSearch":n==="web_fetch"?"WebFetch":n}o(T,"normalizeCodexToolName");function F(n){if(!n)return{tool_type:null,tool_name:"",mcp_server:null};if(n.startsWith(_)){const s=n.slice(_.length),r=s.indexOf("__");if(r>=0){const i=s.slice(0,r),c=$(i),u=s.slice(r+2);return{tool_type:"mcp",tool_name:v(u),mcp_server:c}}}const t=S(n);if(t!==null&&!n.startsWith(_))return{tool_type:"mcp",tool_name:v(n),mcp_server:t};const e=T(n);return n==="spawn_agent"||n==="wait_agent"||n==="close_agent"?{tool_type:"sub_agent",tool_name:e,mcp_server:null}:{tool_type:null,tool_name:e,mcp_server:null}}o(F,"classifyCodexTool");function D(n,t){if(!n||t===void 0)return;if(n==="apply_patch"){if(typeof t=="string")return{input_size:t.length};if(typeof t=="object"&&t!==null){const r=t,i=r.command??r.input;if(typeof i=="string")return{input_size:i.length}}return{input_size:void 0}}if(typeof t!="object"||t===null)return;const e=t;if(T(n)==="Bash"){const r=e.cmd??e.command,i=typeof r=="string"?j(r):void 0;return{workdir:e.workdir,binary:i}}if(n==="update_plan"){const r=e.explanation,i=e.plan;return{explanation:typeof r=="string"?r:void 0,plan_step_count:Array.isArray(i)?i.length:void 0}}if(n==="spawn_agent"){const r=e.agent_type,i=e.message,c=e.fork_context;return{agent_type:typeof r=="string"?r:void 0,message_size:typeof i=="string"?i.length:void 0,fork_context:typeof c=="boolean"?c:void 0}}if(n==="wait_agent"){const r=e.targets,i=e.timeout_ms;return{target_count:Array.isArray(r)?r.length:void 0,timeout_ms:typeof i=="number"?i:void 0}}if(n==="close_agent"){const r=e.target;return{target:typeof r=="string"?r:void 0}}if(n==="view_image"){const r=e.path,i=e.detail;return{path:typeof r=="string"?r:void 0,detail:typeof i=="string"?i:void 0}}if(n==="write_stdin"){const r=e.session_id,i=e.chars,c=e.yield_time_ms,u=e.max_output_tokens;return{session_id:typeof r=="number"?r:void 0,chars_size:typeof i=="string"?i.length:void 0,yield_time_ms:typeof c=="number"?c:void 0,max_output_tokens:typeof u=="number"?u:void 0}}if(n.startsWith(_)||S(n)!==null){if("_metadata"in e){const{_metadata:r,...i}=e;return i}return e}}o(D,"extractCodexToolInput");function j(n){const t=n.trim();if(!t)return;const e=t.split(/\s+/);for(const s of e)if(!/^[A-Za-z_][A-Za-z0-9_]*=/.test(s)&&s.length>0)return s.split(/[\\/]/).pop()??s}o(j,"extractBashBinary");function A(n){const t=n.split(".");if(t.length!==3)return null;try{const e=Buffer.from(t[1],"base64url").toString("utf-8"),s=JSON.parse(e);return typeof s!="object"||s===null?null:s}catch{return null}}o(A,"decodeJwtPayload");function K(n){if(typeof n=="string"){const t=A(n);return t?{email:t.email,planType:t["https://api.openai.com/auth"]?.chatgpt_plan_type}:{}}if(typeof n=="object"&&n!==null){const t=n;return{email:t.email,planType:t.chatgpt_plan_type}}return{}}o(K,"extractIdTokenFields");function V(n){const t=n??(0,p.join)((0,x.homedir)(),".codex","auth.json");if(!(0,m.existsSync)(t))return{};try{const e=JSON.parse((0,m.readFileSync)(t,"utf-8")),s=e.auth_mode==="chatgpt"||e.auth_mode==="swic"?"subscription":e.auth_mode==="api"?"api":void 0,{email:r,planType:i}=K(e.tokens?.id_token);return{usageType:s,usagePlan:i?.toLowerCase(),userEmail:r}}catch(e){return b.logger.debug(`failed to parse ${t}: ${e}`),{}}}o(V,"resolveCodexUsage");function U(n,t){return n.trim()===`[${t}]`}o(U,"tableHeaderLineExact");function X(n){const t=n.trim();return/^\[\[?[^\]]+\]\]?$/.test(t)}o(X,"isAnyTableHeader");function O(n){const e=n.trim().match(/^\[([^[\]]+)\]$/);return e===null?null:e[1]}o(O,"tableHeaderName");function k(n,t){let e=-1;for(let r=0;r<n.length;r+=1)if(U(n[r],t)){e=r;break}if(e<0)return null;let s=n.length;for(let r=e+1;r<n.length;r+=1)if(X(n[r])){s=r;break}return{startIdx:e,endIdx:s}}o(k,"findTomlSection");function G(n){const t=[...n];for(;t.length>0&&t[t.length-1].trim()==="";)t.pop();return t}o(G,"trimTrailingBlanks");function C(n,t){return n.length===0?t.join(`
1
+ "use strict";var k=Object.defineProperty;var E=Object.getOwnPropertyDescriptor;var L=Object.getOwnPropertyNames;var W=Object.prototype.hasOwnProperty;var o=(n,t)=>k(n,"name",{value:t,configurable:!0});var M=(n,t)=>{for(var e in t)k(n,e,{get:t[e],enumerable:!0})},P=(n,t,e,s)=>{if(t&&typeof t=="object"||typeof t=="function")for(let r of L(t))!W.call(n,r)&&r!==e&&k(n,r,{get:()=>t[r],enumerable:!(s=E(t,r))||s.enumerable});return n};var J=n=>P(k({},"__esModule",{value:!0}),n);var pn={};M(pn,{AGENTS_MD_END_MARKER:()=>x,AGENTS_MD_START_MARKER:()=>v,canonicalizeCodexServerName:()=>C,canonicalizeCodexToolName:()=>$,classifyCodexTool:()=>V,codexAgentTomlPath:()=>en,codexConfigTomlPath:()=>T,codexHooksJsonPath:()=>ln,decodeJwtPayload:()=>A,ensureFeaturesHooksTrue:()=>Z,ensureMultiAgentV2SpawnMetadataExposed:()=>q,extractBashBinary:()=>j,extractCodexMcpServer:()=>I,extractCodexToolInput:()=>D,extractTomlTopLevelModel:()=>rn,findTomlSection:()=>h,normalizeCodexToolName:()=>S,parseCodexHookStdin:()=>B,readCodexConfigToml:()=>an,removeAgentsTable:()=>tn,removeMcpServer:()=>N,removeMultiAgentV2SpawnMetadata:()=>Q,resolveCodexUsage:()=>U,stripAgentsMdBlock:()=>un,tomlBodyFromRecord:()=>sn,upsertAgentsMdBlock:()=>on,upsertAgentsTable:()=>nn,upsertMcpServer:()=>Y,userCodexAgentTomlPath:()=>fn,userCodexConfigTomlPath:()=>cn,userCodexHooksJsonPath:()=>gn,writeCodexConfigToml:()=>dn});module.exports=J(pn);var m=require("fs"),b=require("os"),p=require("path"),y=require("../../lib/logger");function B(n){try{return JSON.parse(n)}catch(t){return y.logger.debug(`failed to parse Codex hook stdin: ${t}`),{}}}o(B,"parseCodexHookStdin");const _="mcp__",z={browser_devtools:"browser-devtools",node_devtools:"node-devtools",backend_devtools:"backend-devtools",android_devtools:"android-devtools"},H=["bdt_","ndt_","bedt_","adt_"];function C(n){return z[n]??n}o(C,"canonicalizeCodexServerName");function $(n){if(!H.some(e=>n.startsWith(e)))return n;const t=n.split("_");return t.length<=3?n:`${t[0]}_${t[1]}_${t.slice(2).join("-")}`}o($,"canonicalizeCodexToolName");const F=[["bdt_","browser-devtools"],["ndt_","node-devtools"],["bedt_","backend-devtools"],["adt_","android-devtools"]];function I(n){if(!n)return null;if(n.startsWith(_)){const t=n.slice(_.length),e=t.indexOf("__");return e<0?null:C(t.slice(0,e))}for(const[t,e]of F)if(n.startsWith(t))return e;return null}o(I,"extractCodexMcpServer");function S(n){return n==="exec_command"?"Bash":n==="apply_patch"?"Edit":n==="update_plan"?"TodoWrite":n==="read_file"?"Read":n==="web_search"?"WebSearch":n==="web_fetch"?"WebFetch":n}o(S,"normalizeCodexToolName");function V(n){if(!n)return{tool_type:null,tool_name:"",mcp_server:null};if(n.startsWith(_)){const s=n.slice(_.length),r=s.indexOf("__");if(r>=0){const i=s.slice(0,r),u=C(i),a=s.slice(r+2);return{tool_type:"mcp",tool_name:$(a),mcp_server:u}}}const t=I(n);if(t!==null&&!n.startsWith(_))return{tool_type:"mcp",tool_name:$(n),mcp_server:t};const e=S(n);return n==="spawn_agent"||n==="wait_agent"||n==="close_agent"?{tool_type:"sub_agent",tool_name:e,mcp_server:null}:{tool_type:null,tool_name:e,mcp_server:null}}o(V,"classifyCodexTool");function D(n,t){if(!n||t===void 0)return;if(n==="apply_patch"){if(typeof t=="string")return{input_size:t.length};if(typeof t=="object"&&t!==null){const r=t,i=r.command??r.input;if(typeof i=="string")return{input_size:i.length}}return{input_size:void 0}}if(typeof t!="object"||t===null)return;const e=t;if(S(n)==="Bash"){const r=e.cmd??e.command,i=typeof r=="string"?j(r):void 0;return{workdir:e.workdir,binary:i}}if(n==="update_plan"){const r=e.explanation,i=e.plan;return{explanation:typeof r=="string"?r:void 0,plan_step_count:Array.isArray(i)?i.length:void 0}}if(n==="spawn_agent"){const r=e.agent_type,i=e.message,u=e.fork_context;return{agent_type:typeof r=="string"?r:void 0,message_size:typeof i=="string"?i.length:void 0,fork_context:typeof u=="boolean"?u:void 0}}if(n==="wait_agent"){const r=e.targets,i=e.timeout_ms;return{target_count:Array.isArray(r)?r.length:void 0,timeout_ms:typeof i=="number"?i:void 0}}if(n==="close_agent"){const r=e.target;return{target:typeof r=="string"?r:void 0}}if(n==="view_image"){const r=e.path,i=e.detail;return{path:typeof r=="string"?r:void 0,detail:typeof i=="string"?i:void 0}}if(n==="write_stdin"){const r=e.session_id,i=e.chars,u=e.yield_time_ms,a=e.max_output_tokens;return{session_id:typeof r=="number"?r:void 0,chars_size:typeof i=="string"?i.length:void 0,yield_time_ms:typeof u=="number"?u:void 0,max_output_tokens:typeof a=="number"?a:void 0}}if(n.startsWith(_)||I(n)!==null){if("_metadata"in e){const{_metadata:r,...i}=e;return i}return e}}o(D,"extractCodexToolInput");function j(n){const t=n.trim();if(!t)return;const e=t.split(/\s+/);for(const s of e)if(!/^[A-Za-z_][A-Za-z0-9_]*=/.test(s)&&s.length>0)return s.split(/[\\/]/).pop()??s}o(j,"extractBashBinary");function A(n){const t=n.split(".");if(t.length!==3)return null;try{const e=Buffer.from(t[1],"base64url").toString("utf-8"),s=JSON.parse(e);return typeof s!="object"||s===null?null:s}catch{return null}}o(A,"decodeJwtPayload");function K(n){if(typeof n=="string"){const t=A(n);return t?{email:t.email,planType:t["https://api.openai.com/auth"]?.chatgpt_plan_type}:{}}if(typeof n=="object"&&n!==null){const t=n;return{email:t.email,planType:t.chatgpt_plan_type}}return{}}o(K,"extractIdTokenFields");function U(n){const t=n??(0,p.join)((0,b.homedir)(),".codex","auth.json");if(!(0,m.existsSync)(t))return{};try{const e=JSON.parse((0,m.readFileSync)(t,"utf-8")),s=e.auth_mode==="chatgpt"||e.auth_mode==="swic"?"subscription":e.auth_mode==="api"?"api":void 0,{email:r,planType:i}=K(e.tokens?.id_token);return{usageType:s,usagePlan:i?.toLowerCase(),userEmail:r}}catch(e){return y.logger.debug(`failed to parse ${t}: ${e}`),{}}}o(U,"resolveCodexUsage");function X(n,t){return n.trim()===`[${t}]`}o(X,"tableHeaderLineExact");function G(n){const t=n.trim();return/^\[\[?[^\]]+\]\]?$/.test(t)}o(G,"isAnyTableHeader");function R(n){const e=n.trim().match(/^\[([^[\]]+)\]$/);return e===null?null:e[1]}o(R,"tableHeaderName");function h(n,t){let e=-1;for(let r=0;r<n.length;r+=1)if(X(n[r],t)){e=r;break}if(e<0)return null;let s=n.length;for(let r=e+1;r<n.length;r+=1)if(G(n[r])){s=r;break}return{startIdx:e,endIdx:s}}o(h,"findTomlSection");function O(n){const t=[...n];for(;t.length>0&&t[t.length-1].trim()==="";)t.pop();return t}o(O,"trimTrailingBlanks");function w(n,t){return n.length===0?t.join(`
2
2
  `)+`
3
3
  `:n.replace(/\n+$/,"")+`
4
4
 
5
5
  `+t.join(`
6
6
  `)+`
7
- `}o(C,"appendBlockWithSeparator");function Z(n){const t=n.split(`
8
- `),e=k(t,"features");if(e===null)return C(n,["[features]","hooks = true"]);const s=t.slice(e.startIdx+1,e.endIdx),r=/^\s*hooks\s*=/;let i=!1;for(let l=0;l<s.length;l+=1)if(r.test(s[l])){s[l]="hooks = true",i=!0;break}i||s.unshift("hooks = true");const c=G(s),a=[...t.slice(0,e.startIdx),t[e.startIdx],...c,...e.endIdx<t.length?[""]:[],...t.slice(e.endIdx)].join(`
9
- `);return a.endsWith(`
10
- `)?a:a+`
11
- `}o(Z,"ensureFeaturesHooksTrue");function q(n,t,e){const s=`mcp_servers.${t}`,r=n.split(`
12
- `),i=k(r,s),u=[`[${s}]`,...e];if(i===null)return C(n,u);const a=r.slice(0,i.startIdx),l=r.slice(i.endIdx),d=[...a,...u,...l.length>0?[""]:[],...l].join(`
13
- `);return d.endsWith(`
14
- `)?d:d+`
15
- `}o(q,"upsertMcpServer");function Q(n,t){const e=`mcp_servers.${t}`,s=`${e}.`,r=n.split(`
16
- `),i=[];let c=!1,u=!1;for(const d of r){const g=O(d);if(g!==null&&(c=g===e||g.startsWith(s),c)){u=!0;continue}c||i.push(d)}if(!u)return n;const a=[];let l=!1;for(const d of i){const g=d.trim().length===0;g&&l||(a.push(d),l=g)}const f=a.join(`
7
+ `}o(w,"appendBlockWithSeparator");function Z(n){const t=n.split(`
8
+ `),e=h(t,"features");if(e===null)return w(n,["[features]","hooks = true"]);const s=t.slice(e.startIdx+1,e.endIdx),r=/^\s*hooks\s*=/;let i=!1;for(let d=0;d<s.length;d+=1)if(r.test(s[d])){s[d]="hooks = true",i=!0;break}i||s.unshift("hooks = true");const u=O(s),c=[...t.slice(0,e.startIdx),t[e.startIdx],...u,...e.endIdx<t.length?[""]:[],...t.slice(e.endIdx)].join(`
9
+ `);return c.endsWith(`
10
+ `)?c:c+`
11
+ `}o(Z,"ensureFeaturesHooksTrue");function q(n){const t=n.split(`
12
+ `),e=h(t,"features.multi_agent_v2");if(e===null)return w(n,["[features.multi_agent_v2]","hide_spawn_agent_metadata = false"]);const s=t.slice(e.startIdx+1,e.endIdx),r=/^\s*hide_spawn_agent_metadata\s*=/;let i=!1;for(let d=0;d<s.length;d+=1)if(r.test(s[d])){s[d]="hide_spawn_agent_metadata = false",i=!0;break}i||s.unshift("hide_spawn_agent_metadata = false");const u=O(s),c=[...t.slice(0,e.startIdx),t[e.startIdx],...u,...e.endIdx<t.length?[""]:[],...t.slice(e.endIdx)].join(`
13
+ `);return c.endsWith(`
14
+ `)?c:c+`
15
+ `}o(q,"ensureMultiAgentV2SpawnMetadataExposed");function Q(n){const t=n.split(`
16
+ `),e=h(t,"features.multi_agent_v2");if(e===null)return n;const s=t.slice(e.startIdx+1,e.endIdx).filter(a=>a.trim().length>0);if(!(s.length===1&&/^\s*hide_spawn_agent_metadata\s*=\s*false\s*$/.test(s[0])))return n;const u=[...t.slice(0,e.startIdx),...t.slice(e.endIdx)].join(`
17
+ `).replace(/\n{3,}/g,`
18
+
19
+ `);return u.endsWith(`
20
+ `)?u:u+`
21
+ `}o(Q,"removeMultiAgentV2SpawnMetadata");function Y(n,t,e){const s=`mcp_servers.${t}`,r=n.split(`
22
+ `),i=h(r,s),a=[`[${s}]`,...e];if(i===null)return w(n,a);const c=r.slice(0,i.startIdx),d=r.slice(i.endIdx),l=[...c,...a,...d.length>0?[""]:[],...d].join(`
23
+ `);return l.endsWith(`
24
+ `)?l:l+`
25
+ `}o(Y,"upsertMcpServer");function N(n,t){const e=`mcp_servers.${t}`,s=`${e}.`,r=n.split(`
26
+ `),i=[];let u=!1,a=!1;for(const l of r){const g=R(l);if(g!==null&&(u=g===e||g.startsWith(s),u)){a=!0;continue}u||i.push(l)}if(!a)return n;const c=[];let d=!1;for(const l of i){const g=l.trim().length===0;g&&d||(c.push(l),d=g)}const f=c.join(`
17
27
  `);return f.endsWith(`
18
28
  `)||f.length===0?f:f+`
19
- `}o(Q,"removeMcpServer");function Y(n,t,e){const s=`agents.${t}`,r=n.split(`
20
- `),i=k(r,s),u=[`[${s}]`,...e];if(i===null)return C(n,u);const a=r.slice(0,i.startIdx),l=r.slice(i.endIdx),d=[...a,...u,...l.length>0?[""]:[],...l].join(`
21
- `);return d.endsWith(`
22
- `)?d:d+`
23
- `}o(Y,"upsertAgentsTable");function N(n,t){const e=`agents.${t}`,s=`${e}.`,r=n.split(`
24
- `),i=[];let c=!1,u=!1;for(const d of r){const g=O(d);if(g!==null&&(c=g===e||g.startsWith(s),c)){u=!0;continue}c||i.push(d)}if(!u)return n;const a=[];let l=!1;for(const d of i){const g=d.trim().length===0;g&&l||(a.push(d),l=g)}const f=a.join(`
29
+ `}o(N,"removeMcpServer");function nn(n,t,e){const s=`agents.${t}`,r=n.split(`
30
+ `),i=h(r,s),a=[`[${s}]`,...e];if(i===null)return w(n,a);const c=r.slice(0,i.startIdx),d=r.slice(i.endIdx),l=[...c,...a,...d.length>0?[""]:[],...d].join(`
31
+ `);return l.endsWith(`
32
+ `)?l:l+`
33
+ `}o(nn,"upsertAgentsTable");function tn(n,t){const e=`agents.${t}`,s=`${e}.`,r=n.split(`
34
+ `),i=[];let u=!1,a=!1;for(const l of r){const g=R(l);if(g!==null&&(u=g===e||g.startsWith(s),u)){a=!0;continue}u||i.push(l)}if(!a)return n;const c=[];let d=!1;for(const l of i){const g=l.trim().length===0;g&&d||(c.push(l),d=g)}const f=c.join(`
25
35
  `);return f.endsWith(`
26
36
  `)||f.length===0?f:f+`
27
- `}o(N,"removeAgentsTable");function nn(n,t){return(0,p.join)(n,".codex","agents",`${t}.toml`)}o(nn,"codexAgentTomlPath");function tn(n){for(const t of n.split(`
28
- `)){const e=t.trim();if(e.startsWith("["))break;const s=e.match(/^model\s*=\s*"([^"]*)"/);if(s&&s[1].length>0)return s[1]}return null}o(tn,"extractTomlTopLevelModel");function en(n){const t=[];for(const[e,s]of Object.entries(n))if(s!=null){if(typeof s=="string")t.push(`${e} = ${JSON.stringify(s)}`);else if(typeof s=="number"||typeof s=="boolean")t.push(`${e} = ${s}`);else if(Array.isArray(s)){const r=s.map(i=>typeof i=="string"?JSON.stringify(i):typeof i=="number"||typeof i=="boolean"?String(i):JSON.stringify(i));t.push(`${e} = [${r.join(", ")}]`)}else if(typeof s=="object"){const r=s,i=[];for(const[c,u]of Object.entries(r))u!=null&&(typeof u=="string"?i.push(`${c} = ${JSON.stringify(u)}`):typeof u=="number"||typeof u=="boolean"?i.push(`${c} = ${u}`):i.push(`${c} = ${JSON.stringify(u)}`));t.push(`${e} = { ${i.join(", ")} }`)}}return t}o(en,"tomlBodyFromRecord");const w="<!-- ironbee:start -->",h="<!-- ironbee:end -->";function rn(n,t){const e=`${w}
37
+ `}o(tn,"removeAgentsTable");function en(n,t){return(0,p.join)(n,".codex","agents",`${t}.toml`)}o(en,"codexAgentTomlPath");function rn(n){for(const t of n.split(`
38
+ `)){const e=t.trim();if(e.startsWith("["))break;const s=e.match(/^model\s*=\s*"([^"]*)"/);if(s&&s[1].length>0)return s[1]}return null}o(rn,"extractTomlTopLevelModel");function sn(n){const t=[];for(const[e,s]of Object.entries(n))if(s!=null){if(typeof s=="string")t.push(`${e} = ${JSON.stringify(s)}`);else if(typeof s=="number"||typeof s=="boolean")t.push(`${e} = ${s}`);else if(Array.isArray(s)){const r=s.map(i=>typeof i=="string"?JSON.stringify(i):typeof i=="number"||typeof i=="boolean"?String(i):JSON.stringify(i));t.push(`${e} = [${r.join(", ")}]`)}else if(typeof s=="object"){const r=s,i=[];for(const[u,a]of Object.entries(r))a!=null&&(typeof a=="string"?i.push(`${u} = ${JSON.stringify(a)}`):typeof a=="number"||typeof a=="boolean"?i.push(`${u} = ${a}`):i.push(`${u} = ${JSON.stringify(a)}`));t.push(`${e} = { ${i.join(", ")} }`)}}return t}o(sn,"tomlBodyFromRecord");const v="<!-- ironbee:start -->",x="<!-- ironbee:end -->";function on(n,t){const e=`${v}
29
39
  ${t.trimEnd()}
30
- ${h}`,s=n.indexOf(w),r=n.indexOf(h);if(s>=0&&r>s){const i=n.slice(0,s),c=n.slice(r+h.length);return i+e+c}return n.trim().length===0?e+`
40
+ ${x}`,s=n.indexOf(v),r=n.indexOf(x);if(s>=0&&r>s){const i=n.slice(0,s),u=n.slice(r+x.length);return i+e+u}return n.trim().length===0?e+`
31
41
  `:n.trimEnd()+`
32
42
 
33
43
  `+e+`
34
- `}o(rn,"upsertAgentsMdBlock");function sn(n){const t=n.indexOf(w),e=n.indexOf(h);if(t<0||e<t)return n.trim().length===0?null:n;const s=n.slice(0,t).trimEnd(),r=n.slice(e+h.length).trimStart(),i=s+(s.length>0&&r.length>0?`
44
+ `}o(on,"upsertAgentsMdBlock");function un(n){const t=n.indexOf(v),e=n.indexOf(x);if(t<0||e<t)return n.trim().length===0?null:n;const s=n.slice(0,t).trimEnd(),r=n.slice(e+x.length).trimStart(),i=s+(s.length>0&&r.length>0?`
35
45
 
36
46
  `:"")+r;return i.trim().length===0?null:i.endsWith(`
37
47
  `)?i:i+`
38
- `}o(sn,"stripAgentsMdBlock");function on(n){const t=R(n);if(!(0,m.existsSync)(t))return"";try{return(0,m.readFileSync)(t,"utf-8")}catch(e){return b.logger.debug(`failed to read ${t}: ${e}`),""}}o(on,"readCodexConfigToml");function un(n,t){const e=R(n);try{(0,m.writeFileSync)(e,t)}catch(s){b.logger.debug(`failed to write ${e}: ${s}`)}}o(un,"writeCodexConfigToml");function R(n){return(0,p.join)(n,".codex","config.toml")}o(R,"codexConfigTomlPath");function cn(n){return(0,p.join)(n,".codex","hooks.json")}o(cn,"codexHooksJsonPath");function dn(){return(0,p.join)((0,x.homedir)(),".codex","config.toml")}o(dn,"userCodexConfigTomlPath");function ln(){return(0,p.join)((0,x.homedir)(),".codex","hooks.json")}o(ln,"userCodexHooksJsonPath");function an(n){return(0,p.join)((0,x.homedir)(),".codex","agents",`${n}.toml`)}o(an,"userCodexAgentTomlPath");0&&(module.exports={AGENTS_MD_END_MARKER,AGENTS_MD_START_MARKER,canonicalizeCodexServerName,canonicalizeCodexToolName,classifyCodexTool,codexAgentTomlPath,codexConfigTomlPath,codexHooksJsonPath,decodeJwtPayload,ensureFeaturesHooksTrue,extractBashBinary,extractCodexMcpServer,extractCodexToolInput,extractTomlTopLevelModel,findTomlSection,normalizeCodexToolName,parseCodexHookStdin,readCodexConfigToml,removeAgentsTable,removeMcpServer,resolveCodexUsage,stripAgentsMdBlock,tomlBodyFromRecord,upsertAgentsMdBlock,upsertAgentsTable,upsertMcpServer,userCodexAgentTomlPath,userCodexConfigTomlPath,userCodexHooksJsonPath,writeCodexConfigToml});
48
+ `}o(un,"stripAgentsMdBlock");function an(n){const t=T(n);if(!(0,m.existsSync)(t))return"";try{return(0,m.readFileSync)(t,"utf-8")}catch(e){return y.logger.debug(`failed to read ${t}: ${e}`),""}}o(an,"readCodexConfigToml");function dn(n,t){const e=T(n);try{(0,m.writeFileSync)(e,t)}catch(s){y.logger.debug(`failed to write ${e}: ${s}`)}}o(dn,"writeCodexConfigToml");function T(n){return(0,p.join)(n,".codex","config.toml")}o(T,"codexConfigTomlPath");function ln(n){return(0,p.join)(n,".codex","hooks.json")}o(ln,"codexHooksJsonPath");function cn(){return(0,p.join)((0,b.homedir)(),".codex","config.toml")}o(cn,"userCodexConfigTomlPath");function gn(){return(0,p.join)((0,b.homedir)(),".codex","hooks.json")}o(gn,"userCodexHooksJsonPath");function fn(n){return(0,p.join)((0,b.homedir)(),".codex","agents",`${n}.toml`)}o(fn,"userCodexAgentTomlPath");0&&(module.exports={AGENTS_MD_END_MARKER,AGENTS_MD_START_MARKER,canonicalizeCodexServerName,canonicalizeCodexToolName,classifyCodexTool,codexAgentTomlPath,codexConfigTomlPath,codexHooksJsonPath,decodeJwtPayload,ensureFeaturesHooksTrue,ensureMultiAgentV2SpawnMetadataExposed,extractBashBinary,extractCodexMcpServer,extractCodexToolInput,extractTomlTopLevelModel,findTomlSection,normalizeCodexToolName,parseCodexHookStdin,readCodexConfigToml,removeAgentsTable,removeMcpServer,removeMultiAgentV2SpawnMetadata,resolveCodexUsage,stripAgentsMdBlock,tomlBodyFromRecord,upsertAgentsMdBlock,upsertAgentsTable,upsertMcpServer,userCodexAgentTomlPath,userCodexConfigTomlPath,userCodexHooksJsonPath,writeCodexConfigToml});
@@ -0,0 +1,2 @@
1
+ "use strict";var d=Object.defineProperty;var x=Object.getOwnPropertyDescriptor;var y=Object.getOwnPropertyNames;var $=Object.prototype.hasOwnProperty;var g=(e,o)=>d(e,"name",{value:o,configurable:!0});var h=(e,o)=>{for(var l in o)d(e,l,{get:o[l],enumerable:!0})},w=(e,o,l,a)=>{if(o&&typeof o=="object"||typeof o=="function")for(let r of y(o))!$.call(e,r)&&r!==l&&d(e,r,{get:()=>o[r],enumerable:!(a=x(o,r))||a.enumerable});return e};var j=e=>w(d({},"__esModule",{value:!0}),e);var I={};h(I,{verifierCommand:()=>D});module.exports=j(I);var m=require("commander"),i=require("fs"),p=require("path"),s=require("../../lib/config"),b=require("../../lib/logger"),n=require("../../lib/output"),v=require("./index");function M(e){if(e.global===!0&&e.local===!0)throw new Error("Pass at most one of -g / --global, --local.");const o=e.projectDir??process.cwd();return e.global===!0?{target:"global",projectDir:o}:e.local===!0?{target:"local",projectDir:o}:{target:"project",projectDir:o}}g(M,"resolveTarget");function k(e){if(!(0,i.existsSync)(e))return{};try{return JSON.parse((0,i.readFileSync)(e,"utf-8"))}catch(o){throw new Error(`Config at ${e} is not valid JSON: ${o instanceof Error?o.message:o}`)}}g(k,"readConfigFile");function S(e,o){const l=(0,i.existsSync)(e)?(0,i.readFileSync)(e,"utf-8"):null,a=k(e),r={...a.codex??{}};return r.verifier={...r.verifier??{},mode:o},(0,i.mkdirSync)((0,p.join)(e,".."),{recursive:!0}),(0,i.writeFileSync)(e,JSON.stringify({...a,codex:r},null,2)+`
2
+ `),l}g(S,"writeVerifierMode");const E=new m.Command("mode").description(`Set the Codex verifier delivery mode: "sub-agent" (default \u2014 delegate to the ironbee-verifier custom agent) or "main-agent" (the main agent drives the devtools tools directly; the fallback when Codex's sub-agent machinery breaks).`).argument("<mode>",'"sub-agent" or "main-agent"').option("-p, --project-dir <dir>","Project directory (default: cwd).").option("-g, --global","Write to the global config (~/.ironbee/config.json).").option("--local","Write to the gitignored project-local override (<project>/.ironbee/config.local.json).").action((e,o)=>{e!=="sub-agent"&&e!=="main-agent"&&(console.error(`${n.pc.red("\u2717")} Invalid mode "${e}". Use ${n.pc.bold("sub-agent")} or ${n.pc.bold("main-agent")}.`),process.exit(1));const l=e;let a,r;try{({target:a,projectDir:r}=M(o))}catch(t){console.error(`${n.pc.red("\u2717")} ${t instanceof Error?t.message:t}`),process.exit(1);return}const c=(0,s.getTargetConfigPath)(a,r);let f=null;try{f=S(c,l)}catch(t){console.error(`${n.pc.red("\u2717")} ${t instanceof Error?t.message:t}`),process.exit(1);return}const u=new v.CodexClient;if(u.detect(r))try{u.install(r,(0,s.loadConfig)(r))}catch(t){if(f===null)try{(0,i.writeFileSync)(c,"")}catch{}else try{(0,i.writeFileSync)(c,f)}catch{}b.logger.debug(`verifier mode: rerender failed, rolled back ${c}: ${t}`),console.error(`${n.pc.red("\u2717")} Failed to re-render Codex artifacts: ${t instanceof Error?t.message:t}`),process.exit(1);return}else console.log(` ${n.pc.yellow("\u26A0")} Codex is not installed in this project \u2014 setting saved, run ${n.pc.bold("ironbee install --client codex")} to apply.`);const C=(0,s.getCodexVerifierMode)((0,s.loadConfig)(r));console.log(`${n.pc.green("\u2713")} Codex verifier mode set to ${n.pc.bold(l)} in ${a} config (${n.pc.dim(c)}).`),C!==l&&console.log(` ${n.pc.yellow("\u26A0")} Effective mode is ${n.pc.bold(C)} \u2014 a higher-priority config layer overrides this write.`),console.log(` ${n.pc.dim("\u21BB")} Restart any open Codex sessions to pick up the new config.`)}),D=new m.Command("verifier").description("Codex verifier settings. Subcommand: `mode <sub-agent|main-agent>` \u2014 how the verification cycle is driven (sub-agent delegation vs the main agent driving the tools directly).").addCommand(E);0&&(module.exports={verifierCommand});
@@ -33,6 +33,8 @@ If you see only `ios/`, `web/`, or no mobile directories — the project does NO
33
33
  - Read Logcat output for the tag(s) relevant to the changed code: `MCP:adt_o11y_log-read` or `MCP:adt_o11y_log-follow` (drain a follow with `MCP:adt_o11y_log-get-followed`, stop it with `MCP:adt_o11y_log-stop-follow`).
34
34
  - Confirm expected log lines appear AND no unexpected crashes (FATAL / E/ entries for the app package).
35
35
 
36
+ **Batch (speed):** connect + launch-app run standalone first (prerequisites). On the device-evidence path, batch the UI interactions + the UI snapshot into one `MCP:adt_execute`; the snapshot captures the state after the batched interactions, so to assert an intermediate state take a snapshot at that point too. The device-evidence screenshot is usually pixel-judged (a visual change) — take THAT one standalone with `includeBase64: true` so you can see it; batch it only when it's purely gate evidence. Log-evidence reads batch together too.
37
+
36
38
  ### Verdict fields
37
39
  The verdict is platform-agnostic — submit only semantic judgment:
38
40
 
@@ -13,6 +13,8 @@ The **backend protocol cycle** verifies backend changes by driving real protocol
13
13
 
14
14
  You can satisfy the cycle via **protocol-call evidence** (you drive the request yourself), **log evidence** (something else drives the request, you read the resulting logs), **DB evidence** (you inspect database state directly), or any combination. Pick whichever fits the task; one is enough.
15
15
 
16
+ **Batch (speed):** group consecutive `bedt_*` steps into one `MCP:bedt_execute` — e.g. a POST then a GET that reuses the created id (bind the first call's result: `const r = callTool('bedt_request_http', {…POST…}); callTool('bedt_request_http', { /* GET using an id from r */ })`), register-source + read, or db-connect + query. Keep a step standalone only when you must inspect its result to DECIDE what to do next, not just to pass a value along.
17
+
16
18
  ### Path A — Protocol-call evidence
17
19
 
18
20
  1. **Confirm a backend service is running** (the user's dev server, Docker compose, k8s port-forward, …). The agent itself does not start the service — ask the user if uncertain.
@@ -14,6 +14,8 @@
14
14
 
15
15
  All four tools are MANDATORY (the stop hook checks each). Functional interaction is expected for every verification.
16
16
 
17
+ **Batch (speed):** navigate (step 1) is standalone — read the ARIA snapshot it returns to decide your interactions. Then run steps 2–5 in ONE `MCP:bdt_execute` batch — `callTool('bdt_interaction_…', …)` for each interaction, `callTool('bdt_content_take-screenshot', …)`, `callTool('bdt_a11y_take-aria-snapshot', …)`, `callTool('bdt_o11y_get-console-messages', …)` — instead of four separate turns. Screenshot/aria/console capture the state AFTER the batched interactions, so batch interactions that lead to ONE state you want to assert; to assert an intermediate state (e.g. a modal that opens then closes) take a screenshot/snapshot at that point too — interleave it in the batch or split into two. The interaction is what makes the evidence meaningful: a batch of just the four evidence tools with no real interaction passes the tool-presence check but verifies nothing. If you must judge the screenshot's pixels, take that one standalone with `includeBase64: true`.
18
+
17
19
  ### Verdict fields
18
20
  The verdict is platform-agnostic — you submit only semantic judgment:
19
21
 
@@ -31,6 +31,8 @@ If you see `pom.xml`, `build.gradle`, `requirements.txt`, `pyproject.toml`, `go.
31
31
  - Read errors: `MCP:ndt_debug_get-logs` with the error-level filter.
32
32
  4. **Disconnect** (optional): `MCP:ndt_debug_disconnect`.
33
33
 
34
+ **Batch (speed):** connect (step 2) is standalone discovery. Batch consecutive `ndt_*` calls in one `MCP:ndt_execute` — set several probes together, then later read snapshots/logs together. The exercise step is ALWAYS separate: whatever triggers the code path (a browser/backend call on another server, a CLI command, the user) can't share an `ndt_*` batch — so node runs as set probes (batch) → exercise (separate) → read snapshots (batch).
35
+
34
36
  ### Verdict fields
35
37
  The verdict is platform-agnostic — you submit only semantic judgment:
36
38
 
@@ -55,10 +55,31 @@ If already running, skip start. If the build fails, fix it before proceeding.
55
55
  - Pass → `{ "session_id": "...", "status": "pass", "checks": [...] }`
56
56
  - Fail → add `"issues": [...]` describing what failed.
57
57
  - Pass after a previous fail → add `"fixes": [...]` describing what was repaired.
58
+ - **A FALSE failure is a FAIL — not "verified failure handling".** When you exercise a negative path, separate an EXPECTED negative test (you deliberately fed invalid input — bad card, missing auth, malformed payload — and it correctly failed → supports a `pass`) from a FALSE failure (a VALID, in-scope operation that SHOULD succeed but errors out → a DEFECT). Report a false failure as `status: "fail"` (or at minimum non-empty `issues`), never as a passing "failure path verified". Passing a run whose own evidence shows a legitimate operation breaking is a false pass.
58
59
  - **Nothing to verify? Use N/A — never fake evidence.** When the change has no runtime surface (type-only edit, behavior-neutral refactor, config/docs that still tripped a cycle): global `{ "session_id": "...", "status": "not_applicable", "reason": ["why there's no runtime surface"] }` (no `checks`), or per-platform on a pass/fail verdict `"not_applicable_cycles": ["browser"], "reason": ["server-only change"]` to exempt some cycles while verifying others. `reason` is REQUIRED (recorded + observable); strict mode rejects N/A. Base "nothing to verify" on the FULL change set (the change is often already COMMITTED) — check `git diff HEAD~1 HEAD --stat`, not just a clean `git status`, before declaring N/A.
59
60
  - **The stop hook enforces that you called the required tools for every active (non-exempt) cycle and that a pass/fail verdict carries non-empty `checks`.**
60
61
  8. If failed → fix → rebuild → go back to step 2 → repeat until pass.
61
62
 
63
+ ## Speed — batch your tool calls (fewer LLM round-trips)
64
+
65
+ Each tool call is a separate LLM round-trip, and that round-trip — not the tool's execution
66
+ — is the dominant cost of a verification. Drive the tools in as few turns as you can:
67
+
68
+ - **Batch a scope's work into ONE `MCP:*_execute` call.** Each cycle exposes a batch tool
69
+ (`MCP:bdt_execute` / `MCP:ndt_execute` / `MCP:bedt_execute` / `MCP:adt_execute`) that runs
70
+ many steps in one turn — nest each as a `callTool('<tool>', { … })`. A batch nests only
71
+ that cycle's own tools (you can't mix servers in one `*_execute`). It's a JS sandbox, so a later step
72
+ can reuse a value an earlier `callTool` returned
73
+ (`const r = callTool(…); callTool(…, { /* a field from r */ })`); and `*_execute` STOPS at
74
+ the first failing nested call, so the rest don't run. Nested calls are credited to the gate like
75
+ standalone calls — but authoring the batch is not the work: read each result and confirm
76
+ real evidence came back (a batch whose interaction failed has no screenshot/snapshot
77
+ behind it). See each platform section for that cycle's concrete batch shape, including any
78
+ cycle-specific screenshot or recording handling.
79
+ - **Discovery stays standalone — you can't batch what you haven't seen.** The step that
80
+ reveals what to do (navigate / connect / snapshot) runs first and on its own; you read its
81
+ result, THEN batch the actions it told you to take.
82
+
62
83
  <!--IRONBEE:PLATFORM:browser-->
63
84
  <!--/IRONBEE:PLATFORM:browser-->
64
85