npm - oh-my-codex - Versions diffs - 0.18.7 → 0.18.9 - Mend

oh-my-codex 0.18.7 → 0.18.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (307) hide show

package/Cargo.lock +12 -12
package/Cargo.toml +1 -1
package/README.md +5 -5
package/crates/omx-sparkshell/tests/execution.rs +1 -1
package/dist/agents/__tests__/native-config.test.js +42 -1
package/dist/agents/__tests__/native-config.test.js.map +1 -1
package/dist/agents/definitions.d.ts +8 -0
package/dist/agents/definitions.d.ts.map +1 -1
package/dist/agents/definitions.js +1 -0
package/dist/agents/definitions.js.map +1 -1
package/dist/agents/native-config.d.ts +5 -1
package/dist/agents/native-config.d.ts.map +1 -1
package/dist/agents/native-config.js +17 -2
package/dist/agents/native-config.js.map +1 -1
package/dist/autopilot/__tests__/fsm.test.js +3 -0
package/dist/autopilot/__tests__/fsm.test.js.map +1 -1
package/dist/autopilot/fsm.js +2 -2
package/dist/autopilot/fsm.js.map +1 -1
package/dist/cli/__tests__/auth.test.js +4 -2
package/dist/cli/__tests__/auth.test.js.map +1 -1
package/dist/cli/__tests__/codex-plugin-layout.test.js +512 -1
package/dist/cli/__tests__/codex-plugin-layout.test.js.map +1 -1
package/dist/cli/__tests__/doctor-warning-copy.test.js +39 -0
package/dist/cli/__tests__/doctor-warning-copy.test.js.map +1 -1
package/dist/cli/__tests__/index.test.js +98 -6
package/dist/cli/__tests__/index.test.js.map +1 -1
package/dist/cli/__tests__/package-bin-contract.test.js +28 -8
package/dist/cli/__tests__/package-bin-contract.test.js.map +1 -1
package/dist/cli/__tests__/question.test.js +26 -9
package/dist/cli/__tests__/question.test.js.map +1 -1
package/dist/cli/__tests__/ralph-goal-mode-contract.test.js +13 -0
package/dist/cli/__tests__/ralph-goal-mode-contract.test.js.map +1 -1
package/dist/cli/__tests__/ralph.test.js +14 -0
package/dist/cli/__tests__/ralph.test.js.map +1 -1
package/dist/cli/__tests__/resume.test.js +50 -1
package/dist/cli/__tests__/resume.test.js.map +1 -1
package/dist/cli/__tests__/setup-install-mode.test.js +89 -0
package/dist/cli/__tests__/setup-install-mode.test.js.map +1 -1
package/dist/cli/__tests__/setup-refresh.test.js +65 -0
package/dist/cli/__tests__/setup-refresh.test.js.map +1 -1
package/dist/cli/__tests__/state.test.js +21 -0
package/dist/cli/__tests__/state.test.js.map +1 -1
package/dist/cli/__tests__/team.test.js +2 -2
package/dist/cli/__tests__/update.test.js +323 -18
package/dist/cli/__tests__/update.test.js.map +1 -1
package/dist/cli/__tests__/windows-popup-loop-contract.test.js +1 -1
package/dist/cli/doctor.d.ts.map +1 -1
package/dist/cli/doctor.js +8 -1
package/dist/cli/doctor.js.map +1 -1
package/dist/cli/index.d.ts +21 -4
package/dist/cli/index.d.ts.map +1 -1
package/dist/cli/index.js +143 -28
package/dist/cli/index.js.map +1 -1
package/dist/cli/plugin-marketplace.d.ts +14 -2
package/dist/cli/plugin-marketplace.d.ts.map +1 -1
package/dist/cli/plugin-marketplace.js +62 -15
package/dist/cli/plugin-marketplace.js.map +1 -1
package/dist/cli/ralph.d.ts.map +1 -1
package/dist/cli/ralph.js +3 -1
package/dist/cli/ralph.js.map +1 -1
package/dist/cli/setup-preferences.d.ts +2 -0
package/dist/cli/setup-preferences.d.ts.map +1 -1
package/dist/cli/setup-preferences.js +4 -0
package/dist/cli/setup-preferences.js.map +1 -1
package/dist/cli/setup.d.ts +3 -0
package/dist/cli/setup.d.ts.map +1 -1
package/dist/cli/setup.js +166 -27
package/dist/cli/setup.js.map +1 -1
package/dist/cli/state.d.ts.map +1 -1
package/dist/cli/state.js +8 -1
package/dist/cli/state.js.map +1 -1
package/dist/cli/tmux-hook.d.ts.map +1 -1
package/dist/cli/tmux-hook.js +16 -0
package/dist/cli/tmux-hook.js.map +1 -1
package/dist/cli/update.d.ts +22 -3
package/dist/cli/update.d.ts.map +1 -1
package/dist/cli/update.js +312 -26
package/dist/cli/update.js.map +1 -1
package/dist/cli/version.d.ts.map +1 -1
package/dist/cli/version.js +5 -9
package/dist/cli/version.js.map +1 -1
package/dist/compat/__tests__/doctor-contract.test.js +12 -1
package/dist/compat/__tests__/doctor-contract.test.js.map +1 -1
package/dist/config/__tests__/generator-notify.test.js +1 -0
package/dist/config/__tests__/generator-notify.test.js.map +1 -1
package/dist/config/generator.d.ts +2 -2
package/dist/config/generator.d.ts.map +1 -1
package/dist/config/generator.js +2 -2
package/dist/config/generator.js.map +1 -1
package/dist/config/team-mode.d.ts +12 -0
package/dist/config/team-mode.d.ts.map +1 -0
package/dist/config/team-mode.js +91 -0
package/dist/config/team-mode.js.map +1 -0
package/dist/hooks/__tests__/agents-overlay.test.js +88 -0
package/dist/hooks/__tests__/agents-overlay.test.js.map +1 -1
package/dist/hooks/__tests__/code-review-skill-contract.test.js +12 -0
package/dist/hooks/__tests__/code-review-skill-contract.test.js.map +1 -1
package/dist/hooks/__tests__/deep-interview-contract.test.js +30 -1
package/dist/hooks/__tests__/deep-interview-contract.test.js.map +1 -1
package/dist/hooks/__tests__/keyword-detector.test.js +423 -3
package/dist/hooks/__tests__/keyword-detector.test.js.map +1 -1
package/dist/hooks/__tests__/notify-fallback-watcher.test.js +1 -1
package/dist/hooks/__tests__/notify-fallback-watcher.test.js.map +1 -1
package/dist/hooks/__tests__/notify-hook-auto-nudge.test.js +189 -0
package/dist/hooks/__tests__/notify-hook-auto-nudge.test.js.map +1 -1
package/dist/hooks/__tests__/notify-hook-team-leader-nudge.test.js +35 -2
package/dist/hooks/__tests__/notify-hook-team-leader-nudge.test.js.map +1 -1
package/dist/hooks/__tests__/notify-hook-tmux-heal.test.js +3 -3
package/dist/hooks/__tests__/notify-hook-tmux-heal.test.js.map +1 -1
package/dist/hooks/__tests__/skill-guidance-contract.test.js +21 -0
package/dist/hooks/__tests__/skill-guidance-contract.test.js.map +1 -1
package/dist/hooks/agents-overlay.d.ts.map +1 -1
package/dist/hooks/agents-overlay.js +36 -50
package/dist/hooks/agents-overlay.js.map +1 -1
package/dist/hooks/extensibility/__tests__/plugin-runner.test.js +31 -0
package/dist/hooks/extensibility/__tests__/plugin-runner.test.js.map +1 -1
package/dist/hooks/extensibility/plugin-runner.js +17 -21
package/dist/hooks/extensibility/plugin-runner.js.map +1 -1
package/dist/hooks/keyword-detector.d.ts.map +1 -1
package/dist/hooks/keyword-detector.js +258 -12
package/dist/hooks/keyword-detector.js.map +1 -1
package/dist/hooks/prompt-guidance-contract.d.ts.map +1 -1
package/dist/hooks/prompt-guidance-contract.js +6 -0
package/dist/hooks/prompt-guidance-contract.js.map +1 -1
package/dist/hooks/session.d.ts +1 -0
package/dist/hooks/session.d.ts.map +1 -1
package/dist/hooks/session.js.map +1 -1
package/dist/hud/__tests__/authority.test.js +435 -32
package/dist/hud/__tests__/authority.test.js.map +1 -1
package/dist/hud/__tests__/hud-tmux-injection.test.js +2 -1
package/dist/hud/__tests__/hud-tmux-injection.test.js.map +1 -1
package/dist/hud/__tests__/index.test.js +42 -0
package/dist/hud/__tests__/index.test.js.map +1 -1
package/dist/hud/__tests__/reconcile.test.js +642 -15
package/dist/hud/__tests__/reconcile.test.js.map +1 -1
package/dist/hud/__tests__/render.test.js +61 -0
package/dist/hud/__tests__/render.test.js.map +1 -1
package/dist/hud/__tests__/state.test.js +160 -4
package/dist/hud/__tests__/state.test.js.map +1 -1
package/dist/hud/__tests__/tmux.test.js +180 -21
package/dist/hud/__tests__/tmux.test.js.map +1 -1
package/dist/hud/authority.d.ts +5 -0
package/dist/hud/authority.d.ts.map +1 -1
package/dist/hud/authority.js +324 -28
package/dist/hud/authority.js.map +1 -1
package/dist/hud/index.d.ts +3 -2
package/dist/hud/index.d.ts.map +1 -1
package/dist/hud/index.js +42 -19
package/dist/hud/index.js.map +1 -1
package/dist/hud/reconcile.d.ts +3 -3
package/dist/hud/reconcile.d.ts.map +1 -1
package/dist/hud/reconcile.js +128 -19
package/dist/hud/reconcile.js.map +1 -1
package/dist/hud/render.d.ts.map +1 -1
package/dist/hud/render.js +35 -0
package/dist/hud/render.js.map +1 -1
package/dist/hud/state.d.ts.map +1 -1
package/dist/hud/state.js +65 -80
package/dist/hud/state.js.map +1 -1
package/dist/hud/tmux.d.ts +24 -6
package/dist/hud/tmux.d.ts.map +1 -1
package/dist/hud/tmux.js +136 -38
package/dist/hud/tmux.js.map +1 -1
package/dist/hud/types.d.ts +11 -0
package/dist/hud/types.d.ts.map +1 -1
package/dist/hud/types.js.map +1 -1
package/dist/mcp/__tests__/state-paths.test.js +71 -1
package/dist/mcp/__tests__/state-paths.test.js.map +1 -1
package/dist/mcp/state-paths.d.ts +32 -0
package/dist/mcp/state-paths.d.ts.map +1 -1
package/dist/mcp/state-paths.js +113 -17
package/dist/mcp/state-paths.js.map +1 -1
package/dist/mcp/state-server.d.ts +4 -4
package/dist/question/__tests__/renderer.test.js +566 -1
package/dist/question/__tests__/renderer.test.js.map +1 -1
package/dist/question/renderer.d.ts +9 -1
package/dist/question/renderer.d.ts.map +1 -1
package/dist/question/renderer.js +246 -70
package/dist/question/renderer.js.map +1 -1
package/dist/scripts/__tests__/codex-native-hook.test.js +837 -101
package/dist/scripts/__tests__/codex-native-hook.test.js.map +1 -1
package/dist/scripts/__tests__/notify-state-io.test.js +72 -1
package/dist/scripts/__tests__/notify-state-io.test.js.map +1 -1
package/dist/scripts/__tests__/notify-tmux-injection.test.d.ts +2 -0
package/dist/scripts/__tests__/notify-tmux-injection.test.d.ts.map +1 -0
package/dist/scripts/__tests__/notify-tmux-injection.test.js +57 -0
package/dist/scripts/__tests__/notify-tmux-injection.test.js.map +1 -0
package/dist/scripts/__tests__/run-test-files.test.js +74 -0
package/dist/scripts/__tests__/run-test-files.test.js.map +1 -1
package/dist/scripts/__tests__/verify-native-agents.test.js +65 -0
package/dist/scripts/__tests__/verify-native-agents.test.js.map +1 -1
package/dist/scripts/codex-native-hook.d.ts.map +1 -1
package/dist/scripts/codex-native-hook.js +107 -39
package/dist/scripts/codex-native-hook.js.map +1 -1
package/dist/scripts/eval/eval-parity-smoke.js +1 -1
package/dist/scripts/eval/eval-parity-smoke.js.map +1 -1
package/dist/scripts/notify-hook/auto-nudge.d.ts.map +1 -1
package/dist/scripts/notify-hook/auto-nudge.js +3 -1
package/dist/scripts/notify-hook/auto-nudge.js.map +1 -1
package/dist/scripts/notify-hook/ralph-session-resume.d.ts.map +1 -1
package/dist/scripts/notify-hook/ralph-session-resume.js +3 -10
package/dist/scripts/notify-hook/ralph-session-resume.js.map +1 -1
package/dist/scripts/notify-hook/state-io.d.ts.map +1 -1
package/dist/scripts/notify-hook/state-io.js +62 -38
package/dist/scripts/notify-hook/state-io.js.map +1 -1
package/dist/scripts/notify-hook/team-leader-nudge.d.ts.map +1 -1
package/dist/scripts/notify-hook/team-leader-nudge.js +7 -0
package/dist/scripts/notify-hook/team-leader-nudge.js.map +1 -1
package/dist/scripts/notify-hook/tmux-injection.d.ts +7 -0
package/dist/scripts/notify-hook/tmux-injection.d.ts.map +1 -1
package/dist/scripts/notify-hook/tmux-injection.js +24 -18
package/dist/scripts/notify-hook/tmux-injection.js.map +1 -1
package/dist/scripts/notify-hook.js +75 -11
package/dist/scripts/notify-hook.js.map +1 -1
package/dist/scripts/run-test-files.js +193 -22
package/dist/scripts/run-test-files.js.map +1 -1
package/dist/scripts/sync-plugin-mirror.d.ts.map +1 -1
package/dist/scripts/sync-plugin-mirror.js +61 -3
package/dist/scripts/sync-plugin-mirror.js.map +1 -1
package/dist/scripts/verify-native-agents.d.ts.map +1 -1
package/dist/scripts/verify-native-agents.js +58 -1
package/dist/scripts/verify-native-agents.js.map +1 -1
package/dist/state/__tests__/operations.test.js +113 -0
package/dist/state/__tests__/operations.test.js.map +1 -1
package/dist/state/__tests__/skill-active.test.js +3 -16
package/dist/state/__tests__/skill-active.test.js.map +1 -1
package/dist/state/__tests__/workflow-transition.test.js +25 -0
package/dist/state/__tests__/workflow-transition.test.js.map +1 -1
package/dist/state/operations.d.ts.map +1 -1
package/dist/state/operations.js +57 -2
package/dist/state/operations.js.map +1 -1
package/dist/state/skill-active.d.ts.map +1 -1
package/dist/state/skill-active.js +7 -39
package/dist/state/skill-active.js.map +1 -1
package/dist/state/workflow-transition-reconcile.d.ts.map +1 -1
package/dist/state/workflow-transition-reconcile.js +10 -14
package/dist/state/workflow-transition-reconcile.js.map +1 -1
package/dist/team/__tests__/runtime.test.js +1 -1
package/dist/team/__tests__/runtime.test.js.map +1 -1
package/dist/team/__tests__/scaling.test.js +9 -4
package/dist/team/__tests__/scaling.test.js.map +1 -1
package/dist/team/__tests__/tmux-session.test.js +195 -2
package/dist/team/__tests__/tmux-session.test.js.map +1 -1
package/dist/team/__tests__/worker-runtime-identity.test.js +4 -2
package/dist/team/__tests__/worker-runtime-identity.test.js.map +1 -1
package/dist/team/scaling.d.ts.map +1 -1
package/dist/team/scaling.js +3 -2
package/dist/team/scaling.js.map +1 -1
package/dist/team/tmux-session.d.ts +2 -0
package/dist/team/tmux-session.d.ts.map +1 -1
package/dist/team/tmux-session.js +142 -12
package/dist/team/tmux-session.js.map +1 -1
package/dist/utils/__tests__/platform-command.test.js +16 -1
package/dist/utils/__tests__/platform-command.test.js.map +1 -1
package/dist/utils/__tests__/version.test.d.ts +2 -0
package/dist/utils/__tests__/version.test.d.ts.map +1 -0
package/dist/utils/__tests__/version.test.js +51 -0
package/dist/utils/__tests__/version.test.js.map +1 -0
package/dist/utils/paths.d.ts +8 -1
package/dist/utils/paths.d.ts.map +1 -1
package/dist/utils/paths.js +16 -4
package/dist/utils/paths.js.map +1 -1
package/dist/utils/platform-command.d.ts +9 -0
package/dist/utils/platform-command.d.ts.map +1 -1
package/dist/utils/platform-command.js +15 -0
package/dist/utils/platform-command.js.map +1 -1
package/dist/utils/version.d.ts +7 -0
package/dist/utils/version.d.ts.map +1 -0
package/dist/utils/version.js +67 -0
package/dist/utils/version.js.map +1 -0
package/dist/verification/__tests__/ci-rust-gates.test.js +89 -1
package/dist/verification/__tests__/ci-rust-gates.test.js.map +1 -1
package/dist/verification/__tests__/dev-merge-issue-close-workflow.test.js +16 -2
package/dist/verification/__tests__/dev-merge-issue-close-workflow.test.js.map +1 -1
package/package.json +11 -10
package/plugins/oh-my-codex/.codex-plugin/plugin.json +1 -1
package/plugins/oh-my-codex/hooks/codex-native-hook.mjs +334 -21
package/plugins/oh-my-codex/hooks/hooks.json +1 -2
package/plugins/oh-my-codex/skills/autopilot/SKILL.md +3 -1
package/plugins/oh-my-codex/skills/code-review/SKILL.md +7 -7
package/plugins/oh-my-codex/skills/deep-interview/SKILL.md +51 -11
package/plugins/oh-my-codex/skills/ralph/SKILL.md +22 -22
package/plugins/oh-my-codex/skills/ultraqa/SKILL.md +9 -0
package/skills/autopilot/SKILL.md +3 -1
package/skills/code-review/SKILL.md +7 -7
package/skills/deep-interview/SKILL.md +51 -11
package/skills/ralph/SKILL.md +22 -22
package/skills/ultraqa/SKILL.md +9 -0
package/src/scripts/__tests__/codex-native-hook.test.ts +946 -98
package/src/scripts/__tests__/notify-state-io.test.ts +95 -0
package/src/scripts/__tests__/notify-tmux-injection.test.ts +82 -0
package/src/scripts/__tests__/run-test-files.test.ts +102 -0
package/src/scripts/__tests__/verify-native-agents.test.ts +75 -0
package/src/scripts/codex-native-hook.ts +123 -34
package/src/scripts/demo-team-e2e.sh +10 -7
package/src/scripts/eval/eval-parity-smoke.ts +1 -1
package/src/scripts/notify-hook/auto-nudge.ts +3 -1
package/src/scripts/notify-hook/ralph-session-resume.ts +2 -8
package/src/scripts/notify-hook/state-io.ts +75 -37
package/src/scripts/notify-hook/team-leader-nudge.ts +7 -0
package/src/scripts/notify-hook/tmux-injection.ts +35 -19
package/src/scripts/notify-hook.ts +91 -4
package/src/scripts/prepare-build.js +83 -0
package/src/scripts/run-test-files.ts +192 -22
package/src/scripts/sync-plugin-mirror.ts +98 -9
package/src/scripts/verify-native-agents.ts +65 -1
package/src/scripts/postinstall-bootstrap.js +0 -23

package/plugins/oh-my-codex/skills/ralph/SKILL.md CHANGED Viewed

@@ -26,14 +26,14 @@ Ralph is a persistence loop that keeps working on a task until it is fully compl
 </Do_Not_Use_When>
 <Why_This_Exists>
-Complex tasks often fail silently: partial implementations get declared "done", tests get skipped, edge cases get forgotten. Ralph prevents this by looping until work is genuinely complete, requiring fresh verification evidence before allowing completion, and using tiered architect review to confirm quality.
+Complex tasks often fail silently: partial implementations get declared "done", tests get skipped, edge cases get forgotten. Ralph prevents this by looping until work is genuinely complete, requiring fresh verification evidence before allowing completion, and using explicit architect native-subagent verification to confirm quality.
 </Why_This_Exists>
 <Execution_Policy>
 - Fire independent agent calls simultaneously -- never wait sequentially for independent work
 - Use `run_in_background: true` for long operations (installs, builds, test suites)
-- Always pass the `model` parameter explicitly when delegating to agents
-- Read `docs/shared/agent-tiers.md` before first delegation to select correct agent tiers
+- Always set `agent_type` when spawning native subagents; use `reasoning_effort` for per-dispatch intensity when needed
+- Preserve legacy Ralph tier intent through native reasoning effort: LOW -> `low`, STANDARD -> `medium`, THOROUGH -> `xhigh`
 - Deliver the full implementation: no scope reduction, no partial completion, no deleting tests to make them pass
 - Apply the shared workflow guidance pattern: outcome-first framing, concise visible updates for multi-step execution, local overrides for the active workflow branch, validation proportional to risk, explicit stop rules, and automatic continuation for safe reversible steps. Ask only for material, destructive, credentialed, external-production, or preference-dependent branches.
 - Integrate with Codex goal mode when goal tools are available: inspect the active thread goal with `get_goal`, preserve it as the top-level stop condition, and only call `update_goal({status: "complete"})` after a Ralph completion audit proves the objective is actually achieved.
@@ -54,10 +54,10 @@ Complex tasks often fail silently: partial implementations get declared "done",
    - Do not begin Ralph execution work (delegation, implementation, or verification loops) until snapshot grounding exists. If forced to proceed quickly, note explicit risk tradeoffs.
 1. **Review progress**: Check TODO list and any prior iteration state
 2. **Continue from where you left off**: Pick up incomplete tasks
-3. **Delegate in parallel**: Route tasks to specialist agents at appropriate tiers
-   - Simple lookups: LOW tier -- "What does this function return?"
-   - Standard work: STANDARD tier -- "Add error handling to this module"
-   - Complex analysis: THOROUGH tier -- "Debug this race condition"
+3. **Delegate in parallel**: Route tasks to specialist native agents with explicit `agent_type` and appropriate `reasoning_effort`
+   - Simple lookups: `reasoning_effort="low"` -- "What does this function return?"
+   - Standard work: `reasoning_effort="medium"` -- "Add error handling to this module"
+   - Complex analysis: `reasoning_effort="xhigh"` -- "Debug this race condition"
    - When Ralph is entered as a ralplan follow-up, start from the approved **available-agent-types roster** and make the delegation plan explicit: implementation lane, evidence/regression lane, and final sign-off lane using only known agent types
 4. **Run long operations in background**: Builds, installs, test suites use `run_in_background: true`
 5. **Visual task gate (when screenshot/reference images are present)**:
@@ -72,11 +72,11 @@ Complex tasks often fail silently: partial implementations get declared "done",
    b. Run verification (test, build, lint)
    c. Read the output -- confirm it actually passed
    d. Check: zero pending/in_progress TODO items
-7. **Architect verification** (tiered):
-   - <5 files, <100 lines with full tests: STANDARD tier minimum (architect role)
-   - Standard changes: STANDARD tier (architect role)
-   - >20 files or security/architectural changes: THOROUGH tier (architect role)
-   - Ralph floor: always at least STANDARD, even for small changes
+7. **Architect verification** (native role):
+   - <5 files, <100 lines with full tests: `task(agent_type="architect", reasoning_effort="medium", prompt="...")` minimum
+   - Standard changes: `task(agent_type="architect", reasoning_effort="medium", prompt="...")`
+   - >20 files or security/architectural changes: `task(agent_type="architect", reasoning_effort="xhigh", prompt="...")`
+   - Ralph floor: always run an explicit `architect` native subagent, even for small changes
 7.5 **Mandatory Deslop Pass**:
    - After Step 7 passes, run `oh-my-codex:ai-slop-cleaner` on **all files changed during the Ralph session**.
    - Scope the cleaner to **changed files only**; do not widen the pass beyond Ralph-owned edits.
@@ -87,7 +87,7 @@ Complex tasks often fail silently: partial implementations get declared "done",
    - If post-deslop regression fails, roll back cleaner changes or fix and retry. Then rerun Step 7.5 and Step 7.6 until the regression is green.
    - Do not proceed to completion until post-deslop regression is green (unless `--no-deslop` explicitly skipped the deslop pass).
 8. **On approval**: If Codex goal mode is active, call `update_goal({status: "complete"})` before `/cancel`; report final elapsed time and token-budget usage when the tool returns it. Then run `/cancel` to cleanly exit and clean up all state files.
-9. **On rejection**: Fix the issues raised, then re-verify at the same tier
+9. **On rejection**: Fix the issues raised, then re-verify with the same `agent_type` and `reasoning_effort` profile
 </Steps>
 <Tool_Usage>
@@ -150,11 +150,11 @@ Use the CLI-first state surface for Ralph lifecycle state (`omx state write/read
 <Good>
 Correct parallel delegation:
 ```
-delegate(role="executor", tier="LOW", task="Add type export for UserConfig")
-delegate(role="executor", tier="STANDARD", task="Implement the caching layer for API responses")
-delegate(role="executor", tier="THOROUGH", task="Refactor auth module to support OAuth2 flow")
+task(agent_type="executor", reasoning_effort="low", prompt="Add type export for UserConfig")
+task(agent_type="executor", reasoning_effort="medium", prompt="Implement the caching layer for API responses")
+task(agent_type="executor", reasoning_effort="xhigh", prompt="Refactor auth module to support OAuth2 flow")
 ```
-Why good: Three independent tasks fired simultaneously at appropriate tiers.
+Why good: Three independent tasks fired simultaneously while explicitly selecting the installed `executor` native role, so the UI/tracker does not show default subagents; legacy tier intent is preserved through native reasoning effort (`LOW` -> `low`, `STANDARD` -> `medium`, `THOROUGH` -> `xhigh`).
 </Good>
 <Good>
@@ -163,7 +163,7 @@ Correct verification before completion:
 1. Run: npm test           → Output: "42 passed, 0 failed"
 2. Run: npm run build      → Output: "Build succeeded"
 3. Run: lsp_diagnostics    → Output: 0 errors
-4. Delegate to architect at STANDARD tier  → Verdict: "APPROVED"
+4. task(agent_type="architect", reasoning_effort="medium", prompt="verify completion") → Verdict: "APPROVED"
 5. Run /cancel
 ```
 Why good: Fresh evidence at each step, architect verification, then clean exit.
@@ -178,9 +178,9 @@ Why bad: Uses "should" and "look good" -- no fresh test/build output, no archite
 <Bad>
 Sequential execution of independent tasks:
 ```
-delegate(executor, LOW, "Add type export") → wait →
-delegate(executor, STANDARD, "Implement caching") → wait →
-delegate(executor, THOROUGH, "Refactor auth")
+task(agent_type="executor", reasoning_effort="low", prompt="Add type export") → wait →
+task(agent_type="executor", reasoning_effort="medium", prompt="Implement caching") → wait →
+task(agent_type="executor", reasoning_effort="xhigh", prompt="Refactor auth")
 ```
 Why bad: These are independent tasks that should run in parallel, not sequentially.
 </Bad>
@@ -200,7 +200,7 @@ Why bad: These are independent tasks that should run in parallel, not sequential
 - [ ] Fresh test run output shows all tests pass
 - [ ] Fresh build output shows success
 - [ ] lsp_diagnostics shows 0 errors on affected files
-- [ ] Architect verification passed (STANDARD tier minimum)
+- [ ] Architect verification passed through explicit `task(agent_type="architect", reasoning_effort="medium"...)` minimum
 - [ ] Codex goal-mode completion audit passed, and `update_goal({status: "complete"})` was called when an active goal exists
 - [ ] ai-slop-cleaner pass completed on changed files (or --no-deslop specified)
 - [ ] Post-deslop regression tests pass

package/plugins/oh-my-codex/skills/ultraqa/SKILL.md CHANGED Viewed

@@ -58,6 +58,15 @@ The matrix must include normal-path coverage plus adversarial dynamic e2e scenar
 - Validate exit codes and output semantics; do not trust success-looking text alone.
 - Do not delete, rewrite, or mask unrelated user work. Capture dirty-worktree evidence before and after generated harness work.
+### Temporary Harness Generation Guardrails
+Generated harnesses are part of the QA evidence chain; until setup succeeds, they are evidence about the harness apparatus, not product behavior.
+- **Use absolute repo imports for built artifacts.** When a harness runs from `/tmp` or another scratch directory but imports repository code, resolve the repository root explicitly from the verified repo cwd and import built modules with an absolute path or `pathToFileURL(join(repoRoot, "dist", ...)).href`. Never rely on `./dist/...` from the harness file's temporary directory.
+- **Use a safe file writer for JS/TS harness bodies.** Prefer a small Node/Python writer or another non-interpolating file-write mechanism for harness source that contains backticks, `${...}`, shell metacharacters, or prompt-injection strings. If a shell heredoc is unavoidable, quote the delimiter and verify the written file before execution; do not use interpolating heredocs for JavaScript assertions.
+- **Sanitize OMX runtime env for isolated probes.** When the scenario creates a temporary repo/state tree or intentionally checks local isolation, run the probe with `OMX_ROOT` and `OMX_STATE_ROOT` unset (for example `env -u OMX_ROOT -u OMX_STATE_ROOT ...`) so ambient boxed runtime state cannot redirect reads/writes away from the scenario fixture.
+- **Classify harness setup failures separately.** If a generated harness fails before exercising product behavior because of import paths, shell interpolation, environment leakage, or fixture construction, record it as harness debris, fix the harness, and rerun the scenario before declaring a product defect.
 ## Cycle Workflow
 ### Cycle N (Max 5)

package/skills/autopilot/SKILL.md CHANGED Viewed

@@ -68,12 +68,14 @@ Before Phase `deep-interview` or `ralplan` starts or resumes:
 1. Derive a task slug from the request.
 2. Reuse the latest relevant `.omx/context/{slug}-*.md` snapshot when available.
 3. If none exists, create `.omx/context/{slug}-{timestamp}.md` (UTC `YYYYMMDDTHHMMSSZ`) with:
-   - task statement
+   - activation prompt / task seed
+   - original task status (`activation-prompt`, `legacy-unverified`, or `unavailable`)
    - desired outcome
    - known facts/evidence
    - constraints
    - unknowns/open questions
    - likely codebase touchpoints
+   - a scope note that the seed is the Autopilot activation prompt, not guaranteed prior conversation context
 4. If brownfield facts are missing, run `explore` first before or during `$deep-interview` (`$deep-interview --quick <task>` remains acceptable for bounded low-ambiguity intake); do not skip the clarification gate merely because the task sounds actionable.
 5. Carry the snapshot path in Autopilot state and all handoff artifacts.
 </Pre-context Intake>

package/skills/code-review/SKILL.md CHANGED Viewed

@@ -31,7 +31,7 @@ Delegates to the `code-reviewer` and `architect` agents in parallel for a two-la
 2. **Launch Parallel Review Lanes**
    - **`code-reviewer` lane** - owns spec compliance, security, code quality, performance, and maintainability findings
    - **`architect` lane** - owns the devil's-advocate / design-tradeoff perspective
-   - Both lanes run in parallel and produce distinct outputs before final synthesis
+   - Both lanes run in parallel on a clean context with explicit scope and artifacts, and produce distinct outputs before final synthesis
    - If either lane cannot be launched or does not return evidence, report `independent review unavailable`; do **not** substitute the current/authoring lane, and do **not** approve or mark the review merge-ready.
 3. **Review Categories**
@@ -71,10 +71,11 @@ Delegates to the `code-reviewer` and `architect` agents in parallel for a two-la
 Do not self-review as a fallback. If the `code-reviewer` or `architect` agent path is missing, unavailable, skipped, or fails, emit a clear unavailable-review result and block approval until the independent lane evidence exists.
+Respect the user's current model and reasoning/effort selection when launching review lanes. Do not pass `model` or `reasoning_effort` overrides in the review-lane task calls unless the user explicitly asks for review-specific overrides; omitting them lets native subagents inherit the active session settings.
 ```
-delegate(
-  role="code-reviewer",
-  tier="THOROUGH",
+task(
+  agent_type="code-reviewer",
   prompt="CODE REVIEW TASK
 Review code changes for quality, security, and maintainability.
@@ -98,9 +99,8 @@ Output: Code review report with:
 - Approval recommendation (APPROVE / REQUEST CHANGES / COMMENT)"
 )
-delegate(
-  role="architect",
-  tier="THOROUGH",
+task(
+  agent_type="architect",
   prompt="ARCHITECTURE / DEVIL'S-ADVOCATE REVIEW TASK
 Review the same code changes from the architecture/tradeoff perspective.

package/skills/deep-interview/SKILL.md CHANGED Viewed

@@ -51,6 +51,11 @@ If no flag is provided, use **Standard**.
 - Gather codebase facts via `explore` before asking user about internals
 - `omx explore` is deprecated. Use normal repository inspection tools/subagents for simple read-only brownfield fact gathering; use `omx sparkshell` only for explicit shell-native read-only evidence, and keep ambiguous or non-shell-only investigation on the richer normal path.
 - Always run a preflight context intake before the first interview question
+- For brownfield work, preflight must include doc/context grounding before user-facing questions: inspect applicable `AGENTS.md` files, README/getting-started docs, relevant `docs/` contracts/plans/ADRs, existing `.omx/context/` snapshots, and any project-local glossary/context files such as `CONTEXT.md` or `CONTEXT-MAP.md` when present.
+- Treat existing repo language as evidence, not authority: if the user uses a fuzzy, overloaded, or conflicting term, surface the specific doc/code wording and ask which meaning should govern before implementation.
+- Cross-check user claims about current behavior against code or documented contracts when discoverable. If docs and code disagree, ask a confirmation question that names both sources instead of silently choosing one.
+- Use scenario-based edge-case grilling when relationships, boundaries, or handoff behavior are unclear: invent one concrete scenario that stresses the ambiguous boundary, then ask one focused question about the expected outcome.
+- Durable docs, glossary, ADR, or memory updates are opt-in and public-safe only. Deep-interview may recommend such updates in the handoff summary, but must not automatically create or dump public docs from interview transcripts unless the user explicitly chooses that as in-scope.
 - If initial context is oversized or would exceed the prompt budget, do not paste or forward the raw payload into interview prompts; request and record a prompt-safe initial-context summary first
 - The oversized initial-context summary gate is blocking: wait for the concise summary before ambiguity scoring, crystallizing artifacts, or any downstream execution handoff
 - The summary must preserve goals, constraints, success criteria, non-goals, decision boundaries, and references to any full source documents so downstream consumers receive a prompt-safe but faithful context
@@ -97,8 +102,15 @@ If no flag is provided, use **Standard**.
    - Unknowns/open questions
    - Decision-boundary unknowns
    - Likely codebase touchpoints
+   - Relevant repo docs/rules/context inspected
+   - Terminology or doc/code conflicts found
    - Prompt-safe initial-context summary status (`not_needed`, `needed`, or `recorded`)
-5. Save snapshot to `.omx/context/{slug}-{timestamp}.md` (UTC `YYYYMMDDTHHMMSSZ`) and reference it in mode state.
+5. For brownfield tasks, inspect the applicable documentation/rule surface before the first user-facing round. Prefer exact, nearby sources over broad scans:
+   - governing `AGENTS.md` files and template/runtime instruction surfaces that apply to the touched paths
+   - README/getting-started docs and relevant docs under `docs/`, especially contracts, plans, ADR-like records, and workflow docs
+   - existing `.omx/context/` snapshots, `.omx/specs/`, and planning artifacts relevant to the slug
+   - project-local glossary/context files such as `CONTEXT.md`, `CONTEXT-MAP.md`, or context-specific docs when they exist
+6. Save snapshot to `.omx/context/{slug}-{timestamp}.md` (UTC `YYYYMMDDTHHMMSSZ`) and reference it in mode state.
 ## Phase 1: Initialize
@@ -137,13 +149,14 @@ If no flag is provided, use **Standard**.
 Repeat until ambiguity `<= threshold`, the pressure pass is complete, the readiness gates are explicit, the user exits with warning, or max rounds are reached. This is a stop condition: below threshold, do not open a new ordinary interview branch.
 ### 2a) Generate next question
-If the initial context is oversized and no prompt-safe summary has been recorded yet, the next question must be only a summary request. Do not score ambiguity, do not run readiness gates, and do not hand off to `$ralplan`, `$autopilot`, `$ralph`, or `$team` until that summary answer is captured.
+If the initial context is oversized and no prompt-safe summary has been recorded yet, the next question must be only a summary request. Do not score ambiguity, do not run readiness gates, and do not hand off to `$ultragoal`, `$ralplan`, `$autopilot`, `$ralph`, or `$team` until that summary answer is captured.
 Use:
 - Original idea
 - Prior Q&A rounds
 - Current dimension scores
 - Brownfield context (if any)
+- Doc/context grounding notes, including existing terminology, governing rules, and any doc/code mismatch
 - Activated challenge mode injection (Phase 3)
 Target the lowest-scoring dimension, but respect stage priority:
@@ -155,12 +168,21 @@ Follow-up pressure ladder after each answer:
 1. Ask for a concrete example, counterexample, or evidence signal behind the latest claim
 2. Probe the hidden assumption, dependency, or belief that makes the claim true
 3. Force a boundary or tradeoff: what would you explicitly not do, defer, or reject?
-4. If the answer still describes symptoms, reframe toward essence / root cause before moving on
+4. Challenge fuzzy or conflicting terms against the repo's documented language and current code behavior
+5. Stress-test the boundary with one concrete scenario or edge case when a relationship or handoff remains ambiguous
+6. If the answer still describes symptoms, reframe toward essence / root cause before moving on
 Prefer staying on the same thread for multiple rounds when it has the highest leverage. Breadth without pressure is not progress.
 Maintain a **Breadth Ledger** across independent ambiguity tracks: scope, constraints, outputs, verification, brownfield integration, and any user-mentioned deliverable tracks. The ledger is a guard, not a mandatory rotation rule: stay deep on the current thread until it has been pressure-tested, then zoom out only when another material track remains unresolved and would change execution.
+Maintain a **Docs/Terminology Ledger** for brownfield interviews:
+- repo docs/rules/context sources inspected, with path references
+- canonical terms already used by the repo and terms to avoid or disambiguate
+- user terms that conflict with docs or current code behavior
+- doc/code mismatches that require a human decision before implementation
+- optional durable-doc follow-ups that are safe to propose but not auto-apply
 Detailed dimensions:
 - Intent Clarity — why the user wants this
 - Outcome Clarity — what end state they want
@@ -306,6 +328,7 @@ Append round result and updated scores via `omx state write --input '<json>' --j
 Use each mode once when applicable. These are normal escalation tools, not rare rescue moves:
 - **Contrarian** (round 2+ or immediately when an answer rests on an untested assumption): challenge core assumptions
+- **Terminologist** (brownfield, whenever a key term is fuzzy, overloaded, or conflicts with repo docs/code): force a canonical meaning against existing project language before implementation
 - **Simplifier** (round 4+ or when scope expands faster than outcome clarity): probe minimal viable scope
 - **Ontologist** (round 5+ and ambiguity > 0.25, or when the user keeps describing symptoms): ask for essence-level reframing
@@ -336,6 +359,9 @@ Spec should include:
 - Assumptions exposed + resolutions
 - Pressure-pass findings (which answer was revisited, and what changed)
 - Brownfield evidence vs inference notes for any repository-grounded confirmation questions
+- Docs/Terminology Ledger with inspected repo docs/rules/context, term conflicts, and any doc/code mismatch decisions
+- Scenario/edge-case pressure findings that materially shaped scope or acceptance criteria
+- Optional durable documentation recommendations, explicitly marked opt-in and public-safe; do not include raw private transcript dumps
 - Technical context findings
 - Full or condensed transcript
@@ -365,11 +391,11 @@ When the clarified task is specifically about `$autoresearch`, or the skill is i
 ## Phase 5: Execution Bridge
-Present execution options after artifact generation using explicit handoff contracts. Treat the deep-interview spec as the current requirements source of truth and preserve intent, non-goals, decision boundaries, acceptance criteria, and any residual-risk warnings across the handoff.
+Present execution options after artifact generation using explicit handoff contracts. Treat the deep-interview spec as the current requirements source of truth and preserve intent, non-goals, decision boundaries, acceptance criteria, docs/terminology grounding, and any residual-risk warnings across the handoff.
 ### Goal-mode follow-ups
-Include these product-facing suggestions when they fit the clarified spec, without removing the existing `$ralplan`, `$autopilot`, `$ralph`, and `$team` handoff options:
+Include these product-facing suggestions when they fit the clarified spec, without removing the existing `$ultragoal`, `$ralplan`, `$autopilot`, `$ralph`, and `$team` handoff options:
 - **`$ultragoal`** — default goal-mode follow-up for implementation or general goal-oriented follow-up specs that should be converted into durable Codex/OMX goals with sequential completion tracking.
 - **`$autoresearch-goal`** — use when the clarified context is a research project: a research question, reference/literature gathering, evaluator-backed analysis, or professor/critic-style deliverable.
@@ -377,7 +403,16 @@ Include these product-facing suggestions when they fit the clarified spec, witho
 Recommend `$ultragoal` as the default durable goal-mode follow-up because it supersedes Ralph for goal tracking. Preserve `$team` for coordinated parallel implementation and keep `$ralph` only as an explicit fallback for persistent single-owner execution/verification when the user specifically selects it.
-### 1. **`$ralplan` (Recommended)**
+### 1. **`$ultragoal` (Default durable execution follow-up)**
+- **Input Artifact:** `.omx/specs/deep-interview-{slug}.md` (optionally accompanied by the transcript/context snapshot for traceability)
+- **Invocation:** `$ultragoal create-goals --brief-file <spec-path>` followed by `$ultragoal complete-goals` in the active execution lane
+- **Consumer Behavior:** Convert the clarified spec into durable goal-mode work. Preserve intent, non-goals, decision boundaries, acceptance criteria, docs/terminology grounding, scenario-pressure findings, and residual-risk warnings as binding story constraints.
+- **Skipped / Already-Satisfied Stages:** Requirement interview, ambiguity clarification, doc/context preflight, and early intent-boundary elicitation
+- **Expected Output:** `.omx/ultragoal/brief.md`, `.omx/ultragoal/goals.json`, `.omx/ultragoal/ledger.jsonl`, implementation evidence, verification evidence, and final cleanup/review-gate evidence
+- **Best When:** The clarified spec is execution-ready or the user explicitly wants durable goal tracking as the next step
+- **Next Recommended Step:** Run the Ultragoal completion loop; launch `$team` only inside an active Ultragoal story when parallel lanes are warranted, and use `$ralph` only as an explicit fallback when the user asks for that legacy persistence mode
+### 2. **`$ralplan` (Recommended when architecture/test-shape review is still needed)**
 - **Input Artifact:** `.omx/specs/deep-interview-{slug}.md` (optionally accompanied by the transcript/context snapshot for traceability)
 - **Invocation:** `$plan --consensus --direct <spec-path>`
 - **Consumer Behavior:** Treat the deep-interview spec as the requirements source of truth. Do not repeat the interview by default; refine architecture/feasibility around the clarified intent and boundaries instead.
@@ -386,7 +421,7 @@ Recommend `$ultragoal` as the default durable goal-mode follow-up because it sup
 - **Best When:** Requirements are clear enough to stop interviewing, but architectural validation / consensus planning is still desirable
 - **Next Recommended Step:** Use the approved planning artifacts with `$ultragoal` as the default durable goal-mode follow-up (optionally with `$team` for parallel lanes); choose `$autoresearch-goal` for research validation or `$performance-goal` for measurable optimization, and use `$ralph` only as an explicit fallback when a narrow single-owner persistence loop is requested
-### 2. **`$autopilot`**
+### 3. **`$autopilot`**
 - **Input Artifact:** `.omx/specs/deep-interview-{slug}.md`
 - **Invocation:** `$autopilot <spec-path>`
 - **Consumer Behavior:** Use the deep-interview spec as the clarified execution brief. Preserve intent, non-goals, decision boundaries, and acceptance criteria as binding context for planning/execution.
@@ -395,7 +430,7 @@ Recommend `$ultragoal` as the default durable goal-mode follow-up because it sup
 - **Best When:** The clarified spec is already strong enough for direct planning + execution without an additional consensus gate
 - **Next Recommended Step:** Continue through autopilot's execution/QA/validation flow; if coordination-heavy execution emerges, prefer `$team` under a leader-owned `$ultragoal` ledger, using `$ralph` only as an explicit fallback when a narrow single-owner persistence loop is requested
-### 3. **`$ralph` (Explicit fallback only)**
+### 4. **`$ralph` (Explicit fallback only)**
 - **Input Artifact:** `.omx/specs/deep-interview-{slug}.md`
 - **Invocation:** `$ralph <spec-path>`
 - **Consumer Behavior:** Use the spec's acceptance criteria and boundary constraints as the persistence target. Do not reopen requirements discovery unless the user explicitly asks to refine further.
@@ -404,7 +439,7 @@ Recommend `$ultragoal` as the default durable goal-mode follow-up because it sup
 - **Best When:** The user explicitly asks for Ralph's persistent sequential completion pressure; otherwise use `$ultragoal` for durable goal tracking and completion checkpoints
 - **Next Recommended Step:** If this explicit fallback is selected, continue Ralph's persistence loop; if work expands into coordination-heavy lanes, hand off to `$team` under `$ultragoal` checkpointing rather than promoting Ralph as the next default
-### 4. **`$team`**
+### 5. **`$team`**
 - **Input Artifact:** `.omx/specs/deep-interview-{slug}.md`
 - **Invocation:** `$team <spec-path>`
 - **Consumer Behavior:** Treat the spec as shared execution context for coordinated parallel work. Preserve the clarified intent, non-goals, decision boundaries, and acceptance criteria as common lane constraints.
@@ -413,7 +448,7 @@ Recommend `$ultragoal` as the default durable goal-mode follow-up because it sup
 - **Best When:** The task is large, multi-lane, or blocker-sensitive enough to justify coordinated parallel execution instead of a single persistent loop
 - **Next Recommended Step:** Follow the team verification path when the coordinated execution phase finishes; checkpoint completion through `$ultragoal` by default, escalating to a separate Ralph loop only when the user explicitly asks for that persistent verification/fix owner
-### 5. **Refine further**
+### 6. **Refine further**
 - **Input Artifact:** Existing transcript, context snapshot, and current spec draft
 - **Invocation:** Continue the interview loop
 - **Consumer Behavior:** Re-enter questioning to resolve the highest-leverage remaining uncertainty
@@ -437,6 +472,7 @@ Recommend `$ultragoal` as the default durable goal-mode follow-up because it sup
 - Use `omx state write/read --input '<json>' --json` for resumable mode state; `state_write` / `state_read` are explicit MCP compatibility fallbacks only
 - If the interview cannot ask a required `omx question` round, persist the blocker as terminal state with `active: false` and `current_phase: "blocked"`; do not write a terminal blocked phase with `active: true`
 - Read/write context snapshots under `.omx/context/`
+- Read applicable repo docs/rules/context during preflight; write durable docs, glossary, ADR, or memory updates only when the user explicitly opts in and the content is public-safe
 - Record whether the oversized-context summary gate is not needed, pending, or satisfied before any scoring or handoff step
 - Save transcript/spec artifacts under `.omx/interviews/` and `.omx/specs/`
 </Tool_Usage>
@@ -460,7 +496,11 @@ Recommend `$ultragoal` as the default durable goal-mode follow-up because it sup
 - [ ] Transcript written to `.omx/interviews/{slug}-{timestamp}.md`
 - [ ] Spec written to `.omx/specs/deep-interview-{slug}.md`
 - [ ] Brownfield questions use evidence-backed confirmation when applicable
-- [ ] Handoff options provided (`$ralplan`, `$autopilot`, `$ralph`, `$team`) plus context-sensitive goal-mode suggestions (`$ultragoal`, `$autoresearch-goal`, `$performance-goal`) when applicable
+- [ ] Brownfield preflight inspected applicable repo docs/rules/context before user-facing questions
+- [ ] Fuzzy or conflicting terminology was challenged against repo language/current code behavior when applicable
+- [ ] Scenario-based edge-case grilling was used when boundary ambiguity would materially affect implementation
+- [ ] Durable docs/ADR/memory updates, if any, were explicitly opted into and public-safe
+- [ ] Handoff options provided (`$ultragoal`, `$ralplan`, `$autopilot`, `$ralph`, `$team`) plus context-sensitive goal-mode suggestions (`$autoresearch-goal`, `$performance-goal`) when applicable
 - [ ] No direct implementation performed in this mode
 </Final_Checklist>

package/skills/ralph/SKILL.md CHANGED Viewed

@@ -26,14 +26,14 @@ Ralph is a persistence loop that keeps working on a task until it is fully compl
 </Do_Not_Use_When>
 <Why_This_Exists>
-Complex tasks often fail silently: partial implementations get declared "done", tests get skipped, edge cases get forgotten. Ralph prevents this by looping until work is genuinely complete, requiring fresh verification evidence before allowing completion, and using tiered architect review to confirm quality.
+Complex tasks often fail silently: partial implementations get declared "done", tests get skipped, edge cases get forgotten. Ralph prevents this by looping until work is genuinely complete, requiring fresh verification evidence before allowing completion, and using explicit architect native-subagent verification to confirm quality.
 </Why_This_Exists>
 <Execution_Policy>
 - Fire independent agent calls simultaneously -- never wait sequentially for independent work
 - Use `run_in_background: true` for long operations (installs, builds, test suites)
-- Always pass the `model` parameter explicitly when delegating to agents
-- Read `docs/shared/agent-tiers.md` before first delegation to select correct agent tiers
+- Always set `agent_type` when spawning native subagents; use `reasoning_effort` for per-dispatch intensity when needed
+- Preserve legacy Ralph tier intent through native reasoning effort: LOW -> `low`, STANDARD -> `medium`, THOROUGH -> `xhigh`
 - Deliver the full implementation: no scope reduction, no partial completion, no deleting tests to make them pass
 - Apply the shared workflow guidance pattern: outcome-first framing, concise visible updates for multi-step execution, local overrides for the active workflow branch, validation proportional to risk, explicit stop rules, and automatic continuation for safe reversible steps. Ask only for material, destructive, credentialed, external-production, or preference-dependent branches.
 - Integrate with Codex goal mode when goal tools are available: inspect the active thread goal with `get_goal`, preserve it as the top-level stop condition, and only call `update_goal({status: "complete"})` after a Ralph completion audit proves the objective is actually achieved.
@@ -54,10 +54,10 @@ Complex tasks often fail silently: partial implementations get declared "done",
    - Do not begin Ralph execution work (delegation, implementation, or verification loops) until snapshot grounding exists. If forced to proceed quickly, note explicit risk tradeoffs.
 1. **Review progress**: Check TODO list and any prior iteration state
 2. **Continue from where you left off**: Pick up incomplete tasks
-3. **Delegate in parallel**: Route tasks to specialist agents at appropriate tiers
-   - Simple lookups: LOW tier -- "What does this function return?"
-   - Standard work: STANDARD tier -- "Add error handling to this module"
-   - Complex analysis: THOROUGH tier -- "Debug this race condition"
+3. **Delegate in parallel**: Route tasks to specialist native agents with explicit `agent_type` and appropriate `reasoning_effort`
+   - Simple lookups: `reasoning_effort="low"` -- "What does this function return?"
+   - Standard work: `reasoning_effort="medium"` -- "Add error handling to this module"
+   - Complex analysis: `reasoning_effort="xhigh"` -- "Debug this race condition"
    - When Ralph is entered as a ralplan follow-up, start from the approved **available-agent-types roster** and make the delegation plan explicit: implementation lane, evidence/regression lane, and final sign-off lane using only known agent types
 4. **Run long operations in background**: Builds, installs, test suites use `run_in_background: true`
 5. **Visual task gate (when screenshot/reference images are present)**:
@@ -72,11 +72,11 @@ Complex tasks often fail silently: partial implementations get declared "done",
    b. Run verification (test, build, lint)
    c. Read the output -- confirm it actually passed
    d. Check: zero pending/in_progress TODO items
-7. **Architect verification** (tiered):
-   - <5 files, <100 lines with full tests: STANDARD tier minimum (architect role)
-   - Standard changes: STANDARD tier (architect role)
-   - >20 files or security/architectural changes: THOROUGH tier (architect role)
-   - Ralph floor: always at least STANDARD, even for small changes
+7. **Architect verification** (native role):
+   - <5 files, <100 lines with full tests: `task(agent_type="architect", reasoning_effort="medium", prompt="...")` minimum
+   - Standard changes: `task(agent_type="architect", reasoning_effort="medium", prompt="...")`
+   - >20 files or security/architectural changes: `task(agent_type="architect", reasoning_effort="xhigh", prompt="...")`
+   - Ralph floor: always run an explicit `architect` native subagent, even for small changes
 7.5 **Mandatory Deslop Pass**:
    - After Step 7 passes, run `oh-my-codex:ai-slop-cleaner` on **all files changed during the Ralph session**.
    - Scope the cleaner to **changed files only**; do not widen the pass beyond Ralph-owned edits.
@@ -87,7 +87,7 @@ Complex tasks often fail silently: partial implementations get declared "done",
    - If post-deslop regression fails, roll back cleaner changes or fix and retry. Then rerun Step 7.5 and Step 7.6 until the regression is green.
    - Do not proceed to completion until post-deslop regression is green (unless `--no-deslop` explicitly skipped the deslop pass).
 8. **On approval**: If Codex goal mode is active, call `update_goal({status: "complete"})` before `/cancel`; report final elapsed time and token-budget usage when the tool returns it. Then run `/cancel` to cleanly exit and clean up all state files.
-9. **On rejection**: Fix the issues raised, then re-verify at the same tier
+9. **On rejection**: Fix the issues raised, then re-verify with the same `agent_type` and `reasoning_effort` profile
 </Steps>
 <Tool_Usage>
@@ -150,11 +150,11 @@ Use the CLI-first state surface for Ralph lifecycle state (`omx state write/read
 <Good>
 Correct parallel delegation:
 ```
-delegate(role="executor", tier="LOW", task="Add type export for UserConfig")
-delegate(role="executor", tier="STANDARD", task="Implement the caching layer for API responses")
-delegate(role="executor", tier="THOROUGH", task="Refactor auth module to support OAuth2 flow")
+task(agent_type="executor", reasoning_effort="low", prompt="Add type export for UserConfig")
+task(agent_type="executor", reasoning_effort="medium", prompt="Implement the caching layer for API responses")
+task(agent_type="executor", reasoning_effort="xhigh", prompt="Refactor auth module to support OAuth2 flow")
 ```
-Why good: Three independent tasks fired simultaneously at appropriate tiers.
+Why good: Three independent tasks fired simultaneously while explicitly selecting the installed `executor` native role, so the UI/tracker does not show default subagents; legacy tier intent is preserved through native reasoning effort (`LOW` -> `low`, `STANDARD` -> `medium`, `THOROUGH` -> `xhigh`).
 </Good>
 <Good>
@@ -163,7 +163,7 @@ Correct verification before completion:
 1. Run: npm test           → Output: "42 passed, 0 failed"
 2. Run: npm run build      → Output: "Build succeeded"
 3. Run: lsp_diagnostics    → Output: 0 errors
-4. Delegate to architect at STANDARD tier  → Verdict: "APPROVED"
+4. task(agent_type="architect", reasoning_effort="medium", prompt="verify completion") → Verdict: "APPROVED"
 5. Run /cancel
 ```
 Why good: Fresh evidence at each step, architect verification, then clean exit.
@@ -178,9 +178,9 @@ Why bad: Uses "should" and "look good" -- no fresh test/build output, no archite
 <Bad>
 Sequential execution of independent tasks:
 ```
-delegate(executor, LOW, "Add type export") → wait →
-delegate(executor, STANDARD, "Implement caching") → wait →
-delegate(executor, THOROUGH, "Refactor auth")
+task(agent_type="executor", reasoning_effort="low", prompt="Add type export") → wait →
+task(agent_type="executor", reasoning_effort="medium", prompt="Implement caching") → wait →
+task(agent_type="executor", reasoning_effort="xhigh", prompt="Refactor auth")
 ```
 Why bad: These are independent tasks that should run in parallel, not sequentially.
 </Bad>
@@ -200,7 +200,7 @@ Why bad: These are independent tasks that should run in parallel, not sequential
 - [ ] Fresh test run output shows all tests pass
 - [ ] Fresh build output shows success
 - [ ] lsp_diagnostics shows 0 errors on affected files
-- [ ] Architect verification passed (STANDARD tier minimum)
+- [ ] Architect verification passed through explicit `task(agent_type="architect", reasoning_effort="medium"...)` minimum
 - [ ] Codex goal-mode completion audit passed, and `update_goal({status: "complete"})` was called when an active goal exists
 - [ ] ai-slop-cleaner pass completed on changed files (or --no-deslop specified)
 - [ ] Post-deslop regression tests pass

package/skills/ultraqa/SKILL.md CHANGED Viewed

@@ -58,6 +58,15 @@ The matrix must include normal-path coverage plus adversarial dynamic e2e scenar
 - Validate exit codes and output semantics; do not trust success-looking text alone.
 - Do not delete, rewrite, or mask unrelated user work. Capture dirty-worktree evidence before and after generated harness work.
+### Temporary Harness Generation Guardrails
+Generated harnesses are part of the QA evidence chain; until setup succeeds, they are evidence about the harness apparatus, not product behavior.
+- **Use absolute repo imports for built artifacts.** When a harness runs from `/tmp` or another scratch directory but imports repository code, resolve the repository root explicitly from the verified repo cwd and import built modules with an absolute path or `pathToFileURL(join(repoRoot, "dist", ...)).href`. Never rely on `./dist/...` from the harness file's temporary directory.
+- **Use a safe file writer for JS/TS harness bodies.** Prefer a small Node/Python writer or another non-interpolating file-write mechanism for harness source that contains backticks, `${...}`, shell metacharacters, or prompt-injection strings. If a shell heredoc is unavoidable, quote the delimiter and verify the written file before execution; do not use interpolating heredocs for JavaScript assertions.
+- **Sanitize OMX runtime env for isolated probes.** When the scenario creates a temporary repo/state tree or intentionally checks local isolation, run the probe with `OMX_ROOT` and `OMX_STATE_ROOT` unset (for example `env -u OMX_ROOT -u OMX_STATE_ROOT ...`) so ambient boxed runtime state cannot redirect reads/writes away from the scenario fixture.
+- **Classify harness setup failures separately.** If a generated harness fails before exercising product behavior because of import paths, shell interpolation, environment leakage, or fixture construction, record it as harness debris, fix the harness, and rerun the scenario before declaring a product defect.
 ## Cycle Workflow
 ### Cycle N (Max 5)