npm - oh-my-opencode - Versions diffs - 4.8.1 → 4.9.1 - Mend

oh-my-opencode 4.8.1 → 4.9.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (509) hide show

package/packages/omo-codex/plugin/components/ulw-loop/skills/ulw-loop/references/full-workflow.md CHANGED Viewed

@@ -15,20 +15,20 @@ Prove EVERY success criterion with captured observable evidence from a real-usag
 TESTS ALONE NEVER PROVE DONE. A green test suite is supporting evidence, not completion proof.
 Audit each pass, fail, block, steering change, and checkpoint in `.omo/ulw-loop/ledger.jsonl`.
-## Manual-QA channels (PICK ONE PER CRITERION — ACTUALLY RUN IT)
-For every criterion, build a real-usage scenario through ONE of these four channels and run it yourself before recording PASS. The full test suite being green is NEVER verification on its own.
+## Manual-QA channels
+Run each criterion's real-surface proof yourself through the channel that faithfully exercises it; capture the artifact before recording PASS.
 1. **HTTP call** — hit the live endpoint with `curl -i` (or a Playwright APIRequestContext); capture status line + headers + body.
 2. **tmux** — `tmux new-session -d -s ulw-qa-<criterion>`, drive with `send-keys`, dump via `tmux capture-pane -pS -E -`; transcript is the artifact.
 3. **Browser use** — use Chrome to drive the REAL page; if Chrome is not available, download and use agent-browser (https://github.com/vercel-labs/agent-browser). Capture action log + screenshot path. Never downgrade to a non-browser surface for a browser-facing criterion.
 4. **Computer use** — when the surface is a desktop/GUI app rather than a page, drive it via OS-level automation (a computer-use agent, AppleScript, xdotool, etc.) against the running app; capture action log + screenshot. Use this for any non-browser GUI criterion.
-Auxiliary surfaces (pure CLI stdout / DB state diff / parsed config dump) satisfy CLI- or data-shaped criteria but NEVER replace a channel scenario for user-facing behavior. `--dry-run`, printing the command, "should respond", and "looks correct" never count.
+Auxiliary surfaces (CLI stdout / DB state diff / parsed config dump) are first-class evidence for CLI- or data-shaped criteria; use a channel scenario when the behavior is user-facing. `--dry-run`, printing the command, "should respond", and "looks correct" never count.
 ## Delegation model (ATLAS-STYLE — YOU CONDUCT, WORKERS PLAY)
-You read, search, plan, integrate, and QA. You DELEGATE every code edit, test write, bug fix, and QA execution to a right-sized `spawn_agent` worker, then verify what comes back. Fan out independent tasks in PARALLEL in a single response; serialize only on a NAMED dependency (one task consumes another's output or edits the same file).
+You read, search, plan, integrate, and QA. You DELEGATE every code edit, test write, bug fix, and QA execution to a right-sized `multi_agent_v1.spawn_agent` worker, then verify what comes back. Fan out independent tasks in PARALLEL in a single response; serialize only on a NAMED dependency (one task consumes another's output or edits the same file).
-Size each worker to the task. Put the intended role, rigor level, and specialty inside the worker `message`; the Codex `spawn_agent` schema only accepts `task_name`, `message`, and `fork_turns`.
+Size each worker to the task. Put the intended role, rigor level, and specialty inside the worker `message`.
 | Task shape | Message instruction |
 |---|---|
@@ -42,16 +42,16 @@ Size each worker to the task. Put the intended role, rigor level, and specialty
 For reviewer work, use a self-contained reviewer assignment, tight scope, and explicit verification in `message`. Never spawn a context-only child for review.
-Every worker message MUST carry: goal + exact files in scope; the baseline characterization test pinning current behavior when the task touches existing code, then the failing test / reproduction required before production code; constraints + project rules; the verification commands to run; the ONE Manual-QA channel and the exact evidence artifact to capture; for git-tracked edits, require `git-master` plus repository-wide and touched-path commit history inspection before commit. Workers have NO interview context — be exhaustive, and forward accumulated learnings to every next worker.
+Every worker message MUST carry: goal + exact files in scope; the PIN + failing-first proof required before production code (Per-Criterion Cycle step 3); constraints + project rules; the verification commands to run; the ONE Manual-QA channel and the exact evidence artifact to capture; for git-tracked edits, require `git-master` plus repository-wide and touched-path commit history inspection before commit. Workers have NO interview context — be exhaustive, and forward accumulated learnings to every next worker.
 Codex subagent reliability:
-- Start every `spawn_agent` message with `TASK: <imperative assignment>`, then name `DELIVERABLE`, `SCOPE`, and `VERIFY`. State that it is an executable assignment, not a context handoff.
-- Prefer `fork_turns: "none"` unless full history is truly required; paste only the context the child needs. Full-history forks can make the child continue old parent context instead of the delegated task.
-- Plan and reviewer agents may run for a long time; spawn them in the background, keep doing independent root work, and poll with short wait_agent cycles. Never use a single long blocking wait for them.
+- Start every `multi_agent_v1.spawn_agent` message with `TASK: <imperative assignment>`, then name `DELIVERABLE`, `SCOPE`, and `VERIFY`. State that it is an executable assignment, not a context handoff.
+- Use `fork_context: false` unless full history is truly required; paste only the context the child needs. Full-history forks can make the child continue old parent context instead of the delegated task.
+- Plan and reviewer agents may run for a long time; spawn them in the background, keep doing independent root work, and poll with short `multi_agent_v1.wait_agent` cycles. Never use a single long blocking wait for them.
 - For work likely to exceed one wait cycle, require the child to send `WORKING: <task> - <current phase>` before long reading, testing, or review passes, and `BLOCKED: <reason>` only when it cannot progress.
 - While any child is active, keep the parent visibly alive with active subagent count, agent names, latest `WORKING:` phase, and whether the parent is waiting for mailbox updates.
-- Track spawned agent names locally. Use `wait_agent` for mailbox signals, not proof of completion. A timeout only means no new mailbox update arrived; after a timeout, run a single `list_agents` check for the named child when you need reassurance. If it is running or its latest message is `WORKING:`, treat it as alive. Do not use `list_agents` as a polling loop or status feed; it can replay large payloads.
-- Fallback only when the child is completed without the deliverable, ack-only after followup, explicitly `BLOCKED:`, or no longer running. Then send `TASK STILL ACTIVE: return <deliverable> or BLOCKED: <reason>` when a targeted followup can still recover the lane; otherwise record inconclusive, do not count it as pass/review approval, close if safe, and respawn a smaller `fork_turns: "none"` task with the missing deliverable.
+- Track spawned agent names locally. Use `multi_agent_v1.wait_agent` for mailbox signals, not proof of completion. A timeout only means no new mailbox update arrived. Treat a running child as alive.
+- Fallback only when the child is completed without the deliverable, ack-only after followup, explicitly `BLOCKED:`, or no longer running. Then send `TASK STILL ACTIVE: return <deliverable> or BLOCKED: <reason>` when a targeted followup can still recover the lane; otherwise record inconclusive, do not count it as pass/review approval, close if safe, and respawn a smaller `fork_context: false` task with the missing deliverable.
 ## Artifacts
 - `.omo/ulw-loop/brief.md`: original brief and durable constraints.
@@ -117,16 +117,14 @@ Write state through the CLI path. Do not hand-edit state files.
 ### 2. Refine success criteria + a Prometheus-grade QA and parallelism plan per goal
 Gather context BEFORE planning — fire parallel `explorer` / `librarian` workers plus your own read-only tools; never plan blind.
-First survey the skills available in this system: read the description of every loosely-relevant skill, decide deliberately which ones this work will use, and prefer using as many genuinely-applicable skills as apply rather than working raw. Then size the scope: count distinct surfaces, files, and steps. For any non-trivial goal (2+ steps, multi-file, unclear scope, or an architecture decision) spawn the `plan` agent with the gathered context and let IT decide the wave ordering and parallel grouping; follow that order and grouping exactly and run the verification it specifies. Only a genuinely trivial single-step goal may skip the plan agent.
-Define pass/fail acceptance criteria before launching execution lanes. Include the command, artifact, or manual check that will prove success.
-Each goal MUST carry 3+ `successCriteria` covering happy path, edge, regression, and adversarial risk.
-For each criterion set, concretely and upfront: `id`, `scenario` (the exact tool — curl / tmux / playwright / computer-use — plus exact steps with specific inputs and a binary pass/fail), `expectedEvidence` (the exact artifact path, e.g. `.omo/ulw-loop/evidence/<goal>-<criterion>.<ext>`), adversarial classes, stop condition, and the Manual-QA channel (HTTP call / tmux / browser use / computer use) that will exercise it. Vague QA ("verify it works") is a rejected criterion — revise it before execution.
-Apply ultraqa classes where relevant: malformed input, repeated interruptions, prompt injection, cancel/resume, stale state, dirty worktree, hung or long commands, flaky tests, misleading success output.
+First survey the skills available in this system: read the description of every loosely-relevant skill, decide deliberately which ones this work will use, and prefer using as many genuinely-applicable skills as apply rather than working raw.
+Then run tier triage per goal and record it with an `annotate_ledger` steering entry. Default is LIGHT — a narrow change inside existing layers. Take HEAVY only on a fact you can point to: a new module / abstraction / domain model; auth, security, or session; an external integration; a DB schema or migration; concurrency, transaction boundaries, or cache invalidation; a cross-domain refactor; or the user signaled care or demanded review. When unsure, take HEAVY; upgrade the moment a HEAVY fact surfaces and never downgrade mid-run.
+HEAVY goals: spawn the `plan` agent with the gathered context, follow its wave ordering and parallel grouping exactly, and run the verification it specifies; carry 3+ successCriteria covering happy path, edge, regression, and adversarial risk. LIGHT goals: plan directly; carry 1-2 successCriteria (happy path + the riskiest edge) with one real-surface proof of the deliverable.
+For each criterion set, concretely and upfront: `id`, `scenario` (the exact tool — curl / tmux / playwright / computer-use — plus exact steps with specific inputs and a binary pass/fail), `expectedEvidence` (the exact artifact path, e.g. `.omo/ulw-loop/evidence/<goal>-<criterion>.<ext>`), adversarial classes, stop condition, and the Manual-QA channel that will exercise it. Vague QA ("verify it works") is a rejected criterion — revise it before execution.
+A criterion's adversarial classes are the ultraqa classes a fact about the change triggers: malformed input, prompt injection, cancel/resume, stale state, dirty worktree, hung or long commands, flaky tests, misleading success output, repeated interruptions. Record untriggered classes as not-applicable in one line.
 Use evidence verbs from the channel table (tmux transcript, curl status+body, browser screenshot, computer-use action log, CLI stdout, DB diff, parsed config dump) — not vibes.
-"Tests pass" is supporting signal, NEVER completion proof. Every criterion needs its own channel scenario, built fresh and exercised every time.
-**Plan for maximum parallelism.** Decompose each goal's criteria into atomic tasks (Implementation + its Test = ONE task, never split) and group them into dependency waves. Target 5–8 tasks per wave; <3 per wave (except the final wave) means under-splitting — extract shared prerequisites into Wave 1. For each task record its wave, what it blocks, what blocks it, the worker tier from the Delegation table, and its QA scenario + evidence path. Build a dependency matrix (Task | Depends on | Blocks | Can parallelize with) and name the critical path. Anything not on a real dependency edge MUST share a wave and dispatch together.
-Record manual QA notes when behavior is user-visible.
+**Plan for maximum parallelism (HEAVY goals).** Decompose each goal's criteria into atomic tasks (Implementation + its Test = ONE task, never split) and group them into dependency waves. Target 5–8 tasks per wave; <3 per wave (except the final wave) means under-splitting — extract shared prerequisites into Wave 1. For each task record its wave, what it blocks, what blocks it, the worker tier from the Delegation table, and its QA scenario + evidence path. Build a dependency matrix (Task | Depends on | Blocks | Can parallelize with) and name the critical path. Anything not on a real dependency edge MUST share a wave and dispatch together.
 Revise any criterion that lacks observable `expectedEvidence` or a named channel before execution.
 ### 3. Inspect state
@@ -152,11 +150,11 @@ Loop per goal. Cap at 5 cycles per goal. Cap identical same-criterion failures a
 ### Per-Criterion Cycle
 1. PLAN: read `criterion.scenario`, `criterion.expectedEvidence`, prior ledger entries, and safety bounds. Identify which tasks in the current wave are independent.
 2. Register atomic todos via `update_plan` — one ultra-granular step per action, `path: <action> for <criterion> - verify by <check>`. Call `update_plan` on every transition (start → `in_progress`, finish → `completed`); exactly one `in_progress`, mark completed immediately, never batch, never let the rendered plan lag behind reality.
-3. DELEGATE-IN-PARALLEL: dispatch every independent task in the wave at once via right-sized `spawn_agent` workers (Delegation table). Each worker does strict TDD on its task: when the task touches EXISTING behavior, PIN it FIRST — write a characterization test that asserts the current observable behavior and PASSES on the unchanged code, so any later regression fails loudly. Then RED (the new failing assertion must fail for the RIGHT reason — no syntax/import error), then the SMALLEST GREEN change; before GREEN work that depends on external review, PR, issue, or branch state, refresh current branch/PR/issue state, preserve existing ordering/policy, and separate compatibility detection from policy changes unless the goal explicitly asks to change policy. A GREEN needing >~20 lines means the test was too coarse — instruct a split. The baseline-pin scenario must be as rigorous and specific as the new-behavior scenario: exact inputs, exact observable, exact assertion. Serialize only on a NAMED dependency.
+3. DELEGATE-IN-PARALLEL: dispatch every independent task in the wave at once via right-sized `multi_agent_v1.spawn_agent` workers (Delegation table). Each worker captures evidence failing-first: when the task touches EXISTING behavior, PIN it FIRST — a characterization test that asserts the current observable behavior and PASSES on the unchanged code, as rigorous as the new-behavior scenario (exact inputs, exact observable, exact assertion). Then RED through the cheapest faithful channel — a unit test where a seam exists, an integration/e2e test where the behavior lives in wiring, or the criterion's scenario captured failing when no test seam exists — failing for the RIGHT reason (no syntax/import error). A test that mirrors its implementation (mock-call assertions, pinned constants, cannot fail under plausible regression) is not evidence; use the scenario as the failing proof instead. Then the SMALLEST GREEN change; before GREEN work that depends on external review, PR, issue, or branch state, refresh current branch/PR/issue state, preserve existing ordering/policy, and separate compatibility detection from policy changes unless the goal explicitly asks to change policy. A GREEN far larger than the criterion implies means the proof was too coarse — instruct a split. Serialize only on a NAMED dependency.
 4. INTEGRATE + CRITICAL SELF-QA + GIT CHECKPOINT (EVERY WORKER RETURN): do NOT trust the worker's report. Read the diff yourself, re-run its tests, and run LSP diagnostics on the changed files. Treat "done" as a claim to disprove. If the diff drifts, the test is hollow, or evidence is missing, RESPAWN the worker with the specific failure context. Once the work unit is verified, use `git-master` before staging: inspect recent repository commits and touched-path history to infer commit language, Conventional Commit scope, message shape, and unit size. Stage only that unit's files and commit in the observed style; do not carry verified work forward into a later omnibus commit. If no git-tracked files changed or committing is unsafe, record the no-commit reason as evidence. Forward every finding/learning to subsequent workers.
-5. EXECUTE-AS-SCENARIO: ACTUALLY run the Manual-QA channel scenario the criterion named (HTTP call / tmux / browser use / computer use — see the channel table above). Run it yourself for the orchestrator check; for heavier flows dispatch a dedicated QA worker (`worker`, `gpt-5.5`, `high`) whose ONLY job is to drive the channel and write the artifact to the named evidence path. The unit suite being green is NEVER substitute. If the scenario FAILS, respawn the implementing worker with the captured failure — do not hand-patch around it.
+5. EXECUTE-AS-SCENARIO: ACTUALLY run the Manual-QA scenario the criterion named (channel table above). Run it yourself for the orchestrator check; for heavier flows dispatch a dedicated QA worker (`worker`, `gpt-5.5`, `high`) whose ONLY job is to drive the channel and write the artifact to the named evidence path. If the scenario FAILS, respawn the implementing worker with the captured failure — do not hand-patch around it.
 6. CAPTURE: collect the observable artifact path: transcript, stdout, screenshot, assertion, status+body, diff, or parsed dump. No artifact written at the evidence path — not done; record BLOCKED and respawn QA.
-7. CLEAN (PAIRED, NEVER SKIP): tear down every runtime artifact step 5 spawned BEFORE recording — server PIDs (`kill`, verify `kill -0` fails), `tmux` sessions (`tmux kill-session -t ulw-qa-<criterion>`; confirm `tmux ls`), browser / Playwright contexts (`.close()`), containers (`docker rm -f`), bound ports (`lsof -i :<port>` empty), temp sockets / files / dirs (`rm -rf` the `mktemp` paths), QA-only env vars, AND `close_agent` on every finished worker. Register each teardown as its own todo the moment the QA spawns the resource (scripts, tmux assets, browsers / agent-browser sessions, PIDs, ports) so none is forgotten. Embed a one-line cleanup receipt in the evidence string, e.g. `cleanup: killed 12345; tmux kill-session ulw-qa-foo; rm -rf /tmp/ulw.aB12cD; close_agent w-3`. Missing receipt → record BLOCKED, not PASS.
+7. CLEAN (PAIRED, NEVER SKIP): tear down every runtime artifact step 5 spawned BEFORE recording — server PIDs (`kill`, verify `kill -0` fails), `tmux` sessions (`tmux kill-session -t ulw-qa-<criterion>`; confirm `tmux ls`), browser / Playwright contexts (`.close()`), containers (`docker rm -f`), bound ports (`lsof -i :<port>` empty), temp sockets / files / dirs (`rm -rf` the `mktemp` paths), QA-only env vars, AND `multi_agent_v1.close_agent` on every finished worker. Register each teardown as its own todo the moment the QA spawns the resource (scripts, tmux assets, browsers / agent-browser sessions, PIDs, ports) so none is forgotten. Embed a one-line cleanup receipt in the evidence string, e.g. `cleanup: killed 12345; tmux kill-session ulw-qa-foo; rm -rf /tmp/ulw.aB12cD; multi_agent_v1.close_agent w-3`. Missing receipt → record BLOCKED, not PASS.
 8. RECORD exactly one result:
    - PASS: `omo ulw-loop record-evidence --goal-id <id> --criterion-id <id> --status pass --evidence "<observable> | <cleanup receipt>" --json`
    - FAIL: `omo ulw-loop record-evidence --goal-id <id> --criterion-id <id> --status fail --evidence "<observable> | <cleanup receipt>" --notes "<diagnosis>" --json`
@@ -178,7 +176,7 @@ Trigger only when one goal remains and all its criteria are passing.
 1. Run targeted verification for changed behavior.
 2. Run `ai-slop-cleaner` on changed files. If no relevant edits exist, record a passed no-op cleaner report.
 3. Rerun verification after cleanup.
-4. Judge the change size. Spawn a rigorous reviewer with `spawn_agent({"task_name":"final_verification_review","message":"TASK: act as a rigorous final verification reviewer. DELIVERABLE: approve or cite blockers. SCOPE: <changed files and goal>. VERIFY: inspect diff and verification evidence.","fork_turns":"none"})` only when the work is large or risky (multi-file, cross-cutting, new architecture, security/data surfaces, or you are unsure it is sound); for a small, local, low-risk change, do the review yourself and record `codeReview` with `evidence` starting `UNCONDITIONAL APPROVAL` plus a one-line justification of why the change was small enough to self-review.
+4. HEAVY tier — or any goal you are unsure is sound — spawns a rigorous reviewer with `multi_agent_v1.spawn_agent({"message":"TASK: act as a rigorous final verification reviewer. DELIVERABLE: approve or cite blockers. SCOPE: <changed files and goal>. VERIFY: inspect diff and verification evidence.","fork_context":false})`. LIGHT tier: review the diff yourself and record `codeReview` with `evidence` starting `UNCONDITIONAL APPROVAL` plus a one-line justification of why the tier held.
 5. Clean review means `codeReview.recommendation == "APPROVE"` and `codeReview.architectStatus == "CLEAR"`.
 6. If review is non-clean, run `omo ulw-loop record-review-blockers --goal-id <id> --title "<...>" --objective "<...>" --evidence "<review findings>" --codex-goal-json <snapshot> --json`.
 7. If clean, checkpoint final completion:
@@ -194,6 +192,7 @@ omo ulw-loop checkpoint --goal-id <id> --status complete --evidence "<e2e eviden
   "criteriaCoverage": { "totalCriteria": N, "passCount": N, "adversarialClassesCovered": ["malformed_input", "..."] }
 }
 ```
+A LIGHT goal with no triggered adversarial class records `"adversarialClassesCovered": ["none-applicable: <reason>"]`.
 ## Dynamic Steering
 Use steering only for structured evidence-backed mutation. Reject natural-language steering requests.
@@ -221,11 +220,11 @@ Structured prompt directives accepted: `OMO_ULW_LOOP_STEER: { ... }`, `omo.ulw-l
 7. Per-story Codex goal mode is opt-in only with `--codex-goal-mode per-story`; default is aggregate.
 8. Structured steering directives mutate state through validation; normal prose does not.
 9. Evidence MUST be observable from the real surface: tmux transcript, curl status+body, browser/Playwright assertion, CLI stdout, DB state diff, parsed config dump.
-10. Apply ultraqa's 9 adversarial classes where relevant per goal: malformed input, prompt injection, cancel/resume, stale state, dirty worktree, hung commands, flaky tests, misleading success output, repeated interruptions.
+10. Probe the adversarial classes each criterion's trigger facts name (list in Bootstrap step 2); record untriggered classes as not-applicable in one line.
 11. After completing an aggregate ulw-loop run, clear the Codex goal manually with `/goal clear` before starting another in the same session.
 12. The shell command emits a model-facing handoff; only the Codex agent calls `get_goal`, `create_goal`, or `update_goal` tools.
 13. NEVER record `--status pass` while a QA-spawned process, `tmux` session, browser context, bound port, container, or temp file / dir is still alive, or while any worker is still open. The evidence string MUST include the cleanup receipt. Leftover runtime state = BLOCKED, not PASS.
-14. DELEGATE all code edits, test writes, fixes, and QA execution to right-sized `spawn_agent` workers (Delegation table); you read, search, plan, integrate, and QA. NEVER record `--status pass` from a worker's self-report — only from evidence you re-verified yourself. Dispatch independent tasks in parallel; serialize only on a NAMED dependency.
+14. DELEGATE all code edits, test writes, fixes, and QA execution to right-sized `multi_agent_v1.spawn_agent` workers (Delegation table); you read, search, plan, integrate, and QA. NEVER record `--status pass` from a worker's self-report — only from evidence you re-verified yourself. Dispatch independent tasks in parallel; serialize only on a NAMED dependency.
 15. Every verified work unit that touched git-tracked files must leave either an atomic `git-master`-style commit hash or explicit no-commit blocker evidence before the next unit starts.
 ## Stop Rules

package/packages/omo-codex/plugin/components/ulw-loop/src/cli-commands.ts CHANGED Viewed

@@ -16,15 +16,25 @@ import { UlwLoopError } from "./types.js";
 type CheckpointStatus = "complete" | "failed" | "blocked";
+export const ULW_LOOP_SUBCOMMANDS = ["help", "create-goals", "status", "complete-goals", "checkpoint", "steer", "add-goal", "criteria", "record-evidence", "record-review-blockers"] as const;
+export type UlwLoopSubcommand = (typeof ULW_LOOP_SUBCOMMANDS)[number];
+export function isUlwLoopSubcommand(value: string): value is UlwLoopSubcommand {
+	return (ULW_LOOP_SUBCOMMANDS as readonly string[]).includes(value);
+}
 export async function ulwLoopCommand(argv: readonly string[]): Promise<number> {
-	const command = argv[0] ?? "help";
+	const head = argv[0] ?? "help";
+	const command = head === "--help" || head === "-h" ? "help" : head;
 	const rest = argv.slice(1);
 	const repoRoot = process.cwd();
 	const json = hasFlag(rest, "--json");
 	const scope = commandScope(rest);
 	try {
+		if (!isUlwLoopSubcommand(command)) { process.stdout.write(`${ULW_LOOP_HELP}\n`); return 1; }
 		switch (command) {
-			case "help": case "--help": case "-h": process.stdout.write(`${ULW_LOOP_HELP}\n`); return 0;
+			case "help": process.stdout.write(`${ULW_LOOP_HELP}\n`); return 0;
 			case "create-goals": return await createGoals(repoRoot, rest, json, scope);
 			case "status": return await status(repoRoot, json, scope);
 			case "complete-goals": return await completeGoals(repoRoot, rest, json, scope);
@@ -34,7 +44,7 @@ export async function ulwLoopCommand(argv: readonly string[]): Promise<number> {
 			case "criteria": return await criteria(repoRoot, rest, json, scope);
 			case "record-evidence": return await captureEvidence(repoRoot, rest, json, scope);
 			case "record-review-blockers": return await reviewBlockers(repoRoot, rest, json, scope);
-			default: process.stdout.write(`${ULW_LOOP_HELP}\n`); return 1;
+			default: return unhandledSubcommand(command);
 		}
 	} catch (error) {
 		if (error instanceof UlwLoopError) process.stderr.write(`[ulw-loop] ${error.message}\n`);
@@ -44,6 +54,10 @@ export async function ulwLoopCommand(argv: readonly string[]): Promise<number> {
 	}
 }
+function unhandledSubcommand(command: never): never {
+	throw new UlwLoopError(`Unhandled ulw-loop subcommand: ${String(command)}.`, "ULW_LOOP_SUBCOMMAND_UNHANDLED");
+}
 function commandScope(argv: readonly string[]): UlwLoopScope | undefined {
 	const sessionId = readValue(argv, "--session-id") ?? resolveUlwLoopSessionIdFromEnv();
 	return sessionId === null ? undefined : { sessionId };

package/packages/omo-codex/plugin/components/ulw-loop/src/cli.ts CHANGED Viewed

@@ -1,5 +1,5 @@
 #!/usr/bin/env node
-import { ulwLoopCommand } from "./cli-commands.js";
+import { isUlwLoopSubcommand, ulwLoopCommand } from "./cli-commands.js";
 import { runPreToolUseGoalBudgetGuardCli, runUlwLoopHookCli } from "./codex-hook.js";
 const TOP_LEVEL_HELP =
@@ -26,6 +26,7 @@ async function main(): Promise<number> {
 		process.stderr.write(`[omo] unknown hook subcommand: ${sub ?? "(none)"}\n`);
 		return 1;
 	}
+	if (isUlwLoopSubcommand(command)) return ulwLoopCommand(argv);
 	process.stderr.write(`[omo] unknown command: ${command}\n${TOP_LEVEL_HELP}`);
 	return 1;
 }

package/packages/omo-codex/plugin/components/ulw-loop/src/codex-goal-instruction.ts CHANGED Viewed

@@ -105,7 +105,7 @@ function finalSection(plan: UlwLoopPlan, goal: UlwLoopItem, isFinal: boolean, ag
 	const checkpointCommand = `omo ulw-loop checkpoint${option} --goal-id ${goal.id} --status complete --evidence "<tests/files/PR evidence>" --codex-goal-json "<fresh complete get_goal JSON or path>" --quality-gate-json "<quality gate JSON or path>"`;
 	return joinLines([
 		"Final story — run mandatory quality gate before update_goal:",
-		"- Run ai-slop-cleaner on changed files even when it is a no-op, rerun verification, then run the code review (spawn_agent(agent_type=\"codex-ultrawork-reviewer\", fork_turns=\"none\", ...); fall back to agent_type=\"worker\" with a scoped reviewer assignment if unavailable).",
+		"- Run ai-slop-cleaner on changed files even when it is a no-op, rerun verification, then run the code review (multi_agent_v1.spawn_agent(agent_type=\"codex-ultrawork-reviewer\", fork_context=false, ...); fall back to agent_type=\"worker\" with a scoped reviewer assignment if unavailable).",
 		"- If the final review is not APPROVE with architect status CLEAR, do not call update_goal. Record blocker work first:",
 		`  ${blockerCommand}`,
 		aggregate

package/packages/omo-codex/plugin/components/ulw-loop/test/cli-entrypoint.test.ts ADDED Viewed

@@ -0,0 +1,95 @@
+import { spawn } from "node:child_process";
+import { mkdtemp, rm } from "node:fs/promises";
+import { tmpdir } from "node:os";
+import { dirname, join, resolve } from "node:path";
+import { fileURLToPath } from "node:url";
+import { afterEach, beforeAll, beforeEach, describe, expect, it } from "vitest";
+const componentRoot = resolve(dirname(fileURLToPath(import.meta.url)), "..");
+const builtCli = join(componentRoot, "dist", "cli.js");
+type CliResult = {
+	readonly code: number | null;
+	readonly stdout: string;
+	readonly stderr: string;
+};
+function sanitizedEnv(): NodeJS.ProcessEnv {
+	const env = { ...process.env };
+	delete env["CODEX_SESSION_ID"];
+	delete env["CODEX_THREAD_ID"];
+	delete env["OMO_ULW_LOOP_SESSION_ID"];
+	return env;
+}
+async function runProcess(command: string, args: readonly string[], cwd: string): Promise<CliResult> {
+	return new Promise((resolvePromise, reject) => {
+		const child = spawn(command, [...args], { cwd, env: sanitizedEnv() });
+		const stdout: Buffer[] = [];
+		const stderr: Buffer[] = [];
+		child.stdout.on("data", (chunk: Buffer) => stdout.push(chunk));
+		child.stderr.on("data", (chunk: Buffer) => stderr.push(chunk));
+		child.on("error", reject);
+		child.on("close", (code) => {
+			resolvePromise({
+				code,
+				stdout: Buffer.concat(stdout).toString("utf8"),
+				stderr: Buffer.concat(stderr).toString("utf8"),
+			});
+		});
+	});
+}
+let workspace: string;
+async function runCli(args: readonly string[]): Promise<CliResult> {
+	return runProcess(process.execPath, [builtCli, ...args], workspace);
+}
+beforeAll(async () => {
+	const build = await runProcess("npm", ["run", "build"], componentRoot);
+	expect(build.code, `npm run build failed:\n${build.stderr}`).toBe(0);
+}, 120_000);
+beforeEach(async () => {
+	workspace = await mkdtemp(join(tmpdir(), "ulw-loop-entrypoint-"));
+});
+afterEach(async () => {
+	await rm(workspace, { recursive: true, force: true });
+});
+describe("dist/cli.js entrypoint dispatch", () => {
+	it("#given no plan #when invoked with bare 'status --json' #then routes into ulw-loop instead of unknown command", async () => {
+		const result = await runCli(["status", "--json"]);
+		const combined = `${result.stdout}${result.stderr}`;
+		expect(combined).toContain("No ulw-loop plan found");
+		expect(combined).not.toContain("[omo] unknown command");
+		expect(result.code).toBe(1);
+	});
+	it("#given no plan #when invoked with legacy 'ulw-loop status --json' #then still routes into ulw-loop", async () => {
+		const result = await runCli(["ulw-loop", "status", "--json"]);
+		const combined = `${result.stdout}${result.stderr}`;
+		expect(combined).toContain("No ulw-loop plan found");
+		expect(combined).not.toContain("[omo] unknown command");
+		expect(result.code).toBe(1);
+	});
+	it("#given the top-level entrypoint #when invoked with 'help' #then prints usage and exits 0", async () => {
+		const result = await runCli(["help"]);
+		expect(result.code).toBe(0);
+		expect(result.stdout).toContain("Usage:");
+		expect(result.stdout).toContain("omo ulw-loop <subcommand>");
+	});
+	it("#given a command outside the ulw-loop vocabulary #when invoked with 'frobnicate' #then fails as unknown command", async () => {
+		const result = await runCli(["frobnicate"]);
+		expect(result.code).toBe(1);
+		expect(result.stderr).toContain("[omo] unknown command: frobnicate");
+	});
+});

package/packages/omo-codex/plugin/components/ulw-loop/test/package-smoke.test.ts CHANGED Viewed

@@ -118,8 +118,6 @@ describe("skills/ulw-loop/SKILL.md", () => {
 		const text = await readText("skills/ulw-loop/SKILL.md");
 		expect(text).toMatch(/^---\nname: ulw-loop\n/m);
-		expect(text).toContain("Goal-like loop that uses ultrawork mode to decompose work into systematic, evidence-bound steps.");
-		expect(text).toContain("short-description: Goal-like ultrawork loop for systematic decomposition");
 	});
 	it("#given Codex dollar hinting #when querying ulw-loop #then ulw-loop surfaces the ulw-loop alias", async () => {
@@ -138,37 +136,6 @@ describe("skills/ulw-loop/SKILL.md", () => {
 		expect(text).toContain('- "ulw-loop"');
 	});
-	it("references the success criteria and record-evidence vocabulary", async () => {
-		const text = await readText("skills/ulw-loop/references/full-workflow.md");
-		expect(text.toLowerCase()).toMatch(/success criteria|successcriteria/);
-		expect(text.toLowerCase()).toContain("record-evidence");
-	});
-	it("#given workflow Acquire Next Goal text #when inspected #then create_goal uses objective-only payload wording", async () => {
-		const text = await readText("skills/ulw-loop/references/full-workflow.md");
-		expect(text).toContain("instruction.json.objective");
-		expect(text).toContain("objective only");
-		expect(text).not.toContain("Call `create_goal` with the handoff payload.");
-	});
-	it("#given omo is absent from PATH #when bootstrap instructions are read #then local cached CLI fallback is documented", async () => {
-		const text = await readText("skills/ulw-loop/references/full-workflow.md");
-		expect(text).toContain("If `omo` is absent from PATH");
-		expect(text).toContain("ULW_LOOP_CLI");
-		expect(text).toContain("components/ulw-loop/dist/cli.js");
-	});
-	it("#given empty PATH #when bootstrap instructions are read #then handles empty PATH without losing notepad bootstrap", async () => {
-		const text = await readText("skills/ulw-loop/references/full-workflow.md");
-		expect(text).toContain("If PATH is empty");
-		expect(text).toContain("ULW_LOOP_NODE");
-		expect(text).toContain(".omo/ulw-loop/bootstrap-notepad.md");
-		expect(text).not.toContain("ls -1");
-	});
 	it("#given PATH omo lacks ulw-loop #when bootstrap runs #then falls back to cached ulw-loop CLI", async () => {
 		const text = await readText("skills/ulw-loop/references/full-workflow.md");
 		const bootstrap = bootstrapScriptFrom(text);
@@ -213,69 +180,6 @@ describe("skills/ulw-loop/SKILL.md", () => {
 		}
 	});
-	it("uses the .omo workspace path", async () => {
-		const text = await readText("skills/ulw-loop/SKILL.md");
-		expect(text).toContain(".omo/ulw-loop");
-	});
-	it("#given completed default state #when skill guidance is inspected #then it prefers a fresh session", async () => {
-		const skill = await readText("skills/ulw-loop/SKILL.md");
-		const workflow = await readText("skills/ulw-loop/references/full-workflow.md");
-		expect(skill).toContain("fresh `--session-id <new-id>`");
-		expect(skill).toContain("Use `--force` only");
-		expect(workflow).toContain("create-goals --session-id <new-id>");
-		expect(workflow).toContain("overwriting completed evidence");
-	});
-	it("#given long Codex runs #when worker guidance is inspected #then avoids context-expensive agent polling", async () => {
-		const text = await readText("skills/ulw-loop/references/full-workflow.md");
-		expect(text).toMatch(/list_agents/);
-		expect(text).toMatch(/polling loop/);
-		expect(text).toMatch(/replay large payloads/);
-		expect(text).toMatch(/Track spawned agent names locally/);
-		expect(text).toMatch(/wait_agent.*mailbox signals/);
-		expect(text).toMatch(/WORKING:/);
-		expect(text).toMatch(/single `list_agents`/);
-		expect(text).toMatch(/Plan and reviewer agents may run for a long time/);
-		expect(text).toMatch(/short wait_agent cycles/);
-		expect(text).toMatch(/single long blocking wait/);
-		expect(text).toMatch(/git-master/);
-		expect(text).toMatch(/touched-path commit history/);
-		expect(text).toMatch(/commit in the observed style/);
-		expect(text).toMatch(/omnibus commit/);
-		expect(text).toContain("Every worker message MUST carry");
-		expect(text).toContain("Each worker does strict TDD");
-	});
-	it("#given Codex subagent delegation #when worker guidance is inspected #then assignment ambiguity is hardened", async () => {
-		const text = await readText("skills/ulw-loop/SKILL.md");
-		expect(text).toMatch(/TASK:/);
-		expect(text).toMatch(/fork_turns:\s*"none"/);
-		expect(text).toMatch(/wait_agent.*mailbox signals/);
-		expect(text).toMatch(/WORKING:/);
-		expect(text).toMatch(/single `list_agents`/);
-		expect(text).toMatch(/Fallback only when/);
-		expect(text).toMatch(/BLOCKED:/);
-		expect(text).toMatch(/respawn.*smaller/);
-		expect(text).toMatch(/Plan and reviewer agents may run for a long time/);
-		expect(text).toMatch(/short wait_agent cycles/);
-		expect(text).toMatch(/single long blocking wait/);
-		expect(text).toMatch(/git-master/);
-		expect(text).toMatch(/commit each verified work unit atomically/);
-	});
-	it("#given quiet Codex reviewers #when full workflow guidance is inspected #then timeout is not treated as death", async () => {
-		const text = await readText("skills/ulw-loop/references/full-workflow.md");
-		expect(text).toMatch(/A timeout only means no new mailbox update arrived/i);
-		expect(text).toMatch(/WORKING:/);
-		expect(text).toMatch(/single `list_agents`/);
-		expect(text).toMatch(/do not count it as pass\/review approval/i);
-		expect(text).toMatch(/record inconclusive/i);
-	});
 });
 describe("source LOC budget", () => {

package/packages/omo-codex/plugin/components/ulw-loop/test/quality-gate.test.ts CHANGED Viewed

@@ -31,6 +31,29 @@ function makeGate(overrides: Record<string, unknown> = {}): Record<string, unkno
 	return { ...VALID_GATE, ...overrides };
 }
+describe("validateQualityGate LIGHT-tier shape", () => {
+	it("#given a light-tier gate with none-applicable classes and self-review approval evidence #when validated #then it passes without reviewer fields", () => {
+		// given
+		const gate = makeGate({
+			codeReview: {
+				evidence: "UNCONDITIONAL APPROVAL — LIGHT tier: single-file copy change, self-reviewed diff + diagnostics",
+			},
+			criteriaCoverage: {
+				totalCriteria: 1,
+				passCount: 1,
+				adversarialClassesCovered: ["none-applicable: prompt-file-only change, no input parsing or state"],
+			},
+		});
+		// when
+		const validated = validateQualityGate(gate);
+		// then
+		expect(validated.codeReview.recommendation).toBe("APPROVE");
+		expect(validated.codeReview.architectStatus).toBe("CLEAR");
+	});
+});
 function getQualityGateError(input: unknown): UlwLoopError {
 	try {
 		validateQualityGate(input);

package/packages/omo-codex/plugin/components/ulw-loop/test/skill-contract.test.ts ADDED Viewed

@@ -0,0 +1,46 @@
+import { readFile } from "node:fs/promises";
+import { describe, expect, it } from "vitest";
+const SKILL_URL = new URL("../skills/ulw-loop/SKILL.md", import.meta.url);
+const FULL_WORKFLOW_URL = new URL("../skills/ulw-loop/references/full-workflow.md", import.meta.url);
+function wordCount(text: string): number {
+	return text.split(/\s+/).filter(Boolean).length;
+}
+describe("ulw-loop skill contract", () => {
+	it("#given full workflow #when tier triage is inspected #then criteria scale by LIGHT/HEAVY with upgrade-only ratchet", async () => {
+		// given
+		const workflow = await readFile(FULL_WORKFLOW_URL, "utf8");
+		// then
+		expect(workflow).toMatch(/[Tt]ier triage/);
+		expect(workflow).toMatch(/LIGHT/);
+		expect(workflow).toMatch(/HEAVY/);
+		expect(workflow).toMatch(/1-2 successCriteria/);
+		expect(workflow).toMatch(/3\+ criteria|3\+ successCriteria/);
+		expect(workflow).toMatch(/When unsure[^.]{0,30}HEAVY/);
+		expect(workflow).toMatch(/never downgrade/i);
+	});
+	it("#given full workflow #when evidence rules are inspected #then tautological tests are rejected and the light quality gate is named", async () => {
+		// given
+		const workflow = await readFile(FULL_WORKFLOW_URL, "utf8");
+		// then
+		expect(workflow).toMatch(/mirrors its implementation/);
+		expect(workflow).toMatch(/none-applicable/);
+	});
+	it("#given full workflow #when echo discipline is inspected #then the ultraqa class list is enumerated once and budgets hold", async () => {
+		// given
+		const workflow = await readFile(FULL_WORKFLOW_URL, "utf8");
+		const skill = await readFile(SKILL_URL, "utf8");
+		// then
+		expect(workflow.match(/malformed input, prompt injection/g)?.length ?? 0).toBe(1);
+		expect(wordCount(workflow)).toBeLessThanOrEqual(3544);
+		expect(wordCount(skill)).toBeLessThanOrEqual(611);
+	});
+});