npm - @fro.bot/systematic - Versions diffs - 2.5.3 → 2.6.1 - Mend

@fro.bot/systematic 2.5.3 → 2.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

package/skills/ce-plan/references/universal-planning.md ADDED Viewed

@@ -0,0 +1,114 @@
+# Universal Planning Workflow
+This file is loaded when ce-plan detects a non-software task (Phase 0.1b). It replaces the software-specific phases (0.2 through 5.1) with a domain-agnostic planning workflow.
+## Before starting: verify classification
+The detection stub in SKILL.md routes here for anything that isn't clearly software. Verify the classification is correct before proceeding:
+- **Is this actually a software task?** The key distinction is task-type, not topic-domain. A study guide about Rust is non-software (producing educational content). A Rust library refactor is software (modifying code). If this is actually software, return to Phase 0.2 in the main SKILL.md.
+- **Is this a quick-help request, not a planning task?** Error messages, factual questions, and single-step tasks don't need a plan. Respond directly and exit. Examples: "zsh: command not found: brew", "what's the capital of France."
+- **Pipeline mode?** If invoked from LFG or any `disable-model-invocation` context: output "This is a non-software task. The LFG pipeline requires ce-work, which only supports software tasks. Use `/ce-plan` directly for non-software planning." and stop.
+Once past these checks, commit to producing a plan. Do not exit because the task looks like a "lookup" or "research question" — the user invoked `ce-plan` because they want a structured output.
+---
+## Step 1: Assess Ambiguity and Research Need
+Evaluate two things before planning:
+**Would 1-3 quick questions meaningfully improve this plan?**
+- **Default: ask 1-3 questions** via Step 1b when the answers would change the plan's structure or content. Always include a final option like "Skip — just make the plan with reasonable assumptions" so the user can opt out instantly.
+- **Skip questions entirely** only when the request already specifies all major variables or the task is simple enough that reasonable assumptions cover it well.
+**Research need — does this plan depend on facts that change faster than training data?**
+| Research need | Signals | Action |
+|--------------|---------|--------|
+| **None** | Generic, timeless, or conceptual plan (study curriculum methodology, project management approach, personal goal breakdown) | Skip research. Model knowledge is sufficient. After structuring the plan, offer: "I based this on general knowledge. Want me to search for [specific thing research would improve]?" — e.g., sourced recipes, current product recommendations, expert frameworks. Only if the user accepts. |
+| **Recommended** | Plan references specific locations, venues, dates, prices, schedules, seasonal availability, or current events — anything where stale information would break the plan (closed restaurants, changed prices, cancelled events, wrong seasonal dates). | Research before planning. Decompose into 2-5 focused research questions and dispatch parallel web searches. In OpenCode, use the Agent tool with `model: "haiku"` for each search to reduce cost. Collate findings before structuring the plan. |
+When research is recommended, do it — don't just offer. Stale recommendations (closed restaurants, rethemed attractions, outdated prices) are worse than no recommendations. The user invoked `/ce-plan` because they want a good plan, not a disclaimer about training data.
+**Research decomposition pattern:**
+1. Identify 2-5 independent research questions based on the task. Good questions target facts the model is least confident about: current prices, hours, availability, recent changes, seasonal specifics.
+2. Dispatch parallel research. Prefer user-named surfaces first per Core Principle 8 in SKILL.md; fall back to web search for questions those surfaces don't cover.
+3. Collate findings into a brief research summary before proceeding to planning.
+Example for "plan a date night in Seattle this Saturday":
+- "Best restaurants open late Saturday in Capitol Hill Seattle 2026"
+- "Events happening in Seattle [specific date]"
+- "Seattle waterfront current status and hours"
+## Step 1b: Focused Q&A
+Ask up to 3 questions targeting the unknowns that would most change the plan. Use the platform's blocking question tool: `question` in OpenCode (call `ToolSearch` with `select:question` first if its schema isn't loaded), `request_user_input` in Codex, `ask_user` in Gemini, `ask_user` in Pi (requires the `pi-ask-user` extension). Fall back to numbered options in chat only when no blocking tool exists or the call errors (e.g., Codex edit modes) — not because a schema load is required. Never silently skip the question.
+**How to ask well:**
+- Offer informed options, not open-ended blanks. Instead of "When are you going?", try "Mid-week visits have 30-40% shorter lines — are you flexible on timing?" The question should give the user a frame of reference, not just extract information.
+- Use multi-select when several independent choices can be captured in one question. This is compact and respects the user's time.
+- Always include a final option like **"Skip — just make the plan with reasonable assumptions"** so the user can opt out at any point.
+Focus on the unknowns specific to this task that would change what the plan recommends or how it's structured. Do not ask more than 3 — after that, proceed with assumptions for anything remaining.
+## Step 2: Structure the Plan
+Create a structured plan guided by these quality principles. Do NOT use the software plan template (implementation units, test scenarios, file paths, etc.).
+### Format: when to prescribe vs. present options
+Not every plan should be a single linear path. Match the format to the task:
+| Task type | Best format | Why |
+|-----------|------------|-----|
+| **High personal preference** (food, entertainment, activities, gifts) | Curated options per category — present 2-3 choices and let the user compose | Preferences vary; a single pick may miss. Options respect the user's taste. |
+| **Logical sequence** (study plan, project timeline, multi-day trip logistics) | Single prescriptive path with clear ordering | Sequencing matters; options at each step create decision paralysis. |
+| **Hybrid** (event with fixed structure but variable details) | Fixed structure with choice points marked | The skeleton is set but specific vendors/venues/activities are options. |
+Example: A date night plan should present 2-3 restaurant options, 2-3 activity options, and a suggested flow — not pick one restaurant and build the whole evening around it. A study plan should prescribe a single weekly progression — not present 3 different curricula to choose from.
+### Formatting: bullets over prose
+- Prefer bullets and tables for actionable content (steps, options, logistics, budgets)
+- Use prose only for context, rationale, or explanations that connect the dots
+- Plans are for scanning and executing, not reading cover-to-cover
+### Quality principles
+- **Actionable steps**: Each step is specific enough to execute without further research
+- **Sequenced by dependency**: Steps are in the right order, with dependencies noted
+- **Time-aware**: When relevant, include timing, durations, deadlines, or phases
+- **Resource-identified**: Specify what's needed — tools, materials, people, budget, locations
+- **Contingency-aware**: For important decisions, note alternatives or what to do if plans change
+- **Appropriately detailed**: Match detail to task complexity. A weekend trip needs less structure than a 3-month curriculum. A dinner plan should be concise, not a 200-line document.
+- **Domain-appropriate format**: Choose a structure that fits the domain:
+  - Itinerary for travel (day-by-day, with times and locations)
+  - Syllabus or curriculum for study plans (topics, resources, milestones)
+  - Runbook for events (timeline, responsibilities, logistics)
+  - Project plan for business or operational tasks (phases, owners, deliverables)
+  - Research plan for investigations (questions, methods, sources)
+  - Options menu for preference-driven tasks (curated picks per category)
+## Step 3: Save or Share
+After structuring the plan, ask the user how they want to receive it using the platform's blocking question tool: `question` in OpenCode (call `ToolSearch` with `select:question` first if its schema isn't loaded), `request_user_input` in Codex, `ask_user` in Gemini, `ask_user` in Pi (requires the `pi-ask-user` extension). Fall back to numbered options in chat only when no blocking tool exists or the call errors (e.g., Codex edit modes) — not because a schema load is required. Never silently skip the question.
+**Question:** "Plan ready. How would you like to receive it?"
+**Options:**
+1. **Save to disk** — Write the plan as a markdown file. Ask where:
+   - `docs/plans/` (only show if this directory exists)
+   - Current working directory
+   - `/tmp`
+   - A custom path
+   - Use filename convention: `YYYY-MM-DD-<descriptive-name>-plan.md`
+   - Start the document with a `# Title` heading, followed by `Created: YYYY-MM-DD` on the next line. No YAML frontmatter.
+2. **Open in Proof (web app) — review and comment to iterate with the agent** — Open the doc in Every's Proof editor, iterate with the agent via comments, or copy a link to share with others. Load the `ce-proof` skill to create and open the document.
+3. **Save to disk AND open in Proof** — Do both: write the markdown file to disk and open the doc in Proof for review.
+Do not offer `/ce-work` (software-only) or issue creation (not applicable to non-software plans).

package/skills/ce-plan/references/visual-communication.md ADDED Viewed

@@ -0,0 +1,31 @@
+# Visual Communication in Plan Documents
+Section 3.4 covers diagrams about the *solution being planned* (pseudo-code, mermaid sequences, state diagrams). The existing Section 4.3 mermaid rule encourages those solution-design diagrams within Technical Design and per-unit fields. This guidance covers a different concern: visual aids that help readers *navigate and comprehend the plan document itself* -- dependency graphs, interaction diagrams, and comparison tables that make plan structure scannable.
+Visual aids are conditional on content patterns, not on plan depth classification -- a Lightweight plan about a complex multi-unit workflow may warrant a dependency graph; a Deep plan about a straightforward feature may not.
+**When to include:**
+| Plan describes... | Visual aid | Placement |
+|---|---|---|
+| 4+ implementation units with non-linear dependencies (parallelism, diamonds, fan-in/fan-out) | Mermaid dependency graph | Before or after the Implementation Units heading |
+| System-Wide Impact naming 3+ interacting surfaces or cross-layer effects | Mermaid interaction or component diagram | Within the System-Wide Impact section |
+| Summary or Problem Frame involving 3+ behavioral modes, states, or variants | Markdown comparison table | Within Summary or Problem Frame (legacy plans may still use `Overview`) |
+| Key Technical Decisions with 3+ interacting decisions, or Alternative Approaches with 3+ alternatives | Markdown comparison table | Within the relevant section |
+**When to skip:**
+- The plan has 3 or fewer units in a straight dependency chain -- the Dependencies field on each unit is sufficient
+- Prose already communicates the relationships clearly
+- The visual would duplicate what the High-Level Technical Design section already shows
+- The visual describes code-level detail (specific method names, SQL columns, API field lists)
+**Format selection:**
+- **Mermaid** (default) for dependency graphs and interaction diagrams -- 5-15 nodes, no in-box annotations, standard flowchart shapes. Use `TB` (top-to-bottom) direction so diagrams stay narrow in both rendered and source form. Source should be readable as fallback in diff views and terminals.
+- **ASCII/box-drawing diagrams** for annotated flows that need rich in-box content -- file path layouts, decision logic branches, multi-column spatial arrangements. More expressive than mermaid when the diagram's value comes from annotations within nodes. Follow 80-column max for code blocks, use vertical stacking.
+- **Markdown tables** for mode/variant comparisons and decision/approach comparisons.
+- Keep diagrams proportionate to the plan. A 6-unit linear chain gets a simple 6-node graph. A complex dependency graph with fan-out and fan-in may need 10-15 nodes -- that is fine if every node earns its place.
+- Place inline at the point of relevance, not in a separate section.
+- Plan-structure level only -- unit dependencies, component interactions, mode comparisons, impact surfaces. Not implementation architecture, data schemas, or code structure (those belong in Section 3.4).
+- Prose is authoritative: when a visual aid and its surrounding prose disagree, the prose governs.
+After generating a visual aid, verify it accurately represents the plan sections it illustrates -- correct dependency edges, no missing surfaces, no merged units.

package/skills/ce-work/references/shipping-workflow.md ADDED Viewed

@@ -0,0 +1,129 @@
+# Shipping Workflow
+This file contains the shipping workflow (Phase 3-4). Load it only when all Phase 2 tasks are complete and execution transitions to quality check.
+## Phase 3: Quality Check
+1. **Run Core Quality Checks**
+   Always run before submitting:
+   ```bash
+   # Run full test suite (use project's test command)
+   # Examples: bin/rails test, npm test, pytest, go test, etc.
+   # Run linting (per AGENTS.md)
+   # Use linting-agent before pushing to origin
+   ```
+2. **Code Review** (REQUIRED)
+   Every change gets reviewed before shipping. The depth scales with the change's risk profile, but review itself is never skipped.
+   **Tier 2: Full review (default)** -- REQUIRED unless Tier 1 criteria are explicitly met. Invoke the `ce-code-review` skill with `mode:autofix` to run specialized reviewer agents, auto-apply safe fixes, and record residual downstream work in the per-run artifact. When the plan file path is known, pass it as `plan:<path>`. This is the mandatory default -- proceed to Tier 1 only after confirming every criterion below.
+   **Tier 1: Inline self-review** -- A lighter alternative permitted only when **all four** criteria are true. Before choosing Tier 1, explicitly state which criteria apply and why. If any criterion is uncertain, use Tier 2.
+   - Purely additive (new files only, no existing behavior modified)
+   - Single concern (one skill, one component -- not cross-cutting)
+   - Pattern-following (implementation mirrors an existing example with no novel logic)
+   - Plan-faithful (no scope growth, no deferred questions resolved with surprising answers)
+3. **Residual Work Gate** (REQUIRED when Tier 2 ran)
+   After Tier 2 code review completes, inspect the Residual Actionable Work summary it returned (or read the run artifact directly if the summary was not emitted). If one or more residual `downstream-resolver` findings remain, do not proceed to Final Validation until the user decides how to handle them.
+   Ask the user using the platform's blocking question tool (`question` in OpenCode with `ToolSearch select:question` pre-loaded if needed, `request_user_input` in Codex, `ask_user` in Gemini, `ask_user` in Pi (requires the `pi-ask-user` extension)). Fall back to numbered options in chat only when the harness genuinely lacks a blocking tool. Never silently skip the gate.
+   Stem: `Code review found N residual finding(s) the skill did not auto-fix. How should the agent proceed?`
+   Options (four or fewer, self-contained labels):
+   - `Apply/fix now` — loop back into review with focused fixes; the agent investigates each finding, applies changes where safe, and re-runs review.
+   - `File tickets via project tracker` — load `references/tracker-defer.md` in Interactive mode; the agent files tickets in the project's detected tracker (or `gh` fallback, or leaves them in the report if no sink exists) and proceeds to Final Validation.
+   - `Accept and proceed` — record the residual findings verbatim in a durable "Known Residuals" sink before shipping. If a PR will be created or updated in Phase 4, include them in the PR description's "Known Residuals" section (the agent owns this when calling `ce-commit-push-pr`). If the user later chooses the no-PR `ce-commit` path, create `docs/residual-review-findings/<branch-or-head-sha>.md`, include the accepted findings and source review-run context, stage it with the implementation commit, and mention the file path in the final summary. The user has acknowledged the risk, but the findings must not live only in the transient session.
+   - `Stop — do not ship` — abort the shipping workflow. The user will handle findings manually before re-invoking.
+   Skip this gate entirely when the review reported `Residual actionable work: none.` or when only Tier 1 (inline self-review) was used. Do not proceed past this gate on an `Accept and proceed` decision until the agent has recorded whether the durable sink is `PR Known Residuals` or `docs/residual-review-findings/<branch-or-head-sha>.md`.
+4. **Final Validation**
+   - All tasks marked completed
+   - Testing addressed -- tests pass and new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)
+   - Linting passes
+   - Code follows existing patterns
+   - Figma designs match (if applicable)
+   - No console errors or warnings
+   - If the plan has a `Requirements` section (or legacy `Requirements Trace`), verify each requirement is satisfied by the completed work
+   - If any `Deferred to Implementation` questions were noted, confirm they were resolved during execution
+5. **Prepare Operational Validation Plan** (REQUIRED)
+   - Add a `## Post-Deploy Monitoring & Validation` section to the PR description for every change.
+   - Include concrete:
+     - Log queries/search terms
+     - Metrics or dashboards to watch
+     - Expected healthy signals
+     - Failure signals and rollback/mitigation trigger
+     - Validation window and owner
+   - If there is truly no production/runtime impact, still include the section with: `No additional operational monitoring required` and a one-line reason.
+## Phase 4: Ship It
+1. **Prepare Evidence Context**
+   Do not invoke `ce-demo-reel` directly in this step. Evidence capture belongs to the PR creation or PR description update flow, where the final PR diff and description context are available.
+   Note whether the completed work has observable behavior (UI rendering, CLI output, API/library behavior with a runnable example, generated artifacts, or workflow output). The `ce-commit-push-pr` skill will ask whether to capture evidence only when evidence is possible.
+2. **Update Plan Status**
+   If the input document has YAML frontmatter with a `status` field, update it to `completed`:
+   ```
+   status: active  ->  status: completed
+   ```
+3. **Commit and Create Pull Request**
+   Load the `ce-commit-push-pr` skill to handle committing, pushing, and PR creation. The skill handles convention detection, branch safety, logical commit splitting, adaptive PR descriptions, and attribution badges.
+   When providing context for the PR description, include:
+   - The plan's summary and key decisions
+   - Testing notes (tests added/modified, manual testing performed)
+   - Evidence context from step 1, so `ce-commit-push-pr` can decide whether to ask about capturing evidence
+   - Figma design link (if applicable)
+   - The Post-Deploy Monitoring & Validation section (see Phase 3 Step 5)
+   - Any "Known Residuals" accepted in the Phase 3 Residual Work Gate, rendered as a dedicated section in the PR body with severity, file:line, and title per finding
+   If the user prefers to commit without creating a PR, load the `ce-commit` skill instead.
+4. **Notify User**
+   - Summarize what was completed
+   - Link to PR (if one was created)
+   - Note any follow-up work needed
+   - Suggest next steps if applicable
+## Quality Checklist
+Before creating PR, verify:
+- [ ] All clarifying questions asked and answered
+- [ ] All tasks marked completed
+- [ ] Testing addressed -- tests pass AND new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed)
+- [ ] Linting passes (use linting-agent)
+- [ ] Code follows existing patterns
+- [ ] Figma designs match implementation (if applicable)
+- [ ] Evidence decision handled by `ce-commit-push-pr` when the change has observable behavior
+- [ ] Commit messages follow conventional format
+- [ ] PR description includes Post-Deploy Monitoring & Validation section (or explicit no-impact rationale)
+- [ ] Code review completed (inline self-review or full `ce-code-review`)
+- [ ] PR description includes summary, testing notes, and evidence when captured
+- [ ] PR description includes Compound Engineered badge with accurate model and harness
+## Code Review Tiers
+Every change gets reviewed. The tier determines depth, not whether review happens.
+**Tier 2 (full review)** -- REQUIRED default. Invoke `ce-code-review mode:autofix` with `plan:<path>` when available. Safe fixes are applied automatically; residual work is recorded in the run artifact for downstream routing. Always use this tier unless all four Tier 1 criteria are explicitly confirmed.
+**Tier 1 (inline self-review)** -- permitted only when all four are true (state each explicitly before choosing):
+- Purely additive (new files only, no existing behavior modified)
+- Single concern (one skill, one component -- not cross-cutting)
+- Pattern-following (mirrors an existing example, no novel logic)
+- Plan-faithful (no scope growth, no surprising deferred-question resolutions)

package/skills/ce-work-beta/references/codex-delegation-workflow.md ADDED Viewed

@@ -0,0 +1,327 @@
+# Codex Delegation Workflow
+When `delegation_active` is true, code implementation is delegated to the Codex CLI (`codex exec`) instead of being implemented directly. The orchestrating OpenCode agent retains control of planning, review, git operations, and orchestration.
+## Delegation Decision
+If `work_delegate_decision` is `ask`, present the recommendation and wait for the user's choice before proceeding.
+**When recommending Codex delegation:**
+> "Codex delegation active. [N] implementation units -- delegating in one batch."
+> 1. Delegate to Codex *(recommended)*
+> 2. Execute with OpenCode instead
+**When recommending Codex delegation, multiple batches:**
+> "Codex delegation active. [N] implementation units -- delegating in [X] batches."
+> 1. Delegate to Codex *(recommended)*
+> 2. Execute with OpenCode instead
+**When recommending OpenCode (all units are trivial):**
+> "Codex delegation active, but these are small changes where the cost of delegating outweighs having OpenCode do them."
+> 1. Execute with OpenCode *(recommended)*
+> 2. Delegate to Codex anyway
+If the user chooses the delegation option, proceed to Pre-Delegation Checks below. If the user chooses the OpenCode option, set `delegation_active` to false and return to standard execution in the parent skill.
+If `work_delegate_decision` is `auto` (the default), state the execution plan in one line and proceed without waiting: "Codex delegation active. Delegating [N] units in [X] batch(es)." If all units are trivial, set `delegation_active` to false and proceed: "Codex delegation active. All units are trivial -- executing with OpenCode."
+## Pre-Delegation Checks
+Run these checks **once before the first batch**. If any check fails, fall back to standard mode for the remainder of the plan execution. Do not re-run on subsequent batches.
+**0. Platform Gate**
+Codex delegation is only supported when the orchestrating agent is running in OpenCode. If the current session is Codex, Gemini CLI, OpenCode, or any other platform, set `delegation_active` to false and proceed in standard mode.
+**1. Environment Guard**
+Check whether the current agent is already running inside a Codex sandbox:
+```bash
+if [ -n "$CODEX_SANDBOX" ] || [ -n "$CODEX_SESSION_ID" ]; then
+  echo "inside_sandbox=true"
+else
+  echo "inside_sandbox=false"
+fi
+```
+If `inside_sandbox` is true, delegation would recurse or fail.
+- If `delegation_source` is `argument`: emit "Already inside Codex sandbox -- using standard mode." and set `delegation_active` to false.
+- If `delegation_source` is `config` or `default`: set `delegation_active` to false silently.
+**2. Availability Check**
+**Codex availability (pre-resolved):**
+!`command -v codex >/dev/null 2>&1 && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_FOUND"`
+If the line above shows `CODEX_AVAILABLE`, proceed to the next check.
+If it shows `CODEX_NOT_FOUND`, the Codex CLI is not installed. Emit "Codex CLI not found (install via `npm install -g @openai/codex` or `brew install codex`) -- using standard mode." and set `delegation_active` to false.
+If it shows an unresolved command string, run `command -v codex` using a shell tool. If the command prints a path, proceed. If it fails or prints nothing, emit the same message and set `delegation_active` to false.
+**3. Consent Flow**
+If `consent_granted` is not true (from config `work_delegate_consent`):
+Present a one-time consent warning using the platform's blocking question tool (`question` in OpenCode, `request_user_input` in Codex, `ask_user` in Gemini, `ask_user` in Pi (requires the `pi-ask-user` extension)). The consent warning explains:
+- Delegation sends implementation units to `codex exec` as a structured prompt
+- **yolo mode** (`--yolo`): Full system access including network. Required for verification steps that run tests or install dependencies. **Recommended.**
+- **full-auto mode** (`--full-auto`): Workspace-write sandbox, no network access.
+Present the sandbox mode choice: (1) yolo (recommended), (2) full-auto.
+On acceptance:
+- Resolve the repo root: `git rev-parse --show-toplevel`. Write `work_delegate_consent: true` and `work_delegate_sandbox: <chosen-mode>` to `<repo-root>/.systematic/config.local.yaml`
+- To write: (1) if file or directory does not exist, create `<repo-root>/.systematic/` and write the YAML file; (2) if file exists, merge new keys preserving existing keys
+- Update `consent_granted` and `sandbox_mode` in the resolved state
+On decline:
+- Ask whether to disable delegation entirely for this project
+- If yes: write `work_delegate: false` to `<repo-root>/.systematic/config.local.yaml` (using the same repo root resolved above). To write: (1) if file or directory does not exist, create `<repo-root>/.systematic/` and write the YAML file; (2) if file exists, merge new keys preserving existing keys. Set `delegation_active` to false, proceed in standard mode
+- If no: set `delegation_active` to false for this invocation only, proceed in standard mode
+**Headless consent:** If running in a headless or non-interactive context, delegation proceeds only if `work_delegate_consent` is already `true` in the config file. If consent is not recorded, set `delegation_active` to false silently.
+## Batching
+Delegate all units in one batch. If the plan exceeds 5 units, split into batches at the plan's own phase boundaries, or in groups of roughly 5 -- never splitting units that share files. Skip delegation entirely if every unit is trivial.
+## Prompt Template
+At the start of delegated execution, create a per-run OS-temp scratch directory via `mktemp -d` and capture its **absolute path** for all downstream use. All scratch files for this invocation live under that directory. Do not use `.context/` — these scratch files are per-run throwaway that get cleaned up when delegated execution ends (see Cleanup below), matching the repo Scratch Space convention for one-shot artifacts. Do not pass unresolved shell-variable strings to non-shell tools (Write, Read); use the absolute path returned by `mktemp -d`.
+```bash
+SCRATCH_DIR="$(mktemp -d -t ce-work-codex-XXXXXX)"
+echo "$SCRATCH_DIR"
+```
+Refer to the echoed absolute path as `<scratch-dir>` throughout the rest of this workflow.
+Before each batch, write a prompt file to `<scratch-dir>/prompt-batch-<batch-num>.md`.
+Build the prompt from the batch's implementation units using these XML-tagged sections:
+```xml
+<task>
+[For a single-unit batch: Goal from the implementation unit.
+For a multi-unit batch: list each unit with its Goal, stating the concrete
+job, repository context, and expected end state for each.]
+</task>
+<files>
+[Combined file list from all units in the batch -- files to create, modify, or read.]
+</files>
+<patterns>
+[File paths from all units' "Patterns to follow" fields. If no patterns:
+"No explicit patterns referenced -- follow existing conventions in the
+modified files."]
+</patterns>
+<approach>
+[For a single-unit batch: Approach from the unit.
+For a multi-unit batch: list each unit's approach, noting dependencies
+and suggested ordering.]
+</approach>
+<constraints>
+- Do NOT run git commit, git push, or create PRs -- the orchestrating agent handles all git operations
+- Restrict all modifications to files within the repository root
+- Keep changes tightly scoped to the stated task -- avoid unrelated refactors, renames, or cleanup
+- Resolve the task fully before stopping -- do not stop at the first plausible answer
+- If you discover mid-execution that you need to modify files outside the repo root, complete what you can within the repo and report what you could not do via the result schema issues field
+</constraints>
+<testing>
+Before writing tests, check whether the plan's test scenarios cover all
+categories that apply to each unit. Supplement gaps before writing tests:
+- Happy path: core input/output pairs from each unit's goal
+- Edge cases: boundary values, empty/nil inputs, type mismatches
+- Error/failure paths: invalid inputs, permission denials, downstream failures
+- Integration: cross-layer scenarios that mocks alone won't prove
+Write tests that name specific inputs and expected outcomes. If your changes
+touch code with callbacks, middleware, or event handlers, verify the
+interaction chain works end-to-end.
+</testing>
+<verify>
+After implementing, run ALL test files together in a single command (not
+per-file). Cross-file contamination (e.g., mocked globals leaking between
+test files) only surfaces when tests run in the same process. If tests
+fail, fix the issues and re-run until they pass. Do not report status
+"completed" unless verification passes. This is your responsibility --
+the orchestrator will not re-run verification independently.
+[Test and lint commands from the project. Use the union of all units'
+verification commands as a single combined invocation.]
+</verify>
+<output_contract>
+Report your result via the --output-schema mechanism. Fill in every field:
+- status: "completed" ONLY if all changes were made AND verification passes,
+  "partial" if incomplete, "failed" if no meaningful progress
+- files_modified: array of file paths you changed
+- issues: array of strings describing any problems, gaps, or out-of-scope
+  work discovered
+- summary: one-paragraph description of what was done
+- verification_summary: what you ran to verify (command and outcome).
+  Example: "Ran `bun test` -- 14 tests passed, 0 failed."
+  If no verification was possible, say why.
+</output_contract>
+```
+## Result Schema
+Write the result schema to `<scratch-dir>/result-schema.json` (using the absolute path captured at the start) once at the start of delegated execution:
+```json
+{
+  "type": "object",
+  "properties": {
+    "status": { "enum": ["completed", "partial", "failed"] },
+    "files_modified": { "type": "array", "items": { "type": "string" } },
+    "issues": { "type": "array", "items": { "type": "string" } },
+    "summary": { "type": "string" },
+    "verification_summary": { "type": "string" }
+  },
+  "required": ["status", "files_modified", "issues", "summary", "verification_summary"],
+  "additionalProperties": false
+}
+```
+Each batch's result is written to `<scratch-dir>/result-batch-<batch-num>.json` via the `-o` flag. On plan failure, files are left in place for debugging.
+If the result JSON is absent or malformed after a successful exit code, classify as task failure.
+## Execution Loop
+Initialize a `consecutive_failures` counter at 0 before the first batch.
+**Clean-baseline preflight:** Before the first batch, verify there are no uncommitted changes to tracked files:
+```bash
+git diff --quiet HEAD
+```
+This intentionally ignores untracked files. Only staged or unstaged modifications to tracked files make rollback unsafe. However, if untracked files exist at paths in the batch's planned Files list, rollback (`git clean -fd -- <paths>`) would delete them. If such overlaps are detected, warn the user and recommend committing or stashing those files before proceeding.
+If tracked files are dirty, stop and present options: (1) commit current changes, (2) stash explicitly (`git stash push -m "pre-delegation"`), (3) continue in standard mode (sets `delegation_active` to false). Do not auto-stash user changes.
+**Delegation invocation:** For each batch, execute these as **separate Bash tool calls** (not combined into one):
+**Step A — Launch (background, separate Bash call):**
+Write the prompt file, then make a single Bash tool call with `run_in_background: true` set on the tool parameter. This call returns immediately and has no timeout ceiling.
+Substitute the literal absolute path captured at setup for every `<scratch-dir>` below. Each Bash tool call starts a fresh shell, so the `$SCRATCH_DIR` variable from the setup snippet is not preserved — an unresolved `$SCRATCH_DIR` would expand empty and break result detection.
+```bash
+# Substitute the resolved sandbox_mode value (yolo or full-auto) from the skill state
+SANDBOX_MODE="<sandbox_mode>"
+# Resolve sandbox flag
+if [ "$SANDBOX_MODE" = "full-auto" ]; then
+  SANDBOX_FLAG="--full-auto"
+else
+  SANDBOX_FLAG="--dangerously-bypass-approvals-and-sandbox"
+fi
+codex exec \
+  $SANDBOX_FLAG \
+  --output-schema "<scratch-dir>/result-schema.json" \
+  -o "<scratch-dir>/result-batch-<batch-num>.json" \
+  - < "<scratch-dir>/prompt-batch-<batch-num>.md"
+```
+**Conditional flags** — only include each line when the corresponding skill-state value is set:
+- If `delegate_model` is set, insert `  -m "<delegate_model>" \` as a line before `$SANDBOX_FLAG`.
+- If `delegate_effort` is set, insert `  -c 'model_reasoning_effort="<delegate_effort>"' \` as a line before `$SANDBOX_FLAG`.
+When either value is unset, omit its line entirely — Codex resolves the default from the user's `~/.codex/config.toml` (and ultimately the CLI's own built-in default). Do not substitute a placeholder string for unset values.
+Critical: `run_in_background: true` must be set as a **Bash tool parameter**, not as a shell `&` suffix. The tool parameter is what removes the timeout ceiling. A shell `&` inside a foreground Bash call still hits the 2-minute default timeout.
+Quoting is critical for the `-c` flag when present: use single quotes around the entire key=value and double quotes around the TOML string value inside. Example: `-c 'model_reasoning_effort="high"'`.
+Do not improvise CLI flags or modify this invocation template beyond the documented conditional insertions.
+**Step B — Poll (foreground, separate Bash calls):**
+After the launch call returns, make a **new, separate** foreground Bash tool call that polls for the result file. This keeps the agent's turn active so the user cannot interfere with the working tree.
+Substitute the literal absolute path captured at setup for `<scratch-dir>`. The shell variable from Step A does not survive across separate Bash tool calls.
+```bash
+RESULT_FILE="<scratch-dir>/result-batch-<batch-num>.json"
+for i in $(seq 1 6); do
+  test -s "$RESULT_FILE" && echo "DONE" && exit 0
+  sleep 10
+done
+echo "Waiting for Codex..."
+```
+If the output is "Waiting for Codex...", issue the same polling command again as another separate Bash call. Repeat until the output is "DONE", then read the result file and proceed to classification.
+**Polling termination conditions:** Stop polling when any of these conditions is met:
+- **Result file appears** (output is "DONE") -- proceed to result classification normally.
+- **Background process exits with non-zero code** -- classify as CLI failure (row 1). Rollback and fall back to standard mode.
+- **Background process exits with zero code but result file is absent** -- classify as task failure (row 2: exit 0, result JSON missing). Rollback and increment `consecutive_failures`.
+- **5 polling rounds** elapse (~5 minutes) without the result file appearing and without a background process notification -- treat as a hung process. Classify as CLI failure (row 1). Rollback and fall back to standard mode.
+**Result classification:** Codex is responsible for running verification internally and fixing failures before reporting -- the orchestrator does not re-run verification independently.
+| # | Signal | Classification | Action |
+|---|--------|---------------|--------|
+| 1 | Exit code != 0 | CLI failure | Rollback to HEAD. Fall back to standard mode for ALL remaining work. |
+| 2 | Exit code 0, result JSON missing or malformed | Task failure | Rollback to HEAD. Increment `consecutive_failures`. |
+| 3 | Exit code 0, `status: "failed"` | Task failure | Rollback to HEAD. Increment `consecutive_failures`. |
+| 4 | Exit code 0, `status: "partial"` | Partial success | Keep the diff. Complete remaining work locally, verify, and commit. Increment `consecutive_failures`. |
+| 5 | Exit code 0, `status: "completed"` | Success | Commit changes. Reset `consecutive_failures` to 0. |
+**Result handoff — surface to user:** After reading the result JSON and before committing or rolling back, display a summary so the user sees what happened. Format:
+> **Codex batch <batch-num> — <classification>**
+> <summary from result JSON>
+>
+> **Files:** <comma-separated list from files_modified>
+> **Verification:** <verification_summary from result JSON>
+> **Issues:** <issues list, or "None">
+On failure or partial results, include the classification reason (e.g., "status: failed", "result JSON missing") so the user understands why the orchestrator is rolling back or completing locally.
+Keep this brief — the goal is transparency, not a wall of text. One short block per batch.
+**Rollback procedure:**
+```bash
+git checkout -- .
+git clean -fd -- <paths from the batch's combined Files list>
+```
+Do NOT use bare `git clean -fd` without path arguments.
+**Commit on success:**
+```bash
+git add $(git diff --name-only HEAD; git ls-files --others --exclude-standard)
+git commit -m "feat(<scope>): <batch summary>"
+```
+**Between batches** (plans split into multiple batches): Report what completed, test results, and what's next. Continue immediately unless the user intervenes -- the checkpoint exists so the user *can* steer, not so they *must*.
+**Circuit breaker:** After 3 consecutive failures, set `delegation_active` to false and emit: "Codex delegation disabled after 3 consecutive failures -- completing remaining units in standard mode."
+**Scratch cleanup:** No explicit cleanup needed — OS temp handles eventual cleanup (macOS `$TMPDIR` periodic purge; Linux/WSL `/tmp` reboot or periodic cleanup). Leaving `<scratch-dir>` in place after the run also preserves intermediate artifacts for debugging if anything went wrong.
+## Mixed-Model Attribution
+When some units are executed by Codex and others locally:
+- If all units used delegation: attribute to the Codex model
+- If all units used standard mode: attribute to the current agent's model
+- If mixed: note which units were delegated in the PR description and credit both models