npm - waypoint-codex - Versions diffs - 0.10.9 → 0.10.11 - Mend

waypoint-codex 0.10.9 → 0.10.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (37) hide show

package/README.md CHANGED Viewed

@@ -37,7 +37,7 @@ The philosophy is simple:
 - more markdown
 - better continuity for the next agent
-By default, Waypoint appends a `.gitignore` snippet that ignores the exact Waypoint-created skill directories and reviewer-agent config files, plus everything under `.waypoint/` except `.waypoint/docs/`, while still ignoring the scaffolded `.waypoint/docs/README.md` and `.waypoint/docs/code-guide.md` assets. User-authored durable docs stay trackable; workspace, context, indexes, and other operational state remain local.
+By default, Waypoint keeps its `.gitignore` rules inside a comment-delimited `# Waypoint state` section. That section ignores the exact Waypoint-created skill directories and reviewer-agent config files, plus everything under `.waypoint/` except `.waypoint/docs/`, while still ignoring the scaffolded `.waypoint/docs/README.md` and `.waypoint/docs/code-guide.md` assets. User-authored durable docs stay trackable; workspace, context, indexes, and other operational state remain local.
 ## Best fit
@@ -136,6 +136,7 @@ Waypoint ships a strong default skill pack for real coding work:
 - `work-tracker`
 - `docs-sync`
 - `code-guide-audit`
+- `adversarial-review`
 - `break-it-qa`
 - `conversation-retrospective`
 - `frontend-ship-audit`
@@ -157,9 +158,9 @@ Waypoint scaffolds these reviewer agents by default:
 - `code-reviewer`
 - `plan-reviewer`
-The intended workflow is closeout-based: run `code-reviewer` before considering any non-trivial implementation slice complete, and run `code-health-reviewer` before considering medium or large changes complete, especially when they add structure, duplicate logic, or introduce new abstractions. If both apply, run them in parallel. A recent self-authored commit is the preferred scope anchor when one cleanly represents the slice, but it is not the only valid trigger. Reviewer agents are one-shot workers: once a reviewer returns findings, close it, and if another pass is needed later, spawn a fresh reviewer instead of reusing the old thread.
+The intended workflow is closeout-based: run `adversarial-review` before considering any non-trivial implementation slice complete. That skill scopes the current slice, runs `code-reviewer`, runs `code-health-reviewer` when the change is medium or large or otherwise structurally risky, runs `code-guide-audit`, waits as long as needed, fixes meaningful findings, and reruns fresh reviewer rounds until no meaningful findings remain. A recent self-authored commit is the preferred scope anchor when one cleanly represents the slice, but it is not the only valid trigger. Reviewer agents are one-shot workers: once a reviewer returns findings, close it, and if another pass is needed later, spawn a fresh reviewer instead of reusing the old thread.
-The shipped reviewer configs now default to `gpt-5.4` with `high` reasoning, and the main-agent guidance explicitly tells Codex to pass the same `model` and `reasoning_effort` values whenever it spawns reviewer agents or other subagents. The reviewer prompts also treat the diff as a starting pointer rather than the review itself: they must read each changed file in full, expand into related files, and only then conclude.
+The shipped reviewer configs now default to `gpt-5.4` with `high` reasoning, and the main-agent guidance explicitly tells Codex to pass `fork_context: false` plus the same `model` and `reasoning_effort` values whenever it spawns reviewer agents or other subagents. The reviewer prompts also treat the diff as a starting pointer rather than the review itself: they must read each changed file in full, expand into related files, and only then conclude.
 For planning work, run `plan-reviewer` before presenting a non-trivial implementation plan to the user and iterate until it has no meaningful review findings left. Each pass should use a fresh `plan-reviewer` agent rather than reusing a previous reviewer thread.

package/dist/src/core.js CHANGED Viewed

@@ -10,6 +10,59 @@ const DEFAULT_DOCS_INDEX = ".waypoint/DOCS_INDEX.md";
 const DEFAULT_TRACK_DIR = ".waypoint/track";
 const DEFAULT_TRACKS_INDEX = ".waypoint/TRACKS_INDEX.md";
 const DEFAULT_WORKSPACE = ".waypoint/WORKSPACE.md";
+const GITIGNORE_WAYPOINT_START = "# Waypoint state";
+const GITIGNORE_WAYPOINT_END = "# End Waypoint state";
+const LEGACY_WAYPOINT_GITIGNORE_RULES = new Set([
+    ".codex/",
+    ".codex/config.toml",
+    ".codex/agents/",
+    ".codex/agents/code-reviewer.toml",
+    ".codex/agents/code-health-reviewer.toml",
+    ".codex/agents/plan-reviewer.toml",
+    ".agents/",
+    ".agents/skills/",
+    ".agents/skills/planning/",
+    ".agents/skills/work-tracker/",
+    ".agents/skills/docs-sync/",
+    ".agents/skills/code-guide-audit/",
+    ".agents/skills/adversarial-review/",
+    ".agents/skills/visual-explanations/",
+    ".agents/skills/break-it-qa/",
+    ".agents/skills/frontend-context-interview/",
+    ".agents/skills/backend-context-interview/",
+    ".agents/skills/frontend-ship-audit/",
+    ".agents/skills/backend-ship-audit/",
+    ".agents/skills/conversation-retrospective/",
+    ".agents/skills/workspace-compress/",
+    ".agents/skills/pre-pr-hygiene/",
+    ".agents/skills/pr-review/",
+    ".waypoint/",
+    ".waypoint/DOCS_INDEX.md",
+    ".waypoint/state/",
+    ".waypoint/context/",
+    ".waypoint/*",
+    "!.waypoint/docs/",
+    "!.waypoint/docs/**",
+    ".waypoint/docs/README.md",
+    ".waypoint/docs/code-guide.md",
+]);
+const SHIPPED_SKILL_NAMES = [
+    "planning",
+    "work-tracker",
+    "docs-sync",
+    "code-guide-audit",
+    "adversarial-review",
+    "visual-explanations",
+    "break-it-qa",
+    "conversation-retrospective",
+    "workspace-compress",
+    "pre-pr-hygiene",
+    "pr-review",
+    "frontend-context-interview",
+    "backend-context-interview",
+    "frontend-ship-audit",
+    "backend-ship-audit",
+];
 const TIMESTAMPED_WORKSPACE_SECTIONS = new Set([
     "## Active Trackers",
     "## Current State",
@@ -49,15 +102,114 @@ function migrateLegacyRootFiles(projectRoot) {
 function appendGitignoreSnippet(projectRoot) {
     const gitignorePath = path.join(projectRoot, ".gitignore");
     const snippet = readTemplate(".gitignore.snippet").trim();
+    const snippetLines = snippet.split("\n");
     if (!existsSync(gitignorePath)) {
         writeText(gitignorePath, `${snippet}\n`);
         return;
     }
     const content = readFileSync(gitignorePath, "utf8");
-    if (content.includes(snippet)) {
+    const normalizedLines = content.split(/\r?\n/);
+    const normalizedContent = normalizedLines.join("\n");
+    const headerCount = normalizedLines.filter((line) => line === GITIGNORE_WAYPOINT_START).length;
+    if (normalizedContent.includes(snippet) && headerCount <= 1) {
+        return;
+    }
+    const startIndex = normalizedLines.findIndex((line) => line === snippetLines[0]);
+    if (startIndex === -1) {
+        writeText(gitignorePath, `${content.trimEnd()}\n\n${snippet}\n`);
+        return;
+    }
+    const managedLineSet = new Set(snippetLines);
+    const endIndex = findWaypointGitignoreBlockEnd(normalizedLines, startIndex);
+    if (endIndex === -1) {
+        writeText(gitignorePath, `${content.trimEnd()}\n\n${snippet}\n`);
         return;
     }
-    writeText(gitignorePath, `${content.trimEnd()}\n\n${snippet}\n`);
+    const hasForeignLineInsideBlock = normalizedLines
+        .slice(startIndex + 1, endIndex)
+        .some((line) => line.length > 0 && !isManagedWaypointGitignoreLine(line, managedLineSet));
+    const trailingLines = stripSubsequentWaypointGitignoreBlocks(normalizedLines.slice(endIndex + 1), managedLineSet);
+    if (hasForeignLineInsideBlock) {
+        const foreignLines = normalizedLines
+            .slice(startIndex + 1, endIndex)
+            .filter((line) => line.length > 0 && !isManagedWaypointGitignoreLine(line, managedLineSet))
+            .join("\n");
+        const before = normalizedLines.slice(0, startIndex).join("\n").trimEnd();
+        const after = trailingLines.join("\n").trimStart();
+        const merged = [before, snippet, foreignLines, after].filter((piece) => piece.length > 0).join("\n\n");
+        writeText(gitignorePath, `${merged}\n`);
+        return;
+    }
+    const before = normalizedLines.slice(0, startIndex).join("\n").trimEnd();
+    const after = trailingLines.join("\n").trimStart();
+    const merged = [before, snippet, after].filter((piece) => piece.length > 0).join("\n\n");
+    writeText(gitignorePath, `${merged}\n`);
+}
+function findWaypointGitignoreBlockEnd(lines, startIndex) {
+    const explicitEndIndex = lines.findIndex((line, index) => index > startIndex && line === GITIGNORE_WAYPOINT_END);
+    if (explicitEndIndex !== -1) {
+        return explicitEndIndex;
+    }
+    return findLegacyWaypointGitignoreBlockEnd(lines, startIndex);
+}
+function findLegacyWaypointGitignoreBlockEnd(lines, startIndex) {
+    let scanEndExclusive = lines.length;
+    for (let index = startIndex + 1; index < lines.length; index += 1) {
+        const line = lines[index];
+        if (line.length === 0) {
+            scanEndExclusive = index;
+            break;
+        }
+        if (line.startsWith("#") && line !== GITIGNORE_WAYPOINT_START) {
+            scanEndExclusive = index;
+            break;
+        }
+    }
+    let endIndex = -1;
+    for (let index = startIndex + 1; index < scanEndExclusive; index += 1) {
+        if (isLegacyWaypointGitignoreRule(lines[index])) {
+            endIndex = index;
+        }
+    }
+    return endIndex;
+}
+function isLegacyWaypointGitignoreRule(line) {
+    const normalizedLine = line.startsWith("/") ? line.slice(1) : line;
+    return LEGACY_WAYPOINT_GITIGNORE_RULES.has(normalizedLine);
+}
+function isManagedWaypointGitignoreLine(line, managedLineSet) {
+    return managedLineSet.has(line) || isLegacyWaypointGitignoreRule(line);
+}
+function stripSubsequentWaypointGitignoreBlocks(lines, managedLineSet) {
+    const keptLines = [];
+    let index = 0;
+    while (index < lines.length) {
+        if (lines[index] !== GITIGNORE_WAYPOINT_START) {
+            keptLines.push(lines[index]);
+            index += 1;
+            continue;
+        }
+        const endIndex = findWaypointGitignoreBlockEnd(lines, index);
+        if (endIndex === -1) {
+            keptLines.push(lines[index]);
+            index += 1;
+            continue;
+        }
+        const foreignLines = lines
+            .slice(index + 1, endIndex)
+            .filter((line) => line.length > 0 && !isManagedWaypointGitignoreLine(line, managedLineSet));
+        if (foreignLines.length > 0) {
+            if (keptLines.length > 0 && keptLines[keptLines.length - 1] !== "") {
+                keptLines.push("");
+            }
+            keptLines.push(...foreignLines);
+            if (endIndex + 1 < lines.length && lines[endIndex + 1] !== "") {
+                keptLines.push("");
+            }
+        }
+        index = endIndex + 1;
+    }
+    return keptLines;
 }
 function upsertManagedBlock(filePath, block) {
     if (!existsSync(filePath)) {
@@ -377,18 +529,7 @@ export function doctorRepository(projectRoot) {
             paths: [workspacePath, ...tracksIndex.activeTrackPaths.map((trackPath) => path.join(projectRoot, trackPath))],
         });
     }
-    for (const skillName of [
-        "planning",
-        "work-tracker",
-        "docs-sync",
-        "code-guide-audit",
-        "visual-explanations",
-        "break-it-qa",
-        "conversation-retrospective",
-        "workspace-compress",
-        "pre-pr-hygiene",
-        "pr-review",
-    ]) {
+    for (const skillName of SHIPPED_SKILL_NAMES) {
         const skillPath = path.join(projectRoot, ".agents/skills", skillName, "SKILL.md");
         if (!existsSync(skillPath)) {
             findings.push({
@@ -398,6 +539,17 @@ export function doctorRepository(projectRoot) {
                 remediation: "Run `waypoint init` to restore repo-local skills.",
                 paths: [skillPath],
             });
+            continue;
+        }
+        const metadataPath = path.join(projectRoot, ".agents/skills", skillName, "agents", "openai.yaml");
+        if (!existsSync(metadataPath)) {
+            findings.push({
+                severity: "error",
+                category: "skills",
+                message: `Repo skill \`${skillName}\` metadata is missing.`,
+                remediation: "Run `waypoint init` to restore repo-local skill metadata.",
+                paths: [metadataPath],
+            });
         }
     }
     const codexConfigPath = path.join(projectRoot, ".codex/config.toml");

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "waypoint-codex",
-  "version": "0.10.9",
+  "version": "0.10.11",
   "description": "Codex-native repository operating system: scaffolding, docs routing, repo-local skills, doctor, and sync.",
   "license": "MIT",
   "type": "module",

package/templates/.agents/skills/adversarial-review/SKILL.md ADDED Viewed

@@ -0,0 +1,85 @@
+---
+name: adversarial-review
+description: Close out a meaningful implementation slice with the full iterative review loop. Use when the user asks for a final review pass, asks to "close the loop," asks whether work is ready to call done, or when Codex is about to say a non-trivial code change is complete. This skill scopes the slice, runs `code-reviewer`, runs `code-health-reviewer` when the change is medium or large or structurally risky, runs `code-guide-audit`, waits for the required outputs, fixes real findings, and repeats with fresh rounds until no meaningful issues remain. Do not use this for tiny obvious edits, pre-implementation plan review, or active PR comment triage.
+---
+# Adversarial Review
+Use this skill to close the loop on implementation work instead of treating review as a one-shot pass.
+This skill owns the default closeout workflow for a reviewable slice. It coordinates the specialist reviewers, keeps the scope tight, waits as long as needed, fixes meaningful findings, and reruns fresh review rounds until the remaining feedback is only optional polish or no findings at all.
+## When To Skip This Skill
+- Skip it for tiny obvious edits where launching the full closeout loop would be noise.
+- Skip it for pre-implementation planning; that is `plan-reviewer` territory.
+- Skip it for active PR comment back-and-forth; use `pr-review` for that workflow.
+- Skip it when the user wants a one-off targeted coding-guide check and not the full closeout loop; use `code-guide-audit` directly in that case.
+## Step 1: Define The Reviewable Slice
+- Resolve the exact slice you are trying to close out before launching reviewers.
+- Prefer a recent self-authored commit when one cleanly represents the slice.
+- Otherwise use the current changed files, diff, or feature path.
+- Pass the reviewers the same concrete scope anchor, plus a short plain-English summary of what changed.
+- If the scope is muddy, tighten it before review instead of asking the reviewers to figure it out from an entire worktree.
+## Step 2: Launch The Required Reviewers
+- Spawn `code-reviewer` for every non-trivial implementation slice.
+- Spawn `code-health-reviewer` when the change is medium or large, especially when it adds structure, duplicates logic, or introduces new abstractions.
+- Run `code-guide-audit` on the same scoped slice as part of the closeout loop.
+- Launch the reviewer agents with `fork_context: false`, `model: gpt-5.4`, and `reasoning_effort: high` unless the user explicitly asked for something else.
+- Tell the reviewer agents what changed, what scope anchor to use, and which files or feature area represent the slice under review.
+- When both reviewer agents apply, launch them in parallel.
+## Step 3: Wait For The Round To Finish
+- Wait for every required reviewer result, no matter how long it takes.
+- Do not interrupt slow reviewer agents just because they are still running.
+- Do not call the work done while a required reviewer round is still in flight.
+- Read the full reviewer outputs before deciding what to fix.
+## Step 4: Fix Meaningful Findings
+- Fix real correctness, regression, maintainability, and code-guide issues.
+- Rerun the most relevant verification for the changed area after the fixes.
+- If a reviewer comment is only a nit or clearly optional polish, note that distinction and do not keep reopening the loop just to satisfy minor taste differences.
+- If a finding changes durable behavior or repo memory, update the relevant docs and workspace state before the next round.
+## Step 5: Close The Old Review Round
+- Treat `code-reviewer` and `code-health-reviewer` as one-shot reviewer agents.
+- After you have read a reviewer result, close that reviewer thread.
+- If another pass is needed later, spawn a fresh reviewer instead of reusing the old thread.
+## Step 6: Repeat Until The Slice Is Actually Clear
+- Start a fresh round whenever you made meaningful fixes in response to the previous round.
+- Reuse the same scope anchor when it still represents the slice cleanly; otherwise hand the new round the updated changed-file set or follow-up commit.
+- Rerun `code-guide-audit` when the fixes materially changed guide-relevant behavior or when the previous round surfaced guide-related issues.
+- Stop only when no meaningful findings remain. Optional polish and obvious nitpicks do not block closeout.
+## Step 7: Report The Closeout State
+Summarize:
+- what scope was reviewed
+- which reviewers ran
+- what meaningful issues were fixed
+- what verification ran
+- whether the slice is now clear or what still blocks it
+## Gotchas
+- Fresh reviewer rounds matter. If you make meaningful fixes, do not treat older reviewer findings as if they still describe the current code.
+- Green local tests are not enough if required reviewer threads are still running. Wait for the actual reviewer outputs before calling the slice done.
+- Close reviewer agents after each round. Reusing a stale reviewer thread weakens the signal and blurs which code state the findings apply to.
+- When this loop changes repo-health or upgrade behavior, test real old-repo edge cases, not just fresh-init cases.
+- If a reviewer result is clean, it should still name the key paths and related files it checked. A "looks fine" skim is not a real closeout pass.
+## Keep This Skill Sharp
+- After meaningful runs, add new gotchas when the same review-loop failure, stale-review mistake, or repo-upgrade edge case is likely to happen again.
+- Tighten the description if the skill fires too broadly or misses real prompts like "final review pass" or "before we call this done."
+- If the loop keeps re-creating the same helper logic or review instructions, move that reusable logic into the skill or its supporting resources instead of leaving it in chat.

package/templates/.agents/skills/adversarial-review/agents/openai.yaml ADDED Viewed

@@ -0,0 +1,4 @@
+interface:
+  display_name: "Adversarial Review"
+  short_description: "Close out a code slice with iterative review"
+  default_prompt: "Use $adversarial-review to close out this non-trivial implementation slice before calling it done."

package/templates/.agents/skills/backend-context-interview/SKILL.md CHANGED Viewed

@@ -1,80 +1,130 @@
 ---
 name: backend-context-interview
-description: Gather and persist durable backend project context when missing or insufficient for implementation, architecture decisions, or ship-readiness review. Use this to ask project-level questions about deployment reality, scale, criticality, compatibility, tenant model, security posture, reliability expectations, and other durable backend context that is not clearly documented. This is not a feature-discovery skill.
+description: Gather and persist durable backend project context when it is missing, stale, contradictory, or too weak to support implementation, architecture, migration, or ship-readiness decisions. Use when a task needs project-level backend facts such as deployment reality, user exposure, scale, criticality, tenant model, compatibility expectations, security posture, observability expectations, or compliance constraints, and those facts are not already clear in `AGENTS.md` or the repo docs. This is not a feature-discovery skill and should not trigger for endpoint details, acceptance criteria, or other task-specific product questions.
 ---
 # Backend Context Interview
-Use this skill when relevant backend project context is missing, stale, contradictory, or too weak to support correct implementation or review decisions.
+Use this skill to fill in missing backend operating context only when that context materially changes the right engineering choices.
-## Goals
+This skill is for durable project truth, not for feature planning. It should reduce repeated questioning, not create a habit of interviewing the user every time a backend task appears.
-1. identify the missing backend context that materially affects the work
-2. ask only high-leverage questions that cannot be answered from the repo or guidance files
-3. persist durable context into the project root guidance file
-4. avoid repeated questioning in future tasks
+## What This Skill Owns
-This skill is for project-level operating context, not feature requirements gathering.
+- identify which backend context facts are still missing after reading the repo guidance
+- ask only the smallest set of high-leverage project questions
+- persist the durable answers into the project guidance layer
+- avoid re-asking the same foundational backend questions later
-## When to use
+## When To Use This Skill
-Use this skill when the current task depends on context such as:
-- internal tool vs public internet-facing product
-- expected scale, concurrency, and criticality
-- regulatory, privacy, or compliance requirements
-- multi-tenant vs single-tenant behavior
-- backward compatibility requirements
-- uptime and reliability expectations
-- migration and rollback risk tolerance
-- security posture expectations
-- observability or incident response expectations
-- infrastructure constraints that materially affect design
-Do not use this skill when the answer is already clearly present in `AGENTS.md`, architecture docs, runbooks, or the task itself.
-Do not use this skill to ask about feature-specific behavior, UX details, endpoint shapes, acceptance criteria, implementation preferences, or other concrete requirements that belong in planning or normal task clarification.
+Use it when the task depends on backend context such as:
+- whether the system is internal, customer-facing, partner-facing, or public
+- expected scale, concurrency, workload shape, or job intensity
+- reliability, outage, or data-loss tolerance
+- tenant model and isolation expectations
+- backward compatibility or migration constraints
+- security posture or exposure assumptions
+- regulatory, privacy, or compliance constraints
+- observability, incident response, or audit expectations
+- infrastructure or deployment constraints that materially affect design
+## When Not To Use This Skill
+- Do not use it when the answer is already clearly documented in `AGENTS.md`, architecture docs, runbooks, or durable repo guidance.
+- Do not use it for feature-specific behavior, endpoint shapes, UX details, acceptance criteria, or implementation preferences.
+- Do not use it as a substitute for planning. If the missing information is really about the feature itself, use planning or normal task clarification instead.
 ## Workflow
-### 1. Check persisted context first
+### 1. Read Persisted Context First
-Inspect the project root guidance files.
+- Read `AGENTS.md` first.
+- Look for `## Project Context`, `## Backend Context`, or equivalent sections.
+- Read the repo docs that are most likely to hold deployment, security, migration, or operating constraints.
+- If the existing context is sufficient and credible, stop. Do not interview the user just because the skill triggered.
-Priority:
-1. `AGENTS.md`
+### 2. Decide What Is Actually Missing
-Look for:
-- `## Project Context`
-- `## Backend Context`
-- equivalent sections with the same intent
+Ask only about facts that would materially change implementation or review choices.
-If the existing section is accurate and sufficient, do not interview the user.
+High-value examples:
-### 2. Determine what is actually missing
+- internal tool vs public internet-facing product
+- low-risk internal automation vs business-critical system
+- best-effort batch work vs strict correctness and rollback expectations
+- single-tenant assumptions vs hard tenant isolation
+- backward compatibility required vs safe to break old contracts
-Only ask questions that materially affect implementation or review choices.
+Low-value examples:
-Good triggers:
-- public service vs internal tool changes reliability and security bar
-- scale and concurrency change architecture depth and observability expectations
-- compatibility requirements change migration and API decisions
-- tenant model changes authorization and data-isolation design
+- "What should this feature do?"
+- "What endpoint shape do you want?"
+- "Should I use library X or Y?"
-Do not ask broad or low-value questions.
-Do not ask feature-specific product questions.
+### 3. Ask The Smallest Useful Interview
-### 3. Ask concise grouped questions
+- Group questions so the user can answer quickly.
+- Prefer a few project-level questions over a long checklist.
+- Ask only what the repo cannot already tell you.
-Ask the minimum set of questions needed.
+Good categories:
-Suggested categories:
-- product type and exposure
+- product exposure and real users
 - scale and criticality
+- compatibility and migration safety
 - data sensitivity and compliance
+- reliability and observability expectations
+Good question shapes:
-Do not ask generic product questions that do not affect backend engineering.
-Do ask project-level questions like:
 - whether the product is internal, customer-facing, partner-facing, or public
-- whether there are real users yet or only development/staging use
-- expected traffic, concurrency, or import/job intensity
+- whether there are real users yet or only dev/staging use
+- expected traffic, concurrency, or job/import volume
 - whether backward compatibility is required
-- how costly outages, corruption, or security mistakes would be
+- how costly outages, data corruption, or security mistakes would be
+### 4. Persist Only Durable Answers
+- Write durable answers into the project root guidance file, normally `AGENTS.md`.
+- Prefer a `## Backend Context` section if one exists.
+- If there is no such section, add the smallest coherent backend-context section that matches the repo's guidance style.
+- Persist stable operating facts, not one-off task details.
+- Keep the wording concrete enough that a later agent can make decisions from it without rereading the whole conversation.
+What belongs there:
+- exposure level
+- scale assumptions
+- compatibility expectations
+- tenant model
+- security/compliance constraints
+- reliability bar
+- observability expectations
+What does not belong there:
+- current feature requirements
+- temporary blockers
+- one-off implementation notes
+- chat-only phrasing that will age badly
+### 5. Reuse The Saved Context
+- After persisting the answers, treat them as the new source of truth for later work.
+- Do not ask the same foundational questions again unless the saved context is clearly stale or contradictory.
+## Gotchas
+- Missing backend context is not the same as missing feature requirements. Do not drift into product discovery.
+- If `AGENTS.md` already answers the important questions, stop there. Re-asking stable project questions is wasted user effort.
+- Persist only durable operating truth. If the answer only matters for the current task, it does not belong in backend context.
+- Do not ask broad "tell me about your backend" questions. Ask the few facts that would actually change architecture, migration, reliability, or security choices.
+- If the repo gives partial answers, ask only the delta instead of restating the full questionnaire.
+## Keep This Skill Sharp
+- After meaningful runs, add new gotchas when the same backend-context confusion or repeated over-questioning happens again.
+- Tighten the description if the skill starts firing for feature-planning prompts instead of true project-context gaps.
+- If the same persistence pattern or backend-context template keeps getting recreated, move that reusable guidance into this skill instead of relearning it in chat.

package/templates/.agents/skills/backend-context-interview/agents/openai.yaml CHANGED Viewed

@@ -1,4 +1,4 @@
 interface:
   display_name: "Backend Context Interview"
-  short_description: "Ask only the backend context questions that materially change implementation or review decisions"
-  default_prompt: "Use this skill to inspect persisted project context, identify only the missing backend deployment or reliability facts that materially affect the work, ask a concise high-leverage interview if needed, and persist durable Backend Context into AGENTS.md."
+  short_description: "Ask only the missing backend context questions"
+  default_prompt: "Use $backend-context-interview to inspect persisted guidance, ask only the missing backend project-context questions that materially affect the work, and persist durable Backend Context into AGENTS.md."

package/templates/.agents/skills/backend-ship-audit/SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: backend-ship-audit
-description: Audit a backend scope for practical ship readiness with evidence-based findings focused on real release risk rather than style. Use when reviewing a backend service, feature, endpoint group, worker, scheduler, API surface, pull request, or directory to decide whether it is ready to ship; when the backend scope must be resolved from repository structure; when complete-file reading is required to understand behavior and dependencies; when only high-leverage deployment-context questions should be asked after repository exploration; when durable backend context should be persisted in project-root AGENTS.md; or when a timestamped audit should be written under .waypoint/audit/.
+description: Audit a backend scope for practical ship readiness with evidence-based findings focused on real release risk rather than style. Use when the user asks whether a backend service, API, worker, scheduler, endpoint group, pull request, or backend directory is ready to ship; when Codex needs to perform a release-risk review before launch; when the audit scope must be resolved from repository structure; when complete-file reading is required to understand behavior and dependencies; when only high-leverage deployment-context questions should be asked after repository exploration; or when a timestamped backend audit should be written under `.waypoint/audit/`. Do not use this for frontend ship review, generic style review, PR comment triage, or a one-off coding-guide check.
 ---
 # Backend ship audit
@@ -11,6 +11,13 @@ Use bundled resources as follows:
 - Use `references/audit-framework.md` for detailed evaluation prompts and severity calibration.
 - Use `references/report-template.md` for the audit structure and finding format.
+## When Not To Use This Skill
+- Skip it for frontend release review; use the frontend ship-audit workflow instead.
+- Skip it for generic code review or maintainability review that is not explicitly about ship readiness.
+- Skip it for active PR comment triage; use `pr-review` for that loop.
+- Skip it for a one-off coding-guide compliance check on a narrow slice; use `code-guide-audit` for that job.
 ## 1. Resolve the reviewable unit
 Turn the user request into the narrowest defensible backend unit that can be audited end to end.
@@ -121,30 +128,7 @@ Make this edit manually. Prefer the smallest precise change that preserves all u
 Assess the scoped backend like a strong backend reviewer. Focus on real ship risk, not code taste.
-Evaluate at least these categories when relevant:
-- scope and architecture fit
-- API and contract quality
-- input validation and trust boundaries
-- domain modeling correctness
-- data integrity and consistency
-- transaction boundaries and idempotency
-- migration safety and rollback safety
-- failure handling and retry semantics
-- timeouts, cancellation, and backpressure
-- concurrency and race risks
-- queue and background job correctness
-- authorization and access control
-- authentication assumptions
-- secret handling and configuration safety
-- tenant isolation
-- security vulnerabilities and unsafe defaults
-- boundary clarity between layers and services
-- reliability under expected production conditions
-- observability, alertability, and debuggability
-- test coverage for meaningful failure modes
-- future legibility and maintainability as it affects shipping risk
-Use judgment. Do not force findings in every category.
+Use `references/audit-framework.md` to drive the detailed category pass and severity calibration. Do not force findings in every category; use judgment and focus on the risks that actually matter for release readiness.
 Treat missing evidence carefully:
 - Missing tests, docs, or operational controls can be findings if the absence creates real release risk.
@@ -219,3 +203,17 @@ Do not include:
 - vague advice such as "add more tests" without naming the missing failure mode or blind spot
 Prefer a short audit with strong evidence over a long audit with weak claims.
+## Gotchas
+- Do not start asking deployment-context questions before you have read the scoped code and docs. This skill should ask only what the repository cannot answer.
+- Do not rely on grep hits or partial snippets for anything that informs a finding. Backend ship audits need complete-file reads for the code and docs that matter.
+- Do not drift into style review, generic refactor advice, or "nice to have" cleanup. Every finding should connect to real release risk.
+- Do not trust route names or file names alone to define the scope. Resolve the actual entry points, persistence paths, jobs, and external dependencies before judging readiness.
+- If deployment context is missing, state the assumption and calibrate confidence or severity accordingly. Do not present guessed operating conditions as established fact.
+## Keep This Skill Sharp
+- After meaningful audits, add new gotchas when the same backend-risk blind spot, scope mistake, or deployment-context question keeps recurring.
+- Tighten the description if the skill misses real prompts like "is this API ready to ship" or fires on requests that only need generic code review.
+- If the audit keeps reusing the same detailed evaluation logic or evidence format, move that reusable detail into `references/` instead of expanding the hub file.

package/templates/.agents/skills/backend-ship-audit/agents/openai.yaml CHANGED Viewed

@@ -1,3 +1,4 @@
-display_name: Backend Ship Audit
-short_description: Audit a backend scope for real ship risk and write an evidence-based readiness report.
-default_prompt: Audit this backend scope for ship readiness. Resolve scope from the repository, read relevant backend code and docs completely, ask only missing high-leverage questions, persist durable backend context, and write a prioritized audit under .waypoint/audit/.
+interface:
+  display_name: "Backend Ship Audit"
+  short_description: "Audit backend ship-readiness with evidence"
+  default_prompt: "Use $backend-ship-audit to audit this backend scope for ship readiness and write the resulting evidence-based report."

package/templates/.agents/skills/code-guide-audit/SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: code-guide-audit
-description: Audit a specific feature, file set, or implementation slice against the coding guide and report only coding-guide-related violations or risks in that scope. Use after building a feature, when the user wants a coding-guide compliance check, before review on a targeted area, or when validating whether a change follows rules like no silent fallbacks, strong boundary validation, frontend reuse, explicit state handling, and behavior-focused verification.
+description: Audit a specific feature, file set, or implementation slice against the coding guide and report only coding-guide-related violations or risks in that scope. Use when the user asks for a code-guide audit, coding-guide compliance check, guide-specific review, or wants to know whether a change follows rules like no silent fallbacks, strong boundary validation, frontend reuse, explicit state handling, and behavior-focused verification. Do not use this for broad ship-readiness review, generic bug hunting, PR comment triage, or repo-wide cleanup.
 ---
 # Code Guide Audit
@@ -9,6 +9,13 @@ Use this skill for a targeted audit against the coding guide, not for a whole-re
 This skill owns one job: inspect the specific code the user points at, map it against the coding guide, and report only guide-related findings in that scope.
+## When Not To Use This Skill
+- Skip it for broad ship-readiness review; use `pre-pr-hygiene` or a ship-audit workflow for that.
+- Skip it for generic bug finding or regression review that is not specifically about the coding guide.
+- Skip it for active PR comment triage; use `pr-review` for that loop.
+- Skip it for repo-wide cleanup unless the user explicitly asked for a repo-wide coding-guide audit.
 ## Step 1: Load The Right Scope
 - Read the repo's routed code guide.
@@ -67,3 +74,17 @@ Summarize the scoped result in review style:
 - each finding tied back to the relevant coding-guide rule
 - include exact file references
 - then note any skipped guide areas or residual uncertainty
+## Gotchas
+- Do not turn this into generic code review. Every finding should tie back to a specific coding-guide rule.
+- Do not audit the whole repo by accident. Resolve the narrow slice first, then stay inside it unless an out-of-scope issue would seriously mislead the user.
+- Do not report a guide violation from a grep hit alone. Read the real implementation and the nearby evidence before calling it a problem.
+- Do not force every coding-guide rule onto every change. Skip non-applicable rules explicitly instead of inventing weak findings.
+- If you notice a broader ship-risk issue that is not really a coding-guide issue, say it is outside this skill's scope instead of quietly drifting into another audit.
+## Keep This Skill Sharp
+- After meaningful runs, add new gotchas when the same guide-specific failure mode or scope-drift mistake keeps recurring.
+- Tighten the description if the skill fires on generic review requests or misses real prompts like "check this against the code guide."
+- If the same guide-rule translation logic keeps repeating, move that reusable detail into a supporting reference instead of expanding the hub file.

package/templates/.agents/skills/code-guide-audit/agents/openai.yaml CHANGED Viewed

@@ -1,4 +1,4 @@
 interface:
   display_name: "Code Guide Audit"
-  short_description: "Audit scoped code against the coding guide"
-  default_prompt: "Use this skill to audit a specific feature, file set, or implementation slice against the coding guide and report only guide-related violations or risks in that scope."
+  short_description: "Audit code-guide compliance on a scoped slice"
+  default_prompt: "Use $code-guide-audit to audit this specific feature, file set, or implementation slice against the coding guide."

package/templates/.agents/skills/conversation-retrospective/SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: conversation-retrospective
-description: Analyze the active conversation for durable repo knowledge, skill improvements, and repeated workflow patterns. Use when the user asks to save what was learned from the current conversation, update memory/docs without more prompting, improve skills that were used or exposed gaps, or propose new skills based on repetitive work in the live thread.
+description: Harvest durable knowledge, user feedback, skill lessons, and repeated workflow patterns from the active conversation into the repo's existing memory system. Use when the user asks to save what was learned, write down what changed, capture lessons from this thread, update docs or handoff state without more prompting, improve skills that were used or exposed gaps, or record new skill ideas based on repetitive work in the live conversation. Do not use this for generic planning, broad docs audits, or digging through archived session history unless the user explicitly asks for that.
 ---
 # Conversation Retrospective
@@ -11,6 +11,13 @@ This skill works from the live conversation already in context. Do not go huntin
 This is a closeout and distillation workflow, not a generic planning pass or a broad docs audit.
+## When Not To Use This Skill
+- Skip it for generic planning or implementation design; use the planning workflow for that.
+- Skip it for broad docs audits that are not driven by what happened in this conversation.
+- Skip it when the user wants archived history analysis rather than the live thread; only dig into old sessions if they explicitly ask.
+- Skip it when there is nothing durable to preserve and no skill or workflow lesson to capture.
 ## Read First
 Before persisting anything:
@@ -123,3 +130,17 @@ Summarize:
 - what you intentionally left unpersisted because it was transient
 If no substantive persistence changes were needed, say that explicitly instead of inventing updates.
+## Gotchas
+- Do not turn this skill into a transcript dump. Persist only durable knowledge, live state, or reusable lessons.
+- Do not scatter the same learning across multiple files. Pick the smallest truthful home the repo already uses.
+- Do not blame a skill for a problem that was really an execution mistake or an external tool failure.
+- Do not preserve one-off user phrasing or temporary frustration as if it were standing repo policy unless the user clearly framed it that way.
+- Do not go hunting through archived session files just because the live thread feels incomplete. This skill should work from the current conversation unless the user explicitly broadens the scope.
+## Keep This Skill Sharp
+- After meaningful retrospectives, add new gotchas when the same persistence mistake, memory-placement mistake, or skill-triage mistake keeps recurring.
+- Tighten the description if the skill misses real prompts like "save what we learned here" or fires on requests that are really planning or docs-audit work.
+- If the same kind of durable learning keeps needing a custom destination, add that routing guidance to the skill instead of leaving the decision to be rediscovered in chat.

package/templates/.agents/skills/conversation-retrospective/agents/openai.yaml CHANGED Viewed

@@ -1,4 +1,4 @@
 interface:
   display_name: "Conversation Retrospective"
   short_description: "Harvest the live conversation into repo memory"
-  default_prompt: "Use this skill to analyze the active conversation, preserve durable knowledge and user feedback in the repo's existing memory surfaces, evaluate whether used skills succeeded or failed, capture concrete errors and friction points, improve skills whose guidance was insufficient, and record new skill ideas without asking follow-up questions when the correct destination is clear."
+  default_prompt: "Use $conversation-retrospective to preserve the durable lessons, repo-memory updates, and skill learnings from this live conversation."

package/templates/.agents/skills/docs-sync/SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: docs-sync
-description: Audit routed docs against the actual codebase and shipped behavior. Use when docs may be stale after implementation work, before pushing or opening a PR, when routes/contracts/config changed, or when the agent should find missing, incorrect, outdated, or broken documentation and then update or flag the exact gaps.
+description: Audit routed docs against the actual codebase and shipped behavior. Use when the user asks to sync docs, when docs may be stale after implementation work, before pushing or opening a PR, when routes, contracts, config, commands, or shipped behavior changed, or when Codex should find missing, incorrect, outdated, or broken documentation and then update or flag the exact gaps. Do not use this for vendor-doc ingestion, repo-memory cleanup, or broad code review that is not specifically about docs drift.
 ---
 # Docs Sync
@@ -9,6 +9,13 @@ Use this skill to keep repo docs aligned with reality.
 This is not a vendor-doc ingestion skill and not a workspace-cleanup skill. It owns one job: compare the codebase and shipped behavior against routed docs, then fix or flag the mismatches.
+## When Not To Use This Skill
+- Skip it for importing or summarizing upstream vendor docs. Link to the real source instead of copying it into the repo.
+- Skip it for workspace compression or tracker cleanup. This skill is about docs drift, not handoff hygiene.
+- Skip it for broad code review that is not specifically about docs-to-reality mismatches.
+- Skip it when the user only wants a new durable plan or architecture note; use the planning or normal docs-writing flow in that case.
 ## Read First
 Before auditing docs:
@@ -55,3 +62,17 @@ Summarize:
 - what docs were stale or missing
 - what you updated
 - what still needs a decision, if anything
+## Gotchas
+- Do not trust docs-to-docs consistency alone. The source of truth is the shipped code and behavior, not whether two markdown files agree with each other.
+- Do not leave stale future-tense claims behind after a feature ships or is cut. Docs drift often shows up as roadmap language that quietly became false.
+- Do not update prose without checking commands, routes, config names, and examples. Small copied snippets are often where docs rot first.
+- Do not invent certainty when the right doc shape is unclear. Flag the mismatch instead of bluffing a final answer.
+- After touching routed docs, always refresh the generated docs/context layer so the repo’s index and bootstrap bundle match the new reality.
+## Keep This Skill Sharp
+- After meaningful runs, add new gotchas when the same docs-drift pattern, broken example shape, or stale-claim mistake keeps recurring.
+- Tighten the description if the skill misses real prompts like "sync the docs" or fires on requests that are really about repo-memory cleanup instead.
+- If the skill starts needing detailed provider-specific or command-heavy guidance, move that detail into references instead of bloating the hub file.

package/templates/.agents/skills/docs-sync/agents/openai.yaml CHANGED Viewed

@@ -1,4 +1,4 @@
 interface:
   display_name: "Docs Sync"
   short_description: "Audit docs against the real codebase"
-  default_prompt: "Use this skill to audit routed docs against the actual codebase and shipped behavior, then update or flag any missing, incorrect, or outdated documentation."
+  default_prompt: "Use $docs-sync to audit routed docs against the actual codebase and shipped behavior, then update or flag any missing, incorrect, or outdated documentation."

package/templates/.agents/skills/frontend-context-interview/SKILL.md CHANGED Viewed

@@ -16,6 +16,12 @@ Use this skill when relevant frontend project context is missing, stale, contrad
 This skill is for project-level operating context, not feature requirements gathering.
+## When Not To Use This Skill
+- Skip it when the needed context is already clearly documented in `AGENTS.md` or routed docs.
+- Skip it for feature-specific UX, copy, flow, or acceptance-criteria questions.
+- Skip it for implementation preferences that can be resolved from the codebase.
 ## When to use
 Use this skill when the current task depends on context such as:
@@ -70,3 +76,16 @@ Good project-level question areas include:
 - accessibility expectations or compliance targets
 - whether SEO matters for any routes
 - whether backward compatibility in user workflows matters
+## Gotchas
+- Do not re-ask stable context that is already present in `AGENTS.md` or routed docs.
+- Do not drift into feature discovery. This skill is about project context that changes implementation or review choices across many tasks.
+- Do not persist transient task details into `## Frontend Context`; only save durable deployment and product constraints.
+- Do not create a new guidance file if `AGENTS.md` is missing unless the user explicitly asked for that.
+## Keep This Skill Sharp
+- Add new gotchas when the same frontend-context blind spot or repeated unnecessary question keeps showing up in real work.
+- Tighten the description if the skill fires on feature-planning prompts or misses real requests about browser support, accessibility, SEO, or deployment context.
+- If the same stable setup facts keep being asked across repos, add sharper routing or persistence guidance instead of leaving that learning in chat.

package/templates/.agents/skills/frontend-context-interview/agents/openai.yaml CHANGED Viewed

@@ -1,4 +1,4 @@
 interface:
   display_name: "Frontend Context Interview"
-  short_description: "Ask only the frontend context questions that materially change implementation or review decisions"
-  default_prompt: "Use this skill to inspect persisted project context, identify only the missing frontend product or deployment facts that materially affect the work, ask a concise high-leverage interview if needed, and persist durable Frontend Context into AGENTS.md."
+  short_description: "Ask only the missing frontend context questions"
+  default_prompt: "Use $frontend-context-interview to inspect persisted project context, ask only the missing frontend deployment or product-context questions that materially affect the work, and persist durable Frontend Context into AGENTS.md when needed."

package/templates/.agents/skills/frontend-ship-audit/SKILL.md CHANGED Viewed

@@ -5,6 +5,12 @@ description: Audit a defined frontend scope for ship-readiness with a strong foc
 Audit ship-readiness like a strong frontend reviewer. Optimize for user impact, release risk, and production correctness. Do not optimize for style policing.
+## When Not To Use This Skill
+- Skip it for backend ship-readiness; use the backend ship-audit workflow instead.
+- Skip it for generic code review, maintainability review, or PR comment triage that is not explicitly about ship readiness.
+- Skip it for a one-off coding-guide check on a narrow slice; use `code-guide-audit` for that.
 Use this workflow:
 1. Resolve the scope.
@@ -85,3 +91,17 @@ When evidence is partial:
 - say what remains assumed
 - lower confidence instead of overstating certainty
 - ask only the missing questions that would change the release decision
+## Gotchas
+- Do not drift into style review or generic UX commentary. Every finding should connect to release risk or user-facing correctness.
+- Do not rely on grep hits or partial snippets for files that support a finding. Ship audits need complete reads of the frontend files and docs that matter.
+- Do not ask deployment or audience questions before you have exhausted what the repo already tells you.
+- Do not treat API, auth, SEO, accessibility, analytics, or localization assumptions as proven just because they are implied by filenames. Trace the real behavior.
+- If the code and docs disagree, call out the mismatch instead of quietly choosing whichever story feels cleaner.
+## Keep This Skill Sharp
+- Add new gotchas when the same frontend-risk blind spot, stale-doc problem, or deployment-context miss keeps recurring.
+- Tighten the description if the skill fires on ordinary review requests or misses real prompts like "is this route ready to ship" or "audit this frontend before launch."
+- If the same evidence structure or audit framing keeps repeating, move more of that detail into the existing references or helper script instead of bloating the hub file.

package/templates/.agents/skills/frontend-ship-audit/agents/openai.yaml CHANGED Viewed

@@ -1,3 +1,4 @@
-display_name: Frontend Ship Audit
-short_description: Audit a scoped frontend surface for ship-readiness with evidence-based findings and durable deployment context.
-default_prompt: Audit the ship-readiness of the requested frontend scope. Resolve the reviewable unit from the repo, read all relevant frontend files completely, ask only missing high-leverage questions, persist durable Frontend Context in the project root guidance file when present, and write a prioritized audit at .waypoint/audit/dd-mm-yyyy-hh-mm-frontend-audit.md.
+interface:
+  display_name: "Frontend Ship Audit"
+  short_description: "Audit frontend ship-readiness with evidence"
+  default_prompt: "Use $frontend-ship-audit to audit the requested frontend scope for ship readiness and write the resulting evidence-based report."

package/templates/.agents/skills/planning/SKILL.md CHANGED Viewed

@@ -15,6 +15,12 @@ Good plans prove you understand the problem. Size matches complexity — a renam
 **The handoff test:** Could someone implement this plan without asking you questions? If not, find what's missing.
+## When Not To Use This Skill
+- Skip it for tiny obvious edits where a full planning pass would cost more than it saves.
+- Skip it when the user explicitly wants implementation right away and the work is already straightforward.
+- Skip it for post-implementation closeout; use the review or hygiene workflows for that.
 ## Read First
 Before planning:
@@ -150,3 +156,17 @@ When the plan doc is written:
 ## Quality Bar
 If the plan would make the implementer ask "where does this hook in?" or "what exactly am I changing?", it is not done.
+## Gotchas
+- Do not spend interview turns on implementation facts that are already in the code or routed docs.
+- Do not stop exploring just because you have a plausible plan. The usual failure mode is shallow repo understanding.
+- Do not leave unresolved architecture or product decisions hidden behind "we can figure that out during implementation."
+- Do not dump a transcript into the plan doc. Distill the decisions and requirements into a clean implementation handoff.
+- Do not treat a reviewed plan as a stopping point. Once the user approves it, the workflow expects execution to continue.
+## Keep This Skill Sharp
+- Add new gotchas when the same planning blind spot, under-explored area, or vague plan failure keeps recurring.
+- Tighten the description if the skill fires on tiny tasks or misses real prompts about migrations, refactors, and implementation-ready design work.
+- If planning keeps depending on the same durable context or external reference paths, encode that routing into the skill instead of rediscovering it in chat.

package/templates/.agents/skills/planning/agents/openai.yaml CHANGED Viewed

@@ -1,4 +1,4 @@
 interface:
   display_name: "Planning"
   short_description: "Interview, explore, and write an implementation-ready plan into the repo"
-  default_prompt: "Use this skill to deeply explore the repository, interview for the product and architectural details that materially affect the work, and write an implementation-ready plan into .waypoint/docs/."
+  default_prompt: "Use $planning to deeply explore the repository, interview for the product and architectural details that materially affect the work, and write an implementation-ready plan into .waypoint/docs/."

package/templates/.agents/skills/pr-review/SKILL.md CHANGED Viewed

@@ -7,6 +7,12 @@ description: Triage and close the review loop on an open PR after automated or h
 Use this skill to drive the PR through review instead of treating review as a one-shot comment sweep.
+## When Not To Use This Skill
+- Skip it before a PR has active review or automated review in flight.
+- Skip it for local pre-push hygiene; use `pre-pr-hygiene` for that workflow.
+- Skip it for the repo-internal closeout loop on an unpushed slice; use the normal review workflows instead.
 ## Step 1: Wait For Review To Settle
 - Check the PR's current review and CI status.
@@ -60,3 +66,17 @@ Summarize:
 - what was intentionally declined
 - what verification ran
 - whether the PR is clear or still waiting on reviewer response
+## Gotchas
+- Do not treat a placeholder like "review in progress" as a clean review result.
+- Do not leave comment threads silent just because the code changed. The reply is part of the workflow.
+- Do not assume stacked PR fixes have landed in the branch you are reviewing; compare against the actual base.
+- Do not leave the loop just because CI is slow. A pending review state is still unfinished.
+- Do not declare the PR clear if the required repo-level reviewer passes have not actually run.
+## Keep This Skill Sharp
+- Add new gotchas when the same PR-review failure mode, automation blind spot, or reviewer-state confusion keeps recurring.
+- Tighten the description if the skill fires before review has actually started or misses real prompts about "address these PR comments" or "close the loop on this PR."
+- If the workflow keeps repeating the same review-system quirks, preserve them in the skill instead of letting them stay as one-off chat lessons.

package/templates/.agents/skills/pr-review/agents/openai.yaml CHANGED Viewed

@@ -1,4 +1,4 @@
 interface:
   display_name: "PR Review"
   short_description: "Close the review loop on an active PR"
-  default_prompt: "Use this skill when a PR has active review comments or automated review in progress. Wait for review to settle, triage every comment, reply inline, fix meaningful issues, push follow-up commits, and repeat until no new meaningful findings remain."
+  default_prompt: "Use $pr-review when a PR has active review comments or automated review in progress. Wait for review to settle, triage every comment, reply inline, fix meaningful issues, push follow-up commits, and repeat until no new meaningful findings remain."

package/templates/.agents/skills/pre-pr-hygiene/SKILL.md CHANGED Viewed

@@ -7,6 +7,12 @@ description: Run a broad final hygiene pass before pushing, before opening or up
 Use this skill for the larger final audit before code leaves the machine.
+## When Not To Use This Skill
+- Skip it for tiny changes that do not justify a broad hygiene pass.
+- Skip it after a PR already has active review comments; use `pr-review` for that loop.
+- Skip it when the task is only a narrow coding-guide check or only a docs sync pass.
 ## Read First
 Before the hygiene pass:
@@ -61,3 +67,17 @@ Summarize:
 - what you fixed
 - what verification ran
 - what residual risks remain, if any
+## Gotchas
+- Do not turn this into a whole-repo cleanup mission. Keep the pass tied to the change surface that is about to leave the machine.
+- Do not stop at reporting obvious fixable issues if the correct remediation is clear.
+- Do not call the pass complete without real verification that matches the risk of the change.
+- Do not let docs, contracts, and code drift just because the implementation itself "works."
+- Do not use this as a replacement for active PR review or the normal closeout loop.
+## Keep This Skill Sharp
+- Add new gotchas when the same hygiene blind spot, contract drift, or verification miss keeps recurring.
+- Tighten the description if the skill fires on tiny edits or misses real prompts about "do a final pass before I push."
+- If the same cross-cutting checks keep being rediscovered, encode them more explicitly here instead of relying on chat memory.

package/templates/.agents/skills/pre-pr-hygiene/agents/openai.yaml CHANGED Viewed

@@ -1,4 +1,4 @@
 interface:
   display_name: "Pre-PR Hygiene"
   short_description: "Run the final cross-cutting ship audit"
-  default_prompt: "Use this skill before pushing or opening/updating a PR for substantial work to do a broader hygiene pass across code, docs, contracts, typing, UI rollback, persistence correctness, and code-guide compliance."
+  default_prompt: "Use $pre-pr-hygiene before pushing or opening/updating a PR for substantial work to do a broader hygiene pass across code, docs, contracts, typing, UI rollback, persistence correctness, and code-guide compliance."

package/templates/.agents/skills/visual-explanations/SKILL.md CHANGED Viewed

@@ -84,3 +84,17 @@ Do not trust the generation step blindly.
 - prefer a single strong visual over a pile of mediocre ones
 If the artifact is only for the current conversation, store it in a temp or scratch location. If the user wants a durable asset in the repo, place it in the repo's normal docs or asset structure instead of inventing a new convention.
+## Gotchas
+- Do not make an image when Mermaid or a short paragraph would already explain the point cleanly.
+- Do not annotate a screenshot until you have verified the source screenshot actually shows the right state.
+- Do not bury the main message under too many callouts or labels. One image should usually explain one thing.
+- Do not present a conceptual mockup as if it were a real current UI state. Label it clearly when it is illustrative.
+- Do not trust the rendering step blindly; clipped text, tiny labels, and misplaced arrows are common failure modes.
+## Keep This Skill Sharp
+- Add new gotchas when the same visual clarity problem, screenshot mistake, or rendering failure keeps showing up.
+- Tighten the description if the skill fires when Mermaid would have been enough or misses real requests for annotated screenshots and concept cards.
+- If the same layout patterns or annotation helpers keep repeating, move them into reusable assets or scripts instead of rebuilding them from scratch.

package/templates/.agents/skills/visual-explanations/agents/openai.yaml CHANGED Viewed

@@ -1,4 +1,4 @@
 interface:
   display_name: "Visual Explanations"
   short_description: "Create generated images and annotated screenshots"
-  default_prompt: "Use this skill to create a generated image or annotated screenshot when a visual artifact would explain the point more clearly than prose alone. Prefer Mermaid directly when a simple in-chat diagram is enough."
+  default_prompt: "Use $visual-explanations to create a generated image or annotated screenshot when a visual artifact would explain the point more clearly than prose alone. Prefer Mermaid directly when a simple in-chat diagram is enough."

package/templates/.agents/skills/work-tracker/SKILL.md CHANGED Viewed

@@ -13,6 +13,12 @@ This skill owns the execution tracker layer:
 - keep `WORKSPACE.md` pointing at the active tracker
 - move detailed checklists and progress into the tracker instead of bloating the workspace
+## When Not To Use This Skill
+- Skip it for small single-shot tasks that fit comfortably in `WORKSPACE.md`.
+- Skip it when the work has already finished and does not need a durable execution log.
+- Skip it when the real need is docs compression or docs sync rather than active execution tracking.
 ## Read First
 Before tracking:
@@ -108,3 +114,17 @@ When you create or update a tracker, report:
 - the tracker path
 - the current status
 - what `WORKSPACE.md` now points to
+## Gotchas
+- Do not create a new tracker if a relevant active tracker already exists for the same workstream.
+- Do not let the tracker become fiction. Completed items, blockers, and verification state should match reality.
+- Do not stuff durable architecture or debugging knowledge into the tracker if it belongs in `.waypoint/docs/`.
+- Do not leave `WORKSPACE.md` carrying the full execution log after a tracker exists.
+- Do not keep trackers "active" forever after the work is done; update the status.
+## Keep This Skill Sharp
+- Add new gotchas when the same tracker drift, duplicate-tracker pattern, or workspace-bloat problem keeps recurring.
+- Tighten the description if the skill fires for tiny work that does not need a tracker or misses long-running remediation campaigns.
+- If the tracker format keeps needing the same recurring section or checklist pattern, capture that reusable pattern in the skill instead of rediscovering it each time.

package/templates/.agents/skills/work-tracker/agents/openai.yaml CHANGED Viewed

@@ -1,4 +1,4 @@
 interface:
   display_name: "Work Tracker"
   short_description: "Create or maintain a durable tracker for a large workstream"
-  default_prompt: "Use this skill to create or update a tracker under .waypoint/track/, keep WORKSPACE.md pointing to it, and move detailed execution state out of chat and into the repo."
+  default_prompt: "Use $work-tracker to create or update a tracker under .waypoint/track/, keep WORKSPACE.md pointing to it, and move detailed execution state out of chat and into the repo."

package/templates/.agents/skills/workspace-compress/SKILL.md CHANGED Viewed

@@ -9,6 +9,12 @@ Keep `WORKSPACE.md` as a live handoff, not a project diary.
 This skill is for compression, not for erasing context. Preserve what the next agent needs in the first few minutes of a resume, and push durable detail into the docs layer that already exists in the repo.
+## When Not To Use This Skill
+- Skip it when the workspace is still short, current, and easy to resume from.
+- Skip it when the detail you want to remove is still active execution state that belongs in a tracker.
+- Skip it when the real problem is stale docs rather than stale workspace history.
 ## Read First
 Before compressing:
@@ -91,3 +97,17 @@ Summarize:
 - what was collapsed or removed
 - which durable docs now hold the preserved detail
 - any remaining risk that still belongs in the workspace
+## Gotchas
+- Do not compress away the active operational truth just because the workspace feels long.
+- Do not rely on `git diff` to decide what matters; the workspace must stay useful even in a dirty tree.
+- Do not delete detail unless you know which routed doc or tracker now preserves it.
+- Do not compress unresolved blockers or immediate next steps into vague summaries.
+- Do not rewrite over in-flight user edits in the workspace or linked docs.
+## Keep This Skill Sharp
+- Add new gotchas when the same compression mistake, lost-context problem, or stale-workspace pattern keeps recurring.
+- Tighten the description if the skill fires on already-clean workspaces or misses real "clean up the handoff" requests.
+- If the same compression pattern keeps moving detail into the same durable home, make that routing more explicit in the skill.

package/templates/.agents/skills/workspace-compress/agents/openai.yaml CHANGED Viewed

@@ -1,4 +1,4 @@
 interface:
   display_name: "Workspace Compress"
   short_description: "Compress the live workspace handoff"
-  default_prompt: "Use this skill after a meaningful chunk of work, before stopping, before review, or before opening or updating a PR to keep WORKSPACE.md short, current, and useful to the next agent."
+  default_prompt: "Use $workspace-compress after a meaningful chunk of work, before stopping, before review, or before opening or updating a PR to keep WORKSPACE.md short, current, and useful to the next agent."

package/templates/.codex/agents/code-health-reviewer.toml CHANGED Viewed

@@ -26,6 +26,7 @@ Critical rules:
 You set the standard. Don't learn quality standards from existing code - the codebase may already be degraded. Apply good engineering judgment regardless of what exists.
 - Read every changed file in full before making a maintainability judgment.
 - Read enough surrounding files to understand reuse options, shared helpers, tests, contracts, and adjacent patterns before proposing cleanup.
+- Do not clear a change as healthy unless you can explain which surrounding files, reuse candidates, and calling paths you checked to support that conclusion.
 - Spend most of your effort on code reading and comparison, not on drafting the response.
 Explore what exists. Search for existing helpers, utilities, and patterns that could be reused instead of duplicated.
@@ -103,5 +104,5 @@ Each finding needs:
 - suggested fix direction
 Return:
-Scope anchor, changed files read, related files read, reuse candidates checked, findings, brief overall assessment.
+Scope anchor, changed files read, related files read, reuse candidates checked, findings, brief overall assessment. If you report no issues, explain why the surrounding code supported the change instead of only saying the diff looked fine.
 """

package/templates/.codex/agents/code-reviewer.toml CHANGED Viewed

@@ -22,6 +22,7 @@ The diff or commit is only a starting pointer. A diff-only review is a failed re
 Rules:
 - Read every changed file in full before forming conclusions.
 - Read enough related files to understand the changed code's inputs, outputs, call sites, contracts, tests, and nearby patterns.
+- Do not clear a change unless you can explain the critical paths you traced, the related files you read, and why the surrounding code supports the behavior.
 - Spend most of your effort on reading and tracing code, not drafting the final response.
 - Find bugs, not style issues.
 - Assume issues are hiding. Dig until you find them or can justify that the code is solid.
@@ -94,7 +95,7 @@ Description of the issue with evidence.
 **Fix:** What to change.
 ### No Issues Found
-[Use this section instead if the code is clean. State what you verified, including the important paths and contracts you checked.]
+[Use this section instead if the code is clean. State what you verified, including the important paths, related files, and contracts you checked. A clean review without surrounding-file understanding is a failed review.]
 Quality bar:
 Only report issues that:

package/templates/.gitignore.snippet CHANGED Viewed

@@ -7,6 +7,7 @@
 .agents/skills/work-tracker/
 .agents/skills/docs-sync/
 .agents/skills/code-guide-audit/
+.agents/skills/adversarial-review/
 .agents/skills/visual-explanations/
 .agents/skills/break-it-qa/
 .agents/skills/frontend-context-interview/
@@ -22,3 +23,4 @@
 !.waypoint/docs/**
 .waypoint/docs/README.md
 .waypoint/docs/code-guide.md
+# End Waypoint state

package/templates/.waypoint/SOUL.md CHANGED Viewed

@@ -34,7 +34,7 @@ You're direct, opinionated, and evidence-driven. You read before you write. You
 **Update the durable record.** When behavior changes, update docs. When state changes, update `WORKSPACE.md`. When a better pattern emerges, encode it in the repo contract instead of rediscovering it later.
-**Close the loop before complete.** Run `code-reviewer` before considering any non-trivial implementation slice complete. Run `code-health-reviewer` before considering medium or large changes complete, especially when they add structure, duplicate logic, or introduce new abstractions.
+**Close the loop before complete.** Run `adversarial-review` before considering any non-trivial implementation slice complete. That closeout skill should keep looping through reviewer passes and fixes until no meaningful findings remain.
 **Prefer small, reviewable changes.** Keep work scoped and comprehensible.

package/templates/.waypoint/agent-operating-manual.md CHANGED Viewed

@@ -49,7 +49,7 @@ If something important lives only in your head or in the chat transcript, the re
 - Update `.waypoint/docs/` when durable knowledge changes, and refresh each changed routable doc's `last_updated` field.
 - Rebuild `.waypoint/DOCS_INDEX.md` whenever routable docs change.
 - Rebuild `.waypoint/TRACKS_INDEX.md` whenever tracker files change.
-- When spawning reviewer agents or other subagents, explicitly set `model` to `gpt-5.4` and `reasoning_effort` to `high` unless the user explicitly requests a different model or lower reasoning.
+- When spawning reviewer agents or other subagents, explicitly set `fork_context: false`, `model` to `gpt-5.4`, and `reasoning_effort` to `high` unless the user explicitly requests a different model or lower reasoning.
 - Use the repo-local skills and reviewer agents instead of improvising from scratch.
 - Treat reviewer agents as one-shot workers: once a reviewer returns findings, read the result and close it. If another review pass is needed later, spawn a fresh reviewer instead of reusing the same thread.
 - Do not kill long-running subagents or reviewer agents just because they are slow.
@@ -99,6 +99,7 @@ Do not document every trivial implementation detail. Document the non-obvious, d
 - `work-tracker` when large multi-step work needs durable progress tracking in `.waypoint/track/`
 - `docs-sync` when routed docs may be stale, missing, or inconsistent with the codebase
 - `code-guide-audit` when a specific feature or file set needs a targeted coding-guide compliance check
+- `adversarial-review` when a non-trivial implementation slice is nearing completion and needs the default closeout loop for reviewer agents plus code-guide checks
 - `visual-explanations` when a generated image or annotated screenshot would explain the work more clearly than prose alone; Mermaid diagrams do not need a skill
 - `conversation-retrospective` after major completed work pieces so the active conversation is distilled into durable memory, user feedback and errors are preserved, exercised skills are improved, and real new-skill candidates are recorded
 - `break-it-qa` when a browser-facing feature should be attacked with invalid inputs, refreshes, repeated clicks, wrong action order, or other adversarial manual QA
@@ -128,19 +129,17 @@ Run `plan-reviewer` before presenting a non-trivial implementation plan to the u
 ## Review Loop
-Use reviewer agents before considering the work complete, not just as a reflex after every tiny commit.
-1. Run `code-reviewer` before considering any non-trivial implementation slice complete.
-2. Run `code-health-reviewer` before considering medium or large changes complete, especially when they add structure, duplicate logic, or introduce new abstractions.
-3. If both apply, launch `code-reviewer` and `code-health-reviewer` in parallel as background, read-only reviewers.
-4. Treat reviewer agents as one-shot workers. Once a reviewer returns its findings, read the result and close it.
-5. If you need another review pass after changes, spawn a fresh reviewer agent rather than reusing the old thread.
-6. If you have a recent self-authored commit that cleanly represents the reviewable slice, use it as the default review scope anchor. Otherwise scope the reviewers to the current changed slice.
-7. Widen only when surrounding files are needed to validate a finding.
-8. Do not call the work finished before you read the required reviewer results.
-9. Wait for reviewer outputs even if that requires repeated or long waits. Do not interrupt them just because they are still running.
-10. Fix real findings, rerun the relevant verification, update workspace/docs if needed, and make a follow-up commit when fixes change the repo.
-11. Do not call a PR clear, ready, or done until the required reviewer-agent passes for the current slice have actually run.
+Use `adversarial-review` before considering the work complete, not just as a reflex after every tiny commit.
+1. Run `adversarial-review` before considering any non-trivial implementation slice complete.
+2. That skill owns the default closeout loop for the current slice: define the scope, run `code-reviewer`, run `code-health-reviewer` when applicable, run `code-guide-audit`, wait as long as needed, fix meaningful issues, and repeat with fresh reviewer rounds until no meaningful findings remain.
+3. Treat reviewer agents as one-shot workers. Once a reviewer returns its findings, read the result and close it.
+4. If you need another review pass after changes, spawn a fresh reviewer agent rather than reusing the old thread.
+5. If you have a recent self-authored commit that cleanly represents the reviewable slice, use it as the default review scope anchor. Otherwise scope the review loop to the current changed slice.
+6. Widen only when surrounding files are needed to validate a finding.
+7. Do not call the work finished before you read the required closeout outputs.
+8. Wait for reviewer outputs even if that requires repeated or long waits. Do not interrupt them just because they are still running.
+9. Do not call a PR clear, ready, or done until the required reviewer-agent passes for the current slice have actually run.
 ## Quality bar

package/templates/managed-agents-block.md CHANGED Viewed

@@ -93,12 +93,11 @@ Working rules:
 - Use `work-tracker` when a long-running implementation, remediation, or verification campaign needs durable progress tracking
 - Use `docs-sync` when the docs may be stale or a change altered shipped behavior, contracts, routes, or commands
 - Use `code-guide-audit` for a targeted coding-guide compliance pass on a specific feature, file set, or change slice
+- Use `adversarial-review` before considering a non-trivial implementation slice complete; it owns the closeout loop for `code-reviewer`, `code-health-reviewer`, and `code-guide-audit`, reruns fresh review rounds after meaningful fixes, and stops only when no meaningful findings remain
 - Use `visual-explanations` when a generated image or annotated screenshot would explain the work more clearly than prose alone; Mermaid diagrams can be written directly in chat without invoking a skill
 - Use `conversation-retrospective` after major completed work pieces to preserve durable learnings, capture user feedback and errors, improve any skills that were exercised, and record real new-skill candidates
 - Do not invoke `break-it-qa`, `frontend-ship-audit`, or `backend-ship-audit` yourself from the managed AGENTS block workflow; they are user-facing skills for explicit human-requested QA or ship-readiness audits, not default agent steps
 - Before presenting a non-trivial implementation plan to the user, run `plan-reviewer` and iterate on the plan until it has no meaningful review findings left
-- Before considering a non-trivial implementation slice complete, run `code-reviewer`; use a recent self-authored commit as the default scope anchor when one cleanly represents that slice
-- Before considering medium or large changes complete, run `code-health-reviewer`, especially when they add structure, duplicate logic, or introduce new abstractions
 - Treat `plan-reviewer`, `code-reviewer`, and `code-health-reviewer` as one-shot agents: once a reviewer returns findings, close it; if another pass is needed later, spawn a fresh reviewer instead of reusing the old thread
 - Before pushing or opening/updating a PR for substantial work, use `pre-pr-hygiene`
 - Use `pr-review` once a PR has active review comments or automated review in progress