npm - @kodrunhq/opencode-autopilot - Versions diffs - 1.4.0 → 1.5.0 - Mend

@kodrunhq/opencode-autopilot 1.4.0 → 1.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

package/assets/commands/brainstorm.md +7 -0
package/assets/commands/stocktake.md +7 -0
package/assets/commands/tdd.md +7 -0
package/assets/commands/update-docs.md +7 -0
package/assets/commands/write-plan.md +7 -0
package/assets/skills/brainstorming/SKILL.md +295 -0
package/assets/skills/code-review/SKILL.md +241 -0
package/assets/skills/e2e-testing/SKILL.md +266 -0
package/assets/skills/git-worktrees/SKILL.md +296 -0
package/assets/skills/go-patterns/SKILL.md +240 -0
package/assets/skills/plan-executing/SKILL.md +258 -0
package/assets/skills/plan-writing/SKILL.md +278 -0
package/assets/skills/python-patterns/SKILL.md +255 -0
package/assets/skills/rust-patterns/SKILL.md +293 -0
package/assets/skills/strategic-compaction/SKILL.md +217 -0
package/assets/skills/systematic-debugging/SKILL.md +299 -0
package/assets/skills/tdd-workflow/SKILL.md +311 -0
package/assets/skills/typescript-patterns/SKILL.md +278 -0
package/assets/skills/verification/SKILL.md +240 -0
package/package.json +1 -1
package/src/index.ts +4 -0
package/src/orchestrator/skill-injection.ts +38 -0
package/src/review/sanitize.ts +1 -1
package/src/skills/adaptive-injector.ts +122 -0
package/src/skills/dependency-resolver.ts +88 -0
package/src/skills/linter.ts +113 -0
package/src/skills/loader.ts +88 -0
package/src/templates/skill-template.ts +4 -0
package/src/tools/create-skill.ts +12 -0
package/src/tools/stocktake.ts +170 -0
package/src/tools/update-docs.ts +116 -0

package/assets/skills/systematic-debugging/SKILL.md ADDED Viewed

@@ -0,0 +1,299 @@
+---
+name: systematic-debugging
+description: 4-phase root cause analysis methodology for systematic bug diagnosis and resolution
+stacks: []
+requires: []
+---
+# Systematic Debugging
+A disciplined 4-phase methodology for diagnosing and fixing bugs: Reproduce, Isolate, Diagnose, Fix. This skill replaces ad-hoc debugging (changing things until it works) with a systematic process that finds the root cause and prevents recurrence.
+Every bug fix should produce a regression test. A bug fixed without a test is a bug that will return.
+## When to Use
+**Activate this skill when:**
+- A bug report comes in (user-reported, automated alert, test failure)
+- Tests fail unexpectedly after a change
+- Behavior doesn't match specification or documentation
+- Performance degrades without an obvious cause
+- Integration between modules produces unexpected results
+- A production incident requires root cause analysis
+**Do NOT use when:**
+- The issue is a feature request, not a bug
+- The fix is obvious and trivial (typo, missing import, wrong config value)
+- The issue has a known fix documented in the codebase or issue tracker
+- You need a code review (use the code-review skill instead)
+## The 4-Phase Debugging Process
+Follow the phases in order. Do not skip phases. The most common debugging mistake is jumping to Phase 4 (Fix) before completing Phase 3 (Diagnose).
+### Phase 1: Reproduce
+**Purpose:** Confirm the bug exists and get a reliable way to trigger it.
+**Process:**
+1. Read the bug report carefully. Extract the exact steps, inputs, and expected vs actual behavior.
+2. Reproduce the bug locally using the reported steps.
+3. If the bug reproduces, create a MINIMAL reproduction case:
+   - Strip away everything not needed to trigger the bug
+   - The minimal case should be a single test or a 5-10 line script
+   - Document the exact command to run: `bun test tests/auth.test.ts -t "rejects expired tokens"`
+4. If the bug does NOT reproduce:
+   - Check environment differences (OS, runtime version, config)
+   - Check input data differences (encoding, edge cases, null values)
+   - Check timing differences (race conditions, async ordering)
+   - Ask for more context: logs, screenshots, exact input data
+5. Record the reproduction steps for the regression test in Phase 4.
+**Output:** A reproducible test case or script that triggers the bug on demand.
+**Exit criterion:** You can trigger the bug reliably. If you cannot reproduce it after 15 minutes, escalate for more information.
+**A bug you cannot reproduce is a bug you cannot fix.** Do not proceed to Phase 2 until you have a reproduction.
+### Phase 2: Isolate
+**Purpose:** Narrow the scope from "the whole system" to "this specific function/line."
+**Process:**
+1. Start with the reproduction case from Phase 1.
+2. **Binary search the codebase:** Comment out or bypass half the code path. Does the bug persist?
+   - If yes: the bug is in the remaining half. Repeat.
+   - If no: the bug is in the removed half. Restore and bisect that half.
+3. **Check recent changes:** The bug may have been introduced recently.
+   ```
+   git log --oneline -20
+   git diff HEAD~5
+   git bisect start
+   git bisect bad HEAD
+   git bisect good <known-good-commit>
+   ```
+4. **Add strategic logging** at module boundaries:
+   - Log inputs and outputs at each function call in the chain
+   - Compare expected vs actual values at each step
+   - The first point where actual diverges from expected is the bug location
+5. **Check the call stack:** If the bug produces an error, read the full stack trace. The bug is usually near the top of the stack, but the root cause may be deeper.
+**Output:** The exact function, file, and approximate line number where behavior diverges from expectation.
+**Exit criterion:** You can point to a specific code location and say "the bug is here because [expected X but got Y]."
+**Isolation tips:**
+- If the code path is long, log at 3-4 strategic points first (entry, middle, exit, error path)
+- If the bug is intermittent, add logging and run the reproduction 10 times to collect data
+- If the bug only happens in production, check for environment-specific behavior (env vars, feature flags, data volume)
+### Phase 3: Diagnose
+**Purpose:** Understand WHY the bug exists, not just WHERE it is. The difference matters -- knowing where tells you what to change, knowing why tells you what to change it TO.
+**Process:**
+1. Read the code path end-to-end from the entry point to the bug location (Phase 2 output).
+2. For each function in the path, check these assumptions:
+   - **Types:** Is the value the expected type? (Watch for implicit coercion, especially in JS/TS)
+   - **Null/undefined:** Can the value be null where the code assumes it's defined?
+   - **Async timing:** Are operations completing in the expected order? Are there missing awaits?
+   - **State mutation:** Is an object being modified in place when the caller expects immutability?
+   - **Boundary values:** Are off-by-one errors possible? (Array indices, string slicing, pagination)
+   - **Error handling:** Is an error being caught and swallowed somewhere in the chain?
+3. Identify the root cause category (see Common Root Cause Patterns below).
+4. Verify the diagnosis: can you predict the exact output given the root cause? If your diagnosis is correct, you should be able to predict the bug's behavior for any input.
+**Output:** A one-paragraph explanation of WHY the bug exists, referencing the specific code and the root cause pattern.
+**Exit criterion:** You can explain the bug to someone who has never seen the code, and they understand why it happens.
+### Phase 4: Fix
+**Purpose:** Apply the minimal fix and prevent recurrence with a regression test.
+**Process:**
+1. **Write the regression test FIRST** (TDD-style):
+   - The test should reproduce the exact bug from Phase 1
+   - Run the test -- it MUST fail (confirming the bug exists)
+   - The test becomes a permanent guard against recurrence
+2. **Apply the minimal fix:**
+   - Change only what is needed to fix the root cause (Phase 3 output)
+   - Do not refactor adjacent code in the same change
+   - Do not add unrelated improvements
+3. **Verify the fix:**
+   - Run the regression test -- it MUST pass
+   - Run ALL existing tests -- they MUST still pass (no regressions)
+   - Run the original reproduction case from Phase 1 -- bug should be gone
+4. **Search for similar patterns:**
+   - The same bug often exists in multiple places in the codebase
+   - Search for the same pattern: `grep -rn "similar_pattern" src/`
+   - If found, fix those too and add regression tests for each
+**Output:** A fix commit with a regression test and a brief explanation of the root cause.
+**Commit format:**
+```
+fix: [brief description of what was wrong]
+Root cause: [one sentence explaining why the bug existed]
+Regression test: [test name that guards against recurrence]
+```
+## Common Root Cause Patterns
+### Race Conditions
+**What happens:** Async operations complete in an unexpected order. Operation B reads data before Operation A finishes writing it.
+**Signs:** Bug is intermittent. Bug disappears with added logging (timing changes). Bug only appears under load.
+**Fix pattern:** Add proper awaiting, use locks/mutexes, or redesign to eliminate the shared state.
+### State Mutation
+**What happens:** An object is modified in place when the caller expected the original to be unchanged. Function A passes an object to Function B, which mutates it, and Function A's subsequent code uses the now-changed object.
+**Signs:** Values change "mysteriously" between operations. Adding a `structuredClone` before the call fixes the bug.
+**Fix pattern:** Clone objects at function boundaries. Use spread operators to create new objects. Follow immutability patterns.
+### Boundary Errors
+**What happens:** Off-by-one errors in array indexing, string slicing, pagination, or loop bounds. Empty collections handled incorrectly.
+**Signs:** Bug only appears with certain input sizes (empty, one element, exactly N elements). Bug appears at page boundaries.
+**Fix pattern:** Test with 0, 1, N, N+1 elements. Use inclusive/exclusive bounds consistently. Handle empty inputs explicitly.
+### Type Coercion
+**What happens:** Implicit type conversions produce unexpected values. String "0" treated as falsy. Number comparison on string values.
+**Signs:** Bug only appears with specific values (0, empty string, null, NaN). Comparison operators behave unexpectedly.
+**Fix pattern:** Use strict equality (`===`). Explicit type conversion before comparison. Schema validation at input boundaries.
+### Stale Closures
+**What happens:** A callback captures a variable's value at creation time, not at execution time. By the time the callback runs, the variable has changed.
+**Signs:** Bug only appears in async code or event handlers. The value in the callback is always the "old" value. Adding a log shows the variable changed between capture and execution.
+**Fix pattern:** Capture the current value in a local variable. Use function arguments instead of closures. In React: add missing dependencies to useEffect/useCallback.
+### Missing Error Handling
+**What happens:** An error occurs but is caught and silently swallowed. The caller receives undefined/null instead of an error, and proceeds with invalid data.
+**Signs:** No error in logs, but behavior is wrong. Adding a throw in the catch block reveals the actual error. Values are unexpectedly null/undefined deep in the call chain.
+**Fix pattern:** Never use empty catch blocks. Always log the error with context. Re-throw or return a meaningful error value.
+### Incorrect Assumptions About External Data
+**What happens:** Code assumes an API response, file content, or user input has a certain shape, but the actual data differs (missing fields, different types, unexpected nulls).
+**Signs:** Bug only appears with certain inputs or after an external service changes. Works in tests (mocked data) but fails in production (real data).
+**Fix pattern:** Validate external data at the boundary with a schema. Handle missing/unexpected fields explicitly. Never assume the shape of data you don't control.
+## Anti-Pattern Catalog
+### Anti-Pattern: Shotgun Debugging
+**What goes wrong:** Making random changes hoping something fixes the bug. Changing multiple things at once so you don't know which change actually helped.
+**Signs:** Multiple unrelated changes in the fix commit. "Try this" mentality. Reverting changes randomly.
+**Instead:** Follow the 4-phase process. One change at a time, tested after each change.
+### Anti-Pattern: Fixing Symptoms
+**What goes wrong:** Adding a null check without understanding why the value is null. Adding a retry without understanding why the operation fails. The root cause remains and will manifest differently.
+**Signs:** The fix adds a guard clause but doesn't explain why the guarded condition occurs. The same module needs frequent "fixes." New bugs appear shortly after the fix.
+**Instead:** Complete Phase 3 (Diagnose) before Phase 4 (Fix). Understand WHY before fixing WHAT.
+### Anti-Pattern: No Regression Test
+**What goes wrong:** The bug is fixed but no test guards against it recurring. Three months later, a refactoring reintroduces the exact same bug.
+**Signs:** Fix commit has no test changes. The bug has been fixed before (check git log). Similar bugs keep appearing in the same module.
+**Instead:** Always write the regression test FIRST (Phase 4, step 1). The test should fail before the fix and pass after.
+### Anti-Pattern: Debugging in Production
+**What goes wrong:** Adding console.log or debug statements to production code instead of reproducing locally. Production debugging is slow, risky, and often modifies the bug's behavior (observer effect).
+**Signs:** `console.log` scattered in production code. Debug endpoints exposed. Debugging requires deploying to staging.
+**Instead:** Reproduce the bug locally first (Phase 1). Use structured logging that is always present, not ad-hoc debug statements.
+### Anti-Pattern: Blame-Driven Debugging
+**What goes wrong:** Spending time on `git blame` to find who introduced the bug instead of understanding what the bug is. Attribution is irrelevant to the fix.
+**Signs:** First action is `git blame`. Discussion focuses on who, not what. The fix is delayed by organizational process.
+**Instead:** Focus on WHAT the bug is (Phase 3). Use `git log` and `git bisect` to find WHEN the bug was introduced (useful for understanding context), not WHO.
+## Integration with Our Tools
+**`oc_forensics`:** Use during Phase 2 (Isolate) to analyze failed pipeline runs. `oc_forensics` identifies the failing phase, agent, and root cause from pipeline execution logs. Particularly useful for bugs in the orchestration pipeline where the failure is in a subagent's output.
+**`oc_review`:** Use after Phase 4 (Fix) to review the fix for introduced issues. The review catches cases where the fix solves the immediate bug but introduces a new one (incomplete error handling, missing edge cases).
+**`oc_logs`:** Use during Phase 2 (Isolate) to inspect session event history. Useful for timing-related bugs where the order of events matters. The structured log shows exact timestamps, event types, and data payloads.
+## Failure Modes
+### Cannot Reproduce
+**Symptom:** Phase 1 fails -- the bug doesn't appear in your environment.
+**Recovery:**
+1. Compare environments exactly: OS, runtime version, config, env vars
+2. Check for data-dependent bugs: request the exact input that triggered the bug
+3. Check for timing-dependent bugs: add artificial delays or run under load
+4. If still cannot reproduce: ask the reporter to record a session (screen recording, network trace)
+5. Last resort: add structured logging to the relevant code path and deploy. Wait for the bug to occur and analyze the logs.
+### Reproduce But Cannot Isolate
+**Symptom:** Phase 2 fails -- the bug appears but you cannot narrow it to a specific location.
+**Recovery:**
+1. Add more granular logging between existing log points
+2. Check async operation ordering -- add timestamps to all log messages
+3. Use a debugger with breakpoints at module boundaries
+4. Create a stripped-down reproduction that eliminates as much code as possible
+5. If the codebase is complex, draw the call flow on paper and mark where you've verified correct behavior
+### Root Cause Unclear
+**Symptom:** Phase 3 fails -- you know WHERE the bug is but not WHY.
+**Recovery:**
+1. Rubber duck debugging: explain the code path to an imaginary colleague, out loud, line by line
+2. Read the surrounding code more widely -- the bug may be caused by an interaction with adjacent logic
+3. Check the git history for the buggy function -- was it recently changed? What was the intent of the change?
+4. If the root cause is genuinely unclear after 30 minutes, take a break. Bugs often become obvious after stepping away.
+### Fix Introduces New Bugs
+**Symptom:** Phase 4 fix causes other tests to fail.
+**Recovery:**
+1. The fix changed behavior beyond the bug -- revert and apply a more targeted fix
+2. The failing tests were depending on the buggy behavior -- update those tests (they were wrong)
+3. The fix exposed a latent bug elsewhere -- debug that bug separately using this same 4-phase process

package/assets/skills/tdd-workflow/SKILL.md ADDED Viewed

@@ -0,0 +1,311 @@
+---
+name: tdd-workflow
+description: Strict RED-GREEN-REFACTOR TDD methodology with anti-pattern catalog and explicit failure modes
+stacks: []
+requires: []
+---
+# TDD Workflow
+Strict RED-GREEN-REFACTOR test-driven development methodology. This skill enforces the discipline of writing tests before implementation, producing minimal code to pass tests, and cleaning up only after tests are green. Every cycle produces a commit. Every phase has a clear purpose and exit criterion.
+TDD is not "writing tests." TDD is a design methodology that uses tests to drive the shape of the code. The test defines the behavior. The implementation satisfies the test. The refactor improves the code without changing behavior.
+## When to Use
+**Activate this skill when:**
+- Implementing business logic with defined inputs and outputs
+- Building API endpoints with request/response contracts
+- Writing data transformations, parsers, or formatters
+- Implementing validation rules or authorization checks
+- Building algorithms, state machines, or decision logic
+- Fixing a bug (write the regression test first, then fix)
+- Implementing any function where you can describe the expected behavior
+**Do NOT use when:**
+- UI layout and styling (visual output is hard to assert meaningfully)
+- Configuration files and static data
+- One-off scripts or migrations
+- Simple CRUD with no business logic (getById, list, delete)
+- Prototyping or exploring an unfamiliar API (spike first, then TDD the real implementation)
+## The RED-GREEN-REFACTOR Cycle
+Each cycle implements ONE behavior. Not two. Not "a few related things." One behavior, one test, one cycle. Repeat until the feature is complete.
+### Phase 1: RED (Write a Failing Test)
+**Purpose:** Define the expected behavior BEFORE writing any production code. The test is a specification.
+**Process:**
+1. Write ONE test that describes a single expected behavior
+2. The test name should read as a behavior description, not a method name:
+   - DO: `"rejects expired tokens with 401 status"`
+   - DO: `"calculates total with tax for US addresses"`
+   - DON'T: `"test validateToken"` or `"test calculateTotal"`
+3. Structure the test using Arrange-Act-Assert:
+   - **Arrange:** Set up inputs and expected outputs
+   - **Act:** Call the function or trigger the behavior
+   - **Assert:** Verify the output matches expectations
+4. Run the test -- it MUST fail
+5. Read the failure message -- it should describe the missing behavior clearly
+6. If the test passes without any new implementation, the behavior already exists or the test is wrong
+**Commit:** `test: add failing test for [behavior]`
+**Exit criterion:** The test fails with a clear, expected error message.
+**Common mistakes in RED:**
+- Writing multiple tests at once (write ONE, see it fail, then proceed)
+- Writing the test and implementation simultaneously (defeats the purpose)
+- Writing a test that cannot fail (tautology: `expect(true).toBe(true)`)
+- Testing implementation details instead of behavior (asserting internal state)
+### Phase 2: GREEN (Make It Pass)
+**Purpose:** Write the MINIMUM code to make the test pass. Nothing more.
+**Process:**
+1. Read the failing test to understand what behavior is expected
+2. Write the simplest possible code that makes the test pass
+3. Do NOT add error handling the test does not require
+4. Do NOT handle edge cases the test does not cover
+5. Do NOT optimize -- performance improvements are Phase 3 or a new cycle
+6. Do NOT "clean up" -- that is Phase 3
+7. Run the test -- it MUST pass
+8. Run all existing tests -- they MUST still pass (no regressions)
+**Commit:** `feat: implement [behavior]`
+**Exit criterion:** The new test passes AND all existing tests pass.
+**The hardest discipline:** Resist the urge to write "good" code in this phase. You WILL see opportunities for abstraction, error handling, and optimization. Ignore them. Write the minimum. Phase 3 exists specifically for cleanup.
+**Why minimum matters:** If you write more code than the test requires, that extra code is untested. Untested code is the source of bugs. The RED-GREEN cycle guarantees that every line of production code exists to satisfy a test.
+### Phase 3: REFACTOR (Clean Up)
+**Purpose:** Improve the code without changing behavior. The tests are your safety net.
+**Process:**
+1. Review the implementation from Phase 2 -- what can be improved?
+2. Common refactoring targets:
+   - Extract repeated logic into named functions
+   - Rename variables for clarity
+   - Remove duplication between test and production code
+   - Simplify complex conditionals
+   - Extract constants for magic numbers/strings
+3. After EVERY change, run the tests -- they MUST still pass
+4. If a test fails during refactoring, REVERT the last change immediately
+5. Make smaller changes -- one refactoring at a time, verified by tests
+**Commit (if changes were made):** `refactor: clean up [behavior]`
+**Exit criterion:** Code is clean, all tests pass, no new behavior added.
+**When to skip REFACTOR:** Never. Even if the code "looks fine," do a quick review pass. The habit of always refactoring prevents technical debt accumulation. If nothing needs changing, that's fine -- move to the next RED.
+## Test Writing Guidelines
+### Name Tests as Behavior Descriptions
+Tests are documentation. The test name should explain what the system does, not how the test works.
+```
+// DO: Behavior descriptions
+"creates user with hashed password"
+"rejects duplicate email addresses"
+"returns empty array when no results match"
+"sends welcome email after successful registration"
+// DON'T: Implementation descriptions
+"test createUser"
+"test email validation"
+"test empty result"
+"test sendEmail"
+```
+### One Assertion Per Test
+Each test should verify one behavior. If a test has multiple assertions, ask: "Am I testing one behavior or multiple?"
+**Acceptable:** Multiple assertions that verify different aspects of the SAME behavior:
+```
+// OK: Both assertions verify the "create user" behavior
+expect(result.id).toBeDefined()
+expect(result.email).toBe("user@example.com")
+```
+**Not acceptable:** Assertions that verify DIFFERENT behaviors:
+```
+// WRONG: Testing creation AND retrieval in one test
+const created = await createUser(data)
+const fetched = await getUser(created.id)
+expect(created.id).toBeDefined()
+expect(fetched.email).toBe(data.email)
+```
+### Arrange-Act-Assert Structure
+Every test has three distinct sections. Separate them with blank lines for readability.
+```
+test("calculates discount for premium customers", () => {
+  // Arrange
+  const customer = { tier: "premium", orderTotal: 100 }
+  // Act
+  const discount = calculateDiscount(customer)
+  // Assert
+  expect(discount).toBe(15)
+})
+```
+### Use Descriptive Failure Messages
+When a test fails, the failure message should tell you what went wrong without reading the test code.
+```
+// DO: Descriptive
+expect(response.status, "Expected 401 for expired token").toBe(401)
+// DON'T: Generic
+expect(response.status).toBe(401)
+```
+### Test Edge Cases in Separate Cycles
+Each edge case gets its own RED-GREEN-REFACTOR cycle:
+1. RED: Write test for empty input
+2. GREEN: Handle empty input
+3. REFACTOR: Clean up
+4. RED: Write test for null input
+5. GREEN: Handle null input
+6. REFACTOR: Clean up
+Do NOT bundle edge cases into the initial implementation.
+## Anti-Pattern Catalog
+### Anti-Pattern: Writing Tests After Code
+**What goes wrong:** Tests become assertions of what the code already does, not specifications of what it should do. The tests verify the implementation, not the behavior. When the implementation has a bug, the test has the same bug.
+**Signs:** All tests pass on the first run. Tests mirror implementation structure. Changing the implementation always requires changing the tests.
+**Instead:** Always write the test FIRST. The test should fail before any implementation exists. If it doesn't fail, the test is wrong.
+### Anti-Pattern: Skipping RED
+**What goes wrong:** Writing the test and implementation together means you never verified that the test can actually detect a failure. The test might be a tautology that always passes.
+**Signs:** You never see a red test. Tests are written alongside implementation in the same commit. You feel confident the test works but have no evidence.
+**Instead:** Run the test, see the red failure message, read it, confirm it describes the missing behavior. Only then write the implementation.
+### Anti-Pattern: Over-Engineering in GREEN
+**What goes wrong:** Adding error handling, edge case coverage, performance optimizations, and abstractions before the test requires them. This extra code is untested and may contain bugs.
+**Signs:** The implementation is significantly more complex than what the test verifies. Functions handle cases no test covers. You justify additions with "we'll need this later."
+**Instead:** Write only what the current test needs. If you need error handling, write a RED test for the error case first. If you need optimization, write a benchmark test first.
+### Anti-Pattern: Skipping REFACTOR
+**What goes wrong:** Technical debt accumulates as each GREEN phase adds minimal code without cleanup. After 20 cycles, the codebase is a pile of special cases.
+**Signs:** Production code has obvious duplication. Variable names are unclear. Functions grow linearly with each new test. You dread adding new features because the code is messy.
+**Instead:** Always do a REFACTOR pass, even if it's a 30-second review that concludes "looks fine." Build the habit.
+### Anti-Pattern: Testing Implementation Details
+**What goes wrong:** Tests assert on internal method calls, private state, or call counts instead of observable behavior. Refactoring breaks all tests even though behavior is unchanged.
+**Signs:** Tests use `toHaveBeenCalledWith` on internal methods. Tests assert on intermediate variables. Renaming an internal function breaks 15 tests.
+**Instead:** Test the public API. Assert on outputs, side effects (emails sent, records created), and error behaviors. Never assert on how the implementation achieves the result.
+### Anti-Pattern: Large Test Suites with No RED
+**What goes wrong:** All tests are written at once, all passing from the start. This is "test-after" development, not TDD. The tests validate the existing implementation rather than driving the design.
+**Signs:** A PR adds 20 tests and an implementation, all in one commit. No commit shows a failing test. The test file was written after the production code.
+**Instead:** One test at a time, one cycle at a time. Each cycle produces 1-3 commits (RED, GREEN, optional REFACTOR). The git history tells the story.
+## Integration with Our Tools
+**After GREEN phase:** Invoke `oc_review` for a quick quality check on the implementation. The review catches issues (naming, error handling gaps, security concerns) that the REFACTOR phase should address.
+**During REFACTOR:** If a test fails unexpectedly after a refactoring change, use `oc_forensics` to diagnose the root cause. It identifies the exact change that broke the test.
+**After completing all cycles:** Run `oc_review` on the full changeset to catch cross-cutting concerns (duplication across files, missing integration tests, inconsistent patterns).
+**Commit hygiene:** Each RED-GREEN-REFACTOR cycle produces up to 3 commits. This granular history is valuable -- it shows design evolution and makes bisecting easier.
+## Failure Modes
+### Test Won't Fail (RED Phase)
+**Symptom:** You write the test and it passes immediately without any new implementation.
+**Diagnosis:**
+- The behavior is already implemented (check existing code)
+- The test is asserting something trivially true (tautology)
+- The test is calling the wrong function or using stale imports
+**Recovery:** Delete the test. Read the existing implementation. Write a test for behavior that is genuinely NOT implemented yet.
+### Test Won't Pass (GREEN Phase)
+**Symptom:** You write the implementation but the test still fails.
+**Diagnosis:**
+- Re-read the test carefully -- are you implementing what the test actually checks?
+- Check for typos in function names, property names, import paths
+- Simplify: can you make the test pass with a hardcoded return value? If yes, work backwards from there
+**Recovery:** Start with the simplest possible implementation (even a hardcoded value). Then generalize one step at a time, running the test after each change.
+### Refactoring Breaks Tests
+**Symptom:** Tests fail after a refactoring change in Phase 3.
+**Diagnosis:**
+- The refactoring changed behavior (not just structure) -- revert
+- A test was testing implementation details, not behavior -- the test needs updating
+- The refactoring introduced a subtle bug (argument order, missing return, etc.)
+**Recovery:** Revert the last change immediately. Make a smaller refactoring step. If the tests are too coupled to implementation, that's a separate problem to fix in a dedicated cycle.
+### Can't Think of the Next Test
+**Symptom:** The current behavior works, but you're not sure what to test next.
+**Diagnosis:** This is normal and healthy -- it means the current scope might be complete.
+**Recovery:**
+1. Review the requirements -- is there untested behavior?
+2. Check edge cases: empty input, null, boundary values, error conditions
+3. Check integration points: does this work correctly with adjacent modules?
+4. If nothing emerges, the feature may be done. Run coverage to confirm.
+### TDD Feels Slow
+**Symptom:** TDD seems like it takes twice as long as just writing the code.
+**Reality:** TDD front-loads the time you would spend debugging later. The total time is usually equal or less. The difference: TDD time is predictable (small cycles), debug time is unpredictable (hours chasing a bug).
+**If genuinely slow:** Your cycles are too large. Each cycle should take 5-15 minutes (RED: 2-3 min, GREEN: 2-5 min, REFACTOR: 1-5 min). If a cycle takes longer, the behavior being tested is too complex -- split it.