npm - gsd-pi - Versions diffs - 2.76.0 → 2.77.0 - Mend

gsd-pi 2.76.0 → 2.77.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (536) hide show

package/src/resources/skills/observability/SKILL.md ADDED Viewed

@@ -0,0 +1,174 @@
+---
+name: observability
+description: Add agent-first observability to code — structured logs, health endpoints, failure-state persistence, and explicit failure modes — so the next agent hitting a problem at 3am has the signals it needs to diagnose. Use when asked to "add logging", "add observability", "add metrics", "debug later", "make this observable", or when building/refactoring a subsystem that will run unattended (auto-mode engine, background jobs, servers, watchers). Operationalizes VISION.md's "agent-first observability" principle.
+---
+<objective>
+Instrument code so that a cold-start agent can understand what happened by reading signals, not by rerunning with extra logging. The deliverable is a set of specific instrumentation additions: structured logs at decision points, health/status surfaces for long-running processes, persisted failure state, and explicit failure modes that don't get swallowed.
+</objective>
+<context>
+GSD-2's `VISION.md` lists "agent-first observability" as a principle, and the system prompt calls it out: "A future version of you will land in this codebase with no memory… you add observability because you're the one who'll need it at 3am." GSD-2 already exemplifies this — `activity/*.jsonl`, `journal/*.jsonl`, `metrics.json`, `doctor-history.jsonl` — but new code doesn't get that treatment automatically.
+This skill is the thinking process for adding it. Not "add logs everywhere" — add the *right* signals at the *right* decision points.
+Invocation points:
+- Building auto-mode-style code (loops, dispatch, guards, retries)
+- Adding a background job, watcher, or scheduled task
+- Writing a server or long-running process
+- Refactoring a subsystem that has been hard to debug
+- Addressing a production bug where "we had no visibility" surfaced
+</context>
+<core_principle>
+**LOG DECISIONS, NOT ACTIVITY.** "Entering function X" is noise. "Dispatched unit `slice/S02` after guard check passed because `status=pending`" is signal. Every log line should answer a question a future debugger will ask.
+**FAIL LOUDLY AND PERSIST THE REASON.** Silent `try/catch` that returns `undefined` is an anti-pattern. If something fails, the failure state needs to be somewhere a fresh agent can find it — a JSONL, a status file, a health endpoint.
+**OBSERVABILITY IS NOT FREE.** Every log allocation, every metric, every health check costs CPU and disk. Add only what you would actually read.
+</core_principle>
+<process>
+## Step 1: Map the failure modes
+Before instrumenting, list what can go wrong:
+1. **What inputs could be invalid?** External API responses, user-submitted data, filesystem state, env vars.
+2. **What external dependencies could fail?** Network, DB, child processes, filesystem permissions.
+3. **What internal invariants could break?** State transitions, lock acquisition, concurrency assumptions.
+4. **What silent corruption is possible?** Truncated writes, partial transactions, stale caches.
+This map tells you where to instrument. Don't instrument uniformly — instrument at the decision points where these failures would manifest.
+## Step 2: Structured logs at decision points
+For each decision the code makes that could plausibly go wrong later:
+- **What decision?** ("Dispatching unit X", "Retrying with backoff", "Skipping validation because flag set")
+- **Why?** ("status=pending", "previous attempt exit=1", "--dev flag set")
+- **What would a future debugger want to know?** (The values that drove the choice.)
+Format:
+```ts
+log.info({
+  event: "unit-dispatched",
+  unitType: "slice",
+  unitId: "S02",
+  reason: "pending",
+  attempt: 1,
+  flowId,
+});
+```
+Use the project's existing logger if one exists. In gsd-2, follow the patterns in `src/resources/extensions/gsd/activity-log.ts` and `src/resources/extensions/gsd/journal.ts` — structured JSONL, one event per line, with `ts`, `event`, and domain-specific fields.
+Avoid:
+- `console.log("here")` — what does "here" mean in six months?
+- Logging secrets, tokens, or PII — ever.
+- Formatting structured data into a prose string — it can't be grepped or filtered.
+## Step 3: Persist failure state
+When something fails in a way the caller can't immediately handle, write the failure state to disk:
+```ts
+await writeAtomically(
+  resolve(".gsd/runtime/last-error.json"),
+  JSON.stringify({
+    ts: new Date().toISOString(),
+    phase: "execute",
+    unitId,
+    error: { message, stack, code },
+    retryCount,
+  })
+);
+```
+A fresh agent reading `.gsd/runtime/` sees what happened last, what was retried, and where the process stopped. Pattern exists already in gsd-2 — reuse the `atomic-write.ts` helpers and the `.gsd/runtime/` and `.gsd/forensics/` directories.
+## Step 4: Health and status surfaces
+For long-running processes:
+- **Health endpoint** (HTTP server) or **status file** (CLI tool). Cheap to call, no side effects. Returns current state: `{status: "healthy" | "degraded" | "down", ...diagnostics}`.
+- **Digest view** — a small representation of recent work. In gsd-2, this is `STATE.md` and the health widget. In a server, it's `/internal/status` with last 10 request summaries.
+- **Minimal metrics** — counters for the 3–5 things that matter (requests, errors, active jobs). Not everything — just what drives alerts.
+Don't build a metrics empire. Build exactly what you'd check at 3am.
+## Step 5: Explicit failure modes
+Replace silent handling with explicit:
+```ts
+// Bad
+try {
+  return await db.getUser(id);
+} catch {
+  return null;
+}
+// Good
+try {
+  return await db.getUser(id);
+} catch (err) {
+  log.error({ event: "db-getuser-failed", userId: id, err: serializeError(err) });
+  throw new DatabaseError("Failed to load user", { cause: err, userId: id });
+}
+```
+The caller now knows the failure happened, gets an error type it can branch on, and a log line exists for forensics.
+## Step 6: Remove the scaffolding
+Before shipping, cull the ad-hoc instrumentation you used while debugging. Keep only:
+- Decision-point logs that a future agent would use
+- Persistent failure state
+- Health/status surfaces
+- Explicit failure modes
+Drop:
+- Temporary `console.log` debug lines
+- Spammy per-iteration logs that no one will read
+- Metrics that were "might be useful someday"
+The system prompt says it plainly: "Remove noisy one-off instrumentation before finishing unless it provides durable diagnostic value."
+## Step 7: Verify the signals work
+Pick one plausible failure mode from Step 1 and simulate it (inject an error, point at a missing file, break a dependency). Confirm:
+1. The failure produced a log line a cold-start agent could understand.
+2. The failure state persisted somewhere durable.
+3. The health surface reflects the degraded state.
+4. Nothing was swallowed silently.
+If any signal is missing, add it — that's the gap this skill exists to catch.
+</process>
+<anti_patterns>
+- **Uniform logging.** Logging every function entry/exit buries signal in noise.
+- **Prose logs.** `"Processing user 42 now"` vs `{event: "user-process-start", userId: 42}` — the latter is queryable, the former is not.
+- **Silent swallowing.** `catch {}` or `catch (err) { /* ignore */ }` without a log is a deferred production incident.
+- **Metrics empire.** 200 Prometheus metrics nobody reads. Ship 5 that drive alerts.
+- **Logging secrets.** API keys, tokens, passwords, full request bodies with PII — never.
+- **"I'll add logging when it breaks."** By then you don't have the signal. Instrument now.
+- **Over-instrumenting hot paths.** Logging inside a tight loop kills performance. Sample or aggregate.
+</anti_patterns>
+<success_criteria>
+- [ ] Failure modes were listed before instrumenting.
+- [ ] Logs are at decision points, structured, and contain the driving values.
+- [ ] Failure state is persisted to a known location (`.gsd/runtime/`, `/var/log/`, a status file).
+- [ ] Long-running processes expose a health or status surface.
+- [ ] No silent `try/catch` swallowing errors.
+- [ ] Ad-hoc debug instrumentation was removed.
+- [ ] One plausible failure was simulated and the signals were confirmed to reach a fresh reader.
+</success_criteria>

package/src/resources/skills/security-review/SKILL.md ADDED Viewed

@@ -0,0 +1,181 @@
+---
+name: security-review
+description: Threat-model-driven security review of a change, feature, or subsystem. Runs a STRIDE-style pass (Spoofing, Tampering, Repudiation, Info disclosure, Denial of service, Elevation of privilege), examines the actual code, and produces a filing-ready report with severity, exploit scenario, and concrete remediation. Use when asked to "security review", "threat model", "check for vulnerabilities", "audit this for security", "secure this", or before shipping any change that touches auth, input handling, data access, or external surfaces.
+---
+<objective>
+Produce a security review that names specific exploit paths through the actual code — not a generic checklist. The deliverable is a prioritized list of findings, each with: where the issue lives, the threat category, a concrete exploit scenario, severity, and a remediation the caller can implement. Read-only: does not modify code.
+</objective>
+<context>
+GSD-2's general `review` skill covers security as one of several categories. This skill is the deeper pass — triggered deliberately when security is the primary concern. It complements v1's `/gsd-secure-phase` concept, adapted to the gsd-2 artifact model.
+Invocation points:
+- Any change touching authentication, authorization, session handling
+- Any change touching user input → database, filesystem, or shell
+- Any change exposing a new external surface (HTTP endpoint, webhook, IPC boundary)
+- Secrets handling, environment variable changes, crypto code
+- Pre-release audit of a feature or milestone
+- Response to a suspected vulnerability
+Do NOT use for:
+- General code review (use `review`)
+- Performance audits (use `code-optimizer`)
+</context>
+<core_principle>
+**CODE BEFORE CHECKLISTS.** A threat model that doesn't read the code is theater. Find the actual input source, the actual validation (or absence), the actual sink. Cite file:line for every finding.
+**THREAT, NOT HYPOTHETICAL.** "SQL injection is possible in theory" is useless. "If an attacker passes `' OR 1=1--` to `getUser(name)` at `src/db/users.ts:42`, the query becomes `SELECT … WHERE name='' OR 1=1--'`, returning every row" is actionable.
+**READ-ONLY.** Don't patch while reviewing — you conflate reviewer and author and lose the audit trail. Report, let the user act.
+</core_principle>
+<process>
+## Step 1: Scope the review
+Identify what to review:
+- Recent diff (staged / branch / specific commit)
+- A named subsystem (`src/auth/`, the webhook handler, etc.)
+- A user-provided concern ("I'm worried about our JWT handling")
+If the scope is vague, ask one round of clarifying questions (1–3 questions). Otherwise proceed.
+## Step 2: Map the attack surface
+Before STRIDE: identify every untrusted entry point in the scope:
+- HTTP routes / GraphQL resolvers / RPC endpoints
+- CLI flag parsing and argv consumption
+- Webhook handlers, event subscribers
+- Environment variables read at runtime
+- Files read from untrusted locations
+- Third-party library deserialization (YAML, XML, pickle, etc.)
+- IPC boundaries, child processes
+For each, note: who can reach this surface? (public internet, authenticated user, same-host process, admin-only).
+## Step 3: STRIDE pass
+For each attack surface, walk STRIDE:
+### Spoofing (identity)
+- Can an attacker pretend to be another user?
+- Are identity tokens verified before they're trusted?
+- Session cookies: HttpOnly, Secure, SameSite set?
+### Tampering (integrity)
+- Can an attacker modify data in transit or at rest?
+- Are webhooks signed and signatures verified?
+- Are incoming payloads rehydrated without integrity checks?
+### Repudiation (audit trail)
+- Is there a log of who did what?
+- Can an attacker erase their trail?
+- Are logs tamper-evident where it matters?
+### Information disclosure
+- Does an error message leak internal state, stack traces, file paths, DB queries?
+- Are secrets logged anywhere?
+- Are authorization checks upstream of data loading, or does the query run first?
+### Denial of service
+- Are there unbounded loops, unpaginated queries, or user-controlled recursion?
+- Rate limits on expensive endpoints?
+- Regex-on-user-input vulnerable to ReDoS?
+### Elevation of privilege
+- Are admin-only routes actually gated?
+- Can a low-privilege user trigger a high-privilege operation through an unauthenticated webhook?
+- Are role checks enforced at the handler, the service, and the data layer — or just one?
+Use `Agent(subagent_type=Explore)` in parallel if the scope is large — one sub-agent per STRIDE category over the same surface list.
+## Step 4: OWASP cross-check (web scope)
+If the scope includes web surfaces, confirm against the top OWASP patterns that STRIDE doesn't cleanly cover:
+- Injection (SQL, NoSQL, command, template, LDAP)
+- XSS (stored, reflected, DOM)
+- SSRF (server makes requests to attacker-controlled URLs)
+- Insecure deserialization
+- Path traversal
+- Open redirects
+- Broken access control at object level (IDOR)
+- CSRF where cookies are used for auth
+For each present, find the code path. Same standard: cite file:line.
+## Step 5: Triage
+For each finding, assign:
+- **Severity:** Critical / High / Medium / Low / Informational
+- **Exploitability:** Remote unauthenticated / authenticated user / adjacent user / admin-only / local-only
+- **Business impact:** Data breach / account takeover / service disruption / audit failure / minor
+Severity × Exploitability = priority. Sort findings by priority.
+## Step 6: Write the report
+```markdown
+## Security Review — <scope>
+### Summary
+<1–3 sentences — biggest finding and overall posture>
+### Findings
+#### CRITICAL-1: SQL injection in `getUser`
+**Location:** `src/db/users.ts:42`
+**Category:** Tampering / Info disclosure (STRIDE) + OWASP A03 Injection
+**Exploit:** Passing `' OR 1=1--` to the `name` parameter produces the query `SELECT * FROM users WHERE name='' OR 1=1--'`, returning every row. `name` arrives from `POST /api/search` without validation.
+**Reachability:** Remote unauthenticated (endpoint has no auth).
+**Remediation:** Use a parameterized query. The codebase's `db.prepare` helper at `src/db/util.ts:17` handles this — switch `getUser` to it.
+#### HIGH-2: ...
+### Non-findings considered
+<brief: what you checked and ruled out — prevents repeat reviews>
+### Out of scope
+<what wasn't reviewed>
+```
+Offer to file as a GitHub issue — requires explicit confirmation per the outward-action rule. Sensitive findings should stay in `.gsd/security-reviews/` and not be pushed to a public tracker; check the repository's security policy first.
+## Step 7: Follow-ups
+If the review found CRITICAL or HIGH issues:
+- Recommend filing a private security advisory (not a public issue) if the repo is public.
+- Flag the finding for `/gsd start hotfix` if it's in the scope of active work.
+- Append one line to `.gsd/DECISIONS.md` if the remediation involves an architectural change.
+</process>
+<anti_patterns>
+- **Generic checklists.** "Does it validate input?" yes/no without pointing at the code is not a review.
+- **Hypothetical exploits without code.** If you can't name the path, the finding isn't real yet.
+- **Modifying code during review.** Reviewer and author must stay separate.
+- **Treating every theoretical issue as CRITICAL.** Severity requires exploitability.
+- **Skipping reachability.** A "critical" behind admin-only auth is usually not critical.
+- **Filing sensitive findings publicly.** Check the repo's security policy first.
+</anti_patterns>
+<success_criteria>
+- [ ] Every finding cites a file:line.
+- [ ] Every finding has a concrete exploit scenario, not just a category.
+- [ ] Severity, exploitability, and business impact are stated.
+- [ ] At least one non-finding is listed (shows what was ruled out).
+- [ ] No code was modified during the review.
+- [ ] Critical findings are routed through an appropriate disclosure channel, not auto-filed publicly.
+</success_criteria>

package/src/resources/skills/spike-wrap-up/SKILL.md ADDED Viewed

@@ -0,0 +1,138 @@
+---
+name: spike-wrap-up
+description: Package findings from a completed spike into a durable, project-local skill that auto-loads on future similar work. Reads the most recent `.gsd/workflows/spikes/` directory, interviews the user briefly on what's reusable, then writes `.claude/skills/<name>/SKILL.md`. Use when asked to "wrap up the spike", "package this as a skill", "make this reusable", "turn findings into a skill", or at the end of the synthesize phase of `/gsd start spike`. Closes the parity gap with GSD v1's `/gsd-spike-wrap-up`.
+---
+<objective>
+Convert the output of a research spike (`SCOPE.md`, `research/*.md`, `RECOMMENDATION.md`) into a project-local skill under `.claude/skills/` so that the next time a similar task comes up, the agent loads the skill automatically. This is how throwaway spikes become durable capital.
+</objective>
+<context>
+GSD's spike workflow (`src/resources/extensions/gsd/workflow-templates/spike.md`) produces documents in `.gsd/workflows/spikes/<slug>/`. Those documents are useful once and then forgotten unless something packages them for reuse.
+GSD already watches `.claude/skills/` (and `.agents/skills/`) at both user and project levels — see `src/resources/extensions/gsd/skill-discovery.ts`. Any skill written there is picked up on the next session without further wiring. This skill is the bridge from "spike done" to "skill available."
+Invocation points:
+- End of Phase 3 (synthesize) in `/gsd start spike` — prompt suggests running this skill
+- User has a spike directory and wants to harvest it
+- Pre-existing `RECOMMENDATION.md` that deserves a permanent home
+</context>
+<core_principle>
+**NOT EVERY SPIKE DESERVES A SKILL.** If the recommendation was "don't do X," there may be no reusable guidance. Ask the user first; exit without writing if the answer is no.
+**PROJECT-LOCAL, NOT USER-GLOBAL.** Write to `.claude/skills/` in the repo, not `~/.claude/skills/`. The skill encodes project-specific choices that should not leak into unrelated projects.
+**DESCRIPTION IS THE DISCOVERABILITY SIGNAL.** The `description` field in frontmatter is the primary signal the agent uses to judge relevance and decide whether to load the skill — it is a heuristic, not a deterministic trigger. Write it as keywords the future agent will plausibly encounter, not a summary.
+</core_principle>
+<process>
+## Step 1: Find the spike
+1. List directories under `.gsd/workflows/spikes/` — sort by mtime, newest first.
+2. If multiple exist, ask the user which to wrap up. Default: the most recent.
+3. If none exist, tell the user and stop. This skill requires a completed spike.
+Read the core files:
+- `<spike>/SCOPE.md` — the question that was asked
+- `<spike>/research/*.md` — the angles investigated
+- `<spike>/RECOMMENDATION.md` — the conclusion
+## Step 2: Decide if it deserves a skill
+Ask the user — one round:
+1. **Is the conclusion reusable on future work, or was it specific to one decision?**
+   Recommendation: packaging is worth it if the findings include repeatable guidance (how to evaluate X, a pattern to follow, a library's gotchas). If the spike ended in "we chose library Y, end of story," it probably belongs in `.gsd/DECISIONS.md` instead.
+2. **What is the trigger?** When should a future agent load this skill? Give concrete keywords — "adding a new webhook handler", "writing a SQL migration", etc.
+If the user says it's not worth packaging, offer instead to append a summary to `.gsd/DECISIONS.md` and stop.
+## Step 3: Design the skill
+Before writing, sketch in the conversation:
+- **Name:** kebab-case, short, unambiguous. Prefix with the project's domain when helpful (`auth-webhook-setup`, not `webhook`).
+- **Description (frontmatter):** one sentence, 120–1024 chars, keyword-rich. Must state when the agent should load it. Rewrite at least twice before settling.
+- **Objective:** one paragraph — what the skill does and what artifact it produces.
+- **Process:** numbered steps. Reference the spike's findings as the source, but the skill itself should be executable without reading the spike.
+- **Anti-patterns:** gotchas the spike surfaced — things that looked right but didn't work.
+- **Success criteria:** checklist the skill's user can confirm against.
+Show this sketch to the user. One round of feedback. Iterate.
+## Step 4: Write the skill
+Write to `.claude/skills/<name>/SKILL.md` (create the directory). Match the frontmatter + XML-tag structure used by other bundled skills — see `src/resources/skills/review/SKILL.md` for the canonical shape.
+Minimum structure:
+```markdown
+---
+name: <skill-name>
+description: <one sentence with trigger keywords>
+---
+<objective>
+<one paragraph — what this skill does>
+</objective>
+<context>
+<when to invoke, what produced it (cite the spike), assumptions>
+</context>
+<process>
+## Step 1: <action>
+<instructions>
+## Step 2: <action>
+<instructions>
+</process>
+<anti_patterns>
+- <gotcha from the spike>
+- <another gotcha>
+</anti_patterns>
+<success_criteria>
+- [ ] <observable confirmation>
+- [ ] <observable confirmation>
+</success_criteria>
+```
+If the spike produced a reusable template (a config file, a starter script), copy it into `.claude/skills/<name>/templates/` or `.claude/skills/<name>/references/` and reference it from the skill body.
+## Step 5: Archive the spike, link from the skill
+1. In the new SKILL.md, reference the originating spike: "Derived from `.gsd/workflows/spikes/<slug>/RECOMMENDATION.md` (dated YYYY-MM-DD)."
+2. Do NOT delete the spike directory — spikes are research artifacts and retain value for forensics.
+3. Append one line to `.gsd/DECISIONS.md`: `- YYYY-MM-DD [spike]: packaged "<slug>" findings as skill <name>`.
+## Step 6: Confirm pickup
+Tell the user the skill will be surfaced on the next session via `skill-discovery.ts`. If they want to use it immediately, they can `Read .claude/skills/<name>/SKILL.md` now.
+</process>
+<anti_patterns>
+- **Writing to `~/.claude/skills/`.** That's user-global. Project spikes produce project skills — keep them scoped.
+- **Verbose frontmatter description.** The description is an index entry, not a tutorial. Keywords over prose.
+- **Packaging every spike.** If the outcome was "we decided X once," append to DECISIONS.md and move on.
+- **Copy-pasting the spike verbatim into the skill.** The spike is research; the skill is executable guidance. Re-author.
+- **Deleting the source spike.** Research artifacts should persist.
+</anti_patterns>
+<success_criteria>
+- [ ] A new `.claude/skills/<name>/SKILL.md` exists with well-formed frontmatter.
+- [ ] The `description` field uses keywords that will plausibly match future agent work.
+- [ ] The skill body is executable on its own without re-reading the originating spike.
+- [ ] The originating spike is referenced from the skill.
+- [ ] `.gsd/DECISIONS.md` has a one-line entry recording the packaging.
+- [ ] The spike directory itself is untouched.
+</success_criteria>

package/src/resources/skills/tdd/SKILL.md ADDED Viewed

@@ -0,0 +1,112 @@
+---
+name: tdd
+description: Test-driven development with red-green-refactor loops built around vertical slices (tracer bullets), not horizontal layers. Use when asked to "use TDD", "write test-first", "red-green-refactor", "build this with tests", or whenever a feature has a clear observable contract and would benefit from tests that outlive refactors. Complements the bundled test and add-tests skills — use this for the discipline, use those for the mechanics.
+---
+<objective>
+Drive feature implementation through one red-green-refactor cycle per vertical slice. Each cycle produces a single failing test that pins one observable behavior, then the minimal code that makes it pass, then refactoring while GREEN. Never refactor while RED. Never write all tests up front.
+</objective>
+<context>
+GSD already organizes work into slices (S##) and tasks (T##). This skill operates at the task level — inside a single T##-PLAN.md, it structures the execution as a sequence of tiny red-green-refactor cycles rather than write-all-then-test or write-all-without-tests.
+Invocation points:
+- Task plan calls out behavior with a clear external contract (pure function, API endpoint, module boundary)
+- Bug fix — write the failing repro test first, then fix
+- Refactor into a new module — write the tests against the new interface first
+Do not use this skill for:
+- Exploratory spikes (use `/gsd start spike` — no production code ships)
+- Pure UI polish where visual verification beats unit tests
+- Scripts that run once and are deleted
+</context>
+<core_principle>
+**TESTS VERIFY BEHAVIOR THROUGH PUBLIC INTERFACES, NOT IMPLEMENTATION DETAILS.** A good test reads like a specification and survives refactors. A test that mocks internals or asserts private state fails every time you clean up the code — that is a bad test pretending to be a good one.
+**VERTICAL SLICES, NOT HORIZONTAL LAYERS.** Writing all tests upfront ("horizontal slicing") produces tests for behavior you imagined, not behavior the code actually exhibits. One tracer bullet at a time: write one test, make it pass, learn, write the next.
+**NEVER REFACTOR WHILE RED.** Refactoring without a passing test means you are guessing whether you broke something. Get to green first. Then clean up. Then go red again for the next slice.
+</core_principle>
+<process>
+## Step 1: Confirm the interface
+Before writing anything:
+1. What is the public interface? (function signature, HTTP route, module exports)
+2. Which behaviors matter most? List them in order of how badly you'd want to know if they broke.
+3. What does a "caller" look like? Write one example call in prose.
+If the user has not supplied these, ask — one round, 1–3 questions.
+## Step 2: Tracer bullet
+Pick the first behavior — usually the happy path for the most common input. Write one test that exercises it end-to-end through the public interface. Run it. It must fail (RED). If it passes, the test is not testing what you think it is — fix the test before writing any code.
+Then write the minimum code to make it pass. "Minimum" means: if hard-coding `return 42` makes the test pass, hard-code `return 42`. You will generalize on the next cycle. Run the test. It must pass (GREEN).
+This proves the end-to-end path works — test harness, imports, wiring, build. Everything from Step 3 onward is incremental.
+## Step 3: Red-green loop
+Pick the next behavior. Write one test that pins it. Run — RED. Write the minimum code that makes it pass without breaking prior tests. Run — GREEN.
+Guidelines for picking the next behavior:
+- Alternate happy-path-variant and edge-case tests — don't do all happy paths then all edges.
+- Stop adding happy paths when they stop revealing new code. Move to errors.
+- If the next test would require no new code, you have hit the end of this slice — skip to Step 4.
+Guidelines for writing the test:
+- One assertion per concept (multiple `expect` calls that describe one behavior are fine).
+- No mocking of internals. Mock external I/O (network, filesystem, clock) only when necessary.
+- The test name reads as a sentence: "rejects requests missing an auth header".
+Guidelines for writing the code:
+- Minimum to pass. If there are three cases and two are untested, write only the one that's under test.
+- Copy-paste is fine on the first and second occurrence. Extract on the third.
+## Step 4: Refactor while GREEN
+Now that the behavior is pinned, clean up. Extract duplicated logic. Rename unclear variables. Deepen the module — move responsibilities behind the interface until the internals stop leaking into the test.
+Rules:
+- Every refactor keeps every test GREEN. Run tests after each small change.
+- If a refactor would require changing a test, the test was coupled to implementation — either the test is wrong, or the interface you thought you were pinning is actually different. Fix the test first, then refactor.
+- Don't refactor speculatively. Extract around real duplication and real seams, not imagined ones.
+## Step 5: Close the slice
+When the behavior the task plan specified is fully under test and the code is clean:
+1. Run the full test suite — not just the tests you wrote. Verify no regressions.
+2. Append a one-line summary of what is now pinned to `.gsd/KNOWLEDGE.md` if the behavior is non-obvious or the test surfaced a trap future agents should know about.
+3. Use `gsd_*` tools to mark the task complete — do not edit checkboxes by hand.
+</process>
+<anti_patterns>
+- **Writing all tests first.** Horizontal slicing. Produces imagined-behavior tests that decouple from reality.
+- **Mocking internals.** `jest.mock("./internal-helper")` tells you the wiring matches your mental model, not that the behavior is correct.
+- **Refactoring while RED.** You have no signal. Any change could be right or wrong.
+- **"Just one more test" after GREEN without refactoring.** You accumulate duplication and the code rots in place.
+- **Testing implementation detail.** "It calls `fetchUser` three times." Who cares? Test the observable result.
+- **Skipping the failing run.** If you never saw RED, you don't know the test would have caught the bug.
+</anti_patterns>
+<success_criteria>
+- [ ] Every behavior the task plan called out has a test that pins it through the public interface.
+- [ ] Every test went RED before it went GREEN. No test was born passing.
+- [ ] All refactoring happened on GREEN. The final code is not the first draft.
+- [ ] Full test suite runs clean — no regressions.
+- [ ] No test mocks an internal helper or asserts a private field.
+</success_criteria>