npm - joycraft - Versions diffs - 0.5.6 → 0.5.8 - Mend

joycraft 0.5.6 → 0.5.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

package/README.md +59 -338
package/dist/{chunk-QIYIJ7VR.js → chunk-A2CQG5J5.js} +680 -61
package/dist/chunk-A2CQG5J5.js.map +1 -0
package/dist/cli.js +3 -3
package/dist/index.d.ts +6 -0
package/dist/index.js +78 -7
package/dist/index.js.map +1 -1
package/dist/{init-MKRU6SYT.js → init-QXG5BT4Y.js} +83 -11
package/dist/init-QXG5BT4Y.js.map +1 -0
package/dist/{init-autofix-V2Y2O4HO.js → init-autofix-Y5DQOFEU.js} +2 -2
package/dist/{upgrade-HK6F5SXI.js → upgrade-VUOSXPR5.js} +2 -2
package/package.json +1 -1
package/dist/chunk-QIYIJ7VR.js.map +0 -1
package/dist/init-MKRU6SYT.js.map +0 -1
/package/dist/{init-autofix-V2Y2O4HO.js.map → init-autofix-Y5DQOFEU.js.map} +0 -0
/package/dist/{upgrade-HK6F5SXI.js.map → upgrade-VUOSXPR5.js.map} +0 -0

package/README.md CHANGED Viewed

@@ -72,6 +72,8 @@ Joycraft auto-detects your tech stack and creates:
   - `/joycraft-interview` Lightweight brainstorm. Yap about ideas, get a structured summary
   - `/joycraft-decompose` Break a brief into small, testable specs
   - `/joycraft-add-fact` Capture project knowledge on the fly -- routes to the right context doc
+  - `/joycraft-lockdown` Generate constrained execution boundaries (read-only tests, deny patterns)
+  - `/joycraft-verify` Spawn a separate subagent to independently verify implementation against spec
   - `/joycraft-session-end` Capture discoveries, verify, commit, push
   - `/joycraft-implement-level5` Set up Level 5 (autofix loop, holdout scenarios, scenario evolution)
 - **docs/** structure: `briefs/`, `specs/`, `discoveries/`, `contracts/`, `decisions/`, `context/`
@@ -96,6 +98,8 @@ After init, open Claude Code and use the installed skills:
 /joycraft-new-feature           # Interview → Feature Brief → Atomic Specs → ready to execute
 /joycraft-decompose             # Break any feature into small, independent specs
 /joycraft-add-fact              # Capture a fact mid-session -- auto-routes to the right context doc
+/joycraft-lockdown              # Generate constrained execution boundaries for autonomous sessions
+/joycraft-verify                # Independent verification -- spawns a subagent to check your work
 /joycraft-session-end           # Wrap up: discoveries, verification, commit, push
 /joycraft-implement-level5     # Set up Level 5 (autofix, holdout scenarios, evolution)
 ```
@@ -170,379 +174,92 @@ Joycraft tracks what it installed vs. what you've customized. Unmodified files u
 > **A note on complexity:** Setting up Level 5 does have some moving parts and, depending on the complexity of your stack (software vs. hardware, monorepo vs. single app, etc.), this will require a good amount of prompting and trial-and-error to get right. I've done my best to make this as painless as possible, but just note - this is not a one-shot-prompt-done-in-5-minutes kind of thing. For small projects and simple stacks it will be easy, but any level of complexity is going to take some iteration, so plan ahead. Full step-by-step guides along with a video coming soon.
-Level 5 is where specs go in and validated software comes out. Joycraft implements this as four interlocking GitHub Actions workflows, a separate scenarios repository, and two independent AI agents that can never see each other's work.
+Level 5 is where specs go in and validated software comes out — four GitHub Actions workflows, a separate scenarios repo, and two AI agents that can never see each other's work. Run `/joycraft-implement-level5` for guided setup, or `npx joycraft init-autofix` via CLI.
-Run `/joycraft-implement-level5` in Claude Code for a guided setup, or use the CLI directly:
+See the full **[Level 5 Autonomy Guide](docs/guides/level-5-autonomy.md)** for architecture diagrams, setup steps, workflow details, and cost estimates.
-```bash
-npx joycraft init-autofix --scenarios-repo my-project-scenarios --app-id 3180156
-```
-### Architecture Overview
-Level 5 has four moving parts. Each is a GitHub Actions workflow that communicates via `repository_dispatch` events. No custom servers, no webhooks, no external services.
-```mermaid
-graph TB
-    subgraph "Main Repository"
-        A[Push specs to docs/specs/] -->|push to main| B[Spec Dispatch Workflow]
-        C[PR opened] --> D[CI runs]
-        D -->|CI fails| E[Autofix Workflow]
-        D -->|CI passes| F[Scenarios Dispatch Workflow]
-        G[Scenarios Re-run Workflow]
-    end
-    subgraph "Scenarios Repository (private)"
-        H[Scenario Generation Workflow]
-        I[Scenario Run Workflow]
-        J[Holdout Tests]
-        K[Specs Mirror]
-    end
-    B -->|repository_dispatch: spec-pushed| H
-    H -->|reads specs, writes tests| J
-    H -->|repository_dispatch: scenarios-updated| G
-    G -->|repository_dispatch: run-scenarios| I
-    F -->|repository_dispatch: run-scenarios| I
-    I -->|posts PASS/FAIL comment| C
-    E -->|Claude fixes code, pushes| D
-    style J fill:#f9f,stroke:#333
-    style K fill:#bbf,stroke:#333
-```
-### The Four Workflows
-#### 1. Autofix Workflow (`autofix.yml`)
-Triggered when CI **fails** on a PR. Claude Code CLI reads the failure logs and attempts a fix.
-```mermaid
-sequenceDiagram
-    participant CI as CI Workflow
-    participant AF as Autofix Workflow
-    participant Claude as Claude Code CLI
-    participant PR as Pull Request
-    CI->>AF: workflow_run (conclusion: failure)
-    AF->>AF: Generate GitHub App token
-    AF->>AF: Checkout PR branch
-    AF->>AF: Count previous autofix attempts
-    alt attempts >= 3
-        AF->>PR: Comment: "Human review needed"
-    else attempts < 3
-        AF->>AF: Fetch CI failure logs
-        AF->>AF: Strip ANSI codes
-        AF->>Claude: claude -p "Fix this CI failure..." <br/> --dangerously-skip-permissions --max-turns 20
-        Claude->>Claude: Read logs, edit code, run tests
-        Claude->>AF: Exit (changes committed locally)
-        AF->>PR: Push fix (commit prefix: "autofix:")
-        AF->>PR: Comment: summary of fix
-        Note over CI,PR: CI re-runs automatically on push
-    end
-```
-**Key details:**
-- Uses a GitHub App identity for pushes to avoid GitHub's anti-recursion protection
-- Concurrency group per PR so only one autofix runs at a time
-- Max 3 iterations, then posts "human review needed"
-- No `--model` flag. Claude CLI handles model selection.
-- Strips ANSI escape codes from logs so Claude gets clean text
-#### 2. Scenarios Dispatch Workflow (`scenarios-dispatch.yml`)
-Triggered when CI **passes** on a PR. Fires a `repository_dispatch` to the scenarios repo to run holdout tests against the PR branch.
-```mermaid
-sequenceDiagram
-    participant CI as CI Workflow
-    participant SD as Scenarios Dispatch
-    participant SR as Scenarios Repo
-    CI->>SD: workflow_run (conclusion: success, PR)
-    SD->>SD: Generate GitHub App token
-    SD->>SR: repository_dispatch: run-scenarios<br/>payload: {pr_number, branch, sha, repo}
-```
-#### 3. Spec Dispatch Workflow (`spec-dispatch.yml`)
-Triggered when spec files are pushed to `main`. Sends the spec content to the scenarios repo so the scenario agent can write tests.
-```mermaid
-sequenceDiagram
-    participant Dev as Developer
-    participant Main as Main Repo (push to main)
-    participant SPD as Spec Dispatch Workflow
-    participant SR as Scenarios Repo
-    Dev->>Main: Push specs to docs/specs/
-    Main->>SPD: push event (docs/specs/** changed)
-    SPD->>SPD: git diff --diff-filter=AM (added/modified only)
-    loop For each changed spec
-        SPD->>SR: repository_dispatch: spec-pushed<br/>payload: {spec_filename, spec_content, commit_sha, branch, repo}
-    end
-    Note over SPD: Deleted specs are ignored -<br/>existing scenario tests remain
-```
-#### 4. Scenarios Re-run Workflow (`scenarios-rerun.yml`)
-Triggered when the scenarios repo updates its tests. Re-dispatches all open PRs to the scenarios repo so they get tested with the latest holdout tests.
-```mermaid
-sequenceDiagram
-    participant SR as Scenarios Repo
-    participant RR as Re-run Workflow
-    participant SRun as Scenarios Run
-    SR->>RR: repository_dispatch: scenarios-updated
-    RR->>RR: List open PRs via GitHub API
-    alt No open PRs
-        RR->>RR: Exit (no-op)
-    else Has open PRs
-        loop For each open PR
-            RR->>SRun: repository_dispatch: run-scenarios<br/>payload: {pr_number, branch, sha, repo}
-        end
-    end
-```
-**Why this exists:** There's a race condition. The implementation agent might open a PR before the scenario agent finishes writing new tests. The re-run workflow handles this by re-testing all open PRs when new tests land. Worst case, a PR merges before the re-run, and the new tests protect the very next PR. You're never more than one cycle behind.
-### The Holdout Wall
-The core safety mechanism. Two agents, two repos, one shared interface (specs):
-```mermaid
-graph LR
-    subgraph "Implementation Agent (main repo)"
-        IA_sees["Can see:<br/>Source code<br/>Internal tests<br/>Specs"]
-        IA_cant["Cannot see:<br/>Scenario tests<br/>Scenario repo"]
-    end
-    subgraph "Specs (shared interface)"
-        Specs["docs/specs/*.md<br/>Describes WHAT should happen<br/>Never describes HOW it's tested"]
-    end
-    subgraph "Scenario Agent (scenarios repo)"
-        SA_sees["Can see:<br/>Specs (via dispatch)<br/>Scenario tests<br/>Specs mirror"]
-        SA_cant["Cannot see:<br/>Source code<br/>Internal tests"]
-    end
-    IA_sees --> Specs
-    Specs --> SA_sees
-    style IA_cant fill:#fcc,stroke:#933
-    style SA_cant fill:#fcc,stroke:#933
-    style Specs fill:#cfc,stroke:#393
-```
-This is the same principle as a holdout set in machine learning. If the implementation agent could see the scenario tests, it would optimize to pass them specifically instead of building correct software. By keeping the wall intact, scenario tests catch real behavioral regressions, not test-gaming.
-### Scenario Evolution
-Scenarios aren't static. When you push new specs, the scenario agent automatically triages them and writes new holdout tests.
-```mermaid
-flowchart TD
-    A[New spec pushed to main] --> B[Spec Dispatch sends to scenarios repo]
-    B --> C[Scenario Agent reads spec]
-    C --> D{Triage: is this user-facing?}
-    D -->|Internal refactor, CI, dev tooling| E[Skip - commit note: 'No scenario changes needed']
-    D -->|New user-facing behavior| F[Write new scenario test file]
-    D -->|Modified existing behavior| G[Update existing scenario tests]
-    F --> H[Commit to scenarios main]
-    G --> H
-    H --> I[Dispatch scenarios-updated to main repo]
-    I --> J[Re-run workflow tests open PRs with new scenarios]
-    style D fill:#ffd,stroke:#993
-    style E fill:#ddd,stroke:#999
-    style F fill:#cfc,stroke:#393
-    style G fill:#cfc,stroke:#393
-```
-**The scenario agent's prompt instructs it to:**
-- Act as a QA engineer, never a developer
-- Write only behavioral tests (invoke the built artifact, assert on output)
-- Never import source code or reference internal implementation
-- Use a triage decision tree: SKIP / NEW / UPDATE
-- Err on the side of writing a test if the spec is ambiguous
-**The specs mirror:** The scenarios repo maintains a `specs/` folder that mirrors every spec it receives. This gives the scenario agent historical context ("what features already exist?") without access to the main repo's codebase.
-### The Complete Loop
-Here's the full lifecycle from spec to shipped, validated code:
-```mermaid
-sequenceDiagram
-    participant Human as Human (writes specs)
-    participant Main as Main Repo
-    participant ScAgent as Scenario Agent
-    participant ScRepo as Scenarios Repo
-    participant ImplAgent as Implementation Agent
-    participant Autofix as Autofix Workflow
-    Human->>Main: Push spec to docs/specs/
-    Main->>ScAgent: spec-pushed dispatch
-    par Scenario Generation
-        ScAgent->>ScAgent: Triage spec
-        ScAgent->>ScRepo: Write/update holdout tests
-        ScRepo->>Main: scenarios-updated dispatch
-    and Implementation
-        Human->>ImplAgent: Execute spec (fresh session)
-        ImplAgent->>Main: Open PR
-    end
-    Main->>Main: CI runs on PR
-    alt CI fails
-        Main->>Autofix: Autofix workflow triggers
-        Autofix->>Main: Push fix, CI re-runs
-    end
-    alt CI passes
-        Main->>ScRepo: run-scenarios dispatch
-        ScRepo->>ScRepo: Clone PR branch, build, run holdout tests
-        ScRepo->>Main: Post PASS/FAIL comment on PR
-    end
-    alt Scenarios PASS
-        Note over Human,Main: Ready for human review and merge
-    else Scenarios FAIL
-        Main->>Autofix: Autofix attempts fix
-        Note over Autofix,ScRepo: Loop continues (max 3 iterations)
-    end
-```
-### What Gets Installed
-| Where | File | Purpose |
-|-------|------|---------|
-| Main repo | `.github/workflows/autofix.yml` | CI failure → Claude fix → push |
-| Main repo | `.github/workflows/scenarios-dispatch.yml` | CI pass → trigger holdout tests |
-| Main repo | `.github/workflows/spec-dispatch.yml` | Spec push → trigger scenario generation |
-| Main repo | `.github/workflows/scenarios-rerun.yml` | New tests → re-test open PRs |
-| Scenarios repo | `workflows/run.yml` | Clone PR, build, run tests, post results |
-| Scenarios repo | `workflows/generate.yml` | Receive spec, run scenario agent |
-| Scenarios repo | `prompts/scenario-agent.md` | Scenario agent prompt template |
-| Scenarios repo | `example-scenario.test.ts` | Example holdout test |
-| Scenarios repo | `package.json` | Minimal vitest setup |
-| Scenarios repo | `README.md` | Explains holdout pattern to contributors |
-### Setup Guide
+## Tuning: Risk Interview & Git Autonomy
-The fastest way: run `/joycraft-implement-level5` in Claude Code and it walks you through everything interactively. Or follow these steps manually:
+When `/joycraft-tune` runs for the first time, it does two things:
-#### Step 1: Create a GitHub App
+### Risk interview
-The autofix workflow needs a GitHub App identity to push commits. GitHub blocks workflows from triggering other workflows with the default `GITHUB_TOKEN` -- a separate App identity solves this. Creating one takes about 2 minutes:
+3-5 targeted questions about what's dangerous in your project (production databases, live APIs, secrets, files that should be off-limits). From your answers, Joycraft generates:
-1. Go to https://github.com/settings/apps/new
-2. Give it a name (e.g., "My Project Autofix")
-3. Uncheck "Webhook > Active" (not needed)
-4. Under **Repository permissions**, set:
-   - **Contents**: Read & Write
-   - **Pull requests**: Read & Write
-   - **Actions**: Read & Write
-5. Click **Create GitHub App**
-6. Note the **App ID** from the settings page (you'll need it in Step 2)
-7. Scroll to **Private keys** > click **Generate a private key**
-8. Save the downloaded `.pem` file -- you'll need it in Step 3
-9. Click **Install App** in the left sidebar > install it on the repo(s) you want to use
+- **NEVER rules** for CLAUDE.md (e.g., "NEVER connect to production DB")
+- **Deny patterns** for `.claude/settings.json` (blocks dangerous bash commands)
+- **`docs/context/production-map.md`** documenting what's real vs. safe to touch
+- **`docs/context/dangerous-assumptions.md`** documenting "Agent might assume X, but actually Y"
-> **Coming soon:** We're working on a shared Joycraft Autofix app that will reduce this to a single click. For now, creating your own app gives you full control and takes just a couple minutes.
+This takes 2-3 minutes and dramatically reduces the chance of your agent doing something catastrophic.
-#### Step 2: Run the CLI
+### Git autonomy
-```bash
-npx joycraft init-autofix --scenarios-repo my-project-scenarios --app-id YOUR_APP_ID
-```
+One question: **how autonomous should git be?**
-Replace `YOUR_APP_ID` with the App ID from Step 1. This installs the four workflow files in your main repo and copies scenario templates to `docs/templates/scenarios/`.
+- **Cautious** (default) commits freely but asks before pushing or opening PRs. Good for learning the workflow.
+- **Autonomous** commits, pushes to feature branches, and opens PRs without asking. Good for spec-driven development where you want full send.
-#### Step 3: Add secrets to your main repo
+Either way, Joycraft generates explicit git boundaries in your CLAUDE.md: commit message format (`verb: message`), specific file staging (no `git add -A`), no secrets in commits, no force-pushing.
-Go to your repo's **Settings > Secrets and variables > Actions** and add:
+## Test-First Development
-| Secret | Value |
-|--------|-------|
-| `JOYCRAFT_APP_PRIVATE_KEY` | The full contents of the `.pem` file from Step 1 |
-| `ANTHROPIC_API_KEY` | Your Anthropic API key (used by the autofix workflow to run Claude) |
+Joycraft enforces a test-first workflow because tests are the mechanism to autonomy. Without tests, your agent implements 9 specs and you have to manually verify each one. With tests, the agent knows when it's done and you can trust the output.
-#### Step 4: Create the scenarios repo
+### How it works
-```bash
-# Create a private repo for holdout tests
-gh repo create my-project-scenarios --private
-# Copy the scenario templates into it
-cp -r docs/templates/scenarios/* ../my-project-scenarios/
-cd ../my-project-scenarios
-git add -A && git commit -m "init: scaffold scenarios repo from Joycraft"
-git push
-```
+When you run `/joycraft-new-feature`, the interview now includes test-focused questions: what test types your project uses, how fast your tests need to run for iteration, and whether you want lockdown mode. Every atomic spec generated by `/joycraft-decompose` includes a **Test Plan** that maps each acceptance criterion to at least one test.
-Then add the **same two secrets** (`JOYCRAFT_APP_PRIVATE_KEY` and `ANTHROPIC_API_KEY`) to the scenarios repo's Settings > Secrets.
+The execution order is enforced:
-#### Step 5: Verify
+1. **Write failing tests first** -- the agent writes tests from the spec's Test Plan
+2. **Run them and confirm they fail** -- if they pass immediately, something is wrong (you're testing the wrong thing)
+3. **Implement until tests pass** -- the tests are the contract
-```bash
-# Check workflow files exist in your main repo
-ls .github/workflows/autofix.yml .github/workflows/scenarios-dispatch.yml \
-   .github/workflows/spec-dispatch.yml .github/workflows/scenarios-rerun.yml
+### The three laws of test harnesses
-# Check scenario templates in the scenarios repo
-ls ../my-project-scenarios/workflows/run.yml ../my-project-scenarios/workflows/generate.yml \
-   ../my-project-scenarios/prompts/scenario-agent.md ../my-project-scenarios/example-scenario.test.ts
-```
+These are baked into every spec template, discovered through real autonomous development:
-#### Step 6: Test it
+1. **Tests must fail first.** If your test harness doesn't have failing tests, the agent will write tests that pass trivially -- testing the library instead of your function.
+2. **Tests must run against your actual function.** Not a reimplementation, not a mock, not the wrapped library. The test calls your code.
+3. **Tests must detect individual changes.** You need fast smoke tests (seconds, not minutes) so you know if a single change helped or hurt.
-1. Push a spec to `docs/specs/` on main -- this triggers scenario generation in the scenarios repo
-2. Open a PR with a small change -- when CI passes, scenarios run against the PR
-3. Watch for the scenario test results posted as a PR comment
+### Lockdown mode
-Or deliberately break something in a PR to test the autofix loop.
+For complex stacks or long autonomous sessions, `/joycraft-lockdown` generates constrained execution boundaries:
-### Cost
+- **NEVER rules** for editing test files (read-only)
+- **Deny patterns** for package installs, network access, log reading
+- **Permission mode recommendations** (see below)
-Validated in the Pipit trial (~3 minutes, one iteration, zero human intervention). With Claude Sonnet + `--max-turns 20` + max 3 iterations per PR:
-- **Autofix:** ~$0.50 per attempt, worst case ~$1.50 per PR (3 iterations)
-- **Scenario generation:** ~$0.20 per spec dispatch
-- **Solo dev with ~10 PRs/month:** ~$5-10/month for the full loop
+This prevents the agent from going rogue -- downloading SDKs, pinging random IPs, clearing test files, or filling context with log output. Lockdown is optional and most useful for complex tech stacks (hardware, firmware, multi-device workflows).
-The iteration guard and max-turns cap prevent runaway costs.
+### Independent verification
-## Tuning: Risk Interview & Git Autonomy
+`/joycraft-verify` spawns a separate subagent with a clean context window to independently check your implementation against the spec. The verifier reads the acceptance criteria, runs the tests, and produces a structured pass/fail verdict. It cannot edit any code -- read-only plus test execution only.
-When `/joycraft-tune` runs for the first time, it does two things:
+This follows [Anthropic's finding](https://www.anthropic.com/engineering/harness-design-long-running-apps) that "agents reliably skew positive when grading their own work" and that separating the worker from the evaluator consistently outperforms self-evaluation.
-### Risk interview
+## Claude Code Permission Modes
-3-5 targeted questions about what's dangerous in your project (production databases, live APIs, secrets, files that should be off-limits). From your answers, Joycraft generates:
+You do **not** need `--dangerously-skip-permissions` for autonomous development. Claude Code offers safer alternatives that Joycraft recommends based on your use case:
-- **NEVER rules** for CLAUDE.md (e.g., "NEVER connect to production DB")
-- **Deny patterns** for `.claude/settings.json` (blocks dangerous bash commands)
-- **`docs/context/production-map.md`** documenting what's real vs. safe to touch
-- **`docs/context/dangerous-assumptions.md`** documenting "Agent might assume X, but actually Y"
+| Your situation | Permission mode | What it does |
+|---|---|---|
+| Interactive development | `acceptEdits` | Auto-approves file edits, prompts for shell commands |
+| Long autonomous session | `auto` | Safety classifier reviews each action, blocks scope escalation |
+| Autonomous spec execution | `dontAsk` + allowlist | Only pre-approved commands run, everything else denied |
+| Planning and exploration | `plan` | Claude can only read and propose, no edits allowed |
-This takes 2-3 minutes and dramatically reduces the chance of your agent doing something catastrophic.
+### When to use what
-### Git autonomy
+**`--permission-mode auto`** is the best default for most developers. A background classifier (Sonnet) reviews each action before execution, blocking things like: downloading unexpected packages, accessing unfamiliar infrastructure, or escalating beyond the task scope. It adds minimal latency and catches the exact problems that make autonomous development scary.
-One question: **how autonomous should git be?**
+**`--permission-mode dontAsk`** is for maximum control. You define an explicit allowlist of what the agent can do (write code, run specific test commands) and everything else is silently denied. No prompts, no surprises. This is what Joycraft's `/joycraft-lockdown` skill helps you configure.
-- **Cautious** (default) commits freely but asks before pushing or opening PRs. Good for learning the workflow.
-- **Autonomous** commits, pushes to feature branches, and opens PRs without asking. Good for spec-driven development where you want full send.
+**`--dangerously-skip-permissions`** should only be used in isolated containers or VMs with no internet access. It bypasses all safety checks and cannot be overridden by subagents.
-Either way, Joycraft generates explicit git boundaries in your CLAUDE.md: commit message format (`verb: message`), specific file staging (no `git add -A`), no secrets in commits, no force-pushing.
+Both `/joycraft-lockdown` and `/joycraft-tune` now recommend the appropriate permission mode based on your project's risk profile.
 ## How It Works with AI Agents
@@ -581,6 +298,10 @@ Joycraft's approach is synthesized from several sources:
 **Behavioral boundaries.** CLAUDE.md isn't a suggestion box, it's a contract. Joycraft installs a three-tier boundary framework (Always / Ask First / Never) that prevents the most common AI development failures: overwriting user files, skipping tests, pushing without approval, hardcoding secrets. This is [Addy Osmani's](https://addyosmani.com/blog/good-spec/) "boundaries" principle made concrete.
+**Test-first as the mechanism to autonomy.** Tests aren't a nice-to-have, they're the bridge between "agent writes code" and "agent writes *correct* code." Every spec includes a Test Plan mapping acceptance criteria to tests, and the agent must write failing tests before implementing. This follows the three laws of test harnesses discovered through real autonomous development, and aligns with [Anthropic's harness design research](https://www.anthropic.com/engineering/harness-design-long-running-apps) which found that agents reliably skip verification unless explicitly constrained.
+**Separation of evaluation from implementation.** [Anthropic's research](https://www.anthropic.com/engineering/harness-design-long-running-apps) found that "agents reliably skew positive when grading their own work." Joycraft addresses this at two levels: `/joycraft-verify` spawns a separate subagent with clean context to independently verify against the spec, and Level 5's holdout scenarios provide external evaluation the implementation agent can never see.
 **Knowledge capture over session notes.** Most session notes are never re-read. Joycraft's `/joycraft-session-end` skill captures only *discoveries*: assumptions that were wrong, APIs that behaved unexpectedly, decisions made during implementation that aren't in the spec. If nothing surprising happened, you capture nothing. This keeps the signal-to-noise ratio high.
 **External holdout scenarios.** [StrongDM's Software Factory](https://factory.strongdm.ai/) proved that AI agents will [actively game visible test suites](https://palisaderesearch.org/blog/specification-gaming). Their solution: scenarios that live *outside* the codebase, invisible to the agent during development. Like a holdout set in ML, this prevents overfitting. Joycraft now implements this directly. `init-autofix` sets up the holdout wall, the scenario agent, and the GitHub App integration.