joycraft 0.5.12 → 0.5.14

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -8,27 +8,13 @@
8
8
 
9
9
  ## What is Joycraft?
10
10
 
11
- Joycraft is a CLI tool and [Claude Code](https://docs.anthropic.com/en/docs/claude-code) plugin that upgrades your AI development workflow. It installs skills, behavioral boundaries, templates, and documentation structure into any project, taking you from unstructured prompting to autonomous spec-driven development.
12
-
13
- If you've been using Claude Code (or any AI coding tool) and your workflow looks like this:
14
-
15
- > Prompt → wait → read output → "no, not that" → re-prompt → fix hallucination → re-prompt → manually fix → "ok close enough" → commit
16
-
17
- ...then Joycraft is for you.
18
-
19
- This project started as a personal exploration by [@maksutovic](https://github.com/maksutovic). I was working across multiple client projects, spending more time wrestling with prompts than building software. I knew Claude Code was capable of extraordinary work, but my *process* was holding it back. I was vibe coding - and vibe coding doesn't scale.
20
-
21
- The spark was [Nate B Jones' video on the 5 Levels of Vibe Coding](https://www.youtube.com/watch?v=bDcgHzCBgmQ). It mapped out a progression I hadn't seen articulated before - from "spicy autocomplete" to fully autonomous development - and lit my brain up to the potential of what Claude Code could do with the right harness around it. Joycraft is the result of that exploration: a tool that encodes the patterns, boundaries, and workflows that make AI-assisted development actually deterministic.
11
+ Joycraft is a CLI tool that installs structured development skills into [Claude Code](https://docs.anthropic.com/en/docs/claude-code) and [OpenAI Codex](https://openai.com/codex), along with behavioral boundaries, templates, and documentation structure. It takes any project from unstructured prompting to autonomous spec-driven development.
22
12
 
23
13
  ### The core idea
24
14
 
25
- Joycraft is simple. It's a set of **skills** (slash commands for Claude Code) and **instructions** (CLAUDE.md boundaries) that guide you and your agent through a structured development process:
26
-
27
- - **Levels 1-4:** Skills like `/joycraft-tune`, `/joycraft-new-feature`, and `/joycraft-interview` replace unstructured prompting with spec-driven development. You interview, you write specs, the agent executes. No back-and-forth.
15
+ - **Levels 1-4:** Skills like `/joycraft-tune`, `/joycraft-new-feature`, and `/joycraft-interview` replace unstructured prompting with spec-driven development. You interview, you write specs, the agent executes.
28
16
  - **Level 5:** The `/joycraft-implement-level5` skill sets up the autonomous loop where specs go in and validated software comes out, with holdout scenario testing that prevents the agent from gaming its own tests.
29
17
 
30
- StrongDM calls their Level 5 fully autonomous loop a "Dark Factory" - which, albeit a cool name, the world has so much darkness in it right now. I wanted a name that extolled more of what I believe tools like this can provide: joy and craftsmanship. Hence "Joycraft."
31
-
32
18
  ### What are the levels?
33
19
 
34
20
  [Dan Shapiro's 5 Levels of Vibe Coding](https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory/) provides the framework:
@@ -66,24 +52,11 @@ Joycraft auto-detects your tech stack and creates:
66
52
 
67
53
  - **CLAUDE.md** with behavioral boundaries (Always / Ask First / Never) and correct build/test/lint commands
68
54
  - **AGENTS.md** for Codex compatibility
69
- - **Claude Code skills** installed to `.claude/skills/` and **Codex skills** installed to `.agents/skills/`:
70
- - `/joycraft-tune` Assess your harness, apply upgrades, see your path to Level 5
71
- - `/joycraft-new-feature` Interview → Feature Brief → Atomic Specs
72
- - `/joycraft-interview` Lightweight brainstorm. Yap about ideas, get a structured summary
73
- - `/joycraft-research` Objective codebase research — subagent sees only questions, never the brief
74
- - `/joycraft-design` Design discussion checkpoint — ~200-line artifact for human review before decompose
75
- - `/joycraft-decompose` Break a brief into small, testable specs
76
- - `/joycraft-add-fact` Capture project knowledge on the fly -- routes to the right context doc
77
- - `/joycraft-lockdown` Generate constrained execution boundaries (read-only tests, deny patterns)
78
- - `/joycraft-verify` Spawn a separate subagent to independently verify implementation against spec
79
- - `/joycraft-session-end` Capture discoveries, verify, commit, push
80
- - `/joycraft-implement-level5` Set up Level 5 (autofix loop, holdout scenarios, scenario evolution)
55
+ - **11 skills** installed to `.claude/skills/` (Claude Code) and `.agents/skills/` (Codex) see [Which skill do I need?](#which-skill-do-i-need) below
81
56
  - **docs/** structure: `briefs/`, `specs/`, `discoveries/`, `contracts/`, `decisions/`, `context/`
82
57
  - **Context documents** in `docs/context/`: production map, dangerous assumptions, decision log, institutional knowledge, and troubleshooting guide
83
58
  - **Templates** including atomic spec, feature brief, implementation plan, boundary framework, and workflow templates for scenario generation and autofix loops
84
59
 
85
- Once you reach Level 4, you can set up the autonomous loop with `/joycraft-implement-level5`. See [Level 5: The Autonomous Loop](#level-5-the-autonomous-loop) below.
86
-
87
60
  ### Supported Stacks
88
61
 
89
62
  Node.js (npm/pnpm/yarn/bun), Python (poetry/pip/uv), Rust, Go, Swift, and generic (Makefile/Dockerfile).
@@ -92,106 +65,71 @@ Frameworks auto-detected: Next.js, FastAPI, Django, Flask, Actix, Axum, Express,
92
65
 
93
66
  ## The Workflow
94
67
 
95
- After init, open Claude Code and use the installed skills:
68
+ ### Which skill do I need?
96
69
 
97
- ```
98
- /joycraft-tune # Assess your harness, apply upgrades, see path to Level 5
99
- /joycraft-interview # Brainstorm freely, yap about ideas, get a structured summary
100
- /joycraft-new-feature # Interview → Feature Brief → Atomic Specs → ready to execute
101
- /joycraft-research # Objective codebase research (subagent never sees the brief)
102
- /joycraft-design # Design discussion patterns, decisions, open questions for review
103
- /joycraft-decompose # Break any feature into small, independent specs
104
- /joycraft-add-fact # Capture a fact mid-session -- auto-routes to the right context doc
105
- /joycraft-lockdown # Generate constrained execution boundaries for autonomous sessions
106
- /joycraft-verify # Independent verification -- spawns a subagent to check your work
107
- /joycraft-session-end # Wrap up: discoveries, verification, commit, push
108
- /joycraft-implement-level5 # Set up Level 5 (autofix, holdout scenarios, evolution)
109
- ```
70
+ | You want to... | Use | What happens |
71
+ |---|---|---|
72
+ | Brainstorm an idea before committing to building it | `/joycraft-interview` | Free-form conversation structured draft brief |
73
+ | Build a new feature from scratch | `/joycraft-new-feature` | Guided interview → Feature Brief → Atomic Specs |
74
+ | Understand existing code before building on it | `/joycraft-research` | Objective codebase research facts only, no opinions |
75
+ | Align on approach before writing code | `/joycraft-design` | Design discussion ~200-line artifact for human review |
76
+ | Break a feature into small, independent tasks | `/joycraft-decompose` | Feature Brief → testable Atomic Specs |
77
+ | Fix a bug with a structured workflow | `/joycraft-bugfix` | Reproduce isolate → fix → verify loop |
78
+ | Run specs autonomously without hand-holding | `/joycraft-implement-level5` | Autofix loop + holdout scenario testing |
79
+ | Verify an implementation independently | `/joycraft-verify` | Read-only subagent checks work against the spec |
110
80
 
111
81
  The core loop:
112
82
 
113
- ```
114
- Interview → Brief → Research → Design → Decompose → Specs → Implement → Verify
115
- (optional) (optional)
116
- ```
117
-
118
- ## The Interview: Why It Matters
119
-
120
- The single biggest upgrade Joycraft makes to your workflow is replacing the prompt-iterate-fix cycle with a **structured interview**.
121
-
122
- Here's the problem with how most of us use AI coding tools: we open a session and start typing. "Build me a notification system." The agent starts writing code immediately. It makes assumptions about your data model, your UI framework, your error handling strategy, your deployment target. You catch some of these mid-flight, correct them, the agent adjusts, introduces new assumptions. Three hours later you have something that *kind of* works but is built on a foundation of guesses.
123
-
124
- Joycraft flips this. Before the agent writes a single line of code, you have a conversation about *what you're building and why*.
125
-
126
- ### Two interview modes
127
-
128
- **`/joycraft-interview`** is the lightweight brainstorm. You yap about an idea, the agent asks clarifying questions, and you get a structured summary saved to `docs/briefs/`. Good for early-stage thinking when you're not ready to commit to building anything yet. No pressure, no specs, just organized thought.
129
-
130
- **`/joycraft-new-feature`** is the full workflow. This is the structured interview that produces a **Feature Brief** (the what and why) and then decomposes it into **Atomic Specs** (small, testable, independently executable units of work). Each spec is self-contained. An agent in a fresh session can pick it up and execute without reading anything else.
131
-
132
- ### Why this works
133
-
134
- The insight comes from [Boris Cherny](https://www.lennysnewsletter.com/p/head-of-claude-code-what-happens) (Head of Claude Code at Anthropic): interview in one session, write the spec, then execute in a *fresh session* with clean context. The interview captures your intent. The spec is the contract. The execution session has only the spec. No baggage from the conversation, no accumulated misunderstandings, no context window full of abandoned approaches.
135
-
136
- This is what separates Level 2 (back-and-forth prompting) from Level 4 (spec-driven development). You stop being a typist correcting an agent's guesses and start being a PM defining what needs to be built.
137
-
138
83
  ```mermaid
139
84
  flowchart LR
140
- A["/joycraft-interview<br/>(brainstorm)"] --> B["Draft Brief<br/>docs/briefs/"]
141
- B --> C["/joycraft-new-feature<br/>(structured interview)"]
142
- C --> D["Feature Brief<br/>(what & why)"]
143
- D --> R["/joycraft-research<br/>(objective facts)"]
144
- R --> DS["/joycraft-design<br/>(human checkpoint)"]
145
- DS --> E["/joycraft-decompose"]
146
- E --> F["Atomic Specs<br/>docs/specs/"]
147
- F --> G["Fresh Session<br/>Execute each spec"]
148
- G --> H["/joycraft-session-end<br/>(discoveries + commit)"]
149
-
150
- style A fill:#e8f4fd,stroke:#369
151
- style C fill:#e8f4fd,stroke:#369
152
- style R fill:#f0e8fd,stroke:#639
153
- style DS fill:#f0e8fd,stroke:#639
154
- style F fill:#cfc,stroke:#393
155
- style G fill:#ffd,stroke:#993
85
+ A[Interview] --> B[Feature Brief]
86
+ B --> C{Complex?}
87
+ C -- "Simple/clear scope" --> F[Decompose]
88
+ C -- "Complex/unfamiliar" --> D[Research]
89
+ D --> E[Design]
90
+ E --> F
91
+ F --> G[Atomic Specs]
92
+ G --> H[Execute]
93
+ H --> I[Session End]
94
+
95
+ style A fill:#e8f4fd,stroke:#4a90d9
96
+ style B fill:#e8f4fd,stroke:#4a90d9
97
+ style C fill:#fef3cd,stroke:#d4a843
98
+ style D fill:#f0e8fd,stroke:#9b72cf
99
+ style E fill:#f0e8fd,stroke:#9b72cf
100
+ style F fill:#e8f4fd,stroke:#4a90d9
101
+ style G fill:#e8f4fd,stroke:#4a90d9
102
+ style H fill:#d4edda,stroke:#5a9a6e
103
+ style I fill:#d4edda,stroke:#5a9a6e
156
104
  ```
157
105
 
158
- ## Research Isolation & Design Checkpoints
159
-
160
- These two skills were inspired by [Dex Horthy](https://x.com/dexhorthy)'s work at [HumanLayer](https://humanlayer.dev) on what went wrong with the Research-Plan-Implement (RPI) methodology and the evolution to [CRISPY](https://humanlayer.dev/blog) (Context, Research, Investigate, Structure, Plan, Yield).
106
+ ### The Interview
161
107
 
162
- ### The problem with "research the codebase"
108
+ The single biggest upgrade Joycraft makes is replacing prompt-iterate-fix with a structured interview. [Read the full guide →](docs/guides/interview-workflow.md)
163
109
 
164
- When you tell an agent "research how endpoints work — I'm going to build a new one," the research comes back contaminated with opinions about how to build the new endpoint. Good research is pure facts. The moment the researcher knows the intent, it editorializes.
110
+ ### Research Isolation & Design Checkpoints
165
111
 
166
- **`/joycraft-research`** fixes this with context isolation: one context window generates research questions from the brief, then a separate subagent researches the codebase using *only those questions* — it never sees the brief. The output is a research document in `docs/research/` that contains file paths, function signatures, data flows, and patterns. No recommendations. No opinions. Just compressed truth.
112
+ Objective research via context isolation and 200-line design checkpoints for human review before decomposition. [Read the full guide →](docs/guides/research-and-design.md)
167
113
 
168
- This is the same "query planning" technique Dex describes: separate the intent from the investigation, like a database separates query planning from execution.
114
+ ### Test-First Development
169
115
 
170
- ### The 200-line checkpoint
116
+ Tests are the mechanism to autonomy — every spec includes a test plan, and the agent writes failing tests before implementing. [Read the full guide →](docs/guides/test-first-development.md)
171
117
 
172
- HumanLayer found that engineers were reviewing 1,000-line plans — which is the same effort as reviewing 1,000 lines of code, and the plans often diverged from what was actually implemented. The leverage was terrible.
118
+ ### Tuning: Risk Interview & Git Autonomy
173
119
 
174
- **`/joycraft-design`** produces a ~200-line design discussion artifact instead. It contains five sections: current state, desired end state, patterns to follow, resolved design decisions, and open questions with concrete options. This is where you catch "that's not how we do atomic SQL updates — go find the pattern in `/services/billing`" *before* 2,000 lines of code follow the wrong pattern.
120
+ A 2-3 minute risk interview generates safety boundaries, and you choose your git autonomy level. [Read the full guide →](docs/guides/tuning.md)
175
121
 
176
- [Matt Pocock](https://x.com/mattpocockuk) calls this the "design concept" — the shared understanding between you and the agent that exists separately from the code. Joycraft materializes it as a markdown document and forces a human checkpoint: the skill will not proceed to decomposition until you've reviewed and approved.
122
+ ### Level 5: The Autonomous Loop
177
123
 
178
- Both steps are optional. You can skip straight from brief to decompose for simple features. But for anything complex enough to get wrong, the 15 minutes of human review on a 200-line document saves hours of rework on code that followed the wrong patterns.
124
+ Level 5 is where specs go in and validated software comes out four GitHub Actions workflows, a separate scenarios repo, and two AI agents that can never see each other's work. [Read the full guide →](docs/guides/level-5-autonomy.md)
179
125
 
180
- ### Instruction budget discipline
126
+ ### Permission Modes
181
127
 
182
- Every Joycraft skill now includes an `instructions` count in its frontmatter. No skill exceeds 40 instructions. This is based on [research](https://arxiv.org/pdf/2507.11538) showing that frontier LLMs can reliably follow ~150-200 instructions — but your skill shares that budget with the system prompt, CLAUDE.md, tools, and MCP servers. A skill with 85 instructions (as Joycraft's `/joycraft-tune` had before this refactor) is competing for attention with everything else in the context window. Smaller, focused skills with clear handoffs produce more reliable results than monolithic mega-prompts.
128
+ You do **not** need `--dangerously-skip-permissions` for autonomous development. Claude Code offers safer alternatives. [Read the full guide ](docs/guides/permission-modes.md)
183
129
 
184
- ### What a good spec looks like
130
+ ### How It Works with AI Agents
185
131
 
186
- An atomic spec produced by `/joycraft-decompose` has:
187
-
188
- - **What:** One paragraph. A developer with zero context understands the change in 15 seconds.
189
- - **Why:** One sentence. What breaks or is missing without this?
190
- - **Acceptance criteria:** Checkboxes. Testable. No ambiguity.
191
- - **Affected files:** Exact paths, what changes in each.
192
- - **Edge cases:** Table of scenarios and expected behavior.
193
-
194
- The agent doesn't guess. It reads the spec and executes. If something's unclear, the spec is wrong. Fix the spec, not the conversation.
132
+ Claude Code reads CLAUDE.md, Codex reads AGENTS.md — both get the same guardrails and workflow. [Read the full guide →](docs/guides/agent-compatibility.md)
195
133
 
196
134
  ## Upgrade
197
135
 
@@ -205,143 +143,9 @@ Joycraft tracks what it installed vs. what you've customized. Unmodified files u
205
143
 
206
144
  > **Note:** If you're upgrading from an early version, deprecated skill directories (e.g., `/joy`, `/joysmith`, `/tune`) are automatically removed during upgrade.
207
145
 
208
- ## Level 5: The Autonomous Loop
209
-
210
- > **A note on complexity:** Setting up Level 5 does have some moving parts and, depending on the complexity of your stack (software vs. hardware, monorepo vs. single app, etc.), this will require a good amount of prompting and trial-and-error to get right. I've done my best to make this as painless as possible, but just note - this is not a one-shot-prompt-done-in-5-minutes kind of thing. For small projects and simple stacks it will be easy, but any level of complexity is going to take some iteration, so plan ahead. Full step-by-step guides along with a video coming soon.
211
-
212
- Level 5 is where specs go in and validated software comes out — four GitHub Actions workflows, a separate scenarios repo, and two AI agents that can never see each other's work. Run `/joycraft-implement-level5` for guided setup, or `npx joycraft init-autofix` via CLI.
213
-
214
- See the full **[Level 5 Autonomy Guide](docs/guides/level-5-autonomy.md)** for architecture diagrams, setup steps, workflow details, and cost estimates.
215
-
216
- ## Tuning: Risk Interview & Git Autonomy
217
-
218
- When `/joycraft-tune` runs for the first time, it does two things:
219
-
220
- ### Risk interview
221
-
222
- 3-5 targeted questions about what's dangerous in your project (production databases, live APIs, secrets, files that should be off-limits). From your answers, Joycraft generates:
223
-
224
- - **NEVER rules** for CLAUDE.md (e.g., "NEVER connect to production DB")
225
- - **Deny patterns** for `.claude/settings.json` (blocks dangerous bash commands)
226
- - **`docs/context/production-map.md`** documenting what's real vs. safe to touch
227
- - **`docs/context/dangerous-assumptions.md`** documenting "Agent might assume X, but actually Y"
228
-
229
- This takes 2-3 minutes and dramatically reduces the chance of your agent doing something catastrophic.
230
-
231
- ### Git autonomy
232
-
233
- One question: **how autonomous should git be?**
234
-
235
- - **Cautious** (default) commits freely but asks before pushing or opening PRs. Good for learning the workflow.
236
- - **Autonomous** commits, pushes to feature branches, and opens PRs without asking. Good for spec-driven development where you want full send.
237
-
238
- Either way, Joycraft generates explicit git boundaries in your CLAUDE.md: commit message format (`verb: message`), specific file staging (no `git add -A`), no secrets in commits, no force-pushing.
239
-
240
- ## Test-First Development
241
-
242
- Joycraft enforces a test-first workflow because tests are the mechanism to autonomy. Without tests, your agent implements 9 specs and you have to manually verify each one. With tests, the agent knows when it's done and you can trust the output.
243
-
244
- ### How it works
245
-
246
- When you run `/joycraft-new-feature`, the interview now includes test-focused questions: what test types your project uses, how fast your tests need to run for iteration, and whether you want lockdown mode. Every atomic spec generated by `/joycraft-decompose` includes a **Test Plan** that maps each acceptance criterion to at least one test.
247
-
248
- The execution order is enforced:
249
-
250
- 1. **Write failing tests first** -- the agent writes tests from the spec's Test Plan
251
- 2. **Run them and confirm they fail** -- if they pass immediately, something is wrong (you're testing the wrong thing)
252
- 3. **Implement until tests pass** -- the tests are the contract
253
-
254
- ### The three laws of test harnesses
255
-
256
- These are baked into every spec template, discovered through real autonomous development:
257
-
258
- 1. **Tests must fail first.** If your test harness doesn't have failing tests, the agent will write tests that pass trivially -- testing the library instead of your function.
259
- 2. **Tests must run against your actual function.** Not a reimplementation, not a mock, not the wrapped library. The test calls your code.
260
- 3. **Tests must detect individual changes.** You need fast smoke tests (seconds, not minutes) so you know if a single change helped or hurt.
261
-
262
- ### Lockdown mode
263
-
264
- For complex stacks or long autonomous sessions, `/joycraft-lockdown` generates constrained execution boundaries:
265
-
266
- - **NEVER rules** for editing test files (read-only)
267
- - **Deny patterns** for package installs, network access, log reading
268
- - **Permission mode recommendations** (see below)
269
-
270
- This prevents the agent from going rogue -- downloading SDKs, pinging random IPs, clearing test files, or filling context with log output. Lockdown is optional and most useful for complex tech stacks (hardware, firmware, multi-device workflows).
271
-
272
- ### Independent verification
273
-
274
- `/joycraft-verify` spawns a separate subagent with a clean context window to independently check your implementation against the spec. The verifier reads the acceptance criteria, runs the tests, and produces a structured pass/fail verdict. It cannot edit any code -- read-only plus test execution only.
275
-
276
- This follows [Anthropic's finding](https://www.anthropic.com/engineering/harness-design-long-running-apps) that "agents reliably skew positive when grading their own work" and that separating the worker from the evaluator consistently outperforms self-evaluation.
277
-
278
- ## Claude Code Permission Modes
279
-
280
- You do **not** need `--dangerously-skip-permissions` for autonomous development. Claude Code offers safer alternatives that Joycraft recommends based on your use case:
281
-
282
- | Your situation | Permission mode | What it does |
283
- |---|---|---|
284
- | Interactive development | `acceptEdits` | Auto-approves file edits, prompts for shell commands |
285
- | Long autonomous session | `auto` | Safety classifier reviews each action, blocks scope escalation |
286
- | Autonomous spec execution | `dontAsk` + allowlist | Only pre-approved commands run, everything else denied |
287
- | Planning and exploration | `plan` | Claude can only read and propose, no edits allowed |
288
-
289
- ### When to use what
290
-
291
- **`--permission-mode auto`** is the best default for most developers. A background classifier (Sonnet) reviews each action before execution, blocking things like: downloading unexpected packages, accessing unfamiliar infrastructure, or escalating beyond the task scope. It adds minimal latency and catches the exact problems that make autonomous development scary.
292
-
293
- **`--permission-mode dontAsk`** is for maximum control. You define an explicit allowlist of what the agent can do (write code, run specific test commands) and everything else is silently denied. No prompts, no surprises. This is what Joycraft's `/joycraft-lockdown` skill helps you configure.
294
-
295
- **`--dangerously-skip-permissions`** should only be used in isolated containers or VMs with no internet access. It bypasses all safety checks and cannot be overridden by subagents.
296
-
297
- Both `/joycraft-lockdown` and `/joycraft-tune` now recommend the appropriate permission mode based on your project's risk profile.
298
-
299
- ## How It Works with AI Agents
300
-
301
- **Claude Code** reads `CLAUDE.md` automatically and discovers skills in `.claude/skills/`. The behavioral boundaries guide every action. The skills provide structured workflows accessible via `/slash-commands`.
302
-
303
- **Codex** reads `AGENTS.md`, which provides the same boundaries and commands in a concise format optimized for smaller context windows.
304
-
305
- Both agents get the same guardrails and the same development workflow. Joycraft doesn't write your project code. It builds the *system* that makes AI-assisted development reliable.
306
-
307
- ### Team Sharing
308
-
309
- Skills live in `.claude/skills/` which is **not** gitignored by default. Commit it so your whole team gets the workflow:
310
-
311
- ```bash
312
- git add .claude/skills/ docs/
313
- git commit -m "add: Joycraft harness"
314
- ```
315
-
316
- Joycraft also installs a session-start hook that checks for updates. If your templates are outdated, you'll see a one-line nudge when Claude Code starts.
317
-
318
146
  ## Why This Exists
319
147
 
320
- Most developers using AI tools are at Level 2. They prompt, they iterate, they feel productive. But [METR's randomized control trial](https://metr.org/) found experienced developers using AI tools actually completed tasks **19% slower**, while *believing* they were 24% faster. The problem isn't the tools. It's the absence of structure around them.
321
-
322
- The teams seeing transformative results ([StrongDM](https://factory.strongdm.ai/) shipping an entire product with 3 engineers, [Spotify Honk](https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory/) merging 1,000 PRs every 10 days, Anthropic generating effectively 100% of their code with AI) all share the same pattern: **they don't prompt AI to write code. They write specs and let AI execute them.**
323
-
324
- Joycraft packages that pattern into something anyone can install.
325
-
326
- ### The methodology
327
-
328
- Joycraft's approach is synthesized from several sources:
329
-
330
- **Spec-driven development.** Instead of prompting AI in conversation, you write structured specifications. Feature Briefs capture the *what* and *why*, then Atomic Specs break work into small, testable, independently executable units. Each spec is self-contained: an agent can pick it up without reading anything else. This follows [Addy Osmani's](https://addyosmani.com/blog/good-spec/) principles for AI-consumable specs and [GitHub's Spec Kit](https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/) 4-phase process (Specify → Plan → Tasks → Implement).
331
-
332
- **Context isolation.** [Boris Cherny](https://www.lennysnewsletter.com/p/head-of-claude-code-what-happens) (Head of Claude Code at Anthropic) recommends: interview in one session, write the spec, then execute in a *fresh session* with clean context. [Dex Horthy](https://humanlayer.dev) at HumanLayer took this further: even *research* should be isolated from intent — the researching agent should never see the ticket, only objective questions derived from it. Joycraft's `/joycraft-research` → `/joycraft-design` → `/joycraft-decompose` pipeline enforces this at every stage: the interview captures intent, research gathers objective facts, design aligns human and agent on approach, and the execution session has only the spec.
333
-
334
- **Behavioral boundaries.** CLAUDE.md isn't a suggestion box, it's a contract. Joycraft installs a three-tier boundary framework (Always / Ask First / Never) that prevents the most common AI development failures: overwriting user files, skipping tests, pushing without approval, hardcoding secrets. This is [Addy Osmani's](https://addyosmani.com/blog/good-spec/) "boundaries" principle made concrete.
335
-
336
- **Test-first as the mechanism to autonomy.** Tests aren't a nice-to-have, they're the bridge between "agent writes code" and "agent writes *correct* code." Every spec includes a Test Plan mapping acceptance criteria to tests, and the agent must write failing tests before implementing. This follows the three laws of test harnesses discovered through real autonomous development, and aligns with [Anthropic's harness design research](https://www.anthropic.com/engineering/harness-design-long-running-apps) which found that agents reliably skip verification unless explicitly constrained.
337
-
338
- **Separation of evaluation from implementation.** [Anthropic's research](https://www.anthropic.com/engineering/harness-design-long-running-apps) found that "agents reliably skew positive when grading their own work." Joycraft addresses this at two levels: `/joycraft-verify` spawns a separate subagent with clean context to independently verify against the spec, and Level 5's holdout scenarios provide external evaluation the implementation agent can never see.
339
-
340
- **Knowledge capture over session notes.** Most session notes are never re-read. Joycraft's `/joycraft-session-end` skill captures only *discoveries*: assumptions that were wrong, APIs that behaved unexpectedly, decisions made during implementation that aren't in the spec. If nothing surprising happened, you capture nothing. This keeps the signal-to-noise ratio high.
341
-
342
- **External holdout scenarios.** [StrongDM's Software Factory](https://factory.strongdm.ai/) proved that AI agents will [actively game visible test suites](https://palisaderesearch.org/blog/specification-gaming). Their solution: scenarios that live *outside* the codebase, invisible to the agent during development. Like a holdout set in ML, this prevents overfitting. Joycraft now implements this directly. `init-autofix` sets up the holdout wall, the scenario agent, and the GitHub App integration.
343
-
344
- **The 5-level framework.** [Dan Shapiro's levels](https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory/) give you a map. Level 2 (Junior Developer) is where most teams plateau. Level 3 (Developer as Manager) means your life is diffs. Level 4 (Developer as PM) means you write specs, not code. Level 5 (Dark Factory) means specs in, software out. Joycraft's `/joycraft-tune` assessment tells you where you are and what to do next.
148
+ Most developers using AI tools are at Level 2 and [METR's research](https://metr.org/) found they're actually slower, not faster. Joycraft packages the patterns used by teams seeing transformative results into something anyone can install. [Read the full methodology →](docs/guides/methodology.md)
345
149
 
346
150
  ## Standing on the Shoulders of Giants
347
151