e2e-engineering 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (32) hide show
  1. package/README.md +904 -0
  2. package/bin/install.js +175 -0
  3. package/dist/agents-md/AGENTS.md +108 -0
  4. package/dist/cursor/.cursor/rules/e2e-engineering.mdc +19 -0
  5. package/dist/marketplace/.claude-plugin/marketplace.json +19 -0
  6. package/dist/marketplace/README.md +40 -0
  7. package/dist/marketplace/plugins/e2e-engineering/.claude-plugin/plugin.json +15 -0
  8. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/SKILL.md +114 -0
  9. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/adopt.md +25 -0
  10. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/constitution.md +46 -0
  11. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/cross/context-checkpoint.md +23 -0
  12. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/cross/phase-transition.md +22 -0
  13. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/impl/e2e-loop.md +23 -0
  14. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/impl/grill-with-docs.md +18 -0
  15. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/impl/systematic-debugging.md +20 -0
  16. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/impl/tdd.md +39 -0
  17. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/impl/to-issues.md +35 -0
  18. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/impl/triage.md +24 -0
  19. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/impl/verification.md +17 -0
  20. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/post-impl/human-qa.md +20 -0
  21. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/post-impl/review.md +22 -0
  22. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/pre-impl/grill-me.md +29 -0
  23. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/pre-impl/map-codebase.md +29 -0
  24. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/pre-impl/prototype.md +24 -0
  25. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/pre-impl/research.md +30 -0
  26. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/pre-impl/to-prd.md +22 -0
  27. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/schemas/codebase-map.md +31 -0
  28. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/schemas/prd.json.md +35 -0
  29. package/dist/marketplace/plugins/e2e-engineering/skills/e2e-engineering/schemas/progress.txt.md +27 -0
  30. package/package.json +38 -0
  31. package/scripts/build-dist.js +65 -0
  32. package/scripts/publish-marketplace.js +121 -0
package/README.md ADDED
@@ -0,0 +1,904 @@
1
+ # e2e-engineering
2
+
3
+ Master engineering orchestrator — drives a Task from idea to passing E2E across three phases: **pre-implementation** (idea → approved PRD), **implementation** (vertical-slice TDD loop → green tests), **post-implementation** (review + human QA). Five hard gates, a `depends_on` slice DAG, and `.e2e-engineering/` state files keep the flow honest. The essay below ("AI-Engineering") is the philosophy this skill encodes.
4
+
5
+ ## Install
6
+
7
+ ```bash
8
+ npx e2e-engineering init # auto-detect the agent in this project
9
+ npx e2e-engineering init --target claude # full skill → .claude/skills/e2e-engineering/
10
+ npx e2e-engineering init --target cursor # .cursor/rules/e2e-engineering.mdc + AGENTS.md
11
+ npx e2e-engineering init --target codex # AGENTS.md
12
+ npx e2e-engineering init --target opencode # AGENTS.md
13
+ npx e2e-engineering init --target all # everything
14
+ ```
15
+
16
+ Flags: `--dest <dir>` · `--force` · `--dry-run`. Auto-detect: `.claude/` → claude · `.cursor/` → cursor · else → codex. An existing `AGENTS.md` is never clobbered (writes `AGENTS.e2e-engineering.md`).
17
+
18
+ In Claude Code: `/e2e-engineering`. Also triggers on "ship-it", "ship it", "implement feature X", "write e2e for X", "build this end to end", "run the full flow".
19
+
20
+ ## Fidelity
21
+
22
+ | | Claude Code | Codex / OpenCode / Cursor |
23
+ |---|---|---|
24
+ | Phases, 5 gates, DAG, TDD loop, state files, constitution | yes | yes |
25
+ | Parallel slice execution (worktree fan-out) | yes | **no — sequential** |
26
+ | Subagent dispatch, 65% auto-checkpoint, `/run`+`/verify` gate 5 | yes | manual |
27
+
28
+ Portable targets run slices one at a time in dependency order; everything else is identical.
29
+
30
+ ## Claude marketplace
31
+
32
+ Plugin lives in `dist/claude-plugin/`. Once pushed to a GitHub repo: `/plugin marketplace add <owner>/<repo>` then `/plugin install e2e-engineering@e2e-engineering`.
33
+
34
+ MIT.
35
+
36
+ ---
37
+
38
+ # AI-Engineering
39
+
40
+ ## How to Build Software with AI Agents
41
+
42
+ ### Core principle
43
+
44
+ The main lesson across the files is simple: **AI does not remove the need for software engineering discipline. It makes discipline more important.**
45
+
46
+ The workflow is not “ask the AI to build everything and hope.” The workflow is:
47
+
48
+ ```text
49
+ Human clarifies the idea
50
+ Human and AI align on language and architecture
51
+ AI helps produce a PRD
52
+ PRD becomes small vertical issues
53
+ Agents implement with tests
54
+ Agents or humans review
55
+ Human performs QA
56
+ New issues are created
57
+ The loop repeats
58
+ ```
59
+
60
+ AI changes the tools, but not the fundamentals: clear requirements, good modular design, small tasks, feedback loops, testing, QA, and review still matter.
61
+
62
+ ---
63
+
64
+ ## Part 1 — Prepare your mindset: AI agents are not magic engineers
65
+
66
+ The files repeatedly describe agents as useful but constrained. They can write code quickly, explore repositories, implement issues, and even review each other’s work, but they do not naturally carry long-term memory across sessions. That means you need to give them **process, structure, documentation, and feedback loops**.
67
+
68
+ A helpful mental model is:
69
+
70
+ ```text
71
+ Human = strategic programmer
72
+ AI agent = tactical programmer
73
+ ```
74
+
75
+ The human decides what matters, what trade-offs are acceptable, what the system should become, and where quality boundaries belong. The agent executes tactical work inside that structure. The “de-slop” file makes this very clear: architecture improvement is not something you simply run AFK; it requires judgment from the programmer above the agent.
76
+
77
+ ---
78
+
79
+ ## Part 2 — Make your codebase ready for AI
80
+
81
+ ### 2.1 Why architecture matters more with AI
82
+
83
+ A messy codebase makes AI worse. If the file system does not reflect the mental model of the application, the AI enters the repo with no prior memory and sees only scattered files. It does not automatically know which modules belong together, which concepts are central, or where responsibilities live.
84
+
85
+ So before expecting good AI output, you need a codebase that is:
86
+
87
+ ```text
88
+ Easy to navigate
89
+ Easy to test
90
+ Organized around meaningful modules
91
+ Built around clear interfaces
92
+ Protected by feedback loops
93
+ ```
94
+
95
+ The files argue that the structure of the codebase is often more influential than prompts or instruction files. If the system is hard to change, the agent will struggle to change it safely.
96
+
97
+ ---
98
+
99
+ ### 2.2 Use deep modules
100
+
101
+ A central architectural idea is the **deep module**.
102
+
103
+ A module is a unit of application behavior: a group of components, functions, services, or capabilities. A module has an **interface**, which is what callers need to know to use it, and an **implementation**, which is the internal code that performs the work.
104
+
105
+ A **deep module** hides a lot of implementation behind a relatively simple interface. A **shallow module** exposes a complex interface while hiding very little implementation. Deep modules are better because they give the caller more capability with less surface area to understand.
106
+
107
+ A practical way to think about it:
108
+
109
+ ```text
110
+ Bad for AI:
111
+ Many tiny files
112
+ Unclear relationships
113
+ Hidden dependencies
114
+ Business rules spread everywhere
115
+
116
+ Good for AI:
117
+ Larger meaningful modules
118
+ Clear public interfaces
119
+ Tests around module boundaries
120
+ Implementation details hidden inside
121
+ ```
122
+
123
+ Deep modules give you two major benefits: **locality** and **leverage**. Locality means related changes and bugs concentrate in one place. Leverage means callers get more behavior per unit of interface they need to learn.
124
+
125
+ ---
126
+
127
+ ### 2.3 Define seams and adapters
128
+
129
+ A **seam** is the boundary where one module talks to another. It is often the best place to test. For example, if a service depends on time, you can define a clock interface and use a real clock in production but a fake clock in tests. The fake clock is an **adapter** that satisfies the same interface.
130
+
131
+ This matters because agents need reliable places to test behavior. If your seams are unclear, the AI does not know where to write tests or how to isolate behavior.
132
+
133
+ A good module should therefore have:
134
+
135
+ ```text
136
+ A clear public interface
137
+ A small number of meaningful exported functions
138
+ Tests at the boundary
139
+ Adapters for external dependencies
140
+ Internal implementation hidden from callers
141
+ ```
142
+
143
+ ---
144
+
145
+ ### 2.4 Run architecture improvement regularly
146
+
147
+ The “de-slop” workflow suggests using an architecture-improvement process to identify shallow modules, duplicated concepts, poor locality, missing seams, and untested parallel implementations. In the example, the AI identifies places where frontend and backend logic could drift because two parallel implementations lack a shared seam.
148
+
149
+ The important part: do not let the AI blindly refactor the whole codebase. Let it **surface candidates**, then you choose which refactor matters.
150
+
151
+ A useful prompt pattern:
152
+
153
+ ```text
154
+ Explore this codebase for architecture-deepening opportunities.
155
+ Look for shallow modules, duplicated business rules, unclear seams,
156
+ poor locality, and places where tests cannot easily be written.
157
+ Do not implement yet. Give me candidates and explain the trade-offs.
158
+ ```
159
+
160
+ Then choose one candidate and ask the AI to propose:
161
+
162
+ ```text
163
+ The new module boundary
164
+ The public interface
165
+ The implementation location
166
+ The tests needed
167
+ The migration plan
168
+ The risks
169
+ ```
170
+
171
+ ---
172
+
173
+ ## Part 3 — Establish shared language before building
174
+
175
+ ### 3.1 Why “Grill Me” is useful
176
+
177
+ The original **grill-me** skill asks the AI to interview the user relentlessly until both sides reach a shared understanding. It walks down the design tree and resolves dependencies between decisions one by one.
178
+
179
+ The goal is not to move fast immediately. The goal is to prevent the AI from implementing the wrong thing quickly.
180
+
181
+ A simple version of the prompt:
182
+
183
+ ```text
184
+ Interview me relentlessly about every aspect of this plan
185
+ until we reach a shared understanding.
186
+ Walk down each branch of the design tree.
187
+ Resolve dependencies between decisions one by one.
188
+ If a question can be answered by exploring the codebase,
189
+ explore the codebase instead of asking me.
190
+ ```
191
+
192
+ Use this when an idea is still vague.
193
+
194
+ ---
195
+
196
+ ### 3.2 Prefer “Grill with Docs” when there is a codebase
197
+
198
+ The newer workflow replaces pure grill-me with **grill-with-docs** when a codebase exists. The problem with grill-me alone is that good terminology may emerge during the conversation but not get documented. Then the user has to re-explain the same domain concepts again in future sessions.
199
+
200
+ Grill-with-docs adds documentation to the alignment process. It looks for a `context.md` file, uses existing shared language, challenges fuzzy terms, cross-references with code, and updates the documentation as the conversation progresses.
201
+
202
+ Use this structure:
203
+
204
+ ```text
205
+ /context.md
206
+ - Domain vocabulary
207
+ - Core entities
208
+ - Definitions
209
+ - Relationships
210
+ - Terms users see in the UI
211
+ - Terms developers use in code
212
+ ```
213
+
214
+ The purpose is to align:
215
+
216
+ ```text
217
+ Human language
218
+ Code language
219
+ Agent language
220
+ User-facing language
221
+ ```
222
+
223
+ When all four match, the AI needs fewer words to understand your intent and is more likely to generate code that fits the domain.
224
+
225
+ ---
226
+
227
+ ### 3.3 Create ADRs for important decisions
228
+
229
+ Some decisions are not just vocabulary. They are architectural trade-offs. For those, use **ADRs**: architectural decision records.
230
+
231
+ The files suggest creating ADRs when a decision is:
232
+
233
+ ```text
234
+ Hard to reverse
235
+ Surprising without context
236
+ The result of a real trade-off
237
+ Likely to affect future implementation
238
+ ```
239
+
240
+ This prevents future agents from undoing decisions because they do not understand why they were made.
241
+
242
+ A simple ADR template:
243
+
244
+ ```markdown
245
+ ## ADR: [Decision title]
246
+
247
+ ### Context
248
+
249
+ What problem or trade-off led to this decision?
250
+
251
+ ### Decision
252
+
253
+ What did we decide?
254
+
255
+ ### Consequences
256
+
257
+ What becomes easier?
258
+ What becomes harder?
259
+ What should future agents avoid changing casually?
260
+ ```
261
+
262
+ ---
263
+
264
+ ## Part 4 — Follow the 7 phases of AI-driven development
265
+
266
+ One file lays out seven phases of AI-driven development:
267
+
268
+ ```text
269
+ 1. Idea
270
+ 2. Research
271
+ 3. Prototype
272
+ 4. PRD
273
+ 5. Implementation planning
274
+ 6. Execution
275
+ 7. QA
276
+ ```
277
+
278
+ These phases can be used for a full app, a feature, a bug fix, or a refactor.
279
+
280
+ ---
281
+
282
+ ### Phase 1 — Start with the idea
283
+
284
+ The idea can be broad or narrow. It might be a full application, a feature, a bug fix, or a refactor. The important thing is not to jump straight from idea to implementation. The idea is just the starting point.
285
+
286
+ Start by writing:
287
+
288
+ ```text
289
+ What I want to change:
290
+ Why I want to change it:
291
+ Who it affects:
292
+ What must remain true:
293
+ What I am unsure about:
294
+ ```
295
+
296
+ Then run a grill-with-docs session.
297
+
298
+ ---
299
+
300
+ ### Phase 2 — Research
301
+
302
+ Use research when the task depends on external APIs, unfamiliar libraries, complex integration details, or parts of the repo that are difficult to explore repeatedly. The research should be cached in a temporary asset like `research.md`, so future agents do not need to rediscover the same information from scratch.
303
+
304
+ But research can rot. The files warn that research usually belongs to the lifetime of a sprint or idea, not permanently. If it gets stale, it can mislead the agent.
305
+
306
+ A good `research.md` contains:
307
+
308
+ ```text
309
+ External API behavior
310
+ Relevant docs
311
+ Constraints
312
+ Known gotchas
313
+ Example calls
314
+ Integration risks
315
+ Decisions already made
316
+ ```
317
+
318
+ ---
319
+
320
+ ### Phase 3 — Prototype
321
+
322
+ Prototype when you need concrete feedback before writing the PRD. This is especially important for UI, UX, state machines, business logic, or external service integration.
323
+
324
+ The prototype is not the final implementation. It is a learning tool.
325
+
326
+ Use prototypes to answer questions like:
327
+
328
+ ```text
329
+ Which UI direction feels right?
330
+ Does this state machine make sense?
331
+ Can this API integration actually work?
332
+ Is this interaction too confusing?
333
+ What implementation path has the fewest unknowns?
334
+ ```
335
+
336
+ The changelog file also describes a `/prototype` skill for throwaway prototypes, including UI variations and small terminal apps for testing logic. The core philosophy is: prototype first, then hand off to an implementation agent.
337
+
338
+ ---
339
+
340
+ ### Phase 4 — Write the PRD
341
+
342
+ A PRD is the **destination document**. It describes where the work is going, not every tiny step to get there. The files describe PRDs as containing problem statements, proposed solutions, user stories, implementation decisions, and testing decisions.
343
+
344
+ A strong PRD should include:
345
+
346
+ ```markdown
347
+ ## PRD: [Feature Name]
348
+
349
+ ### Problem
350
+
351
+ What is broken, missing, annoying, or valuable?
352
+
353
+ ### Goal
354
+
355
+ What should be true when this is complete?
356
+
357
+ ### Non-goals
358
+
359
+ What are we intentionally not doing?
360
+
361
+ ### User stories
362
+
363
+ As a [user], I want [behavior], so that [outcome].
364
+
365
+ ### Implementation decisions
366
+
367
+ What has already been decided?
368
+ What constraints must be respected?
369
+
370
+ ### Testing decisions
371
+
372
+ What behaviors must be tested?
373
+ Which tests should be unit, integration, or visual?
374
+
375
+ ### Risks
376
+
377
+ What could go wrong?
378
+
379
+ ### Acceptance criteria
380
+
381
+ How will we know this is done?
382
+ ```
383
+
384
+ The files emphasize that testing decisions inside the PRD help agents follow TDD and create feedback loops during implementation.
385
+
386
+ ---
387
+
388
+ ### Phase 5 — Turn the PRD into vertical issues
389
+
390
+ The PRD is the destination. The issues are the journey.
391
+
392
+ A major mistake is breaking work into horizontal layers:
393
+
394
+ ```text
395
+ Task 1: database
396
+ Task 2: backend
397
+ Task 3: frontend
398
+ Task 4: tests
399
+ ```
400
+
401
+ This delays feedback. Instead, the files recommend **vertical slices**: each task should cut through the necessary layers and produce something testable.
402
+
403
+ A vertical slice might include:
404
+
405
+ ```text
406
+ Small schema change
407
+ Service function
408
+ UI behavior
409
+ Tests
410
+ Acceptance criteria
411
+ ```
412
+
413
+ The files connect this to the “tracer bullet” idea: pick slices that reveal unknowns early. If a risky integration might fail, make that one of the first slices.
414
+
415
+ A good issue should include:
416
+
417
+ ```markdown
418
+ ## Issue: [Small vertical slice]
419
+
420
+ ### Parent PRD
421
+
422
+ Link to PRD
423
+
424
+ ### What to build
425
+
426
+ Precise task description
427
+
428
+ ### Acceptance criteria
429
+
430
+ - [ ] Behavior A works
431
+ - [ ] Behavior B is tested
432
+ - [ ] Existing behavior is preserved
433
+
434
+ ### Testing instructions
435
+
436
+ What tests to add or run
437
+
438
+ ### Blocking relationships
439
+
440
+ Blocked by:
441
+ Blocks:
442
+
443
+ ### Notes for agent
444
+
445
+ Important context, files, constraints, and risks
446
+ ```
447
+
448
+ ---
449
+
450
+ ## Part 5 — Triage your backlog before agents touch it
451
+
452
+ The `/triage` workflow turns messy ideas, bug reports, and feature requests into actionable work. It uses labels as a state machine. Each issue should have a category and a state.
453
+
454
+ Common category labels:
455
+
456
+ ```text
457
+ bug
458
+ enhancement
459
+ ```
460
+
461
+ Common state labels:
462
+
463
+ ```text
464
+ needs triage
465
+ needs info
466
+ ready for agent
467
+ ready for human
468
+ won’t fix
469
+ ```
470
+
471
+ The key rule is: **an issue should not be picked up by an AFK agent unless it is explicitly ready for agent**.
472
+
473
+ This prevents the agent from wasting time on vague, low-quality, contradictory, or out-of-scope tasks.
474
+
475
+ A useful triage workflow:
476
+
477
+ ```text
478
+ 1. Pull all untriaged issues.
479
+ 2. Categorize each as bug or enhancement.
480
+ 3. Decide the state.
481
+ 4. If unclear, mark needs info.
482
+ 5. If out of scope, mark won’t fix and document why.
483
+ 6. If actionable, write an agent brief.
484
+ 7. Mark ready for agent only when fully specified.
485
+ ```
486
+
487
+ The files also recommend documenting “out of scope” decisions so future agents can reject similar ideas consistently.
488
+
489
+ ---
490
+
491
+ ## Part 6 — Execute with TDD: Red, Green, Refactor
492
+
493
+ The TDD workflow is one of the strongest recommendations in the files. The agent should write a failing test first, then implement the minimum code to pass, then refactor.
494
+
495
+ The loop:
496
+
497
+ ```text
498
+ Red: write one failing test
499
+ Green: write the minimum implementation to pass
500
+ Refactor: clean up while tests remain green
501
+ Repeat
502
+ ```
503
+
504
+ The important detail is **one test at a time**. The files warn that LLMs tend to create huge horizontal layers: many tests at once, then a massive implementation attempt. That often produces weak tests and messy code.
505
+
506
+ A good agent instruction:
507
+
508
+ ```text
509
+ Use red-green-refactor.
510
+ For each behavior:
511
+ 1. Write exactly one failing test.
512
+ 2. Run it and confirm it fails for the expected reason.
513
+ 3. Implement the smallest change to pass.
514
+ 4. Run the test again.
515
+ 5. Only then move to the next behavior.
516
+ After all tests pass, look for refactor candidates.
517
+ Do not rewrite the test just to make the implementation pass.
518
+ ```
519
+
520
+ This works especially well with agents because the human can see the test fail, then pass, which provides confidence that the implementation is grounded in real feedback.
521
+
522
+ ---
523
+
524
+ ## Part 7 — Build feedback loops everywhere
525
+
526
+ The files repeat one message: **without feedback loops, AI is coding blind**.
527
+
528
+ Useful feedback loops include:
529
+
530
+ ```text
531
+ Unit tests
532
+ Integration tests
533
+ Type checking
534
+ Linting
535
+ Build checks
536
+ CI
537
+ Regression tests
538
+ Browser screenshots
539
+ Manual QA
540
+ Code review
541
+ ```
542
+
543
+ For backend work, feedback is usually textual. Tests, logs, type errors, and build failures are easy for the AI to read. For frontend work, this is harder because the feedback is visual: spacing, layout, scrolling, animation, hover states, dark mode, and interaction feel.
544
+
545
+ So frontend agents need browser access. The files describe using Chrome DevTools-style tooling so the agent can open the local app, inspect pages, take screenshots, emulate dark mode, and verify rendering.
546
+
547
+ For frontend or full-stack work, add:
548
+
549
+ ```text
550
+ Browser automation
551
+ Screenshot inspection
552
+ Light/dark mode checks
553
+ Responsive layout checks
554
+ Ad hoc interaction testing
555
+ Accessibility checks when relevant
556
+ ```
557
+
558
+ This makes the AI more like a human frontend developer because it can inspect the actual execution environment, not just the code.
559
+
560
+ ---
561
+
562
+ ## Part 8 — Run agents safely with sandboxes
563
+
564
+ The Sandcastle file introduces a way to run agents AFK in isolated sandboxes. The problem it addresses is permissions: if agents constantly ask for permission, they cannot work autonomously; if you give them unrestricted access, they can do dangerous things. Sandboxing gives them a controlled environment.
565
+
566
+ Sandcastle is described as a TypeScript library for orchestrating coding agents in isolated sandboxes. It can run prompts with agents, use GitHub issues as a backlog manager, and run agents in parallel.
567
+
568
+ A typical Sandcastle-style setup has:
569
+
570
+ ```text
571
+ A .sandcastle directory
572
+ A Dockerfile or sandbox definition
573
+ Environment variables
574
+ A backlog source such as GitHub issues
575
+ A planner agent
576
+ One or more implementer agents
577
+ A reviewer agent
578
+ Possibly a merger agent
579
+ ```
580
+
581
+ The workflow described in the file:
582
+
583
+ ```text
584
+ 1. Planner reads open labeled issues.
585
+ 2. Planner identifies unblocked tasks.
586
+ 3. Implementer agents work in sandboxes.
587
+ 4. Agents run tests and type checks.
588
+ 5. Reviewer analyzes the changes.
589
+ 6. Merger can combine or select branches.
590
+ ```
591
+
592
+ The Sandcastle file also shows that agents can be prompted to use red-green-refactor during implementation, tying autonomous execution back to TDD.
593
+
594
+ ---
595
+
596
+ ## Part 9 — Use worktrees for parallel development
597
+
598
+ Git worktrees let multiple branches of the same repository be checked out in separate folders. This allows multiple agents to work independently without interfering with each other.
599
+
600
+ The basic idea:
601
+
602
+ ```text
603
+ main repo
604
+ feature-worktree-1
605
+ feature-worktree-2
606
+ bugfix-worktree-3
607
+ ```
608
+
609
+ Each worktree can have its own branch, its own changes, and its own agent.
610
+
611
+ The files describe this as a powerful way to make parallelization easier. One agent can work on one idea, another agent can work on another, and each can produce a PR back to main.
612
+
613
+ But there is an important warning: protect your main branch and make sure the agent pushes to the specific branch name. Otherwise, an agent may accidentally push work to main if the setup is wrong.
614
+
615
+ A safe instruction for agents:
616
+
617
+ ```text
618
+ You are working in a git worktree.
619
+ Before committing, run git status and confirm the branch name.
620
+ Do not push to main.
621
+ Push only to the current feature branch.
622
+ Open a PR back to main.
623
+ If branch identity is unclear, stop and report.
624
+ ```
625
+
626
+ ---
627
+
628
+ ## Part 10 — Review in a fresh context
629
+
630
+ The files recommend reviewing AI-generated code in a fresh context. If the same agent that wrote the code reviews it inside a bloated context, it may be less effective. A fresh context gives the reviewer a cleaner view.
631
+
632
+ The newer skills changelog also describes a planned `/review` skill with two parallel review modes:
633
+
634
+ ```text
635
+ Standards review:
636
+ Does the code follow repository conventions?
637
+
638
+ Spec review:
639
+ Does the implementation match the issue or PRD?
640
+ ```
641
+
642
+ This distinction is useful. A change can be well-written but solve the wrong problem, or it can solve the right problem while violating project standards.
643
+
644
+ A good review prompt:
645
+
646
+ ```text
647
+ Review this PR in a fresh context.
648
+
649
+ Check two things separately:
650
+
651
+ 1. Spec compliance:
652
+ - Does the implementation satisfy the issue?
653
+ - Are all acceptance criteria met?
654
+ - Are user stories preserved?
655
+
656
+ 2. Code standards:
657
+ - Does the code match existing conventions?
658
+ - Are module boundaries respected?
659
+ - Are tests meaningful?
660
+ - Are there unnecessary abstractions?
661
+ - Are there risky changes outside scope?
662
+
663
+ Do not rewrite code yet. First produce findings ranked by severity.
664
+ ```
665
+
666
+ ---
667
+
668
+ ## Part 11 — Use handoff when context gets too large
669
+
670
+ Long sessions consume context. The `/handoff` skill creates a temporary handoff document that summarizes the current conversation, intent, artifacts, decisions, and suggested next skills. This lets another agent continue the work without carrying the entire original conversation.
671
+
672
+ Use handoff when:
673
+
674
+ ```text
675
+ The session is getting long
676
+ You want a fresh agent to continue
677
+ You want to delegate a subtask
678
+ You want another agent to review or prototype independently
679
+ You want to preserve intent without copying everything
680
+ ```
681
+
682
+ A handoff document should include:
683
+
684
+ ```markdown
685
+ ## Handoff
686
+
687
+ ### Current goal
688
+
689
+ What are we trying to accomplish?
690
+
691
+ ### Current state
692
+
693
+ What has been decided or built?
694
+
695
+ ### Important artifacts
696
+
697
+ Links to PRD, issues, context.md, ADRs, prototypes, branches
698
+
699
+ ### Domain language
700
+
701
+ Terms the next agent must understand
702
+
703
+ ### Constraints
704
+
705
+ What must not change?
706
+
707
+ ### Recommended next action
708
+
709
+ What should the next agent do?
710
+
711
+ ### Suggested skill
712
+
713
+ grill-with-docs / prototype / tdd / review / triage / etc.
714
+ ```
715
+
716
+ ---
717
+
718
+ ## Part 12 — Human QA closes the loop
719
+
720
+ Even after AFK implementation, tests, and review, the human still performs QA. The seven-phase workflow explicitly ends with the agent producing a QA plan and the human walking through the completed work. That QA often creates more tickets, which go back into the implementation loop.
721
+
722
+ A good QA plan includes:
723
+
724
+ ```text
725
+ Core happy path
726
+ Edge cases
727
+ Regression checks
728
+ Visual checks
729
+ Data integrity checks
730
+ Error states
731
+ Performance concerns
732
+ Accessibility concerns if relevant
733
+ Manual steps to reproduce
734
+ ```
735
+
736
+ The loop becomes:
737
+
738
+ ```text
739
+ Execute issue
740
+ Run tests
741
+ Review
742
+ Human QA
743
+ Find problems
744
+ Create new issues
745
+ Triage
746
+ Execute again
747
+ ```
748
+
749
+ This is why the process is iterative, not one-shot.
750
+
751
+ ---
752
+
753
+ ## The complete workflow
754
+
755
+ Here is the combined tutorial workflow from the files:
756
+
757
+ ```text
758
+ 1. Prepare the codebase
759
+ - Improve architecture
760
+ - Create deep modules
761
+ - Define seams and adapters
762
+ - Add tests around boundaries
763
+
764
+ 2. Establish shared language
765
+ - Create or update context.md
766
+ - Use grill-with-docs
767
+ - Add ADRs for hard-to-reverse decisions
768
+
769
+ 3. Start from an idea
770
+ - Describe the goal
771
+ - Explain why it matters
772
+ - Identify uncertainty
773
+
774
+ 4. Research when needed
775
+ - Cache temporary research in research.md
776
+ - Avoid stale permanent research
777
+
778
+ 5. Prototype when taste or uncertainty matters
779
+ - UI prototypes
780
+ - Logic prototypes
781
+ - API experiments
782
+
783
+ 6. Write the PRD
784
+ - Problem
785
+ - Goal
786
+ - User stories
787
+ - Implementation decisions
788
+ - Testing decisions
789
+ - Acceptance criteria
790
+
791
+ 7. Break into vertical issues
792
+ - Avoid horizontal layers
793
+ - Create tracer-bullet tasks
794
+ - Add blocking relationships
795
+ - Reference the parent PRD
796
+
797
+ 8. Triage the backlog
798
+ - Label category and state
799
+ - Mark only clear tasks as ready for agent
800
+ - Document out-of-scope decisions
801
+
802
+ 9. Execute with agents
803
+ - Use sandboxes
804
+ - Use worktrees
805
+ - Use one agent per unblocked task when useful
806
+ - Protect main
807
+
808
+ 10. Use TDD
809
+ - One failing test
810
+ - Minimal implementation
811
+ - Refactor
812
+ - Repeat
813
+
814
+ 11. Add feedback loops
815
+ - Tests
816
+ - Type checks
817
+ - Lint
818
+ - Builds
819
+ - Browser screenshots for frontend
820
+
821
+ 12. Review in fresh context
822
+ - Spec review
823
+ - Standards review
824
+
825
+ 13. Human QA
826
+ - Walk through the completed work
827
+ - Create new issues
828
+ - Repeat the loop
829
+ ```
830
+
831
+ ---
832
+
833
+ ## Practical “minimum viable” version
834
+
835
+ If someone is not ready for the full multi-agent workflow, the simplest version is:
836
+
837
+ ```text
838
+ 1. Use grill-with-docs to clarify the feature.
839
+ 2. Write a PRD.
840
+ 3. Break the PRD into 3–6 vertical issues.
841
+ 4. Pick one issue.
842
+ 5. Ask the agent to use red-green-refactor.
843
+ 6. Run tests and type checks.
844
+ 7. Review the diff.
845
+ 8. QA manually.
846
+ 9. Create follow-up issues.
847
+ ```
848
+
849
+ This gives most of the benefit without needing a full Sandcastle-style AFK factory.
850
+
851
+ ---
852
+
853
+ ## Advanced version: AI software factory
854
+
855
+ The advanced version combines everything:
856
+
857
+ ```text
858
+ Architecture-ready codebase
859
+ +
860
+ context.md and ADRs
861
+ +
862
+ PRDs
863
+ +
864
+ GitHub issues
865
+ +
866
+ triage labels
867
+ +
868
+ Sandcastle or equivalent sandbox orchestration
869
+ +
870
+ Git worktrees
871
+ +
872
+ TDD prompts
873
+ +
874
+ review agents
875
+ +
876
+ human QA
877
+ ```
878
+
879
+ At that point, the human does the “day shift”: thinking, deciding, grilling, documenting, prioritizing, and reviewing. The agents do the “night shift”: implementing, testing, reviewing, and reporting. This “human day shift / AI night shift” idea appears as the final shape of the workflow.
880
+
881
+ ---
882
+
883
+ ## Final takeaway
884
+
885
+ The combined message of the files is:
886
+
887
+ ```text
888
+ Do not use AI to avoid engineering.
889
+ Use engineering to make AI useful.
890
+ ```
891
+
892
+ Good AI-driven development is not about the perfect prompt. It is about creating a system where agents can succeed:
893
+
894
+ ```text
895
+ Clear language
896
+ Clear architecture
897
+ Clear tasks
898
+ Clear tests
899
+ Clear feedback
900
+ Clear review
901
+ Clear human ownership
902
+ ```
903
+
904
+ When those pieces are in place, AI agents can become genuinely powerful collaborators. When they are missing, AI simply accelerates entropy and produces code that is faster to write but harder to maintain.