theslopmachine 1.0.13 → 1.0.22

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (39) hide show
  1. package/assets/agents/developer.md +6 -7
  2. package/assets/agents/slopmachine-claude.md +66 -9
  3. package/assets/agents/slopmachine.md +68 -9
  4. package/assets/claude/agents/developer.md +5 -1
  5. package/assets/skills/clarification-gate/SKILL.md +56 -20
  6. package/assets/skills/claude-worker-management/SKILL.md +14 -4
  7. package/assets/skills/deep-retrospective/SKILL.md +179 -0
  8. package/assets/skills/deep-retrospective/run.py +446 -0
  9. package/assets/skills/deep-retrospective/workflow-reference.md +240 -0
  10. package/assets/skills/developer-session-lifecycle/SKILL.md +18 -4
  11. package/assets/skills/development-guidance/SKILL.md +52 -31
  12. package/assets/skills/evaluation-triage/SKILL.md +21 -7
  13. package/assets/skills/final-evaluation-orchestration/SKILL.md +92 -28
  14. package/assets/skills/integrated-verification/SKILL.md +38 -42
  15. package/assets/skills/p8-readiness-reconciliation/SKILL.md +31 -10
  16. package/assets/skills/planning-gate/SKILL.md +10 -7
  17. package/assets/skills/planning-guidance/SKILL.md +60 -52
  18. package/assets/skills/retrospective-analysis/SKILL.md +172 -58
  19. package/assets/skills/scaffold-guidance/SKILL.md +18 -6
  20. package/assets/skills/submission-packaging/SKILL.md +11 -3
  21. package/assets/slopmachine/clarifier-agent-prompt.md +7 -6
  22. package/assets/slopmachine/exact-readme-template.md +8 -12
  23. package/assets/slopmachine/owner-verification-checklist.md +1 -1
  24. package/assets/slopmachine/phase-1-design-prompt.md +5 -10
  25. package/assets/slopmachine/phase-1-design-template.md +15 -11
  26. package/assets/slopmachine/phase-2-execution-planning-prompt.md +5 -2
  27. package/assets/slopmachine/phase-2-plan-template.md +14 -4
  28. package/assets/slopmachine/scaffold-playbooks/shared-contract.md +2 -1
  29. package/assets/slopmachine/templates/AGENTS.md +3 -1
  30. package/assets/slopmachine/templates/CLAUDE.md +3 -1
  31. package/assets/slopmachine/test-coverage-prompt.md +8 -1
  32. package/assets/slopmachine/utils/README.md +1 -5
  33. package/assets/slopmachine/utils/claude_live_common.mjs +2 -5
  34. package/assets/slopmachine/utils/prepare_evaluation_send_packet.mjs +3 -3
  35. package/package.json +1 -1
  36. package/src/constants.js +0 -9
  37. package/src/init.js +17 -24
  38. package/src/install.js +30 -28
  39. package/assets/slopmachine/utils/prepare_evaluation_prompt.mjs +0 -81
@@ -0,0 +1,240 @@
1
+ # SlopMachine Workflow Reference
2
+
3
+ Expected behavior and rules for each phase. Use this to judge whether the workflow ran correctly.
4
+
5
+ ---
6
+
7
+ ## Overall Rules
8
+
9
+ ### Non-Negotiable Verbatim Prompt Paste
10
+ Every packaged `.md` prompt (clarifier, faithfulness review, design, execution planning, evaluation, test coverage) must be read fresh from `~/slopmachine/` and pasted verbatim into the subagent message. Never summarize, describe, shorten, paraphrase, or tell the worker to read the file themselves. Violation invalidates the workflow action.
11
+
12
+ ### Worker Communication Firewall
13
+ Developer/worker sessions must never see: phase names/numbers, workflow gates, lifecycle mechanics, owner/worker terminology, Beads, metadata, `../.ai`, hidden plans/private reports, evaluator mechanics. Prompts must sound like a human lead engineer, not an orchestration system.
14
+
15
+ ### Natural Language Prompting
16
+ No markdown formatting in developer prompts. Plain English sentences only. No bullet lists, bold, backticks, dashes. Write like a human: "I checked the module and found these issues."
17
+
18
+ ### Session Integrity
19
+ Sessions are the primary deliverable. Never edit, rename, restructure, or delete session files. Never perform off-session work. Never return to a closed session.
20
+
21
+ ### State Management
22
+ - Owner-private files live under `../.ai` and `../.beads`
23
+ - `../.ai/metadata.json` is the control plane (run ID, phase, sessions, artifacts)
24
+ - Beads is the durable ledger (phase lifecycle, blockers, evidence)
25
+ - Both must be kept aligned at every phase transition
26
+
27
+ ### Docker Deferred
28
+ Docker and `run_tests.sh` are deferred to Phase 6/7. Never run Docker during earlier phases.
29
+
30
+ ---
31
+
32
+ ## Phase 1: Clarification
33
+
34
+ **Expected behavior:**
35
+ - Verify `./metadata.json` has the exact original prompt
36
+ - Launch clarification worker subagent via `task` tool with `~/slopmachine/clarifier-agent-prompt.md` verbatim
37
+ - Launch faithfulness review subagent via `task` tool with `~/slopmachine/clarification-faithfulness-review-prompt.md` verbatim
38
+ - Patch all drift findings in questions.md and requirements-breakdown.md
39
+ - Record artifact creation and acceptance in metadata and Beads
40
+ - Exit when `clarification-gate` is satisfied
41
+
42
+ **What to look for in traces:**
43
+ - Did the clarifier prompt get sent verbatim?
44
+ - Did the faithfulness review run and find drift?
45
+ - Were drift findings actually patched?
46
+ - Did questions.md cover environment/trust-boundary ambiguities?
47
+ - Did requirements-breakdown.md classify implied defaults with risk tiers?
48
+
49
+ ## Phase 2: Planning
50
+
51
+ **Expected behavior:**
52
+ - Use the same primary developer session established/continued
53
+ - Follow deterministic planning sequence exactly:
54
+ 1. Send original prompt + "Don't write code yet — we'll plan this first."
55
+ 2. After acknowledgement, send clarifications as natural-language requirements (no REQ-### IDs)
56
+ 3. After acknowledgement, send `~/slopmachine/phase-1-design-prompt.md` verbatim
57
+ - Developer fills `./docs/design.md` and `./docs/api-spec.md`
58
+ - Owner verifies design quality against all requirements
59
+ - Launch general subagent with `~/slopmachine/phase-2-execution-planning-prompt.md` + `~/slopmachine/phase-2-plan-template.md` verbatim to create `../.ai/plan.md`
60
+ - Exit when `planning-gate` is satisfied
61
+
62
+ **What to look for in traces:**
63
+ - Was the sequence strictly followed (step 1→2→3 with acknowledgement between each)?
64
+ - Was the design prompt pasted verbatim?
65
+ - Was the design review thorough or cursory (spot-check)?
66
+ - Did the plan include runtime verification strategy, JSON contracts, browser verification matrix, seeded-data tests?
67
+ - Were clarifications sent as plain sentences (no REQ-### IDs or workflow metadata)?
68
+ - Was the plan created owner-private (subagent, not developer)?
69
+
70
+ ## Phase 3: Development
71
+
72
+ **Expected behavior:**
73
+ - Continue same primary developer session
74
+ - Scaffold first, then module by module
75
+ - Each module prompt: casual human language, one bounded slice, no workflow mechanics
76
+ - After each module: verify against design, plan, and original prompt
77
+ - **Owner must start the application locally at scaffold acceptance and every module boundary** — do not accept based on test output alone
78
+ - **Developer must write cross-module integration tests** proving data/behavior flow between each new module and all previously built modules
79
+ - Owner must verify each module before accepting completion
80
+ - Local test harness only — no Docker
81
+ - After all modules: ask developer for self-check against design/API docs
82
+ - Exit when all modules complete, scaffold accepted, app verified at every boundary, cross-module tests exist, self-check done
83
+
84
+ **What to look for in traces:**
85
+ - Were prompts natural language? No markdown, no bullet lists, no workflow terminology?
86
+ - Did the developer actually build real behavior or decoration (mock factories, permissive tests)?
87
+ - Were tests strict (exact state/status assertions) or permissive (accept any outcome)?
88
+ - Was the application ever run? (`docker compose up`, `npm start`, `go run`)?
89
+ - Did the owner start the app at scaffold acceptance and each module boundary?
90
+ - Did the owner ask "did you run it?" or "what do these tests prove?"
91
+ - **Were cross-module integration tests written** when a new module connected to previous ones?
92
+ - Was each module accepted based on "X tests pass" without runtime verification?
93
+ - Were assertions checking specific business behavior or just presence/counts?
94
+
95
+ **Signs of weak development:**
96
+ - "Tests pass" used as sole completion signal
97
+ - "Let me fix the tests to accept either status" (permissive assertions)
98
+ - "Mock factories need to export..." (mock-heavy testing)
99
+ - Module completion in < 10 minutes for large scope (speed over quality)
100
+ - No `docker compose`, `npm start`, or browser verification anywhere
101
+ - Test counts used as completion evidence without content inspection
102
+
103
+ ## Phase 4: Integrated Verification
104
+
105
+ **Expected behavior:**
106
+ - Close normal work in primary developer session
107
+ - Start a new bugfix session
108
+ - Owner plan-based review first (before evaluator loop)
109
+ - Run internal evaluator loop: 5 passes, same evaluator session
110
+ - Pass 1: full evaluation prompt verbatim (`~/slopmachine/backend-evaluation-prompt.md`) via `prepare_evaluation_send_packet.mjs`
111
+ - Passes 2-5: follow-up prompt asking for different issues (NOT full prompt again): "Using the same review strategy, do another pass and look for a different set of material issues that were not already covered. Focus on independent Blocker, High, security, or prompt-fit risks. Write the next report to the requested path."
112
+ - **All 5 passes must run** — do not stop early unless the evaluator produces zero new findings in two consecutive passes
113
+ - Each pass writes to `../.ai/internal-verification/report-<N>.md`
114
+ - All issues collected in `../.ai/consolidated-internal-issues.md`
115
+ - **For web/fullstack projects, run browser verification with agent-browser** — exercise every README credential, every core user journey, key prompt requirements
116
+ - Batch ALL issues (plan-based review + all 5 passes + browser findings) before sending ANY to bugfix lane
117
+ - Send consolidated issues to bugfix developer in broad human language
118
+ - Run local non-Docker verification (local test harness)
119
+ - Never run Docker or `run_tests.sh` during this phase
120
+ - Exit when all plan-based review issues are fixed, all 5 evaluator passes complete, browser verification run (web/fullstack), local verification pass, surfaces coherent
121
+
122
+ **What to look for in traces:**
123
+ - How many evaluator passes were actually run? (Must be 5, or explicitly justified early stop)
124
+ - Were passes 2-5 using the follow-up prompt (not the full prompt)?
125
+ - Were findings batched into one consolidated file before sending to bugfix?
126
+ - Did the bugfix developer reproduce failures before fixing?
127
+ - Did the bugfix developer run the app or just fix static code?
128
+ - **Was browser verification attempted?** Was `agent-browser` available and used? (Mandatory for web/fullstack)
129
+ - What runtime bugs should have been caught at this phase?
130
+ - Did the owner accept test counts as proof of fix completion?
131
+
132
+ **Signs of weak verification:**
133
+ - Only 1-2 evaluator passes instead of 5
134
+ - Full prompt re-pasted for passes 2-5 (wasteful, wrong)
135
+ - Findings sent to bugfix one at a time instead of batched
136
+ - No browser verification for fullstack/web projects
137
+ - Bugfix developer never ran the application
138
+ - Owner accepts "tests pass" without inspecting what changed
139
+ - Phase completed too quickly (< 1 hour for substantial projects)
140
+
141
+ ## Phase 5: Evaluation
142
+
143
+ **Expected behavior:**
144
+ - Two audit cycles, each with fresh evaluator session
145
+ - Each audit: paste full evaluation prompt verbatim (read fresh from installed asset)
146
+ - If Fail: fix via bugfix lane, then send exact regeneration prompt (fresh prompt, no mention of previous fixes)
147
+ - Each cycle must produce: `./.tmp/audit_report-<N>.md` (150+ lines) and `./.tmp/audit_report-<N>-fix_check.md`
148
+ - After both audit cycles: close bugfix lane, start test-coverage lane
149
+ - Run coverage/README audit with `~/slopmachine/test-coverage-prompt.md` verbatim
150
+ - Target: >= 90% test score
151
+ - Exit when both audit cycles complete with kept reports and fix-checks, coverage/README audit passes
152
+
153
+ **What to look for in traces:**
154
+ - Was the evaluation prompt pasted verbatim?
155
+ - How many regeneration cycles were needed? (More than 2-3 = problem)
156
+ - Did evaluators have file-write capability? If not, how were reports saved?
157
+ - What was the coverage score? If < 90, why?
158
+ - Was the score gap addressed by fixing root causes or creating evidence copies?
159
+ - How many coverage evaluator sessions were spawned? (More than 5 = coverage spiral)
160
+
161
+ **Signs of evaluation issues:**
162
+ - > 5 evaluator sessions for coverage scoring (coverage spiral)
163
+ - Reports mention "cannot write files"
164
+ - Score gap persists after multiple sessions
165
+ - Build-tag-gated evidence copies created to game score
166
+ - Regeneration prompt not sent verbatim
167
+
168
+ ## Phase 6: Final Readiness
169
+
170
+ **Expected behavior:**
171
+ - Run Docker: `docker compose up --build`
172
+ - Run `./repo/run_tests.sh` (dockerized)
173
+ - Browser verification with `agent-browser`: verify ALL prompt requirement surfaces, ALL README credentials, ALL seeded values, ALL core user journeys
174
+ - Batch all browser findings before sending to developer
175
+ - Fix + rerun checks until green or risk-accepted
176
+ - Exit when all readiness categories pass
177
+
178
+ **What to look for in traces:**
179
+ - Did Docker/runtime checks pass on first try or require multiple retries?
180
+ - Was browser verification decorative (page-render only) or substantive (button flows)?
181
+ - What bugs were discovered only here? These are indictments of earlier phases.
182
+ - Was `agent-browser` available? Was it checked (`command -v agent-browser`)?
183
+
184
+ ## Phase 7: Submission Packaging
185
+
186
+ **Expected behavior:**
187
+ - Package root contains only: .git, .gitignore, .tmp, docs, metadata.json, repo
188
+ - No workflow-private files (../.ai, ../.beads) in package
189
+ - `unit_tests/` and `API_tests/` present with runnable tests
190
+ - README is Docker-contained (no local test commands)
191
+ - No last-minute developer fixes (fixes at this phase mean earlier phases failed)
192
+ - Exit when package structure matches allowlist
193
+
194
+ **What to look for in traces:**
195
+ - Were there last-minute developer fixes? (Should be none)
196
+ - Were test directories populated with runnable tests or evidence copies?
197
+ - Did credential verification pass for all seeded accounts?
198
+ - Was metadata/session hygiene maintained?
199
+
200
+ ## Phase 8: Retrospective
201
+
202
+ **Expected behavior:**
203
+ - Write retrospective after packaging closes
204
+ - Review all mandatory evidence sources
205
+ - Produce late-finding origin table
206
+ - Capture improvement actions
207
+ - Duration should be > 5 minutes for meaningful projects
208
+
209
+ ---
210
+
211
+ ## Developer Lane Rules (from developer.md)
212
+
213
+ Tests must:
214
+ - Prove behavior and side effects, not only existence or rendering
215
+ - Be directly runnable from unit_tests/ and API_tests/ (not build-tag-gated copies)
216
+ - Assert exact expected state transitions, status codes, response bodies (not permissive)
217
+ - Use real backend for frontend tests (mock only external dependencies)
218
+ - Cover negative/boundary paths (unauthenticated, unauthorized, not found, etc.)
219
+
220
+ Implementation must:
221
+ - Complete coherent behavior end to end through real app path
222
+ - Wire UI → client call → handler/service → persistence/state
223
+ - Never create fake success paths, no-op jobs, disconnected forms
224
+
225
+ ---
226
+
227
+ ## Known Anti-Patterns (Playbook)
228
+
229
+ 1. **Permissive testing:** Tests accept "either status" or "any valid outcome" instead of exact expected state
230
+ 2. **Mock-heavy testing:** Mock factories/services instead of real DB, real HTTP, real browser
231
+ 3. **No runtime verification:** Application never run during development
232
+ 4. **Completion by test count:** Module declared complete based on "X tests pass" alone
233
+ 5. **Coverage spiral:** > 5 evaluator sessions spawned for coverage scoring with no score improvement
234
+ 6. **Build-tag evidence:** Tests in unit_tests/ and API_tests/ that don't actually run
235
+ 7. **Rushed reviews:** Phase 2 or Phase 4 completed in < 30 minutes for substantial projects
236
+ 8. **Fan-out collapse:** Multiple planned parallel lanes collapse into single serial developer session
237
+ 9. **Missed environment checks:** Network/trust-boundary ambiguities not caught during clarification
238
+ 10. **Module isolation:** Modules built in isolation with no cross-module integration tests proving data/behavior flow between them
239
+ 11. **Skipped evaluator passes:** Fewer than 5 internal evaluator passes run in Phase 4
240
+ 12. **No browser verification before P6:** Web/fullstack projects reach P6 without any browser verification, finding crashes that should have been caught in P3 or P4
@@ -7,9 +7,23 @@ description: Developer, Claude, evaluator, and metadata lifecycle rules for slop
7
7
 
8
8
  Use this skill for startup preflight, session policy, metadata consistency, lane handoffs, and recovery.
9
9
 
10
+ ## Session Integrity (Highest Priority)
11
+
12
+ Sessions are the primary deliverable. An incomplete or corrupted session dataset invalidates the submission regardless of code quality. Every session must be preserved, continuous, and authentic.
13
+
14
+ - Do not edit, rename, restructure, rewrite, clean, delete, or fabricate session files or trajectory records. They are immutable evidence.
15
+ - Do not perform untracked implementation work. Developer/Claude implementation, debugging, and substantive product fixes must happen inside a tracked implementation session. Owner orchestration, verification commands, package checks, and tiny safe owner-side docs/config/wrapper/glue fixes are allowed when recorded in metadata/Beads and the active implementation lane is notified afterward.
16
+ - Sessions must progress strictly forward. Never return to a closed session. The lifecycle is:
17
+ 1. Development session → complete → stop/close
18
+ 2. Bugfix session → complete both audit cycles → close
19
+ 3. Test-coverage/reconciliation session → complete coverage/README/Docker/runtime fixes → close
20
+ - If a session becomes genuinely unrecoverable (crash with no salvageable `sid` — even after attempting tmux relaunch with the known `sid` — and transcript/session lookup also fails), start a new session in the same lane. The sessions remain sequential and a clear timeline can be established. This is the only exception to the single-session-per-lane rule. Paused, rate-limited, or waiting states are not unrecoverable — stay in the same session.
21
+ - Each closed session must be recorded as closed in metadata and Beads. No work may resume in a closed session.
22
+
10
23
  ## Preflight
11
24
 
12
25
  - Confirm cwd is task root `./`.
26
+ - Confirm the current working directory is the task root (the directory containing `repo/`, `docs/`, and `metadata.json`). If the root is not `task/` or its equivalent, stop and reject: sessions must be started from the task root, not from inside `repo/` or any subdirectory.
13
27
  - Confirm product repo exists at `./repo`.
14
28
  - Confirm workflow-private root exists at `../.ai`, workflow state exists at `../.ai/metadata.json`, and Beads root exists at `../.beads` when initialized.
15
29
  - Confirm task docs are limited to `./docs/questions.md`, `./docs/design.md`, and `./docs/api-spec.md` when applicable.
@@ -22,17 +36,17 @@ Use this skill for startup preflight, session policy, metadata consistency, lane
22
36
 
23
37
  - Use one primary implementation lane by default from first product orientation through design, implementation, and local verification.
24
38
  - Keep exactly one implementation lane active at a time for the current phase purpose. Any inactive prior lane must be recorded as closed, parked, replaced, or superseded before another lane becomes active.
25
- - Additional implementation lanes are allowed only for context limits, unrecoverable session failure, explicit user instruction, or concrete low-risk bounded work that cannot safely share context.
26
- - Rate limits, slow turns, shell timeouts, and recoverable session interruptions are not reasons to stop the workflow. Wait, recover, or resume the same active session/lane using the appropriate packaged tooling and continue until the full workflow closes.
39
+ - **A paused session is not an invitation to launch a new one.** Rate limits, slow turns, shell timeouts, tmux interruptions, and waiting states are recovery conditions — stay in the same session. Only launch a new session (in the same lane) if the existing session absolutely cannot be revived: the `sid` is lost, tmux state is unrecoverable, and transcript/session lookup fails. Context exhaustion is a valid reason for a new session only after the existing session's context is genuinely exhausted and cannot continue productively.
27
40
  - Record the reason for any new lane in `../.ai/metadata.json` and Beads before using it.
41
+ - If a phase is deliberately reopened for repair, pause before the next implementation/evaluator turn and ask the user whether to continue with the latest viable lane session or replace it with a new session. Do not assume this choice on reopen.
28
42
  - Before every developer or Claude turn, confirm `../.ai/metadata.json` names the active lane/session you are about to use.
29
43
  - After every developer or Claude turn, update metadata with the latest active lane/session, last turn purpose, result status, and any phase/report/session fields that changed.
30
44
  - Register every implementation session used in metadata and Beads with session id, lane name, purpose, status, and recovery/replacement relationship when applicable.
31
45
  - Add a Beads `SESSION:` comment for launch/resume/replacement.
32
46
  - Add `VERIFY:`, `ISSUE:`, or `HANDOFF:` comments when a turn produces accepted evidence, defects, or a work handoff.
33
47
  - Phase 4 starts and uses the dedicated bugfix/fix-check implementation lane for remediation and verification guidance; the original development lane is closed for normal implementation at development completion.
34
- - Phase 5 uses fresh evaluator sessions for full audits; the only evaluator-session reuse is scoped fix-check for a kept Partial Pass report.
35
- - Product fixes from Phase 5 final audits go through the dedicated evaluation bugfix/fix-check implementation lane.
48
+ - Phase 5 uses fresh evaluator sessions for full audits. Evaluator-session reuse occurs in three cases: (1) fail-regeneration — same session receives only the regeneration prompt after fixes; (2) Partial Pass fix-check — same session verifies all scoped issues are fixed; (3) Pass-with-items fix-check same session verifies all scoped recommendations are closed.
49
+ - Audit Cycle 1 and Audit Cycle 2 fixes go through the dedicated bugfix lane. After both audit cycles complete, the bugfix lane is closed and a new test-coverage/final-reconciliation lane takes over for coverage/README remediation.
36
50
  - Phase 6 reconciliation uses the currently active developer/Claude lane for Docker, runtime, browser, account, `run_tests.sh`, coverage, README, and final readiness fixes unless a new lane is explicitly justified by context limits or isolation risk.
37
51
  - If the owner directly changes docs, wrappers, config, cleanup, or light glue, send the active lane a minimal acknowledgement request that names the changed surface and asks it to inspect/confirm before readiness continues.
38
52
 
@@ -12,6 +12,7 @@ Use this skill during `Phase 3: Development` before prompting the active develop
12
12
  - Continue using the same developer/Claude session established during planning.
13
13
  - Development starts with scaffold/baseline work.
14
14
  - After scaffold is accepted, proceed section by section or module by module from `./docs/design.md` and `./docs/api-spec.md` when applicable.
15
+ - **All development iteration, testing, and verification uses the local test harness only.** Docker and `run_tests.sh` are deferred to Phase 6/7.
15
16
  - The owner may use `../.ai/plan.md` privately to check scope, tests, coverage, discoverability, functionality, and sequencing.
16
17
  - Never mention `../.ai/plan.md`, the existence of an internal plan, Beads, metadata, phase names, workflow mechanics, or private checks to the developer/Claude lane.
17
18
  - Developer-facing references are limited to visible project context such as `./docs/design.md`, `./docs/api-spec.md`, `./docs/questions.md`, `./repo`, and the current human prompt.
@@ -23,12 +24,14 @@ Prompt like a human developer working with an AI coding assistant.
23
24
  Prompt one bounded slice at a time. The preferred unit is one phase-purpose, scaffold, module, work package, or fix batch. At most combine two adjacent tightly coupled slices in one prompt, and only when splitting them would make the work less coherent. Never send all phases, the full private plan, or a start-to-finish workflow packet to the developer/Claude lane.
24
25
 
25
26
  Use direct wording such as:
26
- - `I checked the user module and found a missing authorization test. Please add that and rerun the relevant tests.`
27
- - `Continue with the invoice module. Build the create/list/detail flow against the existing product contract and cover the main success and validation paths.`
28
- - `The scaffold looks mostly right, but the README still describes a command that does not exist. Fix the README and rerun the local smoke check.`
27
+ - I checked the user module and found a missing authorization test. Please add that and rerun the relevant tests.
28
+ - Continue with the invoice module. Build the create, list, and detail flow against the existing product contract and cover the main success and validation paths.
29
+ - The scaffold looks mostly right, but the README still describes a command that does not exist. Fix the README and rerun the local smoke check.
29
30
 
30
31
  Do not send robotic process language. Do not require a specific response format. Do not repeat standing instructions every turn. Do not dump, name, summarize, or mention the private plan. Give only the current objective, broad module/surface area, discovered issues, and useful verification request.
31
32
 
33
+ **No markdown formatting in prompts.** Write plain English sentences with normal grammar and punctuation. Do not use bullet lists, numbered lists, bold, backticks, dashes as list markers, or any markdown syntax when writing to the developer lane. Structured markdown belongs only in verbatim-pasted packaged prompts and evaluator instructions, not in direct human prompts to the implementation session.
34
+
32
35
  Do not keep restating visible doc paths in routine follow-up prompts when the same session already knows the project contract. It is fine to say `existing product contract`, `accepted docs`, or simply name the module. Mention exact doc paths only when orienting a new session, resolving confusion, or asking for a final contract check.
33
36
 
34
37
  For larger module slices, group expectations by user/business behavior instead of turning every endpoint, field, and negative case into a long checklist. Ask for real backend-backed behavior, visible UI states, and meaningful success/failure tests, but keep the wording natural. If a module is too large to explain without becoming a checklist packet, split it into smaller sequential prompts.
@@ -36,18 +39,16 @@ For larger module slices, group expectations by user/business behavior instead o
36
39
  Example of a good larger module prompt:
37
40
 
38
41
  ```text
39
- Continue with inventory parcel intake and production planning materials.
40
-
41
- Build these as real backend-backed workflows, not static screens. Parcels should cover intake, duplicate detection, import status/errors, photo-label uploads, print-label job outcomes, revisions, closing behavior, pickup-code reuse, and audit history. Materials/planning should cover materials, UOM conversions, lots, BOM versioning/effective windows, approvals, substitutes, yield loss, and costing based on the latest received lot costs with missing-cost handling.
42
+ Continue with inventory parcel intake and production planning materials. Build these as real backend-backed workflows, not static screens. Parcels should cover intake, duplicate detection, import status and errors, photo and label uploads, print label job outcomes, revisions, closing behavior, pickup code reuse, and audit history. Materials and planning should cover materials, UOM conversions, lots, BOM versioning with effective windows, approvals, substitutes, yield loss, and costing based on the latest received lot costs with proper handling for missing costs.
42
43
 
43
- On the Angular side, make the inventory and materials workspaces feel complete: loading, empty, validation, submitting, duplicate, printing, missing-cost, success, and error states should all be visible where they matter.
44
+ On the frontend, make the inventory and materials workspaces feel complete with loading, empty, validation, submitting, duplicate, printing, missing cost, success, and error states where they matter.
44
45
 
45
- Add focused tests for the important happy paths and failure paths, especially duplicate parcels, pickup-code collisions/reuse, invalid phone/tracking/import rows, printer unavailable, invalid UOM conversions, BOM effective-window conflicts, approval audit trails, latest-lot costing, and missing-cost behavior. Run the targeted checks when you're done.
46
+ Add focused tests for the important happy paths and failure paths, especially duplicate parcels, pickup code collisions and reuse, invalid phone tracking or import rows, printer unavailable scenarios, invalid UOM conversions, BOM effective window conflicts, approval audit trails, latest lot costing, and missing cost behavior. Run the targeted checks when you are done.
46
47
  ```
47
48
 
48
- When sending issues back, do not pass file names, line numbers, report snippets, or exact internal evidence unless the user explicitly asks for that. Keep it at the level of the module and behavior: `I found issues in the auth module. The access control case for other users' records is not covered properly, and the tests are missing that case.`
49
+ When sending issues back, do not pass file names, line numbers, report snippets, or exact internal evidence unless the user explicitly asks for that. Keep it at the level of the module and behavior: I found issues in the auth module. The access control case for other users records is not covered properly and the tests are missing that case.
49
50
 
50
- Do not say `the review found`, `the evaluation found`, or `the audit found`. The owner should speak naturally: `I checked this and found...`.
51
+ Do not say the review found, the evaluation found, or the audit found. The owner should speak naturally: I checked this and found something.
51
52
 
52
53
  ## Development Sequence
53
54
 
@@ -63,26 +64,37 @@ Do not say `the review found`, `the evaluation found`, or `the audit found`. The
63
64
 
64
65
  3. **Proceed module by module.**
65
66
  - Select the next section/module from `./docs/design.md` and the private plan.
67
+ - Before prompting, consult `../.ai/plan.md` for the module's expected test coverage, endpoints, E2E flows, acceptance criteria, and any specific cases the plan lists. Use those details to ask for specific cases in the prompt.
66
68
  - Prompt the developer using the docs only, one module/work package at a time by default.
67
69
  - Ask for the implementation and the relevant tests/checks for that module.
68
70
  - Combine two adjacent modules/work packages only when they share the same user flow or data contract and are easier to verify together.
69
71
 
70
- 4. **Owner checks after each module.**
71
- - Inspect changed files manually.
72
- - Compare behavior against the original product prompt in `./metadata.json`.
73
- - Compare behavior against `./docs/design.md` and `./docs/api-spec.md`.
74
- - Privately compare against `../.ai/plan.md` for tests, coverage, discoverability, functionality, and module completeness.
75
- - Run targeted checks when practical.
76
- - Send missing tests, improper functionality, design drift, failed checks, or integration gaps back to the same session in broad module/product language.
77
- - Do not move to the next module until the current module is acceptably resolved or a concrete blocker is recorded.
72
+ 4. **Owner verifies each module before moving on.**
73
+ - Inspect changed files manually for orphaned or disconnected code.
74
+ - Verify implementation against the original prompt, `./docs/design.md`, and `./docs/api-spec.md`.
75
+ - Compare against `../.ai/plan.md` as the reference: check that every planned test case, endpoint, and E2E flow from the plan for this module is present and passing. Cross-reference the module's row in the plan's ordered work packages, API coverage matrix, FE-BE integration matrix, and risk/negative coverage matrix.
76
+ - Run the specific local tests for the module's changed files and confirm they pass. If no targeted test exists for the changed surface, that is itself a gap to send back.
77
+ - **Start the application locally (not Docker) and verify it is reachable.** For web/fullstack projects, confirm the dev server starts and exercise at least one real flow through the module you just accepted. For API-only projects, confirm the server starts and hit at least one endpoint from the module. If the app does not start or the module is unreachable, the module is not complete -- send it back with the exact failure.
78
+ - Verify the implementation is truly wired: imports resolve, routes register, frontend components render and connect, API calls reach real handlers, data flows through real persistence or state paths. Do not accept disconnected files, unused routes, stub-only components, or fake-success paths.
79
+ - Verify the behavior is real, not a placeholder, shell route, hardcoded response, static demo data, or stub that pretends to be real.
80
+ - **Verify cross-module integration tests exist.** When the new module connects to previously built modules, confirm the developer wrote integration tests proving data and behavior flow between them. If no cross-module tests exist, send that back as a gap.
81
+ - Send issues back to the same session in natural language without file paths, line numbers, or report names.
82
+ - Do not move to the next module until the current module is acceptably resolved or a concrete blocker is recorded.
78
83
 
79
84
  5. **Repeat until development is complete.**
80
85
  - Keep corrections in the same active session unless a concrete context/recovery reason requires otherwise.
81
86
  - Record session turns, artifacts, verification evidence, issues, and handoffs in metadata and Beads.
82
87
 
83
- 6. **Ask for final implementation self-check.**
84
- - Once all modules are completed and the developer claims the implementation is ready, ask the same session one final broad check.
85
- - The prompt should ask them to compare the implementation against `./docs/design.md` and `./docs/api-spec.md` when applicable, verify whether everything is complete, and report any gaps they find.
88
+ 6. **Run a full requirements integrity sweep before the final self-check.**
89
+ - Re-read the original prompt from `./metadata.json` and compare every explicit core requirement against the implementation. Do not assume the plan or design captured everything — the prompt is the source of truth. If any prompt requirement is missing from the implementation, that is a gap regardless of whether the plan or design listed it.
90
+ - Open `../.ai/plan.md` and run the no-orphan ledger against the full implementation. Every requirement in the ledger must map to a delivered surface in code, a test, a documented behavior, or have an accepted not-applicable reason with user confirmation. Do not rely on memory — go through each ledger item one by one.
91
+ - Cross-reference the design's requirement mapping table (section 2.1 in `./docs/design.md`) against the implementation. Every row should be addressable in the codebase.
92
+ - If any requirement is missing from the implementation (whether from the prompt, the ledger, or the design table), send it to the developer session in natural language before the final self-check. Do not batch everything into the self-check — route missing requirements to the active session as separate fix work.
93
+ - Only proceed to the final self-check when the ledger is clean, all explicit prompt requirements are addressed, and every exception is explicitly recorded and risk-accepted.
94
+
95
+ 7. **Ask for final implementation self-check.**
96
+ - Once all modules are completed, the ledger is clean, and the developer claims the implementation is ready, ask the same session one final broad check.
97
+ - The prompt should ask them to compare the implementation against `./docs/design.md`, `./docs/api-spec.md`, and the requirements shared in Step 2 of planning. Verify whether everything is complete and report any gaps they find.
86
98
  - Also ask for the startup commands and the expected user/API flows to exercise the app.
87
99
  - Keep the message human and simple. Do not mention internal plans or their existence, phases, workflow mechanics, evaluation, or hidden state.
88
100
 
@@ -96,21 +108,28 @@ Also give me the startup commands and the main flows I should expect to exercise
96
108
 
97
109
  ## Owner Review Checklist
98
110
 
99
- For each scaffold/module, check:
111
+ For each scaffold/module, the owner must verify:
100
112
  - changed files are integrated, referenced, and not orphaned
101
113
  - implementation matches `./docs/design.md`
102
114
  - API/interface behavior matches `./docs/api-spec.md` when applicable
103
115
  - private plan rows for the module are closed or have concrete accepted exceptions
104
116
  - no-orphan ledger items assigned to the module are closed
105
- - project-specific behavior is real, not placeholder/shell/demo-only behavior
106
- - tests exist for the implemented behavior or a concrete exception is recorded
107
- - planned API/interface proof is present when the module owns endpoints/interfaces, with true no-mock HTTP/API endpoint tests where applicable
108
- - frontend unit tests are directly detectable and import/render real frontend components/modules when the module owns frontend behavior
109
- - planned FE-BE proof is present when the module crosses frontend/backend boundaries
110
- - failure, validation, authorization, ownership, empty, loading, error, and duplicate/re-entry cases are covered where relevant
111
- - frontend/backend wiring is real where applicable
117
+ - project-specific behavior is real, not placeholder/shell/demo-only/fake-success behavior
118
+ - **the application starts and runs locally** -- verify the dev server starts without crashing, the module is reachable, and at least one real flow works through the module you just accepted. Do not accept a module based on test output alone
119
+ - tests exist for the implemented behavior; run them and confirm they pass. If no test exists for the changed surface, that is a gap
120
+ - **cross-module integration tests exist** when the module connects to previously built modules -- confirm real data/behavior flow tests, not just file presence
121
+ - tests under `unit_tests/` and `API_tests/` are directly runnable from those directories — not build-tag-gated evidence copies, compile-time-only files, or infrastructure checks that only verify file counts or presence. Every test must exercise and verify specific business behavior
122
+ - planned API/interface proof is present when the module owns endpoints/interfaces, with true no-mock HTTP/API endpoint tests where applicable; run the relevant API tests and confirm they pass. API test assertions must verify exact expected state transitions, status codes, and response bodies not permissive "accept any valid outcome" checks
123
+ - frontend unit tests are directly detectable and import/render real frontend components/modules when the module owns frontend behavior; run them and confirm they pass
124
+ - planned FE-BE proof is present when the module crosses frontend/backend boundaries; exercise the flow locally and confirm real data reaches real handlers
125
+ - E2E tests cover every prompt requirement for the module, not just the main happy path. Each E2E test must assert business outcomes — state changes, data persistence, authorization enforcement, task closure — not just confirm pages render. Decorational E2E tests that only check page loads are insufficient and must be sent back as gaps
126
+ - failure, validation, empty, loading, error, and duplicate/re-entry cases are covered where relevant
127
+ - logs are meaningful and support troubleshooting, not absent or random print noise
128
+ - all applicable security surfaces are individually addressed: authentication, route authorization, object-level authorization, function-level authorization, tenant/user data isolation, and admin/internal/debug endpoint protection. Each surface must have enforcement visible in code and tests
129
+ - frontend/backend wiring is real: imports resolve, routes register, API calls reach real handlers, data flows through real persistence or state, components render in the app context not just in isolation
130
+ - pages are connected to each other and interaction flows complete through to task closure, not just isolated screens with static outcomes
112
131
  - README changes match delivered runtime, commands, auth/no-auth, seed/demo data, verification behavior, mock/local/debug boundaries, and strict startup/access gates
113
- - targeted checks ran or were clearly blocked
132
+ - local tests for the module's changed surfaces were run and passed before acceptance
114
133
 
115
134
  ## Internal Plan Alignment
116
135
 
@@ -127,7 +146,7 @@ Check the relevant module/work package against:
127
146
  - module acceptance checklist
128
147
  - integration and hardening notes
129
148
 
130
- If the implementation misses one of those expectations, translate it into a normal human issue for the developer without file/line references. Example: `I checked the invoice work. The create flow is there, but the missing-amount validation is not covered and the list still behaves like static data. Please wire it properly and add the missing test.`
149
+ If the implementation misses one of those expectations, translate it into a normal human issue for the developer without file/line references. Example: I checked the invoice work. The create flow is there, but the missing-amount validation is not covered and the list still behaves like static data. Please wire it properly and add the missing test.
131
150
 
132
151
  ## Completion Standard
133
152
 
@@ -137,6 +156,8 @@ Accept development only when:
137
156
  - scaffold is accepted
138
157
  - all planned modules/sections are implemented or have accepted not-applicable reasons
139
158
  - module-level issues found by owner review are resolved
159
+ - **the application was started and verified locally at every module boundary** -- the app must compile, start, and serve real behavior (not crash or serve static shells) before each module acceptance
160
+ - **cross-module integration tests exist for every module pair that shares a data, API, or UI boundary** -- tests must prove real behavior flow, not just file presence
140
161
  - the final implementation self-check has been requested from the same active session and any reported issues have been fixed or recorded as concrete risks
141
162
  - startup commands and expected local flows have been collected from the developer session
142
163
  - targeted tests/checks for changed surfaces have passed or are honestly blocked
@@ -34,7 +34,7 @@ Use this internal extraction schema for every issue/recommendation:
34
34
  ## Human Handoff
35
35
 
36
36
  When sending issues to a developer lane:
37
- - speak as yourself: `I found these issues...`
37
+ - speak as yourself, say I found these issues
38
38
  - do not say the evaluator, audit, or report found them
39
39
  - group by broad module/product area
40
40
  - do not include line numbers, file paths, report names, exact citations, or evaluator mechanics
@@ -53,7 +53,7 @@ When a failed report is regenerated in the same evaluator session:
53
53
  - reject reports if the last ordinary audit send was not the exact saved send packet content
54
54
  - reject reports that are continuation-shaped, fix-only, stale against current files, or contradicted by current repo evidence
55
55
  - reject reports that drop required endpoint/surface inventory, hard-gate README review, severity panels, verdict blocks, or finding details expected by the underlying prompt
56
- - if rejected, send the full prepared evaluation prompt again instead of using the degraded report
56
+ - if rejected, archive every report or candidate report from the invalid cycle unchanged, record the reason, and restart that audit cycle from a fresh evaluator session using the installed prompt asset and exact saved send packet required by `final-evaluation-orchestration`
57
57
 
58
58
  ## Partial Pass Handling
59
59
 
@@ -64,25 +64,39 @@ When a failed report is regenerated in the same evaluator session:
64
64
  - Do not narrow fix-check to Blocker/High issues.
65
65
  - The same evaluator session that wrote the kept Partial Pass report performs the fix-check.
66
66
 
67
+ ## Pass With Items Handling
68
+
69
+ - A kept Pass report with any issue, recommendation, caveat, suggestion, action item, or requested change also requires closure.
70
+ - Save the kept Pass report under the same cycle audit report name: `./.tmp/audit_report-1.md` or `./.tmp/audit_report-2.md`.
71
+ - Extract every issue/recommendation/caveat/suggestion/action item/requested change from it.
72
+ - The fix-check scope is the full extracted set.
73
+ - The same evaluator session that wrote the kept Pass report performs the fix-check.
74
+ - Save the fix-check under the fixed cycle path: `./.tmp/audit_report-1-fix_check.md` or `./.tmp/audit_report-2-fix_check.md`.
75
+ - A Pass report with zero scoped items still requires the cycle fix-check report. The fix-check must explicitly confirm that the kept audit report had no scoped issues to close and that the cycle is clean.
76
+
67
77
  ## Fix-Check Handling
68
78
 
69
79
  - Read the full fix-check report.
70
80
  - If every scoped issue is fixed, save/keep the fix-check report.
71
81
  - If anything is not fixed or partially fixed, send only the unresolved behavior back to the developer lane in broad human language, then rerun the fix-check in the same evaluator session.
72
82
  - The regenerated fix-check must still address the full kept audit issue set, not only the unresolved items that were most recently sent back.
83
+ - Before sending the exact fix-check instruction, provide the same evaluator session with concise developer fix evidence, exact verification results when available, and the exact full audit-scoped issue list from the kept `audit_report-<N>.md`.
73
84
  - Reject fix-check reports that narrow the scope, skip low-priority kept issues, perform a broad new audit instead of scoped checking, or use history-exposing language.
74
85
  - Do not edit evaluator report text.
75
86
 
76
87
  ## Send Packet Validation
77
88
 
78
89
  Before accepting any full audit report:
79
- - confirm `prepare_evaluation_prompt.mjs` was used to build the prepared prompt for the right project type
80
- - confirm `prepare_evaluation_send_packet.mjs` was used to build the exact send packet
90
+ - confirm `prepare_evaluation_send_packet.mjs` was used to build the exact send packet, including {prompt} interpolation and report path insertion
91
+ - record both the installed prompt asset path and the exact saved send packet path
81
92
  - read the saved send packet before sending
82
- - confirm the evaluator received the exact saved send packet content, not a summary, footer, file reference, or shortened prompt
83
- - record prepared prompt path, send packet path, report path, and evaluator session id in metadata and Beads
93
+ - confirm the evaluator received the exact saved send packet content, not a summary, footer, file reference, shortened prompt, or owner-authored replacement
94
+ - confirm the saved send packet contains the complete installed evaluation prompt content, with nothing omitted from the installed asset and no owner-added text outside the packet
95
+ - record installed prompt asset path, prepared prompt path, send packet path, report path, and evaluator session id in metadata and Beads
96
+
97
+ If any part cannot be confirmed, reject the report, archive every report or candidate report from the invalid cycle unchanged, and restart that audit cycle from a fresh evaluator session with a valid installed prompt asset plus saved send packet.
84
98
 
85
- If any part cannot be confirmed, reject the report and rerun with a valid saved send packet.
99
+ For failed-report regeneration, do not build or improvise a new prompt. Use only the exact fail-regeneration prompt from `final-evaluation-orchestration`, verbatim and with no preface, suffix, issue list, fix evidence, or extra wording.
86
100
 
87
101
  ## State Updates
88
102