theslopmachine 1.0.13 → 1.0.22

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (39) hide show
  1. package/assets/agents/developer.md +6 -7
  2. package/assets/agents/slopmachine-claude.md +66 -9
  3. package/assets/agents/slopmachine.md +68 -9
  4. package/assets/claude/agents/developer.md +5 -1
  5. package/assets/skills/clarification-gate/SKILL.md +56 -20
  6. package/assets/skills/claude-worker-management/SKILL.md +14 -4
  7. package/assets/skills/deep-retrospective/SKILL.md +179 -0
  8. package/assets/skills/deep-retrospective/run.py +446 -0
  9. package/assets/skills/deep-retrospective/workflow-reference.md +240 -0
  10. package/assets/skills/developer-session-lifecycle/SKILL.md +18 -4
  11. package/assets/skills/development-guidance/SKILL.md +52 -31
  12. package/assets/skills/evaluation-triage/SKILL.md +21 -7
  13. package/assets/skills/final-evaluation-orchestration/SKILL.md +92 -28
  14. package/assets/skills/integrated-verification/SKILL.md +38 -42
  15. package/assets/skills/p8-readiness-reconciliation/SKILL.md +31 -10
  16. package/assets/skills/planning-gate/SKILL.md +10 -7
  17. package/assets/skills/planning-guidance/SKILL.md +60 -52
  18. package/assets/skills/retrospective-analysis/SKILL.md +172 -58
  19. package/assets/skills/scaffold-guidance/SKILL.md +18 -6
  20. package/assets/skills/submission-packaging/SKILL.md +11 -3
  21. package/assets/slopmachine/clarifier-agent-prompt.md +7 -6
  22. package/assets/slopmachine/exact-readme-template.md +8 -12
  23. package/assets/slopmachine/owner-verification-checklist.md +1 -1
  24. package/assets/slopmachine/phase-1-design-prompt.md +5 -10
  25. package/assets/slopmachine/phase-1-design-template.md +15 -11
  26. package/assets/slopmachine/phase-2-execution-planning-prompt.md +5 -2
  27. package/assets/slopmachine/phase-2-plan-template.md +14 -4
  28. package/assets/slopmachine/scaffold-playbooks/shared-contract.md +2 -1
  29. package/assets/slopmachine/templates/AGENTS.md +3 -1
  30. package/assets/slopmachine/templates/CLAUDE.md +3 -1
  31. package/assets/slopmachine/test-coverage-prompt.md +8 -1
  32. package/assets/slopmachine/utils/README.md +1 -5
  33. package/assets/slopmachine/utils/claude_live_common.mjs +2 -5
  34. package/assets/slopmachine/utils/prepare_evaluation_send_packet.mjs +3 -3
  35. package/package.json +1 -1
  36. package/src/constants.js +0 -9
  37. package/src/init.js +17 -24
  38. package/src/install.js +30 -28
  39. package/assets/slopmachine/utils/prepare_evaluation_prompt.mjs +0 -81
@@ -9,10 +9,11 @@ Use this skill only during Phase 8 Retrospective, after Phase 7 Submission Packa
9
9
 
10
10
  ## Purpose
11
11
 
12
- - inspect what happened across the whole workflow run
12
+ - inspect what happened across the whole workflow run with deep evidence review
13
13
  - identify what caused churn, waste, late defects, or preventable corrections
14
+ - assess the quality of the work done at every phase
14
15
  - capture lessons that should improve future runs
15
- - write package-specific retrospective files under `/Users/yohannesakd/slopmachine/retrospectives/`
16
+ - write package-specific retrospective files under the installed SlopMachine assets directory
16
17
 
17
18
  ## Phase role
18
19
 
@@ -22,62 +23,166 @@ Use this skill only during Phase 8 Retrospective, after Phase 7 Submission Packa
22
23
  - it does not rerun broad verification by default
23
24
  - it should not reopen development unless it finds a real defect in the already-packaged result
24
25
 
25
- ## Output location
26
+ ## Mandatory Evidence Review
26
27
 
27
- Write run-scoped retrospective files under:
28
+ Before writing the retrospective, the owner must read all of the following evidence sources. Skim nothing — every artifact must be inspected.
28
29
 
29
- - `/Users/yohannesakd/slopmachine/retrospectives/`
30
+ ### 1. Metadata and State Transitions
30
31
 
31
- Preferred filenames:
32
+ Read `../.ai/metadata.json` in full. Track:
33
+ - every phase transition and its timestamp
34
+ - every session created (develop, bugfix, test-coverage, evaluator)
35
+ - every session handoff and closure event
36
+ - every Beads entry ID recorded
37
+ - the run_id, current_phase history, and awaiting_human flags
32
38
 
33
- - `retrospective-<run_id>.md`
34
- - `improvement-actions-<run_id>.md`
39
+ Note the elapsed time between phase transitions. A long gap between phases with no state changes indicates owner delay or blocking.
40
+
41
+ ### 2. Beads Comments
35
42
 
36
- If only one file is needed, the retrospective file is sufficient.
43
+ Read all Beads entries for the run. Extract every:
44
+ - `ARTIFACT:` comment — what was produced and when
45
+ - `ISSUE:` comment — what was found and where
46
+ - `SESSION:` comment — lane creation, handoff, closure
47
+ - `DECISION:` comment — owner acceptance, rejection, risk acceptance
48
+ - `VERIFY:` comment — verification evidence recorded
37
49
 
38
- The `run_id` must come from the current project's `../.ai/metadata.json` so the retrospective can be matched back to one exact workflow run.
50
+ Count the total number of each. High `ISSUE:` counts in late phases (4-7) indicate weak early prevention. Low `VERIFY:` counts may indicate insufficient owner-side checking.
39
51
 
40
- ## Evidence sources
52
+ ### 3. Clarification and Requirements Artifacts
41
53
 
42
- Prefer existing workflow artifacts first:
54
+ Read `./docs/questions.md` and `../.ai/requirements-breakdown.md`. Check:
55
+ - how many clarifying questions were asked and resolved
56
+ - whether the requirements breakdown was deep enough
57
+ - whether the faithfulness review surfaced material drift
43
58
 
44
- - root metadata
45
- - questions/clarification record
46
- - clarification prompt
47
- - planning artifacts
48
- - Beads comments and transitions
49
- - developer-session handoffs
50
- - review and rejection history
51
- - verification gate notes
52
- - `./.tmp/` audit and fix-check reports
53
- - packaging checks
59
+ ### 4. Design and Planning Artifacts
54
60
 
55
- Do not reread the entire codebase unless a real inconsistency requires it.
56
- Do not rerun broad Docker or full-suite verification just for retrospective analysis.
61
+ Read `./docs/design.md`, `./docs/api-spec.md`, and `../.ai/plan.md`. Check:
62
+ - whether the design mapped every requirement to a surface (section 2.1)
63
+ - whether the API spec inventoried every endpoint
64
+ - whether the no-orphan ledger in the plan was complete
65
+ - whether the plan had a development prompt queue that was actually followed
57
66
 
58
- ## Required retrospective sections
67
+ ### 5. Development Session Records
59
68
 
60
- 1. outcome summary
61
- 2. what worked well
62
- 3. what caused waste or looping
63
- 4. what was caught too late
64
- 5. findings by phase
65
- 6. findings by instruction plane:
66
- - owner shell
67
- - developer prompt
68
- - skills
69
- - task-root rulebook file such as `./AGENTS.md` or `./CLAUDE.md`
70
- 7. late-finding origin table
71
- 8. actionable improvements
69
+ From metadata, identify every develop session. Review session transcripts when available. Track:
70
+ - how many scaffold iterations were needed
71
+ - how many module prompts were sent
72
+ - how many fix rounds were required per module
73
+ - whether the final self-check found gaps (and if so, how many)
74
+ - whether the owner ran the requirements integrity sweep
72
75
 
73
- ## Late-finding origin table
76
+ ### 6. Integrated Verification (Phase 4) Evidence
77
+
78
+ Read `../.ai/consolidated-internal-issues.md`. Count:
79
+ - total issues from the owner plan-based review
80
+ - total issues from each of the 5 evaluator passes
81
+ - issue severity distribution (blocker, high, medium, low)
82
+ - which modules had the most issues
83
+ - whether any issues repeated across passes (same root cause discovered multiple times)
84
+
85
+ Read every report under `../.ai/internal-verification/`. For each:
86
+ - note the verdict
87
+ - count the issues per severity
88
+ - compare against the consolidated file for completeness
89
+ - check whether any archived report's issues were never extracted
90
+
91
+ ### 7. Final Evaluation (Phase 5) Evidence
92
+
93
+ Read every kept report under `./.tmp/`:
94
+ - `audit_report-1.md` and `audit_report-1-fix_check.md`
95
+ - `audit_report-2.md` and `audit_report-2-fix_check.md`
96
+ - `test_coverage_and_readme_audit_report.md`
97
+
98
+ For each audit report, record:
99
+ - the verdict (Pass / Partial Pass / Fail)
100
+ - how many Blocker and High issues were found
101
+ - whether the regenerated report was accepted or rejected
102
+ - how many fix-check rounds were needed to close all issues
103
+ - the final test coverage score
104
+
105
+ Read every archived report under `../.ai/archive/`. Count:
106
+ - how many reports were archived (failed reports, superseded reports, invalid-cycle reports)
107
+ - which cycles were restarted and why
108
+ - whether any restart was caused by a prompt-paste violation
109
+
110
+ ### 8. Readiness and Packaging (Phase 6-7) Evidence
111
+
112
+ Read `../.ai/metadata.json` for Phase 6-7 state. Check:
113
+ - whether Docker/runtime checks passed on first attempt or required fixes
114
+ - whether `agent-browser` checks passed
115
+ - whether any D1-D9 dimensions were risk-accepted
116
+ - whether packaging found and removed stale artifacts
117
+
118
+ ### 9. Developer Rulebook and Templates
119
+
120
+ Read `./AGENTS.md` or `./CLAUDE.md` (whichever was used). Check:
121
+ - whether the rulebook was adequate for the project type
122
+ - whether the developer followed the rulebook's implementation discipline
123
+ - whether any rulebook gaps contributed to issues
124
+
125
+ ## Required Retrospective Sections
126
+
127
+ The retrospective file must contain all of the following sections:
128
+
129
+ 1. **outcome summary** — what was delivered, was the prompt satisfied, overall verdict
130
+ 2. **run statistics** — phase durations, session counts, total issue counts by phase, fix round counts
131
+ 3. **what worked well** — which phases ran cleanly, which gates passed first try
132
+ 4. **what caused waste or looping** — which issues went through multiple fix cycles, why
133
+ 5. **what was caught too late** — issues first found in Phase 5-7 that should have been caught earlier
134
+ 6. **quality assessment by phase** — for each phase, rate the quality of the output (1-5) with evidence
135
+ 7. **findings by phase** — material issues discovered, how they were resolved
136
+ 8. **findings by instruction plane**: owner shell, developer prompts, skills, rulebook
137
+ 9. **late-finding origin table** — every late issue classified by where it should have been prevented
138
+ 10. **issue discovery timeline** — when each major issue was first found and by which mechanism
139
+ 11. **churn analysis** — which modules had the most fix rounds, which gates required the most iteration
140
+ 12. **token and time efficiency** — whether prompts were efficient, whether work was redundant
141
+ 13. **actionable improvements** — concrete changes to make before the next run
142
+
143
+ ## Reusable Scaffold Configuration
144
+
145
+ After a project successfully passes all Docker/runtime checks and `run_tests.sh` completes, the retrospective must capture the working scaffold configuration so it can be reused for future similar projects.
146
+
147
+ Extract and save the following under `<asset-root>/retrospectives/scaffold-config-<project_type>-<stack>.md`:
148
+
149
+ - the project type and stack used
150
+ - the working Dockerfile pattern (or reference to the playbook that produced it)
151
+ - the working docker-compose.yml structure (services, ports, volumes, healthchecks, profiles)
152
+ - the working run_tests.sh structure (build, start, test, cleanup stages)
153
+ - the working local test harness commands
154
+ - any deviations from the standard playbook that were necessary to make things work
155
+ - the verified ports, healthcheck endpoints, and init scripts
156
+
157
+ If the project used an existing playbook and the configuration worked without modification, note that the playbook was sufficient and register the confirmation. If modifications were needed, document exactly what was changed and why so the playbook can be improved for future runs.
158
+
159
+ This section ensures that every successful project's runtime configuration becomes institutional knowledge. A project that runs and tests cleanly should never require rediscovering its Docker setup from scratch on the next similar task.
160
+
161
+ ## Quality Assessment Per Phase
162
+
163
+ Rate each phase 1-5 on:
164
+
165
+ | Phase | Requirements clarity | Design fidelity | Plan rigor | Implementation quality | Owner review thoroughness |
166
+ |---|---|---|---|---|---|
167
+ | 1: Clarification | Was the requirements breakdown deep and faithful? | — | — | — | Was the faithfulness review honest? |
168
+ | 2: Planning | Did the design cover every requirement? | Was the API spec complete? | Was the no-orphan ledger exhaustive? | — | Was the REQ-### cross-reference thorough? |
169
+ | 3: Development | — | — | — | Was every module fully implemented? | Was the integrity sweep thorough? |
170
+ | 4: Verification | — | — | — | — | Were all evaluator passes reviewed? |
171
+ | 5: Evaluation | — | — | — | — | Were audit reports properly acted on? |
172
+ | 6: Readiness | — | — | — | — | Were Docker/runtime checks genuine? |
173
+ | 7: Packaging | — | — | — | — | Was the package boundary clean? |
174
+
175
+ Each rating must be backed by specific evidence from the artifact review above. A rating of 3 or below in any cell requires an explanation of what went wrong and a concrete improvement.
176
+
177
+ ## Late-Finding Origin Table
74
178
 
75
179
  For every material issue first surfaced in Phase 4, Phase 5, Phase 6, or Phase 7, classify where it should have been prevented.
76
180
 
77
181
  Use this table shape:
78
182
 
79
183
  | Finding | First surfaced in | Prompt required? | Accepted plan/design covered? | Origin classification | Fix belongs in |
80
- |---|---|---:|---:|---|---|
184
+ |---|---|---|---|---|---|
185
+ | | | yes/no | yes/no | | |
81
186
 
82
187
  Origin classifications:
83
188
 
@@ -88,32 +193,41 @@ Origin classifications:
88
193
  - `owner review miss`: implementation claimed closure but review accepted weak or unsupported evidence
89
194
  - `evaluation-only strictness`: the repo was broadly coherent, but the evaluator imposed a stricter interpretation than the prior workflow had required
90
195
 
91
- Do not classify a late finding by where it appeared.
92
- Classify it by the earliest artifact that should have made the correct behavior unavoidable.
196
+ Count how many findings fall into each origin classification. A cluster of `owner review miss` findings indicates the owner review checklist is insufficient. A cluster of `planning miss` findings indicates the planning templates need strengthening. A cluster of `development execution miss` findings indicates the developer rulebook or prompting style needs improvement.
197
+
198
+ Do not classify a late finding by where it appeared. Classify it by the earliest artifact that should have made the correct behavior unavoidable.
199
+
200
+ ## Issue Discovery Timeline
201
+
202
+ For every Blocker or High issue across all phases, record:
93
203
 
94
- Separate immediate repo remediation from workflow learning:
204
+ | Issue | Phase found | Discovery mechanism | Surface/module | Root cause | Fixed in |
205
+ |---|---|---|---|---|---|
206
+ | | | owner review / evaluator pass 2 / audit cycle 1 / etc. | | | develop / bugfix / test-coverage |
95
207
 
96
- - repo remediation answers what code/docs/tests must change now
97
- - workflow learning answers whether clarification, planning, development guidance, owner review, or evaluation routing should change for future runs
208
+ This timeline reveals whether most issues were found by owner review (good — early detection), internal evaluator loop (acceptable — mid-phase), or final evaluation (bad too late).
98
209
 
99
- ## Audit buckets
210
+ ## Churn Analysis
100
211
 
101
- Evaluate at least these buckets in hindsight:
212
+ Count for each module or surface area:
213
+ - how many distinct issue batches were sent
214
+ - how many fix rounds were needed
215
+ - whether the same issue was reported multiple times
216
+ - whether fixes introduced new issues
102
217
 
103
- 1. prompt-fit
104
- 2. security-critical flaws
105
- 3. test sufficiency
106
- 4. major engineering quality
107
- 5. token/time waste
218
+ High churn on a single module indicates either the developer implementation was weak, the design was unclear, or the owner review was insufficient.
108
219
 
109
- For each meaningful finding, prefer:
220
+ ## Token and Time Efficiency
110
221
 
111
- - what happened
112
- - why it happened
113
- - where the fix belongs
114
- - how it should change future runs
222
+ From session transcripts and metadata:
223
+ - estimate whether prompts were concise or bloated
224
+ - count how many times the same information was repeated across turns
225
+ - note whether the developer was given too much or too little context
226
+ - identify any token-wasting patterns (restating rules, re-explaining context, verbose markdown)
115
227
 
116
- ## Rule for reopening work
228
+ ## Rule for Reopening Work
117
229
 
118
230
  - if retrospective finds a real packaging or delivery defect, reopen Phase 7 and fix it
119
- - if it finds only improvements, document them and close the retrospective phase
231
+ - if it finds systemic workflow issues (e.g., the owner review checklist consistently missed a class of problem), record them as improvement actions for the next run
232
+ - if it finds only non-blocking improvements, document them and close the retrospective phase
233
+ - never silently skip a finding just because it is inconvenient
@@ -11,6 +11,10 @@ Use this skill for the first development slice: the framework/runtime/test/READM
11
11
 
12
12
  Scaffold creates an honest base for later module work. It should bootstrap the selected framework and project structure without implementing project-specific business logic beyond a minimal proof surface.
13
13
 
14
+ **Two test paths are created at scaffold time and follow this lifecycle:**
15
+ 1. **Local test harness** (stack-native) — used during all development, verification, and remediation work (Phases 3-5). Fast, non-Docker, for iteration.
16
+ 2. **Dockerized `run_tests.sh`** — deferred to Phase 6/7. Not run during development. After all development is complete and verified locally, Docker and `run_tests.sh` are run for the first time and any issues fixed.
17
+
14
18
  The owner may use scaffold playbooks privately, but developer-facing prompts should reference only the docs and normal engineering instructions.
15
19
 
16
20
  ## Private Inputs
@@ -22,6 +26,10 @@ Use these owner-side inputs to shape the scaffold prompt:
22
26
  - scaffold playbooks under `~/slopmachine/scaffold-playbooks/`
23
27
  - current framework/library docs when scaffold commands or config depend on current behavior
24
28
 
29
+ The scaffold playbooks are the primary source for Docker and `run_tests.sh` setup. Before writing the scaffold prompt, the owner must and read the relevant playbooks for the project's type, technology, and stack. Start with `shared-contract.md` for the universal scaffold rules, then read the type-specific playbook (e.g. `type-web-spa.md`), the tech-specific playbooks (e.g. `tech-frontend-react.md`, `tech-backend-koa.md`), and the stack-specific playbook if one matches (e.g. `stack-react-go-postgres.md`). Each playbook provides the exact Dockerfile, docker-compose.yml, run_tests.sh, and local test harness structure for that project class.
30
+
31
+ The owner extracts the Docker and runtime configuration details from these playbooks and translates them into the developer prompt in plain language. Do not tell the developer to read the playbooks or reference playbook names — just communicate the technical requirements.
32
+
25
33
  Do not tell the developer/Claude lane to read `../.ai/plan.md` or scaffold playbooks directly, and do not mention that the internal plan exists.
26
34
 
27
35
  ## Scaffold Prompt Shape
@@ -29,9 +37,9 @@ Do not tell the developer/Claude lane to read `../.ai/plan.md` or scaffold playb
29
37
  Use casual, direct language. Example:
30
38
 
31
39
  ```text
32
- Let's start with the scaffold. Set up the base project in ./repo for the stack described in docs/design.md. Keep this to the framework/runtime/test/README foundation and a minimal proof page or endpoint; don't add product-specific business logic yet.
40
+ Let's start with the scaffold. Set up the base project in ./repo for the stack described in docs/design.md. Keep this to the framework, runtime, tests, and README foundation with a minimal proof page or endpoint. Do not add product-specific business logic yet.
33
41
 
34
- Please include the local test harness, runtime files, run_tests.sh, and the README baseline so the next modules have a clean foundation.
42
+ The scaffold needs a working docker-compose.yml with a profile-gated test service, a run_tests.sh that runs the full test suite through Docker, a separate stack-native local test harness for fast implementation-time checks, a unit_tests directory for unit tests, an API_tests directory for API and integration HTTP tests, and a README baseline with startup, access, verification, and test commands. Make sure the local harness is separate from the Docker test path. The local harness is for fast iteration. run_tests.sh is the broad dockerized verification wrapper for later.
35
43
  ```
36
44
 
37
45
  Adjust the exact wording to the project. Do not over-format the message.
@@ -41,9 +49,11 @@ Adjust the exact wording to the project. Do not over-format the message.
41
49
  - coherent project structure for the selected stack
42
50
  - framework bootstrap wired enough for later module work
43
51
  - minimal proof surface such as a basic page, route, endpoint, or app shell
44
- - stack-native local test harness
45
- - product repo root `./repo/run_tests.sh` when required by the project contract
46
- - runtime/Docker files when relevant, wired honestly for later verification
52
+ - **documented local start command** (e.g., `npm run dev`, `go run ./cmd/server`, `docker compose up`) that starts the application for local development and verification. The developer must have started it and confirmed it works before reporting scaffold completion
53
+ - `unit_tests/` directory for unit tests and `API_tests/` directory for API/integration HTTP tests (both mandatory when the corresponding test surface exists)
54
+ - stack-native local test harness for fast implementation-time iteration
55
+ - product repo root `./repo/run_tests.sh` that runs the full test suite through Docker (`docker compose --profile test`)
56
+ - `./repo/docker-compose.yml` with a profile-gated test service, wired honestly for later verification
47
57
  - database/bootstrap/seed path when the product will require seeded data or persistent storage
48
58
  - README baseline with project type near the top, stack, primary startup/access command, legacy `docker-compose up` compatibility string where applicable, verification method, auth/no-auth, seeded/empty-state note, mock/local/debug disclosures, known limitations, and repo layout
49
59
  - no committed secrets, `.env`, `.env.example`, hidden host setup, no-op tests, or fake-success integration paths
@@ -63,7 +73,9 @@ After the scaffold turn:
63
73
  - inspect changed files manually
64
74
  - verify files are in `./repo` and integrated
65
75
  - check the stack is coherent and discoverable
66
- - check local test harness and `run_tests.sh` are meaningful
76
+ - verify `./run_tests.sh` is wired to Docker (references docker compose) and `docker-compose.yml` has a profile-gated test service
77
+ - check the local test harness is separate from the Docker test path
78
+ - **start the application locally using the documented start command and confirm it reaches a usable state** — if the app does not start, reject the scaffold and send it back with the error
67
79
  - check README baseline matches actual files and scripts
68
80
  - run the narrow local scaffold check when practical
69
81
  - record artifacts, evidence, issues, and acceptance in metadata and Beads
@@ -1,3 +1,8 @@
1
+ ---
2
+ name: submission-packaging
3
+ description: Phase 7 submission packaging, package-boundary validation, and final closure.
4
+ ---
5
+
1
6
  # Submission Packaging
2
7
 
3
8
  Use this skill for Phase 7 submission packaging and final closure.
@@ -29,8 +34,8 @@ Packaging must reject or remove stale workflow notes and scratch execution artif
29
34
  - `repo/README.md` is the primary product documentation.
30
35
  - `repo/run_tests.sh` is the broad verification wrapper.
31
36
  - Runtime docs follow packaged platform rules.
32
- - Unit tests use `unit_tests/` where applicable.
33
- - API/integration HTTP tests use `API_tests/` where applicable.
37
+ - Unit tests must use `unit_tests/`.
38
+ - API/integration HTTP tests must use `API_tests/`.
34
39
  - No `.env`, `.env.example`, secret-bearing examples, local-only setup residue, or hidden host assumptions may remain.
35
40
  - `repo/docker-compose.yml` must exist for container-supported deliveries when the runtime contract requires Compose; `compose.yaml` or `docker-compose.yaml` may not be the only Compose file.
36
41
  - `repo/init_db.sh` must exist when database dependencies exist and must reflect final schema/bootstrap needs.
@@ -49,12 +54,14 @@ Packaging must reject or remove stale workflow notes and scratch execution artif
49
54
  ## Required `.tmp` Report Shape
50
55
 
51
56
  - `.tmp/` must contain the kept immutable evaluator reports and fix-check reports required by the evaluation phase.
52
- - Normal 2-audit-session path must end with `audit_report-1.md`, `audit_report-2.md`, `test_coverage_and_readme_audit_report.md`, and the corresponding `audit_report-<N>-fix_check.md` files for every kept Partial Pass report that required fix-check.
57
+ - Normal 2-audit-session path must end with `audit_report-1.md`, `audit_report-1-fix_check.md`, `audit_report-2.md`, `audit_report-2-fix_check.md`, and `test_coverage_and_readme_audit_report.md`.
58
+ - A cycle fix-check file is never omitted. Even a Pass report with zero scoped issues requires a fix-check report stating that there were no scoped issues to close.
53
59
  - Do not leave archived failed reports, stale report variants, numbered coverage variants, owner reconciliation notes, or superseded reports in final `.tmp/`; archived lineage belongs under `../.ai/archive/`.
54
60
  - Read the final coverage/README report holistically as an acceptance signal and reconcile any repo/README/docs mismatch before closing packaging.
55
61
 
56
62
  ## Final Runtime And Test Confirmation
57
63
 
64
+ - `./repo/run_tests.sh` must always run through Docker. The owner defers all Dockerized tests and Docker builds to this phase — they were never run during development or earlier phases.
58
65
  - Phase 7 owns the final Docker/runtime confirmation and dockerized broad `./repo/run_tests.sh` confirmation when those commands are part of the delivered contract or when late fixes/packaging changes could affect runtime/test behavior.
59
66
  - Phase 7 also owns final browser/API manual confirmation when late fixes, README edits, cleanup, package boundary changes, or seed/config changes could affect user-visible behavior or documented seeded values.
60
67
  - If `./repo/README.md` documents `docker compose up --build` or `./repo/run_tests.sh`, treat those as package contract commands, not aspirational notes.
@@ -91,6 +98,7 @@ Do not run broad Docker prune commands that can affect unrelated projects.
91
98
  ## Metadata And Naming
92
99
 
93
100
  - `./metadata.json` must truthfully describe the delivered project and contain only these seven project-fact keys: `prompt`, `project_type`, `frontend_language`, `backend_language`, `database`, `frontend_framework`, and `backend_framework`.
101
+ - `prompt` in `./metadata.json` must be the original product prompt captured during Phase 1, not a summary, clarified rewrite, workflow note, or prompt-plus-operator-context dump.
94
102
  - Normalize `project_type` to exactly one of the six accepted task classifications: `backend`, `fullstack`, `web`, `android`, `ios`, or `desktop`.
95
103
  - If a task/question id exists, use that exact id for final deliverable/archive naming without adding an extra `ID-` prefix.
96
104
  - Record the package-root manifest: workflow root, task root, package root, docs path, `.tmp` path, separate session handoff path if present, and any explicit validation exceptions.
@@ -234,13 +234,14 @@ Do not include any preface, explanation, summary, commentary, or planning notes
234
234
  ```md
235
235
  # Questions
236
236
 
237
- ## Clarification Entries
237
+ This file records only genuine original-prompt ambiguities that needed interpretation because they were unclear, incomplete, contradictory, or materially ambiguous.
238
238
 
239
- ### Q-001: [short question title]
240
- - **Ambiguity:** [what is unclear, missing, or open to interpretation]
241
- - **Prompt Basis:** [exact quote or reference from the original prompt]
242
- - **Impact:** [what would go wrong in later design or implementation if unresolved]
243
- - **Solution:** [decisive resolution with prompt-faithful default]
239
+ Use this structure for each real clarification item:
240
+
241
+ ### 1. <short clarification title>
242
+ - Question: <the exact ambiguity or missing detail that needed to be locked>
243
+ - My Understanding: <how the prompt was interpreted, why this was ambiguous, and why it matters>
244
+ - Solution: <the chosen prompt-faithful resolution or safe default, written decisively>
244
245
  ```
245
246
 
246
247
  Do not include requirement IDs, traceability fields, priority fields, or evaluator-risk metadata in `questions.md`. Those belong in `../.ai/requirements-breakdown.md` only.
@@ -206,6 +206,8 @@ Expected result:
206
206
 
207
207
  ## 9. Testing
208
208
 
209
+ All testing is Docker-contained. Do not document local test commands (`npm test`, `pytest`, `go test`, etc.) in the README — those are for implementation-time development only and must not appear in the reviewer-facing project documentation.
210
+
209
211
  ### Standard broad test command
210
212
 
211
213
  ```bash
@@ -214,17 +216,12 @@ Expected result:
214
216
 
215
217
  If `init_db.sh` is part of the standard test bootstrap, document that relationship clearly.
216
218
 
217
- ### Local verification harness
218
- - Document the separate local verification command(s) used for ordinary development and readiness checks only if they do not become required reviewer setup.
219
- - Make clear that these local verification commands are distinct from the dockerized `./repo/run_tests.sh` broad test path.
220
- - Use the real stack-native local suite for the chosen language/framework where applicable, for example Vitest, Jest, PHPUnit, pytest, go test, cargo test, or another framework-native equivalent.
221
- - Do not require reviewers to run manual installs or machine-level setup for the standard packaged verification path.
222
-
223
219
  ### Test entry points
224
- - Unit tests: `[command/path]`
225
- - Component/state tests: `[command/path]`
226
- - API/integration tests: `[command/path]`
227
- - E2E/platform tests: `[command/path]`
220
+ These are the Docker-contained test entry points invoked by `./run_tests.sh`:
221
+ - Unit tests: `[command/path inside Docker]`
222
+ - Component/state tests: `[command/path inside Docker]`
223
+ - API/integration tests: `[command/path inside Docker]`
224
+ - E2E/platform tests: `[command/path inside Docker]`
228
225
 
229
226
  ### What the test suite covers
230
227
  - [backend unit coverage summary]
@@ -233,8 +230,7 @@ If `init_db.sh` is part of the standard test bootstrap, document that relationsh
233
230
  - [E2E/platform coverage summary]
234
231
 
235
232
  ### Test notes
236
- - `./repo/run_tests.sh` is the broad test path for containerized verification when applicable.
237
- - Local verification commands are used for ordinary development iteration and readiness checks.
233
+ - `./repo/run_tests.sh` is the broad test path for containerized verification.
238
234
  - [Docker-contained notes if applicable]
239
235
  - [seed/fixture notes if applicable]
240
236
  - [known test constraints if any]
@@ -26,7 +26,7 @@ Reject only for material defects that would mislead development, evaluation, or
26
26
 
27
27
  - [ ] Platform runtime expectations follow the packaged runtime rules.
28
28
  - [ ] `./repo/run_tests.sh` is the broad test wrapper and is honest.
29
- - [ ] `unit_tests/` and `API_tests/` are used where applicable.
29
+ - [ ] `unit_tests/` and `API_tests/` are used.
30
30
  - [ ] Docker requirements are not incorrectly forced onto Android/iOS native runtime delivery.
31
31
  - [ ] Docker/runtime docs and files are repo-controlled, non-interactive, free of hidden `.env`/manual export dependencies, and statically credible for final confirmation.
32
32
  - [ ] A separate stack-native local harness exists for development/Phase 4, or the missing harness is explicitly user risk-accepted.
@@ -1,13 +1,6 @@
1
- # Design Prompt
2
-
3
1
  You are helping create the product/system design for a software project.
4
2
 
5
- You will receive:
6
- - the original product prompt
7
- - stack/context information
8
- - accepted clarifications
9
- - accepted requirements
10
- - the design template
3
+ The original prompt, stack and context, accepted clarifications, and accepted requirements have already been provided in the previous steps. The design template is already seeded at ./docs/design.md.
11
4
 
12
5
  Your task is to write `./docs/design.md`.
13
6
 
@@ -23,7 +16,9 @@ The design must:
23
16
  - preserve the original business goal and required user outcomes
24
17
  - incorporate accepted clarifications and requirements without narrowing them
25
18
  - identify the project type, stack, actors, roles, main flows, modules, data, UI/API surfaces, security boundaries, assumptions, and verification strategy
26
- - define the testing contract as part of the visible design: every API/interface endpoint must have positive and negative true HTTP/API tests where a runtime endpoint exists, unit coverage must target 90%+ for meaningful business logic, frontend unit tests must be identifiable and must import/render real frontend components where a frontend exists, fullstack/web apps must prove frontend-to-backend behavior, and user-facing applications must include full E2E/platform coverage for the main user journeys unless a surface is genuinely not applicable
19
+ - require all unit tests under `unit_tests/` and all API/integration HTTP tests under `API_tests/` (both directories mandatory when the corresponding test surface exists)
20
+ - define the testing contract as part of the visible design: every API/interface endpoint must have positive and negative true HTTP/API tests where a runtime endpoint exists, unit coverage must target 90%+ for meaningful business logic, frontend unit tests must be identifiable and must import/render real frontend components where a frontend exists, fullstack/web apps must prove frontend-to-backend behavior
21
+ - require E2E/platform test coverage for every prompt requirement: not just main user journeys but every requirement, actor path, business rule, authorization rule, error state, and task-closure condition must have an identifiable E2E test or an explicit accepted not-applicable reason. E2E tests must exercise real application behavior and verify business outcomes, not just confirm pages render. An E2E test that only checks a page loads without asserting state changes, data persistence, or backend integration is decorative and must be rejected.
27
22
  - define README/runtime obligations that satisfy strict review: project type near the top, `docker compose up --build` as the primary startup command for container-supported deliveries, the legacy compatibility string `docker-compose up` without making it primary, access URL/port or platform launch method, verification method, auth/demo credentials for every role or the exact statement `No authentication required`, seeded data or empty-state statement, no manual runtime installs, no hidden `.env` dependency, mock/local/debug disclosures, and known limitations
28
23
  - make meaningful assumptions explicit
29
24
  - mark unresolved items only when a real decision is still needed
@@ -44,6 +39,6 @@ The design should define what the system must be and how its major parts fit tog
44
39
 
45
40
  ## Output
46
41
 
47
- Write the completed design to `./docs/design.md` using the provided template. If a section is not applicable, keep it brief and explain why.
42
+ Write the completed design to `./docs/design.md` using the template already seeded at that path. If a section is not applicable, keep it brief and explain why.
48
43
 
49
44
  If the project has meaningful APIs or interface contracts, say that `./docs/api-spec.md` should be completed next and summarize the API families that need specification.
@@ -48,9 +48,11 @@ List only items that do not shrink the original prompt.
48
48
 
49
49
  ## 5. Module Design
50
50
 
51
- | Module | Purpose | Owned behavior | UI surfaces | API/service/job surfaces | Data owned | Key failure/security cases |
52
- |---|---|---|---|---|---|---|
53
- | | | | | | | |
51
+ | Module | Purpose | Owned behavior | UI surfaces | API/service/job surfaces | Data owned | Key failure/security cases | Required test surface |
52
+ |---|---|---|---|---|---|---|---|
53
+ | | | | | | | | |
54
+
55
+ Each module's test surface must list the specific API endpoints to test, unit test targets, and E2E flows required. Example: `GET /invoices — positive + 403 unauthorized; unit: InvoiceService.calculateTax; E2E: user creates invoice → sees it in list → deletes it`.
54
56
 
55
57
  ## 6. Data Design
56
58
 
@@ -98,23 +100,25 @@ Cover where relevant: authentication, route authorization, object authorization,
98
100
  This is a design-level strategy, not an execution checklist.
99
101
 
100
102
  Required testing contract:
101
- - All API/interface endpoints must have true HTTP/API test coverage for successful behavior and important negative/error cases where a runtime endpoint exists. If a non-HTTP interface or accepted exception requires another proof layer, state the exception and replacement proof. If there is no API/interface surface, state `Not Applicable` with the reason.
103
+ - All tests must live under their prescribed directories: unit tests under `unit_tests/`, API/integration HTTP tests under `API_tests/`. Both directories are mandatory when the corresponding test surface exists.
104
+ - All API/interface endpoints must have true HTTP/API test coverage for successful behavior and important negative/error cases where a runtime endpoint exists. Test assertions must verify exact expected state transitions, status codes, and response bodies — not permissive "accept any valid outcome" checks. If a non-HTTP interface or accepted exception requires another proof layer, state the exception and replacement proof. If there is no API/interface surface, state `Not Applicable` with the reason.
102
105
  - Meaningful business logic must target 90%+ unit coverage. If a component cannot be unit-tested meaningfully, state the exception and the replacement proof layer.
103
106
  - Frontend unit tests must be identifiable by file pattern/framework evidence and must import or render real frontend components/modules when a frontend exists.
104
107
  - Fullstack or backend-backed frontend work must include proof that real frontend actions reach the intended backend/service behavior.
105
- - User-facing applications must have full E2E/platform coverage for the main user journeys, including success, validation/failure, and recovery states. If E2E/platform testing is not applicable, state why and what proof replaces it.
108
+ - E2E/platform tests must cover every requirement from the original prompt. Not just the main user journeys — every requirement, actor path, business rule, authorization rule, error state, and task-closure condition must have an identifiable E2E test or an explicit accepted not-applicable reason. E2E tests must exercise real application behavior end to end and verify that the system actually works — they must not be decorative page-load-only tests that skip actual interaction, state verification, data persistence checks, or backend integration. An E2E test that only confirms a page renders without asserting business outcomes is insufficient and must be flagged as a gap.
109
+ - E2E tests must run and pass before any manual verification. If an E2E test cannot verify a particular surface (e.g., requires external service, real email delivery), document the exact boundary and the manual verification that replaces it.
106
110
 
107
111
  | Surface / risk | Expected proof layer | Notes |
108
112
  |---|---|---|
109
- | core happy paths | | |
110
- | key failure paths | | |
111
- | security boundaries | | |
113
+ | core happy paths | full E2E coverage | |
114
+ | key failure paths | full E2E coverage + unit/API assertions | |
115
+ | security boundaries | E2E + unit/API for negative cases | |
112
116
  | API/interface behavior | endpoint tests for every endpoint, including positive and negative cases | |
113
- | UI states / interactions | | |
114
- | frontend-to-backend integration paths | | |
117
+ | UI states / interactions | E2E coverage for every state | |
118
+ | frontend-to-backend integration paths | E2E proves real FE ↔ BE behavior | |
115
119
  | unit coverage | 90%+ meaningful business-logic coverage | |
116
120
  | frontend unit/component tests | identifiable tests importing/rendering real components/modules | |
117
- | E2E/platform journeys | full main-journey coverage for user-facing apps | |
121
+ | E2E/platform journeys | covers every prompt requirement with real assertions | |
118
122
 
119
123
  ## 11.1 README / Runtime Gate Strategy
120
124
 
@@ -14,7 +14,8 @@ You will receive:
14
14
  - original prompt
15
15
  - stack/context information
16
16
  - accepted questions/clarifications
17
- - requirements breakdown
17
+ - full requirements breakdown
18
+ - minified requirements list (core requirements from the breakdown)
18
19
  - accepted `./docs/design.md`
19
20
  - accepted `./docs/api-spec.md` or not-applicable rationale
20
21
  - the plan template
@@ -31,6 +32,7 @@ Create a practical implementation plan that can be translated into concise imple
31
32
  - Start with the scaffold/baseline work needed before product modules.
32
33
  - Plan by product modules and vertical flows, not by broad file trees.
33
34
  - Tests travel with implementation.
35
+ - All testing artifacts must live under the prescribed test directory structure: unit tests under `unit_tests/`, API/integration HTTP tests under `API_tests/`. Both directories are mandatory when the corresponding test surface exists.
34
36
  - Every meaningful requirement, clarification, module, API/interface, data object, actor path, security boundary, and user-visible flow needs a responsible module/work package and proof path.
35
37
  - Fullstack/backend-backed frontend work needs explicit frontend-to-backend proof.
36
38
  - API work needs endpoint/interface proof.
@@ -41,7 +43,8 @@ Create a practical implementation plan that can be translated into concise imple
41
43
  - Include a FE-BE integration matrix for fullstack/backend-backed frontend work: frontend action, backend endpoint/service/job, payload/state input, response/side effect, UI states, and proof path.
42
44
  - Include a backend-to-frontend exposure check when backend capabilities exist: every prompt-relevant backend capability must have visible exposure or a specific accepted internal/API-only reason.
43
45
  - Include README/runtime gates that match the strict README audit: project type near the top, primary `docker compose up --build` for container-supported deliveries, legacy compatibility string `docker-compose up` without making it primary, startup/access, verification, auth/no-auth, all demo credentials/roles when auth exists, seeded values or empty-state statement, configuration/no-secret handling, no manual runtime installs or manual DB setup, test commands, known limitations, and mock/local-data/debug disclosures.
44
- - Include coverage rigor: 90%+ unit target for meaningful business logic, exact true no-mock HTTP/API endpoint tests, identifiable frontend unit tests that import/render real components/modules, fullstack FE-BE proof, E2E/platform proof, security/negative cases, and final local verification expectations.
46
+ - Include coverage rigor: 90%+ unit target for meaningful business logic, exact true no-mock HTTP/API endpoint tests, identifiable frontend unit tests that import/render real components/modules, fullstack FE-BE proof, security/negative cases, and final local verification expectations.
47
+ - Include an E2E coverage map: every prompt requirement, actor path, business rule, authorization rule, error state, and task-closure condition must have an identifiable E2E test or an explicit accepted not-applicable reason. E2E tests must exercise real behavior and verify business outcomes, not just confirm pages render. Decorative-load-only E2E tests are insufficient and must be called out as gaps.
45
48
  - Include module acceptance checks that prevent shell/demo completion: observable behavior, persisted state/artifact or UI/API outcome, relevant negative paths, tests, README impact, and integration evidence.
46
49
 
47
50
  ## Output Requirements