npm - theslopmachine - Versions diffs - 1.0.13 → 1.0.22 - Mend

theslopmachine 1.0.13 → 1.0.22

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (39) hide show

package/assets/agents/developer.md +6 -7
package/assets/agents/slopmachine-claude.md +66 -9
package/assets/agents/slopmachine.md +68 -9
package/assets/claude/agents/developer.md +5 -1
package/assets/skills/clarification-gate/SKILL.md +56 -20
package/assets/skills/claude-worker-management/SKILL.md +14 -4
package/assets/skills/deep-retrospective/SKILL.md +179 -0
package/assets/skills/deep-retrospective/run.py +446 -0
package/assets/skills/deep-retrospective/workflow-reference.md +240 -0
package/assets/skills/developer-session-lifecycle/SKILL.md +18 -4
package/assets/skills/development-guidance/SKILL.md +52 -31
package/assets/skills/evaluation-triage/SKILL.md +21 -7
package/assets/skills/final-evaluation-orchestration/SKILL.md +92 -28
package/assets/skills/integrated-verification/SKILL.md +38 -42
package/assets/skills/p8-readiness-reconciliation/SKILL.md +31 -10
package/assets/skills/planning-gate/SKILL.md +10 -7
package/assets/skills/planning-guidance/SKILL.md +60 -52
package/assets/skills/retrospective-analysis/SKILL.md +172 -58
package/assets/skills/scaffold-guidance/SKILL.md +18 -6
package/assets/skills/submission-packaging/SKILL.md +11 -3
package/assets/slopmachine/clarifier-agent-prompt.md +7 -6
package/assets/slopmachine/exact-readme-template.md +8 -12
package/assets/slopmachine/owner-verification-checklist.md +1 -1
package/assets/slopmachine/phase-1-design-prompt.md +5 -10
package/assets/slopmachine/phase-1-design-template.md +15 -11
package/assets/slopmachine/phase-2-execution-planning-prompt.md +5 -2
package/assets/slopmachine/phase-2-plan-template.md +14 -4
package/assets/slopmachine/scaffold-playbooks/shared-contract.md +2 -1
package/assets/slopmachine/templates/AGENTS.md +3 -1
package/assets/slopmachine/templates/CLAUDE.md +3 -1
package/assets/slopmachine/test-coverage-prompt.md +8 -1
package/assets/slopmachine/utils/README.md +1 -5
package/assets/slopmachine/utils/claude_live_common.mjs +2 -5
package/assets/slopmachine/utils/prepare_evaluation_send_packet.mjs +3 -3
package/package.json +1 -1
package/src/constants.js +0 -9
package/src/init.js +17 -24
package/src/install.js +30 -28
package/assets/slopmachine/utils/prepare_evaluation_prompt.mjs +0 -81

package/assets/skills/retrospective-analysis/SKILL.md CHANGED Viewed

@@ -9,10 +9,11 @@ Use this skill only during Phase 8 Retrospective, after Phase 7 Submission Packa
 ## Purpose
-- inspect what happened across the whole workflow run
+- inspect what happened across the whole workflow run with deep evidence review
 - identify what caused churn, waste, late defects, or preventable corrections
+- assess the quality of the work done at every phase
 - capture lessons that should improve future runs
-- write package-specific retrospective files under `/Users/yohannesakd/slopmachine/retrospectives/`
+- write package-specific retrospective files under the installed SlopMachine assets directory
 ## Phase role
@@ -22,62 +23,166 @@ Use this skill only during Phase 8 Retrospective, after Phase 7 Submission Packa
 - it does not rerun broad verification by default
 - it should not reopen development unless it finds a real defect in the already-packaged result
-## Output location
+## Mandatory Evidence Review
-Write run-scoped retrospective files under:
+Before writing the retrospective, the owner must read all of the following evidence sources. Skim nothing — every artifact must be inspected.
-- `/Users/yohannesakd/slopmachine/retrospectives/`
+### 1. Metadata and State Transitions
-Preferred filenames:
+Read `../.ai/metadata.json` in full. Track:
+- every phase transition and its timestamp
+- every session created (develop, bugfix, test-coverage, evaluator)
+- every session handoff and closure event
+- every Beads entry ID recorded
+- the run_id, current_phase history, and awaiting_human flags
-- `retrospective-<run_id>.md`
-- `improvement-actions-<run_id>.md`
+Note the elapsed time between phase transitions. A long gap between phases with no state changes indicates owner delay or blocking.
+### 2. Beads Comments
-If only one file is needed, the retrospective file is sufficient.
+Read all Beads entries for the run. Extract every:
+- `ARTIFACT:` comment — what was produced and when
+- `ISSUE:` comment — what was found and where
+- `SESSION:` comment — lane creation, handoff, closure
+- `DECISION:` comment — owner acceptance, rejection, risk acceptance
+- `VERIFY:` comment — verification evidence recorded
-The `run_id` must come from the current project's `../.ai/metadata.json` so the retrospective can be matched back to one exact workflow run.
+Count the total number of each. High `ISSUE:` counts in late phases (4-7) indicate weak early prevention. Low `VERIFY:` counts may indicate insufficient owner-side checking.
-## Evidence sources
+### 3. Clarification and Requirements Artifacts
-Prefer existing workflow artifacts first:
+Read `./docs/questions.md` and `../.ai/requirements-breakdown.md`. Check:
+- how many clarifying questions were asked and resolved
+- whether the requirements breakdown was deep enough
+- whether the faithfulness review surfaced material drift
-- root metadata
-- questions/clarification record
-- clarification prompt
-- planning artifacts
-- Beads comments and transitions
-- developer-session handoffs
-- review and rejection history
-- verification gate notes
-- `./.tmp/` audit and fix-check reports
-- packaging checks
+### 4. Design and Planning Artifacts
-Do not reread the entire codebase unless a real inconsistency requires it.
-Do not rerun broad Docker or full-suite verification just for retrospective analysis.
+Read `./docs/design.md`, `./docs/api-spec.md`, and `../.ai/plan.md`. Check:
+- whether the design mapped every requirement to a surface (section 2.1)
+- whether the API spec inventoried every endpoint
+- whether the no-orphan ledger in the plan was complete
+- whether the plan had a development prompt queue that was actually followed
-## Required retrospective sections
+### 5. Development Session Records
-1. outcome summary
-2. what worked well
-3. what caused waste or looping
-4. what was caught too late
-5. findings by phase
-6. findings by instruction plane:
-   - owner shell
-   - developer prompt
-   - skills
-   - task-root rulebook file such as `./AGENTS.md` or `./CLAUDE.md`
-7. late-finding origin table
-8. actionable improvements
+From metadata, identify every develop session. Review session transcripts when available. Track:
+- how many scaffold iterations were needed
+- how many module prompts were sent
+- how many fix rounds were required per module
+- whether the final self-check found gaps (and if so, how many)
+- whether the owner ran the requirements integrity sweep
-## Late-finding origin table
+### 6. Integrated Verification (Phase 4) Evidence
+Read `../.ai/consolidated-internal-issues.md`. Count:
+- total issues from the owner plan-based review
+- total issues from each of the 5 evaluator passes
+- issue severity distribution (blocker, high, medium, low)
+- which modules had the most issues
+- whether any issues repeated across passes (same root cause discovered multiple times)
+Read every report under `../.ai/internal-verification/`. For each:
+- note the verdict
+- count the issues per severity
+- compare against the consolidated file for completeness
+- check whether any archived report's issues were never extracted
+### 7. Final Evaluation (Phase 5) Evidence
+Read every kept report under `./.tmp/`:
+- `audit_report-1.md` and `audit_report-1-fix_check.md`
+- `audit_report-2.md` and `audit_report-2-fix_check.md`
+- `test_coverage_and_readme_audit_report.md`
+For each audit report, record:
+- the verdict (Pass / Partial Pass / Fail)
+- how many Blocker and High issues were found
+- whether the regenerated report was accepted or rejected
+- how many fix-check rounds were needed to close all issues
+- the final test coverage score
+Read every archived report under `../.ai/archive/`. Count:
+- how many reports were archived (failed reports, superseded reports, invalid-cycle reports)
+- which cycles were restarted and why
+- whether any restart was caused by a prompt-paste violation
+### 8. Readiness and Packaging (Phase 6-7) Evidence
+Read `../.ai/metadata.json` for Phase 6-7 state. Check:
+- whether Docker/runtime checks passed on first attempt or required fixes
+- whether `agent-browser` checks passed
+- whether any D1-D9 dimensions were risk-accepted
+- whether packaging found and removed stale artifacts
+### 9. Developer Rulebook and Templates
+Read `./AGENTS.md` or `./CLAUDE.md` (whichever was used). Check:
+- whether the rulebook was adequate for the project type
+- whether the developer followed the rulebook's implementation discipline
+- whether any rulebook gaps contributed to issues
+## Required Retrospective Sections
+The retrospective file must contain all of the following sections:
+1. **outcome summary** — what was delivered, was the prompt satisfied, overall verdict
+2. **run statistics** — phase durations, session counts, total issue counts by phase, fix round counts
+3. **what worked well** — which phases ran cleanly, which gates passed first try
+4. **what caused waste or looping** — which issues went through multiple fix cycles, why
+5. **what was caught too late** — issues first found in Phase 5-7 that should have been caught earlier
+6. **quality assessment by phase** — for each phase, rate the quality of the output (1-5) with evidence
+7. **findings by phase** — material issues discovered, how they were resolved
+8. **findings by instruction plane**: owner shell, developer prompts, skills, rulebook
+9. **late-finding origin table** — every late issue classified by where it should have been prevented
+10. **issue discovery timeline** — when each major issue was first found and by which mechanism
+11. **churn analysis** — which modules had the most fix rounds, which gates required the most iteration
+12. **token and time efficiency** — whether prompts were efficient, whether work was redundant
+13. **actionable improvements** — concrete changes to make before the next run
+## Reusable Scaffold Configuration
+After a project successfully passes all Docker/runtime checks and `run_tests.sh` completes, the retrospective must capture the working scaffold configuration so it can be reused for future similar projects.
+Extract and save the following under `<asset-root>/retrospectives/scaffold-config-<project_type>-<stack>.md`:
+- the project type and stack used
+- the working Dockerfile pattern (or reference to the playbook that produced it)
+- the working docker-compose.yml structure (services, ports, volumes, healthchecks, profiles)
+- the working run_tests.sh structure (build, start, test, cleanup stages)
+- the working local test harness commands
+- any deviations from the standard playbook that were necessary to make things work
+- the verified ports, healthcheck endpoints, and init scripts
+If the project used an existing playbook and the configuration worked without modification, note that the playbook was sufficient and register the confirmation. If modifications were needed, document exactly what was changed and why so the playbook can be improved for future runs.
+This section ensures that every successful project's runtime configuration becomes institutional knowledge. A project that runs and tests cleanly should never require rediscovering its Docker setup from scratch on the next similar task.
+## Quality Assessment Per Phase
+Rate each phase 1-5 on:
+| Phase | Requirements clarity | Design fidelity | Plan rigor | Implementation quality | Owner review thoroughness |
+|---|---|---|---|---|---|
+| 1: Clarification | Was the requirements breakdown deep and faithful? | — | — | — | Was the faithfulness review honest? |
+| 2: Planning | Did the design cover every requirement? | Was the API spec complete? | Was the no-orphan ledger exhaustive? | — | Was the REQ-### cross-reference thorough? |
+| 3: Development | — | — | — | Was every module fully implemented? | Was the integrity sweep thorough? |
+| 4: Verification | — | — | — | — | Were all evaluator passes reviewed? |
+| 5: Evaluation | — | — | — | — | Were audit reports properly acted on? |
+| 6: Readiness | — | — | — | — | Were Docker/runtime checks genuine? |
+| 7: Packaging | — | — | — | — | Was the package boundary clean? |
+Each rating must be backed by specific evidence from the artifact review above. A rating of 3 or below in any cell requires an explanation of what went wrong and a concrete improvement.
+## Late-Finding Origin Table
 For every material issue first surfaced in Phase 4, Phase 5, Phase 6, or Phase 7, classify where it should have been prevented.
 Use this table shape:
 | Finding | First surfaced in | Prompt required? | Accepted plan/design covered? | Origin classification | Fix belongs in |
-|---|---|---:|---:|---|---|
+|---|---|---|---|---|---|
+|  |  | yes/no | yes/no |  |  |
 Origin classifications:
@@ -88,32 +193,41 @@ Origin classifications:
 - `owner review miss`: implementation claimed closure but review accepted weak or unsupported evidence
 - `evaluation-only strictness`: the repo was broadly coherent, but the evaluator imposed a stricter interpretation than the prior workflow had required
-Do not classify a late finding by where it appeared.
-Classify it by the earliest artifact that should have made the correct behavior unavoidable.
+Count how many findings fall into each origin classification. A cluster of `owner review miss` findings indicates the owner review checklist is insufficient. A cluster of `planning miss` findings indicates the planning templates need strengthening. A cluster of `development execution miss` findings indicates the developer rulebook or prompting style needs improvement.
+Do not classify a late finding by where it appeared. Classify it by the earliest artifact that should have made the correct behavior unavoidable.
+## Issue Discovery Timeline
+For every Blocker or High issue across all phases, record:
-Separate immediate repo remediation from workflow learning:
+| Issue | Phase found | Discovery mechanism | Surface/module | Root cause | Fixed in |
+|---|---|---|---|---|---|
+|  |  | owner review / evaluator pass 2 / audit cycle 1 / etc. |  |  | develop / bugfix / test-coverage |
-- repo remediation answers what code/docs/tests must change now
-- workflow learning answers whether clarification, planning, development guidance, owner review, or evaluation routing should change for future runs
+This timeline reveals whether most issues were found by owner review (good — early detection), internal evaluator loop (acceptable — mid-phase), or final evaluation (bad — too late).
-## Audit buckets
+## Churn Analysis
-Evaluate at least these buckets in hindsight:
+Count for each module or surface area:
+- how many distinct issue batches were sent
+- how many fix rounds were needed
+- whether the same issue was reported multiple times
+- whether fixes introduced new issues
-1. prompt-fit
-2. security-critical flaws
-3. test sufficiency
-4. major engineering quality
-5. token/time waste
+High churn on a single module indicates either the developer implementation was weak, the design was unclear, or the owner review was insufficient.
-For each meaningful finding, prefer:
+## Token and Time Efficiency
-- what happened
-- why it happened
-- where the fix belongs
-- how it should change future runs
+From session transcripts and metadata:
+- estimate whether prompts were concise or bloated
+- count how many times the same information was repeated across turns
+- note whether the developer was given too much or too little context
+- identify any token-wasting patterns (restating rules, re-explaining context, verbose markdown)
-## Rule for reopening work
+## Rule for Reopening Work
 - if retrospective finds a real packaging or delivery defect, reopen Phase 7 and fix it
-- if it finds only improvements, document them and close the retrospective phase
+- if it finds systemic workflow issues (e.g., the owner review checklist consistently missed a class of problem), record them as improvement actions for the next run
+- if it finds only non-blocking improvements, document them and close the retrospective phase
+- never silently skip a finding just because it is inconvenient

package/assets/skills/scaffold-guidance/SKILL.md CHANGED Viewed

@@ -11,6 +11,10 @@ Use this skill for the first development slice: the framework/runtime/test/READM
 Scaffold creates an honest base for later module work. It should bootstrap the selected framework and project structure without implementing project-specific business logic beyond a minimal proof surface.
+**Two test paths are created at scaffold time and follow this lifecycle:**
+1. **Local test harness** (stack-native) — used during all development, verification, and remediation work (Phases 3-5). Fast, non-Docker, for iteration.
+2. **Dockerized `run_tests.sh`** — deferred to Phase 6/7. Not run during development. After all development is complete and verified locally, Docker and `run_tests.sh` are run for the first time and any issues fixed.
 The owner may use scaffold playbooks privately, but developer-facing prompts should reference only the docs and normal engineering instructions.
 ## Private Inputs
@@ -22,6 +26,10 @@ Use these owner-side inputs to shape the scaffold prompt:
 - scaffold playbooks under `~/slopmachine/scaffold-playbooks/`
 - current framework/library docs when scaffold commands or config depend on current behavior
+The scaffold playbooks are the primary source for Docker and `run_tests.sh` setup. Before writing the scaffold prompt, the owner must and read the relevant playbooks for the project's type, technology, and stack. Start with `shared-contract.md` for the universal scaffold rules, then read the type-specific playbook (e.g. `type-web-spa.md`), the tech-specific playbooks (e.g. `tech-frontend-react.md`, `tech-backend-koa.md`), and the stack-specific playbook if one matches (e.g. `stack-react-go-postgres.md`). Each playbook provides the exact Dockerfile, docker-compose.yml, run_tests.sh, and local test harness structure for that project class.
+The owner extracts the Docker and runtime configuration details from these playbooks and translates them into the developer prompt in plain language. Do not tell the developer to read the playbooks or reference playbook names — just communicate the technical requirements.
 Do not tell the developer/Claude lane to read `../.ai/plan.md` or scaffold playbooks directly, and do not mention that the internal plan exists.
 ## Scaffold Prompt Shape
@@ -29,9 +37,9 @@ Do not tell the developer/Claude lane to read `../.ai/plan.md` or scaffold playb
 Use casual, direct language. Example:
 ```text
-Let's start with the scaffold. Set up the base project in ./repo for the stack described in docs/design.md. Keep this to the framework/runtime/test/README foundation and a minimal proof page or endpoint; don't add product-specific business logic yet.
+Let's start with the scaffold. Set up the base project in ./repo for the stack described in docs/design.md. Keep this to the framework, runtime, tests, and README foundation with a minimal proof page or endpoint. Do not add product-specific business logic yet.
-Please include the local test harness, runtime files, run_tests.sh, and the README baseline so the next modules have a clean foundation.
+The scaffold needs a working docker-compose.yml with a profile-gated test service, a run_tests.sh that runs the full test suite through Docker, a separate stack-native local test harness for fast implementation-time checks, a unit_tests directory for unit tests, an API_tests directory for API and integration HTTP tests, and a README baseline with startup, access, verification, and test commands. Make sure the local harness is separate from the Docker test path. The local harness is for fast iteration. run_tests.sh is the broad dockerized verification wrapper for later.
 ```
 Adjust the exact wording to the project. Do not over-format the message.
@@ -41,9 +49,11 @@ Adjust the exact wording to the project. Do not over-format the message.
 - coherent project structure for the selected stack
 - framework bootstrap wired enough for later module work
 - minimal proof surface such as a basic page, route, endpoint, or app shell
-- stack-native local test harness
-- product repo root `./repo/run_tests.sh` when required by the project contract
-- runtime/Docker files when relevant, wired honestly for later verification
+- **documented local start command** (e.g., `npm run dev`, `go run ./cmd/server`, `docker compose up`) that starts the application for local development and verification. The developer must have started it and confirmed it works before reporting scaffold completion
+- `unit_tests/` directory for unit tests and `API_tests/` directory for API/integration HTTP tests (both mandatory when the corresponding test surface exists)
+- stack-native local test harness for fast implementation-time iteration
+- product repo root `./repo/run_tests.sh` that runs the full test suite through Docker (`docker compose --profile test`)
+- `./repo/docker-compose.yml` with a profile-gated test service, wired honestly for later verification
 - database/bootstrap/seed path when the product will require seeded data or persistent storage
 - README baseline with project type near the top, stack, primary startup/access command, legacy `docker-compose up` compatibility string where applicable, verification method, auth/no-auth, seeded/empty-state note, mock/local/debug disclosures, known limitations, and repo layout
 - no committed secrets, `.env`, `.env.example`, hidden host setup, no-op tests, or fake-success integration paths
@@ -63,7 +73,9 @@ After the scaffold turn:
 - inspect changed files manually
 - verify files are in `./repo` and integrated
 - check the stack is coherent and discoverable
-- check local test harness and `run_tests.sh` are meaningful
+- verify `./run_tests.sh` is wired to Docker (references docker compose) and `docker-compose.yml` has a profile-gated test service
+- check the local test harness is separate from the Docker test path
+- **start the application locally using the documented start command and confirm it reaches a usable state** — if the app does not start, reject the scaffold and send it back with the error
 - check README baseline matches actual files and scripts
 - run the narrow local scaffold check when practical
 - record artifacts, evidence, issues, and acceptance in metadata and Beads

package/assets/skills/submission-packaging/SKILL.md CHANGED Viewed

@@ -1,3 +1,8 @@
+---
+name: submission-packaging
+description: Phase 7 submission packaging, package-boundary validation, and final closure.
+---
 # Submission Packaging
 Use this skill for Phase 7 submission packaging and final closure.
@@ -29,8 +34,8 @@ Packaging must reject or remove stale workflow notes and scratch execution artif
 - `repo/README.md` is the primary product documentation.
 - `repo/run_tests.sh` is the broad verification wrapper.
 - Runtime docs follow packaged platform rules.
-- Unit tests use `unit_tests/` where applicable.
-- API/integration HTTP tests use `API_tests/` where applicable.
+- Unit tests must use `unit_tests/`.
+- API/integration HTTP tests must use `API_tests/`.
 - No `.env`, `.env.example`, secret-bearing examples, local-only setup residue, or hidden host assumptions may remain.
 - `repo/docker-compose.yml` must exist for container-supported deliveries when the runtime contract requires Compose; `compose.yaml` or `docker-compose.yaml` may not be the only Compose file.
 - `repo/init_db.sh` must exist when database dependencies exist and must reflect final schema/bootstrap needs.
@@ -49,12 +54,14 @@ Packaging must reject or remove stale workflow notes and scratch execution artif
 ## Required `.tmp` Report Shape
 - `.tmp/` must contain the kept immutable evaluator reports and fix-check reports required by the evaluation phase.
-- Normal 2-audit-session path must end with `audit_report-1.md`, `audit_report-2.md`, `test_coverage_and_readme_audit_report.md`, and the corresponding `audit_report-<N>-fix_check.md` files for every kept Partial Pass report that required fix-check.
+- Normal 2-audit-session path must end with `audit_report-1.md`, `audit_report-1-fix_check.md`, `audit_report-2.md`, `audit_report-2-fix_check.md`, and `test_coverage_and_readme_audit_report.md`.
+- A cycle fix-check file is never omitted. Even a Pass report with zero scoped issues requires a fix-check report stating that there were no scoped issues to close.
 - Do not leave archived failed reports, stale report variants, numbered coverage variants, owner reconciliation notes, or superseded reports in final `.tmp/`; archived lineage belongs under `../.ai/archive/`.
 - Read the final coverage/README report holistically as an acceptance signal and reconcile any repo/README/docs mismatch before closing packaging.
 ## Final Runtime And Test Confirmation
+- `./repo/run_tests.sh` must always run through Docker. The owner defers all Dockerized tests and Docker builds to this phase — they were never run during development or earlier phases.
 - Phase 7 owns the final Docker/runtime confirmation and dockerized broad `./repo/run_tests.sh` confirmation when those commands are part of the delivered contract or when late fixes/packaging changes could affect runtime/test behavior.
 - Phase 7 also owns final browser/API manual confirmation when late fixes, README edits, cleanup, package boundary changes, or seed/config changes could affect user-visible behavior or documented seeded values.
 - If `./repo/README.md` documents `docker compose up --build` or `./repo/run_tests.sh`, treat those as package contract commands, not aspirational notes.
@@ -91,6 +98,7 @@ Do not run broad Docker prune commands that can affect unrelated projects.
 ## Metadata And Naming
 - `./metadata.json` must truthfully describe the delivered project and contain only these seven project-fact keys: `prompt`, `project_type`, `frontend_language`, `backend_language`, `database`, `frontend_framework`, and `backend_framework`.
+- `prompt` in `./metadata.json` must be the original product prompt captured during Phase 1, not a summary, clarified rewrite, workflow note, or prompt-plus-operator-context dump.
 - Normalize `project_type` to exactly one of the six accepted task classifications: `backend`, `fullstack`, `web`, `android`, `ios`, or `desktop`.
 - If a task/question id exists, use that exact id for final deliverable/archive naming without adding an extra `ID-` prefix.
 - Record the package-root manifest: workflow root, task root, package root, docs path, `.tmp` path, separate session handoff path if present, and any explicit validation exceptions.

package/assets/slopmachine/clarifier-agent-prompt.md CHANGED Viewed

@@ -234,13 +234,14 @@ Do not include any preface, explanation, summary, commentary, or planning notes
 ```md
 # Questions
-## Clarification Entries
+This file records only genuine original-prompt ambiguities that needed interpretation because they were unclear, incomplete, contradictory, or materially ambiguous.
-### Q-001: [short question title]
-- **Ambiguity:** [what is unclear, missing, or open to interpretation]
-- **Prompt Basis:** [exact quote or reference from the original prompt]
-- **Impact:** [what would go wrong in later design or implementation if unresolved]
-- **Solution:** [decisive resolution with prompt-faithful default]
+Use this structure for each real clarification item:
+### 1. <short clarification title>
+- Question: <the exact ambiguity or missing detail that needed to be locked>
+- My Understanding: <how the prompt was interpreted, why this was ambiguous, and why it matters>
+- Solution: <the chosen prompt-faithful resolution or safe default, written decisively>
 ```
 Do not include requirement IDs, traceability fields, priority fields, or evaluator-risk metadata in `questions.md`. Those belong in `../.ai/requirements-breakdown.md` only.

package/assets/slopmachine/exact-readme-template.md CHANGED Viewed

@@ -206,6 +206,8 @@ Expected result:
 ## 9. Testing
+All testing is Docker-contained. Do not document local test commands (`npm test`, `pytest`, `go test`, etc.) in the README — those are for implementation-time development only and must not appear in the reviewer-facing project documentation.
 ### Standard broad test command
 ```bash
@@ -214,17 +216,12 @@ Expected result:
 If `init_db.sh` is part of the standard test bootstrap, document that relationship clearly.
-### Local verification harness
-- Document the separate local verification command(s) used for ordinary development and readiness checks only if they do not become required reviewer setup.
-- Make clear that these local verification commands are distinct from the dockerized `./repo/run_tests.sh` broad test path.
-- Use the real stack-native local suite for the chosen language/framework where applicable, for example Vitest, Jest, PHPUnit, pytest, go test, cargo test, or another framework-native equivalent.
-- Do not require reviewers to run manual installs or machine-level setup for the standard packaged verification path.
 ### Test entry points
-- Unit tests: `[command/path]`
-- Component/state tests: `[command/path]`
-- API/integration tests: `[command/path]`
-- E2E/platform tests: `[command/path]`
+These are the Docker-contained test entry points invoked by `./run_tests.sh`:
+- Unit tests: `[command/path inside Docker]`
+- Component/state tests: `[command/path inside Docker]`
+- API/integration tests: `[command/path inside Docker]`
+- E2E/platform tests: `[command/path inside Docker]`
 ### What the test suite covers
 - [backend unit coverage summary]
@@ -233,8 +230,7 @@ If `init_db.sh` is part of the standard test bootstrap, document that relationsh
 - [E2E/platform coverage summary]
 ### Test notes
-- `./repo/run_tests.sh` is the broad test path for containerized verification when applicable.
-- Local verification commands are used for ordinary development iteration and readiness checks.
+- `./repo/run_tests.sh` is the broad test path for containerized verification.
 - [Docker-contained notes if applicable]
 - [seed/fixture notes if applicable]
 - [known test constraints if any]

package/assets/slopmachine/owner-verification-checklist.md CHANGED Viewed

@@ -26,7 +26,7 @@ Reject only for material defects that would mislead development, evaluation, or
 - [ ] Platform runtime expectations follow the packaged runtime rules.
 - [ ] `./repo/run_tests.sh` is the broad test wrapper and is honest.
-- [ ] `unit_tests/` and `API_tests/` are used where applicable.
+- [ ] `unit_tests/` and `API_tests/` are used.
 - [ ] Docker requirements are not incorrectly forced onto Android/iOS native runtime delivery.
 - [ ] Docker/runtime docs and files are repo-controlled, non-interactive, free of hidden `.env`/manual export dependencies, and statically credible for final confirmation.
 - [ ] A separate stack-native local harness exists for development/Phase 4, or the missing harness is explicitly user risk-accepted.

package/assets/slopmachine/phase-1-design-prompt.md CHANGED Viewed

@@ -1,13 +1,6 @@
-# Design Prompt
 You are helping create the product/system design for a software project.
-You will receive:
-- the original product prompt
-- stack/context information
-- accepted clarifications
-- accepted requirements
-- the design template
+The original prompt, stack and context, accepted clarifications, and accepted requirements have already been provided in the previous steps. The design template is already seeded at ./docs/design.md.
 Your task is to write `./docs/design.md`.
@@ -23,7 +16,9 @@ The design must:
 - preserve the original business goal and required user outcomes
 - incorporate accepted clarifications and requirements without narrowing them
 - identify the project type, stack, actors, roles, main flows, modules, data, UI/API surfaces, security boundaries, assumptions, and verification strategy
-- define the testing contract as part of the visible design: every API/interface endpoint must have positive and negative true HTTP/API tests where a runtime endpoint exists, unit coverage must target 90%+ for meaningful business logic, frontend unit tests must be identifiable and must import/render real frontend components where a frontend exists, fullstack/web apps must prove frontend-to-backend behavior, and user-facing applications must include full E2E/platform coverage for the main user journeys unless a surface is genuinely not applicable
+- require all unit tests under `unit_tests/` and all API/integration HTTP tests under `API_tests/` (both directories mandatory when the corresponding test surface exists)
+- define the testing contract as part of the visible design: every API/interface endpoint must have positive and negative true HTTP/API tests where a runtime endpoint exists, unit coverage must target 90%+ for meaningful business logic, frontend unit tests must be identifiable and must import/render real frontend components where a frontend exists, fullstack/web apps must prove frontend-to-backend behavior
+- require E2E/platform test coverage for every prompt requirement: not just main user journeys but every requirement, actor path, business rule, authorization rule, error state, and task-closure condition must have an identifiable E2E test or an explicit accepted not-applicable reason. E2E tests must exercise real application behavior and verify business outcomes, not just confirm pages render. An E2E test that only checks a page loads without asserting state changes, data persistence, or backend integration is decorative and must be rejected.
 - define README/runtime obligations that satisfy strict review: project type near the top, `docker compose up --build` as the primary startup command for container-supported deliveries, the legacy compatibility string `docker-compose up` without making it primary, access URL/port or platform launch method, verification method, auth/demo credentials for every role or the exact statement `No authentication required`, seeded data or empty-state statement, no manual runtime installs, no hidden `.env` dependency, mock/local/debug disclosures, and known limitations
 - make meaningful assumptions explicit
 - mark unresolved items only when a real decision is still needed
@@ -44,6 +39,6 @@ The design should define what the system must be and how its major parts fit tog
 ## Output
-Write the completed design to `./docs/design.md` using the provided template. If a section is not applicable, keep it brief and explain why.
+Write the completed design to `./docs/design.md` using the template already seeded at that path. If a section is not applicable, keep it brief and explain why.
 If the project has meaningful APIs or interface contracts, say that `./docs/api-spec.md` should be completed next and summarize the API families that need specification.

package/assets/slopmachine/phase-1-design-template.md CHANGED Viewed

@@ -48,9 +48,11 @@ List only items that do not shrink the original prompt.
 ## 5. Module Design
-| Module | Purpose | Owned behavior | UI surfaces | API/service/job surfaces | Data owned | Key failure/security cases |
-|---|---|---|---|---|---|---|
-|  |  |  |  |  |  |  |
+| Module | Purpose | Owned behavior | UI surfaces | API/service/job surfaces | Data owned | Key failure/security cases | Required test surface |
+|---|---|---|---|---|---|---|---|
+|  |  |  |  |  |  |  |  |
+Each module's test surface must list the specific API endpoints to test, unit test targets, and E2E flows required. Example: `GET /invoices — positive + 403 unauthorized; unit: InvoiceService.calculateTax; E2E: user creates invoice → sees it in list → deletes it`.
 ## 6. Data Design
@@ -98,23 +100,25 @@ Cover where relevant: authentication, route authorization, object authorization,
 This is a design-level strategy, not an execution checklist.
 Required testing contract:
-- All API/interface endpoints must have true HTTP/API test coverage for successful behavior and important negative/error cases where a runtime endpoint exists. If a non-HTTP interface or accepted exception requires another proof layer, state the exception and replacement proof. If there is no API/interface surface, state `Not Applicable` with the reason.
+- All tests must live under their prescribed directories: unit tests under `unit_tests/`, API/integration HTTP tests under `API_tests/`. Both directories are mandatory when the corresponding test surface exists.
+- All API/interface endpoints must have true HTTP/API test coverage for successful behavior and important negative/error cases where a runtime endpoint exists. Test assertions must verify exact expected state transitions, status codes, and response bodies — not permissive "accept any valid outcome" checks. If a non-HTTP interface or accepted exception requires another proof layer, state the exception and replacement proof. If there is no API/interface surface, state `Not Applicable` with the reason.
 - Meaningful business logic must target 90%+ unit coverage. If a component cannot be unit-tested meaningfully, state the exception and the replacement proof layer.
 - Frontend unit tests must be identifiable by file pattern/framework evidence and must import or render real frontend components/modules when a frontend exists.
 - Fullstack or backend-backed frontend work must include proof that real frontend actions reach the intended backend/service behavior.
-- User-facing applications must have full E2E/platform coverage for the main user journeys, including success, validation/failure, and recovery states. If E2E/platform testing is not applicable, state why and what proof replaces it.
+- E2E/platform tests must cover every requirement from the original prompt. Not just the main user journeys — every requirement, actor path, business rule, authorization rule, error state, and task-closure condition must have an identifiable E2E test or an explicit accepted not-applicable reason. E2E tests must exercise real application behavior end to end and verify that the system actually works — they must not be decorative page-load-only tests that skip actual interaction, state verification, data persistence checks, or backend integration. An E2E test that only confirms a page renders without asserting business outcomes is insufficient and must be flagged as a gap.
+- E2E tests must run and pass before any manual verification. If an E2E test cannot verify a particular surface (e.g., requires external service, real email delivery), document the exact boundary and the manual verification that replaces it.
 | Surface / risk | Expected proof layer | Notes |
 |---|---|---|
-| core happy paths |  |  |
-| key failure paths |  |  |
-| security boundaries |  |  |
+| core happy paths | full E2E coverage |  |
+| key failure paths | full E2E coverage + unit/API assertions |  |
+| security boundaries | E2E + unit/API for negative cases |  |
 | API/interface behavior | endpoint tests for every endpoint, including positive and negative cases |  |
-| UI states / interactions |  |  |
-| frontend-to-backend integration paths |  |  |
+| UI states / interactions | E2E coverage for every state |  |
+| frontend-to-backend integration paths | E2E proves real FE ↔ BE behavior |  |
 | unit coverage | 90%+ meaningful business-logic coverage |  |
 | frontend unit/component tests | identifiable tests importing/rendering real components/modules |  |
-| E2E/platform journeys | full main-journey coverage for user-facing apps |  |
+| E2E/platform journeys | covers every prompt requirement with real assertions |  |
 ## 11.1 README / Runtime Gate Strategy

package/assets/slopmachine/phase-2-execution-planning-prompt.md CHANGED Viewed

@@ -14,7 +14,8 @@ You will receive:
 - original prompt
 - stack/context information
 - accepted questions/clarifications
-- requirements breakdown
+- full requirements breakdown
+- minified requirements list (core requirements from the breakdown)
 - accepted `./docs/design.md`
 - accepted `./docs/api-spec.md` or not-applicable rationale
 - the plan template
@@ -31,6 +32,7 @@ Create a practical implementation plan that can be translated into concise imple
 - Start with the scaffold/baseline work needed before product modules.
 - Plan by product modules and vertical flows, not by broad file trees.
 - Tests travel with implementation.
+- All testing artifacts must live under the prescribed test directory structure: unit tests under `unit_tests/`, API/integration HTTP tests under `API_tests/`. Both directories are mandatory when the corresponding test surface exists.
 - Every meaningful requirement, clarification, module, API/interface, data object, actor path, security boundary, and user-visible flow needs a responsible module/work package and proof path.
 - Fullstack/backend-backed frontend work needs explicit frontend-to-backend proof.
 - API work needs endpoint/interface proof.
@@ -41,7 +43,8 @@ Create a practical implementation plan that can be translated into concise imple
 - Include a FE-BE integration matrix for fullstack/backend-backed frontend work: frontend action, backend endpoint/service/job, payload/state input, response/side effect, UI states, and proof path.
 - Include a backend-to-frontend exposure check when backend capabilities exist: every prompt-relevant backend capability must have visible exposure or a specific accepted internal/API-only reason.
 - Include README/runtime gates that match the strict README audit: project type near the top, primary `docker compose up --build` for container-supported deliveries, legacy compatibility string `docker-compose up` without making it primary, startup/access, verification, auth/no-auth, all demo credentials/roles when auth exists, seeded values or empty-state statement, configuration/no-secret handling, no manual runtime installs or manual DB setup, test commands, known limitations, and mock/local-data/debug disclosures.
-- Include coverage rigor: 90%+ unit target for meaningful business logic, exact true no-mock HTTP/API endpoint tests, identifiable frontend unit tests that import/render real components/modules, fullstack FE-BE proof, E2E/platform proof, security/negative cases, and final local verification expectations.
+- Include coverage rigor: 90%+ unit target for meaningful business logic, exact true no-mock HTTP/API endpoint tests, identifiable frontend unit tests that import/render real components/modules, fullstack FE-BE proof, security/negative cases, and final local verification expectations.
+- Include an E2E coverage map: every prompt requirement, actor path, business rule, authorization rule, error state, and task-closure condition must have an identifiable E2E test or an explicit accepted not-applicable reason. E2E tests must exercise real behavior and verify business outcomes, not just confirm pages render. Decorative-load-only E2E tests are insufficient and must be called out as gaps.
 - Include module acceptance checks that prevent shell/demo completion: observable behavior, persisted state/artifact or UI/API outcome, relevant negative paths, tests, README impact, and integration evidence.
 ## Output Requirements