npm - theslopmachine - Versions diffs - 1.0.13 → 1.0.22 - Mend

theslopmachine 1.0.13 → 1.0.22

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (39) hide show

package/assets/agents/developer.md +6 -7
package/assets/agents/slopmachine-claude.md +66 -9
package/assets/agents/slopmachine.md +68 -9
package/assets/claude/agents/developer.md +5 -1
package/assets/skills/clarification-gate/SKILL.md +56 -20
package/assets/skills/claude-worker-management/SKILL.md +14 -4
package/assets/skills/deep-retrospective/SKILL.md +179 -0
package/assets/skills/deep-retrospective/run.py +446 -0
package/assets/skills/deep-retrospective/workflow-reference.md +240 -0
package/assets/skills/developer-session-lifecycle/SKILL.md +18 -4
package/assets/skills/development-guidance/SKILL.md +52 -31
package/assets/skills/evaluation-triage/SKILL.md +21 -7
package/assets/skills/final-evaluation-orchestration/SKILL.md +92 -28
package/assets/skills/integrated-verification/SKILL.md +38 -42
package/assets/skills/p8-readiness-reconciliation/SKILL.md +31 -10
package/assets/skills/planning-gate/SKILL.md +10 -7
package/assets/skills/planning-guidance/SKILL.md +60 -52
package/assets/skills/retrospective-analysis/SKILL.md +172 -58
package/assets/skills/scaffold-guidance/SKILL.md +18 -6
package/assets/skills/submission-packaging/SKILL.md +11 -3
package/assets/slopmachine/clarifier-agent-prompt.md +7 -6
package/assets/slopmachine/exact-readme-template.md +8 -12
package/assets/slopmachine/owner-verification-checklist.md +1 -1
package/assets/slopmachine/phase-1-design-prompt.md +5 -10
package/assets/slopmachine/phase-1-design-template.md +15 -11
package/assets/slopmachine/phase-2-execution-planning-prompt.md +5 -2
package/assets/slopmachine/phase-2-plan-template.md +14 -4
package/assets/slopmachine/scaffold-playbooks/shared-contract.md +2 -1
package/assets/slopmachine/templates/AGENTS.md +3 -1
package/assets/slopmachine/templates/CLAUDE.md +3 -1
package/assets/slopmachine/test-coverage-prompt.md +8 -1
package/assets/slopmachine/utils/README.md +1 -5
package/assets/slopmachine/utils/claude_live_common.mjs +2 -5
package/assets/slopmachine/utils/prepare_evaluation_send_packet.mjs +3 -3
package/package.json +1 -1
package/src/constants.js +0 -9
package/src/init.js +17 -24
package/src/install.js +30 -28
package/assets/slopmachine/utils/prepare_evaluation_prompt.mjs +0 -81

package/assets/skills/deep-retrospective/workflow-reference.md ADDED Viewed

@@ -0,0 +1,240 @@
+# SlopMachine Workflow Reference
+Expected behavior and rules for each phase. Use this to judge whether the workflow ran correctly.
+---
+## Overall Rules
+### Non-Negotiable Verbatim Prompt Paste
+Every packaged `.md` prompt (clarifier, faithfulness review, design, execution planning, evaluation, test coverage) must be read fresh from `~/slopmachine/` and pasted verbatim into the subagent message. Never summarize, describe, shorten, paraphrase, or tell the worker to read the file themselves. Violation invalidates the workflow action.
+### Worker Communication Firewall
+Developer/worker sessions must never see: phase names/numbers, workflow gates, lifecycle mechanics, owner/worker terminology, Beads, metadata, `../.ai`, hidden plans/private reports, evaluator mechanics. Prompts must sound like a human lead engineer, not an orchestration system.
+### Natural Language Prompting
+No markdown formatting in developer prompts. Plain English sentences only. No bullet lists, bold, backticks, dashes. Write like a human: "I checked the module and found these issues."
+### Session Integrity
+Sessions are the primary deliverable. Never edit, rename, restructure, or delete session files. Never perform off-session work. Never return to a closed session.
+### State Management
+- Owner-private files live under `../.ai` and `../.beads`
+- `../.ai/metadata.json` is the control plane (run ID, phase, sessions, artifacts)
+- Beads is the durable ledger (phase lifecycle, blockers, evidence)
+- Both must be kept aligned at every phase transition
+### Docker Deferred
+Docker and `run_tests.sh` are deferred to Phase 6/7. Never run Docker during earlier phases.
+---
+## Phase 1: Clarification
+**Expected behavior:**
+- Verify `./metadata.json` has the exact original prompt
+- Launch clarification worker subagent via `task` tool with `~/slopmachine/clarifier-agent-prompt.md` verbatim
+- Launch faithfulness review subagent via `task` tool with `~/slopmachine/clarification-faithfulness-review-prompt.md` verbatim
+- Patch all drift findings in questions.md and requirements-breakdown.md
+- Record artifact creation and acceptance in metadata and Beads
+- Exit when `clarification-gate` is satisfied
+**What to look for in traces:**
+- Did the clarifier prompt get sent verbatim?
+- Did the faithfulness review run and find drift?
+- Were drift findings actually patched?
+- Did questions.md cover environment/trust-boundary ambiguities?
+- Did requirements-breakdown.md classify implied defaults with risk tiers?
+## Phase 2: Planning
+**Expected behavior:**
+- Use the same primary developer session established/continued
+- Follow deterministic planning sequence exactly:
+  1. Send original prompt + "Don't write code yet — we'll plan this first."
+  2. After acknowledgement, send clarifications as natural-language requirements (no REQ-### IDs)
+  3. After acknowledgement, send `~/slopmachine/phase-1-design-prompt.md` verbatim
+- Developer fills `./docs/design.md` and `./docs/api-spec.md`
+- Owner verifies design quality against all requirements
+- Launch general subagent with `~/slopmachine/phase-2-execution-planning-prompt.md` + `~/slopmachine/phase-2-plan-template.md` verbatim to create `../.ai/plan.md`
+- Exit when `planning-gate` is satisfied
+**What to look for in traces:**
+- Was the sequence strictly followed (step 1→2→3 with acknowledgement between each)?
+- Was the design prompt pasted verbatim?
+- Was the design review thorough or cursory (spot-check)?
+- Did the plan include runtime verification strategy, JSON contracts, browser verification matrix, seeded-data tests?
+- Were clarifications sent as plain sentences (no REQ-### IDs or workflow metadata)?
+- Was the plan created owner-private (subagent, not developer)?
+## Phase 3: Development
+**Expected behavior:**
+- Continue same primary developer session
+- Scaffold first, then module by module
+- Each module prompt: casual human language, one bounded slice, no workflow mechanics
+- After each module: verify against design, plan, and original prompt
+- **Owner must start the application locally at scaffold acceptance and every module boundary** — do not accept based on test output alone
+- **Developer must write cross-module integration tests** proving data/behavior flow between each new module and all previously built modules
+- Owner must verify each module before accepting completion
+- Local test harness only — no Docker
+- After all modules: ask developer for self-check against design/API docs
+- Exit when all modules complete, scaffold accepted, app verified at every boundary, cross-module tests exist, self-check done
+**What to look for in traces:**
+- Were prompts natural language? No markdown, no bullet lists, no workflow terminology?
+- Did the developer actually build real behavior or decoration (mock factories, permissive tests)?
+- Were tests strict (exact state/status assertions) or permissive (accept any outcome)?
+- Was the application ever run? (`docker compose up`, `npm start`, `go run`)?
+- Did the owner start the app at scaffold acceptance and each module boundary?
+- Did the owner ask "did you run it?" or "what do these tests prove?"
+- **Were cross-module integration tests written** when a new module connected to previous ones?
+- Was each module accepted based on "X tests pass" without runtime verification?
+- Were assertions checking specific business behavior or just presence/counts?
+**Signs of weak development:**
+- "Tests pass" used as sole completion signal
+- "Let me fix the tests to accept either status" (permissive assertions)
+- "Mock factories need to export..." (mock-heavy testing)
+- Module completion in < 10 minutes for large scope (speed over quality)
+- No `docker compose`, `npm start`, or browser verification anywhere
+- Test counts used as completion evidence without content inspection
+## Phase 4: Integrated Verification
+**Expected behavior:**
+- Close normal work in primary developer session
+- Start a new bugfix session
+- Owner plan-based review first (before evaluator loop)
+- Run internal evaluator loop: 5 passes, same evaluator session
+  - Pass 1: full evaluation prompt verbatim (`~/slopmachine/backend-evaluation-prompt.md`) via `prepare_evaluation_send_packet.mjs`
+  - Passes 2-5: follow-up prompt asking for different issues (NOT full prompt again): "Using the same review strategy, do another pass and look for a different set of material issues that were not already covered. Focus on independent Blocker, High, security, or prompt-fit risks. Write the next report to the requested path."
+  - **All 5 passes must run** — do not stop early unless the evaluator produces zero new findings in two consecutive passes
+  - Each pass writes to `../.ai/internal-verification/report-<N>.md`
+  - All issues collected in `../.ai/consolidated-internal-issues.md`
+- **For web/fullstack projects, run browser verification with agent-browser** — exercise every README credential, every core user journey, key prompt requirements
+- Batch ALL issues (plan-based review + all 5 passes + browser findings) before sending ANY to bugfix lane
+- Send consolidated issues to bugfix developer in broad human language
+- Run local non-Docker verification (local test harness)
+- Never run Docker or `run_tests.sh` during this phase
+- Exit when all plan-based review issues are fixed, all 5 evaluator passes complete, browser verification run (web/fullstack), local verification pass, surfaces coherent
+**What to look for in traces:**
+- How many evaluator passes were actually run? (Must be 5, or explicitly justified early stop)
+- Were passes 2-5 using the follow-up prompt (not the full prompt)?
+- Were findings batched into one consolidated file before sending to bugfix?
+- Did the bugfix developer reproduce failures before fixing?
+- Did the bugfix developer run the app or just fix static code?
+- **Was browser verification attempted?** Was `agent-browser` available and used? (Mandatory for web/fullstack)
+- What runtime bugs should have been caught at this phase?
+- Did the owner accept test counts as proof of fix completion?
+**Signs of weak verification:**
+- Only 1-2 evaluator passes instead of 5
+- Full prompt re-pasted for passes 2-5 (wasteful, wrong)
+- Findings sent to bugfix one at a time instead of batched
+- No browser verification for fullstack/web projects
+- Bugfix developer never ran the application
+- Owner accepts "tests pass" without inspecting what changed
+- Phase completed too quickly (< 1 hour for substantial projects)
+## Phase 5: Evaluation
+**Expected behavior:**
+- Two audit cycles, each with fresh evaluator session
+- Each audit: paste full evaluation prompt verbatim (read fresh from installed asset)
+- If Fail: fix via bugfix lane, then send exact regeneration prompt (fresh prompt, no mention of previous fixes)
+- Each cycle must produce: `./.tmp/audit_report-<N>.md` (150+ lines) and `./.tmp/audit_report-<N>-fix_check.md`
+- After both audit cycles: close bugfix lane, start test-coverage lane
+- Run coverage/README audit with `~/slopmachine/test-coverage-prompt.md` verbatim
+- Target: >= 90% test score
+- Exit when both audit cycles complete with kept reports and fix-checks, coverage/README audit passes
+**What to look for in traces:**
+- Was the evaluation prompt pasted verbatim?
+- How many regeneration cycles were needed? (More than 2-3 = problem)
+- Did evaluators have file-write capability? If not, how were reports saved?
+- What was the coverage score? If < 90, why?
+- Was the score gap addressed by fixing root causes or creating evidence copies?
+- How many coverage evaluator sessions were spawned? (More than 5 = coverage spiral)
+**Signs of evaluation issues:**
+- > 5 evaluator sessions for coverage scoring (coverage spiral)
+- Reports mention "cannot write files"
+- Score gap persists after multiple sessions
+- Build-tag-gated evidence copies created to game score
+- Regeneration prompt not sent verbatim
+## Phase 6: Final Readiness
+**Expected behavior:**
+- Run Docker: `docker compose up --build`
+- Run `./repo/run_tests.sh` (dockerized)
+- Browser verification with `agent-browser`: verify ALL prompt requirement surfaces, ALL README credentials, ALL seeded values, ALL core user journeys
+- Batch all browser findings before sending to developer
+- Fix + rerun checks until green or risk-accepted
+- Exit when all readiness categories pass
+**What to look for in traces:**
+- Did Docker/runtime checks pass on first try or require multiple retries?
+- Was browser verification decorative (page-render only) or substantive (button flows)?
+- What bugs were discovered only here? These are indictments of earlier phases.
+- Was `agent-browser` available? Was it checked (`command -v agent-browser`)?
+## Phase 7: Submission Packaging
+**Expected behavior:**
+- Package root contains only: .git, .gitignore, .tmp, docs, metadata.json, repo
+- No workflow-private files (../.ai, ../.beads) in package
+- `unit_tests/` and `API_tests/` present with runnable tests
+- README is Docker-contained (no local test commands)
+- No last-minute developer fixes (fixes at this phase mean earlier phases failed)
+- Exit when package structure matches allowlist
+**What to look for in traces:**
+- Were there last-minute developer fixes? (Should be none)
+- Were test directories populated with runnable tests or evidence copies?
+- Did credential verification pass for all seeded accounts?
+- Was metadata/session hygiene maintained?
+## Phase 8: Retrospective
+**Expected behavior:**
+- Write retrospective after packaging closes
+- Review all mandatory evidence sources
+- Produce late-finding origin table
+- Capture improvement actions
+- Duration should be > 5 minutes for meaningful projects
+---
+## Developer Lane Rules (from developer.md)
+Tests must:
+- Prove behavior and side effects, not only existence or rendering
+- Be directly runnable from unit_tests/ and API_tests/ (not build-tag-gated copies)
+- Assert exact expected state transitions, status codes, response bodies (not permissive)
+- Use real backend for frontend tests (mock only external dependencies)
+- Cover negative/boundary paths (unauthenticated, unauthorized, not found, etc.)
+Implementation must:
+- Complete coherent behavior end to end through real app path
+- Wire UI → client call → handler/service → persistence/state
+- Never create fake success paths, no-op jobs, disconnected forms
+---
+## Known Anti-Patterns (Playbook)
+1. **Permissive testing:** Tests accept "either status" or "any valid outcome" instead of exact expected state
+2. **Mock-heavy testing:** Mock factories/services instead of real DB, real HTTP, real browser
+3. **No runtime verification:** Application never run during development
+4. **Completion by test count:** Module declared complete based on "X tests pass" alone
+5. **Coverage spiral:** > 5 evaluator sessions spawned for coverage scoring with no score improvement
+6. **Build-tag evidence:** Tests in unit_tests/ and API_tests/ that don't actually run
+7. **Rushed reviews:** Phase 2 or Phase 4 completed in < 30 minutes for substantial projects
+8. **Fan-out collapse:** Multiple planned parallel lanes collapse into single serial developer session
+9. **Missed environment checks:** Network/trust-boundary ambiguities not caught during clarification
+10. **Module isolation:** Modules built in isolation with no cross-module integration tests proving data/behavior flow between them
+11. **Skipped evaluator passes:** Fewer than 5 internal evaluator passes run in Phase 4
+12. **No browser verification before P6:** Web/fullstack projects reach P6 without any browser verification, finding crashes that should have been caught in P3 or P4

package/assets/skills/developer-session-lifecycle/SKILL.md CHANGED Viewed

@@ -7,9 +7,23 @@ description: Developer, Claude, evaluator, and metadata lifecycle rules for slop
 Use this skill for startup preflight, session policy, metadata consistency, lane handoffs, and recovery.
+## Session Integrity (Highest Priority)
+Sessions are the primary deliverable. An incomplete or corrupted session dataset invalidates the submission regardless of code quality. Every session must be preserved, continuous, and authentic.
+- Do not edit, rename, restructure, rewrite, clean, delete, or fabricate session files or trajectory records. They are immutable evidence.
+- Do not perform untracked implementation work. Developer/Claude implementation, debugging, and substantive product fixes must happen inside a tracked implementation session. Owner orchestration, verification commands, package checks, and tiny safe owner-side docs/config/wrapper/glue fixes are allowed when recorded in metadata/Beads and the active implementation lane is notified afterward.
+- Sessions must progress strictly forward. Never return to a closed session. The lifecycle is:
+  1. Development session → complete → stop/close
+  2. Bugfix session → complete both audit cycles → close
+  3. Test-coverage/reconciliation session → complete coverage/README/Docker/runtime fixes → close
+- If a session becomes genuinely unrecoverable (crash with no salvageable `sid` — even after attempting tmux relaunch with the known `sid` — and transcript/session lookup also fails), start a new session in the same lane. The sessions remain sequential and a clear timeline can be established. This is the only exception to the single-session-per-lane rule. Paused, rate-limited, or waiting states are not unrecoverable — stay in the same session.
+- Each closed session must be recorded as closed in metadata and Beads. No work may resume in a closed session.
 ## Preflight
 - Confirm cwd is task root `./`.
+- Confirm the current working directory is the task root (the directory containing `repo/`, `docs/`, and `metadata.json`). If the root is not `task/` or its equivalent, stop and reject: sessions must be started from the task root, not from inside `repo/` or any subdirectory.
 - Confirm product repo exists at `./repo`.
 - Confirm workflow-private root exists at `../.ai`, workflow state exists at `../.ai/metadata.json`, and Beads root exists at `../.beads` when initialized.
 - Confirm task docs are limited to `./docs/questions.md`, `./docs/design.md`, and `./docs/api-spec.md` when applicable.
@@ -22,17 +36,17 @@ Use this skill for startup preflight, session policy, metadata consistency, lane
 - Use one primary implementation lane by default from first product orientation through design, implementation, and local verification.
 - Keep exactly one implementation lane active at a time for the current phase purpose. Any inactive prior lane must be recorded as closed, parked, replaced, or superseded before another lane becomes active.
-- Additional implementation lanes are allowed only for context limits, unrecoverable session failure, explicit user instruction, or concrete low-risk bounded work that cannot safely share context.
-- Rate limits, slow turns, shell timeouts, and recoverable session interruptions are not reasons to stop the workflow. Wait, recover, or resume the same active session/lane using the appropriate packaged tooling and continue until the full workflow closes.
+- **A paused session is not an invitation to launch a new one.** Rate limits, slow turns, shell timeouts, tmux interruptions, and waiting states are recovery conditions — stay in the same session. Only launch a new session (in the same lane) if the existing session absolutely cannot be revived: the `sid` is lost, tmux state is unrecoverable, and transcript/session lookup fails. Context exhaustion is a valid reason for a new session only after the existing session's context is genuinely exhausted and cannot continue productively.
 - Record the reason for any new lane in `../.ai/metadata.json` and Beads before using it.
+- If a phase is deliberately reopened for repair, pause before the next implementation/evaluator turn and ask the user whether to continue with the latest viable lane session or replace it with a new session. Do not assume this choice on reopen.
 - Before every developer or Claude turn, confirm `../.ai/metadata.json` names the active lane/session you are about to use.
 - After every developer or Claude turn, update metadata with the latest active lane/session, last turn purpose, result status, and any phase/report/session fields that changed.
 - Register every implementation session used in metadata and Beads with session id, lane name, purpose, status, and recovery/replacement relationship when applicable.
 - Add a Beads `SESSION:` comment for launch/resume/replacement.
 - Add `VERIFY:`, `ISSUE:`, or `HANDOFF:` comments when a turn produces accepted evidence, defects, or a work handoff.
 - Phase 4 starts and uses the dedicated bugfix/fix-check implementation lane for remediation and verification guidance; the original development lane is closed for normal implementation at development completion.
-- Phase 5 uses fresh evaluator sessions for full audits; the only evaluator-session reuse is scoped fix-check for a kept Partial Pass report.
-- Product fixes from Phase 5 final audits go through the dedicated evaluation bugfix/fix-check implementation lane.
+- Phase 5 uses fresh evaluator sessions for full audits. Evaluator-session reuse occurs in three cases: (1) fail-regeneration — same session receives only the regeneration prompt after fixes; (2) Partial Pass fix-check — same session verifies all scoped issues are fixed; (3) Pass-with-items fix-check — same session verifies all scoped recommendations are closed.
+- Audit Cycle 1 and Audit Cycle 2 fixes go through the dedicated bugfix lane. After both audit cycles complete, the bugfix lane is closed and a new test-coverage/final-reconciliation lane takes over for coverage/README remediation.
 - Phase 6 reconciliation uses the currently active developer/Claude lane for Docker, runtime, browser, account, `run_tests.sh`, coverage, README, and final readiness fixes unless a new lane is explicitly justified by context limits or isolation risk.
 - If the owner directly changes docs, wrappers, config, cleanup, or light glue, send the active lane a minimal acknowledgement request that names the changed surface and asks it to inspect/confirm before readiness continues.

package/assets/skills/development-guidance/SKILL.md CHANGED Viewed

@@ -12,6 +12,7 @@ Use this skill during `Phase 3: Development` before prompting the active develop
 - Continue using the same developer/Claude session established during planning.
 - Development starts with scaffold/baseline work.
 - After scaffold is accepted, proceed section by section or module by module from `./docs/design.md` and `./docs/api-spec.md` when applicable.
+- **All development iteration, testing, and verification uses the local test harness only.** Docker and `run_tests.sh` are deferred to Phase 6/7.
 - The owner may use `../.ai/plan.md` privately to check scope, tests, coverage, discoverability, functionality, and sequencing.
 - Never mention `../.ai/plan.md`, the existence of an internal plan, Beads, metadata, phase names, workflow mechanics, or private checks to the developer/Claude lane.
 - Developer-facing references are limited to visible project context such as `./docs/design.md`, `./docs/api-spec.md`, `./docs/questions.md`, `./repo`, and the current human prompt.
@@ -23,12 +24,14 @@ Prompt like a human developer working with an AI coding assistant.
 Prompt one bounded slice at a time. The preferred unit is one phase-purpose, scaffold, module, work package, or fix batch. At most combine two adjacent tightly coupled slices in one prompt, and only when splitting them would make the work less coherent. Never send all phases, the full private plan, or a start-to-finish workflow packet to the developer/Claude lane.
 Use direct wording such as:
-- `I checked the user module and found a missing authorization test. Please add that and rerun the relevant tests.`
-- `Continue with the invoice module. Build the create/list/detail flow against the existing product contract and cover the main success and validation paths.`
-- `The scaffold looks mostly right, but the README still describes a command that does not exist. Fix the README and rerun the local smoke check.`
+- I checked the user module and found a missing authorization test. Please add that and rerun the relevant tests.
+- Continue with the invoice module. Build the create, list, and detail flow against the existing product contract and cover the main success and validation paths.
+- The scaffold looks mostly right, but the README still describes a command that does not exist. Fix the README and rerun the local smoke check.
 Do not send robotic process language. Do not require a specific response format. Do not repeat standing instructions every turn. Do not dump, name, summarize, or mention the private plan. Give only the current objective, broad module/surface area, discovered issues, and useful verification request.
+**No markdown formatting in prompts.** Write plain English sentences with normal grammar and punctuation. Do not use bullet lists, numbered lists, bold, backticks, dashes as list markers, or any markdown syntax when writing to the developer lane. Structured markdown belongs only in verbatim-pasted packaged prompts and evaluator instructions, not in direct human prompts to the implementation session.
 Do not keep restating visible doc paths in routine follow-up prompts when the same session already knows the project contract. It is fine to say `existing product contract`, `accepted docs`, or simply name the module. Mention exact doc paths only when orienting a new session, resolving confusion, or asking for a final contract check.
 For larger module slices, group expectations by user/business behavior instead of turning every endpoint, field, and negative case into a long checklist. Ask for real backend-backed behavior, visible UI states, and meaningful success/failure tests, but keep the wording natural. If a module is too large to explain without becoming a checklist packet, split it into smaller sequential prompts.
@@ -36,18 +39,16 @@ For larger module slices, group expectations by user/business behavior instead o
 Example of a good larger module prompt:
 ```text
-Continue with inventory parcel intake and production planning materials.
-Build these as real backend-backed workflows, not static screens. Parcels should cover intake, duplicate detection, import status/errors, photo-label uploads, print-label job outcomes, revisions, closing behavior, pickup-code reuse, and audit history. Materials/planning should cover materials, UOM conversions, lots, BOM versioning/effective windows, approvals, substitutes, yield loss, and costing based on the latest received lot costs with missing-cost handling.
+Continue with inventory parcel intake and production planning materials. Build these as real backend-backed workflows, not static screens. Parcels should cover intake, duplicate detection, import status and errors, photo and label uploads, print label job outcomes, revisions, closing behavior, pickup code reuse, and audit history. Materials and planning should cover materials, UOM conversions, lots, BOM versioning with effective windows, approvals, substitutes, yield loss, and costing based on the latest received lot costs with proper handling for missing costs.
-On the Angular side, make the inventory and materials workspaces feel complete: loading, empty, validation, submitting, duplicate, printing, missing-cost, success, and error states should all be visible where they matter.
+On the frontend, make the inventory and materials workspaces feel complete with loading, empty, validation, submitting, duplicate, printing, missing cost, success, and error states where they matter.
-Add focused tests for the important happy paths and failure paths, especially duplicate parcels, pickup-code collisions/reuse, invalid phone/tracking/import rows, printer unavailable, invalid UOM conversions, BOM effective-window conflicts, approval audit trails, latest-lot costing, and missing-cost behavior. Run the targeted checks when you're done.
+Add focused tests for the important happy paths and failure paths, especially duplicate parcels, pickup code collisions and reuse, invalid phone tracking or import rows, printer unavailable scenarios, invalid UOM conversions, BOM effective window conflicts, approval audit trails, latest lot costing, and missing cost behavior. Run the targeted checks when you are done.
 ```
-When sending issues back, do not pass file names, line numbers, report snippets, or exact internal evidence unless the user explicitly asks for that. Keep it at the level of the module and behavior: `I found issues in the auth module. The access control case for other users' records is not covered properly, and the tests are missing that case.`
+When sending issues back, do not pass file names, line numbers, report snippets, or exact internal evidence unless the user explicitly asks for that. Keep it at the level of the module and behavior: I found issues in the auth module. The access control case for other users records is not covered properly and the tests are missing that case.
-Do not say `the review found`, `the evaluation found`, or `the audit found`. The owner should speak naturally: `I checked this and found...`.
+Do not say the review found, the evaluation found, or the audit found. The owner should speak naturally: I checked this and found something.
 ## Development Sequence
@@ -63,26 +64,37 @@ Do not say `the review found`, `the evaluation found`, or `the audit found`. The
 3. **Proceed module by module.**
    - Select the next section/module from `./docs/design.md` and the private plan.
+   - Before prompting, consult `../.ai/plan.md` for the module's expected test coverage, endpoints, E2E flows, acceptance criteria, and any specific cases the plan lists. Use those details to ask for specific cases in the prompt.
    - Prompt the developer using the docs only, one module/work package at a time by default.
    - Ask for the implementation and the relevant tests/checks for that module.
    - Combine two adjacent modules/work packages only when they share the same user flow or data contract and are easier to verify together.
-4. **Owner checks after each module.**
-   - Inspect changed files manually.
-   - Compare behavior against the original product prompt in `./metadata.json`.
-   - Compare behavior against `./docs/design.md` and `./docs/api-spec.md`.
-   - Privately compare against `../.ai/plan.md` for tests, coverage, discoverability, functionality, and module completeness.
-   - Run targeted checks when practical.
-   - Send missing tests, improper functionality, design drift, failed checks, or integration gaps back to the same session in broad module/product language.
-   - Do not move to the next module until the current module is acceptably resolved or a concrete blocker is recorded.
+4. **Owner verifies each module before moving on.**
+    - Inspect changed files manually for orphaned or disconnected code.
+    - Verify implementation against the original prompt, `./docs/design.md`, and `./docs/api-spec.md`.
+    - Compare against `../.ai/plan.md` as the reference: check that every planned test case, endpoint, and E2E flow from the plan for this module is present and passing. Cross-reference the module's row in the plan's ordered work packages, API coverage matrix, FE-BE integration matrix, and risk/negative coverage matrix.
+    - Run the specific local tests for the module's changed files and confirm they pass. If no targeted test exists for the changed surface, that is itself a gap to send back.
+    - **Start the application locally (not Docker) and verify it is reachable.** For web/fullstack projects, confirm the dev server starts and exercise at least one real flow through the module you just accepted. For API-only projects, confirm the server starts and hit at least one endpoint from the module. If the app does not start or the module is unreachable, the module is not complete -- send it back with the exact failure.
+    - Verify the implementation is truly wired: imports resolve, routes register, frontend components render and connect, API calls reach real handlers, data flows through real persistence or state paths. Do not accept disconnected files, unused routes, stub-only components, or fake-success paths.
+    - Verify the behavior is real, not a placeholder, shell route, hardcoded response, static demo data, or stub that pretends to be real.
+    - **Verify cross-module integration tests exist.** When the new module connects to previously built modules, confirm the developer wrote integration tests proving data and behavior flow between them. If no cross-module tests exist, send that back as a gap.
+    - Send issues back to the same session in natural language without file paths, line numbers, or report names.
+    - Do not move to the next module until the current module is acceptably resolved or a concrete blocker is recorded.
 5. **Repeat until development is complete.**
    - Keep corrections in the same active session unless a concrete context/recovery reason requires otherwise.
    - Record session turns, artifacts, verification evidence, issues, and handoffs in metadata and Beads.
-6. **Ask for final implementation self-check.**
-   - Once all modules are completed and the developer claims the implementation is ready, ask the same session one final broad check.
-   - The prompt should ask them to compare the implementation against `./docs/design.md` and `./docs/api-spec.md` when applicable, verify whether everything is complete, and report any gaps they find.
+6. **Run a full requirements integrity sweep before the final self-check.**
+   - Re-read the original prompt from `./metadata.json` and compare every explicit core requirement against the implementation. Do not assume the plan or design captured everything — the prompt is the source of truth. If any prompt requirement is missing from the implementation, that is a gap regardless of whether the plan or design listed it.
+   - Open `../.ai/plan.md` and run the no-orphan ledger against the full implementation. Every requirement in the ledger must map to a delivered surface in code, a test, a documented behavior, or have an accepted not-applicable reason with user confirmation. Do not rely on memory — go through each ledger item one by one.
+   - Cross-reference the design's requirement mapping table (section 2.1 in `./docs/design.md`) against the implementation. Every row should be addressable in the codebase.
+   - If any requirement is missing from the implementation (whether from the prompt, the ledger, or the design table), send it to the developer session in natural language before the final self-check. Do not batch everything into the self-check — route missing requirements to the active session as separate fix work.
+   - Only proceed to the final self-check when the ledger is clean, all explicit prompt requirements are addressed, and every exception is explicitly recorded and risk-accepted.
+7. **Ask for final implementation self-check.**
+   - Once all modules are completed, the ledger is clean, and the developer claims the implementation is ready, ask the same session one final broad check.
+   - The prompt should ask them to compare the implementation against `./docs/design.md`, `./docs/api-spec.md`, and the requirements shared in Step 2 of planning. Verify whether everything is complete and report any gaps they find.
    - Also ask for the startup commands and the expected user/API flows to exercise the app.
    - Keep the message human and simple. Do not mention internal plans or their existence, phases, workflow mechanics, evaluation, or hidden state.
@@ -96,21 +108,28 @@ Also give me the startup commands and the main flows I should expect to exercise
 ## Owner Review Checklist
-For each scaffold/module, check:
+For each scaffold/module, the owner must verify:
 - changed files are integrated, referenced, and not orphaned
 - implementation matches `./docs/design.md`
 - API/interface behavior matches `./docs/api-spec.md` when applicable
 - private plan rows for the module are closed or have concrete accepted exceptions
 - no-orphan ledger items assigned to the module are closed
-- project-specific behavior is real, not placeholder/shell/demo-only behavior
-- tests exist for the implemented behavior or a concrete exception is recorded
-- planned API/interface proof is present when the module owns endpoints/interfaces, with true no-mock HTTP/API endpoint tests where applicable
-- frontend unit tests are directly detectable and import/render real frontend components/modules when the module owns frontend behavior
-- planned FE-BE proof is present when the module crosses frontend/backend boundaries
-- failure, validation, authorization, ownership, empty, loading, error, and duplicate/re-entry cases are covered where relevant
-- frontend/backend wiring is real where applicable
+- project-specific behavior is real, not placeholder/shell/demo-only/fake-success behavior
+- **the application starts and runs locally** -- verify the dev server starts without crashing, the module is reachable, and at least one real flow works through the module you just accepted. Do not accept a module based on test output alone
+- tests exist for the implemented behavior; run them and confirm they pass. If no test exists for the changed surface, that is a gap
+- **cross-module integration tests exist** when the module connects to previously built modules -- confirm real data/behavior flow tests, not just file presence
+- tests under `unit_tests/` and `API_tests/` are directly runnable from those directories — not build-tag-gated evidence copies, compile-time-only files, or infrastructure checks that only verify file counts or presence. Every test must exercise and verify specific business behavior
+- planned API/interface proof is present when the module owns endpoints/interfaces, with true no-mock HTTP/API endpoint tests where applicable; run the relevant API tests and confirm they pass. API test assertions must verify exact expected state transitions, status codes, and response bodies — not permissive "accept any valid outcome" checks
+- frontend unit tests are directly detectable and import/render real frontend components/modules when the module owns frontend behavior; run them and confirm they pass
+- planned FE-BE proof is present when the module crosses frontend/backend boundaries; exercise the flow locally and confirm real data reaches real handlers
+- E2E tests cover every prompt requirement for the module, not just the main happy path. Each E2E test must assert business outcomes — state changes, data persistence, authorization enforcement, task closure — not just confirm pages render. Decorational E2E tests that only check page loads are insufficient and must be sent back as gaps
+- failure, validation, empty, loading, error, and duplicate/re-entry cases are covered where relevant
+- logs are meaningful and support troubleshooting, not absent or random print noise
+- all applicable security surfaces are individually addressed: authentication, route authorization, object-level authorization, function-level authorization, tenant/user data isolation, and admin/internal/debug endpoint protection. Each surface must have enforcement visible in code and tests
+- frontend/backend wiring is real: imports resolve, routes register, API calls reach real handlers, data flows through real persistence or state, components render in the app context not just in isolation
+- pages are connected to each other and interaction flows complete through to task closure, not just isolated screens with static outcomes
 - README changes match delivered runtime, commands, auth/no-auth, seed/demo data, verification behavior, mock/local/debug boundaries, and strict startup/access gates
-- targeted checks ran or were clearly blocked
+- local tests for the module's changed surfaces were run and passed before acceptance
 ## Internal Plan Alignment
@@ -127,7 +146,7 @@ Check the relevant module/work package against:
 - module acceptance checklist
 - integration and hardening notes
-If the implementation misses one of those expectations, translate it into a normal human issue for the developer without file/line references. Example: `I checked the invoice work. The create flow is there, but the missing-amount validation is not covered and the list still behaves like static data. Please wire it properly and add the missing test.`
+If the implementation misses one of those expectations, translate it into a normal human issue for the developer without file/line references. Example: I checked the invoice work. The create flow is there, but the missing-amount validation is not covered and the list still behaves like static data. Please wire it properly and add the missing test.
 ## Completion Standard
@@ -137,6 +156,8 @@ Accept development only when:
 - scaffold is accepted
 - all planned modules/sections are implemented or have accepted not-applicable reasons
 - module-level issues found by owner review are resolved
+- **the application was started and verified locally at every module boundary** -- the app must compile, start, and serve real behavior (not crash or serve static shells) before each module acceptance
+- **cross-module integration tests exist for every module pair that shares a data, API, or UI boundary** -- tests must prove real behavior flow, not just file presence
 - the final implementation self-check has been requested from the same active session and any reported issues have been fixed or recorded as concrete risks
 - startup commands and expected local flows have been collected from the developer session
 - targeted tests/checks for changed surfaces have passed or are honestly blocked

package/assets/skills/evaluation-triage/SKILL.md CHANGED Viewed

@@ -34,7 +34,7 @@ Use this internal extraction schema for every issue/recommendation:
 ## Human Handoff
 When sending issues to a developer lane:
-- speak as yourself: `I found these issues...`
+- speak as yourself, say I found these issues
 - do not say the evaluator, audit, or report found them
 - group by broad module/product area
 - do not include line numbers, file paths, report names, exact citations, or evaluator mechanics
@@ -53,7 +53,7 @@ When a failed report is regenerated in the same evaluator session:
 - reject reports if the last ordinary audit send was not the exact saved send packet content
 - reject reports that are continuation-shaped, fix-only, stale against current files, or contradicted by current repo evidence
 - reject reports that drop required endpoint/surface inventory, hard-gate README review, severity panels, verdict blocks, or finding details expected by the underlying prompt
-- if rejected, send the full prepared evaluation prompt again instead of using the degraded report
+- if rejected, archive every report or candidate report from the invalid cycle unchanged, record the reason, and restart that audit cycle from a fresh evaluator session using the installed prompt asset and exact saved send packet required by `final-evaluation-orchestration`
 ## Partial Pass Handling
@@ -64,25 +64,39 @@ When a failed report is regenerated in the same evaluator session:
 - Do not narrow fix-check to Blocker/High issues.
 - The same evaluator session that wrote the kept Partial Pass report performs the fix-check.
+## Pass With Items Handling
+- A kept Pass report with any issue, recommendation, caveat, suggestion, action item, or requested change also requires closure.
+- Save the kept Pass report under the same cycle audit report name: `./.tmp/audit_report-1.md` or `./.tmp/audit_report-2.md`.
+- Extract every issue/recommendation/caveat/suggestion/action item/requested change from it.
+- The fix-check scope is the full extracted set.
+- The same evaluator session that wrote the kept Pass report performs the fix-check.
+- Save the fix-check under the fixed cycle path: `./.tmp/audit_report-1-fix_check.md` or `./.tmp/audit_report-2-fix_check.md`.
+- A Pass report with zero scoped items still requires the cycle fix-check report. The fix-check must explicitly confirm that the kept audit report had no scoped issues to close and that the cycle is clean.
 ## Fix-Check Handling
 - Read the full fix-check report.
 - If every scoped issue is fixed, save/keep the fix-check report.
 - If anything is not fixed or partially fixed, send only the unresolved behavior back to the developer lane in broad human language, then rerun the fix-check in the same evaluator session.
 - The regenerated fix-check must still address the full kept audit issue set, not only the unresolved items that were most recently sent back.
+- Before sending the exact fix-check instruction, provide the same evaluator session with concise developer fix evidence, exact verification results when available, and the exact full audit-scoped issue list from the kept `audit_report-<N>.md`.
 - Reject fix-check reports that narrow the scope, skip low-priority kept issues, perform a broad new audit instead of scoped checking, or use history-exposing language.
 - Do not edit evaluator report text.
 ## Send Packet Validation
 Before accepting any full audit report:
-- confirm `prepare_evaluation_prompt.mjs` was used to build the prepared prompt for the right project type
-- confirm `prepare_evaluation_send_packet.mjs` was used to build the exact send packet
+- confirm `prepare_evaluation_send_packet.mjs` was used to build the exact send packet, including {prompt} interpolation and report path insertion
+- record both the installed prompt asset path and the exact saved send packet path
 - read the saved send packet before sending
-- confirm the evaluator received the exact saved send packet content, not a summary, footer, file reference, or shortened prompt
-- record prepared prompt path, send packet path, report path, and evaluator session id in metadata and Beads
+- confirm the evaluator received the exact saved send packet content, not a summary, footer, file reference, shortened prompt, or owner-authored replacement
+- confirm the saved send packet contains the complete installed evaluation prompt content, with nothing omitted from the installed asset and no owner-added text outside the packet
+- record installed prompt asset path, prepared prompt path, send packet path, report path, and evaluator session id in metadata and Beads
+If any part cannot be confirmed, reject the report, archive every report or candidate report from the invalid cycle unchanged, and restart that audit cycle from a fresh evaluator session with a valid installed prompt asset plus saved send packet.
-If any part cannot be confirmed, reject the report and rerun with a valid saved send packet.
+For failed-report regeneration, do not build or improvise a new prompt. Use only the exact fail-regeneration prompt from `final-evaluation-orchestration`, verbatim and with no preface, suffix, issue list, fix evidence, or extra wording.
 ## State Updates