theslopmachine 0.6.2 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (76) hide show
  1. package/MANUAL.md +21 -6
  2. package/README.md +55 -7
  3. package/RELEASE.md +15 -0
  4. package/assets/agents/developer.md +41 -1
  5. package/assets/agents/slopmachine-claude.md +100 -60
  6. package/assets/agents/slopmachine.md +40 -17
  7. package/assets/claude/agents/developer.md +42 -5
  8. package/assets/skills/clarification-gate/SKILL.md +25 -5
  9. package/assets/skills/claude-worker-management/SKILL.md +280 -57
  10. package/assets/skills/developer-session-lifecycle/SKILL.md +81 -37
  11. package/assets/skills/development-guidance/SKILL.md +21 -1
  12. package/assets/skills/evaluation-triage/SKILL.md +32 -23
  13. package/assets/skills/final-evaluation-orchestration/SKILL.md +86 -50
  14. package/assets/skills/hardening-gate/SKILL.md +17 -3
  15. package/assets/skills/integrated-verification/SKILL.md +3 -3
  16. package/assets/skills/planning-gate/SKILL.md +32 -3
  17. package/assets/skills/planning-guidance/SKILL.md +72 -13
  18. package/assets/skills/retrospective-analysis/SKILL.md +2 -2
  19. package/assets/skills/scaffold-guidance/SKILL.md +129 -124
  20. package/assets/skills/submission-packaging/SKILL.md +33 -27
  21. package/assets/skills/verification-gates/SKILL.md +44 -14
  22. package/assets/slopmachine/backend-evaluation-prompt.md +1 -1
  23. package/assets/slopmachine/frontend-evaluation-prompt.md +5 -5
  24. package/assets/slopmachine/scaffold-playbooks/android-kotlin-compose.md +81 -0
  25. package/assets/slopmachine/scaffold-playbooks/android-kotlin-views.md +191 -0
  26. package/assets/slopmachine/scaffold-playbooks/android-native-java.md +203 -0
  27. package/assets/slopmachine/scaffold-playbooks/angular-default.md +181 -0
  28. package/assets/slopmachine/scaffold-playbooks/backend-baseline.md +142 -0
  29. package/assets/slopmachine/scaffold-playbooks/backend-family-matrix.md +80 -0
  30. package/assets/slopmachine/scaffold-playbooks/database-module-matrix.md +80 -0
  31. package/assets/slopmachine/scaffold-playbooks/django-default.md +166 -0
  32. package/assets/slopmachine/scaffold-playbooks/docker-baseline.md +189 -0
  33. package/assets/slopmachine/scaffold-playbooks/docker-shared-contract.md +334 -0
  34. package/assets/slopmachine/scaffold-playbooks/electron-vite-default.md +124 -0
  35. package/assets/slopmachine/scaffold-playbooks/expo-react-native-default.md +73 -0
  36. package/assets/slopmachine/scaffold-playbooks/fastapi-default.md +134 -0
  37. package/assets/slopmachine/scaffold-playbooks/frontend-baseline.md +160 -0
  38. package/assets/slopmachine/scaffold-playbooks/frontend-family-matrix.md +134 -0
  39. package/assets/slopmachine/scaffold-playbooks/generic-unknown-tech-guide.md +136 -0
  40. package/assets/slopmachine/scaffold-playbooks/go-chi-default.md +160 -0
  41. package/assets/slopmachine/scaffold-playbooks/ios-linux-portable.md +93 -0
  42. package/assets/slopmachine/scaffold-playbooks/ios-native-objective-c.md +151 -0
  43. package/assets/slopmachine/scaffold-playbooks/ios-native-swift.md +188 -0
  44. package/assets/slopmachine/scaffold-playbooks/laravel-default.md +216 -0
  45. package/assets/slopmachine/scaffold-playbooks/livewire-default.md +265 -0
  46. package/assets/slopmachine/scaffold-playbooks/overlay-module-matrix.md +130 -0
  47. package/assets/slopmachine/scaffold-playbooks/platform-family-matrix.md +79 -0
  48. package/assets/slopmachine/scaffold-playbooks/selection-matrix.md +72 -0
  49. package/assets/slopmachine/scaffold-playbooks/spring-boot-default.md +182 -0
  50. package/assets/slopmachine/scaffold-playbooks/tauri-default.md +80 -0
  51. package/assets/slopmachine/scaffold-playbooks/vue-vite-default.md +162 -0
  52. package/assets/slopmachine/scaffold-playbooks/web-default.md +96 -0
  53. package/assets/slopmachine/templates/AGENTS.md +41 -3
  54. package/assets/slopmachine/templates/CLAUDE.md +111 -0
  55. package/assets/slopmachine/utils/claude_create_session.mjs +1 -0
  56. package/assets/slopmachine/utils/claude_live_channel.mjs +188 -0
  57. package/assets/slopmachine/utils/claude_live_common.mjs +406 -0
  58. package/assets/slopmachine/utils/claude_live_hook.py +47 -0
  59. package/assets/slopmachine/utils/claude_live_launch.mjs +181 -0
  60. package/assets/slopmachine/utils/claude_live_status.mjs +25 -0
  61. package/assets/slopmachine/utils/claude_live_stop.mjs +45 -0
  62. package/assets/slopmachine/utils/claude_live_turn.mjs +250 -0
  63. package/assets/slopmachine/utils/claude_resume_session.mjs +1 -0
  64. package/assets/slopmachine/utils/claude_wait_for_rate_limit_reset.mjs +23 -0
  65. package/assets/slopmachine/utils/claude_wait_for_rate_limit_reset.sh +5 -0
  66. package/assets/slopmachine/utils/claude_worker_common.mjs +224 -4
  67. package/assets/slopmachine/utils/cleanup_delivery_artifacts.py +4 -0
  68. package/assets/slopmachine/utils/export_ai_session.mjs +1 -1
  69. package/assets/slopmachine/utils/normalize_claude_session.py +153 -0
  70. package/assets/slopmachine/utils/package_claude_session.mjs +96 -0
  71. package/assets/slopmachine/utils/prepare_strict_audit_workspace.mjs +65 -0
  72. package/package.json +1 -1
  73. package/src/constants.js +42 -3
  74. package/src/init.js +173 -28
  75. package/src/install.js +75 -0
  76. package/src/send-data.js +56 -57
@@ -52,6 +52,8 @@ At the beginning of owner work:
52
52
 
53
53
  Do not use this preflight to perform clarification, planning, scaffold, or implementation work.
54
54
 
55
+ If bootstrap seeded a later `current_phase` from `requested_start_phase`, verify that the adopted repo and evidence really support starting there; if not, repair the phase conservatively before continuing.
56
+
55
57
  ## Prompt and project metadata boundary
56
58
 
57
59
  - keep the real project prompt in `../metadata.json` under `prompt`
@@ -60,15 +62,27 @@ Do not use this preflight to perform clarification, planning, scaffold, or imple
60
62
  - map appended stack/context lines into structured project metadata fields when defensible and keep the raw remainder in `../.ai/startup-context.md`
61
63
  - if the separation is unclear, resolve it before clarification proceeds
62
64
 
65
+ ## Developer rulebook decision
66
+
67
+ - during `P1`, choose the repo-local developer rulebook file that matches the active backend
68
+ - for the normal OpenCode developer backend, use `AGENTS.md`
69
+ - for `slopmachine-claude`, use `CLAUDE.md`
70
+ - record that decision in `../.ai/metadata.json` as `developer_rulebook_file`
71
+ - before launching the developer, ensure the chosen rulebook file exists in `repo/` and reflects the current packaged template or an explicitly approved equivalent
72
+ - for `slopmachine-claude`, if `repo/CLAUDE.md` is missing but `repo/AGENTS.md` exists, rename `repo/AGENTS.md` to `repo/CLAUDE.md` during `P1` before the first Claude developer launch
73
+
63
74
  ## Required startup outputs
64
75
 
65
76
  - `../.ai/metadata.json` exists with the current workflow/session schema
66
77
  - `../metadata.json` exists with project-fact fields
67
78
  - `../.ai/startup-context.md` exists
68
79
  - seeded parent-root docs exist, including `../docs/questions.md`, `../docs/design.md`, `../docs/test-coverage.md`, and `../docs/api-spec.md`
80
+ - parent-root `../.tmp/` exists as the evaluation artifact directory
69
81
  - seeded repo `README.md` exists
82
+ - seeded repo `.claude/settings.json` exists with the repo-local Claude default-agent configuration
70
83
  - root workflow Beads exist for `P1` through `P10`
71
84
  - developer-session tracking is initialized
85
+ - the backend-appropriate repo-local developer rulebook file has been chosen or is ready to be chosen in `P1`
72
86
 
73
87
  ## Workflow metadata fields
74
88
 
@@ -82,19 +96,24 @@ Track at least these fields in `../.ai/metadata.json`:
82
96
  - `bootstrap_mode`
83
97
  - `requested_start_phase`
84
98
  - `packaging_completed`
85
- - `claude_trace_root`
99
+ - `claude_live_root`
100
+ - `developer_rulebook_file`
86
101
  - `current_developer_lane`
87
102
  - `active_developer_session_id`
103
+ - `primary_develop_session_id`
104
+ - `latest_develop_session_id`
88
105
  - `next_develop_session_number`
89
106
  - `next_bugfix_session_number`
90
107
  - `developer_sessions`
91
108
  - `evaluation_prompt_kind`
92
109
  - `active_evaluator_session_id`
93
- - `self_test_reports_root`
94
- - `self_test_successful_cycle_count`
95
- - `self_test_cycles`
96
- - `failed_audit_count`
97
- - `failed_audits`
110
+ - `evaluation_reports_root`
111
+ - `evaluation_audit_count`
112
+ - `evaluation_runs`
113
+ - `completed_bugfix_session_count`
114
+ - `required_bugfix_session_count`
115
+ - `coverage_readme_audit_completed`
116
+ - `coverage_readme_audit_report_path`
98
117
 
99
118
  Each `developer_sessions[]` record should include enough to recover and export it later, such as:
100
119
 
@@ -106,30 +125,30 @@ Each `developer_sessions[]` record should include enough to recover and export i
106
125
  - `created_phase`
107
126
  - `session_id`
108
127
  - `status`
109
- - `trace_dir`
128
+ - `runtime_dir`
129
+ - `tmux_session`
130
+ - `transcript_path`
131
+ - `opened_from_audit_number`
110
132
  - `orientation_completed`
111
133
  - `last_result_summary`
112
- - `last_resumed_at`
134
+ - `last_turn_at`
113
135
 
114
- Each `self_test_cycles[]` record should include enough to recover the counted `P7` flow later, such as:
136
+ If legacy metadata still contains `claude_trace_root`, normalize it to `claude_live_root` when repairing workflow state.
115
137
 
116
- - `cycle`
117
- - `session_id`
118
- - `status`
119
- - `initial_audit_result`
120
- - `cycle_dir`
121
- - `audit_report_path`
122
- - `fix_check_paths`
123
- - `open_issues_summary`
124
- - `completed_at`
138
+ Each `evaluation_runs[]` record should include enough to recover deterministic `P7` routing later, such as:
125
139
 
126
- Each `failed_audits[]` record should include enough to recover non-counted failed-initial-audit remediation history, such as:
127
-
128
- - `attempt`
140
+ - `audit_number`
129
141
  - `session_id`
142
+ - `verdict`
130
143
  - `audit_report_path`
131
- - `archived_to`
144
+ - `route_target`
145
+ - `routed_developer_session_id`
146
+ - `routed_developer_label`
147
+ - `started_bugfix_session_id`
148
+ - `started_bugfix_label`
149
+ - `fix_check_paths`
132
150
  - `status`
151
+ - `completed_at`
133
152
 
134
153
  ## Project metadata fields
135
154
 
@@ -152,11 +171,16 @@ Keep `../metadata.json` focused on project facts and exported project metadata,
152
171
 
153
172
  - keep exactly one active developer session at a time
154
173
  - record every developer session in `developer_sessions`
155
- - from `P2` through `P6`, the main implementation lane is `develop-N`
156
- - if multiple pre-`P7` developer sessions are needed, they still stay in the `develop-N` lane
157
- - when `P7` begins, switch to a separate remediation lane `bugfix-N`
158
- - if multiple `P7` or post-`P7` remediation developer sessions are needed, they stay in the `bugfix-N` lane
174
+ - from `P2` through `P6`, default to one long-lived `develop-1` lane
175
+ - if a new `develop-N` session is created, it should happen only for controlled replacement or explicit user direction, not because `P7` found more issues
176
+ - keep `primary_develop_session_id` pointing at the original long-lived develop session when that distinction matters
177
+ - keep `latest_develop_session_id` pointing at the most recent recoverable `develop-N` session so `fail` audits can route back deterministically
178
+ - when a fresh `P7` evaluation returns `partial pass`, create the next `bugfix-N` session tied to that audit number
179
+ - when a fresh `P7` evaluation returns `fail`, route the issue list back to the latest `develop-N` session instead of opening `bugfix-N`
180
+ - require 2 completed `bugfix-N` sessions before the final post-bugfix coverage/README audit can run
181
+ - after the second bugfix session completes, run the separate coverage/README audit and keep its remediation on the most recently used recoverable developer session until the report is clean
159
182
  - keep the currently active lane mirrored in `current_developer_lane`
183
+ - each tracked Claude-backed session should point at its live runtime directory so the lane can be recovered deterministically
160
184
 
161
185
  ## `P2` planning-entry rule
162
186
 
@@ -166,19 +190,24 @@ Keep `../metadata.json` focused on project facts and exported project metadata,
166
190
 
167
191
  ## `P7` lane-transition rule
168
192
 
169
- - when `P7` starts, do not continue evaluator-driven fixes in the existing `develop-N` lane
170
- - create a new `bugfix-N` developer session for `P7` remediation work
171
- - run the repo-orientation prompt for that new `bugfix-N` session before sending evaluator issues
172
- - after the orientation completes, mark that `bugfix-N` session as the active developer session
173
- - preserve the earlier `develop-N` session records for continuity and export; do not overwrite them
193
+ - when `P7` starts, keep the latest `develop-N` session recoverable and ready; do not automatically switch to `bugfix-N`
194
+ - after each fresh evaluation report, branch deterministically by verdict:
195
+ - `fail` -> hand the issue list to the latest `develop-N` session and keep that develop session as the remediation lane
196
+ - `partial pass` -> create the next `bugfix-N` developer session, tie it to that audit number, and keep its loop scoped to that audit's issue list until the evaluator confirms the scoped issues are fixed
197
+ - `pass` -> record the audit, but do not treat it as `P7` completion if fewer than 2 bugfix sessions have been completed
198
+ - run the repo-orientation prompt for a new `bugfix-N` session before sending its issue list
199
+ - after the orientation completes, mark that `bugfix-N` session as the active developer session for that scoped remediation window
200
+ - after a `bugfix-N` session is fully closed, preserve it in metadata and keep the latest `develop-N` session available for any later `fail` audit routing
174
201
 
175
202
  ## Recovery rule
176
203
 
177
204
  - if session or phase records disagree, stop and repair the inconsistency before proceeding
178
205
  - if the current phase already has an active developer session, recover it instead of silently creating a replacement
179
206
  - if an evaluator session is marked active, recover it before continuing the current `P7` cycle
180
- - treat resume as deterministic recovery, not guesswork
181
- - if the active Claude developer session is marked `rate_limited`, do not replace it with owner-side coding; preserve it, record the pause, and wait for the user to resume later
207
+ - treat live-lane recovery as deterministic recovery, not guesswork
208
+ - if the active Claude developer session is marked `rate_limited`, do not replace it with owner-side coding; preserve it, record the blocked state, and auto-wait for reset or resume from the same session when the wait helper completes
209
+ - if an open evaluation run is tied to a `fail` audit, recover the latest `develop-N` session for remediation rather than starting `bugfix-N`
210
+ - if an open evaluation run is tied to a `partial pass` audit, recover the linked `bugfix-N` session and its scoped fix-check loop instead of broadening the work
182
211
 
183
212
  On recovery, inspect at least:
184
213
 
@@ -193,9 +222,22 @@ On recovery, inspect at least:
193
222
  - store the active developer session id in Beads comments using `SESSION:`
194
223
  - mirror the active developer session id in `../.ai/metadata.json`
195
224
  - mirror the active developer session id in `../metadata.json` as `session_id`
196
- - for Claude-backed sessions, include backend and trace directory in the recorded session state so recovery and export remain deterministic
225
+ - for Claude-backed sessions, include backend, runtime directory, tmux session name, and transcript path in the recorded session state so recovery and export remain deterministic
197
226
  - if these records disagree, repair them before continuing
198
- - do not silently create a replacement developer session if the intended existing one can still be resumed
227
+ - do not silently create a replacement developer session if the intended existing one can still be recovered
228
+
229
+ For live Claude lanes specifically:
230
+
231
+ - treat bridge `state.json` as the authoritative transport truth
232
+ - when `../.ai/metadata.json` disagrees with the bridge on the active Claude session id, runtime directory, tmux session, transcript path, or lane status, repair metadata from the bridge state before continuing
233
+ - preserve the same tracked session id when the bridge reports `blocked`; do not replace it just to clear a temporary capacity issue
234
+ - for final Claude session packaging, resolve the stored `session_id` within `~/.claude/projects` using the project `cwd`; tracked runtime or transcript pointers remain useful for recovery/debugging but are not the final package lookup key
235
+
236
+ ## Managed-lane guardrail
237
+
238
+ - bridge-managed Claude TUI lanes are owner-controlled assets during ordinary workflow execution
239
+ - do not manually type into a managed lane during normal operation
240
+ - if a managed lane is touched manually for debugging or recovery, record that fact and then resync workflow metadata and Beads comments from bridge evidence before using the lane again
199
241
 
200
242
  ## Boundary-summary rule
201
243
 
@@ -208,7 +250,9 @@ On recovery, inspect at least:
208
250
  ## Initial structure rule
209
251
 
210
252
  - parent-root `../docs/` is the owner-maintained external documentation directory
211
- - parent-root `../sessions/` is the converted session-artifact directory for exported conversation traces
212
- - parent-root `../self_test_reports/` is the counted `P7` evaluation-cycle artifact directory
253
+ - parent-root `../sessions/` is the cleaned raw session-export directory for non-Claude developer sessions
254
+ - Claude-backed developer sessions are packaged once as parent-root `claude-sessions.zip` instead of per-session `../sessions/` entries
255
+ - parent-root `../.tmp/` is the `P7` evaluation artifact directory for `audit_report-<N>.md`, `audit_report-<N>-fix_check-<M>.md`, and `test_coverage_and_readme_audit_report.md`
256
+ - parent-root `../.ai/claude-live/` is the live Claude bridge runtime directory root
213
257
  - `../docs/questions.md` is the mandatory clarification record artifact
214
258
  - do not treat repo-local `docs/` as the active external documentation location
@@ -13,6 +13,9 @@ Use this skill during `P4 Development` before prompting the developer.
13
13
  - complete the real user-facing and admin-facing surface for the slice
14
14
  - keep slice-local planning, implementation, verification, and doc sync together
15
15
  - after planning is accepted, use the relevant accepted plan section as the slice baseline instead of expecting the owner to restate the full slice contract
16
+ - when the owner provides a stage-exclusive checklist for the current slice or gate, treat that checklist as a hard acceptance contract and respond against it explicitly rather than answering loosely
17
+ - before deeper implementation, do a quick serial-versus-parallel check for the current slice instead of defaulting to one long serial branch
18
+ - when the slice contains 2 or 3 independent units with stable interfaces and low shared-file overlap, use parallel task fan-out for those units and then merge back into one reviewed result
16
19
 
17
20
  ## Module implementation guidance
18
21
 
@@ -25,9 +28,15 @@ Use this skill during `P4 Development` before prompting the developer.
25
28
  - handle failure paths and boundary conditions
26
29
  - add or update tests as part of the module work
27
30
  - prefer TDD when the behavior is well defined and the module is practical to drive test-first; otherwise define the expected tests before implementation and keep them tied to the module plan
31
+ - when backend or fullstack API endpoints are added or changed, add or update real HTTP tests for the exact `METHOD + PATH` where practical instead of relying only on controller/service-level tests
32
+ - when mocked HTTP coverage or unit-only coverage still exists, keep it explicit in the coverage notes instead of overstating it as equivalent to true no-mock endpoint coverage
33
+ - when backend or fullstack API tests are material, keep the test names, fixtures, or assertions audit-readable enough that a reviewer can trace the endpoint, request input, and expected response behavior statically
34
+ - keep track of important modules that still lack meaningful tests so hardening does not have to rediscover them from scratch
35
+ - define the branch contract before parallelizing: expected outcome, boundaries, shared constraints, merge condition, and required verification
28
36
  - keep parent-root `../docs/test-coverage.md` maintainable by making new tests traceable to concrete requirement or risk points instead of vague “more coverage” additions
29
37
  - make sure the module is moving toward full definition-of-done completion, not just happy-path completion
30
38
  - keep auth, authorization, ownership, validation, and logging concerns in view when relevant
39
+ - for backend or fullstack work, keep configuration reads on the shared config path instead of introducing new scattered direct environment access in feature code
31
40
  - keep frontend and backend contracts synchronized when the module spans both sides
32
41
  - verify the module integrates cleanly with existing modules, routes, permissions, shared state, and cross-cutting helpers rather than only proving the new feature path in isolation
33
42
  - before closing the slice, do a narrow adjacent-flow sweep: what existing flows, commands, or docs should still be true after this slice lands?
@@ -50,11 +59,17 @@ Use this skill during `P4 Development` before prompting the developer.
50
59
  - do not hardcode secrets or persist local sensitive values in the repo while implementing
51
60
  - explain behavior changes clearly enough that the README and owner-maintained external documentation can be kept accurate
52
61
  - update `README.md` when runtime, build/preview, configuration, routes, tests, feature flags, debug/demo surfaces, mock defaults, logging, validation, or state models change
62
+ - keep `README.md` aligned with the strict audit contract as the implementation matures: project type near the top, startup instructions, access method, verification method, and demo credentials or the exact statement `No authentication required`
63
+ - for backend, fullstack, and web projects, keep the canonical `docker compose up --build` contract in `README.md` and also include the exact legacy compatibility string `docker-compose up` somewhere in startup guidance for the strict README audit
64
+ - for Android, iOS, and desktop work, keep the required Docker-contained final contract while also maintaining the project-type-specific host-side guidance sections the strict README audit expects
53
65
  - do not let implementation depend on parent-root docs or sibling artifacts for normal repo understanding
54
66
  - explain behavior changes clearly enough that the owner can keep parent-root `../docs/design.md`, `../docs/api-spec.md`, and `../docs/test-coverage.md` accurate when they apply
67
+ - before reporting development complete, remove or correct local-only setup instructions, host-only dependency assumptions, and other fast-iteration traces that should not survive into the final Docker-contained delivery
55
68
  - verify the module against its planned behavior before trying to move on
56
69
  - do not move on while the module is still obviously weak or half-finished
57
70
  - do not spread broad partial logic across many modules; bias toward completed trustworthy slices before opening the next major chunk
71
+ - do not parallelize tightly coupled work that still depends on unresolved contracts, shared abstractions being invented in real time, or overlapping edits to the same files
72
+ - after parallel fan-in, run final targeted verification on the integrated result rather than trusting the branch-local checks alone
58
73
 
59
74
  ## Verification model
60
75
 
@@ -63,6 +78,8 @@ Use this skill during `P4 Development` before prompting the developer.
63
78
  - prefer fast local language-native or framework-native test commands for the changed area during normal iteration
64
79
  - set up and use the local test environment inside the current working directory so normal verification does not depend on hidden global tooling assumptions
65
80
  - if the local toolchain is missing, install or enable the local targeted test tooling; do not fall back to Docker, `./run_tests.sh`, Playwright, or other broad-gate tooling during ordinary slice work
81
+ - fast local iteration is allowed during development even when the final delivered runtime and broad verification contract must be Docker-contained
82
+ - do not let temporary local tooling or host-only setup assumptions leak into the final README, wrapper scripts, or declared delivery contract
66
83
  - do not run browser E2E, Playwright, full test suites, `./run_tests.sh`, or Docker runtime commands during ordinary development slices
67
84
  - for frontend-bearing projects, rely on targeted local tests such as unit, component, route, page, or state-focused tests instead of browser E2E during ordinary slice work
68
85
  - for mobile and desktop projects, rely on targeted local non-E2E verification during ordinary slice work rather than broad checkpoint commands
@@ -70,9 +87,12 @@ Use this skill during `P4 Development` before prompting the developer.
70
87
  - for non-trivial frontend stateful work, do not rely only on runtime or E2E checks; add component, page, route, or state-focused tests when that is the credible way to prove the behavior statically
71
88
  - for frontend-bearing flows, explicitly verify loading, empty, submitting, disabled, success, error, and duplicate-action or re-entry protection states where those states are required by the prompt or core flow
72
89
  - use the shared logging path rather than random `console.log` or print-style debugging as the durable implementation pattern
90
+ - when backend logging matters, keep request or route outcomes, exceptions, and background failure logging on the shared structured logging path with redaction intact
73
91
  - use the shared validation and normalized error-handling path rather than per-component or per-route improvisation where a common contract exists
74
- - keep the test surface moving toward at least 90 percent meaningful coverage of the relevant behavior area as slices are completed
92
+ - keep the test surface moving toward the hard minimum 90 percent coverage threshold as slices are completed, and do not defer obvious coverage debt to hardening
93
+ - for backend or fullstack APIs, keep `../docs/test-coverage.md` moving toward an endpoint inventory plus API test mapping table, not just a generic risk matrix
75
94
  - in each slice reply, report the exact verification commands that were run and the concrete results they produced so the owner can review the evidence without blindly rerunning the same commands
95
+ - when the owner names specific expected outcomes for the slice or gate, tie the reported verification and changed files back to those expected outcomes explicitly
76
96
  - keep ordinary slice-complete replies short by default: short summary, exact changed files, exact verification commands plus results, and only real unresolved issues unless the owner explicitly asks for a deeper mapping
77
97
 
78
98
  ## Quality rules
@@ -5,39 +5,46 @@ description: Owner-side evaluation issue handoff and scoped fix-verification rul
5
5
 
6
6
  # Evaluation Issue Handoff
7
7
 
8
- Use this skill during `P7 Evaluation and Fix Verification` after an initial audit report exists.
8
+ Use this skill during `P7 Evaluation and Fix Verification` after a fresh audit report exists.
9
9
 
10
10
  ## Core rules
11
11
 
12
- - treat the current initial audit report as the authoritative issue source for the current cycle
12
+ - treat the current fresh audit report as the authoritative issue source for that audit number
13
13
  - keep the issue set concrete and exact
14
- - use the active `bugfix-N` developer session for evaluator-driven remediation
14
+ - route `fail` audits back to the latest `develop-N` session
15
+ - use `bugfix-N` only for `partial pass` audits that explicitly opened a bugfix session
15
16
  - do not split the issue set into backend/frontend tracks
16
- - do not silently drop, merge away, or wave through issues from the current initial audit report
17
+ - do not silently drop, merge away, or wave through issues from the current audit report
17
18
  - the owner must read the current audit report and extract the issues before talking to the developer
18
- - after the developer claims the fixes are complete, return to the same evaluator session that produced that cycle's initial audit report
19
+ - after the developer claims the fixes are complete for a `partial pass` audit, return to the same evaluator session that produced that audit report
19
20
  - keep ordinary post-hardening evaluation remediation inside `P7`
20
21
 
21
- ## Initial-audit result handling
22
+ ## Fresh-audit result handling
22
23
 
23
24
  ### `fail`
24
25
 
25
- - treat the audit as a non-counted remediation trigger
26
- - extract and hand off all issues to the active `bugfix-N` developer session
26
+ - treat the audit as a remediation trigger that routes back to develop
27
+ - extract and hand off all issues to the latest `develop-N` developer session
27
28
  - fix them
28
- - archive the failed initial audit report under `../.ai/`
29
- - do not count that audit as a successful cycle
30
- - run a fresh new evaluator session for the next initial audit
29
+ - keep the audit report at its normalized `../.tmp/audit_report-<N>.md` path
30
+ - do not open `bugfix-N` for this audit
31
+ - run a fresh new evaluator session for the next audit
31
32
 
32
- ### `partial pass` or `pass`
33
+ ### `partial pass`
33
34
 
34
- - treat the audit as the start of a counted cycle
35
- - use its exact issue list as the scope of the cycle
35
+ - treat the audit as the start of a scoped bugfix session
36
+ - use its exact issue list as the scope of that bugfix session
36
37
  - send that exact issue list to the developer in explicit but compact detail
37
38
 
39
+ ### `pass`
40
+
41
+ - record the audit as a discarded clean audit and do not hand off an issue list
42
+ - do not treat it as `P7` completion
43
+ - immediately rerun a fresh evaluation until a `partial pass` opens the next scoped bugfix session
44
+
38
45
  ## Issue handoff standard
39
46
 
40
- - send the developer the exact issues from the current cycle's initial audit in explicit but trimmed detail
47
+ - send the developer the exact issues from the current audit in explicit but trimmed detail
41
48
  - do not tell the developer to read the audit report directly
42
49
  - require the developer to address the full scoped issue list or its explicitly unresolved subset on later loop passes
43
50
  - require the developer to report the exact verification commands that were run and the concrete results they produced
@@ -47,21 +54,23 @@ Use this skill during `P7 Evaluation and Fix Verification` after an initial audi
47
54
 
48
55
  ## Scoped fix-check standard
49
56
 
50
- - the follow-up verification must happen in the same evaluator session that produced the current cycle's initial audit report
51
- - that same evaluator session should receive only the exact cycle-scoped issue list or the current unresolved subset
57
+ - the follow-up verification must happen only for `partial pass` audits and must use the same evaluator session that produced that audit report
58
+ - that same evaluator session should receive only the exact audit-scoped issue list or the current unresolved subset
52
59
  - that same evaluator session should only confirm whether those exact earlier items are fixed; it should not perform a broader new review
53
60
  - the follow-up report should describe what is resolved, what remains open, and any important verification caveats
54
- - store follow-up reports in the current cycle directory as `audit_fix_check_1.md`, `audit_fix_check_2.md`, and so on
61
+ - store follow-up reports as `../.tmp/audit_report-<N>-fix_check-<M>.md`
55
62
  - do not rewrite the report text after generation except for file moves and filename normalization
56
63
 
57
64
  ## Scope discipline
58
65
 
59
- - counted-cycle remediation is strictly scoped to the issues from that cycle's initial audit report
66
+ - `partial pass` remediation is strictly scoped to the issues from that audit report
60
67
  - do not let the fix-check loop expand into a fresh issue hunt
61
- - if a broader new review is needed, finish or abandon the current cycle appropriately and start a fresh evaluator session
68
+ - if a broader new review is needed, finish or abandon the current scoped bugfix loop appropriately and start a fresh evaluator session
62
69
 
63
70
  ## Exit standard
64
71
 
65
- - do not move to `P8` until 2 successful counted cycles exist under `../self_test_reports/`
66
- - failed initial audits may exist under `../.ai/`, but they never count toward the required successful cycles
67
- - each successful cycle must have its initial audit report and any fix-check reports stored together in its cycle directory
72
+ - after the second bugfix session completes, run the separate coverage/README audit and treat every issue in that report as blocking work for the most recently used recoverable developer session until the report is clean
73
+ - keep the coverage/README report path fixed at `../.tmp/test_coverage_and_readme_audit_report.md` and replace the prior copy on each rerun instead of numbering it
74
+ - do not move to `P8` until 2 bugfix sessions have been completed and the coverage/README audit report is clean
75
+ - keep every fresh audit report under `../.tmp/audit_report-<N>.md`
76
+ - for each bugfix session, keep its starting partial-pass audit report and any fix-check reports together by shared audit number in `../.tmp/`
@@ -22,36 +22,56 @@ The canonical evaluation prompt files are:
22
22
  - `assets/slopmachine/backend-evaluation-prompt.md`
23
23
  - `assets/slopmachine/frontend-evaluation-prompt.md`
24
24
  - installed runtime copies used during ordinary evaluation runs:
25
- - `~/slopmachine/backend-evaluation-prompt.md`
26
- - `~/slopmachine/frontend-evaluation-prompt.md`
25
+ - `~/slopmachine/backend-evaluation-prompt.md`
26
+ - `~/slopmachine/frontend-evaluation-prompt.md`
27
+ - `~/slopmachine/test-coverage-prompt.md`
27
28
 
28
29
  The installed runtime copies under `~/slopmachine/` are the ordinary evaluation prompt sources at runtime.
29
30
 
30
31
  ## Evaluation selection rule
31
32
 
32
- - choose one evaluation prompt kind for the whole `P7` cycle set
33
+ - choose one fresh-audit evaluation prompt kind for the whole `P7` cycle set
33
34
  - if the project is frontend-only, use `~/slopmachine/frontend-evaluation-prompt.md`
34
35
  - if the project is backend-only, fullstack, or any other project type, use `~/slopmachine/backend-evaluation-prompt.md`
35
36
  - do not run both prompts in the same ordinary workflow cycle
37
+ - the post-bugfix coverage/README audit is additional and always uses `~/slopmachine/test-coverage-prompt.md`
36
38
 
37
- ## Report root
39
+ ## Report root and naming
38
40
 
39
- - counted evaluation-cycle reports live under parent-root `../self_test_reports/`
40
- - require 2 successful counted cycles before `P7` can finish
41
- - use zero-based cycle directories:
42
- - `../self_test_reports/cycle-0/`
43
- - `../self_test_reports/cycle-1/`
41
+ - all `P7` audit and fix-check reports live under parent-root `../.tmp/`
42
+ - do not use the older cycle-directory report-root model
43
+ - number every fresh evaluation audit sequentially across the whole run:
44
+ - `../.tmp/audit_report-1.md`
45
+ - `../.tmp/audit_report-2.md`
46
+ - and so on
47
+ - for a `partial pass` audit that opens a bugfix session, store each scoped fix-check under that audit number:
48
+ - `../.tmp/audit_report-<N>-fix_check-1.md`
49
+ - `../.tmp/audit_report-<N>-fix_check-2.md`
50
+ - and so on
44
51
 
45
52
  ## Evaluator-session model
46
53
 
47
- - every initial audit must start from a fresh `General` evaluator session
48
- - track the active evaluator session id in workflow metadata
49
- - keep using that same evaluator session for the counted cycle's scoped fix-check loop
50
- - do not reuse a failed-initial-audit evaluator session for the next fresh audit
54
+ - every fresh audit must start from a fresh `General` evaluator session
55
+ - track the active evaluator session id and current audit number in workflow metadata
56
+ - if a fresh audit returns `partial pass`, keep using that same evaluator session only for the scoped fix-check loop tied to that audit's issue list
57
+ - do not reuse a `fail` audit evaluator session for the next fresh audit
58
+ - do not reuse an evaluator session from one audit number for another audit number
59
+
60
+ ## Developer-routing model
61
+
62
+ - `P7` does not automatically switch to `bugfix-N`
63
+ - keep the latest `develop-N` session recoverable throughout `P7`
64
+ - branch fresh audit results this way:
65
+ - `fail` -> send the issue list back to the latest `develop-N` session
66
+ - `partial pass` -> start the next `bugfix-N` session tied to that audit number
67
+ - `pass` -> discard it as a non-counting clean audit and immediately run another fresh evaluation until a `partial pass` opens the next bugfix session
68
+ - require 2 completed bugfix sessions before the final post-bugfix coverage/README audit can run
69
+ - after `bugfix-1` completes, run a fresh new evaluation
70
+ - after `bugfix-2` completes through its scoped fix-check loop, run the separate coverage/README audit before `P7` can close
51
71
 
52
72
  ## Audit launch rule
53
73
 
54
- For each fresh initial audit:
74
+ For each fresh audit:
55
75
 
56
76
  - compose the chosen evaluation prompt yourself; do not tell the evaluator to read prompt files on its own
57
77
  - use the original project prompt from metadata
@@ -62,64 +82,80 @@ For each fresh initial audit:
62
82
  - inject the full original project prompt into the `{prompt}` placeholder for the chosen evaluation prompt content, but otherwise do not rewrite or replace the template body
63
83
  - send that fully composed text block directly to one fresh `General` evaluator session
64
84
  - require that session to produce a detailed file-backed audit report plus an issue summary
65
- - record the evaluator session id, prompt kind, and current audit/cycle state in metadata
85
+ - assign the next audit number and normalize the report path to `../.tmp/audit_report-<N>.md`
86
+ - record the evaluator session id, prompt kind, audit number, verdict, report path, and routing decision in metadata
66
87
 
67
- ## Initial-audit branching rule
88
+ ## Fresh-audit branching rule
68
89
 
69
- After the initial audit report is produced, branch by audit result:
90
+ After each fresh audit report is produced, branch by verdict:
70
91
 
71
92
  ### `fail`
72
93
 
73
- - this does not count as a successful cycle
74
- - extract all reported issues and send them to the active developer remediation session
75
- - fix the issues outside the counted-cycle flow
76
- - move the failed audit report out of the counted report set into `../.ai/`
77
- - record that failed audit in metadata under `failed_audits`
78
- - then start a brand new evaluator session and run a fresh initial audit again
94
+ - record the audit as a `fail` under its `audit_report-<N>.md` path
95
+ - extract all reported issues and send them to the latest `develop-N` session
96
+ - do not open `bugfix-N` for a `fail` audit
97
+ - fix the issues in that develop session
98
+ - after remediation, start a brand new evaluator session and run the next fresh audit as `audit_report-<N+1>.md`
99
+
100
+ ### `partial pass`
79
101
 
80
- ### `partial pass` or `pass`
102
+ - record the audit as a `partial pass` under its `audit_report-<N>.md` path
103
+ - start the next `bugfix-N` session and tie that session to audit number `<N>`
104
+ - treat the exact issue list from `audit_report-<N>.md` as the full scope of that bugfix session
105
+ - keep using the same evaluator session only for the scoped fix-check loop for that audit number
81
106
 
82
- - this begins a counted evaluation cycle
83
- - assign the next counted cycle number
84
- - create the corresponding cycle directory under `../self_test_reports/`
85
- - store the initial audit report in that cycle directory as `audit_report_1.md`
86
- - record the counted cycle in metadata under `self_test_cycles`
107
+ ### `pass`
87
108
 
88
- ## Counted-cycle fix loop
109
+ - record the audit as a discarded clean audit under its `audit_report-<N>.md` path
110
+ - do not open `bugfix-N`
111
+ - do not count it toward `P7` completion
112
+ - immediately start another fresh evaluator session and continue `P7` until a `partial pass` opens the next bugfix session
89
113
 
90
- Inside a counted cycle:
114
+ ## Partial-pass fix-check loop
91
115
 
92
- - treat the exact issue list from that cycle's initial audit report as the scope of the cycle
116
+ Inside a `partial pass` audit's bugfix loop:
117
+
118
+ - treat the exact issue list from `audit_report-<N>.md` as the scope of the loop
93
119
  - send that exact issue list to the active `bugfix-N` developer session
94
120
  - do not tell the developer to read the audit report file directly
95
121
  - require the developer to fix the issues and report the exact verification commands and concrete results
96
122
  - after the developer claims the fixes are done, run a rough targeted owner-side verification pass on the affected behavior before asking for evaluator confirmation
97
- - then return to the same evaluator session and send only the exact issue list for scoped fix confirmation
123
+ - then return to the same evaluator session and send only the exact issue list or current unresolved subset for scoped fix confirmation
98
124
  - require a file-backed fix-check report for that scoped verification pass
99
- - store each fix-check report inside the current cycle directory as `audit_fix_check_1.md`, `audit_fix_check_2.md`, and so on
100
- - if unresolved issues remain, take only that unresolved subset back to the developer and repeat the same-session fix-check loop
101
- - continue until all issues from that cycle's initial audit are resolved
125
+ - store each fix-check report as `../.tmp/audit_report-<N>-fix_check-<M>.md`
126
+ - if unresolved issues remain, take only that unresolved subset back to the same bugfix session and repeat the same-session fix-check loop
127
+ - once all issues from `audit_report-<N>.md` are resolved, mark that bugfix session completed in metadata
128
+
129
+ ## Post-bugfix coverage and README audit
130
+
131
+ - after 2 bugfix sessions have been completed, do not leave `P7` yet
132
+ - read `~/slopmachine/test-coverage-prompt.md` yourself before launching the audit
133
+ - launch a fresh `General` evaluator session for this audit
134
+ - prepare the audit workspace with `node ~/slopmachine/utils/prepare_strict_audit_workspace.mjs --workspace-root .. --name test-coverage-readme-audit` and use the returned `run_dir` as the evaluator working directory so `repo/README.md` and `../.tmp/` both resolve correctly
135
+ - compose the request yourself and make clear that the reviewer is working in the current project directory and must write the report to `../.tmp/test_coverage_and_readme_audit_report.md`
136
+ - before each rerun, remove or replace the previous `../.tmp/test_coverage_and_readme_audit_report.md`; do not keep numbered variants for this report
137
+ - if the report finds any issue, treat that as blocking `P7` completion
138
+ - route those issues to the currently active recoverable developer session; prefer the most recently used developer session, which will usually be `bugfix-2`
139
+ - require fixes plus concrete verification evidence from that developer session
140
+ - after the fixes land, run a fresh new coverage/README audit again and replace the old report
141
+ - keep looping until `../.tmp/test_coverage_and_readme_audit_report.md` is clean and the report confirms the minimum 90 percent coverage threshold is satisfied
102
142
 
103
143
  ## Scope rule
104
144
 
105
- - the counted-cycle fix loop is strictly scoped to the issues reported by that cycle's initial audit report
106
- - no new issue hunt belongs inside a counted cycle's fix-check loop
107
- - if a broader new review is needed, that belongs to the next fresh initial audit in a new evaluator session
108
-
109
- ## Success target
110
-
111
- - require 2 successful counted cycles
112
- - each successful counted cycle must start from a fresh evaluator session
113
- - each successful counted cycle must fully resolve the scoped issues from its own initial audit before the next counted cycle begins
145
+ - a bugfix session opened from `audit_report-<N>.md` is strictly scoped to that audit's issue list
146
+ - no new issue hunt belongs inside that audit's fix-check loop
147
+ - if a broader new review is needed, that belongs to the next fresh audit in a new evaluator session
148
+ - if a later fresh audit fails after a bugfix session completes, route that new fail issue list back to the latest `develop-N` session instead of reopening the completed bugfix session
114
149
 
115
150
  ## Exit target
116
151
 
117
- - `P7` is complete only after 2 successful counted cycles exist under `../self_test_reports/`
118
- - failed initial audits may exist under `../.ai/`, but they do not count toward the 2 successful cycles
119
- - after the second successful counted cycle completes, move to `P8 Final Human Decision`
152
+ - `P7` is complete only after 2 bugfix sessions have been completed and the post-bugfix coverage/README audit report is clean
153
+ - the second bugfix session must be completed by resolving its scoped issue list through the same-audit fix-check loop
154
+ - fresh `pass` audits before that point are discarded clean audits and do not replace the 2-bugfix-session requirement
155
+ - after the second bugfix session completes, run the coverage/README audit; move to `P8 Final Human Decision` only after that audit passes cleanly
120
156
 
121
157
  ## Boundaries
122
158
 
123
159
  - this phase is owner-side evaluation orchestration, not the final human decision gate
124
- - keep the active `bugfix-N` developer lane for evaluator-driven fixes during `P7`
125
- - do not reopen the old numbered `self-test-run-N.md` / `self-test-fixes.md` model
160
+ - keep audit numbering deterministic and monotonic across the whole run
161
+ - do not reopen the old counted-cycle report-root model
@@ -33,8 +33,11 @@ Hardening should treat these as the main review buckets before final evaluation
33
33
  - audit security boundaries, validation, ownership, and secret handling
34
34
  - prioritize authentication, authorization, object ownership, tenant isolation, admin/debug exposure, and secret leakage risk over style issues
35
35
  - audit whether the current tests are sufficient to catch major issues in the core business flow, major failure paths, security-critical areas, and obvious high-risk boundaries
36
- - audit whether parent-root `../docs/test-coverage.md` actually maps major requirement and risk points to concrete tests, assertions, and gaps in a way the owner can follow quickly
37
- - audit whether the project is actually approaching or achieving at least 90 percent meaningful coverage of the relevant behavior surface rather than relying on a thin happy-path suite
36
+ - audit whether parent-root `../docs/test-coverage.md` actually maps major requirement and risk points to concrete tests, assertions, gaps, and the intended minimum 90 percent threshold in a way the owner can follow quickly
37
+ - audit whether the project actually meets the minimum 90 percent coverage threshold for the relevant behavior surface rather than relying on a thin happy-path suite
38
+ - require concrete coverage evidence during hardening, such as a stack-native coverage report, configured threshold, or equally explicit proof; do not accept approximate claims here
39
+ - when backend or fullstack APIs exist, audit whether `../docs/test-coverage.md` includes a resolved endpoint inventory, API test mapping, mock classification, and the important modules that still lack meaningful tests
40
+ - when backend or fullstack APIs exist, audit whether core endpoint coverage is truly no-mock HTTP where it matters, and whether mocked or indirect tests are being overstated as stronger evidence than they are
38
41
  - audit env/config paths so sensitive values are injected safely and are not baked into committed files or images
39
42
  - inspect architecture, coupling, file size, and maintainability risks
40
43
  - focus engineering review on the major maintainability and architecture concerns that materially affect delivery confidence
@@ -55,6 +58,15 @@ Hardening should treat these as the main review buckets before final evaluation
55
58
  - run a prototype-residue sweep for hardcoded preview values, placeholder text, seeded defaults, hidden fallbacks, and computed-but-unrendered behavior
56
59
  - enforce env-file discipline during hardening
57
60
  - run documentation verification against the real codebase and runtime behavior, not just document existence
61
+ - audit README compliance against the strict post-bugfix README review shape:
62
+ - project type near the top
63
+ - startup instructions
64
+ - access method
65
+ - verification method
66
+ - demo credentials for every known role or the exact statement `No authentication required`
67
+ - architecture and workflow clarity
68
+ - for backend, fullstack, and web projects, verify the README still documents the canonical `docker compose up --build` contract while also containing the exact legacy compatibility string `docker-compose up` for the strict README audit
69
+ - verify that fast local-iteration traces have been cleaned up before hardening closes: no lingering README dependence on `npm install`, `pip install`, `apt-get`, host-only runtime setup, or manual DB setup for the final delivered flow
58
70
  - re-check prompt-critical operational obligations such as scheduled jobs, retention, backups, worker behavior, privacy/accountability logging, and admin controls
59
71
  - enter release-candidate mode: stop feature work and focus only on fixes, verification, docs, and packaging preparation
60
72
  - make sure the system is genuinely reviewable and reproducible
@@ -68,11 +80,13 @@ Before `P6` can close, the owner should have a clear answer for each of these:
68
80
  - prompt-fit: does the delivered project still match the business goal, core flows, and implicit constraints?
69
81
  - security-critical flaws: are there any unresolved auth, authorization, isolation, exposure, or secret-handling defects?
70
82
  - test sufficiency: are the current tests strong enough to rule out most major issues, and if not, what was added or strengthened?
71
- - coverage depth: does the current evidence support roughly 90 percent meaningful coverage of the relevant behavior surface, and if not, what remains weak?
83
+ - coverage depth: does the current evidence prove the minimum 90 percent coverage threshold for the relevant behavior surface, and if not, what remains weak?
84
+ - endpoint coverage readiness: if backend or fullstack APIs exist, could a strict static reviewer map the important `METHOD + PATH` surfaces to true no-mock HTTP tests, mocked HTTP tests, or unit-only coverage without guessing?
72
85
  - major engineering quality: is the project structurally credible and maintainable, rather than piled-up or demo-grade?
73
86
  - static audit readiness: would a fresh static reviewer be able to trace the startup path, test path, core module boundaries, and any mock/local-data scope from repository artifacts alone?
74
87
  - security-boundary readiness: would a fresh static reviewer be able to explain the real auth, authorization, admin/debug, and isolation boundaries with file-backed evidence?
75
88
  - coverage-mapping readiness: would a fresh static reviewer be able to map the major requirement and risk points to concrete tests and remaining gaps without inventing the matrix themselves?
89
+ - README hard-gate readiness: would a fresh static reviewer find the required project type, startup, access, verification, and auth-disclosure sections in `README.md` without reconstructing them from code?
76
90
  - frontend-state readiness: would a fresh static reviewer be able to trace the required frontend state model and key interaction transitions from repo artifacts alone?
77
91
  - repo-self-sufficiency: can the repo be reviewed and used without depending on parent-root docs or sibling workflow artifacts?
78
92