theslopmachine 0.3.7 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/MANUAL.md +13 -9
- package/README.md +163 -3
- package/RELEASE.md +11 -3
- package/assets/agents/developer-v2.md +86 -0
- package/assets/agents/developer.md +21 -23
- package/assets/agents/slopmachine-v2.md +219 -0
- package/assets/agents/slopmachine.md +56 -38
- package/assets/skills/beads-operations/SKILL.md +32 -31
- package/assets/skills/beads-operations-v2/SKILL.md +82 -0
- package/assets/skills/clarification-gate/SKILL.md +8 -1
- package/assets/skills/clarification-gate-v2/SKILL.md +74 -0
- package/assets/skills/developer-session-lifecycle/SKILL.md +45 -14
- package/assets/skills/developer-session-lifecycle-v2/SKILL.md +148 -0
- package/assets/skills/development-guidance-v2/SKILL.md +60 -0
- package/assets/skills/evaluation-triage-v2/SKILL.md +38 -0
- package/assets/skills/final-evaluation-orchestration/SKILL.md +9 -11
- package/assets/skills/final-evaluation-orchestration-v2/SKILL.md +57 -0
- package/assets/skills/get-overlays/SKILL.md +77 -6
- package/assets/skills/hardening-gate-v2/SKILL.md +64 -0
- package/assets/skills/integrated-verification-v2/SKILL.md +47 -0
- package/assets/skills/owner-evidence-discipline-v2/SKILL.md +15 -0
- package/assets/skills/planning-gate/SKILL.md +6 -4
- package/assets/skills/planning-gate-v2/SKILL.md +91 -0
- package/assets/skills/planning-guidance-v2/SKILL.md +100 -0
- package/assets/skills/remediation-guidance-v2/SKILL.md +31 -0
- package/assets/skills/report-output-discipline-v2/SKILL.md +15 -0
- package/assets/skills/scaffold-guidance-v2/SKILL.md +57 -0
- package/assets/skills/session-rollover-v2/SKILL.md +41 -0
- package/assets/skills/submission-packaging/SKILL.md +147 -115
- package/assets/skills/submission-packaging-v2/SKILL.md +142 -0
- package/assets/skills/verification-gates/SKILL.md +44 -16
- package/assets/skills/verification-gates-v2/SKILL.md +102 -0
- package/assets/slopmachine/backend-evaluation-prompt.md +9 -2
- package/assets/slopmachine/frontend-evaluation-prompt.md +9 -2
- package/assets/slopmachine/templates/AGENTS-v2.md +55 -0
- package/assets/slopmachine/templates/AGENTS.md +20 -17
- package/assets/slopmachine/tracker-init.js +104 -0
- package/assets/slopmachine/workflow-init-v2.js +99 -0
- package/package.json +1 -1
- package/src/constants.js +22 -3
- package/src/init.js +33 -28
- package/src/install.js +186 -140
- package/src/utils.js +19 -0
- package/assets/slopmachine/beads-init.js +0 -439
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: developer-session-lifecycle
|
|
3
|
-
description: Startup,
|
|
3
|
+
description: Startup, primary developer-session persistence, recovery, and initial project structure rules for repo-cwd tracked development.
|
|
4
4
|
---
|
|
5
5
|
|
|
6
6
|
# Developer Session Lifecycle
|
|
@@ -9,15 +9,16 @@ Use this skill during startup, tracked developer-session creation, and recovery.
|
|
|
9
9
|
|
|
10
10
|
## Usage rules
|
|
11
11
|
|
|
12
|
-
- Load this skill before starting the
|
|
12
|
+
- Load this skill before starting the main developer session.
|
|
13
13
|
- Load it again during any recovery or session-consistency check.
|
|
14
14
|
- Treat it as internal orchestration guidance, not developer-visible text.
|
|
15
|
+
- outside the initial clarification approval stop, do not use this skill to create extra pause points; startup should continue directly into development bootup once approval exists
|
|
15
16
|
|
|
16
|
-
##
|
|
17
|
+
## Main developer session
|
|
17
18
|
|
|
18
19
|
- use the current working directory as the live codebase and start the fresh developer session in it
|
|
19
20
|
- ensure the parent project root has the required supporting structure, especially `../sessions/`
|
|
20
|
-
- record the developer session id immediately in
|
|
21
|
+
- record the developer session id immediately in the tracker and `.ai/metadata.json`
|
|
21
22
|
- persist the developer session id in more than one durable place
|
|
22
23
|
- reuse that same session throughout development and remediation whenever possible
|
|
23
24
|
- if the developer session crashes, resume it and continue the same work loop
|
|
@@ -41,35 +42,65 @@ Optional startup inputs may include:
|
|
|
41
42
|
1. receive the prompt and stack context
|
|
42
43
|
2. create `../.ai/metadata.json` for internal workflow state
|
|
43
44
|
3. initialize parent-root `../metadata.json` with the required schema and store the full prompt text in `prompt`
|
|
44
|
-
4. initialize root workflow state and top-level phase
|
|
45
|
+
4. initialize root workflow state and top-level phase tracker items
|
|
45
46
|
5. complete clarification using the clarification skill
|
|
46
|
-
6.
|
|
47
|
-
7. ensure the parent project root has the required working structure, especially `../sessions/`
|
|
48
|
-
8. start the
|
|
47
|
+
6. wait only for the initial clarification approval before development starts
|
|
48
|
+
7. ensure the parent project root has the required working structure, especially `../sessions/` and `../docs/`
|
|
49
|
+
8. start the main developer session
|
|
49
50
|
9. send `Let's plan this project: <original-prompt>` as the first message in that session
|
|
50
51
|
10. wait for the developer's first exchange
|
|
51
52
|
11. send the approved clarification prompt as the next guidance message
|
|
52
53
|
12. continue orchestration from there
|
|
53
54
|
|
|
55
|
+
## Metadata and workflow files
|
|
56
|
+
|
|
57
|
+
- the internal workflow mirror is `../.ai/metadata.json`
|
|
58
|
+
- the internal clarification artifact is `../.ai/clarification-prompt.md`
|
|
59
|
+
- the project artifact metadata file is `../metadata.json`
|
|
60
|
+
- do not use `../metadata.json` as internal workflow scratch state
|
|
61
|
+
|
|
62
|
+
Required internal workflow fields in `../.ai/metadata.json`:
|
|
63
|
+
|
|
64
|
+
- `workflow_state`
|
|
65
|
+
- `current_phase_item`
|
|
66
|
+
- `session_id`
|
|
67
|
+
- `clarification_approved`
|
|
68
|
+
- `awaiting_human`
|
|
69
|
+
- `remediation_round`
|
|
70
|
+
|
|
71
|
+
Required project metadata fields in `../metadata.json`:
|
|
72
|
+
|
|
73
|
+
- `prompt`
|
|
74
|
+
- `project_type`
|
|
75
|
+
- `frontend_language`
|
|
76
|
+
- `backend_language`
|
|
77
|
+
- `database`
|
|
78
|
+
- `session_id`
|
|
79
|
+
- `frontend_framework`
|
|
80
|
+
- `backend_framework`
|
|
81
|
+
|
|
82
|
+
- maintain `../metadata.json` as a real project artifact from the beginning of the workflow
|
|
83
|
+
- fill known values immediately and keep the file current as the project becomes clearer
|
|
84
|
+
- prefer explicit values; use `null` only when a field is genuinely unknown or not applicable
|
|
85
|
+
|
|
54
86
|
## Initial structure rule
|
|
55
87
|
|
|
56
|
-
-
|
|
57
|
-
-
|
|
88
|
+
- parent-root `../docs/` is the owner-maintained external documentation directory
|
|
89
|
+
- do not treat repo-local `docs/` as the active documentation location in `v3`
|
|
58
90
|
- parent-root `../sessions/` is the session artifact directory for exported conversation traces
|
|
59
91
|
|
|
60
92
|
## Recovery rule
|
|
61
93
|
|
|
62
94
|
- orchestrator restart is handled externally
|
|
63
95
|
- developer-session restart is your responsibility
|
|
64
|
-
- on recovery, read
|
|
96
|
+
- on recovery, read `../.ai/metadata.json`, `../metadata.json`, current phase tracker item, latest `SESSION:` comment, latest unresolved `ISSUE:` comments, and any persistent session record before continuing
|
|
65
97
|
- treat resume as deterministic state recovery, not guesswork
|
|
66
98
|
|
|
67
99
|
## Session persistence rule
|
|
68
100
|
|
|
69
|
-
- store the
|
|
70
|
-
- mirror it in a `SESSION:` comment on the root Bead
|
|
101
|
+
- store the main developer session id in the root tracker item comments using `SESSION:`
|
|
71
102
|
- mirror it in `../.ai/metadata.json`
|
|
72
103
|
- mirror it in parent-root `../metadata.json` as `session_id`
|
|
73
104
|
- once created, treat that session id as locked unless an explicit reset policy is introduced later
|
|
74
105
|
- if these records disagree, stop and resolve the inconsistency before continuing
|
|
75
|
-
- do not silently create a replacement
|
|
106
|
+
- do not silently create a replacement main developer session if the existing one can still be resumed
|
|
@@ -0,0 +1,148 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: developer-session-lifecycle-v2
|
|
3
|
+
description: Startup, resume detection, metadata consistency, and developer-session recovery rules for slopmachine-v2.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Developer Session Lifecycle v2
|
|
7
|
+
|
|
8
|
+
Use this skill during `P0 Intake and Setup` and whenever startup or recovery state is uncertain.
|
|
9
|
+
|
|
10
|
+
## Purpose
|
|
11
|
+
|
|
12
|
+
- detect whether the run is new or resumed
|
|
13
|
+
- initialize or recover workflow metadata consistently
|
|
14
|
+
- initialize the planned bounded developer-session slots
|
|
15
|
+
- recover the current active developer session when one already exists
|
|
16
|
+
|
|
17
|
+
## Usage rules
|
|
18
|
+
|
|
19
|
+
- keep startup and recovery in one skill; both begin from the same state inspection problem
|
|
20
|
+
- treat this as internal orchestration guidance, not developer-visible text
|
|
21
|
+
- do not launch the developer during `P0` or `P1`
|
|
22
|
+
- do not use this skill to create extra approval stops beyond the two allowed human gates
|
|
23
|
+
|
|
24
|
+
## State inspection sequence
|
|
25
|
+
|
|
26
|
+
Inspect:
|
|
27
|
+
|
|
28
|
+
- Beads root state and current phase
|
|
29
|
+
- `../.ai/metadata.json`
|
|
30
|
+
- `../metadata.json`
|
|
31
|
+
- existing session comments or recorded session ids
|
|
32
|
+
|
|
33
|
+
Decide whether the run is:
|
|
34
|
+
|
|
35
|
+
- a fresh startup
|
|
36
|
+
- a resumed run with consistent state
|
|
37
|
+
- a run needing consistency repair before it can continue
|
|
38
|
+
|
|
39
|
+
## Startup contract
|
|
40
|
+
|
|
41
|
+
Expect to start from:
|
|
42
|
+
|
|
43
|
+
- a project prompt
|
|
44
|
+
- tech stack information when it is not already clear from the prompt
|
|
45
|
+
|
|
46
|
+
Optional startup inputs may include:
|
|
47
|
+
|
|
48
|
+
- task id
|
|
49
|
+
- project type
|
|
50
|
+
- explicit constraints or preferences
|
|
51
|
+
|
|
52
|
+
## Startup flow
|
|
53
|
+
|
|
54
|
+
1. receive the prompt and stack context
|
|
55
|
+
2. create `../.ai/metadata.json` for internal workflow state
|
|
56
|
+
3. initialize parent-root `../metadata.json` with the required schema and store the full prompt text in `prompt`
|
|
57
|
+
4. initialize root workflow state and top-level phase Beads items
|
|
58
|
+
5. complete clarification using the clarification skill
|
|
59
|
+
6. wait only for the initial clarification approval before development starts
|
|
60
|
+
7. ensure the parent project root has the required working structure, especially `../sessions/` and `../docs/`
|
|
61
|
+
8. initialize the bounded developer-session slots
|
|
62
|
+
9. start the build developer session only after `P2` is ready to begin
|
|
63
|
+
10. send `Let's plan this project: <original-prompt>` as the first message in that session
|
|
64
|
+
11. wait for the developer's first exchange
|
|
65
|
+
12. send the approved clarification prompt as the next guidance message
|
|
66
|
+
13. continue orchestration from there
|
|
67
|
+
|
|
68
|
+
## Required startup outputs
|
|
69
|
+
|
|
70
|
+
- root workflow state exists
|
|
71
|
+
- `../.ai/metadata.json` exists
|
|
72
|
+
- `../metadata.json` exists
|
|
73
|
+
- planned developer session slots are initialized
|
|
74
|
+
- required parent-root directories exist
|
|
75
|
+
|
|
76
|
+
## Metadata and workflow files
|
|
77
|
+
|
|
78
|
+
- the internal workflow mirror is `../.ai/metadata.json`
|
|
79
|
+
- the internal clarification artifact is `../.ai/clarification-prompt.md`
|
|
80
|
+
- the project artifact metadata file is `../metadata.json`
|
|
81
|
+
- do not use `../metadata.json` as internal workflow scratch state
|
|
82
|
+
|
|
83
|
+
## Suggested metadata fields
|
|
84
|
+
|
|
85
|
+
Track at least:
|
|
86
|
+
|
|
87
|
+
- `current_phase`
|
|
88
|
+
- `awaiting_human`
|
|
89
|
+
- `clarification_approved`
|
|
90
|
+
- `remediation_round`
|
|
91
|
+
- `developer_sessions`
|
|
92
|
+
- `active_developer_session_index`
|
|
93
|
+
|
|
94
|
+
Each planned developer session record should include enough to recover it later, such as:
|
|
95
|
+
|
|
96
|
+
- `index`
|
|
97
|
+
- `label`
|
|
98
|
+
- `phase_group`
|
|
99
|
+
- `session_id`
|
|
100
|
+
- `status`
|
|
101
|
+
- `handoff_in`
|
|
102
|
+
- `handoff_out`
|
|
103
|
+
|
|
104
|
+
Required project metadata fields in `../metadata.json` when relevant:
|
|
105
|
+
|
|
106
|
+
- `prompt`
|
|
107
|
+
- `project_type`
|
|
108
|
+
- `frontend_language`
|
|
109
|
+
- `backend_language`
|
|
110
|
+
- `database`
|
|
111
|
+
- `session_id`
|
|
112
|
+
- `frontend_framework`
|
|
113
|
+
- `backend_framework`
|
|
114
|
+
|
|
115
|
+
- maintain `../metadata.json` as a real project artifact from the beginning of the workflow
|
|
116
|
+
- fill known values immediately and keep the file current as the project becomes clearer
|
|
117
|
+
- prefer explicit values; use `null` only when a field is genuinely unknown or not applicable
|
|
118
|
+
|
|
119
|
+
## Bounded session model
|
|
120
|
+
|
|
121
|
+
Track up to three planned developer sessions:
|
|
122
|
+
|
|
123
|
+
1. build
|
|
124
|
+
2. stabilization
|
|
125
|
+
3. remediation
|
|
126
|
+
|
|
127
|
+
Later session slots may remain unused if the workflow never needs them.
|
|
128
|
+
|
|
129
|
+
## Initial structure rule
|
|
130
|
+
|
|
131
|
+
- parent-root `../docs/` is the owner-maintained external documentation directory
|
|
132
|
+
- parent-root `../sessions/` is the session artifact directory for exported conversation traces
|
|
133
|
+
- do not treat repo-local `docs/` as the active external documentation location
|
|
134
|
+
|
|
135
|
+
## Recovery rule
|
|
136
|
+
|
|
137
|
+
- if session records disagree, stop and resolve the inconsistency before continuing
|
|
138
|
+
- if the current phase already has an active developer session, recover that session instead of silently creating a new one
|
|
139
|
+
- treat resume as deterministic recovery, not guesswork
|
|
140
|
+
- on recovery, read `../.ai/metadata.json`, `../metadata.json`, current phase Beads item, latest `SESSION:` comment, latest unresolved `ISSUE:` comments, and any persistent session record before continuing
|
|
141
|
+
|
|
142
|
+
## Session persistence rule
|
|
143
|
+
|
|
144
|
+
- store the active developer session id in Beads comments using `SESSION:`
|
|
145
|
+
- mirror it in `../.ai/metadata.json`
|
|
146
|
+
- mirror the active session id in parent-root `../metadata.json` as `session_id`
|
|
147
|
+
- if these records disagree, stop and resolve the inconsistency before continuing
|
|
148
|
+
- do not silently create a replacement developer session if the intended existing one can still be resumed
|
|
@@ -0,0 +1,60 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: development-guidance-v2
|
|
3
|
+
description: Developer-facing slice execution and local verification guidance for slopmachine-v2.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Development Guidance v2
|
|
7
|
+
|
|
8
|
+
Use this skill during `P4 Development` before prompting the developer.
|
|
9
|
+
|
|
10
|
+
## Slice model
|
|
11
|
+
|
|
12
|
+
- work in bounded vertical slices
|
|
13
|
+
- complete the real user-facing and admin-facing surface for the slice
|
|
14
|
+
- keep slice-local planning, implementation, verification, and doc sync together
|
|
15
|
+
|
|
16
|
+
## Module implementation guidance
|
|
17
|
+
|
|
18
|
+
- define lightweight planning notes for the module before coding
|
|
19
|
+
- define the module purpose, constraints, and edge cases before coding
|
|
20
|
+
- keep the original requirement and clarified interpretation visible while implementing so the module does not silently drift
|
|
21
|
+
- implement real behavior, not partial scattered logic
|
|
22
|
+
- handle failure paths and boundary conditions
|
|
23
|
+
- add or update tests as part of the module work
|
|
24
|
+
- make sure the module is moving toward full definition-of-done completion, not just happy-path completion
|
|
25
|
+
- keep auth, authorization, ownership, validation, and logging concerns in view when relevant
|
|
26
|
+
- keep frontend and backend contracts synchronized when the module spans both sides
|
|
27
|
+
- verify the module integrates cleanly with existing modules, routes, permissions, shared state, and cross-cutting helpers rather than only proving the new feature path in isolation
|
|
28
|
+
- check cross-cutting consistency where relevant, especially permissions, error handling, audit/logging/redaction behavior, and state or context transition behavior
|
|
29
|
+
- verify tenant or ownership isolation where relevant so access is scoped to the authorized context rather than merely functionally working for one actor
|
|
30
|
+
- verify file and export paths are validated and confined to allowed roots when the module reads, writes, imports, or exports files
|
|
31
|
+
- verify error and auth responses are user-safe and do not leak internal reasons, paths, stack details, or sensitive state
|
|
32
|
+
- perform a clean-slate sweep before reporting module completion: remove seeded credentials, weak demo defaults, test-account hints, prototype residue, and other production-inappropriate artifacts
|
|
33
|
+
- do not treat backend existence, composable existence, or partial wiring as completion if the user-visible flow is still incomplete
|
|
34
|
+
- when the prompt says users can manage or configure something, implement full management behavior rather than create-only controls where appropriate
|
|
35
|
+
- if a required user-facing or admin-facing surface is missing, treat that gap as incomplete implementation rather than a reason to bypass the surface with direct API calls or test-only shortcuts
|
|
36
|
+
- do not leave computed-but-unrendered or partially surfaced requirement behavior in place
|
|
37
|
+
- do not treat a module as complete when a meaningful user-facing, release-facing, production-path, or build verification is known to be failing unless the owner explicitly scopes that check out
|
|
38
|
+
- use the `frontend-design` skill for frontend component or page work
|
|
39
|
+
- use the `frontend-design` skill during frontend/UI verification when reviewing Playwright screenshots and tightening the interface
|
|
40
|
+
- do not hardcode secrets or persist local sensitive values in the repo while implementing
|
|
41
|
+
- explain behavior changes clearly enough that the documentation discipline can be satisfied accurately
|
|
42
|
+
- verify the module against its planned behavior before trying to move on
|
|
43
|
+
- do not move on while the module is still obviously weak or half-finished
|
|
44
|
+
|
|
45
|
+
## Verification model
|
|
46
|
+
|
|
47
|
+
- use targeted local verification by default
|
|
48
|
+
- use local Playwright on affected flows when UI is material
|
|
49
|
+
- avoid broad Docker/full-suite commands during ordinary slice work
|
|
50
|
+
- prefer fast local language-native or framework-native test commands for the changed area during normal iteration
|
|
51
|
+
- set up and use the local test environment inside the current working directory so normal verification does not depend on hidden global tooling assumptions
|
|
52
|
+
- if the local toolchain is missing, try to install or enable it before falling back to the broad gate path
|
|
53
|
+
- when the slice materially changes frontend code, frontend tooling, or release-facing build behavior, include production build health in meaningful local verification when practical
|
|
54
|
+
|
|
55
|
+
## Quality rules
|
|
56
|
+
|
|
57
|
+
- do not bypass required UI surfaces with API shortcuts and call that done
|
|
58
|
+
- do not leave placeholder, demo, setup, or debug UI in product-facing screens
|
|
59
|
+
- do not report completion while known release-facing failures still exist
|
|
60
|
+
- product UI should serve the real workflow only
|
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: evaluation-triage-v2
|
|
3
|
+
description: Owner-side evaluation report triage rules for slopmachine-v2.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Evaluation Triage v2
|
|
7
|
+
|
|
8
|
+
Use this skill during `P7 Evaluation and Triage` after evaluation reports exist.
|
|
9
|
+
|
|
10
|
+
## Rules
|
|
11
|
+
|
|
12
|
+
- evaluation findings are advisory inputs, not automatic orders
|
|
13
|
+
- accept or reject findings explicitly
|
|
14
|
+
- keep accepted findings concrete and bounded
|
|
15
|
+
- do not enter remediation just because a report found something; enter it only when the accepted findings justify it
|
|
16
|
+
- if no remediation is needed, move directly to the final human decision
|
|
17
|
+
|
|
18
|
+
## Triage rules
|
|
19
|
+
|
|
20
|
+
- read both reports and merge the findings into one explicit triage set before deciding what happens next
|
|
21
|
+
- use the evaluator priority ordering directly when triaging findings unless stronger direct evidence says otherwise
|
|
22
|
+
- any finding marked `Blocker` or `High` should normally be returned for remediation
|
|
23
|
+
- findings marked `Medium` may be passed in limited cases, but should usually be fixed when they materially improve confidence, correctness, or acceptance readiness
|
|
24
|
+
- findings marked `Low` may be passed without remediation
|
|
25
|
+
- do not treat complaints about test coverage depth, unverifiable tests, or evaluator inability to confirm a test path as automatic blockers by themselves
|
|
26
|
+
- if your own direct evidence shows the tests run and the coverage is acceptable for qualification, defend the project and pass those findings instead of automatically remediating
|
|
27
|
+
- if a report says it could not verify some behavior because of environment limits or avoidable verification setup issues, first decide whether you can remove that constraint and rerun the evaluation in a cleaner state
|
|
28
|
+
- if the evaluator could not verify something but your own verified evidence already shows the behavior is acceptable, do not treat that as an automatic remediation trigger
|
|
29
|
+
- challenge weak, random, or overreaching findings using your stronger project context and direct codebase knowledge
|
|
30
|
+
- never edit or rewrite the evaluation report itself
|
|
31
|
+
- if you need to add context, disagreement, or justification, append it only as a clearly labeled `User comment/message` section at the bottom of the report
|
|
32
|
+
- do not loop forever chasing every newly surfaced medium or low issue once the project is otherwise qualified
|
|
33
|
+
|
|
34
|
+
## Output standard
|
|
35
|
+
|
|
36
|
+
- keep a clear accepted-finding set
|
|
37
|
+
- keep a clear rejected or passed set when disagreement matters
|
|
38
|
+
- keep the remediation brief focused on accepted issues only
|
|
@@ -12,17 +12,18 @@ Use this skill only after integrated verification and hardening are complete eno
|
|
|
12
12
|
- Load this skill only during final evaluation and evaluation-driven remediation decisions.
|
|
13
13
|
- Treat it as internal orchestration guidance.
|
|
14
14
|
- Do not let evaluator findings automatically override your triage judgment.
|
|
15
|
+
- this phase is the only allowed later human-stop point after development has started; do not create any other approval pauses outside the final evaluation decision itself
|
|
15
16
|
|
|
16
17
|
## Prompt sources
|
|
17
18
|
|
|
18
|
-
- `~/
|
|
19
|
-
- `~/
|
|
19
|
+
- `~/backend-evaluation-prompt.md`
|
|
20
|
+
- `~/frontend-evaluation-prompt.md`
|
|
20
21
|
|
|
21
22
|
## Evaluation execution rules
|
|
22
23
|
|
|
23
24
|
- when the project reaches final-evaluation readiness, run two separate evaluations:
|
|
24
|
-
- backend/non-frontend evaluation using `~/
|
|
25
|
-
- frontend evaluation using `~/
|
|
25
|
+
- backend/non-frontend evaluation using `~/backend-evaluation-prompt.md`
|
|
26
|
+
- frontend evaluation using `~/frontend-evaluation-prompt.md`
|
|
26
27
|
- read the full original project prompt from parent-root `../metadata.json` field `prompt`
|
|
27
28
|
- read the respective evaluation prompt file contents yourself before launching evaluation
|
|
28
29
|
- compose each evaluation request yourself as one large final prompt block
|
|
@@ -40,10 +41,6 @@ Use this skill only after integrated verification and hardening are complete eno
|
|
|
40
41
|
- require each evaluation session to produce its own detailed evaluation report artifact
|
|
41
42
|
- always compare both evaluations against the original prompt for alignment, not just the delivered implementation
|
|
42
43
|
|
|
43
|
-
On the first evaluation round, preserve and use this instruction exactly inside both evaluation prompts/sessions:
|
|
44
|
-
|
|
45
|
-
"Please confirm whether the current project tests are genuine and effective rather than superficial or fake tests, whether the API tests actually invoke real HTTP endpoints, and whether they cover more than 90% of the overall API surface."
|
|
46
|
-
|
|
47
44
|
## Triage rules
|
|
48
45
|
|
|
49
46
|
- read both reports and merge the findings into one explicit triage set before deciding what happens next
|
|
@@ -66,10 +63,11 @@ On the first evaluation round, preserve and use this instruction exactly inside
|
|
|
66
63
|
## Remediation loop
|
|
67
64
|
|
|
68
65
|
- route accepted blocking issues back into remediation in the same long-lived `Developer` session
|
|
69
|
-
- after remediation, rerun
|
|
70
|
-
-
|
|
71
|
-
-
|
|
66
|
+
- after remediation, rerun strong local verification before any re-evaluation:
|
|
67
|
+
- relevant local test commands
|
|
68
|
+
- local runtime checks when affected behavior needs runtime proof
|
|
72
69
|
- Playwright where applicable, with fresh screenshots
|
|
70
|
+
- if remediation materially reopens an owner-run milestone boundary, route the project back to that milestone gate before any re-evaluation instead of treating every remediation pass as an automatic Docker and `run_tests.sh` moment
|
|
73
71
|
- rerun only the evaluation tracks that have not already passed, each in a brand new fresh `General` session and still sequentially
|
|
74
72
|
- keep the remediation loop bounded and explicit so you never lose track of the active evaluation round or the accepted issue set
|
|
75
73
|
- remember the external process allows a maximum of 3 repair rounds
|
|
@@ -0,0 +1,57 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: final-evaluation-orchestration-v2
|
|
3
|
+
description: Evaluation execution rules for slopmachine-v2.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Final Evaluation Orchestration v2
|
|
7
|
+
|
|
8
|
+
Use this skill only during `P7 Evaluation and Triage`.
|
|
9
|
+
|
|
10
|
+
## Usage rules
|
|
11
|
+
|
|
12
|
+
- load this skill only during final evaluation and evaluation-driven remediation decisions
|
|
13
|
+
- treat it as internal orchestration guidance
|
|
14
|
+
- do not let evaluator findings automatically override your triage judgment
|
|
15
|
+
|
|
16
|
+
## Prompt sources
|
|
17
|
+
|
|
18
|
+
- `~/backend-evaluation-prompt.md`
|
|
19
|
+
- `~/frontend-evaluation-prompt.md`
|
|
20
|
+
|
|
21
|
+
## Evaluation execution rules
|
|
22
|
+
|
|
23
|
+
- run backend and frontend evaluation in separate fresh `General` sessions
|
|
24
|
+
- compose the evaluation prompts yourself; do not tell the evaluator to read prompt files on its own
|
|
25
|
+
- use the original project prompt from metadata
|
|
26
|
+
- read the respective evaluation prompt file contents yourself before launching evaluation
|
|
27
|
+
- compose each evaluation request yourself as one large final prompt block
|
|
28
|
+
- prefix each evaluation request with a clear instruction that the reviewer must work in the current project directory and evaluate that delivered project
|
|
29
|
+
- inject the full original project prompt into the `{prompt}` placeholder for the chosen evaluation prompt content
|
|
30
|
+
- send that fully composed text block directly to the fresh `General` evaluator session
|
|
31
|
+
- never tell the evaluator to go read prompt files, metadata files, or evaluation template paths on its own
|
|
32
|
+
- never send only a path, filename, or shorthand reference and expect the evaluator to assemble the prompt itself
|
|
33
|
+
- never reuse, resume, or continue a prior evaluation session
|
|
34
|
+
- run the two evaluations sequentially, not in parallel, so shared runtime state, ports, databases, and artifacts do not conflict
|
|
35
|
+
- track backend and frontend evaluation status separately
|
|
36
|
+
- once backend evaluation passes, do not run backend evaluation again in later remediation rounds
|
|
37
|
+
- once frontend evaluation passes, do not run frontend evaluation again in later remediation rounds
|
|
38
|
+
- require each evaluation session to produce its own detailed evaluation report artifact
|
|
39
|
+
- always compare both evaluations against the original prompt for alignment, not just the delivered implementation
|
|
40
|
+
- keep reports file-backed and bring only short summaries into chat
|
|
41
|
+
- rerun only the evaluation track that still needs re-evaluation after remediation
|
|
42
|
+
|
|
43
|
+
## Remediation loop
|
|
44
|
+
|
|
45
|
+
- route accepted blocking issues back into the active remediation developer-session slot rather than inventing an untracked side path
|
|
46
|
+
- after remediation, rerun strong local verification before any re-evaluation:
|
|
47
|
+
- relevant local test commands
|
|
48
|
+
- local runtime checks when affected behavior needs runtime proof
|
|
49
|
+
- Playwright where applicable, with fresh screenshots
|
|
50
|
+
- if remediation materially reopens an owner-run broad milestone boundary, route the project back to that boundary before re-evaluation instead of treating every remediation pass as an automatic broad rerun moment
|
|
51
|
+
- keep the remediation loop bounded and explicit so you never lose track of the active evaluation round or the accepted issue set
|
|
52
|
+
- remember the external process allows a maximum of 3 repair rounds
|
|
53
|
+
|
|
54
|
+
## Boundaries
|
|
55
|
+
|
|
56
|
+
- this phase is owner-side analysis, not the final human decision gate
|
|
57
|
+
- do not create extra human pauses while evaluation and triage are still active
|
|
@@ -15,6 +15,22 @@ Use this skill when you need the detailed developer overlay guidance for one of
|
|
|
15
15
|
- Pass only the relevant guidance for the current engineering step, not the whole section verbatim unless the moment truly needs it.
|
|
16
16
|
- Prefer short, natural teammate-style prompts.
|
|
17
17
|
|
|
18
|
+
## Cross-Cutting Documentation Discipline
|
|
19
|
+
|
|
20
|
+
- the owner maintains external docs under parent-root `../docs/`
|
|
21
|
+
- the developer should keep `README.md` and any codebase-local docs accurate for repo-local use
|
|
22
|
+
- planning and implementation guidance should stay explicit enough that owner-maintained external docs can be updated accurately
|
|
23
|
+
- verification and hardening should check both `README.md` and the owner-maintained external docs against implementation reality
|
|
24
|
+
- `README.md` must stay codebase-specific and must not become an index or explanation of the external docs set
|
|
25
|
+
|
|
26
|
+
## Cross-Cutting Env-File Discipline
|
|
27
|
+
|
|
28
|
+
- never create or keep `.env` files anywhere in the repo tree
|
|
29
|
+
- do not allow committed `.env` files even as placeholders or examples
|
|
30
|
+
- keep real secrets out of the repository and rely on Docker-provided runtime variables for sensitive values
|
|
31
|
+
- if the stack requires env-file format at runtime, generate it ephemerally from Docker-provided runtime variables rather than storing it in the repo or package
|
|
32
|
+
- verify the delivered project can start from scratch without any preexisting `.env` file in the repo or package
|
|
33
|
+
|
|
18
34
|
## Phase mapping
|
|
19
35
|
|
|
20
36
|
- `P2 Development Bootstrap and Planning` -> `Planning And Design`
|
|
@@ -31,20 +47,44 @@ Use this skill when you need the detailed developer overlay guidance for one of
|
|
|
31
47
|
- start from the actual project prompt and build the plan from there
|
|
32
48
|
- carry the settled project requirements forward consistently as you plan
|
|
33
49
|
- identify the hard non-negotiable requirements early and do not quietly trade them away for implementation convenience
|
|
50
|
+
- when planning technical items that depend on a library, framework, API, or tool, check Context7 documentation first for authoritative usage details
|
|
51
|
+
- when planning needs targeted outside research beyond direct documentation, use Exa web search next
|
|
52
|
+
- use technical research to strengthen concrete planning decisions, interfaces, constraints, and verification strategy rather than leaving them vague
|
|
34
53
|
- break the problem into explicit requirements, constraints, flows, boundaries, and edge cases
|
|
35
54
|
- map each meaningful requirement to its owning module, visible UI/API surface, failure behavior, test target, and final acceptance check
|
|
36
|
-
-
|
|
55
|
+
- make the planning explicit enough that the owner can maintain external design notes and API/spec docs accurately when relevant
|
|
37
56
|
- keep the spec focused on required behavior rather than turning it into a progress or completion narrative
|
|
38
57
|
- define major modules as meaningful delivery units, not arbitrary folders
|
|
39
58
|
- for fullstack work, map frontend surfaces, routes, components, and state boundaries to the backend modules and contracts that support them
|
|
59
|
+
- for fullstack work, make the frontend-to-backend crosswalk explicit enough that each major route, page, component group, or state boundary has a defined supporting backend module, endpoint, and data shape
|
|
60
|
+
- when the prompt says behavior is configurable, plan the real configuration surface, data model, permissions, and operator flow rather than treating configurability as an implementation detail to invent later
|
|
61
|
+
- when a feature must be admin-manageable or operator-manageable, plan the real usable UI surface for that management flow, not just the backing API or data model
|
|
40
62
|
- define failure paths, permissions, validation, logging, runtime assumptions, and test strategy before coding
|
|
63
|
+
- for complex security, offline, sync, authorization, or data-governance features, define what `done` means across all prompt-promised dimensions rather than stopping at a partial foundation or hook layer
|
|
64
|
+
- define shared lifecycle and state models when the product has meaningful workflow state, and keep those models aligned across design notes and API/spec notes
|
|
65
|
+
- require cross-document consistency so design, API/spec, and test-planning artifacts do not drift on lifecycle/state models, flow coverage, permissions, or operational behavior
|
|
41
66
|
- define logging and observability expectations for both frontend and backend
|
|
67
|
+
- define operator visibility and operator workflow expectations when the prompt implies admin, operational, audit, backup, or support responsibilities
|
|
68
|
+
- when the system has meaningful cross-cutting behavior, define shared implementation contracts early rather than leaving each module to invent its own pattern
|
|
69
|
+
- define error-handling contracts when relevant, including normalization patterns for user-visible errors and backend error-shape expectations
|
|
70
|
+
- define audit contracts when relevant, including centralized helper or service expectations and redaction rules
|
|
71
|
+
- define permission contracts when relevant so navigation visibility, route guards, and API enforcement stay aligned
|
|
72
|
+
- define state-lifecycle contracts when relevant, including context-switch or tenant-switch cleanup expectations
|
|
73
|
+
- define auth edge-case expectations when relevant, such as token refresh, session expiry, or clock-skew tolerance
|
|
42
74
|
- call out operational obligations early when they are prompt-critical, such as scheduling, retention, backups, workers, auditability, or offline behavior
|
|
75
|
+
- define infrastructure requirements early when they are material to correctness, such as rate limiting, encryption boundaries, production-equivalent test infrastructure, and browser-storage rules for sensitive data
|
|
76
|
+
- define frontend validation and accessibility expectations when the product surface materially depends on them, including keyboard, focus, feedback, and other user-interaction quality requirements where relevant
|
|
77
|
+
- if backup or recovery behavior is prompt-critical, plan the designated media, operator drill flow, visibility, and verification expectations explicitly
|
|
78
|
+
- if the prompt names literal storage, indexing, partitioning, retention, or performance dimensions, represent them literally in the planning artifacts rather than abstracting them away
|
|
79
|
+
- for frontend work, unless the prompt, existing repository, or established stack clearly dictates otherwise, default to Tailwind CSS for styling and `shadcn/ui` for component primitives
|
|
80
|
+
- if the existing project already uses a different UI system, preserve and extend that system instead of forcing Tailwind CSS or `shadcn/ui` into it
|
|
43
81
|
- define end-to-end coverage for major user flows before coding
|
|
44
82
|
- for fullstack work, explicitly plan Playwright coverage for the synchronized frontend/backend flows when end-to-end testing is applicable
|
|
45
83
|
- when UI-bearing flows are material, explicitly plan screenshot review as part of Playwright verification so UI correctness is checked, not just browser success
|
|
46
84
|
- aim for at least 90 percent meaningful coverage of the relevant behavior surface
|
|
47
85
|
- define verification strategy, Docker expectations, and documentation implications before coding
|
|
86
|
+
- for each major module, define how it integrates with existing modules and which shared contracts it must follow consistently
|
|
87
|
+
- define verification plans that include cross-module scenarios and seam checks, not just isolated feature checks
|
|
48
88
|
- make the plan detailed enough to guide real implementation and later verification
|
|
49
89
|
- review the module map and make sure it is stable before deeper implementation begins
|
|
50
90
|
- do not move into deeper implementation with vague architecture or unstable module boundaries
|
|
@@ -57,13 +97,25 @@ Use this skill when you need the detailed developer overlay guidance for one of
|
|
|
57
97
|
- create required testing directories and baseline docs structure
|
|
58
98
|
- put baseline config and logging structure in place
|
|
59
99
|
- put migrations, worker/job foundation, and real runtime health surfaces in place when the project needs them
|
|
60
|
-
-
|
|
61
|
-
-
|
|
100
|
+
- never create or keep `.env` files anywhere in the repo tree
|
|
101
|
+
- treat prompt-critical security controls as real baseline runtime behavior, not placeholder checks or visual wiring
|
|
102
|
+
- if a requirement implies enforcement, persistence, statefulness, or rejection behavior, make that behavior real in the scaffold unless the prompt clearly scopes it down
|
|
103
|
+
- do not accept shape-only security implementations such as header presence checks, passive constants, or partially wired middleware when the requirement implies real protection
|
|
104
|
+
- when applicable at scaffold time, require real security baselines such as nonce reuse rejection rather than nonce-header presence, real lockout behavior rather than config-only lockout values, CSRF rejection on protected mutations, and meaningful server-side state when the protection model depends on it
|
|
105
|
+
- keep real secrets out of the repository and rely on Docker-provided runtime variables for sensitive values
|
|
106
|
+
- do not allow committed `.env` files even as placeholders or examples
|
|
107
|
+
- if the stack requires env-file format at runtime, generate it ephemerally from Docker-provided runtime variables rather than storing it in the repo or package
|
|
62
108
|
- remove prototype residue from runtime foundations: no placeholder titles, hidden setup, fake defaults, or seeded live-path assumptions
|
|
63
109
|
- make prompt-critical runtime behavior visible in the scaffold instead of hand-waving it for later, especially offline, worker, backup, or HTTPS requirements
|
|
110
|
+
- keep Docker runtime isolation clean in shared environments: use self-contained Compose namespacing, avoid fragile generic project names, and prefer Compose-managed service naming over unnecessary hardcoded `container_name` values
|
|
111
|
+
- require reproducible build and tooling foundations: prefer lockfile-driven installs where the stack supports them, keep source and build outputs clearly separated, and do not allow generated runtime artifacts to drift back into source directories
|
|
112
|
+
- for typed build pipelines, keep source-of-truth boundaries clean so compiled output does not create TS/JS or similar dual-source drift in the working tree
|
|
64
113
|
- establish README structure early instead of leaving it until the end
|
|
65
114
|
- prove the scaffold in a clean state before deeper feature work
|
|
66
115
|
- verify `docker compose up` and `run_tests.sh` in the clean scaffold state
|
|
116
|
+
- verify clean `docker compose up --build -d` and `docker compose down` behavior under the chosen project namespace when Dockerized execution is in scope
|
|
117
|
+
- when the architecture materially depends on infrastructure capabilities such as rate limiting, encryption, offline support, or browser-storage policy, put the baseline framework and policy in place during scaffold rather than deferring it to late implementation
|
|
118
|
+
- for backend integration paths, prefer production-equivalent test infrastructure when practical rather than silently substituting a weaker database or runtime model that can hide real defects
|
|
67
119
|
- do not treat scaffold as placeholder boilerplate or rely on hidden setup
|
|
68
120
|
|
|
69
121
|
## Module Implementation
|
|
@@ -79,17 +131,26 @@ Use this skill when you need the detailed developer overlay guidance for one of
|
|
|
79
131
|
- set up and use the local test environment inside the current working directory so normal verification does not depend on hidden global tooling assumptions
|
|
80
132
|
- if the local toolchain is missing, try to install or enable it before falling back to `run_tests.sh`
|
|
81
133
|
- for applicable fullstack or UI-bearing work, run local Playwright on the affected flows during implementation and inspect screenshots to confirm the UI actually matches
|
|
134
|
+
- when the slice materially changes frontend code, frontend tooling, or release-facing build behavior, include production build health in meaningful local verification when practical
|
|
82
135
|
- make sure the module is moving toward full definition-of-done completion, not just happy-path completion
|
|
83
136
|
- keep auth, authorization, ownership, validation, and logging concerns in view when relevant
|
|
84
137
|
- keep frontend and backend contracts synchronized when the module spans both sides
|
|
138
|
+
- verify the module integrates cleanly with existing modules, routes, permissions, shared state, and cross-cutting helpers rather than only proving the new feature path in isolation
|
|
139
|
+
- check cross-cutting consistency where relevant, especially permissions, error handling, audit/logging/redaction behavior, and state or context transition behavior
|
|
140
|
+
- verify tenant or ownership isolation where relevant so access is scoped to the authorized context rather than merely functionally working for one actor
|
|
141
|
+
- verify file and export paths are validated and confined to allowed roots when the module reads, writes, imports, or exports files
|
|
142
|
+
- verify error and auth responses are user-safe and do not leak internal reasons, paths, stack details, or sensitive state
|
|
143
|
+
- perform a clean-slate sweep before reporting module completion: remove seeded credentials, weak demo defaults, test-account hints, prototype residue, and other production-inappropriate artifacts
|
|
85
144
|
- do not treat backend existence, composable existence, or partial wiring as completion if the user-visible flow is still incomplete
|
|
86
145
|
- when the prompt says users can manage or configure something, implement full management behavior rather than create-only controls where appropriate
|
|
146
|
+
- if a required user-facing or admin-facing surface is missing, treat that gap as incomplete implementation rather than a reason to bypass the surface with direct API calls or test-only shortcuts
|
|
87
147
|
- do not leave computed-but-unrendered or partially surfaced requirement behavior in place
|
|
148
|
+
- do not treat a module as complete when a meaningful user-facing, release-facing, production-path, or build verification is known to be failing unless the owner explicitly scopes that check out
|
|
88
149
|
- do not ship frontend screens with demo/debug/setup messaging or development-only status text; product UI should serve the real workflow only
|
|
89
150
|
- use the `frontend-design` skill for frontend component or page work
|
|
90
151
|
- use the `frontend-design` skill during frontend/UI verification when reviewing Playwright screenshots and tightening the interface
|
|
91
152
|
- do not hardcode secrets or persist local sensitive values in the repo while implementing
|
|
92
|
-
-
|
|
153
|
+
- explain behavior changes clearly enough that the documentation discipline can be satisfied accurately
|
|
93
154
|
- verify the module against its planned behavior before trying to move on
|
|
94
155
|
- do not move on while the module is still obviously weak or half-finished
|
|
95
156
|
|
|
@@ -106,16 +167,23 @@ Use this skill when you need the detailed developer overlay guidance for one of
|
|
|
106
167
|
- verify requirement closure, not just feature existence
|
|
107
168
|
- verify behavior against the current plan, the actual requirements, and any settled project decisions that affect the change
|
|
108
169
|
- verify end-to-end flow behavior where the change affects real workflows
|
|
170
|
+
- do not use mocked APIs as integration evidence; integration verification must use real HTTP requests against the actual running service surface
|
|
109
171
|
- for fullstack work, run Playwright coverage for major flows and review screenshots for real UI behavior and regressions
|
|
172
|
+
- end-to-end coverage must use the real intended user-facing or admin-facing surfaces for the flow; if the flow cannot be exercised that way, treat the missing surface as incomplete work
|
|
110
173
|
- use `frontend-design` while reviewing frontend screenshots so UI issues are challenged, not just functionally observed
|
|
111
174
|
- verify screenshots do not contain demo placeholders, scaffold instructions, debug notices, or other development-only UI leakage
|
|
112
175
|
- verify important failure, conflict, stale-state, negative-auth, and cross-user-isolation paths where relevant
|
|
113
176
|
- verify required remediation guidance is actually visible to the user, not just computed internally
|
|
114
177
|
- verify security-sensitive behavior where applicable
|
|
178
|
+
- verify multi-tenant and cross-user isolation where applicable, including negative checks rather than single-actor happy paths only
|
|
179
|
+
- verify file/path safety for file-bearing flows where applicable, including traversal-style negative cases
|
|
115
180
|
- verify secrets are not committed, hardcoded, or leaking through logs/config/docs
|
|
116
|
-
-
|
|
117
|
-
- verify
|
|
181
|
+
- apply the cross-cutting env-file discipline during verification
|
|
182
|
+
- verify error surfaces and auth-related failures are sanitized for users and operators appropriately
|
|
183
|
+
- apply the cross-cutting documentation discipline during verification
|
|
118
184
|
- trace the changed tests and verification back to the prompt-critical risks, not just the easiest happy paths
|
|
185
|
+
- challenge integration seams and adjacent-module behavior, not just the changed module's local path
|
|
186
|
+
- when frontend behavior or tooling changed materially, treat known production build breakage as blocking evidence rather than optional cleanup
|
|
119
187
|
- do not treat a module as done until functional behavior, failure behavior, tests, docs, security considerations, and required runtime verification are all in place
|
|
120
188
|
- call out weak evidence, missing coverage, or unresolved issues plainly
|
|
121
189
|
- do not treat developer claims as enough without real verification
|
|
@@ -131,6 +199,9 @@ Use this skill when you need the detailed developer overlay guidance for one of
|
|
|
131
199
|
- re-check frontend and backend observability, redaction, and operator visibility paths
|
|
132
200
|
- run a prompt-fidelity sweep for silent requirement substitution, partially delivered hard requirements, and frontend/backend mismatch
|
|
133
201
|
- run a prototype-residue sweep for hardcoded preview values, placeholder text, seeded defaults, hidden fallbacks, and computed-but-unrendered behavior
|
|
202
|
+
- enforce the cross-cutting env-file discipline during hardening
|
|
203
|
+
- run documentation verification against the real codebase and runtime behavior, not just document existence
|
|
204
|
+
- enforce the cross-cutting documentation discipline during hardening
|
|
134
205
|
- re-check prompt-critical operational obligations such as scheduled jobs, retention, backups, worker behavior, privacy/accountability logging, and admin controls
|
|
135
206
|
- enter release-candidate mode: stop feature work and focus only on fixes, verification, docs, and packaging preparation
|
|
136
207
|
- make sure the system is genuinely reviewable and reproducible
|