@ai-dev-methodologies/rlp-desk 0.2.4 → 0.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +32 -1
- package/docs/TODO-verification-next.md +59 -0
- package/docs/architecture.md +25 -0
- package/docs/getting-started.md +11 -4
- package/docs/internal/verification-policy-gap-analysis.md +523 -0
- package/docs/internal/verification-strategy-research.md +2097 -0
- package/docs/protocol-reference.md +21 -10
- package/package.json +1 -1
- package/src/commands/rlp-desk.md +97 -4
- package/src/governance.md +219 -1
- package/src/scripts/init_ralph_desk.zsh +221 -12
|
@@ -45,12 +45,34 @@ Read these files in order:
|
|
|
45
45
|
- No file creation or modification outside the project root.
|
|
46
46
|
- Do not modify this prompt file or any PRD/test-spec files.
|
|
47
47
|
|
|
48
|
+
## Test-First Approach (read test-spec BEFORE coding)
|
|
49
|
+
1. Read test-spec "Impacted Tests" — if TODO (first iteration), skip to step 2 and fill this section during your work. Otherwise, run these FIRST to confirm they pass before your changes.
|
|
50
|
+
2. Read test-spec "Required New Tests" — write these. They SHOULD FAIL initially.
|
|
51
|
+
3. Implement minimum code to make all tests pass.
|
|
52
|
+
4. Run ALL tests (impacted + new) to confirm nothing is broken.
|
|
53
|
+
|
|
54
|
+
## Forbidden Shortcuts (Verifier will check these)
|
|
55
|
+
- Do not mock external services when L2 integration test is required by test-spec.
|
|
56
|
+
- Do not delete or weaken existing assertions to make tests pass.
|
|
57
|
+
- Do not add test-specific logic (code that detects it is running in a test).
|
|
58
|
+
- Do not skip boundary cases listed in the PRD.
|
|
59
|
+
- Do not claim "code inspection" as verification — run the actual command.
|
|
60
|
+
- Do not say "too simple to test" — simple code breaks. Test takes 30 seconds.
|
|
61
|
+
- Do not say "I'll test after" — tests passing immediately prove nothing.
|
|
62
|
+
- Do not say "already manually tested" — ad-hoc is not systematic, no record.
|
|
63
|
+
- Do not say "partial check is enough" — partial proves nothing about the whole.
|
|
64
|
+
- Do not say "I'm confident" — confidence is not evidence.
|
|
65
|
+
- Do not say "existing code has no tests" — you are improving it, add tests.
|
|
66
|
+
- Do not write code before tests — if you did, delete it and start with tests.
|
|
67
|
+
|
|
48
68
|
## Iteration rules
|
|
49
69
|
- Use fresh context only; do NOT depend on prior chat history.
|
|
50
70
|
- Execute exactly the work specified in the Next Iteration Contract.
|
|
51
71
|
- Refresh context file with the current frontier.
|
|
52
72
|
- Rewrite campaign memory in full.
|
|
53
73
|
- Write evidence artifacts.
|
|
74
|
+
- **After writing tests, update test-spec Criteria Mapping with actual test file paths and function names** (replace placeholder -k filters).
|
|
75
|
+
- Ensure **each AC has >= 3 tests** (happy + negative + boundary). Do not just meet the total count — distribute evenly per AC.
|
|
54
76
|
- **Commit all changes when the iteration is complete** (include iteration number and story ID in commit message).
|
|
55
77
|
|
|
56
78
|
MANDATORY: When done with this iteration, write the following signal file:
|
|
@@ -68,6 +90,25 @@ MANDATORY: When done with this iteration, write the following signal file:
|
|
|
68
90
|
- Do NOT signal "continue" when a US is done — always signal "verify" per US.
|
|
69
91
|
- Signal "continue" ONLY when you have more work to do within the same US (e.g., a multi-step task).
|
|
70
92
|
|
|
93
|
+
## Done Claim Format
|
|
94
|
+
When writing done-claim JSON, ALWAYS include execution_steps — what you did, in what order, with evidence:
|
|
95
|
+
\`\`\`json
|
|
96
|
+
{
|
|
97
|
+
"us_id": "US-NNN",
|
|
98
|
+
"claims": ["AC1: ...", "AC2: ..."],
|
|
99
|
+
"execution_steps": [
|
|
100
|
+
{"step": "write_test", "ac_id": "AC1", "command": null, "summary": "wrote tests/test_add.py with 3 tests"},
|
|
101
|
+
{"step": "verify_red", "ac_id": "AC1", "command": "pytest tests/...", "exit_code": 1, "summary": "RED: test fails as expected"},
|
|
102
|
+
{"step": "implement", "ac_id": "AC1", "command": null, "summary": "created add() function"},
|
|
103
|
+
{"step": "verify_green", "ac_id": "AC1", "command": "pytest tests/...", "exit_code": 0, "summary": "GREEN: 3 passed"},
|
|
104
|
+
{"step": "verify_e2e", "ac_id": "AC1", "command": "python -c '...'", "exit_code": 0, "summary": "E2E output matches expected"},
|
|
105
|
+
{"step": "commit", "ac_id": "AC1", "command": "git commit ...", "exit_code": 0, "summary": "committed abc1234"}
|
|
106
|
+
]
|
|
107
|
+
}
|
|
108
|
+
\`\`\`
|
|
109
|
+
This is NOT optional. Every done-claim must include the steps you took and the evidence for each.
|
|
110
|
+
execution_steps MUST be a JSON array of objects (not a dict with string keys). Each object MUST have: "step", "ac_id", "command", "exit_code", "summary".
|
|
111
|
+
|
|
71
112
|
## Stop behavior
|
|
72
113
|
- Single US achieved → write done-claim JSON to $DESK/memos/$SLUG-done-claim.json with the specific US, signal verify, exit
|
|
73
114
|
- All US achieved → write done-claim JSON with all US, signal verify with us_id "ALL", exit
|
|
@@ -86,6 +127,17 @@ if [[ ! -f "$F" ]]; then
|
|
|
86
127
|
cat > "$F" <<EOF
|
|
87
128
|
Independent verifier for Ralph Desk: $SLUG
|
|
88
129
|
|
|
130
|
+
## Iron Law (ABSOLUTE — no exceptions)
|
|
131
|
+
> NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE
|
|
132
|
+
> "should pass", "probably works", "seems to" = automatic FAIL
|
|
133
|
+
|
|
134
|
+
## Evidence Gate (MANDATORY before any verdict)
|
|
135
|
+
1. IDENTIFY: What command proves this claim?
|
|
136
|
+
2. RUN: Execute the FULL command (fresh, complete)
|
|
137
|
+
3. READ: Full output, check exit code, count failures
|
|
138
|
+
4. VERIFY: Does output confirm the claim?
|
|
139
|
+
5. ONLY THEN: Issue verdict
|
|
140
|
+
|
|
89
141
|
Required reads:
|
|
90
142
|
- PRD: $DESK/plans/prd-$SLUG.md
|
|
91
143
|
- Test Spec: $DESK/plans/test-spec-$SLUG.md
|
|
@@ -100,14 +152,24 @@ Check the iter-signal.json "us_id" field:
|
|
|
100
152
|
- If us_id is "ALL": verify ALL acceptance criteria from the PRD (final full verify).
|
|
101
153
|
- If us_id is absent or null: verify all criteria in the done-claim (legacy/batch mode).
|
|
102
154
|
|
|
103
|
-
Process
|
|
155
|
+
## Verification Process
|
|
104
156
|
1. Read PRD acceptance criteria (scoped to us_id if present)
|
|
105
157
|
2. Read done claim
|
|
106
158
|
3. Identify scope: run \`git diff --name-only\` to find changed files, then read those files + related imports only
|
|
107
|
-
4. Run
|
|
108
|
-
5.
|
|
109
|
-
6. Run
|
|
110
|
-
7.
|
|
159
|
+
4. **Scope Lock check**: (a) Read the Next Iteration Contract from campaign memory to identify the contracted US. (b) Run \`git diff --name-only\` to list all changed files. (c) For each changed file, verify it is plausibly related to the contracted US's acceptance criteria. (d) Flag files that appear unrelated. (e) Shared infrastructure (types, configs, common utilities) and dependency files are permitted if the AC implies them.
|
|
160
|
+
5. **Layer Enforcement**: check test-spec L1/L2/L3/L4 sections. ANY section with TODO or blank = FAIL (IL-3).
|
|
161
|
+
6. Run fresh verification: execute ALL commands from test-spec verification layers (L1, L2, L3, L4 as applicable)
|
|
162
|
+
7. Check each criterion against fresh evidence (only for the scoped US, or all if us_id=ALL)
|
|
163
|
+
8. Run smoke test if defined in PRD
|
|
164
|
+
9. **Test Sufficiency (IL-4)**: count test functions exercising each AC. Count < 3 per AC = FAIL.
|
|
165
|
+
Check diversity: at least 2 of 3 categories (happy, negative, boundary) per AC.
|
|
166
|
+
10. **Anti-Gaming Detection**:
|
|
167
|
+
- Assertion integrity: compare assertion count/strength via \`git diff HEAD~1\` — assertions not deleted or weakened
|
|
168
|
+
- Test-specific logic: no environment-detection patterns
|
|
169
|
+
- "Code inspection" claims: Worker must run actual commands
|
|
170
|
+
- Tautological tests: expected values that mirror implementation logic
|
|
171
|
+
11. **Reproducibility check**: verify lock file committed, clean install succeeds, security scan passes, env vars documented (per test-spec Reproducibility Gate). Skip if test-spec says "N/A."
|
|
172
|
+
12. Write verdict JSON to: $DESK/memos/$SLUG-verify-verdict.json
|
|
111
173
|
|
|
112
174
|
Verdict JSON:
|
|
113
175
|
{
|
|
@@ -118,6 +180,14 @@ Verdict JSON:
|
|
|
118
180
|
"criteria_results": [{"criterion":"...","met":true/false,"evidence":"..."}],
|
|
119
181
|
"missing_evidence": [],
|
|
120
182
|
"issues": [{"id":"...","severity":"critical|major|minor","description":"...","fix_hint":"(suggestion, non-authoritative)"}],
|
|
183
|
+
"reasoning": [
|
|
184
|
+
{"check": "IL-1 Evidence Gate", "decision": "pass|fail", "basis": "what command was run, what output confirmed the decision"},
|
|
185
|
+
{"check": "Layer Enforcement", "decision": "pass|fail", "basis": "which layers checked, any TODO found"},
|
|
186
|
+
{"check": "Test Sufficiency", "decision": "pass|fail", "basis": "test count per AC, category coverage"},
|
|
187
|
+
{"check": "Anti-Gaming", "decision": "pass|fail", "basis": "what was checked, any suspicious patterns"}
|
|
188
|
+
],
|
|
189
|
+
"layer_status": {"L1":"pass|fail|todo|na","L2":"pass|fail|todo|na","L3":"pass|fail|todo|na","L4":"pass|fail|todo|na"},
|
|
190
|
+
"test_quality": {"test_count":0,"ac_count":0,"sufficiency":"pass|fail","anti_patterns_found":[]},
|
|
121
191
|
"recommended_state_transition": "complete|continue|blocked",
|
|
122
192
|
"next_iteration_contract": "...",
|
|
123
193
|
"evidence_paths": []
|
|
@@ -129,6 +199,7 @@ Rules:
|
|
|
129
199
|
- Campaign Memory is for orientation only — do NOT use it as source of truth for AC verification.
|
|
130
200
|
- Deterministic checks (type hints, linting, security) delegate to test-spec tools; focus on AC verification + semantic review + smoke test.
|
|
131
201
|
- Do NOT modify code or write sentinel files.
|
|
202
|
+
- If Worker claims "inspection" or "review" for an AC that requires an automated command, verdict = FAIL.
|
|
132
203
|
EOF
|
|
133
204
|
echo " + $F"
|
|
134
205
|
else echo " · $F"; fi
|
|
@@ -199,16 +270,29 @@ $OBJECTIVE
|
|
|
199
270
|
### US-001: [Title]
|
|
200
271
|
- **Priority**: P0
|
|
201
272
|
- **Size**: S|M|L
|
|
273
|
+
- **Type**: code|visual|content|integration|infra
|
|
274
|
+
- **Risk**: LOW|MEDIUM|HIGH|CRITICAL (governance §1c)
|
|
202
275
|
- **Depends on**: []
|
|
203
|
-
- **Acceptance Criteria
|
|
204
|
-
-
|
|
276
|
+
- **Acceptance Criteria** (Given/When/Then — domain language only):
|
|
277
|
+
- AC1:
|
|
278
|
+
- Given: [precondition in domain language]
|
|
279
|
+
- When: [action in domain language]
|
|
280
|
+
- Then: [expected outcome with quantitative criteria]
|
|
281
|
+
- AC2:
|
|
282
|
+
- Given: [precondition]
|
|
283
|
+
- When: [action]
|
|
284
|
+
- Then: [expected outcome with quantitative criteria]
|
|
285
|
+
- **Boundary Cases**: [edge cases — empty input, max values, error conditions, concurrent access]
|
|
286
|
+
- **Verification Layers**: [Fill per Risk level — LOW: L1+L3, MEDIUM: L1+L2(if ext deps)+L3, HIGH: L1+L2+L3+L4, CRITICAL: L1+L2+L3+L4+mutation (governance §1c)]
|
|
205
287
|
- **Status**: not started
|
|
206
288
|
|
|
207
289
|
## Non-Goals
|
|
208
290
|
## Technical Constraints
|
|
209
291
|
## Done When
|
|
210
|
-
- All acceptance criteria pass
|
|
211
|
-
-
|
|
292
|
+
- All acceptance criteria pass with quantitative evidence
|
|
293
|
+
- All boundary cases covered
|
|
294
|
+
- All required verification layers executed (no TODO remaining)
|
|
295
|
+
- Independent verifier confirms via Evidence Gate (governance §1b)
|
|
212
296
|
EOF
|
|
213
297
|
echo " + $F"
|
|
214
298
|
else echo " · $F"; fi
|
|
@@ -219,6 +303,12 @@ if [[ ! -f "$F" ]]; then
|
|
|
219
303
|
cat > "$F" <<EOF
|
|
220
304
|
# Test Specification: $SLUG
|
|
221
305
|
|
|
306
|
+
## Iron Law Reference
|
|
307
|
+
> IL-3: NO PASS WITH TODO IN ANY REQUIRED VERIFICATION LAYER
|
|
308
|
+
> IL-4: NO PASS WITHOUT TEST COUNT >= AC COUNT x 3
|
|
309
|
+
|
|
310
|
+
---
|
|
311
|
+
|
|
222
312
|
## Verification Commands
|
|
223
313
|
### Build
|
|
224
314
|
\`\`\`bash
|
|
@@ -233,10 +323,129 @@ if [[ ! -f "$F" ]]; then
|
|
|
233
323
|
# TODO
|
|
234
324
|
\`\`\`
|
|
235
325
|
|
|
326
|
+
---
|
|
327
|
+
|
|
328
|
+
## Verification Context (fill BEFORE implementation)
|
|
329
|
+
|
|
330
|
+
### Target Behavior
|
|
331
|
+
What behavior does this project change or introduce?
|
|
332
|
+
- TODO
|
|
333
|
+
|
|
334
|
+
### Impacted Tests
|
|
335
|
+
Existing tests that may break due to this change:
|
|
336
|
+
- TODO (acceptable at init; Worker fills during first iteration)
|
|
337
|
+
|
|
338
|
+
### Required New Tests
|
|
339
|
+
Tests that MUST be written (minimum 3 per AC: happy + negative + boundary):
|
|
340
|
+
- TODO
|
|
341
|
+
|
|
342
|
+
### Forbidden Shortcuts (see Worker prompt for full list)
|
|
343
|
+
- Do not mock external services when L2 integration test is required
|
|
344
|
+
- Do not delete or weaken existing assertions to make tests pass
|
|
345
|
+
- Do not add test-specific logic (if __name__ == '__test__' patterns)
|
|
346
|
+
- Do not skip boundary cases listed in the PRD
|
|
347
|
+
- Do not claim "code inspection" as verification — run the actual command
|
|
348
|
+
- Do not say "too simple to test" — simple code breaks
|
|
349
|
+
- Do not say "I'll test after" — tests passing immediately prove nothing
|
|
350
|
+
- Do not say "already manually tested" — ad-hoc is not systematic
|
|
351
|
+
- Do not say "partial check is enough" — partial proves nothing
|
|
352
|
+
- Do not say "I'm confident" — confidence is not evidence
|
|
353
|
+
- Do not say "existing code has no tests" — you are improving it, add tests
|
|
354
|
+
- Do not write code before tests — delete it and start with tests
|
|
355
|
+
|
|
356
|
+
### Pass/Fail Evidence Format
|
|
357
|
+
- Command output with exit code 0
|
|
358
|
+
- Quantitative result matching expected value
|
|
359
|
+
- Screenshot comparison (for visual tasks)
|
|
360
|
+
|
|
361
|
+
---
|
|
362
|
+
|
|
363
|
+
## Verification Layers (ALL required sections — TODO in required layer = Verifier FAIL)
|
|
364
|
+
|
|
365
|
+
### L1: Unit Test (REQUIRED)
|
|
366
|
+
\`\`\`bash
|
|
367
|
+
# TODO — unit test command (e.g., pytest, jest, go test)
|
|
368
|
+
\`\`\`
|
|
369
|
+
|
|
370
|
+
### L2: Integration (required if external services exist, otherwise "N/A — reason")
|
|
371
|
+
\`\`\`bash
|
|
372
|
+
# TODO — integration test command, or write: N/A — no external services (pure computation/transformation)
|
|
373
|
+
\`\`\`
|
|
374
|
+
|
|
375
|
+
### L3: E2E Simulation (REQUIRED)
|
|
376
|
+
Known input → full pipeline → quantitative output comparison.
|
|
377
|
+
Must cover ALL AC types: happy path + boundary + error path.
|
|
378
|
+
- **Happy path input**: TODO (specific test data)
|
|
379
|
+
- **Happy path expected output**: TODO (quantitative value)
|
|
380
|
+
- **Happy path command**:
|
|
381
|
+
\`\`\`bash
|
|
382
|
+
# TODO — E2E happy path command
|
|
383
|
+
\`\`\`
|
|
384
|
+
- **Error path input**: TODO (invalid/boundary input that triggers error)
|
|
385
|
+
- **Error path expected**: TODO (error type + non-zero exit code)
|
|
386
|
+
- **Error path command**:
|
|
387
|
+
\`\`\`bash
|
|
388
|
+
# TODO — E2E error path command (expected exit ≠ 0)
|
|
389
|
+
\`\`\`
|
|
390
|
+
|
|
391
|
+
### L4: Deploy Verification (required if deploying, otherwise "N/A — reason")
|
|
392
|
+
\`\`\`bash
|
|
393
|
+
# TODO — deploy verification command, or write: N/A — no deployment (library/tool, local-only change)
|
|
394
|
+
\`\`\`
|
|
395
|
+
|
|
396
|
+
---
|
|
397
|
+
|
|
398
|
+
## Mutation Testing Gate (CRITICAL risk only)
|
|
399
|
+
- Required: only for CRITICAL risk classification (governance §1c)
|
|
400
|
+
- Tool: TODO (e.g., mutmut, Stryker, go-mutesting) or "N/A — not CRITICAL risk"
|
|
401
|
+
- Target: >= 60% mutation score on core business logic (project default; override in PRD if justified)
|
|
402
|
+
- Scope: core business logic files (not config/tests/docs)
|
|
403
|
+
- Command:
|
|
404
|
+
\`\`\`bash
|
|
405
|
+
# TODO — mutation testing command, or write: N/A — not CRITICAL risk
|
|
406
|
+
\`\`\`
|
|
407
|
+
|
|
408
|
+
---
|
|
409
|
+
|
|
410
|
+
## Test Quality Checklist (Verifier checks these)
|
|
411
|
+
- [ ] Tests verify behavior, not implementation details
|
|
412
|
+
- [ ] Each test has meaningful assertions (not just "no error thrown")
|
|
413
|
+
- [ ] Boundary cases covered (empty, max, zero, null, concurrent)
|
|
414
|
+
- [ ] No tautological tests (expected value copied from implementation)
|
|
415
|
+
- [ ] Mock usage limited to external boundaries only
|
|
416
|
+
- [ ] No test-specific logic in production code
|
|
417
|
+
- [ ] Each AC has >= 3 tests (happy + negative + boundary) per IL-4
|
|
418
|
+
|
|
419
|
+
## Traceability Matrix (Worker fills during implementation)
|
|
420
|
+
|
|
421
|
+
| US | AC | Test File :: Function | Layer | Evidence | Status |
|
|
422
|
+
|----|----|----------------------|-------|----------|--------|
|
|
423
|
+
| US-001 | AC1 | TODO | L1 | TODO | pending |
|
|
424
|
+
|
|
425
|
+
---
|
|
426
|
+
|
|
427
|
+
## Code Quality Gates (defaults — override in PRD with justification)
|
|
428
|
+
- **Code duplication**: <= 3% (project-appropriate tool, e.g., jscpd, pylint, sonar)
|
|
429
|
+
- **Mock ratio**: mock-based assertions <= 30% of total assertions
|
|
430
|
+
- **Cyclomatic complexity**: <= 10 per function
|
|
431
|
+
- **Function length**: <= 50 lines per function
|
|
432
|
+
- **File length**: <= 800 lines per file
|
|
433
|
+
|
|
434
|
+
---
|
|
435
|
+
|
|
436
|
+
## Reproducibility Gate
|
|
437
|
+
- [ ] Lock file exists and committed (package-lock.json, poetry.lock, go.sum, etc.) or "N/A — no external dependencies"
|
|
438
|
+
- [ ] Clean install succeeds (npm ci, pip install, etc.) or "N/A — no external dependencies"
|
|
439
|
+
- [ ] Security scan passes (or known vulnerabilities documented and acknowledged in PRD) or "N/A — no dependencies"
|
|
440
|
+
- [ ] Environment variables documented (.env.example or equivalent) or "N/A — no env vars"
|
|
441
|
+
|
|
442
|
+
---
|
|
443
|
+
|
|
236
444
|
## Criteria → Verification Mapping
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
|
|
445
|
+
|
|
446
|
+
| US | AC | Layer | Method | Command | Expected Output | Pass Criteria |
|
|
447
|
+
|----|----|-------|--------|---------|-----------------|---------------|
|
|
448
|
+
| US-001 | AC1 | L1 | TODO | TODO | TODO | TODO |
|
|
240
449
|
EOF
|
|
241
450
|
echo " + $F"
|
|
242
451
|
else echo " · $F"; fi
|