@ai-dev-methodologies/rlp-desk 0.2.4 → 0.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +32 -1
- package/docs/TODO-verification-next.md +59 -0
- package/docs/architecture.md +25 -0
- package/docs/getting-started.md +11 -4
- package/docs/internal/verification-policy-gap-analysis.md +523 -0
- package/docs/internal/verification-strategy-research.md +2097 -0
- package/docs/protocol-reference.md +21 -10
- package/package.json +1 -1
- package/src/commands/rlp-desk.md +97 -4
- package/src/governance.md +219 -1
- package/src/scripts/init_ralph_desk.zsh +221 -12
package/README.md
CHANGED
|
@@ -40,9 +40,11 @@ curl -sSL https://raw.githubusercontent.com/ai-dev-methodologies/rlp-desk/main/i
|
|
|
40
40
|
|
|
41
41
|
You'll be asked to confirm each item:
|
|
42
42
|
- **Slug** — project identifier
|
|
43
|
-
- **User Stories** — discrete, testable units
|
|
43
|
+
- **User Stories** — discrete, testable units with Given/When/Then acceptance criteria
|
|
44
|
+
- **Task Type & Risk Level** — code/visual/content/integration/infra × LOW/MEDIUM/HIGH/CRITICAL
|
|
44
45
|
- **Iteration Unit** — one story per iteration (incremental) or all at once (fast)
|
|
45
46
|
- **Verification Commands** — how to check the work
|
|
47
|
+
- **Ambiguity Gate** — AC quality scoring (IL-2, 0-12 scale, blocks init if < 6)
|
|
46
48
|
- **Models** — which Claude model for Worker/Verifier
|
|
47
49
|
|
|
48
50
|
### 3. Run
|
|
@@ -97,12 +99,38 @@ for iteration in 1..max_iter:
|
|
|
97
99
|
8. Update status, report to user, continue or stop
|
|
98
100
|
```
|
|
99
101
|
|
|
102
|
+
### Verification Policy (v0.3.0)
|
|
103
|
+
|
|
104
|
+
RLP Desk enforces a comprehensive verification policy defined in `governance.md`:
|
|
105
|
+
|
|
106
|
+
**Iron Laws (§1a)** — 4 absolute rules that cannot be violated:
|
|
107
|
+
- **IL-1**: No completion claims without fresh verification evidence
|
|
108
|
+
- **IL-2**: No init without AC quality score ≥ 6 (Ambiguity Gate)
|
|
109
|
+
- **IL-3**: No pass with TODO in any required verification layer
|
|
110
|
+
- **IL-4**: No pass without test count ≥ AC count × 3
|
|
111
|
+
|
|
112
|
+
**Evidence Gate (§1b)** — 5-step protocol: IDENTIFY → RUN → READ → VERIFY → ONLY THEN claim
|
|
113
|
+
|
|
114
|
+
**Risk Classification (§1c)** — Proportional verification layers per risk level:
|
|
115
|
+
|
|
116
|
+
| Risk | Required Layers |
|
|
117
|
+
|------|----------------|
|
|
118
|
+
| LOW | L1 (Unit) + L3 (E2E) |
|
|
119
|
+
| MEDIUM | L1 + L2 (Integration) + L3 |
|
|
120
|
+
| HIGH | L1 + L2 + L3 + L4 (Deploy) |
|
|
121
|
+
| CRITICAL | L1 + L2 + L3 + L4 + mutation testing |
|
|
122
|
+
|
|
123
|
+
**Execution Traceability (§1f)** — Always-on, not flag-gated:
|
|
124
|
+
- Worker records `execution_steps` in done-claim.json (what was done, in what order, with evidence)
|
|
125
|
+
- Verifier records `reasoning` in verify-verdict.json (why each judgment was made)
|
|
126
|
+
|
|
100
127
|
### Circuit Breakers
|
|
101
128
|
|
|
102
129
|
| Condition | Action |
|
|
103
130
|
|-----------|--------|
|
|
104
131
|
| Context unchanged for 3 iterations | BLOCKED |
|
|
105
132
|
| Same error repeated twice | Upgrade model, retry once, then BLOCKED |
|
|
133
|
+
| 3 consecutive failures | Architecture Escalation (§7¾) → report to user |
|
|
106
134
|
| Max iterations reached | TIMEOUT |
|
|
107
135
|
|
|
108
136
|
### Model Routing
|
|
@@ -140,6 +168,8 @@ for iteration in 1..max_iter:
|
|
|
140
168
|
| `--codex-reasoning low\|medium\|high` | high | Reasoning effort for Codex |
|
|
141
169
|
| `--verify-mode per-us\|batch` | per-us | Verification strategy (see below) |
|
|
142
170
|
| `--verify-consensus` | off | Cross-engine consensus verification (see below) |
|
|
171
|
+
| `--debug` | off | Debug logging to `logs/<slug>/debug.log` |
|
|
172
|
+
| `--with-self-verification` | off | Campaign-level post-loop analysis report |
|
|
143
173
|
|
|
144
174
|
## Execution Modes
|
|
145
175
|
|
|
@@ -334,6 +364,7 @@ mkdir my-calc && cd my-calc
|
|
|
334
364
|
- [Architecture](docs/architecture.md) — Design philosophy, Agent() and tmux execution modes
|
|
335
365
|
- [Getting Started](docs/getting-started.md) — Step-by-step tutorial with the calculator example
|
|
336
366
|
- [Protocol Reference](docs/protocol-reference.md) — Full protocol specification
|
|
367
|
+
- [Future Plans](docs/TODO-verification-next.md) — P3 items and upcoming features
|
|
337
368
|
|
|
338
369
|
## Contributing
|
|
339
370
|
|
|
@@ -0,0 +1,59 @@
|
|
|
1
|
+
# Verification Policy — Next Iterations
|
|
2
|
+
|
|
3
|
+
> Items scoped out of the feature/verification-policy branch.
|
|
4
|
+
> P0-P2 (governance + templates) are complete. These items are planned for subsequent iterations.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## --with-self-verification Flag (Campaign-Level Analysis)
|
|
9
|
+
|
|
10
|
+
Post-campaign analysis that reads all iteration artifacts and generates a versioned report.
|
|
11
|
+
|
|
12
|
+
### Concept
|
|
13
|
+
- Separate from `--debug` (which logs Leader decisions)
|
|
14
|
+
- After COMPLETE/BLOCKED/TIMEOUT, Leader analyzes all done-claims and verdicts
|
|
15
|
+
- Generates `logs/<slug>/self-verification-report-NNN.md` (versioned per run)
|
|
16
|
+
- Cumulative data stored in `logs/<slug>/self-verification-data.json`
|
|
17
|
+
|
|
18
|
+
### Report Sections (9-section template defined in rlp-desk.md step 9)
|
|
19
|
+
1. Automated Validation Summary
|
|
20
|
+
2. Failure Deep Dive
|
|
21
|
+
3. Worker Process Quality (§1f audit)
|
|
22
|
+
4. Verifier Judgment Quality (§1f audit)
|
|
23
|
+
5. AC Lifecycle
|
|
24
|
+
6. Test-Spec Adherence
|
|
25
|
+
7. Patterns: Strengths & Weaknesses
|
|
26
|
+
8. Recommendations for Next Cycle (Brainstorm / PRD / Test-Spec)
|
|
27
|
+
9. Blind Spots
|
|
28
|
+
|
|
29
|
+
### Open Design Items
|
|
30
|
+
- [ ] Automated report generation (currently manual Leader analysis)
|
|
31
|
+
- [ ] Cross-campaign trend analysis (compare report-001 vs report-002)
|
|
32
|
+
- [ ] Integration with brainstorm (Leader reads previous report at brainstorm start)
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## P3: External Tool Integration + Domain Specialization
|
|
37
|
+
|
|
38
|
+
P0-P2 (governance policies + templates) form the foundation. P3 requires external dependencies and is planned for separate feature branches.
|
|
39
|
+
|
|
40
|
+
### P3-1: Domain Rule Packs
|
|
41
|
+
- **Purpose**: Domain-specific verification rule sets (finance, healthcare, security)
|
|
42
|
+
- **Why separate**: Different in nature from universal governance. Requires plugin architecture.
|
|
43
|
+
- [ ] Plugin loading mechanism design
|
|
44
|
+
- [ ] Finance domain rule pack (first)
|
|
45
|
+
- [ ] Rule pack authoring guide
|
|
46
|
+
|
|
47
|
+
### P3-2: Playwright Agents
|
|
48
|
+
- **Purpose**: Automated verification for visual/content task types (screenshot comparison, accessibility checks)
|
|
49
|
+
- **Why separate**: Requires Playwright installation + browser binaries + CI environment setup.
|
|
50
|
+
- [ ] Playwright integration wrapper
|
|
51
|
+
- [ ] Screenshot comparison verification logic
|
|
52
|
+
- [ ] CI environment guide
|
|
53
|
+
|
|
54
|
+
### P3-3: Mutahunter / Spec Kit
|
|
55
|
+
- **Purpose**: Automated mutation testing execution for CRITICAL risk
|
|
56
|
+
- **Why separate**: Requires language-specific tool wrappers (mutmut, Stryker, go-mutesting). Governance defines the Gate only.
|
|
57
|
+
- [ ] Language-specific mutation tool wrappers
|
|
58
|
+
- [ ] Mutation score collection + verdict integration
|
|
59
|
+
- [ ] Spec Kit: test-spec auto-generation helper
|
package/docs/architecture.md
CHANGED
|
@@ -48,6 +48,31 @@ RLP Desk supports two modes for running the Leader loop. Both honor the same gov
|
|
|
48
48
|
| **Agent() — "Smart mode"** (default) | LLM (current session) | Dynamic — Leader reasons about which model to use each iteration | Active Claude Code session | Interactive development, complex routing decisions |
|
|
49
49
|
| **Tmux — "Lean mode"** | Shell script (`run_ralph_desk.zsh`) | Static — set via `WORKER_MODEL`/`VERIFIER_MODEL` env vars | None (runs detached) | Long campaigns, CI, observability, zero-token orchestration |
|
|
50
50
|
|
|
51
|
+
### Verification Policy Layer
|
|
52
|
+
|
|
53
|
+
Both modes enforce the same verification policy (governance §1a-§1f):
|
|
54
|
+
|
|
55
|
+
```
|
|
56
|
+
┌─────────────────────────────────┐
|
|
57
|
+
│ Governance (§1a-§1f) │
|
|
58
|
+
│ Iron Laws · Evidence Gate │
|
|
59
|
+
│ Risk Classification · Layers │
|
|
60
|
+
│ Checkpoints · Traceability │
|
|
61
|
+
└──────────┬──────────────────────┘
|
|
62
|
+
│ enforced by
|
|
63
|
+
┌────────────────┼────────────────┐
|
|
64
|
+
▼ ▼ ▼
|
|
65
|
+
Worker Template Verifier Template Leader Loop
|
|
66
|
+
(Test-First, (12-step process, (Contract review,
|
|
67
|
+
12 Shortcuts, 5 reasoning Checkpoints,
|
|
68
|
+
execution_steps) categories) Escalation)
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
Key design decisions:
|
|
72
|
+
- **execution_steps** (Worker) and **reasoning** (Verifier) are always-on (§1f), not gated by flags
|
|
73
|
+
- **`--with-self-verification`** adds post-campaign analysis only — does not change loop behavior
|
|
74
|
+
- **Risk-proportional layers**: LOW gets L1+L3, CRITICAL gets L1+L2+L3+L4+mutation
|
|
75
|
+
|
|
51
76
|
**Agent() mode** is synchronous and simple: each `Agent()` call blocks until the subprocess finishes, then the Leader reads the filesystem. No polling, no signal files, no tmux.
|
|
52
77
|
|
|
53
78
|
**Tmux mode** trades dynamic routing for visibility and independence. The shell Leader writes prompts to files, sends short trigger commands via `tmux send-keys`, and polls structured JSON signal files (`iter-signal.json`, `verify-verdict.json`) for control flow. It uses proven tmux patterns — write-then-notify, pane ID stability, copy-mode guards, heartbeat monitoring — for reliable, race-free orchestration.
|
package/docs/getting-started.md
CHANGED
|
@@ -45,10 +45,12 @@ The brainstorm phase interactively determines:
|
|
|
45
45
|
|------|---------|
|
|
46
46
|
| **Slug** | `loop-test` |
|
|
47
47
|
| **Objective** | Implement calc.py + test_calc.py |
|
|
48
|
-
| **User Stories** | US-001: calculator functions
|
|
48
|
+
| **User Stories** | US-001: calculator functions (Given/When/Then AC format) |
|
|
49
|
+
| **Task Type & Risk** | code, LOW |
|
|
49
50
|
| **Iteration Unit** | One user story per iteration |
|
|
50
51
|
| **Verification** | `python3 -m pytest test_calc.py -v` |
|
|
51
52
|
| **Models** | Worker: sonnet, Verifier: opus |
|
|
53
|
+
| **Ambiguity Gate (IL-2)** | AC quality score ≥ 6 required to proceed |
|
|
52
54
|
| **Max Iterations** | 10 |
|
|
53
55
|
|
|
54
56
|
On approval, brainstorm offers to run `init` automatically.
|
|
@@ -81,9 +83,11 @@ This creates the scaffold:
|
|
|
81
83
|
Edit `.claude/ralph-desk/plans/prd-loop-test.md` to define your user stories and acceptance criteria. See [`examples/calculator/`](../examples/calculator/.claude/ralph-desk/plans/prd-loop-test.md) for a complete example.
|
|
82
84
|
|
|
83
85
|
Key sections:
|
|
84
|
-
- **User Stories** with
|
|
86
|
+
- **User Stories** with Given/When/Then acceptance criteria, Task Type, and Risk Level
|
|
87
|
+
- **Boundary Cases** for each US
|
|
88
|
+
- **Verification Layers** (L1-L4 based on risk level per governance §1c)
|
|
85
89
|
- **Technical Constraints** (e.g., "Python 3 + pytest only")
|
|
86
|
-
- **Done When** conditions
|
|
90
|
+
- **Done When** conditions (must reference Evidence Gate §1b)
|
|
87
91
|
|
|
88
92
|
## Step 6: Define the Test Spec
|
|
89
93
|
|
|
@@ -147,7 +151,10 @@ If you want to run the loop again:
|
|
|
147
151
|
## Tips
|
|
148
152
|
|
|
149
153
|
- **Start small**: One or two user stories for your first loop
|
|
150
|
-
- **
|
|
154
|
+
- **Use Given/When/Then**: "Given 10 and 5, When add, Then 15" — not "function works well"
|
|
155
|
+
- **Set risk levels**: LOW for docs/config, MEDIUM for features, HIGH for deploys, CRITICAL for security/finance
|
|
151
156
|
- **Include verification commands**: The verifier needs concrete commands to run
|
|
152
157
|
- **One story per iteration**: Each worker should do one bounded action
|
|
153
158
|
- **Check logs when stuck**: `logs/<slug>/iter-NNN.worker-prompt.md` shows exactly what the worker received
|
|
159
|
+
- **Review done-claims**: Worker's `execution_steps` show exactly what was done and in what order
|
|
160
|
+
- **Review verdict reasoning**: Verifier's `reasoning` shows why each judgment was made
|