loki-mode 5.1.3 → 5.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/SKILL.md +26 -2
- package/VERSION +1 -1
- package/bin/postinstall.js +1 -1
- package/docs/ACKNOWLEDGEMENTS.md +63 -1
- package/docs/COMPARISON.md +123 -1
- package/package.json +1 -1
- package/references/core-workflow.md +14 -1
- package/references/memory-system.md +1097 -1
- package/skills/model-selection.md +231 -0
- package/skills/quality-gates.md +283 -0
- package/skills/troubleshooting.md +634 -1
|
@@ -176,3 +176,234 @@ Task(model="haiku", description="Fix import in utils.py", prompt="...")
|
|
|
176
176
|
# Complex tasks -> Supervisor orchestration
|
|
177
177
|
Task(description="Implement user authentication with OAuth", prompt="...")
|
|
178
178
|
```
|
|
179
|
+
|
|
180
|
+
## Tiered Agent Escalation Triggers
|
|
181
|
+
|
|
182
|
+
*Inspired by [oh-my-claudecode](https://github.com/Yeachan-Heo/oh-my-claudecode) tiered architecture.*
|
|
183
|
+
|
|
184
|
+
**Tier Mapping:** LOW=fast=Haiku, MEDIUM=development=Sonnet, HIGH=planning=Opus. This section uses LOW/MEDIUM/HIGH for clarity in escalation context.
|
|
185
|
+
|
|
186
|
+
Explicit signals that determine when to escalate from one tier to another. These triggers enable automatic tier selection based on task characteristics and runtime conditions.
|
|
187
|
+
|
|
188
|
+
### Tier Definitions
|
|
189
|
+
|
|
190
|
+
| Tier | Model | Cost | Speed | Use Case |
|
|
191
|
+
|------|-------|------|-------|----------|
|
|
192
|
+
| **LOW** | Haiku | Lowest | Fastest | Simple, repetitive, well-defined tasks |
|
|
193
|
+
| **MEDIUM** | Sonnet | Medium | Balanced | Complex implementation, testing, debugging |
|
|
194
|
+
| **HIGH** | Opus | Highest | Slowest | Critical decisions, architecture, security |
|
|
195
|
+
|
|
196
|
+
### Automatic Escalation Triggers
|
|
197
|
+
|
|
198
|
+
#### LOW -> MEDIUM Escalation
|
|
199
|
+
|
|
200
|
+
Escalate from Haiku to Sonnet when:
|
|
201
|
+
|
|
202
|
+
| Trigger | Threshold | Rationale |
|
|
203
|
+
|---------|-----------|-----------|
|
|
204
|
+
| Error count | > 2 consecutive failures | Haiku struggling with complexity |
|
|
205
|
+
| File count | > 5 files modified | Cross-cutting changes need context |
|
|
206
|
+
| Lines changed | > 200 lines | Large changes need careful review |
|
|
207
|
+
| Test failures | > 3 failing tests | Need deeper debugging |
|
|
208
|
+
| Dependency changes | Any package.json/requirements.txt | Dependency resolution is complex |
|
|
209
|
+
| Retry attempts | > 1 retry on same task | Task too complex for current tier |
|
|
210
|
+
|
|
211
|
+
```python
|
|
212
|
+
# Example: Auto-escalate on repeated failures
|
|
213
|
+
if task.error_count > 2:
|
|
214
|
+
task.escalate(to="sonnet", reason="Multiple failures indicate complexity")
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
#### MEDIUM -> HIGH Escalation
|
|
218
|
+
|
|
219
|
+
Escalate from Sonnet to Opus when:
|
|
220
|
+
|
|
221
|
+
| Trigger | Threshold | Rationale |
|
|
222
|
+
|---------|-----------|-----------|
|
|
223
|
+
| Error count | > 3 consecutive failures | Even Sonnet struggling |
|
|
224
|
+
| Complexity score | > 15 cyclomatic | Highly complex logic needs planning |
|
|
225
|
+
| Architecture files | Any changes to core/* | Architecture decisions are critical |
|
|
226
|
+
| Breaking changes | API contract modifications | Need careful impact analysis |
|
|
227
|
+
| Performance issues | > 2x baseline regression | Need optimization strategy |
|
|
228
|
+
| Integration scope | > 3 services affected | Cross-service coordination |
|
|
229
|
+
|
|
230
|
+
```python
|
|
231
|
+
# Example: Auto-escalate for architecture changes
|
|
232
|
+
if "core/" in modified_files or "architecture" in task.tags:
|
|
233
|
+
task.escalate(to="opus", reason="Architecture changes require planning tier")
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
#### HIGH -> HUMAN Escalation (Terminal)
|
|
237
|
+
|
|
238
|
+
When even Opus fails, escalate to human intervention:
|
|
239
|
+
|
|
240
|
+
| Trigger | Threshold | Action |
|
|
241
|
+
|---------|-----------|--------|
|
|
242
|
+
| Error count | > 5 consecutive at HIGH tier | Create `.loki/signals/HUMAN_REVIEW_NEEDED` |
|
|
243
|
+
| Ambiguous requirements | Cannot determine correct behavior | Create signal with specific questions |
|
|
244
|
+
| External dependencies | Blocked on third-party API/service | Document blocker, pause task |
|
|
245
|
+
| Ethical concerns | Task may violate principles | Halt immediately, document concern |
|
|
246
|
+
|
|
247
|
+
```python
|
|
248
|
+
# Terminal escalation - no automated recovery
|
|
249
|
+
if task.tier == "opus" and task.error_count > 5:
|
|
250
|
+
create_signal(".loki/signals/HUMAN_REVIEW_NEEDED", {
|
|
251
|
+
"task_id": task.id,
|
|
252
|
+
"reason": "5+ failures at planning tier",
|
|
253
|
+
"attempts": task.attempt_log,
|
|
254
|
+
"recommendation": "Manual investigation required"
|
|
255
|
+
})
|
|
256
|
+
task.status = "blocked_on_human"
|
|
257
|
+
```
|
|
258
|
+
|
|
259
|
+
### Threshold Rationale
|
|
260
|
+
|
|
261
|
+
Why these specific thresholds? Each value is grounded in research or proven heuristics:
|
|
262
|
+
|
|
263
|
+
| Threshold | Value | Justification |
|
|
264
|
+
|-----------|-------|---------------|
|
|
265
|
+
| **Error count > 2** (LOW->MEDIUM) | 3 attempts | "Two strikes" principle: first failure may be transient (network, timeout), second suggests genuine complexity. Third attempt warrants escalation to capable tier. |
|
|
266
|
+
| **Error count > 3** (MEDIUM->HIGH) | 4 attempts | Sonnet has significantly more capability than Haiku, so allow one additional attempt before expensive Opus escalation. Balances cost vs. success rate. |
|
|
267
|
+
| **Error count > 5** (HIGH->HUMAN) | 6 attempts | Planning tier (Opus) exhausted all automated reasoning options. Further attempts unlikely to succeed; human judgment required. |
|
|
268
|
+
| **File count > 5** | 6+ files | Cross-cutting changes affecting 5+ files require holistic understanding of system interactions. Research on code review effectiveness shows reviewer accuracy drops with multi-file changes. |
|
|
269
|
+
| **Lines changed > 200** | 200 LOC | Studies on code review effectiveness (Cisco, SmartBear) show review quality degrades significantly above 200-400 LOC. Microsoft's internal research suggests 200 LOC as optimal review size. |
|
|
270
|
+
| **Cyclomatic complexity > 15** | McCabe threshold | Industry standard since McCabe (1976). NIST considers >15 "high risk." Many static analysis tools default to this threshold. |
|
|
271
|
+
| **Test failures > 3** | 4+ failures | Distinguishes isolated flakiness from systemic issues. Single test failure may be flaky; 3+ indicates deeper problems requiring debugging capability. |
|
|
272
|
+
| **Retry attempts > 1** | 2+ retries | First retry accounts for transient issues. Second retry at same tier signals fundamental mismatch between task complexity and model capability. |
|
|
273
|
+
| **5+ successful tasks** (de-escalation) | Success streak | Sustained success indicates task complexity has reduced or model has adapted. Safe to try lower-cost tier with quick re-escalation if needed. |
|
|
274
|
+
|
|
275
|
+
**References:**
|
|
276
|
+
- McCabe, T.J. (1976). "A Complexity Measure." IEEE Transactions on Software Engineering.
|
|
277
|
+
- Cisco Code Review Study: Optimal review size 200-400 LOC for defect detection.
|
|
278
|
+
- SmartBear "Best Kept Secrets of Peer Code Review": Review effectiveness drops 50% above 400 LOC.
|
|
279
|
+
|
|
280
|
+
### Always-HIGH Triggers (No Escalation Path)
|
|
281
|
+
|
|
282
|
+
These tasks ALWAYS start at HIGH tier (Opus):
|
|
283
|
+
|
|
284
|
+
| Category | Examples | Rationale |
|
|
285
|
+
|----------|----------|-----------|
|
|
286
|
+
| **Security** | Auth, encryption, secrets, RBAC | Security cannot be compromised |
|
|
287
|
+
| **Architecture** | System design, service boundaries, data models | Foundation decisions |
|
|
288
|
+
| **Breaking Changes** | API versioning, schema migrations, deprecations | High blast radius |
|
|
289
|
+
| **Production Incidents** | Outage response, data corruption, rollback | Critical impact |
|
|
290
|
+
| **Compliance** | GDPR, HIPAA, SOC2 implementations | Regulatory requirements |
|
|
291
|
+
| **Cost Decisions** | Infrastructure scaling, vendor selection | Financial impact |
|
|
292
|
+
|
|
293
|
+
```python
|
|
294
|
+
# Example: Security tasks always use Opus
|
|
295
|
+
ALWAYS_HIGH_PATTERNS = [
|
|
296
|
+
r"(auth|security|encrypt|secret|credential|token|password)",
|
|
297
|
+
r"(architecture|system.design|schema.migration)",
|
|
298
|
+
r"(production|incident|outage|rollback)",
|
|
299
|
+
r"(compliance|gdpr|hipaa|soc2|pci)",
|
|
300
|
+
]
|
|
301
|
+
|
|
302
|
+
if any(re.search(p, task.description, re.I) for p in ALWAYS_HIGH_PATTERNS):
|
|
303
|
+
task.tier = "HIGH" # No escalation, start at Opus
|
|
304
|
+
```
|
|
305
|
+
|
|
306
|
+
### De-escalation Triggers (Cost Optimization)
|
|
307
|
+
|
|
308
|
+
De-escalate to lower tier when conditions improve:
|
|
309
|
+
|
|
310
|
+
| Trigger | Action | Rationale |
|
|
311
|
+
|---------|--------|-----------|
|
|
312
|
+
| 5+ successful tasks at tier | Consider de-escalation | Complexity resolved |
|
|
313
|
+
| Single-file changes | Use LOW for isolated fixes | Simple scope |
|
|
314
|
+
| Test-only changes | Use LOW for unit tests | Well-defined output |
|
|
315
|
+
| Documentation | Use LOW for docs/comments | Low risk |
|
|
316
|
+
|
|
317
|
+
```python
|
|
318
|
+
# Example: De-escalate when task becomes routine
|
|
319
|
+
if task.success_streak >= 5 and task.scope == "single_file":
|
|
320
|
+
task.deescalate(to="haiku", reason="Task scope is simple and stable")
|
|
321
|
+
```
|
|
322
|
+
|
|
323
|
+
### Escalation Flow Diagram
|
|
324
|
+
|
|
325
|
+
```
|
|
326
|
+
+------------------+
|
|
327
|
+
| Task Arrives |
|
|
328
|
+
+--------+---------+
|
|
329
|
+
|
|
|
330
|
+
+--------v---------+
|
|
331
|
+
| Check ALWAYS_HIGH |
|
|
332
|
+
+--------+---------+
|
|
333
|
+
|
|
|
334
|
+
+--------------+--------------+
|
|
335
|
+
| |
|
|
336
|
+
[matches] [no match]
|
|
337
|
+
| |
|
|
338
|
+
+--------v--------+ +--------v--------+
|
|
339
|
+
| START: HIGH | | START: LOW |
|
|
340
|
+
| (Opus) | | (Haiku) |
|
|
341
|
+
+-----------------+ +--------+--------+
|
|
342
|
+
|
|
|
343
|
+
+--------v--------+
|
|
344
|
+
| Execute Task |
|
|
345
|
+
+--------+--------+
|
|
346
|
+
|
|
|
347
|
+
+-------------+-------------+
|
|
348
|
+
| |
|
|
349
|
+
[success] [failure]
|
|
350
|
+
| |
|
|
351
|
+
+---------v---------+ +---------v---------+
|
|
352
|
+
| Continue at tier | | Check thresholds |
|
|
353
|
+
+-------------------+ +---------+---------+
|
|
354
|
+
|
|
|
355
|
+
+-----------+-----------+
|
|
356
|
+
| |
|
|
357
|
+
[under limit] [over limit]
|
|
358
|
+
| |
|
|
359
|
+
+--------v--------+ +--------v--------+
|
|
360
|
+
| Retry at tier | | ESCALATE tier |
|
|
361
|
+
+-----------------+ +-----------------+
|
|
362
|
+
```
|
|
363
|
+
|
|
364
|
+
### Implementation in Provider Context
|
|
365
|
+
|
|
366
|
+
For Claude (full features):
|
|
367
|
+
```python
|
|
368
|
+
# Task tool with tier awareness
|
|
369
|
+
Task(
|
|
370
|
+
model=determine_tier(task), # Returns "opus", "sonnet", or "haiku"
|
|
371
|
+
description=task.description,
|
|
372
|
+
prompt=task.prompt,
|
|
373
|
+
metadata={"escalation_count": task.escalation_count}
|
|
374
|
+
)
|
|
375
|
+
```
|
|
376
|
+
|
|
377
|
+
For Codex/Gemini (degraded mode):
|
|
378
|
+
```python
|
|
379
|
+
# Map tiers to effort/thinking levels
|
|
380
|
+
TIER_MAPPING = {
|
|
381
|
+
"codex": {"HIGH": "xhigh", "MEDIUM": "high", "LOW": "low"},
|
|
382
|
+
"gemini": {"HIGH": "high", "MEDIUM": "medium", "LOW": "low"},
|
|
383
|
+
}
|
|
384
|
+
effort_level = TIER_MAPPING[provider][determine_tier(task)]
|
|
385
|
+
```
|
|
386
|
+
|
|
387
|
+
### Metrics for Tier Optimization
|
|
388
|
+
|
|
389
|
+
Track these metrics to tune escalation thresholds:
|
|
390
|
+
|
|
391
|
+
| Metric | Purpose | Target |
|
|
392
|
+
|--------|---------|--------|
|
|
393
|
+
| Escalation rate | How often tasks escalate | < 20% |
|
|
394
|
+
| First-tier success | Tasks completed without escalation | > 80% |
|
|
395
|
+
| Cost per task | Average token cost by tier | Minimize |
|
|
396
|
+
| Time to completion | Including escalation delays | Minimize |
|
|
397
|
+
| Quality score | Post-completion review score | > 4.0/5.0 |
|
|
398
|
+
|
|
399
|
+
```python
|
|
400
|
+
# Log escalation events for analysis
|
|
401
|
+
log_escalation(
|
|
402
|
+
task_id=task.id,
|
|
403
|
+
from_tier=current_tier,
|
|
404
|
+
to_tier=new_tier,
|
|
405
|
+
trigger=trigger_reason,
|
|
406
|
+
error_count=task.error_count,
|
|
407
|
+
timestamp=now()
|
|
408
|
+
)
|
|
409
|
+
```
|
package/skills/quality-gates.md
CHANGED
|
@@ -21,6 +21,190 @@
|
|
|
21
21
|
|
|
22
22
|
---
|
|
23
23
|
|
|
24
|
+
## Chain-of-Verification (CoVe) Protocol
|
|
25
|
+
|
|
26
|
+
**Research:** arXiv 2309.11495 - "Chain-of-Verification Reduces Hallucination in Large Language Models"
|
|
27
|
+
|
|
28
|
+
### Core Insight
|
|
29
|
+
|
|
30
|
+
Factored, decoupled verification mitigates error propagation. Each verification is computed independently without access to the original response, preventing the model from rationalizing its initial mistakes.
|
|
31
|
+
|
|
32
|
+
### The 4-Step CoVe Process
|
|
33
|
+
|
|
34
|
+
```
|
|
35
|
+
Step 1: DRAFT Step 2: PLAN Step 3: EXECUTE Step 4: REVISE
|
|
36
|
+
+-------------+ +---------------+ +-----------------+ +----------------+
|
|
37
|
+
| Generate | ---> | Self-generate | --> | Answer each | -> | Incorporate |
|
|
38
|
+
| initial | | verification | | question | | corrections |
|
|
39
|
+
| response | | questions | | INDEPENDENTLY | | into final |
|
|
40
|
+
+-------------+ +---------------+ +-----------------+ +----------------+
|
|
41
|
+
"What claims | (factored exec)
|
|
42
|
+
did I make? | No access to
|
|
43
|
+
What could be | original response
|
|
44
|
+
wrong?"
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
### Step-by-Step Implementation
|
|
48
|
+
|
|
49
|
+
**Step 1: Draft Initial Response**
|
|
50
|
+
```yaml
|
|
51
|
+
draft_phase:
|
|
52
|
+
action: "Generate initial code/response"
|
|
53
|
+
model: "sonnet" # Fast drafting
|
|
54
|
+
output: "baseline_response"
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
**Step 2: Plan Verification Questions**
|
|
58
|
+
```yaml
|
|
59
|
+
verification_planning:
|
|
60
|
+
prompt: |
|
|
61
|
+
Review the response above. Generate verification questions:
|
|
62
|
+
1. What factual claims did I make?
|
|
63
|
+
2. What assumptions did I rely on?
|
|
64
|
+
3. What could be incorrect or incomplete?
|
|
65
|
+
4. What edge cases did I miss?
|
|
66
|
+
output: "verification_questions[]"
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
**Step 3: Execute Verifications INDEPENDENTLY (Critical)**
|
|
70
|
+
```yaml
|
|
71
|
+
factored_execution:
|
|
72
|
+
critical: "Each verification runs in isolation"
|
|
73
|
+
rule: "Verifier has NO access to original response"
|
|
74
|
+
|
|
75
|
+
# Launch in parallel - each is independent
|
|
76
|
+
verifications:
|
|
77
|
+
- question: "Does the function handle null inputs?"
|
|
78
|
+
context: "Function signature and spec only" # NOT the implementation
|
|
79
|
+
verifier: "sonnet"
|
|
80
|
+
- question: "Is the SQL query injection-safe?"
|
|
81
|
+
context: "Query requirements only"
|
|
82
|
+
verifier: "sonnet"
|
|
83
|
+
- question: "Does the API match the documented spec?"
|
|
84
|
+
context: "API spec only"
|
|
85
|
+
verifier: "sonnet"
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
**Step 4: Generate Final Verified Response**
|
|
89
|
+
```yaml
|
|
90
|
+
revision_phase:
|
|
91
|
+
inputs:
|
|
92
|
+
- original_response
|
|
93
|
+
- verification_results[]
|
|
94
|
+
action: "Revise response incorporating all corrections"
|
|
95
|
+
output: "verified_response"
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
### Factor+Revise Variant (Longform Code Generation)
|
|
99
|
+
|
|
100
|
+
For complex code generation, use the enhanced Factor+Revise pattern. The key difference from basic Factored execution is an **explicit cross-check step** where the model compares original claims against verification results before revision.
|
|
101
|
+
|
|
102
|
+
```yaml
|
|
103
|
+
factor_revise_pattern:
|
|
104
|
+
step_1_draft:
|
|
105
|
+
action: "Generate complete implementation"
|
|
106
|
+
output: "draft_code"
|
|
107
|
+
|
|
108
|
+
step_2_factor:
|
|
109
|
+
action: "Decompose into verifiable claims"
|
|
110
|
+
outputs:
|
|
111
|
+
- "Function X handles error case Y"
|
|
112
|
+
- "Loop invariant: Z holds at each iteration"
|
|
113
|
+
- "API call returns type T"
|
|
114
|
+
- "Memory is freed in all paths"
|
|
115
|
+
|
|
116
|
+
step_3_independent_verify:
|
|
117
|
+
# CRITICAL: Each runs with ONLY the claim + minimal context
|
|
118
|
+
# No access to full draft code
|
|
119
|
+
parallel_tasks:
|
|
120
|
+
- verify: "Function X handles error case Y"
|
|
121
|
+
context: "Function signature + error spec"
|
|
122
|
+
result: "PASS|FAIL + evidence"
|
|
123
|
+
- verify: "Loop invariant holds"
|
|
124
|
+
context: "Loop structure only"
|
|
125
|
+
result: "PASS|FAIL + evidence"
|
|
126
|
+
|
|
127
|
+
step_3b_cross_check:
|
|
128
|
+
# KEY DIFFERENCE: Explicit consistency check before revision
|
|
129
|
+
action: "Compare original claims against verification results"
|
|
130
|
+
prompt: "Identify which facts from the draft are CONSISTENT vs INCONSISTENT with verifications"
|
|
131
|
+
output: "consistency_report"
|
|
132
|
+
|
|
133
|
+
step_4_revise:
|
|
134
|
+
inputs: [draft_code, verification_results, consistency_report]
|
|
135
|
+
action: "Discard inconsistent facts, use consistent facts to regenerate"
|
|
136
|
+
output: "verified_code"
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
### Why Factored Execution Matters
|
|
140
|
+
|
|
141
|
+
The paper tested 4 execution variants:
|
|
142
|
+
- **Joint**: Questions and answers in one prompt (worst - repeats hallucinations)
|
|
143
|
+
- **2-Step**: Separate prompts for questions vs answers (better)
|
|
144
|
+
- **Factored**: Each question answered separately (recommended)
|
|
145
|
+
- **Factor+Revise**: Factored + explicit cross-check step (best for longform)
|
|
146
|
+
|
|
147
|
+
Without factoring (naive verification):
|
|
148
|
+
```
|
|
149
|
+
Model: "Here's the code"
|
|
150
|
+
Model: "Let me check my code... looks correct!" # Confirmation bias
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
With factored verification:
|
|
154
|
+
```
|
|
155
|
+
Model: "Here's the code"
|
|
156
|
+
Model: "Question: Does function handle nulls?"
|
|
157
|
+
[New context, no code visible]
|
|
158
|
+
Model: "Given a function that takes X, null handling requires..." # Independent reasoning
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
**Key principle from the paper:** The verifier cannot see the original response, only the verification question and minimal context. This prevents rationalization of errors and breaks the chain of hallucination propagation.
|
|
162
|
+
|
|
163
|
+
### CoVe Integration with Blind Review
|
|
164
|
+
|
|
165
|
+
CoVe operates BEFORE blind review as a self-correction step:
|
|
166
|
+
|
|
167
|
+
```
|
|
168
|
+
Developer Code --> CoVe (self-verification) --> Blind Review (3 parallel)
|
|
169
|
+
| |
|
|
170
|
+
Catches errors early Catches remaining
|
|
171
|
+
via factored checking issues independently
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
**Combined workflow:**
|
|
175
|
+
```yaml
|
|
176
|
+
quality_pipeline:
|
|
177
|
+
phase_1_cove:
|
|
178
|
+
# Developer runs CoVe on their own code
|
|
179
|
+
draft: "Initial implementation"
|
|
180
|
+
verify: "Self-generated questions, factored execution"
|
|
181
|
+
revise: "Corrected implementation"
|
|
182
|
+
|
|
183
|
+
phase_2_blind_review:
|
|
184
|
+
# 3 independent reviewers (no access to CoVe results)
|
|
185
|
+
reviewers:
|
|
186
|
+
- focus: "correctness"
|
|
187
|
+
- focus: "security"
|
|
188
|
+
- focus: "performance"
|
|
189
|
+
# Reviewers see verified code but don't know what was corrected
|
|
190
|
+
|
|
191
|
+
phase_3_aggregate:
|
|
192
|
+
if: "unanimous approval"
|
|
193
|
+
then: "Devil's Advocate review"
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
### Metrics
|
|
197
|
+
|
|
198
|
+
Track CoVe effectiveness:
|
|
199
|
+
```
|
|
200
|
+
.loki/metrics/cove/
|
|
201
|
+
+-- corrections.json # Issues caught by CoVe before review
|
|
202
|
+
+-- false_positives.json # CoVe flags that were actually correct
|
|
203
|
+
+-- review_reduction.json # Reviewer findings before/after CoVe adoption
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
---
|
|
207
|
+
|
|
24
208
|
## Velocity-Quality Feedback Loop (CRITICAL)
|
|
25
209
|
|
|
26
210
|
**Research from arXiv 2511.04427v2 - empirical study of 807 repositories.**
|
|
@@ -98,6 +282,105 @@ Task(model="sonnet", description="Code review: performance", prompt="Review for
|
|
|
98
282
|
|
|
99
283
|
---
|
|
100
284
|
|
|
285
|
+
## Two-Stage Review Protocol
|
|
286
|
+
|
|
287
|
+
**Source:** Superpowers (obra) - 35K+ stars GitHub project
|
|
288
|
+
|
|
289
|
+
**CRITICAL: Never mix spec compliance and code quality review. They are separate stages.**
|
|
290
|
+
|
|
291
|
+
### Why Separate Stages Matter
|
|
292
|
+
|
|
293
|
+
Mixing stages causes these problems:
|
|
294
|
+
- **"Technically correct but wrong feature"** - Code is clean, well-tested, maintainable, but doesn't implement what the spec requires
|
|
295
|
+
- **Spec drift goes undetected** - Quality reviewers approve beautiful code that solves the wrong problem
|
|
296
|
+
- **False confidence** - "3 reviewers approved" means nothing if none checked spec compliance
|
|
297
|
+
|
|
298
|
+
### Stage 1: Spec Compliance Review
|
|
299
|
+
|
|
300
|
+
**Question:** "Does this code implement what the spec requires?"
|
|
301
|
+
|
|
302
|
+
```
|
|
303
|
+
Review this implementation against the specification.
|
|
304
|
+
|
|
305
|
+
Specification:
|
|
306
|
+
{paste_spec_or_requirements}
|
|
307
|
+
|
|
308
|
+
Implementation:
|
|
309
|
+
{paste_code_or_diff}
|
|
310
|
+
|
|
311
|
+
Check ONLY the following:
|
|
312
|
+
1. Does the code implement ALL required features from the spec?
|
|
313
|
+
2. Does the code implement ONLY what the spec requires (no scope creep)?
|
|
314
|
+
3. Are edge cases from the spec handled?
|
|
315
|
+
4. Do the tests verify spec requirements?
|
|
316
|
+
|
|
317
|
+
DO NOT review code quality, style, or maintainability.
|
|
318
|
+
Output: PASS/FAIL with specific spec violations listed.
|
|
319
|
+
```
|
|
320
|
+
|
|
321
|
+
**Stage 1 must PASS before proceeding to Stage 2.**
|
|
322
|
+
|
|
323
|
+
### Stage 2: Code Quality Review
|
|
324
|
+
|
|
325
|
+
**Question:** "Is this code well-written, maintainable, secure?"
|
|
326
|
+
|
|
327
|
+
```
|
|
328
|
+
Review this code for quality. Spec compliance has already been verified.
|
|
329
|
+
|
|
330
|
+
Code:
|
|
331
|
+
{paste_code_or_diff}
|
|
332
|
+
|
|
333
|
+
Check the following:
|
|
334
|
+
1. Is the code readable and maintainable?
|
|
335
|
+
2. Are there security vulnerabilities?
|
|
336
|
+
3. Is error handling appropriate?
|
|
337
|
+
4. Are there performance concerns?
|
|
338
|
+
5. Does it follow project conventions?
|
|
339
|
+
|
|
340
|
+
DO NOT verify spec compliance (already done).
|
|
341
|
+
Output: PASS/FAIL with specific issues listed by severity.
|
|
342
|
+
```
|
|
343
|
+
|
|
344
|
+
### Implementation in Loki Mode
|
|
345
|
+
|
|
346
|
+
```yaml
|
|
347
|
+
two_stage_review:
|
|
348
|
+
stage_1_spec:
|
|
349
|
+
reviewer_count: 1 # Spec compliance is objective
|
|
350
|
+
model: "sonnet"
|
|
351
|
+
must_pass: true
|
|
352
|
+
blocks: "stage_2"
|
|
353
|
+
|
|
354
|
+
stage_2_quality:
|
|
355
|
+
reviewer_count: 3 # Quality is subjective, use blind review
|
|
356
|
+
model: "sonnet"
|
|
357
|
+
must_pass: true
|
|
358
|
+
follows: "stage_1"
|
|
359
|
+
anti_sycophancy: true # Devil's advocate on unanimous
|
|
360
|
+
|
|
361
|
+
on_stage_1_fail:
|
|
362
|
+
action: "Return to implementation, DO NOT proceed to Stage 2"
|
|
363
|
+
reason: "Quality review of wrong feature wastes resources"
|
|
364
|
+
|
|
365
|
+
on_stage_2_fail:
|
|
366
|
+
action: "Fix quality issues, re-run Stage 2 only"
|
|
367
|
+
reason: "Spec compliance already verified"
|
|
368
|
+
```
|
|
369
|
+
|
|
370
|
+
### Common Anti-Pattern
|
|
371
|
+
|
|
372
|
+
```
|
|
373
|
+
# WRONG - Mixed review
|
|
374
|
+
Task(prompt="Review for correctness, security, performance, and spec compliance...")
|
|
375
|
+
|
|
376
|
+
# RIGHT - Separate stages
|
|
377
|
+
Task(prompt="Stage 1: Check spec compliance ONLY...")
|
|
378
|
+
# Wait for pass
|
|
379
|
+
Task(prompt="Stage 2: Check code quality ONLY...")
|
|
380
|
+
```
|
|
381
|
+
|
|
382
|
+
---
|
|
383
|
+
|
|
101
384
|
## Severity-Based Blocking
|
|
102
385
|
|
|
103
386
|
| Severity | Action |
|