loki-mode 5.1.3 → 5.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -176,3 +176,234 @@ Task(model="haiku", description="Fix import in utils.py", prompt="...")
176
176
  # Complex tasks -> Supervisor orchestration
177
177
  Task(description="Implement user authentication with OAuth", prompt="...")
178
178
  ```
179
+
180
+ ## Tiered Agent Escalation Triggers
181
+
182
+ *Inspired by [oh-my-claudecode](https://github.com/Yeachan-Heo/oh-my-claudecode) tiered architecture.*
183
+
184
+ **Tier Mapping:** LOW=fast=Haiku, MEDIUM=development=Sonnet, HIGH=planning=Opus. This section uses LOW/MEDIUM/HIGH for clarity in escalation context.
185
+
186
+ Explicit signals that determine when to escalate from one tier to another. These triggers enable automatic tier selection based on task characteristics and runtime conditions.
187
+
188
+ ### Tier Definitions
189
+
190
+ | Tier | Model | Cost | Speed | Use Case |
191
+ |------|-------|------|-------|----------|
192
+ | **LOW** | Haiku | Lowest | Fastest | Simple, repetitive, well-defined tasks |
193
+ | **MEDIUM** | Sonnet | Medium | Balanced | Complex implementation, testing, debugging |
194
+ | **HIGH** | Opus | Highest | Slowest | Critical decisions, architecture, security |
195
+
196
+ ### Automatic Escalation Triggers
197
+
198
+ #### LOW -> MEDIUM Escalation
199
+
200
+ Escalate from Haiku to Sonnet when:
201
+
202
+ | Trigger | Threshold | Rationale |
203
+ |---------|-----------|-----------|
204
+ | Error count | > 2 consecutive failures | Haiku struggling with complexity |
205
+ | File count | > 5 files modified | Cross-cutting changes need context |
206
+ | Lines changed | > 200 lines | Large changes need careful review |
207
+ | Test failures | > 3 failing tests | Need deeper debugging |
208
+ | Dependency changes | Any package.json/requirements.txt | Dependency resolution is complex |
209
+ | Retry attempts | > 1 retry on same task | Task too complex for current tier |
210
+
211
+ ```python
212
+ # Example: Auto-escalate on repeated failures
213
+ if task.error_count > 2:
214
+ task.escalate(to="sonnet", reason="Multiple failures indicate complexity")
215
+ ```
216
+
217
+ #### MEDIUM -> HIGH Escalation
218
+
219
+ Escalate from Sonnet to Opus when:
220
+
221
+ | Trigger | Threshold | Rationale |
222
+ |---------|-----------|-----------|
223
+ | Error count | > 3 consecutive failures | Even Sonnet struggling |
224
+ | Complexity score | > 15 cyclomatic | Highly complex logic needs planning |
225
+ | Architecture files | Any changes to core/* | Architecture decisions are critical |
226
+ | Breaking changes | API contract modifications | Need careful impact analysis |
227
+ | Performance issues | > 2x baseline regression | Need optimization strategy |
228
+ | Integration scope | > 3 services affected | Cross-service coordination |
229
+
230
+ ```python
231
+ # Example: Auto-escalate for architecture changes
232
+ if "core/" in modified_files or "architecture" in task.tags:
233
+ task.escalate(to="opus", reason="Architecture changes require planning tier")
234
+ ```
235
+
236
+ #### HIGH -> HUMAN Escalation (Terminal)
237
+
238
+ When even Opus fails, escalate to human intervention:
239
+
240
+ | Trigger | Threshold | Action |
241
+ |---------|-----------|--------|
242
+ | Error count | > 5 consecutive at HIGH tier | Create `.loki/signals/HUMAN_REVIEW_NEEDED` |
243
+ | Ambiguous requirements | Cannot determine correct behavior | Create signal with specific questions |
244
+ | External dependencies | Blocked on third-party API/service | Document blocker, pause task |
245
+ | Ethical concerns | Task may violate principles | Halt immediately, document concern |
246
+
247
+ ```python
248
+ # Terminal escalation - no automated recovery
249
+ if task.tier == "opus" and task.error_count > 5:
250
+ create_signal(".loki/signals/HUMAN_REVIEW_NEEDED", {
251
+ "task_id": task.id,
252
+ "reason": "5+ failures at planning tier",
253
+ "attempts": task.attempt_log,
254
+ "recommendation": "Manual investigation required"
255
+ })
256
+ task.status = "blocked_on_human"
257
+ ```
258
+
259
+ ### Threshold Rationale
260
+
261
+ Why these specific thresholds? Each value is grounded in research or proven heuristics:
262
+
263
+ | Threshold | Value | Justification |
264
+ |-----------|-------|---------------|
265
+ | **Error count > 2** (LOW->MEDIUM) | 3 attempts | "Two strikes" principle: first failure may be transient (network, timeout), second suggests genuine complexity. Third attempt warrants escalation to capable tier. |
266
+ | **Error count > 3** (MEDIUM->HIGH) | 4 attempts | Sonnet has significantly more capability than Haiku, so allow one additional attempt before expensive Opus escalation. Balances cost vs. success rate. |
267
+ | **Error count > 5** (HIGH->HUMAN) | 6 attempts | Planning tier (Opus) exhausted all automated reasoning options. Further attempts unlikely to succeed; human judgment required. |
268
+ | **File count > 5** | 6+ files | Cross-cutting changes affecting 5+ files require holistic understanding of system interactions. Research on code review effectiveness shows reviewer accuracy drops with multi-file changes. |
269
+ | **Lines changed > 200** | 200 LOC | Studies on code review effectiveness (Cisco, SmartBear) show review quality degrades significantly above 200-400 LOC. Microsoft's internal research suggests 200 LOC as optimal review size. |
270
+ | **Cyclomatic complexity > 15** | McCabe threshold | Industry standard since McCabe (1976). NIST considers >15 "high risk." Many static analysis tools default to this threshold. |
271
+ | **Test failures > 3** | 4+ failures | Distinguishes isolated flakiness from systemic issues. Single test failure may be flaky; 3+ indicates deeper problems requiring debugging capability. |
272
+ | **Retry attempts > 1** | 2+ retries | First retry accounts for transient issues. Second retry at same tier signals fundamental mismatch between task complexity and model capability. |
273
+ | **5+ successful tasks** (de-escalation) | Success streak | Sustained success indicates task complexity has reduced or model has adapted. Safe to try lower-cost tier with quick re-escalation if needed. |
274
+
275
+ **References:**
276
+ - McCabe, T.J. (1976). "A Complexity Measure." IEEE Transactions on Software Engineering.
277
+ - Cisco Code Review Study: Optimal review size 200-400 LOC for defect detection.
278
+ - SmartBear "Best Kept Secrets of Peer Code Review": Review effectiveness drops 50% above 400 LOC.
279
+
280
+ ### Always-HIGH Triggers (No Escalation Path)
281
+
282
+ These tasks ALWAYS start at HIGH tier (Opus):
283
+
284
+ | Category | Examples | Rationale |
285
+ |----------|----------|-----------|
286
+ | **Security** | Auth, encryption, secrets, RBAC | Security cannot be compromised |
287
+ | **Architecture** | System design, service boundaries, data models | Foundation decisions |
288
+ | **Breaking Changes** | API versioning, schema migrations, deprecations | High blast radius |
289
+ | **Production Incidents** | Outage response, data corruption, rollback | Critical impact |
290
+ | **Compliance** | GDPR, HIPAA, SOC2 implementations | Regulatory requirements |
291
+ | **Cost Decisions** | Infrastructure scaling, vendor selection | Financial impact |
292
+
293
+ ```python
294
+ # Example: Security tasks always use Opus
295
+ ALWAYS_HIGH_PATTERNS = [
296
+ r"(auth|security|encrypt|secret|credential|token|password)",
297
+ r"(architecture|system.design|schema.migration)",
298
+ r"(production|incident|outage|rollback)",
299
+ r"(compliance|gdpr|hipaa|soc2|pci)",
300
+ ]
301
+
302
+ if any(re.search(p, task.description, re.I) for p in ALWAYS_HIGH_PATTERNS):
303
+ task.tier = "HIGH" # No escalation, start at Opus
304
+ ```
305
+
306
+ ### De-escalation Triggers (Cost Optimization)
307
+
308
+ De-escalate to lower tier when conditions improve:
309
+
310
+ | Trigger | Action | Rationale |
311
+ |---------|--------|-----------|
312
+ | 5+ successful tasks at tier | Consider de-escalation | Complexity resolved |
313
+ | Single-file changes | Use LOW for isolated fixes | Simple scope |
314
+ | Test-only changes | Use LOW for unit tests | Well-defined output |
315
+ | Documentation | Use LOW for docs/comments | Low risk |
316
+
317
+ ```python
318
+ # Example: De-escalate when task becomes routine
319
+ if task.success_streak >= 5 and task.scope == "single_file":
320
+ task.deescalate(to="haiku", reason="Task scope is simple and stable")
321
+ ```
322
+
323
+ ### Escalation Flow Diagram
324
+
325
+ ```
326
+ +------------------+
327
+ | Task Arrives |
328
+ +--------+---------+
329
+ |
330
+ +--------v---------+
331
+ | Check ALWAYS_HIGH |
332
+ +--------+---------+
333
+ |
334
+ +--------------+--------------+
335
+ | |
336
+ [matches] [no match]
337
+ | |
338
+ +--------v--------+ +--------v--------+
339
+ | START: HIGH | | START: LOW |
340
+ | (Opus) | | (Haiku) |
341
+ +-----------------+ +--------+--------+
342
+ |
343
+ +--------v--------+
344
+ | Execute Task |
345
+ +--------+--------+
346
+ |
347
+ +-------------+-------------+
348
+ | |
349
+ [success] [failure]
350
+ | |
351
+ +---------v---------+ +---------v---------+
352
+ | Continue at tier | | Check thresholds |
353
+ +-------------------+ +---------+---------+
354
+ |
355
+ +-----------+-----------+
356
+ | |
357
+ [under limit] [over limit]
358
+ | |
359
+ +--------v--------+ +--------v--------+
360
+ | Retry at tier | | ESCALATE tier |
361
+ +-----------------+ +-----------------+
362
+ ```
363
+
364
+ ### Implementation in Provider Context
365
+
366
+ For Claude (full features):
367
+ ```python
368
+ # Task tool with tier awareness
369
+ Task(
370
+ model=determine_tier(task), # Returns "opus", "sonnet", or "haiku"
371
+ description=task.description,
372
+ prompt=task.prompt,
373
+ metadata={"escalation_count": task.escalation_count}
374
+ )
375
+ ```
376
+
377
+ For Codex/Gemini (degraded mode):
378
+ ```python
379
+ # Map tiers to effort/thinking levels
380
+ TIER_MAPPING = {
381
+ "codex": {"HIGH": "xhigh", "MEDIUM": "high", "LOW": "low"},
382
+ "gemini": {"HIGH": "high", "MEDIUM": "medium", "LOW": "low"},
383
+ }
384
+ effort_level = TIER_MAPPING[provider][determine_tier(task)]
385
+ ```
386
+
387
+ ### Metrics for Tier Optimization
388
+
389
+ Track these metrics to tune escalation thresholds:
390
+
391
+ | Metric | Purpose | Target |
392
+ |--------|---------|--------|
393
+ | Escalation rate | How often tasks escalate | < 20% |
394
+ | First-tier success | Tasks completed without escalation | > 80% |
395
+ | Cost per task | Average token cost by tier | Minimize |
396
+ | Time to completion | Including escalation delays | Minimize |
397
+ | Quality score | Post-completion review score | > 4.0/5.0 |
398
+
399
+ ```python
400
+ # Log escalation events for analysis
401
+ log_escalation(
402
+ task_id=task.id,
403
+ from_tier=current_tier,
404
+ to_tier=new_tier,
405
+ trigger=trigger_reason,
406
+ error_count=task.error_count,
407
+ timestamp=now()
408
+ )
409
+ ```
@@ -21,6 +21,190 @@
21
21
 
22
22
  ---
23
23
 
24
+ ## Chain-of-Verification (CoVe) Protocol
25
+
26
+ **Research:** arXiv 2309.11495 - "Chain-of-Verification Reduces Hallucination in Large Language Models"
27
+
28
+ ### Core Insight
29
+
30
+ Factored, decoupled verification mitigates error propagation. Each verification is computed independently without access to the original response, preventing the model from rationalizing its initial mistakes.
31
+
32
+ ### The 4-Step CoVe Process
33
+
34
+ ```
35
+ Step 1: DRAFT Step 2: PLAN Step 3: EXECUTE Step 4: REVISE
36
+ +-------------+ +---------------+ +-----------------+ +----------------+
37
+ | Generate | ---> | Self-generate | --> | Answer each | -> | Incorporate |
38
+ | initial | | verification | | question | | corrections |
39
+ | response | | questions | | INDEPENDENTLY | | into final |
40
+ +-------------+ +---------------+ +-----------------+ +----------------+
41
+ "What claims | (factored exec)
42
+ did I make? | No access to
43
+ What could be | original response
44
+ wrong?"
45
+ ```
46
+
47
+ ### Step-by-Step Implementation
48
+
49
+ **Step 1: Draft Initial Response**
50
+ ```yaml
51
+ draft_phase:
52
+ action: "Generate initial code/response"
53
+ model: "sonnet" # Fast drafting
54
+ output: "baseline_response"
55
+ ```
56
+
57
+ **Step 2: Plan Verification Questions**
58
+ ```yaml
59
+ verification_planning:
60
+ prompt: |
61
+ Review the response above. Generate verification questions:
62
+ 1. What factual claims did I make?
63
+ 2. What assumptions did I rely on?
64
+ 3. What could be incorrect or incomplete?
65
+ 4. What edge cases did I miss?
66
+ output: "verification_questions[]"
67
+ ```
68
+
69
+ **Step 3: Execute Verifications INDEPENDENTLY (Critical)**
70
+ ```yaml
71
+ factored_execution:
72
+ critical: "Each verification runs in isolation"
73
+ rule: "Verifier has NO access to original response"
74
+
75
+ # Launch in parallel - each is independent
76
+ verifications:
77
+ - question: "Does the function handle null inputs?"
78
+ context: "Function signature and spec only" # NOT the implementation
79
+ verifier: "sonnet"
80
+ - question: "Is the SQL query injection-safe?"
81
+ context: "Query requirements only"
82
+ verifier: "sonnet"
83
+ - question: "Does the API match the documented spec?"
84
+ context: "API spec only"
85
+ verifier: "sonnet"
86
+ ```
87
+
88
+ **Step 4: Generate Final Verified Response**
89
+ ```yaml
90
+ revision_phase:
91
+ inputs:
92
+ - original_response
93
+ - verification_results[]
94
+ action: "Revise response incorporating all corrections"
95
+ output: "verified_response"
96
+ ```
97
+
98
+ ### Factor+Revise Variant (Longform Code Generation)
99
+
100
+ For complex code generation, use the enhanced Factor+Revise pattern. The key difference from basic Factored execution is an **explicit cross-check step** where the model compares original claims against verification results before revision.
101
+
102
+ ```yaml
103
+ factor_revise_pattern:
104
+ step_1_draft:
105
+ action: "Generate complete implementation"
106
+ output: "draft_code"
107
+
108
+ step_2_factor:
109
+ action: "Decompose into verifiable claims"
110
+ outputs:
111
+ - "Function X handles error case Y"
112
+ - "Loop invariant: Z holds at each iteration"
113
+ - "API call returns type T"
114
+ - "Memory is freed in all paths"
115
+
116
+ step_3_independent_verify:
117
+ # CRITICAL: Each runs with ONLY the claim + minimal context
118
+ # No access to full draft code
119
+ parallel_tasks:
120
+ - verify: "Function X handles error case Y"
121
+ context: "Function signature + error spec"
122
+ result: "PASS|FAIL + evidence"
123
+ - verify: "Loop invariant holds"
124
+ context: "Loop structure only"
125
+ result: "PASS|FAIL + evidence"
126
+
127
+ step_3b_cross_check:
128
+ # KEY DIFFERENCE: Explicit consistency check before revision
129
+ action: "Compare original claims against verification results"
130
+ prompt: "Identify which facts from the draft are CONSISTENT vs INCONSISTENT with verifications"
131
+ output: "consistency_report"
132
+
133
+ step_4_revise:
134
+ inputs: [draft_code, verification_results, consistency_report]
135
+ action: "Discard inconsistent facts, use consistent facts to regenerate"
136
+ output: "verified_code"
137
+ ```
138
+
139
+ ### Why Factored Execution Matters
140
+
141
+ The paper tested 4 execution variants:
142
+ - **Joint**: Questions and answers in one prompt (worst - repeats hallucinations)
143
+ - **2-Step**: Separate prompts for questions vs answers (better)
144
+ - **Factored**: Each question answered separately (recommended)
145
+ - **Factor+Revise**: Factored + explicit cross-check step (best for longform)
146
+
147
+ Without factoring (naive verification):
148
+ ```
149
+ Model: "Here's the code"
150
+ Model: "Let me check my code... looks correct!" # Confirmation bias
151
+ ```
152
+
153
+ With factored verification:
154
+ ```
155
+ Model: "Here's the code"
156
+ Model: "Question: Does function handle nulls?"
157
+ [New context, no code visible]
158
+ Model: "Given a function that takes X, null handling requires..." # Independent reasoning
159
+ ```
160
+
161
+ **Key principle from the paper:** The verifier cannot see the original response, only the verification question and minimal context. This prevents rationalization of errors and breaks the chain of hallucination propagation.
162
+
163
+ ### CoVe Integration with Blind Review
164
+
165
+ CoVe operates BEFORE blind review as a self-correction step:
166
+
167
+ ```
168
+ Developer Code --> CoVe (self-verification) --> Blind Review (3 parallel)
169
+ | |
170
+ Catches errors early Catches remaining
171
+ via factored checking issues independently
172
+ ```
173
+
174
+ **Combined workflow:**
175
+ ```yaml
176
+ quality_pipeline:
177
+ phase_1_cove:
178
+ # Developer runs CoVe on their own code
179
+ draft: "Initial implementation"
180
+ verify: "Self-generated questions, factored execution"
181
+ revise: "Corrected implementation"
182
+
183
+ phase_2_blind_review:
184
+ # 3 independent reviewers (no access to CoVe results)
185
+ reviewers:
186
+ - focus: "correctness"
187
+ - focus: "security"
188
+ - focus: "performance"
189
+ # Reviewers see verified code but don't know what was corrected
190
+
191
+ phase_3_aggregate:
192
+ if: "unanimous approval"
193
+ then: "Devil's Advocate review"
194
+ ```
195
+
196
+ ### Metrics
197
+
198
+ Track CoVe effectiveness:
199
+ ```
200
+ .loki/metrics/cove/
201
+ +-- corrections.json # Issues caught by CoVe before review
202
+ +-- false_positives.json # CoVe flags that were actually correct
203
+ +-- review_reduction.json # Reviewer findings before/after CoVe adoption
204
+ ```
205
+
206
+ ---
207
+
24
208
  ## Velocity-Quality Feedback Loop (CRITICAL)
25
209
 
26
210
  **Research from arXiv 2511.04427v2 - empirical study of 807 repositories.**
@@ -98,6 +282,105 @@ Task(model="sonnet", description="Code review: performance", prompt="Review for
98
282
 
99
283
  ---
100
284
 
285
+ ## Two-Stage Review Protocol
286
+
287
+ **Source:** Superpowers (obra) - 35K+ stars GitHub project
288
+
289
+ **CRITICAL: Never mix spec compliance and code quality review. They are separate stages.**
290
+
291
+ ### Why Separate Stages Matter
292
+
293
+ Mixing stages causes these problems:
294
+ - **"Technically correct but wrong feature"** - Code is clean, well-tested, maintainable, but doesn't implement what the spec requires
295
+ - **Spec drift goes undetected** - Quality reviewers approve beautiful code that solves the wrong problem
296
+ - **False confidence** - "3 reviewers approved" means nothing if none checked spec compliance
297
+
298
+ ### Stage 1: Spec Compliance Review
299
+
300
+ **Question:** "Does this code implement what the spec requires?"
301
+
302
+ ```
303
+ Review this implementation against the specification.
304
+
305
+ Specification:
306
+ {paste_spec_or_requirements}
307
+
308
+ Implementation:
309
+ {paste_code_or_diff}
310
+
311
+ Check ONLY the following:
312
+ 1. Does the code implement ALL required features from the spec?
313
+ 2. Does the code implement ONLY what the spec requires (no scope creep)?
314
+ 3. Are edge cases from the spec handled?
315
+ 4. Do the tests verify spec requirements?
316
+
317
+ DO NOT review code quality, style, or maintainability.
318
+ Output: PASS/FAIL with specific spec violations listed.
319
+ ```
320
+
321
+ **Stage 1 must PASS before proceeding to Stage 2.**
322
+
323
+ ### Stage 2: Code Quality Review
324
+
325
+ **Question:** "Is this code well-written, maintainable, secure?"
326
+
327
+ ```
328
+ Review this code for quality. Spec compliance has already been verified.
329
+
330
+ Code:
331
+ {paste_code_or_diff}
332
+
333
+ Check the following:
334
+ 1. Is the code readable and maintainable?
335
+ 2. Are there security vulnerabilities?
336
+ 3. Is error handling appropriate?
337
+ 4. Are there performance concerns?
338
+ 5. Does it follow project conventions?
339
+
340
+ DO NOT verify spec compliance (already done).
341
+ Output: PASS/FAIL with specific issues listed by severity.
342
+ ```
343
+
344
+ ### Implementation in Loki Mode
345
+
346
+ ```yaml
347
+ two_stage_review:
348
+ stage_1_spec:
349
+ reviewer_count: 1 # Spec compliance is objective
350
+ model: "sonnet"
351
+ must_pass: true
352
+ blocks: "stage_2"
353
+
354
+ stage_2_quality:
355
+ reviewer_count: 3 # Quality is subjective, use blind review
356
+ model: "sonnet"
357
+ must_pass: true
358
+ follows: "stage_1"
359
+ anti_sycophancy: true # Devil's advocate on unanimous
360
+
361
+ on_stage_1_fail:
362
+ action: "Return to implementation, DO NOT proceed to Stage 2"
363
+ reason: "Quality review of wrong feature wastes resources"
364
+
365
+ on_stage_2_fail:
366
+ action: "Fix quality issues, re-run Stage 2 only"
367
+ reason: "Spec compliance already verified"
368
+ ```
369
+
370
+ ### Common Anti-Pattern
371
+
372
+ ```
373
+ # WRONG - Mixed review
374
+ Task(prompt="Review for correctness, security, performance, and spec compliance...")
375
+
376
+ # RIGHT - Separate stages
377
+ Task(prompt="Stage 1: Check spec compliance ONLY...")
378
+ # Wait for pass
379
+ Task(prompt="Stage 2: Check code quality ONLY...")
380
+ ```
381
+
382
+ ---
383
+
101
384
  ## Severity-Based Blocking
102
385
 
103
386
  | Severity | Action |