loki-mode 5.51.0 → 5.52.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (44) hide show
  1. package/README.md +4 -56
  2. package/SKILL.md +2 -2
  3. package/VERSION +1 -1
  4. package/autonomy/hooks/validate-bash.sh +5 -2
  5. package/dashboard/__init__.py +1 -1
  6. package/dashboard/server.py +1 -1
  7. package/docs/INSTALLATION.md +1 -1
  8. package/docs/alternative-installations.md +3 -3
  9. package/docs/certification/01-core-concepts/lab.md +174 -0
  10. package/docs/certification/01-core-concepts/lesson.md +182 -0
  11. package/docs/certification/01-core-concepts/quiz.md +93 -0
  12. package/docs/certification/02-enterprise-features/lab.md +154 -0
  13. package/docs/certification/02-enterprise-features/lesson.md +202 -0
  14. package/docs/certification/02-enterprise-features/quiz.md +93 -0
  15. package/docs/certification/03-advanced-patterns/lab.md +138 -0
  16. package/docs/certification/03-advanced-patterns/lesson.md +199 -0
  17. package/docs/certification/03-advanced-patterns/quiz.md +93 -0
  18. package/docs/certification/04-production-deployment/lab.md +160 -0
  19. package/docs/certification/04-production-deployment/lesson.md +261 -0
  20. package/docs/certification/04-production-deployment/quiz.md +93 -0
  21. package/docs/certification/05-troubleshooting/lab.md +254 -0
  22. package/docs/certification/05-troubleshooting/lesson.md +266 -0
  23. package/docs/certification/05-troubleshooting/quiz.md +93 -0
  24. package/docs/certification/README.md +80 -0
  25. package/docs/certification/answer-key.md +117 -0
  26. package/docs/certification/certification-exam.md +471 -0
  27. package/docs/certification/sample-prds/microservices-platform.md +100 -0
  28. package/docs/certification/sample-prds/saas-dashboard.md +60 -0
  29. package/docs/certification/sample-prds/todo-app.md +44 -0
  30. package/mcp/__init__.py +1 -1
  31. package/mcp/server.py +230 -0
  32. package/package.json +1 -1
  33. package/src/plugins/agent-plugin.js +123 -0
  34. package/src/plugins/gate-plugin.js +153 -0
  35. package/src/plugins/index.js +116 -0
  36. package/src/plugins/integration-plugin.js +174 -0
  37. package/src/plugins/loader.js +275 -0
  38. package/src/plugins/mcp-plugin.js +190 -0
  39. package/src/plugins/schemas/agent.json +59 -0
  40. package/src/plugins/schemas/integration.json +62 -0
  41. package/src/plugins/schemas/mcp_tool.json +73 -0
  42. package/src/plugins/schemas/quality_gate.json +52 -0
  43. package/src/plugins/validator.js +297 -0
  44. /package/dashboard/{secrets.py → app_secrets.py} +0 -0
@@ -0,0 +1,93 @@
1
+ # Module 4 Quiz: Production Deployment
2
+
3
+ Answer each question by selecting the best option (A, B, C, or D).
4
+
5
+ ---
6
+
7
+ **Question 1:** What base image does the Loki Mode Dockerfile use?
8
+
9
+ A) Alpine Linux 3.19
10
+ B) Node.js 20 official image
11
+ C) Ubuntu 24.04
12
+ D) Debian Bookworm
13
+
14
+ ---
15
+
16
+ **Question 2:** Which volume mount in docker-compose.yml gives the container read-write access?
17
+
18
+ A) `~/.gitconfig:/home/loki/.gitconfig`
19
+ B) `.:/workspace:rw`
20
+ C) `~/.ssh:/home/loki/.ssh`
21
+ D) `~/.config/gh:/home/loki/.config/gh`
22
+
23
+ ---
24
+
25
+ **Question 3:** What does `LOKI_STAGED_AUTONOMY=true` do?
26
+
27
+ A) Enables parallel agent execution in stages
28
+ B) Requires human approval before execution
29
+ C) Stages deployment across multiple environments
30
+ D) Enables incremental feature rollout
31
+
32
+ ---
33
+
34
+ **Question 4:** What is the default maximum number of parallel agents?
35
+
36
+ A) 3
37
+ B) 5
38
+ C) 10
39
+ D) 20
40
+
41
+ ---
42
+
43
+ **Question 5:** How do you set a cost budget limit for a Loki Mode session?
44
+
45
+ A) `loki start --max-cost 10`
46
+ B) `loki start --budget 10.00 ./prd.md`
47
+ C) `LOKI_COST_LIMIT=10 loki start`
48
+ D) `loki config set budget 10.00`
49
+
50
+ ---
51
+
52
+ **Question 6:** What does the completion council do?
53
+
54
+ A) Reviews all code changes before they are committed
55
+ B) Votes on whether the project is truly complete to prevent premature termination
56
+ C) Manages the deployment pipeline approval process
57
+ D) Assigns tasks to available agents
58
+
59
+ ---
60
+
61
+ **Question 7:** What is the default dashboard port?
62
+
63
+ A) 3000
64
+ B) 8080
65
+ C) 57374
66
+ D) 9090
67
+
68
+ ---
69
+
70
+ **Question 8:** Which environment variables enable TLS for the dashboard?
71
+
72
+ A) `LOKI_HTTPS=true` and `LOKI_HTTPS_PORT=443`
73
+ B) `LOKI_TLS_CERT` and `LOKI_TLS_KEY`
74
+ C) `LOKI_SSL_CERT` and `LOKI_SSL_KEY`
75
+ D) `LOKI_DASHBOARD_TLS=true`
76
+
77
+ ---
78
+
79
+ **Question 9:** What does `LOKI_COUNCIL_STAGNATION_LIMIT=5` mean?
80
+
81
+ A) The council can only reject completion 5 times
82
+ B) After 5 iterations with no git changes, stagnation is flagged
83
+ C) The council checks every 5 minutes
84
+ D) Maximum 5 council members can vote
85
+
86
+ ---
87
+
88
+ **Question 10:** How do you restrict which directories agents can modify?
89
+
90
+ A) `LOKI_READ_ONLY_PATHS=/etc,/usr`
91
+ B) `LOKI_ALLOWED_PATHS=/workspace/src,/workspace/tests`
92
+ C) `LOKI_SANDBOX_PATHS=/safe/dir`
93
+ D) `LOKI_WRITE_DIRS=src,tests`
@@ -0,0 +1,254 @@
1
+ # Module 5 Lab: Diagnose and Troubleshoot
2
+
3
+ ## Objective
4
+
5
+ Practice inspecting Loki Mode state files, interpreting circuit breaker status, examining the dead-letter queue, and using recovery procedures.
6
+
7
+ ## Prerequisites
8
+
9
+ - Loki Mode installed (`npm install -g loki-mode`)
10
+ - `jq` installed for JSON inspection
11
+ - Familiarity with the `.loki/` directory structure (Module 1)
12
+
13
+ ## Step 1: Create a Simulated `.loki/` State
14
+
15
+ Create a mock `.loki/` directory with sample state files to practice inspection:
16
+
17
+ ```bash
18
+ mkdir -p /tmp/troubleshoot-lab && cd /tmp/troubleshoot-lab
19
+ git init
20
+
21
+ # Create the .loki directory structure
22
+ mkdir -p .loki/{state,queue,signals,memory/episodic,memory/semantic,logs}
23
+ ```
24
+
25
+ Create a sample orchestrator state:
26
+
27
+ ```bash
28
+ cat > .loki/state/orchestrator.json << 'EOF'
29
+ {
30
+ "currentPhase": "DEVELOPMENT",
31
+ "tasksCompleted": 12,
32
+ "tasksFailed": 3,
33
+ "totalTasks": 20,
34
+ "startedAt": "2026-02-20T10:00:00Z",
35
+ "lastUpdated": "2026-02-20T14:30:00Z"
36
+ }
37
+ EOF
38
+ ```
39
+
40
+ Create a sample circuit breaker file:
41
+
42
+ ```bash
43
+ cat > .loki/state/circuit-breakers.json << 'EOF'
44
+ {
45
+ "api/claude": {
46
+ "state": "CLOSED",
47
+ "failure_count": 0,
48
+ "success_count": 0,
49
+ "last_failure_time": null,
50
+ "last_state_change": "2026-02-20T10:00:00Z",
51
+ "cooldown_until": null,
52
+ "failure_window_start": null
53
+ },
54
+ "api/openai": {
55
+ "state": "OPEN",
56
+ "failure_count": 3,
57
+ "success_count": 0,
58
+ "last_failure_time": "2026-02-20T14:25:42Z",
59
+ "last_state_change": "2026-02-20T14:25:42Z",
60
+ "cooldown_until": "2026-02-20T14:30:42Z",
61
+ "failure_window_start": "2026-02-20T14:24:50Z"
62
+ },
63
+ "api/gemini": {
64
+ "state": "HALF_OPEN",
65
+ "failure_count": 0,
66
+ "success_count": 1,
67
+ "last_failure_time": "2026-02-20T14:10:00Z",
68
+ "last_state_change": "2026-02-20T14:15:00Z",
69
+ "cooldown_until": null,
70
+ "failure_window_start": null
71
+ }
72
+ }
73
+ EOF
74
+ ```
75
+
76
+ Create a sample dead-letter queue:
77
+
78
+ ```bash
79
+ cat > .loki/queue/dead-letter.json << 'EOF'
80
+ {
81
+ "tasks": [
82
+ {
83
+ "task_id": "task-007",
84
+ "original_queue": "in-progress",
85
+ "failure_count": 5,
86
+ "first_failure": "2026-02-20T11:00:00Z",
87
+ "last_failure": "2026-02-20T14:00:00Z",
88
+ "error_summary": "Database migration script fails on foreign key constraint",
89
+ "attempts": [
90
+ {
91
+ "attempt_number": 1,
92
+ "timestamp": "2026-02-20T11:00:00Z",
93
+ "approach": "Direct ALTER TABLE with constraint",
94
+ "error_type": "validation",
95
+ "error_message": "ERROR: cannot add foreign key constraint - referenced table not yet created",
96
+ "agent_id": "eng-database-001"
97
+ },
98
+ {
99
+ "attempt_number": 5,
100
+ "timestamp": "2026-02-20T14:00:00Z",
101
+ "approach": "Deferred constraint with migration ordering",
102
+ "error_type": "validation",
103
+ "error_message": "ERROR: circular dependency between users and organizations tables",
104
+ "agent_id": "eng-database-001"
105
+ }
106
+ ],
107
+ "recovery_strategy": "requires_human_review",
108
+ "task_data": {
109
+ "title": "Create database migration for user-organization relationship",
110
+ "description": "Add foreign keys between users and organizations tables",
111
+ "dependencies": ["task-005"],
112
+ "priority": "high"
113
+ }
114
+ }
115
+ ],
116
+ "metadata": {
117
+ "last_reviewed": "2026-02-20T08:00:00Z",
118
+ "total_abandoned": 0,
119
+ "total_recovered": 2
120
+ }
121
+ }
122
+ EOF
123
+ ```
124
+
125
+ ## Step 2: Inspect Circuit Breaker State
126
+
127
+ Practice reading circuit breaker state:
128
+
129
+ ```bash
130
+ # List all circuit breaker states
131
+ cat .loki/state/circuit-breakers.json | jq 'to_entries[] | {api: .key, state: .value.state}'
132
+
133
+ # Find which APIs are in OPEN state
134
+ cat .loki/state/circuit-breakers.json | jq 'to_entries[] | select(.value.state == "OPEN") | .key'
135
+
136
+ # Check the cooldown time for the OPEN circuit
137
+ cat .loki/state/circuit-breakers.json | jq '.["api/openai"].cooldown_until'
138
+
139
+ # Check how many successes the HALF_OPEN circuit needs
140
+ cat .loki/state/circuit-breakers.json | jq '.["api/gemini"].success_count'
141
+ ```
142
+
143
+ **Questions to answer:**
144
+ 1. Which API is currently blocked (OPEN)?
145
+ 2. When does the cooldown expire?
146
+ 3. How many more successes does the HALF_OPEN circuit need to return to CLOSED? (Answer: 2 more, needs 3 total)
147
+
148
+ ## Step 3: Analyze the Dead-Letter Queue
149
+
150
+ ```bash
151
+ # Count tasks in dead-letter
152
+ cat .loki/queue/dead-letter.json | jq '.tasks | length'
153
+
154
+ # View the error summary
155
+ cat .loki/queue/dead-letter.json | jq '.tasks[0].error_summary'
156
+
157
+ # View all attempts for the first task
158
+ cat .loki/queue/dead-letter.json | jq '.tasks[0].attempts'
159
+
160
+ # Check the recovery strategy
161
+ cat .loki/queue/dead-letter.json | jq '.tasks[0].recovery_strategy'
162
+
163
+ # Check when the queue was last reviewed
164
+ cat .loki/queue/dead-letter.json | jq '.metadata.last_reviewed'
165
+ ```
166
+
167
+ **Questions to answer:**
168
+ 1. What is the root cause of the failure?
169
+ 2. What recovery strategy is assigned?
170
+ 3. Is the `last_reviewed` timestamp more than 24 hours old? (If so, the queue should be processed before new work.)
171
+
172
+ ## Step 4: Simulate Signal Files
173
+
174
+ Create signal files and understand their purpose:
175
+
176
+ ```bash
177
+ # Simulate a PAUSE signal
178
+ touch .loki/signals/PAUSE
179
+ ls .loki/signals/
180
+ # In a real session, the agent would stop after the current iteration
181
+
182
+ # Remove PAUSE
183
+ rm .loki/signals/PAUSE
184
+
185
+ # Simulate a DRIFT_DETECTED signal
186
+ cat >> .loki/signals/DRIFT_DETECTED << 'EOF'
187
+ {"timestamp":"2026-02-20T14:30:00Z","task_id":"task-012","severity":"medium","detected_drift":"Started optimizing CSS instead of implementing API endpoint"}
188
+ {"timestamp":"2026-02-20T14:35:00Z","task_id":"task-012","severity":"medium","detected_drift":"Switched to refactoring tests instead of implementing API endpoint"}
189
+ EOF
190
+
191
+ # Read the drift log
192
+ cat .loki/signals/DRIFT_DETECTED | jq -s '.'
193
+
194
+ # Simulate HUMAN_REVIEW_NEEDED
195
+ cat > .loki/signals/HUMAN_REVIEW_NEEDED << 'EOF'
196
+ {
197
+ "timestamp": "2026-02-20T14:40:00Z",
198
+ "reason": "security_decision",
199
+ "task_id": "task-015",
200
+ "context": "Requires AWS production credentials for deployment",
201
+ "severity": "critical",
202
+ "blocking": true
203
+ }
204
+ EOF
205
+
206
+ cat .loki/signals/HUMAN_REVIEW_NEEDED | jq .
207
+ ```
208
+
209
+ ## Step 5: Practice Recovery Commands
210
+
211
+ Use the Loki Mode CLI recovery commands:
212
+
213
+ ```bash
214
+ # Check current status
215
+ loki status
216
+
217
+ # Reset commands (these work on actual .loki/ state):
218
+ # loki reset retries -- Reset retry counters
219
+ # loki reset failed -- Reset failed task status
220
+ # loki reset all -- Reset all session state
221
+
222
+ # View logs
223
+ loki logs
224
+ ```
225
+
226
+ ## Step 6: Inspect Orchestrator State
227
+
228
+ ```bash
229
+ # View current phase
230
+ cat .loki/state/orchestrator.json | jq '.currentPhase'
231
+
232
+ # Calculate progress
233
+ cat .loki/state/orchestrator.json | jq '{
234
+ phase: .currentPhase,
235
+ progress: "\(.tasksCompleted)/\(.totalTasks)",
236
+ failed: .tasksFailed
237
+ }'
238
+ ```
239
+
240
+ ## Verification Checklist
241
+
242
+ - [ ] You can read and interpret circuit breaker states (CLOSED, OPEN, HALF_OPEN)
243
+ - [ ] You can calculate when an OPEN circuit will transition to HALF_OPEN
244
+ - [ ] You can inspect dead-letter queue tasks and identify recovery strategies
245
+ - [ ] You understand the drift detection signal and its accumulation thresholds
246
+ - [ ] You know which signal files exist and what they trigger
247
+ - [ ] You can use `loki status`, `loki logs`, and `loki reset` for recovery
248
+
249
+ ## Cleanup
250
+
251
+ ```bash
252
+ cd ~
253
+ rm -rf /tmp/troubleshoot-lab
254
+ ```
@@ -0,0 +1,266 @@
1
+ # Module 5: Troubleshooting
2
+
3
+ ## Overview
4
+
5
+ This module covers diagnosing and resolving common issues in Loki Mode: gate failures, session conflicts, circuit breakers, dead-letter queue processing, signal handling, and recovery procedures. The primary reference is `skills/troubleshooting.md`.
6
+
7
+ ## Common Issues
8
+
9
+ | Issue | Cause | Solution |
10
+ |-------|-------|----------|
11
+ | Agent stuck / no progress | Lost context | Read `.loki/CONTINUITY.md` at session start |
12
+ | Task repeating | Not checking queue state | Check `.loki/queue/*.json` before claiming |
13
+ | Code review failing | Skipped static analysis | Run static analysis BEFORE AI reviewers |
14
+ | Tests failing after merge | Skipped quality gates | Never bypass severity-based blocking |
15
+ | Rate limit hit | Too many parallel agents | Check circuit breakers, use exponential backoff |
16
+ | Cannot find what to do | Not following RARV cycle | Check `orchestrator.json`, follow decision tree |
17
+
18
+ ## Quality Gate Failures
19
+
20
+ When a quality gate fails, identify which gate triggered the failure:
21
+
22
+ **Gates 1-6 (Review gates):**
23
+ - Check the review output for severity levels
24
+ - Critical/High/Medium = BLOCK (must fix)
25
+ - Low/Cosmetic = TODO (informational)
26
+ - If all 3 reviewers pass unanimously, Gate 4 runs Devil's Advocate
27
+
28
+ **Gate 7 (Test coverage):**
29
+ - Unit tests must have 100% pass rate and >80% coverage
30
+ - Integration tests must have 100% pass rate
31
+ - Fix failing tests before proceeding (never delete or skip tests)
32
+
33
+ **Gate 8 (Mock detector):**
34
+ - Runs `tests/detect-mock-problems.sh`
35
+ - Flags tests that mock internal modules instead of using real code
36
+ - Flags tautological assertions and high internal mock ratios
37
+ - Disable with `LOKI_GATE_MOCK_DETECTOR=false` (not recommended)
38
+
39
+ **Gate 9 (Test mutation detector):**
40
+ - Runs `tests/detect-test-mutations.sh`
41
+ - Detects assertion values changed alongside implementation (test fitting)
42
+ - Detects low assertion density and missing pass/fail tracking
43
+ - Disable with `LOKI_GATE_MUTATION_DETECTOR=false` (not recommended)
44
+
45
+ ## Circuit Breaker System
46
+
47
+ The circuit breaker prevents cascading failures when API providers are unavailable. State is tracked in `.loki/state/circuit-breakers.json`.
48
+
49
+ ### States
50
+
51
+ | State | Behavior | Transitions |
52
+ |-------|----------|-------------|
53
+ | **CLOSED** | Normal operation, all requests pass | -> OPEN after 3 failures in 60s |
54
+ | **OPEN** | All requests blocked | -> HALF_OPEN after 300s cooldown |
55
+ | **HALF_OPEN** | Limited probe requests | -> CLOSED after 3 successes; -> OPEN on any failure |
56
+
57
+ ### Inspecting Circuit Breaker State
58
+
59
+ ```bash
60
+ cat .loki/state/circuit-breakers.json | jq .
61
+ ```
62
+
63
+ Example output:
64
+
65
+ ```json
66
+ {
67
+ "api/claude": {
68
+ "state": "CLOSED",
69
+ "failure_count": 0,
70
+ "last_failure_time": null,
71
+ "cooldown_until": null
72
+ },
73
+ "api/openai": {
74
+ "state": "OPEN",
75
+ "failure_count": 3,
76
+ "last_failure_time": "2025-01-20T10:35:42Z",
77
+ "cooldown_until": "2025-01-20T10:40:42Z"
78
+ }
79
+ }
80
+ ```
81
+
82
+ ### Recovery Protocol
83
+
84
+ When a circuit breaker is OPEN:
85
+ 1. Check the `cooldown_until` timestamp
86
+ 2. Reduce parallel agent count (e.g., from 10 to 2)
87
+ 3. Disable non-critical background operations
88
+ 4. Wait for HALF_OPEN state
89
+ 5. Monitor probe request results
90
+ 6. After CLOSED state is restored, gradually increase parallelism
91
+
92
+ ## Dead-Letter Queue
93
+
94
+ Tasks that fail 5+ times are moved to `.loki/queue/dead-letter.json`. This prevents infinite retry loops.
95
+
96
+ ### Inspecting the Dead-Letter Queue
97
+
98
+ ```bash
99
+ cat .loki/queue/dead-letter.json | jq '.tasks | length' # Count failed tasks
100
+ cat .loki/queue/dead-letter.json | jq '.tasks[0]' # View first failed task
101
+ ```
102
+
103
+ ### Recovery Strategies
104
+
105
+ | Strategy | When to Use |
106
+ |----------|-------------|
107
+ | `retry_with_simpler_approach` | Complex implementation failed multiple times |
108
+ | `dependency_blocked` | Task needs output from another failed task |
109
+ | `requires_human_review` | Security decision, unclear spec, or irreversible action |
110
+ | `permanent_abandon` | 10+ attempts, or same error across 3 different approaches |
111
+
112
+ ### Retry Conditions
113
+
114
+ A dead-letter task can be retried when:
115
+ - A dependency that was blocking it is now available
116
+ - A new approach has been identified
117
+ - A simpler scope has been defined
118
+ - A blocking bug has been fixed
119
+
120
+ ### Permanent Abandon Criteria
121
+
122
+ Move to `.loki/queue/abandoned.json` when:
123
+ - 10+ total attempts across all strategies
124
+ - Same error with 3 different approaches
125
+ - Dependency will never be available
126
+ - Scope is no longer relevant
127
+
128
+ ## Signal Processing
129
+
130
+ Signals in `.loki/signals/` are inter-process communication files. Key signals:
131
+
132
+ ### PAUSE and STOP
133
+
134
+ ```bash
135
+ # Pause after current iteration
136
+ touch .loki/PAUSE
137
+
138
+ # Stop immediately
139
+ touch .loki/STOP
140
+ ```
141
+
142
+ Or use CLI commands:
143
+
144
+ ```bash
145
+ loki pause
146
+ loki stop
147
+ ```
148
+
149
+ ### DRIFT_DETECTED
150
+
151
+ Recorded when an agent's actions diverge from the task goal. The file is append-only (JSON lines format).
152
+
153
+ ```json
154
+ {
155
+ "timestamp": "2026-01-25T10:30:00Z",
156
+ "task_id": "task-042",
157
+ "severity": "medium",
158
+ "detected_drift": "Started refactoring database schema instead of implementing auth"
159
+ }
160
+ ```
161
+
162
+ Processing rules:
163
+ - 1 drift: Log warning, continue with correction
164
+ - 2 drifts on same task: Escalate to orchestrator
165
+ - 3+ accumulated drifts: Trigger context clear and full state reload
166
+
167
+ ### CONTEXT_CLEAR_REQUESTED
168
+
169
+ Created when the context window becomes heavy. Can be triggered by:
170
+ - Agent self-assessment ("context feels heavy")
171
+ - After 25+ iterations
172
+ - 3+ accumulated DRIFT_DETECTED events
173
+ - Same error occurring 3+ times
174
+
175
+ The wrapper (`run.sh`) handles this by starting a fresh session with injected state from `.loki/CONTINUITY.md`.
176
+
177
+ ### HUMAN_REVIEW_NEEDED
178
+
179
+ Created when autonomous action is inappropriate:
180
+ - Confidence below 0.40 on a critical decision
181
+ - Security-critical operations
182
+ - Irreversible operations without rollback
183
+ - 3+ consecutive failures on the same task
184
+
185
+ The task is blocked until a human provides input.
186
+
187
+ ## Rationalization Detection
188
+
189
+ Agents can rationalize failures to avoid acknowledging mistakes. Common patterns to watch for:
190
+
191
+ | Rationalization | Required Action |
192
+ |-----------------|-----------------|
193
+ | "I'll refactor later" | Refactor now or reduce scope |
194
+ | "This is just an edge case" | Handle the edge case |
195
+ | "The tests are flaky" | Fix the flaky test first |
196
+ | "It works on my machine" | Must pass in CI |
197
+ | "This is good enough" | Run full test suite before claiming completion |
198
+
199
+ **Red flag language patterns:**
200
+ - Hedging: "probably", "should be fine", "most likely"
201
+ - Minimization: "just a small change", "simple fix", "minor update"
202
+ - Verification skipping: Moving to next task without running tests
203
+
204
+ When rationalization is detected: stop, identify the specific rationalization, apply the required action, and log the attempt to `.loki/memory/episodic/`.
205
+
206
+ ## Recovery Procedures
207
+
208
+ ### Context Loss Recovery
209
+
210
+ 1. Read `.loki/CONTINUITY.md` for current state
211
+ 2. Check `.loki/state/orchestrator.json` for current phase
212
+ 3. Review `.loki/queue/in-progress.json` for interrupted tasks
213
+ 4. Resume from last checkpoint
214
+
215
+ ### Rate Limit Recovery
216
+
217
+ 1. Check circuit breaker state in `.loki/state/circuit-breakers.json`
218
+ 2. Wait for cooldown period to expire
219
+ 3. Reduce `LOKI_MAX_PARALLEL_AGENTS`
220
+ 4. Resume with exponential backoff (base: 5s, max: 300s, multiplier: 2)
221
+
222
+ ### Test Failure Recovery
223
+
224
+ 1. Read test output carefully
225
+ 2. Determine if the test is flaky or a real failure
226
+ 3. Roll back to last passing commit if needed (`loki checkpoint` can help)
227
+ 4. Fix the code (never the test) and re-run the full suite
228
+
229
+ ### Session Reset
230
+
231
+ If the session state becomes corrupted:
232
+
233
+ ```bash
234
+ loki reset all # Reset all session state
235
+ loki reset retries # Reset retry counters only
236
+ loki reset failed # Reset failed task status only
237
+ ```
238
+
239
+ ## Debugging Tools
240
+
241
+ ### Logs
242
+
243
+ ```bash
244
+ loki logs # Show recent log output
245
+ loki status # Show current session status
246
+ loki status --json # Machine-readable status
247
+ ```
248
+
249
+ ### Audit Trail
250
+
251
+ ```bash
252
+ loki audit # View recent agent actions
253
+ ```
254
+
255
+ ### State Inspection
256
+
257
+ ```bash
258
+ cat .loki/state/orchestrator.json | jq . # Current phase and progress
259
+ cat .loki/queue/pending.json | jq . # Pending tasks
260
+ cat .loki/queue/dead-letter.json | jq . # Failed tasks
261
+ cat .loki/state/circuit-breakers.json | jq . # API health
262
+ ```
263
+
264
+ ## Summary
265
+
266
+ Troubleshooting Loki Mode involves inspecting the `.loki/` directory state: orchestrator phase, task queues, circuit breakers, signals, and memory. The circuit breaker system prevents cascading API failures. The dead-letter queue captures persistently failing tasks. Signals coordinate between processes. Rationalization detection helps identify when agents are avoiding real problems. Recovery procedures exist for context loss, rate limits, test failures, and corrupted state.
@@ -0,0 +1,93 @@
1
+ # Module 5 Quiz: Troubleshooting
2
+
3
+ Answer each question by selecting the best option (A, B, C, or D).
4
+
5
+ ---
6
+
7
+ **Question 1:** What are the three states of the circuit breaker system?
8
+
9
+ A) Active, Inactive, Standby
10
+ B) Closed, Open, Half-Open
11
+ C) Running, Paused, Stopped
12
+ D) Green, Yellow, Red
13
+
14
+ ---
15
+
16
+ **Question 2:** How many failures within 60 seconds trigger a circuit breaker to OPEN?
17
+
18
+ A) 1
19
+ B) 2
20
+ C) 3
21
+ D) 5
22
+
23
+ ---
24
+
25
+ **Question 3:** What is the default cooldown period when a circuit breaker is in the OPEN state?
26
+
27
+ A) 30 seconds
28
+ B) 60 seconds
29
+ C) 300 seconds (5 minutes)
30
+ D) 600 seconds (10 minutes)
31
+
32
+ ---
33
+
34
+ **Question 4:** After how many failures is a task moved to the dead-letter queue?
35
+
36
+ A) 3
37
+ B) 5
38
+ C) 7
39
+ D) 10
40
+
41
+ ---
42
+
43
+ **Question 5:** What happens when 3 or more DRIFT_DETECTED signals accumulate?
44
+
45
+ A) The session terminates immediately
46
+ B) A context clear is triggered and state is reloaded from scratch
47
+ C) The task is moved to the dead-letter queue
48
+ D) All agents are stopped and restarted
49
+
50
+ ---
51
+
52
+ **Question 6:** Which file should an agent read first when recovering from context loss?
53
+
54
+ A) `.loki/queue/pending.json`
55
+ B) `.loki/CONTINUITY.md`
56
+ C) `.loki/session.json`
57
+ D) `.loki/memory/index.json`
58
+
59
+ ---
60
+
61
+ **Question 7:** What does the `loki reset retries` command do?
62
+
63
+ A) Deletes all tasks from the queue
64
+ B) Restarts the AI provider CLI
65
+ C) Resets retry counters only
66
+ D) Removes the entire `.loki/` directory
67
+
68
+ ---
69
+
70
+ **Question 8:** Which environment variable disables Gate 8 (Mock Detector)?
71
+
72
+ A) `LOKI_SKIP_MOCK_CHECK=true`
73
+ B) `LOKI_GATE_MOCK_DETECTOR=false`
74
+ C) `LOKI_DISABLE_GATE_8=true`
75
+ D) `LOKI_NO_MOCK_DETECTION=true`
76
+
77
+ ---
78
+
79
+ **Question 9:** When should a dead-letter task be permanently abandoned?
80
+
81
+ A) After 3 failed attempts
82
+ B) After 5 failed attempts
83
+ C) After 10+ total attempts, or same error with 3 different approaches
84
+ D) Only when manually deleted by the user
85
+
86
+ ---
87
+
88
+ **Question 10:** What is a red flag indication that an agent is rationalizing a failure?
89
+
90
+ A) The agent requests a model upgrade
91
+ B) The agent uses language like "probably", "should be fine", or "just a small change"
92
+ C) The agent creates a new branch for the fix
93
+ D) The agent runs additional tests