loki-mode 5.51.0 → 5.52.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +4 -56
- package/SKILL.md +2 -2
- package/VERSION +1 -1
- package/autonomy/hooks/validate-bash.sh +5 -2
- package/dashboard/__init__.py +1 -1
- package/dashboard/server.py +1 -1
- package/docs/INSTALLATION.md +1 -1
- package/docs/alternative-installations.md +3 -3
- package/docs/certification/01-core-concepts/lab.md +174 -0
- package/docs/certification/01-core-concepts/lesson.md +182 -0
- package/docs/certification/01-core-concepts/quiz.md +93 -0
- package/docs/certification/02-enterprise-features/lab.md +154 -0
- package/docs/certification/02-enterprise-features/lesson.md +202 -0
- package/docs/certification/02-enterprise-features/quiz.md +93 -0
- package/docs/certification/03-advanced-patterns/lab.md +138 -0
- package/docs/certification/03-advanced-patterns/lesson.md +199 -0
- package/docs/certification/03-advanced-patterns/quiz.md +93 -0
- package/docs/certification/04-production-deployment/lab.md +160 -0
- package/docs/certification/04-production-deployment/lesson.md +261 -0
- package/docs/certification/04-production-deployment/quiz.md +93 -0
- package/docs/certification/05-troubleshooting/lab.md +254 -0
- package/docs/certification/05-troubleshooting/lesson.md +266 -0
- package/docs/certification/05-troubleshooting/quiz.md +93 -0
- package/docs/certification/README.md +80 -0
- package/docs/certification/answer-key.md +117 -0
- package/docs/certification/certification-exam.md +471 -0
- package/docs/certification/sample-prds/microservices-platform.md +100 -0
- package/docs/certification/sample-prds/saas-dashboard.md +60 -0
- package/docs/certification/sample-prds/todo-app.md +44 -0
- package/mcp/__init__.py +1 -1
- package/mcp/server.py +230 -0
- package/package.json +1 -1
- package/src/plugins/agent-plugin.js +123 -0
- package/src/plugins/gate-plugin.js +153 -0
- package/src/plugins/index.js +116 -0
- package/src/plugins/integration-plugin.js +174 -0
- package/src/plugins/loader.js +275 -0
- package/src/plugins/mcp-plugin.js +190 -0
- package/src/plugins/schemas/agent.json +59 -0
- package/src/plugins/schemas/integration.json +62 -0
- package/src/plugins/schemas/mcp_tool.json +73 -0
- package/src/plugins/schemas/quality_gate.json +52 -0
- package/src/plugins/validator.js +297 -0
- /package/dashboard/{secrets.py → app_secrets.py} +0 -0
|
@@ -0,0 +1,93 @@
|
|
|
1
|
+
# Module 4 Quiz: Production Deployment
|
|
2
|
+
|
|
3
|
+
Answer each question by selecting the best option (A, B, C, or D).
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
**Question 1:** What base image does the Loki Mode Dockerfile use?
|
|
8
|
+
|
|
9
|
+
A) Alpine Linux 3.19
|
|
10
|
+
B) Node.js 20 official image
|
|
11
|
+
C) Ubuntu 24.04
|
|
12
|
+
D) Debian Bookworm
|
|
13
|
+
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
**Question 2:** Which volume mount in docker-compose.yml gives the container read-write access?
|
|
17
|
+
|
|
18
|
+
A) `~/.gitconfig:/home/loki/.gitconfig`
|
|
19
|
+
B) `.:/workspace:rw`
|
|
20
|
+
C) `~/.ssh:/home/loki/.ssh`
|
|
21
|
+
D) `~/.config/gh:/home/loki/.config/gh`
|
|
22
|
+
|
|
23
|
+
---
|
|
24
|
+
|
|
25
|
+
**Question 3:** What does `LOKI_STAGED_AUTONOMY=true` do?
|
|
26
|
+
|
|
27
|
+
A) Enables parallel agent execution in stages
|
|
28
|
+
B) Requires human approval before execution
|
|
29
|
+
C) Stages deployment across multiple environments
|
|
30
|
+
D) Enables incremental feature rollout
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
**Question 4:** What is the default maximum number of parallel agents?
|
|
35
|
+
|
|
36
|
+
A) 3
|
|
37
|
+
B) 5
|
|
38
|
+
C) 10
|
|
39
|
+
D) 20
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
**Question 5:** How do you set a cost budget limit for a Loki Mode session?
|
|
44
|
+
|
|
45
|
+
A) `loki start --max-cost 10`
|
|
46
|
+
B) `loki start --budget 10.00 ./prd.md`
|
|
47
|
+
C) `LOKI_COST_LIMIT=10 loki start`
|
|
48
|
+
D) `loki config set budget 10.00`
|
|
49
|
+
|
|
50
|
+
---
|
|
51
|
+
|
|
52
|
+
**Question 6:** What does the completion council do?
|
|
53
|
+
|
|
54
|
+
A) Reviews all code changes before they are committed
|
|
55
|
+
B) Votes on whether the project is truly complete to prevent premature termination
|
|
56
|
+
C) Manages the deployment pipeline approval process
|
|
57
|
+
D) Assigns tasks to available agents
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
**Question 7:** What is the default dashboard port?
|
|
62
|
+
|
|
63
|
+
A) 3000
|
|
64
|
+
B) 8080
|
|
65
|
+
C) 57374
|
|
66
|
+
D) 9090
|
|
67
|
+
|
|
68
|
+
---
|
|
69
|
+
|
|
70
|
+
**Question 8:** Which environment variables enable TLS for the dashboard?
|
|
71
|
+
|
|
72
|
+
A) `LOKI_HTTPS=true` and `LOKI_HTTPS_PORT=443`
|
|
73
|
+
B) `LOKI_TLS_CERT` and `LOKI_TLS_KEY`
|
|
74
|
+
C) `LOKI_SSL_CERT` and `LOKI_SSL_KEY`
|
|
75
|
+
D) `LOKI_DASHBOARD_TLS=true`
|
|
76
|
+
|
|
77
|
+
---
|
|
78
|
+
|
|
79
|
+
**Question 9:** What does `LOKI_COUNCIL_STAGNATION_LIMIT=5` mean?
|
|
80
|
+
|
|
81
|
+
A) The council can only reject completion 5 times
|
|
82
|
+
B) After 5 iterations with no git changes, stagnation is flagged
|
|
83
|
+
C) The council checks every 5 minutes
|
|
84
|
+
D) Maximum 5 council members can vote
|
|
85
|
+
|
|
86
|
+
---
|
|
87
|
+
|
|
88
|
+
**Question 10:** How do you restrict which directories agents can modify?
|
|
89
|
+
|
|
90
|
+
A) `LOKI_READ_ONLY_PATHS=/etc,/usr`
|
|
91
|
+
B) `LOKI_ALLOWED_PATHS=/workspace/src,/workspace/tests`
|
|
92
|
+
C) `LOKI_SANDBOX_PATHS=/safe/dir`
|
|
93
|
+
D) `LOKI_WRITE_DIRS=src,tests`
|
|
@@ -0,0 +1,254 @@
|
|
|
1
|
+
# Module 5 Lab: Diagnose and Troubleshoot
|
|
2
|
+
|
|
3
|
+
## Objective
|
|
4
|
+
|
|
5
|
+
Practice inspecting Loki Mode state files, interpreting circuit breaker status, examining the dead-letter queue, and using recovery procedures.
|
|
6
|
+
|
|
7
|
+
## Prerequisites
|
|
8
|
+
|
|
9
|
+
- Loki Mode installed (`npm install -g loki-mode`)
|
|
10
|
+
- `jq` installed for JSON inspection
|
|
11
|
+
- Familiarity with the `.loki/` directory structure (Module 1)
|
|
12
|
+
|
|
13
|
+
## Step 1: Create a Simulated `.loki/` State
|
|
14
|
+
|
|
15
|
+
Create a mock `.loki/` directory with sample state files to practice inspection:
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
mkdir -p /tmp/troubleshoot-lab && cd /tmp/troubleshoot-lab
|
|
19
|
+
git init
|
|
20
|
+
|
|
21
|
+
# Create the .loki directory structure
|
|
22
|
+
mkdir -p .loki/{state,queue,signals,memory/episodic,memory/semantic,logs}
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
Create a sample orchestrator state:
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
cat > .loki/state/orchestrator.json << 'EOF'
|
|
29
|
+
{
|
|
30
|
+
"currentPhase": "DEVELOPMENT",
|
|
31
|
+
"tasksCompleted": 12,
|
|
32
|
+
"tasksFailed": 3,
|
|
33
|
+
"totalTasks": 20,
|
|
34
|
+
"startedAt": "2026-02-20T10:00:00Z",
|
|
35
|
+
"lastUpdated": "2026-02-20T14:30:00Z"
|
|
36
|
+
}
|
|
37
|
+
EOF
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
Create a sample circuit breaker file:
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
cat > .loki/state/circuit-breakers.json << 'EOF'
|
|
44
|
+
{
|
|
45
|
+
"api/claude": {
|
|
46
|
+
"state": "CLOSED",
|
|
47
|
+
"failure_count": 0,
|
|
48
|
+
"success_count": 0,
|
|
49
|
+
"last_failure_time": null,
|
|
50
|
+
"last_state_change": "2026-02-20T10:00:00Z",
|
|
51
|
+
"cooldown_until": null,
|
|
52
|
+
"failure_window_start": null
|
|
53
|
+
},
|
|
54
|
+
"api/openai": {
|
|
55
|
+
"state": "OPEN",
|
|
56
|
+
"failure_count": 3,
|
|
57
|
+
"success_count": 0,
|
|
58
|
+
"last_failure_time": "2026-02-20T14:25:42Z",
|
|
59
|
+
"last_state_change": "2026-02-20T14:25:42Z",
|
|
60
|
+
"cooldown_until": "2026-02-20T14:30:42Z",
|
|
61
|
+
"failure_window_start": "2026-02-20T14:24:50Z"
|
|
62
|
+
},
|
|
63
|
+
"api/gemini": {
|
|
64
|
+
"state": "HALF_OPEN",
|
|
65
|
+
"failure_count": 0,
|
|
66
|
+
"success_count": 1,
|
|
67
|
+
"last_failure_time": "2026-02-20T14:10:00Z",
|
|
68
|
+
"last_state_change": "2026-02-20T14:15:00Z",
|
|
69
|
+
"cooldown_until": null,
|
|
70
|
+
"failure_window_start": null
|
|
71
|
+
}
|
|
72
|
+
}
|
|
73
|
+
EOF
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
Create a sample dead-letter queue:
|
|
77
|
+
|
|
78
|
+
```bash
|
|
79
|
+
cat > .loki/queue/dead-letter.json << 'EOF'
|
|
80
|
+
{
|
|
81
|
+
"tasks": [
|
|
82
|
+
{
|
|
83
|
+
"task_id": "task-007",
|
|
84
|
+
"original_queue": "in-progress",
|
|
85
|
+
"failure_count": 5,
|
|
86
|
+
"first_failure": "2026-02-20T11:00:00Z",
|
|
87
|
+
"last_failure": "2026-02-20T14:00:00Z",
|
|
88
|
+
"error_summary": "Database migration script fails on foreign key constraint",
|
|
89
|
+
"attempts": [
|
|
90
|
+
{
|
|
91
|
+
"attempt_number": 1,
|
|
92
|
+
"timestamp": "2026-02-20T11:00:00Z",
|
|
93
|
+
"approach": "Direct ALTER TABLE with constraint",
|
|
94
|
+
"error_type": "validation",
|
|
95
|
+
"error_message": "ERROR: cannot add foreign key constraint - referenced table not yet created",
|
|
96
|
+
"agent_id": "eng-database-001"
|
|
97
|
+
},
|
|
98
|
+
{
|
|
99
|
+
"attempt_number": 5,
|
|
100
|
+
"timestamp": "2026-02-20T14:00:00Z",
|
|
101
|
+
"approach": "Deferred constraint with migration ordering",
|
|
102
|
+
"error_type": "validation",
|
|
103
|
+
"error_message": "ERROR: circular dependency between users and organizations tables",
|
|
104
|
+
"agent_id": "eng-database-001"
|
|
105
|
+
}
|
|
106
|
+
],
|
|
107
|
+
"recovery_strategy": "requires_human_review",
|
|
108
|
+
"task_data": {
|
|
109
|
+
"title": "Create database migration for user-organization relationship",
|
|
110
|
+
"description": "Add foreign keys between users and organizations tables",
|
|
111
|
+
"dependencies": ["task-005"],
|
|
112
|
+
"priority": "high"
|
|
113
|
+
}
|
|
114
|
+
}
|
|
115
|
+
],
|
|
116
|
+
"metadata": {
|
|
117
|
+
"last_reviewed": "2026-02-20T08:00:00Z",
|
|
118
|
+
"total_abandoned": 0,
|
|
119
|
+
"total_recovered": 2
|
|
120
|
+
}
|
|
121
|
+
}
|
|
122
|
+
EOF
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
## Step 2: Inspect Circuit Breaker State
|
|
126
|
+
|
|
127
|
+
Practice reading circuit breaker state:
|
|
128
|
+
|
|
129
|
+
```bash
|
|
130
|
+
# List all circuit breaker states
|
|
131
|
+
cat .loki/state/circuit-breakers.json | jq 'to_entries[] | {api: .key, state: .value.state}'
|
|
132
|
+
|
|
133
|
+
# Find which APIs are in OPEN state
|
|
134
|
+
cat .loki/state/circuit-breakers.json | jq 'to_entries[] | select(.value.state == "OPEN") | .key'
|
|
135
|
+
|
|
136
|
+
# Check the cooldown time for the OPEN circuit
|
|
137
|
+
cat .loki/state/circuit-breakers.json | jq '.["api/openai"].cooldown_until'
|
|
138
|
+
|
|
139
|
+
# Check how many successes the HALF_OPEN circuit needs
|
|
140
|
+
cat .loki/state/circuit-breakers.json | jq '.["api/gemini"].success_count'
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
**Questions to answer:**
|
|
144
|
+
1. Which API is currently blocked (OPEN)?
|
|
145
|
+
2. When does the cooldown expire?
|
|
146
|
+
3. How many more successes does the HALF_OPEN circuit need to return to CLOSED? (Answer: 2 more, needs 3 total)
|
|
147
|
+
|
|
148
|
+
## Step 3: Analyze the Dead-Letter Queue
|
|
149
|
+
|
|
150
|
+
```bash
|
|
151
|
+
# Count tasks in dead-letter
|
|
152
|
+
cat .loki/queue/dead-letter.json | jq '.tasks | length'
|
|
153
|
+
|
|
154
|
+
# View the error summary
|
|
155
|
+
cat .loki/queue/dead-letter.json | jq '.tasks[0].error_summary'
|
|
156
|
+
|
|
157
|
+
# View all attempts for the first task
|
|
158
|
+
cat .loki/queue/dead-letter.json | jq '.tasks[0].attempts'
|
|
159
|
+
|
|
160
|
+
# Check the recovery strategy
|
|
161
|
+
cat .loki/queue/dead-letter.json | jq '.tasks[0].recovery_strategy'
|
|
162
|
+
|
|
163
|
+
# Check when the queue was last reviewed
|
|
164
|
+
cat .loki/queue/dead-letter.json | jq '.metadata.last_reviewed'
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
**Questions to answer:**
|
|
168
|
+
1. What is the root cause of the failure?
|
|
169
|
+
2. What recovery strategy is assigned?
|
|
170
|
+
3. Is the `last_reviewed` timestamp more than 24 hours old? (If so, the queue should be processed before new work.)
|
|
171
|
+
|
|
172
|
+
## Step 4: Simulate Signal Files
|
|
173
|
+
|
|
174
|
+
Create signal files and understand their purpose:
|
|
175
|
+
|
|
176
|
+
```bash
|
|
177
|
+
# Simulate a PAUSE signal
|
|
178
|
+
touch .loki/signals/PAUSE
|
|
179
|
+
ls .loki/signals/
|
|
180
|
+
# In a real session, the agent would stop after the current iteration
|
|
181
|
+
|
|
182
|
+
# Remove PAUSE
|
|
183
|
+
rm .loki/signals/PAUSE
|
|
184
|
+
|
|
185
|
+
# Simulate a DRIFT_DETECTED signal
|
|
186
|
+
cat >> .loki/signals/DRIFT_DETECTED << 'EOF'
|
|
187
|
+
{"timestamp":"2026-02-20T14:30:00Z","task_id":"task-012","severity":"medium","detected_drift":"Started optimizing CSS instead of implementing API endpoint"}
|
|
188
|
+
{"timestamp":"2026-02-20T14:35:00Z","task_id":"task-012","severity":"medium","detected_drift":"Switched to refactoring tests instead of implementing API endpoint"}
|
|
189
|
+
EOF
|
|
190
|
+
|
|
191
|
+
# Read the drift log
|
|
192
|
+
cat .loki/signals/DRIFT_DETECTED | jq -s '.'
|
|
193
|
+
|
|
194
|
+
# Simulate HUMAN_REVIEW_NEEDED
|
|
195
|
+
cat > .loki/signals/HUMAN_REVIEW_NEEDED << 'EOF'
|
|
196
|
+
{
|
|
197
|
+
"timestamp": "2026-02-20T14:40:00Z",
|
|
198
|
+
"reason": "security_decision",
|
|
199
|
+
"task_id": "task-015",
|
|
200
|
+
"context": "Requires AWS production credentials for deployment",
|
|
201
|
+
"severity": "critical",
|
|
202
|
+
"blocking": true
|
|
203
|
+
}
|
|
204
|
+
EOF
|
|
205
|
+
|
|
206
|
+
cat .loki/signals/HUMAN_REVIEW_NEEDED | jq .
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
## Step 5: Practice Recovery Commands
|
|
210
|
+
|
|
211
|
+
Use the Loki Mode CLI recovery commands:
|
|
212
|
+
|
|
213
|
+
```bash
|
|
214
|
+
# Check current status
|
|
215
|
+
loki status
|
|
216
|
+
|
|
217
|
+
# Reset commands (these work on actual .loki/ state):
|
|
218
|
+
# loki reset retries -- Reset retry counters
|
|
219
|
+
# loki reset failed -- Reset failed task status
|
|
220
|
+
# loki reset all -- Reset all session state
|
|
221
|
+
|
|
222
|
+
# View logs
|
|
223
|
+
loki logs
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
## Step 6: Inspect Orchestrator State
|
|
227
|
+
|
|
228
|
+
```bash
|
|
229
|
+
# View current phase
|
|
230
|
+
cat .loki/state/orchestrator.json | jq '.currentPhase'
|
|
231
|
+
|
|
232
|
+
# Calculate progress
|
|
233
|
+
cat .loki/state/orchestrator.json | jq '{
|
|
234
|
+
phase: .currentPhase,
|
|
235
|
+
progress: "\(.tasksCompleted)/\(.totalTasks)",
|
|
236
|
+
failed: .tasksFailed
|
|
237
|
+
}'
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
## Verification Checklist
|
|
241
|
+
|
|
242
|
+
- [ ] You can read and interpret circuit breaker states (CLOSED, OPEN, HALF_OPEN)
|
|
243
|
+
- [ ] You can calculate when an OPEN circuit will transition to HALF_OPEN
|
|
244
|
+
- [ ] You can inspect dead-letter queue tasks and identify recovery strategies
|
|
245
|
+
- [ ] You understand the drift detection signal and its accumulation thresholds
|
|
246
|
+
- [ ] You know which signal files exist and what they trigger
|
|
247
|
+
- [ ] You can use `loki status`, `loki logs`, and `loki reset` for recovery
|
|
248
|
+
|
|
249
|
+
## Cleanup
|
|
250
|
+
|
|
251
|
+
```bash
|
|
252
|
+
cd ~
|
|
253
|
+
rm -rf /tmp/troubleshoot-lab
|
|
254
|
+
```
|
|
@@ -0,0 +1,266 @@
|
|
|
1
|
+
# Module 5: Troubleshooting
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
This module covers diagnosing and resolving common issues in Loki Mode: gate failures, session conflicts, circuit breakers, dead-letter queue processing, signal handling, and recovery procedures. The primary reference is `skills/troubleshooting.md`.
|
|
6
|
+
|
|
7
|
+
## Common Issues
|
|
8
|
+
|
|
9
|
+
| Issue | Cause | Solution |
|
|
10
|
+
|-------|-------|----------|
|
|
11
|
+
| Agent stuck / no progress | Lost context | Read `.loki/CONTINUITY.md` at session start |
|
|
12
|
+
| Task repeating | Not checking queue state | Check `.loki/queue/*.json` before claiming |
|
|
13
|
+
| Code review failing | Skipped static analysis | Run static analysis BEFORE AI reviewers |
|
|
14
|
+
| Tests failing after merge | Skipped quality gates | Never bypass severity-based blocking |
|
|
15
|
+
| Rate limit hit | Too many parallel agents | Check circuit breakers, use exponential backoff |
|
|
16
|
+
| Cannot find what to do | Not following RARV cycle | Check `orchestrator.json`, follow decision tree |
|
|
17
|
+
|
|
18
|
+
## Quality Gate Failures
|
|
19
|
+
|
|
20
|
+
When a quality gate fails, identify which gate triggered the failure:
|
|
21
|
+
|
|
22
|
+
**Gates 1-6 (Review gates):**
|
|
23
|
+
- Check the review output for severity levels
|
|
24
|
+
- Critical/High/Medium = BLOCK (must fix)
|
|
25
|
+
- Low/Cosmetic = TODO (informational)
|
|
26
|
+
- If all 3 reviewers pass unanimously, Gate 4 runs Devil's Advocate
|
|
27
|
+
|
|
28
|
+
**Gate 7 (Test coverage):**
|
|
29
|
+
- Unit tests must have 100% pass rate and >80% coverage
|
|
30
|
+
- Integration tests must have 100% pass rate
|
|
31
|
+
- Fix failing tests before proceeding (never delete or skip tests)
|
|
32
|
+
|
|
33
|
+
**Gate 8 (Mock detector):**
|
|
34
|
+
- Runs `tests/detect-mock-problems.sh`
|
|
35
|
+
- Flags tests that mock internal modules instead of using real code
|
|
36
|
+
- Flags tautological assertions and high internal mock ratios
|
|
37
|
+
- Disable with `LOKI_GATE_MOCK_DETECTOR=false` (not recommended)
|
|
38
|
+
|
|
39
|
+
**Gate 9 (Test mutation detector):**
|
|
40
|
+
- Runs `tests/detect-test-mutations.sh`
|
|
41
|
+
- Detects assertion values changed alongside implementation (test fitting)
|
|
42
|
+
- Detects low assertion density and missing pass/fail tracking
|
|
43
|
+
- Disable with `LOKI_GATE_MUTATION_DETECTOR=false` (not recommended)
|
|
44
|
+
|
|
45
|
+
## Circuit Breaker System
|
|
46
|
+
|
|
47
|
+
The circuit breaker prevents cascading failures when API providers are unavailable. State is tracked in `.loki/state/circuit-breakers.json`.
|
|
48
|
+
|
|
49
|
+
### States
|
|
50
|
+
|
|
51
|
+
| State | Behavior | Transitions |
|
|
52
|
+
|-------|----------|-------------|
|
|
53
|
+
| **CLOSED** | Normal operation, all requests pass | -> OPEN after 3 failures in 60s |
|
|
54
|
+
| **OPEN** | All requests blocked | -> HALF_OPEN after 300s cooldown |
|
|
55
|
+
| **HALF_OPEN** | Limited probe requests | -> CLOSED after 3 successes; -> OPEN on any failure |
|
|
56
|
+
|
|
57
|
+
### Inspecting Circuit Breaker State
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
cat .loki/state/circuit-breakers.json | jq .
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
Example output:
|
|
64
|
+
|
|
65
|
+
```json
|
|
66
|
+
{
|
|
67
|
+
"api/claude": {
|
|
68
|
+
"state": "CLOSED",
|
|
69
|
+
"failure_count": 0,
|
|
70
|
+
"last_failure_time": null,
|
|
71
|
+
"cooldown_until": null
|
|
72
|
+
},
|
|
73
|
+
"api/openai": {
|
|
74
|
+
"state": "OPEN",
|
|
75
|
+
"failure_count": 3,
|
|
76
|
+
"last_failure_time": "2025-01-20T10:35:42Z",
|
|
77
|
+
"cooldown_until": "2025-01-20T10:40:42Z"
|
|
78
|
+
}
|
|
79
|
+
}
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
### Recovery Protocol
|
|
83
|
+
|
|
84
|
+
When a circuit breaker is OPEN:
|
|
85
|
+
1. Check the `cooldown_until` timestamp
|
|
86
|
+
2. Reduce parallel agent count (e.g., from 10 to 2)
|
|
87
|
+
3. Disable non-critical background operations
|
|
88
|
+
4. Wait for HALF_OPEN state
|
|
89
|
+
5. Monitor probe request results
|
|
90
|
+
6. After CLOSED state is restored, gradually increase parallelism
|
|
91
|
+
|
|
92
|
+
## Dead-Letter Queue
|
|
93
|
+
|
|
94
|
+
Tasks that fail 5+ times are moved to `.loki/queue/dead-letter.json`. This prevents infinite retry loops.
|
|
95
|
+
|
|
96
|
+
### Inspecting the Dead-Letter Queue
|
|
97
|
+
|
|
98
|
+
```bash
|
|
99
|
+
cat .loki/queue/dead-letter.json | jq '.tasks | length' # Count failed tasks
|
|
100
|
+
cat .loki/queue/dead-letter.json | jq '.tasks[0]' # View first failed task
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
### Recovery Strategies
|
|
104
|
+
|
|
105
|
+
| Strategy | When to Use |
|
|
106
|
+
|----------|-------------|
|
|
107
|
+
| `retry_with_simpler_approach` | Complex implementation failed multiple times |
|
|
108
|
+
| `dependency_blocked` | Task needs output from another failed task |
|
|
109
|
+
| `requires_human_review` | Security decision, unclear spec, or irreversible action |
|
|
110
|
+
| `permanent_abandon` | 10+ attempts, or same error across 3 different approaches |
|
|
111
|
+
|
|
112
|
+
### Retry Conditions
|
|
113
|
+
|
|
114
|
+
A dead-letter task can be retried when:
|
|
115
|
+
- A dependency that was blocking it is now available
|
|
116
|
+
- A new approach has been identified
|
|
117
|
+
- A simpler scope has been defined
|
|
118
|
+
- A blocking bug has been fixed
|
|
119
|
+
|
|
120
|
+
### Permanent Abandon Criteria
|
|
121
|
+
|
|
122
|
+
Move to `.loki/queue/abandoned.json` when:
|
|
123
|
+
- 10+ total attempts across all strategies
|
|
124
|
+
- Same error with 3 different approaches
|
|
125
|
+
- Dependency will never be available
|
|
126
|
+
- Scope is no longer relevant
|
|
127
|
+
|
|
128
|
+
## Signal Processing
|
|
129
|
+
|
|
130
|
+
Signals in `.loki/signals/` are inter-process communication files. Key signals:
|
|
131
|
+
|
|
132
|
+
### PAUSE and STOP
|
|
133
|
+
|
|
134
|
+
```bash
|
|
135
|
+
# Pause after current iteration
|
|
136
|
+
touch .loki/PAUSE
|
|
137
|
+
|
|
138
|
+
# Stop immediately
|
|
139
|
+
touch .loki/STOP
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
Or use CLI commands:
|
|
143
|
+
|
|
144
|
+
```bash
|
|
145
|
+
loki pause
|
|
146
|
+
loki stop
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
### DRIFT_DETECTED
|
|
150
|
+
|
|
151
|
+
Recorded when an agent's actions diverge from the task goal. The file is append-only (JSON lines format).
|
|
152
|
+
|
|
153
|
+
```json
|
|
154
|
+
{
|
|
155
|
+
"timestamp": "2026-01-25T10:30:00Z",
|
|
156
|
+
"task_id": "task-042",
|
|
157
|
+
"severity": "medium",
|
|
158
|
+
"detected_drift": "Started refactoring database schema instead of implementing auth"
|
|
159
|
+
}
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
Processing rules:
|
|
163
|
+
- 1 drift: Log warning, continue with correction
|
|
164
|
+
- 2 drifts on same task: Escalate to orchestrator
|
|
165
|
+
- 3+ accumulated drifts: Trigger context clear and full state reload
|
|
166
|
+
|
|
167
|
+
### CONTEXT_CLEAR_REQUESTED
|
|
168
|
+
|
|
169
|
+
Created when the context window becomes heavy. Can be triggered by:
|
|
170
|
+
- Agent self-assessment ("context feels heavy")
|
|
171
|
+
- After 25+ iterations
|
|
172
|
+
- 3+ accumulated DRIFT_DETECTED events
|
|
173
|
+
- Same error occurring 3+ times
|
|
174
|
+
|
|
175
|
+
The wrapper (`run.sh`) handles this by starting a fresh session with injected state from `.loki/CONTINUITY.md`.
|
|
176
|
+
|
|
177
|
+
### HUMAN_REVIEW_NEEDED
|
|
178
|
+
|
|
179
|
+
Created when autonomous action is inappropriate:
|
|
180
|
+
- Confidence below 0.40 on a critical decision
|
|
181
|
+
- Security-critical operations
|
|
182
|
+
- Irreversible operations without rollback
|
|
183
|
+
- 3+ consecutive failures on the same task
|
|
184
|
+
|
|
185
|
+
The task is blocked until a human provides input.
|
|
186
|
+
|
|
187
|
+
## Rationalization Detection
|
|
188
|
+
|
|
189
|
+
Agents can rationalize failures to avoid acknowledging mistakes. Common patterns to watch for:
|
|
190
|
+
|
|
191
|
+
| Rationalization | Required Action |
|
|
192
|
+
|-----------------|-----------------|
|
|
193
|
+
| "I'll refactor later" | Refactor now or reduce scope |
|
|
194
|
+
| "This is just an edge case" | Handle the edge case |
|
|
195
|
+
| "The tests are flaky" | Fix the flaky test first |
|
|
196
|
+
| "It works on my machine" | Must pass in CI |
|
|
197
|
+
| "This is good enough" | Run full test suite before claiming completion |
|
|
198
|
+
|
|
199
|
+
**Red flag language patterns:**
|
|
200
|
+
- Hedging: "probably", "should be fine", "most likely"
|
|
201
|
+
- Minimization: "just a small change", "simple fix", "minor update"
|
|
202
|
+
- Verification skipping: Moving to next task without running tests
|
|
203
|
+
|
|
204
|
+
When rationalization is detected: stop, identify the specific rationalization, apply the required action, and log the attempt to `.loki/memory/episodic/`.
|
|
205
|
+
|
|
206
|
+
## Recovery Procedures
|
|
207
|
+
|
|
208
|
+
### Context Loss Recovery
|
|
209
|
+
|
|
210
|
+
1. Read `.loki/CONTINUITY.md` for current state
|
|
211
|
+
2. Check `.loki/state/orchestrator.json` for current phase
|
|
212
|
+
3. Review `.loki/queue/in-progress.json` for interrupted tasks
|
|
213
|
+
4. Resume from last checkpoint
|
|
214
|
+
|
|
215
|
+
### Rate Limit Recovery
|
|
216
|
+
|
|
217
|
+
1. Check circuit breaker state in `.loki/state/circuit-breakers.json`
|
|
218
|
+
2. Wait for cooldown period to expire
|
|
219
|
+
3. Reduce `LOKI_MAX_PARALLEL_AGENTS`
|
|
220
|
+
4. Resume with exponential backoff (base: 5s, max: 300s, multiplier: 2)
|
|
221
|
+
|
|
222
|
+
### Test Failure Recovery
|
|
223
|
+
|
|
224
|
+
1. Read test output carefully
|
|
225
|
+
2. Determine if the test is flaky or a real failure
|
|
226
|
+
3. Roll back to last passing commit if needed (`loki checkpoint` can help)
|
|
227
|
+
4. Fix the code (never the test) and re-run the full suite
|
|
228
|
+
|
|
229
|
+
### Session Reset
|
|
230
|
+
|
|
231
|
+
If the session state becomes corrupted:
|
|
232
|
+
|
|
233
|
+
```bash
|
|
234
|
+
loki reset all # Reset all session state
|
|
235
|
+
loki reset retries # Reset retry counters only
|
|
236
|
+
loki reset failed # Reset failed task status only
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
## Debugging Tools
|
|
240
|
+
|
|
241
|
+
### Logs
|
|
242
|
+
|
|
243
|
+
```bash
|
|
244
|
+
loki logs # Show recent log output
|
|
245
|
+
loki status # Show current session status
|
|
246
|
+
loki status --json # Machine-readable status
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
### Audit Trail
|
|
250
|
+
|
|
251
|
+
```bash
|
|
252
|
+
loki audit # View recent agent actions
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
### State Inspection
|
|
256
|
+
|
|
257
|
+
```bash
|
|
258
|
+
cat .loki/state/orchestrator.json | jq . # Current phase and progress
|
|
259
|
+
cat .loki/queue/pending.json | jq . # Pending tasks
|
|
260
|
+
cat .loki/queue/dead-letter.json | jq . # Failed tasks
|
|
261
|
+
cat .loki/state/circuit-breakers.json | jq . # API health
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
## Summary
|
|
265
|
+
|
|
266
|
+
Troubleshooting Loki Mode involves inspecting the `.loki/` directory state: orchestrator phase, task queues, circuit breakers, signals, and memory. The circuit breaker system prevents cascading API failures. The dead-letter queue captures persistently failing tasks. Signals coordinate between processes. Rationalization detection helps identify when agents are avoiding real problems. Recovery procedures exist for context loss, rate limits, test failures, and corrupted state.
|
|
@@ -0,0 +1,93 @@
|
|
|
1
|
+
# Module 5 Quiz: Troubleshooting
|
|
2
|
+
|
|
3
|
+
Answer each question by selecting the best option (A, B, C, or D).
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
**Question 1:** What are the three states of the circuit breaker system?
|
|
8
|
+
|
|
9
|
+
A) Active, Inactive, Standby
|
|
10
|
+
B) Closed, Open, Half-Open
|
|
11
|
+
C) Running, Paused, Stopped
|
|
12
|
+
D) Green, Yellow, Red
|
|
13
|
+
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
**Question 2:** How many failures within 60 seconds trigger a circuit breaker to OPEN?
|
|
17
|
+
|
|
18
|
+
A) 1
|
|
19
|
+
B) 2
|
|
20
|
+
C) 3
|
|
21
|
+
D) 5
|
|
22
|
+
|
|
23
|
+
---
|
|
24
|
+
|
|
25
|
+
**Question 3:** What is the default cooldown period when a circuit breaker is in the OPEN state?
|
|
26
|
+
|
|
27
|
+
A) 30 seconds
|
|
28
|
+
B) 60 seconds
|
|
29
|
+
C) 300 seconds (5 minutes)
|
|
30
|
+
D) 600 seconds (10 minutes)
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
**Question 4:** After how many failures is a task moved to the dead-letter queue?
|
|
35
|
+
|
|
36
|
+
A) 3
|
|
37
|
+
B) 5
|
|
38
|
+
C) 7
|
|
39
|
+
D) 10
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
**Question 5:** What happens when 3 or more DRIFT_DETECTED signals accumulate?
|
|
44
|
+
|
|
45
|
+
A) The session terminates immediately
|
|
46
|
+
B) A context clear is triggered and state is reloaded from scratch
|
|
47
|
+
C) The task is moved to the dead-letter queue
|
|
48
|
+
D) All agents are stopped and restarted
|
|
49
|
+
|
|
50
|
+
---
|
|
51
|
+
|
|
52
|
+
**Question 6:** Which file should an agent read first when recovering from context loss?
|
|
53
|
+
|
|
54
|
+
A) `.loki/queue/pending.json`
|
|
55
|
+
B) `.loki/CONTINUITY.md`
|
|
56
|
+
C) `.loki/session.json`
|
|
57
|
+
D) `.loki/memory/index.json`
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
**Question 7:** What does the `loki reset retries` command do?
|
|
62
|
+
|
|
63
|
+
A) Deletes all tasks from the queue
|
|
64
|
+
B) Restarts the AI provider CLI
|
|
65
|
+
C) Resets retry counters only
|
|
66
|
+
D) Removes the entire `.loki/` directory
|
|
67
|
+
|
|
68
|
+
---
|
|
69
|
+
|
|
70
|
+
**Question 8:** Which environment variable disables Gate 8 (Mock Detector)?
|
|
71
|
+
|
|
72
|
+
A) `LOKI_SKIP_MOCK_CHECK=true`
|
|
73
|
+
B) `LOKI_GATE_MOCK_DETECTOR=false`
|
|
74
|
+
C) `LOKI_DISABLE_GATE_8=true`
|
|
75
|
+
D) `LOKI_NO_MOCK_DETECTION=true`
|
|
76
|
+
|
|
77
|
+
---
|
|
78
|
+
|
|
79
|
+
**Question 9:** When should a dead-letter task be permanently abandoned?
|
|
80
|
+
|
|
81
|
+
A) After 3 failed attempts
|
|
82
|
+
B) After 5 failed attempts
|
|
83
|
+
C) After 10+ total attempts, or same error with 3 different approaches
|
|
84
|
+
D) Only when manually deleted by the user
|
|
85
|
+
|
|
86
|
+
---
|
|
87
|
+
|
|
88
|
+
**Question 10:** What is a red flag indication that an agent is rationalizing a failure?
|
|
89
|
+
|
|
90
|
+
A) The agent requests a model upgrade
|
|
91
|
+
B) The agent uses language like "probably", "should be fine", or "just a small change"
|
|
92
|
+
C) The agent creates a new branch for the fix
|
|
93
|
+
D) The agent runs additional tests
|