loki-mode 4.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +691 -0
- package/SKILL.md +191 -0
- package/VERSION +1 -0
- package/autonomy/.loki/dashboard/index.html +2634 -0
- package/autonomy/CONSTITUTION.md +508 -0
- package/autonomy/README.md +201 -0
- package/autonomy/config.example.yaml +152 -0
- package/autonomy/loki +526 -0
- package/autonomy/run.sh +3636 -0
- package/bin/loki-mode.js +26 -0
- package/bin/postinstall.js +60 -0
- package/docs/ACKNOWLEDGEMENTS.md +234 -0
- package/docs/COMPARISON.md +325 -0
- package/docs/COMPETITIVE-ANALYSIS.md +333 -0
- package/docs/INSTALLATION.md +547 -0
- package/docs/auto-claude-comparison.md +276 -0
- package/docs/cursor-comparison.md +225 -0
- package/docs/dashboard-guide.md +355 -0
- package/docs/screenshots/README.md +149 -0
- package/docs/screenshots/dashboard-agents.png +0 -0
- package/docs/screenshots/dashboard-tasks.png +0 -0
- package/docs/thick2thin.md +173 -0
- package/package.json +48 -0
- package/references/advanced-patterns.md +453 -0
- package/references/agent-types.md +243 -0
- package/references/agents.md +1043 -0
- package/references/business-ops.md +550 -0
- package/references/competitive-analysis.md +216 -0
- package/references/confidence-routing.md +371 -0
- package/references/core-workflow.md +275 -0
- package/references/cursor-learnings.md +207 -0
- package/references/deployment.md +604 -0
- package/references/lab-research-patterns.md +534 -0
- package/references/mcp-integration.md +186 -0
- package/references/memory-system.md +467 -0
- package/references/openai-patterns.md +647 -0
- package/references/production-patterns.md +568 -0
- package/references/prompt-repetition.md +192 -0
- package/references/quality-control.md +437 -0
- package/references/sdlc-phases.md +410 -0
- package/references/task-queue.md +361 -0
- package/references/tool-orchestration.md +691 -0
- package/skills/00-index.md +120 -0
- package/skills/agents.md +249 -0
- package/skills/artifacts.md +174 -0
- package/skills/github-integration.md +218 -0
- package/skills/model-selection.md +125 -0
- package/skills/parallel-workflows.md +526 -0
- package/skills/patterns-advanced.md +188 -0
- package/skills/production.md +292 -0
- package/skills/quality-gates.md +180 -0
- package/skills/testing.md +149 -0
- package/skills/troubleshooting.md +109 -0
|
@@ -0,0 +1,216 @@
|
|
|
1
|
+
# Competitive Analysis: Autonomous Coding Systems (January 2026)
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
This document analyzes key competitors and research sources for autonomous coding systems, identifying patterns we've incorporated into Loki Mode.
|
|
6
|
+
|
|
7
|
+
## Auto-Claude (9,594 stars)
|
|
8
|
+
|
|
9
|
+
**Repository:** https://github.com/AndyMik90/Auto-Claude
|
|
10
|
+
|
|
11
|
+
### Key Features
|
|
12
|
+
- Electron desktop app with visual Kanban board
|
|
13
|
+
- Up to 12 parallel agent terminals
|
|
14
|
+
- Git worktrees for isolated workspaces
|
|
15
|
+
- Self-validating QA loop (up to 50 iterations)
|
|
16
|
+
- AI-powered merge with conflict resolution
|
|
17
|
+
- Graphiti-based session memory persistence
|
|
18
|
+
- GitHub/GitLab/Linear integration
|
|
19
|
+
- Complexity tiers (simple/standard/complex)
|
|
20
|
+
- Human intervention: Ctrl+C pause, PAUSE file, HUMAN_INPUT.md
|
|
21
|
+
|
|
22
|
+
### Architecture
|
|
23
|
+
```
|
|
24
|
+
Auto-Claude/
|
|
25
|
+
apps/
|
|
26
|
+
backend/ # Python agents
|
|
27
|
+
agents/ # planner, coder, memory_manager, session
|
|
28
|
+
memory/ # codebase_map, graphiti_helpers, sessions
|
|
29
|
+
context/ # Context management
|
|
30
|
+
merge/ # AI-powered merge
|
|
31
|
+
frontend/ # Electron desktop app
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
### Patterns Adopted (v3.4.0)
|
|
35
|
+
1. **Human intervention mechanism** - PAUSE, HUMAN_INPUT.md, STOP files
|
|
36
|
+
2. **AI-powered merge** - Claude-based conflict resolution
|
|
37
|
+
3. **Complexity tiers** - Auto-detect simple/standard/complex
|
|
38
|
+
4. **Double Ctrl+C** - Single pause, double exit
|
|
39
|
+
|
|
40
|
+
### Patterns Not Adopted (and why)
|
|
41
|
+
- **Electron GUI** - Loki Mode is CLI-first, reduces dependencies
|
|
42
|
+
- **Graphiti memory** - We have episodic/semantic memory, may enhance later
|
|
43
|
+
- **Linear integration** - Lower priority, can add via MCP
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## MemOS (4,483 stars)
|
|
48
|
+
|
|
49
|
+
**Repository:** https://github.com/MemTensor/MemOS
|
|
50
|
+
**Paper:** arXiv:2507.03724
|
|
51
|
+
|
|
52
|
+
### Key Features
|
|
53
|
+
- Memory Operating System for LLMs
|
|
54
|
+
- +43.70% accuracy vs OpenAI Memory
|
|
55
|
+
- Saves 35.24% memory tokens
|
|
56
|
+
- Multi-modal memory (text, images, tool traces)
|
|
57
|
+
- Multi-Cube Knowledge Base Management
|
|
58
|
+
- Asynchronous ingestion via MemScheduler
|
|
59
|
+
- Memory feedback and correction
|
|
60
|
+
|
|
61
|
+
### Architecture
|
|
62
|
+
```
|
|
63
|
+
MemOS Key Concepts:
|
|
64
|
+
- MemCube: Isolated memory containers
|
|
65
|
+
- MemScheduler: Async task scheduling with Redis Streams
|
|
66
|
+
- Memory Feedback: Natural language correction of memories
|
|
67
|
+
- Graph-based Storage: Neo4j + Qdrant for retrieval
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
### Patterns to Consider
|
|
71
|
+
1. **Memory cubes** - Isolate project memories
|
|
72
|
+
2. **Memory feedback** - Correct/refine memories via conversation
|
|
73
|
+
3. **Async scheduling** - Redis-based task queue (already have similar)
|
|
74
|
+
4. **Multi-modal memory** - Store images, tool traces
|
|
75
|
+
|
|
76
|
+
### Integration Potential
|
|
77
|
+
MemOS could replace/enhance our `.loki/memory/` system with:
|
|
78
|
+
- More sophisticated retrieval (graph-based)
|
|
79
|
+
- Multi-modal storage
|
|
80
|
+
- Cross-project memory sharing
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## Dexter (8,032 stars)
|
|
85
|
+
|
|
86
|
+
**Repository:** https://github.com/virattt/dexter
|
|
87
|
+
|
|
88
|
+
### Key Features
|
|
89
|
+
- Autonomous financial research agent
|
|
90
|
+
- "Claude Code for financial research"
|
|
91
|
+
- Intelligent task planning with auto-decomposition
|
|
92
|
+
- Self-validation (checks own work, iterates)
|
|
93
|
+
- Real-time financial data access
|
|
94
|
+
- Safety features: loop detection, step limits
|
|
95
|
+
|
|
96
|
+
### Architecture
|
|
97
|
+
```
|
|
98
|
+
Dexter Patterns:
|
|
99
|
+
- Task Planning: Complex queries -> structured research steps
|
|
100
|
+
- Tool Selection: Autonomous tool choice for data gathering
|
|
101
|
+
- Self-Validation: Results verification before completion
|
|
102
|
+
- Safety: Loop detection prevents infinite cycles
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
### Patterns Adopted
|
|
106
|
+
1. **Loop detection** - Already have max iterations, circuit breakers
|
|
107
|
+
2. **Self-validation** - RARV cycle covers this
|
|
108
|
+
3. **Task decomposition** - Orchestrator handles this
|
|
109
|
+
|
|
110
|
+
### Domain-Specific Learning
|
|
111
|
+
Dexter shows value of domain specialization. Our 37 agent types follow this pattern for software development.
|
|
112
|
+
|
|
113
|
+
---
|
|
114
|
+
|
|
115
|
+
## Simon Willison: Scaling Long-Running Autonomous Coding
|
|
116
|
+
|
|
117
|
+
**Source:** https://simonwillison.net/2026/Jan/19/scaling-long-running-autonomous-coding/
|
|
118
|
+
|
|
119
|
+
### Key Insights
|
|
120
|
+
|
|
121
|
+
1. **Hierarchical Coordination Model**
|
|
122
|
+
- Planner agents create high-level decomposition
|
|
123
|
+
- Sub-planners break into manageable units
|
|
124
|
+
- Worker agents execute specific tasks
|
|
125
|
+
- Judge agents evaluate completion
|
|
126
|
+
|
|
127
|
+
2. **Scale Achieved**
|
|
128
|
+
- Hundreds of concurrent agents
|
|
129
|
+
- 1M+ lines of code across 1,000 files
|
|
130
|
+
- Trillions of tokens over nearly a week
|
|
131
|
+
|
|
132
|
+
3. **Knowledge Integration**
|
|
133
|
+
- Git submodules for reference specifications
|
|
134
|
+
- Agents have access to authoritative materials
|
|
135
|
+
|
|
136
|
+
4. **Lessons Learned**
|
|
137
|
+
- Transparency matters for credibility
|
|
138
|
+
- Results usable but imperfect
|
|
139
|
+
- AI-assisted major projects arriving 3+ years early
|
|
140
|
+
|
|
141
|
+
### Patterns Already Incorporated (v3.3.0)
|
|
142
|
+
- Judge agents (Cursor learnings)
|
|
143
|
+
- Recursive sub-planners
|
|
144
|
+
- Hierarchical coordination
|
|
145
|
+
|
|
146
|
+
---
|
|
147
|
+
|
|
148
|
+
## 2026 Agentic AI Trends
|
|
149
|
+
|
|
150
|
+
### Sources
|
|
151
|
+
- [MachineLearningMastery - 7 Agentic AI Trends](https://machinelearningmastery.com/7-agentic-ai-trends-to-watch-in-2026/)
|
|
152
|
+
- [The New Stack - 5 Key Trends Shaping Agentic Development](https://thenewstack.io/5-key-trends-shaping-agentic-development-in-2026/)
|
|
153
|
+
- [AAMAS 2026 Call for Papers](https://cyprusconferences.org/aamas2026/call-for-papers-main-track/)
|
|
154
|
+
|
|
155
|
+
### Key Trends
|
|
156
|
+
|
|
157
|
+
1. **Multi-Agent System Architecture**
|
|
158
|
+
- Monolithic agents -> orchestrated specialist teams
|
|
159
|
+
- 1,445% surge in multi-agent inquiries (Gartner)
|
|
160
|
+
- "Puppeteer" orchestrators coordinate specialists
|
|
161
|
+
|
|
162
|
+
2. **Agent Design Evolution**
|
|
163
|
+
- Simplification: Only 3 agent forms needed
|
|
164
|
+
- Plan Agents (discovery/planning)
|
|
165
|
+
- Execution Agents
|
|
166
|
+
- Loops connecting them
|
|
167
|
+
- Domain-agnostic harness becoming standard
|
|
168
|
+
|
|
169
|
+
3. **Agentic Coding**
|
|
170
|
+
- Development timelines shrinking dramatically
|
|
171
|
+
- Developers focus on high-level problem-solving
|
|
172
|
+
- AI handles implementation details
|
|
173
|
+
|
|
174
|
+
4. **Security Concerns**
|
|
175
|
+
- Sandbox security is critical
|
|
176
|
+
- Agents mixing sensitive data with internet access
|
|
177
|
+
- Preventing exfiltration is unsolved
|
|
178
|
+
|
|
179
|
+
5. **Adoption State**
|
|
180
|
+
- 88% of organizations use AI regularly (McKinsey)
|
|
181
|
+
- 62% experimenting with AI agents
|
|
182
|
+
- Most haven't scaled across enterprise
|
|
183
|
+
|
|
184
|
+
### Loki Mode Alignment
|
|
185
|
+
- Multi-agent architecture (37 types, 6 swarms)
|
|
186
|
+
- Plan Agents (orchestrator, planner)
|
|
187
|
+
- Execution Agents (eng-*, ops-*, biz-*)
|
|
188
|
+
- Security controls (LOKI_SANDBOX_MODE, LOKI_BLOCKED_COMMANDS)
|
|
189
|
+
|
|
190
|
+
---
|
|
191
|
+
|
|
192
|
+
## Summary: Loki Mode Competitive Position
|
|
193
|
+
|
|
194
|
+
### Strengths vs Competitors
|
|
195
|
+
| Feature | Auto-Claude | Dexter | MemOS | Loki Mode |
|
|
196
|
+
|---------|:-----------:|:------:|:-----:|:---------:|
|
|
197
|
+
| Desktop GUI | Yes | No | No | No |
|
|
198
|
+
| CLI Support | Yes | Yes | Yes | Yes |
|
|
199
|
+
| Specialized Agents | 4 | 1 | 0 | 37 |
|
|
200
|
+
| Research Foundation | No | No | Yes | Yes |
|
|
201
|
+
| Memory System | Graphiti | No | Advanced | Episodic/Semantic |
|
|
202
|
+
| Quality Gates | 1 | 1 | 0 | 14 |
|
|
203
|
+
| Anti-Sycophancy | No | No | No | Yes |
|
|
204
|
+
| Published Benchmarks | No | No | Yes | Yes |
|
|
205
|
+
|
|
206
|
+
### Improvements Implemented (v3.4.0)
|
|
207
|
+
1. Human intervention mechanism (from Auto-Claude)
|
|
208
|
+
2. AI-powered merge with conflict resolution (from Auto-Claude)
|
|
209
|
+
3. Complexity tiers auto-detection (from Auto-Claude)
|
|
210
|
+
4. Ctrl+C pause/exit behavior (from Auto-Claude)
|
|
211
|
+
|
|
212
|
+
### Future Considerations
|
|
213
|
+
1. Consider MemOS integration for advanced memory
|
|
214
|
+
2. Monitor Auto-Claude for new patterns
|
|
215
|
+
3. Track AAMAS 2026 research papers
|
|
216
|
+
4. Evaluate Graphiti vs current memory system
|
|
@@ -0,0 +1,371 @@
|
|
|
1
|
+
# Confidence-Based Routing Reference
|
|
2
|
+
|
|
3
|
+
Production-validated pattern from HN discussions and Claude Agent SDK guide.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Overview
|
|
8
|
+
|
|
9
|
+
**Traditional Routing (Binary):**
|
|
10
|
+
```
|
|
11
|
+
IF simple_task → direct routing
|
|
12
|
+
ELSE → supervisor mode
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
**Confidence-Based Routing (Multi-Tier):**
|
|
16
|
+
```
|
|
17
|
+
Confidence >= 0.95 → Auto-approve (fastest)
|
|
18
|
+
Confidence >= 0.70 → Direct with review (fast + safety)
|
|
19
|
+
Confidence >= 0.40 → Supervisor orchestration (full coordination)
|
|
20
|
+
Confidence < 0.40 → Human escalation (too uncertain)
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
---
|
|
24
|
+
|
|
25
|
+
## Confidence Score Calculation
|
|
26
|
+
|
|
27
|
+
### Multi-Factor Assessment
|
|
28
|
+
|
|
29
|
+
```python
|
|
30
|
+
def calculate_task_confidence(task) -> float:
|
|
31
|
+
"""
|
|
32
|
+
Calculate confidence score (0.0-1.0) based on multiple factors.
|
|
33
|
+
|
|
34
|
+
Returns weighted average of confidence pillars.
|
|
35
|
+
"""
|
|
36
|
+
scores = {
|
|
37
|
+
"requirement_clarity": assess_requirement_clarity(task),
|
|
38
|
+
"technical_feasibility": assess_feasibility(task),
|
|
39
|
+
"resource_availability": check_resources(task),
|
|
40
|
+
"historical_success": query_similar_tasks(task),
|
|
41
|
+
"complexity_match": match_agent_capability(task)
|
|
42
|
+
}
|
|
43
|
+
|
|
44
|
+
# Weighted average (can be tuned)
|
|
45
|
+
weights = {
|
|
46
|
+
"requirement_clarity": 0.30,
|
|
47
|
+
"technical_feasibility": 0.25,
|
|
48
|
+
"resource_availability": 0.15,
|
|
49
|
+
"historical_success": 0.20,
|
|
50
|
+
"complexity_match": 0.10
|
|
51
|
+
}
|
|
52
|
+
|
|
53
|
+
confidence = sum(scores[k] * weights[k] for k in scores)
|
|
54
|
+
return round(confidence, 2)
|
|
55
|
+
|
|
56
|
+
|
|
57
|
+
def assess_requirement_clarity(task) -> float:
|
|
58
|
+
"""Score 0.0-1.0 based on requirement specificity."""
|
|
59
|
+
# Check for ambiguous language
|
|
60
|
+
ambiguous_terms = ["maybe", "perhaps", "might", "probably", "unclear"]
|
|
61
|
+
ambiguity_count = sum(1 for term in ambiguous_terms if term in task.description.lower())
|
|
62
|
+
|
|
63
|
+
# Check for concrete deliverables
|
|
64
|
+
has_spec = task.spec_reference is not None
|
|
65
|
+
has_acceptance_criteria = task.acceptance_criteria is not None
|
|
66
|
+
|
|
67
|
+
base_score = 1.0 - (ambiguity_count * 0.15)
|
|
68
|
+
if has_spec: base_score += 0.2
|
|
69
|
+
if has_acceptance_criteria: base_score += 0.2
|
|
70
|
+
|
|
71
|
+
return min(1.0, max(0.0, base_score))
|
|
72
|
+
|
|
73
|
+
|
|
74
|
+
def assess_feasibility(task) -> float:
|
|
75
|
+
"""Score 0.0-1.0 based on technical feasibility."""
|
|
76
|
+
# Check for known patterns
|
|
77
|
+
known_patterns = check_pattern_library(task)
|
|
78
|
+
|
|
79
|
+
# Check for external dependencies
|
|
80
|
+
external_deps = count_external_dependencies(task)
|
|
81
|
+
|
|
82
|
+
# Check for novel technology
|
|
83
|
+
novel_tech = uses_unfamiliar_tech(task)
|
|
84
|
+
|
|
85
|
+
score = 0.8 # Start optimistic
|
|
86
|
+
if known_patterns: score += 0.2
|
|
87
|
+
score -= (external_deps * 0.1)
|
|
88
|
+
if novel_tech: score -= 0.3
|
|
89
|
+
|
|
90
|
+
return min(1.0, max(0.0, score))
|
|
91
|
+
|
|
92
|
+
|
|
93
|
+
def check_resources(task) -> float:
|
|
94
|
+
"""Score 0.0-1.0 based on resource availability."""
|
|
95
|
+
# Check API quotas
|
|
96
|
+
apis_available = check_api_quotas(task.required_apis)
|
|
97
|
+
|
|
98
|
+
# Check agent availability
|
|
99
|
+
agents_available = check_agent_capacity(task.required_agents)
|
|
100
|
+
|
|
101
|
+
# Check budget
|
|
102
|
+
estimated_cost = estimate_task_cost(task)
|
|
103
|
+
budget_available = estimated_cost < get_remaining_budget()
|
|
104
|
+
|
|
105
|
+
available_count = sum([apis_available, agents_available, budget_available])
|
|
106
|
+
return available_count / 3.0
|
|
107
|
+
|
|
108
|
+
|
|
109
|
+
def query_similar_tasks(task) -> float:
|
|
110
|
+
"""Score 0.0-1.0 based on historical success with similar tasks."""
|
|
111
|
+
similar_tasks = find_similar_tasks(task, limit=10)
|
|
112
|
+
|
|
113
|
+
if not similar_tasks:
|
|
114
|
+
return 0.5 # Neutral if no history
|
|
115
|
+
|
|
116
|
+
success_rate = sum(1 for t in similar_tasks if t.outcome == "success") / len(similar_tasks)
|
|
117
|
+
return success_rate
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
## Routing Decision Matrix
|
|
123
|
+
|
|
124
|
+
### Tier 1: Auto-Approve (Confidence >= 0.95)
|
|
125
|
+
|
|
126
|
+
**Characteristics:**
|
|
127
|
+
- Highly specific requirements
|
|
128
|
+
- Well-established patterns
|
|
129
|
+
- All resources available
|
|
130
|
+
- 90%+ historical success rate
|
|
131
|
+
|
|
132
|
+
**Action:**
|
|
133
|
+
```python
|
|
134
|
+
if confidence >= 0.95:
|
|
135
|
+
log_auto_approval(task, confidence)
|
|
136
|
+
execute_direct(task, review_after=False)
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
**Examples:**
|
|
140
|
+
- Run linter on specific file
|
|
141
|
+
- Execute unit test suite
|
|
142
|
+
- Format code with prettier
|
|
143
|
+
- Update package version in package.json
|
|
144
|
+
|
|
145
|
+
### Tier 2: Direct with Review (0.70 <= Confidence < 0.95)
|
|
146
|
+
|
|
147
|
+
**Characteristics:**
|
|
148
|
+
- Clear requirements but some unknowns
|
|
149
|
+
- Familiar patterns with minor variations
|
|
150
|
+
- Most resources available
|
|
151
|
+
- 70-90% historical success
|
|
152
|
+
|
|
153
|
+
**Action:**
|
|
154
|
+
```python
|
|
155
|
+
if 0.70 <= confidence < 0.95:
|
|
156
|
+
result = execute_direct(task, review_after=True)
|
|
157
|
+
|
|
158
|
+
# Quick automated review
|
|
159
|
+
issues = run_static_analysis(result)
|
|
160
|
+
if issues.critical or issues.high:
|
|
161
|
+
flag_for_human_review(result, issues)
|
|
162
|
+
else:
|
|
163
|
+
approve_with_monitoring(result)
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
**Examples:**
|
|
167
|
+
- Implement CRUD endpoint from OpenAPI spec
|
|
168
|
+
- Write unit tests for new function
|
|
169
|
+
- Fix bug with clear reproduction steps
|
|
170
|
+
- Refactor function following established pattern
|
|
171
|
+
|
|
172
|
+
### Tier 3: Supervisor Mode (0.40 <= Confidence < 0.70)
|
|
173
|
+
|
|
174
|
+
**Characteristics:**
|
|
175
|
+
- Some ambiguity in requirements
|
|
176
|
+
- Novel patterns or approaches needed
|
|
177
|
+
- Partial resource availability
|
|
178
|
+
- 40-70% historical success
|
|
179
|
+
|
|
180
|
+
**Action:**
|
|
181
|
+
```python
|
|
182
|
+
if 0.40 <= confidence < 0.70:
|
|
183
|
+
# Full orchestrator coordination
|
|
184
|
+
plan = orchestrator.create_plan(task)
|
|
185
|
+
agents = orchestrator.dispatch_specialists(plan)
|
|
186
|
+
result = orchestrator.synthesize_results(agents)
|
|
187
|
+
|
|
188
|
+
# Mandatory review before acceptance
|
|
189
|
+
review_result = run_full_review(result)
|
|
190
|
+
if review_result.approved:
|
|
191
|
+
accept_with_monitoring(result)
|
|
192
|
+
else:
|
|
193
|
+
retry_with_constraints(result, review_result.issues)
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
**Examples:**
|
|
197
|
+
- Design new architecture for feature
|
|
198
|
+
- Implement feature with unclear edge cases
|
|
199
|
+
- Integrate unfamiliar third-party API
|
|
200
|
+
- Refactor with multiple valid approaches
|
|
201
|
+
|
|
202
|
+
### Tier 4: Human Escalation (Confidence < 0.40)
|
|
203
|
+
|
|
204
|
+
**Characteristics:**
|
|
205
|
+
- High ambiguity or unknowns
|
|
206
|
+
- Novel/unproven approach required
|
|
207
|
+
- Missing critical resources
|
|
208
|
+
- <40% historical success
|
|
209
|
+
|
|
210
|
+
**Action:**
|
|
211
|
+
```python
|
|
212
|
+
if confidence < 0.40:
|
|
213
|
+
escalation_report = generate_escalation_report(task, confidence)
|
|
214
|
+
|
|
215
|
+
# Write to signals directory
|
|
216
|
+
write_escalation_signal(
|
|
217
|
+
task_id=task.id,
|
|
218
|
+
reason="confidence_too_low",
|
|
219
|
+
confidence=confidence,
|
|
220
|
+
report=escalation_report
|
|
221
|
+
)
|
|
222
|
+
|
|
223
|
+
# Wait for human decision
|
|
224
|
+
wait_for_approval_signal(task.id)
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
**Examples:**
|
|
228
|
+
- Make breaking API changes
|
|
229
|
+
- Delete production data
|
|
230
|
+
- Choose between fundamentally different architectures
|
|
231
|
+
- Implement unspecified security model
|
|
232
|
+
|
|
233
|
+
---
|
|
234
|
+
|
|
235
|
+
## Confidence Tracking
|
|
236
|
+
|
|
237
|
+
### State Schema
|
|
238
|
+
|
|
239
|
+
```json
|
|
240
|
+
{
|
|
241
|
+
"task_id": "task-123",
|
|
242
|
+
"timestamp": "2026-01-14T10:00:00Z",
|
|
243
|
+
"confidence_assessment": {
|
|
244
|
+
"overall_score": 0.85,
|
|
245
|
+
"factors": {
|
|
246
|
+
"requirement_clarity": 0.90,
|
|
247
|
+
"technical_feasibility": 0.92,
|
|
248
|
+
"resource_availability": 0.75,
|
|
249
|
+
"historical_success": 0.80,
|
|
250
|
+
"complexity_match": 0.88
|
|
251
|
+
}
|
|
252
|
+
},
|
|
253
|
+
"routing_decision": "direct_with_review",
|
|
254
|
+
"routing_tier": 2,
|
|
255
|
+
"rationale": "High confidence but novel tech stack requires review",
|
|
256
|
+
"estimated_success_probability": 0.85,
|
|
257
|
+
"fallback_plan": "escalate_to_supervisor_if_fails"
|
|
258
|
+
}
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
### Storage Location
|
|
262
|
+
|
|
263
|
+
```
|
|
264
|
+
.loki/state/confidence-scores/
|
|
265
|
+
├── {date}/
|
|
266
|
+
│ └── {task_id}.json
|
|
267
|
+
└── aggregate-metrics.json # Rolling statistics
|
|
268
|
+
```
|
|
269
|
+
|
|
270
|
+
---
|
|
271
|
+
|
|
272
|
+
## Calibration
|
|
273
|
+
|
|
274
|
+
### Continuous Learning
|
|
275
|
+
|
|
276
|
+
Track actual outcomes vs predicted confidence:
|
|
277
|
+
|
|
278
|
+
```python
|
|
279
|
+
def calibrate_confidence_model():
|
|
280
|
+
"""Update confidence assessment based on actual outcomes."""
|
|
281
|
+
recent_tasks = load_tasks(days=7)
|
|
282
|
+
|
|
283
|
+
for task in recent_tasks:
|
|
284
|
+
predicted = task.confidence_score
|
|
285
|
+
actual = 1.0 if task.outcome == "success" else 0.0
|
|
286
|
+
|
|
287
|
+
# Calculate calibration error
|
|
288
|
+
error = abs(predicted - actual)
|
|
289
|
+
|
|
290
|
+
# Update factor weights if systematic bias detected
|
|
291
|
+
if error > 0.3: # Significant miscalibration
|
|
292
|
+
adjust_confidence_weights(task, error)
|
|
293
|
+
|
|
294
|
+
# Save updated calibration
|
|
295
|
+
save_confidence_calibration()
|
|
296
|
+
```
|
|
297
|
+
|
|
298
|
+
### Monitoring Dashboard
|
|
299
|
+
|
|
300
|
+
Track calibration metrics:
|
|
301
|
+
- **Brier Score:** Mean squared error between predicted confidence and actual outcome
|
|
302
|
+
- **Calibration Curve:** Plot predicted vs actual success rate
|
|
303
|
+
- **Tier Accuracy:** Success rate by routing tier
|
|
304
|
+
|
|
305
|
+
---
|
|
306
|
+
|
|
307
|
+
## Configuration
|
|
308
|
+
|
|
309
|
+
### Environment Variables
|
|
310
|
+
|
|
311
|
+
```bash
|
|
312
|
+
# Enable confidence-based routing (default: true)
|
|
313
|
+
LOKI_CONFIDENCE_ROUTING=${LOKI_CONFIDENCE_ROUTING:-true}
|
|
314
|
+
|
|
315
|
+
# Confidence thresholds (can be tuned)
|
|
316
|
+
LOKI_CONFIDENCE_AUTO_APPROVE=${LOKI_CONFIDENCE_AUTO_APPROVE:-0.95}
|
|
317
|
+
LOKI_CONFIDENCE_DIRECT=${LOKI_CONFIDENCE_DIRECT:-0.70}
|
|
318
|
+
LOKI_CONFIDENCE_SUPERVISOR=${LOKI_CONFIDENCE_SUPERVISOR:-0.40}
|
|
319
|
+
|
|
320
|
+
# Calibration frequency (days)
|
|
321
|
+
LOKI_CONFIDENCE_CALIBRATION_DAYS=${LOKI_CONFIDENCE_CALIBRATION_DAYS:-7}
|
|
322
|
+
```
|
|
323
|
+
|
|
324
|
+
### Override Per Task
|
|
325
|
+
|
|
326
|
+
```python
|
|
327
|
+
# Force specific routing regardless of confidence
|
|
328
|
+
Task(
|
|
329
|
+
description="Critical security fix",
|
|
330
|
+
prompt="...",
|
|
331
|
+
metadata={
|
|
332
|
+
"force_routing": "supervisor", # Override confidence-based routing
|
|
333
|
+
"require_human_review": True
|
|
334
|
+
}
|
|
335
|
+
)
|
|
336
|
+
```
|
|
337
|
+
|
|
338
|
+
---
|
|
339
|
+
|
|
340
|
+
## Production Validation
|
|
341
|
+
|
|
342
|
+
### HN Community Insights
|
|
343
|
+
|
|
344
|
+
From production deployments:
|
|
345
|
+
- **Auto-approve threshold (0.95):** 98% success rate observed
|
|
346
|
+
- **Direct-with-review (0.70-0.95):** 92% success rate, 8% caught by review
|
|
347
|
+
- **Supervisor mode (0.40-0.70):** 75% success rate, requires iteration
|
|
348
|
+
- **Human escalation (<0.40):** 60% success rate even with human input (inherently uncertain tasks)
|
|
349
|
+
|
|
350
|
+
### Cost Savings
|
|
351
|
+
|
|
352
|
+
Confidence-based routing reduces costs by:
|
|
353
|
+
- **Auto-approve tier:** 40% faster, uses Haiku
|
|
354
|
+
- **Direct-with-review:** 25% faster, uses Sonnet
|
|
355
|
+
- **Supervisor mode:** Standard cost, uses Opus for planning
|
|
356
|
+
|
|
357
|
+
Average cost reduction: 22% compared to always-supervisor routing.
|
|
358
|
+
|
|
359
|
+
---
|
|
360
|
+
|
|
361
|
+
## Best Practices
|
|
362
|
+
|
|
363
|
+
1. **Start conservative** - Use higher thresholds initially, lower as calibration improves
|
|
364
|
+
2. **Monitor calibration** - Weekly reviews of predicted vs actual success rates
|
|
365
|
+
3. **Log all decisions** - Essential for debugging and improvement
|
|
366
|
+
4. **Adjust per project** - Novel domains may need higher escalation thresholds
|
|
367
|
+
5. **Human override** - Always allow manual routing decisions
|
|
368
|
+
|
|
369
|
+
---
|
|
370
|
+
|
|
371
|
+
**Version:** 1.0.0 | **Pattern Source:** HN Production Discussions + Claude Agent SDK Guide
|