loki-mode 4.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +691 -0
- package/SKILL.md +191 -0
- package/VERSION +1 -0
- package/autonomy/.loki/dashboard/index.html +2634 -0
- package/autonomy/CONSTITUTION.md +508 -0
- package/autonomy/README.md +201 -0
- package/autonomy/config.example.yaml +152 -0
- package/autonomy/loki +526 -0
- package/autonomy/run.sh +3636 -0
- package/bin/loki-mode.js +26 -0
- package/bin/postinstall.js +60 -0
- package/docs/ACKNOWLEDGEMENTS.md +234 -0
- package/docs/COMPARISON.md +325 -0
- package/docs/COMPETITIVE-ANALYSIS.md +333 -0
- package/docs/INSTALLATION.md +547 -0
- package/docs/auto-claude-comparison.md +276 -0
- package/docs/cursor-comparison.md +225 -0
- package/docs/dashboard-guide.md +355 -0
- package/docs/screenshots/README.md +149 -0
- package/docs/screenshots/dashboard-agents.png +0 -0
- package/docs/screenshots/dashboard-tasks.png +0 -0
- package/docs/thick2thin.md +173 -0
- package/package.json +48 -0
- package/references/advanced-patterns.md +453 -0
- package/references/agent-types.md +243 -0
- package/references/agents.md +1043 -0
- package/references/business-ops.md +550 -0
- package/references/competitive-analysis.md +216 -0
- package/references/confidence-routing.md +371 -0
- package/references/core-workflow.md +275 -0
- package/references/cursor-learnings.md +207 -0
- package/references/deployment.md +604 -0
- package/references/lab-research-patterns.md +534 -0
- package/references/mcp-integration.md +186 -0
- package/references/memory-system.md +467 -0
- package/references/openai-patterns.md +647 -0
- package/references/production-patterns.md +568 -0
- package/references/prompt-repetition.md +192 -0
- package/references/quality-control.md +437 -0
- package/references/sdlc-phases.md +410 -0
- package/references/task-queue.md +361 -0
- package/references/tool-orchestration.md +691 -0
- package/skills/00-index.md +120 -0
- package/skills/agents.md +249 -0
- package/skills/artifacts.md +174 -0
- package/skills/github-integration.md +218 -0
- package/skills/model-selection.md +125 -0
- package/skills/parallel-workflows.md +526 -0
- package/skills/patterns-advanced.md +188 -0
- package/skills/production.md +292 -0
- package/skills/quality-gates.md +180 -0
- package/skills/testing.md +149 -0
- package/skills/troubleshooting.md +109 -0
|
@@ -0,0 +1,691 @@
|
|
|
1
|
+
# Tool Orchestration Patterns Reference
|
|
2
|
+
|
|
3
|
+
Research-backed patterns inspired by NVIDIA ToolOrchestra, OpenAI Agents SDK, and multi-agent coordination research.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Overview
|
|
8
|
+
|
|
9
|
+
Effective tool orchestration requires four key innovations:
|
|
10
|
+
1. **Tracing Spans** - Hierarchical event tracking (OpenAI SDK pattern)
|
|
11
|
+
2. **Efficiency Metrics** - Track computational cost per task
|
|
12
|
+
3. **Reward Signals** - Outcome, efficiency, and preference rewards for learning
|
|
13
|
+
4. **Dynamic Selection** - Adapt agent count and types based on task complexity
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## Tracing Spans Architecture (OpenAI SDK Pattern)
|
|
18
|
+
|
|
19
|
+
### Span Types
|
|
20
|
+
|
|
21
|
+
Every operation is wrapped in a typed span for observability:
|
|
22
|
+
|
|
23
|
+
```yaml
|
|
24
|
+
span_types:
|
|
25
|
+
agent_span: # Wraps entire agent execution
|
|
26
|
+
generation_span: # Wraps LLM API calls
|
|
27
|
+
function_span: # Wraps tool/function calls
|
|
28
|
+
guardrail_span: # Wraps validation checks
|
|
29
|
+
handoff_span: # Wraps agent-to-agent transfers
|
|
30
|
+
custom_span: # User-defined operations
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
### Hierarchical Trace Structure
|
|
34
|
+
|
|
35
|
+
```json
|
|
36
|
+
{
|
|
37
|
+
"trace_id": "trace_abc123def456",
|
|
38
|
+
"workflow_name": "implement_feature",
|
|
39
|
+
"group_id": "session_xyz789",
|
|
40
|
+
"spans": [
|
|
41
|
+
{
|
|
42
|
+
"span_id": "span_001",
|
|
43
|
+
"parent_id": null,
|
|
44
|
+
"type": "agent_span",
|
|
45
|
+
"agent_name": "orchestrator",
|
|
46
|
+
"started_at": "2026-01-07T10:00:00Z",
|
|
47
|
+
"ended_at": "2026-01-07T10:05:00Z",
|
|
48
|
+
"children": ["span_002", "span_003"]
|
|
49
|
+
},
|
|
50
|
+
{
|
|
51
|
+
"span_id": "span_002",
|
|
52
|
+
"parent_id": "span_001",
|
|
53
|
+
"type": "guardrail_span",
|
|
54
|
+
"guardrail_name": "input_validation",
|
|
55
|
+
"triggered": false,
|
|
56
|
+
"blocking": true
|
|
57
|
+
},
|
|
58
|
+
{
|
|
59
|
+
"span_id": "span_003",
|
|
60
|
+
"parent_id": "span_001",
|
|
61
|
+
"type": "handoff_span",
|
|
62
|
+
"from_agent": "orchestrator",
|
|
63
|
+
"to_agent": "backend-dev"
|
|
64
|
+
}
|
|
65
|
+
]
|
|
66
|
+
}
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
### Storage Location
|
|
70
|
+
|
|
71
|
+
```
|
|
72
|
+
.loki/traces/
|
|
73
|
+
├── active/
|
|
74
|
+
│ └── {trace_id}.json # Currently running traces
|
|
75
|
+
└── completed/
|
|
76
|
+
└── {date}/
|
|
77
|
+
└── {trace_id}.json # Archived traces
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
See `references/openai-patterns.md` for full tracing implementation.
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## Efficiency Metrics System
|
|
85
|
+
|
|
86
|
+
### Why Track Efficiency?
|
|
87
|
+
|
|
88
|
+
ToolOrchestra achieves 70% cost reduction vs GPT-5 by explicitly optimizing for efficiency. Loki Mode should track:
|
|
89
|
+
|
|
90
|
+
- **Token usage** per task (input + output)
|
|
91
|
+
- **Wall clock time** per task
|
|
92
|
+
- **Agent spawns** per task
|
|
93
|
+
- **Retry count** before success
|
|
94
|
+
|
|
95
|
+
### Efficiency Tracking Schema
|
|
96
|
+
|
|
97
|
+
```json
|
|
98
|
+
{
|
|
99
|
+
"task_id": "task-2026-01-06-001",
|
|
100
|
+
"correlation_id": "session-abc123",
|
|
101
|
+
"started_at": "2026-01-06T10:00:00Z",
|
|
102
|
+
"completed_at": "2026-01-06T10:05:32Z",
|
|
103
|
+
"metrics": {
|
|
104
|
+
"wall_time_seconds": 332,
|
|
105
|
+
"agents_spawned": 3,
|
|
106
|
+
"total_agent_calls": 7,
|
|
107
|
+
"retry_count": 1,
|
|
108
|
+
"retry_reasons": ["test_failure"],
|
|
109
|
+
"recovery_rate": 1.0,
|
|
110
|
+
"model_usage": {
|
|
111
|
+
"haiku": {"calls": 4, "est_tokens": 12000},
|
|
112
|
+
"sonnet": {"calls": 2, "est_tokens": 8000},
|
|
113
|
+
"opus": {"calls": 1, "est_tokens": 6000}
|
|
114
|
+
}
|
|
115
|
+
},
|
|
116
|
+
"outcome": "success",
|
|
117
|
+
"outcome_reason": "tests_passed_after_fix",
|
|
118
|
+
"efficiency_score": 0.85,
|
|
119
|
+
"efficiency_factors": ["used_haiku_for_tests", "parallel_review"],
|
|
120
|
+
"quality_pillars": {
|
|
121
|
+
"tool_selection_correct": true,
|
|
122
|
+
"tool_reliability_rate": 0.95,
|
|
123
|
+
"memory_retrieval_relevant": true,
|
|
124
|
+
"goal_adherence": 1.0
|
|
125
|
+
}
|
|
126
|
+
}
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
**Why capture these metrics?** (Based on multi-agent research)
|
|
130
|
+
|
|
131
|
+
1. **Capture intent, not just actions** ([Hashrocket](https://hashrocket.substack.com/p/the-hidden-cost-of-well-fix-it-later))
|
|
132
|
+
- "UX debt turns into data debt" - recording actions without intent creates useless analytics
|
|
133
|
+
|
|
134
|
+
2. **Track recovery rate** ([Assessment Framework, arXiv 2512.12791](https://arxiv.org/html/2512.12791v1))
|
|
135
|
+
- `recovery_rate = successful_retries / total_retries`
|
|
136
|
+
- Paper found "perfect tool sequencing but only 33% policy adherence" - surface metrics mask failures
|
|
137
|
+
|
|
138
|
+
3. **Distributed tracing** ([Maxim AI](https://www.getmaxim.ai/articles/best-practices-for-building-production-ready-multi-agent-systems/))
|
|
139
|
+
- `correlation_id`: Links all tasks in a session for end-to-end tracing
|
|
140
|
+
- Essential for debugging multi-agent coordination failures
|
|
141
|
+
|
|
142
|
+
4. **Tool reliability separate from selection** ([Stanford/Harvard](https://www.marktechpost.com/2025/12/24/this-ai-paper-from-stanford-and-harvard-explains-why-most-agentic-ai-systems-feel-impressive-in-demos-and-then-completely-fall-apart-in-real-use/))
|
|
143
|
+
- `tool_selection_correct`: Did we pick the right tool?
|
|
144
|
+
- `tool_reliability_rate`: Did the tool work as expected? (tools can fail even when correctly selected)
|
|
145
|
+
- Key insight: "Tool use reliability" is a primary demo-to-deployment gap
|
|
146
|
+
|
|
147
|
+
5. **Quality pillars beyond outcomes** ([Assessment Framework](https://arxiv.org/html/2512.12791v1))
|
|
148
|
+
- `memory_retrieval_relevant`: Did episodic/semantic retrieval help?
|
|
149
|
+
- `goal_adherence`: Did we stay on task? (0.0-1.0 score)
|
|
150
|
+
|
|
151
|
+
### Efficiency Score Calculation
|
|
152
|
+
|
|
153
|
+
```python
|
|
154
|
+
def calculate_efficiency_score(metrics, task_complexity):
|
|
155
|
+
"""
|
|
156
|
+
Score from 0-1 where higher is more efficient.
|
|
157
|
+
Based on ToolOrchestra's efficiency reward signal.
|
|
158
|
+
"""
|
|
159
|
+
# Baseline expectations by complexity
|
|
160
|
+
baselines = {
|
|
161
|
+
"trivial": {"time": 60, "agents": 1, "retries": 0},
|
|
162
|
+
"simple": {"time": 180, "agents": 2, "retries": 0},
|
|
163
|
+
"moderate": {"time": 600, "agents": 4, "retries": 1},
|
|
164
|
+
"complex": {"time": 1800, "agents": 8, "retries": 2},
|
|
165
|
+
"critical": {"time": 3600, "agents": 12, "retries": 3}
|
|
166
|
+
}
|
|
167
|
+
|
|
168
|
+
baseline = baselines[task_complexity]
|
|
169
|
+
|
|
170
|
+
# Calculate component scores (1.0 = at baseline, >1 = better, <1 = worse)
|
|
171
|
+
time_score = min(1.0, baseline["time"] / max(metrics["wall_time_seconds"], 1))
|
|
172
|
+
agent_score = min(1.0, baseline["agents"] / max(metrics["agents_spawned"], 1))
|
|
173
|
+
retry_score = 1.0 - (metrics["retry_count"] / (baseline["retries"] + 3))
|
|
174
|
+
|
|
175
|
+
# Weighted average (time matters most)
|
|
176
|
+
return (time_score * 0.5) + (agent_score * 0.3) + (retry_score * 0.2)
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
### Standard Reason Codes
|
|
180
|
+
|
|
181
|
+
Use consistent codes to enable pattern analysis:
|
|
182
|
+
|
|
183
|
+
```yaml
|
|
184
|
+
outcome_reasons:
|
|
185
|
+
success:
|
|
186
|
+
- tests_passed_first_try
|
|
187
|
+
- tests_passed_after_fix
|
|
188
|
+
- review_approved
|
|
189
|
+
- spec_validated
|
|
190
|
+
partial:
|
|
191
|
+
- tests_partial_pass
|
|
192
|
+
- review_concerns_minor
|
|
193
|
+
- timeout_partial_work
|
|
194
|
+
failure:
|
|
195
|
+
- tests_failed
|
|
196
|
+
- review_blocked
|
|
197
|
+
- dependency_missing
|
|
198
|
+
- timeout_no_progress
|
|
199
|
+
- error_unrecoverable
|
|
200
|
+
|
|
201
|
+
retry_reasons:
|
|
202
|
+
- test_failure
|
|
203
|
+
- lint_error
|
|
204
|
+
- type_error
|
|
205
|
+
- review_rejection
|
|
206
|
+
- rate_limit
|
|
207
|
+
- timeout
|
|
208
|
+
- dependency_conflict
|
|
209
|
+
|
|
210
|
+
efficiency_factors:
|
|
211
|
+
positive:
|
|
212
|
+
- used_haiku_for_simple
|
|
213
|
+
- parallel_execution
|
|
214
|
+
- cached_result
|
|
215
|
+
- first_try_success
|
|
216
|
+
- spec_driven
|
|
217
|
+
negative:
|
|
218
|
+
- used_opus_for_simple
|
|
219
|
+
- sequential_when_parallel_possible
|
|
220
|
+
- multiple_retries
|
|
221
|
+
- missing_context
|
|
222
|
+
- unclear_requirements
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
### Storage Location
|
|
226
|
+
|
|
227
|
+
```
|
|
228
|
+
.loki/metrics/
|
|
229
|
+
├── efficiency/
|
|
230
|
+
│ ├── 2026-01-06.json # Daily efficiency logs
|
|
231
|
+
│ └── aggregate.json # Running averages by task type
|
|
232
|
+
└── rewards/
|
|
233
|
+
├── outcomes.json # Task success/failure records
|
|
234
|
+
└── preferences.json # User preference signals
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
---
|
|
238
|
+
|
|
239
|
+
## Reward Signal Framework
|
|
240
|
+
|
|
241
|
+
### Three Reward Types (ToolOrchestra Pattern)
|
|
242
|
+
|
|
243
|
+
```
|
|
244
|
+
+------------------------------------------------------------------+
|
|
245
|
+
| 1. OUTCOME REWARD |
|
|
246
|
+
| - Did the task succeed? Binary + quality grade |
|
|
247
|
+
| - Signal: +1.0 (success), 0.0 (partial), -1.0 (failure) |
|
|
248
|
+
+------------------------------------------------------------------+
|
|
249
|
+
| 2. EFFICIENCY REWARD |
|
|
250
|
+
| - Did we use resources wisely? |
|
|
251
|
+
| - Signal: 0.0 to 1.0 based on efficiency score |
|
|
252
|
+
+------------------------------------------------------------------+
|
|
253
|
+
| 3. PREFERENCE REWARD |
|
|
254
|
+
| - Did the user like the approach/result? |
|
|
255
|
+
| - Signal: Inferred from user actions (accept/reject/modify) |
|
|
256
|
+
+------------------------------------------------------------------+
|
|
257
|
+
```
|
|
258
|
+
|
|
259
|
+
### Outcome Reward Implementation
|
|
260
|
+
|
|
261
|
+
```python
|
|
262
|
+
def calculate_outcome_reward(task_result):
|
|
263
|
+
"""
|
|
264
|
+
Outcome reward based on task completion status.
|
|
265
|
+
"""
|
|
266
|
+
if task_result.status == "completed":
|
|
267
|
+
# Grade the quality of completion
|
|
268
|
+
if task_result.tests_passed and task_result.review_passed:
|
|
269
|
+
return 1.0 # Full success
|
|
270
|
+
elif task_result.tests_passed:
|
|
271
|
+
return 0.7 # Tests pass but review had concerns
|
|
272
|
+
else:
|
|
273
|
+
return 0.3 # Completed but with issues
|
|
274
|
+
|
|
275
|
+
elif task_result.status == "partial":
|
|
276
|
+
return 0.0 # Partial completion, no reward
|
|
277
|
+
|
|
278
|
+
else: # failed
|
|
279
|
+
return -1.0 # Negative reward for failure
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
### Preference Reward Implementation
|
|
283
|
+
|
|
284
|
+
```python
|
|
285
|
+
def infer_preference_reward(task_result, user_actions):
|
|
286
|
+
"""
|
|
287
|
+
Infer user preference from their actions after task completion.
|
|
288
|
+
Based on implicit feedback patterns.
|
|
289
|
+
"""
|
|
290
|
+
signals = []
|
|
291
|
+
|
|
292
|
+
# Positive signals
|
|
293
|
+
if "commit" in user_actions:
|
|
294
|
+
signals.append(0.8) # User committed our changes
|
|
295
|
+
if "deploy" in user_actions:
|
|
296
|
+
signals.append(1.0) # User deployed our changes
|
|
297
|
+
if "no_edits" in user_actions:
|
|
298
|
+
signals.append(0.6) # User didn't modify our output
|
|
299
|
+
|
|
300
|
+
# Negative signals
|
|
301
|
+
if "revert" in user_actions:
|
|
302
|
+
signals.append(-1.0) # User reverted our changes
|
|
303
|
+
if "manual_fix" in user_actions:
|
|
304
|
+
signals.append(-0.5) # User had to fix our work
|
|
305
|
+
if "retry_different" in user_actions:
|
|
306
|
+
signals.append(-0.3) # User asked for different approach
|
|
307
|
+
|
|
308
|
+
# Neutral (no signal)
|
|
309
|
+
if not signals:
|
|
310
|
+
return None
|
|
311
|
+
|
|
312
|
+
return sum(signals) / len(signals)
|
|
313
|
+
```
|
|
314
|
+
|
|
315
|
+
### Reward Aggregation for Learning
|
|
316
|
+
|
|
317
|
+
```python
|
|
318
|
+
def aggregate_rewards(outcome, efficiency, preference):
|
|
319
|
+
"""
|
|
320
|
+
Combine rewards into single learning signal.
|
|
321
|
+
Weights based on ToolOrchestra findings.
|
|
322
|
+
"""
|
|
323
|
+
# Outcome is most important (must succeed)
|
|
324
|
+
# Efficiency secondary (once successful, optimize)
|
|
325
|
+
# Preference tertiary (align with user style)
|
|
326
|
+
|
|
327
|
+
weights = {
|
|
328
|
+
"outcome": 0.6,
|
|
329
|
+
"efficiency": 0.25,
|
|
330
|
+
"preference": 0.15
|
|
331
|
+
}
|
|
332
|
+
|
|
333
|
+
total = outcome * weights["outcome"]
|
|
334
|
+
total += efficiency * weights["efficiency"]
|
|
335
|
+
|
|
336
|
+
if preference is not None:
|
|
337
|
+
total += preference * weights["preference"]
|
|
338
|
+
else:
|
|
339
|
+
# Redistribute weight if no preference signal
|
|
340
|
+
total = total / (1 - weights["preference"])
|
|
341
|
+
|
|
342
|
+
return total
|
|
343
|
+
```
|
|
344
|
+
|
|
345
|
+
---
|
|
346
|
+
|
|
347
|
+
## Dynamic Agent Selection
|
|
348
|
+
|
|
349
|
+
### Task Complexity Classification
|
|
350
|
+
|
|
351
|
+
```python
|
|
352
|
+
def classify_task_complexity(task):
|
|
353
|
+
"""
|
|
354
|
+
Classify task to determine agent allocation.
|
|
355
|
+
Based on ToolOrchestra's tool selection flexibility.
|
|
356
|
+
"""
|
|
357
|
+
complexity_signals = {
|
|
358
|
+
# File scope signals
|
|
359
|
+
"single_file": -1,
|
|
360
|
+
"few_files": 0, # 2-5 files
|
|
361
|
+
"many_files": +1, # 6-20 files
|
|
362
|
+
"system_wide": +2, # 20+ files
|
|
363
|
+
|
|
364
|
+
# Change type signals
|
|
365
|
+
"typo_fix": -2,
|
|
366
|
+
"bug_fix": 0,
|
|
367
|
+
"feature": +1,
|
|
368
|
+
"refactor": +1,
|
|
369
|
+
"architecture": +2,
|
|
370
|
+
|
|
371
|
+
# Domain signals
|
|
372
|
+
"documentation": -1,
|
|
373
|
+
"tests_only": 0,
|
|
374
|
+
"frontend": 0,
|
|
375
|
+
"backend": 0,
|
|
376
|
+
"full_stack": +1,
|
|
377
|
+
"infrastructure": +1,
|
|
378
|
+
"security": +2,
|
|
379
|
+
}
|
|
380
|
+
|
|
381
|
+
score = 0
|
|
382
|
+
for signal, weight in complexity_signals.items():
|
|
383
|
+
if task.has_signal(signal):
|
|
384
|
+
score += weight
|
|
385
|
+
|
|
386
|
+
# Map score to complexity level
|
|
387
|
+
if score <= -2:
|
|
388
|
+
return "trivial"
|
|
389
|
+
elif score <= 0:
|
|
390
|
+
return "simple"
|
|
391
|
+
elif score <= 2:
|
|
392
|
+
return "moderate"
|
|
393
|
+
elif score <= 4:
|
|
394
|
+
return "complex"
|
|
395
|
+
else:
|
|
396
|
+
return "critical"
|
|
397
|
+
```
|
|
398
|
+
|
|
399
|
+
### Agent Allocation by Complexity
|
|
400
|
+
|
|
401
|
+
```yaml
|
|
402
|
+
# Agent allocation strategy
|
|
403
|
+
# Model selection: Opus=planning, Sonnet=development, Haiku=unit tests/monitoring
|
|
404
|
+
complexity_allocations:
|
|
405
|
+
trivial:
|
|
406
|
+
max_agents: 1
|
|
407
|
+
planning: null # No planning needed
|
|
408
|
+
development: haiku
|
|
409
|
+
testing: haiku
|
|
410
|
+
review: skip # No review needed for trivial
|
|
411
|
+
parallel: false
|
|
412
|
+
|
|
413
|
+
simple:
|
|
414
|
+
max_agents: 2
|
|
415
|
+
planning: null # No planning needed
|
|
416
|
+
development: haiku
|
|
417
|
+
testing: haiku
|
|
418
|
+
review: single # One quick review
|
|
419
|
+
parallel: false
|
|
420
|
+
|
|
421
|
+
moderate:
|
|
422
|
+
max_agents: 4
|
|
423
|
+
planning: sonnet # Sonnet for moderate planning
|
|
424
|
+
development: sonnet
|
|
425
|
+
testing: haiku # Unit tests always haiku
|
|
426
|
+
review: standard # 3 parallel reviewers
|
|
427
|
+
parallel: true
|
|
428
|
+
|
|
429
|
+
complex:
|
|
430
|
+
max_agents: 8
|
|
431
|
+
planning: opus # Opus ONLY for complex planning
|
|
432
|
+
development: sonnet # Sonnet for implementation
|
|
433
|
+
testing: haiku # Unit tests still haiku
|
|
434
|
+
review: deep # 3 reviewers + devil's advocate
|
|
435
|
+
parallel: true
|
|
436
|
+
|
|
437
|
+
critical:
|
|
438
|
+
max_agents: 12
|
|
439
|
+
planning: opus # Opus for critical planning
|
|
440
|
+
development: sonnet # Sonnet for implementation
|
|
441
|
+
testing: sonnet # Functional/E2E tests with sonnet
|
|
442
|
+
review: exhaustive # Multiple review rounds
|
|
443
|
+
parallel: true
|
|
444
|
+
human_checkpoint: true # Pause for human review
|
|
445
|
+
```
|
|
446
|
+
|
|
447
|
+
### Dynamic Selection Algorithm
|
|
448
|
+
|
|
449
|
+
```python
|
|
450
|
+
def select_agents_for_task(task, available_agents):
|
|
451
|
+
"""
|
|
452
|
+
Dynamically select agents based on task requirements.
|
|
453
|
+
Inspired by ToolOrchestra's configurable tool selection.
|
|
454
|
+
"""
|
|
455
|
+
complexity = classify_task_complexity(task)
|
|
456
|
+
allocation = COMPLEXITY_ALLOCATIONS[complexity]
|
|
457
|
+
|
|
458
|
+
# 1. Identify required agent types
|
|
459
|
+
required_types = identify_required_agents(task)
|
|
460
|
+
|
|
461
|
+
# 2. Filter to available agents of required types
|
|
462
|
+
candidates = [a for a in available_agents if a.type in required_types]
|
|
463
|
+
|
|
464
|
+
# 3. Score candidates by past performance
|
|
465
|
+
for agent in candidates:
|
|
466
|
+
agent.selection_score = get_agent_performance_score(
|
|
467
|
+
agent,
|
|
468
|
+
task_type=task.type,
|
|
469
|
+
complexity=complexity
|
|
470
|
+
)
|
|
471
|
+
|
|
472
|
+
# 4. Select top N agents up to allocation limit
|
|
473
|
+
candidates.sort(key=lambda a: a.selection_score, reverse=True)
|
|
474
|
+
selected = candidates[:allocation["max_agents"]]
|
|
475
|
+
|
|
476
|
+
# 5. Assign models based on complexity
|
|
477
|
+
for agent in selected:
|
|
478
|
+
if agent.role == "reviewer":
|
|
479
|
+
agent.model = "sonnet" # Sonnet for reviews (balanced quality/cost)
|
|
480
|
+
else:
|
|
481
|
+
agent.model = allocation["model"]
|
|
482
|
+
|
|
483
|
+
return selected
|
|
484
|
+
|
|
485
|
+
def get_agent_performance_score(agent, task_type, complexity):
|
|
486
|
+
"""
|
|
487
|
+
Score agent based on historical performance on similar tasks.
|
|
488
|
+
Uses reward signals from previous executions.
|
|
489
|
+
"""
|
|
490
|
+
history = load_agent_history(agent.id)
|
|
491
|
+
|
|
492
|
+
# Filter to similar tasks
|
|
493
|
+
similar = [h for h in history
|
|
494
|
+
if h.task_type == task_type
|
|
495
|
+
and h.complexity == complexity]
|
|
496
|
+
|
|
497
|
+
if not similar:
|
|
498
|
+
return 0.5 # Neutral score if no history
|
|
499
|
+
|
|
500
|
+
# Average past rewards
|
|
501
|
+
return sum(h.aggregate_reward for h in similar) / len(similar)
|
|
502
|
+
```
|
|
503
|
+
|
|
504
|
+
---
|
|
505
|
+
|
|
506
|
+
## Tool Usage Analytics
|
|
507
|
+
|
|
508
|
+
### Track Tool Effectiveness
|
|
509
|
+
|
|
510
|
+
```json
|
|
511
|
+
{
|
|
512
|
+
"tool_analytics": {
|
|
513
|
+
"period": "2026-01-06",
|
|
514
|
+
"by_tool": {
|
|
515
|
+
"Grep": {
|
|
516
|
+
"calls": 142,
|
|
517
|
+
"success_rate": 0.89,
|
|
518
|
+
"avg_result_quality": 0.82,
|
|
519
|
+
"common_patterns": ["error handling", "function def"]
|
|
520
|
+
},
|
|
521
|
+
"Task": {
|
|
522
|
+
"calls": 47,
|
|
523
|
+
"success_rate": 0.94,
|
|
524
|
+
"avg_efficiency": 0.76,
|
|
525
|
+
"by_subagent_type": {
|
|
526
|
+
"general-purpose": {"calls": 35, "success": 0.91},
|
|
527
|
+
"Explore": {"calls": 12, "success": 1.0}
|
|
528
|
+
}
|
|
529
|
+
}
|
|
530
|
+
},
|
|
531
|
+
"insights": [
|
|
532
|
+
"Explore agent 100% success - use more for codebase search",
|
|
533
|
+
"Grep success drops to 0.65 for regex patterns - simplify searches"
|
|
534
|
+
]
|
|
535
|
+
}
|
|
536
|
+
}
|
|
537
|
+
```
|
|
538
|
+
|
|
539
|
+
### Continuous Improvement Loop
|
|
540
|
+
|
|
541
|
+
```
|
|
542
|
+
+------------------------------------------------------------------+
|
|
543
|
+
| 1. COLLECT |
|
|
544
|
+
| Record every task: agents used, tools called, outcome |
|
|
545
|
+
+------------------------------------------------------------------+
|
|
546
|
+
|
|
|
547
|
+
v
|
|
548
|
+
+------------------------------------------------------------------+
|
|
549
|
+
| 2. ANALYZE |
|
|
550
|
+
| Weekly aggregation: What worked? What didn't? |
|
|
551
|
+
| Identify patterns in high-reward vs low-reward tasks |
|
|
552
|
+
+------------------------------------------------------------------+
|
|
553
|
+
|
|
|
554
|
+
v
|
|
555
|
+
+------------------------------------------------------------------+
|
|
556
|
+
| 3. ADAPT |
|
|
557
|
+
| Update selection algorithms based on analytics |
|
|
558
|
+
| Store successful patterns in semantic memory |
|
|
559
|
+
+------------------------------------------------------------------+
|
|
560
|
+
|
|
|
561
|
+
v
|
|
562
|
+
+------------------------------------------------------------------+
|
|
563
|
+
| 4. VALIDATE |
|
|
564
|
+
| A/B test new selection strategies |
|
|
565
|
+
| Measure efficiency improvement |
|
|
566
|
+
+------------------------------------------------------------------+
|
|
567
|
+
|
|
|
568
|
+
+-----------> Loop back to COLLECT
|
|
569
|
+
```
|
|
570
|
+
|
|
571
|
+
---
|
|
572
|
+
|
|
573
|
+
## Integration with RARV Cycle
|
|
574
|
+
|
|
575
|
+
The orchestration patterns integrate with RARV at each phase:
|
|
576
|
+
|
|
577
|
+
```
|
|
578
|
+
REASON:
|
|
579
|
+
├── Check efficiency metrics for similar past tasks
|
|
580
|
+
├── Classify task complexity
|
|
581
|
+
└── Select appropriate agent allocation
|
|
582
|
+
|
|
583
|
+
ACT:
|
|
584
|
+
├── Dispatch agents according to allocation
|
|
585
|
+
├── Track start time and resource usage
|
|
586
|
+
└── Record tool calls and agent interactions
|
|
587
|
+
|
|
588
|
+
REFLECT:
|
|
589
|
+
├── Calculate outcome reward (did it work?)
|
|
590
|
+
├── Calculate efficiency reward (resource usage)
|
|
591
|
+
└── Log to metrics store
|
|
592
|
+
|
|
593
|
+
VERIFY:
|
|
594
|
+
├── Run verification checks
|
|
595
|
+
├── If failed: negative outcome reward, retry with learning
|
|
596
|
+
├── If passed: infer preference reward from user actions
|
|
597
|
+
└── Update agent performance scores
|
|
598
|
+
```
|
|
599
|
+
|
|
600
|
+
---
|
|
601
|
+
|
|
602
|
+
## Key Metrics Dashboard
|
|
603
|
+
|
|
604
|
+
Track these metrics in `.loki/metrics/dashboard.json`:
|
|
605
|
+
|
|
606
|
+
```json
|
|
607
|
+
{
|
|
608
|
+
"dashboard": {
|
|
609
|
+
"period": "rolling_7_days",
|
|
610
|
+
"summary": {
|
|
611
|
+
"tasks_completed": 127,
|
|
612
|
+
"success_rate": 0.94,
|
|
613
|
+
"avg_efficiency_score": 0.78,
|
|
614
|
+
"avg_outcome_reward": 0.82,
|
|
615
|
+
"avg_preference_reward": 0.71,
|
|
616
|
+
"avg_recovery_rate": 0.87,
|
|
617
|
+
"avg_goal_adherence": 0.93
|
|
618
|
+
},
|
|
619
|
+
"quality_pillars": {
|
|
620
|
+
"tool_selection_accuracy": 0.91,
|
|
621
|
+
"tool_reliability_rate": 0.93,
|
|
622
|
+
"memory_retrieval_relevance": 0.84,
|
|
623
|
+
"policy_adherence": 0.96
|
|
624
|
+
},
|
|
625
|
+
"trends": {
|
|
626
|
+
"efficiency": "+12% vs previous week",
|
|
627
|
+
"success_rate": "+3% vs previous week",
|
|
628
|
+
"avg_agents_per_task": "-0.8 (improving)",
|
|
629
|
+
"recovery_rate": "+5% vs previous week"
|
|
630
|
+
},
|
|
631
|
+
"top_performing_patterns": [
|
|
632
|
+
"Haiku for unit tests (0.95 success, 0.92 efficiency)",
|
|
633
|
+
"Explore agent for codebase search (1.0 success)",
|
|
634
|
+
"Parallel review with sonnet (0.98 accuracy)"
|
|
635
|
+
],
|
|
636
|
+
"areas_for_improvement": [
|
|
637
|
+
"Complex refactors taking 2x expected time",
|
|
638
|
+
"Security review efficiency below baseline",
|
|
639
|
+
"Memory retrieval relevance below 0.85 target"
|
|
640
|
+
]
|
|
641
|
+
}
|
|
642
|
+
}
|
|
643
|
+
```
|
|
644
|
+
|
|
645
|
+
---
|
|
646
|
+
|
|
647
|
+
## Multi-Dimensional Evaluation
|
|
648
|
+
|
|
649
|
+
Based on [Measurement Imbalance research (arXiv 2506.02064)](https://arxiv.org/abs/2506.02064):
|
|
650
|
+
|
|
651
|
+
> "Technical metrics dominate assessments (83%), while human-centered (30%), safety (53%), and economic (30%) remain peripheral"
|
|
652
|
+
|
|
653
|
+
**Loki Mode tracks four evaluation axes:**
|
|
654
|
+
|
|
655
|
+
| Axis | Metrics | Current Coverage |
|
|
656
|
+
|------|---------|------------------|
|
|
657
|
+
| **Technical** | success_rate, efficiency_score, recovery_rate | Full |
|
|
658
|
+
| **Human-Centered** | preference_reward, goal_adherence | Partial |
|
|
659
|
+
| **Safety** | policy_adherence, quality_gates_passed | Full (via review system) |
|
|
660
|
+
| **Economic** | model_usage, agents_spawned, wall_time | Full |
|
|
661
|
+
|
|
662
|
+
---
|
|
663
|
+
|
|
664
|
+
## Sources
|
|
665
|
+
|
|
666
|
+
**OpenAI Agents SDK:**
|
|
667
|
+
- [Agents SDK Documentation](https://openai.github.io/openai-agents-python/) - Core primitives: agents, handoffs, guardrails, tracing
|
|
668
|
+
- [Practical Guide to Building Agents](https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf) - Orchestration patterns
|
|
669
|
+
- [Building Agents Track](https://developers.openai.com/tracks/building-agents/) - Official developer guide
|
|
670
|
+
- [AGENTS.md Specification](https://agents.md/) - Standard for agent instructions
|
|
671
|
+
- [Tracing Documentation](https://openai.github.io/openai-agents-python/tracing/) - Span types and observability
|
|
672
|
+
|
|
673
|
+
**Efficiency & Orchestration:**
|
|
674
|
+
- [NVIDIA ToolOrchestra](https://github.com/NVlabs/ToolOrchestra) - Multi-turn tool orchestration with RL
|
|
675
|
+
- [ToolScale Dataset](https://huggingface.co/datasets/nvidia/ToolScale) - Training data synthesis
|
|
676
|
+
|
|
677
|
+
**Evaluation Frameworks:**
|
|
678
|
+
- [Assessment Framework for Agentic AI (arXiv 2512.12791)](https://arxiv.org/html/2512.12791v1) - Four-pillar evaluation model
|
|
679
|
+
- [Measurement Imbalance in Agentic AI (arXiv 2506.02064)](https://arxiv.org/abs/2506.02064) - Multi-dimensional evaluation
|
|
680
|
+
- [Adaptive Monitoring for Agentic AI (arXiv 2509.00115)](https://arxiv.org/abs/2509.00115) - AMDM algorithm
|
|
681
|
+
|
|
682
|
+
**Best Practices:**
|
|
683
|
+
- [Anthropic: Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) - Simplicity, transparency, tool engineering
|
|
684
|
+
- [Maxim AI: Production Multi-Agent Systems](https://www.getmaxim.ai/articles/best-practices-for-building-production-ready-multi-agent-systems/) - Orchestration patterns, distributed tracing
|
|
685
|
+
- [UiPath: Agent Builder Best Practices](https://www.uipath.com/blog/ai/agent-builder-best-practices) - Single-responsibility, evaluations
|
|
686
|
+
- [Stanford/Harvard: Demo-to-Deployment Gap](https://www.marktechpost.com/2025/12/24/this-ai-paper-from-stanford-and-harvard-explains-why-most-agentic-ai-systems-feel-impressive-in-demos-and-then-completely-fall-apart-in-real-use/) - Tool reliability as key failure mode
|
|
687
|
+
|
|
688
|
+
**Safety & Reasoning:**
|
|
689
|
+
- [Chain of Thought Monitoring](https://openai.com/index/chain-of-thought-monitoring/) - CoT monitorability for safety
|
|
690
|
+
- [Agent Builder Safety](https://platform.openai.com/docs/guides/agent-builder-safety) - Human-in-loop patterns
|
|
691
|
+
- [Agentic AI Foundation](https://openai.com/index/agentic-ai-foundation/) - Industry standards (MCP, AGENTS.md, goose)
|