loki-mode 4.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (54) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +691 -0
  3. package/SKILL.md +191 -0
  4. package/VERSION +1 -0
  5. package/autonomy/.loki/dashboard/index.html +2634 -0
  6. package/autonomy/CONSTITUTION.md +508 -0
  7. package/autonomy/README.md +201 -0
  8. package/autonomy/config.example.yaml +152 -0
  9. package/autonomy/loki +526 -0
  10. package/autonomy/run.sh +3636 -0
  11. package/bin/loki-mode.js +26 -0
  12. package/bin/postinstall.js +60 -0
  13. package/docs/ACKNOWLEDGEMENTS.md +234 -0
  14. package/docs/COMPARISON.md +325 -0
  15. package/docs/COMPETITIVE-ANALYSIS.md +333 -0
  16. package/docs/INSTALLATION.md +547 -0
  17. package/docs/auto-claude-comparison.md +276 -0
  18. package/docs/cursor-comparison.md +225 -0
  19. package/docs/dashboard-guide.md +355 -0
  20. package/docs/screenshots/README.md +149 -0
  21. package/docs/screenshots/dashboard-agents.png +0 -0
  22. package/docs/screenshots/dashboard-tasks.png +0 -0
  23. package/docs/thick2thin.md +173 -0
  24. package/package.json +48 -0
  25. package/references/advanced-patterns.md +453 -0
  26. package/references/agent-types.md +243 -0
  27. package/references/agents.md +1043 -0
  28. package/references/business-ops.md +550 -0
  29. package/references/competitive-analysis.md +216 -0
  30. package/references/confidence-routing.md +371 -0
  31. package/references/core-workflow.md +275 -0
  32. package/references/cursor-learnings.md +207 -0
  33. package/references/deployment.md +604 -0
  34. package/references/lab-research-patterns.md +534 -0
  35. package/references/mcp-integration.md +186 -0
  36. package/references/memory-system.md +467 -0
  37. package/references/openai-patterns.md +647 -0
  38. package/references/production-patterns.md +568 -0
  39. package/references/prompt-repetition.md +192 -0
  40. package/references/quality-control.md +437 -0
  41. package/references/sdlc-phases.md +410 -0
  42. package/references/task-queue.md +361 -0
  43. package/references/tool-orchestration.md +691 -0
  44. package/skills/00-index.md +120 -0
  45. package/skills/agents.md +249 -0
  46. package/skills/artifacts.md +174 -0
  47. package/skills/github-integration.md +218 -0
  48. package/skills/model-selection.md +125 -0
  49. package/skills/parallel-workflows.md +526 -0
  50. package/skills/patterns-advanced.md +188 -0
  51. package/skills/production.md +292 -0
  52. package/skills/quality-gates.md +180 -0
  53. package/skills/testing.md +149 -0
  54. package/skills/troubleshooting.md +109 -0
@@ -0,0 +1,691 @@
1
+ # Tool Orchestration Patterns Reference
2
+
3
+ Research-backed patterns inspired by NVIDIA ToolOrchestra, OpenAI Agents SDK, and multi-agent coordination research.
4
+
5
+ ---
6
+
7
+ ## Overview
8
+
9
+ Effective tool orchestration requires four key innovations:
10
+ 1. **Tracing Spans** - Hierarchical event tracking (OpenAI SDK pattern)
11
+ 2. **Efficiency Metrics** - Track computational cost per task
12
+ 3. **Reward Signals** - Outcome, efficiency, and preference rewards for learning
13
+ 4. **Dynamic Selection** - Adapt agent count and types based on task complexity
14
+
15
+ ---
16
+
17
+ ## Tracing Spans Architecture (OpenAI SDK Pattern)
18
+
19
+ ### Span Types
20
+
21
+ Every operation is wrapped in a typed span for observability:
22
+
23
+ ```yaml
24
+ span_types:
25
+ agent_span: # Wraps entire agent execution
26
+ generation_span: # Wraps LLM API calls
27
+ function_span: # Wraps tool/function calls
28
+ guardrail_span: # Wraps validation checks
29
+ handoff_span: # Wraps agent-to-agent transfers
30
+ custom_span: # User-defined operations
31
+ ```
32
+
33
+ ### Hierarchical Trace Structure
34
+
35
+ ```json
36
+ {
37
+ "trace_id": "trace_abc123def456",
38
+ "workflow_name": "implement_feature",
39
+ "group_id": "session_xyz789",
40
+ "spans": [
41
+ {
42
+ "span_id": "span_001",
43
+ "parent_id": null,
44
+ "type": "agent_span",
45
+ "agent_name": "orchestrator",
46
+ "started_at": "2026-01-07T10:00:00Z",
47
+ "ended_at": "2026-01-07T10:05:00Z",
48
+ "children": ["span_002", "span_003"]
49
+ },
50
+ {
51
+ "span_id": "span_002",
52
+ "parent_id": "span_001",
53
+ "type": "guardrail_span",
54
+ "guardrail_name": "input_validation",
55
+ "triggered": false,
56
+ "blocking": true
57
+ },
58
+ {
59
+ "span_id": "span_003",
60
+ "parent_id": "span_001",
61
+ "type": "handoff_span",
62
+ "from_agent": "orchestrator",
63
+ "to_agent": "backend-dev"
64
+ }
65
+ ]
66
+ }
67
+ ```
68
+
69
+ ### Storage Location
70
+
71
+ ```
72
+ .loki/traces/
73
+ ├── active/
74
+ │ └── {trace_id}.json # Currently running traces
75
+ └── completed/
76
+ └── {date}/
77
+ └── {trace_id}.json # Archived traces
78
+ ```
79
+
80
+ See `references/openai-patterns.md` for full tracing implementation.
81
+
82
+ ---
83
+
84
+ ## Efficiency Metrics System
85
+
86
+ ### Why Track Efficiency?
87
+
88
+ ToolOrchestra achieves 70% cost reduction vs GPT-5 by explicitly optimizing for efficiency. Loki Mode should track:
89
+
90
+ - **Token usage** per task (input + output)
91
+ - **Wall clock time** per task
92
+ - **Agent spawns** per task
93
+ - **Retry count** before success
94
+
95
+ ### Efficiency Tracking Schema
96
+
97
+ ```json
98
+ {
99
+ "task_id": "task-2026-01-06-001",
100
+ "correlation_id": "session-abc123",
101
+ "started_at": "2026-01-06T10:00:00Z",
102
+ "completed_at": "2026-01-06T10:05:32Z",
103
+ "metrics": {
104
+ "wall_time_seconds": 332,
105
+ "agents_spawned": 3,
106
+ "total_agent_calls": 7,
107
+ "retry_count": 1,
108
+ "retry_reasons": ["test_failure"],
109
+ "recovery_rate": 1.0,
110
+ "model_usage": {
111
+ "haiku": {"calls": 4, "est_tokens": 12000},
112
+ "sonnet": {"calls": 2, "est_tokens": 8000},
113
+ "opus": {"calls": 1, "est_tokens": 6000}
114
+ }
115
+ },
116
+ "outcome": "success",
117
+ "outcome_reason": "tests_passed_after_fix",
118
+ "efficiency_score": 0.85,
119
+ "efficiency_factors": ["used_haiku_for_tests", "parallel_review"],
120
+ "quality_pillars": {
121
+ "tool_selection_correct": true,
122
+ "tool_reliability_rate": 0.95,
123
+ "memory_retrieval_relevant": true,
124
+ "goal_adherence": 1.0
125
+ }
126
+ }
127
+ ```
128
+
129
+ **Why capture these metrics?** (Based on multi-agent research)
130
+
131
+ 1. **Capture intent, not just actions** ([Hashrocket](https://hashrocket.substack.com/p/the-hidden-cost-of-well-fix-it-later))
132
+ - "UX debt turns into data debt" - recording actions without intent creates useless analytics
133
+
134
+ 2. **Track recovery rate** ([Assessment Framework, arXiv 2512.12791](https://arxiv.org/html/2512.12791v1))
135
+ - `recovery_rate = successful_retries / total_retries`
136
+ - Paper found "perfect tool sequencing but only 33% policy adherence" - surface metrics mask failures
137
+
138
+ 3. **Distributed tracing** ([Maxim AI](https://www.getmaxim.ai/articles/best-practices-for-building-production-ready-multi-agent-systems/))
139
+ - `correlation_id`: Links all tasks in a session for end-to-end tracing
140
+ - Essential for debugging multi-agent coordination failures
141
+
142
+ 4. **Tool reliability separate from selection** ([Stanford/Harvard](https://www.marktechpost.com/2025/12/24/this-ai-paper-from-stanford-and-harvard-explains-why-most-agentic-ai-systems-feel-impressive-in-demos-and-then-completely-fall-apart-in-real-use/))
143
+ - `tool_selection_correct`: Did we pick the right tool?
144
+ - `tool_reliability_rate`: Did the tool work as expected? (tools can fail even when correctly selected)
145
+ - Key insight: "Tool use reliability" is a primary demo-to-deployment gap
146
+
147
+ 5. **Quality pillars beyond outcomes** ([Assessment Framework](https://arxiv.org/html/2512.12791v1))
148
+ - `memory_retrieval_relevant`: Did episodic/semantic retrieval help?
149
+ - `goal_adherence`: Did we stay on task? (0.0-1.0 score)
150
+
151
+ ### Efficiency Score Calculation
152
+
153
+ ```python
154
+ def calculate_efficiency_score(metrics, task_complexity):
155
+ """
156
+ Score from 0-1 where higher is more efficient.
157
+ Based on ToolOrchestra's efficiency reward signal.
158
+ """
159
+ # Baseline expectations by complexity
160
+ baselines = {
161
+ "trivial": {"time": 60, "agents": 1, "retries": 0},
162
+ "simple": {"time": 180, "agents": 2, "retries": 0},
163
+ "moderate": {"time": 600, "agents": 4, "retries": 1},
164
+ "complex": {"time": 1800, "agents": 8, "retries": 2},
165
+ "critical": {"time": 3600, "agents": 12, "retries": 3}
166
+ }
167
+
168
+ baseline = baselines[task_complexity]
169
+
170
+ # Calculate component scores (1.0 = at baseline, >1 = better, <1 = worse)
171
+ time_score = min(1.0, baseline["time"] / max(metrics["wall_time_seconds"], 1))
172
+ agent_score = min(1.0, baseline["agents"] / max(metrics["agents_spawned"], 1))
173
+ retry_score = 1.0 - (metrics["retry_count"] / (baseline["retries"] + 3))
174
+
175
+ # Weighted average (time matters most)
176
+ return (time_score * 0.5) + (agent_score * 0.3) + (retry_score * 0.2)
177
+ ```
178
+
179
+ ### Standard Reason Codes
180
+
181
+ Use consistent codes to enable pattern analysis:
182
+
183
+ ```yaml
184
+ outcome_reasons:
185
+ success:
186
+ - tests_passed_first_try
187
+ - tests_passed_after_fix
188
+ - review_approved
189
+ - spec_validated
190
+ partial:
191
+ - tests_partial_pass
192
+ - review_concerns_minor
193
+ - timeout_partial_work
194
+ failure:
195
+ - tests_failed
196
+ - review_blocked
197
+ - dependency_missing
198
+ - timeout_no_progress
199
+ - error_unrecoverable
200
+
201
+ retry_reasons:
202
+ - test_failure
203
+ - lint_error
204
+ - type_error
205
+ - review_rejection
206
+ - rate_limit
207
+ - timeout
208
+ - dependency_conflict
209
+
210
+ efficiency_factors:
211
+ positive:
212
+ - used_haiku_for_simple
213
+ - parallel_execution
214
+ - cached_result
215
+ - first_try_success
216
+ - spec_driven
217
+ negative:
218
+ - used_opus_for_simple
219
+ - sequential_when_parallel_possible
220
+ - multiple_retries
221
+ - missing_context
222
+ - unclear_requirements
223
+ ```
224
+
225
+ ### Storage Location
226
+
227
+ ```
228
+ .loki/metrics/
229
+ ├── efficiency/
230
+ │ ├── 2026-01-06.json # Daily efficiency logs
231
+ │ └── aggregate.json # Running averages by task type
232
+ └── rewards/
233
+ ├── outcomes.json # Task success/failure records
234
+ └── preferences.json # User preference signals
235
+ ```
236
+
237
+ ---
238
+
239
+ ## Reward Signal Framework
240
+
241
+ ### Three Reward Types (ToolOrchestra Pattern)
242
+
243
+ ```
244
+ +------------------------------------------------------------------+
245
+ | 1. OUTCOME REWARD |
246
+ | - Did the task succeed? Binary + quality grade |
247
+ | - Signal: +1.0 (success), 0.0 (partial), -1.0 (failure) |
248
+ +------------------------------------------------------------------+
249
+ | 2. EFFICIENCY REWARD |
250
+ | - Did we use resources wisely? |
251
+ | - Signal: 0.0 to 1.0 based on efficiency score |
252
+ +------------------------------------------------------------------+
253
+ | 3. PREFERENCE REWARD |
254
+ | - Did the user like the approach/result? |
255
+ | - Signal: Inferred from user actions (accept/reject/modify) |
256
+ +------------------------------------------------------------------+
257
+ ```
258
+
259
+ ### Outcome Reward Implementation
260
+
261
+ ```python
262
+ def calculate_outcome_reward(task_result):
263
+ """
264
+ Outcome reward based on task completion status.
265
+ """
266
+ if task_result.status == "completed":
267
+ # Grade the quality of completion
268
+ if task_result.tests_passed and task_result.review_passed:
269
+ return 1.0 # Full success
270
+ elif task_result.tests_passed:
271
+ return 0.7 # Tests pass but review had concerns
272
+ else:
273
+ return 0.3 # Completed but with issues
274
+
275
+ elif task_result.status == "partial":
276
+ return 0.0 # Partial completion, no reward
277
+
278
+ else: # failed
279
+ return -1.0 # Negative reward for failure
280
+ ```
281
+
282
+ ### Preference Reward Implementation
283
+
284
+ ```python
285
+ def infer_preference_reward(task_result, user_actions):
286
+ """
287
+ Infer user preference from their actions after task completion.
288
+ Based on implicit feedback patterns.
289
+ """
290
+ signals = []
291
+
292
+ # Positive signals
293
+ if "commit" in user_actions:
294
+ signals.append(0.8) # User committed our changes
295
+ if "deploy" in user_actions:
296
+ signals.append(1.0) # User deployed our changes
297
+ if "no_edits" in user_actions:
298
+ signals.append(0.6) # User didn't modify our output
299
+
300
+ # Negative signals
301
+ if "revert" in user_actions:
302
+ signals.append(-1.0) # User reverted our changes
303
+ if "manual_fix" in user_actions:
304
+ signals.append(-0.5) # User had to fix our work
305
+ if "retry_different" in user_actions:
306
+ signals.append(-0.3) # User asked for different approach
307
+
308
+ # Neutral (no signal)
309
+ if not signals:
310
+ return None
311
+
312
+ return sum(signals) / len(signals)
313
+ ```
314
+
315
+ ### Reward Aggregation for Learning
316
+
317
+ ```python
318
+ def aggregate_rewards(outcome, efficiency, preference):
319
+ """
320
+ Combine rewards into single learning signal.
321
+ Weights based on ToolOrchestra findings.
322
+ """
323
+ # Outcome is most important (must succeed)
324
+ # Efficiency secondary (once successful, optimize)
325
+ # Preference tertiary (align with user style)
326
+
327
+ weights = {
328
+ "outcome": 0.6,
329
+ "efficiency": 0.25,
330
+ "preference": 0.15
331
+ }
332
+
333
+ total = outcome * weights["outcome"]
334
+ total += efficiency * weights["efficiency"]
335
+
336
+ if preference is not None:
337
+ total += preference * weights["preference"]
338
+ else:
339
+ # Redistribute weight if no preference signal
340
+ total = total / (1 - weights["preference"])
341
+
342
+ return total
343
+ ```
344
+
345
+ ---
346
+
347
+ ## Dynamic Agent Selection
348
+
349
+ ### Task Complexity Classification
350
+
351
+ ```python
352
+ def classify_task_complexity(task):
353
+ """
354
+ Classify task to determine agent allocation.
355
+ Based on ToolOrchestra's tool selection flexibility.
356
+ """
357
+ complexity_signals = {
358
+ # File scope signals
359
+ "single_file": -1,
360
+ "few_files": 0, # 2-5 files
361
+ "many_files": +1, # 6-20 files
362
+ "system_wide": +2, # 20+ files
363
+
364
+ # Change type signals
365
+ "typo_fix": -2,
366
+ "bug_fix": 0,
367
+ "feature": +1,
368
+ "refactor": +1,
369
+ "architecture": +2,
370
+
371
+ # Domain signals
372
+ "documentation": -1,
373
+ "tests_only": 0,
374
+ "frontend": 0,
375
+ "backend": 0,
376
+ "full_stack": +1,
377
+ "infrastructure": +1,
378
+ "security": +2,
379
+ }
380
+
381
+ score = 0
382
+ for signal, weight in complexity_signals.items():
383
+ if task.has_signal(signal):
384
+ score += weight
385
+
386
+ # Map score to complexity level
387
+ if score <= -2:
388
+ return "trivial"
389
+ elif score <= 0:
390
+ return "simple"
391
+ elif score <= 2:
392
+ return "moderate"
393
+ elif score <= 4:
394
+ return "complex"
395
+ else:
396
+ return "critical"
397
+ ```
398
+
399
+ ### Agent Allocation by Complexity
400
+
401
+ ```yaml
402
+ # Agent allocation strategy
403
+ # Model selection: Opus=planning, Sonnet=development, Haiku=unit tests/monitoring
404
+ complexity_allocations:
405
+ trivial:
406
+ max_agents: 1
407
+ planning: null # No planning needed
408
+ development: haiku
409
+ testing: haiku
410
+ review: skip # No review needed for trivial
411
+ parallel: false
412
+
413
+ simple:
414
+ max_agents: 2
415
+ planning: null # No planning needed
416
+ development: haiku
417
+ testing: haiku
418
+ review: single # One quick review
419
+ parallel: false
420
+
421
+ moderate:
422
+ max_agents: 4
423
+ planning: sonnet # Sonnet for moderate planning
424
+ development: sonnet
425
+ testing: haiku # Unit tests always haiku
426
+ review: standard # 3 parallel reviewers
427
+ parallel: true
428
+
429
+ complex:
430
+ max_agents: 8
431
+ planning: opus # Opus ONLY for complex planning
432
+ development: sonnet # Sonnet for implementation
433
+ testing: haiku # Unit tests still haiku
434
+ review: deep # 3 reviewers + devil's advocate
435
+ parallel: true
436
+
437
+ critical:
438
+ max_agents: 12
439
+ planning: opus # Opus for critical planning
440
+ development: sonnet # Sonnet for implementation
441
+ testing: sonnet # Functional/E2E tests with sonnet
442
+ review: exhaustive # Multiple review rounds
443
+ parallel: true
444
+ human_checkpoint: true # Pause for human review
445
+ ```
446
+
447
+ ### Dynamic Selection Algorithm
448
+
449
+ ```python
450
+ def select_agents_for_task(task, available_agents):
451
+ """
452
+ Dynamically select agents based on task requirements.
453
+ Inspired by ToolOrchestra's configurable tool selection.
454
+ """
455
+ complexity = classify_task_complexity(task)
456
+ allocation = COMPLEXITY_ALLOCATIONS[complexity]
457
+
458
+ # 1. Identify required agent types
459
+ required_types = identify_required_agents(task)
460
+
461
+ # 2. Filter to available agents of required types
462
+ candidates = [a for a in available_agents if a.type in required_types]
463
+
464
+ # 3. Score candidates by past performance
465
+ for agent in candidates:
466
+ agent.selection_score = get_agent_performance_score(
467
+ agent,
468
+ task_type=task.type,
469
+ complexity=complexity
470
+ )
471
+
472
+ # 4. Select top N agents up to allocation limit
473
+ candidates.sort(key=lambda a: a.selection_score, reverse=True)
474
+ selected = candidates[:allocation["max_agents"]]
475
+
476
+ # 5. Assign models based on complexity
477
+ for agent in selected:
478
+ if agent.role == "reviewer":
479
+ agent.model = "sonnet" # Sonnet for reviews (balanced quality/cost)
480
+ else:
481
+ agent.model = allocation["model"]
482
+
483
+ return selected
484
+
485
+ def get_agent_performance_score(agent, task_type, complexity):
486
+ """
487
+ Score agent based on historical performance on similar tasks.
488
+ Uses reward signals from previous executions.
489
+ """
490
+ history = load_agent_history(agent.id)
491
+
492
+ # Filter to similar tasks
493
+ similar = [h for h in history
494
+ if h.task_type == task_type
495
+ and h.complexity == complexity]
496
+
497
+ if not similar:
498
+ return 0.5 # Neutral score if no history
499
+
500
+ # Average past rewards
501
+ return sum(h.aggregate_reward for h in similar) / len(similar)
502
+ ```
503
+
504
+ ---
505
+
506
+ ## Tool Usage Analytics
507
+
508
+ ### Track Tool Effectiveness
509
+
510
+ ```json
511
+ {
512
+ "tool_analytics": {
513
+ "period": "2026-01-06",
514
+ "by_tool": {
515
+ "Grep": {
516
+ "calls": 142,
517
+ "success_rate": 0.89,
518
+ "avg_result_quality": 0.82,
519
+ "common_patterns": ["error handling", "function def"]
520
+ },
521
+ "Task": {
522
+ "calls": 47,
523
+ "success_rate": 0.94,
524
+ "avg_efficiency": 0.76,
525
+ "by_subagent_type": {
526
+ "general-purpose": {"calls": 35, "success": 0.91},
527
+ "Explore": {"calls": 12, "success": 1.0}
528
+ }
529
+ }
530
+ },
531
+ "insights": [
532
+ "Explore agent 100% success - use more for codebase search",
533
+ "Grep success drops to 0.65 for regex patterns - simplify searches"
534
+ ]
535
+ }
536
+ }
537
+ ```
538
+
539
+ ### Continuous Improvement Loop
540
+
541
+ ```
542
+ +------------------------------------------------------------------+
543
+ | 1. COLLECT |
544
+ | Record every task: agents used, tools called, outcome |
545
+ +------------------------------------------------------------------+
546
+ |
547
+ v
548
+ +------------------------------------------------------------------+
549
+ | 2. ANALYZE |
550
+ | Weekly aggregation: What worked? What didn't? |
551
+ | Identify patterns in high-reward vs low-reward tasks |
552
+ +------------------------------------------------------------------+
553
+ |
554
+ v
555
+ +------------------------------------------------------------------+
556
+ | 3. ADAPT |
557
+ | Update selection algorithms based on analytics |
558
+ | Store successful patterns in semantic memory |
559
+ +------------------------------------------------------------------+
560
+ |
561
+ v
562
+ +------------------------------------------------------------------+
563
+ | 4. VALIDATE |
564
+ | A/B test new selection strategies |
565
+ | Measure efficiency improvement |
566
+ +------------------------------------------------------------------+
567
+ |
568
+ +-----------> Loop back to COLLECT
569
+ ```
570
+
571
+ ---
572
+
573
+ ## Integration with RARV Cycle
574
+
575
+ The orchestration patterns integrate with RARV at each phase:
576
+
577
+ ```
578
+ REASON:
579
+ ├── Check efficiency metrics for similar past tasks
580
+ ├── Classify task complexity
581
+ └── Select appropriate agent allocation
582
+
583
+ ACT:
584
+ ├── Dispatch agents according to allocation
585
+ ├── Track start time and resource usage
586
+ └── Record tool calls and agent interactions
587
+
588
+ REFLECT:
589
+ ├── Calculate outcome reward (did it work?)
590
+ ├── Calculate efficiency reward (resource usage)
591
+ └── Log to metrics store
592
+
593
+ VERIFY:
594
+ ├── Run verification checks
595
+ ├── If failed: negative outcome reward, retry with learning
596
+ ├── If passed: infer preference reward from user actions
597
+ └── Update agent performance scores
598
+ ```
599
+
600
+ ---
601
+
602
+ ## Key Metrics Dashboard
603
+
604
+ Track these metrics in `.loki/metrics/dashboard.json`:
605
+
606
+ ```json
607
+ {
608
+ "dashboard": {
609
+ "period": "rolling_7_days",
610
+ "summary": {
611
+ "tasks_completed": 127,
612
+ "success_rate": 0.94,
613
+ "avg_efficiency_score": 0.78,
614
+ "avg_outcome_reward": 0.82,
615
+ "avg_preference_reward": 0.71,
616
+ "avg_recovery_rate": 0.87,
617
+ "avg_goal_adherence": 0.93
618
+ },
619
+ "quality_pillars": {
620
+ "tool_selection_accuracy": 0.91,
621
+ "tool_reliability_rate": 0.93,
622
+ "memory_retrieval_relevance": 0.84,
623
+ "policy_adherence": 0.96
624
+ },
625
+ "trends": {
626
+ "efficiency": "+12% vs previous week",
627
+ "success_rate": "+3% vs previous week",
628
+ "avg_agents_per_task": "-0.8 (improving)",
629
+ "recovery_rate": "+5% vs previous week"
630
+ },
631
+ "top_performing_patterns": [
632
+ "Haiku for unit tests (0.95 success, 0.92 efficiency)",
633
+ "Explore agent for codebase search (1.0 success)",
634
+ "Parallel review with sonnet (0.98 accuracy)"
635
+ ],
636
+ "areas_for_improvement": [
637
+ "Complex refactors taking 2x expected time",
638
+ "Security review efficiency below baseline",
639
+ "Memory retrieval relevance below 0.85 target"
640
+ ]
641
+ }
642
+ }
643
+ ```
644
+
645
+ ---
646
+
647
+ ## Multi-Dimensional Evaluation
648
+
649
+ Based on [Measurement Imbalance research (arXiv 2506.02064)](https://arxiv.org/abs/2506.02064):
650
+
651
+ > "Technical metrics dominate assessments (83%), while human-centered (30%), safety (53%), and economic (30%) remain peripheral"
652
+
653
+ **Loki Mode tracks four evaluation axes:**
654
+
655
+ | Axis | Metrics | Current Coverage |
656
+ |------|---------|------------------|
657
+ | **Technical** | success_rate, efficiency_score, recovery_rate | Full |
658
+ | **Human-Centered** | preference_reward, goal_adherence | Partial |
659
+ | **Safety** | policy_adherence, quality_gates_passed | Full (via review system) |
660
+ | **Economic** | model_usage, agents_spawned, wall_time | Full |
661
+
662
+ ---
663
+
664
+ ## Sources
665
+
666
+ **OpenAI Agents SDK:**
667
+ - [Agents SDK Documentation](https://openai.github.io/openai-agents-python/) - Core primitives: agents, handoffs, guardrails, tracing
668
+ - [Practical Guide to Building Agents](https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf) - Orchestration patterns
669
+ - [Building Agents Track](https://developers.openai.com/tracks/building-agents/) - Official developer guide
670
+ - [AGENTS.md Specification](https://agents.md/) - Standard for agent instructions
671
+ - [Tracing Documentation](https://openai.github.io/openai-agents-python/tracing/) - Span types and observability
672
+
673
+ **Efficiency & Orchestration:**
674
+ - [NVIDIA ToolOrchestra](https://github.com/NVlabs/ToolOrchestra) - Multi-turn tool orchestration with RL
675
+ - [ToolScale Dataset](https://huggingface.co/datasets/nvidia/ToolScale) - Training data synthesis
676
+
677
+ **Evaluation Frameworks:**
678
+ - [Assessment Framework for Agentic AI (arXiv 2512.12791)](https://arxiv.org/html/2512.12791v1) - Four-pillar evaluation model
679
+ - [Measurement Imbalance in Agentic AI (arXiv 2506.02064)](https://arxiv.org/abs/2506.02064) - Multi-dimensional evaluation
680
+ - [Adaptive Monitoring for Agentic AI (arXiv 2509.00115)](https://arxiv.org/abs/2509.00115) - AMDM algorithm
681
+
682
+ **Best Practices:**
683
+ - [Anthropic: Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) - Simplicity, transparency, tool engineering
684
+ - [Maxim AI: Production Multi-Agent Systems](https://www.getmaxim.ai/articles/best-practices-for-building-production-ready-multi-agent-systems/) - Orchestration patterns, distributed tracing
685
+ - [UiPath: Agent Builder Best Practices](https://www.uipath.com/blog/ai/agent-builder-best-practices) - Single-responsibility, evaluations
686
+ - [Stanford/Harvard: Demo-to-Deployment Gap](https://www.marktechpost.com/2025/12/24/this-ai-paper-from-stanford-and-harvard-explains-why-most-agentic-ai-systems-feel-impressive-in-demos-and-then-completely-fall-apart-in-real-use/) - Tool reliability as key failure mode
687
+
688
+ **Safety & Reasoning:**
689
+ - [Chain of Thought Monitoring](https://openai.com/index/chain-of-thought-monitoring/) - CoT monitorability for safety
690
+ - [Agent Builder Safety](https://platform.openai.com/docs/guides/agent-builder-safety) - Human-in-loop patterns
691
+ - [Agentic AI Foundation](https://openai.com/index/agentic-ai-foundation/) - Industry standards (MCP, AGENTS.md, goose)