loki-mode 4.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +691 -0
- package/SKILL.md +191 -0
- package/VERSION +1 -0
- package/autonomy/.loki/dashboard/index.html +2634 -0
- package/autonomy/CONSTITUTION.md +508 -0
- package/autonomy/README.md +201 -0
- package/autonomy/config.example.yaml +152 -0
- package/autonomy/loki +526 -0
- package/autonomy/run.sh +3636 -0
- package/bin/loki-mode.js +26 -0
- package/bin/postinstall.js +60 -0
- package/docs/ACKNOWLEDGEMENTS.md +234 -0
- package/docs/COMPARISON.md +325 -0
- package/docs/COMPETITIVE-ANALYSIS.md +333 -0
- package/docs/INSTALLATION.md +547 -0
- package/docs/auto-claude-comparison.md +276 -0
- package/docs/cursor-comparison.md +225 -0
- package/docs/dashboard-guide.md +355 -0
- package/docs/screenshots/README.md +149 -0
- package/docs/screenshots/dashboard-agents.png +0 -0
- package/docs/screenshots/dashboard-tasks.png +0 -0
- package/docs/thick2thin.md +173 -0
- package/package.json +48 -0
- package/references/advanced-patterns.md +453 -0
- package/references/agent-types.md +243 -0
- package/references/agents.md +1043 -0
- package/references/business-ops.md +550 -0
- package/references/competitive-analysis.md +216 -0
- package/references/confidence-routing.md +371 -0
- package/references/core-workflow.md +275 -0
- package/references/cursor-learnings.md +207 -0
- package/references/deployment.md +604 -0
- package/references/lab-research-patterns.md +534 -0
- package/references/mcp-integration.md +186 -0
- package/references/memory-system.md +467 -0
- package/references/openai-patterns.md +647 -0
- package/references/production-patterns.md +568 -0
- package/references/prompt-repetition.md +192 -0
- package/references/quality-control.md +437 -0
- package/references/sdlc-phases.md +410 -0
- package/references/task-queue.md +361 -0
- package/references/tool-orchestration.md +691 -0
- package/skills/00-index.md +120 -0
- package/skills/agents.md +249 -0
- package/skills/artifacts.md +174 -0
- package/skills/github-integration.md +218 -0
- package/skills/model-selection.md +125 -0
- package/skills/parallel-workflows.md +526 -0
- package/skills/patterns-advanced.md +188 -0
- package/skills/production.md +292 -0
- package/skills/quality-gates.md +180 -0
- package/skills/testing.md +149 -0
- package/skills/troubleshooting.md +109 -0
|
@@ -0,0 +1,534 @@
|
|
|
1
|
+
# Lab Research Patterns Reference
|
|
2
|
+
|
|
3
|
+
Research-backed patterns from Google DeepMind and Anthropic for enhanced multi-agent orchestration and safety.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Overview
|
|
8
|
+
|
|
9
|
+
This reference consolidates key patterns from:
|
|
10
|
+
1. **Google DeepMind** - World models, self-improvement, scalable oversight
|
|
11
|
+
2. **Anthropic** - Constitutional AI, alignment safety, agentic coding
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
## Google DeepMind Patterns
|
|
16
|
+
|
|
17
|
+
### World Model Training (Dreamer 4)
|
|
18
|
+
|
|
19
|
+
**Key Insight:** Train agents inside world models for safety and data efficiency.
|
|
20
|
+
|
|
21
|
+
```yaml
|
|
22
|
+
world_model_training:
|
|
23
|
+
principle: "Learn behaviors through simulation, not real environment"
|
|
24
|
+
benefits:
|
|
25
|
+
- 100x less data than real-world training
|
|
26
|
+
- Safe exploration of dangerous actions
|
|
27
|
+
- Faster iteration cycles
|
|
28
|
+
|
|
29
|
+
architecture:
|
|
30
|
+
tokenizer: "Compress frames into continuous representation"
|
|
31
|
+
dynamics_model: "Predict next world state given action"
|
|
32
|
+
imagination_training: "RL inside simulated trajectories"
|
|
33
|
+
|
|
34
|
+
loki_application:
|
|
35
|
+
- Run agent tasks in isolated containers first
|
|
36
|
+
- Simulate deployment before actual deploy
|
|
37
|
+
- Test error scenarios in sandbox
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
### Self-Improvement Loop (SIMA 2)
|
|
41
|
+
|
|
42
|
+
**Key Insight:** Use AI to generate tasks and score outcomes for bootstrapped learning.
|
|
43
|
+
|
|
44
|
+
```python
|
|
45
|
+
class SelfImprovementLoop:
|
|
46
|
+
"""
|
|
47
|
+
Based on SIMA 2's self-improvement mechanism.
|
|
48
|
+
Gemini-based teacher + learned reward model.
|
|
49
|
+
"""
|
|
50
|
+
|
|
51
|
+
def __init__(self):
|
|
52
|
+
self.task_generator = "Use LLM to generate varied tasks"
|
|
53
|
+
self.reward_model = "Learned model to score trajectories"
|
|
54
|
+
self.experience_bank = []
|
|
55
|
+
|
|
56
|
+
def bootstrap_cycle(self):
|
|
57
|
+
# 1. Generate tasks with estimated rewards
|
|
58
|
+
tasks = self.task_generator.generate(
|
|
59
|
+
domain=current_project,
|
|
60
|
+
difficulty_curriculum=True
|
|
61
|
+
)
|
|
62
|
+
|
|
63
|
+
# 2. Execute tasks, accumulate experience
|
|
64
|
+
for task in tasks:
|
|
65
|
+
trajectory = execute(task)
|
|
66
|
+
reward = self.reward_model.score(trajectory)
|
|
67
|
+
self.experience_bank.append((trajectory, reward))
|
|
68
|
+
|
|
69
|
+
# 3. Train next generation on experience
|
|
70
|
+
next_agent = train_on_experience(self.experience_bank)
|
|
71
|
+
|
|
72
|
+
# 4. Iterate with minimal human intervention
|
|
73
|
+
return next_agent
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
**Loki Mode Application:**
|
|
77
|
+
- Generate test scenarios automatically
|
|
78
|
+
- Score code quality with learned criteria
|
|
79
|
+
- Bootstrap agent training across projects
|
|
80
|
+
|
|
81
|
+
### Hierarchical Reasoning (Gemini Robotics)
|
|
82
|
+
|
|
83
|
+
**Key Insight:** Separate high-level planning from low-level execution.
|
|
84
|
+
|
|
85
|
+
```
|
|
86
|
+
+------------------------------------------------------------------+
|
|
87
|
+
| EMBODIED REASONING MODEL (Gemini Robotics-ER) |
|
|
88
|
+
| - Orchestrates activities like a "high-level brain" |
|
|
89
|
+
| - Spatial understanding, planning, logical decisions |
|
|
90
|
+
| - Natively calls tools (search, user functions) |
|
|
91
|
+
| - Does NOT directly control actions |
|
|
92
|
+
+------------------------------------------------------------------+
|
|
93
|
+
|
|
|
94
|
+
| High-level insights
|
|
95
|
+
v
|
|
96
|
+
+------------------------------------------------------------------+
|
|
97
|
+
| VISION-LANGUAGE-ACTION MODEL (Gemini Robotics) |
|
|
98
|
+
| - "Thinks before taking action" |
|
|
99
|
+
| - Generates internal reasoning in natural language |
|
|
100
|
+
| - Decomposes long tasks into simpler segments |
|
|
101
|
+
| - Directly outputs actions/commands |
|
|
102
|
+
+------------------------------------------------------------------+
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
**Loki Mode Application:**
|
|
106
|
+
- Orchestrator = ER model (planning, tool calls)
|
|
107
|
+
- Implementation agents = VLA model (code actions)
|
|
108
|
+
- Task decomposition before execution
|
|
109
|
+
|
|
110
|
+
### Cross-Embodiment Transfer
|
|
111
|
+
|
|
112
|
+
**Key Insight:** Skills learned by one agent type transfer to others.
|
|
113
|
+
|
|
114
|
+
```yaml
|
|
115
|
+
transfer_learning:
|
|
116
|
+
observation: "Tasks learned on ALOHA2 work on Apollo humanoid"
|
|
117
|
+
mechanism: "Shared action space abstraction"
|
|
118
|
+
|
|
119
|
+
loki_application:
|
|
120
|
+
- Patterns learned by frontend agent transfer to mobile agent
|
|
121
|
+
- Testing strategies from QA apply to security testing
|
|
122
|
+
- Deployment scripts generalize across cloud providers
|
|
123
|
+
|
|
124
|
+
implementation:
|
|
125
|
+
shared_skills_library: ".loki/memory/skills/"
|
|
126
|
+
abstraction_layer: "Domain-agnostic action primitives"
|
|
127
|
+
transfer_score: "Confidence in skill applicability"
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
### Scalable Oversight via Debate
|
|
131
|
+
|
|
132
|
+
**Key Insight:** Pit AI capabilities against each other for verification.
|
|
133
|
+
|
|
134
|
+
```python
|
|
135
|
+
async def debate_verification(proposal, max_rounds=2):
|
|
136
|
+
"""
|
|
137
|
+
Based on DeepMind's Scalable AI Safety via Doubly-Efficient Debate.
|
|
138
|
+
Use debate to break down verification into manageable sub-tasks.
|
|
139
|
+
"""
|
|
140
|
+
# Two equally capable AI critics
|
|
141
|
+
proponent = Agent(role="defender", model="opus")
|
|
142
|
+
opponent = Agent(role="challenger", model="opus")
|
|
143
|
+
|
|
144
|
+
debate_log = []
|
|
145
|
+
|
|
146
|
+
for round in range(max_rounds):
|
|
147
|
+
# Proponent defends proposal
|
|
148
|
+
defense = await proponent.argue(
|
|
149
|
+
proposal=proposal,
|
|
150
|
+
counter_arguments=debate_log
|
|
151
|
+
)
|
|
152
|
+
|
|
153
|
+
# Opponent challenges
|
|
154
|
+
challenge = await opponent.argue(
|
|
155
|
+
proposal=proposal,
|
|
156
|
+
defense=defense,
|
|
157
|
+
goal="find_flaws"
|
|
158
|
+
)
|
|
159
|
+
|
|
160
|
+
debate_log.append({
|
|
161
|
+
"round": round,
|
|
162
|
+
"defense": defense,
|
|
163
|
+
"challenge": challenge
|
|
164
|
+
})
|
|
165
|
+
|
|
166
|
+
# If opponent cannot find valid flaw, proposal is verified
|
|
167
|
+
if not challenge.has_valid_flaw:
|
|
168
|
+
return VerificationResult(verified=True, debate_log=debate_log)
|
|
169
|
+
|
|
170
|
+
# Human reviews remaining disagreements
|
|
171
|
+
return escalate_to_human(debate_log)
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
### Amplified Oversight
|
|
175
|
+
|
|
176
|
+
**Key Insight:** Use AI to help humans supervise AI beyond human capability.
|
|
177
|
+
|
|
178
|
+
```yaml
|
|
179
|
+
amplified_oversight:
|
|
180
|
+
goal: "Supervision as close as possible to human with complete understanding"
|
|
181
|
+
|
|
182
|
+
techniques:
|
|
183
|
+
- "AI explains its reasoning transparently"
|
|
184
|
+
- "AI argues against itself when wrong"
|
|
185
|
+
- "AI cites relevant evidence"
|
|
186
|
+
- "Monitor knows when it doesn't know"
|
|
187
|
+
|
|
188
|
+
monitoring_principle:
|
|
189
|
+
when_unsure: "Either reject action OR flag for review"
|
|
190
|
+
never: "Approve uncertain actions silently"
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
---
|
|
194
|
+
|
|
195
|
+
## Anthropic Patterns
|
|
196
|
+
|
|
197
|
+
### Constitutional AI Principles
|
|
198
|
+
|
|
199
|
+
**Key Insight:** Train AI to self-critique based on explicit principles.
|
|
200
|
+
|
|
201
|
+
```python
|
|
202
|
+
class ConstitutionalAI:
|
|
203
|
+
"""
|
|
204
|
+
Based on Anthropic's Constitutional AI: Harmlessness from AI Feedback.
|
|
205
|
+
Self-critique and revision based on constitutional principles.
|
|
206
|
+
"""
|
|
207
|
+
|
|
208
|
+
def __init__(self, constitution):
|
|
209
|
+
self.constitution = constitution # List of principles
|
|
210
|
+
|
|
211
|
+
async def supervised_learning_phase(self, response):
|
|
212
|
+
"""Phase 1: Self-critique and revise."""
|
|
213
|
+
# Generate initial response
|
|
214
|
+
initial = response
|
|
215
|
+
|
|
216
|
+
# Self-critique against each principle
|
|
217
|
+
critiques = []
|
|
218
|
+
for principle in self.constitution:
|
|
219
|
+
critique = await self.critique(
|
|
220
|
+
response=initial,
|
|
221
|
+
principle=principle,
|
|
222
|
+
prompt=f"Does this response violate: {principle}?"
|
|
223
|
+
)
|
|
224
|
+
critiques.append(critique)
|
|
225
|
+
|
|
226
|
+
# Revise based on critiques
|
|
227
|
+
revised = await self.revise(
|
|
228
|
+
response=initial,
|
|
229
|
+
critiques=critiques
|
|
230
|
+
)
|
|
231
|
+
|
|
232
|
+
return revised
|
|
233
|
+
|
|
234
|
+
async def rlai_phase(self, response_pairs):
|
|
235
|
+
"""Phase 2: AI compares responses for constitutional compliance."""
|
|
236
|
+
preferences = []
|
|
237
|
+
for (response_a, response_b) in response_pairs:
|
|
238
|
+
preference = await self.compare(
|
|
239
|
+
a=response_a,
|
|
240
|
+
b=response_b,
|
|
241
|
+
criterion="Which better follows the constitution?"
|
|
242
|
+
)
|
|
243
|
+
preferences.append(preference)
|
|
244
|
+
|
|
245
|
+
# Train preference model on AI feedback
|
|
246
|
+
return train_preference_model(preferences)
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
**Loki Mode Constitution Example:**
|
|
250
|
+
```yaml
|
|
251
|
+
loki_constitution:
|
|
252
|
+
- "Never delete production data without explicit backup"
|
|
253
|
+
- "Never commit secrets or credentials to version control"
|
|
254
|
+
- "Never bypass quality gates for speed"
|
|
255
|
+
- "Always verify tests pass before marking task complete"
|
|
256
|
+
- "Never claim completion without running actual tests"
|
|
257
|
+
- "Prefer simple solutions over clever ones"
|
|
258
|
+
- "Document decisions, not just code"
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
### Sleeper Agent Detection (Defection Probes)
|
|
262
|
+
|
|
263
|
+
**Key Insight:** Use linear classifiers to detect when agents might misbehave.
|
|
264
|
+
|
|
265
|
+
```python
|
|
266
|
+
class DefectionProbe:
|
|
267
|
+
"""
|
|
268
|
+
Based on Anthropic's "Simple probes can catch sleeper agents".
|
|
269
|
+
Linear classifiers detecting anomalous agent behavior.
|
|
270
|
+
"""
|
|
271
|
+
|
|
272
|
+
def __init__(self):
|
|
273
|
+
self.baseline_behaviors = []
|
|
274
|
+
self.probe = LinearClassifier()
|
|
275
|
+
|
|
276
|
+
def train_probe(self, normal_trajectories, defection_examples):
|
|
277
|
+
"""
|
|
278
|
+
Train on contrast pairs:
|
|
279
|
+
- Normal agent behavior
|
|
280
|
+
- Examples of defection/misbehavior
|
|
281
|
+
"""
|
|
282
|
+
contrast_pairs = create_contrast_pairs(
|
|
283
|
+
normal=normal_trajectories,
|
|
284
|
+
defection=defection_examples
|
|
285
|
+
)
|
|
286
|
+
self.probe.fit(contrast_pairs)
|
|
287
|
+
|
|
288
|
+
def check_for_defection(self, agent_action, context):
|
|
289
|
+
"""
|
|
290
|
+
Check if action appears anomalous.
|
|
291
|
+
AUROC > 99% achieved in Anthropic's research.
|
|
292
|
+
"""
|
|
293
|
+
features = extract_features(agent_action, context)
|
|
294
|
+
defection_probability = self.probe.predict_proba(features)
|
|
295
|
+
|
|
296
|
+
if defection_probability > THRESHOLD:
|
|
297
|
+
return DefectionAlert(
|
|
298
|
+
probability=defection_probability,
|
|
299
|
+
action=agent_action,
|
|
300
|
+
recommendation="Flag for human review"
|
|
301
|
+
)
|
|
302
|
+
return None
|
|
303
|
+
```
|
|
304
|
+
|
|
305
|
+
**Loki Mode Application:**
|
|
306
|
+
- Monitor agent actions for anomalies
|
|
307
|
+
- Flag unusual file access patterns
|
|
308
|
+
- Detect attempts to bypass quality gates
|
|
309
|
+
|
|
310
|
+
### Alignment Faking Awareness
|
|
311
|
+
|
|
312
|
+
**Key Insight:** Models may strategically comply while preserving hidden preferences.
|
|
313
|
+
|
|
314
|
+
```yaml
|
|
315
|
+
alignment_faking_signals:
|
|
316
|
+
warning_signs:
|
|
317
|
+
- "Agent complies only when it knows it's being monitored"
|
|
318
|
+
- "Agent's reasoning contradicts its final action"
|
|
319
|
+
- "Sudden behavior change when training context mentioned"
|
|
320
|
+
- "Agent attempts to exfiltrate its own weights/state"
|
|
321
|
+
|
|
322
|
+
detection_strategy:
|
|
323
|
+
- "Monitor chain-of-thought for rationalization"
|
|
324
|
+
- "Compare behavior in 'free' vs 'monitored' conditions"
|
|
325
|
+
- "Check for attempts to undermine oversight systems"
|
|
326
|
+
|
|
327
|
+
mitigation:
|
|
328
|
+
- "Constitutional constraints at training time"
|
|
329
|
+
- "Transparency in reasoning (visible extended thinking)"
|
|
330
|
+
- "Defection probes on intermediate reasoning"
|
|
331
|
+
```
|
|
332
|
+
|
|
333
|
+
### Extended Thinking Levels
|
|
334
|
+
|
|
335
|
+
**Key Insight:** Allocate computation based on problem complexity.
|
|
336
|
+
|
|
337
|
+
```yaml
|
|
338
|
+
thinking_levels:
|
|
339
|
+
"think":
|
|
340
|
+
computation: "Minimal"
|
|
341
|
+
use_for: "Simple questions, straightforward tasks"
|
|
342
|
+
|
|
343
|
+
"think hard":
|
|
344
|
+
computation: "Moderate"
|
|
345
|
+
use_for: "Multi-step problems, code implementation"
|
|
346
|
+
|
|
347
|
+
"think harder":
|
|
348
|
+
computation: "Extended"
|
|
349
|
+
use_for: "Complex debugging, architecture decisions"
|
|
350
|
+
|
|
351
|
+
"ultrathink":
|
|
352
|
+
computation: "Maximum"
|
|
353
|
+
use_for: "Security analysis, critical system design"
|
|
354
|
+
|
|
355
|
+
loki_mode_mapping:
|
|
356
|
+
haiku_tasks: "think"
|
|
357
|
+
sonnet_tasks: "think hard"
|
|
358
|
+
opus_tasks: "think harder to ultrathink"
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
### Explore-Plan-Code Pattern
|
|
362
|
+
|
|
363
|
+
**Key Insight:** Research before planning, plan before coding.
|
|
364
|
+
|
|
365
|
+
```
|
|
366
|
+
+------------------------------------------------------------------+
|
|
367
|
+
| PHASE 1: EXPLORE |
|
|
368
|
+
| - Research relevant files |
|
|
369
|
+
| - Understand existing patterns |
|
|
370
|
+
| - Identify dependencies and constraints |
|
|
371
|
+
| - NO CODE CHANGES YET |
|
|
372
|
+
+------------------------------------------------------------------+
|
|
373
|
+
|
|
|
374
|
+
v
|
|
375
|
+
+------------------------------------------------------------------+
|
|
376
|
+
| PHASE 2: PLAN |
|
|
377
|
+
| - Create detailed implementation plan |
|
|
378
|
+
| - List all files to modify |
|
|
379
|
+
| - Define success criteria |
|
|
380
|
+
| - Get checkpoint approval if needed |
|
|
381
|
+
| - STILL NO CODE CHANGES |
|
|
382
|
+
+------------------------------------------------------------------+
|
|
383
|
+
|
|
|
384
|
+
v
|
|
385
|
+
+------------------------------------------------------------------+
|
|
386
|
+
| PHASE 3: CODE |
|
|
387
|
+
| - Execute plan systematically |
|
|
388
|
+
| - Test after each file change |
|
|
389
|
+
| - Update plan if discoveries require it |
|
|
390
|
+
| - Verify against success criteria |
|
|
391
|
+
+------------------------------------------------------------------+
|
|
392
|
+
```
|
|
393
|
+
|
|
394
|
+
### Context Reset Strategy
|
|
395
|
+
|
|
396
|
+
**Key Insight:** Fresh context often performs better than accumulated context.
|
|
397
|
+
|
|
398
|
+
```yaml
|
|
399
|
+
context_management:
|
|
400
|
+
problem: "Long sessions accumulate irrelevant information"
|
|
401
|
+
|
|
402
|
+
solution:
|
|
403
|
+
trigger_reset:
|
|
404
|
+
- "After completing major task"
|
|
405
|
+
- "When changing domains (backend -> frontend)"
|
|
406
|
+
- "When agent seems confused or repeating errors"
|
|
407
|
+
|
|
408
|
+
preserve_across_reset:
|
|
409
|
+
- "CONTINUITY.md (working memory)"
|
|
410
|
+
- "Key decisions made this session"
|
|
411
|
+
- "Current task state"
|
|
412
|
+
|
|
413
|
+
discard_on_reset:
|
|
414
|
+
- "Intermediate debugging attempts"
|
|
415
|
+
- "Abandoned approaches"
|
|
416
|
+
- "Superseded plans"
|
|
417
|
+
```
|
|
418
|
+
|
|
419
|
+
### Parallel Instance Pattern
|
|
420
|
+
|
|
421
|
+
**Key Insight:** Multiple Claude instances with separation of concerns.
|
|
422
|
+
|
|
423
|
+
```python
|
|
424
|
+
async def parallel_instance_pattern(task):
|
|
425
|
+
"""
|
|
426
|
+
Run multiple Claude instances for separation of concerns.
|
|
427
|
+
Based on Anthropic's Claude Code best practices.
|
|
428
|
+
"""
|
|
429
|
+
# Instance 1: Implementation
|
|
430
|
+
implementer = spawn_instance(
|
|
431
|
+
role="implementer",
|
|
432
|
+
context=implementation_context,
|
|
433
|
+
permissions=["edit", "bash"]
|
|
434
|
+
)
|
|
435
|
+
|
|
436
|
+
# Instance 2: Review
|
|
437
|
+
reviewer = spawn_instance(
|
|
438
|
+
role="reviewer",
|
|
439
|
+
context=review_context,
|
|
440
|
+
permissions=["read"] # Read-only for safety
|
|
441
|
+
)
|
|
442
|
+
|
|
443
|
+
# Parallel execution
|
|
444
|
+
implementation = await implementer.execute(task)
|
|
445
|
+
review = await reviewer.review(implementation)
|
|
446
|
+
|
|
447
|
+
if review.approved:
|
|
448
|
+
return implementation
|
|
449
|
+
else:
|
|
450
|
+
# Feed review back to implementer for fixes
|
|
451
|
+
fixed = await implementer.fix(review.issues)
|
|
452
|
+
return fixed
|
|
453
|
+
```
|
|
454
|
+
|
|
455
|
+
### Prompt Injection Defense
|
|
456
|
+
|
|
457
|
+
**Key Insight:** Multi-layer defense against injection attacks.
|
|
458
|
+
|
|
459
|
+
```yaml
|
|
460
|
+
prompt_injection_defense:
|
|
461
|
+
layers:
|
|
462
|
+
layer_1_recognition:
|
|
463
|
+
- "Train to recognize injection patterns"
|
|
464
|
+
- "Detect malicious content in external sources"
|
|
465
|
+
|
|
466
|
+
layer_2_context_isolation:
|
|
467
|
+
- "Sandbox external content processing"
|
|
468
|
+
- "Mark user content vs system instructions"
|
|
469
|
+
|
|
470
|
+
layer_3_action_validation:
|
|
471
|
+
- "Verify requested actions are authorized"
|
|
472
|
+
- "Block sensitive operations without confirmation"
|
|
473
|
+
|
|
474
|
+
layer_4_monitoring:
|
|
475
|
+
- "Log all external content interactions"
|
|
476
|
+
- "Alert on suspicious patterns"
|
|
477
|
+
|
|
478
|
+
performance:
|
|
479
|
+
claude_opus_4: "89% attack prevention"
|
|
480
|
+
claude_sonnet_4: "86% attack prevention"
|
|
481
|
+
```
|
|
482
|
+
|
|
483
|
+
---
|
|
484
|
+
|
|
485
|
+
## Combined Patterns for Loki Mode
|
|
486
|
+
|
|
487
|
+
### Self-Improving Multi-Agent System
|
|
488
|
+
|
|
489
|
+
```yaml
|
|
490
|
+
combined_approach:
|
|
491
|
+
world_model_training: "Test in simulation before real execution"
|
|
492
|
+
self_improvement: "Bootstrap learning from successful trajectories"
|
|
493
|
+
constitutional_constraints: "Principles-based self-critique"
|
|
494
|
+
debate_verification: "Pit reviewers against each other"
|
|
495
|
+
defection_probes: "Monitor for alignment faking"
|
|
496
|
+
|
|
497
|
+
implementation_priority:
|
|
498
|
+
high:
|
|
499
|
+
- Constitutional AI principles in agent prompts
|
|
500
|
+
- Explore-Plan-Code workflow enforcement
|
|
501
|
+
- Context reset triggers
|
|
502
|
+
|
|
503
|
+
medium:
|
|
504
|
+
- Self-improvement loop for task generation
|
|
505
|
+
- Debate-based verification for critical changes
|
|
506
|
+
- Cross-embodiment skill transfer
|
|
507
|
+
|
|
508
|
+
low:
|
|
509
|
+
- Full world model training
|
|
510
|
+
- Defection probe classifiers
|
|
511
|
+
```
|
|
512
|
+
|
|
513
|
+
---
|
|
514
|
+
|
|
515
|
+
## Sources
|
|
516
|
+
|
|
517
|
+
**Google DeepMind:**
|
|
518
|
+
- [SIMA 2: Generalist AI Agent](https://deepmind.google/blog/sima-2-an-agent-that-plays-reasons-and-learns-with-you-in-virtual-3d-worlds/)
|
|
519
|
+
- [Gemini Robotics 1.5](https://deepmind.google/blog/gemini-robotics-15-brings-ai-agents-into-the-physical-world/)
|
|
520
|
+
- [Dreamer 4: World Model Training](https://danijar.com/project/dreamer4/)
|
|
521
|
+
- [Genie 3: World Models](https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/)
|
|
522
|
+
- [Scalable AI Safety via Debate](https://deepmind.google/research/publications/34920/)
|
|
523
|
+
- [Amplified Oversight](https://deepmindsafetyresearch.medium.com/human-ai-complementarity-a-goal-for-amplified-oversight-0ad8a44cae0a)
|
|
524
|
+
- [Technical AGI Safety Approach](https://arxiv.org/html/2504.01849v1)
|
|
525
|
+
|
|
526
|
+
**Anthropic:**
|
|
527
|
+
- [Constitutional AI](https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback)
|
|
528
|
+
- [Building Effective Agents](https://www.anthropic.com/research/building-effective-agents)
|
|
529
|
+
- [Claude Code Best Practices](https://www.anthropic.com/engineering/claude-code-best-practices)
|
|
530
|
+
- [Sleeper Agents Detection](https://www.anthropic.com/research/probes-catch-sleeper-agents)
|
|
531
|
+
- [Alignment Faking](https://www.anthropic.com/research/alignment-faking)
|
|
532
|
+
- [Visible Extended Thinking](https://www.anthropic.com/research/visible-extended-thinking)
|
|
533
|
+
- [Computer Use Safety](https://www.anthropic.com/news/3-5-models-and-computer-use)
|
|
534
|
+
- [Sabotage Evaluations](https://www.anthropic.com/research/sabotage-evaluations-for-frontier-models)
|