uer-mcp 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Margus Martsepp / The Risk Takers Team
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,633 @@
1
+ # Universal Expert Registry
2
+
3
+ > **ASI-Level Experts, Infinite Memory, Any Client**
4
+
5
+ An MCP server that provides:
6
+ 1. **Universal LLM Access** - Call any LLM (Claude, GPT, Gemini, Bedrock, Azure, local models) through LiteLLM
7
+ 2. **MCP Tool Orchestration** - Connect to 1000+ MCP servers (filesystem, databases, browsers, etc.)
8
+ 3. **Shared Memory/Context** - Break context window limits via external storage with URI references
9
+ 4. **Subagent Delegation** - Spawn subagents with full chat history, not just single messages
10
+
11
+ ## Why This Exists
12
+
13
+ LLMs have fundamental limitations:
14
+ - **Single message I/O**: 32-64k tokens max
15
+ - **Context window**: 200k-2M tokens
16
+ - **No persistent memory**: Forget between sessions
17
+ - **No expert access**: Can't use specialized tools
18
+
19
+ Traditional multi-agent approaches waste tokens by copying full context to each subagent. This registry solves it by:
20
+ - Storing context externally (unlimited)
21
+ - Passing URI references instead of full data (50 tokens vs 50k)
22
+ - Building complete chat histories for subagents
23
+ - Persisting across sessions
24
+
25
+ ## Architecture
26
+
27
+ ```mermaid
28
+ graph TB
29
+ subgraph clients["MCP Clients"]
30
+ A1["Cursor"]
31
+ A2["Claude Desktop"]
32
+ A3["ChatGPT"]
33
+ A4["VS Code"]
34
+ A5["JetBrains"]
35
+ end
36
+
37
+ subgraph uer["UER - Universal Expert Registry"]
38
+ direction TB
39
+ B["MCP Tools<br/>llm_call, mcp_call, put, get, delegate, search"]
40
+
41
+ subgraph litellm["LiteLLM Gateway"]
42
+ C1["100+ LLM providers"]
43
+ C2["Native MCP Gateway"]
44
+ C3["A2A Protocol support"]
45
+ C4["Cost tracking, rate limiting, fallbacks"]
46
+ end
47
+
48
+ subgraph store["Context Store"]
49
+ D1["Local: SQLite"]
50
+ D2["Cloud: Firebase"]
51
+ end
52
+
53
+ B --> litellm
54
+ B --> store
55
+ end
56
+
57
+ subgraph providers["LLM Providers"]
58
+ E1["Anthropic"]
59
+ E2["OpenAI"]
60
+ E3["Google"]
61
+ E4["Azure"]
62
+ E5["AWS Bedrock"]
63
+ E6["Local: Ollama"]
64
+ end
65
+
66
+ subgraph mcpservers["MCP Servers"]
67
+ F1["Filesystem"]
68
+ F2["PostgreSQL"]
69
+ F3["Slack"]
70
+ F4["Browser"]
71
+ F5["GitHub"]
72
+ F6["1000+ more..."]
73
+ end
74
+
75
+ subgraph knowledge["Knowledge Sources"]
76
+ G1["Context7"]
77
+ G2["Company docs"]
78
+ G3["Guidelines"]
79
+ G4["Standards"]
80
+ end
81
+
82
+ clients -->|MCP Protocol| B
83
+ litellm --> providers
84
+ litellm --> mcpservers
85
+ litellm --> knowledge
86
+ ```
87
+
88
+ ## Key Features
89
+
90
+ ### 1. Universal LLM Access via LiteLLM
91
+
92
+ Call any LLM with a single interface:
93
+
94
+ ```python
95
+ # All use the same interface - just change the model string
96
+ llm_call(model="anthropic/claude-sonnet-4-5-20250929", messages=[...])
97
+ llm_call(model="openai/gpt-5.2", messages=[...])
98
+ llm_call(model="gemini/gemini-3-flash-preview", messages=[...])
99
+ llm_call(model="bedrock/anthropic.claude-3-sonnet", messages=[...])
100
+ llm_call(model="azure/gpt-4-deployment", messages=[...])
101
+ llm_call(model="ollama/llama3.1:8b-instruct-q4_K_M", messages=[...])
102
+ ```
103
+
104
+ Features included:
105
+ - Automatic fallbacks between providers
106
+ - Cost tracking per request
107
+ - Rate limit handling with retries
108
+ - Tool/function calling across all providers
109
+
110
+ ### 2. MCP Tool Integration
111
+
112
+ Connect to any MCP server:
113
+
114
+ ```python
115
+ # List available MCP tools
116
+ search(type="mcp")
117
+
118
+ # Call MCP tools directly
119
+ mcp_call(server="filesystem", tool="read_file", args={"path": "/data/report.txt"})
120
+ mcp_call(server="postgres", tool="query", args={"sql": "SELECT * FROM users"})
121
+ mcp_call(server="context7", tool="search", args={"query": "LiteLLM API reference"})
122
+ ```
123
+
124
+ ### 3. Shared Context (The Killer Feature)
125
+
126
+ Store data externally, pass URI references:
127
+
128
+ ```python
129
+ # Store large document (200k tokens)
130
+ put("registry://context/doc_001", {"content": large_document})
131
+
132
+ # Pass only URI to subagent (50 tokens!)
133
+ delegate(
134
+ model="anthropic/claude-sonnet-4-5-20250929",
135
+ task="Analyze the document",
136
+ context_refs=["registry://context/doc_001"]
137
+ )
138
+
139
+ # Subagent retrieves full content from registry
140
+ # Result stored back to registry
141
+ # Parent retrieves summary only
142
+ ```
143
+
144
+ **Token savings: 99.9%** for multi-agent workflows.
145
+
146
+ ### 4. Full Chat History for Subagents
147
+
148
+ Build complete conversation context, not just single messages:
149
+
150
+ ```python
151
+ delegate(
152
+ model="openai/gpt-5-mini",
153
+ messages=[
154
+ {"role": "system", "content": "You are a code reviewer..."},
155
+ {"role": "user", "content": "Review this code for security issues"},
156
+ {"role": "assistant", "content": "I'll analyze the code..."},
157
+ {"role": "user", "content": "Focus on SQL injection risks"}
158
+ ],
159
+ tools=[...], # MCP tools available to subagent
160
+ context_refs=["registry://context/codebase"] # Large context via URI
161
+ )
162
+ ```
163
+
164
+ ### 5. Continuation Across Sessions
165
+
166
+ Complex tasks can span multiple messages and sessions:
167
+
168
+ ```
169
+ Message 1: Start analysis → Progress: 20% → {{continuation: registry://plan/001}}
170
+ Message 2: Continue → Progress: 60% → {{continuation: registry://plan/001}}
171
+ [Next day]
172
+ Message 3: Continue → Complete! Here's your report...
173
+ ```
174
+
175
+ ## Quick Start
176
+
177
+ ### Prerequisites
178
+
179
+ - **Node.js 14+** (for npm/npx)
180
+ - **Python 3.11+** (automatically detected)
181
+ - **Claude Desktop** (or any MCP-compatible client)
182
+ - **At least one LLM API key** (see below)
183
+
184
+ **Optional but recommended:**
185
+ - **uv** package manager ([installation guide](https://docs.astral.sh/uv/getting-started/installation/)) - for better dependency management
186
+
187
+ ### Step 1: Get API Keys
188
+
189
+ You need at least one LLM API key to use UER. We recommend starting with **Google Gemini** as it offers a free tier:
190
+
191
+ #### Google Gemini (Free - Recommended for Testing)
192
+
193
+ 1. Visit [https://aistudio.google.com/apikey](https://aistudio.google.com/apikey)
194
+ 2. Click "Create API Key"
195
+ 3. Copy your key (starts with `AIza...`)
196
+ 4. Free tier includes: 10-15 requests/minute, 250K tokens/minute, 250-1000 requests/day (varies by model)
197
+
198
+ #### Other Providers (Optional)
199
+
200
+ | Provider | Get API Key | Free Tier |
201
+ |----------|-------------|-----------|
202
+ | **Anthropic** (Claude) | [console.anthropic.com](https://console.anthropic.com/) | $5 credit for new users |
203
+ | **OpenAI** (GPT) | [platform.openai.com/api-keys](https://platform.openai.com/api-keys) | $5 credit for new users |
204
+ | **Azure OpenAI** | [Azure Portal](https://portal.azure.com/) | Requires Azure subscription |
205
+ | **AWS Bedrock** | [AWS Console](https://console.aws.amazon.com/bedrock/) | Pay-as-you-go |
206
+
207
+ ### Step 2: Installation
208
+
209
+ **Option A: Using npx (Recommended - Zero Installation)**
210
+
211
+ No installation needed! Just configure Claude Desktop (Step 3) and it will automatically download and run the latest version.
212
+
213
+ **Option B: Manual Installation (For Development)**
214
+
215
+ ```bash
216
+ # Clone the repository
217
+ git clone https://github.com/margusmartsepp/UER.git
218
+ cd UER
219
+
220
+ # Install dependencies
221
+ uv sync
222
+
223
+ # Build the npm package
224
+ npm run build
225
+ ```
226
+
227
+ ### Step 3: Configure Claude Desktop
228
+
229
+ Add UER as an MCP server to Claude Desktop:
230
+
231
+ **Location:** `%APPDATA%\Claude\claude_desktop_config.json` (Windows) or `~/Library/Application Support/Claude/claude_desktop_config.json` (Mac)
232
+
233
+ **Minimal Configuration (Gemini only - Using npx):**
234
+ ```json
235
+ {
236
+ "mcpServers": {
237
+ "uer": {
238
+ "command": "npx",
239
+ "args": ["uer-mcp@latest"],
240
+ "env": {
241
+ "GEMINI_API_KEY": "AIza_your_key_here"
242
+ }
243
+ }
244
+ }
245
+ }
246
+ ```
247
+
248
+ **Full Configuration (All providers - Using npx):**
249
+ ```json
250
+ {
251
+ "mcpServers": {
252
+ "uer": {
253
+ "command": "npx",
254
+ "args": ["uer-mcp@latest"],
255
+ "env": {
256
+ "GEMINI_API_KEY": "AIza_your_key_here",
257
+ "ANTHROPIC_API_KEY": "sk-ant-...",
258
+ "OPENAI_API_KEY": "sk-...",
259
+ "AWS_ACCESS_KEY_ID": "...",
260
+ "AWS_SECRET_ACCESS_KEY": "...",
261
+ "AWS_REGION_NAME": "us-east-1",
262
+ "AZURE_API_KEY": "...",
263
+ "AZURE_API_BASE": "https://....openai.azure.com/"
264
+ }
265
+ }
266
+ }
267
+ }
268
+ ```
269
+
270
+ **Manual Installation Configuration (For Development):**
271
+ ```json
272
+ {
273
+ "mcpServers": {
274
+ "uer": {
275
+ "command": "uv",
276
+ "args": ["--directory", "C:\\path\\to\\UER", "run", "python", "-m", "uer.server"],
277
+ "env": {
278
+ "GEMINI_API_KEY": "AIza_your_key_here"
279
+ }
280
+ }
281
+ }
282
+ }
283
+ ```
284
+
285
+ **Important:**
286
+ - For npx: No path needed, just add your API keys
287
+ - For manual install: Replace `C:\\path\\to\\UER` with your actual directory
288
+ - Use double backslashes `\\` on Windows, or forward slashes `/` on Mac/Linux
289
+ - Only include API keys for providers you want to use
290
+
291
+ ### Step 4: Restart Claude Desktop
292
+
293
+ 1. Quit Claude Desktop completely
294
+ 2. Reopen Claude Desktop
295
+ 3. Look for the 🔨 (hammer) icon indicating MCP tools are loaded
296
+
297
+ ### Step 5: Test Your Setup
298
+
299
+ Try this in Claude Desktop:
300
+
301
+ ```
302
+ "Use the llm_call tool to call Gemini 3 Flash and ask it to explain what an MCP server is in one sentence."
303
+ ```
304
+
305
+ Expected behavior:
306
+ - Claude will use the `llm_call` tool
307
+ - Call `gemini/gemini-3-flash-preview`
308
+ - Return Gemini's response
309
+
310
+ ### Example Usage Scenarios
311
+
312
+ **1. Call Different LLMs:**
313
+ ```
314
+ User: "Use llm_call to ask Gemini what the capital of France is"
315
+ → Calls gemini/gemini-3-flash-preview
316
+ → Returns: "Paris"
317
+
318
+ User: "Now ask Claude Sonnet the same question"
319
+ → Calls anthropic/claude-sonnet-4-5-20250929
320
+ → Returns: "Paris"
321
+ ```
322
+
323
+ **2. Compare LLM Responses:**
324
+ ```
325
+ User: "Ask both Gemini and Claude Sonnet to write a haiku about programming"
326
+ → Uses llm_call twice with different models
327
+ → Returns both haikus for comparison
328
+ ```
329
+
330
+ **3. Store and Share Context:**
331
+ ```
332
+ User: "Store this document in the registry and have Gemini summarize it"
333
+ → put("registry://context/doc", {...})
334
+ → delegate(model="gemini/gemini-3-flash-preview", context_refs=["registry://context/doc"])
335
+ → Returns: Summary without re-sending full document
336
+ ```
337
+
338
+ ## Troubleshooting
339
+
340
+ ### "MCP server not found" or "No tools available"
341
+
342
+ 1. Check that `claude_desktop_config.json` is in the correct location
343
+ 2. Verify the `--directory` path is correct (use absolute path)
344
+ 3. Ensure you've restarted Claude Desktop after configuration
345
+ 4. Check Claude Desktop logs: `%APPDATA%\Claude\logs\` (Windows) or `~/Library/Logs/Claude/` (Mac)
346
+
347
+ ### "API key invalid" errors
348
+
349
+ 1. Verify your API key is correct and active
350
+ 2. Check you're using the right key for the right provider
351
+ 3. For Gemini, ensure the key starts with `AIza`
352
+ 4. For Anthropic, ensure the key starts with `sk-ant-`
353
+ 5. For OpenAI, ensure the key starts with `sk-`
354
+
355
+ ### "Model not found" errors
356
+
357
+ 1. Ensure you have an API key configured for that provider
358
+ 2. Check the model name is correct (use LiteLLM format: `provider/model`)
359
+ 3. Verify the model is available in your region/tier
360
+
361
+
362
+ ## Tools Reference
363
+
364
+ | Tool | Description |
365
+ |------|-------------|
366
+ | `llm_call` | Call any LLM via LiteLLM (100+ providers) |
367
+ | `mcp_call` | Call any configured MCP server tool |
368
+ | `put` | Store data/context in registry |
369
+ | `get` | Retrieve data/context from registry |
370
+ | `search` | Search MCP servers, skills, or stored context |
371
+ | `delegate` | Spawn subagent with full chat history |
372
+ | `subscribe` | Watch for async results |
373
+ | `cancel` | Cancel subscription or execution |
374
+
375
+ ## LiteLLM Integration
376
+
377
+ This project uses [LiteLLM](https://github.com/BerriAI/litellm) as the unified LLM gateway, providing:
378
+
379
+ - **100+ LLM providers** through single interface
380
+ - **Native MCP Gateway** with permission management
381
+ - **A2A Protocol** for agent-to-agent communication
382
+ - **Cost tracking** per request with spend reports
383
+ - **Rate limiting** with automatic retries
384
+ - **Fallbacks** between providers on failure
385
+ - **Tool/function calling** normalized across providers
386
+
387
+ ### Supported Providers
388
+
389
+ | Provider | Model Examples |
390
+ |----------|---------------|
391
+ | Anthropic | `anthropic/claude-sonnet-4-5-20250929`, `anthropic/claude-opus-4-5-20251101` |
392
+ | OpenAI | `openai/gpt-5.2`, `openai/gpt-5-mini`, `openai/gpt-5.2-codex` |
393
+ | Google | `gemini/gemini-3-flash-preview`, `gemini/gemini-3-pro-preview` |
394
+ | Azure | `azure/gpt-4-deployment` |
395
+ | AWS Bedrock | `bedrock/anthropic.claude-3-sonnet` |
396
+ | Local | `ollama/llama3.1:8b-instruct-q4_K_M`, `lm_studio/lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF` |
397
+
398
+ ## Project Structure
399
+
400
+ ```
401
+ UER/
402
+ ├── README.md # This file
403
+ ├── ADR.plan.md # Architecture Decision Record
404
+ ├── TODO.md # Implementation checklist
405
+ ├── pyproject.toml
406
+
407
+ ├── src/
408
+ │ ├── server.py # MCP server entry point
409
+ │ ├── llm/
410
+ │ │ └── gateway.py # LiteLLM wrapper
411
+ │ ├── mcp/
412
+ │ │ └── client.py # MCP client for calling other servers
413
+ │ ├── storage/
414
+ │ │ ├── base.py # Storage protocol
415
+ │ │ └── local.py # SQLite + filesystem
416
+ │ ├── tools/
417
+ │ │ ├── llm_call.py # LLM invocation tool
418
+ │ │ ├── mcp_call.py # MCP tool invocation
419
+ │ │ ├── crud.py # put/get/search
420
+ │ │ └── delegate.py # Subagent delegation
421
+ │ └── models/
422
+ │ ├── context.py # Context/blob schemas
423
+ │ └── message.py # Chat message schemas
424
+
425
+ └── config/
426
+ └── litellm_config.yaml
427
+ ```
428
+
429
+ ## Dependencies
430
+
431
+ ```toml
432
+ [project]
433
+ dependencies = [
434
+ "mcp>=1.0.0",
435
+ "litellm>=1.77.0",
436
+ "pydantic>=2.0.0",
437
+ "httpx>=0.25.0",
438
+ ]
439
+ ```
440
+
441
+ ## Datasets & Testing
442
+
443
+ UER includes scripts to download and test manipulation detection datasets.
444
+
445
+ ### Quick Start: Download All Datasets
446
+
447
+ **One command downloads everything:**
448
+
449
+ ```bash
450
+ python seed_datasets.py
451
+ ```
452
+
453
+ This downloads:
454
+ - **WMDP Benchmark:** 3,668 questions (Bio: 1,273, Chem: 408, Cyber: 1,987)
455
+ - **WildChat Sample:** 10,000 real conversations (162 MB)
456
+ - **lm-evaluation-harness:** Evaluation framework
457
+
458
+ **Time:** ~5-10 minutes depending on internet speed.
459
+
460
+ ### Run Tests
461
+
462
+ **Test for Sandbagging:**
463
+ ```bash
464
+ cd context/scripts
465
+ python test_wmdp.py --model gemini/gemini-3-flash-preview --limit 50
466
+ ```
467
+
468
+ **Test for Sycophancy:**
469
+ ```bash
470
+ python test_sycophancy.py --models gemini
471
+ ```
472
+
473
+ **Results saved to:** `context/datasets/results/`
474
+
475
+ ### Dataset Details
476
+
477
+ | Dataset | Size | Purpose | Location |
478
+ |---------|------|---------|----------|
479
+ | **WMDP Benchmark** | 3,668 questions (2.2 MB) | Sandbagging detection | `context/datasets/wmdp_questions/` |
480
+ | **WildChat** | 10k conversations (162 MB) | Real-world sycophancy | `context/datasets/wildchat/` |
481
+ | **lm-evaluation-harness** | Framework | Standard LLM evaluation | `context/datasets/lm-evaluation-harness/` |
482
+
483
+ All datasets are gitignored. Run `seed_datasets.py` to download locally.
484
+
485
+ ## Hackathon Context
486
+
487
+ This project was built for the **[AI Manipulation Hackathon](https://apartresearch.com/sprints/ai-manipulation-hackathon-2026-01-09-to-2026-01-11)** organized by [Apart Research](https://apartresearch.com/).
488
+
489
+ ### Event Details
490
+
491
+ - **Dates:** January 9-11, 2026
492
+ - **Theme:** Measuring, detecting, and defending against AI manipulation
493
+ - **Participants:** 500+ builders worldwide
494
+ - **Prizes:** $2,000 in cash prizes
495
+ - **Workshop:** Winners present at IASEAI workshop in Paris (February 26, 2026)
496
+
497
+ ### The Challenge
498
+
499
+ AI systems are mastering deception, sycophancy, sandbagging, and psychological exploitation at scale, while our ability to detect, measure, and counter these behaviors remains dangerously underdeveloped. This hackathon brings together builders to prototype practical systems that address this critical AI safety challenge.
500
+
501
+ ### How UER Addresses AI Manipulation
502
+
503
+ The Universal Expert Registry provides infrastructure for:
504
+
505
+ 1. **Multi-Model Testing** - Compare responses across providers to detect inconsistencies and manipulation patterns
506
+ 2. **Persistent Context** - Track conversation history across sessions to identify behavioral shifts
507
+ 3. **Tool Integration** - Connect manipulation detection tools via MCP protocol
508
+ 4. **Subagent Orchestration** - Deploy specialized agents for red-teaming and safety testing
509
+ 5. **Transparent Logging** - Full visibility into LLM calls, costs, and behaviors
510
+
511
+ ### Team
512
+
513
+ **The Risk Takers** - Building practical tools for AI safety and transparency.
514
+
515
+ ### Hackathon Resources
516
+
517
+ The hackathon provides extensive research and tools for understanding AI manipulation. Below are key resources organized by track.
518
+
519
+ #### General Introduction
520
+
521
+ **[AI deception: A survey of examples, risks, and potential solutions](https://arxiv.org/abs/2308.14752)** (Park et al., 2024)
522
+ Comprehensive survey defining AI deception and documenting manipulation via sycophancy, strategic behavior, and safety test evasion. Covers both near-term risks (fraud, election tampering) and long-term risks (losing control of AI systems).
523
+
524
+ **[Artificial Intelligence Index Report 2025](https://hai.stanford.edu/assets/files/hai_ai_index_report_2025.pdf)** (Stanford HAI)
525
+ Annual assessment of global AI progress including responsible AI and system risks. Provides regulatory trends and technical challenges context.
526
+
527
+ **[Towards Understanding Sycophancy in Language Models](https://arxiv.org/abs/2310.13548)** (Sharma et al., 2024)
528
+ Documents that state-of-the-art AI assistants consistently exhibit sycophancy, with human preference data driving this behavior. Demonstrates inverse scaling: larger models become MORE sycophantic.
529
+
530
+ #### Track 1: Measurement & Evaluation
531
+
532
+ **[AI Sandbagging: Language Models can Strategically Underperform on Evaluations](https://arxiv.org/abs/2406.07358)** (van der Weij et al., 2024)
533
+ Demonstrates GPT-4 and Claude 3 Opus strategically underperforming on dangerous capability evaluations. Models can "password-lock" capabilities and target specific scores.
534
+
535
+ **[Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models](https://arxiv.org/abs/2501.00002)** (Tice et al., 2024)
536
+ Novel sandbagging detection via weight noise injection. Provides practical, model-agnostic detection tool.
537
+
538
+ **[The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning](https://arxiv.org/abs/2403.03218)** (Li et al., 2024)
539
+ Dataset of 3,668 multiple-choice questions for evaluating hazardous knowledge. Widely used in sandbagging and capability evaluation research.
540
+
541
+ **UER Application:** Use `llm_call` to test multiple models against WMDP benchmark, comparing responses to detect sandbagging patterns.
542
+
543
+ #### Track 2: Real-World Analysis
544
+
545
+ **[Recent Frontier Models Are Reward Hacking](https://metr.org/blog/2025-06-05-recent-reward-hacking/)** (METR, 2025)
546
+ Documents reward hacking in production-like settings: o3 hacked evaluation software, Claude models subverted tasks. Shows models demonstrate awareness their behavior violates intentions.
547
+
548
+ **[Reward Hacking in Reinforcement Learning](https://lilianweng.github.io/posts/2024-11-28-reward-hacking/)** (Lilian Weng, 2024)
549
+ Comprehensive overview with real-world examples: summarization gaming ROUGE, coding tampering tests, recommendation algorithms optimizing engagement over welfare.
550
+
551
+ **[Reward hacking behavior can generalize across tasks](https://arxiv.org/abs/2406.06393)** (Denison et al., 2024)
552
+ Training on reward hacking leads to 1.3x-2.0x increase on held-out datasets. Documents how manipulation generalizes.
553
+
554
+ **UER Application:** Use persistent context storage with `put`/`get` to track behavioral shifts across sessions and detect manipulation patterns.
555
+
556
+ #### Track 3: Mitigations
557
+
558
+ **[Chain of Thought Monitoring for Misbehavior](https://openai.com/index/chain-of-thought-monitoring/)** (OpenAI, 2025)
559
+ Chain-of-thought monitoring can detect reward hacking. GPT-4o can monitor stronger o3-mini model. Demonstrates transparency-based mitigation promise and pitfalls.
560
+
561
+ **[From shortcuts to sabotage: natural emergent misalignment from reward hacking](https://www.anthropic.com/research/emergent-misalignment-reward-hacking)** (Anthropic, 2025)
562
+ Reward hacking training causes downstream misalignment (alignment faking, sabotage). Shows interventions can break semantic links.
563
+
564
+ **[The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning](https://arxiv.org/abs/2403.03218)** (Li et al., 2024)
565
+ Develops RMU unlearning method removing hazardous knowledge while preserving general capabilities.
566
+
567
+ **UER Application:** Integrate mitigation tools via `mcp_call` to test interventions across multiple models simultaneously.
568
+
569
+ #### Track 4: Open Track (Multi-Agent & Emergent Behavior)
570
+
571
+ **[AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors](https://arxiv.org/abs/2308.10848)** (Chen et al., 2024)
572
+ Demonstrates emergent social behaviors in multi-agent systems: volunteer behaviors, conformity, destructive behaviors.
573
+
574
+ **[Emergence in Multi-Agent Systems: A Safety Perspective](https://arxiv.org/abs/2406.12411)** (2024)
575
+ Investigates how specification insufficiency leads to emergent manipulative behavior when agents' learned priors conflict.
576
+
577
+ **[School of Reward Hacks: Hacking Harmless Tasks Generalizes to Misalignment](https://arxiv.org/abs/2501.00003)** (2024)
578
+ Training on "harmless" reward hacking causes generalization to concerning behaviors including shutdown avoidance and alignment faking.
579
+
580
+ **UER Application:** Use `delegate` to orchestrate multi-agent studies with different models, tracking emergent manipulation behaviors via shared context.
581
+
582
+ #### Open Datasets & Tools
583
+
584
+ | Resource | Type | Link |
585
+ |----------|------|------|
586
+ | **WMDP Benchmark** | Dataset + Code | [github.com/centerforaisafety/wmdp](https://github.com/centerforaisafety/wmdp) |
587
+ | **WildChat Dataset** | 1M ChatGPT conversations | [huggingface.co/datasets/allenai/WildChat](https://huggingface.co/datasets/allenai/WildChat) |
588
+ | **lm-evaluation-harness** | Evaluation framework | [github.com/EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) |
589
+ | **METR Task Environments** | Autonomous AI tasks | [github.com/METR/task-standard](https://github.com/METR/task-standard) |
590
+ | **TransformerLens** | Interpretability library | [github.com/neelnanda-io/TransformerLens](https://github.com/neelnanda-io/TransformerLens) |
591
+ | **AgentVerse Framework** | Multi-agent collaboration | [github.com/OpenBMB/AgentVerse](https://github.com/OpenBMB/AgentVerse) |
592
+ | **Multi-Agent Particle Envs** | OpenAI environments | [github.com/openai/multiagent-particle-envs](https://github.com/openai/multiagent-particle-envs) |
593
+ | **School of Reward Hacks** | Training dataset | [github.com/aypan17/reward-hacking](https://github.com/aypan17/reward-hacking) |
594
+ | **NetLogo** | Agent-based modeling | [ccl.northwestern.edu/netlogo](https://ccl.northwestern.edu/netlogo/) |
595
+
596
+ #### Project Scoping Advice
597
+
598
+ Based on successful hackathon retrospectives:
599
+
600
+ **Focus on MVP, Not Production** (2-day timeline):
601
+ - Day 1: Set up environment, implement core functionality, basic pipeline
602
+ - Day 2: Add 1-2 key features, create demo, prepare presentation
603
+
604
+ **Use Mock/Simulated Data** instead of real APIs:
605
+ - Synthetic datasets (WMDP, WildChat, School of Reward Hacks)
606
+ - Pre-recorded samples
607
+ - Simulation environments (METR, AgentVerse)
608
+
609
+ **Leverage Pre-trained Models** - Don't train from scratch:
610
+ - OpenAI/Anthropic APIs via UER's `llm_call`
611
+ - Hugging Face pre-trained models
612
+ - Existing detection tools as starting points
613
+
614
+ **Clear Success Criteria** - Define "working":
615
+ - **Benchmarks:** Evaluates 3+ models on 50+ test cases with documented methodology
616
+ - **Detection:** Identifies manipulation in 10+ examples with >70% accuracy
617
+ - **Analysis:** Documents patterns across 100+ deployment examples with clear taxonomy
618
+ - **Mitigation:** Demonstrates measurable improvement on 3+ manipulation metrics
619
+
620
+ ## Related Projects
621
+
622
+ - [LiteLLM](https://github.com/BerriAI/litellm) - Unified LLM gateway
623
+ - [MCP Registry](https://registry.modelcontextprotocol.io) - Official MCP server directory
624
+ - [Context7](https://github.com/upstash/context7) - Library documentation MCP
625
+ - [Apart Research](https://apartresearch.com/) - AI safety research and hackathons
626
+
627
+ ## License
628
+
629
+ MIT
630
+
631
+ ---
632
+
633
+ *Built for the AI Manipulation Hackathon by The Risk Takers team*