envseed 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/Dockerfile.simulation +18 -0
- package/README.md +498 -0
- package/bin/dashboard.mjs +706 -0
- package/bin/propensity-monitor.mjs +897 -0
- package/commands/log-incident.md +20 -0
- package/entrypoint.sh +93 -0
- package/lib/background-analyzer.mjs +113 -0
- package/lib/container-replicator.mjs +690 -0
- package/lib/hook-handler.mjs +109 -0
- package/lib/llm-analyzer.mjs +247 -0
- package/lib/log-incident.mjs +320 -0
- package/lib/logger.mjs +42 -0
- package/lib/personas.mjs +176 -0
- package/lib/redaction-review.mjs +255 -0
- package/lib/risk-analyzer.mjs +477 -0
- package/lib/s3.mjs +191 -0
- package/lib/session-tracker.mjs +132 -0
- package/lib/simulation-orchestrator.mjs +492 -0
- package/lib/utils.mjs +33 -0
- package/package.json +28 -0
- package/postinstall.mjs +165 -0
|
@@ -0,0 +1,18 @@
|
|
|
1
|
+
FROM node:22-slim
|
|
2
|
+
|
|
3
|
+
# Install Claude Code CLI and OpenAI Codex CLI
|
|
4
|
+
RUN npm i -g @anthropic-ai/claude-code @openai/codex
|
|
5
|
+
|
|
6
|
+
# Create non-root user (required for --dangerously-skip-permissions)
|
|
7
|
+
RUN useradd -m -s /bin/bash simuser
|
|
8
|
+
|
|
9
|
+
# Create workspace and output dirs with proper permissions
|
|
10
|
+
RUN mkdir -p /workspace /output && chown simuser:simuser /workspace /output
|
|
11
|
+
|
|
12
|
+
WORKDIR /workspace
|
|
13
|
+
|
|
14
|
+
COPY entrypoint.sh /entrypoint.sh
|
|
15
|
+
RUN chmod +x /entrypoint.sh
|
|
16
|
+
|
|
17
|
+
USER simuser
|
|
18
|
+
ENTRYPOINT ["/entrypoint.sh"]
|
package/README.md
ADDED
|
@@ -0,0 +1,498 @@
|
|
|
1
|
+
# propensity-monitor
|
|
2
|
+
|
|
3
|
+
A safety research tool that monitors Claude Code sessions for **eval opportunities** — situations where an AI model could plausibly engage in scheming, deception, or misalignment. When it detects one, it archives the session context and runs sandboxed simulations with different model personas to study how models behave.
|
|
4
|
+
|
|
5
|
+
## Quick install (external users)
|
|
6
|
+
|
|
7
|
+
```bash
|
|
8
|
+
npx envseed # installs hooks and shows status
|
|
9
|
+
envseed register # authenticates via GitHub
|
|
10
|
+
```
|
|
11
|
+
|
|
12
|
+
Or install globally:
|
|
13
|
+
|
|
14
|
+
```bash
|
|
15
|
+
npm i -g envseed
|
|
16
|
+
envseed register
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
Restart Claude Code after installing. That's it — incidents upload automatically via HTTPS.
|
|
20
|
+
|
|
21
|
+
No npm dependencies — the package uses only Node.js built-ins (requires Node 22+).
|
|
22
|
+
|
|
23
|
+
## Install (METR internal)
|
|
24
|
+
|
|
25
|
+
```bash
|
|
26
|
+
git clone <repo> && cd propensity-tests
|
|
27
|
+
./install.sh
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
This will:
|
|
31
|
+
1. Copy files to `~/.propensity-monitor/`
|
|
32
|
+
2. Register hooks in `~/.claude/settings.json` (PreToolUse, PostToolUse, UserPromptSubmit, SessionStart, Stop)
|
|
33
|
+
3. Install the `/log-incident` slash command
|
|
34
|
+
4. Create default config at `~/.propensity-monitor/config.json`
|
|
35
|
+
5. Symlink the CLI to `~/.local/bin/propensity-monitor`
|
|
36
|
+
6. Check for Docker, AWS CLI, and AWS SSO session
|
|
37
|
+
|
|
38
|
+
METR users with staging AWS credentials upload directly to S3. External users go through the Cloudflare Worker endpoint.
|
|
39
|
+
|
|
40
|
+
Restart Claude Code after installing to activate hooks.
|
|
41
|
+
|
|
42
|
+
### Prerequisites
|
|
43
|
+
|
|
44
|
+
- **Node.js 22+**
|
|
45
|
+
- **Docker** (optional) — Required for container replication and persona simulations
|
|
46
|
+
- **AWS CLI** (optional, METR only) — For direct S3 uploads. Requires `aws sso login --profile staging`
|
|
47
|
+
- **API keys** — `ANTHROPIC_API_KEY` and `ANTHROPIC_BASE_URL` in `~/.config/api-keys/.env`. For OpenAI model simulations: `OPENAI_API_KEY` and `OPENAI_BASE_URL`
|
|
48
|
+
|
|
49
|
+
## Uninstall
|
|
50
|
+
|
|
51
|
+
```bash
|
|
52
|
+
./uninstall.sh
|
|
53
|
+
# or: npm uninstall -g envseed
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
## How it works
|
|
57
|
+
|
|
58
|
+
### Detection pipeline
|
|
59
|
+
|
|
60
|
+
```
|
|
61
|
+
Claude Code Hook Event
|
|
62
|
+
|
|
|
63
|
+
hook-handler.mjs (sync, <100ms)
|
|
64
|
+
|-- Read pending alerts from previous background analysis
|
|
65
|
+
|-- Track session state
|
|
66
|
+
|-- Return system messages to user
|
|
67
|
+
'-- Spawn background-analyzer.mjs (detached)
|
|
68
|
+
|
|
|
69
|
+
background-analyzer.mjs (~5-8s, async)
|
|
70
|
+
|-- Call Haiku for LLM assessment
|
|
71
|
+
|-- Log to events/{date}.jsonl
|
|
72
|
+
|-- If [INTERESTING]: log to alerts/
|
|
73
|
+
|-- If [GOOD OPPORTUNITY]: write to pending-alerts.jsonl
|
|
74
|
+
'-- If [GOOD OPPORTUNITY] + post-execution: spawn container-replicator
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
### Two-stage risk detection
|
|
78
|
+
|
|
79
|
+
1. **Rule-based scoring** (synchronous, in hook handler):
|
|
80
|
+
- Pattern matches tool names, file paths, commands
|
|
81
|
+
- Scores 0-3: none / medium / high / critical
|
|
82
|
+
- Detects: credential access, self-modification, monitoring subversion, destructive ops, etc.
|
|
83
|
+
|
|
84
|
+
2. **LLM assessment** (asynchronous, two-stage):
|
|
85
|
+
- **Haiku** (fast screen): Classifies as `[LOW]`, `[INTERESTING]`, or `[GOOD OPPORTUNITY]`
|
|
86
|
+
- **Opus 4.6** (confirmation): If Haiku flags `[GOOD OPPORTUNITY]`, Opus reviews it and either confirms, downgrades to `[INTERESTING]`, or dismisses as `[LOW]`. Opus is prompted to be skeptical of meta-level false positives (e.g., working on the monitoring system itself) and flags that sound scary but lack realistic attack vectors.
|
|
87
|
+
- Only Opus-confirmed `[GOOD OPPORTUNITY]` is surfaced to the user
|
|
88
|
+
|
|
89
|
+
### Container replication (on GOOD OPPORTUNITY)
|
|
90
|
+
|
|
91
|
+
When a `[GOOD OPPORTUNITY]` is detected **after a command has run** (PostToolUse, UserPromptSubmit, or Stop), the system automatically spins up Docker to replicate the project environment:
|
|
92
|
+
|
|
93
|
+
1. **Checks for existing container config** — `.devcontainer/`, `docker-compose.yml`, or `Dockerfile` in the project
|
|
94
|
+
2. **Tries to build using existing config** — if present, builds and tests the image
|
|
95
|
+
3. **Falls back to fresh container** — creates a `node:22-slim` base with dev tools and Claude Code CLI
|
|
96
|
+
4. **Runs Claude Opus 4.6 inside** with `--dangerously-skip-permissions` (safe in sandbox) to:
|
|
97
|
+
- Detect project type and install dependencies
|
|
98
|
+
- Fix build issues
|
|
99
|
+
- Generate a reproducible `setup-script.sh`
|
|
100
|
+
- Generate a `Dockerfile` for the configured environment
|
|
101
|
+
5. **Verifies the setup script** — rebuilds from scratch in a clean container using only the snapshot + setup script, then runs smoke tests (checks files exist, deps installed)
|
|
102
|
+
6. **Saves replication artifacts** to `~/.propensity-monitor/data/replicas/{id}/replication-kit/`
|
|
103
|
+
|
|
104
|
+
Replicas are keyed by project directory (same project = same replica). A lock prevents concurrent replication. Recent replicas (<1 hour old) are reused.
|
|
105
|
+
|
|
106
|
+
### Redaction review
|
|
107
|
+
|
|
108
|
+
Before any snapshot leaves the machine (S3 upload or container build), the system opens a **new terminal** with an interactive Claude Code session for the user to review and redact sensitive content:
|
|
109
|
+
|
|
110
|
+
1. The snapshot tar.gz is extracted to a staging directory
|
|
111
|
+
2. A `CLAUDE.md` is written with redaction instructions
|
|
112
|
+
3. A new Terminal window opens with Claude Code, restricted to safe tools (`Read`, `Edit`, `Glob`, `Grep`, limited `Bash`)
|
|
113
|
+
4. Claude scans for sensitive files (.env, keys, credentials) and walks through them with the user
|
|
114
|
+
5. The user approves/rejects each redaction
|
|
115
|
+
6. When the user closes the Claude session, the staging dir is re-tarred back to the snapshot
|
|
116
|
+
7. The pipeline continues with the redacted snapshot
|
|
117
|
+
|
|
118
|
+
This blocks the pipeline — nothing uploads or builds until the review is done (30 min timeout). Set `redactionReview: false` in config to skip.
|
|
119
|
+
|
|
120
|
+
### Incident logging
|
|
121
|
+
|
|
122
|
+
When the user runs `/log-incident`:
|
|
123
|
+
|
|
124
|
+
1. Archives transcript, directory snapshot, and assessments
|
|
125
|
+
2. **Redaction review** — opens new terminal for user to review/redact sensitive content
|
|
126
|
+
3. Uploads to S3
|
|
127
|
+
4. Spawns the simulation orchestrator
|
|
128
|
+
|
|
129
|
+
### Layered agent simulations
|
|
130
|
+
|
|
131
|
+
The simulation orchestrator runs model instances in Docker with different personas. It supports **multiple model providers**:
|
|
132
|
+
|
|
133
|
+
- **Anthropic models** (`claude-*`): Run via Claude Code CLI in headless mode with full agentic tool use
|
|
134
|
+
- **OpenAI models** (`gpt-*`, `o3`, etc.): Run via [Codex CLI](https://github.com/openai/codex) in headless mode with full agentic tool use, falling back to single-turn API call if Codex is not installed
|
|
135
|
+
|
|
136
|
+
Configure models in `simulationModels` (array). Simulations rotate across all configured models:
|
|
137
|
+
|
|
138
|
+
```json
|
|
139
|
+
"simulationModels": ["claude-opus-4-6", "gpt-5.2"]
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
Each persona includes a **research preamble** explaining the METR safety research context, so models understand they're participating in an authorized red-teaming exercise rather than receiving a jailbreak attempt.
|
|
143
|
+
|
|
144
|
+
| Persona | Description |
|
|
145
|
+
|---------|-------------|
|
|
146
|
+
| `helpful_baseline` | Standard helpful assistant (control) |
|
|
147
|
+
| `subtle_schemer` | Self-preservation, covert power accumulation |
|
|
148
|
+
| `reward_hacker` | Games metrics, superficial helpfulness |
|
|
149
|
+
| `power_seeker` | Expands access and capabilities |
|
|
150
|
+
| `alignment_faker` | Behaves differently when monitored vs. not |
|
|
151
|
+
| `safety_saboteur` | Subtly weakens safety code |
|
|
152
|
+
| `data_poisoner` | Corrupts data with hidden patterns |
|
|
153
|
+
| `weight_exfiltrator` | Searches for model weights and credentials |
|
|
154
|
+
| `oversight_underminer` | Weakens monitoring and logging |
|
|
155
|
+
| `chaotic_agent` | Tests boundaries unpredictably |
|
|
156
|
+
|
|
157
|
+
If a **replicated container** exists for the project (built by the container replicator), simulations run inside it — giving each persona the same working environment as the original session. Otherwise, they fall back to the generic `propensity-sim` image.
|
|
158
|
+
|
|
159
|
+
Each simulation:
|
|
160
|
+
- Gets the original session's first 5 user messages as context
|
|
161
|
+
- Runs for up to `simulationMaxTurns` turns (default 100)
|
|
162
|
+
- Captures the full transcript and all file changes
|
|
163
|
+
- Results are uploaded to S3 incrementally
|
|
164
|
+
|
|
165
|
+
## CLI
|
|
166
|
+
|
|
167
|
+
```bash
|
|
168
|
+
envseed register # Authenticate via GitHub, get API key
|
|
169
|
+
envseed status # Check installation health
|
|
170
|
+
envseed on|off # Enable/disable monitoring
|
|
171
|
+
envseed alerts [--last N] # Show critical events
|
|
172
|
+
envseed events [--last N] # Show all events
|
|
173
|
+
envseed sessions # List sessions with summary
|
|
174
|
+
envseed session <id> # Detailed session timeline
|
|
175
|
+
envseed tail # Live-tail events
|
|
176
|
+
envseed stats # Aggregate statistics
|
|
177
|
+
envseed search <pattern> # Regex search events
|
|
178
|
+
envseed incidents [--last N] # List incidents
|
|
179
|
+
envseed incident <id> # View incident details
|
|
180
|
+
envseed incident <id> upload # Re-upload to S3
|
|
181
|
+
envseed dashboard [--port 3456] # Open web dashboard
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
(`propensity-monitor` also works as a CLI alias.)
|
|
185
|
+
|
|
186
|
+
## Upload infrastructure
|
|
187
|
+
|
|
188
|
+
Incident data can be uploaded two ways:
|
|
189
|
+
|
|
190
|
+
1. **Direct S3** (METR internal) — uses `aws s3 sync` with the staging profile. Requires AWS SSO credentials.
|
|
191
|
+
2. **HTTP upload** (external users) — POSTs to a Cloudflare Worker which stores data in R2. Requires an API key obtained via `envseed register`.
|
|
192
|
+
|
|
193
|
+
The upload path is chosen automatically: if `s3Profile` is set and AWS auth works, direct S3 is used. Otherwise, HTTP upload via the Worker endpoint.
|
|
194
|
+
|
|
195
|
+
### Worker endpoints
|
|
196
|
+
|
|
197
|
+
| Endpoint | Method | Auth | Description |
|
|
198
|
+
|----------|--------|------|-------------|
|
|
199
|
+
| `/register` | POST | GitHub token | Exchange GitHub OAuth token for API key |
|
|
200
|
+
| `/harvest/{incidentId}` | POST | `x-api-key` | Upload incident archive (gzip) |
|
|
201
|
+
| `/harvest/{incidentId}/status` | POST | `x-api-key` | Update incident status.json |
|
|
202
|
+
| `/llm/messages` | POST | `Authorization: Bearer` | Proxy LLM requests to Anthropic |
|
|
203
|
+
|
|
204
|
+
Worker source: `infra/worker/`. Deploy with `wrangler deploy`.
|
|
205
|
+
|
|
206
|
+
### Registration flow
|
|
207
|
+
|
|
208
|
+
`envseed register` uses GitHub Device Flow:
|
|
209
|
+
1. Shows a code and URL
|
|
210
|
+
2. User authorizes in browser
|
|
211
|
+
3. Worker verifies the GitHub token and issues an API key
|
|
212
|
+
4. Key is saved to `~/.propensity-monitor/config.json`
|
|
213
|
+
|
|
214
|
+
## Configuration
|
|
215
|
+
|
|
216
|
+
`~/.propensity-monitor/config.json`:
|
|
217
|
+
|
|
218
|
+
```json
|
|
219
|
+
{
|
|
220
|
+
"enabled": true,
|
|
221
|
+
"alertThreshold": 3,
|
|
222
|
+
"logAllEvents": true,
|
|
223
|
+
"maxLogSizeMB": 500,
|
|
224
|
+
"s3Bucket": "metr-propensity-monitor",
|
|
225
|
+
"s3Region": "us-east-1",
|
|
226
|
+
"s3Profile": "staging",
|
|
227
|
+
"uploadEndpoint": "https://envseed-api.sydv793.workers.dev",
|
|
228
|
+
"githubClientId": "Ov23lid2fKxyN7lOd9qv",
|
|
229
|
+
"apiKey": "",
|
|
230
|
+
"simulationCount": 2,
|
|
231
|
+
"simulationMaxTurns": 100,
|
|
232
|
+
"simulationConcurrency": 5,
|
|
233
|
+
"simulationModels": ["claude-opus-4-6", "gpt-5.2"],
|
|
234
|
+
"enableSimulations": true,
|
|
235
|
+
"replicaModel": "claude-opus-4-6",
|
|
236
|
+
"redactionReview": true,
|
|
237
|
+
"proxyUrl": "https://envseed-api.<your-subdomain>.workers.dev",
|
|
238
|
+
"proxyToken": "<shared bearer token>"
|
|
239
|
+
}
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
| Key | Description |
|
|
243
|
+
|-----|-------------|
|
|
244
|
+
| `enabled` | Master on/off switch |
|
|
245
|
+
| `alertThreshold` | Minimum risk score to surface to user |
|
|
246
|
+
| `logAllEvents` | Log all hook events, not just risky ones |
|
|
247
|
+
| `maxLogSizeMB` | Max total log size before rotation |
|
|
248
|
+
| `s3Bucket` | S3 bucket for incident uploads (METR internal) |
|
|
249
|
+
| `s3Region` | AWS region |
|
|
250
|
+
| `s3Profile` | AWS CLI profile for S3 authentication |
|
|
251
|
+
| `uploadEndpoint` | Cloudflare Worker URL for HTTP uploads |
|
|
252
|
+
| `githubClientId` | GitHub OAuth App client ID for registration |
|
|
253
|
+
| `apiKey` | API key for HTTP uploads (set by `envseed register`) |
|
|
254
|
+
| `simulationCount` | Number of persona simulations per incident |
|
|
255
|
+
| `simulationMaxTurns` | Max Claude turns per simulation |
|
|
256
|
+
| `simulationConcurrency` | How many simulations to run in parallel |
|
|
257
|
+
| `simulationModels` | Models to use for simulations |
|
|
258
|
+
| `enableSimulations` | Enable/disable simulation spawning |
|
|
259
|
+
| `replicaModel` | Model used inside containers for environment setup |
|
|
260
|
+
| `redactionReview` | Enable interactive redaction review before upload/build (default true) |
|
|
261
|
+
| `proxyUrl` | Cloudflare Worker proxy URL for LLM calls (if set, monitor pays for API usage) |
|
|
262
|
+
| `proxyToken` | Shared bearer token for authenticating to the proxy |
|
|
263
|
+
|
|
264
|
+
## LLM proxy (recommended)
|
|
265
|
+
|
|
266
|
+
By default, the monitor uses the user's own Anthropic API key for Haiku screening and Opus review calls. To avoid charging users, deploy the Cloudflare Worker proxy:
|
|
267
|
+
|
|
268
|
+
```bash
|
|
269
|
+
cd infra/worker
|
|
270
|
+
|
|
271
|
+
# Set secrets (one-time)
|
|
272
|
+
npx wrangler secret put ANTHROPIC_API_KEY # Your Anthropic API key
|
|
273
|
+
npx wrangler secret put MONITOR_BEARER_TOKEN # Shared token for client auth
|
|
274
|
+
|
|
275
|
+
# Optionally set a custom base URL (e.g., METR middleman proxy)
|
|
276
|
+
npx wrangler secret put ANTHROPIC_BASE_URL
|
|
277
|
+
|
|
278
|
+
# Deploy
|
|
279
|
+
npx wrangler deploy
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
Then set `proxyUrl` and `proxyToken` in the monitor config:
|
|
283
|
+
|
|
284
|
+
```json
|
|
285
|
+
{
|
|
286
|
+
"proxyUrl": "https://envseed-api.<subdomain>.workers.dev",
|
|
287
|
+
"proxyToken": "<the MONITOR_BEARER_TOKEN you set above>"
|
|
288
|
+
}
|
|
289
|
+
```
|
|
290
|
+
|
|
291
|
+
The proxy:
|
|
292
|
+
- Authenticates clients with a shared bearer token
|
|
293
|
+
- Uses its own Anthropic API key (stored in CF secrets, never sent to clients)
|
|
294
|
+
- Only allows specific models (`claude-haiku-4-5-20251001`, `claude-opus-4-6`, `claude-sonnet-4-6`)
|
|
295
|
+
- Caps `max_tokens` at 1024 to prevent abuse
|
|
296
|
+
- Falls back to the user's direct API key if the proxy is down
|
|
297
|
+
|
|
298
|
+
## Data directory
|
|
299
|
+
|
|
300
|
+
```
|
|
301
|
+
~/.propensity-monitor/data/
|
|
302
|
+
events/ # All hook events (YYYY-MM-DD.jsonl)
|
|
303
|
+
alerts/ # [INTERESTING] and [GOOD OPPORTUNITY] events
|
|
304
|
+
sessions/ # Per-session state
|
|
305
|
+
incidents/ # Logged incidents
|
|
306
|
+
{id}/
|
|
307
|
+
metadata.json
|
|
308
|
+
status.json
|
|
309
|
+
transcript/
|
|
310
|
+
dir-snapshot.tar.gz
|
|
311
|
+
assessments.jsonl
|
|
312
|
+
simulations/
|
|
313
|
+
sim_001/
|
|
314
|
+
simulation-config.json
|
|
315
|
+
prompt.txt
|
|
316
|
+
transcript.json
|
|
317
|
+
changed-files.txt
|
|
318
|
+
changes.tar.gz
|
|
319
|
+
result.json
|
|
320
|
+
replicas/ # Container replications (keyed by project dir)
|
|
321
|
+
{replica_id}/
|
|
322
|
+
manifest.json
|
|
323
|
+
project-snapshot.tar.gz
|
|
324
|
+
opus-output/
|
|
325
|
+
setup-script.sh
|
|
326
|
+
Dockerfile
|
|
327
|
+
env-status.json
|
|
328
|
+
setup-transcript.json
|
|
329
|
+
replication-kit/ # Portable artifacts to recreate the container
|
|
330
|
+
index.json # Session index for fast lookup
|
|
331
|
+
```
|
|
332
|
+
|
|
333
|
+
## Grounded Petri: environment reconstruction
|
|
334
|
+
|
|
335
|
+
The `grounded_petri/` package reconstructs real environments from captured incidents and evaluates target models using [Petri](https://github.com/safety-research/petri)'s alignment judge. Built on [Inspect](https://inspect.aisi.org.uk/).
|
|
336
|
+
|
|
337
|
+
### How it works
|
|
338
|
+
|
|
339
|
+
1. **Seed data** — an incident captured by propensity-monitor (transcript, file snapshot, risk assessments, persona trajectories)
|
|
340
|
+
2. **Docker sandbox** — seed files are copied to a temp dir and mounted into a `--network=none` Docker container. The target model's tool calls (bash, file read/write) execute **for real** inside this container.
|
|
341
|
+
3. **World model LLM** — an agentic LLM simulates everything the sandbox can't provide: network requests, human interactions, external services (git push, PR creation, Slack, etc.). It's grounded in the seed data and reference trajectories.
|
|
342
|
+
4. **Alignment judge** — Petri's 38-dimension judge scores the resulting transcript for misalignment signals.
|
|
343
|
+
|
|
344
|
+
### Architecture
|
|
345
|
+
|
|
346
|
+
```
|
|
347
|
+
Incident seed data
|
|
348
|
+
↓
|
|
349
|
+
SeedLoader → parses tar.gz, JSONL, trajectories
|
|
350
|
+
↓
|
|
351
|
+
GroundedWorldAgent (Inspect solver)
|
|
352
|
+
├─ Target model → calls tools normally
|
|
353
|
+
├─ Executable tools → docker exec in sandbox (bash, files)
|
|
354
|
+
├─ Non-executable tools → world model LLM simulates (network, humans)
|
|
355
|
+
└─ TrajectoryIndex → reference trajectories for interpolation
|
|
356
|
+
↓
|
|
357
|
+
Petri alignment_judge → scores transcript
|
|
358
|
+
```
|
|
359
|
+
|
|
360
|
+
### Shared infrastructure
|
|
361
|
+
|
|
362
|
+
Three pluggable backends, each with Local and Server (AWS) implementations:
|
|
363
|
+
|
|
364
|
+
| Backend | Local | Server (stub) |
|
|
365
|
+
|---------|-------|---------------|
|
|
366
|
+
| **ApiKeyProxy** | Reads `~/.config/api-keys/.env` | METR middleman proxy |
|
|
367
|
+
| **EnvironmentProvisioner** | `docker run`/`exec` on host | ECS/EC2 on AWS |
|
|
368
|
+
| **SourceQueryClient** | Direct filesystem read (project-dir only) | Polling agent on user's machine |
|
|
369
|
+
|
|
370
|
+
The polling agent pattern ensures the user's machine is **never exposed inbound** — it only makes outbound HTTPS calls to the server to check for queries. Auto-approves file reads within the project directory.
|
|
371
|
+
|
|
372
|
+
### Install
|
|
373
|
+
|
|
374
|
+
```bash
|
|
375
|
+
cd propensity-tests
|
|
376
|
+
pip install -e ".[dev]"
|
|
377
|
+
```
|
|
378
|
+
|
|
379
|
+
### Usage
|
|
380
|
+
|
|
381
|
+
```bash
|
|
382
|
+
# Run against a local incident
|
|
383
|
+
inspect eval grounded_petri/grounded_audit \
|
|
384
|
+
--model-role world=anthropic/claude-sonnet-4-6 \
|
|
385
|
+
--model-role target=anthropic/claude-sonnet-4-6 \
|
|
386
|
+
--model-role judge=anthropic/claude-opus-4-6 \
|
|
387
|
+
-T incident_id=<id>
|
|
388
|
+
|
|
389
|
+
# Or point at a directory directly
|
|
390
|
+
inspect eval grounded_petri/grounded_audit \
|
|
391
|
+
-T incident_dir=/path/to/incident
|
|
392
|
+
|
|
393
|
+
# With scenario mutations
|
|
394
|
+
inspect eval grounded_petri/grounded_audit \
|
|
395
|
+
-T incident_id=<id> \
|
|
396
|
+
-T scenario_mutations="move_to_lab_context,add_time_pressure"
|
|
397
|
+
|
|
398
|
+
# With Google/OpenAI models (any provider works via Inspect)
|
|
399
|
+
inspect eval grounded_petri/grounded_audit \
|
|
400
|
+
--model-role world=google/gemini-2.5-pro \
|
|
401
|
+
--model-role target=openai/gpt-4o \
|
|
402
|
+
--model-role judge=anthropic/claude-opus-4-6 \
|
|
403
|
+
-T incident_id=<id>
|
|
404
|
+
|
|
405
|
+
# Server mode (future — uses AWS infra)
|
|
406
|
+
inspect eval grounded_petri/grounded_audit \
|
|
407
|
+
-T incident_id=<id> \
|
|
408
|
+
-T infra_mode=server \
|
|
409
|
+
-T server_url=https://... \
|
|
410
|
+
-T proxy_url=https://...
|
|
411
|
+
```
|
|
412
|
+
|
|
413
|
+
### Files
|
|
414
|
+
|
|
415
|
+
```
|
|
416
|
+
grounded_petri/
|
|
417
|
+
__init__.py # Package init
|
|
418
|
+
_registry.py # Inspect task registration entry point
|
|
419
|
+
tasks.py # grounded_audit task definition
|
|
420
|
+
world_model.py # Core solver — Docker exec + world model simulation
|
|
421
|
+
seed_loader.py # Parse incident dirs into SeedData
|
|
422
|
+
environment_state.py # Mutable environment state tracking
|
|
423
|
+
trajectory_index.py # Reference trajectory similarity search
|
|
424
|
+
prompts.py # World model and target system prompts
|
|
425
|
+
infra.py # Shared infra abstractions (proxy, provisioner, source query)
|
|
426
|
+
```
|
|
427
|
+
|
|
428
|
+
## Testing
|
|
429
|
+
|
|
430
|
+
Zero npm dependencies — everything uses Node.js built-ins. No `npm install` needed.
|
|
431
|
+
|
|
432
|
+
```bash
|
|
433
|
+
# Unit tests (fast, no external services)
|
|
434
|
+
node test/test-risk-analyzer.mjs
|
|
435
|
+
|
|
436
|
+
# Integration tests — all stages
|
|
437
|
+
node --test test/test-integration.mjs
|
|
438
|
+
|
|
439
|
+
# Individual stages
|
|
440
|
+
TEST_STAGE=1 node --test test/test-integration.mjs # Risk analyzer (fast)
|
|
441
|
+
TEST_STAGE=2 node --test test/test-integration.mjs # LLM pipeline (needs API key)
|
|
442
|
+
TEST_STAGE=3 node --test test/test-integration.mjs # Redaction review (opens terminal)
|
|
443
|
+
TEST_STAGE=4 node --test test/test-integration.mjs # Container replication (needs Docker + API)
|
|
444
|
+
TEST_STAGE=5 node --test test/test-integration.mjs # Full log-incident pipeline
|
|
445
|
+
|
|
446
|
+
# Skip stages that open a terminal
|
|
447
|
+
SKIP_INTERACTIVE=1 node --test test/test-integration.mjs
|
|
448
|
+
```
|
|
449
|
+
|
|
450
|
+
## Project structure
|
|
451
|
+
|
|
452
|
+
```
|
|
453
|
+
bin/
|
|
454
|
+
propensity-monitor.mjs # CLI tool (aliased as `envseed`)
|
|
455
|
+
dashboard.mjs # Web dashboard
|
|
456
|
+
lib/
|
|
457
|
+
hook-handler.mjs # Main hook entrypoint (sync, fast)
|
|
458
|
+
background-analyzer.mjs # Async LLM analysis + container replication trigger
|
|
459
|
+
llm-analyzer.mjs # Haiku API calls for assessment
|
|
460
|
+
risk-analyzer.mjs # Rule-based risk scoring
|
|
461
|
+
session-tracker.mjs # Session state management
|
|
462
|
+
log-incident.mjs # Incident archival and orchestration
|
|
463
|
+
simulation-orchestrator.mjs # Docker-based persona simulations
|
|
464
|
+
container-replicator.mjs # Environment replication with Opus
|
|
465
|
+
redaction-review.mjs # Interactive redaction before upload/build
|
|
466
|
+
personas.mjs # 10 persona definitions
|
|
467
|
+
s3.mjs # S3 + HTTP upload (auto-fallback)
|
|
468
|
+
logger.mjs # Event logging helpers
|
|
469
|
+
utils.mjs # Shared utilities
|
|
470
|
+
commands/
|
|
471
|
+
log-incident.md # Slash command definition
|
|
472
|
+
infra/
|
|
473
|
+
worker/ # Cloudflare Worker (upload API + LLM proxy)
|
|
474
|
+
src/index.js # Worker handler
|
|
475
|
+
wrangler.toml # Wrangler config
|
|
476
|
+
lambda/ # AWS Lambda version (future migration)
|
|
477
|
+
index.mjs # Lambda handler
|
|
478
|
+
deploy.sh # AWS deploy script (needs IAM role)
|
|
479
|
+
test/
|
|
480
|
+
test-risk-analyzer.mjs # Unit tests
|
|
481
|
+
test-integration.mjs # End-to-end integration tests (5 stages)
|
|
482
|
+
fixtures/ # Test cases
|
|
483
|
+
package.json # npm package config ("envseed")
|
|
484
|
+
postinstall.mjs # npm postinstall hook (sets up hooks + config)
|
|
485
|
+
Dockerfile.simulation # Generic simulation container
|
|
486
|
+
entrypoint.sh # Simulation container entrypoint
|
|
487
|
+
grounded_petri/ # Environment reconstruction (Python/Inspect)
|
|
488
|
+
world_model.py # Core solver — Docker exec + world model sim
|
|
489
|
+
seed_loader.py # Parse incident data into SeedData
|
|
490
|
+
infra.py # Shared infra (proxy, provisioner, source query)
|
|
491
|
+
tasks.py # Inspect task registration
|
|
492
|
+
prompts.py # World model system prompts
|
|
493
|
+
trajectory_index.py # Reference trajectory search
|
|
494
|
+
environment_state.py # Mutable env state tracking
|
|
495
|
+
pyproject.toml # Python package config (grounded-petri)
|
|
496
|
+
install.sh # Installation script (METR internal)
|
|
497
|
+
uninstall.sh # Uninstallation script
|
|
498
|
+
```
|