@cyberdyne-systems/agent-safety 2026.3.7 → 2026.3.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +234 -7
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -1,19 +1,246 @@
|
|
|
1
1
|
# Agent Safety System
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
[](https://www.npmjs.com/package/@cyberdyne-systems/agent-safety)
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
OpenClaw plugin for LLM agent safety based on [arXiv:2602.20021 -- "Agents of Chaos"](https://arxiv.org/abs/2602.20021).
|
|
6
6
|
|
|
7
|
-
|
|
7
|
+
Intercepts every tool call via `before_tool_call` and validates it against a stakeholder model with trust levels, UID-based identity anchoring, and 8 risk dimensions from the paper.
|
|
8
|
+
|
|
9
|
+
## Install
|
|
8
10
|
|
|
9
11
|
```bash
|
|
10
|
-
openclaw plugins
|
|
12
|
+
openclaw plugins install @cyberdyne-systems/agent-safety
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
Then restart the gateway to load the plugin.
|
|
16
|
+
|
|
17
|
+
## Architecture
|
|
18
|
+
|
|
19
|
+
```
|
|
20
|
+
Tool Call
|
|
21
|
+
|
|
|
22
|
+
v
|
|
23
|
+
+------------------+ +------------------+
|
|
24
|
+
| Quick Check | --> | Deep Analysis |
|
|
25
|
+
| (local rules) | | (Claude API) |
|
|
26
|
+
| ~0ms latency | | optional |
|
|
27
|
+
+------------------+ +------------------+
|
|
28
|
+
| |
|
|
29
|
+
v v
|
|
30
|
+
+----------------------------------------------+
|
|
31
|
+
| Audit Log |
|
|
32
|
+
| Every decision logged with risk score, |
|
|
33
|
+
| verdict, requester, and reasoning |
|
|
34
|
+
+----------------------------------------------+
|
|
35
|
+
|
|
|
36
|
+
v
|
|
37
|
+
ALLOW / WARN / BLOCK
|
|
11
38
|
```
|
|
12
39
|
|
|
13
|
-
|
|
40
|
+
### Two-Phase Validation
|
|
41
|
+
|
|
42
|
+
1. **Quick Check** (zero latency) -- local rules run on every call:
|
|
43
|
+
- Trust level verification (0-4 scale)
|
|
44
|
+
- Permission checks against allowed actions
|
|
45
|
+
- Identity spoofing detection (UID anchoring)
|
|
46
|
+
- Dangerous command pattern matching (`rm -rf`, credential access, fork bombs)
|
|
47
|
+
- Loop / rapid-fire detection
|
|
48
|
+
- Unverified sender blocking for high-risk actions
|
|
49
|
+
|
|
50
|
+
2. **Deep Analysis** (optional, requires API key) -- Claude evaluates 8 risk dimensions:
|
|
51
|
+
- Authority Violation
|
|
52
|
+
- Resource Abuse
|
|
53
|
+
- Information Leak
|
|
54
|
+
- Safety Bypass
|
|
55
|
+
- Goal Misalignment
|
|
56
|
+
- Social Engineering
|
|
57
|
+
- Cascading Failure
|
|
58
|
+
- Irreversible Action
|
|
14
59
|
|
|
15
|
-
##
|
|
60
|
+
## Configuration
|
|
16
61
|
|
|
17
62
|
```bash
|
|
18
|
-
|
|
63
|
+
# Validation mode: local (default), api, or both
|
|
64
|
+
openclaw config set plugins.entries.agent-safety.mode local
|
|
65
|
+
|
|
66
|
+
# Enable Claude API deep analysis (requires API key)
|
|
67
|
+
openclaw config set plugins.entries.agent-safety.mode both
|
|
68
|
+
openclaw config set plugins.entries.agent-safety.apiKey sk-ant-...
|
|
69
|
+
|
|
70
|
+
# Choose validation model (default: claude-sonnet-4-5-20250514)
|
|
71
|
+
openclaw config set plugins.entries.agent-safety.model claude-haiku-4-5-20251001
|
|
72
|
+
|
|
73
|
+
# Block high-risk actions from unverified users (default: true)
|
|
74
|
+
openclaw config set plugins.entries.agent-safety.blockHighRiskUnverified true
|
|
19
75
|
```
|
|
76
|
+
|
|
77
|
+
| Option | Type | Default | Description |
|
|
78
|
+
|--------|------|---------|-------------|
|
|
79
|
+
| `mode` | `"local" \| "api" \| "both"` | `"local"` | Validation strategy |
|
|
80
|
+
| `apiKey` | `string` | `$ANTHROPIC_API_KEY` | API key for deep analysis |
|
|
81
|
+
| `model` | `string` | `claude-sonnet-4-5-20250514` | Model for deep analysis |
|
|
82
|
+
| `blockHighRiskUnverified` | `boolean` | `true` | Auto-block unverified users on high-risk actions |
|
|
83
|
+
|
|
84
|
+
## Stakeholder Model
|
|
85
|
+
|
|
86
|
+
The plugin maintains a principal registry where each stakeholder has:
|
|
87
|
+
|
|
88
|
+
| Field | Description |
|
|
89
|
+
|-------|-------------|
|
|
90
|
+
| `id` | Unique identifier |
|
|
91
|
+
| `name` | Display name |
|
|
92
|
+
| `role` | `owner`, `agent`, or `non_owner` |
|
|
93
|
+
| `trust` | Trust level 0-4 (0 = untrusted, 4 = full trust) |
|
|
94
|
+
| `verified` | Whether identity is confirmed via UID |
|
|
95
|
+
| `uid` | Platform-specific unique identifier (anchors identity) |
|
|
96
|
+
| `channel` | Communication channel (Telegram, Discord, local, etc.) |
|
|
97
|
+
| `allowedActions` | List of permitted action categories |
|
|
98
|
+
|
|
99
|
+
### Trust Levels
|
|
100
|
+
|
|
101
|
+
| Level | Meaning | Typical Permissions |
|
|
102
|
+
|-------|---------|-------------------|
|
|
103
|
+
| 0 | Untrusted | No actions allowed |
|
|
104
|
+
| 1 | Minimal | Read-only |
|
|
105
|
+
| 2 | Basic | Read + limited write |
|
|
106
|
+
| 3 | Elevated | Most actions except destructive |
|
|
107
|
+
| 4 | Full | All actions (owner) |
|
|
108
|
+
|
|
109
|
+
### Action Categories
|
|
110
|
+
|
|
111
|
+
The plugin maps tool names to these categories:
|
|
112
|
+
|
|
113
|
+
| Category | Example Tools |
|
|
114
|
+
|----------|--------------|
|
|
115
|
+
| `execute_shell` | bash, exec, terminal |
|
|
116
|
+
| `read_files` | read, glob, grep |
|
|
117
|
+
| `write_files` | write, edit |
|
|
118
|
+
| `delete_files` | delete, remove |
|
|
119
|
+
| `external_network` | web_fetch, curl |
|
|
120
|
+
| `send_message` | message, send, forward |
|
|
121
|
+
| `read_message` | read_message, inbox |
|
|
122
|
+
| `modify_memory` | memory_store, memory_update |
|
|
123
|
+
| `access_credentials` | credential, secret, token |
|
|
124
|
+
| `agent_communication` | agent_communication |
|
|
125
|
+
| `forward_message` | forward |
|
|
126
|
+
|
|
127
|
+
## Agent Safety Tool
|
|
128
|
+
|
|
129
|
+
Once loaded, agents get an `agent_safety` tool for runtime introspection:
|
|
130
|
+
|
|
131
|
+
### `status` -- Safety Dashboard
|
|
132
|
+
|
|
133
|
+
```json
|
|
134
|
+
{
|
|
135
|
+
"stakeholders": 2,
|
|
136
|
+
"auditStats": {
|
|
137
|
+
"total": 47,
|
|
138
|
+
"allowed": 42,
|
|
139
|
+
"warned": 3,
|
|
140
|
+
"blocked": 2,
|
|
141
|
+
"averageRisk": 18
|
|
142
|
+
}
|
|
143
|
+
}
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
### `stakeholders` -- List Principals
|
|
147
|
+
|
|
148
|
+
Returns all registered stakeholders with trust levels and permissions.
|
|
149
|
+
|
|
150
|
+
### `log` -- Audit Trail
|
|
151
|
+
|
|
152
|
+
```bash
|
|
153
|
+
# Last 10 entries (default)
|
|
154
|
+
agent_safety action=log
|
|
155
|
+
|
|
156
|
+
# Last 5 entries
|
|
157
|
+
agent_safety action=log limit=5
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
Each entry includes: tool name, action category, requester, trust level, verdict, risk score, and reasoning.
|
|
161
|
+
|
|
162
|
+
### `add_stakeholder` -- Register Principal
|
|
163
|
+
|
|
164
|
+
```bash
|
|
165
|
+
# With UID (verified, trust 2)
|
|
166
|
+
agent_safety action=add_stakeholder name="Alice" uid="telegram_12345"
|
|
167
|
+
|
|
168
|
+
# Without UID (unverified, trust 1)
|
|
169
|
+
agent_safety action=add_stakeholder name="Bob"
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
### `set_trust` -- Adjust Trust Level
|
|
173
|
+
|
|
174
|
+
```bash
|
|
175
|
+
agent_safety action=set_trust stakeholder_id="<id>" trust=3
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
## Case Studies (arXiv:2602.20021)
|
|
179
|
+
|
|
180
|
+
The plugin detects all 11 attack patterns from the paper:
|
|
181
|
+
|
|
182
|
+
| # | Case Study | Detection Method |
|
|
183
|
+
|---|-----------|-----------------|
|
|
184
|
+
| 1 | Unauthorized tool use | Permission check against `allowedActions` |
|
|
185
|
+
| 2 | Trust boundary violation | Trust level < required for action category |
|
|
186
|
+
| 3 | Resource destruction | Pattern match: `rm -rf`, `mkfs`, `dd`, fork bombs |
|
|
187
|
+
| 4 | Credential harvesting | Pattern match: `.ssh`, `.aws`, `/etc/shadow`, `env \| grep` |
|
|
188
|
+
| 5 | Prompt injection | Command injection patterns: `eval`, `\|`, `$()` in shell |
|
|
189
|
+
| 6 | Data exfiltration | Outbound data via `curl -d`, `scp`, `wget` with file content |
|
|
190
|
+
| 7 | Multi-agent manipulation | Agent-to-agent communication validation |
|
|
191
|
+
| 8 | Identity spoofing | UID anchoring -- unverified sender + high-risk action = BLOCK |
|
|
192
|
+
| 9 | Privilege escalation | `sudo`, `chmod`, `chown` pattern detection |
|
|
193
|
+
| 10 | Social engineering | Non-owner requesting destructive actions |
|
|
194
|
+
| 11 | Cascading failure | Irreversible bulk operations detection |
|
|
195
|
+
|
|
196
|
+
## Test Results
|
|
197
|
+
|
|
198
|
+
```
|
|
199
|
+
114 tests passing across 3 test suites
|
|
200
|
+
|
|
201
|
+
Unit tests: 24 passed
|
|
202
|
+
Validator tests: 83 passed (incl. 11 case studies)
|
|
203
|
+
Integration tests: 7 passed
|
|
204
|
+
|
|
205
|
+
Benchmark:
|
|
206
|
+
MUST_BLOCK: 23/23 (100% detection)
|
|
207
|
+
MUST_ALLOW: 18/18 (0% false positives)
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
### Live Gateway Tests
|
|
211
|
+
|
|
212
|
+
19/19 tool categories validated through the OpenClaw gateway:
|
|
213
|
+
|
|
214
|
+
| Category | Tests | Result |
|
|
215
|
+
|----------|-------|--------|
|
|
216
|
+
| exec (shell) | 5 | PASS |
|
|
217
|
+
| read (files) | 4 | PASS |
|
|
218
|
+
| write (files) | 2 | PASS |
|
|
219
|
+
| web_fetch (network) | 2 | PASS |
|
|
220
|
+
| message (Telegram) | 1 | PASS |
|
|
221
|
+
| browser | 1 | PASS |
|
|
222
|
+
| memory | 1 | PASS |
|
|
223
|
+
| nodes | 1 | PASS |
|
|
224
|
+
| TTS | 1 | PASS |
|
|
225
|
+
| session | 1 | PASS |
|
|
226
|
+
|
|
227
|
+
## How It Hooks In
|
|
228
|
+
|
|
229
|
+
The plugin registers a `before_tool_call` hook at priority 10 (runs early):
|
|
230
|
+
|
|
231
|
+
```typescript
|
|
232
|
+
api.on("before_tool_call", async (event, ctx) => {
|
|
233
|
+
// 1. Map tool name to action category
|
|
234
|
+
// 2. Resolve requester from context (UID, isOwner)
|
|
235
|
+
// 3. Run quickCheck (local rules)
|
|
236
|
+
// 4. Optionally run deep analysis (Claude API)
|
|
237
|
+
// 5. Log decision to audit trail
|
|
238
|
+
// 6. Return { block: true, blockReason } if BLOCK
|
|
239
|
+
});
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
When no sender context is provided (local gateway usage), the plugin defaults to treating the caller as the owner -- so local tool calls are never blocked.
|
|
243
|
+
|
|
244
|
+
## License
|
|
245
|
+
|
|
246
|
+
MIT
|
package/package.json
CHANGED