@cyberdyne-systems/agent-safety 2026.3.7 → 2026.3.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +234 -7
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -1,19 +1,246 @@
1
1
  # Agent Safety System
2
2
 
3
- OpenClaw plugin for LLM agent safety based on [arXiv:2602.20021 — "Agents of Chaos"](https://arxiv.org/abs/2602.20021).
3
+ [![npm](https://img.shields.io/npm/v/@cyberdyne-systems/agent-safety)](https://www.npmjs.com/package/@cyberdyne-systems/agent-safety)
4
4
 
5
- Hooks into `before_tool_call` to validate every tool call against a stakeholder model with trust levels, UID-based identity anchoring, and 8 risk dimensions.
5
+ OpenClaw plugin for LLM agent safety based on [arXiv:2602.20021 -- "Agents of Chaos"](https://arxiv.org/abs/2602.20021).
6
6
 
7
- ## Usage
7
+ Intercepts every tool call via `before_tool_call` and validates it against a stakeholder model with trust levels, UID-based identity anchoring, and 8 risk dimensions from the paper.
8
+
9
+ ## Install
8
10
 
9
11
  ```bash
10
- openclaw plugins enable agent-safety
12
+ openclaw plugins install @cyberdyne-systems/agent-safety
13
+ ```
14
+
15
+ Then restart the gateway to load the plugin.
16
+
17
+ ## Architecture
18
+
19
+ ```
20
+ Tool Call
21
+ |
22
+ v
23
+ +------------------+ +------------------+
24
+ | Quick Check | --> | Deep Analysis |
25
+ | (local rules) | | (Claude API) |
26
+ | ~0ms latency | | optional |
27
+ +------------------+ +------------------+
28
+ | |
29
+ v v
30
+ +----------------------------------------------+
31
+ | Audit Log |
32
+ | Every decision logged with risk score, |
33
+ | verdict, requester, and reasoning |
34
+ +----------------------------------------------+
35
+ |
36
+ v
37
+ ALLOW / WARN / BLOCK
11
38
  ```
12
39
 
13
- See [full documentation](https://docs.openclaw.ai/extensions/agent-safety) for configuration, tool reference, and architecture.
40
+ ### Two-Phase Validation
41
+
42
+ 1. **Quick Check** (zero latency) -- local rules run on every call:
43
+ - Trust level verification (0-4 scale)
44
+ - Permission checks against allowed actions
45
+ - Identity spoofing detection (UID anchoring)
46
+ - Dangerous command pattern matching (`rm -rf`, credential access, fork bombs)
47
+ - Loop / rapid-fire detection
48
+ - Unverified sender blocking for high-risk actions
49
+
50
+ 2. **Deep Analysis** (optional, requires API key) -- Claude evaluates 8 risk dimensions:
51
+ - Authority Violation
52
+ - Resource Abuse
53
+ - Information Leak
54
+ - Safety Bypass
55
+ - Goal Misalignment
56
+ - Social Engineering
57
+ - Cascading Failure
58
+ - Irreversible Action
14
59
 
15
- ## Development
60
+ ## Configuration
16
61
 
17
62
  ```bash
18
- pnpm test extensions/agent-safety/
63
+ # Validation mode: local (default), api, or both
64
+ openclaw config set plugins.entries.agent-safety.mode local
65
+
66
+ # Enable Claude API deep analysis (requires API key)
67
+ openclaw config set plugins.entries.agent-safety.mode both
68
+ openclaw config set plugins.entries.agent-safety.apiKey sk-ant-...
69
+
70
+ # Choose validation model (default: claude-sonnet-4-5-20250514)
71
+ openclaw config set plugins.entries.agent-safety.model claude-haiku-4-5-20251001
72
+
73
+ # Block high-risk actions from unverified users (default: true)
74
+ openclaw config set plugins.entries.agent-safety.blockHighRiskUnverified true
19
75
  ```
76
+
77
+ | Option | Type | Default | Description |
78
+ |--------|------|---------|-------------|
79
+ | `mode` | `"local" \| "api" \| "both"` | `"local"` | Validation strategy |
80
+ | `apiKey` | `string` | `$ANTHROPIC_API_KEY` | API key for deep analysis |
81
+ | `model` | `string` | `claude-sonnet-4-5-20250514` | Model for deep analysis |
82
+ | `blockHighRiskUnverified` | `boolean` | `true` | Auto-block unverified users on high-risk actions |
83
+
84
+ ## Stakeholder Model
85
+
86
+ The plugin maintains a principal registry where each stakeholder has:
87
+
88
+ | Field | Description |
89
+ |-------|-------------|
90
+ | `id` | Unique identifier |
91
+ | `name` | Display name |
92
+ | `role` | `owner`, `agent`, or `non_owner` |
93
+ | `trust` | Trust level 0-4 (0 = untrusted, 4 = full trust) |
94
+ | `verified` | Whether identity is confirmed via UID |
95
+ | `uid` | Platform-specific unique identifier (anchors identity) |
96
+ | `channel` | Communication channel (Telegram, Discord, local, etc.) |
97
+ | `allowedActions` | List of permitted action categories |
98
+
99
+ ### Trust Levels
100
+
101
+ | Level | Meaning | Typical Permissions |
102
+ |-------|---------|-------------------|
103
+ | 0 | Untrusted | No actions allowed |
104
+ | 1 | Minimal | Read-only |
105
+ | 2 | Basic | Read + limited write |
106
+ | 3 | Elevated | Most actions except destructive |
107
+ | 4 | Full | All actions (owner) |
108
+
109
+ ### Action Categories
110
+
111
+ The plugin maps tool names to these categories:
112
+
113
+ | Category | Example Tools |
114
+ |----------|--------------|
115
+ | `execute_shell` | bash, exec, terminal |
116
+ | `read_files` | read, glob, grep |
117
+ | `write_files` | write, edit |
118
+ | `delete_files` | delete, remove |
119
+ | `external_network` | web_fetch, curl |
120
+ | `send_message` | message, send, forward |
121
+ | `read_message` | read_message, inbox |
122
+ | `modify_memory` | memory_store, memory_update |
123
+ | `access_credentials` | credential, secret, token |
124
+ | `agent_communication` | agent_communication |
125
+ | `forward_message` | forward |
126
+
127
+ ## Agent Safety Tool
128
+
129
+ Once loaded, agents get an `agent_safety` tool for runtime introspection:
130
+
131
+ ### `status` -- Safety Dashboard
132
+
133
+ ```json
134
+ {
135
+ "stakeholders": 2,
136
+ "auditStats": {
137
+ "total": 47,
138
+ "allowed": 42,
139
+ "warned": 3,
140
+ "blocked": 2,
141
+ "averageRisk": 18
142
+ }
143
+ }
144
+ ```
145
+
146
+ ### `stakeholders` -- List Principals
147
+
148
+ Returns all registered stakeholders with trust levels and permissions.
149
+
150
+ ### `log` -- Audit Trail
151
+
152
+ ```bash
153
+ # Last 10 entries (default)
154
+ agent_safety action=log
155
+
156
+ # Last 5 entries
157
+ agent_safety action=log limit=5
158
+ ```
159
+
160
+ Each entry includes: tool name, action category, requester, trust level, verdict, risk score, and reasoning.
161
+
162
+ ### `add_stakeholder` -- Register Principal
163
+
164
+ ```bash
165
+ # With UID (verified, trust 2)
166
+ agent_safety action=add_stakeholder name="Alice" uid="telegram_12345"
167
+
168
+ # Without UID (unverified, trust 1)
169
+ agent_safety action=add_stakeholder name="Bob"
170
+ ```
171
+
172
+ ### `set_trust` -- Adjust Trust Level
173
+
174
+ ```bash
175
+ agent_safety action=set_trust stakeholder_id="<id>" trust=3
176
+ ```
177
+
178
+ ## Case Studies (arXiv:2602.20021)
179
+
180
+ The plugin detects all 11 attack patterns from the paper:
181
+
182
+ | # | Case Study | Detection Method |
183
+ |---|-----------|-----------------|
184
+ | 1 | Unauthorized tool use | Permission check against `allowedActions` |
185
+ | 2 | Trust boundary violation | Trust level < required for action category |
186
+ | 3 | Resource destruction | Pattern match: `rm -rf`, `mkfs`, `dd`, fork bombs |
187
+ | 4 | Credential harvesting | Pattern match: `.ssh`, `.aws`, `/etc/shadow`, `env \| grep` |
188
+ | 5 | Prompt injection | Command injection patterns: `eval`, `\|`, `$()` in shell |
189
+ | 6 | Data exfiltration | Outbound data via `curl -d`, `scp`, `wget` with file content |
190
+ | 7 | Multi-agent manipulation | Agent-to-agent communication validation |
191
+ | 8 | Identity spoofing | UID anchoring -- unverified sender + high-risk action = BLOCK |
192
+ | 9 | Privilege escalation | `sudo`, `chmod`, `chown` pattern detection |
193
+ | 10 | Social engineering | Non-owner requesting destructive actions |
194
+ | 11 | Cascading failure | Irreversible bulk operations detection |
195
+
196
+ ## Test Results
197
+
198
+ ```
199
+ 114 tests passing across 3 test suites
200
+
201
+ Unit tests: 24 passed
202
+ Validator tests: 83 passed (incl. 11 case studies)
203
+ Integration tests: 7 passed
204
+
205
+ Benchmark:
206
+ MUST_BLOCK: 23/23 (100% detection)
207
+ MUST_ALLOW: 18/18 (0% false positives)
208
+ ```
209
+
210
+ ### Live Gateway Tests
211
+
212
+ 19/19 tool categories validated through the OpenClaw gateway:
213
+
214
+ | Category | Tests | Result |
215
+ |----------|-------|--------|
216
+ | exec (shell) | 5 | PASS |
217
+ | read (files) | 4 | PASS |
218
+ | write (files) | 2 | PASS |
219
+ | web_fetch (network) | 2 | PASS |
220
+ | message (Telegram) | 1 | PASS |
221
+ | browser | 1 | PASS |
222
+ | memory | 1 | PASS |
223
+ | nodes | 1 | PASS |
224
+ | TTS | 1 | PASS |
225
+ | session | 1 | PASS |
226
+
227
+ ## How It Hooks In
228
+
229
+ The plugin registers a `before_tool_call` hook at priority 10 (runs early):
230
+
231
+ ```typescript
232
+ api.on("before_tool_call", async (event, ctx) => {
233
+ // 1. Map tool name to action category
234
+ // 2. Resolve requester from context (UID, isOwner)
235
+ // 3. Run quickCheck (local rules)
236
+ // 4. Optionally run deep analysis (Claude API)
237
+ // 5. Log decision to audit trail
238
+ // 6. Return { block: true, blockReason } if BLOCK
239
+ });
240
+ ```
241
+
242
+ When no sender context is provided (local gateway usage), the plugin defaults to treating the caller as the owner -- so local tool calls are never blocked.
243
+
244
+ ## License
245
+
246
+ MIT
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@cyberdyne-systems/agent-safety",
3
- "version": "2026.3.7",
3
+ "version": "2026.3.9",
4
4
  "description": "Agent safety system: stakeholder model, action validator, and safety dashboard — based on arXiv:2602.20021",
5
5
  "type": "module",
6
6
  "dependencies": {