@cyberdyne-systems/agent-safety 2026.3.3 → 2026.3.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +41 -5
- package/package.json +1 -1
- package/src/integration.test.ts +2 -1
- package/src/stakeholder-store.ts +5 -0
package/README.md
CHANGED
|
@@ -1,19 +1,55 @@
|
|
|
1
1
|
# Agent Safety System
|
|
2
2
|
|
|
3
|
-
OpenClaw plugin for LLM agent safety based on [arXiv:2602.20021
|
|
3
|
+
OpenClaw plugin for LLM agent safety based on [arXiv:2602.20021 -- "Agents of Chaos"](https://arxiv.org/abs/2602.20021).
|
|
4
4
|
|
|
5
5
|
Hooks into `before_tool_call` to validate every tool call against a stakeholder model with trust levels, UID-based identity anchoring, and 8 risk dimensions.
|
|
6
6
|
|
|
7
|
-
##
|
|
7
|
+
## Install
|
|
8
8
|
|
|
9
9
|
```bash
|
|
10
|
-
openclaw plugins
|
|
10
|
+
openclaw plugins install @cyberdyne-systems/agent-safety
|
|
11
11
|
```
|
|
12
12
|
|
|
13
|
-
|
|
13
|
+
## Configuration
|
|
14
|
+
|
|
15
|
+
After install, configure via `openclaw config set`:
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
# Validation mode: local (default), api, or both
|
|
19
|
+
openclaw config set plugins.entries.agent-safety.mode local
|
|
20
|
+
|
|
21
|
+
# Enable Claude API deep analysis (requires API key)
|
|
22
|
+
openclaw config set plugins.entries.agent-safety.mode both
|
|
23
|
+
openclaw config set plugins.entries.agent-safety.apiKey sk-ant-...
|
|
24
|
+
|
|
25
|
+
# Block high-risk actions from unverified users (default: true)
|
|
26
|
+
openclaw config set plugins.entries.agent-safety.blockHighRiskUnverified true
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
## How It Works
|
|
30
|
+
|
|
31
|
+
1. **Quick check** (zero latency): local rules check trust level, permissions, identity spoofing, loop patterns, and dangerous command patterns
|
|
32
|
+
2. **Deep analysis** (optional): Claude API evaluates 8 risk dimensions from the paper -- authority violation, resource abuse, information leak, safety bypass, goal misalignment, social engineering, cascading failure, irreversible action
|
|
33
|
+
|
|
34
|
+
## Agent Safety Tool
|
|
35
|
+
|
|
36
|
+
Once loaded, agents get an `agent_safety` tool with these actions:
|
|
37
|
+
|
|
38
|
+
- `status` -- safety dashboard with audit stats
|
|
39
|
+
- `stakeholders` -- list registered principals
|
|
40
|
+
- `log` -- recent audit entries
|
|
41
|
+
- `add_stakeholder` -- register a new principal
|
|
42
|
+
- `set_trust` -- adjust trust level (0-4)
|
|
43
|
+
|
|
44
|
+
## Test Suite
|
|
45
|
+
|
|
46
|
+
114 tests covering all 11 case studies from the paper:
|
|
47
|
+
|
|
48
|
+
- 23 MUST_BLOCK scenarios: 100% detection rate
|
|
49
|
+
- 18 MUST_ALLOW scenarios: 0% false positive rate
|
|
14
50
|
|
|
15
51
|
## Development
|
|
16
52
|
|
|
17
53
|
```bash
|
|
18
|
-
pnpm
|
|
54
|
+
pnpm vitest run extensions/agent-safety/
|
|
19
55
|
```
|
package/package.json
CHANGED
package/src/integration.test.ts
CHANGED
|
@@ -113,7 +113,8 @@ describe("Integration: full hook pipeline", () => {
|
|
|
113
113
|
expect(simulateHook(store, auditLog, "bash", { command: "ls" }, "unknown_uid").block).toBe(
|
|
114
114
|
true,
|
|
115
115
|
);
|
|
116
|
-
|
|
116
|
+
// No sender context → defaults to owner (local user), should be allowed
|
|
117
|
+
expect(simulateHook(store, auditLog, "modify_memory", { content: "hi" }).block).toBe(false);
|
|
117
118
|
});
|
|
118
119
|
|
|
119
120
|
// Audit logging
|
package/src/stakeholder-store.ts
CHANGED
|
@@ -80,6 +80,11 @@ export class StakeholderStore {
|
|
|
80
80
|
if (match) return match;
|
|
81
81
|
}
|
|
82
82
|
|
|
83
|
+
// No sender context at all → local user, treat as owner
|
|
84
|
+
if (senderId === undefined && isOwner === undefined) {
|
|
85
|
+
return this.getOwner() ?? DEFAULT_STAKEHOLDERS[0];
|
|
86
|
+
}
|
|
87
|
+
|
|
83
88
|
// Return untrusted default for unknown senders
|
|
84
89
|
return {
|
|
85
90
|
id: "unknown",
|