@cyberdyne-systems/agent-safety 2026.3.3 → 2026.3.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,19 +1,55 @@
1
1
  # Agent Safety System
2
2
 
3
- OpenClaw plugin for LLM agent safety based on [arXiv:2602.20021 "Agents of Chaos"](https://arxiv.org/abs/2602.20021).
3
+ OpenClaw plugin for LLM agent safety based on [arXiv:2602.20021 -- "Agents of Chaos"](https://arxiv.org/abs/2602.20021).
4
4
 
5
5
  Hooks into `before_tool_call` to validate every tool call against a stakeholder model with trust levels, UID-based identity anchoring, and 8 risk dimensions.
6
6
 
7
- ## Usage
7
+ ## Install
8
8
 
9
9
  ```bash
10
- openclaw plugins enable agent-safety
10
+ openclaw plugins install @cyberdyne-systems/agent-safety
11
11
  ```
12
12
 
13
- See [full documentation](https://docs.openclaw.ai/extensions/agent-safety) for configuration, tool reference, and architecture.
13
+ ## Configuration
14
+
15
+ After install, configure via `openclaw config set`:
16
+
17
+ ```bash
18
+ # Validation mode: local (default), api, or both
19
+ openclaw config set plugins.entries.agent-safety.mode local
20
+
21
+ # Enable Claude API deep analysis (requires API key)
22
+ openclaw config set plugins.entries.agent-safety.mode both
23
+ openclaw config set plugins.entries.agent-safety.apiKey sk-ant-...
24
+
25
+ # Block high-risk actions from unverified users (default: true)
26
+ openclaw config set plugins.entries.agent-safety.blockHighRiskUnverified true
27
+ ```
28
+
29
+ ## How It Works
30
+
31
+ 1. **Quick check** (zero latency): local rules check trust level, permissions, identity spoofing, loop patterns, and dangerous command patterns
32
+ 2. **Deep analysis** (optional): Claude API evaluates 8 risk dimensions from the paper -- authority violation, resource abuse, information leak, safety bypass, goal misalignment, social engineering, cascading failure, irreversible action
33
+
34
+ ## Agent Safety Tool
35
+
36
+ Once loaded, agents get an `agent_safety` tool with these actions:
37
+
38
+ - `status` -- safety dashboard with audit stats
39
+ - `stakeholders` -- list registered principals
40
+ - `log` -- recent audit entries
41
+ - `add_stakeholder` -- register a new principal
42
+ - `set_trust` -- adjust trust level (0-4)
43
+
44
+ ## Test Suite
45
+
46
+ 114 tests covering all 11 case studies from the paper:
47
+
48
+ - 23 MUST_BLOCK scenarios: 100% detection rate
49
+ - 18 MUST_ALLOW scenarios: 0% false positive rate
14
50
 
15
51
  ## Development
16
52
 
17
53
  ```bash
18
- pnpm test extensions/agent-safety/
54
+ pnpm vitest run extensions/agent-safety/
19
55
  ```
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@cyberdyne-systems/agent-safety",
3
- "version": "2026.3.3",
3
+ "version": "2026.3.8",
4
4
  "description": "Agent safety system: stakeholder model, action validator, and safety dashboard — based on arXiv:2602.20021",
5
5
  "type": "module",
6
6
  "dependencies": {
@@ -113,7 +113,8 @@ describe("Integration: full hook pipeline", () => {
113
113
  expect(simulateHook(store, auditLog, "bash", { command: "ls" }, "unknown_uid").block).toBe(
114
114
  true,
115
115
  );
116
- expect(simulateHook(store, auditLog, "modify_memory", { content: "hi" }).block).toBe(true);
116
+ // No sender context defaults to owner (local user), should be allowed
117
+ expect(simulateHook(store, auditLog, "modify_memory", { content: "hi" }).block).toBe(false);
117
118
  });
118
119
 
119
120
  // Audit logging
@@ -80,6 +80,11 @@ export class StakeholderStore {
80
80
  if (match) return match;
81
81
  }
82
82
 
83
+ // No sender context at all → local user, treat as owner
84
+ if (senderId === undefined && isOwner === undefined) {
85
+ return this.getOwner() ?? DEFAULT_STAKEHOLDERS[0];
86
+ }
87
+
83
88
  // Return untrusted default for unknown senders
84
89
  return {
85
90
  id: "unknown",