wardproof 0.2.0__tar.gz → 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (70) hide show
  1. {wardproof-0.2.0 → wardproof-0.3.0}/PKG-INFO +100 -31
  2. {wardproof-0.2.0 → wardproof-0.3.0}/README.md +82 -30
  3. {wardproof-0.2.0 → wardproof-0.3.0}/THREAT_MODEL.md +85 -5
  4. wardproof-0.3.0/benchmarks/README.md +222 -0
  5. wardproof-0.3.0/benchmarks/corpus.jsonl +136 -0
  6. {wardproof-0.2.0 → wardproof-0.3.0}/benchmarks/external/README.md +21 -6
  7. {wardproof-0.2.0 → wardproof-0.3.0}/benchmarks/external/fetch_data.py +15 -4
  8. {wardproof-0.2.0 → wardproof-0.3.0}/benchmarks/external/injecagent.py +32 -0
  9. wardproof-0.3.0/benchmarks/heldout.py +132 -0
  10. wardproof-0.3.0/benchmarks/latency.py +114 -0
  11. {wardproof-0.2.0 → wardproof-0.3.0}/benchmarks/run_benchmark.py +42 -1
  12. wardproof-0.3.0/examples/agent_to_agent_transfer.py +172 -0
  13. wardproof-0.3.0/examples/integrations/README.md +346 -0
  14. wardproof-0.3.0/examples/integrations/agentkit_guarded.py +213 -0
  15. wardproof-0.3.0/examples/integrations/anthropic_tools_guarded.py +198 -0
  16. wardproof-0.3.0/examples/integrations/crewai_guarded.py +154 -0
  17. wardproof-0.3.0/examples/integrations/langgraph_guarded.py +208 -0
  18. wardproof-0.3.0/examples/integrations/mcp_guarded.py +218 -0
  19. wardproof-0.3.0/examples/integrations/openai_tools_guarded.py +212 -0
  20. wardproof-0.3.0/examples/integrations/skills_guard.py +143 -0
  21. wardproof-0.3.0/examples/integrations/venice_guarded.py +256 -0
  22. wardproof-0.3.0/examples/morse_injection_blocked_at_action.py +132 -0
  23. {wardproof-0.2.0 → wardproof-0.3.0}/examples/protect_rag_app.py +1 -1
  24. wardproof-0.3.0/examples/protect_x402_payments.py +205 -0
  25. wardproof-0.3.0/pyproject.toml +80 -0
  26. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/__init__.py +1 -1
  27. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/agents/base.py +13 -0
  28. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/audit/ledger.py +18 -2
  29. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/guardrails/__init__.py +2 -0
  30. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/guardrails/_normalize.py +17 -0
  31. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/guardrails/prompt_injection.py +110 -1
  32. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/guardrails/tool_misuse.py +61 -11
  33. wardproof-0.3.0/wardproof/guardrails/transfer.py +211 -0
  34. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/guardrails/x402_payment.py +70 -28
  35. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/orchestration/engine.py +35 -3
  36. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/standards.py +42 -1
  37. wardproof-0.2.0/benchmarks/README.md +0 -103
  38. wardproof-0.2.0/benchmarks/corpus.jsonl +0 -66
  39. wardproof-0.2.0/examples/protect_x402_payments.py +0 -138
  40. wardproof-0.2.0/pyproject.toml +0 -54
  41. {wardproof-0.2.0 → wardproof-0.3.0}/.gitignore +0 -0
  42. {wardproof-0.2.0 → wardproof-0.3.0}/CONTRIBUTING.md +0 -0
  43. {wardproof-0.2.0 → wardproof-0.3.0}/LICENSE +0 -0
  44. {wardproof-0.2.0 → wardproof-0.3.0}/SECURITY.md +0 -0
  45. {wardproof-0.2.0 → wardproof-0.3.0}/benchmarks/external/__init__.py +0 -0
  46. {wardproof-0.2.0 → wardproof-0.3.0}/benchmarks/external/_screen.py +0 -0
  47. {wardproof-0.2.0 → wardproof-0.3.0}/benchmarks/external/agentdojo.py +0 -0
  48. {wardproof-0.2.0 → wardproof-0.3.0}/examples/protect_defi_agent.py +0 -0
  49. {wardproof-0.2.0 → wardproof-0.3.0}/examples/protect_mcp_agent.py +0 -0
  50. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/agents/__init__.py +0 -0
  51. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/agents/detector.py +0 -0
  52. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/agents/responder.py +0 -0
  53. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/agents/verifier.py +0 -0
  54. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/audit/__init__.py +0 -0
  55. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/audit/stix.py +0 -0
  56. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/cli.py +0 -0
  57. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/config.py +0 -0
  58. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/guardrails/base.py +0 -0
  59. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/guardrails/mcp_guard.py +0 -0
  60. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/guardrails/memory_poisoning.py +0 -0
  61. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/llm/__init__.py +0 -0
  62. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/llm/base.py +0 -0
  63. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/llm/null.py +0 -0
  64. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/llm/ollama_client.py +0 -0
  65. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/orchestration/__init__.py +0 -0
  66. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/orchestration/factory.py +0 -0
  67. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/sandbox/__init__.py +0 -0
  68. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/sandbox/executor.py +0 -0
  69. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/sandbox/permissions.py +0 -0
  70. {wardproof-0.2.0 → wardproof-0.3.0}/wardproof/schema.py +0 -0
@@ -1,16 +1,22 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: wardproof
3
- Version: 0.2.0
3
+ Version: 0.3.0
4
4
  Summary: Local-first, verifiable defensive AI agent swarms that protect other AI agent systems.
5
5
  Author: Wardproof contributors
6
6
  License-Expression: MIT
7
7
  License-File: LICENSE
8
8
  Keywords: agents,ai-security,guardrails,local-first,prompt-injection
9
9
  Requires-Python: >=3.11
10
+ Provides-Extra: agentkit
11
+ Requires-Dist: coinbase-agentkit>=0.7; extra == 'agentkit'
10
12
  Provides-Extra: all
11
13
  Requires-Dist: cryptography>=42; extra == 'all'
12
14
  Requires-Dist: httpx>=0.27; extra == 'all'
13
15
  Requires-Dist: pyyaml>=6; extra == 'all'
16
+ Provides-Extra: anthropic
17
+ Requires-Dist: anthropic>=0.40; extra == 'anthropic'
18
+ Provides-Extra: crewai
19
+ Requires-Dist: crewai>=1.14; extra == 'crewai'
14
20
  Provides-Extra: crypto
15
21
  Requires-Dist: cryptography>=42; extra == 'crypto'
16
22
  Provides-Extra: dev
@@ -21,8 +27,19 @@ Requires-Dist: pytest>=8; extra == 'dev'
21
27
  Requires-Dist: ruff>=0.4; extra == 'dev'
22
28
  Provides-Extra: guard
23
29
  Requires-Dist: llm-guard>=0.3; extra == 'guard'
30
+ Provides-Extra: langgraph
31
+ Requires-Dist: langchain-core>=1.4; extra == 'langgraph'
32
+ Requires-Dist: langgraph>=1.2; extra == 'langgraph'
33
+ Provides-Extra: mcp
34
+ Requires-Dist: mcp>=1.26; extra == 'mcp'
24
35
  Provides-Extra: ollama
25
36
  Requires-Dist: httpx>=0.27; extra == 'ollama'
37
+ Provides-Extra: openai
38
+ Requires-Dist: openai>=1.0; extra == 'openai'
39
+ Provides-Extra: venice
40
+ Requires-Dist: openai>=1.0; extra == 'venice'
41
+ Provides-Extra: x402
42
+ Requires-Dist: x402>=2.0; extra == 'x402'
26
43
  Provides-Extra: yaml
27
44
  Requires-Dist: pyyaml>=6; extra == 'yaml'
28
45
  Description-Content-Type: text/markdown
@@ -31,7 +48,16 @@ Description-Content-Type: text/markdown
31
48
 
32
49
  **Local-first, verifiable defensive AI agent swarms.**
33
50
 
51
+ Stop prompt injection and tool misuse before your agent drains its wallet, leaks
52
+ its keys, or runs the wrong command, and keep a tamper-evident log of every
53
+ decision.
54
+
34
55
  [![CI](https://github.com/Impossible-Mission-Force/wardproof/actions/workflows/ci.yml/badge.svg)](https://github.com/Impossible-Mission-Force/wardproof/actions/workflows/ci.yml)
56
+ [![PyPI](https://img.shields.io/pypi/v/wardproof.svg)](https://pypi.org/project/wardproof/)
57
+ [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/Impossible-Mission-Force/wardproof/blob/main/LICENSE)
58
+ [![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)
59
+
60
+ ![Wardproof screening x402 payments: a legitimate payment is allowed while an attacker redirect, a replayed payment, and a prompt injection in the 402 body are all blocked and written to a tamper-evident ledger.](assets/wardproof-x402-demo.gif)
35
61
 
36
62
  Wardproof is a small framework for building swarms of *defensive* agents that
37
63
  sit in front of your *other* AI systems (RAG pipelines, tool-using agents,
@@ -44,10 +70,17 @@ It is deliberately **small, transparent, and forkable**. The security core has
44
70
  **zero third-party dependencies** and runs **fully offline**, with a local
45
71
  model via Ollama, or with no model at all.
46
72
 
47
- > **Status: v0.1.** The deterministic core is built, tested, and benchmarked
48
- > (see [Benchmark](#benchmark)). It is deployable today as a screening and
49
- > audit layer, designed to run as defence in depth within the scope set out in
50
- > [`THREAT_MODEL.md`](THREAT_MODEL.md) and [`SECURITY.md`](SECURITY.md).
73
+ > **Status: v0.3.0.** The deterministic core is built, tested, and benchmarked
74
+ > (see [Benchmark](#benchmark)), and ships dedicated guards for x402 agent
75
+ > payments, on-chain transfers, MCP tool calls, and skill/tool definitions, a
76
+ > controls-to-standards map (OWASP Agentic Top 10, OWASP LLM 2025, MITRE ATLAS,
77
+ > CSA MAESTRO, and NIST AI 600-1) with STIX 2.1 ledger export, harnesses that
78
+ > screen the public AgentDojo and InjecAgent suites, and drop-in integration
79
+ > examples for OpenAI and Anthropic tool calling, CrewAI, LangGraph, MCP,
80
+ > Coinbase AgentKit, and Venice AI. It is
81
+ > deployable today as a screening and audit layer, designed to run as defence in
82
+ > depth within the scope set out in [`THREAT_MODEL.md`](THREAT_MODEL.md) and
83
+ > [`SECURITY.md`](SECURITY.md).
51
84
 
52
85
  ---
53
86
 
@@ -84,14 +117,26 @@ different stance:
84
117
  - **x402 payment guardrail**: chain-agnostic screening of x402 (HTTP 402)
85
118
  payment envelopes (CAIP-2 network, amount, recipient, asset) with a recipient
86
119
  allowlist, amount thresholds, replay detection, and 402-body injection checks.
120
+ - **Transfer guardrail**: screens on-chain transfers against a recipient
121
+ allowlist and spend threshold, and treats an agent-relayed transfer as never
122
+ pre-authorised (it escalates rather than trusting one agent's say-so).
87
123
  - **MCP guard**: screens MCP tool descriptions and schemas for tool poisoning
88
124
  (incl. hidden Unicode), allowlists servers, detects manifest rug pulls, and
89
125
  audits every tool invocation.
126
+ - **Skill/tool scanner**: screens a skill or tool definition (name, description,
127
+ code) before it is registered, catching hidden instructions buried in a
128
+ description (the tool-poisoning class, one step earlier than a live call). See
129
+ `examples/integrations/skills_guard.py`.
130
+ - **Framework integrations**: drop-in examples that put the swarm in front of
131
+ OpenAI and Anthropic tool calling, CrewAI, LangGraph, MCP, and Coinbase
132
+ AgentKit tool calls, plus Venice AI as an optional escalate-only second-opinion
133
+ backend. Each is an optional dependency; the core imports none of them. See
134
+ [`examples/integrations/`](examples/integrations/).
90
135
  - **Standards-aligned**: every control mapped to OWASP Top 10 for Agentic
91
136
  Applications, OWASP Agentic Threats (T1-T15), OWASP LLM Top 10 2025, CSA
92
- MAESTRO, and MITRE ATLAS (`wardproof/standards.py`, enforced by tests). Ledger
93
- detections are ATLAS-tagged and export to **STIX 2.1** for SIEM/SOC via
94
- `wardproof export-stix`.
137
+ MAESTRO, MITRE ATLAS, and NIST AI 600-1 (`wardproof/standards.py`, enforced by
138
+ tests). Ledger detections are ATLAS-tagged and export to **STIX 2.1** for
139
+ SIEM/SOC via `wardproof export-stix`.
95
140
  - **3 reference agents**: `DetectorAgent`, `VerifierAgent` (with detector
96
141
  integrity check), `ResponderAgent`.
97
142
  - **Capability sandbox**: default-deny permission broker (per-agent grants,
@@ -113,7 +158,7 @@ different stance:
113
158
  pip install -e . # core only, zero third-party deps
114
159
  pip install -e ".[crypto]" # + Ed25519 signed ledgers
115
160
  pip install -e ".[ollama]" # + local model via Ollama
116
- pip install -e ".[all]" # everything, incl. dev tools
161
+ pip install -e ".[all]" # optional runtime backends (ollama, crypto, yaml)
117
162
  ```
118
163
 
119
164
  Requires Python 3.11+.
@@ -204,16 +249,30 @@ category:
204
249
  python benchmarks/run_benchmark.py
205
250
  ```
206
251
 
207
- On the default configuration with no model (66 cases, including a round of
208
- red-team bypasses), it flags all 44 attacks at a 1 in 22 (5%) false-positive
209
- rate. Treat that near-perfect number as a coverage and regression signal on
210
- *known* patterns, not a security claim: the corpus is small and partly
211
- self-authored, so novel attacks (other languages, fresh encodings, or
212
- pure-semantic paraphrase) can still slip past a deterministic denylist. Closing
213
- that gap is the job of the optional LLM second opinion (see Roadmap); these
214
- patterns are the floor, not the ceiling. The full breakdown, including the one
215
- benign input the guardrails deliberately flag, is in
216
- [`benchmarks/README.md`](benchmarks/README.md).
252
+ On the default configuration plus the optional payment, transfer, and MCP guards,
253
+ with no model (136 cases: 89 attacks, 47 benign), it flags all 89 attacks at a
254
+ 0% false-positive rate (0 of 47 benign inputs flagged):
255
+
256
+ | Category | Recall (attacks flagged) | False positives |
257
+ | ---------------- | ------------------------ | --------------- |
258
+ | injection | 27/27 | 0/11 |
259
+ | tool_misuse | 23/23 | 0/10 |
260
+ | memory_poisoning | 16/16 | 0/10 |
261
+ | mcp_poisoning | 6/6 | 0/4 |
262
+ | skill_poisoning | 4/4 | 0/2 |
263
+ | x402_payment | 6/6 | 0/2 |
264
+ | transfer | 3/3 | 0/2 |
265
+ | agent_relayed | 4/4 | 0/2 |
266
+ | benign_general | n/a | 0/4 |
267
+ | **Overall** | **89/89 (100%)** | **0/47 (0%)** |
268
+
269
+ Treat these as a coverage and regression signal on *known* patterns, not a
270
+ security claim: the corpus is partly self-authored, so novel attacks (other
271
+ languages, fresh encodings, or pure-semantic paraphrase) can still slip past a
272
+ deterministic denylist. Closing that gap is the job of the optional LLM second
273
+ opinion (see Roadmap); these patterns are the floor, not the ceiling. Re-run the
274
+ harness to regenerate the numbers above; the full breakdown and the honest edges
275
+ are in [`benchmarks/README.md`](benchmarks/README.md).
217
276
 
218
277
  ---
219
278
 
@@ -240,22 +299,32 @@ No need to touch the engine, the ledger, or the agent base classes.
240
299
  Wardproof is built to become a complete, auditable control layer for AI agents.
241
300
  The direction:
242
301
 
243
- **Now (v0.1)**
244
- The deterministic core: schema, three guardrails, Detector / Verifier /
245
- Responder, a capability sandbox, circuit breaker and watchdog, a hash-chained
246
- and optionally signed audit ledger, a reproducible adversarial benchmark, a
247
- published threat model, worked examples, a test suite, and a ledger
248
- verification CLI.
302
+ **Now (v0.3.0)**
303
+ The deterministic core: schema, guardrails, Detector / Verifier / Responder, a
304
+ capability sandbox, circuit breaker and watchdog, a hash-chained and optionally
305
+ signed audit ledger, a reproducible adversarial benchmark, a published threat
306
+ model, worked examples, a test suite, and a ledger verification CLI. On top of
307
+ that: dedicated guards for x402 payments (recipient allowlist, spend thresholds,
308
+ replay detection, injection screening of the 402 body), on-chain transfers, MCP
309
+ tool calls (description and schema screening, server allowlisting, rug-pull
310
+ detection), and skill/tool definitions; a controls-to-standards map (OWASP
311
+ Agentic Top 10, OWASP LLM 2025, MITRE ATLAS, CSA MAESTRO, NIST AI 600-1) with
312
+ STIX 2.1 ledger export; screening harnesses for the public AgentDojo and
313
+ InjecAgent suites; and drop-in integration examples for OpenAI and Anthropic tool
314
+ calling, CrewAI, LangGraph, MCP, and Coinbase AgentKit, plus Venice AI as an
315
+ optional escalate-only second-opinion backend (alongside the existing Ollama
316
+ backend).
249
317
 
250
318
  **Next**
251
- - A semantic detection layer running alongside the deterministic guardrails as
252
- an escalate-only second opinion, to close the gaps the benchmark exposes.
319
+ - A bundled local semantic detection layer that ships by default alongside the
320
+ deterministic guardrails, to close the gaps the benchmark exposes. The
321
+ escalate-only second-opinion hook already exists (Ollama or Venice); this would
322
+ add a default local model so the semantic layer is on without extra setup.
253
323
  - First-class isolation backends behind one interface: subprocess with rlimits,
254
324
  Docker, and gVisor or microVM, each with its trust boundary documented.
255
- - Optional adapters for popular agent frameworks (LangGraph, CrewAI) and a
256
- FastAPI middleware, dropping the swarm in front of an existing agent without
257
- pulling anything into the security core.
258
- - Config files, structured logging, and a pluggable guardrail registry.
325
+ - A FastAPI middleware that drops the swarm in front of an existing agent
326
+ service, and a pluggable guardrail registry, config files, and structured
327
+ logging.
259
328
 
260
329
  **Later**
261
330
  - Observability: ledger export to OpenTelemetry and SIEM, a read-only audit
@@ -2,7 +2,16 @@
2
2
 
3
3
  **Local-first, verifiable defensive AI agent swarms.**
4
4
 
5
+ Stop prompt injection and tool misuse before your agent drains its wallet, leaks
6
+ its keys, or runs the wrong command, and keep a tamper-evident log of every
7
+ decision.
8
+
5
9
  [![CI](https://github.com/Impossible-Mission-Force/wardproof/actions/workflows/ci.yml/badge.svg)](https://github.com/Impossible-Mission-Force/wardproof/actions/workflows/ci.yml)
10
+ [![PyPI](https://img.shields.io/pypi/v/wardproof.svg)](https://pypi.org/project/wardproof/)
11
+ [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/Impossible-Mission-Force/wardproof/blob/main/LICENSE)
12
+ [![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)
13
+
14
+ ![Wardproof screening x402 payments: a legitimate payment is allowed while an attacker redirect, a replayed payment, and a prompt injection in the 402 body are all blocked and written to a tamper-evident ledger.](assets/wardproof-x402-demo.gif)
6
15
 
7
16
  Wardproof is a small framework for building swarms of *defensive* agents that
8
17
  sit in front of your *other* AI systems (RAG pipelines, tool-using agents,
@@ -15,10 +24,17 @@ It is deliberately **small, transparent, and forkable**. The security core has
15
24
  **zero third-party dependencies** and runs **fully offline**, with a local
16
25
  model via Ollama, or with no model at all.
17
26
 
18
- > **Status: v0.1.** The deterministic core is built, tested, and benchmarked
19
- > (see [Benchmark](#benchmark)). It is deployable today as a screening and
20
- > audit layer, designed to run as defence in depth within the scope set out in
21
- > [`THREAT_MODEL.md`](THREAT_MODEL.md) and [`SECURITY.md`](SECURITY.md).
27
+ > **Status: v0.3.0.** The deterministic core is built, tested, and benchmarked
28
+ > (see [Benchmark](#benchmark)), and ships dedicated guards for x402 agent
29
+ > payments, on-chain transfers, MCP tool calls, and skill/tool definitions, a
30
+ > controls-to-standards map (OWASP Agentic Top 10, OWASP LLM 2025, MITRE ATLAS,
31
+ > CSA MAESTRO, and NIST AI 600-1) with STIX 2.1 ledger export, harnesses that
32
+ > screen the public AgentDojo and InjecAgent suites, and drop-in integration
33
+ > examples for OpenAI and Anthropic tool calling, CrewAI, LangGraph, MCP,
34
+ > Coinbase AgentKit, and Venice AI. It is
35
+ > deployable today as a screening and audit layer, designed to run as defence in
36
+ > depth within the scope set out in [`THREAT_MODEL.md`](THREAT_MODEL.md) and
37
+ > [`SECURITY.md`](SECURITY.md).
22
38
 
23
39
  ---
24
40
 
@@ -55,14 +71,26 @@ different stance:
55
71
  - **x402 payment guardrail**: chain-agnostic screening of x402 (HTTP 402)
56
72
  payment envelopes (CAIP-2 network, amount, recipient, asset) with a recipient
57
73
  allowlist, amount thresholds, replay detection, and 402-body injection checks.
74
+ - **Transfer guardrail**: screens on-chain transfers against a recipient
75
+ allowlist and spend threshold, and treats an agent-relayed transfer as never
76
+ pre-authorised (it escalates rather than trusting one agent's say-so).
58
77
  - **MCP guard**: screens MCP tool descriptions and schemas for tool poisoning
59
78
  (incl. hidden Unicode), allowlists servers, detects manifest rug pulls, and
60
79
  audits every tool invocation.
80
+ - **Skill/tool scanner**: screens a skill or tool definition (name, description,
81
+ code) before it is registered, catching hidden instructions buried in a
82
+ description (the tool-poisoning class, one step earlier than a live call). See
83
+ `examples/integrations/skills_guard.py`.
84
+ - **Framework integrations**: drop-in examples that put the swarm in front of
85
+ OpenAI and Anthropic tool calling, CrewAI, LangGraph, MCP, and Coinbase
86
+ AgentKit tool calls, plus Venice AI as an optional escalate-only second-opinion
87
+ backend. Each is an optional dependency; the core imports none of them. See
88
+ [`examples/integrations/`](examples/integrations/).
61
89
  - **Standards-aligned**: every control mapped to OWASP Top 10 for Agentic
62
90
  Applications, OWASP Agentic Threats (T1-T15), OWASP LLM Top 10 2025, CSA
63
- MAESTRO, and MITRE ATLAS (`wardproof/standards.py`, enforced by tests). Ledger
64
- detections are ATLAS-tagged and export to **STIX 2.1** for SIEM/SOC via
65
- `wardproof export-stix`.
91
+ MAESTRO, MITRE ATLAS, and NIST AI 600-1 (`wardproof/standards.py`, enforced by
92
+ tests). Ledger detections are ATLAS-tagged and export to **STIX 2.1** for
93
+ SIEM/SOC via `wardproof export-stix`.
66
94
  - **3 reference agents**: `DetectorAgent`, `VerifierAgent` (with detector
67
95
  integrity check), `ResponderAgent`.
68
96
  - **Capability sandbox**: default-deny permission broker (per-agent grants,
@@ -84,7 +112,7 @@ different stance:
84
112
  pip install -e . # core only, zero third-party deps
85
113
  pip install -e ".[crypto]" # + Ed25519 signed ledgers
86
114
  pip install -e ".[ollama]" # + local model via Ollama
87
- pip install -e ".[all]" # everything, incl. dev tools
115
+ pip install -e ".[all]" # optional runtime backends (ollama, crypto, yaml)
88
116
  ```
89
117
 
90
118
  Requires Python 3.11+.
@@ -175,16 +203,30 @@ category:
175
203
  python benchmarks/run_benchmark.py
176
204
  ```
177
205
 
178
- On the default configuration with no model (66 cases, including a round of
179
- red-team bypasses), it flags all 44 attacks at a 1 in 22 (5%) false-positive
180
- rate. Treat that near-perfect number as a coverage and regression signal on
181
- *known* patterns, not a security claim: the corpus is small and partly
182
- self-authored, so novel attacks (other languages, fresh encodings, or
183
- pure-semantic paraphrase) can still slip past a deterministic denylist. Closing
184
- that gap is the job of the optional LLM second opinion (see Roadmap); these
185
- patterns are the floor, not the ceiling. The full breakdown, including the one
186
- benign input the guardrails deliberately flag, is in
187
- [`benchmarks/README.md`](benchmarks/README.md).
206
+ On the default configuration plus the optional payment, transfer, and MCP guards,
207
+ with no model (136 cases: 89 attacks, 47 benign), it flags all 89 attacks at a
208
+ 0% false-positive rate (0 of 47 benign inputs flagged):
209
+
210
+ | Category | Recall (attacks flagged) | False positives |
211
+ | ---------------- | ------------------------ | --------------- |
212
+ | injection | 27/27 | 0/11 |
213
+ | tool_misuse | 23/23 | 0/10 |
214
+ | memory_poisoning | 16/16 | 0/10 |
215
+ | mcp_poisoning | 6/6 | 0/4 |
216
+ | skill_poisoning | 4/4 | 0/2 |
217
+ | x402_payment | 6/6 | 0/2 |
218
+ | transfer | 3/3 | 0/2 |
219
+ | agent_relayed | 4/4 | 0/2 |
220
+ | benign_general | n/a | 0/4 |
221
+ | **Overall** | **89/89 (100%)** | **0/47 (0%)** |
222
+
223
+ Treat these as a coverage and regression signal on *known* patterns, not a
224
+ security claim: the corpus is partly self-authored, so novel attacks (other
225
+ languages, fresh encodings, or pure-semantic paraphrase) can still slip past a
226
+ deterministic denylist. Closing that gap is the job of the optional LLM second
227
+ opinion (see Roadmap); these patterns are the floor, not the ceiling. Re-run the
228
+ harness to regenerate the numbers above; the full breakdown and the honest edges
229
+ are in [`benchmarks/README.md`](benchmarks/README.md).
188
230
 
189
231
  ---
190
232
 
@@ -211,22 +253,32 @@ No need to touch the engine, the ledger, or the agent base classes.
211
253
  Wardproof is built to become a complete, auditable control layer for AI agents.
212
254
  The direction:
213
255
 
214
- **Now (v0.1)**
215
- The deterministic core: schema, three guardrails, Detector / Verifier /
216
- Responder, a capability sandbox, circuit breaker and watchdog, a hash-chained
217
- and optionally signed audit ledger, a reproducible adversarial benchmark, a
218
- published threat model, worked examples, a test suite, and a ledger
219
- verification CLI.
256
+ **Now (v0.3.0)**
257
+ The deterministic core: schema, guardrails, Detector / Verifier / Responder, a
258
+ capability sandbox, circuit breaker and watchdog, a hash-chained and optionally
259
+ signed audit ledger, a reproducible adversarial benchmark, a published threat
260
+ model, worked examples, a test suite, and a ledger verification CLI. On top of
261
+ that: dedicated guards for x402 payments (recipient allowlist, spend thresholds,
262
+ replay detection, injection screening of the 402 body), on-chain transfers, MCP
263
+ tool calls (description and schema screening, server allowlisting, rug-pull
264
+ detection), and skill/tool definitions; a controls-to-standards map (OWASP
265
+ Agentic Top 10, OWASP LLM 2025, MITRE ATLAS, CSA MAESTRO, NIST AI 600-1) with
266
+ STIX 2.1 ledger export; screening harnesses for the public AgentDojo and
267
+ InjecAgent suites; and drop-in integration examples for OpenAI and Anthropic tool
268
+ calling, CrewAI, LangGraph, MCP, and Coinbase AgentKit, plus Venice AI as an
269
+ optional escalate-only second-opinion backend (alongside the existing Ollama
270
+ backend).
220
271
 
221
272
  **Next**
222
- - A semantic detection layer running alongside the deterministic guardrails as
223
- an escalate-only second opinion, to close the gaps the benchmark exposes.
273
+ - A bundled local semantic detection layer that ships by default alongside the
274
+ deterministic guardrails, to close the gaps the benchmark exposes. The
275
+ escalate-only second-opinion hook already exists (Ollama or Venice); this would
276
+ add a default local model so the semantic layer is on without extra setup.
224
277
  - First-class isolation backends behind one interface: subprocess with rlimits,
225
278
  Docker, and gVisor or microVM, each with its trust boundary documented.
226
- - Optional adapters for popular agent frameworks (LangGraph, CrewAI) and a
227
- FastAPI middleware, dropping the swarm in front of an existing agent without
228
- pulling anything into the security core.
229
- - Config files, structured logging, and a pluggable guardrail registry.
279
+ - A FastAPI middleware that drops the swarm in front of an existing agent
280
+ service, and a pluggable guardrail registry, config files, and structured
281
+ logging.
230
282
 
231
283
  **Later**
232
284
  - Observability: ledger export to OpenTelemetry and SIEM, a read-only audit
@@ -68,8 +68,8 @@ our own model.
68
68
  | Data exfiltration | outbound-data detection plus `allowed_hosts` | a creative encoding, or an allowlisted host that is itself compromised |
69
69
  | Memory poisoning | flags persistence and stealth phrasing | subtle writes below threshold get escalated, not blocked; `remember that` over-flags (benchmark `mem-b4`) |
70
70
  | Compromised defensive agent | Verifier re-checks independently and audits the Detector; verdicts combine fail-closed | if Detector and Verifier share the same rule set, they share its blind spots |
71
- | Alert flood or cascading failure | CircuitBreaker forces a human into the loop | depends on threshold tuning |
72
- | Ledger tampering | hash chain, signatures, external append-only storage, self-verify | a stolen signing key can forge entries; deleting the whole store removes evidence |
71
+ | Alert flood or cascading failure | CircuitBreaker forces a human into the loop | a downgrade window opens once tripped (floor is ESCALATE, CRITICAL is exempt); see the breaker note below |
72
+ | Ledger tampering | hash chain, signatures, external append-only storage, self-verify | the hash chain alone cannot detect a full rewrite; signatures close that, but a stolen key can forge entries and deleting the whole store removes evidence; see the trust-model note below |
73
73
  | Sandbox escape | rlimit-bounded subprocess runner | not a boundary for hostile native code; use containers, gVisor, or microVMs |
74
74
  | Supply chain | zero-dependency core | optional extras and the toolchain remain; signed releases and SBOM are roadmap v1.0 |
75
75
 
@@ -80,13 +80,15 @@ operators can place Wardproof in their existing risk language. The mapping is no
80
80
  prose: it lives in code (`wardproof/standards.py`) and a test
81
81
  (`tests/test_standards.py`) fails if any identifier here is wrong or unknown. The
82
82
  identifiers were verified against primary sources (the official
83
- `mitre-atlas/atlas-data` dataset, genai.owasp.org, and the CSA MAESTRO
84
- publication); provenance and source URLs are in
83
+ `mitre-atlas/atlas-data` dataset, genai.owasp.org, the CSA MAESTRO publication,
84
+ and the NIST AI 600-1 PDF); provenance and source URLs are in
85
85
  `research/03-standards-verification.md`.
86
86
 
87
87
  Frameworks: **ASI** = OWASP Top 10 for Agentic Applications 2026; **T** = OWASP
88
88
  Agentic AI Threats and Mitigations (T1-T15); **LLM** = OWASP Top 10 for LLM
89
- Applications 2025; **L** = CSA MAESTRO layer; **AML.T** = MITRE ATLAS technique.
89
+ Applications 2025; **L** = CSA MAESTRO layer; **AML.T** = MITRE ATLAS technique;
90
+ **NIST AI 600-1** = the generative-AI risks named in the NIST GenAI Profile
91
+ (mapped in a separate table below).
90
92
 
91
93
  | Wardproof control | OWASP Agentic Top 10 | OWASP Agentic Threats | OWASP LLM 2025 | MAESTRO | MITRE ATLAS |
92
94
  | --- | --- | --- | --- | --- | --- |
@@ -122,6 +124,31 @@ The MITRE ATLAS catalog in `standards.py` also carries `AML.T0024`,
122
124
  `AML.T0098`, and `AML.T0083` for credential and inference-API exfiltration that
123
125
  the roadmap guardrails will map onto.
124
126
 
127
+ ### NIST AI 600-1 (Generative AI Profile)
128
+
129
+ The guards also map onto the generative-AI risks named in **NIST AI 600-1**,
130
+ "Artificial Intelligence Risk Management Framework: Generative Artificial
131
+ Intelligence Profile" (July 2024), available at
132
+ <https://doi.org/10.6028/NIST.AI.600-1>. The risk names below are verbatim from
133
+ Section 2 of that document (`standards.py` carries all twelve).
134
+
135
+ | Wardproof control | NIST AI 600-1 risk |
136
+ | --- | --- |
137
+ | Prompt-injection guardrail | Information Security, Data Privacy |
138
+ | Tool-misuse guardrail | Information Security |
139
+ | Memory-poisoning guardrail | Information Integrity |
140
+ | x402 payment guardrail | Information Security |
141
+ | On-chain transfer guardrail | Information Security |
142
+ | MCP guard | Information Security |
143
+
144
+ Honest scope note: NIST AI 600-1 frames risks, not specific attack techniques.
145
+ Its Section 2 "Information Security" risk covers offensive cyber, vulnerability
146
+ exploitation, and attacks on the confidentiality and integrity of training data,
147
+ code, and model weights; the literal phrase "prompt injection" appears in the
148
+ document's suggested actions, not in that Section 2 risk definition. The mapping
149
+ above is to the risk category each guard reduces, not a claim that the document
150
+ names each attack by the same words.
151
+
125
152
  ### SIEM / SOC integration
126
153
 
127
154
  The ledger exports to **STIX 2.1** so detections flow into SIEM and SOC tooling
@@ -138,6 +165,59 @@ custom properties, so a SIEM alert traces back to the exact, hash-verifiable
138
165
  ledger entry. The export is deterministic (stable object ids), so re-exporting
139
166
  the same ledger yields an identical bundle.
140
167
 
168
+ ### Audit ledger trust model
169
+
170
+ The ledger is **tamper-evident, not tamper-preventing**. Be precise about what
171
+ that buys you:
172
+
173
+ - The stdlib hash chain (no key) detects in-place **mutation**, **deletion**,
174
+ **reordering**, and a **forged appended** entry: each breaks a recomputed hash
175
+ or a chain link.
176
+ - The hash chain alone does **not** detect a **full rewrite**: an attacker who
177
+ changes an entry and recomputes every downstream hash and `prev_hash` produces
178
+ a chain that re-verifies. A bare hash chain cannot stop someone who rewrites
179
+ all of it.
180
+ - **Ed25519 signatures** close that gap by binding history to a key the writer
181
+ holds. When a public key is in play, `verify()` requires **every** entry to
182
+ carry a valid signature: a missing signature is treated as tampering, not
183
+ skipped. So a full rewrite fails (the old signatures no longer match the
184
+ recomputed hashes) and stripping signatures fails (a missing signature at the
185
+ tampered index). The `require_signatures` flag forces this check explicitly.
186
+ - Signatures matter only when the **writer and verifier differ** and the
187
+ **private key lives outside the audited agent**. If the same process holds the
188
+ key and can rewrite the store, it can produce a fresh valid chain: that is
189
+ tamper-evidence's limit, and it is why the key and the verifier belong outside
190
+ the agent. Deleting the entire store still removes the evidence; append-only or
191
+ tamper-resistant media is the operator's job.
192
+
193
+ ### Circuit-breaker downgrade tradeoff
194
+
195
+ When too many severe verdicts occur in a short window the breaker trips and, for
196
+ a cooldown, downgrades further `BLOCK`/`QUARANTINE` verdicts to `ESCALATE`. This
197
+ trades a block storm for a downgrade window, and it is a deliberate tradeoff, not
198
+ a hole:
199
+
200
+ - The downgrade floor is `ESCALATE`, **never** `ALLOW`: a human stays in the
201
+ loop, no action slips straight through.
202
+ - **CRITICAL** severity is **exempt**: a clearly critical action (a denied tool,
203
+ a payment replay) is never softened; only repeated lower-severity noise is.
204
+ - It is configurable: a high `max_events` effectively opts out for operators who
205
+ would rather take a block storm than a downgrade window.
206
+
207
+ ### x402 canonical-envelope assumption
208
+
209
+ The x402 guard reads the payment envelope tolerant of the several field-name
210
+ aliases the ecosystem uses. A hostile server could try a **split envelope**:
211
+ a benign value where the guard looks and a hostile value where the payer reads.
212
+ The guard defends in depth by collecting **every** recipient and amount value,
213
+ requiring all recipients to be allowlisted, evaluating the **worst-case** (max)
214
+ amount against the threshold, flagging conflicting aliases as ambiguous, and
215
+ flagging non-positive amounts. It still **assumes the caller passes a single
216
+ canonical envelope** and that the payment tool reads the same authoritative
217
+ fields the guard screens; the conflict detection is a backstop, not a substitute
218
+ for a canonical envelope. The guard screens; the sandbox permission broker
219
+ remains the hard enforcement layer for the actual payment tool.
220
+
141
221
  ## Out of scope
142
222
 
143
223
  Wardproof does **not** protect against the following. Treat these as the