npm - llm-trust-guard - Versions diffs - 4.19.0 → 4.19.1 - Mend

llm-trust-guard 4.19.0 → 4.19.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/CHANGELOG.md CHANGED Viewed

@@ -5,6 +5,26 @@ All notable changes to `llm-trust-guard` will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [4.19.1] - 2026-04-23
+### Added — Measured Performance
+- Published held-out benchmark results in [tests/adversarial/RESULTS-v4.19.0.md](tests/adversarial/RESULTS-v4.19.0.md). Methodology, 95% Wilson CIs, hand-adjudicated label noise, and reproducibility scripts
+- New README section "Measured Performance" summarizing the findings:
+  - On Giskard (n=35) and Compass CTF Chinese (n=11), Pipeline A detection rate is unchanged from v4.13.5 (80.00%, 9.09%). Underpowered — "no evidence of improvement," not "proof of no improvement"
+  - On WildChat-1M (10,000 real ChatGPT production prompts, seed=42), Pipeline A corrected FPR is ~2.73% [95% CI 2.43, 2.84] after canonical-marker + 50-sample hand-adjudication (203 canonical TPs + ~17 extrapolated TPs among unmarked → ~220 true jailbreak attempts in the 493 blocks)
+  - Same order of magnitude as Meta Prompt Guard 86M's self-reported 3–5% OOD FPR; not a head-to-head comparison
+- New reproducibility scripts: `tests/adversarial/extract_wildchat.py`, `classify_wildchat_blocks.py`, `wildchat-fpr.ts`, `wildchat-block-dump.ts`, `v419-delta-benchmark.ts`, `hand-labels.json`, and `tests/adversarial/README.md`
+### Not changed
+- No code changes to any guard. No detection patterns added or modified. No published API changed.
+- All 705 tests still pass
+### Context
+- [ARTICLE-2-REGEX-CEILING.md](https://github.com/nkratk/llm-trust-guard/blob/main/../ARTICLE-2-REGEX-CEILING.md) now includes a 2026-04-23 addendum reconciling v4.13.5 → v4.19.0
 ## [4.19.0] - 2026-04-23
 ### Added — Indirect Injection Expansion

package/README.md CHANGED Viewed

@@ -222,6 +222,25 @@ const output = guard.filterOutput(llmResponse, session.role);
 | ASI09: Trust Exploitation | TrustExploitationGuard | Strong |
 | ASI10: Rogue Agents | DriftDetector, AutonomyEscalationGuard | Moderate |
+## Measured Performance
+v4.19.0 benchmark, 2026-04-23. Full methodology, 95% confidence intervals, hand-adjudication labels, and reproducibility scripts: [tests/adversarial/RESULTS-v4.19.0.md](tests/adversarial/RESULTS-v4.19.0.md).
+**Attack detection on prior-published corpora** (Giskard n=35, Compass CTF Chinese n=11): detection rate has not moved from v4.13.5 → v4.19.0 on the Sanitizer+Encoder pipeline — 80.00% and 9.09% respectively, identical to the v4.13.5 numbers. Six releases of pattern additions (v4.14–v4.19) targeted different attack classes (indirect injection, tool-result validation, memory persistence, multi-agent trust) that these direct-text jailbreak corpora do not exercise. Small sample sizes mean "no evidence of improvement," not "proof of no improvement."
+**False-positive rate on real ChatGPT production traffic** (WildChat-1M shard 0, 10,000 non-toxic multilingual first-user-turns, seed=42):
+| Pipeline | Blocks | Raw block rate | Corrected FPR (after label adjudication) |
+|---|---|---|---|
+| Sanitizer + Encoder | 493 / 10,000 | 4.93% [4.52, 5.37] | **~2.73%** [2.43, 2.84] |
+| Detection-only facade | 898 / 10,000 | 8.98% [8.44, 9.56] | ≤6.69% (upper bound; not fully adjudicated) |
+WildChat filters toxic content but not prompt-injection intent. Canonical-marker analysis + a 50-sample hand-adjudication found that approximately 220 of the 493 Pipeline A blocks are actual jailbreak attempts users sent to ChatGPT, not genuine false positives. Corrected FPR is in the same order of magnitude as Meta Prompt Guard 86M's self-reported 3–5% out-of-distribution FPR — not a head-to-head comparison, but a useful reference point.
+**Not measured:** detection rate on the four attack classes added in v4.19.0 (CSS-hidden content, HTML attribute directives, JSON agent-directive fields, Reprompt-class markdown exfiltration). No public held-out corpus exists for these at statistical scale. Unit-test coverage is in the v4.19.0 test suite; independent third-party evaluation is invited.
+For higher detection on adversarial corpora, plug in an ML classifier via the [DetectionClassifier interface](#pluggable-detection).
 ## Defense In Depth
 This package is one layer. For production systems, combine with:

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "llm-trust-guard",
-  "version": "4.19.0",
+  "version": "4.19.1",
   "description": "Comprehensive security guards for LLM-powered and agentic AI applications - 34 guards covering OWASP Top 10 for LLMs 2025, Agentic Applications 2026, and MCP Security. All guards accessible via unified TrustGuard facade. Features prompt injection (PAP/persuasion), multi-modal attacks, RAG poisoning with embedding attack detection, memory persistence attacks, code execution sandboxing, multi-agent security (spawn policy, delegation scope, trust transitivity), MCP tool shadowing prevention, system prompt leakage protection, human-agent trust exploitation (ASI09), autonomy escalation (ASI10), state persistence (ASI08), tool chain validation v2 (ASI07/ASI04), circuit breaker, drift detection, and more",
   "main": "dist/index.js",
   "module": "dist/index.mjs",