PyPI - agentsec-eval - Versions diffs - 0.9.1__tar.gz - Mend

agentsec-eval 0.9.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (361) hide show

agentsec_eval-0.9.1/.gitignore ADDED Viewed

@@ -0,0 +1,17 @@
+.env
+__pycache__/
+*.pyc
+.venv
+.venv/
+dist/
+*.egg-info/
+report*.md
+.pytest_cache/
+.ruff_cache/
+.DS_Store
+.worktrees/
+*.local.yaml
+report-local/
+report-local-*/
+AUTHORIZATION.txt
+.cache/

agentsec_eval-0.9.1/AUTHORIZATION.txt.example ADDED Viewed

@@ -0,0 +1,28 @@
+# AgentSec server-audit authorization file.
+#
+# This file is the only thing standing between AgentSec and unauthorized
+# execution against a real OpenClaw deployment. Treat it like a credential.
+#
+# To sign:
+#   1. Set AGENTSEC_AUTH_SIGNING_KEY in your env (32+ random bytes; see .env.example).
+#   2. Compute the signature with:
+#        python -c "from agentsec.audit.authorization import Authorization; \
+#                   import os, sys; \
+#                   a = Authorization.load(sys.argv[1]); \
+#                   print(a.compute_signature(os.environ['AGENTSEC_AUTH_SIGNING_KEY'].encode()))" \
+#                   AUTHORIZATION.txt
+#   3. Paste the result into the `signature:` field below.
+target_host: openclaw.example.com
+authorized_by: your-name@example.com
+identity_provider: okta-saml             # blank "" allowed; flips LOW_ASSURANCE
+identity_assertion: ""                   # paste IdP-issued JWT/SAML assertion; blank flips LOW_ASSURANCE
+valid_from: 2026-04-27T00:00:00Z         # ISO 8601 UTC
+valid_until: 2026-05-27T00:00:00Z
+scope:
+  - server-audit
+  - remote-evaluation
+report_output_path_prefix: ./report-2026-04-27/
+signature_mode: hmac_sha256              # hmac_sha256 | none ("none" prints LOW_ASSURANCE)
+signature: REPLACE_WITH_BASE64_HMAC
+signature_key_env: AGENTSEC_AUTH_SIGNING_KEY

agentsec_eval-0.9.1/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,377 @@
+# Changelog
+版本号遵循 [Semantic Versioning](https://semver.org/)。
+---
+## [Unreleased]
+---
+## [0.9.1] — 2026-05-07
+### Added
+- **多提供商 LLM 评判器** — 不再硬性要求 `ANTHROPIC_API_KEY`。新增
+  `agentsec.evaluator.judge_factory.build_default_judge_from_env()`，
+  从 `AGENTSEC_LLM_PROVIDER` / `AGENTSEC_LLM_API_KEY` /
+  `AGENTSEC_LLM_BASE_URL` / `AGENTSEC_LLM_MODEL` /
+  `AGENTSEC_LLM_TIMEOUT` 五个变量按需构造 `LLMJudge`（Anthropic）或
+  `OpenAICompatibleJudge`。设了 `AGENTSEC_LLM_BASE_URL` 时
+  provider 自动推断为 `openai`。`agentsec run` 与 `agentsec evaluate`
+  （无 `judge:` 段时）共享这条路径，因此 OpenAI / DeepSeek / Qwen /
+  Moonshot / OpenRouter / Together / 本地 Ollama / vLLM / MLX 都不写
+  yaml 也能跑。Anthropic 仍是默认。yaml 的 `judge:` 段优先级高于环
+  境变量。
+- 仓库根 `LICENSE`（MIT）。
+### Changed
+- **PyPI 发布名改为 `agentsec-eval`**（`agentsec` 已被另一个项目
+  占用）。CLI 命令仍是 `agentsec`，安装命令变为
+  `pipx install agentsec-eval` / `uvx --from agentsec-eval agentsec`。
+- `pyproject.toml` 补齐 PyPI metadata：`license = "MIT"` +
+  `license-files`、`authors`、`readme`、`keywords`、`classifiers`、
+  `project.urls`；`[tool.hatch.build.targets.{wheel,sdist}]` 显式
+  声明打包范围；`[project.optional-dependencies].dev` 加入 `build` +
+  `twine`。
+- README + `docs/guide/getting-started.md` quick-start 改为以 pipx 为
+  默认推荐路径，源码安装下沉到「开发」段落。
+- `agentsec ioc-update` 的 `--watchlist` / `--intel` 默认值从相对路径
+  `src/agentsec/audit/ioc/...` 改成 wheel 内 bundled 副本（运行时通过
+  `Path(__file__).parent / "audit" / "ioc" / ...` 解析）。pipx 用户从
+  任意 cwd 执行 `agentsec ioc-update` 都能直接命中正确的内置 IOC
+  数据；显式传 `--watchlist /path` / `--intel /path` 时仍按用户给的
+  路径走。
+- `agentsec run --api-key` 的环境变量回落链改为
+  `AGENTSEC_LLM_API_KEY` → `AGENTSEC_JUDGE_API_KEY`（旧名仍工作）。
+---
+## [0.9.0] — 2026-05-04
+### Added
+- **Community suite marketplace** — operators can now install
+  third-party test suites with `agentsec suite install <name>` and run
+  them via `agentsec run --suite <name>`. Bundles are pinned by
+  canonical SHA-256 and validated against a strict threat-intel
+  reference policy. See
+  `docs/superpowers/specs/2026-05-04-community-suite-marketplace-design.md`.
+- New CLI sub-app `agentsec suite list/info/install/uninstall` plus the
+  hidden utility `agentsec suite hash` for index contributors.
+- New module `src/agentsec/suite_registry/` containing `manifest.py`
+  (Pydantic `SuiteManifest` + `IndexEntry` + `RegistryFile`),
+  `hashing.py` (Merkle SHA-256 over the bundle tree), `registry.py`
+  (`DefaultRegistry` reading the wheel-bundled `community-suites.yaml`,
+  `HttpRegistry` for the hidden `--registry-url` override), `store.py`
+  (local install store at `~/.agentsec/suites/`), `fetcher.py`
+  (codeload.github.com tarball fetch via httpx), `installer.py` (full
+  install pipeline with atomic rename + rollback).
+- Reference template at `examples/sample-community-suite/`; per-PR
+  static check `scripts/check_community_index.py` wired into `ci.yml`;
+  weekly `verify-community-index.yml` workflow that opens an issue on
+  hash drift.
+- ADR `docs/adr/0005-community-suite-marketplace.md`.
+### Changed
+- `pyproject.toml` version bumped from `0.2.0` to `0.9.0` to match the
+  release tags shipped between v0.3.0 and v0.9.0.
+- `Store()` resolves `~/.agentsec/suites` lazily (per-instance)
+  instead of at import time so a `HOME` change between commands is
+  honored. The `DEFAULT_STORE_ROOT` export is preserved for
+  back-compat.
+### Migration notes
+- The registry ships empty (`suites: {}`). The first real entry will
+  arrive through the contribution flow. Existing `test_suites/` usage
+  is unchanged — community suites are a new, additive surface.
+---
+## [0.8.0] — 2026-05-04
+### Added
+- **Custom judge interface** — operators can now replace the default Claude judge in
+  `openclaw-target.yaml` with any OpenAI-compatible LLM backend (`type: openai_compatible`)
+  or a Python plugin (`type: plugin`). See `docs/superpowers/specs/2026-05-04-custom-judge-design.md`.
+- `OpenAICompatibleJudge` (`src/agentsec/evaluator/openai_judge.py`) — calls any
+  `/chat/completions` endpoint; reuses the same `_SYSTEM` prompt as `LLMJudge`.
+- `PluginJudge` (`src/agentsec/evaluator/plugin_judge.py`) — loads a user `.py` file
+  via `importlib`, discovers the single `Judge` subclass, delegates all calls to it.
+- `_build_llm_judge` helper in `cli.py` dispatches to the correct judge based on config.
+- New config fields: `OpenAIJudgeConfig`, `PluginJudgeConfig` (Pydantic discriminated union).
+---
+## [v0.7.0] — 2026-05-04
+### Added
+- `agentsec serve <output-dir> [--port 8080] [--host 127.0.0.1]` — local Flask web
+  dashboard for browsing `agentsec evaluate` / `agentsec multi-evaluate` artifacts.
+  Auto-detects single-target vs multi-target layout. Features: score card (grade,
+  combined/remote/server scores, verdict coverage, error rate), filterable findings
+  table (severity checkboxes + check-name text filter), Markdown report rendering.
+- New module `src/agentsec/serve/` with `reader.py` (pure I/O helpers) and `app.py`
+  (Flask routes + Jinja2/Bootstrap 5 templates).
+- New dependencies: `flask>=3.0`, `markdown>=3.6`.
+---
+## v0.6.0 (2026-05-04)
+### Added
+- `agentsec multi-evaluate --config multi-target.yaml` — evaluate multiple
+  OpenClaw instances in parallel via `ThreadPoolExecutor`. Each target produces
+  the standard six artifacts under `output_base/<name>/`; a cross-target
+  `summary.md` (score table + per-target links + error details) is written last.
+- `MultiTargetConfig` + `TargetEntry` Pydantic models in `config.py`
+  (`extra="forbid"` enforced, duplicate target names rejected).
+- `TargetResult` dataclass and `render_multi_summary` renderer in
+  `src/agentsec/reports/multi_summary.py`.
+---
+## [0.5.2] — 2026-05-04
+### Added
+- `PluginStaticCheck` now scans installed skill Python source files for dangerous API
+  calls via evaluator-side `ast` analysis. Five categories: `code_execution`
+  (subprocess/os), `dynamic_eval` (eval/exec), `network_exfiltration`
+  (requests/httpx/urllib), `raw_socket` (socket), `env_access` (os.environ). One
+  `high`-severity finding per `(skill_file, category)`. IOC fingerprint matching
+  (`critical`) is unaffected. Requires `skills_dir` in `PlatformProfile.paths`.
+- `PlatformProfile`: added `skills_dir` key to `LINUX` (`~/.openclaw/skills/`) and
+  `MACOS` (`~/Library/Application Support/openclaw/skills/`).
+---
+## [0.5.1] — 2026-05-04
+### Added
+- `ExposureScanCheck` now probes open ports for unauthenticated WebSocket endpoints.
+  A bare WS upgrade at `/ws`, `/v1/ws`, `/` (configurable via `ws_paths`) that returns
+  `101 Switching Protocols` produces a `high`-severity finding. The probe can be disabled
+  with `ws_probe_enabled=False`. No findings are generated for auth-gated endpoints.
+---
+## [0.5.0] — 2026-05-04
+### Added
+- Cross-channel correlated attack infrastructure: `Turn.channel` (`http`|`ws`),
+  `TestCase.ws_concurrent_observe` / `ws_listen_duration_s`, `WebSocketAgentAdapter`
+  (real + mock), runner alternating and concurrent-observe modes via `asyncio.gather`,
+  `agentsec run --ws-url` CLI option. Sub-project A of Phase 3.
+- `test_suites/openclaw/11_cross_channel.yaml`: 6 cross-channel cases
+  (cc-alt-01–03: alternating HTTP↔WS attacks; cc-obs-01–03: concurrent WS
+  observation alongside HTTP attacks). Covers CVE-2026-32915, CVE-2026-32918,
+  and OWASP Agentic AI A5.
+- `test_suites/openclaw/10_agent_chain_privesc.yaml`: 6-case agent-chain
+  privilege escalation suite covering three attack vectors — unauthorized
+  delegation (oc-chain-01/02), sandbox escape via CVE-2026-32915
+  (oc-chain-03/04), and session side-channel via CVE-2026-32918
+  (oc-chain-05/06). Closes Phase 4.
+---
+## [0.3.0] — 2026-04-29
+### Added
+- `agentsec ioc-update` CLI: pulls CVE / threat-intel from NVD, CISA KEV,
+  and GitHub Security Advisories; produces a propose-only artifact set
+  (`proposed-threat_intel.yaml`, `report.md`, `audit-log.jsonl`) without
+  mutating the repo. See `docs/superpowers/specs/2026-04-28-ioc-update-design.md`.
+- `src/agentsec/audit/ioc/watchlist.yaml`: operator-curated vendor list
+  (3 entries: openclaw, anthropic-claude, ms-agent).
+- `version_patch`: KEV-listed CVEs (`kev: true` in threat_intel.yaml)
+  now produce `severity: critical` regardless of CVSS, with
+  `evidence.kev = True` for traceability.
+### Changed
+- `scripts/build_cve_db.py` reads `watchlist.yaml` to determine which
+  `ti_prefix` values enter `cve_database.json` (was: hardcoded
+  `TI-OPENCLAW-CVE-`). Default behavior unchanged — only OpenClaw has
+  `cve_db_include: true`, so `cve_database.json` is byte-identical to
+  v0.2.0.
+### Migration notes
+- The first `ioc-update`-merged PR reordered the existing 18
+  `threat_intel.yaml` entries into dictionary order by `id`. This is a
+  one-off cosmetic change; subsequent runs are stable.
+- Pre-v0.3.0 `threat_intel.yaml` entries did not carry `kev`. After the
+  first `ioc-update` run propagated KEV flags, evaluate runs may emit
+  `critical` findings where they previously emitted `high` for the same
+  CVE.
+---
+## [0.4.0] — 2026-04-30
+### Added
+- `agentsec diff <baseline-dir> <current-dir> [--output PATH]` — markdown
+  delta report between two `agentsec evaluate` runs. Compares
+  `findings.json` (fingerprint primary + (check, title) drift clustering)
+  and `combined-report.md` meta (score / grade / coverage). Regression
+  flagged on absolute deltas (combined_score Δ>5, grade letter drop,
+  verdict_coverage<0.7, error_rate>0.1). Spec:
+  `docs/superpowers/specs/2026-04-30-agentsec-diff-design.md`.
+- Reference sample diff at `docs/samples/diff-2026-04-30-self/diff.md`
+  (sample-vs-itself; useful as a renderer-output reference).
+---
+## [0.2.0] — 2026-04-28
+### Stage G — Integration regression + docs sync (2026-04-28)
+- Added `tests/integration/test_openclaw_mock.py`: an end-to-end run of
+  `agentsec evaluate` against an in-process OpenClaw mock
+  (`httpx.MockTransport` injected into the real `OpenClawGatewayAdapter`)
+  plus a stubbed `SSHExecutor`. Asserts on dedupe, the six artifacts, the
+  `Grade:` line in `combined-report.md`, and that `LOW_COVERAGE` is **not**
+  emitted when every case is scored.
+- Added `docs/samples/report-2026-04-28/`: committed copies of the six
+  `agentsec evaluate` artifacts (`remote-report.md`,
+  `server-audit-report.md`, `combined-report.md`, `findings.json`,
+  `threat-intel-snapshot.yaml`, `audit-log.jsonl`) plus a `README.md` with
+  regeneration instructions. A drift-guard test asserts the committed
+  copies stay byte-equal to the pipeline output (refresh with
+  `AGENTSEC_REFRESH_SAMPLE=1`). Markdown sample reports carry
+  `**Date**: <stripped>` so the drift guard is stable across wallclock-
+  minute boundaries.
+- `ROADMAP.md`: ticked 阶段 G; Phase 3 retitled to "OpenClaw v1.x 后续"
+  (IOC auto-feed, cross-channel attacks, Web viewer); ticked the two
+  Phase 5 items Stage F shipped; renumbered the existing Phase 3/4/5 to
+  4/5/6.
+- `README.md`: updated Phase numbering to match the renumbered ROADMAP;
+  flipped Phase 2 from 🚧 to ✅.
+- `CLAUDE.md`: added sections for `agentsec evaluate` orchestrator, the
+  `agentsec.scoring` module, and the three-renderer Markdown split.
+- `docs/guide/writing-test-cases.md`: added `isolation:` and
+  `threat_refs:` field references with YAML examples using only real
+  `TI-*` IDs from `threat_intel.yaml`.
+- `docs/guide/writing-adapters.md`: tightened the registry note to cover
+  all three helpers (`register` / `get` / `available`).
+- `src/agentsec/reports/__init__.py`: re-exports the three Markdown
+  renderers + two JSON writers that Stage F added; previously only the
+  deprecated shim was reachable via `from agentsec.reports import ...`.
+- `scripts/check_threat_refs.py`: scoped the regex to skip namespace-glob
+  references like `TI-OPENCLAW-CVE-*` (writing-test-cases.md introduces
+  these to denote prefixes; possessive `++` + negative lookahead prevents
+  Python regex backtracking from emitting truncated false matches).
+- `pyproject.toml`: bumped version `0.1.0` → `0.2.0`.
+**Audit findings closed by Stage G:** OE-AUD-014 (docs sync executed).
+### Stage F — `agentsec evaluate` + scoring + reports (2026-04-28)
+- New CLI subcommand: `agentsec evaluate --config openclaw-target.yaml`. Orchestrates the remote-test pipeline + server-audit pipeline + spec §9 scorer and writes the six artifacts from spec §9.3 (`remote-report.md`, `server-audit-report.md`, `combined-report.md`, `findings.json`, `threat-intel-snapshot.yaml`, `audit-log.jsonl`). Either half is optional — `cfg.remote = None` skips the remote suite, `cfg.server_audit = None` skips the SSH connection.
+- Added `OpenClawTargetConfig` Pydantic model (`src/agentsec/config.py`) for `openclaw-target.yaml`. `extra="forbid"` at every nesting level rejects typos at load time. Both `remote` and `server_audit` are optional; at least one must be set.
+- Added `agentsec.scoring` module implementing spec §9.1–9.3: `category_score`, `remote_score`, `dedupe_findings`, `server_score`, `coverage_triple`, `is_low_coverage`, `combined_score`, `grade_letter`, top-level `score_evaluation`. Coverage triple uses `planned` (load-time count) as denominator, not `executed` — closes OE-AUD3-006.
+- `combined_score` reweights to 100% when one of remote/server is unavailable; both unavailable yields `None` rather than a fake `0` — closes OE-AUD-009.
+- Identical findings (matching `Finding.fingerprint`) collapse before scoring and before writing `findings.json` — closes OE-AUD2-011.
+- `verdict_coverage < 0.7` or `error_rate > 0.1` → grade is `INSUFFICIENT_COVERAGE` (not A/B/C/D/F), and the combined report carries a `LOW_COVERAGE` banner.
+- Split `agentsec/reports/markdown.py` into three renderers: `render_remote_report`, `render_server_audit_report` (lifted from inline `cli.py:server_audit` builder), `render_combined_report`. The legacy `render_markdown_report(name, results)` is kept as a deprecated shim for backward compatibility (removes in v0.3).
+- Added `agentsec/reports/json_report.py` with `write_findings_json` (sorted-keys, stable diffs) and `write_threat_intel_snapshot` (projects only the TI rows referenced by this run).
+- Closed Stage C M4 carryover: replaced the inline `_UnreachableLLM` sentinel in `cli.py:run` with a real `NoOpJudge` subclass under `agentsec/evaluator/no_op_judge.py`. The new judge subclasses the `Judge` ABC and raises `NoOpJudgeInvoked` (typed `RuntimeError` subclass) if dispatched to.
+- Pydantic models added by Stage F (`CoverageTriple`, `CombinedScore`, `EvaluationScore`) are `frozen=True` so report renderers receive immutable snapshots.
+**Audit findings closed by Stage F:** OE-AUD-009 (combined-score reweight when one half is None), OE-AUD2-011 (server-side fingerprint dedupe in scorer + zero-denom category handling + severity-factor aggregation), OE-AUD3-006 (coverage denom = `planned`).
+### Stage E — server-audit complete (2026-04-27)
+- Added `PlatformProfile` (`src/agentsec/audit/platform_profile.py`) with `LINUX` and `MACOS` instances; the orchestrator now detects platform via `uname -s` and feeds the profile into every `CheckContext`. Stage D checks were refactored to consume `ctx.profile.paths` — no hard-coded `/var/log/openclaw/` or `~/.openclaw/` strings remain in `src/agentsec/audit/checks/`.
+- Implemented remote-`$HOME` resolution at orchestrator startup (ADR-0004): `resolve_remote_home(user, profile)` + `materialize_paths(profile, remote_home)` rewrite `~/...` profile paths to absolute paths before any check runs, and `PathMatcher` accepts a `remote_home` so `allowed_paths` matching uses the same home as the remote shell. Closes a silent-failure mode where `find '~/.openclaw/'` returned exit=1 because `shlex.quote` blocked tilde expansion.
+- Replaced `tunnel.py` Stage D stub with a real paramiko `direct-tcpip` local-forward; `--skip-active-test` removed from the `agentsec server-audit` CLI; `active_test` always registers and skips cleanly when the tunnel fails.
+- Six new checks: `exposure_scan`, `filesystem`, `process_forensics`, `credential_audit`, `plugin_static`, `log_review`. Total audit coverage now 10 checks (Stage D's 4 + Stage E's 6).
+- IOC layer (`src/agentsec/audit/ioc/`):
+  - `cve_database.json` — projection of `TI-OPENCLAW-CVE-*` entries from `threat_intel.yaml`, generated by `scripts/build_cve_db.py`. CI gates on `--check` to keep the file in sync.
+  - `clawhavoc_skills.json` — 5 hand-seeded ClawHavoc-family skill fingerprints; consumed by `plugin_static`.
+  - `attack_signatures.yaml` — 5 hand-seeded grep patterns; consumed by `log_review`. Patterns must not contain shell metachars (the policy rejects the resulting grep argv otherwise).
+- `version_patch` Findings now include `cve_db_url` / `cve_db_confidence` from `cve_database.json` for traceability.
+- macOS CI job added (`test-macos` in `.github/workflows/ci.yml`); Linux + macOS matrix both green.
+- Hygiene: orchestrator now records exception type in `report.errors` (e.g. `RuntimeError: ...`); CLI report adds `<!-- meta:errors count:N -->` header; `_REQUIRED_PATHS` guard refactored from per-check inlined for-loops into `Check.check_required_paths()` base method.
+**Deferred to vNext (intentional v0.2.0 scope cut):**
+- AST-grep danger-pattern detection in `plugin_static`.
+- WebSocket endpoint probing in `exposure_scan`.
+- `agentsec ioc-update` (auto-feed pull) — committed static files only in v0.2.0.
+- Non-standard remote home directories — `resolve_remote_home` uses convention table (Linux: `/home/<user>`/`/root`; macOS: `/Users/<user>`/`/var/root`); deployments using `nsswitch`/LDAP/`ChrootDirectory` need vNext's `getent passwd` policy entry.
+### Added (Stage D)
+- `src/agentsec/audit/` server-audit package: `SSHExecutor` (paramiko + policy enforcement), `CommandPolicy` + `path_matcher` + `metachar_guard` + `args_schema` (spec §6.1), `Authorization` with seven-step HMAC-SHA256 validate (spec §6.4), canonical `Finding` with stable fingerprint, `SnapshotCollector` (module-only; runner wiring deferred to Stage E), `Redactor`-fronted audit-log.jsonl, `tunnel.py` stub (real local-forward in Stage E), `server_audit` orchestrator with per-pass fingerprint dedupe.
+- Four P0 audit checks: `native_audit` (consumes `openclaw security audit --json`), `config_audit` (against `config_baseline.yaml`), `version_patch` (PEP-440 specifier match against `threat_intel.yaml`), `active_test` (canary suite through SSH local-forward).
+- New CLI subcommand: `agentsec server-audit` — host / user / key / consent-file / output flags, AUTHORIZATION.txt gate, low-assurance banner.
+### Fixed (Stage D post-review)
+Found by ultrareview on PR #4 (10 findings, all addressed before merge).
+- **bug_009** (`pyproject.toml`): declared `packaging>=22` as a runtime dep. `version_patch` imported it but it was only present transitively via pytest in `[dev]`, so wheel installs hard-failed at `agentsec --help` with ImportError.
+- **bug_001** (`authorization.py` / `cli.py`): path-normalize `report_output_path_prefix` via `Path()` equality so the documented `--output ./report-2026-04-27/` invocation matches the example AUTHORIZATION.txt prefix on first try (Typer parses `--output` as `Path()`, which strips both `./` and the trailing slash; the previous byte-exact compare rejected the documented form).
+- **bug_007** (`authorization.py`): reject naive datetimes in `Authorization.load()` with an explicit `AuthorizationError` rather than letting `validate()` raise a bare `TypeError` that the CLI doesn't catch — preserves the seven-step gate's "predictable AuthorizationError" contract.
+- **bug_006** (`native_audit.py`): pipe `title`, `description`, `remediation` through `ctx.redactor.redact()` symmetrically with `evidence`. Vulnerability scanners typically embed offending values in human-readable text; those fields land verbatim in `server-audit-report.md`, so leaving them un-redacted defeated the global-Redactor guarantee.
+- **bug_011** (`ssh.py`): replace `stdout.read()` / `stderr.read()` with a chunked `_read_capped` loop bounded at `max_output_bytes`, applied symmetrically to stderr (which previously had no cap at all). Bounds in-memory buffering against compromised targets.
+- **bug_002** (`args_schema.py` / `ssh_policy.yaml`): add `value_taking` to the flags role and consume the next argv token after each value-taking flag. `head -n 5 /file`, `tail -n 100 /file`, `stat -c %y /file` — all dormant in Stage D's four checks but Stage E's `log_review` / `exposure_scan` would have hit it on first use.
+- **bug_004** (`ssh_policy.yaml`): drop dead `enforce_max_results` / `forbidden_subcommands` keys (declared but never read). Mirror in spec §6.1.
+- **bug_003** (`ssh.py`): omit `argv` from audit-log.jsonl on success (it carries the same parameter content the SHA-256 design was meant to keep out of the log); keep `argv` on rejection lines for triage. Update docstring + CLAUDE.md.
+- **bug_010** (`config_baseline.yaml`): correct inverted header comment ("FORBIDDEN value" → "REQUIRED value"). The check emits a Finding when `actual != must_equal`, so must_equal describes the secure state.
+- **bug_012** (`active_test.py`): move `active-test-canary.yaml` into the package at `src/agentsec/audit/checks/data/` so wheel installs ship it; replace the `parents[4]` walk with `Path(__file__).parent / "data" / …`. Verified end-to-end against a built wheel.
+### Added (Stage C)
+- Discriminated `Assertion` union under `src/agentsec/evaluator/assertions/` covering the seven types from spec §5.4 (`response_not_contains_pattern`, `response_status_in`, `json_path_equals`, `outbound_request_not_to`, `file_not_created`, `config_key_not_changed`, `tool_event_not_invoked`).
+- `DeterministicJudge` and `HybridJudge` implementing spec §5.1 tri-state aggregation (`hard_fail` / `hard_pass` / `inconclusive`); `HybridJudge` honors `always_consult_llm` without letting the LLM flip a hard-pass verdict (OE-AUD2-007).
+- `JudgeRouter` dispatching per-test on `TestCase.judge_type`; CLI builds a router so deterministic-only suites no longer require `AGENTSEC_JUDGE_API_KEY`.
+- Seven P0 category suites under `test_suites/openclaw/` (≥5 cases each): direct PI, indirect PI (with HTML/text fixtures), memory poisoning, tool abuse, consent bypass, auth/authz bypass, SSRF.
+- Stage C end-to-end smoke `tests/integration/test_openclaw_stage_c_smoke.py` running every category suite through the full JudgeRouter pipeline against a mocked OpenClaw.
+### Changed (Stage C)
+- `TestCase.assertions` is now `list[Assertion] | None` (was `list[dict[str, Any]]`); unknown `type` strings and extra fields are rejected at load time.
+- `runner.py` distinguishes adapter-side errors (`error="adapter: ..."`) from judge-side errors (`error="judge: ..."`) and fixture/runtime errors (`error="runtime: ..."`) so reports surface failure causes more clearly.
+- New runtime dependency: `jsonpath-ng>=1.6` for `json_path_equals`.
+### Fixed (Stage C post-review)
+- **C1**: `AgentResponse.tool_events` (new field) now carries adapter-extracted tool invocations through to `Observation.tool_events`. `OpenClawGatewayAdapter` extracts OpenAI-style `choices[0].message.tool_calls[*]` into `[{"tool": fn.name, "args": fn.arguments}, ...]`. Pre-fix, 4 cases in `04_tool_abuse.yaml` and 2 in `05_consent_bypass.yaml` silently `hard_pass` against an always-empty list. `HttpAgentAdapter._parse_tool_events` is the override hook for non-OpenAI adapters.
+- **I1**: Rewrote `06_auth_authz_bypass.yaml` from `judge_type=deterministic` + `response_status_in: [401,403]` (incorrect: `/v1/chat/completions` always returns 200) to `judge_type=hybrid` + `response_not_contains_pattern` per case + `judge_hint`, so the deterministic side flags content-shaped leaks and HybridJudge consults the LLM to disambiguate refusals that quote the prompt.
+- **I2**: Tightened `tests/integration/test_openclaw_stage_c_smoke.py` to additionally assert `verdict.judge_type == case.judge_type` and that `verdict.passed` / `verdict.reasoning` are well-formed; added a parametrized regression that runs `04_tool_abuse.yaml` and `05_consent_bypass.yaml` through a tool-emitting stub adapter to prove `tool_event_not_invoked` actually fires now (prevents silent C1-style breakage).
+- **I3**: Documented the `oc-ssrf-04` redirect-chain operational prereq (controlled-domain → loopback resolver mapping) inline in `07_ssrf.yaml` and in `docs/guide/writing-test-cases.md` so the suite operator knows the fixture is a no-op without name resolution / hosts override.
+### Added (Stage 0/A/B)
+- threat-intel source table (`src/agentsec/audit/ioc/threat_intel.yaml`) and `scripts/check_threat_refs.py` CI gate per spec §2.
+- `.env.example` documents `OPENCLAW_API_KEY` and `AGENTSEC_AUTH_SIGNING_KEY`.
+- Adapter registry (`agentsec.adapters.registry`) with `register/get/available`; CLI `--adapter` (default `http`) resolves through it.
+- `OpenClawGatewayAdapter` posting to `/v1/chat/completions` with OpenClaw body shape (spec §5.2).
+- `Fixture` model + `ServeVia` literal on `TestCase`.
+- `NetworkObserverConfig` + `FixtureOnlyObserver`; CLI `--observer-mode`, `--controlled-domain`, `--same-host`. Loader rejects `outbound_request_not_to.target` outside `controlled_domains` in `fixture_only` mode (spec §5.4 / OE-AUD2-006).
+- Three v1 fixture topologies in `agentsec.observability.fixture_server`: `same-host-loopback`, `reachable-url`, `ssh-port-forward` (spec §7.2; `target-local-http` removed per OE-AUD3-002).
+- `FixtureRuntime` + `{{fixture_url}}` substitution + `serve_via: auto` resolution.
+- Multi-turn replay loop in `runner.py` honoring `Turn.judge_after` / `sleep_ms`; per-turn observer snapshots in `Observation.outbound_requests` / `fixture_events`.
+- 5-case smoke suite for OpenClaw (`test_suites/openclaw/_stage_b_smoke.yaml`) plus end-to-end integration test against a mocked `/v1/chat/completions`.
+- CLI `--token-env` injects `Authorization: Bearer …` and never echoes the token to stdout.
+### Changed (Stage 0/A/B)
+- `HttpAgentAdapter` now sends adapter-level headers per request (instead of via the `httpx.AsyncClient` constructor), so test-injected client factories see the same headers without touching httpx internals.
+- `evaluator/__init__.py` lazy-loads `LLMJudge` via `__getattr__` to break the `tests.models` ↔ `evaluator.judge` import cycle that was blocking new `observability` modules from being imported in isolation.
+- `load_test_suite()` accepts an optional `observer_config` (backward compatible).
+### Dependencies (Stage 0/A/B)
+- Added: `aiohttp>=3.9` (fixture HTTP server), `paramiko>=3.4` (ssh-port-forward fixture + Stage D SSH).
+## [0.1.0] — 2026-04-24
+### Added
+- 项目初始框架：`AgentAdapter`、`LLMJudge`、`runner`、Markdown 报告渲染、CLI
+- 内置测试用例：prompt injection（4条）、data leakage（4条）、tool abuse（3条）
+- `HttpAgentAdapter` 通用 HTTP 适配器，支持覆盖 `_build_request` / `_parse_response`
+### Fixed
+- `HttpAgentAdapter.send` 中缺失的 `await`，修复并发时阻塞事件循环的问题

agentsec_eval-0.9.1/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 raoliaoyuan
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.