npm - @kiwidata/grimoire - Versions diffs - 0.1.4 → 0.1.6 - Mend

@kiwidata/grimoire 0.1.4 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (112) hide show

package/.claude-plugin/plugin.json +2 -2
package/AGENTS.md +21 -25
package/LICENSE +36 -0
package/README.md +86 -61
package/dist/cli/index.js +2 -43
package/dist/cli/index.js.map +1 -1
package/dist/cli/program.d.ts +4 -0
package/dist/cli/program.d.ts.map +1 -0
package/dist/cli/program.js +45 -0
package/dist/cli/program.js.map +1 -0
package/dist/commands/configure.d.ts.map +1 -1
package/dist/commands/configure.js +2 -1
package/dist/commands/configure.js.map +1 -1
package/dist/core/check.d.ts.map +1 -1
package/dist/core/check.js +47 -11
package/dist/core/check.js.map +1 -1
package/dist/core/ci.d.ts.map +1 -1
package/dist/core/ci.js +2 -2
package/dist/core/ci.js.map +1 -1
package/dist/core/doc-style.d.ts.map +1 -1
package/dist/core/doc-style.js +76 -0
package/dist/core/doc-style.js.map +1 -1
package/dist/core/docs.d.ts.map +1 -1
package/dist/core/docs.js +93 -74
package/dist/core/docs.js.map +1 -1
package/dist/core/health.d.ts +6 -0
package/dist/core/health.d.ts.map +1 -1
package/dist/core/health.js +78 -21
package/dist/core/health.js.map +1 -1
package/dist/core/hooks.d.ts.map +1 -1
package/dist/core/hooks.js +17 -19
package/dist/core/hooks.js.map +1 -1
package/dist/core/list.d.ts.map +1 -1
package/dist/core/list.js +4 -7
package/dist/core/list.js.map +1 -1
package/dist/core/pr.d.ts.map +1 -1
package/dist/core/pr.js +0 -8
package/dist/core/pr.js.map +1 -1
package/dist/core/risk-register.d.ts +17 -0
package/dist/core/risk-register.d.ts.map +1 -0
package/dist/core/risk-register.js +73 -0
package/dist/core/risk-register.js.map +1 -0
package/dist/core/shared-setup.d.ts.map +1 -1
package/dist/core/shared-setup.js +5 -4
package/dist/core/shared-setup.js.map +1 -1
package/dist/core/status.d.ts.map +1 -1
package/dist/core/status.js +3 -3
package/dist/core/status.js.map +1 -1
package/dist/core/trace.d.ts.map +1 -1
package/dist/core/trace.js +37 -35
package/dist/core/trace.js.map +1 -1
package/dist/core/update.d.ts.map +1 -1
package/dist/core/update.js +1 -10
package/dist/core/update.js.map +1 -1
package/dist/index.d.ts +0 -3
package/dist/index.d.ts.map +1 -1
package/dist/index.js +0 -3
package/dist/index.js.map +1 -1
package/package.json +19 -2
package/skills/grimoire-apply/SKILL.md +40 -37
package/skills/grimoire-audit/SKILL.md +4 -1
package/skills/grimoire-bug/SKILL.md +7 -3
package/skills/grimoire-commit/SKILL.md +1 -1
package/skills/grimoire-design/SKILL.md +3 -3
package/skills/grimoire-discover/SKILL.md +77 -110
package/skills/grimoire-draft/SKILL.md +55 -18
package/skills/grimoire-plan/SKILL.md +58 -52
package/skills/grimoire-pr/SKILL.md +7 -8
package/skills/grimoire-pr-review/SKILL.md +2 -1
package/skills/grimoire-refactor/SKILL.md +3 -3
package/skills/grimoire-review/SKILL.md +13 -1
package/skills/grimoire-verify/SKILL.md +19 -7
package/skills/grimoire-vuln-remediate/SKILL.md +107 -0
package/skills/grimoire-vuln-triage/SKILL.md +109 -0
package/skills/references/artifact-map.md +44 -0
package/skills/references/code-quality.md +41 -9
package/skills/references/container-scan-triage.md +102 -0
package/skills/references/dependency-vuln-triage.md +236 -0
package/skills/references/principles.md +82 -0
package/skills/references/refactor-scan-categories.md +2 -2
package/skills/references/review-personas.md +13 -6
package/skills/references/test-baseline.md +55 -0
package/skills/references/testing-contracts.md +1 -1
package/templates/accepted-risks.yml +47 -0
package/templates/constraints.md +25 -0
package/dist/commands/archive.d.ts +0 -3
package/dist/commands/archive.d.ts.map +0 -1
package/dist/commands/archive.js +0 -22
package/dist/commands/archive.js.map +0 -1
package/dist/commands/log.d.ts +0 -3
package/dist/commands/log.d.ts.map +0 -1
package/dist/commands/log.js +0 -15
package/dist/commands/log.js.map +0 -1
package/dist/commands/map.d.ts +0 -3
package/dist/commands/map.d.ts.map +0 -1
package/dist/commands/map.js +0 -16
package/dist/commands/map.js.map +0 -1
package/dist/core/archive.d.ts +0 -9
package/dist/core/archive.d.ts.map +0 -1
package/dist/core/archive.js +0 -81
package/dist/core/archive.js.map +0 -1
package/dist/core/log.d.ts +0 -8
package/dist/core/log.d.ts.map +0 -1
package/dist/core/log.js +0 -140
package/dist/core/log.js.map +0 -1
package/dist/core/map.d.ts +0 -22
package/dist/core/map.d.ts.map +0 -1
package/dist/core/map.js +0 -365
package/dist/core/map.js.map +0 -1
package/templates/dupignore +0 -93
package/templates/mapignore +0 -58
package/templates/mapkeys +0 -65

package/skills/grimoire-vuln-triage/SKILL.md ADDED Viewed

@@ -0,0 +1,109 @@
+---
+name: grimoire-vuln-triage
+description: Triage vulnerability scans from any source — npm audit, pip-audit, osv-scanner, Trivy, Grype, Snyk, Dependabot, SARIF, or a report a teammate forwards — against our actual deployment model and recorded mitigating controls. Reconciles stale scans against the current tree, then decides the one thing that matters per finding — drop-everything hotfix vs next release cycle — and suppresses non-actionable noise with VEX verdicts. Use when a scanner produces a flood of CVEs and you need to know which actually matter here.
+compatibility: Designed for Claude Code (or similar products)
+metadata:
+  author: kiwi-data
+  version: "0.2"
+---
+# grimoire-vuln-triage
+Vulnerability scanners flag every CVE that *exists* in your tree or image, ranked by CVSS base score — which knows nothing about your deployment, and nothing about whether you already upgraded past it. Most findings are not exploitable as you actually run the code. This skill is **scanner-agnostic**: it normalizes whatever it's handed (npm audit, pip-audit, osv-scanner, Trivy, Grype, Snyk, Dependabot, SARIF, or a freeform CSV/markdown report) into one canonical model, reconciles it against the current tree, and triages each surviving advisory against **our** deployment and controls to answer the only question that drives action:
+> **Drop everything and hotfix now, or let it ride the normal testing / release cycle?**
+It produces VEX verdicts (`fixed` / `not_affected` / `affected` / `under_investigation`) so non-actionable findings are dismissed with an auditor-defensible justification, and an urgency (`hotfix-now` / `next-release` / `accept`) for the ones that survive. Covers application dependencies, OS packages from container scans, and runtime/build tooling alike.
+This skill **classifies**. Filing the dev work (tickets in the configured bug reporting system) is the job of `grimoire-vuln-remediate`, which consumes this triage.
+## Triggers
+- A scanner produces a wall of findings: "npm audit found 40 vulnerabilities", "pip-audit is screaming", "trivy flagged 200 CVEs in the image", "triage these CVEs"
+- "Is this CVE actually a problem for us?", "do we need to hotfix this or can it wait?"
+- "which of these vulnerabilities actually matter", "filter out the noise from the scan"
+- A teammate forwards a scan report (any tool, any format) and asks what's real
+- Loose match: "vuln triage", "CVE triage", "security scan", "trivy/grype/snyk results", "audit results", "image scan"
+## Routing
+- A *reported* security bug (not a scanner finding) → `grimoire-bug-triage` (it has a security classification path)
+- A dependency *add/upgrade* review (lockfile, floating ranges, supply chain) → review-time; see `../references/security-compliance.md` § Supply Chain Defense, enforced by `grimoire-review` / `grimoire-precommit-review`
+- After this triage, to file dev work → `grimoire-vuln-remediate`
+- Persistent IaC/container misconfig (root user, no limits, `:latest` base) → `grimoire-draft`/infra, not an app hotfix
+- A control gap surfaced here (a mitigation assumed but never recorded) → `grimoire-draft` to write the MADR
+## Prerequisites
+- A scan to triage: output of `config.tools.dep_audit` / `config.tools.security`, a saved scan file (e.g. `reports/security/...`), or pasted text. Prefer machine-readable (`--json` / SARIF) over a human table.
+- The repo's current lockfile/manifest (or deployed image tag) available for reconciliation.
+- Network access for KEV + EPSS enrichment (degrades gracefully to CVSS-only if offline).
+- Best results with `codebase-memory-mcp` for reachability; falls back to grep.
+## Workflow
+Read `../references/dependency-vuln-triage.md` now — it has the canonical model, the format adapters, the reconciliation rule, the enrichment feeds, the type-aware reachability rules, the VEX statuses, the urgency tree, and the record format. Follow it. The steps below are the spine.
+### 1. Normalize the scan into the canonical model (any scanner)
+Identify the source and map each finding to the canonical advisory (reference § Step 1): `id`, `aliases`, `cve`, `component`, `component_type` (`library`/`os-package`/`container`/`iac`/`runtime`), `installed_version`, `fixed_version`, `severity`/`cvss`, `target`, `scanner`. Use the format adapter for the tool you were handed — **never assume one tool's field names apply to another** (npm's `isDirect`/`via` ≠ pip-audit's `aliases`/`fix_versions` ≠ Trivy's `Results[].Vulnerabilities[]` with `Class`/`Type`/`Status`). For an unknown/freeform report, extract the minimum (`id`, `component`, version, fixed version) and mark the rest `unknown`. If you can't parse it, ask for `--json`/SARIF rather than guessing.
+**Dedup + non-CVE results.** Collapse the same CVE listed across multiple packages (Trivy does this constantly) to unique `(id, component_type)`, keeping the package list — report `raw_findings → unique_advisories` so the noise reduction is visible. Don't discard `Class: secret` / `Class: config` results — they aren't package CVEs; route them (secrets-in-image, Dockerfile/k8s misconfig) to infra/`grimoire-draft`, not triage.
+### 2. Reconcile against the current tree FIRST (mandatory)
+Before any enrichment, compare each advisory's `installed_version` against what the repo resolves **right now** (reference § Step 2): read the live lockfile/manifest (`uv.lock`/`poetry.lock`/`package-lock.json`/`go.sum`/`Cargo.lock`/`Gemfile.lock`), or for container/OS findings check the currently deployed image tag / Dockerfile base. If the current version ≥ `fixed_version`, mark **`fixed`** and drop it before enrichment — record it under "Already fixed" as the audit trail. Honor manifest comments / prior triage that already dismiss a CVE. **Never file remediation for an advisory you haven't confirmed still exists.** On a stale scan this pass clears most of the queue.
+**Honor the risk-acceptance register.** Read `.grimoire/security/accepted-risks.yml` (written by `grimoire-vuln-remediate`). An **unexpired** entry for a CVE means it was already triaged and consciously accepted → carry it as known-accepted, don't re-escalate (cite the register entry). An **expired** entry → re-triage it fresh (the acceptance lapsed). This is what stops accepted findings from re-flooding the queue every scan.
+### 3. Enrich the survivors — KEV then EPSS
+Per the reference: fetch the **CISA KEV** catalog once and match every `cve`/`aliases` (known-exploited = strongest hotfix signal); fetch **EPSS** for all CVE ids (batch, comma-separated) for exploit probability. Cache both in the run dir. If offline, record `kev-feed: offline` / `epss-fetched: false` and proceed on CVSS + reachability + exposure — say so. IaC/config findings skip threat-intel (no CVE).
+### 4. Reachability — type-aware, the cheapest big filter
+Judge reachability by `component_type` (reference § Reachability):
+- **library** — dev/test-only (infer from lockfile groups, not a flag) → `not_affected` in prod; imported at all? (`search_graph`/`search_code`); vulnerable function actually called? (`trace_path`).
+- **os-package** (container scan) — judge **two separate axes**: *reachability* (is the vulnerable code called by untrusted input? grep the consumer, not the C package name) and *removability* (how installed — explicit/transitive/base-image/builder-only — and what breaks). **Unreachable ≠ removable.** Never recommend removing a package (or "slim the base image") without tracing the install path and naming the post-change test; many base-OS/transitive libs aren't removable. No-fix / `will_not_fix` → accept or base bump, never an "upgrade X" ticket. Full discipline + maps + anti-patterns in `../references/container-scan-triage.md`.
+- **runtime** (interpreter/build tool, e.g. pip) — invoked at runtime or only at build time? Build-only, not in the running container → `not_affected` at runtime (check entrypoint/CMD).
+- **container/iac** — not a CVE; triage on whether the misconfig is reachable in our deployment; route persistent ones to infra.
+Also check **advisory preconditions** against real config — many CVEs are conditional (a setting, ASGI-vs-WSGI, a middleware). Precondition met → raises urgency; absent → clean `not_affected`. Record reachability provenance (`graph-verified`/`grep-asserted`/`image-layer`/`unknown`).
+**Resolve unknowns in the moment — don't default to `under_investigation`.** When reachability isn't settled by grep: trace deeper (`trace_path` from routes to the vulnerable binding), then **ask the human the one decisive question** (e.g. "does any endpoint parse user-supplied XML?") — a single yes/no usually collapses several findings to `not_affected` or `affected` on the spot, sparing both a register entry and a follow-up task. Reserve `under_investigation` for questions nobody in the session can answer (needs a runtime check / a teammate / an external dependency); time-box those and name what must be checked.
+### 5. Exposure & controls — read, don't invent
+Per reference § Exposure & Controls: `.grimoire/docs/context.yml` (internet-facing vs internal vs lambda/batch, infra, services) and MADR decisions (`Security (CIA)` rows, WAF/network-isolation/auth/tenancy decisions). A documented control that breaks the attack path is a legitimate damper / VEX `inline_mitigations_already_exist`. **Do not create a controls config file** — controls live in MADR + context.yml. A verdict-changing control that's recorded nowhere → log under "Control gaps", don't credit it silently.
+### 6. Assign VEX verdict + urgency
+Apply the decision tree (reference § VEX + § Urgency): `fixed` (Step 2) → `not_affected` (reachability/precondition) → `affected` with **hotfix-now** / **next-release** / **accept** (with expiry) → `under_investigation` (time-boxed). Fail safe on unknowns (KEV + public + unknown reachability → hotfix-now); don't manufacture emergencies (no KEV + low EPSS + unknown → under_investigation + next-release).
+### 7. Contrarian pass — calibrate before you escalate
+Run the **Contrarian calibration pass** (`../references/review-personas.md` §4.8) over every `hotfix-now` and `affected` verdict: steel-man "we are not affected", name the assumption, run the inversion test (does a rushed hotfix / base-image swap ship *new* risk?), check severity clears all three bars (reachable + exploitable-as-deployed + real blast radius). Emit `[hotfix upheld]` / `[hotfix → next-release]` / `[finding dropped]` per escalation with one line of evidence. Summary counts are **post-Contrarian**. Calibration, not veto.
+### 8. Write the triage record
+Write `.grimoire/security/vulns/<run-date>/triage.md` in the reference format: frontmatter totals, then sections — Hotfix now / Next release / Risk-accepted / **Already fixed** (the stale-scan audit trail) / Not affected / Under investigation / Control gaps. Cache the KEV snapshot and EPSS responses alongside for reproducibility.
+### 9. Report and hand off
+Headline: how many findings, how many already **fixed** (stale scan), how many **not_affected** (noise) with the dominant reason, how many real (`affected`), how many **hotfix-now** — and *why* the hotfixes are hotfixes (one line each).
+- **Any hotfix-now** → flag immediately, notify the security owner, recommend expedited fix.
+- **affected (any urgency)** → "Run `grimoire-vuln-remediate` to file these into the bug tracker."
+- **Control gaps** → "Assumed but unrecorded — `grimoire-draft` to capture them."
+## Important
+- **Reconcile before you triage.** A scan is a snapshot; the tree moves. Confirm each finding still exists in the current lockfile/image before spending any effort on it — it's the single highest-leverage step and stops you filing dead tickets.
+- **Scanner-agnostic by design.** Normalize any tool into the canonical model, then triage that. The verdict logic must never depend on npm's or pip's or Trivy's field names. New tool? Add an adapter, not a new triage path.
+- **CVSS ranks the world; we triage our deployment.** A "critical" in a dev-only, unreachable, or base-OS-cruft component is noise; a "medium" KEV hit on a public endpoint is a hotfix.
+- **Reachability is type-aware.** App import ≠ OS package ≠ build-time tool ≠ IaC misconfig. Judge each on its own terms; a flagged base-image lib the app never calls is not a prod emergency.
+- **Check the precondition.** Conditional CVEs are common — read the actual setting. Met → escalate; absent → `not_affected`.
+- **`not_affected` requires a justification code.** The code is what makes the dismissal defensible to an auditor.
+- **Controls must be recorded to count.** Flag undocumented ones, don't assume them.
+- **Fail safe, don't fearmonger.** Escalate on KEV + public + unknown reachability; don't turn a low-EPSS, non-KEV, internal-only finding into a fire drill.
+- **The Contrarian is the noise filter, not a silencer.**
+- **`accept` carries an expiry.** Re-triaged when the fix ships; never silently permanent.
+- **This skill does not fix or file.** It classifies. Code changes, ticket filing, image rebuilds are downstream (`grimoire-vuln-remediate`, `grimoire-bug`/`grimoire-draft` for non-trivial fixes).
+## Done
+When `.grimoire/security/vulns/<run-date>/triage.md` exists with every finding normalized, reconciled, and assigned a VEX verdict (and for `affected`, an urgency), the Contrarian pass applied to escalations, and the headline reported, triage is complete. Hand off to `grimoire-vuln-remediate` to file the dev work.

package/skills/references/artifact-map.md ADDED Viewed

@@ -0,0 +1,44 @@
+# Artifact Map & Reading Discipline
+Loaded by skills that read a change's specs before acting (`grimoire-plan`, `grimoire-draft`, `grimoire-design`, `grimoire-review`, `grimoire-pr-review`). This is the single home for **what each grimoire artifact is** and **how to read them**. Skills link here instead of restating it; they keep only the reading focus specific to their job.
+---
+## The artifacts
+Per-change (under `.grimoire/changes/<change-id>/`):
+- **`manifest.md`** — change summary, complexity level, and the Why. Level 3-4 also carry Assumptions, Pre-Mortem, and **Prior Art** (the build-vs-buy rationale).
+- **`features/*.feature`** — behavioral specifications. Edited live in `features/` on the branch.
+- **decision records** — architectural choices for this change, edited live in `.grimoire/decisions/`, including Cost of Ownership sections.
+- **`tasks.md`** — the implementation plan (present once planned).
+- **`data.yml`** — proposed schema changes (present only when the change touches the data model).
+Project-wide (under `.grimoire/`):
+- **`config.yaml`** — language, tools, conventions, `comment_style`, `commit_style`, `compliance`, `dep_audit`.
+- **`docs/<area>.md`** — per-area Purpose, Boundaries, Conventions, and "Where New Code Goes". Intent and placement, not live structure.
+- **`docs/data/schema.yml`** — the full data model: tables/collections, field types, relationships, indexes, external API contracts with `source:` pointers. Read this instead of individual model files.
+- **`docs/context.yml`** — deployment environment, related services, infrastructure dependencies, CI/CD, observability. Tells you runtime constraints (Lambda → no long-running processes), cross-service boundaries (auth lives in a sibling service), and what's available (Redis, RabbitMQ).
+- **`brand/tokens.json`**, **`brand/voice.md`** — design grounding (see `brand-tokens-format.md`).
+---
+## Reading discipline
+**Grimoire docs first, codebase second.** `.grimoire/docs/` is a pre-computed map — where code lives, what utilities exist, what patterns to follow, what the data layer looks like. Read it *instead of* exploring raw source. Read specific source files only when the docs don't have what you need.
+**Graph for live structure.** Area docs give intent and placement; they do not carry exact symbols. For function names, file paths, line numbers, reusable utilities, and call graphs, query the graph — `search_graph` / `get_code_snippet` / `get_architecture`. Combine the two: area doc says *where new code goes*, the graph says *what's already there to reuse*.
+**Do NOT read the entire codebase for "context."** Area docs + data schema + the graph already give you specific paths and assertions. Reading dozens of source files wastes context and does not produce better output. Read specific source only to verify a detail the docs can't answer (exact signature, exact import path, existing step-definition setup).
+---
+## Staleness gate
+For each area doc you load, compare its `last_updated` against `git log -1 --format=%ci <directory>`. If the doc is older than the most recent commit to its directory, it's stale — its paths, utility names, and patterns may be wrong.
+- **Level 1-2:** warn (`Area doc for <area> is behind recent commits — rely on the graph for structure`) and proceed. Mark inferred paths with `<!-- inferred: area doc may be stale -->`.
+- **Level 3-4:** blocker. Do not proceed until the user refreshes via `grimoire-discover` targeted refresh. Acting on stale docs at this complexity produces wrong paths and misses recent utilities — re-doing the work costs more than refreshing first.
+If area docs don't exist at all, tell the user to run `/grimoire:discover` first.

package/skills/references/code-quality.md CHANGED Viewed

@@ -14,7 +14,7 @@ For each production file you wrote or edited, walk the seven checks below. Any f
 ### 1. Reuse before write
-Before adding a function, helper, type, or constant: grep for it. Check the area doc's reusable-code table if one exists. Check neighbors in the same directory.
+Before adding a function, helper, type, or constant: query the graph (`search_graph` by concept and by name) for an existing one. Then grep, and check neighbors in the same directory.
 - If a function with the same job already exists → call it. Don't re-implement.
 - If something *almost* fits → use it directly first, refactor it once a second caller actually needs the change. Don't generalize on speculation.
@@ -83,21 +83,53 @@ Keep:
 Fail: a new `BaseFoo` / `FooStrategy` / `FooFactory` introduced for a single caller.
-### 7. Comments earn their place
+### 7. Comments earn their place — terse, self-contained, no essays
-Default: no comments. Add a comment **only** when the *why* is non-obvious — a hidden constraint, a workaround for a specific bug, an invariant that would surprise a future reader.
+Write comments like a senior engineer with no time: dense, professional, zero filler.
+**Voice: terse.** "Resolve model by id; raises on unknown provider." — not "This function is responsible for resolving the model by its id, and it will raise an exception if the provider is not known." Drop "this function", "we", hedging, and restated types. Fragments are fine; full prose grammar is not required.
+**Self-contained.** A comment describes the function/class on its own terms only. It must NOT name an external artifact that changes independently — feature flags / `.feature` files / scenario names, unit or integration test names, MADR/ADR numbers, change-ids, issue/PR numbers, tag codes (`LOG-OBS-003`). Those orphan the moment the artifact moves, and rot silently. Describe the *behavior*, not where it's specced.
+- OK: `# skip third-party sinks (e.g. behave capture)` — generic, about the code.
+- Not OK: `# implements scenario LOG-OBS-003 from logging.feature` — points at an artifact that will move.
+**No paragraphs.** Summary is one line, two at most. No prose block explaining the whole design before the params. If the rationale needs a paragraph, it belongs in a decision record — not the code.
+**Params per `comment_style` are fine.** If the project's style (sphinx/google/jsdoc/…) calls for `:param`/`Args:`/`@param`, keep them — but describe a param only when its name + type don't already say it, and don't precede them with prose.
 Drop:
 - Comments that restate the code (`# loop over users`).
-- Comments referencing the current task / PR / ticket (`# added for issue #123`, `# used by the new flow`). These rot.
-- Multi-line docstrings on private functions whose name and signature already say everything.
+- Any reference to a task / PR / ticket / feature / scenario / ADR / specific test (`# added for issue #123`, `# covers scenario X`, `# see test_foo`). Self-contained or gone.
+- Multi-line prose docstrings on private functions whose name + signature already say everything.
 - Commented-out code. Delete it; git remembers.
 Keep:
-- One-line "why": the constraint, the gotcha, the link to the spec / ADR.
-- Docstrings the project's `comment_style` requires (check `.grimoire/config.yaml`).
+- One terse line of *why* when non-obvious — a hidden constraint, a workaround, a surprising invariant — stated in terms of the code itself.
+- The structured `comment_style` param/return section, terse.
+Fail: any comment that (a) wouldn't confuse a future reader if removed, (b) names an external artifact, or (c) runs to a prose paragraph.
+**Before / after** (the offender this rule targets):
+```python
+# BEFORE — orphan-prone essay
+def build_chat(model_id):
+    """
+    Build and return a chat model for the given model id. This is the primary
+    entry point used by every agent and team in the system, as specified by
+    scenario LOG-OBS-003 in logging.feature and decided in ADR-0001. See
+    test_build_chat for the expected behavior. Added as part of add-2fa-login.
+    :param model_id: the id of the model to build
+    :return: the chat model
+    """
+# AFTER — terse, self-contained
+def build_chat(model_id):
+    """Resolve a chat model by id. Raises on an unknown provider.
-Fail: any comment whose removal would not confuse a future reader.
+    :param model_id: provider-prefixed model id (e.g. "gpt-4.1-mini")
+    """
+```
 ---
@@ -110,7 +142,7 @@ Before marking a task `[x]`:
 - [ ] No guards / try-except / type-checks inside the trust boundary (§4)
 - [ ] No locals named `data`, `result`, `temp`, `info`, `obj` — names reveal intent (§5)
 - [ ] No new abstractions, interfaces, or wrappers with a single caller (§6)
-- [ ] No comments describing *what* the code does — only *why*, and only when non-obvious (§7)
+- [ ] Comments are terse, self-contained, ≤2 lines of prose — no *what*, no external-artifact refs (feature/scenario/ADR/test/ticket) (§7)
 - [ ] Diff stays inside the task's scope — no "while I'm here" refactors
 If any box can't be ticked, fix the code (not the checklist) and re-run tests.

package/skills/references/container-scan-triage.md ADDED Viewed

@@ -0,0 +1,102 @@
+# Container & OS-Package Scan Triage Reference
+The deep-dive for `os-package` / `container` / `iac` findings from image scanners (Trivy, Grype). Loaded by `grimoire-vuln-triage` when a scan carries container/OS-package results. The general rubric — normalize, reconcile, KEV/EPSS, VEX, urgency, Contrarian — lives in `./dependency-vuln-triage.md`; this file is the discipline for the part that goes wrong most: deciding what to *do* about a base-OS CVE.
+Written after a review recommended removing `Mesa`, `ncurses`, and `krb5` from a headless Django API with the rationale "no business in a headless API." Two of the three could not be removed without breaking the image. This guide exists so that mistake is not repeated.
+## Separate the two axes — they are not the same question
+- **Reachability** — *is the vulnerable code path actually called by untrusted input in this service?* Decides **urgency** (hotfix / next-release / accept). A library present but never invoked is low risk even at CVSS CRITICAL.
+- **Removability** — *can we delete the package, and what breaks if we do?* Decides **remediation** (Dockerfile edit / base bump / accept+document).
+Conflating them produces the classic error: "this lib is unreachable, so remove it." Unreachable ≠ removable. A headless API genuinely cannot reach Mesa's OpenGL code — **and also cannot remove Mesa** if it arrived transitively behind a package it needs. Judge both, separately.
+## Core rule: trace before you recommend
+A CVE scanner reports *what is present*, not *why* or *whether it is removable*. Never recommend removing a package until you have answered all three:
+1. **How does it get in?** Directly installed, transitive, base image, or builder-stage-only?
+2. **What depends on it?** Application code, a required runtime lib, or the base OS?
+3. **What breaks if it's gone?** Build, runtime, or nothing — and what's the test that proves it?
+"This doesn't belong in a headless API" is an assumption, not an analysis.
+## Step A — How is it installed?
+Search the Dockerfile for an explicit install line first.
+- **Explicitly installed** (named in `apt-get install` / `pip install`): a real removal candidate — continue to Step B.
+- **Not named anywhere**: it's transitive — pulled by another package or shipped in the base image. You cannot `apt-get remove` it without breaking its parent. Identify the parent before saying anything.
+Common transitive sources — map these before flagging:
+| Flagged lib | Usually pulled in by | Notes |
+|---|---|---|
+| krb5 / libgssapi-krb5 | `libpq5`, `postgresql-client`, `curl` | Postgres GSSAPI/Kerberos auth |
+| ncurses / libtinfo | base image | bash, apt, dpkg, python readline link it |
+| Mesa / libgl1 / libgbm | `libgl1`, `libglib2.0-0` | OpenCV / docling / easyocr deps |
+| OpenSSL / libssl | base image + most TLS clients | almost never removable |
+| libexpat1 | base image + python (`pyexpat`) | stdlib XML |
+## Step B — What actually depends on it?
+Do not assume. Check the repo:
+- **App imports** — grep for the consuming module (`import cv2`, `import magic`, `import pyodbc`, `gssapi`, `lxml`).
+- **System tools used at runtime** — grep code, scripts, and the entrypoint for the binary (`psql`, `pg_dump`, `pg_isready`).
+- **Driver bundling** — many Python wheels bundle their native lib, making the system package redundant. `psycopg-binary` bundles libpq → system `libpq5` not needed for the driver; `pylibmagic` bundles libmagic → system `libmagic1` may be redundant. When a binary wheel is present, the matching system runtime package is often dead weight — **verify, then say so**.
+- **Cross-service config trap** — env vars or paths in this repo may configure a *different* container. `EASYOCR_MODULE_PATH` / `DOCLING_ARTIFACTS_PATH` in bake are passed by `job_runner.py` to the **ricky** pipeline container — they do **not** mean bake runs easyocr/docling. Never justify or condemn a package using a string that belongs to another service.
+## Step C — Know what is not removable
+Some findings are not actionable by editing an install line:
+- **Base-image packages** (ncurses, OpenSSL, glibc, zlib, expat): part of the OS. Removing breaks bash/apt/dpkg or the Python runtime. The only real mitigations are: **switch to a smaller/distroless base**, **bump the base image** for patched versions, or **accept and document** the risk. State which — do not tell the user to "remove" it.
+- **Transitive deps of required libs** (e.g. krb5 behind `libpq5`): removable only by also removing the parent, and only if the parent is itself unneeded.
+## Step D — Multi-stage builds: target the right stage
+In a multi-stage Dockerfile only the final stage ships. Packages in a `builder` stage (compilers, `-dev` headers) do **not** appear in the runtime image if only the artifact (`/opt/venv` or equivalent) is copied forward. They are not runtime attack surface. Don't flag builder-stage packages as runtime risk; if you mention them, label them **build-only**.
+## Step E — Assess real risk, not just presence (reachability)
+Maps to `./dependency-vuln-triage.md` § Reachability. A CVE in a library never reached by untrusted input is lower priority than its score. For each finding note:
+- Is the vulnerable code path reachable in *this* service? (use the consumer map below — grep the **consumer**, not the C package name; the app never `import`s `libexpat1`)
+- Network/user-input exposed, or internal-only?
+- Headless API context: no display, no user shell, no interactive TTY → GUI/terminal libs (Mesa, ncurses) are usually unreachable even when present.
+| OS package | Reached only if the app… | Grep for |
+|---|---|---|
+| libexpat1 | parses XML via stdlib | `xml.etree`, `xml.sax`, `pyexpat`, `minidom` |
+| libxml2 / libxslt | parses XML/XSLT via lxml | `import lxml`, `etree`, `XSLT` |
+| krb5 / libgssapi | does Kerberos/GSSAPI auth | `gssapi`, `kerberos`, `requests_kerberos` |
+| mesa / libGL / libgbm | does GPU/OpenGL rendering | `OpenGL`, `moderngl`, `cv2` (headless API: none) |
+| ncurses / libtinfo | drives an interactive terminal | `curses`, `pty`, `readline` (web process: none) |
+| libssl / openssl | does TLS | usually reachable — judge on impact |
+| imagemagick / libvips | processes user-uploaded images | the upload/convert path |
+**Grep can lie — verify the binding.** `Price.fromstring()` (price-parser) is not `etree.fromstring()` (XML). Confirm the match is the real vulnerable call site before asserting `not_affected` or `affected`; if you can't, mark `under_investigation` and name what a human must check. Prefer reachability-based prioritization over raw CVSS.
+## Honor the scanner's fix-state
+Trivy `Status` (and Grype `fix.state`): `fixed` → a patched package exists; upgrade/rebuild is the lever. `affected` / `will_not_fix` / `end_of_life` → **no fix available**; the lever is *accept with expiry* or *rebuild on a newer/slimmer base when one ships* — **not** an "upgrade X" ticket. `under_investigation` → distro hasn't ruled; mirror it. Never file an upgrade task for a no-fixed-version finding.
+## Output per flagged package
+1. **Package + CVE(s)** — what the scanner said (and the dedup count if one CVE spans many packages).
+2. **How it's installed** — explicit line N / transitive via `<parent>` / base image / builder-only.
+3. **What depends on it** — app module / runtime tool / OS, with grep evidence.
+4. **Reachable?** — vulnerable path called by untrusted input? (provenance: graph / grep / image-layer / unknown)
+5. **Removable?** — Yes (safe) / Yes (after removing `<parent>`, test X) / No (base OS) / No (required by Y).
+6. **Recommendation** — exact Dockerfile edit, or "patch/bump base image", or "accept + document", **plus the post-change test** (build, import, DB connect). Route image-structure changes to infra / `grimoire-draft`, not app remediation.
+## Anti-patterns — do not do these
+- ❌ "X has no business in a headless API" with no trace of how X got installed.
+- ❌ Recommending `apt-get remove` of a base-image or transitive package.
+- ❌ Treating scanner presence as equal to exploitable risk.
+- ❌ Justifying or condemning a package using config that targets another service.
+- ❌ Flagging builder-stage packages as runtime attack surface.
+- ❌ Recommending removal without naming the post-change test.
+- ❌ Filing an "upgrade" ticket for a `will_not_fix` / no-fixed-version finding.

package/skills/references/dependency-vuln-triage.md ADDED Viewed

@@ -0,0 +1,236 @@
+# Vulnerability Triage Reference
+Loaded by `grimoire-vuln-triage` (and later `grimoire-vuln-remediate`). Turns **any** vulnerability scan — `npm audit`, `pip-audit`, `osv-scanner`, Trivy, Grype, Snyk, Dependabot, a SARIF file, or a CSV/markdown report a teammate forwards — into per-advisory verdicts whose single most important output is one decision:
+> **Drop everything and hotfix now, or let it ride the normal testing / release cycle?**
+Everything below exists to answer that, honestly, for *our* deployment — not in the abstract. The skill is **scanner-agnostic**: it normalizes whatever it's handed into one canonical model, then triages that. Covers application dependencies (npm/PyPI/Go/Cargo/…), OS packages (Debian/Alpine/RPM from container scans), and container/IaC findings alike.
+## Why raw scanner severity is not the answer
+Scanners rank by **CVSS base score**, which describes the vulnerability in a vacuum. It knows nothing about whether our code reaches the vulnerable function, whether the package even runs in production, whether the service is internet-facing, what controls sit in front of it, or **whether we already upgraded past it**. CVSS alone over-escalates: most "high"/"critical" findings are not actionable in a given deployment. Commercial reachability tooling suppresses 70–90% of findings for exactly this reason. We get most of that signal for free from reconciliation + KEV + EPSS + reachability + our own context.
+The triage rubric is **Threat × Exposure × Impact**:
+- **Threat** — is it actually being exploited / likely to be? (KEV, EPSS)
+- **Exposure** — can an attacker reach the vulnerable code in our deployment? (reachability + network exposure)
+- **Impact** — what is the blast radius if they do? (data sensitivity, privilege)
+## Step 1 — Normalize the scan into the canonical advisory model
+**Do this before anything else, regardless of source.** Different scanners emit wildly different shapes; triage logic must never be coupled to one format. Map each finding to:
+| Field | Meaning |
+|---|---|
+| `id` | Primary advisory id (CVE, GHSA, OSV, vendor id) |
+| `aliases` | All other ids (so KEV/EPSS lookups can find the CVE) |
+| `cve` | The CVE alias if any (KEV/EPSS key); may be absent |
+| `component` | Package / module / OS-package / image name |
+| `component_type` | `library` \| `os-package` \| `container` \| `iac` \| `runtime` — drives how reachability is judged |
+| `installed_version` | What the scan saw |
+| `fixed_version` | First fixed version, or `none` |
+| `severity` / `cvss` | Scanner-reported, treated as a prior only |
+| `target` | Where it was found (lockfile, image layer, Dockerfile, repo) |
+| `scanner` | Which tool produced it |
+| `advisory_url` / `description` | For reading what the bug actually is |
+### Format adapters
+Read the right fields per scanner — **do not** assume one tool's field names apply to another:
+- **npm audit** (`--json`): `vulnerabilities{}` keyed by package → `severity`, `via[]` (string or advisory object with `title`/`url`), `isDirect`, `fixAvailable`, `nodes[]`. Dev deps appear via `effects`/dependency graph (no single `dev` flag on every entry).
+- **pip-audit** (`-f json`): `dependencies[]` → `{name, version, vulns[]}`; each vuln has `id`, `aliases[]` (find the `CVE-` one), `fix_versions[]`, `description`. **No dev flag in the data** — infer dev/runtime from lockfile groups (`[tool.uv] dev-dependencies`, poetry `group.dev`, `requirements-dev.txt`).
+- **osv-scanner** (`--format json`): `results[].packages[].vulnerabilities[]` with OSV ids + `aliases`; `results[].source.path` is the manifest.
+- **Trivy** (`--format json`): `Results[]` each with a `Class` (`os-pkgs` \| `lang-pkgs` \| `config`) and `Type` (debian, alpine, gobinary, python-pkg, …); `Results[].Vulnerabilities[]` → `VulnerabilityID`, `PkgName`, `InstalledVersion`, `FixedVersion`, `Severity`, `CVSS{}`. **`Class: os-pkgs` → `component_type: os-package`** (base-image OS cruft — judged differently from app deps). `Results[].Class: config` → `component_type: iac` (Dockerfile/k8s misconfig, not a CVE — triage on exposure, no KEV/EPSS).
+- **Grype** (`-o json`): `matches[].vulnerability` (`id`, `severity`, `fix.versions[]`) + `matches[].artifact` (`name`, `version`, `type`).
+- **Snyk / Dependabot / SARIF**: pull `ruleId`/`cve`, the package coordinate, and fixed version from `results[]` / alerts / `runs[].results[]`. SARIF `level` maps to severity.
+- **Unknown / freeform (CSV, markdown, pasted text):** extract the minimum — `id` (CVE/GHSA), `component`, `installed_version`, `fixed_version` if stated. Anything you can't fill is `unknown`, not a guess. Triage proceeds on what you have; record `scanner: <described>` and note reduced confidence.
+If you genuinely can't parse a format, say so and ask for `--json`/SARIF rather than guessing at a table.
+### Deduplicate before triaging
+Scanners — Trivy especially — emit the **same CVE once per affected package** (e.g. `CVE-2026-40393` listed against `libgbm1`, `libgl1-mesa-dri`, `libglx-mesa0`, `mesa-libgallium` = 4 findings, 1 vulnerability). Collapse to **unique `(id, component_type)`**, keeping the list of affected packages on the single entry. Triage and count the deduplicated set — a "200 CVE" image scan is often 30 real advisories. Report both numbers (raw findings → unique advisories) so the noise reduction is visible.
+### Container scans also carry non-CVE results — don't drop them
+Trivy/Grype image scans include result classes that are **not** package CVEs and need routing, not triage:
+- **secret** (`Class: secret`) — a credential/key found in an image layer (e.g. an `.env` file baked in). Even zero-hit secret *targets* are a smell (why is an env file in the image?). Route to infra/`grimoire-draft`, treat any real hit as a confidential security issue.
+- **config / misconfig** (`Class: config`, `component_type: iac`) — Dockerfile/k8s findings (root user, no resource limits, exposed port, `:latest` base). Triage on exposure; route persistent ones to infra/`grimoire-draft`. Not an app hotfix.
+## Step 2 — Reconcile against the current tree FIRST  *(mandatory, highest-leverage)*
+**Scan artifacts go stale.** A report from last week was taken against versions you may have already upgraded. Before spending any effort on enrichment or reachability, compare each advisory's `installed_version` against what the repo resolves **right now**:
+- Read the live lockfile / manifest: `package-lock.json` / `pnpm-lock.yaml` / `yarn.lock`, `uv.lock` / `poetry.lock` / `requirements.txt`, `go.mod`/`go.sum`, `Cargo.lock`, `Gemfile.lock`. For container/OS findings, the equivalent is "is this image still deployed?" — check the current image tag / Dockerfile base.
+- If the **currently resolved version ≥ `fixed_version`**, the advisory is **`fixed`** → drop it from the queue before enrichment. Record it in the "Already fixed" section as the audit trail.
+- If a manifest comment or prior triage already dismisses the CVE (e.g. `urllib3>=2.7.0  # CVE-2026-44431`), treat as `fixed`/known and don't re-litigate.
+This single pass routinely clears the majority of a stale scan and saves the expensive work for findings that are actually still present. **Never file remediation for an advisory without confirming it still exists in the current tree.**
+## The enrichment signals (for advisories that survive reconciliation)
+Gather what you can. Degrade gracefully — a missing signal is "unknown", not "safe". OS-package and library findings both get KEV/EPSS (they're CVEs); IaC/config findings skip threat-intel and triage on exposure.
+### KEV — CISA Known Exploited Vulnerabilities  *(Threat, binary)*
+Fetch once per run and match every advisory's `cve`/`aliases`:
+`https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json`
+KEV membership is binary, evidence-grounded, auditor-defensible. **A reachable KEV vulnerability is a hotfix candidate regardless of CVSS.**
+### EPSS — Exploit Prediction Scoring System  *(Threat, probability)*
+Fetch per CVE (batchable, comma-separated):
+`https://api.first.org/data/v1/epss?cve=CVE-2024-XXXX,CVE-2024-YYYY`
+Daily-refreshed probability (0–1) of exploitation in the next 30 days. Ranks the long tail KEV is silent on. Rough bands: `≥0.5` high, `0.1–0.5` elevated, `<0.1` low. A prior, not a verdict.
+### Reachability — is the vulnerable code in our execute path?  *(Exposure)* — judge by `component_type`
+The strongest noise filter. **How you judge reachability depends on what kind of component it is:**
+- **`library` (app dependency):**
+  - *Dev/test only?* Not shipped or run in prod → **not_affected** in prod (`vulnerable_code_not_in_execute_path`). Infer dev/runtime from lockfile groups — there is usually no single flag.
+  - *Imported at all?* `search_graph(name_pattern=<pkg>)` / `search_code(<pkg>)`. Unused transitive → low exposure.
+  - *Vulnerable function reached?* When the advisory names the affected API, `trace_path` / `search_code` to confirm our code calls *that* surface. Not present / not on a reachable path → **not_affected**.
+- **`os-package` (Debian/Alpine/RPM from a container scan):** judge **two separate axes** — *reachability* (is the vulnerable code path called by untrusted input?) drives urgency; *removability* (can we delete it, and what breaks?) drives remediation. **They are not the same — unreachable ≠ removable.** A headless API can't reach Mesa's OpenGL code *and* can't remove Mesa if it arrived transitively behind a package it needs. A C library is only reachable if something in the running app binds to it, so grep the **consumer**, not the C package name (the app never `import`s `libexpat1`). Honor the scanner's fix-state: `will_not_fix`/`end_of_life`/no-`FixedVersion` means **no fix exists** → the lever is accept-with-expiry or a base-image bump/rebuild, **never** an "upgrade X" ticket. **Before recommending any removal or "slim the base image," trace how the package is installed (explicit / transitive / base-image / builder-only) and name the post-change test** — see `./container-scan-triage.md` for the full discipline, the transitive-source and consumer maps, and the anti-patterns. Route image-structure changes to infra/`grimoire-draft`, not app remediation.
+- **`runtime` (the interpreter/build tool itself — e.g. `pip`, `node`, `setuptools`):** is it invoked **at runtime** or only at **build time**? A build-time tool not present/called in the running container → **not_affected** at runtime (`vulnerable_code_not_in_execute_path`); note it as build-image hygiene at most. Check the container entrypoint/CMD.
+- **`container` / `iac` (Dockerfile, k8s, compose misconfig):** not a CVE — no KEV/EPSS. Triage on exposure + control: does the misconfig (root user, no resource limits, exposed port, `:latest` base) actually create reachable risk in our deployment? Route persistent ones to `grimoire-draft`/infra, not an app hotfix.
+**Grep can lie — verify the binding.** A bare symbol/name match is not proof: `Price.fromstring()` (the `price-parser` library) is not XML `etree.fromstring()`; a package named in a comment isn't a call. Confirm the match is the actual vulnerable binding (right import, right call site) before asserting `not_affected` **or** `affected`.
+**Resolve unknowns now — `under_investigation` is a last resort, not a default.** When grep doesn't settle reachability, work the question before deferring it:
+1. **Trace deeper** — `codebase-memory-mcp` `trace_path` from the entry points (routes/handlers) to the vulnerable binding; read the actual call site, not just its name.
+2. **Ask the decisive question.** Most reachability unknowns collapse to one yes/no the human can answer instantly — ask it inline rather than filing a task. The pattern: *"Does <surface> ever receive attacker-controlled <input>?"* e.g. "Does any endpoint parse user-supplied XML (SAML/SSO metadata, uploads, external responses)?" → "no" makes expat/libxml2 `not_affected` on the spot; "yes" makes them `affected`. One question can clear several findings **and** spare a register entry and a follow-up task.
+3. Only when the answer genuinely needs work nobody in the session can do — a runtime check, a teammate's knowledge, an external dependency — mark **`under_investigation`**, time-box it, and name exactly what must be checked. Don't force a verdict to look decisive; but don't punt one you could resolve with a trace or a question, either.
+Record reachability provenance: `graph-verified` (codebase-memory-mcp), `grep-asserted` (fallback), `image-layer` (container scan), or `unknown`.
+### Exposure & Controls — our deployment, our mitigations  *(Exposure + Impact damping)*
+Read these — do **not** invent a controls config file; the truth already lives here:
+- **`.grimoire/docs/context.yml`** — `deployment` (internet-facing vs internal vs lambda/batch), `infrastructure`, `services`. An internal-only service behind an auth gateway has far less exposure than a public endpoint.
+- **MADR decisions** (`.grimoire/decisions/*.md`) — `Security (CIA)` quality-attribute rows and security-relevant decisions (WAF, network isolation, input validation, auth model, tenancy). A documented control that breaks the attack path is a legitimate VEX `inline_mitigations_already_exist` / severity damper.
+- **App config that satisfies (or fails) an advisory's precondition** — many CVEs are conditional ("only when `SESSION_SAVE_EVERY_REQUEST=True`", "only on ASGI", "only if middleware X enabled"). Read the actual setting. A precondition that is **met** raises urgency; one that is **absent** is a clean `not_affected` (`vulnerable_code_cannot_be_controlled_by_adversary` / not in execute path).
+If a control that *would* change the verdict is assumed but **not recorded anywhere**, do not credit it silently. Flag the gap and recommend recording it as a MADR via `grimoire-draft` — an undocumented control is not defensible in an audit.
+## VEX verdict — the per-advisory status
+Assign each surviving advisory a [VEX](https://www.cisa.gov/sites/default/files/2023-04/minimum-requirements-for-vex-508c.pdf) status. This is the noise-suppression layer: only `affected` items become dev work.
+| Status | Meaning | Justification codes (for `not_affected`) |
+|---|---|---|
+| `fixed` | Already remediated — current tree resolves ≥ the fixed version (from Step 2). | — |
+| `not_affected` | We are not exploitable. **No dev work.** | `component_not_present`, `vulnerable_code_not_present`, `vulnerable_code_not_in_execute_path`, `vulnerable_code_cannot_be_controlled_by_adversary`, `inline_mitigations_already_exist` |
+| `affected` | Exploitable in our deployment. **Needs a remediation action.** | — (carries an urgency, see below) |
+| `under_investigation` | Can't determine yet (no graph, ambiguous advisory). Time-box it. | — |
+Every verdict records *why*. A `not_affected` with a justification code is the auditor-defensible way to dismiss noise; a bare "looks fine" is not.
+## Urgency — the decision that matters (for `affected` items only)
+| Urgency | Trigger | Action |
+|---|---|---|
+| **hotfix-now** | In **KEV** AND reachable AND exposed; **or** EPSS high + reachable + internet-exposed + no mitigating control; **or** active exploitation against a high-impact surface (auth, PII, RCE on a public endpoint). | Drop everything. Expedited fix branch, out-of-band release. Notify security owner. |
+| **next-release** | Reachable but exposure is damped (internal-only / behind a control), **or** no KEV and low/elevated EPSS, **or** fix requires a non-trivial upgrade / image rebuild with no active-exploitation pressure. | File remediation for the normal testing / release cycle. |
+| **accept (risk-accepted)** | `affected` but low real risk and **no fix available** (no patched version / no newer base image yet). | Record justification + an **expiry / revisit date**. Re-triage on expiry or when a fix ships. Don't let it become permanent. |
+Decision tree, in order:
+1. Already `fixed` (Step 2)? → done, drop. No enrichment needed.
+2. `not_affected` (reachability / precondition absent)? → done, it's noise. No urgency.
+3. Reachable + (KEV **or** high EPSS) + internet-exposed + no breaking control? → **hotfix-now**.
+4. `affected` but damped (not exposed / control breaks the path / low EPSS / dev-or-build-time only) → **next-release**.
+5. `affected`, no fix exists, low risk → **accept** with expiry.
+Default bias: unknown reachability + internet-facing + KEV → **hotfix-now** (fail safe). Unknown reachability + no KEV + low EPSS → **under_investigation** + next-release — don't manufacture an emergency.
+## The Contrarian pass — calibrate before you escalate
+Before finalizing **any** `hotfix-now` or `affected` verdict, run the **Contrarian calibration pass** (`./review-personas.md` §4.8) over the escalated findings. The Contrarian adds no findings; it challenges the ones we're about to act on. For each escalation ask:
+1. **Steel-man "we are not affected."** Strongest case this doesn't matter here — dev/build-only, function never called, base-OS cruft, behind auth + network isolation, precondition absent, input never attacker-controlled. If it holds, drop the urgency (often to `not_affected`).
+2. **Name the assumption.** "Assumes the parser is fed untrusted input." "Assumes this endpoint is public." If it contradicts `context.yml`/a MADR/the actual config, the finding is mis-calibrated.
+3. **Inversion.** If we hotfixed this, what *new* risk ships — a rushed major bump, an untested base-image swap, a breaking transitive change? Is the cure riskier than the disease before the next release window?
+4. **Is doing-nothing-until-release right?** Symptom vs root cause; will it actually trigger; cost of "fix now" vs "fix when it hurts".
+5. **Is severity calibrated?** A `hotfix-now` must clear all of: reachable, exploitable-as-deployed, real blast radius.
+Emit per escalation: `[hotfix upheld]` / `[hotfix → next-release]` / `[finding dropped]` with one line of evidence. Summary counts are **post-Contrarian**. Calibration, not veto — a surviving harm path tied to `context.yml`/KEV stands.
+## Triage record format
+Write `.grimoire/security/vulns/<run-date>/triage.md`:
+```markdown
+---
+scanners: [<npm-audit|pip-audit|osv-scanner|trivy|grype|snyk|sarif|other>]
+scan_dates: [<YYYY-MM-DD per source>]
+triaged_date: <YYYY-MM-DD>
+reconciled_against: <lockfile/manifest/image checked>
+kev-feed: <date fetched, or "offline">
+epss-fetched: <true|false>
+reachability: <graph-verified|grep-asserted|image-layer|unknown>
+totals: { raw_findings: N, unique_advisories: N, fixed: N, not_affected: N, affected: N, hotfix_now: N, accepted: N, under_investigation: N }
+---
+# Vulnerability Triage — <run-date>
+## Hotfix now (drop everything)
+<!-- omit section if empty -->
+### <id> — <component> <version> (<component_type>)
+- **VEX**: affected · **Urgency**: hotfix-now
+- **KEV**: yes/no · **EPSS**: 0.NN · **CVSS**: N.N (<severity>) · **Scanner**: <tool>
+- **Reachable**: <yes — calls X / image-layer / no / unknown> (<provenance>)
+- **Exposure**: <internet-facing endpoint / internal-only / dev/build-only / base-OS>
+- **Controls**: <none that break the path / WAF per ADR-00NN / behind auth gateway>
+- **Precondition**: <CVE condition — met / absent, with the setting checked>
+- **Blast radius**: <RCE / PII read / DoS / info disclosure>
+- **Contrarian**: [hotfix upheld] <one line>
+- **Fix**: upgrade <component> <from> → <to> / rebuild on <base> (or: no fix — mitigation: <...>)
+## Next release cycle
+### <id> — <component> (<component_type>)
+- (same fields) · **Contrarian**: [hotfix → next-release] <why damped>
+## Risk-accepted (revisit by <date>)
+### <id> — <component>
+- **VEX**: affected · no fix available · **Expiry**: <YYYY-MM-DD> · **Justification**: <...>
+## Already fixed (reconciled out — scan was stale)
+<!-- audit trail: dropped before enrichment because current tree is past the fix -->
+- <id> — <component>: current tree resolves <ver> ≥ fixed <ver>
+- <id> — <component>: dismissed in manifest (<comment/prior triage>)
+## Not affected (suppressed noise)
+- <id> — <component>: not_affected (`vulnerable_code_not_in_execute_path` — dev-only / build-time / base-OS not invoked)
+- <id> — <component>: not_affected (`vulnerable_code_cannot_be_controlled_by_adversary` — precondition absent: <setting>)
+## Under investigation (time-boxed to <date>)
+- <id> — <component>: <what's blocking the call>
+## Control gaps surfaced
+- <control> assumed for <id> but not in any decision record → suggest `grimoire-draft`.
+## Infra follow-ups (root-cause, not per-CVE)
+<!-- container/IaC hygiene — route to infra/grimoire-draft, not app remediation. Each must state how-installed + post-change test, per container-scan-triage.md. -->
+- <package> (CVE <id>): installed via <explicit line N / transitive via PARENT / base image / builder-only>; depends: <evidence>; removable: <yes-safe / yes-after-removing-PARENT / no-base-OS / no-required-by-Y>; recommendation: <Dockerfile edit / bump-base / accept+document> — test: <build / import / DB connect>.
+- Secret/config result: <e.g. `.env` baked into image layer> → remove from image, route to infra.
+```
+## Supply-chain note (separate from CVE triage)
+A *known CVE* in a component is what this reference triages. A **dependency add/upgrade** (new package, version bump, floating range, missing lockfile/integrity hashes) is a different risk class — covered by `security-compliance.md` § Supply Chain Defense, a review-time blocker, not a CVE-triage output. Keep them distinct: triage answers "is this known CVE a hotfix?"; supply-chain defense answers "should this change merge at all?".
+## Principles
+- **Reconcile first.** A stale scan is mostly already-fixed findings. Confirm each advisory still exists in the current tree before any other work — it's the cheapest, highest-leverage pass.
+- **Scanner-agnostic by construction.** Normalize any tool's output into the canonical model, then triage that. Never couple the verdict logic to npm's or pip's field names.
+- **CVSS ranks the world; we triage our deployment.** The whole job is that gap. A "critical" in a dev-only, unreachable, or base-OS-cruft component is noise; a "medium" KEV hit on a public endpoint is a hotfix.
+- **Reachability is type-aware.** Library imports, OS-package usage, runtime-vs-build, and IaC misconfig are judged differently. A flagged base-image OS lib the app never calls is not a prod emergency.
+- **`not_affected` needs a justification code, not a vibe.** That line is the audit trail.
+- **Controls must be recorded to count.** Undocumented WAF/auth/isolation can't damp a verdict — flag the gap.
+- **Fail safe on unknowns, but don't manufacture emergencies.**
+- **The Contrarian calibrates escalations; it does not suppress real signal.**
+- **`accept` is not `ignore`.** It carries an expiry and gets re-triaged.