npm - sanook-cli - Versions diffs - 0.4.0 → 0.5.1 - Mend

sanook-cli 0.4.0 → 0.5.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (238) hide show

package/.env.example +19 -0
package/CHANGELOG.md +173 -0
package/README.md +153 -20
package/README.th.md +136 -0
package/dist/agentContext.js +4 -0
package/dist/approval.js +6 -0
package/dist/bin.js +405 -57
package/dist/brain.js +92 -59
package/dist/brand.js +47 -0
package/dist/checkpoint.js +37 -0
package/dist/commands.js +86 -6
package/dist/compaction.js +76 -5
package/dist/config.js +100 -12
package/dist/cost.js +60 -3
package/dist/doctor.js +92 -0
package/dist/gateway/auth.js +2 -2
package/dist/gateway/ledger.js +2 -2
package/dist/gateway/scheduler.js +1 -0
package/dist/gateway/serve.js +6 -4
package/dist/gateway/server.js +10 -2
package/dist/git.js +11 -2
package/dist/hooks.js +43 -17
package/dist/knowledge.js +48 -49
package/dist/loop.js +182 -66
package/dist/lsp/client.js +173 -0
package/dist/lsp/framing.js +56 -0
package/dist/lsp/index.js +138 -0
package/dist/lsp/servers.js +82 -0
package/dist/mcp-server.js +244 -0
package/dist/mcp.js +184 -29
package/dist/memory-store.js +559 -0
package/dist/memory.js +143 -29
package/dist/orchestrate.js +150 -0
package/dist/providers/codex.js +21 -7
package/dist/providers/keys.js +3 -2
package/dist/providers/models.js +22 -6
package/dist/providers/registry.js +155 -1
package/dist/repomap.js +93 -0
package/dist/search/chunk.js +158 -0
package/dist/search/embed-store.js +187 -0
package/dist/search/engine.js +203 -0
package/dist/search/fuse.js +35 -0
package/dist/search/index-core.js +187 -0
package/dist/search/indexer.js +241 -0
package/dist/search/store.js +77 -0
package/dist/session.js +42 -8
package/dist/skill-install.js +10 -10
package/dist/skills.js +12 -9
package/dist/summarize.js +31 -0
package/dist/tools/bash.js +21 -2
package/dist/tools/diagnostics.js +41 -0
package/dist/tools/edit.js +29 -7
package/dist/tools/index.js +8 -1
package/dist/tools/list.js +7 -2
package/dist/tools/permission.js +90 -9
package/dist/tools/read.js +23 -4
package/dist/tools/remember.js +1 -1
package/dist/tools/sandbox.js +61 -0
package/dist/tools/search.js +105 -4
package/dist/tools/task.js +195 -29
package/dist/tools/timeout.js +35 -0
package/dist/tools/util.js +10 -0
package/dist/tools/write.js +6 -4
package/dist/trust.js +89 -0
package/dist/ui/app.js +228 -31
package/dist/ui/banner.js +4 -9
package/dist/ui/brain-wizard.js +2 -2
package/dist/ui/history.js +30 -0
package/dist/ui/mentions.js +44 -0
package/dist/ui/render.js +55 -15
package/dist/ui/setup.js +97 -12
package/dist/ui/useEditor.js +83 -0
package/dist/update.js +114 -0
package/dist/worktree.js +173 -0
package/package.json +11 -5
package/scripts/postinstall.mjs +33 -0
package/second-brain/.agents/_Index.md +30 -0
package/second-brain/.agents/skills/_Index.md +30 -0
package/second-brain/.agents/workflows/_Index.md +30 -0
package/second-brain/AGENTS.md +4 -4
package/second-brain/Acceptance/_Index.md +30 -0
package/second-brain/Acceptance/golden-case-template.md +39 -0
package/second-brain/Areas/_Index.md +30 -0
package/second-brain/Bugs/System-OS/_Index.md +30 -0
package/second-brain/Bugs/_Index.md +30 -0
package/second-brain/CLAUDE.md +4 -1
package/second-brain/Checklists/_Index.md +30 -0
package/second-brain/Checklists/preflight-postflight-template.md +29 -0
package/second-brain/Distillations/_Index.md +30 -0
package/second-brain/Entities/_Index.md +30 -0
package/second-brain/Entities/entity-template.md +33 -0
package/second-brain/Evals/_Index.md +30 -0
package/second-brain/Evals/correction-pairs.md +24 -0
package/second-brain/Evals/failure-taxonomy.md +24 -0
package/second-brain/Evals/golden-set.md +25 -0
package/second-brain/Evals/quality-ledger.md +23 -0
package/second-brain/Evals/self-eval-rubric.md +23 -0
package/second-brain/GEMINI.md +4 -4
package/second-brain/Goals/_Index.md +30 -0
package/second-brain/Handoffs/_Index.md +30 -0
package/second-brain/Home.md +7 -0
package/second-brain/Intake/Raw Sources/_Index.md +30 -0
package/second-brain/Intake/_Index.md +30 -0
package/second-brain/Intake/_Quarantine/_Index.md +30 -0
package/second-brain/Learning/_Index.md +30 -0
package/second-brain/Playbooks/_Index.md +30 -0
package/second-brain/Playbooks/playbook-template.md +23 -0
package/second-brain/Projects/_Index.md +30 -0
package/second-brain/Prompts/_Index.md +30 -0
package/second-brain/README.md +2 -1
package/second-brain/Research/_Index.md +30 -0
package/second-brain/Retrospectives/_Index.md +30 -0
package/second-brain/Reviews/_Index.md +30 -0
package/second-brain/Runbooks/_Index.md +30 -0
package/second-brain/Runbooks/eval-loop.md +24 -0
package/second-brain/Sessions/_Index.md +30 -0
package/second-brain/Shared/AI-Context-Index.md +20 -0
package/second-brain/Shared/AI-Threads/_Index.md +30 -0
package/second-brain/Shared/Archive/_Index.md +30 -0
package/second-brain/Shared/Assets/_Index.md +30 -0
package/second-brain/Shared/Context-Packs/_Index.md +30 -0
package/second-brain/Shared/Context7-Docs/_Index.md +30 -0
package/second-brain/Shared/Coordination/NOW.md +28 -0
package/second-brain/Shared/Coordination/_Index.md +30 -0
package/second-brain/Shared/Coordination/agent-registry.md +24 -0
package/second-brain/Shared/Coordination/task-board/_Index.md +30 -0
package/second-brain/Shared/Coordination/task-board/task-template.md +43 -0
package/second-brain/Shared/Coordination/task-board.md +32 -0
package/second-brain/Shared/Core-Facts/_Index.md +30 -0
package/second-brain/Shared/Decision-Memory/_Index.md +30 -0
package/second-brain/Shared/Glossary/_Index.md +30 -0
package/second-brain/Shared/Memory-Inbox/_Index.md +30 -0
package/second-brain/Shared/Operating-State/_Index.md +30 -0
package/second-brain/Shared/Prompting/_Index.md +30 -0
package/second-brain/Shared/Provenance/_Index.md +30 -0
package/second-brain/Shared/Rules/_Index.md +30 -0
package/second-brain/Shared/Rules/contextual-note-rule.md +30 -0
package/second-brain/Shared/Rules/frontmatter-standard.md +10 -0
package/second-brain/Shared/Rules/memory-write-protocol.md +28 -0
package/second-brain/Shared/Rules/procedural-runbook-header.md +40 -0
package/second-brain/Shared/Rules/review-and-staleness-policy.md +22 -0
package/second-brain/Shared/Rules/rules-formatting.md +34 -0
package/second-brain/Shared/Scripts/_Index.md +30 -0
package/second-brain/Shared/Scripts-Archive/_Index.md +30 -0
package/second-brain/Shared/Tech-Standards/_Index.md +30 -0
package/second-brain/Shared/Tech-Standards/verification-standard.md +40 -0
package/second-brain/Shared/User-Memory/_Index.md +30 -0
package/second-brain/Shared/User-Persona/_Index.md +30 -0
package/second-brain/Shared/User-Persona/owner-profile.md +25 -0
package/second-brain/Shared/Working-Memory/_Index.md +30 -0
package/second-brain/Shared/_Index.md +30 -0
package/second-brain/Shared/mcp-servers/_Index.md +30 -0
package/second-brain/Skills/_Index.md +30 -0
package/second-brain/Templates/_Index.md +30 -0
package/second-brain/Templates/bug.md +2 -0
package/second-brain/Templates/handoff.md +2 -0
package/second-brain/Templates/session.md +2 -0
package/second-brain/Tools/_Index.md +30 -0
package/second-brain/Traces/_Index.md +30 -0
package/second-brain/Vault Structure Map.md +33 -1
package/second-brain/copilot/_Index.md +30 -0
package/skills/audit-license-compliance/SKILL.md +117 -0
package/skills/author-codemod/SKILL.md +110 -0
package/skills/build-audit-logging/SKILL.md +112 -0
package/skills/build-cdc-streaming-pipeline/SKILL.md +123 -0
package/skills/build-cli-tool/SKILL.md +108 -0
package/skills/build-data-table/SKILL.md +141 -0
package/skills/build-native-mobile-ui/SKILL.md +154 -0
package/skills/build-offline-first-sync/SKILL.md +118 -0
package/skills/build-realtime-channel/SKILL.md +122 -0
package/skills/build-vector-search/SKILL.md +131 -0
package/skills/compose-local-dev-stack/SKILL.md +149 -0
package/skills/configure-bundler-build/SKILL.md +166 -0
package/skills/configure-dns-tls/SKILL.md +142 -0
package/skills/configure-reverse-proxy-lb/SKILL.md +129 -0
package/skills/configure-security-headers-csp/SKILL.md +122 -0
package/skills/contract-testing/SKILL.md +140 -0
package/skills/datetime-timezone-correctness/SKILL.md +125 -0
package/skills/debug-ci-pipeline-failure/SKILL.md +134 -0
package/skills/debug-flaky-tests/SKILL.md +128 -0
package/skills/defend-llm-prompt-injection/SKILL.md +110 -0
package/skills/deliver-webhooks/SKILL.md +116 -0
package/skills/design-api-pagination/SKILL.md +144 -0
package/skills/design-authorization-model/SKILL.md +119 -0
package/skills/design-backup-dr-recovery/SKILL.md +113 -0
package/skills/design-event-sourcing-cqrs/SKILL.md +143 -0
package/skills/design-multi-tenancy/SKILL.md +100 -0
package/skills/design-protobuf-grpc-service/SKILL.md +146 -0
package/skills/design-relational-schema/SKILL.md +129 -0
package/skills/design-search-index-infra/SKILL.md +151 -0
package/skills/design-state-machine/SKILL.md +108 -0
package/skills/design-token-system/SKILL.md +109 -0
package/skills/distributed-locks-leases/SKILL.md +120 -0
package/skills/encrypt-sensitive-data/SKILL.md +148 -0
package/skills/feature-flags-rollout/SKILL.md +130 -0
package/skills/file-upload-object-storage/SKILL.md +107 -0
package/skills/fuzz-dynamic-security-test/SKILL.md +111 -0
package/skills/harden-llm-app-reliability/SKILL.md +126 -0
package/skills/i18n-localization-setup/SKILL.md +113 -0
package/skills/idempotency-keys/SKILL.md +107 -0
package/skills/implement-push-notifications/SKILL.md +142 -0
package/skills/ingest-webhook-secure/SKILL.md +120 -0
package/skills/integrate-oauth-oidc/SKILL.md +126 -0
package/skills/load-stress-test/SKILL.md +129 -0
package/skills/map-privacy-data-gdpr/SKILL.md +146 -0
package/skills/model-nosql-data/SKILL.md +118 -0
package/skills/money-decimal-arithmetic/SKILL.md +123 -0
package/skills/monitor-ml-drift/SKILL.md +109 -0
package/skills/numeric-precision-units/SKILL.md +144 -0
package/skills/optimize-llm-cost-latency/SKILL.md +103 -0
package/skills/optimize-react-rerenders/SKILL.md +124 -0
package/skills/orchestrate-agent-workflow/SKILL.md +100 -0
package/skills/payments-billing-integration/SKILL.md +114 -0
package/skills/pin-toolchain-versions/SKILL.md +116 -0
package/skills/plan-strangler-migration/SKILL.md +95 -0
package/skills/property-based-testing/SKILL.md +108 -0
package/skills/publish-package-registry/SKILL.md +130 -0
package/skills/recover-git-state/SKILL.md +119 -0
package/skills/remediate-web-vulnerabilities/SKILL.md +125 -0
package/skills/resilience-timeouts-retries/SKILL.md +104 -0
package/skills/resolve-merge-rebase-conflict/SKILL.md +97 -0
package/skills/rewrite-git-history/SKILL.md +109 -0
package/skills/scaffold-cross-platform-app/SKILL.md +137 -0
package/skills/schema-evolution-compatibility/SKILL.md +121 -0
package/skills/send-transactional-email/SKILL.md +126 -0
package/skills/serve-deploy-ml-model/SKILL.md +107 -0
package/skills/setup-cdn-edge-waf/SKILL.md +107 -0
package/skills/setup-devcontainer-env/SKILL.md +131 -0
package/skills/setup-lint-format-precommit/SKILL.md +140 -0
package/skills/setup-monorepo-tooling/SKILL.md +125 -0
package/skills/ship-mobile-app-store-release/SKILL.md +137 -0
package/skills/structured-output-llm/SKILL.md +86 -0
package/skills/supply-chain-sbom-provenance/SKILL.md +120 -0
package/skills/test-data-factories/SKILL.md +158 -0
package/skills/threat-model-stride/SKILL.md +123 -0
package/skills/train-evaluate-ml-model/SKILL.md +109 -0
package/skills/unicode-text-correctness/SKILL.md +109 -0
package/skills/visual-regression-testing/SKILL.md +120 -0

package/skills/supply-chain-sbom-provenance/SKILL.md ADDED Viewed

@@ -0,0 +1,120 @@
+---
+name: supply-chain-sbom-provenance
+description: Hardens the software supply chain by generating/validating an SBOM (CycloneDX/SPDX via syft/cdxgen), signing artifacts keylessly (cosign + OIDC), emitting SLSA/in-toto build provenance, pinning deps and base images to digests, and enforcing signature/attestation policy at consumption.
+when_to_use: User must prove what's in an artifact and that it was built from trusted source — producing/consuming an SBOM, signing containers/releases, adding provenance, raising SLSA level, or hardening CI against poisoned deps (EO 14028, EU CRA). Distinct from dependency-upgrade (bumping versions), secrets-management (handling credentials), and security-review (auditing source code).
+---
+## When to Use
+Reach for this skill when the request is about **proving artifact integrity and origin**, not about the code's behavior:
+- "Generate/attach an SBOM to our releases" or "a customer wants a CycloneDX/SPDX SBOM"
+- "Sign our container images / release binaries" (cosign, keyless OIDC)
+- "Add build provenance / attestations" or "get us to SLSA Level 3"
+- "Pin base images to digests, not `:latest`" / "we got hit by a typosquat / dependency-confusion package"
+- "Reject unsigned or unattested images at deploy/admission time"
+- "Keep scanning what we already shipped for new CVEs against the SBOM"
+NOT this skill:
+- Bumping a dependency to a newer version / resolving a lockfile → dependency-upgrade
+- Storing/rotating the signing key or registry token itself → secrets-management (this skill prefers **keyless**, so there's no key to store)
+- Auditing the source for vulnerabilities/logic bugs → security-review (this proves *where the artifact came from*, not whether the code is safe)
+- Optimizing the Dockerfile layers/size → dockerfile-optimize
+- Writing the deploy/CI YAML in general → cicd-pipeline-author / gitops-deploy-workflow / deploy-release
+- Enforcing the policy specifically inside Kubernetes manifests → k8s-manifest-review
+## Steps
+1. **Inventory inputs, then pick format + tool by ecosystem — do not hand-write an SBOM.** An SBOM must cover **direct + transitive** deps, base image layers, and build tools. Choose:
+   | Need | Format | Generator | Why |
+   |---|---|---|---|
+   | Container/filesystem, broadest ecosystem coverage | **CycloneDX** (`--output cyclonedx-json`) | **syft** | Best default; reads installed packages from the built image, not just manifests |
+   | Same, when consumer mandates SPDX (US federal / NTIA) | **SPDX** (`--output spdx-json`) | **syft** | One flag swap; emit both if asked |
+   | Source-tree / app deps with rich pURLs + license + VEX | CycloneDX | **cdxgen** | Deeper per-language resolution (npm/maven/gradle/go/pip) |
+   Default to **CycloneDX JSON from syft, run against the final built image** (`syft <image>@sha256:... -o cyclonedx-json=sbom.cdx.json`). Generating from source misses what the base image actually ships.
+2. **Generate the SBOM in CI as a build step, attach to the release, and validate completeness.** Pin the digest you scanned. A valid SBOM has `bom-ref`/pURL for every component, declared versions, and a license field. Validate, don't trust:
+   ```bash
+   syft "$IMAGE@$DIGEST" -o cyclonedx-json=sbom.cdx.json
+   cyclonedx-cli validate --input-file sbom.cdx.json --fail-on-errors   # schema valid
+   jq -e '[.components[] | select(.version==null or .version=="")] | length == 0' sbom.cdx.json  # no unversioned comps
+   ```
+   Reject the build if validation fails. An SBOM with missing versions/hashes is worse than none — it lies.
+3. **Pin and verify every input to a hash, never a moving tag.** This is the actual tamper defense; the SBOM only *describes* it.
+   - **Base images:** `FROM node:20.11-bookworm@sha256:<digest>` — never `:latest`, never a bare tag. A tag can be repointed under you.
+   - **Deps:** require a lockfile with **integrity hashes** and install in frozen mode: `npm ci` (uses `package-lock.json` `integrity`), `pip install --require-hashes -r requirements.txt`, `go mod verify` + committed `go.sum`, `cargo --locked`. A lockfile without hashes (or `npm install`, which mutates it) is not pinning.
+   - **Defend dependency-confusion / typosquatting:** set a **scoped private registry** (`@yourscope:registry=...`) so internal names never resolve to public; **reserve your namespace** on the public registry; pin `registry`/`@scope` in `.npmrc`/`pip.conf`; maintain an **allowlist** and fail CI on any new top-level dep not on it. Confusion attacks beat any signature because you signed the wrong package.
+4. **Sign artifacts and the SBOM keylessly with cosign via the CI OIDC identity.** No long-lived key to leak or rotate — the signature is bound to the workflow identity and logged in the Rekor transparency log.
+   ```bash
+   # cosign 2.x: keyless is the default — no COSIGN_EXPERIMENTAL flag needed
+   cosign sign --yes "$IMAGE@$DIGEST"                                  # keyless, OIDC → Fulcio cert → Rekor
+   cosign attest --yes --type cyclonedx --predicate sbom.cdx.json "$IMAGE@$DIGEST"   # SBOM as an attestation
+   ```
+   Sign the **digest**, not the tag. Pushing a tag after signing leaves the signature pointing at a digest a re-tag can bypass.
+5. **Emit SLSA build provenance (in-toto attestation) linking artifact → source commit → builder, then harden to raise the level.** Provenance answers "which commit, which builder, what inputs." On GitHub the cheapest path is the official generator/`actions/attest-build-provenance`; verify the chain with cosign:
+   ```bash
+   cosign verify-attestation --type slsaprovenance \
+     --certificate-identity-regexp '^https://github.com/<org>/.+/.github/workflows/.+@refs/' \
+     --certificate-oidc-issuer https://token.actions.githubusercontent.com "$IMAGE@$DIGEST"
+   ```
+   | SLSA Build level | Requirement | How to reach it |
+   |---|---|---|
+   | L1 | Provenance exists | Generate + attach any provenance |
+   | L2 | Signed provenance, hosted build | Keyless cosign + CI-hosted runner (above) |
+   | **L3** | Non-falsifiable, isolated build | Use a trusted builder that isolates the run from the steps it builds (reusable trusted workflow); no secrets exposed to user build steps |
+   Target **L3** for anything customer-facing; L2 is the floor.
+6. **Enforce on consumption — reject unsigned / unattested artifacts at the gate.** Signing nothing-checks is theater. Put a verifying admission/policy controller (e.g. a Sigstore policy controller or `cosign verify` gate in the deploy job) that **denies** images lacking a valid signature *and* required attestations from the *expected* identity:
+   ```yaml
+   # policy intent: only images signed by OUR workflow, with an SBOM attestation, may deploy
+   require:
+     signature:
+       issuer: https://token.actions.githubusercontent.com
+       subjectRegExp: ^https://github.com/<org>/<repo>/.github/workflows/release.yml@refs/tags/.+$
+     attestations: [cyclonedx, slsaprovenance]
+   ```
+   Pin the **identity**, not just "is it signed" — anyone can sign with keyless. Fail closed.
+7. **Continuously scan the shipped SBOM for new CVEs.** Vulns are disclosed after you ship. Re-scan the stored SBOM on a schedule (not just at build) so a component clean yesterday flags today:
+   ```bash
+   osv-scanner scan --sbom sbom.cdx.json --fail-on-vuln    # or: grype sbom:sbom.cdx.json --fail-on high
+   ```
+   Use a VEX document to suppress not-exploitable findings deliberately — never by lowering the threshold.
+## Common Errors
+- **Pinning to a tag, not a digest.** `FROM python:3.12` / `cosign sign $IMAGE:latest` — a tag is mutable and can be repointed after you scan/sign. Always `@sha256:<digest>`.
+- **SBOM generated from source, attached to a binary built elsewhere.** It won't list the base-image OS packages that actually ship in the artifact. Scan the **built image at its digest**.
+- **`npm install` / unfrozen install in CI.** Mutates the lockfile and can pull an unpinned version, voiding the hashes. Use `npm ci` / `--require-hashes` / `--locked` / `--frozen-lockfile`.
+- **Signing without verifying identity on the consumer side.** `cosign verify` with no `--certificate-identity*` accepts a signature from *anyone* keyless. Always pin issuer + subject regexp.
+- **Treating the SBOM as the tamper control.** The SBOM only *describes*; integrity comes from **digest pins + hashes + signatures**. An accurate SBOM of a poisoned artifact is still poisoned.
+- **Lockfile without integrity hashes.** `go.sum` missing, `requirements.txt` without `--hash=`, a `package-lock.json` from `lockfileVersion:1`. Versions alone don't detect content swaps; require hashes.
+- **Internal package name resolvable on the public registry.** Classic dependency confusion — the public one wins by higher version. Scope it, reserve the name, and allowlist.
+- **Provenance that doesn't bind to a commit/builder.** Provenance with no `materials`/`buildDefinition` source ref proves nothing. It must name the exact commit SHA and builder.
+- **Policy in "audit"/warn mode forever.** A controller that logs violations but admits anyway is not enforcement. Flip to **deny / fail-closed** once green.
+- **Suppressing scanner findings by raising the severity threshold.** Hides real CVEs. Suppress specific not-exploitable CVEs via VEX with a reason, leave the threshold strict.
+- **Storing a long-lived cosign private key in CI secrets.** Defeats the point and creates a rotation burden. Use keyless OIDC; if a key is truly required, that's a secrets-management problem.
+## Verify
+1. **SBOM completeness:** `cyclonedx-cli validate --fail-on-errors` passes, and `jq` confirms zero components with null/empty `version`. Spot-check that a known base-image OS package (e.g. `glibc`) appears — proves it scanned the image, not just the manifest.
+2. **Digest pinning:** `grep -rE 'FROM .+:[^@]+$' Dockerfile*` returns nothing (every `FROM` ends in `@sha256:`); the lockfile carries integrity hashes; frozen-install command is the one used in CI.
+3. **Signature + attestation present:** `cosign verify --certificate-identity-regexp ... --certificate-oidc-issuer ...` exits `0` against the **digest**, and `cosign verify-attestation --type cyclonedx ...` returns the SBOM predicate. Tampering with one byte of the image flips both to non-zero.
+4. **Provenance binds source+builder:** the SLSA predicate's `buildDefinition`/`materials` names the exact commit SHA and the builder identity; `cosign verify-attestation --type slsaprovenance` exits `0`.
+5. **Enforcement is fail-closed:** deploy an image that is unsigned (or signed by a *different* identity) → the gate/admission controller **denies** it; the legitimately-signed image admits. A wrong-identity signature must be rejected, not just an unsigned one.
+6. **Confusion defense:** attempt to install an internal package name from the public registry → resolution fails or is blocked by the scoped registry/allowlist; a new unlisted top-level dep fails CI.
+7. **Continuous scan wired:** `osv-scanner --sbom ... --fail-on-vuln` (or `grype --fail-on high`) runs on a schedule against the stored SBOM and breaks the job on a new CVE; any suppression is a VEX entry with a reason, not a lowered threshold.
+Done = SBOM (CycloneDX/SPDX) validates and is attested to the artifact; every base image and dependency is digest/hash-pinned with confusion defenses; the artifact and SBOM are keyless-signed with SLSA L2+ provenance binding source commit and builder; the consumption gate fails closed against unsigned and wrong-identity artifacts; and a scheduled scanner re-checks the SBOM for new CVEs.

package/skills/test-data-factories/SKILL.md ADDED Viewed

@@ -0,0 +1,158 @@
+---
+name: test-data-factories
+description: Generates realistic, maintainable test data with factories instead of brittle shared fixtures — factory libraries (Ruby factory_bot, Python factory_boy/model-bakery, PHP Foundry/Faker, JS Fishery/@mswjs/data/Fabbrica, Java instancio/Java-faker, Go fake) that build valid objects with sane defaults, Faker for realistic values, traits/transient params/variants for state, build vs create (in-memory vs persisted), sequences for unique fields, nested associations and object graphs without combinatorial fixtures, deterministic seeding (Faker.seed/locale pinning) for reproducible CI, idempotent upsert-based DB seeders for dev/E2E that re-run cleanly, and anonymized prod-like data via masking/synthesis — so every test declares only the fields it cares about and stays valid as the schema evolves.
+when_to_use: Building test data — replacing shared YAML/SQL fixtures, generating valid model instances for unit/integration/E2E tests, seeding a dev or E2E database idempotently, creating object graphs with associations, or producing anonymized prod-like datasets. Distinct from write-tests (structures the assertions and suite; this generates the inputs they assert on) and validate-data-quality (checks real datasets for nulls/dupes/outliers; this manufactures synthetic data on purpose).
+---
+## When to Use
+Reach for this skill when you need to manufacture valid, realistic test data and the pain is fixtures that rot or tests coupled to a giant shared dataset:
+- "Replace our `fixtures/*.yml` — every schema change breaks 200 tests"
+- "Give me a valid User/Order with just the 2 fields this test cares about"
+- "Build an order with 3 line items, a customer, and an address" (object graph)
+- "Seed the dev/E2E database so `db:seed` is safe to re-run"
+- "Tests pass locally, flake in CI" (non-deterministic random data)
+- "Generate realistic-but-fake names/emails/addresses" (Faker)
+- "Make a prod-like dataset for staging without leaking PII" (anonymize)
+NOT this skill:
+- Structuring the assertions, arrange/act/assert, mocking, coverage, test naming → write-tests (it organizes the suite that *consumes* the data this skill builds)
+- Checking a *real* dataset for nulls, dupes, outliers, schema drift → validate-data-quality (it inspects data you didn't generate; this fabricates data on purpose)
+- Profiling/exploring an unfamiliar dataset's distributions → profile-dataset
+- Generating *inputs to find bugs* by shrinking counterexamples → property-based-testing (it searches the input space; this hands you fixed, named, realistic instances)
+- Driving a browser through the app to set up E2E state via the UI/API → write-playwright-e2e (it may *call* this skill's seeder to plant rows directly)
+- Stabilizing a flaky test whose data was already non-deterministic → debug-flaky-tests (seed pinning here is one of its fixes)
+- Safe schema changes / running the migration the seeder targets → db-migration-safety
+- Designing the schema/associations themselves → design-relational-schema
+## Steps
+1. **Prefer factories over shared fixtures — fixtures are the anti-pattern you're replacing.** A global `users.yml`/seed SQL becomes load-bearing: tests depend on `user(:admin)` having exactly these fields, so any edit ripples across the suite and tests silently couple to unrelated data ("mystery guest"). Factories invert this: a `build(:user)` is valid by default, and each test **overrides only the attributes it asserts on**. Pick the idiomatic library:
+   | Stack | Library | Build vs persist |
+   |---|---|---|
+   | Ruby / Rails | **factory_bot** | `build(:user)` (RAM) · `create(:user)` (DB) · `build_stubbed` (no DB, fake id) · `attributes_for` (hash) |
+   | Python / Django | **factory_boy** (`DjangoModelFactory`) or **model-bakery** (`baker.make`/`baker.prepare`) | `.build()` vs `.create()` / `prepare` vs `make` |
+   | Python (plain) | **factory_boy** `Factory` + `faker` | `.build()` only |
+   | JS / TS | **Fishery** (`.build()`), **@mswjs/data**, **Fabbrica** (Prisma), `@faker-js/faker` | build returns object; persist via your ORM |
+   | PHP / Laravel | **Foundry** or Eloquent factories + **fakerphp/faker** | `Model::factory()->make()` vs `->create()` |
+   | Java / Kotlin | **instancio**, **easy-random**, **datafaker** (Java-faker successor) | POJO in memory |
+   | Go | `go-faker`/`gofakeit` + hand-rolled builder funcs | struct in memory |
+2. **Make the default object minimally valid; override per test.** The factory's defaults must pass all model validations on their own so `create(:user)` never fails for an unrelated reason. Then a test passes only what it cares about:
+   ```ruby
+   # factory_bot
+   factory :user do
+     name  { Faker::Name.name }
+     email { Faker::Internet.unique.email }   # unique → no collisions
+     role  { "member" }
+   end
+   # test asserts on role only:
+   admin = create(:user, role: "admin")       # name/email auto-filled, valid
+   ```
+   ```ts
+   // Fishery
+   const userFactory = Factory.define<User>(({ sequence }) => ({
+     id: sequence,                              // 1,2,3… unique per build
+     email: `user${sequence}@example.test`,
+     role: 'member',
+   }));
+   userFactory.build({ role: 'admin' });        // override one field
+   ```
+   Use the reserved **`example.test`/`example.com`** domains and the `+tag` trick for emails so generated data never hits a real inbox.
+3. **Use Faker for realism, but never for fields you assert on.** Faker gives plausible names/emails/addresses/companies so data looks real and surfaces formatting bugs. The discipline: **if a test checks a value, set it explicitly; let Faker fill the rest.** Asserting against a Faker-generated value is a guaranteed flake. Pin locale (`Faker::Config.locale = :en` / `faker.setLocale`) so address/phone formats are stable across machines, and use `unique` generators (`Faker::Internet.unique.email`, `faker.helpers.unique` — note `@faker-js/faker` removed `unique`; use a sequence or `faker.string.uuid` instead) for columns under a UNIQUE constraint.
+4. **Sequences for unique/monotonic fields; transient params for build-time knobs that aren't attributes.** Sequences (`sequence(:email) { |n| "user#{n}@example.test" }`) guarantee uniqueness without a global mutable counter. **Transient/transient-params** are inputs that shape the build but aren't columns:
+   ```ruby
+   factory :order do
+     transient { line_item_count { 3 } }       # not a column
+     after(:create) do |order, ev|
+       create_list(:line_item, ev.line_item_count, order: order)
+     end
+   end
+   create(:order, line_item_count: 5)
+   ```
+   factory_boy calls these `Params`/`class Params: ...` with `factory.Trait`; Foundry uses `->with()`/states. Reach for them whenever a test wants "an order with N items" without N being an order column.
+5. **Model variants with traits, not a forest of sub-factories.** A trait is a named bundle of attribute overrides; compose several in one call. This beats `factory :admin_user`, `factory :suspended_admin_user`, … which explodes combinatorially:
+   ```ruby
+   factory :user do
+     trait(:admin)     { role { "admin" } }
+     trait(:suspended) { suspended_at { Time.current } }
+     factory :premium  { plan { "premium" } }   # nested for true subtype
+   end
+   create(:user, :admin, :suspended)            # compose traits
+   ```
+   factory_boy → `class Params:` + `factory.Trait`; Fishery → `.params()` + transient `transientParams`, or named factory variants; model-bakery → recipes. **One base factory + traits** is the maintainable shape; deep factory inheritance is not.
+6. **Build object graphs through associations — don't hand-wire foreign keys.** Declare relations so the factory creates the whole graph and back-references resolve automatically:
+   | Lib | Association syntax |
+   |---|---|
+   | factory_bot | `association :customer` · `customer { create(:customer) }` · `create_list(:item, 3, order:)` |
+   | factory_boy | `customer = factory.SubFactory(CustomerFactory)` · `factory.RelatedFactory` (reverse) · `factory.List` |
+   | Fishery | `customer: customerFactory.build()` inside the generator, or `associations` param |
+   | Foundry | `CustomerFactory::new()` as an attribute; `->many(3)` for collections |
+   **Build the minimal graph the test needs** — pulling in 4 levels of associations slows every test and recreates the fixture problem. Use `build`/`build_stubbed` (no DB) for pure-logic tests; reserve `create` for tests that actually query the DB. Guard against accidental N× object creation in `before` hooks.
+7. **Seed dev/E2E databases idempotently — upsert, never blind insert.** A seed script that `INSERT`s breaks on the second run (unique-constraint violation) and corrupts state. Make it re-runnable:
+   ```ruby
+   # Rails: find_or_create_by / upsert on a natural key
+   User.find_or_create_by!(email: "demo@example.test") { |u| u.name = "Demo" }
+   ```
+   ```sql
+   INSERT INTO plans (code, name) VALUES ('pro','Pro')
+   ON CONFLICT (code) DO UPDATE SET name = EXCLUDED.name;   -- idempotent
+   ```
+   Rules: key every seed row on a **stable natural key** (slug/code/email), not an auto-id; wrap the whole seed in a transaction; make `db:seed` / `prisma db seed` / `php artisan db:seed` safe to run N times with identical end state. Separate **dev seed** (rich demo data, may be large) from **E2E/test seed** (minimal, deterministic, reset between runs via truncate or transactional rollback). For E2E, prefer planting rows via the factory/seeder directly over driving the UI — orders of magnitude faster and less flaky.
+8. **Make data deterministic in CI — pin the RNG seed and locale.** Random factory data that flakes is worse than fixtures. Pin a seed so a failing CI run is reproducible:
+   | Tool | Seed control |
+   |---|---|
+   | Faker (Ruby) | `Faker::Config.random = Random.new(RSEED)` |
+   | @faker-js/faker | `faker.seed(12345)` (per-suite, in a `beforeEach`/global setup) |
+   | faker (Python) | `Faker.seed(0)` / `fake.seed_instance(0)` |
+   | factory_boy | `factory.random.reseed_random('seed')` |
+   | RSpec / Jest | `--seed`/`config.seed`; Jest `--testSequencer` + faker seed |
+   Print the seed on every run and let CI re-run with a fixed seed to reproduce a failure. **Pin the locale too** — default-locale drift changes address/phone formats and breaks format-sensitive assertions. Don't share mutable factory state across parallel test workers; each worker reseeds.
+9. **For prod-like datasets, anonymize — synthesize or mask, never copy raw PII.** Staging/perf data that mirrors prod shape without leaking real people: (a) **synthesize** from factories at volume (`create_list(:user, 100_000)` / a Faker loop) when you only need realistic shape; (b) **mask/anonymize** a prod snapshot when you need real distributions — replace names/emails/SSNs with Faker values, **deterministically** (hash the original → same fake every time, so foreign keys stay consistent), null or tokenize free-text, and shift dates by a constant offset. Tools: pg `anon` extension, `pganonymize`, Snaplet/`@snaplet/seed`, `faker` + a mapping table. Never load an un-anonymized prod dump into a lower environment — that's a PII breach. Keep the anonymization mapping out of the lower environment.
+10. **Keep factories close to the code and validated.** Co-locate factories with the test suite (`spec/factories`, `test/factories`, `src/test-utils/factories`), auto-load them, and add a CI lint that **`build`s every factory and asserts it's valid** (factory_bot's `FactoryBot.lint`) so a schema/validation change that breaks a factory fails fast instead of in 50 unrelated tests. When a column is added with a NOT NULL/validation, fix it in the **one** factory, not across the suite.
+## Common Errors
+- **Shared fixtures as source of truth.** One `users.yml` every test secretly depends on → mystery-guest coupling, schema edits break the world. Fix: factories with per-test overrides; delete the global fixture.
+- **Asserting against a Faker-generated value.** `expect(user.name).to eq(faker_name)` flakes the moment the seed changes. Fix: set asserted fields explicitly; Faker only fills don't-care fields.
+- **Non-deterministic data in CI with no seed.** Intermittent failures no one can reproduce. Fix: pin `faker.seed`/`Faker.seed` and the locale; print and replay the seed.
+- **`create` everywhere when `build` would do.** Hitting the DB (and its associations) for pure-logic tests makes the suite slow and order-dependent. Fix: `build`/`build_stubbed`/`prepare` for in-memory; `create` only when you query.
+- **Non-idempotent seed script.** Blind `INSERT` → second run violates UNIQUE / duplicates rows. Fix: `find_or_create_by` / `ON CONFLICT DO UPDATE` on a natural key; wrap in a transaction.
+- **Sub-factory explosion.** `admin_user`, `suspended_admin_user`, `premium_suspended_admin_user`… Fix: one base factory + composable traits.
+- **Over-deep association graphs.** Every `create` drags in 4 levels of records → slow tests and re-coupled data. Fix: build the minimal graph; stub the rest.
+- **Duplicate-key collisions from static defaults.** A hardcoded `email: "a@b.com"` default fails the second `create`. Fix: sequences or `Faker::Internet.unique` (or `faker.string.uuid` in @faker-js/faker, which dropped `unique`).
+- **Loading raw prod data into staging.** Real PII in a lower environment = breach. Fix: deterministic anonymization/masking or synthesize; keep the mapping out of staging.
+- **Locale drift.** Default Faker locale differs by machine/CI → address/phone format assertions break. Fix: pin the locale explicitly.
+- **Factories that drift from the schema.** A new NOT NULL column makes every `create` fail cryptically. Fix: `FactoryBot.lint`-style CI check that builds every factory.
+## Verify
+1. **Factories are valid standalone:** run the lint (`FactoryBot.lint` / build-every-factory test) — every factory and trait `build`s and passes validations with zero overrides.
+2. **No fixture coupling:** grep the suite for the old shared fixture references; a test reads only the fields it sets/asserts, and editing an unrelated factory attribute breaks nothing.
+3. **Determinism:** run the suite twice with the same pinned seed → identical data and pass/fail; run with two different seeds → still green (no test asserts a Faker value).
+4. **Uniqueness holds:** create N rows from a factory with a UNIQUE column in a loop → no constraint violation (sequence/`unique`/uuid working).
+5. **Seed is idempotent:** run `db:seed` twice → identical row count and end state, no UNIQUE error; the second run is a no-op or clean upsert.
+6. **build vs create honored:** `build`/`build_stubbed` issues zero SQL INSERTs (assert via query log/`assert_no_queries`); `create` persists exactly the intended graph.
+7. **Traits compose:** `create(:user, :admin, :suspended)` yields both states; no combinatorial sub-factory needed.
+8. **Object graph is minimal and correct:** an order factory creates exactly its declared associations (customer + N items), foreign keys resolve, and no surprise extra records appear.
+9. **Anonymized data is safe:** spot-check the prod-like dataset — no real PII, the masking is deterministic (same input → same fake, FKs consistent), and date/format distributions are realistic.
+Done = brittle shared fixtures are gone, each test declares only the fields it cares about against a valid-by-default factory, Faker fills the rest with a pinned seed+locale so CI is reproducible, object graphs and traits compose without sub-factory explosion, the dev/E2E seed is idempotent on a natural key, and any prod-like data is deterministically anonymized — all proven by the factory lint, the twice-with-same-seed run, and the double-seed idempotence check.

package/skills/threat-model-stride/SKILL.md ADDED Viewed

@@ -0,0 +1,123 @@
+---
+name: threat-model-stride
+description: Produces a design-level STRIDE threat model — decomposes the architecture into a data-flow diagram with trust boundaries, enumerates threats per element, rates them by likelihood × impact, and records mitigations and signed-off residual risk. Use before building or substantially changing a system that handles untrusted input, secrets, money, or PII.
+when_to_use: A new service, public API, auth flow, multi-tenant boundary, or agent/tool surface is being designed and you need "what could go wrong here?" answered before code exists. Distinct from security-review (audits an already-written diff line by line) and write-rfc (proposes the design itself).
+---
+## When to Use
+Reach for this when the question is **what an adversary could do to a design**, before the design is built:
+- "Threat model this new payments/checkout service"
+- "We're adding a multi-tenant boundary — where can one tenant reach another's data?"
+- "New public API / webhook ingress / agent tool surface — enumerate the attack surface"
+- "The security RFC needs a threats section and a residual-risk register"
+- "What trust boundaries does this auth flow cross, and what crosses each one?"
+NOT this skill:
+- Auditing already-written code for injection/SSRF/secrets line by line → security-review (this skill works on a diagram, not a diff)
+- Writing the design/proposal itself (motivation, alternatives, rollout) → write-rfc (threat-model is one section feeding it)
+- Implementing the login/JWT/session controls a threat surfaces → auth-jwt-session
+- Storing/rotating the secrets a threat targets → secrets-management
+- Hardening one webhook endpoint's signature/replay handling → ingest-webhook-secure
+- Responding to an attack happening **now** → incident-response-sre
+## Steps
+1. **Define scope, assets, and adversaries first — never enumerate threats against an unbounded system.** Write three lists before drawing anything:
+   - **Assets** — what you protect, by category: *confidentiality* (PII, secrets, tokens), *integrity* (balances, order state, audit log), *availability* (checkout, login). Name the concrete data, not "the database."
+   - **Adversaries** — pick from this set and state each one's starting position:
+     | Adversary | Starts with | Typically drives |
+     |---|---|---|
+     | Anonymous internet | Network reachability only | S, D, I (info disclosure via errors) |
+     | Authenticated user | Valid session, own tenant | E (priv-esc), tenant-boundary I/T |
+     | Malicious tenant (multi-tenant) | Valid account, own data | Cross-tenant I (read), T (write) |
+     | Insider / operator | Prod console, some creds | R (repudiation), I, E |
+     | Compromised dependency | Code execution in one process | S, T, E across the process boundary |
+     | Stolen credential / token | One leaked secret | S, blast-radius of that secret |
+   - **In/out of scope** — explicitly list what you will NOT model (e.g. "physical datacenter security: out; we trust the cloud provider's hypervisor"). Unstated scope = infinite scope.
+2. **Draw the data-flow diagram as validated Mermaid — four element types plus boundaries, no more.** DFD elements: **External Entity** (square — user, third-party API), **Process** (round — your service/lambda), **Data Store** (cylinder — DB, queue, bucket, cache), **Data Flow** (arrow, labeled with protocol + what data). A **trust boundary** is a dashed box crossing one or more flows where the privilege/trust level changes. The four boundaries to always look for: **network edge** (internet → DMZ), **authz** (unauthenticated → authenticated), **tenant** (tenant A → shared/tenant B), **process** (your code → third-party/dependency code).
+   ```mermaid
+   flowchart LR
+     subgraph edge["Network edge — untrusted"]
+       user["Browser (external entity)"]
+     end
+     subgraph trusted["Authenticated · single-tenant"]
+       api("API service")
+       worker("Async worker")
+     end
+     db[("Orders DB")]
+     pay["Stripe API (external entity)"]
+     user -->|"HTTPS · login creds"| api
+     api -->|"SQL · tenant_id scoped"| db
+     api -->|"enqueue · job payload"| worker
+     worker -->|"HTTPS · card token"| pay
+   ```
+   Shapes match the legend above: `[ ]` square = external entity, `( )` round = process, `[( )]` cylinder = data store, `-->|label|` = data flow. Each `subgraph` is a trust boundary (Mermaid renders dashed boxes). `db` and `pay` sit outside both boundaries on purpose — that's the point where trust changes.
+   Validate it before continuing: `npx -y @mermaid-js/mermaid-cli -i model.mmd -o model.svg` (or paste into mermaid.live). A diagram that doesn't render isn't a deliverable. If the system is large, model **one boundary-crossing flow per diagram** rather than one unreadable mega-graph.
+3. **Walk every flow that crosses a boundary and apply STRIDE per element.** Do not brainstorm freely — march the checklist. STRIDE maps to the property each threat violates:
+   | Letter | Threat | Violates | Ask at this element |
+   |---|---|---|---|
+   | **S** | Spoofing | Authentication | Can the caller forge who they are? (no/weak auth, replayable token) |
+   | **T** | Tampering | Integrity | Can data in transit or at rest be altered? (no TLS, no signature, mutable audit log) |
+   | **R** | Repudiation | Non-repudiation | Can an actor deny an action? (no/forgeable logs, shared accounts) |
+   | **I** | Information disclosure | Confidentiality | Can data leak? (verbose errors, missing authz check, IDOR, unencrypted store) |
+   | **D** | Denial of service | Availability | Can it be exhausted? (unbounded input, no rate limit, amplification) |
+   | **E** | Elevation of privilege | Authorization | Can a lower-privilege actor gain higher rights? (missing tenant scope, broken RBAC, injection → RCE) |
+   Apply the elements-affected rule to save time: **External entities** → S, R. **Processes** → all six. **Data stores** → T, I, D (and R if it's the audit log). **Data flows** → T, I, D. Record each threat as one row: `<element> | <STRIDE letter> | <concrete attack> | <adversary from step 1>`. Concrete means "authenticated user changes `tenant_id` in the path param and reads another tenant's orders (I via IDOR)", not "data could leak."
+4. **Rate each threat likelihood × impact, then rank.** Use a 3×3 so disagreements are cheap:
+   | | Impact: Low | Impact: Med | Impact: High |
+   |---|---|---|---|
+   | **Likelihood: High** | Medium | High | **Critical** |
+   | **Likelihood: Med** | Low | Medium | High |
+   | **Likelihood: Low** | Low | Low | Medium |
+   Likelihood = how exposed + how easy (anonymous-reachable + no skill = High; insider-only + needs prod creds = Low). Impact = blast radius on the asset (all-tenant PII dump = High; one user's display name = Low). Sort the threat table by rating, Critical first. Rate on the controls that *exist today*, never on ones you plan to build — planned controls earn their reduction in the disposition step (5), not here.
+5. **Disposition every threat — exactly one of four, no threat left unrated or "noted."**
+   - **Mitigate** — name the *specific* control (e.g. "scope every query by `tenant_id` from the session, never from the request; add a row-level-security policy as defense-in-depth"). A mitigation without a named control is not a mitigation.
+   - **Eliminate** — remove the feature/flow/data that creates the threat (don't store the PAN; tokenize at the edge so the card number never enters scope).
+   - **Transfer** — push to a party who owns it (offload card storage to a PCI-compliant processor; buy insurance). Note who now owns it.
+   - **Accept** — only with a named sign-off and an expiry. An accepted risk needs an owner, a date, and a re-review trigger; otherwise it's a silent gap.
+   Default bias: **Critical/High must be mitigated or eliminated before ship.** Medium may be accepted with sign-off. Low may be accepted by the team lead.
+6. **Map each mitigation to a real engineering task and link existing controls.** For every "mitigate," produce a tracked task (`SEC-123: enforce tenant_id from session in OrderRepository`) and point at the control that delivers it — frequently a sibling skill: rate limiting (D) → rate-limiting; auth/RBAC/IDOR (S/E) → auth-jwt-session; secret handling (S, I) → secrets-management; webhook signature/replay (S/T) → ingest-webhook-secure. Mark which controls **already exist** (TLS everywhere, WAF) vs **must be built** so the model doubles as a backlog.
+7. **Emit the deliverables — the model is the artifact, not the conversation.** Write a `threat-model.md` containing: scope/assets/adversaries (step 1), the validated DFD (step 2), the rated threat table with dispositions (steps 3–5), an explicit **abuse-cases** list (the top attacker stories: "as a malicious tenant I enumerate IDs to read others' invoices"), and a **residual-risk register** (every Accept row: threat, rating, owner, sign-off, expiry). Finish with **re-model triggers** — the events that invalidate this model and require a redo (new trust boundary, new external integration, auth model change, new class of PII, major arch change). A threat model with no expiry condition rots silently.
+## Common Errors
+- **Listing threats with no diagram.** Without the DFD you miss the boundary-crossing flows that produce the real threats. Draw and validate the diagram first; enumerate per element second.
+- **Missing trust boundaries entirely (or only drawing the network edge).** The expensive bugs live at the *authz* and *tenant* boundaries, not the firewall. Every place trust level changes gets a dashed box.
+- **Vague threats: "data could be leaked."** Unactionable and unrateable. Write the concrete path: which actor, which element, which parameter, which STRIDE letter.
+- **Skipping STRIDE letters because they "feel unlikely."** That's what rating is for — enumerate all applicable letters per element, then let likelihood × impact triage. Skipping at enumeration time hides the threat; skipping at rating time is a defensible decision.
+- **Accepting risk with no owner/expiry.** "We'll accept that" in a meeting evaporates. An accepted risk is only accepted when it's in the register with a name, a date, and a re-review trigger.
+- **Modeling the whole company.** Scope creep makes the model useless. Bound it to the one service/flow/change and explicitly list what's out of scope (step 1).
+- **Rating on aspirational controls.** Rating a threat "Low" because of a mitigation you *plan* to build inflates safety. Rate on what exists today; the disposition step is where planned controls earn their reduction.
+- **Mitigation = "add validation" / "we'll be careful."** Not a control. Name the mechanism (RLS policy, signed token with `aud` check, allowlist, rate limiter) and the task that builds it.
+- **Treating it as one-and-done.** A model with no re-model triggers is stale the next time a boundary moves. List the triggers that force a redo.
+- **Confusing this with a code audit.** STRIDE on a diagram finds design flaws (missing boundary, IDOR by design); it will not find a SQL-injection typo in line 88 — that's security-review on the diff.
+## Verify
+1. **Diagram renders:** `npx -y @mermaid-js/mermaid-cli -i model.mmd -o model.svg` exits 0 and the SVG shows every external entity, process, store, flow, and at least one dashed trust boundary.
+2. **Boundary coverage:** Every flow that crosses a trust boundary has at least one threat row. Pick any boundary-crossing arrow at random — it must appear in the threat table.
+3. **STRIDE coverage:** Each process element was evaluated against all six letters (each letter either has a threat row or an explicit "N/A — why"); stores and flows covered for T/I/D.
+4. **Every threat is concrete and rated:** No row reads "data could leak"; each names actor + element + attack, carries a likelihood × impact rating, and is sorted Critical-first.
+5. **Every threat is dispositioned:** Each row is exactly one of mitigate / eliminate / transfer / accept — zero "noted" or blank. Mitigations name a control and link a tracked task.
+6. **Residual register complete:** Every Accept appears in the residual-risk register with owner, sign-off, and expiry. No Critical/High is in Accept without explicit named sign-off.
+7. **Abuse cases + re-model triggers present:** The doc lists the top attacker stories and the events that invalidate the model.
+Done = the Mermaid DFD renders with explicit trust boundaries, every boundary-crossing flow has STRIDE-enumerated threats that are each rated and dispositioned, no Critical/High sits in Accept without named sign-off, and the doc ships a residual-risk register plus re-model triggers.

package/skills/train-evaluate-ml-model/SKILL.md ADDED Viewed

@@ -0,0 +1,109 @@
+---
+name: train-evaluate-ml-model
+description: Trains and evaluates a classic (non-LLM) ML model — business-aligned metric selection, leakage-safe train/validation/test splits, Pipeline-scoped feature engineering, baseline-first model selection, cross-validated hyperparameter tuning, bias/variance diagnosis, and experiment tracking — guarding against data leakage and overfitting.
+when_to_use: Fitting and validating a classification, regression, ranking, forecasting, or clustering model on tabular/feature data. Distinct from profile-dataset (EDA only), wrangle-tabular-data (cleaning/shaping the feature table), serve-deploy-ml-model (deployment), monitor-ml-drift (post-deploy), and rag-pipeline/prompt-engineering (LLM work).
+---
+## When to Use
+Reach for this skill when the request is to **fit and validate a model that predicts**, not to explore or clean data:
+- "Train a classifier to predict churn / fraud / default and tell me if it's good"
+- "Build a regression/forecasting model for demand/price and pick the best one"
+- "My model gets 99% accuracy — is that real or leakage?"
+- "Tune hyperparameters / cross-validate / compare XGBoost vs logistic regression"
+- "Cluster these customers into segments"
+NOT this skill:
+- Summary stats, distributions, correlations, missingness *before* modeling → profile-dataset
+- Cleaning, type coercion, joins, dedup, resampling to build the feature table → wrangle-tabular-data
+- Asserting schema/range/null contracts on the data before training → validate-data-quality
+- Packaging the trained model behind an API / batch job → serve-deploy-ml-model
+- Watching a live model's inputs/outputs degrade over time → monitor-ml-drift
+- Scoring or grading an LLM's outputs against a rubric/golden set → llm-eval-harness
+- Getting an LLM to answer over a corpus, or designing prompts → rag-pipeline / prompt-engineering
+## Steps
+1. **Choose the metric from the business cost FIRST — never optimize bare accuracy on imbalanced data.** A 1%-positive fraud set scores 99% accuracy by predicting all-negative and catches zero fraud. Pick before you train:
+   | Task / situation | Metric | Why |
+   |---|---|---|
+   | Imbalanced classification, cost of missing a positive high | Recall @ fixed precision, or **PR-AUC** | accuracy & ROC-AUC look great while recall is ~0 |
+   | Imbalanced, ranking/threshold-free comparison | **PR-AUC** (not ROC-AUC) | ROC-AUC is optimistic under heavy imbalance |
+   | Balanced classification | F1 or ROC-AUC | symmetric cost |
+   | Asymmetric FP vs FN cost | Expected cost = `cFP·FP + cFN·FN`, tune threshold | maps directly to money |
+   | Regression, outliers matter | RMSE | penalizes large errors |
+   | Regression, robust to outliers | MAE / MAPE | business reads "off by X" |
+   | Ranking / recsys | NDCG@k, MAP@k | position-aware |
+   | Forecasting | MASE / sMAPE vs naive | scale-free, must beat seasonal-naive |
+   | Clustering (no labels) | silhouette + downstream business check | inertia alone is meaningless |
+   Fix the **decision threshold** from the cost matrix later — don't ship the default 0.5.
+2. **Split BEFORE any feature engineering or fitting — this is the #1 leakage source.** Three sets: train / validation (or CV folds) / **held-out test touched once at the very end**. The split strategy is not optional — pick by data structure:
+   - **Random stratified** for i.i.d. rows: `train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)`.
+   - **Time-based** for any temporal data — train on past, test on future, never shuffle. Use `TimeSeriesSplit` for CV. A random split on time-series leaks the future and inflates every metric.
+   - **Group split** when rows share an entity (same user/patient/device across rows): `GroupKFold` / `StratifiedGroupKFold` so the same group never appears in both train and test.
+3. **Engineer features inside a `Pipeline` fit only on train, then run a leakage audit.** Every transform that *learns* (imputation means, scaler stats, target/one-hot encoders, feature selection) must `.fit` on train and only `.transform` val/test — otherwise test statistics leak in. Wrap it:
+   ```python
+   from sklearn.pipeline import Pipeline
+   from sklearn.compose import ColumnTransformer
+   from sklearn.impute import SimpleImputer
+   from sklearn.preprocessing import StandardScaler, OneHotEncoder
+   from sklearn.model_selection import cross_val_score
+   pre = ColumnTransformer([
+       ("num", Pipeline([("imp", SimpleImputer(strategy="median")),
+                         ("sc", StandardScaler())]), num_cols),
+       ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
+   ])
+   pipe = Pipeline([("pre", pre), ("model", model)])
+   # CV refits `pre` per fold → no leakage across folds
+   cross_val_score(pipe, X_train, y_train, cv=cv, scoring="average_precision")
+   ```
+   **Leakage audit checklist** — a feature is leaky if it: (a) is derived from the target or post-outcome (e.g. `payment_received` predicting `will_pay`); (b) encodes future information unavailable at prediction time; (c) is an ID/timestamp that proxies the label; (d) was computed using full-dataset statistics before the split. If any single feature gives a near-perfect score, it's leakage, not skill.
+4. **Establish a dumb baseline before any real model.** `DummyClassifier(strategy="most_frequent")` / `DummyRegressor(strategy="mean")`, then a linear baseline (`LogisticRegression` / `Ridge`). This is the bar every model must clear; a fancy model that barely beats majority-class isn't worth the complexity.
+5. **For tabular data, reach for gradient boosting before deep nets.** Default order: linear baseline → **gradient boosting** (`XGBoost` / `LightGBM` / `HistGradientBoostingClassifier`) → deep net only if GBM plateaus and you have ample data. GBMs win on heterogeneous tabular data, need little preprocessing, and train in minutes. Handle imbalance with `class_weight` / `scale_pos_weight` or threshold tuning — not blind SMOTE (and if you oversample, do it inside the CV fold only, never before the split).
+6. **Tune hyperparameters with CV, search smart not exhaustive.** `RandomizedSearchCV` or Optuna over a sensible space beats `GridSearchCV` on a huge grid. Always pass `scoring=` your step-1 metric (not accuracy) and use the same CV object as step 2. Key GBM knobs: `n_estimators` + `learning_rate` (trade off), `max_depth` / `num_leaves`, `min_child_samples`, `subsample`, `colsample_bytree`, plus `early_stopping_rounds` on a validation set.
+7. **Diagnose bias vs variance, then act.** Compare train vs validation score:
+   - Train high, val low (large gap) = **overfit/variance** → regularize, reduce depth/leaves, add data, drop features, stronger early stopping.
+   - Train and val both low = **underfit/bias** → richer model, better features, less regularization.
+   Plot a learning curve to decide whether more data would even help before collecting it.
+8. **Track every experiment — params, metric, data version, code version.** Log to MLflow / Weights & Biases (or a CSV at minimum): hyperparameters, all CV metrics with std, the data snapshot hash, git commit, and the random seed. An untracked best-run is unreproducible. Pin seeds (`random_state`) everywhere.
+9. **Evaluate the held-out test set EXACTLY ONCE, at the end.** Report the step-1 metric with a confidence interval (bootstrap), the confusion matrix at your chosen threshold, and the gap vs baseline. Repeatedly peeking at test = overfitting to test by hand.
+## Common Errors
+- **Splitting after feature engineering / scaling on the full dataset.** Test statistics bleed into train; metrics inflate, prod collapses. Split first, fit transforms on train only (use a `Pipeline`).
+- **Random split on temporal data.** The model trains on future rows and "predicts" the past. Use a time-based split / `TimeSeriesSplit`.
+- **Reporting accuracy on an imbalanced problem.** 99% accuracy with 0% recall is useless. Pick PR-AUC / recall-at-precision from the cost (step 1).
+- **A feature that's too good.** One column driving a near-perfect score is almost always leakage (post-outcome field, ID proxy, target-derived). Audit and drop it.
+- **Target encoding / imputation / feature selection computed before the split or outside the CV fold.** Subtle leakage that survives CV but not production. Fit them inside the pipeline, per fold.
+- **SMOTE/oversampling applied before splitting.** Synthetic copies of test rows land in train. Resample inside the training fold only.
+- **Tuning against the test set / peeking repeatedly.** You overfit to it manually. Tune on CV/validation; touch test once.
+- **No baseline.** Without DummyClassifier/linear you can't tell if your model learned anything. Always beat the dumb baseline first.
+- **`scoring=` left at default (accuracy/R²) during search.** The search optimizes the wrong thing. Pass your business metric to `cross_val_score` / `*SearchCV`.
+- **Train/serve feature skew.** Features computed differently (or with different code/library versions) at training vs inference. Reuse the exact fitted pipeline artifact for serving.
+- **Unpinned seeds / untracked runs.** Results aren't reproducible and "best model" can't be recovered. Pin `random_state`, log params+metrics+data+code version.
+## Verify
+1. **Beats baseline:** held-out test metric (step-1 metric) > the DummyClassifier/DummyRegressor and the linear baseline by a margin larger than the bootstrap CI. If it doesn't, there's no model.
+2. **Leakage check passes:** drop each top-importance feature individually — no single feature should collapse the score to chance-level; remove any post-outcome/target-derived/ID-proxy column; confirm all learned transforms were fit on train only.
+3. **Split integrity:** for temporal data, every test timestamp is strictly after every train timestamp; for grouped data, no group ID appears in two sets (assert programmatically).
+4. **Generalization gap sane:** train metric − validation metric is small and explained by your bias/variance call; test ≈ validation (a big test drop means tuning overfit to validation).
+5. **Metric matches business cost:** the reported metric and the chosen decision threshold come from step 1, not the library default.
+6. **No train/serve skew:** running the saved pipeline on a held-out row reproduces the exact training-time prediction (same features, same library versions).
+7. **Reproducible:** re-running with the logged seed + data snapshot + code commit yields the same metric (within float tolerance); the experiment tracker has params, CV metrics with std, data hash, and git commit.
+Done = the held-out test metric (chosen for business cost, evaluated once) beats both baselines beyond its CI, the leakage and split-integrity checks pass, the generalization gap is explained, and the saved pipeline reproduces predictions with no train/serve skew under a pinned, tracked run.