npm - kushi-agents - Versions diffs - 5.0.1 → 5.0.3 - Mend

kushi-agents 5.0.1 → 5.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (46) hide show

package/README.md +47 -7
package/bin/cli.mjs +73 -45
package/package.json +55 -51
package/plugin/agents/kushi.agent.md +1 -1
package/plugin/instructions/multi-host-install.instructions.md +125 -0
package/plugin/instructions/skill-evals.instructions.md +130 -0
package/plugin/skills/aggregate-project/evals/evals.json +33 -0
package/plugin/skills/apply-ado-update/evals/evals.json +33 -0
package/plugin/skills/ask-project/evals/evals.json +34 -0
package/plugin/skills/bootstrap-project/evals/evals.json +34 -0
package/plugin/skills/build-state/evals/evals.json +31 -0
package/plugin/skills/consolidate-evidence/evals/evals.json +33 -0
package/plugin/skills/dashboard/evals/evals.json +33 -0
package/plugin/skills/emit-vertex/evals/evals.json +33 -0
package/plugin/skills/eval/SKILL.md +90 -0
package/plugin/skills/eval/evals.schema.json +73 -0
package/plugin/skills/eval/run-evals.ps1 +372 -0
package/plugin/skills/fde-intake/evals/evals.json +33 -0
package/plugin/skills/fde-report/evals/evals.json +33 -0
package/plugin/skills/fde-triage/evals/evals.json +33 -0
package/plugin/skills/intro/evals/evals.json +33 -0
package/plugin/skills/link-entities/evals/evals.json +31 -0
package/plugin/skills/project-status/evals/evals.json +33 -0
package/plugin/skills/propose-ado-update/evals/evals.json +33 -0
package/plugin/skills/pull-ado/evals/evals.json +35 -0
package/plugin/skills/pull-crm/evals/evals.json +35 -0
package/plugin/skills/pull-email/evals/evals.json +35 -0
package/plugin/skills/pull-loop/evals/evals.json +35 -0
package/plugin/skills/pull-meetings/evals/evals.json +35 -0
package/plugin/skills/pull-misc/evals/evals.json +35 -0
package/plugin/skills/pull-onenote/evals/evals.json +35 -0
package/plugin/skills/pull-sharepoint/evals/evals.json +35 -0
package/plugin/skills/pull-teams/evals/evals.json +35 -0
package/plugin/skills/refresh-project/evals/evals.json +31 -0
package/plugin/skills/self-check/SKILL.md +2 -0
package/plugin/skills/self-check/evals/evals.json +28 -0
package/plugin/skills/self-check/run.ps1 +174 -1
package/plugin/skills/setup/evals/evals.json +33 -0
package/plugin/skills/tour/evals/evals.json +33 -0
package/plugin/skills/vertex-link/evals/evals.json +33 -0
package/src/constants.mjs +39 -1
package/src/eval-aggregator.mjs +209 -0
package/src/eval-aggregator.test.mjs +64 -0
package/src/eval-runner.test.mjs +69 -0
package/src/multi-host-install.test.mjs +170 -0
package/src/multi-host.mjs +277 -0

package/README.md CHANGED Viewed

@@ -121,20 +121,38 @@ See [Quickstart](https://gim-home.github.io/kushi/getting-started/quickstart/) f
 ## Install
-Two install targets — pick the one that matches how you talk to Copilot, optionally combined with a profile.
+Kushi supports **two host surfaces** as first-class peers (v5.0.2+):
-**VS Code Chat** — `.kushi/` in your workspace:
+| Host                                  | Install path                       | Best for                                |
+|---------------------------------------|------------------------------------|-----------------------------------------|
+| **Clawpilot CLI**                     | `~/.copilot/m-skills/kushi/`       | Scheduled / overnight runs (e.g. `kushi refresh <project>` at 6 AM via automation) |
+| **VS Code Chat** ("GitHub Copilot Chat") | `~/.vscode/chat/skills/kushi/`     | Interactive use (`@kushi what's the MACC for X?`) |
+Both hosts read the **same** Evidence/ tree on disk, so a refresh from one is immediately visible from the other — the same user routinely lives in both. SKILL content is host-agnostic (no per-host branching, enforced by self-check `D32.multi-host`).
 ```bash
-npx kushi-agents                               # default profile (standard)
-npx kushi-agents --profile core                # aggregator only
+# Install to a single host
+npx kushi-agents --clawpilot                       # Clawpilot only
+npx kushi-agents --vscode                          # VS Code Chat only
+# Install to BOTH at once (auto-detects what's present + targets both)
+npx kushi-agents --all-hosts
+# Uninstall
+npx kushi-agents --uninstall                       # all detected hosts
+npx kushi-agents --uninstall --clawpilot           # Clawpilot only
+npx kushi-agents --uninstall --vscode              # VS Code Chat only
+npx kushi-agents --uninstall --all                 # both
+# Legacy workspace install (per-project .kushi/ in cwd)
+npx kushi-agents                                   # default = standard profile
+npx kushi-agents --profile core                    # aggregator only
 ```
-**Clawpilot CLI** — `~/.copilot/m-skills/kushi/`:
+The 2-host matrix is a deliberate cap — see [`plugin/instructions/multi-host-install.instructions.md`](plugin/instructions/multi-host-install.instructions.md) for the rationale + per-host layout details.
 ```bash
-npx kushi-agents --clawpilot
-npx kushi-agents --clawpilot --profile full
+npx kushi-agents --clawpilot --profile full        # everything
 ```
 To switch profiles later, re-run with `--force` (cleanly handles downgrades):
@@ -217,6 +235,28 @@ npm pack --dry-run
 The self-check validates frontmatter, agent inventory, prompt → skill routing, profile manifest, reference packs, cross-links, the verbs table in this README, and the layout diagram in `docs/reference/where-things-live.md`. Full reference: [docs/reference/self-check.md](docs/reference/self-check.md).
+## Evaluating skills (v5.0.3+)
+Every skill ships per-case evals at `plugin/skills/<name>/evals/evals.json`, aligned with the [agentskills.io evaluating-skills spec](https://agentskills.io/skill-creation/evaluating-skills). Doctrine: [`plugin/instructions/skill-evals.instructions.md`](plugin/instructions/skill-evals.instructions.md).
+Quickstart:
+```powershell
+npm run eval:canary        # ~6 skills, runs in seconds — what PRs run
+npm run eval:all           # full suite (every plugin/skills/<name>/)
+npm run eval -- ask-project   # one skill
+npm run eval:baseline      # maintainer-only: refresh evals/baseline.json
+```
+Outputs:
+- `Evidence/_evals/<utc-ts>.json` — per-run JSON (pass/fail + duration + tokens per case).
+- `Evidence/_evals/benchmark.json` — per-skill mean/stddev for `pass_rate`, `duration_ms`, `tokens_total` + regression flags vs `evals/baseline.json`.
+Regressions flagged at ≥10pp pass-rate drop OR ≥50% latency/token increase. The canary subset is `ask-project`, `bootstrap-project`, `refresh-project`, `link-entities`, `build-state`, `self-check`.
+**Privacy:** fixtures under `evals/fixtures/` are synthetic. NEVER copy real customer data into the evals tree.
 ## License
 See [LICENSE](LICENSE).

package/bin/cli.mjs CHANGED Viewed

@@ -1,6 +1,7 @@
 #!/usr/bin/env node
 import { main } from '../src/main.mjs';
+import { runMultiHost } from '../src/multi-host.mjs';
 const args = process.argv.slice(2);
@@ -10,72 +11,99 @@ if (args.includes('--help') || args.includes('-h')) {
   Installs the Kushi multi-source project-evidence + Q&A agent.
-  Targets (pick one — default is vscode):
-    --target vscode     Install to <cwd>/.kushi/ + update .vscode/settings.json
-    --target clawpilot  Install to ~/.copilot/m-skills/kushi/ for Clawpilot CLI
-    --clawpilot         Shortcut for --target clawpilot
+  Host installs (v5.0.2+ — install into a host's user-global skill folder):
+    --clawpilot       Install to ~/.copilot/m-skills/kushi/
+    --vscode          Install to ~/.vscode/chat/skills/kushi/  (a.k.a. GitHub Copilot Chat)
+    --all-hosts       Install to BOTH hosts
+    --uninstall [--clawpilot|--vscode|--all]
+                      Cleanly remove the kushi install + skills-metadata.json entry
+                      from the chosen host(s). Default = all detected hosts.
+  Workspace install (legacy / default when no host flag is given):
+    --target vscode     Install to <cwd>/.kushi/ + update .vscode/settings.json   [default]
+    --target clawpilot  Alias for --clawpilot (kept for back-compat)
   Profile (controls what gets installed):
-    --profile core      Aggregator only (pull + consolidate + ask). The
-                        Evidence/ folder is the public contract; suitable for
-                        external rollup integrations.
-    --profile standard  Core + State/ rollup (Kushi's default opinion).  [DEFAULT]
-    --profile full      Standard + report packs (FDE weekly/customer/handoff).
+    --profile core      Aggregator only (pull + consolidate + ask).
+    --profile standard  Core + State/ rollup.                          [DEFAULT]
+    --profile full      Standard + report packs.
   Options:
     --force             Overwrite existing destination without asking
-    --yes, -y           Skip the project-root check (useful for scripted or
-                        agent-driven installs)
-    --no-settings       Skip .vscode/settings.json update (vscode target only)
-    --no-instructions   Skip .github/copilot-instructions.md merge (vscode target only)
+    --yes, -y           Skip the project-root check
+    --no-settings       Skip .vscode/settings.json update (vscode workspace target only)
+    --no-instructions   Skip .github/copilot-instructions.md merge   (vscode workspace target only)
   WorkIQ (REQUIRED — Kushi cannot pull evidence without it):
     --with-workiq           Auto-install WorkIQ via winget (Windows) / brew (macOS)
     --workiq-path <abs>     Use this explicit path to the workiq binary
-    --skip-workiq-check     Bypass the WorkIQ pre-flight check (CI / inspection only —
-                            bootstrap/refresh will block until WorkIQ is installed)
+    --skip-workiq-check     Bypass the WorkIQ pre-flight check
     --help, -h          Show this help
   After install, talk to Kushi:
-    aggregate <project>     Pull + consolidate (no State/) — all profiles
-    bootstrap <project>     First-time setup, 30d initial pull        — standard+
-    refresh <project>       Incremental refresh + rebuild State/      — standard+
-    state <project>         Re-render State/ from existing Evidence   — standard+
-    consolidate <project>   Merge per-user evidence                   — all profiles
-    status <project>        Show run-log                              — all profiles
-    ask <project> <q>       Cited Q&A over Evidence/ (auto-routes)    — all profiles
+    bootstrap <project>     First-time setup
+    refresh <project>       Incremental refresh + rebuild State/
+    state <project>         Re-render State/ from existing Evidence
+    consolidate <project>   Merge per-user evidence
+    status <project>        Show run-log
+    ask <project> <q>       Cited Q&A over Evidence/ (auto-routes)
   In VS Code Chat the prefix is "@Kushi". In Clawpilot just say "kushi <verb>".
 `);
   process.exit(0);
 }
-let target = getFlag('--target');
-if (args.includes('--clawpilot')) {
-  if (target && target !== 'clawpilot') {
-    console.error(`\n  Conflicting flags: --target ${target} and --clawpilot.\n`);
+// ── multi-host mode (v5.0.2+) ───────────────────────────────────────────────
+// Trigger when the user passes any of: --vscode, --all-hosts, --uninstall.
+// --clawpilot ALONE continues to route through the legacy main.mjs path so
+// the existing target=clawpilot flow stays byte-identical.
+const wantsVscode = args.includes('--vscode');
+const wantsAllHosts = args.includes('--all-hosts');
+const wantsUninstall = args.includes('--uninstall');
+if (wantsVscode || wantsAllHosts || wantsUninstall) {
+  const hosts = [];
+  if (args.includes('--clawpilot')) hosts.push('clawpilot');
+  if (wantsVscode) hosts.push('vscode');
+  const all = wantsAllHosts || args.includes('--all');
+  runMultiHost({
+    hosts,
+    all,
+    uninstall: wantsUninstall,
+    profile: getFlag('--profile'),
+  }).catch((err) => {
+    console.error(`\n  ${err.message}\n`);
     process.exit(1);
+  });
+} else {
+  let target = getFlag('--target');
+  if (args.includes('--clawpilot')) {
+    if (target && target !== 'clawpilot') {
+      console.error(`\n  Conflicting flags: --target ${target} and --clawpilot.\n`);
+      process.exit(1);
+    }
+    target = 'clawpilot';
   }
-  target = 'clawpilot';
-}
-const options = {
-  force: args.includes('--force'),
-  yes: args.includes('--yes') || args.includes('-y'),
-  noSettings: args.includes('--no-settings'),
-  noInstructions: args.includes('--no-instructions'),
-  target,
-  profile: getFlag('--profile'),
-  withWorkiq: args.includes('--with-workiq'),
-  workiqPath: getFlag('--workiq-path'),
-  skipWorkiqCheck: args.includes('--skip-workiq-check'),
-};
-main(options).catch((err) => {
-  console.error(`\n  ${err.message}\n`);
-  process.exit(1);
-});
+  const options = {
+    force: args.includes('--force'),
+    yes: args.includes('--yes') || args.includes('-y'),
+    noSettings: args.includes('--no-settings'),
+    noInstructions: args.includes('--no-instructions'),
+    target,
+    profile: getFlag('--profile'),
+    withWorkiq: args.includes('--with-workiq'),
+    workiqPath: getFlag('--workiq-path'),
+    skipWorkiqCheck: args.includes('--skip-workiq-check'),
+  };
+  main(options).catch((err) => {
+    console.error(`\n  ${err.message}\n`);
+    process.exit(1);
+  });
+}
 function getFlag(flag) {
   const idx = args.indexOf(flag);
@@ -85,4 +113,4 @@ function getFlag(flag) {
   const prefix = flag + '=';
   const match = args.find((a) => a.startsWith(prefix));
   return match ? match.slice(prefix.length) : undefined;
-}
+}

package/package.json CHANGED Viewed

@@ -1,52 +1,56 @@
-{
-  "name": "kushi-agents",
-  "version": "5.0.1",
-  "description": "Install Kushi — multi-source project evidence agent with Comprehensive Structured Capture (CSC) into weekly-only files across Email, Teams, OneNote, Loop, SharePoint, Meetings, CRM, ADO. Meetings retain a sibling verbatim/ audit folder. WorkIQ-only for M365 sources (Graph / m365_* FORBIDDEN as fallbacks; user-paste is first-class). Host-agnostic.",
-  "type": "module",
-  "bin": {
-    "kushi-agents": "./bin/cli.mjs"
-  },
-  "files": [
-    "bin/",
-    "src/",
-    "plugin/",
-    ".github/copilot-instructions.kushi.md"
-  ],
-  "engines": {
-    "node": ">=18.0.0"
-  },
-  "dependencies": {
-    "@mozilla/readability": "^0.6.0",
-    "jsdom": "^29.1.1",
-    "jsonc-parser": "^3.3.1"
-  },
-  "keywords": [
-    "vscode",
-    "copilot",
-    "agents",
-    "kushi",
-    "project-evidence",
-    "workiq",
-    "m365",
-    "ai",
-    "cli"
-  ],
-  "repository": {
-    "type": "git",
-    "url": "git+https://github.com/gim-home/kushi.git"
-  },
-  "homepage": "https://gim-home.github.io/kushi/",
-  "bugs": {
-    "url": "https://github.com/gim-home/kushi/issues"
-  },
-  "license": "MIT",
-  "scripts": {
-    "test": "node --test src/check-workiq.test.mjs src/seed-config.test.mjs src/sanitize-workiq-input.test.mjs src/detect-vertex-repo.test.mjs src/vertex-validate.test.mjs src/emit-vertex.e2e.test.mjs src/config-root-resolve.test.mjs src/forbidden-workiq-phrasings.test.mjs",
-    "test:integration:bootstrap": "node src/bootstrap-dryrun.integration.test.mjs",
-    "smoke": "node scripts/smoke.mjs",
-    "prepublishOnly": "npm test && npm run smoke"
-  },
-  "publishConfig": {
-    "access": "public"
-  }
+{
+  "name": "kushi-agents",
+  "version": "5.0.3",
+  "description": "Install Kushi — multi-source project evidence agent with Comprehensive Structured Capture (CSC) into weekly-only files across Email, Teams, OneNote, Loop, SharePoint, Meetings, CRM, ADO. Meetings retain a sibling verbatim/ audit folder. WorkIQ-only for M365 sources (Graph / m365_* FORBIDDEN as fallbacks; user-paste is first-class). Host-agnostic.",
+  "type": "module",
+  "bin": {
+    "kushi-agents": "./bin/cli.mjs"
+  },
+  "files": [
+    "bin/",
+    "src/",
+    "plugin/",
+    ".github/copilot-instructions.kushi.md"
+  ],
+  "engines": {
+    "node": ">=18.0.0"
+  },
+  "dependencies": {
+    "@mozilla/readability": "^0.6.0",
+    "jsdom": "^29.1.1",
+    "jsonc-parser": "^3.3.1"
+  },
+  "keywords": [
+    "vscode",
+    "copilot",
+    "agents",
+    "kushi",
+    "project-evidence",
+    "workiq",
+    "m365",
+    "ai",
+    "cli"
+  ],
+  "repository": {
+    "type": "git",
+    "url": "git+https://github.com/gim-home/kushi.git"
+  },
+  "homepage": "https://gim-home.github.io/kushi/",
+  "bugs": {
+    "url": "https://github.com/gim-home/kushi/issues"
+  },
+  "license": "MIT",
+  "scripts": {
+    "test": "node --test src/check-workiq.test.mjs src/seed-config.test.mjs src/sanitize-workiq-input.test.mjs src/detect-vertex-repo.test.mjs src/vertex-validate.test.mjs src/emit-vertex.e2e.test.mjs src/config-root-resolve.test.mjs src/forbidden-workiq-phrasings.test.mjs src/multi-host-install.test.mjs src/eval-aggregator.test.mjs src/eval-runner.test.mjs",
+    "test:integration:bootstrap": "node src/bootstrap-dryrun.integration.test.mjs",
+    "smoke": "node scripts/smoke.mjs",
+    "eval": "pwsh plugin/skills/eval/run-evals.ps1 -Skill",
+    "eval:all": "pwsh plugin/skills/eval/run-evals.ps1 -All",
+    "eval:canary": "pwsh plugin/skills/eval/run-evals.ps1 -Canary",
+    "eval:baseline": "pwsh plugin/skills/eval/run-evals.ps1 -All -UpdateBaseline",
+    "prepublishOnly": "npm test && npm run smoke"
+  },
+  "publishConfig": {
+    "access": "public"
+  }
 }

package/plugin/agents/kushi.agent.md CHANGED Viewed

@@ -16,7 +16,7 @@ Kushi ships in three profiles. The installed profile is recorded in `kushi-insta
 | Profile | What's installed | Verbs available |
 |---|---|---|
-| `core` | Aggregator only: `setup`, `pull-*`, `consolidate-evidence`, `aggregate-project`, `ask-project`, `project-status`, `vertex-link`, `emit-vertex`, `self-check`, `intro` | `setup`, `aggregate`, `consolidate`, `status`, `pull`, `ask`, `vertex-link`, `emit-vertex` |
+| `core` | Aggregator only: `setup`, `pull-*`, `consolidate-evidence`, `aggregate-project`, `ask-project`, `project-status`, `vertex-link`, `emit-vertex`, `self-check`, `eval`, `intro` | `setup`, `aggregate`, `consolidate`, `status`, `pull`, `ask`, `vertex-link`, `emit-vertex` |
 | `standard` *(default)* | core + `bootstrap-project`, `refresh-project`, `fde-intake`, `fde-report`, `fde-triage` + FDE reference pack | core + `bootstrap`, `refresh`, `fde-intake`, `fde-report`, `fde-triage` |
 | `full` | standard + `build-state` | standard + `state` |
 | **`preview`** *(opt-in)* | standard + `propose-ado-update`, `apply-ado-update` | standard + `propose-ado`, `apply-ado` |

package/plugin/instructions/multi-host-install.instructions.md ADDED Viewed

@@ -0,0 +1,125 @@
+---
+name: "multi-host-install"
+version: "1.0.0"
+description: "USE WHEN installing kushi to a user-global host (Clawpilot or VS Code Chat), uninstalling, or wondering why kushi ships to two hosts (not three or N) and how the layouts differ. DO NOT USE for workspace-local .kushi/ install — that is the legacy --target vscode flow."
+---
+# Multi-host install doctrine (v5.0.2+)
+Kushi ships as a single host-agnostic skill bundle, installed into a host's
+user-global skill folder by `npx kushi-agents`. **Exactly two hosts are
+supported.** No third host. No N-host generality.
+## The 2-host matrix
+| Host ID      | Display name                            | Skill dir                          | Metadata file                                |
+|--------------|-----------------------------------------|------------------------------------|----------------------------------------------|
+| `clawpilot`  | Clawpilot CLI                           | `~/.copilot/m-skills/kushi/`       | `~/.copilot/m-skills/skills-metadata.json`   |
+| `vscode`     | VS Code Chat ("GitHub Copilot Chat")    | `~/.vscode/chat/skills/kushi/`     | `~/.vscode/chat/skills/skills-metadata.json` |
+> The VS Code Chat path is the canonical user-global skill folder location
+> assumed by this doctrine. If a future VS Code Chat release ships a
+> different canonical path, update `VSCODE_CHAT_DEST_SUBPATH` in
+> `src/constants.mjs` — every other surface reads from there.
+## Why only these two
+- **Clawpilot** — primary scheduled / overnight surface (e.g. `kushi refresh <project>`
+  run from a Clawpilot automation at 6 AM).
+- **VS Code Chat** — primary interactive surface (`@kushi what's the MACC for X?`
+  asked the next morning).
+The same user routinely lives in both — automation in Clawpilot, follow-up
+questions in VS Code Chat. Both hosts read the **same** Evidence/ tree on disk,
+so a refresh in one is visible from the other immediately.
+A third host (e.g. a hypothetical web UI) would force per-host SKILL surgery
+or a content-branching shim. The 2-host matrix is the deliberate stopping
+point — it covers ≥ 99 % of the actual usage pattern and keeps cross-host
+parity trivially enforceable.
+## Per-host layout
+Both hosts get **byte-identical** content under their skill dir:
+```text
+<host-skill-dir>/
+├── SKILL.md                         (mirrored from agents/kushi.agent.md)
+├── agents/kushi.agent.md
+├── instructions/<name>.instructions.md
+├── prompts/<name>.prompt.md
+├── skills/<name>/SKILL.md           (+ references/, run.ps1, etc.)
+├── templates/<...>
+├── lib/<...>
+├── reference-packs/<...>
+├── config/
+│   ├── shared/                       (team-owned, safe to commit)
+│   └── user/                         (per-contributor, gitignored)
+└── kushi-install.json               (profile manifest)
+```
+And at the **parent** dir, `skills-metadata.json` carries the host's skill
+registry. The installer upserts a single `{"name": "kushi", ...}` entry that
+points `instructions` at the host's own `SKILL.md` (absolute path).
+There is **no per-host content branching**. A SKILL.md that opens with `WHEN
+running on Clawpilot, do X; WHEN running on VS Code Chat, do Y` is a defect.
+## Detection rules
+A host is **detected** when its parent dir exists on the file system:
+| Host       | Detection probe                       |
+|------------|---------------------------------------|
+| clawpilot  | `~/.copilot/m-skills/` exists         |
+| vscode     | `~/.vscode/chat/skills/` exists       |
+`bin/cli.mjs` runs detection before any install/uninstall and prints
+`Detected hosts: …` so the user sees which targets will be touched by the
+default behavior.
+## Install / uninstall flags
+| Flag                                 | Effect |
+|--------------------------------------|--------|
+| `--clawpilot`                        | Install to Clawpilot host (alone: also legacy `--target clawpilot` path) |
+| `--vscode`                           | Install to VS Code Chat host |
+| `--all-hosts`                        | Install to BOTH hosts (regardless of detection) |
+| `--uninstall`                        | Uninstall from all *detected* hosts |
+| `--uninstall --clawpilot`            | Uninstall from Clawpilot only |
+| `--uninstall --vscode`               | Uninstall from VS Code Chat only |
+| `--uninstall --all`                  | Uninstall from BOTH hosts |
+| (no flag)                            | Workspace install (legacy `.kushi/` in cwd) |
+## `kushi refresh <project>` is host-agnostic
+`refresh` is just a SKILL — the same `plugin/skills/refresh-project/SKILL.md`
+content runs under both hosts. It writes Evidence to disk at the user-
+configured engagement root (`~/Documents/Engagements/<project>/` by default),
+which lives **outside** the host skill dir. That is why a Clawpilot-driven
+refresh is immediately visible to a VS Code Chat `ask` — neither host owns
+the data.
+## Cross-host parity rule
+Every SKILL.md, prompt, instruction, template, and lib asset MUST work
+identically under both hosts. Concretely:
+- No `if (HOST === 'clawpilot') ...` style branching in any markdown body.
+- No host-specific paths in skill content (always use `<engagement-root>` /
+  `<project>` placeholders that resolve at runtime via the standard
+  `engagement-root-resolution.instructions.md` chain).
+- The two `SKILL.md` files (one per host) are required to be **byte-identical**
+  — enforced by `self-check` `D32.multi-host` in deep mode (temp-dir dry-run).
+If a future feature genuinely requires host-specific behavior (e.g. UI affordance
+only one host provides), it belongs in a host-specific *helper* outside `plugin/`
+— not inside the shared skill bundle.
+## See also
+- `host-portability.instructions.md` — older doctrine on per-host portability of
+  individual skill calls (still in force for runtime tool selection).
+- `D32.multi-host` self-check (deep) — validates installer + temp-dir layout.
+- `src/multi-host.mjs` — the installer/uninstaller implementation.
+- `src/multi-host-install.test.mjs` — node:test coverage.

package/plugin/instructions/skill-evals.instructions.md ADDED Viewed

@@ -0,0 +1,130 @@
+---
+description: "v5.0.3 — Skill evals doctrine, adapted from https://agentskills.io/skill-creation/evaluating-skills. Every skill MUST ship an evals/ folder with at least 2 deterministic cases plus structured assertions; a per-skill pass-rate is the objective regression signal. Canary subset runs on every PR; full suite runs on demand. Real customer data is FORBIDDEN in fixtures — use synthetic data only."
+---
+# Skill evals — doctrine
+> Inspired by **<https://agentskills.io/skill-creation/evaluating-skills>**. Adapted to kushi's PowerShell + Node test stack and to our 2-host install matrix.
+## Why
+Skills are prompts plus a runner. Prompts drift silently. Without an objective per-skill regression signal, every change is a gamble. Evals make that signal cheap:
+- **Per-skill pass-rate** is the headline metric.
+- **Latency** and **tokens** are secondary metrics (regressions ≥50% latency / ≥10pp pass-rate flag a baseline failure).
+- A **canary subset** runs on every PR (target: < 60s wall clock); the **full suite** runs on demand (`npm run eval:all`).
+## Where evals live
+```text
+plugin/skills/<name>/
+├── SKILL.md
+└── evals/
+    ├── evals.json        ← REQUIRED — case list + assertions
+    └── fixtures/         ← OPTIONAL per-skill fixtures
+```
+Cross-skill fixtures live at the repo root:
+```text
+evals/
+├── baseline.json         ← Committed; maintainer updates with `npm run eval:baseline`
+└── fixtures/             ← Tiny synthetic evidence trees, ADO fixtures, etc.
+```
+Per-run output goes to `Evidence/_evals/<timestamp>.json` (gitignored; not customer data).
+## Case schema
+```jsonc
+{
+  "skill": "<skill-name>",
+  "cases": [
+    {
+      "id": "ap-citations-format",
+      "name": "ask-project emits weekly-csc citation form",
+      "input": "what was decided about MACC for fixture-acme?",
+      "fixture": "evals/fixtures/fixture-acme",        // optional
+      "canary": true,
+      "grader_type": "script",                          // "script" | "llm"
+      "expected_assertions": [
+        { "type": "regex-match", "pattern": "\\[source:\\s*fixture-acme/email/weekly/" },
+        { "type": "regex-match", "pattern": "Source-layout:\\s*weekly-csc" }
+      ]
+    }
+  ]
+}
+```
+### Required fields per case
+- `id` — unique within the skill; kebab-case.
+- `name` — human-readable.
+- `input` — what gets passed to the skill (string OR object).
+- `expected_assertions` — array, **≥ 1** entry (enforced by `D33.evals-have-assertions`).
+- `grader_type` — `"script"` for deterministic graders, `"llm"` for rubric-based.
+### Optional fields
+- `fixture` — repo-relative path to the fixture to point the skill at.
+- `canary` — `true` to include in the fast CI subset.
+- `args` — extra args forwarded to the skill script (e.g. `{ "DryRun": true }`).
+- `skip` — `true` to skip (must include `skip_reason`).
+- `timeout_ms` — override the runner default (30 000 ms).
+## Assertion types
+| Type | Shape | Passes when |
+|---|---|---|
+| `file-exists` | `{ "type": "file-exists", "path": "..." }` | Path exists post-run (relative to fixture or evidence dir). |
+| `file-contains` | `{ "type": "file-contains", "path": "...", "needle": "..." }` | File exists and substring is present. |
+| `json-path-equals` | `{ "type": "json-path-equals", "path": "...", "json_path": "$.foo.bar", "equals": "v" }` | JSON file parses; dotted path value === expected. |
+| `regex-match` | `{ "type": "regex-match", "pattern": "...", "flags": "i" }` | Captured stdout matches the regex. |
+| `llm-rubric` | `{ "type": "llm-rubric", "rubric": "...", "min_score": 4 }` | LLM grader scores ≥ min on a 1–5 rubric. |
+## Run modes
+The runner (`plugin/skills/eval/run-evals.ps1`) supports three dispatch modes:
+1. **Direct invocation** (default for `script` graders). Runs the skill's executable artifact (`run.ps1`, `*.mjs`, or a small probe stub) with the given input and fixture. Pure deterministic.
+2. **Sub-agent dispatch** (optional, gated by `-Live`). Forwards the case to a sub-agent. Used only for `llm-rubric` cases. Skipped in canary mode.
+3. **Recorded fixture replay** (for `pull-*` skills). Reads a recorded `--cached` output of a real pull and asserts against that, so no live M365 calls are needed.
+For each case the runner records: `pass`, `duration_ms`, `tokens_in`, `tokens_out`, `stdout`, `stderr`, per-assertion `pass`/`reason`. The aggregate is a JSON file under `Evidence/_evals/` plus a one-line `benchmark.json` summary.
+## Canary set
+Marked with `"canary": true`. Kept tiny so PRs stay fast.
+Default canary set (v5.0.3):
+- `ask-project`
+- `bootstrap-project`
+- `refresh-project`
+- `link-entities`
+- `build-state`
+- `self-check`
+## Baseline + regression detection
+- `evals/baseline.json` is **committed**.
+- Each per-skill record carries the last green `pass_rate`, `mean_duration_ms`, and `mean_tokens_total`.
+- `src/eval-aggregator.mjs` flags **regressions**:
+  - `pass_rate` drop ≥ 10 percentage points
+  - `mean_duration_ms` increase ≥ 50 %
+  - `mean_tokens_total` increase ≥ 50 %
+- Maintainers refresh the baseline with `npm run eval:baseline` after deliberate behavior changes.
+## Privacy + safety
+- **No real customer data** in any fixture. Use `fixture-acme`-style synthetic names.
+- `Evidence/_evals/` is in `.gitignore`.
+- `pull-*` evals NEVER hit live M365 endpoints in canary mode. Use recorded `--cached` payloads or `--dry-run`.
+- Tenant IDs / GUIDs in fixtures must be obviously fake (e.g. `00000000-...`).
+## References
+- [agentskills.io — evaluating skills](https://agentskills.io/skill-creation/evaluating-skills) (source of truth)
+- `plugin/skills/eval/SKILL.md` (the runner skill)
+- `plugin/skills/eval/evals.schema.json` (JSON schema; self-check D33.evals-schema)
+- `plugin/instructions/agentskills-compliance.instructions.md` (sibling doctrine — size + section caps)

package/plugin/skills/aggregate-project/evals/evals.json ADDED Viewed

@@ -0,0 +1,33 @@
+{
+  "skill": "aggregate-project",
+  "version": "1.0.0",
+  "description": "Auto-seeded evals for aggregate-project. Replace with real cases as the skill matures.",
+  "cases": [
+    {
+      "id": "aggregate-project-smoke-1",
+      "name": "aggregate-project produces a non-empty response",
+      "input": "synthetic aggregate-project probe — canary smoke",
+      "canary": false,
+      "grader_type": "script",
+      "expected_assertions": [
+        {
+          "type": "regex-match",
+          "pattern": ".+"
+        }
+      ]
+    },
+    {
+      "id": "aggregate-project-smoke-2",
+      "name": "aggregate-project echoes case id",
+      "input": "case-id aggregate-project-smoke-2",
+      "canary": false,
+      "grader_type": "script",
+      "expected_assertions": [
+        {
+          "type": "regex-match",
+          "pattern": "aggregate-project-smoke-2"
+        }
+      ]
+    }
+  ]
+}