kushi-agents 5.0.1 → 5.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (46) hide show
  1. package/README.md +47 -7
  2. package/bin/cli.mjs +73 -45
  3. package/package.json +55 -51
  4. package/plugin/agents/kushi.agent.md +1 -1
  5. package/plugin/instructions/multi-host-install.instructions.md +125 -0
  6. package/plugin/instructions/skill-evals.instructions.md +130 -0
  7. package/plugin/skills/aggregate-project/evals/evals.json +33 -0
  8. package/plugin/skills/apply-ado-update/evals/evals.json +33 -0
  9. package/plugin/skills/ask-project/evals/evals.json +34 -0
  10. package/plugin/skills/bootstrap-project/evals/evals.json +34 -0
  11. package/plugin/skills/build-state/evals/evals.json +31 -0
  12. package/plugin/skills/consolidate-evidence/evals/evals.json +33 -0
  13. package/plugin/skills/dashboard/evals/evals.json +33 -0
  14. package/plugin/skills/emit-vertex/evals/evals.json +33 -0
  15. package/plugin/skills/eval/SKILL.md +90 -0
  16. package/plugin/skills/eval/evals.schema.json +73 -0
  17. package/plugin/skills/eval/run-evals.ps1 +372 -0
  18. package/plugin/skills/fde-intake/evals/evals.json +33 -0
  19. package/plugin/skills/fde-report/evals/evals.json +33 -0
  20. package/plugin/skills/fde-triage/evals/evals.json +33 -0
  21. package/plugin/skills/intro/evals/evals.json +33 -0
  22. package/plugin/skills/link-entities/evals/evals.json +31 -0
  23. package/plugin/skills/project-status/evals/evals.json +33 -0
  24. package/plugin/skills/propose-ado-update/evals/evals.json +33 -0
  25. package/plugin/skills/pull-ado/evals/evals.json +35 -0
  26. package/plugin/skills/pull-crm/evals/evals.json +35 -0
  27. package/plugin/skills/pull-email/evals/evals.json +35 -0
  28. package/plugin/skills/pull-loop/evals/evals.json +35 -0
  29. package/plugin/skills/pull-meetings/evals/evals.json +35 -0
  30. package/plugin/skills/pull-misc/evals/evals.json +35 -0
  31. package/plugin/skills/pull-onenote/evals/evals.json +35 -0
  32. package/plugin/skills/pull-sharepoint/evals/evals.json +35 -0
  33. package/plugin/skills/pull-teams/evals/evals.json +35 -0
  34. package/plugin/skills/refresh-project/evals/evals.json +31 -0
  35. package/plugin/skills/self-check/SKILL.md +2 -0
  36. package/plugin/skills/self-check/evals/evals.json +28 -0
  37. package/plugin/skills/self-check/run.ps1 +174 -1
  38. package/plugin/skills/setup/evals/evals.json +33 -0
  39. package/plugin/skills/tour/evals/evals.json +33 -0
  40. package/plugin/skills/vertex-link/evals/evals.json +33 -0
  41. package/src/constants.mjs +39 -1
  42. package/src/eval-aggregator.mjs +209 -0
  43. package/src/eval-aggregator.test.mjs +64 -0
  44. package/src/eval-runner.test.mjs +69 -0
  45. package/src/multi-host-install.test.mjs +170 -0
  46. package/src/multi-host.mjs +277 -0
package/README.md CHANGED
@@ -121,20 +121,38 @@ See [Quickstart](https://gim-home.github.io/kushi/getting-started/quickstart/) f
121
121
 
122
122
  ## Install
123
123
 
124
- Two install targets pick the one that matches how you talk to Copilot, optionally combined with a profile.
124
+ Kushi supports **two host surfaces** as first-class peers (v5.0.2+):
125
125
 
126
- **VS Code Chat** `.kushi/` in your workspace:
126
+ | Host | Install path | Best for |
127
+ |---------------------------------------|------------------------------------|-----------------------------------------|
128
+ | **Clawpilot CLI** | `~/.copilot/m-skills/kushi/` | Scheduled / overnight runs (e.g. `kushi refresh <project>` at 6 AM via automation) |
129
+ | **VS Code Chat** ("GitHub Copilot Chat") | `~/.vscode/chat/skills/kushi/` | Interactive use (`@kushi what's the MACC for X?`) |
130
+
131
+ Both hosts read the **same** Evidence/ tree on disk, so a refresh from one is immediately visible from the other — the same user routinely lives in both. SKILL content is host-agnostic (no per-host branching, enforced by self-check `D32.multi-host`).
127
132
 
128
133
  ```bash
129
- npx kushi-agents # default profile (standard)
130
- npx kushi-agents --profile core # aggregator only
134
+ # Install to a single host
135
+ npx kushi-agents --clawpilot # Clawpilot only
136
+ npx kushi-agents --vscode # VS Code Chat only
137
+
138
+ # Install to BOTH at once (auto-detects what's present + targets both)
139
+ npx kushi-agents --all-hosts
140
+
141
+ # Uninstall
142
+ npx kushi-agents --uninstall # all detected hosts
143
+ npx kushi-agents --uninstall --clawpilot # Clawpilot only
144
+ npx kushi-agents --uninstall --vscode # VS Code Chat only
145
+ npx kushi-agents --uninstall --all # both
146
+
147
+ # Legacy workspace install (per-project .kushi/ in cwd)
148
+ npx kushi-agents # default = standard profile
149
+ npx kushi-agents --profile core # aggregator only
131
150
  ```
132
151
 
133
- **Clawpilot CLI**`~/.copilot/m-skills/kushi/`:
152
+ The 2-host matrix is a deliberate cap see [`plugin/instructions/multi-host-install.instructions.md`](plugin/instructions/multi-host-install.instructions.md) for the rationale + per-host layout details.
134
153
 
135
154
  ```bash
136
- npx kushi-agents --clawpilot
137
- npx kushi-agents --clawpilot --profile full
155
+ npx kushi-agents --clawpilot --profile full # everything
138
156
  ```
139
157
 
140
158
  To switch profiles later, re-run with `--force` (cleanly handles downgrades):
@@ -217,6 +235,28 @@ npm pack --dry-run
217
235
 
218
236
  The self-check validates frontmatter, agent inventory, prompt → skill routing, profile manifest, reference packs, cross-links, the verbs table in this README, and the layout diagram in `docs/reference/where-things-live.md`. Full reference: [docs/reference/self-check.md](docs/reference/self-check.md).
219
237
 
238
+ ## Evaluating skills (v5.0.3+)
239
+
240
+ Every skill ships per-case evals at `plugin/skills/<name>/evals/evals.json`, aligned with the [agentskills.io evaluating-skills spec](https://agentskills.io/skill-creation/evaluating-skills). Doctrine: [`plugin/instructions/skill-evals.instructions.md`](plugin/instructions/skill-evals.instructions.md).
241
+
242
+ Quickstart:
243
+
244
+ ```powershell
245
+ npm run eval:canary # ~6 skills, runs in seconds — what PRs run
246
+ npm run eval:all # full suite (every plugin/skills/<name>/)
247
+ npm run eval -- ask-project # one skill
248
+ npm run eval:baseline # maintainer-only: refresh evals/baseline.json
249
+ ```
250
+
251
+ Outputs:
252
+
253
+ - `Evidence/_evals/<utc-ts>.json` — per-run JSON (pass/fail + duration + tokens per case).
254
+ - `Evidence/_evals/benchmark.json` — per-skill mean/stddev for `pass_rate`, `duration_ms`, `tokens_total` + regression flags vs `evals/baseline.json`.
255
+
256
+ Regressions flagged at ≥10pp pass-rate drop OR ≥50% latency/token increase. The canary subset is `ask-project`, `bootstrap-project`, `refresh-project`, `link-entities`, `build-state`, `self-check`.
257
+
258
+ **Privacy:** fixtures under `evals/fixtures/` are synthetic. NEVER copy real customer data into the evals tree.
259
+
220
260
  ## License
221
261
 
222
262
  See [LICENSE](LICENSE).
package/bin/cli.mjs CHANGED
@@ -1,6 +1,7 @@
1
1
  #!/usr/bin/env node
2
2
 
3
3
  import { main } from '../src/main.mjs';
4
+ import { runMultiHost } from '../src/multi-host.mjs';
4
5
 
5
6
  const args = process.argv.slice(2);
6
7
 
@@ -10,72 +11,99 @@ if (args.includes('--help') || args.includes('-h')) {
10
11
 
11
12
  Installs the Kushi multi-source project-evidence + Q&A agent.
12
13
 
13
- Targets (pick one default is vscode):
14
- --target vscode Install to <cwd>/.kushi/ + update .vscode/settings.json
15
- --target clawpilot Install to ~/.copilot/m-skills/kushi/ for Clawpilot CLI
16
- --clawpilot Shortcut for --target clawpilot
14
+ Host installs (v5.0.2+install into a host's user-global skill folder):
15
+ --clawpilot Install to ~/.copilot/m-skills/kushi/
16
+ --vscode Install to ~/.vscode/chat/skills/kushi/ (a.k.a. GitHub Copilot Chat)
17
+ --all-hosts Install to BOTH hosts
18
+ --uninstall [--clawpilot|--vscode|--all]
19
+ Cleanly remove the kushi install + skills-metadata.json entry
20
+ from the chosen host(s). Default = all detected hosts.
21
+
22
+ Workspace install (legacy / default when no host flag is given):
23
+ --target vscode Install to <cwd>/.kushi/ + update .vscode/settings.json [default]
24
+ --target clawpilot Alias for --clawpilot (kept for back-compat)
17
25
 
18
26
  Profile (controls what gets installed):
19
- --profile core Aggregator only (pull + consolidate + ask). The
20
- Evidence/ folder is the public contract; suitable for
21
- external rollup integrations.
22
- --profile standard Core + State/ rollup (Kushi's default opinion). [DEFAULT]
23
- --profile full Standard + report packs (FDE weekly/customer/handoff).
27
+ --profile core Aggregator only (pull + consolidate + ask).
28
+ --profile standard Core + State/ rollup. [DEFAULT]
29
+ --profile full Standard + report packs.
24
30
 
25
31
  Options:
26
32
  --force Overwrite existing destination without asking
27
- --yes, -y Skip the project-root check (useful for scripted or
28
- agent-driven installs)
29
- --no-settings Skip .vscode/settings.json update (vscode target only)
30
- --no-instructions Skip .github/copilot-instructions.md merge (vscode target only)
33
+ --yes, -y Skip the project-root check
34
+ --no-settings Skip .vscode/settings.json update (vscode workspace target only)
35
+ --no-instructions Skip .github/copilot-instructions.md merge (vscode workspace target only)
31
36
 
32
37
  WorkIQ (REQUIRED — Kushi cannot pull evidence without it):
33
38
  --with-workiq Auto-install WorkIQ via winget (Windows) / brew (macOS)
34
39
  --workiq-path <abs> Use this explicit path to the workiq binary
35
- --skip-workiq-check Bypass the WorkIQ pre-flight check (CI / inspection only —
36
- bootstrap/refresh will block until WorkIQ is installed)
40
+ --skip-workiq-check Bypass the WorkIQ pre-flight check
37
41
 
38
42
  --help, -h Show this help
39
43
 
40
44
  After install, talk to Kushi:
41
- aggregate <project> Pull + consolidate (no State/) — all profiles
42
- bootstrap <project> First-time setup, 30d initial pull — standard+
43
- refresh <project> Incremental refresh + rebuild State/ — standard+
44
- state <project> Re-render State/ from existing Evidence — standard+
45
- consolidate <project> Merge per-user evidence — all profiles
46
- status <project> Show run-log — all profiles
47
- ask <project> <q> Cited Q&A over Evidence/ (auto-routes) — all profiles
45
+ bootstrap <project> First-time setup
46
+ refresh <project> Incremental refresh + rebuild State/
47
+ state <project> Re-render State/ from existing Evidence
48
+ consolidate <project> Merge per-user evidence
49
+ status <project> Show run-log
50
+ ask <project> <q> Cited Q&A over Evidence/ (auto-routes)
48
51
 
49
52
  In VS Code Chat the prefix is "@Kushi". In Clawpilot just say "kushi <verb>".
50
53
  `);
51
54
  process.exit(0);
52
55
  }
53
56
 
54
- let target = getFlag('--target');
55
- if (args.includes('--clawpilot')) {
56
- if (target && target !== 'clawpilot') {
57
- console.error(`\n Conflicting flags: --target ${target} and --clawpilot.\n`);
57
+ // ── multi-host mode (v5.0.2+) ───────────────────────────────────────────────
58
+ // Trigger when the user passes any of: --vscode, --all-hosts, --uninstall.
59
+ // --clawpilot ALONE continues to route through the legacy main.mjs path so
60
+ // the existing target=clawpilot flow stays byte-identical.
61
+ const wantsVscode = args.includes('--vscode');
62
+ const wantsAllHosts = args.includes('--all-hosts');
63
+ const wantsUninstall = args.includes('--uninstall');
64
+
65
+ if (wantsVscode || wantsAllHosts || wantsUninstall) {
66
+ const hosts = [];
67
+ if (args.includes('--clawpilot')) hosts.push('clawpilot');
68
+ if (wantsVscode) hosts.push('vscode');
69
+ const all = wantsAllHosts || args.includes('--all');
70
+
71
+ runMultiHost({
72
+ hosts,
73
+ all,
74
+ uninstall: wantsUninstall,
75
+ profile: getFlag('--profile'),
76
+ }).catch((err) => {
77
+ console.error(`\n ${err.message}\n`);
58
78
  process.exit(1);
79
+ });
80
+ } else {
81
+ let target = getFlag('--target');
82
+ if (args.includes('--clawpilot')) {
83
+ if (target && target !== 'clawpilot') {
84
+ console.error(`\n Conflicting flags: --target ${target} and --clawpilot.\n`);
85
+ process.exit(1);
86
+ }
87
+ target = 'clawpilot';
59
88
  }
60
- target = 'clawpilot';
61
- }
62
89
 
63
- const options = {
64
- force: args.includes('--force'),
65
- yes: args.includes('--yes') || args.includes('-y'),
66
- noSettings: args.includes('--no-settings'),
67
- noInstructions: args.includes('--no-instructions'),
68
- target,
69
- profile: getFlag('--profile'),
70
- withWorkiq: args.includes('--with-workiq'),
71
- workiqPath: getFlag('--workiq-path'),
72
- skipWorkiqCheck: args.includes('--skip-workiq-check'),
73
- };
74
-
75
- main(options).catch((err) => {
76
- console.error(`\n ${err.message}\n`);
77
- process.exit(1);
78
- });
90
+ const options = {
91
+ force: args.includes('--force'),
92
+ yes: args.includes('--yes') || args.includes('-y'),
93
+ noSettings: args.includes('--no-settings'),
94
+ noInstructions: args.includes('--no-instructions'),
95
+ target,
96
+ profile: getFlag('--profile'),
97
+ withWorkiq: args.includes('--with-workiq'),
98
+ workiqPath: getFlag('--workiq-path'),
99
+ skipWorkiqCheck: args.includes('--skip-workiq-check'),
100
+ };
101
+
102
+ main(options).catch((err) => {
103
+ console.error(`\n ${err.message}\n`);
104
+ process.exit(1);
105
+ });
106
+ }
79
107
 
80
108
  function getFlag(flag) {
81
109
  const idx = args.indexOf(flag);
@@ -85,4 +113,4 @@ function getFlag(flag) {
85
113
  const prefix = flag + '=';
86
114
  const match = args.find((a) => a.startsWith(prefix));
87
115
  return match ? match.slice(prefix.length) : undefined;
88
- }
116
+ }
package/package.json CHANGED
@@ -1,52 +1,56 @@
1
- {
2
- "name": "kushi-agents",
3
- "version": "5.0.1",
4
- "description": "Install Kushi — multi-source project evidence agent with Comprehensive Structured Capture (CSC) into weekly-only files across Email, Teams, OneNote, Loop, SharePoint, Meetings, CRM, ADO. Meetings retain a sibling verbatim/ audit folder. WorkIQ-only for M365 sources (Graph / m365_* FORBIDDEN as fallbacks; user-paste is first-class). Host-agnostic.",
5
- "type": "module",
6
- "bin": {
7
- "kushi-agents": "./bin/cli.mjs"
8
- },
9
- "files": [
10
- "bin/",
11
- "src/",
12
- "plugin/",
13
- ".github/copilot-instructions.kushi.md"
14
- ],
15
- "engines": {
16
- "node": ">=18.0.0"
17
- },
18
- "dependencies": {
19
- "@mozilla/readability": "^0.6.0",
20
- "jsdom": "^29.1.1",
21
- "jsonc-parser": "^3.3.1"
22
- },
23
- "keywords": [
24
- "vscode",
25
- "copilot",
26
- "agents",
27
- "kushi",
28
- "project-evidence",
29
- "workiq",
30
- "m365",
31
- "ai",
32
- "cli"
33
- ],
34
- "repository": {
35
- "type": "git",
36
- "url": "git+https://github.com/gim-home/kushi.git"
37
- },
38
- "homepage": "https://gim-home.github.io/kushi/",
39
- "bugs": {
40
- "url": "https://github.com/gim-home/kushi/issues"
41
- },
42
- "license": "MIT",
43
- "scripts": {
44
- "test": "node --test src/check-workiq.test.mjs src/seed-config.test.mjs src/sanitize-workiq-input.test.mjs src/detect-vertex-repo.test.mjs src/vertex-validate.test.mjs src/emit-vertex.e2e.test.mjs src/config-root-resolve.test.mjs src/forbidden-workiq-phrasings.test.mjs",
45
- "test:integration:bootstrap": "node src/bootstrap-dryrun.integration.test.mjs",
46
- "smoke": "node scripts/smoke.mjs",
47
- "prepublishOnly": "npm test && npm run smoke"
48
- },
49
- "publishConfig": {
50
- "access": "public"
51
- }
1
+ {
2
+ "name": "kushi-agents",
3
+ "version": "5.0.3",
4
+ "description": "Install Kushi — multi-source project evidence agent with Comprehensive Structured Capture (CSC) into weekly-only files across Email, Teams, OneNote, Loop, SharePoint, Meetings, CRM, ADO. Meetings retain a sibling verbatim/ audit folder. WorkIQ-only for M365 sources (Graph / m365_* FORBIDDEN as fallbacks; user-paste is first-class). Host-agnostic.",
5
+ "type": "module",
6
+ "bin": {
7
+ "kushi-agents": "./bin/cli.mjs"
8
+ },
9
+ "files": [
10
+ "bin/",
11
+ "src/",
12
+ "plugin/",
13
+ ".github/copilot-instructions.kushi.md"
14
+ ],
15
+ "engines": {
16
+ "node": ">=18.0.0"
17
+ },
18
+ "dependencies": {
19
+ "@mozilla/readability": "^0.6.0",
20
+ "jsdom": "^29.1.1",
21
+ "jsonc-parser": "^3.3.1"
22
+ },
23
+ "keywords": [
24
+ "vscode",
25
+ "copilot",
26
+ "agents",
27
+ "kushi",
28
+ "project-evidence",
29
+ "workiq",
30
+ "m365",
31
+ "ai",
32
+ "cli"
33
+ ],
34
+ "repository": {
35
+ "type": "git",
36
+ "url": "git+https://github.com/gim-home/kushi.git"
37
+ },
38
+ "homepage": "https://gim-home.github.io/kushi/",
39
+ "bugs": {
40
+ "url": "https://github.com/gim-home/kushi/issues"
41
+ },
42
+ "license": "MIT",
43
+ "scripts": {
44
+ "test": "node --test src/check-workiq.test.mjs src/seed-config.test.mjs src/sanitize-workiq-input.test.mjs src/detect-vertex-repo.test.mjs src/vertex-validate.test.mjs src/emit-vertex.e2e.test.mjs src/config-root-resolve.test.mjs src/forbidden-workiq-phrasings.test.mjs src/multi-host-install.test.mjs src/eval-aggregator.test.mjs src/eval-runner.test.mjs",
45
+ "test:integration:bootstrap": "node src/bootstrap-dryrun.integration.test.mjs",
46
+ "smoke": "node scripts/smoke.mjs",
47
+ "eval": "pwsh plugin/skills/eval/run-evals.ps1 -Skill",
48
+ "eval:all": "pwsh plugin/skills/eval/run-evals.ps1 -All",
49
+ "eval:canary": "pwsh plugin/skills/eval/run-evals.ps1 -Canary",
50
+ "eval:baseline": "pwsh plugin/skills/eval/run-evals.ps1 -All -UpdateBaseline",
51
+ "prepublishOnly": "npm test && npm run smoke"
52
+ },
53
+ "publishConfig": {
54
+ "access": "public"
55
+ }
52
56
  }
@@ -16,7 +16,7 @@ Kushi ships in three profiles. The installed profile is recorded in `kushi-insta
16
16
 
17
17
  | Profile | What's installed | Verbs available |
18
18
  |---|---|---|
19
- | `core` | Aggregator only: `setup`, `pull-*`, `consolidate-evidence`, `aggregate-project`, `ask-project`, `project-status`, `vertex-link`, `emit-vertex`, `self-check`, `intro` | `setup`, `aggregate`, `consolidate`, `status`, `pull`, `ask`, `vertex-link`, `emit-vertex` |
19
+ | `core` | Aggregator only: `setup`, `pull-*`, `consolidate-evidence`, `aggregate-project`, `ask-project`, `project-status`, `vertex-link`, `emit-vertex`, `self-check`, `eval`, `intro` | `setup`, `aggregate`, `consolidate`, `status`, `pull`, `ask`, `vertex-link`, `emit-vertex` |
20
20
  | `standard` *(default)* | core + `bootstrap-project`, `refresh-project`, `fde-intake`, `fde-report`, `fde-triage` + FDE reference pack | core + `bootstrap`, `refresh`, `fde-intake`, `fde-report`, `fde-triage` |
21
21
  | `full` | standard + `build-state` | standard + `state` |
22
22
  | **`preview`** *(opt-in)* | standard + `propose-ado-update`, `apply-ado-update` | standard + `propose-ado`, `apply-ado` |
@@ -0,0 +1,125 @@
1
+ ---
2
+ name: "multi-host-install"
3
+ version: "1.0.0"
4
+ description: "USE WHEN installing kushi to a user-global host (Clawpilot or VS Code Chat), uninstalling, or wondering why kushi ships to two hosts (not three or N) and how the layouts differ. DO NOT USE for workspace-local .kushi/ install — that is the legacy --target vscode flow."
5
+ ---
6
+
7
+ # Multi-host install doctrine (v5.0.2+)
8
+
9
+ Kushi ships as a single host-agnostic skill bundle, installed into a host's
10
+ user-global skill folder by `npx kushi-agents`. **Exactly two hosts are
11
+ supported.** No third host. No N-host generality.
12
+
13
+ ## The 2-host matrix
14
+
15
+ | Host ID | Display name | Skill dir | Metadata file |
16
+ |--------------|-----------------------------------------|------------------------------------|----------------------------------------------|
17
+ | `clawpilot` | Clawpilot CLI | `~/.copilot/m-skills/kushi/` | `~/.copilot/m-skills/skills-metadata.json` |
18
+ | `vscode` | VS Code Chat ("GitHub Copilot Chat") | `~/.vscode/chat/skills/kushi/` | `~/.vscode/chat/skills/skills-metadata.json` |
19
+
20
+ > The VS Code Chat path is the canonical user-global skill folder location
21
+ > assumed by this doctrine. If a future VS Code Chat release ships a
22
+ > different canonical path, update `VSCODE_CHAT_DEST_SUBPATH` in
23
+ > `src/constants.mjs` — every other surface reads from there.
24
+
25
+ ## Why only these two
26
+
27
+ - **Clawpilot** — primary scheduled / overnight surface (e.g. `kushi refresh <project>`
28
+ run from a Clawpilot automation at 6 AM).
29
+ - **VS Code Chat** — primary interactive surface (`@kushi what's the MACC for X?`
30
+ asked the next morning).
31
+
32
+ The same user routinely lives in both — automation in Clawpilot, follow-up
33
+ questions in VS Code Chat. Both hosts read the **same** Evidence/ tree on disk,
34
+ so a refresh in one is visible from the other immediately.
35
+
36
+ A third host (e.g. a hypothetical web UI) would force per-host SKILL surgery
37
+ or a content-branching shim. The 2-host matrix is the deliberate stopping
38
+ point — it covers ≥ 99 % of the actual usage pattern and keeps cross-host
39
+ parity trivially enforceable.
40
+
41
+ ## Per-host layout
42
+
43
+ Both hosts get **byte-identical** content under their skill dir:
44
+
45
+ ```text
46
+ <host-skill-dir>/
47
+ ├── SKILL.md (mirrored from agents/kushi.agent.md)
48
+ ├── agents/kushi.agent.md
49
+ ├── instructions/<name>.instructions.md
50
+ ├── prompts/<name>.prompt.md
51
+ ├── skills/<name>/SKILL.md (+ references/, run.ps1, etc.)
52
+ ├── templates/<...>
53
+ ├── lib/<...>
54
+ ├── reference-packs/<...>
55
+ ├── config/
56
+ │ ├── shared/ (team-owned, safe to commit)
57
+ │ └── user/ (per-contributor, gitignored)
58
+ └── kushi-install.json (profile manifest)
59
+ ```
60
+
61
+ And at the **parent** dir, `skills-metadata.json` carries the host's skill
62
+ registry. The installer upserts a single `{"name": "kushi", ...}` entry that
63
+ points `instructions` at the host's own `SKILL.md` (absolute path).
64
+
65
+ There is **no per-host content branching**. A SKILL.md that opens with `WHEN
66
+ running on Clawpilot, do X; WHEN running on VS Code Chat, do Y` is a defect.
67
+
68
+ ## Detection rules
69
+
70
+ A host is **detected** when its parent dir exists on the file system:
71
+
72
+ | Host | Detection probe |
73
+ |------------|---------------------------------------|
74
+ | clawpilot | `~/.copilot/m-skills/` exists |
75
+ | vscode | `~/.vscode/chat/skills/` exists |
76
+
77
+ `bin/cli.mjs` runs detection before any install/uninstall and prints
78
+ `Detected hosts: …` so the user sees which targets will be touched by the
79
+ default behavior.
80
+
81
+ ## Install / uninstall flags
82
+
83
+ | Flag | Effect |
84
+ |--------------------------------------|--------|
85
+ | `--clawpilot` | Install to Clawpilot host (alone: also legacy `--target clawpilot` path) |
86
+ | `--vscode` | Install to VS Code Chat host |
87
+ | `--all-hosts` | Install to BOTH hosts (regardless of detection) |
88
+ | `--uninstall` | Uninstall from all *detected* hosts |
89
+ | `--uninstall --clawpilot` | Uninstall from Clawpilot only |
90
+ | `--uninstall --vscode` | Uninstall from VS Code Chat only |
91
+ | `--uninstall --all` | Uninstall from BOTH hosts |
92
+ | (no flag) | Workspace install (legacy `.kushi/` in cwd) |
93
+
94
+ ## `kushi refresh <project>` is host-agnostic
95
+
96
+ `refresh` is just a SKILL — the same `plugin/skills/refresh-project/SKILL.md`
97
+ content runs under both hosts. It writes Evidence to disk at the user-
98
+ configured engagement root (`~/Documents/Engagements/<project>/` by default),
99
+ which lives **outside** the host skill dir. That is why a Clawpilot-driven
100
+ refresh is immediately visible to a VS Code Chat `ask` — neither host owns
101
+ the data.
102
+
103
+ ## Cross-host parity rule
104
+
105
+ Every SKILL.md, prompt, instruction, template, and lib asset MUST work
106
+ identically under both hosts. Concretely:
107
+
108
+ - No `if (HOST === 'clawpilot') ...` style branching in any markdown body.
109
+ - No host-specific paths in skill content (always use `<engagement-root>` /
110
+ `<project>` placeholders that resolve at runtime via the standard
111
+ `engagement-root-resolution.instructions.md` chain).
112
+ - The two `SKILL.md` files (one per host) are required to be **byte-identical**
113
+ — enforced by `self-check` `D32.multi-host` in deep mode (temp-dir dry-run).
114
+
115
+ If a future feature genuinely requires host-specific behavior (e.g. UI affordance
116
+ only one host provides), it belongs in a host-specific *helper* outside `plugin/`
117
+ — not inside the shared skill bundle.
118
+
119
+ ## See also
120
+
121
+ - `host-portability.instructions.md` — older doctrine on per-host portability of
122
+ individual skill calls (still in force for runtime tool selection).
123
+ - `D32.multi-host` self-check (deep) — validates installer + temp-dir layout.
124
+ - `src/multi-host.mjs` — the installer/uninstaller implementation.
125
+ - `src/multi-host-install.test.mjs` — node:test coverage.
@@ -0,0 +1,130 @@
1
+ ---
2
+ description: "v5.0.3 — Skill evals doctrine, adapted from https://agentskills.io/skill-creation/evaluating-skills. Every skill MUST ship an evals/ folder with at least 2 deterministic cases plus structured assertions; a per-skill pass-rate is the objective regression signal. Canary subset runs on every PR; full suite runs on demand. Real customer data is FORBIDDEN in fixtures — use synthetic data only."
3
+ ---
4
+
5
+ # Skill evals — doctrine
6
+
7
+ > Inspired by **<https://agentskills.io/skill-creation/evaluating-skills>**. Adapted to kushi's PowerShell + Node test stack and to our 2-host install matrix.
8
+
9
+ ## Why
10
+
11
+ Skills are prompts plus a runner. Prompts drift silently. Without an objective per-skill regression signal, every change is a gamble. Evals make that signal cheap:
12
+
13
+ - **Per-skill pass-rate** is the headline metric.
14
+ - **Latency** and **tokens** are secondary metrics (regressions ≥50% latency / ≥10pp pass-rate flag a baseline failure).
15
+ - A **canary subset** runs on every PR (target: < 60s wall clock); the **full suite** runs on demand (`npm run eval:all`).
16
+
17
+ ## Where evals live
18
+
19
+ ```text
20
+ plugin/skills/<name>/
21
+ ├── SKILL.md
22
+ └── evals/
23
+ ├── evals.json ← REQUIRED — case list + assertions
24
+ └── fixtures/ ← OPTIONAL per-skill fixtures
25
+ ```
26
+
27
+ Cross-skill fixtures live at the repo root:
28
+
29
+ ```text
30
+ evals/
31
+ ├── baseline.json ← Committed; maintainer updates with `npm run eval:baseline`
32
+ └── fixtures/ ← Tiny synthetic evidence trees, ADO fixtures, etc.
33
+ ```
34
+
35
+ Per-run output goes to `Evidence/_evals/<timestamp>.json` (gitignored; not customer data).
36
+
37
+ ## Case schema
38
+
39
+ ```jsonc
40
+ {
41
+ "skill": "<skill-name>",
42
+ "cases": [
43
+ {
44
+ "id": "ap-citations-format",
45
+ "name": "ask-project emits weekly-csc citation form",
46
+ "input": "what was decided about MACC for fixture-acme?",
47
+ "fixture": "evals/fixtures/fixture-acme", // optional
48
+ "canary": true,
49
+ "grader_type": "script", // "script" | "llm"
50
+ "expected_assertions": [
51
+ { "type": "regex-match", "pattern": "\\[source:\\s*fixture-acme/email/weekly/" },
52
+ { "type": "regex-match", "pattern": "Source-layout:\\s*weekly-csc" }
53
+ ]
54
+ }
55
+ ]
56
+ }
57
+ ```
58
+
59
+ ### Required fields per case
60
+
61
+ - `id` — unique within the skill; kebab-case.
62
+ - `name` — human-readable.
63
+ - `input` — what gets passed to the skill (string OR object).
64
+ - `expected_assertions` — array, **≥ 1** entry (enforced by `D33.evals-have-assertions`).
65
+ - `grader_type` — `"script"` for deterministic graders, `"llm"` for rubric-based.
66
+
67
+ ### Optional fields
68
+
69
+ - `fixture` — repo-relative path to the fixture to point the skill at.
70
+ - `canary` — `true` to include in the fast CI subset.
71
+ - `args` — extra args forwarded to the skill script (e.g. `{ "DryRun": true }`).
72
+ - `skip` — `true` to skip (must include `skip_reason`).
73
+ - `timeout_ms` — override the runner default (30 000 ms).
74
+
75
+ ## Assertion types
76
+
77
+ | Type | Shape | Passes when |
78
+ |---|---|---|
79
+ | `file-exists` | `{ "type": "file-exists", "path": "..." }` | Path exists post-run (relative to fixture or evidence dir). |
80
+ | `file-contains` | `{ "type": "file-contains", "path": "...", "needle": "..." }` | File exists and substring is present. |
81
+ | `json-path-equals` | `{ "type": "json-path-equals", "path": "...", "json_path": "$.foo.bar", "equals": "v" }` | JSON file parses; dotted path value === expected. |
82
+ | `regex-match` | `{ "type": "regex-match", "pattern": "...", "flags": "i" }` | Captured stdout matches the regex. |
83
+ | `llm-rubric` | `{ "type": "llm-rubric", "rubric": "...", "min_score": 4 }` | LLM grader scores ≥ min on a 1–5 rubric. |
84
+
85
+ ## Run modes
86
+
87
+ The runner (`plugin/skills/eval/run-evals.ps1`) supports three dispatch modes:
88
+
89
+ 1. **Direct invocation** (default for `script` graders). Runs the skill's executable artifact (`run.ps1`, `*.mjs`, or a small probe stub) with the given input and fixture. Pure deterministic.
90
+ 2. **Sub-agent dispatch** (optional, gated by `-Live`). Forwards the case to a sub-agent. Used only for `llm-rubric` cases. Skipped in canary mode.
91
+ 3. **Recorded fixture replay** (for `pull-*` skills). Reads a recorded `--cached` output of a real pull and asserts against that, so no live M365 calls are needed.
92
+
93
+ For each case the runner records: `pass`, `duration_ms`, `tokens_in`, `tokens_out`, `stdout`, `stderr`, per-assertion `pass`/`reason`. The aggregate is a JSON file under `Evidence/_evals/` plus a one-line `benchmark.json` summary.
94
+
95
+ ## Canary set
96
+
97
+ Marked with `"canary": true`. Kept tiny so PRs stay fast.
98
+
99
+ Default canary set (v5.0.3):
100
+
101
+ - `ask-project`
102
+ - `bootstrap-project`
103
+ - `refresh-project`
104
+ - `link-entities`
105
+ - `build-state`
106
+ - `self-check`
107
+
108
+ ## Baseline + regression detection
109
+
110
+ - `evals/baseline.json` is **committed**.
111
+ - Each per-skill record carries the last green `pass_rate`, `mean_duration_ms`, and `mean_tokens_total`.
112
+ - `src/eval-aggregator.mjs` flags **regressions**:
113
+ - `pass_rate` drop ≥ 10 percentage points
114
+ - `mean_duration_ms` increase ≥ 50 %
115
+ - `mean_tokens_total` increase ≥ 50 %
116
+ - Maintainers refresh the baseline with `npm run eval:baseline` after deliberate behavior changes.
117
+
118
+ ## Privacy + safety
119
+
120
+ - **No real customer data** in any fixture. Use `fixture-acme`-style synthetic names.
121
+ - `Evidence/_evals/` is in `.gitignore`.
122
+ - `pull-*` evals NEVER hit live M365 endpoints in canary mode. Use recorded `--cached` payloads or `--dry-run`.
123
+ - Tenant IDs / GUIDs in fixtures must be obviously fake (e.g. `00000000-...`).
124
+
125
+ ## References
126
+
127
+ - [agentskills.io — evaluating skills](https://agentskills.io/skill-creation/evaluating-skills) (source of truth)
128
+ - `plugin/skills/eval/SKILL.md` (the runner skill)
129
+ - `plugin/skills/eval/evals.schema.json` (JSON schema; self-check D33.evals-schema)
130
+ - `plugin/instructions/agentskills-compliance.instructions.md` (sibling doctrine — size + section caps)
@@ -0,0 +1,33 @@
1
+ {
2
+ "skill": "aggregate-project",
3
+ "version": "1.0.0",
4
+ "description": "Auto-seeded evals for aggregate-project. Replace with real cases as the skill matures.",
5
+ "cases": [
6
+ {
7
+ "id": "aggregate-project-smoke-1",
8
+ "name": "aggregate-project produces a non-empty response",
9
+ "input": "synthetic aggregate-project probe — canary smoke",
10
+ "canary": false,
11
+ "grader_type": "script",
12
+ "expected_assertions": [
13
+ {
14
+ "type": "regex-match",
15
+ "pattern": ".+"
16
+ }
17
+ ]
18
+ },
19
+ {
20
+ "id": "aggregate-project-smoke-2",
21
+ "name": "aggregate-project echoes case id",
22
+ "input": "case-id aggregate-project-smoke-2",
23
+ "canary": false,
24
+ "grader_type": "script",
25
+ "expected_assertions": [
26
+ {
27
+ "type": "regex-match",
28
+ "pattern": "aggregate-project-smoke-2"
29
+ }
30
+ ]
31
+ }
32
+ ]
33
+ }