pi-evalset-lab 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (63) hide show
  1. package/.copier-answers.yml +5 -0
  2. package/.githooks/pre-commit +12 -0
  3. package/.github/CODEOWNERS +12 -0
  4. package/.github/ISSUE_TEMPLATE/bug-report.yml +63 -0
  5. package/.github/ISSUE_TEMPLATE/config.yml +5 -0
  6. package/.github/ISSUE_TEMPLATE/docs.yml +39 -0
  7. package/.github/ISSUE_TEMPLATE/feature-request.yml +41 -0
  8. package/.github/VOUCHED.td +8 -0
  9. package/.github/dependabot.yml +13 -0
  10. package/.github/pull_request_template.md +34 -0
  11. package/.github/workflows/ci.yml +37 -0
  12. package/.github/workflows/publish.yml +60 -0
  13. package/.github/workflows/release-please.yml +25 -0
  14. package/.github/workflows/vouch-check-pr.yml +29 -0
  15. package/.github/workflows/vouch-manage.yml +34 -0
  16. package/.pi/extensions/startup-intake-router.ts +151 -0
  17. package/.pi/prompts/init-project-docs.md +32 -0
  18. package/.release-please-config.json +11 -0
  19. package/.release-please-manifest.json +3 -0
  20. package/AGENTS.md +39 -0
  21. package/CHANGELOG.md +43 -0
  22. package/CODE_OF_CONDUCT.md +50 -0
  23. package/CONTRIBUTING.md +28 -0
  24. package/NEXT_SESSION_PROMPT.md +14 -0
  25. package/README.md +246 -0
  26. package/SECURITY.md +34 -0
  27. package/SUPPORT.md +37 -0
  28. package/docs/dev/CONTRIBUTING.md +37 -0
  29. package/docs/dev/EXTENSION_SOP.md +43 -0
  30. package/docs/dev/next_steps.md +17 -0
  31. package/docs/dev/plans/001-initial-plan.md +24 -0
  32. package/docs/dev/status.md +21 -0
  33. package/docs/org/operating_model.md +39 -0
  34. package/docs/org/project-docs-intake.questions.json +60 -0
  35. package/docs/project/foundation.md +28 -0
  36. package/docs/project/incentives.md +17 -0
  37. package/docs/project/resources.md +26 -0
  38. package/docs/project/skills.md +17 -0
  39. package/docs/project/strategic_goals.md +18 -0
  40. package/docs/project/tactical_goals.md +39 -0
  41. package/docs/project/vision.md +21 -0
  42. package/examples/.gitkeep +0 -0
  43. package/examples/fixed-task-set-v2.json +127 -0
  44. package/examples/fixed-task-set-v3.json +126 -0
  45. package/examples/fixed-task-set.json +22 -0
  46. package/examples/system-baseline.txt +1 -0
  47. package/examples/system-candidate.txt +6 -0
  48. package/extensions/evalset.ts +1090 -0
  49. package/external/.gitkeep +0 -0
  50. package/ontology/.gitkeep +0 -0
  51. package/package.json +31 -0
  52. package/policy/security-policy.json +10 -0
  53. package/prek.toml +15 -0
  54. package/prompts/implementation-planning.md +17 -0
  55. package/prompts/init-project-docs.md +32 -0
  56. package/prompts/security-review.md +17 -0
  57. package/scripts/docs-list.sh +50 -0
  58. package/scripts/init-project-docs.sh +56 -0
  59. package/scripts/install-hooks.sh +13 -0
  60. package/scripts/sync-to-live.sh +91 -0
  61. package/scripts/validate-structure.sh +325 -0
  62. package/src/.gitkeep +0 -0
  63. package/tests/.gitkeep +0 -0
package/CHANGELOG.md ADDED
@@ -0,0 +1,43 @@
1
+ ---
2
+ summary: "Changelog for scaffold evolution."
3
+ read_when:
4
+ - "Preparing a release or reviewing history."
5
+ system4d:
6
+ container: "Release log for this extension package."
7
+ compass: "Track meaningful deltas per version."
8
+ engine: "Document changes at release boundaries."
9
+ fog: "Versioning policy may evolve with team preference."
10
+ ---
11
+
12
+ # Changelog
13
+
14
+ All notable changes to this project should be documented here.
15
+
16
+ ## [Unreleased]
17
+
18
+ ### Added
19
+
20
+ - Added `/evalset` MVP command with subcommands:
21
+ - `init` to generate a starter fixed-task-set dataset
22
+ - `run` to evaluate one variant against a dataset
23
+ - `compare` to evaluate baseline vs candidate system prompts
24
+ - Added example files in `examples/`:
25
+ - `fixed-task-set.json`
26
+ - `fixed-task-set-v2.json`
27
+ - `fixed-task-set-v3.json`
28
+ - `system-baseline.txt`
29
+ - `system-candidate.txt`
30
+ - Added report output support to `.evalset/reports/*.json` with per-case and aggregate metrics.
31
+ - Added run identity metadata to reports (`runId`, `datasetHash`, `casesHash`, `variantHash`).
32
+ - Reduced session message payload size by storing only lightweight report metadata instead of full report bodies.
33
+
34
+ ### Changed
35
+
36
+ - Clarified `/evalset` invocation docs: use `pi -p` (or `pi -e ... -p`) for non-interactive runs; `/evalset` is not a standalone shell binary.
37
+ - Added the same non-interactive invocation note to `/evalset help` output.
38
+
39
+ ## [0.1.0] - 2026-02-08
40
+
41
+ ### Added
42
+
43
+ - Initial production-ready scaffold generated from template v2.
@@ -0,0 +1,50 @@
1
+ ---
2
+ summary: "Community behavior expectations and enforcement path."
3
+ read_when:
4
+ - "Participating in issues, discussions, or pull requests."
5
+ - "Handling conduct incidents."
6
+ system4d:
7
+ container: "Shared standards for respectful collaboration."
8
+ compass: "Safety, respect, and clarity over conflict escalation."
9
+ engine: "Observe -> report -> review -> enforce -> document."
10
+ fog: "Context around incidents can be incomplete at first report."
11
+ ---
12
+
13
+ # Code of Conduct
14
+
15
+ ## Our commitment
16
+
17
+ We want this repository to be a respectful, harassment-free space for everyone,
18
+ regardless of background or identity.
19
+
20
+ ## Expected behavior
21
+
22
+ - Be respectful and constructive.
23
+ - Assume good intent, ask clarifying questions, and stay technical.
24
+ - Accept feedback and correct mistakes quickly.
25
+ - Keep discussions focused on project outcomes.
26
+
27
+ ## Unacceptable behavior
28
+
29
+ - Harassment, threats, hate speech, or intimidation.
30
+ - Personal attacks, repeated hostile behavior, or trolling.
31
+ - Sharing private information without consent.
32
+ - Sexualized language or unwanted attention.
33
+
34
+ ## Reporting incidents
35
+
36
+ If you experience or witness unacceptable behavior:
37
+
38
+ 1. Contact maintainers privately using the channels listed in [SUPPORT.md](SUPPORT.md).
39
+ 2. Share relevant links/screenshots and timeline details.
40
+ 3. Do not post sensitive personal data publicly.
41
+
42
+ ## Enforcement
43
+
44
+ Maintainers may remove or edit comments, close threads, or block contributors for
45
+ behavior that violates this policy. Responses may include warning, temporary ban,
46
+ or permanent ban depending on severity and repetition.
47
+
48
+ ## Attribution
49
+
50
+ Adapted from the [Contributor Covenant](https://www.contributor-covenant.org/version/2/0/code_of_conduct.html).
@@ -0,0 +1,28 @@
1
+ ---
2
+ summary: "Top-level contribution entrypoint linking to the detailed contributor guide."
3
+ read_when:
4
+ - "Preparing to submit code or docs changes."
5
+ - "Looking for contribution quality gates."
6
+ system4d:
7
+ container: "Contribution intake and quality policy."
8
+ compass: "Small, reviewable, verified changes."
9
+ engine: "Read guide -> implement -> validate -> open PR."
10
+ fog: "Project-specific constraints may evolve with release policy changes."
11
+ ---
12
+
13
+ # Contributing
14
+
15
+ Primary contributor guide: [docs/dev/CONTRIBUTING.md](docs/dev/CONTRIBUTING.md)
16
+
17
+ ## Minimum checklist
18
+
19
+ 1. Read applicable docs (`npm run docs:list`).
20
+ 2. Keep changes scoped.
21
+ 3. Run `npm run check`.
22
+ 4. Update docs/changelog when behavior changes.
23
+ 5. Open a PR with validation output.
24
+
25
+ ## Conduct + support
26
+
27
+ - [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md)
28
+ - [SUPPORT.md](SUPPORT.md)
@@ -0,0 +1,14 @@
1
+ ---
2
+ summary: "Session handoff prompt for pi-evalset-lab."
3
+ read_when:
4
+ - "Starting the next focused development session."
5
+ system4d:
6
+ container: "Session handoff artifact."
7
+ compass: "Resume work quickly with explicit priorities."
8
+ engine: "Capture context, constraints, and next actions."
9
+ fog: "Staleness risk if not updated after major changes."
10
+ ---
11
+
12
+ # Next session prompt for pi-evalset-lab
13
+
14
+ Use this file to capture exact follow-up tasks for the next coding session.
package/README.md ADDED
@@ -0,0 +1,246 @@
1
+ ---
2
+ summary: "Overview and quickstart for pi-evalset-lab."
3
+ read_when:
4
+ - "Starting work in this repository."
5
+ system4d:
6
+ container: "Repository scaffold for a pi extension package."
7
+ compass: "Ship small, safe, testable extension iterations."
8
+ engine: "Plan -> implement -> verify with docs and hooks in sync."
9
+ fog: "Unknown runtime integration edge cases until first live sync."
10
+ ---
11
+
12
+ # pi-evalset-lab
13
+
14
+ Production-ready starter scaffold for a pi extension package.
15
+
16
+ ## Quickstart
17
+
18
+ 1. Install dependencies (if you add any):
19
+
20
+ ```bash
21
+ npm install
22
+ ```
23
+
24
+ 2. Test with pi:
25
+
26
+ ```bash
27
+ pi -e ./extensions/evalset.ts
28
+ ```
29
+
30
+ 3. Install package into pi:
31
+
32
+ ```bash
33
+ pi install /absolute/path/to/pi-evalset-lab
34
+ ```
35
+
36
+ ## evalset command (MVP)
37
+
38
+ This extension adds `/evalset` for fixed-task-set evaluation runs.
39
+
40
+ ### Commands
41
+
42
+ ```bash
43
+ /evalset help
44
+ /evalset init [dataset-path] [--force]
45
+ /evalset run <dataset.json> [--system-file <path>] [--system-text <text>] [--variant <name>] [--max-cases <n>] [--temperature <n>] [--out <report.json>]
46
+ /evalset compare <dataset.json> <baseline-system.txt> <candidate-system.txt> [--baseline-name <name>] [--candidate-name <name>] [--max-cases <n>] [--temperature <n>] [--out <report.json>]
47
+ ```
48
+
49
+ ### Running modes
50
+
51
+ `/evalset` is a pi slash command, not a shell executable.
52
+
53
+ Interactive mode:
54
+
55
+ ```bash
56
+ pi -e ./extensions/evalset.ts
57
+ # then inside pi:
58
+ /evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt
59
+ ```
60
+
61
+ Non-interactive mode (scripts/CI):
62
+
63
+ ```bash
64
+ pi -e ./extensions/evalset.ts -p "/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt"
65
+ # or, if extension already installed/enabled:
66
+ pi -p "/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt"
67
+ ```
68
+
69
+ ### Example workflow (inside pi)
70
+
71
+ ```bash
72
+ /evalset run examples/fixed-task-set.json --variant baseline
73
+ /evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt
74
+ ```
75
+
76
+ ### Included datasets
77
+
78
+ - `examples/fixed-task-set.json` — tiny smoke set (3 cases)
79
+ - `examples/fixed-task-set-v2.json` — larger first pass set
80
+ - `examples/fixed-task-set-v3.json` — less brittle checks (recommended)
81
+
82
+ The command writes JSON reports to:
83
+ - explicit `--out <path>` when provided
84
+ - otherwise `.evalset/reports/*.json` under your current project directory
85
+
86
+ Each report includes run identity metadata:
87
+ - `runId`
88
+ - `datasetHash`
89
+ - `casesHash`
90
+ - `variantHash` (run) or baseline/candidate variant hashes (compare)
91
+
92
+ Session messages only keep lightweight report metadata (`reportPath`, ids, summary metrics), not full report bodies.
93
+
94
+ ## Optional core hooks (future, not required for this MVP)
95
+
96
+ This extension works today without core changes. If we decide to harden further, optional core support could include:
97
+
98
+ 1. Stable agent-level lineage IDs (`runId`/`traceId`) across extension events.
99
+ 2. Explicit reproducibility capability metadata in `pi-ai` (e.g. seed support and determinism caveats per provider/model).
100
+ 3. Shared canonical provider payload hash helper in `pi-ai`.
101
+ 4. A headless agent-eval API for tool-heavy/full agent-loop benchmark runs.
102
+
103
+ ## Repository checks
104
+
105
+ Run:
106
+
107
+ ```bash
108
+ npm run check
109
+ ```
110
+
111
+ This executes [scripts/validate-structure.sh](scripts/validate-structure.sh).
112
+
113
+ ## Release + security baseline
114
+
115
+ This scaffold defaults to **release-please** for single-package release PR + tag flow (`vX.Y.Z`) and npm trusted publishing via OIDC.
116
+
117
+ Included files:
118
+
119
+ - [CI workflow](.github/workflows/ci.yml)
120
+ - [release-please workflow](.github/workflows/release-please.yml)
121
+ - [publish workflow](.github/workflows/publish.yml)
122
+ - [Dependabot config](.github/dependabot.yml)
123
+ - [CODEOWNERS](.github/CODEOWNERS)
124
+ - [release-please config](.release-please-config.json)
125
+ - [release-please manifest](.release-please-manifest.json)
126
+ - [Security policy](SECURITY.md)
127
+
128
+ Before first production release:
129
+
130
+ 1. Confirm/adjust owners in [.github/CODEOWNERS](.github/CODEOWNERS).
131
+ 2. Enable branch protection on `main`.
132
+ 3. Configure npm Trusted Publishing for this repo + [publish workflow](.github/workflows/publish.yml).
133
+ 4. Merge release PR from release-please, then publish from GitHub release.
134
+
135
+ ## Issue + PR intake baseline
136
+
137
+ Included files:
138
+
139
+ - [Bug report form](.github/ISSUE_TEMPLATE/bug-report.yml)
140
+ - [Feature request form](.github/ISSUE_TEMPLATE/feature-request.yml)
141
+ - [Docs request form](.github/ISSUE_TEMPLATE/docs.yml)
142
+ - [Issue template config](.github/ISSUE_TEMPLATE/config.yml)
143
+ - [PR template](.github/pull_request_template.md)
144
+ - [Code of conduct](CODE_OF_CONDUCT.md)
145
+ - [Support guide](SUPPORT.md)
146
+ - [Top-level contributing guide](CONTRIBUTING.md)
147
+
148
+ ## Vouch trust gate baseline
149
+
150
+ Included files:
151
+
152
+ - [Vouched contributors list](.github/VOUCHED.td)
153
+ - [PR trust gate workflow](.github/workflows/vouch-check-pr.yml)
154
+ - [Issue-comment trust management workflow](.github/workflows/vouch-manage.yml)
155
+
156
+ Default behavior:
157
+
158
+ - PR workflow runs on `pull_request_target` (`opened`, `reopened`).
159
+ - `require-vouch: true` and `auto-close: true` are enabled by default.
160
+ - Maintainers can comment `vouch`, `denounce`, or `unvouch` on issues to update trust state.
161
+ - Vouch actions are SHA pinned (`0e11a71bba23218a284d3ecca162e75a110fd7e3`) for reproducibility and supply-chain review.
162
+
163
+ Bootstrap step:
164
+
165
+ - Confirm/adjust entries in [.github/VOUCHED.td](.github/VOUCHED.td) before enforcing production policy.
166
+
167
+ ## Docs discovery
168
+
169
+ Run:
170
+
171
+ ```bash
172
+ npm run docs:list
173
+ npm run docs:list:workspace
174
+ npm run docs:list:json
175
+ ```
176
+
177
+ Wrapper script: [scripts/docs-list.sh](scripts/docs-list.sh)
178
+
179
+ Resolution order:
180
+ 1. `DOCS_LIST_SCRIPT`
181
+ 2. `./scripts/docs-list.mjs` (if vendored)
182
+ 3. `~/ai-society/core/agent-scripts/scripts/docs-list.mjs`
183
+
184
+ ## Copier lifecycle policy
185
+
186
+ - Keep `.copier-answers.yml` committed.
187
+ - Do not edit `.copier-answers.yml` manually.
188
+ - Run from a clean destination repo (commit or stash pending changes first).
189
+ - Use `copier update --trust` when `.copier-answers.yml` includes `_commit` and update is supported.
190
+ - In non-interactive shells/CI, append `--defaults` to update/recopy.
191
+ - Use `copier recopy --trust` when update is unavailable (for example local non-VCS source) or cannot reconcile cleanly.
192
+ - After recopy, re-apply local deltas intentionally and run `npm run check`.
193
+
194
+ ## Hook behavior
195
+
196
+ - Git uses `.githooks/pre-commit` (configured by [scripts/install-hooks.sh](scripts/install-hooks.sh)).
197
+ - If `prek` is available, the hook runs `prek` using [prek.toml](prek.toml).
198
+ - If `prek` is not available, the hook falls back to `scripts/validate-structure.sh`.
199
+
200
+ Install options for `prek`:
201
+
202
+ ```bash
203
+ npm add -D @j178/prek
204
+ # or
205
+ npm install -g @j178/prek
206
+ ```
207
+
208
+ ## Startup interview flow (project-local)
209
+
210
+ - [`.pi/extensions/startup-intake-router.ts`](.pi/extensions/startup-intake-router.ts) watches the first non-command message in a session.
211
+ - It converts your startup intent into a prefilled command:
212
+ - `/init-project-docs "<your intent>"`
213
+ - [`.pi/prompts/init-project-docs.md`](.pi/prompts/init-project-docs.md) then drives the `interview` tool using [docs/org/project-docs-intake.questions.json](docs/org/project-docs-intake.questions.json).
214
+
215
+ Utility commands:
216
+
217
+ - `/startup-intake-router-status`
218
+ - `/startup-intake-router-reset`
219
+
220
+ ## Live sync helper
221
+
222
+ Use [scripts/sync-to-live.sh](scripts/sync-to-live.sh) to copy the package extension to
223
+ `~/.pi/agent/extensions/`.
224
+
225
+ Optional flags:
226
+
227
+ - `--with-prompts`
228
+ - `--with-policy`
229
+ - `--all` (prompts + policy)
230
+
231
+ After sync, run `/reload` in pi.
232
+
233
+ ## Docs map
234
+
235
+ - [Organization operating model](docs/org/operating_model.md)
236
+ - [Project foundation model](docs/project/foundation.md)
237
+ - [Project vision](docs/project/vision.md)
238
+ - [Project incentives](docs/project/incentives.md)
239
+ - [Project resources](docs/project/resources.md)
240
+ - [Project skills](docs/project/skills.md)
241
+ - [Strategic goals](docs/project/strategic_goals.md)
242
+ - [Tactical goals](docs/project/tactical_goals.md)
243
+ - [Contributor guide](docs/dev/CONTRIBUTING.md)
244
+ - [Extension SOP](docs/dev/EXTENSION_SOP.md)
245
+ - [Next steps](docs/dev/next_steps.md)
246
+ - [Status](docs/dev/status.md)
package/SECURITY.md ADDED
@@ -0,0 +1,34 @@
1
+ ---
2
+ summary: "Security reporting process and release hardening baseline."
3
+ read_when:
4
+ - "Reporting a vulnerability."
5
+ - "Reviewing release and workflow security controls."
6
+ system4d:
7
+ container: "Security policy for maintainers and contributors."
8
+ compass: "Private reporting, least privilege, auditable releases."
9
+ engine: "Report privately -> triage -> patch -> verify -> disclose."
10
+ fog: "Dependency and ecosystem risk shifts over time."
11
+ ---
12
+
13
+ # Security Policy
14
+
15
+ ## Supported versions
16
+
17
+ Security fixes target the latest release and `main` branch.
18
+
19
+ ## Reporting a vulnerability
20
+
21
+ Use **private reporting**.
22
+
23
+ 1. Preferred: GitHub Security tab -> **Report a vulnerability**.
24
+ 2. If private reporting is unavailable, open a minimal issue titled
25
+ `Security contact request` without exploit details and request a private channel.
26
+ 3. Include impact, affected versions, and reproduction steps.
27
+ 4. Avoid public disclosure until maintainers confirm a fix/release plan.
28
+
29
+ ## Release and supply-chain baseline
30
+
31
+ - Release flow uses release-please PRs before tags/releases.
32
+ - Publish flow uses npm Trusted Publishing (OIDC) and `npm publish --provenance`.
33
+ - Workflow permissions default to read and elevate per job only.
34
+ - Third-party actions must stay explicit; high-risk paths should be SHA pinned.
package/SUPPORT.md ADDED
@@ -0,0 +1,37 @@
1
+ ---
2
+ summary: "How users and contributors request help or report problems."
3
+ read_when:
4
+ - "Needing help, troubleshooting, or reporting non-security issues."
5
+ - "Deciding where to file bug/feature/docs requests."
6
+ system4d:
7
+ container: "Support intake and routing guidance."
8
+ compass: "Route requests to the right channel with enough context."
9
+ engine: "Self-check -> search -> file focused issue -> iterate."
10
+ fog: "Reproduction context is often incomplete in first reports."
11
+ ---
12
+
13
+ # Support
14
+
15
+ ## Before opening an issue
16
+
17
+ 1. Read [README.md](README.md) and relevant docs in `docs/`.
18
+ 2. Search existing issues/PRs for duplicates.
19
+ 3. Re-test on the latest release.
20
+
21
+ ## Open the right issue type
22
+
23
+ Use the GitHub issue forms for:
24
+
25
+ - Bug reports
26
+ - Feature requests
27
+ - Documentation improvements
28
+
29
+ ## Security reports
30
+
31
+ For vulnerabilities, follow [SECURITY.md](SECURITY.md) and use private reporting.
32
+ Do **not** post exploit details in public issues.
33
+
34
+ ## Maintainer response expectations
35
+
36
+ This project may be maintained part-time. Triage and response times can vary,
37
+ but actionable reports with reproduction details are prioritized.
@@ -0,0 +1,37 @@
1
+ ---
2
+ summary: "Contribution workflow for this extension repository."
3
+ read_when:
4
+ - "Before opening PRs or submitting local changes."
5
+ system4d:
6
+ container: "Contributor process and quality gates."
7
+ compass: "Small, validated, documented changes."
8
+ engine: "Branch -> implement -> check -> document -> review."
9
+ fog: "Process details may adjust with team scale."
10
+ ---
11
+
12
+ # Contributing
13
+
14
+ ## Workflow
15
+
16
+ 1. Create a focused branch.
17
+ 2. Run `npm run docs:list` and read matched docs before cross-cutting changes.
18
+ 3. Implement one scoped change.
19
+ 4. Run `npm run check`.
20
+ 5. Update docs/changelog where relevant.
21
+ 6. Open PR with concise rationale and validation output.
22
+
23
+ ## Standards
24
+
25
+ - Keep diffs small and reviewable.
26
+ - Preserve markdown frontmatter in generated docs.
27
+ - Prefer explicit scripts over manual one-off commands.
28
+
29
+ ## Copier policy
30
+
31
+ - Keep `.copier-answers.yml` in version control.
32
+ - Do not edit `.copier-answers.yml` manually.
33
+ - Run update/recopy from a clean destination repo (commit or stash pending changes first).
34
+ - Use `copier update --trust` when `.copier-answers.yml` includes `_commit` and update is supported.
35
+ - In non-interactive shells/CI, append `--defaults` to update/recopy.
36
+ - Use `copier recopy --trust` when update is unavailable (for example local non-VCS source) or cannot reconcile cleanly.
37
+ - After recopy, re-apply local deltas intentionally and run `npm run check`.
@@ -0,0 +1,43 @@
1
+ ---
2
+ summary: "Lifecycle SOP for extension delivery and maintenance."
3
+ read_when:
4
+ - "Planning, implementing, verifying, releasing, or maintaining extension work."
5
+ system4d:
6
+ container: "End-to-end extension operating procedure."
7
+ compass: "Consistent quality from idea to maintenance."
8
+ engine: "plan -> implement -> verify -> release -> maintain."
9
+ fog: "Unknowns resolved through incremental validation loops."
10
+ ---
11
+
12
+ # Extension SOP
13
+
14
+ ## 1) Plan
15
+
16
+ - Define scope and acceptance criteria.
17
+ - Run `npm run docs:list` and read docs matching your task domain.
18
+ - Capture work in `docs/dev/plans/`.
19
+ - Confirm risks and dependencies.
20
+
21
+ ## 2) Implement
22
+
23
+ - Build in small commits.
24
+ - Keep command/tool behavior explicit.
25
+ - Update docs as behavior changes.
26
+
27
+ ## 3) Verify
28
+
29
+ - Run `npm run check`.
30
+ - Execute relevant extension tests.
31
+ - Validate prompt templates if changed.
32
+
33
+ ## 4) Release
34
+
35
+ - Update `CHANGELOG.md`.
36
+ - Tag/version according to team policy.
37
+ - Sync extension to live pi when needed.
38
+
39
+ ## 5) Maintain
40
+
41
+ - Monitor regressions and user feedback.
42
+ - Re-run validation after dependency/script changes.
43
+ - Keep `docs/dev/status.md` and `docs/dev/next_steps.md` current.
@@ -0,0 +1,17 @@
1
+ ---
2
+ summary: "Prioritized next actions for active development."
3
+ read_when:
4
+ - "Starting a coding session or grooming tasks."
5
+ system4d:
6
+ container: "Execution queue for maintainers."
7
+ compass: "Maintain momentum with clear, ordered tasks."
8
+ engine: "Do highest-leverage item first, then validate."
9
+ fog: "Task order may change after discovery."
10
+ ---
11
+
12
+ # Next steps
13
+
14
+ 1. Complete npm publish on npmjs (login, registry, publish, verify install).
15
+ 2. Add at least one automated test for argument parsing and expectation scoring.
16
+ 3. Add a small repeatable report-share helper (JSON -> static HTML export command/script).
17
+ 4. Evaluate optional LLM-judge scoring mode (`expectJudgePrompt`) after parser/scoring tests exist.
@@ -0,0 +1,24 @@
1
+ ---
2
+ summary: "Initial implementation plan for first extension iteration."
3
+ read_when:
4
+ - "Executing the first feature slice from scaffold state."
5
+ system4d:
6
+ container: "Plan artifact for incremental delivery."
7
+ compass: "Move from scaffold to validated feature quickly."
8
+ engine: "Plan -> implement -> verify -> document."
9
+ fog: "Scope creep risk if tasks are not constrained."
10
+ ---
11
+
12
+ # Plan 001: first feature slice
13
+
14
+ ## Objective
15
+
16
+ Ship one useful command behavior end-to-end.
17
+
18
+ ## Steps
19
+
20
+ 1. Define expected command input/output.
21
+ 2. Implement logic in `extensions/`.
22
+ 3. Add tests in `tests/`.
23
+ 4. Run `npm run check`.
24
+ 5. Update `docs/dev/status.md` and `CHANGELOG.md`.
@@ -0,0 +1,21 @@
1
+ ---
2
+ summary: "Current project status snapshot."
3
+ read_when:
4
+ - "Checking project health or preparing handoff updates."
5
+ system4d:
6
+ container: "State report for current branch/project."
7
+ compass: "Keep stakeholders aligned on progress and blockers."
8
+ engine: "Update status after meaningful implementation slices."
9
+ fog: "Status can stale quickly without disciplined updates."
10
+ ---
11
+
12
+ # Status
13
+
14
+ - Scaffold: complete
15
+ - Extension behavior: `/evalset` MVP implemented (`init`, `run`, `compare`)
16
+ - Example datasets: smoke + `fixed-task-set-v2.json` + `fixed-task-set-v3.json`
17
+ - Invocation docs: clarified non-interactive usage via `pi -p` / `pi -e ... -p`
18
+ - GitHub publish: `tryingET/pi-evalset-lab` created with release `v0.1.0`
19
+ - npm publish: pending (`npm` auth + registry set to `https://registry.npmjs.org/`)
20
+ - Validation hooks: installed
21
+ - Tests: pending
@@ -0,0 +1,39 @@
1
+ ---
2
+ summary: "Compact organization operating model and terminology."
3
+ read_when:
4
+ - "Aligning organization-level purpose, mission, and strategy."
5
+ system4d:
6
+ container: "Organization-level concepts shared across projects."
7
+ compass: "Keep strategy and culture aligned with organization purpose."
8
+ engine: "Purpose -> mission -> vision -> strategic objectives."
9
+ fog: "Terminology drift can create cross-project confusion."
10
+ ---
11
+
12
+ # Organization operating model
13
+
14
+ ## Organization purpose
15
+ Enable teams to build and operate AI-assisted software workflows that are reliable, auditable, and easy to improve.
16
+
17
+ ## Organization mission (current)
18
+ Provide practical tooling, standards, and documentation that make safe AI engineering the default path.
19
+
20
+ ## Organization vision
21
+ A trusted ecosystem where AI coding workflows are reproducible, understandable, and continuously improving.
22
+
23
+ ## Organization strategic objectives
24
+ 1. Standardize evaluation workflows across active projects.
25
+ 2. Improve run traceability and reproducibility metadata.
26
+ 3. Keep onboarding friction low through concise, current documentation.
27
+ 4. Maintain secure-by-default release and dependency hygiene.
28
+ 5. Reduce ambiguity between interactive and non-interactive execution modes.
29
+
30
+ ## Core values
31
+ - Clarity first
32
+ - Evidence over assumption
33
+ - Small, verifiable increments
34
+ - Safety for high-impact changes
35
+ - Respectful collaboration
36
+
37
+ ## Boundary
38
+ Organization purpose is cross-project.
39
+ Project-specific purpose is defined in [Project foundation model](../project/foundation.md).