pi-dev 0.1.5 → 0.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/dist/manifest.js CHANGED
@@ -11,6 +11,7 @@ export const SKILLS = [
11
11
  { name: "do", kind: "human", summary: "Do the engineering work end-to-end." },
12
12
  { name: "taste", kind: "human", summary: "View / update / onboard preferences." },
13
13
  { name: "where", kind: "human", summary: "Recall prior pi sessions for this cwd." },
14
+ { name: "improve-skill-flow", kind: "human", summary: "Audit pi session telemetry and propose evidence-based SKILL.md edits." },
14
15
  // Auto-invoked support skills
15
16
  { name: "migrate", kind: "support", summary: "Strict migration gate before /do can run." },
16
17
  { name: "setup", kind: "support", summary: "Scaffold issue-tracker / triage / domain docs." },
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "pi-dev",
3
- "version": "0.1.5",
3
+ "version": "0.1.7",
4
4
  "description": "An autonomous engineering skill framework for the pi runtime — built on Matt Pocock's skills.",
5
5
  "type": "module",
6
6
  "bin": {
@@ -0,0 +1,302 @@
1
+ ---
2
+ name: improve-skill-flow
3
+ description: Analyse real pi-runtime session telemetry from a consumer repo to find where the engineering skills (especially /do's chain) drift from their stated contract, then propose evidence-anchored edits to the SKILL.md files. Use when the user wants to improve, audit, debug, or evolve the pi-dev skill framework itself based on what actually happened in real sessions ("스킬 개선하자", "do 가 왜 멈춰", "히스토리 보고 분석해서 스킬 고치자", "메타 스킬 작업", etc).
4
+ ---
5
+
6
+ # /improve-skill-flow — Meta-skill for evidence-based skill improvement
7
+
8
+ The pi-dev skills are pure markdown. They get better by reading what real sessions did, comparing that to what the SKILL.md said *should* happen, and editing the gap closed. This skill is the canonical loop for that.
9
+
10
+ The point: **never edit a SKILL.md from gut feeling.** Edit because session N showed phase P violated predicate Q on M occasions, and here is the line that would have prevented it.
11
+
12
+ ## When to run
13
+
14
+ - A consumer repo has accumulated at least one real day of pi sessions (≈ 3+ `.jsonl` files).
15
+ - A specific skill is suspected of misbehaving ("why does `/do` keep stopping?").
16
+ - After landing a skill change, to verify the next session(s) actually follow the new wording.
17
+ - Periodically (every N releases) as a regression sweep across all human-facing skills.
18
+
19
+ ## What this skill is NOT
20
+
21
+ - Not for analysing the *codebase* of the consumer repo — that is `improve-codebase-architecture`.
22
+ - Not for shipping engineering work — it does not invoke `/do`. Findings turn into proposed SKILL.md diffs, which are committed via the normal release-please flow on `pi-dev`.
23
+ - Not a real-time monitor — it reads completed session files.
24
+
25
+ ## Inputs
26
+
27
+ - **Target repo path** (or its sessions directory). Defaults: the user names a repo; you resolve it.
28
+ - Optional: a specific skill name to focus the audit on (`do`, `migrate`, `triage`, …).
29
+ - Optional: a date range.
30
+ - **Install scope** for any fixes that come out of the audit — `global` or `project`. Auto-detected (see Step 5.5); user can override per finding.
31
+
32
+ ## Install scopes
33
+
34
+ pi-runtime today loads skill bodies from a single location: `~/.pi/agent/skills/<name>/SKILL.md`. The framework's 3-layer override is on **preferences**, not on SKILL bodies. So a finding lands in one of two places:
35
+
36
+ | scope | lands in | reaches | propagation | when to pick |
37
+ | --- | --- | --- | --- | --- |
38
+ | **global** | `pi-dev`'s `skills/<name>/SKILL.md` | every consumer after the next `npx pi-dev update` | release-please → npm publish | the SKILL.md wording itself is wrong; gap shows up generically |
39
+ | **project** | consumer repo's `docs/agents/preferences.md` (Project taboos / Diagnosis posture / Local-live playbook / Free notes — whichever section fits) | only this repo, on every `/do` bootstrap | regular consumer-repo commit | gap is the repo's domain / paths / conventions, not the SKILL.md |
40
+
41
+ Notes:
42
+
43
+ - A `global` apply is **always** mirrored into `~/.pi/agent/skills/<name>/` on the operator's machine so the next session picks it up immediately, without waiting for npm.
44
+ - A `project` apply touches no pi-dev files. It is committed to the consumer repo only.
45
+ - A single audit may produce a mix of global and project findings. Decide scope per finding, not per audit.
46
+
47
+ ## Session-data location & format
48
+
49
+ pi-runtime persists every session as JSON Lines at:
50
+
51
+ ```
52
+ ~/.pi/agent/sessions/--<cwd-with-slashes-replaced-by-dashes>--/<ISO-ts>_<uuid>.jsonl
53
+ ```
54
+
55
+ For example `cwd=/Users/jason/dev/sandbox/hugn` → `~/.pi/agent/sessions/--Users-jason-dev-sandbox-hugn--/`.
56
+
57
+ Each line is one of these record types:
58
+
59
+ | `type` | meaning |
60
+ | --- | --- |
61
+ | `session` | session header — has `cwd`, `timestamp`, `version`, `id` |
62
+ | `model_change` | provider / modelId switch |
63
+ | `thinking_level_change` | reasoning effort knob |
64
+ | `message` | a user or assistant turn |
65
+
66
+ `message.content` is a **list of blocks**, each block has a `type`:
67
+
68
+ | block `type` | shape | what it represents |
69
+ | --- | --- | --- |
70
+ | `text` | `{text}` | plain text from either side |
71
+ | `thinking` | `{thinking, thinkingSignature}` | model scratchpad (assistant only) |
72
+ | `toolCall` | `{id, name, arguments}` | **pi uses this** — NOT Anthropic SDK's `tool_use` |
73
+ | `toolResult` | `{id, output}` | **pi uses this** — NOT `tool_result` |
74
+
75
+ Tool names are **lower-case** (`bash`, `read`, `edit`, `write`, `glob`, `grep`, etc.). Build any parser around `toolCall` / `toolResult` first, then fall back to Anthropic-shaped blocks for robustness.
76
+
77
+ **Critical detail:** the pi runtime injects each skill's `SKILL.md` content into the conversation as a `<skill name="..." location="...">…</skill>` block embedded inside a **user-role** message (system-side injection, but the role is user). This is how you tell the classifier loaded a skill. Count these to see what `/do` actually picked.
78
+
79
+ Timestamps on `message` records are ISO strings; some other record types use int millis. Handle both.
80
+
81
+ ## Process
82
+
83
+ ### 1 — Scope and load
84
+
85
+ Ask the user (one round, only if not already specified):
86
+
87
+ - target repo (or "all repos with sessions")
88
+ - a skill to focus on, or "everything"
89
+ - a date range or "all"
90
+
91
+ Resolve the sessions directory. List the `.jsonl` files with size + line count so the user can see the input scale.
92
+
93
+ ### 2 — Build the raw signal table
94
+
95
+ Run a single pass over every targeted `.jsonl` and tally:
96
+
97
+ **Per session:**
98
+ - user messages, assistant messages, tool calls, thinking blocks
99
+ - duration (last ts − first ts)
100
+
101
+ **Aggregates:**
102
+ - tool-use frequency (`bash`, `read`, `edit`, `write`, …)
103
+ - `<skill name="X">` injections per skill (this is the classifier's actual decision)
104
+ - skills mentioned in user text vs. assistant text (request side vs. recall side)
105
+ - top Edit / Write targets (hotspot files)
106
+ - bash commands matching domain-specific danger / smell patterns
107
+
108
+ **Skill-flow specific:**
109
+ - After each `<skill name="do">` injection, did another skill get injected within the same session? **This measures `/do` chain depth.** Single-step chains are a red flag.
110
+ - Count user messages shorter than ~80 chars — these are usually nudges ("진행해", "다음은?", "끝났어?"). High proportion = `/do` is handing the flow back too often.
111
+ - Count user messages containing correction markers (`아니`, `wait`, `stop`, `취소`, `다시`, `그만`, `undo`, `revert`). These mark interventions.
112
+ - Count `<!-- migrated: ... -->` marker date vs. any post-marker `docs/handoff/` / `.scratch/flow/` / `SESSION_*.md` writes. Drift = handoff lockout failed.
113
+ - Count tracker writes (`gh issue create`, `gh pr create`) and check whether `generated by AI` (the `/triage` disclaimer) appears in the surrounding text. Missing disclaimer = `/to-issues` / `/triage` predicate violated.
114
+
115
+ Use a deterministic Python or shell script you write once and check into `/tmp` for the duration of the run. Do not eyeball big JSONLs by hand.
116
+
117
+ ### 3 — Cross-reference with repo state
118
+
119
+ For the same date range, pull:
120
+
121
+ - `git log --since=<start> --pretty=format:"%h %ad %s"` — commit cadence vs. the predicate `auto-commit-per-slice`.
122
+ - `gh issue list / pr list` (if GitHub) — slice/PR shape vs. `default-issue-style=vertical-slice`.
123
+ - Existence of forbidden paths from `docs/agents/preferences.md` taboos (`docs/handoff/`, `.scratch/flow/`, etc.).
124
+ - `gh label list` filtered against taboo-marked labels.
125
+
126
+ Cross-reference each signal against:
127
+
128
+ - The skill's **terminal predicate** in `do/SKILL.md` → "Phase contracts".
129
+ - The repo's **`docs/agents/preferences.md`** taboos and `auto-*` settings.
130
+ - The hard rules in the skill being audited.
131
+
132
+ ### 4 — Score the gaps
133
+
134
+ Produce a small table per audited skill. Each row:
135
+
136
+ ```
137
+ | signal | observed | expected (predicate / rule) | severity |
138
+ | ------------------------------ | -------- | ---------------------------------- | -------- |
139
+ | /do → next-skill chain depth | 1/5 | ≥1 per chain step (M phases) | 🔴 |
140
+ | AI disclaimer on slice issues | 0/8 | every issue starts with disclaimer | 🔴 |
141
+ | post-marker handoff writes | 9 | 0 | 🟡 |
142
+ | ... | | | |
143
+ ```
144
+
145
+ Severity rule of thumb:
146
+ - 🔴 — predicate is violated repeatedly AND the rule is binding (in "Hard rules" or "Phase contracts").
147
+ - 🟡 — taboo or preference violated but the skill's wording is soft / advisory.
148
+ - 🟢 — observed behaviour matches the spec; do not propose a change.
149
+
150
+ ### 5 — Anchor each finding to an evidence excerpt
151
+
152
+ For every 🔴 / 🟡 row, quote the smallest piece of evidence that makes the gap undeniable:
153
+
154
+ - a user message timestamp + first 80 chars,
155
+ - a tool-call command that violates a taboo,
156
+ - a missing disclaimer in a created issue body.
157
+
158
+ If a finding cannot be backed by an excerpt, it is not actionable yet — demote to a TODO and keep digging.
159
+
160
+ ### 5.5 — Decide install scope per finding (auto + user-overridable)
161
+
162
+ For each 🔴 / 🟡 finding, pick a default scope using this two-step heuristic, then show the table to the user once and let them flip individual rows before applying.
163
+
164
+ **Step A — detect operator context.** Run once at the start of this step:
165
+
166
+ ```bash
167
+ origin=$(git -C "$PWD" remote get-url origin 2>/dev/null || echo "")
168
+ pkg_name=$(jq -r '.name // empty' package.json 2>/dev/null)
169
+ is_maintainer=false
170
+ case "$origin" in *pi-dev*|*pi-dev.git) is_maintainer=true ;; esac
171
+ [ "$pkg_name" = "pi-dev" ] && is_maintainer=true
172
+ echo "operator_context=$([ \"$is_maintainer\" = true ] && echo maintainer || echo consumer)"
173
+ ```
174
+
175
+ - `operator_context=maintainer` → cwd is the pi-dev repo itself; the release path is available.
176
+ - `operator_context=consumer` → cwd is a downstream repo; no release path. `global` findings here become "draft a patch + open an upstream PR / issue" rather than "push and release".
177
+
178
+ **Step B — score each finding.** Default to `global` if the finding matches **any** of:
179
+
180
+ - Cites SKILL.md wording / phase / predicate / rule numbers.
181
+ - The proposed fix is a generic anti-pattern string, a terminator literal, a runway line, or a lockout that any repo would benefit from.
182
+ - The same gap would plausibly show up in two or more consumer repos.
183
+
184
+ Default to `project` if the finding matches **any** of:
185
+
186
+ - Cites a repo-specific path (`src/core/...`, `bin/...-smoke.ts`), brand, schema, table, or domain term.
187
+ - The fix is a taboo, a smoke convention, an env / boot detail, or a glossary entry.
188
+ - The fix would not apply (or would be wrong) in another consumer repo.
189
+
190
+ Present the scope-decision table:
191
+
192
+ ```
193
+ | # | finding (short) | default scope | target file | flip? |
194
+ | - | ---------------------------------------- | ------------- | ------------------------------------ | ----- |
195
+ | 1 | /do hands flow back between phases | global | pi-dev:skills/do/SKILL.md | |
196
+ | 2 | docs/handoff/ resurrected after marker | global | pi-dev:skills/migrate/SKILL.md | |
197
+ | 3 | retro-action-item label still alive | project | hugn:docs/agents/preferences.md | |
198
+ | 4 | smoke command name changed in S058 | project | hugn:docs/agents/preferences.md | |
199
+ ```
200
+
201
+ Ask the user once: "OK to proceed with these scopes? Reply with row numbers to flip, or `go`." Apply their flips and move on. If `operator_context=consumer`, any rows still marked `global` get the suffix `(via upstream PR — cannot release locally)` and the apply step adjusts accordingly.
202
+
203
+ ### 6 — Propose edits (per-finding, scoped)
204
+
205
+ For each 🔴 / 🟡 finding, draft the smallest possible edit that, **if it had been in place at session time, would have prevented the gap.** The shape of the draft depends on the scope from Step 5.5:
206
+
207
+ **Global findings (target: pi-dev SKILL.md):**
208
+
209
+ - Edit a **rule** or a **step**, not a flavour sentence. The model must be able to detect the constraint in its own draft output.
210
+ - Prefer **explicit anti-pattern strings** ("Do not say 'shall I continue?'") over abstract injunctions ("be decisive"). The hugn-2026-05 audit showed that named anti-patterns work.
211
+ - Prefer **terminal markers** ("the summary's last line must be one of these two literals: …") over qualitative descriptions of "good wrap-up".
212
+ - Update **at most three skills per run.** More than that means findings aren't anchored well enough.
213
+
214
+ **Project findings (target: consumer's `docs/agents/preferences.md`):**
215
+
216
+ - Pick the *narrowest* existing section that fits before adding a new one. Mapping:
217
+
218
+ | finding flavour | preferences section |
219
+ | --- | --- |
220
+ | forbidden path / file / module / command | `## Project taboos` |
221
+ | label / state / triage rule | `## Side-effect gates` or `## Project taboos` |
222
+ | smoke / test / endpoint convention | `## Smoke / test conventions` |
223
+ | boot / env / probe change | `## Local-live playbook` |
224
+ | error taxonomy / 5-axes / domain rule | `## Diagnosis posture` |
225
+ | glossary / context term clarification | `## Glossary alignment` |
226
+ | rationale that doesn't fit elsewhere | `## Free notes` (one paragraph max, dated) |
227
+
228
+ - One bullet per finding. Reference the evidence ticket ("S058 smoke name", "#103 missing disclaimer") so the line stays auditable.
229
+ - Do **not** invent new top-level sections unless three findings legitimately share one.
230
+
231
+ Show all drafts as one unified diff per target file before applying. Group by target file: pi-dev's `skills/<name>/SKILL.md` first (global), then consumer's `docs/agents/preferences.md` (project).
232
+
233
+ ### 7 — Apply, release, verify (branches on scope)
234
+
235
+ Run both branches if the audit produced mixed-scope findings. Each branch has its own terminal state.
236
+
237
+ **7a. Global branch** — only if any finding was approved as `global` **and** `operator_context=maintainer`:
238
+
239
+ 1. From the pi-dev checkout: `git add skills/<name>/SKILL.md && git commit -m "<conventional commit anchoring the evidence>"`. Commit body must cite the signal that motivated each change.
240
+ 2. `cp` each edited SKILL.md into `~/.pi/agent/skills/<name>/` so the **next** session anywhere picks up the change immediately (release-please takes a minute and a half).
241
+ 3. `git push origin main`; release-please opens the version-bump PR; merge it; npm publish runs automatically.
242
+ 4. Confirm `npm view pi-dev@latest version` matches the bumped tag.
243
+
244
+ **7a' — Global findings when `operator_context=consumer`:** you cannot release. Instead:
245
+
246
+ 1. Stash the proposed diffs to `/tmp/pi-dev-upstream-<date>.patch` with one file per skill.
247
+ 2. Open an issue on `pi-dev` (or a PR if the operator has clone+push rights) with the evidence excerpts and the patch attached.
248
+ 3. As a hotfix for this machine only, optionally `cp` the edited bodies into `~/.pi/agent/skills/<name>/` and note in the issue that the next `pi-dev update` will overwrite them — which is the desired end state once the upstream change lands.
249
+
250
+ **7b. Project branch** — only if any finding was approved as `project`:
251
+
252
+ 1. In the consumer repo: edit `docs/agents/preferences.md` per the drafts from Step 6. Keep the migration marker at the very end of the file undisturbed.
253
+ 2. Bump the `last-updated` line at the top of the file to today's UTC date.
254
+ 3. `git add docs/agents/preferences.md && git commit -m "docs(agents): <one-liner per finding>"`. Conventional Commits apply.
255
+ 4. Push per the repo's normal workflow. No release-please involvement — preferences are not packaged.
256
+
257
+ **Verification (both branches).** After the next pi session in the affected repo:
258
+
259
+ 1. Re-run **this skill** scoped to the last 24 h.
260
+ 2. Confirm each previously-🔴 signal has moved (chain depth up, intervention rate down, taboo writes gone, disclaimer coverage at 100%, etc.).
261
+ 3. If a signal did not move, the fix wording was too weak — file a follow-up audit, do not re-write from scratch.
262
+
263
+ ## Terminal predicate
264
+
265
+ This skill is done when **all four** are true:
266
+
267
+ 1. A signal table with severities and evidence excerpts has been presented.
268
+ 2. Each finding has an approved scope (`global` / `project` / `defer`) on record, defaulted by Step 5.5 and confirmed by the user.
269
+ 3. Either (a) zero 🔴 findings — flow is healthy, recorded as "no change this cycle", OR (b) each 🔴 finding has landed in its scope's target file (or been stashed + filed upstream when an operator-context mismatch prevents release).
270
+ 4. For any landed change: if `global` and `maintainer`, the npm version has bumped (`npm view pi-dev@latest version`); if `project`, the consumer repo has the commit on its push-stream. Either way, the next-session re-audit plan is stated.
271
+
272
+ The summary's **last line** must be one of:
273
+
274
+ ```
275
+ audit complete — no changes this cycle.
276
+ ```
277
+
278
+ ```
279
+ audit complete — global v<X.Y.Z> released, project commit <sha>, next re-audit after the next session.
280
+ ```
281
+
282
+ ```
283
+ audit complete — upstream issue <#N> filed, project commit <sha>, hotfix mirrored to ~/.pi.
284
+ ```
285
+
286
+ ## What this skill does not do
287
+
288
+ - It does not modify a consumer repo's code, issues, or preferences. It only edits **pi-dev's own `skills/`**.
289
+ - It does not invent gaps from first principles. Every finding must come from a session excerpt or a repo-state probe.
290
+ - It does not run faster than the data allows — if there is only one session, run it but say so up front; the signal is noisy.
291
+
292
+ ## Heuristics
293
+
294
+ - **One screenful of signals beats a dashboard.** Most of the time three or four numbers tell the whole story.
295
+ - **Watch the gap between `/do` injection and next-skill injection.** That is the single highest-information ratio about whether the chain orchestrator is actually orchestrating.
296
+ - **Short user messages are the cheapest interruption proxy.** Count them.
297
+ - **A taboo without a `.gitignore` lockout will resurrect.** If the same taboo file shows up across two audits, the fix belongs in `/migrate`, not `/do`.
298
+ - **Disclaimer / terminator / lockout / runway** are the four "tripwire" patterns that catch silent contract drift. Reach for them before inventing new ones.
299
+
300
+ ## Why this skill exists
301
+
302
+ Skills are prose. Prose drifts. Without a feedback loop, the SKILL.md files become wishful thinking that the agent ignores in real sessions. This skill is the loop.