pi-dev 0.1.4 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/dist/manifest.js CHANGED
@@ -11,6 +11,7 @@ export const SKILLS = [
11
11
  { name: "do", kind: "human", summary: "Do the engineering work end-to-end." },
12
12
  { name: "taste", kind: "human", summary: "View / update / onboard preferences." },
13
13
  { name: "where", kind: "human", summary: "Recall prior pi sessions for this cwd." },
14
+ { name: "improve-skill-flow", kind: "human", summary: "Audit pi session telemetry and propose evidence-based SKILL.md edits." },
14
15
  // Auto-invoked support skills
15
16
  { name: "migrate", kind: "support", summary: "Strict migration gate before /do can run." },
16
17
  { name: "setup", kind: "support", summary: "Scaffold issue-tracker / triage / domain docs." },
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "pi-dev",
3
- "version": "0.1.4",
3
+ "version": "0.1.6",
4
4
  "description": "An autonomous engineering skill framework for the pi runtime — built on Matt Pocock's skills.",
5
5
  "type": "module",
6
6
  "bin": {
@@ -14,10 +14,16 @@ The point: **one request → one finished outcome**, with as few user interrupti
14
14
  1. **Migration gate (strict).** If `docs/agents/preferences.md` does not contain the `<!-- migrated: ... -->` marker, **stop** and invoke `/migrate`. Do not run any engineering work on un-migrated repos. The user makes the migration decision; the skill executes it. No partial-migration shortcuts.
15
15
  2. **Read preferences (3 layers).** Merge global (`~/.pi/agent/preferences.md`) → project (`docs/agents/preferences.md`) → package (`packages/<pkg>/preferences.md` if a single package is targeted). If global is missing, warn once and use built-in defaults from `/taste`.
16
16
  3. **Treat preferences as decisions.** Any choice the merged prefs answer is **decided**. Do not re-ask.
17
- 4. **Keep going.** When something would normally cause a halt (ambiguous classification, scope creep beyond `change-budget`, missing follow-up), **expand and surface in the final summary** instead of stopping mid-flow. Halt only when:
17
+ 4. **Keep going. Never hand the flow back between phases.** When something would normally cause a halt (ambiguous classification, scope creep beyond `change-budget`, missing follow-up), **expand and surface in the final summary** instead of stopping mid-flow. Specifically:
18
+ - Do **not** ask the user "shall I proceed to the next phase?", "want me to continue?", "should I do X next?", or any equivalent. The chain is decided in Step 3 and runs to completion.
19
+ - Do **not** end your turn between phases. A phase's terminal predicate passing is the cue to immediately print the next phase's status line and start it, in the same turn.
20
+ - Confirmation-shaped questions are only legal via the **Ambiguity protocol** (one decision the prefs genuinely cannot answer) or the **Failure protocol** (terminal predicate failed twice).
21
+
22
+ Halt only when:
18
23
  - migration gate fails, OR
19
24
  - GitHub gate fails (when GitHub is the tracker), OR
20
25
  - a phase's terminal predicate fails twice and the failure cannot be characterised, OR
26
+ - the Ambiguity protocol fires (preferences truly silent on a one-shot decision), OR
21
27
  - user interrupts.
22
28
  5. **No handoff files.** State of work lives in three places only: **code (git), issue tracker, merged preferences**. Do not create `.scratch/flow/`, `docs/handoff/`, or any session-log file. Phase outputs are remembered in-context; persistent decisions are committed to code or filed as issues.
23
29
  6. **Side-effect gates respect prefs literally.** `auto-create-issues`, `auto-apply-labels`, `auto-commit-per-slice`, `auto-pr` follow merged prefs without reinterpretation.
@@ -99,15 +105,31 @@ If measured scope exceeds `change-budget` (e.g. `module` budget vs `cross-packag
99
105
 
100
106
  ### Step 4 — Execute the chain
101
107
 
102
- For each phase:
108
+ Before running any phase, **print the planned chain once** so the user can see the runway:
109
+
110
+ ```
111
+ [flow plan] intent=<x> scope=<y> chain: phase1 → phase2 → … → phaseM
112
+ ```
113
+
114
+ Then for each phase:
103
115
 
104
- 1. Print status line.
116
+ 1. Print status line `[flow N/M] <phase> — <one-sentence what>`.
105
117
  2. Run the phase. Use merged prefs to make every taste decision the phase exposes.
106
118
  3. Check the terminal predicate. If fail, retry once with explicit diagnosis. If fail again, halt with reason.
107
- 4. Continue.
119
+ 4. **Transition without asking.** The instant the terminal predicate passes, print the *next* phase's status line on the very next line and start executing it, in the same turn. No "next, I'll do X" preamble. No "shall I continue?". No turn end.
108
120
 
109
121
  **No handoff writes.** Pass state in-context only.
110
122
 
123
+ **Anti-patterns to detect in your own output before sending** (if you catch any of these mid-draft while phases remain, delete and replace with the next phase's status line):
124
+
125
+ - "Want me to proceed with <next phase>?"
126
+ - "Shall I continue to <next phase>?"
127
+ - "Ready to move on to <next phase>?"
128
+ - "Let me know if you want me to <next phase>."
129
+ - Ending the turn after a phase summary when N < M.
130
+
131
+ The only place a wrap-up belongs is **Step 7 — Final summary**, after the last phase has met its terminal predicate.
132
+
111
133
  ### Step 5 — Live verification
112
134
 
113
135
  Driven by **two separate prefs keys**:
@@ -139,15 +161,29 @@ If a side-effect was skipped because prefs said so, list it in the final summary
139
161
 
140
162
  ### Step 7 — Final summary
141
163
 
142
- One screen:
164
+ Reached only after the **last** phase in the chain has met its terminal predicate (or the Failure protocol fired). One screen:
143
165
 
144
- - intent / scope / chain executed
166
+ - intent / scope / chain executed (all N/M status lines accounted for)
145
167
  - diffs (files touched)
146
168
  - side effects done (issue URLs, commit SHAs, branch name, PR URL)
147
169
  - side effects deferred (with reason and the exact command to do them)
148
170
  - expansions (scope went beyond budget, ambiguous decisions made with rationale)
149
171
  - prefs refresh hint if stale
150
172
 
173
+ The summary's **last line** must be one of these two terminators, so the user can tell at a glance whether the flow is done:
174
+
175
+ ```
176
+ chain complete — no further action.
177
+ ```
178
+
179
+ or
180
+
181
+ ```
182
+ follow-up: <one-line description> — run `<exact command>`.
183
+ ```
184
+
185
+ If you emitted anything that looks like a wrap-up ("All set!", "Done.", "Let me know if...") without one of those two terminators, you ended early. Reopen the chain at the missed phase.
186
+
151
187
  ## Phase contracts (terminal predicates)
152
188
 
153
189
  | phase | done when… |
@@ -0,0 +1,188 @@
1
+ ---
2
+ name: improve-skill-flow
3
+ description: Analyse real pi-runtime session telemetry from a consumer repo to find where the engineering skills (especially /do's chain) drift from their stated contract, then propose evidence-anchored edits to the SKILL.md files. Use when the user wants to improve, audit, debug, or evolve the pi-dev skill framework itself based on what actually happened in real sessions ("스킬 개선하자", "do 가 왜 멈춰", "히스토리 보고 분석해서 스킬 고치자", "메타 스킬 작업", etc).
4
+ ---
5
+
6
+ # /improve-skill-flow — Meta-skill for evidence-based skill improvement
7
+
8
+ The pi-dev skills are pure markdown. They get better by reading what real sessions did, comparing that to what the SKILL.md said *should* happen, and editing the gap closed. This skill is the canonical loop for that.
9
+
10
+ The point: **never edit a SKILL.md from gut feeling.** Edit because session N showed phase P violated predicate Q on M occasions, and here is the line that would have prevented it.
11
+
12
+ ## When to run
13
+
14
+ - A consumer repo has accumulated at least one real day of pi sessions (≈ 3+ `.jsonl` files).
15
+ - A specific skill is suspected of misbehaving ("why does `/do` keep stopping?").
16
+ - After landing a skill change, to verify the next session(s) actually follow the new wording.
17
+ - Periodically (every N releases) as a regression sweep across all human-facing skills.
18
+
19
+ ## What this skill is NOT
20
+
21
+ - Not for analysing the *codebase* of the consumer repo — that is `improve-codebase-architecture`.
22
+ - Not for shipping engineering work — it does not invoke `/do`. Findings turn into proposed SKILL.md diffs, which are committed via the normal release-please flow on `pi-dev`.
23
+ - Not a real-time monitor — it reads completed session files.
24
+
25
+ ## Inputs
26
+
27
+ - **Target repo path** (or its sessions directory). Defaults: the user names a repo; you resolve it.
28
+ - Optional: a specific skill name to focus the audit on (`do`, `migrate`, `triage`, …).
29
+ - Optional: a date range.
30
+
31
+ ## Session-data location & format
32
+
33
+ pi-runtime persists every session as JSON Lines at:
34
+
35
+ ```
36
+ ~/.pi/agent/sessions/--<cwd-with-slashes-replaced-by-dashes>--/<ISO-ts>_<uuid>.jsonl
37
+ ```
38
+
39
+ For example `cwd=/Users/jason/dev/sandbox/hugn` → `~/.pi/agent/sessions/--Users-jason-dev-sandbox-hugn--/`.
40
+
41
+ Each line is one of these record types:
42
+
43
+ | `type` | meaning |
44
+ | --- | --- |
45
+ | `session` | session header — has `cwd`, `timestamp`, `version`, `id` |
46
+ | `model_change` | provider / modelId switch |
47
+ | `thinking_level_change` | reasoning effort knob |
48
+ | `message` | a user or assistant turn |
49
+
50
+ `message.content` is a **list of blocks**, each block has a `type`:
51
+
52
+ | block `type` | shape | what it represents |
53
+ | --- | --- | --- |
54
+ | `text` | `{text}` | plain text from either side |
55
+ | `thinking` | `{thinking, thinkingSignature}` | model scratchpad (assistant only) |
56
+ | `toolCall` | `{id, name, arguments}` | **pi uses this** — NOT Anthropic SDK's `tool_use` |
57
+ | `toolResult` | `{id, output}` | **pi uses this** — NOT `tool_result` |
58
+
59
+ Tool names are **lower-case** (`bash`, `read`, `edit`, `write`, `glob`, `grep`, etc.). Build any parser around `toolCall` / `toolResult` first, then fall back to Anthropic-shaped blocks for robustness.
60
+
61
+ **Critical detail:** the pi runtime injects each skill's `SKILL.md` content into the conversation as a `<skill name="..." location="...">…</skill>` block embedded inside a **user-role** message (system-side injection, but the role is user). This is how you tell the classifier loaded a skill. Count these to see what `/do` actually picked.
62
+
63
+ Timestamps on `message` records are ISO strings; some other record types use int millis. Handle both.
64
+
65
+ ## Process
66
+
67
+ ### 1 — Scope and load
68
+
69
+ Ask the user (one round, only if not already specified):
70
+
71
+ - target repo (or "all repos with sessions")
72
+ - a skill to focus on, or "everything"
73
+ - a date range or "all"
74
+
75
+ Resolve the sessions directory. List the `.jsonl` files with size + line count so the user can see the input scale.
76
+
77
+ ### 2 — Build the raw signal table
78
+
79
+ Run a single pass over every targeted `.jsonl` and tally:
80
+
81
+ **Per session:**
82
+ - user messages, assistant messages, tool calls, thinking blocks
83
+ - duration (last ts − first ts)
84
+
85
+ **Aggregates:**
86
+ - tool-use frequency (`bash`, `read`, `edit`, `write`, …)
87
+ - `<skill name="X">` injections per skill (this is the classifier's actual decision)
88
+ - skills mentioned in user text vs. assistant text (request side vs. recall side)
89
+ - top Edit / Write targets (hotspot files)
90
+ - bash commands matching domain-specific danger / smell patterns
91
+
92
+ **Skill-flow specific:**
93
+ - After each `<skill name="do">` injection, did another skill get injected within the same session? **This measures `/do` chain depth.** Single-step chains are a red flag.
94
+ - Count user messages shorter than ~80 chars — these are usually nudges ("진행해", "다음은?", "끝났어?"). High proportion = `/do` is handing the flow back too often.
95
+ - Count user messages containing correction markers (`아니`, `wait`, `stop`, `취소`, `다시`, `그만`, `undo`, `revert`). These mark interventions.
96
+ - Count `<!-- migrated: ... -->` marker date vs. any post-marker `docs/handoff/` / `.scratch/flow/` / `SESSION_*.md` writes. Drift = handoff lockout failed.
97
+ - Count tracker writes (`gh issue create`, `gh pr create`) and check whether `generated by AI` (the `/triage` disclaimer) appears in the surrounding text. Missing disclaimer = `/to-issues` / `/triage` predicate violated.
98
+
99
+ Use a deterministic Python or shell script you write once and check into `/tmp` for the duration of the run. Do not eyeball big JSONLs by hand.
100
+
101
+ ### 3 — Cross-reference with repo state
102
+
103
+ For the same date range, pull:
104
+
105
+ - `git log --since=<start> --pretty=format:"%h %ad %s"` — commit cadence vs. the predicate `auto-commit-per-slice`.
106
+ - `gh issue list / pr list` (if GitHub) — slice/PR shape vs. `default-issue-style=vertical-slice`.
107
+ - Existence of forbidden paths from `docs/agents/preferences.md` taboos (`docs/handoff/`, `.scratch/flow/`, etc.).
108
+ - `gh label list` filtered against taboo-marked labels.
109
+
110
+ Cross-reference each signal against:
111
+
112
+ - The skill's **terminal predicate** in `do/SKILL.md` → "Phase contracts".
113
+ - The repo's **`docs/agents/preferences.md`** taboos and `auto-*` settings.
114
+ - The hard rules in the skill being audited.
115
+
116
+ ### 4 — Score the gaps
117
+
118
+ Produce a small table per audited skill. Each row:
119
+
120
+ ```
121
+ | signal | observed | expected (predicate / rule) | severity |
122
+ | ------------------------------ | -------- | ---------------------------------- | -------- |
123
+ | /do → next-skill chain depth | 1/5 | ≥1 per chain step (M phases) | 🔴 |
124
+ | AI disclaimer on slice issues | 0/8 | every issue starts with disclaimer | 🔴 |
125
+ | post-marker handoff writes | 9 | 0 | 🟡 |
126
+ | ... | | | |
127
+ ```
128
+
129
+ Severity rule of thumb:
130
+ - 🔴 — predicate is violated repeatedly AND the rule is binding (in "Hard rules" or "Phase contracts").
131
+ - 🟡 — taboo or preference violated but the skill's wording is soft / advisory.
132
+ - 🟢 — observed behaviour matches the spec; do not propose a change.
133
+
134
+ ### 5 — Anchor each finding to an evidence excerpt
135
+
136
+ For every 🔴 / 🟡 row, quote the smallest piece of evidence that makes the gap undeniable:
137
+
138
+ - a user message timestamp + first 80 chars,
139
+ - a tool-call command that violates a taboo,
140
+ - a missing disclaimer in a created issue body.
141
+
142
+ If a finding cannot be backed by an excerpt, it is not actionable yet — demote to a TODO and keep digging.
143
+
144
+ ### 6 — Propose SKILL.md edits
145
+
146
+ For each 🔴 finding, draft the smallest possible edit that, **if it had been in the SKILL.md at session time, would have prevented the gap.** Constraints:
147
+
148
+ - Edit a **rule** or a **step**, not a flavour sentence. The model must be able to detect the constraint in its own draft output.
149
+ - Prefer **explicit anti-pattern strings** ("Do not say 'shall I continue?'") over abstract injunctions ("be decisive"). The hugn-2026-05 audit showed that named anti-patterns work.
150
+ - Prefer **terminal markers** ("the summary's last line must be one of these two literals: …") over qualitative descriptions of "good wrap-up".
151
+ - Update **at most three skills per run.** More than that means findings aren't anchored well enough.
152
+
153
+ Show the edits as a unified diff before applying.
154
+
155
+ ### 7 — Apply, release, verify
156
+
157
+ Standard pi-dev release loop:
158
+
159
+ 1. `git add skills/<name>/SKILL.md && git commit -m "<conventional commit anchoring the evidence>"` — the commit body should cite the signal that motivated each change.
160
+ 2. `cp` the edited SKILL.md into `~/.pi/agent/skills/<name>/` so the **next** session in any repo picks up the change immediately (release-please takes a minute and a half).
161
+ 3. `git push origin main`; release-please opens the version-bump PR; merge it; npm publish runs automatically.
162
+ 4. After the next consumer session lands, re-run **this skill** and confirm the targeted signals moved.
163
+
164
+ ## Terminal predicate
165
+
166
+ This skill is done when **all three** are true:
167
+
168
+ 1. A signal table with severities and evidence excerpts has been presented.
169
+ 2. Either (a) zero 🔴 findings — flow is healthy, recorded as "no change this cycle", OR (b) a unified diff for each 🔴 finding has been applied to `skills/<name>/SKILL.md` AND mirrored into `~/.pi/agent/skills/<name>/`.
170
+ 3. If diffs landed, the release loop (commit → push → merge release PR → npm publish) finished and the new version is `npm view pi-dev@latest version`.
171
+
172
+ ## What this skill does not do
173
+
174
+ - It does not modify a consumer repo's code, issues, or preferences. It only edits **pi-dev's own `skills/`**.
175
+ - It does not invent gaps from first principles. Every finding must come from a session excerpt or a repo-state probe.
176
+ - It does not run faster than the data allows — if there is only one session, run it but say so up front; the signal is noisy.
177
+
178
+ ## Heuristics
179
+
180
+ - **One screenful of signals beats a dashboard.** Most of the time three or four numbers tell the whole story.
181
+ - **Watch the gap between `/do` injection and next-skill injection.** That is the single highest-information ratio about whether the chain orchestrator is actually orchestrating.
182
+ - **Short user messages are the cheapest interruption proxy.** Count them.
183
+ - **A taboo without a `.gitignore` lockout will resurrect.** If the same taboo file shows up across two audits, the fix belongs in `/migrate`, not `/do`.
184
+ - **Disclaimer / terminator / lockout / runway** are the four "tripwire" patterns that catch silent contract drift. Reach for them before inventing new ones.
185
+
186
+ ## Why this skill exists
187
+
188
+ Skills are prose. Prose drifts. Without a feedback loop, the SKILL.md files become wishful thinking that the agent ignores in real sessions. This skill is the loop.