pi-dev 0.2.8 → 0.2.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/skills/improve-skill-flow/SKILL.md +131 -307
package/package.json
CHANGED
|
@@ -1,31 +1,25 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: improve-skill-flow
|
|
3
|
-
description: MAINTAINER-ONLY. Analyse real pi-runtime session telemetry from
|
|
3
|
+
description: MAINTAINER-ONLY. Analyse real pi-runtime session telemetry from consumer repos to find where the engineering lifecycle, especially /do, drifts from its contract; then apply the smallest evidence-backed framework, runtime, or consumer-prefs improvement. Run only from the pi-dev repo.
|
|
4
4
|
---
|
|
5
5
|
|
|
6
|
-
# /improve-skill-flow —
|
|
7
|
-
|
|
8
|
-
**Audience: pi-dev maintainers only.** Consumers do not get this skill installed. Consumers improve their own setup by editing `docs/agents/preferences.md` (per project) or `~/.pi/agent/preferences.md` (per machine), not by editing SKILL.md bodies. The framework is fixed for them; only the maintainer changes it.
|
|
9
|
-
|
|
10
|
-
The pi-dev workflow is not just markdown. It is a layered control system: SKILL.md prose, repo preferences, pi-runtime lifecycle events, tools, commands, TUI surfaces, session persistence, compaction hooks, and extensions. This skill reads what real sessions did across consumer repos, compares that to what the workflow contract said *should* happen, then chooses the lightest effective intervention.
|
|
11
|
-
|
|
12
|
-
The point: **never improve the framework from gut feeling, never collapse every fix into a "guard", and never append diary-style history.** Edit because session N showed phase P violated predicate Q on M occasions, then decide whether the right response is clearer prose, repo-local preferences, a runtime steer, a custom tool, a command, a TUI affordance, state persistence, compaction/context shaping, or — only when the evidence calls for it — a blocking gate. Every wording change must pay rent: reduce future confusion, replace weaker text, merge duplicate rules, or delete obsolete context.
|
|
6
|
+
# /improve-skill-flow — Evidence loop for the pi-dev workflow
|
|
13
7
|
|
|
14
8
|
## North star
|
|
15
9
|
|
|
16
|
-
`/do` is the autonomous engineering lifecycle orchestrator.
|
|
10
|
+
`/do` is the autonomous engineering lifecycle orchestrator. It should carry a request A→Z: classify → scope → plan → run phases in order → satisfy terminal predicates → verify → apply side effects per prefs → leave durable state in code / issues / preferences. This skill audits whether that lifecycle happened in real sessions and improves the framework so the next run is more autonomous, durable, low-interruption, and correct.
|
|
17
11
|
|
|
18
|
-
|
|
12
|
+
Do not append diary history. Every edit must pay rent by removing ambiguity, replacing weaker text, merging duplicates, deleting stale rationale, or adding a runtime affordance that measurably improves the lifecycle.
|
|
19
13
|
|
|
20
|
-
## Pre-flight
|
|
14
|
+
## Pre-flight gate
|
|
21
15
|
|
|
22
|
-
|
|
16
|
+
Run only from the pi-dev checkout:
|
|
23
17
|
|
|
24
18
|
```bash
|
|
25
19
|
origin=$(git -C "$PWD" remote get-url origin 2>/dev/null || echo "")
|
|
26
20
|
pkg_name=$(jq -r '.name // empty' package.json 2>/dev/null)
|
|
27
21
|
ok=0
|
|
28
|
-
case "$origin" in *pi-dev*|*pi-dev.git) ok=1 ;; esac
|
|
22
|
+
case "$origin" in *pi-dev*|*pi-dev.git|*pi-flow*|*pi-flow.git) ok=1 ;; esac
|
|
29
23
|
[ "$pkg_name" = "pi-dev" ] && ok=1
|
|
30
24
|
if [ "$ok" != 1 ]; then
|
|
31
25
|
echo "this skill is maintainer-only; cd into the pi-dev repo and re-run"
|
|
@@ -33,375 +27,205 @@ if [ "$ok" != 1 ]; then
|
|
|
33
27
|
fi
|
|
34
28
|
```
|
|
35
29
|
|
|
36
|
-
If the gate fails, stop.
|
|
37
|
-
|
|
38
|
-
## When to run
|
|
39
|
-
|
|
40
|
-
- A consumer repo on this machine has accumulated at least one real day of pi sessions (≈ 3+ `.jsonl` files).
|
|
41
|
-
- A specific skill is suspected of misbehaving ("why does `/do` keep stopping?").
|
|
42
|
-
- After landing a skill change, to verify the next session(s) actually follow the new wording.
|
|
43
|
-
- Periodically (every N releases) as a regression sweep across all human-facing skills.
|
|
44
|
-
|
|
45
|
-
Always from inside the pi-dev checkout (see Pre-flight).
|
|
46
|
-
|
|
47
|
-
## What this skill is NOT
|
|
48
|
-
|
|
49
|
-
- Not for analysing the *codebase* of any repo — that is `improve-codebase-architecture`.
|
|
50
|
-
- Not for shipping engineering work — it does not invoke `/do`. Findings turn into proposed SKILL.md diffs, which are committed via the normal release-please flow on `pi-dev`.
|
|
51
|
-
- Not a real-time monitor — it reads completed session files.
|
|
52
|
-
- Not a consumer-facing tool. Consumers don't get this skill installed; they tune their setup via `preferences.md`, not by editing SKILL bodies.
|
|
30
|
+
If the gate fails, stop.
|
|
53
31
|
|
|
54
32
|
## Inputs
|
|
55
33
|
|
|
56
|
-
|
|
57
|
-
- Optional: a specific skill name to focus the audit on (`do`, `migrate`, `triage`, …).
|
|
58
|
-
- Optional: a date range.
|
|
59
|
-
- **Fix scope** per finding — `framework`, `extension`, or `consumer-prefs`. Defaults set in Step 5.5; maintainer can flip individual rows before applying.
|
|
60
|
-
|
|
61
|
-
## Fix scopes
|
|
62
|
-
|
|
63
|
-
pi-runtime loads three artefact kinds the framework can ship:
|
|
64
|
-
|
|
65
|
-
- **Skill bodies** — `~/.pi/agent/skills/<name>/SKILL.md` (global) or `<repo>/.pi/skills/<name>/SKILL.md` (local). Pure prose.
|
|
66
|
-
- **Runtime interventions** — extensions under `~/.pi/agent/extensions/<name>/` (global) or `<repo>/.pi/extensions/<name>/` (local). TypeScript modules auto-loaded via jiti. They can subscribe to lifecycle/session/agent/model/tool events, inject context in `before_agent_start`, reshape model context, observe `message_end`, intercept or modify tool calls/results, register custom tools/commands/shortcuts/flags, prompt via `ctx.ui`, render custom TUI widgets, persist state with `pi.appendEntry()`, and customize compaction/session behavior. pi-dev ships **one** extension by default — `pi-flow` — and the bar to add a second package remains high.
|
|
67
|
-
- **Preferences** — `docs/agents/preferences.md` (per repo) and `~/.pi/agent/preferences.md` (per machine). 3-layer override on prose-level decisions.
|
|
68
|
-
|
|
69
|
-
A finding lands in exactly one of:
|
|
70
|
-
|
|
71
|
-
| scope | lands in | reaches | propagation | when to pick |
|
|
72
|
-
| --- | --- | --- | --- | --- |
|
|
73
|
-
| **framework** | this repo's `skills/<name>/SKILL.md` | every consumer after the next `npx pi-dev update` | release-please → npm publish | the SKILL.md wording is wrong; gap shows up generically; prose alone is plausibly enough |
|
|
74
|
-
| **extension** | this repo's `extensions/pi-flow/index.ts` (extend) or a new `extensions/<name>/` (rare) | every consumer after `npx pi-dev update` | release-please → npm publish | a real audit shows prose / prefs cannot reliably preserve the workflow, AND pi-runtime has an event/tool/UI/state hook that can make the desired path easier, more observable, or safer |
|
|
75
|
-
| **consumer-prefs** | the audited consumer repo's `docs/agents/preferences.md` | only that repo, on every `/do` bootstrap | regular consumer-repo commit | gap is the consumer repo's domain / paths / conventions, not the SKILL.md |
|
|
76
|
-
|
|
77
|
-
Notes:
|
|
78
|
-
|
|
79
|
-
- A `framework` apply is **always** mirrored into `~/.pi/agent/skills/<name>/` on this machine so the next session picks it up immediately, without waiting for npm.
|
|
80
|
-
- An `extension` apply is **always** mirrored into `~/.pi/agent/extensions/<name>/` on this machine for the same reason. The package directory name is the install name verbatim (no `pi-dev-` prefix).
|
|
81
|
-
- A `consumer-prefs` apply touches no pi-dev files. It is committed to the consumer repo only.
|
|
82
|
-
- A single audit may produce a mix of all three. Decide scope per finding, not per audit.
|
|
83
|
-
|
|
84
|
-
**Extension scope is the most expensive option, but it is not synonymous with "guards".** Runtime interventions can be passive observability, progress/status UI, context injection, command shortcuts, structured tools, state checkpoints, compaction shaping, soft steers, confirmations, or hard blocks. Default to `framework` / `consumer-prefs` when prose or repo-local convention plausibly fixes the gap; reach for `extension` when the audit shows the workflow needs runtime support.
|
|
34
|
+
Ask once only if missing:
|
|
85
35
|
|
|
86
|
-
|
|
36
|
+
- target consumer repo path, sessions dir, or "all repos with sessions"
|
|
37
|
+
- focus skill (`do`, `triage`, `migrate`, …) or "everything"
|
|
38
|
+
- date range or "all"
|
|
87
39
|
|
|
88
|
-
|
|
89
|
-
2. **Make the right path easier** — custom command, custom tool, context injection, remembered state, or TUI affordance.
|
|
90
|
-
3. **Steer** — inject a follow-up message or system/context nudge when the model is drifting but no irreversible action happened.
|
|
91
|
-
4. **Confirm** — ask the user only for genuinely risky or preference-silent operations.
|
|
92
|
-
5. **Block** — refuse a tool call only for destructive, costly, or repeatedly proven workflow violations.
|
|
40
|
+
## Fix targets
|
|
93
41
|
|
|
94
|
-
|
|
42
|
+
| target | lands in | use when |
|
|
43
|
+
| --- | --- | --- |
|
|
44
|
+
| `framework` | `skills/<name>/SKILL.md` | generic prose contract is wrong, missing, duplicated, or ambiguous |
|
|
45
|
+
| `extension` | `extensions/pi-flow/` or rare new extension | runtime support would make the correct lifecycle easier, visible, recoverable, or safer |
|
|
46
|
+
| `consumer-prefs` | audited repo's `docs/agents/preferences.md` | repo-specific convention, path, smoke, glossary, env, tracker, or taboo |
|
|
47
|
+
| `defer` | no edit | evidence is weak, already fixed, or needs a next-session re-audit |
|
|
95
48
|
|
|
96
|
-
|
|
97
|
-
- The runtime intervention can be described as a small event-driven mechanism over pi's lifecycle/session/agent/tool/UI/state surfaces; no hidden LLM call unless the finding is explicitly about a summarizer/evaluator extension.
|
|
98
|
-
- The success metric is measurable in the next audit: fewer user nudges, higher `/do` phase completion, fewer corrections, better issue shape, shorter stalled intervals, or clearer live evidence.
|
|
99
|
-
- A simpler remedy (docs wording, prefs, `.gitignore` lockout, issue-template fix) does **not** plausibly close the gap. If it does, take the simpler remedy.
|
|
49
|
+
Mirror framework edits to `~/.pi/agent/skills/<name>/`. Mirror extension edits to `~/.pi/agent/extensions/<name>/`. Consumer-prefs edits stay in the consumer repo.
|
|
100
50
|
|
|
101
|
-
|
|
51
|
+
Runtime intervention is not synonymous with a guard. Consider the ladder: `observe → affordance/tool/command → context/state/render → steer → confirm → block`. Pick the weakest intervention that would have changed the audited session outcome, and define a next-audit metric for it.
|
|
102
52
|
|
|
103
|
-
## Session
|
|
53
|
+
## Session facts to rely on
|
|
104
54
|
|
|
105
|
-
|
|
55
|
+
Sessions live at:
|
|
106
56
|
|
|
107
|
-
```
|
|
57
|
+
```text
|
|
108
58
|
~/.pi/agent/sessions/--<cwd-with-slashes-replaced-by-dashes>--/<ISO-ts>_<uuid>.jsonl
|
|
109
59
|
```
|
|
110
60
|
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
Each line is one of these record types:
|
|
114
|
-
|
|
115
|
-
| `type` | meaning |
|
|
116
|
-
| --- | --- |
|
|
117
|
-
| `session` | session header — has `cwd`, `timestamp`, `version`, `id` |
|
|
118
|
-
| `model_change` | provider / modelId switch |
|
|
119
|
-
| `thinking_level_change` | reasoning effort knob |
|
|
120
|
-
| `message` | a user / assistant / toolResult turn |
|
|
121
|
-
|
|
122
|
-
**Record shape for `message` (do not skim this — it has bitten parsers before):**
|
|
123
|
-
|
|
124
|
-
```jsonc
|
|
125
|
-
{
|
|
126
|
-
"type": "message",
|
|
127
|
-
"id": "...",
|
|
128
|
-
"parentId": "...",
|
|
129
|
-
"timestamp": "2026-05-11T16:46:49.795Z",
|
|
130
|
-
"message": { // ← NESTED. role/content are HERE, not at top level.
|
|
131
|
-
"role": "user" | "assistant" | "toolResult",
|
|
132
|
-
"content": [ ...blocks ],
|
|
133
|
-
"timestamp": "...",
|
|
134
|
-
// assistant-only extras: api, provider, model, usage, stopReason, responseId
|
|
135
|
-
// toolResult-only extras: toolCallId, toolName, isError
|
|
136
|
-
}
|
|
137
|
-
}
|
|
138
|
-
```
|
|
139
|
-
|
|
140
|
-
A correct read path is `rec["message"]["role"]` and `rec["message"]["content"]`. Reading `rec["role"]` / `rec["content"]` returns `None` for every record and silently produces a zero-row signal table — if your first pass shows all counters at 0, this is almost certainly why.
|
|
141
|
-
|
|
142
|
-
`message.content` is a **list of blocks**, each block has a `type`:
|
|
143
|
-
|
|
144
|
-
| block `type` | shape | what it represents |
|
|
145
|
-
| --- | --- | --- |
|
|
146
|
-
| `text` | `{text}` | plain text from either side |
|
|
147
|
-
| `thinking` | `{thinking, thinkingSignature}` | model scratchpad (assistant only) |
|
|
148
|
-
| `toolCall` | `{id, name, arguments}` | **pi uses this** — NOT Anthropic SDK's `tool_use` |
|
|
149
|
-
| `toolResult` | `{id, output}` | **pi uses this** — NOT `tool_result` |
|
|
150
|
-
|
|
151
|
-
Roles in practice: `user`, `assistant`, and **`toolResult`** (yes, role and block type share the name; a `toolResult`-role message contains one or more `text` blocks holding the tool output). Treat `toolResult`-role messages as siblings of the originating `toolCall` — do not double-count them as user/assistant turns.
|
|
61
|
+
Message records are nested. Always read `rec["message"]["role"]` and `rec["message"]["content"]`, never top-level `role/content`. `toolResult` is a message role and also a block type; do not count toolResult-role messages as user/assistant turns. Pi tool calls are content blocks with `type == "toolCall"` and lower-case `name` (`bash`, `read`, `edit`, `write`, …). Skill injections arrive as `<skill name="...">` blocks inside **user-role** messages.
|
|
152
62
|
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
**Parser pre-flight (mandatory before Step 2 aggregates).** After you write the one-shot Python parser, run it on the newest 1–2 `.jsonl` files and assert the following are non-zero for any session that obviously had work done:
|
|
63
|
+
Parser pre-flight is mandatory on the newest 1–2 work sessions:
|
|
156
64
|
|
|
157
65
|
```python
|
|
158
|
-
assert tool_count, "toolCall extraction returned 0 — check rec['message']['content']
|
|
66
|
+
assert tool_count, "toolCall extraction returned 0 — check rec['message']['content']"
|
|
159
67
|
assert user_msg_total or skill_inject_count, "no user messages parsed — same nested-message bug"
|
|
160
68
|
```
|
|
161
69
|
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
**Critical detail:** the pi runtime injects each skill's `SKILL.md` content into the conversation as a `<skill name="..." location="...">…</skill>` block embedded inside a **user-role** message (system-side injection, but the role is user). This is how you tell the classifier loaded a skill. Count these to see what `/do` actually picked.
|
|
165
|
-
|
|
166
|
-
Timestamps on `message` records are ISO strings; some other record types use int millis. Handle both.
|
|
70
|
+
A zero-row signal table is a parser bug, not a finding.
|
|
167
71
|
|
|
168
72
|
## Process
|
|
169
73
|
|
|
170
74
|
### 1 — Scope and load
|
|
171
75
|
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
- target consumer repo (or "all repos with sessions on this machine")
|
|
175
|
-
- a skill to focus on, or "everything"
|
|
176
|
-
- a date range or "all"
|
|
177
|
-
|
|
178
|
-
Resolve the sessions directory. List the `.jsonl` files with size + line count so the maintainer can see the input scale.
|
|
76
|
+
Resolve the sessions directory. List candidate `.jsonl` files with size + line count. Use the requested range; otherwise default to all relevant files for the named repo(s).
|
|
179
77
|
|
|
180
78
|
### 2 — Build the raw signal table
|
|
181
79
|
|
|
182
|
-
|
|
183
|
-
|
|
184
|
-
**Per session:**
|
|
185
|
-
- user messages, assistant messages, tool calls, thinking blocks
|
|
186
|
-
- duration (last ts − first ts)
|
|
187
|
-
|
|
188
|
-
**Aggregates:**
|
|
189
|
-
- tool-use frequency (`bash`, `read`, `edit`, `write`, …)
|
|
190
|
-
- `<skill name="X">` injections per skill (this is the classifier's actual decision)
|
|
191
|
-
- skills mentioned in user text vs. assistant text (request side vs. recall side)
|
|
192
|
-
- top Edit / Write targets (hotspot files)
|
|
193
|
-
- bash commands matching domain-specific danger / smell patterns
|
|
194
|
-
|
|
195
|
-
**Workflow-compliance specific:**
|
|
196
|
-
- Reconstruct `/do`'s lifecycle trajectory: classify → scope → plan → phase execution → terminal predicates → verification → side effects → durable final state. Score the ordered lifecycle, not just the final outcome.
|
|
197
|
-
- After each `<skill name="do">` injection, did another skill get injected within the same session? **This measures `/do` chain depth.** Single-step chains are a red flag, but not the whole story.
|
|
198
|
-
- Did `/do` emit `[flow plan]`? Count planned phases and observed `[flow N/M]` status lines. Flag: plan missing, N/M skipped, final summary before all phases, or terminal predicate not evidenced.
|
|
199
|
-
- Count user messages shorter than ~80 chars — these are usually nudges ("진행해", "다음은?", "끝났어?"). High proportion = `/do` is handing the flow back too often.
|
|
200
|
-
- Count user messages containing correction markers (`아니`, `wait`, `stop`, `취소`, `다시`, `그만`, `undo`, `revert`, `왜`, `??`, `제대로`). These mark interventions.
|
|
201
|
-
- Detect stalled intervals: long gaps between assistant text / tool calls, repeated identical commands, or repeated failed live probes beyond the repo's wait budget.
|
|
202
|
-
- Count local-live / ops-live evidence when runtime behavior changed: command run, observed output, log/screenshot/evidence embedded in summary.
|
|
203
|
-
- Count side-effect gates: commits, pushes, PRs, issue creates/edits/closes vs. merged prefs (`auto-*`) and tracker docs.
|
|
204
|
-
- Count `<!-- migrated: ... -->` marker date vs. any post-marker `docs/handoff/` / `.scratch/flow/` / `SESSION_*.md` writes. Drift = handoff lockout failed.
|
|
205
|
-
- Count tracker writes (`gh issue create`, `gh pr create`) and inspect the bodies; the bodies are plain spec consumed by future worker agents, so look for issues with shape problems (missing parent, no acceptance criteria, wrong state labels, wrong parent topology, etc.) rather than meta tags.
|
|
80
|
+
Use one deterministic script in `/tmp`; do not eyeball JSONL by hand. Tally:
|
|
206
81
|
|
|
207
|
-
|
|
82
|
+
- per session: duration, user/assistant/toolResult messages, tool calls, thinking blocks
|
|
83
|
+
- tools used; edit/write targets; selected bash commands (`git`, `gh`, `npm publish`, dangerous or repo-specific patterns)
|
|
84
|
+
- skill injections and sequence, especially `/do` → next skill
|
|
85
|
+
- short user nudges and correction markers (`아니`, `왜`, `??`, `wait`, `stop`, `undo`, `revert`, `제대로`)
|
|
86
|
+
- tracker writes and issue/PR body shape
|
|
87
|
+
- post-migration writes to forbidden state paths
|
|
208
88
|
|
|
209
|
-
|
|
210
|
-
|
|
211
|
-
For the same date range, pull (against the **consumer repo** being audited):
|
|
212
|
-
|
|
213
|
-
- `git log --since=<start> --pretty=format:"%h %ad %s"` — commit cadence vs. the predicate `auto-commit-per-slice`.
|
|
214
|
-
- `gh issue list / pr list` (if GitHub) — slice/PR shape vs. `default-issue-style=vertical-slice`.
|
|
215
|
-
- Existence of forbidden paths from `docs/agents/preferences.md` taboos (`docs/handoff/`, `.scratch/flow/`, etc.).
|
|
216
|
-
- `gh label list` filtered against taboo-marked labels.
|
|
217
|
-
|
|
218
|
-
Cross-reference each signal against:
|
|
219
|
-
|
|
220
|
-
- The skill's **terminal predicate** in `do/SKILL.md` → "Phase contracts".
|
|
221
|
-
- The consumer repo's **`docs/agents/preferences.md`** taboos and `auto-*` settings.
|
|
222
|
-
- The hard rules in the skill being audited.
|
|
223
|
-
|
|
224
|
-
### 4 — Score the gaps
|
|
225
|
-
|
|
226
|
-
Produce a small table per audited skill. Each row:
|
|
89
|
+
For `/do`, reconstruct the ordered lifecycle:
|
|
227
90
|
|
|
91
|
+
```text
|
|
92
|
+
classify → scope → plan → phase execution → terminal predicates → verification → side effects → durable final state
|
|
228
93
|
```
|
|
229
|
-
| signal | observed | expected (predicate / rule) | severity |
|
|
230
|
-
| ------------------------------ | -------- | ---------------------------------- | -------- |
|
|
231
|
-
| /do → next-skill chain depth | 1/5 | ≥1 per chain step (M phases) | 🔴 |
|
|
232
|
-
| /do hand-back via follow-up | 8/13 | 0 mid-chain hand-backs | 🔴 |
|
|
233
|
-
| post-marker handoff writes | 9 | 0 | 🟡 |
|
|
234
|
-
| ... | | | |
|
|
235
|
-
```
|
|
236
|
-
|
|
237
|
-
Severity rule of thumb:
|
|
238
|
-
- 🔴 — predicate is violated repeatedly AND the rule is binding (in "Hard rules" or "Phase contracts").
|
|
239
|
-
- 🟡 — taboo or preference violated but the skill's wording is soft / advisory.
|
|
240
|
-
- 🟢 — observed behaviour matches the spec; do not propose a change.
|
|
241
94
|
|
|
242
|
-
|
|
95
|
+
Flag missing `[flow plan]`, skipped `[flow N/M]`, final summary before all phases, terminal predicates without evidence, missing local/ops-live evidence, side effects that contradict prefs, repeated failed commands, stalls beyond wait budget, or user re-entry nudges.
|
|
243
96
|
|
|
244
|
-
|
|
97
|
+
### 3 — Cross-reference live repo state
|
|
245
98
|
|
|
246
|
-
|
|
247
|
-
- a tool-call command that violates a taboo,
|
|
248
|
-
- the first 80 chars of an issue body that violates the slice template.
|
|
99
|
+
Against the audited repo, bounded by the same date range:
|
|
249
100
|
|
|
250
|
-
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
|
|
255
|
-
|
|
256
|
-
|
|
101
|
+
```bash
|
|
102
|
+
git log --since=<start> --pretty=format:"%h %ad %s" --date=short | head -50
|
|
103
|
+
git status -sb
|
|
104
|
+
find docs/handoff .scratch/flow .handoff -type f 2>/dev/null | head
|
|
105
|
+
# if GitHub is the tracker:
|
|
106
|
+
gh issue list --state open --limit 20 --json number,title,labels,url
|
|
107
|
+
gh pr list --state open --limit 20 --json number,title,url
|
|
108
|
+
```
|
|
257
109
|
|
|
258
|
-
|
|
110
|
+
Read `docs/agents/preferences.md`, `issue-tracker.md`, `triage-labels.md`, and `domain.md`. Compare observed behavior to `/do` phase predicates, hard rules in the focused skill, and merged prefs.
|
|
259
111
|
|
|
260
|
-
|
|
261
|
-
- The fix needs runtime affordance rather than prose: observe counters, inject context, add a command/tool, show TUI status, persist/checkpoint state, shape compaction/context, steer after a message, confirm a risky action, or block an unsafe tool call.
|
|
262
|
-
- The success condition can be measured in the next session from telemetry, not merely hoped for from stronger wording.
|
|
112
|
+
### 4 — Score gaps
|
|
263
113
|
|
|
264
|
-
|
|
114
|
+
Produce a compact table per audited skill:
|
|
265
115
|
|
|
266
|
-
|
|
116
|
+
```markdown
|
|
117
|
+
| signal | observed | expected | severity |
|
|
118
|
+
| --- | --- | --- | --- |
|
|
119
|
+
| /do lifecycle completion | 2/7 phases evidenced | all planned phases + predicates | 🔴 |
|
|
120
|
+
```
|
|
267
121
|
|
|
268
|
-
|
|
269
|
-
- The proposed fix is a generic anti-pattern string, a terminator literal, a runway line, or a lockout that any repo would benefit from.
|
|
270
|
-
- The same gap would plausibly show up in two or more consumer repos.
|
|
122
|
+
Severity:
|
|
271
123
|
|
|
272
|
-
|
|
124
|
+
- 🔴 repeated or high-cost violation of a hard rule / phase predicate / lifecycle step
|
|
125
|
+
- 🟡 preference or taboo drift, weak wording, noisy but plausible signal
|
|
126
|
+
- 🟢 healthy; do not propose a change
|
|
273
127
|
|
|
274
|
-
|
|
275
|
-
- The fix is a taboo, a smoke convention, an env / boot detail, or a glossary entry.
|
|
276
|
-
- The fix would not apply (or would be wrong) in another consumer repo.
|
|
128
|
+
### 5 — Anchor evidence
|
|
277
129
|
|
|
278
|
-
|
|
130
|
+
For every 🔴 / 🟡 row, quote the smallest proof: timestamp + first 80 chars of a user correction, assistant hand-back, tool command, path write, or issue body. No excerpt → demote to TODO and keep digging.
|
|
279
131
|
|
|
280
|
-
|
|
281
|
-
| # | finding (short) | default scope | target file | flip? |
|
|
282
|
-
| - | ---------------------------------------- | ---------------- | ---------------------------------------------------- | ----- |
|
|
283
|
-
| 1 | /do hands flow back between phases | extension:steer | pi-dev:extensions/pi-flow/index.ts (message_end) | |
|
|
284
|
-
| 2 | long live-smoke silence causes user nudges | extension:render/observe | pi-dev:extensions/pi-flow/index.ts or project extension | |
|
|
285
|
-
| 3 | post-marker handoff writes | framework | pi-dev:skills/migrate/SKILL.md (gitignore lockout) | |
|
|
286
|
-
| 4 | smoke command name changed in S058 | consumer-prefs | hugn:docs/agents/preferences.md | |
|
|
287
|
-
```
|
|
132
|
+
### 5.5 — Choose target and intervention type
|
|
288
133
|
|
|
289
|
-
|
|
134
|
+
Before choosing, answer one sentence:
|
|
290
135
|
|
|
291
|
-
|
|
136
|
+
> What would have made the correct lifecycle easier to follow at the moment of failure?
|
|
292
137
|
|
|
293
|
-
|
|
138
|
+
Then choose:
|
|
294
139
|
|
|
295
|
-
|
|
140
|
+
- `framework` when wording/ordering/terminal predicates are generically wrong or confusing.
|
|
141
|
+
- `extension:<observe|affordance|context|state|render|steer|confirm|block>` when prose/prefs were available but runtime support is needed and next-session movement is measurable.
|
|
142
|
+
- `consumer-prefs` when the fix is repo-specific.
|
|
143
|
+
- `defer` when already fixed, weak, or better verified after another session.
|
|
296
144
|
|
|
297
|
-
|
|
145
|
+
Show the scope table once and let the maintainer flip rows before applying.
|
|
298
146
|
|
|
299
|
-
|
|
300
|
-
- Prefer **replacement over accumulation**. A good edit often makes the skill shorter; adding lines is justified only when it removes a larger ambiguity.
|
|
301
|
-
- Prefer **explicit anti-pattern strings** ("Do not say 'shall I continue?'") over abstract injunctions ("be decisive"). The hugn-2026-05 audit showed that named anti-patterns work.
|
|
302
|
-
- Prefer **terminal markers** ("the summary's last line must be one of these two literals: …") over qualitative descriptions of "good wrap-up".
|
|
303
|
-
- Update **at most three skills per run.** More than that means findings aren't anchored well enough.
|
|
147
|
+
### 6 — Draft lean edits
|
|
304
148
|
|
|
305
|
-
|
|
149
|
+
Inspect the whole target section before drafting. Prefer replace/merge/delete over append. If the file grows, state what ambiguity the extra lines remove; otherwise rewrite shorter.
|
|
306
150
|
|
|
307
|
-
|
|
308
|
-
- Prefer extending `pi-flow` when the concern is generic workflow compliance. Create a new extension only when the concern is large enough to be independently named, toggled, installed, and audited.
|
|
309
|
-
- Pick the weakest intervention that would have changed the session outcome: observe → affordance → context/state/render → steer → confirm → block.
|
|
310
|
-
- Determinism is mandatory for `confirm` and `block`; it is desirable but not sufficient for softer interventions. A progress widget, command shortcut, or state checkpoint can be valuable even when it does not block anything.
|
|
311
|
-
- Runtime behavior must be toggleable via `~/.pi/agent/settings.json` when it can alter turns or tool execution. Default on only if the audit shows broad benefit.
|
|
312
|
-
- Keep the corresponding SKILL.md prose as the human-readable contract: one-line *why*, the expected behavior, and a pointer to the runtime support. Do not delete the why — the model still needs to know the intent when the extension is off.
|
|
313
|
-
- Update **at most one runtime mechanism per run** unless the second is pure observability. Two behavior changes at once destroys the next audit's ability to attribute movement.
|
|
151
|
+
Framework edits:
|
|
314
152
|
|
|
315
|
-
|
|
153
|
+
- edit a rule, step, predicate, or anti-pattern the model can self-check
|
|
154
|
+
- keep at most three skill files per run
|
|
155
|
+
- remove obsolete or duplicate rationale while adding the fix
|
|
316
156
|
|
|
317
|
-
|
|
157
|
+
Extension edits:
|
|
318
158
|
|
|
319
|
-
|
|
320
|
-
|
|
321
|
-
|
|
322
|
-
|
|
323
|
-
| smoke / test / endpoint convention | `## Smoke / test conventions` |
|
|
324
|
-
| boot / env / probe change | `## Local-live playbook` |
|
|
325
|
-
| error taxonomy / 5-axes / domain rule | `## Diagnosis posture` |
|
|
326
|
-
| glossary / context term clarification | `## Glossary alignment` |
|
|
327
|
-
| rationale that doesn't fit elsewhere | `## Free notes` (one paragraph max, dated) |
|
|
159
|
+
- start from pi's actual surfaces: lifecycle/session/agent/model/tool events, `before_agent_start`, `context`, `tool_call`, `tool_result`, registered tools/commands, `ctx.ui`, custom rendering, `pi.appendEntry()`, compaction/session hooks
|
|
160
|
+
- prefer `pi-flow` for generic lifecycle support; create a new extension only for a separately named/toggled concern
|
|
161
|
+
- make turn/tool-altering behavior toggleable via `~/.pi/agent/settings.json`
|
|
162
|
+
- change at most one runtime mechanism per run unless the second is pure observability
|
|
328
163
|
|
|
329
|
-
-
|
|
330
|
-
- Do **not** invent new top-level sections unless three findings legitimately share one.
|
|
164
|
+
Consumer-prefs edits:
|
|
331
165
|
|
|
332
|
-
|
|
166
|
+
- use the narrowest existing section (`Project taboos`, `Smoke / test conventions`, `Local-live playbook`, `Diagnosis posture`, `Glossary alignment`, `Free notes`)
|
|
167
|
+
- one auditable bullet per finding
|
|
168
|
+
- keep the migration marker at EOF
|
|
333
169
|
|
|
334
|
-
|
|
170
|
+
Show unified diffs before applying.
|
|
335
171
|
|
|
336
|
-
|
|
172
|
+
### 7 — Apply, release, verify
|
|
337
173
|
|
|
338
|
-
|
|
174
|
+
Framework:
|
|
339
175
|
|
|
340
|
-
|
|
341
|
-
|
|
342
|
-
|
|
343
|
-
|
|
344
|
-
|
|
345
|
-
|
|
176
|
+
```bash
|
|
177
|
+
git add skills/<name>/SKILL.md
|
|
178
|
+
git commit -m "docs(<skill>): <evidence-anchored one-liner>"
|
|
179
|
+
cp skills/<name>/SKILL.md ~/.pi/agent/skills/<name>/SKILL.md
|
|
180
|
+
git push origin main
|
|
181
|
+
# merge release-please PR; confirm npm view pi-dev version
|
|
182
|
+
```
|
|
346
183
|
|
|
347
|
-
|
|
348
|
-
2. If a new extension was added, register it in `src/install.ts` `EXTENSIONS` array so `pi-dev install` and `pi-dev update` propagate it.
|
|
349
|
-
3. `npm run build` to ensure `install.ts` still compiles.
|
|
350
|
-
4. Smoke-test with `node dist/cli.js install --local --skip-prefs -y` in `/tmp/<fresh-dir>`. Verify the extension landed under `.pi/extensions/<name>/`.
|
|
351
|
-
5. `git add extensions/<name>/ src/install.ts src/paths.ts && git commit -m "feat(pi-flow): <one-liner anchoring the evidence>"`. Commit body must cite the signal.
|
|
352
|
-
6. Mirror to live: `cp -r extensions/<name> ~/.pi/agent/extensions/<name>` so the **next** session picks up the change without waiting for npm.
|
|
353
|
-
7. `git push origin main`; release-please → npm publish as usual.
|
|
184
|
+
Extension:
|
|
354
185
|
|
|
355
|
-
|
|
186
|
+
```bash
|
|
187
|
+
npm run build
|
|
188
|
+
node dist/cli.js install --local --skip-prefs -y # in a temp dir; verify extension copied
|
|
189
|
+
git add extensions/<name> src/install.ts src/paths.ts
|
|
190
|
+
git commit -m "feat(pi-flow): <evidence-anchored one-liner>"
|
|
191
|
+
cp -R extensions/<name> ~/.pi/agent/extensions/<name>
|
|
192
|
+
git push origin main
|
|
193
|
+
# merge release-please PR; confirm npm view pi-dev version
|
|
194
|
+
```
|
|
356
195
|
|
|
357
|
-
|
|
358
|
-
2. Bump the `last-updated` line at the top of the file to today's UTC date.
|
|
359
|
-
3. `git add docs/agents/preferences.md && git commit -m "docs(agents): <one-liner per finding>"`. Conventional Commits apply.
|
|
360
|
-
4. Push per that repo's normal workflow. No release-please involvement — preferences are not packaged.
|
|
196
|
+
Consumer-prefs:
|
|
361
197
|
|
|
362
|
-
|
|
198
|
+
```bash
|
|
199
|
+
# in consumer repo
|
|
200
|
+
git add docs/agents/preferences.md
|
|
201
|
+
git commit -m "docs(agents): <evidence-anchored one-liner>"
|
|
202
|
+
# push per that repo's workflow
|
|
203
|
+
```
|
|
363
204
|
|
|
364
|
-
|
|
365
|
-
2. Confirm each previously-🔴 signal has moved (chain depth up, intervention rate down, taboo writes gone, etc.).
|
|
366
|
-
3. If a signal did not move, the fix wording was too weak — file a follow-up audit, do not re-write from scratch.
|
|
205
|
+
After the next affected session, re-run this skill for the last 24 h and confirm the signal moved.
|
|
367
206
|
|
|
368
207
|
## Terminal predicate
|
|
369
208
|
|
|
370
|
-
|
|
209
|
+
Done when:
|
|
371
210
|
|
|
372
|
-
1.
|
|
373
|
-
2.
|
|
374
|
-
3.
|
|
375
|
-
4.
|
|
211
|
+
1. signal table + evidence excerpts are presented
|
|
212
|
+
2. each 🔴 / 🟡 finding has target `framework` / `extension:*` / `consumer-prefs` / `defer`
|
|
213
|
+
3. zero 🔴 findings are marked healthy/deferred with rationale, or each approved 🔴 has landed
|
|
214
|
+
4. landed framework/extension changes are mirrored live and released to npm; consumer-prefs changes are committed in the consumer repo
|
|
215
|
+
5. next-session re-audit plan is stated
|
|
376
216
|
|
|
377
|
-
|
|
217
|
+
Final line must be exactly one of:
|
|
378
218
|
|
|
379
|
-
```
|
|
219
|
+
```text
|
|
380
220
|
audit complete — no changes this cycle.
|
|
221
|
+
audit complete — framework v<X.Y.Z> released, consumer-prefs commit <sha-or-none>, next re-audit after the next session.
|
|
381
222
|
```
|
|
382
223
|
|
|
383
|
-
```
|
|
384
|
-
audit complete — framework v<X.Y.Z> released, consumer-prefs commit <sha>, next re-audit after the next session.
|
|
385
|
-
```
|
|
386
|
-
|
|
387
|
-
## What this skill does not do
|
|
388
|
-
|
|
389
|
-
- It does not modify a consumer repo's code or issues. It edits **pi-dev's own `skills/`** (framework scope) and — only when the audit demands it — the consumer's `docs/agents/preferences.md` (consumer-prefs scope).
|
|
390
|
-
- It does not invent gaps from first principles. Every finding must come from a session excerpt or a repo-state probe.
|
|
391
|
-
- It does not run from a consumer repo. The Pre-flight gate refuses; cd into pi-dev first.
|
|
392
|
-
- It does not run faster than the data allows — if there is only one session, run it but say so up front; the signal is noisy.
|
|
393
|
-
|
|
394
224
|
## Heuristics
|
|
395
225
|
|
|
396
|
-
-
|
|
397
|
-
-
|
|
398
|
-
-
|
|
399
|
-
-
|
|
400
|
-
-
|
|
401
|
-
-
|
|
402
|
-
- **Refresh external reliability patterns when designing new runtime support.** Current agent-reliability work emphasizes trajectory monitoring, recovery orchestration, and trace-derived textual feedback; use that to challenge pi-flow designs before coding.
|
|
403
|
-
- **Keep the skills lean.** Do not turn audits into append-only history. Prefer section rewrites, merged bullets, and deletion of stale rationale over adding another paragraph. If a skill crosses ~500 lines, treat that as a smell and look for compression before adding more.
|
|
404
|
-
|
|
405
|
-
## Why this skill exists
|
|
406
|
-
|
|
407
|
-
Skills are prose. Prose drifts. Without a feedback loop, the SKILL.md files become wishful thinking that the agent ignores in real sessions. This skill is the loop — and it is the maintainer's loop, not the consumer's.
|
|
226
|
+
- Judge `/do` by lifecycle completion, not by isolated success.
|
|
227
|
+
- Short user nudges are interruption smoke; inspect the preceding assistant turn.
|
|
228
|
+
- A task that ships after many corrections is not healthy; trace where autonomy failed.
|
|
229
|
+
- Do not overfit on guards; runtime support can observe, guide, remember, render, steer, confirm, or block.
|
|
230
|
+
- Keep skills lean. Around 500 lines is a smell; compress before adding.
|
|
231
|
+
- For new runtime-support patterns, verify current agent-reliability sources and cite them in audit notes.
|