pi-dev 0.2.5 → 0.2.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json
CHANGED
|
@@ -7,9 +7,15 @@ description: MAINTAINER-ONLY. Analyse real pi-runtime session telemetry from any
|
|
|
7
7
|
|
|
8
8
|
**Audience: pi-dev maintainers only.** Consumers do not get this skill installed. Consumers improve their own setup by editing `docs/agents/preferences.md` (per project) or `~/.pi/agent/preferences.md` (per machine), not by editing SKILL.md bodies. The framework is fixed for them; only the maintainer changes it.
|
|
9
9
|
|
|
10
|
-
The pi-dev
|
|
10
|
+
The pi-dev workflow is not just markdown. It is a layered control system: SKILL.md prose, repo preferences, pi-runtime lifecycle events, tools, commands, TUI surfaces, session persistence, compaction hooks, and extensions. This skill reads what real sessions did across consumer repos, compares that to what the workflow contract said *should* happen, then chooses the lightest effective intervention.
|
|
11
11
|
|
|
12
|
-
The point: **never
|
|
12
|
+
The point: **never improve the framework from gut feeling, and never collapse every fix into a "guard".** Edit because session N showed phase P violated predicate Q on M occasions, then decide whether the right response is clearer prose, repo-local preferences, a runtime steer, a custom tool, a command, a TUI affordance, state persistence, compaction/context shaping, or — only when the evidence calls for it — a blocking gate.
|
|
13
|
+
|
|
14
|
+
## North star
|
|
15
|
+
|
|
16
|
+
`/do` is the autonomous engineering lifecycle orchestrator. Its job is not to perform one isolated action; its job is to carry a request through the agreed lifecycle in order — classify, scope, plan, run the right skills, satisfy each phase predicate, verify locally/live as required, apply side effects according to prefs, and finish with durable state in code / issues / preferences. The framework's product goal is to make that A→Z loop long-running, durable, low-interruption, and trustworthy across real repos.
|
|
17
|
+
|
|
18
|
+
Therefore `/improve-skill-flow` audits **lifecycle adherence**, not just individual mistakes. A session that eventually shipped after repeated user nudges, skipped verification, wrong side effects, or unordered phases is not healthy. The improvement question is: what framework / preference / runtime support would have let `/do` complete the full lifecycle autonomously and correctly the first time?
|
|
13
19
|
|
|
14
20
|
## Pre-flight (hard gate)
|
|
15
21
|
|
|
@@ -57,7 +63,7 @@ Always from inside the pi-dev checkout (see Pre-flight).
|
|
|
57
63
|
pi-runtime loads three artefact kinds the framework can ship:
|
|
58
64
|
|
|
59
65
|
- **Skill bodies** — `~/.pi/agent/skills/<name>/SKILL.md` (global) or `<repo>/.pi/skills/<name>/SKILL.md` (local). Pure prose.
|
|
60
|
-
- **
|
|
66
|
+
- **Runtime interventions** — extensions under `~/.pi/agent/extensions/<name>/` (global) or `<repo>/.pi/extensions/<name>/` (local). TypeScript modules auto-loaded via jiti. They can subscribe to lifecycle/session/agent/model/tool events, inject context in `before_agent_start`, reshape model context, observe `message_end`, intercept or modify tool calls/results, register custom tools/commands/shortcuts/flags, prompt via `ctx.ui`, render custom TUI widgets, persist state with `pi.appendEntry()`, and customize compaction/session behavior. pi-dev ships **one** extension by default — `pi-flow` — and the bar to add a second package remains high.
|
|
61
67
|
- **Preferences** — `docs/agents/preferences.md` (per repo) and `~/.pi/agent/preferences.md` (per machine). 3-layer override on prose-level decisions.
|
|
62
68
|
|
|
63
69
|
A finding lands in exactly one of:
|
|
@@ -65,7 +71,7 @@ A finding lands in exactly one of:
|
|
|
65
71
|
| scope | lands in | reaches | propagation | when to pick |
|
|
66
72
|
| --- | --- | --- | --- | --- |
|
|
67
73
|
| **framework** | this repo's `skills/<name>/SKILL.md` | every consumer after the next `npx pi-dev update` | release-please → npm publish | the SKILL.md wording is wrong; gap shows up generically; prose alone is plausibly enough |
|
|
68
|
-
| **extension** | this repo's `extensions/pi-flow/index.ts` (extend) or a new `extensions/<name>/` (rare) | every consumer after `npx pi-dev update` | release-please → npm publish | a real audit shows prose
|
|
74
|
+
| **extension** | this repo's `extensions/pi-flow/index.ts` (extend) or a new `extensions/<name>/` (rare) | every consumer after `npx pi-dev update` | release-please → npm publish | a real audit shows prose / prefs cannot reliably preserve the workflow, AND pi-runtime has an event/tool/UI/state hook that can make the desired path easier, more observable, or safer |
|
|
69
75
|
| **consumer-prefs** | the audited consumer repo's `docs/agents/preferences.md` | only that repo, on every `/do` bootstrap | regular consumer-repo commit | gap is the consumer repo's domain / paths / conventions, not the SKILL.md |
|
|
70
76
|
|
|
71
77
|
Notes:
|
|
@@ -75,12 +81,24 @@ Notes:
|
|
|
75
81
|
- A `consumer-prefs` apply touches no pi-dev files. It is committed to the consumer repo only.
|
|
76
82
|
- A single audit may produce a mix of all three. Decide scope per finding, not per audit.
|
|
77
83
|
|
|
78
|
-
**Extension scope is the most expensive option.**
|
|
84
|
+
**Extension scope is the most expensive option, but it is not synonymous with "guards".** Runtime interventions can be passive observability, progress/status UI, context injection, command shortcuts, structured tools, state checkpoints, compaction shaping, soft steers, confirmations, or hard blocks. Default to `framework` / `consumer-prefs` when prose or repo-local convention plausibly fixes the gap; reach for `extension` when the audit shows the workflow needs runtime support.
|
|
85
|
+
|
|
86
|
+
Intervention ladder, from lightest to strongest:
|
|
87
|
+
|
|
88
|
+
1. **Observe** — collect counters / emit low-noise status so the next audit can see what happened.
|
|
89
|
+
2. **Make the right path easier** — custom command, custom tool, context injection, remembered state, or TUI affordance.
|
|
90
|
+
3. **Steer** — inject a follow-up message or system/context nudge when the model is drifting but no irreversible action happened.
|
|
91
|
+
4. **Confirm** — ask the user only for genuinely risky or preference-silent operations.
|
|
92
|
+
5. **Block** — refuse a tool call only for destructive, costly, or repeatedly proven workflow violations.
|
|
79
93
|
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
- The failure
|
|
83
|
-
-
|
|
94
|
+
The bar for extension work:
|
|
95
|
+
|
|
96
|
+
- The failure is repeated or high-cost, and the expected workflow was available in-context via SKILL.md / prefs / repo docs.
|
|
97
|
+
- The runtime intervention can be described as a small event-driven mechanism over pi's lifecycle/session/agent/tool/UI/state surfaces; no hidden LLM call unless the finding is explicitly about a summarizer/evaluator extension.
|
|
98
|
+
- The success metric is measurable in the next audit: fewer user nudges, higher `/do` phase completion, fewer corrections, better issue shape, shorter stalled intervals, or clearer live evidence.
|
|
99
|
+
- A simpler remedy (docs wording, prefs, `.gitignore` lockout, issue-template fix) does **not** plausibly close the gap. If it does, take the simpler remedy.
|
|
100
|
+
|
|
101
|
+
External research posture (DECAY): before proposing a new class of runtime intervention, do a quick current-source pass. 2025–2026 agent-reliability literature emphasizes trajectory monitoring, workflow-level observability, recovery orchestration, and textual feedback from traces — not just pre-execution blocking. Use that as a check against over-fitting on "guards"; cite the sources in the audit notes when they shape the proposal.
|
|
84
102
|
|
|
85
103
|
## Session-data location & format
|
|
86
104
|
|
|
@@ -174,12 +192,17 @@ Run a single pass over every targeted `.jsonl` and tally:
|
|
|
174
192
|
- top Edit / Write targets (hotspot files)
|
|
175
193
|
- bash commands matching domain-specific danger / smell patterns
|
|
176
194
|
|
|
177
|
-
**
|
|
178
|
-
-
|
|
195
|
+
**Workflow-compliance specific:**
|
|
196
|
+
- Reconstruct `/do`'s lifecycle trajectory: classify → scope → plan → phase execution → terminal predicates → verification → side effects → durable final state. Score the ordered lifecycle, not just the final outcome.
|
|
197
|
+
- After each `<skill name="do">` injection, did another skill get injected within the same session? **This measures `/do` chain depth.** Single-step chains are a red flag, but not the whole story.
|
|
198
|
+
- Did `/do` emit `[flow plan]`? Count planned phases and observed `[flow N/M]` status lines. Flag: plan missing, N/M skipped, final summary before all phases, or terminal predicate not evidenced.
|
|
179
199
|
- Count user messages shorter than ~80 chars — these are usually nudges ("진행해", "다음은?", "끝났어?"). High proportion = `/do` is handing the flow back too often.
|
|
180
|
-
- Count user messages containing correction markers (`아니`, `wait`, `stop`, `취소`, `다시`, `그만`, `undo`, `revert
|
|
200
|
+
- Count user messages containing correction markers (`아니`, `wait`, `stop`, `취소`, `다시`, `그만`, `undo`, `revert`, `왜`, `??`, `제대로`). These mark interventions.
|
|
201
|
+
- Detect stalled intervals: long gaps between assistant text / tool calls, repeated identical commands, or repeated failed live probes beyond the repo's wait budget.
|
|
202
|
+
- Count local-live / ops-live evidence when runtime behavior changed: command run, observed output, log/screenshot/evidence embedded in summary.
|
|
203
|
+
- Count side-effect gates: commits, pushes, PRs, issue creates/edits/closes vs. merged prefs (`auto-*`) and tracker docs.
|
|
181
204
|
- Count `<!-- migrated: ... -->` marker date vs. any post-marker `docs/handoff/` / `.scratch/flow/` / `SESSION_*.md` writes. Drift = handoff lockout failed.
|
|
182
|
-
- Count tracker writes (`gh issue create`, `gh pr create`) and inspect the bodies; the bodies are plain spec consumed by future worker agents, so look for issues with shape problems (missing parent, no acceptance criteria, etc.) rather than meta tags.
|
|
205
|
+
- Count tracker writes (`gh issue create`, `gh pr create`) and inspect the bodies; the bodies are plain spec consumed by future worker agents, so look for issues with shape problems (missing parent, no acceptance criteria, wrong state labels, wrong parent topology, etc.) rather than meta tags.
|
|
183
206
|
|
|
184
207
|
Use a deterministic Python or shell script you write once and check into `/tmp` for the duration of the run. Do not eyeball big JSONLs by hand.
|
|
185
208
|
|
|
@@ -230,11 +253,15 @@ If a finding cannot be backed by an excerpt, it is not actionable yet — demote
|
|
|
230
253
|
|
|
231
254
|
For each 🔴 / 🟡 finding, pick a default scope using the heuristic below, then show the table once and let the maintainer flip individual rows before applying.
|
|
232
255
|
|
|
256
|
+
Before assigning scope, write one sentence answering: **what would have made the correct workflow easier to follow at the moment of failure?** Do not jump straight to a blocking guard.
|
|
257
|
+
|
|
233
258
|
Default to `extension` if the finding matches **any** of:
|
|
234
259
|
|
|
235
|
-
- The
|
|
236
|
-
- The
|
|
237
|
-
- The
|
|
260
|
+
- The expected behavior was already in SKILL.md / prefs / repo docs at the time of the violating session **and** the session still drifted repeatedly or expensively.
|
|
261
|
+
- The fix needs runtime affordance rather than prose: observe counters, inject context, add a command/tool, show TUI status, persist/checkpoint state, shape compaction/context, steer after a message, confirm a risky action, or block an unsafe tool call.
|
|
262
|
+
- The success condition can be measured in the next session from telemetry, not merely hoped for from stronger wording.
|
|
263
|
+
|
|
264
|
+
When defaulting to `extension`, classify the intervention type: `observe`, `affordance`, `context`, `state`, `render`, `steer`, `confirm`, or `block`.
|
|
238
265
|
|
|
239
266
|
Default to `framework` if the finding matches **any** of:
|
|
240
267
|
|
|
@@ -253,9 +280,9 @@ Present the scope-decision table:
|
|
|
253
280
|
```
|
|
254
281
|
| # | finding (short) | default scope | target file | flip? |
|
|
255
282
|
| - | ---------------------------------------- | ---------------- | ---------------------------------------------------- | ----- |
|
|
256
|
-
| 1 | /do hands flow back between phases | extension
|
|
257
|
-
| 2 |
|
|
258
|
-
| 3 |
|
|
283
|
+
| 1 | /do hands flow back between phases | extension:steer | pi-dev:extensions/pi-flow/index.ts (message_end) | |
|
|
284
|
+
| 2 | long live-smoke silence causes user nudges | extension:render/observe | pi-dev:extensions/pi-flow/index.ts or project extension | |
|
|
285
|
+
| 3 | post-marker handoff writes | framework | pi-dev:skills/migrate/SKILL.md (gitignore lockout) | |
|
|
259
286
|
| 4 | smoke command name changed in S058 | consumer-prefs | hugn:docs/agents/preferences.md | |
|
|
260
287
|
```
|
|
261
288
|
|
|
@@ -274,12 +301,13 @@ For each 🔴 / 🟡 finding, draft the smallest possible edit that, **if it had
|
|
|
274
301
|
|
|
275
302
|
**Extension findings (target: pi-dev `extensions/pi-flow/index.ts`, or rarely a new `extensions/<name>/`):**
|
|
276
303
|
|
|
277
|
-
-
|
|
278
|
-
-
|
|
279
|
-
-
|
|
280
|
-
-
|
|
281
|
-
-
|
|
282
|
-
-
|
|
304
|
+
- Start from pi's actual extension surface, not from the word "guard": lifecycle/session/agent/model/tool events, `before_agent_start` context injection, `context` shaping, `tool_call` / `tool_result` interception, registered tools, registered commands, `ctx.ui` notifications/widgets/custom UI, custom renderers, `pi.appendEntry()` state, and compaction/session hooks.
|
|
305
|
+
- Prefer extending `pi-flow` when the concern is generic workflow compliance. Create a new extension only when the concern is large enough to be independently named, toggled, installed, and audited.
|
|
306
|
+
- Pick the weakest intervention that would have changed the session outcome: observe → affordance → context/state/render → steer → confirm → block.
|
|
307
|
+
- Determinism is mandatory for `confirm` and `block`; it is desirable but not sufficient for softer interventions. A progress widget, command shortcut, or state checkpoint can be valuable even when it does not block anything.
|
|
308
|
+
- Runtime behavior must be toggleable via `~/.pi/agent/settings.json` when it can alter turns or tool execution. Default on only if the audit shows broad benefit.
|
|
309
|
+
- Keep the corresponding SKILL.md prose as the human-readable contract: one-line *why*, the expected behavior, and a pointer to the runtime support. Do not delete the why — the model still needs to know the intent when the extension is off.
|
|
310
|
+
- Update **at most one runtime mechanism per run** unless the second is pure observability. Two behavior changes at once destroys the next audit's ability to attribute movement.
|
|
283
311
|
|
|
284
312
|
**Consumer-prefs findings (target: that repo's `docs/agents/preferences.md`):**
|
|
285
313
|
|
|
@@ -363,10 +391,12 @@ audit complete — framework v<X.Y.Z> released, consumer-prefs commit <sha>, nex
|
|
|
363
391
|
## Heuristics
|
|
364
392
|
|
|
365
393
|
- **One screenful of signals beats a dashboard.** Most of the time three or four numbers tell the whole story.
|
|
366
|
-
- **
|
|
367
|
-
- **Short user messages are the cheapest interruption proxy.** Count them.
|
|
394
|
+
- **Judge `/do` by lifecycle completion, not skill injection alone.** Chain depth is useful, but the stronger metric is: classify/scope correct → plan emitted → phases run in order → terminal predicates evidenced → verification completed → side effects match prefs → durable state landed → user did not need to re-enter the chain.
|
|
395
|
+
- **Short user messages are the cheapest interruption proxy.** Count them, then inspect the preceding assistant turn to classify why the user had to nudge.
|
|
368
396
|
- **A taboo without a `.gitignore` lockout will resurrect.** If the same taboo file shows up across two audits, the fix belongs in `/migrate`, not `/do`.
|
|
369
|
-
- **
|
|
397
|
+
- **Do not overfit on guards.** The intervention may be an observability counter, status widget, custom command, context injection, state checkpoint, prompt/compaction shaping, soft steer, confirmation, or block. Pick by evidence and measured failure cost.
|
|
398
|
+
- **Trajectory beats scalar success.** A task that eventually shipped after five corrections is not healthy; use the session trace to find where workflow support was missing.
|
|
399
|
+
- **Refresh external reliability patterns when designing new runtime support.** Current agent-reliability work emphasizes trajectory monitoring, recovery orchestration, and trace-derived textual feedback; use that to challenge pi-flow designs before coding.
|
|
370
400
|
|
|
371
401
|
## Why this skill exists
|
|
372
402
|
|