npm - pi-dev - Versions diffs - 0.2.6 → 0.2.8 - Mend

pi-dev 0.2.6 → 0.2.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (2) hide show

package/package.json +1 -1
package/skills/improve-skill-flow/SKILL.md +14 -3

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "pi-dev",
-  "version": "0.2.6",
+  "version": "0.2.8",
   "description": "An autonomous engineering skill framework for the pi runtime — built on Matt Pocock's skills.",
   "type": "module",
   "bin": {

package/skills/improve-skill-flow/SKILL.md CHANGED Viewed

@@ -9,7 +9,13 @@ description: MAINTAINER-ONLY. Analyse real pi-runtime session telemetry from any
 The pi-dev workflow is not just markdown. It is a layered control system: SKILL.md prose, repo preferences, pi-runtime lifecycle events, tools, commands, TUI surfaces, session persistence, compaction hooks, and extensions. This skill reads what real sessions did across consumer repos, compares that to what the workflow contract said *should* happen, then chooses the lightest effective intervention.
-The point: **never improve the framework from gut feeling, and never collapse every fix into a "guard".** Edit because session N showed phase P violated predicate Q on M occasions, then decide whether the right response is clearer prose, repo-local preferences, a runtime steer, a custom tool, a command, a TUI affordance, state persistence, compaction/context shaping, or — only when the evidence calls for it — a blocking gate.
+The point: **never improve the framework from gut feeling, never collapse every fix into a "guard", and never append diary-style history.** Edit because session N showed phase P violated predicate Q on M occasions, then decide whether the right response is clearer prose, repo-local preferences, a runtime steer, a custom tool, a command, a TUI affordance, state persistence, compaction/context shaping, or — only when the evidence calls for it — a blocking gate. Every wording change must pay rent: reduce future confusion, replace weaker text, merge duplicate rules, or delete obsolete context.
+## North star
+`/do` is the autonomous engineering lifecycle orchestrator. Its job is not to perform one isolated action; its job is to carry a request through the agreed lifecycle in order — classify, scope, plan, run the right skills, satisfy each phase predicate, verify locally/live as required, apply side effects according to prefs, and finish with durable state in code / issues / preferences. The framework's product goal is to make that A→Z loop long-running, durable, low-interruption, and trustworthy across real repos.
+Therefore `/improve-skill-flow` audits **lifecycle adherence**, not just individual mistakes. A session that eventually shipped after repeated user nudges, skipped verification, wrong side effects, or unordered phases is not healthy. The improvement question is: what framework / preference / runtime support would have let `/do` complete the full lifecycle autonomously and correctly the first time?
 ## Pre-flight (hard gate)
@@ -187,6 +193,7 @@ Run a single pass over every targeted `.jsonl` and tally:
 - bash commands matching domain-specific danger / smell patterns
 **Workflow-compliance specific:**
+- Reconstruct `/do`'s lifecycle trajectory: classify → scope → plan → phase execution → terminal predicates → verification → side effects → durable final state. Score the ordered lifecycle, not just the final outcome.
 - After each `<skill name="do">` injection, did another skill get injected within the same session? **This measures `/do` chain depth.** Single-step chains are a red flag, but not the whole story.
 - Did `/do` emit `[flow plan]`? Count planned phases and observed `[flow N/M]` status lines. Flag: plan missing, N/M skipped, final summary before all phases, or terminal predicate not evidenced.
 - Count user messages shorter than ~80 chars — these are usually nudges ("진행해", "다음은?", "끝났어?"). High proportion = `/do` is handing the flow back too often.
@@ -283,11 +290,14 @@ Ask once: "OK to proceed with these scopes? Reply with row numbers to flip, or `
 ### 6 — Propose edits (per-finding, scoped)
-For each 🔴 / 🟡 finding, draft the smallest possible edit that, **if it had been in place at session time, would have prevented the gap.** The shape of the draft depends on the scope from Step 5.5:
+For each 🔴 / 🟡 finding, draft the smallest possible edit that, **if it had been in place at session time, would have prevented the gap.** The shape of the draft depends on the scope from Step 5.5.
+**Diet rule before any draft:** inspect the target section as a whole and decide whether the change should replace, merge, move, or delete existing text. Do not append a new historical note just because today's audit found a new incident. If the section would become longer without becoming clearer, rewrite the section instead. If two bullets now say the same thing, collapse them. If a rule only explains why a past mistake happened but does not constrain future behavior, delete or compress it into the active rule.
 **Framework findings (target: pi-dev SKILL.md):**
 - Edit a **rule** or a **step**, not a flavour sentence. The model must be able to detect the constraint in its own draft output.
+- Prefer **replacement over accumulation**. A good edit often makes the skill shorter; adding lines is justified only when it removes a larger ambiguity.
 - Prefer **explicit anti-pattern strings** ("Do not say 'shall I continue?'") over abstract injunctions ("be decisive"). The hugn-2026-05 audit showed that named anti-patterns work.
 - Prefer **terminal markers** ("the summary's last line must be one of these two literals: …") over qualitative descriptions of "good wrap-up".
 - Update **at most three skills per run.** More than that means findings aren't anchored well enough.
@@ -384,12 +394,13 @@ audit complete — framework v<X.Y.Z> released, consumer-prefs commit <sha>, nex
 ## Heuristics
 - **One screenful of signals beats a dashboard.** Most of the time three or four numbers tell the whole story.
-- **Judge `/do` by workflow completion, not skill injection alone.** Chain depth is useful, but the stronger metric is: plan emitted → phases accounted for → terminal predicates evidenced → side effects match prefs → user did not need to re-enter the chain.
+- **Judge `/do` by lifecycle completion, not skill injection alone.** Chain depth is useful, but the stronger metric is: classify/scope correct → plan emitted → phases run in order → terminal predicates evidenced → verification completed → side effects match prefs → durable state landed → user did not need to re-enter the chain.
 - **Short user messages are the cheapest interruption proxy.** Count them, then inspect the preceding assistant turn to classify why the user had to nudge.
 - **A taboo without a `.gitignore` lockout will resurrect.** If the same taboo file shows up across two audits, the fix belongs in `/migrate`, not `/do`.
 - **Do not overfit on guards.** The intervention may be an observability counter, status widget, custom command, context injection, state checkpoint, prompt/compaction shaping, soft steer, confirmation, or block. Pick by evidence and measured failure cost.
 - **Trajectory beats scalar success.** A task that eventually shipped after five corrections is not healthy; use the session trace to find where workflow support was missing.
 - **Refresh external reliability patterns when designing new runtime support.** Current agent-reliability work emphasizes trajectory monitoring, recovery orchestration, and trace-derived textual feedback; use that to challenge pi-flow designs before coding.
+- **Keep the skills lean.** Do not turn audits into append-only history. Prefer section rewrites, merged bullets, and deletion of stale rationale over adding another paragraph. If a skill crosses ~500 lines, treat that as a smell and look for compression before adding more.
 ## Why this skill exists