pi-dev 0.2.6 → 0.2.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "pi-dev",
3
- "version": "0.2.6",
3
+ "version": "0.2.8",
4
4
  "description": "An autonomous engineering skill framework for the pi runtime — built on Matt Pocock's skills.",
5
5
  "type": "module",
6
6
  "bin": {
@@ -9,7 +9,13 @@ description: MAINTAINER-ONLY. Analyse real pi-runtime session telemetry from any
9
9
 
10
10
  The pi-dev workflow is not just markdown. It is a layered control system: SKILL.md prose, repo preferences, pi-runtime lifecycle events, tools, commands, TUI surfaces, session persistence, compaction hooks, and extensions. This skill reads what real sessions did across consumer repos, compares that to what the workflow contract said *should* happen, then chooses the lightest effective intervention.
11
11
 
12
- The point: **never improve the framework from gut feeling, and never collapse every fix into a "guard".** Edit because session N showed phase P violated predicate Q on M occasions, then decide whether the right response is clearer prose, repo-local preferences, a runtime steer, a custom tool, a command, a TUI affordance, state persistence, compaction/context shaping, or — only when the evidence calls for it — a blocking gate.
12
+ The point: **never improve the framework from gut feeling, never collapse every fix into a "guard", and never append diary-style history.** Edit because session N showed phase P violated predicate Q on M occasions, then decide whether the right response is clearer prose, repo-local preferences, a runtime steer, a custom tool, a command, a TUI affordance, state persistence, compaction/context shaping, or — only when the evidence calls for it — a blocking gate. Every wording change must pay rent: reduce future confusion, replace weaker text, merge duplicate rules, or delete obsolete context.
13
+
14
+ ## North star
15
+
16
+ `/do` is the autonomous engineering lifecycle orchestrator. Its job is not to perform one isolated action; its job is to carry a request through the agreed lifecycle in order — classify, scope, plan, run the right skills, satisfy each phase predicate, verify locally/live as required, apply side effects according to prefs, and finish with durable state in code / issues / preferences. The framework's product goal is to make that A→Z loop long-running, durable, low-interruption, and trustworthy across real repos.
17
+
18
+ Therefore `/improve-skill-flow` audits **lifecycle adherence**, not just individual mistakes. A session that eventually shipped after repeated user nudges, skipped verification, wrong side effects, or unordered phases is not healthy. The improvement question is: what framework / preference / runtime support would have let `/do` complete the full lifecycle autonomously and correctly the first time?
13
19
 
14
20
  ## Pre-flight (hard gate)
15
21
 
@@ -187,6 +193,7 @@ Run a single pass over every targeted `.jsonl` and tally:
187
193
  - bash commands matching domain-specific danger / smell patterns
188
194
 
189
195
  **Workflow-compliance specific:**
196
+ - Reconstruct `/do`'s lifecycle trajectory: classify → scope → plan → phase execution → terminal predicates → verification → side effects → durable final state. Score the ordered lifecycle, not just the final outcome.
190
197
  - After each `<skill name="do">` injection, did another skill get injected within the same session? **This measures `/do` chain depth.** Single-step chains are a red flag, but not the whole story.
191
198
  - Did `/do` emit `[flow plan]`? Count planned phases and observed `[flow N/M]` status lines. Flag: plan missing, N/M skipped, final summary before all phases, or terminal predicate not evidenced.
192
199
  - Count user messages shorter than ~80 chars — these are usually nudges ("진행해", "다음은?", "끝났어?"). High proportion = `/do` is handing the flow back too often.
@@ -283,11 +290,14 @@ Ask once: "OK to proceed with these scopes? Reply with row numbers to flip, or `
283
290
 
284
291
  ### 6 — Propose edits (per-finding, scoped)
285
292
 
286
- For each 🔴 / 🟡 finding, draft the smallest possible edit that, **if it had been in place at session time, would have prevented the gap.** The shape of the draft depends on the scope from Step 5.5:
293
+ For each 🔴 / 🟡 finding, draft the smallest possible edit that, **if it had been in place at session time, would have prevented the gap.** The shape of the draft depends on the scope from Step 5.5.
294
+
295
+ **Diet rule before any draft:** inspect the target section as a whole and decide whether the change should replace, merge, move, or delete existing text. Do not append a new historical note just because today's audit found a new incident. If the section would become longer without becoming clearer, rewrite the section instead. If two bullets now say the same thing, collapse them. If a rule only explains why a past mistake happened but does not constrain future behavior, delete or compress it into the active rule.
287
296
 
288
297
  **Framework findings (target: pi-dev SKILL.md):**
289
298
 
290
299
  - Edit a **rule** or a **step**, not a flavour sentence. The model must be able to detect the constraint in its own draft output.
300
+ - Prefer **replacement over accumulation**. A good edit often makes the skill shorter; adding lines is justified only when it removes a larger ambiguity.
291
301
  - Prefer **explicit anti-pattern strings** ("Do not say 'shall I continue?'") over abstract injunctions ("be decisive"). The hugn-2026-05 audit showed that named anti-patterns work.
292
302
  - Prefer **terminal markers** ("the summary's last line must be one of these two literals: …") over qualitative descriptions of "good wrap-up".
293
303
  - Update **at most three skills per run.** More than that means findings aren't anchored well enough.
@@ -384,12 +394,13 @@ audit complete — framework v<X.Y.Z> released, consumer-prefs commit <sha>, nex
384
394
  ## Heuristics
385
395
 
386
396
  - **One screenful of signals beats a dashboard.** Most of the time three or four numbers tell the whole story.
387
- - **Judge `/do` by workflow completion, not skill injection alone.** Chain depth is useful, but the stronger metric is: plan emitted → phases accounted for → terminal predicates evidenced → side effects match prefs → user did not need to re-enter the chain.
397
+ - **Judge `/do` by lifecycle completion, not skill injection alone.** Chain depth is useful, but the stronger metric is: classify/scope correct → plan emitted → phases run in order → terminal predicates evidenced → verification completed → side effects match prefs → durable state landed → user did not need to re-enter the chain.
388
398
  - **Short user messages are the cheapest interruption proxy.** Count them, then inspect the preceding assistant turn to classify why the user had to nudge.
389
399
  - **A taboo without a `.gitignore` lockout will resurrect.** If the same taboo file shows up across two audits, the fix belongs in `/migrate`, not `/do`.
390
400
  - **Do not overfit on guards.** The intervention may be an observability counter, status widget, custom command, context injection, state checkpoint, prompt/compaction shaping, soft steer, confirmation, or block. Pick by evidence and measured failure cost.
391
401
  - **Trajectory beats scalar success.** A task that eventually shipped after five corrections is not healthy; use the session trace to find where workflow support was missing.
392
402
  - **Refresh external reliability patterns when designing new runtime support.** Current agent-reliability work emphasizes trajectory monitoring, recovery orchestration, and trace-derived textual feedback; use that to challenge pi-flow designs before coding.
403
+ - **Keep the skills lean.** Do not turn audits into append-only history. Prefer section rewrites, merged bullets, and deletion of stale rationale over adding another paragraph. If a skill crosses ~500 lines, treat that as a smell and look for compression before adding more.
393
404
 
394
405
  ## Why this skill exists
395
406