npm - @tianhai/pi-workflow-kit - Versions diffs - 0.15.0 → 0.17.1 - Mend

@tianhai/pi-workflow-kit 0.15.0 → 0.17.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

package/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # pi-workflow-kit
-> Stop AI agents from rushing to code. Enforce a structured brainstorm→plan→execute→finalize workflow with TDD discipline.
+> Stop AI agents from rushing to code. Enforce a structured brainstorm→plan→execute→verify→finalize workflow with TDD discipline.
 AI coding agents tend to skip design and jump straight into implementation, producing over-engineered or misaligned code. **pi-workflow-kit** solves this by hard-blocking write operations during brainstorm and planning phases — the agent *literally cannot modify your source files* until you approve the design.
@@ -33,21 +33,21 @@ Enforces phase-appropriate tool access — not just guidelines, but hard blocks:
 The agent can read code and discuss design with you during brainstorm/plan, but it physically cannot modify source files or run mutating commands.
-### 🧠 5 Workflow Skills
+### 🧠 7 Workflow Skills
 Guide the agent through a disciplined development process:
-```
-brainstorm → plan → execute → finalize
-              ↕
-           diagnose (anytime)
-```
+brainstorm → design-review → plan → execute → verify → finalize
+                                           ↕
+                                        diagnose (anytime)
 | Phase | Trigger | What Happens |
 |-------|---------|--------------|
 | **Brainstorm** | `/skill:brainstorming` | Explore approaches, debate tradeoffs, produce a design doc |
-| **Plan** | `/skill:writing-plans` | Break design into bite-sized TDD tasks with file paths and acceptance criteria |
+| **Design Review** | `/skill:design-review` | Audit design for production risks (security, scalability, fault tolerance) |
+| **Plan** | `/skill:writing-plans` | Break design into bite-sized TDD tasks with acceptance criteria and concrete code |
 | **Execute** | `/skill:executing-tasks` | Implement tasks one-by-one with TDD discipline and pre-commit checkpoint review gates |
+| **Verify** | `/skill:verify` | Three expert review passes (security, optimization, traceability) on implemented code |
 | **Finalize** | `/skill:finalizing` | Archive plan docs, update README/CHANGELOG, create PR |
 | **Diagnose** | `/skill:diagnose` | 6-phase debugging loop: reproduce → hypothesize → instrument → fix → verify |
@@ -57,12 +57,12 @@ brainstorm → plan → execute → finalize
 You control each phase — the agent never advances on its own. Invoke a skill to move forward:
-```
 /skill:brainstorming   →  discuss and design
+/skill:design-review   →  audit for production risks (non-trivial designs)
 /skill:writing-plans   →  break into tasks
 /skill:executing-tasks →  implement with TDD
+/skill:verify          →  review code for security, optimization, and traceability issues
 /skill:finalizing      →  ship it
-```
 ### TDD Three-Scenario Model
@@ -116,15 +116,23 @@ pi install npm:@tianhai/pi-workflow-kit
 # (agent explores approaches, writes design doc)
 # (write/edit are blocked — your code is safe)
+> /skill:design-review
+# (agent audits for security, scalability, fault tolerance)
+# (trivial changes can skip this step)
 > /skill:writing-plans
-# (agent breaks design into TDD tasks)
+# (agent breaks design into TDD tasks with acceptance criteria)
 > /skill:executing-tasks
-# (agent implements with TDD, all tools unlocked)
+# (agent implements with TDD, cognitive persona shifts, all tools unlocked)
+> /skill:verify
+# (agent runs security, optimization, and traceability reviews on implemented code)
 > /skill:finalizing
-# (agent archives docs, updates changelog, creates PR)
+# (agent archives docs, curates lessons, creates PR)
 ```
 ## Why?
@@ -142,8 +150,10 @@ pi-workflow-kit/
 │   └── workflow-guard.ts      # Write blocker during brainstorm/plan
 ├── skills/
 │   ├── brainstorming/SKILL.md
+│   ├── design-review/SKILL.md
 │   ├── writing-plans/SKILL.md
 │   ├── executing-tasks/SKILL.md
+│   ├── verify/SKILL.md
 │   ├── finalizing/SKILL.md
 │   └── diagnose/SKILL.md
 ├── tests/

package/docs/plans/2026-06-03-karpathy-guidelines-ab-comparison.md ADDED Viewed

@@ -0,0 +1,166 @@
+# A/B Comparison: Writing Plans — Karpathy Behavioral Guidelines
+## Setup
+- **Same design doc** (bookmarks: CRUD + search)
+- **Same Go project scaffold**
+- **Same prompt** (no questions, full plan with concrete code)
+- **Variant A** (WITHOUT guidelines): 292-line SKILL.md — original writing-plans skill
+- **Variant B** (WITH guidelines): 354-line SKILL.md — with Behavioral Guidelines section appended
+---
+## Structural Comparison
+| Dimension | A (Without) | B (With) |
+|---|---|---|
+| **Total tasks** | 4 | 6 |
+| **Lines in plan** | ~1,054 | ~1,019 |
+| **New files per plan** | 7 files in Task 1 alone | 1-2 files per task |
+| **External dependency** | None (stdlib only) | `github.com/google/uuid` |
+---
+## Task Decomposition
+### A (Without) — 4 tasks
+| Task | Scope | Files touched |
+|---|---|---|
+| 1 | Bookmark + ALL infrastructure (model, store interface, mem store with full CRUD, service, handler, errors, route, tests) | 7 files |
+| 2 | Delete bookmark | 3 files |
+| 3 | List bookmarks (paginated, cursor) | 3 files |
+| 4 | Search bookmarks (keyword + pagination) | 3 files |
+### B (With) — 6 tasks
+| Task | Scope | Files touched |
+|---|---|---|
+| 1 | Scaffold (go.mod + model only) | 2 files |
+| 2 | Bookmark a message (store + handler + test + route) | 4 files |
+| 3 | List bookmarks (offset/limit pagination) | 4 files |
+| 4 | Remove a bookmark | 4 files |
+| 5 | Search bookmarks (keyword) | 4 files |
+| 6 | Final wiring + integration lifecycle test | 2 files |
+---
+## Detailed Analysis by Guideline
+### Simplicity First
+**A (Without):** ⚠️ **Overbuilt in Task 1.** Task 1 creates a `BookmarkStore` interface with 4 methods (Create, Delete, ListByUser, SearchByUser) — methods that won't be used until Tasks 2-4. It also creates the full `MemoryStore` implementation with all 4 methods, an `errors.go` file, a `Service` struct, AND the handler — all in a single task. The store interface is the full contract upfront before any task exercises most of it.
+**B (With):** ✅ **Minimal per task.** Task 1 only creates `go.mod` + the `Bookmark` struct. Task 2 introduces `Store` with only `Create`, and `MemStore` with only `Create`. `List` is added to the interface in Task 3, `Delete` in Task 4, `Search` in Task 5 — each method appears when it's needed, not before.
+**Verdict:** Guidelines had a clear positive effect. Plan B builds only what each task needs.
+### Surgical Changes
+**A (Without):** ⚠️ Task 1 touches 7 files in one go (model, store interface, store mem, errors, service, handler, main.go). The Task 1 description says "create the full vertical slice" which bundles infrastructure that isn't tested yet.
+**B (With):** ✅ Each task touches 1-2 files for new code. Task 1 creates 2 files (go.mod, model.go). Task 2 adds 3 new files + modifies main.go. No task creates more than 4 files.
+**Verdict:** Guidelines had a clear positive effect. Plan B has tighter blast radius per task.
+### Think Before Coding (surface assumptions)
+**A (Without):** ❌ Silent assumptions throughout:
+- Used cursor-based pagination without noting the design just said "paginated" — didn't surface that offset-based vs cursor-based is a choice
+- Added `sync.RWMutex` and concurrent safety without the design mentioning concurrency
+- Created a `Service` layer between handler and store without justification
+**B (With):** ⚠️ Still has assumptions but more defensible:
+- Used offset/limit pagination (simpler, matches "paginated" literally)
+- No concurrency concerns added (store uses `sync.Mutex` only, no RWMutex overhead)
+- No `Service` layer — handler calls store directly
+- Did add `github.com/google/uuid` dependency without asking — minor assumption
+**Verdict:** Marginal positive effect. Plan B is less presumptuous but both plans made assumptions. Neither explicitly surfaced tradeoffs to the user.
+### Goal-Driven Execution
+**A (Without):** ✅ Good acceptance criteria with Given/When/Then. Has a `checkpoint: test` on 3/4 tasks and `checkpoint: done` on the last task.
+**B (With):** ✅ Good acceptance criteria. Has `checkpoint: test` on 3 tasks, `checkpoint: done` on 1, and no checkpoint on 2 simpler tasks. Added a full lifecycle integration test in Task 6 that wasn't in A.
+**Verdict:** Roughly equivalent. Both plans have strong acceptance criteria (required by the base skill). The lifecycle test in B is a nice bonus that catches integration issues.
+---
+## Unrelated Observations (noise, not guidelines)
+| Observation | A (Without) | B (With) |
+|---|---|---|
+| Pagination style | Cursor-based (more complex) | Offset-based (simpler) |
+| External deps | None | `google/uuid` |
+| Handler method naming | `Create`, `Delete`, `List`, `Search` | `CreateBookmark`, `DeleteBookmark`, `ListBookmarks`, `SearchBookmarks` |
+| Test structure | Single `TestXxx` with `t.Run` subtests | Separate top-level test functions |
+| `make([]T, 0, len)` usage | Yes (mem store candidates) | Yes (list handler, search handler) |
+---
+## Overall Assessment
+| Guideline | Effect | Evidence |
+|---|---|---|
+| **Simplicity First** | ✅ Strong positive | B builds incrementally; A front-loads the full store interface |
+| **Surgical Changes** | ✅ Positive | B touches fewer files per task (1-4 vs 7 in Task 1) |
+| **Think Before Coding** | ⚠️ Marginal | B made fewer silent assumptions but neither surfaced tradeoffs explicitly |
+| **Goal-Driven Execution** | ≈ Neutral | Both strong; base skill already enforces acceptance criteria |
+**Bottom line (iteration 1):** The guidelines measurably improved the plan. The biggest win is **Simplicity First** — Plan B's incremental interface growth (adding methods to `Store` as each task needs them) is clearly better than Plan A's upfront full-contract approach. This is exactly the kind of thing "no abstractions for single-use code" catches.
+**Weakness:** Neither plan explicitly called out assumptions or asked clarifying questions — the "Think Before Coding" guideline had the weakest signal. The guidelines alone may not be enough to overcome the model's tendency to fill gaps silently.
+---
+## Iteration 2: Revised Guidelines
+### What changed
+The guidelines were reworked from 4 generic coding rules to 3 planning-specific principles:
+| v1 (Generic) | v2 (Planning-Specific) | Why |
+|---|---|---|
+| Think Before Coding | **Surface Assumptions** | v1 said "ask" — the agent ignores this when told not to ask. v2 says "annotate in the plan" with a concrete `> **Assumption:** ...` format and examples of what to annotate. |
+| Simplicity First | **Build Only What Each Task Needs** | Kept the same core principle but added the specific anti-pattern from the v1 A/B test: "don't define interface methods that no task exercises yet." |
+| Surgical Changes | **One Task, One Change** | Reframed from "don't touch adjacent code" to "each task should trace to exactly one user-facing behavior" with a concrete guardrail (max 4 new files). |
+| Goal-Driven Execution | *(removed)* | Redundant — the base skill already enforces Given/When/Then acceptance criteria. |
+### Iteration 2 Plan (v2 guidelines) vs Iteration 1 Plans
+| Dimension | A (No guidelines) | B1 (v1 guidelines) | B2 (v2 guidelines) |
+|---|---|---|---|
+| **Total tasks** | 4 | 6 | 4 |
+| **Max files/task** | 7 (Task 1) | 4 | 4 |
+| **Assumptions annotated** | 0 | 0 | **4** (header below) |
+| **External deps** | None | `google/uuid` | None |
+| **Store interface** | 4 methods upfront in Task 1 | 1 method per task | 1 method per task |
+| **Service layer** | Yes (unjustified) | No | No |
+### The big win: Surface Assumptions
+Plan B2 opens with four explicit assumption annotations:
+```
+> **Assumption:** User identification via X-User-ID request header since
+> no auth system exists in the project.
+> **Assumption:** Bookmarks include a Note field so users can annotate
+> bookmarks. The design says "search by keyword" but doesn't specify
+> the field.
+> **Assumption:** Offset/limit pagination (not cursor-based).
+> **Assumption:** In-memory store behind a Store interface.
+```
+None of the previous plans (A or B1) did this. The v1 "Think Before Coding" guideline was completely invisible in output. The v2 "Surface Assumptions" guideline produced visible, reviewable annotations on the first run.
+### Iteration 2 Assessment
+| Guideline | v1 Effect | v2 Effect | Improvement |
+|---|---|---|---|
+| **Surface Assumptions** (was Think Before Coding) | ⚠️ Invisible | ✅ 4 explicit annotations | Complete turnaround — concrete format + examples fixed the weakest signal |
+| **Build Only What's Needed** (was Simplicity First) | ✅ Strong | ✅ Strong | Maintained — interface still grows incrementally |
+| **One Task, One Change** (was Surgical Changes) | ✅ Positive | ✅ Positive | Maintained — max 4 files/task |
+**Bottom line (iteration 2):** The v2 guidelines fixed the weakest signal from v1. "Surface Assumptions" went from invisible to producing 4 explicit, reviewable annotations. The other two principles maintained their positive effect. The removal of "Goal-Driven Execution" (redundant) reduced noise without losing signal.

package/docs/plans/completed/2026-05-22-agentic-agile-enhancements-design.md ADDED Viewed

@@ -0,0 +1,77 @@
+# Design: Agentic Agile & Architectural Rigor Enhancements
+Enforcing rigorous Agile engineering discipline within `pi-workflow-kit` by introducing Behavioral Acceptance Criteria, Cognitive Persona Shifts, automated Lessons Curation, strict Multi-Pillar Architectural Reviews, and High-Risk Operation Safeguards.
+## Context & Objectives
+Based on industry standards and modern agentic development templates (such as Microsoft's Agentic Agile model), autonomous coding agents succeed most when operating under tight behavioral boundaries, specialized cognitive roles, and continuous retro/learning loops.
+We are enhancing `pi-workflow-kit` by mapping out distinct engineering "Hats" and rigorous check-gates directly into our existing phase-based skills without adding repository clutter or introducing flaky external file lookups:
+1. **The QA Engineer Hat** (in `writing-plans`): Defines rigid, testable `Given/When/Then` Acceptance Criteria for both happy and edge paths during planning.
+2. **The Pragmatic Developer & Senior Refactorer Hats** (in `executing-tasks`): Guides the execution loop through clear cognitive phases (Green Light → Polish / Software Craftsmanship).
+3. **The Agile Scrum Master Hat** (in `finalizing`): Cleans up, de-duplicates, and categorizes persistent lessons to prevent context-bloat and maximize the utility of future sprints.
+4. **Architectural Review & Audit Gates**: Formally audits both the design (brainstorming) and the plan (writing-plans) against the 6 core pillars of production-grade software (Robustness, Atomicity, Security, Scalability, Compatibility, and Testability) before allowing the agent to move forward.
+5. **High-Risk Operation Safeguards**: Auto-detects critical execution hazards (unbounded Redis scans, in-memory OOM loops, unthrottled concurrency, long-running transactions, etc.) and mandates strict mitigation steps and verification checkpoints.
+---
+## Architecture & Detailed Design
+Because agent workspaces default tool execution and file-reading relative to the user's project directory, external files bundled in NPM global modules are not reliably reachable. Therefore, all guidelines are **inlined directly within the respective `SKILL.md` prompts**. This guarantees 100% reliability, zero repository pollution, and zero runtime performance overhead.
+### Slice 1: Multi-Pillar Design Review & Risk Detection (`brainstorming`)
+Before concluding a brainstorm and generating a design doc, the agent must put on its **Architect Hat** and evaluate the proposed system against the **6 Pillars of Production-Grade Design**:
+1. **Robustness & Fault Tolerance**: How expected failures are handled, subsystem isolation, and graceful degradation.
+2. **Atomicity & Consistency**: Database transactions, state rollback on error, and endpoint idempotency.
+3. **Security & Access Control**: Input validation/sanitization and authorization checks at the boundary.
+4. **Scalability & Performance**: Connection pooling, closing resource leaks, and preventing N+1 queries.
+5. **Backwards Compatibility**: Schema migration safety, zero-downtime deployment, and API versioning.
+6. **Testability**: Injection seams for external dependencies (APIs, system clocks, randomizers) to keep tests 100% deterministic.
+#### ⚠️ High-Risk Hazard Auditing
+The agent must proactively audit the design for the **8 High-Risk Production Hazards**:
+1. **Unbounded Redis Deletions / Operations**: Multi-key deletion or scans (e.g. `KEYS` or raw `SCAN` loops) that block single-threaded performance.
+2. **In-Memory OOM Loops**: Fetching complete database datasets into server memory (e.g., raw `select *`) to filter, sort, or map in runtime heap.
+3. **Unbounded Concurrency Spikes**: Running concurrent network requests (e.g. unthrottled `Promise.all`) without strict batch limits (e.g., `p-limit`).
+4. **Missing High-Frequency Indexes**: Running queries on unindexed columns, forcing expensive table-scans under load.
+5. **Nested/Long-Running Transactions**: Holding database connections and locks open while awaiting slow external HTTP, disk, or cryptographic tasks.
+6. **Unrestricted Uploads & Temp Flooding**: Writing uploaded data directly to local temporary paths without validation limits or explicit `finally` cleanup blocks.
+7. **Raw Query String Interpolation**: Merging raw variables into SQL queries or shell command inputs (susceptible to injection).
+8. **Silent Swallowing loops**: Background workers or cron tasks silently catching and suppressing exceptions without logging, back-offs, or alerts.
+#### 🔍 Discovering Unknown & Contextual Risks (Socratic Heuristics)
+To identify novel or domain-specific risks that fall outside the standard checklist, the agent must put on its **SRE Hat** and audit the proposed logic against the **3 Socratic Heuristics**:
+* **The "Scale to 100x" Heuristic (Resource Exhaustion)**: If this operation is run 100x/sec or on 100k items, what breaks? (Memory, CPU, Disk I/O, sockets, database connection limits).
+* **The "Hostile World" Heuristic (Security & Malice)**: If a malicious actor has complete control over these inputs (headers, payloads, IDs), how can they exploit, crash, or extract data?
+* **The "Silent Error" Heuristic (Observability & Partitioning)**: If this downstream dependency or query hangs or fails silently, how does our server react? Is there a timeout, a back-off, or logging?
+If any of the standard hazards or Socratic risks are identified, the design document **must** include a dedicated `⚠️ High-Risk Operations & Mitigations` section detailing the exact safety protocols applied.
+### Slice 2: Behavioral Acceptance Criteria & Plan Audit (`writing-plans`)
+The planning process is enhanced to mandate behavior-driven specifications and an automated plan verification step.
+- **Role**: QA Engineer Hat.
+- **Specification Format**: Mandatory `Given/When/Then` blocks covering the Happy Path and Edge/Error Paths.
+- **Plan Acceptance Audit**: Before presenting the plan to the user, the agent must verify:
+  - Every task is a complete vertical slice.
+  - Sizing is correct (no monolithic tasks).
+  - Checkpoint gates are placed on the most critical/risky tasks.
+  - **Risk Enforcement**: Any task containing any of the **8 High-Risk Hazards** or **Socratic Heuristics risks** is strictly required to have a mandatory `checkpoint: done` gate and explicit verification guidelines.
+### Slice 3: Cognitive Persona Shifts (`executing-tasks`)
+The implementation execution loop is updated to divide the cognitive workload of a single task into three distinct phases.
+- **Phase 1: QA Test Phase**: Translate the Given/When/Then specs into failing test cases.
+- **Phase 2: Pragmatic Developer Phase**: Implement the simplest, raw code to green the tests.
+- **Phase 3: Senior Refactoring Phase**: Refactor and polish using software craftsmanship principles (Shallow Modules, Deletion Test, Duplication, Seam Discipline).
+### Slice 4: Lessons Curation & Caching (`finalizing`)
+The finalizing phase is upgraded to run a structured retrospective on our persistent learning files.
+- **Role**: Agile Scrum Master Hat.
+- **Curating Rules**: De-duplicate, validate against the Generalization Test, and categorize rules under distinct headers (e.g., `# Tool Usage`, `# Testing Patterns`, `# Architecture Rules`).
+---
+## Verification & Testing Plan
+- **Manual Verification**: Run a mock `/skill:writing-plans` and `/skill:executing-tasks` to verify the generated implementation plan matches our QA template and the task-running agent correctly segments its progress through the three cognitive hats.
+- **Automated Tests**: Confirm existing Vitest suites run successfully without side-effects.