npm - @tianhai/pi-workflow-kit - Versions diffs - 0.16.0 → 0.17.1 - Mend

@tianhai/pi-workflow-kit 0.16.0 → 0.17.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/README.md +11 -9
package/docs/plans/2026-06-03-karpathy-guidelines-ab-comparison.md +166 -0
package/docs/plans/completed/2026-06-03-add-verify-skill-design.md +51 -0
package/docs/plans/completed/2026-06-03-add-verify-skill-implementation.md +111 -0
package/docs/plans/completed/2026-06-03-add-verify-skill-progress.md +11 -0
package/docs/plans/completed/2026-06-03-verify-skill-design.md +176 -0
package/package.json +1 -1
package/skills/verify/SKILL.md +170 -0
package/skills/writing-plans/SKILL.md +51 -0

package/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # pi-workflow-kit
-> Stop AI agents from rushing to code. Enforce a structured brainstorm→plan→execute→finalize workflow with TDD discipline.
+> Stop AI agents from rushing to code. Enforce a structured brainstorm→plan→execute→verify→finalize workflow with TDD discipline.
 AI coding agents tend to skip design and jump straight into implementation, producing over-engineered or misaligned code. **pi-workflow-kit** solves this by hard-blocking write operations during brainstorm and planning phases — the agent *literally cannot modify your source files* until you approve the design.
@@ -33,15 +33,13 @@ Enforces phase-appropriate tool access — not just guidelines, but hard blocks:
 The agent can read code and discuss design with you during brainstorm/plan, but it physically cannot modify source files or run mutating commands.
-### 🧠 6 Workflow Skills
+### 🧠 7 Workflow Skills
 Guide the agent through a disciplined development process:
-```
-brainstorm → design-review → plan → execute → finalize
-                                    ↕
-                                 diagnose (anytime)
-```
+brainstorm → design-review → plan → execute → verify → finalize
+                                           ↕
+                                        diagnose (anytime)
 | Phase | Trigger | What Happens |
 |-------|---------|--------------|
@@ -49,6 +47,7 @@ brainstorm → design-review → plan → execute → finalize
 | **Design Review** | `/skill:design-review` | Audit design for production risks (security, scalability, fault tolerance) |
 | **Plan** | `/skill:writing-plans` | Break design into bite-sized TDD tasks with acceptance criteria and concrete code |
 | **Execute** | `/skill:executing-tasks` | Implement tasks one-by-one with TDD discipline and pre-commit checkpoint review gates |
+| **Verify** | `/skill:verify` | Three expert review passes (security, optimization, traceability) on implemented code |
 | **Finalize** | `/skill:finalizing` | Archive plan docs, update README/CHANGELOG, create PR |
 | **Diagnose** | `/skill:diagnose` | 6-phase debugging loop: reproduce → hypothesize → instrument → fix → verify |
@@ -58,13 +57,12 @@ brainstorm → design-review → plan → execute → finalize
 You control each phase — the agent never advances on its own. Invoke a skill to move forward:
-```
 /skill:brainstorming   →  discuss and design
 /skill:design-review   →  audit for production risks (non-trivial designs)
 /skill:writing-plans   →  break into tasks
 /skill:executing-tasks →  implement with TDD
+/skill:verify          →  review code for security, optimization, and traceability issues
 /skill:finalizing      →  ship it
-```
 ### TDD Three-Scenario Model
@@ -129,6 +127,9 @@ pi install npm:@tianhai/pi-workflow-kit
 > /skill:executing-tasks
 # (agent implements with TDD, cognitive persona shifts, all tools unlocked)
+> /skill:verify
+# (agent runs security, optimization, and traceability reviews on implemented code)
 > /skill:finalizing
 # (agent archives docs, curates lessons, creates PR)
@@ -152,6 +153,7 @@ pi-workflow-kit/
 │   ├── design-review/SKILL.md
 │   ├── writing-plans/SKILL.md
 │   ├── executing-tasks/SKILL.md
+│   ├── verify/SKILL.md
 │   ├── finalizing/SKILL.md
 │   └── diagnose/SKILL.md
 ├── tests/

package/docs/plans/2026-06-03-karpathy-guidelines-ab-comparison.md ADDED Viewed

@@ -0,0 +1,166 @@
+# A/B Comparison: Writing Plans — Karpathy Behavioral Guidelines
+## Setup
+- **Same design doc** (bookmarks: CRUD + search)
+- **Same Go project scaffold**
+- **Same prompt** (no questions, full plan with concrete code)
+- **Variant A** (WITHOUT guidelines): 292-line SKILL.md — original writing-plans skill
+- **Variant B** (WITH guidelines): 354-line SKILL.md — with Behavioral Guidelines section appended
+---
+## Structural Comparison
+| Dimension | A (Without) | B (With) |
+|---|---|---|
+| **Total tasks** | 4 | 6 |
+| **Lines in plan** | ~1,054 | ~1,019 |
+| **New files per plan** | 7 files in Task 1 alone | 1-2 files per task |
+| **External dependency** | None (stdlib only) | `github.com/google/uuid` |
+---
+## Task Decomposition
+### A (Without) — 4 tasks
+| Task | Scope | Files touched |
+|---|---|---|
+| 1 | Bookmark + ALL infrastructure (model, store interface, mem store with full CRUD, service, handler, errors, route, tests) | 7 files |
+| 2 | Delete bookmark | 3 files |
+| 3 | List bookmarks (paginated, cursor) | 3 files |
+| 4 | Search bookmarks (keyword + pagination) | 3 files |
+### B (With) — 6 tasks
+| Task | Scope | Files touched |
+|---|---|---|
+| 1 | Scaffold (go.mod + model only) | 2 files |
+| 2 | Bookmark a message (store + handler + test + route) | 4 files |
+| 3 | List bookmarks (offset/limit pagination) | 4 files |
+| 4 | Remove a bookmark | 4 files |
+| 5 | Search bookmarks (keyword) | 4 files |
+| 6 | Final wiring + integration lifecycle test | 2 files |
+---
+## Detailed Analysis by Guideline
+### Simplicity First
+**A (Without):** ⚠️ **Overbuilt in Task 1.** Task 1 creates a `BookmarkStore` interface with 4 methods (Create, Delete, ListByUser, SearchByUser) — methods that won't be used until Tasks 2-4. It also creates the full `MemoryStore` implementation with all 4 methods, an `errors.go` file, a `Service` struct, AND the handler — all in a single task. The store interface is the full contract upfront before any task exercises most of it.
+**B (With):** ✅ **Minimal per task.** Task 1 only creates `go.mod` + the `Bookmark` struct. Task 2 introduces `Store` with only `Create`, and `MemStore` with only `Create`. `List` is added to the interface in Task 3, `Delete` in Task 4, `Search` in Task 5 — each method appears when it's needed, not before.
+**Verdict:** Guidelines had a clear positive effect. Plan B builds only what each task needs.
+### Surgical Changes
+**A (Without):** ⚠️ Task 1 touches 7 files in one go (model, store interface, store mem, errors, service, handler, main.go). The Task 1 description says "create the full vertical slice" which bundles infrastructure that isn't tested yet.
+**B (With):** ✅ Each task touches 1-2 files for new code. Task 1 creates 2 files (go.mod, model.go). Task 2 adds 3 new files + modifies main.go. No task creates more than 4 files.
+**Verdict:** Guidelines had a clear positive effect. Plan B has tighter blast radius per task.
+### Think Before Coding (surface assumptions)
+**A (Without):** ❌ Silent assumptions throughout:
+- Used cursor-based pagination without noting the design just said "paginated" — didn't surface that offset-based vs cursor-based is a choice
+- Added `sync.RWMutex` and concurrent safety without the design mentioning concurrency
+- Created a `Service` layer between handler and store without justification
+**B (With):** ⚠️ Still has assumptions but more defensible:
+- Used offset/limit pagination (simpler, matches "paginated" literally)
+- No concurrency concerns added (store uses `sync.Mutex` only, no RWMutex overhead)
+- No `Service` layer — handler calls store directly
+- Did add `github.com/google/uuid` dependency without asking — minor assumption
+**Verdict:** Marginal positive effect. Plan B is less presumptuous but both plans made assumptions. Neither explicitly surfaced tradeoffs to the user.
+### Goal-Driven Execution
+**A (Without):** ✅ Good acceptance criteria with Given/When/Then. Has a `checkpoint: test` on 3/4 tasks and `checkpoint: done` on the last task.
+**B (With):** ✅ Good acceptance criteria. Has `checkpoint: test` on 3 tasks, `checkpoint: done` on 1, and no checkpoint on 2 simpler tasks. Added a full lifecycle integration test in Task 6 that wasn't in A.
+**Verdict:** Roughly equivalent. Both plans have strong acceptance criteria (required by the base skill). The lifecycle test in B is a nice bonus that catches integration issues.
+---
+## Unrelated Observations (noise, not guidelines)
+| Observation | A (Without) | B (With) |
+|---|---|---|
+| Pagination style | Cursor-based (more complex) | Offset-based (simpler) |
+| External deps | None | `google/uuid` |
+| Handler method naming | `Create`, `Delete`, `List`, `Search` | `CreateBookmark`, `DeleteBookmark`, `ListBookmarks`, `SearchBookmarks` |
+| Test structure | Single `TestXxx` with `t.Run` subtests | Separate top-level test functions |
+| `make([]T, 0, len)` usage | Yes (mem store candidates) | Yes (list handler, search handler) |
+---
+## Overall Assessment
+| Guideline | Effect | Evidence |
+|---|---|---|
+| **Simplicity First** | ✅ Strong positive | B builds incrementally; A front-loads the full store interface |
+| **Surgical Changes** | ✅ Positive | B touches fewer files per task (1-4 vs 7 in Task 1) |
+| **Think Before Coding** | ⚠️ Marginal | B made fewer silent assumptions but neither surfaced tradeoffs explicitly |
+| **Goal-Driven Execution** | ≈ Neutral | Both strong; base skill already enforces acceptance criteria |
+**Bottom line (iteration 1):** The guidelines measurably improved the plan. The biggest win is **Simplicity First** — Plan B's incremental interface growth (adding methods to `Store` as each task needs them) is clearly better than Plan A's upfront full-contract approach. This is exactly the kind of thing "no abstractions for single-use code" catches.
+**Weakness:** Neither plan explicitly called out assumptions or asked clarifying questions — the "Think Before Coding" guideline had the weakest signal. The guidelines alone may not be enough to overcome the model's tendency to fill gaps silently.
+---
+## Iteration 2: Revised Guidelines
+### What changed
+The guidelines were reworked from 4 generic coding rules to 3 planning-specific principles:
+| v1 (Generic) | v2 (Planning-Specific) | Why |
+|---|---|---|
+| Think Before Coding | **Surface Assumptions** | v1 said "ask" — the agent ignores this when told not to ask. v2 says "annotate in the plan" with a concrete `> **Assumption:** ...` format and examples of what to annotate. |
+| Simplicity First | **Build Only What Each Task Needs** | Kept the same core principle but added the specific anti-pattern from the v1 A/B test: "don't define interface methods that no task exercises yet." |
+| Surgical Changes | **One Task, One Change** | Reframed from "don't touch adjacent code" to "each task should trace to exactly one user-facing behavior" with a concrete guardrail (max 4 new files). |
+| Goal-Driven Execution | *(removed)* | Redundant — the base skill already enforces Given/When/Then acceptance criteria. |
+### Iteration 2 Plan (v2 guidelines) vs Iteration 1 Plans
+| Dimension | A (No guidelines) | B1 (v1 guidelines) | B2 (v2 guidelines) |
+|---|---|---|---|
+| **Total tasks** | 4 | 6 | 4 |
+| **Max files/task** | 7 (Task 1) | 4 | 4 |
+| **Assumptions annotated** | 0 | 0 | **4** (header below) |
+| **External deps** | None | `google/uuid` | None |
+| **Store interface** | 4 methods upfront in Task 1 | 1 method per task | 1 method per task |
+| **Service layer** | Yes (unjustified) | No | No |
+### The big win: Surface Assumptions
+Plan B2 opens with four explicit assumption annotations:
+```
+> **Assumption:** User identification via X-User-ID request header since
+> no auth system exists in the project.
+> **Assumption:** Bookmarks include a Note field so users can annotate
+> bookmarks. The design says "search by keyword" but doesn't specify
+> the field.
+> **Assumption:** Offset/limit pagination (not cursor-based).
+> **Assumption:** In-memory store behind a Store interface.
+```
+None of the previous plans (A or B1) did this. The v1 "Think Before Coding" guideline was completely invisible in output. The v2 "Surface Assumptions" guideline produced visible, reviewable annotations on the first run.
+### Iteration 2 Assessment
+| Guideline | v1 Effect | v2 Effect | Improvement |
+|---|---|---|---|
+| **Surface Assumptions** (was Think Before Coding) | ⚠️ Invisible | ✅ 4 explicit annotations | Complete turnaround — concrete format + examples fixed the weakest signal |
+| **Build Only What's Needed** (was Simplicity First) | ✅ Strong | ✅ Strong | Maintained — interface still grows incrementally |
+| **One Task, One Change** (was Surgical Changes) | ✅ Positive | ✅ Positive | Maintained — max 4 files/task |
+**Bottom line (iteration 2):** The v2 guidelines fixed the weakest signal from v1. "Surface Assumptions" went from invisible to producing 4 explicit, reviewable annotations. The other two principles maintained their positive effect. The removal of "Goal-Driven Execution" (redundant) reduced noise without losing signal.

package/docs/plans/completed/2026-06-03-add-verify-skill-design.md ADDED Viewed

@@ -0,0 +1,51 @@
+# Add Verify Skill — Design Doc
+## Context
+Based on [Chris LeMa's "The Last Prompt"](https://chrislema.com/the-last-prompt-you-need-when-building-software-with-ai), we need a post-implementation code verification phase in pi-workflow-kit. The existing `design-review` skill validates architecture *intentions* at the design-doc level, but there's no review of the *actual implemented code*. This is where the most dangerous bugs hide: signature mismatches between layers, dead code, duplicated logic, and security holes that pass tests but break in production.
+## Decision
+### Add a `verify` skill (new)
+A single skill triggered by `/skill:verify` that runs three sequential expert review passes over implemented code:
+1. **Security** 🔴 — adversarial review as if a junior wrote it and the best security expert is auditing
+2. **Optimization** 🟡 — dead code, duplication, over/under-engineering, performance
+3. **Traceability** 🔵 — end-to-end call chain verification across every layer boundary
+Output: structured markdown report at `docs/plans/*-verification-report.md` with findings and actionable task list.
+### Keep `design-review` unchanged
+`design-review` stays between brainstorm and plan — it validates architecture before task breakdown. Moving it would lose the cheap "catch it before you build it" value.
+### Update README
+Add `verify` to the workflow diagram, skill table, and quick start. The pipeline becomes:
+```
+brainstorm → design-review → plan → execute → verify → finalize
+```
+## Workflow Integration
+```
+brainstorm → design-review (optional) → plan → execute → verify → finalize
+                                                 ↑         ↑
+                                           existing         new
+```
+- `verify` runs after `executing-tasks` and before `finalizing`
+- It's optional — trivial changes can skip it
+- The report's remediation task list feeds directly into a follow-up `/skill:writing-plans` if fixes are needed
+- Read-only: can write to `docs/plans/` only, cannot modify source code
+## Files to Change
+1. **`skills/verify/SKILL.md`** — new skill (full content in `docs/plans/2026-06-03-verify-skill-design.md`)
+2. **`README.md`** — update workflow diagram, skill table, quick start, and project structure
+## Production Risks
+Simple change — no design review needed. We're adding a new SKILL.md and updating documentation. No code execution, no external integrations, no security surface.

package/docs/plans/completed/2026-06-03-add-verify-skill-implementation.md ADDED Viewed

@@ -0,0 +1,111 @@
+# Implementation Plan: Add Verify Skill
+Design: `docs/plans/2026-06-03-add-verify-skill-design.md`
+## Overview
+Add a `verify` skill to pi-workflow-kit — a post-implementation code verification phase that runs three expert review passes (security, optimization, traceability) over implemented code. Also update the README to reflect the expanded workflow pipeline.
+Full SKILL.md content is in `docs/plans/2026-06-03-verify-skill-design.md` (lines 7-176, inside the code fence).
+## Task 1: Create the verify skill
+<!-- tdd: trivial -->
+Acceptance Criteria (QA Engineer Hat):
+- **Happy Path**:
+  - Given: No `skills/verify/` directory exists
+  - When: `skills/verify/SKILL.md` is created
+  - Then: The file contains valid YAML frontmatter with `name: verify` and a description mentioning security, optimization, and traceability. The file body contains all three review pass sections, the report format template, and the principles section.
+- **Edge Case (skill already exists)**:
+  - Given: `skills/verify/SKILL.md` already exists
+  - When: Task runs
+  - Then: The existing file is overwritten with the new content
+Files:
+- `skills/verify/SKILL.md`
+Steps:
+1. Create the directory `skills/verify/`
+2. Create `skills/verify/SKILL.md` with the full content from the design draft. The content is the markdown inside the code fence in `docs/plans/2026-06-03-verify-skill-design.md` (lines 8-176). Copy it exactly — it includes:
+   - YAML frontmatter with name and description
+   - # Verify heading and intro paragraph
+   - ## Process section (5 steps)
+   - ## Pass 1 — Security Review 🔴 (framing, what to look for, severity table)
+   - ## Pass 2 — Optimization Review 🟡 (framing, what to look for, priority table)
+   - ## Pass 3 — Traceability Review 🔵 (framing, what to look for 4 sub-items, severity table)
+   - ## Report Format section (full template with summary table, findings sections, remediation task list)
+   - ## Principles section (5 bullets)
+## Task 2: Update README with verify skill
+<!-- tdd: trivial -->
+Acceptance Criteria (QA Engineer Hat):
+- **Happy Path**:
+  - Given: README.md has the current workflow (brainstorm → design-review → plan → execute → finalize)
+  - When: README is updated
+  - Then: All five sections are updated — tagline, workflow diagram, skill table, phase control, quick start, and project structure — to include `verify` between execute and finalize.
+- **Edge Case (verify already in README)**:
+  - Given: README already contains verify references
+  - When: Task runs
+  - Then: No duplicate entries are introduced
+Files:
+- `README.md`
+Steps:
+1. Update the tagline (line 3) — change `brainstorm→plan→execute→finalize` to `brainstorm→plan→execute→verify→finalize`:
+   ```
+   > Stop AI agents from rushing to code. Enforce a structured brainstorm→plan→execute→verify→finalize workflow with TDD discipline.
+   ```
+2. Update the "🧠 6 Workflow Skills" heading (line 36) to "🧠 7 Workflow Skills"
+3. Update the workflow diagram (lines 40-44) to:
+   ```
+   brainstorm → design-review → plan → execute → verify → finalize
+                                            ↕
+                                         diagnose (anytime)
+   ```
+4. Add verify to the skill table (after the Execute row, before Finalize):
+   ```
+   | **Verify** | `/skill:verify` | Three expert review passes (security, optimization, traceability) on implemented code |
+   ```
+5. Update the phase control section (lines 61-67) to add verify:
+   ```
+   /skill:brainstorming   →  discuss and design
+   /skill:design-review   →  audit for production risks (non-trivial designs)
+   /skill:writing-plans   →  break into tasks
+   /skill:executing-tasks →  implement with TDD
+   /skill:verify          →  review code for security, optimization, and traceability issues
+   /skill:finalizing      →  ship it
+   ```
+6. Update the quick start section (lines 110-135) to add verify between executing-tasks and finalizing:
+   ```
+   > /skill:executing-tasks
+   # (agent implements with TDD, cognitive persona shifts, all tools unlocked)
+   > /skill:verify
+   # (agent runs security, optimization, and traceability reviews on implemented code)
+   > /skill:finalizing
+   # (agent archives docs, curates lessons, creates PR)
+   ```
+7. Update the project structure (lines 146-161) to add verify:
+   ```
+   ├── skills/
+   │   ├── brainstorming/SKILL.md
+   │   ├── design-review/SKILL.md
+   │   ├── writing-plans/SKILL.md
+   │   ├── executing-tasks/SKILL.md
+   │   ├── verify/SKILL.md
+   │   ├── finalizing/SKILL.md
+   │   └── diagnose/SKILL.md
+   ```

package/docs/plans/completed/2026-06-03-add-verify-skill-progress.md ADDED Viewed

@@ -0,0 +1,11 @@
+# Progress: Add Verify Skill
+Plan: docs/plans/2026-06-03-add-verify-skill-implementation.md
+Branch: add-verify-skill
+Started: 2026-06-03T13:00:00Z
+Last updated: 2026-06-03T13:00:00Z
+| # | Status | Task | Commit |
+|---|--------|------|--------|
+| 1 | ✅ done | Create the verify skill | c48d47a |
+| 2 | ✅ done | Update README with verify skill | ea37ea8 |

package/docs/plans/completed/2026-06-03-verify-skill-design.md ADDED Viewed

@@ -0,0 +1,176 @@
+# Verify Skill — Draft SKILL.md
+> **Target path:** `skills/verify/SKILL.md` (to be created during executing-tasks)
+---
+```markdown
+---
+name: verify
+description: "Post-implementation code verification with three expert review passes — security, optimization, and traceability. Use after executing-tasks and before finalizing to catch issues that pass tests but break in production. Runs the 'last prompt' pattern: adversarial security review, dead code and duplication audit, and end-to-end contract verification across every layer. Use this skill whenever the user says 'verify', 'review the code', 'check for issues', 'security review', 'the last prompt', 'audit', or when code has been implemented and needs a quality gate before shipping."
+---
+# Verify
+Three expert review passes over the implemented codebase. Read-only — you **may** write the verification report to `docs/plans/`, but you **may not** modify source code.
+The core insight: code that passes tests is not code that's ready. Working code can have security holes, dead branches, duplicated logic, and broken contracts between layers — especially when AI generates across many files without maintaining a single mental model of the whole system. This skill catches what tests miss.
+## Process
+1. **Check what's been done** — run `git log --oneline` and `git diff --stat` to understand the scope of recent changes. If nothing has been implemented, say "No code changes found. Run `/skill:executing-tasks` first." and stop.
+2. **Identify the project's layers** — before reviewing, map the codebase's architecture. Look for layer boundaries: UI/handlers/routes → services/business logic → repositories/data access → database/models. Note the patterns: does the project use controllers, handlers, or routes? Services or use cases? Repositories or DAOs? This map drives the traceability pass.
+3. **Run three expert review passes** — each pass adopts a distinct adversarial framing. Do them sequentially. For each pass, read the relevant code deeply — don't skim. Then write findings.
+4. **Compile the report** — write all findings to `docs/plans/*-verification-report.md`. Present the report to the user and wait for feedback.
+5. **Offer to create a remediation plan** — after the report, ask: "Want me to create a fix plan from these findings? Run `/skill:writing-plans` to turn the task list into executable tasks."
+## Pass 1 — Security Review 🔴
+**Framing:** A junior developer wrote this code. Now the best security expert on the team is reviewing it — adversarial, suspicious of everything. Trust nothing.
+**What to look for:**
+- **Input validation** — every external input (HTTP params, form data, headers, query strings, environment variables) must be validated and sanitized. Unvalidated input is a critical finding.
+- **Authentication & authorization** — every endpoint that handles user data must have auth checks. Are there endpoints that skip auth? Can one user access another user's data by changing an ID?
+- **Injection** — SQL queries built by string concatenation, unsanitized shell commands, template injection, XSS in HTML output. Any raw variable interpolated into a query or command is critical.
+- **Secrets** — API keys, passwords, tokens hardcoded in source files. Check environment variable loading — are defaults set to empty or to actual secrets?
+- **Data exposure** — are sensitive fields (passwords, tokens, PII) logged, returned in API responses, or stored unencrypted?
+- **Dependency risks** — known-vulnerable packages (if `package.json`/`go.mod`/`requirements.txt` is present).
+**Severity classification:**
+| Severity | Definition |
+|----------|-----------|
+| Critical | Exploitable right now — auth bypass, injection, data leak |
+| High | Likely exploitable — missing validation on sensitive endpoint, weak auth |
+| Medium | Harder to exploit but real risk — verbose error messages leaking internals, missing rate limits |
+| Low | Best practice violations — missing CSP headers, no HSTS, long session timeouts |
+## Pass 2 — Optimization Review 🟡
+**Framing:** A code quality expert looking for waste — things that make the codebase harder to maintain, slower to run, or more confusing than necessary.
+**What to look for:**
+- **Dead code** — functions, methods, types, or exports that are never called anywhere in the codebase. Search for definitions and verify they have callers.
+- **Duplication** — the same logic implemented in slightly different ways across multiple files. AI-generated code is especially prone to this — if context was lost between sessions, the AI solved the same sub-problem differently in two places. Flag each pair with file paths and line numbers.
+- **Over-engineering** — abstractions, interfaces, or layers that add complexity without earning their keep (only one implementation, no real variation across the seam).
+- **Under-engineering** — god functions, 200-line blocks, deeply nested conditionals that should be extracted.
+- **Performance concerns** — N+1 queries, unbounded loops, unnecessary copies of large data structures, missing pagination on list endpoints.
+**Priority classification:**
+| Priority | Definition |
+|----------|-----------|
+| P0 | Dead code in a critical path or duplicated logic that will diverge |
+| P1 | Significant duplication or over-engineering that increases maintenance cost |
+| P2 | Minor cleanups — long functions, missing pagination, style inconsistencies |
+## Pass 3 — Traceability Review 🔵
+**Framing:** An integration expert tracing every user-facing action end-to-end — from UI to database and back. The AI generates code file-by-file, and the seams between files are where bugs hide.
+**What to look for:**
+1. **Map every entry point** — list all handlers, routes, controllers, or event listeners that receive external input.
+2. **Trace each call chain** — for each entry point, follow the call: handler → service → repository → database. At each boundary, verify:
+   - **Function name** — does the caller use the exact function name the callee exposes?
+   - **Argument names** — does the caller pass `userId` when the function expects `user_id`? Does `id` mean the same thing in both layers?
+   - **Argument types** — is a string passed where an integer is expected? Is an object shape different from what the next layer destructures?
+   - **Return shape** — does the caller expect fields that the callee actually returns? Are response DTOs consistent across layers?
+3. **Check error propagation** — when a database query returns no results, does the service layer handle it? Does the handler return 404 or 500? Do errors propagate cleanly or get swallowed silently?
+4. **Verify the round-trip** — if the UI calls `getUser(id)` and displays `user.name`, trace that `name` actually exists in the DB schema, gets selected by the query, mapped by the repository, passed through the service, included in the response, and rendered by the UI.
+**This is the pass that catches the most bugs.** AI-generated code will often have a frontend calling `getUserProfile(userId)` and a backend exposing `get_user_profile(user_id)` — both work in isolation, neither works together.
+**Severity classification:**
+| Severity | Definition |
+|----------|-----------|
+| Critical | Call chain is completely broken — function doesn't exist or signature is fundamentally wrong |
+| High | Signature mismatch — wrong arg names, wrong types, missing required fields |
+| Medium | Silent error handling — errors swallowed without logging or user feedback |
+| Low | Inconsistent naming conventions that could confuse future developers |
+## Report Format
+Write findings to `docs/plans/*-verification-report.md` using this structure:
+    # Verification Report: <feature/topic>
+    **Date:** <ISO date>
+    **Scope:** <summary of what was reviewed>
+    **Reviewer:** AI verify skill (security + optimization + traceability)
+    ## Summary
+    | Pass | Critical | High | Medium | Low |
+    |------|----------|------|--------|-----|
+    | Security | X | X | X | X |
+    | Optimization | — | X | X | X |
+    | Traceability | X | X | X | X |
+    | **Total** | **X** | **X** | **X** | **X** |
+    ## 🔴 Security Findings
+    ### [S-001] Critical — <short title>
+    **Location:** `path/to/file.ts:line`
+    **Issue:** <what's wrong and why it matters>
+    **Fix:** <concrete remediation step>
+    ### [S-002] High — <short title>
+    ...
+    ## 🟡 Optimization Findings
+    ### [O-001] P0 — <short title>
+    **Location:** `path/to/file.ts:line` and `path/to/other.ts:line`
+    **Issue:** <what's wrong>
+    **Fix:** <concrete remediation step>
+    ### [O-002] P1 — <short title>
+    ...
+    ## 🔵 Traceability Findings
+    ### [T-001] Critical — <short title>
+    **Entry point:** `path/to/handler.ts:line`
+    **Call chain:** handler → service → repository → DB
+    **Broken at:** <which boundary>
+    **Issue:** <what's wrong — e.g., handler passes `userId` but service expects `user_id`>
+    **Fix:** <concrete remediation step>
+    ### [T-002] High — <short title>
+    ...
+    ## Remediation Task List
+    Convert findings into actionable tasks:
+    | ID | Priority | Finding | Estimated Effort |
+    |----|----------|---------|-----------------|
+    | S-001 | Critical | <one-liner> | <small/medium/large> |
+    | T-001 | Critical | <one-liner> | <small/medium/large> |
+    | O-001 | P0 | <one-liner> | <small/medium/large> |
+    | ...
+## Principles
+- **Be specific** — every finding must include a file path and line reference. "There might be security issues" is useless.
+- **Be adversarial** — actively look for problems. If you don't find any, say so — but don't phone it in.
+- **Be proportional** — a small config change doesn't need the same depth as a new API endpoint. Adjust your review depth to the scope of changes.
+- **Don't fix anything** — this is read-only. Find and report. The user decides what to fix and when.
+- **Focus on seams** — the traceability pass is where the most value lives. Code within a single file is usually coherent; the bugs hide between files.
+```

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@tianhai/pi-workflow-kit",
-  "version": "0.16.0",
+  "version": "0.17.1",
   "description": "Enforce structured brainstorm→plan→execute→finalize workflow with TDD discipline in AI coding agents",
   "keywords": [
     "pi-package",

package/skills/verify/SKILL.md ADDED Viewed

@@ -0,0 +1,170 @@
+---
+name: verify
+description: "Post-implementation code verification with three expert review passes — security, optimization, and traceability. Use after executing-tasks and before finalizing to catch issues that pass tests but break in production. Runs the 'last prompt' pattern: adversarial security review, dead code and duplication audit, and end-to-end contract verification across every layer. Use this skill whenever the user says 'verify', 'review the code', 'check for issues', 'security review', 'the last prompt', 'audit', or when code has been implemented and needs a quality gate before shipping."
+---
+# Verify
+Three expert review passes over the implemented codebase. Read-only — you **may** write the verification report to `docs/plans/`, but you **may not** modify source code.
+The core insight: code that passes tests is not code that's ready. Working code can have security holes, dead branches, duplicated logic, and broken contracts between layers — especially when AI generates across many files without maintaining a single mental model of the whole system. This skill catches what tests miss.
+## Process
+1. **Check what's been done** — run `git log --oneline` and `git diff --stat` to understand the scope of recent changes. If nothing has been implemented, say "No code changes found. Run `/skill:executing-tasks` first." and stop.
+2. **Identify the project's layers** — before reviewing, map the codebase's architecture. Look for layer boundaries: UI/handlers/routes → services/business logic → repositories/data access → database/models. Note the patterns: does the project use controllers, handlers, or routes? Services or use cases? Repositories or DAOs? This map drives the traceability pass.
+3. **Run three expert review passes** — each pass adopts a distinct adversarial framing. Do them sequentially. For each pass, read the relevant code deeply — don't skim. Then write findings.
+4. **Compile the report** — write all findings to `docs/plans/*-verification-report.md`. Present the report to the user and wait for feedback.
+5. **Offer to create a remediation plan** — after the report, ask: "Want me to create a fix plan from these findings? Run `/skill:writing-plans` to turn the task list into executable tasks."
+## Pass 1 — Security Review 🔴
+**Framing:** A junior developer wrote this code. Now the best security expert on the team is reviewing it — adversarial, suspicious of everything. Trust nothing.
+**What to look for:**
+- **Input validation** — every external input (HTTP params, form data, headers, query strings, environment variables) must be validated and sanitized. Unvalidated input is a critical finding.
+- **Authentication & authorization** — every endpoint that handles user data must have auth checks. Are there endpoints that skip auth? Can one user access another user's data by changing an ID?
+- **Injection** — SQL queries built by string concatenation, unsanitized shell commands, template injection, XSS in HTML output. Any raw variable interpolated into a query or command is critical.
+- **Secrets** — API keys, passwords, tokens hardcoded in source files. Check environment variable loading — are defaults set to empty or to actual secrets?
+- **Data exposure** — are sensitive fields (passwords, tokens, PII) logged, returned in API responses, or stored unencrypted?
+- **Dependency risks** — known-vulnerable packages (if `package.json`/`go.mod`/`requirements.txt` is present).
+**Severity classification:**
+| Severity | Definition |
+|----------|-----------|
+| Critical | Exploitable right now — auth bypass, injection, data leak |
+| High | Likely exploitable — missing validation on sensitive endpoint, weak auth |
+| Medium | Harder to exploit but real risk — verbose error messages leaking internals, missing rate limits |
+| Low | Best practice violations — missing CSP headers, no HSTS, long session timeouts |
+## Pass 2 — Optimization Review 🟡
+**Framing:** A code quality expert looking for waste — things that make the codebase harder to maintain, slower to run, or more confusing than necessary.
+**What to look for:**
+- **Dead code** — functions, methods, types, or exports that are never called anywhere in the codebase. Search for definitions and verify they have callers.
+- **Duplication** — the same logic implemented in slightly different ways across multiple files. AI-generated code is especially prone to this — if context was lost between sessions, the AI solved the same sub-problem differently in two places. Flag each pair with file paths and line numbers.
+- **Over-engineering** — abstractions, interfaces, or layers that add complexity without earning their keep (only one implementation, no real variation across the seam).
+- **Under-engineering** — god functions, 200-line blocks, deeply nested conditionals that should be extracted.
+- **Performance concerns** — N+1 queries, unbounded loops, unnecessary copies of large data structures, missing pagination on list endpoints.
+**Priority classification:**
+| Priority | Definition |
+|----------|-----------|
+| P0 | Dead code in a critical path or duplicated logic that will diverge |
+| P1 | Significant duplication or over-engineering that increases maintenance cost |
+| P2 | Minor cleanups — long functions, missing pagination, style inconsistencies |
+## Pass 3 — Traceability Review 🔵
+**Framing:** An integration expert tracing every user-facing action end-to-end — from UI to database and back. The AI generates code file-by-file, and the seams between files are where bugs hide.
+**What to look for:**
+1. **Map every entry point** — list all handlers, routes, controllers, or event listeners that receive external input.
+2. **Trace each call chain** — for each entry point, follow the call: handler → service → repository → database. At each boundary, verify:
+   - **Function name** — does the caller use the exact function name the callee exposes?
+   - **Argument names** — does the caller pass `userId` when the function expects `user_id`? Does `id` mean the same thing in both layers?
+   - **Argument types** — is a string passed where an integer is expected? Is an object shape different from what the next layer destructures?
+   - **Return shape** — does the caller expect fields that the callee actually returns? Are response DTOs consistent across layers?
+3. **Check error propagation** — when a database query returns no results, does the service layer handle it? Does the handler return 404 or 500? Do errors propagate cleanly or get swallowed silently?
+4. **Verify the round-trip** — if the UI calls `getUser(id)` and displays `user.name`, trace that `name` actually exists in the DB schema, gets selected by the query, mapped by the repository, passed through the service, included in the response, and rendered by the UI.
+**This is the pass that catches the most bugs.** AI-generated code will often have a frontend calling `getUserProfile(userId)` and a backend exposing `get_user_profile(user_id)` — both work in isolation, neither works together.
+**Severity classification:**
+| Severity | Definition |
+|----------|-----------|
+| Critical | Call chain is completely broken — function doesn't exist or signature is fundamentally wrong |
+| High | Signature mismatch — wrong arg names, wrong types, missing required fields |
+| Medium | Silent error handling — errors swallowed without logging or user feedback |
+| Low | Inconsistent naming conventions that could confuse future developers |
+## Report Format
+Write findings to `docs/plans/*-verification-report.md` using this structure:
+```markdown
+# Verification Report: <feature/topic>
+**Date:** <ISO date>
+**Scope:** <summary of what was reviewed>
+**Reviewer:** AI verify skill (security + optimization + traceability)
+## Summary
+| Pass | Critical | High | Medium | Low |
+|------|----------|------|--------|-----|
+| Security | X | X | X | X |
+| Optimization | — | X | X | X |
+| Traceability | X | X | X | X |
+| **Total** | **X** | **X** | **X** | **X** |
+## 🔴 Security Findings
+### [S-001] Critical — <short title>
+**Location:** `path/to/file.ts:line`
+**Issue:** <what's wrong and why it matters>
+**Fix:** <concrete remediation step>
+### [S-002] High — <short title>
+...
+## 🟡 Optimization Findings
+### [O-001] P0 — <short title>
+**Location:** `path/to/file.ts:line` and `path/to/other.ts:line`
+**Issue:** <what's wrong>
+**Fix:** <concrete remediation step>
+### [O-002] P1 — <short title>
+...
+## 🔵 Traceability Findings
+### [T-001] Critical — <short title>
+**Entry point:** `path/to/handler.ts:line`
+**Call chain:** handler → service → repository → DB
+**Broken at:** <which boundary>
+**Issue:** <what's wrong — e.g., handler passes `userId` but service expects `user_id`>
+**Fix:** <concrete remediation step>
+### [T-002] High — <short title>
+...
+## Remediation Task List
+Convert findings into actionable tasks:
+| ID | Priority | Finding | Estimated Effort |
+|----|----------|---------|-----------------|
+| S-001 | Critical | <one-liner> | <small/medium/large> |
+| T-001 | Critical | <one-liner> | <small/medium/large> |
+| O-001 | P0 | <one-liner> | <small/medium/large> |
+| ...
+```
+## Principles
+- **Be specific** — every finding must include a file path and line reference. "There might be security issues" is useless.
+- **Be adversarial** — actively look for problems. If you don't find any, say so — but don't phone it in.
+- **Be proportional** — a small config change doesn't need the same depth as a new API endpoint. Adjust your review depth to the scope of changes.
+- **Don't fix anything** — this is read-only. Find and report. The user decides what to fix and when.
+- **Focus on seams** — the traceability pass is where the most value lives. Code within a single file is usually coherent; the bugs hide between files.

package/skills/writing-plans/SKILL.md CHANGED Viewed

@@ -290,3 +290,54 @@ Use judgment when assigning checkpoints. Prefer `checkpoint: test` for new featu
 ## After the plan
 Ask: "Ready to execute? Run `/skill:executing-tasks`"
+## Behavioral Guidelines
+Guidelines to reduce overcomplication and hidden assumptions in plans. Derived from [Andrej Karpathy's observations](https://x.com/karpathy/status/2015883857489522876) on LLM coding pitfalls, adapted for the planning context.
+**Tradeoff:** These guidelines bias toward caution over speed. For trivial plans (1-2 tasks), use judgment.
+### Surface Assumptions
+**When the design is ambiguous, annotate — don't silently pick.**
+When writing a plan, you'll encounter gaps: the design says "paginated" but doesn't specify how, says "validate input" but doesn't say which fields, or leaves the data layer unspecified. Your instinct will be to fill the gap and keep writing. Resist that.
+Instead, add a brief `> **Assumption:** ...` note in the plan at the point where you made the call:
+```
+> **Assumption:** Using offset/limit pagination because the design just says
+> "paginated". Cursor-based would be better for large datasets.
+```
+```
+> **Assumption:** No service layer — handler calls store directly. Add one
+> if cross-cutting concerns (logging, auth checks) emerge later.
+```
+This lets the reviewer see what you chose and why, without blocking progress. Common gaps worth annotating:
+- Pagination style, error handling strategy, concurrency model
+- Whether to add a service/middleware layer
+- Whether to add external dependencies
+- Naming conventions when the design doesn't specify
+### Build Only What Each Task Needs
+**Minimum code to deliver the task's observable behavior. Nothing more.**
+- No interface methods that no task exercises yet. If Task 2 creates a `Store` interface, it should have only the methods Task 2 calls. Add methods in the task that first needs them.
+- No layers (service, middleware, repository) unless the design explicitly requires them.
+- No error types, helper files, or shared packages until a task actually uses them.
+- No external dependencies when stdlib suffices. Every `go get` or `npm install` is a choice — default to no.
+- No "flexible" or "configurable" code that wasn't requested.
+If you find yourself writing a store with 4 methods where only 1 is used in this task, stop. Write 1 method. Add the rest when the tasks that need them arrive.
+### One Task, One Change
+**Each task should trace to exactly one user-facing behavior.**
+- If a task creates more than 4 new files, it's probably doing too much — split it.
+- If a task modifies existing files unrelated to its acceptance criteria, trim the scope.
+- Infrastructure (types, interfaces, module scaffolding) should live in the same task as the first code that uses it, not in a separate "setup" task — unless the infrastructure alone is complex enough to warrant its own task.
+- Every file listed in a task's `Files:` section should be directly necessary for that task's acceptance criteria to pass.