npm - @tianhai/pi-workflow-kit - Versions diffs - 0.15.0 → 0.17.1 - Mend

@tianhai/pi-workflow-kit 0.15.0 → 0.17.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

package/skills/verify/SKILL.md ADDED Viewed

@@ -0,0 +1,170 @@
+---
+name: verify
+description: "Post-implementation code verification with three expert review passes — security, optimization, and traceability. Use after executing-tasks and before finalizing to catch issues that pass tests but break in production. Runs the 'last prompt' pattern: adversarial security review, dead code and duplication audit, and end-to-end contract verification across every layer. Use this skill whenever the user says 'verify', 'review the code', 'check for issues', 'security review', 'the last prompt', 'audit', or when code has been implemented and needs a quality gate before shipping."
+---
+# Verify
+Three expert review passes over the implemented codebase. Read-only — you **may** write the verification report to `docs/plans/`, but you **may not** modify source code.
+The core insight: code that passes tests is not code that's ready. Working code can have security holes, dead branches, duplicated logic, and broken contracts between layers — especially when AI generates across many files without maintaining a single mental model of the whole system. This skill catches what tests miss.
+## Process
+1. **Check what's been done** — run `git log --oneline` and `git diff --stat` to understand the scope of recent changes. If nothing has been implemented, say "No code changes found. Run `/skill:executing-tasks` first." and stop.
+2. **Identify the project's layers** — before reviewing, map the codebase's architecture. Look for layer boundaries: UI/handlers/routes → services/business logic → repositories/data access → database/models. Note the patterns: does the project use controllers, handlers, or routes? Services or use cases? Repositories or DAOs? This map drives the traceability pass.
+3. **Run three expert review passes** — each pass adopts a distinct adversarial framing. Do them sequentially. For each pass, read the relevant code deeply — don't skim. Then write findings.
+4. **Compile the report** — write all findings to `docs/plans/*-verification-report.md`. Present the report to the user and wait for feedback.
+5. **Offer to create a remediation plan** — after the report, ask: "Want me to create a fix plan from these findings? Run `/skill:writing-plans` to turn the task list into executable tasks."
+## Pass 1 — Security Review 🔴
+**Framing:** A junior developer wrote this code. Now the best security expert on the team is reviewing it — adversarial, suspicious of everything. Trust nothing.
+**What to look for:**
+- **Input validation** — every external input (HTTP params, form data, headers, query strings, environment variables) must be validated and sanitized. Unvalidated input is a critical finding.
+- **Authentication & authorization** — every endpoint that handles user data must have auth checks. Are there endpoints that skip auth? Can one user access another user's data by changing an ID?
+- **Injection** — SQL queries built by string concatenation, unsanitized shell commands, template injection, XSS in HTML output. Any raw variable interpolated into a query or command is critical.
+- **Secrets** — API keys, passwords, tokens hardcoded in source files. Check environment variable loading — are defaults set to empty or to actual secrets?
+- **Data exposure** — are sensitive fields (passwords, tokens, PII) logged, returned in API responses, or stored unencrypted?
+- **Dependency risks** — known-vulnerable packages (if `package.json`/`go.mod`/`requirements.txt` is present).
+**Severity classification:**
+| Severity | Definition |
+|----------|-----------|
+| Critical | Exploitable right now — auth bypass, injection, data leak |
+| High | Likely exploitable — missing validation on sensitive endpoint, weak auth |
+| Medium | Harder to exploit but real risk — verbose error messages leaking internals, missing rate limits |
+| Low | Best practice violations — missing CSP headers, no HSTS, long session timeouts |
+## Pass 2 — Optimization Review 🟡
+**Framing:** A code quality expert looking for waste — things that make the codebase harder to maintain, slower to run, or more confusing than necessary.
+**What to look for:**
+- **Dead code** — functions, methods, types, or exports that are never called anywhere in the codebase. Search for definitions and verify they have callers.
+- **Duplication** — the same logic implemented in slightly different ways across multiple files. AI-generated code is especially prone to this — if context was lost between sessions, the AI solved the same sub-problem differently in two places. Flag each pair with file paths and line numbers.
+- **Over-engineering** — abstractions, interfaces, or layers that add complexity without earning their keep (only one implementation, no real variation across the seam).
+- **Under-engineering** — god functions, 200-line blocks, deeply nested conditionals that should be extracted.
+- **Performance concerns** — N+1 queries, unbounded loops, unnecessary copies of large data structures, missing pagination on list endpoints.
+**Priority classification:**
+| Priority | Definition |
+|----------|-----------|
+| P0 | Dead code in a critical path or duplicated logic that will diverge |
+| P1 | Significant duplication or over-engineering that increases maintenance cost |
+| P2 | Minor cleanups — long functions, missing pagination, style inconsistencies |
+## Pass 3 — Traceability Review 🔵
+**Framing:** An integration expert tracing every user-facing action end-to-end — from UI to database and back. The AI generates code file-by-file, and the seams between files are where bugs hide.
+**What to look for:**
+1. **Map every entry point** — list all handlers, routes, controllers, or event listeners that receive external input.
+2. **Trace each call chain** — for each entry point, follow the call: handler → service → repository → database. At each boundary, verify:
+   - **Function name** — does the caller use the exact function name the callee exposes?
+   - **Argument names** — does the caller pass `userId` when the function expects `user_id`? Does `id` mean the same thing in both layers?
+   - **Argument types** — is a string passed where an integer is expected? Is an object shape different from what the next layer destructures?
+   - **Return shape** — does the caller expect fields that the callee actually returns? Are response DTOs consistent across layers?
+3. **Check error propagation** — when a database query returns no results, does the service layer handle it? Does the handler return 404 or 500? Do errors propagate cleanly or get swallowed silently?
+4. **Verify the round-trip** — if the UI calls `getUser(id)` and displays `user.name`, trace that `name` actually exists in the DB schema, gets selected by the query, mapped by the repository, passed through the service, included in the response, and rendered by the UI.
+**This is the pass that catches the most bugs.** AI-generated code will often have a frontend calling `getUserProfile(userId)` and a backend exposing `get_user_profile(user_id)` — both work in isolation, neither works together.
+**Severity classification:**
+| Severity | Definition |
+|----------|-----------|
+| Critical | Call chain is completely broken — function doesn't exist or signature is fundamentally wrong |
+| High | Signature mismatch — wrong arg names, wrong types, missing required fields |
+| Medium | Silent error handling — errors swallowed without logging or user feedback |
+| Low | Inconsistent naming conventions that could confuse future developers |
+## Report Format
+Write findings to `docs/plans/*-verification-report.md` using this structure:
+```markdown
+# Verification Report: <feature/topic>
+**Date:** <ISO date>
+**Scope:** <summary of what was reviewed>
+**Reviewer:** AI verify skill (security + optimization + traceability)
+## Summary
+| Pass | Critical | High | Medium | Low |
+|------|----------|------|--------|-----|
+| Security | X | X | X | X |
+| Optimization | — | X | X | X |
+| Traceability | X | X | X | X |
+| **Total** | **X** | **X** | **X** | **X** |
+## 🔴 Security Findings
+### [S-001] Critical — <short title>
+**Location:** `path/to/file.ts:line`
+**Issue:** <what's wrong and why it matters>
+**Fix:** <concrete remediation step>
+### [S-002] High — <short title>
+...
+## 🟡 Optimization Findings
+### [O-001] P0 — <short title>
+**Location:** `path/to/file.ts:line` and `path/to/other.ts:line`
+**Issue:** <what's wrong>
+**Fix:** <concrete remediation step>
+### [O-002] P1 — <short title>
+...
+## 🔵 Traceability Findings
+### [T-001] Critical — <short title>
+**Entry point:** `path/to/handler.ts:line`
+**Call chain:** handler → service → repository → DB
+**Broken at:** <which boundary>
+**Issue:** <what's wrong — e.g., handler passes `userId` but service expects `user_id`>
+**Fix:** <concrete remediation step>
+### [T-002] High — <short title>
+...
+## Remediation Task List
+Convert findings into actionable tasks:
+| ID | Priority | Finding | Estimated Effort |
+|----|----------|---------|-----------------|
+| S-001 | Critical | <one-liner> | <small/medium/large> |
+| T-001 | Critical | <one-liner> | <small/medium/large> |
+| O-001 | P0 | <one-liner> | <small/medium/large> |
+| ...
+```
+## Principles
+- **Be specific** — every finding must include a file path and line reference. "There might be security issues" is useless.
+- **Be adversarial** — actively look for problems. If you don't find any, say so — but don't phone it in.
+- **Be proportional** — a small config change doesn't need the same depth as a new API endpoint. Adjust your review depth to the scope of changes.
+- **Don't fix anything** — this is read-only. Find and report. The user decides what to fix and when.
+- **Focus on seams** — the traceability pass is where the most value lives. Code within a single file is usually coherent; the bugs hide between files.

package/skills/writing-plans/SKILL.md CHANGED Viewed

@@ -10,18 +10,43 @@ You may only create or edit files under `docs/plans/`. Do not modify source code
 ## Process
 1. **Check for a design doc** — look for `docs/plans/*-design.md`. If one exists, use it as the basis for the plan. If the design doc is incomplete, fill gaps by asking the human. If no design doc exists, ask the user to describe what they want to build and read relevant code. **Read `docs/lessons.md`** if it exists — incorporate known patterns into the task breakdown (e.g., if a lesson says "always run lint before commit," include that in relevant task instructions).
+   Then evaluate whether the design — whether from the design doc or from the user's description and codebase exploration — involves any of the following:
+   - Database schema changes or migrations
+   - Authentication or authorization logic
+   - External API or service integrations
+   - Concurrency or batch processing
+   - File uploads or large data flows
+   - Redis, caching, or message queues
+   If any apply AND the design doc does not already have an `## Architectural Review` section, prompt the user: "This design involves [list what you found] but hasn't been reviewed for production risks. Run `/skill:design-review` first, or type 'proceed' to skip."
+   If the design doc explicitly notes "Simple change — no design review needed", skip this check.
 2. **Write the implementation plan** — break the design into tasks. Save to `docs/plans/YYYY-MM-DD-<topic>-implementation.md`. If the design is too large for ~15 tasks, flag this to the human and ask whether to reduce scope or proceed with the full plan.
 3. **Present the plan** — show the complete plan to the human. Wait for approval before suggesting execution.
+   Before presenting, run the **Plan Acceptance Audit**:
+   - **Vertical Slices**: Is every task a complete vertical slice (not horizontal)?
+   - **Task Sizing**: Is any single task too large or covering multiple complex behaviors? If so, split it.
+   - **QA Coverage**: Does every task have both a Happy Path and at least one Edge Case in its Acceptance Criteria?
+   - **Checkpoint Alignment**: Are `checkpoint: test` and `checkpoint: done` gates placed on the most critical or risky tasks?
+   - **Risk Enforcement**: If the design doc's Architectural Review section flagged any hazards as `[TRIGGERED]`, verify the corresponding tasks have `checkpoint: done` and a `Hazard Mitigation Verification` section.
+   If any check fails, fix the plan before presenting.
 ## Task format
 Each task should produce one testable change. The executing-tasks skill handles committing — do not include `git commit` in the task body.
 Each task must include:
 - Exact file paths to create/modify
+- **Acceptance Criteria (QA Engineer Hat)** — Put on your **QA Engineer Hat** to design exhaustive test coverage. Explicitly define:
+  - **Happy Path**: Expected behavior under normal operations.
+  - **Edge Cases & Error Paths**: What happens with empty inputs, limits exceeded, authentication failures, or error states.
+  Ensure every criteria block specifies the expected state and returned results using `Given/When/Then` behavioral blocks.
 - **Concrete code** — include the actual implementation, not a summary. Write out SQL schemas, type definitions, function signatures with bodies, route handler code, and test assertions. A developer should be able to copy-paste from the plan and have working code. For tasks that depend on types or utilities from earlier tasks, reference them explicitly (e.g., `import { User } from Task 2`) and include only the new code
 - Exact commands with expected output (e.g., `npx vitest run src/user/model.test.ts` → shows 1 test passing)
-- Each task's tests should cover the happy path and at least one edge case or error path, with concrete assertions
 Each task must use a numbered heading with optional metadata comments:
@@ -95,6 +120,16 @@ The examples below show the structure — headings, metadata comments, checkpoin
 <!-- tdd: new-feature -->
+Acceptance Criteria (QA Engineer Hat):
+- **Happy Path**:
+  - Given: Valid user data with name and email
+  - When: The User model is created
+  - Then: The model contains the correct fields and a generated ID
+- **Edge Case (duplicate email)**:
+  - Given: A user with email "test@example.com" already exists
+  - When: Another user is created with the same email
+  - Then: Creation fails with a unique constraint error
 Files:
 - `src/user/model.ts`
 - `src/user/model.test.ts`
@@ -113,6 +148,16 @@ Steps:
 <!-- tdd: new-feature -->
 <!-- checkpoint: test -->
+Acceptance Criteria (QA Engineer Hat):
+- **Happy Path**:
+  - Given: A user with valid credentials exists
+  - When: Login is attempted
+  - Then: A valid session token is returned
+- **Edge Case (wrong password)**:
+  - Given: A user exists but password is incorrect
+  - When: Login is attempted
+  - Then: An authentication error is returned
 Files:
 - `src/auth/login.test.ts`
@@ -135,6 +180,20 @@ Steps:
 <!-- tdd: new-feature -->
 <!-- checkpoint: done -->
+Acceptance Criteria (QA Engineer Hat):
+- **Happy Path**:
+  - Given: A user with email "user@example.com" and password "secure123" exists
+  - When: A POST request with those credentials is sent to `/api/login`
+  - Then: Response returns `200 OK` with a signed JWT token
+- **Edge Case (invalid password)**:
+  - Given: A user exists but the password sent is "wrong-pass"
+  - When: A POST request is sent to `/api/login`
+  - Then: Response returns `401 Unauthorized`
+- **Edge Case (rate limiting)**:
+  - Given: 5 failed login attempts from the same IP
+  - When: A 6th attempt is sent
+  - Then: Response returns `429 Too Many Requests`
 Files:
 - `src/auth/login.ts`
 - `src/auth/login.test.ts`
@@ -159,6 +218,16 @@ Steps:
 <!-- checkpoint: test -->
 <!-- checkpoint: done -->
+Acceptance Criteria (QA Engineer Hat):
+- **Happy Path**:
+  - Given: A valid OAuth2 authorization code
+  - When: The auth callback is invoked
+  - Then: A user session is created and the user is redirected to the dashboard
+- **Edge Case (expired code)**:
+  - Given: An expired or invalid authorization code
+  - When: The auth callback is invoked
+  - Then: The user is redirected to login with an error message
 Steps:
 1. Write failing test for auth flow
 2. Run test — confirm it fails
@@ -221,3 +290,54 @@ Use judgment when assigning checkpoints. Prefer `checkpoint: test` for new featu
 ## After the plan
 Ask: "Ready to execute? Run `/skill:executing-tasks`"
+## Behavioral Guidelines
+Guidelines to reduce overcomplication and hidden assumptions in plans. Derived from [Andrej Karpathy's observations](https://x.com/karpathy/status/2015883857489522876) on LLM coding pitfalls, adapted for the planning context.
+**Tradeoff:** These guidelines bias toward caution over speed. For trivial plans (1-2 tasks), use judgment.
+### Surface Assumptions
+**When the design is ambiguous, annotate — don't silently pick.**
+When writing a plan, you'll encounter gaps: the design says "paginated" but doesn't specify how, says "validate input" but doesn't say which fields, or leaves the data layer unspecified. Your instinct will be to fill the gap and keep writing. Resist that.
+Instead, add a brief `> **Assumption:** ...` note in the plan at the point where you made the call:
+```
+> **Assumption:** Using offset/limit pagination because the design just says
+> "paginated". Cursor-based would be better for large datasets.
+```
+```
+> **Assumption:** No service layer — handler calls store directly. Add one
+> if cross-cutting concerns (logging, auth checks) emerge later.
+```
+This lets the reviewer see what you chose and why, without blocking progress. Common gaps worth annotating:
+- Pagination style, error handling strategy, concurrency model
+- Whether to add a service/middleware layer
+- Whether to add external dependencies
+- Naming conventions when the design doesn't specify
+### Build Only What Each Task Needs
+**Minimum code to deliver the task's observable behavior. Nothing more.**
+- No interface methods that no task exercises yet. If Task 2 creates a `Store` interface, it should have only the methods Task 2 calls. Add methods in the task that first needs them.
+- No layers (service, middleware, repository) unless the design explicitly requires them.
+- No error types, helper files, or shared packages until a task actually uses them.
+- No external dependencies when stdlib suffices. Every `go get` or `npm install` is a choice — default to no.
+- No "flexible" or "configurable" code that wasn't requested.
+If you find yourself writing a store with 4 methods where only 1 is used in this task, stop. Write 1 method. Add the rest when the tasks that need them arrive.
+### One Task, One Change
+**Each task should trace to exactly one user-facing behavior.**
+- If a task creates more than 4 new files, it's probably doing too much — split it.
+- If a task modifies existing files unrelated to its acceptance criteria, trim the scope.
+- Infrastructure (types, interfaces, module scaffolding) should live in the same task as the first code that uses it, not in a separate "setup" task — unless the infrastructure alone is complex enough to warrant its own task.
+- Every file listed in a task's `Files:` section should be directly necessary for that task's acceptance criteria to pass.