@tianhai/pi-workflow-kit 0.15.0 → 0.16.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -33,20 +33,21 @@ Enforces phase-appropriate tool access — not just guidelines, but hard blocks:
33
33
 
34
34
  The agent can read code and discuss design with you during brainstorm/plan, but it physically cannot modify source files or run mutating commands.
35
35
 
36
- ### 🧠 5 Workflow Skills
36
+ ### 🧠 6 Workflow Skills
37
37
 
38
38
  Guide the agent through a disciplined development process:
39
39
 
40
40
  ```
41
- brainstorm → plan → execute → finalize
42
-
43
- diagnose (anytime)
41
+ brainstorm → design-review → plan → execute → finalize
42
+
43
+ diagnose (anytime)
44
44
  ```
45
45
 
46
46
  | Phase | Trigger | What Happens |
47
47
  |-------|---------|--------------|
48
48
  | **Brainstorm** | `/skill:brainstorming` | Explore approaches, debate tradeoffs, produce a design doc |
49
- | **Plan** | `/skill:writing-plans` | Break design into bite-sized TDD tasks with file paths and acceptance criteria |
49
+ | **Design Review** | `/skill:design-review` | Audit design for production risks (security, scalability, fault tolerance) |
50
+ | **Plan** | `/skill:writing-plans` | Break design into bite-sized TDD tasks with acceptance criteria and concrete code |
50
51
  | **Execute** | `/skill:executing-tasks` | Implement tasks one-by-one with TDD discipline and pre-commit checkpoint review gates |
51
52
  | **Finalize** | `/skill:finalizing` | Archive plan docs, update README/CHANGELOG, create PR |
52
53
  | **Diagnose** | `/skill:diagnose` | 6-phase debugging loop: reproduce → hypothesize → instrument → fix → verify |
@@ -59,6 +60,7 @@ You control each phase — the agent never advances on its own. Invoke a skill t
59
60
 
60
61
  ```
61
62
  /skill:brainstorming → discuss and design
63
+ /skill:design-review → audit for production risks (non-trivial designs)
62
64
  /skill:writing-plans → break into tasks
63
65
  /skill:executing-tasks → implement with TDD
64
66
  /skill:finalizing → ship it
@@ -116,15 +118,20 @@ pi install npm:@tianhai/pi-workflow-kit
116
118
  # (agent explores approaches, writes design doc)
117
119
  # (write/edit are blocked — your code is safe)
118
120
 
121
+ > /skill:design-review
122
+
123
+ # (agent audits for security, scalability, fault tolerance)
124
+ # (trivial changes can skip this step)
125
+
119
126
  > /skill:writing-plans
120
127
 
121
- # (agent breaks design into TDD tasks)
128
+ # (agent breaks design into TDD tasks with acceptance criteria)
122
129
  > /skill:executing-tasks
123
130
 
124
- # (agent implements with TDD, all tools unlocked)
131
+ # (agent implements with TDD, cognitive persona shifts, all tools unlocked)
125
132
  > /skill:finalizing
126
133
 
127
- # (agent archives docs, updates changelog, creates PR)
134
+ # (agent archives docs, curates lessons, creates PR)
128
135
  ```
129
136
 
130
137
  ## Why?
@@ -142,6 +149,7 @@ pi-workflow-kit/
142
149
  │ └── workflow-guard.ts # Write blocker during brainstorm/plan
143
150
  ├── skills/
144
151
  │ ├── brainstorming/SKILL.md
152
+ │ ├── design-review/SKILL.md
145
153
  │ ├── writing-plans/SKILL.md
146
154
  │ ├── executing-tasks/SKILL.md
147
155
  │ ├── finalizing/SKILL.md
@@ -0,0 +1,77 @@
1
+ # Design: Agentic Agile & Architectural Rigor Enhancements
2
+
3
+ Enforcing rigorous Agile engineering discipline within `pi-workflow-kit` by introducing Behavioral Acceptance Criteria, Cognitive Persona Shifts, automated Lessons Curation, strict Multi-Pillar Architectural Reviews, and High-Risk Operation Safeguards.
4
+
5
+ ## Context & Objectives
6
+ Based on industry standards and modern agentic development templates (such as Microsoft's Agentic Agile model), autonomous coding agents succeed most when operating under tight behavioral boundaries, specialized cognitive roles, and continuous retro/learning loops.
7
+
8
+ We are enhancing `pi-workflow-kit` by mapping out distinct engineering "Hats" and rigorous check-gates directly into our existing phase-based skills without adding repository clutter or introducing flaky external file lookups:
9
+ 1. **The QA Engineer Hat** (in `writing-plans`): Defines rigid, testable `Given/When/Then` Acceptance Criteria for both happy and edge paths during planning.
10
+ 2. **The Pragmatic Developer & Senior Refactorer Hats** (in `executing-tasks`): Guides the execution loop through clear cognitive phases (Green Light → Polish / Software Craftsmanship).
11
+ 3. **The Agile Scrum Master Hat** (in `finalizing`): Cleans up, de-duplicates, and categorizes persistent lessons to prevent context-bloat and maximize the utility of future sprints.
12
+ 4. **Architectural Review & Audit Gates**: Formally audits both the design (brainstorming) and the plan (writing-plans) against the 6 core pillars of production-grade software (Robustness, Atomicity, Security, Scalability, Compatibility, and Testability) before allowing the agent to move forward.
13
+ 5. **High-Risk Operation Safeguards**: Auto-detects critical execution hazards (unbounded Redis scans, in-memory OOM loops, unthrottled concurrency, long-running transactions, etc.) and mandates strict mitigation steps and verification checkpoints.
14
+
15
+ ---
16
+
17
+ ## Architecture & Detailed Design
18
+
19
+ Because agent workspaces default tool execution and file-reading relative to the user's project directory, external files bundled in NPM global modules are not reliably reachable. Therefore, all guidelines are **inlined directly within the respective `SKILL.md` prompts**. This guarantees 100% reliability, zero repository pollution, and zero runtime performance overhead.
20
+
21
+ ### Slice 1: Multi-Pillar Design Review & Risk Detection (`brainstorming`)
22
+ Before concluding a brainstorm and generating a design doc, the agent must put on its **Architect Hat** and evaluate the proposed system against the **6 Pillars of Production-Grade Design**:
23
+ 1. **Robustness & Fault Tolerance**: How expected failures are handled, subsystem isolation, and graceful degradation.
24
+ 2. **Atomicity & Consistency**: Database transactions, state rollback on error, and endpoint idempotency.
25
+ 3. **Security & Access Control**: Input validation/sanitization and authorization checks at the boundary.
26
+ 4. **Scalability & Performance**: Connection pooling, closing resource leaks, and preventing N+1 queries.
27
+ 5. **Backwards Compatibility**: Schema migration safety, zero-downtime deployment, and API versioning.
28
+ 6. **Testability**: Injection seams for external dependencies (APIs, system clocks, randomizers) to keep tests 100% deterministic.
29
+
30
+ #### ⚠️ High-Risk Hazard Auditing
31
+ The agent must proactively audit the design for the **8 High-Risk Production Hazards**:
32
+ 1. **Unbounded Redis Deletions / Operations**: Multi-key deletion or scans (e.g. `KEYS` or raw `SCAN` loops) that block single-threaded performance.
33
+ 2. **In-Memory OOM Loops**: Fetching complete database datasets into server memory (e.g., raw `select *`) to filter, sort, or map in runtime heap.
34
+ 3. **Unbounded Concurrency Spikes**: Running concurrent network requests (e.g. unthrottled `Promise.all`) without strict batch limits (e.g., `p-limit`).
35
+ 4. **Missing High-Frequency Indexes**: Running queries on unindexed columns, forcing expensive table-scans under load.
36
+ 5. **Nested/Long-Running Transactions**: Holding database connections and locks open while awaiting slow external HTTP, disk, or cryptographic tasks.
37
+ 6. **Unrestricted Uploads & Temp Flooding**: Writing uploaded data directly to local temporary paths without validation limits or explicit `finally` cleanup blocks.
38
+ 7. **Raw Query String Interpolation**: Merging raw variables into SQL queries or shell command inputs (susceptible to injection).
39
+ 8. **Silent Swallowing loops**: Background workers or cron tasks silently catching and suppressing exceptions without logging, back-offs, or alerts.
40
+
41
+ #### 🔍 Discovering Unknown & Contextual Risks (Socratic Heuristics)
42
+ To identify novel or domain-specific risks that fall outside the standard checklist, the agent must put on its **SRE Hat** and audit the proposed logic against the **3 Socratic Heuristics**:
43
+ * **The "Scale to 100x" Heuristic (Resource Exhaustion)**: If this operation is run 100x/sec or on 100k items, what breaks? (Memory, CPU, Disk I/O, sockets, database connection limits).
44
+ * **The "Hostile World" Heuristic (Security & Malice)**: If a malicious actor has complete control over these inputs (headers, payloads, IDs), how can they exploit, crash, or extract data?
45
+ * **The "Silent Error" Heuristic (Observability & Partitioning)**: If this downstream dependency or query hangs or fails silently, how does our server react? Is there a timeout, a back-off, or logging?
46
+
47
+ If any of the standard hazards or Socratic risks are identified, the design document **must** include a dedicated `⚠️ High-Risk Operations & Mitigations` section detailing the exact safety protocols applied.
48
+
49
+ ### Slice 2: Behavioral Acceptance Criteria & Plan Audit (`writing-plans`)
50
+ The planning process is enhanced to mandate behavior-driven specifications and an automated plan verification step.
51
+
52
+ - **Role**: QA Engineer Hat.
53
+ - **Specification Format**: Mandatory `Given/When/Then` blocks covering the Happy Path and Edge/Error Paths.
54
+ - **Plan Acceptance Audit**: Before presenting the plan to the user, the agent must verify:
55
+ - Every task is a complete vertical slice.
56
+ - Sizing is correct (no monolithic tasks).
57
+ - Checkpoint gates are placed on the most critical/risky tasks.
58
+ - **Risk Enforcement**: Any task containing any of the **8 High-Risk Hazards** or **Socratic Heuristics risks** is strictly required to have a mandatory `checkpoint: done` gate and explicit verification guidelines.
59
+
60
+ ### Slice 3: Cognitive Persona Shifts (`executing-tasks`)
61
+ The implementation execution loop is updated to divide the cognitive workload of a single task into three distinct phases.
62
+
63
+ - **Phase 1: QA Test Phase**: Translate the Given/When/Then specs into failing test cases.
64
+ - **Phase 2: Pragmatic Developer Phase**: Implement the simplest, raw code to green the tests.
65
+ - **Phase 3: Senior Refactoring Phase**: Refactor and polish using software craftsmanship principles (Shallow Modules, Deletion Test, Duplication, Seam Discipline).
66
+
67
+ ### Slice 4: Lessons Curation & Caching (`finalizing`)
68
+ The finalizing phase is upgraded to run a structured retrospective on our persistent learning files.
69
+
70
+ - **Role**: Agile Scrum Master Hat.
71
+ - **Curating Rules**: De-duplicate, validate against the Generalization Test, and categorize rules under distinct headers (e.g., `# Tool Usage`, `# Testing Patterns`, `# Architecture Rules`).
72
+
73
+ ---
74
+
75
+ ## Verification & Testing Plan
76
+ - **Manual Verification**: Run a mock `/skill:writing-plans` and `/skill:executing-tasks` to verify the generated implementation plan matches our QA template and the task-running agent correctly segments its progress through the three cognitive hats.
77
+ - **Automated Tests**: Confirm existing Vitest suites run successfully without side-effects.
@@ -0,0 +1,473 @@
1
+ # Implementation Plan: Agentic Agile & Architectural Rigor
2
+
3
+ Updates 4 skill files to introduce behavioral acceptance criteria, SRE hazard checks, cognitive persona shifts, architectural design reviews, and automated lessons curation.
4
+
5
+ ---
6
+
7
+ ## Task 1: Update `skills/brainstorming/SKILL.md` — 6 Pillars, 8 Hazards, 3 Socratic Heuristics
8
+
9
+ <!-- tdd: modifying-tested-code -->
10
+
11
+ Files:
12
+ - `skills/brainstorming/SKILL.md`
13
+
14
+ Acceptance Criteria (QA Engineer Hat):
15
+ - **Happy Path**:
16
+ - Given: A user runs `/skill:brainstorming` and a non-trivial design is proposed
17
+ - When: The agent presents the design for approval
18
+ - Then: The design includes a dedicated `🏛️ Architectural Pillars Review` section covering all 6 pillars (Robustness, Atomicity, Security, Scalability, Compatibility, Testability)
19
+ - **Edge Path (Trivial Feature)**:
20
+ - Given: A user runs `/skill:brainstorming` for a trivial change (e.g., renaming a column)
21
+ - When: The agent reaches the architectural review step
22
+ - Then: The agent writes a brief statement like "Simple change — no architectural review needed" and skips the full audit
23
+ - **Edge Path (Hazard Detection)**:
24
+ - Given: The proposed design involves Redis key deletion
25
+ - When: The agent audits against the 8 High-Risk Hazards
26
+ - Then: The design flags it as `[TRIGGERED]` under hazard #1 and includes a mitigation in a `⚠️ High-Risk Operations & Mitigations` section
27
+ - **Edge Path (Socratic Discovery)**:
28
+ - Given: The proposed design has a novel batch-processing loop not covered by the 8 hazards
29
+ - When: The agent applies the 3 Socratic Heuristics
30
+ - Then: The design flags the discovered risk and proposes mitigation
31
+
32
+ Steps:
33
+ 1. Read `skills/brainstorming/SKILL.md` in full
34
+ 2. In step 4 ("Present the design"), add a new mandatory sub-step before writing the design doc: **Architectural Review & Risk Detection**. Insert the following inline guidelines:
35
+
36
+ ```markdown
37
+ #### 🏛️ Architectural Pillars Review
38
+
39
+ For non-trivial designs, evaluate the proposed design against the **6 Pillars of Production-Grade Design**. Include a dedicated section in the design doc addressing each:
40
+
41
+ 1. **Robustness & Fault Tolerance**: How expected failures are handled, subsystem isolation, graceful degradation.
42
+ 2. **Atomicity & Consistency**: Database transactions, state rollback on error, endpoint idempotency.
43
+ 3. **Security & Access Control**: Input validation/sanitization, authorization checks at the boundary.
44
+ 4. **Scalability & Performance**: Connection pooling, closing resource leaks, preventing N+1 queries.
45
+ 5. **Backwards Compatibility**: Schema migration safety, zero-downtime deployment, API versioning.
46
+ 6. **Testability**: Injection seams for external dependencies (APIs, system clocks, randomizers) to keep tests 100% deterministic.
47
+
48
+ For trivial changes (config, naming, simple field additions), a brief statement like "Simple change — no architectural review needed" suffices.
49
+
50
+ #### ⚠️ High-Risk Hazard Audit
51
+
52
+ For non-trivial designs, you MUST evaluate the design against the **8 High-Risk Production Hazards**. For each hazard, write either `[SAFE]` (with a 1-sentence justification of why it doesn't apply) or `[TRIGGERED]` (detailing the mitigation):
53
+
54
+ - **1. Unbounded Redis Deletions / Operations**: Multi-key deletion or scans (e.g. `KEYS` or raw `SCAN` loops) that block single-threaded performance.
55
+ - **2. In-Memory OOM Loops**: Fetching complete database datasets into server memory (e.g., raw `select *`) to filter, sort, or map in runtime heap.
56
+ - **3. Unbounded Concurrency Spikes**: Running concurrent network requests (e.g. unthrottled `Promise.all`) without strict batch limits.
57
+ - **4. Missing High-Frequency Indexes**: Running queries on unindexed columns, forcing expensive table-scans under load.
58
+ - **5. Nested/Long-Running Transactions**: Holding database connections and locks open while awaiting slow external HTTP, disk, or cryptographic tasks.
59
+ - **6. Unrestricted Uploads & Temp Flooding**: Writing uploaded data directly to local temporary paths without validation limits or explicit `finally` cleanup blocks.
60
+ - **7. Raw Query String Interpolation**: Merging raw variables into SQL queries or shell command inputs (susceptible to injection).
61
+ - **8. Silent Swallowing Loops**: Background workers or cron tasks silently catching and suppressing exceptions without logging, back-offs, or alerts.
62
+
63
+ For trivial changes, skip this audit.
64
+
65
+ #### 🔍 Socratic Risk Discovery
66
+
67
+ For non-trivial designs, put on your **SRE Hat** and audit the proposed logic against the **3 Socratic Heuristics** to identify novel or domain-specific risks:
68
+
69
+ - **The "Scale to 100x" Heuristic**: If this operation is run 100x/sec or on 100k items, what breaks? (Memory, CPU, Disk I/O, sockets, database connection limits).
70
+ - **The "Hostile World" Heuristic**: If a malicious actor has complete control over these inputs (headers, payloads, IDs), how can they exploit, crash, or extract data?
71
+ - **The "Silent Error" Heuristic**: If this downstream dependency or query hangs or fails silently, how does our server react? Is there a timeout, a back-off, or logging?
72
+
73
+ For trivial changes, skip this audit.
74
+
75
+ If any hazard is `[TRIGGERED]` or any Socratic risk is identified, the design document **must** include a dedicated `⚠️ High-Risk Operations & Mitigations` section detailing the exact safety protocols applied.
76
+ ```
77
+
78
+ 3. Verify the file reads cleanly — the new sections should slot naturally between the existing ADR guidance and step 5 ("Write the design doc").
79
+
80
+ ---
81
+
82
+ ## Task 2: Update `skills/writing-plans/SKILL.md` — QA Hat, Given/When/Then, Plan Acceptance Audit
83
+
84
+ <!-- tdd: modifying-tested-code -->
85
+
86
+ Files:
87
+ - `skills/writing-plans/SKILL.md`
88
+
89
+ Acceptance Criteria (QA Engineer Hat):
90
+ - **Happy Path**:
91
+ - Given: A user runs `/skill:writing-plans`
92
+ - When: The implementation plan is generated
93
+ - Then: Every task has a structured `Acceptance Criteria` block with `Given/When/Then` happy-path and edge-case behaviors
94
+ - **Edge Path (Risk Enforcement)**:
95
+ - Given: A task involves any of the 8 production hazards or Socratic risks flagged in the design
96
+ - When: The plan audit runs
97
+ - Then: That task is automatically gated with `checkpoint: done` and includes a `Hazard Mitigation Verification` section
98
+
99
+ Steps:
100
+ 1. Read `skills/writing-plans/SKILL.md` in full
101
+ 2. In the "Task format" section, add the QA Engineer Hat and Acceptance Criteria requirements. Replace:
102
+
103
+ ```markdown
104
+ Each task must include:
105
+ - Exact file paths to create/modify
106
+ - **Concrete code** — include the actual implementation, not a summary. Write out SQL schemas, type definitions, function signatures with bodies, route handler code, and test assertions. A developer should be able to copy-paste from the plan and have working code. For tasks that depend on types or utilities from earlier tasks, reference them explicitly (e.g., `import { User } from Task 2`) and include only the new code
107
+ - Exact commands with expected output (e.g., `npx vitest run src/user/model.test.ts` → shows 1 test passing)
108
+ - Each task's tests should cover the happy path and at least one edge case or error path, with concrete assertions
109
+ ```
110
+
111
+ With:
112
+
113
+ ```markdown
114
+ Each task must include:
115
+ - Exact file paths to create/modify
116
+ - **Acceptance Criteria (QA Engineer Hat)** — Put on your **QA Engineer Hat** to design exhaustive test coverage. Explicitly define:
117
+ - **Happy Path**: Expected behavior under normal operations.
118
+ - **Edge Cases & Error Paths**: What happens with empty inputs, limits exceeded, authentication failures, or error states.
119
+ Ensure every criteria block specifies the expected state and returned results using `Given/When/Then` behavioral blocks.
120
+ - **Concrete code** — include the actual implementation, not a summary. Write out SQL schemas, type definitions, function signatures with bodies, route handler code, and test assertions. A developer should be able to copy-paste from the plan and have working code. For tasks that depend on types or utilities from earlier tasks, reference them explicitly (e.g., `import { User } from Task 2`) and include only the new code
121
+ - Exact commands with expected output (e.g., `npx vitest run src/user/model.test.ts` → shows 1 test passing)
122
+ - Each task's tests should cover the happy path and at least one edge case or error path, with concrete assertions
123
+ ```
124
+
125
+ 3. In the "Task body structure" section, update each example task template to include an `Acceptance Criteria` block. Update the "No checkpoint" example to:
126
+
127
+ ```markdown
128
+ ## Task 1: Create User model
129
+
130
+ <!-- tdd: new-feature -->
131
+
132
+ Acceptance Criteria (QA Engineer Hat):
133
+ - **Happy Path**:
134
+ - Given: Valid user data with name and email
135
+ - When: The User model is created
136
+ - Then: The model contains the correct fields and a generated ID
137
+ - **Edge Case (duplicate email)**:
138
+ - Given: A user with email "test@example.com" already exists
139
+ - When: Another user is created with the same email
140
+ - Then: Creation fails with a unique constraint error
141
+
142
+ Files:
143
+ - `src/user/model.ts`
144
+ - `src/user/model.test.ts`
145
+
146
+ Steps:
147
+ 1. Write failing test for User model creation
148
+ 2. Run test — confirm it fails
149
+ 3. Implement User model
150
+ 4. Run test — confirm it passes
151
+ ```
152
+
153
+ Update the `checkpoint: test` example to include acceptance criteria:
154
+
155
+ ```markdown
156
+ ## Task 2: Write auth tests
157
+
158
+ <!-- tdd: new-feature -->
159
+ <!-- checkpoint: test -->
160
+
161
+ Acceptance Criteria (QA Engineer Hat):
162
+ - **Happy Path**:
163
+ - Given: A user with valid credentials exists
164
+ - When: Login is attempted
165
+ - Then: A valid session token is returned
166
+ - **Edge Case (wrong password)**:
167
+ - Given: A user exists but password is incorrect
168
+ - When: Login is attempted
169
+ - Then: An authentication error is returned
170
+
171
+ Files:
172
+ - `src/auth/login.test.ts`
173
+
174
+ Steps:
175
+ 1. Write failing test for login with valid credentials
176
+ 2. Run test — confirm it fails
177
+
178
+ ⏸ **CHECKPOINT: test** — present test review. Wait for human approval before implementing.
179
+
180
+ 3. Implement login handler
181
+ 4. Run test — confirm it passes
182
+ 5. Refactor — check for shallow modules, duplication, seam discipline. Run tests after changes.
183
+ 6. Lessons — caught a mistake that applies to future tasks? Add rule to `docs/lessons.md`.
184
+ ```
185
+
186
+ Update the `checkpoint: done` example to include acceptance criteria:
187
+
188
+ ```markdown
189
+ ## Task 3: Add login endpoint
190
+
191
+ <!-- tdd: new-feature -->
192
+ <!-- checkpoint: done -->
193
+
194
+ Acceptance Criteria (QA Engineer Hat):
195
+ - **Happy Path**:
196
+ - Given: A user with email "user@example.com" and password "secure123" exists
197
+ - When: A POST request with those credentials is sent to `/api/login`
198
+ - Then: Response returns `200 OK` with a signed JWT token
199
+ - **Edge Case (invalid password)**:
200
+ - Given: A user exists but the password sent is "wrong-pass"
201
+ - When: A POST request is sent to `/api/login`
202
+ - Then: Response returns `401 Unauthorized`
203
+ - **Edge Case (rate limiting)**:
204
+ - Given: 5 failed login attempts from the same IP
205
+ - When: A 6th attempt is sent
206
+ - Then: Response returns `429 Too Many Requests`
207
+
208
+ Files:
209
+ - `src/auth/login.ts`
210
+ - `src/auth/login.test.ts`
211
+
212
+ Steps:
213
+ 1. Write failing test for login with valid credentials
214
+ 2. Run test — confirm it fails
215
+ 3. Implement login handler
216
+ 4. Run test — confirm it passes
217
+ 5. Add edge case tests (invalid password, missing email)
218
+ 6. Refactor — check for shallow modules, duplication, seam discipline. Run tests after changes.
219
+ 7. Lessons — caught a mistake that applies to future tasks? Add rule to `docs/lessons.md`.
220
+
221
+ ⏸ **CHECKPOINT: done** — present implementation review. Wait for human approval before committing.
222
+ ```
223
+
224
+ Update the "Both checkpoints" example to include acceptance criteria:
225
+
226
+ ```markdown
227
+ ## Task 4: Complex auth flow
228
+
229
+ <!-- tdd: new-feature -->
230
+ <!-- checkpoint: test -->
231
+ <!-- checkpoint: done -->
232
+
233
+ Acceptance Criteria (QA Engineer Hat):
234
+ - **Happy Path**:
235
+ - Given: A valid OAuth2 authorization code
236
+ - When: The auth callback is invoked
237
+ - Then: A user session is created and the user is redirected to the dashboard
238
+ - **Edge Case (expired code)**:
239
+ - Given: An expired or invalid authorization code
240
+ - When: The auth callback is invoked
241
+ - Then: The user is redirected to login with an error message
242
+
243
+ Steps:
244
+ 1. Write failing test for auth flow
245
+ 2. Run test — confirm it fails
246
+
247
+ ⏸ **CHECKPOINT: test** — present test review. Wait for human approval before implementing.
248
+
249
+ 3. Implement auth flow
250
+ 4. Run test — confirm it passes
251
+ 5. Refactor — check for shallow modules, duplication, seam discipline. Run tests after changes.
252
+ 6. Lessons — caught a mistake that applies to future tasks? Add rule to `docs/lessons.md`.
253
+
254
+ ⏸ **CHECKPOINT: done** — present implementation review. Wait for human approval before committing.
255
+ ```
256
+
257
+ 4. In step 3 ("Present the plan"), add the **Plan Acceptance Audit** sub-step after "show the complete plan to the human":
258
+
259
+ ```markdown
260
+ Before presenting, run the **Plan Acceptance Audit**:
261
+ - **Vertical Slices**: Is every task a complete vertical slice (not horizontal)?
262
+ - **Task Sizing**: Is any single task too large or covering multiple complex behaviors? If so, split it.
263
+ - **QA Coverage**: Does every task have both a Happy Path and at least one Edge Case in its Acceptance Criteria?
264
+ - **Checkpoint Alignment**: Are `checkpoint: test` and `checkpoint: done` gates placed on the most critical or risky tasks?
265
+ - **Risk Enforcement**: If the design doc flagged any hazards as `[TRIGGERED]`, verify the corresponding tasks have `checkpoint: done` and a `Hazard Mitigation Verification` section.
266
+
267
+ If any check fails, fix the plan before presenting.
268
+ ```
269
+
270
+ 5. Verify the file reads cleanly.
271
+
272
+ ---
273
+
274
+ ## Task 3: Update `skills/executing-tasks/SKILL.md` — Cognitive Persona Shifts & Defensive Sandboxing
275
+
276
+ <!-- tdd: modifying-tested-code -->
277
+
278
+ Files:
279
+ - `skills/executing-tasks/SKILL.md`
280
+
281
+ Acceptance Criteria (QA Engineer Hat):
282
+ - **Happy Path**:
283
+ - Given: An implementation plan with tasks containing Given/When/Then acceptance criteria and numbered steps
284
+ - When: `/skill:executing-tasks` runs through a task
285
+ - Then: The agent follows the plan's numbered steps while applying three cognitive frames:
286
+ 1. **QA Test frame** (when writing/running tests): Focus on translating Given/When/Then specs, verify sandboxed environment
287
+ 2. **Pragmatic Developer frame** (when implementing): Focus on simplest code to green tests
288
+ 3. **Senior Refactoring frame** (when refactoring): Evaluate craftsmanship (shallow modules, deletion test, duplication, seam discipline)
289
+ - **Edge Path (Sandbox Verification)**:
290
+ - Given: A test file that would connect to a real database
291
+ - When: The agent is in the QA Test frame
292
+ - Then: The agent verifies the test uses mocks/stubs and no live connections before running
293
+
294
+ Steps:
295
+ 1. Read `skills/executing-tasks/SKILL.md` in full
296
+ 2. In the "Per-task execution" section, replace step 3 with meta-framed persona shifts that preserve the plan-step-following behavior. Replace:
297
+
298
+ ```markdown
299
+ 3. **Execute the plan steps** — follow each numbered step in the task body, in order. Stop at any `⏸ CHECKPOINT` gate (see [Checkpoint gates](#checkpoint-gates--when-the-plan-says-stop)).
300
+ 4. **Verify against task description** — re-read the task from the plan. Does the implementation satisfy every requirement listed? If not, fix before proceeding.
301
+ 5. **Refactor** — after all tests pass, look for:
302
+ - **Shallow modules** — is the interface nearly as complex as the implementation? Can complexity be hidden behind a simpler interface?
303
+ - **Deletion test** — if you deleted this module, would complexity vanish (pass-through) or reappear across callers (earning its keep)?
304
+ - **Duplication** — extract repeated patterns
305
+ - **Seam discipline** — don't introduce abstraction unless something actually varies across it. One adapter = hypothetical seam. Two adapters = real seam
306
+
307
+ Run tests after each refactor step. Never refactor while tests are failing.
308
+ ```
309
+
310
+ With:
311
+
312
+ ```markdown
313
+ 3. **Execute the plan steps** — follow each numbered step in the task body, in order. As you work, shift your cognitive focus through three frames:
314
+
315
+ **QA Test frame** (when writing/running tests): Focus entirely on translating the task's `Given/When/Then` Acceptance Criteria into precise failing tests. Before running tests, verify the test environment is sandboxed — no real database connections, API calls, or live services. External dependencies must be mocked or stubbed. `NODE_ENV` must be `test` (or equivalent).
316
+
317
+ **Pragmatic Developer frame** (when implementing): Focus on the simplest possible code to make the tests green. Do not over-engineer or add code for future requirements. Keep complexity to a bare minimum.
318
+
319
+ **Senior Refactoring frame** (when refactoring): Evaluate the craftsmanship of the code. Check for:
320
+ - **Shallow modules** — is the interface nearly as complex as the implementation? Can complexity be hidden behind a simpler interface?
321
+ - **Deletion test** — if you deleted this module, would complexity vanish (pass-through) or reappear across callers (earning its keep)?
322
+ - **Duplication** — extract repeated patterns
323
+ - **Seam discipline** — don't introduce abstraction unless something actually varies across it. One adapter = hypothetical seam. Two adapters = real seam
324
+
325
+ Run tests after each refactor step. Never refactor while tests are failing.
326
+
327
+ Stop at any `⏸ CHECKPOINT` gate (see [Checkpoint gates](#checkpoint-gates--when-the-plan-says-stop)).
328
+ 4. **Verify against task description** — re-read the task from the plan. Does the implementation satisfy every requirement listed? If not, fix before proceeding.
329
+ ```
330
+
331
+ Note: The old step 5 (Refactor) is folded into step 3's "Senior Refactoring frame" so step 4 remains "Verify against task description". The remaining steps (old 6→5, old 7→6, old 8→7, old 9→8, old 10→9) need to be renumbered.
332
+
333
+ 3. Renumber the remaining steps after the new step 4:
334
+ - Old step 6 ("Learn from mistakes") → new step 5
335
+ - Old step 7 ("Commit") → new step 6
336
+ - Old step 8 ("Update progress") → new step 7
337
+ - Old step 9 ("Suggest session break") → new step 8
338
+ - Old step 10 ("Loop") → new step 9
339
+
340
+ 4. Verify the file reads cleanly — the cognitive frames are meta-guidance applied while following the plan's numbered steps, not a replacement for them.
341
+
342
+ ---
343
+
344
+ ## Task 4: Update `skills/finalizing/SKILL.md` — Lessons Curation with Scrum Master Hat
345
+
346
+ <!-- tdd: modifying-tested-code -->
347
+
348
+ Files:
349
+ - `skills/finalizing/SKILL.md`
350
+
351
+ Acceptance Criteria (QA Engineer Hat):
352
+ - **Happy Path**:
353
+ - Given: A sprint is completed with some rules in `docs/lessons.md`
354
+ - When: `/skill:finalizing` is executed
355
+ - Then: The agent puts on the **Agile Scrum Master Hat** to de-duplicate, generalize, and categorize all rules under structured markdown headers
356
+ - **Edge Path (No lessons exist)**:
357
+ - Given: No `docs/lessons.md` exists and no lessons were learned
358
+ - When: `/skill:finalizing` is executed
359
+ - Then: The step is skipped gracefully (existing behavior preserved)
360
+ - **Edge Path (Lessons format after categorization)**:
361
+ - Given: `docs/lessons.md` was categorized into headers like `## Tool Usage` and `## Testing Patterns` by a previous finalizing run
362
+ - When: A new execution phase appends a rule under `## Rules`
363
+ - Then: The rule lands in the correct location (the `## Rules` section still exists for new entries, and finalizing re-categorizes later)
364
+
365
+ Steps:
366
+ 1. Read `skills/finalizing/SKILL.md` in full
367
+ 2. In step 2 ("Review lessons learned"), replace the existing instruction with the enhanced Scrum Master Hat curation:
368
+
369
+ Replace:
370
+
371
+ ```markdown
372
+ 2. **Review lessons learned** — if `docs/lessons.md` exists, review it:
373
+ - Add any lessons from this session that were missed during execution
374
+ - **Generalize domain-specific rules** — if a rule names a specific service, entity, or feature, either rewrite it as a generic pattern or remove it if no generic form exists
375
+ - Retire rules that no longer apply (remove the bullet)
376
+ - If no changes are needed, leave it as-is
377
+ ```
378
+
379
+ With:
380
+
381
+ ```markdown
382
+ 2. **Review & Polish Lessons (Agile Scrum Master Hat)** — if `docs/lessons.md` exists, put on your **Agile Scrum Master Hat** to curate and optimize it for future sprints:
383
+ - **Add missed lessons** — capture any lessons from this session that weren't written during execution
384
+ - **Generalize domain-specific rules** — if a rule names a specific service, entity, or feature, either rewrite it as a generic pattern or remove it if no generic form exists
385
+ - **De-duplicate** — combine overlapping or redundant rules into single, sharper entries
386
+ - **Categorize** — group the rules under clear, structured markdown headers (e.g., `## Tool Usage`, `## Testing Patterns`, `## Architecture Rules`) to make the document highly scannable for future sessions. Keep the `## Rules` section as the append target for new entries during execution — categorization moves rules out of `## Rules` into the appropriate category headers.
387
+ - **Retire stale rules** — remove bullets that no longer apply
388
+ - If no changes are needed, leave it as-is
389
+ ```
390
+
391
+ 3. Verify the file reads cleanly.
392
+
393
+ ---
394
+
395
+ ## Task 5: Update `docs/lessons.md` format template in `skills/executing-tasks/SKILL.md`
396
+
397
+ <!-- tdd: modifying-tested-code -->
398
+
399
+ Files:
400
+ - `skills/executing-tasks/SKILL.md`
401
+
402
+ Acceptance Criteria (QA Engineer Hat):
403
+ - **Happy Path**:
404
+ - Given: The agent catches a repeat mistake during task execution
405
+ - When: It appends a new rule to `docs/lessons.md`
406
+ - Then: The rule is appended under `## Rules` (the standard append target), regardless of whether category headers exist from a previous finalizing run
407
+ - **Edge Path (After categorization)**:
408
+ - Given: `docs/lessons.md` has been reorganized by finalizing with category headers like `## Tool Usage`
409
+ - When: The agent needs to append a new rule during execution
410
+ - Then: The agent appends to `## Rules` (which finalizing ensures always exists as the catch-all section)
411
+
412
+ Steps:
413
+ 1. Read the `docs/lessons.md` format template section in `skills/executing-tasks/SKILL.md`
414
+ 2. Update the format template comment to clarify the append convention:
415
+
416
+ Replace:
417
+
418
+ ```markdown
419
+ ### `docs/lessons.md` format
420
+
421
+ ```markdown
422
+ # Lessons Learned
423
+
424
+ <!--
425
+ Agent: read this at the start of each task during executing-tasks.
426
+ Follow every rule. Add new rules when you catch yourself making repeat mistakes.
427
+ Rules must be generic patterns applicable to any domain or feature — not specific to one service, entity, or use case.
428
+ Retire rules that no longer apply during finalizing.
429
+ -->
430
+
431
+ ## Rules
432
+
433
+ - <new rule here>
434
+ ```
435
+ ```
436
+
437
+ With:
438
+
439
+ ```markdown
440
+ ### `docs/lessons.md` format
441
+
442
+ ```markdown
443
+ # Lessons Learned
444
+
445
+ <!--
446
+ Agent: read this at the start of each task during executing-tasks.
447
+ Follow every rule. Add new rules when you catch yourself making repeat mistakes.
448
+ Rules must be generic patterns applicable to any domain or feature — not specific to one service, entity, or use case.
449
+ Retire rules that no longer apply during finalizing.
450
+ -->
451
+
452
+ ## Rules
453
+
454
+ - <new rule here>
455
+ ```
456
+
457
+ When adding a new rule during execution, always append it under `## Rules`. The categorization into specific headers (e.g., `## Tool Usage`, `## Testing Patterns`) is done during finalizing — never during execution.
458
+ ```
459
+
460
+ 3. Verify the file reads cleanly.
461
+
462
+ ---
463
+
464
+ ## Task 6: Run tests and verify existing suite passes
465
+
466
+ <!-- tdd: trivial -->
467
+
468
+ Files:
469
+ - None (verification only)
470
+
471
+ Steps:
472
+ 1. Run `npm test` — confirm all existing tests pass without side-effects
473
+ 2. Verify no `docs/lessons.md` was created or modified by the test run