wogiflow 2.12.0 → 2.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (66) hide show
  1. package/.claude/commands/wogi-challenge.md +62 -0
  2. package/.claude/commands/wogi-eval.md +14 -1
  3. package/.claude/commands/wogi-gate-stats.md +80 -0
  4. package/.claude/commands/wogi-start-continuation.md +12 -0
  5. package/.claude/commands/wogi-start.md +32 -902
  6. package/.claude/docs/explore-agents.md +49 -0
  7. package/.claude/docs/gate-telemetry.md +142 -0
  8. package/.claude/docs/intent-grounded-reasoning.md +140 -0
  9. package/.claude/docs/phases/01-explore.md +159 -0
  10. package/.claude/docs/phases/02-spec.md +88 -0
  11. package/.claude/docs/phases/03-implement.md +92 -0
  12. package/.claude/docs/phases/04-verify.md +495 -0
  13. package/.claude/docs/phases/05-complete.md +140 -0
  14. package/.claude/rules/_internal/README.md +64 -0
  15. package/.claude/rules/_internal/document-structure.md +77 -0
  16. package/.claude/rules/_internal/dual-repo-management.md +174 -0
  17. package/.claude/rules/_internal/feature-refactoring-cleanup.md +87 -0
  18. package/.claude/rules/_internal/github-releases.md +71 -0
  19. package/.claude/rules/_internal/model-management.md +35 -0
  20. package/.claude/rules/_internal/self-maintenance.md +87 -0
  21. package/.claude/rules/architecture/component-reuse.md +38 -0
  22. package/.claude/rules/code-style/naming-conventions.md +107 -0
  23. package/.claude/rules/operations/git-workflows.md +92 -0
  24. package/.claude/rules/operations/scratch-directory.md +54 -0
  25. package/.claude/rules/security/security-patterns.md +193 -0
  26. package/.claude/skills/figma-analyzer/knowledge/learnings.md +11 -0
  27. package/.workflow/agents/architect.md +104 -0
  28. package/.workflow/agents/logic-adversary.md +81 -0
  29. package/.workflow/specs/architecture.md.template +24 -0
  30. package/.workflow/specs/stack.md.template +33 -0
  31. package/.workflow/specs/testing.md.template +36 -0
  32. package/.workflow/templates/claude-md.hbs +2 -0
  33. package/.workflow/templates/partials/auto-features.hbs +2 -0
  34. package/.workflow/templates/partials/intent-grounded-reasoning.hbs +40 -0
  35. package/package.json +1 -1
  36. package/scripts/flow-architect-pass.js +621 -0
  37. package/scripts/flow-bridge.js +6 -0
  38. package/scripts/flow-cli-utils.js +85 -0
  39. package/scripts/flow-completion-truth-gate.js +477 -0
  40. package/scripts/flow-correction-detector.js +279 -6
  41. package/scripts/flow-done-gates.js +69 -1
  42. package/scripts/flow-gate-telemetry.js +602 -0
  43. package/scripts/flow-intent-bootstrap.js +662 -0
  44. package/scripts/flow-intent-framing.js +708 -0
  45. package/scripts/flow-logic-adversary.js +693 -0
  46. package/scripts/flow-migrate-igr.js +245 -0
  47. package/scripts/flow-runtime-verification.js +37 -0
  48. package/scripts/flow-standards-checker.js +62 -6
  49. package/scripts/flow-standards-gate.js +45 -1
  50. package/scripts/flow-state-drift-detector.js +279 -0
  51. package/scripts/flow-trap-zone.js +470 -0
  52. package/scripts/flow-worktree.js +58 -0
  53. package/scripts/hooks/adapters/claude-code.js +34 -2
  54. package/scripts/hooks/core/manager-boundary-gate.js +388 -0
  55. package/scripts/hooks/core/phase-read-gate.js +156 -0
  56. package/scripts/hooks/core/pre-compact.js +159 -0
  57. package/scripts/hooks/core/template-change-detector.js +112 -0
  58. package/scripts/hooks/entry/claude-code/post-tool-use.js +12 -0
  59. package/scripts/hooks/entry/claude-code/pre-compact.js +31 -0
  60. package/scripts/hooks/entry/claude-code/pre-tool-use.js +63 -0
  61. package/scripts/hooks/entry/claude-code/session-start.js +17 -0
  62. package/scripts/postinstall.js +7 -0
  63. package/templates/intent/domain-model.md.hbs +44 -0
  64. package/templates/intent/glossary.md.hbs +43 -0
  65. package/templates/intent/product.md.hbs +43 -0
  66. package/templates/intent/user-journeys.md.hbs +41 -0
@@ -160,6 +160,19 @@ Estimate if task fits in remaining context using `flow-context-estimator.js`:
160
160
  - Count criteria (~3% each), files (~2% each), refactor buffer (+10%)
161
161
  - If `projected_total > 95%` → compact first. If `current >= 90%` → emergency compact.
162
162
 
163
+ ### Step 0.3: Intent Bootstrap (when `config.intentGroundedReasoning.enabled`)
164
+
165
+ **Conditional** — runs only when the IGR master flag is on AND no intent artifacts exist yet in `.workflow/state/`.
166
+
167
+ First run per project with IGR enabled, present the Option C three-choice prompt (see `.claude/docs/intent-grounded-reasoning.md` for the exact UX):
168
+ - `[1]` Bootstrap now (blocks ~5-10 min)
169
+ - `[2]` Bootstrap in background, review at `/wogi-session-end` (default)
170
+ - `[3]` Skip for now (3 consecutive skips silences the prompt)
171
+
172
+ Run via `node scripts/flow-intent-bootstrap.js bootstrap [--auto-confirm]`. Scaffolds 4 artifacts (`product.md`, `domain-model.md`, `user-journeys.md`, `glossary.md`) with `reviewStatus: draft`. The trap-zone detector runs agnostic structural-ambiguity scanning.
173
+
174
+ When IGR flag is OFF: this step is SKIPPED entirely. Pipeline proceeds to Step 0.5 with no overhead.
175
+
163
176
  ### Step 0.5: Parallel Execution Check
164
177
 
165
178
  Check `ready.json` for 2+ tasks. If parallelizable (no dependencies), offer parallel execution with worktree isolation.
@@ -171,915 +184,32 @@ Check `ready.json` for 2+ tasks. If parallelizable (no dependencies), offer para
171
184
  3. Check `app-map.md`, `function-map.md`, `api-map.md`, `decisions.md`
172
185
  4. Auto-invoke matched skills based on task context
173
186
 
174
- ### Decision Authority Framework (Cross-Cutting — applies to ALL steps)
175
-
176
- **Before presenting ANY decision to the user**, classify it using `flow-decision-authority.js`:
177
-
178
- ```bash
179
- node node_modules/wogiflow/scripts/flow-decision-authority.js classify "<decision text>"
180
- ```
181
-
182
- | Authority Level | Action |
183
- |-----------------|--------|
184
- | `agent-decides` | Decide autonomously. Report in completion summary only. |
185
- | `agent-decides-report-after` | Decide autonomously. Explicitly state the decision after implementing. |
186
- | `owner-decides` | Present to user. Wait for answer before proceeding. |
187
- | `auto-fix-report-after` | Fix automatically. Report what was fixed after. |
188
-
189
- **Batch enforcement**: When multiple decisions arise in a single task, use `batchClassify()`. If owner-decides questions exceed `maxOwnerQuestionsPerBatch` (default: 5), overflow is automatically downgraded to `agent-decides-report-after`. This prevents question flooding (12+ questions in one batch).
190
-
191
- **Default categories**: engineering → agent-decides, infrastructure → agent-decides-report-after, productBehavior → owner-decides, security → auto-fix-report-after, ux → owner-decides, naming → agent-decides, performance → agent-decides-report-after.
192
-
193
- **User can update**: Via `/wogi-decide "from now on, just fix [category] yourself"` which calls `updateCategoryAuthority()` to change the config.
194
-
195
- **Low-confidence classification**: When the classifier cannot confidently categorize a decision, it defaults to `owner-decides` (safest fallback).
196
-
197
- ### Step 1.2: Clarifying Questions
198
-
199
- Before generating specs (skip for small tasks ≤2 files, bugfixes, explicit specs):
200
- - Scope validation, assumption surfacing, edge cases, integration points
201
- - Config: `config.clarifyingQuestions`
202
-
203
- ### Step 1.25: Item Reconciliation Gate (Multi-Item Inputs)
204
-
205
- **Activates when**: User input contains 3+ discrete requests (identified by: numbered lists, bullet points, "and also", "plus", semicolons separating requests, or distinct topics in voice-transcribed text).
206
-
207
- **Purpose**: Prevent item loss when the AI compresses many requests into fewer stories. This is the #1 cause of "silently dropped items" in long inputs.
208
-
209
- **Procedure**:
210
- 1. **Enumerate**: Produce a numbered checklist of EVERY discrete request from the user's input. Each item = one testable action. No compression, no grouping, no summarization.
211
- 2. **Confirm count**: Display the checklist and count: "I found N items in your request: [list]. Is this complete?"
212
- 3. **Map to work items**: Each checklist item becomes a trackable acceptance criterion. Items may be grouped into stories, but EVERY item must appear as a criterion in at least one story. No item may be dropped during grouping.
213
- 4. **Reconciliation check**: After stories/tasks are created, cross-reference: for each original checklist item, verify it appears in at least one acceptance criterion. If any item is missing → add it before proceeding.
214
- 5. **At completion** (Step 3.5): The criteria verification must trace back to this original checklist. Every checklist item must be verified as implemented.
215
-
216
- **Example**:
217
- ```
218
- User: "Fix the login page, add forgot password, remove mock data,
219
- update the header logo, and add loading states to all forms"
220
-
221
- Item Reconciliation:
222
- 1. Fix the login page [→ Story A, criterion 1]
223
- 2. Add forgot password flow [→ Story A, criterion 2]
224
- 3. Remove all mock data [→ Story B, all criteria]
225
- 4. Update header logo [→ Story C, criterion 1]
226
- 5. Add loading states to all forms [→ Story C, criterion 2]
227
-
228
- 5 items found → 5 criteria across 3 stories → 0 items dropped ✓
229
- ```
230
-
231
- **Skip when**: Input has only 1-2 items, or is a task ID reference.
232
-
233
- **ANTI-DEFERRAL ENFORCEMENT**: After reconciliation, verify ALL items became tasks/criteria. If you find yourself writing "deferred", "skipped", or "not created" for ANY item — STOP. You are violating the anti-deferral rule. The user provided these items for a reason. Create tasks for ALL of them. You may suggest priority ordering (P0-P3), but you must NEVER autonomously filter items out. A large ready queue is correct behavior. A filtered queue is data loss that breaks the user's trust.
234
-
235
- ### Step 1.3: Explore Phase (MANDATORY Multi-Agent Research)
236
-
237
- **For L2+ tasks. Research is MANDATORY** — do NOT skip even if you think you know the answer.
238
-
239
- Before launching: check `.workflow/state/research-cache.json` for cached results (TTL: 24h).
240
-
241
- **Research Depth** (`config.planMode.researchDepth`):
242
- - `"thorough"`: All 5-6 agents in parallel
243
- - `"standard"`: Agents 1 + 2 + 4 (3 agents)
244
- - `"minimal"`: Agent 1 only
245
-
246
- **L3 (Subtask/trivial) tasks always skip this phase.**
247
-
248
- **Agents** (full prompts in `.claude/docs/explore-agents.md` — Read that file before launching):
249
-
250
- | Agent | Focus | Network |
251
- |-------|-------|---------|
252
- | 1. Codebase Analyzer | Related files, reusable components, dependency map, assumptions | Local |
253
- | 2. Best Practices | Current best practices, pitfalls, ecosystem patterns | Web |
254
- | 3. Version Verifier | API compatibility, deprecated APIs, version gotchas | Web |
255
- | 4. Risk & History | feedback-patterns, corrections, promoted rules, rejected approaches | Local |
256
- | 5. Standards Preview | Applicable rules, reuse candidates across ALL registries, security patterns | Local |
257
- | 6. Consumer Impact | **ALL L1+ tasks.** Map ALL consumers, classify BREAKING/NEEDS-UPDATE/SAFE. Write results to `.workflow/state/blast-radius-{taskId}.json` | Local |
258
-
259
- Launch all in parallel. When `config.hybrid.enabled`, route via `model` parameter (explore → sonnet, search → haiku, judging → opus).
260
-
261
- **After agents complete**: Display consolidated research summary covering codebase analysis, best practices, version info, risks, standards, and consumer impact.
262
-
263
- **REUSE GATE (MANDATORY)**: After consolidating agent results, check for reuse candidates:
264
- 1. Collect all reuse candidates reported by Agent 1 (domain-keyword search) and Agent 5 (registry scan)
265
- 2. If ANY reuse candidate has purpose overlap with planned new code → **STOP and present to user**:
266
- - Show each candidate: name, path, purpose, similarity
267
- - Ask: "Use existing / Extend existing / Create new (explain why)"
268
- - Implementation BLOCKED until user decides on each candidate
269
- 3. If no reuse candidates found → proceed normally
270
- 4. This gate runs BEFORE spec generation — catching reuse early prevents wasted implementation
271
-
272
- **For L1/L0 tasks**: Offer to deepen research (exhaustive search, load all skills, full dependency tree).
273
-
274
- **Fallback**: If agents fail, log warning and proceed with remaining. Consumer Impact failure on L1+ tasks = HARD BLOCK (require user confirmation). See `.claude/docs/explore-agents.md` for details.
275
-
276
- **Constraints**: READ-ONLY phase. No Edit/Write. Agents use only Glob, Grep, Read, WebSearch, WebFetch.
277
-
278
- ### Step 1.45: Scope-Confidence Gate (L0/L1 tasks only)
279
-
280
- **Activates when**: Task level is L0 or L1. Skip for L2/L3 tasks.
281
-
282
- **The problem this solves**: Multi-day plans often depend on assumptions about what exists (new tables, new models, new APIs, new services). Without verification, a 7-10 day plan can collapse to 1 day when a single question reveals the assumption was wrong. This gate audits scope-inflating assumptions BEFORE the spec is generated — not the same as clarifying questions (Step 1.2) which target user intent.
283
-
284
- **Procedure**:
285
-
286
- 1. **Extract assumptions**: From the explore phase results and task description, list every assumption the plan depends on:
287
- - New database tables/schemas needed
288
- - New API endpoints or services to create
289
- - New models or data structures
290
- - External integrations assumed not to exist
291
- - Infrastructure components (queues, caches, workers)
292
-
293
- 2. **Verify each assumption against the codebase**:
294
- - For each assumption, grep/glob for existing implementations
295
- - Check schema files, migration files, service directories, API routes
296
- - Check `app-map.md`, `function-map.md`, `api-map.md`, `schema-map.md` for registered components
297
-
298
- 3. **Classify results**:
299
- | Status | Meaning | Action |
300
- |--------|---------|--------|
301
- | VERIFIED | Assumption confirmed by codebase evidence | Proceed — scope is accurate |
302
- | EXISTS | Assumed-new thing already exists | **Scope reduction** — remove from plan |
303
- | UNVERIFIABLE | Cannot confirm or deny from codebase | **Ask user** before proceeding |
304
- | CONTRADICTED | Codebase shows opposite of assumption | **Scope change** — replan required |
305
-
306
- 4. **Present findings to user** (MANDATORY when any UNVERIFIABLE or CONTRADICTED found):
307
- ```
308
- ━━━ SCOPE-CONFIDENCE AUDIT ━━━
309
- Task: [title]
310
-
311
- Assumptions verified:
312
- ✓ [assumption] — found at [file:line]
313
-
314
- Scope reductions (already exists):
315
- ↓ [assumption] — exists at [file:line], removing from plan
316
-
317
- Needs confirmation:
318
- ? [assumption] — does [X] already exist? Could not find in codebase.
319
-
320
- Contradictions:
321
- ✗ [assumption] — codebase shows [opposite evidence]
322
-
323
- Revised estimate: [original] → [adjusted based on findings]
324
- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
325
- ```
326
-
327
- 5. **Wait for user response** on UNVERIFIABLE items before proceeding to spec generation. Spec MUST reflect verified scope, not assumed scope.
328
-
329
- **This is NOT the same as Step 1.2 (Clarifying Questions)**:
330
- - Step 1.2 targets **user intent** ("what do you want?")
331
- - Step 1.45 targets **scope assumptions** ("what does the codebase already have?")
332
- - Step 1.2 runs before explore; Step 1.45 runs after explore (uses explore results)
333
-
334
- ### Step 1.5: Generate Specification
335
-
336
- For medium/large tasks (check `config.specificationMode`):
337
-
338
- 1. Generate spec to `.workflow/specs/wf-XXXXXXXX.md`:
339
- - Acceptance criteria (Given/When/Then), implementation steps, files to change
340
- - Boundary declarations (files that must NOT be modified)
341
- - Consumer impact plan (for refactors — MANDATORY if BREAKING consumers found; 5+ = phased approach required)
342
- - Test strategy, verification commands
343
- 2. Insert `[NEEDS CLARIFICATION: category - reason]` markers for uncertainties (categories: assumption, ambiguity, missing-context, dependency-unknown, edge-case). Implementation BLOCKED until all resolved (when `config.specificationMode.needsClarification.blockImplementation`).
344
- 3. Reflection: "Does this spec fully address the requirements?"
345
-
346
- **Batch fix spec requirement**: When a task contains 3+ discrete items (e.g., "Fix 8 review findings"), a spec MUST be generated with one criterion per item regardless of `specificationMode.minTaskLevel`. Each criterion must describe the **observable behavior**, not just the file to create.
347
-
348
- - BAD: "Create TokenBlacklistService"
349
- - GOOD: "When an admin changes a user's role, the user's next API request returns 401 'Token has been revoked'"
350
-
351
- Behavior-level criteria force end-to-end chain verification in Step 3.5/3.52.
352
-
353
- ### Step 1.6: Approval Gate (Stories/Epics)
354
-
355
- **For L1/L0 tasks: STOP and WAIT for explicit user approval** before implementation.
356
- Approval phrases: approved, proceed, looks good, lgtm, go ahead, yes, continue, start.
357
- L2/L3 skip this gate.
358
-
359
- ### Step 1.7: Test Generation (when `config.testing.enabled` and `config.testing.generation.autoGenerate`)
360
-
361
- When testing is enabled and auto-generation is on:
362
- 1. Run `node node_modules/wogiflow/scripts/flow-test-generate.js wf-XXXXXXXX` to parse spec and generate test scaffolds
363
- 2. Review output: number of test files created, criteria coverage, edge cases
364
- 3. If tests were generated, add "Make generated tests pass" to TodoWrite items in Step 2
365
- 4. During implementation (Step 3), verify generated tests fail before implementation and pass after
366
- 5. If `testing.generation.autoGenerate: false` or `testing.enabled: false`, skip this step entirely
367
-
368
- ### Step 2: Decompose into TodoWrite
369
-
370
- Each acceptance criterion → TodoWrite item. Also add: update request-log, update maps, run quality gates, commit.
371
-
372
- ### Step 2.5: TDD Mode Check
373
-
374
- When `config.tdd.enforced` is true OR `--tdd` flag is used, the execution loop switches to test-first order. Also auto-enables for task types listed in `config.tdd.defaultForTypes` (e.g., `["bugfix"]`).
375
-
376
- **TDD Execution Loop** (replaces normal Step 3 when active):
377
-
378
- For each acceptance criterion:
379
- 1. Mark in_progress in TodoWrite
380
- 2. **Write test** for this criterion (Given/When/Then → test assertion)
381
- 3. **Run test → MUST FAIL** (proves test is meaningful). If it passes before implementation → WARNING: test may be trivial
382
- 4. **Implement** the feature/fix following matched skill patterns
383
- 5. **Run test → MUST PASS**. If still fails → debug and fix (max 5 retries)
384
- 6. **Run full verification** (lint, typecheck, all tests)
385
- 7. **Save TDD artifact** to `.workflow/verifications/` with before/after test results
386
- 8. Mark completed only when all tests pass
387
-
388
- Test framework auto-detected from package.json: jest, vitest, mocha, tap, or fallback `node --test`.
389
-
390
- ### Step 3: Execute Each Scenario (Loop)
391
-
392
- **When TDD is NOT active**, use this normal flow. For each acceptance criterion:
393
- 1. Mark in_progress in TodoWrite
394
- 2. Implement following matched skill patterns
395
- 3. Run verification (lint, typecheck, tests) → save artifact to `.workflow/verifications/`
396
- 4. If failing: debug, fix, retry (max 5 attempts)
397
- 5. Mark completed only when verification passes
398
-
399
- ### Step 3.05: Sprint-Based Context Reset (L1+ tasks with 5+ criteria)
400
-
401
- **Activates when**: `config.sprintReset.enabled` (default: true) AND task has 5+ acceptance criteria AND current criterion index is a multiple of `config.sprintReset.criteriaPerSprint` (default: 3).
402
-
403
- **The problem this solves**: For large tasks, context fills with implementation details from early criteria. By criterion 6+, the AI is working with degraded context — old diffs, stale tool results, and exploration artifacts crowd out what matters for the current criterion. The Anthropic harness design research found that full context resets with structured file-based handoffs produce higher quality output than continuous context for long-running tasks.
404
-
405
- **Procedure** (runs automatically at sprint boundaries):
406
-
407
- 1. After completing criterion N (where N % `criteriaPerSprint` === 0 AND remaining criteria > 0):
408
- 2. **Commit progress**: `git add -A && git commit -m "sprint: criteria 1-N of M complete"`
409
- 3. **Save sprint checkpoint** to `.workflow/state/task-checkpoint.json`:
410
- - Task ID, spec path, completed criteria indices, changed files, remaining criteria
411
- 4. **Output sprint summary** (visible to user):
412
- ```
413
- ━━━ SPRINT BOUNDARY ━━━
414
- Completed criteria 1-N of M. Committing and resetting context.
415
- Remaining: criteria (N+1)-M
416
- ```
417
- 5. **Compact context** — this triggers a full compaction. The PostCompact hook restores:
418
- - Active task ID and spec reference
419
- - Which criteria are done vs pending (from checkpoint)
420
- - Changed files list
421
- 6. **Resume from checkpoint** — read the spec fresh, skip completed criteria, continue with criterion N+1
422
-
423
- **Why this is different from normal compaction**: Normal compaction summarizes the conversation. Sprint reset goes further — it commits work, saves a structured checkpoint, and compacts. The next sprint starts with a clean slate + the checkpoint file, not a compressed summary of everything that happened. The AI reads the spec fresh rather than relying on a summarized memory of it.
424
-
425
- **Configuration**:
426
- ```json
427
- {
428
- "sprintReset": {
429
- "enabled": true,
430
- "criteriaPerSprint": 3,
431
- "minTaskCriteria": 5
432
- }
433
- }
434
- ```
435
-
436
- **Skip when**: Task has < 5 criteria, TDD mode is active (TDD has its own rhythm), or `sprintReset.enabled` is false.
437
-
438
- ### Step 3.5: Criteria Completion Verification (MANDATORY)
439
-
440
- After implementing all scenarios, BEFORE quality gates:
441
-
442
- 1. Re-read original acceptance criteria from spec
443
- 2. For EACH criterion: verify it was actually implemented and WORKS (not just "code exists" but "code does what the criterion describes")
444
- 3. If ANY criterion NOT done → implement it, then re-check ALL criteria again
445
- 4. Only proceed when ALL criteria verified
446
-
447
- **This prevents "claiming done when not done."**
448
-
449
- ### Step 3.52: Sub-Agent Output Verification (MANDATORY when agents were used)
450
-
451
- **Activates when**: Any acceptance criterion was implemented by a sub-agent (Agent tool with `isolation: "worktree"` or any delegated agent).
452
-
453
- **The problem this solves**: Sub-agents self-report completion, but their self-assessment is unreliable. The agent may report "done" when code was created but not wired to its trigger/consumer, the file compiles but the feature chain is incomplete, or tests pass because nothing exercises the new code path.
454
-
455
- **Procedure**:
456
-
457
- 1. **DISTRUST sub-agent self-reports.** A sub-agent saying "done" is a CLAIM, not a FACT. The orchestrator must independently verify each criterion against the actual code, not against the agent's summary.
458
-
459
- 2. For EACH criterion a sub-agent claims to have completed:
460
- a. **Read the ACTUAL files** the agent modified (not just the agent's summary)
461
- b. **Trace the full feature chain**: Who calls this? → What does it call? → What's the end-to-end flow?
462
- c. For services: verify at least ONE caller invokes the critical method
463
- d. For guards/middleware: verify they are registered in the correct module
464
- e. For event-driven features: verify the event is emitted AND consumed
465
-
466
- 3. **Chain verification checklist** (for each new service/feature):
467
- - [ ] Service/component is created
468
- - [ ] Registered in the correct module (providers, imports)
469
- - [ ] Exported from the module (if needed by other modules)
470
- - [ ] Imported by the consuming module
471
- - [ ] Injected in the consuming service/controller
472
- - [ ] The critical method is CALLED at the right trigger point
473
- - [ ] The trigger point is reachable from a user action (HTTP request, cron, event)
474
-
475
- 4. If ANY link in the chain is missing → the criterion is NOT done. Fix the missing link first.
476
-
477
- **Anti-pattern: "Dead service"** — a service that exists, compiles, is imported somewhere, but its critical method is never called by the thing that should trigger it. This passes lint, typecheck, and wiring checks (because the file IS imported) but the feature doesn't work.
478
-
479
- ### Step 3.55: Inventory-Based Verification (for "remove/fix/replace all X" tasks)
480
-
481
- **Activates when**: The task involves removing, cleaning up, fixing, or replacing ALL instances of something (e.g., "remove all mock data", "fix all console.log", "replace all hardcoded URLs", "remove all deprecated APIs").
482
-
483
- **The problem this solves**: Pattern-based search (grep, regex) only finds instances that match a naming convention. Semantic variants — inline hardcoded arrays, helper functions that wrap the target, useState initializers with fake data, constants not named with the expected prefix — are invisible to pattern search. In practice, pattern search finds ~60-70% of instances. The AI then declares "done" and the remaining 30-40% persist undetected. This has caused repeated false completions (3-4x on a single project).
484
-
485
- **Core principle**: For each file in scope, ask **"does anything in this file serve the PURPOSE of [what we're removing]?"** — regardless of what it's named. Reason about function, not strings.
486
-
487
- **Procedure (3 phases — ALL mandatory)**:
488
-
489
- #### Phase A: Pre-Implementation Inventory (BEFORE any code changes)
490
-
491
- 1. **Identify all files in scope** — every file that could contain instances of [X]. Use both:
492
- - Pattern search (grep/glob) for syntactic matches
493
- - File-by-file reading of components/pages/modules that CONSUME data related to [X]
494
-
495
- 2. **For each file, answer the semantic question**: "Does anything in this file serve the purpose of [what we're removing]?" Examples by task type:
496
-
497
- | Task Type | Semantic Question | What Pattern Search Misses |
498
- |-----------|-------------------|---------------------------|
499
- | Remove mock data | "Where does this component get its displayed data? Is it from an API call or a local constant/array/useState?" | Inline arrays (`const customers = [{...}]`), useState initializers (`useState([...POLICY_DATA])`), export constants not named `MOCK_*` |
500
- | Remove console.log | "What in this file produces output to any channel?" | `console.warn`, `console.debug`, `debugger`, `alert()`, custom logger wrappers |
501
- | Replace hardcoded URLs | "What string values in this file resolve to network addresses?" | URLs built from concatenation, template literals, env var fallbacks with hardcoded defaults |
502
- | Remove deprecated API | "What in this file provides the same FUNCTIONALITY as the deprecated API?" | Wrapper functions, polyfills, compatibility shims, re-implementations |
503
- | Fix all raw JSON.parse | "What in this file deserializes JSON?" | Utility functions that call JSON.parse internally, library wrappers |
504
-
505
- 3. **Trace data-providing imports one level (MANDATORY)**:
506
-
507
- The semantic scan in step 2 catches inline instances but gives a free pass to imported values. Imported constants, configurations, and helpers can contain the exact thing you're looking for — hidden behind one level of indirection and a legitimate-sounding name (`DEFAULT_*`, `INITIAL_*`, `FALLBACK_*`, `BASE_*`).
508
-
509
- **For every import statement in each scoped file**, classify it:
510
-
511
- | Import Type | Example | Action |
512
- |-------------|---------|--------|
513
- | **Data-providing** | `import { RATE_OPTIONS } from './constants'` | MUST read the source file and apply the semantic question to its contents |
514
- | **Utility/function** | `import { formatDate } from './utils'` | Skip — unless the function wraps or returns the target pattern |
515
- | **Type/interface** | `import type { Customer } from './types'` | Skip — types don't contain runtime data |
516
- | **Style/asset** | `import styles from './styles.module.css'` | Skip |
517
- | **Component** | `import { Button } from './ui'` | Skip — unless it's a wrapper that embeds the target pattern |
518
-
519
- **How to classify**: If the import provides a value that gets **rendered, displayed, logged, passed to an API, or used as configuration** — it's data-providing. Read its source.
520
-
521
- **Anti-pattern — naming convention bias**: Constants named `DEFAULT_*`, `INITIAL_*`, `FALLBACK_*`, `CONFIG_*`, `BASE_*` look legitimate but are often hardcoded placeholders. The name is NOT evidence of legitimacy. Only the source is.
522
-
523
- **Rule**: Any imported value that contributes to **user-visible output** and resolves to a hardcoded literal (not an API call, env var, or database query) is an instance of [X] — regardless of what it's named or which directory it lives in.
524
-
525
- 4. **Produce a numbered inventory** and display it to the user:
526
- ```
527
- ━━━ PRE-IMPLEMENTATION INVENTORY ━━━
528
- Found N instances of [X] across M files:
529
-
530
- 1. [file:lines] — [description] [TYPE: syntactic|semantic|import-traced]
531
- 2. [file:lines] — [description] [TYPE: syntactic|semantic|import-traced]
532
- ...
533
-
534
- Total: N instances (S syntactic, M semantic)
535
- Confirm inventory is complete before proceeding? [Y/adjust]
536
- ```
537
-
538
- 5. **Wait for user confirmation** that the inventory is complete. If the user identifies missing items, add them. This step is CRITICAL — it commits the AI to a concrete scope that can be verified later.
539
-
540
- #### Phase B: Implementation
541
-
542
- 6. Implement the removal/fix/replacement for EVERY item in the inventory. Each inventory item becomes a trackable unit of work.
543
-
544
- #### Phase C: Post-Implementation Re-Inventory (AFTER all changes)
545
-
546
- 7. **Re-run the SAME semantic scan** from Phase A (including import tracing from step 3) on the SAME set of files. Do NOT downgrade to pattern-only search.
547
-
548
- 8. **Diff the inventories**:
549
- ```
550
- ━━━ POST-IMPLEMENTATION VERIFICATION ━━━
551
- Re-scanned M files for [X]:
552
-
553
- 1. [file:lines] — [description] → REMOVED ✓
554
- 2. [file:lines] — [description] → REMOVED ✓
555
- 3. [file:lines] — [description] → STILL PRESENT ✗
556
- ...
557
-
558
- Result: N/N removed (0 remaining)
559
- ```
560
-
561
- 9. **If ANY items remain** → task is NOT done. Fix the remaining items and re-verify. Do NOT proceed to quality gates with remaining items.
562
-
563
- 10. **If new instances are discovered** during re-scan (including via import tracing) that weren't in the original inventory → add them, fix them, and note them as "discovered during verification."
564
-
565
- **Why this works**: The inventory creates a concrete, numbered checklist BEFORE implementation. The AI cannot claim "done" when the post-inventory shows items still present — the evidence is in the conversation. The pre/post diff is unfakeable.
566
-
567
- **Skip conditions**: Tasks that target a specific file or a small known set (e.g., "remove the mock import in Dashboard.tsx") don't need the full inventory — they're scoped enough already. The inventory is for "all X" / "every X" / "clean up X everywhere" tasks.
568
-
569
- ### Step 3.56: Skeptical Evaluator Gate (L2+ tasks, when `config.skepticalEvaluator.enabled`)
570
-
571
- **The problem this solves**: The same agent that wrote the code verifies its own work in Step 3.5. Anthropic's harness design research found that "separating the agent doing the work from the agent judging it proves to be a strong lever" and that "tuning standalone evaluators toward skepticism is far more tractable than making a generator critical of its own work." This is "confident praise bias" — the implementer always thinks it did a good job.
572
-
573
- **Activates when**: `config.skepticalEvaluator.enabled` (default: true) AND task level is L2 or higher (not L3 trivial tasks).
574
-
575
- **Procedure**:
576
-
577
- 1. **Spawn a skeptical evaluator sub-agent** (separate from the implementation agent):
578
- ```
579
- Agent({
580
- subagent_type: "code-reviewer",
581
- model: "sonnet", // Use a different model for diversity
582
- prompt: <see below>
583
- })
584
- ```
585
-
586
- 2. **Evaluator prompt** (tuned toward skepticism):
587
- ```
588
- You are a SKEPTICAL code evaluator. Your job is to find problems, not praise.
589
- Assume the implementation has gaps until proven otherwise.
187
+ ### Decision Authority (Cross-Cutting)
590
188
 
591
- ## Task Specification
592
- <read and paste the spec from .workflow/specs/wf-XXXXXXXX.md>
189
+ Classify decisions via `node node_modules/wogiflow/scripts/flow-decision-authority.js classify "<text>"`:
593
190
 
594
- ## Implementation Diff
595
- <git diff of all changed files>
191
+ | Authority | Action |
192
+ |-----------|--------|
193
+ | `agent-decides` | Decide autonomously, report in summary |
194
+ | `agent-decides-report-after` | Decide autonomously, state decision after |
195
+ | `owner-decides` | Present to user, wait for answer |
196
+ | `auto-fix-report-after` | Fix automatically, report what was fixed |
596
197
 
597
- ## Your Job
598
-
599
- For EACH acceptance criterion in the spec:
600
- 1. Read the criterion carefully
601
- 2. Find the EXACT code that implements it (cite file:line)
602
- 3. Grade: PASS (fully works), PARTIAL (code exists but incomplete), FAIL (not implemented)
603
- 4. If PARTIAL or FAIL: explain exactly what's missing
604
-
605
- IMPORTANT: "Code exists" is NOT the same as "criterion is met."
606
- A service that exists but is never called = FAIL.
607
- A component that renders but doesn't handle the specified edge case = PARTIAL.
608
- Only grade PASS when the criterion is FULLY satisfied end-to-end.
609
-
610
- ## Output Format
611
- Return JSON:
612
- {
613
- "criteria": [
614
- { "criterion": "...", "grade": "PASS|PARTIAL|FAIL", "evidence": "file:line", "issue": "..." }
615
- ],
616
- "overallPass": true/false,
617
- "criticalIssues": ["..."]
618
- }
619
- ```
620
-
621
- 3. **Process evaluator results**:
622
- - If `overallPass: true` → proceed to Step 3.6
623
- - If `overallPass: false` → **iteration loop** (see below)
624
-
625
- 4. **Generator-Evaluator Iteration Loop** (when evaluator finds issues):
626
- - Feed the evaluator's `criticalIssues` and failed criteria back to the implementation context
627
- - Fix the identified issues (targeted fixes, not re-implementation)
628
- - Re-run the evaluator on the updated diff
629
- - **Max iterations**: `config.skepticalEvaluator.maxIterations` (default: 3)
630
- - If still failing after max iterations → proceed to Step 3.6 anyway but **flag the unresolved issues** in the completion report
631
-
632
- 5. **Calibration** (when `config.skepticalEvaluator.calibration` is true):
633
- - Before spawning the evaluator, check `.workflow/state/eval-calibration.json` for calibration examples
634
- - If examples exist, inject 2-3 into the evaluator prompt as few-shot examples:
635
- - One high-scoring example (what a PASS looks like)
636
- - One low-scoring example (what a FAIL looks like)
637
- - This prevents score drift — the evaluator is anchored to concrete examples
638
-
639
- **Configuration**:
640
- ```json
641
- {
642
- "skepticalEvaluator": {
643
- "enabled": true,
644
- "maxIterations": 3,
645
- "model": "sonnet",
646
- "calibration": true,
647
- "skipForL3": true
648
- }
649
- }
650
- ```
651
-
652
- **Why this works**: The evaluator has NO emotional investment in the code. It reads the spec and the diff cold. It's explicitly prompted to be skeptical. And because it's a separate sub-agent, it has a fresh context — no accumulated "I already know this works" bias from the implementation phase.
653
-
654
- ### Step 3.58: Runtime Verification Gate — Auto-Test Generation (MANDATORY)
655
-
656
- **Activates when**: ANY code file is changed. This is the DEFAULT — not optional.
657
-
658
- Run detection: `node node_modules/wogiflow/scripts/flow-runtime-verification.js task-type [changed-files...]`
659
-
660
- This returns the task type: `frontend`, `backend`, `fullstack`, or `other`. For `frontend` and `fullstack`, UI browser tests are generated. For `backend` and `fullstack`, API integration tests are generated. For `other`, standard static verification applies.
661
-
662
- **The problem this solves**: AI workers mark tasks as "done" based on static evidence (TypeScript compiles, build succeeds) without verifying the feature actually works end-to-end. This leads to repeated failed iterations. Auto-generated tests catch these failures BEFORE the user does.
663
-
664
- **DEFAULT BEHAVIOR**: For every task, WogiFlow auto-generates and runs verification tests as part of the execution loop. Tests are written to `tests/verification/` and persist as regression guards. This is ON by default — disable with `config.runtimeVerification.enabled: false`.
665
-
666
- #### Auto-Test Generation Flow
667
-
668
- ```
669
- For EACH acceptance criterion in the spec:
670
- 1. Classify: Is this a UI behavior, API behavior, or internal logic?
671
- 2. Generate: Write a test that exercises the criterion
672
- 3. Implement: Write the actual code
673
- 4. Run: Execute the test — it MUST pass
674
- 5. If FAIL → debug, fix, re-run (max 5 retries)
675
- 6. Persist: Test file stays in tests/verification/ as regression guard
676
- ```
677
-
678
- **This is NOT TDD** (where tests come first and must fail initially). This is **post-implementation verification** — the test is generated from the criterion, the code is written, then the test validates the code works. The key difference: TDD tests are written before code; verification tests are written alongside code and run after.
679
-
680
- ---
681
-
682
- #### FRONTEND: Browser Test Generation (Playwright + WebMCP)
683
-
684
- **Activates when**: Changed files match `*.tsx`, `*.jsx`, `*.vue`, `*.svelte`, `*.css`, `*.styled.*`
685
-
686
- **The problem this solves**: AI workers mark UI tasks as "done" based on static evidence without ever opening a browser. (See: Pipeline Rules case study — 5 failed iterations, same bug.)
687
-
688
- **BANNED verification methods** — these NEVER count as evidence for UI tasks:
689
-
690
- | Banned Method | What it proves | Why it's insufficient |
691
- |---|---|---|
692
- | `grep` deployed bundle for function names | Code included in build | Function may never execute or render wrong |
693
- | `tsc --noEmit` passes | Types are correct | Type-correct code can have wrong runtime behavior |
694
- | `vite build` succeeds | Modules resolve | Build success says nothing about UX |
695
- | "I read the code and it's logically correct" | Nothing | Author is worst possible judge of own work |
696
- | `aws s3 sync` completes | Files hosted | Hosting ≠ functioning |
697
-
698
- **Evidence Tiers** — every verification claim must be classified:
699
-
700
- | Tier | Name | Sufficient alone? |
701
- |---|---|---|
702
- | 0 | STATIC (compile, build, lint) | NEVER |
703
- | 1 | STRUCTURAL (file exists, imported, route registered) | NEVER |
704
- | 2 | OBSERVATIONAL (page loads, feature renders) | Yes (display-only) |
705
- | 3 | INTERACTIVE (click/type/submit → observed result persists) | Yes (behavioral) |
706
- | 4 | AUTOMATED (Playwright/WebMCP test passes) | Yes (strongest) |
707
-
708
- **Minimum: Tier 2 for display criteria, Tier 3 for behavioral criteria.**
709
-
710
- #### Verification Method Selection
711
-
712
- Run: `node node_modules/wogiflow/scripts/flow-runtime-verification.js method`
713
-
714
- **Priority order** (use the first available):
715
-
716
- **1. WebMCP Browser Verification (DEFAULT — preferred)**
717
-
718
- When `config.webmcp.enabled` or a browser MCP server is detected in `.mcp.json`:
719
-
720
- For EACH acceptance criterion:
721
- 1. Navigate to the affected page via `mcp_browser_navigate`
722
- 2. Screenshot BEFORE: `mcp_browser_screenshot()`
723
- 3. Perform the user action (click, type, select, submit)
724
- 4. Wait 2-3 seconds for async updates
725
- 5. Screenshot AFTER: `mcp_browser_screenshot()`
726
- 6. Assert DOM state: `mcp_browser_evaluate("document.querySelector(...)")`
727
- 7. Record in Behavioral Evidence Log
728
-
729
- **High-risk tasks** (state mutation detected — useMutation, invalidateQueries, onMutate):
730
- - After all criteria verified, wait 3 seconds
731
- - Screenshot again — check state persisted after refetch
732
- - Reload page: `mcp_browser_navigate` to same URL
733
- - Wait for networkidle
734
- - Screenshot — check state survived page reload
735
- - If state reverted → the server didn't persist, or refetch overwrote it → FAIL
736
-
737
- **2. Playwright Test Generation (secondary)**
738
-
739
- When Playwright/Puppeteer is in dependencies but no WebMCP:
740
-
741
- 1. Auto-generate a Playwright test from acceptance criteria
742
- 2. Write test to `tests/verification/verify-{taskId}.spec.ts`
743
- 3. Instruct the user: "Run `npx playwright test tests/verification/verify-{taskId}.spec.ts --headed` to verify"
744
- 4. If the project has CI, the test persists as a regression guard
745
-
746
- **3. User Verification Checklist (fallback — always available)**
747
-
748
- When neither WebMCP nor Playwright is available:
749
-
750
- Present a checklist to the user:
751
- ```
752
- ━━━ USER VERIFICATION CHECKLIST ━━━
753
- I cannot verify UI behavior from the CLI. Please check:
754
-
755
- □ 1. Navigate to [page]
756
- □ 2. [criterion 1 — specific action + expected result]
757
- □ 3. [criterion 2 — specific action + expected result]
758
- □ Wait 3 seconds after each action
759
- □ Refresh the page and verify changes persisted
760
-
761
- Reply "verified" when all checks pass, or describe what's broken.
762
- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
763
- ```
764
-
765
- **CRITICAL**: The agent MUST wait for the user's "verified" response before marking the task complete. Do NOT proceed to quality gates without verification.
766
-
767
- #### Behavioral Evidence Log (BEL)
768
-
769
- Before marking ANY UI task complete, produce a BEL:
770
-
771
- ```
772
- ━━━ BEHAVIORAL EVIDENCE LOG ━━━
773
- Task: wf-XXXXXXXX
774
- Method: WEBMCP / PLAYWRIGHT / USER_CHECKLIST
775
- Verified on: localhost:5173
776
-
777
- CRITERION: "[text]"
778
- ACTION: Clicked "Route To" cell, selected "Design Department"
779
- EXPECTED: Cell updates to show "Design DEPARTMENT"
780
- OBSERVED: Cell shows "Design DEPARTMENT" with blue icon
781
- WAIT: 3 seconds — state persisted after refetch
782
- VERDICT: PASS
783
- EVIDENCE: Tier 3 (INTERACTIVE)
784
- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
785
- ```
786
-
787
- The OBSERVED field MUST describe what was SEEN, not what the code theoretically produces.
788
-
789
- #### Pre-Implementation "See Before You Touch" (modification tasks)
790
-
791
- For tasks modifying existing UI (not greenfield):
792
- 1. Start dev server if not running
793
- 2. Navigate to the affected page
794
- 3. Screenshot/observe current state (BEFORE)
795
- 4. Document the baseline
796
- 5. Then implement changes
797
- 6. After implementation, compare BEFORE vs AFTER
798
-
799
- #### Repeat Failure Protocol (Groundhog Day Detector)
800
-
801
- When the SAME issue is reported in 2+ consecutive dispatches:
802
-
803
- | Strike | Action |
804
- |--------|--------|
805
- | 1 | Normal fix + BEL |
806
- | 2 | MANDATORY root cause analysis BEFORE coding. Change approach. Add console.log tracing. Tier 3+ evidence required. |
807
- | 3 | HARD BLOCK: Cannot mark done without screenshot/console evidence. Must state what's DIFFERENT this time. |
808
- | 4+ | ESCALATION: Acknowledge inability, suggest pair debugging with user. |
809
-
810
- Run: `node node_modules/wogiflow/scripts/flow-runtime-verification.js repeat wf-XXXXXXXX`
811
-
812
- #### Devil's Advocate Prompt
813
-
814
- Before marking ANY task complete (frontend or backend), ask yourself:
815
-
816
- > "Assume this is broken. What are the 3 most likely ways it could fail?"
817
-
818
- Then CHECK each one:
819
- 1. Does the API actually accept these fields? (curl it or check the DTO)
820
- 2. Does the response include the fields I'm reading? (log the response)
821
- 3. Does the UI update persist after refetch/re-render? (wait 3 seconds and look again)
822
- 4. Is the request payload shape what the server expects? (compare DTO with frontend fetch)
823
-
824
- If ANY is plausible and not verified → investigate before marking done.
825
-
826
- ---
827
-
828
- #### BACKEND: API Integration Test Generation
829
-
830
- **Activates when**: Changed files match `*.controller.*`, `*.service.*`, `*.resolver.*`, `/routes/`, `/api/`, `*.dto.*`, `*.guard.*`, `*.middleware.*`
831
-
832
- Run detection: `node node_modules/wogiflow/scripts/flow-runtime-verification.js api-detect [changed-files...]`
833
-
834
- **For EACH acceptance criterion that involves an API endpoint**:
835
-
836
- 1. **Identify the endpoint**: method (GET/POST/PUT/PATCH/DELETE), path, expected request/response shape
837
- 2. **Generate an integration test** that:
838
- - Makes the actual HTTP request to the running dev server
839
- - Asserts the status code matches expected
840
- - Asserts the response body contains expected fields
841
- - For mutations (POST/PUT/PATCH/DELETE): re-fetches the resource to verify persistence
842
- - For auth-protected endpoints: includes the auth token
843
- 3. **Write the test** to `tests/verification/api-verify-{taskId}.test.js`
844
- 4. **Run the test**: `node --test tests/verification/api-verify-{taskId}.test.js`
845
- 5. **If test fails** → debug, fix the implementation, re-run (max 5 retries)
846
- 6. **Test persists** as a regression guard
847
-
848
- **API Test Template** (generated per criterion):
849
-
850
- ```javascript
851
- it('POST /api/pipeline-rules — creates a rule with correct fields', async () => {
852
- const res = await apiRequest('POST', '/api/pipeline-rules', {
853
- tagPattern: 'animation',
854
- routeTo: { type: 'department', id: 'dept-123' },
855
- mode: 'CLAIMABLE'
856
- });
857
-
858
- // Status check
859
- assert.equal(res.status, 201);
860
-
861
- // Response shape check
862
- assert.ok(res.data.id, 'Response missing field: id');
863
- assert.equal(res.data.tagPattern, 'animation');
864
- assert.equal(res.data.mode, 'CLAIMABLE');
865
-
866
- // Persistence check: re-fetch and verify stored
867
- const verify = await apiRequest('GET', `/api/pipeline-rules/${res.data.id}`);
868
- assert.equal(verify.status, 200);
869
- assert.equal(verify.data.tagPattern, 'animation');
870
- });
871
- ```
872
-
873
- **Boundary verification** (frontend↔backend):
874
- When the task is `fullstack` (both UI and API files changed):
875
- 1. Generate BOTH browser tests AND API tests
876
- 2. The API test verifies the server accepts the payload shape the frontend sends
877
- 3. The browser test verifies the UI correctly displays the response shape the server returns
878
- 4. If either fails → the boundary contract is broken
879
-
880
- **Quick verification via curl** (for manual checking):
881
- The AI can also generate and run curl commands directly:
882
- ```bash
883
- # Create a rule
884
- curl -s -X POST http://localhost:3000/api/pipeline-rules \
885
- -H "Content-Type: application/json" \
886
- -d '{"tagPattern":"animation","routeTo":{"type":"department","id":"dept-123"},"mode":"CLAIMABLE"}'
887
-
888
- # Verify it was stored
889
- curl -s http://localhost:3000/api/pipeline-rules | jq '.[-1]'
890
- ```
891
-
892
- ---
893
-
894
- #### Configuration
895
-
896
- ```json
897
- {
898
- "runtimeVerification": {
899
- "enabled": true,
900
- "autoGenerateTests": true,
901
- "frontend": {
902
- "method": "webmcp",
903
- "fallback": ["playwright", "checklist"],
904
- "devServerUrl": "http://localhost:5173"
905
- },
906
- "backend": {
907
- "method": "api-test",
908
- "fallback": ["curl", "checklist"],
909
- "baseUrl": "http://localhost:3000"
910
- },
911
- "testOutput": "tests/verification",
912
- "persistTests": true,
913
- "blockOnFailure": true
914
- }
915
- }
916
- ```
917
-
918
- **`autoGenerateTests: true`** (default) — Tests are generated for EVERY task. This is the core behavioral change: verification is not an afterthought, it's built into the execution loop.
919
-
920
- **`persistTests: true`** (default) — Generated tests stay in `tests/verification/` as permanent regression guards. Over time, this builds an automated test suite from the actual use cases that were implemented.
921
-
922
- **`blockOnFailure: true`** (default) — If generated tests fail, the task is NOT complete. The agent must fix the implementation until tests pass.
923
-
924
- #### Skip Conditions
925
-
926
- - `config.runtimeVerification.enabled: false` → skip entirely (not recommended)
927
- - Task has NO code files in changed set (docs-only, config-only) → skip
928
- - Task is L3 trivial AND no UI/API files → skip
929
-
930
- ### Step 3.6: Integration Wiring Validation (MANDATORY)
931
-
932
- Run `node node_modules/wogiflow/scripts/flow-wiring-verifier.js wf-XXXXXXXX`
933
-
934
- **Forward wiring** — For each created file, verify it's imported/used somewhere:
935
- - Entry points (index.ts, App.tsx, *.config.ts, tests) don't need imports
936
- - Components MUST be imported in a parent. Hooks MUST be called. Utilities MUST be imported.
937
- - If NOT wired: identify where to import, wire it up, re-verify.
938
-
939
- **Removal impact** (v1.9.3) — For each removed export, type member, or identifier, verify no consumers still reference it:
940
- - Runs automatically as part of the `integrationWiring` quality gate
941
- - Detects orphaned references: removed type union members, exported names, component references, string literal IDs (e.g., tab IDs, route keys)
942
- - If orphaned references found: update consumers to remove stale references, re-verify.
943
- - CLI: `node node_modules/wogiflow/scripts/flow-wiring-verifier.js removal-check [files...]`
944
-
945
- ### Step 3.7: Standards Compliance Check (MANDATORY)
946
-
947
- Run `node node_modules/wogiflow/scripts/flow-standards-gate.js wf-XXXXXXXX [changed-files...]`
948
-
949
- Checks scoped by task type: component → naming/components/security. Utility → naming/functions/security. API → naming/api/security. Bugfix → naming/security. Feature → all. Refactor/migration → all + consumer-impact verification.
950
-
951
- **Consumer impact check** (ALL L1+ tasks): For each BREAKING consumer from explore phase (blast-radius analysis), verify it was updated. If any NOT migrated → BLOCK task completion. Results are persisted in `.workflow/state/blast-radius-{taskId}.json`.
952
-
953
- **Reuse candidate check** (AI-as-Judge): Standards gate returns similar items from all registries. AI reasons about PURPOSE overlap (not just name). If purpose overlaps → ask user (use existing / extend / create new). If purpose clearly differs → proceed silently.
954
-
955
- If violations found: fix, re-run, only proceed when all pass. Violations auto-recorded to `feedback-patterns.md`; 3+ occurrences → promoted to `decisions.md` (project-level) or fixed in WogiFlow base code (product-level). See `/wogi-decide` Step 0.5 for product vs project classification.
956
-
957
- ### Step 4: Quality Gates + Final Verification
958
-
959
- **First**: Run `node node_modules/wogiflow/scripts/flow-spec-verifier.js verify wf-XXXXXXXX` — verify all spec deliverables exist. If missing → STOP, create them.
960
-
961
- **Then**: Check `config.qualityGates` for task type. Gates are type-specific:
962
- - **feature**: loopComplete, tests, registryUpdate, requestLogEntry, integrationWiring, standardsCompliance
963
- - **bugfix**: loopComplete, tests, requestLogEntry, standardsCompliance, learningEnforcement
964
- - **refactor**: loopComplete, tests, noNewFeatures, smokeTest, standardsCompliance
965
- - **chore**: requestLogEntry, outstandingFindings
966
- - **release**: requestLogEntry, outstandingFindings, preRelease
967
- - **fix**: loopComplete, requestLogEntry, standardsCompliance
968
-
969
- **Fallback behavior**: Task types not listed above (docs, style, test, perf, etc.) inherit the **feature** gates. This is intentional — feature gates are the most comprehensive and serve as a safe default.
970
-
971
- **Key automated gates** (v1.9.7):
972
- - `registryUpdate` → runs `flow registry-manager scan` on ALL active registries (app-map, function-map, api-map, schema-map, service-map). Auto-updates maps when new entries found. Replaces old `appMapUpdate` no-op gate.
973
- - `integrationWiring` → calls `verifyWiring()` — checks created files are imported/used
974
- - `standardsCompliance` → calls `runTaskStandardsCheck()` — checks naming, security, decisions.md rules
975
- - `outstandingFindings` → reads `last-review.json` — blocks if unresolved critical/high findings exist
976
- - `preRelease` → verifies codebase is releasable (no outstanding findings + lint + typecheck)
977
-
978
- **CRITICAL**: No task type defaults to zero gates. Every task type MUST have at least `requestLogEntry` + `outstandingFindings`.
979
-
980
- **WebMCP** (optional): If `config.webmcp.enabled` and UI files changed, check tool coverage. Non-blocking.
981
-
982
- Reflection: "Have I introduced any bugs or regressions?"
983
-
984
- ### Step 5: Finalize
985
-
986
- 1. Reflection: "Does this match what the user asked for?"
987
- 2. Close out all TodoWrite items for this task
988
- 3. Move task to recentlyCompleted in ready.json
989
- 4. Registry maps auto-updated by `registryUpdate` quality gate (runs `flow registry-manager scan` on all active registries — app-map, function-map, api-map, schema-map, service-map)
990
- 5. If `config.webmcp.enabled` and UI files created: run `node node_modules/wogiflow/scripts/flow-webmcp-generator.js scan`
991
- 6. Commit: `feat: Complete wf-XXXXXXXX - [title]`
992
- 7. Show completion summary
993
-
994
- ## Options
995
-
996
- | Flag | Effect |
997
- |------|--------|
998
- | `--tdd` | Test-first mode (see `.claude/docs/tdd-mode.md`) |
999
- | `--no-loop` | Load context only, don't execute |
1000
- | `--no-spec` | Skip spec generation |
1001
- | `--no-skills` | Skip skill auto-loading |
1002
- | `--no-reflection` | Skip reflection checkpoints |
1003
- | `--max-retries N` | Limit retries per scenario (default: 5) |
1004
- | `--pause-between` | Confirm between scenarios |
1005
- | `--verify-only` | Run verification only |
1006
- | `--phased` | Phased execution: Contract → Skeleton → Core → Edge Cases → Polish |
1007
-
1008
- ## When Things Go Wrong
1009
-
1010
- **Scenario keeps failing** (max retries): Stop, report, leave in inProgress. For HIGH-RISK tasks (architecture/migration/refactor, complexity HIGH + files > 10), suggest Best-of-N via `flow-best-of-n.js`. For others, suggest `/wogi-debug-hypothesis`.
1011
-
1012
- **Best-of-N** (when `config.bestOfN.enabled`): `assessRisk()` checks if task qualifies. If yes, offer to spawn N agents in worktrees with Opus judging the winner.
1013
-
1014
- **Quality gate keeps failing**: Report, attempt fix, after 3 failures suggest `/wogi-debug-hypothesis`.
1015
-
1016
- **Context too large**: When `config.autoCompact.betweenTasks` is true (default), compact AUTOMATICALLY between tasks — do NOT ask the user, do NOT show a summary, do NOT invoke `/wogi-pre-compact`. Just compact silently and continue with the next task. The PostCompact hook restores all state automatically. Mid-task: commit progress, compact silently, resume from checkpoint. The user should never see compaction happen — it's invisible infrastructure.
1017
-
1018
- ## Progress Tracking (MANDATORY for L1+ tasks)
1019
-
1020
- **Display progress at every natural checkpoint** so the user knows where they are during long tasks. This applies to ALL L1+ task execution AND to `/wogi-review` and `/wogi-audit`.
1021
-
1022
- ### Progress Format
1023
-
1024
- At each checkpoint, output a progress line using this format:
1025
-
1026
- ```
1027
- ━━━ PROGRESS: [phase_bar] phase_name ━━━
1028
- [step_bar] step_detail
1029
- ```
1030
-
1031
- Where `[phase_bar]` is: `[████░░░░░░] 40%` (filled/empty blocks proportional to completion).
1032
-
1033
- **Example during a 5-criteria task:**
1034
- ```
1035
- ━━━ PROGRESS: [██████░░░░] 60% Implementing criteria ━━━
1036
- Criterion 3/5: Add input validation to login form
1037
- ```
1038
-
1039
- ### When to Display Progress
1040
-
1041
- | Checkpoint | What to show |
1042
- |------------|-------------|
1043
- | **After explore phase** | `[██░░░░░░░░] 20% Explore complete — N agents returned` |
1044
- | **After spec generated** | `[████░░░░░░] 30% Spec ready — N criteria, N files` |
1045
- | **Each criterion start** | `[█████░░░░░] N% Implementing — Criterion M/N: [title]` |
1046
- | **Each criterion done** | `[███████░░░] N% Criterion M/N complete ✓` |
1047
- | **Quality gates** | `[█████████░] 90% Running quality gates` |
1048
- | **Task complete** | `[██████████] 100% Complete ✓` |
1049
-
1050
- ### State File Updates
1051
-
1052
- At each checkpoint, also update the progress state file for hooks/resume:
1053
-
1054
- ```bash
1055
- node node_modules/wogiflow/scripts/flow-progress-tracker.js update '{"taskId":"wf-XXX","command":"/wogi-start","phase":"Implementing","phaseNum":3,"totalPhases":5,"step":"Criterion 2/4","stepNum":2,"totalSteps":4}'
1056
- ```
1057
-
1058
- This updates `.workflow/state/task-progress.json` AND prefixes the task title in `ready.json` with `[3/5]` for status line visibility.
1059
-
1060
- ### On Task Completion
1061
-
1062
- Always clear the progress state:
1063
-
1064
- ```bash
1065
- node node_modules/wogiflow/scripts/flow-progress-tracker.js clear
1066
- ```
198
+ Defaults: engineering/naming → agent-decides. infrastructure/performance → agent-decides-report-after. productBehavior/ux → owner-decides. security → auto-fix-report-after. Max 5 owner questions per batch (overflow → agent-decides-report-after). User can update via `/wogi-decide`. Low-confidence classification defaults to `owner-decides` (safest fallback).
1067
199
 
1068
- ### Phase Mapping for /wogi-start Execution
200
+ ## Phase Execution (MANDATORY)
1069
201
 
1070
- | Phase | phaseNum | Description |
1071
- |-------|----------|-------------|
1072
- | 1 | Routing + Context | Loading task, checking maps |
1073
- | 2 | Explore | Research agents |
1074
- | 3 | Spec + Approval | Generate spec, wait for approval |
1075
- | 4 | Implementation | Criteria loop (sub-steps = criteria) |
1076
- | 5 | Verification + Complete | Quality gates, finalize |
202
+ Before executing ANY phase, you MUST Read the phase instruction file. The PreToolUse hook BLOCKS Edit/Write/Bash until the phase file is read.
1077
203
 
1078
- ### Skip Conditions
204
+ | Phase | File to Read | Contents |
205
+ |-------|-------------|----------|
206
+ | exploring | `.claude/docs/phases/01-explore.md` | Steps 1–1.45: Context loading, intent framing, clarifying questions, item reconciliation, multi-agent research, reuse gate, scope-confidence audit |
207
+ | spec_review | `.claude/docs/phases/02-spec.md` | Steps 1.55–2.5: Architect pass, logic adversary, spec generation, approval gate, test generation, TodoWrite decomposition, TDD check |
208
+ | coding | `.claude/docs/phases/03-implement.md` | Steps 3–3.52: Scenario execution loop, sprint resets, criteria completion verification, sub-agent output verification |
209
+ | validating | `.claude/docs/phases/04-verify.md` | Steps 3.55–3.9: Inventory verification, skeptical evaluator, runtime verification (frontend + backend), wiring validation, standards compliance, completion truth gate |
210
+ | completing | `.claude/docs/phases/05-complete.md` | Steps 4–5: Quality gates, finalization, progress tracking, mandatory rules, options, error handling |
1079
211
 
1080
- - **L3 tasks**: Skip progress tracking (too small to be useful)
1081
- - **Conversation mode**: Skip progress tracking (no phases)
1082
- - **Quick fixes (≤2 criteria)**: Show start + complete only (no mid-progress)
212
+ **How it works**: When you transition to a new phase, Read the corresponding file BEFORE using Edit/Write/Bash. The phase-read gate tracks which files you've read and blocks mutation tools until the current phase's file is loaded.
1083
213
 
1084
214
  ## Mandatory Rules
1085
215