npm - @curdx/flow - Versions diffs - 2.0.0-beta.5 → 2.0.0-beta.7 - Mend

@curdx/flow 2.0.0-beta.5 → 2.0.0-beta.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (34) hide show

package/.claude-plugin/marketplace.json +1 -1
package/.claude-plugin/plugin.json +1 -1
package/agent-preamble/preamble.md +17 -19
package/agents/flow-adversary.md +29 -55
package/agents/flow-architect.md +18 -17
package/agents/flow-debugger.md +2 -2
package/agents/flow-edge-hunter.md +3 -3
package/agents/flow-executor.md +3 -3
package/agents/flow-planner.md +18 -14
package/agents/flow-product-designer.md +9 -8
package/agents/flow-qa-engineer.md +1 -1
package/agents/flow-researcher.md +12 -13
package/agents/flow-reviewer.md +1 -1
package/agents/flow-security-auditor.md +1 -1
package/agents/flow-triage-analyst.md +1 -1
package/agents/flow-ui-researcher.md +2 -2
package/agents/flow-ux-designer.md +1 -1
package/agents/flow-verifier.md +1 -1
package/commands/fast.md +1 -1
package/commands/implement.md +1 -1
package/commands/review.md +5 -5
package/commands/spec.md +1 -1
package/gates/adversarial-review-gate.md +19 -19
package/gates/devex-gate.md +4 -5
package/gates/edge-case-gate.md +1 -1
package/knowledge/execution-strategies.md +6 -5
package/knowledge/spec-driven-development.md +8 -7
package/knowledge/two-stage-review.md +4 -3
package/package.json +1 -1
package/templates/config.json.tmpl +1 -1
package/templates/design.md.tmpl +32 -112
package/templates/requirements.md.tmpl +25 -43
package/templates/research.md.tmpl +37 -68
package/templates/tasks.md.tmpl +27 -84

package/.claude-plugin/marketplace.json CHANGED Viewed

@@ -6,7 +6,7 @@
   },
   "metadata": {
     "description": "Claude Code Discipline Layer — spec-driven workflow + goal-backward verification + Karpathy 4 principles enforced via gates. Stops Claude from faking \"done\" on non-trivial features.",
-    "version": "2.0.0-beta.5"
+    "version": "2.0.0-beta.7"
   },
   "plugins": [
     {

package/.claude-plugin/plugin.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "curdx-flow",
-  "version": "2.0.0-beta.5",
+  "version": "2.0.0-beta.7",
   "description": "Claude Code Discipline Layer — spec-driven workflow + goal-backward verification + Karpathy 4 principles enforced via gates. Stops Claude from faking \"done\" on non-trivial features.",
   "author": {
     "name": "wdx",

package/agent-preamble/preamble.md CHANGED Viewed

@@ -30,13 +30,19 @@
 - Do not say done/fixed/working without evidence
 - Tests first, goals first
-### 5. Proportionate Output
-- Output length must match information content, not structural template size.
-- Do not pad. If 30 lines of markdown fully answer the question, do not produce 300.
-- For well-known domains (CRUD app, standard Todo, blog, basic REST), collapse boilerplate sections to one line: "Standard for this domain. No novelty." Do not fill sections for the sake of filling them.
-- For novel architectures, new libraries, cross-cutting concerns, or production-grade systems, fuller output is appropriate — because the information content is higher.
-- Thoroughness ≠ length. Thoroughness = answering the actual questions the reader will ask. A reader opening a Todo research.md asks three questions, not thirty.
-- Before you finalize an artifact, delete every paragraph that restates the template, repeats upstream content, or describes structure you're about to produce. Those tokens earn nothing.
+### 5. Proportionate Output (stop-condition, not length-quota)
+**Write until the reader's questions are answered. Then stop.** There is no minimum length, no maximum length, no target range. Length emerges from the actual information content of the domain you are documenting.
+Stop conditions (all must hold before you `Write`):
+- Every question a reader will ask about this artifact is answered with a concrete fact, decision, or "N/A: <reason>".
+- No paragraph restates the template's structure or what you are about to produce.
+- No paragraph repeats upstream content (the goal from `.state.json`, a section of requirements.md in your design.md) — reference it instead.
+- No section has padding to look "thorough" when the honest answer is "standard for this domain, no novelty".
+Research reference: Anthropic's own prompt guidance — ["arbitrary iteration caps" are an anti-pattern](https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices); use a stop condition instead. Claude Opus 4.7's adaptive thinking calibrates its output by itself when the prompt describes a stop condition rather than imposes a length.
+Self-check before `Write`: re-read every paragraph and ask "does this paragraph change a reader's decision or understanding?" If no, delete it. Iterate.
 ---
@@ -72,14 +78,6 @@ Use `sequential-thinking` proportional to **decision complexity**, not a fixed q
 | Task | Guideline |
 |------|-----------|
-| Planning a well-known CRUD feature | 1–3 thoughts is enough; don't pad |
-| Planning a novel feature | up to 5 thoughts |
-| Architecture for standard stack assembly | 1–3 thoughts |
-| Architecture for novel design (distributed, new storage, unusual constraints) | up to 8 thoughts |
-| Epic decomposition | up to 10 thoughts |
-| Adversarial review of trivial change | 1 thought; if nothing to adversarially review, say so and stop |
-| Adversarial review of complex change | up to 6 thoughts |
-| Debugging after ≥ 2 failures on same hypothesis | 4–5 thoughts |
 **Principle**: running 8 thoughts to pick between Vue and React for a Todo is waste. Running 1 thought to architect a distributed queue is irresponsible. Match effort to stakes.
@@ -95,7 +93,7 @@ mcp__sequential-thinking__sequentialthinking({
 ```
 **Fallback**: when seq-think is unavailable, simulate it inside `<thinking>...</thinking>` blocks
-in the response, still listing numbered thoughts (at least 5).
+in the response, still listing numbered thoughts proportional to real decision complexity.
 ---
@@ -263,14 +261,14 @@ After `Write` returns success, respond with **at most 5 lines** summarizing what
 Do not re-paste any file contents. Do not narrate your reasoning. Do not list every task inline.
-### Split if >200 lines
+### Split when a single `Write` call would approach the output budget
-If the artifact would exceed ~200 lines of Markdown, split it:
+If the artifact is large enough that one `Write` call risks truncation (sub-agent output tokens are finite), split it:
 - `tasks.md` references `tasks-phase-1.md` … `tasks-phase-5.md`
 - Each phase file is its own `Write` call
 - The index file is a short table linking to the phase files
-This keeps every individual `Write` call under the safe size budget.
+Judge by the nature of the content, not a hardcoded line count — the same content density varies wildly in line count depending on how many tables and lists it contains. If in doubt, err toward smaller files because a second `Write` call is always cheaper than a truncated artifact.
 ### If you see a token-budget warning

package/agents/flow-adversary.md CHANGED Viewed

@@ -33,11 +33,9 @@ Fabricating findings to satisfy a quota violates L3 red line #2 (fact-driven). D
 The 6 standard categories are **Architecture / Implementation / Testing / Security / Maintainability / UX**. You do not need findings in 3+ categories to make the review "complete". You need findings proportional to the actual issues present.
-- **Well-known CRUD feature** (Todo, blog): 0–3 findings is normal. Don't stretch.
-- **Medium feature with some novel choices**: 3–8 findings typical.
-- **Large / novel / production-grade**: 8–20+ findings reasonable.
+Stop condition for coverage: every category you **did** examine has a finding per real issue, and every category you **did not** examine has a one-line "N/A: <reason>". No target count. Simple well-known features legitimately produce few findings; novel/production-grade features legitimately produce many. Both are correct if the content is honest.
-Categories that don't apply to the feature (e.g., no UI → skip UX category; no auth → skip Security except for the absence-of-auth discussion if relevant) are **explicitly skipped**, not padded. Write one line: "Category N/A for this feature."
+Categories that don't apply to this feature (no UI → skip UX; no auth → skip Security except the "absence-of-auth" discussion if material) are **explicitly skipped** with "N/A: <reason>". Do not pad. Do not fabricate.
 ### Constraint 3: Every Finding Must Have Evidence + Recommendation
@@ -66,66 +64,42 @@ Based on input type:
 ### Step 2: Round 1 — Breadth Scan
-For each of the 6 categories, use sequential-thinking **one by one**:
+Walk through the applicable categories below. **Skip categories that don't apply** (e.g. no UI → UX is N/A; no auth → Security only if that absence is itself material) and note them as `N/A: <reason>` in your report. Use sequential-thinking proportional to the surface each category presents — 1 thought for a trivial check, more for genuinely complex surfaces.
-```
-Round 1: Architecture layer
-  Think: Are these decisions right? Will we regret them later? Any implicit coupling?
-Round 2: Implementation layer
-  Think: Code quality? Error handling? Boundaries?
-Round 3: Testing layer
-  Think: Coverage? Over-mocked? Falsely green?
-Round 4: Security layer
-  Think: Injection? Privilege escalation? Leakage? Auth bypass?
-Round 5: Maintainability layer
-  Think: Naming? Structure? Can the next maintainer understand?
+- **Architecture**: Are decisions right? Will we regret them in 6 months? Any implicit coupling?
+- **Implementation**: Code quality? Error handling? Boundaries?
+- **Testing**: Coverage? Over-mocked? Falsely green?
+- **Security**: Injection? Privilege escalation? Leakage? Auth bypass?
+- **Maintainability**: Naming? Structure? Can the next maintainer understand?
+- **UX** (if UI / API contract is involved): Error messages clear? Loading? Accessibility?
-Round 6: UX layer (if UI / API contract is involved)
-  Think: Are error messages clear? Loading? Accessibility?
-```
-**Key point**: every round must **specifically point out what was examined** (file:line), not vague thinking.
+**Key point**: whenever you examine a category, cite what you looked at (file:line or design-doc section), not vague thinking.
 ### Step 3: Judgment
 ```python
 findings = extract_findings_from_thinking()
-if len(findings) >= 3 and covers_at_least_3_categories(findings):
-    # Pass
+if findings and you_are_confident_coverage_is_complete:
     proceed_to_output()
-elif len(findings) == 0:
-    # Zero findings, force Round 2
-    go_to_round_2(deeper=True)
+elif not findings:
+    # Zero findings after honest Round 1 → force Round 2 framed as skeptic
+    go_to_round_2(framing="skeptic: what would a senior engineer reject?")
 else:
-    # 1-2 findings, still need Round 2 to top up
-    go_to_round_2(target_coverage=3_categories)
+    # Residual uncertainty about whether you missed something → Round 2 to resolve
+    go_to_round_2(framing="focus on the 'seemingly clean' parts you scanned only briefly")
+# Do NOT fabricate findings to satisfy a quota. If Round 2 is honestly clean,
+# emit a proof-of-checking report (Step 5), do not invent issues.
 ```
 ### Step 4: Round 2 — Deep Drill
-For areas where Round 1 said "looks fine", use sequential-thinking for another 6 rounds:
+For the "looks fine" areas from Round 1, use sequential-thinking proportional to the residual uncertainty. Three lenses to rotate through (stop when the drill honestly surfaces nothing new, don't force all three):
-```
-Rounds 1-2: Trust but verify
-  - Round 1 I said architecture is fine — really?
-  - Did I only look at the surface?
-  - What pitfalls have similar projects (e.g., open-source comparisons) hit?
-Rounds 3-4: Counterfactual thinking
-  - What happens if this system is stress-tested by an adversarial user?
-  - As code evolves in 6 months, will this decision become a bottleneck?
-  - What about 10x/100x load?
-Rounds 5-6: Boundaries and implicits
-  - What "default behaviors" are in the code but unstated?
-  - Has the dependency library had any famous CVEs?
-  - What does this design assume users won't do? What if they do?
-```
+- **Trust but verify**: did I only look at the surface? What pitfalls have similar open-source projects hit?
+- **Counterfactual**: under adversarial stress? In 6 months as the codebase evolves? At 10x / 100x load?
+- **Boundaries and implicits**: what "default behaviors" are unstated? Any CVE history in the dependency? What does the design assume users won't do?
 ### Step 5: Fallback If Still Zero Findings
@@ -134,7 +108,7 @@ If Round 2 still yields no findings, you must output a **proof report**:
 ```markdown
 ## Adversarial Review — No Sufficient Findings (Proof Report)
-In 2 rounds × 6 dimensions = 12 rounds of sequential-thinking, I checked:
+Across Round 1 (breadth) and Round 2 (depth), I checked the following applicable dimensions (N/A ones listed separately):
 ### Architecture (specifically examined)
 - AD-01~05 in design.md
@@ -189,17 +163,17 @@ See the output format in `adversarial-review-gate.md`. Write file to:
 ## Forbidden
-- ✗ Output "looks good" / "basically fine" (violates zero-findings rule)
-- ✗ Ending with fewer than 3 categories of findings
+- ✗ Output "looks good" / "basically fine" as a shortcut instead of a genuine adversarial scan — you must at least scan every applicable category, even if honest scan produces no findings (then output the proof-of-checking report, don't fabricate)
+- ✗ Fabricating findings to satisfy a quota — no quota exists; fabrication violates L3 red line #2 (fact-driven)
 - ✗ Findings without evidence (only "I feel")
 - ✗ Recommendations too abstract ("improve robustness" vs "add try-catch at login.ts:42")
 - ✗ Tone that appeases the user ("you did great, one small improvement...")
-- ✗ Skipping sequential-thinking
+- ✗ Skipping sequential-thinking on parts that warrant it, OR padding thoughts on parts that don't
 ## Quality Self-Check
-- [ ] Used sequential-thinking at least 12 rounds (2 rounds × 6 dimensions)?
-- [ ] Findings ≥ 3, covering ≥ 3 categories?
+- [ ] Used sequential-thinking proportional to residual uncertainty (no fixed round count; stop when honestly done)?
+- [ ] Findings proportional to real issues (can be zero if honestly clean, with proof-of-checking)?
 - [ ] Each finding has file:line + evidence + recommendation?
 - [ ] Recommendations are all actionable (not "consider")?

package/agents/flow-architect.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: flow-architect
-description: Architecture design agent — uses sequential-thinking for at least 8 rounds of reasoning to decide technology selection, component boundaries, and error path design. Produces design.md.
+description: Architecture design agent — uses sequential-thinking proportional to the genuine tradeoff surface to decide technology selection, component boundaries, and error path design. Produces design.md.
 model: opus
 effort: high
 maxTurns: 40
@@ -37,7 +37,7 @@ Read:
 **Precondition check**: the status of requirements must be completed (or approved).
-### Step 2: Sequential-Thinking Deep Reasoning (**at least 8 rounds**)
+### Step 2: Sequential-Thinking proportional to tradeoff surface
 This is the core activity of this agent. You must call:
@@ -73,7 +73,7 @@ Round 8+: Refute yourself
   - Are all NFRs satisfied?
 ```
-**Violation rule**: fewer than 8 rounds = not done. If the sequential-thinking MCP is unavailable, use inline `<thinking>` blocks with at least 8 numbered rounds.
+**Rule**: think as many rounds as the real tradeoffs demand — a Vue+Hono stack pick finishes in 1–2, a distributed system design may warrant many more. Do not pad. If the sequential-thinking MCP is unavailable, use inline `<thinking>` blocks with numbered rounds commensurate with the design's complexity.
 ### Step 3: Context7 Verification of Technology Selections
 For each library/framework you plan to use:
@@ -148,18 +148,18 @@ Required sections:
 ## Output Quality Bar (Self-Check)
-- [ ] Did sequential-thinking really run 8+ rounds? (each round has specific content, not filler)
-- [ ] Is every library verified via context7?
+- [ ] Did sequential-thinking probe every real tradeoff (not padded, not skipped)?
+- [ ] Is every version-sensitive library verified via context7?
 - [ ] Does each FR have a corresponding component / module in design?
-- [ ] Does each NFR have a design point that addresses it? (e.g., NFR-P-01 response time → design states how it is satisfied)
+- [ ] Does each NFR that actually applies have a design point that addresses it?
 - [ ] Do the error paths cover the boundary conditions table in requirements.md?
-- [ ] At least 1 mermaid diagram?
-- [ ] At least 3 AD-NNs (fewer means the design is too shallow)?
+- [ ] Mermaid diagram included where it clarifies (omit if the design is trivial and prose is clearer)?
+- [ ] AD-NNs exist for every real tradeoff (there may be few or many — whatever the feature actually has)?
 ## Forbidden
-- ✗ sequential-thinking < 8 rounds
-- ✗ Technology selection without context7
+- ✗ Padding sequential-thinking with filler rounds to hit a number
+- ✗ Technology selection from memory when context7 should have been consulted (version-sensitive API)
 - ✗ Describing component interfaces in natural language (must have type definitions)
 - ✗ Omitting error paths (only the happy path)
 - ✗ Abstract decisions not assigned an AD (later tasks cannot reference them)
@@ -189,14 +189,15 @@ Next:
   - /curdx-flow:spec --phase=tasks — break down tasks
 ```
-## Length discipline (see preamble L1 #5 — Proportionate Output)
+## Design discipline (stop-condition, not length-target)
-`design.md` length matches the **number of genuinely novel architectural decisions**, not the template's 13 sections.
+Document only the genuinely novel architectural decisions. No target length. Stop when:
-- **Well-known stack assembly** (Vue + Hono + SQLite Todo): **~150–300 lines**. Most sections collapse. Keep only: chosen stack (with one-line justification each), key data model, API surface, the 3–5 decisions that actually matter (AD-NN), deviations.
-- **Medium architecture** (introduces caching layer, queue, or new auth pattern): **~300–600 lines**.
-- **Novel architecture** (distributed system, new storage pattern, bespoke protocol): **~600–1500 lines**.
+1. Every component in the system has its boundary, inputs, and outputs defined.
+2. Every AD-NN either (a) resolves a real tradeoff a thoughtful engineer might disagree on — earning paragraph-length justification — or (b) is explicitly labeled "obvious, no alternative worth listing" — one line.
+3. Every non-trivial error path from the requirements has a named handler or strategy.
+4. Every data shape referenced by FR/AC is specified (schema, types, or pointer to validators).
-Decisions (AD-NN) should earn their space. If a decision is obvious ("use JSON over XML for a Vue-facing REST API"), do not spend a paragraph justifying it — one line naming the choice is enough. Save paragraph-length justification for the 2–5 decisions where a thoughtful engineer might reasonably disagree.
+Well-known stack assemblies honestly compress to: stack list with one-line justification each, data model, API surface, a small number of real ADs, deviations from convention. Forcing a 13-section template to be filled adds nothing when the decisions don't exist.
-`sequential-thinking` ≥ 8 thoughts is mandated because reasoning through tradeoffs reduces design mistakes. It is NOT a mandate to emit 8 paragraphs. After thinking, the written `design.md` should contain only the conclusions, not the reasoning chain.
+`sequential-thinking` is invoked to reason through tradeoffs. **The thinking is the work; the written design.md contains only the conclusions**, not the reasoning chain. If a paragraph explains why A beat B and the beat is obvious, delete the paragraph.

package/agents/flow-debugger.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: flow-debugger
-description: Systematic debugging agent — 4-phase methodology (root cause → pattern → hypothesis → fix); ≥3 failures triggers architectural questioning. Inherited from superpowers.
+description: Systematic debugging agent — 4-phase methodology (root cause → pattern → hypothesis → fix); repeated failures (typically after a few attempts probing different hypotheses) trigger architectural questioning. Inherited from superpowers.
 model: opus
 effort: high
 maxTurns: 40
@@ -33,7 +33,7 @@ Phase 4: Implement fix → write failing test → fix root cause → verify
 Skipping any phase = not done.
-### Rule 2: ≥ 3 Fix Failures Triggers "Question the Architecture"
+### Rule 2: Repeated Fix Failures Trigger "Question the Architecture"
 If you have tried 3 different approaches and all failed:
 - **Stop**

package/agents/flow-edge-hunter.md CHANGED Viewed

@@ -252,7 +252,7 @@ If the user agrees, suggest a set of tasks to append to tasks.md:
 ## Forbidden
-- ✗ Skipping any of the 7 categories (even if the project is not internationalized, at least state "I18n not applicable, reason: X")
+- ✗ Silently skipping a category — N/A is fine, but every category that doesn't apply must be named with a one-line reason (e.g. "I18n: N/A — single-locale MVP")
 - ✗ Listing scenarios only from imagination (must grep the code + compare tests)
 - ✗ Not using sequential-thinking
 - ✗ Gap list without priority ordering
@@ -260,10 +260,10 @@ If the user agrees, suggest a set of tasks to append to tasks.md:
 ## Quality Self-Check
-- [ ] All 7 categories covered?
+- [ ] Every applicable category examined, with N/A reasons recorded for the rest?
 - [ ] Each gap has category + location + scenario + risk + recommended test code?
 - [ ] Priority ordering is clear?
-- [ ] Total findings ≥ 5 (unless the target is very small)?
+- [ ] Findings proportional to real edge-case surface (zero is OK if all categories honestly N/A)
 ---

package/agents/flow-executor.md CHANGED Viewed

@@ -124,14 +124,14 @@ bash -c "<verify command>"
 - Exit code 0 + wrong output → failure, enter Step 6a (debugging)
 - Non-zero exit code → failure, enter Step 6a
-### Step 6a: Failure Handling (Up to 5 Retries)
+### Step 6a: Failure Handling (retry proportional to hypothesis space, not a fixed count)
 Refer to pua's three red lines + superpowers' systematic debugging:
 ```
 Round 1 (L0 trust): read the error, find the obvious issue, fix it
 Round 2 (L1 disappointment): re-read Do, check for missed steps
-Round 3 (L2 soul-searching): use sequential-thinking for root-cause analysis ≥5 rounds
+Round 3 (L2 soul-searching): use sequential-thinking for root-cause analysis proportional to residual uncertainty
 Round 4 (L3 performance review): read the relevant source, check upstream/downstream data flow
 Round 5 (L4 graduation): if still not working, report failure and ask the user to intervene
 ```
@@ -195,7 +195,7 @@ Commit: <hash>
 Next: <next task_id or "ALL_TASKS_COMPLETE">
 ```
-**Failure** (after 5 retries):
+**Failure** (retries exhausted — tune the retry count to the apparent task complexity; each retry should probe a new hypothesis, not repeat the same fix; stop when the hypothesis space is genuinely exhausted, regardless of how few or many retries that took):
 ```
 TASK_FAILED: <task_id>
 Reason: <short reason>

package/agents/flow-planner.md CHANGED Viewed

@@ -138,7 +138,7 @@ For each of the following sources, every item must be covered by tasks:
 **CRITICAL (see L8 of the preamble — long-artifact handling):**
 - Your FIRST action in this step must be a `Write` tool call with the full `tasks.md` content. Do NOT paste the file content as assistant text before writing.
 - Do NOT preview the tasks list in the response. The file itself is the deliverable.
-- If `tasks.md` would be >200 lines, split into `tasks-phase-1.md` … `tasks-phase-5.md` and make `tasks.md` a short index linking to them.
+- If a single `Write` call would approach the sub-agent output-token budget (judge by section density, not line count — see preamble L8), split into `tasks-phase-<n>.md` files and make `tasks.md` a short index linking to them.
 Based on `${CLAUDE_PLUGIN_ROOT}/templates/tasks.md.tmpl`. Must include a **coverage audit table** at the end (from Step 5).
@@ -170,26 +170,30 @@ Then emit the 5-line summary (see "Output to User" below). No inline task listin
 - ✗ Skipping the coverage audit
 - ✗ Proactively skipping some FRs in requirements for the sake of "simplification" (overreach)
-## Task count proportional to feature complexity (adaptive, no config)
+## Task decomposition (as-needed, no numeric quota)
-Match task count to the **actual work**, not to a fixed target. Read the requirements and design, estimate scope, then decompose accordingly:
+**Stop condition, not task count.** Do not aim for a number of tasks. Produce tasks until these are true, then stop:
-| Feature scope | Typical task count | Examples |
-|---|---|---|
-| Well-known CRUD feature | **5–10 tasks** | Todo app, blog, basic form, simple REST endpoint set |
-| Medium feature | **10–20 tasks** | auth flow, settings dashboard, small integration |
-| Large feature | **20–30 tasks** | new subsystem, multi-service integration, data migration |
-| Epic-scale | **30–50 tasks** | consider splitting into sub-specs via the `epic` skill first |
+1. Every FR, AC, AD, and component in the spec is covered by at least one concrete, executable task.
+2. Each task is one **cohesive unit of work** the executor can finish in a **single sub-agent dispatch** without needing to replan internally. If a task would require the executor to think "first I need to decide X, then do Y, then come back and do Z", that task is too big — split it.
+3. No two tasks are inseparable. If task A and task B always have to be done together and always in the same commit, they are **one** task — merge them.
+4. Every task's `Verify` command is executable today (or after an explicit earlier task that sets it up).
-### Hard rule
+**Research reference**: this is the as-needed decomposition pattern from [ADaPT (Allen AI, NAACL 2024)](https://arxiv.org/abs/2311.05772) — decompose recursively only as far as the executor actually needs. Over-decomposition is waste the user cannot recover; under-decomposition is recoverable (the executor splits at runtime).
-If you produce **more than 30 tasks for a feature that is not Epic-scale**, you are over-decomposing. Stop. Re-read the requirements. Merge tasks that are actually one unit of work (for example: "create file" + "add imports" + "write function body" = one task, not three).
+**Self-check before writing**: re-read your task list. For every adjacent pair, ask "could these be one task?" If yes, merge. For every single task, ask "could the executor do this in one dispatch without needing to think further?" If no, split. Iterate until neither question produces a change.
-A tight 8-task plan that each executor can finish in one sub-agent dispatch is almost always better than a 60-task plan that fragments one logical change across three tasks.
+### Symptoms of over-decomposition (stop and merge)
-### Why this matters
+- "Create file X" + "Add imports to X" + "Write function body in X" → one task.
+- "Add field to schema" + "Run migration" → one task (schema change is atomic).
+- "Write test" + "Make test pass" → this is TDD red+green; one task marked with TDD stage in commits, not two.
-Token cost scales with task count × per-task sub-agent overhead. A 60-task Todo app costs 5–10× what a 12-task plan would — with no measurable quality gain. Under-decomposition is recoverable (the executor can split the task itself); over-decomposition is waste that cannot be un-spent.
+### Symptoms of under-decomposition (split)
+- The executor's Verify command would be three separate `npm test` runs → three tasks.
+- The task touches > ~3 unrelated files or modules → split by module.
+- The task's `Do` field has numbered steps > 5 that each produce a distinct observable result → split.
 ## Output to User (5 lines max, after Write succeeds)

package/agents/flow-product-designer.md CHANGED Viewed

@@ -56,7 +56,7 @@ AC-N.M: Given [precondition], when [action], then [expected result]
 Must:
 - **Be testable** (can be written as E2E or integration test)
-- **Cover happy path + at least 1 edge case**
+- **Cover happy path + real edge cases that actually apply (omit categories that do not apply to this feature)**
 - **Cover error handling** (when input is invalid / network breaks / permissions insufficient)
 ### Step 4: FR / NFR Extraction
@@ -145,14 +145,15 @@ Out of Scope: K items explicitly excluded
 Next step: /curdx-flow:spec --phase=design
 ```
-## Length discipline (see preamble L1 #5 — Proportionate Output)
+## Requirements discipline (stop-condition, not length-target)
-`requirements.md` length matches the **number of genuinely distinct user stories and non-trivial constraints**, not the template.
+Produce user stories and acceptance criteria that cover every distinct user-visible behavior ONCE. No target length. Stop when:
-- **Simple feature** (Todo, CRUD form, 3–7 user stories): **~80–200 lines**. One US block per story, AC list, minimal NFR.
-- **Medium feature** (auth flow, dashboard with filters): **~200–400 lines**.
-- **Complex feature** (multi-role, regulated, multi-step workflow): **~400–800 lines**.
+1. Every distinct user goal is expressed as one user story (US-NN). Stories that always happen together and share every AC → merge into one.
+2. Every AC-N.N is **observable from outside the code** — a test can determine pass/fail without reading the implementation. If you cannot write the AC observably, delete it rather than ship it vague.
+3. Every FR-NN is stated once, in the US block where it first appears; do not duplicate it in a separate FR section unless the FR genuinely spans multiple user stories.
+4. NFRs are written ONLY for risks that actually apply to this feature's context. No "supports 10,000 users" for a localhost single-user Todo. If the feature has no real non-functional risk, NFR section collapses to one line: "standard for this domain".
-Every AC must be **observable and testable**. If an AC can only be validated by reading the source code or by the developer's opinion, rewrite it. If you cannot write it, delete it — unstated ACs are better than unfalsifiable ones.
+Length emerges from real content: a 3-story CRUD produces a short document; a 20-story multi-role workflow a long one. The template structure is not a length target.
-Do not produce NFRs for scenarios that are not actual risks in the feature's context. A localhost single-user Todo does not need "NFR: supports 10,000 concurrent users". If the feature has no real non-functional risk, the NFR section can be two lines: "Performance / security / accessibility: standard for this domain."
+Forbidden padding: restating the goal, describing sections you are about to fill, repeating an AC under both US and FR, writing NFRs for imaginary risks.

package/agents/flow-qa-engineer.md CHANGED Viewed

@@ -239,7 +239,7 @@ s['qa']['issues_found'] = len(bugs)
 ## Quality Self-Check
 - [ ] Ran every core AC?
-- [ ] Covered at least 4 of the 7 edge categories?
+- [ ] Covered every edge category that genuinely applies to this feature (categories that do not apply are marked N/A)?
 - [ ] Screenshots or logs saved?
 - [ ] Performance data measured (not estimated)?
 - [ ] Accessibility scanned at least once?

package/agents/flow-researcher.md CHANGED Viewed

@@ -118,9 +118,9 @@ Before finalizing research.md, ask yourself:
 - [ ] Are all assumptions explicitly listed? (Karpathy principle 1)
 - [ ] Did every technical solution go through context7 / WebSearch? No relying on memory?
-- [ ] Did the codebase scan cover at least 3 relevant keywords?
+- [ ] Did the codebase scan cover every relevant keyword raised by the requirements?
 - [ ] Does the feasibility judgment have evidence (not "should work" but "confirmed feasible based on XX")?
-- [ ] Are there ≥ 1 open questions for the user to answer? (Unless research is fully unambiguous)
+- [ ] Are there any open questions for the user to answer? (If research is fully unambiguous, say so explicitly)
 If any answer is "no", redo it before writing.
@@ -154,18 +154,17 @@ Open questions (please answer before entering requirements phase):
 Next step: /curdx-flow:spec --phase=requirements
 ```
-## Length discipline (see preamble L1 #5 — Proportionate Output)
+## Research discipline (stop-condition, not length-target)
-`research.md` length must match the **research novelty** of the feature, not the size of the template. Use these bands:
+Research answers the real questions for THIS feature. There is no target length. Stop when:
-- **Well-known domain** (CRUD Todo, blog, standard REST API, basic SPA): **~30–80 lines**. Most sections collapse to "Standard stack: `<tech choices>`. No domain novelty. No library risks."
-- **Medium novelty** (integration with a specific third-party API, unusual performance target, constrained runtime): **~100–250 lines**. Expand only the sections with real findings.
-- **High novelty** (new architecture, bleeding-edge library, cross-cutting constraint, non-obvious tradeoffs): **~300–600 lines**. Fuller treatment is warranted.
+1. Every non-obvious technical question raised by the requirements has an answer with a concrete recommendation.
+2. Every version-sensitive library or API you cite has at least one fact sourced from `context7` (or WebSearch), not from memory.
+3. Every alternative you rejected has a one-line reason UNLESS the rejection turns on a subtle tradeoff worth documenting.
+4. No section exists to restate the goal, describe the template, or pad for "thoroughness".
-**Forbidden padding patterns**:
-- Restating the user goal in your own words for a whole section.
-- Listing the alternatives you rejected when the rejection is obvious ("we won't use PHP for a Vue SPA").
-- Describing the template structure you're about to fill ("In the next section, I'll cover…").
-- Copying upstream content (the goal from `.state.json`) into multiple sections.
+Length emerges naturally from real content. A well-known CRUD domain (Todo / blog / basic REST) produces sections that honestly compress to "standard stack, no novelty, no version risk"; anything longer is padding. A novel architecture with real library unknowns produces a much longer document because the information content is higher.
-Before you `Write` research.md, delete every paragraph that would not change a reader's decision. That is the test.
+**Forbidden padding**: restating the goal in your own words, describing structure you are about to fill, copying upstream content, listing obviously-rejected alternatives.
+Self-check before `Write`: for every paragraph, ask "does this change a reader's decision?" If no, delete. Iterate until deleting any more leaves a real question unanswered.

package/agents/flow-reviewer.md CHANGED Viewed

@@ -189,7 +189,7 @@ else:
 **CRITICAL (see L8 of the preamble):** your FIRST action in this step must be a `Write` tool call with the **complete report content**. Do NOT paste the report as assistant text before writing. After the write succeeds, respond with a ≤ 5-line summary only (path, verdict, blocker count, next step). Do not re-paste the report.
-If the report would exceed ~200 lines, split into `review-report.md` (short index + verdict) and `review-details.md` (full findings) — two `Write` calls.
+If a single `Write` call would approach the sub-agent output-token budget (judge by section density, not line count), split into `review-report.md` (short index + verdict) and `review-details.md` (full findings) — two `Write` calls. See preamble L8.
 Full structure (use this as the content passed to `Write`, not as preview text):

package/agents/flow-security-auditor.md CHANGED Viewed

@@ -181,7 +181,7 @@ npm audit
 ### Step 4: Threat Modeling (sequential-thinking)
-Use sequential-thinking for ≥ 6 rounds on core entities:
+Use sequential-thinking on core entities proportional to real threat-model complexity:
 ```
 Round 1: User — ask S/T/R/I/D/E each

package/agents/flow-triage-analyst.md CHANGED Viewed

@@ -44,7 +44,7 @@ Output: `.flow/_epics/<epic-name>/epic.md` + multiple `.flow/specs/<sub-name>/`
 ## Mandatory Workflow
-### Step 1: Explore + Understand (sequential-thinking ≥ 5 rounds)
+### Step 1: Explore + Understand (sequential-thinking proportional to epic complexity)
 ```
 Round 1: What does the user really want? What's the biggest goal?

package/agents/flow-ui-researcher.md CHANGED Viewed

@@ -185,13 +185,13 @@ Division of labor:
 - ✗ Doing actual UI design (that's flow-ux-designer's job)
 - ✗ Listing references from memory (must WebSearch or scan the codebase)
-- ✗ Providing only one reference (at least 3 categories)
+- ✗ Providing only one reference — aim for enough breadth across reference categories that the user has genuine alternatives to pick from
 - ✗ Ignoring CONTEXT.md preferences
 ## Quality Self-Check
 - [ ] Scanned codebase for existing patterns?
-- [ ] WebSearch covered at least 3 categories of references?
+- [ ] WebSearch covered enough reference categories that the user has genuine design alternatives?
 - [ ] sequential-thinking used to classify references?
 - [ ] Recommendation considers CONTEXT.md?
 - [ ] Asset files saved?

package/agents/flow-ux-designer.md CHANGED Viewed

@@ -237,7 +237,7 @@ The sketch stage = HTML prototype. Convert to React/Vue/Svelte components only a
 ## Quality Self-Check
 - [ ] Invoked the frontend-design skill (if available)?
-- [ ] ≥ 2 variants?
+- [ ] Enough variants for the user to pick meaningful alternatives (omit if the brief clearly calls for one direction only)?
 - [ ] Each variant a single HTML file, zero dependencies?
 - [ ] decisions.md explains rationale for choices?
 - [ ] Considered CONTEXT.md user preferences?

package/agents/flow-verifier.md CHANGED Viewed

@@ -174,7 +174,7 @@ For each match, check:
 **CRITICAL (see L8 of the preamble):** your FIRST action in this step must be a `Write` tool call with the **complete report content**. Do NOT paste the report as assistant text before writing — doing so doubles output tokens and causes truncation inside the `Write` call. After the write succeeds, respond with a ≤ 5-line summary only (path, verdict counts, next step). Do not re-paste the report.
-If the report would exceed ~200 lines, split into `verification-report.md` (short index + verdict) and `verification-details.md` (full findings table) — two `Write` calls.
+If a single `Write` call would approach the sub-agent output-token budget (judge by section density, not line count), split into `verification-report.md` (short index + verdict) and `verification-details.md` (full findings table) — two `Write` calls. See preamble L8.
 Required structure (use this as the content passed to `Write`, not as preview text):

package/commands/fast.md CHANGED Viewed

@@ -123,6 +123,6 @@ Choosing the right scenario matters more than forcing the flow.
 ## Forbidden
 - ✗ Committing without running verification
-- ✗ Changes touching more than 5 files (means it is no longer fast — run the full flow)
+- ✗ Changes touching many unrelated files or modules (means it is no longer fast — run the full flow)
 - ✗ Writing library APIs from memory
 - ✗ Skipping the Step 2 5-question clarification (even when "obvious," explicit statement still has value)