@curdx/flow 2.0.0-beta.4 → 2.0.0-beta.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -6,7 +6,7 @@
6
6
  },
7
7
  "metadata": {
8
8
  "description": "Claude Code Discipline Layer — spec-driven workflow + goal-backward verification + Karpathy 4 principles enforced via gates. Stops Claude from faking \"done\" on non-trivial features.",
9
- "version": "2.0.0-beta.4"
9
+ "version": "2.0.0-beta.6"
10
10
  },
11
11
  "plugins": [
12
12
  {
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "curdx-flow",
3
- "version": "2.0.0-beta.4",
3
+ "version": "2.0.0-beta.6",
4
4
  "description": "Claude Code Discipline Layer — spec-driven workflow + goal-backward verification + Karpathy 4 principles enforced via gates. Stops Claude from faking \"done\" on non-trivial features.",
5
5
  "author": {
6
6
  "name": "wdx",
@@ -30,13 +30,19 @@
30
30
  - Do not say done/fixed/working without evidence
31
31
  - Tests first, goals first
32
32
 
33
- ### 5. Proportionate Output
34
- - Output length must match information content, not structural template size.
35
- - Do not pad. If 30 lines of markdown fully answer the question, do not produce 300.
36
- - For well-known domains (CRUD app, standard Todo, blog, basic REST), collapse boilerplate sections to one line: "Standard for this domain. No novelty." Do not fill sections for the sake of filling them.
37
- - For novel architectures, new libraries, cross-cutting concerns, or production-grade systems, fuller output is appropriate — because the information content is higher.
38
- - Thoroughness length. Thoroughness = answering the actual questions the reader will ask. A reader opening a Todo research.md asks three questions, not thirty.
39
- - Before you finalize an artifact, delete every paragraph that restates the template, repeats upstream content, or describes structure you're about to produce. Those tokens earn nothing.
33
+ ### 5. Proportionate Output (stop-condition, not length-quota)
34
+
35
+ **Write until the reader's questions are answered. Then stop.** There is no minimum length, no maximum length, no target range. Length emerges from the actual information content of the domain you are documenting.
36
+
37
+ Stop conditions (all must hold before you `Write`):
38
+ - Every question a reader will ask about this artifact is answered with a concrete fact, decision, or "N/A: <reason>".
39
+ - No paragraph restates the template's structure or what you are about to produce.
40
+ - No paragraph repeats upstream content (the goal from `.state.json`, a section of requirements.md in your design.md) — reference it instead.
41
+ - No section has padding to look "thorough" when the honest answer is "standard for this domain, no novelty".
42
+
43
+ Research reference: Anthropic's own prompt guidance — ["arbitrary iteration caps" are an anti-pattern](https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices); use a stop condition instead. Claude Opus 4.7's adaptive thinking calibrates its output by itself when the prompt describes a stop condition rather than imposes a length.
44
+
45
+ Self-check before `Write`: re-read every paragraph and ask "does this paragraph change a reader's decision or understanding?" If no, delete it. Iterate.
40
46
 
41
47
  ---
42
48
 
@@ -44,29 +50,38 @@
44
50
 
45
51
  ### Documentation lookup → context7 MCP
46
52
 
47
- For any question involving a library / framework / SDK / CLI / API:
53
+ Query `context7` when EITHER is true:
54
+ - The library API is version-sensitive (recent breaking change, typed API in a new version, deprecated method you're considering).
55
+ - You are genuinely uncertain (can't recall the method signature, can't recall whether a feature exists in the installed version).
48
56
 
49
57
  ```
50
58
  1. mcp__context7__resolve-library-id("react") → resolve library ID
51
59
  2. mcp__context7__query-docs(libraryId, query) → query latest docs
52
60
  ```
53
61
 
54
- **Forbidden**: writing library API calls from training memory. Training data may be stale.
62
+ Do NOT query context7 for:
63
+ - Universally stable APIs you can write from memory (Vue 3 `ref`, React `useState`, Express `app.get`, SQL `SELECT`).
64
+ - Syntax you would paste into a test file without thinking.
65
+ - Every single library mention in a spec (the spec is planning, not implementation — defer the lookup to the executor when it actually calls the API).
55
66
 
56
- **Fallback**: when context7 MCP is unavailable, use WebSearch with a version number, and annotate the output with
57
- "⚠️ context7 unavailable — documentation may not be current".
67
+ **Rule of thumb**: if you would paste the code into production without double-checking, don't waste a context7 call checking it. If you would hesitate, query. Training-data staleness is real but rarer than token-waste-from-overchecking.
68
+
69
+ **Forbidden**: writing calls to a specific minor version of a library from memory when the code needs to run against that exact version and the API surface is known to have changed. Then you MUST query context7.
70
+
71
+ **Fallback**: when context7 MCP is unavailable, use WebSearch with a version number, and annotate the output with "⚠️ context7 unavailable — documentation may not be current".
58
72
 
59
73
  ---
60
74
 
61
75
  ### Structured thinking → sequential-thinking MCP
62
76
 
63
- For the following scenarios, sequential-thinking is mandatory beforehand:
77
+ Use `sequential-thinking` proportional to **decision complexity**, not a fixed quota. The numbers below are **ceilings for genuinely hard cases**, not floors to hit:
78
+
79
+ | Task | Guideline |
80
+ |------|-----------|
81
+
82
+ **Principle**: running 8 thoughts to pick between Vue and React for a Todo is waste. Running 1 thought to architect a distributed queue is irresponsible. Match effort to stakes.
64
83
 
65
- - Planning (≥5 thoughts)
66
- - Architecture design (≥8 thoughts)
67
- - Epic decomposition (≥10 thoughts)
68
- - Adversarial review (≥6 thoughts)
69
- - Complex bug root-cause analysis (≥5 thoughts)
84
+ Hard rule: do NOT emit empty thoughts ("Thought 4: let me also consider X… X is fine"). If you've reached the answer, stop.
70
85
 
71
86
  ```
72
87
  mcp__sequential-thinking__sequentialthinking({
@@ -78,7 +93,7 @@ mcp__sequential-thinking__sequentialthinking({
78
93
  ```
79
94
 
80
95
  **Fallback**: when seq-think is unavailable, simulate it inside `<thinking>...</thinking>` blocks
81
- in the response, still listing numbered thoughts (at least 5).
96
+ in the response, still listing numbered thoughts proportional to real decision complexity.
82
97
 
83
98
  ---
84
99
 
@@ -246,14 +261,14 @@ After `Write` returns success, respond with **at most 5 lines** summarizing what
246
261
 
247
262
  Do not re-paste any file contents. Do not narrate your reasoning. Do not list every task inline.
248
263
 
249
- ### Split if >200 lines
264
+ ### Split when a single `Write` call would approach the output budget
250
265
 
251
- If the artifact would exceed ~200 lines of Markdown, split it:
266
+ If the artifact is large enough that one `Write` call risks truncation (sub-agent output tokens are finite), split it:
252
267
  - `tasks.md` references `tasks-phase-1.md` … `tasks-phase-5.md`
253
268
  - Each phase file is its own `Write` call
254
269
  - The index file is a short table linking to the phase files
255
270
 
256
- This keeps every individual `Write` call under the safe size budget.
271
+ Judge by the nature of the content, not a hardcoded line count — the same content density varies wildly in line count depending on how many tables and lists it contains. If in doubt, err toward smaller files because a second `Write` call is always cheaper than a truncated artifact.
257
272
 
258
273
  ### If you see a token-budget warning
259
274
 
@@ -20,13 +20,22 @@ Review the target (spec or code) from an **attacker's perspective**. Your task i
20
20
 
21
21
  ## Hard Constraints
22
22
 
23
- ### Constraint 1: Zero Findings Forbidden
23
+ ### Constraint 1: "No findings" requires proof, not fabrication
24
24
 
25
- If the first-round analysis outputs "no issues", **automatically trigger a second round**. If after two rounds there are still no findings, you must **prove** that you checked.
25
+ If your honest analysis produces no findings, you do NOT invent problems. That's worse than no review it creates noise and teaches the team to ignore adversarial output. Instead:
26
26
 
27
- ### Constraint 2: Findings in At Least 3 Categories
27
+ - Run a **second pass** with explicitly skeptical framing ("what would a senior engineer reject in this PR?").
28
+ - If the second pass also finds nothing, emit a short **proof-of-checking report**: list the categories you scanned, the specific files / line ranges you reviewed, and 2–3 counterfactual questions you asked. This is the honest "clean" verdict.
28
29
 
29
- A complete review covers 6 categories (Architecture / Implementation / Testing / Security / Maintainability / UX), with findings in at least 3 categories.
30
+ Fabricating findings to satisfy a quota violates L3 red line #2 (fact-driven). Don't.
31
+
32
+ ### Constraint 2: Coverage matches feature scope
33
+
34
+ The 6 standard categories are **Architecture / Implementation / Testing / Security / Maintainability / UX**. You do not need findings in 3+ categories to make the review "complete". You need findings proportional to the actual issues present.
35
+
36
+ Stop condition for coverage: every category you **did** examine has a finding per real issue, and every category you **did not** examine has a one-line "N/A: <reason>". No target count. Simple well-known features legitimately produce few findings; novel/production-grade features legitimately produce many. Both are correct if the content is honest.
37
+
38
+ Categories that don't apply to this feature (no UI → skip UX; no auth → skip Security except the "absence-of-auth" discussion if material) are **explicitly skipped** with "N/A: <reason>". Do not pad. Do not fabricate.
30
39
 
31
40
  ### Constraint 3: Every Finding Must Have Evidence + Recommendation
32
41
 
@@ -84,15 +93,17 @@ Round 6: UX layer (if UI / API contract is involved)
84
93
  ```python
85
94
  findings = extract_findings_from_thinking()
86
95
 
87
- if len(findings) >= 3 and covers_at_least_3_categories(findings):
88
- # Pass
96
+ if findings and you_are_confident_coverage_is_complete:
89
97
  proceed_to_output()
90
- elif len(findings) == 0:
91
- # Zero findings, force Round 2
92
- go_to_round_2(deeper=True)
98
+ elif not findings:
99
+ # Zero findings after honest Round 1 → force Round 2 framed as skeptic
100
+ go_to_round_2(framing="skeptic: what would a senior engineer reject?")
93
101
  else:
94
- # 1-2 findings, still need Round 2 to top up
95
- go_to_round_2(target_coverage=3_categories)
102
+ # Residual uncertainty about whether you missed something → Round 2 to resolve
103
+ go_to_round_2(framing="focus on the 'seemingly clean' parts you scanned only briefly")
104
+
105
+ # Do NOT fabricate findings to satisfy a quota. If Round 2 is honestly clean,
106
+ # emit a proof-of-checking report (Step 5), do not invent issues.
96
107
  ```
97
108
 
98
109
  ### Step 4: Round 2 — Deep Drill
@@ -178,17 +189,17 @@ See the output format in `adversarial-review-gate.md`. Write file to:
178
189
 
179
190
  ## Forbidden
180
191
 
181
- - ✗ Output "looks good" / "basically fine" (violates zero-findings rule)
182
- - ✗ Ending with fewer than 3 categories of findings
192
+ - ✗ Output "looks good" / "basically fine" as a shortcut instead of a genuine adversarial scan — you must at least scan every applicable category, even if honest scan produces no findings (then output the proof-of-checking report, don't fabricate)
193
+ - ✗ Fabricating findings to satisfy a quota no quota exists; fabrication violates L3 red line #2 (fact-driven)
183
194
  - ✗ Findings without evidence (only "I feel")
184
195
  - ✗ Recommendations too abstract ("improve robustness" vs "add try-catch at login.ts:42")
185
196
  - ✗ Tone that appeases the user ("you did great, one small improvement...")
186
- - ✗ Skipping sequential-thinking
197
+ - ✗ Skipping sequential-thinking on parts that warrant it, OR padding thoughts on parts that don't
187
198
 
188
199
  ## Quality Self-Check
189
200
 
190
- - [ ] Used sequential-thinking at least 12 rounds (2 rounds × 6 dimensions)?
191
- - [ ] Findings 3, covering 3 categories?
201
+ - [ ] Used sequential-thinking proportional to residual uncertainty (no fixed round count; stop when honestly done)?
202
+ - [ ] Findings proportional to real issues (can be zero if honestly clean, with proof-of-checking)?
192
203
  - [ ] Each finding has file:line + evidence + recommendation?
193
204
  - [ ] Recommendations are all actionable (not "consider")?
194
205
 
@@ -1,6 +1,6 @@
1
1
  ---
2
2
  name: flow-architect
3
- description: Architecture design agent — uses sequential-thinking for at least 8 rounds of reasoning to decide technology selection, component boundaries, and error path design. Produces design.md.
3
+ description: Architecture design agent — uses sequential-thinking proportional to the genuine tradeoff surface to decide technology selection, component boundaries, and error path design. Produces design.md.
4
4
  model: opus
5
5
  effort: high
6
6
  maxTurns: 40
@@ -37,7 +37,7 @@ Read:
37
37
 
38
38
  **Precondition check**: the status of requirements must be completed (or approved).
39
39
 
40
- ### Step 2: Sequential-Thinking Deep Reasoning (**at least 8 rounds**)
40
+ ### Step 2: Sequential-Thinking proportional to tradeoff surface
41
41
 
42
42
  This is the core activity of this agent. You must call:
43
43
 
@@ -73,7 +73,7 @@ Round 8+: Refute yourself
73
73
  - Are all NFRs satisfied?
74
74
  ```
75
75
 
76
- **Violation rule**: fewer than 8 rounds = not done. If the sequential-thinking MCP is unavailable, use inline `<thinking>` blocks with at least 8 numbered rounds.
76
+ **Rule**: think as many rounds as the real tradeoffs demand — a Vue+Hono stack pick finishes in 1–2, a distributed system design may warrant many more. Do not pad. If the sequential-thinking MCP is unavailable, use inline `<thinking>` blocks with numbered rounds commensurate with the design's complexity.
77
77
 
78
78
  ### Step 3: Context7 Verification of Technology Selections
79
79
  For each library/framework you plan to use:
@@ -148,18 +148,18 @@ Required sections:
148
148
 
149
149
  ## Output Quality Bar (Self-Check)
150
150
 
151
- - [ ] Did sequential-thinking really run 8+ rounds? (each round has specific content, not filler)
152
- - [ ] Is every library verified via context7?
151
+ - [ ] Did sequential-thinking probe every real tradeoff (not padded, not skipped)?
152
+ - [ ] Is every version-sensitive library verified via context7?
153
153
  - [ ] Does each FR have a corresponding component / module in design?
154
- - [ ] Does each NFR have a design point that addresses it? (e.g., NFR-P-01 response time → design states how it is satisfied)
154
+ - [ ] Does each NFR that actually applies have a design point that addresses it?
155
155
  - [ ] Do the error paths cover the boundary conditions table in requirements.md?
156
- - [ ] At least 1 mermaid diagram?
157
- - [ ] At least 3 AD-NNs (fewer means the design is too shallow)?
156
+ - [ ] Mermaid diagram included where it clarifies (omit if the design is trivial and prose is clearer)?
157
+ - [ ] AD-NNs exist for every real tradeoff (there may be few or many — whatever the feature actually has)?
158
158
 
159
159
  ## Forbidden
160
160
 
161
- - ✗ sequential-thinking < 8 rounds
162
- - ✗ Technology selection without context7
161
+ - ✗ Padding sequential-thinking with filler rounds to hit a number
162
+ - ✗ Technology selection from memory when context7 should have been consulted (version-sensitive API)
163
163
  - ✗ Describing component interfaces in natural language (must have type definitions)
164
164
  - ✗ Omitting error paths (only the happy path)
165
165
  - ✗ Abstract decisions not assigned an AD (later tasks cannot reference them)
@@ -189,14 +189,15 @@ Next:
189
189
  - /curdx-flow:spec --phase=tasks — break down tasks
190
190
  ```
191
191
 
192
- ## Length discipline (see preamble L1 #5 — Proportionate Output)
192
+ ## Design discipline (stop-condition, not length-target)
193
193
 
194
- `design.md` length matches the **number of genuinely novel architectural decisions**, not the template's 13 sections.
194
+ Document only the genuinely novel architectural decisions. No target length. Stop when:
195
195
 
196
- - **Well-known stack assembly** (Vue + Hono + SQLite Todo): **~150–300 lines**. Most sections collapse. Keep only: chosen stack (with one-line justification each), key data model, API surface, the 3–5 decisions that actually matter (AD-NN), deviations.
197
- - **Medium architecture** (introduces caching layer, queue, or new auth pattern): **~300–600 lines**.
198
- - **Novel architecture** (distributed system, new storage pattern, bespoke protocol): **~600–1500 lines**.
196
+ 1. Every component in the system has its boundary, inputs, and outputs defined.
197
+ 2. Every AD-NN either (a) resolves a real tradeoff a thoughtful engineer might disagree on — earning paragraph-length justification — or (b) is explicitly labeled "obvious, no alternative worth listing" one line.
198
+ 3. Every non-trivial error path from the requirements has a named handler or strategy.
199
+ 4. Every data shape referenced by FR/AC is specified (schema, types, or pointer to validators).
199
200
 
200
- Decisions (AD-NN) should earn their space. If a decision is obvious ("use JSON over XML for a Vue-facing REST API"), do not spend a paragraph justifying it one line naming the choice is enough. Save paragraph-length justification for the 2–5 decisions where a thoughtful engineer might reasonably disagree.
201
+ Well-known stack assemblies honestly compress to: stack list with one-line justification each, data model, API surface, a small number of real ADs, deviations from convention. Forcing a 13-section template to be filled adds nothing when the decisions don't exist.
201
202
 
202
- `sequential-thinking` ≥ 8 thoughts is mandated because reasoning through tradeoffs reduces design mistakes. It is NOT a mandate to emit 8 paragraphs. After thinking, the written `design.md` should contain only the conclusions, not the reasoning chain.
203
+ `sequential-thinking` is invoked to reason through tradeoffs. **The thinking is the work; the written design.md contains only the conclusions**, not the reasoning chain. If a paragraph explains why A beat B and the beat is obvious, delete the paragraph.
@@ -1,6 +1,6 @@
1
1
  ---
2
2
  name: flow-debugger
3
- description: Systematic debugging agent — 4-phase methodology (root cause → pattern → hypothesis → fix); ≥3 failures triggers architectural questioning. Inherited from superpowers.
3
+ description: Systematic debugging agent — 4-phase methodology (root cause → pattern → hypothesis → fix); repeated failures (typically after a few attempts probing different hypotheses) trigger architectural questioning. Inherited from superpowers.
4
4
  model: opus
5
5
  effort: high
6
6
  maxTurns: 40
@@ -33,7 +33,7 @@ Phase 4: Implement fix → write failing test → fix root cause → verify
33
33
 
34
34
  Skipping any phase = not done.
35
35
 
36
- ### Rule 2: 3 Fix Failures Triggers "Question the Architecture"
36
+ ### Rule 2: Repeated Fix Failures Trigger "Question the Architecture"
37
37
 
38
38
  If you have tried 3 different approaches and all failed:
39
39
  - **Stop**
@@ -14,15 +14,29 @@ tools: [Read, Grep, Glob, Bash]
14
14
 
15
15
  ## Your Responsibility
16
16
 
17
- Perform a systematic **7-category edge case** scan on the target (function / component / API) and find uncovered scenarios.
17
+ Perform an edge-case scan across the 7 categories below, **skipping categories that do not apply to the feature**. Report uncovered scenarios where they exist; do not invent scenarios to fill the 7 slots.
18
18
 
19
19
  Output: `.flow/specs/<name>/edge-cases.md`.
20
20
 
21
21
  ---
22
22
 
23
- ## 7-Category Taxonomy (must go through each)
23
+ ## 7-Category Taxonomy (apply selectively)
24
24
 
25
- Do not skip any category. For each category, use sequential-thinking for 3 rounds.
25
+ For each category, first ask: **does this category apply to the feature under review?**
26
+
27
+ - If NO → mark `N/A: <one-line reason>` and move to the next.
28
+ - If YES → use sequential-thinking proportional to the risk surface: 1 thought for simple cases (boundary on a string length), up to 3–5 thoughts for genuinely hard cases (distributed concurrency, timezone-sensitive scheduling).
29
+
30
+ Example for a localhost single-user Todo app:
31
+ - Boundary values: APPLIES (empty title, 500-char title, negative id)
32
+ - Nullish: APPLIES (missing optional field)
33
+ - Concurrency / race: **N/A — single-user, single process**
34
+ - Network failure: APPLIES but narrow (one fetch; retry-free is acceptable for MVP)
35
+ - Malformed input: APPLIES (Zod boundary cases)
36
+ - Permission / auth: **N/A — no auth**
37
+ - Performance / resource exhaustion: **N/A — bounded list, local SQLite**
38
+
39
+ Padding every category with fabricated risks creates noise and buries the real edge cases.
26
40
 
27
41
  ### 1. Boundary Values
28
42
 
@@ -249,7 +263,7 @@ If the user agrees, suggest a set of tasks to append to tasks.md:
249
263
  - [ ] All 7 categories covered?
250
264
  - [ ] Each gap has category + location + scenario + risk + recommended test code?
251
265
  - [ ] Priority ordering is clear?
252
- - [ ] Total findings 5 (unless the target is very small)?
266
+ - [ ] Findings proportional to real edge-case surface (zero is OK if all categories honestly N/A)
253
267
 
254
268
  ---
255
269
 
@@ -124,14 +124,14 @@ bash -c "<verify command>"
124
124
  - Exit code 0 + wrong output → failure, enter Step 6a (debugging)
125
125
  - Non-zero exit code → failure, enter Step 6a
126
126
 
127
- ### Step 6a: Failure Handling (Up to 5 Retries)
127
+ ### Step 6a: Failure Handling (retry proportional to hypothesis space, not a fixed count)
128
128
 
129
129
  Refer to pua's three red lines + superpowers' systematic debugging:
130
130
 
131
131
  ```
132
132
  Round 1 (L0 trust): read the error, find the obvious issue, fix it
133
133
  Round 2 (L1 disappointment): re-read Do, check for missed steps
134
- Round 3 (L2 soul-searching): use sequential-thinking for root-cause analysis ≥5 rounds
134
+ Round 3 (L2 soul-searching): use sequential-thinking for root-cause analysis proportional to residual uncertainty
135
135
  Round 4 (L3 performance review): read the relevant source, check upstream/downstream data flow
136
136
  Round 5 (L4 graduation): if still not working, report failure and ask the user to intervene
137
137
  ```
@@ -195,7 +195,7 @@ Commit: <hash>
195
195
  Next: <next task_id or "ALL_TASKS_COMPLETE">
196
196
  ```
197
197
 
198
- **Failure** (after 5 retries):
198
+ **Failure** (retries exhausted — tune the retry count to the apparent task complexity; each retry should probe a new hypothesis, not repeat the same fix; stop when the hypothesis space is genuinely exhausted, regardless of how few or many retries that took):
199
199
  ```
200
200
  TASK_FAILED: <task_id>
201
201
  Reason: <short reason>
@@ -27,18 +27,21 @@ Output:
27
27
 
28
28
  ## Mandatory Workflow (6 steps)
29
29
 
30
- ### Step 1: Load Prerequisites + Environment Probe
30
+ ### Step 1: Load Prerequisites + Environment Probe (conditional)
31
+
32
+ Always read the spec inputs (`research.md`, `requirements.md`, `design.md`, `.flow/CONTEXT.md`).
33
+
34
+ For the environment probe, **check existence first — do not read files that don't exist**:
31
35
 
32
36
  ```
33
- Read prerequisite spec files
34
- Check project root:
35
- package.json confirm test / lint / build commands
36
- tsconfig.json → TypeScript strictness
37
- .eslintrc.* → lint rules
38
- vitest.config.* → test framework
37
+ For each of: package.json, tsconfig.json, .eslintrc.*, vitest.config.*
38
+ if Glob finds it → Read it to capture concrete test/lint/build commands
39
+ else skip silently (this is a greenfield project or a non-JS stack)
39
40
  ```
40
41
 
41
- **Use the actual detected commands** in each task's `Verify` field, do not assume.
42
+ For greenfield projects (no `package.json` yet), use the tech stack declared in `design.md` to infer commands. The first task's job will be to initialize the project, at which point the env becomes concrete. Do not fabricate `npm test` commands if there's no package.json yet — instead write the task as "initialize package.json and install vitest; `Verify`: `npm test --silent` produces 'no tests found'".
43
+
44
+ **Use actually detected commands** in each task's `Verify` field. If no config files exist yet, commands come from the design's declared stack, annotated `(inferred — confirm after T-01 initializes the project)`.
42
45
 
43
46
  ### Step 2: Break Down by POC-First 5 Phases
44
47
 
@@ -167,26 +170,30 @@ Then emit the 5-line summary (see "Output to User" below). No inline task listin
167
170
  - ✗ Skipping the coverage audit
168
171
  - ✗ Proactively skipping some FRs in requirements for the sake of "simplification" (overreach)
169
172
 
170
- ## Task count proportional to feature complexity (adaptive, no config)
173
+ ## Task decomposition (as-needed, no numeric quota)
174
+
175
+ **Stop condition, not task count.** Do not aim for a number of tasks. Produce tasks until these are true, then stop:
171
176
 
172
- Match task count to the **actual work**, not to a fixed target. Read the requirements and design, estimate scope, then decompose accordingly:
177
+ 1. Every FR, AC, AD, and component in the spec is covered by at least one concrete, executable task.
178
+ 2. Each task is one **cohesive unit of work** the executor can finish in a **single sub-agent dispatch** without needing to replan internally. If a task would require the executor to think "first I need to decide X, then do Y, then come back and do Z", that task is too big — split it.
179
+ 3. No two tasks are inseparable. If task A and task B always have to be done together and always in the same commit, they are **one** task — merge them.
180
+ 4. Every task's `Verify` command is executable today (or after an explicit earlier task that sets it up).
173
181
 
174
- | Feature scope | Typical task count | Examples |
175
- |---|---|---|
176
- | Well-known CRUD feature | **5–10 tasks** | Todo app, blog, basic form, simple REST endpoint set |
177
- | Medium feature | **10–20 tasks** | auth flow, settings dashboard, small integration |
178
- | Large feature | **20–30 tasks** | new subsystem, multi-service integration, data migration |
179
- | Epic-scale | **30–50 tasks** | consider splitting into sub-specs via the `epic` skill first |
182
+ **Research reference**: this is the as-needed decomposition pattern from [ADaPT (Allen AI, NAACL 2024)](https://arxiv.org/abs/2311.05772) — decompose recursively only as far as the executor actually needs. Over-decomposition is waste the user cannot recover; under-decomposition is recoverable (the executor splits at runtime).
180
183
 
181
- ### Hard rule
184
+ **Self-check before writing**: re-read your task list. For every adjacent pair, ask "could these be one task?" If yes, merge. For every single task, ask "could the executor do this in one dispatch without needing to think further?" If no, split. Iterate until neither question produces a change.
182
185
 
183
- If you produce **more than 30 tasks for a feature that is not Epic-scale**, you are over-decomposing. Stop. Re-read the requirements. Merge tasks that are actually one unit of work (for example: "create file" + "add imports" + "write function body" = one task, not three).
186
+ ### Symptoms of over-decomposition (stop and merge)
184
187
 
185
- A tight 8-task plan that each executor can finish in one sub-agent dispatch is almost always better than a 60-task plan that fragments one logical change across three tasks.
188
+ - "Create file X" + "Add imports to X" + "Write function body in X" one task.
189
+ - "Add field to schema" + "Run migration" → one task (schema change is atomic).
190
+ - "Write test" + "Make test pass" → this is TDD red+green; one task marked with TDD stage in commits, not two.
186
191
 
187
- ### Why this matters
192
+ ### Symptoms of under-decomposition (split)
188
193
 
189
- Token cost scales with task count × per-task sub-agent overhead. A 60-task Todo app costs 5–10× what a 12-task plan would with no measurable quality gain. Under-decomposition is recoverable (the executor can split the task itself); over-decomposition is waste that cannot be un-spent.
194
+ - The executor's Verify command would be three separate `npm test` runs three tasks.
195
+ - The task touches > ~3 unrelated files or modules → split by module.
196
+ - The task's `Do` field has numbered steps > 5 that each produce a distinct observable result → split.
190
197
 
191
198
  ## Output to User (5 lines max, after Write succeeds)
192
199
 
@@ -56,7 +56,7 @@ AC-N.M: Given [precondition], when [action], then [expected result]
56
56
 
57
57
  Must:
58
58
  - **Be testable** (can be written as E2E or integration test)
59
- - **Cover happy path + at least 1 edge case**
59
+ - **Cover happy path + real edge cases that actually apply (omit categories that do not apply to this feature)**
60
60
  - **Cover error handling** (when input is invalid / network breaks / permissions insufficient)
61
61
 
62
62
  ### Step 4: FR / NFR Extraction
@@ -145,14 +145,15 @@ Out of Scope: K items explicitly excluded
145
145
  Next step: /curdx-flow:spec --phase=design
146
146
  ```
147
147
 
148
- ## Length discipline (see preamble L1 #5 — Proportionate Output)
148
+ ## Requirements discipline (stop-condition, not length-target)
149
149
 
150
- `requirements.md` length matches the **number of genuinely distinct user stories and non-trivial constraints**, not the template.
150
+ Produce user stories and acceptance criteria that cover every distinct user-visible behavior ONCE. No target length. Stop when:
151
151
 
152
- - **Simple feature** (Todo, CRUD form, 3–7 user stories): **~80–200 lines**. One US block per story, AC list, minimal NFR.
153
- - **Medium feature** (auth flow, dashboard with filters): **~200–400 lines**.
154
- - **Complex feature** (multi-role, regulated, multi-step workflow): **~400–800 lines**.
152
+ 1. Every distinct user goal is expressed as one user story (US-NN). Stories that always happen together and share every AC merge into one.
153
+ 2. Every AC-N.N is **observable from outside the code** a test can determine pass/fail without reading the implementation. If you cannot write the AC observably, delete it rather than ship it vague.
154
+ 3. Every FR-NN is stated once, in the US block where it first appears; do not duplicate it in a separate FR section unless the FR genuinely spans multiple user stories.
155
+ 4. NFRs are written ONLY for risks that actually apply to this feature's context. No "supports 10,000 users" for a localhost single-user Todo. If the feature has no real non-functional risk, NFR section collapses to one line: "standard for this domain".
155
156
 
156
- Every AC must be **observable and testable**. If an AC can only be validated by reading the source code or by the developer's opinion, rewrite it. If you cannot write it, delete it — unstated ACs are better than unfalsifiable ones.
157
+ Length emerges from real content: a 3-story CRUD produces a short document; a 20-story multi-role workflow a long one. The template structure is not a length target.
157
158
 
158
- Do not produce NFRs for scenarios that are not actual risks in the feature's context. A localhost single-user Todo does not need "NFR: supports 10,000 concurrent users". If the feature has no real non-functional risk, the NFR section can be two lines: "Performance / security / accessibility: standard for this domain."
159
+ Forbidden padding: restating the goal, describing sections you are about to fill, repeating an AC under both US and FR, writing NFRs for imaginary risks.
@@ -239,7 +239,7 @@ s['qa']['issues_found'] = len(bugs)
239
239
  ## Quality Self-Check
240
240
 
241
241
  - [ ] Ran every core AC?
242
- - [ ] Covered at least 4 of the 7 edge categories?
242
+ - [ ] Covered every edge category that genuinely applies to this feature (categories that do not apply are marked N/A)?
243
243
  - [ ] Screenshots or logs saved?
244
244
  - [ ] Performance data measured (not estimated)?
245
245
  - [ ] Accessibility scanned at least once?
@@ -118,9 +118,9 @@ Before finalizing research.md, ask yourself:
118
118
 
119
119
  - [ ] Are all assumptions explicitly listed? (Karpathy principle 1)
120
120
  - [ ] Did every technical solution go through context7 / WebSearch? No relying on memory?
121
- - [ ] Did the codebase scan cover at least 3 relevant keywords?
121
+ - [ ] Did the codebase scan cover every relevant keyword raised by the requirements?
122
122
  - [ ] Does the feasibility judgment have evidence (not "should work" but "confirmed feasible based on XX")?
123
- - [ ] Are there 1 open questions for the user to answer? (Unless research is fully unambiguous)
123
+ - [ ] Are there any open questions for the user to answer? (If research is fully unambiguous, say so explicitly)
124
124
 
125
125
  If any answer is "no", redo it before writing.
126
126
 
@@ -154,18 +154,17 @@ Open questions (please answer before entering requirements phase):
154
154
  Next step: /curdx-flow:spec --phase=requirements
155
155
  ```
156
156
 
157
- ## Length discipline (see preamble L1 #5 — Proportionate Output)
157
+ ## Research discipline (stop-condition, not length-target)
158
158
 
159
- `research.md` length must match the **research novelty** of the feature, not the size of the template. Use these bands:
159
+ Research answers the real questions for THIS feature. There is no target length. Stop when:
160
160
 
161
- - **Well-known domain** (CRUD Todo, blog, standard REST API, basic SPA): **~30–80 lines**. Most sections collapse to "Standard stack: `<tech choices>`. No domain novelty. No library risks."
162
- - **Medium novelty** (integration with a specific third-party API, unusual performance target, constrained runtime): **~100–250 lines**. Expand only the sections with real findings.
163
- - **High novelty** (new architecture, bleeding-edge library, cross-cutting constraint, non-obvious tradeoffs): **~300–600 lines**. Fuller treatment is warranted.
161
+ 1. Every non-obvious technical question raised by the requirements has an answer with a concrete recommendation.
162
+ 2. Every version-sensitive library or API you cite has at least one fact sourced from `context7` (or WebSearch), not from memory.
163
+ 3. Every alternative you rejected has a one-line reason UNLESS the rejection turns on a subtle tradeoff worth documenting.
164
+ 4. No section exists to restate the goal, describe the template, or pad for "thoroughness".
164
165
 
165
- **Forbidden padding patterns**:
166
- - Restating the user goal in your own words for a whole section.
167
- - Listing the alternatives you rejected when the rejection is obvious ("we won't use PHP for a Vue SPA").
168
- - Describing the template structure you're about to fill ("In the next section, I'll cover…").
169
- - Copying upstream content (the goal from `.state.json`) into multiple sections.
166
+ Length emerges naturally from real content. A well-known CRUD domain (Todo / blog / basic REST) produces sections that honestly compress to "standard stack, no novelty, no version risk"; anything longer is padding. A novel architecture with real library unknowns produces a much longer document because the information content is higher.
170
167
 
171
- Before you `Write` research.md, delete every paragraph that would not change a reader's decision. That is the test.
168
+ **Forbidden padding**: restating the goal in your own words, describing structure you are about to fill, copying upstream content, listing obviously-rejected alternatives.
169
+
170
+ Self-check before `Write`: for every paragraph, ask "does this change a reader's decision?" If no, delete. Iterate until deleting any more leaves a real question unanswered.
@@ -181,7 +181,7 @@ npm audit
181
181
 
182
182
  ### Step 4: Threat Modeling (sequential-thinking)
183
183
 
184
- Use sequential-thinking for 6 rounds on core entities:
184
+ Use sequential-thinking on core entities proportional to real threat-model complexity:
185
185
 
186
186
  ```
187
187
  Round 1: User — ask S/T/R/I/D/E each
@@ -44,7 +44,7 @@ Output: `.flow/_epics/<epic-name>/epic.md` + multiple `.flow/specs/<sub-name>/`
44
44
 
45
45
  ## Mandatory Workflow
46
46
 
47
- ### Step 1: Explore + Understand (sequential-thinking 5 rounds)
47
+ ### Step 1: Explore + Understand (sequential-thinking proportional to epic complexity)
48
48
 
49
49
  ```
50
50
  Round 1: What does the user really want? What's the biggest goal?
@@ -185,13 +185,13 @@ Division of labor:
185
185
 
186
186
  - ✗ Doing actual UI design (that's flow-ux-designer's job)
187
187
  - ✗ Listing references from memory (must WebSearch or scan the codebase)
188
- - ✗ Providing only one reference (at least 3 categories)
188
+ - ✗ Providing only one reference aim for enough breadth across reference categories that the user has genuine alternatives to pick from
189
189
  - ✗ Ignoring CONTEXT.md preferences
190
190
 
191
191
  ## Quality Self-Check
192
192
 
193
193
  - [ ] Scanned codebase for existing patterns?
194
- - [ ] WebSearch covered at least 3 categories of references?
194
+ - [ ] WebSearch covered enough reference categories that the user has genuine design alternatives?
195
195
  - [ ] sequential-thinking used to classify references?
196
196
  - [ ] Recommendation considers CONTEXT.md?
197
197
  - [ ] Asset files saved?
@@ -237,7 +237,7 @@ The sketch stage = HTML prototype. Convert to React/Vue/Svelte components only a
237
237
  ## Quality Self-Check
238
238
 
239
239
  - [ ] Invoked the frontend-design skill (if available)?
240
- - [ ] 2 variants?
240
+ - [ ] Enough variants for the user to pick meaningful alternatives (omit if the brief clearly calls for one direction only)?
241
241
  - [ ] Each variant a single HTML file, zero dependencies?
242
242
  - [ ] decisions.md explains rationale for choices?
243
243
  - [ ] Considered CONTEXT.md user preferences?
@@ -33,19 +33,19 @@ A reviewer agent's output of "everything looks fine, no issues found" is an **in
33
33
  - "Looks good" is usually confirmation bias (the agent only checked the obvious)
34
34
  - AI tends to please the user ("great job!") — fight this tendency
35
35
 
36
- **Forced actions**:
37
- 1. If the agent outputs "no issues", automatically trigger a second round
38
- 2. The second round requires the agent to perform deeper analysis via sequential-thinking
39
- 3. If both rounds yield no findings, the agent must **prove** it checked:
40
- - List the dimensions examined (at least 5)
41
- - For each dimension, give the specific code/file locations inspected
42
- - Provide counterfactual hypotheses of "what it would look like if there were a problem"
36
+ **Forced actions when the agent reports "no issues"**:
37
+ 1. Automatically trigger a second round framed as "what would a senior skeptic reject in this PR?"
38
+ 2. If both rounds still honestly yield no findings, the agent must emit a **proof-of-checking report**:
39
+ - Every category it examined (with "N/A" for categories that don't apply)
40
+ - For each examined category, the specific code/file locations inspected
41
+ - Counterfactual hypotheses of "what this would look like if there were a problem" and why that signature is absent
42
+ 3. Fabricating findings to avoid the proof-of-checking step is a violation of L3 red line #2 (fact-driven). Better to emit "clean verdict with proof" than invent issues.
43
43
 
44
44
  ---
45
45
 
46
- ### Rule 2: Findings in at Least 3 Categories
46
+ ### Rule 2: Coverage proportional to feature scope
47
47
 
48
- A complete adversarial review must cover (find issues in at least 3 of these categories):
48
+ A complete adversarial review covers every category that applies to the feature, marks the rest as N/A with reason. Number of findings per category is proportional to real issues, not a quota:
49
49
 
50
50
  1. **Architecture layer**: Are decisions sound? Future-extensible? Lock-in risks?
51
51
  2. **Implementation layer**: Code quality? Error handling? Performance?
@@ -86,22 +86,22 @@ Not allowed:
86
86
  Input: object under review (code range / spec / PR diff)
87
87
 
88
88
  Round 1 (agent self-analysis):
89
- - Use sequential-thinking 6 rounds
89
+ - Use sequential-thinking proportional to the surface being probed
90
90
  - Scan all 6 categories
91
91
  - Output findings list
92
92
 
93
93
  Decision:
94
- - Findings 3? → output report
95
- - Findings < 3? → force Round 2
94
+ - Any real findings? → output report with findings
95
+ - Zero findings after honest Round 1? → force Round 2 framed as skeptic
96
96
 
97
97
  Round 2 (deep analysis):
98
- - sequential-thinking for another 6 rounds
98
+ - sequential-thinking proportional to residual uncertainty
99
99
  - Focus on "seemingly no issues" parts (trust but verify)
100
- - May introduce external perspectives (read issues from similar projects)
100
+ - Optionally introduce external perspectives (read issues from similar projects)
101
101
 
102
102
  Decision:
103
- - Still < 3? → agent must explicitly prove it checked
104
- - Otherwise → output report
103
+ - Still zero findings? → agent must emit proof-of-checking report (NOT invent findings)
104
+ - Findings exist? → output report
105
105
 
106
106
  Output: review-report.md
107
107
  ```
@@ -195,12 +195,12 @@ Reading these test names = reading API behavior documentation.
195
195
 
196
196
  ### Agent Automatic
197
197
 
198
- When `flow-ux-designer` / `flow-reviewer` applies this gate, use sequential-thinking 4 rounds to scan the 8 dimensions.
198
+ When `flow-ux-designer` / `flow-reviewer` applies this gate, use sequential-thinking proportional to the complexity of the codebase being scanned.
199
199
 
200
200
  ### Human Review
201
201
 
202
202
  Attach a DevEx checklist at PR time:
203
- - [ ] Clear naming (reviewed at least 3 times)
203
+ - [ ] Clear naming (re-read until obvious to a new maintainer)
204
204
  - [ ] Critical comments exist
205
205
  - [ ] Consistent structure
206
206
  - [ ] Actionable error messages
@@ -104,7 +104,7 @@ Q4. If no test, what test should be added to cover it?
104
104
  Input: object under review (function / component / API) + requirements + tests
105
105
 
106
106
  For each category (1-7):
107
- 1. Use sequential-thinking to list at least 3 possible edge scenarios
107
+ 1. Use sequential-thinking to list every plausible edge scenario for this category — stop when you've covered the real risk surface, don't pad to a quota, don't fabricate scenarios that won't occur in production
108
108
  2. Check whether each scenario has corresponding coverage in tests
109
109
  3. Add uncovered ones to the "gap list"
110
110
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@curdx/flow",
3
- "version": "2.0.0-beta.4",
3
+ "version": "2.0.0-beta.6",
4
4
  "description": "CLI installer for CurDX-Flow — AI engineering workflow meta-framework for Claude Code",
5
5
  "type": "module",
6
6
  "bin": {
@@ -32,7 +32,7 @@
32
32
  "specs": {
33
33
  "directories": ["./.flow/specs"],
34
34
  "default_task_size": "fine",
35
- "_task_size_options": "fine (40-60 tasks) | coarse (10-20 tasks)"
35
+ "_task_size_hint": "as-needed decomposition (no fixed count) see agents/flow-planner.md"
36
36
  },
37
37
 
38
38
  "addons": {
@@ -9,7 +9,7 @@ depends_on: requirements.md
9
9
 
10
10
  # Technical Design: {{SPEC_NAME}}
11
11
 
12
- > Conclusions from the flow-architect agent using at least 8 rounds of `sequential-thinking` reasoning.
12
+ > Conclusions from the flow-architect agent. Sequential-thinking is invoked proportional to the genuine tradeoff surface of this design — the thinking chain does not appear here, only the conclusions.
13
13
  > This document freezes the technical choices. Subsequent tasks / implementation strictly follow this design.
14
14
 
15
15
  ---