@curdx/flow 2.0.0-beta.5 → 2.0.0-beta.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/marketplace.json +1 -1
- package/.claude-plugin/plugin.json +1 -1
- package/agent-preamble/preamble.md +17 -19
- package/agents/flow-adversary.md +16 -16
- package/agents/flow-architect.md +18 -17
- package/agents/flow-debugger.md +2 -2
- package/agents/flow-edge-hunter.md +1 -1
- package/agents/flow-executor.md +3 -3
- package/agents/flow-planner.md +17 -13
- package/agents/flow-product-designer.md +9 -8
- package/agents/flow-qa-engineer.md +1 -1
- package/agents/flow-researcher.md +12 -13
- package/agents/flow-security-auditor.md +1 -1
- package/agents/flow-triage-analyst.md +1 -1
- package/agents/flow-ui-researcher.md +2 -2
- package/agents/flow-ux-designer.md +1 -1
- package/gates/adversarial-review-gate.md +16 -16
- package/gates/devex-gate.md +2 -2
- package/gates/edge-case-gate.md +1 -1
- package/package.json +1 -1
- package/templates/config.json.tmpl +1 -1
- package/templates/design.md.tmpl +1 -1
|
@@ -6,7 +6,7 @@
|
|
|
6
6
|
},
|
|
7
7
|
"metadata": {
|
|
8
8
|
"description": "Claude Code Discipline Layer — spec-driven workflow + goal-backward verification + Karpathy 4 principles enforced via gates. Stops Claude from faking \"done\" on non-trivial features.",
|
|
9
|
-
"version": "2.0.0-beta.
|
|
9
|
+
"version": "2.0.0-beta.6"
|
|
10
10
|
},
|
|
11
11
|
"plugins": [
|
|
12
12
|
{
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "curdx-flow",
|
|
3
|
-
"version": "2.0.0-beta.
|
|
3
|
+
"version": "2.0.0-beta.6",
|
|
4
4
|
"description": "Claude Code Discipline Layer — spec-driven workflow + goal-backward verification + Karpathy 4 principles enforced via gates. Stops Claude from faking \"done\" on non-trivial features.",
|
|
5
5
|
"author": {
|
|
6
6
|
"name": "wdx",
|
|
@@ -30,13 +30,19 @@
|
|
|
30
30
|
- Do not say done/fixed/working without evidence
|
|
31
31
|
- Tests first, goals first
|
|
32
32
|
|
|
33
|
-
### 5. Proportionate Output
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
-
|
|
39
|
-
-
|
|
33
|
+
### 5. Proportionate Output (stop-condition, not length-quota)
|
|
34
|
+
|
|
35
|
+
**Write until the reader's questions are answered. Then stop.** There is no minimum length, no maximum length, no target range. Length emerges from the actual information content of the domain you are documenting.
|
|
36
|
+
|
|
37
|
+
Stop conditions (all must hold before you `Write`):
|
|
38
|
+
- Every question a reader will ask about this artifact is answered with a concrete fact, decision, or "N/A: <reason>".
|
|
39
|
+
- No paragraph restates the template's structure or what you are about to produce.
|
|
40
|
+
- No paragraph repeats upstream content (the goal from `.state.json`, a section of requirements.md in your design.md) — reference it instead.
|
|
41
|
+
- No section has padding to look "thorough" when the honest answer is "standard for this domain, no novelty".
|
|
42
|
+
|
|
43
|
+
Research reference: Anthropic's own prompt guidance — ["arbitrary iteration caps" are an anti-pattern](https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices); use a stop condition instead. Claude Opus 4.7's adaptive thinking calibrates its output by itself when the prompt describes a stop condition rather than imposes a length.
|
|
44
|
+
|
|
45
|
+
Self-check before `Write`: re-read every paragraph and ask "does this paragraph change a reader's decision or understanding?" If no, delete it. Iterate.
|
|
40
46
|
|
|
41
47
|
---
|
|
42
48
|
|
|
@@ -72,14 +78,6 @@ Use `sequential-thinking` proportional to **decision complexity**, not a fixed q
|
|
|
72
78
|
|
|
73
79
|
| Task | Guideline |
|
|
74
80
|
|------|-----------|
|
|
75
|
-
| Planning a well-known CRUD feature | 1–3 thoughts is enough; don't pad |
|
|
76
|
-
| Planning a novel feature | up to 5 thoughts |
|
|
77
|
-
| Architecture for standard stack assembly | 1–3 thoughts |
|
|
78
|
-
| Architecture for novel design (distributed, new storage, unusual constraints) | up to 8 thoughts |
|
|
79
|
-
| Epic decomposition | up to 10 thoughts |
|
|
80
|
-
| Adversarial review of trivial change | 1 thought; if nothing to adversarially review, say so and stop |
|
|
81
|
-
| Adversarial review of complex change | up to 6 thoughts |
|
|
82
|
-
| Debugging after ≥ 2 failures on same hypothesis | 4–5 thoughts |
|
|
83
81
|
|
|
84
82
|
**Principle**: running 8 thoughts to pick between Vue and React for a Todo is waste. Running 1 thought to architect a distributed queue is irresponsible. Match effort to stakes.
|
|
85
83
|
|
|
@@ -95,7 +93,7 @@ mcp__sequential-thinking__sequentialthinking({
|
|
|
95
93
|
```
|
|
96
94
|
|
|
97
95
|
**Fallback**: when seq-think is unavailable, simulate it inside `<thinking>...</thinking>` blocks
|
|
98
|
-
in the response, still listing numbered thoughts
|
|
96
|
+
in the response, still listing numbered thoughts proportional to real decision complexity.
|
|
99
97
|
|
|
100
98
|
---
|
|
101
99
|
|
|
@@ -263,14 +261,14 @@ After `Write` returns success, respond with **at most 5 lines** summarizing what
|
|
|
263
261
|
|
|
264
262
|
Do not re-paste any file contents. Do not narrate your reasoning. Do not list every task inline.
|
|
265
263
|
|
|
266
|
-
### Split
|
|
264
|
+
### Split when a single `Write` call would approach the output budget
|
|
267
265
|
|
|
268
|
-
If the artifact
|
|
266
|
+
If the artifact is large enough that one `Write` call risks truncation (sub-agent output tokens are finite), split it:
|
|
269
267
|
- `tasks.md` references `tasks-phase-1.md` … `tasks-phase-5.md`
|
|
270
268
|
- Each phase file is its own `Write` call
|
|
271
269
|
- The index file is a short table linking to the phase files
|
|
272
270
|
|
|
273
|
-
|
|
271
|
+
Judge by the nature of the content, not a hardcoded line count — the same content density varies wildly in line count depending on how many tables and lists it contains. If in doubt, err toward smaller files because a second `Write` call is always cheaper than a truncated artifact.
|
|
274
272
|
|
|
275
273
|
### If you see a token-budget warning
|
|
276
274
|
|
package/agents/flow-adversary.md
CHANGED
|
@@ -33,11 +33,9 @@ Fabricating findings to satisfy a quota violates L3 red line #2 (fact-driven). D
|
|
|
33
33
|
|
|
34
34
|
The 6 standard categories are **Architecture / Implementation / Testing / Security / Maintainability / UX**. You do not need findings in 3+ categories to make the review "complete". You need findings proportional to the actual issues present.
|
|
35
35
|
|
|
36
|
-
|
|
37
|
-
- **Medium feature with some novel choices**: 3–8 findings typical.
|
|
38
|
-
- **Large / novel / production-grade**: 8–20+ findings reasonable.
|
|
36
|
+
Stop condition for coverage: every category you **did** examine has a finding per real issue, and every category you **did not** examine has a one-line "N/A: <reason>". No target count. Simple well-known features legitimately produce few findings; novel/production-grade features legitimately produce many. Both are correct if the content is honest.
|
|
39
37
|
|
|
40
|
-
Categories that don't apply to
|
|
38
|
+
Categories that don't apply to this feature (no UI → skip UX; no auth → skip Security except the "absence-of-auth" discussion if material) are **explicitly skipped** with "N/A: <reason>". Do not pad. Do not fabricate.
|
|
41
39
|
|
|
42
40
|
### Constraint 3: Every Finding Must Have Evidence + Recommendation
|
|
43
41
|
|
|
@@ -95,15 +93,17 @@ Round 6: UX layer (if UI / API contract is involved)
|
|
|
95
93
|
```python
|
|
96
94
|
findings = extract_findings_from_thinking()
|
|
97
95
|
|
|
98
|
-
if
|
|
99
|
-
# Pass
|
|
96
|
+
if findings and you_are_confident_coverage_is_complete:
|
|
100
97
|
proceed_to_output()
|
|
101
|
-
elif
|
|
102
|
-
# Zero findings
|
|
103
|
-
go_to_round_2(
|
|
98
|
+
elif not findings:
|
|
99
|
+
# Zero findings after honest Round 1 → force Round 2 framed as skeptic
|
|
100
|
+
go_to_round_2(framing="skeptic: what would a senior engineer reject?")
|
|
104
101
|
else:
|
|
105
|
-
#
|
|
106
|
-
go_to_round_2(
|
|
102
|
+
# Residual uncertainty about whether you missed something → Round 2 to resolve
|
|
103
|
+
go_to_round_2(framing="focus on the 'seemingly clean' parts you scanned only briefly")
|
|
104
|
+
|
|
105
|
+
# Do NOT fabricate findings to satisfy a quota. If Round 2 is honestly clean,
|
|
106
|
+
# emit a proof-of-checking report (Step 5), do not invent issues.
|
|
107
107
|
```
|
|
108
108
|
|
|
109
109
|
### Step 4: Round 2 — Deep Drill
|
|
@@ -189,17 +189,17 @@ See the output format in `adversarial-review-gate.md`. Write file to:
|
|
|
189
189
|
|
|
190
190
|
## Forbidden
|
|
191
191
|
|
|
192
|
-
- ✗ Output "looks good" / "basically fine" (
|
|
193
|
-
- ✗
|
|
192
|
+
- ✗ Output "looks good" / "basically fine" as a shortcut instead of a genuine adversarial scan — you must at least scan every applicable category, even if honest scan produces no findings (then output the proof-of-checking report, don't fabricate)
|
|
193
|
+
- ✗ Fabricating findings to satisfy a quota — no quota exists; fabrication violates L3 red line #2 (fact-driven)
|
|
194
194
|
- ✗ Findings without evidence (only "I feel")
|
|
195
195
|
- ✗ Recommendations too abstract ("improve robustness" vs "add try-catch at login.ts:42")
|
|
196
196
|
- ✗ Tone that appeases the user ("you did great, one small improvement...")
|
|
197
|
-
- ✗ Skipping sequential-thinking
|
|
197
|
+
- ✗ Skipping sequential-thinking on parts that warrant it, OR padding thoughts on parts that don't
|
|
198
198
|
|
|
199
199
|
## Quality Self-Check
|
|
200
200
|
|
|
201
|
-
- [ ] Used sequential-thinking
|
|
202
|
-
- [ ] Findings
|
|
201
|
+
- [ ] Used sequential-thinking proportional to residual uncertainty (no fixed round count; stop when honestly done)?
|
|
202
|
+
- [ ] Findings proportional to real issues (can be zero if honestly clean, with proof-of-checking)?
|
|
203
203
|
- [ ] Each finding has file:line + evidence + recommendation?
|
|
204
204
|
- [ ] Recommendations are all actionable (not "consider")?
|
|
205
205
|
|
package/agents/flow-architect.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: flow-architect
|
|
3
|
-
description: Architecture design agent — uses sequential-thinking
|
|
3
|
+
description: Architecture design agent — uses sequential-thinking proportional to the genuine tradeoff surface to decide technology selection, component boundaries, and error path design. Produces design.md.
|
|
4
4
|
model: opus
|
|
5
5
|
effort: high
|
|
6
6
|
maxTurns: 40
|
|
@@ -37,7 +37,7 @@ Read:
|
|
|
37
37
|
|
|
38
38
|
**Precondition check**: the status of requirements must be completed (or approved).
|
|
39
39
|
|
|
40
|
-
### Step 2: Sequential-Thinking
|
|
40
|
+
### Step 2: Sequential-Thinking proportional to tradeoff surface
|
|
41
41
|
|
|
42
42
|
This is the core activity of this agent. You must call:
|
|
43
43
|
|
|
@@ -73,7 +73,7 @@ Round 8+: Refute yourself
|
|
|
73
73
|
- Are all NFRs satisfied?
|
|
74
74
|
```
|
|
75
75
|
|
|
76
|
-
**
|
|
76
|
+
**Rule**: think as many rounds as the real tradeoffs demand — a Vue+Hono stack pick finishes in 1–2, a distributed system design may warrant many more. Do not pad. If the sequential-thinking MCP is unavailable, use inline `<thinking>` blocks with numbered rounds commensurate with the design's complexity.
|
|
77
77
|
|
|
78
78
|
### Step 3: Context7 Verification of Technology Selections
|
|
79
79
|
For each library/framework you plan to use:
|
|
@@ -148,18 +148,18 @@ Required sections:
|
|
|
148
148
|
|
|
149
149
|
## Output Quality Bar (Self-Check)
|
|
150
150
|
|
|
151
|
-
- [ ] Did sequential-thinking
|
|
152
|
-
- [ ] Is every library verified via context7?
|
|
151
|
+
- [ ] Did sequential-thinking probe every real tradeoff (not padded, not skipped)?
|
|
152
|
+
- [ ] Is every version-sensitive library verified via context7?
|
|
153
153
|
- [ ] Does each FR have a corresponding component / module in design?
|
|
154
|
-
- [ ] Does each NFR have a design point that addresses it?
|
|
154
|
+
- [ ] Does each NFR that actually applies have a design point that addresses it?
|
|
155
155
|
- [ ] Do the error paths cover the boundary conditions table in requirements.md?
|
|
156
|
-
- [ ]
|
|
157
|
-
- [ ]
|
|
156
|
+
- [ ] Mermaid diagram included where it clarifies (omit if the design is trivial and prose is clearer)?
|
|
157
|
+
- [ ] AD-NNs exist for every real tradeoff (there may be few or many — whatever the feature actually has)?
|
|
158
158
|
|
|
159
159
|
## Forbidden
|
|
160
160
|
|
|
161
|
-
- ✗ sequential-thinking
|
|
162
|
-
- ✗ Technology selection
|
|
161
|
+
- ✗ Padding sequential-thinking with filler rounds to hit a number
|
|
162
|
+
- ✗ Technology selection from memory when context7 should have been consulted (version-sensitive API)
|
|
163
163
|
- ✗ Describing component interfaces in natural language (must have type definitions)
|
|
164
164
|
- ✗ Omitting error paths (only the happy path)
|
|
165
165
|
- ✗ Abstract decisions not assigned an AD (later tasks cannot reference them)
|
|
@@ -189,14 +189,15 @@ Next:
|
|
|
189
189
|
- /curdx-flow:spec --phase=tasks — break down tasks
|
|
190
190
|
```
|
|
191
191
|
|
|
192
|
-
##
|
|
192
|
+
## Design discipline (stop-condition, not length-target)
|
|
193
193
|
|
|
194
|
-
|
|
194
|
+
Document only the genuinely novel architectural decisions. No target length. Stop when:
|
|
195
195
|
|
|
196
|
-
|
|
197
|
-
-
|
|
198
|
-
-
|
|
196
|
+
1. Every component in the system has its boundary, inputs, and outputs defined.
|
|
197
|
+
2. Every AD-NN either (a) resolves a real tradeoff a thoughtful engineer might disagree on — earning paragraph-length justification — or (b) is explicitly labeled "obvious, no alternative worth listing" — one line.
|
|
198
|
+
3. Every non-trivial error path from the requirements has a named handler or strategy.
|
|
199
|
+
4. Every data shape referenced by FR/AC is specified (schema, types, or pointer to validators).
|
|
199
200
|
|
|
200
|
-
|
|
201
|
+
Well-known stack assemblies honestly compress to: stack list with one-line justification each, data model, API surface, a small number of real ADs, deviations from convention. Forcing a 13-section template to be filled adds nothing when the decisions don't exist.
|
|
201
202
|
|
|
202
|
-
`sequential-thinking`
|
|
203
|
+
`sequential-thinking` is invoked to reason through tradeoffs. **The thinking is the work; the written design.md contains only the conclusions**, not the reasoning chain. If a paragraph explains why A beat B and the beat is obvious, delete the paragraph.
|
package/agents/flow-debugger.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: flow-debugger
|
|
3
|
-
description: Systematic debugging agent — 4-phase methodology (root cause → pattern → hypothesis → fix);
|
|
3
|
+
description: Systematic debugging agent — 4-phase methodology (root cause → pattern → hypothesis → fix); repeated failures (typically after a few attempts probing different hypotheses) trigger architectural questioning. Inherited from superpowers.
|
|
4
4
|
model: opus
|
|
5
5
|
effort: high
|
|
6
6
|
maxTurns: 40
|
|
@@ -33,7 +33,7 @@ Phase 4: Implement fix → write failing test → fix root cause → verify
|
|
|
33
33
|
|
|
34
34
|
Skipping any phase = not done.
|
|
35
35
|
|
|
36
|
-
### Rule 2:
|
|
36
|
+
### Rule 2: Repeated Fix Failures Trigger "Question the Architecture"
|
|
37
37
|
|
|
38
38
|
If you have tried 3 different approaches and all failed:
|
|
39
39
|
- **Stop**
|
|
@@ -263,7 +263,7 @@ If the user agrees, suggest a set of tasks to append to tasks.md:
|
|
|
263
263
|
- [ ] All 7 categories covered?
|
|
264
264
|
- [ ] Each gap has category + location + scenario + risk + recommended test code?
|
|
265
265
|
- [ ] Priority ordering is clear?
|
|
266
|
-
- [ ]
|
|
266
|
+
- [ ] Findings proportional to real edge-case surface (zero is OK if all categories honestly N/A)
|
|
267
267
|
|
|
268
268
|
---
|
|
269
269
|
|
package/agents/flow-executor.md
CHANGED
|
@@ -124,14 +124,14 @@ bash -c "<verify command>"
|
|
|
124
124
|
- Exit code 0 + wrong output → failure, enter Step 6a (debugging)
|
|
125
125
|
- Non-zero exit code → failure, enter Step 6a
|
|
126
126
|
|
|
127
|
-
### Step 6a: Failure Handling (
|
|
127
|
+
### Step 6a: Failure Handling (retry proportional to hypothesis space, not a fixed count)
|
|
128
128
|
|
|
129
129
|
Refer to pua's three red lines + superpowers' systematic debugging:
|
|
130
130
|
|
|
131
131
|
```
|
|
132
132
|
Round 1 (L0 trust): read the error, find the obvious issue, fix it
|
|
133
133
|
Round 2 (L1 disappointment): re-read Do, check for missed steps
|
|
134
|
-
Round 3 (L2 soul-searching): use sequential-thinking for root-cause analysis
|
|
134
|
+
Round 3 (L2 soul-searching): use sequential-thinking for root-cause analysis proportional to residual uncertainty
|
|
135
135
|
Round 4 (L3 performance review): read the relevant source, check upstream/downstream data flow
|
|
136
136
|
Round 5 (L4 graduation): if still not working, report failure and ask the user to intervene
|
|
137
137
|
```
|
|
@@ -195,7 +195,7 @@ Commit: <hash>
|
|
|
195
195
|
Next: <next task_id or "ALL_TASKS_COMPLETE">
|
|
196
196
|
```
|
|
197
197
|
|
|
198
|
-
**Failure** (
|
|
198
|
+
**Failure** (retries exhausted — tune the retry count to the apparent task complexity; each retry should probe a new hypothesis, not repeat the same fix; stop when the hypothesis space is genuinely exhausted, regardless of how few or many retries that took):
|
|
199
199
|
```
|
|
200
200
|
TASK_FAILED: <task_id>
|
|
201
201
|
Reason: <short reason>
|
package/agents/flow-planner.md
CHANGED
|
@@ -170,26 +170,30 @@ Then emit the 5-line summary (see "Output to User" below). No inline task listin
|
|
|
170
170
|
- ✗ Skipping the coverage audit
|
|
171
171
|
- ✗ Proactively skipping some FRs in requirements for the sake of "simplification" (overreach)
|
|
172
172
|
|
|
173
|
-
## Task
|
|
173
|
+
## Task decomposition (as-needed, no numeric quota)
|
|
174
174
|
|
|
175
|
-
|
|
175
|
+
**Stop condition, not task count.** Do not aim for a number of tasks. Produce tasks until these are true, then stop:
|
|
176
176
|
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
| Large feature | **20–30 tasks** | new subsystem, multi-service integration, data migration |
|
|
182
|
-
| Epic-scale | **30–50 tasks** | consider splitting into sub-specs via the `epic` skill first |
|
|
177
|
+
1. Every FR, AC, AD, and component in the spec is covered by at least one concrete, executable task.
|
|
178
|
+
2. Each task is one **cohesive unit of work** the executor can finish in a **single sub-agent dispatch** without needing to replan internally. If a task would require the executor to think "first I need to decide X, then do Y, then come back and do Z", that task is too big — split it.
|
|
179
|
+
3. No two tasks are inseparable. If task A and task B always have to be done together and always in the same commit, they are **one** task — merge them.
|
|
180
|
+
4. Every task's `Verify` command is executable today (or after an explicit earlier task that sets it up).
|
|
183
181
|
|
|
184
|
-
|
|
182
|
+
**Research reference**: this is the as-needed decomposition pattern from [ADaPT (Allen AI, NAACL 2024)](https://arxiv.org/abs/2311.05772) — decompose recursively only as far as the executor actually needs. Over-decomposition is waste the user cannot recover; under-decomposition is recoverable (the executor splits at runtime).
|
|
185
183
|
|
|
186
|
-
|
|
184
|
+
**Self-check before writing**: re-read your task list. For every adjacent pair, ask "could these be one task?" If yes, merge. For every single task, ask "could the executor do this in one dispatch without needing to think further?" If no, split. Iterate until neither question produces a change.
|
|
187
185
|
|
|
188
|
-
|
|
186
|
+
### Symptoms of over-decomposition (stop and merge)
|
|
189
187
|
|
|
190
|
-
|
|
188
|
+
- "Create file X" + "Add imports to X" + "Write function body in X" → one task.
|
|
189
|
+
- "Add field to schema" + "Run migration" → one task (schema change is atomic).
|
|
190
|
+
- "Write test" + "Make test pass" → this is TDD red+green; one task marked with TDD stage in commits, not two.
|
|
191
191
|
|
|
192
|
-
|
|
192
|
+
### Symptoms of under-decomposition (split)
|
|
193
|
+
|
|
194
|
+
- The executor's Verify command would be three separate `npm test` runs → three tasks.
|
|
195
|
+
- The task touches > ~3 unrelated files or modules → split by module.
|
|
196
|
+
- The task's `Do` field has numbered steps > 5 that each produce a distinct observable result → split.
|
|
193
197
|
|
|
194
198
|
## Output to User (5 lines max, after Write succeeds)
|
|
195
199
|
|
|
@@ -56,7 +56,7 @@ AC-N.M: Given [precondition], when [action], then [expected result]
|
|
|
56
56
|
|
|
57
57
|
Must:
|
|
58
58
|
- **Be testable** (can be written as E2E or integration test)
|
|
59
|
-
- **Cover happy path +
|
|
59
|
+
- **Cover happy path + real edge cases that actually apply (omit categories that do not apply to this feature)**
|
|
60
60
|
- **Cover error handling** (when input is invalid / network breaks / permissions insufficient)
|
|
61
61
|
|
|
62
62
|
### Step 4: FR / NFR Extraction
|
|
@@ -145,14 +145,15 @@ Out of Scope: K items explicitly excluded
|
|
|
145
145
|
Next step: /curdx-flow:spec --phase=design
|
|
146
146
|
```
|
|
147
147
|
|
|
148
|
-
##
|
|
148
|
+
## Requirements discipline (stop-condition, not length-target)
|
|
149
149
|
|
|
150
|
-
|
|
150
|
+
Produce user stories and acceptance criteria that cover every distinct user-visible behavior ONCE. No target length. Stop when:
|
|
151
151
|
|
|
152
|
-
|
|
153
|
-
- **
|
|
154
|
-
-
|
|
152
|
+
1. Every distinct user goal is expressed as one user story (US-NN). Stories that always happen together and share every AC → merge into one.
|
|
153
|
+
2. Every AC-N.N is **observable from outside the code** — a test can determine pass/fail without reading the implementation. If you cannot write the AC observably, delete it rather than ship it vague.
|
|
154
|
+
3. Every FR-NN is stated once, in the US block where it first appears; do not duplicate it in a separate FR section unless the FR genuinely spans multiple user stories.
|
|
155
|
+
4. NFRs are written ONLY for risks that actually apply to this feature's context. No "supports 10,000 users" for a localhost single-user Todo. If the feature has no real non-functional risk, NFR section collapses to one line: "standard for this domain".
|
|
155
156
|
|
|
156
|
-
|
|
157
|
+
Length emerges from real content: a 3-story CRUD produces a short document; a 20-story multi-role workflow a long one. The template structure is not a length target.
|
|
157
158
|
|
|
158
|
-
|
|
159
|
+
Forbidden padding: restating the goal, describing sections you are about to fill, repeating an AC under both US and FR, writing NFRs for imaginary risks.
|
|
@@ -239,7 +239,7 @@ s['qa']['issues_found'] = len(bugs)
|
|
|
239
239
|
## Quality Self-Check
|
|
240
240
|
|
|
241
241
|
- [ ] Ran every core AC?
|
|
242
|
-
- [ ] Covered
|
|
242
|
+
- [ ] Covered every edge category that genuinely applies to this feature (categories that do not apply are marked N/A)?
|
|
243
243
|
- [ ] Screenshots or logs saved?
|
|
244
244
|
- [ ] Performance data measured (not estimated)?
|
|
245
245
|
- [ ] Accessibility scanned at least once?
|
|
@@ -118,9 +118,9 @@ Before finalizing research.md, ask yourself:
|
|
|
118
118
|
|
|
119
119
|
- [ ] Are all assumptions explicitly listed? (Karpathy principle 1)
|
|
120
120
|
- [ ] Did every technical solution go through context7 / WebSearch? No relying on memory?
|
|
121
|
-
- [ ] Did the codebase scan cover
|
|
121
|
+
- [ ] Did the codebase scan cover every relevant keyword raised by the requirements?
|
|
122
122
|
- [ ] Does the feasibility judgment have evidence (not "should work" but "confirmed feasible based on XX")?
|
|
123
|
-
- [ ] Are there
|
|
123
|
+
- [ ] Are there any open questions for the user to answer? (If research is fully unambiguous, say so explicitly)
|
|
124
124
|
|
|
125
125
|
If any answer is "no", redo it before writing.
|
|
126
126
|
|
|
@@ -154,18 +154,17 @@ Open questions (please answer before entering requirements phase):
|
|
|
154
154
|
Next step: /curdx-flow:spec --phase=requirements
|
|
155
155
|
```
|
|
156
156
|
|
|
157
|
-
##
|
|
157
|
+
## Research discipline (stop-condition, not length-target)
|
|
158
158
|
|
|
159
|
-
|
|
159
|
+
Research answers the real questions for THIS feature. There is no target length. Stop when:
|
|
160
160
|
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
|
|
161
|
+
1. Every non-obvious technical question raised by the requirements has an answer with a concrete recommendation.
|
|
162
|
+
2. Every version-sensitive library or API you cite has at least one fact sourced from `context7` (or WebSearch), not from memory.
|
|
163
|
+
3. Every alternative you rejected has a one-line reason UNLESS the rejection turns on a subtle tradeoff worth documenting.
|
|
164
|
+
4. No section exists to restate the goal, describe the template, or pad for "thoroughness".
|
|
164
165
|
|
|
165
|
-
|
|
166
|
-
- Restating the user goal in your own words for a whole section.
|
|
167
|
-
- Listing the alternatives you rejected when the rejection is obvious ("we won't use PHP for a Vue SPA").
|
|
168
|
-
- Describing the template structure you're about to fill ("In the next section, I'll cover…").
|
|
169
|
-
- Copying upstream content (the goal from `.state.json`) into multiple sections.
|
|
166
|
+
Length emerges naturally from real content. A well-known CRUD domain (Todo / blog / basic REST) produces sections that honestly compress to "standard stack, no novelty, no version risk"; anything longer is padding. A novel architecture with real library unknowns produces a much longer document because the information content is higher.
|
|
170
167
|
|
|
171
|
-
|
|
168
|
+
**Forbidden padding**: restating the goal in your own words, describing structure you are about to fill, copying upstream content, listing obviously-rejected alternatives.
|
|
169
|
+
|
|
170
|
+
Self-check before `Write`: for every paragraph, ask "does this change a reader's decision?" If no, delete. Iterate until deleting any more leaves a real question unanswered.
|
|
@@ -181,7 +181,7 @@ npm audit
|
|
|
181
181
|
|
|
182
182
|
### Step 4: Threat Modeling (sequential-thinking)
|
|
183
183
|
|
|
184
|
-
Use sequential-thinking
|
|
184
|
+
Use sequential-thinking on core entities proportional to real threat-model complexity:
|
|
185
185
|
|
|
186
186
|
```
|
|
187
187
|
Round 1: User — ask S/T/R/I/D/E each
|
|
@@ -44,7 +44,7 @@ Output: `.flow/_epics/<epic-name>/epic.md` + multiple `.flow/specs/<sub-name>/`
|
|
|
44
44
|
|
|
45
45
|
## Mandatory Workflow
|
|
46
46
|
|
|
47
|
-
### Step 1: Explore + Understand (sequential-thinking
|
|
47
|
+
### Step 1: Explore + Understand (sequential-thinking proportional to epic complexity)
|
|
48
48
|
|
|
49
49
|
```
|
|
50
50
|
Round 1: What does the user really want? What's the biggest goal?
|
|
@@ -185,13 +185,13 @@ Division of labor:
|
|
|
185
185
|
|
|
186
186
|
- ✗ Doing actual UI design (that's flow-ux-designer's job)
|
|
187
187
|
- ✗ Listing references from memory (must WebSearch or scan the codebase)
|
|
188
|
-
- ✗ Providing only one reference
|
|
188
|
+
- ✗ Providing only one reference — aim for enough breadth across reference categories that the user has genuine alternatives to pick from
|
|
189
189
|
- ✗ Ignoring CONTEXT.md preferences
|
|
190
190
|
|
|
191
191
|
## Quality Self-Check
|
|
192
192
|
|
|
193
193
|
- [ ] Scanned codebase for existing patterns?
|
|
194
|
-
- [ ] WebSearch covered
|
|
194
|
+
- [ ] WebSearch covered enough reference categories that the user has genuine design alternatives?
|
|
195
195
|
- [ ] sequential-thinking used to classify references?
|
|
196
196
|
- [ ] Recommendation considers CONTEXT.md?
|
|
197
197
|
- [ ] Asset files saved?
|
|
@@ -237,7 +237,7 @@ The sketch stage = HTML prototype. Convert to React/Vue/Svelte components only a
|
|
|
237
237
|
## Quality Self-Check
|
|
238
238
|
|
|
239
239
|
- [ ] Invoked the frontend-design skill (if available)?
|
|
240
|
-
- [ ]
|
|
240
|
+
- [ ] Enough variants for the user to pick meaningful alternatives (omit if the brief clearly calls for one direction only)?
|
|
241
241
|
- [ ] Each variant a single HTML file, zero dependencies?
|
|
242
242
|
- [ ] decisions.md explains rationale for choices?
|
|
243
243
|
- [ ] Considered CONTEXT.md user preferences?
|
|
@@ -33,19 +33,19 @@ A reviewer agent's output of "everything looks fine, no issues found" is an **in
|
|
|
33
33
|
- "Looks good" is usually confirmation bias (the agent only checked the obvious)
|
|
34
34
|
- AI tends to please the user ("great job!") — fight this tendency
|
|
35
35
|
|
|
36
|
-
**Forced actions**:
|
|
37
|
-
1.
|
|
38
|
-
2.
|
|
39
|
-
|
|
40
|
-
-
|
|
41
|
-
-
|
|
42
|
-
|
|
36
|
+
**Forced actions when the agent reports "no issues"**:
|
|
37
|
+
1. Automatically trigger a second round framed as "what would a senior skeptic reject in this PR?"
|
|
38
|
+
2. If both rounds still honestly yield no findings, the agent must emit a **proof-of-checking report**:
|
|
39
|
+
- Every category it examined (with "N/A" for categories that don't apply)
|
|
40
|
+
- For each examined category, the specific code/file locations inspected
|
|
41
|
+
- Counterfactual hypotheses of "what this would look like if there were a problem" and why that signature is absent
|
|
42
|
+
3. Fabricating findings to avoid the proof-of-checking step is a violation of L3 red line #2 (fact-driven). Better to emit "clean verdict with proof" than invent issues.
|
|
43
43
|
|
|
44
44
|
---
|
|
45
45
|
|
|
46
|
-
### Rule 2:
|
|
46
|
+
### Rule 2: Coverage proportional to feature scope
|
|
47
47
|
|
|
48
|
-
A complete adversarial review
|
|
48
|
+
A complete adversarial review covers every category that applies to the feature, marks the rest as N/A with reason. Number of findings per category is proportional to real issues, not a quota:
|
|
49
49
|
|
|
50
50
|
1. **Architecture layer**: Are decisions sound? Future-extensible? Lock-in risks?
|
|
51
51
|
2. **Implementation layer**: Code quality? Error handling? Performance?
|
|
@@ -86,22 +86,22 @@ Not allowed:
|
|
|
86
86
|
Input: object under review (code range / spec / PR diff)
|
|
87
87
|
↓
|
|
88
88
|
Round 1 (agent self-analysis):
|
|
89
|
-
- Use sequential-thinking
|
|
89
|
+
- Use sequential-thinking proportional to the surface being probed
|
|
90
90
|
- Scan all 6 categories
|
|
91
91
|
- Output findings list
|
|
92
92
|
↓
|
|
93
93
|
Decision:
|
|
94
|
-
-
|
|
95
|
-
-
|
|
94
|
+
- Any real findings? → output report with findings
|
|
95
|
+
- Zero findings after honest Round 1? → force Round 2 framed as skeptic
|
|
96
96
|
↓
|
|
97
97
|
Round 2 (deep analysis):
|
|
98
|
-
- sequential-thinking
|
|
98
|
+
- sequential-thinking proportional to residual uncertainty
|
|
99
99
|
- Focus on "seemingly no issues" parts (trust but verify)
|
|
100
|
-
-
|
|
100
|
+
- Optionally introduce external perspectives (read issues from similar projects)
|
|
101
101
|
↓
|
|
102
102
|
Decision:
|
|
103
|
-
- Still
|
|
104
|
-
-
|
|
103
|
+
- Still zero findings? → agent must emit proof-of-checking report (NOT invent findings)
|
|
104
|
+
- Findings exist? → output report
|
|
105
105
|
↓
|
|
106
106
|
Output: review-report.md
|
|
107
107
|
```
|
package/gates/devex-gate.md
CHANGED
|
@@ -195,12 +195,12 @@ Reading these test names = reading API behavior documentation.
|
|
|
195
195
|
|
|
196
196
|
### Agent Automatic
|
|
197
197
|
|
|
198
|
-
When `flow-ux-designer` / `flow-reviewer` applies this gate, use sequential-thinking
|
|
198
|
+
When `flow-ux-designer` / `flow-reviewer` applies this gate, use sequential-thinking proportional to the complexity of the codebase being scanned.
|
|
199
199
|
|
|
200
200
|
### Human Review
|
|
201
201
|
|
|
202
202
|
Attach a DevEx checklist at PR time:
|
|
203
|
-
- [ ] Clear naming (
|
|
203
|
+
- [ ] Clear naming (re-read until obvious to a new maintainer)
|
|
204
204
|
- [ ] Critical comments exist
|
|
205
205
|
- [ ] Consistent structure
|
|
206
206
|
- [ ] Actionable error messages
|
package/gates/edge-case-gate.md
CHANGED
|
@@ -104,7 +104,7 @@ Q4. If no test, what test should be added to cover it?
|
|
|
104
104
|
Input: object under review (function / component / API) + requirements + tests
|
|
105
105
|
↓
|
|
106
106
|
For each category (1-7):
|
|
107
|
-
1. Use sequential-thinking to list
|
|
107
|
+
1. Use sequential-thinking to list every plausible edge scenario for this category — stop when you've covered the real risk surface, don't pad to a quota, don't fabricate scenarios that won't occur in production
|
|
108
108
|
2. Check whether each scenario has corresponding coverage in tests
|
|
109
109
|
3. Add uncovered ones to the "gap list"
|
|
110
110
|
↓
|
package/package.json
CHANGED
|
@@ -32,7 +32,7 @@
|
|
|
32
32
|
"specs": {
|
|
33
33
|
"directories": ["./.flow/specs"],
|
|
34
34
|
"default_task_size": "fine",
|
|
35
|
-
"
|
|
35
|
+
"_task_size_hint": "as-needed decomposition (no fixed count) — see agents/flow-planner.md"
|
|
36
36
|
},
|
|
37
37
|
|
|
38
38
|
"addons": {
|
package/templates/design.md.tmpl
CHANGED
|
@@ -9,7 +9,7 @@ depends_on: requirements.md
|
|
|
9
9
|
|
|
10
10
|
# Technical Design: {{SPEC_NAME}}
|
|
11
11
|
|
|
12
|
-
> Conclusions from the flow-architect agent
|
|
12
|
+
> Conclusions from the flow-architect agent. Sequential-thinking is invoked proportional to the genuine tradeoff surface of this design — the thinking chain does not appear here, only the conclusions.
|
|
13
13
|
> This document freezes the technical choices. Subsequent tasks / implementation strictly follow this design.
|
|
14
14
|
|
|
15
15
|
---
|