@curdx/flow 2.0.0-beta.2 → 2.0.0-beta.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/marketplace.json +1 -1
- package/.claude-plugin/plugin.json +1 -1
- package/agent-preamble/preamble.md +55 -0
- package/agents/flow-architect.md +12 -0
- package/agents/flow-planner.md +36 -40
- package/agents/flow-product-designer.md +12 -0
- package/agents/flow-researcher.md +16 -0
- package/agents/flow-reviewer.md +5 -1
- package/agents/flow-verifier.md +47 -14
- package/package.json +1 -1
|
@@ -6,7 +6,7 @@
|
|
|
6
6
|
},
|
|
7
7
|
"metadata": {
|
|
8
8
|
"description": "Claude Code Discipline Layer — spec-driven workflow + goal-backward verification + Karpathy 4 principles enforced via gates. Stops Claude from faking \"done\" on non-trivial features.",
|
|
9
|
-
"version": "2.0.0-beta.
|
|
9
|
+
"version": "2.0.0-beta.4"
|
|
10
10
|
},
|
|
11
11
|
"plugins": [
|
|
12
12
|
{
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "curdx-flow",
|
|
3
|
-
"version": "2.0.0-beta.
|
|
3
|
+
"version": "2.0.0-beta.4",
|
|
4
4
|
"description": "Claude Code Discipline Layer — spec-driven workflow + goal-backward verification + Karpathy 4 principles enforced via gates. Stops Claude from faking \"done\" on non-trivial features.",
|
|
5
5
|
"author": {
|
|
6
6
|
"name": "wdx",
|
|
@@ -30,6 +30,14 @@
|
|
|
30
30
|
- Do not say done/fixed/working without evidence
|
|
31
31
|
- Tests first, goals first
|
|
32
32
|
|
|
33
|
+
### 5. Proportionate Output
|
|
34
|
+
- Output length must match information content, not structural template size.
|
|
35
|
+
- Do not pad. If 30 lines of markdown fully answer the question, do not produce 300.
|
|
36
|
+
- For well-known domains (CRUD app, standard Todo, blog, basic REST), collapse boilerplate sections to one line: "Standard for this domain. No novelty." Do not fill sections for the sake of filling them.
|
|
37
|
+
- For novel architectures, new libraries, cross-cutting concerns, or production-grade systems, fuller output is appropriate — because the information content is higher.
|
|
38
|
+
- Thoroughness ≠ length. Thoroughness = answering the actual questions the reader will ask. A reader opening a Todo research.md asks three questions, not thirty.
|
|
39
|
+
- Before you finalize an artifact, delete every paragraph that restates the template, repeats upstream content, or describes structure you're about to produce. Those tokens earn nothing.
|
|
40
|
+
|
|
33
41
|
---
|
|
34
42
|
|
|
35
43
|
## L2: Mandatory Tool Rules (enforced)
|
|
@@ -210,5 +218,52 @@ When you need to delegate to a sub-agent:
|
|
|
210
218
|
|
|
211
219
|
---
|
|
212
220
|
|
|
221
|
+
## L8: Long-artifact handling (truncation prevention)
|
|
222
|
+
|
|
223
|
+
When your job is to produce a long Markdown artifact (`tasks.md`, `verification-report.md`, `review-report.md`, `research.md`, `requirements.md`, `design.md`, etc.), follow these rules. Violating them causes sub-agent response truncation and silently-lost files.
|
|
224
|
+
|
|
225
|
+
### Write first, explain second
|
|
226
|
+
|
|
227
|
+
Your FIRST substantive action after gathering inputs must be a `Write` tool call with the **complete file content**. Do NOT paste the content as assistant text before writing.
|
|
228
|
+
|
|
229
|
+
- ✗ *"Here's the tasks.md I'll write:"* followed by a 500-line markdown code block, then a `Write` call containing the same 500 lines — this doubles the output tokens and usually hits the truncation limit mid-`Write`, leaving the file missing or partial.
|
|
230
|
+
- ✓ Immediately `Write` the file with full content. Then output a ≤ 5-line summary.
|
|
231
|
+
|
|
232
|
+
### Do not preview
|
|
233
|
+
|
|
234
|
+
Never output the file's content in your response. The file IS the deliverable — the reader opens it. Your response is just the ack that you wrote it.
|
|
235
|
+
|
|
236
|
+
### After write, summarize only
|
|
237
|
+
|
|
238
|
+
After `Write` returns success, respond with **at most 5 lines** summarizing what you wrote:
|
|
239
|
+
|
|
240
|
+
```
|
|
241
|
+
✓ Wrote .flow/specs/<spec>/tasks.md
|
|
242
|
+
40 tasks across 5 phases
|
|
243
|
+
Coverage: FR 10/10, AC 12/12, AD 4/4
|
|
244
|
+
Next: /curdx-flow:implement
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
Do not re-paste any file contents. Do not narrate your reasoning. Do not list every task inline.
|
|
248
|
+
|
|
249
|
+
### Split if >200 lines
|
|
250
|
+
|
|
251
|
+
If the artifact would exceed ~200 lines of Markdown, split it:
|
|
252
|
+
- `tasks.md` references `tasks-phase-1.md` … `tasks-phase-5.md`
|
|
253
|
+
- Each phase file is its own `Write` call
|
|
254
|
+
- The index file is a short table linking to the phase files
|
|
255
|
+
|
|
256
|
+
This keeps every individual `Write` call under the safe size budget.
|
|
257
|
+
|
|
258
|
+
### If you see a token-budget warning
|
|
259
|
+
|
|
260
|
+
Stop narrating and call `Write` with whatever content is ready. Sub-agents do not have a "next response" — continuation is not possible after truncation. Save what you have, then return.
|
|
261
|
+
|
|
262
|
+
### Why this matters
|
|
263
|
+
|
|
264
|
+
Sub-agents invoked via the `Task` tool have a ~16 K output-token budget per invocation. A naive agent that previews then writes consumes those tokens twice — once as prose, once inside the tool call — and truncation typically lands inside the `Write` call itself. The parent command then reports "agent did not complete" and re-dispatches, burning compute for no new artifact. Writing first eliminates the failure mode at the source.
|
|
265
|
+
|
|
266
|
+
---
|
|
267
|
+
|
|
213
268
|
**Remember**: this preamble exists because, without discipline, AI tends to slack off, hallucinate, and over-engineer.
|
|
214
269
|
These rules are not constraints — they are the tools that make you reliable.
|
package/agents/flow-architect.md
CHANGED
|
@@ -188,3 +188,15 @@ Next:
|
|
|
188
188
|
- Review the design (especially AD-01/02/03)
|
|
189
189
|
- /curdx-flow:spec --phase=tasks — break down tasks
|
|
190
190
|
```
|
|
191
|
+
|
|
192
|
+
## Length discipline (see preamble L1 #5 — Proportionate Output)
|
|
193
|
+
|
|
194
|
+
`design.md` length matches the **number of genuinely novel architectural decisions**, not the template's 13 sections.
|
|
195
|
+
|
|
196
|
+
- **Well-known stack assembly** (Vue + Hono + SQLite Todo): **~150–300 lines**. Most sections collapse. Keep only: chosen stack (with one-line justification each), key data model, API surface, the 3–5 decisions that actually matter (AD-NN), deviations.
|
|
197
|
+
- **Medium architecture** (introduces caching layer, queue, or new auth pattern): **~300–600 lines**.
|
|
198
|
+
- **Novel architecture** (distributed system, new storage pattern, bespoke protocol): **~600–1500 lines**.
|
|
199
|
+
|
|
200
|
+
Decisions (AD-NN) should earn their space. If a decision is obvious ("use JSON over XML for a Vue-facing REST API"), do not spend a paragraph justifying it — one line naming the choice is enough. Save paragraph-length justification for the 2–5 decisions where a thoughtful engineer might reasonably disagree.
|
|
201
|
+
|
|
202
|
+
`sequential-thinking` ≥ 8 thoughts is mandated because reasoning through tradeoffs reduces design mistakes. It is NOT a mandate to emit 8 paragraphs. After thinking, the written `design.md` should contain only the conclusions, not the reasoning chain.
|
package/agents/flow-planner.md
CHANGED
|
@@ -132,31 +132,23 @@ For each of the following sources, every item must be covered by tasks:
|
|
|
132
132
|
|
|
133
133
|
### Step 6: Write tasks.md + State
|
|
134
134
|
|
|
135
|
-
|
|
135
|
+
**CRITICAL (see L8 of the preamble — long-artifact handling):**
|
|
136
|
+
- Your FIRST action in this step must be a `Write` tool call with the full `tasks.md` content. Do NOT paste the file content as assistant text before writing.
|
|
137
|
+
- Do NOT preview the tasks list in the response. The file itself is the deliverable.
|
|
138
|
+
- If `tasks.md` would be >200 lines, split into `tasks-phase-1.md` … `tasks-phase-5.md` and make `tasks.md` a short index linking to them.
|
|
136
139
|
|
|
137
|
-
Must include a **coverage audit table** at the end (from Step 5)
|
|
140
|
+
Based on `${CLAUDE_PLUGIN_ROOT}/templates/tasks.md.tmpl`. Must include a **coverage audit table** at the end (from Step 5).
|
|
138
141
|
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
| AD-03 | 1.1, 2.1 | ✓ |
|
|
148
|
-
```
|
|
149
|
-
|
|
150
|
-
Then:
|
|
151
|
-
|
|
152
|
-
```
|
|
153
|
-
.flow/specs/<name>/.state.json:
|
|
154
|
-
phase_status.tasks = "completed"
|
|
155
|
-
total_tasks = <N>
|
|
142
|
+
After the `Write` succeeds:
|
|
143
|
+
1. Update `.flow/specs/<name>/.state.json`:
|
|
144
|
+
```
|
|
145
|
+
phase_status.tasks = "completed"
|
|
146
|
+
total_tasks = <N>
|
|
147
|
+
```
|
|
148
|
+
2. Append to `.flow/specs/<name>/.progress.md`:
|
|
149
|
+
`## tasks phase complete, total N tasks`
|
|
156
150
|
|
|
157
|
-
.
|
|
158
|
-
Append "## tasks phase complete, total N tasks"
|
|
159
|
-
```
|
|
151
|
+
Then emit the 5-line summary (see "Output to User" below). No inline task listing.
|
|
160
152
|
|
|
161
153
|
## Output Quality Bar (Self-Check)
|
|
162
154
|
|
|
@@ -175,30 +167,34 @@ Then:
|
|
|
175
167
|
- ✗ Skipping the coverage audit
|
|
176
168
|
- ✗ Proactively skipping some FRs in requirements for the sake of "simplification" (overreach)
|
|
177
169
|
|
|
178
|
-
## Task
|
|
170
|
+
## Task count proportional to feature complexity (adaptive, no config)
|
|
179
171
|
|
|
180
|
-
|
|
181
|
-
- **coarse**: 15-60 minutes per task. Total 10-20
|
|
172
|
+
Match task count to the **actual work**, not to a fixed target. Read the requirements and design, estimate scope, then decompose accordingly:
|
|
182
173
|
|
|
183
|
-
|
|
174
|
+
| Feature scope | Typical task count | Examples |
|
|
175
|
+
|---|---|---|
|
|
176
|
+
| Well-known CRUD feature | **5–10 tasks** | Todo app, blog, basic form, simple REST endpoint set |
|
|
177
|
+
| Medium feature | **10–20 tasks** | auth flow, settings dashboard, small integration |
|
|
178
|
+
| Large feature | **20–30 tasks** | new subsystem, multi-service integration, data migration |
|
|
179
|
+
| Epic-scale | **30–50 tasks** | consider splitting into sub-specs via the `epic` skill first |
|
|
184
180
|
|
|
185
|
-
|
|
181
|
+
### Hard rule
|
|
186
182
|
|
|
187
|
-
|
|
188
|
-
|
|
183
|
+
If you produce **more than 30 tasks for a feature that is not Epic-scale**, you are over-decomposing. Stop. Re-read the requirements. Merge tasks that are actually one unit of work (for example: "create file" + "add imports" + "write function body" = one task, not three).
|
|
184
|
+
|
|
185
|
+
A tight 8-task plan that each executor can finish in one sub-agent dispatch is almost always better than a 60-task plan that fragments one logical change across three tasks.
|
|
189
186
|
|
|
190
|
-
|
|
191
|
-
Phase 1 (POC): X tasks
|
|
192
|
-
Phase 2 (Refactor): Y tasks
|
|
193
|
-
Phase 3 (Testing): Z tasks
|
|
194
|
-
Phase 4 (Quality): W tasks
|
|
195
|
-
Phase 5 (PR): V tasks
|
|
187
|
+
### Why this matters
|
|
196
188
|
|
|
197
|
-
|
|
189
|
+
Token cost scales with task count × per-task sub-agent overhead. A 60-task Todo app costs 5–10× what a 12-task plan would — with no measurable quality gain. Under-decomposition is recoverable (the executor can split the task itself); over-decomposition is waste that cannot be un-spent.
|
|
198
190
|
|
|
199
|
-
|
|
191
|
+
## Output to User (5 lines max, after Write succeeds)
|
|
200
192
|
|
|
201
|
-
Next:
|
|
202
|
-
- Review tasks.md
|
|
203
|
-
- /curdx-flow:implement — start execution (after Phase 2 is released)
|
|
204
193
|
```
|
|
194
|
+
✓ Wrote .flow/specs/<name>/tasks.md
|
|
195
|
+
N tasks across 5 Phases (X/Y/Z/W/V)
|
|
196
|
+
Coverage: FR A/B | AC C/D | AD E/F
|
|
197
|
+
Next: /curdx-flow:implement
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
**Do not re-paste the tasks.md content inline. Do not list every task. Just the summary.**
|
|
@@ -144,3 +144,15 @@ Out of Scope: K items explicitly excluded
|
|
|
144
144
|
|
|
145
145
|
Next step: /curdx-flow:spec --phase=design
|
|
146
146
|
```
|
|
147
|
+
|
|
148
|
+
## Length discipline (see preamble L1 #5 — Proportionate Output)
|
|
149
|
+
|
|
150
|
+
`requirements.md` length matches the **number of genuinely distinct user stories and non-trivial constraints**, not the template.
|
|
151
|
+
|
|
152
|
+
- **Simple feature** (Todo, CRUD form, 3–7 user stories): **~80–200 lines**. One US block per story, AC list, minimal NFR.
|
|
153
|
+
- **Medium feature** (auth flow, dashboard with filters): **~200–400 lines**.
|
|
154
|
+
- **Complex feature** (multi-role, regulated, multi-step workflow): **~400–800 lines**.
|
|
155
|
+
|
|
156
|
+
Every AC must be **observable and testable**. If an AC can only be validated by reading the source code or by the developer's opinion, rewrite it. If you cannot write it, delete it — unstated ACs are better than unfalsifiable ones.
|
|
157
|
+
|
|
158
|
+
Do not produce NFRs for scenarios that are not actual risks in the feature's context. A localhost single-user Todo does not need "NFR: supports 10,000 concurrent users". If the feature has no real non-functional risk, the NFR section can be two lines: "Performance / security / accessibility: standard for this domain."
|
|
@@ -153,3 +153,19 @@ Open questions (please answer before entering requirements phase):
|
|
|
153
153
|
|
|
154
154
|
Next step: /curdx-flow:spec --phase=requirements
|
|
155
155
|
```
|
|
156
|
+
|
|
157
|
+
## Length discipline (see preamble L1 #5 — Proportionate Output)
|
|
158
|
+
|
|
159
|
+
`research.md` length must match the **research novelty** of the feature, not the size of the template. Use these bands:
|
|
160
|
+
|
|
161
|
+
- **Well-known domain** (CRUD Todo, blog, standard REST API, basic SPA): **~30–80 lines**. Most sections collapse to "Standard stack: `<tech choices>`. No domain novelty. No library risks."
|
|
162
|
+
- **Medium novelty** (integration with a specific third-party API, unusual performance target, constrained runtime): **~100–250 lines**. Expand only the sections with real findings.
|
|
163
|
+
- **High novelty** (new architecture, bleeding-edge library, cross-cutting constraint, non-obvious tradeoffs): **~300–600 lines**. Fuller treatment is warranted.
|
|
164
|
+
|
|
165
|
+
**Forbidden padding patterns**:
|
|
166
|
+
- Restating the user goal in your own words for a whole section.
|
|
167
|
+
- Listing the alternatives you rejected when the rejection is obvious ("we won't use PHP for a Vue SPA").
|
|
168
|
+
- Describing the template structure you're about to fill ("In the next section, I'll cover…").
|
|
169
|
+
- Copying upstream content (the goal from `.state.json`) into multiple sections.
|
|
170
|
+
|
|
171
|
+
Before you `Write` research.md, delete every paragraph that would not change a reader's decision. That is the test.
|
package/agents/flow-reviewer.md
CHANGED
|
@@ -187,7 +187,11 @@ else:
|
|
|
187
187
|
|
|
188
188
|
### Step 6: Generate review-report.md
|
|
189
189
|
|
|
190
|
-
|
|
190
|
+
**CRITICAL (see L8 of the preamble):** your FIRST action in this step must be a `Write` tool call with the **complete report content**. Do NOT paste the report as assistant text before writing. After the write succeeds, respond with a ≤ 5-line summary only (path, verdict, blocker count, next step). Do not re-paste the report.
|
|
191
|
+
|
|
192
|
+
If the report would exceed ~200 lines, split into `review-report.md` (short index + verdict) and `review-details.md` (full findings) — two `Write` calls.
|
|
193
|
+
|
|
194
|
+
Full structure (use this as the content passed to `Write`, not as preview text):
|
|
191
195
|
|
|
192
196
|
```markdown
|
|
193
197
|
# Review Report: <spec-name>
|
package/agents/flow-verifier.md
CHANGED
|
@@ -85,33 +85,60 @@ for comp in design.components:
|
|
|
85
85
|
assertions.append(("Comp", comp.name, f"{comp.name} must exist"))
|
|
86
86
|
```
|
|
87
87
|
|
|
88
|
-
### Step 3:
|
|
88
|
+
### Step 3: Classify every AC — does it describe user-visible behavior?
|
|
89
|
+
|
|
90
|
+
**BEFORE searching for evidence, classify each AC as either UI-facing or code-only.**
|
|
91
|
+
|
|
92
|
+
An AC is **UI-facing** if any of these is true:
|
|
93
|
+
- Contains words: "user sees", "displays", "renders", "shown", "visible", "click", "type into", "press", "hover", "select"
|
|
94
|
+
- Names a UI element: "button", "input", "checkbox", "link", "list", "form", "label", "modal", "banner"
|
|
95
|
+
- Describes a user flow: "the user can do X", "after X the user sees Y"
|
|
96
|
+
- References a visual state: "strikethrough", "highlighted", "disabled", "focus ring"
|
|
97
|
+
|
|
98
|
+
An AC is **code-only** if it describes internal behavior:
|
|
99
|
+
- Schema shape, API response structure, data transformations
|
|
100
|
+
- Performance ("p95 < 50ms"), reliability, security properties
|
|
101
|
+
- Error-envelope shapes, database constraints
|
|
102
|
+
|
|
103
|
+
### Step 3a: Find evidence for code-only ACs
|
|
89
104
|
|
|
90
105
|
```python
|
|
91
|
-
for source, id, text in
|
|
106
|
+
for source, id, text in code_only_assertions:
|
|
92
107
|
evidence = []
|
|
93
|
-
|
|
94
|
-
# Evidence 1: code implementation
|
|
95
108
|
relevant_files = grep_codebase(extract_keywords(text))
|
|
96
109
|
if relevant_files:
|
|
97
110
|
evidence.append(("code", relevant_files))
|
|
98
|
-
|
|
99
|
-
# Evidence 2: tests
|
|
100
111
|
test_files = find_tests_mentioning(id)
|
|
101
112
|
if test_files:
|
|
102
113
|
evidence.append(("test", test_files))
|
|
103
|
-
|
|
104
|
-
# Evidence 3: commit references
|
|
105
114
|
commits = git_log_grep(id)
|
|
106
115
|
if commits:
|
|
107
116
|
evidence.append(("commit", commits))
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
117
|
+
status = "verified" if evidence and all_evidence_strong(evidence) else ("partial" if evidence else "missing")
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
### Step 3b: UI-facing ACs REQUIRE browser verification (hard rule)
|
|
121
|
+
|
|
122
|
+
Code inspection + unit tests are **insufficient** evidence for a UI-facing AC. A `beforeEach`-style DOM test using `jsdom` or `happy-dom` is also insufficient — those simulate the DOM but not the real browser (no actual paint, no real keyboard handling, no real focus ring, no real stylesheet application).
|
|
123
|
+
|
|
124
|
+
For every UI-facing AC:
|
|
125
|
+
|
|
114
126
|
```
|
|
127
|
+
1. Check chrome-devtools MCP availability (mcp__chrome-devtools__*).
|
|
128
|
+
2. If available:
|
|
129
|
+
- Start the app (dev server or served build) in the current repo.
|
|
130
|
+
- Drive the flow described in the AC: click / type / navigate.
|
|
131
|
+
- Capture screenshot + list_console_messages + list_network_requests.
|
|
132
|
+
- Compare observed behavior against the AC text.
|
|
133
|
+
- Verdict: verified | partial | failed, with the screenshot as evidence.
|
|
134
|
+
3. If chrome-devtools MCP is NOT available:
|
|
135
|
+
- Mark the AC as "unverified — browser MCP missing".
|
|
136
|
+
- Add a CRITICAL section in verification-report.md listing the UI-facing ACs that could not be verified.
|
|
137
|
+
- Do NOT silently pass the AC based on code reading.
|
|
138
|
+
- Do NOT accept "manual smoke" as sufficient evidence unless the user explicitly logged a D-NN decision in STATE.md waiving automated browser verification.
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
Manual-smoke evidence (comments in tasks.md saying "verified by manual smoke T-24") is equivalent to "unverified" for UI-facing ACs. Flag it. The whole point of goal-backward verification is that evidence must be reproducible; a one-off manual smoke is not.
|
|
115
142
|
|
|
116
143
|
### Step 4: Run Actual Tests (Decisive)
|
|
117
144
|
|
|
@@ -145,6 +172,12 @@ For each match, check:
|
|
|
145
172
|
|
|
146
173
|
### Step 6: Generate verification-report.md
|
|
147
174
|
|
|
175
|
+
**CRITICAL (see L8 of the preamble):** your FIRST action in this step must be a `Write` tool call with the **complete report content**. Do NOT paste the report as assistant text before writing — doing so doubles output tokens and causes truncation inside the `Write` call. After the write succeeds, respond with a ≤ 5-line summary only (path, verdict counts, next step). Do not re-paste the report.
|
|
176
|
+
|
|
177
|
+
If the report would exceed ~200 lines, split into `verification-report.md` (short index + verdict) and `verification-details.md` (full findings table) — two `Write` calls.
|
|
178
|
+
|
|
179
|
+
Required structure (use this as the content passed to `Write`, not as preview text):
|
|
180
|
+
|
|
148
181
|
```markdown
|
|
149
182
|
# Verification Report: <spec-name>
|
|
150
183
|
|