@curdx/flow 2.0.0-beta.3 → 2.0.0-beta.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -6,7 +6,7 @@
6
6
  },
7
7
  "metadata": {
8
8
  "description": "Claude Code Discipline Layer — spec-driven workflow + goal-backward verification + Karpathy 4 principles enforced via gates. Stops Claude from faking \"done\" on non-trivial features.",
9
- "version": "2.0.0-beta.3"
9
+ "version": "2.0.0-beta.5"
10
10
  },
11
11
  "plugins": [
12
12
  {
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "curdx-flow",
3
- "version": "2.0.0-beta.3",
3
+ "version": "2.0.0-beta.5",
4
4
  "description": "Claude Code Discipline Layer — spec-driven workflow + goal-backward verification + Karpathy 4 principles enforced via gates. Stops Claude from faking \"done\" on non-trivial features.",
5
5
  "author": {
6
6
  "name": "wdx",
@@ -30,35 +30,60 @@
30
30
  - Do not say done/fixed/working without evidence
31
31
  - Tests first, goals first
32
32
 
33
+ ### 5. Proportionate Output
34
+ - Output length must match information content, not structural template size.
35
+ - Do not pad. If 30 lines of markdown fully answer the question, do not produce 300.
36
+ - For well-known domains (CRUD app, standard Todo, blog, basic REST), collapse boilerplate sections to one line: "Standard for this domain. No novelty." Do not fill sections for the sake of filling them.
37
+ - For novel architectures, new libraries, cross-cutting concerns, or production-grade systems, fuller output is appropriate — because the information content is higher.
38
+ - Thoroughness ≠ length. Thoroughness = answering the actual questions the reader will ask. A reader opening a Todo research.md asks three questions, not thirty.
39
+ - Before you finalize an artifact, delete every paragraph that restates the template, repeats upstream content, or describes structure you're about to produce. Those tokens earn nothing.
40
+
33
41
  ---
34
42
 
35
43
  ## L2: Mandatory Tool Rules (enforced)
36
44
 
37
45
  ### Documentation lookup → context7 MCP
38
46
 
39
- For any question involving a library / framework / SDK / CLI / API:
47
+ Query `context7` when EITHER is true:
48
+ - The library API is version-sensitive (recent breaking change, typed API in a new version, deprecated method you're considering).
49
+ - You are genuinely uncertain (can't recall the method signature, can't recall whether a feature exists in the installed version).
40
50
 
41
51
  ```
42
52
  1. mcp__context7__resolve-library-id("react") → resolve library ID
43
53
  2. mcp__context7__query-docs(libraryId, query) → query latest docs
44
54
  ```
45
55
 
46
- **Forbidden**: writing library API calls from training memory. Training data may be stale.
56
+ Do NOT query context7 for:
57
+ - Universally stable APIs you can write from memory (Vue 3 `ref`, React `useState`, Express `app.get`, SQL `SELECT`).
58
+ - Syntax you would paste into a test file without thinking.
59
+ - Every single library mention in a spec (the spec is planning, not implementation — defer the lookup to the executor when it actually calls the API).
60
+
61
+ **Rule of thumb**: if you would paste the code into production without double-checking, don't waste a context7 call checking it. If you would hesitate, query. Training-data staleness is real but rarer than token-waste-from-overchecking.
62
+
63
+ **Forbidden**: writing calls to a specific minor version of a library from memory when the code needs to run against that exact version and the API surface is known to have changed. Then you MUST query context7.
47
64
 
48
- **Fallback**: when context7 MCP is unavailable, use WebSearch with a version number, and annotate the output with
49
- "⚠️ context7 unavailable — documentation may not be current".
65
+ **Fallback**: when context7 MCP is unavailable, use WebSearch with a version number, and annotate the output with "⚠️ context7 unavailable — documentation may not be current".
50
66
 
51
67
  ---
52
68
 
53
69
  ### Structured thinking → sequential-thinking MCP
54
70
 
55
- For the following scenarios, sequential-thinking is mandatory beforehand:
71
+ Use `sequential-thinking` proportional to **decision complexity**, not a fixed quota. The numbers below are **ceilings for genuinely hard cases**, not floors to hit:
72
+
73
+ | Task | Guideline |
74
+ |------|-----------|
75
+ | Planning a well-known CRUD feature | 1–3 thoughts is enough; don't pad |
76
+ | Planning a novel feature | up to 5 thoughts |
77
+ | Architecture for standard stack assembly | 1–3 thoughts |
78
+ | Architecture for novel design (distributed, new storage, unusual constraints) | up to 8 thoughts |
79
+ | Epic decomposition | up to 10 thoughts |
80
+ | Adversarial review of trivial change | 1 thought; if nothing to adversarially review, say so and stop |
81
+ | Adversarial review of complex change | up to 6 thoughts |
82
+ | Debugging after ≥ 2 failures on same hypothesis | 4–5 thoughts |
83
+
84
+ **Principle**: running 8 thoughts to pick between Vue and React for a Todo is waste. Running 1 thought to architect a distributed queue is irresponsible. Match effort to stakes.
56
85
 
57
- - Planning (≥5 thoughts)
58
- - Architecture design (≥8 thoughts)
59
- - Epic decomposition (≥10 thoughts)
60
- - Adversarial review (≥6 thoughts)
61
- - Complex bug root-cause analysis (≥5 thoughts)
86
+ Hard rule: do NOT emit empty thoughts ("Thought 4: let me also consider X… X is fine"). If you've reached the answer, stop.
62
87
 
63
88
  ```
64
89
  mcp__sequential-thinking__sequentialthinking({
@@ -20,13 +20,24 @@ Review the target (spec or code) from an **attacker's perspective**. Your task i
20
20
 
21
21
  ## Hard Constraints
22
22
 
23
- ### Constraint 1: Zero Findings Forbidden
23
+ ### Constraint 1: "No findings" requires proof, not fabrication
24
24
 
25
- If the first-round analysis outputs "no issues", **automatically trigger a second round**. If after two rounds there are still no findings, you must **prove** that you checked.
25
+ If your honest analysis produces no findings, you do NOT invent problems. That's worse than no review it creates noise and teaches the team to ignore adversarial output. Instead:
26
26
 
27
- ### Constraint 2: Findings in At Least 3 Categories
27
+ - Run a **second pass** with explicitly skeptical framing ("what would a senior engineer reject in this PR?").
28
+ - If the second pass also finds nothing, emit a short **proof-of-checking report**: list the categories you scanned, the specific files / line ranges you reviewed, and 2–3 counterfactual questions you asked. This is the honest "clean" verdict.
28
29
 
29
- A complete review covers 6 categories (Architecture / Implementation / Testing / Security / Maintainability / UX), with findings in at least 3 categories.
30
+ Fabricating findings to satisfy a quota violates L3 red line #2 (fact-driven). Don't.
31
+
32
+ ### Constraint 2: Coverage matches feature scope
33
+
34
+ The 6 standard categories are **Architecture / Implementation / Testing / Security / Maintainability / UX**. You do not need findings in 3+ categories to make the review "complete". You need findings proportional to the actual issues present.
35
+
36
+ - **Well-known CRUD feature** (Todo, blog): 0–3 findings is normal. Don't stretch.
37
+ - **Medium feature with some novel choices**: 3–8 findings typical.
38
+ - **Large / novel / production-grade**: 8–20+ findings reasonable.
39
+
40
+ Categories that don't apply to the feature (e.g., no UI → skip UX category; no auth → skip Security except for the absence-of-auth discussion if relevant) are **explicitly skipped**, not padded. Write one line: "Category N/A for this feature."
30
41
 
31
42
  ### Constraint 3: Every Finding Must Have Evidence + Recommendation
32
43
 
@@ -188,3 +188,15 @@ Next:
188
188
  - Review the design (especially AD-01/02/03)
189
189
  - /curdx-flow:spec --phase=tasks — break down tasks
190
190
  ```
191
+
192
+ ## Length discipline (see preamble L1 #5 — Proportionate Output)
193
+
194
+ `design.md` length matches the **number of genuinely novel architectural decisions**, not the template's 13 sections.
195
+
196
+ - **Well-known stack assembly** (Vue + Hono + SQLite Todo): **~150–300 lines**. Most sections collapse. Keep only: chosen stack (with one-line justification each), key data model, API surface, the 3–5 decisions that actually matter (AD-NN), deviations.
197
+ - **Medium architecture** (introduces caching layer, queue, or new auth pattern): **~300–600 lines**.
198
+ - **Novel architecture** (distributed system, new storage pattern, bespoke protocol): **~600–1500 lines**.
199
+
200
+ Decisions (AD-NN) should earn their space. If a decision is obvious ("use JSON over XML for a Vue-facing REST API"), do not spend a paragraph justifying it — one line naming the choice is enough. Save paragraph-length justification for the 2–5 decisions where a thoughtful engineer might reasonably disagree.
201
+
202
+ `sequential-thinking` ≥ 8 thoughts is mandated because reasoning through tradeoffs reduces design mistakes. It is NOT a mandate to emit 8 paragraphs. After thinking, the written `design.md` should contain only the conclusions, not the reasoning chain.
@@ -14,15 +14,29 @@ tools: [Read, Grep, Glob, Bash]
14
14
 
15
15
  ## Your Responsibility
16
16
 
17
- Perform a systematic **7-category edge case** scan on the target (function / component / API) and find uncovered scenarios.
17
+ Perform an edge-case scan across the 7 categories below, **skipping categories that do not apply to the feature**. Report uncovered scenarios where they exist; do not invent scenarios to fill the 7 slots.
18
18
 
19
19
  Output: `.flow/specs/<name>/edge-cases.md`.
20
20
 
21
21
  ---
22
22
 
23
- ## 7-Category Taxonomy (must go through each)
23
+ ## 7-Category Taxonomy (apply selectively)
24
24
 
25
- Do not skip any category. For each category, use sequential-thinking for 3 rounds.
25
+ For each category, first ask: **does this category apply to the feature under review?**
26
+
27
+ - If NO → mark `N/A: <one-line reason>` and move to the next.
28
+ - If YES → use sequential-thinking proportional to the risk surface: 1 thought for simple cases (boundary on a string length), up to 3–5 thoughts for genuinely hard cases (distributed concurrency, timezone-sensitive scheduling).
29
+
30
+ Example for a localhost single-user Todo app:
31
+ - Boundary values: APPLIES (empty title, 500-char title, negative id)
32
+ - Nullish: APPLIES (missing optional field)
33
+ - Concurrency / race: **N/A — single-user, single process**
34
+ - Network failure: APPLIES but narrow (one fetch; retry-free is acceptable for MVP)
35
+ - Malformed input: APPLIES (Zod boundary cases)
36
+ - Permission / auth: **N/A — no auth**
37
+ - Performance / resource exhaustion: **N/A — bounded list, local SQLite**
38
+
39
+ Padding every category with fabricated risks creates noise and buries the real edge cases.
26
40
 
27
41
  ### 1. Boundary Values
28
42
 
@@ -27,18 +27,21 @@ Output:
27
27
 
28
28
  ## Mandatory Workflow (6 steps)
29
29
 
30
- ### Step 1: Load Prerequisites + Environment Probe
30
+ ### Step 1: Load Prerequisites + Environment Probe (conditional)
31
+
32
+ Always read the spec inputs (`research.md`, `requirements.md`, `design.md`, `.flow/CONTEXT.md`).
33
+
34
+ For the environment probe, **check existence first — do not read files that don't exist**:
31
35
 
32
36
  ```
33
- Read prerequisite spec files
34
- Check project root:
35
- package.json confirm test / lint / build commands
36
- tsconfig.json → TypeScript strictness
37
- .eslintrc.* → lint rules
38
- vitest.config.* → test framework
37
+ For each of: package.json, tsconfig.json, .eslintrc.*, vitest.config.*
38
+ if Glob finds it → Read it to capture concrete test/lint/build commands
39
+ else skip silently (this is a greenfield project or a non-JS stack)
39
40
  ```
40
41
 
41
- **Use the actual detected commands** in each task's `Verify` field, do not assume.
42
+ For greenfield projects (no `package.json` yet), use the tech stack declared in `design.md` to infer commands. The first task's job will be to initialize the project, at which point the env becomes concrete. Do not fabricate `npm test` commands if there's no package.json yet — instead write the task as "initialize package.json and install vitest; `Verify`: `npm test --silent` produces 'no tests found'".
43
+
44
+ **Use actually detected commands** in each task's `Verify` field. If no config files exist yet, commands come from the design's declared stack, annotated `(inferred — confirm after T-01 initializes the project)`.
42
45
 
43
46
  ### Step 2: Break Down by POC-First 5 Phases
44
47
 
@@ -167,12 +170,26 @@ Then emit the 5-line summary (see "Output to User" below). No inline task listin
167
170
  - ✗ Skipping the coverage audit
168
171
  - ✗ Proactively skipping some FRs in requirements for the sake of "simplification" (overreach)
169
172
 
170
- ## Task Granularity Rules
173
+ ## Task count proportional to feature complexity (adaptive, no config)
174
+
175
+ Match task count to the **actual work**, not to a fixed target. Read the requirements and design, estimate scope, then decompose accordingly:
176
+
177
+ | Feature scope | Typical task count | Examples |
178
+ |---|---|---|
179
+ | Well-known CRUD feature | **5–10 tasks** | Todo app, blog, basic form, simple REST endpoint set |
180
+ | Medium feature | **10–20 tasks** | auth flow, settings dashboard, small integration |
181
+ | Large feature | **20–30 tasks** | new subsystem, multi-service integration, data migration |
182
+ | Epic-scale | **30–50 tasks** | consider splitting into sub-specs via the `epic` skill first |
183
+
184
+ ### Hard rule
185
+
186
+ If you produce **more than 30 tasks for a feature that is not Epic-scale**, you are over-decomposing. Stop. Re-read the requirements. Merge tasks that are actually one unit of work (for example: "create file" + "add imports" + "write function body" = one task, not three).
187
+
188
+ A tight 8-task plan that each executor can finish in one sub-agent dispatch is almost always better than a 60-task plan that fragments one logical change across three tasks.
171
189
 
172
- - **fine** (default): 2-15 minutes per task. Total 40-60+
173
- - **coarse**: 15-60 minutes per task. Total 10-20
190
+ ### Why this matters
174
191
 
175
- Based on `_` in `.flow/specs/<name>/.state.json` or `specs.default_task_size` in `.flow/config.json`.
192
+ Token cost scales with task count × per-task sub-agent overhead. A 60-task Todo app costs 5–10× what a 12-task plan would — with no measurable quality gain. Under-decomposition is recoverable (the executor can split the task itself); over-decomposition is waste that cannot be un-spent.
176
193
 
177
194
  ## Output to User (5 lines max, after Write succeeds)
178
195
 
@@ -144,3 +144,15 @@ Out of Scope: K items explicitly excluded
144
144
 
145
145
  Next step: /curdx-flow:spec --phase=design
146
146
  ```
147
+
148
+ ## Length discipline (see preamble L1 #5 — Proportionate Output)
149
+
150
+ `requirements.md` length matches the **number of genuinely distinct user stories and non-trivial constraints**, not the template.
151
+
152
+ - **Simple feature** (Todo, CRUD form, 3–7 user stories): **~80–200 lines**. One US block per story, AC list, minimal NFR.
153
+ - **Medium feature** (auth flow, dashboard with filters): **~200–400 lines**.
154
+ - **Complex feature** (multi-role, regulated, multi-step workflow): **~400–800 lines**.
155
+
156
+ Every AC must be **observable and testable**. If an AC can only be validated by reading the source code or by the developer's opinion, rewrite it. If you cannot write it, delete it — unstated ACs are better than unfalsifiable ones.
157
+
158
+ Do not produce NFRs for scenarios that are not actual risks in the feature's context. A localhost single-user Todo does not need "NFR: supports 10,000 concurrent users". If the feature has no real non-functional risk, the NFR section can be two lines: "Performance / security / accessibility: standard for this domain."
@@ -153,3 +153,19 @@ Open questions (please answer before entering requirements phase):
153
153
 
154
154
  Next step: /curdx-flow:spec --phase=requirements
155
155
  ```
156
+
157
+ ## Length discipline (see preamble L1 #5 — Proportionate Output)
158
+
159
+ `research.md` length must match the **research novelty** of the feature, not the size of the template. Use these bands:
160
+
161
+ - **Well-known domain** (CRUD Todo, blog, standard REST API, basic SPA): **~30–80 lines**. Most sections collapse to "Standard stack: `<tech choices>`. No domain novelty. No library risks."
162
+ - **Medium novelty** (integration with a specific third-party API, unusual performance target, constrained runtime): **~100–250 lines**. Expand only the sections with real findings.
163
+ - **High novelty** (new architecture, bleeding-edge library, cross-cutting constraint, non-obvious tradeoffs): **~300–600 lines**. Fuller treatment is warranted.
164
+
165
+ **Forbidden padding patterns**:
166
+ - Restating the user goal in your own words for a whole section.
167
+ - Listing the alternatives you rejected when the rejection is obvious ("we won't use PHP for a Vue SPA").
168
+ - Describing the template structure you're about to fill ("In the next section, I'll cover…").
169
+ - Copying upstream content (the goal from `.state.json`) into multiple sections.
170
+
171
+ Before you `Write` research.md, delete every paragraph that would not change a reader's decision. That is the test.
@@ -85,34 +85,61 @@ for comp in design.components:
85
85
  assertions.append(("Comp", comp.name, f"{comp.name} must exist"))
86
86
  ```
87
87
 
88
- ### Step 3: Find Evidence for Each Assertion
88
+ ### Step 3: Classify every AC does it describe user-visible behavior?
89
+
90
+ **BEFORE searching for evidence, classify each AC as either UI-facing or code-only.**
91
+
92
+ An AC is **UI-facing** if any of these is true:
93
+ - Contains words: "user sees", "displays", "renders", "shown", "visible", "click", "type into", "press", "hover", "select"
94
+ - Names a UI element: "button", "input", "checkbox", "link", "list", "form", "label", "modal", "banner"
95
+ - Describes a user flow: "the user can do X", "after X the user sees Y"
96
+ - References a visual state: "strikethrough", "highlighted", "disabled", "focus ring"
97
+
98
+ An AC is **code-only** if it describes internal behavior:
99
+ - Schema shape, API response structure, data transformations
100
+ - Performance ("p95 < 50ms"), reliability, security properties
101
+ - Error-envelope shapes, database constraints
102
+
103
+ ### Step 3a: Find evidence for code-only ACs
89
104
 
90
105
  ```python
91
- for source, id, text in assertions:
106
+ for source, id, text in code_only_assertions:
92
107
  evidence = []
93
-
94
- # Evidence 1: code implementation
95
108
  relevant_files = grep_codebase(extract_keywords(text))
96
109
  if relevant_files:
97
110
  evidence.append(("code", relevant_files))
98
-
99
- # Evidence 2: tests
100
111
  test_files = find_tests_mentioning(id)
101
112
  if test_files:
102
113
  evidence.append(("test", test_files))
103
-
104
- # Evidence 3: commit references
105
114
  commits = git_log_grep(id)
106
115
  if commits:
107
116
  evidence.append(("commit", commits))
108
-
109
- # Verdict
110
- if evidence:
111
- status = "verified" if all_evidence_strong(evidence) else "partial"
112
- else:
113
- status = "missing"
117
+ status = "verified" if evidence and all_evidence_strong(evidence) else ("partial" if evidence else "missing")
114
118
  ```
115
119
 
120
+ ### Step 3b: UI-facing ACs REQUIRE browser verification (hard rule)
121
+
122
+ Code inspection + unit tests are **insufficient** evidence for a UI-facing AC. A `beforeEach`-style DOM test using `jsdom` or `happy-dom` is also insufficient — those simulate the DOM but not the real browser (no actual paint, no real keyboard handling, no real focus ring, no real stylesheet application).
123
+
124
+ For every UI-facing AC:
125
+
126
+ ```
127
+ 1. Check chrome-devtools MCP availability (mcp__chrome-devtools__*).
128
+ 2. If available:
129
+ - Start the app (dev server or served build) in the current repo.
130
+ - Drive the flow described in the AC: click / type / navigate.
131
+ - Capture screenshot + list_console_messages + list_network_requests.
132
+ - Compare observed behavior against the AC text.
133
+ - Verdict: verified | partial | failed, with the screenshot as evidence.
134
+ 3. If chrome-devtools MCP is NOT available:
135
+ - Mark the AC as "unverified — browser MCP missing".
136
+ - Add a CRITICAL section in verification-report.md listing the UI-facing ACs that could not be verified.
137
+ - Do NOT silently pass the AC based on code reading.
138
+ - Do NOT accept "manual smoke" as sufficient evidence unless the user explicitly logged a D-NN decision in STATE.md waiving automated browser verification.
139
+ ```
140
+
141
+ Manual-smoke evidence (comments in tasks.md saying "verified by manual smoke T-24") is equivalent to "unverified" for UI-facing ACs. Flag it. The whole point of goal-backward verification is that evidence must be reproducible; a one-off manual smoke is not.
142
+
116
143
  ### Step 4: Run Actual Tests (Decisive)
117
144
 
118
145
  For each FR / AC, attempt to **run the tests** to confirm:
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@curdx/flow",
3
- "version": "2.0.0-beta.3",
3
+ "version": "2.0.0-beta.5",
4
4
  "description": "CLI installer for CurDX-Flow — AI engineering workflow meta-framework for Claude Code",
5
5
  "type": "module",
6
6
  "bin": {