npm - cc-devflow - Versions diffs - 4.5.2 → 4.5.3 - Mend

cc-devflow 4.5.2 → 4.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (76) hide show

package/.claude/skills/cc-investigate/assets/ANALYSIS_TEMPLATE.md CHANGED Viewed

@@ -17,10 +17,25 @@
 - What the user saw:
 - Reproduction command / path:
 - Repro stability: `stable` | `intermittent` | `not-yet-reproduced` | `narrowed-only`
+- Matches reported symptom: `yes` | `no` | `partial` | `unknown`
+- Symptom match evidence:
 - Expected:
 - Actual:
 - Impact / blast radius:
+## Feedback Loop Contract
+- Loop type: `failing-test` | `http-script` | `cli-fixture` | `browser-script` | `trace-replay` | `throwaway-harness` | `property-fuzz` | `bisect` | `differential` | `hitl`
+- Command or manual driver:
+- Expected failing signal:
+- Actual failing signal:
+- Runtime:
+- Determinism: `deterministic` | `high-rate-flaky` | `low-rate-flaky` | `unknown`
+- Failure rate:
+- Signal specificity:
+- Sharpening plan:
+- If no loop, evidence request:
 ## Evidence Chain
 - Logs / stack traces:
@@ -29,6 +44,7 @@
 - Existing tests:
 - Prior investigations:
 - TODO / backlog / report-card signals:
+- Native domain / decision context:
 ## Boundary Probe Matrix
@@ -55,9 +71,9 @@
 ## Diagnostic Instrumentation Plan
-| Probe location | Question answered | Command to run | Expected signal | Actual signal | Cleanup requirement |
-| --- | --- | --- | --- | --- | --- |
-| | | | | | |
+| Probe tag | Probe location | Question answered | Command to run | Expected signal | Actual signal | Cleanup requirement |
+| --- | --- | --- | --- | --- | --- | --- |
+| | | | | | | |
 ## Pattern Analysis
@@ -70,9 +86,16 @@
 | configuration drift | | ruled-out | |
 | stale cache | | ruled-out | |
 | resource leak | | ruled-out | |
+| performance regression | | ruled-out | |
 | trust boundary drift | | ruled-out | |
 | timing guess / flaky wait | | ruled-out | |
+## Candidate Hypotheses
+| Rank | Hypothesis | Why plausible | Prediction | Status |
+| --- | --- | --- | --- | --- |
+| 1 | | | | pending |
 ## Research Evidence
 - External research used: `yes` | `no`
@@ -94,6 +117,7 @@
 - Attempted evidence:
 - Why current entry is suspect:
 - Next option: `continue-with-new-hypothesis` | `instrument-and-wait` | `human-review` | `reroute-cc-plan`
+- Evidence request:
 - Recommendation:
 ## Root Cause
@@ -108,6 +132,14 @@
 - Operator handling after fix:
 - Prior history relationship: `new` | `recurring` | `same-root-cause` | `architectural-smell-candidate`
+## Correct Test Seam
+- Test seam:
+- Public interface exercised:
+- Why this seam reaches the real trigger chain:
+- Why a shallower test would be false confidence:
+- If no correct seam exists:
 ## Repair Boundary
 - Fix strategy:
@@ -125,6 +157,9 @@
 ## Review Gate
 - Repro stable:
+- Feedback loop trustworthy:
+- Symptom match confirmed:
 - Root cause confirmed:
+- Correct test seam identified:
 - Repair scope still belongs to this requirement:
 - If not, reroute:

package/.claude/skills/cc-investigate/assets/TASKS_TEMPLATE.md CHANGED Viewed

@@ -16,6 +16,8 @@
 - Execution mode: `single-path` | `parallel-ready`
 - Confirmed root cause:
 - Root-cause hypothesis:
+- Feedback loop:
+- Symptom match evidence:
 - Frozen repair boundary:
 - Boundary probes:
 - Backward trace:
@@ -28,16 +30,19 @@
 - Commands to trust:
 - Do not re-decide:
 - Parallel boundaries:
+- Correct test seam:
+- Evidence request if blocked:
 ## Phase 1: Reproduce And Probe Guard
 - [ ] T001 [TEST] Capture the failing behavior as a stable reproduction (dependsOn:none) `path/to/test`
-  Goal: 让 bug 先变成一个可复跑的失败事实。
+  Goal: 让 bug 先变成一个快、准、可复跑且匹配用户症状的失败事实。
   Files: `path/to/test`
   Read first: `analysis.md`, `tasks.md`
   Verification: `npm test -- path/to/test`
-  Evidence: failing output or reproducible log
-  Ready when: reproduction path 已稳定，analysis 已记录必要的 boundary / trace / comparison evidence
+  Evidence: failing output or reproducible log + symptom match evidence
+  Correct seam: test must exercise the real trigger chain through a public interface
+  Ready when: feedback loop 已稳定，analysis 已记录必要的 boundary / trace / comparison evidence
 ## Phase 2: Repair
@@ -47,7 +52,7 @@
   Read first: `analysis.md`, `path/to/test`
   Verification: `npm test -- path/to/test`
   Evidence: passing output + checkpoint
-  Ready when: T001 已证明问题存在，analysis 已证明根因源头
+  Ready when: T001 已证明同一个用户症状存在，analysis 已证明根因源头
 ## Phase 3: Verify

package/.claude/skills/cc-investigate/assets/TASK_MANIFEST_TEMPLATE.json CHANGED Viewed

@@ -20,7 +20,7 @@
     ]
   },
   "planningMeta": {
-    "ccInvestigateSkillVersion": "1.1.4",
+    "ccInvestigateSkillVersion": "1.1.6",
     "analysisVersion": "analysis.v1",
     "approvedAt": "2026-04-17T12:00:00.000Z",
     "approvedBy": "user",
@@ -29,10 +29,24 @@
   "investigationMeta": {
     "symptomStatus": "stable",
     "reproductionPath": "npm test -- src/feature/feature.test.ts",
+    "feedbackLoop": {
+      "loopType": "failing-test",
+      "commandOrDriver": "npm test -- src/feature/feature.test.ts",
+      "expectedFailingSignal": "The test fails with the user-reported behavior",
+      "actualFailingSignal": "Observed failure output from the current repo",
+      "symptomMatchEvidence": "Failure output matches the reported symptom, not a nearby unrelated failure",
+      "runtime": "under 10s",
+      "determinism": "deterministic",
+      "failureRate": "100%",
+      "signalSpecificity": "asserts the exact broken behavior",
+      "sharpeningPlan": "Narrow setup or assertions if the loop becomes slow or broad",
+      "evidenceRequest": ""
+    },
     "patternAnalysis": {
-      "selectedPattern": "implementation drift",
+      "selectedPattern": "null propagation",
       "ruledOutPatterns": [
         "race condition",
+        "performance regression",
         "configuration drift",
         "timing guess / flaky wait"
       ],
@@ -73,6 +87,7 @@
     },
     "diagnosticInstrumentation": [
       {
+        "probeTag": "[DEBUG-FIXXXX-a4f2]",
         "probeLocation": "file:line or component boundary",
         "questionAnswered": "Which boundary first emits the invalid value?",
         "commandToRun": "npm test -- src/feature/feature.test.ts",
@@ -81,8 +96,23 @@
         "cleanupRequirement": "Remove temporary probe or convert it into a durable assertion/log"
       }
     ],
+    "candidateHypotheses": [
+      {
+        "rank": 1,
+        "statement": "Specific, testable root-cause claim",
+        "whyPlausible": "Reproduction output points to the affected contract",
+        "prediction": "The failing signal disappears when that contract is restored",
+        "status": "accepted-for-testing"
+      }
+    ],
     "priorInvestigations": [],
     "researchEvidence": [],
+    "domainDecisionContext": {
+      "contextFilesRead": [],
+      "adrFilesRead": [],
+      "vocabularyNotes": [],
+      "adrConflicts": []
+    },
     "rootCauseHypothesis": {
       "statement": "Specific, testable root-cause claim",
       "falsificationMethod": "Command, log probe, assertion, or code-path check",
@@ -112,6 +142,13 @@
       "nextOption": "cc-do",
       "recommendation": "Repair the confirmed root cause"
     },
+    "correctTestSeam": {
+      "testSeam": "public interface or end-to-end path that reaches the real trigger chain",
+      "publicInterfaceExercised": "CLI/API/UI behavior observed by callers",
+      "realTriggerChainCoverage": "The test enters through the same trigger path as the bug",
+      "whyShallowTestRejected": "A lower-level unit test would not prove the upstream contract",
+      "ifNoCorrectSeam": ""
+    },
     "repairBoundary": {
       "affectedModule": "src/feature",
       "allowedFiles": [
@@ -172,6 +209,8 @@
       ],
       "acceptance": [
         "The target bug is reproduced as a stable failure",
+        "The failing loop matches the user-reported symptom",
+        "The regression test uses the correct seam for the real trigger chain",
         "The failure output points to the confirmed root-cause path"
       ],
       "verification": [

package/.claude/skills/cc-investigate/references/investigation-contract.md CHANGED Viewed

@@ -11,6 +11,8 @@
 - symptom
 - reproduction path
+- feedback loop contract
+- symptom match evidence
 - expected vs actual
 - code path
 - recent change signal
@@ -20,9 +22,11 @@
 - reference comparison, when a similar working path exists
 - diagnostic instrumentation plan, when probes are needed
 - pattern analysis
+- ranked candidate hypotheses
 - root-cause hypothesis
 - falsification method
 - confirmed root cause
+- correct test seam
 - root cause class
 - repair boundary
 - blast radius
@@ -37,6 +41,7 @@
 每条假设都必须可证伪：
+- `candidateRank`：候选假设排序，避免第一直觉锚定
 - `hypothesis`：具体说明什么坏了，为什么会导致症状
 - `evidenceFor`
 - `evidenceAgainst`
@@ -47,6 +52,22 @@
 只有 `confirmed` 假设可以进入 Root Cause。
+## Feedback Loop Contract
+调查必须先构造一个可信 pass/fail loop：
+- `loopType`: failing-test / http-script / cli-fixture / browser-script / trace-replay / throwaway-harness / property-fuzz / bisect / differential / hitl
+- `commandOrDriver`
+- `expectedFailingSignal`
+- `actualFailingSignal`
+- `symptomMatchEvidence`
+- `runtime`
+- `determinism`
+- `failureRate`
+- `sharpeningPlan`
+loop 必须复现用户报告的同一失败。无法构造 loop 时，只能进入 `Evidence Request`，不能冻结根因。
 ## Pattern Analysis
 调查必须显式选择或排除常见模式：
@@ -58,6 +79,7 @@
 - configuration drift
 - stale cache
 - resource leak
+- performance regression
 - trust boundary drift
 - timing guess / flaky wait
@@ -105,6 +127,7 @@
 临时探针必须回答一个明确问题：
+- probe tag
 - probe location
 - question answered
 - command to run
@@ -114,6 +137,28 @@
 探针不是修复。handoff 必须说明删除、保留为正式日志，或转成测试断言。
+debug 日志必须带唯一前缀，例如 `[DEBUG-FIX123-a4f2]`，确保 cleanup 可以用 grep 验证。
+## Correct Test Seam
+修复 handoff 必须记录回归测试是否覆盖真实触发链：
+- `testSeam`
+- `publicInterfaceExercised`
+- `realTriggerChainCoverage`
+- `whyShallowTestRejected`
+- `ifNoCorrectSeam`
+没有正确 seam 时，必须把它记录为架构事实，并保留原始 feedback loop 作为修复验证。
+## Domain And Decision Context
+调查前先读 cc-devflow 原生上下文：`devflow/specs/INDEX.md`、相关 capability specs、roadmap/backlog handoff、历史 `planning/design.md` / `planning/analysis.md`、`change-meta.json`。
+- 输出中的领域概念、假设名、测试名使用项目既有词汇
+- 如果根因或修复方向违反 capability spec、roadmap decision 或历史 design decision，必须显式记录冲突和理由
+- 缺失领域词汇是调查信号，不要临时发明同义词掩盖契约缺口
 ## Prior History
 调查必须记录是否检查了：
@@ -166,6 +211,7 @@
 - attempted evidence
 - why current entry is suspect
 - recommended next option：continue / instrument-and-wait / human-review / reroute-cc-plan
+- evidence request：repro env / HAR / log dump / core dump / timestamped recording / temporary production instrumentation
 ## Reroute

package/.claude/skills/cc-plan/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,31 @@
 # CC-Plan Skill Changelog
+## v3.7.0 - 2026-04-28
+- add glossary delta capture for canonical terms, aliases to avoid, ambiguities, and relationship constraints during context sweep
+- require non-trivial public interfaces to compare deliberately different shapes before freezing the final seam
+- mark vertical slices as `AFK` or `HITL` and require durable design / issue handoffs to describe behavior contracts instead of stale file paths
+## v3.6.2 - 2026-04-28
+- clarify that canonical language and durable decisions come from cc-devflow native sources: `devflow/specs/`, roadmap/backlog handoff, planning design/analysis, and change metadata
+- remove external context/architecture-decision files from the standard planning contract so they are not implied as generated artifacts
+- route long-lived decisions into capability spec deltas, roadmap/backlog decision notes, or the current design decision log
+## v3.6.1 - 2026-04-28
+- require plans to freeze public test seams, behavior assertions, mock boundaries, and feedback loop types before handing Red tasks to `cc-do`
+- strengthen TDD planning so Red tasks reject implementation-detail tests, internal collaborator mocks, and fake seams
+- update design, tiny-design, tasks, and manifest templates with test quality fields inherited from the TDD workflow review
+## v3.6.0 - 2026-04-28
+- absorb grilling-session discipline into native planning: one decision branch at a time, recommended answer with evidence, and no user questions when repo evidence can answer
+- require domain language and durable decision scans before naming modules, interfaces, tests, or tasks
+- add interface/deep-module checks so new public surfaces identify callers, hidden complexity, misuse risk, and alternative shapes before task split
+- strengthen test-first planning around vertical tracer bullets so tasks do not become horizontal "all tests first, all implementation later" slices
+- update design, tiny-design, tasks, and manifest templates with language handoff, interface shape, and vertical slice fields
 ## v3.5.6 - 2026-04-28
 - require non-trivial plans to compare named option roles, including minimal viable and ideal architecture, before freezing a recommendation

package/.claude/skills/cc-plan/PLAYBOOK.md CHANGED Viewed

@@ -18,14 +18,16 @@
 5. 版本、来源、冻结决策必须可追踪。
 6. 机械决策自动落盘；taste decision 和 user challenge 必须显式交给用户拍板。
 7. 同 blast radius 内的完整边界优先做完，跨系统或无证据扩张才 defer。
-8. 具体执行计划默认测试先行；没有 Red/Green/Refactor 链或 TDD exception，不准交给 `cc-do`。
+8. 具体执行计划默认测试先行；没有 Red/Green/Refactor 链、公共测试 seam、行为断言、mock 边界或 TDD exception，不准交给 `cc-do`。
 9. 新 change 目录必须使用 `REQ-<number>-<description>` 或 `FIX-<number>-<description>`；旧小写目录只读兼容，不再作为新输出。
 10. 原始需求跨多个独立子系统时，先拆回 roadmap / 多个 REQ/FIX；不要把一个大杂烩压成单个计划。
 11. `tiny-design` 仍然必须被批准，它只是短设计，不是跳过设计。
 12. 非 trivial 方案必须至少比较 `minimal viable` 和 `ideal architecture` 两种角色，小方案没有天然优先权。
 13. `full-design` 必须冻结 implementation decision horizon 和 error/rescue map，避免 `cc-do` 临场补设计。
-14. 测试框架来源、覆盖质量和回归测试必须在计划阶段写清，不准靠执行阶段猜。
+14. 测试框架来源、覆盖质量、测试 seam、mock 边界和回归测试必须在计划阶段写清，不准靠执行阶段猜。
 15. UI 和 developer/operator-facing 范围只在适用时触发对应 gate，不把每个计划都塞成大审查清单。
+16. 先对齐项目语言和持久决策，再命名 capability、模块、接口、测试和任务；术语冲突必须显式暴露。
+17. 行为变更按 tracer bullet 垂直切片推进，不能把任务水平切成“先测试层、再服务层、最后 UI 层”。
 ## Required Outputs
@@ -63,10 +65,14 @@
 12. `full-design` 必须包含 implementation decision horizon 和 error/rescue map；不适用时写清 N/A 理由。
 13. 新 artifact、CLI、包、容器、文档入口必须在计划阶段写清分发和 discoverability，不准到 `cc-act` 才发现没人能用。
 14. 行为变更任务必须拆成 `[TEST] -> [IMPL] -> [REFACTOR]` 或写明 TDD exception；不能用“实现并测试”混成一个任务。
-15. 回归测试不能 defer。修改既有行为且缺少覆盖时，必须先计划 regression test。
-16. UI scope 要写 design completeness score 和 loading / empty / error / success / partial 状态。
-17. developer/operator-facing scope 要写 target persona、time to first value、magic moment 和 install / run / debug / upgrade 风险。
-18. Review gate 只拦会导致实现错误、执行卡住、范围越界、验证缺失的问题；文字偏好和 nice-to-have 只能作为 advisory。
+15. 行为变更任务必须按一个 observable behavior 一条 tracer bullet 链组织，不能先批量写红灯再批量实现。
+16. 回归测试不能 defer。修改既有行为且缺少覆盖时，必须先计划 regression test。
+17. Red 任务必须验证公共接口上的行为，不验证私有函数、内部调用次数或临时数据结构。
+18. Mock 只能放在系统边界；如果测试必须 mock 自己控制的模块，说明 seam 或接口设计还没压平。
+19. 找不到正确 seam 时，先计划 exploratory spike 或设计修正，不能用假红灯冒充 TDD。
+17. UI scope 要写 design completeness score 和 loading / empty / error / success / partial 状态。
+18. developer/operator-facing scope 要写 target persona、time to first value、magic moment 和 install / run / debug / upgrade 风险。
+19. Review gate 只拦会导致实现错误、执行卡住、范围越界、验证缺失的问题；文字偏好和 nice-to-have 只能作为 advisory。
 ## Approval Flow
@@ -86,9 +92,15 @@
 - 每个会触达的文件职责是什么，为什么属于这个文件，而不是另一个平行位置？
 - 为什么推荐方案胜过 `minimal viable` / `ideal architecture` 的另一端？
 - foundation / core / integration / polish 阶段哪些决策已经冻结，哪些仍是 blocked question？
+- 核心语言是否沿用 `devflow/specs/`、roadmap handoff 或历史 design/analysis，是否存在 language conflict？
+- 新增接口是否是小接口深模块，复杂度是否被藏在正确边界里？
 - 每条 failure path 的 rescue action、用户可见结果和测试证据是什么？
 - 每条新增 code path / user flow / error path 的第一条失败测试是什么？
+- 第一条失败测试通过哪个公共 seam 进入系统，断言什么可观察行为？
+- 哪些依赖允许 mock，哪些内部协作者禁止 mock？
+- 反馈循环是自动测试、HTTP、CLI、浏览器、trace replay、harness、property/fuzz、differential，还是 HITL；为什么这是当前最短可信循环？
 - 测试框架来源是什么，现有覆盖是 strong、happy-path-only、smoke-only 还是 missing？
+- task 是否以端到端 tracer bullet 为单位，而不是按层水平拆？
 - 哪些生产失败模式已经处理，哪些 defer 到 backlog？
 ## Design Mode Switch

package/.claude/skills/cc-plan/SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: cc-plan
-version: 3.5.6
+version: 3.7.0
 description: Use when a requirement, roadmap item, or bug needs scope clarification, design decisions, and executable task breakdown before coding starts.
 triggers:
   - 帮我规划这个需求
@@ -33,6 +33,7 @@ writes:
     required: true
 entry_gate:
   - Read roadmap handoff, current requirement files, code, docs, and tests before drafting design.
+  - Load cc-devflow native language and decision sources (`devflow/specs/`, roadmap/backlog handoff, current or prior `planning/design.md` / `planning/analysis.md`, and `change-meta.json`) before naming concepts, modules, tests, or tasks.
   - Freeze problem, constraints, non-goals, and success criteria before proposing implementation tasks.
   - If the raw ask spans multiple independent subsystems, split it back into roadmap stages or separate REQ/FIX candidates before asking implementation details.
   - "For non-trivial designs, compare named option roles: minimal viable, ideal architecture, and optional hybrid. Do not default to smallest unless it best serves the goal."
@@ -173,7 +174,8 @@ tool_budget:
 3. 如果原始需求包含多个可独立交付的子系统，先拆成独立 `RM` 或 `REQ/FIX` 候选；不要在一个 `cc-plan` 里继续追问实现细节。
 4. 先读当前 change 目录现状。旧目录里如果还有 `BRAINSTORM.md` / `PLAN_REVIEW.md` / `context-package.md`，把有效信息吸收进新的 `planning/design.md`，不要继续增殖。
 5. 先看代码、文档、测试和最近提交，再谈拆任务。
-6. 先写不做什么，再写做什么。
+6. 先读 cc-devflow 原生项目语言和决策上下文：`devflow/specs/INDEX.md`、相关 capability specs、roadmap/backlog handoff、当前或历史 `planning/design.md` / `planning/analysis.md`、`change-meta.json`；不存在时静默跳过，但发现术语冲突必须写成 blocked question 或 user challenge。
+7. 先写不做什么，再写做什么。
 ## Context Sweep
@@ -182,12 +184,14 @@ tool_budget:
 1. 当前对象对应的 `RM-ID`、roadmap version、roadmap skill version
 2. `devflow/ROADMAP.md` / `devflow/BACKLOG.md` 中该事项的阶段来源、证据、dependencies、success signal、kill signal、next decision、capability links
 3. `devflow/specs/INDEX.md` 与相关 capability specs
-4. 当前 change 目录已有的 `planning/design.md`、`planning/tasks.md`、`planning/task-manifest.json`、`change-meta.json` 与历史 planning 文档
-5. `CLAUDE.md`、README、相关 docs / specs / ADR / 最近提交
-6. 当前代码、测试、发布、迁移、依赖的现实边界
-7. 测试框架真相源：优先读 `CLAUDE.md` / project docs 的测试约定，再用配置文件和目录结构补证。
-8. 如果有 UI scope，读取现有设计系统、组件、页面状态和交互模式。
-9. 如果是 API / CLI / SDK / developer-facing / operator-facing scope，读取 README、docs、package metadata、安装/运行/调试入口和当前 first-success path。
+4. 项目语言 / 决策上下文：`devflow/specs/INDEX.md`、相关 capability specs、roadmap/backlog handoff、当前或历史 `planning/design.md` / `planning/analysis.md`、`change-meta.json`
+5. 当前 change 目录已有的 `planning/design.md`、`planning/tasks.md`、`planning/task-manifest.json`、`change-meta.json` 与历史 planning 文档
+6. `CLAUDE.md`、README、相关 docs / specs / 最近提交
+7. 当前代码、测试、发布、迁移、依赖的现实边界
+8. 测试框架真相源：优先读 `CLAUDE.md` / project docs 的测试约定，再用配置文件和目录结构补证。
+9. 如果有 UI scope，读取现有设计系统、组件、页面状态和交互模式。
+10. 如果是 API / CLI / SDK / developer-facing / operator-facing scope，读取 README、docs、package metadata、安装/运行/调试入口和当前 first-success path。
+11. 如果现有语言仍混乱，写出最小 glossary delta：canonical term、aliases to avoid、flagged ambiguity、关系约束；只记录领域或 capability 概念，不记录短期类名。
 先把这些材料压成 `Source Handoff`，再决定 discovery 还是 planning。
@@ -201,9 +205,22 @@ tool_budget:
 4. Narrowest wedge：最小可交付边界是什么，哪些同 blast radius 问题必须顺手解决？
 5. Observation：有没有日志、测试、真实流程、最近提交能证明这个问题存在？
 6. Future fit：这个方案 6 个月后是否仍然是正确边界，还是会制造第二套系统？
+7. Language fit：这次使用的核心名词是否已经是项目里的 canonical term，还是在创造第二套语言？
+8. Interface fit：调用方真正需要的最小公共接口是什么，哪些复杂度应该被藏在模块内部？
 一次只问一个关键未知点。能从代码、文档、测试、git 历史里确认的问题，不问用户。
+## Grilling Protocol
+`cc-plan` 可以吸收 brainstorm / grilling 的结论，但不再产出独立 `BRAINSTORM.md`。深挖问题时遵守这些规则：
+1. 沿决策树一枝一枝走。每次只解决一个会改变设计或任务切分的关键分支。
+2. 每个问题必须附带推荐答案、证据来源、以及如果用户反对会影响哪些下游决策。
+3. 能从代码、docs、tests、git history、capability spec、roadmap handoff 或历史 design/analysis 得到答案时，先查证，不问用户。
+4. 用户或文档里的模糊词必须被压成 canonical term；如果和 `devflow/specs/`、roadmap/backlog 或历史 design/analysis 冲突，立即标成 `language conflict`。
+5. 具体场景优先于抽象概念。每个关键边界至少用一个真实 codepath、user flow、operator flow 或 failure path 压测。
+6. 只有满足 hard to reverse、surprising without context、real trade-off 三个条件的决策，才建议沉淀为 capability spec delta 或 roadmap/backlog decision note；否则留在本次 design decision log。
 ## Session Protocol
 1. 先探索上下文，再写结论。
@@ -228,17 +245,23 @@ tool_budget:
 2. Scope challenge：超过 8 个文件、2 个新 service/class、或跨模块连锁时，必须解释为什么不是过度设计。
 3. Implementation surface map：先锁定每个会新增或修改的文件、职责、归属理由、耦合风险，再拆任务。
 4. Option role check：非 trivial 方案必须比较 `minimal viable`、`ideal architecture`，必要时加 `hybrid`，并写清为什么推荐方案服务当前目标。
-5. Implementation decision horizon：提前写出 foundation、core logic、integration、polish/tests 阶段实现者会撞到的决策，能现在冻结就不要留给 `cc-do` 临场猜。
-6. Architecture diagram：跨模块或状态流变更要写 ASCII 数据流 / 依赖图。
-7. Error & Rescue map：`full-design` 必须按 codepath 写清 failure、rescue、user sees、test evidence；不适用时写 N/A 理由。
-8. Code quality scan：指出 DRY、命名、错误处理、三层以上分支、隐藏耦合风险。
-9. Test diagram：列出新增 code path、user flow、错误路径、边界状态，并标注 first failing test、unit / e2e / eval。
-10. Test framework source：先记录测试框架来自 `CLAUDE.md` / docs / config / directory 的哪条证据；不能靠猜。
-11. UI state coverage：有 UI / interaction scope 时，写 loading / empty / error / success / partial 状态表和 design completeness score。
-12. DX / operator coverage：developer-facing / operator-facing scope 必须写 target persona、time to first value、magic moment、install / run / debug / upgrade 风险。
-13. Performance and distribution：涉及批量、I/O、发布物、CLI、包、容器时，必须写清性能和分发边界。
-14. NOT in scope：所有被考虑但 defer 的内容要写理由，不能消失在聊天里。
-15. Review calibration：只有会导致 `cc-do` 建错、卡住、越界、漏测的问题才是 blocking；措辞偏好和非阻塞建议不能伪装成 gate failure。
+5. Domain language check：核心名词、文件命名、测试名、任务标题必须对齐 `devflow/specs/`、roadmap handoff 或历史 design/analysis；没有来源时写 assumption，不要临时发明第二套语言。
+6. Interface depth check：新增或改动模块 / API / CLI / SDK 时，先说明调用方、公共操作、隐藏复杂度、易用错点；非 trivial 公共接口至少比较两种故意不同的形态，例如 `minimal/common-case` 与 `flexible/general-purpose`，再解释为什么最终形态更深、更不容易误用。
+7. Implementation decision horizon：提前写出 foundation、core logic、integration、polish/tests 阶段实现者会撞到的决策，能现在冻结就不要留给 `cc-do` 临场猜。
+8. Architecture diagram：跨模块或状态流变更要写 ASCII 数据流 / 依赖图。
+9. Error & Rescue map：`full-design` 必须按 codepath 写清 failure、rescue、user sees、test evidence；不适用时写 N/A 理由。
+10. Code quality scan：指出 DRY、命名、错误处理、三层以上分支、隐藏耦合风险。
+11. Test diagram：列出新增 code path、user flow、错误路径、边界状态，并标注 first failing test、unit / e2e / eval。
+12. Test seam check：每条 Red 任务必须说明通过哪个公共接口、调用方流程或用户可见路径证明行为；如果只能测私有函数、内部调用次数或临时结构，先改设计或写 blocked question。
+13. Mock boundary check：只允许 mock 系统边界，如外部 API、时间、随机性、文件系统、必要数据库边界；不 mock 自己控制的内部模块。
+14. Feedback loop check：为每条行为选定最短可信反馈循环，优先顺序是自动测试、curl/HTTP、CLI+fixture、浏览器脚本、trace replay、throwaway harness、property/fuzz、differential loop、HITL script。
+15. Test framework source：先记录测试框架来自 `CLAUDE.md` / docs / config / directory 的哪条证据；不能靠猜。
+16. UI state coverage：有 UI / interaction scope 时，写 loading / empty / error / success / partial 状态表和 design completeness score。
+17. DX / operator coverage：developer-facing / operator-facing scope 必须写 target persona、time to first value、magic moment、install / run / debug / upgrade 风险。
+18. Performance and distribution：涉及批量、I/O、发布物、CLI、包、容器时，必须写清性能和分发边界。
+19. NOT in scope：所有被考虑但 defer 的内容要写理由，不能消失在聊天里。
+20. Review calibration：只有会导致 `cc-do` 建错、卡住、越界、漏测的问题才是 blocking；措辞偏好和非阻塞建议不能伪装成 gate failure。
+21. Durable brief check：设计摘要、PRD 化描述、issue / follow-up handoff 只写行为、契约、模块责任和验收标准；不要把易过期的文件路径、行号或当前实现细节当成长期事实。
 如果任一项无法从当前证据完成，写 `assumption` 或 `blocked question`，不要伪装成已经审过。
@@ -250,17 +273,24 @@ tool_budget:
    - 优先读取 `CLAUDE.md` / project docs 中的 testing 约定。
    - 如果没有，按配置文件和目录结构识别：`vitest` / `jest` / `pytest` / `go test` / `cargo test` / `rspec` / `playwright` / `cypress` 等。
    - 如果仍然没有框架，写成 `test framework unknown`，并把验证计划降级为 exploratory spike 或 manual evidence，不准假装已有自动测试路径。
-2. 每个可观察行为变更默认拆成 `Red -> Green -> Refactor`：
+2. 先冻结测试 seam 和行为断言：
+   - Red 必须通过公共接口、调用方流程、CLI/API/UI 路径或其它真实边界证明行为缺失。
+   - 测试名、断言和 fixture 必须描述用户 / 调用方关心的行为，不描述内部实现步骤。
+   - 如果正确 seam 不存在，计划先写 exploratory spike 或架构 follow-up，不准用脆弱单元测试冒充回归保护。
+3. 每个可观察行为变更默认拆成 `Red -> Green -> Refactor`：
    - Red：先写 `[TEST]` 任务，目标是用最小失败测试证明目标行为缺失。
    - Green：再写 `[IMPL]` 任务，只做让对应红灯转绿的最小生产实现。
    - Refactor：最后写 `[REFACTOR]` 或在实现任务中明确 refactor checkpoint，说明何时清理重复、命名、结构和坏味道。
-3. `planning/tasks.md` 不能把测试和实现塞进同一个 task。一个 task 同时写“实现并测试”就是计划失败。
-4. `planning/task-manifest.json` 必须让 `cc-do` 看出每个任务的 `tddPhase`、依赖和证据：`red` 任务产出 failing output，`green` 任务产出 passing output，`refactor` 任务产出重跑后的 green evidence。
-5. Test diagram 要同时覆盖 code paths 和 user flows。每条路径标注 `unit` / `integration` / `e2e` / `eval`，并给现有测试质量分级：`strong`、`happy-path-only`、`smoke-only`、`missing`。
-6. 回归测试是硬门槛。只要计划修改既有行为且现有测试没有覆盖，就必须把 regression test 写进 `planning/tasks.md`，不能 defer，不能问用户要不要跳过。
-7. 只有纯文档、纯配置、纯生成文件、throwaway prototype 可以例外。例外必须写进 `planning/design.md` 和 `planning/tasks.md` 的 `TDD exceptions`，包含原因、风险、替代验证命令和后续补证入口。
-8. 并行只允许发生在已经满足上游 Red/Green 依赖之后。两个 `[P]` 任务如果共享同一个红灯或同一组 touched files，就不能并行。
-9. 如果当前需求找不到第一条失败测试，先把它写成 blocked question 或 exploratory spike，不准伪装成可执行实现任务。
+4. 禁止水平切片：不能先写一批测试、再写一批实现。计划必须按 tracer bullet 垂直切片排列：一个行为红灯 -> 最小实现转绿 -> 必要重构，然后再进入下一个行为。
+5. `planning/tasks.md` 不能把测试和实现塞进同一个 task。一个 task 同时写“实现并测试”就是计划失败。
+6. `planning/tasks.md` 的每个 `[TEST]` task 必须写清 test seam、behavior asserted、allowed mocks、feedback loop type、implementation-detail risk。
+7. `planning/task-manifest.json` 必须让 `cc-do` 看出每个任务的 `tddPhase`、依赖、测试质量边界和证据：`red` 任务产出 failing output，`green` 任务产出 passing output，`refactor` 任务产出重跑后的 green evidence。
+8. Test diagram 要同时覆盖 code paths 和 user flows。每条路径标注 `unit` / `integration` / `e2e` / `eval`，并给现有测试质量分级：`strong`、`happy-path-only`、`smoke-only`、`missing`。
+9. 回归测试是硬门槛。只要计划修改既有行为且现有测试没有覆盖，就必须把 regression test 写进 `planning/tasks.md`，不能 defer，不能问用户要不要跳过。
+10. 只有纯文档、纯配置、纯生成文件、throwaway prototype 可以例外。例外必须写进 `planning/design.md` 和 `planning/tasks.md` 的 `TDD exceptions`，包含原因、风险、替代验证命令和后续补证入口。
+11. 并行只允许发生在已经满足上游 Red/Green 依赖之后。两个 `[P]` 任务如果共享同一个红灯或同一组 touched files，就不能并行。
+12. 如果当前需求找不到第一条失败测试，先把它写成 blocked question 或 exploratory spike，不准伪装成可执行实现任务。
+13. 每条垂直切片必须标注 `AFK` 或 `HITL`：`AFK` 代表执行者可在现有合同下独立完成并验证；`HITL` 代表仍需要用户判断、外部权限、设计取舍或人工验收。默认拆到可 `AFK`，只有证据证明必须人工参与时才保留 `HITL`。
 ## Design Modes
@@ -299,8 +329,14 @@ tool_budget:
 8. Decision horizon scan：foundation / core / integration / polish/tests 的实现决策是否已经冻结或明确 blocked。
 9. Error & rescue scan：`full-design` 是否写清 failure -> rescue -> user sees -> test evidence。
 10. Test framework / regression scan：测试框架来源、覆盖质量、回归测试是否明确。
-11. Review calibration：只把会导致实现错误、执行卡住、范围越界、验证缺失的问题标成 blocking；非阻塞建议必须降级为 advisory
-12. Final gate：明确 auto-decided items、taste decisions、user challenges 和最终 recommendation
+11. Test seam / mock boundary scan：Red 任务是否通过公共 seam 证明行为，mock 是否只发生在系统边界，反馈循环是否可重复。
+12. Domain language scan：核心名词、测试名、文件职责是否沿用项目语言；冲突是否写成 blocked question / user challenge。
+13. Interface depth scan：新增接口是否足够小、隐藏复杂度是否足够深、调用方是否容易正确使用且不容易误用；非 trivial 接口是否已经做过至少两种形态比较。
+14. Tracer bullet scan：任务是否按一个行为一条 Red/Green/Refactor 链组织，而不是按测试层、服务层、UI 层水平堆叠。
+15. Slice readiness scan：每条切片是否能独立 demo / verify，是否标明 `AFK` / `HITL`、依赖和阻塞原因。
+16. Durable handoff scan：design / issue / follow-up 文案是否按行为和契约表达，没有把当前文件行号当成长期 truth。
+17. Review calibration：只把会导致实现错误、执行卡住、范围越界、验证缺失的问题标成 blocking；非阻塞建议必须降级为 advisory
+18. Final gate：明确 auto-decided items、taste decisions、user challenges 和最终 recommendation
 如果有 UI / interaction 明显范围，在 `planning/design.md` 里补 design completeness score 和状态覆盖表。
 如果有 API / CLI / developer-facing / operator-facing scope，在 `planning/design.md` 里补 target persona、time to first value、magic moment 和 DX / operator review 结论。
@@ -308,8 +344,9 @@ tool_budget:
 ## Good Output
 - `planning/design.md` 一份就讲清：为什么做、做什么、不做什么、备选方案、批准方案、设计模式、风险、review gate、执行边界
-- `planning/tasks.md` 只保留能直接执行的任务和 handoff，不再承载重复背景介绍；行为变更默认拆成 `[TEST] -> [IMPL] -> [REFACTOR]`
-- `planning/task-manifest.json` 是 `cc-do` 的真相源，要写清 `dependsOn`、`tddPhase`、并行资格、触点、验证命令，以及继承了哪版 roadmap / design / spec
+- `planning/design.md` 必须使用项目 canonical language，记录相关 capability spec / roadmap decision 冲突，并说明新增接口如何保持小接口深模块
+- `planning/tasks.md` 只保留能直接执行的任务和 handoff，不再承载重复背景介绍；行为变更默认拆成 tracer bullet 形式的 `[TEST] -> [IMPL] -> [REFACTOR]`，且 Red task 明确公共 seam、行为断言、mock 边界和反馈循环
+- `planning/task-manifest.json` 是 `cc-do` 的真相源，要写清 `dependsOn`、`tddPhase`、`verticalSlice`、test seam、allowed mocks、feedback loop、并行资格、触点、验证命令，以及继承了哪版 roadmap / design / spec
 - `change-meta.json` 是 capability 真相源，要写清这次 change 准备如何改变长期 spec
 - 看完第一屏，执行者就知道这次属于 `tiny-design` 还是 `full-design`，以及为什么
@@ -334,9 +371,10 @@ tool_budget:
 5. 版本、来源、冻结决策必须可追踪。
 6. 任务少而硬，胜过任务多而虚。
 7. 具体计划默认测试先行；没有 Red/Green/Refactor 或 TDD exception，就不能进入 `cc-do`。
-8. 任务一旦超过 2-5 分钟粒度就继续拆，直到可以稳定交给执行者。
-9. 三层以上判断说明设计还没压平，应回到 `planning/design.md` 继续简化。
-10. `tiny-design` 不得被当成“免审批”；只要要写任务，就必须先有已批准的设计卡片。
+8. 任务必须是端到端可验证的垂直切片；除非是纯重构，否则不要按“先改模型、再改服务、最后改 UI”的水平层次拆。
+9. 任务一旦超过 2-5 分钟粒度就继续拆，直到可以稳定交给执行者。
+10. 三层以上判断说明设计还没压平，应回到 `planning/design.md` 继续简化。
+11. `tiny-design` 不得被当成“免审批”；只要要写任务，就必须先有已批准的设计卡片。
 ## Exit Criteria