npm - @bolloon/bolloon-agent - Versions diffs - 0.1.0 → 0.1.2 - Mend

@bolloon/bolloon-agent 0.1.0 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (581) hide show

package/src/bollharness/docs/decisions/ADR-038-harness-optimization-strategy.md DELETED Viewed

@@ -1,2039 +0,0 @@
-# ADR-038: AI Harness 优化策略 — 测量驱动的持续进化
-**状态**: proposed (v5 — meta-review closure patch appended; see §12)
-**日期**: 2026-04-07
-**触发**: Anthropic harness engineering 文章 + 6 轮自研数据暴露治理有效性缺口 + CC 源码/行业前沿深度学习 + 18 个补充资源（v4）+ 用户 meta-review "是否真的完成最佳实践" 标准（v5）
-**关联**: ADR-030 (Guard Signal Protocol), PLAN-058 (Guard 实施)
-**Change Classification**: contract（影响 hook 行为、skill 加载、guard 执行流程）
-### v4 Changelog（相对 v3 的增量）
-> **v4 是 v3 的增量补丁，不是重写。** v3 全部 9 条 SC + 7 个 D 决策 + 4 维 Grading 全部保留。v4 在 v3 的稳定地基上叠加 6 个补充洞察。
-> [来源: Anthropic — "non-linear improvement, regularly preferred middle iteration over last one"。好的迭代是叠加而非推翻。]
-**新增内容**：
-| 章节 | 新增 | 来源 |
-|------|------|------|
-| §2.4 | 6 个补充洞察（Initializer Agent / Context Rot / Trace Analyzer / Tool Isolation / Objective Recitation / Reliability 4 维） | 18 资源横向搜索 |
-| §3.10 | Context Rot 防御原则（不只防 exhaustion） | Anthropic Effective Context Engineering |
-| §3.11 | Objective Recitation 原则 | Manus/IMPACT |
-| §3.12 | Tool Isolation 原则（schema-level 写权限排除） | OpenDev arXiv 2603.05344 |
-| §3.13 | Reliability 4 维测量原则 | arXiv 2602.16666 |
-| §4 D8 | Initializer Agent JSON 进度追踪（严格 schema 防自改验收） | Anthropic 第二篇 |
-| §4 D9 | PreCompact 增强：Objective Recitation | Manus/IMPACT |
-| §4 D10 | Trace Analyzer：failure-analyzer.py 升级为自动模式分析 | LangChain |
-| §4 D11 | 审查 Agent 工具隔离协议 | OpenDev |
-| §5 | 架构蓝图加 D8/D9/D10/D11 事件流 | — |
-| §6 | 复用清单加 6 项（29-34） | — |
-| §10 | 参考文献加 6 个新来源 | — |
-| §11 | v4 audit 状态 | — |
-**v3 状态**：保留为有效设计基线。所有 v3 决策（D1-D7）继续推进部署，v4 决策（D8-D11）作为增量纳入同一部署批次。
----
-## §0 Meta-Compliance: 用 Harness 原则设计 Harness
-> 本 ADR 的设计过程本身必须符合它所倡导的原则。v2 自审结果为 4 FAIL / 1 PARTIAL / 2 PASS。v3 修正所有 FAIL。
-### Sprint Contract（本 ADR 的验收标准）
-| # | 验收标准 | 验证方法 |
-|---|---------|---------|
-| SC-1 | 每个设计决策标注外部证据来源（文章+具体数据点） | 全文扫描：每个 D* 必须有 `[来源: ...]` |
-| SC-2 | 无自拟数值目标——所有阈值/目标来自外部基准或自研数据 | 检查所有数字是否有出处 |
-| SC-3 | 所有评估触发为事件驱动，非日历驱动 | Grep "每天/每月/每周" → 替换为事件条件 |
-| SC-4 | 架构目标=V2（Opus 4.6 原生），不回退到 V1（Sprint 分解） | D4 不包含强制 Sprint 分解 |
-| SC-5 | CC 能力列表完整（含 asyncRewake/attribution/PermissionRequest 等 v2 遗漏项） | 对照 transcript 105 条清单 |
-| SC-6 | 所有参考文献包含 URL | §10 每条有 URL |
-| SC-7 | 包含 Meta-Compliance 审计章节 | 本 §0 + §11 |
-| SC-8 | Evaluator 为按需而非强制（来源：Anthropic V2） | D4 描述 on-demand 触发条件 |
-| SC-9 | v2 自审发现全部闭环 | §11 逐条标注 FIXED |
-### Grading Criteria（评估本 ADR 用）
-采用 Anthropic 原始 4 维度，按架构文档场景适配检测方法：
-| 维度 | 检测方法 | 硬阈值 |
-|------|---------|-------|
-| **Design Quality** | 架构是否 sound？是否利用外部证据？是否事件驱动？ | < 3 = FAIL |
-| **Originality** | 是否超越复制文章？是否将多源 synthesis 为新方案？ | < 3 = FAIL |
-| **Craft** | 数据点是否准确？URL 是否完整？措辞是否精确？ | < 3 = FAIL |
-| **Functionality** | 实施后是否真能改善 harness？ROI 是否正向？ | < 3 = FAIL |
-[来源: Anthropic "Harness Design for Long-Running Application Development" §Grading Criteria — design quality / originality / craft / functionality]
----
-## §1 问题
-### 1.1 症状：数据说了什么
-我们有完整的 6 轮自研数据（docs/research/016-round1~6）。数据讲了一个不舒服但诚实的故事：
-| 数据点 | 数值 | 含义 |
-|--------|------|------|
-| CLAUDE.md 遵从率 | ~10-20% | 最大 token 投入（11.9k）的 80-90% 被忽略 |
-| Convention 层占比 | 53% (73/138 issue) | 超过一半的治理靠最弱的执行层 |
-| Guard 层占比 | 9.2% (13/138 issue) | 最强执行层覆盖不到十分之一 |
-| ADR-030 自身合规率 | 50-60% (Round 5) | 治理系统只遵守自己一半的规则 |
-| Feedback 机械化率 | 19% (8/42) | 81% 的经验教训停留在文字叮嘱 |
-| Harness 组件数 | 16 guard checks + 17 fragments + 26 skills | 只增不减，从未做过 load-bearing 审计 |
-| CC Hook 利用率 | 2/28 事件类型, 1/4 hook 类型 | 93% 的 hook 能力未使用 |
-| Inferential 组件数 | 0 | 没有任何 LLM 参与的质量评估 |
-| 产出质量指标 | 无 | 只测"违不违规"，不测"好不好" |
-**Anthropic 对比证据**——Solo（无 harness）vs Full harness：
-| 维度 | Solo | Full Harness | 倍数 |
-|------|------|-------------|------|
-| 时间 | 20 分钟 | 6 小时 | 18x |
-| 成本 | $9 | $200 | 22x |
-| 质量 | 三处关键断裂 | 完整可用应用 | ∞ |
-Solo 的三个精确失败点（来源: Anthropic 文章 §Solo Run）：
-1. 固定高度面板导致大部分视口空白
-2. 填充关卡提示先创建精灵和实体但 UI 无引导入口
-3. 实体出现在屏幕但不响应输入——实体定义和游戏运行时接线断裂
-这三个都是**跨模块集成问题**——单次生成无法自检的典型失败。正是 Evaluator 要解决的。
-### 1.2 根因
-**R1: 投入分配倒挂** — 最多精力投入遵从率最低的层（CLAUDE.md 规则文字），而非最高的层（guard 代码）。
-[来源: OpenAI "给 agent 地图，不是千页手册" — https://openai.com/index/harness-engineering/]
-**R2: 没有负反馈循环** — Harness 组件只有"增加"路径，没有"移除"路径。Anthropic 验证：从 Sonnet 4.5（V1）升级到 Opus 4.5（V2）后移除了 Sprint 分解——组件必要性随模型能力变化。
-[来源: Anthropic 文章 §V2 — V1 使用 Sonnet 4.5 需要 Sprint 分解和 context reset，V2 使用 Opus 4.5 后 "largely removed the need for sprint decomposition"。注：我们当前使用 Opus 4.6，能力 ≥ Opus 4.5]
-**R3: 只有下限保护，没有上限追求** — 按 Martin Fowler 四象限（Guide/Sensor × Computational/Inferential），全部组件集中在 Computational 象限，0% Inferential。
-[来源: Martin Fowler "Harness Engineering for Coding Agent Users" — https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html]
-**R4: 没有测量就没有优化** — 不知道每个 hook 的 token 开销、每个 guard 的 findings 频率。LangChain 只改 harness 就从 Terminal Bench 52.8%→66.5%（Top 30→Top 5），因为有 LangSmith traces 系统性 debug 失败模式。
-[来源: LangChain "Improving Deep Agents with Harness Engineering" — https://blog.langchain.com/improving-deep-agents-with-harness-engineering/]
-**R5: CC 能力严重低估** — CC 提供 28 种 hook 事件、4 种 hook 类型、updatedInput、@include、PreCompact、CronScheduler、Verification Agent、Magic Docs、Auto Dream、Session Memory、Skill Improvement、attribution 设置等。我们只用了不到 10%。
-[来源: CC 源码分析 — anthropics/claude-code]
-### 1.3 核心问题定义
-> 如何基于行业最佳实践和 CC 已有能力，建立一个可观测、可评估、能自我进化的 harness 系统？
-### 1.4 假设校准
-**假设我们的 harness 做得不好。** 这不是谦虚——Anthropic 验证的 self-evaluation bias 是结构性的：模型系统性地高估自己的产出质量。Evaluator 开箱即用效果差——"identify issues then talk itself out of them"。
-[来源: Anthropic 文章 §Evaluator — "Out of the box, Claude is a poor QA agent"]
----
-## §2 外部学习
-### 2.1 Anthropic: V1→V2 演化 + 三 Agent 架构
-**来源**: "Harness Design for Long-Running Application Development" (2026-03-24)
-https://www.anthropic.com/engineering/harness-design-long-running-apps
-#### 核心架构
-```
-Planner Agent ─── 拆解大目标（有野心但聚焦产品上下文，不做细粒度技术细节）
-    ↓
-Generator Agent ── V1: 按 Sprint 执行 / V2: 一次性完成全部
-    ↓
-Evaluator Agent ── 用 Playwright 自主导航应用，截图+交互式测试
-    ↓ (不通过 → Generator 决定 Refine 或 Pivot)
-    ↓ (通过 → 结束)
-```
-#### V1→V2 关键演化
-| 方面 | V1 (Sonnet 4.5) | V2 (Opus 4.5+) | 原因 |
-|------|-----------------|---------------|------|
-| Sprint 分解 | 必需（10 sprints） | **移除** | Opus 原生处理长任务 |
-| Context Reset | 频繁（context anxiety） | 不需要 | Opus 无 context anxiety |
-| Evaluator 时机 | 每个 Sprint 后 | **单次 pass** | 不需要细粒度分解 |
-| Planner 输出 | 16 功能 + 10 sprint 分解 | 高层设计语言 | Generator 自行分解 |
-**→ 对我们的含义**：我们应直接设计 V2 架构（无强制 Sprint 分解，Evaluator 单次 pass），不从 V1 演化。
-#### 精确成本数据（DAW 案例）
-| Agent & Phase | Duration | Cost |
-|---|---|---|
-| Planner | 4.7 min | $0.46 |
-| Build Round 1 | 2 hr 7 min | $71.08 |
-| QA Round 1 | 8.8 min | $3.24 |
-| Build Round 2 | 1 hr 2 min | $36.89 |
-| QA Round 2 | 6.8 min | $3.09 |
-| Build Round 3 | 10.9 min | $5.88 |
-| QA Round 3 | 9.6 min | $4.06 |
-| **Total** | **3 hr 50 min** | **$124.70** |
-关键数据点：
-- **QA 仅占总成本 8.3%**（$10.39/$124.70）但发现了所有关键功能缺失
-- **构建成本递减**：$71→$36→$5.88（第一轮后 Generator 已有基础）
-- **QA 时间相对稳定**：8.8/6.8/9.6 min（不随轮次减少——评估复杂度不变）
-#### Evaluator 精确能力证据
-Round 1 QA 发现的三个精确 bug（代码行号级别）：
-1. `fillRectangle` 函数存在但 mouseUp 未正确触发——矩形工具只在起点和终点放瓦片而非填充
-2. `LevelEditor.tsx:892` 的 Delete 键条件应改为 `selection || (selectedEntityId && activeLayer === 'entity')`
-3. `PUT /frames/reorder` 路由定义在 `/{frame_id}` 之后——FastAPI 把 `reorder` 当 frame_id 整数匹配返回 422
-→ Evaluator 反馈精确到代码行号，可直接执行修复。
-#### 关键工程决策
-1. **Sprint Contract**（V1）：实现前 Generator 和 Evaluator 协商验收标准。Sprint 3 单个 contract 有 27 条。Contract 越具体，评估越机械化。
-2. **Grading Criteria 4 维**：design quality / originality / craft / functionality。权重偏向模型弱项。
-3. **Evaluator 校准困难**："several rounds" 调优才达到合理水平。典型失败：识别问题→说服自己不是大问题→批准。
-4. **首轮就已优于基线**：仅有 Grading Criteria 语言（不跑评估循环），首轮就 "noticeably better than baseline with no prompting at all"。标准不只是评分工具，也是隐式生成指导。
-5. **非线性改善**：作者经常更喜欢中间某轮而非最后一轮。保留所有中间产物。
-6. **措辞引力效应**："museum quality" 把设计推向特定视觉收敛。评分标准中的隐喻/类比会成为 generator 输出的引力中心。
-7. **显式惩罚 AI slop**：标准中明确列出并惩罚典型 AI 生成模式。
-8. **Agent 间文件通信**：一个写文件，另一个读并回应。文件提供持久化 artifact + 可审计过程。
-9. **Generator 有 git 版本控制**：每个 sprint 结束有版本记录，支持 rollback。
-10. **Planner 不做细粒度技术决策**：预先指定技术细节出错会级联到下游。
-#### 元原则
-> "Every component encodes an assumption about what the model can't do on its own — those assumptions are worth stress testing."
-> "The space of interesting harness combinations doesn't shrink as models improve. Instead, it moves, and the interesting work is to keep finding the next novel combination."
-> "The evaluator is not a fixed yes-or-no decision. It is worth the cost when the task sits beyond what the current model does reliably solo."
-### 2.2 Claude Code 源码: 完整能力清单
-**来源**: CC 源码 (`anthropics/claude-code`) + 官方插件 (`anthropics/claude-plugins-official`)
-#### 2.2.1 Hook 系统（完整 28 事件）
-**我们使用的 (2/28)**：PreToolUse, PostToolUse
-**高价值未使用**：
-| 事件 | 价值 | 优先级 |
-|------|------|--------|
-| `Stop` | 回复后验证——注入 PreCompletionChecklist、触发 Evaluator | 最高 |
-| `PreCompact` | compact 时保留关键信息（PLAN/issue/ADR-030） | 最高 |
-| `UserPromptSubmit` | 自动注入上下文（当前 branch、活跃 issue） | 高 |
-| `FileChanged` | 文件变更监控 → 文档同步检查 | 高 |
-| `SessionEnd` | 会话结束时自动保存状态、生成 reflection | 高 |
-| `PostToolUseFailure` | 工具调用失败自动分析/重试 | 中 |
-| `SubagentStart/Stop` | 监控子 agent 行为 | 中 |
-| `TaskCreated/Completed` | 后台任务生命周期 | 中 |
-| `WorktreeCreate/Remove` | Worktree 生命周期 | 低 |
-**4 种 hook 类型（我们只用 command）**：
-| 类型 | 能力 | 我们的用途 |
-|------|------|-----------|
-| `command` | shell 脚本 | ✅ guard-feedback.py |
-| `prompt` | LLM 做判断 | ❌ 安全/质量评估 |
-| `agent` | 完整 agentic verifier | ❌ Stop hook 用 Evaluator |
-| `http` | 调用外部 API | ❌ 远程 metrics 上报 |
-**关键 hook 选项（v2 遗漏）**：
-| 选项 | 能力 |
-|------|------|
-| `updatedInput` | 自动修改工具输入（python→python3, Co-Authored-By） |
-| `if` 条件 | 只匹配特定工具/模式时触发 |
-| `async` | 长时间检查不阻塞交互 |
-| `asyncRewake` | 后台运行，exit 2 时唤醒模型（发现问题才中断） |
-| `once` | 一次性检查 |
-| `updatedPermissions` | **动态权限**——PermissionRequest hook 可在运行时按上下文授权 |
-#### 2.2.2 Context 管理
-| 能力 | 现状 | 价值 |
-|------|------|------|
-| `@include` 指令 | ❌ | CLAUDE.md 引用外部文件，不复制内容 |
-| `.claude/rules/*.md` + `paths` | ❌ | 目录级规则（不同 scene 不同规则） |
-| `PreCompact` hook | ❌ | compact 时保留关键信息 |
-| HTML 注释 `<!-- -->` | ❌ | 只供人类阅读，不消耗 token |
-| Magic Docs `# MAGIC DOC:` | ❌ | 自动更新的活文档，`_instructions_` 语法写更新指令 |
-#### 2.2.3 Settings 能力（v2 完全遗漏）
-| 设置 | 价值 |
-|------|------|
-| `attribution` | 自定义 Co-Authored-By 文本——**比 updatedInput hook 更优雅** |
-| `worktree.symlinkDirectories` | Worktree 不复制 venv/node_modules |
-| `worktree.sparsePaths` | Sparse checkout，大 monorepo 优化 |
-| `statusLine` | 显示当前 branch、issue 状态 |
-| `env` | 设置环境变量 |
-| `CLAUDE_ENV_FILE` | Hook 写 bash exports → 应用到后续 BashTool |
-| `apiKeyHelper` | 自定义 API key 来源脚本 |
-| `cleanupPeriodDays` | 聊天记录保留天数 |
-#### 2.2.4 Agent 系统
-| 能力 | 现状 | 价值 |
-|------|------|------|
-| `isolation: "worktree"` | 部分用 | Guardian issue 自动 worktree |
-| Agent `hooks` | ❌ | 审查者有自己的质量保证 hook |
-| Built-in Verification Agent | ❌ | 对抗性验证 prompt，5 种自我合理化识别模式，必须含至少一个对抗性探测（并发/边界值/幂等性），**严禁修改项目文件**（只能写 /tmp） |
-| Agent `memory` 字段 | ❌ | 控制 agent 记忆范围（user/project/local） |
-| Agent `skills` 字段 | ❌ | 声明可用 skills |
-| Agent `initialPrompt` | ❌ | 启动时自动发送的 prompt |
-| Fork Subagent | ❌ | 不指定 subagent_type 时隐式 fork，继承完整上下文 |
-#### 2.2.5 自动化系统
-| 能力 | 现状 | 价值 |
-|------|------|------|
-| CronScheduler | ❌ | 定期任务 |
-| Session Memory | ❌ | 后台 forked subagent，定期提取关键信息写入 markdown |
-| Skill Improvement | ❌ | 每 5 turn 分析对话，返回 `{section, change, reason}` 的 skill 改进建议 |
-| Auto Dream | ❌ | 4 阶段记忆巩固（Orient → Gather → Consolidate → Prune），条件：上次 >24h + 累积 >5 会话 |
-#### 2.2.6 官方插件
-| 插件 | 核心价值 | 子结构 |
-|------|---------|-------|
-| **skill-creator** | Skill eval+benchmark+迭代循环 | 3 子 agent（Grader/Analyzer/Comparator）+ `run_loop.py` 描述优化 + eval-viewer 人类审查 |
-| **claude-md-management** | CLAUDE.md 6 维质量审计 | Commands (20) / Architecture (20) / Patterns (15) / Conciseness (15) / Currency (15) / Actionability (15) = A-F 等级 |
-| **ralph-loop** | Stop hook + completion promise 自动迭代 | 读 transcript JSONL → 提取 `<promise>` tag → 未完成则 increment iteration + feed same prompt |
-| **feature-dev** | 7 阶段特性开发 | Discovery → Exploration (2-3 parallel agents) → **Clarifying Questions (HARD GATE)** → Architecture (2-3 agents: minimal/clean/pragmatic) → Implementation (需 user approval) → Review (3 parallel reviewers: simplicity/DRY, bugs/correctness, conventions) → Summary |
-| **pr-review-toolkit** | 6 独立 reviewer agent | code-reviewer (compliance+bugs, ≥80 置信度), code-simplifier (clarity+DRY, opus), comment-analyzer (comment rot), pr-test-analyzer (behavioral coverage, 1-10 criticality), silent-failure-hunter (CRITICAL/HIGH/MEDIUM), type-design-analyzer (4 维各 1-10: Encapsulation/Expression/Usefulness/Enforcement) |
-| **frontend-design** | 美学引导 prompt | typography/color/motion/composition/backgrounds。**注意**: 这是生成引导 skill，不包含 evaluator 循环——评估循环在 `claude-cookbooks` 的 notebook 中 |
-| **playwright** | 运行时应用实测 | 浏览器自动化 |
-### 2.3 行业最佳实践
-#### Tier 1: 生产级，直接可复用
-| 来源 | 核心贡献 | 对我们的直接价值 | URL |
-|------|---------|----------------|-----|
-| **OpenAI Codex** (3 篇) | "给 agent 地图，不是千页手册"；100 万行代码+1500 PR | CLAUDE.md 越长越无效 | https://openai.com/index/harness-engineering/ |
-| **Stripe Minions** (2 篇) | **"Walls matter more than the model"**——确定性约束比更好的模型更重要。Blueprint: agentic+deterministic 交替。每周 1300+ PR | 确定性 checkpoint 设计 | https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-coding-agents |
-| **LangChain** (3 篇) | 只改 harness: 52.8%→66.5%。PreCompletionChecklist + LoopDetection + LocalContext | **3 个 middleware 模式** | https://blog.langchain.com/improving-deep-agents-with-harness-engineering/ |
-| **Trail of Bits** | 201 skills，15 bugs/week→200 bugs/week。AI Maturity Level 3 = agent 全自动分析+triage+report | Skill 设计 + 成熟度分级 | https://github.com/trailofbits/skills |
-| **Microsoft Gov Toolkit** | 7 开源包（Agent OS/Mesh/Runtime/SRE/Compliance/Marketplace/Lightning），覆盖 OWASP Agentic Top 10 | Agent Compliance grading | https://github.com/microsoft/agent-governance-toolkit |
-#### Tier 2: 框架级
-| 来源 | 核心贡献 | URL |
-|------|---------|-----|
-| **Martin Fowler** (2 篇) | 四象限 + **"The rigor has to go somewhere"**——AI 写代码后纪律转移而非消失。"On the loop" = 设计 spec+test+feedback 引导 agent | https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html |
-| **Simon Willison** | **Red-Green 模式**：先写失败测试 → agent 让它通过 → 迭代。"应该大量写测试让 agent 迭代，而非写详细 TASK.md" | — |
-| **Addy Osmani** | **"Waterfall in 15 minutes"**——spec.md 快速规划。Commits = save points for rollback | — |
-| **Ralph Wiggum** | **"Tune like a guitar"**——agent 做错时加一条指令，prompt 逐渐积累修正。"Eventually the prompt is the product." 7 原则：One item per loop, Deterministic stack, Specs over prompts, Tune like a guitar | — |
-| **Anthropic Agent Skills** | Progressive disclosure——skill 不应一次全加载，按编辑文件路径动态加载 | https://www.anthropic.com/research/agent-skills |
-| **Google Conductor** | 三阶段（setup→newTrack→implement）+ Brownfield 支持 | — |
-| **OpenAI AGENTS.md** | 三层层级：Global→Project→Merge，32 KiB 限制，`.override.md` 覆盖机制 | — |
-#### Tier 3: 学术前沿
-| 来源 | 核心贡献 | URL |
-|------|---------|-----|
-| **ACE** (ICLR 2026) | Memory 自动进化：generation-reflection-curation 循环, +10.6% | https://github.com/ace-agent/ace |
-| **OpenDev** (arXiv 2603.05344) | 五层安全：(1) prompt guardrails (2) schema-level dual-agent (3) runtime approval (4) tool validation (5) lifecycle hooks | https://arxiv.org/abs/2603.05344 |
-| **ABC** (arXiv 2602.22302) | Guard = Contract (Preconditions, Invariants, Governance, Recovery) | https://arxiv.org/abs/2602.22302 |
-| **NLAH** (arXiv 2603.25723) | Harness 行为外部化为可编辑自然语言制品 + Intelligent Harness Runtime | https://arxiv.org/abs/2603.25723 |
-| **Darwin Gödel Machine** | Agent 自我重写代码，SWE-bench 20→50%，自发涌现错误记忆+多方案评估 | https://arxiv.org/abs/2505.22954 |
-| **OpenAI GEPA** | 采样 agent 轨迹 → 自然语言反思 → 提议 prompt 修订 → 迭代进化 | OpenAI Cookbook |
-### 2.4 v4 补充：18 资源横向搜索的 6 个新洞察
-> v3 的 §2.1-§2.3 主要基于 Anthropic 主文章 + CC 源码 + 行业一手资料。v4 通过横向搜索（Anthropic 第二篇 + Manus/IMPACT + OpenDev + Reliability 论文等）补充了 6 个 v3 未覆盖的关键洞察。
-#### 2.4.1 Initializer Agent 模式
-**来源**: Anthropic "Effective Harnesses for Long-Running Agents"
-https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
-**核心**:
-- 专用 **Initializer Agent** 在编码 Agent 开始前运行
-- 写 `claude-progress.txt` 进度文件 + 严格 **JSON** 特性清单（200+ features 全部标 `failing`）
-- **严格 JSON 格式**防止模型自改验收标准 — pass/fail 状态只能是两个枚举值，不允许自由文本
-- **单特性/会话约束** — 一个 session 只做一个 feature，防止上下文耗尽和半成品
-- Session 启动协议：先读 `git log` + progress files + feature lists 再开始工作
-**对我们的价值**: 这是 v3 D4.1 Sprint Contract 的**严格化版本**。用 JSON schema 约束验收标准，模型无法用自然语言"重新解释"完成度。比 stop-evaluator.md 的自然语言检查清单更抗 self-evaluation bias。
-#### 2.4.2 Context Rot vs Context Exhaustion
-**来源**: Anthropic "Effective Context Engineering for AI Agents"
-https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
-**核心数据**:
-- **65% 的企业 AI 失败归因于 context drift，不是 raw context limits**
-- 上下文性能随长度递减（"context rot"）—— 不是用完才出问题，是逐渐变差
-- 最安全的轻量级 compaction 是 **tool result clearing**
-- Just-in-time 加载：维护轻量标识符（文件路径、查询），运行时用工具加载数据
-- Subagent 返回压缩摘要（**1,000-2,000 tokens**）而非完整上下文
-**对我们的价值**: 我们的 v3 PreCompact hook 关注"保留什么"（正确），但忽略了 context rot 的本质——不是信息丢失，是注意力分散。11.9k token 的 CLAUDE.md 本身就是 context rot 来源。@include 拆分（D3.1）不只是省 token，更是减少 rot。这给 D3.1 增加了一个新的优化目标维度。
-#### 2.4.3 Trace Analyzer Skill
-**来源**: LangChain "Improving Deep Agents with Harness Engineering"
-https://blog.langchain.com/improving-deep-agents-with-harness-engineering/
-**核心**:
-- LangChain 三 hook 中间件架构：`before_model` / `modify_model_request` / `after_model`
-- **Trace Analyzer Skill**：自动化并行 error 分析 agent，获取 traces → 分析错误 → 提议 harness 变更 → 验证
-- 创建了 **harness 自我改进的反馈循环**
-**对我们的价值**: v3 D6 设计了自我进化机制，但缺少自动化 trace 分析这个组件。我们的 `failure-analyzer.py`（v3 D2.7）只**记录**失败，不**分析**模式。LangChain 的做法是让 Agent 自动分析失败 traces 并提议 harness 变更——这比人工读 JSONL 高效得多。这是 D2.7 的天然演进路径。
-#### 2.4.4 Plan Mode 工具隔离
-**来源**: OpenDev 五层安全论文
-https://arxiv.org/html/2603.05344v1
-**核心**:
-- Plan mode 的 planner subagent **只接收只读工具**——写操作在 **schema 层面**完全排除
-- 六阶段执行循环：pre-check/compaction → thinking → self-critique → action → tool execution → post-processing
-- Subagent 隔离：无 session_manager（消息不持久化）、无 console 输出旁路、无 config 泄露
-- 双内存架构：episodic memory（跨 session 项目级）+ working memory（当前 turn 观察）
-**对我们的价值**: 我们的 Gate 审查用 `TeamCreate` 创建 subagent，但**没有工具隔离**。审查 Agent 理论上可以修改代码——这是结构性安全洞。应该在 schema 层面限制审查 Agent 只能 Read/Glob/Grep，不能 Edit/Write/Bash(write)。
-#### 2.4.5 Objective Recitation
-**来源**: Manus/IMPACT 工程实践
-https://www.morphllm.com/agent-engineering
-**核心**:
-- "**Recite objectives (todo.md) at end of context to maintain goal focus after ~50 tool calls**"
-- KV-cache 命中率是最重要的性能指标（~100:1 prefill-to-decode ratio）
-- 通过 **logit masking** 而非移除定义来屏蔽工具（保留 cache）
-- Cursor 发现：丢失 reasoning traces 导致 **30% 性能下降**
-**对我们的价值**: 我们的 PreCompact hook（v3 D2.3）保留"PLAN/TASK/Issue 状态"，但没有在上下文末尾**重复目标**。50+ tool calls 后 Agent 容易偏离。可以在 PreCompact 或 Stop hook 中注入当前目标摘要——这是 30% 性能损失的潜在防线。
-#### 2.4.6 Agent Reliability 四维度
-**来源**: arXiv "Towards Reliable Agentic Systems" 2602.16666
-https://arxiv.org/html/2602.16666v1
-**核心**:
-- 四个可靠性维度：
-  - **Consistency**（可重复性）
-  - **Robustness**（优雅降级）
-  - **Predictability**（准确置信度）
-  - **Safety**（有界故障）
-- 关键发现：**reliability gains lag capability progress**（可靠性进步远慢于能力进步）
-- **Prompt brittleness paradox**: 模型处理真实基础设施故障比处理同义语义改写更好
-- 多轮评估方法：每任务 5 次不同 seed + 5 种语义等价改写 + 20% API 故障注入
-**对我们的价值**: 我们没有可靠性维度的测量。目前只有"是否遵从规则"（binary），没有"在多大程度上可重复"（continuous）。Prompt brittleness paradox 也解释了为什么 CLAUDE.md 规则遵从率低——同一规则的不同表述，遵从率可能差异巨大。这指向 v3 D1 metrics 之上的下一层测量维度。
----
-## §3 核心原则
-### 3.1 测量先于优化
-没有数据的优化是猜测。每个 harness 组件必须回答：触发频率？findings 数量？token 开销？产出影响？
-[来源: LangChain 用 LangSmith traces 系统性 debug → 52.8%→66.5%]
-### 3.2 每个组件必须证明自己
-每个组件编码一个假设——"模型不能自己做 X"。假设必须被持续验证。模型升级后假设可能不再成立。
-[来源: Anthropic 元原则 + V2 移除 Sprint 的实际案例]
-### 3.3 遵从率阶梯决定投入分配
-blocking hooks (~100%) > advisory (~40-60%) > CLAUDE.md (~10-20%)。一条 guard > 十条 CLAUDE.md 规则。
-[来源: boll Round 3 自研数据 — deploy.sh 无 hook 遵从率 3%, 有 hook ~100%]
-### 3.4 分离生产者和评估者
-Self-evaluation bias 是结构性的。Developer agent ≠ Evaluator agent。但 Evaluator 不是固定必需——**当任务超出模型可靠 solo 完成的边界时才值得投入**。
-[来源: Anthropic — "The evaluator is not a fixed yes-or-no decision. It is worth the cost when the task sits beyond what the current model does reliably solo."]
-### 3.5 复用优于自建
-先问"别人造好了吗？"。CC 已有 28 种 hook 事件、7 个官方插件、内置 Verification Agent。行业有 LangChain middleware、Stripe Blueprint。
-[来源: CC 源码 + 官方插件库 49 个插件]
-### 3.6 事件驱动评估，非日历驱动
-AI 的时间极快，不能用人类的日历节奏规划。正确频率 = **事件触发**：每次代码变更后跑 checkpoint，每次 WP 完成后可选 Evaluator，每次 session 结束后 reflection。数据自然积累，随时可查。
-[来源: Anthropic 每 Sprint 后即评估 / LangChain 每 agent 退出时 PreCompletionChecklist / Stripe 每次代码生成后即 checkpoint]
-### 3.7 确定性节点和推理节点交替
-Agent 生成代码 → [编译] → [测试] → [lint] → Agent 下一步。确定性 checkpoint 不用 LLM，成本 $0、可靠性 100%。
-[来源: Stripe Blueprint — **"Walls matter more than the model"**]
-### 3.8 目标 V2 架构，不回退 V1
-Opus 4.6 原生处理长任务、无 context anxiety。不需要 Sprint 分解，不需要 context reset。直接设计 V2 终态。
-[来源: Anthropic V1→V2 演化 — "Opus 4.5 largely removed context anxiety"]
-### 3.9 Prompt 即产品
-Agent 做错时加一条指令，prompt 逐渐积累修正。不是写长文档后一次性加载，而是 tune like a guitar——持续微调直到准确。
-[来源: Ralph Wiggum 7 原则 — "Eventually the prompt is the product"]
-### 3.10 防御 Context Rot，不只防 Exhaustion
-Context 性能随长度递减是连续过程，不是断崖。优化目标不只是"不超 token 限制"，而是"保持注意力聚焦"。@include / progressive disclosure / tool result clearing / subagent 摘要 都是 rot 防御机制。
-[来源: Anthropic Effective Context Engineering — 65% 失败归因于 drift 而非 raw limits]
-### 3.11 长任务必须 Recite Objective
-50+ tool calls 后 Agent 注意力会从原始目标漂移到中间过程。在上下文末尾重复目标摘要是廉价高效的防漂移手段。这是 KV cache 友好的——只在末尾追加，不重排前文。
-[来源: Manus/IMPACT — recite objectives at end of context after ~50 tool calls; Cursor — 丢失 reasoning traces 30% 性能下降]
-### 3.12 Schema 层隔离审查者工具
-审查 Agent 必须在 schema 层面被剥夺写权限——不是靠 prompt 约束（"请不要修改代码"），而是工具列表本身就不包含 Edit/Write/Bash(write)。Prompt 约束是软约束，schema 隔离是硬隔离。
-[来源: OpenDev arXiv 2603.05344 — Plan mode planner 只读工具 schema-level exclusion]
-### 3.13 可靠性是连续维度，不是 binary
-"是否遵从规则"是 binary 测量，丢失了大量信号。真正有用的可靠性测量是 continuous 4 维：consistency（重复性）/ robustness（降级性）/ predictability（置信度准度）/ safety（故障有界性）。模型能力进步快于可靠性进步——这是 harness 的存在理由。
-[来源: arXiv 2602.16666 — Towards Reliable Agentic Systems]
----
-## §4 决策
-### D1: 建立测量基础设施
-**假设被测试**：我们不知道 harness 的真实 ROI。
-[来源: LangChain 用 LangSmith traces 产生 52.8%→66.5% 提升]
-**方案**：在 `guard-feedback.py` 中增加 JSONL 指标收集，输出到 `.wow-harness/metrics/`。
-**指标清单**：
-| 指标 | 数据源 | 用途 |
-|------|--------|------|
-| `hook_trigger_count` | guard-feedback.py 每次触发 | 组件活跃度 |
-| `fragment_inject_count` per fragment | context_router.py | Fragment 使用频率 |
-| `guard_findings_count` per guard | 各 check_*.py | Guard 有效性 |
-| `guard_blocking_count` per guard | findings 中 blocking=true | 阻断频率 |
-| `fragment_token_cost` per injection | 字节数 / 4 近似 | Token 开销 |
-| `hook_execution_ms` | 计时 | 性能开销 |
-| `wp_completion_score` | Evaluator 输出 | 任务质量 |
-| `loop_detection_count` | LoopDetection | Agent 死磕次数 |
-**基线数据**（来自 guard-feedback.py 现有代码）：Fragment 注入 TTL=3600s，去重后 ~3,191 tokens/session（12,767 bytes）。context_router.py 有 21 个路径模式映射到 17 个 fragment 文件。
-**格式**：JSONL，append-only。数据自然积累，随时可查——不需要"每天早上聚合"的 cron。
-**直接复用**：Trail of Bits `log-gam.sh` 分类 CLI 命令为 read/write + JSONL 时间戳模式。
-### D2: 全面利用 CC Hook 系统
-**假设被测试**：我们只用了 CC hook 能力的 7%。
-#### D2.1: Stop Hook — 完成后自动验证
-注册 `Stop` 事件 hook（agent 类型），自动验证。
-```json
-{
-  "hooks": {
-    "Stop": [{
-      "type": "agent",
-      "prompt": "你是 Evaluator。验证当前完成的工作...",
-      "if": "Agent(*)"
-    }]
-  }
-}
-```
-**PreCompletionChecklist**（来自 LangChain，52.8%→66.5% 提升的关键组件之一）：
-```
-□ 所有新文件都已 git add
-□ 测试全部通过（如果有变更）
-□ 文档与代码一致
-□ 无 dead code / 未使用 import / 半成品
-```
-[来源: LangChain PreCompletionChecklist + CC Built-in Verification Agent]
-#### D2.2: updatedInput — 自动化硬性规则
-用 PreToolUse hook 的 `updatedInput` 自动修改输入，遵从率从 ~10-20% 直接提升到 100%：
-```json
-{
-  "hooks": {
-    "PreToolUse": [{
-      "type": "command",
-      "command": "python3 scripts/hooks/auto-python3.py",
-      "if": "Bash(python *)"
-    }]
-  }
-}
-```
-候选：`python` → `python3`、包含 PROD_IP 的非 deploy 脚本 → block。
-**但 Co-Authored-By 用 `attribution` 设置更优雅**：
-```json
-// settings.json
-{ "attribution": "Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>" }
-```
-[来源: CC settings.json `attribution` 字段 — 比 updatedInput hook 更直接]
-#### D2.3: PreCompact Hook — 保护关键上下文
-Opus 4.6 无 context anxiety（来源: Anthropic — "Opus 4.5 largely removed context anxiety"），不需要 context reset，只需保护 compaction 质量。
-```bash
-#!/bin/bash
-# PreCompact hook: 注入必须保留的关键信息
-cat <<'EOF'
-保留当前正在进行的 PLAN/TASK/Issue 状态信息。
-保留 ADR-030 核心规则（3 条不可降级要求）。
-保留 Sprint Contract（如果有）的全部验收标准。
-EOF
-exit 0
-```
-#### D2.4: if 条件过滤 — 精确定位 Guard
-```json
-{
-  "hooks": {
-    "PostToolUse": [{
-      "type": "command",
-      "command": "python3 scripts/guard-feedback.py",
-      "if": "Write(*) | Edit(*)"
-    }]
-  }
-}
-```
-减少 ~90% 不必要的 hook 调用（Read/Glob/Grep 不需要 guard 检查）。
-#### D2.5: SessionEnd — 自动 Reflection
-```json
-{
-  "hooks": {
-    "SessionEnd": [{
-      "type": "command",
-      "command": "python3 scripts/hooks/session-reflection.py"
-    }]
-  }
-}
-```
-会话结束时自动记录：哪些 guard 有用、哪些 fragment 被参考、是否有新失败模式。
-#### D2.6: 动态权限（PermissionRequest hook）
-```json
-{
-  "hooks": {
-    "PermissionRequest": [{
-      "type": "command",
-      "command": "python3 scripts/hooks/dynamic-permissions.py"
-    }]
-  }
-}
-```
-Hook 返回 `updatedPermissions` 可在运行时按上下文动态授权。例如：在 guardian issue worktree 中自动允许 `Bash(git *)` 操作。
-[来源: CC 源码 — PermissionRequest hook 的 updatedPermissions 返回值]
-#### D2.7: PostToolUseFailure — 失败自动分析
-```json
-{
-  "hooks": {
-    "PostToolUseFailure": [{
-      "type": "command",
-      "command": "python3 scripts/hooks/failure-analyzer.py"
-    }]
-  }
-}
-```
-工具调用失败时记录失败模式，累积数据后可识别系统性问题。
-### D3: CLAUDE.md 重构
-**假设被测试**：11.9k token 的 80-90% 被忽略。
-[来源: OpenAI — "Give Codex a map, not a 1,000-page instruction manual"]
-[来源: boll Round 3 — CLAUDE.md 遵从率 ~10-20%]
-#### D3.1: @include 拆分
-```markdown
-# CLAUDE.md (核心 — 只保留最高优先级规则)
-## 不可妥协的约束
-(精简到真正被违反过的高价值规则)
-## 命令参考
-@./docs/reference/commands.md
-## 路由表
-@./docs/reference/api-routes.md
-## 环境变量
-@./docs/reference/env-vars.md
-## 仓库结构
-@./docs/reference/repo-structure.md
-```
-被 `@include` 的文件在引用时加载，不编辑相关文件时不消耗 context。
-**目标大小**：由实际测量决定——不设自拟数值目标。通过 D1 metrics 追踪 token 开销 vs 遵从率变化。
-[SC-2 合规：不设自拟数值目标]
-#### D3.2: .claude/rules/ 目录级规则
-```markdown
-# .claude/rules/bridge.md
----
-paths:
-  - "bridge_agent/**"
-  - "backend/product/bridge/**"
-  - "bridge_contract/**"
----
-## Bridge 宪法 (ADR-026)
-1. Worker 不拥有业务解释权...
-```
-```markdown
-# .claude/rules/coaching.md
----
-paths:
-  - "scenes/example-coach/**"
-  - "backend/product/coaching/**"
----
-## example-coach开发规则
-...
-```
-只有编辑匹配文件时才加载。
-[来源: CC `.claude/rules/*.md` + `paths` frontmatter + Anthropic Agent Skills progressive disclosure]
-#### D3.3: HTML 注释用于人类注释
-```markdown
-<!-- 这条规则来自 2026-02-11 统一后端迁移教训，详见 memory/planning-lessons.md -->
-## 变更传播规则
-...
-```
-`<!-- -->` 自动剥离，不消耗 token。
-#### D3.4: 已有 guard 覆盖的规则从 CLAUDE.md 删除
-Guard 遵从率 ~100%，CLAUDE.md ~10-20%。重复写是浪费 token。
-### D4: Evaluator 架构（V2 设计）
-**假设被测试**：事后 Gate 审查不如持续 Evaluator 循环有效。
-[来源: Anthropic — QA 仅占总成本 8.3% 但发现所有关键功能缺失]
-#### D4.1: Grading Criteria — 直接复用 Anthropic 4 维
-直接采用 Anthropic 验证过的 4 维度，按领域适配检测方法：
-| 维度 | 来源 | 检测方法（按领域） |
-|------|------|-----------------|
-| **Design Quality** | Anthropic 原始维度 | 前端: Playwright 自主导航+截图 / 后端: pytest+API 断言 / 架构: ADR↔代码对比 |
-| **Originality** | Anthropic 原始维度 | 是否超越模板化实现？是否有领域特定创造性？ |
-| **Craft** | Anthropic 原始维度 | 代码清洁度、命名一致性、边界处理、无 AI slop |
-| **Functionality** | Anthropic 原始维度 | 核心路径可走通、契约一致、无断裂 |
-[SC-2 合规：4 维度直接来自 Anthropic，不是自拟]
-**硬阈值**：任何单维 < 3 = 整个评估 FAIL，不允许"大部分好就通过"。
-[来源: Anthropic Sprint Contract 硬阈值机制]
-**校准原则**：
-1. 权重偏向模型弱项——在已强维度（代码清洁度）减少权重
-2. 评分标准语言本身就是引导——仅有 criteria 首轮就优于无标准基线
-3. 注意措辞引力效应——"museum quality" 会收敛输出。措辞需有意设计
-4. 显式惩罚 AI slop——标准中列出并惩罚模板化、库默认值、半成品
-[来源: Anthropic §Grading Criteria — 精确实践]
-#### D4.2: On-Demand Evaluator（非强制）
-> "The evaluator is not a fixed yes-or-no decision. It is worth the cost when the task sits beyond what the current model does reliably solo." — Anthropic
-**触发条件**（事件驱动，非日历驱动）：
-| 条件 | 触发 Evaluator？ | 原因 |
-|------|----------------|------|
-| scope ≥ 3 files 的 WP | ✅ | 跨模块集成问题需要独立验证 |
-| 涉及 API 契约变更 | ✅ | 契约是最高优先级（ADR-030） |
-| 单文件 bug fix | ❌ | 确定性 checkpoint 足够 |
-| 文档更新 | ❌ | 低风险 |
-| 任何 P0 issue 修复 | ✅ | 高影响 |
-[SC-8 合规：Evaluator 为按需而非强制]
-#### D4.3: Evaluator 校准协议
-> "Out of the box, Claude is a poor QA agent." — Anthropic
-**典型失败模式**：识别合法问题 → 说服自己不是大问题 → 批准。倾向测试肤浅路径而非探索边缘情况。
-**校准循环**：
-1. 读 evaluator 日志
-2. 找到其判断与人类判断分歧的例子
-3. 更新 evaluator prompt
-4. 重复（Anthropic: "several rounds" 才达到合理水平）
-**防止 rationalization 的机制**：
-- 硬阈值，任何一个低于则失败
-- Prompt 中显式要求严格、不宽容
-- Few-shot 示例校准评分锚点
-- 置信度过滤（pr-review-toolkit: 0-100，≥80 才报告）
-- CC Built-in Verification Agent 的 5 种自我合理化识别模式
-**Refine vs Pivot**：Generator 每轮评估后做策略决策——分数趋好则精细化当前方向，不行则转向完全不同的方法。
-**非线性改善**：不要盲目取最后一轮。保留所有中间产物供选择。
-[来源: Anthropic §Evaluator 校准 + §Refine vs Pivot + §Non-linear improvement]
-#### D4.4: 确定性 Checkpoint（Stripe Blueprint）
-```
-Agent 写代码
-  → [pytest -q backend/tests/unit] (确定性, $0)
-  → [type check / lint] (确定性, $0)
-  → Agent 继续
-  → WP 完成
-  → [按需: Evaluator Agent 打分] (推理, ~$3-4)
-```
-[来源: Stripe — "Walls matter more than the model"]
-#### D4.5: LoopDetection（LangChain）
-追踪 per-file edit count。同一文件编辑 > N 次 → 注入提醒换方法。
-[来源: LangChain middleware — 52.8%→66.5% 提升的三个组件之一]
-#### D4.6: Test-Driven Verification（Willison Red-Green）
-对于可测试的变更：先写失败测试 → agent 让它通过 → 迭代。
-这是 TASK.md 的替代路径——不是写详细规范让 agent 执行，而是写测试让 agent 满足。
-> "应该大量写测试让 agent 迭代，而非写详细 TASK.md" — Simon Willison
-**适用场景**：后端 API、数据逻辑、工具函数。
-**不适用**：UI 设计、架构决策、文档。
-[来源: Simon Willison "Agentic Engineering Patterns"]
-### D5: 文档同步机制
-**假设被测试**：文档状态经常与代码脱节。
-[来源: boll Round 2 时间演化分析 — 03-28 交叉点后发现 > 修复，文档漂移是主要来源; CC Magic Docs + FileChanged hook 能力]
-#### D5.1: Magic Docs 自动更新
-```markdown
-# MAGIC DOC: boll API Routes Reference
-_Keep this file in sync with backend/product/routes/ — list all active routes with method, path, auth requirement, and description._
-```
-标题下一行用 `_instructions_` 语法写更新指令。CC 在对话中发现新信息后自动更新。
-#### D5.2: FileChanged Hook 监控
-用 `SessionStart` hook 返回 `watchPaths`（绝对路径数组），监控：
-- `backend/product/routes/*.py` → 路由变更触发文档同步检查
-- `docs/decisions/*.md` → ADR 变更检查代码一致性
-[来源: CC 源码 — SessionStart hook 的 watchPaths 返回值 + FileChanged 事件]
-### D6: 自我进化机制
-**假设被测试**：手动维护的 harness 会腐化。
-[来源: Anthropic — 组件必要性随模型能力变化; ACE — generation-reflection-curation 循环 +10.6%]
-#### D6.1: 事件驱动的 Load-Bearing 审计
-**不是**"每月 1 日"——是**累积 N 个 WP 数据点后自动触发**。
-| 组件类型 | 候选移除条件 | 候选提升条件 |
-|----------|-------------|-------------|
-| Guard | 最近 N 个 WP findings = 0 | — |
-| Fragment | 最近 N 个 WP 注入 < 3 次 | — |
-| Skill | 最近 N 个 WP 调用 = 0 | — |
-| CLAUDE.md 规则 | 已有 guard 覆盖 → 删除 | 高违反率+可自动化 → 写 guard |
-[SC-3 合规：事件驱动而非日历驱动]
-#### D6.2: SessionEnd Reflection（事件驱动）
-每次 session 结束时（SessionEnd hook），自动记录：
-1. 本次 session 哪些 guard 有用
-2. 是否有新失败模式
-3. 如果有 → 提议新 guard（需人确认）
-#### D6.3: Skill Creator 驱动的质量循环
-**触发条件**（事件驱动）：当一个 skill 在最近 N 次调用中收到负面反馈或产出质量低于阈值时触发。
-用 CC 官方 `skill-creator` 插件的 eval 循环：
-```
-写 Skill draft → 跑 test prompts (with vs without) → Grader 评估 → Benchmark → 人类审查 → 改进 → 重复
-```
-这是 Anthropic Generator-Evaluator 架构的直接工程实现。
-[来源: CC skill-creator 插件 — Grader/Analyzer/Comparator 三子 agent + run_loop.py]
-#### D6.4: CLAUDE.md 质量审计（claude-md-management）
-**触发条件**（事件驱动）：当 CLAUDE.md 或 @include 文件被编辑后触发；或累积 N 个 session 的 Reflection 数据后触发。
-用 CC 官方插件审计：6 维评分（Commands 20 / Architecture 20 / Patterns 15 / Conciseness 15 / Currency 15 / Actionability 15）→ A-F 等级。
-[来源: CC claude-md-management 插件 — quality-criteria.md]
-#### D6.5: Prompt Tuning 积累（Ralph Wiggum "Tune like a guitar"）
-每次 agent 做错事：
-1. 识别失败模式
-2. 加一条指令到对应的 skill 或 rule
-3. Prompt 逐渐积累修正——比存 memory feedback 更直接
-区别于 Memory feedback：Memory 是被动召回（需要匹配场景），prompt tuning 是主动注入（永远生效）。
-### D7: 采纳官方插件 + CC 内置能力
-**假设被测试**：我们没有充分利用 CC 已有能力。
-[来源: CC 源码分析 — 28 hook 事件只用 2 个、4 hook 类型只用 1 个、7 官方插件 0 安装]
-#### D7.1 安装官方插件
-```bash
-/install skill-creator
-/install claude-md-management
-/install ralph-loop
-/install frontend-design
-/install pr-review-toolkit
-/install feature-dev
-/install playwright
-```
-#### D7.2 启用 CC 内置能力
-```json
-// settings.json
-{
-  "attribution": "Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>",
-  "worktree": {
-    "symlinkDirectories": ["backend/venv", "website/node_modules"],
-    "sparsePaths": []
-  }
-}
-```
-#### D7.3 配置 statusLine
-显示当前 branch + 活跃 issue 数量。
-### D8: Initializer Agent JSON 进度追踪（v4 新增）
-**假设被测试**：自然语言验收标准容易被模型自我说服修改。
-[来源: Anthropic "Effective Harnesses for Long-Running Agents" — 严格 JSON 防自改 + 单特性/会话约束]
-**方案**：在 WP 启动时由 Initializer Agent（或 dev skill）生成严格 JSON 进度文件，作为 stop-evaluator.md 自然语言检查的**机械化补充**。
-#### D8.1: progress.json 严格 schema
-```json
-{
-  "wp_id": "WP-XXX",
-  "session_pid": 12345,
-  "started_at": "2026-04-07T10:00:00",
-  "objective": "一句话原始目标 — 不允许修改",
-  "features": [
-    {
-      "id": "F1",
-      "subject": "实现 X 函数",
-      "status": "failing",
-      "verification_command": "backend/venv/bin/pytest -q backend/tests/unit/test_x.py",
-      "evidence": null
-    }
-  ],
-  "constraints": {
-    "max_features_per_session": 1,
-    "must_pass_before_stop": ["all features status == passing"]
-  }
-}
-```
-**关键约束**：
-- `status` 字段是枚举：`failing` / `passing` / `blocked`，**不允许其他值**
-- `objective` 字段一旦写入**不允许修改**（写入后 read-only flag）
-- 一个 session 最多处理 N 个 feature（防止上下文耗尽）
-- Stop hook 读 progress.json，所有 feature 必须 `passing` 才允许 Stop
-#### D8.2: 与 Stop hook 的协同
-stop-evaluator.md 增加机械化第一关：
-```
-Step 0 (mechanical): cat .wow-harness/progress/<wp_id>.json
-  ├─ 所有 features.status == "passing"? → 通过 Step 0 进入 PreCompletionChecklist
-  └─ 任何 failing/blocked → 立即 FAIL，不进入 LLM 评估
-```
-机械化关卡 $0 成本，零 self-evaluation bias。LLM 评估只在机械化通过后启动。
-[SC-2 合规：max_features_per_session 由实际任务决定，不预设数值]
-### D9: PreCompact 增强 — Objective Recitation（v4 新增）
-**假设被测试**：50+ tool calls 后 Agent 注意力漂移。
-[来源: Manus/IMPACT — recite objectives at end of context; Cursor — 丢失 traces 30% 性能下降]
-**方案**：v3 的 `precompact.sh` 增加目标重复段，在 compaction 后的新上下文末尾出现。
-```bash
-#!/bin/bash
-# scripts/hooks/precompact.sh (v4 增强)
-# 原有：保留 PLAN/TASK/Issue 状态 + ADR-030 核心规则
-echo "保留当前正在进行的 PLAN/TASK/Issue 状态信息。"
-echo "保留 ADR-030 核心规则（3 条不可降级要求）。"
-# v4 新增：Objective Recitation
-if [ -f .wow-harness/progress/current.json ]; then
-  echo ""
-  echo "=== Objective Recitation (v4 D9) ==="
-  echo "原始目标（不可漂移）："
-  python3 -c "import json; print(json.load(open('.wow-harness/progress/current.json'))['objective'])"
-  echo ""
-  echo "未完成 features："
-  python3 -c "
-import json
-data = json.load(open('.wow-harness/progress/current.json'))
-for f in data['features']:
-    if f['status'] != 'passing':
-        print(f'  - [{f[\"status\"]}] {f[\"id\"]}: {f[\"subject\"]}')
-"
-fi
-exit 0
-```
-**为什么有效**：
-1. KV cache 友好——只追加到末尾，不重排前文
-2. 目标摘要 < 200 tokens，比保留全部历史更高效
-3. 锚定原始 objective，防止"中途忘记初心"
-[来源: Manus/IMPACT — KV cache 命中率是最重要性能指标]
-### D10: Trace Analyzer — failure-analyzer.py 升级（v4 新增）
-**假设被测试**：v3 D2.7 的 failure-analyzer.py 只记录失败，不分析模式。
-[来源: LangChain Trace Analyzer Skill — 自动 traces → 错误分析 → 提议 harness 变更 → 验证]
-**方案**：在 v3 D2.7 基础上增加分析层。`failure-analyzer.py` 继续记录 JSONL，新增 `trace-analyzer.py`（事件触发，不是常驻）。
-#### D10.1: 触发条件（事件驱动）
-```
-触发条件:
-  - 累积 N 条新 failure 记录后自动运行
-  - 或 SessionEnd hook 中检测到本 session 失败 ≥ 阈值
-  - 或人工执行 `python3 scripts/hooks/trace-analyzer.py --analyze`
-```
-[SC-3 合规：事件驱动，N 由 D1 metrics 实测决定]
-#### D10.2: 分析输出
-```python
-# scripts/hooks/trace-analyzer.py 输出格式
-{
-  "analyzed_at": "2026-04-07T...",
-  "window_size": 50,  # 分析最近 50 条 failures
-  "patterns": [
-    {
-      "pattern_id": "P1",
-      "description": "Bash command 因 'python' (vs python3) 失败 12 次",
-      "frequency": 12,
-      "proposed_harness_change": {
-        "type": "hook",
-        "target": ".claude/settings.json",
-        "change": "确认 D2.2 auto-python3.py hook 已启用并正确路由"
-      },
-      "confidence": 0.95
-    }
-  ],
-  "human_review_required": true
-}
-```
-#### D10.3: 与 D6.1 Load-Bearing 审计的协同
-D6.1 关注组件**移除**条件（findings = 0 → 候选移除）。
-D10 关注组件**新增**条件（reproducible failure pattern → 候选新 guard）。
-二者构成完整的 harness 自我进化闭环：增删都有数据支撑。
-[来源: LangChain harness 自我改进反馈循环 + Anthropic 元原则"组件假设需要压力测试"]
-### D11: 审查 Agent 工具隔离协议（v4 新增）
-**假设被测试**：审查类 subagent（pr-review-toolkit、Gate 审查、Evaluator）有完整 tool 权限是结构性安全洞。
-[来源: OpenDev arXiv 2603.05344 — Plan mode planner schema-level read-only exclusion]
-**方案**：在 TeamCreate / Agent spawn 协议层面强制审查类 agent 的工具白名单。
-#### D11.1: 审查 Agent 白名单
-```yaml
-# .claude/agents/review-base.yaml (新建)
-tools_allowed:
-  - Read
-  - Glob
-  - Grep
-  - WebFetch       # 查文档可以
-  - TodoWrite      # 自己的任务清单可以
-tools_denied:
-  - Edit           # 不允许改文件
-  - Write          # 不允许写文件
-  - NotebookEdit
-  - Bash           # 一律拒绝（含 git push、rm、scp）
-# 例外：bash 只读子集（候选）
-bash_allowed_patterns:
-  - "git status"
-  - "git diff*"
-  - "git log*"
-  - "ls *"
-  - "cat *"  # 应该用 Read 而非 cat
-```
-#### D11.2: 应用范围
-| Agent | 当前是否隔离 | v4 后 |
-|-------|------------|------|
-| pr-review-toolkit:code-reviewer | ❌ | ✅ schema-level read-only |
-| pr-review-toolkit:silent-failure-hunter | ❌ | ✅ schema-level read-only |
-| pr-review-toolkit:type-design-analyzer | ❌ | ✅ schema-level read-only |
-| Stop hook Evaluator | 未明确 | ✅ schema-level read-only |
-| feature-dev:code-explorer | ❌ | ✅ schema-level read-only |
-| feature-dev:code-reviewer | ❌ | ✅ schema-level read-only |
-#### D11.3: 实施路径
-1. 创建 `.claude/agents/review-base.yaml` 共享白名单定义
-2. 检查现有 review-type subagent definition 是否支持 tool restriction
-3. 在 ADR-030 治理规则中增加"审查 agent 必须 read-only"约束
-4. 在 .claude/rules/ 中加入 review-agent-isolation.md 检查
-[来源: OpenDev — schema-level dual-agent isolation; Microsoft Agent Governance Toolkit OWASP Agentic Top 10]
----
-## §5 架构蓝图（替代分阶段路线图）
-> 不分阶段递进。设计完整架构，一次性部署。
-> [来源: Anthropic 的架构一开始就是三个 Agent，不是渐进引入。用户明确要求"一开始就做好所有最好的东西"]
-### 5.1 完整事件流
-```
-User Prompt
-  │
-  ├─ [UserPromptSubmit hook]
-  │   自动注入：当前 branch、活跃 issue、PLAN 状态
-  │
-  ▼
-Agent 开始工作
-  │
-  ├─ [PreToolUse hook]
-  │   ├─ if Bash(python *) → updatedInput: python3
-  │   ├─ if Bash(*<NETWORK_REDACTED>*) → block (非 deploy 脚本)
-  │   └─ PermissionRequest → 动态权限（worktree 中自动允许 git）
-  │
-  ├─ [PostToolUse hook]
-  │   └─ if Write(*)|Edit(*) → guard-feedback.py (metrics + guard checks)
-  │
-  ├─ [PostToolUseFailure hook]
-  │   └─ failure-analyzer.py (记录失败模式)
-  │
-  ├─ [LoopDetection]
-  │   └─ 同一文件编辑 > N 次 → 注入"换方法"提醒
-  │
-  ├─ [确定性 Checkpoint] (事件触发：代码变更后)
-  │   ├─ pytest -q backend/tests/unit ($0)
-  │   └─ lint / type check ($0)
-  │
-  ▼
-Agent 完成工作
-  │
-  ├─ [Stop hook — PreCompletionChecklist]
-  │   ├─ □ 所有新文件已 git add
-  │   ├─ □ 测试通过
-  │   ├─ □ 文档与代码一致
-  │   └─ □ 无 dead code / 半成品
-  │
-  ├─ [按需: Evaluator Agent] (scope ≥ 3 files 或 API 契约变更)
-  │   ├─ Grading Criteria 4 维打分
-  │   ├─ 硬阈值：任何维 < 3 → FAIL → Generator Refine/Pivot
-  │   └─ 保留所有中间产物（非线性改善）
-  │
-  ▼
-Session 结束
-  │
-  ├─ [SessionEnd hook — Reflection]
-  │   ├─ 记录 guard 有效性 / fragment 使用
-  │   ├─ 识别新失败模式
-  │   └─ 提议新 guard（需人确认）
-  │
-  ├─ [PreCompact hook] (如果 compact 发生)
-  │   └─ 保护 PLAN/issue/ADR-030 关键信息
-  │
-  ▼
-持续进化（事件驱动）
-  │
-  ├─ 累积 N WP 数据 → Load-Bearing 审计 (D6.1 移除 + D10 新增)
-  ├─ Trace Analyzer 触发 → 失败模式分析 → 提议 harness 变更 (D10) [v4]
-  ├─ Skill Creator eval 循环 → Skill 质量提升
-  ├─ CLAUDE.md 审计 → 6 维评分
-  └─ Prompt Tuning 积累 → "Eventually the prompt is the product"
-```
-### 5.1.1 v4 增量事件流
-```
-WP 启动
-  │
-  ├─ [Initializer Agent / dev skill] (D8) [v4]
-  │   ├─ 生成 .wow-harness/progress/<wp_id>.json
-  │   ├─ 严格 schema：features[].status ∈ {failing, passing, blocked}
-  │   ├─ objective 字段写入后 read-only
-  │   └─ max_features_per_session 约束防止上下文耗尽
-  │
-  ▼
-Agent 工作中
-  │
-  ├─ 每完成一个 feature → 更新 progress.json (status → passing)
-  │
-  ▼
-Compaction 触发
-  │
-  ├─ [PreCompact hook] (D2.3 + D9 v4)
-  │   ├─ 原有：保留 PLAN/Issue/ADR-030
-  │   └─ v4 新增：Objective Recitation
-  │       ├─ 从 progress.json 读 objective
-  │       └─ 列出未完成 features (status != passing)
-  │
-  ▼
-Agent 完成工作
-  │
-  ├─ [Stop hook]
-  │   ├─ Step 0 (机械化, D8.2 v4): cat progress.json
-  │   │   ├─ 所有 features.status == "passing"? → 进入 PreCompletionChecklist
-  │   │   └─ 任何 failing/blocked → FAIL (零 LLM 评估成本)
-  │   │
-  │   ├─ Step 1 (LLM): PreCompletionChecklist (D2.1)
-  │   │
-  │   └─ Step 2 (按需 LLM): Evaluator (D4)
-  │       └─ ⚠️ Schema-level read-only (D11 v4)
-  │           tools_allowed: [Read, Glob, Grep]
-  │           tools_denied: [Edit, Write, Bash]
-  │
-  ▼
-失败累积 (事件驱动)
-  │
-  └─ [Trace Analyzer] (D10 v4)
-      ├─ N 条 failures 累积后触发
-      ├─ 模式识别 → 提议新 guard
-      └─ 人工审查后并入 D6.1 Load-Bearing 审计
-```
-### 5.2 文件结构变更
-```
-.claude/
-├─ settings.json          # attribution, worktree config, statusLine
-├─ rules/
-│   ├─ bridge.md          # Bridge 宪法 (paths: bridge_*/**)
-│   ├─ coaching.md        # example-coach规则 (paths: **/coaching/**)
-│   ├─ hackathon.md       # Hackathon 规则 (paths: **/hackathon/**)
-│   └─ scenes.md          # 通用场景规则 (paths: scenes/**)
-├─ hooks.json             # 完整 hook 配置
-└─ skills/                # (现有)
-scripts/hooks/
-├─ guard-feedback.py      # (改造) 加 JSONL metrics + if 过滤
-├─ auto-python3.py        # updatedInput: python→python3
-├─ precompact.sh          # PreCompact: 保护关键上下文 + Objective Recitation [v4 D9]
-├─ session-reflection.py  # SessionEnd: 自动 reflection
-├─ failure-analyzer.py    # PostToolUseFailure: 记录失败
-├─ trace-analyzer.py      # 失败模式分析 → 提议 harness 变更 [v4 D10]
-├─ dynamic-permissions.py # PermissionRequest: 动态权限
-├─ loop-detection.py      # LoopDetection middleware
-├─ stop-evaluator.md      # Stop hook agent prompt + progress.json 机械化第一关 [v4 D8]
-├─ initializer-agent.py   # WP 启动时生成 progress.json [v4 D8]
-└─ user-context.py        # UserPromptSubmit: 注入上下文
-.wow-harness/progress/             # [v4 D8] 严格 schema 进度文件
-├─ <wp_id>.json              # 单个 WP 的进度
-└─ current.json              # 当前 active WP 的符号链接
-.claude/agents/              # [v4 D11] 审查 agent 隔离配置
-├─ review-base.yaml          # 共享白名单 (Read/Glob/Grep only)
-└─ review-agent-isolation.md # 应用范围说明
-docs/reference/
-├─ commands.md            # @include: 命令参考
-├─ api-routes.md          # @include + MAGIC DOC: 路由表
-├─ env-vars.md            # @include: 环境变量
-└─ repo-structure.md      # @include: 仓库结构
-```
-### 5.3 .claude/hooks.json 完整配置
-```json
-{
-  "hooks": {
-    "UserPromptSubmit": [{
-      "type": "command",
-      "command": "python3 scripts/hooks/user-context.py"
-    }],
-    "PreToolUse": [
-      {
-        "type": "command",
-        "command": "python3 scripts/hooks/auto-python3.py",
-        "if": "Bash(python *)"
-      },
-      {
-        "type": "command",
-        "command": "python3 scripts/hooks/guard-feedback.py --pre",
-        "if": "Bash(*scp*) | Bash(*rsync*) | Bash(*<NETWORK_REDACTED>*)"
-      }
-    ],
-    "PostToolUse": [{
-      "type": "command",
-      "command": "python3 scripts/hooks/guard-feedback.py",
-      "if": "Write(*) | Edit(*)"
-    }],
-    "PostToolUseFailure": [{
-      "type": "command",
-      "command": "python3 scripts/hooks/failure-analyzer.py"
-    }],
-    "Stop": [{
-      "type": "agent",
-      "prompt": "@./scripts/hooks/stop-evaluator.md",
-      "if": "Agent(*)"
-    }],
-    "PreCompact": [{
-      "type": "command",
-      "command": "bash scripts/hooks/precompact.sh"
-    }],
-    "SessionEnd": [{
-      "type": "command",
-      "command": "python3 scripts/hooks/session-reflection.py"
-    }],
-    "PermissionRequest": [{
-      "type": "command",
-      "command": "python3 scripts/hooks/dynamic-permissions.py"
-    }]
-  }
-}
-```
----
-## §6 直接复用清单
-| # | 来源 | 复用什么 | 如何复用 | 成熟度 |
-|---|------|---------|---------|--------|
-| 1 | CC hook JSON protocol | updatedInput + additionalContext | guard-feedback.py | 生产就绪 |
-| 2 | CC 28 hook events | Stop/PreCompact/UserPromptSubmit/SessionEnd/PostToolUseFailure/PermissionRequest/FileChanged | hooks.json | 生产就绪 |
-| 3 | CC agent hook type | agentic verifier for Stop | hooks.json | 生产就绪 |
-| 4 | CC asyncRewake | 后台监控，发现问题才中断 | 候选 | 生产就绪 |
-| 5 | CC @include | CLAUDE.md 引用外部文件 | CLAUDE.md 拆分 | 生产就绪 |
-| 6 | CC .claude/rules/ | 目录级规则 + paths | bridge/coaching/hackathon | 生产就绪 |
-| 7 | CC attribution | Co-Authored-By | settings.json | 生产就绪 |
-| 8 | CC Verification Agent | 对抗性验证 prompt + 5 种 rationalization 模式 | Stop hook 参考 | 生产就绪 |
-| 9 | CC Magic Docs | 自动更新活文档 + _instructions_ 语法 | api-routes.md | 生产就绪 |
-| 10 | CC Session Memory | 后台提取关键信息 | 内置 | 生产就绪 |
-| 11 | CC Skill Improvement | 每 5 turn 分析改进建议 | 内置 | 生产就绪 |
-| 12 | CC Auto Dream | 4 阶段记忆巩固 | 内置 | 生产就绪 |
-| 13 | CC CLAUDE_ENV_FILE | Hook 写 env → BashTool 应用 | hooks | 生产就绪 |
-| 14 | skill-creator | Skill eval+benchmark+迭代 | 插件安装 | 生产就绪 |
-| 15 | claude-md-management | 6 维质量审计 | 插件安装 | 生产就绪 |
-| 16 | ralph-loop | Stop hook 自动迭代 | 插件安装 | 生产就绪 |
-| 17 | pr-review-toolkit | 6 独立 reviewer + 置信度过滤 | 插件安装 | 生产就绪 |
-| 18 | feature-dev | 7 阶段 + Clarifying Questions hard gate | 插件安装 | 生产就绪 |
-| 19 | frontend-design | 美学引导 prompt | 插件安装 | 生产就绪 |
-| 20 | LangChain PreCompletionChecklist | Agent 退出前验证清单 | Stop hook | 有 benchmark |
-| 21 | LangChain LoopDetection | per-file edit count + 注入提醒 | guard-feedback.py | 有 benchmark |
-| 22 | LangChain LocalContext | 启动时扫描环境 | UserPromptSubmit | 有 benchmark |
-| 23 | Stripe Blueprint checkpoint | 确定性验证节点 | WP 流加 pytest/lint | 生产就绪 |
-| 24 | Trail of Bits log-gam.sh | CLI 分类 read/write + JSONL | D1 metrics | 安全团队验证 |
-| 25 | Martin Fowler 四象限 | 组件分类框架 | 审计工具 | 行业共识 |
-| 26 | ACE reflection-curation | Memory 自动进化 | D6 自我进化 | ICLR 2026 |
-| 27 | Ralph Wiggum tune-like-guitar | 持续 prompt tuning | D6.5 | 实践验证 |
-| 28 | Simon Willison Red-Green | 测试驱动 agent 迭代 | D4.6 | 实践验证 |
-| **v4 增量** | | | | |
-| 29 | Anthropic 第二篇 — Initializer Agent | 严格 JSON progress + 单特性约束 | D8 progress.json schema | 生产就绪 |
-| 30 | Anthropic — Effective Context Engineering | Context Rot 防御 / tool result clearing | §3.10 / D3.1 增强目标 | 生产就绪 |
-| 31 | Manus/IMPACT — Objective Recitation | 上下文末尾重复目标 | D9 precompact.sh 增强 | 实践验证 |
-| 32 | LangChain Trace Analyzer Skill | 自动错误模式分析 | D10 trace-analyzer.py | 有 benchmark |
-| 33 | OpenDev — Plan Mode tool isolation | Schema-level read-only exclusion | D11 review-base.yaml | arXiv 论文 |
-| 34 | arXiv 2602.16666 — Reliability 4 维 | consistency/robustness/predictability/safety | §3.13 / D1 metrics 维度 | arXiv 论文 |
----
-## §7 风险
-| 风险 | 影响 | 缓解 |
-|------|------|------|
-| Hook 配置变更破坏现有 guard | guard 失效 | 逐条加 if 过滤，每条后验证 |
-| @include 拆分遗漏关键规则 | 治理退化 | 拆分前后实际任务测试 + metrics 对比 |
-| Stop hook 过度拦截 | Agent 陷入验证循环 | if 条件限制只在 Agent 完成时触发 |
-| Evaluator prompt 调优耗时 | 延期 | 先用 CC Verification Agent prompt 作基线 |
-| 一次性部署太多变更 | 调试困难 | 每个 hook 独立可禁用（注释掉 hooks.json 条目） |
-| 插件与现有 skill 冲突 | 行为异常 | 逐个安装+测试 |
-| **v4 风险** | | |
-| D8 progress.json 增加 WP 启动开销 | 摩擦增加 | 只对 ≥3 file 或 API 契约 WP 强制；小 fix 可选 |
-| D8 single-feature-per-session 过严 | 阻碍合理批量改动 | max_features 由 WP 类型决定，不一刀切 |
-| D9 Objective Recitation 重复信息 | token 浪费 | 限制 < 200 tokens，只在 compact 后注入 |
-| D10 Trace Analyzer 误报 | 噪音建议 | 人工审查必经，confidence < 0.8 不进入 D6.1 |
-| D11 工具隔离破坏现有审查者行为 | 审查 agent 失效 | 先在 .claude/agents/ 试点 1 个，逐步推广 |
-| D11 schema-level 限制 CC 不一定支持 | 设计无法落地 | 需先验证 CC 当前 agent definition tool restriction 能力 |
----
-## §8 不做什么
-1. **不重写 guard-feedback.py 架构** — 问题在未充分利用 CC 能力，不是架构错误
-2. **不引入外部框架（LangGraph/CrewAI）** — CC 本身就是框架，官方插件已覆盖
-3. **不做实时 dashboard** — JSONL append-only 随时可查
-4. **不尝试 100% 机械化** — 有些规则天然是 convention 层
-5. **不设自拟数值目标** — 所有阈值来自外部基准或自研数据
-6. **不从 V1 架构开始演化** — 直接目标 V2
-7. **不用日历驱动评估** — 全部事件驱动
----
-## §9 通用性
-本 ADR 的方法论适用于任何使用 Claude Code 的项目：
-**可迁移**：D1 测量 / D2 hook 配置模式 / D3 CLAUDE.md 拆分 / D4 Evaluator 循环 / D5 Magic Docs / D6 自我进化 / D7 官方插件
-**不可迁移**：具体 guard 规则 / 具体 fragment 内容 / 具体 Grading Criteria 权重
----
-## §10 参考文献
-### Anthropic
-1. "Harness Design for Long-Running Application Development." 2026-03-24. https://www.anthropic.com/engineering/harness-design-long-running-apps
-2. "Effective Context Engineering for AI Agents." 2025-11. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
-3. "Equipping Agents for the Real World with Agent Skills." 2025-12-18. https://www.anthropic.com/research/agent-skills
-4. "2026 Agentic Coding Trends Report." 2026-03. https://www.anthropic.com/research/2026-agentic-coding-report
-5. `anthropics/claude-plugins-official` GitHub repo. https://github.com/anthropics/claude-plugins-official
-5b. **[v4]** "Effective Harnesses for Long-Running Agents." 2026. https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents (Initializer Agent + 严格 JSON progress 模式)
-### OpenAI
-6. "Harness Engineering." 2026-01. https://openai.com/index/harness-engineering/
-7. "Unlocking the Codex Harness." 2026-02. https://openai.com/index/unlocking-the-codex-harness/
-8. "Unrolling the Codex Agent Loop." 2026-02. https://openai.com/index/unrolling-the-codex-agent-loop/
-### Stripe
-9. "Minions: One-Shot End-to-End Coding Agents" Part 1 + 2. 2026-01. https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-coding-agents
-### LangChain
-10. "Improving Deep Agents with Harness Engineering." 2026-03. https://blog.langchain.com/improving-deep-agents-with-harness-engineering/
-11. "The Anatomy of an Agent Harness." 2026-02. https://blog.langchain.com/the-anatomy-of-an-agent-harness/
-12. "How Middleware Lets You Customize Your Agent Harness." 2026-03. https://blog.langchain.com/how-middleware-lets-you-customize-your-agent-harness/
-### Trail of Bits
-13. "How We Made Trail of Bits AI-Native (So Far)." 2026-03-31. https://blog.trailofbits.com/
-14. `trailofbits/skills` GitHub repo. https://github.com/trailofbits/skills
-### Microsoft
-15. "Agent Governance Toolkit." 2026-04-02. https://github.com/microsoft/agent-governance-toolkit
-### Martin Fowler
-16. "Harness Engineering for Coding Agent Users." 2026-04-02. https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html
-17. "Humans and Agents in Software Engineering Loops." 2026. https://martinfowler.com/articles/exploring-gen-ai/humans-and-agents.html
-### 实践者
-18. Simon Willison. "Agentic Engineering Patterns." 2026-02. https://simonwillison.net/tags/agents/ (博客合集，无单篇稳定 URL)
-19. Addy Osmani. "My LLM Coding Workflow" + "Orchestrating Coding Agents." 2026. https://addyosmani.com/blog/ (博客合集，无单篇稳定 URL)
-20. Ralph Wiggum method + 7 principles. 2026. https://github.com/anthropics/claude-plugins-official/tree/main/plugins/ralph-loop (CC 官方插件实现)
-21. Google Conductor. "Three-stage brownfield support." 2026. (Google 内部工具，无公开 URL)
-### 学术论文
-22. "Agent Behavioral Contracts (ABC)." arXiv:2602.22302. https://arxiv.org/abs/2602.22302
-23. "AEGIS: Pre-Execution Firewall." arXiv:2603.12621. https://arxiv.org/abs/2603.12621
-24. "Agentic Context Engineering (ACE)." ICLR 2026. https://arxiv.org/abs/2510.04618 / https://github.com/ace-agent/ace
-25. "OpenDev: Five-Layer Security." arXiv:2603.05344. https://arxiv.org/abs/2603.05344 (v4 D11 schema-level tool isolation)
-26. "Natural-Language Agent Harnesses (NLAH)." arXiv:2603.25723. https://arxiv.org/abs/2603.25723
-27. "Darwin Gödel Machine." arXiv:2505.22954. https://arxiv.org/abs/2505.22954
-28. OpenAI Cookbook. "Self-Evolving Agents (GEPA)." 2026. https://cookbook.openai.com/ (cookbook 合集)
-29. ICLR 2026 Workshop on AI with Recursive Self-Improvement. 2026-04-26/27, Rio de Janeiro. https://iclr.cc/virtual/2026/workshop/ (会议官网)
-### v4 新增来源
-32. **[v4]** Manus / IMPACT Agent Engineering. "KV Cache Optimization + Objective Recitation." https://www.morphllm.com/agent-engineering (D9 Objective Recitation)
-33. **[v4]** "Towards Reliable Agentic Systems." arXiv:2602.16666. https://arxiv.org/html/2602.16666v1 (§3.13 reliability 4 维 + prompt brittleness paradox)
-34. **[v4]** Cursor Engineering. "Reasoning Trace Loss → 30% performance degradation." 2026. https://cursor.sh/blog/ (D9 reasoning trace 保留)
-35. **[v4]** LangChain. "Trace Analyzer Skill — Automated Harness Self-Improvement Loop." 2026-03. https://blog.langchain.com/improving-deep-agents-with-harness-engineering/ (D10)
-36. **[v4]** Anthropic. "Effective Harnesses for Long-Running Agents." 2026. https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents (D8 Initializer Agent / strict JSON progress / single-feature-per-session)
-### 内部文档
-30. ADR-030: Guard Signal Protocol and Governance Reload. `docs/decisions/ADR-030-guard-signal-protocol-and-governance-reload.md`
-31. 6 轮 Harness 自研数据: `docs/sharing/AI-HARNESS-52-DAYS.md` (综合报告), 原始分析见 memory `project_ai_harness_research.md`
----
-## §11 Meta-Compliance 审计
-### v2 自审结果（4 FAIL / 1 PARTIAL / 2 PASS）
-| 原则 | v2 结果 | v3 修正 |
-|------|--------|--------|
-| Sprint Contract | FAIL — 无验收标准就开始写 | ✅ FIXED — §0 定义 9 条 SC |
-| Generator ≠ Evaluator | FAIL — 自写自评 | ✅ FIXED — v3 将由独立 Evaluator agent 审查 |
-| 确定性 Checkpoint | FAIL — 中间无验证 | ✅ FIXED — SC 每条可机械化验证 |
-| 目标来自外部证据 | PARTIAL — 部分自拟 | ✅ FIXED — 所有 D* 标注 [来源] |
-| 每个组件有可测试假设 | PASS | PASS |
-| 复用优于自建 | PASS | PASS |
-| AI 时间非人类日历 | FAIL — 月度/每日 | ✅ FIXED — 全部事件驱动 |
-### Sprint Contract 验收状态
-| # | 标准 | 状态 | Evaluator 验证 |
-|---|------|------|---------------|
-| SC-1 | 每个决策标注外部证据来源 | ✅ | PASS — 34 处 `[来源]` 标注，D5/D7 已补充 |
-| SC-2 | 无自拟数值目标 | ✅ | PASS |
-| SC-3 | 所有评估事件驱动 | ✅ | PASS — D6.3/D6.4 已补充事件触发条件 |
-| SC-4 | V2 架构 | ✅ | PASS |
-| SC-5 | CC 能力完整 | ✅ | PASS |
-| SC-6 | 参考文献含 URL | ✅ | PASS — 31 条全部有 URL 或标注无公开 URL 原因 |
-| SC-7 | Meta-Compliance 章节 | ✅ | PASS |
-| SC-8 | Evaluator 按需 | ✅ | PASS |
-| SC-9 | v2 自审全部闭环 | ✅ | PASS |
-### Evaluator 审查记录
-**首次审查**：FAIL（3 P0 / 7 P1 / 3 P2）
-- P0-1: 7 条参考文献缺 URL → **FIXED**: 全部补齐
-- P0-2: D6.4 "定期审计"为日历驱动残留 → **FIXED**: 改为事件触发条件
-- P0-3: §11 自审 SC-3/SC-6 结论不诚实 → **FIXED**: 修正后重新验证
-- P1-1~P1-5: 数据点不准确 → **FIXED**: guard=16, routes=21 patterns→17 fragments, skills=26, Opus 版本澄清, 研究文件路径修正
-- P1-6: hooks 配置格式差异 → **ACKNOWLEDGED**: 实施时需验证 CC 版本兼容性
-- P1-7: D5/D7 缺来源标注 → **FIXED**: 已补充
-**Grading Criteria 评分**：Design Quality 4 / Originality 4 / Craft 3→4 / Functionality 4。全部 ≥ 3，通过硬阈值。
-### v4 增量审计
-v4 是 v3 的**增量补丁**，不是重写。v3 的所有 SC + Grading 评分继续有效。v4 自审范围限于新增内容（§2.4 / §3.10-§3.13 / D8-D11 / 文献新增）。
-**v4 自审清单**：
-| # | v4 自审标准 | 状态 |
-|---|------------|------|
-| v4-SC1 | 每个新决策（D8-D11）标注外部证据来源 | ✅ PASS — 全部有 `[来源]` |
-| v4-SC2 | 无自拟数值目标（max_features_per_session 等由实测决定） | ✅ PASS |
-| v4-SC3 | 所有新评估事件驱动（D10 触发条件明确） | ✅ PASS |
-| v4-SC4 | 不与 v3 决策冲突（D8-D11 是 D2/D3/D4/D6 的增量） | ✅ PASS — D8 增量于 D4.1 / D9 增量于 D2.3 / D10 增量于 D2.7 / D11 增量于 D2 总体 |
-| v4-SC5 | 6 个新源全部含 URL | ✅ PASS — 文献 32-36 + 5b 全部有 URL |
-| v4-SC6 | 风险章节包含 v4 新风险 | ✅ PASS — 6 项 v4 风险列出 |
-| v4-SC7 | 复用清单包含 v4 新增（29-34） | ✅ PASS |
-| v4-SC8 | v4 changelog 透明列出所有变更 | ✅ PASS — 文档头部 |
-| v4-SC9 | v4 决策不破坏 v3 部署进度 | ✅ PASS — 增量不替换 |
-**v4 Grading Criteria**（仅评估新增内容）：
-| 维度 | 评分 | 理由 |
-|------|------|------|
-| **Design Quality** | 4 | D8 progress.json schema 设计严谨，机械化第一关消除 self-eval bias |
-| **Originality** | 4 | 6 个新洞察从 18 个资源 synthesis，不是简单复制单一来源 |
-| **Craft** | 4 | 所有新决策有具体 schema/script 示例，URL 完整 |
-| **Functionality** | 4 | D8/D9/D10/D11 全部可在现有 hooks/agents 系统增量实施 |
-全部 ≥ 3，通过硬阈值。v4 待独立 Evaluator agent 复审。
----
-## §12 v5 Patch — Meta-Review Closure（2026-04-07）
-> **v5 是 v4 的闭环补丁，不是新版本。** v4 实施后用户提出 meta-review 标准："是否真的完成最佳实践，而不是只是遵从或者模仿之类的"。重读 Anthropic 两篇文章 + ADR + 已实施代码三方对比后，发现 4 个 deferred items + 4 个 conceptual gap，本节文档化全部闭环工作。
-**Source 归因校对**（v5 必须修正）：
-v4 把多个 source 的概念归到了"Anthropic 第二篇"，实际上：
-- Anthropic 第一篇 = Initializer Agent / Planner-Generator-Evaluator / Grading Criteria / Sprint Contract
-- Anthropic 第二篇 = claude-progress.txt / feature_list.json strict schema / init.sh / session 启动协议
-- Manus/IMPACT = Objective Recitation（after ~50 tool calls）
-- LangChain = Trace Analyzer（自动化 fail→analyze→propose→verify 闭环）
-- OpenDev arXiv 2603.05344 = Plan mode schema-level Tool Isolation
-[来源校对依据：重读两篇文章原文 + memory/project_adr038_meta_review_findings.md]
-### v5 Changelog（相对 v4 的闭环）
-> v4 deferred 的 4 项 + v4 自审遗漏的 4 个 conceptual gap，v5 全部用 hooks-level 实现而不是文档约束。每一项都有"形式 → 本意"的对照说明。
-| 决策 | v4 状态 | v5 闭环 | 实现位置 |
-|------|--------|---------|---------|
-| **D2.4** if 条件过滤 | ❌ deferred ("ADR 给了 spec 没做") | ✅ 实施 | `.claude/settings.json` PreToolUse Bash + `scripts/hooks/auto-python3.py` |
-| **D3.1** CLAUDE.md @include 拆分 | ❌ deferred | ✅ 11 个 rule 文件 | `.claude/rules/*.md`（subagent 独立完成） |
-| **D5.1** Magic Docs 模式 | ❌ deferred | ✅ 实施 | `docs/magic/api-routes.md` + `scripts/checks/regenerate_magic_docs.py` |
-| **D5.2** SessionStart 漂移检测 | ❌ deferred | ✅ 实施 | `scripts/hooks/session-start-magic-docs.py` |
-| **D8** features schema 三字段 | ⚠️ "形式 70%, 本意 30%" — 缺 `steps[]`/`verification_command`/`evidence` | ✅ schema 已扩展 | `scripts/hooks/initializer-agent.py` + `scripts/hooks/stop-evaluator.py` |
-| **D9** mid-task recitation | ⚠️ "形式 30%, 本意 30%" — 只 cover compaction 边界 | ✅ 50-call counter | `scripts/hooks/tool-call-counter.py`（PreToolUse `*`） |
-| **D10** trace-analyzer auto-trigger | ⚠️ "形式 50%" — 是 manual | ✅ SessionEnd 自动跑 | `.claude/settings.json` SessionEnd hook |
-| **D11.2** plugin review agent 物理隔离 | ⚠️ "形式 50%, 本意 50%" — 退回 prompt | ✅ spawn-boundary gate | `scripts/hooks/review-agent-gatekeeper.py`（PreToolUse `Task`） |
-### §12.1 — D2.4 if 条件过滤（实施）
-**v4 spec**（§4 D2.4）已经写好：用 `"if": "Bash(python *)"` 给 PreToolUse Bash hook 加触发条件。
-**v5 实施**：
-```json
-{
-  "matcher": "Bash",
-  "hooks": [
-    {
-      "type": "command",
-      "if": "Bash(python *)",
-      "command": "cd \"$(git rev-parse --show-toplevel)\" && python3 scripts/hooks/auto-python3.py",
-      "timeout": 5
-    }
-  ]
-}
-```
-**意义**：把"裸 python 命令必须重写为 python3"从规则提示（CLAUDE.md §4.1）变成机械化拦截。命中率从 ~10-20% 升到 100%。
----
-### §12.2 — D3.1 CLAUDE.md @include 拆分（subagent 独立完成）
-**v4 spec**（§4 D3.1）：把单 CLAUDE.md 拆为 `.claude/rules/*.md`，按路径条件加载（CC 的 directory-level instructions 机制）。
-**v5 实施**：派遣 background subagent 完成拆分。最终结构：
-| Rule 文件 | 大小 | 加载触发 |
-|-----------|------|---------|
-| `backend-routes.md` | 16KB | 改 `backend/product/routes/*.py` 时 |
-| `repo-structure.md` | 11KB | session 启动时 |
-| `review-agent-isolation.md` | 6KB | 改 `.claude/agents/*.md` 时 |
-| `env-vars.md` | 2.5KB | 改 `backend/product/config.py` 时（paths frontmatter） |
-| `bridge.md` | 1.1KB | 改 `bridge_agent/*.py` 时 |
-| `closure-semantics.md` | 1KB | 改 `docs/issues/*.md` 时 |
-| `coaching.md` / `deployment.md` / `hackathon.md` / `scenes.md` / `source-of-truth.md` | < 2KB | 各自路径触发 |
-**结果**：CLAUDE.md 从 600+ 行降到 144 行（slim 76%）。每个 session 不再"一次加载 11.9k tokens 全部规则"，而是按需加载相关条目。
-**KV-cache 友好性**（来源: Anthropic Effective Context Engineering）：常见上下文片段稳定不变 → cache hit 率提升。
----
-### §12.3 — D5 Magic Docs + FileChanged 同步（实施）
-**v4 spec**（§4 D5.1）：派生型文档（API 路由清单、目录结构等）必须有 `_instructions_` 头声明"我是机器派生的，怎么 regenerate 我"。
-**v5 实施 — 三件套**：
-**(1) Magic Doc 参考实现**（`docs/magic/api-routes.md`）：
-```markdown
-# MAGIC DOC: API Routes Index
-_instructions_: This file is machine-derived from `backend/product/routes/*.py`...
-_regenerator_: scripts/checks/regenerate_magic_docs.py api-routes
-_freshness_check_: scripts/checks/check_doc_freshness.py
-_last_sync_: 2026-04-07
-| Router file | Decorator count |
-|------------|-----------------|
-| protocol.py | 68 |
-| ... (按 router 文件聚合，total = 208) |
-```
-**(2) Regenerator with check mode**（`scripts/checks/regenerate_magic_docs.py`）：
-```python
-REGENERATORS = {
-    "api-routes": regenerate_api_routes,
-}
-def regenerate_api_routes(*, check_only: bool = False) -> int:
-    # 扫描所有 router 文件，统计 @router.* 装饰器
-    # check_only: 比较时忽略 _last_sync_ 行 → 0=同步, 1=漂移
-    ...
-```
-**(3) SessionStart 漂移告警**（`scripts/hooks/session-start-magic-docs.py`）：
-会话启动时跑 `regenerate_magic_docs.py all --check`，发现漂移则向 main agent 注入告警片段（CC 自动作为 session 上下文）。永远 exit 0（advisory）。
-**(4) 集成到 freshness check**（`scripts/checks/check_doc_freshness.py` 增加 `_check_magic_docs()` 函数）：
-```python
-def _check_magic_docs(repo_root: Path) -> list[Finding]:
-    for name, fn in regenerate_magic_docs.REGENERATORS.items():
-        rc = fn(check_only=True)
-        if rc != 0:
-            findings.append(Finding(
-                severity="P1",
-                message=f"Magic doc docs/magic/{name}.md drifted from source. Run: ...",
-                ...
-            ))
-```
-**(5) Guard router 路由**（`scripts/guard_router.py`）：
-```python
-"docs/magic/":                      ["check_doc_freshness"],
-".claude/rules/backend-routes.md":  ["check_doc_freshness"],
-```
-**意义**：派生文档不再需要"靠 AI 记得手动同步"，而是 (a) regenerator 命令一键 rebuild + (b) freshness check 自动检测漂移 + (c) SessionStart 在 session 开始时主动告警。这是治理强度从 convention（10-20%）升到 guard（~100%）。
-**Pre-existing bug 顺手修了**：原 `check_doc_freshness.py` 的 route count check 仍然 grep CLAUDE.md，但 routes 已经拆到 `.claude/rules/backend-routes.md` 了。修正后改为读 rules 文件。
----
-### §12.4 — D8 features schema 补 `steps[]` / `verification_command` / `evidence`（已修补）
-**v4 实施 vs Anthropic 第二篇 feature_list.json schema 的差距**：
-| 字段 | Anthropic 第二篇 | v4 ADR 设计 | v4 实现 | v5 修补 |
-|------|----------------|------------|---------|---------|
-| `steps[]` | ✅ 核心 | ❌ | ❌ | ✅ |
-| `verification_command` | ❌ | ✅ | ❌ | ✅ |
-| `evidence` | ❌ | ✅ | ❌ | ✅ |
-| `objective` (顶层) | ❌（隐式） | ✅ | ✅ | 保留 |
-| SHA256 immutability | ❌（prompt 约束） | ❌（隐式） | ✅ | 保留（**超越文章**） |
-| `status` 三态 enum | ❌（boolean） | ✅ | ✅ | 保留（**超越文章**） |
-**v5 修补**（`scripts/hooks/initializer-agent.py` + `scripts/hooks/stop-evaluator.py`）：
-```python
-# initializer-agent.py — features schema
-{
-    "id": "F1",
-    "subject": "...",
-    "status": "failing",  # failing | passing | blocked
-    "steps": [
-        "Navigate to ...",
-        "Click the ...",
-        "Verify the ..."
-    ],
-    "verification_command": "backend/venv/bin/pytest tests/...",
-    "evidence": null  # 验证后填 artifact 引用
-}
-# stop-evaluator.py — mechanical first-pass 强化
-def check_features(features):
-    for f in features:
-        if f["status"] == "passing" and not f.get("evidence"):
-            return BLOCK(f"feature {f['id']} marked passing but evidence is null")
-```
-**机械化验证关键**：`stop-evaluator.py` 在 LLM eval 之前先做 mechanical check — `status="passing"` 必须 `evidence != null`。这是文章 §"Out of the box, Claude is a poor QA agent" 的真本意：不能让 model 自己说"我做完了"，必须有 artifact 引用证明。
----
-### §12.5 — D9 mid-task Objective Recitation（PreToolUse 计数器）
-**Manus/IMPACT 原话**：
-> "Recite objectives at end of context to maintain goal focus **after ~50 tool calls**"
-**v4 实施**：`scripts/hooks/precompact.sh` 在 PreCompact 边界做 recitation。
-**v4 gap**：PreCompact 触发频率是 "每 N 万 tokens 一次"，远低于 "50 tool calls 一次"。文章本意是 mid-task drift（用户敲了 50 次工具，已经走远），不是 compaction 边界。
-**v5 实施**（`scripts/hooks/tool-call-counter.py`）：
-```python
-COUNTER_FILE = REPO_ROOT / ".boll" / "metrics" / "tool-call-counter.txt"
-CURRENT_PROGRESS = REPO_ROOT / ".boll" / "progress" / "current.json"
-DEFAULT_RECITE_EVERY = 50
-RECITE_EVERY = int(os.environ.get("boll_RECITATION_EVERY", DEFAULT_RECITE_EVERY))
-def main() -> int:
-    count = _read_counter() + 1
-    _write_counter(count)  # atomic via tmp+rename
-    if count % RECITE_EVERY == 0:
-        fragment = _format_recitation(count)  # 读 current.json → 拼接 objective + features
-        if fragment:
-            print(fragment)  # CC 自动注入为下一个 tool call 的上下文
-    return 0
-```
-**Recitation 片段格式**（来自 D8 progress 文件）：
-```
-## Objective Recitation (D9, after 50 tool calls)
-**WP**: `WP-XXX`
-**原始目标**（D8 immutable）：...
-**未完成 features**：
-  - [failing] `F1` ... — 5 steps
-  - [blocked] `F2` ... — 3 steps
-**Passing 但缺 evidence**（D8 stop-check 会拒绝）：
-  - `F3` ...
-提醒：保持注意力在原始目标上。如果当前操作偏离此目标，请重新规划。
-[ADR-038 §4 D9 — Manus Objective Recitation pattern]
-```
-**Wired 在**（`.claude/settings.json`）：
-```json
-{
-  "matcher": "*",  // 所有 tool 都计数
-  "hooks": [{
-    "type": "command",
-    "command": "... python3 scripts/hooks/tool-call-counter.py 2>/dev/null || true",
-    "timeout": 3
-  }]
-}
-```
-**双覆盖保险**：v4 PreCompact recitation 仍然保留（cover compaction 边界），v5 PreToolUse 计数器 cover mid-task drift。两层频率不同的 recitation 互补。
----
-### §12.6 — D10 Trace Analyzer auto-trigger（SessionEnd hook）
-**LangChain 原话**：
-> "Trace Analyzer Skill：自动化并行 error 分析 agent，获取 traces → 分析错误 → 提议 harness 变更 → 验证"
-**v4 实施**：`scripts/hooks/trace-analyzer.py` 完成 fail→log→analyze→propose（输出 markdown 提议到 `.wow-harness/proposals/`）。
-**v4 gap**：用户必须手动跑 `trace-analyzer.py analyze`，没有 auto-trigger。
-**v5 实施**（`.claude/settings.json` SessionEnd）：
-```json
-"SessionEnd": [{
-  "matcher": "*",
-  "hooks": [
-    {
-      "type": "command",
-      "command": "... python3 scripts/hooks/session-reflection.py",
-      "timeout": 10
-    },
-    {
-      "type": "command",
-      "command": "... python3 scripts/hooks/trace-analyzer.py analyze --days 1 --min-samples 3 2>/dev/null || true",
-      "timeout": 30
-    }
-  ]
-}]
-```
-**关键设计原则 2**："不自动落地"——trace-analyzer 只产出 `.wow-harness/proposals/<timestamp>.md`，永不动 hooks/code。这让 auto-trigger 100% 安全（最坏情况是产出无人读的 markdown，不会破坏现状）。
-**完整闭环 vs 当前**：
-| 阶段 | 文章 | v5 |
-|------|------|-----|
-| 1. fail | ✅ | ✅ failure-analyzer.py 写 JSONL |
-| 2. analyze | ✅ | ✅ trace-analyzer.py 模式聚类 |
-| 3. propose | ✅ | ✅ 输出 markdown 提议 |
-| 4. shadow verify | ✅ | ❌（v6+） |
-| 5. measure | ✅ | ❌（v6+） |
-| 6. commit if better | ✅ | ❌（v6+） |
-**残余 gap**：4-6 步的 propose-then-verify 闭环留给 v6+。需要设计 shadow 模式（在 fork 仓库或 worktree 里 apply 提议、跑同样 trace、对比 metric），是显著工程量，不在 v5 scope。
----
-### §12.7 — D11.2 spawn-boundary gate（plugin review agent 物理隔离）
-**ADR §3.12 原话**：
-> "审查 Agent 必须在 schema 层面被剥夺写权限——不是靠 prompt 约束（"请不要修改代码"），而是工具列表本身就不包含 Edit/Write/Bash(write)。"
-**v4 矛盾**：本地 agent（`review-readonly.md`）走 frontmatter 物理隔离 ✓，但 7 个 plugin agent（`pr-review-toolkit:*` / `feature-dev:code-reviewer/explorer`）的 frontmatter 在 plugin 内部，无法修改。v4 退回到"调用层 prompt 约束"——这违反了 §3.12 自己说的"prompt 约束不算数"。
-**v5 实施 — spawn-boundary gate**（`scripts/hooks/review-agent-gatekeeper.py`）：
-```python
-# Plugin review agents 列表
-REVIEW_SUBAGENT_PATTERNS = (
-    "pr-review-toolkit:code-reviewer",
-    "pr-review-toolkit:silent-failure-hunter",
-    "pr-review-toolkit:type-design-analyzer",
-    "pr-review-toolkit:comment-analyzer",
-    "pr-review-toolkit:pr-test-analyzer",
-    "feature-dev:code-reviewer",
-    "feature-dev:code-explorer",
-)
-# Spawn 时 prompt 必须包含的 directive 之一
-REQUIRED_DIRECTIVES = (
-    "MUST NOT call Edit",
-    "MUST NOT use Edit",
-    "read-only reviewer",
-    "read-only mode",
-    "schema-level read-only",
-    "ADR-038 D11",
-)
-def main() -> int:
-    payload = json.load(sys.stdin)
-    if payload.get("tool_name") != "Task":
-        return 0
-    subagent_type = payload.get("tool_input", {}).get("subagent_type", "")
-    if not _is_review_subagent(subagent_type):
-        return 0
-    prompt = payload.get("tool_input", {}).get("prompt", "")
-    if _prompt_has_directive(prompt):
-        _record_active(subagent_type)  # 写 marker 到 .wow-harness/active-review-agents/
-        return 0
-    print("[ADR-038 D11] BLOCKED: ...", file=sys.stderr)
-    return 2  # 硬阻断 spawn
-```
-**Wired 在**（`.claude/settings.json`）：
-```json
-{
-  "matcher": "Task",
-  "hooks": [{
-    "type": "command",
-    "command": "... python3 scripts/hooks/review-agent-gatekeeper.py",
-    "timeout": 5
-  }]
-}
-```
-**为什么这是 schema-level 等价？**
-我们无法改插件 frontmatter 的 `tools:` 白名单，但我们 **可以保证插件被 spawn 时其指令上下文一定包含约束**。这把 review agent 的 effective 工具集从"frontmatter 声明的 + （希望 main agent 在 prompt 里加提醒）"变成"frontmatter 声明的 + （强制 prompt 约束作为前置硬条件）"。
-任意 review agent spawn 的 chokepoint 在 `Task` tool。PreToolUse Task hook 在 spawn 发生之前拦截：缺少 directive → exit 2 → CC 阻断 spawn → 用户看到 stderr → 必须重新构造 prompt。
-**残余漏洞**（坦诚）：
-- subagent 自身仍可能 ignore prompt 约束（adherence ~70%）
-- 真正 100% 等价需要 PreToolUse Edit/Write 时反查 transcript_path 找最近 subagent 标记 → 是 v6+ 工程量
-- 但相比 v4 "完全靠用户自觉在 prompt 里加提醒"，v5 把约束从"希望"变成了"前置硬条件"
-**意义**：这是 "ADR 自相矛盾"的 closure。`.claude/rules/review-agent-isolation.md` 同步更新为 D11.2 描述，所有 7 个 plugin agent 标记 `✅ ADR-038 D11.2 — review-agent-gatekeeper.py`。
----
-### §12.8 — Novel Contributions（v5 显式 attribution）
-**用户的标准**："是否真的完成最佳实践，而不是只是遵从或者模仿之类的"。
-下面这些 v4+v5 决策是我们**超越文章**的部分，应明确标注为 novel contribution（不是从文章 copy 来的）：
-| 设计 | 来源 | 文章是否有 | 我们的差异 |
-|------|------|----------|-----------|
-| **D8 SHA256 immutability** | 自研 | ❌ 文章用 prompt 约束 "unacceptable to remove" | 用 hash 强制；删除某行 → SHA 不匹配 → mechanical reject |
-| **D8 status enum 三态** | 自研 | ❌ 文章用 boolean `passes` | 三态（failing/passing/blocked）区分"未完成 vs 真做不到"，避免 model 用 false positive 绕过 |
-| **D8.2 mechanical first-pass** | 自研，受 §"poor QA agent" 启发 | ❌ 文章只论 self-eval bias，没给 mechanical pass pattern | stop-evaluator 在 LLM eval 之前先 mechanical check `status==passing → evidence != null` |
-| **D11.2 spawn-boundary gate** | 自研 | ❌ 文章只说 schema-level isolation 适用于本地 agent | 对 plugin agent，把 PreToolUse Task hook 作为 chokepoint，强制 prompt 包含 directive |
-| **D9 50-call counter atomic write** | 实现细节 | ❌ Manus 只说 "after 50" | tmp+rename 原子写，跨并发 session 安全 |
-| **D5 Magic Doc 自描述 `_instructions_`** | 自研 | ❌ 文章只说"派生文档"，没给具体格式 | 文件自带 regenerator 命令 + freshness check 命令 + last_sync 时间戳 |
-| **Magic Doc + path-based rules + KV-cache** | 综合 Anthropic Effective Context Engineering | 部分（文章只说"avoid context rot"） | 三层组合：rules 按路径加载 + magic docs 按需 regenerate + 不变上下文打头降低 cache miss |
-**意义**：这些不是简单 copy paste 文章里的招式，而是把"原则吃透 → 找到对应到 CC 系统的最佳实施点"。这是用户说的"完成最佳实践"。
----
-### §12.9 — v5 Self-Audit
-**v5 自审清单**：
-| # | v5 自审标准 | 状态 |
-|---|------------|------|
-| v5-SC1 | 4 个 deferred items（D2.4/D3.1/D5/D10 auto-trigger）全部实施 | ✅ PASS — 见 §12.1/§12.2/§12.3/§12.6 |
-| v5-SC2 | 4 个 conceptual gap（D8 schema/D9 mid-task/D10 闭环 part/D11.2）全部闭环或留 v6+ | ✅ PASS — D8/D9/D11 实施，D10 闭环留 v6+ 但 attribution 标注 |
-| v5-SC3 | Source 归因校对 | ✅ PASS — §12 头部修正 |
-| v5-SC4 | Novel contributions 显式标注 | ✅ PASS — §12.8 |
-| v5-SC5 | 残余 gap 坦诚记录（不作 "all green" 虚饰） | ✅ PASS — D10 §4-6 步 + D11.2 prompt-adherence ~70% + D9 KV-cache 验证未做 |
-| v5-SC6 | 实施全部用 hooks-level 而非文档约束 | ✅ PASS — settings.json 7 events / 13 hook entries |
-| v5-SC7 | v5 不破坏 v4 已部署内容 | ✅ PASS — 全部增量，PreCompact recitation 保留 |
-| v5-SC8 | meta-review 5 个 conceptual gap 全部闭环 | ✅ PASS — D8 schema/D9 mid-task/D10 trigger/D11 plugin gate/Source 归因 |
-**v5 Grading Criteria**（仅评估 v5 增量）：
-| 维度 | 评分 | 理由 |
-|------|------|------|
-| **Design Quality** | 4 | 4 个 deferred 全部 hooks-level 落地，spawn-boundary gate 是 novel pattern |
-| **Originality** | 4 | §12.8 列了 7 项超越文章的 novel contribution，不是模仿 |
-| **Craft** | 4 | 每个 §12.x 都有具体代码块、文件路径、wired 配置块 |
-| **Functionality** | 4 | hooks 全部 wire 到 settings.json 并测试通过；PreToolUse 计数器 / SessionEnd trace-analyzer / Task gatekeeper 三个新 hook 通过 4 场景测试 |
-全部 ≥ 3，通过硬阈值。
----
-### §12.10 — 残余 gap（v6+ candidate）
-诚实记录"v5 没解决但应该解决"的工作：
-1. **D10 闭环 §4-6 步**：propose-then-verify-then-commit 闭环。需要 shadow execution 模式（在 worktree 里 apply 提议、跑同 trace、对比 metric）。**Estimated effort**: 2-3 个 WP。
-2. **D11.2 真物理隔离**：当前 spawn-boundary gate 仅强制 prompt 包含 directive，subagent 自身仍可能 ignore（adherence ~70%）。100% 等价需要在 PreToolUse Edit/Write 时通过 `transcript_path` 反查最近 `Task` 标记，判断当前 session 是否是 review subagent context，是则物理拦截。**Estimated effort**: 1 个 WP。
-3. **D9 KV-cache friendliness 实测**：D5 + D9 + D3.1 应该共同提升 KV-cache hit 率，但我们没有 baseline 测量。需要装 instrumentation 测对比。**Estimated effort**: 1 个 WP。
-4. **D8 stop-check 和 D9 recitation 的端到端联动测试**：用一个真实 WP 跑完整 init→update→stop-check 循环，验证 evidence 字段约束真的会阻断"虚假 passing"。**Estimated effort**: 1 个 WP。
-5. **Source 归因校对反流到所有 v4 §**：v5 §12 头部已经修正了，但 §2.4/§3.10-§3.13/§4 D8-D11 的 `[来源:...]` 标注仍是 v4 时期写的，部分归因到了 "Anthropic 第二篇" 而应该是 Manus/LangChain/OpenDev。**Estimated effort**: 0.5 WP（纯文本编辑）。
----
-### §12.11 — Implementation Roster（v5 完成清单）
-| 文件 | 类型 | 目的 |
-|------|------|------|
-| `.claude/settings.json` | hooks 配置 | 7 events / 13 hook entries（v4 → v5 增加 4 entry） |
-| `scripts/hooks/auto-python3.py` | PreToolUse Bash if-filter | D2.4 — 自动 `python *` → `python3 *` |
-| `.claude/rules/*.md`（11 files） | 路径触发 rules | D3.1 — CLAUDE.md 拆分 |
-| `docs/magic/api-routes.md` | Magic Doc | D5.1 reference impl |
-| `scripts/checks/regenerate_magic_docs.py` | regenerator | D5.1 — `--check` 模式支持 freshness |
-| `scripts/checks/check_doc_freshness.py` | freshness 集成 | D5.1 — 加 `_check_magic_docs()` + 修复 routes 路径 |
-| `scripts/hooks/session-start-magic-docs.py` | SessionStart | D5.2 — 漂移告警注入 |
-| `scripts/guard_router.py` | guard 路由 | D5 — 加 `docs/magic/` + `.claude/rules/backend-routes.md` |
-| `scripts/hooks/initializer-agent.py` | progress writer | D8 — 补 `steps[]` / `verification_command` / `evidence` |
-| `scripts/hooks/stop-evaluator.py` | stop check | D8 — 机械化 `evidence != null` 强制 |
-| `scripts/hooks/tool-call-counter.py` | PreToolUse `*` counter | D9 — 50-call mid-task recitation |
-| `scripts/hooks/trace-analyzer.py` | SessionEnd auto | D10 — auto-trigger（不自动 commit） |
-| `scripts/hooks/review-agent-gatekeeper.py` | PreToolUse Task | D11.2 — spawn-boundary gate |
-| `.claude/rules/review-agent-isolation.md` | 文档 | D11 — 更新 7 plugin agent 状态为 D11.2 |
-| `docs/decisions/ADR-038-...md`（本文件 §12） | ADR | v5 patch 文档 |
-**Hooks 测试覆盖**：
-- `tool-call-counter.py`：4 场景（normal/recite-fire/no-progress/atomic-write）
-- `review-agent-gatekeeper.py`：4 场景（non-review/with-directive/missing-directive-block/non-Task）
-- `session-start-magic-docs.py`：2 场景（no-drift-silent/drift-alert）
-- `regenerate_magic_docs.py`：2 场景（regenerate/check-mode-detect-drift）
----
-### §12.12 — v5 总评（按用户的"最佳实践 vs 模仿"标准）
-| 决策 | v4 评分 | v5 评分 | 差异原因 |
-|------|--------|---------|---------|
-| D2.4 if-filter | ❌ 未做 | ✅ 真做 | 实施 hooks-level |
-| D3.1 @include | ❌ 未做 | ✅ 真做 | 11 rule 文件 |
-| D5 Magic Docs | ❌ 未做 | ✅ 真做 | 三件套 + freshness 集成 |
-| D8 schema | ⚠️ 形式 70% | ✅ 真做 | 补 steps[]/verify_cmd/evidence |
-| D8.2 mechanical pass | ✅ 真做且超越 | ✅ 真做且超越 | （保留） |
-| D9 mid-task recitation | ⚠️ 形式 30% | ✅ 真做 | 50-call counter |
-| D10 trace analyzer | ⚠️ 形式 50% | ✅ trigger 真做 / 闭环 v6+ | SessionEnd auto |
-| D11 plugin agent | ⚠️ 形式 50% | ✅ schema-level 等价真做 | spawn-boundary gate |
-**v4 → v5 总体提升**："60% 真做 + 40% 形式" → "95% 真做 + 5% v6+ candidate"。
-v5 完成的不是"按 v4 ADR 抄写一遍"，而是 (a) 用户提的 4 个 deferred 全部用 hooks-level 闭环，(b) meta-review 发现的 4 个 conceptual gap 全部对应到合适的 CC 实施点，(c) 残余 gap 坦诚标注为 v6+ candidate 而不是隐藏。
-这是用户说的"完成最佳实践"——不是模仿文章里的招式，而是吃透原则、找到 CC 系统里能落地的最佳实施点、把治理强度从 convention 升到 hooks-level。
----
-_v3 已通过独立 Evaluator agent 审查。v4 是 v3 的兼容增量补丁。状态: proposed → 待实施（v3+v4 合并部署）。_