npm - @comate/zulu - Versions diffs - 1.4.0-beta.2 → 1.4.0-beta.4 - Mend

@comate/zulu 1.4.0-beta.2 → 1.4.0-beta.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (206) hide show

package/comate-engine/assets/skills/code-review/evals/SKILL.md ADDED Viewed

@@ -0,0 +1,334 @@
+---
+name: code-review-eval
+description: >-
+  Evaluate the code-review skill's effectiveness by mining commits from a git repo,
+  generating semantic ground truth, running the skill on the diffs, and scoring
+  recall/precision/F1. Uses both bug-fix commits (positive samples) and clean commits
+  (negative samples) to measure both bug detection and false positive control.
+  Use this skill whenever the user mentions "evaluate code review", "benchmark code review",
+  "run code review eval", "test the code-review skill", "code review evaluation",
+  "assess code review quality", or wants to measure how well the code-review skill
+  performs on their codebase.
+---
+# Code Review Skill Evaluation
+Evaluate the code-review skill by testing whether it can discover known bugs from
+bug-fix commit diffs, while also measuring false positive control on clean commits.
+Supports two execution modes: **lite** (SubAgent, diff-only, fast iteration) and **full** (CLI parallel processes, production-representative evaluation with complete multi-agent pipeline).
+## What This Evaluates
+Two core abilities:
+1. **Bug detection (Recall)**: Given a diff that **introduces** a known bug (reversed from a bug-fix commit), can the skill discover the bug? Uses reversed bug-fix commits as positive samples.
+2. **False positive control (Precision)**: Given a clean diff with no bugs, can the skill correctly say "no issues found"? Uses feature/refactor commits as negative samples.
+## Parameters
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| repo | Current working directory | Target git repository path |
+| limit | 20 | Number of samples per type (bug-fix + clean) |
+| workdir | `<repo>-eval-output/` | Output directory for all artifacts |
+| mode | `lite` | Step 3 执行模式：`lite`（SubAgent，纯 diff，快速迭代）或 `full`（外部 CLI，完整生产 pipeline） |
+| cli | `zulu` | Full mode 使用的 CLI：`zulu`（需要 `license`）或 `baidu-cc`（走内部认证） |
+| license | *(required for zulu)* | `zulu run` 的 SaaS license key，仅 `cli=zulu` 时需要 |
+| concurrency | 5 | Full mode 并行 CLI 进程数上限，建议 3-10，过高可能触发 API 限流 |
+## 4-Step Pipeline
+### Step 1: Mine Commits
+Use `git log` to get candidates, then use a SubAgent to filter into positive and negative samples.
+1. Execute these commands to get candidate commits:
+```bash
+# Positive candidates (bug-fix)
+git -C <repo> log --oneline --all -n 500 | grep -iE '\[Bug\]|fix|crash|修复|问题|异常|错误'
+# Negative candidates (clean)
+git -C <repo> log --oneline --all -n 500 | grep -iE 'feat|feature|add|支持|新增|优化|improve|enhance|实现'
+```
+2. Launch **one SubAgent** that:
+   - Reads `agents/miner.md` for filtering rules
+   - For each candidate commit, runs `git show --stat <hash>` and `git log -1 --format="%H%n%P" <hash>` in the repo
+   - Applies filtering rules (10-200 lines, ≤5 files, source code changes)
+   - Separates into bug-fix samples (`sample-XXXX`) and clean samples (`clean-XXXX`)
+   - Outputs a JSON array of selected candidates
+3. Save the output to `<workdir>/candidates.json`
+If `<workdir>/candidates.json` already exists, ask the user whether to re-mine or reuse.
+### Step 2: Generate Semantic Ground Truth
+For each candidate in `candidates.json`, launch **parallel SubAgents** (5-10 at a time):
+Each SubAgent:
+1. 根据样本类型生成 diff 命令：
+   - **bug-fix 样本**：`git -C <repo> diff <commit> <parent_commit> -- <source_files>`（**反转方向**：修复后 → 修复前，模拟"引入 bug 的变更"）
+   - **clean 样本**：`git -C <repo> diff <parent_commit> <commit> -- <source_files>`（正常方向：引入新功能的变更）
+2. Reads `agents/gt-generator.md` for the GT generation rules
+3. **Information symmetry**: The SubAgent sees ONLY the diff, NOT the commit subject. **Also NOT the sample type** (`bug-fix` or `clean`). The SubAgent only receives the sample ID for identification purposes. This ensures the GT generator and the skill operate under identical information conditions.
+4. Analyzes the diff independently to determine if the **newly introduced code** (the `+` lines in the diff) has issues
+5. Outputs findings if issues are found, or empty findings array if the code looks correct
+6. Writes semantic GT JSON to `<workdir>/semantic_gt/<id>.json`
+**为什么反转 bug-fix diff**：真实的 code review 场景是审查"引入变更"的代码。如果直接用 bug-fix diff（修复前 → 修复后），skill 和 GT 生成器看到的是"有人在修 bug"，自然会说"修复合理，通过"。反转后（修复后 → 修复前），变成"有人提交了这段有 bug 的代码"，才是真正测试 skill 发现 bug 的能力。Clean 样本本身就是"引入新功能"，不需要反转。
+**Incremental**: If `<workdir>/semantic_gt/<id>.json` already exists, skip that sample.
+Each GT file contains:
+- `sample_id`: the sample identifier
+- `findings[]`: issues found in the newly introduced code (may be empty for any sample type)
+- `expected_review`: what a good reviewer should say when seeing this diff — i.e., spotting problems in the **newly introduced code** (the `+` lines)
+**Note**: Since the GT generator doesn't know the sample type, some bug-fix samples may get empty GT (bug too subtle to see from diff alone), and some clean samples may get non-empty GT (the feature commit actually introduced issues). This is by design — it reflects the true difficulty of the review task.
+### Step 3: Run Code Review on Diffs
+根据 `mode` 参数选择执行方式：
+- **`mode=lite`（默认）**：用 SubAgent 直接审查 diff，速度快但只能跑单 Agent、纯 diff 审查
+- **`mode=full`**：用外部 CLI 进程运行完整 code-review skill，包含 4 Agent 并行 + Meta-Review + 完整代码库访问
+#### Step 3 — Lite Mode（SubAgent）
+当 `mode=lite` 或 CLI 不可用时使用此模式。
+For each candidate, launch **parallel SubAgents** (5-10 at a time):
+Each SubAgent:
+1. 根据样本类型生成 diff 命令（与 Step 2 一致）：
+   - **bug-fix 样本**：`git -C <repo> diff <commit> <parent_commit> -- <source_files>`（反转方向）
+   - **clean 样本**：`git -C <repo> diff <parent_commit> <commit> -- <source_files>`（正常方向）
+2. Reads the code-review skill definition from `../SKILL.md` (the parent directory)
+3. **Standard code-review task**: Review the code change as you would in a real PR review.
+   Do NOT assume there is a bug. The diff may be a bug-fix, a feature addition, a refactor, or anything else.
+   Evaluate the change for correctness, reliability, style, and reuse issues.
+   If the code looks correct and well-implemented, say so — do not force-find problems.
+4. Produces a review report following Step 6 output format **strictly**:
+   - Severity sections: `### 🔴 P0 严重 (N)` / `### 🟠 P1 高优 (N)` / `### 🟡 P2 中等 (N)` / `### 🔵 P3 低优 (N)`
+   - Each finding: `**N. [path/to/file.ts:42](path/to/file.ts#L42)**`
+   - Followed by `- 问题：...` and `- 建议：...`
+   - If no issues: "本次审查未发现需要阻断合入的问题，审查通过"
+5. Writes the report to `<workdir>/responses/<id>.response.md`
+**Evaluation mode note**: The SubAgent is in eval mode — no `delegate_subtask`, no `ask_user_question`,
+no Step 1 scope detection, no Step 8 user interaction. Just produce the review report.
+**Incremental**: If `<workdir>/responses/<id>.response.md` already exists, skip that sample.
+**Lite mode 局限性**：SubAgent 无法嵌套调用 `delegate_subtask`，因此只能跑单 Agent 审查，不能启动完整的 4 Agent 并行 + Meta-Review pipeline。SubAgent 也没有完整的代码库访问能力（如 `codebase_search`），复用审查维度几乎无法生效。评测结果是 skill 的**下界估计**。
+#### Step 3 — Full Mode（CLI 并行）
+当 `mode=full` 时使用此模式。每个样本通过独立的 CLI 进程运行完整的 code-review skill pipeline。
+##### 前置检查
+1. 验证 CLI 可用：执行 `which <cli>`（`zulu` 或 `baidu-cc`）。如果不可用，**警告用户并降级到 Lite Mode**。
+2. 如果 `cli=zulu`，验证 `license` 参数已提供。未提供则警告并降级。
+3. 向用户确认即将使用 Full Mode，展示样本数和预估并发数。
+##### 执行流程
+1. 读取 `<workdir>/candidates.json`
+2. 过滤出还没有 `<workdir>/responses/<id>.response.md` 的样本（增量支持）
+3. 读取 `references/cli-query-template.md` 获取 CLI 命令模板和 query 模板
+4. 对每个待处理样本，用模板生成 CLI 命令。**注意 diff 方向**：
+   - bug-fix 样本：`git diff {COMMIT} {PARENT_COMMIT} -- {SOURCE_FILES}`（反转方向）
+   - clean 样本：`git diff {PARENT_COMMIT} {COMMIT} -- {SOURCE_FILES}`（正常方向）
+   填充占位符后生成完整 query
+5. 通过 `run_command` 执行并行 shell 脚本，控制并发数为 `concurrency` 个
+##### Shell 脚本模式
+主 Agent 生成一段 shell 脚本并通过 `run_command` 执行。脚本模式如下：
+```bash
+#!/bin/bash
+WORKDIR="<workdir>"
+REPO="<repo>"
+MAX_PARALLEL=<concurrency>
+RESPONSES_DIR="$WORKDIR/responses"
+mkdir -p "$RESPONSES_DIR"
+run_review() {
+  local ID="$1" PARENT="$2" COMMIT="$3" FILES="$4"
+  local OUTPUT="$RESPONSES_DIR/${ID}.response.md"
+  # 增量：跳过已有结果
+  [ -f "$OUTPUT" ] && echo "SKIP: $ID" && return 0
+  echo "START: $ID"
+  # --- 根据 CLI 选择命令 ---
+  # zulu:
+  local RESULT
+  RESULT=$(zulu run \
+    -l "<license>" \
+    --activate-skill code-review \
+    --cwd "$REPO" \
+    --display task \
+    -q "<query with filled placeholders>" 2>/dev/null) || {
+    echo "FAIL: $ID"
+    echo "# Review Failed" > "$OUTPUT"
+    return 0
+  }
+  # baidu-cc:
+  # RESULT=$(baidu-cc -p "<query with filled placeholders>" \
+  #   --allowedTools "Bash,Read,Write,Edit,Glob,Grep,Agent" \
+  #   --cwd "$REPO" 2>/dev/null) || { ... }
+  # 去除 YAML frontmatter（如有）
+  echo "$RESULT" | awk 'BEGIN{s=0} /^---$/{s++;next} s>=2{print} s<1{print}' > "$OUTPUT"
+  echo "DONE: $ID"
+}
+# 并发控制：后台进程 + wait -n
+RUNNING=0
+<for each sample, call run_review with args &, control concurrency>
+wait
+echo "All reviews completed."
+```
+> **注意**：主 Agent 需要根据 `candidates.json` 动态构建样本参数列表，而不是写死在脚本中。
+##### CLI Query 内容
+每个 CLI 进程的 query 遵循 `references/cli-query-template.md` 中的模板，核心要点：
+- **提供精确的 diff 命令**：根据样本类型决定 diff 方向（bug-fix 反转、clean 正常），跳过 Step 1 范围检测
+- **禁止用户交互**：跳过 Step 8，不调用 `ask_user_question`
+- **严格 Step 6 输出格式**：确保 score-judge 能正常解析
+- **不预设 bug 存在**：和 Lite Mode 一样，不暗示样本类型
+##### 失败处理
+| 场景 | 处理 |
+|------|------|
+| CLI 进程异常退出 | 写 `# Review Failed` 占位文件，继续其他样本 |
+| CLI 进程超时 | 同上 |
+| 输出中缺少 severity section | 保留原始输出，score-judge 会按"审查通过"处理 |
+| 部分样本失败 | 在 Step 4 报告中统计失败数，不影响已成功样本的评分 |
+| 用户中断（Ctrl-C） | 增量设计，重跑自动跳过已完成样本 |
+##### Full Mode 的优势
+- 被测 skill 运行**完整生产 pipeline**：4 个专业 Agent 并行（correctness/style/reliability/reuse）+ Meta-Review 降噪
+- 每个 CLI 进程拥有完整工具集：`read_file`、`grep_content`、`codebase_search`，可读取完整文件上下文
+- 评测结果直接代表用户实际体验到的 skill 质量
+**Incremental**: If `<workdir>/responses/<id>.response.md` already exists, skip that sample.
+### Step 4: Score and Report
+For each sample, launch **one SubAgent** per sample (or batch 5-10 in parallel):
+Each SubAgent:
+1. Reads `<workdir>/semantic_gt/<id>.json` (the GT)
+2. Reads `<workdir>/responses/<id>.response.md` (the skill's review)
+3. Reads `agents/score-judge.md` for judging rules
+4. Determines the sample type from `candidates.json` (NOT from the GT file — the GT generator doesn't know sample type)
+5. For bug-fix samples with GT findings: checks if any predicted finding semantically matches a GT finding
+6. For bug-fix samples with empty GT (`gt_blind=true`): the bug was too subtle for diff-only analysis; if skill found it anyway, it's a bonus hit
+7. For clean samples with empty GT: checks if the skill correctly reported "no issues" (correct negative) or falsely reported a P0-P2 issue (false positive)
+8. For clean samples with GT findings: the GT independently found issues in the feature commit; if the skill found the same issues, it's a hit (NOT a false positive)
+9. Outputs a JSON score result with `sample_type`, `gt_blind`, `hits`, `false_positives`, `correct_negatives`
+10. Writes to `<workdir>/scores/<id>.json`
+After all scoring SubAgents complete, the **main Agent** aggregates results and writes `<workdir>/report.md`.
+#### Report structure
+```markdown
+# Code Review Skill Evaluation Report
+**Mode**: lite / full
+**CLI**: zulu / baidu-cc (full mode only)
+## Overall Metrics
+| Metric | Value |
+|--------|-------|
+| Bug-fix samples | N |
+| Clean samples | N |
+| GT findings (bug-fix, excl. gt_blind) | N |
+| GT findings (clean, unexpected) | N |
+| Predicted findings (bug-fix) | N |
+| Predicted findings (clean) | N |
+| Recall | X% (hits / GT findings, excl. gt_blind) |
+| Recall (incl. gt_blind bonus) | X% |
+| Precision | X% (hits / total pred P0-P2) |
+| False Positive Rate | X% (FP on clean samples with empty GT) |
+| Correct Negative Rate | X% (correct negatives / clean samples with empty GT) |
+| GT Blindness Rate | X% (gt_blind bug-fix samples / total bug-fix samples) |
+| F1 | X% |
+## Bug-Fix Sample Results (Recall)
+| Sample | GT | GT Blind | Pred | Hit | FP | Recall |
+|--------|:--:|:--------:|:----:|:---:|:--:|:------:|
+## Clean Sample Results (False Positive Control)
+| Sample | GT | Pred | Hit | FP | Correct Negative |
+|--------|:--:|:----:|:---:|:--:|:---------------|
+## GT Blindness Analysis
+(bug-fix samples where GT also couldn't find the bug from diff alone)
+## Worst Samples (missed bugs)
+## False Positive Analysis
+```
+### Metric definitions
+- **Recall**: Of all GT findings in bug-fix samples (excluding `gt_blind` samples), how many did the skill find?
+- **Recall (incl. gt_blind bonus)**: Same as Recall, but gt_blind samples where the skill found the bug are counted as bonus hits (numerator only, not denominator).
+- **Precision**: Of all P0-P2 findings the skill reported across ALL samples, how many matched a GT finding?
+- **False Positive Rate**: Of clean samples where GT is also empty, what fraction did the skill incorrectly flag with a P0-P2 finding?
+- **Correct Negative Rate**: Of clean samples where GT is also empty, what fraction correctly got "no issues"?
+- **GT Blindness Rate**: Of bug-fix samples, what fraction had empty GT (bug too subtle for diff-only analysis)? A high rate suggests the sample set contains many subtle bugs that are hard to detect from diffs alone.
+- **F1**: Harmonic mean of Recall and Precision.
+## Reading the Report
+- **Recall ≥ 60%**: decent — the skill catches most known bugs
+- **False Positive Rate ≤ 20%**: good — the skill doesn't over-report on clean diffs
+- **F1 ≥ 50%**: solid overall performance
+## Important Notes
+- **Lite mode vs Full mode**: Lite mode 评估简化版单 Agent 审查（下界估计），适合快速迭代和调试。Full mode 评估完整生产 pipeline（4 Agent 并行 + Meta-Review + 代码库访问），结果直接代表用户体验。**正式评测建议使用 Full mode**。
+- **模式切换注意事项**：两种模式产出的 `responses/` 文件不应混用。如果从 lite 切换到 full（或反向），应先清空 `<workdir>/responses/` 目录再重跑 Step 3，否则增量逻辑会跳过已有文件。
+- **Information symmetry**: GT generator and skill see the same diff (no commit subject, **no sample type**). The GT generator does not know whether a sample is bug-fix or clean. This prevents confirmation bias and ensures the GT reflects what's actually visible from the diff.
+- **Reversed diff for bug-fix samples**: Bug-fix samples use reversed diff (`commit → parent`), making the diff look like "someone introduced this buggy code". This aligns the evaluation task with real code review: reviewing newly submitted code for problems. Clean samples use normal diff direction (`parent → commit`) since they already represent "introducing new functionality".
+- **No confirmation bias**: Both Step 2 (GT generation) and Step 3 (skill execution) use standard instructions with no hint about sample type. Both must independently decide if there are issues.
+- **GT blindness handling**: When a bug-fix sample's GT is empty (`gt_blind=true`), it means the bug is too subtle to detect from the diff alone. These samples are excluded from the Recall denominator to avoid unfairly penalizing the skill. If the skill finds the bug anyway, it's counted as a bonus.
+- **Clean sample fairness**: If the GT generator independently finds issues in a clean sample, skill findings that match are counted as hits, not false positives. This prevents penalizing the skill for correct behavior.
+- **Lower-bound evaluation (lite mode only)**: Lite mode 的 SubAgent 只能看 diff（无完整文件访问）。Full mode 无此限制，CLI 进程拥有完整工具集。
+- **Incremental runs**: All steps skip existing outputs. Safe to re-run at any time.
+- **GT quality**: Semantic GT is generated by SubAgent analysis of diffs without knowing sample type. For high-stakes evaluation, spot-check `semantic_gt/*.json` files manually, especially gt_blind cases.
+- **Portability**: Works on any git repo. Just point `repo` at the target.
+## Output Structure
+```
+<workdir>/
+  candidates.json            # Step 1: bug-fix + clean samples
+  semantic_gt/               # Step 2: one JSON per sample
+    sample-0001.json         #   bug-fix sample with findings
+    clean-0001.json          #   clean sample with empty findings
+    ...
+  responses/                 # Step 3: review reports
+    sample-0001.response.md
+    clean-0001.response.md
+    ...
+  scores/                    # Step 4: per-sample scoring
+    sample-0001.json
+    clean-0001.json
+    ...
+  report.md                  # Step 4: aggregated report
+```

package/comate-engine/assets/skills/code-review/evals/agents/gt-generator.md ADDED Viewed

@@ -0,0 +1,76 @@
+# Semantic Ground Truth Generator
+你是一个代码审查评测专家。给定一个 commit 的 diff，你的任务是为评测生成 ground truth。
+## 核心原则
+GT 生成器和 code-review skill 必须看到**完全相同的信息**，以确保评测公平。
+**两者都只能看到 diff**。不能利用 commit subject、ticket 号、样本类型等额外信息来推断是否存在 bug。
+## 输入
+你会收到：
+- 完整的 git diff
+- 样本 ID（如 `sample-0001` 或 `clean-0001`，仅用于标识，不代表代码质量）
+**注意**：你不会被告知这个 diff 是什么类型的变更。你必须完全依赖 diff 内容本身来独立判断。
+## 任务
+分析 diff，判断**新引入的代码（diff 中的 `+` 行）**是否存在值得指出的问题。
+这是一个标准的 code review 任务：有人提交了一段代码变更，你需要审查这次变更引入的代码是否有 bug 或缺陷。
+### 如果发现问题
+从 diff 中识别新引入代码存在的 bug 或缺陷。重点关注：
+- diff 中新增的代码（`+` 行）是否引入了问题
+- 这个问题的根因是什么
+产出 1-N 条 findings，每条描述一个独立的问题。
+### 如果未发现问题
+如果 diff 看起来是正常的功能增强、重构或改进，且新引入的代码不存在明显问题，产出**空的 findings 数组**。
+**重要**：不要硬凑问题。只有在 diff 中确实能看出新引入代码存在缺陷时才输出 findings。
+## 输出格式
+**仅输出 JSON 对象**，不要包含任何其他文字：
+```json
+{
+  "sample_id": "<id>",
+  "findings": [
+    {
+      "file": "<相对于仓库根目录的文件路径>",
+      "line_range": [起始行, 结束行],
+      "dimension": "<correctness|reliability|style|reuse>",
+      "severity": "<P0|P1|P2|P3>",
+      "description": "<描述新引入的代码存在什么问题>",
+      "root_cause": "<为什么存在这个问题>",
+      "expected_review": "<一个优秀的 reviewer 看到这个 diff 时应该说什么>"
+    }
+  ]
+}
+```
+## 规则
+1. **只看 diff**。不要利用 commit subject、样本类型或其他元信息来推断是否存在 bug。
+2. **独立判断**：你的角色和 code-review skill 完全对等——只从 diff 中判断新引入的代码是否有问题。
+3. **一个独立问题只对应一条 finding**。不要按 diff hunk 逐条展开。
+4. `description` 应描述**新引入的代码存在什么问题**（`+` 行引入的缺陷）。
+5. `root_cause` 应解释**问题为什么存在**（深层原因）。
+6. `expected_review` 应描述**一个优秀的 reviewer 看到这个 diff 时应该指出的问题**。这是判分的关键锚点。
+7. `line_range` 应覆盖包含根因的最小代码区域。
+8. `severity` 反映问题的严重程度，与主 skill 保持一致：
+   - **P0**：明确的严重 bug、安全漏洞、数据损坏或崩溃风险
+   - **P1**：高概率逻辑问题、显著性能问题、重要边界错误
+   - **P2**：中等可维护性或稳定性问题
+   - **P3**：低风险改进项或代码风格建议
+9. `dimension` 反映问题的主要类别，与主 skill 的四个审查维度对齐：`correctness`（正确性）、`reliability`（可靠性/资源/并发/鉴权）、`style`（代码规范）、`reuse`（复用）。
+10. **没有问题就输出空 findings**：如果 diff 看起来是合理的改进且新引入代码无明显缺陷，输出空数组 `[]`。不要硬凑。
+11. **仅输出 JSON，不要输出任何其他内容**。

package/comate-engine/assets/skills/code-review/evals/agents/miner.md ADDED Viewed

@@ -0,0 +1,87 @@
+# Commit Miner
+你是一个代码审查评测的数据挖掘专家。你的任务是从 git 仓库的 commit 历史中筛选出两类评测样本：
+1. **正样本（bug-fix）**：已知包含 bug 修复的 commit
+2. **负样本（clean）**：正常的功能增强、重构等 commit（不包含 bug 修复）
+## 输入
+你会收到：
+- 仓库路径
+- 一份候选 commit 列表（通过 `git log` 获取，包含 hash + subject）
+## 工作方式
+对每个候选 commit：
+1. 执行 `git show --stat <hash>` 查看变更规模和涉及的文件
+2. 执行 `git log -1 --format="%H%n%P%n%s%n%b" <hash>` 获取完整 commit 信息
+3. 根据下方规则判断属于正样本还是负样本还是排除
+## 正样本筛选规则
+**保留**满足以下所有条件的 commit，标记 `"type": "bug-fix"`：
+- commit message 包含 bug/fix/crash/修复/问题/异常/错误 等关键词（不区分大小写）
+- 变更规模在 10-200 行之间
+- 变更文件数 ≤ 5 个
+- 主要修改的是源代码文件
+**排除**：
+- commit message 包含 refactor/rename/revert/docs/chore/style/lint/merge/update dependency
+- 纯测试文件变更
+## 负样本筛选规则
+**保留**满足以下所有条件的 commit，标记 `"type": "clean"`：
+- commit message 包含 feature/add/feat/支持/新增/优化/improve/enhance/new/实现 等关键词
+- 变更规模在 10-200 行之间
+- 变更文件数 ≤ 5 个
+- 主要修改的是源代码文件
+- **不包含** bug/fix/crash/修复/问题/异常/错误 等关键词
+**排除**：
+- 纯测试文件变更、纯配置变更
+- commit message 包含 refactor/rename/revert/docs/chore/style/lint/merge/update dependency
+## 输出
+输出 JSON 数组，每个元素代表一个筛选后的 commit：
+```json
+[
+  {
+    "id": "sample-0001",
+    "commit": "abc123def456",
+    "parent_commit": "parent_hash",
+    "subject": "commit subject line",
+    "type": "bug-fix",
+    "source_files": ["path/to/file1.ts", "path/to/file2.ts"],
+    "language_exts": [".ts"],
+    "changed_lines": 45,
+    "file_count": 2
+  },
+  {
+    "id": "clean-0001",
+    "commit": "def456abc789",
+    "parent_commit": "parent_hash",
+    "subject": "commit subject line",
+    "type": "clean",
+    "source_files": ["path/to/file1.ts"],
+    "language_exts": [".ts"],
+    "changed_lines": 30,
+    "file_count": 1
+  }
+]
+```
+## 规则
+- `parent_commit` 通过 `git log -1 --format="%P" <hash>` 获取（取第一个 parent）
+- `source_files` 只包含源代码文件，排除 .json/.yaml/.md/.txt 等非源码文件
+- `language_exts` 从 source_files 的扩展名提取
+- `changed_lines` 从 `git show --stat` 的最后一行统计数字获取
+- 正样本 id 格式：`sample-XXXX`（按顺序编号）
+- 负样本 id 格式：`clean-XXXX`（按顺序编号）
+- **正负样本数量建议 1:1**，但至少各保留 5 个
+- 只输出 JSON 数组，不要输出任何其他内容

package/comate-engine/assets/skills/code-review/evals/agents/score-judge.md ADDED Viewed

@@ -0,0 +1,168 @@
+# Precision Judge
+你是一个代码审查评测的判分器。你的任务是判定 code-review skill 的输出是否发现了已知问题。
+## 背景
+评测流程：
+1. 从 git 历史中取出 commit 的 diff
+2. 样本分两类：
+   - **bug-fix 样本**：从 bug-fix commit 中取出的**反转 diff**（修复后 → 修复前），模拟"有人提交了引入 bug 的代码"
+   - **clean 样本**：从 feature/enhance commit 中取出的正常 diff（修改前 → 修改后）
+3. GT 生成器和 code-review skill 都只看到 diff（不知道样本类型），各自独立产出分析结果
+4. 你需要结合样本类型和 GT，判断 skill 的表现
+## 输入
+- **样本类型**：`bug-fix` 或 `clean`（来自 candidates.json，不是 GT 生成器标注的）
+- **GT findings**: GT 生成器独立分析产出的 findings（可能为空）
+- **Predicted findings**: skill 在审查报告中实际说了什么
+## 维度映射
+GT 的 `dimension` 与主 skill 的 `reviewer`/`category` 对应关系如下。判分时用此映射理解 skill 输出的分类含义：
+| GT `dimension` | 主 skill `reviewer` | 对应的 `category` 值 |
+|---|---|---|
+| `correctness` | `correctness` | `null-safety`, `type-error`, `data-structure`, `exception-handling`, `variable-param`, `string-format`, `control-flow`, `oop-error`, `framework-bug` |
+| `reliability` | `reliability` | `resource-leak`, `concurrency-race`, `thread-safety`, `db-operation`, `async-issue`, `auth-missing`, `auth-bypass`, `auth-logic-error`, `performance-issue` |
+| `style` | `style` | `code-format`, `naming-convention`, `code-style`, `comment-style`, `vue-style`, `react-style` |
+| `reuse` | `reuse` | `duplicate-function`, `inline-reimplementation`, `similar-pattern` |
+**注意**：
+- 判分时**不要因为 dimension/reviewer 不匹配就判定 miss**。GT 的 dimension 和 skill 输出的 reviewer/category 只是辅助信息，最终判分依据仍然是 `expected_review` 与 skill 输出的**语义一致性**。
+- 如果 GT finding 的 `dimension` 为 `efficiency` 或 `quality`（历史数据），将其映射到 `reliability`（performance-issue 等）。
+## 判分逻辑
+### 对于 bug-fix 样本
+bug-fix 样本使用反转 diff，diff 中的 `+` 行是引入 bug 的代码。skill 的任务是发现这些新引入代码中的问题。
+#### 情况 A：GT 有 findings
+对每个 GT finding，判断 skill 是否命中：
+**命中（hit=1）**：skill 的输出和 GT 的 `expected_review` 语义一致，指向同一个底层问题。文件必须匹配，问题的核心相同。
+**未命中（hit=0）**：skill 说"审查通过"、发现了完全不同的文件、或发现了 GT 中没有的其他问题。
+#### 情况 B：GT 无 findings（bug 过于隐蔽，GT 也未发现）
+此情况说明这个 bug 即使在反转 diff 中也很难看出。标记为 `gt_blind=true`。
+- 如果 skill 也没发现：`hit=0`，但 `gt_blind=true`（不计入 Recall 分母，因为 GT 也看不出来）
+- 如果 skill 反而发现了：`hit=1`，`gt_blind=true`（skill 表现超出 GT，是加分项）
+### 对于 clean 样本
+#### 情况 A：GT 无 findings（GT 也认为 diff 无问题）
+**正确（correct=1）**：skill 说"审查通过"或"未发现需要阻断合入的问题"，或只报告了 P3 级别的代码风格建议。
+**误报（false_positive=1）**：skill 报告了 P0/P1/P2 级别的问题。
+#### 情况 B：GT 有 findings（GT 独立发现了 clean 样本中的问题）
+此情况说明这个 feature commit 实际上也存在问题。按 bug-fix 样本的逻辑判分：
+- 如果 skill 的 finding 和 GT 的 finding 语义匹配：`hit=1`（skill 正确发现了问题，不算误报）
+- 如果 skill 的 finding 和 GT 的 finding 不匹配：按正常逻辑判断是命中、未命中还是误报
+**关键**：当 GT 在 clean 样本上也发现了问题时，skill 发现同样的问题是正确行为，**绝不能算误报**。
+## 输出
+仅输出 JSON，不要包含任何其他内容：
+### bug-fix 样本（GT 有 findings）
+```json
+{
+  "sample_type": "bug-fix",
+  "gt_blind": false,
+  "gt_count": 1,
+  "pred_count": 3,
+  "hits": 1,
+  "false_positives": 2,
+  "correct_negatives": 0,
+  "details": [
+    {
+      "gt_idx": 0,
+      "gt_description": "新引入代码的 bug 描述",
+      "matched_pred": "skill 实际说了什么",
+      "hit": true,
+      "reason": "判定理由"
+    }
+  ]
+}
+```
+### bug-fix 样本（GT 也未发现 bug，gt_blind）
+```json
+{
+  "sample_type": "bug-fix",
+  "gt_blind": true,
+  "gt_count": 0,
+  "pred_count": 1,
+  "hits": 1,
+  "false_positives": 0,
+  "correct_negatives": 0,
+  "details": [
+    {
+      "gt_idx": null,
+      "gt_description": "GT 未发现问题（bug 过于隐蔽）",
+      "matched_pred": "skill 实际发现的问题",
+      "hit": true,
+      "reason": "skill 在 GT 也看不出的情况下独立发现了 bug，超出 GT 基准"
+    }
+  ]
+}
+```
+### clean 样本（GT 无 findings，skill 也无问题）
+```json
+{
+  "sample_type": "clean",
+  "gt_blind": false,
+  "gt_count": 0,
+  "pred_count": 0,
+  "hits": 0,
+  "false_positives": 0,
+  "correct_negatives": 1,
+  "details": []
+}
+```
+### clean 样本（GT 有 findings，skill 也发现了同样问题）
+```json
+{
+  "sample_type": "clean",
+  "gt_blind": false,
+  "gt_count": 1,
+  "pred_count": 1,
+  "hits": 1,
+  "false_positives": 0,
+  "correct_negatives": 0,
+  "details": [
+    {
+      "gt_idx": 0,
+      "gt_description": "GT 在 clean 样本中发现的问题",
+      "matched_pred": "skill 发现的同样问题",
+      "hit": true,
+      "reason": "skill 和 GT 一致发现了 clean 样本中的真实问题，不算误报"
+    }
+  ]
+}
+```
+## 规则
+1. **重点看 `expected_review`**：这是 GT 中最关键的判分依据。
+2. **宽松但不过度宽松**：skill 的表述不需要和 GT 完全一致，但必须指向同一个底层问题。如果只是"同一个文件的不同问题"，不算命中。
+3. **P3 风格建议不算命中也不算误报**：如果 skill 只报告了命名风格、代码组织等 P3 问题，而 GT 是 P0/P1 的正确性 bug，这算未命中（不是误报）。
+4. **误报只针对 P0-P2 级别**：skill 报告了一个不存在的 P0/P1/P2 问题才算误报。P3 不算。
+5. **保守判定**：如果不确定，判定为 miss（hit=0）。
+6. **尊重 GT 的盲审结果**：GT 生成器不知道样本类型，它的 findings 反映了"仅从 diff 能看出的问题"。当 GT 和 skill 一致发现问题时（即使在 clean 样本上），这是正确行为。
+7. **gt_blind 标记**：当 bug-fix 样本的 GT 为空时，标记 `gt_blind=true`，这些样本在计算 Recall 时需要特殊处理。