npm - @parkgogogo/openclaw-reflection - Versions diffs - 0.1.1 → 0.1.3 - Mend

@parkgogogo/openclaw-reflection 0.1.1 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

package/README.md +47 -11
package/README.zh-CN.md +47 -11
package/assets/memory-flowchart.png +0 -0
package/assets/openclaw-reflection-logo.png +0 -0
package/package.json +4 -2
package/src/evals/cli.ts +15 -0
package/src/evals/comparison.ts +248 -0
package/src/evals/models.ts +125 -0
package/src/evals/reporting.ts +123 -0
package/src/evals/runner.ts +62 -0
package/src/index.ts +66 -1
package/src/write-guardian/audit-log.ts +71 -0
package/src/write-guardian/index.ts +49 -7

package/README.md CHANGED Viewed

@@ -1,5 +1,11 @@
 # OpenClaw Reflection
+<p align="center">
+  <img src="./assets/openclaw-reflection-logo.png" alt="OpenClaw Reflection logo" width="180" />
+</p>
+<p align="center"><strong>Make OpenClaw's native memory system sharper without replacing it.</strong></p>
 ![OpenClaw Plugin](https://img.shields.io/badge/OpenClaw-Plugin-111111?style=flat-square)
 ![TypeScript](https://img.shields.io/badge/TypeScript-5.x-3178c6?style=flat-square)
 ![memory_gate 18 cases](https://img.shields.io/badge/memory_gate-18%20benchmark%20cases-2ea043?style=flat-square)
@@ -7,8 +13,6 @@
 Chinese version: [README.zh-CN.md](./README.zh-CN.md)
-**Make OpenClaw's native memory system sharper without replacing it.**
 OpenClaw Reflection is an additive layer on top of OpenClaw's built-in Markdown memory system. It captures message flow, keeps thread noise out of long-term memory, writes durable knowledge into the same human-readable memory files OpenClaw already uses, and periodically consolidates them so your agent gets sharper over time instead of messier.
 ## Current Scope
@@ -103,6 +107,13 @@ Put the following under `plugins.entries.openclaw-reflection` in your OpenClaw c
 Once the gateway restarts, Reflection will begin listening to `message_received` and `before_message_write`, then writing curated memory files into your configured `workspaceDir`.
+### Observability command
+- Reflection now writes an independent write_guardian audit log to:
+  - `<workspaceDir>/.openclaw-reflection/write-guardian.log.jsonl`
+- Register command: `/openclaw-reflection`
+  - Returns the most recent 10 write_guardian behaviors (written/refused/failed/skipped), including decision, target file, and reason.
 ## What You Get
 | You want                             | Reflection gives you                                       |
@@ -114,15 +125,7 @@ Once the gateway restarts, Reflection will begin listening to `message_received`
 ## How It Works
-```mermaid
-flowchart LR
-  A["Incoming conversation"] --> B["Session buffer"]
-  B --> C["memory_gate"]
-  C -->|durable fact| D["write_guardian"]
-  C -->|thread noise| E["No write"]
-  D --> F["MEMORY.md / USER.md / SOUL.md / IDENTITY.md / TOOLS.md"]
-  F --> G["Scheduled consolidation"]
-```
+![OpenClaw Reflection flowchart](./assets/memory-flowchart.png)
 In practice, the pipeline is simple:
@@ -200,10 +203,43 @@ pnpm run typecheck
 pnpm run eval:memory-gate
 pnpm run eval:write-guardian
 pnpm run eval:all
+node evals/run.mjs \
+  --suite memory-gate \
+  --models-config evals/models.json \
+  --baseline grok-fast \
+  --output evals/results/$(date +%F)-memory-gate-matrix.json \
+  --markdown-output evals/results/$(date +%F)-memory-gate-matrix.md
 ```
+`evals/models.json` defines only the comparison matrix. The shared provider endpoint and key still come from `EVAL_BASE_URL` and `EVAL_API_KEY`. JSON output is the source of truth for automation and history, while the Markdown artifact is the readable leaderboard summary.
 More eval details: [evals/README.md](./evals/README.md)
+## Model Selection
+Benchmark date: `2026-03-09`
+Scope: `memory_gate` only, `18` cases, shared OpenRouter-compatible `EVAL_*` route
+| Model | Pass/Total | Accuracy | Errors (P/S/E) | Recommendation | Best For |
+| --- | --- | --- | --- | --- | --- |
+| `x-ai/grok-4.1-fast` | `17/18` | `94.4%` | `0/0/0` | Default baseline | Daily eval baseline |
+| `qwen/qwen3.5-flash-02-23` | `17/18` | `94.4%` | `0/1/0` | Good backup option | Cost-sensitive cross-checks |
+| `google/gemini-2.5-flash-lite` | `16/18` | `88.9%` | `0/0/0` | Fast iteration candidate | Cheap prompt iteration |
+| `inception/mercury-2` | `11/18` | `61.1%` | `0/0/0` | Not recommended as default | Exploratory comparisons only |
+| `minimax/minimax-m2.5` | `9/18` | `50.0%` | `0/0/0` | Not recommended as default | Occasional sanity checks only |
+| `openai/gpt-4o-mini` | `4/18` | `22.2%` | `18/0/0` | Not recommended on current route | Avoid on current OpenRouter path |
+How to choose:
+- Default to `x-ai/grok-4.1-fast` because it had the best overall stability in this round with no internal errors.
+- Use `qwen/qwen3.5-flash-02-23` as the strongest backup when you want similar accuracy but can tolerate one schema failure in this benchmark.
+- Use `google/gemini-2.5-flash-lite` for cheaper, faster prompt iteration when slightly lower boundary accuracy is acceptable.
+- Avoid `inception/mercury-2` and `minimax/minimax-m2.5` as defaults because they frequently collapse `SOUL`, `IDENTITY`, or `NO_WRITE` boundaries into the wrong bucket.
+- Avoid `openai/gpt-4o-mini` on the current OpenRouter/Azure-backed route because all `18` cases surfaced provider-side structured-output errors.
+Source artifact: [2026-03-09-memory-gate-openrouter-model-benchmark.md](./evals/results/2026-03-09-memory-gate-openrouter-model-benchmark.md)
 ## Links
 - OpenClaw plugin docs: [docs.openclaw.ai/tools/plugin](https://docs.openclaw.ai/tools/plugin)

package/README.zh-CN.md CHANGED Viewed

@@ -1,5 +1,11 @@
 # OpenClaw Reflection
+<p align="center">
+  <img src="./assets/openclaw-reflection-logo.png" alt="OpenClaw Reflection logo" width="180" />
+</p>
+<p align="center"><strong>在不替换 OpenClaw 原生记忆体系的前提下，让 Markdown 记忆更干净、更稳定、更可持续。</strong></p>
 英文版： [README.md](./README.md)
 ![OpenClaw Plugin](https://img.shields.io/badge/OpenClaw-Plugin-111111?style=flat-square)
@@ -7,8 +13,6 @@
 ![memory_gate 18 cases](https://img.shields.io/badge/memory_gate-18%20benchmark%20cases-2ea043?style=flat-square)
 ![write_guardian 14 cases](https://img.shields.io/badge/write_guardian-14%20benchmark%20cases-2ea043?style=flat-square)
-**在不替换 OpenClaw 原生记忆体系的前提下，让 Markdown 记忆更干净、更稳定、更可持续。**
 OpenClaw Reflection 是叠加在 OpenClaw 原生 Markdown memory 之上的一层增强插件。它负责监听消息流，过滤线程噪音，把真正长期有效的信息写回 OpenClaw 的核心记忆文件，并定期整理这些文件，避免长期使用后越记越乱。
 ## 当前支持范围
@@ -98,6 +102,13 @@ openclaw plugins install @parkgogogo/openclaw-reflection
 Gateway 重启后，Reflection 就会开始监听 `message_received` 和 `before_message_write`，并把整理后的长期信息写入你配置的 `workspaceDir`。
+### 可观测性命令
+- Reflection 现在会给 write_guardian 单独写一份审计日志：
+  - `<workspaceDir>/.openclaw-reflection/write-guardian.log.jsonl`
+- 注册命令：`/openclaw-reflection`
+  - 返回最近 10 条 write_guardian 行为（written/refused/failed/skipped），包含 decision、目标文件和原因。
 ## 你会得到什么
 | 你想要的能力             | Reflection 提供的结果                          |
@@ -109,15 +120,7 @@ Gateway 重启后，Reflection 就会开始监听 `message_received` 和 `before
 ## 它如何工作
-```mermaid
-flowchart LR
-  A["Incoming conversation"] --> B["Session buffer"]
-  B --> C["memory_gate"]
-  C -->|durable fact| D["write_guardian"]
-  C -->|thread noise| E["No write"]
-  D --> F["MEMORY.md / USER.md / SOUL.md / IDENTITY.md / TOOLS.md"]
-  F --> G["Scheduled consolidation"]
-```
+![OpenClaw Reflection flowchart](./assets/memory-flowchart.png)
 流程很直接：
@@ -174,10 +177,43 @@ pnpm run typecheck
 pnpm run eval:memory-gate
 pnpm run eval:write-guardian
 pnpm run eval:all
+node evals/run.mjs \
+  --suite memory-gate \
+  --models-config evals/models.json \
+  --baseline grok-fast \
+  --output evals/results/$(date +%F)-memory-gate-matrix.json \
+  --markdown-output evals/results/$(date +%F)-memory-gate-matrix.md
 ```
+`evals/models.json` 只用来定义多模型对比矩阵；共享的 provider endpoint 和 key 仍然来自 `EVAL_BASE_URL` 与 `EVAL_API_KEY`。JSON 输出是后续自动化和历史追踪的基准，Markdown 输出则是给人看的 leaderboard 摘要。
 更多评测说明见 [evals/README.md](./evals/README.md)。
+## 模型选择
+评测日期：`2026-03-09`
+范围：仅 `memory_gate`，共 `18` 个 case，共享 OpenRouter 兼容的 `EVAL_*` 路由
+| 模型 | Pass/Total | 准确率 | 错误数 (P/S/E) | 建议 | 适用场景 |
+| --- | --- | --- | --- | --- | --- |
+| `x-ai/grok-4.1-fast` | `17/18` | `94.4%` | `0/0/0` | 默认基线 | 日常 eval 基线 |
+| `qwen/qwen3.5-flash-02-23` | `17/18` | `94.4%` | `0/1/0` | 优秀备选 | 对成本敏感的交叉验证 |
+| `google/gemini-2.5-flash-lite` | `16/18` | `88.9%` | `0/0/0` | 便宜快速候选 | 低成本 prompt 迭代 |
+| `inception/mercury-2` | `11/18` | `61.1%` | `0/0/0` | 不建议默认使用 | 仅做探索性对比 |
+| `minimax/minimax-m2.5` | `9/18` | `50.0%` | `0/0/0` | 不建议默认使用 | 偶尔做 sanity check |
+| `openai/gpt-4o-mini` | `4/18` | `22.2%` | `18/0/0` | 当前路由下不建议使用 | 避免在当前 OpenRouter 路径使用 |
+如何选择：
+- 默认优先用 `x-ai/grok-4.1-fast`，因为这一轮里它的整体稳定性最好，而且没有内部错误。
+- 如果想要接近的准确率，同时能接受一次 schema 失败，可以把 `qwen/qwen3.5-flash-02-23` 作为最强备选。
+- 如果更看重低成本和快速迭代，可以用 `google/gemini-2.5-flash-lite`，但要接受它在部分 `TOOLS` 边界上略弱。
+- 不要把 `inception/mercury-2` 和 `minimax/minimax-m2.5` 当默认基线，因为它们经常把 `SOUL`、`IDENTITY` 或 `NO_WRITE` 判到错误类别。
+- 当前 OpenRouter/Azure 路由下不要选 `openai/gpt-4o-mini`，因为 `18` 个 case 全都触发了 provider 侧 structured-output 错误。
+源结果见：[2026-03-09-memory-gate-openrouter-model-benchmark.md](./evals/results/2026-03-09-memory-gate-openrouter-model-benchmark.md)
 ## 链接
 - OpenClaw plugin docs: [docs.openclaw.ai/tools/plugin](https://docs.openclaw.ai/tools/plugin)

package/assets/memory-flowchart.png ADDED Viewed

Binary file

package/assets/openclaw-reflection-logo.png ADDED Viewed

Binary file

package/package.json CHANGED Viewed

@@ -1,10 +1,11 @@
 {
   "name": "@parkgogogo/openclaw-reflection",
-  "version": "0.1.1",
+  "version": "0.1.3",
   "description": "OpenClaw plugin that enhances native Markdown memory with filtering, curation, and consolidation",
   "type": "module",
   "main": "src/index.ts",
   "files": [
+    "assets/",
     "src/",
     "openclaw.plugin.json",
     "README.md",
@@ -20,8 +21,9 @@
     "url": "https://github.com/parkgogogo/openclaw-reflection/issues"
   },
   "scripts": {
-    "build": "tsc --noEmit",
+    "build": "tsc -p tsconfig.json",
     "clean": "rm -rf logs",
+    "test": "pnpm run build && node --test tests/*.test.mjs",
     "typecheck": "tsc --noEmit",
     "e2e:openclaw-plugin": "bash scripts/e2e-openclaw-plugin.sh",
     "eval:memory-gate": "pnpm exec tsc && node evals/run.mjs --suite memory-gate",

package/src/evals/cli.ts CHANGED Viewed

@@ -7,6 +7,11 @@ export interface EvalCliOptions {
   sharedDatasetPath?: string;
   memoryGateDatasetPath?: string;
   writeGuardianDatasetPath?: string;
+  modelsConfigPath?: string;
+  models?: string[];
+  baselineModelId?: string;
+  outputPath?: string;
+  markdownOutputPath?: string;
 }
 function getArgValue(argv: string[], flag: string): string | undefined {
@@ -34,6 +39,11 @@ function parseSuite(value: string | undefined): EvalSuite {
 }
 export function parseEvalCliOptions(argv: string[]): EvalCliOptions {
+  const models = getArgValue(argv, "--models")
+    ?.split(",")
+    .map((modelId) => modelId.trim())
+    .filter((modelId) => modelId !== "");
   return {
     suite: parseSuite(getArgValue(argv, "--suite")),
     useJudge: !argv.includes("--no-judge"),
@@ -41,5 +51,10 @@ export function parseEvalCliOptions(argv: string[]): EvalCliOptions {
     sharedDatasetPath: getArgValue(argv, "--shared-dataset"),
     memoryGateDatasetPath: getArgValue(argv, "--memory-gate-dataset"),
     writeGuardianDatasetPath: getArgValue(argv, "--write-guardian-dataset"),
+    modelsConfigPath: getArgValue(argv, "--models-config"),
+    models,
+    baselineModelId: getArgValue(argv, "--baseline"),
+    outputPath: getArgValue(argv, "--output"),
+    markdownOutputPath: getArgValue(argv, "--markdown-output"),
   };
 }

package/src/evals/comparison.ts ADDED Viewed

@@ -0,0 +1,248 @@
+import type {
+  MemoryGateCaseResult,
+  SingleModelRunReport,
+  WriteGuardianCaseResult,
+} from "./runner.js";
+import type { EvalSuite } from "./cli.js";
+export interface RankedModelReport {
+  modelId: string;
+  passed: number;
+  total: number;
+  errorCounts?: SingleModelRunReport["summary"]["errorCounts"];
+}
+export interface BaselineDiff {
+  modelId: string;
+  regressedCases: string[];
+  improvedCases: string[];
+  disagreementCases: string[];
+}
+export interface HardestCase {
+  scenarioId: string;
+  failedBy: string[];
+}
+export interface DisagreementCase {
+  scenarioId: string;
+  modelIds: string[];
+}
+export interface MultiModelComparisonReport {
+  runId: string;
+  timestamp: string;
+  suite: EvalSuite;
+  baselineModelId?: string;
+  models: SingleModelRunReport[];
+  comparison: {
+    ranking: RankedModelReport[];
+    baselineDiffs: BaselineDiff[];
+    hardestCases: HardestCase[];
+    disagreementCases: DisagreementCase[];
+  };
+}
+type EvalCaseResult = MemoryGateCaseResult | WriteGuardianCaseResult;
+function getScenarioId(result: EvalCaseResult): string {
+  return result.scenarioId;
+}
+function getTotalErrors(report: SingleModelRunReport): number {
+  const errorCounts = report.summary.errorCounts;
+  if (!errorCounts) {
+    return 0;
+  }
+  return (
+    errorCounts.provider_error +
+    errorCounts.schema_error +
+    errorCounts.execution_error
+  );
+}
+function getCaseSignature(result: EvalCaseResult): string {
+  if ("actualDecision" in result) {
+    return JSON.stringify({
+      pass: result.pass,
+      actualDecision: result.actualDecision,
+      decisionPass: result.decisionPass,
+      candidatePass: result.candidatePass,
+      errorType: result.errorType,
+    });
+  }
+  return JSON.stringify({
+    pass: result.pass,
+    actualShouldWrite: result.actualShouldWrite,
+    toolTrace: result.actualToolTrace,
+  });
+}
+function buildResultMap(report: SingleModelRunReport): Map<string, EvalCaseResult> {
+  return new Map(
+    report.results.map((result) => [getScenarioId(result as EvalCaseResult), result as EvalCaseResult])
+  );
+}
+export function rankModelReports(
+  reports: SingleModelRunReport[]
+): RankedModelReport[] {
+  return reports
+    .map((report) => ({
+      modelId: report.modelId,
+      passed: report.summary.passed,
+      total: report.summary.total,
+      errorCounts: report.summary.errorCounts,
+      totalErrors: getTotalErrors(report),
+    }))
+    .sort((left, right) => {
+      if (right.passed !== left.passed) {
+        return right.passed - left.passed;
+      }
+      if (left.totalErrors !== right.totalErrors) {
+        return left.totalErrors - right.totalErrors;
+      }
+      return left.modelId.localeCompare(right.modelId);
+    })
+    .map(({ totalErrors: _totalErrors, ...report }) => report);
+}
+export function buildBaselineDiffs(
+  reports: SingleModelRunReport[],
+  baselineModelId: string
+): BaselineDiff[] {
+  const baselineReport = reports.find((report) => report.modelId === baselineModelId);
+  if (!baselineReport) {
+    throw new Error(`Missing baseline model: ${baselineModelId}`);
+  }
+  const baselineResults = buildResultMap(baselineReport);
+  return reports
+    .filter((report) => report.modelId !== baselineModelId)
+    .map((report) => {
+      const reportResults = buildResultMap(report);
+      const regressedCases: string[] = [];
+      const improvedCases: string[] = [];
+      const disagreementCases: string[] = [];
+      for (const [scenarioId, baselineResult] of baselineResults.entries()) {
+        const candidateResult = reportResults.get(scenarioId);
+        if (!candidateResult) {
+          continue;
+        }
+        if (baselineResult.pass && !candidateResult.pass) {
+          regressedCases.push(scenarioId);
+        }
+        if (!baselineResult.pass && candidateResult.pass) {
+          improvedCases.push(scenarioId);
+        }
+        if (getCaseSignature(baselineResult) !== getCaseSignature(candidateResult)) {
+          disagreementCases.push(scenarioId);
+        }
+      }
+      return {
+        modelId: report.modelId,
+        regressedCases,
+        improvedCases,
+        disagreementCases,
+      };
+    });
+}
+export function findHardestCases(
+  reports: SingleModelRunReport[]
+): HardestCase[] {
+  const failedByScenario = new Map<string, string[]>();
+  for (const report of reports) {
+    for (const result of report.results) {
+      const caseResult = result as EvalCaseResult;
+      if (caseResult.pass) {
+        continue;
+      }
+      const scenarioId = getScenarioId(caseResult);
+      const failedBy = failedByScenario.get(scenarioId) ?? [];
+      failedBy.push(report.modelId);
+      failedByScenario.set(scenarioId, failedBy);
+    }
+  }
+  return [...failedByScenario.entries()]
+    .map(([scenarioId, failedBy]) => ({
+      scenarioId,
+      failedBy,
+    }))
+    .sort((left, right) => {
+      if (right.failedBy.length !== left.failedBy.length) {
+        return right.failedBy.length - left.failedBy.length;
+      }
+      return left.scenarioId.localeCompare(right.scenarioId);
+    });
+}
+export function findDisagreementCases(
+  reports: SingleModelRunReport[]
+): DisagreementCase[] {
+  const cases = new Map<string, Array<{ modelId: string; signature: string }>>();
+  for (const report of reports) {
+    for (const result of report.results) {
+      const caseResult = result as EvalCaseResult;
+      const scenarioId = getScenarioId(caseResult);
+      const entries = cases.get(scenarioId) ?? [];
+      entries.push({
+        modelId: report.modelId,
+        signature: getCaseSignature(caseResult),
+      });
+      cases.set(scenarioId, entries);
+    }
+  }
+  return [...cases.entries()]
+    .filter(([, entries]) => new Set(entries.map((entry) => entry.signature)).size > 1)
+    .map(([scenarioId, entries]) => ({
+      scenarioId,
+      modelIds: entries.map((entry) => entry.modelId),
+    }))
+    .sort((left, right) => left.scenarioId.localeCompare(right.scenarioId));
+}
+export function buildMultiModelComparisonReport(input: {
+  suite: EvalSuite;
+  modelReports: SingleModelRunReport[];
+  baselineModelId?: string;
+  timestamp?: string;
+  runId?: string;
+}): MultiModelComparisonReport {
+  const timestamp = input.timestamp ?? new Date().toISOString();
+  const baselineModelId =
+    input.baselineModelId ??
+    (input.modelReports.length > 0 ? input.modelReports[0].modelId : undefined);
+  return {
+    runId: input.runId ?? `${input.suite}-${timestamp}`,
+    timestamp,
+    suite: input.suite,
+    baselineModelId,
+    models: input.modelReports,
+    comparison: {
+      ranking: rankModelReports(input.modelReports),
+      baselineDiffs: baselineModelId
+        ? buildBaselineDiffs(input.modelReports, baselineModelId)
+        : [],
+      hardestCases: findHardestCases(input.modelReports),
+      disagreementCases: findDisagreementCases(input.modelReports),
+    },
+  };
+}

package/src/evals/models.ts ADDED Viewed

@@ -0,0 +1,125 @@
+import { readFile } from "node:fs/promises";
+export interface EvalModelProfile {
+  id: string;
+  label: string;
+  model: string;
+  enabled: boolean;
+  tags?: string[];
+}
+export interface ResolvedEvalModelProfile extends EvalModelProfile {
+  baseURL: string;
+  apiKey: string;
+}
+interface LoadEvalModelProfilesInput {
+  configPath: string;
+  selectedModelIds?: string[];
+  env?: NodeJS.ProcessEnv;
+}
+function isRecord(value: unknown): value is Record<string, unknown> {
+  return typeof value === "object" && value !== null && !Array.isArray(value);
+}
+function parseEvalModelProfile(value: unknown): EvalModelProfile {
+  if (!isRecord(value)) {
+    throw new Error("Eval model profile must be an object");
+  }
+  const {
+    id,
+    label,
+    model,
+    enabled,
+    tags,
+  } = value;
+  if (typeof id !== "string" || id.trim() === "") {
+    throw new Error("Eval model profile id must be a non-empty string");
+  }
+  if (typeof label !== "string" || label.trim() === "") {
+    throw new Error(`Eval model profile ${id} label must be a non-empty string`);
+  }
+  if (typeof model !== "string" || model.trim() === "") {
+    throw new Error(`Eval model profile ${id} model must be a non-empty string`);
+  }
+  if (typeof enabled !== "boolean") {
+    throw new Error(`Eval model profile ${id} enabled must be a boolean`);
+  }
+  if (
+    tags !== undefined &&
+    (!Array.isArray(tags) || tags.some((tag) => typeof tag !== "string"))
+  ) {
+    throw new Error(`Eval model profile ${id} tags must be a string array`);
+  }
+  return {
+    id,
+    label,
+    model,
+    enabled,
+    tags,
+  };
+}
+function parseEvalModelConfig(content: string): EvalModelProfile[] {
+  const parsed: unknown = JSON.parse(content);
+  if (!isRecord(parsed) || !Array.isArray(parsed.profiles)) {
+    throw new Error("Eval model config must contain a profiles array");
+  }
+  return parsed.profiles.map((profile) => parseEvalModelProfile(profile));
+}
+export async function loadEvalModelProfiles(
+  input: LoadEvalModelProfilesInput
+): Promise<ResolvedEvalModelProfile[]> {
+  const env = input.env ?? process.env;
+  const baseURL = env.EVAL_BASE_URL;
+  const apiKey = env.EVAL_API_KEY;
+  if (
+    typeof baseURL !== "string" ||
+    baseURL.trim() === "" ||
+    typeof apiKey !== "string" ||
+    apiKey.trim() === ""
+  ) {
+    throw new Error(
+      "Missing required env vars for model comparison: EVAL_BASE_URL, EVAL_API_KEY"
+    );
+  }
+  const profiles = parseEvalModelConfig(await readFile(input.configPath, "utf8"));
+  const enabledProfiles = profiles.filter((profile) => profile.enabled);
+  if (enabledProfiles.length === 0) {
+    throw new Error("Eval model config has no enabled profiles");
+  }
+  const selectedModelIds =
+    input.selectedModelIds?.filter((modelId) => modelId.trim() !== "") ?? [];
+  const filteredProfiles =
+    selectedModelIds.length === 0
+      ? enabledProfiles
+      : selectedModelIds.map((modelId) => {
+          const profile = enabledProfiles.find((candidate) => candidate.id === modelId);
+          if (!profile) {
+            throw new Error(`Unknown model ids: ${modelId}`);
+          }
+          return profile;
+        });
+  return filteredProfiles.map((profile) => ({
+    ...profile,
+    baseURL,
+    apiKey,
+  }));
+}

package/src/evals/reporting.ts ADDED Viewed

@@ -0,0 +1,123 @@
+import { mkdir, writeFile } from "node:fs/promises";
+import path from "node:path";
+import type { MultiModelComparisonReport } from "./comparison.js";
+function formatErrorCounts(
+  errorCounts: MultiModelComparisonReport["models"][number]["summary"]["errorCounts"]
+): string {
+  if (!errorCounts) {
+    return "0/0/0";
+  }
+  return `${errorCounts.provider_error}/${errorCounts.schema_error}/${errorCounts.execution_error}`;
+}
+export function renderComparisonMarkdown(
+  report: MultiModelComparisonReport
+): string {
+  const lines = [
+    "# Eval Comparison Report",
+    "",
+    `- Run ID: ${report.runId}`,
+    `- Timestamp: ${report.timestamp}`,
+    `- Suite: ${report.suite}`,
+  ];
+  if (report.baselineModelId) {
+    lines.push(`- Baseline: ${report.baselineModelId}`);
+  }
+  lines.push(
+    "",
+    "## Leaderboard",
+    "",
+    "| Model | Passed | Total | Errors (provider/schema/execution) |",
+    "| --- | --- | --- | --- |"
+  );
+  for (const entry of report.comparison.ranking) {
+    lines.push(
+      `| ${entry.modelId} | ${entry.passed} | ${entry.total} | ${formatErrorCounts(
+        entry.errorCounts
+      )} |`
+    );
+  }
+  lines.push("", "## Baseline Diffs", "");
+  if (report.comparison.baselineDiffs.length === 0) {
+    lines.push("No baseline diffs.");
+  } else {
+    for (const diff of report.comparison.baselineDiffs) {
+      lines.push(`### ${diff.modelId}`);
+      lines.push(`- Regressed: ${diff.regressedCases.join(", ") || "(none)"}`);
+      lines.push(`- Improved: ${diff.improvedCases.join(", ") || "(none)"}`);
+      lines.push(
+        `- Disagreements: ${diff.disagreementCases.join(", ") || "(none)"}`
+      );
+      lines.push("");
+    }
+  }
+  lines.push("## Hardest Cases", "");
+  if (report.comparison.hardestCases.length === 0) {
+    lines.push("No failed cases.");
+  } else {
+    for (const hardestCase of report.comparison.hardestCases) {
+      lines.push(
+        `- ${hardestCase.scenarioId}: ${hardestCase.failedBy.join(", ")}`
+      );
+    }
+  }
+  lines.push("", "## Disagreement Cases", "");
+  if (report.comparison.disagreementCases.length === 0) {
+    lines.push("No disagreement cases.");
+  } else {
+    for (const disagreement of report.comparison.disagreementCases) {
+      lines.push(
+        `- ${disagreement.scenarioId}: ${disagreement.modelIds.join(", ")}`
+      );
+    }
+  }
+  return `${lines.join("\n")}\n`;
+}
+export async function writeComparisonReports(input: {
+  report: MultiModelComparisonReport;
+  outputPath?: string;
+  markdownOutputPath?: string;
+}): Promise<{
+  jsonWritten: boolean;
+  markdownWritten: boolean;
+  writtenPaths: string[];
+}> {
+  const writtenPaths: string[] = [];
+  if (input.outputPath) {
+    await mkdir(path.dirname(input.outputPath), { recursive: true });
+    await writeFile(
+      input.outputPath,
+      `${JSON.stringify(input.report, null, 2)}\n`,
+      "utf8"
+    );
+    writtenPaths.push(input.outputPath);
+  }
+  if (input.markdownOutputPath) {
+    await mkdir(path.dirname(input.markdownOutputPath), { recursive: true });
+    await writeFile(
+      input.markdownOutputPath,
+      renderComparisonMarkdown(input.report),
+      "utf8"
+    );
+    writtenPaths.push(input.markdownOutputPath);
+  }
+  return {
+    jsonWritten: Boolean(input.outputPath),
+    markdownWritten: Boolean(input.markdownOutputPath),
+    writtenPaths,
+  };
+}

package/src/evals/runner.ts CHANGED Viewed

@@ -5,6 +5,7 @@ import { mkdtemp, readFile, rm, writeFile } from "node:fs/promises";
 import { LLMService } from "../llm/service.js";
 import { MemoryGateAnalyzer } from "../memory-gate/analyzer.js";
 import { WriteGuardian } from "../write-guardian/index.js";
+import type { EvalSuite } from "./cli.js";
 import type {
   AgentStep,
   LLMService as LLMServiceContract,
@@ -59,6 +60,7 @@ export interface MemoryGateCaseResult {
   expectedDecision: MemoryGateOutput["decision"];
   actualCandidateFact?: string;
   expectedCandidateFact?: string;
+  errorType?: "provider_error" | "schema_error" | "execution_error";
   error?: string;
 }
@@ -77,6 +79,21 @@ export interface WriteGuardianCaseResult {
 export interface BenchmarkSummary {
   total: number;
   passed: number;
+  errorCounts?: {
+    provider_error: number;
+    schema_error: number;
+    execution_error: number;
+  };
+}
+export interface SingleModelRunReport {
+  modelId: string;
+  modelLabel: string;
+  suite: EvalSuite;
+  startedAt: string;
+  finishedAt: string;
+  summary: BenchmarkSummary;
+  results: MemoryGateCaseResult[] | WriteGuardianCaseResult[];
 }
 export interface Judge {
@@ -138,6 +155,38 @@ function normalizeFileContent(content: string): string {
   return normalized.endsWith("\n") ? normalized : `${normalized}\n`;
 }
+function createEmptyErrorCounts(): NonNullable<BenchmarkSummary["errorCounts"]> {
+  return {
+    provider_error: 0,
+    schema_error: 0,
+    execution_error: 0,
+  };
+}
+function classifyMemoryGateError(
+  message: string | undefined
+): MemoryGateCaseResult["errorType"] | undefined {
+  if (!message) {
+    return undefined;
+  }
+  if (message.includes("Provider request failed")) {
+    return "provider_error";
+  }
+  if (message.includes("Schema validation failed")) {
+    return "schema_error";
+  }
+  return undefined;
+}
+export function buildSingleModelRunReport(
+  input: SingleModelRunReport
+): SingleModelRunReport {
+  return { ...input };
+}
 export async function evaluateMemoryGateBenchmark(input: {
   scenarios: SharedScenario[];
   benchmarkCases: MemoryGateBenchmarkCase[];
@@ -148,6 +197,7 @@ export async function evaluateMemoryGateBenchmark(input: {
   const scenarioMap = buildScenarioMap(input.scenarios);
   const results: MemoryGateCaseResult[] = [];
   const logger = input.logger ?? createNoopLogger();
+  const errorCounts = createEmptyErrorCounts();
   for (const benchmarkCase of input.benchmarkCases) {
     const scenario = scenarioMap.get(benchmarkCase.scenario_id);
@@ -188,6 +238,10 @@ export async function evaluateMemoryGateBenchmark(input: {
       }
       const pass = decisionPass && candidatePass;
+      const errorType = classifyMemoryGateError(actual.reason);
+      if (errorType) {
+        errorCounts[errorType] += 1;
+      }
       results.push({
         scenarioId: benchmarkCase.scenario_id,
         pass,
@@ -198,6 +252,8 @@ export async function evaluateMemoryGateBenchmark(input: {
         expectedDecision: benchmarkCase.expected_decision,
         actualCandidateFact: actual.candidateFact,
         expectedCandidateFact: benchmarkCase.expected_candidate_fact,
+        errorType,
+        error: errorType ? actual.reason : undefined,
       });
       logger.info("EvalRunner", "Completed memory_gate case", {
         scenarioId: benchmarkCase.scenario_id,
@@ -206,9 +262,12 @@ export async function evaluateMemoryGateBenchmark(input: {
         candidatePass,
         judgeUsed,
         actualDecision: actual.decision,
+        errorType,
       });
     } catch (error) {
       const reason = getErrorMessage(error);
+      const errorType = classifyMemoryGateError(reason) ?? "execution_error";
+      errorCounts[errorType] += 1;
       results.push({
         scenarioId: benchmarkCase.scenario_id,
         pass: false,
@@ -218,11 +277,13 @@ export async function evaluateMemoryGateBenchmark(input: {
         actualDecision: "NO_WRITE",
         expectedDecision: benchmarkCase.expected_decision,
         expectedCandidateFact: benchmarkCase.expected_candidate_fact,
+        errorType,
         error: reason,
       });
       logger.error("EvalRunner", "memory_gate case failed", {
         scenarioId: benchmarkCase.scenario_id,
         reason,
+        errorType,
       });
     }
   }
@@ -231,6 +292,7 @@ export async function evaluateMemoryGateBenchmark(input: {
     summary: {
       total: results.length,
       passed: results.filter((result) => result.pass).length,
+      errorCounts,
     },
     results,
   };

package/src/index.ts CHANGED Viewed

@@ -12,6 +12,10 @@ import {
   MemoryGateAnalyzer,
 } from "./memory-gate/index.js";
 import { WriteGuardian } from "./write-guardian/index.js";
+import {
+  WriteGuardianAuditLog,
+  type WriteGuardianAuditEntry,
+} from "./write-guardian/audit-log.js";
 import {
   handleBeforeMessageWrite,
   handleMessageReceived,
@@ -47,6 +51,10 @@ export interface PluginAPI {
     handler: (event: unknown, context?: unknown) => void,
     options?: { priority?: number }
   ) => void;
+  registerCommand?: (
+    command: string,
+    handler: (args?: string) => string | Promise<string>
+  ) => void;
 }
 let bufferManager: SessionBufferManager | null = null;
@@ -54,6 +62,54 @@ let gatewayLogger: PluginLogger | null = null;
 let fileLogger: FileLogger | null = null;
 let isRegistered = false;
+function formatWriteGuardianAudit(entries: WriteGuardianAuditEntry[]): string {
+  if (entries.length === 0) {
+    return "No write_guardian records found.";
+  }
+  const lines = entries.map((entry, index) => {
+    const summary = [
+      `${index + 1}. [${entry.timestamp}] ${entry.status}`,
+      `decision=${entry.decision}`,
+      entry.targetFile ? `file=${entry.targetFile}` : undefined,
+      entry.reason ? `reason=${entry.reason}` : undefined,
+      entry.candidateFact ? `fact=${entry.candidateFact}` : undefined,
+    ]
+      .filter((part): part is string => Boolean(part))
+      .join(" | ");
+    return summary;
+  });
+  return lines.join("\n");
+}
+function registerReflectionCommand(
+  api: PluginAPI,
+  logger: FileLogger,
+  auditLog?: WriteGuardianAuditLog
+): void {
+  if (typeof api.registerCommand !== "function") {
+    logger.info("PluginLifecycle", "registerCommand unavailable, skip command registration", {
+      command: "/openclaw-reflection",
+    });
+    return;
+  }
+  api.registerCommand("/openclaw-reflection", async () => {
+    if (!auditLog) {
+      return "write_guardian audit log unavailable: workspace is not configured.";
+    }
+    const entries = await auditLog.readRecent(10);
+    return formatWriteGuardianAudit(entries);
+  });
+  logger.info("PluginLifecycle", "Registered plugin command", {
+    command: "/openclaw-reflection",
+  });
+}
 function getErrorMessage(error: unknown): string {
   if (error instanceof Error) {
     return error.message;
@@ -206,6 +262,7 @@ export default function activate(api: PluginAPI): void {
     let memoryGate: MemoryGateAnalyzer | undefined;
     let writeGuardian: WriteGuardian | undefined;
+    let writeGuardianAuditLog: WriteGuardianAuditLog | undefined;
     if (config.memoryGate.enabled && llmService) {
       memoryGate = new MemoryGateAnalyzer(llmService, logger);
@@ -217,7 +274,13 @@ export default function activate(api: PluginAPI): void {
     }
     if (llmService && workspaceDir) {
-      writeGuardian = new WriteGuardian({ workspaceDir }, logger, llmService);
+      writeGuardianAuditLog = new WriteGuardianAuditLog(workspaceDir);
+      writeGuardian = new WriteGuardian(
+        { workspaceDir },
+        logger,
+        llmService,
+        writeGuardianAuditLog
+      );
       logger.info("PluginLifecycle", "write_guardian initialized", {
         workspaceDir,
       });
@@ -303,6 +366,8 @@ export default function activate(api: PluginAPI): void {
       }
     );
+    registerReflectionCommand(api, logger, writeGuardianAuditLog);
     gatewayLogger.info("[Reflection] Message hooks registered");
     logger.info("PluginLifecycle", "Message hooks registered");

package/src/write-guardian/audit-log.ts ADDED Viewed

@@ -0,0 +1,71 @@
+import * as fs from "node:fs";
+import * as fsp from "node:fs/promises";
+import * as path from "node:path";
+export interface WriteGuardianAuditEntry {
+  timestamp: string;
+  decision: string;
+  targetFile?: string;
+  status: "written" | "refused" | "failed" | "skipped";
+  reason?: string;
+  candidateFact?: string;
+}
+function normalizeError(error: unknown): string {
+  if (error instanceof Error) {
+    return error.message;
+  }
+  return String(error);
+}
+export class WriteGuardianAuditLog {
+  private readonly filePath: string;
+  constructor(workspaceDir: string) {
+    const logDir = path.join(workspaceDir, ".openclaw-reflection");
+    this.filePath = path.join(logDir, "write-guardian.log.jsonl");
+    if (!fs.existsSync(logDir)) {
+      fs.mkdirSync(logDir, { recursive: true });
+    }
+  }
+  async append(entry: Omit<WriteGuardianAuditEntry, "timestamp">): Promise<void> {
+    const serialized = JSON.stringify({
+      timestamp: new Date().toISOString(),
+      ...entry,
+    });
+    await fsp.appendFile(this.filePath, `${serialized}\n`, "utf8");
+  }
+  async readRecent(limit: number): Promise<WriteGuardianAuditEntry[]> {
+    try {
+      const content = await fsp.readFile(this.filePath, "utf8");
+      const lines = content
+        .split("\n")
+        .map((line) => line.trim())
+        .filter((line) => line.length > 0);
+      const parsed = lines
+        .map((line) => {
+          try {
+            return JSON.parse(line) as WriteGuardianAuditEntry;
+          } catch {
+            return null;
+          }
+        })
+        .filter((entry): entry is WriteGuardianAuditEntry => entry !== null);
+      return parsed.slice(-limit).reverse();
+    } catch (error) {
+      const errorMessage = normalizeError(error);
+      if (errorMessage.includes("ENOENT")) {
+        return [];
+      }
+      throw error;
+    }
+  }
+}

package/src/write-guardian/index.ts CHANGED Viewed

@@ -1,6 +1,7 @@
 import * as path from "path";
 import type { AgentTool, LLMService, MemoryGateOutput, Logger } from "../types.js";
 import { readFile, writeFileWithLock } from "../utils/file-utils.js";
+import { WriteGuardianAuditLog } from "./audit-log.js";
 type UpdateDecision =
   | "UPDATE_MEMORY"
@@ -93,16 +94,25 @@ export class WriteGuardian {
   private config: WriteGuardianConfig;
   private logger: Logger;
   private llmService: LLMService;
-  constructor(config: WriteGuardianConfig, logger: Logger, llmService: LLMService) {
+  private auditLog?: WriteGuardianAuditLog;
+  constructor(
+    config: WriteGuardianConfig,
+    logger: Logger,
+    llmService: LLMService,
+    auditLog?: WriteGuardianAuditLog
+  ) {
     this.config = config;
     this.logger = logger;
     this.llmService = llmService;
+    this.auditLog = auditLog;
   }
   async write(output: MemoryGateOutput): Promise<WriteGuardianWriteResult> {
     if (!isUpdateDecision(output.decision)) {
-      return { status: "skipped", reason: "not an update decision" };
+      const result = { status: "skipped", reason: "not an update decision" } as const;
+      await this.recordAudit(output, result);
+      return result;
     }
     const candidateFact = output.candidateFact?.trim();
@@ -111,7 +121,9 @@ export class WriteGuardian {
         decision: output.decision,
         reason: output.reason,
       });
-      return { status: "skipped", reason: "missing candidate fact" };
+      const result = { status: "skipped", reason: "missing candidate fact" } as const;
+      await this.recordAudit(output, result);
+      return result;
     }
     const targetFile = TARGET_FILES[output.decision];
@@ -141,14 +153,18 @@ export class WriteGuardian {
           filePath,
           reason,
         });
-        return { status: "refused", reason };
+        const writeResult = { status: "refused", reason } as const;
+        await this.recordAudit(output, writeResult, targetFile);
+        return writeResult;
       }
       this.logger.info("WriteGuardian", "write_guardian rewrote target file", {
         decision: output.decision,
         filePath,
       });
-      return { status: "written" };
+      const writeResult = { status: "written" } as const;
+      await this.recordAudit(output, writeResult, targetFile);
+      return writeResult;
     } catch (error) {
       const reason = getErrorMessage(error);
       this.logger.error("WriteGuardian", "write_guardian execution failed", {
@@ -156,7 +172,33 @@ export class WriteGuardian {
         filePath,
         reason,
       });
-      return { status: "failed", reason };
+      const writeResult = { status: "failed", reason } as const;
+      await this.recordAudit(output, writeResult, targetFile);
+      return writeResult;
+    }
+  }
+  private async recordAudit(
+    output: MemoryGateOutput,
+    result: WriteGuardianWriteResult,
+    targetFile?: CuratedFilename
+  ): Promise<void> {
+    if (!this.auditLog) {
+      return;
+    }
+    try {
+      await this.auditLog.append({
+        decision: output.decision,
+        targetFile,
+        status: result.status,
+        reason: result.reason,
+        candidateFact: output.candidateFact,
+      });
+    } catch (error) {
+      this.logger.warn("WriteGuardian", "Failed to append write_guardian audit log", {
+        reason: getErrorMessage(error),
+      });
     }
   }