npm - @gianfrancopiana/openclaw-autoresearch - Versions diffs - 1.0.2 → 1.0.3 - Mend

@gianfrancopiana/openclaw-autoresearch 1.0.2 → 1.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

package/README.md CHANGED Viewed

@@ -13,19 +13,20 @@ Three tools drive the loop:
 | Tool | What it does |
 |---|---|
 | `init_experiment` | Configures the session: name, primary metric, unit, direction (lower/higher). Re-calling starts a new segment. |
-| `run_experiment` | Executes a shell command, times it, captures stdout/stderr, returns pass/fail via exit code. |
-| `log_experiment` | Records the result. `keep` auto-commits to git. `discard`/`crash` log without committing. Tracks secondary metrics alongside the primary. |
+| `run_experiment` | Executes a shell command, times it, captures stdout/stderr, parses `METRIC name=number` lines, and opens a pending experiment window that must be logged before another run can start. |
+| `log_experiment` | Records the pending run. `keep` auto-commits to git. `discard`/`crash` log without committing. If the prior `run_experiment` captured the primary metric, `log_experiment` can infer `commit` and `metric` automatically. |
 Each tool also accepts an optional `cwd` so callers can target a nested repo explicitly instead of relying on the current session working directory.
-All state lives in four repo-root files:
+All state lives in five repo-root files:
 | File | Purpose |
 |---|---|
-| `autoresearch.md` | Session doc: objective, metrics, files in scope, constraints, what's been tried. A fresh agent reads this to resume. |
+| `autoresearch.md` | Session doc. The plugin keeps the Metrics, How to Run, What's Been Tried, and Plugin Checkpoint sections synchronized so resumes are less agent-dependent. |
 | `autoresearch.sh` | Benchmark script. Outputs `METRIC name=number` lines. |
 | `autoresearch.jsonl` | Structured log: config headers + experiment entries (metric, status, timestamp, segment, commit hash). |
 | `autoresearch.ideas.md` | Backlog of promising ideas not yet tried. Optional. |
+| `autoresearch.checkpoint.json` | Plugin-managed checkpoint: latest logged state, recent runs, and any pending unlogged run. |
 The design is file-first: any agent can pick up the repo-root files and continue the loop without prior context.
@@ -72,6 +73,14 @@ Verify:
 Prefer the explicit `/autoresearch` command surface in OpenClaw. The auto-generated native skill alias `/autoresearch_create` may not trigger reliably on some hosts, so use `/skill autoresearch-create` if you need to invoke the skill directly.
+## Workflow Guarantees
+- `run_experiment` refuses to start a second run until the previous one is logged.
+- `run_experiment` parses `METRIC name=number` lines and stores a pending run so `log_experiment` can default from the actual benchmark output.
+- During active autoresearch mode, raw benchmark execution through OpenClaw `exec`/`bash` is blocked. Use `run_experiment` instead.
+- `autoresearch_status` warns when a pending run is unlogged or git history has moved ahead of the last logged experiment.
+- The plugin updates `autoresearch.checkpoint.json` and refreshes plugin-managed sections in `autoresearch.md` after init, run, and log transitions.
 ## Use
 In the repo you want to optimize:
@@ -80,7 +89,7 @@ In the repo you want to optimize:
 2. Run `/autoresearch` or `/autoresearch setup <goal>`.
 3. Send a normal message with the goal, command, metric (+ direction), files in scope, and constraints.
 4. If you need the raw skill invocation, use `/skill autoresearch-create`.
-5. The agent writes `autoresearch.md` and `autoresearch.sh`, runs a baseline, then starts looping.
+5. The agent writes `autoresearch.md` and `autoresearch.sh`, runs a baseline with `run_experiment`, then records it with `log_experiment`.
 6. Use `/autoresearch` or `/autoresearch status` to re-prime context on a later turn.
 To resume an existing session, a new agent reads the repo-root files and continues from where the last one stopped.

package/extensions/openclaw-autoresearch/src/checkpoint.ts ADDED Viewed

@@ -0,0 +1,80 @@
+import * as fs from "node:fs";
+import { AUTORESEARCH_ROOT_FILES, getAutoresearchRootFilePath } from "./files.js";
+import type {
+  AutoresearchRunSnapshot,
+  AutoresearchStateSnapshot,
+} from "./state.js";
+import type { PendingExperimentRun } from "./runtime-state.js";
+export type AutoresearchCheckpoint = {
+  readonly version: 1;
+  readonly updatedAt: number;
+  readonly sessionStartCommit: string | null;
+  readonly session: {
+    readonly name: string | null;
+    readonly metricName: string;
+    readonly metricUnit: string;
+    readonly bestDirection: "lower" | "higher";
+    readonly currentSegment: number;
+    readonly currentRunCount: number;
+    readonly totalRunCount: number;
+    readonly currentBaselineMetric: number | null;
+    readonly currentBestMetric: number | null;
+  };
+  readonly lastLoggedRun: AutoresearchRunSnapshot | null;
+  readonly recentLoggedRuns: readonly AutoresearchRunSnapshot[];
+  readonly pendingRun: PendingExperimentRun | null;
+};
+export function readAutoresearchCheckpoint(cwd: string): AutoresearchCheckpoint | null {
+  const checkpointPath = getAutoresearchRootFilePath(cwd, "checkpoint");
+  if (!fs.existsSync(checkpointPath)) {
+    return null;
+  }
+  try {
+    const parsed = JSON.parse(fs.readFileSync(checkpointPath, "utf8")) as AutoresearchCheckpoint;
+    return parsed?.version === 1 ? parsed : null;
+  } catch {
+    return null;
+  }
+}
+export function writeAutoresearchCheckpoint(options: {
+  cwd: string;
+  state: AutoresearchStateSnapshot;
+  sessionStartCommit: string | null;
+  recentLoggedRuns: readonly AutoresearchRunSnapshot[];
+  pendingRun: PendingExperimentRun | null;
+}): AutoresearchCheckpoint {
+  const checkpoint: AutoresearchCheckpoint = {
+    version: 1,
+    updatedAt: Date.now(),
+    sessionStartCommit: options.sessionStartCommit,
+    session: {
+      name: options.state.name,
+      metricName: options.state.metricName,
+      metricUnit: options.state.metricUnit,
+      bestDirection: options.state.bestDirection,
+      currentSegment: options.state.currentSegment,
+      currentRunCount: options.state.currentRunCount,
+      totalRunCount: options.state.totalRunCount,
+      currentBaselineMetric: options.state.currentBaselineMetric,
+      currentBestMetric: options.state.currentBestMetric,
+    },
+    lastLoggedRun: options.state.lastRun,
+    recentLoggedRuns: options.recentLoggedRuns,
+    pendingRun: options.pendingRun,
+  };
+  const checkpointPath = getAutoresearchRootFilePath(options.cwd, "checkpoint");
+  fs.writeFileSync(checkpointPath, `${JSON.stringify(checkpoint, null, 2)}\n`);
+  return checkpoint;
+}
+export function deleteAutoresearchCheckpoint(cwd: string): void {
+  const checkpointPath = getAutoresearchRootFilePath(cwd, "checkpoint");
+  if (fs.existsSync(checkpointPath)) {
+    fs.unlinkSync(checkpointPath);
+  }
+}

package/extensions/openclaw-autoresearch/src/files.ts CHANGED Viewed

@@ -5,6 +5,7 @@ export const AUTORESEARCH_ROOT_FILES = {
   runnerScript: "autoresearch.sh",
   resultsLog: "autoresearch.jsonl",
   ideasBacklog: "autoresearch.ideas.md",
+  checkpoint: "autoresearch.checkpoint.json",
 } as const;
 export type AutoresearchRootFileKey = keyof typeof AUTORESEARCH_ROOT_FILES;

package/extensions/openclaw-autoresearch/src/git.ts CHANGED Viewed

@@ -19,6 +19,11 @@ export type GitKeepResult = {
   readonly command: GitCommandResult;
 };
+export type GitRuntimeOptions = {
+  runCommandWithTimeout: RunCommandWithTimeout;
+  cwd: string;
+};
 async function runGitCommand(
   runCommandWithTimeout: RunCommandWithTimeout,
   cwd: string,
@@ -40,6 +45,31 @@ async function runGitCommand(
   };
 }
+export async function readShortHeadCommit(options: GitRuntimeOptions): Promise<string | null> {
+  const result = await runGitCommand(options.runCommandWithTimeout, options.cwd, [
+    "rev-parse",
+    "--short=7",
+    "HEAD",
+  ]);
+  return result.code === 0 && result.stdout.trim().length > 0 ? result.stdout.trim() : null;
+}
+export async function countCommitsSince(
+  options: GitRuntimeOptions & { sinceCommit: string },
+): Promise<number | null> {
+  const result = await runGitCommand(options.runCommandWithTimeout, options.cwd, [
+    "rev-list",
+    "--count",
+    `${options.sinceCommit}..HEAD`,
+  ]);
+  if (result.code !== 0) {
+    return null;
+  }
+  const count = Number.parseInt(result.stdout.trim(), 10);
+  return Number.isFinite(count) ? count : null;
+}
 export async function commitKeptExperiment(options: {
   runCommandWithTimeout: RunCommandWithTimeout;
   cwd: string;

package/extensions/openclaw-autoresearch/src/hooks.ts CHANGED Viewed

@@ -64,6 +64,30 @@ export function registerAutoresearchHooks(api: OpenClawPluginApi): void {
       queueAutoresearchSteer(cwd, messageText);
     });
+    hookApi.on("before_tool_call", (event, ctx) => {
+      const cwd = resolveHookCwd(api, ctx);
+      if (cwd === null || !shouldEnforceAutoresearchMode(cwd)) {
+        return;
+      }
+      const record = event as Record<string, unknown>;
+      const toolName = typeof record.toolName === "string" ? record.toolName : "";
+      if (toolName !== "exec" && toolName !== "bash") {
+        return;
+      }
+      const command = extractToolCommand(record.params);
+      if (!command || !looksLikeExperimentCommand(command)) {
+        return;
+      }
+      return {
+        block: true,
+        blockReason:
+          "Autoresearch mode blocks raw benchmark execution through exec/bash. Use run_experiment so the result is captured and log_experiment can enforce the experiment lifecycle.",
+      };
+    });
     hookApi.on("agent_end", (_event, ctx) => {
       const cwd = resolveHookCwd(api, ctx);
       if (cwd === null) {
@@ -112,11 +136,7 @@ export function registerAutoresearchHooks(api: OpenClawPluginApi): void {
 export function buildBeforePromptBuildContext(cwd: string): string | null {
   const state = reconstructStateFromJsonl(cwd);
   const runtimeState = getAutoresearchRuntimeState(cwd);
-  const modeEnabled =
-    runtimeState.mode === "on" ||
-    (runtimeState.mode !== "off" && (state.mode === "active" || state.hasSessionDoc));
-  if (!modeEnabled) {
+  if (!shouldEnforceAutoresearchMode(cwd, state, runtimeState)) {
     return null;
   }
@@ -144,6 +164,8 @@ export function buildBeforePromptBuildContext(cwd: string): string | null {
       `Read ${AUTORESEARCH_ROOT_FILES.sessionDoc} before resuming or changing the experiment loop, and re-read it after compaction.`,
       "Resume the autonomous upstream loop: edit, run_experiment, log_experiment, keep/discard/crash, repeat.",
       "Use init_experiment, run_experiment, and log_experiment for experiment state changes. Never stop unless the user explicitly interrupts the loop.",
+      "Never run benchmark or test commands through raw exec/bash during autoresearch mode. Use run_experiment so the plugin can capture metrics, enforce logging, and preserve resumable state.",
+      "After every run_experiment, call log_experiment before starting another run. If METRIC lines were captured, log_experiment can infer commit and metric from the pending run.",
     );
     if (pendingCommand?.args) {
       lines.push(`Additional resume instruction from /autoresearch: ${pendingCommand.args}`);
@@ -232,3 +254,49 @@ function firstString(...values: unknown[]): string | null {
 function isCommandLikeMessage(text: string): boolean {
   return /^[\/!]/.test(text.trim());
 }
+function shouldEnforceAutoresearchMode(
+  cwd: string,
+  state = reconstructStateFromJsonl(cwd),
+  runtimeState = getAutoresearchRuntimeState(cwd),
+): boolean {
+  return (
+    runtimeState.mode === "on" ||
+    runtimeState.runInFlight ||
+    runtimeState.pendingRun !== null ||
+    (runtimeState.mode !== "off" && (state.mode === "active" || state.hasSessionDoc))
+  );
+}
+function extractToolCommand(params: unknown): string | null {
+  if (!params || typeof params !== "object") {
+    return null;
+  }
+  const record = params as Record<string, unknown>;
+  for (const key of ["command", "cmd", "args"]) {
+    const value = record[key];
+    if (typeof value === "string" && value.trim().length > 0) {
+      return value.trim();
+    }
+  }
+  return null;
+}
+function looksLikeExperimentCommand(command: string): boolean {
+  const normalized = command.trim();
+  if (!normalized) {
+    return false;
+  }
+  const readOnlyPatterns = [
+    /^(pwd|ls|find|rg|grep|sed|cat|head|tail|wc|stat)\b/,
+    /^git\s+(status|diff|show|log|rev-parse|branch|remote)\b/,
+  ];
+  if (readOnlyPatterns.some((pattern) => pattern.test(normalized))) {
+    return false;
+  }
+  return true;
+}

package/extensions/openclaw-autoresearch/src/metrics.ts ADDED Viewed

@@ -0,0 +1,24 @@
+const METRIC_LINE_RE =
+  /^METRIC\s+([A-Za-z0-9_.\-µ]+)\s*=\s*(-?(?:\d+\.?\d*|\.\d+)(?:[eE][+-]?\d+)?)\s*$/;
+export function parseMetricLines(output: string): Record<string, number> {
+  const metrics = new Map<string, number>();
+  for (const rawLine of output.split(/\r?\n/)) {
+    const line = rawLine.trim();
+    const match = METRIC_LINE_RE.exec(line);
+    if (!match) {
+      continue;
+    }
+    const [, name, valueText] = match;
+    const value = Number(valueText);
+    if (!name || !Number.isFinite(value)) {
+      continue;
+    }
+    metrics.set(name, value);
+  }
+  return Object.fromEntries(metrics.entries());
+}

package/extensions/openclaw-autoresearch/src/runtime-state.ts CHANGED Viewed

@@ -7,12 +7,26 @@ export type PendingAutoresearchCommand =
     }
   | null;
+export type PendingExperimentRun = {
+  readonly command: string;
+  readonly commit: string | null;
+  readonly primaryMetric: number | null;
+  readonly metrics: Record<string, number>;
+  readonly durationSeconds: number;
+  readonly exitCode: number | null;
+  readonly passed: boolean;
+  readonly timedOut: boolean;
+  readonly tailOutput: string;
+  readonly capturedAt: number;
+};
 export type AutoresearchRuntimeSnapshot = {
   readonly mode: AutoresearchRuntimeMode;
   readonly runInFlight: boolean;
   readonly queuedSteers: readonly string[];
   readonly needsContinuationReminder: boolean;
   readonly pendingCommand: PendingAutoresearchCommand;
+  readonly pendingRun: PendingExperimentRun | null;
 };
 type MutableAutoresearchRuntimeState = {
@@ -21,6 +35,7 @@ type MutableAutoresearchRuntimeState = {
   queuedSteers: string[];
   needsContinuationReminder: boolean;
   pendingCommand: PendingAutoresearchCommand;
+  pendingRun: PendingExperimentRun | null;
 };
 const MAX_QUEUED_STEERS = 20;
@@ -33,6 +48,7 @@ function createDefaultRuntimeState(): MutableAutoresearchRuntimeState {
     queuedSteers: [],
     needsContinuationReminder: false,
     pendingCommand: null,
+    pendingRun: null,
   };
 }
@@ -53,6 +69,7 @@ export function getAutoresearchRuntimeState(cwd: string): AutoresearchRuntimeSna
     queuedSteers: [...state.queuedSteers],
     needsContinuationReminder: state.needsContinuationReminder,
     pendingCommand: state.pendingCommand,
+    pendingRun: state.pendingRun,
   };
 }
@@ -138,6 +155,26 @@ export function consumeAutoresearchContinuationReminder(cwd: string): boolean {
   return needsReminder;
 }
+export function setAutoresearchPendingRun(
+  cwd: string,
+  pendingRun: PendingExperimentRun | null,
+): AutoresearchRuntimeSnapshot {
+  const state = getMutableRuntimeState(cwd);
+  state.pendingRun = pendingRun;
+  return getAutoresearchRuntimeState(cwd);
+}
+export function getAutoresearchPendingRun(cwd: string): PendingExperimentRun | null {
+  return getMutableRuntimeState(cwd).pendingRun;
+}
+export function consumeAutoresearchPendingRun(cwd: string): PendingExperimentRun | null {
+  const state = getMutableRuntimeState(cwd);
+  const pendingRun = state.pendingRun;
+  state.pendingRun = null;
+  return pendingRun;
+}
 export function clearAutoresearchRuntimeState(cwd: string): void {
   runtimeStates.delete(cwd);
 }

package/extensions/openclaw-autoresearch/src/session-doc.ts ADDED Viewed

@@ -0,0 +1,102 @@
+import * as fs from "node:fs";
+import { AUTORESEARCH_ROOT_FILES, getAutoresearchRootFilePath } from "./files.js";
+import type { AutoresearchCheckpoint } from "./checkpoint.js";
+export function syncAutoresearchSessionDoc(
+  cwd: string,
+  checkpoint: AutoresearchCheckpoint,
+): void {
+  const sessionDocPath = getAutoresearchRootFilePath(cwd, "sessionDoc");
+  const existing = fs.existsSync(sessionDocPath) ? fs.readFileSync(sessionDocPath, "utf8") : "";
+  let doc = ensureTitle(existing, checkpoint.session.name);
+  doc = upsertSection(
+    doc,
+    "## Metrics",
+    [
+      `- **Primary**: ${checkpoint.session.metricName} (${checkpoint.session.metricUnit || "unitless"}, ${checkpoint.session.bestDirection} is better)`,
+    ].join("\n"),
+  );
+  doc = upsertSection(
+    doc,
+    "## How to Run",
+    `\`${AUTORESEARCH_ROOT_FILES.runnerScript}\` — should emit \`METRIC name=number\` lines for ${checkpoint.session.metricName}.`,
+  );
+  doc = upsertSection(doc, "## What's Been Tried", buildTriedSection(checkpoint));
+  doc = upsertSection(doc, "## Plugin Checkpoint", buildCheckpointSection(checkpoint));
+  fs.writeFileSync(sessionDocPath, `${doc.trimEnd()}\n`);
+}
+function ensureTitle(doc: string, sessionName: string | null): string {
+  const trimmed = doc.trim();
+  if (!trimmed) {
+    return `# Autoresearch: ${sessionName ?? "Session"}\n`;
+  }
+  if (/^#\s+/m.test(trimmed)) {
+    return trimmed;
+  }
+  return `# Autoresearch: ${sessionName ?? "Session"}\n\n${trimmed}`;
+}
+function upsertSection(doc: string, heading: string, body: string): string {
+  const escapedHeading = heading.replace(/[.*+?^${}()|[\]\\]/g, "\\$&");
+  const sectionRe = new RegExp(`(^${escapedHeading}\\n)([\\s\\S]*?)(?=^##\\s|\\Z)`, "m");
+  const rendered = `${heading}\n${body.trim()}\n\n`;
+  if (sectionRe.test(doc)) {
+    return doc.replace(sectionRe, rendered);
+  }
+  return `${doc.trimEnd()}\n\n${rendered}`;
+}
+function buildTriedSection(checkpoint: AutoresearchCheckpoint): string {
+  if (checkpoint.recentLoggedRuns.length === 0) {
+    return "- No logged experiments yet.";
+  }
+  return checkpoint.recentLoggedRuns
+    .map((run) => {
+      const metricUnit = checkpoint.session.metricUnit;
+      const renderedMetric =
+        metricUnit && metricUnit.length > 0 ? `${run.metric}${metricUnit}` : `${run.metric}`;
+      return `- #${run.run} ${run.status} ${renderedMetric} ${run.commit} — ${run.description}`;
+    })
+    .join("\n");
+}
+function buildCheckpointSection(checkpoint: AutoresearchCheckpoint): string {
+  const lines = [
+    `- Last updated: ${new Date(checkpoint.updatedAt).toISOString()}`,
+    `- Runs tracked: ${checkpoint.session.currentRunCount} current / ${checkpoint.session.totalRunCount} total`,
+    `- Baseline: ${formatMetric(checkpoint.session.currentBaselineMetric, checkpoint.session.metricUnit)}`,
+    `- Best kept: ${formatMetric(checkpoint.session.currentBestMetric, checkpoint.session.metricUnit)}`,
+  ];
+  if (checkpoint.lastLoggedRun) {
+    lines.push(
+      `- Last logged run: #${checkpoint.lastLoggedRun.run} ${checkpoint.lastLoggedRun.status} ${checkpoint.lastLoggedRun.commit} — ${checkpoint.lastLoggedRun.description}`,
+    );
+  }
+  if (checkpoint.pendingRun) {
+    lines.push(
+      `- Pending run awaiting log_experiment: ${checkpoint.pendingRun.command} (${formatMetric(checkpoint.pendingRun.primaryMetric, checkpoint.session.metricUnit)})`,
+    );
+  }
+  return lines.join("\n");
+}
+function formatMetric(value: number | null, unit: string): string {
+  if (value === null) {
+    return "n/a";
+  }
+  return `${value}${unit}`;
+}

package/extensions/openclaw-autoresearch/src/state.ts CHANGED Viewed

@@ -211,6 +211,62 @@ export function reconstructStateFromJsonl(cwd: string): AutoresearchStateSnapsho
   };
 }
+export function readRecentLoggedRuns(
+  cwd: string,
+  limit: number,
+): readonly AutoresearchRunSnapshot[] {
+  const jsonl = readAutoresearchRootFile(cwd, "resultsLog");
+  if (jsonl === null || limit <= 0) {
+    return [];
+  }
+  const runs: AutoresearchRunSnapshot[] = [];
+  let currentSegment = 0;
+  let currentRunIndex = 0;
+  let hasSeenAnyRun = false;
+  const lines = jsonl
+    .split("\n")
+    .map((line) => line.trim())
+    .filter(Boolean);
+  for (const line of lines) {
+    let entry: JsonlEntry;
+    try {
+      entry = JSON.parse(line) as JsonlEntry;
+    } catch {
+      continue;
+    }
+    if (entry.type === "config") {
+      if (hasSeenAnyRun) {
+        currentSegment += 1;
+      }
+      currentRunIndex = 0;
+      continue;
+    }
+    if (typeof entry.metric !== "number") {
+      continue;
+    }
+    hasSeenAnyRun = true;
+    currentRunIndex += 1;
+    runs.push({
+      run: typeof entry.run === "number" ? entry.run : currentRunIndex,
+      commit: entry.commit ?? "",
+      metric: entry.metric,
+      metrics: normalizeMetrics(entry.metrics),
+      status: entry.status ?? "keep",
+      description: entry.description ?? "",
+      timestamp: typeof entry.timestamp === "number" ? entry.timestamp : 0,
+      segment: typeof entry.segment === "number" ? entry.segment : currentSegment,
+    });
+  }
+  return runs.slice(-limit);
+}
 function normalizeMetrics(metrics: Record<string, number> | undefined): Record<string, number> {
   if (!metrics || typeof metrics !== "object") {
     return {};

package/extensions/openclaw-autoresearch/src/tools/autoresearch-status.ts CHANGED Viewed

@@ -1,8 +1,22 @@
 import type { OpenClawPluginApi } from "openclaw/plugin-sdk";
 import { Type } from "@sinclair/typebox";
 import { reconstructStateFromJsonl, type AutoresearchStateSnapshot } from "../state.js";
-import { getAutoresearchRuntimeState, type AutoresearchRuntimeSnapshot } from "../runtime-state.js";
+import {
+  getAutoresearchRuntimeState,
+  type AutoresearchRuntimeSnapshot,
+} from "../runtime-state.js";
 import { resolveToolCwd } from "./tool-cwd.js";
+import {
+  readAutoresearchCheckpoint,
+  type AutoresearchCheckpoint,
+} from "../checkpoint.js";
+import { countCommitsSince, readShortHeadCommit } from "../git.js";
+export type AutoresearchStatusDiagnostics = {
+  readonly warnings: readonly string[];
+  readonly checkpoint: AutoresearchCheckpoint | null;
+  readonly gitHead: string | null;
+};
 const AutoresearchStatusParams = Type.Object(
   {
@@ -34,13 +48,20 @@ export function createAutoresearchStatusTool(api: OpenClawPluginApi) {
       const cwd = resolveToolCwd(api, params.cwd);
       const state = reconstructStateFromJsonl(cwd);
       const runtimeState = getAutoresearchRuntimeState(cwd);
+      const diagnostics = await buildAutoresearchStatusDiagnostics(api, cwd, state);
       return {
-        content: [{ type: "text" as const, text: formatAutoresearchStatusText(state, runtimeState) }],
+        content: [
+          {
+            type: "text" as const,
+            text: formatAutoresearchStatusText(state, runtimeState, diagnostics),
+          },
+        ],
         details: {
           status: "ok",
           state,
           runtime: runtimeState,
+          diagnostics,
         },
       };
     },
@@ -50,10 +71,12 @@ export function createAutoresearchStatusTool(api: OpenClawPluginApi) {
 export function formatAutoresearchStatusText(
   state: AutoresearchStateSnapshot,
   runtimeState?: AutoresearchRuntimeSnapshot,
+  diagnostics?: AutoresearchStatusDiagnostics,
 ): string {
   const lines = [
     `Mode: ${state.mode}`,
     `Session doc: ${state.hasSessionDoc ? "present" : "missing"}`,
+    `Checkpoint: ${diagnostics?.checkpoint ? "present" : "missing"}`,
     `Ideas backlog: ${state.ideas.hasBacklog ? `${state.ideas.pendingCount} pending` : "empty"}`,
     `Metric: ${state.metricName} (${state.metricUnit || "unitless"}, ${state.bestDirection} is better)`,
     `Current segment: ${state.currentSegment}`,
@@ -74,6 +97,7 @@ export function formatAutoresearchStatusText(
       `Experiment window: ${runtimeState.runInFlight ? "running" : "idle"}`,
       `Queued steers: ${runtimeState.queuedSteers.length}`,
     );
+    lines.splice(state.name ? 5 : 4, 0, `Pending run: ${runtimeState.pendingRun ? "yes" : "no"}`);
   }
   if (state.lastRun) {
@@ -86,6 +110,17 @@ export function formatAutoresearchStatusText(
     lines.push(`Ideas preview: ${state.ideas.preview.join(" | ")}`);
   }
+  if (diagnostics?.gitHead) {
+    lines.push(`Git HEAD: ${diagnostics.gitHead}`);
+  }
+  if (diagnostics && diagnostics.warnings.length > 0) {
+    lines.push("", "Warnings:");
+    for (const warning of diagnostics.warnings) {
+      lines.push(`- ${warning}`);
+    }
+  }
   return lines.join("\n");
 }
@@ -97,3 +132,48 @@ function formatMetric(value: number | null, unit: string): string {
   const rendered = value === Math.round(value) ? `${Math.round(value)}` : value.toFixed(2);
   return `${rendered}${unit}`;
 }
+async function buildAutoresearchStatusDiagnostics(
+  api: OpenClawPluginApi,
+  cwd: string,
+  state: AutoresearchStateSnapshot,
+): Promise<AutoresearchStatusDiagnostics> {
+  const checkpoint = readAutoresearchCheckpoint(cwd);
+  const gitHead = await readShortHeadCommit({
+    runCommandWithTimeout: api.runtime.system.runCommandWithTimeout,
+    cwd,
+  });
+  const warnings: string[] = [];
+  if (checkpoint?.pendingRun) {
+    warnings.push(
+      `A previous run_experiment is still pending log_experiment: ${checkpoint.pendingRun.command}`,
+    );
+  }
+  const driftBase =
+    state.lastRun?.commit && state.lastRun.commit.length > 0
+      ? state.lastRun.commit
+      : checkpoint?.sessionStartCommit ?? null;
+  if (driftBase) {
+    const commitsAhead = await countCommitsSince({
+      runCommandWithTimeout: api.runtime.system.runCommandWithTimeout,
+      cwd,
+      sinceCommit: driftBase,
+    });
+    if (commitsAhead !== null && commitsAhead > 0) {
+      warnings.push(
+        state.lastRun
+          ? `${commitsAhead} commit${commitsAhead === 1 ? "" : "s"} since the last logged experiment (${state.lastRun.commit}).`
+          : `${commitsAhead} commit${commitsAhead === 1 ? "" : "s"} since init_experiment, but no experiment has been logged yet.`,
+      );
+    }
+  }
+  return {
+    warnings,
+    checkpoint,
+    gitHead,
+  };
+}

package/extensions/openclaw-autoresearch/src/tools/init-experiment.ts CHANGED Viewed

@@ -3,9 +3,14 @@ import { InitExperimentParams } from "./schemas.js";
 import { createConfigHeader, writeConfigHeader } from "../logging.js";
 import {
   createEmptyStateSnapshot,
+  readRecentLoggedRuns,
   reconstructStateFromJsonl,
   type AutoresearchStateSnapshot,
 } from "../state.js";
+import { readAutoresearchCheckpoint, writeAutoresearchCheckpoint } from "../checkpoint.js";
+import { syncAutoresearchSessionDoc } from "../session-doc.js";
+import { readShortHeadCommit } from "../git.js";
+import { setAutoresearchPendingRun, setAutoresearchRunInFlight } from "../runtime-state.js";
 import { resolveToolCwd } from "./tool-cwd.js";
 export function createInitExperimentTool(api: OpenClawPluginApi) {
@@ -29,6 +34,7 @@ export function createInitExperimentTool(api: OpenClawPluginApi) {
     ) {
       const cwd = resolveToolCwd(api, params.cwd);
       const previousState = reconstructStateFromJsonl(cwd);
+      const previousCheckpoint = readAutoresearchCheckpoint(cwd);
       const isReinit = previousState.currentRunCount > 0;
       const nextState: AutoresearchStateSnapshot = {
         ...createEmptyStateSnapshot(),
@@ -66,6 +72,23 @@ export function createInitExperimentTool(api: OpenClawPluginApi) {
         };
       }
+      setAutoresearchPendingRun(cwd, null);
+      setAutoresearchRunInFlight(cwd, false);
+      const nextPersistentState = reconstructStateFromJsonl(cwd);
+      const sessionStartCommit = await readShortHeadCommit({
+        runCommandWithTimeout: api.runtime.system.runCommandWithTimeout,
+        cwd,
+      });
+      const checkpoint = writeAutoresearchCheckpoint({
+        cwd,
+        state: nextPersistentState,
+        sessionStartCommit: sessionStartCommit ?? previousCheckpoint?.sessionStartCommit ?? null,
+        recentLoggedRuns: readRecentLoggedRuns(cwd, 8),
+        pendingRun: null,
+      });
+      syncAutoresearchSessionDoc(cwd, checkpoint);
       const reinitNote = isReinit
         ? " (re-initialized - previous results archived, new baseline needed)"
         : "";
@@ -77,7 +100,7 @@ export function createInitExperimentTool(api: OpenClawPluginApi) {
             text:
               `Experiment initialized: "${nextState.name}"${reinitNote}\n` +
               `Metric: ${nextState.metricName} (${nextState.metricUnit || "unitless"}, ${nextState.bestDirection} is better)\n` +
-              "Config written to autoresearch.jsonl. Now run the baseline with run_experiment.",
+              "Config written to autoresearch.jsonl. Now run the baseline with run_experiment, then log it before starting another run.",
           },
         ],
         details: {

package/extensions/openclaw-autoresearch/src/tools/log-experiment.ts CHANGED Viewed

@@ -1,19 +1,24 @@
 import * as fs from "node:fs";
 import type { OpenClawPluginApi } from "openclaw/plugin-sdk";
 import { LogExperimentParams } from "./schemas.js";
-import { commitKeptExperiment } from "../git.js";
+import { commitKeptExperiment, readShortHeadCommit } from "../git.js";
 import { appendResultEntry, type AutoresearchResultEntry } from "../logging.js";
 import { getAutoresearchRootFilePath } from "../files.js";
 import {
+  readRecentLoggedRuns,
   reconstructStateFromJsonl,
   type AutoresearchStateSnapshot,
   type SecondaryMetricDef,
 } from "../state.js";
 import {
+  consumeAutoresearchPendingRun,
+  getAutoresearchPendingRun,
   consumeAutoresearchSteers,
   setAutoresearchRunInFlight,
 } from "../runtime-state.js";
 import { resolveToolCwd } from "./tool-cwd.js";
+import { readAutoresearchCheckpoint, writeAutoresearchCheckpoint } from "../checkpoint.js";
+import { syncAutoresearchSessionDoc } from "../session-doc.js";
 export function createLogExperimentTool(api: OpenClawPluginApi) {
   return {
@@ -26,8 +31,8 @@ export function createLogExperimentTool(api: OpenClawPluginApi) {
       _toolCallId: string,
       params: {
         cwd?: string;
-        commit: string;
-        metric: number;
+        commit?: string;
+        metric?: number;
         status: "keep" | "discard" | "crash";
         description: string;
         metrics?: Record<string, number>;
@@ -37,8 +42,37 @@ export function createLogExperimentTool(api: OpenClawPluginApi) {
       _onUpdate: unknown,
     ) {
       const cwd = resolveToolCwd(api, params.cwd);
+      const checkpoint = readAutoresearchCheckpoint(cwd);
+      const pendingRun = getAutoresearchPendingRun(cwd) ?? checkpoint?.pendingRun ?? null;
       const state = reconstructStateFromJsonl(cwd);
-      const secondaryMetrics = params.metrics ?? {};
+      const secondaryMetrics = params.metrics ?? pendingRun?.metrics ?? {};
+      const inferredCommit =
+        params.commit ??
+        pendingRun?.commit ??
+        (await readShortHeadCommit({
+          runCommandWithTimeout: api.runtime.system.runCommandWithTimeout,
+          cwd,
+        })) ??
+        "";
+      const inferredMetric = params.metric ?? pendingRun?.primaryMetric;
+      if (inferredMetric === null || inferredMetric === undefined) {
+        return {
+          content: [
+            {
+              type: "text" as const,
+              text:
+                "No primary metric is available to log.\n" +
+                `Expected a METRIC line for ${state.metricName} from run_experiment, or provide metric explicitly.`,
+            },
+          ],
+          details: {
+            status: "error",
+            phase: "metric",
+            pendingRun,
+          },
+        };
+      }
       if (state.secondaryMetrics.length > 0) {
         const validationError = validateSecondaryMetrics(
@@ -61,8 +95,8 @@ export function createLogExperimentTool(api: OpenClawPluginApi) {
       const currentResults = readCurrentSegmentResults(cwd, state.currentSegment);
       const experiment: AutoresearchResultEntry = {
         run: state.currentRunCount + 1,
-        commit: params.commit.slice(0, 7),
-        metric: params.metric,
+        commit: inferredCommit.slice(0, 7),
+        metric: inferredMetric,
         metrics: secondaryMetrics,
         status: params.status,
         description: params.description,
@@ -83,7 +117,7 @@ export function createLogExperimentTool(api: OpenClawPluginApi) {
           cwd: cwd,
           description: params.description,
           metricName: state.metricName,
-          metric: params.metric,
+          metric: inferredMetric,
           metrics: secondaryMetrics,
           commit: experiment.commit,
           status: "keep",
@@ -138,7 +172,16 @@ export function createLogExperimentTool(api: OpenClawPluginApi) {
       );
       const nextState: AutoresearchStateSnapshot = reconstructStateFromJsonl(cwd);
       const queuedSteers = consumeAutoresearchSteers(cwd);
+      consumeAutoresearchPendingRun(cwd);
       setAutoresearchRunInFlight(cwd, false);
+      const nextCheckpoint = writeAutoresearchCheckpoint({
+        cwd,
+        state: nextState,
+        sessionStartCommit: checkpoint?.sessionStartCommit ?? experiment.commit,
+        recentLoggedRuns: readRecentLoggedRuns(cwd, 8),
+        pendingRun: null,
+      });
+      syncAutoresearchSessionDoc(cwd, nextCheckpoint);
       return {
         content: [
@@ -153,6 +196,7 @@ export function createLogExperimentTool(api: OpenClawPluginApi) {
               gitSummary,
               knownSecondaryMetrics,
               queuedSteers,
+              usedPendingRun: pendingRun !== null,
             }),
           },
         ],
@@ -304,6 +348,7 @@ function buildResultText(options: {
   gitSummary: string;
   knownSecondaryMetrics: readonly SecondaryMetricDef[];
   queuedSteers: readonly string[];
+  usedPendingRun: boolean;
 }): string {
   let text = `Logged #${options.experiment.run}: ${options.experiment.status} - ${options.experiment.description}`;
   text += `\nBaseline ${options.state.metricName}: ${formatMetric(options.baselineMetric, options.state.metricUnit)}`;
@@ -338,6 +383,9 @@ function buildResultText(options: {
   }
   text += `\n(${options.totalRunCount} experiments in current segment)`;
+  if (options.usedPendingRun) {
+    text += "\nUsed the pending run_experiment result as the source of truth for commit/metric defaults.";
+  }
   text += `\n${options.gitSummary}`;
   if (options.queuedSteers.length > 0) {

package/extensions/openclaw-autoresearch/src/tools/run-experiment.ts CHANGED Viewed

@@ -1,8 +1,17 @@
 import type { OpenClawPluginApi } from "openclaw/plugin-sdk";
 import { RunExperimentParams } from "./schemas.js";
 import { executeExperimentCommand } from "../execute.js";
-import { setAutoresearchRunInFlight } from "../runtime-state.js";
+import {
+  getAutoresearchPendingRun,
+  setAutoresearchPendingRun,
+  setAutoresearchRunInFlight,
+} from "../runtime-state.js";
 import { resolveToolCwd } from "./tool-cwd.js";
+import { parseMetricLines } from "../metrics.js";
+import { readShortHeadCommit } from "../git.js";
+import { readAutoresearchCheckpoint, writeAutoresearchCheckpoint } from "../checkpoint.js";
+import { readRecentLoggedRuns, reconstructStateFromJsonl } from "../state.js";
+import { syncAutoresearchSessionDoc } from "../session-doc.js";
 export function createRunExperimentTool(api: OpenClawPluginApi) {
   return {
@@ -22,6 +31,28 @@ export function createRunExperimentTool(api: OpenClawPluginApi) {
       onUpdate: ((update: unknown) => void | Promise<void>) | undefined,
     ) {
       const cwd = resolveToolCwd(api, params.cwd);
+      const checkpoint = readAutoresearchCheckpoint(cwd);
+      const existingPendingRun = getAutoresearchPendingRun(cwd) ?? checkpoint?.pendingRun ?? null;
+      if (existingPendingRun) {
+        return {
+          content: [
+            {
+              type: "text" as const,
+              text:
+                "The previous run_experiment result has not been logged yet.\n" +
+                `Pending command: ${existingPendingRun.command}\n` +
+                "Call log_experiment next. You can omit commit/metric and the tool will use the pending run by default.",
+            },
+          ],
+          details: {
+            status: "error",
+            phase: "pending_log",
+            pendingRun: existingPendingRun,
+          },
+        };
+      }
       setAutoresearchRunInFlight(cwd, true);
       if (onUpdate) {
@@ -45,6 +76,48 @@ export function createRunExperimentTool(api: OpenClawPluginApi) {
         throw error;
       }
+      const state = reconstructStateFromJsonl(cwd);
+      const parsedMetrics = parseMetricLines([details.stdout, details.stderr].join("\n"));
+      const detectedPrimaryMetricName =
+        parsedMetrics[state.metricName] !== undefined
+          ? state.metricName
+          : Object.keys(parsedMetrics).length === 1
+            ? Object.keys(parsedMetrics)[0] ?? null
+            : null;
+      const primaryMetric =
+        detectedPrimaryMetricName !== null ? parsedMetrics[detectedPrimaryMetricName] ?? null : null;
+      const secondaryMetrics =
+        detectedPrimaryMetricName !== null
+          ? Object.fromEntries(
+              Object.entries(parsedMetrics).filter(([name]) => name !== detectedPrimaryMetricName),
+            )
+          : parsedMetrics;
+      const currentCommit = await readShortHeadCommit({
+        runCommandWithTimeout: api.runtime.system.runCommandWithTimeout,
+        cwd,
+      });
+      const pendingRun = {
+        command: params.command,
+        commit: currentCommit,
+        primaryMetric,
+        metrics: secondaryMetrics,
+        durationSeconds: details.durationSeconds,
+        exitCode: details.exitCode,
+        passed: details.passed,
+        timedOut: details.timedOut,
+        tailOutput: details.tailOutput,
+        capturedAt: Date.now(),
+      } as const;
+      setAutoresearchPendingRun(cwd, pendingRun);
+      const nextCheckpoint = writeAutoresearchCheckpoint({
+        cwd,
+        state,
+        sessionStartCommit: checkpoint?.sessionStartCommit ?? currentCommit,
+        recentLoggedRuns: readRecentLoggedRuns(cwd, 8),
+        pendingRun,
+      });
+      syncAutoresearchSessionDoc(cwd, nextCheckpoint);
       let text = "";
       if (details.timedOut) {
         text += `TIMEOUT after ${details.durationSeconds.toFixed(1)}s\n`;
@@ -55,10 +128,25 @@ export function createRunExperimentTool(api: OpenClawPluginApi) {
       }
       text += `\nLast 80 lines of output:\n${details.tailOutput || "(no output)"}`;
+      if (Object.keys(parsedMetrics).length > 0) {
+        text += `\n\nParsed METRIC lines: ${Object.entries(parsedMetrics)
+          .map(([name, value]) => `${name}=${value}`)
+          .join(", ")}`;
+      } else {
+        text += `\n\nNo METRIC lines were detected.`;
+      }
+      text +=
+        "\nNext step: call log_experiment before another run. When the primary METRIC was captured, log_experiment can infer commit and metric from this run.";
       return {
         content: [{ type: "text" as const, text }],
-        details,
+        details: {
+          ...details,
+          metrics: parsedMetrics,
+          secondaryMetrics,
+          primaryMetric,
+          pendingRun,
+        },
       };
     },
   };

package/extensions/openclaw-autoresearch/src/tools/schemas.ts CHANGED Viewed

@@ -45,11 +45,18 @@ export const RunExperimentParams = Type.Object({
 export const LogExperimentParams = Type.Object({
   cwd: CwdParam,
-  commit: Type.String({ description: "Git commit hash (short, 7 chars)" }),
-  metric: Type.Number({
-    description:
-      "The primary optimization metric value (e.g. seconds, val_bpb). Use 0 for crashes.",
-  }),
+  commit: Type.Optional(
+    Type.String({
+      description:
+        "Git commit hash (short, 7 chars). Optional when logging the most recent run_experiment result.",
+    }),
+  ),
+  metric: Type.Optional(
+    Type.Number({
+      description:
+        "The primary optimization metric value (e.g. seconds, val_bpb). Optional when run_experiment already captured a METRIC line for the configured primary metric.",
+    }),
+  ),
   status: Type.String({
     description: "Result status for this experiment.",
     enum: ["keep", "discard", "crash"],

package/openclaw.plugin.json CHANGED Viewed

@@ -5,7 +5,7 @@
   "skills": [
     "./skills"
   ],
-  "version": "1.0.2",
+  "version": "1.0.3",
   "configSchema": {
     "type": "object",
     "additionalProperties": false,

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@gianfrancopiana/openclaw-autoresearch",
-  "version": "1.0.2",
+  "version": "1.0.3",
   "description": "Faithful OpenClaw port of pi-autoresearch.",
   "type": "module",
   "main": "./index.ts",

package/skills/autoresearch-create/SKILL.md CHANGED Viewed

@@ -10,8 +10,8 @@ Autonomous experiment loop: try ideas, keep what works, discard what doesn't, ne
 ## Tools
 - **`init_experiment`** — configure session (name, metric, unit, direction). Call again to re-initialize with a new baseline when the optimization target changes.
-- **`run_experiment`** — runs command, times it, captures output.
-- **`log_experiment`** — records result. `keep` auto-commits. `discard`/`crash` → `git checkout -- .` to revert. Always include secondary `metrics` dict.
+- **`run_experiment`** — runs the benchmark command, times it, captures output, parses `METRIC name=number` lines, and opens a pending run that must be logged before another run can start.
+- **`log_experiment`** — records the pending run. `keep` auto-commits. `discard`/`crash` → `git checkout -- .` to revert. If the previous `run_experiment` captured the primary metric, `commit` and `metric` can be omitted and will default from the pending run.
 ## Setup
@@ -19,7 +19,7 @@ Autonomous experiment loop: try ideas, keep what works, discard what doesn't, ne
 2. `git checkout -b autoresearch/<goal>-<date>`
 3. Read the source files. Understand the workload deeply before writing anything.
 4. Write `autoresearch.md` and `autoresearch.sh` (see below). Commit both.
-5. `init_experiment` → run baseline → `log_experiment` → start looping immediately.
+5. `init_experiment` → `run_experiment` baseline → `log_experiment` → start looping immediately.
 ### `autoresearch.md`
@@ -52,7 +52,7 @@ This is the heart of the session. A fresh agent with no context should be able t
 and architectural insights so the agent doesn't repeat failed approaches.>
 ```
-Update `autoresearch.md` periodically — especially the "What's Been Tried" section — so resuming agents have full context.
+The plugin rewrites the Metrics, How to Run, What's Been Tried, and Plugin Checkpoint sections after init/log transitions. You may add context elsewhere in the file, but do not fight the plugin-managed sections.
 ### `autoresearch.sh`
@@ -67,7 +67,8 @@ Bash script (`set -euo pipefail`) that: pre-checks fast (syntax errors in <1s),
 - **Don't thrash.** Repeatedly reverting the same idea? Try something structurally different.
 - **Crashes:** fix if trivial, otherwise log and move on. Don't over-invest.
 - **Think longer when stuck.** Re-read source files, study the profiling data, reason about what the CPU is actually doing. The best ideas come from deep understanding, not from trying random variations.
-- **Resuming:** if `autoresearch.md` exists, read it + git log, continue looping.
+- **Resuming:** if `autoresearch.md` exists, read it plus `autoresearch.checkpoint.json`, then continue looping.
+- **No raw benchmark exec:** during active autoresearch mode, benchmark/test commands should go through `run_experiment`, not raw `exec`/`bash`.
 **NEVER STOP.** The user may be away for hours. Keep going until interrupted.