@gianfrancopiana/openclaw-autoresearch 1.0.2 → 1.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -13,19 +13,20 @@ Three tools drive the loop:
13
13
  | Tool | What it does |
14
14
  |---|---|
15
15
  | `init_experiment` | Configures the session: name, primary metric, unit, direction (lower/higher). Re-calling starts a new segment. |
16
- | `run_experiment` | Executes a shell command, times it, captures stdout/stderr, returns pass/fail via exit code. |
17
- | `log_experiment` | Records the result. `keep` auto-commits to git. `discard`/`crash` log without committing. Tracks secondary metrics alongside the primary. |
16
+ | `run_experiment` | Executes a shell command, times it, captures stdout/stderr, parses `METRIC name=number` lines, and opens a pending experiment window that must be logged before another run can start. |
17
+ | `log_experiment` | Records the pending run. `keep` auto-commits to git. `discard`/`crash` log without committing. If the prior `run_experiment` captured the primary metric, `log_experiment` can infer `commit` and `metric` automatically. |
18
18
 
19
19
  Each tool also accepts an optional `cwd` so callers can target a nested repo explicitly instead of relying on the current session working directory.
20
20
 
21
- All state lives in four repo-root files:
21
+ All state lives in five repo-root files:
22
22
 
23
23
  | File | Purpose |
24
24
  |---|---|
25
- | `autoresearch.md` | Session doc: objective, metrics, files in scope, constraints, what's been tried. A fresh agent reads this to resume. |
25
+ | `autoresearch.md` | Session doc. The plugin keeps the Metrics, How to Run, What's Been Tried, and Plugin Checkpoint sections synchronized so resumes are less agent-dependent. |
26
26
  | `autoresearch.sh` | Benchmark script. Outputs `METRIC name=number` lines. |
27
27
  | `autoresearch.jsonl` | Structured log: config headers + experiment entries (metric, status, timestamp, segment, commit hash). |
28
28
  | `autoresearch.ideas.md` | Backlog of promising ideas not yet tried. Optional. |
29
+ | `autoresearch.checkpoint.json` | Plugin-managed checkpoint: latest logged state, recent runs, and any pending unlogged run. |
29
30
 
30
31
  The design is file-first: any agent can pick up the repo-root files and continue the loop without prior context.
31
32
 
@@ -72,6 +73,14 @@ Verify:
72
73
 
73
74
  Prefer the explicit `/autoresearch` command surface in OpenClaw. The auto-generated native skill alias `/autoresearch_create` may not trigger reliably on some hosts, so use `/skill autoresearch-create` if you need to invoke the skill directly.
74
75
 
76
+ ## Workflow Guarantees
77
+
78
+ - `run_experiment` refuses to start a second run until the previous one is logged.
79
+ - `run_experiment` parses `METRIC name=number` lines and stores a pending run so `log_experiment` can default from the actual benchmark output.
80
+ - During active autoresearch mode, raw benchmark execution through OpenClaw `exec`/`bash` is blocked. Use `run_experiment` instead.
81
+ - `autoresearch_status` warns when a pending run is unlogged or git history has moved ahead of the last logged experiment.
82
+ - The plugin updates `autoresearch.checkpoint.json` and refreshes plugin-managed sections in `autoresearch.md` after init, run, and log transitions.
83
+
75
84
  ## Use
76
85
 
77
86
  In the repo you want to optimize:
@@ -80,7 +89,7 @@ In the repo you want to optimize:
80
89
  2. Run `/autoresearch` or `/autoresearch setup <goal>`.
81
90
  3. Send a normal message with the goal, command, metric (+ direction), files in scope, and constraints.
82
91
  4. If you need the raw skill invocation, use `/skill autoresearch-create`.
83
- 5. The agent writes `autoresearch.md` and `autoresearch.sh`, runs a baseline, then starts looping.
92
+ 5. The agent writes `autoresearch.md` and `autoresearch.sh`, runs a baseline with `run_experiment`, then records it with `log_experiment`.
84
93
  6. Use `/autoresearch` or `/autoresearch status` to re-prime context on a later turn.
85
94
 
86
95
  To resume an existing session, a new agent reads the repo-root files and continues from where the last one stopped.
@@ -0,0 +1,80 @@
1
+ import * as fs from "node:fs";
2
+ import { AUTORESEARCH_ROOT_FILES, getAutoresearchRootFilePath } from "./files.js";
3
+ import type {
4
+ AutoresearchRunSnapshot,
5
+ AutoresearchStateSnapshot,
6
+ } from "./state.js";
7
+ import type { PendingExperimentRun } from "./runtime-state.js";
8
+
9
+ export type AutoresearchCheckpoint = {
10
+ readonly version: 1;
11
+ readonly updatedAt: number;
12
+ readonly sessionStartCommit: string | null;
13
+ readonly session: {
14
+ readonly name: string | null;
15
+ readonly metricName: string;
16
+ readonly metricUnit: string;
17
+ readonly bestDirection: "lower" | "higher";
18
+ readonly currentSegment: number;
19
+ readonly currentRunCount: number;
20
+ readonly totalRunCount: number;
21
+ readonly currentBaselineMetric: number | null;
22
+ readonly currentBestMetric: number | null;
23
+ };
24
+ readonly lastLoggedRun: AutoresearchRunSnapshot | null;
25
+ readonly recentLoggedRuns: readonly AutoresearchRunSnapshot[];
26
+ readonly pendingRun: PendingExperimentRun | null;
27
+ };
28
+
29
+ export function readAutoresearchCheckpoint(cwd: string): AutoresearchCheckpoint | null {
30
+ const checkpointPath = getAutoresearchRootFilePath(cwd, "checkpoint");
31
+ if (!fs.existsSync(checkpointPath)) {
32
+ return null;
33
+ }
34
+
35
+ try {
36
+ const parsed = JSON.parse(fs.readFileSync(checkpointPath, "utf8")) as AutoresearchCheckpoint;
37
+ return parsed?.version === 1 ? parsed : null;
38
+ } catch {
39
+ return null;
40
+ }
41
+ }
42
+
43
+ export function writeAutoresearchCheckpoint(options: {
44
+ cwd: string;
45
+ state: AutoresearchStateSnapshot;
46
+ sessionStartCommit: string | null;
47
+ recentLoggedRuns: readonly AutoresearchRunSnapshot[];
48
+ pendingRun: PendingExperimentRun | null;
49
+ }): AutoresearchCheckpoint {
50
+ const checkpoint: AutoresearchCheckpoint = {
51
+ version: 1,
52
+ updatedAt: Date.now(),
53
+ sessionStartCommit: options.sessionStartCommit,
54
+ session: {
55
+ name: options.state.name,
56
+ metricName: options.state.metricName,
57
+ metricUnit: options.state.metricUnit,
58
+ bestDirection: options.state.bestDirection,
59
+ currentSegment: options.state.currentSegment,
60
+ currentRunCount: options.state.currentRunCount,
61
+ totalRunCount: options.state.totalRunCount,
62
+ currentBaselineMetric: options.state.currentBaselineMetric,
63
+ currentBestMetric: options.state.currentBestMetric,
64
+ },
65
+ lastLoggedRun: options.state.lastRun,
66
+ recentLoggedRuns: options.recentLoggedRuns,
67
+ pendingRun: options.pendingRun,
68
+ };
69
+
70
+ const checkpointPath = getAutoresearchRootFilePath(options.cwd, "checkpoint");
71
+ fs.writeFileSync(checkpointPath, `${JSON.stringify(checkpoint, null, 2)}\n`);
72
+ return checkpoint;
73
+ }
74
+
75
+ export function deleteAutoresearchCheckpoint(cwd: string): void {
76
+ const checkpointPath = getAutoresearchRootFilePath(cwd, "checkpoint");
77
+ if (fs.existsSync(checkpointPath)) {
78
+ fs.unlinkSync(checkpointPath);
79
+ }
80
+ }
@@ -5,6 +5,7 @@ export const AUTORESEARCH_ROOT_FILES = {
5
5
  runnerScript: "autoresearch.sh",
6
6
  resultsLog: "autoresearch.jsonl",
7
7
  ideasBacklog: "autoresearch.ideas.md",
8
+ checkpoint: "autoresearch.checkpoint.json",
8
9
  } as const;
9
10
 
10
11
  export type AutoresearchRootFileKey = keyof typeof AUTORESEARCH_ROOT_FILES;
@@ -19,6 +19,11 @@ export type GitKeepResult = {
19
19
  readonly command: GitCommandResult;
20
20
  };
21
21
 
22
+ export type GitRuntimeOptions = {
23
+ runCommandWithTimeout: RunCommandWithTimeout;
24
+ cwd: string;
25
+ };
26
+
22
27
  async function runGitCommand(
23
28
  runCommandWithTimeout: RunCommandWithTimeout,
24
29
  cwd: string,
@@ -40,6 +45,31 @@ async function runGitCommand(
40
45
  };
41
46
  }
42
47
 
48
+ export async function readShortHeadCommit(options: GitRuntimeOptions): Promise<string | null> {
49
+ const result = await runGitCommand(options.runCommandWithTimeout, options.cwd, [
50
+ "rev-parse",
51
+ "--short=7",
52
+ "HEAD",
53
+ ]);
54
+ return result.code === 0 && result.stdout.trim().length > 0 ? result.stdout.trim() : null;
55
+ }
56
+
57
+ export async function countCommitsSince(
58
+ options: GitRuntimeOptions & { sinceCommit: string },
59
+ ): Promise<number | null> {
60
+ const result = await runGitCommand(options.runCommandWithTimeout, options.cwd, [
61
+ "rev-list",
62
+ "--count",
63
+ `${options.sinceCommit}..HEAD`,
64
+ ]);
65
+ if (result.code !== 0) {
66
+ return null;
67
+ }
68
+
69
+ const count = Number.parseInt(result.stdout.trim(), 10);
70
+ return Number.isFinite(count) ? count : null;
71
+ }
72
+
43
73
  export async function commitKeptExperiment(options: {
44
74
  runCommandWithTimeout: RunCommandWithTimeout;
45
75
  cwd: string;
@@ -64,6 +64,30 @@ export function registerAutoresearchHooks(api: OpenClawPluginApi): void {
64
64
  queueAutoresearchSteer(cwd, messageText);
65
65
  });
66
66
 
67
+ hookApi.on("before_tool_call", (event, ctx) => {
68
+ const cwd = resolveHookCwd(api, ctx);
69
+ if (cwd === null || !shouldEnforceAutoresearchMode(cwd)) {
70
+ return;
71
+ }
72
+
73
+ const record = event as Record<string, unknown>;
74
+ const toolName = typeof record.toolName === "string" ? record.toolName : "";
75
+ if (toolName !== "exec" && toolName !== "bash") {
76
+ return;
77
+ }
78
+
79
+ const command = extractToolCommand(record.params);
80
+ if (!command || !looksLikeExperimentCommand(command)) {
81
+ return;
82
+ }
83
+
84
+ return {
85
+ block: true,
86
+ blockReason:
87
+ "Autoresearch mode blocks raw benchmark execution through exec/bash. Use run_experiment so the result is captured and log_experiment can enforce the experiment lifecycle.",
88
+ };
89
+ });
90
+
67
91
  hookApi.on("agent_end", (_event, ctx) => {
68
92
  const cwd = resolveHookCwd(api, ctx);
69
93
  if (cwd === null) {
@@ -112,11 +136,7 @@ export function registerAutoresearchHooks(api: OpenClawPluginApi): void {
112
136
  export function buildBeforePromptBuildContext(cwd: string): string | null {
113
137
  const state = reconstructStateFromJsonl(cwd);
114
138
  const runtimeState = getAutoresearchRuntimeState(cwd);
115
- const modeEnabled =
116
- runtimeState.mode === "on" ||
117
- (runtimeState.mode !== "off" && (state.mode === "active" || state.hasSessionDoc));
118
-
119
- if (!modeEnabled) {
139
+ if (!shouldEnforceAutoresearchMode(cwd, state, runtimeState)) {
120
140
  return null;
121
141
  }
122
142
 
@@ -144,6 +164,8 @@ export function buildBeforePromptBuildContext(cwd: string): string | null {
144
164
  `Read ${AUTORESEARCH_ROOT_FILES.sessionDoc} before resuming or changing the experiment loop, and re-read it after compaction.`,
145
165
  "Resume the autonomous upstream loop: edit, run_experiment, log_experiment, keep/discard/crash, repeat.",
146
166
  "Use init_experiment, run_experiment, and log_experiment for experiment state changes. Never stop unless the user explicitly interrupts the loop.",
167
+ "Never run benchmark or test commands through raw exec/bash during autoresearch mode. Use run_experiment so the plugin can capture metrics, enforce logging, and preserve resumable state.",
168
+ "After every run_experiment, call log_experiment before starting another run. If METRIC lines were captured, log_experiment can infer commit and metric from the pending run.",
147
169
  );
148
170
  if (pendingCommand?.args) {
149
171
  lines.push(`Additional resume instruction from /autoresearch: ${pendingCommand.args}`);
@@ -232,3 +254,49 @@ function firstString(...values: unknown[]): string | null {
232
254
  function isCommandLikeMessage(text: string): boolean {
233
255
  return /^[\/!]/.test(text.trim());
234
256
  }
257
+
258
+ function shouldEnforceAutoresearchMode(
259
+ cwd: string,
260
+ state = reconstructStateFromJsonl(cwd),
261
+ runtimeState = getAutoresearchRuntimeState(cwd),
262
+ ): boolean {
263
+ return (
264
+ runtimeState.mode === "on" ||
265
+ runtimeState.runInFlight ||
266
+ runtimeState.pendingRun !== null ||
267
+ (runtimeState.mode !== "off" && (state.mode === "active" || state.hasSessionDoc))
268
+ );
269
+ }
270
+
271
+ function extractToolCommand(params: unknown): string | null {
272
+ if (!params || typeof params !== "object") {
273
+ return null;
274
+ }
275
+
276
+ const record = params as Record<string, unknown>;
277
+ for (const key of ["command", "cmd", "args"]) {
278
+ const value = record[key];
279
+ if (typeof value === "string" && value.trim().length > 0) {
280
+ return value.trim();
281
+ }
282
+ }
283
+
284
+ return null;
285
+ }
286
+
287
+ function looksLikeExperimentCommand(command: string): boolean {
288
+ const normalized = command.trim();
289
+ if (!normalized) {
290
+ return false;
291
+ }
292
+
293
+ const readOnlyPatterns = [
294
+ /^(pwd|ls|find|rg|grep|sed|cat|head|tail|wc|stat)\b/,
295
+ /^git\s+(status|diff|show|log|rev-parse|branch|remote)\b/,
296
+ ];
297
+ if (readOnlyPatterns.some((pattern) => pattern.test(normalized))) {
298
+ return false;
299
+ }
300
+
301
+ return true;
302
+ }
@@ -0,0 +1,24 @@
1
+ const METRIC_LINE_RE =
2
+ /^METRIC\s+([A-Za-z0-9_.\-µ]+)\s*=\s*(-?(?:\d+\.?\d*|\.\d+)(?:[eE][+-]?\d+)?)\s*$/;
3
+
4
+ export function parseMetricLines(output: string): Record<string, number> {
5
+ const metrics = new Map<string, number>();
6
+
7
+ for (const rawLine of output.split(/\r?\n/)) {
8
+ const line = rawLine.trim();
9
+ const match = METRIC_LINE_RE.exec(line);
10
+ if (!match) {
11
+ continue;
12
+ }
13
+
14
+ const [, name, valueText] = match;
15
+ const value = Number(valueText);
16
+ if (!name || !Number.isFinite(value)) {
17
+ continue;
18
+ }
19
+
20
+ metrics.set(name, value);
21
+ }
22
+
23
+ return Object.fromEntries(metrics.entries());
24
+ }
@@ -7,12 +7,26 @@ export type PendingAutoresearchCommand =
7
7
  }
8
8
  | null;
9
9
 
10
+ export type PendingExperimentRun = {
11
+ readonly command: string;
12
+ readonly commit: string | null;
13
+ readonly primaryMetric: number | null;
14
+ readonly metrics: Record<string, number>;
15
+ readonly durationSeconds: number;
16
+ readonly exitCode: number | null;
17
+ readonly passed: boolean;
18
+ readonly timedOut: boolean;
19
+ readonly tailOutput: string;
20
+ readonly capturedAt: number;
21
+ };
22
+
10
23
  export type AutoresearchRuntimeSnapshot = {
11
24
  readonly mode: AutoresearchRuntimeMode;
12
25
  readonly runInFlight: boolean;
13
26
  readonly queuedSteers: readonly string[];
14
27
  readonly needsContinuationReminder: boolean;
15
28
  readonly pendingCommand: PendingAutoresearchCommand;
29
+ readonly pendingRun: PendingExperimentRun | null;
16
30
  };
17
31
 
18
32
  type MutableAutoresearchRuntimeState = {
@@ -21,6 +35,7 @@ type MutableAutoresearchRuntimeState = {
21
35
  queuedSteers: string[];
22
36
  needsContinuationReminder: boolean;
23
37
  pendingCommand: PendingAutoresearchCommand;
38
+ pendingRun: PendingExperimentRun | null;
24
39
  };
25
40
 
26
41
  const MAX_QUEUED_STEERS = 20;
@@ -33,6 +48,7 @@ function createDefaultRuntimeState(): MutableAutoresearchRuntimeState {
33
48
  queuedSteers: [],
34
49
  needsContinuationReminder: false,
35
50
  pendingCommand: null,
51
+ pendingRun: null,
36
52
  };
37
53
  }
38
54
 
@@ -53,6 +69,7 @@ export function getAutoresearchRuntimeState(cwd: string): AutoresearchRuntimeSna
53
69
  queuedSteers: [...state.queuedSteers],
54
70
  needsContinuationReminder: state.needsContinuationReminder,
55
71
  pendingCommand: state.pendingCommand,
72
+ pendingRun: state.pendingRun,
56
73
  };
57
74
  }
58
75
 
@@ -138,6 +155,26 @@ export function consumeAutoresearchContinuationReminder(cwd: string): boolean {
138
155
  return needsReminder;
139
156
  }
140
157
 
158
+ export function setAutoresearchPendingRun(
159
+ cwd: string,
160
+ pendingRun: PendingExperimentRun | null,
161
+ ): AutoresearchRuntimeSnapshot {
162
+ const state = getMutableRuntimeState(cwd);
163
+ state.pendingRun = pendingRun;
164
+ return getAutoresearchRuntimeState(cwd);
165
+ }
166
+
167
+ export function getAutoresearchPendingRun(cwd: string): PendingExperimentRun | null {
168
+ return getMutableRuntimeState(cwd).pendingRun;
169
+ }
170
+
171
+ export function consumeAutoresearchPendingRun(cwd: string): PendingExperimentRun | null {
172
+ const state = getMutableRuntimeState(cwd);
173
+ const pendingRun = state.pendingRun;
174
+ state.pendingRun = null;
175
+ return pendingRun;
176
+ }
177
+
141
178
  export function clearAutoresearchRuntimeState(cwd: string): void {
142
179
  runtimeStates.delete(cwd);
143
180
  }
@@ -0,0 +1,102 @@
1
+ import * as fs from "node:fs";
2
+ import { AUTORESEARCH_ROOT_FILES, getAutoresearchRootFilePath } from "./files.js";
3
+ import type { AutoresearchCheckpoint } from "./checkpoint.js";
4
+
5
+ export function syncAutoresearchSessionDoc(
6
+ cwd: string,
7
+ checkpoint: AutoresearchCheckpoint,
8
+ ): void {
9
+ const sessionDocPath = getAutoresearchRootFilePath(cwd, "sessionDoc");
10
+ const existing = fs.existsSync(sessionDocPath) ? fs.readFileSync(sessionDocPath, "utf8") : "";
11
+ let doc = ensureTitle(existing, checkpoint.session.name);
12
+
13
+ doc = upsertSection(
14
+ doc,
15
+ "## Metrics",
16
+ [
17
+ `- **Primary**: ${checkpoint.session.metricName} (${checkpoint.session.metricUnit || "unitless"}, ${checkpoint.session.bestDirection} is better)`,
18
+ ].join("\n"),
19
+ );
20
+
21
+ doc = upsertSection(
22
+ doc,
23
+ "## How to Run",
24
+ `\`${AUTORESEARCH_ROOT_FILES.runnerScript}\` — should emit \`METRIC name=number\` lines for ${checkpoint.session.metricName}.`,
25
+ );
26
+
27
+ doc = upsertSection(doc, "## What's Been Tried", buildTriedSection(checkpoint));
28
+ doc = upsertSection(doc, "## Plugin Checkpoint", buildCheckpointSection(checkpoint));
29
+
30
+ fs.writeFileSync(sessionDocPath, `${doc.trimEnd()}\n`);
31
+ }
32
+
33
+ function ensureTitle(doc: string, sessionName: string | null): string {
34
+ const trimmed = doc.trim();
35
+ if (!trimmed) {
36
+ return `# Autoresearch: ${sessionName ?? "Session"}\n`;
37
+ }
38
+
39
+ if (/^#\s+/m.test(trimmed)) {
40
+ return trimmed;
41
+ }
42
+
43
+ return `# Autoresearch: ${sessionName ?? "Session"}\n\n${trimmed}`;
44
+ }
45
+
46
+ function upsertSection(doc: string, heading: string, body: string): string {
47
+ const escapedHeading = heading.replace(/[.*+?^${}()|[\]\\]/g, "\\$&");
48
+ const sectionRe = new RegExp(`(^${escapedHeading}\\n)([\\s\\S]*?)(?=^##\\s|\\Z)`, "m");
49
+ const rendered = `${heading}\n${body.trim()}\n\n`;
50
+
51
+ if (sectionRe.test(doc)) {
52
+ return doc.replace(sectionRe, rendered);
53
+ }
54
+
55
+ return `${doc.trimEnd()}\n\n${rendered}`;
56
+ }
57
+
58
+ function buildTriedSection(checkpoint: AutoresearchCheckpoint): string {
59
+ if (checkpoint.recentLoggedRuns.length === 0) {
60
+ return "- No logged experiments yet.";
61
+ }
62
+
63
+ return checkpoint.recentLoggedRuns
64
+ .map((run) => {
65
+ const metricUnit = checkpoint.session.metricUnit;
66
+ const renderedMetric =
67
+ metricUnit && metricUnit.length > 0 ? `${run.metric}${metricUnit}` : `${run.metric}`;
68
+ return `- #${run.run} ${run.status} ${renderedMetric} ${run.commit} — ${run.description}`;
69
+ })
70
+ .join("\n");
71
+ }
72
+
73
+ function buildCheckpointSection(checkpoint: AutoresearchCheckpoint): string {
74
+ const lines = [
75
+ `- Last updated: ${new Date(checkpoint.updatedAt).toISOString()}`,
76
+ `- Runs tracked: ${checkpoint.session.currentRunCount} current / ${checkpoint.session.totalRunCount} total`,
77
+ `- Baseline: ${formatMetric(checkpoint.session.currentBaselineMetric, checkpoint.session.metricUnit)}`,
78
+ `- Best kept: ${formatMetric(checkpoint.session.currentBestMetric, checkpoint.session.metricUnit)}`,
79
+ ];
80
+
81
+ if (checkpoint.lastLoggedRun) {
82
+ lines.push(
83
+ `- Last logged run: #${checkpoint.lastLoggedRun.run} ${checkpoint.lastLoggedRun.status} ${checkpoint.lastLoggedRun.commit} — ${checkpoint.lastLoggedRun.description}`,
84
+ );
85
+ }
86
+
87
+ if (checkpoint.pendingRun) {
88
+ lines.push(
89
+ `- Pending run awaiting log_experiment: ${checkpoint.pendingRun.command} (${formatMetric(checkpoint.pendingRun.primaryMetric, checkpoint.session.metricUnit)})`,
90
+ );
91
+ }
92
+
93
+ return lines.join("\n");
94
+ }
95
+
96
+ function formatMetric(value: number | null, unit: string): string {
97
+ if (value === null) {
98
+ return "n/a";
99
+ }
100
+
101
+ return `${value}${unit}`;
102
+ }
@@ -211,6 +211,62 @@ export function reconstructStateFromJsonl(cwd: string): AutoresearchStateSnapsho
211
211
  };
212
212
  }
213
213
 
214
+ export function readRecentLoggedRuns(
215
+ cwd: string,
216
+ limit: number,
217
+ ): readonly AutoresearchRunSnapshot[] {
218
+ const jsonl = readAutoresearchRootFile(cwd, "resultsLog");
219
+ if (jsonl === null || limit <= 0) {
220
+ return [];
221
+ }
222
+
223
+ const runs: AutoresearchRunSnapshot[] = [];
224
+ let currentSegment = 0;
225
+ let currentRunIndex = 0;
226
+ let hasSeenAnyRun = false;
227
+
228
+ const lines = jsonl
229
+ .split("\n")
230
+ .map((line) => line.trim())
231
+ .filter(Boolean);
232
+
233
+ for (const line of lines) {
234
+ let entry: JsonlEntry;
235
+ try {
236
+ entry = JSON.parse(line) as JsonlEntry;
237
+ } catch {
238
+ continue;
239
+ }
240
+
241
+ if (entry.type === "config") {
242
+ if (hasSeenAnyRun) {
243
+ currentSegment += 1;
244
+ }
245
+ currentRunIndex = 0;
246
+ continue;
247
+ }
248
+
249
+ if (typeof entry.metric !== "number") {
250
+ continue;
251
+ }
252
+
253
+ hasSeenAnyRun = true;
254
+ currentRunIndex += 1;
255
+ runs.push({
256
+ run: typeof entry.run === "number" ? entry.run : currentRunIndex,
257
+ commit: entry.commit ?? "",
258
+ metric: entry.metric,
259
+ metrics: normalizeMetrics(entry.metrics),
260
+ status: entry.status ?? "keep",
261
+ description: entry.description ?? "",
262
+ timestamp: typeof entry.timestamp === "number" ? entry.timestamp : 0,
263
+ segment: typeof entry.segment === "number" ? entry.segment : currentSegment,
264
+ });
265
+ }
266
+
267
+ return runs.slice(-limit);
268
+ }
269
+
214
270
  function normalizeMetrics(metrics: Record<string, number> | undefined): Record<string, number> {
215
271
  if (!metrics || typeof metrics !== "object") {
216
272
  return {};
@@ -1,8 +1,22 @@
1
1
  import type { OpenClawPluginApi } from "openclaw/plugin-sdk";
2
2
  import { Type } from "@sinclair/typebox";
3
3
  import { reconstructStateFromJsonl, type AutoresearchStateSnapshot } from "../state.js";
4
- import { getAutoresearchRuntimeState, type AutoresearchRuntimeSnapshot } from "../runtime-state.js";
4
+ import {
5
+ getAutoresearchRuntimeState,
6
+ type AutoresearchRuntimeSnapshot,
7
+ } from "../runtime-state.js";
5
8
  import { resolveToolCwd } from "./tool-cwd.js";
9
+ import {
10
+ readAutoresearchCheckpoint,
11
+ type AutoresearchCheckpoint,
12
+ } from "../checkpoint.js";
13
+ import { countCommitsSince, readShortHeadCommit } from "../git.js";
14
+
15
+ export type AutoresearchStatusDiagnostics = {
16
+ readonly warnings: readonly string[];
17
+ readonly checkpoint: AutoresearchCheckpoint | null;
18
+ readonly gitHead: string | null;
19
+ };
6
20
 
7
21
  const AutoresearchStatusParams = Type.Object(
8
22
  {
@@ -34,13 +48,20 @@ export function createAutoresearchStatusTool(api: OpenClawPluginApi) {
34
48
  const cwd = resolveToolCwd(api, params.cwd);
35
49
  const state = reconstructStateFromJsonl(cwd);
36
50
  const runtimeState = getAutoresearchRuntimeState(cwd);
51
+ const diagnostics = await buildAutoresearchStatusDiagnostics(api, cwd, state);
37
52
 
38
53
  return {
39
- content: [{ type: "text" as const, text: formatAutoresearchStatusText(state, runtimeState) }],
54
+ content: [
55
+ {
56
+ type: "text" as const,
57
+ text: formatAutoresearchStatusText(state, runtimeState, diagnostics),
58
+ },
59
+ ],
40
60
  details: {
41
61
  status: "ok",
42
62
  state,
43
63
  runtime: runtimeState,
64
+ diagnostics,
44
65
  },
45
66
  };
46
67
  },
@@ -50,10 +71,12 @@ export function createAutoresearchStatusTool(api: OpenClawPluginApi) {
50
71
  export function formatAutoresearchStatusText(
51
72
  state: AutoresearchStateSnapshot,
52
73
  runtimeState?: AutoresearchRuntimeSnapshot,
74
+ diagnostics?: AutoresearchStatusDiagnostics,
53
75
  ): string {
54
76
  const lines = [
55
77
  `Mode: ${state.mode}`,
56
78
  `Session doc: ${state.hasSessionDoc ? "present" : "missing"}`,
79
+ `Checkpoint: ${diagnostics?.checkpoint ? "present" : "missing"}`,
57
80
  `Ideas backlog: ${state.ideas.hasBacklog ? `${state.ideas.pendingCount} pending` : "empty"}`,
58
81
  `Metric: ${state.metricName} (${state.metricUnit || "unitless"}, ${state.bestDirection} is better)`,
59
82
  `Current segment: ${state.currentSegment}`,
@@ -74,6 +97,7 @@ export function formatAutoresearchStatusText(
74
97
  `Experiment window: ${runtimeState.runInFlight ? "running" : "idle"}`,
75
98
  `Queued steers: ${runtimeState.queuedSteers.length}`,
76
99
  );
100
+ lines.splice(state.name ? 5 : 4, 0, `Pending run: ${runtimeState.pendingRun ? "yes" : "no"}`);
77
101
  }
78
102
 
79
103
  if (state.lastRun) {
@@ -86,6 +110,17 @@ export function formatAutoresearchStatusText(
86
110
  lines.push(`Ideas preview: ${state.ideas.preview.join(" | ")}`);
87
111
  }
88
112
 
113
+ if (diagnostics?.gitHead) {
114
+ lines.push(`Git HEAD: ${diagnostics.gitHead}`);
115
+ }
116
+
117
+ if (diagnostics && diagnostics.warnings.length > 0) {
118
+ lines.push("", "Warnings:");
119
+ for (const warning of diagnostics.warnings) {
120
+ lines.push(`- ${warning}`);
121
+ }
122
+ }
123
+
89
124
  return lines.join("\n");
90
125
  }
91
126
 
@@ -97,3 +132,48 @@ function formatMetric(value: number | null, unit: string): string {
97
132
  const rendered = value === Math.round(value) ? `${Math.round(value)}` : value.toFixed(2);
98
133
  return `${rendered}${unit}`;
99
134
  }
135
+
136
+ async function buildAutoresearchStatusDiagnostics(
137
+ api: OpenClawPluginApi,
138
+ cwd: string,
139
+ state: AutoresearchStateSnapshot,
140
+ ): Promise<AutoresearchStatusDiagnostics> {
141
+ const checkpoint = readAutoresearchCheckpoint(cwd);
142
+ const gitHead = await readShortHeadCommit({
143
+ runCommandWithTimeout: api.runtime.system.runCommandWithTimeout,
144
+ cwd,
145
+ });
146
+ const warnings: string[] = [];
147
+
148
+ if (checkpoint?.pendingRun) {
149
+ warnings.push(
150
+ `A previous run_experiment is still pending log_experiment: ${checkpoint.pendingRun.command}`,
151
+ );
152
+ }
153
+
154
+ const driftBase =
155
+ state.lastRun?.commit && state.lastRun.commit.length > 0
156
+ ? state.lastRun.commit
157
+ : checkpoint?.sessionStartCommit ?? null;
158
+ if (driftBase) {
159
+ const commitsAhead = await countCommitsSince({
160
+ runCommandWithTimeout: api.runtime.system.runCommandWithTimeout,
161
+ cwd,
162
+ sinceCommit: driftBase,
163
+ });
164
+
165
+ if (commitsAhead !== null && commitsAhead > 0) {
166
+ warnings.push(
167
+ state.lastRun
168
+ ? `${commitsAhead} commit${commitsAhead === 1 ? "" : "s"} since the last logged experiment (${state.lastRun.commit}).`
169
+ : `${commitsAhead} commit${commitsAhead === 1 ? "" : "s"} since init_experiment, but no experiment has been logged yet.`,
170
+ );
171
+ }
172
+ }
173
+
174
+ return {
175
+ warnings,
176
+ checkpoint,
177
+ gitHead,
178
+ };
179
+ }
@@ -3,9 +3,14 @@ import { InitExperimentParams } from "./schemas.js";
3
3
  import { createConfigHeader, writeConfigHeader } from "../logging.js";
4
4
  import {
5
5
  createEmptyStateSnapshot,
6
+ readRecentLoggedRuns,
6
7
  reconstructStateFromJsonl,
7
8
  type AutoresearchStateSnapshot,
8
9
  } from "../state.js";
10
+ import { readAutoresearchCheckpoint, writeAutoresearchCheckpoint } from "../checkpoint.js";
11
+ import { syncAutoresearchSessionDoc } from "../session-doc.js";
12
+ import { readShortHeadCommit } from "../git.js";
13
+ import { setAutoresearchPendingRun, setAutoresearchRunInFlight } from "../runtime-state.js";
9
14
  import { resolveToolCwd } from "./tool-cwd.js";
10
15
 
11
16
  export function createInitExperimentTool(api: OpenClawPluginApi) {
@@ -29,6 +34,7 @@ export function createInitExperimentTool(api: OpenClawPluginApi) {
29
34
  ) {
30
35
  const cwd = resolveToolCwd(api, params.cwd);
31
36
  const previousState = reconstructStateFromJsonl(cwd);
37
+ const previousCheckpoint = readAutoresearchCheckpoint(cwd);
32
38
  const isReinit = previousState.currentRunCount > 0;
33
39
  const nextState: AutoresearchStateSnapshot = {
34
40
  ...createEmptyStateSnapshot(),
@@ -66,6 +72,23 @@ export function createInitExperimentTool(api: OpenClawPluginApi) {
66
72
  };
67
73
  }
68
74
 
75
+ setAutoresearchPendingRun(cwd, null);
76
+ setAutoresearchRunInFlight(cwd, false);
77
+
78
+ const nextPersistentState = reconstructStateFromJsonl(cwd);
79
+ const sessionStartCommit = await readShortHeadCommit({
80
+ runCommandWithTimeout: api.runtime.system.runCommandWithTimeout,
81
+ cwd,
82
+ });
83
+ const checkpoint = writeAutoresearchCheckpoint({
84
+ cwd,
85
+ state: nextPersistentState,
86
+ sessionStartCommit: sessionStartCommit ?? previousCheckpoint?.sessionStartCommit ?? null,
87
+ recentLoggedRuns: readRecentLoggedRuns(cwd, 8),
88
+ pendingRun: null,
89
+ });
90
+ syncAutoresearchSessionDoc(cwd, checkpoint);
91
+
69
92
  const reinitNote = isReinit
70
93
  ? " (re-initialized - previous results archived, new baseline needed)"
71
94
  : "";
@@ -77,7 +100,7 @@ export function createInitExperimentTool(api: OpenClawPluginApi) {
77
100
  text:
78
101
  `Experiment initialized: "${nextState.name}"${reinitNote}\n` +
79
102
  `Metric: ${nextState.metricName} (${nextState.metricUnit || "unitless"}, ${nextState.bestDirection} is better)\n` +
80
- "Config written to autoresearch.jsonl. Now run the baseline with run_experiment.",
103
+ "Config written to autoresearch.jsonl. Now run the baseline with run_experiment, then log it before starting another run.",
81
104
  },
82
105
  ],
83
106
  details: {
@@ -1,19 +1,24 @@
1
1
  import * as fs from "node:fs";
2
2
  import type { OpenClawPluginApi } from "openclaw/plugin-sdk";
3
3
  import { LogExperimentParams } from "./schemas.js";
4
- import { commitKeptExperiment } from "../git.js";
4
+ import { commitKeptExperiment, readShortHeadCommit } from "../git.js";
5
5
  import { appendResultEntry, type AutoresearchResultEntry } from "../logging.js";
6
6
  import { getAutoresearchRootFilePath } from "../files.js";
7
7
  import {
8
+ readRecentLoggedRuns,
8
9
  reconstructStateFromJsonl,
9
10
  type AutoresearchStateSnapshot,
10
11
  type SecondaryMetricDef,
11
12
  } from "../state.js";
12
13
  import {
14
+ consumeAutoresearchPendingRun,
15
+ getAutoresearchPendingRun,
13
16
  consumeAutoresearchSteers,
14
17
  setAutoresearchRunInFlight,
15
18
  } from "../runtime-state.js";
16
19
  import { resolveToolCwd } from "./tool-cwd.js";
20
+ import { readAutoresearchCheckpoint, writeAutoresearchCheckpoint } from "../checkpoint.js";
21
+ import { syncAutoresearchSessionDoc } from "../session-doc.js";
17
22
 
18
23
  export function createLogExperimentTool(api: OpenClawPluginApi) {
19
24
  return {
@@ -26,8 +31,8 @@ export function createLogExperimentTool(api: OpenClawPluginApi) {
26
31
  _toolCallId: string,
27
32
  params: {
28
33
  cwd?: string;
29
- commit: string;
30
- metric: number;
34
+ commit?: string;
35
+ metric?: number;
31
36
  status: "keep" | "discard" | "crash";
32
37
  description: string;
33
38
  metrics?: Record<string, number>;
@@ -37,8 +42,37 @@ export function createLogExperimentTool(api: OpenClawPluginApi) {
37
42
  _onUpdate: unknown,
38
43
  ) {
39
44
  const cwd = resolveToolCwd(api, params.cwd);
45
+ const checkpoint = readAutoresearchCheckpoint(cwd);
46
+ const pendingRun = getAutoresearchPendingRun(cwd) ?? checkpoint?.pendingRun ?? null;
40
47
  const state = reconstructStateFromJsonl(cwd);
41
- const secondaryMetrics = params.metrics ?? {};
48
+ const secondaryMetrics = params.metrics ?? pendingRun?.metrics ?? {};
49
+ const inferredCommit =
50
+ params.commit ??
51
+ pendingRun?.commit ??
52
+ (await readShortHeadCommit({
53
+ runCommandWithTimeout: api.runtime.system.runCommandWithTimeout,
54
+ cwd,
55
+ })) ??
56
+ "";
57
+ const inferredMetric = params.metric ?? pendingRun?.primaryMetric;
58
+
59
+ if (inferredMetric === null || inferredMetric === undefined) {
60
+ return {
61
+ content: [
62
+ {
63
+ type: "text" as const,
64
+ text:
65
+ "No primary metric is available to log.\n" +
66
+ `Expected a METRIC line for ${state.metricName} from run_experiment, or provide metric explicitly.`,
67
+ },
68
+ ],
69
+ details: {
70
+ status: "error",
71
+ phase: "metric",
72
+ pendingRun,
73
+ },
74
+ };
75
+ }
42
76
 
43
77
  if (state.secondaryMetrics.length > 0) {
44
78
  const validationError = validateSecondaryMetrics(
@@ -61,8 +95,8 @@ export function createLogExperimentTool(api: OpenClawPluginApi) {
61
95
  const currentResults = readCurrentSegmentResults(cwd, state.currentSegment);
62
96
  const experiment: AutoresearchResultEntry = {
63
97
  run: state.currentRunCount + 1,
64
- commit: params.commit.slice(0, 7),
65
- metric: params.metric,
98
+ commit: inferredCommit.slice(0, 7),
99
+ metric: inferredMetric,
66
100
  metrics: secondaryMetrics,
67
101
  status: params.status,
68
102
  description: params.description,
@@ -83,7 +117,7 @@ export function createLogExperimentTool(api: OpenClawPluginApi) {
83
117
  cwd: cwd,
84
118
  description: params.description,
85
119
  metricName: state.metricName,
86
- metric: params.metric,
120
+ metric: inferredMetric,
87
121
  metrics: secondaryMetrics,
88
122
  commit: experiment.commit,
89
123
  status: "keep",
@@ -138,7 +172,16 @@ export function createLogExperimentTool(api: OpenClawPluginApi) {
138
172
  );
139
173
  const nextState: AutoresearchStateSnapshot = reconstructStateFromJsonl(cwd);
140
174
  const queuedSteers = consumeAutoresearchSteers(cwd);
175
+ consumeAutoresearchPendingRun(cwd);
141
176
  setAutoresearchRunInFlight(cwd, false);
177
+ const nextCheckpoint = writeAutoresearchCheckpoint({
178
+ cwd,
179
+ state: nextState,
180
+ sessionStartCommit: checkpoint?.sessionStartCommit ?? experiment.commit,
181
+ recentLoggedRuns: readRecentLoggedRuns(cwd, 8),
182
+ pendingRun: null,
183
+ });
184
+ syncAutoresearchSessionDoc(cwd, nextCheckpoint);
142
185
 
143
186
  return {
144
187
  content: [
@@ -153,6 +196,7 @@ export function createLogExperimentTool(api: OpenClawPluginApi) {
153
196
  gitSummary,
154
197
  knownSecondaryMetrics,
155
198
  queuedSteers,
199
+ usedPendingRun: pendingRun !== null,
156
200
  }),
157
201
  },
158
202
  ],
@@ -304,6 +348,7 @@ function buildResultText(options: {
304
348
  gitSummary: string;
305
349
  knownSecondaryMetrics: readonly SecondaryMetricDef[];
306
350
  queuedSteers: readonly string[];
351
+ usedPendingRun: boolean;
307
352
  }): string {
308
353
  let text = `Logged #${options.experiment.run}: ${options.experiment.status} - ${options.experiment.description}`;
309
354
  text += `\nBaseline ${options.state.metricName}: ${formatMetric(options.baselineMetric, options.state.metricUnit)}`;
@@ -338,6 +383,9 @@ function buildResultText(options: {
338
383
  }
339
384
 
340
385
  text += `\n(${options.totalRunCount} experiments in current segment)`;
386
+ if (options.usedPendingRun) {
387
+ text += "\nUsed the pending run_experiment result as the source of truth for commit/metric defaults.";
388
+ }
341
389
  text += `\n${options.gitSummary}`;
342
390
 
343
391
  if (options.queuedSteers.length > 0) {
@@ -1,8 +1,17 @@
1
1
  import type { OpenClawPluginApi } from "openclaw/plugin-sdk";
2
2
  import { RunExperimentParams } from "./schemas.js";
3
3
  import { executeExperimentCommand } from "../execute.js";
4
- import { setAutoresearchRunInFlight } from "../runtime-state.js";
4
+ import {
5
+ getAutoresearchPendingRun,
6
+ setAutoresearchPendingRun,
7
+ setAutoresearchRunInFlight,
8
+ } from "../runtime-state.js";
5
9
  import { resolveToolCwd } from "./tool-cwd.js";
10
+ import { parseMetricLines } from "../metrics.js";
11
+ import { readShortHeadCommit } from "../git.js";
12
+ import { readAutoresearchCheckpoint, writeAutoresearchCheckpoint } from "../checkpoint.js";
13
+ import { readRecentLoggedRuns, reconstructStateFromJsonl } from "../state.js";
14
+ import { syncAutoresearchSessionDoc } from "../session-doc.js";
6
15
 
7
16
  export function createRunExperimentTool(api: OpenClawPluginApi) {
8
17
  return {
@@ -22,6 +31,28 @@ export function createRunExperimentTool(api: OpenClawPluginApi) {
22
31
  onUpdate: ((update: unknown) => void | Promise<void>) | undefined,
23
32
  ) {
24
33
  const cwd = resolveToolCwd(api, params.cwd);
34
+ const checkpoint = readAutoresearchCheckpoint(cwd);
35
+ const existingPendingRun = getAutoresearchPendingRun(cwd) ?? checkpoint?.pendingRun ?? null;
36
+
37
+ if (existingPendingRun) {
38
+ return {
39
+ content: [
40
+ {
41
+ type: "text" as const,
42
+ text:
43
+ "The previous run_experiment result has not been logged yet.\n" +
44
+ `Pending command: ${existingPendingRun.command}\n` +
45
+ "Call log_experiment next. You can omit commit/metric and the tool will use the pending run by default.",
46
+ },
47
+ ],
48
+ details: {
49
+ status: "error",
50
+ phase: "pending_log",
51
+ pendingRun: existingPendingRun,
52
+ },
53
+ };
54
+ }
55
+
25
56
  setAutoresearchRunInFlight(cwd, true);
26
57
 
27
58
  if (onUpdate) {
@@ -45,6 +76,48 @@ export function createRunExperimentTool(api: OpenClawPluginApi) {
45
76
  throw error;
46
77
  }
47
78
 
79
+ const state = reconstructStateFromJsonl(cwd);
80
+ const parsedMetrics = parseMetricLines([details.stdout, details.stderr].join("\n"));
81
+ const detectedPrimaryMetricName =
82
+ parsedMetrics[state.metricName] !== undefined
83
+ ? state.metricName
84
+ : Object.keys(parsedMetrics).length === 1
85
+ ? Object.keys(parsedMetrics)[0] ?? null
86
+ : null;
87
+ const primaryMetric =
88
+ detectedPrimaryMetricName !== null ? parsedMetrics[detectedPrimaryMetricName] ?? null : null;
89
+ const secondaryMetrics =
90
+ detectedPrimaryMetricName !== null
91
+ ? Object.fromEntries(
92
+ Object.entries(parsedMetrics).filter(([name]) => name !== detectedPrimaryMetricName),
93
+ )
94
+ : parsedMetrics;
95
+ const currentCommit = await readShortHeadCommit({
96
+ runCommandWithTimeout: api.runtime.system.runCommandWithTimeout,
97
+ cwd,
98
+ });
99
+ const pendingRun = {
100
+ command: params.command,
101
+ commit: currentCommit,
102
+ primaryMetric,
103
+ metrics: secondaryMetrics,
104
+ durationSeconds: details.durationSeconds,
105
+ exitCode: details.exitCode,
106
+ passed: details.passed,
107
+ timedOut: details.timedOut,
108
+ tailOutput: details.tailOutput,
109
+ capturedAt: Date.now(),
110
+ } as const;
111
+ setAutoresearchPendingRun(cwd, pendingRun);
112
+ const nextCheckpoint = writeAutoresearchCheckpoint({
113
+ cwd,
114
+ state,
115
+ sessionStartCommit: checkpoint?.sessionStartCommit ?? currentCommit,
116
+ recentLoggedRuns: readRecentLoggedRuns(cwd, 8),
117
+ pendingRun,
118
+ });
119
+ syncAutoresearchSessionDoc(cwd, nextCheckpoint);
120
+
48
121
  let text = "";
49
122
  if (details.timedOut) {
50
123
  text += `TIMEOUT after ${details.durationSeconds.toFixed(1)}s\n`;
@@ -55,10 +128,25 @@ export function createRunExperimentTool(api: OpenClawPluginApi) {
55
128
  }
56
129
 
57
130
  text += `\nLast 80 lines of output:\n${details.tailOutput || "(no output)"}`;
131
+ if (Object.keys(parsedMetrics).length > 0) {
132
+ text += `\n\nParsed METRIC lines: ${Object.entries(parsedMetrics)
133
+ .map(([name, value]) => `${name}=${value}`)
134
+ .join(", ")}`;
135
+ } else {
136
+ text += `\n\nNo METRIC lines were detected.`;
137
+ }
138
+ text +=
139
+ "\nNext step: call log_experiment before another run. When the primary METRIC was captured, log_experiment can infer commit and metric from this run.";
58
140
 
59
141
  return {
60
142
  content: [{ type: "text" as const, text }],
61
- details,
143
+ details: {
144
+ ...details,
145
+ metrics: parsedMetrics,
146
+ secondaryMetrics,
147
+ primaryMetric,
148
+ pendingRun,
149
+ },
62
150
  };
63
151
  },
64
152
  };
@@ -45,11 +45,18 @@ export const RunExperimentParams = Type.Object({
45
45
 
46
46
  export const LogExperimentParams = Type.Object({
47
47
  cwd: CwdParam,
48
- commit: Type.String({ description: "Git commit hash (short, 7 chars)" }),
49
- metric: Type.Number({
50
- description:
51
- "The primary optimization metric value (e.g. seconds, val_bpb). Use 0 for crashes.",
52
- }),
48
+ commit: Type.Optional(
49
+ Type.String({
50
+ description:
51
+ "Git commit hash (short, 7 chars). Optional when logging the most recent run_experiment result.",
52
+ }),
53
+ ),
54
+ metric: Type.Optional(
55
+ Type.Number({
56
+ description:
57
+ "The primary optimization metric value (e.g. seconds, val_bpb). Optional when run_experiment already captured a METRIC line for the configured primary metric.",
58
+ }),
59
+ ),
53
60
  status: Type.String({
54
61
  description: "Result status for this experiment.",
55
62
  enum: ["keep", "discard", "crash"],
@@ -5,7 +5,7 @@
5
5
  "skills": [
6
6
  "./skills"
7
7
  ],
8
- "version": "1.0.2",
8
+ "version": "1.0.3",
9
9
  "configSchema": {
10
10
  "type": "object",
11
11
  "additionalProperties": false,
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@gianfrancopiana/openclaw-autoresearch",
3
- "version": "1.0.2",
3
+ "version": "1.0.3",
4
4
  "description": "Faithful OpenClaw port of pi-autoresearch.",
5
5
  "type": "module",
6
6
  "main": "./index.ts",
@@ -10,8 +10,8 @@ Autonomous experiment loop: try ideas, keep what works, discard what doesn't, ne
10
10
  ## Tools
11
11
 
12
12
  - **`init_experiment`** — configure session (name, metric, unit, direction). Call again to re-initialize with a new baseline when the optimization target changes.
13
- - **`run_experiment`** — runs command, times it, captures output.
14
- - **`log_experiment`** — records result. `keep` auto-commits. `discard`/`crash` → `git checkout -- .` to revert. Always include secondary `metrics` dict.
13
+ - **`run_experiment`** — runs the benchmark command, times it, captures output, parses `METRIC name=number` lines, and opens a pending run that must be logged before another run can start.
14
+ - **`log_experiment`** — records the pending run. `keep` auto-commits. `discard`/`crash` → `git checkout -- .` to revert. If the previous `run_experiment` captured the primary metric, `commit` and `metric` can be omitted and will default from the pending run.
15
15
 
16
16
  ## Setup
17
17
 
@@ -19,7 +19,7 @@ Autonomous experiment loop: try ideas, keep what works, discard what doesn't, ne
19
19
  2. `git checkout -b autoresearch/<goal>-<date>`
20
20
  3. Read the source files. Understand the workload deeply before writing anything.
21
21
  4. Write `autoresearch.md` and `autoresearch.sh` (see below). Commit both.
22
- 5. `init_experiment` → run baseline → `log_experiment` → start looping immediately.
22
+ 5. `init_experiment` → `run_experiment` baseline → `log_experiment` → start looping immediately.
23
23
 
24
24
  ### `autoresearch.md`
25
25
 
@@ -52,7 +52,7 @@ This is the heart of the session. A fresh agent with no context should be able t
52
52
  and architectural insights so the agent doesn't repeat failed approaches.>
53
53
  ```
54
54
 
55
- Update `autoresearch.md` periodically especially the "What's Been Tried" section so resuming agents have full context.
55
+ The plugin rewrites the Metrics, How to Run, What's Been Tried, and Plugin Checkpoint sections after init/log transitions. You may add context elsewhere in the file, but do not fight the plugin-managed sections.
56
56
 
57
57
  ### `autoresearch.sh`
58
58
 
@@ -67,7 +67,8 @@ Bash script (`set -euo pipefail`) that: pre-checks fast (syntax errors in <1s),
67
67
  - **Don't thrash.** Repeatedly reverting the same idea? Try something structurally different.
68
68
  - **Crashes:** fix if trivial, otherwise log and move on. Don't over-invest.
69
69
  - **Think longer when stuck.** Re-read source files, study the profiling data, reason about what the CPU is actually doing. The best ideas come from deep understanding, not from trying random variations.
70
- - **Resuming:** if `autoresearch.md` exists, read it + git log, continue looping.
70
+ - **Resuming:** if `autoresearch.md` exists, read it plus `autoresearch.checkpoint.json`, then continue looping.
71
+ - **No raw benchmark exec:** during active autoresearch mode, benchmark/test commands should go through `run_experiment`, not raw `exec`/`bash`.
71
72
 
72
73
  **NEVER STOP.** The user may be away for hours. Keep going until interrupted.
73
74