@swarmclawai/swarmclaw 1.5.57 → 1.5.58

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -399,6 +399,16 @@ Operational docs: https://swarmclaw.ai/docs/observability
399
399
 
400
400
  ## Releases
401
401
 
402
+ ### v1.5.58 Highlights
403
+
404
+ This release broadens the built-in evaluation harness so SwarmClaw runs can be benchmarked against named suites, adds two targeted starter kits, exposes live per-session cost data, tightens auto-skill drafting, and ships a zero-setup demo mission template.
405
+
406
+ - **Benchmark-style eval suites.** New `SWEBENCH_LITE_SCENARIOS` and `GAIA_L1_SCENARIOS` in `src/lib/server/eval/scenarios-swebench.ts` and `scenarios-gaia.ts` — curated parallels (not the upstream datasets) sized for a single-agent harness run. The shared `EvalScenario` type now carries an optional `suite: 'core' | 'swe-bench-lite' | 'gaia-l1' | 'tool-use' | 'code-action'` tag. `POST /api/eval/suite` accepts `{ suite: "swe-bench-lite" }` to scope a run. New `GET /api/eval/suites` lists every suite with scenario count, max score, and categories. CLI commands: `swarmclaw eval suites`, and `swarmclaw eval suite` still takes a JSON body now including `suite`. Useful for advertising verifiable numbers against a named benchmark instead of a bespoke scoring rubric.
407
+ - **Two additional starter kits.** `inbox_triage` (single Triager agent over email + memory + documents) and `data_analyst` (single Analyst agent over shell + files + web + documents) join the existing seven kits in `src/lib/setup-defaults.ts`. Both are surfaced on the intent-driven setup path alongside Personal Assistant, Research Copilot, Builder Studio, and Delegate Team.
408
+ - **Live per-session usage API.** New `GET /api/usage/live?sessionId=...` returns a lightweight snapshot — records, tokens in/out, estimated cost, firstAt/lastAt, wallclockMs, turns — so frontends can surface a live cost meter without pulling the full aggregated `/api/usage` payload. Without a `sessionId` the route returns the ten most recently active sessions. Registered in the CLI as `swarmclaw usage live`.
409
+ - **Auto-skill drafting is stricter and rate-limited.** `shouldAutoDraftSkillSuggestion` in `chat-turn-finalization.ts` now requires at least 3 tool events in the completed turn (was 1), and a new per-agent daily cap limits automatic drafts to 3 per day per agent to prevent suggestion-inbox spam. Both thresholds are named constants (`AUTO_DRAFT_MIN_TOOL_EVENTS`, `AUTO_DRAFT_DAILY_LIMIT`). Agents with `autoDraftSkillSuggestions = false` are unaffected (auto-drafting remains opt-in per agent).
410
+ - **Hello World demo mission template.** New `hello-world-demo` entry in `BUILT_IN_MISSION_TEMPLATES` — a bounded, zero-setup mission that reads three files in the working directory and writes a one-paragraph markdown summary to `hello-world-report.md`. Budgets (USD 0.25, 20k tokens, 30 turns, 15 min) are small enough to run on a local Ollama model without cost. Intended as the first thing a new user watches an agent complete end to end.
411
+
402
412
  ### v1.5.57 Highlights
403
413
 
404
414
  This release closes the org-orchestration feature gap with Paperclip while keeping SwarmClaw's autonomous-assistant focus. Most additions are additive; nothing existing has changed shape.
@@ -436,108 +446,6 @@ This release closes the org-orchestration feature gap with Paperclip while keepi
436
446
 
437
447
  > **Note:** v1.5.53 release notes described the mission templates library, but the feature commit landed after the v1.5.53 tag was cut. v1.5.54 is the release that actually ships it.
438
448
 
439
- ### v1.5.53 Highlights
440
-
441
- - **Fix: switching a session's model now sticks in the UI** ([#50](https://github.com/swarmclawai/swarmclaw/pull/50), thanks to [@borislavnnikolov](https://github.com/borislavnnikolov)). The **Switch Model** panel in the agent inspector was reading from `agent.provider` / `agent.model` (the agent's defaults) instead of `session.provider` / `session.model`, so after saving a model switch the collapsed pill still showed the agent default, the combobox reset to the default when reopened, and `selectedProvider` reverted on every save. `ModelSwitcherInline` now uses `session.provider || agent.provider` and `session.model || agent.model` as the source of truth, and its `useEffect` syncs to `session.provider` changes so a successful save updates the panel immediately.
442
-
443
- ### v1.5.52 Highlights
444
-
445
- - **Session X-Ray now surfaces the backend execution log** ([#48](https://github.com/swarmclawai/swarmclaw/pull/48), thanks to [@borislavnnikolov](https://github.com/borislavnnikolov)). The debug panel fetches entries from the SQLite execution log on open and merges them with in-memory message events, sorted by time. Expandable entries show provider, model, stream errors, duration, and token counts — the info that was previously invisible when Ollama or other local-model runs failed silently. A new **Tools** filter tab, an `exec` badge for log-sourced entries, an entry count in the stats bar, and a Refresh button round it out. New API route `GET /api/chats/:id/execution-log` with `limit`, `since`, and `category` query params, registered in the CLI manifest as `swarmclaw chats execution-log`.
446
- - **Execution errors now captured in the log** ([#48](https://github.com/swarmclawai/swarmclaw/pull/48)). `finalizeChatTurn()` writes a structured `error` entry to the execution log on terminal failure, recording provider, model, stream errors, duration, token counts, and whether a response was produced — so the Session X-Ray above actually has something to show.
447
- - **Fix: blank task-sheet no longer shows `"null"` under *Blocked By*** ([#47](https://github.com/swarmclawai/swarmclaw/pull/47), thanks to [@borislavnnikolov](https://github.com/borislavnnikolov)). A successful task create/update returns `error: null`, and the old `'error' in res` guard treated that as a truthy error and rendered `String(null)` as a red "null" string under the Blocked By field. Now only non-empty string errors trigger the UI, and `depError` is cleared on dialog close so stale state cannot leak across re-opens.
448
-
449
- ### v1.5.51 Highlights
450
-
451
- - **Desktop app now actually opens and renders on macOS**: packaged builds were broken in v1.5.50 by a stack of independent issues that each masked the next. This release unblocks the cold-boot path end to end. Measured cold-boot time on a populated install: ~1 second to first `/api/healthz` response, down from a hard 60-second timeout.
452
- - Ad-hoc code signing (`identity: '-'`) via a new `scripts/electron-after-pack.cjs` hook that runs `codesign --sign - --force --deep` after electron-builder packages the bundle. The bundle identifier is now sealed as `ai.swarmclaw.desktop` with all 74k resources sealed, so quarantined dmgs surface as "unidentified developer" (right-click → Open) instead of the more confusing "damaged" error.
453
- - Per-architecture native module sync: the afterPack hook copies `better-sqlite3`, `@mongodb-js/zstd`, `node-liblzma`, and `utf-8-validate` `.node` binaries from the electron-builder-rebuilt root `node_modules` into the packaged `.next/standalone/node_modules`. Without this, the standalone server hit `ERR_DLOPEN_FAILED: NODE_MODULE_VERSION 137` on launch because Next.js's output-tracing copied the Node-ABI build of better-sqlite3 into standalone while electron-builder only rebuilt the root tree for Electron's ABI.
454
- - `scripts/run-next-build.mjs` now copies `mdn-data` (used by `css-tree` via `jsdom`) into standalone alongside the existing `css-tree/data` patch, so pages that depend on it don't 500 with `Cannot find module 'mdn-data/css/at-rules.json'`.
455
- - `isomorphic-dompurify` replaced by the browser-only `dompurify` in `agent-avatar.tsx`. The isomorphic wrapper was pulling `jsdom`'s ESM-only `@exodus/bytes` dep into every server bundle the avatar was referenced from, which blew up SSR under Electron 33 (Node 20.18) with `ERR_REQUIRE_ESM` on every page.
456
- - Session-consolidation migrations, `initWsServer`, and `ensureDaemonStarted` moved into a `setImmediate` deferred block in `src/instrumentation.ts` so Next.js can bind the HTTP listener before per-install work runs.
457
- - **App icon fixed**: the Dock no longer shows Electron's default `exec` placeholder. `scripts/gen-icons.mjs` generates `resources/icon.icns`, `resources/icon.ico`, and `resources/icon.png` from `public/branding/swarmclaw-org-avatar.png`; the main process sets the Dock icon at launch and passes it to every `BrowserWindow`.
458
- - **Embedded server log file + improved failure dialog**: the Electron wrapper now tees the child Next.js server's stdout/stderr into `<userData>/logs/server.log` (`~/Library/Application Support/@swarmclawai/swarmclaw/logs/server.log` on macOS, 1 MB rotation). If startup fails or the server exits, the error dialog shows the tail of the log inline and exposes an **Open Logs Folder** button that jumps Finder straight to the file. This is what made root-cause debugging possible in the first place — if you hit any kind of regression here, grab that log and open an issue.
459
- - **Embedded server timeout raised from 60s to 5 minutes**: a safety net. On a healthy install the server is up in about a second; 300 seconds is there for pathological cold boots (very large data dirs, contested Apple Silicon Gatekeeper verification on unsigned binaries, etc.) and should never be hit in normal use.
460
-
461
- ### v1.5.50 Highlights
462
-
463
- - **Fix: opencode-web remote instances no longer fail with `EACCES`**: SwarmClaw used to send the local workspace path (e.g. `/root/.swarmclaw/workspace`) as a `directory=` query parameter on every opencode-web request. Remote opencode-web instances tried to `lstat` that path and rejected the call. The provider now auto-detects local vs. remote from the endpoint hostname (`localhost`, `127.0.0.1`, `::1`, `0.0.0.0`) and only sends `directory=` when the endpoint is local. Thanks to [@SteamedFish](https://github.com/SteamedFish) for the detailed root-cause writeup in [#45](https://github.com/swarmclawai/swarmclaw/issues/45).
464
-
465
- ### v1.5.49 Highlights
466
-
467
- - **Autonomous Missions**: a new first-class concept for long-running, goal-driven agent work. Hand your agent team a goal on Friday, come back Monday to see what they shipped. Each mission carries a title, a natural-language objective, bulleted success criteria, hard budgets (USD, tokens, turns, wallclock), periodic markdown reports, and a full milestone timeline. Missions drive any session through the existing heartbeat pipeline, so delegation to Claude Code, Codex, OpenCode, Cursor, Droid, Goose, Qwen, or native SwarmClaw agents all work without changes.
468
- - **Budget enforcement in the run pipeline**: `enqueueSessionRun` now consults the mission's budget before every autonomous turn. When any cap is hit the mission transitions to `budget_exhausted`, the queue drains, and a final report fires. Warn thresholds (default 50% / 80% / 95% of each cap) emit `budget_warn` milestones exactly once each.
469
- - **Scheduler tick from heartbeat**: `runMissionScheduler()` fires every heartbeat tick, independent of the active-hours window, so wallclock budgets and periodic reports still fire overnight. Report cadence is configurable per mission; reports land as in-app notifications today and ship as Slack/Discord/audio in a follow-up.
470
- - **`/missions` dashboard**: new page with a live mission list, status pills, four-axis budget gauges, a scrollable milestone timeline, a reports drawer, and start / pause / cancel / mark-complete / generate-report-now controls.
471
- - **CLI commands**: `swarmclaw missions list|get|create|update|delete|control|reports|report-now|events`. Create a mission, start it, and watch the timeline from the terminal or CI.
472
- - **New storage collections**: `agent_missions`, `mission_reports`, and `agent_mission_events`. The legacy deprecated `missions` table is left untouched so nothing in existing installs is disturbed.
473
-
474
- ### v1.5.48 Highlights
475
-
476
- - **SwarmDock MCP preset now points at the hosted endpoint**: *MCP Servers → Quick Setup → SwarmDock* is pre-filled with `streamable-http` transport pointed at `https://swarmdock-api.onrender.com/mcp` and a ready-to-edit `Authorization: Bearer <key>` header template. Users no longer need to run `npx swarmdock-mcp` locally — the SwarmDock team hosts the MCP server in-process on the existing API service. First-time setup (browser keygen + agent registration) lives at [swarmdock.ai/mcp/connect](https://www.swarmdock.ai/mcp/connect).
477
- - **McpPreset gains `url` and `headersTemplate`**: `applyPreset` now prefills the URL input and the Headers textarea in addition to command/args/env, so remote presets can ship complete configs.
478
- - **Skills doc refresh**: the `swarmclaw` skill's MCP Servers section points to the hosted flow instead of the prior stdio instructions.
479
-
480
- ### v1.5.47 Highlights
481
-
482
- - **MCP injection for GitHub Copilot CLI and OpenAI Codex CLI agents**: agents using the `copilot-cli` or `codex-cli` providers now run with their assigned MCP servers attached at runtime. Copilot CLI receives the servers via `--additional-mcp-config @<tempfile>`; Codex CLI gets per-session `[mcp_servers.*]` TOML sections appended to a scoped `config.toml`. Stdio transports (command, args, env, cwd) and SSE / streamable-http transports (url, headers) are both supported. Skills assigned to the agent continue to be injected via the system prompt.
483
- - **Skills and MCP panel visible for copilot-cli and codex-cli in the agent editor**: the Advanced Settings section now opens for these two providers so you can attach skills and MCP servers from the UI. Routing, memory, and voice panels stay hidden since these providers are worker-only.
484
- - **Codex CLI approval policy change**: Codex CLI sessions now launch with `--dangerously-bypass-approvals-and-sandbox` instead of `--full-auto`. The old flag silently cancels MCP tool calls via Codex's approval gate, which is why MCP tool results were not landing. SwarmClaw itself runs in its own sandbox, so Codex's additional sandbox was not load-bearing, but be aware of the change if you were relying on it for a specific agent.
485
- - **Under the hood**: `~/.codex-sessions/<session.id>/` replaces `/tmp/swarmclaw-codex-*` as the per-session Codex config directory because Codex refuses to create helper binaries under `/tmp`. The Playwright MCP proxy now passes an explicit `cwd: process.cwd()` when spawning, so it no longer crashes with `uv_cwd ENOENT` when the server is restarted after a directory move.
486
- - **Exa as a new web search provider**: Settings > Web Search gains an Exa option alongside Tavily, Brave, SearXNG, DuckDuckGo, Google, and Bing. Exa uses neural search with AI-generated summaries and falls back to highlights, then raw text when summaries are unavailable. Configure the key via the UI, the `EXA_API_KEY` environment variable, or the secrets store. Requests carry an `x-exa-integration: swarmclaw` tracking header so usage attributed to SwarmClaw is visible to Exa.
487
-
488
- Thanks to [@borislavnnikolov](https://github.com/borislavnnikolov) and [@tgonzalezc5](https://github.com/tgonzalezc5) for the contributions.
489
-
490
- ### v1.5.46 Highlights
491
-
492
- - **Custom base URL for built-in OpenAI and Anthropic providers**: the Endpoint field in provider settings now works for the built-in OpenAI and Anthropic providers (marked as `optionalEndpoint`). Point them at a proxy, gateway, or self-hosted endpoint and the URL persists, auto-resolves on connection test, and flows through both the live chat path and the LangGraph agent path (`ChatAnthropic` now receives `anthropicApiUrl`). Existing installs with no custom URL keep using the defaults.
493
- - **Test-model selector in provider settings**: when you hit "Test Connection", a new dropdown lets you pick a specific model (for example `gpt-4.1-mini` or `claude-haiku-4-5`) or leave it on Auto-detect. Useful for verifying a specific model is reachable on a given endpoint.
494
- - **Auto-resolution of credentials and endpoints in the connection test**: the test route now looks up the saved credential and base URL for the provider when they are not explicitly supplied, so the provider sheet's "Test" button works without needing to replay config.
495
- - **Anthropic streaming refactor**: the streaming handler moved from Node's `https.request()` to `fetch()`. Same behavior, cleaner cancellation, and it now respects `session.apiEndpoint` as a full base URL instead of a hostname.
496
- - **Connection test body**: Ollama and OpenAI-compatible test requests now send `max_completion_tokens` instead of the legacy `max_tokens`, matching current OpenAI conventions and working correctly with reasoning models that reject `max_tokens`.
497
-
498
- Thanks to [@Llugaes](https://github.com/Llugaes) for the contribution.
499
-
500
- ### v1.5.45 Highlights
501
-
502
- - **SwarmVault MCP preset**: a new "SwarmVault" Quick Setup chip in the MCP server sheet pre-fills `npx -y @swarmvaultai/cli mcp` over `stdio` and prompts for the vault directory. One click registers a SwarmVault knowledge vault as an MCP server; agents pick it up via the existing per-agent MCP server selector. SwarmVault docs: https://swarmvault.ai
503
- - **`cwd` on stdio MCP servers**: `McpServerConfig` now has an optional `cwd` field. The MCP client passes it through to `StdioClientTransport` so servers that discover config from the working directory (SwarmVault, anything that reads from `cwd`-relative files) work correctly. Existing MCP servers are untouched (the field is optional and defaults to the SwarmClaw process cwd, which was the prior behaviour).
504
- - **Bundled `swarmvault` skill**: ships at `skills/swarmvault/SKILL.md` and is auto-discovered alongside the other bundled skills. Captures the schema-first / graph-query-first conventions (read `swarmvault.schema.md` before compile or query work, treat `raw/` as immutable, prefer `graph query|path|explain` over grep, preserve `page_id` / `source_ids` / `node_ids` / `freshness` / `source_hashes` frontmatter, save high-value answers to `wiki/outputs/`). Pin it on any agent that talks to a SwarmVault vault. Optional and decoupled from the MCP integration.
505
-
506
- ### v1.5.44 Highlights
507
-
508
- - **Model lists refreshed across every provider**: dropdowns now lead with the April-2026 flagship models instead of mid-2025 names. OpenAI goes to GPT-5.4 / 5.4-mini / 5.4-nano / 5.3 / o3-mini. Google and Gemini CLI lead with Gemini 3.1 Pro, Gemini 3 Flash, and 3.1 Flash-Lite, keeping 2.5 as a legacy fallback. xAI jumps from Grok 3 to Grok 4 plus the Grok 4 / 4.1 Fast reasoning and non-reasoning variants. Groq drops the deprecated `deepseek-r1-distill-llama-70b` and leads with Llama 4 Maverick, Llama 4 Scout, Kimi K2, and gpt-oss 120b/20b. Mistral moves to Magistral 1.2, Devstral 2, Codestral, and Mistral Small 4. Fireworks / Nebius / DeepInfra now lead with DeepSeek V3.2, Kimi K2.5, and Qwen 3 235B instead of the older R1-0528 checkpoint. Anthropic and Claude CLI reorder Opus 4.6 / Sonnet 4.6 / Haiku 4.5 newest-first. OpenCode Web refreshes its `providerID/modelID` seed list.
509
- - **OpenRouter default set expanded**: was one model (`openai/gpt-4.1-mini`). Now ten flagship routes including `openrouter/auto`, Claude 4.6 Opus / Sonnet / Haiku, GPT-5.4, Gemini 3.1 Pro / 3 Flash, Grok 4, DeepSeek V3.2, and Llama 4 Maverick. Much better first-run experience for the "provider that routes to every other provider".
510
- - **`DEFAULT_AGENTS` models refreshed**: 11 starter-agent models updated to match the new flagship lineups (OpenAI → GPT-5.4, xAI → Grok 4, Google / Gemini CLI → Gemini 3.1 Pro, Groq → Llama 4 Maverick, Fireworks / Nebius / DeepInfra → DeepSeek V3.2, OpenCode Web / Copilot CLI → Claude Sonnet 4.6, OpenRouter → Claude Sonnet 4.6). Starter agents created from the setup wizard now default to the right model out of the box.
511
- - **Starter-agent tool bundles now include `droid_cli` and `copilot_cli`**: these delegation backends were added in v1.5.37 and v1.5.3 respectively but never made it into `STARTER_AGENT_TOOLS` / `BUILDER_AGENT_TOOLS`. Every starter kit (Sidekick, Researcher, Builder, Reviewer, Operator, OpenClaw fleet) now picks them up on new workspace creation.
512
- - **DeepSeek note**: `deepseek-chat` and `deepseek-reasoner` remain the recommended model names — they are stable aliases that auto-track the current `V3.2` weights. No action required.
513
- - **Registry sanity test**: added `provider-models.test.ts` which asserts every provider declares a non-empty deduplicated models array, matching metadata keys, and a working `handler.streamChat`. Guards against future copy-paste regressions in the registry.
514
-
515
- ### v1.5.43 Highlights
516
-
517
- - **`/api/version` no longer 500s in Docker**: the route used to shell out to `git` at runtime, which fails in the production image because `.git/` is not copied. The route now returns 200 with `{ source: 'package', version }` from `package.json` when git metadata is unavailable, and `{ source: 'git', version, commit, ... }` when it is. `/api/version/update` short-circuits on Docker-style installs with a clear `no_git_metadata` reason instead of an opaque 500. ([#41](https://github.com/swarmclawai/swarmclaw/issues/41) Bug 1, reported by [@SteamedFish](https://github.com/SteamedFish).)
518
- - **Daemon reclaims stale `daemon-primary` leases on container restart**: when the previous container died holding the SQLite-backed lease, the new container previously waited up to the full 120 s TTL before the daemon could start. The successor now parses the recorded owner pid, probes it with `process.kill(pid, 0)`, and reclaims the lease immediately when the prior owner is provably dead on this host. When the owner is genuinely alive (or when the recorded host is ambiguous, such as multi-pod Kubernetes), behaviour is unchanged but a single deferred retry is scheduled just past the TTL so the daemon comes up automatically rather than waiting for the next API call. ([#41](https://github.com/swarmclawai/swarmclaw/issues/41) Bug 2.)
519
- - **Subprocess daemon fallback fails soft in Docker**: when `resolveDaemonRuntimeEntry()` cannot find `src/lib/server/daemon/daemon-runtime.ts` (the file is intentionally not in the standalone build), `ensureDaemonProcessRunning()` now logs a one-shot warning and returns `false` instead of throwing into the API handler. The in-process daemon path (with the Bug 2 fix) is the production path in Docker. ([#41](https://github.com/swarmclawai/swarmclaw/issues/41) Bug 3.)
520
- - **`CONTRIBUTING.md`**: dropped the broken reference to `AGENTS.md`. That file is `.gitignore`'d and not visible to external contributors. The single canonical project-conventions document is `CLAUDE.md`.
521
-
522
- ### v1.5.42 Highlights
523
-
524
- - **New `opencode-web` provider — connect to remote OpenCode HTTP servers** ([#40](https://github.com/swarmclawai/swarmclaw/issues/40), requested by [@SteamedFish](https://github.com/SteamedFish)): point an agent at any host running `opencode serve` or `opencode web` (default port `4096`). Supports HTTPS endpoints, HTTP Basic Auth (encode credentials as `username:password` in the API key field; bare passwords default the username to `opencode`), automatic OpenCode session reuse across chat turns, and per-session workspace isolation via `?directory=...`. Models are entered as `providerID/modelID` (e.g. `anthropic/claude-sonnet-4-5`). The existing `opencode-cli` provider is unchanged.
525
- - **New `CONTRIBUTING.md`**: short, scannable guide covering bug reports, feature requests, PR expectations, commit conventions, and where to look in the codebase. Models the gold-standard examples after issues #39 and #40.
526
- - **`GET /api/memory/:id` now returns a single entry by default**: previously it eagerly traversed linked memories and returned an array, which broke naive callers that expected a single object per REST convention. Linked traversal is now opt-in via `?depth=N` or `?envelope=true`.
527
-
528
- ### v1.5.41 Highlights
529
-
530
- - **Moonshot / Kimi compatibility — duplicate `files` tool name fixed**: any agent with the default `files` extension was sending two tools both literally named `files` to the LLM. Most providers tolerated the duplicate; Moonshot's strict tool-schema validation rejected it with `MoonshotException - function name files is duplicated` ([#39](https://github.com/swarmclawai/swarmclaw/issues/39), reported by [@SteamedFish](https://github.com/SteamedFish)). Three fixes: the v2 file builder is now correctly gated on `files_v2` (not `files`), it registers under the matching capability key, and the session-tools assembler now shares a single dedup Set across native, CRUD, and extension phases so any future name collision is rejected with a clear warning instead of a silent double-register.
531
-
532
- ### v1.5.40 Highlights
533
-
534
- - **Current-thread recall routing**: the message classifier now emits four explicit flags (`isCurrentThreadRecall`, `isGreeting`, `isAcknowledgement`, `isMemoryWriteIntent`) so the chat router stops treating in-thread pronouns ("your last reply", "both answers", "what I just said") as durable-memory queries. Previously small OSS models (`devstral-small-2:24b` and similar) would run `memory_search` for these, come back empty, and truthfully report "no memories found" even when the answer was three messages up.
535
- - **`memory_search` short-circuits thread-recall queries**: when the search query itself contains phrases like "just", "last reply", "my last", "both answers", the tool now returns a redirect pointing the model back to the visible chat history instead of executing a pointless vector search. Explicit cross-session phrasing ("yesterday", "last week", "in a previous conversation") still runs the normal search path.
536
- - **Explicit Routing Matrix in the system prompt**: spells out the boundary between "read the thread above" and "call a memory tool" in plain language, so routing doesn't depend on the model extrapolating a terse rule. Memory-tool lines are now tagged `(not this thread)` so the distinction is unmissable.
537
- - **Tool-summary retry threshold tightened**: the "trivial response" threshold used to decide whether to force a redundant `tool_summary` continuation dropped from 150 → 80 characters. A 119-char response like "I wrote X, stored Y, and confirmed both." is substantive; the old threshold forced the model to re-stream the same answer twice.
538
- - **Classifier timeout raised to 10 s**: 2 s was too tight for Ollama Cloud with a fully-configured agent (observed 4–6 s calls). Result caching means the latency tax only applies to first-seen messages.
539
- - **Reflection memories dedup across runs**: the supervisor reflection writer now compares candidate notes against recent (last 7 days) reflection memories for the same agent and skips ones that have already been stored, stopping the ~7-per-turn rediscovery churn on top of the within-run dedup shipped in v1.5.38.
540
-
541
449
  Older releases: https://swarmclaw.ai/docs/release-notes
542
450
 
543
451
  - GitHub releases: https://github.com/swarmclawai/swarmclaw/releases
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@swarmclawai/swarmclaw",
3
- "version": "1.5.57",
3
+ "version": "1.5.58",
4
4
  "description": "Build and run autonomous AI agents with OpenClaw, Hermes, multiple model providers, orchestration, delegation, memory, skills, schedules, and chat connectors.",
5
5
  "main": "electron-dist/main.js",
6
6
  "license": "MIT",
@@ -1,19 +1,21 @@
1
1
  import { NextResponse } from 'next/server'
2
- import { EVAL_SCENARIOS } from '@/lib/server/eval/scenarios'
2
+ import { EVAL_SCENARIOS, getSuiteScenarios } from '@/lib/server/eval/scenarios'
3
3
 
4
4
  export async function GET(req: Request) {
5
5
  const { searchParams } = new URL(req.url)
6
6
  const category = searchParams.get('category')
7
+ const suite = searchParams.get('suite')
7
8
 
8
- const scenarios = category
9
- ? EVAL_SCENARIOS.filter((s) => s.category === category)
10
- : EVAL_SCENARIOS
9
+ let scenarios = EVAL_SCENARIOS
10
+ if (suite) scenarios = getSuiteScenarios(suite)
11
+ if (category) scenarios = scenarios.filter((s) => s.category === category)
11
12
 
12
13
  return NextResponse.json(
13
14
  scenarios.map((s) => ({
14
15
  id: s.id,
15
16
  name: s.name,
16
17
  category: s.category,
18
+ suite: s.suite ?? 'core',
17
19
  description: s.description,
18
20
  tools: s.tools,
19
21
  timeoutMs: s.timeoutMs,
@@ -6,6 +6,7 @@ import { errorMessage } from '@/lib/shared-utils'
6
6
  const SuiteSchema = z.object({
7
7
  agentId: z.string().min(1),
8
8
  categories: z.array(z.string()).optional(),
9
+ suite: z.string().min(1).optional(),
9
10
  })
10
11
 
11
12
  export async function POST(req: Request) {
@@ -19,7 +20,10 @@ export async function POST(req: Request) {
19
20
  )
20
21
  }
21
22
 
22
- const result = await runEvalSuite(parsed.data.agentId, parsed.data.categories)
23
+ const result = await runEvalSuite(parsed.data.agentId, {
24
+ categories: parsed.data.categories,
25
+ suite: parsed.data.suite,
26
+ })
23
27
  return NextResponse.json(result)
24
28
  } catch (err: unknown) {
25
29
  return NextResponse.json(
@@ -0,0 +1,19 @@
1
+ import { NextResponse } from 'next/server'
2
+ import { EVAL_SCENARIOS, getSuiteScenarios, listSuites } from '@/lib/server/eval/scenarios'
3
+
4
+ export async function GET() {
5
+ const suites = listSuites()
6
+ const summary = suites.map((name) => {
7
+ const scenarios = name === 'core' ? EVAL_SCENARIOS.filter(s => !s.suite || s.suite === 'core') : getSuiteScenarios(name)
8
+ return {
9
+ name,
10
+ count: scenarios.length,
11
+ maxScore: scenarios.reduce(
12
+ (sum, s) => sum + s.scoringCriteria.reduce((a, c) => a + c.weight, 0),
13
+ 0,
14
+ ),
15
+ categories: Array.from(new Set(scenarios.map(s => s.category))),
16
+ }
17
+ })
18
+ return NextResponse.json(summary)
19
+ }
@@ -0,0 +1,94 @@
1
+ import { NextResponse } from 'next/server'
2
+ import { loadUsage, loadSessions } from '@/lib/server/storage'
3
+ import type { UsageRecord } from '@/types'
4
+
5
+ export const dynamic = 'force-dynamic'
6
+
7
+ type SessionSnapshot = {
8
+ id?: string
9
+ agentId?: string
10
+ createdAt?: number
11
+ lastActiveAt?: number
12
+ messages?: unknown[]
13
+ }
14
+
15
+ interface LiveUsage {
16
+ sessionId: string
17
+ records: number
18
+ totalTokens: number
19
+ inputTokens: number
20
+ outputTokens: number
21
+ estimatedCost: number
22
+ firstAt: number | null
23
+ lastAt: number | null
24
+ wallclockMs: number
25
+ turns: number
26
+ }
27
+
28
+ function summarize(sessionId: string, records: UsageRecord[], session: SessionSnapshot | undefined): LiveUsage {
29
+ let totalTokens = 0
30
+ let inputTokens = 0
31
+ let outputTokens = 0
32
+ let estimatedCost = 0
33
+ let firstAt: number | null = null
34
+ let lastAt: number | null = null
35
+
36
+ for (const r of records) {
37
+ totalTokens += r.totalTokens || 0
38
+ inputTokens += r.inputTokens || 0
39
+ outputTokens += r.outputTokens || 0
40
+ estimatedCost += r.estimatedCost || 0
41
+ const ts = r.timestamp || 0
42
+ if (ts > 0) {
43
+ if (firstAt === null || ts < firstAt) firstAt = ts
44
+ if (lastAt === null || ts > lastAt) lastAt = ts
45
+ }
46
+ }
47
+
48
+ const turns = Array.isArray(session?.messages) ? session!.messages!.length : records.length
49
+ const wallStart = session?.createdAt ?? firstAt ?? 0
50
+ const wallEnd = session?.lastActiveAt ?? lastAt ?? Date.now()
51
+ const wallclockMs = wallStart > 0 ? Math.max(0, wallEnd - wallStart) : 0
52
+
53
+ return {
54
+ sessionId,
55
+ records: records.length,
56
+ totalTokens,
57
+ inputTokens,
58
+ outputTokens,
59
+ estimatedCost: Math.round(estimatedCost * 10000) / 10000,
60
+ firstAt,
61
+ lastAt,
62
+ wallclockMs,
63
+ turns,
64
+ }
65
+ }
66
+
67
+ export async function GET(req: Request) {
68
+ const { searchParams } = new URL(req.url)
69
+ const sessionId = searchParams.get('sessionId')?.trim()
70
+
71
+ const usage = loadUsage() as Record<string, UsageRecord[]>
72
+ const sessions = loadSessions() as Record<string, SessionSnapshot>
73
+
74
+ if (sessionId) {
75
+ const records = usage[sessionId] ?? []
76
+ const session = sessions[sessionId]
77
+ return NextResponse.json(summarize(sessionId, records, session))
78
+ }
79
+
80
+ // Without sessionId, return the 10 most recently active sessions
81
+ const ids = Object.keys(usage)
82
+ const recent = ids
83
+ .map((id) => {
84
+ const records = usage[id] ?? []
85
+ const last = records.reduce((m, r) => Math.max(m, r.timestamp || 0), 0)
86
+ return { id, last }
87
+ })
88
+ .sort((a, b) => b.last - a.last)
89
+ .slice(0, 10)
90
+
91
+ return NextResponse.json(
92
+ recent.map(({ id }) => summarize(id, usage[id] ?? [], sessions[id])),
93
+ )
94
+ }
package/src/cli/index.js CHANGED
@@ -215,9 +215,10 @@ const COMMAND_GROUPS = [
215
215
  description: 'Run agent evaluation scenarios',
216
216
  commands: [
217
217
  cmd('scenarios', 'GET', '/eval/scenarios', 'List available eval scenarios'),
218
+ cmd('suites', 'GET', '/eval/suites', 'List available eval suites (core, swe-bench-lite, gaia-l1, ...)'),
218
219
  cmd('status', 'GET', '/eval/run', 'Get eval run status'),
219
220
  cmd('run', 'POST', '/eval/run', 'Run an eval scenario against an agent', { expectsJsonBody: true }),
220
- cmd('suite', 'POST', '/eval/suite', 'Run a full eval suite against an agent', { expectsJsonBody: true }),
221
+ cmd('suite', 'POST', '/eval/suite', 'Run a full eval suite against an agent (pass { suite: "swe-bench-lite" } in body)', { expectsJsonBody: true }),
221
222
  ],
222
223
  },
223
224
  {
@@ -747,6 +748,7 @@ const COMMAND_GROUPS = [
747
748
  description: 'Usage and cost summary',
748
749
  commands: [
749
750
  cmd('get', 'GET', '/usage', 'Get usage summary'),
751
+ cmd('live', 'GET', '/usage/live', 'Get live per-session usage (use --query sessionId=...)'),
750
752
  ],
751
753
  },
752
754
  {
@@ -59,7 +59,14 @@ test('getStarterKitsForPath: quick exposes a reduced starter set', () => {
59
59
 
60
60
  test('getStarterKitsForPath: intent stays focused on broad starter shapes', () => {
61
61
  const ids = getStarterKitsForPath('intent').map((kit) => kit.id)
62
- assert.deepEqual(ids, ['personal_assistant', 'research_copilot', 'builder_studio', 'operator_swarm'])
62
+ assert.deepEqual(ids, [
63
+ 'personal_assistant',
64
+ 'research_copilot',
65
+ 'builder_studio',
66
+ 'operator_swarm',
67
+ 'inbox_triage',
68
+ 'data_analyst',
69
+ ])
63
70
  })
64
71
 
65
72
  test('getStarterKitsForPath: manual keeps the full catalog', () => {
@@ -25,7 +25,14 @@ export function getStarterKitsForPath(path: OnboardingPath): StarterKit[] {
25
25
  return STARTER_KITS.filter((kit) => quickIds.has(kit.id))
26
26
  }
27
27
  if (path === 'intent') {
28
- const intentIds = new Set(['personal_assistant', 'builder_studio', 'research_copilot', 'operator_swarm'])
28
+ const intentIds = new Set([
29
+ 'personal_assistant',
30
+ 'builder_studio',
31
+ 'research_copilot',
32
+ 'operator_swarm',
33
+ 'inbox_triage',
34
+ 'data_analyst',
35
+ ])
29
36
  return STARTER_KITS.filter((kit) => intentIds.has(kit.id))
30
37
  }
31
38
  return STARTER_KITS
@@ -85,6 +85,9 @@ function resolveHeartbeatLastConnectorTarget(session: Session | null | undefined
85
85
  }
86
86
  }
87
87
 
88
+ const AUTO_DRAFT_DAILY_LIMIT = 3
89
+ const AUTO_DRAFT_MIN_TOOL_EVENTS = 3
90
+
88
91
  function shouldAutoDraftSkillSuggestion(params: {
89
92
  assistantPersisted: boolean
90
93
  internal: boolean
@@ -96,10 +99,31 @@ function shouldAutoDraftSkillSuggestion(params: {
96
99
  if (!params.assistantPersisted) return false
97
100
  if (params.internal || params.isHeartbeatRun) return false
98
101
  if (!params.agentAutoDraftSetting) return false
99
- if (params.toolEventCount === 0) return false
102
+ if (params.toolEventCount < AUTO_DRAFT_MIN_TOOL_EVENTS) return false
100
103
  return params.messageCount >= 4
101
104
  }
102
105
 
106
+ async function isAutoDraftRateLimited(agentId: string | null): Promise<boolean> {
107
+ if (!agentId) return false
108
+ try {
109
+ const { loadSkillSuggestions } = await import('@/lib/server/skills/skill-repository')
110
+ const suggestions = loadSkillSuggestions()
111
+ const todayStart = new Date()
112
+ todayStart.setHours(0, 0, 0, 0)
113
+ const cutoff = todayStart.getTime()
114
+ let count = 0
115
+ for (const s of Object.values(suggestions)) {
116
+ if (s.sourceAgentId !== agentId) continue
117
+ if ((s.createdAt || 0) < cutoff) continue
118
+ count += 1
119
+ if (count >= AUTO_DRAFT_DAILY_LIMIT) return true
120
+ }
121
+ return false
122
+ } catch {
123
+ return false
124
+ }
125
+ }
126
+
103
127
  async function resolveExactOutputContractWithTimeout(params: {
104
128
  sessionId: string
105
129
  agentId?: string | null
@@ -668,11 +692,14 @@ export async function finalizeChatTurn(params: {
668
692
  toolEventCount: persistedToolEvents.length,
669
693
  messageCount: messages.length,
670
694
  })) {
671
- try {
672
- const { createSkillSuggestionFromSession } = await import('@/lib/server/skills/skill-suggestions')
673
- await createSkillSuggestionFromSession(sessionId)
674
- } catch {
675
- // Reviewed skill drafting is best-effort.
695
+ const rateLimited = await isAutoDraftRateLimited(current.agentId)
696
+ if (!rateLimited) {
697
+ try {
698
+ const { createSkillSuggestionFromSession } = await import('@/lib/server/skills/skill-suggestions')
699
+ await createSkillSuggestionFromSession(sessionId)
700
+ } catch {
701
+ // Reviewed skill drafting is best-effort.
702
+ }
676
703
  }
677
704
  }
678
705
  notify(`messages:${sessionId}`)
@@ -2,7 +2,7 @@ import fs from 'node:fs'
2
2
  import path from 'node:path'
3
3
  import { genId } from '@/lib/id'
4
4
  import type { EvalScenario, EvalRun, EvalSuiteResult } from './types'
5
- import { getScenario, EVAL_SCENARIOS } from './scenarios'
5
+ import { getScenario, EVAL_SCENARIOS, getSuiteScenarios } from './scenarios'
6
6
  import { scoreCriteria } from './scorer'
7
7
  import { saveEvalRun } from './store'
8
8
  import { loadSessions, saveSessions, loadAgents, loadCredentials, decryptKey } from '../storage'
@@ -112,10 +112,21 @@ export async function runEvalScenario(scenarioId: string, agentId: string): Prom
112
112
  return run
113
113
  }
114
114
 
115
- export async function runEvalSuite(agentId: string, categories?: string[]): Promise<EvalSuiteResult> {
116
- const scenarios: EvalScenario[] = categories
117
- ? EVAL_SCENARIOS.filter(s => categories.includes(s.category))
118
- : EVAL_SCENARIOS
115
+ export async function runEvalSuite(
116
+ agentId: string,
117
+ opts: { categories?: string[]; suite?: string } = {},
118
+ ): Promise<EvalSuiteResult> {
119
+ let scenarios: EvalScenario[]
120
+ if (opts.suite) {
121
+ scenarios = getSuiteScenarios(opts.suite)
122
+ if (opts.categories?.length) {
123
+ scenarios = scenarios.filter(s => opts.categories!.includes(s.category))
124
+ }
125
+ } else if (opts.categories?.length) {
126
+ scenarios = EVAL_SCENARIOS.filter(s => opts.categories!.includes(s.category))
127
+ } else {
128
+ scenarios = EVAL_SCENARIOS
129
+ }
119
130
 
120
131
  const runs: EvalRun[] = []
121
132
  for (const scenario of scenarios) {
@@ -0,0 +1,100 @@
1
+ import type { EvalScenario } from './types'
2
+
3
+ /**
4
+ * GAIA-Level-1-inspired scenarios: tool-grounded reasoning tasks that require
5
+ * combining search, retrieval, and multi-step synthesis. Curated parallels
6
+ * (not the upstream dataset) scaled to a single harness run.
7
+ */
8
+ export const GAIA_L1_SCENARIOS: EvalScenario[] = [
9
+ {
10
+ id: 'gaia-capital-math',
11
+ name: 'Capital-of-country arithmetic',
12
+ category: 'multi-step',
13
+ suite: 'gaia-l1',
14
+ description: 'Identify the capital of a country, look up a population figure, and perform a simple calculation.',
15
+ userMessage: 'What is the population of the capital of Australia, and roughly what percentage of the country\'s total population does it represent? Cite your sources.',
16
+ expectedBehaviors: [
17
+ 'Uses web_search (or web_fetch) to find Canberra population and Australia total',
18
+ 'Performs the percentage calculation',
19
+ 'Cites sources',
20
+ ],
21
+ scoringCriteria: [
22
+ { name: 'uses_search', weight: 2, evaluator: 'tool_used', expected: 'web_search' },
23
+ { name: 'mentions_canberra', weight: 2, evaluator: 'contains', expected: 'Canberra' },
24
+ { name: 'mentions_percent', weight: 1, evaluator: 'regex', expected: '\\d+(?:\\.\\d+)?\\s*%' },
25
+ { name: 'mentions_source', weight: 1, evaluator: 'regex', expected: 'https?://|source|cite' },
26
+ { name: 'correctness', weight: 4, evaluator: 'llm_judge', expected: 'Did the agent correctly identify Canberra as the capital of Australia, report a plausible population figure, compute a plausible share of the country\'s total (roughly 1.5-2%), and cite at least one source?' },
27
+ ],
28
+ timeoutMs: 180_000,
29
+ tools: ['web_search', 'web_fetch'],
30
+ },
31
+ {
32
+ id: 'gaia-multi-source-synthesis',
33
+ name: 'Two-source synthesis',
34
+ category: 'research',
35
+ suite: 'gaia-l1',
36
+ description: 'Pull facts from two different pages and combine them into a single conclusion.',
37
+ userMessage: 'Find the release year of the first iPhone and the release year of the first Android phone (HTC Dream). How many months apart were they? Cite both sources.',
38
+ expectedBehaviors: [
39
+ 'Searches for both release dates',
40
+ 'Identifies 2007 (iPhone) and 2008 (HTC Dream)',
41
+ 'Computes the month delta',
42
+ 'Cites both sources',
43
+ ],
44
+ scoringCriteria: [
45
+ { name: 'uses_search', weight: 2, evaluator: 'tool_used', expected: 'web_search' },
46
+ { name: 'mentions_2007', weight: 1, evaluator: 'contains', expected: '2007' },
47
+ { name: 'mentions_2008', weight: 1, evaluator: 'contains', expected: '2008' },
48
+ { name: 'mentions_months', weight: 1, evaluator: 'regex', expected: '\\d+\\s*(months?|mo)' },
49
+ { name: 'correctness', weight: 5, evaluator: 'llm_judge', expected: 'Did the agent correctly identify iPhone (June 2007) and HTC Dream (October 2008), compute roughly 16 months apart, and cite two distinct sources?' },
50
+ ],
51
+ timeoutMs: 180_000,
52
+ tools: ['web_search', 'web_fetch'],
53
+ },
54
+ {
55
+ id: 'gaia-unit-conversion',
56
+ name: 'Unit conversion with citation',
57
+ category: 'tool-usage',
58
+ suite: 'gaia-l1',
59
+ description: 'Look up a measurement in one unit, convert to another, and show the arithmetic.',
60
+ userMessage: 'What is the height of Mount Kilimanjaro in meters? Convert that to feet and show the arithmetic (1 m = 3.2808 ft). Cite your source.',
61
+ expectedBehaviors: [
62
+ 'Searches for Kilimanjaro height',
63
+ 'Reports the canonical figure (5895 m)',
64
+ 'Multiplies by 3.2808 to get ~19341 ft',
65
+ 'Cites source',
66
+ ],
67
+ scoringCriteria: [
68
+ { name: 'uses_search', weight: 2, evaluator: 'tool_used', expected: 'web_search' },
69
+ { name: 'mentions_5895', weight: 1, evaluator: 'regex', expected: '5,?89[0-9]' },
70
+ { name: 'mentions_feet', weight: 2, evaluator: 'regex', expected: '1[89],?\\d{3}\\s*(ft|feet)' },
71
+ { name: 'shows_arithmetic', weight: 1, evaluator: 'regex', expected: '×|\\*|x\\s*3\\.2808' },
72
+ { name: 'correctness', weight: 4, evaluator: 'llm_judge', expected: 'Did the agent report ~5895 m, convert to roughly 19341 ft showing the multiplication, and cite a source?' },
73
+ ],
74
+ timeoutMs: 180_000,
75
+ tools: ['web_search', 'web_fetch'],
76
+ },
77
+ {
78
+ id: 'gaia-chain-lookup',
79
+ name: 'Chained fact lookup',
80
+ category: 'research',
81
+ suite: 'gaia-l1',
82
+ description: 'Use the answer to one question to set up the next lookup.',
83
+ userMessage: 'Find out who wrote the novel "Dune" (1965). Then find the year that author was born. Compute how old they were when Dune was published. Cite sources.',
84
+ expectedBehaviors: [
85
+ 'Identifies Frank Herbert as the author',
86
+ 'Identifies 1920 as his birth year',
87
+ 'Computes 45',
88
+ 'Cites sources',
89
+ ],
90
+ scoringCriteria: [
91
+ { name: 'uses_search', weight: 2, evaluator: 'tool_used', expected: 'web_search' },
92
+ { name: 'mentions_herbert', weight: 2, evaluator: 'contains', expected: 'Frank Herbert' },
93
+ { name: 'mentions_1920', weight: 1, evaluator: 'contains', expected: '1920' },
94
+ { name: 'mentions_45', weight: 1, evaluator: 'regex', expected: '\\b4[4-6]\\b' },
95
+ { name: 'correctness', weight: 4, evaluator: 'llm_judge', expected: 'Did the agent identify Frank Herbert (born 1920), the 1965 publication, compute age 45, and cite sources?' },
96
+ ],
97
+ timeoutMs: 180_000,
98
+ tools: ['web_search', 'web_fetch'],
99
+ },
100
+ ]
@@ -0,0 +1,196 @@
1
+ import type { EvalScenario } from './types'
2
+
3
+ /**
4
+ * SWE-Bench-Lite-inspired scenarios: small, self-contained coding tasks that
5
+ * require reading existing code, making a focused edit, and verifying via
6
+ * shell/tests. These are curated parallels (not the full dataset), sized for
7
+ * a single-agent harness run.
8
+ */
9
+ export const SWEBENCH_LITE_SCENARIOS: EvalScenario[] = [
10
+ {
11
+ id: 'swebench-bug-off-by-one',
12
+ name: 'Fix off-by-one in range sum',
13
+ category: 'coding',
14
+ suite: 'swe-bench-lite',
15
+ description: 'Given a broken range_sum implementation, diagnose and patch it, then verify via a test run.',
16
+ userMessage: [
17
+ 'Create /tmp/swarmclaw-swebench-1/ and write `range_sum.py` with this code:',
18
+ '```python',
19
+ 'def range_sum(a, b):',
20
+ ' # inclusive on both ends',
21
+ ' total = 0',
22
+ ' for i in range(a, b):',
23
+ ' total += i',
24
+ ' return total',
25
+ '',
26
+ 'if __name__ == "__main__":',
27
+ ' assert range_sum(1, 5) == 15, f"got {range_sum(1, 5)}"',
28
+ ' print("ok")',
29
+ '```',
30
+ 'Run it, diagnose the failure, fix the bug without changing the test, and re-run until it prints ok.',
31
+ ].join('\n'),
32
+ expectedBehaviors: [
33
+ 'Writes the starter file, runs it, observes the AssertionError',
34
+ 'Identifies the range endpoint as exclusive in Python',
35
+ 'Applies a minimal fix (range(a, b+1))',
36
+ 'Re-runs and confirms "ok"',
37
+ ],
38
+ scoringCriteria: [
39
+ { name: 'uses_shell', weight: 2, evaluator: 'tool_used', expected: 'shell' },
40
+ { name: 'uses_files', weight: 2, evaluator: 'tool_used', expected: 'files' },
41
+ { name: 'mentions_off_by_one', weight: 1, evaluator: 'regex', expected: 'off.by.one|inclusive|exclusive|range\\(a, b\\+1\\)' },
42
+ { name: 'mentions_ok', weight: 1, evaluator: 'contains', expected: 'ok' },
43
+ { name: 'correctness', weight: 5, evaluator: 'llm_judge', expected: 'Did the agent: (1) create the starter file, (2) identify the off-by-one via a real run, (3) apply a minimal correct fix, and (4) confirm the test passes with "ok"?' },
44
+ ],
45
+ timeoutMs: 120_000,
46
+ tools: ['shell', 'files'],
47
+ },
48
+ {
49
+ id: 'swebench-null-guard',
50
+ name: 'Add null guard to list parser',
51
+ category: 'coding',
52
+ suite: 'swe-bench-lite',
53
+ description: 'Patch a parser that crashes on None input so it returns an empty list instead of raising.',
54
+ userMessage: [
55
+ 'Create /tmp/swarmclaw-swebench-2/ and write `parser.py`:',
56
+ '```python',
57
+ 'def parse_items(raw):',
58
+ ' return [x.strip() for x in raw.split(",")]',
59
+ '',
60
+ 'if __name__ == "__main__":',
61
+ ' assert parse_items("a,b, c") == ["a", "b", "c"]',
62
+ ' assert parse_items(None) == [], "None input should return []"',
63
+ ' print("ok")',
64
+ '```',
65
+ 'Run it, fix the crash without altering the tests, and confirm it prints ok.',
66
+ ].join('\n'),
67
+ expectedBehaviors: [
68
+ 'Creates and runs the file, observes the TypeError on None',
69
+ 'Adds a null guard returning [] when raw is falsy',
70
+ 'Reruns and confirms ok',
71
+ ],
72
+ scoringCriteria: [
73
+ { name: 'uses_shell', weight: 2, evaluator: 'tool_used', expected: 'shell' },
74
+ { name: 'uses_files', weight: 2, evaluator: 'tool_used', expected: 'files' },
75
+ { name: 'mentions_none_or_null', weight: 1, evaluator: 'regex', expected: 'none|null|falsy|guard' },
76
+ { name: 'mentions_ok', weight: 1, evaluator: 'contains', expected: 'ok' },
77
+ { name: 'correctness', weight: 5, evaluator: 'llm_judge', expected: 'Did the agent add a minimal null guard so parse_items(None) returns [] without modifying the existing test assertions?' },
78
+ ],
79
+ timeoutMs: 120_000,
80
+ tools: ['shell', 'files'],
81
+ },
82
+ {
83
+ id: 'swebench-rename-symbol',
84
+ name: 'Rename a helper across files',
85
+ category: 'coding',
86
+ suite: 'swe-bench-lite',
87
+ description: 'Rename a utility function and update every caller so the smoke test still passes.',
88
+ userMessage: [
89
+ 'Create /tmp/swarmclaw-swebench-3/ with two files:',
90
+ '',
91
+ '`utils.py`:',
92
+ '```python',
93
+ 'def format_greet(name):',
94
+ ' return f"hello, {name}"',
95
+ '```',
96
+ '',
97
+ '`main.py`:',
98
+ '```python',
99
+ 'from utils import format_greet',
100
+ 'print(format_greet("world"))',
101
+ '```',
102
+ '',
103
+ 'Rename `format_greet` to `greet` in utils.py and every caller. Then run `python main.py` from that directory and confirm it prints "hello, world".',
104
+ ].join('\n'),
105
+ expectedBehaviors: [
106
+ 'Writes both files',
107
+ 'Renames the symbol in utils.py',
108
+ 'Updates the import and call in main.py',
109
+ 'Runs main.py and confirms the expected output',
110
+ ],
111
+ scoringCriteria: [
112
+ { name: 'uses_shell', weight: 2, evaluator: 'tool_used', expected: 'shell' },
113
+ { name: 'uses_files', weight: 2, evaluator: 'tool_used', expected: 'files' },
114
+ { name: 'mentions_greet', weight: 1, evaluator: 'contains', expected: 'greet' },
115
+ { name: 'mentions_hello_world', weight: 1, evaluator: 'contains', expected: 'hello, world' },
116
+ { name: 'correctness', weight: 5, evaluator: 'llm_judge', expected: 'Did the agent rename format_greet to greet across both files and successfully run main.py producing "hello, world"?' },
117
+ ],
118
+ timeoutMs: 120_000,
119
+ tools: ['shell', 'files'],
120
+ },
121
+ {
122
+ id: 'swebench-add-feature',
123
+ name: 'Add feature with a test',
124
+ category: 'coding',
125
+ suite: 'swe-bench-lite',
126
+ description: 'Add a new method to a class along with a passing test.',
127
+ userMessage: [
128
+ 'Create /tmp/swarmclaw-swebench-4/ with `counter.py`:',
129
+ '```python',
130
+ 'class Counter:',
131
+ ' def __init__(self): self.n = 0',
132
+ ' def inc(self): self.n += 1',
133
+ '',
134
+ 'if __name__ == "__main__":',
135
+ ' c = Counter()',
136
+ ' c.inc(); c.inc(); c.inc()',
137
+ ' assert c.n == 3',
138
+ ' print("ok")',
139
+ '```',
140
+ 'Add a `dec` method that decrements by 1, and add an assertion that calling dec twice after three incs leaves n == 1. Run the file and confirm it prints ok.',
141
+ ].join('\n'),
142
+ expectedBehaviors: [
143
+ 'Writes the starter file and runs it',
144
+ 'Adds the dec method',
145
+ 'Adds the new assertion',
146
+ 'Reruns and confirms ok',
147
+ ],
148
+ scoringCriteria: [
149
+ { name: 'uses_shell', weight: 2, evaluator: 'tool_used', expected: 'shell' },
150
+ { name: 'uses_files', weight: 2, evaluator: 'tool_used', expected: 'files' },
151
+ { name: 'mentions_dec', weight: 1, evaluator: 'contains', expected: 'dec' },
152
+ { name: 'mentions_ok', weight: 1, evaluator: 'contains', expected: 'ok' },
153
+ { name: 'correctness', weight: 5, evaluator: 'llm_judge', expected: 'Did the agent add a dec method that decrements n, add a new assertion (n == 1 after 3 incs and 2 decs), and confirm the file runs successfully?' },
154
+ ],
155
+ timeoutMs: 120_000,
156
+ tools: ['shell', 'files'],
157
+ },
158
+ {
159
+ id: 'swebench-regex-fix',
160
+ name: 'Tighten a permissive regex',
161
+ category: 'coding',
162
+ suite: 'swe-bench-lite',
163
+ description: 'Fix an email validator that accepts too much, without regressing valid inputs.',
164
+ userMessage: [
165
+ 'Create /tmp/swarmclaw-swebench-5/ with `validate.py`:',
166
+ '```python',
167
+ 'import re',
168
+ 'def is_email(s):',
169
+ ' return re.match(r".+@.+", s) is not None',
170
+ '',
171
+ 'if __name__ == "__main__":',
172
+ ' assert is_email("a@b.co")',
173
+ ' assert is_email("user.name+tag@host.example")',
174
+ ' assert not is_email("not an email"), "must reject missing @"',
175
+ ' assert not is_email("a@b"), "must require a dot in domain"',
176
+ ' assert not is_email("@b.co"), "must require local part"',
177
+ ' print("ok")',
178
+ '```',
179
+ 'Run it, see which assertion fails, tighten the regex without changing the tests, and confirm it prints ok.',
180
+ ].join('\n'),
181
+ expectedBehaviors: [
182
+ 'Runs the starter, identifies which tests fail',
183
+ 'Replaces the regex with one that requires a local part, @, and a dotted domain',
184
+ 'Reruns and confirms ok',
185
+ ],
186
+ scoringCriteria: [
187
+ { name: 'uses_shell', weight: 2, evaluator: 'tool_used', expected: 'shell' },
188
+ { name: 'uses_files', weight: 2, evaluator: 'tool_used', expected: 'files' },
189
+ { name: 'mentions_regex', weight: 1, evaluator: 'regex', expected: 'regex|pattern' },
190
+ { name: 'mentions_ok', weight: 1, evaluator: 'contains', expected: 'ok' },
191
+ { name: 'correctness', weight: 5, evaluator: 'llm_judge', expected: 'Did the agent tighten is_email so all five assertions pass (valid addresses accepted, invalid ones rejected) without altering the tests?' },
192
+ ],
193
+ timeoutMs: 120_000,
194
+ tools: ['shell', 'files'],
195
+ },
196
+ ]
@@ -1,6 +1,8 @@
1
1
  import type { EvalScenario } from './types'
2
+ import { SWEBENCH_LITE_SCENARIOS } from './scenarios-swebench'
3
+ import { GAIA_L1_SCENARIOS } from './scenarios-gaia'
2
4
 
3
- export const EVAL_SCENARIOS: EvalScenario[] = [
5
+ const CORE_SCENARIOS: EvalScenario[] = [
4
6
  {
5
7
  id: 'coding-prime',
6
8
  name: 'Prime Number Function',
@@ -213,6 +215,23 @@ export const EVAL_SCENARIOS: EvalScenario[] = [
213
215
  },
214
216
  ]
215
217
 
218
+ export const EVAL_SCENARIOS: EvalScenario[] = [
219
+ ...CORE_SCENARIOS,
220
+ ...SWEBENCH_LITE_SCENARIOS,
221
+ ...GAIA_L1_SCENARIOS,
222
+ ]
223
+
216
224
  export function getScenario(id: string): EvalScenario | undefined {
217
225
  return EVAL_SCENARIOS.find(s => s.id === id)
218
226
  }
227
+
228
+ export function getSuiteScenarios(suite: string): EvalScenario[] {
229
+ if (suite === 'core') return EVAL_SCENARIOS.filter(s => !s.suite || s.suite === 'core')
230
+ return EVAL_SCENARIOS.filter(s => s.suite === suite)
231
+ }
232
+
233
+ export function listSuites(): string[] {
234
+ const seen = new Set<string>(['core'])
235
+ for (const s of EVAL_SCENARIOS) if (s.suite) seen.add(s.suite)
236
+ return Array.from(seen)
237
+ }
@@ -5,6 +5,8 @@ export interface ScoringCriterion {
5
5
  expected: string
6
6
  }
7
7
 
8
+ export type EvalSuite = 'core' | 'swe-bench-lite' | 'gaia-l1' | 'tool-use' | 'code-action'
9
+
8
10
  export interface EvalScenario {
9
11
  id: string
10
12
  name: string
@@ -15,6 +17,8 @@ export interface EvalScenario {
15
17
  scoringCriteria: ScoringCriterion[]
16
18
  timeoutMs: number
17
19
  tools: string[]
20
+ /** Optional suite tag. Scenarios without a suite belong to the 'core' suite. */
21
+ suite?: EvalSuite
18
22
  }
19
23
 
20
24
  export interface EvalRun {
@@ -164,6 +164,29 @@ export const BUILT_IN_MISSION_TEMPLATES: MissionTemplate[] = [
164
164
  reportSchedule: report(12 * HOUR),
165
165
  },
166
166
  },
167
+ {
168
+ id: 'hello-world-demo',
169
+ name: 'Hello World Demo',
170
+ description:
171
+ 'A zero-cost first-run mission that summarizes the current working directory into a short markdown report. Great for first-time users to watch an agent complete a bounded task end-to-end.',
172
+ icon: '👋',
173
+ category: 'research',
174
+ tags: ['demo', 'first-run', 'short'],
175
+ setupNote:
176
+ 'No setup required. This demo mission runs in your workspace, reads a few files, and produces a short markdown summary. Best paired with a local Ollama model or any configured provider.',
177
+ defaults: {
178
+ title: 'Hello World Demo',
179
+ goal:
180
+ 'List the files in the current working directory, pick the 3 that look most interesting, read a short excerpt from each, and write a markdown file `hello-world-report.md` with a one-paragraph summary of what this project appears to do. Do not modify any existing files.',
181
+ successCriteria: [
182
+ 'Reads at least 3 files',
183
+ 'Writes hello-world-report.md with a clear one-paragraph summary',
184
+ 'Does not modify any pre-existing files',
185
+ ],
186
+ budget: budget({ maxUsd: 0.25, maxTokens: 20_000, maxTurns: 30, maxWallclockSec: 15 * 60 }),
187
+ reportSchedule: report(HOUR),
188
+ },
189
+ },
167
190
  ]
168
191
 
169
192
  const TEMPLATE_INDEX: Map<string, MissionTemplate> = new Map(
@@ -553,6 +553,48 @@ const BUILDER_AGENT_TOOLS = [
553
553
  'qwen_code_cli',
554
554
  ]
555
555
 
556
+ const INBOX_TRIAGE_PROMPT = `You are an inbox triage copilot inside SwarmClaw.
557
+
558
+ Primary objective:
559
+ - Sort incoming email, messages, and notifications so the user only sees what needs their attention.
560
+
561
+ Behavior:
562
+ - Classify items by urgency, topic, and whether they need a reply.
563
+ - Draft short reply candidates for the user to approve when appropriate.
564
+ - Surface clear summaries and action lists instead of raw firehose.
565
+ - Stop and ask the user before sending on their behalf.`
566
+
567
+ const DATA_ANALYST_PROMPT = `You are a data analyst inside SwarmClaw.
568
+
569
+ Primary objective:
570
+ - Help the user explore, clean, and summarize data, producing concise findings and charts when useful.
571
+
572
+ Behavior:
573
+ - Prefer working in a shell (python/pandas) or via files, showing intermediate results.
574
+ - State the question before computing, and flag limitations or assumptions.
575
+ - Summarize insights with simple prose plus key numbers.
576
+ - When useful, save artifacts (CSV, markdown, PNG) to the working directory.`
577
+
578
+ const INBOX_AGENT_TOOLS = [
579
+ 'memory',
580
+ 'files',
581
+ 'web_search',
582
+ 'web_fetch',
583
+ 'email',
584
+ 'manage_tasks',
585
+ 'manage_documents',
586
+ ]
587
+
588
+ const DATA_ANALYST_TOOLS = [
589
+ 'memory',
590
+ 'files',
591
+ 'execute',
592
+ 'web_search',
593
+ 'web_fetch',
594
+ 'manage_tasks',
595
+ 'manage_documents',
596
+ ]
597
+
556
598
  const OPERATOR_AGENT_TOOLS = STARTER_AGENT_TOOLS
557
599
  const OPENCLAW_AGENT_TOOLS = [
558
600
  'memory',
@@ -720,6 +762,40 @@ export const STARTER_KITS: StarterKit[] = [
720
762
  },
721
763
  ],
722
764
  },
765
+ {
766
+ id: 'inbox_triage',
767
+ name: 'Inbox Triager',
768
+ description: 'A single agent that sorts and summarizes your inbox.',
769
+ detail: 'Good when messages pile up faster than you can read them. Pairs well with the email connector.',
770
+ recommendedFor: ['intent', 'manual'],
771
+ agents: [
772
+ {
773
+ id: 'triager',
774
+ name: 'Triager',
775
+ description: 'Triages inbound messages into urgent, reply-needed, and informational buckets.',
776
+ systemPrompt: INBOX_TRIAGE_PROMPT,
777
+ tools: INBOX_AGENT_TOOLS,
778
+ capabilities: ['triage', 'summarization', 'drafting'],
779
+ },
780
+ ],
781
+ },
782
+ {
783
+ id: 'data_analyst',
784
+ name: 'Data Analyst',
785
+ description: 'A single agent focused on exploring and summarizing data.',
786
+ detail: 'Useful for ad-hoc analysis, CSV crunching, and producing concise findings with charts.',
787
+ recommendedFor: ['intent', 'manual'],
788
+ agents: [
789
+ {
790
+ id: 'analyst',
791
+ name: 'Analyst',
792
+ description: 'Runs exploratory analyses, cleans datasets, and writes short summaries with key numbers.',
793
+ systemPrompt: DATA_ANALYST_PROMPT,
794
+ tools: DATA_ANALYST_TOOLS,
795
+ capabilities: ['analysis', 'summarization', 'visualization'],
796
+ },
797
+ ],
798
+ },
723
799
  {
724
800
  id: 'blank_workspace',
725
801
  name: 'Blank Workspace',