npm - agent.libx.js - Versions diffs - 0.93.33 → 0.93.35 - Mend

agent.libx.js 0.93.33 → 0.93.35

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/README.md CHANGED Viewed

@@ -1,6 +1,21 @@
 # agent.libx.js
-**A coding agent on par with Claude Code — that also runs where Claude Code can't.**
+[![npm](https://img.shields.io/npm/v/agent.libx.js?color=cb3837&logo=npm)](https://www.npmjs.com/package/agent.libx.js)
+[![publish](https://github.com/Livshitz/agent.libx.js/actions/workflows/publish.yml/badge.svg)](https://github.com/Livshitz/agent.libx.js/actions/workflows/publish.yml)
+[![license](https://img.shields.io/npm/l/agent.libx.js)](./LICENSE)
+![runtime](https://img.shields.io/badge/runtime-Bun-black?logo=bun)
+![edge-ready](https://img.shields.io/badge/runs-node%20%C2%B7%20browser%20%C2%B7%20edge-brightgreen)
+**A coding agent that matches Claude Code on correctness — then beats it on cost, tokens, and tool-efficiency, and runs where Claude Code can't (sandbox, browser/edge, database).**
+<!-- DEMO GIF: record a ~15s session (asciinema rec demo.cast → agg demo.cast docs/demo.gif, or a terminal-screen capture) and replace this comment with:
+<p align="center"><img src="docs/demo.gif" alt="agentx fixing a bug" width="720"></p> -->
+```console
+$ agentx "There's a bug in math.js — add() subtracts instead of adds. Fix it."
+  ⚙ Edit  math.js   a - b  →  a + b
+  ✓ Fixed: `a - b` → `a + b` in add().
+```
 By default it's a full-strength terminal coding agent: real disk, real shell, and the same Read/Edit/Grep/permissions/streaming-DX surface you'd expect from Claude Code. The difference is its two host couplings are swappable seams:
@@ -11,16 +26,41 @@ So the *same* agent loop also runs **sandboxed** (in-memory VFS, real disk untou
 Claude Code is the floor; running isolated, on the edge, or hybrid is the ceiling.
+## How it stacks up vs Claude Code
+**Correctness parity — efficiency, cost, and reach are the lead.** Hard 7-task coding suite, Sonnet, *denoised* (each task ×3, no lucky run promotes; `SUITE=hard bun compare/run.ts`):
+| | agent.libx.js | Claude Code |
+|---|---|---|
+| Correctness | 7/7 | 7/7 — **parity** |
+| Tool-calls | **16** | 28 — **−43%** |
+| Tokens | **69k** | 171k — **2.5× fewer** |
+| Wall-time | **~100s** | 133s — **~25% faster** |
+**Cost** (9-task hard suite, USD-metered, vs CC-on-Opus): **$0.49** single-tier Sonnet (**5.4× cheaper**) · **$0.82** three-tier voice/duplex (**3.3× cheaper**) vs CC-Opus **$2.67** — at quality parity (16/18 vs 17/18 passes).
+Plus things Claude Code simply doesn't do:
+- **Runs where CC can't** — the *same* agent loop runs on real disk, an in-memory **sandbox**, the **browser/edge** (no Node, no `/bin/sh`), or a **database-backed** workspace. Swap the filesystem, not the agent.
+- **Keyless web search, built in** — `WebSearch` works in any deployment with no API key (DuckDuckGo; auto-upgrades to Tavily if you set one). CC's search is Anthropic-server-bound.
+- **Context-safe by default** — a 1 MB `Grep`/`Read`/MCP result is auto-paginated and can't blow the window; buried detail is recovered via a cheap context-isolated `Ask` peek — **~5.3× cheaper and more accurate** than re-fetching, in a head-to-head.
+- **It improves its own efficiency** — an autonomous evolution loop cut its own tool-use **~50% (32 → 15** on the core suite, denoised), self-discovered, not hand-tuned — the same lever behind the efficiency lead above.
+*Honest scope:* the win is **efficiency / cost / reach**, not a claim of smarter reasoning — correctness is parity. All figures are denoised and reproducible (see [Eval & compare](#eval--compare)); full boards in [`mind/09-outperform.md`](./mind/09-outperform.md).
 ## Quickstart
+Point it at your project — no clone needed (requires [Bun](https://bun.sh)):
 ```bash
-bun install                 # links wcli (file:), ai.libx.js + libx.js (bun link)
-bun test                    # 34 unit/integration tests (no API key needed)
-ANTHROPIC_API_KEY=… bun examples/run-sonnet.ts   # drive a real model
-bun eval/run.ts             # quantitative eval scorecard (real model)
+export ANTHROPIC_API_KEY=…                              # or OPENAI_API_KEY / GOOGLE_API_KEY / GROQ_API_KEY
+bunx agent.libx.js "find and fix the failing test"      # run once in the current directory
+bunx agent.libx.js                                      # …or open the interactive REPL
 ```
-## Use it
+Want a permanent command? `bun add -g agent.libx.js`, then just `agentx` (and `agentx --duplex` for voice). The agent has full real-disk + shell access by default (like Claude Code); add `--sandbox` to work on an in-memory copy instead. See [The `agentx` CLI](#the-agentx-cli) for flags, sessions, and slash commands.
+## Use it as a library
 ```ts
 import { AIClient } from 'ai.libx.js';
@@ -103,14 +143,24 @@ Full design + threat model + results: [`mind/08-self-evolve.md`](./mind/08-self-
 ## Status
-**v1 (done):** loop + hybrid tools + Mem/Disk backends + deterministic `FakeAIClient` tests + real-model run. **5/5 pass@1** on the behavioral eval (Sonnet 4.6); the head-to-head started at correctness parity with Claude Code but ~2× the tool calls (≈28 vs 15) — a gap the **self-evolution loop has now closed autonomously**: it drove its own baseline from 32 → 15 tool-calls (denoised over 3 runs) and ties Claude Code in a fresh head-to-head (15 vs 15). **112 tests green.**
+**v1 (done):** loop + hybrid tools + Mem/Disk backends + deterministic `FakeAIClient` tests + real-model run. **5/5 pass@1** on the behavioral eval (Sonnet 4.6); the head-to-head started at correctness parity with Claude Code but ~2× the tool calls (≈28 vs 15) — a gap the **self-evolution loop has now closed autonomously**: it drove its own baseline from 32 → 15 tool-calls (denoised over 3 runs) and ties Claude Code in a fresh head-to-head (15 vs 15). **820+ tests green.**
 See [`mind/`](./mind/) for the full vision, architecture, decision journal, roadmap, eval + head-to-head results, the [parity plan](./mind/05-parity.md), and the [self-evolution design](./mind/08-self-evolve.md).
-## Eval & compare
+## Develop & evaluate
+Hacking on the runtime itself (from a clone):
+```bash
+bun install                # links wcli (file:), ai.libx.js + libx.js (bun link)
+bun test                   # 820+ unit/integration tests (offline via FakeAIClient, no key)
+ANTHROPIC_API_KEY=… bun examples/run-sonnet.ts   # drive a real model end-to-end
+```
+Eval & head-to-head (real model):
 ```bash
 bun eval/run.ts            # behavioral scorecard (our agent over MemFilesystem)
 bun compare/seed-tasks.ts  # materialize task specs into .tmp/tasks/
-bun compare/run.ts         # head-to-head vs Claude Code (needs `claude` CLI)
+bun compare/run.ts         # head-to-head vs Claude Code (needs the `claude` CLI)
 ```

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "agent.libx.js",
-  "version": "0.93.33",
+  "version": "0.93.35",
   "description": "Edge-native AI agent runtime — drives a virtual filesystem via any LLM (ai.libx.js). Same bytes run in node, browser, or edge.",
   "type": "module",
   "main": "./dist/index.js",
@@ -46,7 +46,7 @@
     "node": ">=18"
   },
   "bin": {
-    "agentx": "cli/cli.ts"
+    "agentx": "dist/cli.js"
   },
   "files": [
     "dist",