npm - @framers/agentos - Versions diffs - 0.5.12 → 0.5.13 - Mend

@framers/agentos 0.5.12 → 0.5.13

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (2) hide show

package/README.md +92 -35
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -8,7 +8,7 @@
 # **AgentOS** — Open-Source TypeScript AI Agent Runtime with Cognitive Memory, HEXACO Personality, and Runtime Tool Forging
-**85.6% on LongMemEval-S** at $0.0090/correct, +1.4 above Mastra OM gpt-4o (84.23%) at the matched reader · **70.2% on LongMemEval-M** (1.5M-token variant), the only open-source library on the public record above 65% on M with publicly reproducible methodology · 16 LLM providers · 8 neuroscience-backed memory mechanisms · Apache-2.0
+**85.6% on LongMemEval-S** at $0.0090/correct, +1.4 above Mastra OM gpt-4o (84.23%) · **70.2% on LongMemEval-M** (1.5M-token variant), the only open-source library on the public record above 65% on M with publicly reproducible methodology · 16 LLM providers · 8 neuroscience-backed memory mechanisms · Apache-2.0
 [![npm](https://img.shields.io/npm/v/@framers/agentos?style=flat-square&logo=npm&color=cb3837)](https://www.npmjs.com/package/@framers/agentos)
 [![CI](https://img.shields.io/github/actions/workflow/status/framersai/agentos/ci.yml?branch=master&style=flat-square&logo=github&label=CI)](https://github.com/framersai/agentos/actions/workflows/ci.yml)
@@ -24,6 +24,12 @@
 ---
+AgentOS is an open-source TypeScript runtime for AI agents that adapt, remember, and collaborate. The runtime carries the parts of an agent that should outlive a single chat completion: persistent [cognitive memory](https://docs.agentos.sh/features/cognitive-memory) grounded in published cognitive-science literature, optional [HEXACO personality](https://docs.agentos.sh/features/cognitive-memory-guide) modeling, runtime tool forging in a hardened sandbox, [six multi-agent orchestration strategies](https://docs.agentos.sh/features/multi-agent-collaboration), [streaming guardrails](https://docs.agentos.sh/features/guardrails-architecture), a [voice pipeline](https://docs.agentos.sh/features/voice-pipeline), and one dispatch interface across 21 LLM providers. Apache-2.0.
+On benchmarks: **85.6% on LongMemEval-S** at $0.0090 per correct answer (gpt-4o reader, +1.4 points above Mastra's published 84.23%); **70.2% on LongMemEval-M** (1.5M-token haystacks, 500 sessions per question), the only open-source library on the public record above 65% on M with publicly reproducible methodology. Per-case run JSONs and single-CLI reproduction ship in [agentos-bench](https://github.com/framersai/agentos-bench).
+---
 ## Install
 ```bash
@@ -49,43 +55,92 @@ await session.send('Can you expand on that?'); // remembers context
 ---
-## Memory Benchmarks at Matched Reader
+## Emergent Design
-Same `gpt-4o` reader, same dataset, same `gpt-4o-2024-08-06` judge across every row. Cross-provider configurations are excluded because they cannot be reproduced from public methodology disclosures.
+> "So we and our elaborately evolving computers may meet each other halfway."
+>
+> — Philip K. Dick, *The Android and the Human*, 1972
-### LongMemEval-S (115K tokens, 50 sessions)
+Three things accumulate inside an AgentOS session:
+1. **Memory.** What was said, what was decided, what was retrieved.
+2. **Tool surface.** Starts at whatever was registered. Can grow mid-decision when an agent forges a new function and the judge approves it.
+3. **Personality** (optional). A HEXACO trait vector that biases retrieval, specialist routing, and decision-making.
+Behavior in turn six is a function of all three carried forward from turns one through five: which memories got reinforced, which forged tools entered the catalog, which trait values weighted which evidence. Each of those is configurable and observable. None of the three crosses into "emergent agent" on its own; the composition is the interesting part.
-| System (gpt-4o reader) | Accuracy | $/correct | p50 latency | Source |
-|---|---:|---:|---:|---|
-| EmergenceMem Internal | 86.0% | not published | 5,650 ms | [emergence.ai](https://www.emergence.ai/blog/sota-on-longmemeval-with-rag) |
-| **🚀 AgentOS canonical-hybrid + reader-router** | **85.6%** | **$0.0090** | **3,558 ms** | [post](https://docs.agentos.sh/blog/2026/04/28/reader-router-pareto-win) |
-| Mastra OM gpt-4o (gemini-flash observer) | 84.23% | not published | not published | [mastra.ai](https://mastra.ai/research/observational-memory) |
-| Supermemory gpt-4o | 81.6% | not published | not published | [supermemory.ai](https://supermemory.ai/research/) |
-| EmergenceMem Simple Fast (rerun in agentos-bench) | 80.6% | $0.0586 | 3,703 ms | [adapter](https://github.com/framersai/agentos-bench/blob/master/vendors/emergence-simple-fast/) |
-| Zep self / independent reproduction | 71.2% / 63.8% | not published | not published | [self](https://blog.getzep.com/state-of-the-art-agent-memory/) / [arXiv](https://arxiv.org/abs/2512.13564) |
+### Runtime Tool Forging
-**+1.4 points above Mastra OM gpt-4o (84.23%) at the matched reader.** Among open-source memory libraries that publish at gpt-4o with publicly reproducible runs (per-case run JSONs at fixed seed, single-CLI reproduction), AgentOS at 85.6% is the highest published number. EmergenceMem Internal posts 86.0% (0.4 above us) but does not publish per-case results or a reproducible CLI. AgentOS p50 latency 3,558 ms vs EmergenceMem's published median 5,650 ms.
+When an agent encounters a sub-task that no available tool covers, it generates a TypeScript function with a Zod-described input and output schema. A separate LLM call evaluates the forged function against the agent's stated intent and either approves or rejects it. Approved functions execute in a hardened `node:vm` sandbox (default 5-second wall clock, 128 MB nominal memory budget reported as a heap-delta heuristic, not preemptively enforced — preemptive memory limits via [`isolated-vm`](https://github.com/laverdet/isolated-vm) are queued for the hosted multi-tenant tier). The sandbox always bans `eval`, `Function`, `require`, dynamic `import`, `process`, `child_process`, and destructive `fs.*`. `fetch`, `fs.readFile`, and `crypto` are allowlist-only opt-ins; the default allowlist is empty, so the default execution environment has no network, no filesystem, and no crypto unless the host explicitly grants them per-tool. Approved tools are added to a discoverable index keyed by name and signature, and subsequent turns invoke them via `call_forged_tool(name, args)`. A first-time forge costs full LLM tokens; reuse costs tens of tokens. Total cost per turn flattens once a session has accumulated a handful of approved tools.
-Notes on cross-provider numbers excluded from this table: Mastra also publishes 94.87% with a gpt-5-mini reader plus gemini-2.5-flash observer (cross-provider); agentmemory publishes 96.2% with a Claude Opus 4.6 reader; MemMachine publishes 93.0% with a GPT-5-mini reader; Hindsight publishes 91.4% with an unspecified stronger backbone. None of these are at the matched gpt-4o reader, and most do not publish full methodology details (judge model, dataset version, per-case results, single-CLI reproduction).
+The path the runtime supports: an agent forges a tool mid-decision, the judge approves it, that turn invokes it, and a few turns later a different specialist agent in the same session invokes the same tool because the index made it findable. Neither side is scripted; the runtime makes the tool discoverable and the agents find it on their own.
-**Cost at scale**: $0.0090 per memory-grounded answer = $9 per 1,000 RAG calls. A chatbot averaging 5 RAG calls per conversation across 1,000 conversations costs ~$45.
+### HEXACO Personality (optional)
+Personality is opt-in. The runtime behaves identically with or without a trait vector, and most production deployments do not pass one.
+```ts
+// Personality-neutral (most production agents)
+const support = agent({
+  provider: 'openai',
+  instructions: 'Resolve customer tickets.',
+  memory: { types: ['episodic', 'semantic'] },
+});
+// Opt-in HEXACO (when persona consistency across sessions matters)
+const coach = agent({
+  provider: 'openai',
+  instructions: "Long-running career coach. Hold the user accountable to their stated goals across weekly check-ins; flag drift, push back on excuses, escalate when goals shift.",
+  personality: {
+    conscientiousness: 0.9,    // won't let goals drift between sessions
+    honestyHumility: 0.85,     // won't tell the user what they want to hear
+    emotionality: 0.3,         // stays steady when the user is reactive
+  },
+  memory: { types: ['episodic', 'semantic'] },
+});
+```
+When a vector is supplied, the kernel weights retrieval, specialist routing, and tool selection by the trait values. Same agent, same prompt, same tools: a high-Openness leader and a high-Conscientiousness leader produce measurably different decision sequences. The bias lives in the kernel, not in the prompt; prompt-only personality dissolves under context pressure while kernel-encoded bias persists. The vector remains editable, inspectable, and removable on consent.
+---
+## Memory Benchmarks
+`gpt-4o` reader, `gpt-4o-2024-08-06` judge, full N=500 across every row. Cross-provider numbers are excluded from the tables because their public methodology disclosures don't admit reproduction.
+### LongMemEval-S (115K tokens, 50 sessions)
+| System | Accuracy | $/correct | p50 latency |
+|---|---:|---:|---:|
+| EmergenceMem Internal | 86.0% | not published | 5,650 ms |
+| **AgentOS** (canonical-hybrid + reader-router) | **85.6%** | **$0.0090** | **3,558 ms** |
+| Mastra OM gpt-4o (gemini-flash observer) | 84.23% | not published | not published |
+| Supermemory gpt-4o | 81.6% | not published | not published |
+| EmergenceMem Simple Fast (rerun in agentos-bench) | 80.6% | $0.0586 | 3,703 ms |
+| Zep (self / independent reproduction) | 71.2% / 63.8% | not published | not published |
++1.4 points above Mastra OM. EmergenceMem Internal posts 86.0% (0.4 above) but doesn't publish per-case results or a reproducible CLI; among open-source libraries with single-CLI reproduction at `gpt-4o`, 85.6% is the highest publicly reproducible number located. p50 latency 3,558 ms vs EmergenceMem's published median 5,650 ms.
+Cross-provider numbers omitted from the table (different reader and/or undisclosed judge): Mastra OM 94.87% (gpt-5-mini + gemini-2.5-flash observer), agentmemory 96.2% (Claude Opus 4.6), MemMachine 93.0% (GPT-5-mini), Hindsight 91.4% (unspecified backbone).
 ### LongMemEval-M (1.5M tokens, 500 sessions)
-The harder variant. M's haystacks exceed every production context window. Most vendors stop at S because raw long-context fits there.
+M's haystacks exceed every production context window; most vendors only publish on S.
+| System | Accuracy | License |
+|---|---:|---|
+| LongMemEval paper, GPT-4o round Top-10 (paper's best) | 72.0% | open repo |
+| AgentBrain | 71.7% | closed-source SaaS |
+| LongMemEval paper, GPT-4o session Top-5 | 71.4% | open repo |
+| **AgentOS** (sem-embed + reader-router + Top-5) | **70.2%** | **Apache-2.0** |
+| LongMemEval paper, GPT-4o round Top-5 | 65.7% | open repo |
+| Mem0 v3, Mastra, Hindsight, Zep, EmergenceMem, Supermemory, Letta | not published | — |
-| System | Accuracy | License | Source |
-|---|---:|---|---|
-| LongMemEval paper, strongest GPT-4o (round, Top-10) | 72.0% | open repo | [Wu et al., ICLR 2025, Table 3](https://arxiv.org/abs/2410.10813) |
-| AgentBrain | 71.7% | closed-source SaaS | [github.com/AgentBrainHQ](https://github.com/AgentBrainHQ) |
-| LongMemEval paper, strongest GPT-4o at Top-5 (session) | 71.4% | open repo | [Wu et al., ICLR 2025, Table 3](https://arxiv.org/abs/2410.10813) |
-| **🚀 AgentOS** (sem-embed + reader-router + top-K=5) | **70.2%** | **Apache-2.0** | [post](https://docs.agentos.sh/blog/2026/04/29/longmemeval-m-70-with-topk5) |
-| LongMemEval paper, GPT-4o at Top-5 (round) | 65.7% | open repo | [Wu et al., ICLR 2025, Table 3](https://arxiv.org/abs/2410.10813) |
-| Mem0 v3, Mastra, Hindsight, Zep, EmergenceMem, Supermemory, Letta, others | not published | various | reports S only |
+At matched Top-5 retrieval, +4.5 above the round-level paper baseline (65.7%) and 1.2 below the session-level (71.4%); the paper's overall strongest GPT-4o result is 72.0% at Top-10. Of open-source libraries with publicly reproducible runs, AgentOS is the only one above 65% on M.
-**Competitive with the strongest published M results in the LongMemEval paper.** At matched Top-5 retrieval, AgentOS at 70.2% is +4.5 points above the round-level configuration (65.7%) and 1.2 points below the session-level configuration (71.4%); the paper's strongest GPT-4o result overall is 72.0% at round-level Top-10. Among open-source memory libraries with publicly reproducible runs (per-case run JSONs at fixed seed, single-CLI reproduction), AgentOS is the only one on the public record above 65% on M.
+> **[Full leaderboard →](https://github.com/framersai/agentos-bench/blob/master/results/LEADERBOARD.md)** · **[Run JSONs →](https://github.com/framersai/agentos-bench/tree/master/results/runs)** · **[Transparency audit →](https://agentos.sh/en/blog/memory-benchmark-transparency-audit/)** · **[LongMemEval paper](https://arxiv.org/abs/2410.10813)** (Wu et al., ICLR 2025, Table 3)
-> **[Full benchmarks page →](https://github.com/framersai/agentos-bench/blob/master/results/LEADERBOARD.md)** · **[Reproducible run JSONs →](https://github.com/framersai/agentos-bench/tree/master/results/runs)** · **[Methodology audit →](https://agentos.sh/en/blog/agentos-memory-sota-longmemeval/)**
+The transparency audit covers what the headline numbers above don't. LOCOMO's answer key has a [6.4% ground-truth error rate per Penfield Labs](https://dev.to/penfieldlabs/we-audited-locomo-64-of-the-answer-key-is-wrong-and-the-judge-accepts-up-to-63-of-intentionally-33lg) (capping any system's possible score at ~93.6%) and LOCOMO's default LLM judge accepts 62.81% of intentionally wrong answers — so any LOCOMO score gap below ~6 pp is inside the judge's noise floor. LongMemEval-S is partly a context-window test because 115K tokens fits in every modern reader. The audit post documents the Mem0-vs-Zep gaming case study, lists which vendors disclose which methodology dimensions (judge model, dataset version, per-case results, single-CLI reproduction), and explains the agentos-bench transparency stack: bootstrap 95% CIs at 10k Mulberry32 resamples (seed 42), per-benchmark judge-FPR probes (LongMemEval-S 1% [0%, 3%], LongMemEval-M 2% [0%, 5%], LOCOMO 0% [0%, 0%]), per-case run JSONs, single-CLI reproduction.
 ---
@@ -224,22 +279,24 @@ Or pass `apiKey` inline on any call. Auto-detection order: OpenAI → Anthropic
 ## Ecosystem
-| Package | Description |
+| Package | Role |
 |---|---|
-| [`@framers/agentos`](https://www.npmjs.com/package/@framers/agentos) | Core runtime |
-| [`@framers/agentos-extensions`](https://www.npmjs.com/package/@framers/agentos-extensions) | 100+ extensions and templates |
-| [`@framers/agentos-skills`](https://www.npmjs.com/package/@framers/agentos-skills) | 88 curated SKILL.md definitions |
-| [`@framers/agentos-bench`](https://github.com/framersai/agentos-bench) | Open benchmark harness with 95% confidence intervals, judge-FPR probes, per-case run JSONs (MIT-licensed; agentos itself is Apache 2.0) |
-| [`@framers/sql-storage-adapter`](https://www.npmjs.com/package/@framers/sql-storage-adapter) | SQL persistence (SQLite, Postgres, IndexedDB) |
-| [paracosm](https://www.npmjs.com/package/paracosm) | AI agent swarm simulation engine |
+| [`@framers/agentos`](https://www.npmjs.com/package/@framers/agentos) | Core runtime: GMI agents, cognitive memory, multi-agent orchestration, guardrails, voice, 21 LLM providers. Apache 2.0. |
+| [`@framers/agentos-extensions`](https://www.npmjs.com/package/@framers/agentos-extensions) | 100+ first-party extensions and templates: channel adapters, tool packs, integrations, guardrail packs. |
+| [`@framers/agentos-extensions-registry`](https://www.npmjs.com/package/@framers/agentos-extensions-registry) | The discovery + auto-loader layer for the extensions catalog. Consumers wire this in to make curated extension packs available without packaging the entire extensions tree as a dependency. Separating the registry layer from the content layer lets a host pull the index without pulling the implementations. |
+| [`@framers/agentos-skills`](https://www.npmjs.com/package/@framers/agentos-skills) | 88 curated SKILL.md skills covering common tasks. |
+| [`@framers/agentos-skills-registry`](https://www.npmjs.com/package/@framers/agentos-skills-registry) | The discovery + auto-loader layer for the skills catalog. Same split as the extensions registry: a host imports this when it wants the curated skill index without bundling the full skills tree. Registries also serve community-contributed packs once they're vetted. |
+| [`@framers/agentos-bench`](https://github.com/framersai/agentos-bench) | Open benchmark harness. Bootstrap 95% CIs at 10k resamples, judge false-positive-rate probes per benchmark, per-case run JSONs at fixed seed. MIT-licensed (the rest of AgentOS is Apache 2.0). |
+| [`@framers/sql-storage-adapter`](https://www.npmjs.com/package/@framers/sql-storage-adapter) | Cross-platform SQL persistence: SQLite (better-sqlite3 + sql.js for browsers), Postgres, IndexedDB, Capacitor SQLite. |
+| [`paracosm`](https://www.npmjs.com/package/paracosm) | AI agent swarm simulation engine that uses AgentOS as its substrate. |
-**Extensions auto-pickup at startup.** The runtime walks the curated registry plus user-supplied extension paths, resolves each pack's `createExtensionPack(context)` factory, and registers tools, guardrails, and channels without manual wiring. The same model applies to skills: SKILL.md files are auto-discovered from the curated registry and any local `skills/` directory, with capability gating and HITL approval for side-effecting installs. See [extensions architecture](https://docs.agentos.sh/architecture/extension-loading) for the full loading model.
+**Extensions and skills auto-load at startup.** The runtime walks each registry plus any user-supplied paths, resolves each pack's `createExtensionPack(context)` factory or SKILL.md frontmatter, and registers tools, guardrails, channels, and skills without manual wiring. Capability gating and HITL approval gates apply to side-effecting installs. See [extensions architecture](https://docs.agentos.sh/architecture/extension-loading) for the full loading model.
 ---
 ## Documentation & Community
-- **[Benchmarks](https://github.com/framersai/agentos-bench/blob/master/results/LEADERBOARD.md)**: matched-reader benchmark tables, 95% confidence intervals, methodology audit
+- **[Benchmarks](https://github.com/framersai/agentos-bench/blob/master/results/LEADERBOARD.md)**: benchmark tables, 95% confidence intervals, methodology audit
 - **[Architecture](https://docs.agentos.sh/architecture/system-architecture)**: system design, layer breakdown
 - **[Cognitive Memory](https://docs.agentos.sh/features/cognitive-memory)**: 8 mechanisms with 30+ APA citations
 - **[RAG Configuration](https://docs.agentos.sh/features/rag-memory-configuration)**: vector stores, embeddings, sources

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@framers/agentos",
-  "version": "0.5.12",
+  "version": "0.5.13",
   "description": "Modular AgentOS orchestration library",
   "type": "module",
   "main": "dist/index.js",