@framers/agentos 0.5.12 → 0.5.13
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +92 -35
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -8,7 +8,7 @@
|
|
|
8
8
|
|
|
9
9
|
# **AgentOS** — Open-Source TypeScript AI Agent Runtime with Cognitive Memory, HEXACO Personality, and Runtime Tool Forging
|
|
10
10
|
|
|
11
|
-
**85.6% on LongMemEval-S** at $0.0090/correct, +1.4 above Mastra OM gpt-4o (84.23%)
|
|
11
|
+
**85.6% on LongMemEval-S** at $0.0090/correct, +1.4 above Mastra OM gpt-4o (84.23%) · **70.2% on LongMemEval-M** (1.5M-token variant), the only open-source library on the public record above 65% on M with publicly reproducible methodology · 16 LLM providers · 8 neuroscience-backed memory mechanisms · Apache-2.0
|
|
12
12
|
|
|
13
13
|
[](https://www.npmjs.com/package/@framers/agentos)
|
|
14
14
|
[](https://github.com/framersai/agentos/actions/workflows/ci.yml)
|
|
@@ -24,6 +24,12 @@
|
|
|
24
24
|
|
|
25
25
|
---
|
|
26
26
|
|
|
27
|
+
AgentOS is an open-source TypeScript runtime for AI agents that adapt, remember, and collaborate. The runtime carries the parts of an agent that should outlive a single chat completion: persistent [cognitive memory](https://docs.agentos.sh/features/cognitive-memory) grounded in published cognitive-science literature, optional [HEXACO personality](https://docs.agentos.sh/features/cognitive-memory-guide) modeling, runtime tool forging in a hardened sandbox, [six multi-agent orchestration strategies](https://docs.agentos.sh/features/multi-agent-collaboration), [streaming guardrails](https://docs.agentos.sh/features/guardrails-architecture), a [voice pipeline](https://docs.agentos.sh/features/voice-pipeline), and one dispatch interface across 21 LLM providers. Apache-2.0.
|
|
28
|
+
|
|
29
|
+
On benchmarks: **85.6% on LongMemEval-S** at $0.0090 per correct answer (gpt-4o reader, +1.4 points above Mastra's published 84.23%); **70.2% on LongMemEval-M** (1.5M-token haystacks, 500 sessions per question), the only open-source library on the public record above 65% on M with publicly reproducible methodology. Per-case run JSONs and single-CLI reproduction ship in [agentos-bench](https://github.com/framersai/agentos-bench).
|
|
30
|
+
|
|
31
|
+
---
|
|
32
|
+
|
|
27
33
|
## Install
|
|
28
34
|
|
|
29
35
|
```bash
|
|
@@ -49,43 +55,92 @@ await session.send('Can you expand on that?'); // remembers context
|
|
|
49
55
|
|
|
50
56
|
---
|
|
51
57
|
|
|
52
|
-
##
|
|
58
|
+
## Emergent Design
|
|
53
59
|
|
|
54
|
-
|
|
60
|
+
> "So we and our elaborately evolving computers may meet each other halfway."
|
|
61
|
+
>
|
|
62
|
+
> — Philip K. Dick, *The Android and the Human*, 1972
|
|
55
63
|
|
|
56
|
-
|
|
64
|
+
Three things accumulate inside an AgentOS session:
|
|
65
|
+
|
|
66
|
+
1. **Memory.** What was said, what was decided, what was retrieved.
|
|
67
|
+
2. **Tool surface.** Starts at whatever was registered. Can grow mid-decision when an agent forges a new function and the judge approves it.
|
|
68
|
+
3. **Personality** (optional). A HEXACO trait vector that biases retrieval, specialist routing, and decision-making.
|
|
69
|
+
|
|
70
|
+
Behavior in turn six is a function of all three carried forward from turns one through five: which memories got reinforced, which forged tools entered the catalog, which trait values weighted which evidence. Each of those is configurable and observable. None of the three crosses into "emergent agent" on its own; the composition is the interesting part.
|
|
57
71
|
|
|
58
|
-
|
|
59
|
-
|---|---:|---:|---:|---|
|
|
60
|
-
| EmergenceMem Internal | 86.0% | not published | 5,650 ms | [emergence.ai](https://www.emergence.ai/blog/sota-on-longmemeval-with-rag) |
|
|
61
|
-
| **🚀 AgentOS canonical-hybrid + reader-router** | **85.6%** | **$0.0090** | **3,558 ms** | [post](https://docs.agentos.sh/blog/2026/04/28/reader-router-pareto-win) |
|
|
62
|
-
| Mastra OM gpt-4o (gemini-flash observer) | 84.23% | not published | not published | [mastra.ai](https://mastra.ai/research/observational-memory) |
|
|
63
|
-
| Supermemory gpt-4o | 81.6% | not published | not published | [supermemory.ai](https://supermemory.ai/research/) |
|
|
64
|
-
| EmergenceMem Simple Fast (rerun in agentos-bench) | 80.6% | $0.0586 | 3,703 ms | [adapter](https://github.com/framersai/agentos-bench/blob/master/vendors/emergence-simple-fast/) |
|
|
65
|
-
| Zep self / independent reproduction | 71.2% / 63.8% | not published | not published | [self](https://blog.getzep.com/state-of-the-art-agent-memory/) / [arXiv](https://arxiv.org/abs/2512.13564) |
|
|
72
|
+
### Runtime Tool Forging
|
|
66
73
|
|
|
67
|
-
|
|
74
|
+
When an agent encounters a sub-task that no available tool covers, it generates a TypeScript function with a Zod-described input and output schema. A separate LLM call evaluates the forged function against the agent's stated intent and either approves or rejects it. Approved functions execute in a hardened `node:vm` sandbox (default 5-second wall clock, 128 MB nominal memory budget reported as a heap-delta heuristic, not preemptively enforced — preemptive memory limits via [`isolated-vm`](https://github.com/laverdet/isolated-vm) are queued for the hosted multi-tenant tier). The sandbox always bans `eval`, `Function`, `require`, dynamic `import`, `process`, `child_process`, and destructive `fs.*`. `fetch`, `fs.readFile`, and `crypto` are allowlist-only opt-ins; the default allowlist is empty, so the default execution environment has no network, no filesystem, and no crypto unless the host explicitly grants them per-tool. Approved tools are added to a discoverable index keyed by name and signature, and subsequent turns invoke them via `call_forged_tool(name, args)`. A first-time forge costs full LLM tokens; reuse costs tens of tokens. Total cost per turn flattens once a session has accumulated a handful of approved tools.
|
|
68
75
|
|
|
69
|
-
|
|
76
|
+
The path the runtime supports: an agent forges a tool mid-decision, the judge approves it, that turn invokes it, and a few turns later a different specialist agent in the same session invokes the same tool because the index made it findable. Neither side is scripted; the runtime makes the tool discoverable and the agents find it on their own.
|
|
70
77
|
|
|
71
|
-
|
|
78
|
+
### HEXACO Personality (optional)
|
|
79
|
+
|
|
80
|
+
Personality is opt-in. The runtime behaves identically with or without a trait vector, and most production deployments do not pass one.
|
|
81
|
+
|
|
82
|
+
```ts
|
|
83
|
+
// Personality-neutral (most production agents)
|
|
84
|
+
const support = agent({
|
|
85
|
+
provider: 'openai',
|
|
86
|
+
instructions: 'Resolve customer tickets.',
|
|
87
|
+
memory: { types: ['episodic', 'semantic'] },
|
|
88
|
+
});
|
|
89
|
+
|
|
90
|
+
// Opt-in HEXACO (when persona consistency across sessions matters)
|
|
91
|
+
const coach = agent({
|
|
92
|
+
provider: 'openai',
|
|
93
|
+
instructions: "Long-running career coach. Hold the user accountable to their stated goals across weekly check-ins; flag drift, push back on excuses, escalate when goals shift.",
|
|
94
|
+
personality: {
|
|
95
|
+
conscientiousness: 0.9, // won't let goals drift between sessions
|
|
96
|
+
honestyHumility: 0.85, // won't tell the user what they want to hear
|
|
97
|
+
emotionality: 0.3, // stays steady when the user is reactive
|
|
98
|
+
},
|
|
99
|
+
memory: { types: ['episodic', 'semantic'] },
|
|
100
|
+
});
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
When a vector is supplied, the kernel weights retrieval, specialist routing, and tool selection by the trait values. Same agent, same prompt, same tools: a high-Openness leader and a high-Conscientiousness leader produce measurably different decision sequences. The bias lives in the kernel, not in the prompt; prompt-only personality dissolves under context pressure while kernel-encoded bias persists. The vector remains editable, inspectable, and removable on consent.
|
|
104
|
+
|
|
105
|
+
---
|
|
106
|
+
|
|
107
|
+
## Memory Benchmarks
|
|
108
|
+
|
|
109
|
+
`gpt-4o` reader, `gpt-4o-2024-08-06` judge, full N=500 across every row. Cross-provider numbers are excluded from the tables because their public methodology disclosures don't admit reproduction.
|
|
110
|
+
|
|
111
|
+
### LongMemEval-S (115K tokens, 50 sessions)
|
|
112
|
+
|
|
113
|
+
| System | Accuracy | $/correct | p50 latency |
|
|
114
|
+
|---|---:|---:|---:|
|
|
115
|
+
| EmergenceMem Internal | 86.0% | not published | 5,650 ms |
|
|
116
|
+
| **AgentOS** (canonical-hybrid + reader-router) | **85.6%** | **$0.0090** | **3,558 ms** |
|
|
117
|
+
| Mastra OM gpt-4o (gemini-flash observer) | 84.23% | not published | not published |
|
|
118
|
+
| Supermemory gpt-4o | 81.6% | not published | not published |
|
|
119
|
+
| EmergenceMem Simple Fast (rerun in agentos-bench) | 80.6% | $0.0586 | 3,703 ms |
|
|
120
|
+
| Zep (self / independent reproduction) | 71.2% / 63.8% | not published | not published |
|
|
121
|
+
|
|
122
|
+
+1.4 points above Mastra OM. EmergenceMem Internal posts 86.0% (0.4 above) but doesn't publish per-case results or a reproducible CLI; among open-source libraries with single-CLI reproduction at `gpt-4o`, 85.6% is the highest publicly reproducible number located. p50 latency 3,558 ms vs EmergenceMem's published median 5,650 ms.
|
|
123
|
+
|
|
124
|
+
Cross-provider numbers omitted from the table (different reader and/or undisclosed judge): Mastra OM 94.87% (gpt-5-mini + gemini-2.5-flash observer), agentmemory 96.2% (Claude Opus 4.6), MemMachine 93.0% (GPT-5-mini), Hindsight 91.4% (unspecified backbone).
|
|
72
125
|
|
|
73
126
|
### LongMemEval-M (1.5M tokens, 500 sessions)
|
|
74
127
|
|
|
75
|
-
|
|
128
|
+
M's haystacks exceed every production context window; most vendors only publish on S.
|
|
129
|
+
|
|
130
|
+
| System | Accuracy | License |
|
|
131
|
+
|---|---:|---|
|
|
132
|
+
| LongMemEval paper, GPT-4o round Top-10 (paper's best) | 72.0% | open repo |
|
|
133
|
+
| AgentBrain | 71.7% | closed-source SaaS |
|
|
134
|
+
| LongMemEval paper, GPT-4o session Top-5 | 71.4% | open repo |
|
|
135
|
+
| **AgentOS** (sem-embed + reader-router + Top-5) | **70.2%** | **Apache-2.0** |
|
|
136
|
+
| LongMemEval paper, GPT-4o round Top-5 | 65.7% | open repo |
|
|
137
|
+
| Mem0 v3, Mastra, Hindsight, Zep, EmergenceMem, Supermemory, Letta | not published | — |
|
|
76
138
|
|
|
77
|
-
|
|
78
|
-
|---|---:|---|---|
|
|
79
|
-
| LongMemEval paper, strongest GPT-4o (round, Top-10) | 72.0% | open repo | [Wu et al., ICLR 2025, Table 3](https://arxiv.org/abs/2410.10813) |
|
|
80
|
-
| AgentBrain | 71.7% | closed-source SaaS | [github.com/AgentBrainHQ](https://github.com/AgentBrainHQ) |
|
|
81
|
-
| LongMemEval paper, strongest GPT-4o at Top-5 (session) | 71.4% | open repo | [Wu et al., ICLR 2025, Table 3](https://arxiv.org/abs/2410.10813) |
|
|
82
|
-
| **🚀 AgentOS** (sem-embed + reader-router + top-K=5) | **70.2%** | **Apache-2.0** | [post](https://docs.agentos.sh/blog/2026/04/29/longmemeval-m-70-with-topk5) |
|
|
83
|
-
| LongMemEval paper, GPT-4o at Top-5 (round) | 65.7% | open repo | [Wu et al., ICLR 2025, Table 3](https://arxiv.org/abs/2410.10813) |
|
|
84
|
-
| Mem0 v3, Mastra, Hindsight, Zep, EmergenceMem, Supermemory, Letta, others | not published | various | reports S only |
|
|
139
|
+
At matched Top-5 retrieval, +4.5 above the round-level paper baseline (65.7%) and 1.2 below the session-level (71.4%); the paper's overall strongest GPT-4o result is 72.0% at Top-10. Of open-source libraries with publicly reproducible runs, AgentOS is the only one above 65% on M.
|
|
85
140
|
|
|
86
|
-
**
|
|
141
|
+
> **[Full leaderboard →](https://github.com/framersai/agentos-bench/blob/master/results/LEADERBOARD.md)** · **[Run JSONs →](https://github.com/framersai/agentos-bench/tree/master/results/runs)** · **[Transparency audit →](https://agentos.sh/en/blog/memory-benchmark-transparency-audit/)** · **[LongMemEval paper](https://arxiv.org/abs/2410.10813)** (Wu et al., ICLR 2025, Table 3)
|
|
87
142
|
|
|
88
|
-
|
|
143
|
+
The transparency audit covers what the headline numbers above don't. LOCOMO's answer key has a [6.4% ground-truth error rate per Penfield Labs](https://dev.to/penfieldlabs/we-audited-locomo-64-of-the-answer-key-is-wrong-and-the-judge-accepts-up-to-63-of-intentionally-33lg) (capping any system's possible score at ~93.6%) and LOCOMO's default LLM judge accepts 62.81% of intentionally wrong answers — so any LOCOMO score gap below ~6 pp is inside the judge's noise floor. LongMemEval-S is partly a context-window test because 115K tokens fits in every modern reader. The audit post documents the Mem0-vs-Zep gaming case study, lists which vendors disclose which methodology dimensions (judge model, dataset version, per-case results, single-CLI reproduction), and explains the agentos-bench transparency stack: bootstrap 95% CIs at 10k Mulberry32 resamples (seed 42), per-benchmark judge-FPR probes (LongMemEval-S 1% [0%, 3%], LongMemEval-M 2% [0%, 5%], LOCOMO 0% [0%, 0%]), per-case run JSONs, single-CLI reproduction.
|
|
89
144
|
|
|
90
145
|
---
|
|
91
146
|
|
|
@@ -224,22 +279,24 @@ Or pass `apiKey` inline on any call. Auto-detection order: OpenAI → Anthropic
|
|
|
224
279
|
|
|
225
280
|
## Ecosystem
|
|
226
281
|
|
|
227
|
-
| Package |
|
|
282
|
+
| Package | Role |
|
|
228
283
|
|---|---|
|
|
229
|
-
| [`@framers/agentos`](https://www.npmjs.com/package/@framers/agentos) | Core runtime |
|
|
230
|
-
| [`@framers/agentos-extensions`](https://www.npmjs.com/package/@framers/agentos-extensions) | 100+ extensions and templates |
|
|
231
|
-
| [`@framers/agentos-
|
|
232
|
-
| [`@framers/agentos-
|
|
233
|
-
| [`@framers/
|
|
234
|
-
| [
|
|
284
|
+
| [`@framers/agentos`](https://www.npmjs.com/package/@framers/agentos) | Core runtime: GMI agents, cognitive memory, multi-agent orchestration, guardrails, voice, 21 LLM providers. Apache 2.0. |
|
|
285
|
+
| [`@framers/agentos-extensions`](https://www.npmjs.com/package/@framers/agentos-extensions) | 100+ first-party extensions and templates: channel adapters, tool packs, integrations, guardrail packs. |
|
|
286
|
+
| [`@framers/agentos-extensions-registry`](https://www.npmjs.com/package/@framers/agentos-extensions-registry) | The discovery + auto-loader layer for the extensions catalog. Consumers wire this in to make curated extension packs available without packaging the entire extensions tree as a dependency. Separating the registry layer from the content layer lets a host pull the index without pulling the implementations. |
|
|
287
|
+
| [`@framers/agentos-skills`](https://www.npmjs.com/package/@framers/agentos-skills) | 88 curated SKILL.md skills covering common tasks. |
|
|
288
|
+
| [`@framers/agentos-skills-registry`](https://www.npmjs.com/package/@framers/agentos-skills-registry) | The discovery + auto-loader layer for the skills catalog. Same split as the extensions registry: a host imports this when it wants the curated skill index without bundling the full skills tree. Registries also serve community-contributed packs once they're vetted. |
|
|
289
|
+
| [`@framers/agentos-bench`](https://github.com/framersai/agentos-bench) | Open benchmark harness. Bootstrap 95% CIs at 10k resamples, judge false-positive-rate probes per benchmark, per-case run JSONs at fixed seed. MIT-licensed (the rest of AgentOS is Apache 2.0). |
|
|
290
|
+
| [`@framers/sql-storage-adapter`](https://www.npmjs.com/package/@framers/sql-storage-adapter) | Cross-platform SQL persistence: SQLite (better-sqlite3 + sql.js for browsers), Postgres, IndexedDB, Capacitor SQLite. |
|
|
291
|
+
| [`paracosm`](https://www.npmjs.com/package/paracosm) | AI agent swarm simulation engine that uses AgentOS as its substrate. |
|
|
235
292
|
|
|
236
|
-
**Extensions auto-
|
|
293
|
+
**Extensions and skills auto-load at startup.** The runtime walks each registry plus any user-supplied paths, resolves each pack's `createExtensionPack(context)` factory or SKILL.md frontmatter, and registers tools, guardrails, channels, and skills without manual wiring. Capability gating and HITL approval gates apply to side-effecting installs. See [extensions architecture](https://docs.agentos.sh/architecture/extension-loading) for the full loading model.
|
|
237
294
|
|
|
238
295
|
---
|
|
239
296
|
|
|
240
297
|
## Documentation & Community
|
|
241
298
|
|
|
242
|
-
- **[Benchmarks](https://github.com/framersai/agentos-bench/blob/master/results/LEADERBOARD.md)**:
|
|
299
|
+
- **[Benchmarks](https://github.com/framersai/agentos-bench/blob/master/results/LEADERBOARD.md)**: benchmark tables, 95% confidence intervals, methodology audit
|
|
243
300
|
- **[Architecture](https://docs.agentos.sh/architecture/system-architecture)**: system design, layer breakdown
|
|
244
301
|
- **[Cognitive Memory](https://docs.agentos.sh/features/cognitive-memory)**: 8 mechanisms with 30+ APA citations
|
|
245
302
|
- **[RAG Configuration](https://docs.agentos.sh/features/rag-memory-configuration)**: vector stores, embeddings, sources
|