@pentatonic-ai/ai-agent-sdk 0.7.0 → 0.7.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +22 -14
- package/bin/commands/login.js +10 -3
- package/package.json +1 -1
- package/packages/memory-engine/engine/services/l4/server.py +36 -6
- package/packages/memory-engine/engine/services/l5/l5-comms-layer.py +34 -16
- package/packages/memory-engine/engine/services/l6/l6-document-store.py +29 -10
- package/packages/memory-engine/docs/MIGRATION.md +0 -178
- package/packages/memory-engine/docs/RUNBOOK-AWS.md +0 -375
- package/packages/memory-engine/docs/why-v05-underperforms.md +0 -138
package/README.md
CHANGED
|
@@ -161,22 +161,26 @@ If `/search` returns the row from `/store`, the engine is live.
|
|
|
161
161
|
|
|
162
162
|
**Connect Claude Code**
|
|
163
163
|
|
|
164
|
-
The `tes-memory` plugin's hooks already speak the engine's wire format.
|
|
164
|
+
The `tes-memory` plugin's hooks already speak the engine's wire format. Three steps:
|
|
165
165
|
|
|
166
166
|
1. Install the plugin (once):
|
|
167
167
|
```
|
|
168
168
|
/plugin marketplace add Pentatonic-Ltd/ai-agent-sdk
|
|
169
169
|
/plugin install tes-memory@pentatonic-ai
|
|
170
170
|
```
|
|
171
|
-
2. Point it at your local engine
|
|
172
|
-
```
|
|
173
|
-
|
|
174
|
-
mode: local
|
|
175
|
-
memory_url: http://localhost:8099
|
|
176
|
-
---
|
|
171
|
+
2. Point it at your local engine — one command writes the plugin config:
|
|
172
|
+
```bash
|
|
173
|
+
npx @pentatonic-ai/ai-agent-sdk config local
|
|
177
174
|
```
|
|
175
|
+
This writes `~/.claude-pentatonic/tes-memory.local.md` with `mode: local` and `memory_url: http://localhost:8099`. If you want a different URL, pass `--engine-url <url>`. To switch back to hosted later, run `tes config hosted` (delegates to `login`).
|
|
178
176
|
3. Reload: `/reload-plugins` (or restart Claude Code if status reports stale state — MCP server processes need a full restart to pick up plugin updates).
|
|
179
177
|
|
|
178
|
+
Inspect what's currently configured at any time:
|
|
179
|
+
|
|
180
|
+
```bash
|
|
181
|
+
npx @pentatonic-ai/ai-agent-sdk config show
|
|
182
|
+
```
|
|
183
|
+
|
|
180
184
|
Verify:
|
|
181
185
|
|
|
182
186
|
```
|
|
@@ -307,22 +311,26 @@ Works with both local and hosted memory. Install once, switch modes via config.
|
|
|
307
311
|
/plugin install tes-memory@pentatonic-ai
|
|
308
312
|
```
|
|
309
313
|
|
|
310
|
-
**Local engine** — bring up the engine first ([Memory > Local](#local-self-hosted)), then
|
|
314
|
+
**Local engine** — bring up the engine first ([Memory > Local](#local-self-hosted)), then write the plugin config:
|
|
311
315
|
|
|
312
|
-
```
|
|
313
|
-
|
|
314
|
-
mode: local
|
|
315
|
-
memory_url: http://localhost:8099
|
|
316
|
-
---
|
|
316
|
+
```bash
|
|
317
|
+
npx @pentatonic-ai/ai-agent-sdk config local
|
|
317
318
|
```
|
|
318
319
|
|
|
319
320
|
**Hosted TES** — run `login` once, the plugin auto-discovers `~/.config/tes/credentials.json`:
|
|
320
321
|
|
|
321
322
|
```bash
|
|
322
323
|
npx @pentatonic-ai/ai-agent-sdk login
|
|
324
|
+
# equivalent: npx @pentatonic-ai/ai-agent-sdk config hosted
|
|
325
|
+
```
|
|
326
|
+
|
|
327
|
+
Either way, verify with `/tes-memory:tes-status` in Claude Code, or from the shell:
|
|
328
|
+
|
|
329
|
+
```bash
|
|
330
|
+
npx @pentatonic-ai/ai-agent-sdk config show
|
|
323
331
|
```
|
|
324
332
|
|
|
325
|
-
|
|
333
|
+
The plugin's MCP server, hooks, and tools all read the same config — switching modes is a single CLI call away.
|
|
326
334
|
|
|
327
335
|
**What it tracks (auto, every turn):**
|
|
328
336
|
- Memory search at prompt time — relevant memories injected as context
|
package/bin/commands/login.js
CHANGED
|
@@ -179,9 +179,16 @@ export async function runLoginCommand(opts = {}) {
|
|
|
179
179
|
log(` ✓ Connected as ${claims.email || "user"} on tenant \`${clientId}\``);
|
|
180
180
|
log(` ✓ Credentials written to ~/.config/tes/credentials.json`);
|
|
181
181
|
log("");
|
|
182
|
-
log("
|
|
183
|
-
log("
|
|
184
|
-
log("
|
|
182
|
+
log(" Install the Pentatonic TES plugin to start capturing context:");
|
|
183
|
+
log("");
|
|
184
|
+
log(" Claude Code:");
|
|
185
|
+
log(" /plugin marketplace add Pentatonic-Ltd/ai-agent-sdk");
|
|
186
|
+
log(" /plugin install tes-memory@pentatonic-ai");
|
|
187
|
+
log("");
|
|
188
|
+
log(" OpenClaw:");
|
|
189
|
+
log(" openclaw plugins install @pentatonic-ai/openclaw-memory-plugin");
|
|
190
|
+
log("");
|
|
191
|
+
log(" Already installed the plugin? Reload now to refresh the credentials.");
|
|
185
192
|
log("");
|
|
186
193
|
|
|
187
194
|
return { exitCode: 0, clientId };
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@pentatonic-ai/ai-agent-sdk",
|
|
3
|
-
"version": "0.7.
|
|
3
|
+
"version": "0.7.1",
|
|
4
4
|
"description": "TES SDK — LLM observability and lifecycle tracking via Pentatonic Thing Event System. Track token usage, tool calls, and conversations. Manage things through event-sourced lifecycle stages with AI enrichment and vector search.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"main": "./dist/index.cjs",
|
|
@@ -46,8 +46,6 @@ EMBED_MODEL_NAME = os.environ.get("L4_EMBED_MODEL", "nv-embed-v2")
|
|
|
46
46
|
EMBED_API_KEY = os.environ.get("L4_EMBED_API_KEY", "")
|
|
47
47
|
EMBED_DIM = int(os.environ.get("L4_EMBED_DIM", "4096"))
|
|
48
48
|
|
|
49
|
-
def _embed_headers() -> dict:
|
|
50
|
-
return {"Authorization": f"Bearer {EMBED_API_KEY}"} if EMBED_API_KEY else {}
|
|
51
49
|
|
|
52
50
|
|
|
53
51
|
# ----------------------------------------------------------------------
|
|
@@ -109,16 +107,48 @@ def _client() -> httpx.AsyncClient:
|
|
|
109
107
|
|
|
110
108
|
|
|
111
109
|
async def _embed_batch(texts: list[str]) -> list[list[float]]:
|
|
110
|
+
"""Embed a batch of texts.
|
|
111
|
+
|
|
112
|
+
Tries OpenAI-compatible shape first (POST <url>, Bearer auth,
|
|
113
|
+
response data[i].embedding). On failure, falls back to the
|
|
114
|
+
Pentatonic-AI gateway's native shape (POST .../v1/embed, X-API-Key
|
|
115
|
+
auth, response embeddings[i]). When the gateway eventually adds an
|
|
116
|
+
OpenAI-compat /v1/embeddings alias, the primary path will succeed
|
|
117
|
+
and the fallback will never fire — no code change needed.
|
|
118
|
+
"""
|
|
112
119
|
if not texts:
|
|
113
120
|
return []
|
|
121
|
+
payload = {"input": texts, "model": EMBED_MODEL_NAME}
|
|
122
|
+
# Primary: OpenAI-compat
|
|
123
|
+
try:
|
|
124
|
+
resp = await _client().post(
|
|
125
|
+
NV_EMBED_URL,
|
|
126
|
+
headers=_openai_headers(),
|
|
127
|
+
json=payload,
|
|
128
|
+
timeout=120.0,
|
|
129
|
+
)
|
|
130
|
+
resp.raise_for_status()
|
|
131
|
+
return [d["embedding"] for d in resp.json()["data"]]
|
|
132
|
+
except Exception:
|
|
133
|
+
pass
|
|
134
|
+
# Fallback: lambda-gateway native shape
|
|
135
|
+
fallback_url = NV_EMBED_URL.replace("/v1/embeddings", "/v1/embed").replace("/embeddings", "/embed")
|
|
114
136
|
resp = await _client().post(
|
|
115
|
-
|
|
116
|
-
headers=
|
|
117
|
-
json=
|
|
137
|
+
fallback_url,
|
|
138
|
+
headers=_lambda_headers(),
|
|
139
|
+
json=payload,
|
|
118
140
|
timeout=120.0,
|
|
119
141
|
)
|
|
120
142
|
resp.raise_for_status()
|
|
121
|
-
return
|
|
143
|
+
return resp.json()["embeddings"]
|
|
144
|
+
|
|
145
|
+
|
|
146
|
+
def _openai_headers() -> dict:
|
|
147
|
+
return {"Authorization": f"Bearer {EMBED_API_KEY}"} if EMBED_API_KEY else {}
|
|
148
|
+
|
|
149
|
+
|
|
150
|
+
def _lambda_headers() -> dict:
|
|
151
|
+
return {"X-API-Key": EMBED_API_KEY} if EMBED_API_KEY else {}
|
|
122
152
|
|
|
123
153
|
|
|
124
154
|
# ----------------------------------------------------------------------
|
|
@@ -51,8 +51,36 @@ EMBED_MODEL_NAME = os.environ.get("L5_EMBED_MODEL", "nv-embed-v2")
|
|
|
51
51
|
# Optional Authorization: Bearer <key> for the primary embedding endpoint.
|
|
52
52
|
EMBED_API_KEY = os.environ.get("L5_EMBED_API_KEY", "")
|
|
53
53
|
|
|
54
|
-
def
|
|
55
|
-
|
|
54
|
+
def _embed_post(texts):
|
|
55
|
+
"""POST to the configured embedding endpoint. Tries OpenAI-compat
|
|
56
|
+
shape first; falls back to Pentatonic-AI lambda-gateway native shape
|
|
57
|
+
on any failure. When the gateway adds an /v1/embeddings alias the
|
|
58
|
+
primary path will succeed and the fallback never fires.
|
|
59
|
+
|
|
60
|
+
Returns: list[list[float]] (one embedding per input text).
|
|
61
|
+
"""
|
|
62
|
+
payload = {"input": texts, "model": EMBED_MODEL_NAME}
|
|
63
|
+
try:
|
|
64
|
+
r = httpx.post(
|
|
65
|
+
NV_EMBED_URL,
|
|
66
|
+
headers={"Authorization": f"Bearer {EMBED_API_KEY}"} if EMBED_API_KEY else {},
|
|
67
|
+
json=payload,
|
|
68
|
+
timeout=120,
|
|
69
|
+
)
|
|
70
|
+
r.raise_for_status()
|
|
71
|
+
return [d["embedding"] for d in r.json()["data"]]
|
|
72
|
+
except Exception:
|
|
73
|
+
pass
|
|
74
|
+
fallback_url = NV_EMBED_URL.replace("/v1/embeddings", "/v1/embed").replace("/embeddings", "/embed")
|
|
75
|
+
r = httpx.post(
|
|
76
|
+
fallback_url,
|
|
77
|
+
headers={"X-API-Key": EMBED_API_KEY} if EMBED_API_KEY else {},
|
|
78
|
+
json=payload,
|
|
79
|
+
timeout=120,
|
|
80
|
+
)
|
|
81
|
+
r.raise_for_status()
|
|
82
|
+
return r.json()["embeddings"]
|
|
83
|
+
|
|
56
84
|
# Ollama fallback path. URL/model can be overridden so the L5 container can
|
|
57
85
|
# reach an Ollama instance running on the docker host (host.docker.internal)
|
|
58
86
|
# or on a co-located service. Mirrors the env-var pattern used by L2.
|
|
@@ -99,10 +127,7 @@ def _embed_nv_batch(texts: list[str]) -> list[list[float]] | None:
|
|
|
99
127
|
return []
|
|
100
128
|
try:
|
|
101
129
|
truncated = [t[:4000] for t in texts]
|
|
102
|
-
|
|
103
|
-
r.raise_for_status()
|
|
104
|
-
data = r.json()
|
|
105
|
-
embeddings = [item["embedding"] for item in data["data"]]
|
|
130
|
+
embeddings = _embed_post(truncated)
|
|
106
131
|
if all(len(e) == EMBED_DIM for e in embeddings):
|
|
107
132
|
return embeddings
|
|
108
133
|
except Exception:
|
|
@@ -113,10 +138,8 @@ def _embed_nv_batch(texts: list[str]) -> list[list[float]] | None:
|
|
|
113
138
|
def _embed_nv_single(text: str) -> list[float] | None:
|
|
114
139
|
"""Embed single text via NV-Embed-v2 (4096-dim)."""
|
|
115
140
|
try:
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
data = r.json()
|
|
119
|
-
emb = data["data"][0]["embedding"]
|
|
141
|
+
embs = _embed_post([text[:4000]])
|
|
142
|
+
emb = embs[0]
|
|
120
143
|
if len(emb) == EMBED_DIM:
|
|
121
144
|
return emb
|
|
122
145
|
except Exception:
|
|
@@ -573,12 +596,7 @@ def serve(port=8034):
|
|
|
573
596
|
texts = [(r.get("text") or "")[:8192] for r in records]
|
|
574
597
|
t0 = _time.time()
|
|
575
598
|
try:
|
|
576
|
-
|
|
577
|
-
NV_EMBED_URL, headers=_embed_headers(), json={"input": texts, "model": EMBED_MODEL_NAME},
|
|
578
|
-
timeout=120,
|
|
579
|
-
)
|
|
580
|
-
resp.raise_for_status()
|
|
581
|
-
embs = [d["embedding"] for d in resp.json()["data"]]
|
|
599
|
+
embs = _embed_post(texts)
|
|
582
600
|
except Exception as exc:
|
|
583
601
|
return {"status": "error", "error": f"embed failed: {exc}"}
|
|
584
602
|
embed_ms = (_time.time() - t0) * 1000.0
|
|
@@ -44,8 +44,33 @@ EMBED_DIM = int(os.environ.get("L6_EMBED_DIM", "4096"))
|
|
|
44
44
|
# Optional Authorization: Bearer <key> for the embedding endpoint.
|
|
45
45
|
EMBED_API_KEY = os.environ.get("L6_EMBED_API_KEY", "")
|
|
46
46
|
|
|
47
|
-
def
|
|
48
|
-
|
|
47
|
+
def _embed_post(texts):
|
|
48
|
+
"""POST to embedding endpoint. Tries OpenAI-compat shape first;
|
|
49
|
+
falls back to Pentatonic-AI lambda-gateway native shape on failure.
|
|
50
|
+
See L4 / L5 for the same pattern."""
|
|
51
|
+
import httpx as _httpx
|
|
52
|
+
payload = {"input": texts, "model": EMBED_MODEL}
|
|
53
|
+
try:
|
|
54
|
+
r = _httpx.post(
|
|
55
|
+
NV_EMBED_URL,
|
|
56
|
+
headers={"Authorization": f"Bearer {EMBED_API_KEY}"} if EMBED_API_KEY else {},
|
|
57
|
+
json=payload,
|
|
58
|
+
timeout=120,
|
|
59
|
+
)
|
|
60
|
+
r.raise_for_status()
|
|
61
|
+
return [d["embedding"] for d in r.json()["data"]]
|
|
62
|
+
except Exception:
|
|
63
|
+
pass
|
|
64
|
+
fallback_url = NV_EMBED_URL.replace("/v1/embeddings", "/v1/embed").replace("/embeddings", "/embed")
|
|
65
|
+
r = _httpx.post(
|
|
66
|
+
fallback_url,
|
|
67
|
+
headers={"X-API-Key": EMBED_API_KEY} if EMBED_API_KEY else {},
|
|
68
|
+
json=payload,
|
|
69
|
+
timeout=120,
|
|
70
|
+
)
|
|
71
|
+
r.raise_for_status()
|
|
72
|
+
return r.json()["embeddings"]
|
|
73
|
+
|
|
49
74
|
COLLECTION_NAME = "documents"
|
|
50
75
|
RRF_K = 60
|
|
51
76
|
DEFAULT_PORT = 8037
|
|
@@ -874,16 +899,10 @@ def serve(port: int = DEFAULT_PORT):
|
|
|
874
899
|
|
|
875
900
|
texts = [(r.get("text") or "")[:16000] for r in records]
|
|
876
901
|
|
|
877
|
-
# Single batched
|
|
902
|
+
# Single batched embed call (OpenAI-compat first, lambda-gateway fallback).
|
|
878
903
|
t0 = _time.time()
|
|
879
904
|
try:
|
|
880
|
-
|
|
881
|
-
NV_EMBED_URL, headers=_embed_headers(),
|
|
882
|
-
json={"input": texts, "model": EMBED_MODEL},
|
|
883
|
-
timeout=120,
|
|
884
|
-
)
|
|
885
|
-
resp.raise_for_status()
|
|
886
|
-
embs = [d["embedding"] for d in resp.json()["data"]]
|
|
905
|
+
embs = _embed_post(texts)
|
|
887
906
|
except Exception as exc:
|
|
888
907
|
raise HTTPException(status_code=500, detail=f"embed failed: {exc}")
|
|
889
908
|
embed_ms = (_time.time() - t0) * 1000.0
|
|
@@ -1,178 +0,0 @@
|
|
|
1
|
-
# Migration Guide
|
|
2
|
-
|
|
3
|
-
## From `pentatonic-memory` v0.5.x → `pentatonic-memory-engine`
|
|
4
|
-
|
|
5
|
-
### TL;DR
|
|
6
|
-
|
|
7
|
-
```diff
|
|
8
|
-
- export PENTATONIC_MEMORY_URL=http://your-pm-host:8099
|
|
9
|
-
+ export PENTATONIC_MEMORY_URL=http://your-engine-host:8099
|
|
10
|
-
```
|
|
11
|
-
|
|
12
|
-
That's it. Same SDK, same code, same `/store` `/search` `/health` calls. Engine returns the same response shape with one optional addition (`engine_layer` field on results, naming which layer carried the hit — purely informational).
|
|
13
|
-
|
|
14
|
-
### Detailed wire-format compatibility
|
|
15
|
-
|
|
16
|
-
#### `POST /store`
|
|
17
|
-
|
|
18
|
-
Request:
|
|
19
|
-
```json
|
|
20
|
-
{ "content": "...", "metadata": { "key": "value" } }
|
|
21
|
-
```
|
|
22
|
-
|
|
23
|
-
Response (v0.5.x):
|
|
24
|
-
```json
|
|
25
|
-
{ "id": "mem_abc...", "content": "...", "layerId": "ml_default_episodic" }
|
|
26
|
-
```
|
|
27
|
-
|
|
28
|
-
Response (engine):
|
|
29
|
-
```json
|
|
30
|
-
{
|
|
31
|
-
"id": "abc...",
|
|
32
|
-
"content": "...",
|
|
33
|
-
"layerId": "ml_default_episodic",
|
|
34
|
-
"engine": { "l5": 1, "l6": 1 } // ← new, optional
|
|
35
|
-
}
|
|
36
|
-
```
|
|
37
|
-
|
|
38
|
-
The `engine` field is informational only. Existing SDK clients that ignore unknown fields (the default for both Node.js and Python clients) work without modification.
|
|
39
|
-
|
|
40
|
-
#### `POST /search`
|
|
41
|
-
|
|
42
|
-
Request:
|
|
43
|
-
```json
|
|
44
|
-
{ "query": "...", "limit": 10, "min_score": 0.0001 }
|
|
45
|
-
```
|
|
46
|
-
|
|
47
|
-
Response (v0.5.x):
|
|
48
|
-
```json
|
|
49
|
-
{
|
|
50
|
-
"results": [
|
|
51
|
-
{
|
|
52
|
-
"id": "mem_abc...", "content": "...", "metadata": {},
|
|
53
|
-
"similarity": 0.81, "layer_id": "ml_default_episodic", "client_id": "default"
|
|
54
|
-
}
|
|
55
|
-
]
|
|
56
|
-
}
|
|
57
|
-
```
|
|
58
|
-
|
|
59
|
-
Response (engine):
|
|
60
|
-
```json
|
|
61
|
-
{
|
|
62
|
-
"results": [
|
|
63
|
-
{
|
|
64
|
-
"id": "abc...", "content": "...", "metadata": {},
|
|
65
|
-
"similarity": 0.81, "layer_id": "ml_default_episodic", "client_id": "default",
|
|
66
|
-
"source": "doc1.md", // ← passes through engine's source_file
|
|
67
|
-
"engine_layer": "L4 vec" // ← new, optional, names the winning layer
|
|
68
|
-
}
|
|
69
|
-
]
|
|
70
|
-
}
|
|
71
|
-
```
|
|
72
|
-
|
|
73
|
-
#### `GET /health`
|
|
74
|
-
|
|
75
|
-
Request: no body.
|
|
76
|
-
|
|
77
|
-
Response (v0.5.x):
|
|
78
|
-
```json
|
|
79
|
-
{ "status": "ok", "client": "default", "version": "0.5.6", "memories": 249 }
|
|
80
|
-
```
|
|
81
|
-
|
|
82
|
-
Response (engine):
|
|
83
|
-
```json
|
|
84
|
-
{
|
|
85
|
-
"status": "ok",
|
|
86
|
-
"client": "default",
|
|
87
|
-
"version": "0.1.0",
|
|
88
|
-
"engine": "pentatonic-memory-engine",
|
|
89
|
-
"layers": {
|
|
90
|
-
"l0": "ok", "l1": "ok", "l2": "ok", "l3": "ok",
|
|
91
|
-
"l4": "ok", "l5": "ok", "l6": "ok",
|
|
92
|
-
"nv_embed": "ok"
|
|
93
|
-
},
|
|
94
|
-
"memories": 249
|
|
95
|
-
}
|
|
96
|
-
```
|
|
97
|
-
|
|
98
|
-
Reports per-layer status across all 7 layers of the `sequential-hybridrag-7-layer` engine.
|
|
99
|
-
|
|
100
|
-
#### `POST /store-batch` (NEW — not in v0.5.x)
|
|
101
|
-
|
|
102
|
-
```json
|
|
103
|
-
// Request
|
|
104
|
-
{
|
|
105
|
-
"records": [
|
|
106
|
-
{ "id": "doc1", "content": "...", "metadata": {} },
|
|
107
|
-
{ "id": "doc2", "content": "...", "metadata": {} }
|
|
108
|
-
],
|
|
109
|
-
"arena": "general"
|
|
110
|
-
}
|
|
111
|
-
|
|
112
|
-
// Response
|
|
113
|
-
{
|
|
114
|
-
"status": "ok",
|
|
115
|
-
"inserted": 2,
|
|
116
|
-
"ids": ["doc1", "doc2"],
|
|
117
|
-
"engine": { "l5": 2, "l6": 2 },
|
|
118
|
-
"duration_ms": 234.5
|
|
119
|
-
}
|
|
120
|
-
```
|
|
121
|
-
|
|
122
|
-
30-50× faster than calling `/store` N times when ingesting more than ~5 records.
|
|
123
|
-
|
|
124
|
-
#### `POST /forget` (RESTORED — was in v0.4.x, removed in v0.5.x)
|
|
125
|
-
|
|
126
|
-
```json
|
|
127
|
-
// Delete one record
|
|
128
|
-
{ "id": "doc1" }
|
|
129
|
-
|
|
130
|
-
// Or delete all records matching a metadata filter
|
|
131
|
-
{ "metadata_contains": { "bench_tag": "test-run-12345" } }
|
|
132
|
-
|
|
133
|
-
// Response
|
|
134
|
-
{ "deleted": 17, "engine": "pentatonic-memory-engine" }
|
|
135
|
-
```
|
|
136
|
-
|
|
137
|
-
Required for: test pollution control, GDPR data deletion, multi-tenant isolation, bench harnesses.
|
|
138
|
-
|
|
139
|
-
### Data migration
|
|
140
|
-
|
|
141
|
-
There is no automated dump-and-replay tool. Two paths:
|
|
142
|
-
|
|
143
|
-
**Path A — Re-ingest from source.**
|
|
144
|
-
If your Pentatonic deployment was populated from a known source (chat archives, document repository, TES events), re-run the ingestion against the engine. Use `/store-batch` for speed.
|
|
145
|
-
|
|
146
|
-
**Path B — Dump-and-replay from Postgres.**
|
|
147
|
-
If you only have the v0.5 Postgres database:
|
|
148
|
-
|
|
149
|
-
```bash
|
|
150
|
-
# Dump as JSONL
|
|
151
|
-
psql $DATABASE_URL -A -t -c \
|
|
152
|
-
"SELECT json_build_object('id', id, 'content', content, 'metadata', metadata)::text
|
|
153
|
-
FROM memory_nodes WHERE client_id = 'your-client'" \
|
|
154
|
-
> export.jsonl
|
|
155
|
-
|
|
156
|
-
# Replay against the engine
|
|
157
|
-
python tools/replay.py export.jsonl --target http://your-engine-host:8099
|
|
158
|
-
```
|
|
159
|
-
|
|
160
|
-
A `tools/replay.py` reference implementation lives under `tools/` in this package.
|
|
161
|
-
|
|
162
|
-
### What you lose
|
|
163
|
-
|
|
164
|
-
- **The `metadata.hypothetical_queries` field stops being generated at ingest time.** The engine generates HyDE queries at SEARCH time instead, against the user's actual query (better matching, faster ingest).
|
|
165
|
-
- **`metadata.distilled_from` atoms are no longer auto-generated.** If you were relying on the v0.5+ atomic-fact distillation behaviour, that's a feature of v0.5+ specifically — not a portable feature. The engine treats memories as canonical raw chunks. You can still run distillation as a separate post-processing step if needed.
|
|
166
|
-
|
|
167
|
-
### What you gain
|
|
168
|
-
|
|
169
|
-
- ~5× retrieval accuracy on substring/exact-match benches (~17.6% → ~82.4% mean)
|
|
170
|
-
- 30-50× faster bulk ingest via `/store-batch`
|
|
171
|
-
- Restored `/forget` endpoint
|
|
172
|
-
- Cross-encoder reranking on top-50
|
|
173
|
-
- Knowledge-graph-aware retrieval (entity overlap signal)
|
|
174
|
-
- Per-layer health visibility
|
|
175
|
-
|
|
176
|
-
### Rollback
|
|
177
|
-
|
|
178
|
-
The engine doesn't write to your existing Postgres. Roll back by switching the env var back. No data lost.
|
|
@@ -1,375 +0,0 @@
|
|
|
1
|
-
# pentatonic-memory-engine — AWS deployment runbook (v1)
|
|
2
|
-
|
|
3
|
-
**Target:** single EC2 (`m6i.2xlarge`) in `us-east-1`, network-boundary auth via Cloudflare Tunnel.
|
|
4
|
-
**Operator:** Phil Hauser (or anyone with `AdministratorAccess` to account `170649632502`).
|
|
5
|
-
**Estimated time end-to-end:** ~45 minutes (mostly waiting for instance/volume provisioning).
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## 0. Prerequisites
|
|
10
|
-
|
|
11
|
-
Before starting, verify:
|
|
12
|
-
|
|
13
|
-
```bash
|
|
14
|
-
aws sts get-caller-identity
|
|
15
|
-
# Should return Account: 170649632502, AdministratorAccess role
|
|
16
|
-
|
|
17
|
-
aws configure get region
|
|
18
|
-
# us-east-1
|
|
19
|
-
```
|
|
20
|
-
|
|
21
|
-
If region isn't set: `export AWS_REGION=us-east-1` for the rest of the session.
|
|
22
|
-
|
|
23
|
-
You'll also need:
|
|
24
|
-
- A **Cloudflare account** with access to the Pentatonic CF zone (for Tunnel setup)
|
|
25
|
-
- The **`pentatonic-ai-gateway` API key** (from lambda.dev — should already exist)
|
|
26
|
-
|
|
27
|
-
---
|
|
28
|
-
|
|
29
|
-
## 1. Variables (paste once, reuse below)
|
|
30
|
-
|
|
31
|
-
```bash
|
|
32
|
-
export AWS_REGION=us-east-1
|
|
33
|
-
export ENV=prod
|
|
34
|
-
export NAME=pme-${ENV}-us-east-1
|
|
35
|
-
export INSTANCE_TYPE=m6i.2xlarge
|
|
36
|
-
# Latest Ubuntu 22.04 LTS in us-east-1 (verify via aws ec2 describe-images if needed)
|
|
37
|
-
export AMI_ID=$(aws ec2 describe-images \
|
|
38
|
-
--owners 099720109477 \
|
|
39
|
-
--filters "Name=name,Values=ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*" \
|
|
40
|
-
--query 'Images | sort_by(@, &CreationDate) | [-1].ImageId' \
|
|
41
|
-
--output text)
|
|
42
|
-
echo "Using AMI: $AMI_ID"
|
|
43
|
-
```
|
|
44
|
-
|
|
45
|
-
---
|
|
46
|
-
|
|
47
|
-
## 2. Networking
|
|
48
|
-
|
|
49
|
-
Use the default VPC for v1. (Multi-VPC isolation is a v2 concern.)
|
|
50
|
-
|
|
51
|
-
```bash
|
|
52
|
-
export VPC_ID=$(aws ec2 describe-vpcs \
|
|
53
|
-
--filters "Name=is-default,Values=true" \
|
|
54
|
-
--query 'Vpcs[0].VpcId' --output text)
|
|
55
|
-
|
|
56
|
-
export SUBNET_ID=$(aws ec2 describe-subnets \
|
|
57
|
-
--filters "Name=vpc-id,Values=$VPC_ID" "Name=default-for-az,Values=true" \
|
|
58
|
-
--query 'Subnets[0].SubnetId' --output text)
|
|
59
|
-
|
|
60
|
-
echo "VPC=$VPC_ID Subnet=$SUBNET_ID"
|
|
61
|
-
```
|
|
62
|
-
|
|
63
|
-
### 2.1 Security group
|
|
64
|
-
|
|
65
|
-
No public ingress. Outbound 443/80/53 for Tunnel + gateway + apt + DNS.
|
|
66
|
-
|
|
67
|
-
```bash
|
|
68
|
-
export SG_ID=$(aws ec2 create-security-group \
|
|
69
|
-
--group-name $NAME-sg \
|
|
70
|
-
--description "pentatonic-memory-engine $ENV — outbound only; ingress via SSM" \
|
|
71
|
-
--vpc-id $VPC_ID \
|
|
72
|
-
--query 'GroupId' --output text)
|
|
73
|
-
|
|
74
|
-
# Outbound is allowed by default. Strip default outbound and re-add explicitly.
|
|
75
|
-
aws ec2 revoke-security-group-egress \
|
|
76
|
-
--group-id $SG_ID \
|
|
77
|
-
--ip-permissions '[{"IpProtocol":"-1","IpRanges":[{"CidrIp":"0.0.0.0/0"}]}]'
|
|
78
|
-
|
|
79
|
-
aws ec2 authorize-security-group-egress --group-id $SG_ID \
|
|
80
|
-
--ip-permissions '[
|
|
81
|
-
{"IpProtocol":"tcp","FromPort":443,"ToPort":443,"IpRanges":[{"CidrIp":"0.0.0.0/0","Description":"HTTPS for tunnel + gateway + apt"}]},
|
|
82
|
-
{"IpProtocol":"tcp","FromPort":80, "ToPort":80, "IpRanges":[{"CidrIp":"0.0.0.0/0","Description":"HTTP for apt fallback"}]},
|
|
83
|
-
{"IpProtocol":"udp","FromPort":53, "ToPort":53, "IpRanges":[{"CidrIp":"0.0.0.0/0","Description":"DNS"}]},
|
|
84
|
-
{"IpProtocol":"tcp","FromPort":53, "ToPort":53, "IpRanges":[{"CidrIp":"0.0.0.0/0","Description":"DNS-over-TCP"}]}
|
|
85
|
-
]'
|
|
86
|
-
|
|
87
|
-
echo "SG=$SG_ID"
|
|
88
|
-
```
|
|
89
|
-
|
|
90
|
-
**No inbound rule.** Ops access happens via SSM Session Manager (next step), not SSH.
|
|
91
|
-
|
|
92
|
-
---
|
|
93
|
-
|
|
94
|
-
## 3. IAM role for SSM Session Manager + EBS snapshot agent
|
|
95
|
-
|
|
96
|
-
Lets you `aws ssm start-session` into the box without an SSH key.
|
|
97
|
-
|
|
98
|
-
```bash
|
|
99
|
-
aws iam create-role --role-name $NAME-role \
|
|
100
|
-
--assume-role-policy-document '{
|
|
101
|
-
"Version":"2012-10-17",
|
|
102
|
-
"Statement":[{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}]
|
|
103
|
-
}'
|
|
104
|
-
|
|
105
|
-
aws iam attach-role-policy --role-name $NAME-role \
|
|
106
|
-
--policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
|
|
107
|
-
|
|
108
|
-
aws iam create-instance-profile --instance-profile-name $NAME-profile
|
|
109
|
-
|
|
110
|
-
aws iam add-role-to-instance-profile \
|
|
111
|
-
--instance-profile-name $NAME-profile \
|
|
112
|
-
--role-name $NAME-role
|
|
113
|
-
|
|
114
|
-
# Wait for IAM eventual-consistency before launching EC2
|
|
115
|
-
sleep 10
|
|
116
|
-
```
|
|
117
|
-
|
|
118
|
-
---
|
|
119
|
-
|
|
120
|
-
## 4. EBS volumes
|
|
121
|
-
|
|
122
|
-
Five `gp3` volumes, 50 GiB each (resize online later if needed). One per layer's data dir.
|
|
123
|
-
|
|
124
|
-
```bash
|
|
125
|
-
export AZ=$(aws ec2 describe-subnets --subnet-ids $SUBNET_ID \
|
|
126
|
-
--query 'Subnets[0].AvailabilityZone' --output text)
|
|
127
|
-
|
|
128
|
-
for layer in l2 l3 l4 l5 l6; do
|
|
129
|
-
vol_id=$(aws ec2 create-volume \
|
|
130
|
-
--availability-zone $AZ \
|
|
131
|
-
--size 50 --volume-type gp3 \
|
|
132
|
-
--tag-specifications "ResourceType=volume,Tags=[{Key=Name,Value=$NAME-$layer},{Key=pme-layer,Value=$layer}]" \
|
|
133
|
-
--query 'VolumeId' --output text)
|
|
134
|
-
echo "$layer = $vol_id"
|
|
135
|
-
eval "export VOL_${layer}=$vol_id"
|
|
136
|
-
done
|
|
137
|
-
|
|
138
|
-
# Wait until all are 'available'
|
|
139
|
-
aws ec2 wait volume-available --volume-ids $VOL_l2 $VOL_l3 $VOL_l4 $VOL_l5 $VOL_l6
|
|
140
|
-
echo "All volumes available."
|
|
141
|
-
```
|
|
142
|
-
|
|
143
|
-
---
|
|
144
|
-
|
|
145
|
-
## 5. Launch the EC2
|
|
146
|
-
|
|
147
|
-
```bash
|
|
148
|
-
# User data: format the EBS volumes on first boot, install docker, mount.
|
|
149
|
-
cat > /tmp/userdata.sh <<'EOF'
|
|
150
|
-
#!/bin/bash
|
|
151
|
-
set -euxo pipefail
|
|
152
|
-
|
|
153
|
-
apt-get update
|
|
154
|
-
apt-get install -y docker.io docker-compose-v2 git xfsprogs
|
|
155
|
-
|
|
156
|
-
# Wait for EBS volumes to attach (they're attached just after instance launch by AWS CLI below)
|
|
157
|
-
for layer in l2 l3 l4 l5 l6; do
|
|
158
|
-
for i in {1..30}; do
|
|
159
|
-
if [ -e /dev/disk/by-label/$layer ] || lsblk -no NAME,SERIAL | grep -q "$layer"; then
|
|
160
|
-
break
|
|
161
|
-
fi
|
|
162
|
-
sleep 2
|
|
163
|
-
done
|
|
164
|
-
done
|
|
165
|
-
|
|
166
|
-
# Find each volume by tag (we'll attach by device name below; this just creates mount points)
|
|
167
|
-
mkdir -p /var/lib/pme/{l2,l3,l4,l5,l6}
|
|
168
|
-
|
|
169
|
-
# Format + mount each — done by per-volume systemd in step 6.5 below
|
|
170
|
-
|
|
171
|
-
systemctl enable --now docker
|
|
172
|
-
|
|
173
|
-
# Pull engine repo
|
|
174
|
-
cd /opt
|
|
175
|
-
git clone https://github.com/Pentatonic-Ltd/memory_stack_updated.git engine
|
|
176
|
-
chown -R ubuntu:ubuntu /opt/engine
|
|
177
|
-
EOF
|
|
178
|
-
|
|
179
|
-
export INSTANCE_ID=$(aws ec2 run-instances \
|
|
180
|
-
--image-id $AMI_ID \
|
|
181
|
-
--instance-type $INSTANCE_TYPE \
|
|
182
|
-
--subnet-id $SUBNET_ID \
|
|
183
|
-
--security-group-ids $SG_ID \
|
|
184
|
-
--iam-instance-profile Name=$NAME-profile \
|
|
185
|
-
--block-device-mappings 'DeviceName=/dev/sda1,Ebs={VolumeSize=30,VolumeType=gp3}' \
|
|
186
|
-
--metadata-options 'HttpTokens=required,HttpEndpoint=enabled' \
|
|
187
|
-
--tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=$NAME}]" \
|
|
188
|
-
--user-data file:///tmp/userdata.sh \
|
|
189
|
-
--query 'Instances[0].InstanceId' --output text)
|
|
190
|
-
|
|
191
|
-
aws ec2 wait instance-running --instance-ids $INSTANCE_ID
|
|
192
|
-
echo "Instance $INSTANCE_ID is running."
|
|
193
|
-
```
|
|
194
|
-
|
|
195
|
-
### 5.1 Attach EBS volumes
|
|
196
|
-
|
|
197
|
-
```bash
|
|
198
|
-
aws ec2 attach-volume --volume-id $VOL_l2 --instance-id $INSTANCE_ID --device /dev/xvdf
|
|
199
|
-
aws ec2 attach-volume --volume-id $VOL_l3 --instance-id $INSTANCE_ID --device /dev/xvdg
|
|
200
|
-
aws ec2 attach-volume --volume-id $VOL_l4 --instance-id $INSTANCE_ID --device /dev/xvdh
|
|
201
|
-
aws ec2 attach-volume --volume-id $VOL_l5 --instance-id $INSTANCE_ID --device /dev/xvdi
|
|
202
|
-
aws ec2 attach-volume --volume-id $VOL_l6 --instance-id $INSTANCE_ID --device /dev/xvdj
|
|
203
|
-
|
|
204
|
-
# Wait for all to attach
|
|
205
|
-
for v in $VOL_l2 $VOL_l3 $VOL_l4 $VOL_l5 $VOL_l6; do
|
|
206
|
-
aws ec2 wait volume-in-use --volume-ids $v
|
|
207
|
-
done
|
|
208
|
-
echo "All volumes attached."
|
|
209
|
-
```
|
|
210
|
-
|
|
211
|
-
---
|
|
212
|
-
|
|
213
|
-
## 6. Mount EBS volumes inside the EC2
|
|
214
|
-
|
|
215
|
-
Connect via SSM Session Manager:
|
|
216
|
-
|
|
217
|
-
```bash
|
|
218
|
-
aws ssm start-session --target $INSTANCE_ID
|
|
219
|
-
```
|
|
220
|
-
|
|
221
|
-
Then inside the instance:
|
|
222
|
-
|
|
223
|
-
```bash
|
|
224
|
-
# Format each volume (one-time)
|
|
225
|
-
for pair in xvdf:l2 xvdg:l3 xvdh:l4 xvdi:l5 xvdj:l6; do
|
|
226
|
-
dev=${pair%:*}; layer=${pair#*:}
|
|
227
|
-
if ! sudo blkid /dev/$dev >/dev/null 2>&1; then
|
|
228
|
-
sudo mkfs.xfs -L $layer /dev/$dev
|
|
229
|
-
fi
|
|
230
|
-
done
|
|
231
|
-
|
|
232
|
-
# Add to /etc/fstab and mount
|
|
233
|
-
for pair in xvdf:l2 xvdg:l3 xvdh:l4 xvdi:l5 xvdj:l6; do
|
|
234
|
-
dev=${pair%:*}; layer=${pair#*:}
|
|
235
|
-
uuid=$(sudo blkid -s UUID -o value /dev/$dev)
|
|
236
|
-
sudo mkdir -p /var/lib/pme/$layer
|
|
237
|
-
echo "UUID=$uuid /var/lib/pme/$layer xfs defaults,nofail 0 2" | sudo tee -a /etc/fstab
|
|
238
|
-
done
|
|
239
|
-
|
|
240
|
-
sudo systemctl daemon-reload
|
|
241
|
-
sudo mount -a
|
|
242
|
-
df -h /var/lib/pme/*
|
|
243
|
-
# All five should show ~50G mounted, 49G available.
|
|
244
|
-
```
|
|
245
|
-
|
|
246
|
-
---
|
|
247
|
-
|
|
248
|
-
## 7. Cloudflare Tunnel setup
|
|
249
|
-
|
|
250
|
-
In the Cloudflare dashboard:
|
|
251
|
-
|
|
252
|
-
1. **Zero Trust → Networks → Tunnels → Create a tunnel** (Cloudflared connector type)
|
|
253
|
-
2. Name: `engine-prod-us-east-1`
|
|
254
|
-
3. Save → copy the **tunnel token** (the `eyJ...` string).
|
|
255
|
-
4. **Public hostnames** tab → Add:
|
|
256
|
-
- Subdomain: `engine`
|
|
257
|
-
- Domain: `pentatonic.internal` (or whatever internal CF zone you use)
|
|
258
|
-
- Type: HTTP, URL: `compat:8099`
|
|
259
|
-
|
|
260
|
-
Copy the tunnel token; you'll set it as `CLOUDFLARED_TUNNEL_TOKEN` in `.env` below.
|
|
261
|
-
|
|
262
|
-
> The hostname is reachable only by Workers/services in the same Cloudflare account by default. If you want to lock down further, attach a **Cloudflare Access policy** requiring a service token on the hostname — then set the service-token header in TES Workers' fetch calls. Optional for v1; can layer on later.
|
|
263
|
-
|
|
264
|
-
---
|
|
265
|
-
|
|
266
|
-
## 8. Configure and bring up the engine
|
|
267
|
-
|
|
268
|
-
Back in the SSM session on the EC2:
|
|
269
|
-
|
|
270
|
-
```bash
|
|
271
|
-
cd /opt/engine
|
|
272
|
-
|
|
273
|
-
# Pull the AWS overlay (PR'd separately to memory_stack_updated; for now copy it manually)
|
|
274
|
-
# Once merged upstream, this file is part of the repo.
|
|
275
|
-
sudo curl -fL -o docker-compose.aws.yml \
|
|
276
|
-
https://raw.githubusercontent.com/Pentatonic-Ltd/memory_stack_updated/main/docker-compose.aws.yml
|
|
277
|
-
|
|
278
|
-
# Generate Neo4j password
|
|
279
|
-
NEO4J_PASSWORD=$(openssl rand -base64 24 | tr -d '/+=')
|
|
280
|
-
|
|
281
|
-
# Write .env (substitute values)
|
|
282
|
-
cat | sudo tee .env <<EOF
|
|
283
|
-
PME_PORT=8099
|
|
284
|
-
NV_EMBED_URL=https://gateway.pentatonic.ai/v1/embeddings # confirm exact URL with the gateway team
|
|
285
|
-
PENTATONIC_AI_GATEWAY_KEY=<paste from secret store>
|
|
286
|
-
CLOUDFLARED_TUNNEL_TOKEN=<paste from CF dashboard>
|
|
287
|
-
NEO4J_PASSWORD=$NEO4J_PASSWORD
|
|
288
|
-
EOF
|
|
289
|
-
|
|
290
|
-
sudo chmod 600 .env
|
|
291
|
-
|
|
292
|
-
# Bring up the stack
|
|
293
|
-
sudo docker compose -f docker-compose.yml -f docker-compose.aws.yml up -d
|
|
294
|
-
sudo docker compose ps
|
|
295
|
-
```
|
|
296
|
-
|
|
297
|
-
First run pulls images (~3-5 min) and builds engine images (~10-15 min). Subsequent restarts are fast.
|
|
298
|
-
|
|
299
|
-
---
|
|
300
|
-
|
|
301
|
-
## 9. Smoke test
|
|
302
|
-
|
|
303
|
-
From your laptop or any TES dev environment with access to the CF zone:
|
|
304
|
-
|
|
305
|
-
```bash
|
|
306
|
-
curl -sf https://engine.pentatonic.internal/health | jq
|
|
307
|
-
# Expected: {"status":"ok","layers":{"l0":"ok",...,"l6":"ok"},"engine":"pentatonic-memory-engine"}
|
|
308
|
-
|
|
309
|
-
curl -sX POST https://engine.pentatonic.internal/store \
|
|
310
|
-
-H "content-type: application/json" \
|
|
311
|
-
-d '{"content":"hello from runbook smoke test","metadata":{"arena":"smoke"}}'
|
|
312
|
-
|
|
313
|
-
curl -sX POST https://engine.pentatonic.internal/search \
|
|
314
|
-
-H "content-type: application/json" \
|
|
315
|
-
-d '{"query":"hello","limit":3,"min_score":0.001}' | jq
|
|
316
|
-
```
|
|
317
|
-
|
|
318
|
-
If `/search` returns the row from `/store`, end-to-end works.
|
|
319
|
-
|
|
320
|
-
---
|
|
321
|
-
|
|
322
|
-
## 10. AWS Backup
|
|
323
|
-
|
|
324
|
-
```bash
|
|
325
|
-
# Tag all volumes for the backup plan
|
|
326
|
-
for v in $VOL_l2 $VOL_l3 $VOL_l4 $VOL_l5 $VOL_l6; do
|
|
327
|
-
aws ec2 create-tags --resources $v --tags Key=Backup,Value=daily
|
|
328
|
-
done
|
|
329
|
-
|
|
330
|
-
# Backup plan: nightly snapshot, 14-day retention.
|
|
331
|
-
# Easiest: AWS Backup console → Plan → "DailyBackup14Day" → resource selection by tag Backup=daily.
|
|
332
|
-
# Or via CLI — see https://docs.aws.amazon.com/aws-backup/latest/devguide/creating-a-backup-plan.html
|
|
333
|
-
```
|
|
334
|
-
|
|
335
|
-
Run the restore drill at least once before going live: spin up a sibling instance, attach restored volumes, confirm engine comes back healthy.
|
|
336
|
-
|
|
337
|
-
---
|
|
338
|
-
|
|
339
|
-
## 11. CloudWatch alarms (recommended, not strictly v1)
|
|
340
|
-
|
|
341
|
-
- EC2 instance status check failed → SNS alert
|
|
342
|
-
- EBS volume usage > 80% → SNS alert
|
|
343
|
-
- Engine `/health` failure (custom Lambda probe via the tunnel) → SNS alert
|
|
344
|
-
|
|
345
|
-
---
|
|
346
|
-
|
|
347
|
-
## 12. Resource summary
|
|
348
|
-
|
|
349
|
-
| Resource | Identifier (filled at runtime) |
|
|
350
|
-
|---|---|
|
|
351
|
-
| Instance | `$INSTANCE_ID` (m6i.2xlarge) |
|
|
352
|
-
| VPC / Subnet | `$VPC_ID` / `$SUBNET_ID` |
|
|
353
|
-
| Security group | `$SG_ID` |
|
|
354
|
-
| IAM role / profile | `$NAME-role` / `$NAME-profile` |
|
|
355
|
-
| EBS volumes | `$VOL_l2 $VOL_l3 $VOL_l4 $VOL_l5 $VOL_l6` (50 GiB gp3 each) |
|
|
356
|
-
| Cloudflare Tunnel | `engine-prod-us-east-1` → `engine.pentatonic.internal` |
|
|
357
|
-
|
|
358
|
-
Estimated v1 cost: **~$340/mo on-demand** (instance) + **~$20/mo** (5×50 GiB gp3) + AWS Backup snapshots (~$5-10/mo at 14-day retention) + data transfer (negligible from CF Tunnel).
|
|
359
|
-
|
|
360
|
-
---
|
|
361
|
-
|
|
362
|
-
## Teardown (if you need to recreate)
|
|
363
|
-
|
|
364
|
-
```bash
|
|
365
|
-
aws ec2 terminate-instances --instance-ids $INSTANCE_ID
|
|
366
|
-
aws ec2 wait instance-terminated --instance-ids $INSTANCE_ID
|
|
367
|
-
for v in $VOL_l2 $VOL_l3 $VOL_l4 $VOL_l5 $VOL_l6; do
|
|
368
|
-
aws ec2 delete-volume --volume-id $v
|
|
369
|
-
done
|
|
370
|
-
aws ec2 delete-security-group --group-id $SG_ID
|
|
371
|
-
aws iam remove-role-from-instance-profile --instance-profile-name $NAME-profile --role-name $NAME-role
|
|
372
|
-
aws iam delete-instance-profile --instance-profile-name $NAME-profile
|
|
373
|
-
aws iam detach-role-policy --role-name $NAME-role --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
|
|
374
|
-
aws iam delete-role --role-name $NAME-role
|
|
375
|
-
```
|
|
@@ -1,138 +0,0 @@
|
|
|
1
|
-
# Why `pentatonic-memory` v0.5.x underperforms on retrieval benchmarks
|
|
2
|
-
|
|
3
|
-
This document explains the architectural reasons `pentatonic-memory` v0.5.x scores 17.6% on substring-graded retrieval benches. None of these are bugs — they are deliberate design decisions optimised for a different workload (chat-style fact recall over agent memory). They just happen to be the wrong defaults for general-purpose retrieval.
|
|
4
|
-
|
|
5
|
-
The engine in this package addresses each one.
|
|
6
|
-
|
|
7
|
-
## 1. Atom boost wins over source
|
|
8
|
-
|
|
9
|
-
```js
|
|
10
|
-
// pentatonic-memory v0.5.10/src/search.js
|
|
11
|
-
const DEFAULT_WEIGHTS = {
|
|
12
|
-
...
|
|
13
|
-
atomBoost: 0.15, // ← 15% boost for distilled atomic facts
|
|
14
|
-
verbosityPenalty: 0.1, // ← penalty for long raw content
|
|
15
|
-
};
|
|
16
|
-
```
|
|
17
|
-
|
|
18
|
-
`distill.js` runs an LLM on every ingested memory and extracts "atomic facts." Those atoms are stored as separate rows linked back via `source_id`. Search then ranks atoms higher than their source via the boost.
|
|
19
|
-
|
|
20
|
-
For chat-style queries ("what does Phil drink?") this works: the atom "Phil drinks cortado" is ranked above the raw turn "Yeah, oh hey Phil came over yesterday and he had a cortado…".
|
|
21
|
-
|
|
22
|
-
For substring grading ("what was the price of thing-9001?") it backfires: the atom is "the user reported a sale event" and the raw "thing-9001 sold for $15.50 to buyer-42" gets dropped or out-ranked. The literal answer string is gone.
|
|
23
|
-
|
|
24
|
-
**Engine default:** `atomBoost = 0`, `verbosityPenalty = 0`. Distillation is opt-in per query.
|
|
25
|
-
|
|
26
|
-
## 2. `dedupeBySource` removes the right answer
|
|
27
|
-
|
|
28
|
-
```js
|
|
29
|
-
// pentatonic-memory v0.5.10/src/search.js, line 161
|
|
30
|
-
if (opts.dedupeBySource !== false) {
|
|
31
|
-
const atomSources = new Set(
|
|
32
|
-
filtered.filter((r) => r.source_id).map((r) => r.source_id)
|
|
33
|
-
);
|
|
34
|
-
if (atomSources.size > 0) {
|
|
35
|
-
filtered = filtered.filter((r) => !atomSources.has(r.id));
|
|
36
|
-
}
|
|
37
|
-
}
|
|
38
|
-
```
|
|
39
|
-
|
|
40
|
-
When an atom matches, its source raw is **dropped** from results. The thinking is "the atom contains the relevant fact, the source is redundant." For substring grading, the source contains the literal text the bench is looking for, while the atom is a paraphrase.
|
|
41
|
-
|
|
42
|
-
**Engine default:** return both atom and source. Caller can dedupe if they want to reduce token spend.
|
|
43
|
-
|
|
44
|
-
## 3. `minScore: 0.5` is too aggressive
|
|
45
|
-
|
|
46
|
-
```js
|
|
47
|
-
const threshold = opts.minScore ?? 0.5;
|
|
48
|
-
```
|
|
49
|
-
|
|
50
|
-
NV-Embed-v2 routinely produces cosine similarities of 0.30–0.45 for genuinely relevant chunks. The 0.5 default filters those out completely. The bench passes `min_score: 0.0001` to compensate, but real callers using SDK defaults silently lose recall.
|
|
51
|
-
|
|
52
|
-
**Engine default:** `min_score: 0.001`. The CTE's relevance × recency × frequency formula handles ranking; let everything through and trust the ordering.
|
|
53
|
-
|
|
54
|
-
## 4. No `/forget` endpoint
|
|
55
|
-
|
|
56
|
-
```js
|
|
57
|
-
// server.js routes:
|
|
58
|
-
// POST /search
|
|
59
|
-
// POST /store
|
|
60
|
-
// GET /health
|
|
61
|
-
// (no /forget, no /memories)
|
|
62
|
-
```
|
|
63
|
-
|
|
64
|
-
v0.4.x had `/forget` and `/memories`. v0.5.x removed them. Without `/forget`:
|
|
65
|
-
- Tests can't isolate runs (data accumulates across test suites)
|
|
66
|
-
- Benches pollute each other's namespaces (we observed v0.5.6 going from 17.6% to 9.4% over 5 runs of pollution)
|
|
67
|
-
- GDPR data deletion requests require direct Postgres access
|
|
68
|
-
- Multi-tenant deployments can't enforce tenant boundaries via the SDK alone
|
|
69
|
-
|
|
70
|
-
**Engine:** restored `/forget` with `id` and `metadata_contains` filters.
|
|
71
|
-
|
|
72
|
-
## 5. No `/store-batch`
|
|
73
|
-
|
|
74
|
-
Even though `ai.js` has an `embedBatch()` helper, the server only exposes single-record `/store`. Bulk ingest does N HTTP roundtrips, each with one synchronous embed call.
|
|
75
|
-
|
|
76
|
-
For the bench harness, this means a 22-doc corpus takes ~25 minutes to ingest because every doc waits for an Ollama HyDE generation (60s default) plus an embed call.
|
|
77
|
-
|
|
78
|
-
**Engine:** added `/store-batch`. One HTTP roundtrip, one batched embed call, one bulk INSERT. 30-50× faster on >5 records.
|
|
79
|
-
|
|
80
|
-
## 6. HyDE generated at INGEST time
|
|
81
|
-
|
|
82
|
-
```js
|
|
83
|
-
// ingest.js — for every /store call:
|
|
84
|
-
const hypothetical_queries = await llm.chat(/* generate 3-5 fake queries */);
|
|
85
|
-
metadata.hypothetical_queries = hypothetical_queries;
|
|
86
|
-
```
|
|
87
|
-
|
|
88
|
-
This adds a 60s LLM call to every ingest. Worse, the queries are generated against the *content*, not the user's actual query — so they tend to be generic ("what is the topic of this document"), not useful for matching at search time.
|
|
89
|
-
|
|
90
|
-
**Engine:** HyDE runs at SEARCH time against the user's actual query. Each search generates 3 hypothetical answers, embeds each, runs vector search per embedding, and RRF-fuses the rank lists. Better matching, no ingest blocking.
|
|
91
|
-
|
|
92
|
-
## 7. No content chunking
|
|
93
|
-
|
|
94
|
-
v0.5.x stores a 10,000-token document as one row with one 4096-d embedding. The vector represents the *average* meaning of the document, washing out specific facts.
|
|
95
|
-
|
|
96
|
-
**Engine:** chunks at ingest into ~200-500 token segments, each with its own embedding and `chunk_index`. Search returns chunks; downstream caller can hydrate the parent document if needed.
|
|
97
|
-
|
|
98
|
-
## 8. No reranker
|
|
99
|
-
|
|
100
|
-
v0.5.x's `search.js` returns top-K directly from the SQL CTE score. No second-pass reranker.
|
|
101
|
-
|
|
102
|
-
**Engine:** L6 doc-store runs a `ms-marco-MiniLM-L-6-v2` cross-encoder over the top-50 from initial retrieval, then returns top-K. Substantially better precision on questions that need exact term matching after broad recall.
|
|
103
|
-
|
|
104
|
-
## 9. No graph / entity layer
|
|
105
|
-
|
|
106
|
-
v0.5.x doesn't extract entities at ingest, doesn't build relationships, can't answer multi-hop questions ("who owns thing-X" → "find listings where X was sold" → "fetch buyer's contact").
|
|
107
|
-
|
|
108
|
-
**Engine:** L3 Knowledge Graph (Neo4j Community) extracts entities at ingest, builds edges between co-occurring entities, and at search time boosts rows that mention the same entities as the query. Critical for the marketplace-ops and customer-support benches.
|
|
109
|
-
|
|
110
|
-
## 10. Single vector store, single embedding per row
|
|
111
|
-
|
|
112
|
-
v0.5.x writes one row per memory with one embedding column in pgvector. The HNSW index doesn't work above 2000 dimensions, so 4096-d NV-Embed embeddings fall back to sequential scan. At >100k memories, that's >100ms per query.
|
|
113
|
-
|
|
114
|
-
**Engine:** indexes the same content into multiple stores in parallel:
|
|
115
|
-
- L0 BM25 (SQLite FTS5)
|
|
116
|
-
- L4 sqlite-vec (small, in-process)
|
|
117
|
-
- L5 Milvus (medium, dedicated)
|
|
118
|
-
- L6 doc-store (with reranker)
|
|
119
|
-
- L3 KG (relationship-pivoted)
|
|
120
|
-
|
|
121
|
-
Search runs all five in parallel, RRF-fuses the rank lists, applies reranker on top-50. Different query types win on different layers — the fusion absorbs the strengths of each.
|
|
122
|
-
|
|
123
|
-
## Summary
|
|
124
|
-
|
|
125
|
-
| Gap | Bench impact (estimated) | Fix complexity |
|
|
126
|
-
|---|---|---|
|
|
127
|
-
| 1. atomBoost +0.15 | -15-20pp | trivial (config flag) |
|
|
128
|
-
| 2. dedupeBySource: true | -5-10pp | trivial (config flag) |
|
|
129
|
-
| 3. minScore: 0.5 default | -3-8pp | trivial (config change) |
|
|
130
|
-
| 4. No /forget | n/a but blocks tests | trivial (10 LOC) |
|
|
131
|
-
| 5. No /store-batch | n/a but blocks bench (~25 min ingest) | low (50 LOC) |
|
|
132
|
-
| 6. HyDE at ingest time | -5-10pp + 60s/store | medium (refactor) |
|
|
133
|
-
| 7. No chunking | -5-15pp on long docs | medium (schema change) |
|
|
134
|
-
| 8. No reranker | -5-10pp | medium (sidecar service) |
|
|
135
|
-
| 9. No graph layer | -5-10pp on entity queries | high (new schema + extraction) |
|
|
136
|
-
| 10. Single vector store | -10-20pp, latency at scale | high (parallel infrastructure) |
|
|
137
|
-
|
|
138
|
-
This package addresses 1-10 simultaneously by routing through the 7-layer engine, recovering ~65pp of the gap.
|