@pentatonic-ai/ai-agent-sdk 0.7.0 → 0.7.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -161,22 +161,26 @@ If `/search` returns the row from `/store`, the engine is live.
161
161
 
162
162
  **Connect Claude Code**
163
163
 
164
- The `tes-memory` plugin's hooks already speak the engine's wire format. Two steps:
164
+ The `tes-memory` plugin's hooks already speak the engine's wire format. Three steps:
165
165
 
166
166
  1. Install the plugin (once):
167
167
  ```
168
168
  /plugin marketplace add Pentatonic-Ltd/ai-agent-sdk
169
169
  /plugin install tes-memory@pentatonic-ai
170
170
  ```
171
- 2. Point it at your local engine. Edit `~/.claude-pentatonic/tes-memory.local.md` (create if missing):
172
- ```yaml
173
- ---
174
- mode: local
175
- memory_url: http://localhost:8099
176
- ---
171
+ 2. Point it at your local engine one command writes the plugin config:
172
+ ```bash
173
+ npx @pentatonic-ai/ai-agent-sdk config local
177
174
  ```
175
+ This writes `~/.claude-pentatonic/tes-memory.local.md` with `mode: local` and `memory_url: http://localhost:8099`. If you want a different URL, pass `--engine-url <url>`. To switch back to hosted later, run `tes config hosted` (delegates to `login`).
178
176
  3. Reload: `/reload-plugins` (or restart Claude Code if status reports stale state — MCP server processes need a full restart to pick up plugin updates).
179
177
 
178
+ Inspect what's currently configured at any time:
179
+
180
+ ```bash
181
+ npx @pentatonic-ai/ai-agent-sdk config show
182
+ ```
183
+
180
184
  Verify:
181
185
 
182
186
  ```
@@ -307,22 +311,26 @@ Works with both local and hosted memory. Install once, switch modes via config.
307
311
  /plugin install tes-memory@pentatonic-ai
308
312
  ```
309
313
 
310
- **Local engine** — bring up the engine first ([Memory > Local](#local-self-hosted)), then point the plugin at it. Edit `~/.claude-pentatonic/tes-memory.local.md`:
314
+ **Local engine** — bring up the engine first ([Memory > Local](#local-self-hosted)), then write the plugin config:
311
315
 
312
- ```yaml
313
- ---
314
- mode: local
315
- memory_url: http://localhost:8099
316
- ---
316
+ ```bash
317
+ npx @pentatonic-ai/ai-agent-sdk config local
317
318
  ```
318
319
 
319
320
  **Hosted TES** — run `login` once, the plugin auto-discovers `~/.config/tes/credentials.json`:
320
321
 
321
322
  ```bash
322
323
  npx @pentatonic-ai/ai-agent-sdk login
324
+ # equivalent: npx @pentatonic-ai/ai-agent-sdk config hosted
325
+ ```
326
+
327
+ Either way, verify with `/tes-memory:tes-status` in Claude Code, or from the shell:
328
+
329
+ ```bash
330
+ npx @pentatonic-ai/ai-agent-sdk config show
323
331
  ```
324
332
 
325
- Either way, verify with `/tes-memory:tes-status` in Claude Code. The plugin's MCP server, hooks, and tools all read the same config.
333
+ The plugin's MCP server, hooks, and tools all read the same config — switching modes is a single CLI call away.
326
334
 
327
335
  **What it tracks (auto, every turn):**
328
336
  - Memory search at prompt time — relevant memories injected as context
@@ -179,9 +179,16 @@ export async function runLoginCommand(opts = {}) {
179
179
  log(` ✓ Connected as ${claims.email || "user"} on tenant \`${clientId}\``);
180
180
  log(` ✓ Credentials written to ~/.config/tes/credentials.json`);
181
181
  log("");
182
- log(" Claude Code's tes-memory plugin and the OpenClaw pentatonic-memory");
183
- log(" plugin will pick these credentials up automatically — restart them");
184
- log(" if they're already running.");
182
+ log(" Install the Pentatonic TES plugin to start capturing context:");
183
+ log("");
184
+ log(" Claude Code:");
185
+ log(" /plugin marketplace add Pentatonic-Ltd/ai-agent-sdk");
186
+ log(" /plugin install tes-memory@pentatonic-ai");
187
+ log("");
188
+ log(" OpenClaw:");
189
+ log(" openclaw plugins install @pentatonic-ai/openclaw-memory-plugin");
190
+ log("");
191
+ log(" Already installed the plugin? Reload now to refresh the credentials.");
185
192
  log("");
186
193
 
187
194
  return { exitCode: 0, clientId };
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@pentatonic-ai/ai-agent-sdk",
3
- "version": "0.7.0",
3
+ "version": "0.7.2",
4
4
  "description": "TES SDK — LLM observability and lifecycle tracking via Pentatonic Thing Event System. Track token usage, tool calls, and conversations. Manage things through event-sourced lifecycle stages with AI enrichment and vector search.",
5
5
  "type": "module",
6
6
  "main": "./dist/index.cjs",
@@ -85,6 +85,19 @@ class SearchRequest(BaseModel):
85
85
  query: str
86
86
  limit: Optional[int] = 10
87
87
  min_score: Optional[float] = 0.001
88
+ # Tenant scope. Required for multi-tenant deployments. Forwarded to
89
+ # layers that support arena filtering natively (L6); applied as a
90
+ # post-filter on the shim for layers that don't yet (L2, L4, L5).
91
+ # When unset, search is global — same behaviour as v0.7.x; safe for
92
+ # single-tenant deployments. Multi-tenant callers MUST set this.
93
+ arena: Optional[str] = None
94
+ # Arbitrary metadata equality filters, applied as a post-filter on
95
+ # the shim. Useful for `kind`, `layer_type`, `source_repo`, etc.
96
+ # Keys not present on a result's metadata are treated as no-match.
97
+ # Each pair is exact string equality. Engine doesn't currently
98
+ # forward these to underlying stores, so over-fetch happens; the
99
+ # shim trims to the requested limit after filtering.
100
+ metadata_filter: Optional[dict[str, Any]] = None
88
101
 
89
102
 
90
103
  class ForgetRequest(BaseModel):
@@ -424,6 +437,51 @@ async def store_batch(req: StoreBatchRequest):
424
437
  }
425
438
 
426
439
 
440
+ def _apply_metadata_filters(results: list[dict[str, Any]], req: SearchRequest) -> list[dict[str, Any]]:
441
+ """Post-filter results by arena + arbitrary metadata equality.
442
+
443
+ Many layer searches don't yet honour arena/metadata at the storage
444
+ level, so the shim enforces tenant isolation here as defence in
445
+ depth. Even if the underlying layer leaks across arenas, the shim
446
+ drops cross-tenant rows before returning.
447
+ """
448
+ arena = req.arena
449
+ extra = req.metadata_filter or {}
450
+ if not arena and not extra:
451
+ return results
452
+ out: list[dict[str, Any]] = []
453
+ for item in results:
454
+ meta = item.get("metadata") or {}
455
+ if arena:
456
+ row_arena = meta.get("arena") or item.get("arena")
457
+ if row_arena and row_arena != arena:
458
+ continue
459
+ # If row has no arena tag at all, drop on multi-tenant
460
+ # safety: a row without arena predates the multi-tenant
461
+ # plumbing and could belong to anyone.
462
+ if arena and not row_arena:
463
+ continue
464
+ ok = True
465
+ for k, v in extra.items():
466
+ if str(meta.get(k, "")) != str(v):
467
+ ok = False
468
+ break
469
+ if ok:
470
+ out.append(item)
471
+ return out
472
+
473
+
474
+ def _search_overfetch(req: SearchRequest) -> int:
475
+ """Decide how many results to over-fetch from layers.
476
+
477
+ Post-filtering can drop many rows; we ask layers for more than the
478
+ user's limit so we have headroom after filtering. 5x is a balance
479
+ between accuracy and latency.
480
+ """
481
+ base = req.limit or 10
482
+ return base * 5 if (req.arena or req.metadata_filter) else base * 3
483
+
484
+
427
485
  @app.post("/search")
428
486
  async def search(req: SearchRequest):
429
487
  """
@@ -431,6 +489,12 @@ async def search(req: SearchRequest):
431
489
  queries L0 BM25, L4 vec, L5 Milvus, L6 doc-store in parallel and fuses
432
490
  the results with Reciprocal Rank Fusion. L3 KG adds entity-aware
433
491
  boosting for graph queries.
492
+
493
+ Multi-tenancy: pass `arena` to scope results to a single tenant.
494
+ Underlying layers may or may not honour arena natively (L6 does;
495
+ L2/L4/L5 don't yet — engine TODO); the shim applies arena as a
496
+ post-filter regardless, so cross-tenant leakage is prevented even
497
+ when a layer is non-compliant.
434
498
  """
435
499
  if not req.query:
436
500
  return {"results": []}
@@ -452,10 +516,19 @@ async def search(req: SearchRequest):
452
516
  import asyncio
453
517
  async def _q_l6(query: str):
454
518
  try:
519
+ params: dict[str, Any] = {
520
+ "q": query,
521
+ "limit": _search_overfetch(req),
522
+ "method": "hybrid",
523
+ }
524
+ if req.arena:
525
+ # L6 supports arena natively (l6-document-store.py:837).
526
+ # Forward it so the underlying Milvus query and FTS
527
+ # query both filter to this tenant before returning.
528
+ params["arena"] = req.arena
455
529
  r = await _client().get(
456
530
  f"{L6_DOC_URL}/search",
457
- params={"q": query, "limit": (req.limit or 10) * 3,
458
- "method": "hybrid"},
531
+ params=params,
459
532
  timeout=30.0,
460
533
  )
461
534
  r.raise_for_status()
@@ -544,11 +617,14 @@ async def search(req: SearchRequest):
544
617
  "source": item.get("source_file") or item.get("path") or "",
545
618
  "engine_layer": "+".join(sorted(set(layer_provenance.get(key, [])))),
546
619
  })
547
- return {"results": out_results}
620
+ # Defense-in-depth post-filter (arena + arbitrary metadata),
621
+ # then trim to the requested limit.
622
+ out_results = _apply_metadata_filters(out_results, req)
623
+ return {"results": out_results[: req.limit or 10]}
548
624
  try:
549
625
  r = await _client().get(
550
626
  f"{L2_PROXY_URL}/search",
551
- params={"q": req.query, "limit": req.limit or 10},
627
+ params={"q": req.query, "limit": _search_overfetch(req)},
552
628
  timeout=30.0,
553
629
  )
554
630
  r.raise_for_status()
@@ -558,7 +634,7 @@ async def search(req: SearchRequest):
558
634
  try:
559
635
  r = await _client().post(
560
636
  f"{L2_PROXY_URL}/v1/search",
561
- json={"query": req.query, "limit": req.limit or 10,
637
+ json={"query": req.query, "limit": _search_overfetch(req),
562
638
  "min_score": req.min_score or 0.001},
563
639
  timeout=30.0,
564
640
  )
@@ -567,9 +643,14 @@ async def search(req: SearchRequest):
567
643
  except Exception as exc2:
568
644
  last_err = exc2
569
645
  try:
646
+ params: dict[str, Any] = {"q": req.query, "limit": _search_overfetch(req)}
647
+ # L6 supports arena natively; forward it on the
648
+ # last-resort fallback path too.
649
+ if req.arena:
650
+ params["arena"] = req.arena
570
651
  r = await _client().get(
571
652
  f"{L6_DOC_URL}/search",
572
- params={"q": req.query, "limit": req.limit or 10},
653
+ params=params,
573
654
  timeout=10.0,
574
655
  )
575
656
  r.raise_for_status()
@@ -621,7 +702,10 @@ async def search(req: SearchRequest):
621
702
  "source": item.get("source", item.get("source_file", "")),
622
703
  "engine_layer": item.get("layer", item.get("source_layer", "")),
623
704
  })
624
- return {"results": out_results}
705
+ # Defense-in-depth post-filter (arena + arbitrary metadata) on L2/L6
706
+ # fallback paths. Same logic as the BYPASS branch above.
707
+ out_results = _apply_metadata_filters(out_results, req)
708
+ return {"results": out_results[: req.limit or 10]}
625
709
 
626
710
 
627
711
  @app.post("/forget")
@@ -46,8 +46,6 @@ EMBED_MODEL_NAME = os.environ.get("L4_EMBED_MODEL", "nv-embed-v2")
46
46
  EMBED_API_KEY = os.environ.get("L4_EMBED_API_KEY", "")
47
47
  EMBED_DIM = int(os.environ.get("L4_EMBED_DIM", "4096"))
48
48
 
49
- def _embed_headers() -> dict:
50
- return {"Authorization": f"Bearer {EMBED_API_KEY}"} if EMBED_API_KEY else {}
51
49
 
52
50
 
53
51
  # ----------------------------------------------------------------------
@@ -109,16 +107,48 @@ def _client() -> httpx.AsyncClient:
109
107
 
110
108
 
111
109
  async def _embed_batch(texts: list[str]) -> list[list[float]]:
110
+ """Embed a batch of texts.
111
+
112
+ Tries OpenAI-compatible shape first (POST <url>, Bearer auth,
113
+ response data[i].embedding). On failure, falls back to the
114
+ Pentatonic-AI gateway's native shape (POST .../v1/embed, X-API-Key
115
+ auth, response embeddings[i]). When the gateway eventually adds an
116
+ OpenAI-compat /v1/embeddings alias, the primary path will succeed
117
+ and the fallback will never fire — no code change needed.
118
+ """
112
119
  if not texts:
113
120
  return []
121
+ payload = {"input": texts, "model": EMBED_MODEL_NAME}
122
+ # Primary: OpenAI-compat
123
+ try:
124
+ resp = await _client().post(
125
+ NV_EMBED_URL,
126
+ headers=_openai_headers(),
127
+ json=payload,
128
+ timeout=120.0,
129
+ )
130
+ resp.raise_for_status()
131
+ return [d["embedding"] for d in resp.json()["data"]]
132
+ except Exception:
133
+ pass
134
+ # Fallback: lambda-gateway native shape
135
+ fallback_url = NV_EMBED_URL.replace("/v1/embeddings", "/v1/embed").replace("/embeddings", "/embed")
114
136
  resp = await _client().post(
115
- NV_EMBED_URL,
116
- headers=_embed_headers(),
117
- json={"input": texts, "model": EMBED_MODEL_NAME},
137
+ fallback_url,
138
+ headers=_lambda_headers(),
139
+ json=payload,
118
140
  timeout=120.0,
119
141
  )
120
142
  resp.raise_for_status()
121
- return [d["embedding"] for d in resp.json()["data"]]
143
+ return resp.json()["embeddings"]
144
+
145
+
146
+ def _openai_headers() -> dict:
147
+ return {"Authorization": f"Bearer {EMBED_API_KEY}"} if EMBED_API_KEY else {}
148
+
149
+
150
+ def _lambda_headers() -> dict:
151
+ return {"X-API-Key": EMBED_API_KEY} if EMBED_API_KEY else {}
122
152
 
123
153
 
124
154
  # ----------------------------------------------------------------------
@@ -51,8 +51,36 @@ EMBED_MODEL_NAME = os.environ.get("L5_EMBED_MODEL", "nv-embed-v2")
51
51
  # Optional Authorization: Bearer <key> for the primary embedding endpoint.
52
52
  EMBED_API_KEY = os.environ.get("L5_EMBED_API_KEY", "")
53
53
 
54
- def _embed_headers() -> dict:
55
- return {"Authorization": f"Bearer {EMBED_API_KEY}"} if EMBED_API_KEY else {}
54
+ def _embed_post(texts):
55
+ """POST to the configured embedding endpoint. Tries OpenAI-compat
56
+ shape first; falls back to Pentatonic-AI lambda-gateway native shape
57
+ on any failure. When the gateway adds an /v1/embeddings alias the
58
+ primary path will succeed and the fallback never fires.
59
+
60
+ Returns: list[list[float]] (one embedding per input text).
61
+ """
62
+ payload = {"input": texts, "model": EMBED_MODEL_NAME}
63
+ try:
64
+ r = httpx.post(
65
+ NV_EMBED_URL,
66
+ headers={"Authorization": f"Bearer {EMBED_API_KEY}"} if EMBED_API_KEY else {},
67
+ json=payload,
68
+ timeout=120,
69
+ )
70
+ r.raise_for_status()
71
+ return [d["embedding"] for d in r.json()["data"]]
72
+ except Exception:
73
+ pass
74
+ fallback_url = NV_EMBED_URL.replace("/v1/embeddings", "/v1/embed").replace("/embeddings", "/embed")
75
+ r = httpx.post(
76
+ fallback_url,
77
+ headers={"X-API-Key": EMBED_API_KEY} if EMBED_API_KEY else {},
78
+ json=payload,
79
+ timeout=120,
80
+ )
81
+ r.raise_for_status()
82
+ return r.json()["embeddings"]
83
+
56
84
  # Ollama fallback path. URL/model can be overridden so the L5 container can
57
85
  # reach an Ollama instance running on the docker host (host.docker.internal)
58
86
  # or on a co-located service. Mirrors the env-var pattern used by L2.
@@ -99,10 +127,7 @@ def _embed_nv_batch(texts: list[str]) -> list[list[float]] | None:
99
127
  return []
100
128
  try:
101
129
  truncated = [t[:4000] for t in texts]
102
- r = httpx.post(NV_EMBED_URL, headers=_embed_headers(), json={"input": truncated, "model": EMBED_MODEL_NAME}, timeout=120)
103
- r.raise_for_status()
104
- data = r.json()
105
- embeddings = [item["embedding"] for item in data["data"]]
130
+ embeddings = _embed_post(truncated)
106
131
  if all(len(e) == EMBED_DIM for e in embeddings):
107
132
  return embeddings
108
133
  except Exception:
@@ -113,10 +138,8 @@ def _embed_nv_batch(texts: list[str]) -> list[list[float]] | None:
113
138
  def _embed_nv_single(text: str) -> list[float] | None:
114
139
  """Embed single text via NV-Embed-v2 (4096-dim)."""
115
140
  try:
116
- r = httpx.post(NV_EMBED_URL, headers=_embed_headers(), json={"input": text[:4000], "model": EMBED_MODEL_NAME}, timeout=15)
117
- r.raise_for_status()
118
- data = r.json()
119
- emb = data["data"][0]["embedding"]
141
+ embs = _embed_post([text[:4000]])
142
+ emb = embs[0]
120
143
  if len(emb) == EMBED_DIM:
121
144
  return emb
122
145
  except Exception:
@@ -573,12 +596,7 @@ def serve(port=8034):
573
596
  texts = [(r.get("text") or "")[:8192] for r in records]
574
597
  t0 = _time.time()
575
598
  try:
576
- resp = httpx.post(
577
- NV_EMBED_URL, headers=_embed_headers(), json={"input": texts, "model": EMBED_MODEL_NAME},
578
- timeout=120,
579
- )
580
- resp.raise_for_status()
581
- embs = [d["embedding"] for d in resp.json()["data"]]
599
+ embs = _embed_post(texts)
582
600
  except Exception as exc:
583
601
  return {"status": "error", "error": f"embed failed: {exc}"}
584
602
  embed_ms = (_time.time() - t0) * 1000.0
@@ -44,8 +44,33 @@ EMBED_DIM = int(os.environ.get("L6_EMBED_DIM", "4096"))
44
44
  # Optional Authorization: Bearer <key> for the embedding endpoint.
45
45
  EMBED_API_KEY = os.environ.get("L6_EMBED_API_KEY", "")
46
46
 
47
- def _embed_headers() -> dict:
48
- return {"Authorization": f"Bearer {EMBED_API_KEY}"} if EMBED_API_KEY else {}
47
+ def _embed_post(texts):
48
+ """POST to embedding endpoint. Tries OpenAI-compat shape first;
49
+ falls back to Pentatonic-AI lambda-gateway native shape on failure.
50
+ See L4 / L5 for the same pattern."""
51
+ import httpx as _httpx
52
+ payload = {"input": texts, "model": EMBED_MODEL}
53
+ try:
54
+ r = _httpx.post(
55
+ NV_EMBED_URL,
56
+ headers={"Authorization": f"Bearer {EMBED_API_KEY}"} if EMBED_API_KEY else {},
57
+ json=payload,
58
+ timeout=120,
59
+ )
60
+ r.raise_for_status()
61
+ return [d["embedding"] for d in r.json()["data"]]
62
+ except Exception:
63
+ pass
64
+ fallback_url = NV_EMBED_URL.replace("/v1/embeddings", "/v1/embed").replace("/embeddings", "/embed")
65
+ r = _httpx.post(
66
+ fallback_url,
67
+ headers={"X-API-Key": EMBED_API_KEY} if EMBED_API_KEY else {},
68
+ json=payload,
69
+ timeout=120,
70
+ )
71
+ r.raise_for_status()
72
+ return r.json()["embeddings"]
73
+
49
74
  COLLECTION_NAME = "documents"
50
75
  RRF_K = 60
51
76
  DEFAULT_PORT = 8037
@@ -874,16 +899,10 @@ def serve(port: int = DEFAULT_PORT):
874
899
 
875
900
  texts = [(r.get("text") or "")[:16000] for r in records]
876
901
 
877
- # Single batched NV-Embed call.
902
+ # Single batched embed call (OpenAI-compat first, lambda-gateway fallback).
878
903
  t0 = _time.time()
879
904
  try:
880
- resp = _httpx.post(
881
- NV_EMBED_URL, headers=_embed_headers(),
882
- json={"input": texts, "model": EMBED_MODEL},
883
- timeout=120,
884
- )
885
- resp.raise_for_status()
886
- embs = [d["embedding"] for d in resp.json()["data"]]
905
+ embs = _embed_post(texts)
887
906
  except Exception as exc:
888
907
  raise HTTPException(status_code=500, detail=f"embed failed: {exc}")
889
908
  embed_ms = (_time.time() - t0) * 1000.0
@@ -1,178 +0,0 @@
1
- # Migration Guide
2
-
3
- ## From `pentatonic-memory` v0.5.x → `pentatonic-memory-engine`
4
-
5
- ### TL;DR
6
-
7
- ```diff
8
- - export PENTATONIC_MEMORY_URL=http://your-pm-host:8099
9
- + export PENTATONIC_MEMORY_URL=http://your-engine-host:8099
10
- ```
11
-
12
- That's it. Same SDK, same code, same `/store` `/search` `/health` calls. Engine returns the same response shape with one optional addition (`engine_layer` field on results, naming which layer carried the hit — purely informational).
13
-
14
- ### Detailed wire-format compatibility
15
-
16
- #### `POST /store`
17
-
18
- Request:
19
- ```json
20
- { "content": "...", "metadata": { "key": "value" } }
21
- ```
22
-
23
- Response (v0.5.x):
24
- ```json
25
- { "id": "mem_abc...", "content": "...", "layerId": "ml_default_episodic" }
26
- ```
27
-
28
- Response (engine):
29
- ```json
30
- {
31
- "id": "abc...",
32
- "content": "...",
33
- "layerId": "ml_default_episodic",
34
- "engine": { "l5": 1, "l6": 1 } // ← new, optional
35
- }
36
- ```
37
-
38
- The `engine` field is informational only. Existing SDK clients that ignore unknown fields (the default for both Node.js and Python clients) work without modification.
39
-
40
- #### `POST /search`
41
-
42
- Request:
43
- ```json
44
- { "query": "...", "limit": 10, "min_score": 0.0001 }
45
- ```
46
-
47
- Response (v0.5.x):
48
- ```json
49
- {
50
- "results": [
51
- {
52
- "id": "mem_abc...", "content": "...", "metadata": {},
53
- "similarity": 0.81, "layer_id": "ml_default_episodic", "client_id": "default"
54
- }
55
- ]
56
- }
57
- ```
58
-
59
- Response (engine):
60
- ```json
61
- {
62
- "results": [
63
- {
64
- "id": "abc...", "content": "...", "metadata": {},
65
- "similarity": 0.81, "layer_id": "ml_default_episodic", "client_id": "default",
66
- "source": "doc1.md", // ← passes through engine's source_file
67
- "engine_layer": "L4 vec" // ← new, optional, names the winning layer
68
- }
69
- ]
70
- }
71
- ```
72
-
73
- #### `GET /health`
74
-
75
- Request: no body.
76
-
77
- Response (v0.5.x):
78
- ```json
79
- { "status": "ok", "client": "default", "version": "0.5.6", "memories": 249 }
80
- ```
81
-
82
- Response (engine):
83
- ```json
84
- {
85
- "status": "ok",
86
- "client": "default",
87
- "version": "0.1.0",
88
- "engine": "pentatonic-memory-engine",
89
- "layers": {
90
- "l0": "ok", "l1": "ok", "l2": "ok", "l3": "ok",
91
- "l4": "ok", "l5": "ok", "l6": "ok",
92
- "nv_embed": "ok"
93
- },
94
- "memories": 249
95
- }
96
- ```
97
-
98
- Reports per-layer status across all 7 layers of the `sequential-hybridrag-7-layer` engine.
99
-
100
- #### `POST /store-batch` (NEW — not in v0.5.x)
101
-
102
- ```json
103
- // Request
104
- {
105
- "records": [
106
- { "id": "doc1", "content": "...", "metadata": {} },
107
- { "id": "doc2", "content": "...", "metadata": {} }
108
- ],
109
- "arena": "general"
110
- }
111
-
112
- // Response
113
- {
114
- "status": "ok",
115
- "inserted": 2,
116
- "ids": ["doc1", "doc2"],
117
- "engine": { "l5": 2, "l6": 2 },
118
- "duration_ms": 234.5
119
- }
120
- ```
121
-
122
- 30-50× faster than calling `/store` N times when ingesting more than ~5 records.
123
-
124
- #### `POST /forget` (RESTORED — was in v0.4.x, removed in v0.5.x)
125
-
126
- ```json
127
- // Delete one record
128
- { "id": "doc1" }
129
-
130
- // Or delete all records matching a metadata filter
131
- { "metadata_contains": { "bench_tag": "test-run-12345" } }
132
-
133
- // Response
134
- { "deleted": 17, "engine": "pentatonic-memory-engine" }
135
- ```
136
-
137
- Required for: test pollution control, GDPR data deletion, multi-tenant isolation, bench harnesses.
138
-
139
- ### Data migration
140
-
141
- There is no automated dump-and-replay tool. Two paths:
142
-
143
- **Path A — Re-ingest from source.**
144
- If your Pentatonic deployment was populated from a known source (chat archives, document repository, TES events), re-run the ingestion against the engine. Use `/store-batch` for speed.
145
-
146
- **Path B — Dump-and-replay from Postgres.**
147
- If you only have the v0.5 Postgres database:
148
-
149
- ```bash
150
- # Dump as JSONL
151
- psql $DATABASE_URL -A -t -c \
152
- "SELECT json_build_object('id', id, 'content', content, 'metadata', metadata)::text
153
- FROM memory_nodes WHERE client_id = 'your-client'" \
154
- > export.jsonl
155
-
156
- # Replay against the engine
157
- python tools/replay.py export.jsonl --target http://your-engine-host:8099
158
- ```
159
-
160
- A `tools/replay.py` reference implementation lives under `tools/` in this package.
161
-
162
- ### What you lose
163
-
164
- - **The `metadata.hypothetical_queries` field stops being generated at ingest time.** The engine generates HyDE queries at SEARCH time instead, against the user's actual query (better matching, faster ingest).
165
- - **`metadata.distilled_from` atoms are no longer auto-generated.** If you were relying on the v0.5+ atomic-fact distillation behaviour, that's a feature of v0.5+ specifically — not a portable feature. The engine treats memories as canonical raw chunks. You can still run distillation as a separate post-processing step if needed.
166
-
167
- ### What you gain
168
-
169
- - ~5× retrieval accuracy on substring/exact-match benches (~17.6% → ~82.4% mean)
170
- - 30-50× faster bulk ingest via `/store-batch`
171
- - Restored `/forget` endpoint
172
- - Cross-encoder reranking on top-50
173
- - Knowledge-graph-aware retrieval (entity overlap signal)
174
- - Per-layer health visibility
175
-
176
- ### Rollback
177
-
178
- The engine doesn't write to your existing Postgres. Roll back by switching the env var back. No data lost.
@@ -1,375 +0,0 @@
1
- # pentatonic-memory-engine — AWS deployment runbook (v1)
2
-
3
- **Target:** single EC2 (`m6i.2xlarge`) in `us-east-1`, network-boundary auth via Cloudflare Tunnel.
4
- **Operator:** Phil Hauser (or anyone with `AdministratorAccess` to account `170649632502`).
5
- **Estimated time end-to-end:** ~45 minutes (mostly waiting for instance/volume provisioning).
6
-
7
- ---
8
-
9
- ## 0. Prerequisites
10
-
11
- Before starting, verify:
12
-
13
- ```bash
14
- aws sts get-caller-identity
15
- # Should return Account: 170649632502, AdministratorAccess role
16
-
17
- aws configure get region
18
- # us-east-1
19
- ```
20
-
21
- If region isn't set: `export AWS_REGION=us-east-1` for the rest of the session.
22
-
23
- You'll also need:
24
- - A **Cloudflare account** with access to the Pentatonic CF zone (for Tunnel setup)
25
- - The **`pentatonic-ai-gateway` API key** (from lambda.dev — should already exist)
26
-
27
- ---
28
-
29
- ## 1. Variables (paste once, reuse below)
30
-
31
- ```bash
32
- export AWS_REGION=us-east-1
33
- export ENV=prod
34
- export NAME=pme-${ENV}-us-east-1
35
- export INSTANCE_TYPE=m6i.2xlarge
36
- # Latest Ubuntu 22.04 LTS in us-east-1 (verify via aws ec2 describe-images if needed)
37
- export AMI_ID=$(aws ec2 describe-images \
38
- --owners 099720109477 \
39
- --filters "Name=name,Values=ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*" \
40
- --query 'Images | sort_by(@, &CreationDate) | [-1].ImageId' \
41
- --output text)
42
- echo "Using AMI: $AMI_ID"
43
- ```
44
-
45
- ---
46
-
47
- ## 2. Networking
48
-
49
- Use the default VPC for v1. (Multi-VPC isolation is a v2 concern.)
50
-
51
- ```bash
52
- export VPC_ID=$(aws ec2 describe-vpcs \
53
- --filters "Name=is-default,Values=true" \
54
- --query 'Vpcs[0].VpcId' --output text)
55
-
56
- export SUBNET_ID=$(aws ec2 describe-subnets \
57
- --filters "Name=vpc-id,Values=$VPC_ID" "Name=default-for-az,Values=true" \
58
- --query 'Subnets[0].SubnetId' --output text)
59
-
60
- echo "VPC=$VPC_ID Subnet=$SUBNET_ID"
61
- ```
62
-
63
- ### 2.1 Security group
64
-
65
- No public ingress. Outbound 443/80/53 for Tunnel + gateway + apt + DNS.
66
-
67
- ```bash
68
- export SG_ID=$(aws ec2 create-security-group \
69
- --group-name $NAME-sg \
70
- --description "pentatonic-memory-engine $ENV — outbound only; ingress via SSM" \
71
- --vpc-id $VPC_ID \
72
- --query 'GroupId' --output text)
73
-
74
- # Outbound is allowed by default. Strip default outbound and re-add explicitly.
75
- aws ec2 revoke-security-group-egress \
76
- --group-id $SG_ID \
77
- --ip-permissions '[{"IpProtocol":"-1","IpRanges":[{"CidrIp":"0.0.0.0/0"}]}]'
78
-
79
- aws ec2 authorize-security-group-egress --group-id $SG_ID \
80
- --ip-permissions '[
81
- {"IpProtocol":"tcp","FromPort":443,"ToPort":443,"IpRanges":[{"CidrIp":"0.0.0.0/0","Description":"HTTPS for tunnel + gateway + apt"}]},
82
- {"IpProtocol":"tcp","FromPort":80, "ToPort":80, "IpRanges":[{"CidrIp":"0.0.0.0/0","Description":"HTTP for apt fallback"}]},
83
- {"IpProtocol":"udp","FromPort":53, "ToPort":53, "IpRanges":[{"CidrIp":"0.0.0.0/0","Description":"DNS"}]},
84
- {"IpProtocol":"tcp","FromPort":53, "ToPort":53, "IpRanges":[{"CidrIp":"0.0.0.0/0","Description":"DNS-over-TCP"}]}
85
- ]'
86
-
87
- echo "SG=$SG_ID"
88
- ```
89
-
90
- **No inbound rule.** Ops access happens via SSM Session Manager (next step), not SSH.
91
-
92
- ---
93
-
94
- ## 3. IAM role for SSM Session Manager + EBS snapshot agent
95
-
96
- Lets you `aws ssm start-session` into the box without an SSH key.
97
-
98
- ```bash
99
- aws iam create-role --role-name $NAME-role \
100
- --assume-role-policy-document '{
101
- "Version":"2012-10-17",
102
- "Statement":[{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}]
103
- }'
104
-
105
- aws iam attach-role-policy --role-name $NAME-role \
106
- --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
107
-
108
- aws iam create-instance-profile --instance-profile-name $NAME-profile
109
-
110
- aws iam add-role-to-instance-profile \
111
- --instance-profile-name $NAME-profile \
112
- --role-name $NAME-role
113
-
114
- # Wait for IAM eventual-consistency before launching EC2
115
- sleep 10
116
- ```
117
-
118
- ---
119
-
120
- ## 4. EBS volumes
121
-
122
- Five `gp3` volumes, 50 GiB each (resize online later if needed). One per layer's data dir.
123
-
124
- ```bash
125
- export AZ=$(aws ec2 describe-subnets --subnet-ids $SUBNET_ID \
126
- --query 'Subnets[0].AvailabilityZone' --output text)
127
-
128
- for layer in l2 l3 l4 l5 l6; do
129
- vol_id=$(aws ec2 create-volume \
130
- --availability-zone $AZ \
131
- --size 50 --volume-type gp3 \
132
- --tag-specifications "ResourceType=volume,Tags=[{Key=Name,Value=$NAME-$layer},{Key=pme-layer,Value=$layer}]" \
133
- --query 'VolumeId' --output text)
134
- echo "$layer = $vol_id"
135
- eval "export VOL_${layer}=$vol_id"
136
- done
137
-
138
- # Wait until all are 'available'
139
- aws ec2 wait volume-available --volume-ids $VOL_l2 $VOL_l3 $VOL_l4 $VOL_l5 $VOL_l6
140
- echo "All volumes available."
141
- ```
142
-
143
- ---
144
-
145
- ## 5. Launch the EC2
146
-
147
- ```bash
148
- # User data: format the EBS volumes on first boot, install docker, mount.
149
- cat > /tmp/userdata.sh <<'EOF'
150
- #!/bin/bash
151
- set -euxo pipefail
152
-
153
- apt-get update
154
- apt-get install -y docker.io docker-compose-v2 git xfsprogs
155
-
156
- # Wait for EBS volumes to attach (they're attached just after instance launch by AWS CLI below)
157
- for layer in l2 l3 l4 l5 l6; do
158
- for i in {1..30}; do
159
- if [ -e /dev/disk/by-label/$layer ] || lsblk -no NAME,SERIAL | grep -q "$layer"; then
160
- break
161
- fi
162
- sleep 2
163
- done
164
- done
165
-
166
- # Find each volume by tag (we'll attach by device name below; this just creates mount points)
167
- mkdir -p /var/lib/pme/{l2,l3,l4,l5,l6}
168
-
169
- # Format + mount each — done by per-volume systemd in step 6.5 below
170
-
171
- systemctl enable --now docker
172
-
173
- # Pull engine repo
174
- cd /opt
175
- git clone https://github.com/Pentatonic-Ltd/memory_stack_updated.git engine
176
- chown -R ubuntu:ubuntu /opt/engine
177
- EOF
178
-
179
- export INSTANCE_ID=$(aws ec2 run-instances \
180
- --image-id $AMI_ID \
181
- --instance-type $INSTANCE_TYPE \
182
- --subnet-id $SUBNET_ID \
183
- --security-group-ids $SG_ID \
184
- --iam-instance-profile Name=$NAME-profile \
185
- --block-device-mappings 'DeviceName=/dev/sda1,Ebs={VolumeSize=30,VolumeType=gp3}' \
186
- --metadata-options 'HttpTokens=required,HttpEndpoint=enabled' \
187
- --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=$NAME}]" \
188
- --user-data file:///tmp/userdata.sh \
189
- --query 'Instances[0].InstanceId' --output text)
190
-
191
- aws ec2 wait instance-running --instance-ids $INSTANCE_ID
192
- echo "Instance $INSTANCE_ID is running."
193
- ```
194
-
195
- ### 5.1 Attach EBS volumes
196
-
197
- ```bash
198
- aws ec2 attach-volume --volume-id $VOL_l2 --instance-id $INSTANCE_ID --device /dev/xvdf
199
- aws ec2 attach-volume --volume-id $VOL_l3 --instance-id $INSTANCE_ID --device /dev/xvdg
200
- aws ec2 attach-volume --volume-id $VOL_l4 --instance-id $INSTANCE_ID --device /dev/xvdh
201
- aws ec2 attach-volume --volume-id $VOL_l5 --instance-id $INSTANCE_ID --device /dev/xvdi
202
- aws ec2 attach-volume --volume-id $VOL_l6 --instance-id $INSTANCE_ID --device /dev/xvdj
203
-
204
- # Wait for all to attach
205
- for v in $VOL_l2 $VOL_l3 $VOL_l4 $VOL_l5 $VOL_l6; do
206
- aws ec2 wait volume-in-use --volume-ids $v
207
- done
208
- echo "All volumes attached."
209
- ```
210
-
211
- ---
212
-
213
- ## 6. Mount EBS volumes inside the EC2
214
-
215
- Connect via SSM Session Manager:
216
-
217
- ```bash
218
- aws ssm start-session --target $INSTANCE_ID
219
- ```
220
-
221
- Then inside the instance:
222
-
223
- ```bash
224
- # Format each volume (one-time)
225
- for pair in xvdf:l2 xvdg:l3 xvdh:l4 xvdi:l5 xvdj:l6; do
226
- dev=${pair%:*}; layer=${pair#*:}
227
- if ! sudo blkid /dev/$dev >/dev/null 2>&1; then
228
- sudo mkfs.xfs -L $layer /dev/$dev
229
- fi
230
- done
231
-
232
- # Add to /etc/fstab and mount
233
- for pair in xvdf:l2 xvdg:l3 xvdh:l4 xvdi:l5 xvdj:l6; do
234
- dev=${pair%:*}; layer=${pair#*:}
235
- uuid=$(sudo blkid -s UUID -o value /dev/$dev)
236
- sudo mkdir -p /var/lib/pme/$layer
237
- echo "UUID=$uuid /var/lib/pme/$layer xfs defaults,nofail 0 2" | sudo tee -a /etc/fstab
238
- done
239
-
240
- sudo systemctl daemon-reload
241
- sudo mount -a
242
- df -h /var/lib/pme/*
243
- # All five should show ~50G mounted, 49G available.
244
- ```
245
-
246
- ---
247
-
248
- ## 7. Cloudflare Tunnel setup
249
-
250
- In the Cloudflare dashboard:
251
-
252
- 1. **Zero Trust → Networks → Tunnels → Create a tunnel** (Cloudflared connector type)
253
- 2. Name: `engine-prod-us-east-1`
254
- 3. Save → copy the **tunnel token** (the `eyJ...` string).
255
- 4. **Public hostnames** tab → Add:
256
- - Subdomain: `engine`
257
- - Domain: `pentatonic.internal` (or whatever internal CF zone you use)
258
- - Type: HTTP, URL: `compat:8099`
259
-
260
- Copy the tunnel token; you'll set it as `CLOUDFLARED_TUNNEL_TOKEN` in `.env` below.
261
-
262
- > The hostname is reachable only by Workers/services in the same Cloudflare account by default. If you want to lock down further, attach a **Cloudflare Access policy** requiring a service token on the hostname — then set the service-token header in TES Workers' fetch calls. Optional for v1; can layer on later.
263
-
264
- ---
265
-
266
- ## 8. Configure and bring up the engine
267
-
268
- Back in the SSM session on the EC2:
269
-
270
- ```bash
271
- cd /opt/engine
272
-
273
- # Pull the AWS overlay (PR'd separately to memory_stack_updated; for now copy it manually)
274
- # Once merged upstream, this file is part of the repo.
275
- sudo curl -fL -o docker-compose.aws.yml \
276
- https://raw.githubusercontent.com/Pentatonic-Ltd/memory_stack_updated/main/docker-compose.aws.yml
277
-
278
- # Generate Neo4j password
279
- NEO4J_PASSWORD=$(openssl rand -base64 24 | tr -d '/+=')
280
-
281
- # Write .env (substitute values)
282
- cat | sudo tee .env <<EOF
283
- PME_PORT=8099
284
- NV_EMBED_URL=https://gateway.pentatonic.ai/v1/embeddings # confirm exact URL with the gateway team
285
- PENTATONIC_AI_GATEWAY_KEY=<paste from secret store>
286
- CLOUDFLARED_TUNNEL_TOKEN=<paste from CF dashboard>
287
- NEO4J_PASSWORD=$NEO4J_PASSWORD
288
- EOF
289
-
290
- sudo chmod 600 .env
291
-
292
- # Bring up the stack
293
- sudo docker compose -f docker-compose.yml -f docker-compose.aws.yml up -d
294
- sudo docker compose ps
295
- ```
296
-
297
- First run pulls images (~3-5 min) and builds engine images (~10-15 min). Subsequent restarts are fast.
298
-
299
- ---
300
-
301
- ## 9. Smoke test
302
-
303
- From your laptop or any TES dev environment with access to the CF zone:
304
-
305
- ```bash
306
- curl -sf https://engine.pentatonic.internal/health | jq
307
- # Expected: {"status":"ok","layers":{"l0":"ok",...,"l6":"ok"},"engine":"pentatonic-memory-engine"}
308
-
309
- curl -sX POST https://engine.pentatonic.internal/store \
310
- -H "content-type: application/json" \
311
- -d '{"content":"hello from runbook smoke test","metadata":{"arena":"smoke"}}'
312
-
313
- curl -sX POST https://engine.pentatonic.internal/search \
314
- -H "content-type: application/json" \
315
- -d '{"query":"hello","limit":3,"min_score":0.001}' | jq
316
- ```
317
-
318
- If `/search` returns the row from `/store`, end-to-end works.
319
-
320
- ---
321
-
322
- ## 10. AWS Backup
323
-
324
- ```bash
325
- # Tag all volumes for the backup plan
326
- for v in $VOL_l2 $VOL_l3 $VOL_l4 $VOL_l5 $VOL_l6; do
327
- aws ec2 create-tags --resources $v --tags Key=Backup,Value=daily
328
- done
329
-
330
- # Backup plan: nightly snapshot, 14-day retention.
331
- # Easiest: AWS Backup console → Plan → "DailyBackup14Day" → resource selection by tag Backup=daily.
332
- # Or via CLI — see https://docs.aws.amazon.com/aws-backup/latest/devguide/creating-a-backup-plan.html
333
- ```
334
-
335
- Run the restore drill at least once before going live: spin up a sibling instance, attach restored volumes, confirm engine comes back healthy.
336
-
337
- ---
338
-
339
- ## 11. CloudWatch alarms (recommended, not strictly v1)
340
-
341
- - EC2 instance status check failed → SNS alert
342
- - EBS volume usage > 80% → SNS alert
343
- - Engine `/health` failure (custom Lambda probe via the tunnel) → SNS alert
344
-
345
- ---
346
-
347
- ## 12. Resource summary
348
-
349
- | Resource | Identifier (filled at runtime) |
350
- |---|---|
351
- | Instance | `$INSTANCE_ID` (m6i.2xlarge) |
352
- | VPC / Subnet | `$VPC_ID` / `$SUBNET_ID` |
353
- | Security group | `$SG_ID` |
354
- | IAM role / profile | `$NAME-role` / `$NAME-profile` |
355
- | EBS volumes | `$VOL_l2 $VOL_l3 $VOL_l4 $VOL_l5 $VOL_l6` (50 GiB gp3 each) |
356
- | Cloudflare Tunnel | `engine-prod-us-east-1` → `engine.pentatonic.internal` |
357
-
358
- Estimated v1 cost: **~$340/mo on-demand** (instance) + **~$20/mo** (5×50 GiB gp3) + AWS Backup snapshots (~$5-10/mo at 14-day retention) + data transfer (negligible from CF Tunnel).
359
-
360
- ---
361
-
362
- ## Teardown (if you need to recreate)
363
-
364
- ```bash
365
- aws ec2 terminate-instances --instance-ids $INSTANCE_ID
366
- aws ec2 wait instance-terminated --instance-ids $INSTANCE_ID
367
- for v in $VOL_l2 $VOL_l3 $VOL_l4 $VOL_l5 $VOL_l6; do
368
- aws ec2 delete-volume --volume-id $v
369
- done
370
- aws ec2 delete-security-group --group-id $SG_ID
371
- aws iam remove-role-from-instance-profile --instance-profile-name $NAME-profile --role-name $NAME-role
372
- aws iam delete-instance-profile --instance-profile-name $NAME-profile
373
- aws iam detach-role-policy --role-name $NAME-role --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
374
- aws iam delete-role --role-name $NAME-role
375
- ```
@@ -1,138 +0,0 @@
1
- # Why `pentatonic-memory` v0.5.x underperforms on retrieval benchmarks
2
-
3
- This document explains the architectural reasons `pentatonic-memory` v0.5.x scores 17.6% on substring-graded retrieval benches. None of these are bugs — they are deliberate design decisions optimised for a different workload (chat-style fact recall over agent memory). They just happen to be the wrong defaults for general-purpose retrieval.
4
-
5
- The engine in this package addresses each one.
6
-
7
- ## 1. Atom boost wins over source
8
-
9
- ```js
10
- // pentatonic-memory v0.5.10/src/search.js
11
- const DEFAULT_WEIGHTS = {
12
- ...
13
- atomBoost: 0.15, // ← 15% boost for distilled atomic facts
14
- verbosityPenalty: 0.1, // ← penalty for long raw content
15
- };
16
- ```
17
-
18
- `distill.js` runs an LLM on every ingested memory and extracts "atomic facts." Those atoms are stored as separate rows linked back via `source_id`. Search then ranks atoms higher than their source via the boost.
19
-
20
- For chat-style queries ("what does Phil drink?") this works: the atom "Phil drinks cortado" is ranked above the raw turn "Yeah, oh hey Phil came over yesterday and he had a cortado…".
21
-
22
- For substring grading ("what was the price of thing-9001?") it backfires: the atom is "the user reported a sale event" and the raw "thing-9001 sold for $15.50 to buyer-42" gets dropped or out-ranked. The literal answer string is gone.
23
-
24
- **Engine default:** `atomBoost = 0`, `verbosityPenalty = 0`. Distillation is opt-in per query.
25
-
26
- ## 2. `dedupeBySource` removes the right answer
27
-
28
- ```js
29
- // pentatonic-memory v0.5.10/src/search.js, line 161
30
- if (opts.dedupeBySource !== false) {
31
- const atomSources = new Set(
32
- filtered.filter((r) => r.source_id).map((r) => r.source_id)
33
- );
34
- if (atomSources.size > 0) {
35
- filtered = filtered.filter((r) => !atomSources.has(r.id));
36
- }
37
- }
38
- ```
39
-
40
- When an atom matches, its source raw is **dropped** from results. The thinking is "the atom contains the relevant fact, the source is redundant." For substring grading, the source contains the literal text the bench is looking for, while the atom is a paraphrase.
41
-
42
- **Engine default:** return both atom and source. Caller can dedupe if they want to reduce token spend.
43
-
44
- ## 3. `minScore: 0.5` is too aggressive
45
-
46
- ```js
47
- const threshold = opts.minScore ?? 0.5;
48
- ```
49
-
50
- NV-Embed-v2 routinely produces cosine similarities of 0.30–0.45 for genuinely relevant chunks. The 0.5 default filters those out completely. The bench passes `min_score: 0.0001` to compensate, but real callers using SDK defaults silently lose recall.
51
-
52
- **Engine default:** `min_score: 0.001`. The CTE's relevance × recency × frequency formula handles ranking; let everything through and trust the ordering.
53
-
54
- ## 4. No `/forget` endpoint
55
-
56
- ```js
57
- // server.js routes:
58
- // POST /search
59
- // POST /store
60
- // GET /health
61
- // (no /forget, no /memories)
62
- ```
63
-
64
- v0.4.x had `/forget` and `/memories`. v0.5.x removed them. Without `/forget`:
65
- - Tests can't isolate runs (data accumulates across test suites)
66
- - Benches pollute each other's namespaces (we observed v0.5.6 going from 17.6% to 9.4% over 5 runs of pollution)
67
- - GDPR data deletion requests require direct Postgres access
68
- - Multi-tenant deployments can't enforce tenant boundaries via the SDK alone
69
-
70
- **Engine:** restored `/forget` with `id` and `metadata_contains` filters.
71
-
72
- ## 5. No `/store-batch`
73
-
74
- Even though `ai.js` has an `embedBatch()` helper, the server only exposes single-record `/store`. Bulk ingest does N HTTP roundtrips, each with one synchronous embed call.
75
-
76
- For the bench harness, this means a 22-doc corpus takes ~25 minutes to ingest because every doc waits for an Ollama HyDE generation (60s default) plus an embed call.
77
-
78
- **Engine:** added `/store-batch`. One HTTP roundtrip, one batched embed call, one bulk INSERT. 30-50× faster on >5 records.
79
-
80
- ## 6. HyDE generated at INGEST time
81
-
82
- ```js
83
- // ingest.js — for every /store call:
84
- const hypothetical_queries = await llm.chat(/* generate 3-5 fake queries */);
85
- metadata.hypothetical_queries = hypothetical_queries;
86
- ```
87
-
88
- This adds a 60s LLM call to every ingest. Worse, the queries are generated against the *content*, not the user's actual query — so they tend to be generic ("what is the topic of this document"), not useful for matching at search time.
89
-
90
- **Engine:** HyDE runs at SEARCH time against the user's actual query. Each search generates 3 hypothetical answers, embeds each, runs vector search per embedding, and RRF-fuses the rank lists. Better matching, no ingest blocking.
91
-
92
- ## 7. No content chunking
93
-
94
- v0.5.x stores a 10,000-token document as one row with one 4096-d embedding. The vector represents the *average* meaning of the document, washing out specific facts.
95
-
96
- **Engine:** chunks at ingest into ~200-500 token segments, each with its own embedding and `chunk_index`. Search returns chunks; downstream caller can hydrate the parent document if needed.
97
-
98
- ## 8. No reranker
99
-
100
- v0.5.x's `search.js` returns top-K directly from the SQL CTE score. No second-pass reranker.
101
-
102
- **Engine:** L6 doc-store runs a `ms-marco-MiniLM-L-6-v2` cross-encoder over the top-50 from initial retrieval, then returns top-K. Substantially better precision on questions that need exact term matching after broad recall.
103
-
104
- ## 9. No graph / entity layer
105
-
106
- v0.5.x doesn't extract entities at ingest, doesn't build relationships, can't answer multi-hop questions ("who owns thing-X" → "find listings where X was sold" → "fetch buyer's contact").
107
-
108
- **Engine:** L3 Knowledge Graph (Neo4j Community) extracts entities at ingest, builds edges between co-occurring entities, and at search time boosts rows that mention the same entities as the query. Critical for the marketplace-ops and customer-support benches.
109
-
110
- ## 10. Single vector store, single embedding per row
111
-
112
- v0.5.x writes one row per memory with one embedding column in pgvector. The HNSW index doesn't work above 2000 dimensions, so 4096-d NV-Embed embeddings fall back to sequential scan. At >100k memories, that's >100ms per query.
113
-
114
- **Engine:** indexes the same content into multiple stores in parallel:
115
- - L0 BM25 (SQLite FTS5)
116
- - L4 sqlite-vec (small, in-process)
117
- - L5 Milvus (medium, dedicated)
118
- - L6 doc-store (with reranker)
119
- - L3 KG (relationship-pivoted)
120
-
121
- Search runs all five in parallel, RRF-fuses the rank lists, applies reranker on top-50. Different query types win on different layers — the fusion absorbs the strengths of each.
122
-
123
- ## Summary
124
-
125
- | Gap | Bench impact (estimated) | Fix complexity |
126
- |---|---|---|
127
- | 1. atomBoost +0.15 | -15-20pp | trivial (config flag) |
128
- | 2. dedupeBySource: true | -5-10pp | trivial (config flag) |
129
- | 3. minScore: 0.5 default | -3-8pp | trivial (config change) |
130
- | 4. No /forget | n/a but blocks tests | trivial (10 LOC) |
131
- | 5. No /store-batch | n/a but blocks bench (~25 min ingest) | low (50 LOC) |
132
- | 6. HyDE at ingest time | -5-10pp + 60s/store | medium (refactor) |
133
- | 7. No chunking | -5-15pp on long docs | medium (schema change) |
134
- | 8. No reranker | -5-10pp | medium (sidecar service) |
135
- | 9. No graph layer | -5-10pp on entity queries | high (new schema + extraction) |
136
- | 10. Single vector store | -10-20pp, latency at scale | high (parallel infrastructure) |
137
-
138
- This package addresses 1-10 simultaneously by routing through the 7-layer engine, recovering ~65pp of the gap.