npm - @pentatonic-ai/ai-agent-sdk - Versions diffs - 0.7.0 → 0.7.2 - Mend

@pentatonic-ai/ai-agent-sdk 0.7.0 → 0.7.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/README.md +22 -14
package/bin/commands/login.js +10 -3
package/package.json +1 -1
package/packages/memory-engine/compat/server.py +91 -7
package/packages/memory-engine/engine/services/l4/server.py +36 -6
package/packages/memory-engine/engine/services/l5/l5-comms-layer.py +34 -16
package/packages/memory-engine/engine/services/l6/l6-document-store.py +29 -10
package/packages/memory-engine/docs/MIGRATION.md +0 -178
package/packages/memory-engine/docs/RUNBOOK-AWS.md +0 -375
package/packages/memory-engine/docs/why-v05-underperforms.md +0 -138

package/README.md CHANGED Viewed

@@ -161,22 +161,26 @@ If `/search` returns the row from `/store`, the engine is live.
 **Connect Claude Code**
-The `tes-memory` plugin's hooks already speak the engine's wire format. Two steps:
+The `tes-memory` plugin's hooks already speak the engine's wire format. Three steps:
 1. Install the plugin (once):
    ```
    /plugin marketplace add Pentatonic-Ltd/ai-agent-sdk
    /plugin install tes-memory@pentatonic-ai
    ```
-2. Point it at your local engine. Edit `~/.claude-pentatonic/tes-memory.local.md` (create if missing):
-   ```yaml
-   ---
-   mode: local
-   memory_url: http://localhost:8099
-   ---
+2. Point it at your local engine — one command writes the plugin config:
+   ```bash
+   npx @pentatonic-ai/ai-agent-sdk config local
    ```
+   This writes `~/.claude-pentatonic/tes-memory.local.md` with `mode: local` and `memory_url: http://localhost:8099`. If you want a different URL, pass `--engine-url <url>`. To switch back to hosted later, run `tes config hosted` (delegates to `login`).
 3. Reload: `/reload-plugins` (or restart Claude Code if status reports stale state — MCP server processes need a full restart to pick up plugin updates).
+Inspect what's currently configured at any time:
+```bash
+npx @pentatonic-ai/ai-agent-sdk config show
+```
 Verify:
 ```
@@ -307,22 +311,26 @@ Works with both local and hosted memory. Install once, switch modes via config.
 /plugin install tes-memory@pentatonic-ai
 ```
-**Local engine** — bring up the engine first ([Memory > Local](#local-self-hosted)), then point the plugin at it. Edit `~/.claude-pentatonic/tes-memory.local.md`:
+**Local engine** — bring up the engine first ([Memory > Local](#local-self-hosted)), then write the plugin config:
-```yaml
----
-mode: local
-memory_url: http://localhost:8099
----
+```bash
+npx @pentatonic-ai/ai-agent-sdk config local
 ```
 **Hosted TES** — run `login` once, the plugin auto-discovers `~/.config/tes/credentials.json`:
 ```bash
 npx @pentatonic-ai/ai-agent-sdk login
+# equivalent: npx @pentatonic-ai/ai-agent-sdk config hosted
+```
+Either way, verify with `/tes-memory:tes-status` in Claude Code, or from the shell:
+```bash
+npx @pentatonic-ai/ai-agent-sdk config show
 ```
-Either way, verify with `/tes-memory:tes-status` in Claude Code. The plugin's MCP server, hooks, and tools all read the same config.
+The plugin's MCP server, hooks, and tools all read the same config — switching modes is a single CLI call away.
 **What it tracks (auto, every turn):**
 - Memory search at prompt time — relevant memories injected as context

package/bin/commands/login.js CHANGED Viewed

@@ -179,9 +179,16 @@ export async function runLoginCommand(opts = {}) {
   log(`  ✓ Connected as ${claims.email || "user"} on tenant \`${clientId}\``);
   log(`  ✓ Credentials written to ~/.config/tes/credentials.json`);
   log("");
-  log("  Claude Code's tes-memory plugin and the OpenClaw pentatonic-memory");
-  log("  plugin will pick these credentials up automatically — restart them");
-  log("  if they're already running.");
+  log("  Install the Pentatonic TES plugin to start capturing context:");
+  log("");
+  log("    Claude Code:");
+  log("      /plugin marketplace add Pentatonic-Ltd/ai-agent-sdk");
+  log("      /plugin install tes-memory@pentatonic-ai");
+  log("");
+  log("    OpenClaw:");
+  log("      openclaw plugins install @pentatonic-ai/openclaw-memory-plugin");
+  log("");
+  log("  Already installed the plugin? Reload now to refresh the credentials.");
   log("");
   return { exitCode: 0, clientId };

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@pentatonic-ai/ai-agent-sdk",
-  "version": "0.7.0",
+  "version": "0.7.2",
   "description": "TES SDK — LLM observability and lifecycle tracking via Pentatonic Thing Event System. Track token usage, tool calls, and conversations. Manage things through event-sourced lifecycle stages with AI enrichment and vector search.",
   "type": "module",
   "main": "./dist/index.cjs",

package/packages/memory-engine/compat/server.py CHANGED Viewed

@@ -85,6 +85,19 @@ class SearchRequest(BaseModel):
     query: str
     limit: Optional[int] = 10
     min_score: Optional[float] = 0.001
+    # Tenant scope. Required for multi-tenant deployments. Forwarded to
+    # layers that support arena filtering natively (L6); applied as a
+    # post-filter on the shim for layers that don't yet (L2, L4, L5).
+    # When unset, search is global — same behaviour as v0.7.x; safe for
+    # single-tenant deployments. Multi-tenant callers MUST set this.
+    arena: Optional[str] = None
+    # Arbitrary metadata equality filters, applied as a post-filter on
+    # the shim. Useful for `kind`, `layer_type`, `source_repo`, etc.
+    # Keys not present on a result's metadata are treated as no-match.
+    # Each pair is exact string equality. Engine doesn't currently
+    # forward these to underlying stores, so over-fetch happens; the
+    # shim trims to the requested limit after filtering.
+    metadata_filter: Optional[dict[str, Any]] = None
 class ForgetRequest(BaseModel):
@@ -424,6 +437,51 @@ async def store_batch(req: StoreBatchRequest):
     }
+def _apply_metadata_filters(results: list[dict[str, Any]], req: SearchRequest) -> list[dict[str, Any]]:
+    """Post-filter results by arena + arbitrary metadata equality.
+    Many layer searches don't yet honour arena/metadata at the storage
+    level, so the shim enforces tenant isolation here as defence in
+    depth. Even if the underlying layer leaks across arenas, the shim
+    drops cross-tenant rows before returning.
+    """
+    arena = req.arena
+    extra = req.metadata_filter or {}
+    if not arena and not extra:
+        return results
+    out: list[dict[str, Any]] = []
+    for item in results:
+        meta = item.get("metadata") or {}
+        if arena:
+            row_arena = meta.get("arena") or item.get("arena")
+            if row_arena and row_arena != arena:
+                continue
+            # If row has no arena tag at all, drop on multi-tenant
+            # safety: a row without arena predates the multi-tenant
+            # plumbing and could belong to anyone.
+            if arena and not row_arena:
+                continue
+        ok = True
+        for k, v in extra.items():
+            if str(meta.get(k, "")) != str(v):
+                ok = False
+                break
+        if ok:
+            out.append(item)
+    return out
+def _search_overfetch(req: SearchRequest) -> int:
+    """Decide how many results to over-fetch from layers.
+    Post-filtering can drop many rows; we ask layers for more than the
+    user's limit so we have headroom after filtering. 5x is a balance
+    between accuracy and latency.
+    """
+    base = req.limit or 10
+    return base * 5 if (req.arena or req.metadata_filter) else base * 3
 @app.post("/search")
 async def search(req: SearchRequest):
     """
@@ -431,6 +489,12 @@ async def search(req: SearchRequest):
     queries L0 BM25, L4 vec, L5 Milvus, L6 doc-store in parallel and fuses
     the results with Reciprocal Rank Fusion. L3 KG adds entity-aware
     boosting for graph queries.
+    Multi-tenancy: pass `arena` to scope results to a single tenant.
+    Underlying layers may or may not honour arena natively (L6 does;
+    L2/L4/L5 don't yet — engine TODO); the shim applies arena as a
+    post-filter regardless, so cross-tenant leakage is prevented even
+    when a layer is non-compliant.
     """
     if not req.query:
         return {"results": []}
@@ -452,10 +516,19 @@ async def search(req: SearchRequest):
         import asyncio
         async def _q_l6(query: str):
             try:
+                params: dict[str, Any] = {
+                    "q": query,
+                    "limit": _search_overfetch(req),
+                    "method": "hybrid",
+                }
+                if req.arena:
+                    # L6 supports arena natively (l6-document-store.py:837).
+                    # Forward it so the underlying Milvus query and FTS
+                    # query both filter to this tenant before returning.
+                    params["arena"] = req.arena
                 r = await _client().get(
                     f"{L6_DOC_URL}/search",
-                    params={"q": query, "limit": (req.limit or 10) * 3,
-                            "method": "hybrid"},
+                    params=params,
                     timeout=30.0,
                 )
                 r.raise_for_status()
@@ -544,11 +617,14 @@ async def search(req: SearchRequest):
                 "source": item.get("source_file") or item.get("path") or "",
                 "engine_layer": "+".join(sorted(set(layer_provenance.get(key, [])))),
             })
-        return {"results": out_results}
+        # Defense-in-depth post-filter (arena + arbitrary metadata),
+        # then trim to the requested limit.
+        out_results = _apply_metadata_filters(out_results, req)
+        return {"results": out_results[: req.limit or 10]}
     try:
         r = await _client().get(
             f"{L2_PROXY_URL}/search",
-            params={"q": req.query, "limit": req.limit or 10},
+            params={"q": req.query, "limit": _search_overfetch(req)},
             timeout=30.0,
         )
         r.raise_for_status()
@@ -558,7 +634,7 @@ async def search(req: SearchRequest):
         try:
             r = await _client().post(
                 f"{L2_PROXY_URL}/v1/search",
-                json={"query": req.query, "limit": req.limit or 10,
+                json={"query": req.query, "limit": _search_overfetch(req),
                       "min_score": req.min_score or 0.001},
                 timeout=30.0,
             )
@@ -567,9 +643,14 @@ async def search(req: SearchRequest):
         except Exception as exc2:
             last_err = exc2
             try:
+                params: dict[str, Any] = {"q": req.query, "limit": _search_overfetch(req)}
+                # L6 supports arena natively; forward it on the
+                # last-resort fallback path too.
+                if req.arena:
+                    params["arena"] = req.arena
                 r = await _client().get(
                     f"{L6_DOC_URL}/search",
-                    params={"q": req.query, "limit": req.limit or 10},
+                    params=params,
                     timeout=10.0,
                 )
                 r.raise_for_status()
@@ -621,7 +702,10 @@ async def search(req: SearchRequest):
             "source": item.get("source", item.get("source_file", "")),
             "engine_layer": item.get("layer", item.get("source_layer", "")),
         })
-    return {"results": out_results}
+    # Defense-in-depth post-filter (arena + arbitrary metadata) on L2/L6
+    # fallback paths. Same logic as the BYPASS branch above.
+    out_results = _apply_metadata_filters(out_results, req)
+    return {"results": out_results[: req.limit or 10]}
 @app.post("/forget")

package/packages/memory-engine/engine/services/l4/server.py CHANGED Viewed

@@ -46,8 +46,6 @@ EMBED_MODEL_NAME = os.environ.get("L4_EMBED_MODEL", "nv-embed-v2")
 EMBED_API_KEY = os.environ.get("L4_EMBED_API_KEY", "")
 EMBED_DIM = int(os.environ.get("L4_EMBED_DIM", "4096"))
-def _embed_headers() -> dict:
-    return {"Authorization": f"Bearer {EMBED_API_KEY}"} if EMBED_API_KEY else {}
 # ----------------------------------------------------------------------
@@ -109,16 +107,48 @@ def _client() -> httpx.AsyncClient:
 async def _embed_batch(texts: list[str]) -> list[list[float]]:
+    """Embed a batch of texts.
+    Tries OpenAI-compatible shape first (POST <url>, Bearer auth,
+    response data[i].embedding). On failure, falls back to the
+    Pentatonic-AI gateway's native shape (POST .../v1/embed, X-API-Key
+    auth, response embeddings[i]). When the gateway eventually adds an
+    OpenAI-compat /v1/embeddings alias, the primary path will succeed
+    and the fallback will never fire — no code change needed.
+    """
     if not texts:
         return []
+    payload = {"input": texts, "model": EMBED_MODEL_NAME}
+    # Primary: OpenAI-compat
+    try:
+        resp = await _client().post(
+            NV_EMBED_URL,
+            headers=_openai_headers(),
+            json=payload,
+            timeout=120.0,
+        )
+        resp.raise_for_status()
+        return [d["embedding"] for d in resp.json()["data"]]
+    except Exception:
+        pass
+    # Fallback: lambda-gateway native shape
+    fallback_url = NV_EMBED_URL.replace("/v1/embeddings", "/v1/embed").replace("/embeddings", "/embed")
     resp = await _client().post(
-        NV_EMBED_URL,
-        headers=_embed_headers(),
-        json={"input": texts, "model": EMBED_MODEL_NAME},
+        fallback_url,
+        headers=_lambda_headers(),
+        json=payload,
         timeout=120.0,
     )
     resp.raise_for_status()
-    return [d["embedding"] for d in resp.json()["data"]]
+    return resp.json()["embeddings"]
+def _openai_headers() -> dict:
+    return {"Authorization": f"Bearer {EMBED_API_KEY}"} if EMBED_API_KEY else {}
+def _lambda_headers() -> dict:
+    return {"X-API-Key": EMBED_API_KEY} if EMBED_API_KEY else {}
 # ----------------------------------------------------------------------

package/packages/memory-engine/engine/services/l5/l5-comms-layer.py CHANGED Viewed

@@ -51,8 +51,36 @@ EMBED_MODEL_NAME = os.environ.get("L5_EMBED_MODEL", "nv-embed-v2")
 # Optional Authorization: Bearer <key> for the primary embedding endpoint.
 EMBED_API_KEY = os.environ.get("L5_EMBED_API_KEY", "")
-def _embed_headers() -> dict:
-    return {"Authorization": f"Bearer {EMBED_API_KEY}"} if EMBED_API_KEY else {}
+def _embed_post(texts):
+    """POST to the configured embedding endpoint. Tries OpenAI-compat
+    shape first; falls back to Pentatonic-AI lambda-gateway native shape
+    on any failure. When the gateway adds an /v1/embeddings alias the
+    primary path will succeed and the fallback never fires.
+    Returns: list[list[float]] (one embedding per input text).
+    """
+    payload = {"input": texts, "model": EMBED_MODEL_NAME}
+    try:
+        r = httpx.post(
+            NV_EMBED_URL,
+            headers={"Authorization": f"Bearer {EMBED_API_KEY}"} if EMBED_API_KEY else {},
+            json=payload,
+            timeout=120,
+        )
+        r.raise_for_status()
+        return [d["embedding"] for d in r.json()["data"]]
+    except Exception:
+        pass
+    fallback_url = NV_EMBED_URL.replace("/v1/embeddings", "/v1/embed").replace("/embeddings", "/embed")
+    r = httpx.post(
+        fallback_url,
+        headers={"X-API-Key": EMBED_API_KEY} if EMBED_API_KEY else {},
+        json=payload,
+        timeout=120,
+    )
+    r.raise_for_status()
+    return r.json()["embeddings"]
 # Ollama fallback path. URL/model can be overridden so the L5 container can
 # reach an Ollama instance running on the docker host (host.docker.internal)
 # or on a co-located service. Mirrors the env-var pattern used by L2.
@@ -99,10 +127,7 @@ def _embed_nv_batch(texts: list[str]) -> list[list[float]] | None:
         return []
     try:
         truncated = [t[:4000] for t in texts]
-        r = httpx.post(NV_EMBED_URL, headers=_embed_headers(), json={"input": truncated, "model": EMBED_MODEL_NAME}, timeout=120)
-        r.raise_for_status()
-        data = r.json()
-        embeddings = [item["embedding"] for item in data["data"]]
+        embeddings = _embed_post(truncated)
         if all(len(e) == EMBED_DIM for e in embeddings):
             return embeddings
     except Exception:
@@ -113,10 +138,8 @@ def _embed_nv_batch(texts: list[str]) -> list[list[float]] | None:
 def _embed_nv_single(text: str) -> list[float] | None:
     """Embed single text via NV-Embed-v2 (4096-dim)."""
     try:
-        r = httpx.post(NV_EMBED_URL, headers=_embed_headers(), json={"input": text[:4000], "model": EMBED_MODEL_NAME}, timeout=15)
-        r.raise_for_status()
-        data = r.json()
-        emb = data["data"][0]["embedding"]
+        embs = _embed_post([text[:4000]])
+        emb = embs[0]
         if len(emb) == EMBED_DIM:
             return emb
     except Exception:
@@ -573,12 +596,7 @@ def serve(port=8034):
         texts = [(r.get("text") or "")[:8192] for r in records]
         t0 = _time.time()
         try:
-            resp = httpx.post(
-                NV_EMBED_URL, headers=_embed_headers(), json={"input": texts, "model": EMBED_MODEL_NAME},
-                timeout=120,
-            )
-            resp.raise_for_status()
-            embs = [d["embedding"] for d in resp.json()["data"]]
+            embs = _embed_post(texts)
         except Exception as exc:
             return {"status": "error", "error": f"embed failed: {exc}"}
         embed_ms = (_time.time() - t0) * 1000.0

package/packages/memory-engine/engine/services/l6/l6-document-store.py CHANGED Viewed

@@ -44,8 +44,33 @@ EMBED_DIM = int(os.environ.get("L6_EMBED_DIM", "4096"))
 # Optional Authorization: Bearer <key> for the embedding endpoint.
 EMBED_API_KEY = os.environ.get("L6_EMBED_API_KEY", "")
-def _embed_headers() -> dict:
-    return {"Authorization": f"Bearer {EMBED_API_KEY}"} if EMBED_API_KEY else {}
+def _embed_post(texts):
+    """POST to embedding endpoint. Tries OpenAI-compat shape first;
+    falls back to Pentatonic-AI lambda-gateway native shape on failure.
+    See L4 / L5 for the same pattern."""
+    import httpx as _httpx
+    payload = {"input": texts, "model": EMBED_MODEL}
+    try:
+        r = _httpx.post(
+            NV_EMBED_URL,
+            headers={"Authorization": f"Bearer {EMBED_API_KEY}"} if EMBED_API_KEY else {},
+            json=payload,
+            timeout=120,
+        )
+        r.raise_for_status()
+        return [d["embedding"] for d in r.json()["data"]]
+    except Exception:
+        pass
+    fallback_url = NV_EMBED_URL.replace("/v1/embeddings", "/v1/embed").replace("/embeddings", "/embed")
+    r = _httpx.post(
+        fallback_url,
+        headers={"X-API-Key": EMBED_API_KEY} if EMBED_API_KEY else {},
+        json=payload,
+        timeout=120,
+    )
+    r.raise_for_status()
+    return r.json()["embeddings"]
 COLLECTION_NAME = "documents"
 RRF_K = 60
 DEFAULT_PORT = 8037
@@ -874,16 +899,10 @@ def serve(port: int = DEFAULT_PORT):
         texts = [(r.get("text") or "")[:16000] for r in records]
-        # Single batched NV-Embed call.
+        # Single batched embed call (OpenAI-compat first, lambda-gateway fallback).
         t0 = _time.time()
         try:
-            resp = _httpx.post(
-                NV_EMBED_URL, headers=_embed_headers(),
-                json={"input": texts, "model": EMBED_MODEL},
-                timeout=120,
-            )
-            resp.raise_for_status()
-            embs = [d["embedding"] for d in resp.json()["data"]]
+            embs = _embed_post(texts)
         except Exception as exc:
             raise HTTPException(status_code=500, detail=f"embed failed: {exc}")
         embed_ms = (_time.time() - t0) * 1000.0

package/packages/memory-engine/docs/MIGRATION.md DELETED Viewed

@@ -1,178 +0,0 @@
-# Migration Guide
-## From `pentatonic-memory` v0.5.x → `pentatonic-memory-engine`
-### TL;DR
-```diff
-- export PENTATONIC_MEMORY_URL=http://your-pm-host:8099
-+ export PENTATONIC_MEMORY_URL=http://your-engine-host:8099
-```
-That's it. Same SDK, same code, same `/store` `/search` `/health` calls. Engine returns the same response shape with one optional addition (`engine_layer` field on results, naming which layer carried the hit — purely informational).
-### Detailed wire-format compatibility
-#### `POST /store`
-Request:
-```json
-{ "content": "...", "metadata": { "key": "value" } }
-```
-Response (v0.5.x):
-```json
-{ "id": "mem_abc...", "content": "...", "layerId": "ml_default_episodic" }
-```
-Response (engine):
-```json
-{
-  "id": "abc...",
-  "content": "...",
-  "layerId": "ml_default_episodic",
-  "engine": { "l5": 1, "l6": 1 }            // ← new, optional
-}
-```
-The `engine` field is informational only. Existing SDK clients that ignore unknown fields (the default for both Node.js and Python clients) work without modification.
-#### `POST /search`
-Request:
-```json
-{ "query": "...", "limit": 10, "min_score": 0.0001 }
-```
-Response (v0.5.x):
-```json
-{
-  "results": [
-    {
-      "id": "mem_abc...", "content": "...", "metadata": {},
-      "similarity": 0.81, "layer_id": "ml_default_episodic", "client_id": "default"
-    }
-  ]
-}
-```
-Response (engine):
-```json
-{
-  "results": [
-    {
-      "id": "abc...", "content": "...", "metadata": {},
-      "similarity": 0.81, "layer_id": "ml_default_episodic", "client_id": "default",
-      "source": "doc1.md",                 // ← passes through engine's source_file
-      "engine_layer": "L4 vec"             // ← new, optional, names the winning layer
-    }
-  ]
-}
-```
-#### `GET /health`
-Request: no body.
-Response (v0.5.x):
-```json
-{ "status": "ok", "client": "default", "version": "0.5.6", "memories": 249 }
-```
-Response (engine):
-```json
-{
-  "status": "ok",
-  "client": "default",
-  "version": "0.1.0",
-  "engine": "pentatonic-memory-engine",
-  "layers": {
-    "l0": "ok", "l1": "ok", "l2": "ok", "l3": "ok",
-    "l4": "ok", "l5": "ok", "l6": "ok",
-    "nv_embed": "ok"
-  },
-  "memories": 249
-}
-```
-Reports per-layer status across all 7 layers of the `sequential-hybridrag-7-layer` engine.
-#### `POST /store-batch` (NEW — not in v0.5.x)
-```json
-// Request
-{
-  "records": [
-    { "id": "doc1", "content": "...", "metadata": {} },
-    { "id": "doc2", "content": "...", "metadata": {} }
-  ],
-  "arena": "general"
-}
-// Response
-{
-  "status": "ok",
-  "inserted": 2,
-  "ids": ["doc1", "doc2"],
-  "engine": { "l5": 2, "l6": 2 },
-  "duration_ms": 234.5
-}
-```
-30-50× faster than calling `/store` N times when ingesting more than ~5 records.
-#### `POST /forget` (RESTORED — was in v0.4.x, removed in v0.5.x)
-```json
-// Delete one record
-{ "id": "doc1" }
-// Or delete all records matching a metadata filter
-{ "metadata_contains": { "bench_tag": "test-run-12345" } }
-// Response
-{ "deleted": 17, "engine": "pentatonic-memory-engine" }
-```
-Required for: test pollution control, GDPR data deletion, multi-tenant isolation, bench harnesses.
-### Data migration
-There is no automated dump-and-replay tool. Two paths:
-**Path A — Re-ingest from source.**
-If your Pentatonic deployment was populated from a known source (chat archives, document repository, TES events), re-run the ingestion against the engine. Use `/store-batch` for speed.
-**Path B — Dump-and-replay from Postgres.**
-If you only have the v0.5 Postgres database:
-```bash
-# Dump as JSONL
-psql $DATABASE_URL -A -t -c \
-  "SELECT json_build_object('id', id, 'content', content, 'metadata', metadata)::text
-   FROM memory_nodes WHERE client_id = 'your-client'" \
-  > export.jsonl
-# Replay against the engine
-python tools/replay.py export.jsonl --target http://your-engine-host:8099
-```
-A `tools/replay.py` reference implementation lives under `tools/` in this package.
-### What you lose
-- **The `metadata.hypothetical_queries` field stops being generated at ingest time.** The engine generates HyDE queries at SEARCH time instead, against the user's actual query (better matching, faster ingest).
-- **`metadata.distilled_from` atoms are no longer auto-generated.** If you were relying on the v0.5+ atomic-fact distillation behaviour, that's a feature of v0.5+ specifically — not a portable feature. The engine treats memories as canonical raw chunks. You can still run distillation as a separate post-processing step if needed.
-### What you gain
-- ~5× retrieval accuracy on substring/exact-match benches (~17.6% → ~82.4% mean)
-- 30-50× faster bulk ingest via `/store-batch`
-- Restored `/forget` endpoint
-- Cross-encoder reranking on top-50
-- Knowledge-graph-aware retrieval (entity overlap signal)
-- Per-layer health visibility
-### Rollback
-The engine doesn't write to your existing Postgres. Roll back by switching the env var back. No data lost.

package/packages/memory-engine/docs/RUNBOOK-AWS.md DELETED Viewed

@@ -1,375 +0,0 @@
-# pentatonic-memory-engine — AWS deployment runbook (v1)
-**Target:** single EC2 (`m6i.2xlarge`) in `us-east-1`, network-boundary auth via Cloudflare Tunnel.
-**Operator:** Phil Hauser (or anyone with `AdministratorAccess` to account `170649632502`).
-**Estimated time end-to-end:** ~45 minutes (mostly waiting for instance/volume provisioning).
----
-## 0. Prerequisites
-Before starting, verify:
-```bash
-aws sts get-caller-identity
-# Should return Account: 170649632502, AdministratorAccess role
-aws configure get region
-# us-east-1
-```
-If region isn't set: `export AWS_REGION=us-east-1` for the rest of the session.
-You'll also need:
-- A **Cloudflare account** with access to the Pentatonic CF zone (for Tunnel setup)
-- The **`pentatonic-ai-gateway` API key** (from lambda.dev — should already exist)
----
-## 1. Variables (paste once, reuse below)
-```bash
-export AWS_REGION=us-east-1
-export ENV=prod
-export NAME=pme-${ENV}-us-east-1
-export INSTANCE_TYPE=m6i.2xlarge
-# Latest Ubuntu 22.04 LTS in us-east-1 (verify via aws ec2 describe-images if needed)
-export AMI_ID=$(aws ec2 describe-images \
-  --owners 099720109477 \
-  --filters "Name=name,Values=ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*" \
-  --query 'Images | sort_by(@, &CreationDate) | [-1].ImageId' \
-  --output text)
-echo "Using AMI: $AMI_ID"
-```
----
-## 2. Networking
-Use the default VPC for v1. (Multi-VPC isolation is a v2 concern.)
-```bash
-export VPC_ID=$(aws ec2 describe-vpcs \
-  --filters "Name=is-default,Values=true" \
-  --query 'Vpcs[0].VpcId' --output text)
-export SUBNET_ID=$(aws ec2 describe-subnets \
-  --filters "Name=vpc-id,Values=$VPC_ID" "Name=default-for-az,Values=true" \
-  --query 'Subnets[0].SubnetId' --output text)
-echo "VPC=$VPC_ID  Subnet=$SUBNET_ID"
-```
-### 2.1 Security group
-No public ingress. Outbound 443/80/53 for Tunnel + gateway + apt + DNS.
-```bash
-export SG_ID=$(aws ec2 create-security-group \
-  --group-name $NAME-sg \
-  --description "pentatonic-memory-engine $ENV — outbound only; ingress via SSM" \
-  --vpc-id $VPC_ID \
-  --query 'GroupId' --output text)
-# Outbound is allowed by default. Strip default outbound and re-add explicitly.
-aws ec2 revoke-security-group-egress \
-  --group-id $SG_ID \
-  --ip-permissions '[{"IpProtocol":"-1","IpRanges":[{"CidrIp":"0.0.0.0/0"}]}]'
-aws ec2 authorize-security-group-egress --group-id $SG_ID \
-  --ip-permissions '[
-    {"IpProtocol":"tcp","FromPort":443,"ToPort":443,"IpRanges":[{"CidrIp":"0.0.0.0/0","Description":"HTTPS for tunnel + gateway + apt"}]},
-    {"IpProtocol":"tcp","FromPort":80, "ToPort":80, "IpRanges":[{"CidrIp":"0.0.0.0/0","Description":"HTTP for apt fallback"}]},
-    {"IpProtocol":"udp","FromPort":53, "ToPort":53, "IpRanges":[{"CidrIp":"0.0.0.0/0","Description":"DNS"}]},
-    {"IpProtocol":"tcp","FromPort":53, "ToPort":53, "IpRanges":[{"CidrIp":"0.0.0.0/0","Description":"DNS-over-TCP"}]}
-  ]'
-echo "SG=$SG_ID"
-```
-**No inbound rule.** Ops access happens via SSM Session Manager (next step), not SSH.
----
-## 3. IAM role for SSM Session Manager + EBS snapshot agent
-Lets you `aws ssm start-session` into the box without an SSH key.
-```bash
-aws iam create-role --role-name $NAME-role \
-  --assume-role-policy-document '{
-    "Version":"2012-10-17",
-    "Statement":[{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}]
-  }'
-aws iam attach-role-policy --role-name $NAME-role \
-  --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
-aws iam create-instance-profile --instance-profile-name $NAME-profile
-aws iam add-role-to-instance-profile \
-  --instance-profile-name $NAME-profile \
-  --role-name $NAME-role
-# Wait for IAM eventual-consistency before launching EC2
-sleep 10
-```
----
-## 4. EBS volumes
-Five `gp3` volumes, 50 GiB each (resize online later if needed). One per layer's data dir.
-```bash
-export AZ=$(aws ec2 describe-subnets --subnet-ids $SUBNET_ID \
-  --query 'Subnets[0].AvailabilityZone' --output text)
-for layer in l2 l3 l4 l5 l6; do
-  vol_id=$(aws ec2 create-volume \
-    --availability-zone $AZ \
-    --size 50 --volume-type gp3 \
-    --tag-specifications "ResourceType=volume,Tags=[{Key=Name,Value=$NAME-$layer},{Key=pme-layer,Value=$layer}]" \
-    --query 'VolumeId' --output text)
-  echo "$layer = $vol_id"
-  eval "export VOL_${layer}=$vol_id"
-done
-# Wait until all are 'available'
-aws ec2 wait volume-available --volume-ids $VOL_l2 $VOL_l3 $VOL_l4 $VOL_l5 $VOL_l6
-echo "All volumes available."
-```
----
-## 5. Launch the EC2
-```bash
-# User data: format the EBS volumes on first boot, install docker, mount.
-cat > /tmp/userdata.sh <<'EOF'
-#!/bin/bash
-set -euxo pipefail
-apt-get update
-apt-get install -y docker.io docker-compose-v2 git xfsprogs
-# Wait for EBS volumes to attach (they're attached just after instance launch by AWS CLI below)
-for layer in l2 l3 l4 l5 l6; do
-  for i in {1..30}; do
-    if [ -e /dev/disk/by-label/$layer ] || lsblk -no NAME,SERIAL | grep -q "$layer"; then
-      break
-    fi
-    sleep 2
-  done
-done
-# Find each volume by tag (we'll attach by device name below; this just creates mount points)
-mkdir -p /var/lib/pme/{l2,l3,l4,l5,l6}
-# Format + mount each — done by per-volume systemd in step 6.5 below
-systemctl enable --now docker
-# Pull engine repo
-cd /opt
-git clone https://github.com/Pentatonic-Ltd/memory_stack_updated.git engine
-chown -R ubuntu:ubuntu /opt/engine
-EOF
-export INSTANCE_ID=$(aws ec2 run-instances \
-  --image-id $AMI_ID \
-  --instance-type $INSTANCE_TYPE \
-  --subnet-id $SUBNET_ID \
-  --security-group-ids $SG_ID \
-  --iam-instance-profile Name=$NAME-profile \
-  --block-device-mappings 'DeviceName=/dev/sda1,Ebs={VolumeSize=30,VolumeType=gp3}' \
-  --metadata-options 'HttpTokens=required,HttpEndpoint=enabled' \
-  --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=$NAME}]" \
-  --user-data file:///tmp/userdata.sh \
-  --query 'Instances[0].InstanceId' --output text)
-aws ec2 wait instance-running --instance-ids $INSTANCE_ID
-echo "Instance $INSTANCE_ID is running."
-```
-### 5.1 Attach EBS volumes
-```bash
-aws ec2 attach-volume --volume-id $VOL_l2 --instance-id $INSTANCE_ID --device /dev/xvdf
-aws ec2 attach-volume --volume-id $VOL_l3 --instance-id $INSTANCE_ID --device /dev/xvdg
-aws ec2 attach-volume --volume-id $VOL_l4 --instance-id $INSTANCE_ID --device /dev/xvdh
-aws ec2 attach-volume --volume-id $VOL_l5 --instance-id $INSTANCE_ID --device /dev/xvdi
-aws ec2 attach-volume --volume-id $VOL_l6 --instance-id $INSTANCE_ID --device /dev/xvdj
-# Wait for all to attach
-for v in $VOL_l2 $VOL_l3 $VOL_l4 $VOL_l5 $VOL_l6; do
-  aws ec2 wait volume-in-use --volume-ids $v
-done
-echo "All volumes attached."
-```
----
-## 6. Mount EBS volumes inside the EC2
-Connect via SSM Session Manager:
-```bash
-aws ssm start-session --target $INSTANCE_ID
-```
-Then inside the instance:
-```bash
-# Format each volume (one-time)
-for pair in xvdf:l2 xvdg:l3 xvdh:l4 xvdi:l5 xvdj:l6; do
-  dev=${pair%:*}; layer=${pair#*:}
-  if ! sudo blkid /dev/$dev >/dev/null 2>&1; then
-    sudo mkfs.xfs -L $layer /dev/$dev
-  fi
-done
-# Add to /etc/fstab and mount
-for pair in xvdf:l2 xvdg:l3 xvdh:l4 xvdi:l5 xvdj:l6; do
-  dev=${pair%:*}; layer=${pair#*:}
-  uuid=$(sudo blkid -s UUID -o value /dev/$dev)
-  sudo mkdir -p /var/lib/pme/$layer
-  echo "UUID=$uuid /var/lib/pme/$layer xfs defaults,nofail 0 2" | sudo tee -a /etc/fstab
-done
-sudo systemctl daemon-reload
-sudo mount -a
-df -h /var/lib/pme/*
-# All five should show ~50G mounted, 49G available.
-```
----
-## 7. Cloudflare Tunnel setup
-In the Cloudflare dashboard:
-1. **Zero Trust → Networks → Tunnels → Create a tunnel** (Cloudflared connector type)
-2. Name: `engine-prod-us-east-1`
-3. Save → copy the **tunnel token** (the `eyJ...` string).
-4. **Public hostnames** tab → Add:
-   - Subdomain: `engine`
-   - Domain: `pentatonic.internal` (or whatever internal CF zone you use)
-   - Type: HTTP, URL: `compat:8099`
-Copy the tunnel token; you'll set it as `CLOUDFLARED_TUNNEL_TOKEN` in `.env` below.
-> The hostname is reachable only by Workers/services in the same Cloudflare account by default. If you want to lock down further, attach a **Cloudflare Access policy** requiring a service token on the hostname — then set the service-token header in TES Workers' fetch calls. Optional for v1; can layer on later.
----
-## 8. Configure and bring up the engine
-Back in the SSM session on the EC2:
-```bash
-cd /opt/engine
-# Pull the AWS overlay (PR'd separately to memory_stack_updated; for now copy it manually)
-# Once merged upstream, this file is part of the repo.
-sudo curl -fL -o docker-compose.aws.yml \
-  https://raw.githubusercontent.com/Pentatonic-Ltd/memory_stack_updated/main/docker-compose.aws.yml
-# Generate Neo4j password
-NEO4J_PASSWORD=$(openssl rand -base64 24 | tr -d '/+=')
-# Write .env (substitute values)
-cat | sudo tee .env <<EOF
-PME_PORT=8099
-NV_EMBED_URL=https://gateway.pentatonic.ai/v1/embeddings   # confirm exact URL with the gateway team
-PENTATONIC_AI_GATEWAY_KEY=<paste from secret store>
-CLOUDFLARED_TUNNEL_TOKEN=<paste from CF dashboard>
-NEO4J_PASSWORD=$NEO4J_PASSWORD
-EOF
-sudo chmod 600 .env
-# Bring up the stack
-sudo docker compose -f docker-compose.yml -f docker-compose.aws.yml up -d
-sudo docker compose ps
-```
-First run pulls images (~3-5 min) and builds engine images (~10-15 min). Subsequent restarts are fast.
----
-## 9. Smoke test
-From your laptop or any TES dev environment with access to the CF zone:
-```bash
-curl -sf https://engine.pentatonic.internal/health | jq
-# Expected: {"status":"ok","layers":{"l0":"ok",...,"l6":"ok"},"engine":"pentatonic-memory-engine"}
-curl -sX POST https://engine.pentatonic.internal/store \
-  -H "content-type: application/json" \
-  -d '{"content":"hello from runbook smoke test","metadata":{"arena":"smoke"}}'
-curl -sX POST https://engine.pentatonic.internal/search \
-  -H "content-type: application/json" \
-  -d '{"query":"hello","limit":3,"min_score":0.001}' | jq
-```
-If `/search` returns the row from `/store`, end-to-end works.
----
-## 10. AWS Backup
-```bash
-# Tag all volumes for the backup plan
-for v in $VOL_l2 $VOL_l3 $VOL_l4 $VOL_l5 $VOL_l6; do
-  aws ec2 create-tags --resources $v --tags Key=Backup,Value=daily
-done
-# Backup plan: nightly snapshot, 14-day retention.
-# Easiest: AWS Backup console → Plan → "DailyBackup14Day" → resource selection by tag Backup=daily.
-# Or via CLI — see https://docs.aws.amazon.com/aws-backup/latest/devguide/creating-a-backup-plan.html
-```
-Run the restore drill at least once before going live: spin up a sibling instance, attach restored volumes, confirm engine comes back healthy.
----
-## 11. CloudWatch alarms (recommended, not strictly v1)
-- EC2 instance status check failed → SNS alert
-- EBS volume usage > 80% → SNS alert
-- Engine `/health` failure (custom Lambda probe via the tunnel) → SNS alert
----
-## 12. Resource summary
-| Resource | Identifier (filled at runtime) |
-|---|---|
-| Instance | `$INSTANCE_ID` (m6i.2xlarge) |
-| VPC / Subnet | `$VPC_ID` / `$SUBNET_ID` |
-| Security group | `$SG_ID` |
-| IAM role / profile | `$NAME-role` / `$NAME-profile` |
-| EBS volumes | `$VOL_l2 $VOL_l3 $VOL_l4 $VOL_l5 $VOL_l6` (50 GiB gp3 each) |
-| Cloudflare Tunnel | `engine-prod-us-east-1` → `engine.pentatonic.internal` |
-Estimated v1 cost: **~$340/mo on-demand** (instance) + **~$20/mo** (5×50 GiB gp3) + AWS Backup snapshots (~$5-10/mo at 14-day retention) + data transfer (negligible from CF Tunnel).
----
-## Teardown (if you need to recreate)
-```bash
-aws ec2 terminate-instances --instance-ids $INSTANCE_ID
-aws ec2 wait instance-terminated --instance-ids $INSTANCE_ID
-for v in $VOL_l2 $VOL_l3 $VOL_l4 $VOL_l5 $VOL_l6; do
-  aws ec2 delete-volume --volume-id $v
-done
-aws ec2 delete-security-group --group-id $SG_ID
-aws iam remove-role-from-instance-profile --instance-profile-name $NAME-profile --role-name $NAME-role
-aws iam delete-instance-profile --instance-profile-name $NAME-profile
-aws iam detach-role-policy --role-name $NAME-role --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
-aws iam delete-role --role-name $NAME-role
-```

package/packages/memory-engine/docs/why-v05-underperforms.md DELETED Viewed

@@ -1,138 +0,0 @@
-# Why `pentatonic-memory` v0.5.x underperforms on retrieval benchmarks
-This document explains the architectural reasons `pentatonic-memory` v0.5.x scores 17.6% on substring-graded retrieval benches. None of these are bugs — they are deliberate design decisions optimised for a different workload (chat-style fact recall over agent memory). They just happen to be the wrong defaults for general-purpose retrieval.
-The engine in this package addresses each one.
-## 1. Atom boost wins over source
-```js
-// pentatonic-memory v0.5.10/src/search.js
-const DEFAULT_WEIGHTS = {
-  ...
-  atomBoost: 0.15,         // ← 15% boost for distilled atomic facts
-  verbosityPenalty: 0.1,   // ← penalty for long raw content
-};
-```
-`distill.js` runs an LLM on every ingested memory and extracts "atomic facts." Those atoms are stored as separate rows linked back via `source_id`. Search then ranks atoms higher than their source via the boost.
-For chat-style queries ("what does Phil drink?") this works: the atom "Phil drinks cortado" is ranked above the raw turn "Yeah, oh hey Phil came over yesterday and he had a cortado…".
-For substring grading ("what was the price of thing-9001?") it backfires: the atom is "the user reported a sale event" and the raw "thing-9001 sold for $15.50 to buyer-42" gets dropped or out-ranked. The literal answer string is gone.
-**Engine default:** `atomBoost = 0`, `verbosityPenalty = 0`. Distillation is opt-in per query.
-## 2. `dedupeBySource` removes the right answer
-```js
-// pentatonic-memory v0.5.10/src/search.js, line 161
-if (opts.dedupeBySource !== false) {
-  const atomSources = new Set(
-    filtered.filter((r) => r.source_id).map((r) => r.source_id)
-  );
-  if (atomSources.size > 0) {
-    filtered = filtered.filter((r) => !atomSources.has(r.id));
-  }
-}
-```
-When an atom matches, its source raw is **dropped** from results. The thinking is "the atom contains the relevant fact, the source is redundant." For substring grading, the source contains the literal text the bench is looking for, while the atom is a paraphrase.
-**Engine default:** return both atom and source. Caller can dedupe if they want to reduce token spend.
-## 3. `minScore: 0.5` is too aggressive
-```js
-const threshold = opts.minScore ?? 0.5;
-```
-NV-Embed-v2 routinely produces cosine similarities of 0.30–0.45 for genuinely relevant chunks. The 0.5 default filters those out completely. The bench passes `min_score: 0.0001` to compensate, but real callers using SDK defaults silently lose recall.
-**Engine default:** `min_score: 0.001`. The CTE's relevance × recency × frequency formula handles ranking; let everything through and trust the ordering.
-## 4. No `/forget` endpoint
-```js
-// server.js routes:
-//   POST /search
-//   POST /store
-//   GET  /health
-//   (no /forget, no /memories)
-```
-v0.4.x had `/forget` and `/memories`. v0.5.x removed them. Without `/forget`:
-- Tests can't isolate runs (data accumulates across test suites)
-- Benches pollute each other's namespaces (we observed v0.5.6 going from 17.6% to 9.4% over 5 runs of pollution)
-- GDPR data deletion requests require direct Postgres access
-- Multi-tenant deployments can't enforce tenant boundaries via the SDK alone
-**Engine:** restored `/forget` with `id` and `metadata_contains` filters.
-## 5. No `/store-batch`
-Even though `ai.js` has an `embedBatch()` helper, the server only exposes single-record `/store`. Bulk ingest does N HTTP roundtrips, each with one synchronous embed call.
-For the bench harness, this means a 22-doc corpus takes ~25 minutes to ingest because every doc waits for an Ollama HyDE generation (60s default) plus an embed call.
-**Engine:** added `/store-batch`. One HTTP roundtrip, one batched embed call, one bulk INSERT. 30-50× faster on >5 records.
-## 6. HyDE generated at INGEST time
-```js
-// ingest.js — for every /store call:
-const hypothetical_queries = await llm.chat(/* generate 3-5 fake queries */);
-metadata.hypothetical_queries = hypothetical_queries;
-```
-This adds a 60s LLM call to every ingest. Worse, the queries are generated against the *content*, not the user's actual query — so they tend to be generic ("what is the topic of this document"), not useful for matching at search time.
-**Engine:** HyDE runs at SEARCH time against the user's actual query. Each search generates 3 hypothetical answers, embeds each, runs vector search per embedding, and RRF-fuses the rank lists. Better matching, no ingest blocking.
-## 7. No content chunking
-v0.5.x stores a 10,000-token document as one row with one 4096-d embedding. The vector represents the *average* meaning of the document, washing out specific facts.
-**Engine:** chunks at ingest into ~200-500 token segments, each with its own embedding and `chunk_index`. Search returns chunks; downstream caller can hydrate the parent document if needed.
-## 8. No reranker
-v0.5.x's `search.js` returns top-K directly from the SQL CTE score. No second-pass reranker.
-**Engine:** L6 doc-store runs a `ms-marco-MiniLM-L-6-v2` cross-encoder over the top-50 from initial retrieval, then returns top-K. Substantially better precision on questions that need exact term matching after broad recall.
-## 9. No graph / entity layer
-v0.5.x doesn't extract entities at ingest, doesn't build relationships, can't answer multi-hop questions ("who owns thing-X" → "find listings where X was sold" → "fetch buyer's contact").
-**Engine:** L3 Knowledge Graph (Neo4j Community) extracts entities at ingest, builds edges between co-occurring entities, and at search time boosts rows that mention the same entities as the query. Critical for the marketplace-ops and customer-support benches.
-## 10. Single vector store, single embedding per row
-v0.5.x writes one row per memory with one embedding column in pgvector. The HNSW index doesn't work above 2000 dimensions, so 4096-d NV-Embed embeddings fall back to sequential scan. At >100k memories, that's >100ms per query.
-**Engine:** indexes the same content into multiple stores in parallel:
-- L0 BM25 (SQLite FTS5)
-- L4 sqlite-vec (small, in-process)
-- L5 Milvus (medium, dedicated)
-- L6 doc-store (with reranker)
-- L3 KG (relationship-pivoted)
-Search runs all five in parallel, RRF-fuses the rank lists, applies reranker on top-50. Different query types win on different layers — the fusion absorbs the strengths of each.
-## Summary
-| Gap | Bench impact (estimated) | Fix complexity |
-|---|---|---|
-| 1. atomBoost +0.15 | -15-20pp | trivial (config flag) |
-| 2. dedupeBySource: true | -5-10pp | trivial (config flag) |
-| 3. minScore: 0.5 default | -3-8pp | trivial (config change) |
-| 4. No /forget | n/a but blocks tests | trivial (10 LOC) |
-| 5. No /store-batch | n/a but blocks bench (~25 min ingest) | low (50 LOC) |
-| 6. HyDE at ingest time | -5-10pp + 60s/store | medium (refactor) |
-| 7. No chunking | -5-15pp on long docs | medium (schema change) |
-| 8. No reranker | -5-10pp | medium (sidecar service) |
-| 9. No graph layer | -5-10pp on entity queries | high (new schema + extraction) |
-| 10. Single vector store | -10-20pp, latency at scale | high (parallel infrastructure) |
-This package addresses 1-10 simultaneously by routing through the 7-layer engine, recovering ~65pp of the gap.