npm - @pentatonic-ai/ai-agent-sdk - Versions diffs - 0.10.16 → 0.10.17 - Mend

@pentatonic-ai/ai-agent-sdk 0.10.16 → 0.10.17

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/dist/index.cjs +1 -1
package/dist/index.js +1 -1
package/package.json +1 -1
package/packages/memory-engine-v2/RFC-student-cascade.md +148 -0
package/packages/memory-engine-v2/extractor-async/test_cascade.py +175 -0
package/packages/memory-engine-v2/extractor-async/worker.py +399 -67
package/packages/memory-engine-v2/org-model/migrations/010_distillation_ledger.sql +40 -0

package/dist/index.cjs CHANGED Viewed

@@ -878,7 +878,7 @@ function fireAndForgetEmit(clientConfig, sessionOpts, messages, result, model) {
 }
 // src/telemetry.js
-var VERSION = "0.10.16";
+var VERSION = "0.10.17";
 var TELEMETRY_URL = "https://sdk-telemetry.philip-134.workers.dev";
 function machineId() {
   const raw = typeof process !== "undefined" ? `${process.env?.USER || process.env?.USERNAME || "u"}:${process.platform || "x"}:${process.arch || "x"}` : "browser";

package/dist/index.js CHANGED Viewed

@@ -847,7 +847,7 @@ function fireAndForgetEmit(clientConfig, sessionOpts, messages, result, model) {
 }
 // src/telemetry.js
-var VERSION = "0.10.16";
+var VERSION = "0.10.17";
 var TELEMETRY_URL = "https://sdk-telemetry.philip-134.workers.dev";
 function machineId() {
   const raw = typeof process !== "undefined" ? `${process.env?.USER || process.env?.USERNAME || "u"}:${process.platform || "x"}:${process.arch || "x"}` : "browser";

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@pentatonic-ai/ai-agent-sdk",
-  "version": "0.10.16",
+  "version": "0.10.17",
   "description": "TES SDK — LLM observability and lifecycle tracking via Pentatonic Thing Event System. Track token usage, tool calls, and conversations. Manage things through event-sourced lifecycle stages with AI enrichment and vector search.",
   "type": "module",
   "main": "./dist/index.cjs",

package/packages/memory-engine-v2/RFC-student-cascade.md ADDED Viewed

@@ -0,0 +1,148 @@
+# RFC — Student→Teacher Distillation Cascade (#99)
+**Status:** proposed (plan of record) · **Date:** 2026-06-18 · **Owner:** Phil
+## Goal
+Cut distillation GPU cost/latency by making the cheap fine-tuned **student**
+(`numind/NuExtract-2.0-4B`, full-FT on teacher traces) the **primary** distiller,
+and reaching the expensive **teacher** (Qwen3.6-27B on the L40S fleet) only for a
+**sampled + gated subset**. The teacher's output on that subset both corrects the
+graph and feeds continuous improvement of the student.
+This is a model **cascade**, not a replacement: the teacher stays the source of
+truth on the hard/sampled tail; the student carries the bulk.
+## What we already have (grounding)
+- **Trained student** at `s3://pme-deploy-prod-us-east-1-170649632502/backups/training/nuextract-2.0-4b-ft-final/` (7 GB). Train/eval loss 0.112/0.106.
+- **Quality vs the CURRENT teacher** (`f1e0ff` prompt): median entity-F1 **0.909**, ~**0% invalid JSON**, ~1% email-hallucination, **no drift** despite a teacher-prompt change since training (the student trained on `bbdaba`). So it's deployable without a retrain.
+- **Routing signals are weak.** Single-pass token-confidence and a feature-based verifier (HistGBM, AUC 0.63) only modestly beat random at predicting student↔teacher disagreement. Cheap features don't pinpoint the student's errors. → routing leans on deterministic gates + sampling, not a confidence/verifier gate.
+- **Infra to reuse:** `distillation_queue` + the `extractor-async` consumer pattern; the combined-demand distiller **autoscaler**; the `fusion_queue` consumer pattern; `distillation_traces` with `system_prompt_hash` (teacher-version segmentation); the Fusion Drive (fuzzy self-healing on top).
+- **Caveat carried forward:** the "disagreement rate" we've quoted (~12%) is a crude entity-name-exact-match proxy — entities only, penalizes normalization, teacher-as-gold. A proper structured-diff metric is a dependency for trustworthy monitoring + verifier labels (see Open Questions).
+## Architecture (steady state, post-flip)
+```
+event → extractor-sync (deterministic provisional write, unchanged) → distillation_queue
+            │
+            ▼
+   STUDENT consumer (cheap, always-on small GPU)
+      • distils the event, writes entities/facts/relationships to the graph
+        tagged producer='student'
+      • computes the escalation decision (below)
+            │
+            └── ESCALATE subset → distillation_queue-teacher (the existing 27B flow,
+                                   autoscaled L40S fleet)
+                                   • teacher distils
+                                   • SUPERSEDES the student's rows for that event
+                                   • writes distillation_traces (gold) → monitoring + retraining
+```
+The student is a new consumer; escalation is an enqueue onto the existing teacher
+path. No new scheduler — the autoscaler already scales the teacher fleet on queue
+depth (incl. fusion). The student runs on a cheap always-on GPU (L4/g6 — 4B fits
+in ~8 GB; the teacher L40S fleet stays scale-to-zero for the escalation subset).
+## Escalation policy (the "sample out")
+Escalate to the teacher iff **any** of:
+1. **Deterministic gate-fail** (cheap, high-precision):
+   - student output isn't valid JSON / violates schema, OR
+   - **grounding violation** — student emits an email/entity contradicting the
+     event's structured envelope (the #111 hard-key bag). *This is the gate that
+     catches "confidently wrong about a known fact," which confidence can't.*
+2. **High-value event class** — e.g. `decision`/`commitment` facts, VIP arenas —
+   always teacher (cheap metadata routing, decided pre-student where possible).
+3. **Random sample** (e.g. 3–5%) — *not* a quality lever; this exists to (a)
+   monitor student↔teacher agreement over time (drift), and (b) generate fresh
+   teacher-gold on live traffic for retraining the student + verifier (active
+   learning). The events the student is *unsure* on are the highest-value
+   retraining data.
+4. *(soft)* **verifier score** — verifier-v0 is weak (AUC 0.63); use it only as a
+   low-weight tiebreak to nudge borderline events toward the teacher, never as
+   the primary gate. Revisit if a stronger signal (self-consistency) is built.
+Everything else: the student's write stands.
+## Supersede-on-escalation (the one genuinely new mechanism)
+The store is **pure-accretion** — event identity is `content-hash`, there is **no
+supersede-by-source_id**, and graph upserts only accrete (see
+`pme2-dedup-supersede-semantics`). So when the teacher re-distils an escalated
+event, its output must **replace** the student's rows for that event, not pile on
+top. Options:
+- **(A) Producer-tagged supersede (recommended).** Tag every graph row written by
+  distillation with `producer` (`student`/`teacher`) + `event_id` (already in
+  `provenance_event_ids`). On a teacher escalation, in one transaction: delete the
+  `producer='student'` rows whose provenance is that single event **and** that no
+  other event corroborates (don't delete a row a second event also supports —
+  decrement/repoint instead), then write the teacher's rows. Mirrors the Fusion
+  Drive's repoint/audit discipline; reversible via an audit receipt.
+- **(B) Defer-write.** Decide escalation *before* the student writes (gates that
+  don't need the student output: high-value class, random sample), and for those
+  skip the student write entirely — teacher-only. Gate-fails (which need the
+  student output) still need (A). Cuts most supersede churn.
+- Recommended: **B for the pre-decidable escalations + A for gate-fail
+  escalations.** Most escalations (class/random) are pre-decidable → no student
+  write to undo; only the (rarer) gate-fails incur a supersede.
+Open: define "no other event corroborates" precisely against the accretion graph;
+reuse Fusion Drive's `entity_merges`/`fact_merges` audit tables for reversibility.
+## Rollout sequence
+1. **Shadow** (no graph impact): student runs alongside; teacher still does
+   everything; log student-vs-teacher per event → validate on live traffic +
+   accumulate verifier/quality-metric labels. (We've already done a *batch* shadow
+   over recent traces; a brief standing shadow confirms on live flow.)
+2. **Flip to student-primary + sampled teacher** (the diagram above), starting
+   with a **conservative escalation rate** (high random %, broad high-value
+   classes), tighten as monitoring confirms quality.
+3. **Iterate**: retrain student on accumulated teacher-gold (esp. escalated/hard
+   events); rebuild the verifier when a proper metric + more data exist.
+Kill switch + dry-run posture mirror the Fusion Drive (a flag to fall back to
+teacher-primary instantly).
+## Monitoring & active learning
+- **Drift:** the random-sample agreement rate, segmented by `system_prompt_hash`.
+  A teacher-prompt change (like `bbdaba`→`f1e0ff`) shows up as an agreement drop →
+  trigger a student refresh. (The hash segmentation already exists.)
+- **Active learning:** escalated + sampled events with teacher-gold are the next
+  training corpus; the student improves on exactly its weak spots.
+- **Cost model:** worth it iff escalation rate × teacher-cost ≪ teacher-on-
+  everything. Student-on-everything (cheap GPU, always-on) + teacher on ~10–20%
+  beats teacher-on-100% comfortably; the L40S fleet scale-to-zero already assumes
+  bursty teacher load.
+## Open questions / risks
+- **Disagreement metric.** Replace the entity-name-exact-match proxy with a
+  structured diff (entities w/ type+email, facts as s·p·o, relationships; fuzzy
+  name matching; small independent rubric for semantic equivalence). Dependency
+  for trustworthy monitoring **and** better verifier labels. *(Highest-leverage
+  next build.)*
+- **Prompt-version coupling.** The student is bound to a teacher prompt version
+  (`system_prompt_hash`). Every teacher-prompt change risks staling it → the
+  monitoring must watch the hash and the refresh pipeline must be cheap.
+- **Routing is unsolved.** No cheap signal cleanly predicts student errors yet;
+  self-consistency (K-sample) is the most promising unbuilt option but costs K×.
+  Until then, gates + sampling carry it and we accept the residual tail.
+- **Supersede correctness** in the accretion store (above) — the riskiest piece
+  to get exactly right.
+## Build phases (components)
+1. **Metric** — structured-diff scorer (offline, no GPU). *Do first.*
+2. **Student service** — always-on cheap-GPU server (4B) + a `distillation_queue`
+   student consumer; producer-tagging on graph writes.
+3. **Escalation + supersede** — gate/class/random logic; supersede-on-escalation
+   (B+A); reuse audit tables.
+4. **Shadow wiring** → **flip** → **monitoring dashboard** (agreement by hash).
+5. **Retrain loop** — periodic student/verifier retrain on accumulated gold.
+🤖 Generated with [Claude Code](https://claude.com/claude-code)

package/packages/memory-engine-v2/extractor-async/test_cascade.py ADDED Viewed

@@ -0,0 +1,175 @@
+"""Unit tests for the student→teacher distillation cascade (#99).
+Covers the pure decision surface — the JSON-salvage of student output, the
+grounding email set derived from an event, and the escalation gates (parse
+fail, empty, grounding violation, high-value class, random sample, pass). The
+network call (call_student_one) and the DB writes (_apply_extraction,
+_record_distillation) are integration-tested elsewhere; here we pin the routing
+logic that decides student-XOR-teacher.
+"""
+from __future__ import annotations
+import importlib.util
+from pathlib import Path
+import pytest
+_THIS = Path(__file__).resolve().parent
+def _load_worker(name: str = "extractor_async_worker_cascade"):
+    spec = importlib.util.spec_from_file_location(name, _THIS / "worker.py")
+    assert spec and spec.loader
+    mod = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(mod)
+    return mod
+try:
+    worker = _load_worker()
+except ImportError as e:
+    pytest.skip(f"extractor-async deps unavailable: {e}", allow_module_level=True)
+# ----------------------------------------------------------------------
+# _salvage_json_object — the JSON-validity gate's parser
+# ----------------------------------------------------------------------
+def test_salvage_plain_object() -> None:
+    assert worker._salvage_json_object('{"entities": []}') == {"entities": []}
+def test_salvage_strips_code_fence() -> None:
+    txt = '```json\n{"facts": [{"statement": "x"}]}\n```'
+    assert worker._salvage_json_object(txt) == {"facts": [{"statement": "x"}]}
+def test_salvage_extracts_embedded_object() -> None:
+    txt = 'Sure! Here is the extraction:\n{"entities": [{"name": "Acme"}]}\nDone.'
+    assert worker._salvage_json_object(txt) == {"entities": [{"name": "Acme"}]}
+def test_salvage_returns_none_on_garbage() -> None:
+    assert worker._salvage_json_object("not json at all") is None
+    assert worker._salvage_json_object("") is None
+    # a bare JSON array is not an event object
+    assert worker._salvage_json_object("[1, 2, 3]") is None
+# ----------------------------------------------------------------------
+# _event_known_emails — grounding set (content + structured envelope)
+# ----------------------------------------------------------------------
+def test_known_emails_from_content_and_envelope() -> None:
+    event = {
+        "content": "Reach me at alice@acme.com tomorrow.",
+        "attributes": {
+            "contact_email": "Bob <bob@acme.com>",
+            "to_emails": ["carol@acme.com", "dave@x.io"],
+            "cc_emails": ["erin@acme.com"],
+        },
+    }
+    known = worker._event_known_emails(event)
+    assert known == {
+        "alice@acme.com", "bob@acme.com", "carol@acme.com",
+        "dave@x.io", "erin@acme.com",
+    }
+def test_known_emails_lowercased_and_empty_safe() -> None:
+    assert worker._event_known_emails({"content": "FOO@BAR.COM"}) == {"foo@bar.com"}
+    assert worker._event_known_emails({}) == set()
+# ----------------------------------------------------------------------
+# escalation_decision — the gates (precedence + each trigger)
+# ----------------------------------------------------------------------
+def _no_sample(monkeypatch):
+    """Pin the random sample off so the deterministic gates are isolated."""
+    monkeypatch.setattr(worker, "STUDENT_SAMPLE_RATE", 0.0)
+def test_escalate_on_parse_fail(monkeypatch) -> None:
+    _no_sample(monkeypatch)
+    esc, reason = worker.escalation_decision({"content": ""}, None)
+    assert esc and reason == "student_parse_fail"
+def test_escalate_on_empty_extraction(monkeypatch) -> None:
+    _no_sample(monkeypatch)
+    empty = {"entities": [], "facts": [], "relationships": []}
+    esc, reason = worker.escalation_decision({"content": ""}, empty)
+    assert esc and reason == "student_empty"
+def test_escalate_on_grounding_violation(monkeypatch) -> None:
+    _no_sample(monkeypatch)
+    event = {"content": "Met with the vendor.", "attributes": {}}
+    # student invents an email not present anywhere in the event
+    result = {
+        "entities": [{"type": "person", "name": "X", "aliases": ["ghost@evil.com"]}],
+        "facts": [],
+        "relationships": [],
+    }
+    esc, reason = worker.escalation_decision(event, result)
+    assert esc and reason == "grounding_violation"
+def test_no_escalation_when_email_is_grounded(monkeypatch) -> None:
+    _no_sample(monkeypatch)
+    event = {"content": "ping alice@acme.com", "attributes": {}}
+    result = {
+        "entities": [{"type": "person", "name": "Alice", "aliases": ["alice@acme.com"]}],
+        "facts": [],
+        "relationships": [],
+    }
+    esc, reason = worker.escalation_decision(event, result)
+    assert not esc and reason is None
+def test_escalate_on_high_value_class(monkeypatch) -> None:
+    _no_sample(monkeypatch)
+    monkeypatch.setattr(worker, "HIGH_VALUE_CATEGORIES", {"decision", "commitment"})
+    event = {"content": "We will ship Friday.", "attributes": {}}
+    result = {
+        "entities": [],
+        "facts": [{"category": "decision", "statement": "ship Friday", "subject": "team"}],
+        "relationships": [],
+    }
+    esc, reason = worker.escalation_decision(event, result)
+    assert esc and reason == "high_value_class"
+def test_escalate_on_random_sample(monkeypatch) -> None:
+    monkeypatch.setattr(worker, "STUDENT_SAMPLE_RATE", 1.0)  # always sample
+    monkeypatch.setattr(worker, "HIGH_VALUE_CATEGORIES", set())
+    event = {"content": "low value note", "attributes": {}}
+    result = {"entities": [{"type": "other", "name": "thing"}], "facts": [], "relationships": []}
+    esc, reason = worker.escalation_decision(event, result)
+    assert esc and reason == "random_sample"
+def test_student_handles_clean_low_value_event(monkeypatch) -> None:
+    """The common case: valid, grounded, non-high-value, not sampled → the
+    student's write stands (no escalation)."""
+    _no_sample(monkeypatch)
+    monkeypatch.setattr(worker, "HIGH_VALUE_CATEGORIES", {"decision", "commitment"})
+    event = {"content": "Acme released a new SKU.", "attributes": {}}
+    result = {
+        "entities": [{"type": "org", "name": "Acme"}],
+        "facts": [{"category": "state", "statement": "Acme released a SKU", "subject": "Acme"}],
+        "relationships": [],
+    }
+    esc, reason = worker.escalation_decision(event, result)
+    assert not esc and reason is None
+# ----------------------------------------------------------------------
+# Flag contract — cascade is a no-op until CASCADE_ENABLED is flipped.
+# ----------------------------------------------------------------------
+def test_cascade_default_off() -> None:
+    """Default env ⇒ teacher-only. The flag is the kill switch."""
+    assert worker.CASCADE_ENABLED is False

package/packages/memory-engine-v2/extractor-async/worker.py CHANGED Viewed

@@ -30,6 +30,7 @@ import hashlib
 import json
 import logging
 import os
+import random
 import re
 import socket
 import time
@@ -94,6 +95,45 @@ DISTILL_TRACE_ENABLED = os.environ.get(
 ).strip().lower() in ("true", "1", "yes", "on")
+def _envflag(name: str, default: str = "false") -> bool:
+    return os.environ.get(name, default).strip().lower() in ("true", "1", "yes", "on")
+# --------------------------------------------------------------------
+# Student→teacher distillation cascade (#99)
+#
+# When CASCADE_ENABLED, the cheap fine-tuned student (NuExtract-2.0-4B,
+# served behind STUDENT_ENDPOINT) is the PRIMARY distiller. Per event the
+# worker runs the student first, applies deterministic gates + sampling, and
+# the event is handled by student XOR teacher — DISJOINT writes, so nothing in
+# the accretion store is ever superseded. The 27B teacher (the existing
+# call_llm_batch path) handles only the escalated subset.
+#
+# Hard requirement on this load-bearing ingestion path: the whole cascade is
+# behind this single flag. Flip it off ⇒ byte-for-byte the prior teacher-only
+# behaviour (no student call, no ledger write). That is the instant kill switch
+# (cf. the 0.10.9 outage). Every graph row's producer is recorded in the
+# event_distillations ledger (migration 010) so escalation can be monitored and
+# student rows can be re-distilled later ("mop up").
+CASCADE_ENABLED = _envflag("CASCADE_ENABLED")
+STUDENT_ENDPOINT = os.environ.get("STUDENT_ENDPOINT", "")
+STUDENT_API_KEY = os.environ.get("STUDENT_API_KEY", "")
+STUDENT_MODEL = os.environ.get("STUDENT_MODEL", "nuextract-2.0-4b-ft")
+STUDENT_TIMEOUT_SEC = float(os.environ.get("STUDENT_TIMEOUT_SEC", "60"))
+STUDENT_MAX_TOKENS = int(os.environ.get("STUDENT_MAX_TOKENS", "768"))
+# Fraction of student-passing events ALSO sent to the teacher — NOT a quality
+# lever, this is the monitoring + active-learning sample (drift detection and
+# fresh teacher-gold on live traffic). Default conservative-ish 5%.
+STUDENT_SAMPLE_RATE = float(os.environ.get("STUDENT_SAMPLE_RATE", "0.05"))
+# Fact categories that always go to the teacher regardless of the student's
+# output (high-value, cheap to over-escalate). Comma-separated, lowercased.
+HIGH_VALUE_CATEGORIES = {
+    c.strip().lower()
+    for c in os.environ.get("HIGH_VALUE_CATEGORIES", "decision,commitment").split(",")
+    if c.strip()
+}
 # KV-text output format constants. We dropped JSON output (and the
 # `guided_json` schema enforcement that went with it) because a single
 # invalid char inside a 13k-character JSON blob nukes the whole 10-event
@@ -779,6 +819,143 @@ async def call_llm_batch(
     return parsed
+# --------------------------------------------------------------------
+# Student→teacher cascade (#99) — student call + escalation gates
+# --------------------------------------------------------------------
+_EMAIL_RE = re.compile(r"[\w.+-]+@[\w.-]+\.\w+")
+_JSON_OBJ_RE = re.compile(r"\{.*\}", re.DOTALL)
+def _salvage_json_object(text: str) -> dict[str, Any] | None:
+    """Best-effort parse of the student's single-event output into a dict.
+    The student was trained to emit one event object verbatim (the teacher's
+    per-event raw_slice), so a bare json.loads usually works; we also strip
+    ```json fences and grab the outermost {...} as a fallback. Returns None on
+    any failure — that None IS the JSON-validity gate (→ escalate)."""
+    t = text.strip()
+    if t.startswith("```"):
+        t = t.strip("`")
+        if t[:4].lower() == "json":
+            t = t[4:]
+    try:
+        obj = json.loads(t)
+        return obj if isinstance(obj, dict) else None
+    except Exception:
+        pass
+    m = _JSON_OBJ_RE.search(t)
+    if m:
+        try:
+            obj = json.loads(m.group(0))
+            return obj if isinstance(obj, dict) else None
+        except Exception:
+            return None
+    return None
+async def call_student_one(
+    client: httpx.AsyncClient, event: dict[str, Any]
+) -> dict[str, Any] | None:
+    """Distil ONE event with the fine-tuned student. The student trained on
+    single-event (build_event_block → per-event object) pairs with NO system
+    prompt, so the request is a single user turn. Output is normalised through
+    the SAME _parse_guided_json the teacher path uses (wrap the lone object in
+    the {"events":[...]} envelope), so a student-produced result dict is
+    byte-shape-identical to a teacher one — every downstream upsert is
+    producer-agnostic. Returns the result dict, or None on transport/parse
+    failure (→ the JSON-validity gate escalates it)."""
+    block = build_event_block(0, event)
+    headers = {"Content-Type": "application/json"}
+    if STUDENT_API_KEY:
+        headers["X-API-Key"] = STUDENT_API_KEY
+        headers["Authorization"] = f"Bearer {STUDENT_API_KEY}"
+    body = {
+        "model": STUDENT_MODEL,
+        "messages": [{"role": "user", "content": block}],
+        "temperature": 0.0,
+        "max_tokens": STUDENT_MAX_TOKENS,
+    }
+    try:
+        r = await client.post(
+            STUDENT_ENDPOINT, json=body, headers=headers, timeout=STUDENT_TIMEOUT_SEC
+        )
+        r.raise_for_status()
+        data = r.json()
+        text = (data.get("choices") or [{}])[0].get("message", {}).get("content", "")
+        if not text:
+            text = data.get("message", {}).get("content", "")
+    except Exception as exc:
+        log.warning(f"student call failed event_id={event.get('id')}: {exc}")
+        return None
+    obj = _salvage_json_object(text or "")
+    if obj is None:
+        return None
+    parsed = _parse_guided_json(json.dumps({"events": [obj]}), 1)
+    result = parsed[0]
+    result["raw_slice"] = json.dumps(obj, ensure_ascii=False)
+    return result
+def _event_known_emails(event: dict[str, Any]) -> set[str]:
+    """Emails grounded in the event — its content plus the structured envelope
+    (the #111 hard-key bag). The grounding gate escalates any student output
+    that asserts an email NOT in this set: 'confidently wrong about a known
+    fact', which token-confidence can't catch."""
+    known = {e.lower() for e in _EMAIL_RE.findall(event.get("content") or "")}
+    attrs = event.get("attributes") or {}
+    for k in ("contact_email", "author", "user_id"):
+        v = attrs.get(k)
+        if isinstance(v, str):
+            known |= {e.lower() for e in _EMAIL_RE.findall(v)}
+    for k in ("to_emails", "cc_emails"):
+        v = attrs.get(k)
+        if isinstance(v, list):
+            for item in v:
+                if isinstance(item, str):
+                    known |= {e.lower() for e in _EMAIL_RE.findall(item)}
+    return known
+def escalation_decision(
+    event: dict[str, Any], student_result: dict[str, Any] | None
+) -> tuple[bool, str | None]:
+    """Decide whether an event escalates from student to teacher. Escalate iff
+    ANY gate fires (see RFC-student-cascade §Escalation policy). Returns
+    (escalate, reason). reason is the producer-tag stored in the ledger so the
+    escalation mix is queryable.
+    Routing leans on deterministic gates + sampling, NOT on token-confidence /
+    the weak verifier (AUC 0.63) — cheap signals don't pinpoint student errors."""
+    # 1a. JSON/schema validity gate — None means the student produced nothing
+    # parseable.
+    if student_result is None:
+        return True, "student_parse_fail"
+    # 1b. A parseable-but-empty extraction is also a fail — the student gave us
+    # no graph signal; let the teacher try.
+    if not (
+        student_result.get("entities")
+        or student_result.get("facts")
+        or student_result.get("relationships")
+    ):
+        return True, "student_empty"
+    # 1c. Grounding gate — student asserts an email the event doesn't contain.
+    known = _event_known_emails(event)
+    blob = json.dumps(student_result.get("entities", [])) + json.dumps(
+        student_result.get("facts", [])
+    )
+    for em in {e.lower() for e in _EMAIL_RE.findall(blob)}:
+        if em not in known:
+            return True, "grounding_violation"
+    # 2. High-value fact class — always teacher.
+    for f in student_result.get("facts", []):
+        if (f.get("category") or "").lower() in HIGH_VALUE_CATEGORIES:
+            return True, "high_value_class"
+    # 3. Random monitoring/active-learning sample (not a quality lever).
+    if STUDENT_SAMPLE_RATE > 0 and random.random() < STUDENT_SAMPLE_RATE:
+        return True, "random_sample"
+    return False, None
 # --------------------------------------------------------------------
 # Upsert helpers (mirror extractor-sync's idempotent shape)
 # --------------------------------------------------------------------
@@ -1508,6 +1685,41 @@ def _insert_trace(
         )
+def _record_distillation(
+    conn: psycopg.Connection,
+    *,
+    event_id: str,
+    producer: str,
+    llm_model: str,
+    escalated: bool | None,
+    escalate_reason: str | None,
+) -> None:
+    """Append a (event_id, producer) row to the cascade ledger (migration 010).
+    Audit-only — caller wraps in try/except, never poisons the upsert path.
+    Records WHICH producer wrote this event's graph rows so escalation can be
+    monitored and student rows re-distilled later. ON CONFLICT refreshes (a
+    gate-fail re-distill legitimately re-stamps the teacher row)."""
+    with conn.cursor() as cur:
+        cur.execute(
+            """
+            INSERT INTO event_distillations (
+              event_id, producer, llm_model, system_prompt_hash,
+              escalated, escalate_reason
+            ) VALUES (%s, %s, %s, %s, %s, %s)
+            ON CONFLICT (event_id, producer) DO UPDATE SET
+              llm_model = EXCLUDED.llm_model,
+              system_prompt_hash = EXCLUDED.system_prompt_hash,
+              escalated = EXCLUDED.escalated,
+              escalate_reason = EXCLUDED.escalate_reason,
+              distilled_at = now()
+            """,
+            (
+                event_id, producer, llm_model, SYSTEM_PROMPT_HASH,
+                escalated, escalate_reason,
+            ),
+        )
 # --------------------------------------------------------------------
 # Queue mechanics
 # --------------------------------------------------------------------
@@ -1829,12 +2041,81 @@ async def process_batch(
     if not callable_items:
         return
+    # Cascade (#99): student-primary with a gated/sampled teacher escalation.
+    # Disjoint writes — each event is handled by student XOR teacher — so the
+    # accretion store is never superseded. Flag-off ⇒ the teacher-only path
+    # below, byte-for-byte the prior behaviour.
+    if CASCADE_ENABLED and not stub_mode and STUDENT_ENDPOINT:
+        await _process_cascade(http, conn, callable_items, events_by_qid)
+        return
+    await _run_teacher(http, conn, callable_items, events_by_qid, stub_mode, None)
+async def _process_cascade(
+    http: httpx.AsyncClient,
+    conn: psycopg.Connection,
+    callable_items: list[dict[str, Any]],
+    events_by_qid: dict[int, dict[str, Any] | None],
+) -> None:
+    """Student-primary cascade for one claim. Runs the student over every
+    callable event (bounded concurrency), then per event applies the escalation
+    gates: a pass writes the student's extraction (producer='student'); a fail/
+    sample escalates to the teacher. Disjoint — the student writes XOR the event
+    is escalated, so there is nothing to supersede."""
+    sem = asyncio.Semaphore(CONCURRENT_LLM_CALLS)
+    async def _student(item):
+        async with sem:
+            ev = events_by_qid[item["id"]]
+            return item, await call_student_one(http, ev)
+    outcomes = await asyncio.gather(*[_student(i) for i in callable_items])
+    escalate_items: list[dict[str, Any]] = []
+    reason_by_qid: dict[int, str] = {}
+    for item, sresult in outcomes:
+        event = events_by_qid[item["id"]]
+        escalate, reason = escalation_decision(event, sresult)
+        if escalate:
+            escalate_items.append(item)
+            reason_by_qid[item["id"]] = reason or "escalated"
+        else:
+            _apply_extraction(
+                conn, item=item, event=event, result=sresult, llm_ms=0.0,
+                local_idx=0, stub_mode=False, producer="student",
+                escalated=False, escalate_reason=None,
+            )
+    log.info(
+        f"cascade: {len(callable_items) - len(escalate_items)} student-handled, "
+        f"{len(escalate_items)} escalated to teacher"
+    )
+    if escalate_items:
+        await _run_teacher(http, conn, escalate_items, events_by_qid, False, reason_by_qid)
+async def _run_teacher(
+    http: httpx.AsyncClient,
+    conn: psycopg.Connection,
+    teacher_items: list[dict[str, Any]],
+    events_by_qid: dict[int, dict[str, Any] | None],
+    stub_mode: bool,
+    reason_by_qid: dict[int, str] | None,
+) -> None:
+    """Teacher (27B) distillation over the given items — the existing
+    multi-event batched path, factored out so both the cascade-off (all items)
+    and cascade escalation (subset) flows share it. `reason_by_qid` non-None
+    means these items were escalated by the cascade (tags the ledger row);
+    None means pure teacher-only (no ledger)."""
+    if not teacher_items:
+        return
     # Build chunks of EVENTS_PER_LLM_CALL items each (last chunk may be
     # short). Each chunk → one LLM call. Up to CONCURRENT_LLM_CALLS run
     # concurrently; asyncio.gather queues the rest.
     chunks: list[tuple[list[dict[str, Any]], list[dict[str, Any]]]] = []
-    for s in range(0, len(callable_items), EVENTS_PER_LLM_CALL):
-        chunk_items = callable_items[s : s + EVENTS_PER_LLM_CALL]
+    for s in range(0, len(teacher_items), EVENTS_PER_LLM_CALL):
+        chunk_items = teacher_items[s : s + EVENTS_PER_LLM_CALL]
         chunk_events = [events_by_qid[i["id"]] for i in chunk_items]
         chunks.append((chunk_items, chunk_events))
@@ -1853,7 +2134,6 @@ async def process_batch(
     for (chunk_items, _chunk_events), (per_item, llm_ms) in zip(chunks, chunk_outcomes):
         for local_idx, (item, result) in enumerate(zip(chunk_items, per_item)):
             queue_id = item["id"]
-            event_id = item["event_id"]
             attempts = item["attempts"]
             event = events_by_qid[queue_id]
@@ -1868,77 +2148,117 @@ async def process_batch(
                     release_claim(conn, queue_id, err)
                 continue
-            ents = result.get("entities") or []
-            facts = result.get("facts") or []
-            rels = result.get("relationships") or []
-            arena = event["arena"]
-            participant_set = event.get("participant_set") or [arena]
-            disclosure = event.get("disclosure_class") or "private"
-            # SOURCE time of this event: prefer the parsed
-            # `attributes.timestamp` (canonical), falling back to the
-            # stored `emitted_at` column (which the sync path now also
-            # stamps from source time). `None` ⇒ upserts fall back to
-            # NOW() in-SQL. NEVER crash on a bad/absent source time.
-            event_time = event_source_time(event) or event.get("emitted_at")
-            # A structured deadline on the source event, if the producer
-            # supplied one — populates facts.effective_until. Absent or
-            # unparseable ⇒ None (column stays NULL, its existing
-            # behaviour). Only `attributes.due_at` is honoured; we do NOT
-            # guess deadlines from free text here.
-            due_at = parse_source_time((event.get("attributes") or {}).get("due_at"))
-            # ORIGINATING SOURCE of this event, stamped onto its facts so
-            # downstream can tell CRM-asserted from email-asserted (the
-            # SoR-drift foundation). Finer `attributes.source` else coarse
-            # `source_kind`; None ⇒ column stays NULL (source-unknown).
-            src = fact_source(event)
+            escalated = None if reason_by_qid is None else True
+            reason = None if reason_by_qid is None else reason_by_qid.get(queue_id)
+            _apply_extraction(
+                conn, item=item, event=event, result=result, llm_ms=llm_ms,
+                local_idx=local_idx, stub_mode=stub_mode, producer="teacher",
+                escalated=escalated, escalate_reason=reason,
+            )
+def _apply_extraction(
+    conn: psycopg.Connection,
+    *,
+    item: dict[str, Any],
+    event: dict[str, Any],
+    result: dict[str, Any],
+    llm_ms: float,
+    local_idx: int,
+    stub_mode: bool,
+    producer: str,
+    escalated: bool | None,
+    escalate_reason: str | None,
+) -> bool:
+    """Write one event's extraction to the graph (+ trace + cascade ledger).
+    Shared by the student-pass and teacher paths so producer tagging and the
+    upsert semantics are identical; `result` is a parsed extraction dict (never
+    an Exception). Returns True on success; on a DB failure releases/fails the
+    claim per attempts and returns False."""
+    queue_id = item["id"]
+    event_id = item["event_id"]
+    attempts = item["attempts"]
+    ents = result.get("entities") or []
+    facts = result.get("facts") or []
+    rels = result.get("relationships") or []
+    arena = event["arena"]
+    participant_set = event.get("participant_set") or [arena]
+    disclosure = event.get("disclosure_class") or "private"
+    # SOURCE time of this event: prefer the parsed `attributes.timestamp`
+    # (canonical), falling back to the stored `emitted_at` column. `None` ⇒
+    # upserts fall back to NOW() in-SQL. NEVER crash on a bad/absent source time.
+    event_time = event_source_time(event) or event.get("emitted_at")
+    # A structured deadline on the source event, if the producer supplied one —
+    # populates facts.effective_until. Only `attributes.due_at` is honoured.
+    due_at = parse_source_time((event.get("attributes") or {}).get("due_at"))
+    # ORIGINATING SOURCE of this event, stamped onto its facts so downstream can
+    # tell CRM-asserted from email-asserted (the SoR-drift foundation). Finer
+    # `attributes.source` else coarse `source_kind`; None ⇒ column stays NULL.
+    src = fact_source(event)
+    try:
+        name_to_id = upsert_entities(
+            conn, arena, event_id, participant_set, disclosure, ents,
+            event_time, event.get("attributes"),
+        )
+        n_facts = upsert_facts(
+            conn, arena, event_id, participant_set, disclosure, facts, name_to_id,
+            event_time, due_at, src,
+        )
+        n_rels = upsert_relationships(
+            conn, arena, event_id, participant_set, disclosure, rels, name_to_id,
+            event_time,
+        )
+        mark_done(conn, queue_id)
+        log.info(
+            f"completed queue_id={queue_id} event_id={event_id} producer={producer} "
+            f"entities={len(name_to_id)} facts={n_facts} relationships={n_rels}"
+            + (f" llm_ms={llm_ms:.0f}" if not stub_mode and llm_ms else "")
+        )
+        # Trace logging — best-effort. ONLY the teacher produces training gold;
+        # logging student output would train the student on itself.
+        if DISTILL_TRACE_ENABLED and not stub_mode and producer == "teacher":
             try:
-                name_to_id = upsert_entities(
-                    conn, arena, event_id, participant_set, disclosure, ents,
-                    event_time, event.get("attributes"),
-                )
-                n_facts = upsert_facts(
-                    conn, arena, event_id, participant_set, disclosure, facts, name_to_id,
-                    event_time, due_at, src,
+                _insert_trace(
+                    conn,
+                    event_id=event_id,
+                    user_prompt=build_event_block(local_idx, event),
+                    raw_response=result.get("raw_slice", ""),
+                    llm_chunk_ms=llm_ms,
                 )
-                n_rels = upsert_relationships(
-                    conn, arena, event_id, participant_set, disclosure, rels, name_to_id,
-                    event_time,
+            except Exception as trace_exc:
+                log.warning(
+                    f"trace insert failed queue_id={queue_id} "
+                    f"event_id={event_id}: {trace_exc}"
                 )
-                mark_done(conn, queue_id)
-                log.info(
-                    f"completed queue_id={queue_id} event_id={event_id} "
-                    f"entities={len(name_to_id)} facts={n_facts} "
-                    f"relationships={n_rels}"
-                    + (f" llm_ms={llm_ms:.0f}/chunk" if not stub_mode else "")
+        # Cascade ledger — best-effort, records which producer wrote this
+        # event's rows (only when the cascade is on).
+        if CASCADE_ENABLED:
+            try:
+                _record_distillation(
+                    conn,
+                    event_id=event_id,
+                    producer=producer,
+                    llm_model=LLM_MODEL if producer == "teacher" else STUDENT_MODEL,
+                    escalated=escalated,
+                    escalate_reason=escalate_reason,
                 )
-                # Trace logging — best-effort, never breaks the worker.
-                # Captures (input, output) so a student model can be
-                # trained on the teacher's distribution. Skipped in
-                # stub mode (no real LLM output to record).
-                if DISTILL_TRACE_ENABLED and not stub_mode:
-                    try:
-                        _insert_trace(
-                            conn,
-                            event_id=event_id,
-                            user_prompt=build_event_block(local_idx, event),
-                            raw_response=result.get("raw_slice", ""),
-                            llm_chunk_ms=llm_ms,
-                        )
-                    except Exception as trace_exc:
-                        log.warning(
-                            f"trace insert failed queue_id={queue_id} "
-                            f"event_id={event_id}: {trace_exc}"
-                        )
-            except Exception as exc:
-                err = f"{type(exc).__name__}: {exc}"
+            except Exception as ledger_exc:
                 log.warning(
-                    f"db upsert failed queue_id={queue_id} attempts={attempts}: {err}"
+                    f"ledger insert failed queue_id={queue_id} "
+                    f"event_id={event_id}: {ledger_exc}"
                 )
-                if attempts >= MAX_ATTEMPTS:
-                    mark_failed(conn, queue_id, err)
-                else:
-                    release_claim(conn, queue_id, err)
+        return True
+    except Exception as exc:
+        err = f"{type(exc).__name__}: {exc}"
+        log.warning(
+            f"db upsert failed queue_id={queue_id} attempts={attempts}: {err}"
+        )
+        if attempts >= MAX_ATTEMPTS:
+            mark_failed(conn, queue_id, err)
+        else:
+            release_claim(conn, queue_id, err)
+        return False
 async def amain():
@@ -1951,6 +2271,18 @@ async def amain():
         f"output_mode={DISTILL_OUTPUT_MODE}, "
         f"prompt_hash={SYSTEM_PROMPT_HASH})"
     )
+    if CASCADE_ENABLED:
+        log.info(
+            f"cascade ENABLED — student-primary "
+            f"(model={STUDENT_MODEL}, endpoint={STUDENT_ENDPOINT or '(unset!)'}, "
+            f"sample_rate={STUDENT_SAMPLE_RATE}, "
+            f"high_value={sorted(HIGH_VALUE_CATEGORIES)})"
+        )
+        if not STUDENT_ENDPOINT:
+            log.warning(
+                "CASCADE_ENABLED but STUDENT_ENDPOINT unset — falling back to "
+                "teacher-only (no student call)."
+            )
     stub_mode = not LLM_ENDPOINT
     if stub_mode:
         log.warning("LLM_ENDPOINT not set — running in stub mode (no extraction).")

package/packages/memory-engine-v2/org-model/migrations/010_distillation_ledger.sql ADDED Viewed

@@ -0,0 +1,40 @@
+-- 010_distillation_ledger.sql — student→teacher cascade audit ledger (#99)
+--
+-- The cascade runs the cheap fine-tuned student as the primary distiller and
+-- escalates a gated/sampled subset to the 27B teacher. Graph writes are
+-- DISJOINT per event (student XOR teacher), so we never supersede in the
+-- accretion store — but we still need to know WHICH producer wrote each
+-- event's rows, both for monitoring (escalation rate, agreement) and for the
+-- "mop up" path (find + re-distill any student rows later). This ledger is
+-- that record: one row per (event, producer).
+--
+-- It is intentionally OUTSIDE the entities/facts/relationships tables — no
+-- column added to the load-bearing graph tables — so the cascade is fully
+-- reversible: drop the flag and this table is simply no longer written. Rows
+-- are best-effort (the worker wraps the insert in try/except, exactly like
+-- distillation_traces) and never gate the upsert path.
+--
+-- Disjoint writes mean normally one producer per event. A gate-fail re-distill
+-- (mop-up: teacher re-runs an event the student first handled) legitimately
+-- adds a second producer row — the PK(event_id, producer) permits that and the
+-- ON CONFLICT refreshes the existing row.
+CREATE TABLE IF NOT EXISTS event_distillations (
+  event_id           text        NOT NULL,
+  producer           text        NOT NULL,   -- 'student' | 'teacher'
+  llm_model          text        NOT NULL,
+  system_prompt_hash text,
+  escalated          boolean,                -- NULL when cascade disabled (pure teacher)
+  escalate_reason    text,                   -- why this event went to the teacher
+  distilled_at       timestamptz NOT NULL DEFAULT now(),
+  PRIMARY KEY (event_id, producer)
+);
+-- Monitoring: escalation rate / producer mix over time.
+CREATE INDEX IF NOT EXISTS idx_event_distillations_producer_time
+  ON event_distillations (producer, distilled_at DESC);
+-- Active learning + mop-up: find the events a given gate escalated.
+CREATE INDEX IF NOT EXISTS idx_event_distillations_reason
+  ON event_distillations (escalate_reason)
+  WHERE escalate_reason IS NOT NULL;