npm - @pentatonic-ai/ai-agent-sdk - Versions diffs - 0.10.7 → 0.10.8 - Mend

@pentatonic-ai/ai-agent-sdk 0.10.7 → 0.10.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

package/dist/index.cjs CHANGED Viewed

@@ -878,7 +878,7 @@ function fireAndForgetEmit(clientConfig, sessionOpts, messages, result, model) {
 }
 // src/telemetry.js
-var VERSION = "0.10.7";
+var VERSION = "0.10.8";
 var TELEMETRY_URL = "https://sdk-telemetry.philip-134.workers.dev";
 function machineId() {
   const raw = typeof process !== "undefined" ? `${process.env?.USER || process.env?.USERNAME || "u"}:${process.platform || "x"}:${process.arch || "x"}` : "browser";

package/dist/index.js CHANGED Viewed

@@ -847,7 +847,7 @@ function fireAndForgetEmit(clientConfig, sessionOpts, messages, result, model) {
 }
 // src/telemetry.js
-var VERSION = "0.10.7";
+var VERSION = "0.10.8";
 var TELEMETRY_URL = "https://sdk-telemetry.philip-134.workers.dev";
 function machineId() {
   const raw = typeof process !== "undefined" ? `${process.env?.USER || process.env?.USERNAME || "u"}:${process.platform || "x"}:${process.arch || "x"}` : "browser";

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@pentatonic-ai/ai-agent-sdk",
-  "version": "0.10.7",
+  "version": "0.10.8",
   "description": "TES SDK — LLM observability and lifecycle tracking via Pentatonic Thing Event System. Track token usage, tool calls, and conversations. Manage things through event-sourced lifecycle stages with AI enrichment and vector search.",
   "type": "module",
   "main": "./dist/index.cjs",

package/packages/memory-engine-v2/RFC-decay-and-fusion.md ADDED Viewed

@@ -0,0 +1,185 @@
+# RFC: the Fusion Drive — v2 memory self-healing (cross-run node fusion + decay)
+> **Fusion Drive** = the continuous, arena-scoped background engine that keeps the v2
+> memory graph self-healing: it *fuses* duplicate/near-duplicate nodes from different
+> distillation runs into a single master node (horizontal convergence) and *decays* stale,
+> low-value, and junk nodes out of existence (vertical aging). Named for the drive that
+> does the fusing — the decay pass rides the same engine.
+**Status:** draft / spec — 2026-06-12
+**Builds on:** `RFC-entity-reconciliation.md`, `scripts/entity_resolution_v2.py` (#82),
+`org-model/migrations/002_entity_merges_audit.sql`.
+**Motivated by:** the v2 store is currently **pure-accretion** — three independent
+properties, all verified in code, mean nothing ever leaves or improves in place:
+1. **No supersede by source_id** — event identity is `sha256(arena:content)`; re-emitting
+   edited content appends a new event, the old persists.
+2. **Accrete-only graph writes** — entity/fact upserts are `ON CONFLICT (id) DO UPDATE`
+   that only merge aliases/provenance and bump confidence; a *corrected* extraction has a
+   different deterministic id, so it lands **beside** the polluted node, never replacing it.
+3. **No decay/eviction** — v2 has no GC; fact confidence only moves up; recency affects
+   search ranking only, never retention.
+Net: improving the extractor/teacher only helps **new** content. Accumulated 7B-era
+pollution (hallucinated emails, numeric-ID-as-person, ungrounded entities) is immortal.
+`pentatonic-team` had to be **nuked** rather than re-distilled because of this; `pip-agents`
+(87k events) still carries all of it.
+This RFC makes the store **self-healing** via two complementary mechanisms:
+**fusion** (horizontal — converge duplicate/near-duplicate nodes from different
+distillation runs into one *master* node) and **decay** (vertical — age out stale and
+low-value nodes). Both are gated, arena-scoped, audited, and reversible.
+---
+## Part A — Fusion: converge near-duplicate nodes into a master
+Extends the existing entity-resolution machinery along four axes.
+### A1. Online + continuous (today it's dry-run batch)
+Run fusion as a scheduled per-arena pass (systemd timer on the engine box, same pattern as
+the distiller autoscaler) **and** opportunistically after a distillation run touches an
+arena's entities. Keep #82's invariants: dry-run default, `--apply` gate, arena scoping,
+`entity_merges` rollback. Add a `fusion_runs` ledger (arena, started_at, candidates,
+merged, mode) for observability.
+### A2. Cross-distillation-run detection (the actual pollution cure)
+The hard case #82 misses: 7B `"1716801984"` (numeric-ID person) and Qwen3.6 `"Katie Cooper"`
+are the same real entity but share **no name similarity**, so name-blocking never compares
+them. New candidate signals beyond name trigrams / embedding-on-name:
+- **Shared-provenance co-reference** — two entities of the same `entity_type` citing the
+  same `event_id` in `provenance_event_ids`, where one is low-quality (numeric / ungrounded
+  / single-token). The shared event's content is the adjudication context ("does this event
+  support these being the same person?").
+- **Context embedding** — embed the *facts/statements about* an entity (not just its name),
+  so name-divergent dupes still cluster. Reuses the bulk-embed lane.
+- **Teacher-version signal** — provenance maps to `distillation_traces.llm_model` /
+  `system_prompt_hash`. Prefer the newer-teacher extraction as master; an entity *only* ever
+  produced by the superseded teacher and never re-confirmed by the new one is both a fusion
+  candidate (likely a worse rendering of a node the new teacher got right) and a decay
+  candidate (stale-teacher orphan — see B).
+### A3. Master-node selection — replace richest-row-wins
+#82 uses "richest-row-wins", which (flagged in review) would crown the typo **"Phil Mossop"**
+over **"Philip Mossop"**. Replace with a **scored** canonical pick:
+| Signal | Effect |
+|---|---|
+| **Directory/authority anchor** (name matches an org-directory / HubSpot contact / Pip `contact_email`+`contact_name`) | dominant + → canonical |
+| Grounding (name appears verbatim in a provenance event's content) | + |
+| Teacher recency (newer `llm_model`) | + |
+| Corroboration (`cardinality(provenance_event_ids)`) | + |
+| Looks-like-ID (digit-ratio > 0.5) / hallucinated-email flag / single-token bare name | − − |
+Master = highest score. Losers' surface forms become **aliases** on the master (so existing
+lookups still resolve), facts/relationships are repointed, losers tombstoned in
+`entity_merges` with `rollback_payload`. Directory-anchored selection is the key fix: an
+authoritative source, when present, beats any heuristic.
+### A4. Fact + relationship fusion (today only entities fuse)
+After entity fusion (so subject/object ids are canonical):
+- **Facts** — exact `(arena, subject, predicate, object)` dupes already collapse via the
+  content-id. **Semantic** dupes (same assertion, different surface — "joined Acme" vs "works
+  at Acme") need statement-embedding similarity + LLM adjudication ("same assertion?").
+  Master fact = max confidence + best-grounded statement; union provenance; tombstone dupes.
+  New `fact_merges` audit mirroring `entity_merges`.
+- **Relationships** — `(from,to,type)` already collapses; a controlled rel-type vocabulary
+  ("works at" ≡ "employed by") is a later optional canonicalization.
+### A5. Audit, reversibility, safety rails
+Reuse `entity_merges`; add `fact_merges`. Every fusion carries `rollback_payload`.
+LLM-adjudicated merges store prompt+verdict. **Disclosure rail:** never send
+`disclosure_class='restricted'` rows to the LLM adjudicator (data-egress; the #82 review
+item). Auto-merge only above a high confidence band; everything else → human-review queue.
+---
+## Part B — Decay: age out stale and low-value nodes
+### B1. Separate `salience` from `confidence` (important)
+Do **not** decay `confidence` — it means "how corroborated/true is this", and decaying it
+would lie about corroboration. Add a separate **`salience`** (retention priority) to
+entities/facts/relationships. Decay acts on salience; eviction keys on salience.
+`salience(t) = salience₀ · exp(−ln2 · Δt / half_life[category])`, bumped on access or
+re-corroboration. Per-category half-life:
+| category | half-life | rationale |
+|---|---|---|
+| decision, commitment | very long / ∞ | durable record |
+| state, preference | medium | changes but matters |
+| mention, observation | short | ephemeral |
+`Δt` = time since `last_seen` **or** a new `last_accessed` (bumped when a node is returned by
+`/search` — cheap write, makes retrieval keep memories alive). Re-corroboration (new
+provenance) resets the clock and bumps salience.
+### B2. Born-salience — the cheap partial cure
+Seed `salience₀` from extraction-quality signals already computed (the trap detectors:
+ungrounded, numeric-ID-person, hallucinated-email, `noise_filter` hits). **Junk is born
+low**, so it decays below threshold and self-evicts fast — pollution cleans itself even
+without a fusion match.
+### B3. Eviction (GC)
+Node is evictable when: `salience < min_threshold` **AND** `last_seen`/`last_accessed`
+older than a floor **AND** not referenced by a surviving higher-salience node (an entity
+that's the subject/object of a live fact survives). Eviction = **tombstone** (soft-delete +
+retention window) → hard-delete after grace, cascading to the node's Qdrant points +
+`vector_provenance`. Never evict `disclosure_class='restricted'` without sign-off.
+### B4. Capacity bound (optional)
+Per-arena soft cap; when exceeded, evict lowest-salience first. Backstop against unbounded
+arenas.
+### B5. Cadence + safety
+Background per-arena pass (timer on the engine box), dry-run → `--apply` in a quiet window,
+counts logged, fully arena-scoped. Same operational shape as the distiller autoscaler /
+sparse backfill.
+---
+## Part C — Ordering & how they combine
+Per arena, on schedule: **(1) fusion → (2) decay.** Fusion first so a master node absorbs
+its duplicates' provenance/salience *before* decay judges it (else a real node split across
+two weak dupes could wrongly decay out). Then decay ages + evicts the survivors.
+**This is what finally cures immortal pollution:**
+- 7B polluted node *with* a correct Qwen3.6 counterpart → **fused**, correct one as master,
+  polluted demoted to alias / tombstoned.
+- 7B pure-junk node with *no* correct counterpart (numeric-ID-person, ungrounded) → born-low
+  salience + no corroboration + never accessed → **decays out and is evicted**.
+Together they convert the accrete-only store into a self-healing one. `pip-agents` could
+then self-clean over time instead of requiring a nuke (a nuke is still faster for a one-shot
+reset, but no longer the *only* path).
+---
+## Part D — Schema changes
+- `entities`: `+ salience REAL DEFAULT …`, `+ last_accessed TIMESTAMPTZ`.
+- `facts`: `+ salience REAL`, `+ last_accessed TIMESTAMPTZ` (keep `confidence` as-is =
+  corroboration truth; `asserted_at`/`expires_at` already exist).
+- `relationships`: `+ salience REAL`, `+ last_accessed` (already has `weight`,
+  `first/last_seen`).
+- new `fact_merges` audit (mirror `entity_merges` incl. `rollback_payload`).
+- new `fusion_runs` + `decay_runs` ledgers for observability.
+- `/search` gains a `last_accessed = NOW()` bump on returned nodes (batched).
+## Part E — Rollout (each flag-gated, arena-scoped, dry-run-first, audited)
+1. **Salience scoring only** — add columns, born-salience + decay math, NO eviction.
+   Observe distributions; confirm junk scores low and durable facts stay high.
+2. **Eviction** — dry-run (count what *would* evict) → `--apply` in a quiet window.
+3. **Fusion extension** — scored canonical selection (fix typo-crowning) + cross-run
+   detection + fact fusion, dry-run → apply.
+4. **Online/continuous** — wire fusion+decay to run after distillation per arena.
+## Open questions
+- Half-life constants per category — needs a calibration pass against real arenas.
+- `last_accessed` write amplification on hot search paths — batch/throttle the bump.
+- Directory authority source for canonical anchoring — HubSpot contacts? a curated table?
+- Interaction with the (still-open) source_id supersede mode — fusion partly subsumes it,
+  but explicit supersede is cheaper for known-mutable sources.

package/packages/memory-engine-v2/RFC-fusion-drive.md ADDED Viewed

@@ -0,0 +1,193 @@
+# RFC: the Fusion Drive — v2 memory self-healing (cross-run node fusion + decay)
+> **Fusion Drive** = the continuous, arena-scoped background engine that keeps the v2
+> memory graph self-healing: it *fuses* duplicate/near-duplicate nodes from different
+> distillation runs into a single master node (horizontal convergence) and *decays* stale,
+> low-value, and junk nodes out of existence (vertical aging). Named for the drive that
+> does the fusing — the decay pass rides the same engine.
+**Status:** spec + initial implementation (PR #92) — 2026-06-12. Implemented: salience
+scoring + decay, **eviction** (`fusion_drive_decay.py --evict`, reversible via
+`node_evictions`), and **fusion** of exact + cross-run-shared-provenance entity dupes and
+exact-triple fact dupes (`fusion_drive_fuse.py --apply`, reversible via `entity_merges`/
+`fact_merges`), with scored directory-anchored master selection. All arena-scoped,
+dry-run-default, transactional, audited. TODO (later PRs): embedding-band + LLM-adjudicated
+detection (in `entity_resolution_v2.py`), semantic fact fusion, authority-table wiring for
+canonical scoring, continuous scheduling, and a half-life/threshold calibration pass before
+`--evict` runs in prod.
+**Builds on:** `RFC-entity-reconciliation.md`, `scripts/entity_resolution_v2.py` (#82),
+`org-model/migrations/002_entity_merges_audit.sql`.
+**Motivated by:** the v2 store is currently **pure-accretion** — three independent
+properties, all verified in code, mean nothing ever leaves or improves in place:
+1. **No supersede by source_id** — event identity is `sha256(arena:content)`; re-emitting
+   edited content appends a new event, the old persists.
+2. **Accrete-only graph writes** — entity/fact upserts are `ON CONFLICT (id) DO UPDATE`
+   that only merge aliases/provenance and bump confidence; a *corrected* extraction has a
+   different deterministic id, so it lands **beside** the polluted node, never replacing it.
+3. **No decay/eviction** — v2 has no GC; fact confidence only moves up; recency affects
+   search ranking only, never retention.
+Net: improving the extractor/teacher only helps **new** content. Accumulated 7B-era
+pollution (hallucinated emails, numeric-ID-as-person, ungrounded entities) is immortal.
+`pentatonic-team` had to be **nuked** rather than re-distilled because of this; `pip-agents`
+(87k events) still carries all of it.
+This RFC makes the store **self-healing** via two complementary mechanisms:
+**fusion** (horizontal — converge duplicate/near-duplicate nodes from different
+distillation runs into one *master* node) and **decay** (vertical — age out stale and
+low-value nodes). Both are gated, arena-scoped, audited, and reversible.
+---
+## Part A — Fusion: converge near-duplicate nodes into a master
+Extends the existing entity-resolution machinery along four axes.
+### A1. Online + continuous (today it's dry-run batch)
+Run fusion as a scheduled per-arena pass (systemd timer on the engine box, same pattern as
+the distiller autoscaler) **and** opportunistically after a distillation run touches an
+arena's entities. Keep #82's invariants: dry-run default, `--apply` gate, arena scoping,
+`entity_merges` rollback. Add a `fusion_runs` ledger (arena, started_at, candidates,
+merged, mode) for observability.
+### A2. Cross-distillation-run detection (the actual pollution cure)
+The hard case #82 misses: 7B `"1716801984"` (numeric-ID person) and Qwen3.6 `"Katie Cooper"`
+are the same real entity but share **no name similarity**, so name-blocking never compares
+them. New candidate signals beyond name trigrams / embedding-on-name:
+- **Shared-provenance co-reference** — two entities of the same `entity_type` citing the
+  same `event_id` in `provenance_event_ids`, where one is low-quality (numeric / ungrounded
+  / single-token). The shared event's content is the adjudication context ("does this event
+  support these being the same person?").
+- **Context embedding** — embed the *facts/statements about* an entity (not just its name),
+  so name-divergent dupes still cluster. Reuses the bulk-embed lane.
+- **Teacher-version signal** — provenance maps to `distillation_traces.llm_model` /
+  `system_prompt_hash`. Prefer the newer-teacher extraction as master; an entity *only* ever
+  produced by the superseded teacher and never re-confirmed by the new one is both a fusion
+  candidate (likely a worse rendering of a node the new teacher got right) and a decay
+  candidate (stale-teacher orphan — see B).
+### A3. Master-node selection — replace richest-row-wins
+#82 uses "richest-row-wins", which (flagged in review) would crown the typo **"Phil Mossop"**
+over **"Philip Mossop"**. Replace with a **scored** canonical pick:
+| Signal | Effect |
+|---|---|
+| **Directory/authority anchor** (name matches an org-directory / HubSpot contact / Pip `contact_email`+`contact_name`) | dominant + → canonical |
+| Grounding (name appears verbatim in a provenance event's content) | + |
+| Teacher recency (newer `llm_model`) | + |
+| Corroboration (`cardinality(provenance_event_ids)`) | + |
+| Looks-like-ID (digit-ratio > 0.5) / hallucinated-email flag / single-token bare name | − − |
+Master = highest score. Losers' surface forms become **aliases** on the master (so existing
+lookups still resolve), facts/relationships are repointed, losers tombstoned in
+`entity_merges` with `rollback_payload`. Directory-anchored selection is the key fix: an
+authoritative source, when present, beats any heuristic.
+### A4. Fact + relationship fusion (today only entities fuse)
+After entity fusion (so subject/object ids are canonical):
+- **Facts** — exact `(arena, subject, predicate, object)` dupes already collapse via the
+  content-id. **Semantic** dupes (same assertion, different surface — "joined Acme" vs "works
+  at Acme") need statement-embedding similarity + LLM adjudication ("same assertion?").
+  Master fact = max confidence + best-grounded statement; union provenance; tombstone dupes.
+  New `fact_merges` audit mirroring `entity_merges`.
+- **Relationships** — `(from,to,type)` already collapses; a controlled rel-type vocabulary
+  ("works at" ≡ "employed by") is a later optional canonicalization.
+### A5. Audit, reversibility, safety rails
+Reuse `entity_merges`; add `fact_merges`. Every fusion carries `rollback_payload`.
+LLM-adjudicated merges store prompt+verdict. **Disclosure rail:** never send
+`disclosure_class='restricted'` rows to the LLM adjudicator (data-egress; the #82 review
+item). Auto-merge only above a high confidence band; everything else → human-review queue.
+---
+## Part B — Decay: age out stale and low-value nodes
+### B1. Separate `salience` from `confidence` (important)
+Do **not** decay `confidence` — it means "how corroborated/true is this", and decaying it
+would lie about corroboration. Add a separate **`salience`** (retention priority) to
+entities/facts/relationships. Decay acts on salience; eviction keys on salience.
+`salience(t) = salience₀ · exp(−ln2 · Δt / half_life[category])`, bumped on access or
+re-corroboration. Per-category half-life:
+| category | half-life | rationale |
+|---|---|---|
+| decision, commitment | very long / ∞ | durable record |
+| state, preference | medium | changes but matters |
+| mention, observation | short | ephemeral |
+`Δt` = time since `last_seen` **or** a new `last_accessed` (bumped when a node is returned by
+`/search` — cheap write, makes retrieval keep memories alive). Re-corroboration (new
+provenance) resets the clock and bumps salience.
+### B2. Born-salience — the cheap partial cure
+Seed `salience₀` from extraction-quality signals already computed (the trap detectors:
+ungrounded, numeric-ID-person, hallucinated-email, `noise_filter` hits). **Junk is born
+low**, so it decays below threshold and self-evicts fast — pollution cleans itself even
+without a fusion match.
+### B3. Eviction (GC)
+Node is evictable when: `salience < min_threshold` **AND** `last_seen`/`last_accessed`
+older than a floor **AND** not referenced by a surviving higher-salience node (an entity
+that's the subject/object of a live fact survives). Eviction = **tombstone** (soft-delete +
+retention window) → hard-delete after grace, cascading to the node's Qdrant points +
+`vector_provenance`. Never evict `disclosure_class='restricted'` without sign-off.
+### B4. Capacity bound (optional)
+Per-arena soft cap; when exceeded, evict lowest-salience first. Backstop against unbounded
+arenas.
+### B5. Cadence + safety
+Background per-arena pass (timer on the engine box), dry-run → `--apply` in a quiet window,
+counts logged, fully arena-scoped. Same operational shape as the distiller autoscaler /
+sparse backfill.
+---
+## Part C — Ordering & how they combine
+Per arena, on schedule: **(1) fusion → (2) decay.** Fusion first so a master node absorbs
+its duplicates' provenance/salience *before* decay judges it (else a real node split across
+two weak dupes could wrongly decay out). Then decay ages + evicts the survivors.
+**This is what finally cures immortal pollution:**
+- 7B polluted node *with* a correct Qwen3.6 counterpart → **fused**, correct one as master,
+  polluted demoted to alias / tombstoned.
+- 7B pure-junk node with *no* correct counterpart (numeric-ID-person, ungrounded) → born-low
+  salience + no corroboration + never accessed → **decays out and is evicted**.
+Together they convert the accrete-only store into a self-healing one. `pip-agents` could
+then self-clean over time instead of requiring a nuke (a nuke is still faster for a one-shot
+reset, but no longer the *only* path).
+---
+## Part D — Schema changes
+- `entities`: `+ salience REAL DEFAULT …`, `+ last_accessed TIMESTAMPTZ`.
+- `facts`: `+ salience REAL`, `+ last_accessed TIMESTAMPTZ` (keep `confidence` as-is =
+  corroboration truth; `asserted_at`/`expires_at` already exist).
+- `relationships`: `+ salience REAL`, `+ last_accessed` (already has `weight`,
+  `first/last_seen`).
+- new `fact_merges` audit (mirror `entity_merges` incl. `rollback_payload`).
+- new `fusion_runs` + `decay_runs` ledgers for observability.
+- `/search` gains a `last_accessed = NOW()` bump on returned nodes (batched).
+## Part E — Rollout (each flag-gated, arena-scoped, dry-run-first, audited)
+1. **Salience scoring only** — add columns, born-salience + decay math, NO eviction.
+   Observe distributions; confirm junk scores low and durable facts stay high.
+2. **Eviction** — dry-run (count what *would* evict) → `--apply` in a quiet window.
+3. **Fusion extension** — scored canonical selection (fix typo-crowning) + cross-run
+   detection + fact fusion, dry-run → apply.
+4. **Online/continuous** — wire fusion+decay to run after distillation per arena.
+## Open questions
+- Half-life constants per category — needs a calibration pass against real arenas.
+- `last_accessed` write amplification on hot search paths — batch/throttle the bump.
+- Directory authority source for canonical anchoring — HubSpot contacts? a curated table?
+- Interaction with the (still-open) source_id supersede mode — fusion partly subsumes it,
+  but explicit supersede is cheaper for known-mutable sources.

package/packages/memory-engine-v2/extractor-async/confidence.py CHANGED Viewed

@@ -60,3 +60,40 @@ def corroborated_confidence(n_sources: int) -> float:
     if bumped > _CONF_CAP:
         return _CONF_CAP
     return round(bumped, 2)
+# ── born salience (Fusion Drive) ─────────────────────────────────────
+# Retention priority a node is stamped with at extraction time, SEPARATE
+# from confidence (confidence = corroboration/truth; salience = how long
+# it's worth keeping). Junk — flagged by the extractor's own quality
+# detectors (noise name, numeric-ID-as-person, hallucinated email,
+# ungrounded, etc.) — is born near the floor so the Fusion Drive decay
+# pass evicts it on a short clock instead of the multi-year default.
+#
+# This MUST stay byte-identical to fusion_drive/salience.py:born_salience
+# (the decay side uses the same scale). test_born_salience_parity.py
+# guards the two against drift — same pattern as entity_id.py's parity
+# test across the sync/async build contexts.
+_SAL_BASE = 0.50
+_SAL_CORROB_PER_SOURCE = 0.10
+_SAL_CORROB_CAP = 0.30
+_SAL_FLOOR = 0.01
+_SAL_CEIL = 1.00
+_SAL_PENALTIES = {
+    "noise_name": 0.45,
+    "numeric_id_person": 0.45,
+    "hallucinated_email": 0.40,
+    "ungrounded": 0.35,
+    "subject_undeclared": 0.25,
+    "low_signal": 0.15,
+}
+def born_salience(n_sources: int = 1, quality_flags: list[str] | None = None) -> float:
+    """Salience to stamp on a freshly extracted node. See the module note."""
+    s = _SAL_BASE
+    if n_sources > 1:
+        s += min(_SAL_CORROB_CAP, _SAL_CORROB_PER_SOURCE * (n_sources - 1))
+    for flag in quality_flags or []:
+        s -= _SAL_PENALTIES.get(flag, 0.0)
+    return round(max(_SAL_FLOOR, min(_SAL_CEIL, s)), 4)

package/packages/memory-engine-v2/extractor-async/test_born_salience_parity.py ADDED Viewed

@@ -0,0 +1,35 @@
+"""Parity guard: confidence.born_salience (worker, copied into the container)
+must stay byte-equivalent to fusion_drive/salience.born_salience (the decay
+side). Same pattern as test_entity_id_parity.py — the two live across a Docker
+build-context boundary and would silently drift otherwise."""
+from __future__ import annotations
+import os
+import sys
+import confidence as worker
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "fusion_drive"))
+import salience as drive  # noqa: E402
+def test_constants_match():
+    assert worker._SAL_BASE == drive.BASE_SALIENCE
+    assert worker._SAL_CORROB_PER_SOURCE == drive.CORROB_PER_SOURCE
+    assert worker._SAL_CORROB_CAP == drive.CORROB_CAP
+    assert worker._SAL_FLOOR == drive.SALIENCE_FLOOR
+    assert worker._SAL_CEIL == drive.SALIENCE_CEIL
+    assert worker._SAL_PENALTIES == drive.QUALITY_PENALTIES
+def test_output_matches_across_input_matrix():
+    flagsets = [
+        None, [], ["noise_name"], ["numeric_id_person"], ["hallucinated_email"],
+        ["ungrounded"], ["subject_undeclared"], ["low_signal"],
+        ["numeric_id_person", "hallucinated_email", "ungrounded"],
+        ["noise_name"] * 5,
+    ]
+    for n in (1, 2, 3, 5, 100):
+        for flags in flagsets:
+            assert worker.born_salience(n, flags) == drive.born_salience(n_sources=n, quality_flags=flags), (n, flags)

package/packages/memory-engine-v2/extractor-async/worker.py CHANGED Viewed

@@ -39,7 +39,7 @@ import httpx
 import psycopg
 import psycopg.rows
-from confidence import corroborated_confidence
+from confidence import born_salience, corroborated_confidence
 from entity_id import entity_id, normalize_surface_form
 from extraction_schema import (
     ALLOWED_ENT_TYPES,
@@ -782,6 +782,15 @@ def _content_id(*parts: str) -> str:
     return hashlib.sha256("\x1f".join(parts).encode()).hexdigest()[:32]
+def _digit_ratio(s: str) -> float:
+    """Fraction of non-whitespace chars that are digits. Used to flag
+    numeric-ID-as-person junk for Fusion Drive born-salience."""
+    stripped = "".join(s.split())
+    if not stripped:
+        return 0.0
+    return sum(c.isdigit() for c in stripped) / len(stripped)
 def upsert_entities(
     conn: psycopg.Connection,
     arena: str,
@@ -883,12 +892,20 @@ def upsert_entities(
             else:
                 # 3b. No match — insert new.
                 eid = entity_id(arena, etype, name)
+                # Fusion Drive born-salience: a numeric-ID-as-person (classic
+                # 7B junk that slips past noise_filter, e.g. "1716801984") is
+                # born near the floor so the decay pass can evict it on a short
+                # clock instead of the multi-year entity default.
+                _qflags = []
+                if etype == "person" and _digit_ratio(name) > 0.5:
+                    _qflags.append("numeric_id_person")
+                _sal = born_salience(1, _qflags)
                 cur.execute(
                     """
                     INSERT INTO entities (
                       id, arena, entity_type, canonical_name, aliases,
-                      provenance_event_ids, participant_set, disclosure_class
-                    ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s::disclosure_class)
+                      provenance_event_ids, participant_set, disclosure_class, salience
+                    ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s::disclosure_class, %s)
                     ON CONFLICT (id) DO UPDATE SET
                       aliases = (
                         SELECT ARRAY(SELECT DISTINCT UNNEST(entities.aliases || EXCLUDED.aliases))
@@ -896,11 +913,13 @@ def upsert_entities(
                       provenance_event_ids = (
                         SELECT ARRAY(SELECT DISTINCT UNNEST(entities.provenance_event_ids || EXCLUDED.provenance_event_ids))
                       ),
+                      -- re-corroboration can only RAISE salience, never lower it
+                      salience = GREATEST(entities.salience, EXCLUDED.salience),
                       last_seen = NOW()
                     """,
                     (
                         eid, arena, etype, name, aliases,
-                        [event_id], participant_set, disclosure_class,
+                        [event_id], participant_set, disclosure_class, _sal,
                     ),
                 )
             name_to_id[name] = eid
@@ -942,15 +961,24 @@ def upsert_facts(
                 continue
             subj_name = f.get("subject")
             obj_name = f.get("object")
+            # Fusion Drive born-salience: a fact whose subject isn't among the
+            # event's declared entities (ungrounded subject) or that's barely
+            # a sentence is born low so decay can clear it. n_sources=1 here.
+            _fflags = []
+            if subj_name and not name_to_id.get(subj_name):
+                _fflags.append("subject_undeclared")
+            if len(stmt) < 60:
+                _fflags.append("low_signal")
+            _fsal = born_salience(1, _fflags)
             cur.execute(
                 """
                 INSERT INTO facts (
                   id, arena, category, subject_entity_id, predicate,
                   object_entity_id, statement, provenance_event_ids,
-                  stage, confidence, participant_set, disclosure_class
+                  stage, confidence, participant_set, disclosure_class, salience
                 ) VALUES (
                   %s, %s, %s, %s, %s, %s, %s, %s,
-                  'provisional'::extraction_stage, %s, %s, %s::disclosure_class
+                  'provisional'::extraction_stage, %s, %s, %s::disclosure_class, %s
                 )
                 ON CONFLICT (id) DO UPDATE SET
                   provenance_event_ids = (
@@ -958,6 +986,7 @@ def upsert_facts(
                       facts.provenance_event_ids || EXCLUDED.provenance_event_ids
                     ))
                   ),
+                  salience = GREATEST(facts.salience, EXCLUDED.salience),
                   -- Confidence bumps with each additional independent
                   -- source. The cardinality of the merged provenance
                   -- array IS the corroboration count, so the formula
@@ -990,6 +1019,7 @@ def upsert_facts(
                     float(f.get("confidence") or corroborated_confidence(1)),
                     participant_set,
                     disclosure_class,
+                    _fsal,
                 ),
             )
             inserted += 1

package/packages/memory-engine-v2/fusion_drive/__init__.py ADDED Viewed

File without changes