omnius 1.0.41 → 1.0.43

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -103,6 +103,7 @@ An autonomous multi-turn tool-calling agent that reads your code, makes changes,
103
103
  - [Zettelkasten Linking (A-MEM)](#zettelkasten-linking-a-mem)
104
104
  - [PPR Retrieval (HippoRAG)](#ppr-retrieval-hipporag)
105
105
  - [Cross-Modal Binding](#cross-modal-binding)
106
+ - [Scoped Visual Identity Recall](#scoped-visual-identity-recall)
106
107
  - [Gist Compression](#gist-compression)
107
108
  - [Near-Critical Cognitive Architecture](#near-critical-cognitive-architecture)
108
109
  - [Cross‑Modality Identity & Association (CLIP + Voice)](#crossmodality-identity--association-clip--voice)
@@ -240,7 +241,7 @@ An LLM is a high-bandwidth associative generative core — closer to a cortex-li
240
241
  |---|---|---|
241
242
  | Associative core | Cortex | LLM weights (any size) |
242
243
  | Current workspace | Global workspace / attention | `assembleContext()` — structured context assembly |
243
- | Episodic memory | Hippocampus | `.omnius/memory/` — write, search, retrieve across sessions |
244
+ | Episodic memory | Hippocampus | `.omnius/episodes.db` + `.omnius/knowledge.db` — write, search, retrieve, and link across sessions |
244
245
  | Cognitive map | Hippocampal spatial maps | `semantic-map.ts` + `repo-map.ts` (PageRank) |
245
246
  | Action gating | Basal ganglia | Tool selection policy (task-aware filtering) |
246
247
  | Temporal hierarchy | Prefrontal executive | Task decomposition, sub-agent delegation |
@@ -303,7 +304,8 @@ Omnius includes background workers that compute and associate embeddings across
303
304
  - Visual embeddings: CLIP ViT-B/32 (OpenCLIP) image embeddings for episodes with `modality: "visual"`.
304
305
  - Audio embeddings: speaker embeddings (ECAPA) when available; automatic fallback to normalized log‑mel in constrained environments.
305
306
  - Transcription: Whisper runs automatically for audio ingests; transcripts are stored as text episodes and embedded for retrieval.
306
- - Associations: `appears_in` for visual presence, `said_by` for transcripts, and `alias_of` for alternate labels (e.g., username + display name). Workers also link visual episodes to nearby transcripts via a time-window co‑occurrence pass.
307
+ - Associations: `appears_in` for visual presence, `said_by` for transcripts, `depicts` / `named_as` / `same_person_candidate` for identity evidence, and `alias_of` for alternate labels (e.g., username + display name). Workers also link visual episodes to nearby transcripts via a time-window co‑occurrence pass.
308
+ - Scoped visual identity recall: image ingress in TUI, GUI, Telegram private chats, and Telegram groups runs structured face identification against prior explicit enrollments. If a known face matches, Omnius injects a same-scope recall block and commits graph evidence; if a face is unknown, it nudges the agent to ask who it is instead of guessing.
307
309
 
308
310
  Config (env vars):
309
311
 
@@ -350,7 +352,7 @@ The daemon auto-installs Python dependencies (OpenCLIP, torchaudio + soundfile,
350
352
  - **Mid-task steering** — type while the agent works to add context without interrupting
351
353
  - **Smart compaction** — 6 context compaction strategies (default, aggressive, decisions, errors, summary, structured) with ARC-inspired active context revision ([arXiv:2601.12030](https://arxiv.org/abs/2601.12030)) that preserves structural file content through compaction, preventing small-model repetitive loops at the root cause. Success signals and content previews survive compaction so models never lose evidence that tools succeeded
352
354
  - **Memex experience archive** — large tool outputs archived during compaction with hash-based retrieval
353
- - **Persistent memory** — learned patterns stored in `.omnius/memory/` across sessions
355
+ - **Persistent memory** — learned patterns, episodes, and temporal graph evidence are stored under `.omnius/` across sessions (`episodes.db`, `knowledge.db`, and specialized `.omnius/memory/` stores for procedural and subsystem memory)
354
356
  - **Structured procedural memory (SQLite)** — replaces flat JSON with a full relational database: CRUD with soft-delete, revision tracking, embedding storage (float32 BLOB), bidirectional memory linking with confidence scores. Inspired by [ExpeL](https://arxiv.org/abs/2308.10144) (contrastive extraction) and [TIMG](https://arxiv.org/abs/2603.10600) (structured procedural format). 79 unit tests
355
357
  - **Semantic memory search** — vector embeddings via [Ollama /api/embed](https://ollama.com) (nomic-embed-text, 768-dim) with cosine similarity search over stored memories. Auto-generates embeddings on memory creation. Auto-links related memories when similarity > 0.6. Graceful fallback to text search when Ollama unavailable
356
358
  - **LLM-based memory extraction** — post-task, the LLM itself extracts structured procedural memories (CATEGORY/TRIGGER/LESSON/STEPS) instead of copying raw error text verbatim. Based on [ExpeL](https://arxiv.org/abs/2308.10144) and [AWM](https://arxiv.org/abs/2409.07429) patterns
@@ -979,8 +981,11 @@ Also cleans up the Docker container if the job was spawned with `"sandbox":"cont
979
981
  | GET | `/v1/memory` | read | Memory backends summary |
980
982
  | POST | `/v1/memory/search` | read | Vector + keyword search |
981
983
  | POST | `/v1/memory/write` | run | Write a memory entry |
984
+ | POST | `/v1/memory/ingest` | run | Structured multimodal ingest for visual/audio/text media. Writes episodes + temporal graph atoms and returns scoped visual identity recall metadata when a known face matches. |
985
+ | GET | `/v1/memory/entities` | read | List temporal graph entities, including stored `person:` identity nodes |
982
986
  | GET | `/v1/memory/episodes` | read | Paginated episode list |
983
987
  | GET | `/v1/memory/failures` | read | Paginated failure list |
988
+ | POST | `/v1/chat/attachments` | run | Browser chat attachment upload. Saves media under `.omnius/gui-attachments/`, ingests it with GUI scope, and returns a context block for the next chat turn. |
984
989
  | GET | `/v1/skills` | read | List AIWG + custom skills (paginated) |
985
990
  | GET | `/v1/skills/:name` | read | Skill content |
986
991
  | GET | `/v1/mcps` | read | List MCP servers |
@@ -1325,8 +1330,30 @@ curl -s 'http://127.0.0.1:11435/v1/memory/episodes?limit=10'
1325
1330
 
1326
1331
  # Paginated failure store (anti-patterns)
1327
1332
  curl -s 'http://127.0.0.1:11435/v1/memory/failures?limit=10'
1333
+
1334
+ # Structured multimodal ingest (visual/audio/text)
1335
+ curl -s -X POST http://127.0.0.1:11435/v1/memory/ingest \
1336
+ -d '{"sourceSurface":"api","scope":{"kind":"gui","id":"demo"},"modality":"visual","media_path":"/abs/path/person.jpg","media_type":"photo"}'
1337
+
1338
+ # Stored graph identity/entity nodes
1339
+ curl -s 'http://127.0.0.1:11435/v1/memory/entities?type=person&limit=25'
1340
+ ```
1341
+
1342
+ `/v1/memory/ingest` writes through the same `MultimodalIdentityService` used by Telegram, TUI, and GUI attachments. Visual media is stored as an episode, linked into the temporal graph with explicit `scope`, `sender`, `message`, `replyTo`, and `media` atoms, and, when `visual_memory identify` returns a structured prior-enrolled face match, the response includes:
1343
+
1344
+ ```json
1345
+ {
1346
+ "visualIdentity": {
1347
+ "matches": [{"name": "Cole", "confidence": 0.91}],
1348
+ "recalledEpisodes": [{"content": "Alice named this person as Cole."}],
1349
+ "committedEpisodeIds": ["..."],
1350
+ "contextBlock": "## Scoped Visual Identity Recall\n..."
1351
+ }
1352
+ }
1328
1353
  ```
1329
1354
 
1355
+ No identity is guessed from captions. New person names are stored only when the agent explicitly calls `identity_memory` from user intent, or when a previously staged next-image identity assertion is consumed in the same scope.
1356
+
1330
1357
  **Example search response** — search returns real episode records with timestamps, content, importance scores, and retrieval counts:
1331
1358
 
1332
1359
  ```json
@@ -1713,6 +1740,7 @@ Open `http://localhost:11435/` in a browser when `omnius serve` is running. Zero
1713
1740
  - Model picker populated from `/v1/models`
1714
1741
  - API key support (stored in localStorage)
1715
1742
  - System prompt (collapsible textarea)
1743
+ - Chat attachment upload through `/v1/chat/attachments`; images are saved under `.omnius/gui-attachments/`, ingested with GUI session scope, and can return scoped visual identity recall context before the next agent turn
1716
1744
  - Markdown rendering with code block copy buttons
1717
1745
  - Docker sandbox toggle (native vs container execution)
1718
1746
  - Workspace sidebar (toggleable file tree)
@@ -2113,6 +2141,7 @@ On startup and `/model` switch, Omnius detects your RAM/VRAM and creates an opti
2113
2141
  | `memory_read` | Read from persistent memory store by topic and key |
2114
2142
  | `memory_write` | Store facts/patterns in persistent memory with provenance tracking |
2115
2143
  | `memory_search` | Semantic search across all memory entries by query |
2144
+ | `identity_memory` | Scoped multimodal identity memory. Explicitly assert current-media identity, stage a name for the next same-scope image, identify enrolled faces, and recall graph evidence without regex name guessing |
2116
2145
  | `memex_retrieve` | Recover full tool output archived during context compaction by hash ID |
2117
2146
  | **Git & Diagnostics** | |
2118
2147
  | `diagnostic` | Lint/typecheck/test/build validation pipeline in one call |
@@ -2161,7 +2190,7 @@ On startup and `/model` switch, Omnius detects your RAM/VRAM and creates an opti
2161
2190
  | `audio_analyze` | Audio scene analysis — YAMNet 521-class classification (AudioSet taxonomy), Silero VAD voice activity detection, FFT spectrum analysis with peak frequency detection |
2162
2191
  | `asr_listen` | Record from microphone and transcribe speech to text — combines audio capture + Whisper ASR in one call. Uses PipeWire (bluetooth/USB) → faster-whisper → openai-whisper backends |
2163
2192
  | **Visual Intelligence** | |
2164
- | `visual_memory` | Face recognition + object memory — InsightFace ArcFace 512d face enrollment/identification, CLIP ViT-B/32 object teaching/recognition. Persistent face+object databases in `.omnius/visual-memory/` |
2193
+ | `visual_memory` | Face recognition + object memory — InsightFace ArcFace 512d face enrollment/identification, CLIP ViT-B/32 object teaching/recognition. `detect`, `identify`, and `recognize` support `format=json` for machine-readable memory plumbing. Persistent face+object databases in `~/.omnius/visual-memory/` |
2165
2194
  | `multimodal_memory` | Cross-modal episode binding — captures face + voice + text + location into unified episodes. Actions: capture (photo+audio), meet (register person with name+face+voice), recall (associative retrieval), timeline (chronological query) |
2166
2195
  | **Associative Memory** | |
2167
2196
  | `episode_store` | SQLite episode store with triple-factor scoring (recency x importance x relevance), 4-class temporal decay (session/daily/procedural/permanent), Ebbinghaus strengthening on retrieval |
@@ -2228,6 +2257,9 @@ The agent can access physical hardware — cameras, microphones, and speakers
2228
2257
  | Transcribe audio file | `asr_listen` action=transcribe file="rec.wav" | Whisper transcription |
2229
2258
  | Enroll a face | `visual_memory` action=enroll name="Alice" image="photo.jpg" | Face database entry |
2230
2259
  | Identify faces | `visual_memory` action=identify image="photo.jpg" | Known face matches |
2260
+ | Remember current image identity | `identity_memory` action=assert_identity name="Alice" media="latest" | Scoped graph evidence + face enrollment attempt |
2261
+ | Name the next image | `identity_memory` action=stage_identity name="Alice" | Pending same-scope assertion consumed by later image ingress |
2262
+ | Ask who is in an image | `identity_memory` action=identify media="reply" | Prior enrolled face match + scoped recall context |
2231
2263
  | Teach an object | `visual_memory` action=teach label="coffee_mug" image="obj.jpg" | CLIP object memory |
2232
2264
  | Meet a person | `multimodal_memory` action=meet name="Bob" | Photo+voice+text episode |
2233
2265
  | Recall a person | `multimodal_memory` action=recall query="Bob" | Associative memory search |
@@ -2245,7 +2277,7 @@ The agent can access physical hardware — cameras, microphones, and speakers
2245
2277
 
2246
2278
  **Mesh/GPS/SDR**: Auto-installs dependencies when hardware is detected. Meshtastic creates a Python venv with the CLI. GPS auto-probes NMEA at multiple baud rates. RTL-SDR auto-blacklists kernel modules and installs udev rules via pkexec.
2247
2279
 
2248
- **Visual Intelligence**: `visual_memory` provides persistent face recognition (InsightFace ArcFace 512d) and object memory (CLIP ViT-B/32). `multimodal_memory` binds all modalities into cross-session episodes with associative recall.
2280
+ **Visual Intelligence**: `visual_memory` provides persistent face recognition (InsightFace ArcFace 512d) and object memory (CLIP ViT-B/32). `identity_memory` is the agent-facing scoped layer that records explicit user-provided names, stages "next image is X" chronology, asks who unknown people are when identity matters, and recalls same-scope graph evidence. `multimodal_memory` binds all modalities into cross-session episodes with associative recall.
2249
2281
 
2250
2282
 
2251
2283
  ## Model Context Protocol (MCP)
@@ -3567,7 +3599,7 @@ While the sub-agent is working, users see:
3567
3599
 
3568
3600
  ### Public User Isolation
3569
3601
 
3570
- Public users get **per-chat isolated memory** — each chat has its own scoped memory namespace (`telegram-{chatId}-{topic}`) so public users can store and retrieve facts about their conversation without accessing or polluting global agent memory. Public tools include: `memory_read`, `memory_write` (scoped), `memory_search`, `web_search`, `web_fetch`, and scoped minimal reminders via `reminder`/`remind`.
3602
+ Public users get **per-chat isolated memory** — each chat is stored with explicit multimodal scope (`scope.kind = "group"|"private"`, `scope.id = chatId`) so public users can store and retrieve facts about their conversation without accessing or polluting unrelated chat memory. Public tools include: `memory_read`, `memory_write` (scoped), `memory_search`, `identity_memory` (scoped explicit identity evidence), `web_search`, `web_fetch`, and scoped minimal reminders via `reminder`/`remind`.
3571
3603
 
3572
3604
  The bridge also maintains a per-chat conversation state file with recent history, participants, relationship signals, and lightweight Zettelkasten memory cards. Each Telegram group or private chat gets its own scoped personality document under `.omnius/scoped-personality/telegram-chat/`; that profile is updated as people talk and injected into future Telegram context so tone, pacing, names, and relationships stay available turn to turn.
3573
3605
 
@@ -3626,14 +3658,16 @@ The bridge distinguishes between **private DMs** and **group/supergroup chats**,
3626
3658
 
3627
3659
  Photos, audio, voice messages, video, video notes, and documents sent via Telegram are automatically downloaded and processed:
3628
3660
 
3629
- 1. **Download** — files are fetched via the Telegram `getFile` API and cached to `.omnius/media-cache/`
3661
+ 1. **Download** — files are fetched via the Telegram `getFile` API and cached to `.omnius/telegram-media-cache/`
3630
3662
  2. **Processing** — routed to the appropriate pipeline:
3631
- - Images → `vision` / `image_read` / `ocr` tools
3663
+ - Images → vision ingress (`vision` / OCR context), multimodal memory ingest, and scoped visual identity association
3632
3664
  - Audio/voice → `transcribe_file` tool
3633
3665
  - Video/video notes → `transcribe_file` (audio track extraction)
3634
3666
  - Documents → `pdf_to_text` / `ocr_pdf` for PDFs, `file_read` for text
3635
- 3. **Context injection** — processing results are prepended to the user's message as additional context for the sub-agent
3636
- 4. **Cache cleanup** — media files are cached for 30 minutes, then automatically deleted. Only metadata (filename, type, chat ID, timestamp, processing result summary) is persisted long-term per chat
3667
+ 3. **Structured memory ingest** — media is posted to `/v1/memory/ingest` with `sourceSurface`, `scope`, `sender`, `message`, `replyTo`, `media`, transcript or extracted visual context, and Telegram chat/message IDs. If the daemon is unavailable, the bridge falls back to local scoped identity association.
3668
+ 4. **Identity recall** — images run `visual_memory identify` with `format=json`. Prior enrolled face matches inject a `Scoped Visual Identity Recall` block and commit `same_person_candidate` / `depicts` graph evidence. Pending same-scope `identity_memory action="stage_identity"` assertions are consumed by the next image and enrolled. Unknown faces inject a prompt for the agent to ask who the person is when relevant.
3669
+ 5. **Context injection** — processing results, reply relationship data, and identity recall blocks are prepended to the user's message as additional context for the sub-agent
3670
+ 6. **Cache cleanup** — media files are cached for 30 minutes, then automatically deleted. Only scoped metadata (filename, type, chat ID, message ID, sender, processing summary, identity graph evidence) is persisted long-term per chat
3637
3671
 
3638
3672
  ### Rate Limit Handling
3639
3673
 
@@ -3922,15 +3956,17 @@ Omnius implements a full associative memory system inspired by hippocampal episo
3922
3956
  ┌─────────────────────────────────────────────────────────────────┐
3923
3957
  │ Associative Memory Pipeline │
3924
3958
  │ │
3925
- │ Tool Call → Episode Store → Temporal KG → Zettelkasten Links
3926
- │ │
3927
- Triple-Factor Entity Edges Neighbor Discovery
3928
- Scoring (Graphiti) (A-MEM cosine)
3929
- │ │
3930
- └───── PPR Retrieval ───────────┘
3931
- (HippoRAG)
3932
-
3933
- Context Injection (every 3 turns)
3959
+ │ Tool Call / Media Ingest → Episode Store → Temporal KG
3960
+ │ │
3961
+ Triple-Factor Entity/Scope Zettelkasten Links
3962
+ Scoring Edges (Graphiti) (A-MEM cosine)
3963
+ │ │
3964
+ ├──── Multimodal Identity Service ────┐
3965
+ (sender/message/media/person)
3966
+ └───── PPR Retrieval ─────────────────┘
3967
+ │ (HippoRAG)
3968
+ │ │ │
3969
+ │ Scoped Context Injection + Recall │
3934
3970
  └─────────────────────────────────────────────────────────────────┘
3935
3971
  ```
3936
3972
 
@@ -3944,6 +3980,7 @@ Every tool call generates an episode stored in SQLite with WAL journal mode:
3944
3980
  | `importance` | 0-10 scale (errors=8, file edits=6, reads=3) |
3945
3981
  | `decay_class` | session (1h), daily (1d), procedural (30d), permanent (∞) |
3946
3982
  | `embedding` | 384d vector for semantic similarity |
3983
+ | `clip_embedding` | OpenCLIP-compatible image/text vector for cross-modal retrieval when available |
3947
3984
  | `strength` | Ebbinghaus curve — increases on each retrieval |
3948
3985
 
3949
3986
  **Scoring**: `score = recency_weight × importance × relevance` — the triple-factor model from [Generative Agents (Park et al., 2023)](https://arxiv.org/abs/2304.03442).
@@ -3952,8 +3989,8 @@ Every tool call generates an episode stored in SQLite with WAL journal mode:
3952
3989
 
3953
3990
  Entities extracted from tool results form a temporal KG with [Graphiti](https://arxiv.org/abs/2501.13956)-style edges:
3954
3991
 
3955
- - **Nodes**: files, functions, errors, people, concepts — with `mention_count` and `last_seen`
3956
- - **Edges**: causal relationships (`modifies`, `calls`, `causes_error`, `met_person`) with `valid_from`/`valid_until` temporal bounds
3992
+ - **Nodes**: files, functions, errors, people, scopes, messages, media assets, concepts — with `mention_count` and `last_seen`
3993
+ - **Edges**: causal and identity relationships (`contains`, `authored_by`, `uploaded_by`, `replied_to`, `depicts`, `named_as`, `same_person_candidate`, `voice_sample_of`) with `valid_from`/`valid_until` temporal bounds
3957
3994
  - **Temporal queries**: "What was the state at time T?" via validity filtering
3958
3995
 
3959
3996
  ### Zettelkasten Linking (A-MEM)
@@ -3972,6 +4009,25 @@ Retrieval uses [Personalized PageRank over the temporal KG](https://arxiv.org/ab
3972
4009
 
3973
4010
  This enables multi-hop retrieval: asking about "the auth bug" can surface episodes about the specific file, the test that caught it, and the person who reported it — even if those episodes don't share keywords.
3974
4011
 
4012
+ ### Scoped Visual Identity Recall
4013
+
4014
+ Visual identity memory is deliberately split into two layers:
4015
+
4016
+ | Layer | Role | Storage |
4017
+ |-------|------|---------|
4018
+ | `visual_memory` | Local face/object recognizer. Enrolls and identifies faces with InsightFace ArcFace, teaches and recognizes objects with CLIP. Structured callers use `format=json` instead of parsing display text. | `~/.omnius/visual-memory/` |
4019
+ | `identity_memory` | Agent-facing scoped evidence layer. Records explicit user assertions, stages names for future images, identifies enrolled faces, and recalls graph evidence. | `.omnius/episodes.db` + `.omnius/knowledge.db` |
4020
+ | `MultimodalIdentityService` | Central graph writer for source surface, scope, sender, message, reply, media, identity assertions, embeddings, and transcript links. | `.omnius/episodes.db` + `.omnius/knowledge.db` |
4021
+
4022
+ Supported natural chronologies:
4023
+
4024
+ 1. **Image then name** — user sends an image, then says "this is Cole" or replies to the image with the name. The agent calls `identity_memory action="assert_identity" name="Cole" media="latest|reply"`, storing `named_as` / `depicts` graph evidence and attempting face enrollment.
4025
+ 2. **Name then image** — user says "the next image is Cole" before sending media. The agent calls `identity_memory action="stage_identity" name="Cole"`. The next same-scope image consumes that pending assertion, enrolls the face, and commits `depicts` evidence only after enrollment succeeds.
4026
+ 3. **Later image** — TUI clipboard/drop, GUI attachment upload, Telegram private chats, Telegram groups, and `/v1/memory/ingest` all run structured `visual_memory identify`. If an enrolled face matches, Omnius injects a `Scoped Visual Identity Recall` block with same-scope memories and commits `same_person_candidate` / `depicts` evidence for the new image.
4027
+ 4. **Unknown face** — if face detection sees a face but no enrolled identity matches, image ingress injects an `Unknown Visual Identity Candidate` block. The model is steered to ask who the person is only when identity matters to the user's task, and never to guess a real identity.
4028
+
4029
+ Scope is part of every write and recall. A Telegram group, Telegram DM, TUI terminal session, GUI chat session, and API caller each get their own `scope.kind` / `scope.id` boundary. The recognizer may know that a face matches "Cole", but related memory recall is filtered to the current scope/session unless a tool or policy explicitly broadens access.
4030
+
3975
4031
  ### Cross-Modal Binding
3976
4032
 
3977
4033
  The `multimodal_memory` tool binds face, voice, text, and location into unified episodes:
@@ -3997,13 +4053,14 @@ Post-task, the [ReadAgent](https://arxiv.org/abs/2402.09727) gist compressor cre
3997
4053
 
3998
4054
  ### Cross‑Modality Identity & Association (CLIP + Voice)
3999
4055
 
4000
- Omnius binds entities across image, audio, and text using joint‑embedding models:
4056
+ Omnius binds entities across image, audio, and text using explicit evidence plus local embedding models:
4001
4057
 
4002
- - CLIP‑based visual ID: person/object embeddings extracted from frames are matched to persistent entity nodes; cosine similarity > τ promotes to identity with temporal smoothing. Supports multi‑appearance tracking and re‑identification across sessions.
4003
- - Voiceprint linkage: speaker embeddings (x‑vector/ECAPA) are associated with entities when co‑occurring in time with a visual track and a transcribed utterance; robust to background noise via median pooling across windows.
4004
- - Text label fusion: natural‑language labels (names, roles, tags) are bound to the same entity when co‑referents appear in proximate context windows (heuristics + clustering).
4005
- - Association graph: cross‑modal edges (image↔voice↔text) consolidate into a unified entity node with provenance (model, score, timestamp) and decay‑based confidence.
4006
- - Privacy & safety: raw media never leaves the machine; embeddings are stored locally under `.omnius/memory/`. Redaction controls can drop embeddings by label or recency.
4058
+ - Face identity: InsightFace ArcFace embeddings in `visual_memory` perform enrolled-face matching. Matches become graph evidence only through structured JSON results, never by parsing pretty tool output.
4059
+ - Object and scene association: CLIP/OpenCLIP vectors are stored as `clip_embedding` for visual/text retrieval and for taught object recognition through `visual_memory teach/recognize`.
4060
+ - Voice linkage: speaker embeddings and transcripts attach audio episodes to sender/speaker candidates when available; transcripts are stored as text episodes for retrieval.
4061
+ - Text labels: person names are stored from explicit agent-decided `identity_memory` calls (`assert_identity` for current media, `stage_identity` for next media), not regex shortcuts over captions.
4062
+ - Association graph: cross-modal edges (`depicts`, `named_as`, `same_person_candidate`, `voice_sample_of`, `said_by`, `replied_to`) consolidate into scoped entity neighborhoods with provenance, confidence, timestamp, and source episode IDs.
4063
+ - Privacy & safety: raw media and embeddings remain local. Episode and graph evidence live under `.omnius/`; the persistent visual face/object database lives under `~/.omnius/visual-memory/`.
4007
4064
 
4008
4065
  This enables queries like: “Find where Alex spoke about deployment,” “Show files edited after the person in the red sweater approved the PR,” or “Summarize conversations where Speaker‑B and Alice appear together.”
4009
4066