npm - omnius - Versions diffs - 1.0.41 → 1.0.43 - Mend

omnius 1.0.41 → 1.0.43

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md CHANGED Viewed

@@ -103,6 +103,7 @@ An autonomous multi-turn tool-calling agent that reads your code, makes changes,
   - [Zettelkasten Linking (A-MEM)](#zettelkasten-linking-a-mem)
   - [PPR Retrieval (HippoRAG)](#ppr-retrieval-hipporag)
   - [Cross-Modal Binding](#cross-modal-binding)
+  - [Scoped Visual Identity Recall](#scoped-visual-identity-recall)
   - [Gist Compression](#gist-compression)
   - [Near-Critical Cognitive Architecture](#near-critical-cognitive-architecture)
   - [Cross‑Modality Identity & Association (CLIP + Voice)](#crossmodality-identity--association-clip--voice)
@@ -240,7 +241,7 @@ An LLM is a high-bandwidth associative generative core — closer to a cortex-li
 |---|---|---|
 | Associative core | Cortex | LLM weights (any size) |
 | Current workspace | Global workspace / attention | `assembleContext()` — structured context assembly |
-| Episodic memory | Hippocampus | `.omnius/memory/` — write, search, retrieve across sessions |
+| Episodic memory | Hippocampus | `.omnius/episodes.db` + `.omnius/knowledge.db` — write, search, retrieve, and link across sessions |
 | Cognitive map | Hippocampal spatial maps | `semantic-map.ts` + `repo-map.ts` (PageRank) |
 | Action gating | Basal ganglia | Tool selection policy (task-aware filtering) |
 | Temporal hierarchy | Prefrontal executive | Task decomposition, sub-agent delegation |
@@ -303,7 +304,8 @@ Omnius includes background workers that compute and associate embeddings across
 - Visual embeddings: CLIP ViT-B/32 (OpenCLIP) image embeddings for episodes with `modality: "visual"`.
 - Audio embeddings: speaker embeddings (ECAPA) when available; automatic fallback to normalized log‑mel in constrained environments.
 - Transcription: Whisper runs automatically for audio ingests; transcripts are stored as text episodes and embedded for retrieval.
-- Associations: `appears_in` for visual presence, `said_by` for transcripts, and `alias_of` for alternate labels (e.g., username + display name). Workers also link visual episodes to nearby transcripts via a time-window co‑occurrence pass.
+- Associations: `appears_in` for visual presence, `said_by` for transcripts, `depicts` / `named_as` / `same_person_candidate` for identity evidence, and `alias_of` for alternate labels (e.g., username + display name). Workers also link visual episodes to nearby transcripts via a time-window co‑occurrence pass.
+- Scoped visual identity recall: image ingress in TUI, GUI, Telegram private chats, and Telegram groups runs structured face identification against prior explicit enrollments. If a known face matches, Omnius injects a same-scope recall block and commits graph evidence; if a face is unknown, it nudges the agent to ask who it is instead of guessing.
 Config (env vars):
@@ -350,7 +352,7 @@ The daemon auto-installs Python dependencies (OpenCLIP, torchaudio + soundfile,
 - **Mid-task steering** — type while the agent works to add context without interrupting
 - **Smart compaction** — 6 context compaction strategies (default, aggressive, decisions, errors, summary, structured) with ARC-inspired active context revision ([arXiv:2601.12030](https://arxiv.org/abs/2601.12030)) that preserves structural file content through compaction, preventing small-model repetitive loops at the root cause. Success signals and content previews survive compaction so models never lose evidence that tools succeeded
 - **Memex experience archive** — large tool outputs archived during compaction with hash-based retrieval
-- **Persistent memory** — learned patterns stored in `.omnius/memory/` across sessions
+- **Persistent memory** — learned patterns, episodes, and temporal graph evidence are stored under `.omnius/` across sessions (`episodes.db`, `knowledge.db`, and specialized `.omnius/memory/` stores for procedural and subsystem memory)
 - **Structured procedural memory (SQLite)** — replaces flat JSON with a full relational database: CRUD with soft-delete, revision tracking, embedding storage (float32 BLOB), bidirectional memory linking with confidence scores. Inspired by [ExpeL](https://arxiv.org/abs/2308.10144) (contrastive extraction) and [TIMG](https://arxiv.org/abs/2603.10600) (structured procedural format). 79 unit tests
 - **Semantic memory search** — vector embeddings via [Ollama /api/embed](https://ollama.com) (nomic-embed-text, 768-dim) with cosine similarity search over stored memories. Auto-generates embeddings on memory creation. Auto-links related memories when similarity > 0.6. Graceful fallback to text search when Ollama unavailable
 - **LLM-based memory extraction** — post-task, the LLM itself extracts structured procedural memories (CATEGORY/TRIGGER/LESSON/STEPS) instead of copying raw error text verbatim. Based on [ExpeL](https://arxiv.org/abs/2308.10144) and [AWM](https://arxiv.org/abs/2409.07429) patterns
@@ -979,8 +981,11 @@ Also cleans up the Docker container if the job was spawned with `"sandbox":"cont
 | GET | `/v1/memory` | read | Memory backends summary |
 | POST | `/v1/memory/search` | read | Vector + keyword search |
 | POST | `/v1/memory/write` | run | Write a memory entry |
+| POST | `/v1/memory/ingest` | run | Structured multimodal ingest for visual/audio/text media. Writes episodes + temporal graph atoms and returns scoped visual identity recall metadata when a known face matches. |
+| GET | `/v1/memory/entities` | read | List temporal graph entities, including stored `person:` identity nodes |
 | GET | `/v1/memory/episodes` | read | Paginated episode list |
 | GET | `/v1/memory/failures` | read | Paginated failure list |
+| POST | `/v1/chat/attachments` | run | Browser chat attachment upload. Saves media under `.omnius/gui-attachments/`, ingests it with GUI scope, and returns a context block for the next chat turn. |
 | GET | `/v1/skills` | read | List AIWG + custom skills (paginated) |
 | GET | `/v1/skills/:name` | read | Skill content |
 | GET | `/v1/mcps` | read | List MCP servers |
@@ -1325,8 +1330,30 @@ curl -s 'http://127.0.0.1:11435/v1/memory/episodes?limit=10'
 # Paginated failure store (anti-patterns)
 curl -s 'http://127.0.0.1:11435/v1/memory/failures?limit=10'
+# Structured multimodal ingest (visual/audio/text)
+curl -s -X POST http://127.0.0.1:11435/v1/memory/ingest \
+  -d '{"sourceSurface":"api","scope":{"kind":"gui","id":"demo"},"modality":"visual","media_path":"/abs/path/person.jpg","media_type":"photo"}'
+# Stored graph identity/entity nodes
+curl -s 'http://127.0.0.1:11435/v1/memory/entities?type=person&limit=25'
+```
+`/v1/memory/ingest` writes through the same `MultimodalIdentityService` used by Telegram, TUI, and GUI attachments. Visual media is stored as an episode, linked into the temporal graph with explicit `scope`, `sender`, `message`, `replyTo`, and `media` atoms, and, when `visual_memory identify` returns a structured prior-enrolled face match, the response includes:
+```json
+{
+  "visualIdentity": {
+    "matches": [{"name": "Cole", "confidence": 0.91}],
+    "recalledEpisodes": [{"content": "Alice named this person as Cole."}],
+    "committedEpisodeIds": ["..."],
+    "contextBlock": "## Scoped Visual Identity Recall\n..."
+  }
+}
 ```
+No identity is guessed from captions. New person names are stored only when the agent explicitly calls `identity_memory` from user intent, or when a previously staged next-image identity assertion is consumed in the same scope.
 **Example search response** — search returns real episode records with timestamps, content, importance scores, and retrieval counts:
 ```json
@@ -1713,6 +1740,7 @@ Open `http://localhost:11435/` in a browser when `omnius serve` is running. Zero
 - Model picker populated from `/v1/models`
 - API key support (stored in localStorage)
 - System prompt (collapsible textarea)
+- Chat attachment upload through `/v1/chat/attachments`; images are saved under `.omnius/gui-attachments/`, ingested with GUI session scope, and can return scoped visual identity recall context before the next agent turn
 - Markdown rendering with code block copy buttons
 - Docker sandbox toggle (native vs container execution)
 - Workspace sidebar (toggleable file tree)
@@ -2113,6 +2141,7 @@ On startup and `/model` switch, Omnius detects your RAM/VRAM and creates an opti
 | `memory_read` | Read from persistent memory store by topic and key |
 | `memory_write` | Store facts/patterns in persistent memory with provenance tracking |
 | `memory_search` | Semantic search across all memory entries by query |
+| `identity_memory` | Scoped multimodal identity memory. Explicitly assert current-media identity, stage a name for the next same-scope image, identify enrolled faces, and recall graph evidence without regex name guessing |
 | `memex_retrieve` | Recover full tool output archived during context compaction by hash ID |
 | **Git & Diagnostics** | |
 | `diagnostic` | Lint/typecheck/test/build validation pipeline in one call |
@@ -2161,7 +2190,7 @@ On startup and `/model` switch, Omnius detects your RAM/VRAM and creates an opti
 | `audio_analyze` | Audio scene analysis — YAMNet 521-class classification (AudioSet taxonomy), Silero VAD voice activity detection, FFT spectrum analysis with peak frequency detection |
 | `asr_listen` | Record from microphone and transcribe speech to text — combines audio capture + Whisper ASR in one call. Uses PipeWire (bluetooth/USB) → faster-whisper → openai-whisper backends |
 | **Visual Intelligence** | |
-| `visual_memory` | Face recognition + object memory — InsightFace ArcFace 512d face enrollment/identification, CLIP ViT-B/32 object teaching/recognition. Persistent face+object databases in `.omnius/visual-memory/` |
+| `visual_memory` | Face recognition + object memory — InsightFace ArcFace 512d face enrollment/identification, CLIP ViT-B/32 object teaching/recognition. `detect`, `identify`, and `recognize` support `format=json` for machine-readable memory plumbing. Persistent face+object databases in `~/.omnius/visual-memory/` |
 | `multimodal_memory` | Cross-modal episode binding — captures face + voice + text + location into unified episodes. Actions: capture (photo+audio), meet (register person with name+face+voice), recall (associative retrieval), timeline (chronological query) |
 | **Associative Memory** | |
 | `episode_store` | SQLite episode store with triple-factor scoring (recency x importance x relevance), 4-class temporal decay (session/daily/procedural/permanent), Ebbinghaus strengthening on retrieval |
@@ -2228,6 +2257,9 @@ The agent can access physical hardware — cameras, microphones, and speakers
 | Transcribe audio file | `asr_listen` action=transcribe file="rec.wav" | Whisper transcription |
 | Enroll a face | `visual_memory` action=enroll name="Alice" image="photo.jpg" | Face database entry |
 | Identify faces | `visual_memory` action=identify image="photo.jpg" | Known face matches |
+| Remember current image identity | `identity_memory` action=assert_identity name="Alice" media="latest" | Scoped graph evidence + face enrollment attempt |
+| Name the next image | `identity_memory` action=stage_identity name="Alice" | Pending same-scope assertion consumed by later image ingress |
+| Ask who is in an image | `identity_memory` action=identify media="reply" | Prior enrolled face match + scoped recall context |
 | Teach an object | `visual_memory` action=teach label="coffee_mug" image="obj.jpg" | CLIP object memory |
 | Meet a person | `multimodal_memory` action=meet name="Bob" | Photo+voice+text episode |
 | Recall a person | `multimodal_memory` action=recall query="Bob" | Associative memory search |
@@ -2245,7 +2277,7 @@ The agent can access physical hardware — cameras, microphones, and speakers
 **Mesh/GPS/SDR**: Auto-installs dependencies when hardware is detected. Meshtastic creates a Python venv with the CLI. GPS auto-probes NMEA at multiple baud rates. RTL-SDR auto-blacklists kernel modules and installs udev rules via pkexec.
-**Visual Intelligence**: `visual_memory` provides persistent face recognition (InsightFace ArcFace 512d) and object memory (CLIP ViT-B/32). `multimodal_memory` binds all modalities into cross-session episodes with associative recall.
+**Visual Intelligence**: `visual_memory` provides persistent face recognition (InsightFace ArcFace 512d) and object memory (CLIP ViT-B/32). `identity_memory` is the agent-facing scoped layer that records explicit user-provided names, stages "next image is X" chronology, asks who unknown people are when identity matters, and recalls same-scope graph evidence. `multimodal_memory` binds all modalities into cross-session episodes with associative recall.
 ## Model Context Protocol (MCP)
@@ -3567,7 +3599,7 @@ While the sub-agent is working, users see:
 ### Public User Isolation
-Public users get **per-chat isolated memory** — each chat has its own scoped memory namespace (`telegram-{chatId}-{topic}`) so public users can store and retrieve facts about their conversation without accessing or polluting global agent memory. Public tools include: `memory_read`, `memory_write` (scoped), `memory_search`, `web_search`, `web_fetch`, and scoped minimal reminders via `reminder`/`remind`.
+Public users get **per-chat isolated memory** — each chat is stored with explicit multimodal scope (`scope.kind = "group"|"private"`, `scope.id = chatId`) so public users can store and retrieve facts about their conversation without accessing or polluting unrelated chat memory. Public tools include: `memory_read`, `memory_write` (scoped), `memory_search`, `identity_memory` (scoped explicit identity evidence), `web_search`, `web_fetch`, and scoped minimal reminders via `reminder`/`remind`.
 The bridge also maintains a per-chat conversation state file with recent history, participants, relationship signals, and lightweight Zettelkasten memory cards. Each Telegram group or private chat gets its own scoped personality document under `.omnius/scoped-personality/telegram-chat/`; that profile is updated as people talk and injected into future Telegram context so tone, pacing, names, and relationships stay available turn to turn.
@@ -3626,14 +3658,16 @@ The bridge distinguishes between **private DMs** and **group/supergroup chats**,
 Photos, audio, voice messages, video, video notes, and documents sent via Telegram are automatically downloaded and processed:
-1. **Download** — files are fetched via the Telegram `getFile` API and cached to `.omnius/media-cache/`
+1. **Download** — files are fetched via the Telegram `getFile` API and cached to `.omnius/telegram-media-cache/`
 2. **Processing** — routed to the appropriate pipeline:
-   - Images → `vision` / `image_read` / `ocr` tools
+   - Images → vision ingress (`vision` / OCR context), multimodal memory ingest, and scoped visual identity association
    - Audio/voice → `transcribe_file` tool
    - Video/video notes → `transcribe_file` (audio track extraction)
    - Documents → `pdf_to_text` / `ocr_pdf` for PDFs, `file_read` for text
-3. **Context injection** — processing results are prepended to the user's message as additional context for the sub-agent
-4. **Cache cleanup** — media files are cached for 30 minutes, then automatically deleted. Only metadata (filename, type, chat ID, timestamp, processing result summary) is persisted long-term per chat
+3. **Structured memory ingest** — media is posted to `/v1/memory/ingest` with `sourceSurface`, `scope`, `sender`, `message`, `replyTo`, `media`, transcript or extracted visual context, and Telegram chat/message IDs. If the daemon is unavailable, the bridge falls back to local scoped identity association.
+4. **Identity recall** — images run `visual_memory identify` with `format=json`. Prior enrolled face matches inject a `Scoped Visual Identity Recall` block and commit `same_person_candidate` / `depicts` graph evidence. Pending same-scope `identity_memory action="stage_identity"` assertions are consumed by the next image and enrolled. Unknown faces inject a prompt for the agent to ask who the person is when relevant.
+5. **Context injection** — processing results, reply relationship data, and identity recall blocks are prepended to the user's message as additional context for the sub-agent
+6. **Cache cleanup** — media files are cached for 30 minutes, then automatically deleted. Only scoped metadata (filename, type, chat ID, message ID, sender, processing summary, identity graph evidence) is persisted long-term per chat
 ### Rate Limit Handling
@@ -3922,15 +3956,17 @@ Omnius implements a full associative memory system inspired by hippocampal episo
 ┌─────────────────────────────────────────────────────────────────┐
 │                    Associative Memory Pipeline                   │
 │                                                                  │
-│  Tool Call → Episode Store → Temporal KG → Zettelkasten Links   │
-│                  │                │              │                │
-│            Triple-Factor    Entity Edges    Neighbor Discovery   │
-│            Scoring          (Graphiti)      (A-MEM cosine)      │
-│                  │                │              │                │
-│                  └───── PPR Retrieval ───────────┘                │
-│                         (HippoRAG)                               │
-│                              │                                   │
-│                    Context Injection (every 3 turns)             │
+│  Tool Call / Media Ingest → Episode Store → Temporal KG          │
+│             │                    │              │                 │
+│       Triple-Factor        Entity/Scope     Zettelkasten Links   │
+│       Scoring              Edges (Graphiti) (A-MEM cosine)       │
+│             │                    │              │                 │
+│             ├──── Multimodal Identity Service ────┐              │
+│             │      (sender/message/media/person)  │              │
+│             └───── PPR Retrieval ─────────────────┘              │
+│                    (HippoRAG)                                    │
+│                         │                                        │
+│              Scoped Context Injection + Recall                   │
 └─────────────────────────────────────────────────────────────────┘
 ```
@@ -3944,6 +3980,7 @@ Every tool call generates an episode stored in SQLite with WAL journal mode:
 | `importance` | 0-10 scale (errors=8, file edits=6, reads=3) |
 | `decay_class` | session (1h), daily (1d), procedural (30d), permanent (∞) |
 | `embedding` | 384d vector for semantic similarity |
+| `clip_embedding` | OpenCLIP-compatible image/text vector for cross-modal retrieval when available |
 | `strength` | Ebbinghaus curve — increases on each retrieval |
 **Scoring**: `score = recency_weight × importance × relevance` — the triple-factor model from [Generative Agents (Park et al., 2023)](https://arxiv.org/abs/2304.03442).
@@ -3952,8 +3989,8 @@ Every tool call generates an episode stored in SQLite with WAL journal mode:
 Entities extracted from tool results form a temporal KG with [Graphiti](https://arxiv.org/abs/2501.13956)-style edges:
-- **Nodes**: files, functions, errors, people, concepts — with `mention_count` and `last_seen`
-- **Edges**: causal relationships (`modifies`, `calls`, `causes_error`, `met_person`) with `valid_from`/`valid_until` temporal bounds
+- **Nodes**: files, functions, errors, people, scopes, messages, media assets, concepts — with `mention_count` and `last_seen`
+- **Edges**: causal and identity relationships (`contains`, `authored_by`, `uploaded_by`, `replied_to`, `depicts`, `named_as`, `same_person_candidate`, `voice_sample_of`) with `valid_from`/`valid_until` temporal bounds
 - **Temporal queries**: "What was the state at time T?" via validity filtering
 ### Zettelkasten Linking (A-MEM)
@@ -3972,6 +4009,25 @@ Retrieval uses [Personalized PageRank over the temporal KG](https://arxiv.org/ab
 This enables multi-hop retrieval: asking about "the auth bug" can surface episodes about the specific file, the test that caught it, and the person who reported it — even if those episodes don't share keywords.
+### Scoped Visual Identity Recall
+Visual identity memory is deliberately split into two layers:
+| Layer | Role | Storage |
+|-------|------|---------|
+| `visual_memory` | Local face/object recognizer. Enrolls and identifies faces with InsightFace ArcFace, teaches and recognizes objects with CLIP. Structured callers use `format=json` instead of parsing display text. | `~/.omnius/visual-memory/` |
+| `identity_memory` | Agent-facing scoped evidence layer. Records explicit user assertions, stages names for future images, identifies enrolled faces, and recalls graph evidence. | `.omnius/episodes.db` + `.omnius/knowledge.db` |
+| `MultimodalIdentityService` | Central graph writer for source surface, scope, sender, message, reply, media, identity assertions, embeddings, and transcript links. | `.omnius/episodes.db` + `.omnius/knowledge.db` |
+Supported natural chronologies:
+1. **Image then name** — user sends an image, then says "this is Cole" or replies to the image with the name. The agent calls `identity_memory action="assert_identity" name="Cole" media="latest|reply"`, storing `named_as` / `depicts` graph evidence and attempting face enrollment.
+2. **Name then image** — user says "the next image is Cole" before sending media. The agent calls `identity_memory action="stage_identity" name="Cole"`. The next same-scope image consumes that pending assertion, enrolls the face, and commits `depicts` evidence only after enrollment succeeds.
+3. **Later image** — TUI clipboard/drop, GUI attachment upload, Telegram private chats, Telegram groups, and `/v1/memory/ingest` all run structured `visual_memory identify`. If an enrolled face matches, Omnius injects a `Scoped Visual Identity Recall` block with same-scope memories and commits `same_person_candidate` / `depicts` evidence for the new image.
+4. **Unknown face** — if face detection sees a face but no enrolled identity matches, image ingress injects an `Unknown Visual Identity Candidate` block. The model is steered to ask who the person is only when identity matters to the user's task, and never to guess a real identity.
+Scope is part of every write and recall. A Telegram group, Telegram DM, TUI terminal session, GUI chat session, and API caller each get their own `scope.kind` / `scope.id` boundary. The recognizer may know that a face matches "Cole", but related memory recall is filtered to the current scope/session unless a tool or policy explicitly broadens access.
 ### Cross-Modal Binding
 The `multimodal_memory` tool binds face, voice, text, and location into unified episodes:
@@ -3997,13 +4053,14 @@ Post-task, the [ReadAgent](https://arxiv.org/abs/2402.09727) gist compressor cre
 ### Cross‑Modality Identity & Association (CLIP + Voice)
-Omnius binds entities across image, audio, and text using joint‑embedding models:
+Omnius binds entities across image, audio, and text using explicit evidence plus local embedding models:
-- CLIP‑based visual ID: person/object embeddings extracted from frames are matched to persistent entity nodes; cosine similarity > τ promotes to identity with temporal smoothing. Supports multi‑appearance tracking and re‑identification across sessions.
-- Voiceprint linkage: speaker embeddings (x‑vector/ECAPA) are associated with entities when co‑occurring in time with a visual track and a transcribed utterance; robust to background noise via median pooling across windows.
-- Text label fusion: natural‑language labels (names, roles, tags) are bound to the same entity when co‑referents appear in proximate context windows (heuristics + clustering).
-- Association graph: cross‑modal edges (image↔voice↔text) consolidate into a unified entity node with provenance (model, score, timestamp) and decay‑based confidence.
-- Privacy & safety: raw media never leaves the machine; embeddings are stored locally under `.omnius/memory/`. Redaction controls can drop embeddings by label or recency.
+- Face identity: InsightFace ArcFace embeddings in `visual_memory` perform enrolled-face matching. Matches become graph evidence only through structured JSON results, never by parsing pretty tool output.
+- Object and scene association: CLIP/OpenCLIP vectors are stored as `clip_embedding` for visual/text retrieval and for taught object recognition through `visual_memory teach/recognize`.
+- Voice linkage: speaker embeddings and transcripts attach audio episodes to sender/speaker candidates when available; transcripts are stored as text episodes for retrieval.
+- Text labels: person names are stored from explicit agent-decided `identity_memory` calls (`assert_identity` for current media, `stage_identity` for next media), not regex shortcuts over captions.
+- Association graph: cross-modal edges (`depicts`, `named_as`, `same_person_candidate`, `voice_sample_of`, `said_by`, `replied_to`) consolidate into scoped entity neighborhoods with provenance, confidence, timestamp, and source episode IDs.
+- Privacy & safety: raw media and embeddings remain local. Episode and graph evidence live under `.omnius/`; the persistent visual face/object database lives under `~/.omnius/visual-memory/`.
 This enables queries like: “Find where Alex spoke about deployment,” “Show files edited after the person in the red sweater approved the PR,” or “Summarize conversations where Speaker‑B and Alice appear together.”