npm - open-agents-ai - Versions diffs - 0.187.172 → 0.187.174 - Mend

open-agents-ai 0.187.172 → 0.187.174

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/README.md CHANGED Viewed

@@ -40,7 +40,8 @@ An autonomous multi-turn tool-calling agent that reads your code, makes changes,
 - [Model-Tier Awareness](#model-tier-awareness)
 - [Live Code Knowledge Graph](#live-code-knowledge-graph)
 - [Auto-Expanding Context Window](#auto-expanding-context-window)
-- [Tools (68+)](#tools-68)
+- [Tools (85+)](#tools-85)
+- [Associative Memory & Cross-Modal Binding](#associative-memory--cross-modal-binding)
 - [Ralph Loop — Iteration-First Design](#ralph-loop--iteration-first-design)
 - [Task Control](#task-control)
 - [COHERE Cognitive Framework](#cohere-cognitive-framework)
@@ -923,7 +924,7 @@ On startup and `/model` switch, Open Agents detects your RAM/VRAM and creates an
-## Tools (68)
+## Tools (85+)
 <div align="right"><a href="#top">back to top</a></div>
@@ -1010,11 +1011,24 @@ On startup and `/model` switch, Open Agents detects your RAM/VRAM and creates an
 | **Hardware Access** | |
 | `camera_capture` | Access system cameras — list devices, capture JPEG frames, query capabilities. Uses ffmpeg + v4l2. Supports USB, CSI, and 360 cameras (QooCam, RealSense). Captured images can be piped to vision tools |
 | `audio_capture` | Record from microphone — list input devices, record WAV/MP3 (configurable duration/rate/channels), check real-time mic level (RMS dBFS). Uses arecord + ffmpeg backends |
-| `audio_playback` | Speaker control and TTS — play audio files (WAV/MP3/OGG), text-to-speech via espeak-ng (multi-language), get/set system volume. Uses aplay/ffplay/amixer backends |
+| `audio_playback` | Speaker control and TTS — play audio files (WAV/MP3/OGG), text-to-speech via LuxTTS voice clone (persistent GPU daemon, ~2s synthesis), get/set system volume. Uses aplay/ffplay/amixer backends |
 | `wifi_control` | WiFi network scanning and management — scan nearby networks (SSID, signal, channel, security), list WiFi adapters (built-in + USB dongles), connect/disconnect, check connection status, toggle monitor mode. Auto-detects AC600/RTL8811AU and other USB adapters |
 | `bluetooth_scan` | Bluetooth device discovery — scan for Classic and BLE devices, list HCI adapters, get device info. Uses hcitool/bluetoothctl backends |
 | `sdr_scan` | Software-defined radio scanning — frequency sweeps, ADS-B aircraft tracking (1090 MHz), FM radio capture. Auto-installs rtl-sdr tools when RTL-SDR hardware detected. Uses rtl_power/rtl_fm/dump1090 |
 | `flipper_zero` | Flipper Zero multi-tool control — Sub-GHz scanning (315/433/868/915 MHz), NFC tag reading, 125kHz RFID reading, IR capture, GPIO pin reading, storage browsing. Serial CLI via /dev/ttyACM* |
+| `meshtastic` | Mesh network communication via LoRa — send/receive messages, list nodes, get device info, configure channels. Auto-installs meshtastic CLI in venv, auto-fixes serial permissions via pkexec |
+| `gps_location` | GPS positioning from 45+ USB receivers — auto-detects device, probes NMEA at multiple baud rates. Uses pyserial+pynmea2 for reliable parsing. Returns lat/lon/alt/speed/heading |
+| `audio_analyze` | Audio scene analysis — YAMNet 521-class classification (AudioSet taxonomy), Silero VAD voice activity detection, FFT spectrum analysis with peak frequency detection |
+| `asr_listen` | Record from microphone and transcribe speech to text — combines audio capture + Whisper ASR in one call. Uses PipeWire (bluetooth/USB) → faster-whisper → openai-whisper backends |
+| **Visual Intelligence** | |
+| `visual_memory` | Face recognition + object memory — InsightFace ArcFace 512d face enrollment/identification, CLIP ViT-B/32 object teaching/recognition. Persistent face+object databases in `.open-agents/visual-memory/` |
+| `multimodal_memory` | Cross-modal episode binding — captures face + voice + text + location into unified episodes. Actions: capture (photo+audio), meet (register person with name+face+voice), recall (associative retrieval), timeline (chronological query) |
+| **Associative Memory** | |
+| `episode_store` | SQLite episode store with triple-factor scoring (recency x importance x relevance), 4-class temporal decay (session/daily/procedural/permanent), Ebbinghaus strengthening on retrieval |
+| `temporal_graph` | Temporal knowledge graph with Graphiti-style valid_from/valid_until edges, entity upsert with mention counting, temporal queries, neighbor traversal for context building |
+| `zettelkasten` | A-MEM Zettelkasten note linking — retroactive context evolution, top-3 neighbor discovery via cosine similarity, bidirectional linking |
+| `ppr_retrieval` | HippoRAG Personalized PageRank retrieval — entity extraction, seed node mapping, multi-hop associative traversal over temporal KG, episode scoring |
+| `gist_compressor` | ReadAgent-style trajectory compression — deterministic gist extraction from multi-turn interactions, no LLM needed |
 Read-only tools execute concurrently when called in the same turn. Mutating tools run sequentially.
@@ -1049,7 +1063,7 @@ The agent can access physical hardware — cameras, microphones, and speakers
 | List cameras | `camera_capture` action=list | Discover `/dev/video*` devices |
 | Record audio | `audio_capture` action=record duration=10 | Record 10s WAV from default mic |
 | Check if mic works | `audio_capture` action=level | RMS level in dBFS |
-| Speak aloud | `audio_playback` action=speak text="Hello" | TTS via espeak-ng |
+| Speak aloud | `audio_playback` action=speak text="Hello" | TTS via LuxTTS voice clone |
 | Play a sound file | `audio_playback` action=play file=alert.wav | Play WAV/MP3/OGG |
 | Check volume | `audio_playback` action=volume | Get current volume % |
 | Set volume | `audio_playback` action=volume volume=50 | Set to 50% |
@@ -1067,12 +1081,33 @@ The agent can access physical hardware — cameras, microphones, and speakers
 | Sub-GHz scan | `flipper_zero` action=subghz_scan frequency=433920000 | RF signals |
 | Read NFC tag | `flipper_zero` action=nfc_read | Tag UID, type |
 | Read RFID tag | `flipper_zero` action=rfid_read | 125kHz tag ID |
+| Send mesh message | `meshtastic` action=send message="Hello mesh" | LoRa broadcast |
+| List mesh nodes | `meshtastic` action=nodes | All nodes + signal info |
+| Get GPS location | `gps_location` action=locate | Lat/lon/alt/speed |
+| Analyze audio scene | `audio_analyze` action=classify file="rec.wav" | Top AudioSet classes |
+| Detect voice activity | `audio_analyze` action=vad file="rec.wav" | Speech segments |
+| Listen + transcribe | `asr_listen` action=listen duration=8 | Record + Whisper ASR |
+| Transcribe audio file | `asr_listen` action=transcribe file="rec.wav" | Whisper transcription |
+| Enroll a face | `visual_memory` action=enroll name="Alice" image="photo.jpg" | Face database entry |
+| Identify faces | `visual_memory` action=identify image="photo.jpg" | Known face matches |
+| Teach an object | `visual_memory` action=teach label="coffee_mug" image="obj.jpg" | CLIP object memory |
+| Meet a person | `multimodal_memory` action=meet name="Bob" | Photo+voice+text episode |
+| Recall a person | `multimodal_memory` action=recall query="Bob" | Associative memory search |
+| Event timeline | `multimodal_memory` action=timeline | Chronological episodes |
-**Prerequisites**: `ffmpeg`, `arecord`, `aplay`, `amixer` (ALSA utils), `espeak-ng`, `bluez` (Bluetooth). Install: `sudo apt install ffmpeg alsa-utils espeak-ng bluez`
+**Prerequisites**: `ffmpeg`, `arecord`, `aplay`, `amixer` (ALSA utils), `bluez` (Bluetooth). Install: `sudo apt install ffmpeg alsa-utils bluez`
-**Camera support**: USB cameras (UVC), Intel RealSense (via UVC), 360 cameras (QooCam, Ricoh Theta — raw fisheye via v4l2loopback + ffmpeg crop). The captured frame is returned as base64 JPEG that can be fed directly to the `vision` tool for analysis.
+**Camera support**: USB cameras (UVC), Intel RealSense (via UVC), QooCam 8K 360 via WiFi OSC protocol (auto-discovers hotspot, connects, switches modes, captures frames). Captured frames returned as base64 JPEG for direct piping to `vision` or `visual_memory` tools.
-**Audio workflow**: Record → transcribe → analyze: `audio_capture action=record` → `transcribe_file` → process transcript. The tools handle device enumeration and graceful degradation when hardware is unavailable.
+**Audio workflow**: Record → transcribe → analyze → remember:
+1. `audio_capture action=record` → WAV recording
+2. `asr_listen action=listen` → record + Whisper transcription in one call
+3. `audio_analyze action=classify` → YAMNet scene classification (521 AudioSet classes)
+4. `multimodal_memory action=meet` → bind face + voice + text into persistent episode
+**Mesh/GPS/SDR**: Auto-installs dependencies when hardware is detected. Meshtastic creates a Python venv with the CLI. GPS auto-probes NMEA at multiple baud rates. RTL-SDR auto-blacklists kernel modules and installs udev rules via pkexec.
+**Visual Intelligence**: `visual_memory` provides persistent face recognition (InsightFace ArcFace 512d) and object memory (CLIP ViT-B/32). `multimodal_memory` binds all modalities into cross-session episodes with associative recall.
 ## Ralph Loop — Iteration-First Design
@@ -1561,7 +1596,7 @@ The emotion system is informed by peer-reviewed and preprint research:
 /voice clone overwatch  # Generate clone ref from Overwatch → LuxTTS
 ```
-Auto-downloads the ONNX voice model (~50MB) on first use. Install `espeak-ng` for best quality (`apt install espeak-ng` / `brew install espeak-ng`).
+Auto-downloads the ONNX voice model (~50MB) on first use. LuxTTS is the primary TTS engine with a persistent GPU daemon that keeps the model warm in VRAM for ~2s synthesis latency.
 ### LuxTTS Voice Cloning
@@ -1583,6 +1618,8 @@ Auto-downloads the ONNX voice model (~50MB) on first use. Install `espeak-ng` fo
 - **Pitch** → post-synthesis resampling via `resamplePitch()` (valence+arousal tanh curve)
 - **Volume** → WAV sample scaling (dominance-driven)
+**Persistent GPU daemon**: The `audio_playback` tool runs a persistent LuxTTS daemon process that keeps the ZipVoice model warm in GPU memory (~19GB VRAM). First call starts the daemon (~7s model load), subsequent calls synthesize in ~2s. The daemon communicates via JSON-over-stdin/stdout protocol and caches encoded voice prompts for instant reuse. Falls back to standalone synthesis (~10s) if the daemon stalls.
 Output: 48kHz WAV, compatible with Telegram voice messages and WebSocket streaming.
 ### Narration Engine Architecture
@@ -2477,6 +2514,98 @@ Every completed task is logged to `.oa/trajectories/trajectories.jsonl` with ful
 | **Skill extraction** | Post-task via `/skillify` | Converts corrections into reusable SKILL.md |
+## Associative Memory & Cross-Modal Binding
+<div align="right"><a href="#top">back to top</a></div>
+Open Agents implements a full associative memory system inspired by hippocampal episodic memory research. Every tool call, observation, and interaction is captured as a richly-linked episode that can be retrieved through multi-hop associative traversal — not just keyword search.
+### Architecture
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    Associative Memory Pipeline                   │
+│                                                                  │
+│  Tool Call → Episode Store → Temporal KG → Zettelkasten Links   │
+│                  │                │              │                │
+│            Triple-Factor    Entity Edges    Neighbor Discovery   │
+│            Scoring          (Graphiti)      (A-MEM cosine)      │
+│                  │                │              │                │
+│                  └───── PPR Retrieval ───────────┘                │
+│                         (HippoRAG)                               │
+│                              │                                   │
+│                    Context Injection (every 3 turns)             │
+└─────────────────────────────────────────────────────────────────┘
+```
+### Episode Store (SQLite)
+Every tool call generates an episode stored in SQLite with WAL journal mode:
+| Field | Description |
+|-------|-------------|
+| `content` | Tool name + args + result summary |
+| `importance` | 0-10 scale (errors=8, file edits=6, reads=3) |
+| `decay_class` | session (1h), daily (1d), procedural (30d), permanent (∞) |
+| `embedding` | 384d vector for semantic similarity |
+| `strength` | Ebbinghaus curve — increases on each retrieval |
+**Scoring**: `score = recency_weight × importance × relevance` — the triple-factor model from [Generative Agents (Park et al., 2023)](https://arxiv.org/abs/2304.03442).
+### Temporal Knowledge Graph
+Entities extracted from tool results form a temporal KG with [Graphiti](https://arxiv.org/abs/2501.13956)-style edges:
+- **Nodes**: files, functions, errors, people, concepts — with `mention_count` and `last_seen`
+- **Edges**: causal relationships (`modifies`, `calls`, `causes_error`, `met_person`) with `valid_from`/`valid_until` temporal bounds
+- **Temporal queries**: "What was the state at time T?" via validity filtering
+### Zettelkasten Linking (A-MEM)
+After embedding computation, each episode discovers its top-3 nearest neighbors by cosine similarity and creates bidirectional links — implementing the [A-MEM Zettelkasten pattern (NeurIPS 2025)](https://arxiv.org/abs/2502.12110). Over time, episodes form a densely connected knowledge graph where context evolves retroactively as new episodes link to old ones.
+### PPR Retrieval (HippoRAG)
+Retrieval uses [Personalized PageRank over the temporal KG](https://arxiv.org/abs/2405.14831):
+1. **Entity extraction** from the current query
+2. **Seed node mapping** — find KG nodes matching query entities
+3. **PPR diffusion** — importance flows along edges with damping factor α=0.15
+4. **Episode scoring** — episodes connected to high-PPR nodes are ranked
+5. **Context injection** — top episodes injected every 3 turns as `[ASSOCIATIVE MEMORY]` context
+This enables multi-hop retrieval: asking about "the auth bug" can surface episodes about the specific file, the test that caught it, and the person who reported it — even if those episodes don't share keywords.
+### Cross-Modal Binding
+The `multimodal_memory` tool binds face, voice, text, and location into unified episodes:
+```
+meet("Cole") → {
+  face: InsightFace ArcFace 512d embedding,
+  voice: Whisper transcription of spoken name,
+  photo: CLIP ViT-B/32 768d scene embedding,
+  text: "My name is Cole",
+  episode_id: shared across all modalities,
+  timestamp: ISO-8601
+}
+```
+**Recall** uses the shared `episode_id` to retrieve all modalities at once. CLIP embeddings enable visual queries ("who was in the photo with the whiteboard?") and face embeddings enable identity queries ("when did I last see Cole?").
+### Gist Compression
+Post-task, the [ReadAgent](https://arxiv.org/abs/2402.09727) gist compressor creates deterministic summaries of multi-turn trajectories (>10 turns), preserving key decisions and outcomes while discarding redundant intermediate steps. No LLM needed — uses extractive heuristics.
+### Near-Critical Cognitive Architecture
+The associative memory integrates with a near-critical cognitive framework inspired by [Beggs & Plenz (2003)](https://doi.org/10.1523/JNEUROSCI.23-35-11167.2003) neuronal avalanche dynamics:
+- **Auto-consolidation**: At task boundaries, the system writes consolidation snapshots to `.oa/consolidations/` with lessons learned and key patterns
+- **Provenance KG**: Every agent action is tracked in `.oa/provenance/` for full action traceability
+- **Homeostasis modulation**: Error rate drives exploration guidance — high error rates inject more careful approaches, low error rates encourage bolder exploration
+- **Error pattern learning**: Recurring error patterns are detected, stored globally in `~/.open-agents/error-patterns.json`, and injected as `[LEARNED FROM EXPERIENCE]` guidance before similar actions in future sessions
 ## Dream Mode — Creative Idle Exploration
@@ -3371,16 +3500,22 @@ The COHERE collective intelligence system, self-play idle loop, identity evoluti
 | Hyperagents: Self-Referential Meta-Improvement | [2603.19461](https://arxiv.org/abs/2603.19461) | Mar 2026 | D6: Recursive meta-improvement |
 | STOP: Self-Taught Optimizer | [2310.02304](https://arxiv.org/abs/2310.02304) | COLM 2024 | D6: Scaffold self-improvement |
-### Memory & Identity
+### Memory, Identity & Associative Retrieval
 | Paper | ArXiv | Venue | Used In |
 |-------|-------|-------|---------|
 | MemoryOS: Memory Operating System | [2506.06326](https://arxiv.org/abs/2506.06326) | EMNLP 2025 Oral | D3: Three-tier consolidation |
-| A-MEM: Agentic Memory (Zettelkasten) | [2502.12110](https://arxiv.org/abs/2502.12110) | NeurIPS 2025 | D3: Retroactive narrative |
+| A-MEM: Agentic Memory (Zettelkasten) | [2502.12110](https://arxiv.org/abs/2502.12110) | NeurIPS 2025 | Zettelkasten linking, retroactive context evolution |
+| HippoRAG: Neurobiological Retrieval | [2405.14831](https://arxiv.org/abs/2405.14831) | NeurIPS 2024 | PPR retrieval over temporal KG |
+| Generative Agents: Interactive Simulacra | [2304.03442](https://arxiv.org/abs/2304.03442) | UIST 2023 | Triple-factor scoring (recency × importance × relevance) |
+| Graphiti: Temporal Knowledge Graphs | [2501.13956](https://arxiv.org/abs/2501.13956) | Jan 2025 | Temporal edges with valid_from/valid_until |
+| ReadAgent: Gist Memories | [2402.09727](https://arxiv.org/abs/2402.09727) | Feb 2024 | Post-task trajectory compression |
+| RGMem: Phase-Transition Memory | — | — | Phase-transition threshold θ_inf=3 |
 | MemRL: Runtime RL on Episodic Memory | [2601.03192](https://arxiv.org/abs/2601.03192) | Jan 2026 | D3: Value-based retrieval |
 | Memory-R1: RL Memory Manager | [2508.19828](https://arxiv.org/abs/2508.19828) | Jan 2026 | D3: ADD/UPDATE/DELETE ops |
 | ExpeL: Experiential Learning | [2308.10144](https://arxiv.org/abs/2308.10144) | AAAI 2024 | D2: Insight extraction |
 | Experiential Reflective Learning | [2603.24639](https://arxiv.org/abs/2603.24639) | Mar 2026 | D2: Heuristics > trajectories |
 | EvoSkill: Automated Skill Discovery | [2603.02766](https://arxiv.org/abs/2603.02766) | Mar 2026 | D2+D4: Pareto + zero-shot transfer |
+| JARVIS-1: Open-World Multi-Modal Agent | [2311.05997](https://arxiv.org/abs/2311.05997) | NeurIPS 2023 | Cross-modal CLIP retrieval pattern |
 ### Collective Identity & Emergence
 | Paper | ArXiv | Venue | Used In |