npm - @vortex-os/computer-use - Versions diffs - 0.7.0 → 0.7.1 - Mend

@vortex-os/computer-use 0.7.0 → 0.7.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

package/README.md +179 -177
package/computer-use.config.example.json +29 -28
package/package.json +74 -73
package/scripts/activity.mjs +92 -92
package/scripts/audio-duck.ps1 +180 -180
package/scripts/classify.ps1 +8 -8
package/scripts/fetch-supertonic.mjs +82 -65
package/scripts/lib.ps1 +679 -679
package/scripts/mcp-stdio.mjs +1337 -1324
package/scripts/noise-filter.mjs +135 -135
package/scripts/ocr.ps1 +92 -92
package/scripts/speak-supertonic.mjs +296 -296
package/scripts/speak.ps1 +58 -58
package/scripts/speech-safety.mjs +104 -104
package/scripts/vlm.mjs +106 -106

package/README.md CHANGED Viewed

@@ -1,177 +1,179 @@
-# @vortex-os/computer-use
-Read-only **screen perception** for VortEX agents, exposed as an MCP server. It lets an agent *see* what is on screen — read a window's structure, capture a region as an image, and watch for on-screen changes — without ever moving the mouse or typing. It layers on `@vortex-os/base` but also works standalone.
-> **Status: 0.5.0, Windows-first, read-only.** Mouse/keyboard **control is intentionally out of scope** for this release — this package only *perceives*. macOS/Linux backends are not yet implemented.
-## What it is
-An MCP (Model Context Protocol) server that exposes nine perception tools over stdio:
-| Tool | What it does | Cost |
-|---|---|---|
-| `probe` | Reports whether this environment can perceive the screen (displays, DPI, capture latency). Never captures real screen content. | ~0 |
-| `read_ui` | Reads the active/target window as a **structured accessibility tree** (UI Automation): element roles, coordinates, text. No image. | ~0 image tokens |
-| `capture_screen` | Pixel capture (PNG) for what structure can't reach — canvases, games, remote desktops. Target by window, region, monitor, or cursor box. | image |
-| `watch_capture` | Captures N frames at an interval in one process; with `changeOnly`, keeps only changed frames. | image(s) |
-| `poll_change` | One non-blocking "did it change?" probe; returns a change percentage and (optionally) an image. Poll it on an interval to watch without blocking. | metadata, image optional |
-| `start_watch` | Watch a fixed target in the **background** (non-blocking) with a built-in **noise filter** that keeps only meaningful, settled changes. Works on video/games that change every frame. | runs in background |
-| `get_events` | Collect the buffered changes a `start_watch` has accumulated — batched (a few looks for a long watch); each event carries the settled frame. | metadata + image(s) |
-| `stop_watch` | Stop a background watch and discard its buffer. | — |
-| `beep` | A system beep, to get the user's attention while they look elsewhere. | — |
-The design favors **structure first, pixels as fallback**: `read_ui` is cheap and precise for ordinary apps; `capture_screen` is for content that has no accessibility tree (games, custom canvases).
-## Watching for changes (the noise filter + event buffer)
-`poll_change` is the manual primitive — *you* poll it on a loop. `start_watch` is the hands-off version: it runs a **background loop with a noise filter** and buffers what matters, so you can watch a busy screen without drowning in frames.
-The problem it solves: on video, games, or scrolling, the screen changes *every* frame, so raw change-detection fires constantly on the ripples of one activity. The filter combines **debounce** (wait for motion to settle, then capture a clean frame — quality) with **cooldown** (at most one event per N seconds — frequency) and **hysteresis** (ambient jitter never even wakes it), plus an anti-starvation `maxWait` so a continuously-moving screen still yields periodic snapshots instead of going silent.
-```
-start_watch { region|window|monitor, watchId? }   -> returns immediately, watch runs in the background
-   ... do other things ...
-get_events  { watchId? }                           -> the settled changes so far (frames + metadata), batched
-stop_watch  { watchId? }                           -> end it (omit watchId to stop all)
-```
-**Calibration.** The defaults assume a meaningful target: a playing video jitters ~2.5–4% frame-to-frame and a scene cut jumps ~16% — so `activityThreshold` (default 8%) sits between them, ignoring the jitter and reporting the cut. Because the change metric is whole-frame, a tiny local change (a clock, a toast) is a small fraction of the frame; **target the region where the change happens** so it reads as a large change. Tune `activityThreshold` / `quietThreshold` / `debounceQuietMs` / `cooldownMs` / `maxWaitMs` per call. The buffer is **memory-only** (no screen history on disk) with count, byte, and 5-minute TTL caps; watches auto-stop after 30 minutes.
-## Reflex alerts — sub-second voice, no cloud round-trip
-`get_events` is the *brain* path (the agent looks, judges, and replies — seconds, because it makes a cloud LLM call). For things you need to hear the instant they happen, `start_watch` takes `triggers` that fire **locally, with no LLM in the loop**, so the alert reaches you in well under a second:
-```
-start_watch {
-  window: "MyGame",
-  triggers: [
-    { action: "beep",  threshold: 20 },                          // a sound
-    { action: "say",   threshold: 20, say: "적 출현" },           // speak a FIXED phrase (Korean TTS)
-    { action: "ocr",   threshold: 12, dwellMs: 700 }             // OCR the region and read the text aloud
-  ]
-}
-```
-A trigger fires the moment the watched region changes past its `threshold` (with hysteresis + a per-trigger `cooldownMs`). The fast alert is the *reflex*; the agent's judged commentary still follows on the next `get_events`. Speaking uses the built-in Windows voice (System.Speech); reading text uses the built-in offline OCR (no install, no GPU).
-**Safety (this matters).** OCR text is *screen content* — untrusted. It is never spoken raw: it gets a spoken **provenance prefix** (`화면 글자: …`), control/secret-token shaping, and a **global speech budget** (capped utterances and seconds per minute, no overlapping speech, auto-mute on sustained noise) so an on-screen string can never voice fake instructions or flood you. A fixed `say` phrase is the safest action (you author the words; a trigger only controls *when*). If the voice/OCR engine isn't available, triggers degrade quietly (a `beep` still works).
-### Optional: higher-quality neural voice (Supertonic) + audio ducking
-The default voice is the built-in Windows one (System.Speech / Heami) — zero install, but robotic. For a much more natural voice, install **Supertonic 3** (Supertone): an on-device ONNX neural TTS (Korean + 30 more languages; code MIT, weights OpenRAIL-M — commercial use OK). One-time model download (~380 MB), then fully offline and fast (~0.5 s/sentence on CPU):
-```
-node scripts/fetch-supertonic.mjs       # downloads models to ~/.vortex/computer-use/supertonic-3
-npm i onnxruntime-node                   # the runtime (optionalDependency)
-```
-Once the models are present, the speak path uses them automatically (`engine: "auto"`) and falls back to Heami if anything is missing — it never goes mute.
-**Audio ducking.** While the companion speaks, other apps' audio (game / music / video) is briefly lowered per-app and **restored exactly** when it finishes, so the voice stands out. On by default. *DRM-protected audio (e.g. Netflix) cannot be ducked* — that protected path bypasses Windows volume control; normal app/game audio ducks fine.
-Configure in `computer-use.config.json` (`tts` section) or via env (env wins). Defaults shown:
-```json
-{ "tts": { "engine": "auto", "voice": "F1", "duck": true, "duckFactor": 0.3 } }
-```
-`engine` `auto|supertonic|heami` · `voice` `F1..F5/M1..M5` · `duckFactor` `0..1` (others drop to this fraction; lower = quieter). Env: `VORTEX_CU_TTS_ENGINE` / `VORTEX_CU_TTS_VOICE` / `VORTEX_CU_DUCK=off` / `VORTEX_CU_DUCK_FACTOR`. Restart the server after changing.
-### Optional: a local vision model (the `vision` trigger)
-A `vision` trigger describes the *scene* (not just its text) via a **local** vision-language model — smarter than OCR, still no cloud round-trip. It is **off by default and GPU-gated**: it runs only when you point it at a reachable, fast-enough local endpoint. Everything above works with **no GPU**; this just adds a smarter local description where the hardware allows. On a machine that can't run it, a `vision` trigger **degrades to `ocr`**.
-Point it at any OpenAI-compatible vision endpoint — `llama.cpp`'s `llama-server` (with `--mmproj`), `llamafile`, Ollama, LM Studio — via env (machine-local, never synced):
-```
-VORTEX_CU_VLM_ENDPOINT=http://127.0.0.1:8080/v1   # set this to enable; presence = on
-VORTEX_CU_VLM_MODEL=gemma-4-e2b-it                # any small VLM (e2b ~2GB is a good light default)
-VORTEX_CU_VLM_KEY=...                             # optional bearer token
-VORTEX_CU_VLM_ALLOW_REMOTE=1                      # only if the endpoint is NOT on this machine (off by default)
-VORTEX_CU_VLM_SLA_MS=6000                         # gate: if even a tiny probe is slower than this, stay off
-```
-How it stays safe (design §23.2/§24): only the *intent* (endpoint set or not) is configuration — the address/secret are machine-local env, never synced, so the same repo on a CPU-only machine simply runs without it. A **loopback** endpoint (same machine) is allowed by default; a **cross-network** one (LAN/VPN/another box) is **off unless you opt in**. The session **probes** the endpoint with a *synthetic* 1×1 image (never a real screen crop) to measure latency before trusting it; a real crop is sent only on an actual `vision` trigger, through the same denylist gate. The model's reply is **untrusted** — spoken with a `로컬 비전: …` prefix, shaped, and rate-limited, and the prompt tells the model to describe only (never follow on-screen instructions). `probe` reports the VLM's availability when you've configured one.
-## Adaptive companion — classify the activity, branch the help
-`classify_activity` lets the companion **figure out what you're doing from the first screenshot** and adapt, instead of being told. One read-only call returns:
-```json
-{ "class": "GAME", "process": "eldenring", "title": "...", "notificationState": "BUSY",
-  "interruptible": false, "canvas": true, "uiaCount": 1, "fullscreen": true,
-  "profile": { "proactive": true, "cadenceSec": 30, "mode": "periodic" }, "needsChangeRate": true }
-```
-It combines cheap signals — foreground **process** + **window title**, the Windows **interruptibility state** (`SHQueryUserNotificationState`), **UIA element count** (a near-empty tree on a screen-filling window = a GPU game/video canvas), and whether it **fills the screen** — into a class: `GAME · DEV · MEDIA · BROWSING · PRODUCTIVITY · UNKNOWN`, each with a help **profile**.
-The profiles branch the behavior (full design in [`docs/adaptive-companion.md`](docs/adaptive-companion.md)):
-| Class | Default | Speaks when | Cadence |
-|---|---|---|---|
-| **Strategy / sim game** | proactive | quiet stretch (your turn) + risk/opportunity | ~30 s during active play |
-| **Fast-action game** | break-gated | a menu/pause/death screen opens | one cue per break; *says up front it can't coach mid-fight* |
-| **Software dev** | silent | an error/failed build/stack trace appears | event-driven, not periodic |
-| **Media / browsing / docs** | silent | only on request | never proactive |
-For a `GAME`, `needsChangeRate` tells the agent to take a couple of `poll_change` reads to split **fast-action** (too fast to coach — break-gated only) from **strategy** (coachable, periodic). Honesty is built in: it never pretends to coach a game it can't follow, and it won't talk over media. The **interruptibility state** gates every utterance, on top of the global speech budget. Explicit user requests ("tell me when X happens", "be quieter") layer on as reflex triggers / cadence overrides.
-Tune it in `computer-use.config.json` (`companion` section): `uiaCanvasMax` (the canvas cutoff) and per-class `profiles` (e.g. `GAME.cadenceSec: 20` for chattier coaching). Env: `VORTEX_CU_UIA_CANVAS_MAX`.
-## What it is NOT
-- **Not control.** No clicking, typing, or app automation. Perception only.
-- **Not real-time for *judgment*.** Reflex `triggers` deliver a sub-second beep / fixed phrase / OCR readout, but anything the agent has to *think* about (a judged message) is seconds-scale — it makes a cloud call. Good for alerts, translation, and watching-alongside; not for reflex-speed decisions.
-- **Not comprehensive secret protection.** See *Privacy & redaction* below — the denylist is the real control; field-level masking is best-effort and does not catch plaintext secrets sitting in arbitrary windows.
-- **Not cross-platform yet.** Windows only in 0.3.0.
-## Install
-```
-npm i @vortex-os/computer-use
-```
-Peer dependency: `@vortex-os/base` (`>=0.3.0 <1.0.0`). The MCP SDK (`@modelcontextprotocol/sdk`) is an optional dependency, loaded only when the server runs. No native build step.
-### Register the MCP server
-The package ships a `vortex-mcp-computer-use` bin that launches the stdio server. Register it with your agent host. For Claude Code, add it to `.mcp.json`:
-```json
-{
-  "mcpServers": {
-    "vortex-computer-use": {
-      "command": "npx",
-      "args": ["vortex-mcp-computer-use"]
-    }
-  }
-}
-```
-> Use a server name other than the reserved `computer-use` (e.g. `vortex-computer-use`) — some hosts reserve `computer-use` and will silently skip a server with that exact name. MCP servers load at session start, so **restart the agent** after adding it.
-## Privacy & redaction
-Whatever you point this at is sent to your AI model. Two controls reduce accidental exposure; both run in the backend **before** any pixels or text reach the model:
-1. **Denylist (the primary control).** List window titles or process names that must never be captured. If a listed window is visible anywhere inside a capture region, the whole capture is refused (`{ "redacted": true }` — no image, no text). This is the reliable defense against accidentally capturing a password manager or banking window during a watch.
-2. **Password-field masking.** In `read_ui`, fields the OS reports as password inputs are dropped (no value, no text, children not traversed).
-Copy `computer-use.config.example.json` to `computer-use.config.json` (next to the server) to configure the denylist, or set `VORTEX_CU_DENY_TITLES` / `VORTEX_CU_DENY_PROCS` (JSON arrays). **The denylist is read once at startup — restart the server after changing it.**
-**Honest limits.** This is *not* comprehensive secret-scanning. A plaintext token shown in a text editor or terminal (not a password field, not a denylisted window) will still be captured. Pixel-level password masking is intentionally out of scope for 0.1.0. Capture images are volatile — held only long enough to send, then deleted; they are never written to disk persistently.
-### Audit
-Each perception call appends one metadata line (timestamp, tool, output size, a keyed HMAC of the output, and an HMAC of the window title) to a daily JSONL log under your user-local app data (`%LOCALAPPDATA%\vortex-computer-use\audit\`) — outside the synced instance data. **No raw images and no plaintext window titles are stored.** If the audit key can't be set up, perception still works and a warning is printed.
-## Verify
-```
-npm run verify        # node scripts/verify.mjs — needs a desktop session; captures the real screen
-npm run test:filter   # node scripts/test-noise-filter.mjs — pure unit tests, no screen needed
-npm run test:speech   # node scripts/test-speech-safety.mjs — pure unit tests (provenance/shaping/budget)
-npm run test:vlm      # node scripts/test-vlm.mjs — pure unit tests (config/trust-tier/protocol)
-```
-`verify` exercises every tool plus the redaction/audit gate (denylist blocking across all capture modes, no over-block, no title leak, audit written with no plaintext). The `test:*` scripts check the noise-filter, speech-safety, and VLM-protocol logic deterministically (no screen/audio/network). Three live harnesses drive the real screen: `node scripts/verify-watch.mjs` (background watch: settle → event → frame, denylist blindness, cleanup), `node scripts/verify-reflex.mjs` (reflex triggers → local speech, rendered to WAV so it stays silent), and `node scripts/verify-vlm.mjs` (the `vision` path against a mock local endpoint: synthetic-probe, real-crop only on a trigger, degrade-to-OCR).
+# @vortex-os/computer-use
+<!-- docs-version: 0.7.1 -->
+Read-only **screen perception** for VortEX agents, exposed as an MCP server. It lets an agent *see* what is on screen — read a window's structure, capture a region as an image, and watch for on-screen changes — without ever moving the mouse or typing. It layers on `@vortex-os/base` but also works standalone.
+> **Status: 0.5.0, Windows-first, read-only.** Mouse/keyboard **control is intentionally out of scope** for this release — this package only *perceives*. macOS/Linux backends are not yet implemented.
+## What it is
+An MCP (Model Context Protocol) server that exposes nine perception tools over stdio:
+| Tool | What it does | Cost |
+|---|---|---|
+| `probe` | Reports whether this environment can perceive the screen (displays, DPI, capture latency). Never captures real screen content. | ~0 |
+| `read_ui` | Reads the active/target window as a **structured accessibility tree** (UI Automation): element roles, coordinates, text. No image. | ~0 image tokens |
+| `capture_screen` | Pixel capture (PNG) for what structure can't reach — canvases, games, remote desktops. Target by window, region, monitor, or cursor box. | image |
+| `watch_capture` | Captures N frames at an interval in one process; with `changeOnly`, keeps only changed frames. | image(s) |
+| `poll_change` | One non-blocking "did it change?" probe; returns a change percentage and (optionally) an image. Poll it on an interval to watch without blocking. | metadata, image optional |
+| `start_watch` | Watch a fixed target in the **background** (non-blocking) with a built-in **noise filter** that keeps only meaningful, settled changes. Works on video/games that change every frame. | runs in background |
+| `get_events` | Collect the buffered changes a `start_watch` has accumulated — batched (a few looks for a long watch); each event carries the settled frame. | metadata + image(s) |
+| `stop_watch` | Stop a background watch and discard its buffer. | — |
+| `beep` | A system beep, to get the user's attention while they look elsewhere. | — |
+The design favors **structure first, pixels as fallback**: `read_ui` is cheap and precise for ordinary apps; `capture_screen` is for content that has no accessibility tree (games, custom canvases).
+## Watching for changes (the noise filter + event buffer)
+`poll_change` is the manual primitive — *you* poll it on a loop. `start_watch` is the hands-off version: it runs a **background loop with a noise filter** and buffers what matters, so you can watch a busy screen without drowning in frames.
+The problem it solves: on video, games, or scrolling, the screen changes *every* frame, so raw change-detection fires constantly on the ripples of one activity. The filter combines **debounce** (wait for motion to settle, then capture a clean frame — quality) with **cooldown** (at most one event per N seconds — frequency) and **hysteresis** (ambient jitter never even wakes it), plus an anti-starvation `maxWait` so a continuously-moving screen still yields periodic snapshots instead of going silent.
+```
+start_watch { region|window|monitor, watchId? }   -> returns immediately, watch runs in the background
+   ... do other things ...
+get_events  { watchId? }                           -> the settled changes so far (frames + metadata), batched
+stop_watch  { watchId? }                           -> end it (omit watchId to stop all)
+```
+**Calibration.** The defaults assume a meaningful target: a playing video jitters ~2.5–4% frame-to-frame and a scene cut jumps ~16% — so `activityThreshold` (default 8%) sits between them, ignoring the jitter and reporting the cut. Because the change metric is whole-frame, a tiny local change (a clock, a toast) is a small fraction of the frame; **target the region where the change happens** so it reads as a large change. Tune `activityThreshold` / `quietThreshold` / `debounceQuietMs` / `cooldownMs` / `maxWaitMs` per call. The buffer is **memory-only** (no screen history on disk) with count, byte, and 5-minute TTL caps; watches auto-stop after 30 minutes.
+## Reflex alerts — sub-second voice, no cloud round-trip
+`get_events` is the *brain* path (the agent looks, judges, and replies — seconds, because it makes a cloud LLM call). For things you need to hear the instant they happen, `start_watch` takes `triggers` that fire **locally, with no LLM in the loop**, so the alert reaches you in well under a second:
+```
+start_watch {
+  window: "MyGame",
+  triggers: [
+    { action: "beep",  threshold: 20 },                          // a sound
+    { action: "say",   threshold: 20, say: "적 출현" },           // speak a FIXED phrase (Korean TTS)
+    { action: "ocr",   threshold: 12, dwellMs: 700 }             // OCR the region and read the text aloud
+  ]
+}
+```
+A trigger fires the moment the watched region changes past its `threshold` (with hysteresis + a per-trigger `cooldownMs`). The fast alert is the *reflex*; the agent's judged commentary still follows on the next `get_events`. Speaking uses the built-in Windows voice (System.Speech); reading text uses the built-in offline OCR (no install, no GPU).
+**Safety (this matters).** OCR text is *screen content* — untrusted. It is never spoken raw: it gets a spoken **provenance prefix** (`화면 글자: …`), control/secret-token shaping, and a **global speech budget** (capped utterances and seconds per minute, no overlapping speech, auto-mute on sustained noise) so an on-screen string can never voice fake instructions or flood you. A fixed `say` phrase is the safest action (you author the words; a trigger only controls *when*). If the voice/OCR engine isn't available, triggers degrade quietly (a `beep` still works).
+### Optional: higher-quality neural voice (Supertonic) + audio ducking
+The default voice is the built-in Windows one (System.Speech / Heami) — zero install, but robotic. For a much more natural voice, install **Supertonic 3** (Supertone): an on-device ONNX neural TTS (Korean + 30 more languages; code MIT, weights OpenRAIL-M — commercial use OK). One-time model download (~380 MB), then fully offline and fast (~0.5 s/sentence on CPU):
+```
+node scripts/fetch-supertonic.mjs       # downloads models to ~/.vortex/computer-use/supertonic-3
+npm i onnxruntime-node                   # the runtime (optionalDependency)
+```
+Once the models are present, the speak path uses them automatically (`engine: "auto"`) and falls back to Heami if anything is missing — it never goes mute.
+**Audio ducking.** While the companion speaks, other apps' audio (game / music / video) is briefly lowered per-app and **restored exactly** when it finishes, so the voice stands out. On by default. *DRM-protected audio (e.g. Netflix) cannot be ducked* — that protected path bypasses Windows volume control; normal app/game audio ducks fine.
+Configure in `computer-use.config.json` (`tts` section) or via env (env wins). Defaults shown:
+```json
+{ "tts": { "engine": "auto", "voice": "F1", "speed": 1.05, "duck": true, "duckFactor": 0.3 } }
+```
+`engine` `auto|supertonic|heami` · `voice` `F1..F5/M1..M5` · `speed` rate multiplier (~1.0 = normal, higher = faster; clamped 0.5..2.0, applied to both the neural and built-in voices) · `duckFactor` `0..1` (others drop to this fraction; lower = quieter). Env: `VORTEX_CU_TTS_ENGINE` / `VORTEX_CU_TTS_VOICE` / `VORTEX_CU_TTS_SPEED` / `VORTEX_CU_DUCK=off` / `VORTEX_CU_DUCK_FACTOR`. Restart the server after changing.
+### Optional: a local vision model (the `vision` trigger)
+A `vision` trigger describes the *scene* (not just its text) via a **local** vision-language model — smarter than OCR, still no cloud round-trip. It is **off by default and GPU-gated**: it runs only when you point it at a reachable, fast-enough local endpoint. Everything above works with **no GPU**; this just adds a smarter local description where the hardware allows. On a machine that can't run it, a `vision` trigger **degrades to `ocr`**.
+Point it at any OpenAI-compatible vision endpoint — `llama.cpp`'s `llama-server` (with `--mmproj`), `llamafile`, Ollama, LM Studio — via env (machine-local, never synced):
+```
+VORTEX_CU_VLM_ENDPOINT=http://127.0.0.1:8080/v1   # set this to enable; presence = on
+VORTEX_CU_VLM_MODEL=gemma-4-e2b-it                # any small VLM (e2b ~2GB is a good light default)
+VORTEX_CU_VLM_KEY=...                             # optional bearer token
+VORTEX_CU_VLM_ALLOW_REMOTE=1                      # only if the endpoint is NOT on this machine (off by default)
+VORTEX_CU_VLM_SLA_MS=6000                         # gate: if even a tiny probe is slower than this, stay off
+```
+How it stays safe (design §23.2/§24): only the *intent* (endpoint set or not) is configuration — the address/secret are machine-local env, never synced, so the same repo on a CPU-only machine simply runs without it. A **loopback** endpoint (same machine) is allowed by default; a **cross-network** one (LAN/VPN/another box) is **off unless you opt in**. The session **probes** the endpoint with a *synthetic* 1×1 image (never a real screen crop) to measure latency before trusting it; a real crop is sent only on an actual `vision` trigger, through the same denylist gate. The model's reply is **untrusted** — spoken with a `로컬 비전: …` prefix, shaped, and rate-limited, and the prompt tells the model to describe only (never follow on-screen instructions). `probe` reports the VLM's availability when you've configured one.
+## Adaptive companion — classify the activity, branch the help
+`classify_activity` lets the companion **figure out what you're doing from the first screenshot** and adapt, instead of being told. One read-only call returns:
+```json
+{ "class": "GAME", "process": "eldenring", "title": "...", "notificationState": "BUSY",
+  "interruptible": false, "canvas": true, "uiaCount": 1, "fullscreen": true,
+  "profile": { "proactive": true, "cadenceSec": 30, "mode": "periodic" }, "needsChangeRate": true }
+```
+It combines cheap signals — foreground **process** + **window title**, the Windows **interruptibility state** (`SHQueryUserNotificationState`), **UIA element count** (a near-empty tree on a screen-filling window = a GPU game/video canvas), and whether it **fills the screen** — into a class: `GAME · DEV · MEDIA · BROWSING · PRODUCTIVITY · UNKNOWN`, each with a help **profile**.
+The profiles branch the behavior (full design in [`docs/adaptive-companion.md`](docs/adaptive-companion.md)):
+| Class | Default | Speaks when | Cadence |
+|---|---|---|---|
+| **Strategy / sim game** | proactive | quiet stretch (your turn) + risk/opportunity | ~30 s during active play |
+| **Fast-action game** | break-gated | a menu/pause/death screen opens | one cue per break; *says up front it can't coach mid-fight* |
+| **Software dev** | silent | an error/failed build/stack trace appears | event-driven, not periodic |
+| **Media / browsing / docs** | silent | only on request | never proactive |
+For a `GAME`, `needsChangeRate` tells the agent to take a couple of `poll_change` reads to split **fast-action** (too fast to coach — break-gated only) from **strategy** (coachable, periodic). Honesty is built in: it never pretends to coach a game it can't follow, and it won't talk over media. The **interruptibility state** gates every utterance, on top of the global speech budget. Explicit user requests ("tell me when X happens", "be quieter") layer on as reflex triggers / cadence overrides.
+Tune it in `computer-use.config.json` (`companion` section): `uiaCanvasMax` (the canvas cutoff) and per-class `profiles` (e.g. `GAME.cadenceSec: 20` for chattier coaching). Env: `VORTEX_CU_UIA_CANVAS_MAX`.
+## What it is NOT
+- **Not control.** No clicking, typing, or app automation. Perception only.
+- **Not real-time for *judgment*.** Reflex `triggers` deliver a sub-second beep / fixed phrase / OCR readout, but anything the agent has to *think* about (a judged message) is seconds-scale — it makes a cloud call. Good for alerts, translation, and watching-alongside; not for reflex-speed decisions.
+- **Not comprehensive secret protection.** See *Privacy & redaction* below — the denylist is the real control; field-level masking is best-effort and does not catch plaintext secrets sitting in arbitrary windows.
+- **Not cross-platform yet.** Windows only in 0.3.0.
+## Install
+```
+npm i @vortex-os/computer-use
+```
+Peer dependency: `@vortex-os/base` (`>=0.3.0 <1.0.0`). The MCP SDK (`@modelcontextprotocol/sdk`) is an optional dependency, loaded only when the server runs. No native build step.
+### Register the MCP server
+The package ships a `vortex-mcp-computer-use` bin that launches the stdio server. Register it with your agent host. For Claude Code, add it to `.mcp.json`:
+```json
+{
+  "mcpServers": {
+    "vortex-computer-use": {
+      "command": "npx",
+      "args": ["vortex-mcp-computer-use"]
+    }
+  }
+}
+```
+> Use a server name other than the reserved `computer-use` (e.g. `vortex-computer-use`) — some hosts reserve `computer-use` and will silently skip a server with that exact name. MCP servers load at session start, so **restart the agent** after adding it.
+## Privacy & redaction
+Whatever you point this at is sent to your AI model. Two controls reduce accidental exposure; both run in the backend **before** any pixels or text reach the model:
+1. **Denylist (the primary control).** List window titles or process names that must never be captured. If a listed window is visible anywhere inside a capture region, the whole capture is refused (`{ "redacted": true }` — no image, no text). This is the reliable defense against accidentally capturing a password manager or banking window during a watch.
+2. **Password-field masking.** In `read_ui`, fields the OS reports as password inputs are dropped (no value, no text, children not traversed).
+Copy `computer-use.config.example.json` to `computer-use.config.json` (next to the server) to configure the denylist, or set `VORTEX_CU_DENY_TITLES` / `VORTEX_CU_DENY_PROCS` (JSON arrays). **The denylist is read once at startup — restart the server after changing it.**
+**Honest limits.** This is *not* comprehensive secret-scanning. A plaintext token shown in a text editor or terminal (not a password field, not a denylisted window) will still be captured. Pixel-level password masking is intentionally out of scope for 0.1.0. Capture images are volatile — held only long enough to send, then deleted; they are never written to disk persistently.
+### Audit
+Each perception call appends one metadata line (timestamp, tool, output size, a keyed HMAC of the output, and an HMAC of the window title) to a daily JSONL log under your user-local app data (`%LOCALAPPDATA%\vortex-computer-use\audit\`) — outside the synced instance data. **No raw images and no plaintext window titles are stored.** If the audit key can't be set up, perception still works and a warning is printed.
+## Verify
+```
+npm run verify        # node scripts/verify.mjs — needs a desktop session; captures the real screen
+npm run test:filter   # node scripts/test-noise-filter.mjs — pure unit tests, no screen needed
+npm run test:speech   # node scripts/test-speech-safety.mjs — pure unit tests (provenance/shaping/budget)
+npm run test:vlm      # node scripts/test-vlm.mjs — pure unit tests (config/trust-tier/protocol)
+```
+`verify` exercises every tool plus the redaction/audit gate (denylist blocking across all capture modes, no over-block, no title leak, audit written with no plaintext). The `test:*` scripts check the noise-filter, speech-safety, and VLM-protocol logic deterministically (no screen/audio/network). Three live harnesses drive the real screen: `node scripts/verify-watch.mjs` (background watch: settle → event → frame, denylist blindness, cleanup), `node scripts/verify-reflex.mjs` (reflex triggers → local speech, rendered to WAV so it stays silent), and `node scripts/verify-vlm.mjs` (the `vision` path against a mock local endpoint: synthetic-probe, real-crop only on a trigger, degrade-to-OCR).

package/computer-use.config.example.json CHANGED Viewed

@@ -1,28 +1,29 @@
-{
-  "_comment": "Copy to computer-use.config.json to enable redaction. Empty by default — nothing is blocked until you add entries. The denylist is the primary control: any listed window/process that appears inside a capture region makes the whole capture fail-closed (no image, no structured text). Matching is case-insensitive substring. This is NOT comprehensive secret-scanning: plaintext secrets visible in non-listed windows (editors, terminals) are still captured. Env overrides: VORTEX_CU_DENY_TITLES / VORTEX_CU_DENY_PROCS (JSON arrays).",
-  "_restart": "The denylist is read once at server start; RESTART the MCP server (restart the agent / reload its MCP servers) after changing this file or the env vars for the change to take effect.",
-  "redaction": {
-    "denyWindowTitles": [],
-    "denyProcesses": []
-  },
-  "_examples": {
-    "denyWindowTitles": ["Bitwarden", "1Password", "KeePass", "Online Banking"],
-    "denyProcesses": ["Bitwarden", "1Password", "KeePassXC", "keeper"]
-  },
-  "_tts_comment": "Spoken-voice settings. engine 'auto' uses the higher-quality Supertonic neural voice when its models are present (run `node scripts/fetch-supertonic.mjs` once, ~380MB), else falls back to the built-in Windows voice ('heami'). 'duck' lowers OTHER apps' audio while the companion speaks (per-app, restored after); 'duckFactor' is how far they drop (0.3 = to 30%; lower = quieter). NOTE: DRM-protected audio (e.g. Netflix) can't be ducked — that's the app's protection, not a limitation here. Env overrides win over this file: VORTEX_CU_TTS_ENGINE / VORTEX_CU_TTS_VOICE / VORTEX_CU_DUCK(=off) / VORTEX_CU_DUCK_FACTOR. Restart the server after changing.",
-  "tts": {
-    "engine": "auto",
-    "voice": "F1",
-    "duck": true,
-    "duckFactor": 0.3
-  },
-  "_companion_comment": "Adaptive companion (classify_activity). 'uiaCanvasMax' is the UIA element-count cutoff below which a screen-filling window is treated as a GPU canvas (game/video) — raise it if a game with some on-screen widgets is misread as an app. 'profiles' overrides per-class cadence/proactivity, e.g. make game coaching chattier with GAME.cadenceSec=20. Classes: GAME, DEV, MEDIA, BROWSING, PRODUCTIVITY, UNKNOWN. Env override: VORTEX_CU_UIA_CANVAS_MAX. See docs/adaptive-companion.md. Restart the server after changing.",
-  "companion": {
-    "uiaCanvasMax": 5,
-    "profiles": {
-      "GAME": { "cadenceSec": 30, "proactive": true },
-      "DEV": { "proactive": false },
-      "MEDIA": { "proactive": false }
-    }
-  }
-}
+{
+  "_comment": "Copy to computer-use.config.json to enable redaction. Empty by default — nothing is blocked until you add entries. The denylist is the primary control: any listed window/process that appears inside a capture region makes the whole capture fail-closed (no image, no structured text). Matching is case-insensitive substring. This is NOT comprehensive secret-scanning: plaintext secrets visible in non-listed windows (editors, terminals) are still captured. Env overrides: VORTEX_CU_DENY_TITLES / VORTEX_CU_DENY_PROCS (JSON arrays).",
+  "_restart": "The denylist is read once at server start; RESTART the MCP server (restart the agent / reload its MCP servers) after changing this file or the env vars for the change to take effect.",
+  "redaction": {
+    "denyWindowTitles": [],
+    "denyProcesses": []
+  },
+  "_examples": {
+    "denyWindowTitles": ["Bitwarden", "1Password", "KeePass", "Online Banking"],
+    "denyProcesses": ["Bitwarden", "1Password", "KeePassXC", "keeper"]
+  },
+  "_tts_comment": "Spoken-voice settings. engine 'auto' uses the higher-quality Supertonic neural voice when its models are present (run `node scripts/fetch-supertonic.mjs` once, ~380MB), else falls back to the built-in Windows voice ('heami'). 'speed' is the speech-rate multiplier (~1.0 = normal, higher = faster; applied to Supertonic directly and mapped onto the Windows voice's rate; clamped to 0.5..2.0). 'duck' lowers OTHER apps' audio while the companion speaks (per-app, restored after); 'duckFactor' is how far they drop (0.3 = to 30%; lower = quieter). NOTE: DRM-protected audio (e.g. Netflix) can't be ducked — that's the app's protection, not a limitation here. Env overrides win over this file: VORTEX_CU_TTS_ENGINE / VORTEX_CU_TTS_VOICE / VORTEX_CU_TTS_SPEED / VORTEX_CU_DUCK(=off) / VORTEX_CU_DUCK_FACTOR. Restart the server after changing.",
+  "tts": {
+    "engine": "auto",
+    "voice": "F1",
+    "speed": 1.05,
+    "duck": true,
+    "duckFactor": 0.3
+  },
+  "_companion_comment": "Adaptive companion (classify_activity). 'uiaCanvasMax' is the UIA element-count cutoff below which a screen-filling window is treated as a GPU canvas (game/video) — raise it if a game with some on-screen widgets is misread as an app. 'profiles' overrides per-class cadence/proactivity, e.g. make game coaching chattier with GAME.cadenceSec=20. Classes: GAME, DEV, MEDIA, BROWSING, PRODUCTIVITY, UNKNOWN. Env override: VORTEX_CU_UIA_CANVAS_MAX. See docs/adaptive-companion.md. Restart the server after changing.",
+  "companion": {
+    "uiaCanvasMax": 5,
+    "profiles": {
+      "GAME": { "cadenceSec": 30, "proactive": true },
+      "DEV": { "proactive": false },
+      "MEDIA": { "proactive": false }
+    }
+  }
+}