@vortex-os/computer-use 0.2.1 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -2,11 +2,11 @@
2
2
 
3
3
  Read-only **screen perception** for VortEX agents, exposed as an MCP server. It lets an agent *see* what is on screen — read a window's structure, capture a region as an image, and watch for on-screen changes — without ever moving the mouse or typing. It layers on `@vortex-os/base` but also works standalone.
4
4
 
5
- > **Status: 0.1.0, Windows-first, read-only.** Mouse/keyboard **control is intentionally out of scope** for this release — this package only *perceives*. macOS/Linux backends are not yet implemented.
5
+ > **Status: 0.5.0, Windows-first, read-only.** Mouse/keyboard **control is intentionally out of scope** for this release — this package only *perceives*. macOS/Linux backends are not yet implemented.
6
6
 
7
7
  ## What it is
8
8
 
9
- An MCP (Model Context Protocol) server that exposes six perception tools over stdio:
9
+ An MCP (Model Context Protocol) server that exposes nine perception tools over stdio:
10
10
 
11
11
  | Tool | What it does | Cost |
12
12
  |---|---|---|
@@ -15,15 +15,115 @@ An MCP (Model Context Protocol) server that exposes six perception tools over st
15
15
  | `capture_screen` | Pixel capture (PNG) for what structure can't reach — canvases, games, remote desktops. Target by window, region, monitor, or cursor box. | image |
16
16
  | `watch_capture` | Captures N frames at an interval in one process; with `changeOnly`, keeps only changed frames. | image(s) |
17
17
  | `poll_change` | One non-blocking "did it change?" probe; returns a change percentage and (optionally) an image. Poll it on an interval to watch without blocking. | metadata, image optional |
18
+ | `start_watch` | Watch a fixed target in the **background** (non-blocking) with a built-in **noise filter** that keeps only meaningful, settled changes. Works on video/games that change every frame. | runs in background |
19
+ | `get_events` | Collect the buffered changes a `start_watch` has accumulated — batched (a few looks for a long watch); each event carries the settled frame. | metadata + image(s) |
20
+ | `stop_watch` | Stop a background watch and discard its buffer. | — |
18
21
  | `beep` | A system beep, to get the user's attention while they look elsewhere. | — |
19
22
 
20
23
  The design favors **structure first, pixels as fallback**: `read_ui` is cheap and precise for ordinary apps; `capture_screen` is for content that has no accessibility tree (games, custom canvases).
21
24
 
25
+ ## Watching for changes (the noise filter + event buffer)
26
+
27
+ `poll_change` is the manual primitive — *you* poll it on a loop. `start_watch` is the hands-off version: it runs a **background loop with a noise filter** and buffers what matters, so you can watch a busy screen without drowning in frames.
28
+
29
+ The problem it solves: on video, games, or scrolling, the screen changes *every* frame, so raw change-detection fires constantly on the ripples of one activity. The filter combines **debounce** (wait for motion to settle, then capture a clean frame — quality) with **cooldown** (at most one event per N seconds — frequency) and **hysteresis** (ambient jitter never even wakes it), plus an anti-starvation `maxWait` so a continuously-moving screen still yields periodic snapshots instead of going silent.
30
+
31
+ ```
32
+ start_watch { region|window|monitor, watchId? } -> returns immediately, watch runs in the background
33
+ ... do other things ...
34
+ get_events { watchId? } -> the settled changes so far (frames + metadata), batched
35
+ stop_watch { watchId? } -> end it (omit watchId to stop all)
36
+ ```
37
+
38
+ **Calibration.** The defaults assume a meaningful target: a playing video jitters ~2.5–4% frame-to-frame and a scene cut jumps ~16% — so `activityThreshold` (default 8%) sits between them, ignoring the jitter and reporting the cut. Because the change metric is whole-frame, a tiny local change (a clock, a toast) is a small fraction of the frame; **target the region where the change happens** so it reads as a large change. Tune `activityThreshold` / `quietThreshold` / `debounceQuietMs` / `cooldownMs` / `maxWaitMs` per call. The buffer is **memory-only** (no screen history on disk) with count, byte, and 5-minute TTL caps; watches auto-stop after 30 minutes.
39
+
40
+ ## Reflex alerts — sub-second voice, no cloud round-trip
41
+
42
+ `get_events` is the *brain* path (the agent looks, judges, and replies — seconds, because it makes a cloud LLM call). For things you need to hear the instant they happen, `start_watch` takes `triggers` that fire **locally, with no LLM in the loop**, so the alert reaches you in well under a second:
43
+
44
+ ```
45
+ start_watch {
46
+ window: "MyGame",
47
+ triggers: [
48
+ { action: "beep", threshold: 20 }, // a sound
49
+ { action: "say", threshold: 20, say: "적 출현" }, // speak a FIXED phrase (Korean TTS)
50
+ { action: "ocr", threshold: 12, dwellMs: 700 } // OCR the region and read the text aloud
51
+ ]
52
+ }
53
+ ```
54
+
55
+ A trigger fires the moment the watched region changes past its `threshold` (with hysteresis + a per-trigger `cooldownMs`). The fast alert is the *reflex*; the agent's judged commentary still follows on the next `get_events`. Speaking uses the built-in Windows voice (System.Speech); reading text uses the built-in offline OCR (no install, no GPU).
56
+
57
+ **Safety (this matters).** OCR text is *screen content* — untrusted. It is never spoken raw: it gets a spoken **provenance prefix** (`화면 글자: …`), control/secret-token shaping, and a **global speech budget** (capped utterances and seconds per minute, no overlapping speech, auto-mute on sustained noise) so an on-screen string can never voice fake instructions or flood you. A fixed `say` phrase is the safest action (you author the words; a trigger only controls *when*). If the voice/OCR engine isn't available, triggers degrade quietly (a `beep` still works).
58
+
59
+ ### Optional: higher-quality neural voice (Supertonic) + audio ducking
60
+
61
+ The default voice is the built-in Windows one (System.Speech / Heami) — zero install, but robotic. For a much more natural voice, install **Supertonic 3** (Supertone): an on-device ONNX neural TTS (Korean + 30 more languages; code MIT, weights OpenRAIL-M — commercial use OK). One-time model download (~380 MB), then fully offline and fast (~0.5 s/sentence on CPU):
62
+
63
+ ```
64
+ node scripts/fetch-supertonic.mjs # downloads models to ~/.vortex/computer-use/supertonic-3
65
+ npm i onnxruntime-node # the runtime (optionalDependency)
66
+ ```
67
+
68
+ Once the models are present, the speak path uses them automatically (`engine: "auto"`) and falls back to Heami if anything is missing — it never goes mute.
69
+
70
+ **Audio ducking.** While the companion speaks, other apps' audio (game / music / video) is briefly lowered per-app and **restored exactly** when it finishes, so the voice stands out. On by default. *DRM-protected audio (e.g. Netflix) cannot be ducked* — that protected path bypasses Windows volume control; normal app/game audio ducks fine.
71
+
72
+ Configure in `computer-use.config.json` (`tts` section) or via env (env wins). Defaults shown:
73
+
74
+ ```json
75
+ { "tts": { "engine": "auto", "voice": "F1", "duck": true, "duckFactor": 0.3 } }
76
+ ```
77
+
78
+ `engine` `auto|supertonic|heami` · `voice` `F1..F5/M1..M5` · `duckFactor` `0..1` (others drop to this fraction; lower = quieter). Env: `VORTEX_CU_TTS_ENGINE` / `VORTEX_CU_TTS_VOICE` / `VORTEX_CU_DUCK=off` / `VORTEX_CU_DUCK_FACTOR`. Restart the server after changing.
79
+
80
+ ### Optional: a local vision model (the `vision` trigger)
81
+
82
+ A `vision` trigger describes the *scene* (not just its text) via a **local** vision-language model — smarter than OCR, still no cloud round-trip. It is **off by default and GPU-gated**: it runs only when you point it at a reachable, fast-enough local endpoint. Everything above works with **no GPU**; this just adds a smarter local description where the hardware allows. On a machine that can't run it, a `vision` trigger **degrades to `ocr`**.
83
+
84
+ Point it at any OpenAI-compatible vision endpoint — `llama.cpp`'s `llama-server` (with `--mmproj`), `llamafile`, Ollama, LM Studio — via env (machine-local, never synced):
85
+
86
+ ```
87
+ VORTEX_CU_VLM_ENDPOINT=http://127.0.0.1:8080/v1 # set this to enable; presence = on
88
+ VORTEX_CU_VLM_MODEL=gemma-4-e2b-it # any small VLM (e2b ~2GB is a good light default)
89
+ VORTEX_CU_VLM_KEY=... # optional bearer token
90
+ VORTEX_CU_VLM_ALLOW_REMOTE=1 # only if the endpoint is NOT on this machine (off by default)
91
+ VORTEX_CU_VLM_SLA_MS=6000 # gate: if even a tiny probe is slower than this, stay off
92
+ ```
93
+
94
+ How it stays safe (design §23.2/§24): only the *intent* (endpoint set or not) is configuration — the address/secret are machine-local env, never synced, so the same repo on a CPU-only machine simply runs without it. A **loopback** endpoint (same machine) is allowed by default; a **cross-network** one (LAN/VPN/another box) is **off unless you opt in**. The session **probes** the endpoint with a *synthetic* 1×1 image (never a real screen crop) to measure latency before trusting it; a real crop is sent only on an actual `vision` trigger, through the same denylist gate. The model's reply is **untrusted** — spoken with a `로컬 비전: …` prefix, shaped, and rate-limited, and the prompt tells the model to describe only (never follow on-screen instructions). `probe` reports the VLM's availability when you've configured one.
95
+
96
+ ## Adaptive companion — classify the activity, branch the help
97
+
98
+ `classify_activity` lets the companion **figure out what you're doing from the first screenshot** and adapt, instead of being told. One read-only call returns:
99
+
100
+ ```json
101
+ { "class": "GAME", "process": "eldenring", "title": "...", "notificationState": "BUSY",
102
+ "interruptible": false, "canvas": true, "uiaCount": 1, "fullscreen": true,
103
+ "profile": { "proactive": true, "cadenceSec": 30, "mode": "periodic" }, "needsChangeRate": true }
104
+ ```
105
+
106
+ It combines cheap signals — foreground **process** + **window title**, the Windows **interruptibility state** (`SHQueryUserNotificationState`), **UIA element count** (a near-empty tree on a screen-filling window = a GPU game/video canvas), and whether it **fills the screen** — into a class: `GAME · DEV · MEDIA · BROWSING · PRODUCTIVITY · UNKNOWN`, each with a help **profile**.
107
+
108
+ The profiles branch the behavior (full design in [`docs/adaptive-companion.md`](docs/adaptive-companion.md)):
109
+
110
+ | Class | Default | Speaks when | Cadence |
111
+ |---|---|---|---|
112
+ | **Strategy / sim game** | proactive | quiet stretch (your turn) + risk/opportunity | ~30 s during active play |
113
+ | **Fast-action game** | break-gated | a menu/pause/death screen opens | one cue per break; *says up front it can't coach mid-fight* |
114
+ | **Software dev** | silent | an error/failed build/stack trace appears | event-driven, not periodic |
115
+ | **Media / browsing / docs** | silent | only on request | never proactive |
116
+
117
+ For a `GAME`, `needsChangeRate` tells the agent to take a couple of `poll_change` reads to split **fast-action** (too fast to coach — break-gated only) from **strategy** (coachable, periodic). Honesty is built in: it never pretends to coach a game it can't follow, and it won't talk over media. The **interruptibility state** gates every utterance, on top of the global speech budget. Explicit user requests ("tell me when X happens", "be quieter") layer on as reflex triggers / cadence overrides.
118
+
119
+ Tune it in `computer-use.config.json` (`companion` section): `uiaCanvasMax` (the canvas cutoff) and per-class `profiles` (e.g. `GAME.cadenceSec: 20` for chattier coaching). Env: `VORTEX_CU_UIA_CANVAS_MAX`.
120
+
22
121
  ## What it is NOT
23
122
 
24
123
  - **Not control.** No clicking, typing, or app automation. Perception only.
124
+ - **Not real-time for *judgment*.** Reflex `triggers` deliver a sub-second beep / fixed phrase / OCR readout, but anything the agent has to *think* about (a judged message) is seconds-scale — it makes a cloud call. Good for alerts, translation, and watching-alongside; not for reflex-speed decisions.
25
125
  - **Not comprehensive secret protection.** See *Privacy & redaction* below — the denylist is the real control; field-level masking is best-effort and does not catch plaintext secrets sitting in arbitrary windows.
26
- - **Not cross-platform yet.** Windows only in 0.1.0.
126
+ - **Not cross-platform yet.** Windows only in 0.3.0.
27
127
 
28
128
  ## Install
29
129
 
@@ -68,7 +168,10 @@ Each perception call appends one metadata line (timestamp, tool, output size, a
68
168
  ## Verify
69
169
 
70
170
  ```
71
- npm run verify # node scripts/verify.mjs — needs a desktop session; captures the real screen
171
+ npm run verify # node scripts/verify.mjs — needs a desktop session; captures the real screen
172
+ npm run test:filter # node scripts/test-noise-filter.mjs — pure unit tests, no screen needed
173
+ npm run test:speech # node scripts/test-speech-safety.mjs — pure unit tests (provenance/shaping/budget)
174
+ npm run test:vlm # node scripts/test-vlm.mjs — pure unit tests (config/trust-tier/protocol)
72
175
  ```
73
176
 
74
- Exercises every tool plus the redaction/audit gate (denylist blocking across all capture modes, no over-block, no title leak, audit written with no plaintext).
177
+ `verify` exercises every tool plus the redaction/audit gate (denylist blocking across all capture modes, no over-block, no title leak, audit written with no plaintext). The `test:*` scripts check the noise-filter, speech-safety, and VLM-protocol logic deterministically (no screen/audio/network). Three live harnesses drive the real screen: `node scripts/verify-watch.mjs` (background watch: settle → event → frame, denylist blindness, cleanup), `node scripts/verify-reflex.mjs` (reflex triggers → local speech, rendered to WAV so it stays silent), and `node scripts/verify-vlm.mjs` (the `vision` path against a mock local endpoint: synthetic-probe, real-crop only on a trigger, degrade-to-OCR).
@@ -8,5 +8,21 @@
8
8
  "_examples": {
9
9
  "denyWindowTitles": ["Bitwarden", "1Password", "KeePass", "Online Banking"],
10
10
  "denyProcesses": ["Bitwarden", "1Password", "KeePassXC", "keeper"]
11
+ },
12
+ "_tts_comment": "Spoken-voice settings. engine 'auto' uses the higher-quality Supertonic neural voice when its models are present (run `node scripts/fetch-supertonic.mjs` once, ~380MB), else falls back to the built-in Windows voice ('heami'). 'duck' lowers OTHER apps' audio while the companion speaks (per-app, restored after); 'duckFactor' is how far they drop (0.3 = to 30%; lower = quieter). NOTE: DRM-protected audio (e.g. Netflix) can't be ducked — that's the app's protection, not a limitation here. Env overrides win over this file: VORTEX_CU_TTS_ENGINE / VORTEX_CU_TTS_VOICE / VORTEX_CU_DUCK(=off) / VORTEX_CU_DUCK_FACTOR. Restart the server after changing.",
13
+ "tts": {
14
+ "engine": "auto",
15
+ "voice": "F1",
16
+ "duck": true,
17
+ "duckFactor": 0.3
18
+ },
19
+ "_companion_comment": "Adaptive companion (classify_activity). 'uiaCanvasMax' is the UIA element-count cutoff below which a screen-filling window is treated as a GPU canvas (game/video) — raise it if a game with some on-screen widgets is misread as an app. 'profiles' overrides per-class cadence/proactivity, e.g. make game coaching chattier with GAME.cadenceSec=20. Classes: GAME, DEV, MEDIA, BROWSING, PRODUCTIVITY, UNKNOWN. Env override: VORTEX_CU_UIA_CANVAS_MAX. See docs/adaptive-companion.md. Restart the server after changing.",
20
+ "companion": {
21
+ "uiaCanvasMax": 5,
22
+ "profiles": {
23
+ "GAME": { "cadenceSec": 30, "proactive": true },
24
+ "DEV": { "proactive": false },
25
+ "MEDIA": { "proactive": false }
26
+ }
11
27
  }
12
28
  }
package/package.json CHANGED
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "@vortex-os/computer-use",
3
- "version": "0.2.1",
4
- "description": "Add-on — read-only screen perception (structured UIA tree + pixel fallback + change watch) exposed as an MCP server, layered on @vortex-os/base. Windows-first. Control (mouse/keyboard) is intentionally out of scope.",
3
+ "version": "0.7.0",
4
+ "description": "Add-on — read-only screen perception (structured UIA tree + pixel fallback + noise-filtered background watch with an event buffer + sub-second reflex alerts: beep / fixed-phrase / OCR or optional local-VLM description spoken locally, optional higher-quality Supertonic neural TTS with per-app audio ducking, adaptive companion that classifies the on-screen activity and branches its help) exposed as an MCP server, layered on @vortex-os/base. Windows-first. Control (mouse/keyboard) is intentionally out of scope.",
5
5
  "license": "MIT",
6
6
  "author": "vortex-os-project",
7
7
  "homepage": "https://github.com/vortex-os-project/vortex#readme",
@@ -13,10 +13,20 @@
13
13
  "type": "module",
14
14
  "files": [
15
15
  "scripts/mcp-stdio.mjs",
16
+ "scripts/noise-filter.mjs",
17
+ "scripts/speech-safety.mjs",
18
+ "scripts/vlm.mjs",
19
+ "scripts/ocr.ps1",
20
+ "scripts/speak.ps1",
21
+ "scripts/speak-supertonic.mjs",
22
+ "scripts/audio-duck.ps1",
23
+ "scripts/fetch-supertonic.mjs",
16
24
  "scripts/worker.ps1",
17
25
  "scripts/lib.ps1",
18
26
  "scripts/probe.ps1",
19
27
  "scripts/read-ui.ps1",
28
+ "scripts/classify.ps1",
29
+ "scripts/activity.mjs",
20
30
  "scripts/point-to-ask.ps1",
21
31
  "computer-use.config.example.json",
22
32
  "README.md"
@@ -33,7 +43,11 @@
33
43
  }
34
44
  },
35
45
  "scripts": {
36
- "verify": "node scripts/verify.mjs"
46
+ "verify": "node scripts/verify.mjs",
47
+ "test:filter": "node scripts/test-noise-filter.mjs",
48
+ "test:speech": "node scripts/test-speech-safety.mjs",
49
+ "test:vlm": "node scripts/test-vlm.mjs",
50
+ "test:activity": "node scripts/test-activity.mjs"
37
51
  },
38
52
  "peerDependencies": {
39
53
  "@vortex-os/base": ">=0.3.0 <1.0.0"
@@ -44,7 +58,8 @@
44
58
  }
45
59
  },
46
60
  "optionalDependencies": {
47
- "@modelcontextprotocol/sdk": "^1.21.0"
61
+ "@modelcontextprotocol/sdk": "^1.21.0",
62
+ "onnxruntime-node": "^1.19.2"
48
63
  },
49
64
  "engines": {
50
65
  "node": ">=22"
@@ -0,0 +1,92 @@
1
+ // computer-use — activity classifier (pure, testable). Turns the raw signals from classify.ps1
2
+ // (foreground process/title, notification state, UIA count, fullscreen) into an activity CLASS and a help
3
+ // PROFILE the companion uses to pick its cadence/triggers/style. See docs/adaptive-companion.md.
4
+ //
5
+ // Deliberately conservative: process name is the strongest prior; UIA-emptiness marks a GPU canvas
6
+ // (game/video) only when the process isn't a known browser/player; everything is overridable by config and by
7
+ // explicit user requests. All thresholds are starting points to calibrate, surfaced via opts.
8
+
9
+ // Known-app tables — lowercase process names, no ".exe". Extend via opts.apps.{dev,media,browser,productivity}.
10
+ const DEV = ['code', 'cursor', 'devenv', 'rider64', 'idea64', 'pycharm64', 'webstorm64', 'clion64', 'goland64',
11
+ 'sublime_text', 'notepad++', 'windowsterminal', 'wt', 'powershell', 'pwsh', 'cmd', 'conhost', 'alacritty',
12
+ 'wezterm', 'hx', 'nvim', 'vim', 'emacs'];
13
+ const MEDIA = ['mpv', 'vlc', 'mpc-hc64', 'mpc-be64', 'potplayermini64', 'potplayer', 'wmplayer', 'smplayer',
14
+ 'kodi', 'plex', 'plexmediaplayer', 'netflix'];
15
+ const BROWSER = ['chrome', 'msedge', 'firefox', 'brave', 'opera', 'vivaldi', 'arc', 'librewolf'];
16
+ const PRODUCTIVITY = ['winword', 'excel', 'powerpnt', 'onenote', 'acrobat', 'acrord32', 'sumatrapdf', 'foxitpdfreader',
17
+ 'notepad', 'obsidian', 'notion', 'hwp', 'soffice'];
18
+
19
+ // Window-title hints that a browser tab is actually video (so a browser → MEDIA, not BROWSING).
20
+ const VIDEO_TITLE = /youtube|netflix|twitch|vimeo|prime\s*video|disney\+?|wavve|tving|watcha|\bwatch\b/i;
21
+
22
+ // SHQueryUserNotificationState values.
23
+ const NS_NAME = { 1: 'NOT_PRESENT', 2: 'BUSY', 3: 'D3D_FULLSCREEN', 4: 'PRESENTATION', 5: 'ACCEPTS', 6: 'QUIET_TIME', 7: 'APP' };
24
+
25
+ // Per-class help profile defaults (overridable via config `companion` section).
26
+ export const PROFILES = {
27
+ GAME: { proactive: true, cadenceSec: 30, mode: 'periodic',
28
+ note: 'sample change-rate (poll_change) to split fast-action vs strategy; fast-action = break-gated only, announce the limit once' },
29
+ DEV: { proactive: false, cadenceSec: 0, mode: 'event-error',
30
+ note: 'silent until an error/build failure/stack trace; scaffolding help on request' },
31
+ MEDIA: { proactive: false, cadenceSec: 0, mode: 'silent',
32
+ note: 'on explicit request only; do not talk over it; DRM audio cannot be ducked' },
33
+ BROWSING: { proactive: false, cadenceSec: 0, mode: 'silent',
34
+ note: 'on request: summarize / translate / explain' },
35
+ PRODUCTIVITY: { proactive: false, cadenceSec: 0, mode: 'quiet',
36
+ note: 'on request; optional gentle risk flags' },
37
+ UNKNOWN: { proactive: false, cadenceSec: 0, mode: 'low-intrusion',
38
+ note: 'offer help once, then stay quiet until asked' },
39
+ };
40
+
41
+ const has = (list, p) => list.includes(p);
42
+
43
+ /**
44
+ * Derive activity class + profile from raw signals.
45
+ * @param {object} raw { process, title, notificationState, uiaCount, uiaCapped, fullscreen, ... }
46
+ * @param {object} opts { uiaCanvasMax=5, apps?:{dev,media,browser,productivity}, profiles?:{<CLASS>:{cadenceSec?,proactive?,mode?}} }
47
+ */
48
+ export function classifyActivity(raw = {}, opts = {}) {
49
+ const uiaCanvasMax = opts.uiaCanvasMax ?? 5;
50
+ const apps = opts.apps || {};
51
+ const dev = apps.dev || DEV, media = apps.media || MEDIA, browser = apps.browser || BROWSER, prod = apps.productivity || PRODUCTIVITY;
52
+
53
+ const proc = String(raw.process || '').toLowerCase();
54
+ const title = String(raw.title || '');
55
+ const ns = Number(raw.notificationState || 0);
56
+ const nsName = NS_NAME[ns] || 'UNKNOWN';
57
+ const interruptible = ns === 5; // only ACCEPTS_NOTIFICATIONS is unconditionally clear to speak
58
+ // uiaCount is null when the UIA walk FAILED (≠ "found 0") — so a UIA error never reads as an empty canvas.
59
+ const hasUia = typeof raw.uiaCount === 'number';
60
+ // A capped count means there were MORE elements than the backend cap → never a sparse canvas, even if a large
61
+ // uiaCanvasMax is configured. So canvas requires a real, uncapped, below-threshold count.
62
+ const canvas = hasUia && !raw.uiaCapped && raw.uiaCount < uiaCanvasMax;
63
+
64
+ let cls;
65
+ if (has(dev, proc)) cls = 'DEV';
66
+ else if (has(media, proc)) cls = 'MEDIA';
67
+ else if (has(browser, proc)) cls = VIDEO_TITLE.test(title) ? 'MEDIA' : 'BROWSING';
68
+ else if (has(prod, proc)) cls = 'PRODUCTIVITY';
69
+ else if (canvas && proc) cls = 'GAME'; // screen-canvas app that isn't a known player/browser
70
+ else if (hasUia && raw.uiaCount >= uiaCanvasMax) cls = 'PRODUCTIVITY'; // rich tree, unknown app → reading/productivity
71
+ else cls = 'UNKNOWN';
72
+
73
+ // Profile defaults, with optional per-class overrides from the `companion` config (cadence/proactive/mode).
74
+ const profile = { ...PROFILES[cls], ...((opts.profiles && opts.profiles[cls]) || {}) };
75
+
76
+ return {
77
+ class: cls,
78
+ process: raw.process || '',
79
+ title,
80
+ notificationState: nsName,
81
+ interruptible,
82
+ canvas,
83
+ uiaCount: hasUia ? raw.uiaCount : null,
84
+ fullscreen: !!raw.fullscreen,
85
+ profile,
86
+ // GAME needs a change-rate sample to split fast-action (break-gated) vs strategy (periodic) — the agent
87
+ // takes a couple of poll_change reads and applies the thresholds in docs/adaptive-companion.md.
88
+ needsChangeRate: cls === 'GAME',
89
+ };
90
+ }
91
+
92
+ export default classifyActivity;
@@ -0,0 +1,180 @@
1
+ # computer-use — audio ducking helper (pwsh; WASAPI Core Audio via inline C#, NO install).
2
+ #
3
+ # When the companion speaks, briefly lower OTHER apps' audio sessions (game / video / music) so the voice
4
+ # stands out, then restore EXACTLY on completion. Per-app, not master volume — our own voice is excluded.
5
+ #
6
+ # SAFETY: restore is the whole point. Volumes are snapshotted and always restored in a finally block, even if
7
+ # playback throws or the process is asked to stop — a failed restore would leave the user's other apps quiet.
8
+ #
9
+ # Two uses:
10
+ # 1. Dot-source and call [CU.AudioDuck]::Duck(factor, excludePids) -> handle ; [CU.AudioDuck]::Restore(handle).
11
+ # (speak.ps1 / Heami wraps System.Speech this way, excluding its own $PID.)
12
+ # 2. -WavPath <wav>: duck (excluding THIS process), play the WAV via SoundPlayer, restore in finally.
13
+ # (speak-supertonic.mjs spawns this to play its synthesized WAV.)
14
+ #
15
+ # factor is the multiplier applied to other sessions (0.45 = reduce to 45%). 1.0 disables ducking.
16
+ param(
17
+ [string]$WavPath = '',
18
+ [double]$Factor = 0.45,
19
+ [int[]]$ExcludePid = @()
20
+ )
21
+ $ErrorActionPreference = 'Stop'
22
+ try { [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new($false) } catch {}
23
+
24
+ if (-not ('CU.AudioDuck' -as [type])) {
25
+ Add-Type -TypeDefinition @'
26
+ using System;
27
+ using System.Collections.Generic;
28
+ using System.Runtime.InteropServices;
29
+
30
+ namespace CU {
31
+ [ComImport, Guid("BCDE0395-E52F-467C-8E3D-C4579291692E")] class MMDeviceEnumeratorComObject { }
32
+
33
+ enum EDataFlow { eRender = 0, eCapture = 1, eAll = 2 }
34
+ enum ERole { eConsole = 0, eMultimedia = 1, eCommunications = 2 }
35
+
36
+ [ComImport, Guid("A95664D2-9614-4F35-A746-DE8DB63617E6"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
37
+ interface IMMDeviceEnumerator {
38
+ [PreserveSig] int EnumAudioEndpoints(EDataFlow dataFlow, int dwStateMask, out IMMDeviceCollection ppDevices);
39
+ [PreserveSig] int GetDefaultAudioEndpoint(EDataFlow dataFlow, ERole role, out IMMDevice ppEndpoint);
40
+ }
41
+
42
+ [ComImport, Guid("0BD7A1BE-7A1A-44DB-8397-CC5392387B5E"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
43
+ interface IMMDeviceCollection {
44
+ [PreserveSig] int GetCount(out uint pcDevices);
45
+ [PreserveSig] int Item(uint nDevice, out IMMDevice ppDevice);
46
+ }
47
+
48
+ [ComImport, Guid("D666063F-1587-4E43-81F1-B948E807363F"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
49
+ interface IMMDevice {
50
+ [PreserveSig] int Activate(ref Guid iid, int dwClsCtx, IntPtr pActivationParams,
51
+ [MarshalAs(UnmanagedType.IUnknown)] out object ppInterface);
52
+ }
53
+
54
+ [ComImport, Guid("77AA99A0-1BD6-484F-8BC7-2C654C9A9B6F"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
55
+ interface IAudioSessionManager2 {
56
+ [PreserveSig] int NotImpl1();
57
+ [PreserveSig] int NotImpl2();
58
+ [PreserveSig] int GetSessionEnumerator(out IAudioSessionEnumerator SessionEnum);
59
+ }
60
+
61
+ [ComImport, Guid("E2F5BB11-0570-40CA-ACDD-3AA01277DEE8"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
62
+ interface IAudioSessionEnumerator {
63
+ [PreserveSig] int GetCount(out int SessionCount);
64
+ [PreserveSig] int GetSession(int SessionCount, out IAudioSessionControl Session);
65
+ }
66
+
67
+ [ComImport, Guid("F4B1A599-7266-4319-A8CA-E70ACB11E8CD"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
68
+ interface IAudioSessionControl { }
69
+
70
+ [ComImport, Guid("bfb7ff88-7239-4fc9-8fa2-07c950be9c6d"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
71
+ interface IAudioSessionControl2 {
72
+ // 9 inherited IAudioSessionControl slots (unused) ...
73
+ [PreserveSig] int N1(); [PreserveSig] int N2(); [PreserveSig] int N3();
74
+ [PreserveSig] int N4(); [PreserveSig] int N5(); [PreserveSig] int N6();
75
+ [PreserveSig] int N7(); [PreserveSig] int N8(); [PreserveSig] int N9();
76
+ // IAudioSessionControl2 own slots:
77
+ [PreserveSig] int GetSessionIdentifier([MarshalAs(UnmanagedType.LPWStr)] out string s);
78
+ [PreserveSig] int GetSessionInstanceIdentifier([MarshalAs(UnmanagedType.LPWStr)] out string s);
79
+ [PreserveSig] int GetProcessId(out uint pid);
80
+ [PreserveSig] int IsSystemSoundsSession();
81
+ }
82
+
83
+ [ComImport, Guid("87CE5498-68D6-44E5-9215-6DA47EF883D8"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
84
+ interface ISimpleAudioVolume {
85
+ [PreserveSig] int SetMasterVolume(float level, ref Guid ctx);
86
+ [PreserveSig] int GetMasterVolume(out float level);
87
+ [PreserveSig] int SetMute(bool mute, ref Guid ctx);
88
+ [PreserveSig] int GetMute(out bool mute);
89
+ }
90
+
91
+ public class Handle { internal ISimpleAudioVolume Vol; public float Original; public uint Pid; }
92
+
93
+ public static class AudioDuck {
94
+ static Guid IID_IAudioSessionManager2 = new Guid("77AA99A0-1BD6-484F-8BC7-2C654C9A9B6F");
95
+ static Guid CTX = Guid.Empty;
96
+
97
+ static IAudioSessionEnumerator SessionsForDevice(IMMDevice dev) {
98
+ object o; if (dev.Activate(ref IID_IAudioSessionManager2, 23 /*CLSCTX_ALL*/, IntPtr.Zero, out o) != 0) return null;
99
+ var mgr = (IAudioSessionManager2)o;
100
+ IAudioSessionEnumerator en; if (mgr.GetSessionEnumerator(out en) != 0) return null;
101
+ return en;
102
+ }
103
+
104
+ // Lower every session (except excludePids and system-sounds) to original*factor, across ALL active render
105
+ // devices (not just the default endpoint) so virtual-audio routing (e.g. VoiceMeeter) is still caught.
106
+ // Returns restore handles. DEVICE_STATE_ACTIVE = 0x1.
107
+ public static List<Handle> Duck(double factor, int[] excludePids) {
108
+ var handles = new List<Handle>();
109
+ var deo = (IMMDeviceEnumerator)(new MMDeviceEnumeratorComObject());
110
+ IMMDeviceCollection coll;
111
+ if (deo.EnumAudioEndpoints(EDataFlow.eRender, 0x1, out coll) != 0 || coll == null) return handles;
112
+ uint dcount; if (coll.GetCount(out dcount) != 0) return handles;
113
+ var excl = new HashSet<uint>(); foreach (var p in excludePids) excl.Add((uint)p);
114
+ for (uint di = 0; di < dcount; di++) {
115
+ IMMDevice dev; if (coll.Item(di, out dev) != 0 || dev == null) continue;
116
+ var en = SessionsForDevice(dev); if (en == null) continue;
117
+ int count; if (en.GetCount(out count) != 0) continue;
118
+ for (int i = 0; i < count; i++) {
119
+ IAudioSessionControl ctl; if (en.GetSession(i, out ctl) != 0 || ctl == null) continue;
120
+ try {
121
+ var c2 = (IAudioSessionControl2)ctl;
122
+ if (c2.IsSystemSoundsSession() == 0) continue; // S_OK(0) means IS system sounds -> skip it
123
+ uint pid; if (c2.GetProcessId(out pid) != 0) continue;
124
+ if (excl.Contains(pid)) continue;
125
+ var vol = (ISimpleAudioVolume)ctl;
126
+ float cur; if (vol.GetMasterVolume(out cur) != 0) continue;
127
+ var h = new Handle { Vol = vol, Original = cur, Pid = pid };
128
+ vol.SetMasterVolume((float)(cur * factor), ref CTX);
129
+ handles.Add(h);
130
+ } catch { }
131
+ }
132
+ }
133
+ return handles;
134
+ }
135
+
136
+ public static void Restore(List<Handle> handles) {
137
+ if (handles == null) return;
138
+ foreach (var h in handles) { try { h.Vol.SetMasterVolume(h.Original, ref CTX); } catch { } }
139
+ }
140
+
141
+ // Inspection helpers (for the isolation test): current live level / snapshotted original per handle.
142
+ public static float[] Levels(List<Handle> handles) {
143
+ var a = new float[handles.Count];
144
+ for (int i = 0; i < handles.Count; i++) { float c = 0; try { handles[i].Vol.GetMasterVolume(out c); } catch { } a[i] = c; }
145
+ return a;
146
+ }
147
+ public static float[] Originals(List<Handle> handles) {
148
+ var a = new float[handles.Count];
149
+ for (int i = 0; i < handles.Count; i++) a[i] = handles[i].Original;
150
+ return a;
151
+ }
152
+ public static uint[] Pids(List<Handle> handles) {
153
+ var a = new uint[handles.Count];
154
+ for (int i = 0; i < handles.Count; i++) a[i] = handles[i].Pid;
155
+ return a;
156
+ }
157
+ }
158
+ }
159
+ '@
160
+ }
161
+
162
+ function Invoke-Duck([double]$factor, [int[]]$exclude) {
163
+ if ($factor -ge 1.0 -or $factor -lt 0) { return $null }
164
+ try { return [CU.AudioDuck]::Duck($factor, $exclude) } catch { return $null }
165
+ }
166
+ function Restore-Duck($handles) {
167
+ if ($null -eq $handles) { return }
168
+ try { [CU.AudioDuck]::Restore($handles) } catch { }
169
+ }
170
+
171
+ # -WavPath mode: duck others (excluding THIS process, which owns the playback), play, restore in finally.
172
+ if ($WavPath) {
173
+ $h = Invoke-Duck $Factor @($PID)
174
+ try {
175
+ $p = New-Object System.Media.SoundPlayer $WavPath
176
+ $p.PlaySync()
177
+ } finally {
178
+ Restore-Duck $h
179
+ }
180
+ }
@@ -0,0 +1,8 @@
1
+ # computer-use — classify_activity raw-signal adapter over lib.ps1::Get-AxClassifyActivity.
2
+ # Output = a single JSON blob of raw signals (foreground process/title, notification state, UIA count, fullscreen).
3
+ # The JS side (activity.mjs) derives the activity class from these. Read-only, no images.
4
+ param([int]$UiaCap = 60)
5
+ $ErrorActionPreference = 'Stop'
6
+ . (Join-Path $PSScriptRoot 'lib.ps1')
7
+ Initialize-AxEnv
8
+ Get-AxClassifyActivity -UiaCap $UiaCap | ConvertTo-Json -Depth 4
@@ -0,0 +1,65 @@
1
+ #!/usr/bin/env node
2
+ // computer-use — Supertonic 3 model downloader (one-time, ~380 MB) for the optional neural TTS engine.
3
+ //
4
+ // Downloads Supertone/supertonic-3 (OpenRAIL-M weights — commercial use permitted) into the model cache the
5
+ // speak path expects. Idempotent: existing non-empty files are skipped, so re-running resumes a partial fetch.
6
+ // Usage: node fetch-supertonic.mjs [targetDir] (default: VORTEX_CU_TTS_MODEL_DIR or ~/.vortex/computer-use/supertonic-3)
7
+ //
8
+ // The engine code (speak-supertonic.mjs) is adapted from Supertone's MIT Node example; only the weights are
9
+ // downloaded here, never bundled, keeping the npm package small and the license boundary clean.
10
+
11
+ import { createWriteStream, existsSync, statSync, mkdirSync, renameSync, unlinkSync } from 'node:fs';
12
+ import { Readable } from 'node:stream';
13
+ import { pipeline } from 'node:stream/promises';
14
+ import { join, dirname } from 'node:path';
15
+ import { homedir } from 'node:os';
16
+
17
+ const HF = 'https://huggingface.co/Supertone/supertonic-3/resolve/main';
18
+ const FILES = [
19
+ 'onnx/duration_predictor.onnx',
20
+ 'onnx/text_encoder.onnx',
21
+ 'onnx/vector_estimator.onnx',
22
+ 'onnx/vocoder.onnx',
23
+ 'onnx/tts.json',
24
+ 'onnx/unicode_indexer.json',
25
+ 'config.json',
26
+ ...['F1', 'F2', 'F3', 'F4', 'F5', 'M1', 'M2', 'M3', 'M4', 'M5'].map((v) => `voice_styles/${v}.json`),
27
+ ];
28
+
29
+ const targetDir = process.argv[2] || process.env.VORTEX_CU_TTS_MODEL_DIR || join(homedir(), '.vortex', 'computer-use', 'supertonic-3');
30
+
31
+ async function fetchToFile(url, dest) {
32
+ const res = await fetch(url, { redirect: 'follow' });
33
+ if (!res.ok || !res.body) throw new Error(`HTTP ${res.status}`);
34
+ const tmp = dest + '.part';
35
+ mkdirSync(dirname(dest), { recursive: true });
36
+ await pipeline(Readable.fromWeb(res.body), createWriteStream(tmp));
37
+ renameSync(tmp, dest);
38
+ return statSync(dest).size;
39
+ }
40
+
41
+ async function main() {
42
+ console.log(`Supertonic 3 model cache: ${targetDir}`);
43
+ let downloaded = 0, skipped = 0, bytes = 0;
44
+ for (const rel of FILES) {
45
+ const dest = join(targetDir, rel);
46
+ if (existsSync(dest) && statSync(dest).size > 0) { skipped++; continue; }
47
+ process.stdout.write(` ↓ ${rel} ... `);
48
+ try {
49
+ const t = Date.now();
50
+ const sz = await fetchToFile(`${HF}/${rel}`, dest);
51
+ bytes += sz;
52
+ downloaded++;
53
+ console.log(`${(sz / 1048576).toFixed(1)} MB (${((Date.now() - t) / 1000).toFixed(1)}s)`);
54
+ } catch (e) {
55
+ console.log(`FAILED: ${e.message}`);
56
+ try { unlinkSync(dest + '.part'); } catch {}
57
+ console.error(`\nDownload failed for ${rel}. Re-run to resume.`);
58
+ process.exit(1);
59
+ }
60
+ }
61
+ console.log(`\nDone — ${downloaded} downloaded (${(bytes / 1048576).toFixed(0)} MB), ${skipped} already present.`);
62
+ console.log('Neural TTS is ready. Set VORTEX_CU_TTS_ENGINE=auto (default) to use it.');
63
+ }
64
+
65
+ main().catch((e) => { console.error(e); process.exit(1); });
package/scripts/lib.ps1 CHANGED
@@ -26,6 +26,9 @@ public static class AxNative {
26
26
  [DllImport("user32.dll")] public static extern int GetWindowThreadProcessId(IntPtr h, out int pid);
27
27
  [DllImport("user32.dll")] public static extern int GetWindowTextLength(IntPtr h);
28
28
  [DllImport("user32.dll", CharSet=CharSet.Unicode)] public static extern int GetWindowTextW(IntPtr h, StringBuilder s, int max);
29
+ // "Is it OK to interrupt the user right now?" — the global interruptibility gate (1=NOT_PRESENT, 2=BUSY,
30
+ // 3=RUNNING_D3D_FULL_SCREEN, 4=PRESENTATION_MODE, 5=ACCEPTS_NOTIFICATIONS, 6=QUIET_TIME, 7=APP). Returns HRESULT.
31
+ [DllImport("shell32.dll")] public static extern int SHQueryUserNotificationState(out int state);
29
32
  private delegate bool AxEnumProc(IntPtr h, IntPtr l);
30
33
  [DllImport("user32.dll", SetLastError=true)] private static extern bool EnumWindows(AxEnumProc cb, IntPtr l);
31
34
  // Enumerate every visible (not minimized, area>0) top-level window as (pid, rect, title) — so the denylist checks not just the
@@ -164,6 +167,65 @@ function Get-AxProbe {
164
167
  }
165
168
 
166
169
  # ---------------- capture ----------------
170
+ # Raw signals for activity classification (the JS side in activity.mjs derives the class). Read-only, fast:
171
+ # foreground process/title, the interruptibility notification-state, a capped UIA control-view descendant count
172
+ # (near-empty = GPU canvas i.e. game/video; rich = normal app), and whether the window fills its monitor.
173
+ function Get-AxClassifyActivity([int]$UiaCap = 60) {
174
+ $hwnd = [AxNative]::GetForegroundWindow()
175
+ $procId = 0; [void][AxNative]::GetWindowThreadProcessId($hwnd, [ref]$procId)
176
+ $title = ''
177
+ try {
178
+ $len = [AxNative]::GetWindowTextLength($hwnd)
179
+ if ($len -gt 0) { $sb = New-Object System.Text.StringBuilder ($len + 2); [void][AxNative]::GetWindowTextW($hwnd, $sb, $sb.Capacity); $title = $sb.ToString() }
180
+ } catch {}
181
+ $ns = 0; try { [void][AxNative]::SHQueryUserNotificationState([ref]$ns) } catch {}
182
+ $fs = $false
183
+ try {
184
+ $r = New-Object AxRECT
185
+ if ([AxNative]::GetWindowRect($hwnd, [ref]$r)) {
186
+ $scr = [System.Windows.Forms.Screen]::FromHandle($hwnd).Bounds
187
+ if (($r.Right - $r.Left) -ge $scr.Width -and ($r.Bottom - $r.Top) -ge $scr.Height) { $fs = $true }
188
+ }
189
+ } catch {}
190
+ # Denylist (same control as captures): never leak — or classify — a sensitive foreground window. A non-null
191
+ # result means denied (incl. fail-closed when a rule is configured but title/pid can't be resolved).
192
+ $deny = $null
193
+ try { $deny = Test-AxDenylistElement $title $procId } catch { $deny = @{ reason = 'denylist check failed — fail-closed'; match = '' } }
194
+ if ($null -ne $deny) {
195
+ return [ordered]@{
196
+ redacted = $true; reason = [string]$deny.reason; process = ''; procId = $procId; title = ''
197
+ hwnd = [int64]$hwnd; notificationState = $ns; uiaCount = $null; uiaOk = $false; uiaCapped = $false; fullscreen = $fs
198
+ }
199
+ }
200
+ $proc = ''
201
+ try { $proc = (Get-Process -Id $procId -ErrorAction Stop).ProcessName } catch {}
202
+ # UIA control-view descendant count (capped) — near-empty on a GPU canvas (game/video), rich on normal apps.
203
+ # $uiaOk distinguishes "walked, found N" from "couldn't walk" (so a UIA failure isn't read as an empty canvas).
204
+ # A single hard $budget bounds BOTH pops and sibling enumeration so a huge/odd tree can't blow up the walk;
205
+ # the per-call spawnSync timeout (caller) is the backstop against a single hung COM call.
206
+ $uia = 0; $capped = $false; $uiaOk = $false; $budget = [Math]::Max(16, $UiaCap * 4); $iter = 0
207
+ try {
208
+ $root = [System.Windows.Automation.AutomationElement]::FromHandle($hwnd)
209
+ if ($null -ne $root) {
210
+ $walker = [System.Windows.Automation.TreeWalker]::ControlViewWalker
211
+ $stack = New-Object System.Collections.Stack
212
+ $c = $walker.GetFirstChild($root)
213
+ while ($null -ne $c -and $iter -lt $budget) { $stack.Push($c); $iter++; $c = $walker.GetNextSibling($c) }
214
+ while ($stack.Count -gt 0 -and $uia -lt $UiaCap -and $iter -lt $budget) {
215
+ $el = $stack.Pop(); $uia++
216
+ $cc = $walker.GetFirstChild($el)
217
+ while ($null -ne $cc -and $iter -lt $budget) { $stack.Push($cc); $iter++; $cc = $walker.GetNextSibling($cc) }
218
+ }
219
+ if ($uia -ge $UiaCap -or $iter -ge $budget) { $capped = $true }
220
+ $uiaOk = $true
221
+ }
222
+ } catch { $uiaOk = $false }
223
+ return [ordered]@{
224
+ redacted = $false; process = $proc; procId = $procId; title = $title; hwnd = [int64]$hwnd
225
+ notificationState = $ns; uiaCount = $(if ($uiaOk) { $uia } else { $null }); uiaOk = $uiaOk; uiaCapped = $capped; fullscreen = $fs
226
+ }
227
+ }
228
+
167
229
  function New-AxOutPath([string]$OutDir, $Frame = $null) {
168
230
  # Avoid multi-instance / concurrent-capture collisions — guarantee uniqueness with PID + milliseconds + random number.
169
231
  $stamp = (Get-Date).ToString('HHmmssfff')