@vortex-os/computer-use 0.2.1 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,135 @@
1
+ // @vortex-os/computer-use — noise filter for the watch layer (design §22.1).
2
+ //
3
+ // Problem: video / games / scrolling change every frame, so a raw per-frame change threshold
4
+ // floods the agent with one event per ripple of the SAME activity (each event costs a capture
5
+ // + an LLM look). Measured calibration: a playing video jitters ~2.5–4% frame-to-frame, a real
6
+ // scene cut jumps ~16.8%. We want to ignore the steady jitter and report only meaningful,
7
+ // settled changes — without ever going completely silent on a screen that keeps moving.
8
+ //
9
+ // Design: debounce + cooldown COMBINED, with hysteresis. Neither alone works:
10
+ // - debounce alone ("report once it goes quiet") starves on video/games that never go quiet.
11
+ // - cooldown alone ("report at most once per N s") fires on a half-drawn transition frame (bad quality).
12
+ // So they split roles:
13
+ // - debounce = QUALITY — emit the frame AFTER motion settles (a clean, stable shot).
14
+ // - cooldown = FREQUENCY — never emit more than once per cooldownMs (suppress the ripples).
15
+ // - maxWait = ANTI-STARVATION — once WOKEN, if motion never settles, emit anyway every maxWaitMs, so a
16
+ // screen with sustained ABOVE-threshold motion still yields periodic snapshots (not silence).
17
+ // - hysteresis = two thresholds so ambient jitter never even wakes the filter (no flapping).
18
+ // Note on the silence boundary: a screen that changes only BELOW activityThreshold (e.g. steady video jitter
19
+ // at 2.5-4%) is intentionally treated as "nothing happening" and produces no events — that IS the goal of
20
+ // ignoring the ripple, and it means maxWait only applies once a real (above-threshold) change has woken the
21
+ // filter. Lower activityThreshold if you want fainter sustained motion to count as activity.
22
+ //
23
+ // This module is pure (no I/O, no timers of its own — the caller feeds it samples with an explicit
24
+ // `now`), so the state machine is deterministically unit-testable from a synthetic change sequence.
25
+
26
+ // Calibrated against the measured numbers above. All tunable per start_watch.
27
+ export const FILTER_DEFAULTS = {
28
+ // % frame-to-frame change to WAKE the filter (enter an "active" episode). Set well above the
29
+ // measured video jitter ceiling (~4%) so steady playback never registers as a new event, yet
30
+ // below a scene cut (~16.8%) so real transitions do. The whole-frame metric dilutes tiny local
31
+ // changes (a clock, a cursor), so target a meaningful region/window for small UI events.
32
+ activityThreshold: 8,
33
+ // % below which a frame counts as "still". Hysteresis: must be < activityThreshold so a woken
34
+ // episode keeps tracking motion down to a lower floor before it's declared settled.
35
+ quietThreshold: 5,
36
+ // Motion must stay below quietThreshold this long before we emit a "settled" event (quality gate).
37
+ // Keep it a small multiple of the poll interval so a couple of quiet polls confirm the settle.
38
+ debounceQuietMs: 900,
39
+ // Minimum gap between emitted events (frequency cap). Suppresses the ripples of one activity.
40
+ cooldownMs: 6000,
41
+ // If an episode never settles (sustained motion: video, ongoing battle), emit anyway this often
42
+ // so a continuously-moving screen still produces periodic snapshots instead of going silent.
43
+ maxWaitMs: 8000,
44
+ };
45
+
46
+ const clampNum = (v, lo, hi, dflt) => {
47
+ const n = Number(v);
48
+ if (!Number.isFinite(n)) return dflt;
49
+ return Math.min(hi, Math.max(lo, n));
50
+ };
51
+
52
+ // Sanitize caller-supplied options into a coherent, safe config. Enforces the hysteresis invariant
53
+ // (quietThreshold < activityThreshold) so a misconfiguration can't make the filter flap or never wake.
54
+ export function resolveFilterConfig(opts = {}) {
55
+ const d = FILTER_DEFAULTS;
56
+ let activityThreshold = clampNum(opts.activityThreshold, 0.1, 100, d.activityThreshold);
57
+ let quietThreshold = clampNum(opts.quietThreshold, 0, 100, d.quietThreshold);
58
+ // Keep the quiet floor strictly below the wake threshold (hysteresis). If a caller inverts them,
59
+ // pull the quiet floor down to just under the wake threshold rather than silently swapping intent.
60
+ if (quietThreshold >= activityThreshold) quietThreshold = Math.max(0, activityThreshold - 1);
61
+ return {
62
+ activityThreshold,
63
+ quietThreshold,
64
+ debounceQuietMs: clampNum(opts.debounceQuietMs, 0, 60000, d.debounceQuietMs),
65
+ cooldownMs: clampNum(opts.cooldownMs, 0, 600000, d.cooldownMs),
66
+ maxWaitMs: clampNum(opts.maxWaitMs, 100, 600000, d.maxWaitMs),
67
+ };
68
+ }
69
+
70
+ export class NoiseFilter {
71
+ constructor(opts = {}) {
72
+ this.cfg = resolveFilterConfig(opts);
73
+ this.phase = 'idle'; // idle | active | cooldown
74
+ this.activeStart = 0; // ts the current episode woke
75
+ this.lastMotionTs = 0; // last ts change was >= quietThreshold (i.e. "still moving")
76
+ this.lastEmitTs = -Infinity; // ts of the last emitted event
77
+ this.peakPct = 0; // strongest change seen during the current episode
78
+ this.samples = 0; // total samples fed (diagnostics)
79
+ }
80
+
81
+ // Feed one polled sample. Returns an emit descriptor { reason, peakPct, activeMs } when this sample
82
+ // triggers an event, otherwise null. `now` is the caller's clock (ms); `baseline` true means the
83
+ // change metric is not meaningful for this sample (fresh baseline / lost watch state) — reset the episode.
84
+ push({ changePct, now, baseline = false }) {
85
+ this.samples++;
86
+ const c = Number.isFinite(Number(changePct)) ? Number(changePct) : 0;
87
+ const cfg = this.cfg;
88
+
89
+ if (baseline) {
90
+ // A (re)baseline carries no comparable diff — abandon any in-progress episode, stay armed.
91
+ this.phase = 'idle';
92
+ this.peakPct = 0;
93
+ return null;
94
+ }
95
+
96
+ // Re-arm after cooldown expires, then evaluate this same sample as idle (so a sample that arrives
97
+ // already above the wake threshold starts the next episode immediately, no wasted poll).
98
+ if (this.phase === 'cooldown') {
99
+ if (now - this.lastEmitTs >= cfg.cooldownMs) this.phase = 'idle';
100
+ else return null;
101
+ }
102
+
103
+ if (this.phase === 'idle') {
104
+ if (c >= cfg.activityThreshold) {
105
+ this.phase = 'active';
106
+ this.activeStart = now;
107
+ this.lastMotionTs = now;
108
+ this.peakPct = c;
109
+ }
110
+ return null;
111
+ }
112
+
113
+ // phase === 'active'
114
+ if (c > this.peakPct) this.peakPct = c;
115
+ if (c >= cfg.quietThreshold) this.lastMotionTs = now; // still moving — push the settle clock back
116
+ const quietFor = now - this.lastMotionTs;
117
+ const activeFor = now - this.activeStart;
118
+ if (quietFor >= cfg.debounceQuietMs) return this._emit('settled', now, activeFor);
119
+ if (activeFor >= cfg.maxWaitMs) return this._emit('maxwait', now, activeFor);
120
+ return null;
121
+ }
122
+
123
+ _emit(reason, now, activeMs) {
124
+ const peakPct = this.peakPct;
125
+ this.lastEmitTs = now;
126
+ this.phase = 'cooldown';
127
+ this.peakPct = 0;
128
+ return { reason, peakPct, activeMs };
129
+ }
130
+
131
+ // Snapshot for status reporting / tests.
132
+ get status() {
133
+ return { phase: this.phase, samples: this.samples, peakPct: this.peakPct };
134
+ }
135
+ }
@@ -0,0 +1,92 @@
1
+ # computer-use — OCR helper (Windows PowerShell 5.1 ONLY; invoked via powershell.exe).
2
+ #
3
+ # Why a separate helper: the resident worker runs under pwsh 7 (.NET Core), which dropped the WinRT
4
+ # projection, so Windows.Media.Ocr (built-in, offline, NO install) cannot be loaded there. pwsh 7
5
+ # delegates the OCR step to this script run via the in-box `powershell.exe` (Windows PowerShell 5.1),
6
+ # where the WinRT types load. This powers the reflex path: read on-screen text WITHOUT a cloud LLM
7
+ # round-trip, so an alert can be spoken in well under a second.
8
+ #
9
+ # Contract: input -ImagePath <png>; output exactly ONE JSON line on stdout — {ok, text, lang, ms} on
10
+ # success, {ok:false, error} on failure (exit 1). Nothing else is written to stdout (clean to parse).
11
+ # The PARENT must invoke this with an argument array (never a shell string): powershell.exe -NoProfile
12
+ # -NonInteractive -ExecutionPolicy Bypass -File <abs ocr.ps1> -ImagePath <abs temp png>, and must impose
13
+ # its OWN hard spawn timeout + temp-PNG cleanup + non-JSON-stdout handling (codex r1).
14
+ #
15
+ # SECURITY: the recognized text is UNTRUSTED input (it is screen content). It is for narration/alerting
16
+ # ONLY — it must never be turned into an action or an instruction the agent follows, and a speaking
17
+ # caller MUST add a provenance prefix + shape/cap it before TTS (design §14/§16/§24, codex r1 HIGH).
18
+ # Defence-in-depth here: only OCR files under the allowed temp root, bound every WinRT await with a
19
+ # timeout, and pre-shape/cap the emitted text so a hung or hostile crop can't stall or flood the caller.
20
+ param(
21
+ [Parameter(Mandatory = $true)][string]$ImagePath,
22
+ [string]$Lang = '', # optional BCP-47 tag (e.g. 'ko', 'en-US'); else user-profile default
23
+ [int]$TimeoutMs = 4000, # per-await hard cap so a stuck WinRT call can't block forever
24
+ [string]$AllowRoot = '', # only OCR files under this root (default: the user TEMP dir)
25
+ [int]$MaxChars = 600 # cap emitted text (a long spoken utterance is itself a risk)
26
+ )
27
+ $ErrorActionPreference = 'Stop'
28
+ try { [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new($false) } catch {}
29
+ function Emit($o) { [Console]::Out.WriteLine(($o | ConvertTo-Json -Compress)) }
30
+
31
+ try {
32
+ # Path policy: resolve to a full path and require it under the allowed root (worker-owned temp crops only).
33
+ # The real guarantee is caller-side (it passes a denylist-gated, worker-created temp PNG); this is defence-in-depth.
34
+ $root = if ($AllowRoot) { $AllowRoot } else { $env:TEMP }
35
+ $rootFull = [System.IO.Path]::GetFullPath($root).TrimEnd('\') + '\'
36
+ $full = [System.IO.Path]::GetFullPath($ImagePath)
37
+ if (-not $full.StartsWith($rootFull, [System.StringComparison]::OrdinalIgnoreCase)) {
38
+ Emit @{ ok = $false; error = 'image path is outside the allowed temp root' }; exit 1
39
+ }
40
+ if (-not (Test-Path -LiteralPath $full -PathType Leaf)) { Emit @{ ok = $false; error = 'image not found' }; exit 1 }
41
+
42
+ $null = [Windows.Media.Ocr.OcrEngine, Windows.Foundation, ContentType = WindowsRuntime]
43
+ Add-Type -AssemblyName System.Runtime.WindowsRuntime
44
+ # The single-arg generic AsTask projection (IAsyncOperation<T> -> Task<T>) so we can synchronously await WinRT calls.
45
+ $asTask = ([System.WindowsRuntimeSystemExtensions].GetMethods() | Where-Object {
46
+ $_.Name -eq 'AsTask' -and $_.GetParameters().Count -eq 1 -and $_.GetParameters()[0].ParameterType.Name -eq 'IAsyncOperation`1'
47
+ })[0]
48
+ if (-not $asTask) { Emit @{ ok = $false; error = 'WinRT AsTask projection unavailable' }; exit 1 }
49
+ # Bound every await: if a WinRT call hangs, .Wait(timeout) returns false and we fail cleanly instead of blocking.
50
+ function Await($op, $t) {
51
+ $m = $asTask.MakeGenericMethod($t); $task = $m.Invoke($null, @($op))
52
+ if (-not $task.Wait($TimeoutMs)) { throw "OCR step timed out after ${TimeoutMs}ms" }
53
+ $task.Result
54
+ }
55
+
56
+ # Engine: a specific recognizer language if requested and installed, otherwise the user-profile default.
57
+ $eng = $null
58
+ if ($Lang) {
59
+ try {
60
+ $L = New-Object Windows.Globalization.Language $Lang
61
+ if ([Windows.Media.Ocr.OcrEngine]::IsLanguageSupported($L)) { $eng = [Windows.Media.Ocr.OcrEngine]::TryCreateFromLanguage($L) }
62
+ } catch {}
63
+ }
64
+ if (-not $eng) { $eng = [Windows.Media.Ocr.OcrEngine]::TryCreateFromUserProfileLanguages() }
65
+ if (-not $eng) { Emit @{ ok = $false; error = 'no OCR recognizer language installed on this machine' }; exit 1 }
66
+
67
+ $sw = [System.Diagnostics.Stopwatch]::StartNew()
68
+ $file = Await ([Windows.Storage.StorageFile, Windows.Storage, ContentType = WindowsRuntime]::GetFileFromPathAsync($full)) ([Windows.Storage.StorageFile])
69
+ $stream = Await ($file.OpenAsync([Windows.Storage.FileAccessMode]::Read)) ([Windows.Storage.Streams.IRandomAccessStream])
70
+ $decoder = Await ([Windows.Graphics.Imaging.BitmapDecoder, Windows.Graphics, ContentType = WindowsRuntime]::CreateAsync($stream)) ([Windows.Graphics.Imaging.BitmapDecoder])
71
+ $soft = Await ($decoder.GetSoftwareBitmapAsync()) ([Windows.Graphics.Imaging.SoftwareBitmap])
72
+ # Windows OCR is picky about pixel format on some decoders — normalize to Bgra8 for robustness (codex r1 low/med).
73
+ if ($soft.BitmapPixelFormat -ne [Windows.Graphics.Imaging.BitmapPixelFormat]::Bgra8) {
74
+ $soft = [Windows.Graphics.Imaging.SoftwareBitmap]::Convert($soft, [Windows.Graphics.Imaging.BitmapPixelFormat]::Bgra8, [Windows.Graphics.Imaging.BitmapAlphaMode]::Premultiplied)
75
+ }
76
+ $res = Await ($eng.RecognizeAsync($soft)) ([Windows.Media.Ocr.OcrResult])
77
+ $sw.Stop()
78
+
79
+ # Pre-shape the recognized text defensively: strip control/format chars (incl. bidi marks), collapse
80
+ # whitespace, cap length. The SPEAKING caller still adds a provenance prefix + full sanitization/budget.
81
+ $txt = [string]$res.Text
82
+ $txt = ($txt -replace '\p{C}', ' ') -replace '\s+', ' '
83
+ $txt = $txt.Trim()
84
+ $truncated = $false
85
+ if ($txt.Length -gt $MaxChars) { $txt = $txt.Substring(0, $MaxChars); $truncated = $true }
86
+
87
+ Emit @{ ok = $true; text = $txt; lang = $eng.RecognizerLanguage.LanguageTag; ms = [int]$sw.Elapsed.TotalMilliseconds; truncated = $truncated }
88
+ } catch {
89
+ # Keep the reason generic: don't surface raw paths / exception internals into cloud-visible agent context (codex r1 low).
90
+ Emit @{ ok = $false; error = 'ocr failed' }
91
+ exit 1
92
+ }
@@ -0,0 +1,296 @@
1
+ // computer-use — Supertonic TTS speaker helper (Node + onnxruntime-node; SEPARATE INSTALL, Windows-first).
2
+ //
3
+ // Speaks ONE already-finalized utterance and exits — the drop-in higher-quality alternative to speak.ps1
4
+ // (System.Speech/Heami). The CALLER (Node reflex path in mcp-stdio.mjs) owns the security: provenance
5
+ // prefix, sanitization, and the speech budget / no-overlap. This helper just renders audio, as its own
6
+ // short-lived process, so it never blocks the resident worker. Engine selection + Heami fallback live in
7
+ // the caller; this script assumes models are present (the caller probes first).
8
+ //
9
+ // Contract (mirrors speak.ps1): --text <utt> [--voice F1] [--lang ko] [--model-dir <dir>] [--to-wav <path>]
10
+ // [--earcon] [--speed 1.05] [--steps 8] [--max-chars 600]. One JSON line on stdout
11
+ // {ok, voice, chars, ms} (or {ok:false, error}, exit 1). --to-wav renders to a file (silent, for tests).
12
+ //
13
+ // Inference logic is adapted from Supertone's official Node example (supertone-inc/supertonic, nodejs/helper.js,
14
+ // MIT). Model weights (Supertone/supertonic-3) are OpenRAIL-M and are downloaded separately (fetch-supertonic.mjs),
15
+ // never bundled. onnxruntime-node is an optionalDependency.
16
+
17
+ import fs from 'node:fs';
18
+ import os from 'node:os';
19
+ import path from 'node:path';
20
+ import { spawn } from 'node:child_process';
21
+ import { fileURLToPath } from 'node:url';
22
+
23
+ const __dirname = path.dirname(fileURLToPath(import.meta.url));
24
+
25
+ function emit(o) { process.stdout.write(JSON.stringify(o) + '\n'); }
26
+
27
+ // ── arg parse ──
28
+ function parseArgs(argv) {
29
+ const a = { text: '', voice: 'F1', lang: 'ko', modelDir: '', toWav: '', earcon: false, speed: 1.05, steps: 8, maxChars: 600 };
30
+ for (let i = 2; i < argv.length; i++) {
31
+ const k = argv[i];
32
+ if (k === '--text') a.text = argv[++i];
33
+ else if (k === '--voice') a.voice = argv[++i];
34
+ else if (k === '--lang') a.lang = argv[++i];
35
+ else if (k === '--model-dir') a.modelDir = argv[++i];
36
+ else if (k === '--to-wav') a.toWav = argv[++i];
37
+ else if (k === '--earcon') a.earcon = true;
38
+ else if (k === '--speed') a.speed = parseFloat(argv[++i]);
39
+ else if (k === '--steps') a.steps = parseInt(argv[++i], 10);
40
+ else if (k === '--max-chars') a.maxChars = parseInt(argv[++i], 10);
41
+ }
42
+ return a;
43
+ }
44
+
45
+ // Default model dir: env override, else a per-user cache the fetch script writes to.
46
+ function resolveModelDir(arg) {
47
+ if (arg) return arg;
48
+ if (process.env.VORTEX_CU_TTS_MODEL_DIR) return process.env.VORTEX_CU_TTS_MODEL_DIR;
49
+ return path.join(os.homedir(), '.vortex', 'computer-use', 'supertonic-3');
50
+ }
51
+
52
+ const AVAILABLE_LANGS = ['en', 'ko', 'ja', 'ar', 'bg', 'cs', 'da', 'de', 'el', 'es', 'et', 'fi', 'fr', 'hi', 'hr', 'hu', 'id', 'it', 'lt', 'lv', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sv', 'tr', 'uk', 'vi', 'na'];
53
+
54
+ // ── text preprocessing (port of UnicodeProcessor) ──
55
+ class UnicodeProcessor {
56
+ constructor(indexerPath) { this.indexer = JSON.parse(fs.readFileSync(indexerPath, 'utf8')); }
57
+ _pre(text, lang) {
58
+ text = text.normalize('NFKD');
59
+ const emoji = /[\u{1F600}-\u{1F64F}\u{1F300}-\u{1F5FF}\u{1F680}-\u{1F6FF}\u{1F700}-\u{1F77F}\u{1F780}-\u{1F7FF}\u{1F800}-\u{1F8FF}\u{1F900}-\u{1F9FF}\u{1FA00}-\u{1FA6F}\u{1FA70}-\u{1FAFF}\u{2600}-\u{26FF}\u{2700}-\u{27BF}\u{1F1E6}-\u{1F1FF}]+/gu;
60
+ text = text.replace(emoji, '');
61
+ const rep = { '–': '-', '‑': '-', '—': '-', '_': ' ', '“': '"', '”': '"', '‘': "'", '’': "'", '´': "'", '`': "'", '[': ' ', ']': ' ', '|': ' ', '/': ' ', '#': ' ', '→': ' ', '←': ' ' };
62
+ for (const [k, v] of Object.entries(rep)) text = text.replaceAll(k, v);
63
+ text = text.replace(/[♥☆♡©\\]/g, '');
64
+ const expr = { '@': ' at ', 'e.g.,': 'for example, ', 'i.e.,': 'that is, ' };
65
+ for (const [k, v] of Object.entries(expr)) text = text.replaceAll(k, v);
66
+ text = text.replace(/ ,/g, ',').replace(/ \./g, '.').replace(/ !/g, '!').replace(/ \?/g, '?').replace(/ ;/g, ';').replace(/ :/g, ':').replace(/ '/g, "'");
67
+ while (text.includes('""')) text = text.replace('""', '"');
68
+ while (text.includes("''")) text = text.replace("''", "'");
69
+ text = text.replace(/\s+/g, ' ').trim();
70
+ if (!/[.!?;:,'"')\]}…。」』】〉》›»]$/.test(text)) text += '.';
71
+ if (!AVAILABLE_LANGS.includes(lang)) throw new Error(`invalid lang: ${lang}`);
72
+ return `<${lang}>` + text + `</${lang}>`;
73
+ }
74
+ call(textList, langList) {
75
+ const processed = textList.map((t, i) => this._pre(t, langList[i]));
76
+ const lengths = processed.map((t) => t.length);
77
+ const maxLen = Math.max(...lengths);
78
+ const textIds = [];
79
+ for (let i = 0; i < processed.length; i++) {
80
+ const row = new Array(maxLen).fill(0);
81
+ const vals = Array.from(processed[i]).map((c) => c.charCodeAt(0));
82
+ for (let j = 0; j < vals.length; j++) row[j] = this.indexer[vals[j]];
83
+ textIds.push(row);
84
+ }
85
+ return { textIds, textMask: lengthToMask(lengths) };
86
+ }
87
+ }
88
+
89
+ function lengthToMask(lengths, maxLen = null) {
90
+ maxLen = maxLen || Math.max(...lengths);
91
+ return lengths.map((len) => [Array.from({ length: maxLen }, (_, j) => (j < len ? 1.0 : 0.0))]);
92
+ }
93
+ function getLatentMask(wavLengths, baseChunkSize, ccf) {
94
+ const sz = baseChunkSize * ccf;
95
+ return lengthToMask(wavLengths.map((len) => Math.floor((len + sz - 1) / sz)));
96
+ }
97
+
98
+ let ort; // lazy so a missing optionalDependency yields a clean JSON error, not an import crash
99
+ function tensorF32(array, dims) { return new ort.Tensor('float32', Float32Array.from(array.flat(Infinity)), dims); }
100
+ function tensorI64(array, dims) { return new ort.Tensor('int64', BigInt64Array.from(array.flat(Infinity).map((x) => BigInt(x))), dims); }
101
+
102
+ class TextToSpeech {
103
+ constructor(cfgs, tp, dp, te, ve, vo) {
104
+ this.cfgs = cfgs; this.tp = tp; this.dp = dp; this.te = te; this.ve = ve; this.vo = vo;
105
+ this.sampleRate = cfgs.ae.sample_rate; this.baseChunk = cfgs.ae.base_chunk_size;
106
+ this.ccf = cfgs.ttl.chunk_compress_factor; this.ldim = cfgs.ttl.latent_dim;
107
+ }
108
+ _sampleNoisy(duration) {
109
+ const wavLenMax = Math.max(...duration) * this.sampleRate;
110
+ const wavLengths = duration.map((d) => Math.floor(d * this.sampleRate));
111
+ const chunkSize = this.baseChunk * this.ccf;
112
+ const latentLen = Math.floor((wavLenMax + chunkSize - 1) / chunkSize);
113
+ const latentDim = this.ldim * this.ccf;
114
+ const noisy = [];
115
+ for (let b = 0; b < duration.length; b++) {
116
+ const batch = [];
117
+ for (let d = 0; d < latentDim; d++) {
118
+ const row = [];
119
+ for (let t = 0; t < latentLen; t++) {
120
+ const u1 = Math.max(1e-10, Math.random()), u2 = Math.random();
121
+ row.push(Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2));
122
+ }
123
+ batch.push(row);
124
+ }
125
+ noisy.push(batch);
126
+ }
127
+ const mask = getLatentMask(wavLengths, this.baseChunk, this.ccf);
128
+ for (let b = 0; b < noisy.length; b++) for (let d = 0; d < noisy[b].length; d++) for (let t = 0; t < noisy[b][d].length; t++) noisy[b][d][t] *= mask[b][0][t];
129
+ return { noisy, mask };
130
+ }
131
+ async _infer(textList, langList, style, totalStep, speed) {
132
+ const bsz = textList.length;
133
+ const { textIds, textMask } = this.tp.call(textList, langList);
134
+ const idsShape = [bsz, textIds[0].length];
135
+ const maskTensor = tensorF32(textMask, [bsz, 1, textMask[0][0].length]);
136
+ const dpr = await this.dp.run({ text_ids: tensorI64(textIds, idsShape), style_dp: style.dp, text_mask: maskTensor });
137
+ const dur = Array.from(dpr.duration.data).map((d) => d / speed);
138
+ const ter = await this.te.run({ text_ids: tensorI64(textIds, idsShape), style_ttl: style.ttl, text_mask: maskTensor });
139
+ const textEmb = ter.text_emb;
140
+ let { noisy, mask } = this._sampleNoisy(dur);
141
+ const latShape = [bsz, noisy[0].length, noisy[0][0].length];
142
+ const latMaskTensor = tensorF32(mask, [bsz, 1, mask[0][0].length]);
143
+ const stepTensor = tensorF32(new Array(bsz).fill(totalStep), [bsz]);
144
+ for (let step = 0; step < totalStep; step++) {
145
+ const r = await this.ve.run({
146
+ noisy_latent: tensorF32(noisy, latShape), text_emb: textEmb, style_ttl: style.ttl,
147
+ text_mask: maskTensor, latent_mask: latMaskTensor, total_step: stepTensor,
148
+ current_step: tensorF32(new Array(bsz).fill(step), [bsz]),
149
+ });
150
+ const den = Array.from(r.denoised_latent.data);
151
+ let idx = 0;
152
+ for (let b = 0; b < noisy.length; b++) for (let d = 0; d < noisy[b].length; d++) for (let t = 0; t < noisy[b][d].length; t++) noisy[b][d][t] = den[idx++];
153
+ }
154
+ const vr = await this.vo.run({ latent: tensorF32(noisy, latShape) });
155
+ return { wav: Array.from(vr.wav_tts.data), duration: dur };
156
+ }
157
+ async call(text, lang, style, totalStep, speed, silence = 0.3) {
158
+ const maxLen = (lang === 'ko' || lang === 'ja') ? 120 : 300;
159
+ const chunks = chunkText(text, maxLen);
160
+ let wavCat = null, durCat = 0;
161
+ for (const chunk of chunks) {
162
+ const { wav, duration } = await this._infer([chunk], [lang], style, totalStep, speed);
163
+ if (wavCat === null) { wavCat = wav; durCat = duration[0]; }
164
+ else { wavCat = [...wavCat, ...new Array(Math.floor(silence * this.sampleRate)).fill(0), ...wav]; durCat += duration[0] + silence; }
165
+ }
166
+ return { wav: wavCat, duration: [durCat] };
167
+ }
168
+ }
169
+
170
+ function chunkText(text, maxLen) {
171
+ const paras = text.trim().split(/\n\s*\n+/).filter((p) => p.trim());
172
+ const chunks = [];
173
+ for (let para of paras) {
174
+ para = para.trim();
175
+ if (!para) continue;
176
+ const sentences = para.split(/(?<=[.!?])\s+/);
177
+ let cur = '';
178
+ for (const s of sentences) {
179
+ if (cur.length + s.length + 1 <= maxLen) cur += (cur ? ' ' : '') + s;
180
+ else { if (cur) chunks.push(cur.trim()); cur = s; }
181
+ }
182
+ if (cur) chunks.push(cur.trim());
183
+ }
184
+ return chunks.length ? chunks : [text.trim()];
185
+ }
186
+
187
+ function loadVoiceStyle(p) {
188
+ const vs = JSON.parse(fs.readFileSync(p, 'utf8'));
189
+ const td = vs.style_ttl.dims, dd = vs.style_dp.dims;
190
+ const ttl = new ort.Tensor('float32', Float32Array.from(vs.style_ttl.data.flat(Infinity)), [1, td[1], td[2]]);
191
+ const dp = new ort.Tensor('float32', Float32Array.from(vs.style_dp.data.flat(Infinity)), [1, dd[1], dd[2]]);
192
+ return { ttl, dp };
193
+ }
194
+
195
+ async function loadTTS(modelDir) {
196
+ const onnx = path.join(modelDir, 'onnx');
197
+ const cfgs = JSON.parse(fs.readFileSync(path.join(onnx, 'tts.json'), 'utf8'));
198
+ const opts = {};
199
+ const [dp, te, ve, vo] = await Promise.all([
200
+ ort.InferenceSession.create(path.join(onnx, 'duration_predictor.onnx'), opts),
201
+ ort.InferenceSession.create(path.join(onnx, 'text_encoder.onnx'), opts),
202
+ ort.InferenceSession.create(path.join(onnx, 'vector_estimator.onnx'), opts),
203
+ ort.InferenceSession.create(path.join(onnx, 'vocoder.onnx'), opts),
204
+ ]);
205
+ const tp = new UnicodeProcessor(path.join(onnx, 'unicode_indexer.json'));
206
+ return new TextToSpeech(cfgs, tp, dp, te, ve, vo);
207
+ }
208
+
209
+ // ── WAV (16-bit PCM mono) ──
210
+ function floatToWav(samples, sampleRate) {
211
+ const dataSize = samples.length * 2;
212
+ const buf = Buffer.alloc(44 + dataSize);
213
+ buf.write('RIFF', 0); buf.writeUInt32LE(36 + dataSize, 4); buf.write('WAVE', 8);
214
+ buf.write('fmt ', 12); buf.writeUInt32LE(16, 16); buf.writeUInt16LE(1, 20); buf.writeUInt16LE(1, 22);
215
+ buf.writeUInt32LE(sampleRate, 24); buf.writeUInt32LE(sampleRate * 2, 28); buf.writeUInt16LE(2, 32); buf.writeUInt16LE(16, 34);
216
+ buf.write('data', 36); buf.writeUInt32LE(dataSize, 40);
217
+ for (let i = 0; i < samples.length; i++) {
218
+ const s = Math.max(-1, Math.min(1, samples[i]));
219
+ buf.writeInt16LE(Math.floor(s * 32767), 44 + i * 2);
220
+ }
221
+ return buf;
222
+ }
223
+
224
+ // Voice "slightly up": peak-normalize the synthesized voice to a consistent presence (neural TTS output often
225
+ // sits well below full scale). Only boosts (never reduces a loud take), capped so near-silence isn't blown up.
226
+ function normalizePeak(samples, target = 0.9, maxGain = 3.0) {
227
+ let peak = 0;
228
+ for (let i = 0; i < samples.length; i++) { const a = Math.abs(samples[i]); if (a > peak) peak = a; }
229
+ if (peak <= 0) return samples;
230
+ const gain = Math.min(target / peak, maxGain);
231
+ if (gain <= 1.0) return samples;
232
+ for (let i = 0; i < samples.length; i++) samples[i] *= gain;
233
+ return samples;
234
+ }
235
+
236
+ // Provenance chime (two-tone 1175→1568 Hz) matching speak.ps1's earcon, as PCM samples to prepend.
237
+ function earconSamples(sampleRate) {
238
+ const tone = (freq, ms) => {
239
+ const n = Math.floor((ms / 1000) * sampleRate);
240
+ return Array.from({ length: n }, (_, i) => 0.25 * Math.sin((2 * Math.PI * freq * i) / sampleRate));
241
+ };
242
+ return [...tone(1175, 90), ...tone(1568, 110), ...new Array(Math.floor(0.06 * sampleRate)).fill(0)];
243
+ }
244
+
245
+ // Play a WAV synchronously through audio-duck.ps1 (win32): it ducks OTHER apps' sessions while the voice plays
246
+ // and restores them in a finally, excluding its own process so the voice is never ducked. No native deps.
247
+ // If ducking is unavailable it still plays (audio-duck falls back to a plain SoundPlayer). VORTEX_CU_DUCK=off
248
+ // disables ducking (factor 1.0); VORTEX_CU_DUCK_FACTOR tunes how much others drop (0.5 = to 50%).
249
+ function playWav(wavPath) {
250
+ return new Promise((resolve) => {
251
+ if (process.platform !== 'win32') return resolve(false);
252
+ const duckScript = path.join(__dirname, 'audio-duck.ps1');
253
+ const factor = process.env.VORTEX_CU_DUCK === 'off' ? '1.0' : (process.env.VORTEX_CU_DUCK_FACTOR || '0.3');
254
+ const ps = spawn('pwsh', ['-NoProfile', '-NonInteractive', '-File', duckScript, '-WavPath', wavPath, '-Factor', factor], { stdio: 'ignore' });
255
+ const kill = () => { try { ps.kill(); } catch {} };
256
+ process.once('SIGTERM', kill); process.once('SIGINT', kill);
257
+ ps.on('exit', () => resolve(true));
258
+ ps.on('error', () => resolve(false));
259
+ });
260
+ }
261
+
262
+ async function main() {
263
+ const a = parseArgs(process.argv);
264
+ let text = String(a.text || '');
265
+ if (text.length > a.maxChars) text = text.slice(0, a.maxChars);
266
+ if (!text.trim()) { emit({ ok: false, error: 'empty text' }); process.exit(1); }
267
+ const modelDir = resolveModelDir(a.modelDir);
268
+ const t0 = Date.now();
269
+ try {
270
+ ort = (await import('onnxruntime-node')).default ?? (await import('onnxruntime-node'));
271
+ } catch { emit({ ok: false, error: 'onnxruntime-node not installed' }); process.exit(1); }
272
+ try {
273
+ const stylePath = path.join(modelDir, 'voice_styles', `${a.voice}.json`);
274
+ if (!fs.existsSync(stylePath)) { emit({ ok: false, error: `voice not found: ${a.voice}` }); process.exit(1); }
275
+ const tts = await loadTTS(modelDir);
276
+ const style = loadVoiceStyle(stylePath);
277
+ const { wav } = await tts.call(text, a.lang, style, a.steps, a.speed);
278
+ const voice = normalizePeak(wav); // voice "slightly up" — consistent presence
279
+ let samples = a.earcon ? [...earconSamples(tts.sampleRate), ...voice] : voice;
280
+ const wavBuf = floatToWav(samples, tts.sampleRate);
281
+ if (a.toWav) {
282
+ fs.writeFileSync(a.toWav, wavBuf);
283
+ } else {
284
+ const tmp = path.join(os.tmpdir(), `vortex-cu-tts-${process.pid}-${Date.now()}.wav`);
285
+ fs.writeFileSync(tmp, wavBuf);
286
+ await playWav(tmp);
287
+ try { fs.unlinkSync(tmp); } catch {}
288
+ }
289
+ emit({ ok: true, voice: a.voice, chars: text.length, ms: Date.now() - t0 });
290
+ } catch (e) {
291
+ emit({ ok: false, error: 'tts failed' });
292
+ process.exit(1);
293
+ }
294
+ }
295
+
296
+ main();
@@ -0,0 +1,58 @@
1
+ # computer-use — TTS speaker helper (pwsh 7; System.Speech, built-in on Windows, NO install).
2
+ #
3
+ # Speaks ONE already-finalized utterance and exits. The CALLER (Node reflex path) owns the security:
4
+ # provenance prefix, sanitization, and the speech budget / no-overlap (codex r1 HIGH/MED). This helper
5
+ # just renders audio so it never blocks the resident worker (it runs as its own short-lived process).
6
+ # `-ToWav` renders to a file instead of the speakers so tests/verification make no sound.
7
+ #
8
+ # Contract: -Text <utterance>; one JSON line on stdout {ok, voice, chars, ms} (or {ok:false,error}, exit 1).
9
+ param(
10
+ [Parameter(Mandatory = $true)][string]$Text,
11
+ [int]$Rate = 0, # System.Speech rate -10..10 (0 = default)
12
+ [string]$Voice = '', # preferred voice-name substring; else first Korean voice; else default
13
+ [string]$ToWav = '', # render to this WAV path instead of the speakers (tests)
14
+ [string]$Earcon = '', # if set, play a short provenance chime through the speakers BEFORE speaking, marking
15
+ # the utterance as screen-derived (non-verbal provenance for the reflex OCR/vision
16
+ # path; skipped under -ToWav so verification stays silent)
17
+ [int]$MaxChars = 600 # defence-in-depth cap (caller already caps/shapes)
18
+ )
19
+ $ErrorActionPreference = 'Stop'
20
+ try { [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new($false) } catch {}
21
+ function Emit($o) { [Console]::Out.WriteLine(($o | ConvertTo-Json -Compress)) }
22
+ try {
23
+ $t = [string]$Text
24
+ if ($t.Length -gt $MaxChars) { $t = $t.Substring(0, $MaxChars) }
25
+ if ([string]::IsNullOrWhiteSpace($t)) { Emit @{ ok = $false; error = 'empty text' }; exit 1 }
26
+ Add-Type -AssemblyName System.Speech
27
+ $syn = New-Object System.Speech.Synthesis.SpeechSynthesizer
28
+ $voices = @($syn.GetInstalledVoices() | Where-Object { $_.Enabled } | ForEach-Object { $_.VoiceInfo })
29
+ $picked = $null
30
+ if ($Voice) { $picked = @($voices | Where-Object { $_.Name -like "*$Voice*" })[0] }
31
+ if (-not $picked) { $picked = @($voices | Where-Object { $_.Culture.Name -like 'ko*' })[0] }
32
+ if ($picked) { $syn.SelectVoice($picked.Name) }
33
+ $syn.Rate = [Math]::Max(-10, [Math]::Min(10, $Rate))
34
+ if ($ToWav) { $syn.SetOutputToWaveFile($ToWav) } else { $syn.SetOutputToDefaultAudioDevice() }
35
+ # Provenance chime: a short, distinct two-tone played BEFORE screen-derived speech so the listener hears
36
+ # "this next bit is raw screen text, not the assistant" without a verbal prefix. Only on real audio output
37
+ # (never under -ToWav, so verification stays silent). Best-effort — a failed beep must not abort speech.
38
+ # Duck other apps' audio while speaking (per-app WASAPI), restored in finally — excludes THIS process so the
39
+ # voice isn't ducked. Skipped under -ToWav (silent test) or VORTEX_CU_DUCK=off. Best-effort: never blocks speech.
40
+ $duckHandle = $null
41
+ if (-not $ToWav -and $env:VORTEX_CU_DUCK -ne 'off') {
42
+ try {
43
+ . (Join-Path $PSScriptRoot 'audio-duck.ps1')
44
+ $df = 0.0; [double]::TryParse($env:VORTEX_CU_DUCK_FACTOR, [ref]$df) | Out-Null; if ($df -le 0) { $df = 0.3 }
45
+ $duckHandle = Invoke-Duck $df @($PID)
46
+ } catch {}
47
+ }
48
+ $sw = [System.Diagnostics.Stopwatch]::StartNew()
49
+ try {
50
+ if ($Earcon -and -not $ToWav) { try { [Console]::Beep(1175, 90); [Console]::Beep(1568, 110) } catch {} }
51
+ $syn.Speak($t)
52
+ } finally {
53
+ $sw.Stop()
54
+ if ($duckHandle) { Restore-Duck $duckHandle }
55
+ }
56
+ $syn.SetOutputToNull(); $syn.Dispose()
57
+ Emit @{ ok = $true; voice = $(if ($picked) { $picked.Name } else { 'default' }); chars = $t.Length; ms = [int]$sw.Elapsed.TotalMilliseconds }
58
+ } catch { Emit @{ ok = $false; error = 'tts failed' }; exit 1 }