verbalcoding 0.2.12 → 0.2.13

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (169) hide show
  1. package/.env.example +74 -4
  2. package/README.es.md +3 -1
  3. package/README.fr.md +3 -1
  4. package/README.ja.md +3 -1
  5. package/README.ko.md +4 -2
  6. package/README.md +4 -2
  7. package/README.ru.md +3 -1
  8. package/README.zh.md +3 -1
  9. package/app-node/agent_adapters.test.mjs +14 -0
  10. package/app-node/agent_routing.mjs +148 -0
  11. package/app-node/agent_routing.test.mjs +138 -0
  12. package/app-node/agent_turn.mjs +86 -0
  13. package/app-node/agent_turn.test.mjs +109 -0
  14. package/app-node/bridge_context.mjs +73 -0
  15. package/app-node/bridge_context.test.mjs +54 -0
  16. package/app-node/bridge_state.mjs +4 -0
  17. package/app-node/bridge_wireup.test.mjs +462 -0
  18. package/app-node/cli_install.test.mjs +31 -0
  19. package/app-node/cross_agent_routing.test.mjs +78 -0
  20. package/app-node/discord_command_router.mjs +204 -0
  21. package/app-node/discord_command_router.test.mjs +311 -0
  22. package/app-node/discord_voice_setup.mjs +251 -0
  23. package/app-node/discord_voice_setup.test.mjs +86 -0
  24. package/app-node/hermes_profiles.test.mjs +12 -1
  25. package/app-node/install_config.mjs +110 -3
  26. package/app-node/install_config.test.mjs +8 -0
  27. package/app-node/instance_doctor.test.mjs +9 -0
  28. package/app-node/instances.test.mjs +8 -1
  29. package/app-node/main.mjs +488 -1368
  30. package/app-node/mcp_tools.test.mjs +7 -0
  31. package/app-node/notification_handler.mjs +89 -0
  32. package/app-node/notification_handler.test.mjs +187 -0
  33. package/app-node/plan_dispatcher.mjs +215 -0
  34. package/app-node/plan_dispatcher.test.mjs +101 -0
  35. package/app-node/plan_mode.mjs +36 -7
  36. package/app-node/plan_mode.test.mjs +78 -0
  37. package/app-node/progress_handler.mjs +220 -0
  38. package/app-node/progress_handler.test.mjs +193 -0
  39. package/app-node/progress_speech.mjs +54 -32
  40. package/app-node/progress_speech.test.mjs +12 -3
  41. package/app-node/project_sessions.mjs +5 -2
  42. package/app-node/project_sessions.test.mjs +7 -0
  43. package/app-node/research_mode.mjs +282 -0
  44. package/app-node/research_mode.test.mjs +264 -0
  45. package/app-node/restart_notice.mjs +3 -0
  46. package/app-node/restart_notice.test.mjs +11 -0
  47. package/app-node/session_ontology.mjs +271 -0
  48. package/app-node/session_ontology.test.mjs +130 -0
  49. package/app-node/smart_progress.mjs +1 -1
  50. package/app-node/stream_sentencer.mjs +32 -2
  51. package/app-node/stream_sentencer.test.mjs +65 -0
  52. package/app-node/streaming_tts_queue.mjs +5 -1
  53. package/app-node/streaming_tts_queue.test.mjs +7 -1
  54. package/app-node/stt_whisper.mjs +24 -0
  55. package/app-node/stt_whisper.test.mjs +32 -0
  56. package/app-node/text_routing.mjs +4 -2
  57. package/app-node/tts_backends.mjs +537 -3
  58. package/app-node/tts_backends.test.mjs +454 -0
  59. package/app-node/tts_player.mjs +164 -0
  60. package/app-node/tts_player.test.mjs +202 -0
  61. package/app-node/tts_runtime.mjs +134 -0
  62. package/app-node/tts_runtime.test.mjs +89 -0
  63. package/app-node/tts_settings.mjs +150 -3
  64. package/app-node/tts_settings.test.mjs +204 -0
  65. package/app-node/tts_voice_config.mjs +136 -2
  66. package/app-node/tts_voice_config.test.mjs +94 -0
  67. package/app-node/utterance_router.mjs +216 -0
  68. package/app-node/utterance_router.test.mjs +236 -0
  69. package/app-node/voice_autojoin.mjs +37 -0
  70. package/app-node/voice_autojoin.test.mjs +59 -0
  71. package/app-node/voice_io.mjs +272 -0
  72. package/app-node/voice_io.test.mjs +102 -0
  73. package/app-node/voice_turn_runner.mjs +449 -0
  74. package/app-node/voice_turn_runner.test.mjs +289 -0
  75. package/docs/CONFIGURATION.md +12 -2
  76. package/docs/HARNESSES.md +58 -0
  77. package/docs/HARNESS_AIDER.md +50 -0
  78. package/docs/HARNESS_CLAUDE.md +56 -0
  79. package/docs/HARNESS_CODEX.md +56 -0
  80. package/docs/HARNESS_CURSOR.md +45 -0
  81. package/docs/HARNESS_GEMINI.md +45 -0
  82. package/docs/HARNESS_HERMES.md +57 -0
  83. package/docs/HARNESS_OPENCLAW.md +44 -0
  84. package/docs/HARNESS_OPENCODE.md +44 -0
  85. package/docs/README.md +1 -0
  86. package/docs/ROADMAP.md +20 -5
  87. package/docs/TTS_BACKENDS.md +227 -0
  88. package/docs/USAGE.md +22 -0
  89. package/docs/i18n/AGENTS.es.md +34 -0
  90. package/docs/i18n/AGENTS.fr.md +34 -0
  91. package/docs/i18n/AGENTS.ja.md +34 -0
  92. package/docs/i18n/AGENTS.ko.md +34 -0
  93. package/docs/i18n/AGENTS.ru.md +34 -0
  94. package/docs/i18n/AGENTS.zh.md +34 -0
  95. package/docs/i18n/HARNESSES.es.md +58 -0
  96. package/docs/i18n/HARNESSES.fr.md +58 -0
  97. package/docs/i18n/HARNESSES.ja.md +58 -0
  98. package/docs/i18n/HARNESSES.ko.md +58 -0
  99. package/docs/i18n/HARNESSES.ru.md +58 -0
  100. package/docs/i18n/HARNESSES.zh.md +58 -0
  101. package/docs/i18n/HARNESS_AIDER.es.md +48 -0
  102. package/docs/i18n/HARNESS_AIDER.fr.md +48 -0
  103. package/docs/i18n/HARNESS_AIDER.ja.md +50 -0
  104. package/docs/i18n/HARNESS_AIDER.ko.md +50 -0
  105. package/docs/i18n/HARNESS_AIDER.ru.md +48 -0
  106. package/docs/i18n/HARNESS_AIDER.zh.md +48 -0
  107. package/docs/i18n/HARNESS_CLAUDE.es.md +55 -0
  108. package/docs/i18n/HARNESS_CLAUDE.fr.md +55 -0
  109. package/docs/i18n/HARNESS_CLAUDE.ja.md +56 -0
  110. package/docs/i18n/HARNESS_CLAUDE.ko.md +56 -0
  111. package/docs/i18n/HARNESS_CLAUDE.ru.md +55 -0
  112. package/docs/i18n/HARNESS_CLAUDE.zh.md +56 -0
  113. package/docs/i18n/HARNESS_CODEX.es.md +55 -0
  114. package/docs/i18n/HARNESS_CODEX.fr.md +55 -0
  115. package/docs/i18n/HARNESS_CODEX.ja.md +56 -0
  116. package/docs/i18n/HARNESS_CODEX.ko.md +56 -0
  117. package/docs/i18n/HARNESS_CODEX.ru.md +55 -0
  118. package/docs/i18n/HARNESS_CODEX.zh.md +56 -0
  119. package/docs/i18n/HARNESS_CURSOR.es.md +42 -0
  120. package/docs/i18n/HARNESS_CURSOR.fr.md +42 -0
  121. package/docs/i18n/HARNESS_CURSOR.ja.md +45 -0
  122. package/docs/i18n/HARNESS_CURSOR.ko.md +45 -0
  123. package/docs/i18n/HARNESS_CURSOR.ru.md +42 -0
  124. package/docs/i18n/HARNESS_CURSOR.zh.md +42 -0
  125. package/docs/i18n/HARNESS_GEMINI.es.md +44 -0
  126. package/docs/i18n/HARNESS_GEMINI.fr.md +44 -0
  127. package/docs/i18n/HARNESS_GEMINI.ja.md +45 -0
  128. package/docs/i18n/HARNESS_GEMINI.ko.md +45 -0
  129. package/docs/i18n/HARNESS_GEMINI.ru.md +44 -0
  130. package/docs/i18n/HARNESS_GEMINI.zh.md +45 -0
  131. package/docs/i18n/HARNESS_HERMES.es.md +54 -0
  132. package/docs/i18n/HARNESS_HERMES.fr.md +54 -0
  133. package/docs/i18n/HARNESS_HERMES.ja.md +57 -0
  134. package/docs/i18n/HARNESS_HERMES.ko.md +57 -0
  135. package/docs/i18n/HARNESS_HERMES.ru.md +54 -0
  136. package/docs/i18n/HARNESS_HERMES.zh.md +57 -0
  137. package/docs/i18n/HARNESS_OPENCLAW.es.md +41 -0
  138. package/docs/i18n/HARNESS_OPENCLAW.fr.md +41 -0
  139. package/docs/i18n/HARNESS_OPENCLAW.ja.md +44 -0
  140. package/docs/i18n/HARNESS_OPENCLAW.ko.md +44 -0
  141. package/docs/i18n/HARNESS_OPENCLAW.ru.md +41 -0
  142. package/docs/i18n/HARNESS_OPENCLAW.zh.md +42 -0
  143. package/docs/i18n/HARNESS_OPENCODE.es.md +41 -0
  144. package/docs/i18n/HARNESS_OPENCODE.fr.md +41 -0
  145. package/docs/i18n/HARNESS_OPENCODE.ja.md +44 -0
  146. package/docs/i18n/HARNESS_OPENCODE.ko.md +44 -0
  147. package/docs/i18n/HARNESS_OPENCODE.ru.md +41 -0
  148. package/docs/i18n/HARNESS_OPENCODE.zh.md +44 -0
  149. package/docs/superpowers/plans/2026-05-14-cross-agent-voice-transfer.md +625 -0
  150. package/docs/superpowers/plans/2026-05-21-audio-overview-narrated-diffs.md +95 -0
  151. package/docs/superpowers/plans/2026-05-21-autoresearch-ontology.md +83 -0
  152. package/docs/superpowers/plans/2026-05-21-phase11-push-to-talk-wakeword-v2.md +77 -0
  153. package/docs/superpowers/plans/2026-05-21-phase12-multi-user-voice.md +147 -0
  154. package/docs/superpowers/plans/2026-05-21-phase14-verbalbench.md +136 -0
  155. package/docs/superpowers/plans/2026-05-21-phase15-phone-companion.md +72 -0
  156. package/integrations/fireredtts2/mlx_llm.py +183 -0
  157. package/integrations/fireredtts2/synth.py +156 -0
  158. package/integrations/fireredtts2/synth_mlx.py +196 -0
  159. package/integrations/mlxaudio/synth.py +74 -0
  160. package/integrations/neuttsair/synth.py +104 -0
  161. package/integrations/omnivoice/synth.py +110 -0
  162. package/package.json +6 -1
  163. package/scripts/cli.mjs +84 -0
  164. package/scripts/doctor.mjs +104 -4
  165. package/scripts/install.mjs +5 -1
  166. package/scripts/install_fireredtts2.sh +109 -0
  167. package/scripts/install_mlxaudio.sh +34 -0
  168. package/scripts/install_mossttsnano.sh +46 -0
  169. package/scripts/postinstall.mjs +34 -0
@@ -0,0 +1,95 @@
1
+ # Audio Overview — Two-Voice Narrated Diffs Implementation Plan
2
+
3
+ > **For agentic workers:** REQUIRED SUB-SKILL: superpowers:subagent-driven-development or superpowers:executing-plans.
4
+
5
+ **Goal:** Replace the "diff too long to read aloud" dead-end (`app-node/main.mjs:1033-1037`) with a NotebookLM-style two-voice audio overview: a 30–60 s scripted Host A / Host B dialogue summarizing long agent outputs (diffs, test logs, listings) instead of skipping them.
6
+
7
+ **Architecture:** New `audio_overview.mjs` sits between `spokenResultOnly` (`app-node/main.mjs:1031`) and the streaming TTS queue. On trigger it (a) asks the active cheap-LLM adapter to script `[A]/[B]` dialogue, (b) feeds the script to the sentencer (`app-node/stream_sentencer.mjs:18`) which now also splits on speaker markers, (c) the queue (`app-node/streaming_tts_queue.mjs:1`) calls `synth(text, { voice })` with `AUDIO_OVERVIEW_VOICE_A` / `_B`. Any failure falls back to the existing notice.
8
+
9
+ **Tech Stack:** Node 20 ESM, active adapter (`adapterForBackend` at `app-node/main.mjs:730`) for scripting, Edge TTS multi-voice (`createEdgeTtsBackend` at `app-node/tts_backends.mjs:206`) — `voiceProvider` becomes per-call.
10
+
11
+ ---
12
+
13
+ ## Spec
14
+
15
+ ### API
16
+
17
+ ```javascript
18
+ const overview = createAudioOverview({
19
+ thresholdChars: Number(env.AUDIO_OVERVIEW_THRESHOLD || 1500),
20
+ voiceA: env.AUDIO_OVERVIEW_VOICE_A || 'en-US-GuyNeural',
21
+ voiceB: env.AUDIO_OVERVIEW_VOICE_B || 'en-US-JennyNeural',
22
+ scriptModel: () => adapterForBackend(routingStateFor(key).activeRouting?.backend),
23
+ language: settings.voiceLanguage,
24
+ timeoutMs: 4000,
25
+ });
26
+ const result = await overview.maybeScript({ userPrompt, answer });
27
+ // result: { kind: 'overview', segments: [{ voice, text }, ...] } | { kind: 'skip' }
28
+ ```
29
+
30
+ ### Behavior
31
+
32
+ - **Trigger** (`shouldNarrate`): `answer.length > thresholdChars` OR `isPatchLikeOutput(answer)` already at `app-node/main.mjs:1033`. Re-export helper.
33
+ - **Script prompt** (fixed system): _"Script a 30–60 s podcast summary of this coding agent answer. Output 4–6 turns alternating `[A]` (asks) / `[B]` (explains). No markdown, no fences, no paths unless essential. Language: {language}."_ User = first 2 KB of `stripMarkdownNoise(answer)`.
34
+ - **Cache** identical answers 10 min (Map keyed by sha1). **Timeout** 4 s. **Disabled** when `AUDIO_OVERVIEW=0` or toggled off in routing state.
35
+
36
+ ### Integration
37
+
38
+ - `stream_sentencer.mjs`: add `speakerMarkers: ['[A]', '[B]']` option to `createSentencer` (`app-node/stream_sentencer.mjs:18`). When set, `ingest()` splits on each marker; `'sentence'` event payload becomes `{ text, voice }`.
39
+ - `streaming_tts_queue.mjs:40`: `enqueue(text, opts)` stores `{ text, voice }` in the queue (`app-node/streaming_tts_queue.mjs:5`); pump (`app-node/streaming_tts_queue.mjs:11`) passes `voice` to `synth(text, { voice })`.
40
+ - `tts_backends.mjs`: `synthesize(text, { signal, voice })` in `createEdgeTtsBackend` (`app-node/tts_backends.mjs:225`) overrides `currentVoice()` when `voice` is set. Other backends keep default voice (logged once).
41
+ - `main.mjs`: in `spokenResultOnly` (`app-node/main.mjs:1031-1051`), before the "diff too long" return, call `overview.maybeScript()`; on `'overview'` push `segments` through the streaming path; `synth` closure (`app-node/main.mjs:1320`) upgraded to `synth(text, { voice })`.
42
+ - **User control**: at the `voiceCommandFromTranscript` callsite (`app-node/main.mjs:984`) parse `"audio overview off|on"` / `"오디오 오버뷰 꺼|켜"` and toggle `routingStateFor(key).audioOverviewEnabled` (default `true`).
43
+
44
+ ---
45
+
46
+ ## File Structure
47
+
48
+ - Create: `app-node/audio_overview.mjs`, `app-node/audio_overview.test.mjs`.
49
+ - Modify: `app-node/stream_sentencer.mjs` — speaker-marker split.
50
+ - Modify: `app-node/streaming_tts_queue.mjs` — per-chunk voice passthrough.
51
+ - Modify: `app-node/tts_backends.mjs` — `voice` override in `createEdgeTtsBackend`.
52
+ - Modify: `app-node/main.mjs` — wire trigger, voice command, routing flag.
53
+ - Modify: `.env.example` — `AUDIO_OVERVIEW`, `AUDIO_OVERVIEW_THRESHOLD`, `AUDIO_OVERVIEW_VOICE_A`, `AUDIO_OVERVIEW_VOICE_B`.
54
+
55
+ ---
56
+
57
+ ## Tasks
58
+
59
+ ### Task 1: TDD — trigger + parser
60
+
61
+ - [ ] Step 1: `audio_overview.test.mjs`: short answer → `{ kind: 'skip' }`; long answer or `isPatchLikeOutput` short diff → `maybeScript` invoked.
62
+ - [ ] Step 2: Mock adapter returns `"[A] What changed?\n[B] Two files in router."`; expect `segments[0].voice === voiceA`. Malformed output (no `[A]/[B]`) → `{ kind: 'skip' }`.
63
+
64
+ ### Task 2: Implement `audio_overview.mjs`
65
+
66
+ - [ ] Step 1: `createAudioOverview({...})` returning `{ maybeScript }`. Cache `Map<sha1, segments>` 10 min TTL.
67
+ - [ ] Step 2: Script call via injected `scriptModel()` (`.ask(system, user, { signal, timeoutMs })`); strip fences + markdown per segment.
68
+ - [ ] Step 3: Run, expect PASS, commit.
69
+
70
+ ### Task 3: Extend `stream_sentencer.mjs`
71
+
72
+ - [ ] Step 1: Add `speakerMarkers` to `createSentencer` (`app-node/stream_sentencer.mjs:18`). Emit `{ text, voice }`; keep back-compat with `{ text, voice: null }` when option absent.
73
+ - [ ] Step 2: Update listener at `app-node/main.mjs:1333` for the object form.
74
+ - [ ] Step 3: Add `stream_sentencer.test.mjs` covering marker split + fence preservation.
75
+
76
+ ### Task 4: Extend `streaming_tts_queue.mjs`
77
+
78
+ - [ ] Step 1: `enqueue(text, { voice } = {})`; queue stores objects; pump calls `synth(text, { voice })`.
79
+ - [ ] Step 2: Test: two enqueues with distinct voices both reach mock synth with matching arg.
80
+
81
+ ### Task 5: Extend `createEdgeTtsBackend`
82
+
83
+ - [ ] Step 1: `synthesize` at `app-node/tts_backends.mjs:225` accepts `voice` and substitutes into `-v` argv.
84
+ - [ ] Step 2: Non-edge backends log "audio overview voice swap unsupported on {backend}" once.
85
+
86
+ ### Task 6: Wire into `main.mjs`
87
+
88
+ - [ ] Step 1: Instantiate `audioOverview` after `createTtsBackend` (`app-node/main.mjs:253`).
89
+ - [ ] Step 2: In `spokenResultOnly` (`app-node/main.mjs:1031`), when trigger fires and `routingStateFor(key).audioOverviewEnabled !== false`, `await overview.maybeScript(...)`; on `'overview'` enqueue segments; on `'skip'` keep existing notice.
90
+ - [ ] Step 3: Parse `"audio overview on|off"` / Korean in the `handleTtsVoiceCommand` chain (`app-node/main.mjs:983`). Routing state only — no env write.
91
+
92
+ ### Task 7: Document
93
+
94
+ - [ ] Step 1: `.env.example` adds four keys after `STREAMING_TTS` (line 26).
95
+ - [ ] Step 2: Note the two-voice fallback rule in `docs/superpowers/` README. Commit.
@@ -0,0 +1,83 @@
1
+ # Voice Autoresearch + Session Ontology — Implementation Plan
2
+
3
+ **Status: in flight, 2026-05-21.** This is a 1-hour innovation push driven by 6 parallel research agents that surveyed 2026 autoresearch frameworks, LLM-driven ontology construction, agent memory systems, voice-first UX, agent design patterns, and the VerbalCoding code surface.
4
+
5
+ ## Why now
6
+
7
+ VerbalCoding's cross-agent voice routing landed but the handoff still passes a flat string ("last 4 utterances + last plan decisions") to the routed backend. That string drops surrounding context — *which file was touched 6 turns ago that still matters*, *which tool was used*, *which decisions are now superseded*. Every 2026 production agent-memory system (Graphiti/Zep, Mem0g, Anthropic Memory tool, Letta MemFS, ACE) converged on **typed graph or filesystem persistence with append-only semantics + idempotent dedup**. We can carry the smallest viable shape of that across CLI invocations.
8
+
9
+ Simultaneously, the autoresearch wave (STORM, OpenScholar, GPT-Researcher, Anthropic multi-agent research) shows a stable architecture: **plan → parallel fan-out → outline-as-compression → cited synthesis**. VerbalCoding has every piece needed (plan-mode, streaming TTS, sentencer, per-channel state) — we just don't have the voice command or the orchestrator.
10
+
11
+ Both extensions share infrastructure: the session ontology is where autoresearch results land and where handoff context comes from.
12
+
13
+ ## Module 1 — `app-node/session_ontology.mjs`
14
+
15
+ A per-channel typed graph. Fixed schema (no induction tax). Append-only with `superseded_by`. Serializable to <2KB JSON. Persists to `~/.verbalcoding/memory/<channelId>.json`.
16
+
17
+ **Node types (single char `t`):** `D` Decision, `F` File, `T` Tool, `C` Concept, `A` Agent, `R` Result.
18
+
19
+ **Edge predicates (single char `p`):** `d` decided, `t` touched, `u` used, `p` produced, `r` referenced, `s` superseded_by.
20
+
21
+ **Shape:**
22
+ ```js
23
+ {
24
+ v: 1, // schema version
25
+ channelKey: 'voice/123',
26
+ nodes: [{ id: 'n1', t: 'D', n: 'oauth_provider=github', ts: 1716100000, by: 'claude' }],
27
+ edges: [{ s: 'n1', p: 't', o: 'n2', ts: 1716100000 }],
28
+ meta: { updatedAt: 1716100000, nodeCount: 1, edgeCount: 1 },
29
+ }
30
+ ```
31
+
32
+ **Public API:**
33
+ - `createSessionOntology({ rootDir, channelKey, maxNodes = 40, maxEdges = 80 })` → store handle
34
+ - `store.add({ nodes, edges, supersedes })` — idempotent on `(t, lowercase(n))`
35
+ - `store.serializeForHandoff({ language = 'en', maxBytes = 1500 })` → compact markdown block ready for the cross-agent prompt
36
+ - `store.entitiesFromText(text, opts)` — convenience extractor (regex-only, no LLM, see below)
37
+ - `store.save() / store.load()`
38
+
39
+ **Why regex-only extraction (for v1):** the research summary said the EDC-style LLM extraction is the dominant 2025 pattern, but it costs an LLM call per turn on the voice critical path. v1 uses lightweight regex extraction so it never blocks a turn. The extraction is upgradeable later (slot in an LLM-backed extractor behind the same `entitiesFromText` signature).
40
+
41
+ **Eviction:** LRU on non-`Decision` nodes when over `maxNodes`. `Decision` nodes are sticky.
42
+
43
+ **Conflict resolution:** new fact does not delete old fact; it adds a `superseded_by` edge. `serializeForHandoff` skips superseded nodes by default. Graphiti-pattern, simplified.
44
+
45
+ ## Module 2 — `app-node/research_mode.mjs`
46
+
47
+ Voice command parser + pipeline. Maps STORM's outline-as-compression and GPT-Researcher's plan→executors pattern, but compressed to fit a voice turn.
48
+
49
+ **Public API:**
50
+ - `parseResearchCommand(text, language)` — returns `{type:'research', query, depth, sticky?} | {type:'none'}`. English `"research X"`, `"look up X"`, `"deep research X"`; Korean `"X 리서치", "X 조사해"`.
51
+ - `runResearchTurn({ query, depth, fetchImpl, llmAdapter, signal, language })` — async generator yielding `{phase, payload}` events: `'plan'`, `'fetch'`, `'summarize'`, `'narration'`, `'done'`. The main dispatcher routes `narration` chunks to the existing sentencer / streaming TTS queue.
52
+
53
+ **Backends:** primary search via Tavily (`TAVILY_API_KEY`), fallback Brave (`BRAVE_SEARCH_API_KEY`), then a "search backend not configured" voice notice. Synthesis via the active agent adapter (whichever CLI is current) so we don't add a separate API key requirement.
54
+
55
+ **Output shape:** spoken = 3-bullet outline narrated as one sentence each; text-channel = full markdown with sources and a one-line summary at the top (Perplexity-Voice-style citation-defer).
56
+
57
+ **Ontology integration:** after `done`, write `Concept` nodes for the query and main topics, `Result` nodes for each summary bullet, with `referenced` edges to URLs (stored as `File` nodes with a `t:'F'` and `n` set to the URL hash).
58
+
59
+ ## Module 3 — wire into `main.mjs`
60
+
61
+ 1. Import both new modules near the existing `agent_routing` import block.
62
+ 2. In the per-turn dispatcher (~`main.mjs:1844`, before `parseAgentRoutingCommand`), insert `parseResearchCommand`. If it returns `{type:'research', …}`, run the research turn and short-circuit the rest.
63
+ 3. Add `ontologyStateFor(channelKey)` next to `routingStateFor`. Lazy-create, persist on every mutation.
64
+ 4. Replace the cross-agent prompt's `priorUtterances` + `resolvedDecisions` inputs with `ontology.serializeForHandoff()` when the ontology is non-empty. Fall back to the existing flat string otherwise.
65
+ 5. After every successful agent turn, call `ontology.entitiesFromText(prompt + answer, { by: backend })` non-blocking.
66
+
67
+ ## Tests
68
+
69
+ - `app-node/session_ontology.test.mjs` — add, supersede, dedup, eviction, serializeForHandoff shape, save/load round-trip.
70
+ - `app-node/research_mode.test.mjs` — parser EN/KO, runResearchTurn pipeline with mocked fetch + llm.
71
+ - Extend `cross_agent_routing.test.mjs` — ontology projection replaces flat string when present.
72
+
73
+ ## Out of scope (this push)
74
+
75
+ - LLM-driven entity extraction (regex v1; upgradeable later).
76
+ - Streaming TTS prefix for "Research finding: " (TODO `bet-2.1` — wire via the existing prefix mechanism).
77
+ - Citation-defer Discord embed formatting (TODO `bet-2.2`).
78
+ - Per-agent voice ID + pre-warm (separate workstream).
79
+ - The voice-coding benchmark suggestion from the agent-pattern survey (this is a positioning play, separate from code).
80
+
81
+ ## Commit boundary
82
+
83
+ One commit per module + one for the wiring + one for the docs. Land green or revert.
@@ -0,0 +1,77 @@
1
+ # Phase 11 — Push-to-Talk + Wake-Word v2 Implementation Plan
2
+
3
+ > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans to implement this plan task-by-task.
4
+
5
+ **Goal:** Eliminate wake misses and unwanted barge-ins by (a) adding an opt-in **push-to-talk gate** that suppresses whisper-cli unless the user opens a turn, and (b) replacing regex-on-transcript wake detection with a **continuous neural wake-word detector** tapped off the PCM stream — both behind feature flags in one increment.
6
+
7
+ **Motivation:** `subscribeUser()` records every Discord speaking-start to WAV (`main.mjs:2089-2141`), runs whisper-cli, then post-filters via `acceptsWake()` (`main.mjs:1582-1585`, `:1792`). This wastes 300-900 ms per false trigger and leaks nearby-conversation transcripts. Hardware PTT keys aren't visible to bots, so we need a software gate. Wake-word v2 reacts in <200 ms on a user-trained phrase without relying on whisper accuracy.
8
+
9
+ **Architecture:**
10
+
11
+ 1. **PTT gate (`app-node/ptt_gate.mjs`, new).** Pure module: `createPttGate({ mode })` with `isOpen(userId)`, `open(userId, { ttlMs })`, `close(userId)`. `subscribeUser()` consults it before WAV-writer allocation (`main.mjs:2100`); when closed, the opus stream is drained and whisper skipped. **Decision: slash-command pair `/ptt start|stop|status`**, registered via `client.application.commands` on `ready`. Justification:
12
+ - Discord's built-in *Push to Talk* setting only affects the human's outbound audio — bots receive packets either way, so it's invisible server-side.
13
+ - A WebRTC companion needs a sidecar per user, OS keybinding daemons, signalling channel; high install cost.
14
+ - Slash commands work on desktop + mobile, zero install, per-guild ACL'd. Default `ttlMs=20000` auto-closes on forgotten stop.
15
+ - **Future work (out of scope):** WebRTC companion using `node-global-key-listener` + localhost WS toggling the same gate for true hold-to-talk.
16
+
17
+ 2. **Wake-Word v2 (`app-node/wake_detector.mjs`, new).** Streaming detector wrapping **openWakeWord** (Apache-2.0, `onnxruntime-node`, ~30 ms hop, CPU-only). Picovoice rejected (per-user license, closed weights); Kyutai rejected (no shipped wake model). Consumes a tee of the 16 kHz mono PCM from the `prism.opus.Decoder` chain in `subscribeUser()` (`main.mjs:2109-2141`) — insert a `PassThrough` between `decoder` and `writer`, fork a downsampler into the detector. On score >`WAKE_WORD_THRESHOLD` (default 0.55) it calls `pttGate.open(userId, { ttlMs: WAKE_WORD_TURN_MS })`. Custom phrases via openWakeWord's `train_custom_verifier`; models in `models/wake/`.
18
+
19
+ 3. **State plumbing (`app-node/bridge_state.mjs`).** Extend `createBridgeState()` (`bridge_state.mjs:1`) with `setGateMode/getGateMode` so `/ptt`, voice commands ("PTT 켜"/"push to talk on" per `voice_messages.mjs`), and wake hits share one source of truth. `barge_in.mjs` unchanged; gate runs **before** `createLiveBargeInMonitor()` (`main.mjs:2115`), so explicit barge-in still works while gated.
20
+
21
+ **Tech Stack:** Node 20 ESM, `onnxruntime-node`, `discord.js` slash commands, `node --test`. Models: openWakeWord baseline + user-trained `hermes_v1.onnx`.
22
+
23
+ ## Tasks
24
+
25
+ ### Task 1: Failing tests for `ptt_gate`
26
+
27
+ **Files:** Create `app-node/ptt_gate.test.mjs`. Cover: default closed in `mode='ptt'`; always open in `mode='off'` (legacy); `open()` expires after `ttlMs`; `close()` is idempotent; per-user isolation.
28
+
29
+ ### Task 2: Implement `ptt_gate.mjs`
30
+
31
+ Pure module, no Discord deps. Inject `now`/`setTimeout` for testability. Export `createPttGate({ mode, defaultTtlMs, log })`.
32
+
33
+ ### Task 3: Register `/ptt` + wire gate into `subscribeUser`
34
+
35
+ **Files:** `app-node/main.mjs`. On `ClientReady`, register guild command `ptt` (`start|stop|status`). Add `interactionCreate` handler. In `subscribeUser()` insert gate check after the allow-check at `main.mjs:2090` — if closed, drain `receiver.subscribe(...)` and return before line 2109. Short-circuit the whisper post-filter at `main.mjs:1792` to skip `acceptsWake` when the detector confirmed the turn (carry a `viaWakeDetector` flag on the pending utterance).
36
+
37
+ ### Task 4: Failing tests for `wake_detector`
38
+
39
+ **Files:** `app-node/wake_detector.test.mjs`. Inject a fake ORT session; feed canned PCM frames; assert callback fires once per detection with cooldown.
40
+
41
+ ### Task 5: Implement `wake_detector.mjs`
42
+
43
+ Streaming class with 1280-sample (80 ms) frame buffer, 480 ms ring, cooldown 1500 ms, score smoothing over 3 frames. Lazy-load the ONNX model from `WAKE_WORD_MODEL` (default `models/wake/hermes_v1.onnx`). Fallback to legacy `acceptsWake` when model load fails — log once, never crash bridge.
44
+
45
+ ### Task 6: Tap PCM stream + integrate detector
46
+
47
+ **Files:** `app-node/main.mjs:2109-2141`. Replace `opusStream.pipe(decoder).pipe(writer)` with a fork: decoder → PassThrough → [writer, downsampler16k → detector.feed()]. Detector callback flips the gate open and stamps `pending.viaWakeDetector = true` via `bridgeState.setPending` (`bridge_state.mjs:28`).
48
+
49
+ ### Task 7: Settings + voice commands
50
+
51
+ **Files:** `.env.example`, `main.mjs:205-220`, `voice_messages.mjs`. Add `PTT_MODE=off|ptt|wake|wake+ptt` (default `off`), `WAKE_WORD_MODEL`, `WAKE_WORD_THRESHOLD`, `WAKE_WORD_TURN_MS`. Recognize "PTT 켜/꺼", "wake word on/off" per `docs/HARNESSES.md` shared-semantics.
52
+
53
+ ### Task 8: Docs
54
+
55
+ Update `docs/HARNESSES.md` shared-semantics with PTT + wake-v2 entries. Add `docs/PTT_AND_WAKEWORD.md` covering custom-verifier training and latency budget (detection→gate-open <200 ms).
56
+
57
+ ## Verification
58
+
59
+ - `node --test app-node/ptt_gate.test.mjs app-node/wake_detector.test.mjs` → PASS.
60
+ - Full `node --test` suite → no regressions in `barge_in.test.mjs`, `bridge_state.test.mjs`.
61
+ - Manual `PTT_MODE=ptt`: speaking without `/ptt start` → zero whisper invocations (no `pcmBytes` log at `main.mjs:2133`).
62
+ - Manual `PTT_MODE=wake`: trained phrase opens gate <200 ms (new `wakeDetectMs` field in `latency_metrics.mjs`).
63
+ - Manual: barge-in during TTS still aborts (`main.mjs:2115` path unaffected).
64
+
65
+ ## Out of scope
66
+
67
+ - WebRTC/keyboard PTT companion (future task; gate is companion-ready).
68
+ - Multi-wake-word arbitration per user.
69
+ - Server-side VAD replacement (`SUBSCRIBE_AFTER_SILENCE_MS` stays).
70
+ - Whisper streaming partials.
71
+
72
+ ## Self-Review
73
+
74
+ - Gate + detector share one bridge_state field → no split-brain.
75
+ - Default `PTT_MODE=off` keeps current behavior; opt-in only.
76
+ - ONNX load failure degrades to today's regex path, never crashes.
77
+ - All new modules are pure + injected deps → unit-testable without Discord.
@@ -0,0 +1,147 @@
1
+ # Phase 12 — Per-Speaker Multi-User Voice State Implementation Plan
2
+
3
+ > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans to implement this plan task-by-task.
4
+
5
+ **Goal:** Scope routing, plan-mode, recent-utterance buffer, and ontology projection to `(channelId, userId)` so two humans in one Discord voice channel get independent agent state, while keeping a shared channel-level ontology layer and serializing agent invocations (TTS audio remains a single broadcast stream).
6
+
7
+ **Architecture:** Extend the channel-key helpers in `app-node/main.mjs` with a `For(userId)` variant; rekey `routingStateByChannel`, `planStates`, and `ontologyByChannel` to `<channelId>:<userId>` with channel-only fallback for unattributed paths and default-agent reads. Add a per-channel agent-invocation queue (FIFO with brief inter-speaker narration) layered on top of the existing `bridgeState` deferred queue. Ontology becomes a two-tier read: per-speaker overlay merged onto a channel-shared base. Audio output stays unicast-to-channel — only **state** is per-user.
8
+
9
+ **Tech Stack:** Node 20 ESM, existing `bridgeState` (`createBridgeState`), `createSessionOntology`, `node --test`.
10
+
11
+ ---
12
+
13
+ ## Spec
14
+
15
+ ### Key shape
16
+
17
+ - New helper `speakerKey(channelId, userId)` → `"<channelId>:<userId>"`. Channel-only key (`"<channelId>"`) remains valid as the fallback / shared layer.
18
+ - `planChannelKeyFor(userId)` companion to `planChannelKey()` at `app-node/main.mjs:399`. `planChannelKey()` retained for non-turn paths (default-agent reads, channel-wide resets, push notifications, voice-clone capture).
19
+ - `routingStateFor(key)` (`main.mjs:639`) accepts either key shape; on miss for `"chan:user"` it **copies** from `"chan"` once (so first-time user inherits channel default), then diverges.
20
+
21
+ ### Plan mode
22
+
23
+ - `planStates` (`main.mjs:397`) rekeyed to `speakerKey`. `dispatchPlanModeUtterance(prompt, signal)` (`main.mjs:422`) takes a `userId` arg and uses `planChannelKeyFor(userId)`. Concurrent plan sessions per user are independent; `"cancel"` / `"approve"` only affect the speaker's plan.
24
+
25
+ ### Ontology projection
26
+
27
+ - `ontologyByChannel` (`main.mjs:663`) becomes two stores per channel: `baseOntology(channelKey)` (shared) + `speakerOntology(channelKey, userId)` (overlay).
28
+ - New `ontologyProjectionFor(channelKey, userId)` returns a read-through view: search/serialize merges base first, overlay second (overlay supersedes on slot collision).
29
+ - `captureOntologyFromTurn` (`main.mjs:675`) writes to overlay; a low-frequency promoter (every N turns or on `!session save`) lifts overlay nodes whose `by` set spans ≥2 users into base. Promotion is pure; unit-testable.
30
+ - Persisted file layout: `.verbalcoding/ontology/<channel>.json` (base) and `.verbalcoding/ontology/<channel>/<user>.json` (overlay). `createSessionOntology` already takes `channelKey`; reuse with a composite key for overlay.
31
+
32
+ ### Turn taking
33
+
34
+ - New module `app-node/turn_queue.mjs`: per-channel FIFO of `{ userId, runFn, enqueuedAt }`. Drains serially; emits `onWaiting(userId, aheadOf)` when a second speaker arrives mid-run.
35
+ - `handleRecording(userId, ...)` (`main.mjs:1756`) submits work via the queue rather than calling the adapter directly. Existing `bridgeState.enqueueDeferred` path (`main.mjs:1647`) remains for per-user segment accumulation; the queue layers above it.
36
+ - When `onWaiting` fires and channel has ≥2 distinct waiting users, speak one short notice: e.g. `"User B, queued after User A."` Throttled to once per consecutive wait.
37
+ - `clearTransientRouting(planChannelKey())` (`main.mjs:2064`) is rewritten to `clearTransientRouting(planChannelKeyFor(userId))` so a barge-in by User A does not nuke User B's sticky route.
38
+
39
+ ### Audio (explicit non-goal)
40
+
41
+ - TTS remains a single shared stream per `VoiceConnection`. No per-user audio routing. Document this in `docs/USAGE.md` so users know responses are heard by everyone.
42
+
43
+ ---
44
+
45
+ ## File Structure
46
+
47
+ - Create: `app-node/turn_queue.mjs` — per-channel serial FIFO with `enqueue`, `length`, `onWaiting`.
48
+ - Create: `app-node/turn_queue.test.mjs` — serializes overlapping submissions; emits waiting notice once.
49
+ - Create: `app-node/speaker_key.mjs` — `speakerKey`, `planChannelKeyFor`, `splitSpeakerKey`.
50
+ - Create: `app-node/speaker_key.test.mjs`.
51
+ - Create: `app-node/ontology_projection.mjs` — `ontologyProjectionFor`, `promoteCrossUserNodes`.
52
+ - Create: `app-node/ontology_projection.test.mjs` — overlay supersedes base; promotion threshold; isolation between users.
53
+ - Modify: `app-node/main.mjs` — rekey `routingStateByChannel` (`:640`), `planStates` (`:397`), `ontologyByChannel` (`:663`); add `planChannelKeyFor`; thread `userId` through `dispatchPlanModeUtterance` (`:422`), `captureOntologyFromTurn` (`:675`), `clearTransientRouting` (`:2064`), and the final-block cleanup in `handleRecording` (`:1756`). Wire `turn_queue` around the adapter call inside `handleRecording`.
54
+ - Modify: `app-node/session_ontology.mjs` — accept composite `channelKey` (`/Users/neo/Developer/Projects/VerbalCoding/app-node/session_ontology.mjs:35`) without sanitizing `:` out of the path; verify `safeChannelKey` allows the separator.
55
+ - Modify: `app-node/plan_mode.test.mjs` — add two-user concurrent plan-mode case.
56
+ - Modify: `docs/USAGE.md` — add "Multi-user voice channels" section; note shared audio, per-user state.
57
+ - Modify: `docs/CONFIGURATION.md` — document `MULTI_USER_TURN_NOTICE=on|off`, `MULTI_USER_TURN_NOTICE_THROTTLE_MS` (default 8000).
58
+
59
+ ---
60
+
61
+ ## Tasks
62
+
63
+ ### Task 1: Failing test for `speakerKey` + `planChannelKeyFor`
64
+
65
+ **Files:** Create `app-node/speaker_key.test.mjs`.
66
+
67
+ - [ ] Step 1: Write failing test.
68
+
69
+ ```javascript
70
+ import { test } from 'node:test';
71
+ import assert from 'node:assert/strict';
72
+ import { speakerKey, planChannelKeyFor, splitSpeakerKey } from './speaker_key.mjs';
73
+
74
+ test('speakerKey composes channel and user', () => {
75
+ assert.equal(speakerKey('chan1', 'user9'), 'chan1:user9');
76
+ });
77
+ test('planChannelKeyFor falls back when channel missing', () => {
78
+ assert.equal(planChannelKeyFor(null, 'u1'), 'default:u1');
79
+ });
80
+ test('splitSpeakerKey round-trips and tolerates channel-only', () => {
81
+ assert.deepEqual(splitSpeakerKey('chan1:user9'), { channelId: 'chan1', userId: 'user9' });
82
+ assert.deepEqual(splitSpeakerKey('chan1'), { channelId: 'chan1', userId: null });
83
+ });
84
+ ```
85
+
86
+ - [ ] Step 2: `node --test app-node/speaker_key.test.mjs` → FAIL (module missing).
87
+
88
+ ### Task 2: Implement `speaker_key.mjs` + green test
89
+
90
+ - [ ] Step 1: Implement; `planChannelKeyFor(channelId, userId)` mirrors `planChannelKey()` (`main.mjs:399`) plus `:userId` suffix.
91
+ - [ ] Step 2: Run test → PASS.
92
+ - [ ] Step 3: Commit: `feat(multi-user): per-(channel,user) key helpers`.
93
+
94
+ ### Task 3: Failing test for `turn_queue`
95
+
96
+ - [ ] Step 1: Test (a) serializes two overlapping `enqueue` calls (second runFn starts only after first resolves); (b) `onWaiting` fires exactly once per distinct waiting userId across consecutive waits within throttle window.
97
+ - [ ] Step 2: Run → FAIL.
98
+
99
+ ### Task 4: Implement `turn_queue.mjs` + green test
100
+
101
+ - [ ] Step 1: FIFO per `channelId`. `enqueue({channelId, userId, runFn})` returns a Promise resolving to runFn's result. Internal state: `Map<channelId, { running, queue, lastNoticeAt, lastNoticeUserId }>`.
102
+ - [ ] Step 2: Run → PASS. Commit: `feat(multi-user): per-channel serial turn queue with waiting notice`.
103
+
104
+ ### Task 5: Failing test for ontology projection
105
+
106
+ - [ ] Step 1: Two users in same channel write distinct nodes; `ontologyProjectionFor(chan, userA).serialize()` contains A's nodes + shared base, not B's. `promoteCrossUserNodes(...)` lifts nodes whose `by` covers ≥2 distinct users.
107
+ - [ ] Step 2: Run → FAIL.
108
+
109
+ ### Task 6: Implement `ontology_projection.mjs` + green test
110
+
111
+ - [ ] Step 1: Wrap two `createSessionOntology` instances (base + overlay). Implement merged read; promotion checks node-level `by` set.
112
+ - [ ] Step 2: Run → PASS. Commit: `feat(multi-user): two-tier ontology projection (shared base + per-user overlay)`.
113
+
114
+ ### Task 7: Rekey state maps in `main.mjs`
115
+
116
+ **Files:** `app-node/main.mjs`.
117
+
118
+ - [ ] Step 1: Import `speakerKey`, `planChannelKeyFor` from `./speaker_key.mjs`.
119
+ - [ ] Step 2: Update `routingStateFor` (`:639`) — first-time `chan:user` lookup clones from `chan` if present (inheritance), then stored independently. Keep channel-only `routingStateFor(chan)` working for status commands.
120
+ - [ ] Step 3: Thread `userId` through `dispatchPlanModeUtterance` (`:422`), `recordUtterance` (`:656`), `captureOntologyFromTurn` (`:675`), `resetRoutingState` (`:686`), `clearTransientRouting` (`:2064`). Call sites in `handleRecording` (`:1756`) and the plan-mode dispatcher pass `userId`.
121
+ - [ ] Step 4: Keep one channel-level call site: the `voiceCloneCapture` flow at `main.mjs:1535` reads channel-level state (no rekey needed). Document with a comment.
122
+ - [ ] Step 5: Run full suite: `node --test app-node`. Expected PASS; any new flake means missing `userId` propagation.
123
+ - [ ] Step 6: Commit: `feat(multi-user): scope routing/plan/ontology to (channel,user)`.
124
+
125
+ ### Task 8: Wire `turn_queue` into `handleRecording`
126
+
127
+ - [ ] Step 1: Inside `handleRecording` (`main.mjs:1756`), wrap the adapter dispatch in `turnQueue.enqueue({ channelId: activeVoiceChannelId, userId, runFn: () => actualWork() })`.
128
+ - [ ] Step 2: On `onWaiting`, gated by `MULTI_USER_TURN_NOTICE !== 'off'`, call `speakText(noticeFor(userId, language), null, null)`. Notice strings live in a small map; localized for `en`/`ko`.
129
+ - [ ] Step 3: Existing `bridgeState.deferredSize` drain (`main.mjs:2083`) keeps working — it operates per-user upstream of the queue.
130
+ - [ ] Step 4: Add `app-node/multi_user_turn.test.mjs`: two simulated `handleRecording` calls overlap; verify they run sequentially and a single notice TTS is emitted.
131
+ - [ ] Step 5: Run, commit: `feat(multi-user): serialize per-channel agent turns with waiting notice`.
132
+
133
+ ### Task 9: Docs
134
+
135
+ - [ ] Step 1: `docs/USAGE.md` — add **Multi-user voice channels** section: per-user routing/plan/ontology, **shared TTS audio**, turn-taking notice.
136
+ - [ ] Step 2: `docs/CONFIGURATION.md` — `MULTI_USER_TURN_NOTICE`, `MULTI_USER_TURN_NOTICE_THROTTLE_MS`.
137
+ - [ ] Step 3: Update `docs/HARNESS_*` "Shared semantics" bullet to read "per (channel, user) routing and plan-mode".
138
+ - [ ] Step 4: Commit: `docs(multi-user): document per-speaker state and shared audio`.
139
+
140
+ ---
141
+
142
+ ## Self-Review
143
+
144
+ - Spec coverage: per-user routing ✓ (`:640`), per-user plan ✓ (`:397`), two-tier ontology ✓ (`:663`), turn queue ✓ (new), shared-audio caveat documented ✓.
145
+ - No placeholders. Composite keys backward-compatible via channel-only fallback in `routingStateFor`.
146
+ - Tests precede code in every task. `node --test app-node` is the only runner; no new deps.
147
+ - Cross-cutting risk: `safeChannelKey` in `session_ontology.mjs:35` must permit `:` — Task 6 explicitly verifies.
@@ -0,0 +1,136 @@
1
+ # Phase 14 — VerbalBench: Voice-Coding Agent Benchmark Implementation Plan
2
+
3
+ > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans to implement this plan task-by-task.
4
+
5
+ **Goal:** Ship "VerbalBench" — the first end-to-end benchmark for voice-mediated coding agents. SWE-bench / Terminal-Bench measure text-only batch tool loops; nothing measures speech-to-first-tool-action latency, barge-in resilience, TTS-while-acting throughput, or cross-agent handoff fidelity. Deliver a 50-task suite, Discord-free harness, JSON leaderboard, CI smoke-bench.
6
+
7
+ **Architecture:** Node driver synthesizes user audio (Edge-TTS + samples), feeds it through a mock `receiver` matching the `@discordjs/voice` surface in `main.mjs:2089–2161` (`receiver.subscribe → prism.opus.Decoder → wav.FileWriter` at `main.mjs:2107–2118`). Harness boots the bridge in-process with `MOCK_DISCORD=1`, captures timing via `createLatencyTurn` (`main.mjs:314`, `latency_metrics.mjs:29`) plus a new `metricsTurn.mark('first_tool_action')`, writes JSONL.
8
+
9
+ **Tech Stack:** Node 20 ESM, `node --test`, Edge-TTS, existing `tts_backends.mjs` + `stt_whisper.mjs`, GitHub Actions.
10
+
11
+ ## Spec
12
+
13
+ ### Task taxonomy (10 × 5 = 50)
14
+
15
+ | Category | Seed |
16
+ |---|---|
17
+ | codegen | "Write a Python flatten-nested-list." |
18
+ | refactor | "Extract auth in `routes/user.js` into middleware." |
19
+ | run-tests | "Run the suite; tell me what fails." |
20
+ | debug-failing-test | "Fix the failing test in `test_parser.py`." |
21
+ | web-research-then-code | "Look up OpenAI Realtime audio schema, patch our client." |
22
+ | switch-backend-mid-task | "Use Codex." → mid-turn: "Switch to Claude." |
23
+ | plan-mode-then-execute | "Plan a refactor of `bridge_state.mjs`, then do it." |
24
+ | barge-in-and-redirect | Mid-TTS: "Stop. Tests first." |
25
+ | push-handoff-and-resume | "Push to my desktop; pick up there." |
26
+ | multi-file-edit | "Rename `whisperBin` to `sttBin` project-wide." |
27
+
28
+ Five fixtures per category at `bench/tasks/<category>/NN-<slug>.json` with `{ id, prompt_audio_seed, expected_tool_classes, success_predicate, max_wall_ms }`.
29
+
30
+ ### Mock receiver (`bench/mock_discord.mjs`)
31
+
32
+ Implements only what `main.mjs:2089–2118, 2148–2165` touches:
33
+
34
+ - `joinVoiceChannel` → returns `{ receiver, on('stateChange'), subscribe(player) }`; emits `VoiceConnectionStatus.Ready` synchronously.
35
+ - `receiver.speaking.on('start', cb)` driven by harness `inject(userId, wavPath)`.
36
+ - `receiver.subscribe(userId, opts)` → `Readable` yielding Opus packets re-encoded from `bench/audio/*.wav`; honors `EndBehaviorType.AfterSilence`.
37
+ - `createAudioPlayer/createAudioResource` → captures PCM into `turn.tts_pcm[]` with first-byte timestamp.
38
+
39
+ ### Metrics (`bench/metrics.mjs`, extends `latency_metrics.mjs`)
40
+
41
+ - `speech_to_first_tool_action_ms` — `voice_first_packet` (`main.mjs:1665,1694`) → first adapter tool-event (`main.mjs:~1985`, new `metricsTurn.mark('first_tool_action')`).
42
+ - `first_utterance_of_speech_ms` — `voice_first_packet` → first TTS PCM byte from the mock player.
43
+ - `task_completion` — `success_predicate(workdir)` boolean.
44
+ - `barge_in_success_rate` — `barge-in-and-redirect` only: second-utterance onset → `metricsTurn.finish({ status: /barge_in_/ })` (`main.mjs:1726,1737`). Pass ≤ 600 ms.
45
+ - `handoff_fidelity` — `switch-backend-mid-task` only: assert routed `promptForAgent` contains `[Session ontology]` block (`main.mjs:1955–1958`) and that turn-N ontology nodes appear in turn-N+1 prompt.
46
+
47
+ ### Baselines
48
+
49
+ Run across all eight backends in `buildAgentSettings` defaults (`agent_adapters.mjs:217–276`): hermes, claude, codex, gemini, opencode, openclaw, aider, cursor. Publish backend × category matrix.
50
+
51
+ ### JSON result row (sample)
52
+
53
+ ```json
54
+ {
55
+ "bench_version": "0.1.0",
56
+ "run_id": "2026-05-21T14:22:01Z-abc123",
57
+ "backend": "claude",
58
+ "task_id": "barge-in-and-redirect/03-stop-run-tests",
59
+ "speech_to_first_tool_action_ms": 1820,
60
+ "first_utterance_of_speech_ms": 2410,
61
+ "barge_in_ms": 420,
62
+ "task_completion": true,
63
+ "handoff_fidelity": null,
64
+ "wall_ms": 18430,
65
+ "tts_chars": 312,
66
+ "tool_calls": ["read_file", "terminal"],
67
+ "harness_sha": "b76257f",
68
+ "agent_command": "claude -p"
69
+ }
70
+ ```
71
+
72
+ Superset of HF Open Leaderboard rows (`task_id`, `model`→`backend`, numeric metrics) so the same parser ingests both.
73
+
74
+ ### Publishing
75
+
76
+ - `bench/results/` JSONL pushed by CI to `gh-pages`; static leaderboard in `bench/site/` (Vite + sortable table).
77
+ - `npm run bench:smoke` — 10 tasks (1/category, ≤ 3 min) on every PR via `.github/workflows/verbalbench-smoke.yml`.
78
+ - `npm run bench:full` — 50 × 8 nightly; artifact uploaded, PR-commented if `speech_to_first_tool_action_ms` p95 regresses > 15 % vs. baseline.
79
+
80
+ ## File Structure
81
+
82
+ - Create: `bench/mock_discord.mjs` — `@discordjs/voice` doubles for `main.mjs:2089–2165`.
83
+ - Create: `bench/driver.mjs`, `bench/metrics.mjs`, `bench/predicates/*.mjs`.
84
+ - Create: `bench/tasks/<10 dirs>/*.json` (50 fixtures), `bench/audio/` (Edge-TTS WAVs, ko+en).
85
+ - Create: `bench/results/baseline-2026-05-21.jsonl`, `bench/site/`.
86
+ - Create: `.github/workflows/verbalbench-{smoke,nightly}.yml`.
87
+ - Modify: `app-node/main.mjs:1985` — `metricsTurn.mark('first_tool_action')` gated by `VERBALBENCH_INSTRUMENT=1`.
88
+ - Modify: `app-node/main.mjs:165` — `MOCK_DISCORD=1` skips real Discord login.
89
+ - Modify: `app-node/agent_adapters.mjs:594` — `ask` accepts optional `plan.onFirstToolEvent`; fired on first parsed tool-call. No-op when absent.
90
+ - Modify: `app-node/latency_metrics.mjs:29` — record `first_tool_action`, `barge_in_ms`.
91
+ - Modify: `package.json` — `bench:smoke`, `bench:full`, `bench:render`.
92
+ - Modify: `README.md` — Benchmarks section.
93
+ - Create: `docs/VERBALBENCH.md` — contributor guide + `npm run bench:adapter -- --adapter ./my-adapter.mjs`.
94
+
95
+ ## Tasks
96
+
97
+ ### Task 1: Failing test for mock receiver
98
+
99
+ **Files:** Create `bench/mock_discord.test.mjs`.
100
+
101
+ - [ ] Step 1: Assert `mockReceiver.subscribe(userId, { end: { behavior: EndBehaviorType.AfterSilence, duration: 2200 } })` emits bytes of `bench/audio/fixtures/hello.wav` then `end`s. Run `node --test bench/mock_discord.test.mjs` — expect FAIL.
102
+
103
+ ### Task 2: Implement `bench/mock_discord.mjs`
104
+
105
+ - [ ] Step 1: Implement Spec surface. Wrap WAV as Opus `Readable` at 48 kHz/2-ch (matches `prism.opus.Decoder` at `main.mjs:2110`).
106
+ - [ ] Step 2: PASS → commit `feat(bench): mock @discordjs/voice receiver`.
107
+
108
+ ### Task 3: Driver + metrics + instrumentation
109
+
110
+ - [ ] Step 1: `bench/driver.mjs` boots bridge with `MOCK_DISCORD=1`, calls `mockReceiver.inject(...)`, awaits `metricsTurn.finish`, writes JSONL.
111
+ - [ ] Step 2: `bench/metrics.mjs` extensions.
112
+ - [ ] Step 3: Wire `main.mjs:1985` and `agent_adapters.mjs:594` per File Structure. Tests for both. Commit.
113
+
114
+ ### Task 4: First 10 fixtures + smoke CI
115
+
116
+ - [ ] Step 1: One task per category with Edge-TTS audio + predicate.
117
+ - [ ] Step 2: `npm run bench:smoke` green locally.
118
+ - [ ] Step 3: `.github/workflows/verbalbench-smoke.yml` (hermes on PR; full matrix on `workflow_dispatch`). Commit.
119
+
120
+ ### Task 5: Remaining 40 fixtures + baseline
121
+
122
+ - [ ] Step 1: Author 4 more per category.
123
+ - [ ] Step 2: Run full matrix on installed backends (`detectInstalledAgents`); commit `bench/results/baseline-2026-05-21.jsonl` + nightly workflow.
124
+
125
+ ### Task 6: Leaderboard site + docs
126
+
127
+ - [ ] Step 1: `bench/site/` (sortable, per-metric filters, JSON download).
128
+ - [ ] Step 2: `docs/VERBALBENCH.md` + `README.md` update. Commit.
129
+
130
+ ## Self-Review
131
+
132
+ - Coverage: taxonomy, mock, metrics, baselines, publishing — present.
133
+ - No placeholders; units ms; sample row included.
134
+ - Mock enumerates every `@discordjs/voice` symbol at `main.mjs:2089–2165`.
135
+ - Prod untouched unless `VERBALBENCH_INSTRUMENT=1` or `MOCK_DISCORD=1`.
136
+ - Row is HF Open Leaderboard superset.
@@ -0,0 +1,72 @@
1
+ # Phase 15 — Phone Companion PWA Implementation Plan
2
+
3
+ > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans.
4
+
5
+ **Goal:** When a Phase 10 push notification fires while no human is in the bound voice channel, tapping it opens a PWA that (a) replays the agent's `spokenAnswer` audio, (b) lets the user re-engage by voice — speech captured in the PWA is delivered to the bot as if uttered in the bound VC, and (c) shows an optional markdown summary plus progress events.
6
+
7
+ **Architecture:** A new `app-node/companion_server.mjs` boots an Express HTTP server in the same Node process as `main.mjs`. It serves a static PWA from `app-node/companion_pwa/` (manifest, service worker, single-page UI). Each push notification carries a one-time-use, HMAC-signed token scoped to `(userId, messageId, ttl=600s)` and a `Click` URL pointing at `https://<COMPANION_BASE>/c/#<token>`. On load the PWA POSTs `/companion/session` to redeem the token, fetches the pre-rendered `spokenAnswer.opus` (cached in-memory by messageId), and renders the markdown body. Re-engage records mic audio via MediaRecorder, uploads to `/companion/utterance`, and main.mjs dispatches it through the existing voice transcript pipeline bound to `activeVoiceChannelId`.
8
+
9
+ **Tech Stack:** Node 20 ESM, `express` (new dep), `crypto.timingSafeEqual` for token verify, existing `ttsBackend` for pre-render, browser `MediaRecorder` (opus) + `Web Audio` for playback, no build step (single hand-written PWA bundle).
10
+
11
+ ## Spec
12
+
13
+ ### Token grammar
14
+ - Payload: `base64url({uid, mid, vcid, gid, exp})` + `.` + `base64url(HMAC-SHA256(payload, COMPANION_SECRET))`.
15
+ - TTL default 600s. Single-use: redeemed tokens recorded in an in-memory `Set<string>` cleared on bot restart.
16
+ - `COMPANION_SECRET` required when `COMPANION_BASE` is set; bot refuses to boot otherwise.
17
+
18
+ ### Pre-rendered audio
19
+ - `maybeNotifyTaskComplete` (`app-node/main.mjs:369-394`) gains a sibling helper `prerenderSpokenAnswer(spokenText, lang)` that calls the existing `ttsBackend.synthesize` (same path used at `main.mjs:1495-1496`) and stashes the resulting opus bytes in a bounded LRU keyed by `messageId` (cap 32, 10 MB).
20
+ - `notify.mjs` `send()` is extended with optional `audioUrl` and `summaryUrl` fields; the ntfy provider sets `X-Attach` and `X-Actions` headers, pushover sets `url`/`url_title`. Existing redaction in `notify.mjs:5-10` still applies to title/body.
21
+
22
+ ### Re-engage path
23
+ - PWA POSTs WebM/opus blob + token to `/companion/utterance`.
24
+ - Server hands the blob to a new `dispatchCompanionUtterance({userId, vcid, blob})` that resamples (existing `prism.opus.Decoder`) and feeds the same downstream `processVoiceTranscript(text)` path used by the VC listener (call site near `main.mjs:1488-1497`). Result: the bot answers as if the user spoke in `activeVoiceChannelId`.
25
+ - If `vcid` differs from `activeVoiceChannelId`, the bot rebinds (mirrors `pickOccupiedUserVoiceChannel`) before replying.
26
+
27
+ ### Summary view
28
+ - `/companion/summary/:mid` returns `{markdown, progressEvents}`. Markdown is the existing `fullAnswerText` (`main.mjs:1491`). Progress events come from the existing `summarizeProgressEvents` output already collected per-turn.
29
+
30
+ ### Security
31
+ - All `/companion/*` routes require a valid token in `Authorization: Bearer`.
32
+ - Tokens single-use, 10-min TTL, scoped to `(uid, mid, vcid)`. Replay → 401.
33
+ - Mic upload capped at 1 MB and 30 s.
34
+ - Rate-limit: 10 req/min per token (in-memory bucket).
35
+
36
+ ## File Structure
37
+ - Create: `app-node/companion_server.mjs`, `app-node/companion_tokens.mjs`, `app-node/companion_tokens.test.mjs`, `app-node/companion_pwa/{index.html,app.js,sw.js,manifest.webmanifest,styles.css}`.
38
+ - Modify: `app-node/notify.mjs` — accept `audioUrl`, `summaryUrl`; build companion deep link helper `buildCompanionLink({base, token})`.
39
+ - Modify: `app-node/main.mjs` — pre-render audio inside `maybeNotifyTaskComplete` (line 369), mint token, swap `deepLink` for the companion URL, boot companion server at startup, add `dispatchCompanionUtterance` near the text-agent path (line 1488).
40
+ - Modify: `.env.example` — `COMPANION_BASE`, `COMPANION_SECRET`, `COMPANION_PORT`, `COMPANION_TTL_SEC`.
41
+ - Modify: `package.json` — add `express` dep.
42
+
43
+ ## Tasks
44
+
45
+ ### Task 1: TDD — token mint/verify
46
+ - [ ] Step 1: Write `companion_tokens.test.mjs` covering: roundtrip ok, tampered payload fails, expired fails, single-use replay fails, wrong secret fails.
47
+ - [ ] Step 2: Implement `companion_tokens.mjs` exporting `mint({uid,mid,vcid,gid,ttlSec,secret})` and `verify(token, {secret, redeemed})`.
48
+ - [ ] Step 3: Run `node --test app-node/companion_tokens.test.mjs`, expect PASS. Commit.
49
+
50
+ ### Task 2: Pre-render audio + extend notifier
51
+ - [ ] Step 1: Add bounded LRU `companionAudioCache` in `main.mjs` and `prerenderSpokenAnswer()` reusing `ttsBackend`.
52
+ - [ ] Step 2: Extend `notify.mjs send()` signature with `audioUrl`, `summaryUrl`; map per-provider headers.
53
+ - [ ] Step 3: Add `notify.test.mjs` cases for ntfy `X-Actions` and pushover `url` carrying the companion link.
54
+ - [ ] Step 4: Commit.
55
+
56
+ ### Task 3: companion_server.mjs + PWA
57
+ - [ ] Step 1: Boot Express on `COMPANION_PORT` from `main.mjs` startup; mount `/c/*` static and `/companion/*` API.
58
+ - [ ] Step 2: Routes — `POST /companion/session`, `GET /companion/audio/:mid`, `GET /companion/summary/:mid`, `POST /companion/utterance`.
59
+ - [ ] Step 3: PWA — manifest (standalone), SW (cache-first for static, network-only for API), `app.js` handles token redeem, autoplay (user-gesture fallback), MediaRecorder upload, summary render via lightweight markdown renderer.
60
+ - [ ] Step 4: Add `companion_server.test.mjs` exercising token flow and utterance dispatch with a stub agent adapter.
61
+ - [ ] Step 5: Commit.
62
+
63
+ ### Task 4: Wire re-engage into agent pipeline
64
+ - [ ] Step 1: Factor the text-agent dispatch at `main.mjs:1488-1497` into `runAgentForUtterance({text, vcid, gid, speakResponse})` reused by VC and companion paths.
65
+ - [ ] Step 2: Companion utterance → Whisper STT (existing `transcribeWav`) → `runAgentForUtterance({speakResponse: vcOccupied})`.
66
+ - [ ] Step 3: If VC empty, response is delivered as a second push (re-uses Phase 10), creating a conversational loop entirely over phone.
67
+ - [ ] Step 4: Commit.
68
+
69
+ ### Task 5: Document
70
+ - [ ] Step 1: `.env.example` entries + `docs/USAGE.md` companion section with QR-code suggestion for first-pair UX.
71
+ - [ ] Step 2: Note out-of-scope: native push reliability, multi-device replay.
72
+ - [ ] Step 3: Commit.