verbalcoding 0.2.6 → 0.2.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (44) hide show
  1. package/README.md +12 -22
  2. package/app-node/cli_install.test.mjs +15 -0
  3. package/docs/FRESH_INSTALL.md +8 -2
  4. package/docs/assets/figures/verbalcoding-flow.svg +45 -30
  5. package/docs/i18n/CONFIGURATION.es.md +239 -0
  6. package/docs/i18n/CONFIGURATION.fr.md +239 -0
  7. package/docs/i18n/CONFIGURATION.ja.md +239 -0
  8. package/docs/i18n/CONFIGURATION.ko.md +66 -74
  9. package/docs/i18n/CONFIGURATION.ru.md +239 -0
  10. package/docs/i18n/CONFIGURATION.zh.md +239 -0
  11. package/docs/i18n/FRESH_INSTALL.es.md +207 -0
  12. package/docs/i18n/FRESH_INSTALL.fr.md +207 -0
  13. package/docs/i18n/FRESH_INSTALL.ja.md +207 -0
  14. package/docs/i18n/FRESH_INSTALL.ko.md +60 -54
  15. package/docs/i18n/FRESH_INSTALL.ru.md +207 -0
  16. package/docs/i18n/FRESH_INSTALL.zh.md +207 -0
  17. package/docs/i18n/MULTI_INSTANCE.es.md +180 -0
  18. package/docs/i18n/MULTI_INSTANCE.fr.md +180 -0
  19. package/docs/i18n/MULTI_INSTANCE.ja.md +179 -0
  20. package/docs/i18n/MULTI_INSTANCE.ko.md +46 -46
  21. package/docs/i18n/MULTI_INSTANCE.ru.md +179 -0
  22. package/docs/i18n/MULTI_INSTANCE.zh.md +179 -0
  23. package/docs/i18n/README.es.md +83 -55
  24. package/docs/i18n/README.fr.md +85 -57
  25. package/docs/i18n/README.ja.md +83 -55
  26. package/docs/i18n/README.ko.md +47 -56
  27. package/docs/i18n/README.ru.md +86 -58
  28. package/docs/i18n/README.zh.md +83 -56
  29. package/docs/i18n/RELEASE.es.md +74 -0
  30. package/docs/i18n/RELEASE.fr.md +74 -0
  31. package/docs/i18n/RELEASE.ja.md +74 -0
  32. package/docs/i18n/RELEASE.ko.md +38 -36
  33. package/docs/i18n/RELEASE.ru.md +74 -0
  34. package/docs/i18n/RELEASE.zh.md +74 -0
  35. package/docs/i18n/USAGE.es.md +161 -0
  36. package/docs/i18n/USAGE.fr.md +161 -0
  37. package/docs/i18n/USAGE.ja.md +161 -0
  38. package/docs/i18n/USAGE.ko.md +61 -72
  39. package/docs/i18n/USAGE.ru.md +161 -0
  40. package/docs/i18n/USAGE.zh.md +161 -0
  41. package/package.json +1 -1
  42. package/scripts/bootstrap_prereqs.sh +15 -3
  43. package/scripts/cli.mjs +1 -1
  44. package/scripts/doctor.mjs +114 -8
package/README.md CHANGED
@@ -34,7 +34,7 @@ VerbalCoding turns a Discord voice channel into a hands-free control surface for
34
34
  | What you get | Why it feels good |
35
35
  |---|---|
36
36
  | Voice-first agent control | Talk to Hermes Agent, Claude Code, Codex, Gemini CLI, OpenCode, OpenClaw, or any custom CLI harness. |
37
- | Local-first speech loop | Discord voice capture → `whisper.cpp` STT → agent → chunked TTS playback. |
37
+ | On-device speech loop | Discord voice capture → local `whisper-cli` transcription → agent → chunked TTS playback. |
38
38
  | Shared voice + text context | Voice turns and `!ask` text commands can reuse the same supported agent session. |
39
39
  | Barge-in and sensitivity modes | Interrupt playback naturally and switch between normal and conservative/noisy environments. |
40
40
  | Multilingual voice presets | Switch STT, progress language, and TTS voice together with `vc language ko/en/auto`. |
@@ -69,23 +69,10 @@ vc doctor
69
69
  ./run.sh
70
70
  ```
71
71
 
72
- `vc setup --yes` and `./scripts/install.sh --yes` bootstrap local prerequisites where possible: Node/npm dependencies, `ffmpeg`, `whisper-cli`, the default whisper.cpp model, a local `.venv-tts` Edge TTS helper, and the short `vc` shell command for clone installs. They support macOS/Homebrew plus common Linux package managers (`apt`, `dnf`, `pacman`); rerun with `--no-wizard` for dependency-only setup or `--skip-system` if you want to install OS packages yourself.
72
+ `vc setup --yes` bootstraps local prerequisites from the npm package. `./scripts/install.sh --yes` does the same for GitHub clone installs. Both cover Node/npm dependencies, `ffmpeg`, `whisper-cli`, the default whisper.cpp model, a local `.venv-tts` Edge TTS helper, and setup wizard configuration where possible. They support macOS/Homebrew plus common Linux package managers (`apt`, `dnf`, `pacman`); rerun with `--no-wizard` for dependency-only setup or `--skip-system` if you want to install OS packages yourself.
73
73
 
74
74
  Need a clean install walkthrough? Start with [Fresh Install](docs/FRESH_INSTALL.md).
75
75
 
76
- ## How It Works
77
-
78
- ```mermaid
79
- flowchart LR
80
- A[Discord voice] --> B["@discordjs/voice"]
81
- B --> C[PCM cleanup + gates]
82
- C --> D["whisper.cpp STT"]
83
- D --> E["CLI agent adapter"]
84
- E --> F["Concise answer"]
85
- F --> G["Chunked TTS"]
86
- G --> H["Discord playback"]
87
- ```
88
-
89
76
  ## Supported Agent Backends
90
77
 
91
78
  | Backend | Default command | Session support |
@@ -107,7 +94,6 @@ flowchart LR
107
94
  | [Configuration](docs/CONFIGURATION.md) | `.env`, agent backends, MCP, TTS backends, operational notes |
108
95
  | [Multi-Instance](docs/MULTI_INSTANCE.md) | One permanent Discord voice room per project |
109
96
  | [Release Notes](docs/RELEASE.md) | Current capabilities and pre-release checklist |
110
- | [한국어 문서](docs/i18n/README.ko.md) | npm 설치, 사용법, 설정, 멀티 인스턴스 한국어 가이드 |
111
97
 
112
98
  ## Tiny Command Map
113
99
 
@@ -123,11 +109,15 @@ vc start # start the default bridge
123
109
 
124
110
  In Discord:
125
111
 
126
- ```text
127
- !join !ask <prompt> !verbose on/off
128
- !latency !sensitivity normal !sensitivity conservative
129
- !session new <name> <workdir> [context] --voice <voice-channel>
130
- ```
112
+ | Command | What it does |
113
+ |---|---|
114
+ | `!join` | Join your current voice channel. |
115
+ | `!ask <prompt>` | Send text to the same agent backend. |
116
+ | `!verbose on\|off` | Show/speak short progress updates. |
117
+ | `!latency` | Summarize recent voice/STT/agent/TTS latency. |
118
+ | `!sensitivity normal` | Use normal indoor barge-in sensitivity. |
119
+ | `!sensitivity conservative` | Use stricter noisy/outdoor sensitivity. |
120
+ | `!session new <name> <workdir> [context] --voice <voice-channel>` | Bind a project session to a voice room. |
131
121
 
132
122
  ## Requirements
133
123
 
@@ -135,7 +125,7 @@ In Discord:
135
125
  |---|---|
136
126
  | Runtime | Node.js 20+, npm; install script can install via Homebrew/apt/dnf/pacman |
137
127
  | Audio | `ffmpeg`; install script can install it |
138
- | STT | `whisper.cpp` / `whisper-cli`; install script uses Homebrew on macOS or local Linux build fallback |
128
+ | Speech recognition | Local `whisper-cli` from whisper.cpp; install script uses Homebrew on macOS or local Linux build fallback |
139
129
  | TTS | Edge TTS CLI; install script creates `.venv-tts` if needed |
140
130
  | Discord | Bot token, Message Content intent, voice permissions |
141
131
  | Agent | At least one authenticated CLI harness, Hermes Agent by default |
@@ -63,6 +63,8 @@ test('bootstrap script installs cross-platform prerequisites and local model hel
63
63
 
64
64
  assert.match(script, /brew install/);
65
65
  assert.match(script, /apt-get install/);
66
+ assert.match(script, /has_cmd node \|\| packages\+\=\(nodejs\)/);
67
+ assert.match(script, /has_cmd npm \|\| packages\+\=\(npm\)/);
66
68
  assert.match(script, /dnf install/);
67
69
  assert.match(script, /pacman -Sy/);
68
70
  assert.match(script, /git clone --depth 1 https:\/\/github\.com\/ggml-org\/whisper\.cpp\.git/);
@@ -70,6 +72,19 @@ test('bootstrap script installs cross-platform prerequisites and local model hel
70
72
  assert.match(script, /\.venv-tts/);
71
73
  });
72
74
 
75
+ test('doctor auto-bootstraps fixable prerequisites by default', () => {
76
+ const doctor = fs.readFileSync(path.join(ROOT, 'scripts', 'doctor.mjs'), 'utf8');
77
+ const cli = fs.readFileSync(path.join(ROOT, 'scripts', 'cli.mjs'), 'utf8');
78
+
79
+ assert.match(doctor, /fixablePrerequisites/);
80
+ assert.match(doctor, /bootstrap_prereqs\.sh'\), '--yes'/);
81
+ assert.match(doctor, /VERBALCODING_DOCTOR_AUTO_FIX/);
82
+ assert.match(doctor, /--no-fix/);
83
+ assert.match(doctor, /WHISPER_CPP_BIN/);
84
+ assert.match(doctor, /EDGE_TTS_COMMAND/);
85
+ assert.match(cli, /doctor\.mjs'\), \.\.\.argv\.slice\(1\)/);
86
+ });
87
+
73
88
  test('Ubuntu Docker smoke script validates clean install without secrets', () => {
74
89
  const script = fs.readFileSync(path.join(ROOT, 'scripts', 'docker_ubuntu_smoke.sh'), 'utf8');
75
90
 
@@ -32,7 +32,13 @@ cd VerbalCoding
32
32
 
33
33
  ## 2. Bootstrap dependencies and run the setup wizard
34
34
 
35
- The npm commands above run the same bootstrapper as the clone install. For a clone, run:
35
+ For an npm install, do not run `./scripts/install.sh` directly; there is no repository checkout in your current directory. Use the packaged CLI wrapper instead:
36
+
37
+ ```bash
38
+ vc setup --yes
39
+ ```
40
+
41
+ `vc setup` runs the `scripts/install.sh` bundled inside the installed npm package. Only use `./scripts/install.sh --yes` when you are inside a GitHub clone:
36
42
 
37
43
  ```bash
38
44
  ./scripts/install.sh --yes
@@ -104,7 +110,7 @@ The invite includes bot and slash-command scopes plus text/voice permissions use
104
110
  vc doctor
105
111
  ```
106
112
 
107
- `vc doctor` is redacted: it reports missing tokens/commands/models without printing secret values. Fix every `✗` item, then rerun it.
113
+ `vc doctor` is redacted: it reports missing tokens/commands/models without printing secret values. When fixable local prerequisites are missing (`ffmpeg`, `whisper-cli`, the default model, or Edge TTS helper), it automatically reruns the packaged bootstrap first. Fix any remaining `✗` items, then rerun it.
108
114
 
109
115
  Expected success includes:
110
116
 
@@ -1,6 +1,6 @@
1
1
  <svg width="1200" height="520" viewBox="0 0 1200 520" fill="none" xmlns="http://www.w3.org/2000/svg" role="img" aria-labelledby="title desc">
2
- <title id="title">VerbalCoding voice-to-agent flow</title>
3
- <desc id="desc">A stylized pipeline from Discord voice to speech recognition, CLI agent, text answer, and TTS playback.</desc>
2
+ <title id="title">VerbalCoding natural voice loop</title>
3
+ <desc id="desc">A compact phone-call-like loop: user speaks in Discord, Local STT with whisper-cli transcribes, the CLI agent works, TTS speaks back, and the user can interrupt anytime.</desc>
4
4
  <defs>
5
5
  <linearGradient id="bg" x1="0" y1="0" x2="1200" y2="520" gradientUnits="userSpaceOnUse">
6
6
  <stop stop-color="#0F172A"/>
@@ -19,45 +19,60 @@
19
19
  <circle cx="1030" cy="90" r="190" fill="#6366F1" opacity="0.16"/>
20
20
  <circle cx="170" cy="430" r="210" fill="#06B6D4" opacity="0.13"/>
21
21
  <rect x="70" y="54" width="1060" height="412" rx="32" fill="url(#card)" stroke="#334155" filter="url(#shadow)"/>
22
+
22
23
  <text x="110" y="118" fill="#F8FAFC" font-family="Inter, ui-sans-serif, system-ui" font-size="42" font-weight="800">VerbalCoding</text>
23
- <text x="110" y="154" fill="#94A3B8" font-family="Inter, ui-sans-serif, system-ui" font-size="20">Discord voice local STT CLI coding agent spoken answer</text>
24
+ <text x="110" y="154" fill="#94A3B8" font-family="Inter, ui-sans-serif, system-ui" font-size="20">A natural Discord voice loop for coding agents speak, listen, interrupt, continue</text>
24
25
 
25
26
  <g font-family="Inter, ui-sans-serif, system-ui" font-size="17" font-weight="700">
26
- <rect x="110" y="220" width="150" height="92" rx="20" fill="#5865F2"/>
27
- <text x="185" y="258" fill="white" text-anchor="middle">Discord</text>
28
- <text x="185" y="284" fill="#E0E7FF" text-anchor="middle" font-size="14">voice channel</text>
27
+ <rect x="105" y="220" width="160" height="92" rx="20" fill="#5865F2"/>
28
+ <text x="185" y="254" fill="white" text-anchor="middle">Discord</text>
29
+ <text x="185" y="280" fill="#E0E7FF" text-anchor="middle" font-size="14">phone-call voice</text>
29
30
 
30
- <rect x="305" y="220" width="150" height="92" rx="20" fill="#0891B2"/>
31
- <text x="380" y="258" fill="white" text-anchor="middle">whisper.cpp</text>
32
- <text x="380" y="284" fill="#CFFAFE" text-anchor="middle" font-size="14">local STT</text>
31
+ <rect x="305" y="220" width="165" height="92" rx="20" fill="#0891B2"/>
32
+ <text x="387.5" y="254" fill="white" text-anchor="middle">Local STT</text>
33
+ <text x="387.5" y="280" fill="#CFFAFE" text-anchor="middle" font-size="14">whisper-cli</text>
33
34
 
34
- <rect x="500" y="220" width="150" height="92" rx="20" fill="#7C3AED"/>
35
- <text x="575" y="258" fill="white" text-anchor="middle">Adapter</text>
36
- <text x="575" y="284" fill="#EDE9FE" text-anchor="middle" font-size="14">Hermes / Claude / Codex</text>
35
+ <rect x="510" y="220" width="165" height="92" rx="20" fill="#7C3AED"/>
36
+ <text x="592.5" y="254" fill="white" text-anchor="middle">Adapter</text>
37
+ <text x="592.5" y="280" fill="#EDE9FE" text-anchor="middle" font-size="14">Hermes / Claude / Codex</text>
37
38
 
38
- <rect x="695" y="220" width="150" height="92" rx="20" fill="#111827" stroke="#475569"/>
39
- <text x="770" y="258" fill="white" text-anchor="middle">CLI Agent</text>
40
- <text x="770" y="284" fill="#CBD5E1" text-anchor="middle" font-size="14">does the work</text>
39
+ <rect x="715" y="220" width="165" height="92" rx="20" fill="#111827" stroke="#475569"/>
40
+ <text x="797.5" y="254" fill="white" text-anchor="middle">CLI Agent</text>
41
+ <text x="797.5" y="280" fill="#CBD5E1" text-anchor="middle" font-size="14">does the work</text>
41
42
 
42
- <rect x="890" y="220" width="150" height="92" rx="20" fill="#0EA5E9"/>
43
- <text x="965" y="258" fill="white" text-anchor="middle">TTS</text>
44
- <text x="965" y="284" fill="#E0F2FE" text-anchor="middle" font-size="14">chunked playback</text>
43
+ <rect x="920" y="220" width="165" height="92" rx="20" fill="#0EA5E9"/>
44
+ <text x="1002.5" y="254" fill="white" text-anchor="middle">TTS</text>
45
+ <text x="1002.5" y="280" fill="#E0F2FE" text-anchor="middle" font-size="14">spoken reply</text>
45
46
  </g>
46
47
 
47
48
  <g stroke="#94A3B8" stroke-width="4" stroke-linecap="round">
48
- <path d="M266 266H296"/>
49
- <path d="M461 266H491"/>
50
- <path d="M656 266H686"/>
51
- <path d="M851 266H881"/>
49
+ <path d="M275 266H295"/>
50
+ <path d="M480 266H500"/>
51
+ <path d="M685 266H705"/>
52
+ <path d="M890 266H910"/>
53
+ </g>
54
+ <g fill="#94A3B8" opacity="0.95">
55
+ <circle cx="285" cy="266" r="4"/>
56
+ <circle cx="490" cy="266" r="4"/>
57
+ <circle cx="695" cy="266" r="4"/>
58
+ <circle cx="900" cy="266" r="4"/>
52
59
  </g>
53
- <g fill="#94A3B8">
54
- <path d="M296 266l-10-7v14l10-7z"/>
55
- <path d="M491 266l-10-7v14l10-7z"/>
56
- <path d="M686 266l-10-7v14l10-7z"/>
57
- <path d="M881 266l-10-7v14l10-7z"/>
60
+
61
+ <path d="M1002 330C1002 405 185 405 185 330" stroke="#67E8F9" stroke-width="4" stroke-linecap="round" stroke-dasharray="13 13"/>
62
+ <g fill="#67E8F9">
63
+ <circle cx="1002" cy="330" r="5"/>
64
+ <circle cx="185" cy="330" r="5"/>
65
+ </g>
66
+ <text x="594" y="438" fill="#A5F3FC" text-anchor="middle" font-family="Inter, ui-sans-serif, system-ui" font-size="17" font-weight="700">Conversation loop: hear the answer, speak again, or interrupt anytime</text>
67
+
68
+ <path d="M185 210C185 178 1002 178 1002 210" stroke="#FBBF24" stroke-width="3" stroke-linecap="round" stroke-dasharray="8 10" opacity="0.9"/>
69
+ <g fill="#FBBF24" opacity="0.95">
70
+ <circle cx="185" cy="210" r="4"/>
71
+ <circle cx="1002" cy="210" r="4"/>
58
72
  </g>
73
+ <text x="594" y="194" fill="#FDE68A" text-anchor="middle" font-family="Inter, ui-sans-serif, system-ui" font-size="15" font-weight="700">Barge-in stays open while the agent is thinking or speaking</text>
59
74
 
60
- <rect x="150" y="360" width="900" height="54" rx="17" fill="#020617" stroke="#1F2937"/>
61
- <text x="182" y="394" fill="#A7F3D0" font-family="SFMono-Regular, ui-monospace, monospace" font-size="18">$ vc language ko &amp;&amp; vc instance start my-project</text>
62
- <text x="1045" y="394" fill="#64748B" text-anchor="end" font-family="Inter, ui-sans-serif, system-ui" font-size="15">hands-free coding loop</text>
75
+ <rect x="150" y="348" width="900" height="54" rx="17" fill="#020617" stroke="#1F2937"/>
76
+ <text x="182" y="382" fill="#A7F3D0" font-family="SFMono-Regular, ui-monospace, monospace" font-size="18">$ vc language ko &amp;&amp; vc instance start my-project</text>
77
+ <text x="1045" y="382" fill="#64748B" text-anchor="end" font-family="Inter, ui-sans-serif, system-ui" font-size="15">hands-free coding call</text>
63
78
  </svg>
@@ -0,0 +1,239 @@
1
+ # Configuración de VerbalCoding
2
+
3
+ ## Asistente de configuración
4
+
5
+ La configuración de la aplicación/bot de Discord no se vuelve a explicar desde cero aquí de forma intencionada. Usa estas guías originales para los pasos del lado de Discord y luego vuelve a la configuración de VerbalCoding:
6
+
7
+ - Guía de mensajería Discord de Hermes Agent: <https://hermes-agent.nousresearch.com/docs/user-guide/messaging/discord>
8
+ - Resumen oficial de bots de Discord: <https://docs.discord.com/developers/bots/overview>
9
+ - Inicio rápido oficial de Discord: <https://docs.discord.com/developers/quick-start/getting-started>
10
+
11
+ ```bash
12
+ ./scripts/install.sh
13
+ ```
14
+
15
+ El instalador solicita token de Discord, usuarios permitidos, nombres de canales de voz para auto-unión, canal/hilo de transcripción, backend de arnés CLI, idioma de voz predeterminado, ajustes de TTS y comportamiento de palabra de activación. Escribe `.env` con modo `0600`; `.env` está ignorado por git. También enlaza el comando corto de shell `vc`.
16
+
17
+ Si solo necesitas el comando de shell después de una instalación manual:
18
+
19
+ ```bash
20
+ npm link
21
+ ```
22
+
23
+ ## Backends de agentes compatibles
24
+
25
+ Define `AGENT_BACKEND` en `.env`.
26
+
27
+ | Backend | Comando predeterminado | Notas |
28
+ |---|---|---|
29
+ | `hermes` | `hermes chat -Q -q` | Predeterminado. Conserva el comportamiento de reanudación de `.verbalcoding-session`. |
30
+ | `claude-code` / `claude` | `claude -p` | Sobrescribe con `CLAUDE_COMMAND` o `AGENT_COMMAND`. |
31
+ | `codex` | `codex exec` | Sobrescribe con `CODEX_COMMAND` o `AGENT_COMMAND`. |
32
+ | `gemini` | `gemini -p` | Sobrescribe con `GEMINI_COMMAND` o `AGENT_COMMAND`. |
33
+ | `opencode` | `opencode run` | Sobrescribe con `OPENCODE_COMMAND` o `AGENT_COMMAND`. |
34
+ | `openclaw` | `openclaw run` | Sobrescribe con `OPENCLAW_COMMAND` o `AGENT_COMMAND`. |
35
+ | `custom` | `AGENT_COMMAND` requerido | El prompt se añade como argumento argv final. |
36
+
37
+ Sobrescrituras genéricas:
38
+
39
+ ```bash
40
+ AGENT_BACKEND=custom
41
+ AGENT_LABEL="My Harness"
42
+ AGENT_COMMAND="my-harness run --non-interactive"
43
+ AGENT_TASK_TIMEOUT_MS=0
44
+ AGENT_CHAT_TIMEOUT_MS=45000
45
+ AGENT_VERBOSE_PROGRESS=0
46
+ UTTERANCE_IDLE_MS=4500
47
+ LATENCY_LOG_PATH=./.logs/latency.jsonl
48
+ ```
49
+
50
+ ## Contrato del adaptador de agente
51
+
52
+ El puente de voz habla con cada backend mediante un único contrato de adaptador:
53
+
54
+ - `run({ text }, signal, plan)` devuelve estado, texto de respuesta final, etiqueta del backend, tiempo transcurrido y metadatos de sesión opcionales.
55
+ - `ask(text, signal, plan)` es el atajo de compatibilidad que devuelve solo el texto de la respuesta final.
56
+ - `capabilities` declara si el backend admite reanudación de sesión, progreso en streaming y cancelación.
57
+ - Hermes es el adaptador de referencia: reanudación, streaming de progreso detallado, cancelación y recuperación de respuesta final desde archivos de sesión de Hermes.
58
+
59
+ Los nuevos backends deberían implementar el mismo contrato y mantener el comportamiento de voz/STT/TTS fuera del adaptador.
60
+
61
+ ## Ejemplo de `.env`
62
+
63
+ ```bash
64
+ DISCORD_BOT_TOKEN="***"
65
+ DISCORD_ALLOWED_USERS="123456789012345678"
66
+ AUTO_JOIN_VOICE_CHANNELS="일반,General,general"
67
+ TRANSCRIPT_CHANNEL_ID="123456789012345678"
68
+
69
+ AGENT_BACKEND="hermes"
70
+ STT_ENGINE="whisper_cpp"
71
+ WHISPER_CPP_BIN="whisper-cli"
72
+ WHISPER_CPP_MODEL="./models/ggml-small-q5_1.bin"
73
+
74
+ TTS_BACKEND="edge"
75
+ TTS_VOICE_TYPE="korean_female"
76
+ TTS_VOICE="ko-KR-SunHiNeural"
77
+ TTS_RATE="+10%"
78
+ TTS_MAX_CHARS="495"
79
+ TTS_VOLUME="1.0"
80
+
81
+ REQUIRE_WAKE_WORD="0"
82
+ MIN_UTTERANCE_SECONDS="1.0"
83
+ UTTERANCE_IDLE_MS="4500"
84
+ HERMES_TASK_TIMEOUT_MS="0"
85
+ HERMES_CHAT_TIMEOUT_MS="45000"
86
+ AGENT_VERBOSE_PROGRESS="0"
87
+ LATENCY_LOG_PATH="./.logs/latency.jsonl"
88
+ ```
89
+
90
+ ## Selección de voz TTS
91
+
92
+ Los preajustes de idioma y la selección de voz están separados:
93
+
94
+ - `vc language ko|en|auto` cambia el idioma STT, el idioma de progreso y la voz predeterminada para ese idioma.
95
+ - Comandos de voz en vivo como “남자 한국어 목소리로 바꿔”, “여자 한국어 목소리로 바꿔”, `change voice to Korean female` y `switch speaker to English` cambian solo el hablante/tipo de voz.
96
+ - `!voice-test <text>` reproduce una muestra rápida con el backend y la voz actualmente seleccionados.
97
+
98
+ La selección de voz se guarda por defecto en `config/tts-voices.json`. Sobrescribe la ruta con `TTS_VOICE_CONFIG`. El puente en ejecución vuelve a leer/aplicar la selección de voz antes de sintetizar, por lo que los comandos de voz surten efecto sin reinicio completo.
99
+
100
+ Catálogo Edge predeterminado:
101
+
102
+ | `TTS_VOICE_TYPE` | `TTS_VOICE` | Idioma |
103
+ |---|---|---|
104
+ | `korean_male` | `ko-KR-InJoonNeural` | Coreano |
105
+ | `korean_female` | `ko-KR-SunHiNeural` | Coreano |
106
+ | `korean_multilingual_male` | `ko-KR-HyunsuMultilingualNeural` | Coreano |
107
+ | `english_male` | `en-US-GuyNeural` | Inglés |
108
+ | `english_female` | `en-US-AriaNeural` | Inglés |
109
+
110
+ Sobrescritura manual persistente:
111
+
112
+ ```bash
113
+ TTS_BACKEND="edge"
114
+ TTS_VOICE_TYPE="korean_male"
115
+ TTS_VOICE="ko-KR-InJoonNeural"
116
+ TTS_VOICE_CONFIG="config/tts-voices.json"
117
+ ```
118
+
119
+ Para OpenVoice, SpeechSwift o Supertonic, mantén los ajustes de voz/referencia específicos del backend en las secciones siguientes; el mismo archivo de catálogo de voces aún puede rastrear el tipo de voz activo.
120
+
121
+ Opciones de voz específicas de backend:
122
+
123
+ | Backend | Ajustes | Opciones de voz |
124
+ |---|---|---|
125
+ | Edge | `TTS_VOICE_TYPE`, `TTS_VOICE` | Tipos integrados anteriores, más cualquier voz devuelta por `edge-tts --list-voices` |
126
+ | Supertonic | `SUPERTONIC_VOICE`, `SUPERTONIC_LANGUAGE` | `M1`–`M5`, `F1`–`F5`; idioma `ko`, `en`, `es`, `pt`, `fr` |
127
+ | OpenVoice | `OPENVOICE_REF_AUDIO`, `OPENVOICE_STYLE`, `OPENVOICE_LANGUAGE` | WAV de referencia permitido proporcionado por el usuario; el estilo predeterminado es `default` |
128
+ | SpeechSwift / CosyVoice | `SPEECHSWIFT_REF_AUDIO`, `SPEECHSWIFT_ENGINE`, `SPEECHSWIFT_SPEAKER`, `SPEECHSWIFT_MODEL_ID` | Voces de muestra de referencia para CosyVoice, o IDs de hablante/modelo admitidos por el backend |
129
+
130
+ ## Segmentación de emisiones
131
+
132
+ `UTTERANCE_IDLE_MS` controla cuánto espera el puente después de un segmento de habla antes de decidir que el usuario terminó y empezar STT. El valor predeterminado es `4500` ms para conservar instrucciones habladas más largas con pausas naturales. Los valores menores se sienten más rápidos para comandos cortos, pero pueden dividir dictado largo; los valores mayores son más seguros para habla reflexiva.
133
+
134
+ ```bash
135
+ UTTERANCE_IDLE_MS="4500" # balanced default
136
+ UTTERANCE_IDLE_MS="6000" # safer for long dictation with pauses
137
+ ```
138
+
139
+ ## Servidor MCP
140
+
141
+ VerbalCoding incluye un servidor MCP stdio para que Hermes Agent o cualquier cliente MCP pueda controlar el puente mediante herramientas en lugar de depender de skills o comandos de shell de forma libre.
142
+
143
+ Ejemplo de configuración de Hermes:
144
+
145
+ ```yaml
146
+ mcp_servers:
147
+ verbalcoding:
148
+ command: "node"
149
+ args: ["/path/to/VerbalCoding/scripts/mcp-server.mjs"]
150
+ timeout: 120
151
+ connect_timeout: 30
152
+ ```
153
+
154
+ Herramientas MCP expuestas:
155
+
156
+ | Herramienta | Propósito |
157
+ |---|---|
158
+ | `status` | Informar estado del puente/configuración sin secretos |
159
+ | `doctor` | Ejecutar la comprobación doctor con secretos redactados |
160
+ | `set_auto_restart` | Habilitar/deshabilitar el reinicio automático del bot de voz al hacer commit |
161
+ | `set_language` | Actualizar juntos STT/progreso/TTS |
162
+ | `start`, `stop`, `restart` | Controlar el puente de voz de Discord |
163
+
164
+ ## TTS OpenVoice opcional
165
+
166
+ Edge TTS sigue siendo el valor predeterminado y la alternativa. Para probar clonación de voz local con OpenVoice V2:
167
+
168
+ ```bash
169
+ ./scripts/setup_openvoice.sh
170
+ # Download checkpoints_v2_0417.zip from OpenVoice docs and extract under vendor/OpenVoice/checkpoints_v2/
171
+ mkdir -p voice-samples
172
+ # Put a permitted reference sample at voice-samples/user-reference.wav,
173
+ # or capture one from Discord with !voice-clone capture.
174
+ python3 integrations/openvoice/synth.py --openvoice-dir vendor/OpenVoice --ref-audio voice-samples/user-reference.wav --text '안녕하세요. 버벌코딩 목소리 복제 테스트입니다.' --output /tmp/verbalcoding-openvoice-smoke.wav
175
+ ```
176
+
177
+ Luego define:
178
+
179
+ ```bash
180
+ TTS_BACKEND="openvoice"
181
+ OPENVOICE_REF_AUDIO="./voice-samples/user-reference.wav"
182
+ OPENVOICE_PROGRESS="0"
183
+ ```
184
+
185
+ Clona solo voces que poseas o tengas permiso para usar. Si OpenVoice falla o agota el tiempo, VerbalCoding vuelve a Edge TTS.
186
+
187
+ ## TTS Supertonic opcional
188
+
189
+ ```bash
190
+ ./scripts/setup_supertonic.sh
191
+ supertonic tts '안녕하세요. 수퍼토닉 테스트입니다.' --lang ko --voice M1 --steps 2 --speed 1.0 -o /tmp/verbalcoding-supertonic.wav
192
+ ```
193
+
194
+ Luego define:
195
+
196
+ ```bash
197
+ TTS_BACKEND="supertonic"
198
+ SUPERTONIC_COMMAND="./.venv-supertonic/bin/supertonic"
199
+ SUPERTONIC_VOICE="M1"
200
+ SUPERTONIC_LANGUAGE="ko"
201
+ SUPERTONIC_STEPS="2"
202
+ SUPERTONIC_SPEED="1.0"
203
+ SUPERTONIC_PROGRESS="0"
204
+ ```
205
+
206
+ Si Supertonic falta, falla o agota el tiempo, VerbalCoding vuelve a Edge TTS.
207
+
208
+ ## TTS SpeechSwift / CosyVoice opcional
209
+
210
+ En Apple Silicon, `speech-swift` es un backend local para clonación de voz coreana con CosyVoice/Qwen3-TTS nativo de MLX.
211
+
212
+ ```bash
213
+ brew tap soniqo/speech https://github.com/soniqo/speech-swift
214
+ brew install speech
215
+ ```
216
+
217
+ Entorno recomendado:
218
+
219
+ ```bash
220
+ TTS_BACKEND="speechswift"
221
+ SPEECHSWIFT_MODE="server"
222
+ SPEECHSWIFT_ENGINE="cosyvoice"
223
+ SPEECHSWIFT_LANGUAGE="korean"
224
+ SPEECHSWIFT_REF_AUDIO="./voice-samples/user-reference.wav"
225
+ SPEECHSWIFT_SERVER_HOST="127.0.0.1"
226
+ SPEECHSWIFT_SERVER_PORT="18080"
227
+ SPEECHSWIFT_SERVER_URL="http://127.0.0.1:18080"
228
+ SPEECHSWIFT_PROGRESS="0"
229
+ ```
230
+
231
+ Mantén Edge para prompts rápidos de progreso/backchannel.
232
+
233
+ ## Notas operativas
234
+
235
+ - El bot necesita el intent privilegiado Message Content de Discord habilitado para comandos de texto.
236
+ - El bot necesita permisos de conectar/hablar en el canal de voz.
237
+ - Para Hermes Agent, configura/autentica Hermes normalmente (`hermes setup`, `hermes login`, etc.) en tu perfil predeterminado.
238
+ - Para Claude Code, Codex, Gemini, OpenCode y OpenClaw, instala y autentica esas CLIs por separado.
239
+ - Si una CLI emite salida de diff/código durante un timeout o fallo de señal, el puente evita leerla en voz alta y envía texto detallado en su lugar.