verbalcoding 0.2.7 → 0.2.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (44) hide show
  1. package/README.md +12 -27
  2. package/app-node/cli_install.test.mjs +15 -0
  3. package/docs/FRESH_INSTALL.md +8 -2
  4. package/docs/assets/figures/verbalcoding-flow.svg +45 -30
  5. package/docs/i18n/CONFIGURATION.es.md +138 -49
  6. package/docs/i18n/CONFIGURATION.fr.md +138 -49
  7. package/docs/i18n/CONFIGURATION.ja.md +137 -48
  8. package/docs/i18n/CONFIGURATION.ko.md +137 -48
  9. package/docs/i18n/CONFIGURATION.ru.md +138 -49
  10. package/docs/i18n/CONFIGURATION.zh.md +137 -48
  11. package/docs/i18n/FRESH_INSTALL.es.md +115 -32
  12. package/docs/i18n/FRESH_INSTALL.fr.md +115 -32
  13. package/docs/i18n/FRESH_INSTALL.ja.md +119 -36
  14. package/docs/i18n/FRESH_INSTALL.ko.md +120 -37
  15. package/docs/i18n/FRESH_INSTALL.ru.md +115 -32
  16. package/docs/i18n/FRESH_INSTALL.zh.md +119 -36
  17. package/docs/i18n/MULTI_INSTANCE.es.md +85 -26
  18. package/docs/i18n/MULTI_INSTANCE.fr.md +85 -26
  19. package/docs/i18n/MULTI_INSTANCE.ja.md +87 -29
  20. package/docs/i18n/MULTI_INSTANCE.ko.md +87 -29
  21. package/docs/i18n/MULTI_INSTANCE.ru.md +84 -26
  22. package/docs/i18n/MULTI_INSTANCE.zh.md +87 -29
  23. package/docs/i18n/README.es.md +109 -45
  24. package/docs/i18n/README.fr.md +109 -45
  25. package/docs/i18n/README.ja.md +109 -45
  26. package/docs/i18n/README.ko.md +108 -45
  27. package/docs/i18n/README.ru.md +109 -45
  28. package/docs/i18n/README.zh.md +108 -45
  29. package/docs/i18n/RELEASE.es.md +53 -37
  30. package/docs/i18n/RELEASE.fr.md +53 -37
  31. package/docs/i18n/RELEASE.ja.md +52 -36
  32. package/docs/i18n/RELEASE.ko.md +52 -36
  33. package/docs/i18n/RELEASE.ru.md +53 -37
  34. package/docs/i18n/RELEASE.zh.md +53 -37
  35. package/docs/i18n/USAGE.es.md +91 -64
  36. package/docs/i18n/USAGE.fr.md +91 -64
  37. package/docs/i18n/USAGE.ja.md +90 -63
  38. package/docs/i18n/USAGE.ko.md +90 -63
  39. package/docs/i18n/USAGE.ru.md +91 -64
  40. package/docs/i18n/USAGE.zh.md +90 -63
  41. package/package.json +1 -1
  42. package/scripts/bootstrap_prereqs.sh +15 -3
  43. package/scripts/cli.mjs +1 -1
  44. package/scripts/doctor.mjs +114 -8
package/README.md CHANGED
@@ -34,7 +34,7 @@ VerbalCoding turns a Discord voice channel into a hands-free control surface for
34
34
  | What you get | Why it feels good |
35
35
  |---|---|
36
36
  | Voice-first agent control | Talk to Hermes Agent, Claude Code, Codex, Gemini CLI, OpenCode, OpenClaw, or any custom CLI harness. |
37
- | Local-first speech loop | Discord voice capture → `whisper.cpp` STT → agent → chunked TTS playback. |
37
+ | On-device speech loop | Discord voice capture → local `whisper-cli` transcription → agent → chunked TTS playback. |
38
38
  | Shared voice + text context | Voice turns and `!ask` text commands can reuse the same supported agent session. |
39
39
  | Barge-in and sensitivity modes | Interrupt playback naturally and switch between normal and conservative/noisy environments. |
40
40
  | Multilingual voice presets | Switch STT, progress language, and TTS voice together with `vc language ko/en/auto`. |
@@ -69,23 +69,10 @@ vc doctor
69
69
  ./run.sh
70
70
  ```
71
71
 
72
- `vc setup --yes` and `./scripts/install.sh --yes` bootstrap local prerequisites where possible: Node/npm dependencies, `ffmpeg`, `whisper-cli`, the default whisper.cpp model, a local `.venv-tts` Edge TTS helper, and the short `vc` shell command for clone installs. They support macOS/Homebrew plus common Linux package managers (`apt`, `dnf`, `pacman`); rerun with `--no-wizard` for dependency-only setup or `--skip-system` if you want to install OS packages yourself.
72
+ `vc setup --yes` bootstraps local prerequisites from the npm package. `./scripts/install.sh --yes` does the same for GitHub clone installs. Both cover Node/npm dependencies, `ffmpeg`, `whisper-cli`, the default whisper.cpp model, a local `.venv-tts` Edge TTS helper, and setup wizard configuration where possible. They support macOS/Homebrew plus common Linux package managers (`apt`, `dnf`, `pacman`); rerun with `--no-wizard` for dependency-only setup or `--skip-system` if you want to install OS packages yourself.
73
73
 
74
74
  Need a clean install walkthrough? Start with [Fresh Install](docs/FRESH_INSTALL.md).
75
75
 
76
- ## How It Works
77
-
78
- ```mermaid
79
- flowchart LR
80
- A[Discord voice] --> B["@discordjs/voice"]
81
- B --> C[PCM cleanup + gates]
82
- C --> D["whisper.cpp STT"]
83
- D --> E["CLI agent adapter"]
84
- E --> F["Concise answer"]
85
- F --> G["Chunked TTS"]
86
- G --> H["Discord playback"]
87
- ```
88
-
89
76
  ## Supported Agent Backends
90
77
 
91
78
  | Backend | Default command | Session support |
@@ -107,12 +94,6 @@ flowchart LR
107
94
  | [Configuration](docs/CONFIGURATION.md) | `.env`, agent backends, MCP, TTS backends, operational notes |
108
95
  | [Multi-Instance](docs/MULTI_INSTANCE.md) | One permanent Discord voice room per project |
109
96
  | [Release Notes](docs/RELEASE.md) | Current capabilities and pre-release checklist |
110
- | [한국어 문서](docs/i18n/README.ko.md) | npm 설치, 사용법, 설정, 멀티 인스턴스 한국어 가이드 |
111
- | [日本語 docs](docs/i18n/README.ja.md) | npm install, usage, configuration, multi-instance guide in Japanese |
112
- | [中文文档](docs/i18n/README.zh.md) | npm 安装、使用、配置和多实例中文指南 |
113
- | [Español docs](docs/i18n/README.es.md) | Instalación npm, uso, configuración y multiinstancia en español |
114
- | [Français docs](docs/i18n/README.fr.md) | Installation npm, utilisation, configuration et multi-instance en français |
115
- | [Русская документация](docs/i18n/README.ru.md) | npm установка, использование, конфигурация и мульти-инстансы на русском |
116
97
 
117
98
  ## Tiny Command Map
118
99
 
@@ -128,11 +109,15 @@ vc start # start the default bridge
128
109
 
129
110
  In Discord:
130
111
 
131
- ```text
132
- !join !ask <prompt> !verbose on/off
133
- !latency !sensitivity normal !sensitivity conservative
134
- !session new <name> <workdir> [context] --voice <voice-channel>
135
- ```
112
+ | Command | What it does |
113
+ |---|---|
114
+ | `!join` | Join your current voice channel. |
115
+ | `!ask <prompt>` | Send text to the same agent backend. |
116
+ | `!verbose on\|off` | Show/speak short progress updates. |
117
+ | `!latency` | Summarize recent voice/STT/agent/TTS latency. |
118
+ | `!sensitivity normal` | Use normal indoor barge-in sensitivity. |
119
+ | `!sensitivity conservative` | Use stricter noisy/outdoor sensitivity. |
120
+ | `!session new <name> <workdir> [context] --voice <voice-channel>` | Bind a project session to a voice room. |
136
121
 
137
122
  ## Requirements
138
123
 
@@ -140,7 +125,7 @@ In Discord:
140
125
  |---|---|
141
126
  | Runtime | Node.js 20+, npm; install script can install via Homebrew/apt/dnf/pacman |
142
127
  | Audio | `ffmpeg`; install script can install it |
143
- | STT | `whisper.cpp` / `whisper-cli`; install script uses Homebrew on macOS or local Linux build fallback |
128
+ | Speech recognition | Local `whisper-cli` from whisper.cpp; install script uses Homebrew on macOS or local Linux build fallback |
144
129
  | TTS | Edge TTS CLI; install script creates `.venv-tts` if needed |
145
130
  | Discord | Bot token, Message Content intent, voice permissions |
146
131
  | Agent | At least one authenticated CLI harness, Hermes Agent by default |
@@ -63,6 +63,8 @@ test('bootstrap script installs cross-platform prerequisites and local model hel
63
63
 
64
64
  assert.match(script, /brew install/);
65
65
  assert.match(script, /apt-get install/);
66
+ assert.match(script, /has_cmd node \|\| packages\+\=\(nodejs\)/);
67
+ assert.match(script, /has_cmd npm \|\| packages\+\=\(npm\)/);
66
68
  assert.match(script, /dnf install/);
67
69
  assert.match(script, /pacman -Sy/);
68
70
  assert.match(script, /git clone --depth 1 https:\/\/github\.com\/ggml-org\/whisper\.cpp\.git/);
@@ -70,6 +72,19 @@ test('bootstrap script installs cross-platform prerequisites and local model hel
70
72
  assert.match(script, /\.venv-tts/);
71
73
  });
72
74
 
75
+ test('doctor auto-bootstraps fixable prerequisites by default', () => {
76
+ const doctor = fs.readFileSync(path.join(ROOT, 'scripts', 'doctor.mjs'), 'utf8');
77
+ const cli = fs.readFileSync(path.join(ROOT, 'scripts', 'cli.mjs'), 'utf8');
78
+
79
+ assert.match(doctor, /fixablePrerequisites/);
80
+ assert.match(doctor, /bootstrap_prereqs\.sh'\), '--yes'/);
81
+ assert.match(doctor, /VERBALCODING_DOCTOR_AUTO_FIX/);
82
+ assert.match(doctor, /--no-fix/);
83
+ assert.match(doctor, /WHISPER_CPP_BIN/);
84
+ assert.match(doctor, /EDGE_TTS_COMMAND/);
85
+ assert.match(cli, /doctor\.mjs'\), \.\.\.argv\.slice\(1\)/);
86
+ });
87
+
73
88
  test('Ubuntu Docker smoke script validates clean install without secrets', () => {
74
89
  const script = fs.readFileSync(path.join(ROOT, 'scripts', 'docker_ubuntu_smoke.sh'), 'utf8');
75
90
 
@@ -32,7 +32,13 @@ cd VerbalCoding
32
32
 
33
33
  ## 2. Bootstrap dependencies and run the setup wizard
34
34
 
35
- The npm commands above run the same bootstrapper as the clone install. For a clone, run:
35
+ For an npm install, do not run `./scripts/install.sh` directly; there is no repository checkout in your current directory. Use the packaged CLI wrapper instead:
36
+
37
+ ```bash
38
+ vc setup --yes
39
+ ```
40
+
41
+ `vc setup` runs the `scripts/install.sh` bundled inside the installed npm package. Only use `./scripts/install.sh --yes` when you are inside a GitHub clone:
36
42
 
37
43
  ```bash
38
44
  ./scripts/install.sh --yes
@@ -104,7 +110,7 @@ The invite includes bot and slash-command scopes plus text/voice permissions use
104
110
  vc doctor
105
111
  ```
106
112
 
107
- `vc doctor` is redacted: it reports missing tokens/commands/models without printing secret values. Fix every `✗` item, then rerun it.
113
+ `vc doctor` is redacted: it reports missing tokens/commands/models without printing secret values. When fixable local prerequisites are missing (`ffmpeg`, `whisper-cli`, the default model, or Edge TTS helper), it automatically reruns the packaged bootstrap first. Fix any remaining `✗` items, then rerun it.
108
114
 
109
115
  Expected success includes:
110
116
 
@@ -1,6 +1,6 @@
1
1
  <svg width="1200" height="520" viewBox="0 0 1200 520" fill="none" xmlns="http://www.w3.org/2000/svg" role="img" aria-labelledby="title desc">
2
- <title id="title">VerbalCoding voice-to-agent flow</title>
3
- <desc id="desc">A stylized pipeline from Discord voice to speech recognition, CLI agent, text answer, and TTS playback.</desc>
2
+ <title id="title">VerbalCoding natural voice loop</title>
3
+ <desc id="desc">A compact phone-call-like loop: user speaks in Discord, Local STT with whisper-cli transcribes, the CLI agent works, TTS speaks back, and the user can interrupt anytime.</desc>
4
4
  <defs>
5
5
  <linearGradient id="bg" x1="0" y1="0" x2="1200" y2="520" gradientUnits="userSpaceOnUse">
6
6
  <stop stop-color="#0F172A"/>
@@ -19,45 +19,60 @@
19
19
  <circle cx="1030" cy="90" r="190" fill="#6366F1" opacity="0.16"/>
20
20
  <circle cx="170" cy="430" r="210" fill="#06B6D4" opacity="0.13"/>
21
21
  <rect x="70" y="54" width="1060" height="412" rx="32" fill="url(#card)" stroke="#334155" filter="url(#shadow)"/>
22
+
22
23
  <text x="110" y="118" fill="#F8FAFC" font-family="Inter, ui-sans-serif, system-ui" font-size="42" font-weight="800">VerbalCoding</text>
23
- <text x="110" y="154" fill="#94A3B8" font-family="Inter, ui-sans-serif, system-ui" font-size="20">Discord voice local STT CLI coding agent spoken answer</text>
24
+ <text x="110" y="154" fill="#94A3B8" font-family="Inter, ui-sans-serif, system-ui" font-size="20">A natural Discord voice loop for coding agents speak, listen, interrupt, continue</text>
24
25
 
25
26
  <g font-family="Inter, ui-sans-serif, system-ui" font-size="17" font-weight="700">
26
- <rect x="110" y="220" width="150" height="92" rx="20" fill="#5865F2"/>
27
- <text x="185" y="258" fill="white" text-anchor="middle">Discord</text>
28
- <text x="185" y="284" fill="#E0E7FF" text-anchor="middle" font-size="14">voice channel</text>
27
+ <rect x="105" y="220" width="160" height="92" rx="20" fill="#5865F2"/>
28
+ <text x="185" y="254" fill="white" text-anchor="middle">Discord</text>
29
+ <text x="185" y="280" fill="#E0E7FF" text-anchor="middle" font-size="14">phone-call voice</text>
29
30
 
30
- <rect x="305" y="220" width="150" height="92" rx="20" fill="#0891B2"/>
31
- <text x="380" y="258" fill="white" text-anchor="middle">whisper.cpp</text>
32
- <text x="380" y="284" fill="#CFFAFE" text-anchor="middle" font-size="14">local STT</text>
31
+ <rect x="305" y="220" width="165" height="92" rx="20" fill="#0891B2"/>
32
+ <text x="387.5" y="254" fill="white" text-anchor="middle">Local STT</text>
33
+ <text x="387.5" y="280" fill="#CFFAFE" text-anchor="middle" font-size="14">whisper-cli</text>
33
34
 
34
- <rect x="500" y="220" width="150" height="92" rx="20" fill="#7C3AED"/>
35
- <text x="575" y="258" fill="white" text-anchor="middle">Adapter</text>
36
- <text x="575" y="284" fill="#EDE9FE" text-anchor="middle" font-size="14">Hermes / Claude / Codex</text>
35
+ <rect x="510" y="220" width="165" height="92" rx="20" fill="#7C3AED"/>
36
+ <text x="592.5" y="254" fill="white" text-anchor="middle">Adapter</text>
37
+ <text x="592.5" y="280" fill="#EDE9FE" text-anchor="middle" font-size="14">Hermes / Claude / Codex</text>
37
38
 
38
- <rect x="695" y="220" width="150" height="92" rx="20" fill="#111827" stroke="#475569"/>
39
- <text x="770" y="258" fill="white" text-anchor="middle">CLI Agent</text>
40
- <text x="770" y="284" fill="#CBD5E1" text-anchor="middle" font-size="14">does the work</text>
39
+ <rect x="715" y="220" width="165" height="92" rx="20" fill="#111827" stroke="#475569"/>
40
+ <text x="797.5" y="254" fill="white" text-anchor="middle">CLI Agent</text>
41
+ <text x="797.5" y="280" fill="#CBD5E1" text-anchor="middle" font-size="14">does the work</text>
41
42
 
42
- <rect x="890" y="220" width="150" height="92" rx="20" fill="#0EA5E9"/>
43
- <text x="965" y="258" fill="white" text-anchor="middle">TTS</text>
44
- <text x="965" y="284" fill="#E0F2FE" text-anchor="middle" font-size="14">chunked playback</text>
43
+ <rect x="920" y="220" width="165" height="92" rx="20" fill="#0EA5E9"/>
44
+ <text x="1002.5" y="254" fill="white" text-anchor="middle">TTS</text>
45
+ <text x="1002.5" y="280" fill="#E0F2FE" text-anchor="middle" font-size="14">spoken reply</text>
45
46
  </g>
46
47
 
47
48
  <g stroke="#94A3B8" stroke-width="4" stroke-linecap="round">
48
- <path d="M266 266H296"/>
49
- <path d="M461 266H491"/>
50
- <path d="M656 266H686"/>
51
- <path d="M851 266H881"/>
49
+ <path d="M275 266H295"/>
50
+ <path d="M480 266H500"/>
51
+ <path d="M685 266H705"/>
52
+ <path d="M890 266H910"/>
53
+ </g>
54
+ <g fill="#94A3B8" opacity="0.95">
55
+ <circle cx="285" cy="266" r="4"/>
56
+ <circle cx="490" cy="266" r="4"/>
57
+ <circle cx="695" cy="266" r="4"/>
58
+ <circle cx="900" cy="266" r="4"/>
52
59
  </g>
53
- <g fill="#94A3B8">
54
- <path d="M296 266l-10-7v14l10-7z"/>
55
- <path d="M491 266l-10-7v14l10-7z"/>
56
- <path d="M686 266l-10-7v14l10-7z"/>
57
- <path d="M881 266l-10-7v14l10-7z"/>
60
+
61
+ <path d="M1002 330C1002 405 185 405 185 330" stroke="#67E8F9" stroke-width="4" stroke-linecap="round" stroke-dasharray="13 13"/>
62
+ <g fill="#67E8F9">
63
+ <circle cx="1002" cy="330" r="5"/>
64
+ <circle cx="185" cy="330" r="5"/>
65
+ </g>
66
+ <text x="594" y="438" fill="#A5F3FC" text-anchor="middle" font-family="Inter, ui-sans-serif, system-ui" font-size="17" font-weight="700">Conversation loop: hear the answer, speak again, or interrupt anytime</text>
67
+
68
+ <path d="M185 210C185 178 1002 178 1002 210" stroke="#FBBF24" stroke-width="3" stroke-linecap="round" stroke-dasharray="8 10" opacity="0.9"/>
69
+ <g fill="#FBBF24" opacity="0.95">
70
+ <circle cx="185" cy="210" r="4"/>
71
+ <circle cx="1002" cy="210" r="4"/>
58
72
  </g>
73
+ <text x="594" y="194" fill="#FDE68A" text-anchor="middle" font-family="Inter, ui-sans-serif, system-ui" font-size="15" font-weight="700">Barge-in stays open while the agent is thinking or speaking</text>
59
74
 
60
- <rect x="150" y="360" width="900" height="54" rx="17" fill="#020617" stroke="#1F2937"/>
61
- <text x="182" y="394" fill="#A7F3D0" font-family="SFMono-Regular, ui-monospace, monospace" font-size="18">$ vc language ko &amp;&amp; vc instance start my-project</text>
62
- <text x="1045" y="394" fill="#64748B" text-anchor="end" font-family="Inter, ui-sans-serif, system-ui" font-size="15">hands-free coding loop</text>
75
+ <rect x="150" y="348" width="900" height="54" rx="17" fill="#020617" stroke="#1F2937"/>
76
+ <text x="182" y="382" fill="#A7F3D0" font-family="SFMono-Regular, ui-monospace, monospace" font-size="18">$ vc language ko &amp;&amp; vc instance start my-project</text>
77
+ <text x="1045" y="382" fill="#64748B" text-anchor="end" font-family="Inter, ui-sans-serif, system-ui" font-size="15">hands-free coding call</text>
63
78
  </svg>
@@ -1,36 +1,40 @@
1
- # VerbalCoding Configuración
1
+ # Configuración de VerbalCoding
2
2
 
3
- ## Setup Wizard
3
+ ## Asistente de configuración
4
4
 
5
- Use upstream Discord-side guides first, then return to VerbalCoding:
5
+ La configuración de la aplicación/bot de Discord no se vuelve a explicar desde cero aquí de forma intencionada. Usa estas guías originales para los pasos del lado de Discord y luego vuelve a la configuración de VerbalCoding:
6
6
 
7
- - Hermes Agent Discord messaging guide: <https://hermes-agent.nousresearch.com/docs/user-guide/messaging/discord>
8
- - Discord official bot overview: <https://docs.discord.com/developers/bots/overview>
9
- - Discord official quick start: <https://docs.discord.com/developers/quick-start/getting-started>
7
+ - Guía de mensajería Discord de Hermes Agent: <https://hermes-agent.nousresearch.com/docs/user-guide/messaging/discord>
8
+ - Resumen oficial de bots de Discord: <https://docs.discord.com/developers/bots/overview>
9
+ - Inicio rápido oficial de Discord: <https://docs.discord.com/developers/quick-start/getting-started>
10
10
 
11
11
  ```bash
12
- vc setup --yes
13
- # or from a clone
14
12
  ./scripts/install.sh
15
13
  ```
16
14
 
17
- The installer asks for the Discord token, allowed users, auto-join voice channel names, transcript channel/thread, CLI harness backend, default voice language, TTS settings, and wake-word behavior. It writes `.env` with mode `0600`.
15
+ El instalador solicita token de Discord, usuarios permitidos, nombres de canales de voz para auto-unión, canal/hilo de transcripción, backend de arnés CLI, idioma de voz predeterminado, ajustes de TTS y comportamiento de palabra de activación. Escribe `.env` con modo `0600`; `.env` está ignorado por git. También enlaza el comando corto de shell `vc`.
18
16
 
19
- ## Supported Agent Backends
17
+ Si solo necesitas el comando de shell después de una instalación manual:
20
18
 
21
- Set `AGENT_BACKEND` in `.env`.
19
+ ```bash
20
+ npm link
21
+ ```
22
+
23
+ ## Backends de agentes compatibles
22
24
 
23
- | Backend | Default command | Notes |
25
+ Define `AGENT_BACKEND` en `.env`.
26
+
27
+ | Backend | Comando predeterminado | Notas |
24
28
  |---|---|---|
25
- | `hermes` | `hermes chat -Q -q` | Default; supports resume and verbose progress |
26
- | `claude-code` / `claude` | `claude -p` | Override with `CLAUDE_COMMAND` or `AGENT_COMMAND` |
27
- | `codex` | `codex exec` | Override with `CODEX_COMMAND` or `AGENT_COMMAND` |
28
- | `gemini` | `gemini -p` | Override with `GEMINI_COMMAND` or `AGENT_COMMAND` |
29
- | `opencode` | `opencode run` | Override with `OPENCODE_COMMAND` or `AGENT_COMMAND` |
30
- | `openclaw` | `openclaw run` | Override with `OPENCLAW_COMMAND` or `AGENT_COMMAND` |
31
- | `custom` | `AGENT_COMMAND` required | Prompt is appended as final argv |
29
+ | `hermes` | `hermes chat -Q -q` | Predeterminado. Conserva el comportamiento de reanudación de `.verbalcoding-session`. |
30
+ | `claude-code` / `claude` | `claude -p` | Sobrescribe con `CLAUDE_COMMAND` o `AGENT_COMMAND`. |
31
+ | `codex` | `codex exec` | Sobrescribe con `CODEX_COMMAND` o `AGENT_COMMAND`. |
32
+ | `gemini` | `gemini -p` | Sobrescribe con `GEMINI_COMMAND` o `AGENT_COMMAND`. |
33
+ | `opencode` | `opencode run` | Sobrescribe con `OPENCODE_COMMAND` o `AGENT_COMMAND`. |
34
+ | `openclaw` | `openclaw run` | Sobrescribe con `OPENCLAW_COMMAND` o `AGENT_COMMAND`. |
35
+ | `custom` | `AGENT_COMMAND` requerido | El prompt se añade como argumento argv final. |
32
36
 
33
- Generic overrides:
37
+ Sobrescrituras genéricas:
34
38
 
35
39
  ```bash
36
40
  AGENT_BACKEND=custom
@@ -43,23 +47,37 @@ UTTERANCE_IDLE_MS=4500
43
47
  LATENCY_LOG_PATH=./.logs/latency.jsonl
44
48
  ```
45
49
 
46
- ## Example `.env`
50
+ ## Contrato del adaptador de agente
51
+
52
+ El puente de voz habla con cada backend mediante un único contrato de adaptador:
53
+
54
+ - `run({ text }, signal, plan)` devuelve estado, texto de respuesta final, etiqueta del backend, tiempo transcurrido y metadatos de sesión opcionales.
55
+ - `ask(text, signal, plan)` es el atajo de compatibilidad que devuelve solo el texto de la respuesta final.
56
+ - `capabilities` declara si el backend admite reanudación de sesión, progreso en streaming y cancelación.
57
+ - Hermes es el adaptador de referencia: reanudación, streaming de progreso detallado, cancelación y recuperación de respuesta final desde archivos de sesión de Hermes.
58
+
59
+ Los nuevos backends deberían implementar el mismo contrato y mantener el comportamiento de voz/STT/TTS fuera del adaptador.
60
+
61
+ ## Ejemplo de `.env`
47
62
 
48
63
  ```bash
49
64
  DISCORD_BOT_TOKEN="***"
50
65
  DISCORD_ALLOWED_USERS="123456789012345678"
51
66
  AUTO_JOIN_VOICE_CHANNELS="일반,General,general"
52
67
  TRANSCRIPT_CHANNEL_ID="123456789012345678"
68
+
53
69
  AGENT_BACKEND="hermes"
54
70
  STT_ENGINE="whisper_cpp"
55
71
  WHISPER_CPP_BIN="whisper-cli"
56
72
  WHISPER_CPP_MODEL="./models/ggml-small-q5_1.bin"
73
+
57
74
  TTS_BACKEND="edge"
58
75
  TTS_VOICE_TYPE="korean_female"
59
76
  TTS_VOICE="ko-KR-SunHiNeural"
60
77
  TTS_RATE="+10%"
61
78
  TTS_MAX_CHARS="495"
62
79
  TTS_VOLUME="1.0"
80
+
63
81
  REQUIRE_WAKE_WORD="0"
64
82
  MIN_UTTERANCE_SECONDS="1.0"
65
83
  UTTERANCE_IDLE_MS="4500"
@@ -69,39 +87,60 @@ AGENT_VERBOSE_PROGRESS="0"
69
87
  LATENCY_LOG_PATH="./.logs/latency.jsonl"
70
88
  ```
71
89
 
72
- ## TTS Voice Selection
90
+ ## Selección de voz TTS
91
+
92
+ Los preajustes de idioma y la selección de voz están separados:
73
93
 
74
- `vc language ko|en|auto` changes STT language, progress language, and default TTS voice. Live commands such as “남자 한국어 목소리로 바꿔”, “여자 한국어 목소리로 바꿔”, `change voice to Korean female`, and `switch speaker to English` change only the speaker/voice type.
94
+ - `vc language ko|en|auto` cambia el idioma STT, el idioma de progreso y la voz predeterminada para ese idioma.
95
+ - Comandos de voz en vivo como “남자 한국어 목소리로 바꿔”, “여자 한국어 목소리로 바꿔”, `change voice to Korean female` y `switch speaker to English` cambian solo el hablante/tipo de voz.
96
+ - `!voice-test <text>` reproduce una muestra rápida con el backend y la voz actualmente seleccionados.
75
97
 
76
- Default Edge catalog:
98
+ La selección de voz se guarda por defecto en `config/tts-voices.json`. Sobrescribe la ruta con `TTS_VOICE_CONFIG`. El puente en ejecución vuelve a leer/aplicar la selección de voz antes de sintetizar, por lo que los comandos de voz surten efecto sin reinicio completo.
77
99
 
78
- | `TTS_VOICE_TYPE` | `TTS_VOICE` | Language |
100
+ Catálogo Edge predeterminado:
101
+
102
+ | `TTS_VOICE_TYPE` | `TTS_VOICE` | Idioma |
79
103
  |---|---|---|
80
- | `korean_male` | `ko-KR-InJoonNeural` | Korean |
81
- | `korean_female` | `ko-KR-SunHiNeural` | Korean |
82
- | `korean_multilingual_male` | `ko-KR-HyunsuMultilingualNeural` | Korean |
83
- | `english_male` | `en-US-GuyNeural` | English |
84
- | `english_female` | `en-US-AriaNeural` | English |
104
+ | `korean_male` | `ko-KR-InJoonNeural` | Coreano |
105
+ | `korean_female` | `ko-KR-SunHiNeural` | Coreano |
106
+ | `korean_multilingual_male` | `ko-KR-HyunsuMultilingualNeural` | Coreano |
107
+ | `english_male` | `en-US-GuyNeural` | Inglés |
108
+ | `english_female` | `en-US-AriaNeural` | Inglés |
109
+
110
+ Sobrescritura manual persistente:
111
+
112
+ ```bash
113
+ TTS_BACKEND="edge"
114
+ TTS_VOICE_TYPE="korean_male"
115
+ TTS_VOICE="ko-KR-InJoonNeural"
116
+ TTS_VOICE_CONFIG="config/tts-voices.json"
117
+ ```
85
118
 
86
- Backend-specific voice options:
119
+ Para OpenVoice, SpeechSwift o Supertonic, mantén los ajustes de voz/referencia específicos del backend en las secciones siguientes; el mismo archivo de catálogo de voces aún puede rastrear el tipo de voz activo.
87
120
 
88
- | Backend | Settings | Voice choices |
121
+ Opciones de voz específicas de backend:
122
+
123
+ | Backend | Ajustes | Opciones de voz |
89
124
  |---|---|---|
90
- | Edge | `TTS_VOICE_TYPE`, `TTS_VOICE` | Built-in types plus any `edge-tts --list-voices` voice |
91
- | Supertonic | `SUPERTONIC_VOICE`, `SUPERTONIC_LANGUAGE` | `M1`–`M5`, `F1`–`F5`; `ko`, `en`, `es`, `pt`, `fr` |
92
- | OpenVoice | `OPENVOICE_REF_AUDIO`, `OPENVOICE_STYLE`, `OPENVOICE_LANGUAGE` | User-provided permitted reference WAV |
93
- | SpeechSwift / CosyVoice | `SPEECHSWIFT_REF_AUDIO`, `SPEECHSWIFT_ENGINE`, `SPEECHSWIFT_SPEAKER`, `SPEECHSWIFT_MODEL_ID` | Reference-sample voice or backend speaker/model ID |
125
+ | Edge | `TTS_VOICE_TYPE`, `TTS_VOICE` | Tipos integrados anteriores, más cualquier voz devuelta por `edge-tts --list-voices` |
126
+ | Supertonic | `SUPERTONIC_VOICE`, `SUPERTONIC_LANGUAGE` | `M1`–`M5`, `F1`–`F5`; idioma `ko`, `en`, `es`, `pt`, `fr` |
127
+ | OpenVoice | `OPENVOICE_REF_AUDIO`, `OPENVOICE_STYLE`, `OPENVOICE_LANGUAGE` | WAV de referencia permitido proporcionado por el usuario; el estilo predeterminado es `default` |
128
+ | SpeechSwift / CosyVoice | `SPEECHSWIFT_REF_AUDIO`, `SPEECHSWIFT_ENGINE`, `SPEECHSWIFT_SPEAKER`, `SPEECHSWIFT_MODEL_ID` | Voces de muestra de referencia para CosyVoice, o IDs de hablante/modelo admitidos por el backend |
94
129
 
95
- ## Utterance Segmentation
130
+ ## Segmentación de emisiones
96
131
 
97
- `UTTERANCE_IDLE_MS` controls how long the bridge waits after speech before starting STT. Default is `4500` ms.
132
+ `UTTERANCE_IDLE_MS` controla cuánto espera el puente después de un segmento de habla antes de decidir que el usuario terminó y empezar STT. El valor predeterminado es `4500` ms para conservar instrucciones habladas más largas con pausas naturales. Los valores menores se sienten más rápidos para comandos cortos, pero pueden dividir dictado largo; los valores mayores son más seguros para habla reflexiva.
98
133
 
99
134
  ```bash
100
- UTTERANCE_IDLE_MS="4500"
101
- UTTERANCE_IDLE_MS="6000"
135
+ UTTERANCE_IDLE_MS="4500" # balanced default
136
+ UTTERANCE_IDLE_MS="6000" # safer for long dictation with pauses
102
137
  ```
103
138
 
104
- ## MCP Server
139
+ ## Servidor MCP
140
+
141
+ VerbalCoding incluye un servidor MCP stdio para que Hermes Agent o cualquier cliente MCP pueda controlar el puente mediante herramientas en lugar de depender de skills o comandos de shell de forma libre.
142
+
143
+ Ejemplo de configuración de Hermes:
105
144
 
106
145
  ```yaml
107
146
  mcp_servers:
@@ -112,39 +151,89 @@ mcp_servers:
112
151
  connect_timeout: 30
113
152
  ```
114
153
 
115
- Tools: `status`, `doctor`, `set_auto_restart`, `set_language`, `start`, `stop`, and `restart`.
154
+ Herramientas MCP expuestas:
155
+
156
+ | Herramienta | Propósito |
157
+ |---|---|
158
+ | `status` | Informar estado del puente/configuración sin secretos |
159
+ | `doctor` | Ejecutar la comprobación doctor con secretos redactados |
160
+ | `set_auto_restart` | Habilitar/deshabilitar el reinicio automático del bot de voz al hacer commit |
161
+ | `set_language` | Actualizar juntos STT/progreso/TTS |
162
+ | `start`, `stop`, `restart` | Controlar el puente de voz de Discord |
116
163
 
117
- ## Optional OpenVoice TTS
164
+ ## TTS OpenVoice opcional
165
+
166
+ Edge TTS sigue siendo el valor predeterminado y la alternativa. Para probar clonación de voz local con OpenVoice V2:
118
167
 
119
168
  ```bash
120
169
  ./scripts/setup_openvoice.sh
170
+ # Download checkpoints_v2_0417.zip from OpenVoice docs and extract under vendor/OpenVoice/checkpoints_v2/
171
+ mkdir -p voice-samples
172
+ # Put a permitted reference sample at voice-samples/user-reference.wav,
173
+ # or capture one from Discord with !voice-clone capture.
121
174
  python3 integrations/openvoice/synth.py --openvoice-dir vendor/OpenVoice --ref-audio voice-samples/user-reference.wav --text '안녕하세요. 버벌코딩 목소리 복제 테스트입니다.' --output /tmp/verbalcoding-openvoice-smoke.wav
122
175
  ```
123
176
 
177
+ Luego define:
178
+
124
179
  ```bash
125
180
  TTS_BACKEND="openvoice"
126
181
  OPENVOICE_REF_AUDIO="./voice-samples/user-reference.wav"
127
182
  OPENVOICE_PROGRESS="0"
128
183
  ```
129
184
 
130
- Only clone voices you own or have permission to use. OpenVoice falls back to Edge on failure.
185
+ Clona solo voces que poseas o tengas permiso para usar. Si OpenVoice falla o agota el tiempo, VerbalCoding vuelve a Edge TTS.
131
186
 
132
- ## Optional Supertonic TTS
187
+ ## TTS Supertonic opcional
133
188
 
134
189
  ```bash
135
190
  ./scripts/setup_supertonic.sh
136
191
  supertonic tts '안녕하세요. 수퍼토닉 테스트입니다.' --lang ko --voice M1 --steps 2 --speed 1.0 -o /tmp/verbalcoding-supertonic.wav
137
192
  ```
138
193
 
139
- ## Optional SpeechSwift / CosyVoice TTS
194
+ Luego define:
195
+
196
+ ```bash
197
+ TTS_BACKEND="supertonic"
198
+ SUPERTONIC_COMMAND="./.venv-supertonic/bin/supertonic"
199
+ SUPERTONIC_VOICE="M1"
200
+ SUPERTONIC_LANGUAGE="ko"
201
+ SUPERTONIC_STEPS="2"
202
+ SUPERTONIC_SPEED="1.0"
203
+ SUPERTONIC_PROGRESS="0"
204
+ ```
205
+
206
+ Si Supertonic falta, falla o agota el tiempo, VerbalCoding vuelve a Edge TTS.
207
+
208
+ ## TTS SpeechSwift / CosyVoice opcional
209
+
210
+ En Apple Silicon, `speech-swift` es un backend local para clonación de voz coreana con CosyVoice/Qwen3-TTS nativo de MLX.
140
211
 
141
212
  ```bash
142
213
  brew tap soniqo/speech https://github.com/soniqo/speech-swift
143
214
  brew install speech
144
215
  ```
145
216
 
146
- Recommended env includes `TTS_BACKEND="speechswift"`, `SPEECHSWIFT_MODE="server"`, `SPEECHSWIFT_ENGINE="cosyvoice"`, `SPEECHSWIFT_REF_AUDIO`, and `SPEECHSWIFT_SERVER_URL`. Keep Edge for quick progress prompts.
217
+ Entorno recomendado:
218
+
219
+ ```bash
220
+ TTS_BACKEND="speechswift"
221
+ SPEECHSWIFT_MODE="server"
222
+ SPEECHSWIFT_ENGINE="cosyvoice"
223
+ SPEECHSWIFT_LANGUAGE="korean"
224
+ SPEECHSWIFT_REF_AUDIO="./voice-samples/user-reference.wav"
225
+ SPEECHSWIFT_SERVER_HOST="127.0.0.1"
226
+ SPEECHSWIFT_SERVER_PORT="18080"
227
+ SPEECHSWIFT_SERVER_URL="http://127.0.0.1:18080"
228
+ SPEECHSWIFT_PROGRESS="0"
229
+ ```
230
+
231
+ Mantén Edge para prompts rápidos de progreso/backchannel.
147
232
 
148
- ## Operational Notes
233
+ ## Notas operativas
149
234
 
150
- Enable Discord Message Content intent, grant voice connect/speak permissions, authenticate the selected CLI harness separately, and avoid reading diffs/log dumps aloud.
235
+ - El bot necesita el intent privilegiado Message Content de Discord habilitado para comandos de texto.
236
+ - El bot necesita permisos de conectar/hablar en el canal de voz.
237
+ - Para Hermes Agent, configura/autentica Hermes normalmente (`hermes setup`, `hermes login`, etc.) en tu perfil predeterminado.
238
+ - Para Claude Code, Codex, Gemini, OpenCode y OpenClaw, instala y autentica esas CLIs por separado.
239
+ - Si una CLI emite salida de diff/código durante un timeout o fallo de señal, el puente evita leerla en voz alta y envía texto detallado en su lugar.