@cheeko-ai/esp32-voice 2026.2.21
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/NPM_PUBLISH_READINESS.md +299 -0
- package/README.md +226 -0
- package/TODO.md +418 -0
- package/index.ts +128 -0
- package/openclaw.plugin.json +9 -0
- package/package.json +62 -0
- package/src/accounts.ts +110 -0
- package/src/channel.ts +270 -0
- package/src/config-schema.ts +37 -0
- package/src/device/device-otp.ts +173 -0
- package/src/http-handler.ts +154 -0
- package/src/monitor.ts +124 -0
- package/src/onboarding.ts +575 -0
- package/src/runtime.ts +14 -0
- package/src/stt/deepgram.ts +215 -0
- package/src/stt/stt-provider.ts +107 -0
- package/src/stt/stt-registry.ts +71 -0
- package/src/tts/elevenlabs.ts +215 -0
- package/src/tts/tts-provider.ts +111 -0
- package/src/tts/tts-registry.ts +71 -0
- package/src/types.ts +136 -0
- package/src/voice/voice-endpoint.ts +296 -0
- package/src/voice/voice-session.ts +1041 -0
package/TODO.md
ADDED
|
@@ -0,0 +1,418 @@
|
|
|
1
|
+
# ESP32-Voice Plugin — Remaining Work TODO
|
|
2
|
+
|
|
3
|
+
> This document tracks everything that needs to be done before the plugin is production-ready
|
|
4
|
+
> and publishable to npm. Work through each section top-to-bottom. Each item is self-contained
|
|
5
|
+
> with enough context so a fresh contributor can pick it up without prior knowledge of the session.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## 🔴 SECTION 1 — Security (Do This Before Anything Else)
|
|
10
|
+
|
|
11
|
+
### 1.1 — Delete or sanitize `SETUP.md`
|
|
12
|
+
|
|
13
|
+
**Why:** `SETUP.md` contains real, live API keys committed to the repository. Anyone who clones
|
|
14
|
+
the repo has these keys.
|
|
15
|
+
|
|
16
|
+
**Steps:**
|
|
17
|
+
1. Open `extensions/esp32-voice/SETUP.md`
|
|
18
|
+
2. Replace every real credential with a placeholder:
|
|
19
|
+
- `GEMINI_API_KEY=AIzaSy...` → `GEMINI_API_KEY=<YOUR_GEMINI_API_KEY>`
|
|
20
|
+
- `ELEVENLABS_API_KEY=sk_...` → `ELEVENLABS_API_KEY=<YOUR_ELEVENLABS_API_KEY>`
|
|
21
|
+
- Any token or secret string → `<YOUR_TOKEN_HERE>`
|
|
22
|
+
3. Add a note at the top: `> **Note:** Replace all `<PLACEHOLDER>` values with your own credentials.`
|
|
23
|
+
4. Immediately rotate the exposed keys:
|
|
24
|
+
- Go to [ElevenLabs API keys](https://elevenlabs.io/app/settings/api-keys) → delete old key → create new
|
|
25
|
+
- Go to [Google AI Studio](https://aistudio.google.com/apikey) → delete old key → create new
|
|
26
|
+
5. Commit: `"security: remove exposed credentials from SETUP.md"`
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
|
|
30
|
+
### 1.2 — Remove hardcoded fallback token from `ota-server.js`
|
|
31
|
+
|
|
32
|
+
**Why:** The gateway token in `ota-server.js` (line ~59) has a hardcoded default:
|
|
33
|
+
```js
|
|
34
|
+
const GATEWAY_TOKEN = process.env.GATEWAY_TOKEN || "YOUR_GATEWAY_TOKEN_HERE";
|
|
35
|
+
```
|
|
36
|
+
This token is now public. If the env var is not set, the server should exit with a clear error,
|
|
37
|
+
not fall back to a known value.
|
|
38
|
+
|
|
39
|
+
**Steps:**
|
|
40
|
+
1. Open `extensions/esp32-voice/ota-server.js`
|
|
41
|
+
2. Find the `GATEWAY_TOKEN` line
|
|
42
|
+
3. Replace with:
|
|
43
|
+
```js
|
|
44
|
+
const GATEWAY_TOKEN = process.env.GATEWAY_TOKEN;
|
|
45
|
+
if (!GATEWAY_TOKEN) {
|
|
46
|
+
console.error("[ota-server] ERROR: GATEWAY_TOKEN env var is required. Set it in your .env file.");
|
|
47
|
+
process.exit(1);
|
|
48
|
+
}
|
|
49
|
+
```
|
|
50
|
+
4. Update the README/QUICKSTART to mention this env var is required
|
|
51
|
+
5. Commit: `"security: require GATEWAY_TOKEN env var in ota-server, remove hardcoded fallback"`
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## 🔴 SECTION 2 — Package.json Cleanup (Required for npm publish)
|
|
56
|
+
|
|
57
|
+
### 2.1 — Remove unused `@discordjs/opus` dependency
|
|
58
|
+
|
|
59
|
+
**Why:** During development, `@discordjs/opus` was replaced with `opusscript` because macOS
|
|
60
|
+
Gatekeeper rejects its prebuilt native binary. The `@discordjs/opus` package is still listed in
|
|
61
|
+
`package.json` but is never imported anywhere in the source code.
|
|
62
|
+
|
|
63
|
+
**Steps:**
|
|
64
|
+
1. Verify it's unused:
|
|
65
|
+
```bash
|
|
66
|
+
grep -r "@discordjs/opus\|require.*opus\|import.*opus" extensions/esp32-voice/src/
|
|
67
|
+
# Should only find opusscript references, not @discordjs/opus
|
|
68
|
+
```
|
|
69
|
+
2. Remove it:
|
|
70
|
+
```bash
|
|
71
|
+
cd extensions/esp32-voice
|
|
72
|
+
npm uninstall @discordjs/opus
|
|
73
|
+
```
|
|
74
|
+
3. Verify `opusscript` is still in `dependencies` in `package.json`
|
|
75
|
+
4. Commit: `"chore: remove unused @discordjs/opus dependency, use opusscript only"`
|
|
76
|
+
|
|
77
|
+
**Note on opusscript:** `opusscript` is pure JavaScript/WebAssembly — it needs NO native
|
|
78
|
+
compilation, NO node-gyp, NO system libraries. It installs cleanly on macOS, Linux and Windows.
|
|
79
|
+
No pre/post-install scripts are needed.
|
|
80
|
+
|
|
81
|
+
---
|
|
82
|
+
|
|
83
|
+
### 2.2 — Update version to CalVer format
|
|
84
|
+
|
|
85
|
+
**Why:** All OpenClaw extensions use CalVer (`YYYY.M.D` e.g. `2026.2.21`). This package
|
|
86
|
+
has `"version": "1.0.0"` which is inconsistent.
|
|
87
|
+
|
|
88
|
+
**Steps:**
|
|
89
|
+
1. Open `extensions/esp32-voice/package.json`
|
|
90
|
+
2. Change `"version": "1.0.0"` to today's date in CalVer format, e.g. `"version": "2026.2.21"`
|
|
91
|
+
3. Commit: `"chore: align version to CalVer format"`
|
|
92
|
+
|
|
93
|
+
---
|
|
94
|
+
|
|
95
|
+
### 2.3 — Add `ota-server.js` to the `files` array
|
|
96
|
+
|
|
97
|
+
**Why:** `ota-server.js` is not listed in the `files` array in `package.json`. When the package
|
|
98
|
+
is published to npm, this file will be excluded and users won't have the OTA server.
|
|
99
|
+
|
|
100
|
+
**Steps:**
|
|
101
|
+
1. Open `extensions/esp32-voice/package.json`
|
|
102
|
+
2. Find the `files` array (currently: `["index.ts", "src/", "openclaw.plugin.json", "README.md"]`)
|
|
103
|
+
3. Add `"ota-server.js"` and `"TODO.md"` to the array:
|
|
104
|
+
```json
|
|
105
|
+
"files": [
|
|
106
|
+
"index.ts",
|
|
107
|
+
"src/",
|
|
108
|
+
"openclaw.plugin.json",
|
|
109
|
+
"README.md",
|
|
110
|
+
"ota-server.js",
|
|
111
|
+
"TODO.md"
|
|
112
|
+
]
|
|
113
|
+
```
|
|
114
|
+
4. Commit: `"chore: add ota-server.js and TODO.md to npm files array"`
|
|
115
|
+
|
|
116
|
+
---
|
|
117
|
+
|
|
118
|
+
## 🟡 SECTION 3 — OTA Server Integration (High Priority UX)
|
|
119
|
+
|
|
120
|
+
### 3.1 — Fix hardcoded timezone (IST) in `ota-server.js`
|
|
121
|
+
|
|
122
|
+
**Why:** The OTA server response hardcodes the timezone offset to IST (UTC+5:30 = 330 minutes).
|
|
123
|
+
Users in other timezones will get wrong device time on their ESP32.
|
|
124
|
+
|
|
125
|
+
**Current code (in `ota-server.js`):**
|
|
126
|
+
```js
|
|
127
|
+
timezone_offset: 330 // hardcoded IST — wrong for everyone else
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
**Steps:**
|
|
131
|
+
1. Open `extensions/esp32-voice/ota-server.js`
|
|
132
|
+
2. Find the `timezone_offset` line
|
|
133
|
+
3. Replace with a dynamic calculation:
|
|
134
|
+
```js
|
|
135
|
+
// Get local UTC offset in minutes (negative for west, positive for east)
|
|
136
|
+
timezone_offset: -new Date().getTimezoneOffset(),
|
|
137
|
+
```
|
|
138
|
+
> `getTimezoneOffset()` returns minutes west of UTC (negative for east), so negate it to get
|
|
139
|
+
> the standard "minutes east of UTC" that XiaoZhi firmware expects.
|
|
140
|
+
|
|
141
|
+
4. Commit: `"fix: use system timezone instead of hardcoded IST in ota-server"`
|
|
142
|
+
|
|
143
|
+
---
|
|
144
|
+
|
|
145
|
+
### 3.2 — Integrate OTA endpoint into the plugin HTTP handler
|
|
146
|
+
|
|
147
|
+
**Why:** Currently users must run `node ota-server.js` as a completely separate process in a
|
|
148
|
+
separate terminal. This is confusing and adds friction. The plugin already registers an HTTP
|
|
149
|
+
handler (`src/http-handler.ts`). Adding the OTA route there means users just start OpenClaw
|
|
150
|
+
normally — no second process.
|
|
151
|
+
|
|
152
|
+
**Steps:**
|
|
153
|
+
1. Open `extensions/esp32-voice/src/http-handler.ts`
|
|
154
|
+
2. Identify where HTTP routes are registered (look for `app.get(...)` or similar)
|
|
155
|
+
3. Add a route for `/xiaozhi/ota/` or `/__openclaw__/esp32-voice/ota/` that returns the same
|
|
156
|
+
JSON payload currently in `ota-server.js`:
|
|
157
|
+
```typescript
|
|
158
|
+
// The payload the XiaoZhi firmware expects
|
|
159
|
+
{
|
|
160
|
+
"websocket": {
|
|
161
|
+
"url": "ws://<LAN_IP>:<ESP32_VOICE_PORT>/"
|
|
162
|
+
},
|
|
163
|
+
"openclaw": {
|
|
164
|
+
"url": "ws://127.0.0.1:18789",
|
|
165
|
+
"token": "<GATEWAY_TOKEN>"
|
|
166
|
+
},
|
|
167
|
+
"timezone_offset": -new Date().getTimezoneOffset()
|
|
168
|
+
}
|
|
169
|
+
```
|
|
170
|
+
4. Get the LAN IP using the same interface detection logic already in `ota-server.js`
|
|
171
|
+
(prefer `en0`, `en1`, `eth0`, `wlan0`, `wlo1`)
|
|
172
|
+
5. Once integrated, update `README.md` to say "OTA is served automatically at
|
|
173
|
+
`http://<your-ip>:18789/__openclaw__/esp32-voice/ota/`" — no separate server needed
|
|
174
|
+
6. Keep `ota-server.js` as a standalone fallback option for users who run the plugin without
|
|
175
|
+
the full Gateway
|
|
176
|
+
7. Commit: `"feat: integrate OTA endpoint into plugin HTTP handler"`
|
|
177
|
+
|
|
178
|
+
---
|
|
179
|
+
|
|
180
|
+
## 🟡 SECTION 4 — Developer Experience
|
|
181
|
+
|
|
182
|
+
### 4.1 — Write proper `README.md` with 3-step quick setup
|
|
183
|
+
|
|
184
|
+
**Why:** The current `README.md` is detailed but scattered. A new user needs a clear "from zero
|
|
185
|
+
to voice in 10 minutes" path at the very top, with details below.
|
|
186
|
+
|
|
187
|
+
**Suggested structure:**
|
|
188
|
+
|
|
189
|
+
```
|
|
190
|
+
# @openclaw/esp32-voice
|
|
191
|
+
|
|
192
|
+
[One-line description]
|
|
193
|
+
|
|
194
|
+
## Quick Start (10 minutes)
|
|
195
|
+
|
|
196
|
+
### Step 1 — Get API Keys
|
|
197
|
+
- Deepgram (free): https://console.deepgram.com → API Keys → Create Key
|
|
198
|
+
- ElevenLabs (free tier): https://elevenlabs.io → Profile → API Keys
|
|
199
|
+
|
|
200
|
+
### Step 2 — Configure OpenClaw
|
|
201
|
+
[exact env vars and openclaw.json snippet]
|
|
202
|
+
|
|
203
|
+
### Step 3 — Flash your ESP32
|
|
204
|
+
[exact OTA URL to point firmware at]
|
|
205
|
+
|
|
206
|
+
## Configuration Reference
|
|
207
|
+
[full table of all options]
|
|
208
|
+
|
|
209
|
+
## How It Works
|
|
210
|
+
[architecture diagram]
|
|
211
|
+
|
|
212
|
+
## Troubleshooting
|
|
213
|
+
[common errors and fixes]
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
**Steps:**
|
|
217
|
+
1. Open `extensions/esp32-voice/README.md`
|
|
218
|
+
2. Add the "Quick Start" section as the very first content after the title
|
|
219
|
+
3. Link to the Deepgram free tier and ElevenLabs free tier pages explicitly
|
|
220
|
+
4. Show the minimum `~/.openclaw/openclaw.json` block (copy from QUICKSTART.md)
|
|
221
|
+
5. Commit: `"docs: rewrite README with 3-step quick start at top"`
|
|
222
|
+
|
|
223
|
+
---
|
|
224
|
+
|
|
225
|
+
### 4.2 — Add `.env.example` file
|
|
226
|
+
|
|
227
|
+
**Why:** Users need to know what env vars to set. A `.env.example` file is the standard way
|
|
228
|
+
to document this without committing real credentials.
|
|
229
|
+
|
|
230
|
+
**Create `extensions/esp32-voice/.env.example`:**
|
|
231
|
+
```bash
|
|
232
|
+
# Required — get free API key at https://console.deepgram.com
|
|
233
|
+
DEEPGRAM_API_KEY=<your-deepgram-api-key>
|
|
234
|
+
|
|
235
|
+
# Required — get free API key at https://elevenlabs.io
|
|
236
|
+
ELEVENLABS_API_KEY=<your-elevenlabs-api-key>
|
|
237
|
+
|
|
238
|
+
# Optional — find voice IDs at https://elevenlabs.io/voice-library
|
|
239
|
+
ELEVENLABS_VOICE_ID=21m00Tcm4TlvDq8ikWAM
|
|
240
|
+
|
|
241
|
+
# Optional — override default ElevenLabs model
|
|
242
|
+
ELEVENLABS_MODEL_ID=eleven_turbo_v2_5
|
|
243
|
+
|
|
244
|
+
# Optional — override default Deepgram model
|
|
245
|
+
DEEPGRAM_MODEL=nova-2
|
|
246
|
+
|
|
247
|
+
# Optional — port for ESP32 voice WebSocket server (default: 8765)
|
|
248
|
+
ESP32_VOICE_PORT=8765
|
|
249
|
+
|
|
250
|
+
# Required for OTA server — copy from ~/.openclaw/openclaw.json or your gateway setup
|
|
251
|
+
GATEWAY_TOKEN=<your-openclaw-gateway-token>
|
|
252
|
+
|
|
253
|
+
# Optional — override OpenClaw gateway URL (default: ws://127.0.0.1:18789)
|
|
254
|
+
OPENCLAW_GATEWAY_URL=ws://127.0.0.1:18789
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
**Steps:**
|
|
258
|
+
1. Create the file at `extensions/esp32-voice/.env.example` with the content above
|
|
259
|
+
2. Make sure `.env.example` is in the `files` array in `package.json`
|
|
260
|
+
3. Add a note in README: "Copy `.env.example` to `~/.openclaw/.env` and fill in your keys"
|
|
261
|
+
4. Commit: `"docs: add .env.example with all required and optional env vars"`
|
|
262
|
+
|
|
263
|
+
---
|
|
264
|
+
|
|
265
|
+
## 🟡 SECTION 5 — Reliability Improvements
|
|
266
|
+
|
|
267
|
+
### 5.1 — Persist OTP pairing across restarts
|
|
268
|
+
|
|
269
|
+
**Why:** The OTP pairing system stores approved devices in memory only. If OpenClaw restarts,
|
|
270
|
+
all paired devices need to be re-paired. This is annoying for users with always-on ESP32 devices.
|
|
271
|
+
|
|
272
|
+
**Current code in `src/device/device-otp.ts`:**
|
|
273
|
+
```typescript
|
|
274
|
+
private pairedDevices: Map<string, PairedDevice> = new Map(); // in-memory only
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
**Steps:**
|
|
278
|
+
1. Open `extensions/esp32-voice/src/device/device-otp.ts`
|
|
279
|
+
2. Add persistence using a JSON file at `~/.openclaw/esp32-voice-devices.json`
|
|
280
|
+
3. On `DeviceOtpManager` construction, load the file if it exists
|
|
281
|
+
4. On successful pairing, write the updated map back to the file
|
|
282
|
+
5. On OpenClaw restart, previously paired devices are immediately trusted without re-pairing
|
|
283
|
+
6. Commit: `"feat: persist paired devices to disk so pairing survives gateway restarts"`
|
|
284
|
+
|
|
285
|
+
---
|
|
286
|
+
|
|
287
|
+
### 5.2 — Add rate limiting to OTP verification
|
|
288
|
+
|
|
289
|
+
**Why:** The OTP is 6 digits (100,000 possible values). Without rate limiting, an attacker
|
|
290
|
+
on the same network could brute-force the OTP in minutes. The OTA HTTP endpoint and the
|
|
291
|
+
voice WebSocket both need protection.
|
|
292
|
+
|
|
293
|
+
**Steps:**
|
|
294
|
+
1. Open `extensions/esp32-voice/src/device/device-otp.ts`
|
|
295
|
+
2. Add a failed-attempt counter per source IP
|
|
296
|
+
3. After 5 failed attempts from the same IP, block for 15 minutes
|
|
297
|
+
4. Log blocked attempts at `warn` level
|
|
298
|
+
5. Commit: `"security: add rate limiting to OTP verification (5 attempts then 15min block)"`
|
|
299
|
+
|
|
300
|
+
---
|
|
301
|
+
|
|
302
|
+
### 5.3 — Handle Gateway reconnection gracefully
|
|
303
|
+
|
|
304
|
+
**Why:** Currently if the OpenClaw Gateway drops the WebSocket connection (restart, timeout,
|
|
305
|
+
network hiccup), the plugin's `openclawConnected` flag goes false and stays false until the
|
|
306
|
+
ESP32 session is restarted. The Gateway reconnect should happen automatically in the background.
|
|
307
|
+
|
|
308
|
+
**Steps:**
|
|
309
|
+
1. Open `extensions/esp32-voice/src/voice/voice-session.ts`
|
|
310
|
+
2. In the `openclawWs.on("close", ...)` handler, instead of just setting `openclawConnected = false`,
|
|
311
|
+
schedule a reconnect after 3 seconds:
|
|
312
|
+
```typescript
|
|
313
|
+
this.openclawWs.on("close", () => {
|
|
314
|
+
this.openclawConnected = false;
|
|
315
|
+
this.log("info", "OpenClaw disconnected — reconnecting in 3s...");
|
|
316
|
+
setTimeout(() => {
|
|
317
|
+
this.connectToOpenClaw().catch((err) =>
|
|
318
|
+
this.log("error", `Reconnect failed: ${err}`)
|
|
319
|
+
);
|
|
320
|
+
}, 3000);
|
|
321
|
+
});
|
|
322
|
+
```
|
|
323
|
+
3. Add an exponential backoff: 3s → 6s → 12s → 30s (cap at 30s)
|
|
324
|
+
4. Stop retrying after session `cleanup()` is called
|
|
325
|
+
5. Commit: `"feat: auto-reconnect to OpenClaw Gateway on disconnect"`
|
|
326
|
+
|
|
327
|
+
---
|
|
328
|
+
|
|
329
|
+
## 🟢 SECTION 6 — Publishing to npm
|
|
330
|
+
|
|
331
|
+
### 6.1 — Final pre-publish checklist
|
|
332
|
+
|
|
333
|
+
Run through this checklist in order before running `npm publish`:
|
|
334
|
+
|
|
335
|
+
- [ ] All items in Section 1 (Security) are done
|
|
336
|
+
- [ ] All items in Section 2 (package.json) are done
|
|
337
|
+
- [ ] `SETUP.md` has no real credentials
|
|
338
|
+
- [ ] `ota-server.js` has no hardcoded tokens
|
|
339
|
+
- [ ] `@discordjs/opus` is removed from `package.json`
|
|
340
|
+
- [ ] Version is CalVer format (e.g. `2026.2.21`)
|
|
341
|
+
- [ ] `ota-server.js` is in the `files` array
|
|
342
|
+
- [ ] `.env.example` is in the `files` array
|
|
343
|
+
- [ ] `README.md` has a clear Quick Start section at the top
|
|
344
|
+
- [ ] Run `npm pack --dry-run` and check the file list — no `node_modules/`, no `.env`, no real credentials
|
|
345
|
+
- [ ] Test install in a clean directory: `mkdir /tmp/test-install && cd /tmp/test-install && npm install @openclaw/esp32-voice`
|
|
346
|
+
|
|
347
|
+
### 6.2 — Publish
|
|
348
|
+
|
|
349
|
+
```bash
|
|
350
|
+
cd extensions/esp32-voice
|
|
351
|
+
npm login # login to npm with your account
|
|
352
|
+
npm publish --access public
|
|
353
|
+
```
|
|
354
|
+
|
|
355
|
+
After publishing, verify:
|
|
356
|
+
```bash
|
|
357
|
+
npm info @openclaw/esp32-voice
|
|
358
|
+
```
|
|
359
|
+
|
|
360
|
+
---
|
|
361
|
+
|
|
362
|
+
## 🟢 SECTION 7 — Future Enhancements (Post-Launch)
|
|
363
|
+
|
|
364
|
+
These are not blockers but would significantly improve the plugin:
|
|
365
|
+
|
|
366
|
+
| # | Enhancement | Effort | Impact |
|
|
367
|
+
|---|---|---|---|
|
|
368
|
+
| 7.1 | Add Google STT provider (`src/stt/google.ts`) | Medium | High — alternative to Deepgram |
|
|
369
|
+
| 7.2 | Add OpenAI Whisper STT provider | Medium | High — popular, good accuracy |
|
|
370
|
+
| 7.3 | Add Azure TTS provider | Medium | Medium — enterprise users |
|
|
371
|
+
| 7.4 | Add support for multiple simultaneous ESP32 devices per account | Medium | High |
|
|
372
|
+
| 7.5 | Add WebRTC transport option (lower latency than WebSocket+Opus) | High | Medium |
|
|
373
|
+
| 7.6 | Streaming TTS to ESP32 before full LLM response is ready | High | High — reduces perceived latency |
|
|
374
|
+
| 7.7 | Wake-word detection passthrough from ESP32 | Medium | Medium |
|
|
375
|
+
| 7.8 | Add unit tests for STT/TTS registry, frame pacing, JSON-in-binary detection | Medium | High |
|
|
376
|
+
| 7.9 | CI/CD pipeline for the extension (GitHub Actions) | Low | Medium |
|
|
377
|
+
| 7.10 | Support for Zalo/Line/Telegram as voice backends (not just OpenClaw main session) | High | Medium |
|
|
378
|
+
|
|
379
|
+
---
|
|
380
|
+
|
|
381
|
+
## Quick Reference — Architecture
|
|
382
|
+
|
|
383
|
+
```
|
|
384
|
+
ESP32 (XiaoZhi firmware)
|
|
385
|
+
│
|
|
386
|
+
│ WebSocket ws://<your-ip>:8765/
|
|
387
|
+
▼
|
|
388
|
+
[esp32-voice plugin — port 8765]
|
|
389
|
+
│ STT: Opus frames → Deepgram → transcript
|
|
390
|
+
│ LLM: transcript → OpenClaw Gateway → response text
|
|
391
|
+
│ TTS: response text → ElevenLabs → PCM → Opus frames
|
|
392
|
+
│
|
|
393
|
+
│ WebSocket ws://127.0.0.1:18789
|
|
394
|
+
▼
|
|
395
|
+
[OpenClaw Gateway — port 18789]
|
|
396
|
+
│
|
|
397
|
+
▼
|
|
398
|
+
[AI Model — Gemini / Claude / GPT]
|
|
399
|
+
```
|
|
400
|
+
|
|
401
|
+
**Key files:**
|
|
402
|
+
- `src/voice/voice-session.ts` — main pipeline orchestrator (STT → LLM → TTS)
|
|
403
|
+
- `src/voice/voice-endpoint.ts` — standalone WebSocket server on port 8765
|
|
404
|
+
- `src/stt/deepgram.ts` — Deepgram STT (VAD + streaming)
|
|
405
|
+
- `src/tts/elevenlabs.ts` — ElevenLabs TTS (serialized audio chain)
|
|
406
|
+
- `src/device/device-otp.ts` — OTP pairing system
|
|
407
|
+
- `ota-server.js` — standalone OTA config server for XiaoZhi firmware
|
|
408
|
+
- `index.ts` — OpenClaw plugin entry point
|
|
409
|
+
|
|
410
|
+
**Known quirks solved (do not revert):**
|
|
411
|
+
- XiaoZhi sends ALL WebSocket messages as binary frames — even JSON. Detection: check if binary frame starts with `0x7b` (`{`) before treating as audio.
|
|
412
|
+
- `@discordjs/opus` prebuilt binaries are rejected by macOS Gatekeeper. Use `opusscript` (pure WASM) instead.
|
|
413
|
+
- ElevenLabs `onAudio` callback must be chained (not fire-and-forget) so Opus frame pacing is respected and sentences play sequentially, not simultaneously.
|
|
414
|
+
- Frame pacing anchor: `nextFrameAt` must be set at the moment the **first** frame is sent, not at function entry — otherwise TTS connection time is counted as debt and early frames are sent with no delay.
|
|
415
|
+
|
|
416
|
+
---
|
|
417
|
+
|
|
418
|
+
*Last updated: 2026-02-21 | Plugin version: 1.0.0 (pre-release)*
|
package/index.ts
ADDED
|
@@ -0,0 +1,128 @@
|
|
|
1
|
+
import type { OpenClawPluginApi } from "openclaw/plugin-sdk";
|
|
2
|
+
import { emptyPluginConfigSchema } from "openclaw/plugin-sdk";
|
|
3
|
+
import { esp32VoicePlugin } from "./src/channel.js";
|
|
4
|
+
import { setEsp32VoiceRuntime } from "./src/runtime.js";
|
|
5
|
+
import { startStandaloneVoiceServer } from "./src/voice/voice-endpoint.js";
|
|
6
|
+
|
|
7
|
+
// Import STT/TTS providers to trigger auto-registration with the registries
|
|
8
|
+
import "./src/stt/deepgram.js";
|
|
9
|
+
import "./src/tts/elevenlabs.js";
|
|
10
|
+
|
|
11
|
+
const VOICE_PORT = parseInt(process.env.ESP32_VOICE_PORT ?? "8765", 10);
|
|
12
|
+
|
|
13
|
+
const plugin = {
|
|
14
|
+
id: "esp32-voice",
|
|
15
|
+
name: "ESP32 Voice",
|
|
16
|
+
description:
|
|
17
|
+
"ESP32 Voice device channel — voice-to-text-to-voice with pluggable STT/TTS providers",
|
|
18
|
+
configSchema: emptyPluginConfigSchema(),
|
|
19
|
+
register(api: OpenClawPluginApi) {
|
|
20
|
+
setEsp32VoiceRuntime(api.runtime);
|
|
21
|
+
|
|
22
|
+
// Register the ESP32 Voice channel
|
|
23
|
+
api.registerChannel({ plugin: esp32VoicePlugin });
|
|
24
|
+
|
|
25
|
+
// ── Gateway HTTP routes (non-WS utilities) ────────────────────
|
|
26
|
+
// These are registered now so they work once the gateway starts.
|
|
27
|
+
// The actual port is fixed at VOICE_PORT — routes reference it by closure.
|
|
28
|
+
|
|
29
|
+
// Info route (tells callers this path needs a WS connection on the voice port)
|
|
30
|
+
api.registerHttpRoute({
|
|
31
|
+
path: "/__openclaw__/esp32-voice/stream",
|
|
32
|
+
handler: (_req, res) => {
|
|
33
|
+
res.writeHead(200, { "Content-Type": "application/json" });
|
|
34
|
+
res.end(
|
|
35
|
+
JSON.stringify({
|
|
36
|
+
service: "esp32-voice",
|
|
37
|
+
type: "websocket",
|
|
38
|
+
hint: `Connect your ESP32 via WebSocket to ws://<your-ip>:${VOICE_PORT}/`,
|
|
39
|
+
voicePort: VOICE_PORT,
|
|
40
|
+
}),
|
|
41
|
+
);
|
|
42
|
+
},
|
|
43
|
+
});
|
|
44
|
+
|
|
45
|
+
// Health endpoint (via Gateway port for convenience)
|
|
46
|
+
api.registerHttpRoute({
|
|
47
|
+
path: "/__openclaw__/esp32-voice/health",
|
|
48
|
+
handler: (_req, res) => {
|
|
49
|
+
res.writeHead(200, { "Content-Type": "application/json" });
|
|
50
|
+
res.end(
|
|
51
|
+
JSON.stringify({
|
|
52
|
+
ok: true,
|
|
53
|
+
service: "esp32-voice",
|
|
54
|
+
voicePort: VOICE_PORT,
|
|
55
|
+
voiceWsUrl: `ws://<your-ip>:${VOICE_PORT}/`,
|
|
56
|
+
sttConfigured: Boolean(
|
|
57
|
+
process.env.DEEPGRAM_API_KEY ||
|
|
58
|
+
api.config?.channels?.esp32voice?.sttApiKey,
|
|
59
|
+
),
|
|
60
|
+
ttsConfigured: Boolean(
|
|
61
|
+
process.env.ELEVENLABS_API_KEY ||
|
|
62
|
+
process.env.XI_API_KEY ||
|
|
63
|
+
api.config?.channels?.esp32voice?.ttsApiKey,
|
|
64
|
+
),
|
|
65
|
+
}),
|
|
66
|
+
);
|
|
67
|
+
},
|
|
68
|
+
});
|
|
69
|
+
|
|
70
|
+
// OTP generation endpoint
|
|
71
|
+
api.registerHttpRoute({
|
|
72
|
+
path: "/__openclaw__/esp32-voice/otp",
|
|
73
|
+
handler: async (_req, res) => {
|
|
74
|
+
const { deviceOtpManager } = await import("./src/device/device-otp.js");
|
|
75
|
+
const code = deviceOtpManager.generateOtp();
|
|
76
|
+
res.writeHead(200, { "Content-Type": "application/json" });
|
|
77
|
+
res.end(JSON.stringify({ code, expiresInSeconds: 300 }));
|
|
78
|
+
},
|
|
79
|
+
});
|
|
80
|
+
|
|
81
|
+
// Paired devices listing
|
|
82
|
+
api.registerHttpRoute({
|
|
83
|
+
path: "/__openclaw__/esp32-voice/devices",
|
|
84
|
+
handler: async (_req, res) => {
|
|
85
|
+
const { deviceOtpManager } = await import("./src/device/device-otp.js");
|
|
86
|
+
const devices = deviceOtpManager.listPairedDevices();
|
|
87
|
+
res.writeHead(200, { "Content-Type": "application/json" });
|
|
88
|
+
res.end(JSON.stringify({ devices }));
|
|
89
|
+
},
|
|
90
|
+
});
|
|
91
|
+
|
|
92
|
+
// ── Standalone Voice WebSocket Server (gateway-only service) ──
|
|
93
|
+
//
|
|
94
|
+
// The OpenClaw Gateway plugin API does NOT support WebSocket upgrade
|
|
95
|
+
// registration — registerHttpRoute() only handles regular HTTP requests.
|
|
96
|
+
// When an ESP32 tries to upgrade to WebSocket on the Gateway port (18789),
|
|
97
|
+
// the Gateway's own upgrade handler intercepts it and routes it to the
|
|
98
|
+
// Gateway's internal WS server instead of this plugin.
|
|
99
|
+
//
|
|
100
|
+
// Solution: spin up a dedicated HTTP server on a separate port (8765).
|
|
101
|
+
// The ESP32 firmware connects directly to this server. No core changes needed.
|
|
102
|
+
//
|
|
103
|
+
// Registered as a SERVICE so it only starts when the gateway starts,
|
|
104
|
+
// NOT during CLI commands like `channels add` (which would conflict
|
|
105
|
+
// with any already-running gateway on the same port).
|
|
106
|
+
//
|
|
107
|
+
api.registerService({
|
|
108
|
+
id: "esp32-voice-server",
|
|
109
|
+
start: async () => {
|
|
110
|
+
const { port } = startStandaloneVoiceServer(VOICE_PORT);
|
|
111
|
+
console.log("[esp32voice] Plugin registered successfully");
|
|
112
|
+
console.log(`[esp32voice] Voice WebSocket (standalone): ws://0.0.0.0:${port}/`);
|
|
113
|
+
console.log(`[esp32voice] Point your ESP32 to: ws://<your-mac-ip>:${port}/`);
|
|
114
|
+
console.log(`[esp32voice] Health check (Gateway port): http://<gateway>/__openclaw__/esp32-voice/health`);
|
|
115
|
+
console.log(`[esp32voice] Generate OTP: http://<gateway>/__openclaw__/esp32-voice/otp`);
|
|
116
|
+
},
|
|
117
|
+
});
|
|
118
|
+
},
|
|
119
|
+
};
|
|
120
|
+
|
|
121
|
+
export default plugin;
|
|
122
|
+
|
|
123
|
+
// Exports for consumers / third-party provider plugins
|
|
124
|
+
export { startStandaloneVoiceServer };
|
|
125
|
+
export { sttRegistry } from "./src/stt/stt-registry.js";
|
|
126
|
+
export { ttsRegistry } from "./src/tts/tts-registry.js";
|
|
127
|
+
export type { SttProvider, SttProviderConfig, SttProviderMeta } from "./src/stt/stt-provider.js";
|
|
128
|
+
export type { TtsProvider, TtsProviderConfig, TtsProviderMeta } from "./src/tts/tts-provider.js";
|
package/package.json
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "@cheeko-ai/esp32-voice",
|
|
3
|
+
"version": "2026.2.21",
|
|
4
|
+
"private": false,
|
|
5
|
+
"description": "OpenClaw ESP32 Voice channel plugin — voice-to-text-to-voice device integration with pluggable STT/TTS providers",
|
|
6
|
+
"type": "module",
|
|
7
|
+
"license": "MIT",
|
|
8
|
+
"main": "index.ts",
|
|
9
|
+
"files": [
|
|
10
|
+
"index.ts",
|
|
11
|
+
"src/",
|
|
12
|
+
"openclaw.plugin.json",
|
|
13
|
+
"README.md",
|
|
14
|
+
"TODO.md",
|
|
15
|
+
"NPM_PUBLISH_READINESS.md"
|
|
16
|
+
],
|
|
17
|
+
"keywords": [
|
|
18
|
+
"openclaw",
|
|
19
|
+
"esp32",
|
|
20
|
+
"voice",
|
|
21
|
+
"stt",
|
|
22
|
+
"tts",
|
|
23
|
+
"deepgram",
|
|
24
|
+
"elevenlabs",
|
|
25
|
+
"iot",
|
|
26
|
+
"speech-to-text",
|
|
27
|
+
"text-to-speech"
|
|
28
|
+
],
|
|
29
|
+
"repository": {
|
|
30
|
+
"type": "git",
|
|
31
|
+
"url": "https://github.com/openclaw/openclaw",
|
|
32
|
+
"directory": "extensions/esp32-voice"
|
|
33
|
+
},
|
|
34
|
+
"dependencies": {
|
|
35
|
+
"opusscript": "^0.0.8",
|
|
36
|
+
"ws": "^8.18.0",
|
|
37
|
+
"zod": "^4.3.6"
|
|
38
|
+
},
|
|
39
|
+
"devDependencies": {
|
|
40
|
+
"@types/ws": "^8.5.12",
|
|
41
|
+
"openclaw": ">=2026.1.0"
|
|
42
|
+
},
|
|
43
|
+
"openclaw": {
|
|
44
|
+
"extensions": [
|
|
45
|
+
"./index.ts"
|
|
46
|
+
],
|
|
47
|
+
"channel": {
|
|
48
|
+
"id": "esp32voice",
|
|
49
|
+
"label": "ESP32 Voice",
|
|
50
|
+
"selectionLabel": "ESP32 Voice (plugin)",
|
|
51
|
+
"docsPath": "/channels/esp32-voice",
|
|
52
|
+
"docsLabel": "esp32-voice",
|
|
53
|
+
"blurb": "ESP32 voice device channel — speech-to-text-to-speech with pluggable STT/TTS providers (Deepgram, ElevenLabs, and more).",
|
|
54
|
+
"order": 90
|
|
55
|
+
},
|
|
56
|
+
"install": {
|
|
57
|
+
"npmSpec": "@cheeko-ai/esp32-voice",
|
|
58
|
+
"localPath": "extensions/esp32-voice",
|
|
59
|
+
"defaultChoice": "npm"
|
|
60
|
+
}
|
|
61
|
+
}
|
|
62
|
+
}
|