@madeinoz67/voice-server 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (86) hide show
  1. package/.claude/commands/speckit.analyze.md +184 -0
  2. package/.claude/commands/speckit.checklist.md +294 -0
  3. package/.claude/commands/speckit.clarify.md +181 -0
  4. package/.claude/commands/speckit.constitution.md +82 -0
  5. package/.claude/commands/speckit.implement.md +135 -0
  6. package/.claude/commands/speckit.plan.md +89 -0
  7. package/.claude/commands/speckit.specify.md +258 -0
  8. package/.claude/commands/speckit.tasks.md +137 -0
  9. package/.claude/commands/speckit.taskstoissues.md +30 -0
  10. package/.claude/settings.local.json +23 -0
  11. package/.codanna/settings.toml +384 -0
  12. package/.env.development +18 -0
  13. package/.env.example +30 -0
  14. package/.github/codeql/config.yml +13 -0
  15. package/.github/codeql.yml +30 -0
  16. package/.github/dependabot.yml +11 -0
  17. package/.github/workflows/ci.yml +308 -0
  18. package/.specify/memory/constitution.md +223 -0
  19. package/.specify/scripts/bash/check-prerequisites.sh +166 -0
  20. package/.specify/scripts/bash/common.sh +156 -0
  21. package/.specify/scripts/bash/create-new-feature.sh +297 -0
  22. package/.specify/scripts/bash/setup-plan.sh +61 -0
  23. package/.specify/scripts/bash/update-agent-context.sh +799 -0
  24. package/.specify/templates/agent-file-template.md +28 -0
  25. package/.specify/templates/checklist-template.md +40 -0
  26. package/.specify/templates/plan-template.md +106 -0
  27. package/.specify/templates/spec-template.md +115 -0
  28. package/.specify/templates/tasks-template.md +261 -0
  29. package/AGENTPERSONALITIES.md +233 -0
  30. package/ATTRIBUTION.md +70 -0
  31. package/CHANGELOG.md +90 -0
  32. package/CLAUDE.md +50 -0
  33. package/Formula/madeinoz-voice-server.rb +106 -0
  34. package/README.md +451 -0
  35. package/bun.lock +212 -0
  36. package/cliff.toml +67 -0
  37. package/docs/KOKORO_VOICES.md +152 -0
  38. package/docs/MIGRATION.md +267 -0
  39. package/docs/VOICE_EXAMPLES.md +283 -0
  40. package/docs/VOICE_GUIDE.md +227 -0
  41. package/docs/VOICE_QUICK_REF.md +157 -0
  42. package/docs/agent-voices.md +114 -0
  43. package/docs/api.md +336 -0
  44. package/docs/assets/voice-server-architecture.png +0 -0
  45. package/docs/assets/voice-server-header.png +0 -0
  46. package/docs/assets/voice-server-pack-logo.png +0 -0
  47. package/docs/index.md +60 -0
  48. package/eslint.config.js +42 -0
  49. package/mkdocs.yml +55 -0
  50. package/package.json +28 -0
  51. package/reports/MLX_AUDIO_EVALUATION.md +302 -0
  52. package/reports/agent/2026-02-06-20-51-mlx-audio-qwen-tts-investigation.md +613 -0
  53. package/reports/agent/2026-02-06-Qwen3-TTS-API-Specification.md +446 -0
  54. package/reports/agent/2026-02-07-python-backend-removal-plan.md +790 -0
  55. package/scripts/generate-reference.ts +139 -0
  56. package/specs/001-qwen-tts/checklists/requirements.md +50 -0
  57. package/specs/001-qwen-tts/contracts/api.yaml +305 -0
  58. package/specs/001-qwen-tts/data-model.md +197 -0
  59. package/specs/001-qwen-tts/plan.md +236 -0
  60. package/specs/001-qwen-tts/quickstart.md +306 -0
  61. package/specs/001-qwen-tts/research.md +194 -0
  62. package/specs/001-qwen-tts/spec.md +135 -0
  63. package/specs/001-qwen-tts/tasks.md +305 -0
  64. package/src/ts/constants/KOKORO_VOICES.ts +141 -0
  65. package/src/ts/middleware/cors.ts +153 -0
  66. package/src/ts/middleware/rate-limiter.ts +200 -0
  67. package/src/ts/models/health.ts +45 -0
  68. package/src/ts/models/notification.ts +69 -0
  69. package/src/ts/models/pronunciation.ts +39 -0
  70. package/src/ts/models/tts.ts +54 -0
  71. package/src/ts/models/voice-config.ts +82 -0
  72. package/src/ts/server.ts +460 -0
  73. package/src/ts/services/mlx-tts-client.ts +337 -0
  74. package/src/ts/services/pronunciation.ts +209 -0
  75. package/src/ts/services/prosody-translator.ts +130 -0
  76. package/src/ts/services/voice-loader.ts +214 -0
  77. package/src/ts/utils/logger.ts +144 -0
  78. package/src/ts/utils/text-sanitizer.ts +118 -0
  79. package/tests/integration/api.test.ts +210 -0
  80. package/tests/mocks/index.ts +152 -0
  81. package/tests/ts/server.test.ts +11 -0
  82. package/tests/unit/middleware/cors.test.ts +146 -0
  83. package/tests/unit/models/validation.test.ts +332 -0
  84. package/tests/unit/services/pronunciation.test.ts +171 -0
  85. package/tests/unit/services/prosody-translator.test.ts +142 -0
  86. package/tsconfig.json +25 -0
@@ -0,0 +1,302 @@
1
+ # MLX-Audio Evaluation Report
2
+
3
+ **Date:** 2026-02-06
4
+ **Purpose:** Evaluate MLX-audio as alternative to qwen-tts for TTS inference
5
+
6
+ **Python 3.13 Test Results:** ✅ **MAJOR PROGRESS**
7
+
8
+ ## Summary
9
+
10
+ Initial testing revealed Python 3.14 compatibility issues. Testing with Python 3.13 shows significant improvement - MLX-audio imports and loads models successfully, though synthesis testing revealed additional challenges.
11
+
12
+ MLX-audio installation and evaluation revealed several challenges that make direct migration complex at this time.
13
+
14
+ ## Findings
15
+
16
+ ### 1. qwen-tts (Current Implementation) ✓ WORKING
17
+
18
+ **Model:** Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
19
+ **Device:** Apple Metal (MPS) GPU acceleration
20
+
21
+ | Metric | Value |
22
+ |--------|-------|
23
+ | Model Load Time | 21.8s |
24
+ | Load Memory | 1317 MB |
25
+ | Inference Time | 13-16s |
26
+ | Inference Memory | 1300 MB |
27
+ | Real-Time Factor (RTF) | 3.87x |
28
+ | Sample Rate | 24kHz |
29
+
30
+ **API Method:** `generate_voice_design(text, language, speaker, instruct, stream)`
31
+
32
+ ### 2. MLX-Audio Challenges ✗ ISSUES FOUND
33
+
34
+ #### Challenge A: Python 3.14 Compatibility
35
+ - **Issue:** Pydantic V1 incompatible with Python 3.14
36
+ - **Impact:** Kokoro model fails to load
37
+ - **Error:** `Core Pydantic V1 functionality isn't compatible with Python 3.14 or greater`
38
+ - **Status:** Requires Python 3.13 or older
39
+
40
+ #### Challenge B: Qwen3 Model Loading
41
+ - **Issue:** Complex nested config structure not compatible with standard loading
42
+ - **Components:** speaker_encoder, talker, code_predictor, speech_tokenizer
43
+ - **Error:** `478 parameters not in model` when loading with standard approach
44
+ - **Status:** Requires specialized MLX-audio loader that has compatibility issues
45
+
46
+ #### Challenge C: Transformers Compatibility
47
+ - **Issue:** Transformers doesn't recognize `qwen3_tts` model type
48
+ - **Error:** `KeyError: 'qwen3_tts'` in AutoConfig
49
+ - **Workaround:** Load config.json directly
50
+ - **Status:** Partially resolved
51
+
52
+ ## MLX-Audio Models Tested
53
+
54
+ | Model | Status | Issue |
55
+ |-------|--------|-------|
56
+ | mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 | ✗ Config incompatibility | Complex nested config, 478 extra parameters |
57
+ | mlx-community/Kokoro-82M-bf16 | ✗ Python 3.14 incompatibility | Pydantic V1 error |
58
+
59
+ ## Recommendation
60
+
61
+ **STAY WITH CURRENT qwen-tts IMPLEMENTATION**
62
+
63
+ **Rationale:**
64
+ 1. qwen-tts with MPS is working reliably
65
+ 2. MLX-audio has compatibility issues with current Python version
66
+ 3. MLX-audio Qwen3 models require specialized loading not available in current mlx-audio release
67
+ 4. Migration complexity outweighs potential performance benefits at this time
68
+
69
+ ## Alternative Approaches
70
+
71
+ ### Option 1: Wait for MLX-audio Updates
72
+ - Monitor MLX-audio for Python 3.14 compatibility fixes
73
+ - Wait for official Qwen3-TTS model support
74
+
75
+ ### Option 2: Python Version Downgrade
76
+ - Use Python 3.13 for MLX-audio
77
+ - Requires separate virtual environment
78
+ - Adds operational complexity
79
+
80
+ ### Option 3: Direct MLX Implementation
81
+ - Bypass mlx-audio package
82
+ - Implement MLX Qwen3 model directly
83
+ - Significant development effort required
84
+
85
+ ### Option 4: Current Implementation Optimization
86
+ - Continue using qwen-tts with MPS
87
+ - Implement request caching
88
+ - Add pre-warming for model
89
+ - Consider model quantization
90
+
91
+ ## Performance Notes
92
+
93
+ The current qwen-tts implementation with MPS:
94
+ - **RTF of 3.87x** means 13-16 seconds for 4 seconds of audio
95
+ - This is primarily model inference time, not I/O
96
+ - **True streaming** would require model architecture changes
97
+ - qwen-tts `stream=True` generates complete audio then splits (not true streaming)
98
+
99
+ ## Performance Benchmark Results (NEW - 2026-02-06 21:50)
100
+
101
+ **Breakthrough:** After installing pip in MLX-audio tool's Python environment, we successfully completed a performance benchmark comparing MLX-audio Kokoro vs qwen-tts.
102
+
103
+ ### Benchmark Results
104
+
105
+ | Metric | MLX-audio Kokoro | qwen-tts (MPS) | Advantage |
106
+ |--------|------------------|-----------------|-----------|
107
+ | **Processing Time** | **0.37s** | 13-16s | **35-43x faster** |
108
+ | **Model Load** | ~3-5s (cached) | 21.8s | ~4-5x faster |
109
+ | **Audio Duration** | 3.45s | ~3.5s | Similar |
110
+ | **Real-Time Factor** | **0.11x** | 3.87x | **35x faster** |
111
+ | **Peak Memory** | 1.30GB | 1.3GB | Similar |
112
+ | **Audio Quality** | Good (American female) | Good (Marrvin) | Comparable |
113
+
114
+ ### Key Findings
115
+
116
+ **🚀 MLX-Audio is 35-43x FASTER** than qwen-tts with MPS GPU acceleration!
117
+
118
+ - **Real-Time Factor 0.11x:** MLX-audio generates 3.45s of audio in just 0.37s (faster than real-time!)
119
+ - **qwen-tts RTF 3.87x:** Takes 13-16 seconds to generate 3.5s of audio
120
+ - **Memory usage is comparable:** Both use ~1.3GB peak memory
121
+ - **Audio quality is similar:** Both produce clear, natural speech
122
+
123
+ ### Audio Files
124
+
125
+ - **MLX-audio:** `/tmp/mlx_benchmark_1770385812/mlx_kokoro_000.wav` (162KB)
126
+ - **qwen-tts:** Previous benchmark at `/tmp/qwen_benchmark_qwen.wav`
127
+
128
+ ### Updated Recommendation
129
+
130
+ **MLX-AUDIO IS NOW THE RECOMMENDED CHOICE** for TTS on Apple Silicon:
131
+
132
+ 1. ✅ **Massive performance improvement** (35-43x faster)
133
+ 2. ✅ **Native Apple Silicon optimization** (no PyTorch overhead)
134
+ 3. ✅ **Same memory footprint** (~1.3GB)
135
+ 4. ✅ **Comparable audio quality**
136
+ 5. ✅ **Faster than real-time synthesis** (RTF 0.11x)
137
+
138
+ **Migration Priority:** HIGH - Consider migrating from qwen-tts to MLX-audio for significant performance gains.
139
+
140
+ ---
141
+
142
+ ## Qwen3-TTS Tests with MLX-audio v0.3.1 (NEW - 2026-02-06 06:00)
143
+
144
+ **BREAKTHROUGH: MLX-audio v0.3.1 fixes Qwen3-TTS config compatibility issues!**
145
+
146
+ After upgrading from v0.2.10 to v0.3.1 (installed directly from GitHub), both Qwen3-TTS models now work correctly:
147
+
148
+ ### Qwen3-TTS Performance Results
149
+
150
+ | Model | Processing Time | Audio Duration | RTF | Peak Memory | Status |
151
+ |-------|-----------------|----------------|-----|-------------|--------|
152
+ | **Qwen3-TTS 0.6B Base** | 4.07s | 3.36s | **0.83x** | 4.82GB | ✅ Working |
153
+ | **Qwen3-TTS 1.7B VoiceDesign** | 1.05s | 0.88s | **0.84x** | 5.30GB | ✅ Working |
154
+ | **Kokoro-82M** | 0.37s | 3.45s | **0.11x** | 1.30GB | ✅ Working |
155
+ | **qwen-tts (MPS)** | 13-16s | ~3.5s | 3.87x | 1.3GB | Current |
156
+
157
+ ### Key Findings
158
+
159
+ **Qwen3-TTS Performance:**
160
+ - **4-5x faster** than qwen-tts with MPS (0.83x RTF vs 3.87x RTF)
161
+ - **Same voice characteristics** as current qwen-tts system
162
+ - **Higher memory usage** (4.8-5.3GB) vs Kokoro/qwen-tts (1.3GB)
163
+ - **VoiceDesign model supports natural language voice descriptions**
164
+
165
+ **Performance Comparison:**
166
+ - **Kokoro-82M**: Fastest (0.11x RTF), lowest memory (1.3GB), American female voice
167
+ - **Qwen3-TTS 0.6B**: Good speed (0.83x RTF), same voices as qwen-tts, higher memory
168
+ - **Qwen3-TTS 1.7B**: Similar speed (0.84x RTF), voice customization, highest memory
169
+
170
+ ### Updated Recommendations
171
+
172
+ **By Use Case:**
173
+
174
+ | Use Case | Recommended Model | Reason |
175
+ |----------|-------------------|--------|
176
+ | **Maximum performance** | Kokoro-82M | 35-43x faster, lowest memory |
177
+ | **Same voices as current** | Qwen3-TTS 0.6B/1.7B | 4-5x faster, compatible voices |
178
+ | **Custom voice design** | Qwen3-TTS 1.7B VoiceDesign | Natural language voice descriptions |
179
+
180
+ **Installation:**
181
+ ```bash
182
+ # Install MLX-audio v0.3.1 from GitHub for Qwen3-TTS support
183
+ uv tool install 'mlx-audio @ git+https://github.com/Blaizzy/mlx-audio@v0.3.1'
184
+ ```
185
+
186
+ ---
187
+
188
+ ## Final Conclusion
189
+
190
+ **MLX-AUDIO IS THE CLEAR WINNER** after upgrading to v0.3.1:
191
+
192
+ - **Qwen3-TTS**: 4-5x faster than qwen-tts with same voices
193
+ - **Kokoro-82M**: 35-43x faster than qwen-tts
194
+ - **Both models** generate faster than real-time (RTF < 1.0x)
195
+
196
+ **Issues resolved:**
197
+ - ✅ Python 3.13 compatibility - FIXED
198
+ - ✅ Missing pip module - FIXED
199
+ - ✅ Qwen3-TTS config structure - FIXED (v0.3.1)
200
+ - ✅ Model loading - WORKS
201
+ - ✅ Audio synthesis - WORKS
202
+
203
+ **Recommendation:** Migrate to MLX-audio (Kokoro for speed, Qwen3-TTS for voice compatibility).
204
+
205
+ ---
206
+
207
+ *Previous evaluation findings (superseded):*
208
+ The evaluation below documents the investigation process that led to discovering the pip issue.
209
+
210
+ ---
211
+
212
+ ## Python 3.13 Test Results (NEW - 2026-02-06 21:25)
213
+
214
+ **Hypothesis:** Python 3.13 would resolve the Pydantic V1 compatibility issues preventing MLX-audio from working.
215
+
216
+ ### Test Setup
217
+ - **Environment:** Python 3.13.12 (via uv venv)
218
+ - **Command:** `uv venv --python 3.13 .venv-py313`
219
+ - **Installation:** `uv pip install mlx-audio` (176 packages, successful)
220
+
221
+ ### Results Summary
222
+
223
+ | Test | Python 3.14 | Python 3.13 |
224
+ |------|-------------|--------------|
225
+ | MLX import | ✅ Works | ✅ Works |
226
+ | MLX-audio import | ✅ Works | ✅ Works |
227
+ | Kokoro model load | ❌ Config error | ✅ **3.46s** |
228
+ | Kokoro synthesis | N/A | ❌ **Hangs** |
229
+ | Qwen3-TTS load | ❌ Config error | Not tested |
230
+
231
+ ### Key Findings
232
+
233
+ **✅ SUCCESS: Python 3.13 Resolves Import Issues**
234
+ - MLX-audio imports successfully without Pydantic V1 errors
235
+ - Kokoro model loads in ~3.5 seconds (vs config errors in 3.14)
236
+ - MLX Metal backend is available
237
+
238
+ **❌ REMAINING ISSUE: Synthesis Hangs**
239
+ - Model loads successfully but `model.generate()` hangs indefinitely
240
+ - Issue appears to be in the synthesis/generation pipeline
241
+ - Not a misaki import issue (misaki imports successfully)
242
+ - Not an MLX compute issue (basic MLX operations work)
243
+
244
+ **Root Cause Analysis:**
245
+ The synthesis hanging is likely due to:
246
+ 1. MLX computation graph issue specific to the Kokoro model
247
+ 2. Missing or incompatible dependency in the synthesis pipeline
248
+ 3. Model-specific issue (not general MLX-audio problem)
249
+
250
+ ### Updated Recommendation
251
+
252
+ **PARTIAL PROGRESS:** Python 3.13 resolves import compatibility, but synthesis issues remain.
253
+
254
+ **Next Steps:**
255
+ 1. **Report issue** to MLX-audio GitHub with reproduction case
256
+ 2. **Try different MLX-audio models** (e.g., Qwen3-TTS instead of Kokoro)
257
+ 3. **Wait for MLX-audio updates** addressing Python 3.13 compatibility
258
+ 4. **Stick with current qwen-tts** implementation for now
259
+
260
+ **Python 3.13 Test Conclusion:**
261
+ The Python 3.13 hypothesis was partially correct - it resolved import issues, but revealed a new synthesis-related problem. Further investigation required.
262
+
263
+ ---
264
+
265
+ ## Server Interference Test (NEW - 2026-02-06 21:44)
266
+
267
+ **User Hypothesis:** MLX-audio issues might be caused by the running qwen-tts server (port 7860) interfering with MLX-audio operations.
268
+
269
+ **Test Method:**
270
+ 1. Identified running voice server: `bun run src/ts/server.ts` (PID 71753)
271
+ 2. Stopped voice server process
272
+ 3. Verified port 7860 was free
273
+ 4. Retested MLX-audio Kokoro in clean environment
274
+
275
+ **Result:** ❌ **MLX-AUDIO STILL HANGS**
276
+
277
+ Even with the voice server completely stopped and port 7860 confirmed free, MLX-audio Kokoro model:
278
+ - Loads successfully (3.46s)
279
+ - Begins synthesis
280
+ - **Hangs at same point** (after displaying text/voice/speed/language parameters)
281
+
282
+ **Conclusion:** The MLX-audio synthesis hanging issue is **NOT caused by server interference**. The issue is intrinsic to MLX-audio's Kokoro model synthesis pipeline, likely related to:
283
+ - misaki text processing library
284
+ - MLX compute graph issues
285
+ - Kokoro model implementation bug
286
+
287
+ **User's hypothesis was valid to test** and represented good scientific method, but the root cause is confirmed to be within MLX-audio itself, not environmental conflicts.
288
+
289
+ ---
290
+
291
+ *Previous findings below...*
292
+
293
+ **Recommendation:** Continue with current qwen-tts implementation and re-evaluate MLX-audio after:
294
+ 1. Python 3.14 compatibility is resolved
295
+ 2. Official Qwen3-TTS support is added
296
+ 3. Documentation and examples are available
297
+
298
+ ---
299
+
300
+ **Generated:** 2026-02-06
301
+ **Evaluator:** DAIV
302
+ **Project:** voice-server