edge-gemma-speak 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2024 MimicLab, Sogang University
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,7 @@
1
+ include README.md
2
+ include requirements.txt
3
+ recursive-include edge_gemma_speak *.py
4
+ recursive-include edge_gemma_speak/models *.gguf
5
+ global-exclude __pycache__
6
+ global-exclude *.py[co]
7
+ global-exclude .DS_Store
@@ -0,0 +1,376 @@
1
+ Metadata-Version: 2.4
2
+ Name: edge_gemma_speak
3
+ Version: 0.1.0
4
+ Summary: Edge-based voice assistant using Gemma LLM with STT and TTS capabilities
5
+ Home-page: https://github.com/yourusername/edge_gemma_speak
6
+ Author: MimicLab, Sogang University
7
+ Author-email:
8
+ License: MIT
9
+ Project-URL: Homepage, https://github.com/yourusername/edge_gemma_speak
10
+ Classifier: Development Status :: 3 - Alpha
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: Programming Language :: Python :: 3
13
+ Classifier: Programming Language :: Python :: 3.8
14
+ Classifier: Programming Language :: Python :: 3.9
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Requires-Python: >=3.8
18
+ Description-Content-Type: text/markdown
19
+ License-File: LICENSE
20
+ Requires-Dist: torch>=2.0.0
21
+ Requires-Dist: numpy
22
+ Requires-Dist: SpeechRecognition
23
+ Requires-Dist: faster-whisper
24
+ Requires-Dist: llama-cpp-python
25
+ Requires-Dist: edge-tts
26
+ Requires-Dist: pygame
27
+ Requires-Dist: sounddevice
28
+ Requires-Dist: soundfile
29
+ Requires-Dist: gradio
30
+ Requires-Dist: flask
31
+ Requires-Dist: pyaudio
32
+ Dynamic: home-page
33
+ Dynamic: license-file
34
+ Dynamic: requires-python
35
+
36
+ # πŸŽ™οΈ Edge Gemma Speak
37
+
38
+ Edge-based voice assistant using Gemma LLM with Speech-to-Text and Text-to-Speech capabilities
39
+
40
+ ## Key Features
41
+
42
+ - **Speech Recognition (STT)**: High-speed speech recognition using Faster Whisper
43
+ - **Conversational AI (LLM)**: Local LLM based on Llama.cpp (Gemma 3 12B)
44
+ - **Speech Synthesis (TTS)**: Fast response with Edge-TTS streaming
45
+ - **Complete Offline Operation**: All processing is done locally, ensuring privacy
46
+
47
+ ## Installation
48
+
49
+ ### 1. Install via pip
50
+
51
+ ```bash
52
+ pip install edge-gemma-speak
53
+ ```
54
+
55
+ Or install from source:
56
+
57
+ ```bash
58
+ git clone https://github.com/yourusername/edge_gemma_speak.git
59
+ cd edge_gemma_speak
60
+ pip install -e .
61
+ ```
62
+
63
+ #### For NVIDIA CUDA Users
64
+
65
+ If you have an NVIDIA GPU and want to use CUDA acceleration, you need to rebuild llama-cpp-python with CUDA support:
66
+
67
+ ```bash
68
+ # Rebuild llama-cpp-python with CUDA support
69
+ CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir
70
+ ```
71
+
72
+ This will significantly improve LLM inference performance on NVIDIA GPUs.
73
+
74
+ ### 2. Download Model
75
+
76
+ ```bash
77
+ # Automatically download Gemma model (~7GB)
78
+ edge-gemma-speak --download-model
79
+ ```
80
+
81
+ The model will be saved in `~/.edge_gemma_speak/models/` directory.
82
+
83
+ ## Usage
84
+
85
+ ### Basic Usage
86
+
87
+ ```bash
88
+ # Start voice conversation
89
+ edge-gemma-speak
90
+ ```
91
+
92
+ Speak into your microphone and the AI will respond with voice.
93
+
94
+ ### Voice Selection
95
+
96
+ ```bash
97
+ # List all available voices
98
+ edge-gemma-speak --list-voices
99
+
100
+ # Use preset voices
101
+ edge-gemma-speak --voice male # Korean male voice
102
+ edge-gemma-speak --voice female # Korean female voice
103
+ edge-gemma-speak --voice multilingual # Korean multilingual male (default)
104
+
105
+ # Use any Edge-TTS voice directly
106
+ edge-gemma-speak --voice en-US-JennyNeural
107
+ edge-gemma-speak --voice ja-JP-NanamiNeural
108
+ edge-gemma-speak --voice zh-CN-XiaoxiaoNeural
109
+ ```
110
+
111
+ ### Advanced Configuration
112
+
113
+ #### STT (Speech Recognition) Parameters
114
+
115
+ ```bash
116
+ # Recognize speech in different languages
117
+ edge-gemma-speak --stt-language en
118
+
119
+ # Increase beam size for more accurate recognition (default: 5)
120
+ edge-gemma-speak --stt-beam-size 10
121
+
122
+ # Adjust VAD sensitivity (default: 0.5)
123
+ edge-gemma-speak --stt-vad-threshold 0.3
124
+
125
+ # Change Whisper model size (tiny, base, small, medium, large)
126
+ edge-gemma-speak --stt-model small
127
+ ```
128
+
129
+ #### LLM (Language Model) Parameters
130
+
131
+ ```bash
132
+ # Generate longer responses (default: 512)
133
+ edge-gemma-speak --llm-max-tokens 1024
134
+
135
+ # More creative responses (higher temperature, default: 0.7)
136
+ edge-gemma-speak --llm-temperature 0.9
137
+
138
+ # More conservative responses (lower temperature)
139
+ edge-gemma-speak --llm-temperature 0.3
140
+
141
+ # Adjust context size (default: 4096)
142
+ edge-gemma-speak --llm-context-size 8192
143
+
144
+ # Adjust top-p sampling (default: 0.95)
145
+ edge-gemma-speak --llm-top-p 0.9
146
+ ```
147
+
148
+ #### Device Configuration
149
+
150
+ ```bash
151
+ # Auto-detect best available device (default)
152
+ edge-gemma-speak
153
+
154
+ # Explicitly use CPU
155
+ edge-gemma-speak --device cpu
156
+
157
+ # Explicitly use CUDA GPU
158
+ edge-gemma-speak --device cuda
159
+
160
+ # Explicitly use Apple Silicon MPS
161
+ edge-gemma-speak --device mps
162
+ ```
163
+
164
+ The system automatically detects the best available device:
165
+ - NVIDIA GPU with CUDA β†’ `cuda`
166
+ - Apple Silicon β†’ `mps`
167
+ - Otherwise β†’ `cpu`
168
+
169
+ ### Combined Examples
170
+
171
+ ```bash
172
+ # English female voice + English recognition + longer responses
173
+ edge-gemma-speak --voice en-US-JennyNeural --stt-language en --llm-max-tokens 1024
174
+
175
+ # Japanese voice + high accuracy STT + creative responses
176
+ edge-gemma-speak --voice ja-JP-NanamiNeural --stt-beam-size 10 --llm-temperature 0.9
177
+
178
+ # Use custom model path
179
+ edge-gemma-speak --model /path/to/your/model.gguf
180
+ ```
181
+
182
+ ## Python API Usage
183
+
184
+ ```python
185
+ from edge_gemma_speak import VoiceAssistant, ModelConfig, AudioConfig
186
+
187
+ # Configuration
188
+ model_config = ModelConfig(
189
+ stt_model="base",
190
+ llm_temperature=0.7,
191
+ tts_voice="en-US-JennyNeural" # English female voice
192
+ )
193
+
194
+ audio_config = AudioConfig()
195
+
196
+ # Initialize voice assistant
197
+ assistant = VoiceAssistant(model_config, audio_config)
198
+
199
+ # Start conversation
200
+ assistant.run_conversation_loop()
201
+ ```
202
+
203
+ ### Using Individual Modules
204
+
205
+ ```python
206
+ from edge_gemma_speak import STTModule, LLMModule, TTSModule, ModelConfig
207
+
208
+ config = ModelConfig()
209
+
210
+ # STT (Speech to Text)
211
+ stt = STTModule(config)
212
+ text = stt.transcribe("audio.wav")
213
+
214
+ # LLM (Generate text response)
215
+ llm = LLMModule(config)
216
+ response = llm.generate_response(text)
217
+
218
+ # TTS (Text to Speech)
219
+ tts = TTSModule(config)
220
+ tts.speak(response)
221
+ ```
222
+
223
+ ## Available Commands During Conversation
224
+
225
+ - **"exit"** or **"μ’…λ£Œ"**: Exit the program
226
+ - **"reset"** or **"μ΄ˆκΈ°ν™”"**: Reset conversation history
227
+ - **"history"** or **"λŒ€ν™” λ‚΄μ—­"**: View conversation history
228
+
229
+ ## System Requirements
230
+
231
+ - Python 3.8 or higher
232
+ - macOS (with MPS support), Linux, Windows
233
+ - Minimum 8GB RAM (16GB recommended)
234
+ - Approximately 7GB disk space (for model storage)
235
+
236
+ ### Required Packages
237
+
238
+ - torch >= 2.0.0
239
+ - faster-whisper
240
+ - llama-cpp-python
241
+ - edge-tts
242
+ - numpy
243
+ - speech_recognition
244
+ - pygame
245
+ - sounddevice
246
+ - soundfile
247
+ - pyaudio
248
+
249
+ ## Project Structure
250
+
251
+ ```
252
+ edge_gemma_speak/
253
+ β”œβ”€β”€ edge_gemma_speak/ # Package directory
254
+ β”‚ β”œβ”€β”€ __init__.py # Package initialization
255
+ β”‚ β”œβ”€β”€ voice_assistant.py # Main module
256
+ β”‚ └── cli.py # CLI interface
257
+ β”œβ”€β”€ setup.py # Package setup
258
+ β”œβ”€β”€ pyproject.toml # Build configuration
259
+ β”œβ”€β”€ requirements.txt # Dependencies
260
+ β”œβ”€β”€ README.md # Documentation
261
+ └── .gitignore # Git ignore file
262
+ ```
263
+
264
+ ## Troubleshooting
265
+
266
+ ### PyAudio Installation Error
267
+
268
+ macOS:
269
+ ```bash
270
+ brew install portaudio
271
+ pip install pyaudio
272
+ ```
273
+
274
+ Linux:
275
+ ```bash
276
+ sudo apt-get install portaudio19-dev python3-pyaudio
277
+ pip install pyaudio
278
+ ```
279
+
280
+ Windows:
281
+ ```bash
282
+ # Visual Studio Build Tools required
283
+ pip install pipwin
284
+ pipwin install pyaudio
285
+ ```
286
+
287
+ ### Out of Memory
288
+
289
+ For large LLM models:
290
+ - Use smaller quantized models
291
+ - Reduce context size: `--llm-context-size 2048`
292
+ - Use CPU mode: `--device cpu`
293
+
294
+ ### Microphone Recognition Issues
295
+
296
+ - Check microphone permissions in system settings
297
+ - Close other audio applications
298
+ - Adjust VAD threshold: `--stt-vad-threshold 0.3`
299
+
300
+ ### Model File Not Found
301
+
302
+ ```bash
303
+ # Download model
304
+ edge-gemma-speak --download-model
305
+
306
+ # Or download directly
307
+ wget https://huggingface.co/tgisaturday/Docsray/resolve/main/gemma-3-12b-it-GGUF/gemma-3-12b-it-Q4_K_M.gguf \
308
+ -O ~/.edge_gemma_speak/models/gemma-3-12b-it-Q4_K_M.gguf
309
+ ```
310
+
311
+ ## Performance Optimization
312
+
313
+ ### Improve Response Speed
314
+
315
+ 1. **Use smaller STT model**: `--stt-model tiny` or `base`
316
+ 2. **Limit LLM response length**: `--llm-max-tokens 256`
317
+ 3. **Reduce beam size**: `--stt-beam-size 3`
318
+
319
+ ### GPU Acceleration
320
+
321
+ - **macOS**: Automatic MPS support (`--device mps`)
322
+ - **NVIDIA GPU**: CUDA support (`--device cuda`)
323
+ - **AMD GPU**: Requires PyTorch with ROCm support
324
+
325
+ ## Developer Information
326
+
327
+ Developed by MimicLab at Sogang University
328
+
329
+ ## License
330
+
331
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
332
+
333
+ ### Third-Party Licenses
334
+
335
+ This project uses several third-party libraries:
336
+ - **edge-tts**: LGPL-3.0 License (for TTS functionality)
337
+ - **faster-whisper**: MIT License (for STT functionality)
338
+ - **llama-cpp-python**: MIT License (for LLM inference)
339
+ - **Gemma Model**: Check the model provider's license terms
340
+
341
+ For complete third-party license information, see [THIRD_PARTY_LICENSES.md](THIRD_PARTY_LICENSES.md).
342
+
343
+ **Note on edge-tts**: The edge-tts library is licensed under LGPL-3.0. This project uses it as a library dependency without modifications. Users are free to replace edge-tts with their own version if desired. The LGPL-3.0 license of edge-tts does not affect the MIT licensing of this project's source code.
344
+
345
+ ## Contributing
346
+
347
+ Issues and Pull Requests are always welcome!
348
+
349
+ ### Development Setup
350
+
351
+ ```bash
352
+ # Clone repository
353
+ git clone https://github.com/yourusername/edge_gemma_speak.git
354
+ cd edge_gemma_speak
355
+
356
+ # Install in development mode
357
+ pip install -e .
358
+
359
+ # Run tests
360
+ python -m pytest tests/
361
+ ```
362
+
363
+ ## Multilingual Support
364
+
365
+ Edge Gemma Speak supports multiple languages through Edge-TTS. You can use voices in various languages:
366
+
367
+ - **English**: en-US, en-GB, en-AU, en-CA, en-IN
368
+ - **Japanese**: ja-JP
369
+ - **Chinese**: zh-CN, zh-TW, zh-HK
370
+ - **Spanish**: es-ES, es-MX
371
+ - **French**: fr-FR, fr-CA
372
+ - **German**: de-DE
373
+ - **Korean**: ko-KR
374
+ - And many more...
375
+
376
+ Use `--list-voices` to see all available voices and their language codes.