edge-gemma-speak 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- edge_gemma_speak-0.1.0/LICENSE +21 -0
- edge_gemma_speak-0.1.0/MANIFEST.in +7 -0
- edge_gemma_speak-0.1.0/PKG-INFO +376 -0
- edge_gemma_speak-0.1.0/README.md +341 -0
- edge_gemma_speak-0.1.0/edge_gemma_speak/__init__.py +26 -0
- edge_gemma_speak-0.1.0/edge_gemma_speak/cli.py +305 -0
- edge_gemma_speak-0.1.0/edge_gemma_speak/voice_assistant.py +661 -0
- edge_gemma_speak-0.1.0/edge_gemma_speak.egg-info/PKG-INFO +376 -0
- edge_gemma_speak-0.1.0/edge_gemma_speak.egg-info/SOURCES.txt +15 -0
- edge_gemma_speak-0.1.0/edge_gemma_speak.egg-info/dependency_links.txt +1 -0
- edge_gemma_speak-0.1.0/edge_gemma_speak.egg-info/entry_points.txt +2 -0
- edge_gemma_speak-0.1.0/edge_gemma_speak.egg-info/requires.txt +12 -0
- edge_gemma_speak-0.1.0/edge_gemma_speak.egg-info/top_level.txt +1 -0
- edge_gemma_speak-0.1.0/pyproject.toml +50 -0
- edge_gemma_speak-0.1.0/requirements.txt +12 -0
- edge_gemma_speak-0.1.0/setup.cfg +4 -0
- edge_gemma_speak-0.1.0/setup.py +51 -0
@@ -0,0 +1,21 @@
|
|
1
|
+
MIT License
|
2
|
+
|
3
|
+
Copyright (c) 2024 MimicLab, Sogang University
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
13
|
+
copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21
|
+
SOFTWARE.
|
@@ -0,0 +1,376 @@
|
|
1
|
+
Metadata-Version: 2.4
|
2
|
+
Name: edge_gemma_speak
|
3
|
+
Version: 0.1.0
|
4
|
+
Summary: Edge-based voice assistant using Gemma LLM with STT and TTS capabilities
|
5
|
+
Home-page: https://github.com/yourusername/edge_gemma_speak
|
6
|
+
Author: MimicLab, Sogang University
|
7
|
+
Author-email:
|
8
|
+
License: MIT
|
9
|
+
Project-URL: Homepage, https://github.com/yourusername/edge_gemma_speak
|
10
|
+
Classifier: Development Status :: 3 - Alpha
|
11
|
+
Classifier: Intended Audience :: Developers
|
12
|
+
Classifier: Programming Language :: Python :: 3
|
13
|
+
Classifier: Programming Language :: Python :: 3.8
|
14
|
+
Classifier: Programming Language :: Python :: 3.9
|
15
|
+
Classifier: Programming Language :: Python :: 3.10
|
16
|
+
Classifier: Programming Language :: Python :: 3.11
|
17
|
+
Requires-Python: >=3.8
|
18
|
+
Description-Content-Type: text/markdown
|
19
|
+
License-File: LICENSE
|
20
|
+
Requires-Dist: torch>=2.0.0
|
21
|
+
Requires-Dist: numpy
|
22
|
+
Requires-Dist: SpeechRecognition
|
23
|
+
Requires-Dist: faster-whisper
|
24
|
+
Requires-Dist: llama-cpp-python
|
25
|
+
Requires-Dist: edge-tts
|
26
|
+
Requires-Dist: pygame
|
27
|
+
Requires-Dist: sounddevice
|
28
|
+
Requires-Dist: soundfile
|
29
|
+
Requires-Dist: gradio
|
30
|
+
Requires-Dist: flask
|
31
|
+
Requires-Dist: pyaudio
|
32
|
+
Dynamic: home-page
|
33
|
+
Dynamic: license-file
|
34
|
+
Dynamic: requires-python
|
35
|
+
|
36
|
+
# ποΈ Edge Gemma Speak
|
37
|
+
|
38
|
+
Edge-based voice assistant using Gemma LLM with Speech-to-Text and Text-to-Speech capabilities
|
39
|
+
|
40
|
+
## Key Features
|
41
|
+
|
42
|
+
- **Speech Recognition (STT)**: High-speed speech recognition using Faster Whisper
|
43
|
+
- **Conversational AI (LLM)**: Local LLM based on Llama.cpp (Gemma 3 12B)
|
44
|
+
- **Speech Synthesis (TTS)**: Fast response with Edge-TTS streaming
|
45
|
+
- **Complete Offline Operation**: All processing is done locally, ensuring privacy
|
46
|
+
|
47
|
+
## Installation
|
48
|
+
|
49
|
+
### 1. Install via pip
|
50
|
+
|
51
|
+
```bash
|
52
|
+
pip install edge-gemma-speak
|
53
|
+
```
|
54
|
+
|
55
|
+
Or install from source:
|
56
|
+
|
57
|
+
```bash
|
58
|
+
git clone https://github.com/yourusername/edge_gemma_speak.git
|
59
|
+
cd edge_gemma_speak
|
60
|
+
pip install -e .
|
61
|
+
```
|
62
|
+
|
63
|
+
#### For NVIDIA CUDA Users
|
64
|
+
|
65
|
+
If you have an NVIDIA GPU and want to use CUDA acceleration, you need to rebuild llama-cpp-python with CUDA support:
|
66
|
+
|
67
|
+
```bash
|
68
|
+
# Rebuild llama-cpp-python with CUDA support
|
69
|
+
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir
|
70
|
+
```
|
71
|
+
|
72
|
+
This will significantly improve LLM inference performance on NVIDIA GPUs.
|
73
|
+
|
74
|
+
### 2. Download Model
|
75
|
+
|
76
|
+
```bash
|
77
|
+
# Automatically download Gemma model (~7GB)
|
78
|
+
edge-gemma-speak --download-model
|
79
|
+
```
|
80
|
+
|
81
|
+
The model will be saved in `~/.edge_gemma_speak/models/` directory.
|
82
|
+
|
83
|
+
## Usage
|
84
|
+
|
85
|
+
### Basic Usage
|
86
|
+
|
87
|
+
```bash
|
88
|
+
# Start voice conversation
|
89
|
+
edge-gemma-speak
|
90
|
+
```
|
91
|
+
|
92
|
+
Speak into your microphone and the AI will respond with voice.
|
93
|
+
|
94
|
+
### Voice Selection
|
95
|
+
|
96
|
+
```bash
|
97
|
+
# List all available voices
|
98
|
+
edge-gemma-speak --list-voices
|
99
|
+
|
100
|
+
# Use preset voices
|
101
|
+
edge-gemma-speak --voice male # Korean male voice
|
102
|
+
edge-gemma-speak --voice female # Korean female voice
|
103
|
+
edge-gemma-speak --voice multilingual # Korean multilingual male (default)
|
104
|
+
|
105
|
+
# Use any Edge-TTS voice directly
|
106
|
+
edge-gemma-speak --voice en-US-JennyNeural
|
107
|
+
edge-gemma-speak --voice ja-JP-NanamiNeural
|
108
|
+
edge-gemma-speak --voice zh-CN-XiaoxiaoNeural
|
109
|
+
```
|
110
|
+
|
111
|
+
### Advanced Configuration
|
112
|
+
|
113
|
+
#### STT (Speech Recognition) Parameters
|
114
|
+
|
115
|
+
```bash
|
116
|
+
# Recognize speech in different languages
|
117
|
+
edge-gemma-speak --stt-language en
|
118
|
+
|
119
|
+
# Increase beam size for more accurate recognition (default: 5)
|
120
|
+
edge-gemma-speak --stt-beam-size 10
|
121
|
+
|
122
|
+
# Adjust VAD sensitivity (default: 0.5)
|
123
|
+
edge-gemma-speak --stt-vad-threshold 0.3
|
124
|
+
|
125
|
+
# Change Whisper model size (tiny, base, small, medium, large)
|
126
|
+
edge-gemma-speak --stt-model small
|
127
|
+
```
|
128
|
+
|
129
|
+
#### LLM (Language Model) Parameters
|
130
|
+
|
131
|
+
```bash
|
132
|
+
# Generate longer responses (default: 512)
|
133
|
+
edge-gemma-speak --llm-max-tokens 1024
|
134
|
+
|
135
|
+
# More creative responses (higher temperature, default: 0.7)
|
136
|
+
edge-gemma-speak --llm-temperature 0.9
|
137
|
+
|
138
|
+
# More conservative responses (lower temperature)
|
139
|
+
edge-gemma-speak --llm-temperature 0.3
|
140
|
+
|
141
|
+
# Adjust context size (default: 4096)
|
142
|
+
edge-gemma-speak --llm-context-size 8192
|
143
|
+
|
144
|
+
# Adjust top-p sampling (default: 0.95)
|
145
|
+
edge-gemma-speak --llm-top-p 0.9
|
146
|
+
```
|
147
|
+
|
148
|
+
#### Device Configuration
|
149
|
+
|
150
|
+
```bash
|
151
|
+
# Auto-detect best available device (default)
|
152
|
+
edge-gemma-speak
|
153
|
+
|
154
|
+
# Explicitly use CPU
|
155
|
+
edge-gemma-speak --device cpu
|
156
|
+
|
157
|
+
# Explicitly use CUDA GPU
|
158
|
+
edge-gemma-speak --device cuda
|
159
|
+
|
160
|
+
# Explicitly use Apple Silicon MPS
|
161
|
+
edge-gemma-speak --device mps
|
162
|
+
```
|
163
|
+
|
164
|
+
The system automatically detects the best available device:
|
165
|
+
- NVIDIA GPU with CUDA β `cuda`
|
166
|
+
- Apple Silicon β `mps`
|
167
|
+
- Otherwise β `cpu`
|
168
|
+
|
169
|
+
### Combined Examples
|
170
|
+
|
171
|
+
```bash
|
172
|
+
# English female voice + English recognition + longer responses
|
173
|
+
edge-gemma-speak --voice en-US-JennyNeural --stt-language en --llm-max-tokens 1024
|
174
|
+
|
175
|
+
# Japanese voice + high accuracy STT + creative responses
|
176
|
+
edge-gemma-speak --voice ja-JP-NanamiNeural --stt-beam-size 10 --llm-temperature 0.9
|
177
|
+
|
178
|
+
# Use custom model path
|
179
|
+
edge-gemma-speak --model /path/to/your/model.gguf
|
180
|
+
```
|
181
|
+
|
182
|
+
## Python API Usage
|
183
|
+
|
184
|
+
```python
|
185
|
+
from edge_gemma_speak import VoiceAssistant, ModelConfig, AudioConfig
|
186
|
+
|
187
|
+
# Configuration
|
188
|
+
model_config = ModelConfig(
|
189
|
+
stt_model="base",
|
190
|
+
llm_temperature=0.7,
|
191
|
+
tts_voice="en-US-JennyNeural" # English female voice
|
192
|
+
)
|
193
|
+
|
194
|
+
audio_config = AudioConfig()
|
195
|
+
|
196
|
+
# Initialize voice assistant
|
197
|
+
assistant = VoiceAssistant(model_config, audio_config)
|
198
|
+
|
199
|
+
# Start conversation
|
200
|
+
assistant.run_conversation_loop()
|
201
|
+
```
|
202
|
+
|
203
|
+
### Using Individual Modules
|
204
|
+
|
205
|
+
```python
|
206
|
+
from edge_gemma_speak import STTModule, LLMModule, TTSModule, ModelConfig
|
207
|
+
|
208
|
+
config = ModelConfig()
|
209
|
+
|
210
|
+
# STT (Speech to Text)
|
211
|
+
stt = STTModule(config)
|
212
|
+
text = stt.transcribe("audio.wav")
|
213
|
+
|
214
|
+
# LLM (Generate text response)
|
215
|
+
llm = LLMModule(config)
|
216
|
+
response = llm.generate_response(text)
|
217
|
+
|
218
|
+
# TTS (Text to Speech)
|
219
|
+
tts = TTSModule(config)
|
220
|
+
tts.speak(response)
|
221
|
+
```
|
222
|
+
|
223
|
+
## Available Commands During Conversation
|
224
|
+
|
225
|
+
- **"exit"** or **"μ’
λ£"**: Exit the program
|
226
|
+
- **"reset"** or **"μ΄κΈ°ν"**: Reset conversation history
|
227
|
+
- **"history"** or **"λν λ΄μ"**: View conversation history
|
228
|
+
|
229
|
+
## System Requirements
|
230
|
+
|
231
|
+
- Python 3.8 or higher
|
232
|
+
- macOS (with MPS support), Linux, Windows
|
233
|
+
- Minimum 8GB RAM (16GB recommended)
|
234
|
+
- Approximately 7GB disk space (for model storage)
|
235
|
+
|
236
|
+
### Required Packages
|
237
|
+
|
238
|
+
- torch >= 2.0.0
|
239
|
+
- faster-whisper
|
240
|
+
- llama-cpp-python
|
241
|
+
- edge-tts
|
242
|
+
- numpy
|
243
|
+
- speech_recognition
|
244
|
+
- pygame
|
245
|
+
- sounddevice
|
246
|
+
- soundfile
|
247
|
+
- pyaudio
|
248
|
+
|
249
|
+
## Project Structure
|
250
|
+
|
251
|
+
```
|
252
|
+
edge_gemma_speak/
|
253
|
+
βββ edge_gemma_speak/ # Package directory
|
254
|
+
β βββ __init__.py # Package initialization
|
255
|
+
β βββ voice_assistant.py # Main module
|
256
|
+
β βββ cli.py # CLI interface
|
257
|
+
βββ setup.py # Package setup
|
258
|
+
βββ pyproject.toml # Build configuration
|
259
|
+
βββ requirements.txt # Dependencies
|
260
|
+
βββ README.md # Documentation
|
261
|
+
βββ .gitignore # Git ignore file
|
262
|
+
```
|
263
|
+
|
264
|
+
## Troubleshooting
|
265
|
+
|
266
|
+
### PyAudio Installation Error
|
267
|
+
|
268
|
+
macOS:
|
269
|
+
```bash
|
270
|
+
brew install portaudio
|
271
|
+
pip install pyaudio
|
272
|
+
```
|
273
|
+
|
274
|
+
Linux:
|
275
|
+
```bash
|
276
|
+
sudo apt-get install portaudio19-dev python3-pyaudio
|
277
|
+
pip install pyaudio
|
278
|
+
```
|
279
|
+
|
280
|
+
Windows:
|
281
|
+
```bash
|
282
|
+
# Visual Studio Build Tools required
|
283
|
+
pip install pipwin
|
284
|
+
pipwin install pyaudio
|
285
|
+
```
|
286
|
+
|
287
|
+
### Out of Memory
|
288
|
+
|
289
|
+
For large LLM models:
|
290
|
+
- Use smaller quantized models
|
291
|
+
- Reduce context size: `--llm-context-size 2048`
|
292
|
+
- Use CPU mode: `--device cpu`
|
293
|
+
|
294
|
+
### Microphone Recognition Issues
|
295
|
+
|
296
|
+
- Check microphone permissions in system settings
|
297
|
+
- Close other audio applications
|
298
|
+
- Adjust VAD threshold: `--stt-vad-threshold 0.3`
|
299
|
+
|
300
|
+
### Model File Not Found
|
301
|
+
|
302
|
+
```bash
|
303
|
+
# Download model
|
304
|
+
edge-gemma-speak --download-model
|
305
|
+
|
306
|
+
# Or download directly
|
307
|
+
wget https://huggingface.co/tgisaturday/Docsray/resolve/main/gemma-3-12b-it-GGUF/gemma-3-12b-it-Q4_K_M.gguf \
|
308
|
+
-O ~/.edge_gemma_speak/models/gemma-3-12b-it-Q4_K_M.gguf
|
309
|
+
```
|
310
|
+
|
311
|
+
## Performance Optimization
|
312
|
+
|
313
|
+
### Improve Response Speed
|
314
|
+
|
315
|
+
1. **Use smaller STT model**: `--stt-model tiny` or `base`
|
316
|
+
2. **Limit LLM response length**: `--llm-max-tokens 256`
|
317
|
+
3. **Reduce beam size**: `--stt-beam-size 3`
|
318
|
+
|
319
|
+
### GPU Acceleration
|
320
|
+
|
321
|
+
- **macOS**: Automatic MPS support (`--device mps`)
|
322
|
+
- **NVIDIA GPU**: CUDA support (`--device cuda`)
|
323
|
+
- **AMD GPU**: Requires PyTorch with ROCm support
|
324
|
+
|
325
|
+
## Developer Information
|
326
|
+
|
327
|
+
Developed by MimicLab at Sogang University
|
328
|
+
|
329
|
+
## License
|
330
|
+
|
331
|
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
332
|
+
|
333
|
+
### Third-Party Licenses
|
334
|
+
|
335
|
+
This project uses several third-party libraries:
|
336
|
+
- **edge-tts**: LGPL-3.0 License (for TTS functionality)
|
337
|
+
- **faster-whisper**: MIT License (for STT functionality)
|
338
|
+
- **llama-cpp-python**: MIT License (for LLM inference)
|
339
|
+
- **Gemma Model**: Check the model provider's license terms
|
340
|
+
|
341
|
+
For complete third-party license information, see [THIRD_PARTY_LICENSES.md](THIRD_PARTY_LICENSES.md).
|
342
|
+
|
343
|
+
**Note on edge-tts**: The edge-tts library is licensed under LGPL-3.0. This project uses it as a library dependency without modifications. Users are free to replace edge-tts with their own version if desired. The LGPL-3.0 license of edge-tts does not affect the MIT licensing of this project's source code.
|
344
|
+
|
345
|
+
## Contributing
|
346
|
+
|
347
|
+
Issues and Pull Requests are always welcome!
|
348
|
+
|
349
|
+
### Development Setup
|
350
|
+
|
351
|
+
```bash
|
352
|
+
# Clone repository
|
353
|
+
git clone https://github.com/yourusername/edge_gemma_speak.git
|
354
|
+
cd edge_gemma_speak
|
355
|
+
|
356
|
+
# Install in development mode
|
357
|
+
pip install -e .
|
358
|
+
|
359
|
+
# Run tests
|
360
|
+
python -m pytest tests/
|
361
|
+
```
|
362
|
+
|
363
|
+
## Multilingual Support
|
364
|
+
|
365
|
+
Edge Gemma Speak supports multiple languages through Edge-TTS. You can use voices in various languages:
|
366
|
+
|
367
|
+
- **English**: en-US, en-GB, en-AU, en-CA, en-IN
|
368
|
+
- **Japanese**: ja-JP
|
369
|
+
- **Chinese**: zh-CN, zh-TW, zh-HK
|
370
|
+
- **Spanish**: es-ES, es-MX
|
371
|
+
- **French**: fr-FR, fr-CA
|
372
|
+
- **German**: de-DE
|
373
|
+
- **Korean**: ko-KR
|
374
|
+
- And many more...
|
375
|
+
|
376
|
+
Use `--list-voices` to see all available voices and their language codes.
|