churovoice 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,167 @@
1
+ Metadata-Version: 2.4
2
+ Name: churovoice
3
+ Version: 0.1.0
4
+ Summary: A multimodal voice assistant with web search, vision, and image generation.
5
+ Author: Lakshya Prajapati
6
+ Classifier: Development Status :: 3 - Alpha
7
+ Classifier: Intended Audience :: Developers
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: Programming Language :: Python :: 3.11
10
+ Classifier: Operating System :: MacOS
11
+ Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
12
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
13
+ Requires-Python: >=3.11
14
+ Description-Content-Type: text/markdown
15
+ Requires-Dist: torch
16
+ Requires-Dist: ollama
17
+ Requires-Dist: SpeechRecognition
18
+ Requires-Dist: edge-tts
19
+ Requires-Dist: ddgs
20
+ Requires-Dist: opencv-python
21
+ Requires-Dist: rich
22
+ Requires-Dist: diffusers
23
+ Requires-Dist: transformers
24
+ Requires-Dist: accelerate
25
+ Requires-Dist: safetensors
26
+ Requires-Dist: Pillow
27
+ Requires-Dist: term-image
28
+
29
+ # V1.8 Speech Agent
30
+
31
+ V1.8 is an experimental voice-first AI assistant. It listens to spoken prompts, responds out loud, can launch apps on macOS, can search the web for current context, can inspect images from a webcam, and can generate images when the user asks for a visual result.
32
+
33
+ ## What It Does
34
+
35
+ This project combines several assistant behaviors into one loop:
36
+
37
+ - Speech-to-text using Whisper through `speech_recognition`
38
+ - Text-to-speech using `edge-tts`
39
+ - App launching on macOS for commands such as `open Safari`
40
+ - Web query simplification and search retrieval through DDGS
41
+ - Webcam-based vision analysis for appearance or environment questions
42
+ - Image generation with Stable Diffusion
43
+ - Terminal-friendly output formatting with `rich`
44
+
45
+ ## Who This Is For
46
+
47
+ This repository is intended for developers and hobbyists who want to explore a local voice assistant workflow. It is especially useful if you are interested in:
48
+
49
+ - voice interfaces
50
+ - multimodal AI interactions
51
+ - local automation on macOS
52
+ - image generation pipelines
53
+ - combining web search, vision, and speech in a single assistant
54
+
55
+ ## Requirements
56
+
57
+ - macOS
58
+ - Python 3.11 or newer is recommended
59
+ - A microphone with system permission enabled
60
+ - A camera with system permission enabled if you want vision features
61
+ - Ollama installed and available on the machine running the script
62
+ - `chafa` installed if you want terminal previews for generated images
63
+ - Hardware that can run the configured Stable Diffusion pipeline on `mps`, or code changes to target a different device
64
+
65
+ ## Python Dependencies
66
+
67
+ The script uses the following Python packages:
68
+
69
+ - `torch`
70
+ - `ollama`
71
+ - `speech_recognition`
72
+ - `edge_tts`
73
+ - `ddgs`
74
+ - `opencv-python`
75
+ - `rich`
76
+ - `diffusers`
77
+ - `term-image`
78
+
79
+ ## Installation
80
+
81
+ 1. Clone the repository and open the `V1.8` folder.
82
+
83
+ 2. Create a virtual environment:
84
+
85
+ ```bash
86
+ python3 -m venv venv
87
+ source venv/bin/activate
88
+ ```
89
+
90
+ 3. Install the dependencies:
91
+
92
+ ```bash
93
+ pip install torch ollama SpeechRecognition edge-tts ddgs opencv-python rich diffusers term-image
94
+ ```
95
+
96
+ 4. Make sure Ollama can access the models referenced in `main.py`.
97
+
98
+ ## Usage
99
+
100
+ Run the assistant with:
101
+
102
+ ```bash
103
+ python main.py
104
+ ```
105
+
106
+ On startup, the program asks you to choose a voice:
107
+
108
+ - `Male` selects `en-US-SteffanNeural`
109
+ - Any other input selects `en-US-AvaNeural`
110
+
111
+ Then the assistant will:
112
+
113
+ 1. Prompt you to speak
114
+ 2. Transcribe your speech
115
+ 3. Decide whether the request is for app launching, image generation, vision analysis, or a normal answer
116
+ 4. Speak the response back to you
117
+ 5. Ask whether you want to continue the conversation
118
+
119
+ ## How It Works
120
+
121
+ ### App Launching
122
+
123
+ If the transcription includes `open`, the assistant tries to find a matching application on macOS. If no local app is found, it falls back to opening a website based on the target name.
124
+
125
+ ### Web Answers
126
+
127
+ For general questions, the assistant first simplifies the query and fetches recent search results. The response model can use those results when the request is about news, current events, or recent information.
128
+
129
+ ### Vision Mode
130
+
131
+ If the prompt seems to require visual context, the assistant captures a frame from the webcam, saves it locally, and sends it to a vision-capable model for analysis.
132
+
133
+ ### Image Generation
134
+
135
+ If the prompt is recognized as an image request, the assistant converts it into a short image prompt, generates an image with Stable Diffusion, saves the result as `generated_image.png`, and displays it in the terminal.
136
+
137
+ ## Limitations
138
+
139
+ - The current implementation is macOS-focused.
140
+ - The assistant depends on several external models and services.
141
+ - The Stable Diffusion pipeline is loaded at startup, which may be slow on lower-powered machines.
142
+ - The current code stores generated and captured images in the working directory.
143
+ - The app-launching behavior is intentionally simple and may not match every app name perfectly.
144
+
145
+ ## Future Opportunities
146
+
147
+ This version leaves room for several improvements:
148
+
149
+ - Add cross-platform support beyond macOS
150
+ - Make the model names and device selection configurable through environment variables or a config file
151
+ - Add a proper command parser for app launching instead of relying on keyword matching
152
+ - Add a conversation history file or database
153
+ - Add streaming responses so users hear partial answers sooner
154
+ - Add a richer UI for desktop or web use
155
+ - Add safer image handling and cleanup for generated files
156
+ - Add a setup script or dependency file for easier installation
157
+
158
+ ## Troubleshooting
159
+
160
+ - If microphone input fails, check system permissions and verify `speech_recognition` is installed correctly.
161
+ - If camera capture fails, check camera permissions and confirm OpenCV can access the device.
162
+ - If image generation fails, verify that your hardware supports the configured device target or update the pipeline configuration.
163
+ - If terminal image preview fails, install `chafa` and confirm it is available in your PATH.
164
+
165
+ ## License
166
+
167
+ No license has been added yet. Add one before publishing or distributing the project widely.
@@ -0,0 +1,139 @@
1
+ # V1.8 Speech Agent
2
+
3
+ V1.8 is an experimental voice-first AI assistant. It listens to spoken prompts, responds out loud, can launch apps on macOS, can search the web for current context, can inspect images from a webcam, and can generate images when the user asks for a visual result.
4
+
5
+ ## What It Does
6
+
7
+ This project combines several assistant behaviors into one loop:
8
+
9
+ - Speech-to-text using Whisper through `speech_recognition`
10
+ - Text-to-speech using `edge-tts`
11
+ - App launching on macOS for commands such as `open Safari`
12
+ - Web query simplification and search retrieval through DDGS
13
+ - Webcam-based vision analysis for appearance or environment questions
14
+ - Image generation with Stable Diffusion
15
+ - Terminal-friendly output formatting with `rich`
16
+
17
+ ## Who This Is For
18
+
19
+ This repository is intended for developers and hobbyists who want to explore a local voice assistant workflow. It is especially useful if you are interested in:
20
+
21
+ - voice interfaces
22
+ - multimodal AI interactions
23
+ - local automation on macOS
24
+ - image generation pipelines
25
+ - combining web search, vision, and speech in a single assistant
26
+
27
+ ## Requirements
28
+
29
+ - macOS
30
+ - Python 3.11 or newer is recommended
31
+ - A microphone with system permission enabled
32
+ - A camera with system permission enabled if you want vision features
33
+ - Ollama installed and available on the machine running the script
34
+ - `chafa` installed if you want terminal previews for generated images
35
+ - Hardware that can run the configured Stable Diffusion pipeline on `mps`, or code changes to target a different device
36
+
37
+ ## Python Dependencies
38
+
39
+ The script uses the following Python packages:
40
+
41
+ - `torch`
42
+ - `ollama`
43
+ - `speech_recognition`
44
+ - `edge_tts`
45
+ - `ddgs`
46
+ - `opencv-python`
47
+ - `rich`
48
+ - `diffusers`
49
+ - `term-image`
50
+
51
+ ## Installation
52
+
53
+ 1. Clone the repository and open the `V1.8` folder.
54
+
55
+ 2. Create a virtual environment:
56
+
57
+ ```bash
58
+ python3 -m venv venv
59
+ source venv/bin/activate
60
+ ```
61
+
62
+ 3. Install the dependencies:
63
+
64
+ ```bash
65
+ pip install torch ollama SpeechRecognition edge-tts ddgs opencv-python rich diffusers term-image
66
+ ```
67
+
68
+ 4. Make sure Ollama can access the models referenced in `main.py`.
69
+
70
+ ## Usage
71
+
72
+ Run the assistant with:
73
+
74
+ ```bash
75
+ python main.py
76
+ ```
77
+
78
+ On startup, the program asks you to choose a voice:
79
+
80
+ - `Male` selects `en-US-SteffanNeural`
81
+ - Any other input selects `en-US-AvaNeural`
82
+
83
+ Then the assistant will:
84
+
85
+ 1. Prompt you to speak
86
+ 2. Transcribe your speech
87
+ 3. Decide whether the request is for app launching, image generation, vision analysis, or a normal answer
88
+ 4. Speak the response back to you
89
+ 5. Ask whether you want to continue the conversation
90
+
91
+ ## How It Works
92
+
93
+ ### App Launching
94
+
95
+ If the transcription includes `open`, the assistant tries to find a matching application on macOS. If no local app is found, it falls back to opening a website based on the target name.
96
+
97
+ ### Web Answers
98
+
99
+ For general questions, the assistant first simplifies the query and fetches recent search results. The response model can use those results when the request is about news, current events, or recent information.
100
+
101
+ ### Vision Mode
102
+
103
+ If the prompt seems to require visual context, the assistant captures a frame from the webcam, saves it locally, and sends it to a vision-capable model for analysis.
104
+
105
+ ### Image Generation
106
+
107
+ If the prompt is recognized as an image request, the assistant converts it into a short image prompt, generates an image with Stable Diffusion, saves the result as `generated_image.png`, and displays it in the terminal.
108
+
109
+ ## Limitations
110
+
111
+ - The current implementation is macOS-focused.
112
+ - The assistant depends on several external models and services.
113
+ - The Stable Diffusion pipeline is loaded at startup, which may be slow on lower-powered machines.
114
+ - The current code stores generated and captured images in the working directory.
115
+ - The app-launching behavior is intentionally simple and may not match every app name perfectly.
116
+
117
+ ## Future Opportunities
118
+
119
+ This version leaves room for several improvements:
120
+
121
+ - Add cross-platform support beyond macOS
122
+ - Make the model names and device selection configurable through environment variables or a config file
123
+ - Add a proper command parser for app launching instead of relying on keyword matching
124
+ - Add a conversation history file or database
125
+ - Add streaming responses so users hear partial answers sooner
126
+ - Add a richer UI for desktop or web use
127
+ - Add safer image handling and cleanup for generated files
128
+ - Add a setup script or dependency file for easier installation
129
+
130
+ ## Troubleshooting
131
+
132
+ - If microphone input fails, check system permissions and verify `speech_recognition` is installed correctly.
133
+ - If camera capture fails, check camera permissions and confirm OpenCV can access the device.
134
+ - If image generation fails, verify that your hardware supports the configured device target or update the pipeline configuration.
135
+ - If terminal image preview fails, install `chafa` and confirm it is available in your PATH.
136
+
137
+ ## License
138
+
139
+ No license has been added yet. Add one before publishing or distributing the project widely.
@@ -0,0 +1,5 @@
1
+ """ChuroVoice package."""
2
+
3
+ __all__ = ["__version__"]
4
+
5
+ __version__ = "0.1.0"
@@ -0,0 +1,317 @@
1
+ """Core ChuroVoice assistant implementation."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import asyncio
7
+ import os
8
+ import re
9
+ import shutil
10
+ import subprocess
11
+ import tempfile
12
+ import time
13
+ from functools import lru_cache
14
+
15
+ import cv2
16
+ import edge_tts
17
+ import speech_recognition as sr
18
+ import torch
19
+ from ddgs import DDGS
20
+ from diffusers import StableDiffusionPipeline
21
+ from ollama import chat
22
+ from rich.console import Console
23
+ from rich.text import Text
24
+
25
+
26
+ DEFAULT_CHAT_MODEL = os.getenv("CHUROVOICE_CHAT_MODEL", "gemma4:31b-cloud")
27
+ DEFAULT_IMAGE_TRIGGER_MODEL = os.getenv("CHUROVOICE_IMAGE_TRIGGER_MODEL", "ministral-3:14b-cloud")
28
+ DEFAULT_IMAGE_PROMPT_MODEL = os.getenv("CHUROVOICE_IMAGE_PROMPT_MODEL", "ministral-3:3b-cloud")
29
+ DEFAULT_WEB_MODEL = os.getenv("CHUROVOICE_WEB_MODEL", "ministral-3:3b-cloud")
30
+ DEFAULT_VISION_MODEL = os.getenv("CHUROVOICE_VISION_MODEL", "ministral-3:14b-cloud")
31
+ DEFAULT_IMAGE_ANALYSIS_MODEL = os.getenv("CHUROVOICE_IMAGE_ANALYSIS_MODEL", "ministral-3:8b-cloud")
32
+ DEFAULT_STABLE_DIFFUSION_MODEL = os.getenv("CHUROVOICE_SD_MODEL", "nota-ai/bk-sdm-small")
33
+
34
+ ANSI_BOLD = "\033[1m"
35
+ ANSI_ITALIC = "\033[3m"
36
+ ANSI_RESET = "\033[0m"
37
+
38
+
39
+ def format_for_terminal(text: str | None) -> str:
40
+ if text is None:
41
+ return ""
42
+ text = re.sub(r"\*\*(.*?)\*\*", f"{ANSI_BOLD}\\1{ANSI_RESET}", text)
43
+ text = re.sub(r"\*(.*?)\*", f"{ANSI_ITALIC}\\1{ANSI_RESET}", text)
44
+ return text
45
+
46
+
47
+ def clean_for_speech(text: str | None) -> str:
48
+ if text is None:
49
+ return ""
50
+ return re.sub(r"\*\*|\*", "", text)
51
+
52
+
53
+ def resolve_voice(choice: str) -> str:
54
+ return "en-US-SteffanNeural" if choice.lower().strip() == "male" else "en-US-AvaNeural"
55
+
56
+
57
+ def resolve_device() -> str:
58
+ if torch.backends.mps.is_available():
59
+ return "mps"
60
+ if torch.cuda.is_available():
61
+ return "cuda"
62
+ return "cpu"
63
+
64
+
65
+ @lru_cache(maxsize=1)
66
+ def load_image_pipeline() -> StableDiffusionPipeline:
67
+ device = resolve_device()
68
+ dtype = torch.float16 if device in {"mps", "cuda"} else torch.float32
69
+ pipe = StableDiffusionPipeline.from_pretrained(DEFAULT_STABLE_DIFFUSION_MODEL, torch_dtype=dtype)
70
+ return pipe.to(device)
71
+
72
+
73
+ async def speak_async(text: str, voice: str) -> None:
74
+ clean_text = clean_for_speech(text)
75
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".mp3") as fp:
76
+ temp_path = fp.name
77
+
78
+ communicate = edge_tts.Communicate(clean_text, voice)
79
+ await communicate.save(temp_path)
80
+ os.system(f'afplay "{temp_path}"')
81
+ os.remove(temp_path)
82
+
83
+
84
+ def speak(text: str, voice: str) -> None:
85
+ asyncio.run(speak_async(text, voice))
86
+
87
+
88
+ def launch_target(target: str) -> bool:
89
+ finder = subprocess.run(
90
+ ["mdfind", 'kMDItemKind == "Application"'],
91
+ capture_output=True,
92
+ text=True,
93
+ check=False,
94
+ )
95
+ matches = [line for line in finder.stdout.splitlines() if target.lower() in line.lower()]
96
+ if matches:
97
+ subprocess.run(["open", matches[0]], check=False)
98
+ return True
99
+
100
+ subprocess.run(["open", f"https://{target.replace(' ', '')}.com"], check=False)
101
+ return False
102
+
103
+
104
+ def simplify_query(text: str) -> str:
105
+ response = chat(
106
+ model=DEFAULT_WEB_MODEL,
107
+ messages=[
108
+ {"role": "user", "content": text},
109
+ {
110
+ "role": "system",
111
+ "content": f'''You are a web-search query simplifier.Your job:Convert the user's message into ONE concise web search query.User message:"{text}"Rules:- Keep only important keywords- Remove filler words and stop words- No emojis- No explanations- Keep the meaning accurate- Make it optimized for a search engine- Output ONLY the final search queryFormat:latest details on <simplified topic> as of 2026''',
112
+ },
113
+ ],
114
+ )
115
+ return response.message.content.strip()
116
+
117
+
118
+ def detect_trigger(text: str, prompt: str, model: str) -> str:
119
+ response = chat(
120
+ model=model,
121
+ messages=[
122
+ {"role": "user", "content": text},
123
+ {"role": "system", "content": prompt.format(text=text)},
124
+ ],
125
+ )
126
+ return response.message.content.strip().lower()
127
+
128
+
129
+ def build_image_prompt(text: str) -> str:
130
+ response = chat(
131
+ model=DEFAULT_IMAGE_PROMPT_MODEL,
132
+ messages=[
133
+ {"role": "user", "content": text},
134
+ {
135
+ "role": "system",
136
+ "content": f'''You are an expert prompt to image prompt generator. Today your goal is to convert "{text}" into a proper prompt for an image model. This is an example you can follow:- User:"Can you generate an image of a sunset over mountains?" this is what you have to do: "Generate a realistic image of a sunset over mountains". You are maximum only allowed to use 10 words, anything higher will not be tollerated.''',
137
+ },
138
+ ],
139
+ )
140
+ return response.message.content.strip()
141
+
142
+
143
+ def analyze_image(text: str, photo_path: str) -> str:
144
+ response = chat(
145
+ model=DEFAULT_IMAGE_ANALYSIS_MODEL,
146
+ messages=[
147
+ {"role": "system", "content": text},
148
+ {
149
+ "role": "user",
150
+ "content": f'''Analyze the image based on the user's request.User request:"{text}"Instructions:- Focus mainly on the requested subject- If no subject is specified, analyze the surroundings- Be concise but useful- Be truthful and accurate- Mention important visible details- Do not hallucinate- No emojis- No unnecessary formatting- Make the response natural and clear''',
151
+ "images": [photo_path],
152
+ },
153
+ ],
154
+ )
155
+ return response.message.content.strip()
156
+
157
+
158
+ def answer_with_chat(text: str, memory: list[str], search_results: list[dict[str, str]]) -> str:
159
+ response = chat(
160
+ model=DEFAULT_CHAT_MODEL,
161
+ messages=[
162
+ {"role": "user", "content": text},
163
+ {
164
+ "role": "system",
165
+ "content": f'''You are Churo.Personality:- helpful- professional- intelligent- concise- accurate- natural soundingRules:- Use simple language- Keep responses concise- No emojis- Do not ramble- Answer directly- Be conversational but efficientMemory context:{memory}, use this when you feel that the query lacks context. Current user query:{text}Available web search results:{search_results}Use the web results ONLY if the user explicitly asks for:- latest news- recent updates- current information- newest details- web searchesOtherwise answer normally without relying on web results.If image analysis was already provided,do not repeat the analysis.Simply continue the conversation naturally.Never say:"Analysis provided"Instead continue naturally and intelligently.''',
166
+ },
167
+ ],
168
+ )
169
+ return response.message.content.strip()
170
+
171
+
172
+ def print_block(console: Console, text: str, *, style: str = "cornsilk1 on gray15") -> None:
173
+ console.print(" ")
174
+ console.print(" ", style=style, justify="left")
175
+ console.print(Text.from_ansi(text), style=style, justify="left")
176
+ console.print(" ", style=style, justify="left")
177
+ console.print()
178
+
179
+
180
+ def run_assistant(voice_choice: str | None = None) -> None:
181
+ console = Console()
182
+ terminal_width = shutil.get_terminal_size((100, 20)).columns
183
+ voice = resolve_voice(voice_choice or input("Choose a voice (Male/Female): "))
184
+
185
+ answer_history: list[str] = []
186
+ recognizer = sr.Recognizer()
187
+ yes_words = {"y", "yes", "yep", "yeah", "yup", "sure", "ok", "okay", "affirmative", "certainly", "definitely", "absolutely", "indeed", "true", "continue"}
188
+
189
+ while True:
190
+ is_app_open = False
191
+ is_recognised = False
192
+
193
+ with sr.Microphone() as source:
194
+ recognizer.adjust_for_ambient_noise(source, duration=0.2)
195
+ ask_anything = "*Ask Me Anything...*"
196
+ speak(ask_anything, voice)
197
+ console.print(Text.from_ansi(format_for_terminal(ask_anything)))
198
+ audio = recognizer.listen(source)
199
+
200
+ try:
201
+ text = recognizer.recognize_whisper(audio, model="small.en")
202
+ print_block(console, format_for_terminal(text), style="cornsilk1 on gray19")
203
+ except sr.UnknownValueError:
204
+ console.print(Text.from_ansi(format_for_terminal("Could not understand audio")))
205
+ text = ""
206
+
207
+ normalized_text = text.strip(".,!?").lower()
208
+ if "open" in normalized_text:
209
+ parts = text.split(maxsplit=1)
210
+ if len(parts) > 1:
211
+ target = parts[1].strip()
212
+ console.print(Text.from_ansi(format_for_terminal(f"**Opening {target}**")))
213
+ is_app_open = launch_target(target)
214
+
215
+ if text == "":
216
+ console.print(Text.from_ansi(format_for_terminal("No input detected. Please try again.")))
217
+ continue
218
+
219
+ web_query = simplify_query(text)
220
+ image_trigger = detect_trigger(
221
+ text,
222
+ '''You are an image generation trigger detector.Determine whether the user's query requires generating an image or not.User query:"{text}"Respond ONLY with:yesornoSay YES only if:- the user explicitly asks for an image- the user requests a visual representation of something- the answer requires generating an imageExamples of YES:- "Generate an image of a sunset over mountains"- "Create a picture of a futuristic city skyline"- "I want to see a visual representation of a dragon"- "Can you make an illustration of a robot?"Examples of NO:- news- coding- facts- explanations- web searches- math- history- general questionsBe accurate.Do not guess.Output ONLY yes or no.No punctuation.No emojis.''',
223
+ DEFAULT_IMAGE_TRIGGER_MODEL,
224
+ )
225
+
226
+ if "yes" in image_trigger:
227
+ image_prompt = build_image_prompt(text)
228
+ image = load_image_pipeline()(image_prompt, num_inference_steps=20).images[0]
229
+ image_path = os.path.join(os.getcwd(), "generated_image.png")
230
+ image.save(image_path)
231
+
232
+ chafa = shutil.which("chafa")
233
+ if chafa:
234
+ subprocess.run([
235
+ chafa,
236
+ image_path,
237
+ "--symbols",
238
+ "block",
239
+ "--size=60",
240
+ ], check=False)
241
+ else:
242
+ console.print(f"Generated image saved to {image_path}")
243
+
244
+ else:
245
+ vision_trigger = detect_trigger(
246
+ text,
247
+ '''You are a vision-context detector.Determine whether answering the user's query requires:- a camera image- surroundings analysis- appearance analysis- object inspection- environmental contextUser query:"{text}"Respond ONLY with:yesornoSay YES only if:- the user refers to themselves- the user refers to their surroundings- the user asks about appearance- the user asks to inspect something visible- the answer requires visual contextExamples of YES:- "How do I look?"- "What's in front of me?"- "Analyze my room"- "What is this object?"- "Does my hair look good?"Examples of NO:- news- coding- facts- explanations- web searches- math- history- general questionsBe accurate.Do not guess.Output ONLY yes or no.No punctuation.No emojis.''',
248
+ DEFAULT_VISION_MODEL,
249
+ )
250
+
251
+ if "yes" in vision_trigger:
252
+ is_recognised = True
253
+ console.print(Text.from_ansi(format_for_terminal("**Capturing photo...**")))
254
+ cam = cv2.VideoCapture(0)
255
+ ret, frame = cam.read()
256
+ if ret:
257
+ photo_path = os.path.join(os.getcwd(), "instant_photo.png")
258
+ cv2.imwrite(photo_path, frame)
259
+ console.print(Text.from_ansi(format_for_terminal("**Photo captured successfully!**")))
260
+ else:
261
+ console.print(Text.from_ansi(format_for_terminal("**Error: Could not access camera.**")))
262
+ photo_path = ""
263
+ cam.release()
264
+
265
+ if photo_path:
266
+ image_answer = analyze_image(text, photo_path)
267
+ console.print(Text.from_ansi(format_for_terminal(image_answer)))
268
+ answer_history.append(image_answer)
269
+ speak(image_answer, voice)
270
+
271
+ if is_recognised and is_app_open:
272
+ continue
273
+
274
+ search_results = list(DDGS().text(web_query, max_results=3))
275
+ output = answer_with_chat(text, answer_history, search_results)
276
+ formatted_output = format_for_terminal(output)
277
+ speech_output = clean_for_speech(output)
278
+ aligned = formatted_output.rjust(terminal_width)
279
+ print_block(console, aligned)
280
+ answer_history.append(output)
281
+ speak(speech_output, voice)
282
+
283
+ voice_recognizer = sr.Recognizer()
284
+ with sr.Microphone() as source1:
285
+ recognizer.adjust_for_ambient_noise(source1, duration=0.2)
286
+ prompt_text = "*Do you want to continue the conversation? Yes or No?*"
287
+ console.print(Text.from_ansi(format_for_terminal(prompt_text)))
288
+ speak("Do you want to continue the conversation? Yes or No?", voice)
289
+ time.sleep(0.01)
290
+ console.print(Text.from_ansi(format_for_terminal("*Listening for your response...*")))
291
+ audio1 = voice_recognizer.listen(source1)
292
+
293
+ try:
294
+ voice_continue = voice_recognizer.recognize_whisper(audio1, model="small.en").strip()
295
+ print_block(console, format_for_terminal(voice_continue), style="cornsilk1 on gray19")
296
+ except sr.UnknownValueError:
297
+ console.print(Text.from_ansi(format_for_terminal("Could not understand audio")))
298
+ voice_continue = ""
299
+
300
+ normalized_continue = re.sub(r"[^\w]", "", voice_continue).lower()
301
+ if normalized_continue == "":
302
+ console.print(Text.from_ansi(format_for_terminal("**No input detected.**")))
303
+ elif normalized_continue not in yes_words:
304
+ speak("Bye! Please Visit Again!", voice)
305
+ break
306
+
307
+
308
+ def build_parser() -> argparse.ArgumentParser:
309
+ parser = argparse.ArgumentParser(description="Run the ChuroVoice assistant.")
310
+ parser.add_argument("--voice", choices=["male", "female"], help="Choose the spoken voice.")
311
+ return parser
312
+
313
+
314
+ def main(argv: list[str] | None = None) -> None:
315
+ parser = build_parser()
316
+ args = parser.parse_args(argv)
317
+ run_assistant(args.voice)
@@ -0,0 +1,9 @@
1
+ """Console entry point for ChuroVoice."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from .assistant import main
6
+
7
+
8
+ if __name__ == "__main__":
9
+ main()
@@ -0,0 +1,167 @@
1
+ Metadata-Version: 2.4
2
+ Name: churovoice
3
+ Version: 0.1.0
4
+ Summary: A multimodal voice assistant with web search, vision, and image generation.
5
+ Author: Lakshya Prajapati
6
+ Classifier: Development Status :: 3 - Alpha
7
+ Classifier: Intended Audience :: Developers
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: Programming Language :: Python :: 3.11
10
+ Classifier: Operating System :: MacOS
11
+ Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
12
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
13
+ Requires-Python: >=3.11
14
+ Description-Content-Type: text/markdown
15
+ Requires-Dist: torch
16
+ Requires-Dist: ollama
17
+ Requires-Dist: SpeechRecognition
18
+ Requires-Dist: edge-tts
19
+ Requires-Dist: ddgs
20
+ Requires-Dist: opencv-python
21
+ Requires-Dist: rich
22
+ Requires-Dist: diffusers
23
+ Requires-Dist: transformers
24
+ Requires-Dist: accelerate
25
+ Requires-Dist: safetensors
26
+ Requires-Dist: Pillow
27
+ Requires-Dist: term-image
28
+
29
+ # V1.8 Speech Agent
30
+
31
+ V1.8 is an experimental voice-first AI assistant. It listens to spoken prompts, responds out loud, can launch apps on macOS, can search the web for current context, can inspect images from a webcam, and can generate images when the user asks for a visual result.
32
+
33
+ ## What It Does
34
+
35
+ This project combines several assistant behaviors into one loop:
36
+
37
+ - Speech-to-text using Whisper through `speech_recognition`
38
+ - Text-to-speech using `edge-tts`
39
+ - App launching on macOS for commands such as `open Safari`
40
+ - Web query simplification and search retrieval through DDGS
41
+ - Webcam-based vision analysis for appearance or environment questions
42
+ - Image generation with Stable Diffusion
43
+ - Terminal-friendly output formatting with `rich`
44
+
45
+ ## Who This Is For
46
+
47
+ This repository is intended for developers and hobbyists who want to explore a local voice assistant workflow. It is especially useful if you are interested in:
48
+
49
+ - voice interfaces
50
+ - multimodal AI interactions
51
+ - local automation on macOS
52
+ - image generation pipelines
53
+ - combining web search, vision, and speech in a single assistant
54
+
55
+ ## Requirements
56
+
57
+ - macOS
58
+ - Python 3.11 or newer is recommended
59
+ - A microphone with system permission enabled
60
+ - A camera with system permission enabled if you want vision features
61
+ - Ollama installed and available on the machine running the script
62
+ - `chafa` installed if you want terminal previews for generated images
63
+ - Hardware that can run the configured Stable Diffusion pipeline on `mps`, or code changes to target a different device
64
+
65
+ ## Python Dependencies
66
+
67
+ The script uses the following Python packages:
68
+
69
+ - `torch`
70
+ - `ollama`
71
+ - `speech_recognition`
72
+ - `edge_tts`
73
+ - `ddgs`
74
+ - `opencv-python`
75
+ - `rich`
76
+ - `diffusers`
77
+ - `term-image`
78
+
79
+ ## Installation
80
+
81
+ 1. Clone the repository and open the `V1.8` folder.
82
+
83
+ 2. Create a virtual environment:
84
+
85
+ ```bash
86
+ python3 -m venv venv
87
+ source venv/bin/activate
88
+ ```
89
+
90
+ 3. Install the dependencies:
91
+
92
+ ```bash
93
+ pip install torch ollama SpeechRecognition edge-tts ddgs opencv-python rich diffusers term-image
94
+ ```
95
+
96
+ 4. Make sure Ollama can access the models referenced in `main.py`.
97
+
98
+ ## Usage
99
+
100
+ Run the assistant with:
101
+
102
+ ```bash
103
+ python main.py
104
+ ```
105
+
106
+ On startup, the program asks you to choose a voice:
107
+
108
+ - `Male` selects `en-US-SteffanNeural`
109
+ - Any other input selects `en-US-AvaNeural`
110
+
111
+ Then the assistant will:
112
+
113
+ 1. Prompt you to speak
114
+ 2. Transcribe your speech
115
+ 3. Decide whether the request is for app launching, image generation, vision analysis, or a normal answer
116
+ 4. Speak the response back to you
117
+ 5. Ask whether you want to continue the conversation
118
+
119
+ ## How It Works
120
+
121
+ ### App Launching
122
+
123
+ If the transcription includes `open`, the assistant tries to find a matching application on macOS. If no local app is found, it falls back to opening a website based on the target name.
124
+
125
+ ### Web Answers
126
+
127
+ For general questions, the assistant first simplifies the query and fetches recent search results. The response model can use those results when the request is about news, current events, or recent information.
128
+
129
+ ### Vision Mode
130
+
131
+ If the prompt seems to require visual context, the assistant captures a frame from the webcam, saves it locally, and sends it to a vision-capable model for analysis.
132
+
133
+ ### Image Generation
134
+
135
+ If the prompt is recognized as an image request, the assistant converts it into a short image prompt, generates an image with Stable Diffusion, saves the result as `generated_image.png`, and displays it in the terminal.
136
+
137
+ ## Limitations
138
+
139
+ - The current implementation is macOS-focused.
140
+ - The assistant depends on several external models and services.
141
+ - The Stable Diffusion pipeline is loaded at startup, which may be slow on lower-powered machines.
142
+ - The current code stores generated and captured images in the working directory.
143
+ - The app-launching behavior is intentionally simple and may not match every app name perfectly.
144
+
145
+ ## Future Opportunities
146
+
147
+ This version leaves room for several improvements:
148
+
149
+ - Add cross-platform support beyond macOS
150
+ - Make the model names and device selection configurable through environment variables or a config file
151
+ - Add a proper command parser for app launching instead of relying on keyword matching
152
+ - Add a conversation history file or database
153
+ - Add streaming responses so users hear partial answers sooner
154
+ - Add a richer UI for desktop or web use
155
+ - Add safer image handling and cleanup for generated files
156
+ - Add a setup script or dependency file for easier installation
157
+
158
+ ## Troubleshooting
159
+
160
+ - If microphone input fails, check system permissions and verify `speech_recognition` is installed correctly.
161
+ - If camera capture fails, check camera permissions and confirm OpenCV can access the device.
162
+ - If image generation fails, verify that your hardware supports the configured device target or update the pipeline configuration.
163
+ - If terminal image preview fails, install `chafa` and confirm it is available in your PATH.
164
+
165
+ ## License
166
+
167
+ No license has been added yet. Add one before publishing or distributing the project widely.
@@ -0,0 +1,11 @@
1
+ README.md
2
+ pyproject.toml
3
+ churovoice/__init__.py
4
+ churovoice/assistant.py
5
+ churovoice/cli.py
6
+ churovoice.egg-info/PKG-INFO
7
+ churovoice.egg-info/SOURCES.txt
8
+ churovoice.egg-info/dependency_links.txt
9
+ churovoice.egg-info/entry_points.txt
10
+ churovoice.egg-info/requires.txt
11
+ churovoice.egg-info/top_level.txt
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ churovoice = churovoice.cli:main
@@ -0,0 +1,13 @@
1
+ torch
2
+ ollama
3
+ SpeechRecognition
4
+ edge-tts
5
+ ddgs
6
+ opencv-python
7
+ rich
8
+ diffusers
9
+ transformers
10
+ accelerate
11
+ safetensors
12
+ Pillow
13
+ term-image
@@ -0,0 +1 @@
1
+ churovoice
@@ -0,0 +1,49 @@
1
+ [build-system]
2
+ requires = ["setuptools>=68", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "churovoice"
7
+ version = "0.1.0"
8
+ description = "A multimodal voice assistant with web search, vision, and image generation."
9
+ readme = "README.md"
10
+ requires-python = ">=3.11"
11
+ dependencies = [
12
+ "torch",
13
+ "ollama",
14
+ "SpeechRecognition",
15
+ "edge-tts",
16
+ "ddgs",
17
+ "opencv-python",
18
+ "rich",
19
+ "diffusers",
20
+ "transformers",
21
+ "accelerate",
22
+ "safetensors",
23
+ "Pillow",
24
+ "term-image",
25
+ ]
26
+
27
+ authors = [
28
+ { name = "Lakshya Prajapati" }
29
+ ]
30
+
31
+ classifiers = [
32
+ "Development Status :: 3 - Alpha",
33
+ "Intended Audience :: Developers",
34
+ "Programming Language :: Python :: 3",
35
+ "Programming Language :: Python :: 3.11",
36
+ "Operating System :: MacOS",
37
+ "Topic :: Multimedia :: Sound/Audio :: Speech",
38
+ "Topic :: Scientific/Engineering :: Artificial Intelligence",
39
+ ]
40
+
41
+ [project.scripts]
42
+ churovoice = "churovoice.cli:main"
43
+
44
+ [tool.setuptools]
45
+ include-package-data = true
46
+
47
+ [tool.setuptools.packages.find]
48
+ where = ["."]
49
+ include = ["churovoice*"]
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+