rio-agent 0.9.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,516 @@
1
+ Metadata-Version: 2.4
2
+ Name: rio-agent
3
+ Version: 0.9.0
4
+ Summary: Rio Agent: voice-first autonomous assistant with local and cloud runtimes
5
+ Author: Gowshik S
6
+ License-Expression: MIT
7
+ Keywords: rio,agent,gemini,automation,voice
8
+ Classifier: Development Status :: 4 - Beta
9
+ Classifier: Programming Language :: Python :: 3
10
+ Classifier: Programming Language :: Python :: 3.11
11
+ Classifier: Programming Language :: Python :: 3.12
12
+ Classifier: Operating System :: OS Independent
13
+ Requires-Python: >=3.11
14
+ Description-Content-Type: text/markdown
15
+ Requires-Dist: pyyaml>=6.0
16
+ Requires-Dist: structlog>=24.0.0
17
+ Requires-Dist: python-dotenv>=1.0.0
18
+ Requires-Dist: google-genai>=1.0.0
19
+ Provides-Extra: cloud
20
+ Requires-Dist: fastapi>=0.115.0; extra == "cloud"
21
+ Requires-Dist: uvicorn[standard]>=0.34.0; extra == "cloud"
22
+ Requires-Dist: websockets>=14.0; extra == "cloud"
23
+ Provides-Extra: local
24
+ Requires-Dist: websockets>=14.0; extra == "local"
25
+ Requires-Dist: sounddevice>=0.5.0; extra == "local"
26
+ Requires-Dist: numpy>=1.24.0; extra == "local"
27
+ Requires-Dist: scipy>=1.11.0; extra == "local"
28
+ Requires-Dist: torch>=2.0.0; extra == "local"
29
+ Requires-Dist: PyAudio>=0.2.14; extra == "local"
30
+ Requires-Dist: pynput>=1.7.0; extra == "local"
31
+ Requires-Dist: pyautogui>=0.9.54; extra == "local"
32
+ Requires-Dist: pygetwindow>=0.0.9; extra == "local"
33
+ Requires-Dist: pyperclip>=1.8.2; extra == "local"
34
+ Requires-Dist: mss>=9.0.0; extra == "local"
35
+ Requires-Dist: Pillow>=10.0.0; extra == "local"
36
+ Requires-Dist: rapidocr-onnxruntime>=1.2.0; extra == "local"
37
+ Requires-Dist: vosk>=0.3.45; extra == "local"
38
+ Requires-Dist: scikit-learn>=1.3.0; extra == "local"
39
+ Requires-Dist: playwright>=1.40.0; extra == "local"
40
+ Requires-Dist: pywinauto>=0.6.8; extra == "local"
41
+ Requires-Dist: google-api-python-client>=2.0.0; extra == "local"
42
+ Provides-Extra: dev
43
+ Requires-Dist: pytest>=8.0.0; extra == "dev"
44
+ Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
45
+ Requires-Dist: build>=1.2.0; extra == "dev"
46
+ Requires-Dist: twine>=5.0.0; extra == "dev"
47
+
48
+ # Rio Agent
49
+
50
+ [![Python](https://img.shields.io/badge/Python-3.11+-3776AB?logo=python&logoColor=white)](https://python.org)
51
+ [![Google Cloud Run](https://img.shields.io/badge/Cloud%20Run-Deployed-4285F4?logo=googlecloud&logoColor=white)](https://cloud.google.com/run)
52
+ [![Gemini Live API](https://img.shields.io/badge/Gemini-Live%20API-FF6D00?logo=google&logoColor=white)](https://ai.google.dev)
53
+ [![ADK](https://img.shields.io/badge/Google%20ADK-Agent%20Pattern-7B1FA2)](https://google.github.io/adk-docs/)
54
+ [![FastAPI](https://img.shields.io/badge/FastAPI-WebSocket%20Relay-009688?logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com)
55
+
56
+ > **One command. Full autonomy.**
57
+ > Rio listens, sees your screen, plans, acts, and reports — without requiring a single click from you.
58
+
59
+ ---
60
+
61
+ ## The Problem
62
+
63
+ Every AI assistant today is still a text box with better autocomplete. You remain the orchestrator: restating context after every turn, approving micro-steps, and manually bridging the gap between what you said and what needs to happen on screen. For real multi-step tasks — "draft the emails, attach the report, schedule the follow-up" — that model completely breaks.
64
+
65
+ ## The Solution
66
+
67
+ Rio replaces the turn-by-turn loop with a **continuous multimodal control lane**. The local runtime streams your voice and live screen state to a cloud agent that plans, calls tools, confirms on-screen outcomes via OCR, and speaks the result back — all in one uninterrupted flow. You give one command. Rio closes the task.
68
+
69
+ ---
70
+
71
+ ## Challenge Categories
72
+
73
+ | Track | Status |
74
+ |---|---|
75
+ | ✅ Live Agent | Full — voice I/O, barge-in, persona, live bidirectional streaming |
76
+ | ✅ UI Navigator | Full — OCR-grounded screen understanding, Playwright browser control, post-action verification |
77
+ | 🔄 Storyteller / Creative Agent | Partial — Imagen 3 + Veo 2 generation works; narrative packaging in progress |
78
+
79
+ ---
80
+
81
+ ## Architecture
82
+
83
+ ### System Overview
84
+
85
+ ```mermaid
86
+ flowchart TD
87
+ U1["🎙 You speak\na command"]
88
+ U2["🖥 Your screen\nis captured"]
89
+
90
+ subgraph LOCAL["Your Machine — rio/local/"]
91
+ L1["Silero VAD\nfilters silence"]
92
+ L2["asyncio Orchestrator\nmain.py"]
93
+ L3["Tool Executor\n58 tools"]
94
+ end
95
+
96
+ subgraph CLOUD["Google Cloud Run — rio-agent-45"]
97
+ C1["FastAPI Gateway\nadk_server.py"]
98
+ C2["Gemini 2.5 Flash Native Audio\nlive voice session · barge-in"]
99
+ C3["ToolOrchestrator\nplans & routes tasks\n30 RPM rate limiter"]
100
+ C4["ToolBridge\nproxies tool calls\nto your machine"]
101
+ end
102
+
103
+ subgraph MODELS["Gemini Models — via Vertex AI"]
104
+ VAI["Vertex AI\nGoogle Cloud AI Platform"]
105
+ M1["Gemini 3-Flash\ntask reasoning"]
106
+ M2["Gemini Computer Use Preview\nreads screen → coordinates"]
107
+ M3["Imagen 3 · Veo 2\ncreative generation"]
108
+ VAI --> M1 & M2 & M3
109
+ end
110
+
111
+ U1 --> L1 --> L2
112
+ U2 --> L2
113
+
114
+ L2 -- "0x01 PCM16 audio\nover WebSocket" --> C1
115
+ L2 -- "0x02 JPEG frames\nover WebSocket" --> C1
116
+
117
+ C1 --> C2
118
+ C1 --> C3
119
+ C3 --> C4
120
+ C3 --> VAI
121
+
122
+ C4 -- "tool_call" --> L3
123
+ L3 -- "tool_result" --> C4
124
+
125
+ C3 -- "inject final result" --> C2
126
+ C2 -- "🔊 audio response\nover WebSocket" --> L2
127
+
128
+ L2 --> U1
129
+ ```
130
+
131
+ ### Tool Execution Flow
132
+
133
+ ```mermaid
134
+ flowchart TD
135
+ A["You say:\n'Open Gmail and\ndraft a reply'"]
136
+ B["Gemini 3-Flash\nbreaks task into steps"]
137
+ C{"What kind\nof task?"}
138
+ D["Browser tool\nPlaywright opens Gmail"]
139
+ E["Vision tool\nGemini Computer Use Preview\nreads screen → coordinates\npyautogui clicks"]
140
+ F["Workspace tool\nGmail API drafts reply"]
141
+ G["Result returned\nto ToolOrchestrator"]
142
+ H["Rio speaks:\n'Done — draft saved\nin Gmail'"]
143
+
144
+ A --> B --> C
145
+ C -- "browser action" --> D --> G
146
+ C -- "UI interaction" --> E --> G
147
+ C -- "workspace API" --> F --> G
148
+ G --> H
149
+ ```
150
+
151
+ > **Key design decision:** `RIO_LIVE_MODEL_TOOLS=false` by default. All tool execution routes through the text orchestrator, not the native audio model — this prevents unreliable function-calling in live audio sessions while keeping voice I/O seamless.
152
+
153
+ ---
154
+
155
+ ## Multimodal Experience
156
+
157
+ ### Beyond the Text Box
158
+ Rio has no chat input field. Interaction is entirely voice-in / voice-out, with screen vision as passive ground truth.
159
+
160
+ | Criterion | How Rio Satisfies It | Evidence in Code |
161
+ |---|---|---|
162
+ | **Voice + Vision loop** | Mic + screenshot stream run as parallel asyncio loops; neither blocks the other | `local/main.py` — `audio_capture_loop` + `screen_capture_loop` |
163
+ | **Barge-in / interruption** | F2 PTT clears active playback immediately; VAD speech-start also interrupts | `local/push_to_talk.py`, `local/audio_io.py` playback cancel path |
164
+ | **Distinct persona / voice** | Agent name, role, and voice ID are config-driven (`RIO_VOICE`, `RIO_AGENT_NAME`) | `cloud/gemini_session.py` `_build_role_intro()`, `cloud/voice_plugin.py` |
165
+ | **Visual precision** | OCR extracts on-screen text before + after every action; `smart_click` sends the screenshot to **Gemini Computer Use Preview** which returns pixel coordinates — pyautogui then executes the physical click | `local/tools.py` `smart_click()`, `local/ocr.py` |
166
+ | **Live, not turn-based** | Bidirectional WebSocket + background orchestrator task with `inject_context()` keeps the voice session alive while tools execute | `cloud/adk_server.py` `inject_context()`, `cloud/tool_orchestrator.py` |
167
+
168
+ ---
169
+
170
+ ## Technical Implementation
171
+
172
+ ### Vision-Guided UI Control — Gemini Computer Use Preview
173
+
174
+ Rio uses **Gemini Computer Use Preview** (served via **Vertex AI**) as the vision intelligence layer for all UI interactions.
175
+
176
+ The pipeline works in two stages:
177
+
178
+ ```
179
+ Screenshot (JPEG)
180
+
181
+
182
+ Gemini Computer Use Preview (Vertex AI)
183
+ → reads screen context
184
+ → identifies target element
185
+ → returns normalized (x, y) coordinates
186
+
187
+
188
+ pyautogui
189
+ → physically moves mouse to coordinates
190
+ → executes click / drag / scroll
191
+
192
+
193
+ Post-action screenshot + OCR
194
+ → verifies the action had the expected effect
195
+ ```
196
+
197
+ This is what powers `smart_click(target, action)` in `local/tools.py` — you describe the element in plain language ("the Send button", "the search bar"), the Computer Use model locates it on the actual live screen, and pyautogui executes. No hardcoded coordinates, no brittle selectors. Rio sees what a human sees.
198
+
199
+ > **Gemini Computer Use Preview** is purpose-built for agents that interact with UIs — browsers, desktop apps, web applications — by understanding screen context rather than DOM structure. Rio uses it as the grounding layer so UI navigation degrades gracefully even when Playwright selectors can't reach an element.
200
+
201
+ ### Google Cloud & Gemini Integration
202
+
203
+ - **Google GenAI SDK + Vertex AI** used directly: `genai.Client`, `client.aio.live.connect`, `types.LiveConnectConfig`, `types.SpeechConfig`, `types.AutomaticActivityDetection`, `types.FunctionDeclaration.from_callable`
204
+ - **Vertex AI** is the platform backing Gemini Computer Use Preview, Gemini 3-Flash tool orchestration, Imagen 3, and Veo 2. Activated via `GOOGLE_GENAI_USE_VERTEXAI=true` with `GOOGLE_CLOUD_PROJECT` — same SDK client, zero code changes
205
+ - **Google Workspace APIs** (Gmail, Drive, Calendar, Sheets, Docs) integrated via `cloud/workspace_tools.py`
206
+ - **Cloud Run** manifest: `minScale=1`, `maxScale=5`, `sessionAffinity=true`, `timeoutSeconds=3600` — long-lived WebSocket sessions don't get killed mid-task
207
+
208
+ ### ToolBridge Pattern
209
+
210
+ One `ToolBridge` instance is created per WebSocket session. `_make_tools(bridge)` returns **58 async closures** scoped to that session — covering file ops, shell, screen automation, browser (Playwright), window management, clipboard, web search, Google Workspace, Imagen/Veo generation, memory, and skill-specific tools (customer care, tutoring). Results are Pydantic-validated before being fed back to the orchestrator.
211
+
212
+ ### Reliability & Error Handling
213
+
214
+ | Layer | Mechanism |
215
+ |---|---|
216
+ | Rate limiting | 30 RPM token bucket, 4 degradation levels (NORMAL → CAUTION → EMERGENCY → CRITICAL) |
217
+ | Tool safety | Dangerous shell patterns blocklisted; `write_file` creates `.rio.bak` before every edit |
218
+ | Model fallback | `SESSION_MODE` + model env overrides; legacy relay path (`RIO_USE_ADK=0`) as last resort |
219
+ | Tool timeouts | Per-tool and global timeout (`RIO_TOOLBRIDGE_TIMEOUT_SECONDS`); orchestrator caps at 50 iterations |
220
+ | Anti-hallucination | OCR + screenshot provide UI state evidence; tool outputs treated as execution truth, injected as grounding |
221
+
222
+ ### Config Resolution Priority
223
+
224
+ ```
225
+ ENV variable → .env / config.yaml → code defaults
226
+ ```
227
+ All model choices, timeouts, feature flags, and rate limits are overridable at runtime without code changes.
228
+
229
+ ---
230
+
231
+ ## Demo Scenario
232
+
233
+ > **Command:** *"Rio, open Chrome, find yesterday's unread emails, and draft a reply summary."*
234
+
235
+ | Time | What Happens | Observable Signal |
236
+ |---|---|---|
237
+ | T=0s | Voice command captured via F2 / VAD | Live transcription event in dashboard |
238
+ | T=3s | Rio acknowledges verbally; orchestrator begins tool routing | `tool_call` stream visible in dashboard tool log |
239
+ | T=8s | Browser opens; Gmail navigated via Playwright | Screenshot streamed; OCR extracts email subjects |
240
+ | T=15s | Draft composed; workspace tool writes to Gmail draft | `tool_result` confirms draft ID |
241
+ | T=20s | Rio speaks completion summary | Audio playback; dashboard shows full tool trace |
242
+
243
+ **No clicks. No text typed. One spoken sentence.**
244
+
245
+ ---
246
+
247
+ ## Try Rio Live
248
+
249
+ **[rio.gowshik.in](https://rio.gowshik.in)** — Rio is publicly deployed and accessible right now.
250
+
251
+ | Tier | Access |
252
+ |---|---|
253
+ | Free | Available immediately — try voice interaction, dashboard, and tool execution |
254
+ | Pro | Full autonomous task mode, screen control, and all 58 tools unlocked |
255
+
256
+ > **For judges:** The demo video walkthrough covers the full Pro-tier capability. If you'd like live Pro access during evaluation, reach out at [rio.gowshik.in](https://rio.gowshik.in).
257
+
258
+ ---
259
+
260
+ ## Cloud Deployment
261
+
262
+ | Field | Value |
263
+ |---|---|
264
+ | GCP Project | `rio-agent-45` |
265
+ | Cloud Run Service | `rio-cloud` |
266
+ | Region | `us-central1` |
267
+ | Container | Python 3.11-slim · non-root · healthcheck |
268
+
269
+ **Verify live deployment:**
270
+ ```bash
271
+ curl -s https://rio-landing-979788564023.us-central1.run.app/health | jq
272
+ # Expected: { "status": "ok", "service": "rio-cloud", "backend": "...", "model": "..." }
273
+ ```
274
+
275
+ **GCP services used:** Cloud Run · Gemini Live API · **Vertex AI** (Gemini Computer Use Preview · Gemini 3-Flash · Imagen 3 · Veo 2) · Secret Manager (`gemini-api-key`) · Google Workspace APIs
276
+
277
+ ---
278
+
279
+ ## Running Rio Locally
280
+
281
+ ### Prerequisites
282
+
283
+ | Requirement | Version | Notes |
284
+ |---|---|---|
285
+ | Python | 3.11+ | `python --version` to verify |
286
+ | Git | any | for cloning |
287
+ | Gemini API Key | — | [Get one here](https://aistudio.google.com/app/apikey) |
288
+ | Chrome / Chromium | any | required for browser automation tools |
289
+ | Microphone | — | any system mic works |
290
+
291
+ ### Step 1 — Clone
292
+
293
+ ```bash
294
+ git clone https://github.com/Gowshik-S/Gemini-Live-Agent
295
+ cd Gemini-Live-Agent
296
+ ```
297
+
298
+ ### Step 2 — Install Dependencies
299
+
300
+ ```bash
301
+ cd rio
302
+
303
+ # Create virtual environment
304
+ python -m venv .venv
305
+
306
+ # Activate
307
+ source .venv/bin/activate # Linux / macOS
308
+ .venv\Scripts\activate # Windows
309
+
310
+ # Install
311
+ pip install -r requirements.txt
312
+ ```
313
+
314
+ > Optional: install dev dependencies for running tests
315
+ > ```bash
316
+ > pip install -r requirements-dev.txt
317
+ > ```
318
+
319
+ ### Step 3 — Configure API Key
320
+
321
+ ```bash
322
+ echo "GEMINI_API_KEY=your_key_here" > cloud/.env
323
+ ```
324
+
325
+ That's the only required environment variable to get started. Everything else resolves from `rio/config.yaml` defaults.
326
+
327
+ **Optional overrides** (add to `cloud/.env` as needed):
328
+
329
+ ```bash
330
+ GOOGLE_CLOUD_PROJECT=your-gcp-project-id # required for Vertex AI path
331
+ GOOGLE_GENAI_USE_VERTEXAI=true # activate Vertex AI backend
332
+ SESSION_MODE=live # "live" (audio) or "text"
333
+ RIO_VOICE=Puck # agent voice identity
334
+ RIO_WS_TOKEN=secret # WebSocket auth token
335
+ ```
336
+
337
+ ### Step 4 — Start the Cloud Relay
338
+
339
+ The cloud relay is the FastAPI server that hosts the Gemini Live session, orchestrator, and tool bridge. In production this runs on Cloud Run — locally it runs on port 8080.
340
+
341
+ ```bash
342
+ # Linux / macOS
343
+ cd rio/setup && ./run-cloud.sh
344
+
345
+ # Windows
346
+ cd rio\setup && run-cloud.bat
347
+
348
+ # Or directly
349
+ cd rio && uvicorn cloud.adk_server:app --host 0.0.0.0 --port 8080
350
+ ```
351
+
352
+ Confirm it's up:
353
+ ```bash
354
+ curl http://localhost:8080/health
355
+ # {"status":"ok","service":"rio-cloud","backend":"...","model":"..."}
356
+ ```
357
+
358
+ Dashboard: `http://localhost:8080/dashboard`
359
+
360
+ ### Step 5 — Start the Local Runtime
361
+
362
+ The local runtime handles mic capture, screen capture, VAD, and local tool execution. It connects to the relay over WebSocket.
363
+
364
+ ```bash
365
+ # Linux / macOS
366
+ cd rio/setup && ./run-local.sh
367
+
368
+ # Windows
369
+ cd rio\setup && run-local.bat
370
+
371
+ # Or directly
372
+ cd rio/local && python main.py
373
+ ```
374
+
375
+ Once both are running, press **F2** and speak a command.
376
+
377
+ ### Step 6 — Verify End-to-End
378
+
379
+ ```bash
380
+ # Run the full diagnostic suite
381
+ python -m rio.cli doctor --test-api
382
+ ```
383
+
384
+ This checks: config loading, API connectivity, rate limiter, model routing, tool imports, dashboard files, and wire protocol constants. Optional dependencies (ChromaDB, Playwright) are reported as skipped, not failed, if not installed.
385
+
386
+ ---
387
+
388
+ ### One-Line Install (Alternative)
389
+
390
+ ```powershell
391
+ powershell -c "irm https://rio.gowshik.in/install.ps1 | iex"
392
+ ```
393
+
394
+ ```bash
395
+ curl -fsSL https://rio.gowshik.in/install.sh | bash
396
+ ```
397
+
398
+ Or install directly from PyPI:
399
+
400
+ ```bash
401
+ pipx install rio-agent
402
+ # or
403
+ python -m pip install --user rio-agent
404
+ ```
405
+
406
+ ---
407
+
408
+ ### Cloud Run Deployment
409
+
410
+ The repo ships a deploy script that builds and pushes the container, then updates the Cloud Run service.
411
+
412
+ **Prerequisites:** `gcloud` CLI authenticated · Cloud Run + Cloud Build + Secret Manager APIs enabled · a secret named `gemini-api-key` in Secret Manager.
413
+
414
+ ```bash
415
+ cd Rio-Agent/rio
416
+ chmod +x deploy.sh
417
+ ./deploy.sh
418
+ ```
419
+
420
+ The script outputs the HTTP and WebSocket URLs. Paste the WebSocket URL into `rio/config.yaml`:
421
+
422
+ ```yaml
423
+ rio:
424
+ cloud_url: wss://<your-cloud-run-url>/ws/rio/live
425
+ ```
426
+
427
+ Then run the local runtime pointing at the deployed relay — no other changes needed.
428
+
429
+ ---
430
+
431
+ ## Agent & Tool Breakdown
432
+
433
+ ### Agents
434
+
435
+ | Agent | Model | Role |
436
+ |---|---|---|
437
+ | Task Executor | Gemini 3-Flash | Multi-step general task orchestration |
438
+ | Code Agent | Gemini 3-Flash | File editing, shell, git |
439
+ | Computer Use Agent | **Gemini Computer Use Preview** (Vertex AI) | Screen reading, coordinate grounding, GUI automation |
440
+ | Research Agent | Gemini 2.5-Pro | Deep reasoning, analysis |
441
+ | Creative Agent | Imagen 3 + Veo 2 (Vertex AI) | Image and video generation |
442
+
443
+ ### Tool Categories (58 total)
444
+
445
+ `File ops` · `Shell & process` · `Screen capture` · `Screen automation` · `Vision-guided click` · `Window management` · `Browser (Playwright)` · `Web search/fetch` · `Memory & notes` · `Google Workspace` · `Creative (Imagen/Veo)` · `Customer care skill` · `Tutoring skill`
446
+
447
+ ---
448
+
449
+ ## Dashboard
450
+
451
+ Rio ships a real-time operational dashboard served by the cloud relay at `http://localhost:8080/dashboard/` (or your Cloud Run URL in production).
452
+
453
+ | Page | URL | Purpose |
454
+ |---|---|---|
455
+ | Main dashboard | `/dashboard/` | Live transcript, tool log, health gauges, agent status |
456
+ | Chat | `/dashboard/chat.html` | Text-mode interaction with Rio |
457
+ | Setup | `/dashboard/setup.html` | Configure skills, profiles, API keys, agent behavior |
458
+
459
+ ### What the Dashboard Shows
460
+
461
+ **Live Transcript stream** — every utterance and Rio's responses appear in real time via `/ws/dashboard` WebSocket. You can see exactly what Rio heard and what it decided.
462
+
463
+ **Tool log** — every `tool_call` and `tool_result` is correlated and displayed with timing. Judges can watch Rio's execution trace — open Gmail, extract text, draft reply — step by step as it happens.
464
+
465
+ **Health gauges** — RPM usage, degradation level, active session state, and model routing decisions are surfaced live. If the rate limiter kicks in, the gauge shows it.
466
+
467
+ **Schedules** — view and manage any scheduled or trigger-based tasks Rio has queued.
468
+
469
+ **Setup page** — first-run configuration UI. Set your Gemini API key, choose agent skills (Customer Care / Tutor), configure voice, and save profile JSON — all without touching config files.
470
+
471
+ The dashboard is pure static HTML/CSS/JS served directly by FastAPI — no separate frontend server needed. It connects to the relay over WebSocket and polls HTTP config endpoints for state.
472
+
473
+ ---
474
+
475
+ ## Runtime Controls
476
+
477
+ | Key | Action |
478
+ |---|---|
479
+ | F2 | Push-to-talk — hold to speak, release to send; also interrupts active playback |
480
+ | F3 | Mute toggle |
481
+ | F4 | Toggle proactive mode — Rio watches and offers help unprompted |
482
+ | F5 | Screen mode — cycle between on-demand and autonomous capture |
483
+ | F6 | Live mode — continuous monitoring + wake word ("Hey Rio") |
484
+ | F7 | Live translation toggle — real-time bidirectional speech translation |
485
+ | F8 | Current task status — speak or display active task progress |
486
+
487
+ ---
488
+
489
+ ## Known Limitations
490
+
491
+ - **Installer shell PATH propagation** — after `pipx`/`pip --user` install, some shells require reopening terminal before `rio` command is available.
492
+ - **A2A protocol** — no Agent Cards or remote agent discovery yet. `[In Progress]`
493
+ - **pyautogui → Playwright** unification still expanding. `[In Progress]`
494
+ - **React dashboard** planned; current UI is static HTML/JS served by FastAPI. `[In Progress]`
495
+ - Desktop automation on Wayland and elevated/admin windows is inherently less reliable than standard Windows desktop.
496
+
497
+ ---
498
+
499
+ ## Tech Stack
500
+
501
+ | Layer | Technology |
502
+ |---|---|
503
+ | Cloud relay | FastAPI · uvicorn · websockets · structlog |
504
+ | AI / Gemini | `google-genai` SDK · **Vertex AI** · Live API · Gemini 2.5 Flash Native Audio · Gemini 3-Flash · Gemini Computer Use Preview |
505
+ | Audio | sounddevice · Silero VAD · CPU PyTorch |
506
+ | Vision | mss · Pillow · RapidOCR (ONNX) · Gemini Computer Model|
507
+ | Automation | pyautogui · pygetwindow · Playwright |
508
+ | Memory | SQLite · ChromaDB · **Gemini Text Embeddings 2** (`text-embedding-004`) |
509
+ | Deployment | Cloud Run · Docker (Python 3.11-slim) |
510
+
511
+ ---
512
+
513
+ ## License & Contact
514
+
515
+ License: see `LICENSE` in repository root.
516
+ GitHub: [Gowshik-S/Gemini-Live-Agent](https://github.com/Gowshik-S/Gemini-Live-Agent) — open an issue for deployment support or collaboration.