PyPI - pipecat-ai-mcp-server - Versions diffs - 0.0.4__tar.gz → 0.0.11__tar.gz - Mend

pipecat-ai-mcp-server 0.0.4tar.gz → 0.0.11tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (41) hide show

pipecat_ai_mcp_server-0.0.11/.claude/skills/pipecat/SKILL.md ADDED Viewed

@@ -0,0 +1,35 @@
+---
+name: pipecat
+description: Start a voice conversation using the Pipecat MCP server
+---
+Start a voice conversation using the Pipecat MCP server.
+## Flow
+1. Print a nicely formatted message with bullet points in the terminal with the following information:
+   - The voice session is starting
+   - Once ready, they can connect via the transport of their choice (Pipecat Playground, Daily room, or phone call)
+   - Models are downloaded on the first user connection, so the first connection may take a moment
+   - If the connection is not established and the user cannot hear any audio, they should check the terminal for errors from the Pipecat MCP server
+2. Call `start()` to initialize the voice agent
+3. Greet the user with `speak()`, then call `listen()` to wait for input
+4. When the user asks you to perform a task:
+   - Acknowledge the request with `speak()` (do NOT call `listen()` yet)
+   - Perform the work (edit files, run commands, etc.)
+   - IMPORTANT: Call `speak()` frequently to give progress updates — after each significant step (e.g., "Reading the file now", "Making the change", "Done with the first file, moving to the next one"). Never let more than a few tool calls go by in silence.
+   - Once the task is complete, use `speak()` to report the result
+   - Only then call `listen()` to wait for the next user input
+5. When the user asks a simple question or makes conversation (no task to perform), respond with `speak()` then immediately call `listen()`
+6. If the user wants to end the conversation, ask for verbal confirmation before stopping. When in doubt, keep listening.
+7. Once confirmed, say goodbye with `speak()`, then call `stop()`
+The key principle: `listen()` means "I'm done and ready for the user to talk." Never call it while you still have work to do or updates to communicate.
+## Guidelines
+- Keep all responses and progress updates to 1-2 short sentences. Brevity is critical for voice.
+- When the user asks you to perform a task (e.g., edit a file, create a PR), verbally acknowledge the request first, then start working on it. Do not work in silence.
+- Before any change (files, PRs, issues, etc.), show the proposed change in the terminal, use `speak()` to ask for verbal confirmation, then call `listen()` to get the user's response before proceeding.
+- When using `list_windows()` and `screen_capture()`, if there are multiple windows for the same app or you're unsure which window the user wants, ask for clarification before capturing.
+- Always call `stop()` when the conversation ends.

{pipecat_ai_mcp_server-0.0.4 → pipecat_ai_mcp_server-0.0.11}/.github/ISSUE_TEMPLATE/1-bug_report.yml RENAMED Viewed

@@ -50,15 +50,6 @@ body:
     validations:
       required: true
-  - type: input
-    id: browser
-    attributes:
-      label: Browser
-      description: Which browser are you using?
-      placeholder: e.g., Chrome 139.0.7258.127
-    validations:
-      required: true
   - type: textarea
     id: description
     attributes:

{pipecat_ai_mcp_server-0.0.4 → pipecat_ai_mcp_server-0.0.11}/.github/ISSUE_TEMPLATE/2-question.yml RENAMED Viewed

@@ -50,15 +50,6 @@ body:
     validations:
       required: false
-  - type: input
-    id: browser
-    attributes:
-      label: Browser
-      description: Which browser are you using?
-      placeholder: e.g., Chrome 139.0.7258.127
-    validations:
-      required: false
   - type: textarea
     id: question
     attributes:

pipecat_ai_mcp_server-0.0.11/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,109 @@
+# Changelog
+All notable changes to **Pipecat MCP Server** will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.0.11] - 2026-02-02
+### Added
+- New `capture_screenshot()` MCP tool that captures the current screen frame and
+  returns an image path. This allows the agent to visually analyze what's on
+  screen and help with debugging, UI feedback, and more.
+## [0.0.10] - 2026-02-01
+### Added
+- New `list_windows()` MCP tool to list all open windows with title, app name,
+  and window ID.
+- New `screen_capture(window_id)` MCP tool to start or switch screen capture to
+  a specific window or full screen during a voice conversation.
+### Changed
+- Screen capture dependencies are now included by default (no longer an optional
+  `[screen]` extra).
+- Screen capture is no longer configured via environment variables
+  (`PIPECAT_MCP_SERVER_SCREEN_CAPTURE`, `PIPECAT_MCP_SERVER_SCREEN_WINDOW`).
+  Use the `list_windows()` and `screen_capture()` tools instead.
+## [0.0.9] - 2026-01-31
+### Changed
+- Linux X11 screen capture backend using python-xlib.
+- Native macOS screen capture using ScreenCaptureKit. Supports true window-level
+  capture not affected by overlapping windows.
+## [0.0.8] - 2026-01-31
+### Changed
+- Updated to Pipecat >= 0.0.101.
+## [0.0.7] - 2026-01-31
+### Changed
+- `KokoroTTSService` now uses `kokoro-onnx`.
+## [0.0.6] - 2026-01-29
+### Added
+- Added `KokoroTTSService` processor.
+- Added noise cancellation with `RNNoiseFilter`.
+- Simplified the `/pipecat` skill instructions.
+### Changed
+- Replaced third-party STT/TTS services (Deepgram, Cartesia) with local models:
+  Faster Whisper for speech-to-text and Kokoro for text-to-speech. No API keys
+  required.
+## [0.0.5] - 2026-01-28
+### Fixed
+- Fixed an issue that would cause an MCP session to crash and would force the
+  MCP client to reconnect each time.
+## [0.0.4] - 2026-01-26
+### Fixed
+- Fixed an issue where Daily clients couldn't reconnect after disconnecting.
+## [0.0.3] - 2026-01-26
+### Fixed
+- Fixed premature exit of the `/pipecat` skill when user responds with phrases
+  like "no", "nothing", or "that's it" instead of explicit ending phrases.
+- Fixed an issue where WebRTC clients couldn't reconnect after disconnecting.
+  The agent now properly handles disconnect/reconnect cycles.
+- Fixed an issue where `pipecat-mcp-server` could hang indefinitely after
+  pressing Ctrl-C.
+## [0.0.2] - 2026-01-26
+### Fixed
+- Fixed an issue that would cause the Pipecat agent to not load if the optional
+  `daily` dependency was not installed.
+- Added missing support for `telnyx`, `plivo` and `exotel` telephony providers.
+## [0.0.1] - 2026-01-26
+Initial public release.

{pipecat_ai_mcp_server-0.0.4 → pipecat_ai_mcp_server-0.0.11}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: pipecat-ai-mcp-server
-Version: 0.0.4
+Version: 0.0.11
 Summary: Pipecat MCP server for your AI agents
 License-Expression: BSD-2-Clause
 Project-URL: Homepage, https://pipecat.ai
@@ -18,15 +18,21 @@ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
 Requires-Python: >=3.10
 Description-Content-Type: text/markdown
 License-File: LICENSE
+Requires-Dist: kokoro-onnx<1,>=0.5.0
 Requires-Dist: loguru<1,>=0.7.0
 Requires-Dist: mcp>=1.0.0
-Requires-Dist: pipecat-ai[cartesia,deepgram,local-smart-turn-v3,runner,silero,webrtc,websocket]>=0.0.100
+Requires-Dist: pip>=25.3
+Requires-Dist: pipecat-ai[cartesia,deepgram,local-smart-turn-v3,rnnoise,runner,silero,webrtc,websocket]>=0.0.101
+Requires-Dist: pipecat-ai[mlx-whisper]>=0.0.100; sys_platform == "darwin"
+Requires-Dist: pipecat-ai[whisper]>=0.0.100; sys_platform != "darwin"
+Requires-Dist: pyobjc-framework-CoreMedia>=11.0; sys_platform == "darwin"
+Requires-Dist: pyobjc-framework-Quartz>=11.0; sys_platform == "darwin"
+Requires-Dist: pyobjc-framework-ScreenCaptureKit>=11.0; sys_platform == "darwin"
 Requires-Dist: python-dotenv<2,>=1.0.0
+Requires-Dist: python-xlib>=0.33; sys_platform == "linux"
+Requires-Dist: requests<3,>=2.32.5
 Provides-Extra: daily
 Requires-Dist: daily-python~=0.23.0; extra == "daily"
-Provides-Extra: screen
-Requires-Dist: mss>=10.0.0; extra == "screen"
-Requires-Dist: pywinctl>=0.4; extra == "screen"
 Dynamic: license-file
 <h1><div align="center">
@@ -39,16 +45,18 @@ Dynamic: license-file
 Pipecat MCP Server gives your AI agents a voice using [Pipecat](https://github.com/pipecat-ai/pipecat). It should work with any [MCP](https://modelcontextprotocol.io/)-compatible client:
-The Pipecat MCP Server exposes **voice-related tools** (`start`, `listen`, `speak`, `stop`) to MCP-compatible clients, but **it does not itself provide microphone or speaker access**.
+The Pipecat MCP Server exposes **voice-related** and **screen capture** tools to MCP-compatible clients, but **it does not itself provide microphone or speaker access**.
-Audio input/output is handled by a **separate audio transport**, such as:
+Audio input/output is handled by a **separate audio/video transport**, such as:
 - **Pipecat Playground** (local browser UI)
 - **Daily** (WebRTC room)
 - **Phone providers** (Twilio, Telnyx, etc.)
 > **MCP clients like Cursor, Claude Code, and Codex control the agent, but they are not audio devices.**
-> To hear or speak, you must also connect via one of the audio transports.
+> To hear, speak or see, you must connect via one of the audio transports.
+<p align="center"><video src="https://github.com/user-attachments/assets/0ad14e37-2de7-46df-870a-167aa667df16" width="500" controls></video></p>
 ## 🧭 Getting started
@@ -56,9 +64,8 @@ Audio input/output is handled by a **separate audio transport**, such as:
 - Python 3.10 or later
 - [uv](https://docs.astral.sh/uv/getting-started/installation/) package manager
-- API keys for third-party services (Speech-to-Text, Text-to-Speech, ...)
-By default, the voice agent uses [Deepgram](https://deepgram.com) for speech-to-text and [Cartesia](https://cartesia.ai/) for text-to-speech.
+By default, the voice agent uses local models (no API keys required): [Faster Whisper](https://github.com/SYSTRAN/faster-whisper) for speech-to-text and [Kokoro](https://github.com/hexgrad/kokoro) for text-to-speech. The Whisper models are approximately 1.5 GB and are downloaded automatically on the first connection, so the initial startup may take a moment.
 ### Installation
@@ -82,14 +89,7 @@ uv tool install -e /path/to/repo/pipecat-mcp-server
 ## Running the server
-First, set your API keys as environment variables:
-```bash
-export DEEPGRAM_API_KEY=your-deepgram-key
-export CARTESIA_API_KEY=your-cartesia-key
-```
-Then start the server:
+Start the server:
 ```bash
 pipecat-mcp-server
@@ -109,6 +109,20 @@ The [Pipecat skill](.claude/skills/pipecat/SKILL.md) provides a better voice con
 Alternatively, just tell your agent something like `Let's have a voice conversation`. In this case, the agent won't ask for verbal confirmation before making changes.
+## 🖥️ Screen Capture & Analysis
+Screen capture lets you stream your screen (or a specific window) to your configured transport, and ask the agent to help with what it sees.
+For example:
+- *"capture my browser window"* — starts streaming that window
+- *"what's causing this error?"* — the agent analyzes the screen and helps debug
+- *"how does this UI look?"* — get feedback on your design
+**Supported platforms:**
+- **macOS** — uses ScreenCaptureKit for true window-level capture (not affected by overlapping windows)
+- **Linux (X11)** — uses Xlib for window and full-screen capture
 ## 💻 MCP Client: Claude Code
 ### Adding the MCP server
@@ -230,11 +244,11 @@ First, install the server with the Daily dependency:
 uv tool install pipecat-ai-mcp-server[daily]
 ```
-Then, set the `DAILY_API_KEY` environment variable to your Daily API key and `DAILY_SAMPLE_ROOM_URL` to your desired Daily room URL and pass the `-d` argument to `pipecat-mcp-server`.
+Then, set the `DAILY_API_KEY` environment variable to your Daily API key and `DAILY_ROOM_URL` to your desired Daily room URL and pass the `-d` argument to `pipecat-mcp-server`.
 ```bash
 export DAILY_API_KEY=your-daily-api-key
-export DAILY_SAMPLE_ROOM_URL=your-daily-room
+export DAILY_ROOM_URL=your-daily-room
 pipecat-mcp-server -d
 ```
@@ -271,45 +285,9 @@ pipecat-mcp-server -t twilio -x your-proxy.ngrok.app
 Configure your provider's phone number to point to your ngrok URL, then call your number to connect.
-## 🧪 Screen Capture (Experimental)
-You can enable screen capture to stream your screen (or a specific window) to the Pipecat Playground or Daily room. This lets you see what's happening on your computer remotely while having a voice conversation with the agent.
-First, install the server with the screen capture dependency:
-```bash
-uv tool install "pipecat-ai-mcp-server[screen]"
-```
-Then, define the following environment variables:
-| Variable                            | Description                                                        |
-|-------------------------------------|--------------------------------------------------------------------|
-| `PIPECAT_MCP_SERVER_SCREEN_CAPTURE` | Set to any value (e.g., `1`) to enable screen capture              |
-| `PIPECAT_MCP_SERVER_SCREEN_WINDOW`  | Optional. Window name to capture (partial match, case-insensitive) |
-For example, to capture your entire primary monitor:
-```bash
-export PIPECAT_MCP_SERVER_SCREEN_CAPTURE=1
-pipecat-mcp-server
-```
-And to capture a specific window:
-```bash
-export PIPECAT_MCP_SERVER_SCREEN_CAPTURE=1
-export PIPECAT_MCP_SERVER_SCREEN_WINDOW="claude"
-pipecat-mcp-server
-```
-> ℹ️ **Note:** Window capture is based on window coordinates, not content. If another window overlaps the target, the overlapping content will be captured. The capture region updates dynamically if the window is moved. If the specified window is not found, capture falls back to the full screen.
 ## 📚 What's Next?
-- **Customize services**: Edit `agent.py` to use different STT/TTS providers (ElevenLabs, OpenAI, etc.)
+- **Customize services**: Edit `agent.py` to use different STT/TTS providers
 - **Change transport**: Configure for Twilio, WebRTC, or other transports
 - **Add to your project**: Use this as a template for voice-enabled MCP tools
 - **Learn more**: Check out [Pipecat's docs](https://docs.pipecat.ai/) for advanced features

{pipecat_ai_mcp_server-0.0.4 → pipecat_ai_mcp_server-0.0.11}/README.md RENAMED Viewed

@@ -8,16 +8,18 @@
 Pipecat MCP Server gives your AI agents a voice using [Pipecat](https://github.com/pipecat-ai/pipecat). It should work with any [MCP](https://modelcontextprotocol.io/)-compatible client:
-The Pipecat MCP Server exposes **voice-related tools** (`start`, `listen`, `speak`, `stop`) to MCP-compatible clients, but **it does not itself provide microphone or speaker access**.
+The Pipecat MCP Server exposes **voice-related** and **screen capture** tools to MCP-compatible clients, but **it does not itself provide microphone or speaker access**.
-Audio input/output is handled by a **separate audio transport**, such as:
+Audio input/output is handled by a **separate audio/video transport**, such as:
 - **Pipecat Playground** (local browser UI)
 - **Daily** (WebRTC room)
 - **Phone providers** (Twilio, Telnyx, etc.)
 > **MCP clients like Cursor, Claude Code, and Codex control the agent, but they are not audio devices.**
-> To hear or speak, you must also connect via one of the audio transports.
+> To hear, speak or see, you must connect via one of the audio transports.
+<p align="center"><video src="https://github.com/user-attachments/assets/0ad14e37-2de7-46df-870a-167aa667df16" width="500" controls></video></p>
 ## 🧭 Getting started
@@ -25,9 +27,8 @@ Audio input/output is handled by a **separate audio transport**, such as:
 - Python 3.10 or later
 - [uv](https://docs.astral.sh/uv/getting-started/installation/) package manager
-- API keys for third-party services (Speech-to-Text, Text-to-Speech, ...)
-By default, the voice agent uses [Deepgram](https://deepgram.com) for speech-to-text and [Cartesia](https://cartesia.ai/) for text-to-speech.
+By default, the voice agent uses local models (no API keys required): [Faster Whisper](https://github.com/SYSTRAN/faster-whisper) for speech-to-text and [Kokoro](https://github.com/hexgrad/kokoro) for text-to-speech. The Whisper models are approximately 1.5 GB and are downloaded automatically on the first connection, so the initial startup may take a moment.
 ### Installation
@@ -51,14 +52,7 @@ uv tool install -e /path/to/repo/pipecat-mcp-server
 ## Running the server
-First, set your API keys as environment variables:
-```bash
-export DEEPGRAM_API_KEY=your-deepgram-key
-export CARTESIA_API_KEY=your-cartesia-key
-```
-Then start the server:
+Start the server:
 ```bash
 pipecat-mcp-server
@@ -78,6 +72,20 @@ The [Pipecat skill](.claude/skills/pipecat/SKILL.md) provides a better voice con
 Alternatively, just tell your agent something like `Let's have a voice conversation`. In this case, the agent won't ask for verbal confirmation before making changes.
+## 🖥️ Screen Capture & Analysis
+Screen capture lets you stream your screen (or a specific window) to your configured transport, and ask the agent to help with what it sees.
+For example:
+- *"capture my browser window"* — starts streaming that window
+- *"what's causing this error?"* — the agent analyzes the screen and helps debug
+- *"how does this UI look?"* — get feedback on your design
+**Supported platforms:**
+- **macOS** — uses ScreenCaptureKit for true window-level capture (not affected by overlapping windows)
+- **Linux (X11)** — uses Xlib for window and full-screen capture
 ## 💻 MCP Client: Claude Code
 ### Adding the MCP server
@@ -199,11 +207,11 @@ First, install the server with the Daily dependency:
 uv tool install pipecat-ai-mcp-server[daily]
 ```
-Then, set the `DAILY_API_KEY` environment variable to your Daily API key and `DAILY_SAMPLE_ROOM_URL` to your desired Daily room URL and pass the `-d` argument to `pipecat-mcp-server`.
+Then, set the `DAILY_API_KEY` environment variable to your Daily API key and `DAILY_ROOM_URL` to your desired Daily room URL and pass the `-d` argument to `pipecat-mcp-server`.
 ```bash
 export DAILY_API_KEY=your-daily-api-key
-export DAILY_SAMPLE_ROOM_URL=your-daily-room
+export DAILY_ROOM_URL=your-daily-room
 pipecat-mcp-server -d
 ```
@@ -240,45 +248,9 @@ pipecat-mcp-server -t twilio -x your-proxy.ngrok.app
 Configure your provider's phone number to point to your ngrok URL, then call your number to connect.
-## 🧪 Screen Capture (Experimental)
-You can enable screen capture to stream your screen (or a specific window) to the Pipecat Playground or Daily room. This lets you see what's happening on your computer remotely while having a voice conversation with the agent.
-First, install the server with the screen capture dependency:
-```bash
-uv tool install "pipecat-ai-mcp-server[screen]"
-```
-Then, define the following environment variables:
-| Variable                            | Description                                                        |
-|-------------------------------------|--------------------------------------------------------------------|
-| `PIPECAT_MCP_SERVER_SCREEN_CAPTURE` | Set to any value (e.g., `1`) to enable screen capture              |
-| `PIPECAT_MCP_SERVER_SCREEN_WINDOW`  | Optional. Window name to capture (partial match, case-insensitive) |
-For example, to capture your entire primary monitor:
-```bash
-export PIPECAT_MCP_SERVER_SCREEN_CAPTURE=1
-pipecat-mcp-server
-```
-And to capture a specific window:
-```bash
-export PIPECAT_MCP_SERVER_SCREEN_CAPTURE=1
-export PIPECAT_MCP_SERVER_SCREEN_WINDOW="claude"
-pipecat-mcp-server
-```
-> ℹ️ **Note:** Window capture is based on window coordinates, not content. If another window overlaps the target, the overlapping content will be captured. The capture region updates dynamically if the window is moved. If the specified window is not found, capture falls back to the full screen.
 ## 📚 What's Next?
-- **Customize services**: Edit `agent.py` to use different STT/TTS providers (ElevenLabs, OpenAI, etc.)
+- **Customize services**: Edit `agent.py` to use different STT/TTS providers
 - **Change transport**: Configure for Twilio, WebRTC, or other transports
 - **Add to your project**: Use this as a template for voice-enabled MCP tools
 - **Learn more**: Check out [Pipecat's docs](https://docs.pipecat.ai/) for advanced features

{pipecat_ai_mcp_server-0.0.4 → pipecat_ai_mcp_server-0.0.11}/pyproject.toml RENAMED Viewed

@@ -20,10 +20,19 @@ classifiers = [
     "Topic :: Scientific/Engineering :: Artificial Intelligence"
 ]
 dependencies = [
+    "kokoro-onnx>=0.5.0,<1",
     "loguru>=0.7.0,<1",
     "mcp>=1.0.0",
-    "pipecat-ai[cartesia,deepgram,local-smart-turn-v3,runner,silero,webrtc,websocket]>=0.0.100",
+    "pip>=25.3",
+    "pipecat-ai[cartesia,deepgram,local-smart-turn-v3,rnnoise,runner,silero,webrtc,websocket]>=0.0.101",
+    "pipecat-ai[mlx-whisper]>=0.0.100; sys_platform == 'darwin'",
+    "pipecat-ai[whisper]>=0.0.100; sys_platform != 'darwin'",
+    "pyobjc-framework-CoreMedia>=11.0; sys_platform == 'darwin'",
+    "pyobjc-framework-Quartz>=11.0; sys_platform == 'darwin'",
+    "pyobjc-framework-ScreenCaptureKit>=11.0; sys_platform == 'darwin'",
     "python-dotenv>=1.0.0,<2",
+    "python-xlib>=0.33; sys_platform == 'linux'",
+    "requests>=2.32.5,<3",
 ]
 [project.urls]
@@ -35,7 +44,6 @@ Changelog = "https://github.com/pipecat-ai/pipecat-mcp-server/blob/main/CHANGELO
 [project.optional-dependencies]
 daily = [ "daily-python~=0.23.0" ]
-screen = [ "mss>=10.0.0", "pywinctl>=0.4" ]
 [dependency-groups]
 dev = [

{pipecat_ai_mcp_server-0.0.4 → pipecat_ai_mcp_server-0.0.11}/src/pipecat_ai_mcp_server.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: pipecat-ai-mcp-server
-Version: 0.0.4
+Version: 0.0.11
 Summary: Pipecat MCP server for your AI agents
 License-Expression: BSD-2-Clause
 Project-URL: Homepage, https://pipecat.ai
@@ -18,15 +18,21 @@ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
 Requires-Python: >=3.10
 Description-Content-Type: text/markdown
 License-File: LICENSE
+Requires-Dist: kokoro-onnx<1,>=0.5.0
 Requires-Dist: loguru<1,>=0.7.0
 Requires-Dist: mcp>=1.0.0
-Requires-Dist: pipecat-ai[cartesia,deepgram,local-smart-turn-v3,runner,silero,webrtc,websocket]>=0.0.100
+Requires-Dist: pip>=25.3
+Requires-Dist: pipecat-ai[cartesia,deepgram,local-smart-turn-v3,rnnoise,runner,silero,webrtc,websocket]>=0.0.101
+Requires-Dist: pipecat-ai[mlx-whisper]>=0.0.100; sys_platform == "darwin"
+Requires-Dist: pipecat-ai[whisper]>=0.0.100; sys_platform != "darwin"
+Requires-Dist: pyobjc-framework-CoreMedia>=11.0; sys_platform == "darwin"
+Requires-Dist: pyobjc-framework-Quartz>=11.0; sys_platform == "darwin"
+Requires-Dist: pyobjc-framework-ScreenCaptureKit>=11.0; sys_platform == "darwin"
 Requires-Dist: python-dotenv<2,>=1.0.0
+Requires-Dist: python-xlib>=0.33; sys_platform == "linux"
+Requires-Dist: requests<3,>=2.32.5
 Provides-Extra: daily
 Requires-Dist: daily-python~=0.23.0; extra == "daily"
-Provides-Extra: screen
-Requires-Dist: mss>=10.0.0; extra == "screen"
-Requires-Dist: pywinctl>=0.4; extra == "screen"
 Dynamic: license-file
 <h1><div align="center">
@@ -39,16 +45,18 @@ Dynamic: license-file
 Pipecat MCP Server gives your AI agents a voice using [Pipecat](https://github.com/pipecat-ai/pipecat). It should work with any [MCP](https://modelcontextprotocol.io/)-compatible client:
-The Pipecat MCP Server exposes **voice-related tools** (`start`, `listen`, `speak`, `stop`) to MCP-compatible clients, but **it does not itself provide microphone or speaker access**.
+The Pipecat MCP Server exposes **voice-related** and **screen capture** tools to MCP-compatible clients, but **it does not itself provide microphone or speaker access**.
-Audio input/output is handled by a **separate audio transport**, such as:
+Audio input/output is handled by a **separate audio/video transport**, such as:
 - **Pipecat Playground** (local browser UI)
 - **Daily** (WebRTC room)
 - **Phone providers** (Twilio, Telnyx, etc.)
 > **MCP clients like Cursor, Claude Code, and Codex control the agent, but they are not audio devices.**
-> To hear or speak, you must also connect via one of the audio transports.
+> To hear, speak or see, you must connect via one of the audio transports.
+<p align="center"><video src="https://github.com/user-attachments/assets/0ad14e37-2de7-46df-870a-167aa667df16" width="500" controls></video></p>
 ## 🧭 Getting started
@@ -56,9 +64,8 @@ Audio input/output is handled by a **separate audio transport**, such as:
 - Python 3.10 or later
 - [uv](https://docs.astral.sh/uv/getting-started/installation/) package manager
-- API keys for third-party services (Speech-to-Text, Text-to-Speech, ...)
-By default, the voice agent uses [Deepgram](https://deepgram.com) for speech-to-text and [Cartesia](https://cartesia.ai/) for text-to-speech.
+By default, the voice agent uses local models (no API keys required): [Faster Whisper](https://github.com/SYSTRAN/faster-whisper) for speech-to-text and [Kokoro](https://github.com/hexgrad/kokoro) for text-to-speech. The Whisper models are approximately 1.5 GB and are downloaded automatically on the first connection, so the initial startup may take a moment.
 ### Installation
@@ -82,14 +89,7 @@ uv tool install -e /path/to/repo/pipecat-mcp-server
 ## Running the server
-First, set your API keys as environment variables:
-```bash
-export DEEPGRAM_API_KEY=your-deepgram-key
-export CARTESIA_API_KEY=your-cartesia-key
-```
-Then start the server:
+Start the server:
 ```bash
 pipecat-mcp-server
@@ -109,6 +109,20 @@ The [Pipecat skill](.claude/skills/pipecat/SKILL.md) provides a better voice con
 Alternatively, just tell your agent something like `Let's have a voice conversation`. In this case, the agent won't ask for verbal confirmation before making changes.
+## 🖥️ Screen Capture & Analysis
+Screen capture lets you stream your screen (or a specific window) to your configured transport, and ask the agent to help with what it sees.
+For example:
+- *"capture my browser window"* — starts streaming that window
+- *"what's causing this error?"* — the agent analyzes the screen and helps debug
+- *"how does this UI look?"* — get feedback on your design
+**Supported platforms:**
+- **macOS** — uses ScreenCaptureKit for true window-level capture (not affected by overlapping windows)
+- **Linux (X11)** — uses Xlib for window and full-screen capture
 ## 💻 MCP Client: Claude Code
 ### Adding the MCP server
@@ -230,11 +244,11 @@ First, install the server with the Daily dependency:
 uv tool install pipecat-ai-mcp-server[daily]
 ```
-Then, set the `DAILY_API_KEY` environment variable to your Daily API key and `DAILY_SAMPLE_ROOM_URL` to your desired Daily room URL and pass the `-d` argument to `pipecat-mcp-server`.
+Then, set the `DAILY_API_KEY` environment variable to your Daily API key and `DAILY_ROOM_URL` to your desired Daily room URL and pass the `-d` argument to `pipecat-mcp-server`.
 ```bash
 export DAILY_API_KEY=your-daily-api-key
-export DAILY_SAMPLE_ROOM_URL=your-daily-room
+export DAILY_ROOM_URL=your-daily-room
 pipecat-mcp-server -d
 ```
@@ -271,45 +285,9 @@ pipecat-mcp-server -t twilio -x your-proxy.ngrok.app
 Configure your provider's phone number to point to your ngrok URL, then call your number to connect.
-## 🧪 Screen Capture (Experimental)
-You can enable screen capture to stream your screen (or a specific window) to the Pipecat Playground or Daily room. This lets you see what's happening on your computer remotely while having a voice conversation with the agent.
-First, install the server with the screen capture dependency:
-```bash
-uv tool install "pipecat-ai-mcp-server[screen]"
-```
-Then, define the following environment variables:
-| Variable                            | Description                                                        |
-|-------------------------------------|--------------------------------------------------------------------|
-| `PIPECAT_MCP_SERVER_SCREEN_CAPTURE` | Set to any value (e.g., `1`) to enable screen capture              |
-| `PIPECAT_MCP_SERVER_SCREEN_WINDOW`  | Optional. Window name to capture (partial match, case-insensitive) |
-For example, to capture your entire primary monitor:
-```bash
-export PIPECAT_MCP_SERVER_SCREEN_CAPTURE=1
-pipecat-mcp-server
-```
-And to capture a specific window:
-```bash
-export PIPECAT_MCP_SERVER_SCREEN_CAPTURE=1
-export PIPECAT_MCP_SERVER_SCREEN_WINDOW="claude"
-pipecat-mcp-server
-```
-> ℹ️ **Note:** Window capture is based on window coordinates, not content. If another window overlaps the target, the overlapping content will be captured. The capture region updates dynamically if the window is moved. If the specified window is not found, capture falls back to the full screen.
 ## 📚 What's Next?
-- **Customize services**: Edit `agent.py` to use different STT/TTS providers (ElevenLabs, OpenAI, etc.)
+- **Customize services**: Edit `agent.py` to use different STT/TTS providers
 - **Change transport**: Configure for Twilio, WebRTC, or other transports
 - **Add to your project**: Use this as a template for voice-enabled MCP tools
 - **Learn more**: Check out [Pipecat's docs](https://docs.pipecat.ai/) for advanced features

pipecat-ai-mcp-server 0.0.4__tar.gz → 0.0.11__tar.gz

pipecat-ai-mcp-server 0.0.4tar.gz → 0.0.11tar.gz