npm - @p8n.ai/pi-listens - Versions diffs - 0.1.0 - Mend

@p8n.ai/pi-listens 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

package/CHANGELOG.md +27 -0
package/LICENSE +21 -0
package/README.md +187 -0
package/package.json +70 -0
package/skills/pi-listens/SKILL.md +33 -0
package/src/audio.ts +361 -0
package/src/commands.ts +286 -0
package/src/config.ts +182 -0
package/src/index.ts +84 -0
package/src/sarvam.ts +311 -0
package/src/text.ts +33 -0
package/src/tools.ts +252 -0
package/src/voice-ui.ts +350 -0

package/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,27 @@
+# Changelog
+All notable changes to `@p8n.ai/pi-listens` will be documented in this file.
+This project follows [Semantic Versioning](https://semver.org/).
+## [Unreleased]
+## [0.1.0] - 2026-05-09
+### Added
+- Initial release of `@p8n.ai/pi-listens` for Pi.
+- Sarvam AI speech-to-text tools for microphone input and audio file transcription.
+- Sarvam AI text-to-speech tools for spoken output and spoken clarification loops.
+- `/listen`, `/speak`, `/voice-on`, and `/voice-status` slash commands.
+- Interactive voice panel with listen, auto-listen, read-aloud, and close controls.
+- Config support through environment variables, user config, and project config.
+- Global config at `~/.pi/pi-listens.json`, with project-level overrides from `<project>/.pi/pi-listens.json`.
+### Fixed
+- Stop active audio capture/playback subprocesses when voice mode is closed or the Pi session shuts down.
+- Clean up generated audio files when spoken playback is interrupted.
+[Unreleased]: https://github.com/p8n-ai/pi-listens/compare/v0.1.0...HEAD
+[0.1.0]: https://github.com/p8n-ai/pi-listens/releases/tag/v0.1.0

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Ravindra Barthwal
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,187 @@
+# @p8n.ai/pi-listens
+Speech-first Pi package powered by [Sarvam AI](https://www.sarvam.ai/). It gives Pi tools and commands for:
+- streaming speech-to-text (STT) with Sarvam Saaras (`saaras:v3`) over WebSockets
+- text-to-speech (TTS) with Sarvam Bulbul (`bulbul:v3`)
+- voice-first clarification loops where the agent speaks a question, listens, transcribes, and continues
+- interactive TUI and headless/RPC usage through Pi extension tools and UI fallback
+## Install
+```bash
+pi install npm:@p8n.ai/pi-listens
+export SARVAM_API_KEY="your-sarvam-api-key"
+pi
+```
+For local development from this checkout:
+```bash
+npm install
+npm run typecheck
+pi -e /Users/ravindrabarthwal/Projects/pi-listens
+```
+## System requirements
+### Sarvam API key
+Set one of:
+```bash
+export SARVAM_API_KEY="..."
+# or
+export SARVAM_API_SUBSCRIPTION_KEY="..."
+# or
+export PI_LISTENS_SARVAM_API_KEY="..."
+```
+Sarvam's SDK uses the `api-subscription-key` auth model internally; this package uses the official `sarvamai` npm package.
+### Local microphone recorder and audio player
+`pi-listens` records from the local microphone and plays audio locally.
+Auto-detected recorders:
+1. `rec` from SoX (recommended)
+2. `ffmpeg` (`avfoundation` on macOS, `alsa` on Linux)
+Auto-detected players:
+1. `afplay` on macOS
+2. `play` from SoX
+3. `ffplay`
+4. `aplay`
+You can override capture/playback with command templates:
+```bash
+export PI_LISTENS_RECORD_COMMAND='rec -q -r {sampleRate} -c 1 -b 16 {path} trim 0 {seconds}'
+export PI_LISTENS_STREAM_COMMAND='rec -q -r {sampleRate} -c 1 -b 16 -e signed-integer -t raw -'
+export PI_LISTENS_PLAY_COMMAND='afplay {path}'
+```
+Template variables are shell-quoted automatically. Recording templates support `{path}`, `{seconds}`, `{sampleRate}`. Streaming templates write 16-bit mono PCM to stdout and support `{sampleRate}`.
+## Agent tools
+The package registers these tools for Pi's agent:
+| Tool | Purpose |
+| --- | --- |
+| `voice_output` | Speak short user-facing text via Sarvam TTS and local playback. |
+| `voice_input` | Stream microphone audio over Sarvam WebSocket STT. |
+| `voice_ask` | Speak a concise question, then listen and transcribe the user's answer. |
+| `voice_transcribe_file` | Transcribe an existing audio file. |
+| `voice_setup_check` | Check API key, recorder, player, and model configuration. |
+The extension also injects voice guidance into the system prompt:
+- use `voice_ask` whenever user input is needed in voice-first sessions
+- use `voice_output` for short spoken status or response snippets
+- do not speak code blocks, logs, diffs, stack traces, or long explanations
+- keep spoken questions concise and answerable in a short response
+## Commands
+| Command | Purpose |
+| --- | --- |
+| `/listen [seconds]` | Stream one utterance over Sarvam WebSocket STT, wait for a sustained silence boundary, transcribe, and send it to Pi as a user message. |
+| `/speak <text>` | Speak text with Sarvam TTS. |
+| `/voice-on [--speak] [--manual] [--no-listen] [seconds]` | Open the hands-free TUI panel. By default it listens now and auto-listens again after each agent response. `--speak` reads short assistant replies aloud. `--manual` leaves the panel active but only listens when you press R. |
+| `/voice-on --no-speak` | Open the panel without auto-reading assistant replies. |
+| `/voice-status` | Show setup and voice-mode status. |
+Voice panel controls in interactive mode:
+- R: listen now; press again while listening to stop listening
+- A: auto-listen on/off (listen again after each assistant reply)
+- S: read aloud on/off (speak assistant replies)
+- Q: close the panel (and stop listening first if needed)
+- Click the orb: visual ripple feedback (terminals with mouse reporting)
+## Headless/RPC behavior
+Pi extension tools work in interactive TUI and headless/RPC modes.
+- The audio capture/playback still happens on the machine running Pi.
+- When speech is not recognized, `voice_input` and `voice_ask` use Pi's extension UI text fallback if UI is available.
+- In RPC mode that fallback becomes an `extension_ui_request` (`input`) event, so a client can provide textual input.
+- In print/JSON modes, UI fallback is unavailable; the tool returns the empty transcription so the agent can recover.
+## Configuration
+Configuration is resolved in this order, with later entries overriding earlier ones:
+1. defaults
+2. `~/.pi/agent/pi-listens.json` (legacy global path, still supported)
+3. `~/.pi/pi-listens.json` (global user config)
+4. `<project>/.pi/pi-listens.json` (project config)
+5. environment variables
+Project config overrides global config, and environment variables override both.
+Example config file:
+```json
+{
+  "sttModel": "saaras:v3",
+  "sttMode": "transcribe",
+  "sttLanguageCode": "unknown",
+  "translateInputToEnglish": true,
+  "ttsModel": "bulbul:v3",
+  "ttsLanguageCode": "en-IN",
+  "ttsSpeaker": "shubh",
+  "recordSeconds": 300,
+  "recordSampleRate": 16000,
+  "streamChunkMs": 250,
+  "streamMaxSeconds": 300,
+  "silenceStartSeconds": 0.2,
+  "silenceStopSeconds": 3.5,
+  "silenceThreshold": "1%",
+  "ttsSampleRate": 24000,
+  "ttsOutputCodec": "wav",
+  "textFallback": true,
+  "autoSpeakAssistant": false,
+  "maxAutoSpeakChars": 900
+}
+```
+Supported environment variables:
+- `SARVAM_API_KEY` / `SARVAM_API_SUBSCRIPTION_KEY` / `PI_LISTENS_SARVAM_API_KEY`
+- `PI_LISTENS_STT_MODEL`
+- `PI_LISTENS_STT_MODE` (`transcribe`, `translate`, `verbatim`, `translit`, `codemix`)
+- `PI_LISTENS_STT_LANGUAGE` (default `unknown`)
+- `PI_LISTENS_TRANSLATE_INPUT_TO_ENGLISH` (default `true`; speak any supported language, send English to the agent)
+- `PI_LISTENS_TTS_MODEL`
+- `PI_LISTENS_TTS_LANGUAGE` (default `en-IN`)
+- `PI_LISTENS_TTS_SPEAKER` (default `shubh`)
+- `PI_LISTENS_TTS_PACE`
+- `PI_LISTENS_TTS_TEMPERATURE`
+- `PI_LISTENS_TTS_SAMPLE_RATE`
+- `PI_LISTENS_TTS_OUTPUT_CODEC` (`wav`, `mp3`, `linear16`, `mulaw`, `alaw`, `opus`, `flac`, `aac`)
+- `PI_LISTENS_RECORD_SECONDS` (default `300`; maximum listen duration for one streaming utterance)
+- `PI_LISTENS_RECORD_SAMPLE_RATE` (default `16000`; Sarvam streaming works best with 16kHz mono PCM)
+- `PI_LISTENS_STREAM_CHUNK_MS` (default `250`; outgoing WebSocket audio chunk size)
+- `PI_LISTENS_STREAM_MAX_SECONDS` (default `300`; default maximum for streaming microphone capture)
+- `PI_LISTENS_SILENCE_START_SECONDS`
+- `PI_LISTENS_SILENCE_STOP_SECONDS`
+- `PI_LISTENS_SILENCE_THRESHOLD`
+`recordSeconds` is the maximum time Pi will keep streaming one utterance. `silenceStopSeconds` is the quiet pause after which it considers the utterance complete, flushes the WebSocket, and submits the transcript. For example, `recordSeconds: 300` and `silenceStopSeconds: 3.5` means “let me speak for up to 5 minutes, but submit after 3.5 seconds of silence.”
+- `PI_LISTENS_RECORD_COMMAND`
+- `PI_LISTENS_PLAY_COMMAND`
+- `PI_LISTENS_AUDIO_DIR`
+- `PI_LISTENS_DELETE_AUDIO`
+- `PI_LISTENS_TEXT_FALLBACK`
+- `PI_LISTENS_AUTO_SPEAK`
+- `PI_LISTENS_MAX_AUTO_SPEAK_CHARS`
+## Notes
+- Sarvam STT uses the WebSocket streaming API for microphone input, not the 30-second synchronous REST endpoint.
+- Streaming input is sent as 16kHz, 16-bit, mono PCM (`pcm_s16le`) with `saaras:v3` by default.
+- macOS may ask for microphone permissions the first time `rec` or `ffmpeg` records audio.
+- Spoken output is intentionally optimized for concise interaction, not for reading code or full agent responses.

package/package.json ADDED Viewed

@@ -0,0 +1,70 @@
+{
+  "name": "@p8n.ai/pi-listens",
+  "version": "0.1.0",
+  "description": "Pi package for speech-first interaction using Sarvam AI speech-to-text and text-to-speech.",
+  "author": "Ravindra Barthwal",
+  "license": "MIT",
+  "type": "module",
+  "repository": {
+    "type": "git",
+    "url": "git+https://github.com/p8n-ai/pi-listens.git"
+  },
+  "homepage": "https://github.com/p8n-ai/pi-listens#readme",
+  "bugs": {
+    "url": "https://github.com/p8n-ai/pi-listens/issues"
+  },
+  "keywords": [
+    "pi-package",
+    "pi",
+    "pi-coding-agent",
+    "speech-to-text",
+    "text-to-speech",
+    "voice",
+    "sarvam",
+    "sarvam-ai",
+    "agents"
+  ],
+  "files": [
+    "src/",
+    "skills/",
+    "README.md",
+    "LICENSE",
+    "CHANGELOG.md"
+  ],
+  "scripts": {
+    "typecheck": "tsc --noEmit",
+    "test": "npm run typecheck"
+  },
+  "pi": {
+    "extensions": [
+      "./src/index.ts"
+    ],
+    "skills": [
+      "./skills"
+    ]
+  },
+  "dependencies": {
+    "sarvamai": "^1.1.7"
+  },
+  "peerDependencies": {
+    "@earendil-works/pi-ai": "*",
+    "@earendil-works/pi-coding-agent": "*",
+    "@earendil-works/pi-tui": "*",
+    "typebox": "*"
+  },
+  "devDependencies": {
+    "@earendil-works/pi-ai": "^0.74.0",
+    "@earendil-works/pi-coding-agent": "^0.74.0",
+    "@earendil-works/pi-tui": "^0.74.0",
+    "@types/node": "^24.0.0",
+    "tsx": "^4.21.0",
+    "typebox": "^1.1.38",
+    "typescript": "^5.8.0"
+  },
+  "publishConfig": {
+    "access": "public"
+  },
+  "engines": {
+    "node": ">=20"
+  }
+}

package/skills/pi-listens/SKILL.md ADDED Viewed

@@ -0,0 +1,33 @@
+---
+name: pi-listens
+description: Use when interacting with the user by voice through @p8n.ai/pi-listens, Sarvam AI speech-to-text, or Sarvam AI text-to-speech. Applies when the user says they are speaking, wants voice input/output, asks Pi to listen, or when clarification should be gathered by voice.
+---
+# Pi Listens Voice Interaction
+This Pi package provides voice tools backed by Sarvam AI.
+## Tools
+- `voice_output`: speak a short message to the user with Sarvam TTS.
+- `voice_input`: listen to the microphone and transcribe the user's speech.
+- `voice_ask`: speak a concise question, then listen and transcribe the answer.
+- `voice_transcribe_file`: transcribe an existing audio file.
+- `voice_setup_check`: diagnose API key, recorder, player, and voice settings.
+## Usage rules
+1. When you need user input, clarification, or confirmation, use `voice_ask` instead of asking only in text.
+2. Before using `voice_input`, make sure the user already knows you are listening. If not, use `voice_ask`.
+3. Use `voice_output` for concise spoken status updates or spoken summaries that matter to the user.
+4. Do not speak code blocks, diffs, stack traces, logs, long tables, or lengthy explanations. Summarize briefly and leave details in text.
+5. Treat transcripts returned by `voice_input` or `voice_ask` as user input, while allowing for speech-recognition mistakes. If the transcript is ambiguous, ask a short follow-up with `voice_ask`.
+6. If speech is not recognized, rely on the tool's text fallback when available, or ask again with a shorter prompt.
+## Good voice question style
+- Ask one thing at a time.
+- Keep questions under one sentence where possible.
+- Offer clear options if the answer space is constrained.
+- Prefer: "Which option should I use: A, B, or C?"
+- Avoid: long multi-part questions or reading implementation details aloud.