npm - @telnyx/voice-agent-tester - Versions diffs - 0.2.3 → 0.4.0 - Mend

@telnyx/voice-agent-tester 0.2.3 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

package/.agent/workflows/ralph-loop.md +62 -0
package/.gemini/skills/ralph-loop/SKILL.md +240 -0
package/CHANGELOG.md +13 -0
package/README.md +192 -42
package/{benchmarks/applications → applications}/elevenlabs.yaml +5 -2
package/applications/livetok.yaml +16 -0
package/{benchmarks/applications → applications}/telnyx.yaml +1 -1
package/applications/vapi.yaml +19 -0
package/assets/appointment_data_with_noise.mp3 +0 -0
package/assets/hello_make_an_appointment_with_noise.mp3 +0 -0
package/javascript/audio_input_hooks.js +104 -0
package/package.json +1 -1
package/{benchmarks/scenarios → scenarios}/appointment.yaml +0 -2
package/scenarios/appointment_with_noise.yaml +17 -0
package/src/index.js +397 -176
package/src/provider-import.js +61 -22
package/src/report.js +177 -0
package/src/voice-agent-tester.js +53 -3
package/assets/confirmation.mp3 +0 -0
package/assets/greet_me_angry.mp3 +0 -0
package/assets/name_lebron_james.mp3 +0 -0
package/assets/tell_me_joke_laugh.mp3 +0 -0
package/assets/tell_me_something_funny.mp3 +0 -0
package/assets/tell_me_something_sad.mp3 +0 -0
package/benchmarks/applications/vapi.yaml +0 -10

package/.agent/workflows/ralph-loop.md ADDED Viewed

@@ -0,0 +1,62 @@
+---
+description: Ralph Loop - Iterative AI development with persistent iteration until task completion
+---
+# Ralph Loop Workflow
+This workflow implements the Ralph Loop (Ralph Wiggum) technique for iterative, autonomous coding.
+## Usage
+Invoke with: `/ralph-loop <task description>`
+Or provide detailed options:
+```
+/ralph-loop "Build feature X" --max-iterations 30 --completion-promise "COMPLETE"
+```
+## Workflow Steps
+1. **Read the Ralph Loop skill instructions**
+   - View the skill file at `.gemini/skills/ralph-loop/SKILL.md`
+   - Understand the iteration pattern and best practices
+2. **Parse the user's task**
+   - Identify the main objective
+   - Extract success criteria
+   - Set max iterations (default: 30)
+   - Set completion promise (default: "COMPLETE")
+3. **Enter the loop**
+   - Execute the task iteratively
+   - Self-correct on failures
+   - Track progress
+   - Continue until success criteria met or max iterations reached
+4. **Report completion**
+   - Summarize accomplishments
+   - Output the completion promise
+   - List any remaining issues
+## Quick Commands
+- **Start a loop**: `/ralph-loop "Your task here"`
+- **Cancel loop**: Say "stop", "cancel", or "abort"
+- **Check skill docs**: View `.gemini/skills/ralph-loop/SKILL.md`
+## Examples
+### Feature Implementation
+```
+/ralph-loop "Implement user authentication with JWT tokens. Requirements: login/logout endpoints, password hashing, token refresh. Tests must pass."
+```
+### Bug Fix
+```
+/ralph-loop "Fix the 404 error when importing VAPI assistants. Add retry logic with exponential backoff."
+```
+### Refactoring
+```
+/ralph-loop "Refactor the CLI options to be more provider-agnostic. All existing tests must pass."
+```

package/.gemini/skills/ralph-loop/SKILL.md ADDED Viewed

@@ -0,0 +1,240 @@
+---
+name: ralph-loop
+description: Ralph Loop - AI Loop Technique for iterative, autonomous coding. Implements persistent iteration until task completion with self-correction patterns.
+---
+# Ralph Loop - AI Loop Technique
+The Ralph Loop (also known as "Ralph Wiggum") is an iterative AI development methodology. It embodies the philosophy of **persistent iteration despite setbacks**.
+## Core Philosophy
+1. **Iteration > Perfection**: Don't aim for perfect on first try. Let the loop refine the work.
+2. **Failures Are Data**: Deterministically bad means failures are predictable and informative.
+3. **Operator Skill Matters**: Success depends on writing good prompts, not just having a good model.
+4. **Persistence Wins**: Keep trying until success. Handle retry logic automatically.
+---
+## How to Use This Skill
+When the user invokes this skill (e.g., `/ralph-loop` or asks for iterative development), follow these instructions:
+### Step 1: Understand the Task
+Parse the user's request and identify:
+- **The main objective** - What needs to be built/fixed/refactored
+- **Success criteria** - How to know when it's complete
+- **Max iterations** - Safety limit (default: 30)
+- **Completion promise** - The signal word (default: "COMPLETE")
+### Step 2: Enter the Ralph Loop
+Execute the following loop pattern:
+```
+ITERATION = 1
+MAX_ITERATIONS = [specified or 30]
+COMPLETION_PROMISE = [specified or "COMPLETE"]
+WHILE (ITERATION <= MAX_ITERATIONS) AND (NOT COMPLETED):
+    1. Assess current state
+    2. Identify next step toward goal
+    3. Execute the step (write code, run tests, fix bugs, etc.)
+    4. Evaluate results
+    5. If success criteria met → output COMPLETION_PROMISE → EXIT LOOP
+    6. If not complete → increment ITERATION → CONTINUE
+    7. If blocked → document issue → try alternative approach
+END WHILE
+IF MAX_ITERATIONS reached without completion:
+    - Document what was accomplished
+    - List blocking issues
+    - Suggest next steps
+```
+### Step 3: Self-Correction Pattern
+During each iteration, follow this TDD-inspired pattern:
+1. **Plan** - Identify what needs to happen next
+2. **Execute** - Make the change (code, config, etc.)
+3. **Verify** - Run tests, check results, validate
+4. **If failing** - Debug and fix in the same iteration if possible
+5. **If passing** - Move to next requirement
+6. **Refactor** - Clean up if needed before proceeding
+### Step 4: Report Progress
+After each significant iteration, briefly report:
+- Current iteration number
+- What was attempted
+- Result (success/failure/partial)
+- Next step
+### Step 5: Completion
+When all success criteria are met:
+1. Summarize what was accomplished
+2. List any tests/validations that passed
+3. Output the completion promise: `<promise>COMPLETE</promise>`
+---
+## Prompt Templates
+### Feature Implementation
+```
+Implement [FEATURE_NAME].
+Requirements:
+- [Requirement 1]
+- [Requirement 2]
+- [Requirement 3]
+Success criteria:
+- All requirements implemented
+- Tests passing with >80% coverage
+- No linter errors
+- Documentation updated
+Output <promise>COMPLETE</promise> when done.
+```
+### TDD Development
+```
+Implement [FEATURE] using TDD.
+Process:
+1. Write failing test for next requirement
+2. Implement minimal code to pass
+3. Run tests
+4. If failing, fix and retry
+5. Refactor if needed
+6. Repeat for all requirements
+Requirements: [LIST]
+Output <promise>DONE</promise> when all tests green.
+```
+### Bug Fixing
+```
+Fix bug: [DESCRIPTION]
+Steps:
+1. Reproduce the bug
+2. Identify root cause
+3. Implement fix
+4. Write regression test
+5. Verify fix works
+6. Check no new issues introduced
+After 15 iterations if not fixed:
+- Document blocking issues
+- List attempted approaches
+- Suggest alternatives
+Output <promise>FIXED</promise> when resolved.
+```
+### Refactoring
+```
+Refactor [COMPONENT] for [GOAL].
+Constraints:
+- All existing tests must pass
+- No behavior changes
+- Incremental commits
+Checklist:
+- [ ] Tests passing before start
+- [ ] Apply refactoring step
+- [ ] Tests still passing
+- [ ] Repeat until done
+Output <promise>REFACTORED</promise> when complete.
+```
+---
+## Advanced Patterns
+### Multi-Phase Development
+For complex projects, chain multiple loops:
+```
+Phase 1: Core implementation → <promise>PHASE1_DONE</promise>
+Phase 2: API layer → <promise>PHASE2_DONE</promise>
+Phase 3: Frontend → <promise>PHASE3_DONE</promise>
+```
+### Incremental Goals
+Break large tasks into phases:
+```
+Phase 1: User authentication (JWT, tests)
+Phase 2: Product catalog (list/search, tests)
+Phase 3: Shopping cart (add/remove, tests)
+Output <promise>COMPLETE</promise> when all phases done.
+```
+---
+## Best Practices for Writing Prompts
+### ❌ Bad Prompt
+```
+Build a todo API and make it good.
+```
+### ✅ Good Prompt
+```
+Build a REST API for todos.
+When complete:
+- All CRUD endpoints working
+- Input validation in place
+- Tests passing (coverage > 80%)
+- README with API docs
+Output: <promise>COMPLETE</promise>
+```
+---
+## When to Use Ralph Loop
+### ✅ Good For:
+- Feature implementation with clear requirements
+- Bug fixing with reproducible issues
+- Refactoring with existing test coverage
+- TDD-style development
+- Tasks that benefit from iteration
+### ❌ Not Good For:
+- Exploratory research without clear goals
+- Tasks requiring human judgment at each step
+- Real-time interactive sessions
+- Tasks with no verifiable success criteria
+---
+## Cancellation
+The user can cancel the loop at any time by:
+- Saying "stop", "cancel", or "abort"
+- Providing new instructions that supersede the current task
+---
+## Attribution
+Based on the Ralph Wiggum technique from [Awesome Claude](https://awesomeclaude.ai/ralph-wiggum) and the official Claude plugins marketplace (`ralph-loop@claude-plugins-official`).

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,18 @@
 # Changelog
+## [0.4.0](https://github.com/team-telnyx/voice-agent-tester/compare/v0.3.0...v0.4.0) (2026-01-26)
+### Features
+* add audio input from URL for benchmark runs ([c347de8](https://github.com/team-telnyx/voice-agent-tester/commit/c347de83b8318827bac098bff4328502908ee981))
+* add background noise benchmark with pre-mixed audio files ([9f64179](https://github.com/team-telnyx/voice-agent-tester/commit/9f6417936514451270c4d1bc929771446c366b08))
+## [0.3.0](https://github.com/team-telnyx/voice-agent-tester/compare/v0.2.3...v0.3.0) (2026-01-23)
+### Features
+* add comparison benchmark mode for provider imports ([a6de0f4](https://github.com/team-telnyx/voice-agent-tester/commit/a6de0f43e8cfd469ddfcd031c0c05a002662e30a))
 ## [0.2.3](https://github.com/team-telnyx/voice-agent-tester/compare/v0.2.2...v0.2.3) (2026-01-21)
 ### Features

package/README.md CHANGED Viewed

@@ -3,81 +3,231 @@
 [![CI](https://github.com/team-telnyx/voice-agent-tester/actions/workflows/ci.yml/badge.svg)](https://github.com/team-telnyx/voice-agent-tester/actions/workflows/ci.yml)
 [![npm version](https://img.shields.io/npm/v/@telnyx/voice-agent-tester.svg)](https://www.npmjs.com/package/@telnyx/voice-agent-tester)
-A CLI tool for automated benchmarking and testing of voice AI agents. Supports Telnyx, ElevenLabs, and Vapi.
+A CLI tool for automated benchmarking and testing of voice AI agents. Supports Telnyx, ElevenLabs, Vapi, and Retell.
-This is a [Telnyx](https://telnyx.com) fork of [livetok-ai/voice-agent-tester](https://github.com/livetok-ai/voice-agent-tester). For base documentation (configuration, actions, etc.), see the [original README](https://github.com/livetok-ai/voice-agent-tester#readme).
+## Quick Start
-## Installation
+Run directly with npx (no installation required):
 ```bash
-npm install -g @telnyx/voice-agent-tester
+npx @telnyx/voice-agent-tester@latest -a applications/telnyx.yaml -s scenarios/appointment.yaml --assistant-id <YOUR_ASSISTANT_ID>
 ```
-## Quick Start
+Or install globally:
 ```bash
-voice-agent-tester -a benchmarks/applications/telnyx.yaml -s benchmarks/scenarios/appointment.yaml --assistant-id <YOUR_ASSISTANT_ID>
+npm install -g @telnyx/voice-agent-tester
+voice-agent-tester -a applications/telnyx.yaml -s scenarios/appointment.yaml --assistant-id <YOUR_ASSISTANT_ID>
 ```
-The CLI includes bundled application and scenario configs that you can use directly.
 ## CLI Options
-| Option | Description |
-|--------|-------------|
-| `-a, --applications` | Application config path(s) (required) |
-| `-s, --scenarios` | Scenario config path(s) (required) |
-| `--assistant-id` | Telnyx assistant ID |
-| `--agent-id` | ElevenLabs agent ID |
-| `--branch-id` | ElevenLabs branch ID |
-| `--share-key` | Vapi share key |
-| `--api-key` | Telnyx API key |
-| `--provider` | Import from external provider (`vapi`, `elevenlabs`, `retell`) |
-| `--provider-api-key` | External provider API key |
-| `--provider-import-id` | Provider assistant/agent ID to import |
-| `--params` | Additional URL template params |
-| `--debug` | Enable detailed timeout diagnostics |
-| `--headless` | Run browser in headless mode (default: true) |
-| `--repeat` | Number of repetitions |
-| `-v, --verbose` | Show browser console logs |
+| Option | Default | Description |
+|--------|---------|-------------|
+| `-a, --applications` | required | Application config path(s) or folder |
+| `-s, --scenarios` | required | Scenario config path(s) or folder |
+| `--assistant-id` | | Telnyx or provider assistant ID |
+| `--api-key` | | Telnyx API key for authentication |
+| `--provider` | | Import from provider (`vapi`, `elevenlabs`, `retell`) |
+| `--provider-api-key` | | External provider API key (required with `--provider`) |
+| `--provider-import-id` | | Provider assistant ID to import (required with `--provider`) |
+| `--compare` | `true` | Run both provider direct and Telnyx import benchmarks |
+| `--no-compare` | | Disable comparison (run only Telnyx import) |
+| `-d, --debug` | `false` | Enable detailed timeout diagnostics |
+| `-v, --verbose` | `false` | Show browser console logs |
+| `--headless` | `true` | Run browser in headless mode |
+| `--repeat` | `1` | Number of repetitions per combination |
+| `-c, --concurrency` | `1` | Number of parallel tests |
+| `-r, --report` | | Generate CSV report to specified file |
+| `-p, --params` | | URL template params (e.g., `key=value,key2=value2`) |
+| `--application-tags` | | Filter applications by comma-separated tags |
+| `--scenario-tags` | | Filter scenarios by comma-separated tags |
+| `--assets-server` | `http://localhost:3333` | Assets server URL |
+| `--audio-url` | | URL to audio file to play as input during entire benchmark |
+| `--audio-volume` | `1.0` | Volume level for audio input (0.0 to 1.0) |
 ## Bundled Configs
-The following application configs are included:
+| Application Config | Provider |
+|-------------------|----------|
+| `applications/telnyx.yaml` | Telnyx AI Widget |
+| `applications/elevenlabs.yaml` | ElevenLabs |
+| `applications/vapi.yaml` | Vapi |
+| `applications/retell.yaml` | Retell |
+| `applications/livetok.yaml` | Livetok |
+Scenarios:
+- `scenarios/appointment.yaml` - Basic appointment booking test
+- `scenarios/appointment_with_noise.yaml` - Appointment with background noise (pre-mixed audio)
+## Background Noise Testing
+Test voice agents' performance with ambient noise (e.g., crowd chatter, cafe environment). Background noise is pre-mixed into audio files to simulate real-world conditions where users speak to voice agents in noisy environments.
+### Running with Background Noise
+```bash
+# Telnyx with background noise
+npx @telnyx/voice-agent-tester@latest \
+  -a applications/telnyx.yaml \
+  -s scenarios/appointment_with_noise.yaml \
+  --assistant-id <YOUR_ASSISTANT_ID>
+# Compare with no noise (same assistant)
+npx @telnyx/voice-agent-tester@latest \
+  -a applications/telnyx.yaml \
+  -s scenarios/appointment.yaml \
+  --assistant-id <YOUR_ASSISTANT_ID>
+# Generate CSV report with metrics
+npx @telnyx/voice-agent-tester@latest \
+  -a applications/telnyx.yaml \
+  -s scenarios/appointment_with_noise.yaml \
+  --assistant-id <YOUR_ASSISTANT_ID> \
+  -r output/noise_benchmark.csv
+```
+### Custom Audio Input from URL
+Play any audio file from a URL as input throughout the entire benchmark run. The audio is sent to the voice agent as microphone input.
+```bash
+# Use custom audio input from URL
+npx @telnyx/voice-agent-tester@latest \
+  -a applications/telnyx.yaml \
+  -s scenarios/appointment.yaml \
+  --assistant-id <YOUR_ASSISTANT_ID> \
+  --audio-url "https://example.com/test-audio.mp3" \
+  --audio-volume 0.8
+```
+This is useful for:
+- Testing with custom audio inputs
+- Using longer audio tracks that play throughout the benchmark
+- A/B testing different audio sources
+### Bundled Audio Files
+| File | Description |
+|------|-------------|
+| `hello_make_an_appointment.mp3` | Clean appointment request |
+| `hello_make_an_appointment_with_noise.mp3` | Appointment request with crowd noise |
+| `appointment_data.mp3` | Clean appointment details |
+| `appointment_data_with_noise.mp3` | Appointment details with crowd noise |
+### Scenario Configuration
+The noise scenario uses pre-mixed audio files:
+```yaml
+# scenarios/appointment_with_noise.yaml
+tags:
+  - default
+  - noise
+steps:
+  - action: wait_for_voice
+  - action: wait_for_silence
+  - action: sleep
+    time: 1000
+  - action: speak
+    file: hello_make_an_appointment_with_noise.mp3
+  - action: wait_for_voice
+    metrics: elapsed_time
+  - action: wait_for_silence
+  - action: speak
+    file: appointment_data_with_noise.mp3
+  - action: wait_for_voice
+    metrics: elapsed_time
+```
+### Metrics and Reports
+The benchmark collects response latency metrics at each `wait_for_voice` step with `metrics: elapsed_time`. Generated CSV reports include:
+```csv
+app, scenario, repetition, success, duration, step_9_wait_for_voice_elapsed_time, step_12_wait_for_voice_elapsed_time
+telnyx, appointment_with_noise, 0, 1, 29654, 1631, 1225
+```
-| Config | Provider |
-|--------|----------|
-| `benchmarks/applications/telnyx.yaml` | Telnyx AI Widget |
-| `benchmarks/applications/elevenlabs.yaml` | ElevenLabs |
-| `benchmarks/applications/vapi.yaml` | Vapi |
+Compare results with and without noise to measure how background noise affects your voice agent's:
+- Response latency
+- Speech recognition accuracy
+- Overall conversation flow
+## Examples
+### Telnyx
+```bash
+npx @telnyx/voice-agent-tester@latest \
+  -a applications/telnyx.yaml \
+  -s scenarios/appointment.yaml \
+  --assistant-id <ASSISTANT_ID>
+```
-Scenario configs:
-- `benchmarks/scenarios/appointment.yaml` - Appointment scheduling test
+### ElevenLabs
-## Usage Examples
+```bash
+npx @telnyx/voice-agent-tester@latest \
+  -a applications/elevenlabs.yaml \
+  -s scenarios/appointment.yaml \
+  --assistant-id <AGENT_ID>
+```
-### Telnyx Assistant
+### Vapi
 ```bash
-voice-agent-tester -a benchmarks/applications/telnyx.yaml -s benchmarks/scenarios/appointment.yaml --assistant-id <TELNYX_ASSISTANT_ID>
+npx @telnyx/voice-agent-tester@latest \
+  -a applications/vapi.yaml \
+  -s scenarios/appointment.yaml \
+  --assistant-id <ASSISTANT_ID>
 ```
-### ElevenLabs Agent
+## Comparison Mode
+When importing from an external provider, the tool automatically runs both benchmarks in sequence and generates a comparison report:
+1. **Provider Direct** - Benchmarks the assistant on the original provider's widget
+2. **Telnyx Import** - Benchmarks the same assistant after importing to Telnyx
+### Import and Compare (Default)
 ```bash
-voice-agent-tester -a benchmarks/applications/elevenlabs.yaml -s benchmarks/scenarios/appointment.yaml --agent-id <ELEVENLABS_AGENT_ID> --branch-id <BRANCH_ID>
+npx @telnyx/voice-agent-tester@latest \
+  -a applications/telnyx.yaml \
+  -s scenarios/appointment.yaml \
+  --provider vapi \
+  --api-key <TELNYX_KEY> \
+  --provider-api-key <VAPI_KEY> \
+  --provider-import-id <VAPI_ASSISTANT_ID>
 ```
-### Vapi Assistant
+This will:
+- Run Phase 1: VAPI direct benchmark
+- Run Phase 2: Telnyx import benchmark
+- Generate a side-by-side latency comparison report
+### Import Only (No Comparison)
+To skip the provider direct benchmark and only run the Telnyx import:
 ```bash
-voice-agent-tester -a benchmarks/applications/vapi.yaml -s benchmarks/scenarios/appointment.yaml --assistant-id <VAPI_ASSISTANT_ID> --share-key <SHARE_KEY>
+npx @telnyx/voice-agent-tester@latest \
+  -a applications/telnyx.yaml \
+  -s scenarios/appointment.yaml \
+  --provider vapi \
+  --no-compare \
+  --api-key <TELNYX_KEY> \
+  --provider-api-key <VAPI_KEY> \
+  --provider-import-id <VAPI_ASSISTANT_ID>
 ```
-### Import from Provider to Telnyx
+### Debugging Failures
+If benchmarks fail, rerun with `--debug` for detailed diagnostics:
 ```bash
-voice-agent-tester -a benchmarks/applications/telnyx.yaml -s benchmarks/scenarios/appointment.yaml --provider vapi --api-key <TELNYX_API_KEY> --provider-api-key <VAPI_API_KEY> --provider-import-id <VAPI_ASSISTANT_ID>
+voice-agent-tester --provider vapi --debug [other options...]
 ```
 ## License

package/{benchmarks/applications → applications}/elevenlabs.yaml RENAMED Viewed

@@ -1,10 +1,13 @@
 url: "https://elevenlabs.io/app/talk-to?agent_id={{assistantId}}&branch_id={{branchId}}"
+tags:
+  - provider
+  - elevenlabs
 steps:
   - action: wait_for_element
-    selector: "button[data-agent-id]"
+    selector: "text=Call AI agent"
   - action: sleep
     time: 3000
   - action: click
-    selector: "button[data-agent-id]"
+    selector: "text=Call AI agent"
   - action: sleep
     time: 2000

package/applications/livetok.yaml ADDED Viewed

@@ -0,0 +1,16 @@
+url: "https://rti.livetok.io/demo/index.html"
+tags:
+  - default
+  - basic
+steps:
+  - action: fill
+    selector: "input[type='password']"
+    text: "GOOGLE_API_KEY HERE"
+  # - action: select
+  #   selector: "#model"
+  #   value: "gemini-2.5-flash-preview-native-audio-dialog"
+  # - action: fill
+  #   selector: "#tools"
+  #   text: "[]"
+  - action: click
+    selector: "#start"

package/{benchmarks/applications → applications}/telnyx.yaml RENAMED Viewed

@@ -5,6 +5,6 @@ steps:
   - action: sleep
     time: 3000
   - action: click
-    selector: "telnyx-ai-agent"
+    selector: "telnyx-ai-agent >>> button"
   - action: sleep
     time: 4000

package/applications/vapi.yaml ADDED Viewed

@@ -0,0 +1,19 @@
+url: "https://vapi.ai?demo=true&shareKey={{shareKey}}&assistantId={{assistantId}}"
+tags:
+  - provider
+  - vapi
+steps:
+  - action: wait_for_element
+    selector: "button[aria-label=\"Talk to Vapi\"]"
+  - action: sleep
+    time: 5000
+  - action: click
+    selector: "button[aria-label=\"Talk to Vapi\"]"
+  - action: sleep
+    time: 2000
+  - action: speak
+    text: "Hello, what can you do?"
+  - action: wait_for_voice
+    metrics: elapsed_time
+  - action: wait_for_silence
+    metrics: elapsed_time

package/assets/appointment_data_with_noise.mp3 ADDED Viewed

Binary file

package/assets/hello_make_an_appointment_with_noise.mp3 ADDED Viewed

Binary file