@elizaos/plugin-local-inference 2.0.0-beta.1 → 2.0.3-beta.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +157 -0
- package/dist/actions/generate-media.d.ts +59 -0
- package/dist/actions/generate-media.d.ts.map +1 -0
- package/dist/actions/identify-speaker.d.ts +23 -0
- package/dist/actions/identify-speaker.d.ts.map +1 -0
- package/dist/actions/transcription-control.d.ts +29 -0
- package/dist/actions/transcription-control.d.ts.map +1 -0
- package/dist/adapters/capacitor-llama/environment.d.ts +12 -0
- package/dist/adapters/capacitor-llama/environment.d.ts.map +1 -0
- package/dist/adapters/capacitor-llama/index.browser.d.ts +9 -0
- package/dist/adapters/capacitor-llama/index.browser.d.ts.map +1 -0
- package/dist/adapters/capacitor-llama/index.d.ts +18 -0
- package/dist/adapters/capacitor-llama/index.d.ts.map +1 -0
- package/dist/adapters/capacitor-llama/loader.d.ts +35 -0
- package/dist/adapters/capacitor-llama/loader.d.ts.map +1 -0
- package/dist/adapters/capacitor-llama/native-voice-capture.d.ts +70 -0
- package/dist/adapters/capacitor-llama/native-voice-capture.d.ts.map +1 -0
- package/dist/adapters/capacitor-llama/structured-output.d.ts +62 -0
- package/dist/adapters/capacitor-llama/structured-output.d.ts.map +1 -0
- package/dist/adapters/capacitor-llama/text-streaming.d.ts +24 -0
- package/dist/adapters/capacitor-llama/text-streaming.d.ts.map +1 -0
- package/dist/adapters/capacitor-llama/types.d.ts +338 -0
- package/dist/adapters/capacitor-llama/types.d.ts.map +1 -0
- package/dist/adapters/capacitor-llama/voice-turn.d.ts +86 -0
- package/dist/adapters/capacitor-llama/voice-turn.d.ts.map +1 -0
- package/dist/backends/apple-foundation.d.ts +56 -0
- package/dist/backends/apple-foundation.d.ts.map +1 -0
- package/dist/index.d.ts +8 -37
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +38979 -430
- package/dist/index.js.map +217 -0
- package/dist/local-inference-routes.d.ts +47 -0
- package/dist/local-inference-routes.d.ts.map +1 -0
- package/dist/provider.d.ts +21 -0
- package/dist/provider.d.ts.map +1 -0
- package/dist/routes/compat-helpers.d.ts +18 -0
- package/dist/routes/compat-helpers.d.ts.map +1 -0
- package/dist/routes/family-member-route.d.ts +62 -0
- package/dist/routes/family-member-route.d.ts.map +1 -0
- package/dist/routes/index.d.ts +20 -0
- package/dist/routes/index.d.ts.map +1 -0
- package/dist/routes/index.js +42040 -0
- package/dist/routes/index.js.map +236 -0
- package/dist/routes/live-diarization-route.d.ts +33 -0
- package/dist/routes/live-diarization-route.d.ts.map +1 -0
- package/dist/routes/local-inference-asr-route.d.ts +4 -0
- package/dist/routes/local-inference-asr-route.d.ts.map +1 -0
- package/dist/routes/local-inference-asr-transcribe.d.ts +20 -0
- package/dist/routes/local-inference-asr-transcribe.d.ts.map +1 -0
- package/dist/routes/local-inference-compat-routes.d.ts +16 -0
- package/dist/routes/local-inference-compat-routes.d.ts.map +1 -0
- package/dist/routes/local-inference-tts-route.d.ts +7 -0
- package/dist/routes/local-inference-tts-route.d.ts.map +1 -0
- package/dist/routes/native-pcm-turn-route.d.ts +3 -0
- package/dist/routes/native-pcm-turn-route.d.ts.map +1 -0
- package/dist/routes/transcript-audio-store.d.ts +15 -0
- package/dist/routes/transcript-audio-store.d.ts.map +1 -0
- package/dist/routes/transcripts-routes.d.ts +44 -0
- package/dist/routes/transcripts-routes.d.ts.map +1 -0
- package/dist/routes/voice-first-run-routes.d.ts +62 -0
- package/dist/routes/voice-first-run-routes.d.ts.map +1 -0
- package/dist/routes/voice-models-routes.d.ts +62 -0
- package/dist/routes/voice-models-routes.d.ts.map +1 -0
- package/dist/routes/voice-profile-plugin-routes.d.ts +19 -0
- package/dist/routes/voice-profile-plugin-routes.d.ts.map +1 -0
- package/dist/routes/voice-profiles-management-routes.d.ts +52 -0
- package/dist/routes/voice-profiles-management-routes.d.ts.map +1 -0
- package/dist/routes/voice-speaker-profile-routes.d.ts +57 -0
- package/dist/routes/voice-speaker-profile-routes.d.ts.map +1 -0
- package/dist/runtime/embedding-manager-support.d.ts +77 -0
- package/dist/runtime/embedding-manager-support.d.ts.map +1 -0
- package/dist/runtime/embedding-presets.d.ts +16 -0
- package/dist/runtime/embedding-presets.d.ts.map +1 -0
- package/dist/runtime/embedding-warmup-policy.d.ts +14 -0
- package/dist/runtime/embedding-warmup-policy.d.ts.map +1 -0
- package/dist/runtime/ensure-local-inference-handler.d.ts +70 -0
- package/dist/runtime/ensure-local-inference-handler.d.ts.map +1 -0
- package/dist/runtime/index.d.ts +15 -0
- package/dist/runtime/index.d.ts.map +1 -0
- package/dist/runtime/index.js +38768 -0
- package/dist/runtime/index.js.map +217 -0
- package/dist/runtime/mobile-local-inference-gate.d.ts +63 -0
- package/dist/runtime/mobile-local-inference-gate.d.ts.map +1 -0
- package/dist/runtime/voice-entity-binding.d.ts +113 -0
- package/dist/runtime/voice-entity-binding.d.ts.map +1 -0
- package/dist/services/active-model.d.ts +310 -0
- package/dist/services/active-model.d.ts.map +1 -0
- package/dist/services/asr-provenance.d.ts +5 -0
- package/dist/services/asr-provenance.d.ts.map +1 -0
- package/dist/services/assignments.d.ts +84 -0
- package/dist/services/assignments.d.ts.map +1 -0
- package/dist/services/backend-selector.d.ts +55 -0
- package/dist/services/backend-selector.d.ts.map +1 -0
- package/dist/services/backend.d.ts +440 -0
- package/dist/services/backend.d.ts.map +1 -0
- package/dist/services/bionic-host-loader.d.ts +67 -0
- package/dist/services/bionic-host-loader.d.ts.map +1 -0
- package/dist/services/bundled-models.d.ts +34 -0
- package/dist/services/bundled-models.d.ts.map +1 -0
- package/dist/services/cache-bridge.d.ts +206 -0
- package/dist/services/cache-bridge.d.ts.map +1 -0
- package/dist/services/catalog.d.ts +10 -0
- package/dist/services/catalog.d.ts.map +1 -0
- package/dist/services/checkpoint-client.d.ts +109 -0
- package/dist/services/checkpoint-client.d.ts.map +1 -0
- package/dist/services/checkpoint-manager.d.ts +217 -0
- package/dist/services/checkpoint-manager.d.ts.map +1 -0
- package/dist/services/cloud-fallback.d.ts +102 -0
- package/dist/services/cloud-fallback.d.ts.map +1 -0
- package/dist/services/context-fit.d.ts +36 -0
- package/dist/services/context-fit.d.ts.map +1 -0
- package/dist/services/conversation-registry.d.ts +142 -0
- package/dist/services/conversation-registry.d.ts.map +1 -0
- package/dist/services/desktop-fused-ffi-backend-runtime.d.ts +111 -0
- package/dist/services/desktop-fused-ffi-backend-runtime.d.ts.map +1 -0
- package/dist/services/device-bridge.d.ts +188 -0
- package/dist/services/device-bridge.d.ts.map +1 -0
- package/dist/services/device-resource-metrics.d.ts +149 -0
- package/dist/services/device-resource-metrics.d.ts.map +1 -0
- package/dist/services/device-tier.d.ts +133 -0
- package/dist/services/device-tier.d.ts.map +1 -0
- package/dist/services/downloader.d.ts +94 -0
- package/dist/services/downloader.d.ts.map +1 -0
- package/dist/services/engine.d.ts +579 -0
- package/dist/services/engine.d.ts.map +1 -0
- package/dist/services/ensure-local-artifacts.d.ts +82 -0
- package/dist/services/ensure-local-artifacts.d.ts.map +1 -0
- package/dist/services/external-scanner.d.ts +17 -0
- package/dist/services/external-scanner.d.ts.map +1 -0
- package/dist/services/ffi-llm-mock.d.ts +90 -0
- package/dist/services/ffi-llm-mock.d.ts.map +1 -0
- package/dist/services/ffi-llm-streaming-abi.d.ts +318 -0
- package/dist/services/ffi-llm-streaming-abi.d.ts.map +1 -0
- package/dist/services/ffi-streaming-backend.d.ts +201 -0
- package/dist/services/ffi-streaming-backend.d.ts.map +1 -0
- package/dist/services/ffi-streaming-runner.d.ts +146 -0
- package/dist/services/ffi-streaming-runner.d.ts.map +1 -0
- package/dist/services/gpu-autotune.d.ts +150 -0
- package/dist/services/gpu-autotune.d.ts.map +1 -0
- package/dist/services/gpu-detect.d.ts +56 -0
- package/dist/services/gpu-detect.d.ts.map +1 -0
- package/dist/services/handler-registry.d.ts +72 -0
- package/dist/services/handler-registry.d.ts.map +1 -0
- package/dist/services/hardware.d.ts +63 -0
- package/dist/services/hardware.d.ts.map +1 -0
- package/dist/services/image-description-runtime.d.ts +14 -0
- package/dist/services/image-description-runtime.d.ts.map +1 -0
- package/dist/services/imagegen/aosp-unavailable.d.ts +134 -0
- package/dist/services/imagegen/aosp-unavailable.d.ts.map +1 -0
- package/dist/services/imagegen/backend-selector.d.ts +118 -0
- package/dist/services/imagegen/backend-selector.d.ts.map +1 -0
- package/dist/services/imagegen/coreml-unavailable.d.ts +105 -0
- package/dist/services/imagegen/coreml-unavailable.d.ts.map +1 -0
- package/dist/services/imagegen/errors.d.ts +16 -0
- package/dist/services/imagegen/errors.d.ts.map +1 -0
- package/dist/services/imagegen/index.d.ts +58 -0
- package/dist/services/imagegen/index.d.ts.map +1 -0
- package/dist/services/imagegen/mflux.d.ts +74 -0
- package/dist/services/imagegen/mflux.d.ts.map +1 -0
- package/dist/services/imagegen/sd-cpp.d.ts +181 -0
- package/dist/services/imagegen/sd-cpp.d.ts.map +1 -0
- package/dist/services/imagegen/tensorrt-unavailable.d.ts +83 -0
- package/dist/services/imagegen/tensorrt-unavailable.d.ts.map +1 -0
- package/dist/services/imagegen/types.d.ts +181 -0
- package/dist/services/imagegen/types.d.ts.map +1 -0
- package/dist/services/index.d.ts +31 -0
- package/dist/services/index.d.ts.map +1 -0
- package/dist/services/index.js +39453 -0
- package/dist/services/index.js.map +227 -0
- package/dist/services/inference-capabilities.d.ts +132 -0
- package/dist/services/inference-capabilities.d.ts.map +1 -0
- package/dist/services/inference-telemetry.d.ts +59 -0
- package/dist/services/inference-telemetry.d.ts.map +1 -0
- package/dist/services/ios-llama-streaming.d.ts +119 -0
- package/dist/services/ios-llama-streaming.d.ts.map +1 -0
- package/dist/services/kv-spill.d.ts +189 -0
- package/dist/services/kv-spill.d.ts.map +1 -0
- package/dist/services/latency-trace.d.ts +346 -0
- package/dist/services/latency-trace.d.ts.map +1 -0
- package/dist/services/lib-target.d.ts +55 -0
- package/dist/services/lib-target.d.ts.map +1 -0
- package/dist/services/live-signals.d.ts +86 -0
- package/dist/services/live-signals.d.ts.map +1 -0
- package/dist/services/llama-server-metrics.d.ts +114 -0
- package/dist/services/llama-server-metrics.d.ts.map +1 -0
- package/dist/services/llm-streaming-binding.d.ts +96 -0
- package/dist/services/llm-streaming-binding.d.ts.map +1 -0
- package/dist/services/load-args.d.ts +82 -0
- package/dist/services/load-args.d.ts.map +1 -0
- package/dist/services/manifest/index.d.ts +4 -0
- package/dist/services/manifest/index.d.ts.map +1 -0
- package/dist/services/manifest/schema.d.ts +903 -0
- package/dist/services/manifest/schema.d.ts.map +1 -0
- package/dist/services/manifest/types.d.ts +32 -0
- package/dist/services/manifest/types.d.ts.map +1 -0
- package/dist/services/manifest/validator.d.ts +66 -0
- package/dist/services/manifest/validator.d.ts.map +1 -0
- package/dist/services/memory-arbiter.d.ts +348 -0
- package/dist/services/memory-arbiter.d.ts.map +1 -0
- package/dist/services/memory-benchmark.d.ts +76 -0
- package/dist/services/memory-benchmark.d.ts.map +1 -0
- package/dist/services/memory-monitor.d.ts +128 -0
- package/dist/services/memory-monitor.d.ts.map +1 -0
- package/dist/services/memory-pressure.d.ts +130 -0
- package/dist/services/memory-pressure.d.ts.map +1 -0
- package/dist/services/mtp-doctor.d.ts +13 -0
- package/dist/services/mtp-doctor.d.ts.map +1 -0
- package/dist/services/network-policy.d.ts +127 -0
- package/dist/services/network-policy.d.ts.map +1 -0
- package/dist/services/paths.d.ts +6 -0
- package/dist/services/paths.d.ts.map +1 -0
- package/dist/services/planner-skeleton.d.ts +124 -0
- package/dist/services/planner-skeleton.d.ts.map +1 -0
- package/dist/services/providers.d.ts +38 -0
- package/dist/services/providers.d.ts.map +1 -0
- package/dist/services/ram-budget.d.ts +110 -0
- package/dist/services/ram-budget.d.ts.map +1 -0
- package/dist/services/readiness.d.ts +9 -0
- package/dist/services/readiness.d.ts.map +1 -0
- package/dist/services/recommendation.d.ts +111 -0
- package/dist/services/recommendation.d.ts.map +1 -0
- package/dist/services/registry.d.ts +33 -0
- package/dist/services/registry.d.ts.map +1 -0
- package/dist/services/router-handler.d.ts +92 -0
- package/dist/services/router-handler.d.ts.map +1 -0
- package/dist/services/routing-policy.d.ts +92 -0
- package/dist/services/routing-policy.d.ts.map +1 -0
- package/dist/services/routing-preferences.d.ts +8 -0
- package/dist/services/routing-preferences.d.ts.map +1 -0
- package/dist/services/runtime-target.d.ts +98 -0
- package/dist/services/runtime-target.d.ts.map +1 -0
- package/dist/services/service.d.ts +128 -0
- package/dist/services/service.d.ts.map +1 -0
- package/dist/services/session-pool.d.ts +72 -0
- package/dist/services/session-pool.d.ts.map +1 -0
- package/dist/services/structured-output/deterministic-repair.d.ts +23 -0
- package/dist/services/structured-output/deterministic-repair.d.ts.map +1 -0
- package/dist/services/structured-output/index.d.ts +2 -0
- package/dist/services/structured-output/index.d.ts.map +1 -0
- package/dist/services/structured-output.d.ts +311 -0
- package/dist/services/structured-output.d.ts.map +1 -0
- package/dist/services/system-memory.d.ts +33 -0
- package/dist/services/system-memory.d.ts.map +1 -0
- package/dist/services/types.d.ts +19 -0
- package/dist/services/types.d.ts.map +1 -0
- package/dist/services/verify-on-device.d.ts +34 -0
- package/dist/services/verify-on-device.d.ts.map +1 -0
- package/dist/services/verify.d.ts +8 -0
- package/dist/services/verify.d.ts.map +1 -0
- package/dist/services/vision/aosp-unavailable.d.ts +115 -0
- package/dist/services/vision/aosp-unavailable.d.ts.map +1 -0
- package/dist/services/vision/capacitor-llama.d.ts +99 -0
- package/dist/services/vision/capacitor-llama.d.ts.map +1 -0
- package/dist/services/vision/cloud-fallback.d.ts +47 -0
- package/dist/services/vision/cloud-fallback.d.ts.map +1 -0
- package/dist/services/vision/hash.d.ts +71 -0
- package/dist/services/vision/hash.d.ts.map +1 -0
- package/dist/services/vision/index.d.ts +95 -0
- package/dist/services/vision/index.d.ts.map +1 -0
- package/dist/services/vision/llama-server.d.ts +73 -0
- package/dist/services/vision/llama-server.d.ts.map +1 -0
- package/dist/services/vision/types.d.ts +162 -0
- package/dist/services/vision/types.d.ts.map +1 -0
- package/dist/services/vision/vast-fallback.d.ts +18 -0
- package/dist/services/vision/vast-fallback.d.ts.map +1 -0
- package/dist/services/vision-embedding-cache.d.ts +98 -0
- package/dist/services/vision-embedding-cache.d.ts.map +1 -0
- package/dist/services/voice/__test-helpers__/fake-ffi.d.ts +27 -0
- package/dist/services/voice/__test-helpers__/fake-ffi.d.ts.map +1 -0
- package/dist/services/voice/__test-helpers__/synthetic-speech.d.ts +66 -0
- package/dist/services/voice/__test-helpers__/synthetic-speech.d.ts.map +1 -0
- package/dist/services/voice/acoustic-speaker-attribution.d.ts +61 -0
- package/dist/services/voice/acoustic-speaker-attribution.d.ts.map +1 -0
- package/dist/services/voice/audio-frame-consumer.d.ts +294 -0
- package/dist/services/voice/audio-frame-consumer.d.ts.map +1 -0
- package/dist/services/voice/barge-in.d.ts +112 -0
- package/dist/services/voice/barge-in.d.ts.map +1 -0
- package/dist/services/voice/cancellation-coordinator.d.ts +127 -0
- package/dist/services/voice/cancellation-coordinator.d.ts.map +1 -0
- package/dist/services/voice/checkpoint-manager.d.ts +199 -0
- package/dist/services/voice/checkpoint-manager.d.ts.map +1 -0
- package/dist/services/voice/checkpoint-policy.d.ts +178 -0
- package/dist/services/voice/checkpoint-policy.d.ts.map +1 -0
- package/dist/services/voice/corpus-augment.d.ts +111 -0
- package/dist/services/voice/corpus-augment.d.ts.map +1 -0
- package/dist/services/voice/corpus-generator.d.ts +134 -0
- package/dist/services/voice/corpus-generator.d.ts.map +1 -0
- package/dist/services/voice/diarization-error-rate.d.ts +40 -0
- package/dist/services/voice/diarization-error-rate.d.ts.map +1 -0
- package/dist/services/voice/e2e-harness.d.ts +297 -0
- package/dist/services/voice/e2e-harness.d.ts.map +1 -0
- package/dist/services/voice/eager-context-builder.d.ts +170 -0
- package/dist/services/voice/eager-context-builder.d.ts.map +1 -0
- package/dist/services/voice/echo-delay.d.ts +67 -0
- package/dist/services/voice/echo-delay.d.ts.map +1 -0
- package/dist/services/voice/echo-metrics.d.ts +7 -0
- package/dist/services/voice/echo-metrics.d.ts.map +1 -0
- package/dist/services/voice/echo-reference-buffer.d.ts +65 -0
- package/dist/services/voice/echo-reference-buffer.d.ts.map +1 -0
- package/dist/services/voice/eliza1-eot-scorer.d.ts +124 -0
- package/dist/services/voice/eliza1-eot-scorer.d.ts.map +1 -0
- package/dist/services/voice/embedding-server.d.ts +37 -0
- package/dist/services/voice/embedding-server.d.ts.map +1 -0
- package/dist/services/voice/embedding.d.ts +132 -0
- package/dist/services/voice/embedding.d.ts.map +1 -0
- package/dist/services/voice/emotion-attribution.d.ts +68 -0
- package/dist/services/voice/emotion-attribution.d.ts.map +1 -0
- package/dist/services/voice/engine-bridge.d.ts +762 -0
- package/dist/services/voice/engine-bridge.d.ts.map +1 -0
- package/dist/services/voice/eot-classifier-ggml.d.ts +179 -0
- package/dist/services/voice/eot-classifier-ggml.d.ts.map +1 -0
- package/dist/services/voice/eot-classifier.d.ts +211 -0
- package/dist/services/voice/eot-classifier.d.ts.map +1 -0
- package/dist/services/voice/errors.d.ts +20 -0
- package/dist/services/voice/errors.d.ts.map +1 -0
- package/dist/services/voice/expressive-tags.d.ts +158 -0
- package/dist/services/voice/expressive-tags.d.ts.map +1 -0
- package/dist/services/voice/ffi-bindings.d.ts +696 -0
- package/dist/services/voice/ffi-bindings.d.ts.map +1 -0
- package/dist/services/voice/first-line-cache.d.ts +181 -0
- package/dist/services/voice/first-line-cache.d.ts.map +1 -0
- package/dist/services/voice/fused-eot-scorer.d.ts +51 -0
- package/dist/services/voice/fused-eot-scorer.d.ts.map +1 -0
- package/dist/services/voice/index.d.ts +96 -0
- package/dist/services/voice/index.d.ts.map +1 -0
- package/dist/services/voice/kokoro/index.d.ts +24 -0
- package/dist/services/voice/kokoro/index.d.ts.map +1 -0
- package/dist/services/voice/kokoro/kokoro-backend.d.ts +87 -0
- package/dist/services/voice/kokoro/kokoro-backend.d.ts.map +1 -0
- package/dist/services/voice/kokoro/kokoro-engine-discovery.d.ts +58 -0
- package/dist/services/voice/kokoro/kokoro-engine-discovery.d.ts.map +1 -0
- package/dist/services/voice/kokoro/kokoro-ffi-runtime.d.ts +75 -0
- package/dist/services/voice/kokoro/kokoro-ffi-runtime.d.ts.map +1 -0
- package/dist/services/voice/kokoro/kokoro-runtime.d.ts +100 -0
- package/dist/services/voice/kokoro/kokoro-runtime.d.ts.map +1 -0
- package/dist/services/voice/kokoro/phoneme-stream.d.ts +51 -0
- package/dist/services/voice/kokoro/phoneme-stream.d.ts.map +1 -0
- package/dist/services/voice/kokoro/phonemizer.d.ts +50 -0
- package/dist/services/voice/kokoro/phonemizer.d.ts.map +1 -0
- package/dist/services/voice/kokoro/pick-runtime.d.ts +61 -0
- package/dist/services/voice/kokoro/pick-runtime.d.ts.map +1 -0
- package/dist/services/voice/kokoro/runtime-selection.d.ts +31 -0
- package/dist/services/voice/kokoro/runtime-selection.d.ts.map +1 -0
- package/dist/services/voice/kokoro/types.d.ts +82 -0
- package/dist/services/voice/kokoro/types.d.ts.map +1 -0
- package/dist/services/voice/kokoro/voice-presets.d.ts +23 -0
- package/dist/services/voice/kokoro/voice-presets.d.ts.map +1 -0
- package/dist/services/voice/kokoro/voices.d.ts +30 -0
- package/dist/services/voice/kokoro/voices.d.ts.map +1 -0
- package/dist/services/voice/lifecycle.d.ts +135 -0
- package/dist/services/voice/lifecycle.d.ts.map +1 -0
- package/dist/services/voice/live-diarization-session.d.ts +196 -0
- package/dist/services/voice/live-diarization-session.d.ts.map +1 -0
- package/dist/services/voice/metric-math.d.ts +10 -0
- package/dist/services/voice/metric-math.d.ts.map +1 -0
- package/dist/services/voice/mic-source.d.ts +136 -0
- package/dist/services/voice/mic-source.d.ts.map +1 -0
- package/dist/services/voice/nlms-echo-canceller.d.ts +137 -0
- package/dist/services/voice/nlms-echo-canceller.d.ts.map +1 -0
- package/dist/services/voice/optimistic-policy.d.ts +109 -0
- package/dist/services/voice/optimistic-policy.d.ts.map +1 -0
- package/dist/services/voice/optimistic-rollback.d.ts +151 -0
- package/dist/services/voice/optimistic-rollback.d.ts.map +1 -0
- package/dist/services/voice/partial-stabilizer.d.ts +73 -0
- package/dist/services/voice/partial-stabilizer.d.ts.map +1 -0
- package/dist/services/voice/phoneme-tokenizer.d.ts +49 -0
- package/dist/services/voice/phoneme-tokenizer.d.ts.map +1 -0
- package/dist/services/voice/phrase-cache.d.ts +76 -0
- package/dist/services/voice/phrase-cache.d.ts.map +1 -0
- package/dist/services/voice/phrase-chunker.d.ts +62 -0
- package/dist/services/voice/phrase-chunker.d.ts.map +1 -0
- package/dist/services/voice/pipeline-impls.d.ts +151 -0
- package/dist/services/voice/pipeline-impls.d.ts.map +1 -0
- package/dist/services/voice/pipeline.d.ts +216 -0
- package/dist/services/voice/pipeline.d.ts.map +1 -0
- package/dist/services/voice/prefill-client.d.ts +123 -0
- package/dist/services/voice/prefill-client.d.ts.map +1 -0
- package/dist/services/voice/prefix-preserving-queue.d.ts +113 -0
- package/dist/services/voice/prefix-preserving-queue.d.ts.map +1 -0
- package/dist/services/voice/profile-store.d.ts +248 -0
- package/dist/services/voice/profile-store.d.ts.map +1 -0
- package/dist/services/voice/ring-buffer.d.ts +40 -0
- package/dist/services/voice/ring-buffer.d.ts.map +1 -0
- package/dist/services/voice/rollback-queue.d.ts +24 -0
- package/dist/services/voice/rollback-queue.d.ts.map +1 -0
- package/dist/services/voice/samantha-preset-placeholder.d.ts +67 -0
- package/dist/services/voice/samantha-preset-placeholder.d.ts.map +1 -0
- package/dist/services/voice/samantha-preset-regenerator.d.ts +87 -0
- package/dist/services/voice/samantha-preset-regenerator.d.ts.map +1 -0
- package/dist/services/voice/scheduler.d.ts +146 -0
- package/dist/services/voice/scheduler.d.ts.map +1 -0
- package/dist/services/voice/self-voice-imprint.d.ts +33 -0
- package/dist/services/voice/self-voice-imprint.d.ts.map +1 -0
- package/dist/services/voice/shared-resources.d.ts +204 -0
- package/dist/services/voice/shared-resources.d.ts.map +1 -0
- package/dist/services/voice/speaker/attribution-pipeline.d.ts +74 -0
- package/dist/services/voice/speaker/attribution-pipeline.d.ts.map +1 -0
- package/dist/services/voice/speaker/diarizer-fused.d.ts +59 -0
- package/dist/services/voice/speaker/diarizer-fused.d.ts.map +1 -0
- package/dist/services/voice/speaker/diarizer.d.ts +75 -0
- package/dist/services/voice/speaker/diarizer.d.ts.map +1 -0
- package/dist/services/voice/speaker/encoder-fused.d.ts +60 -0
- package/dist/services/voice/speaker/encoder-fused.d.ts.map +1 -0
- package/dist/services/voice/speaker/encoder-ggml.d.ts +33 -0
- package/dist/services/voice/speaker/encoder-ggml.d.ts.map +1 -0
- package/dist/services/voice/speaker/encoder.d.ts +37 -0
- package/dist/services/voice/speaker/encoder.d.ts.map +1 -0
- package/dist/services/voice/speaker-imprint.d.ts +83 -0
- package/dist/services/voice/speaker-imprint.d.ts.map +1 -0
- package/dist/services/voice/speaker-preset-cache.d.ts +77 -0
- package/dist/services/voice/speaker-preset-cache.d.ts.map +1 -0
- package/dist/services/voice/streaming-asr/streaming-pipeline-adapter.d.ts +160 -0
- package/dist/services/voice/streaming-asr/streaming-pipeline-adapter.d.ts.map +1 -0
- package/dist/services/voice/system-audio-sink.d.ts +73 -0
- package/dist/services/voice/system-audio-sink.d.ts.map +1 -0
- package/dist/services/voice/transcriber.d.ts +244 -0
- package/dist/services/voice/transcriber.d.ts.map +1 -0
- package/dist/services/voice/transcript-knowledge.d.ts +37 -0
- package/dist/services/voice/transcript-knowledge.d.ts.map +1 -0
- package/dist/services/voice/transcript-service.d.ts +60 -0
- package/dist/services/voice/transcript-service.d.ts.map +1 -0
- package/dist/services/voice/transcript-store.d.ts +64 -0
- package/dist/services/voice/transcript-store.d.ts.map +1 -0
- package/dist/services/voice/turn-controller.d.ts +183 -0
- package/dist/services/voice/turn-controller.d.ts.map +1 -0
- package/dist/services/voice/types.d.ts +643 -0
- package/dist/services/voice/types.d.ts.map +1 -0
- package/dist/services/voice/vad.d.ts +283 -0
- package/dist/services/voice/vad.d.ts.map +1 -0
- package/dist/services/voice/voice-budget.d.ts +241 -0
- package/dist/services/voice/voice-budget.d.ts.map +1 -0
- package/dist/services/voice/voice-emotion-classifier.d.ts +95 -0
- package/dist/services/voice/voice-emotion-classifier.d.ts.map +1 -0
- package/dist/services/voice/voice-preload-predictor.d.ts +76 -0
- package/dist/services/voice/voice-preload-predictor.d.ts.map +1 -0
- package/dist/services/voice/voice-preset-format.d.ts +158 -0
- package/dist/services/voice/voice-preset-format.d.ts.map +1 -0
- package/dist/services/voice/voice-profile-artifact.d.ts +116 -0
- package/dist/services/voice/voice-profile-artifact.d.ts.map +1 -0
- package/dist/services/voice/voice-profile-routes.d.ts +83 -0
- package/dist/services/voice/voice-profile-routes.d.ts.map +1 -0
- package/dist/services/voice/voice-scenario.d.ts +131 -0
- package/dist/services/voice/voice-scenario.d.ts.map +1 -0
- package/dist/services/voice/voice-state-machine.d.ts +364 -0
- package/dist/services/voice/voice-state-machine.d.ts.map +1 -0
- package/dist/services/voice/voice-workbench-report.d.ts +117 -0
- package/dist/services/voice/voice-workbench-report.d.ts.map +1 -0
- package/dist/services/voice/wake-word-ggml.d.ts +100 -0
- package/dist/services/voice/wake-word-ggml.d.ts.map +1 -0
- package/dist/services/voice/wake-word.d.ts +255 -0
- package/dist/services/voice/wake-word.d.ts.map +1 -0
- package/dist/services/voice/wav-codec.d.ts +11 -0
- package/dist/services/voice/wav-codec.d.ts.map +1 -0
- package/dist/services/voice/workbench-entrypoint.d.ts +42 -0
- package/dist/services/voice/workbench-entrypoint.d.ts.map +1 -0
- package/dist/services/voice/workbench-headless-runner.d.ts +102 -0
- package/dist/services/voice/workbench-headless-runner.d.ts.map +1 -0
- package/dist/services/voice/workbench-logic-services.d.ts +36 -0
- package/dist/services/voice/workbench-logic-services.d.ts.map +1 -0
- package/dist/services/voice/workbench-real-services.d.ts +17 -0
- package/dist/services/voice/workbench-real-services.d.ts.map +1 -0
- package/dist/services/voice/workbench-scenarios.d.ts +24 -0
- package/dist/services/voice/workbench-scenarios.d.ts.map +1 -0
- package/dist/services/voice/wrap-with-first-line-cache.d.ts +70 -0
- package/dist/services/voice/wrap-with-first-line-cache.d.ts.map +1 -0
- package/dist/services/voice-model-updater.d.ts +240 -0
- package/dist/services/voice-model-updater.d.ts.map +1 -0
- package/dist/services/voice-prewarm.d.ts +3 -0
- package/dist/services/voice-prewarm.d.ts.map +1 -0
- package/dist/voice-workbench.d.ts +18 -0
- package/dist/voice-workbench.d.ts.map +1 -0
- package/dist/voice-workbench.js +5259 -0
- package/dist/voice-workbench.js.map +34 -0
- package/package.json +101 -15
- package/registry-entry.json +137 -0
- package/src/actions/generate-media.ts +647 -0
- package/src/actions/identify-speaker.ts +171 -0
- package/src/actions/transcription-control.test.ts +100 -0
- package/src/actions/transcription-control.ts +127 -0
- package/src/adapters/capacitor-llama/__tests__/compat-behavior.test.ts +218 -0
- package/src/adapters/capacitor-llama/__tests__/index.test.ts +68 -0
- package/src/adapters/capacitor-llama/__tests__/structured-output.test.ts +215 -0
- package/src/adapters/capacitor-llama/__tests__/text-streaming.test.ts +174 -0
- package/src/adapters/capacitor-llama/__tests__/voice-turn.test.ts +293 -0
- package/src/adapters/capacitor-llama/environment.ts +71 -0
- package/src/adapters/capacitor-llama/index.browser.ts +83 -0
- package/src/adapters/capacitor-llama/index.ts +831 -0
- package/src/adapters/capacitor-llama/loader.ts +109 -0
- package/src/adapters/capacitor-llama/native-voice-capture.ts +140 -0
- package/src/adapters/capacitor-llama/structured-output.ts +165 -0
- package/src/adapters/capacitor-llama/text-streaming.ts +227 -0
- package/src/adapters/capacitor-llama/types.ts +374 -0
- package/src/adapters/capacitor-llama/voice-turn.ts +178 -0
- package/src/backends/apple-foundation.ts +127 -0
- package/src/index.ts +62 -0
- package/src/local-inference-routes.test.ts +390 -0
- package/src/local-inference-routes.ts +1625 -0
- package/src/provider.ts +1111 -0
- package/src/routes/compat-helpers.ts +275 -0
- package/src/routes/family-member-route.ts +353 -0
- package/src/routes/index.ts +61 -0
- package/src/routes/live-diarization-route.test.ts +347 -0
- package/src/routes/live-diarization-route.ts +198 -0
- package/src/routes/local-inference-asr-route.test.ts +246 -0
- package/src/routes/local-inference-asr-route.ts +166 -0
- package/src/routes/local-inference-asr-transcribe.test.ts +118 -0
- package/src/routes/local-inference-asr-transcribe.ts +97 -0
- package/src/routes/local-inference-compat-routes.test.ts +485 -0
- package/src/routes/local-inference-compat-routes.ts +775 -0
- package/src/routes/local-inference-tts-route.test.ts +179 -0
- package/src/routes/local-inference-tts-route.ts +230 -0
- package/src/routes/native-pcm-turn-route.test.ts +136 -0
- package/src/routes/native-pcm-turn-route.ts +121 -0
- package/src/routes/transcript-audio-store.ts +27 -0
- package/src/routes/transcripts-routes.test.ts +195 -0
- package/src/routes/transcripts-routes.ts +191 -0
- package/src/routes/voice-first-run-routes.ts +524 -0
- package/src/routes/voice-models-routes.ts +554 -0
- package/src/routes/voice-profile-plugin-routes.ts +138 -0
- package/src/routes/voice-profiles-management-routes.ts +476 -0
- package/src/routes/voice-speaker-profile-routes.ts +199 -0
- package/src/runtime/aosp-llama-loader-selection.test.ts +80 -0
- package/src/runtime/bionic-wire-encoding.test.ts +147 -0
- package/src/runtime/capacitor-llama.d.ts +25 -0
- package/src/runtime/embedding-manager-support.ts +497 -0
- package/src/runtime/embedding-presets.ts +81 -0
- package/src/runtime/embedding-warmup-policy.test.ts +53 -0
- package/src/runtime/embedding-warmup-policy.ts +48 -0
- package/src/runtime/ensure-local-inference-handler.test.ts +726 -0
- package/src/runtime/ensure-local-inference-handler.ts +1640 -0
- package/src/runtime/index.ts +36 -0
- package/src/runtime/mobile-local-inference-gate.test.ts +152 -0
- package/src/runtime/mobile-local-inference-gate.ts +99 -0
- package/src/runtime/voice-entity-binding.transcript.test.ts +98 -0
- package/src/runtime/voice-entity-binding.ts +368 -0
- package/src/runtime/voice-speaker-entity-contract.test.ts +149 -0
- package/src/services/README.md +71 -0
- package/src/services/__tests__/backend-selector.precedence.test.ts +333 -0
- package/src/services/__tests__/backend-selector.test.ts +101 -0
- package/src/services/__tests__/checkpoint-manager.test.ts +376 -0
- package/src/services/__tests__/gpu-autotune.test.ts +400 -0
- package/src/services/__tests__/llm-streaming-binding.test.ts +85 -0
- package/src/services/__tests__/planner-grammar.test.ts +372 -0
- package/src/services/__tests__/runtime-target.test.ts +176 -0
- package/src/services/active-model-context-fit.test.ts +125 -0
- package/src/services/active-model-switch-rollback.test.ts +183 -0
- package/src/services/active-model.ts +1416 -0
- package/src/services/asr-provenance.ts +68 -0
- package/src/services/assignment-validation.test.ts +118 -0
- package/src/services/assignments.test.ts +106 -0
- package/src/services/assignments.ts +278 -0
- package/src/services/backend-selector.ts +95 -0
- package/src/services/backend.test.ts +84 -0
- package/src/services/backend.ts +791 -0
- package/src/services/bionic-host-loader.test.ts +226 -0
- package/src/services/bionic-host-loader.ts +252 -0
- package/src/services/bundled-models.ts +129 -0
- package/src/services/cache-bridge.test.ts +516 -0
- package/src/services/cache-bridge.ts +423 -0
- package/src/services/catalog.test.ts +259 -0
- package/src/services/catalog.ts +33 -0
- package/src/services/checkpoint-client.ts +258 -0
- package/src/services/checkpoint-manager.ts +474 -0
- package/src/services/cloud-fallback.ts +230 -0
- package/src/services/context-fit.test.ts +121 -0
- package/src/services/context-fit.ts +113 -0
- package/src/services/conversation-registry.test.ts +235 -0
- package/src/services/conversation-registry.ts +264 -0
- package/src/services/desktop-fused-ffi-backend-runtime.ts +431 -0
- package/src/services/device-bridge.ts +1237 -0
- package/src/services/device-resource-metrics.test.ts +98 -0
- package/src/services/device-resource-metrics.ts +346 -0
- package/src/services/device-tier.test.ts +458 -0
- package/src/services/device-tier.ts +502 -0
- package/src/services/downloader.test.ts +888 -0
- package/src/services/downloader.ts +1039 -0
- package/src/services/engine-direct-bundle.test.ts +90 -0
- package/src/services/engine-streaming.test.ts +80 -0
- package/src/services/engine.ts +2096 -0
- package/src/services/ensure-local-artifacts.integration.test.ts +273 -0
- package/src/services/ensure-local-artifacts.test.ts +368 -0
- package/src/services/ensure-local-artifacts.ts +351 -0
- package/src/services/external-scanner.ts +312 -0
- package/src/services/ffi-llm-mock.ts +354 -0
- package/src/services/ffi-llm-streaming-abi.ts +445 -0
- package/src/services/ffi-streaming-backend.ts +418 -0
- package/src/services/ffi-streaming-runner.test.ts +220 -0
- package/src/services/ffi-streaming-runner.ts +407 -0
- package/src/services/ffi-unload-ordering.test.ts +166 -0
- package/src/services/fused-eliza1-no-regression.test.ts +144 -0
- package/src/services/gpu-autotune.ts +534 -0
- package/src/services/gpu-detect.ts +139 -0
- package/src/services/handler-registry.ts +240 -0
- package/src/services/hardware.test.ts +236 -0
- package/src/services/hardware.ts +438 -0
- package/src/services/image-description-runtime.test.ts +61 -0
- package/src/services/image-description-runtime.ts +118 -0
- package/src/services/imagegen/aosp-unavailable.ts +229 -0
- package/src/services/imagegen/backend-selector.test.ts +190 -0
- package/src/services/imagegen/backend-selector.ts +277 -0
- package/src/services/imagegen/coreml-unavailable.ts +237 -0
- package/src/services/imagegen/errors.ts +40 -0
- package/src/services/imagegen/index.ts +144 -0
- package/src/services/imagegen/mflux.ts +313 -0
- package/src/services/imagegen/sd-cpp.ts +715 -0
- package/src/services/imagegen/tensorrt-unavailable.ts +295 -0
- package/src/services/imagegen/types.ts +193 -0
- package/src/services/index.ts +229 -0
- package/src/services/inference-capabilities.test.ts +75 -0
- package/src/services/inference-capabilities.ts +204 -0
- package/src/services/inference-telemetry.ts +143 -0
- package/src/services/ios-llama-streaming.ts +248 -0
- package/src/services/kv-spill.test.ts +222 -0
- package/src/services/kv-spill.ts +357 -0
- package/src/services/latency-trace.test.ts +266 -0
- package/src/services/latency-trace.ts +844 -0
- package/src/services/lib-target.test.ts +145 -0
- package/src/services/lib-target.ts +102 -0
- package/src/services/live-signals.test.ts +132 -0
- package/src/services/live-signals.ts +177 -0
- package/src/services/llama-server-metrics.test.ts +168 -0
- package/src/services/llama-server-metrics.ts +304 -0
- package/src/services/llm-streaming-binding.ts +136 -0
- package/src/services/load-args.ts +81 -0
- package/src/services/manifest/eliza-1.manifest.v1.json +790 -0
- package/src/services/manifest/index.ts +72 -0
- package/src/services/manifest/manifest.test.ts +791 -0
- package/src/services/manifest/schema.ts +761 -0
- package/src/services/manifest/types.ts +61 -0
- package/src/services/manifest/validator.ts +633 -0
- package/src/services/memory-arbiter.test.ts +558 -0
- package/src/services/memory-arbiter.ts +991 -0
- package/src/services/memory-benchmark.test.ts +91 -0
- package/src/services/memory-benchmark.ts +354 -0
- package/src/services/memory-monitor.test.ts +232 -0
- package/src/services/memory-monitor.ts +309 -0
- package/src/services/memory-pressure.ts +414 -0
- package/src/services/mtp-doctor.ts +86 -0
- package/src/services/network-policy.ts +346 -0
- package/src/services/paths.ts +25 -0
- package/src/services/planner-skeleton.ts +175 -0
- package/src/services/providers.ts +507 -0
- package/src/services/ram-budget-cache.test.ts +164 -0
- package/src/services/ram-budget.ts +309 -0
- package/src/services/readiness.test.ts +87 -0
- package/src/services/readiness.ts +238 -0
- package/src/services/recommendation.test.ts +216 -0
- package/src/services/recommendation.ts +671 -0
- package/src/services/registry.ts +157 -0
- package/src/services/required-kernels-gate.test.ts +64 -0
- package/src/services/router-handler.test.ts +45 -0
- package/src/services/router-handler.ts +426 -0
- package/src/services/routing-policy.test.ts +352 -0
- package/src/services/routing-policy.ts +367 -0
- package/src/services/routing-preferences.ts +17 -0
- package/src/services/runtime-target.ts +154 -0
- package/src/services/service.test.ts +223 -0
- package/src/services/service.ts +750 -0
- package/src/services/session-pool.ts +153 -0
- package/src/services/structured-output/deterministic-repair.test.ts +169 -0
- package/src/services/structured-output/deterministic-repair.ts +443 -0
- package/src/services/structured-output/index.ts +4 -0
- package/src/services/structured-output.test.ts +483 -0
- package/src/services/structured-output.ts +712 -0
- package/src/services/system-memory.test.ts +47 -0
- package/src/services/system-memory.ts +67 -0
- package/src/services/transcription-priority.test.ts +211 -0
- package/src/services/types.ts +59 -0
- package/src/services/verify-on-device.test.ts +87 -0
- package/src/services/verify-on-device.ts +127 -0
- package/src/services/verify.ts +13 -0
- package/src/services/vision/aosp-unavailable.ts +163 -0
- package/src/services/vision/capacitor-llama.ts +255 -0
- package/src/services/vision/cloud-fallback.test.ts +243 -0
- package/src/services/vision/cloud-fallback.ts +268 -0
- package/src/services/vision/fallback-chain.test.ts +86 -0
- package/src/services/vision/hash.ts +157 -0
- package/src/services/vision/index.ts +251 -0
- package/src/services/vision/llama-server.ts +177 -0
- package/src/services/vision/types.ts +163 -0
- package/src/services/vision/vast-fallback.ts +127 -0
- package/src/services/vision-embedding-cache.ts +189 -0
- package/src/services/voice/VOICE_WORKBENCH.md +133 -0
- package/src/services/voice/__fixtures__/voice-workbench-logic-baseline.json +180 -0
- package/src/services/voice/__test-helpers__/fake-ffi.ts +94 -0
- package/src/services/voice/__test-helpers__/synthetic-speech.ts +194 -0
- package/src/services/voice/__tests__/checkpoint-manager.test.ts +241 -0
- package/src/services/voice/__tests__/checkpoint-policy.test.ts +270 -0
- package/src/services/voice/__tests__/eager-context-builder.test.ts +257 -0
- package/src/services/voice/__tests__/eliza1-eot-scorer.test.ts +288 -0
- package/src/services/voice/__tests__/eot-classifier.test.ts +431 -0
- package/src/services/voice/__tests__/optimistic-rollback.test.ts +312 -0
- package/src/services/voice/__tests__/prefill-client.test.ts +266 -0
- package/src/services/voice/__tests__/prefix-preserving-queue.test.ts +208 -0
- package/src/services/voice/__tests__/streaming-asr.test.ts +450 -0
- package/src/services/voice/__tests__/streaming-transcriber.test.ts +339 -0
- package/src/services/voice/__tests__/turn-detector-resolver.test.ts +195 -0
- package/src/services/voice/__tests__/voice-state-machine-prefill.test.ts +275 -0
- package/src/services/voice/__tests__/voice-state-machine.test.ts +354 -0
- package/src/services/voice/acoustic-speaker-attribution.test.ts +165 -0
- package/src/services/voice/acoustic-speaker-attribution.ts +336 -0
- package/src/services/voice/asr-timed.real.test.ts +139 -0
- package/src/services/voice/audio-frame-consumer.test.ts +669 -0
- package/src/services/voice/audio-frame-consumer.ts +651 -0
- package/src/services/voice/barge-in.test.ts +244 -0
- package/src/services/voice/barge-in.ts +335 -0
- package/src/services/voice/cancellation-coordinator.test.ts +196 -0
- package/src/services/voice/cancellation-coordinator.ts +269 -0
- package/src/services/voice/checkpoint-manager.ts +401 -0
- package/src/services/voice/checkpoint-policy.ts +336 -0
- package/src/services/voice/composite-eot-classifier.test.ts +59 -0
- package/src/services/voice/corpus-augment.test.ts +276 -0
- package/src/services/voice/corpus-augment.ts +451 -0
- package/src/services/voice/corpus-generator.test.ts +201 -0
- package/src/services/voice/corpus-generator.ts +413 -0
- package/src/services/voice/diarization-error-rate.greedy.test.ts +140 -0
- package/src/services/voice/diarization-error-rate.test.ts +100 -0
- package/src/services/voice/diarization-error-rate.ts +249 -0
- package/src/services/voice/e2e-harness.der.test.ts +94 -0
- package/src/services/voice/e2e-harness.respond-eot-entity.test.ts +277 -0
- package/src/services/voice/e2e-harness.security-echo.test.ts +103 -0
- package/src/services/voice/e2e-harness.test.ts +182 -0
- package/src/services/voice/e2e-harness.ts +902 -0
- package/src/services/voice/eager-context-builder.ts +262 -0
- package/src/services/voice/echo-delay.test.ts +118 -0
- package/src/services/voice/echo-delay.ts +135 -0
- package/src/services/voice/echo-metrics.test.ts +17 -0
- package/src/services/voice/echo-metrics.ts +20 -0
- package/src/services/voice/echo-reference-buffer.test.ts +86 -0
- package/src/services/voice/echo-reference-buffer.ts +165 -0
- package/src/services/voice/eliza1-eot-scorer.ts +242 -0
- package/src/services/voice/embedding-server.ts +200 -0
- package/src/services/voice/embedding.test.ts +131 -0
- package/src/services/voice/embedding.ts +242 -0
- package/src/services/voice/emotion-attribution.test.ts +129 -0
- package/src/services/voice/emotion-attribution.ts +361 -0
- package/src/services/voice/engine-bridge-cancellation.test.ts +422 -0
- package/src/services/voice/engine-bridge-transcript-join.test.ts +278 -0
- package/src/services/voice/engine-bridge.test.ts +384 -0
- package/src/services/voice/engine-bridge.ts +2343 -0
- package/src/services/voice/eot-classifier-ggml.ts +569 -0
- package/src/services/voice/eot-classifier.test.ts +98 -0
- package/src/services/voice/eot-classifier.ts +422 -0
- package/src/services/voice/errors.ts +34 -0
- package/src/services/voice/expressive-tags.asr.test.ts +77 -0
- package/src/services/voice/expressive-tags.test.ts +102 -0
- package/src/services/voice/expressive-tags.ts +405 -0
- package/src/services/voice/ffi-bindings.test.ts +735 -0
- package/src/services/voice/ffi-bindings.ts +3387 -0
- package/src/services/voice/first-line-cache.ts +725 -0
- package/src/services/voice/fused-eot-scorer.ts +139 -0
- package/src/services/voice/index.ts +502 -0
- package/src/services/voice/kokoro/__tests__/kokoro-backend.test.ts +262 -0
- package/src/services/voice/kokoro/__tests__/kokoro-engine-bridge.real.test.ts +236 -0
- package/src/services/voice/kokoro/__tests__/kokoro-engine-bridge.test.ts +60 -0
- package/src/services/voice/kokoro/__tests__/kokoro-engine-discovery.test.ts +277 -0
- package/src/services/voice/kokoro/__tests__/kokoro-ffi-runtime.test.ts +235 -0
- package/src/services/voice/kokoro/__tests__/kokoro-runtime.test.ts +95 -0
- package/src/services/voice/kokoro/__tests__/phonemizer.test.ts +53 -0
- package/src/services/voice/kokoro/__tests__/runtime-selection.test.ts +67 -0
- package/src/services/voice/kokoro/__tests__/voices.test.ts +57 -0
- package/src/services/voice/kokoro/index.ts +79 -0
- package/src/services/voice/kokoro/kokoro-backend.ts +223 -0
- package/src/services/voice/kokoro/kokoro-engine-discovery.ts +177 -0
- package/src/services/voice/kokoro/kokoro-ffi-runtime.ts +233 -0
- package/src/services/voice/kokoro/kokoro-runtime.ts +170 -0
- package/src/services/voice/kokoro/phoneme-stream.ts +123 -0
- package/src/services/voice/kokoro/phonemizer.ts +344 -0
- package/src/services/voice/kokoro/pick-runtime.test.ts +91 -0
- package/src/services/voice/kokoro/pick-runtime.ts +130 -0
- package/src/services/voice/kokoro/runtime-selection.ts +64 -0
- package/src/services/voice/kokoro/types.ts +95 -0
- package/src/services/voice/kokoro/voice-presets.ts +129 -0
- package/src/services/voice/kokoro/voices.ts +64 -0
- package/src/services/voice/lifecycle.test.ts +315 -0
- package/src/services/voice/lifecycle.ts +301 -0
- package/src/services/voice/live-diarization-session.echo.test.ts +232 -0
- package/src/services/voice/live-diarization-session.ts +622 -0
- package/src/services/voice/metric-math.test.ts +61 -0
- package/src/services/voice/metric-math.ts +25 -0
- package/src/services/voice/mic-source.test.ts +210 -0
- package/src/services/voice/mic-source.ts +503 -0
- package/src/services/voice/nlms-echo-canceller.test.ts +244 -0
- package/src/services/voice/nlms-echo-canceller.ts +317 -0
- package/src/services/voice/optimistic-policy.power-source.test.ts +36 -0
- package/src/services/voice/optimistic-policy.test.ts +101 -0
- package/src/services/voice/optimistic-policy.ts +192 -0
- package/src/services/voice/optimistic-rollback.ts +343 -0
- package/src/services/voice/partial-stabilizer.test.ts +68 -0
- package/src/services/voice/partial-stabilizer.ts +140 -0
- package/src/services/voice/phoneme-tokenizer.ts +158 -0
- package/src/services/voice/phrase-cache.test.ts +242 -0
- package/src/services/voice/phrase-cache.ts +186 -0
- package/src/services/voice/phrase-chunker.test.ts +239 -0
- package/src/services/voice/phrase-chunker.ts +281 -0
- package/src/services/voice/pipeline-impls.l6.test.ts +110 -0
- package/src/services/voice/pipeline-impls.test.ts +292 -0
- package/src/services/voice/pipeline-impls.ts +315 -0
- package/src/services/voice/pipeline.ts +504 -0
- package/src/services/voice/prefill-client.ts +316 -0
- package/src/services/voice/prefix-preserving-queue.ts +162 -0
- package/src/services/voice/profile-store.ts +887 -0
- package/src/services/voice/real-audio-decode.test.ts +148 -0
- package/src/services/voice/research/VOICE_8785_ASSESSMENT.md +141 -0
- package/src/services/voice/research/VOICE_PIPELINE_RESEARCH_2026.md +117 -0
- package/src/services/voice/research/VOICE_VALIDATION_RUNBOOK.md +135 -0
- package/src/services/voice/ring-buffer.test.ts +129 -0
- package/src/services/voice/ring-buffer.ts +123 -0
- package/src/services/voice/rollback-queue.ts +74 -0
- package/src/services/voice/samantha-preset-placeholder.test.ts +97 -0
- package/src/services/voice/samantha-preset-placeholder.ts +148 -0
- package/src/services/voice/samantha-preset-regenerator.ts +393 -0
- package/src/services/voice/samantha-preset-regenerator.wav.test.ts +90 -0
- package/src/services/voice/scheduler.t2.test.ts +141 -0
- package/src/services/voice/scheduler.ts +927 -0
- package/src/services/voice/self-voice-imprint.test.ts +59 -0
- package/src/services/voice/self-voice-imprint.ts +102 -0
- package/src/services/voice/shared-resources.ts +343 -0
- package/src/services/voice/speaker/attribution-pipeline.test.ts +221 -0
- package/src/services/voice/speaker/attribution-pipeline.ts +449 -0
- package/src/services/voice/speaker/diarizer-fused.real.test.ts +100 -0
- package/src/services/voice/speaker/diarizer-fused.ts +154 -0
- package/src/services/voice/speaker/diarizer.ts +218 -0
- package/src/services/voice/speaker/encoder-fused.real.test.ts +113 -0
- package/src/services/voice/speaker/encoder-fused.ts +138 -0
- package/src/services/voice/speaker/encoder-ggml.test.ts +59 -0
- package/src/services/voice/speaker/encoder-ggml.ts +79 -0
- package/src/services/voice/speaker/encoder.ts +105 -0
- package/src/services/voice/speaker-imprint.test.ts +185 -0
- package/src/services/voice/speaker-imprint.ts +312 -0
- package/src/services/voice/speaker-preset-cache.test.ts +154 -0
- package/src/services/voice/speaker-preset-cache.ts +195 -0
- package/src/services/voice/streaming-asr/streaming-pipeline-adapter.ts +292 -0
- package/src/services/voice/system-audio-sink.test.ts +29 -0
- package/src/services/voice/system-audio-sink.ts +366 -0
- package/src/services/voice/transcriber.asr-backend.test.ts +76 -0
- package/src/services/voice/transcriber.test.ts +392 -0
- package/src/services/voice/transcriber.ts +704 -0
- package/src/services/voice/transcript-knowledge.test.ts +68 -0
- package/src/services/voice/transcript-knowledge.ts +75 -0
- package/src/services/voice/transcript-service.test.ts +195 -0
- package/src/services/voice/transcript-service.ts +205 -0
- package/src/services/voice/transcript-store.test.ts +189 -0
- package/src/services/voice/transcript-store.ts +164 -0
- package/src/services/voice/turn-controller.test.ts +575 -0
- package/src/services/voice/turn-controller.ts +596 -0
- package/src/services/voice/types.ts +699 -0
- package/src/services/voice/vad.test.ts +498 -0
- package/src/services/voice/vad.ts +832 -0
- package/src/services/voice/vad.v1-v4.test.ts +222 -0
- package/src/services/voice/voice-budget.test.ts +415 -0
- package/src/services/voice/voice-budget.ts +635 -0
- package/src/services/voice/voice-duet.test.ts +375 -0
- package/src/services/voice/voice-emotion-classifier.test.ts +210 -0
- package/src/services/voice/voice-emotion-classifier.ts +273 -0
- package/src/services/voice/voice-hardening.fuzz.test.ts +116 -0
- package/src/services/voice/voice-preload-predictor.test.ts +130 -0
- package/src/services/voice/voice-preload-predictor.ts +113 -0
- package/src/services/voice/voice-preset-format.fuzz.test.ts +89 -0
- package/src/services/voice/voice-preset-format.test.ts +75 -0
- package/src/services/voice/voice-preset-format.ts +713 -0
- package/src/services/voice/voice-preset-generator.test.ts +89 -0
- package/src/services/voice/voice-profile-artifact.test.ts +138 -0
- package/src/services/voice/voice-profile-artifact.ts +518 -0
- package/src/services/voice/voice-profile-routes.test.ts +429 -0
- package/src/services/voice/voice-profile-routes.ts +425 -0
- package/src/services/voice/voice-scenario.test.ts +159 -0
- package/src/services/voice/voice-scenario.ts +280 -0
- package/src/services/voice/voice-scenario.turn-helpers.test.ts +77 -0
- package/src/services/voice/voice-state-machine.ts +727 -0
- package/src/services/voice/voice-workbench-report.test.ts +168 -0
- package/src/services/voice/voice-workbench-report.ts +367 -0
- package/src/services/voice/voice-workbench.test.ts +158 -0
- package/src/services/voice/voice.test.ts +1070 -0
- package/src/services/voice/wake-word-ggml.ts +319 -0
- package/src/services/voice/wake-word.test.ts +298 -0
- package/src/services/voice/wake-word.ts +554 -0
- package/src/services/voice/wav-codec.fuzz.test.ts +59 -0
- package/src/services/voice/wav-codec.test.ts +32 -0
- package/src/services/voice/wav-codec.ts +101 -0
- package/src/services/voice/workbench-entrypoint.test.ts +55 -0
- package/src/services/voice/workbench-entrypoint.ts +88 -0
- package/src/services/voice/workbench-headless-runner.test.ts +162 -0
- package/src/services/voice/workbench-headless-runner.ts +396 -0
- package/src/services/voice/workbench-logic-services.test.ts +225 -0
- package/src/services/voice/workbench-logic-services.ts +184 -0
- package/src/services/voice/workbench-real-services.ts +629 -0
- package/src/services/voice/workbench-scenarios.ts +407 -0
- package/src/services/voice/wrap-with-first-line-cache.ts +267 -0
- package/src/services/voice-model-updater.ts +724 -0
- package/src/services/voice-prewarm.ts +51 -0
- package/src/voice-workbench.ts +71 -0
|
@@ -0,0 +1,148 @@
|
|
|
1
|
+
import { existsSync, readFileSync } from "node:fs";
|
|
2
|
+
import { join } from "node:path";
|
|
3
|
+
import { fileURLToPath } from "node:url";
|
|
4
|
+
import { describe, expect, it } from "vitest";
|
|
5
|
+
import { decodeMonoPcm16Wav, encodeMonoPcm16Wav } from "./engine-bridge";
|
|
6
|
+
|
|
7
|
+
/**
|
|
8
|
+
* Real-audio coverage for the front of the single FFI pipe: the WAV → PCM
|
|
9
|
+
* decoder every transcription path feeds (`decodeMonoPcm16Wav`), exercised
|
|
10
|
+
* against the committed real WAV files rather than synthetic in-memory buffers.
|
|
11
|
+
*
|
|
12
|
+
* Two real corpora:
|
|
13
|
+
* 1. `native/verify/asr_bench_fixtures/non_publish_structure_5utt/` — five
|
|
14
|
+
* committed mono 16 kHz PCM16 WAVs (deterministic tones, NOT speech — see
|
|
15
|
+
* the corpus manifest; valid for decode/codec validation, NOT for WER).
|
|
16
|
+
* 2. `native/omnivoice.cpp/examples/freeman.wav` — a real 22.05 kHz speech
|
|
17
|
+
* recording. Lives in a git submodule, so the block is skipped when the
|
|
18
|
+
* submodule isn't checked out.
|
|
19
|
+
*/
|
|
20
|
+
|
|
21
|
+
const FIXTURE_DIR = fileURLToPath(
|
|
22
|
+
new URL(
|
|
23
|
+
"../../../native/verify/asr_bench_fixtures/non_publish_structure_5utt/",
|
|
24
|
+
import.meta.url,
|
|
25
|
+
),
|
|
26
|
+
);
|
|
27
|
+
|
|
28
|
+
interface FixtureManifest {
|
|
29
|
+
realRecorded: boolean;
|
|
30
|
+
files: Array<{
|
|
31
|
+
id: string;
|
|
32
|
+
reference: string;
|
|
33
|
+
wav: string;
|
|
34
|
+
txt: string;
|
|
35
|
+
sampleRateHz: number;
|
|
36
|
+
}>;
|
|
37
|
+
}
|
|
38
|
+
|
|
39
|
+
const manifest = JSON.parse(
|
|
40
|
+
readFileSync(join(FIXTURE_DIR, "manifest.json"), "utf8"),
|
|
41
|
+
) as FixtureManifest;
|
|
42
|
+
|
|
43
|
+
/**
|
|
44
|
+
* Every decoded sample must be a finite mono amplitude in [-1, 1]. Scan in a
|
|
45
|
+
* plain loop and assert the AGGREGATE once — a per-sample `expect()` over a
|
|
46
|
+
* multi-second clip (freeman.wav is ~380k samples) is pathologically slow and
|
|
47
|
+
* trips the test timeout under load.
|
|
48
|
+
*/
|
|
49
|
+
function assertInRangePcm(pcm: Float32Array): void {
|
|
50
|
+
expect(pcm.length).toBeGreaterThan(0);
|
|
51
|
+
let allFinite = true;
|
|
52
|
+
let maxAbs = 0;
|
|
53
|
+
for (let i = 0; i < pcm.length; i++) {
|
|
54
|
+
const s = pcm[i] ?? Number.NaN;
|
|
55
|
+
if (!Number.isFinite(s)) {
|
|
56
|
+
allFinite = false;
|
|
57
|
+
break;
|
|
58
|
+
}
|
|
59
|
+
const abs = Math.abs(s);
|
|
60
|
+
if (abs > maxAbs) maxAbs = abs;
|
|
61
|
+
}
|
|
62
|
+
expect(allFinite).toBe(true);
|
|
63
|
+
expect(maxAbs).toBeLessThanOrEqual(1);
|
|
64
|
+
}
|
|
65
|
+
|
|
66
|
+
describe("decodeMonoPcm16Wav — committed fixture corpus (real WAV files)", () => {
|
|
67
|
+
it("decodes every fixture to in-range mono PCM at the manifest sample rate", () => {
|
|
68
|
+
expect(manifest.files.length).toBeGreaterThanOrEqual(5);
|
|
69
|
+
for (const f of manifest.files) {
|
|
70
|
+
const bytes = new Uint8Array(readFileSync(join(FIXTURE_DIR, f.wav)));
|
|
71
|
+
const { pcm, sampleRate } = decodeMonoPcm16Wav(bytes);
|
|
72
|
+
|
|
73
|
+
expect(sampleRate).toBe(f.sampleRateHz);
|
|
74
|
+
assertInRangePcm(pcm);
|
|
75
|
+
|
|
76
|
+
const durationMs = (1000 * pcm.length) / sampleRate;
|
|
77
|
+
expect(durationMs).toBeGreaterThan(0);
|
|
78
|
+
expect(Number.isFinite(durationMs)).toBe(true);
|
|
79
|
+
|
|
80
|
+
// Corpus integrity: the sidecar .txt matches the manifest reference
|
|
81
|
+
// (these are the references a real-speech replacement corpus must hit).
|
|
82
|
+
const txt = readFileSync(join(FIXTURE_DIR, f.txt), "utf8").trim();
|
|
83
|
+
expect(txt).toBe(f.reference);
|
|
84
|
+
}
|
|
85
|
+
});
|
|
86
|
+
|
|
87
|
+
it("round-trips decode → encode → decode losslessly (PCM16 codec)", () => {
|
|
88
|
+
const first = manifest.files[0];
|
|
89
|
+
expect(first).toBeDefined();
|
|
90
|
+
const bytes = new Uint8Array(
|
|
91
|
+
readFileSync(
|
|
92
|
+
join(FIXTURE_DIR, (first as FixtureManifest["files"][0]).wav),
|
|
93
|
+
),
|
|
94
|
+
);
|
|
95
|
+
const a = decodeMonoPcm16Wav(bytes);
|
|
96
|
+
const reencoded = encodeMonoPcm16Wav(a.pcm, a.sampleRate);
|
|
97
|
+
const b = decodeMonoPcm16Wav(reencoded);
|
|
98
|
+
|
|
99
|
+
expect(b.sampleRate).toBe(a.sampleRate);
|
|
100
|
+
expect(b.pcm.length).toBe(a.pcm.length);
|
|
101
|
+
// PCM16 → float → PCM16 is exact (the float values are k/0x8000).
|
|
102
|
+
for (let i = 0; i < a.pcm.length; i++) {
|
|
103
|
+
expect(b.pcm[i]).toBeCloseTo(a.pcm[i] ?? 0, 6);
|
|
104
|
+
}
|
|
105
|
+
});
|
|
106
|
+
|
|
107
|
+
it("documents that the fixture corpus is non-speech (not WER evidence)", () => {
|
|
108
|
+
// Guards against anyone treating these tones as ASR ground truth.
|
|
109
|
+
expect(manifest.realRecorded).toBe(false);
|
|
110
|
+
});
|
|
111
|
+
});
|
|
112
|
+
|
|
113
|
+
const FREEMAN_WAV = fileURLToPath(
|
|
114
|
+
new URL(
|
|
115
|
+
"../../../native/omnivoice.cpp/examples/freeman.wav",
|
|
116
|
+
import.meta.url,
|
|
117
|
+
),
|
|
118
|
+
);
|
|
119
|
+
const hasFreeman = existsSync(FREEMAN_WAV);
|
|
120
|
+
const describeFreeman = hasFreeman ? describe : describe.skip;
|
|
121
|
+
|
|
122
|
+
describeFreeman(
|
|
123
|
+
"decodeMonoPcm16Wav — freeman.wav (real 22.05 kHz speech)",
|
|
124
|
+
() => {
|
|
125
|
+
it("decodes to several seconds of bipolar in-range speech PCM", () => {
|
|
126
|
+
const bytes = new Uint8Array(readFileSync(FREEMAN_WAV));
|
|
127
|
+
const { pcm, sampleRate } = decodeMonoPcm16Wav(bytes);
|
|
128
|
+
|
|
129
|
+
expect(sampleRate).toBe(22_050);
|
|
130
|
+
assertInRangePcm(pcm);
|
|
131
|
+
|
|
132
|
+
// Real speech is bipolar, not silence or a DC tone.
|
|
133
|
+
let min = Number.POSITIVE_INFINITY;
|
|
134
|
+
let max = Number.NEGATIVE_INFINITY;
|
|
135
|
+
for (let i = 0; i < pcm.length; i++) {
|
|
136
|
+
const s = pcm[i] ?? 0;
|
|
137
|
+
if (s < min) min = s;
|
|
138
|
+
if (s > max) max = s;
|
|
139
|
+
}
|
|
140
|
+
expect(min).toBeLessThan(0);
|
|
141
|
+
expect(max).toBeGreaterThan(0);
|
|
142
|
+
|
|
143
|
+
const durationSec = pcm.length / sampleRate;
|
|
144
|
+
expect(durationSec).toBeGreaterThan(1);
|
|
145
|
+
expect(durationSec).toBeLessThan(60);
|
|
146
|
+
});
|
|
147
|
+
},
|
|
148
|
+
);
|
|
@@ -0,0 +1,141 @@
|
|
|
1
|
+
# Voice Workbench (#8785) — Capability Assessment & Evidence Map
|
|
2
|
+
|
|
3
|
+
This is the human/AI-reviewable map of the elizaOS voice-assistant capability: what exists, what this work added, what is **CI-proven** vs **hardware/credential-gated**, and the recommendations. Pair with [VOICE_PIPELINE_RESEARCH_2026.md](./VOICE_PIPELINE_RESEARCH_2026.md) (the evidence base) and [../VOICE_WORKBENCH.md](../VOICE_WORKBENCH.md) (how to run it).
|
|
4
|
+
|
|
5
|
+
Legend: **✅ PROVEN** = verified by a CI-runnable test/lane (no models, no network). **🟡 GATED** = real code exists but verification needs hardware/models/credentials. **🔵 DESIGN** = decision logic + tests exist; runtime wiring is a follow-up.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## 1. The pipeline already in place (recon findings)
|
|
10
|
+
|
|
11
|
+
elizaOS ships a deep, real voice pipeline in `plugins/plugin-local-inference/src/services/voice/` (~240 files). The honest baseline:
|
|
12
|
+
|
|
13
|
+
| Subsystem | Implementation | Status |
|
|
14
|
+
|---|---|---|
|
|
15
|
+
| VAD | Silero v5.1.2 GGML, 2-tier (RMS gate + model), 32 ms hop; configurable onset 0.5 / offset 0.35 / end-hangover 700 ms / pause-hangover 100 ms | 🟡 GATED (native FFI) |
|
|
16
|
+
| EOT | Heuristic (`@elizaos/shared/voice-eot`) + Eliza1 (`<im_end>` prob) + Composite + LiveKit GGUF; early-commit at P≥0.9, tentative at P≥0.6 | partial ✅ (heuristic) / 🟡 (model) |
|
|
17
|
+
| Barge-in | `BargeInController` + `voice-state-machine` (C1 checkpoint), 600 ms words-grace, AbortSignal hard-stop | 🟡 GATED |
|
|
18
|
+
| Optimistic gen | `optimistic-policy` (battery-aware) + `optimistic-rollback` (C7 prefill) | 🟡 GATED |
|
|
19
|
+
| Wake-word | openWakeWord GGML, 80 ms frames, "hey eliza" head (real head v0.3.0 published; placeholder pending bundle ship) | 🟡 GATED |
|
|
20
|
+
| Speaker encoder | WeSpeaker ResNet34-LM INT8, 256-dim, L2-norm, cosine, match threshold **0.78** | 🟡 GATED |
|
|
21
|
+
| Diarization | pyannote-segmentation-3.0 INT8, 5 s window, 7-class powerset | 🟡 GATED |
|
|
22
|
+
| Entity binding | `VOICE_TURN_OBSERVED` → merge engine → `VOICE_ENTITY_BOUND`; `IDENTIFY_SPEAKER` action; `speakerEntityId` on VOICE_DM | ✅ (event seam tested) |
|
|
23
|
+
| Echo/respond gate | word-overlap echo guard (9 s / 70 %) + disfluency filter + bystander suppression + wake-word override | ✅ PROVEN (now consolidated) |
|
|
24
|
+
| Owner enrollment | first-run voice routes (`/api/voice/first-run/*`) write the owner entity | 🟡 GATED |
|
|
25
|
+
| Routing | local/cloud STT+TTS; hybrid (local TTS + cloud STT on mobile) documented + unit-tested | ✅ (selection) / 🟡 (live) |
|
|
26
|
+
|
|
27
|
+
**Gaps this work closed:** the `--real` workbench lane was hollow (mock echoed ground truth → circular); there was no robustness corpus, no echo-rejection scorer/scenario, no owner-vs-intruder scenario, and no autonomous owner inference. Acoustic echo cancellation (AEC3-style) is **still MISSING** at the PCM level — see §6.
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## 2. What this work added (all ✅ PROVEN, CI-runnable, no models)
|
|
32
|
+
|
|
33
|
+
1. **Robustness corpus DSP** (`corpus-augment.ts`, 19 tests): seeded, deterministic additive noise (white/pink at a target SNR), Freeverb reverb, far-field attenuation, telephone/low-quality line (band-limit + µ-law), background talkers. Wired into the corpus generator via a per-turn / per-scenario `environment`.
|
|
34
|
+
2. **Real-decision-logic lane** (`workbench-logic-services.ts`, `voice:workbench --logic`): runs the SHIPPED EOT heuristic + respond/echo/bystander/wake-word gate + name extraction over the corpus, instead of echoing ground truth. Genuinely suppresses a bystander, rejects the agent's echoed reply, and holds on a mid-utterance pause — asserted, not assumed.
|
|
35
|
+
3. **Single source of truth** for the respond/echo gate (`@elizaos/shared/voice/respond-gate`): the UI client re-exports it, so the workbench tests exactly what ships. (21 UI tests still green.)
|
|
36
|
+
4. **New scorers + report metrics**: echo-rejection rate, owner-vs-intruder accuracy, impostor-accept rate.
|
|
37
|
+
5. **Owner inference** (`@elizaos/shared/voice/owner-inference`, 6 tests): `resolveOwnerCandidate` proposes the owner from who speaks most/most-confidently — only when evidence is sufficient AND unambiguous, else UNDECIDED. The decision logic an owner-detection provider/evaluator runs when no owner is enrolled. 🔵 wired into the workbench; runtime provider wiring is the follow-up.
|
|
38
|
+
6. **New scenarios**: noisy-room, far-field-reverb, background-talkers, echo-self-trigger, owner-enrollment-inference, owner-vs-intruder.
|
|
39
|
+
|
|
40
|
+
Lanes: `--mock` PASS (plumbing), `--logic` PASS (real decision logic, 12 scenarios), `--real` SKIPPED (honesty contract).
|
|
41
|
+
|
|
42
|
+
---
|
|
43
|
+
|
|
44
|
+
## 3. #8785 acceptance criteria
|
|
45
|
+
|
|
46
|
+
| AC | Status | Evidence |
|
|
47
|
+
|---|---|---|
|
|
48
|
+
| VoiceScenario schema + labeled corpus (multi-voice, pauses, respond/no, multi-speaker, entity, voice→entity, diarization, EOT, transcription, multi-agent, long-form **+ robustness, echo, owner-security, overlapping**) | ✅ | `voice-scenario.ts`, `workbench-scenarios.ts` (12 scenarios) |
|
|
49
|
+
| All scoring in one shared module; no duplicate WER | ✅ | `e2e-harness.ts` + `@elizaos/shared/voice-wer`; respond/echo now also single-source |
|
|
50
|
+
| Headless runner over real services + scenario-runner `voice` turn kind | ✅ | `workbench-headless-runner.ts`, `packages/scenario-runner/src/voice-turn.ts` |
|
|
51
|
+
| Headful scenario player + per-turn DOM verdict + specs per class | ✅ (mocked) | `VoiceWorkbenchShell`, 10 `voice-workbench-*.spec.ts` |
|
|
52
|
+
| Single `voice:workbench` JSON+MD report with baselines | ✅ | `voice-workbench-report.ts`, `scripts/voice-workbench.ts` |
|
|
53
|
+
| CI: mocked always, real where provisioned, `skipped` (never `pass`) when absent | ✅ | `--mock`/`--logic` run+pass; `--real` skips |
|
|
54
|
+
| Multi-agent room ≥3 participants who-responds | ✅ | `multi-agent-room-address` |
|
|
55
|
+
| README documents the consolidation | ✅ | `VOICE_WORKBENCH.md` |
|
|
56
|
+
|
|
57
|
+
**#8785 is closeable for the workbench scope and for the decision-logic of local + cloud.** The remaining lane — real acoustic models on degraded audio (real WER/DER/EOT-latency, and the live cloud STT/TTS round-trip) — is wired and gated, see §5.
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
## 4. The user's expanded questions — answered
|
|
62
|
+
|
|
63
|
+
| Question | Answer | Status |
|
|
64
|
+
|---|---|---|
|
|
65
|
+
| Ideal pause lengths? | ~200 ms modal inter-turn gap; with a semantic EOT model use **200 ms** end-hangover, fixed-VAD **500 ms**, max-wait **3000 ms**. Pipeline ships 700 ms hangover with semantic early-commit at P≥0.9. | research §1; tune ticket |
|
|
66
|
+
| Optimistic-but-abortable generation? | `optimistic-policy` + `optimistic-rollback` (C1 checkpoint, C7 prefill, battery-aware) | 🟡 GATED, exists |
|
|
67
|
+
| Reject the agent's own voice from TTS? | Two integrated layers: (1) transcript word-overlap echo gate (9 s/70 %), (2) **acoustic self-voice rejection** — `selfVoiceSimilarity ≥ 0.7` vs the agent's TTS imprint hard-suppresses even past the wake word, catching a mis-transcribed echo the transcript guard misses. Both scored (echo-rejection 1.0; `echo-mistranscribed` scenario). PCM-level AEC3 + the embedding wiring still gated — see §6. | ✅ (decision) / 🟡 (audio-frame) |
|
|
68
|
+
| Reverb / low-quality / near-far / noise / background talkers? | `corpus-augment.ts` models all of them deterministically; scenarios assert the decision still holds | ✅ corpus; 🟡 real-model robustness |
|
|
69
|
+
| Interrupting / overlapping voices? | `overlapping-speech` class + background-talkers mixing; barge-in controller exists | ✅ corpus / 🟡 live |
|
|
70
|
+
| Speaker recognition & continuity to cancel others? | bystander suppression (confidence ≥0.7, not enrolled, no wake word) — scored; WeSpeaker centroids continuity | ✅ gate / 🟡 acoustic |
|
|
71
|
+
| Detect the user's voice? | WeSpeaker 256-d centroid, cosine ≥0.78 match, Welford online update | 🟡 GATED |
|
|
72
|
+
| Diarize multiple people → entities, extract names, merge? | pyannote diarizer + `VOICE_TURN_OBSERVED`→merge-engine→`VOICE_ENTITY_BOUND`; name extraction scored in `--logic` | ✅ seam / 🟡 acoustic |
|
|
73
|
+
| How do we know the owner? provider/evaluator when unsure? | `resolveOwnerCandidate` — exactly this logic, undecided until sufficient+unambiguous | 🔵 logic ✅; provider wiring TODO |
|
|
74
|
+
| Owner vs intruder (security)? | `owner-vs-intruder` scenario: impostor gated out (impostor-accept 0); research: FAR ≤0.1 % + ≥3 s utterance for sensitive actions | ✅ gate / 🟡 verification |
|
|
75
|
+
| Wake word "hey eliza"? | openWakeWord GGML head (real v0.3.0 published; bundle ship pending); `--logic` tests the "hey eliza" phrase override | ✅ phrase / 🟡 acoustic head |
|
|
76
|
+
| Mix local STT/TTS + fast cloud LLM? | documented + unit-tested routing; latency math: **~300–400 ms TTFA from end-of-speech** (local STT + Cerebras LLM + local Kokoro TTS) | research §7; 🟡 live |
|
|
77
|
+
| Qwen / Gemma / CoreML / TPU / eliza-1? | Qwen3-ASR (elizaOS ASR), Kokoro TTS, Gemma 3n audio-in, Apple ANE / Tensor G5 on-device; omni models cloud-only | research §6 |
|
|
78
|
+
| VAD? | Silero v5 2-tier; defaults per research §2 | 🟡 GATED |
|
|
79
|
+
|
|
80
|
+
---
|
|
81
|
+
|
|
82
|
+
## 5. The real lanes — NOW RUN (with the provided keys + staged artifacts)
|
|
83
|
+
|
|
84
|
+
Given a funded ElevenLabs/Cerebras key and the staged fused dylib + GGUF bundle,
|
|
85
|
+
the previously-gated lanes were executed for real on macOS (Metal). Evidence:
|
|
86
|
+
`.github/issue-evidence/8785-voice-real-cloud/`.
|
|
87
|
+
|
|
88
|
+
- ✅ **Real on-device ASR** — `eliza-1-asr` GGUF via the fused `libelizainference.dylib` on the **Metal GPU** transcribes real speech (WER 0). The bundle also ships real Kokoro/OmniVoice TTS, pyannote diarizer, WeSpeaker encoder, turn-detector.
|
|
89
|
+
- ✅ **Live cloud STT/TTS** — ElevenLabs `eleven_turbo_v2_5` TTS + `scribe_v1` STT round-trip, **WER 0** (the cloud `/api/v1/voice/*` routes wrap this; the 402 was a free-plan key — a funded key works).
|
|
90
|
+
- ✅ **Mixed local + cloud** — cloud TTS → LOCAL ASR → Cerebras LLM → cloud TTS, **~770–870 ms** end-to-end (inside the research <800 ms band).
|
|
91
|
+
- ✅ **Real ASR WER under degradation** — robust to **WER 0** across every realistic corpus-DSP condition (noise to 0 dB, reverb to 0.98, far-field, telephone, harsh); graceful past the edge; fully fails only on "destroyed" audio (so the DSP genuinely bites).
|
|
92
|
+
- ✅ **Real speaker recognition** (WeSpeaker 256-d) — same-speaker cosine ~0.72 vs different-speaker ~0.15: an intruder is far below the 0.78 imprint threshold → rejected. Backs owner-vs-other, "detect the user's voice", and continuity, with real models.
|
|
93
|
+
- ✅ **Real diarization** (pyannote) — ≥2 speakers detected in a 5 s two-speaker window. ✅ **Real VAD** (Silero) — speech 1.000 vs silence 0.009. ✅ **Real on-device TTS** — 3.9 s synthesized in ~3 s.
|
|
94
|
+
|
|
95
|
+
**Still gated (genuinely external):** a **physical iOS device** (the simulator has no Metal, so on-device inference can't run there — needs Apple ID provisioning); and the `.mjs` diarizer/speaker-encoder benchmark harnesses need `-fp32` model variants + a separate classifier lib (the GGUFs are present; the ASR + fused-lib + Metal path is proven). Honesty contract unchanged: a lane reports **`skipped`, never `pass`**, when its artifact is absent.
|
|
96
|
+
|
|
97
|
+
**Headful desktop/web — NOW PROVEN (2026-06-22).** The full headful matrix runs
|
|
98
|
+
green with recorded A/V: **`13 passed (5.3m)`** for `voice-*.spec.ts` (Chromium,
|
|
99
|
+
`E2E_RECORD`) — the real-mic round-trip (`getUserMedia` + injected audio → real
|
|
100
|
+
local-ASR → agent → `/api/tts/cloud`, WER 0), the STT→agent→TTS self-test, and
|
|
101
|
+
10 voice-workbench scenario specs. A 14-agent adversarial review confirmed all
|
|
102
|
+
13 are genuine passes, **0 false-green** (real ASR latency, WER-0 transcript
|
|
103
|
+
match, correct negative-path non-responses). Evidence (13 screenshots + 3
|
|
104
|
+
round-trip videos + manifest): `.github/issue-evidence/8785-voice-headful/`.
|
|
105
|
+
*(An earlier run failed when it raced a concurrent, unrelated `AppContext.tsx`
|
|
106
|
+
mid-refactor — "autonomousEvents is not defined", which broke every ui-smoke
|
|
107
|
+
test, not just voice; once the working tree stabilized the voice matrix passed
|
|
108
|
+
unchanged. The voice work here never depended on it.)*
|
|
109
|
+
|
|
110
|
+
---
|
|
111
|
+
|
|
112
|
+
## 6. Open recommendation: PCM-level acoustic echo cancellation
|
|
113
|
+
|
|
114
|
+
Self-echo now has TWO decision layers (both integrated + scored): the transcript
|
|
115
|
+
word-overlap guard, and **acoustic self-voice rejection** (`selfVoiceSimilarity ≥
|
|
116
|
+
AGENT_SELF_VOICE_THRESHOLD` vs the agent's TTS imprint → hard-suppress). What
|
|
117
|
+
remains is the audio-frame plumbing + the cheap half-duplex layer (research §3):
|
|
118
|
+
1. **`agentSpeaking` flag + ~1.5 s post-TTS cooldown with a raised RMS gate** — cheap, robust, no new model. *(Effective half-duplex; ship first.)*
|
|
119
|
+
2. **WebRTC AEC3 with a time-aligned playback reference**, interrupt detection off the linear-filter output — true barge-in.
|
|
120
|
+
3. **Wire `selfVoiceSimilarity`** — imprint the agent's TTS voice (we already have the WeSpeaker encoder) and feed the live cosine into the gate's already-built self-voice branch. **Measured (real, `agentvoice:real`):** the agent's on-device TTS voice embeds **more self-similar (~0.37) than human (~0.15 / −0.13)** — a clear, rejectable margin (~0.22) — but the within-agent consistency is modest (real-human voices cluster ~0.72). So imprint the agent from a **centroid over many utterances** (not a single clip) and use a **lower, agent-specific threshold** combined with the `agentSpeaking` timing gate, rather than the 0.78 human-enrollment bar.
|
|
121
|
+
|
|
122
|
+
Track as a follow-up issue; the workbench `echo-rejection` scorer (incl. the
|
|
123
|
+
mis-transcribed case) is ready to gate it.
|
|
124
|
+
|
|
125
|
+
---
|
|
126
|
+
|
|
127
|
+
## 7. How to verify (commands)
|
|
128
|
+
|
|
129
|
+
```bash
|
|
130
|
+
# Real decision logic over the full scenario matrix (no models, no network):
|
|
131
|
+
bun run --cwd plugins/plugin-local-inference voice:workbench -- --logic
|
|
132
|
+
|
|
133
|
+
# Unit suites:
|
|
134
|
+
bun run --cwd plugins/plugin-local-inference test -- src/services/voice/corpus-augment.test.ts \
|
|
135
|
+
src/services/voice/workbench-logic-services.test.ts
|
|
136
|
+
bun run --cwd packages/shared test -- src/voice/owner-inference.test.ts
|
|
137
|
+
bun run --cwd packages/ui test -- src/voice/should-respond.test.ts src/voice/voice-turn-signal.test.ts
|
|
138
|
+
|
|
139
|
+
# Gated real-model lane (skips cleanly without artifacts):
|
|
140
|
+
bun run --cwd plugins/plugin-local-inference voice:workbench -- --real
|
|
141
|
+
```
|
|
@@ -0,0 +1,117 @@
|
|
|
1
|
+
# Voice-Assistant Pipeline Research Brief (2026)
|
|
2
|
+
|
|
3
|
+
Turn-taking, VAD, barge-in, speaker ID, on-device models, latency. Citation-backed engineering reference for the elizaOS voice pipeline (issue #8785). Vendor self-benchmarks are flagged; peer-reviewed and API-documentation numbers are treated as solid.
|
|
4
|
+
|
|
5
|
+
> This brief is the evidence base behind the numeric defaults and gating budgets in the Voice Workbench. Where a recommended default differs from what the pipeline currently ships, that is called out in [VOICE_8785_ASSESSMENT.md](./VOICE_8785_ASSESSMENT.md).
|
|
6
|
+
|
|
7
|
+
## 1. Pause / silence lengths for end-of-turn detection
|
|
8
|
+
|
|
9
|
+
**The linguistics baseline.** The canonical "~200 ms" figure comes from Stivers et al. 2009 (PNAS), a 10-language corpus of question→answer transitions. The actual statistics: cross-linguistic **mode ≈ 0 ms, median ≈ +100 ms, mean ≈ +208 ms** (the mean is pulled right by a long tail) — [PNAS / PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC2705608/). Per-language means span +7 ms (Japanese) to +469 ms (Danish); every language is unimodal with its modal gap between 0 and +200 ms — [PNAS](https://www.pnas.org/doi/10.1073/pnas.0903616106).
|
|
10
|
+
|
|
11
|
+
**The hard problem for VAD.** Minimal human response latency is ~200 ms; acoustic silence below ~120–180 ms isn't reliably perceived as a gap — [Heldner & Edlund 2010](https://www.sciencedirect.com/science/article/pii/S0095447010000628). Crucially, **intra-turn pauses (thinking mid-sentence) and inter-turn gaps (a real handoff) overlap heavily in the 200–500 ms band** — a fixed silence timer cannot tell them apart. Humans hit ~200 ms gaps despite >600 ms production latency, so they must be *predicting* turn ends — [PMC](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4464110/). This is what semantic end-of-turn (EOT) models replicate.
|
|
12
|
+
|
|
13
|
+
**What production systems do** (fixed-VAD silence thresholds cluster around 500 ms):
|
|
14
|
+
- **OpenAI Realtime** `server_vad`: `silence_duration_ms` **500 ms**, `prefix_padding_ms` **300 ms**, `threshold` **0.5**; `semantic_vad` adds an `eagerness` knob — [OpenAI VAD docs](https://developers.openai.com/api/docs/guides/realtime-vad).
|
|
15
|
+
- **LiveKit** turn-detector: `min_endpointing_delay` **500 ms**, `max_endpointing_delay` **3000 ms**; semantic model (Qwen2.5-0.5B) ~50–160 ms on CPU — [LiveKit](https://docs.livekit.io/agents/build/turns/turn-detector/).
|
|
16
|
+
- **Pipecat smart-turn-v2**: upstream VAD `stop_secs` set **short (0.2 s)** because the 94.8M-param model makes the real decision; inference 12 ms (GPU) – 410 ms (CPU) — [smart-turn-v2](https://huggingface.co/pipecat-ai/smart-turn-v2).
|
|
17
|
+
- **Deepgram**: `utterance_end_ms` recommended **≥ 1000 ms** — [Deepgram](https://developers.deepgram.com/docs/endpointing).
|
|
18
|
+
|
|
19
|
+
**The latency↔false-cutoff frontier** (LiveKit self-benchmark, 14 languages): holding premature cutoffs at **10%** costs ~295 ms mean latency; **5%** costs ~543 ms — [LiveKit](https://livekit.com/blog/solving-end-of-turn-detection). Roughly each halving of the cutoff rate costs ~250 ms.
|
|
20
|
+
|
|
21
|
+
**Recommended.** Minimum end-of-utterance silence: **200 ms with a semantic model in front, 500 ms for fixed-VAD only.** Semantic-EOT early-commit at **P(complete) ≥ 0.7**. Max-wait fallback **3000 ms**.
|
|
22
|
+
|
|
23
|
+
## 2. On-device VAD + wake-word
|
|
24
|
+
|
|
25
|
+
**Silero VAD**: ~1–2 MB JIT model, **<1 ms / 30 ms chunk** on one CPU thread, MIT, language-agnostic. Defaults: `threshold` **0.5**, `min_speech_duration_ms` **250**, `min_silence_duration_ms` **100**, `speech_pad_ms` **30**, window **512 samples @16 kHz** — [GitHub](https://github.com/snakers4/silero-vad), [PyTorch Hub](https://pytorch.org/hub/snakers4_silero-vad_vad/).
|
|
26
|
+
|
|
27
|
+
**Wake-word engines** (all accuracy numbers vendor/author-published; no neutral head-to-head exists):
|
|
28
|
+
- **openWakeWord**: frozen Google speech-embedding backbone + tiny per-word DNN head (~200 KB ONNX), trains from synthetic Piper TTS, design target **<0.5 false-accepts/hr, <5% false-reject**. Code Apache-2.0 but **pretrained models CC-BY-NC-SA** — [GitHub](https://github.com/dscripka/openWakeWord).
|
|
29
|
+
- **Porcupine** (Picovoice): ~1 MB, **97.1% detection at 1 FA/10 hr @ 10 dB SNR**, custom phrase trained in-console; proprietary/paid — [FAQ](https://picovoice.ai/docs/faq/porcupine/).
|
|
30
|
+
- **microWakeWord**: fully-Apache code + models, ~26–240 KB int8 TFLite, ~1 FA/hr — [GitHub](https://github.com/kahrendt/microWakeWord).
|
|
31
|
+
|
|
32
|
+
For "hey eliza": all train from synthetic TTS. Target **<0.5 FA/hr, <5% FRR**.
|
|
33
|
+
|
|
34
|
+
## 3. Acoustic echo cancellation + barge-in
|
|
35
|
+
|
|
36
|
+
**WebRTC AEC3** is the production choice: linear partitioned-block frequency-domain adaptive filter (64-sample / 4 ms blocks) + nonlinear residual suppressor. Removes **20–40 dB** of echo; handles 20–200 ms device delay + 100–300 ms reverb tails; double-talk halts adaptation; convergence 1–2 s; ~150 ms filter is the sweet spot — [AEC3 explainer](https://switchboard.audio/hub/how-webrtc-aec3-works/). **Speex** is linear-only and fails if delay exceeds the tail — AEC3 is strictly more robust.
|
|
37
|
+
|
|
38
|
+
**The reference signal is everything.** The #1 barge-in failure is a non-time-aligned playback reference — `getUserMedia({echoCancellation:true})` is blind to PCM played through a custom AudioContext (WebSocket TTS), so "the agent hears itself" — [dev.to case study](https://dev.to/remi_etien/i-built-a-voice-ai-with-sub-500ms-latency-heres-the-echo-cancellation-problem-nobody-talks-about-14la). Wake/interrupt detection must run off the *linear* filter output.
|
|
39
|
+
|
|
40
|
+
**Production strategies, increasing robustness:** half-duplex mic muting → **effective half-duplex via an `agentSpeaking` flag + ~1.5 s post-TTS cooldown with a raised RMS gate** → full-duplex AEC + reference cancellation → speaker-embedding / textual self-voice rejection (Google Textual Echo Cancellation) — [USPTO 11,482,244](https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/11482244).
|
|
41
|
+
|
|
42
|
+
**Framework knobs:** LiveKit `min_interruption_duration` **0.5 s** + adaptive backchannel cooldown (1.0 s / 3.5 s); OpenAI `server_vad` threshold 0.5 / prefix 300 ms / silence 500 ms. **Latency budget:** ICASSP AEC Challenge caps algorithmic latency at **≤40 ms** — [AEC Challenge](https://arxiv.org/pdf/2009.04972).
|
|
43
|
+
|
|
44
|
+
## 4. Speaker diarization + recognition on-device
|
|
45
|
+
|
|
46
|
+
**Embedding models** (EER on VoxCeleb1-O cleaned):
|
|
47
|
+
- **ECAPA-TDNN** (SpeechBrain): 192-dim, **0.80% EER** — [HF](https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb).
|
|
48
|
+
- **WeSpeaker ResNet34-LM**: 256-dim, 6.63M params, **0.72% EER** — [HF](https://huggingface.co/Wespeaker/wespeaker-voxceleb-resnet34-LM). *(This is what elizaOS uses.)*
|
|
49
|
+
- **TitaNet-Large** (NeMo): **0.66% EER**.
|
|
50
|
+
|
|
51
|
+
**pyannote diarization 3.1** = segmentation-3.0 (10 s window, powerset up to 3 overlapping) + WeSpeaker embeddings + agglomerative clustering at cosine ≈ **0.705**. Offline DER: VoxConverse **11.3%**, AMI **18.8%**, DIHARD-3 **21.7%** — [pyannote 3.1](https://github.com/pyannote/hf-speaker-diarization-3.1). **Streaming** (DIART, 5 s buffer / 500 ms hops) costs roughly **+4–9 pp DER** — [DIART](https://github.com/juanmc2005/diart).
|
|
52
|
+
|
|
53
|
+
**Cosine thresholds are model-specific** — NeMo default **0.7**, pyannote clustering ~0.705. **On-device:** ResNet34 (6.6M) and ECAPA fit on phones; 8-bit quantization halves size with +0.07% EER.
|
|
54
|
+
|
|
55
|
+
## 5. Owner enrollment + verification
|
|
56
|
+
|
|
57
|
+
**Enrollment:** enroll once with **multiple utterances of ≥3 s each, pooled** — sub-3 s enrollment destabilizes the centroid — [Aalto](https://speechprocessingbook.aalto.fi/Recognition/Speaker_Recognition_and_Verification.html). **Short utterances kill EER:** clean SOTA ~0.7–0.9%, but at 1 s of test audio EER jumps to **~16.4%** vs ~2.7% at 3 s — [arXiv](https://arxiv.org/pdf/1810.10884).
|
|
58
|
+
|
|
59
|
+
**Open-set "owner vs stranger"** = argmax cosine over enrolled owner centroids, then **reject as stranger if even the best match < θ**. **Threshold tradeoff:** at EER, impostor false-accept = owner false-reject; banking targets **FAR < 0.01%**, convenience tolerates **FAR ~1%** — [ConversaLabs](https://www.conversailabs.com/blog/secure-voice-authentication-for-banking-applications). For an owner gate, set θ above EER for sensitive actions, near EER for low-friction recognition.
|
|
60
|
+
|
|
61
|
+
**Anti-spoofing/liveness** (ASVspoof-5 2024): SOTA countermeasures **2.59% EER (open) / 8.61% (closed)** — helps but is not airtight; pair with the verification threshold for high-value actions — [ASVspoof 5](https://arxiv.org/html/2408.08739v1).
|
|
62
|
+
|
|
63
|
+
## 6. On-device model landscape (2025–2026)
|
|
64
|
+
|
|
65
|
+
| Capability | Realistic on-device pick | Footprint / latency |
|
|
66
|
+
|---|---|---|
|
|
67
|
+
| STT baseline | Whisper tiny/base | 39M/74M, 75–142 MB, ~10–15× realtime, non-streaming |
|
|
68
|
+
| STT streaming | Moonshine v2 Tiny | 33.6M, **~50 ms on M3**, 80 ms lookahead, 12% WER |
|
|
69
|
+
| STT native iOS | Apple SpeechTranscriber (iOS 26) | offline, 4.6–6.2× realtime; WhisperKit +1.3–1.8× ANE |
|
|
70
|
+
| TTS | **Kokoro-82M** | 80–170 MB, ~0.7 RTFx on ANE, Apache-2.0 *(elizaOS mobile default)* |
|
|
71
|
+
| Multimodal audio-in | Gemma 3n E2B | ~2.5 GB, USM audio encoder, ~6 tok/s |
|
|
72
|
+
| Qwen3-ASR 0.6B | borderline (Mac / high-end phone) | ~0.6–1.2 GB quantized *(elizaOS ASR)* |
|
|
73
|
+
| Qwen3-Omni 30B-A3B | **cloud only** | 78–107 GB GPU |
|
|
74
|
+
|
|
75
|
+
Accelerators: Pixel **Tensor G5** runs a 3B real-time speech model on-device; Apple ANE runs WhisperKit and Kokoro faster than realtime. **STT → Whisper/Moonshine (or Apple SpeechTranscriber on iOS); TTS → Kokoro-82M; full audio understanding → Gemma 3n E2B; omni models stay cloud.**
|
|
76
|
+
|
|
77
|
+
## 7. Mixing local + cloud — latency math
|
|
78
|
+
|
|
79
|
+
**Targets:** **<800 ms voice-to-voice "good", <500 ms "great", <300 ms "instant", >1.2–1.5 s "broken"** — [LiveKit](https://livekit.com/blog/voice-agent-architecture-stt-llm-tts-pipelines-explained), [Hamming](https://hamming.ai/resources/voice-ai-latency-whats-fast-whats-slow-how-to-fix-it). Real-world shipped P50 is **1.4–1.7 s**, so the budget is aspirational.
|
|
80
|
+
|
|
81
|
+
**Cerebras LLM TTFT** (third-party, Artificial Analysis): **170 ms (Llama 70B) / 240 ms (405B)**, >2,100 tok/s — after first token the rest of the first sentence is effectively free for TTS.
|
|
82
|
+
|
|
83
|
+
**The hybrid math (local STT + Cerebras LLM + local TTS), from end-of-speech detection:**
|
|
84
|
+
|
|
85
|
+
```
|
|
86
|
+
Local STT finalize (last chunk + endpoint) ~50 ms (mostly overlaps live speech)
|
|
87
|
+
Network RTT to Cerebras ~50–100 ms
|
|
88
|
+
Cerebras LLM TTFT ~170 ms
|
|
89
|
+
Local TTS first-audio (Kokoro) ~30–80 ms (no network leg)
|
|
90
|
+
──────────────────────────────────────────────────────
|
|
91
|
+
Time-to-first-audio ≈ 300–400 ms (excluding endpoint silence wait)
|
|
92
|
+
+ ~250 ms typical endpoint wait ≈ 550–650 ms total
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
Structural wins: STT partials stream *during* speech; TTS starts on the first LLM sentence; Cerebras collapses the LLM term; local STT + local TTS each drop a network leg. **The dominant residual cost is the endpoint/turn-detection silence wait (200–800 ms)** — the single biggest tunable knob, and the one absent from every vendor TTFA headline.
|
|
96
|
+
|
|
97
|
+
## Recommended numeric defaults
|
|
98
|
+
|
|
99
|
+
| Parameter | Recommended | Range | Basis |
|
|
100
|
+
|---|---|---|---|
|
|
101
|
+
| VAD speech-onset threshold | **0.5** | 0.5–0.7 (raise in noise) | Silero / OpenAI / Pipecat |
|
|
102
|
+
| VAD onset confirm | **200 ms** | 100–250 ms | Pipecat `start_secs` |
|
|
103
|
+
| VAD offset / end-hangover (with semantic EOT) | **200 ms** | 200–300 ms | Pipecat `stop_secs` |
|
|
104
|
+
| VAD offset / end-hangover (fixed-VAD only) | **500 ms** | 500–700 ms | OpenAI / LiveKit |
|
|
105
|
+
| Semantic-EOT early-commit | **P ≥ 0.7** | 0.5–0.7 | smart-turn / open-set θ |
|
|
106
|
+
| Max-wait fallback | **3000 ms** | 3000–5000 ms | LiveKit / Pipecat |
|
|
107
|
+
| Barge-in min-interruption | **500 ms** | 300–500 ms | LiveKit |
|
|
108
|
+
| Barge-in grace / post-TTS cooldown | **400 ms** (+adaptive) | 300–1500 ms | dev.to / LiveKit |
|
|
109
|
+
| AEC filter (tail) length | **150 ms** | 100–200 ms | AEC3 |
|
|
110
|
+
| AEC algorithmic latency cap | **≤40 ms** | ≤40 ms | ICASSP AEC Challenge |
|
|
111
|
+
| Speaker-verification cosine threshold | **0.7** (recalibrate per model) | 0.65–0.75 | NeMo / pyannote |
|
|
112
|
+
| Owner-accept threshold (sensitive) | **above EER, FAR ≤ 0.1%** | FAR 0.01–1% | banking vs convenience |
|
|
113
|
+
| Min verification utterance | **≥3 s** | 3 s+ | EER 2.7%@3s vs 16.4%@1s |
|
|
114
|
+
| Wake-word false-accept target | **<0.5 FA/hr, <5% FRR** | — | openWakeWord |
|
|
115
|
+
| Time-to-first-audio budget | **≤500 ms great / ≤800 ms good** | 300–800 ms | LiveKit / Hamming |
|
|
116
|
+
|
|
117
|
+
**Caveats.** Cross-engine wake-word accuracy and turn-detection "improvement" percentages are vendor self-benchmarks. TTS latency headlines are inference-only and run 2–4× higher in production. Cosine thresholds are embedding-model-specific — recalibrate on a per-device dev set. The single largest lever on perceived latency is the endpoint silence wait, so invest in a semantic EOT model before optimizing STT/TTS milliseconds.
|
|
@@ -0,0 +1,135 @@
|
|
|
1
|
+
# Voice Validation Runbook (#8785)
|
|
2
|
+
|
|
3
|
+
Turn-key steps to execute the **gated** end-to-end validations once the
|
|
4
|
+
corresponding resource is available. Everything that does NOT need a gated
|
|
5
|
+
resource is already proven in CI (see [VOICE_8785_ASSESSMENT.md](./VOICE_8785_ASSESSMENT.md)
|
|
6
|
+
§2–4). This runbook covers the remaining lanes: headful A/V capture (desktop /
|
|
7
|
+
web / simulator / iOS), live cloud STT/TTS, and the real on-device model lane.
|
|
8
|
+
|
|
9
|
+
Each section states: **precondition → command → expected artifact → pass bar.**
|
|
10
|
+
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## 0. Always-runnable baseline (no resource needed) — run first
|
|
14
|
+
|
|
15
|
+
```bash
|
|
16
|
+
# Decision logic over the full scenario matrix + regression gate (no models):
|
|
17
|
+
bun run --cwd plugins/plugin-local-inference voice:workbench --logic \
|
|
18
|
+
--baseline src/services/voice/__fixtures__/voice-workbench-logic-baseline.json
|
|
19
|
+
|
|
20
|
+
# The labeled audio-sample corpus (listen to the degraded edge cases):
|
|
21
|
+
bun run --cwd plugins/plugin-local-inference corpus:generate --out /tmp/voice-corpus
|
|
22
|
+
```
|
|
23
|
+
Pass bar: `[voice:workbench] no regressions … PASS`; 14 scenarios under
|
|
24
|
+
`/tmp/voice-corpus/<id>/audio.wav` + `ground-truth.json`.
|
|
25
|
+
|
|
26
|
+
---
|
|
27
|
+
|
|
28
|
+
## 1. Headful A/V capture — desktop + web (Playwright) ✅ DONE
|
|
29
|
+
|
|
30
|
+
**Status (2026-06-22): PASSING + recorded + adversarially verified.** `13 passed`;
|
|
31
|
+
evidence under `.github/issue-evidence/8785-voice-headful/`. (Precondition: the
|
|
32
|
+
app shell mounts — `typecheck` is 0. An earlier run failed against a transient
|
|
33
|
+
concurrent `AppContext.tsx` mid-refactor; once stabilized the matrix passed.)
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
# Full voice headful matrix WITH A/V recording (video+trace+screenshot per spec):
|
|
37
|
+
cd packages/app
|
|
38
|
+
E2E_RECORD=1 node scripts/run-ui-playwright.mjs \
|
|
39
|
+
--config playwright.ui-smoke.config.ts voice-
|
|
40
|
+
```
|
|
41
|
+
**Artifacts:** `e2e-recordings/app/test-results/<spec>/{video.webm,trace.zip,test-finished-1.png}`
|
|
42
|
+
(open a trace: `npx playwright show-trace …/trace.zip`); per-turn DOM verdicts at
|
|
43
|
+
`[data-testid="voice-workbench-turn-<i>"]` / `…-overall`.
|
|
44
|
+
**Pass bar:** every `voice-*.spec.ts` green; `voice-workbench-overall` reads
|
|
45
|
+
`pass`; the real-mic round-trip (`voice-realaudio`) transcribes the injected
|
|
46
|
+
phrase at WER 0. (Backends are mocked — this proves the real client pipeline +
|
|
47
|
+
player + respond/EOT/diarization decisions, not acoustic-model accuracy.)
|
|
48
|
+
|
|
49
|
+
> The Playwright recording pipeline itself is verified working — a run on the
|
|
50
|
+
> broken branch already produced `video.webm` + a screenshot of the error
|
|
51
|
+
> boundary; it just needs the shell to mount.
|
|
52
|
+
|
|
53
|
+
## 2. Headful A/V — iOS simulator + connected device
|
|
54
|
+
|
|
55
|
+
**Precondition:** Xcode + a booted simulator (and, for device, an Apple ID
|
|
56
|
+
provisioning profile). The on-device agent build must embed the Bun engine
|
|
57
|
+
(`ELIZA_IOS_FULL_BUN_ENGINE=1`) for local inference.
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
# On-device real round-trip (Pixel pattern mirrors this for Android):
|
|
61
|
+
bun run --cwd packages/app test:e2e:android:webview # Android device
|
|
62
|
+
# iOS: drive the booted sim/device via the app's ui-packaged config + cliclick
|
|
63
|
+
# recipe (activate Simulator first; floating composer → send).
|
|
64
|
+
```
|
|
65
|
+
**Artifacts:** screen recording (simulator: `xcrun simctl io booted recordVideo`),
|
|
66
|
+
device-resource metrics via `/api/dev/device-resource-metrics`, and the agent's
|
|
67
|
+
trajectory jsonl. **Pass bar:** the STT→agent→TTS round-trip completes on-device;
|
|
68
|
+
TTFA within the research budget (≤800 ms good).
|
|
69
|
+
|
|
70
|
+
## 3. Live cloud STT/TTS (end-to-end)
|
|
71
|
+
|
|
72
|
+
**Precondition:** an authenticated Eliza Cloud session **with billing credits**
|
|
73
|
+
(today the test account returns HTTP 402 — a billing state, not a code bug).
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
# Cloud STT → POST /api/v1/voice/stt (ElevenLabs-backed)
|
|
77
|
+
# Cloud TTS → POST /api/v1/voice/tts
|
|
78
|
+
# Mixed hybrid (local STT + cloud LLM + local TTS) is the default mobile-local
|
|
79
|
+
# routing — verify the chosen route per slot:
|
|
80
|
+
bun run --cwd packages/ui test -- src/voice/voice-provider-defaults.test.ts
|
|
81
|
+
```
|
|
82
|
+
**Pass bar:** a real STT call returns a transcript and a real TTS call returns
|
|
83
|
+
audio (200, non-empty body); the hybrid latency lands within the research TTFA
|
|
84
|
+
budget. Capture the structured `[ClassName] …` backend logs + the network trace.
|
|
85
|
+
|
|
86
|
+
## 4. Real on-device model lane (real WER / DER / EOT latency)
|
|
87
|
+
|
|
88
|
+
**Precondition:** the native fused `libelizainference` built for the host
|
|
89
|
+
platform + the Eliza-1 GGUF bundle (text + Qwen3-ASR + WeSpeaker + pyannote +
|
|
90
|
+
Silero + openWakeWord + Kokoro) staged under the models dir.
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
# Build the fused lib (macOS example), then run the real lane:
|
|
94
|
+
bun run --cwd plugins/plugin-local-inference voice:workbench --real \
|
|
95
|
+
--baseline src/services/voice/__fixtures__/voice-workbench-logic-baseline.json \
|
|
96
|
+
--out /tmp/voice-workbench-real
|
|
97
|
+
|
|
98
|
+
# Real ASR smoke (runs OUTSIDE `bun test` — coverage=true EMFILEs the GGUF mmap):
|
|
99
|
+
bun run --cwd plugins/plugin-local-inference test:asr:real
|
|
100
|
+
```
|
|
101
|
+
**Artifacts:** `/tmp/voice-workbench-real/report.{json,md}` with REAL WER (on the
|
|
102
|
+
degraded robustness corpus), diarization DER, EOT latency p50/p95, first-audio
|
|
103
|
+
latency. **Pass bar:** WER/DER under the per-scenario ceilings; no regression vs
|
|
104
|
+
the baseline. The corpus from §0 (with reverb/noise/far-field) is the input —
|
|
105
|
+
this is where robustness is actually measured.
|
|
106
|
+
|
|
107
|
+
## 5. Wake word "hey eliza"
|
|
108
|
+
|
|
109
|
+
**Precondition:** the trained head shipped in the tier bundle
|
|
110
|
+
(`voice/wakeword/hey-eliza.*.gguf` — published to `elizaos/eliza-1` v0.3.0;
|
|
111
|
+
placeholder until bundled everywhere). Verified ~98% true-accept / 4–7%
|
|
112
|
+
false-accept at training. Local-mode only; inert in cloud mode.
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## Evidence checklist for closing #8785 (local + cloud)
|
|
117
|
+
|
|
118
|
+
- [x] Decision logic (EOT / respond / echo×2 / bystander / wake / owner) — CI `--logic` + regression gate
|
|
119
|
+
- [x] Robustness corpus (noise/reverb/far-field/low-quality/babble/overlap) — DSP tests + corpus:generate
|
|
120
|
+
- [x] Research (pause lengths, VAD, AEC, diarization, owner verification, model landscape, hybrid latency)
|
|
121
|
+
- [x] Headful A/V — desktop + web *(13/13 specs passed + recorded + adversarially verified; `.github/issue-evidence/8785-voice-headful/`)*
|
|
122
|
+
- [~] iOS **simulator** — app boots + UI renders, recorded (`.github/issue-evidence/8785-voice-ios-sim/`); voice *inference* on the sim is Metal-gated (no GPU on the sim) — fundamentally needs a physical device for local, or cloud credits.
|
|
123
|
+
- [~] iOS **physical device** — **the signed app + embedded full-Bun engine is INSTALLED on Shaw's iPhone 15 Pro** (`ios-device-installed.md`). Signing was cracked: the correct team is **25877RY2EH** (not the cert's CN `UT5K5Q5EVF`); automatic signing + the cached team profiles (which cover the device) + the generic "Apple Development" identity needs no Xcode account; aligning the 2 DeviceActivity extensions' entitlements to their profiles → BUILD SUCCEEDED → `devicectl device install` → app on device. The **only** thing left is a physical action: **unlock the iPhone** (it was locked, so iOS refused to launch the dev app — FBSOpenApplicationErrorDomain error 7 "Locked") + trust the developer (Settings → General → VPN & Device Management), then open Milady → "This device". The embedded engine then runs on the iPhone's real Metal GPU — the same fused engine + GGUF proven running real voice models on this Mac's Apple Silicon.
|
|
124
|
+
```bash
|
|
125
|
+
# after the user unlocks + trusts the dev cert:
|
|
126
|
+
xcrun devicectl device process launch --device 00008130-001955E91EF8001C ai.milady.milady
|
|
127
|
+
idevicesyslog | grep -iE "eliza|bun|llama|metal|asr|inference" # on-device engine logs
|
|
128
|
+
```
|
|
129
|
+
- [x] **Live cloud STT/TTS E2E** — ElevenLabs `eleven_turbo_v2_5` + `scribe_v1` round-trip, WER 0 (`.github/issue-evidence/8785-voice-real-cloud/`).
|
|
130
|
+
- [x] **Real on-device ASR + WER on the degraded corpus** — eliza-1-asr via the fused dylib + Metal; WER 0 across every realistic degradation (noise to 0 dB, reverb to 0.98, far-field, telephone, harsh), graceful past the edge.
|
|
131
|
+
- [x] **Mixed local STT + cloud LLM + cloud TTS** — `roundtrip:real`: ~770–870 ms hybrid (local STT ~200 ms + Cerebras ~270 ms + cloud TTS ~270 ms).
|
|
132
|
+
- [x] **Real speaker recognition + diarization + VAD + local TTS** — `voicestack:real`: WeSpeaker same-speaker ~0.72 vs different ~0.15 (owner-vs-intruder), pyannote ≥2 speakers, Silero speech 1.0 / silence 0.009, on-device TTS 3.9 s. (`.github/issue-evidence/8785-voice-real-cloud/`)
|
|
133
|
+
- [~] EOT turn-detector model — GGUFs present (en/intl); the heuristic EOT is validated in `--logic`; the model path (`eotScore`) needs the text model + tokenizer loaded to drive via FFI.
|
|
134
|
+
- [ ] Wake-word "hey eliza" model — a real head is published (v0.3.0, ~98% true-accept) but not staged in this bundle; the wake-word *decision* (phrase detection + override) is validated in `--logic`.
|
|
135
|
+
- [ ] iOS **physical device** — needs Apple ID provisioning
|