livekit-plugins-dtln 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- livekit_plugins_dtln-0.1.0/LICENSE +21 -0
- livekit_plugins_dtln-0.1.0/PKG-INFO +228 -0
- livekit_plugins_dtln-0.1.0/README.md +177 -0
- livekit_plugins_dtln-0.1.0/pyproject.toml +43 -0
- livekit_plugins_dtln-0.1.0/setup.cfg +4 -0
- livekit_plugins_dtln-0.1.0/src/livekit/plugins/dtln/__init__.py +33 -0
- livekit_plugins_dtln-0.1.0/src/livekit/plugins/dtln/models/model_1.onnx +0 -0
- livekit_plugins_dtln-0.1.0/src/livekit/plugins/dtln/models/model_2.onnx +0 -0
- livekit_plugins_dtln-0.1.0/src/livekit/plugins/dtln/noise_suppressor.py +287 -0
- livekit_plugins_dtln-0.1.0/src/livekit_plugins_dtln.egg-info/PKG-INFO +228 -0
- livekit_plugins_dtln-0.1.0/src/livekit_plugins_dtln.egg-info/SOURCES.txt +12 -0
- livekit_plugins_dtln-0.1.0/src/livekit_plugins_dtln.egg-info/dependency_links.txt +1 -0
- livekit_plugins_dtln-0.1.0/src/livekit_plugins_dtln.egg-info/requires.txt +4 -0
- livekit_plugins_dtln-0.1.0/src/livekit_plugins_dtln.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2024 Aloware, Inc.
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,228 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: livekit-plugins-dtln
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: DTLN noise suppression plugin for LiveKit Agents — self-hosted, in-process, no cloud API
|
|
5
|
+
Author-email: "Aloware, Inc." <dev@aloware.com>
|
|
6
|
+
License: MIT License
|
|
7
|
+
|
|
8
|
+
Copyright (c) 2024 Aloware, Inc.
|
|
9
|
+
|
|
10
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
11
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
12
|
+
in the Software without restriction, including without limitation the rights
|
|
13
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
14
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
15
|
+
furnished to do so, subject to the following conditions:
|
|
16
|
+
|
|
17
|
+
The above copyright notice and this permission notice shall be included in all
|
|
18
|
+
copies or substantial portions of the Software.
|
|
19
|
+
|
|
20
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
21
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
22
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
23
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
24
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
25
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
26
|
+
SOFTWARE.
|
|
27
|
+
|
|
28
|
+
Project-URL: Homepage, https://github.com/aloware/livekit-plugins-dtln
|
|
29
|
+
Project-URL: Repository, https://github.com/aloware/livekit-plugins-dtln
|
|
30
|
+
Project-URL: Bug Tracker, https://github.com/aloware/livekit-plugins-dtln/issues
|
|
31
|
+
Project-URL: Demo, https://aloware.github.io/livekit-plugins-dtln/
|
|
32
|
+
Keywords: livekit,noise-suppression,noise-cancellation,dtln,onnx,speech,audio,voip
|
|
33
|
+
Classifier: Development Status :: 4 - Beta
|
|
34
|
+
Classifier: Intended Audience :: Developers
|
|
35
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
36
|
+
Classifier: Programming Language :: Python :: 3
|
|
37
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
38
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
39
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
40
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
41
|
+
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
|
|
42
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
43
|
+
Requires-Python: >=3.10
|
|
44
|
+
Description-Content-Type: text/markdown
|
|
45
|
+
License-File: LICENSE
|
|
46
|
+
Requires-Dist: livekit>=1.1.0
|
|
47
|
+
Requires-Dist: livekit-agents>=1.4.4
|
|
48
|
+
Requires-Dist: onnxruntime>=1.17.0
|
|
49
|
+
Requires-Dist: numpy>=1.26.0
|
|
50
|
+
Dynamic: license-file
|
|
51
|
+
|
|
52
|
+
# livekit-plugins-dtln
|
|
53
|
+
|
|
54
|
+
Python [LiveKit](https://livekit.io) plugin for **DTLN** (Dual-Signal Transformation LSTM Network) noise suppression — a fully self-hosted, open-source alternative to cloud-based noise cancellation services like Krisp or AI-coustics.
|
|
55
|
+
|
|
56
|
+
Runs entirely **in-process** using [ONNX Runtime](https://onnxruntime.ai). No cloud API, no per-minute fees, no proprietary binaries. Works with **self-hosted LiveKit** servers.
|
|
57
|
+
|
|
58
|
+
> Based on [Westhausen & Meyer, "Noise Reduction with DTLN", Interspeech 2020](https://www.isca-archive.org/interspeech_2020/westhausen20_interspeech.html)
|
|
59
|
+
> Original implementation: [github.com/breizhn/DTLN](https://github.com/breizhn/DTLN)
|
|
60
|
+
|
|
61
|
+
**[Live audio comparison demo →](https://aloware.github.io/livekit-plugins-dtln/)**
|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
## Why DTLN?
|
|
66
|
+
|
|
67
|
+
| | DTLN (this plugin) | Krisp / AI-coustics |
|
|
68
|
+
|---|---|---|
|
|
69
|
+
| **Hosting** | Self-hosted, in-process | Cloud API required |
|
|
70
|
+
| **Cost** | Free (open weights) | Per-minute billing |
|
|
71
|
+
| **LiveKit** | Works with self-hosted | Requires LiveKit Cloud |
|
|
72
|
+
| **Latency** | ~8 ms (one block shift) | Network round-trip |
|
|
73
|
+
| **Privacy** | Audio never leaves your server | Audio sent to third party |
|
|
74
|
+
| **Real-time factor** | ~0.05× (20× faster than real-time) | Varies |
|
|
75
|
+
|
|
76
|
+
---
|
|
77
|
+
|
|
78
|
+
## Installation
|
|
79
|
+
|
|
80
|
+
**pip:**
|
|
81
|
+
|
|
82
|
+
```bash
|
|
83
|
+
pip install livekit-plugins-dtln
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
**requirements.txt:**
|
|
87
|
+
|
|
88
|
+
```
|
|
89
|
+
livekit-plugins-dtln
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
**From source:**
|
|
93
|
+
|
|
94
|
+
```bash
|
|
95
|
+
git clone https://github.com/aloware/livekit-plugins-dtln.git
|
|
96
|
+
pip install -e ./livekit-plugins-dtln
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
> The pretrained ONNX model weights (~4 MB) are bundled in the PyPI wheel — no separate download step needed.
|
|
100
|
+
|
|
101
|
+
---
|
|
102
|
+
|
|
103
|
+
## Usage
|
|
104
|
+
|
|
105
|
+
### Session pipeline (recommended)
|
|
106
|
+
|
|
107
|
+
```python
|
|
108
|
+
from livekit.agents import room_io
|
|
109
|
+
from livekit.plugins import dtln
|
|
110
|
+
|
|
111
|
+
await session.start(
|
|
112
|
+
# ...,
|
|
113
|
+
room_options=room_io.RoomOptions(
|
|
114
|
+
audio_input=room_io.AudioInputOptions(
|
|
115
|
+
noise_cancellation=dtln.noise_suppression(),
|
|
116
|
+
),
|
|
117
|
+
),
|
|
118
|
+
)
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
### Custom AudioStream
|
|
122
|
+
|
|
123
|
+
```python
|
|
124
|
+
from livekit import rtc
|
|
125
|
+
from livekit.plugins import dtln
|
|
126
|
+
|
|
127
|
+
stream = rtc.AudioStream.from_track(
|
|
128
|
+
track=track,
|
|
129
|
+
noise_cancellation=dtln.noise_suppression(),
|
|
130
|
+
)
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
> **Note:** Create one `dtln.noise_suppression()` instance **per session**. Each instance holds stateful LSTM hidden states that must be scoped to a single call.
|
|
134
|
+
|
|
135
|
+
> **Note:** DTLN is trained on raw microphone audio. Do not chain it with another noise cancellation model — applying two models in series produces unexpected results.
|
|
136
|
+
|
|
137
|
+
### Custom model paths
|
|
138
|
+
|
|
139
|
+
```python
|
|
140
|
+
dtln.noise_suppression(
|
|
141
|
+
model_1_path="/path/to/model_1.onnx",
|
|
142
|
+
model_2_path="/path/to/model_2.onnx",
|
|
143
|
+
)
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
---
|
|
147
|
+
|
|
148
|
+
## Requirements
|
|
149
|
+
|
|
150
|
+
- Python >= 3.10
|
|
151
|
+
- livekit >= 1.1.0
|
|
152
|
+
- livekit-agents >= 1.4.4
|
|
153
|
+
- onnxruntime >= 1.17.0
|
|
154
|
+
- numpy >= 1.26.0
|
|
155
|
+
|
|
156
|
+
---
|
|
157
|
+
|
|
158
|
+
## How It Works
|
|
159
|
+
|
|
160
|
+
DTLN uses two sequential LSTM-based models:
|
|
161
|
+
|
|
162
|
+
1. **Model 1 — Spectral masking**: Computes the magnitude spectrum of a 32 ms window, runs it through an LSTM to produce a spectral mask, applies the mask in the frequency domain (preserving phase), and reconstructs the time-domain signal via IFFT.
|
|
163
|
+
|
|
164
|
+
2. **Model 2 — Time-domain refinement**: Refines the output of Model 1 with a second LSTM that operates directly on the waveform, capturing residual artifacts that spectral processing misses.
|
|
165
|
+
|
|
166
|
+
The two models are chained: Model 1's output feeds Model 2. Both LSTMs are stateful — their hidden states persist across audio frames, giving the network temporal context across the full duration of a call.
|
|
167
|
+
|
|
168
|
+
**Signal flow:**
|
|
169
|
+
|
|
170
|
+
```
|
|
171
|
+
Input frame (any sample rate, any channels)
|
|
172
|
+
→ downsample to 16 kHz mono
|
|
173
|
+
→ overlap-add loop (512-sample window, 128-sample shift)
|
|
174
|
+
→ FFT → magnitude → Model 1 (spectral mask) → masked IFFT
|
|
175
|
+
→ Model 2 (time-domain refinement)
|
|
176
|
+
→ upsample back to original sample rate
|
|
177
|
+
→ restore original channel count
|
|
178
|
+
→ Denoised output frame
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
The overlap-add synthesis uses 75% overlap (512-sample window, 128-sample shift), identical to the original DTLN paper. This gives ~8 ms of algorithmic latency at 16 kHz.
|
|
182
|
+
|
|
183
|
+
---
|
|
184
|
+
|
|
185
|
+
## Performance
|
|
186
|
+
|
|
187
|
+
Benchmarked on Apple M3 Pro, processing 16 kHz mono audio:
|
|
188
|
+
|
|
189
|
+
| Metric | Value |
|
|
190
|
+
|---|---|
|
|
191
|
+
| Steady-state latency per block | ~0.7 ms |
|
|
192
|
+
| Real-time factor | ~0.05× |
|
|
193
|
+
| Headroom vs real-time | ~20× |
|
|
194
|
+
| Cold-start (first inference) | ~500 ms (amortized by warmup in `__init__`) |
|
|
195
|
+
|
|
196
|
+
The `__init__` method runs a dummy forward pass to trigger ONNX Runtime's JIT compilation before the first real audio frame arrives, eliminating the cold-start stall.
|
|
197
|
+
|
|
198
|
+
---
|
|
199
|
+
|
|
200
|
+
## Models
|
|
201
|
+
|
|
202
|
+
Pretrained weights are the official DTLN models published by the original authors:
|
|
203
|
+
|
|
204
|
+
| File | Source |
|
|
205
|
+
|---|---|
|
|
206
|
+
| `model_1.onnx` | [breizhn/DTLN · pretrained_model/](https://github.com/breizhn/DTLN/tree/master/pretrained_model) |
|
|
207
|
+
| `model_2.onnx` | [breizhn/DTLN · pretrained_model/](https://github.com/breizhn/DTLN/tree/master/pretrained_model) |
|
|
208
|
+
|
|
209
|
+
The models are not bundled in this repository (to keep it lightweight). They are downloaded automatically by `python agent.py download-files` or by calling `download_models()` directly.
|
|
210
|
+
|
|
211
|
+
---
|
|
212
|
+
|
|
213
|
+
## References
|
|
214
|
+
|
|
215
|
+
- **Original DTLN paper**: [Westhausen & Meyer, "Noise Reduction with DTLN", Interspeech 2020](https://www.isca-archive.org/interspeech_2020/westhausen20_interspeech.html)
|
|
216
|
+
- **Original DTLN implementation & pretrained models**: [github.com/breizhn/DTLN](https://github.com/breizhn/DTLN)
|
|
217
|
+
- **DataDog engineering article** — the inspiration for this plugin: [Building a Real-Time Noise Suppression Library](https://www.datadoghq.com/blog/engineering/noise-suppression-library/)
|
|
218
|
+
- **LiveKit noise cancellation overview**: [docs.livekit.io — Noise Cancellation](https://docs.livekit.io/transport/media/noise-cancellation/)
|
|
219
|
+
- **LiveKit Agents SDK**: [github.com/livekit/agents](https://github.com/livekit/agents)
|
|
220
|
+
- **ONNX Runtime**: [onnxruntime.ai](https://onnxruntime.ai)
|
|
221
|
+
|
|
222
|
+
---
|
|
223
|
+
|
|
224
|
+
## License
|
|
225
|
+
|
|
226
|
+
The plugin code in this repository is released under the **MIT License**.
|
|
227
|
+
|
|
228
|
+
The pretrained DTLN model weights are published by the original authors under the **MIT License** — see [breizhn/DTLN](https://github.com/breizhn/DTLN/blob/master/LICENSE).
|
|
@@ -0,0 +1,177 @@
|
|
|
1
|
+
# livekit-plugins-dtln
|
|
2
|
+
|
|
3
|
+
Python [LiveKit](https://livekit.io) plugin for **DTLN** (Dual-Signal Transformation LSTM Network) noise suppression — a fully self-hosted, open-source alternative to cloud-based noise cancellation services like Krisp or AI-coustics.
|
|
4
|
+
|
|
5
|
+
Runs entirely **in-process** using [ONNX Runtime](https://onnxruntime.ai). No cloud API, no per-minute fees, no proprietary binaries. Works with **self-hosted LiveKit** servers.
|
|
6
|
+
|
|
7
|
+
> Based on [Westhausen & Meyer, "Noise Reduction with DTLN", Interspeech 2020](https://www.isca-archive.org/interspeech_2020/westhausen20_interspeech.html)
|
|
8
|
+
> Original implementation: [github.com/breizhn/DTLN](https://github.com/breizhn/DTLN)
|
|
9
|
+
|
|
10
|
+
**[Live audio comparison demo →](https://aloware.github.io/livekit-plugins-dtln/)**
|
|
11
|
+
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
## Why DTLN?
|
|
15
|
+
|
|
16
|
+
| | DTLN (this plugin) | Krisp / AI-coustics |
|
|
17
|
+
|---|---|---|
|
|
18
|
+
| **Hosting** | Self-hosted, in-process | Cloud API required |
|
|
19
|
+
| **Cost** | Free (open weights) | Per-minute billing |
|
|
20
|
+
| **LiveKit** | Works with self-hosted | Requires LiveKit Cloud |
|
|
21
|
+
| **Latency** | ~8 ms (one block shift) | Network round-trip |
|
|
22
|
+
| **Privacy** | Audio never leaves your server | Audio sent to third party |
|
|
23
|
+
| **Real-time factor** | ~0.05× (20× faster than real-time) | Varies |
|
|
24
|
+
|
|
25
|
+
---
|
|
26
|
+
|
|
27
|
+
## Installation
|
|
28
|
+
|
|
29
|
+
**pip:**
|
|
30
|
+
|
|
31
|
+
```bash
|
|
32
|
+
pip install livekit-plugins-dtln
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
**requirements.txt:**
|
|
36
|
+
|
|
37
|
+
```
|
|
38
|
+
livekit-plugins-dtln
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
**From source:**
|
|
42
|
+
|
|
43
|
+
```bash
|
|
44
|
+
git clone https://github.com/aloware/livekit-plugins-dtln.git
|
|
45
|
+
pip install -e ./livekit-plugins-dtln
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
> The pretrained ONNX model weights (~4 MB) are bundled in the PyPI wheel — no separate download step needed.
|
|
49
|
+
|
|
50
|
+
---
|
|
51
|
+
|
|
52
|
+
## Usage
|
|
53
|
+
|
|
54
|
+
### Session pipeline (recommended)
|
|
55
|
+
|
|
56
|
+
```python
|
|
57
|
+
from livekit.agents import room_io
|
|
58
|
+
from livekit.plugins import dtln
|
|
59
|
+
|
|
60
|
+
await session.start(
|
|
61
|
+
# ...,
|
|
62
|
+
room_options=room_io.RoomOptions(
|
|
63
|
+
audio_input=room_io.AudioInputOptions(
|
|
64
|
+
noise_cancellation=dtln.noise_suppression(),
|
|
65
|
+
),
|
|
66
|
+
),
|
|
67
|
+
)
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
### Custom AudioStream
|
|
71
|
+
|
|
72
|
+
```python
|
|
73
|
+
from livekit import rtc
|
|
74
|
+
from livekit.plugins import dtln
|
|
75
|
+
|
|
76
|
+
stream = rtc.AudioStream.from_track(
|
|
77
|
+
track=track,
|
|
78
|
+
noise_cancellation=dtln.noise_suppression(),
|
|
79
|
+
)
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
> **Note:** Create one `dtln.noise_suppression()` instance **per session**. Each instance holds stateful LSTM hidden states that must be scoped to a single call.
|
|
83
|
+
|
|
84
|
+
> **Note:** DTLN is trained on raw microphone audio. Do not chain it with another noise cancellation model — applying two models in series produces unexpected results.
|
|
85
|
+
|
|
86
|
+
### Custom model paths
|
|
87
|
+
|
|
88
|
+
```python
|
|
89
|
+
dtln.noise_suppression(
|
|
90
|
+
model_1_path="/path/to/model_1.onnx",
|
|
91
|
+
model_2_path="/path/to/model_2.onnx",
|
|
92
|
+
)
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
---
|
|
96
|
+
|
|
97
|
+
## Requirements
|
|
98
|
+
|
|
99
|
+
- Python >= 3.10
|
|
100
|
+
- livekit >= 1.1.0
|
|
101
|
+
- livekit-agents >= 1.4.4
|
|
102
|
+
- onnxruntime >= 1.17.0
|
|
103
|
+
- numpy >= 1.26.0
|
|
104
|
+
|
|
105
|
+
---
|
|
106
|
+
|
|
107
|
+
## How It Works
|
|
108
|
+
|
|
109
|
+
DTLN uses two sequential LSTM-based models:
|
|
110
|
+
|
|
111
|
+
1. **Model 1 — Spectral masking**: Computes the magnitude spectrum of a 32 ms window, runs it through an LSTM to produce a spectral mask, applies the mask in the frequency domain (preserving phase), and reconstructs the time-domain signal via IFFT.
|
|
112
|
+
|
|
113
|
+
2. **Model 2 — Time-domain refinement**: Refines the output of Model 1 with a second LSTM that operates directly on the waveform, capturing residual artifacts that spectral processing misses.
|
|
114
|
+
|
|
115
|
+
The two models are chained: Model 1's output feeds Model 2. Both LSTMs are stateful — their hidden states persist across audio frames, giving the network temporal context across the full duration of a call.
|
|
116
|
+
|
|
117
|
+
**Signal flow:**
|
|
118
|
+
|
|
119
|
+
```
|
|
120
|
+
Input frame (any sample rate, any channels)
|
|
121
|
+
→ downsample to 16 kHz mono
|
|
122
|
+
→ overlap-add loop (512-sample window, 128-sample shift)
|
|
123
|
+
→ FFT → magnitude → Model 1 (spectral mask) → masked IFFT
|
|
124
|
+
→ Model 2 (time-domain refinement)
|
|
125
|
+
→ upsample back to original sample rate
|
|
126
|
+
→ restore original channel count
|
|
127
|
+
→ Denoised output frame
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
The overlap-add synthesis uses 75% overlap (512-sample window, 128-sample shift), identical to the original DTLN paper. This gives ~8 ms of algorithmic latency at 16 kHz.
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
## Performance
|
|
135
|
+
|
|
136
|
+
Benchmarked on Apple M3 Pro, processing 16 kHz mono audio:
|
|
137
|
+
|
|
138
|
+
| Metric | Value |
|
|
139
|
+
|---|---|
|
|
140
|
+
| Steady-state latency per block | ~0.7 ms |
|
|
141
|
+
| Real-time factor | ~0.05× |
|
|
142
|
+
| Headroom vs real-time | ~20× |
|
|
143
|
+
| Cold-start (first inference) | ~500 ms (amortized by warmup in `__init__`) |
|
|
144
|
+
|
|
145
|
+
The `__init__` method runs a dummy forward pass to trigger ONNX Runtime's JIT compilation before the first real audio frame arrives, eliminating the cold-start stall.
|
|
146
|
+
|
|
147
|
+
---
|
|
148
|
+
|
|
149
|
+
## Models
|
|
150
|
+
|
|
151
|
+
Pretrained weights are the official DTLN models published by the original authors:
|
|
152
|
+
|
|
153
|
+
| File | Source |
|
|
154
|
+
|---|---|
|
|
155
|
+
| `model_1.onnx` | [breizhn/DTLN · pretrained_model/](https://github.com/breizhn/DTLN/tree/master/pretrained_model) |
|
|
156
|
+
| `model_2.onnx` | [breizhn/DTLN · pretrained_model/](https://github.com/breizhn/DTLN/tree/master/pretrained_model) |
|
|
157
|
+
|
|
158
|
+
The models are not bundled in this repository (to keep it lightweight). They are downloaded automatically by `python agent.py download-files` or by calling `download_models()` directly.
|
|
159
|
+
|
|
160
|
+
---
|
|
161
|
+
|
|
162
|
+
## References
|
|
163
|
+
|
|
164
|
+
- **Original DTLN paper**: [Westhausen & Meyer, "Noise Reduction with DTLN", Interspeech 2020](https://www.isca-archive.org/interspeech_2020/westhausen20_interspeech.html)
|
|
165
|
+
- **Original DTLN implementation & pretrained models**: [github.com/breizhn/DTLN](https://github.com/breizhn/DTLN)
|
|
166
|
+
- **DataDog engineering article** — the inspiration for this plugin: [Building a Real-Time Noise Suppression Library](https://www.datadoghq.com/blog/engineering/noise-suppression-library/)
|
|
167
|
+
- **LiveKit noise cancellation overview**: [docs.livekit.io — Noise Cancellation](https://docs.livekit.io/transport/media/noise-cancellation/)
|
|
168
|
+
- **LiveKit Agents SDK**: [github.com/livekit/agents](https://github.com/livekit/agents)
|
|
169
|
+
- **ONNX Runtime**: [onnxruntime.ai](https://onnxruntime.ai)
|
|
170
|
+
|
|
171
|
+
---
|
|
172
|
+
|
|
173
|
+
## License
|
|
174
|
+
|
|
175
|
+
The plugin code in this repository is released under the **MIT License**.
|
|
176
|
+
|
|
177
|
+
The pretrained DTLN model weights are published by the original authors under the **MIT License** — see [breizhn/DTLN](https://github.com/breizhn/DTLN/blob/master/LICENSE).
|
|
@@ -0,0 +1,43 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=70", "wheel"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "livekit-plugins-dtln"
|
|
7
|
+
version = "0.1.0"
|
|
8
|
+
description = "DTLN noise suppression plugin for LiveKit Agents — self-hosted, in-process, no cloud API"
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
license = { file = "LICENSE" }
|
|
11
|
+
authors = [{ name = "Aloware, Inc.", email = "dev@aloware.com" }]
|
|
12
|
+
requires-python = ">=3.10"
|
|
13
|
+
keywords = ["livekit", "noise-suppression", "noise-cancellation", "dtln", "onnx", "speech", "audio", "voip"]
|
|
14
|
+
classifiers = [
|
|
15
|
+
"Development Status :: 4 - Beta",
|
|
16
|
+
"Intended Audience :: Developers",
|
|
17
|
+
"License :: OSI Approved :: MIT License",
|
|
18
|
+
"Programming Language :: Python :: 3",
|
|
19
|
+
"Programming Language :: Python :: 3.10",
|
|
20
|
+
"Programming Language :: Python :: 3.11",
|
|
21
|
+
"Programming Language :: Python :: 3.12",
|
|
22
|
+
"Programming Language :: Python :: 3.13",
|
|
23
|
+
"Topic :: Multimedia :: Sound/Audio :: Speech",
|
|
24
|
+
"Topic :: Scientific/Engineering :: Artificial Intelligence",
|
|
25
|
+
]
|
|
26
|
+
dependencies = [
|
|
27
|
+
"livekit>=1.1.0",
|
|
28
|
+
"livekit-agents>=1.4.4",
|
|
29
|
+
"onnxruntime>=1.17.0",
|
|
30
|
+
"numpy>=1.26.0",
|
|
31
|
+
]
|
|
32
|
+
|
|
33
|
+
[project.urls]
|
|
34
|
+
Homepage = "https://github.com/aloware/livekit-plugins-dtln"
|
|
35
|
+
Repository = "https://github.com/aloware/livekit-plugins-dtln"
|
|
36
|
+
"Bug Tracker" = "https://github.com/aloware/livekit-plugins-dtln/issues"
|
|
37
|
+
Demo = "https://aloware.github.io/livekit-plugins-dtln/"
|
|
38
|
+
|
|
39
|
+
[tool.setuptools.packages.find]
|
|
40
|
+
where = ["src"]
|
|
41
|
+
|
|
42
|
+
[tool.setuptools.package-data]
|
|
43
|
+
"livekit.plugins.dtln" = ["models/*.onnx"]
|
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
from livekit.agents import Plugin
|
|
2
|
+
import logging
|
|
3
|
+
|
|
4
|
+
from .noise_suppressor import DTLNNoiseSuppressor
|
|
5
|
+
|
|
6
|
+
logger = logging.getLogger(__name__)
|
|
7
|
+
|
|
8
|
+
|
|
9
|
+
class DTLNPlugin(Plugin):
|
|
10
|
+
def __init__(self):
|
|
11
|
+
super().__init__(
|
|
12
|
+
title="DTLN",
|
|
13
|
+
version="0.1.0",
|
|
14
|
+
package="livekit-plugins-dtln",
|
|
15
|
+
logger=logger,
|
|
16
|
+
)
|
|
17
|
+
|
|
18
|
+
def download_files(self):
|
|
19
|
+
from .noise_suppressor import download_models
|
|
20
|
+
download_models()
|
|
21
|
+
|
|
22
|
+
|
|
23
|
+
def noise_suppression(**kwargs) -> DTLNNoiseSuppressor:
|
|
24
|
+
"""Create a DTLNNoiseSuppressor instance.
|
|
25
|
+
|
|
26
|
+
Pass to AudioInputOptions(noise_cancellation=dtln.noise_suppression()).
|
|
27
|
+
"""
|
|
28
|
+
return DTLNNoiseSuppressor(**kwargs)
|
|
29
|
+
|
|
30
|
+
|
|
31
|
+
Plugin.register_plugin(DTLNPlugin())
|
|
32
|
+
|
|
33
|
+
__all__ = ["DTLNNoiseSuppressor", "noise_suppression"]
|
|
Binary file
|
|
Binary file
|
|
@@ -0,0 +1,287 @@
|
|
|
1
|
+
"""DTLN noise suppression as a LiveKit FrameProcessor.
|
|
2
|
+
|
|
3
|
+
Implements the Dual-Signal Transformation LSTM Network (DTLN) from:
|
|
4
|
+
Noise Reduction with DTLN (Westhausen & Meyer, Interspeech 2020)
|
|
5
|
+
https://github.com/breizhn/DTLN
|
|
6
|
+
|
|
7
|
+
Uses ONNX Runtime for in-process inference — no cloud API required.
|
|
8
|
+
Drop-in replacement for livekit-plugins-noise-cancellation / ai-coustics.
|
|
9
|
+
"""
|
|
10
|
+
|
|
11
|
+
import logging
|
|
12
|
+
import os
|
|
13
|
+
import urllib.request
|
|
14
|
+
|
|
15
|
+
import numpy as np
|
|
16
|
+
import onnxruntime as ort
|
|
17
|
+
from livekit import rtc
|
|
18
|
+
|
|
19
|
+
logger = logging.getLogger(__name__)
|
|
20
|
+
|
|
21
|
+
# DTLN requires 16 kHz mono audio (fixed by the pretrained model)
|
|
22
|
+
_SAMPLE_RATE = 16_000
|
|
23
|
+
# 32 ms window, 8 ms shift (75% overlap)
|
|
24
|
+
_BLOCK_LEN = 512
|
|
25
|
+
_BLOCK_SHIFT = 128
|
|
26
|
+
# rfft of 512-sample block gives 257 unique frequency bins
|
|
27
|
+
_N_BINS = _BLOCK_LEN // 2 + 1
|
|
28
|
+
|
|
29
|
+
_DEFAULT_MODEL_DIR = os.path.join(os.path.dirname(__file__), "models")
|
|
30
|
+
|
|
31
|
+
_DTLN_BASE_URL = "https://github.com/breizhn/DTLN/raw/master/pretrained_model"
|
|
32
|
+
_MODEL_FILES = ["model_1.onnx", "model_2.onnx"]
|
|
33
|
+
|
|
34
|
+
|
|
35
|
+
def download_models(models_dir: str = _DEFAULT_MODEL_DIR) -> None:
|
|
36
|
+
"""Download pretrained DTLN ONNX models (skips files already present)."""
|
|
37
|
+
os.makedirs(models_dir, exist_ok=True)
|
|
38
|
+
for filename in _MODEL_FILES:
|
|
39
|
+
dest = os.path.join(models_dir, filename)
|
|
40
|
+
if os.path.exists(dest):
|
|
41
|
+
logger.info("Model already present: %s", dest)
|
|
42
|
+
continue
|
|
43
|
+
url = f"{_DTLN_BASE_URL}/{filename}"
|
|
44
|
+
logger.info("Downloading %s ...", url)
|
|
45
|
+
urllib.request.urlretrieve(url, dest)
|
|
46
|
+
logger.info("Saved to %s", dest)
|
|
47
|
+
|
|
48
|
+
|
|
49
|
+
class DTLNNoiseSuppressor(rtc.FrameProcessor[rtc.AudioFrame]):
|
|
50
|
+
"""In-process DTLN noise suppressor that works with self-hosted LiveKit.
|
|
51
|
+
|
|
52
|
+
Pass to AudioInputOptions(noise_cancellation=DTLNNoiseSuppressor(...)).
|
|
53
|
+
Each instance is stateful (LSTM state persists across frames) — create
|
|
54
|
+
one instance per call session.
|
|
55
|
+
|
|
56
|
+
Args:
|
|
57
|
+
model_1_path: Path to model_1.onnx (spectral masking stage).
|
|
58
|
+
model_2_path: Path to model_2.onnx (time-domain refinement stage).
|
|
59
|
+
"""
|
|
60
|
+
|
|
61
|
+
def __init__(
|
|
62
|
+
self,
|
|
63
|
+
model_1_path: str = os.path.join(_DEFAULT_MODEL_DIR, "model_1.onnx"),
|
|
64
|
+
model_2_path: str = os.path.join(_DEFAULT_MODEL_DIR, "model_2.onnx"),
|
|
65
|
+
) -> None:
|
|
66
|
+
self._sess1 = ort.InferenceSession(model_1_path)
|
|
67
|
+
self._sess2 = ort.InferenceSession(model_2_path)
|
|
68
|
+
|
|
69
|
+
# Input/output tensor names (read from model to avoid hardcoding)
|
|
70
|
+
self._m1_in_mag = self._sess1.get_inputs()[0].name
|
|
71
|
+
self._m1_in_state = self._sess1.get_inputs()[1].name
|
|
72
|
+
self._m2_in_time = self._sess2.get_inputs()[0].name
|
|
73
|
+
self._m2_in_state = self._sess2.get_inputs()[1].name
|
|
74
|
+
|
|
75
|
+
# LSTM states — shape read from model (e.g. [1, 2, 128, 2])
|
|
76
|
+
state1_shape = self._sess1.get_inputs()[1].shape
|
|
77
|
+
state2_shape = self._sess2.get_inputs()[1].shape
|
|
78
|
+
self._state1 = np.zeros(state1_shape, dtype=np.float32)
|
|
79
|
+
self._state2 = np.zeros(state2_shape, dtype=np.float32)
|
|
80
|
+
|
|
81
|
+
# Rolling 512-sample buffers at 16 kHz (float32)
|
|
82
|
+
self._in_buf = np.zeros(_BLOCK_LEN, dtype=np.float32)
|
|
83
|
+
self._out_buf = np.zeros(_BLOCK_LEN, dtype=np.float32)
|
|
84
|
+
|
|
85
|
+
# Queues for handling arbitrary incoming frame sizes
|
|
86
|
+
self._input_queue = np.zeros(0, dtype=np.float32)
|
|
87
|
+
self._output_queue = np.zeros(0, dtype=np.float32)
|
|
88
|
+
|
|
89
|
+
# Resamplers — created lazily on the first frame
|
|
90
|
+
self._downsampler: rtc.AudioResampler | None = None
|
|
91
|
+
self._upsampler: rtc.AudioResampler | None = None
|
|
92
|
+
self._native_rate: int = 0
|
|
93
|
+
|
|
94
|
+
self._enabled = True
|
|
95
|
+
|
|
96
|
+
# Pre-warm ONNX Runtime's JIT compiler so the first real frame
|
|
97
|
+
# doesn't stall the audio pipeline (~500ms cold-start otherwise).
|
|
98
|
+
self._warmup()
|
|
99
|
+
|
|
100
|
+
def _warmup(self) -> None:
|
|
101
|
+
dummy_mag = np.zeros((1, 1, _N_BINS), dtype=np.float32)
|
|
102
|
+
dummy_time = np.zeros((1, 1, _BLOCK_LEN), dtype=np.float32)
|
|
103
|
+
self._sess1.run(None, {self._m1_in_mag: dummy_mag, self._m1_in_state: self._state1})
|
|
104
|
+
self._sess2.run(None, {self._m2_in_time: dummy_time, self._m2_in_state: self._state2})
|
|
105
|
+
# Reset states — warmup outputs shouldn't carry over into real audio
|
|
106
|
+
state1_shape = self._sess1.get_inputs()[1].shape
|
|
107
|
+
state2_shape = self._sess2.get_inputs()[1].shape
|
|
108
|
+
self._state1 = np.zeros(state1_shape, dtype=np.float32)
|
|
109
|
+
self._state2 = np.zeros(state2_shape, dtype=np.float32)
|
|
110
|
+
|
|
111
|
+
# ------------------------------------------------------------------ #
|
|
112
|
+
# FrameProcessor interface #
|
|
113
|
+
# ------------------------------------------------------------------ #
|
|
114
|
+
|
|
115
|
+
@property
|
|
116
|
+
def enabled(self) -> bool:
|
|
117
|
+
return self._enabled
|
|
118
|
+
|
|
119
|
+
@enabled.setter
|
|
120
|
+
def enabled(self, value: bool) -> None:
|
|
121
|
+
self._enabled = value
|
|
122
|
+
|
|
123
|
+
def _process(self, frame: rtc.AudioFrame) -> rtc.AudioFrame:
|
|
124
|
+
if not self._enabled:
|
|
125
|
+
return frame
|
|
126
|
+
|
|
127
|
+
# Lazily create resamplers when we learn the incoming sample rate.
|
|
128
|
+
if frame.sample_rate != self._native_rate:
|
|
129
|
+
self._native_rate = frame.sample_rate
|
|
130
|
+
if frame.sample_rate != _SAMPLE_RATE:
|
|
131
|
+
self._downsampler = rtc.AudioResampler(
|
|
132
|
+
input_rate=frame.sample_rate,
|
|
133
|
+
output_rate=_SAMPLE_RATE,
|
|
134
|
+
num_channels=1,
|
|
135
|
+
quality=rtc.AudioResamplerQuality.MEDIUM,
|
|
136
|
+
)
|
|
137
|
+
self._upsampler = rtc.AudioResampler(
|
|
138
|
+
input_rate=_SAMPLE_RATE,
|
|
139
|
+
output_rate=frame.sample_rate,
|
|
140
|
+
num_channels=1,
|
|
141
|
+
quality=rtc.AudioResamplerQuality.MEDIUM,
|
|
142
|
+
)
|
|
143
|
+
else:
|
|
144
|
+
self._downsampler = None
|
|
145
|
+
self._upsampler = None
|
|
146
|
+
|
|
147
|
+
# Convert int16 → float32 mono
|
|
148
|
+
samples = np.frombuffer(frame.data, dtype=np.int16).astype(np.float32) / 32768.0
|
|
149
|
+
if frame.num_channels > 1:
|
|
150
|
+
samples = samples.reshape(-1, frame.num_channels).mean(axis=1)
|
|
151
|
+
|
|
152
|
+
# Build a mono AudioFrame at the native rate for the resampler
|
|
153
|
+
mono_int16 = (np.clip(samples, -1.0, 1.0) * 32767.0).astype(np.int16)
|
|
154
|
+
mono_frame = rtc.AudioFrame(
|
|
155
|
+
data=mono_int16.tobytes(),
|
|
156
|
+
sample_rate=frame.sample_rate,
|
|
157
|
+
num_channels=1,
|
|
158
|
+
samples_per_channel=len(mono_int16),
|
|
159
|
+
)
|
|
160
|
+
|
|
161
|
+
# Downsample to 16 kHz
|
|
162
|
+
if self._downsampler is not None:
|
|
163
|
+
frames_16k = self._downsampler.push(mono_frame)
|
|
164
|
+
else:
|
|
165
|
+
frames_16k = [mono_frame]
|
|
166
|
+
|
|
167
|
+
if not frames_16k:
|
|
168
|
+
return frame # resampler buffering startup, pass through
|
|
169
|
+
|
|
170
|
+
samples_16k = np.concatenate([
|
|
171
|
+
np.frombuffer(f.data, dtype=np.int16).astype(np.float32) / 32768.0
|
|
172
|
+
for f in frames_16k
|
|
173
|
+
])
|
|
174
|
+
|
|
175
|
+
self._input_queue = np.concatenate([self._input_queue, samples_16k])
|
|
176
|
+
|
|
177
|
+
# Process in BLOCK_SHIFT (128-sample) steps; count steps taken.
|
|
178
|
+
n_steps = 0
|
|
179
|
+
while len(self._input_queue) >= _BLOCK_SHIFT:
|
|
180
|
+
new = self._input_queue[:_BLOCK_SHIFT]
|
|
181
|
+
self._input_queue = self._input_queue[_BLOCK_SHIFT:]
|
|
182
|
+
|
|
183
|
+
self._in_buf[:-_BLOCK_SHIFT] = self._in_buf[_BLOCK_SHIFT:]
|
|
184
|
+
self._in_buf[-_BLOCK_SHIFT:] = new
|
|
185
|
+
|
|
186
|
+
denoised = self._infer_block(self._in_buf)
|
|
187
|
+
|
|
188
|
+
self._out_buf[:-_BLOCK_SHIFT] = self._out_buf[_BLOCK_SHIFT:]
|
|
189
|
+
self._out_buf[-_BLOCK_SHIFT:] = 0.0
|
|
190
|
+
self._out_buf += denoised
|
|
191
|
+
|
|
192
|
+
self._output_queue = np.concatenate([
|
|
193
|
+
self._output_queue, self._out_buf[:_BLOCK_SHIFT]
|
|
194
|
+
])
|
|
195
|
+
n_steps += 1
|
|
196
|
+
|
|
197
|
+
# Drain exactly what we produced this step — no more, no less.
|
|
198
|
+
# This eliminates the periodic silence that occurs when draining by
|
|
199
|
+
# n_16k (downsampler output) instead of n_steps * _BLOCK_SHIFT.
|
|
200
|
+
# During startup the output_queue fills with pipeline latency (~24ms);
|
|
201
|
+
# in steady state it stays constant.
|
|
202
|
+
n_produced = n_steps * _BLOCK_SHIFT
|
|
203
|
+
if n_produced == 0:
|
|
204
|
+
return frame
|
|
205
|
+
|
|
206
|
+
out_16k = self._output_queue[:n_produced]
|
|
207
|
+
self._output_queue = self._output_queue[n_produced:]
|
|
208
|
+
|
|
209
|
+
# Build 16 kHz AudioFrame and upsample back to native rate
|
|
210
|
+
out_int16_16k = (np.clip(out_16k, -1.0, 1.0) * 32767.0).astype(np.int16)
|
|
211
|
+
out_frame_16k = rtc.AudioFrame(
|
|
212
|
+
data=out_int16_16k.tobytes(),
|
|
213
|
+
sample_rate=_SAMPLE_RATE,
|
|
214
|
+
num_channels=1,
|
|
215
|
+
samples_per_channel=len(out_int16_16k),
|
|
216
|
+
)
|
|
217
|
+
|
|
218
|
+
if self._upsampler is not None:
|
|
219
|
+
out_frames = self._upsampler.push(out_frame_16k)
|
|
220
|
+
else:
|
|
221
|
+
out_frames = [out_frame_16k]
|
|
222
|
+
|
|
223
|
+
if not out_frames:
|
|
224
|
+
return frame
|
|
225
|
+
|
|
226
|
+
out_samples = np.concatenate([
|
|
227
|
+
np.frombuffer(f.data, dtype=np.int16).astype(np.float32) / 32768.0
|
|
228
|
+
for f in out_frames
|
|
229
|
+
])
|
|
230
|
+
|
|
231
|
+
# Trim or pad to exactly match the input frame length
|
|
232
|
+
target = frame.samples_per_channel
|
|
233
|
+
if len(out_samples) > target:
|
|
234
|
+
out_samples = out_samples[:target]
|
|
235
|
+
elif len(out_samples) < target:
|
|
236
|
+
out_samples = np.pad(out_samples, (0, target - len(out_samples)))
|
|
237
|
+
|
|
238
|
+
# Restore original channel count (duplicate mono → stereo if needed)
|
|
239
|
+
if frame.num_channels > 1:
|
|
240
|
+
out_samples = np.repeat(out_samples, frame.num_channels)
|
|
241
|
+
|
|
242
|
+
out_int16 = (np.clip(out_samples, -1.0, 1.0) * 32767.0).astype(np.int16)
|
|
243
|
+
return rtc.AudioFrame(
|
|
244
|
+
data=out_int16.tobytes(),
|
|
245
|
+
sample_rate=frame.sample_rate,
|
|
246
|
+
num_channels=frame.num_channels,
|
|
247
|
+
samples_per_channel=frame.samples_per_channel,
|
|
248
|
+
)
|
|
249
|
+
|
|
250
|
+
def _close(self) -> None:
|
|
251
|
+
self._sess1 = None # type: ignore[assignment]
|
|
252
|
+
self._sess2 = None # type: ignore[assignment]
|
|
253
|
+
|
|
254
|
+
# ------------------------------------------------------------------ #
|
|
255
|
+
# Internal inference #
|
|
256
|
+
# ------------------------------------------------------------------ #
|
|
257
|
+
|
|
258
|
+
def _infer_block(self, block: np.ndarray) -> np.ndarray:
|
|
259
|
+
"""Run one 512-sample block through both DTLN models.
|
|
260
|
+
|
|
261
|
+
Returns a 512-sample denoised block (float32, range ~[-1, 1]).
|
|
262
|
+
"""
|
|
263
|
+
# --- Model 1: spectral masking ---
|
|
264
|
+
spec = np.fft.rfft(block) # complex128, shape (257,)
|
|
265
|
+
mag = np.abs(spec).reshape(1, 1, _N_BINS).astype(np.float32)
|
|
266
|
+
|
|
267
|
+
out1 = self._sess1.run(None, {
|
|
268
|
+
self._m1_in_mag: mag,
|
|
269
|
+
self._m1_in_state: self._state1,
|
|
270
|
+
})
|
|
271
|
+
mask = out1[0].reshape(_N_BINS) # (257,)
|
|
272
|
+
self._state1 = out1[1]
|
|
273
|
+
|
|
274
|
+
# Apply mask in frequency domain, reconstruct time domain
|
|
275
|
+
enhanced_spec = mask * spec # element-wise, preserves phase
|
|
276
|
+
enhanced_time = np.fft.irfft(enhanced_spec, n=_BLOCK_LEN).astype(np.float32)
|
|
277
|
+
|
|
278
|
+
# --- Model 2: time-domain refinement ---
|
|
279
|
+
time_in = enhanced_time.reshape(1, 1, _BLOCK_LEN).astype(np.float32)
|
|
280
|
+
out2 = self._sess2.run(None, {
|
|
281
|
+
self._m2_in_time: time_in,
|
|
282
|
+
self._m2_in_state: self._state2,
|
|
283
|
+
})
|
|
284
|
+
denoised = out2[0].reshape(_BLOCK_LEN) # (512,)
|
|
285
|
+
self._state2 = out2[1]
|
|
286
|
+
|
|
287
|
+
return denoised
|
|
@@ -0,0 +1,228 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: livekit-plugins-dtln
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: DTLN noise suppression plugin for LiveKit Agents — self-hosted, in-process, no cloud API
|
|
5
|
+
Author-email: "Aloware, Inc." <dev@aloware.com>
|
|
6
|
+
License: MIT License
|
|
7
|
+
|
|
8
|
+
Copyright (c) 2024 Aloware, Inc.
|
|
9
|
+
|
|
10
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
11
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
12
|
+
in the Software without restriction, including without limitation the rights
|
|
13
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
14
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
15
|
+
furnished to do so, subject to the following conditions:
|
|
16
|
+
|
|
17
|
+
The above copyright notice and this permission notice shall be included in all
|
|
18
|
+
copies or substantial portions of the Software.
|
|
19
|
+
|
|
20
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
21
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
22
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
23
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
24
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
25
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
26
|
+
SOFTWARE.
|
|
27
|
+
|
|
28
|
+
Project-URL: Homepage, https://github.com/aloware/livekit-plugins-dtln
|
|
29
|
+
Project-URL: Repository, https://github.com/aloware/livekit-plugins-dtln
|
|
30
|
+
Project-URL: Bug Tracker, https://github.com/aloware/livekit-plugins-dtln/issues
|
|
31
|
+
Project-URL: Demo, https://aloware.github.io/livekit-plugins-dtln/
|
|
32
|
+
Keywords: livekit,noise-suppression,noise-cancellation,dtln,onnx,speech,audio,voip
|
|
33
|
+
Classifier: Development Status :: 4 - Beta
|
|
34
|
+
Classifier: Intended Audience :: Developers
|
|
35
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
36
|
+
Classifier: Programming Language :: Python :: 3
|
|
37
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
38
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
39
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
40
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
41
|
+
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
|
|
42
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
43
|
+
Requires-Python: >=3.10
|
|
44
|
+
Description-Content-Type: text/markdown
|
|
45
|
+
License-File: LICENSE
|
|
46
|
+
Requires-Dist: livekit>=1.1.0
|
|
47
|
+
Requires-Dist: livekit-agents>=1.4.4
|
|
48
|
+
Requires-Dist: onnxruntime>=1.17.0
|
|
49
|
+
Requires-Dist: numpy>=1.26.0
|
|
50
|
+
Dynamic: license-file
|
|
51
|
+
|
|
52
|
+
# livekit-plugins-dtln
|
|
53
|
+
|
|
54
|
+
Python [LiveKit](https://livekit.io) plugin for **DTLN** (Dual-Signal Transformation LSTM Network) noise suppression — a fully self-hosted, open-source alternative to cloud-based noise cancellation services like Krisp or AI-coustics.
|
|
55
|
+
|
|
56
|
+
Runs entirely **in-process** using [ONNX Runtime](https://onnxruntime.ai). No cloud API, no per-minute fees, no proprietary binaries. Works with **self-hosted LiveKit** servers.
|
|
57
|
+
|
|
58
|
+
> Based on [Westhausen & Meyer, "Noise Reduction with DTLN", Interspeech 2020](https://www.isca-archive.org/interspeech_2020/westhausen20_interspeech.html)
|
|
59
|
+
> Original implementation: [github.com/breizhn/DTLN](https://github.com/breizhn/DTLN)
|
|
60
|
+
|
|
61
|
+
**[Live audio comparison demo →](https://aloware.github.io/livekit-plugins-dtln/)**
|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
## Why DTLN?
|
|
66
|
+
|
|
67
|
+
| | DTLN (this plugin) | Krisp / AI-coustics |
|
|
68
|
+
|---|---|---|
|
|
69
|
+
| **Hosting** | Self-hosted, in-process | Cloud API required |
|
|
70
|
+
| **Cost** | Free (open weights) | Per-minute billing |
|
|
71
|
+
| **LiveKit** | Works with self-hosted | Requires LiveKit Cloud |
|
|
72
|
+
| **Latency** | ~8 ms (one block shift) | Network round-trip |
|
|
73
|
+
| **Privacy** | Audio never leaves your server | Audio sent to third party |
|
|
74
|
+
| **Real-time factor** | ~0.05× (20× faster than real-time) | Varies |
|
|
75
|
+
|
|
76
|
+
---
|
|
77
|
+
|
|
78
|
+
## Installation
|
|
79
|
+
|
|
80
|
+
**pip:**
|
|
81
|
+
|
|
82
|
+
```bash
|
|
83
|
+
pip install livekit-plugins-dtln
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
**requirements.txt:**
|
|
87
|
+
|
|
88
|
+
```
|
|
89
|
+
livekit-plugins-dtln
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
**From source:**
|
|
93
|
+
|
|
94
|
+
```bash
|
|
95
|
+
git clone https://github.com/aloware/livekit-plugins-dtln.git
|
|
96
|
+
pip install -e ./livekit-plugins-dtln
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
> The pretrained ONNX model weights (~4 MB) are bundled in the PyPI wheel — no separate download step needed.
|
|
100
|
+
|
|
101
|
+
---
|
|
102
|
+
|
|
103
|
+
## Usage
|
|
104
|
+
|
|
105
|
+
### Session pipeline (recommended)
|
|
106
|
+
|
|
107
|
+
```python
|
|
108
|
+
from livekit.agents import room_io
|
|
109
|
+
from livekit.plugins import dtln
|
|
110
|
+
|
|
111
|
+
await session.start(
|
|
112
|
+
# ...,
|
|
113
|
+
room_options=room_io.RoomOptions(
|
|
114
|
+
audio_input=room_io.AudioInputOptions(
|
|
115
|
+
noise_cancellation=dtln.noise_suppression(),
|
|
116
|
+
),
|
|
117
|
+
),
|
|
118
|
+
)
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
### Custom AudioStream
|
|
122
|
+
|
|
123
|
+
```python
|
|
124
|
+
from livekit import rtc
|
|
125
|
+
from livekit.plugins import dtln
|
|
126
|
+
|
|
127
|
+
stream = rtc.AudioStream.from_track(
|
|
128
|
+
track=track,
|
|
129
|
+
noise_cancellation=dtln.noise_suppression(),
|
|
130
|
+
)
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
> **Note:** Create one `dtln.noise_suppression()` instance **per session**. Each instance holds stateful LSTM hidden states that must be scoped to a single call.
|
|
134
|
+
|
|
135
|
+
> **Note:** DTLN is trained on raw microphone audio. Do not chain it with another noise cancellation model — applying two models in series produces unexpected results.
|
|
136
|
+
|
|
137
|
+
### Custom model paths
|
|
138
|
+
|
|
139
|
+
```python
|
|
140
|
+
dtln.noise_suppression(
|
|
141
|
+
model_1_path="/path/to/model_1.onnx",
|
|
142
|
+
model_2_path="/path/to/model_2.onnx",
|
|
143
|
+
)
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
---
|
|
147
|
+
|
|
148
|
+
## Requirements
|
|
149
|
+
|
|
150
|
+
- Python >= 3.10
|
|
151
|
+
- livekit >= 1.1.0
|
|
152
|
+
- livekit-agents >= 1.4.4
|
|
153
|
+
- onnxruntime >= 1.17.0
|
|
154
|
+
- numpy >= 1.26.0
|
|
155
|
+
|
|
156
|
+
---
|
|
157
|
+
|
|
158
|
+
## How It Works
|
|
159
|
+
|
|
160
|
+
DTLN uses two sequential LSTM-based models:
|
|
161
|
+
|
|
162
|
+
1. **Model 1 — Spectral masking**: Computes the magnitude spectrum of a 32 ms window, runs it through an LSTM to produce a spectral mask, applies the mask in the frequency domain (preserving phase), and reconstructs the time-domain signal via IFFT.
|
|
163
|
+
|
|
164
|
+
2. **Model 2 — Time-domain refinement**: Refines the output of Model 1 with a second LSTM that operates directly on the waveform, capturing residual artifacts that spectral processing misses.
|
|
165
|
+
|
|
166
|
+
The two models are chained: Model 1's output feeds Model 2. Both LSTMs are stateful — their hidden states persist across audio frames, giving the network temporal context across the full duration of a call.
|
|
167
|
+
|
|
168
|
+
**Signal flow:**
|
|
169
|
+
|
|
170
|
+
```
|
|
171
|
+
Input frame (any sample rate, any channels)
|
|
172
|
+
→ downsample to 16 kHz mono
|
|
173
|
+
→ overlap-add loop (512-sample window, 128-sample shift)
|
|
174
|
+
→ FFT → magnitude → Model 1 (spectral mask) → masked IFFT
|
|
175
|
+
→ Model 2 (time-domain refinement)
|
|
176
|
+
→ upsample back to original sample rate
|
|
177
|
+
→ restore original channel count
|
|
178
|
+
→ Denoised output frame
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
The overlap-add synthesis uses 75% overlap (512-sample window, 128-sample shift), identical to the original DTLN paper. This gives ~8 ms of algorithmic latency at 16 kHz.
|
|
182
|
+
|
|
183
|
+
---
|
|
184
|
+
|
|
185
|
+
## Performance
|
|
186
|
+
|
|
187
|
+
Benchmarked on Apple M3 Pro, processing 16 kHz mono audio:
|
|
188
|
+
|
|
189
|
+
| Metric | Value |
|
|
190
|
+
|---|---|
|
|
191
|
+
| Steady-state latency per block | ~0.7 ms |
|
|
192
|
+
| Real-time factor | ~0.05× |
|
|
193
|
+
| Headroom vs real-time | ~20× |
|
|
194
|
+
| Cold-start (first inference) | ~500 ms (amortized by warmup in `__init__`) |
|
|
195
|
+
|
|
196
|
+
The `__init__` method runs a dummy forward pass to trigger ONNX Runtime's JIT compilation before the first real audio frame arrives, eliminating the cold-start stall.
|
|
197
|
+
|
|
198
|
+
---
|
|
199
|
+
|
|
200
|
+
## Models
|
|
201
|
+
|
|
202
|
+
Pretrained weights are the official DTLN models published by the original authors:
|
|
203
|
+
|
|
204
|
+
| File | Source |
|
|
205
|
+
|---|---|
|
|
206
|
+
| `model_1.onnx` | [breizhn/DTLN · pretrained_model/](https://github.com/breizhn/DTLN/tree/master/pretrained_model) |
|
|
207
|
+
| `model_2.onnx` | [breizhn/DTLN · pretrained_model/](https://github.com/breizhn/DTLN/tree/master/pretrained_model) |
|
|
208
|
+
|
|
209
|
+
The models are not bundled in this repository (to keep it lightweight). They are downloaded automatically by `python agent.py download-files` or by calling `download_models()` directly.
|
|
210
|
+
|
|
211
|
+
---
|
|
212
|
+
|
|
213
|
+
## References
|
|
214
|
+
|
|
215
|
+
- **Original DTLN paper**: [Westhausen & Meyer, "Noise Reduction with DTLN", Interspeech 2020](https://www.isca-archive.org/interspeech_2020/westhausen20_interspeech.html)
|
|
216
|
+
- **Original DTLN implementation & pretrained models**: [github.com/breizhn/DTLN](https://github.com/breizhn/DTLN)
|
|
217
|
+
- **DataDog engineering article** — the inspiration for this plugin: [Building a Real-Time Noise Suppression Library](https://www.datadoghq.com/blog/engineering/noise-suppression-library/)
|
|
218
|
+
- **LiveKit noise cancellation overview**: [docs.livekit.io — Noise Cancellation](https://docs.livekit.io/transport/media/noise-cancellation/)
|
|
219
|
+
- **LiveKit Agents SDK**: [github.com/livekit/agents](https://github.com/livekit/agents)
|
|
220
|
+
- **ONNX Runtime**: [onnxruntime.ai](https://onnxruntime.ai)
|
|
221
|
+
|
|
222
|
+
---
|
|
223
|
+
|
|
224
|
+
## License
|
|
225
|
+
|
|
226
|
+
The plugin code in this repository is released under the **MIT License**.
|
|
227
|
+
|
|
228
|
+
The pretrained DTLN model weights are published by the original authors under the **MIT License** — see [breizhn/DTLN](https://github.com/breizhn/DTLN/blob/master/LICENSE).
|
|
@@ -0,0 +1,12 @@
|
|
|
1
|
+
LICENSE
|
|
2
|
+
README.md
|
|
3
|
+
pyproject.toml
|
|
4
|
+
src/livekit/plugins/dtln/__init__.py
|
|
5
|
+
src/livekit/plugins/dtln/noise_suppressor.py
|
|
6
|
+
src/livekit/plugins/dtln/models/model_1.onnx
|
|
7
|
+
src/livekit/plugins/dtln/models/model_2.onnx
|
|
8
|
+
src/livekit_plugins_dtln.egg-info/PKG-INFO
|
|
9
|
+
src/livekit_plugins_dtln.egg-info/SOURCES.txt
|
|
10
|
+
src/livekit_plugins_dtln.egg-info/dependency_links.txt
|
|
11
|
+
src/livekit_plugins_dtln.egg-info/requires.txt
|
|
12
|
+
src/livekit_plugins_dtln.egg-info/top_level.txt
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
livekit
|