ensemble-pitch-extractor 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- ensemble_pitch_extractor-0.1.0/LICENSE +21 -0
- ensemble_pitch_extractor-0.1.0/MANIFEST.in +1 -0
- ensemble_pitch_extractor-0.1.0/PKG-INFO +245 -0
- ensemble_pitch_extractor-0.1.0/README.md +215 -0
- ensemble_pitch_extractor-0.1.0/README.zh-CN.md +215 -0
- ensemble_pitch_extractor-0.1.0/ensemble_pitch_extractor/__init__.py +26 -0
- ensemble_pitch_extractor-0.1.0/ensemble_pitch_extractor/api.py +391 -0
- ensemble_pitch_extractor-0.1.0/ensemble_pitch_extractor/cli.py +189 -0
- ensemble_pitch_extractor-0.1.0/ensemble_pitch_extractor/core.py +521 -0
- ensemble_pitch_extractor-0.1.0/ensemble_pitch_extractor/plotting.py +292 -0
- ensemble_pitch_extractor-0.1.0/ensemble_pitch_extractor.egg-info/PKG-INFO +245 -0
- ensemble_pitch_extractor-0.1.0/ensemble_pitch_extractor.egg-info/SOURCES.txt +16 -0
- ensemble_pitch_extractor-0.1.0/ensemble_pitch_extractor.egg-info/dependency_links.txt +1 -0
- ensemble_pitch_extractor-0.1.0/ensemble_pitch_extractor.egg-info/entry_points.txt +2 -0
- ensemble_pitch_extractor-0.1.0/ensemble_pitch_extractor.egg-info/requires.txt +5 -0
- ensemble_pitch_extractor-0.1.0/ensemble_pitch_extractor.egg-info/top_level.txt +1 -0
- ensemble_pitch_extractor-0.1.0/pyproject.toml +57 -0
- ensemble_pitch_extractor-0.1.0/setup.cfg +4 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 ensemble-pitch-extractor contributors
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
include README.zh-CN.md
|
|
@@ -0,0 +1,245 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: ensemble-pitch-extractor
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: FCPE TTA and pYIN ensemble pitch extraction for singing voice.
|
|
5
|
+
Author: ensemble-pitch-extractor contributors
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/qiuqiao/ensemble-pitch-extractor
|
|
8
|
+
Project-URL: Repository, https://github.com/qiuqiao/ensemble-pitch-extractor
|
|
9
|
+
Project-URL: Issues, https://github.com/qiuqiao/ensemble-pitch-extractor/issues
|
|
10
|
+
Keywords: pitch-extraction,f0,fundamental-frequency,singing-voice,fcpe,pyin,test-time-augmentation,dynamic-programming,viterbi,audio,music-information-retrieval
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: Intended Audience :: Science/Research
|
|
14
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
15
|
+
Classifier: Operating System :: OS Independent
|
|
16
|
+
Classifier: Programming Language :: Python :: 3
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
19
|
+
Classifier: Topic :: Multimedia :: Sound/Audio :: Analysis
|
|
20
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
21
|
+
Requires-Python: >=3.12
|
|
22
|
+
Description-Content-Type: text/markdown
|
|
23
|
+
License-File: LICENSE
|
|
24
|
+
Requires-Dist: librosa>=0.11.0
|
|
25
|
+
Requires-Dist: matplotlib>=3.10.9
|
|
26
|
+
Requires-Dist: numpy>=1.26
|
|
27
|
+
Requires-Dist: torch>=2.1
|
|
28
|
+
Requires-Dist: torchfcpe>=0.0.4
|
|
29
|
+
Dynamic: license-file
|
|
30
|
+
|
|
31
|
+
[中文](README.zh-CN.md)|English
|
|
32
|
+
|
|
33
|
+
# Ensemble Pitch Extractor
|
|
34
|
+
|
|
35
|
+
[](https://opensource.org/licenses/MIT)
|
|
36
|
+
[](https://www.python.org/downloads/)
|
|
37
|
+
|
|
38
|
+
Ensemble Pitch Extractor is a singing-voice F0 extractor that combines FCPE test-time augmentation with a pYIN high-frequency candidate in a dynamic programming decoder. It is designed for ordinary singing, high notes, and whistle-register material where a single extractor may fail.
|
|
39
|
+
|
|
40
|
+
The package provides:
|
|
41
|
+
|
|
42
|
+
- a Python API for extracting F0 from waveforms or audio files;
|
|
43
|
+
- a command-line interface that saves `.npy` F0 tracks and optional `.png` diagnostic plots;
|
|
44
|
+
- an ensemble decoder that selects a smooth candidate path instead of averaging incompatible F0 estimates.
|
|
45
|
+
|
|
46
|
+
## Demonstration
|
|
47
|
+
|
|
48
|
+
*Audio samples sourced from the internet.*
|
|
49
|
+
|
|
50
|
+
<table>
|
|
51
|
+
<tr>
|
|
52
|
+
<td><img src="assets/胸转哨.png" alt="Chest-to-Whistle"></td>
|
|
53
|
+
<td><img src="assets/大颤音.png" alt="Large Vibrato"></td>
|
|
54
|
+
</tr>
|
|
55
|
+
<tr>
|
|
56
|
+
<td><img src="assets/带噪声高音.png" alt="Noisy High Notes"></td>
|
|
57
|
+
<td><img src="assets/低音.png" alt="Low Notes"></td>
|
|
58
|
+
</tr>
|
|
59
|
+
</table>
|
|
60
|
+
|
|
61
|
+
## Installation
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
pip install ensemble-pitch-extractor
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
For local development:
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
uv sync
|
|
71
|
+
uv run ensemble-pitch-extractor --help
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
Python 3.12 or newer is required.
|
|
75
|
+
|
|
76
|
+
## Command Line
|
|
77
|
+
|
|
78
|
+
Extract F0 from one audio file:
|
|
79
|
+
|
|
80
|
+
```bash
|
|
81
|
+
ensemble-pitch-extractor input.wav -o f0_out
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
This writes:
|
|
85
|
+
|
|
86
|
+
```text
|
|
87
|
+
f0_out/input.npy
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
The `.npy` file is a one-dimensional `float32` array in Hz. Unvoiced frames are stored as `0`.
|
|
91
|
+
|
|
92
|
+
Save a plot of F0 overlaid on a mel spectrogram:
|
|
93
|
+
|
|
94
|
+
```bash
|
|
95
|
+
ensemble-pitch-extractor input.wav -o f0_out --plot
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
Process all supported audio files in a directory:
|
|
99
|
+
|
|
100
|
+
```bash
|
|
101
|
+
ensemble-pitch-extractor audio_dir -o f0_out --plot
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
CUDA is auto-detected by default. To force a specific device:
|
|
105
|
+
|
|
106
|
+
```bash
|
|
107
|
+
ensemble-pitch-extractor audio_dir -o f0_out --device cpu
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
Control GPU memory with `--max-batch-length` (default 480000 samples ≈ 30s):
|
|
111
|
+
|
|
112
|
+
```bash
|
|
113
|
+
ensemble-pitch-extractor audio_dir -o f0_out --max-batch-length 200000
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
Disable pYIN and use FCPE TTA only:
|
|
117
|
+
|
|
118
|
+
```bash
|
|
119
|
+
ensemble-pitch-extractor input.wav -o f0_out --no-pyin
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
Useful options:
|
|
123
|
+
|
|
124
|
+
```text
|
|
125
|
+
--f0-min 80
|
|
126
|
+
--f0-max 4000
|
|
127
|
+
--max-batch-length 480000
|
|
128
|
+
--device auto
|
|
129
|
+
--pyin-priority-min-f0 1300
|
|
130
|
+
--pyin-fcpe-close-semitones 1.0
|
|
131
|
+
--interp-uv
|
|
132
|
+
--recursive
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
## Python API
|
|
136
|
+
|
|
137
|
+
```python
|
|
138
|
+
from ensemble_pitch_extractor import extract_f0_from_file, load_model
|
|
139
|
+
|
|
140
|
+
model = load_model() # auto-detects CUDA, or pass device="cpu"
|
|
141
|
+
result = extract_f0_from_file(
|
|
142
|
+
"input.wav",
|
|
143
|
+
model=model,
|
|
144
|
+
save_npy="f0_out/input.npy",
|
|
145
|
+
save_plot="f0_out/input.png",
|
|
146
|
+
f0_min=80,
|
|
147
|
+
f0_max=4000,
|
|
148
|
+
)
|
|
149
|
+
|
|
150
|
+
f0 = result.f0 # Hz, shape: (frames,)
|
|
151
|
+
times = result.times # seconds, shape: (frames,)
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
For audio already in memory:
|
|
155
|
+
|
|
156
|
+
```python
|
|
157
|
+
import librosa
|
|
158
|
+
from ensemble_pitch_extractor import extract_f0, load_model
|
|
159
|
+
|
|
160
|
+
model = load_model()
|
|
161
|
+
sr = model.get_model_sr()
|
|
162
|
+
audio, _ = librosa.load("input.wav", sr=sr, mono=True)
|
|
163
|
+
f0 = extract_f0(audio, sr, model, f0_min=80, f0_max=4000)
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
For torch tensor input (padded batch or concatenated, supports GPU):
|
|
167
|
+
|
|
168
|
+
```python
|
|
169
|
+
import torch
|
|
170
|
+
from ensemble_pitch_extractor import extract_f0_from_tensor, load_model
|
|
171
|
+
|
|
172
|
+
model = load_model("cuda")
|
|
173
|
+
|
|
174
|
+
# padded batch: (batch=4, samples) with fixed-length clips
|
|
175
|
+
wav = torch.randn(4, 16000, device="cuda")
|
|
176
|
+
f0 = extract_f0_from_tensor(wav, sr=16000, model=model) # (4, frames)
|
|
177
|
+
|
|
178
|
+
# concatenated: clips of different lengths, no padding waste
|
|
179
|
+
wavs = [torch.randn(8000, device="cuda"), torch.randn(12000, device="cuda")]
|
|
180
|
+
lengths = [len(w) for w in wavs]
|
|
181
|
+
concat = torch.cat(wavs)
|
|
182
|
+
f0 = extract_f0_from_tensor(concat, sr=16000, model=model, lengths=lengths,
|
|
183
|
+
max_batch_length=20000) # (2, max_frames), NaN padded
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
## Method Overview
|
|
187
|
+
|
|
188
|
+
The decoder treats each extractor output as a candidate trajectory. Current candidates are:
|
|
189
|
+
|
|
190
|
+
```text
|
|
191
|
+
FCPE key shift = 0
|
|
192
|
+
FCPE key shift = -12
|
|
193
|
+
FCPE key shift = +12
|
|
194
|
+
pYIN
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
For an FCPE candidate with key shift $s$, the model output is mapped back to the original pitch space before fusion:
|
|
198
|
+
|
|
199
|
+
$$
|
|
200
|
+
\hat f_{t,s} = \frac{f_{t,s}}{2^{s/12}} .
|
|
201
|
+
$$
|
|
202
|
+
|
|
203
|
+
pYIN is included as an ultra-high frequency candidate. By default it only searches 1300–4000 Hz, and frames below 1300 Hz are discarded. This prevents pYIN from replacing FCPE in normal ranges where FCPE usually captures finer detail.
|
|
204
|
+
|
|
205
|
+
All candidates are converted to MIDI note space:
|
|
206
|
+
|
|
207
|
+
$$
|
|
208
|
+
n_{t,k}=69+12\log_2\frac{f_{t,k}}{440}.
|
|
209
|
+
$$
|
|
210
|
+
|
|
211
|
+
The final path is selected by dynamic programming:
|
|
212
|
+
|
|
213
|
+
$$
|
|
214
|
+
\pi^*=\arg\min_\pi \sum_t U_t(\pi_t)+\sum_{t=1}^{T-1} C_t(\pi_{t-1},\pi_t).
|
|
215
|
+
$$
|
|
216
|
+
|
|
217
|
+
Here $U_t(k)$ is a per-frame candidate prior and $C_t(i,k)$ is a transition cost. This formulation avoids averaging octave errors, half-frequency errors, and algorithm-specific mistakes into spurious intermediate pitches.
|
|
218
|
+
|
|
219
|
+
## Heuristics as Priors
|
|
220
|
+
|
|
221
|
+
The implementation uses the following structured priors:
|
|
222
|
+
|
|
223
|
+
- MIDI-space costs make equal musical intervals comparable across frequency ranges.
|
|
224
|
+
- UV penalty discourages fragmented voiced/unvoiced paths.
|
|
225
|
+
- Octave-aware jump cost allows one-, two-, and three-octave transitions, which are important for chest-to-whistle jumps.
|
|
226
|
+
- FCPE `+12` receives a low-pitch prior below E2.
|
|
227
|
+
- FCPE `-12` receives a high-pitch prior above D5.
|
|
228
|
+
- pYIN receives a high-frequency prior only when it is above 1300 Hz and more than one semitone away from every FCPE candidate.
|
|
229
|
+
- RMS energy gating removes false voiced output during silence after decoding.
|
|
230
|
+
|
|
231
|
+
The default candidate order is `FCPE 0`, `FCPE -12`, `FCPE +12`, `pYIN`, so that exact ties prefer FCPE over pYIN.
|
|
232
|
+
|
|
233
|
+
## Build and Publish
|
|
234
|
+
|
|
235
|
+
```bash
|
|
236
|
+
uv lock --python 3.12
|
|
237
|
+
uv build
|
|
238
|
+
uv publish
|
|
239
|
+
```
|
|
240
|
+
|
|
241
|
+
With a PyPI token:
|
|
242
|
+
|
|
243
|
+
```bash
|
|
244
|
+
uv publish --token "pypi-..."
|
|
245
|
+
```
|
|
@@ -0,0 +1,215 @@
|
|
|
1
|
+
[中文](README.zh-CN.md)|English
|
|
2
|
+
|
|
3
|
+
# Ensemble Pitch Extractor
|
|
4
|
+
|
|
5
|
+
[](https://opensource.org/licenses/MIT)
|
|
6
|
+
[](https://www.python.org/downloads/)
|
|
7
|
+
|
|
8
|
+
Ensemble Pitch Extractor is a singing-voice F0 extractor that combines FCPE test-time augmentation with a pYIN high-frequency candidate in a dynamic programming decoder. It is designed for ordinary singing, high notes, and whistle-register material where a single extractor may fail.
|
|
9
|
+
|
|
10
|
+
The package provides:
|
|
11
|
+
|
|
12
|
+
- a Python API for extracting F0 from waveforms or audio files;
|
|
13
|
+
- a command-line interface that saves `.npy` F0 tracks and optional `.png` diagnostic plots;
|
|
14
|
+
- an ensemble decoder that selects a smooth candidate path instead of averaging incompatible F0 estimates.
|
|
15
|
+
|
|
16
|
+
## Demonstration
|
|
17
|
+
|
|
18
|
+
*Audio samples sourced from the internet.*
|
|
19
|
+
|
|
20
|
+
<table>
|
|
21
|
+
<tr>
|
|
22
|
+
<td><img src="assets/胸转哨.png" alt="Chest-to-Whistle"></td>
|
|
23
|
+
<td><img src="assets/大颤音.png" alt="Large Vibrato"></td>
|
|
24
|
+
</tr>
|
|
25
|
+
<tr>
|
|
26
|
+
<td><img src="assets/带噪声高音.png" alt="Noisy High Notes"></td>
|
|
27
|
+
<td><img src="assets/低音.png" alt="Low Notes"></td>
|
|
28
|
+
</tr>
|
|
29
|
+
</table>
|
|
30
|
+
|
|
31
|
+
## Installation
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
pip install ensemble-pitch-extractor
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
For local development:
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
uv sync
|
|
41
|
+
uv run ensemble-pitch-extractor --help
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
Python 3.12 or newer is required.
|
|
45
|
+
|
|
46
|
+
## Command Line
|
|
47
|
+
|
|
48
|
+
Extract F0 from one audio file:
|
|
49
|
+
|
|
50
|
+
```bash
|
|
51
|
+
ensemble-pitch-extractor input.wav -o f0_out
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
This writes:
|
|
55
|
+
|
|
56
|
+
```text
|
|
57
|
+
f0_out/input.npy
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
The `.npy` file is a one-dimensional `float32` array in Hz. Unvoiced frames are stored as `0`.
|
|
61
|
+
|
|
62
|
+
Save a plot of F0 overlaid on a mel spectrogram:
|
|
63
|
+
|
|
64
|
+
```bash
|
|
65
|
+
ensemble-pitch-extractor input.wav -o f0_out --plot
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
Process all supported audio files in a directory:
|
|
69
|
+
|
|
70
|
+
```bash
|
|
71
|
+
ensemble-pitch-extractor audio_dir -o f0_out --plot
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
CUDA is auto-detected by default. To force a specific device:
|
|
75
|
+
|
|
76
|
+
```bash
|
|
77
|
+
ensemble-pitch-extractor audio_dir -o f0_out --device cpu
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
Control GPU memory with `--max-batch-length` (default 480000 samples ≈ 30s):
|
|
81
|
+
|
|
82
|
+
```bash
|
|
83
|
+
ensemble-pitch-extractor audio_dir -o f0_out --max-batch-length 200000
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
Disable pYIN and use FCPE TTA only:
|
|
87
|
+
|
|
88
|
+
```bash
|
|
89
|
+
ensemble-pitch-extractor input.wav -o f0_out --no-pyin
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
Useful options:
|
|
93
|
+
|
|
94
|
+
```text
|
|
95
|
+
--f0-min 80
|
|
96
|
+
--f0-max 4000
|
|
97
|
+
--max-batch-length 480000
|
|
98
|
+
--device auto
|
|
99
|
+
--pyin-priority-min-f0 1300
|
|
100
|
+
--pyin-fcpe-close-semitones 1.0
|
|
101
|
+
--interp-uv
|
|
102
|
+
--recursive
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
## Python API
|
|
106
|
+
|
|
107
|
+
```python
|
|
108
|
+
from ensemble_pitch_extractor import extract_f0_from_file, load_model
|
|
109
|
+
|
|
110
|
+
model = load_model() # auto-detects CUDA, or pass device="cpu"
|
|
111
|
+
result = extract_f0_from_file(
|
|
112
|
+
"input.wav",
|
|
113
|
+
model=model,
|
|
114
|
+
save_npy="f0_out/input.npy",
|
|
115
|
+
save_plot="f0_out/input.png",
|
|
116
|
+
f0_min=80,
|
|
117
|
+
f0_max=4000,
|
|
118
|
+
)
|
|
119
|
+
|
|
120
|
+
f0 = result.f0 # Hz, shape: (frames,)
|
|
121
|
+
times = result.times # seconds, shape: (frames,)
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
For audio already in memory:
|
|
125
|
+
|
|
126
|
+
```python
|
|
127
|
+
import librosa
|
|
128
|
+
from ensemble_pitch_extractor import extract_f0, load_model
|
|
129
|
+
|
|
130
|
+
model = load_model()
|
|
131
|
+
sr = model.get_model_sr()
|
|
132
|
+
audio, _ = librosa.load("input.wav", sr=sr, mono=True)
|
|
133
|
+
f0 = extract_f0(audio, sr, model, f0_min=80, f0_max=4000)
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
For torch tensor input (padded batch or concatenated, supports GPU):
|
|
137
|
+
|
|
138
|
+
```python
|
|
139
|
+
import torch
|
|
140
|
+
from ensemble_pitch_extractor import extract_f0_from_tensor, load_model
|
|
141
|
+
|
|
142
|
+
model = load_model("cuda")
|
|
143
|
+
|
|
144
|
+
# padded batch: (batch=4, samples) with fixed-length clips
|
|
145
|
+
wav = torch.randn(4, 16000, device="cuda")
|
|
146
|
+
f0 = extract_f0_from_tensor(wav, sr=16000, model=model) # (4, frames)
|
|
147
|
+
|
|
148
|
+
# concatenated: clips of different lengths, no padding waste
|
|
149
|
+
wavs = [torch.randn(8000, device="cuda"), torch.randn(12000, device="cuda")]
|
|
150
|
+
lengths = [len(w) for w in wavs]
|
|
151
|
+
concat = torch.cat(wavs)
|
|
152
|
+
f0 = extract_f0_from_tensor(concat, sr=16000, model=model, lengths=lengths,
|
|
153
|
+
max_batch_length=20000) # (2, max_frames), NaN padded
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
## Method Overview
|
|
157
|
+
|
|
158
|
+
The decoder treats each extractor output as a candidate trajectory. Current candidates are:
|
|
159
|
+
|
|
160
|
+
```text
|
|
161
|
+
FCPE key shift = 0
|
|
162
|
+
FCPE key shift = -12
|
|
163
|
+
FCPE key shift = +12
|
|
164
|
+
pYIN
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
For an FCPE candidate with key shift $s$, the model output is mapped back to the original pitch space before fusion:
|
|
168
|
+
|
|
169
|
+
$$
|
|
170
|
+
\hat f_{t,s} = \frac{f_{t,s}}{2^{s/12}} .
|
|
171
|
+
$$
|
|
172
|
+
|
|
173
|
+
pYIN is included as an ultra-high frequency candidate. By default it only searches 1300–4000 Hz, and frames below 1300 Hz are discarded. This prevents pYIN from replacing FCPE in normal ranges where FCPE usually captures finer detail.
|
|
174
|
+
|
|
175
|
+
All candidates are converted to MIDI note space:
|
|
176
|
+
|
|
177
|
+
$$
|
|
178
|
+
n_{t,k}=69+12\log_2\frac{f_{t,k}}{440}.
|
|
179
|
+
$$
|
|
180
|
+
|
|
181
|
+
The final path is selected by dynamic programming:
|
|
182
|
+
|
|
183
|
+
$$
|
|
184
|
+
\pi^*=\arg\min_\pi \sum_t U_t(\pi_t)+\sum_{t=1}^{T-1} C_t(\pi_{t-1},\pi_t).
|
|
185
|
+
$$
|
|
186
|
+
|
|
187
|
+
Here $U_t(k)$ is a per-frame candidate prior and $C_t(i,k)$ is a transition cost. This formulation avoids averaging octave errors, half-frequency errors, and algorithm-specific mistakes into spurious intermediate pitches.
|
|
188
|
+
|
|
189
|
+
## Heuristics as Priors
|
|
190
|
+
|
|
191
|
+
The implementation uses the following structured priors:
|
|
192
|
+
|
|
193
|
+
- MIDI-space costs make equal musical intervals comparable across frequency ranges.
|
|
194
|
+
- UV penalty discourages fragmented voiced/unvoiced paths.
|
|
195
|
+
- Octave-aware jump cost allows one-, two-, and three-octave transitions, which are important for chest-to-whistle jumps.
|
|
196
|
+
- FCPE `+12` receives a low-pitch prior below E2.
|
|
197
|
+
- FCPE `-12` receives a high-pitch prior above D5.
|
|
198
|
+
- pYIN receives a high-frequency prior only when it is above 1300 Hz and more than one semitone away from every FCPE candidate.
|
|
199
|
+
- RMS energy gating removes false voiced output during silence after decoding.
|
|
200
|
+
|
|
201
|
+
The default candidate order is `FCPE 0`, `FCPE -12`, `FCPE +12`, `pYIN`, so that exact ties prefer FCPE over pYIN.
|
|
202
|
+
|
|
203
|
+
## Build and Publish
|
|
204
|
+
|
|
205
|
+
```bash
|
|
206
|
+
uv lock --python 3.12
|
|
207
|
+
uv build
|
|
208
|
+
uv publish
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
With a PyPI token:
|
|
212
|
+
|
|
213
|
+
```bash
|
|
214
|
+
uv publish --token "pypi-..."
|
|
215
|
+
```
|
|
@@ -0,0 +1,215 @@
|
|
|
1
|
+
中文|[English](README.md)
|
|
2
|
+
|
|
3
|
+
# Ensemble Pitch Extractor 中文说明
|
|
4
|
+
|
|
5
|
+
Ensemble Pitch Extractor 是一个面向歌声音高提取的 F0 工具包。它将 FCPE 的测试时增强结果和 pYIN 的超高音候选放入同一个动态规划解码器中,适合普通歌声、高音和哨音场景。
|
|
6
|
+
|
|
7
|
+
这个包提供:
|
|
8
|
+
|
|
9
|
+
- 清晰的 Python API;
|
|
10
|
+
- 命令行工具,输出 `.npy`,并可选输出 `.png` 诊断图;
|
|
11
|
+
- 不做逐帧平均,而是在多个候选轨迹中选择总代价最小的一条 F0 路径。
|
|
12
|
+
|
|
13
|
+
## 效果展示
|
|
14
|
+
|
|
15
|
+
*音频素材来源于网络。*
|
|
16
|
+
|
|
17
|
+
<table>
|
|
18
|
+
<tr>
|
|
19
|
+
<td><img src="assets/胸转哨.png" alt="胸声转哨音"></td>
|
|
20
|
+
<td><img src="assets/大颤音.png" alt="大颤音"></td>
|
|
21
|
+
</tr>
|
|
22
|
+
<tr>
|
|
23
|
+
<td><img src="assets/带噪声高音.png" alt="带噪声高音"></td>
|
|
24
|
+
<td><img src="assets/低音.png" alt="低音"></td>
|
|
25
|
+
</tr>
|
|
26
|
+
</table>
|
|
27
|
+
|
|
28
|
+
## 安装
|
|
29
|
+
|
|
30
|
+
```bash
|
|
31
|
+
pip install ensemble-pitch-extractor
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
本地开发:
|
|
35
|
+
|
|
36
|
+
```bash
|
|
37
|
+
uv sync
|
|
38
|
+
uv run ensemble-pitch-extractor --help
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
需要 Python 3.12 或更高版本。
|
|
42
|
+
|
|
43
|
+
## 命令行用法
|
|
44
|
+
|
|
45
|
+
提取单个音频:
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
ensemble-pitch-extractor input.wav -o f0_out
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
输出:
|
|
52
|
+
|
|
53
|
+
```text
|
|
54
|
+
f0_out/input.npy
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
`.npy` 是一维 `float32` 数组,单位为 Hz;unvoiced 帧为 `0`。
|
|
58
|
+
|
|
59
|
+
同时输出 F0 + mel 频谱图:
|
|
60
|
+
|
|
61
|
+
```bash
|
|
62
|
+
ensemble-pitch-extractor input.wav -o f0_out --plot
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
处理目录下全部音频:
|
|
66
|
+
|
|
67
|
+
```bash
|
|
68
|
+
ensemble-pitch-extractor audio_dir -o f0_out --plot
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
默认自动检测 CUDA。手动指定设备:
|
|
72
|
+
|
|
73
|
+
```bash
|
|
74
|
+
ensemble-pitch-extractor audio_dir -o f0_out --device cpu
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
用 `--max-batch-length` 控制 GPU 显存(默认 480000 采样点 ≈ 30s):
|
|
78
|
+
|
|
79
|
+
```bash
|
|
80
|
+
ensemble-pitch-extractor audio_dir -o f0_out --max-batch-length 200000
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
只使用 FCPE TTA,不使用 pYIN:
|
|
84
|
+
|
|
85
|
+
```bash
|
|
86
|
+
ensemble-pitch-extractor input.wav -o f0_out --no-pyin
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
常用参数:
|
|
90
|
+
|
|
91
|
+
```text
|
|
92
|
+
--f0-min 80
|
|
93
|
+
--f0-max 4000
|
|
94
|
+
--max-batch-length 480000
|
|
95
|
+
--device auto
|
|
96
|
+
--pyin-priority-min-f0 1300
|
|
97
|
+
--pyin-fcpe-close-semitones 1.0
|
|
98
|
+
--interp-uv
|
|
99
|
+
--recursive
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
## Python API
|
|
103
|
+
|
|
104
|
+
```python
|
|
105
|
+
from ensemble_pitch_extractor import extract_f0_from_file, load_model
|
|
106
|
+
|
|
107
|
+
model = load_model() # 自动检测 CUDA,也可指定 device="cpu"
|
|
108
|
+
result = extract_f0_from_file(
|
|
109
|
+
"input.wav",
|
|
110
|
+
model=model,
|
|
111
|
+
save_npy="f0_out/input.npy",
|
|
112
|
+
save_plot="f0_out/input.png",
|
|
113
|
+
f0_min=80,
|
|
114
|
+
f0_max=4000,
|
|
115
|
+
)
|
|
116
|
+
|
|
117
|
+
f0 = result.f0 # Hz, shape: (frames,)
|
|
118
|
+
times = result.times # seconds, shape: (frames,)
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
内存中的音频也可以直接提取:
|
|
122
|
+
|
|
123
|
+
```python
|
|
124
|
+
import librosa
|
|
125
|
+
from ensemble_pitch_extractor import extract_f0, load_model
|
|
126
|
+
|
|
127
|
+
model = load_model()
|
|
128
|
+
sr = model.get_model_sr()
|
|
129
|
+
audio, _ = librosa.load("input.wav", sr=sr, mono=True)
|
|
130
|
+
f0 = extract_f0(audio, sr, model, f0_min=80, f0_max=4000)
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
支持直接传入 torch tensor(批量或拼接,支持 GPU):
|
|
134
|
+
|
|
135
|
+
```python
|
|
136
|
+
import torch
|
|
137
|
+
from ensemble_pitch_extractor import extract_f0_from_tensor, load_model
|
|
138
|
+
|
|
139
|
+
model = load_model("cuda")
|
|
140
|
+
|
|
141
|
+
# 批量模式:等长 clip 的 padded batch
|
|
142
|
+
wav = torch.randn(4, 16000, device="cuda") # (batch=4, samples)
|
|
143
|
+
f0 = extract_f0_from_tensor(wav, sr=16000, model=model) # (4, frames)
|
|
144
|
+
|
|
145
|
+
# 拼接模式:不等长 clip 拼接在一起,无 padding 浪费
|
|
146
|
+
wavs = [torch.randn(8000, device="cuda"), torch.randn(12000, device="cuda")]
|
|
147
|
+
lengths = [len(w) for w in wavs]
|
|
148
|
+
concat = torch.cat(wavs)
|
|
149
|
+
f0 = extract_f0_from_tensor(concat, sr=16000, model=model, lengths=lengths,
|
|
150
|
+
max_batch_length=20000) # (2, max_frames),pad 位为 NaN
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
## 方法概要
|
|
154
|
+
|
|
155
|
+
当前候选包括:
|
|
156
|
+
|
|
157
|
+
```text
|
|
158
|
+
FCPE key shift = 0
|
|
159
|
+
FCPE key shift = -12
|
|
160
|
+
FCPE key shift = +12
|
|
161
|
+
pYIN
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
对于 key shift 为 \(s\) 的 FCPE 候选,融合前先反变调回原始音高空间:
|
|
165
|
+
|
|
166
|
+
\[
|
|
167
|
+
\hat f_{t,s} = \frac{f_{t,s}}{2^{s/12}} .
|
|
168
|
+
\]
|
|
169
|
+
|
|
170
|
+
pYIN 作为超高音救援候选加入。默认仅搜索 1300–4000 Hz,1300 Hz 以下的结果会被丢弃,避免它在普通音区覆盖 FCPE 更细致的轨迹。
|
|
171
|
+
|
|
172
|
+
所有候选转为 MIDI note:
|
|
173
|
+
|
|
174
|
+
\[
|
|
175
|
+
n_{t,k}=69+12\log_2\frac{f_{t,k}}{440}.
|
|
176
|
+
\]
|
|
177
|
+
|
|
178
|
+
然后通过动态规划选择总代价最小路径:
|
|
179
|
+
|
|
180
|
+
\[
|
|
181
|
+
\pi^\*=\arg\min_\pi
|
|
182
|
+
\sum_t U_t(\pi_t)
|
|
183
|
+
+
|
|
184
|
+
\sum_{t=1}^{T-1} C_t(\pi_{t-1},\pi_t).
|
|
185
|
+
\]
|
|
186
|
+
|
|
187
|
+
其中 \(U_t(k)\) 是候选先验,\(C_t(i,k)\) 是相邻帧转移代价。这样可以避免把倍频、半频或不同算法的错误结果平均成不存在的中间音高。
|
|
188
|
+
|
|
189
|
+
## 启发式策略
|
|
190
|
+
|
|
191
|
+
这些策略可以统一理解为动态规划中的结构化先验:
|
|
192
|
+
|
|
193
|
+
- MIDI 空间代价使不同频段的相同音乐音程具有相同尺度。
|
|
194
|
+
- UV 惩罚减少 voiced/unvoiced 频繁断裂。
|
|
195
|
+
- 八度感知跳变允许一、二、三个八度附近的真实跳变,适合胸声转哨音。
|
|
196
|
+
- FCPE `+12` 在 E2 以下得到低音区先验。
|
|
197
|
+
- FCPE `-12` 在 D5 以上得到中高音区先验。
|
|
198
|
+
- pYIN 只在 1300 Hz 以上且与所有 FCPE 候选相差超过 1 半音时得到救援先验。
|
|
199
|
+
- 解码后使用 RMS 能量门控移除静音段中的虚假 voiced 输出。
|
|
200
|
+
|
|
201
|
+
候选顺序为 `FCPE 0`、`FCPE -12`、`FCPE +12`、`pYIN`,因此代价完全相同时优先选择 FCPE。
|
|
202
|
+
|
|
203
|
+
## 构建与发布
|
|
204
|
+
|
|
205
|
+
```bash
|
|
206
|
+
uv lock --python 3.12
|
|
207
|
+
uv build
|
|
208
|
+
uv publish
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
使用 PyPI token:
|
|
212
|
+
|
|
213
|
+
```bash
|
|
214
|
+
uv publish --token "pypi-..."
|
|
215
|
+
```
|