ensemble-pitch-extractor 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 ensemble-pitch-extractor contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1 @@
1
+ include README.zh-CN.md
@@ -0,0 +1,245 @@
1
+ Metadata-Version: 2.4
2
+ Name: ensemble-pitch-extractor
3
+ Version: 0.1.0
4
+ Summary: FCPE TTA and pYIN ensemble pitch extraction for singing voice.
5
+ Author: ensemble-pitch-extractor contributors
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/qiuqiao/ensemble-pitch-extractor
8
+ Project-URL: Repository, https://github.com/qiuqiao/ensemble-pitch-extractor
9
+ Project-URL: Issues, https://github.com/qiuqiao/ensemble-pitch-extractor/issues
10
+ Keywords: pitch-extraction,f0,fundamental-frequency,singing-voice,fcpe,pyin,test-time-augmentation,dynamic-programming,viterbi,audio,music-information-retrieval
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Operating System :: OS Independent
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Programming Language :: Python :: 3.13
19
+ Classifier: Topic :: Multimedia :: Sound/Audio :: Analysis
20
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
+ Requires-Python: >=3.12
22
+ Description-Content-Type: text/markdown
23
+ License-File: LICENSE
24
+ Requires-Dist: librosa>=0.11.0
25
+ Requires-Dist: matplotlib>=3.10.9
26
+ Requires-Dist: numpy>=1.26
27
+ Requires-Dist: torch>=2.1
28
+ Requires-Dist: torchfcpe>=0.0.4
29
+ Dynamic: license-file
30
+
31
+ [中文](README.zh-CN.md)|English
32
+
33
+ # Ensemble Pitch Extractor
34
+
35
+ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
36
+ [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
37
+
38
+ Ensemble Pitch Extractor is a singing-voice F0 extractor that combines FCPE test-time augmentation with a pYIN high-frequency candidate in a dynamic programming decoder. It is designed for ordinary singing, high notes, and whistle-register material where a single extractor may fail.
39
+
40
+ The package provides:
41
+
42
+ - a Python API for extracting F0 from waveforms or audio files;
43
+ - a command-line interface that saves `.npy` F0 tracks and optional `.png` diagnostic plots;
44
+ - an ensemble decoder that selects a smooth candidate path instead of averaging incompatible F0 estimates.
45
+
46
+ ## Demonstration
47
+
48
+ *Audio samples sourced from the internet.*
49
+
50
+ <table>
51
+ <tr>
52
+ <td><img src="assets/胸转哨.png" alt="Chest-to-Whistle"></td>
53
+ <td><img src="assets/大颤音.png" alt="Large Vibrato"></td>
54
+ </tr>
55
+ <tr>
56
+ <td><img src="assets/带噪声高音.png" alt="Noisy High Notes"></td>
57
+ <td><img src="assets/低音.png" alt="Low Notes"></td>
58
+ </tr>
59
+ </table>
60
+
61
+ ## Installation
62
+
63
+ ```bash
64
+ pip install ensemble-pitch-extractor
65
+ ```
66
+
67
+ For local development:
68
+
69
+ ```bash
70
+ uv sync
71
+ uv run ensemble-pitch-extractor --help
72
+ ```
73
+
74
+ Python 3.12 or newer is required.
75
+
76
+ ## Command Line
77
+
78
+ Extract F0 from one audio file:
79
+
80
+ ```bash
81
+ ensemble-pitch-extractor input.wav -o f0_out
82
+ ```
83
+
84
+ This writes:
85
+
86
+ ```text
87
+ f0_out/input.npy
88
+ ```
89
+
90
+ The `.npy` file is a one-dimensional `float32` array in Hz. Unvoiced frames are stored as `0`.
91
+
92
+ Save a plot of F0 overlaid on a mel spectrogram:
93
+
94
+ ```bash
95
+ ensemble-pitch-extractor input.wav -o f0_out --plot
96
+ ```
97
+
98
+ Process all supported audio files in a directory:
99
+
100
+ ```bash
101
+ ensemble-pitch-extractor audio_dir -o f0_out --plot
102
+ ```
103
+
104
+ CUDA is auto-detected by default. To force a specific device:
105
+
106
+ ```bash
107
+ ensemble-pitch-extractor audio_dir -o f0_out --device cpu
108
+ ```
109
+
110
+ Control GPU memory with `--max-batch-length` (default 480000 samples ≈ 30s):
111
+
112
+ ```bash
113
+ ensemble-pitch-extractor audio_dir -o f0_out --max-batch-length 200000
114
+ ```
115
+
116
+ Disable pYIN and use FCPE TTA only:
117
+
118
+ ```bash
119
+ ensemble-pitch-extractor input.wav -o f0_out --no-pyin
120
+ ```
121
+
122
+ Useful options:
123
+
124
+ ```text
125
+ --f0-min 80
126
+ --f0-max 4000
127
+ --max-batch-length 480000
128
+ --device auto
129
+ --pyin-priority-min-f0 1300
130
+ --pyin-fcpe-close-semitones 1.0
131
+ --interp-uv
132
+ --recursive
133
+ ```
134
+
135
+ ## Python API
136
+
137
+ ```python
138
+ from ensemble_pitch_extractor import extract_f0_from_file, load_model
139
+
140
+ model = load_model() # auto-detects CUDA, or pass device="cpu"
141
+ result = extract_f0_from_file(
142
+ "input.wav",
143
+ model=model,
144
+ save_npy="f0_out/input.npy",
145
+ save_plot="f0_out/input.png",
146
+ f0_min=80,
147
+ f0_max=4000,
148
+ )
149
+
150
+ f0 = result.f0 # Hz, shape: (frames,)
151
+ times = result.times # seconds, shape: (frames,)
152
+ ```
153
+
154
+ For audio already in memory:
155
+
156
+ ```python
157
+ import librosa
158
+ from ensemble_pitch_extractor import extract_f0, load_model
159
+
160
+ model = load_model()
161
+ sr = model.get_model_sr()
162
+ audio, _ = librosa.load("input.wav", sr=sr, mono=True)
163
+ f0 = extract_f0(audio, sr, model, f0_min=80, f0_max=4000)
164
+ ```
165
+
166
+ For torch tensor input (padded batch or concatenated, supports GPU):
167
+
168
+ ```python
169
+ import torch
170
+ from ensemble_pitch_extractor import extract_f0_from_tensor, load_model
171
+
172
+ model = load_model("cuda")
173
+
174
+ # padded batch: (batch=4, samples) with fixed-length clips
175
+ wav = torch.randn(4, 16000, device="cuda")
176
+ f0 = extract_f0_from_tensor(wav, sr=16000, model=model) # (4, frames)
177
+
178
+ # concatenated: clips of different lengths, no padding waste
179
+ wavs = [torch.randn(8000, device="cuda"), torch.randn(12000, device="cuda")]
180
+ lengths = [len(w) for w in wavs]
181
+ concat = torch.cat(wavs)
182
+ f0 = extract_f0_from_tensor(concat, sr=16000, model=model, lengths=lengths,
183
+ max_batch_length=20000) # (2, max_frames), NaN padded
184
+ ```
185
+
186
+ ## Method Overview
187
+
188
+ The decoder treats each extractor output as a candidate trajectory. Current candidates are:
189
+
190
+ ```text
191
+ FCPE key shift = 0
192
+ FCPE key shift = -12
193
+ FCPE key shift = +12
194
+ pYIN
195
+ ```
196
+
197
+ For an FCPE candidate with key shift $s$, the model output is mapped back to the original pitch space before fusion:
198
+
199
+ $$
200
+ \hat f_{t,s} = \frac{f_{t,s}}{2^{s/12}} .
201
+ $$
202
+
203
+ pYIN is included as an ultra-high frequency candidate. By default it only searches 1300–4000 Hz, and frames below 1300 Hz are discarded. This prevents pYIN from replacing FCPE in normal ranges where FCPE usually captures finer detail.
204
+
205
+ All candidates are converted to MIDI note space:
206
+
207
+ $$
208
+ n_{t,k}=69+12\log_2\frac{f_{t,k}}{440}.
209
+ $$
210
+
211
+ The final path is selected by dynamic programming:
212
+
213
+ $$
214
+ \pi^*=\arg\min_\pi \sum_t U_t(\pi_t)+\sum_{t=1}^{T-1} C_t(\pi_{t-1},\pi_t).
215
+ $$
216
+
217
+ Here $U_t(k)$ is a per-frame candidate prior and $C_t(i,k)$ is a transition cost. This formulation avoids averaging octave errors, half-frequency errors, and algorithm-specific mistakes into spurious intermediate pitches.
218
+
219
+ ## Heuristics as Priors
220
+
221
+ The implementation uses the following structured priors:
222
+
223
+ - MIDI-space costs make equal musical intervals comparable across frequency ranges.
224
+ - UV penalty discourages fragmented voiced/unvoiced paths.
225
+ - Octave-aware jump cost allows one-, two-, and three-octave transitions, which are important for chest-to-whistle jumps.
226
+ - FCPE `+12` receives a low-pitch prior below E2.
227
+ - FCPE `-12` receives a high-pitch prior above D5.
228
+ - pYIN receives a high-frequency prior only when it is above 1300 Hz and more than one semitone away from every FCPE candidate.
229
+ - RMS energy gating removes false voiced output during silence after decoding.
230
+
231
+ The default candidate order is `FCPE 0`, `FCPE -12`, `FCPE +12`, `pYIN`, so that exact ties prefer FCPE over pYIN.
232
+
233
+ ## Build and Publish
234
+
235
+ ```bash
236
+ uv lock --python 3.12
237
+ uv build
238
+ uv publish
239
+ ```
240
+
241
+ With a PyPI token:
242
+
243
+ ```bash
244
+ uv publish --token "pypi-..."
245
+ ```
@@ -0,0 +1,215 @@
1
+ [中文](README.zh-CN.md)|English
2
+
3
+ # Ensemble Pitch Extractor
4
+
5
+ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
6
+ [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
7
+
8
+ Ensemble Pitch Extractor is a singing-voice F0 extractor that combines FCPE test-time augmentation with a pYIN high-frequency candidate in a dynamic programming decoder. It is designed for ordinary singing, high notes, and whistle-register material where a single extractor may fail.
9
+
10
+ The package provides:
11
+
12
+ - a Python API for extracting F0 from waveforms or audio files;
13
+ - a command-line interface that saves `.npy` F0 tracks and optional `.png` diagnostic plots;
14
+ - an ensemble decoder that selects a smooth candidate path instead of averaging incompatible F0 estimates.
15
+
16
+ ## Demonstration
17
+
18
+ *Audio samples sourced from the internet.*
19
+
20
+ <table>
21
+ <tr>
22
+ <td><img src="assets/胸转哨.png" alt="Chest-to-Whistle"></td>
23
+ <td><img src="assets/大颤音.png" alt="Large Vibrato"></td>
24
+ </tr>
25
+ <tr>
26
+ <td><img src="assets/带噪声高音.png" alt="Noisy High Notes"></td>
27
+ <td><img src="assets/低音.png" alt="Low Notes"></td>
28
+ </tr>
29
+ </table>
30
+
31
+ ## Installation
32
+
33
+ ```bash
34
+ pip install ensemble-pitch-extractor
35
+ ```
36
+
37
+ For local development:
38
+
39
+ ```bash
40
+ uv sync
41
+ uv run ensemble-pitch-extractor --help
42
+ ```
43
+
44
+ Python 3.12 or newer is required.
45
+
46
+ ## Command Line
47
+
48
+ Extract F0 from one audio file:
49
+
50
+ ```bash
51
+ ensemble-pitch-extractor input.wav -o f0_out
52
+ ```
53
+
54
+ This writes:
55
+
56
+ ```text
57
+ f0_out/input.npy
58
+ ```
59
+
60
+ The `.npy` file is a one-dimensional `float32` array in Hz. Unvoiced frames are stored as `0`.
61
+
62
+ Save a plot of F0 overlaid on a mel spectrogram:
63
+
64
+ ```bash
65
+ ensemble-pitch-extractor input.wav -o f0_out --plot
66
+ ```
67
+
68
+ Process all supported audio files in a directory:
69
+
70
+ ```bash
71
+ ensemble-pitch-extractor audio_dir -o f0_out --plot
72
+ ```
73
+
74
+ CUDA is auto-detected by default. To force a specific device:
75
+
76
+ ```bash
77
+ ensemble-pitch-extractor audio_dir -o f0_out --device cpu
78
+ ```
79
+
80
+ Control GPU memory with `--max-batch-length` (default 480000 samples ≈ 30s):
81
+
82
+ ```bash
83
+ ensemble-pitch-extractor audio_dir -o f0_out --max-batch-length 200000
84
+ ```
85
+
86
+ Disable pYIN and use FCPE TTA only:
87
+
88
+ ```bash
89
+ ensemble-pitch-extractor input.wav -o f0_out --no-pyin
90
+ ```
91
+
92
+ Useful options:
93
+
94
+ ```text
95
+ --f0-min 80
96
+ --f0-max 4000
97
+ --max-batch-length 480000
98
+ --device auto
99
+ --pyin-priority-min-f0 1300
100
+ --pyin-fcpe-close-semitones 1.0
101
+ --interp-uv
102
+ --recursive
103
+ ```
104
+
105
+ ## Python API
106
+
107
+ ```python
108
+ from ensemble_pitch_extractor import extract_f0_from_file, load_model
109
+
110
+ model = load_model() # auto-detects CUDA, or pass device="cpu"
111
+ result = extract_f0_from_file(
112
+ "input.wav",
113
+ model=model,
114
+ save_npy="f0_out/input.npy",
115
+ save_plot="f0_out/input.png",
116
+ f0_min=80,
117
+ f0_max=4000,
118
+ )
119
+
120
+ f0 = result.f0 # Hz, shape: (frames,)
121
+ times = result.times # seconds, shape: (frames,)
122
+ ```
123
+
124
+ For audio already in memory:
125
+
126
+ ```python
127
+ import librosa
128
+ from ensemble_pitch_extractor import extract_f0, load_model
129
+
130
+ model = load_model()
131
+ sr = model.get_model_sr()
132
+ audio, _ = librosa.load("input.wav", sr=sr, mono=True)
133
+ f0 = extract_f0(audio, sr, model, f0_min=80, f0_max=4000)
134
+ ```
135
+
136
+ For torch tensor input (padded batch or concatenated, supports GPU):
137
+
138
+ ```python
139
+ import torch
140
+ from ensemble_pitch_extractor import extract_f0_from_tensor, load_model
141
+
142
+ model = load_model("cuda")
143
+
144
+ # padded batch: (batch=4, samples) with fixed-length clips
145
+ wav = torch.randn(4, 16000, device="cuda")
146
+ f0 = extract_f0_from_tensor(wav, sr=16000, model=model) # (4, frames)
147
+
148
+ # concatenated: clips of different lengths, no padding waste
149
+ wavs = [torch.randn(8000, device="cuda"), torch.randn(12000, device="cuda")]
150
+ lengths = [len(w) for w in wavs]
151
+ concat = torch.cat(wavs)
152
+ f0 = extract_f0_from_tensor(concat, sr=16000, model=model, lengths=lengths,
153
+ max_batch_length=20000) # (2, max_frames), NaN padded
154
+ ```
155
+
156
+ ## Method Overview
157
+
158
+ The decoder treats each extractor output as a candidate trajectory. Current candidates are:
159
+
160
+ ```text
161
+ FCPE key shift = 0
162
+ FCPE key shift = -12
163
+ FCPE key shift = +12
164
+ pYIN
165
+ ```
166
+
167
+ For an FCPE candidate with key shift $s$, the model output is mapped back to the original pitch space before fusion:
168
+
169
+ $$
170
+ \hat f_{t,s} = \frac{f_{t,s}}{2^{s/12}} .
171
+ $$
172
+
173
+ pYIN is included as an ultra-high frequency candidate. By default it only searches 1300–4000 Hz, and frames below 1300 Hz are discarded. This prevents pYIN from replacing FCPE in normal ranges where FCPE usually captures finer detail.
174
+
175
+ All candidates are converted to MIDI note space:
176
+
177
+ $$
178
+ n_{t,k}=69+12\log_2\frac{f_{t,k}}{440}.
179
+ $$
180
+
181
+ The final path is selected by dynamic programming:
182
+
183
+ $$
184
+ \pi^*=\arg\min_\pi \sum_t U_t(\pi_t)+\sum_{t=1}^{T-1} C_t(\pi_{t-1},\pi_t).
185
+ $$
186
+
187
+ Here $U_t(k)$ is a per-frame candidate prior and $C_t(i,k)$ is a transition cost. This formulation avoids averaging octave errors, half-frequency errors, and algorithm-specific mistakes into spurious intermediate pitches.
188
+
189
+ ## Heuristics as Priors
190
+
191
+ The implementation uses the following structured priors:
192
+
193
+ - MIDI-space costs make equal musical intervals comparable across frequency ranges.
194
+ - UV penalty discourages fragmented voiced/unvoiced paths.
195
+ - Octave-aware jump cost allows one-, two-, and three-octave transitions, which are important for chest-to-whistle jumps.
196
+ - FCPE `+12` receives a low-pitch prior below E2.
197
+ - FCPE `-12` receives a high-pitch prior above D5.
198
+ - pYIN receives a high-frequency prior only when it is above 1300 Hz and more than one semitone away from every FCPE candidate.
199
+ - RMS energy gating removes false voiced output during silence after decoding.
200
+
201
+ The default candidate order is `FCPE 0`, `FCPE -12`, `FCPE +12`, `pYIN`, so that exact ties prefer FCPE over pYIN.
202
+
203
+ ## Build and Publish
204
+
205
+ ```bash
206
+ uv lock --python 3.12
207
+ uv build
208
+ uv publish
209
+ ```
210
+
211
+ With a PyPI token:
212
+
213
+ ```bash
214
+ uv publish --token "pypi-..."
215
+ ```
@@ -0,0 +1,215 @@
1
+ 中文|[English](README.md)
2
+
3
+ # Ensemble Pitch Extractor 中文说明
4
+
5
+ Ensemble Pitch Extractor 是一个面向歌声音高提取的 F0 工具包。它将 FCPE 的测试时增强结果和 pYIN 的超高音候选放入同一个动态规划解码器中,适合普通歌声、高音和哨音场景。
6
+
7
+ 这个包提供:
8
+
9
+ - 清晰的 Python API;
10
+ - 命令行工具,输出 `.npy`,并可选输出 `.png` 诊断图;
11
+ - 不做逐帧平均,而是在多个候选轨迹中选择总代价最小的一条 F0 路径。
12
+
13
+ ## 效果展示
14
+
15
+ *音频素材来源于网络。*
16
+
17
+ <table>
18
+ <tr>
19
+ <td><img src="assets/胸转哨.png" alt="胸声转哨音"></td>
20
+ <td><img src="assets/大颤音.png" alt="大颤音"></td>
21
+ </tr>
22
+ <tr>
23
+ <td><img src="assets/带噪声高音.png" alt="带噪声高音"></td>
24
+ <td><img src="assets/低音.png" alt="低音"></td>
25
+ </tr>
26
+ </table>
27
+
28
+ ## 安装
29
+
30
+ ```bash
31
+ pip install ensemble-pitch-extractor
32
+ ```
33
+
34
+ 本地开发:
35
+
36
+ ```bash
37
+ uv sync
38
+ uv run ensemble-pitch-extractor --help
39
+ ```
40
+
41
+ 需要 Python 3.12 或更高版本。
42
+
43
+ ## 命令行用法
44
+
45
+ 提取单个音频:
46
+
47
+ ```bash
48
+ ensemble-pitch-extractor input.wav -o f0_out
49
+ ```
50
+
51
+ 输出:
52
+
53
+ ```text
54
+ f0_out/input.npy
55
+ ```
56
+
57
+ `.npy` 是一维 `float32` 数组,单位为 Hz;unvoiced 帧为 `0`。
58
+
59
+ 同时输出 F0 + mel 频谱图:
60
+
61
+ ```bash
62
+ ensemble-pitch-extractor input.wav -o f0_out --plot
63
+ ```
64
+
65
+ 处理目录下全部音频:
66
+
67
+ ```bash
68
+ ensemble-pitch-extractor audio_dir -o f0_out --plot
69
+ ```
70
+
71
+ 默认自动检测 CUDA。手动指定设备:
72
+
73
+ ```bash
74
+ ensemble-pitch-extractor audio_dir -o f0_out --device cpu
75
+ ```
76
+
77
+ 用 `--max-batch-length` 控制 GPU 显存(默认 480000 采样点 ≈ 30s):
78
+
79
+ ```bash
80
+ ensemble-pitch-extractor audio_dir -o f0_out --max-batch-length 200000
81
+ ```
82
+
83
+ 只使用 FCPE TTA,不使用 pYIN:
84
+
85
+ ```bash
86
+ ensemble-pitch-extractor input.wav -o f0_out --no-pyin
87
+ ```
88
+
89
+ 常用参数:
90
+
91
+ ```text
92
+ --f0-min 80
93
+ --f0-max 4000
94
+ --max-batch-length 480000
95
+ --device auto
96
+ --pyin-priority-min-f0 1300
97
+ --pyin-fcpe-close-semitones 1.0
98
+ --interp-uv
99
+ --recursive
100
+ ```
101
+
102
+ ## Python API
103
+
104
+ ```python
105
+ from ensemble_pitch_extractor import extract_f0_from_file, load_model
106
+
107
+ model = load_model() # 自动检测 CUDA,也可指定 device="cpu"
108
+ result = extract_f0_from_file(
109
+ "input.wav",
110
+ model=model,
111
+ save_npy="f0_out/input.npy",
112
+ save_plot="f0_out/input.png",
113
+ f0_min=80,
114
+ f0_max=4000,
115
+ )
116
+
117
+ f0 = result.f0 # Hz, shape: (frames,)
118
+ times = result.times # seconds, shape: (frames,)
119
+ ```
120
+
121
+ 内存中的音频也可以直接提取:
122
+
123
+ ```python
124
+ import librosa
125
+ from ensemble_pitch_extractor import extract_f0, load_model
126
+
127
+ model = load_model()
128
+ sr = model.get_model_sr()
129
+ audio, _ = librosa.load("input.wav", sr=sr, mono=True)
130
+ f0 = extract_f0(audio, sr, model, f0_min=80, f0_max=4000)
131
+ ```
132
+
133
+ 支持直接传入 torch tensor(批量或拼接,支持 GPU):
134
+
135
+ ```python
136
+ import torch
137
+ from ensemble_pitch_extractor import extract_f0_from_tensor, load_model
138
+
139
+ model = load_model("cuda")
140
+
141
+ # 批量模式:等长 clip 的 padded batch
142
+ wav = torch.randn(4, 16000, device="cuda") # (batch=4, samples)
143
+ f0 = extract_f0_from_tensor(wav, sr=16000, model=model) # (4, frames)
144
+
145
+ # 拼接模式:不等长 clip 拼接在一起,无 padding 浪费
146
+ wavs = [torch.randn(8000, device="cuda"), torch.randn(12000, device="cuda")]
147
+ lengths = [len(w) for w in wavs]
148
+ concat = torch.cat(wavs)
149
+ f0 = extract_f0_from_tensor(concat, sr=16000, model=model, lengths=lengths,
150
+ max_batch_length=20000) # (2, max_frames),pad 位为 NaN
151
+ ```
152
+
153
+ ## 方法概要
154
+
155
+ 当前候选包括:
156
+
157
+ ```text
158
+ FCPE key shift = 0
159
+ FCPE key shift = -12
160
+ FCPE key shift = +12
161
+ pYIN
162
+ ```
163
+
164
+ 对于 key shift 为 \(s\) 的 FCPE 候选,融合前先反变调回原始音高空间:
165
+
166
+ \[
167
+ \hat f_{t,s} = \frac{f_{t,s}}{2^{s/12}} .
168
+ \]
169
+
170
+ pYIN 作为超高音救援候选加入。默认仅搜索 1300–4000 Hz,1300 Hz 以下的结果会被丢弃,避免它在普通音区覆盖 FCPE 更细致的轨迹。
171
+
172
+ 所有候选转为 MIDI note:
173
+
174
+ \[
175
+ n_{t,k}=69+12\log_2\frac{f_{t,k}}{440}.
176
+ \]
177
+
178
+ 然后通过动态规划选择总代价最小路径:
179
+
180
+ \[
181
+ \pi^\*=\arg\min_\pi
182
+ \sum_t U_t(\pi_t)
183
+ +
184
+ \sum_{t=1}^{T-1} C_t(\pi_{t-1},\pi_t).
185
+ \]
186
+
187
+ 其中 \(U_t(k)\) 是候选先验,\(C_t(i,k)\) 是相邻帧转移代价。这样可以避免把倍频、半频或不同算法的错误结果平均成不存在的中间音高。
188
+
189
+ ## 启发式策略
190
+
191
+ 这些策略可以统一理解为动态规划中的结构化先验:
192
+
193
+ - MIDI 空间代价使不同频段的相同音乐音程具有相同尺度。
194
+ - UV 惩罚减少 voiced/unvoiced 频繁断裂。
195
+ - 八度感知跳变允许一、二、三个八度附近的真实跳变,适合胸声转哨音。
196
+ - FCPE `+12` 在 E2 以下得到低音区先验。
197
+ - FCPE `-12` 在 D5 以上得到中高音区先验。
198
+ - pYIN 只在 1300 Hz 以上且与所有 FCPE 候选相差超过 1 半音时得到救援先验。
199
+ - 解码后使用 RMS 能量门控移除静音段中的虚假 voiced 输出。
200
+
201
+ 候选顺序为 `FCPE 0`、`FCPE -12`、`FCPE +12`、`pYIN`,因此代价完全相同时优先选择 FCPE。
202
+
203
+ ## 构建与发布
204
+
205
+ ```bash
206
+ uv lock --python 3.12
207
+ uv build
208
+ uv publish
209
+ ```
210
+
211
+ 使用 PyPI token:
212
+
213
+ ```bash
214
+ uv publish --token "pypi-..."
215
+ ```