vg-hubert 1.0.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- vg_hubert-1.0.0/.gitignore +31 -0
- vg_hubert-1.0.0/LICENSE +29 -0
- vg_hubert-1.0.0/MANIFEST.in +7 -0
- vg_hubert-1.0.0/PKG-INFO +375 -0
- vg_hubert-1.0.0/README.md +332 -0
- vg_hubert-1.0.0/configs/places.yaml +50 -0
- vg_hubert-1.0.0/configs/spokencoco.yaml +50 -0
- vg_hubert-1.0.0/demo.ipynb +469 -0
- vg_hubert-1.0.0/examples/basic_usage.py +109 -0
- vg_hubert-1.0.0/examples/batch_processing.py +131 -0
- vg_hubert-1.0.0/publish_to_hub.py +406 -0
- vg_hubert-1.0.0/pyproject.toml +58 -0
- vg_hubert-1.0.0/requirements.txt +26 -0
- vg_hubert-1.0.0/setup.cfg +4 -0
- vg_hubert-1.0.0/setup.py +75 -0
- vg_hubert-1.0.0/tests/test_mincutmerge.py +151 -0
- vg_hubert-1.0.0/tests/test_validation.py +256 -0
- vg_hubert-1.0.0/train.py +122 -0
- vg_hubert-1.0.0/vg_hubert/__init__.py +24 -0
- vg_hubert-1.0.0/vg_hubert/datasets/__init__.py +20 -0
- vg_hubert-1.0.0/vg_hubert/datasets/places_dataset.py +112 -0
- vg_hubert-1.0.0/vg_hubert/datasets/sampler.py +36 -0
- vg_hubert-1.0.0/vg_hubert/datasets/spokencoco_dataset.py +108 -0
- vg_hubert-1.0.0/vg_hubert/mincut.py +400 -0
- vg_hubert-1.0.0/vg_hubert/model/__init__.py +36 -0
- vg_hubert-1.0.0/vg_hubert/model/audio_encoder.py +985 -0
- vg_hubert-1.0.0/vg_hubert/model/dual_encoder.py +157 -0
- vg_hubert-1.0.0/vg_hubert/model/utils.py +558 -0
- vg_hubert-1.0.0/vg_hubert/model/vision_transformer.py +304 -0
- vg_hubert-1.0.0/vg_hubert/model/vit_utils.py +623 -0
- vg_hubert-1.0.0/vg_hubert/segmenter.py +493 -0
- vg_hubert-1.0.0/vg_hubert/tests/test_better_params.py +79 -0
- vg_hubert-1.0.0/vg_hubert/tests/test_hf_upload.py +224 -0
- vg_hubert-1.0.0/vg_hubert/tests/test_mincut_comparison.py +255 -0
- vg_hubert-1.0.0/vg_hubert/tests/test_mincut_quality.py +268 -0
- vg_hubert-1.0.0/vg_hubert/training/__init__.py +17 -0
- vg_hubert-1.0.0/vg_hubert/training/bert_adam.py +179 -0
- vg_hubert-1.0.0/vg_hubert/training/trainer.py +403 -0
- vg_hubert-1.0.0/vg_hubert/training/trainer_utils.py +113 -0
- vg_hubert-1.0.0/vg_hubert/training/utils.py +81 -0
- vg_hubert-1.0.0/vg_hubert.egg-info/PKG-INFO +375 -0
- vg_hubert-1.0.0/vg_hubert.egg-info/SOURCES.txt +43 -0
- vg_hubert-1.0.0/vg_hubert.egg-info/dependency_links.txt +1 -0
- vg_hubert-1.0.0/vg_hubert.egg-info/requires.txt +12 -0
- vg_hubert-1.0.0/vg_hubert.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,31 @@
|
|
|
1
|
+
__pycache__/
|
|
2
|
+
*.py[cod]
|
|
3
|
+
.DS_Store
|
|
4
|
+
.vscode/
|
|
5
|
+
.ipynb_checkpoints/
|
|
6
|
+
*.png
|
|
7
|
+
*.gif
|
|
8
|
+
*.pdf
|
|
9
|
+
test*
|
|
10
|
+
*.log
|
|
11
|
+
vg-hubert_3/
|
|
12
|
+
*.tar
|
|
13
|
+
|
|
14
|
+
# Original README and cleanup scripts (keep local only)
|
|
15
|
+
README_ORIGINAL.md
|
|
16
|
+
# Training artifacts
|
|
17
|
+
checkpoints/
|
|
18
|
+
logs/
|
|
19
|
+
runs/
|
|
20
|
+
wandb/
|
|
21
|
+
*.ckpt
|
|
22
|
+
*.pth.tar
|
|
23
|
+
|
|
24
|
+
# Data directories (user-specific)
|
|
25
|
+
data/
|
|
26
|
+
datasets_cache/
|
|
27
|
+
|
|
28
|
+
# Build artifacts
|
|
29
|
+
build/
|
|
30
|
+
dist/
|
|
31
|
+
*.egg-info/
|
vg_hubert-1.0.0/LICENSE
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
BSD 3-Clause License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2022, Puyuan Peng
|
|
4
|
+
All rights reserved.
|
|
5
|
+
|
|
6
|
+
Redistribution and use in source and binary forms, with or without
|
|
7
|
+
modification, are permitted provided that the following conditions are met:
|
|
8
|
+
|
|
9
|
+
1. Redistributions of source code must retain the above copyright notice, this
|
|
10
|
+
list of conditions and the following disclaimer.
|
|
11
|
+
|
|
12
|
+
2. Redistributions in binary form must reproduce the above copyright notice,
|
|
13
|
+
this list of conditions and the following disclaimer in the documentation
|
|
14
|
+
and/or other materials provided with the distribution.
|
|
15
|
+
|
|
16
|
+
3. Neither the name of the copyright holder nor the names of its
|
|
17
|
+
contributors may be used to endorse or promote products derived from
|
|
18
|
+
this software without specific prior written permission.
|
|
19
|
+
|
|
20
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
|
21
|
+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
22
|
+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
|
23
|
+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
|
24
|
+
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
25
|
+
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
|
26
|
+
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
|
27
|
+
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
|
28
|
+
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
|
29
|
+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
vg_hubert-1.0.0/PKG-INFO
ADDED
|
@@ -0,0 +1,375 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: vg-hubert
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Summary: VG-HuBERT: Simplified interface for speech segmentation with HuggingFace Hub integration
|
|
5
|
+
Home-page: https://github.com/human-ai-lab/VG-HuBERT
|
|
6
|
+
Author: Puyuan Peng, David Harwath
|
|
7
|
+
Author-email: Puyuan Peng <harwath@utexas.edu>, David Harwath <harwath@utexas.edu>
|
|
8
|
+
License: BSD-3-Clause
|
|
9
|
+
Project-URL: Homepage, https://github.com/human-ai-lab/VG-HuBERT
|
|
10
|
+
Project-URL: Original Paper (Words), https://arxiv.org/abs/2203.15081
|
|
11
|
+
Project-URL: Original Paper (Syllables), https://www.isca-speech.org/archive/interspeech_2023/peng23_interspeech.html
|
|
12
|
+
Project-URL: HuggingFace Model, https://huggingface.co/hjvm/VG-HuBERT
|
|
13
|
+
Project-URL: Bug Tracker, https://github.com/human-ai-lab/VG-HuBERT/issues
|
|
14
|
+
Keywords: speech,audio,segmentation,syllables,self-supervised,hubert,vg-hubert
|
|
15
|
+
Classifier: Development Status :: 4 - Beta
|
|
16
|
+
Classifier: Intended Audience :: Science/Research
|
|
17
|
+
Classifier: License :: OSI Approved :: BSD License
|
|
18
|
+
Classifier: Programming Language :: Python :: 3
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
22
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
23
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
24
|
+
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
|
|
25
|
+
Requires-Python: >=3.8
|
|
26
|
+
Description-Content-Type: text/markdown
|
|
27
|
+
License-File: LICENSE
|
|
28
|
+
Requires-Dist: torch>=2.0.0
|
|
29
|
+
Requires-Dist: transformers>=4.20.0
|
|
30
|
+
Requires-Dist: huggingface-hub>=0.10.0
|
|
31
|
+
Requires-Dist: numpy>=1.20.0
|
|
32
|
+
Requires-Dist: soundfile>=0.10.0
|
|
33
|
+
Requires-Dist: scipy>=1.6.0
|
|
34
|
+
Provides-Extra: dev
|
|
35
|
+
Requires-Dist: pytest>=7.0.0; extra == "dev"
|
|
36
|
+
Requires-Dist: black>=22.0.0; extra == "dev"
|
|
37
|
+
Requires-Dist: isort>=5.10.0; extra == "dev"
|
|
38
|
+
Requires-Dist: flake8>=4.0.0; extra == "dev"
|
|
39
|
+
Dynamic: author
|
|
40
|
+
Dynamic: home-page
|
|
41
|
+
Dynamic: license-file
|
|
42
|
+
Dynamic: requires-python
|
|
43
|
+
|
|
44
|
+
# VG-HuBERT: Speech Segmentation with Simplified Interface
|
|
45
|
+
|
|
46
|
+
Unsupervised syllable and word segmentation using visually grounded HuBERT (VG-HuBERT). This fork provides a simplified interface with HuggingFace Hub integration, updated PyTorch version to eliminate the need for PyTorch `multi_head_attention_forward` patching, optimized MinCut algorithm (~40x speedup), and PyPI package distribution.
|
|
47
|
+
|
|
48
|
+
## Quick Start
|
|
49
|
+
|
|
50
|
+
```python
|
|
51
|
+
from vg_hubert import Segmenter
|
|
52
|
+
|
|
53
|
+
# Syllable segmentation (RECOMMENDED: includes MinCutMerge post-processing)
|
|
54
|
+
segmenter = Segmenter(mode="syllable", merge_threshold=0.3)
|
|
55
|
+
outputs = segmenter("audio.wav")
|
|
56
|
+
|
|
57
|
+
# Word segmentation
|
|
58
|
+
word_segmenter = Segmenter(mode="word")
|
|
59
|
+
word_outputs = word_segmenter("audio.wav")
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
## Installation
|
|
63
|
+
|
|
64
|
+
```bash
|
|
65
|
+
# From source
|
|
66
|
+
pip install git+https://github.com/hjvm/VG-HuBERT.git
|
|
67
|
+
|
|
68
|
+
# Or PyPI (after publishing)
|
|
69
|
+
pip install vg-hubert
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
**Requirements**: Python ≥3.8, PyTorch ≥2.0, transformers, scipy, soundfile
|
|
73
|
+
|
|
74
|
+
## Features
|
|
75
|
+
|
|
76
|
+
✨ **New in this fork:**
|
|
77
|
+
- 🚀 **40x faster MinCut**: Optimized algorithm from [SyllableLM](https://github.com/AlanBaade/SyllableLM) (Baade et al., 2024)
|
|
78
|
+
- 🔧 **MinCutMerge post-processing**: Prevents over-segmentation (matches original paper)
|
|
79
|
+
- 🤗 **HuggingFace integration**: Auto-download models from Hub
|
|
80
|
+
- 🍎 **Apple Silicon support**: Native MPS acceleration
|
|
81
|
+
- 📦 **PyPI distribution**: Simple `pip install`
|
|
82
|
+
- 🧹 **No fairseq for inference**: Removed complex dependency
|
|
83
|
+
|
|
84
|
+
## Usage
|
|
85
|
+
|
|
86
|
+
### Basic Example
|
|
87
|
+
|
|
88
|
+
```python
|
|
89
|
+
from vg_hubert import Segmenter
|
|
90
|
+
import soundfile as sf
|
|
91
|
+
|
|
92
|
+
# Load and segment
|
|
93
|
+
segmenter = Segmenter(
|
|
94
|
+
model_ckpt="hjvm/VG-HuBERT", # HuggingFace Hub or local path
|
|
95
|
+
mode="syllable",
|
|
96
|
+
device="cuda", # or "mps" or "cpu" (auto-detects best available)
|
|
97
|
+
merge_threshold=0.3 # Enable MinCutMerge (recommended)
|
|
98
|
+
)
|
|
99
|
+
|
|
100
|
+
outputs = segmenter("audio.wav")
|
|
101
|
+
|
|
102
|
+
# Access results
|
|
103
|
+
for start, end in outputs['segments']:
|
|
104
|
+
print(f"Segment: {start:.2f}s - {end:.2f}s")
|
|
105
|
+
|
|
106
|
+
# Access features
|
|
107
|
+
segment_features = outputs['segment_features'] # [num_segments, 768]
|
|
108
|
+
frame_features = outputs['hidden_states'] # [num_frames, 768]
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
### MinCut Configuration
|
|
112
|
+
|
|
113
|
+
The package supports multiple MinCut configurations for different use cases:
|
|
114
|
+
|
|
115
|
+
```python
|
|
116
|
+
# Configuration 1: RECOMMENDED (matches original paper)
|
|
117
|
+
# - Fast algorithm + MinCutMerge post-processing
|
|
118
|
+
# - Prevents over-segmentation
|
|
119
|
+
segmenter = Segmenter(
|
|
120
|
+
mode="syllable",
|
|
121
|
+
merge_threshold=0.3, # Original paper value
|
|
122
|
+
min_segment_frames=2 # Filter very short segments
|
|
123
|
+
)
|
|
124
|
+
|
|
125
|
+
# Configuration 2: Plain MinCut (no merging)
|
|
126
|
+
# - Useful for analysis or more granular segmentation
|
|
127
|
+
segmenter = Segmenter(
|
|
128
|
+
mode="syllable",
|
|
129
|
+
merge_threshold=None # Disable MinCutMerge
|
|
130
|
+
)
|
|
131
|
+
|
|
132
|
+
# Configuration 3: Custom merge threshold
|
|
133
|
+
# - Tune for your specific needs
|
|
134
|
+
# - Higher = more merging = fewer segments
|
|
135
|
+
# - Lower = less merging = more segments
|
|
136
|
+
segmenter = Segmenter(
|
|
137
|
+
mode="syllable",
|
|
138
|
+
merge_threshold=0.5 # More aggressive merging
|
|
139
|
+
)
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
See [examples/mincut_comparison.py](examples/mincut_comparison.py) for detailed comparison.
|
|
143
|
+
|
|
144
|
+
### Low-Level API
|
|
145
|
+
|
|
146
|
+
For advanced users who need full control:
|
|
147
|
+
|
|
148
|
+
```python
|
|
149
|
+
from vg_hubert.mincut import segment_with_mincut
|
|
150
|
+
import numpy as np
|
|
151
|
+
|
|
152
|
+
# Extract features (see examples/ for full code)
|
|
153
|
+
features = ... # Shape: (num_frames, 768)
|
|
154
|
+
|
|
155
|
+
# Apply MinCut with full control
|
|
156
|
+
boundaries, ssm = segment_with_mincut(
|
|
157
|
+
features=features,
|
|
158
|
+
K=10, # Number of boundaries
|
|
159
|
+
merge_threshold=0.3, # Set to None for plain MinCut
|
|
160
|
+
min_segment_frames=2,
|
|
161
|
+
min_hop=3, # Minimum segment length
|
|
162
|
+
max_hop=50 # Maximum segment length
|
|
163
|
+
)
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
### Parameters
|
|
167
|
+
|
|
168
|
+
- **mode**: `"syllable"` (MinCut + feature similarity) or `"word"` (CLS attention)
|
|
169
|
+
- **layer**: HuBERT layer to use (default: 8 for syllables, 9 for words)
|
|
170
|
+
- **device**: `"cuda"`, `"mps"`, or `"cpu"` (defaults to CUDA if available, falls back to MPS on Apple Silicon, then CPU)
|
|
171
|
+
- **sec_per_syllable**: Target syllable duration for MinCut (default: 0.2)
|
|
172
|
+
- **merge_threshold**: Cosine similarity threshold for merging adjacent segments (default: 0.3, set to `None` to disable)
|
|
173
|
+
- **min_segment_frames**: Filter segments with ≤ this many frames (default: 2)
|
|
174
|
+
- **attn_threshold**: Attention threshold for word boundaries (default: 0.25)
|
|
175
|
+
|
|
176
|
+
See [examples/](examples/) for more usage patterns.
|
|
177
|
+
|
|
178
|
+
## Model Details
|
|
179
|
+
|
|
180
|
+
### Checkpoints
|
|
181
|
+
|
|
182
|
+
Two pre-trained models optimized for different tasks:
|
|
183
|
+
|
|
184
|
+
| Checkpoint | Task | Layer | Algorithm | Size |
|
|
185
|
+
|------------|------|-------|-----------|------|
|
|
186
|
+
| `vg-hubert-syllable.pth` | Syllable | 8 | MinCut + MinCutMerge | 474 MB |
|
|
187
|
+
| `vg-hubert-word.pth` | Word | 9 | CLS Attention | 361 MB |
|
|
188
|
+
|
|
189
|
+
### Algorithm Details
|
|
190
|
+
|
|
191
|
+
**MinCut Segmentation (Syllables):**
|
|
192
|
+
1. Extract HuBERT features from layer 8
|
|
193
|
+
2. Compute self-similarity matrix (SSM)
|
|
194
|
+
3. Apply efficient MinCut algorithm (Baade et al., 2024)
|
|
195
|
+
- ~40x faster than original O(N²K) implementation
|
|
196
|
+
- Uses cumulative sums for O(1) range queries
|
|
197
|
+
4. **Optional**: Apply MinCutMerge post-processing (Peng et al., 2023)
|
|
198
|
+
- Iteratively merge adjacent segments with cosine similarity ≥ threshold
|
|
199
|
+
- Prevents over-segmentation
|
|
200
|
+
- Recommended for production use
|
|
201
|
+
|
|
202
|
+
**Performance Comparison:**
|
|
203
|
+
|
|
204
|
+
| Configuration | F1 (LibriSpeech) | Speed (ms/utt) | Speedup |
|
|
205
|
+
|---------------|------------------|----------------|---------|
|
|
206
|
+
| Original MinCut | 0.501 | 7524 | 1.0x |
|
|
207
|
+
| New MinCut | 0.501 | 169 | 44.5x |
|
|
208
|
+
| New + MinCutMerge-0.3 ⭐ | TBD | 171 | 44.0x |
|
|
209
|
+
|
|
210
|
+
*Note: LibriSpeech results shown; original paper reports F1=0.603 on SpokenCOCO*
|
|
211
|
+
|
|
212
|
+
### Performance (SpokenCOCO - Original Paper)
|
|
213
|
+
|
|
214
|
+
**Syllable Segmentation:**
|
|
215
|
+
- Boundary F1: 0.603
|
|
216
|
+
- Boundary Precision: 0.574
|
|
217
|
+
- Boundary Recall: 0.636
|
|
218
|
+
|
|
219
|
+
**Word Discovery:**
|
|
220
|
+
- Token F1: 0.195
|
|
221
|
+
- Type F1: 0.174
|
|
222
|
+
- NED: 0.748
|
|
223
|
+
|
|
224
|
+
## Training
|
|
225
|
+
|
|
226
|
+
VG-HuBERT uses **visually-grounded contrastive learning** to learn speech representations. The model jointly trains on speech and images using datasets like SpokenCOCO or Places.
|
|
227
|
+
|
|
228
|
+
### Training Setup
|
|
229
|
+
|
|
230
|
+
1. **Install training dependencies**:
|
|
231
|
+
```bash
|
|
232
|
+
pip install -r requirements.txt # Includes fairseq, apex, Pillow, etc.
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
2. **Download datasets**:
|
|
236
|
+
- **SpokenCOCO**: [Spoken captions](https://data.csail.mit.edu/placesaudio/SpokenCOCO.tar.gz) + [MSCOCO images](http://cocodataset.org/#download)
|
|
237
|
+
- **Places**: [Spoken descriptions](https://data.csail.mit.edu/placesaudio/) + [Places365 images](http://places2.csail.mit.edu/)
|
|
238
|
+
|
|
239
|
+
3. **Download pre-trained models** for initialization:
|
|
240
|
+
- [HuBERT Base](https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt) (pretrained on LibriSpeech 960h)
|
|
241
|
+
- [DINO ViT](https://dl.fbaipublicfiles.com/dino/dino_vitsmall8_pretrain/dino_vitsmall8_pretrain_full_checkpoint.pth) (vision encoder)
|
|
242
|
+
|
|
243
|
+
4. **Configure training**:
|
|
244
|
+
```yaml
|
|
245
|
+
# configs/spokencoco.yaml
|
|
246
|
+
train_audio_dataset_json_file: "/path/to/SpokenCOCO_train.json"
|
|
247
|
+
val_audio_dataset_json_file: "/path/to/SpokenCOCO_val.json"
|
|
248
|
+
load_hubert_weights: "/path/to/hubert_base_ls960.pt"
|
|
249
|
+
load_pretrained_vit: "/path/to/dino_vitsmall8_pretrain.pth"
|
|
250
|
+
batch_size: 32
|
|
251
|
+
n_epochs: 30
|
|
252
|
+
gpus: "0,1,2,3"
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
5. **Train**:
|
|
256
|
+
```bash
|
|
257
|
+
python train.py --config configs/spokencoco.yaml
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
### Training Outputs
|
|
261
|
+
|
|
262
|
+
- Checkpoints saved to `exp_dir/` (default: `./checkpoints/`)
|
|
263
|
+
- TensorBoard logs in experiment directory
|
|
264
|
+
- Config saved as `config.yaml` in experiment directory
|
|
265
|
+
|
|
266
|
+
### Architecture
|
|
267
|
+
|
|
268
|
+
**Dual-encoder with cross-modal transformer**:
|
|
269
|
+
- **Audio encoder**: HuBERT Base (12 layers, 768-dim)
|
|
270
|
+
- **Vision encoder**: ViT Small/Base (DINO pretrained)
|
|
271
|
+
- **Cross-modal layers**: 5 transformer layers for audio-image interaction
|
|
272
|
+
- **Loss**: Margin InfoNCE (contrastive learning in common embedding space)
|
|
273
|
+
|
|
274
|
+
The trained audio encoder can then be used for segmentation without the vision components.
|
|
275
|
+
|
|
276
|
+
### Training from Scratch
|
|
277
|
+
|
|
278
|
+
The package includes all training code:
|
|
279
|
+
- `vg_hubert/model/`: Dual encoder, audio/vision transformers
|
|
280
|
+
- `vg_hubert/training/`: Trainer, optimizers, utilities
|
|
281
|
+
- `vg_hubert/datasets/`: SpokenCOCO and Places data loaders
|
|
282
|
+
|
|
283
|
+
See [configs/](configs/) for complete training examples.
|
|
284
|
+
|
|
285
|
+
## What's Different in This Fork
|
|
286
|
+
|
|
287
|
+
1. **No PyTorch patching**: Uses native `attn_implementation='eager'` (PyTorch 2.0+)
|
|
288
|
+
2. **Simplified interface**: Single `Segmenter` class for all use cases
|
|
289
|
+
3. **HuggingFace Hub**: Automatic model downloading
|
|
290
|
+
4. **Complete package**: Both training and inference (like Sylber)
|
|
291
|
+
5. **PyPI distribution**: Easy installation via pip
|
|
292
|
+
6. **Apple Silicon support**: Automatic MPS (Metal Performance Shaders) GPU acceleration
|
|
293
|
+
7. **Optimized MinCut**: **~20-50x faster** syllable segmentation using efficient algorithm from [SyllableLM](https://github.com/AlanBaade/SyllableLM) (Baade et al., 2024) with no quality degradation
|
|
294
|
+
|
|
295
|
+
## Implementation Details
|
|
296
|
+
|
|
297
|
+
For inference, this package uses HuggingFace's `transformers.HubertModel` instead of the original fairseq implementation. This is possible because VG-HuBERT's audio encoder architecture is identical to the standard HuBERT model. The visual grounding training adds a vision encoder and cross-modal transformer layers, but these components are only used during training to learn better speech representations. At inference time, only the audio encoder weights are needed, which are fully compatible with the HuggingFace HuBERT architecture. This simplifies deployment and eliminates the fairseq dependency for inference.
|
|
298
|
+
|
|
299
|
+
## Citations
|
|
300
|
+
|
|
301
|
+
### VG-HuBERT Original Work
|
|
302
|
+
|
|
303
|
+
**Syllable Segmentation:**
|
|
304
|
+
```bibtex
|
|
305
|
+
@inproceedings{peng2023syllable,
|
|
306
|
+
title={Syllable Segmentation and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model},
|
|
307
|
+
author={Peng, Puyuan and Li, Shang-Wen and Räsänen, Okko and Mohamed, Abdelrahman and Harwath, David},
|
|
308
|
+
booktitle={Interspeech},
|
|
309
|
+
year={2023}
|
|
310
|
+
}
|
|
311
|
+
```
|
|
312
|
+
|
|
313
|
+
**Word Discovery:**
|
|
314
|
+
```bibtex
|
|
315
|
+
@inproceedings{peng2022word,
|
|
316
|
+
title={Word Discovery in Visually Grounded, Self-Supervised Speech Models},
|
|
317
|
+
author={Peng, Puyuan and Harwath, David},
|
|
318
|
+
booktitle={Interspeech},
|
|
319
|
+
year={2022}
|
|
320
|
+
}
|
|
321
|
+
```
|
|
322
|
+
|
|
323
|
+
### Interface Design
|
|
324
|
+
|
|
325
|
+
This package follows the interface design of Sylber:
|
|
326
|
+
```bibtex
|
|
327
|
+
@article{cho2024sylber,
|
|
328
|
+
title={Sylber: Syllabic Embedding Representation of Speech from Raw Audio},
|
|
329
|
+
author={Cho, Cheol Jun and Lee, Nicholas and Gupta, Akshat and Agarwal, Dhruv and Chen, Ethan and Black, Alan W and Anumanchipalli, Gopala K},
|
|
330
|
+
journal={arXiv preprint arXiv:2410.07168},
|
|
331
|
+
year={2024}
|
|
332
|
+
}
|
|
333
|
+
```
|
|
334
|
+
|
|
335
|
+
### Optimized MinCut Algorithm
|
|
336
|
+
|
|
337
|
+
The MinCut algorithm used for syllable segmentation has been updated to use the efficient implementation from SyllableLM (Baade et al., 2024), which provides **~20-50x speedup** over the original with no statistically significant quality difference:
|
|
338
|
+
|
|
339
|
+
```bibtex
|
|
340
|
+
@misc{baade2024syllablelmlearningcoarsesemantic,
|
|
341
|
+
title={SyllableLM: Learning Coarse Semantic Units for Speech Language Models},
|
|
342
|
+
author={Alan Baade and Puyuan Peng and David Harwath},
|
|
343
|
+
year={2024},
|
|
344
|
+
eprint={2410.04029},
|
|
345
|
+
archivePrefix={arXiv},
|
|
346
|
+
primaryClass={cs.CL},
|
|
347
|
+
url={https://arxiv.org/abs/2410.04029},
|
|
348
|
+
}
|
|
349
|
+
```
|
|
350
|
+
|
|
351
|
+
**Performance Comparison (LibriSpeech test-clean, 50 utterances):**
|
|
352
|
+
- Speed: 6961ms → 133ms per utterance (**52x faster**)
|
|
353
|
+
- Quality: F1=0.377 → 0.372 (p=0.22, not significant)
|
|
354
|
+
- 82% of utterances produce identical segmentations
|
|
355
|
+
|
|
356
|
+
Key optimizations:
|
|
357
|
+
- Cumulative sum preprocessing for O(1) range queries
|
|
358
|
+
- Segment length constraints (min_hop=3, max_hop=50 frames)
|
|
359
|
+
- 5-component cost calculation
|
|
360
|
+
|
|
361
|
+
See [vg_hubert/tests/mincut_validation.ipynb](vg_hubert/tests/mincut_validation.ipynb) for full validation results.
|
|
362
|
+
|
|
363
|
+
## Related Repositories
|
|
364
|
+
|
|
365
|
+
- **Original implementations**: [word-discovery](https://github.com/jasonppy/word-discovery), [syllable-discovery](https://github.com/jasonppy/syllable-discovery)
|
|
366
|
+
- **Fork parent**: [human-ai-lab/VG-HuBERT](https://github.com/human-ai-lab/VG-HuBERT)
|
|
367
|
+
- **Interface inspiration**: [Sylber](https://github.com/Berkeley-Speech-Group/sylber)
|
|
368
|
+
|
|
369
|
+
## License
|
|
370
|
+
|
|
371
|
+
BSD-3-Clause License (same as original repositories)
|
|
372
|
+
|
|
373
|
+
## Contributing
|
|
374
|
+
|
|
375
|
+
Issues and pull requests welcome. Please ensure changes maintain compatibility with original model weights and include proper attribution.
|