vitosa-speech-II 0.0.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- vitosa_speech_ii-0.0.1/PKG-INFO +146 -0
- vitosa_speech_ii-0.0.1/README.md +105 -0
- vitosa_speech_ii-0.0.1/setup.cfg +4 -0
- vitosa_speech_ii-0.0.1/setup.py +65 -0
- vitosa_speech_ii-0.0.1/vitosa_speech_II/__init__.py +5 -0
- vitosa_speech_ii-0.0.1/vitosa_speech_II/audio.py +75 -0
- vitosa_speech_ii-0.0.1/vitosa_speech_II/inference.py +59 -0
- vitosa_speech_ii-0.0.1/vitosa_speech_II/model.py +139 -0
- vitosa_speech_ii-0.0.1/vitosa_speech_II/utils.py +44 -0
- vitosa_speech_ii-0.0.1/vitosa_speech_II.egg-info/PKG-INFO +146 -0
- vitosa_speech_ii-0.0.1/vitosa_speech_II.egg-info/SOURCES.txt +12 -0
- vitosa_speech_ii-0.0.1/vitosa_speech_II.egg-info/dependency_links.txt +1 -0
- vitosa_speech_ii-0.0.1/vitosa_speech_II.egg-info/requires.txt +8 -0
- vitosa_speech_ii-0.0.1/vitosa_speech_II.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,146 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: vitosa-speech-II
|
|
3
|
+
Version: 0.0.1
|
|
4
|
+
Summary: A library for Robust Vietnamese Audio-Based Toxic Span Detection and Censoring
|
|
5
|
+
Author: Vy Le-Phuong Huynh, Huy Ba Do and Luan Thanh Nguyen
|
|
6
|
+
Author-email: luannt@uit.edu.vn
|
|
7
|
+
Project-URL: Model (Hugging Face), https://huggingface.co/UIT-ViToSA/PhoWhisper-BiLSTM-CRF
|
|
8
|
+
Keywords: audio-processing,toxic-span-detection,vietnamese,asr,speech-recognition,censoring,phowhisper
|
|
9
|
+
Classifier: Development Status :: 4 - Beta
|
|
10
|
+
Classifier: Intended Audience :: Developers
|
|
11
|
+
Classifier: Intended Audience :: Science/Research
|
|
12
|
+
Classifier: Programming Language :: Python :: 3
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
16
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
17
|
+
Classifier: Operating System :: OS Independent
|
|
18
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
19
|
+
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
|
|
20
|
+
Classifier: Natural Language :: Vietnamese
|
|
21
|
+
Requires-Python: >=3.7
|
|
22
|
+
Description-Content-Type: text/markdown
|
|
23
|
+
Requires-Dist: torch>=1.13.0
|
|
24
|
+
Requires-Dist: transformers>=4.28.0
|
|
25
|
+
Requires-Dist: librosa
|
|
26
|
+
Requires-Dist: pydub
|
|
27
|
+
Requires-Dist: huggingface_hub
|
|
28
|
+
Requires-Dist: pytorch-crf
|
|
29
|
+
Requires-Dist: numpy
|
|
30
|
+
Requires-Dist: tqdm
|
|
31
|
+
Dynamic: author
|
|
32
|
+
Dynamic: author-email
|
|
33
|
+
Dynamic: classifier
|
|
34
|
+
Dynamic: description
|
|
35
|
+
Dynamic: description-content-type
|
|
36
|
+
Dynamic: keywords
|
|
37
|
+
Dynamic: project-url
|
|
38
|
+
Dynamic: requires-dist
|
|
39
|
+
Dynamic: requires-python
|
|
40
|
+
Dynamic: summary
|
|
41
|
+
|
|
42
|
+
# ViToSA 2.0: A MULTI-TASK APPROACH TOWARDS ROBUST VIETNAMESE AUDIO-BASED TOXIC SPAN DETECTION | ICASSP 2026
|
|
43
|
+
|
|
44
|
+
**Official implementation** of the paper:
|
|
45
|
+
**“A Multi-Task Approach Towards Robust Vietnamese Audio-Based Toxic Span Detection”** (ICASSP 2026).
|
|
46
|
+
|
|
47
|
+
This package provides an end-to-end pipeline for **Vietnamese speech-based toxic span detection**, combining **ASR and toxic span detection** in a unified model. It also supports **automatic audio censoring**, replacing toxic spans with beep sounds in the output waveform.
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## Key Features
|
|
52
|
+
|
|
53
|
+
* **Automated Audio Censoring**: Takes an input audio file containing toxic language and outputs a **clean `.wav` file** where profanity is masked with a beep.
|
|
54
|
+
* **Unified Multi-Task Architecture**: Integrates ASR and Toxic Span Detection (TSD) into a single model for high speed.
|
|
55
|
+
* **SOTA Performance**: Achieves **F1-macro 0.9212** on the ViToSA-v2 dataset using **PhoWhisper + BiLSTM-CRF + Knowledge Distillation**.
|
|
56
|
+
* **High Efficiency**: Reduces inference latency by over **56%** compared to traditional pipelines.
|
|
57
|
+
|
|
58
|
+
## Installation
|
|
59
|
+
|
|
60
|
+
```bash
|
|
61
|
+
pip install vitosa-speech
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
### System requirements
|
|
65
|
+
|
|
66
|
+
This package relies on `pydub` for audio processing, which requires **ffmpeg** to be installed.
|
|
67
|
+
|
|
68
|
+
- **Ubuntu / Debian**
|
|
69
|
+
```bash
|
|
70
|
+
sudo apt-get install ffmpeg
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
- **macOS (Homebrew)**
|
|
74
|
+
```bash
|
|
75
|
+
brew install ffmpeg
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
- **Windows**
|
|
79
|
+
Download ffmpeg from https://ffmpeg.org and add it to your system `PATH`.
|
|
80
|
+
|
|
81
|
+
---
|
|
82
|
+
|
|
83
|
+
|
|
84
|
+
## Quick Start
|
|
85
|
+
This library allows you to input a raw audio file and get a censored audio file as the output.
|
|
86
|
+
|
|
87
|
+
### 1. Load the Model
|
|
88
|
+
```python
|
|
89
|
+
The model is pre-trained on the ViToSA-v2 dataset
|
|
90
|
+
|
|
91
|
+
import torch
|
|
92
|
+
from vitosa-speech-II import load_my_model
|
|
93
|
+
# Automatically detect device (CUDA/CPU)
|
|
94
|
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
|
95
|
+
# Load the pre-trained model
|
|
96
|
+
model, processor = load_my_model(device)
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
### 2. Run Inference (Detect & Censor)
|
|
100
|
+
```python
|
|
101
|
+
from vitosa-speech-II import return_labels, censor_audio_with_beep
|
|
102
|
+
from IPython.display import Audio, display # Optional: to play in notebook
|
|
103
|
+
|
|
104
|
+
# Path to your input file
|
|
105
|
+
input_audio = "samples/toxic_speech.wav"
|
|
106
|
+
|
|
107
|
+
# Step 1: Detect toxic spans
|
|
108
|
+
words_with_labels = return_labels(input_audio, model, processor, device)
|
|
109
|
+
|
|
110
|
+
# Step 2: Generate Censored Audio
|
|
111
|
+
# This function creates a new audio file with beeps over toxic words
|
|
112
|
+
output_audio_path = censor_audio_with_beep(
|
|
113
|
+
audio_path=input_audio,
|
|
114
|
+
model=model,
|
|
115
|
+
processor=processor,
|
|
116
|
+
words_with_labels=words_with_labels,
|
|
117
|
+
device=device
|
|
118
|
+
)
|
|
119
|
+
|
|
120
|
+
# Result
|
|
121
|
+
print(f"✅ Censored audio saved to: {output_audio_path}")
|
|
122
|
+
|
|
123
|
+
# Optional: Play the result (if in Jupyter/Colab)
|
|
124
|
+
# display(Audio(output_audio_path))
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
## Methodology
|
|
128
|
+
Our system works in two steps:
|
|
129
|
+
|
|
130
|
+
1. Detection: The multi-task model (PhoWhisper + BiLSTM-CRF) processes the audio to identify the exact start and end timestamps of toxic words.
|
|
131
|
+
2. Censoring: We reconstruct the audio by keeping safe segments and generating a sine wave (beep) to overlay exactly where the toxic tokens occur, ensuring the rest of the sentence remains intelligible.
|
|
132
|
+
|
|
133
|
+
<!-- ## Citation
|
|
134
|
+
If you use this tool or our findings, please cite:
|
|
135
|
+
```bibtex
|
|
136
|
+
@inproceedings{huynh2026multitask,
|
|
137
|
+
title={A Multi-Task Approach Towards Robust Vietnamese Audio-Based Toxic Span Detection},
|
|
138
|
+
author={Huynh, Vy Le-Phuong and Do, Huy Ba and Nguyen, Luan Thanh},
|
|
139
|
+
booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
|
|
140
|
+
year={2026}
|
|
141
|
+
}
|
|
142
|
+
``` -->
|
|
143
|
+
|
|
144
|
+
## Contact
|
|
145
|
+
For more information: luannt@uit.edu.vn
|
|
146
|
+
|
|
@@ -0,0 +1,105 @@
|
|
|
1
|
+
# ViToSA 2.0: A MULTI-TASK APPROACH TOWARDS ROBUST VIETNAMESE AUDIO-BASED TOXIC SPAN DETECTION | ICASSP 2026
|
|
2
|
+
|
|
3
|
+
**Official implementation** of the paper:
|
|
4
|
+
**“A Multi-Task Approach Towards Robust Vietnamese Audio-Based Toxic Span Detection”** (ICASSP 2026).
|
|
5
|
+
|
|
6
|
+
This package provides an end-to-end pipeline for **Vietnamese speech-based toxic span detection**, combining **ASR and toxic span detection** in a unified model. It also supports **automatic audio censoring**, replacing toxic spans with beep sounds in the output waveform.
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## Key Features
|
|
11
|
+
|
|
12
|
+
* **Automated Audio Censoring**: Takes an input audio file containing toxic language and outputs a **clean `.wav` file** where profanity is masked with a beep.
|
|
13
|
+
* **Unified Multi-Task Architecture**: Integrates ASR and Toxic Span Detection (TSD) into a single model for high speed.
|
|
14
|
+
* **SOTA Performance**: Achieves **F1-macro 0.9212** on the ViToSA-v2 dataset using **PhoWhisper + BiLSTM-CRF + Knowledge Distillation**.
|
|
15
|
+
* **High Efficiency**: Reduces inference latency by over **56%** compared to traditional pipelines.
|
|
16
|
+
|
|
17
|
+
## Installation
|
|
18
|
+
|
|
19
|
+
```bash
|
|
20
|
+
pip install vitosa-speech
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
### System requirements
|
|
24
|
+
|
|
25
|
+
This package relies on `pydub` for audio processing, which requires **ffmpeg** to be installed.
|
|
26
|
+
|
|
27
|
+
- **Ubuntu / Debian**
|
|
28
|
+
```bash
|
|
29
|
+
sudo apt-get install ffmpeg
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
- **macOS (Homebrew)**
|
|
33
|
+
```bash
|
|
34
|
+
brew install ffmpeg
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
- **Windows**
|
|
38
|
+
Download ffmpeg from https://ffmpeg.org and add it to your system `PATH`.
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
|
|
43
|
+
## Quick Start
|
|
44
|
+
This library allows you to input a raw audio file and get a censored audio file as the output.
|
|
45
|
+
|
|
46
|
+
### 1. Load the Model
|
|
47
|
+
```python
|
|
48
|
+
The model is pre-trained on the ViToSA-v2 dataset
|
|
49
|
+
|
|
50
|
+
import torch
|
|
51
|
+
from vitosa-speech-II import load_my_model
|
|
52
|
+
# Automatically detect device (CUDA/CPU)
|
|
53
|
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
|
54
|
+
# Load the pre-trained model
|
|
55
|
+
model, processor = load_my_model(device)
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
### 2. Run Inference (Detect & Censor)
|
|
59
|
+
```python
|
|
60
|
+
from vitosa-speech-II import return_labels, censor_audio_with_beep
|
|
61
|
+
from IPython.display import Audio, display # Optional: to play in notebook
|
|
62
|
+
|
|
63
|
+
# Path to your input file
|
|
64
|
+
input_audio = "samples/toxic_speech.wav"
|
|
65
|
+
|
|
66
|
+
# Step 1: Detect toxic spans
|
|
67
|
+
words_with_labels = return_labels(input_audio, model, processor, device)
|
|
68
|
+
|
|
69
|
+
# Step 2: Generate Censored Audio
|
|
70
|
+
# This function creates a new audio file with beeps over toxic words
|
|
71
|
+
output_audio_path = censor_audio_with_beep(
|
|
72
|
+
audio_path=input_audio,
|
|
73
|
+
model=model,
|
|
74
|
+
processor=processor,
|
|
75
|
+
words_with_labels=words_with_labels,
|
|
76
|
+
device=device
|
|
77
|
+
)
|
|
78
|
+
|
|
79
|
+
# Result
|
|
80
|
+
print(f"✅ Censored audio saved to: {output_audio_path}")
|
|
81
|
+
|
|
82
|
+
# Optional: Play the result (if in Jupyter/Colab)
|
|
83
|
+
# display(Audio(output_audio_path))
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
## Methodology
|
|
87
|
+
Our system works in two steps:
|
|
88
|
+
|
|
89
|
+
1. Detection: The multi-task model (PhoWhisper + BiLSTM-CRF) processes the audio to identify the exact start and end timestamps of toxic words.
|
|
90
|
+
2. Censoring: We reconstruct the audio by keeping safe segments and generating a sine wave (beep) to overlay exactly where the toxic tokens occur, ensuring the rest of the sentence remains intelligible.
|
|
91
|
+
|
|
92
|
+
<!-- ## Citation
|
|
93
|
+
If you use this tool or our findings, please cite:
|
|
94
|
+
```bibtex
|
|
95
|
+
@inproceedings{huynh2026multitask,
|
|
96
|
+
title={A Multi-Task Approach Towards Robust Vietnamese Audio-Based Toxic Span Detection},
|
|
97
|
+
author={Huynh, Vy Le-Phuong and Do, Huy Ba and Nguyen, Luan Thanh},
|
|
98
|
+
booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
|
|
99
|
+
year={2026}
|
|
100
|
+
}
|
|
101
|
+
``` -->
|
|
102
|
+
|
|
103
|
+
## Contact
|
|
104
|
+
For more information: luannt@uit.edu.vn
|
|
105
|
+
|
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
from setuptools import setup, find_packages
|
|
2
|
+
|
|
3
|
+
with open("README.md", "r", encoding="utf-8") as fh:
|
|
4
|
+
long_description = fh.read()
|
|
5
|
+
|
|
6
|
+
setup(
|
|
7
|
+
name="vitosa-speech-II",
|
|
8
|
+
version="0.0.1",
|
|
9
|
+
|
|
10
|
+
author="Vy Le-Phuong Huynh, Huy Ba Do and Luan Thanh Nguyen",
|
|
11
|
+
author_email="luannt@uit.edu.vn",
|
|
12
|
+
|
|
13
|
+
description="A library for Robust Vietnamese Audio-Based Toxic Span Detection and Censoring",
|
|
14
|
+
long_description=long_description,
|
|
15
|
+
long_description_content_type="text/markdown",
|
|
16
|
+
|
|
17
|
+
# Link GitHub chính thức (quan trọng)
|
|
18
|
+
# url="https://github.com/ViToSAResearch/PhoWhisper-BiLSTM-CRF",
|
|
19
|
+
|
|
20
|
+
# Tạo các liên kết bổ sung ở cột bên trái trang PyPI
|
|
21
|
+
project_urls={
|
|
22
|
+
# "Bug Tracker": "https://github.com/ViToSAResearch/PhoWhisper-BiLSTM-CRF/issues",
|
|
23
|
+
"Model (Hugging Face)": "https://huggingface.co/UIT-ViToSA/PhoWhisper-BiLSTM-CRF"
|
|
24
|
+
},
|
|
25
|
+
|
|
26
|
+
packages=find_packages(exclude=("tests", "docs")),
|
|
27
|
+
|
|
28
|
+
install_requires=[
|
|
29
|
+
"torch>=1.13.0",
|
|
30
|
+
"transformers>=4.28.0",
|
|
31
|
+
"librosa",
|
|
32
|
+
"pydub",
|
|
33
|
+
"huggingface_hub",
|
|
34
|
+
"pytorch-crf",
|
|
35
|
+
"numpy",
|
|
36
|
+
"tqdm"
|
|
37
|
+
],
|
|
38
|
+
|
|
39
|
+
keywords=[
|
|
40
|
+
"audio-processing",
|
|
41
|
+
"toxic-span-detection",
|
|
42
|
+
"vietnamese",
|
|
43
|
+
"asr",
|
|
44
|
+
"speech-recognition",
|
|
45
|
+
"censoring",
|
|
46
|
+
"phowhisper"
|
|
47
|
+
],
|
|
48
|
+
|
|
49
|
+
classifiers=[
|
|
50
|
+
"Development Status :: 4 - Beta",
|
|
51
|
+
"Intended Audience :: Developers",
|
|
52
|
+
"Intended Audience :: Science/Research",
|
|
53
|
+
"Programming Language :: Python :: 3",
|
|
54
|
+
"Programming Language :: Python :: 3.8",
|
|
55
|
+
"Programming Language :: Python :: 3.9",
|
|
56
|
+
"Programming Language :: Python :: 3.10",
|
|
57
|
+
"License :: OSI Approved :: MIT License",
|
|
58
|
+
"Operating System :: OS Independent",
|
|
59
|
+
"Topic :: Scientific/Engineering :: Artificial Intelligence",
|
|
60
|
+
"Topic :: Multimedia :: Sound/Audio :: Speech",
|
|
61
|
+
"Natural Language :: Vietnamese",
|
|
62
|
+
],
|
|
63
|
+
|
|
64
|
+
python_requires='>=3.7',
|
|
65
|
+
)
|
|
@@ -0,0 +1,75 @@
|
|
|
1
|
+
import librosa
|
|
2
|
+
from pydub import AudioSegment
|
|
3
|
+
from pydub.generators import Sine
|
|
4
|
+
from transformers import pipeline
|
|
5
|
+
|
|
6
|
+
def censor_audio_with_beep(audio_path, model, processor, words_with_labels, device):
|
|
7
|
+
|
|
8
|
+
print("-"*10 "Calculating time (alignment)" "-"*10)
|
|
9
|
+
pipe = pipeline(
|
|
10
|
+
"automatic-speech-recognition",
|
|
11
|
+
model=model.student,
|
|
12
|
+
tokenizer=processor.tokenizer,
|
|
13
|
+
feature_extractor=processor.feature_extractor,
|
|
14
|
+
device=device,
|
|
15
|
+
return_timestamps="word"
|
|
16
|
+
)
|
|
17
|
+
|
|
18
|
+
result = pipe(audio_path)
|
|
19
|
+
chunks = result['chunks']
|
|
20
|
+
|
|
21
|
+
toxic_timestamps = []
|
|
22
|
+
chunk_idx = 0
|
|
23
|
+
num_chunks = len(chunks)
|
|
24
|
+
|
|
25
|
+
for word, label in words_with_labels:
|
|
26
|
+
clean_word = word.replace('<|startoftranscript|>', '').replace('<|transcribe|>', '').strip()
|
|
27
|
+
if not clean_word: continue
|
|
28
|
+
|
|
29
|
+
if chunk_idx < num_chunks:
|
|
30
|
+
chunk = chunks[chunk_idx]
|
|
31
|
+
timestamp = chunk['timestamp']
|
|
32
|
+
|
|
33
|
+
if label == 1:
|
|
34
|
+
toxic_timestamps.append(timestamp)
|
|
35
|
+
|
|
36
|
+
chunk_idx += 1
|
|
37
|
+
|
|
38
|
+
print('\nDone ✓\n')
|
|
39
|
+
print("-"*10 "Processing cutting and merging audio" "-"*10)
|
|
40
|
+
|
|
41
|
+
try:
|
|
42
|
+
original_audio = AudioSegment.from_wav(audio_path)
|
|
43
|
+
except:
|
|
44
|
+
print("Error: Fail to load audio")
|
|
45
|
+
return None
|
|
46
|
+
|
|
47
|
+
final_audio = AudioSegment.empty()
|
|
48
|
+
|
|
49
|
+
current_pos_ms = 0
|
|
50
|
+
|
|
51
|
+
toxic_timestamps.sort(key=lambda x: x[0])
|
|
52
|
+
|
|
53
|
+
for start, end in toxic_timestamps:
|
|
54
|
+
start_ms = start * 1000
|
|
55
|
+
end_ms = end * 1000
|
|
56
|
+
|
|
57
|
+
if start_ms > current_pos_ms:
|
|
58
|
+
clean_segment = original_audio[current_pos_ms:start_ms]
|
|
59
|
+
final_audio += clean_segment
|
|
60
|
+
|
|
61
|
+
duration_ms = end_ms - start_ms
|
|
62
|
+
if duration_ms > 0:
|
|
63
|
+
beep = Sine(1000).to_audio_segment(duration=duration_ms).apply_gain(-5)
|
|
64
|
+
final_audio += beep
|
|
65
|
+
|
|
66
|
+
current_pos_ms = max(current_pos_ms, end_ms)
|
|
67
|
+
|
|
68
|
+
if current_pos_ms < len(original_audio):
|
|
69
|
+
remaining_audio = original_audio[current_pos_ms:]
|
|
70
|
+
final_audio += remaining_audio
|
|
71
|
+
|
|
72
|
+
output_path = "censored_audio_clean.wav"
|
|
73
|
+
final_audio.export(output_path, format="wav")
|
|
74
|
+
print(f"\nDone ✓ \n File result save path: {output_path}")
|
|
75
|
+
return output_path
|
|
@@ -0,0 +1,59 @@
|
|
|
1
|
+
import torch
|
|
2
|
+
import time
|
|
3
|
+
import librosa
|
|
4
|
+
from huggingface_hub import hf_hub_download
|
|
5
|
+
from transformers import WhisperProcessor
|
|
6
|
+
from .model import WhisperToxicSpansKDModel
|
|
7
|
+
from .utils import group_tokens_into_words_corrected
|
|
8
|
+
|
|
9
|
+
def load_my_model(device, repo_id="ViToSAResearch/PhoWhisper-BiLSTM-CRF", model_filename="model.pth"):
|
|
10
|
+
print(f"Loading model from {repo_id}...")
|
|
11
|
+
checkpoint_path = hf_hub_download(repo_id=repo_id, filename=model_filename)
|
|
12
|
+
|
|
13
|
+
model = WhisperToxicSpansKDModel(use_crf=True, kd_layers=[4, 8, 12])
|
|
14
|
+
processor = WhisperProcessor.from_pretrained("Huydb/phowhisper-toxic", language="vietnamese", task="transcribe")
|
|
15
|
+
|
|
16
|
+
state_dict = torch.load(checkpoint_path, map_location=device)
|
|
17
|
+
new_state_dict = {k: v for k, v in state_dict.items() if 'teacher.' not in k}
|
|
18
|
+
model.load_state_dict(new_state_dict, strict=False)
|
|
19
|
+
model.to(device)
|
|
20
|
+
model.eval()
|
|
21
|
+
return model, processor
|
|
22
|
+
|
|
23
|
+
def toxic_span_asr_inference(audio_path: str, model, whisper_processor, device):
|
|
24
|
+
timings = {}
|
|
25
|
+
print( "-"*10 f"Processing audio file: {audio_path} " "-"*10)
|
|
26
|
+
|
|
27
|
+
start_time = time.perf_counter()
|
|
28
|
+
speech_array, sampling_rate = librosa.load(audio_path, sr=16000)
|
|
29
|
+
input_features = whisper_processor(speech_array, sampling_rate=sampling_rate, return_tensors="pt").input_features
|
|
30
|
+
|
|
31
|
+
with torch.no_grad():
|
|
32
|
+
start_time = time.perf_counter()
|
|
33
|
+
predicted_ids = model.student.generate(input_features.to(device))[0]
|
|
34
|
+
if device.type == 'cuda':
|
|
35
|
+
torch.cuda.synchronize()
|
|
36
|
+
timings["2_asr_inference"] = time.perf_counter() - start_time
|
|
37
|
+
|
|
38
|
+
transcribed_text = whisper_processor.tokenizer.decode(predicted_ids, skip_special_tokens=True)
|
|
39
|
+
|
|
40
|
+
input_ids_for_toxic = predicted_ids.unsqueeze(0).to(device)
|
|
41
|
+
attention_mask_for_toxic = torch.ones_like(input_ids_for_toxic)
|
|
42
|
+
|
|
43
|
+
with torch.no_grad():
|
|
44
|
+
start_time = time.perf_counter()
|
|
45
|
+
toxic_labels_list = model.predict(input_ids_for_toxic, attention_mask_for_toxic)
|
|
46
|
+
if device.type == 'cuda':
|
|
47
|
+
torch.cuda.synchronize()
|
|
48
|
+
timings["3_toxic_span_inference"] = time.perf_counter() - start_time
|
|
49
|
+
|
|
50
|
+
toxic_labels = toxic_labels_list[0] if isinstance(toxic_labels_list, list) else toxic_labels_list
|
|
51
|
+
|
|
52
|
+
return transcribed_text, predicted_ids, toxic_labels, timings
|
|
53
|
+
|
|
54
|
+
|
|
55
|
+
def return_labels(audio_file,model,processor,device):
|
|
56
|
+
text, pred_ids, labels, execution_times = toxic_span_asr_inference(audio_file, model, processor, device)
|
|
57
|
+
|
|
58
|
+
words_with_labels = group_tokens_into_words_corrected(pred_ids, labels, processor.tokenizer)
|
|
59
|
+
return words_with_labels
|
|
@@ -0,0 +1,139 @@
|
|
|
1
|
+
import torch
|
|
2
|
+
import torch.nn as nn
|
|
3
|
+
import torch.nn.functional as F
|
|
4
|
+
from transformers import WhisperConfig, WhisperTokenizerFast, WhisperForConditionalGeneration
|
|
5
|
+
from torchcrf import CRF
|
|
6
|
+
|
|
7
|
+
class WhisperToxicSpansKDModel(nn.Module):
|
|
8
|
+
def __init__(
|
|
9
|
+
self,
|
|
10
|
+
whisper_name: str = "Huydb/phowhisper-toxic",
|
|
11
|
+
use_crf: bool = True,
|
|
12
|
+
use_bidirectional = True,
|
|
13
|
+
lstm_hidden: int = 512,
|
|
14
|
+
dropout: float = 0.4,
|
|
15
|
+
alpha_span: float = 0.5,
|
|
16
|
+
alpha_kd: float = 0.5,
|
|
17
|
+
kd_temp: float = 2.0,
|
|
18
|
+
kd_layers: list = None,
|
|
19
|
+
teacher_model=None
|
|
20
|
+
):
|
|
21
|
+
super().__init__()
|
|
22
|
+
# -- Student setup
|
|
23
|
+
self.tokenizer = WhisperTokenizerFast.from_pretrained(
|
|
24
|
+
whisper_name, language="vietnamese", task="transcribe", use_fast=True
|
|
25
|
+
)
|
|
26
|
+
config = WhisperConfig.from_pretrained(whisper_name)
|
|
27
|
+
config.output_hidden_states = True # enable hidden states
|
|
28
|
+
self.student = WhisperForConditionalGeneration.from_pretrained(
|
|
29
|
+
whisper_name, config=config
|
|
30
|
+
)
|
|
31
|
+
|
|
32
|
+
# -- KD teacher (CPU)
|
|
33
|
+
self.teacher = teacher_model
|
|
34
|
+
|
|
35
|
+
# -- Heads
|
|
36
|
+
d_model = self.student.config.d_model
|
|
37
|
+
self.dropout = nn.Dropout(dropout)
|
|
38
|
+
self.use_crf = use_crf
|
|
39
|
+
self.use_bidirectional = use_bidirectional
|
|
40
|
+
if use_crf:
|
|
41
|
+
if use_bidirectional:
|
|
42
|
+
self.bilstm = nn.LSTM(
|
|
43
|
+
d_model, lstm_hidden // 2,
|
|
44
|
+
num_layers=1, batch_first=True, bidirectional=True
|
|
45
|
+
)
|
|
46
|
+
classifier_in_dim = lstm_hidden
|
|
47
|
+
else:
|
|
48
|
+
self.bilstm = nn.LSTM(d_model, lstm_hidden,
|
|
49
|
+
num_layers=1, batch_first=True, bidirectional=False)
|
|
50
|
+
classifier_in_dim = lstm_hidden
|
|
51
|
+
|
|
52
|
+
self.classifier = nn.Linear(classifier_in_dim, 2)
|
|
53
|
+
self.crf = CRF(2, batch_first=True)
|
|
54
|
+
else:
|
|
55
|
+
self.classifier = nn.Linear(d_model, 2)
|
|
56
|
+
|
|
57
|
+
# -- KD hyperparams
|
|
58
|
+
self.alpha_span = alpha_span
|
|
59
|
+
self.alpha_kd = alpha_kd
|
|
60
|
+
self.temperature = kd_temp
|
|
61
|
+
self.kd_layers = kd_layers or [4, 8, 12]
|
|
62
|
+
|
|
63
|
+
def forward(
|
|
64
|
+
self,
|
|
65
|
+
input_ids,
|
|
66
|
+
attention_mask,
|
|
67
|
+
labels=None,
|
|
68
|
+
teacher_input_ids=None,
|
|
69
|
+
teacher_attention_mask=None
|
|
70
|
+
):
|
|
71
|
+
device = next(self.student.parameters()).device
|
|
72
|
+
ids = input_ids.to(device)
|
|
73
|
+
mask = attention_mask.to(device)
|
|
74
|
+
lab = labels.to(device) if labels is not None else None
|
|
75
|
+
|
|
76
|
+
# 1) Student forward to get hidden states
|
|
77
|
+
student_outputs = self.student.model.decoder(
|
|
78
|
+
input_ids=ids,
|
|
79
|
+
attention_mask=mask,
|
|
80
|
+
output_hidden_states=True,
|
|
81
|
+
return_dict=True
|
|
82
|
+
)
|
|
83
|
+
student_hiddens = student_outputs.hidden_states # tuple of layer outputs
|
|
84
|
+
|
|
85
|
+
# use top hidden for classification pipeline
|
|
86
|
+
top_hidden = student_hiddens[-1]
|
|
87
|
+
h = self.dropout(top_hidden)
|
|
88
|
+
if self.use_crf:
|
|
89
|
+
h, _ = self.bilstm(h)
|
|
90
|
+
logits = self.classifier(h)
|
|
91
|
+
|
|
92
|
+
loss = None
|
|
93
|
+
if lab is not None:
|
|
94
|
+
# span loss
|
|
95
|
+
if self.use_crf:
|
|
96
|
+
m = mask.clone().bool()
|
|
97
|
+
m[:, 0] = True
|
|
98
|
+
tags = lab.clone()
|
|
99
|
+
tags[tags == -100] = 0
|
|
100
|
+
span_loss = -self.crf(logits, tags, mask=m, reduction='mean')
|
|
101
|
+
else:
|
|
102
|
+
span_loss = nn.CrossEntropyLoss(ignore_index=-100)(
|
|
103
|
+
logits.view(-1, 2), lab.view(-1)
|
|
104
|
+
)
|
|
105
|
+
|
|
106
|
+
# KD loss: multi-depth
|
|
107
|
+
kd_loss = 0.0
|
|
108
|
+
if teacher_input_ids is not None and teacher_attention_mask is not None:
|
|
109
|
+
# Teacher on CPU
|
|
110
|
+
with torch.no_grad():
|
|
111
|
+
tch_out = self.teacher(
|
|
112
|
+
input_ids=teacher_input_ids,
|
|
113
|
+
attention_mask=teacher_attention_mask,
|
|
114
|
+
output_hidden_states=True,
|
|
115
|
+
return_dict=True
|
|
116
|
+
)
|
|
117
|
+
teacher_hiddens = tch_out.hidden_states
|
|
118
|
+
# compute layer-wise MSE
|
|
119
|
+
for i in self.kd_layers:
|
|
120
|
+
s_feat = student_hiddens[i]
|
|
121
|
+
t_feat = teacher_hiddens[i]
|
|
122
|
+
# interpolate or project to same size if needed
|
|
123
|
+
kd_loss += F.mse_loss(s_feat, t_feat.to(device))
|
|
124
|
+
kd_loss = kd_loss / len(self.kd_layers)
|
|
125
|
+
loss = self.alpha_span * span_loss + self.alpha_kd * kd_loss
|
|
126
|
+
else:
|
|
127
|
+
loss = span_loss
|
|
128
|
+
|
|
129
|
+
return {'loss': loss, 'logits': logits}
|
|
130
|
+
|
|
131
|
+
def predict(self, input_ids, attention_mask):
|
|
132
|
+
self.eval()
|
|
133
|
+
with torch.no_grad():
|
|
134
|
+
out = self.forward(input_ids, attention_mask)
|
|
135
|
+
logits = out['logits']
|
|
136
|
+
if self.use_crf:
|
|
137
|
+
m = attention_mask.bool().to(next(self.student.parameters()).device)
|
|
138
|
+
return self.crf.decode(F.log_softmax(logits, dim=-1), mask=m)
|
|
139
|
+
return logits.argmax(dim=-1)
|
|
@@ -0,0 +1,44 @@
|
|
|
1
|
+
import torch
|
|
2
|
+
|
|
3
|
+
def group_tokens_into_words_corrected(predicted_ids, toxic_labels, tokenizer):
|
|
4
|
+
"""
|
|
5
|
+
Group token IDs to (decode)
|
|
6
|
+
"""
|
|
7
|
+
words_with_labels = []
|
|
8
|
+
|
|
9
|
+
if isinstance(predicted_ids, torch.Tensor):
|
|
10
|
+
predicted_ids = predicted_ids.tolist()
|
|
11
|
+
if isinstance(toxic_labels, torch.Tensor):
|
|
12
|
+
toxic_labels = toxic_labels.tolist()
|
|
13
|
+
|
|
14
|
+
raw_tokens = tokenizer.convert_ids_to_tokens(predicted_ids)
|
|
15
|
+
|
|
16
|
+
current_word_ids = []
|
|
17
|
+
current_label = -1
|
|
18
|
+
|
|
19
|
+
min_len = min(len(predicted_ids), len(toxic_labels))
|
|
20
|
+
|
|
21
|
+
for i in range(min_len):
|
|
22
|
+
token_id = predicted_ids[i]
|
|
23
|
+
label = toxic_labels[i]
|
|
24
|
+
raw_token = raw_tokens[i]
|
|
25
|
+
|
|
26
|
+
if raw_token.startswith('Ġ') or i == 0:
|
|
27
|
+
|
|
28
|
+
if current_word_ids:
|
|
29
|
+
decoded_word = tokenizer.decode(current_word_ids).strip()
|
|
30
|
+
if decoded_word:
|
|
31
|
+
words_with_labels.append((decoded_word, current_label))
|
|
32
|
+
|
|
33
|
+
current_word_ids = [token_id]
|
|
34
|
+
current_label = label
|
|
35
|
+
else:
|
|
36
|
+
current_word_ids.append(token_id)
|
|
37
|
+
current_label = max(current_label, label)
|
|
38
|
+
|
|
39
|
+
if current_word_ids:
|
|
40
|
+
decoded_word = tokenizer.decode(current_word_ids).strip()
|
|
41
|
+
if decoded_word:
|
|
42
|
+
words_with_labels.append((decoded_word, current_label))
|
|
43
|
+
|
|
44
|
+
return words_with_labels
|
|
@@ -0,0 +1,146 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: vitosa-speech-II
|
|
3
|
+
Version: 0.0.1
|
|
4
|
+
Summary: A library for Robust Vietnamese Audio-Based Toxic Span Detection and Censoring
|
|
5
|
+
Author: Vy Le-Phuong Huynh, Huy Ba Do and Luan Thanh Nguyen
|
|
6
|
+
Author-email: luannt@uit.edu.vn
|
|
7
|
+
Project-URL: Model (Hugging Face), https://huggingface.co/UIT-ViToSA/PhoWhisper-BiLSTM-CRF
|
|
8
|
+
Keywords: audio-processing,toxic-span-detection,vietnamese,asr,speech-recognition,censoring,phowhisper
|
|
9
|
+
Classifier: Development Status :: 4 - Beta
|
|
10
|
+
Classifier: Intended Audience :: Developers
|
|
11
|
+
Classifier: Intended Audience :: Science/Research
|
|
12
|
+
Classifier: Programming Language :: Python :: 3
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
16
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
17
|
+
Classifier: Operating System :: OS Independent
|
|
18
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
19
|
+
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
|
|
20
|
+
Classifier: Natural Language :: Vietnamese
|
|
21
|
+
Requires-Python: >=3.7
|
|
22
|
+
Description-Content-Type: text/markdown
|
|
23
|
+
Requires-Dist: torch>=1.13.0
|
|
24
|
+
Requires-Dist: transformers>=4.28.0
|
|
25
|
+
Requires-Dist: librosa
|
|
26
|
+
Requires-Dist: pydub
|
|
27
|
+
Requires-Dist: huggingface_hub
|
|
28
|
+
Requires-Dist: pytorch-crf
|
|
29
|
+
Requires-Dist: numpy
|
|
30
|
+
Requires-Dist: tqdm
|
|
31
|
+
Dynamic: author
|
|
32
|
+
Dynamic: author-email
|
|
33
|
+
Dynamic: classifier
|
|
34
|
+
Dynamic: description
|
|
35
|
+
Dynamic: description-content-type
|
|
36
|
+
Dynamic: keywords
|
|
37
|
+
Dynamic: project-url
|
|
38
|
+
Dynamic: requires-dist
|
|
39
|
+
Dynamic: requires-python
|
|
40
|
+
Dynamic: summary
|
|
41
|
+
|
|
42
|
+
# ViToSA 2.0: A MULTI-TASK APPROACH TOWARDS ROBUST VIETNAMESE AUDIO-BASED TOXIC SPAN DETECTION | ICASSP 2026
|
|
43
|
+
|
|
44
|
+
**Official implementation** of the paper:
|
|
45
|
+
**“A Multi-Task Approach Towards Robust Vietnamese Audio-Based Toxic Span Detection”** (ICASSP 2026).
|
|
46
|
+
|
|
47
|
+
This package provides an end-to-end pipeline for **Vietnamese speech-based toxic span detection**, combining **ASR and toxic span detection** in a unified model. It also supports **automatic audio censoring**, replacing toxic spans with beep sounds in the output waveform.
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## Key Features
|
|
52
|
+
|
|
53
|
+
* **Automated Audio Censoring**: Takes an input audio file containing toxic language and outputs a **clean `.wav` file** where profanity is masked with a beep.
|
|
54
|
+
* **Unified Multi-Task Architecture**: Integrates ASR and Toxic Span Detection (TSD) into a single model for high speed.
|
|
55
|
+
* **SOTA Performance**: Achieves **F1-macro 0.9212** on the ViToSA-v2 dataset using **PhoWhisper + BiLSTM-CRF + Knowledge Distillation**.
|
|
56
|
+
* **High Efficiency**: Reduces inference latency by over **56%** compared to traditional pipelines.
|
|
57
|
+
|
|
58
|
+
## Installation
|
|
59
|
+
|
|
60
|
+
```bash
|
|
61
|
+
pip install vitosa-speech
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
### System requirements
|
|
65
|
+
|
|
66
|
+
This package relies on `pydub` for audio processing, which requires **ffmpeg** to be installed.
|
|
67
|
+
|
|
68
|
+
- **Ubuntu / Debian**
|
|
69
|
+
```bash
|
|
70
|
+
sudo apt-get install ffmpeg
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
- **macOS (Homebrew)**
|
|
74
|
+
```bash
|
|
75
|
+
brew install ffmpeg
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
- **Windows**
|
|
79
|
+
Download ffmpeg from https://ffmpeg.org and add it to your system `PATH`.
|
|
80
|
+
|
|
81
|
+
---
|
|
82
|
+
|
|
83
|
+
|
|
84
|
+
## Quick Start
|
|
85
|
+
This library allows you to input a raw audio file and get a censored audio file as the output.
|
|
86
|
+
|
|
87
|
+
### 1. Load the Model
|
|
88
|
+
```python
|
|
89
|
+
The model is pre-trained on the ViToSA-v2 dataset
|
|
90
|
+
|
|
91
|
+
import torch
|
|
92
|
+
from vitosa-speech-II import load_my_model
|
|
93
|
+
# Automatically detect device (CUDA/CPU)
|
|
94
|
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
|
95
|
+
# Load the pre-trained model
|
|
96
|
+
model, processor = load_my_model(device)
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
### 2. Run Inference (Detect & Censor)
|
|
100
|
+
```python
|
|
101
|
+
from vitosa-speech-II import return_labels, censor_audio_with_beep
|
|
102
|
+
from IPython.display import Audio, display # Optional: to play in notebook
|
|
103
|
+
|
|
104
|
+
# Path to your input file
|
|
105
|
+
input_audio = "samples/toxic_speech.wav"
|
|
106
|
+
|
|
107
|
+
# Step 1: Detect toxic spans
|
|
108
|
+
words_with_labels = return_labels(input_audio, model, processor, device)
|
|
109
|
+
|
|
110
|
+
# Step 2: Generate Censored Audio
|
|
111
|
+
# This function creates a new audio file with beeps over toxic words
|
|
112
|
+
output_audio_path = censor_audio_with_beep(
|
|
113
|
+
audio_path=input_audio,
|
|
114
|
+
model=model,
|
|
115
|
+
processor=processor,
|
|
116
|
+
words_with_labels=words_with_labels,
|
|
117
|
+
device=device
|
|
118
|
+
)
|
|
119
|
+
|
|
120
|
+
# Result
|
|
121
|
+
print(f"✅ Censored audio saved to: {output_audio_path}")
|
|
122
|
+
|
|
123
|
+
# Optional: Play the result (if in Jupyter/Colab)
|
|
124
|
+
# display(Audio(output_audio_path))
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
## Methodology
|
|
128
|
+
Our system works in two steps:
|
|
129
|
+
|
|
130
|
+
1. Detection: The multi-task model (PhoWhisper + BiLSTM-CRF) processes the audio to identify the exact start and end timestamps of toxic words.
|
|
131
|
+
2. Censoring: We reconstruct the audio by keeping safe segments and generating a sine wave (beep) to overlay exactly where the toxic tokens occur, ensuring the rest of the sentence remains intelligible.
|
|
132
|
+
|
|
133
|
+
<!-- ## Citation
|
|
134
|
+
If you use this tool or our findings, please cite:
|
|
135
|
+
```bibtex
|
|
136
|
+
@inproceedings{huynh2026multitask,
|
|
137
|
+
title={A Multi-Task Approach Towards Robust Vietnamese Audio-Based Toxic Span Detection},
|
|
138
|
+
author={Huynh, Vy Le-Phuong and Do, Huy Ba and Nguyen, Luan Thanh},
|
|
139
|
+
booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
|
|
140
|
+
year={2026}
|
|
141
|
+
}
|
|
142
|
+
``` -->
|
|
143
|
+
|
|
144
|
+
## Contact
|
|
145
|
+
For more information: luannt@uit.edu.vn
|
|
146
|
+
|
|
@@ -0,0 +1,12 @@
|
|
|
1
|
+
README.md
|
|
2
|
+
setup.py
|
|
3
|
+
vitosa_speech_II/__init__.py
|
|
4
|
+
vitosa_speech_II/audio.py
|
|
5
|
+
vitosa_speech_II/inference.py
|
|
6
|
+
vitosa_speech_II/model.py
|
|
7
|
+
vitosa_speech_II/utils.py
|
|
8
|
+
vitosa_speech_II.egg-info/PKG-INFO
|
|
9
|
+
vitosa_speech_II.egg-info/SOURCES.txt
|
|
10
|
+
vitosa_speech_II.egg-info/dependency_links.txt
|
|
11
|
+
vitosa_speech_II.egg-info/requires.txt
|
|
12
|
+
vitosa_speech_II.egg-info/top_level.txt
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
vitosa_speech_II
|