vitosa-speech-II 0.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,146 @@
1
+ Metadata-Version: 2.4
2
+ Name: vitosa-speech-II
3
+ Version: 0.0.1
4
+ Summary: A library for Robust Vietnamese Audio-Based Toxic Span Detection and Censoring
5
+ Author: Vy Le-Phuong Huynh, Huy Ba Do and Luan Thanh Nguyen
6
+ Author-email: luannt@uit.edu.vn
7
+ Project-URL: Model (Hugging Face), https://huggingface.co/UIT-ViToSA/PhoWhisper-BiLSTM-CRF
8
+ Keywords: audio-processing,toxic-span-detection,vietnamese,asr,speech-recognition,censoring,phowhisper
9
+ Classifier: Development Status :: 4 - Beta
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: Intended Audience :: Science/Research
12
+ Classifier: Programming Language :: Python :: 3
13
+ Classifier: Programming Language :: Python :: 3.8
14
+ Classifier: Programming Language :: Python :: 3.9
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: License :: OSI Approved :: MIT License
17
+ Classifier: Operating System :: OS Independent
18
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
19
+ Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
20
+ Classifier: Natural Language :: Vietnamese
21
+ Requires-Python: >=3.7
22
+ Description-Content-Type: text/markdown
23
+ Requires-Dist: torch>=1.13.0
24
+ Requires-Dist: transformers>=4.28.0
25
+ Requires-Dist: librosa
26
+ Requires-Dist: pydub
27
+ Requires-Dist: huggingface_hub
28
+ Requires-Dist: pytorch-crf
29
+ Requires-Dist: numpy
30
+ Requires-Dist: tqdm
31
+ Dynamic: author
32
+ Dynamic: author-email
33
+ Dynamic: classifier
34
+ Dynamic: description
35
+ Dynamic: description-content-type
36
+ Dynamic: keywords
37
+ Dynamic: project-url
38
+ Dynamic: requires-dist
39
+ Dynamic: requires-python
40
+ Dynamic: summary
41
+
42
+ # ViToSA 2.0: A MULTI-TASK APPROACH TOWARDS ROBUST VIETNAMESE AUDIO-BASED TOXIC SPAN DETECTION | ICASSP 2026
43
+
44
+ **Official implementation** of the paper:
45
+ **“A Multi-Task Approach Towards Robust Vietnamese Audio-Based Toxic Span Detection”** (ICASSP 2026).
46
+
47
+ This package provides an end-to-end pipeline for **Vietnamese speech-based toxic span detection**, combining **ASR and toxic span detection** in a unified model. It also supports **automatic audio censoring**, replacing toxic spans with beep sounds in the output waveform.
48
+
49
+ ---
50
+
51
+ ## Key Features
52
+
53
+ * **Automated Audio Censoring**: Takes an input audio file containing toxic language and outputs a **clean `.wav` file** where profanity is masked with a beep.
54
+ * **Unified Multi-Task Architecture**: Integrates ASR and Toxic Span Detection (TSD) into a single model for high speed.
55
+ * **SOTA Performance**: Achieves **F1-macro 0.9212** on the ViToSA-v2 dataset using **PhoWhisper + BiLSTM-CRF + Knowledge Distillation**.
56
+ * **High Efficiency**: Reduces inference latency by over **56%** compared to traditional pipelines.
57
+
58
+ ## Installation
59
+
60
+ ```bash
61
+ pip install vitosa-speech
62
+ ```
63
+
64
+ ### System requirements
65
+
66
+ This package relies on `pydub` for audio processing, which requires **ffmpeg** to be installed.
67
+
68
+ - **Ubuntu / Debian**
69
+ ```bash
70
+ sudo apt-get install ffmpeg
71
+ ```
72
+
73
+ - **macOS (Homebrew)**
74
+ ```bash
75
+ brew install ffmpeg
76
+ ```
77
+
78
+ - **Windows**
79
+ Download ffmpeg from https://ffmpeg.org and add it to your system `PATH`.
80
+
81
+ ---
82
+
83
+
84
+ ## Quick Start
85
+ This library allows you to input a raw audio file and get a censored audio file as the output.
86
+
87
+ ### 1. Load the Model
88
+ ```python
89
+ The model is pre-trained on the ViToSA-v2 dataset
90
+
91
+ import torch
92
+ from vitosa-speech-II import load_my_model
93
+ # Automatically detect device (CUDA/CPU)
94
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
95
+ # Load the pre-trained model
96
+ model, processor = load_my_model(device)
97
+ ```
98
+
99
+ ### 2. Run Inference (Detect & Censor)
100
+ ```python
101
+ from vitosa-speech-II import return_labels, censor_audio_with_beep
102
+ from IPython.display import Audio, display # Optional: to play in notebook
103
+
104
+ # Path to your input file
105
+ input_audio = "samples/toxic_speech.wav"
106
+
107
+ # Step 1: Detect toxic spans
108
+ words_with_labels = return_labels(input_audio, model, processor, device)
109
+
110
+ # Step 2: Generate Censored Audio
111
+ # This function creates a new audio file with beeps over toxic words
112
+ output_audio_path = censor_audio_with_beep(
113
+ audio_path=input_audio,
114
+ model=model,
115
+ processor=processor,
116
+ words_with_labels=words_with_labels,
117
+ device=device
118
+ )
119
+
120
+ # Result
121
+ print(f"✅ Censored audio saved to: {output_audio_path}")
122
+
123
+ # Optional: Play the result (if in Jupyter/Colab)
124
+ # display(Audio(output_audio_path))
125
+ ```
126
+
127
+ ## Methodology
128
+ Our system works in two steps:
129
+
130
+ 1. Detection: The multi-task model (PhoWhisper + BiLSTM-CRF) processes the audio to identify the exact start and end timestamps of toxic words.
131
+ 2. Censoring: We reconstruct the audio by keeping safe segments and generating a sine wave (beep) to overlay exactly where the toxic tokens occur, ensuring the rest of the sentence remains intelligible.
132
+
133
+ <!-- ## Citation
134
+ If you use this tool or our findings, please cite:
135
+ ```bibtex
136
+ @inproceedings{huynh2026multitask,
137
+ title={A Multi-Task Approach Towards Robust Vietnamese Audio-Based Toxic Span Detection},
138
+ author={Huynh, Vy Le-Phuong and Do, Huy Ba and Nguyen, Luan Thanh},
139
+ booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
140
+ year={2026}
141
+ }
142
+ ``` -->
143
+
144
+ ## Contact
145
+ For more information: luannt@uit.edu.vn
146
+
@@ -0,0 +1,105 @@
1
+ # ViToSA 2.0: A MULTI-TASK APPROACH TOWARDS ROBUST VIETNAMESE AUDIO-BASED TOXIC SPAN DETECTION | ICASSP 2026
2
+
3
+ **Official implementation** of the paper:
4
+ **“A Multi-Task Approach Towards Robust Vietnamese Audio-Based Toxic Span Detection”** (ICASSP 2026).
5
+
6
+ This package provides an end-to-end pipeline for **Vietnamese speech-based toxic span detection**, combining **ASR and toxic span detection** in a unified model. It also supports **automatic audio censoring**, replacing toxic spans with beep sounds in the output waveform.
7
+
8
+ ---
9
+
10
+ ## Key Features
11
+
12
+ * **Automated Audio Censoring**: Takes an input audio file containing toxic language and outputs a **clean `.wav` file** where profanity is masked with a beep.
13
+ * **Unified Multi-Task Architecture**: Integrates ASR and Toxic Span Detection (TSD) into a single model for high speed.
14
+ * **SOTA Performance**: Achieves **F1-macro 0.9212** on the ViToSA-v2 dataset using **PhoWhisper + BiLSTM-CRF + Knowledge Distillation**.
15
+ * **High Efficiency**: Reduces inference latency by over **56%** compared to traditional pipelines.
16
+
17
+ ## Installation
18
+
19
+ ```bash
20
+ pip install vitosa-speech
21
+ ```
22
+
23
+ ### System requirements
24
+
25
+ This package relies on `pydub` for audio processing, which requires **ffmpeg** to be installed.
26
+
27
+ - **Ubuntu / Debian**
28
+ ```bash
29
+ sudo apt-get install ffmpeg
30
+ ```
31
+
32
+ - **macOS (Homebrew)**
33
+ ```bash
34
+ brew install ffmpeg
35
+ ```
36
+
37
+ - **Windows**
38
+ Download ffmpeg from https://ffmpeg.org and add it to your system `PATH`.
39
+
40
+ ---
41
+
42
+
43
+ ## Quick Start
44
+ This library allows you to input a raw audio file and get a censored audio file as the output.
45
+
46
+ ### 1. Load the Model
47
+ ```python
48
+ The model is pre-trained on the ViToSA-v2 dataset
49
+
50
+ import torch
51
+ from vitosa-speech-II import load_my_model
52
+ # Automatically detect device (CUDA/CPU)
53
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
54
+ # Load the pre-trained model
55
+ model, processor = load_my_model(device)
56
+ ```
57
+
58
+ ### 2. Run Inference (Detect & Censor)
59
+ ```python
60
+ from vitosa-speech-II import return_labels, censor_audio_with_beep
61
+ from IPython.display import Audio, display # Optional: to play in notebook
62
+
63
+ # Path to your input file
64
+ input_audio = "samples/toxic_speech.wav"
65
+
66
+ # Step 1: Detect toxic spans
67
+ words_with_labels = return_labels(input_audio, model, processor, device)
68
+
69
+ # Step 2: Generate Censored Audio
70
+ # This function creates a new audio file with beeps over toxic words
71
+ output_audio_path = censor_audio_with_beep(
72
+ audio_path=input_audio,
73
+ model=model,
74
+ processor=processor,
75
+ words_with_labels=words_with_labels,
76
+ device=device
77
+ )
78
+
79
+ # Result
80
+ print(f"✅ Censored audio saved to: {output_audio_path}")
81
+
82
+ # Optional: Play the result (if in Jupyter/Colab)
83
+ # display(Audio(output_audio_path))
84
+ ```
85
+
86
+ ## Methodology
87
+ Our system works in two steps:
88
+
89
+ 1. Detection: The multi-task model (PhoWhisper + BiLSTM-CRF) processes the audio to identify the exact start and end timestamps of toxic words.
90
+ 2. Censoring: We reconstruct the audio by keeping safe segments and generating a sine wave (beep) to overlay exactly where the toxic tokens occur, ensuring the rest of the sentence remains intelligible.
91
+
92
+ <!-- ## Citation
93
+ If you use this tool or our findings, please cite:
94
+ ```bibtex
95
+ @inproceedings{huynh2026multitask,
96
+ title={A Multi-Task Approach Towards Robust Vietnamese Audio-Based Toxic Span Detection},
97
+ author={Huynh, Vy Le-Phuong and Do, Huy Ba and Nguyen, Luan Thanh},
98
+ booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
99
+ year={2026}
100
+ }
101
+ ``` -->
102
+
103
+ ## Contact
104
+ For more information: luannt@uit.edu.vn
105
+
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,65 @@
1
+ from setuptools import setup, find_packages
2
+
3
+ with open("README.md", "r", encoding="utf-8") as fh:
4
+ long_description = fh.read()
5
+
6
+ setup(
7
+ name="vitosa-speech-II",
8
+ version="0.0.1",
9
+
10
+ author="Vy Le-Phuong Huynh, Huy Ba Do and Luan Thanh Nguyen",
11
+ author_email="luannt@uit.edu.vn",
12
+
13
+ description="A library for Robust Vietnamese Audio-Based Toxic Span Detection and Censoring",
14
+ long_description=long_description,
15
+ long_description_content_type="text/markdown",
16
+
17
+ # Link GitHub chính thức (quan trọng)
18
+ # url="https://github.com/ViToSAResearch/PhoWhisper-BiLSTM-CRF",
19
+
20
+ # Tạo các liên kết bổ sung ở cột bên trái trang PyPI
21
+ project_urls={
22
+ # "Bug Tracker": "https://github.com/ViToSAResearch/PhoWhisper-BiLSTM-CRF/issues",
23
+ "Model (Hugging Face)": "https://huggingface.co/UIT-ViToSA/PhoWhisper-BiLSTM-CRF"
24
+ },
25
+
26
+ packages=find_packages(exclude=("tests", "docs")),
27
+
28
+ install_requires=[
29
+ "torch>=1.13.0",
30
+ "transformers>=4.28.0",
31
+ "librosa",
32
+ "pydub",
33
+ "huggingface_hub",
34
+ "pytorch-crf",
35
+ "numpy",
36
+ "tqdm"
37
+ ],
38
+
39
+ keywords=[
40
+ "audio-processing",
41
+ "toxic-span-detection",
42
+ "vietnamese",
43
+ "asr",
44
+ "speech-recognition",
45
+ "censoring",
46
+ "phowhisper"
47
+ ],
48
+
49
+ classifiers=[
50
+ "Development Status :: 4 - Beta",
51
+ "Intended Audience :: Developers",
52
+ "Intended Audience :: Science/Research",
53
+ "Programming Language :: Python :: 3",
54
+ "Programming Language :: Python :: 3.8",
55
+ "Programming Language :: Python :: 3.9",
56
+ "Programming Language :: Python :: 3.10",
57
+ "License :: OSI Approved :: MIT License",
58
+ "Operating System :: OS Independent",
59
+ "Topic :: Scientific/Engineering :: Artificial Intelligence",
60
+ "Topic :: Multimedia :: Sound/Audio :: Speech",
61
+ "Natural Language :: Vietnamese",
62
+ ],
63
+
64
+ python_requires='>=3.7',
65
+ )
@@ -0,0 +1,5 @@
1
+ from .inference import load_my_model, return_labels, toxic_span_asr_inference
2
+ from .audio import censor_audio_with_beep
3
+ from .model import WhisperToxicSpansKDModel
4
+
5
+ __version__ = "0.1.0"
@@ -0,0 +1,75 @@
1
+ import librosa
2
+ from pydub import AudioSegment
3
+ from pydub.generators import Sine
4
+ from transformers import pipeline
5
+
6
+ def censor_audio_with_beep(audio_path, model, processor, words_with_labels, device):
7
+
8
+ print("-"*10 "Calculating time (alignment)" "-"*10)
9
+ pipe = pipeline(
10
+ "automatic-speech-recognition",
11
+ model=model.student,
12
+ tokenizer=processor.tokenizer,
13
+ feature_extractor=processor.feature_extractor,
14
+ device=device,
15
+ return_timestamps="word"
16
+ )
17
+
18
+ result = pipe(audio_path)
19
+ chunks = result['chunks']
20
+
21
+ toxic_timestamps = []
22
+ chunk_idx = 0
23
+ num_chunks = len(chunks)
24
+
25
+ for word, label in words_with_labels:
26
+ clean_word = word.replace('<|startoftranscript|>', '').replace('<|transcribe|>', '').strip()
27
+ if not clean_word: continue
28
+
29
+ if chunk_idx < num_chunks:
30
+ chunk = chunks[chunk_idx]
31
+ timestamp = chunk['timestamp']
32
+
33
+ if label == 1:
34
+ toxic_timestamps.append(timestamp)
35
+
36
+ chunk_idx += 1
37
+
38
+ print('\nDone ✓\n')
39
+ print("-"*10 "Processing cutting and merging audio" "-"*10)
40
+
41
+ try:
42
+ original_audio = AudioSegment.from_wav(audio_path)
43
+ except:
44
+ print("Error: Fail to load audio")
45
+ return None
46
+
47
+ final_audio = AudioSegment.empty()
48
+
49
+ current_pos_ms = 0
50
+
51
+ toxic_timestamps.sort(key=lambda x: x[0])
52
+
53
+ for start, end in toxic_timestamps:
54
+ start_ms = start * 1000
55
+ end_ms = end * 1000
56
+
57
+ if start_ms > current_pos_ms:
58
+ clean_segment = original_audio[current_pos_ms:start_ms]
59
+ final_audio += clean_segment
60
+
61
+ duration_ms = end_ms - start_ms
62
+ if duration_ms > 0:
63
+ beep = Sine(1000).to_audio_segment(duration=duration_ms).apply_gain(-5)
64
+ final_audio += beep
65
+
66
+ current_pos_ms = max(current_pos_ms, end_ms)
67
+
68
+ if current_pos_ms < len(original_audio):
69
+ remaining_audio = original_audio[current_pos_ms:]
70
+ final_audio += remaining_audio
71
+
72
+ output_path = "censored_audio_clean.wav"
73
+ final_audio.export(output_path, format="wav")
74
+ print(f"\nDone ✓ \n File result save path: {output_path}")
75
+ return output_path
@@ -0,0 +1,59 @@
1
+ import torch
2
+ import time
3
+ import librosa
4
+ from huggingface_hub import hf_hub_download
5
+ from transformers import WhisperProcessor
6
+ from .model import WhisperToxicSpansKDModel
7
+ from .utils import group_tokens_into_words_corrected
8
+
9
+ def load_my_model(device, repo_id="ViToSAResearch/PhoWhisper-BiLSTM-CRF", model_filename="model.pth"):
10
+ print(f"Loading model from {repo_id}...")
11
+ checkpoint_path = hf_hub_download(repo_id=repo_id, filename=model_filename)
12
+
13
+ model = WhisperToxicSpansKDModel(use_crf=True, kd_layers=[4, 8, 12])
14
+ processor = WhisperProcessor.from_pretrained("Huydb/phowhisper-toxic", language="vietnamese", task="transcribe")
15
+
16
+ state_dict = torch.load(checkpoint_path, map_location=device)
17
+ new_state_dict = {k: v for k, v in state_dict.items() if 'teacher.' not in k}
18
+ model.load_state_dict(new_state_dict, strict=False)
19
+ model.to(device)
20
+ model.eval()
21
+ return model, processor
22
+
23
+ def toxic_span_asr_inference(audio_path: str, model, whisper_processor, device):
24
+ timings = {}
25
+ print( "-"*10 f"Processing audio file: {audio_path} " "-"*10)
26
+
27
+ start_time = time.perf_counter()
28
+ speech_array, sampling_rate = librosa.load(audio_path, sr=16000)
29
+ input_features = whisper_processor(speech_array, sampling_rate=sampling_rate, return_tensors="pt").input_features
30
+
31
+ with torch.no_grad():
32
+ start_time = time.perf_counter()
33
+ predicted_ids = model.student.generate(input_features.to(device))[0]
34
+ if device.type == 'cuda':
35
+ torch.cuda.synchronize()
36
+ timings["2_asr_inference"] = time.perf_counter() - start_time
37
+
38
+ transcribed_text = whisper_processor.tokenizer.decode(predicted_ids, skip_special_tokens=True)
39
+
40
+ input_ids_for_toxic = predicted_ids.unsqueeze(0).to(device)
41
+ attention_mask_for_toxic = torch.ones_like(input_ids_for_toxic)
42
+
43
+ with torch.no_grad():
44
+ start_time = time.perf_counter()
45
+ toxic_labels_list = model.predict(input_ids_for_toxic, attention_mask_for_toxic)
46
+ if device.type == 'cuda':
47
+ torch.cuda.synchronize()
48
+ timings["3_toxic_span_inference"] = time.perf_counter() - start_time
49
+
50
+ toxic_labels = toxic_labels_list[0] if isinstance(toxic_labels_list, list) else toxic_labels_list
51
+
52
+ return transcribed_text, predicted_ids, toxic_labels, timings
53
+
54
+
55
+ def return_labels(audio_file,model,processor,device):
56
+ text, pred_ids, labels, execution_times = toxic_span_asr_inference(audio_file, model, processor, device)
57
+
58
+ words_with_labels = group_tokens_into_words_corrected(pred_ids, labels, processor.tokenizer)
59
+ return words_with_labels
@@ -0,0 +1,139 @@
1
+ import torch
2
+ import torch.nn as nn
3
+ import torch.nn.functional as F
4
+ from transformers import WhisperConfig, WhisperTokenizerFast, WhisperForConditionalGeneration
5
+ from torchcrf import CRF
6
+
7
+ class WhisperToxicSpansKDModel(nn.Module):
8
+ def __init__(
9
+ self,
10
+ whisper_name: str = "Huydb/phowhisper-toxic",
11
+ use_crf: bool = True,
12
+ use_bidirectional = True,
13
+ lstm_hidden: int = 512,
14
+ dropout: float = 0.4,
15
+ alpha_span: float = 0.5,
16
+ alpha_kd: float = 0.5,
17
+ kd_temp: float = 2.0,
18
+ kd_layers: list = None,
19
+ teacher_model=None
20
+ ):
21
+ super().__init__()
22
+ # -- Student setup
23
+ self.tokenizer = WhisperTokenizerFast.from_pretrained(
24
+ whisper_name, language="vietnamese", task="transcribe", use_fast=True
25
+ )
26
+ config = WhisperConfig.from_pretrained(whisper_name)
27
+ config.output_hidden_states = True # enable hidden states
28
+ self.student = WhisperForConditionalGeneration.from_pretrained(
29
+ whisper_name, config=config
30
+ )
31
+
32
+ # -- KD teacher (CPU)
33
+ self.teacher = teacher_model
34
+
35
+ # -- Heads
36
+ d_model = self.student.config.d_model
37
+ self.dropout = nn.Dropout(dropout)
38
+ self.use_crf = use_crf
39
+ self.use_bidirectional = use_bidirectional
40
+ if use_crf:
41
+ if use_bidirectional:
42
+ self.bilstm = nn.LSTM(
43
+ d_model, lstm_hidden // 2,
44
+ num_layers=1, batch_first=True, bidirectional=True
45
+ )
46
+ classifier_in_dim = lstm_hidden
47
+ else:
48
+ self.bilstm = nn.LSTM(d_model, lstm_hidden,
49
+ num_layers=1, batch_first=True, bidirectional=False)
50
+ classifier_in_dim = lstm_hidden
51
+
52
+ self.classifier = nn.Linear(classifier_in_dim, 2)
53
+ self.crf = CRF(2, batch_first=True)
54
+ else:
55
+ self.classifier = nn.Linear(d_model, 2)
56
+
57
+ # -- KD hyperparams
58
+ self.alpha_span = alpha_span
59
+ self.alpha_kd = alpha_kd
60
+ self.temperature = kd_temp
61
+ self.kd_layers = kd_layers or [4, 8, 12]
62
+
63
+ def forward(
64
+ self,
65
+ input_ids,
66
+ attention_mask,
67
+ labels=None,
68
+ teacher_input_ids=None,
69
+ teacher_attention_mask=None
70
+ ):
71
+ device = next(self.student.parameters()).device
72
+ ids = input_ids.to(device)
73
+ mask = attention_mask.to(device)
74
+ lab = labels.to(device) if labels is not None else None
75
+
76
+ # 1) Student forward to get hidden states
77
+ student_outputs = self.student.model.decoder(
78
+ input_ids=ids,
79
+ attention_mask=mask,
80
+ output_hidden_states=True,
81
+ return_dict=True
82
+ )
83
+ student_hiddens = student_outputs.hidden_states # tuple of layer outputs
84
+
85
+ # use top hidden for classification pipeline
86
+ top_hidden = student_hiddens[-1]
87
+ h = self.dropout(top_hidden)
88
+ if self.use_crf:
89
+ h, _ = self.bilstm(h)
90
+ logits = self.classifier(h)
91
+
92
+ loss = None
93
+ if lab is not None:
94
+ # span loss
95
+ if self.use_crf:
96
+ m = mask.clone().bool()
97
+ m[:, 0] = True
98
+ tags = lab.clone()
99
+ tags[tags == -100] = 0
100
+ span_loss = -self.crf(logits, tags, mask=m, reduction='mean')
101
+ else:
102
+ span_loss = nn.CrossEntropyLoss(ignore_index=-100)(
103
+ logits.view(-1, 2), lab.view(-1)
104
+ )
105
+
106
+ # KD loss: multi-depth
107
+ kd_loss = 0.0
108
+ if teacher_input_ids is not None and teacher_attention_mask is not None:
109
+ # Teacher on CPU
110
+ with torch.no_grad():
111
+ tch_out = self.teacher(
112
+ input_ids=teacher_input_ids,
113
+ attention_mask=teacher_attention_mask,
114
+ output_hidden_states=True,
115
+ return_dict=True
116
+ )
117
+ teacher_hiddens = tch_out.hidden_states
118
+ # compute layer-wise MSE
119
+ for i in self.kd_layers:
120
+ s_feat = student_hiddens[i]
121
+ t_feat = teacher_hiddens[i]
122
+ # interpolate or project to same size if needed
123
+ kd_loss += F.mse_loss(s_feat, t_feat.to(device))
124
+ kd_loss = kd_loss / len(self.kd_layers)
125
+ loss = self.alpha_span * span_loss + self.alpha_kd * kd_loss
126
+ else:
127
+ loss = span_loss
128
+
129
+ return {'loss': loss, 'logits': logits}
130
+
131
+ def predict(self, input_ids, attention_mask):
132
+ self.eval()
133
+ with torch.no_grad():
134
+ out = self.forward(input_ids, attention_mask)
135
+ logits = out['logits']
136
+ if self.use_crf:
137
+ m = attention_mask.bool().to(next(self.student.parameters()).device)
138
+ return self.crf.decode(F.log_softmax(logits, dim=-1), mask=m)
139
+ return logits.argmax(dim=-1)
@@ -0,0 +1,44 @@
1
+ import torch
2
+
3
+ def group_tokens_into_words_corrected(predicted_ids, toxic_labels, tokenizer):
4
+ """
5
+ Group token IDs to (decode)
6
+ """
7
+ words_with_labels = []
8
+
9
+ if isinstance(predicted_ids, torch.Tensor):
10
+ predicted_ids = predicted_ids.tolist()
11
+ if isinstance(toxic_labels, torch.Tensor):
12
+ toxic_labels = toxic_labels.tolist()
13
+
14
+ raw_tokens = tokenizer.convert_ids_to_tokens(predicted_ids)
15
+
16
+ current_word_ids = []
17
+ current_label = -1
18
+
19
+ min_len = min(len(predicted_ids), len(toxic_labels))
20
+
21
+ for i in range(min_len):
22
+ token_id = predicted_ids[i]
23
+ label = toxic_labels[i]
24
+ raw_token = raw_tokens[i]
25
+
26
+ if raw_token.startswith('Ġ') or i == 0:
27
+
28
+ if current_word_ids:
29
+ decoded_word = tokenizer.decode(current_word_ids).strip()
30
+ if decoded_word:
31
+ words_with_labels.append((decoded_word, current_label))
32
+
33
+ current_word_ids = [token_id]
34
+ current_label = label
35
+ else:
36
+ current_word_ids.append(token_id)
37
+ current_label = max(current_label, label)
38
+
39
+ if current_word_ids:
40
+ decoded_word = tokenizer.decode(current_word_ids).strip()
41
+ if decoded_word:
42
+ words_with_labels.append((decoded_word, current_label))
43
+
44
+ return words_with_labels
@@ -0,0 +1,146 @@
1
+ Metadata-Version: 2.4
2
+ Name: vitosa-speech-II
3
+ Version: 0.0.1
4
+ Summary: A library for Robust Vietnamese Audio-Based Toxic Span Detection and Censoring
5
+ Author: Vy Le-Phuong Huynh, Huy Ba Do and Luan Thanh Nguyen
6
+ Author-email: luannt@uit.edu.vn
7
+ Project-URL: Model (Hugging Face), https://huggingface.co/UIT-ViToSA/PhoWhisper-BiLSTM-CRF
8
+ Keywords: audio-processing,toxic-span-detection,vietnamese,asr,speech-recognition,censoring,phowhisper
9
+ Classifier: Development Status :: 4 - Beta
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: Intended Audience :: Science/Research
12
+ Classifier: Programming Language :: Python :: 3
13
+ Classifier: Programming Language :: Python :: 3.8
14
+ Classifier: Programming Language :: Python :: 3.9
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: License :: OSI Approved :: MIT License
17
+ Classifier: Operating System :: OS Independent
18
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
19
+ Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
20
+ Classifier: Natural Language :: Vietnamese
21
+ Requires-Python: >=3.7
22
+ Description-Content-Type: text/markdown
23
+ Requires-Dist: torch>=1.13.0
24
+ Requires-Dist: transformers>=4.28.0
25
+ Requires-Dist: librosa
26
+ Requires-Dist: pydub
27
+ Requires-Dist: huggingface_hub
28
+ Requires-Dist: pytorch-crf
29
+ Requires-Dist: numpy
30
+ Requires-Dist: tqdm
31
+ Dynamic: author
32
+ Dynamic: author-email
33
+ Dynamic: classifier
34
+ Dynamic: description
35
+ Dynamic: description-content-type
36
+ Dynamic: keywords
37
+ Dynamic: project-url
38
+ Dynamic: requires-dist
39
+ Dynamic: requires-python
40
+ Dynamic: summary
41
+
42
+ # ViToSA 2.0: A MULTI-TASK APPROACH TOWARDS ROBUST VIETNAMESE AUDIO-BASED TOXIC SPAN DETECTION | ICASSP 2026
43
+
44
+ **Official implementation** of the paper:
45
+ **“A Multi-Task Approach Towards Robust Vietnamese Audio-Based Toxic Span Detection”** (ICASSP 2026).
46
+
47
+ This package provides an end-to-end pipeline for **Vietnamese speech-based toxic span detection**, combining **ASR and toxic span detection** in a unified model. It also supports **automatic audio censoring**, replacing toxic spans with beep sounds in the output waveform.
48
+
49
+ ---
50
+
51
+ ## Key Features
52
+
53
+ * **Automated Audio Censoring**: Takes an input audio file containing toxic language and outputs a **clean `.wav` file** where profanity is masked with a beep.
54
+ * **Unified Multi-Task Architecture**: Integrates ASR and Toxic Span Detection (TSD) into a single model for high speed.
55
+ * **SOTA Performance**: Achieves **F1-macro 0.9212** on the ViToSA-v2 dataset using **PhoWhisper + BiLSTM-CRF + Knowledge Distillation**.
56
+ * **High Efficiency**: Reduces inference latency by over **56%** compared to traditional pipelines.
57
+
58
+ ## Installation
59
+
60
+ ```bash
61
+ pip install vitosa-speech
62
+ ```
63
+
64
+ ### System requirements
65
+
66
+ This package relies on `pydub` for audio processing, which requires **ffmpeg** to be installed.
67
+
68
+ - **Ubuntu / Debian**
69
+ ```bash
70
+ sudo apt-get install ffmpeg
71
+ ```
72
+
73
+ - **macOS (Homebrew)**
74
+ ```bash
75
+ brew install ffmpeg
76
+ ```
77
+
78
+ - **Windows**
79
+ Download ffmpeg from https://ffmpeg.org and add it to your system `PATH`.
80
+
81
+ ---
82
+
83
+
84
+ ## Quick Start
85
+ This library allows you to input a raw audio file and get a censored audio file as the output.
86
+
87
+ ### 1. Load the Model
88
+ ```python
89
+ The model is pre-trained on the ViToSA-v2 dataset
90
+
91
+ import torch
92
+ from vitosa-speech-II import load_my_model
93
+ # Automatically detect device (CUDA/CPU)
94
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
95
+ # Load the pre-trained model
96
+ model, processor = load_my_model(device)
97
+ ```
98
+
99
+ ### 2. Run Inference (Detect & Censor)
100
+ ```python
101
+ from vitosa-speech-II import return_labels, censor_audio_with_beep
102
+ from IPython.display import Audio, display # Optional: to play in notebook
103
+
104
+ # Path to your input file
105
+ input_audio = "samples/toxic_speech.wav"
106
+
107
+ # Step 1: Detect toxic spans
108
+ words_with_labels = return_labels(input_audio, model, processor, device)
109
+
110
+ # Step 2: Generate Censored Audio
111
+ # This function creates a new audio file with beeps over toxic words
112
+ output_audio_path = censor_audio_with_beep(
113
+ audio_path=input_audio,
114
+ model=model,
115
+ processor=processor,
116
+ words_with_labels=words_with_labels,
117
+ device=device
118
+ )
119
+
120
+ # Result
121
+ print(f"✅ Censored audio saved to: {output_audio_path}")
122
+
123
+ # Optional: Play the result (if in Jupyter/Colab)
124
+ # display(Audio(output_audio_path))
125
+ ```
126
+
127
+ ## Methodology
128
+ Our system works in two steps:
129
+
130
+ 1. Detection: The multi-task model (PhoWhisper + BiLSTM-CRF) processes the audio to identify the exact start and end timestamps of toxic words.
131
+ 2. Censoring: We reconstruct the audio by keeping safe segments and generating a sine wave (beep) to overlay exactly where the toxic tokens occur, ensuring the rest of the sentence remains intelligible.
132
+
133
+ <!-- ## Citation
134
+ If you use this tool or our findings, please cite:
135
+ ```bibtex
136
+ @inproceedings{huynh2026multitask,
137
+ title={A Multi-Task Approach Towards Robust Vietnamese Audio-Based Toxic Span Detection},
138
+ author={Huynh, Vy Le-Phuong and Do, Huy Ba and Nguyen, Luan Thanh},
139
+ booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
140
+ year={2026}
141
+ }
142
+ ``` -->
143
+
144
+ ## Contact
145
+ For more information: luannt@uit.edu.vn
146
+
@@ -0,0 +1,12 @@
1
+ README.md
2
+ setup.py
3
+ vitosa_speech_II/__init__.py
4
+ vitosa_speech_II/audio.py
5
+ vitosa_speech_II/inference.py
6
+ vitosa_speech_II/model.py
7
+ vitosa_speech_II/utils.py
8
+ vitosa_speech_II.egg-info/PKG-INFO
9
+ vitosa_speech_II.egg-info/SOURCES.txt
10
+ vitosa_speech_II.egg-info/dependency_links.txt
11
+ vitosa_speech_II.egg-info/requires.txt
12
+ vitosa_speech_II.egg-info/top_level.txt
@@ -0,0 +1,8 @@
1
+ torch>=1.13.0
2
+ transformers>=4.28.0
3
+ librosa
4
+ pydub
5
+ huggingface_hub
6
+ pytorch-crf
7
+ numpy
8
+ tqdm
@@ -0,0 +1 @@
1
+ vitosa_speech_II