PyPI - sopro - Versions diffs - 1.0.1__tar.gz → 1.5.0__tar.gz - Mend

sopro 1.0.1tar.gz → 1.5.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

{sopro-1.0.1 → sopro-1.5.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: sopro
-Version: 1.0.1
+Version: 1.5.0
 Summary: A lightweight text-to-speech model with zero-shot voice cloning.
 Author-email: Samuel Vitorino <samvitorino@gmail.com>
 License: Apache 2.0
@@ -27,14 +27,18 @@ https://github.com/user-attachments/assets/40254391-248f-45ff-b9a4-107d64fbb95f
 [![Alt Text](https://img.shields.io/badge/HuggingFace-Model-orange?logo=huggingface)](https://huggingface.co/samuel-vitorino/sopro)
+### 📰 News
+**2026.02.04 – SoproTTS v1.5 is out: more stable, faster, and smaller. Trained for just $100, it reaches 250 ms TTFA streaming and 0.05 RTF (~20× realtime) on CPU.**
 Sopro (from the Portuguese word for “breath/blow”) is a lightweight English text-to-speech model I trained as a side project. Sopro is composed of dilated convs (à la WaveNet) and lightweight cross-attention layers, instead of the common Transformer architecture. Even though Sopro is not SOTA across most voices and situations, I still think it’s a cool project made with a very low budget (trained on a single L40S GPU), and it can be improved with better data.
 Some of the main features are:
-- **169M parameters**
+- **147M parameters**
 - **Streaming**
 - **Zero-shot voice cloning**
-- **0.25 RTF on CPU** (measured on an M3 base model), meaning it generates 30 seconds of audio in 7.5 seconds
+- **0.05 RTF on CPU** (measured on an M3 base model), meaning it generates 32 seconds of audio in 1.77 seconds
 - **3-12 seconds of reference audio** for voice cloning
 ---
@@ -53,7 +57,7 @@ conda activate soprotts
 ### From PyPI
 ```bash
-pip install sopro
+pip install -U sopro
 ```
 ### From the repo
@@ -79,9 +83,7 @@ soprotts \
 You have the expected `temperature` and `top_p` parameters, alongside:
-- `--style_strength` (controls the FiLM strength; increasing it can improve or reduce voice similarity; default `1.0`)
-- `--no_stop_head` to disable early stopping
-- `--stop_threshold` and `--stop_patience` (number of consecutive frames that must be classified as final before **stopping**). For short sentences, the stop head may fail to trigger, in which case you can lower these values. Likewise, if the model stops before producing the full text, adjusting these parameters up can help.
+- `--style_strength` (controls the FiLM strength; increasing it can improve or reduce voice similarity; default `1.2`)
 ### Python
@@ -119,6 +121,27 @@ wav = torch.cat(chunks, dim=-1)
 tts.save_wav("out_stream.wav", wav)
 ```
+You can also precalculate the reference to reduce TTFA:
+```python
+import torch
+from sopro import SoproTTS
+tts = SoproTTS.from_pretrained("samuel-vitorino/sopro", device="cpu")
+ref = tts.prepare_reference(ref_audio_path="ref.mp3")
+chunks = []
+for chunk in tts.stream(
+    "Hello! This is a streaming Sopro TTS example.",
+    ref=ref,
+):
+    chunks.append(chunk.cpu())
+wav = torch.cat(chunks, dim=-1)
+tts.save_wav("out_stream.wav", wav)
+```
 ---
 ## Interactive streaming demo

{sopro-1.0.1 → sopro-1.5.0}/README.md RENAMED Viewed

@@ -4,14 +4,18 @@ https://github.com/user-attachments/assets/40254391-248f-45ff-b9a4-107d64fbb95f
 [![Alt Text](https://img.shields.io/badge/HuggingFace-Model-orange?logo=huggingface)](https://huggingface.co/samuel-vitorino/sopro)
+### 📰 News
+**2026.02.04 – SoproTTS v1.5 is out: more stable, faster, and smaller. Trained for just $100, it reaches 250 ms TTFA streaming and 0.05 RTF (~20× realtime) on CPU.**
 Sopro (from the Portuguese word for “breath/blow”) is a lightweight English text-to-speech model I trained as a side project. Sopro is composed of dilated convs (à la WaveNet) and lightweight cross-attention layers, instead of the common Transformer architecture. Even though Sopro is not SOTA across most voices and situations, I still think it’s a cool project made with a very low budget (trained on a single L40S GPU), and it can be improved with better data.
 Some of the main features are:
-- **169M parameters**
+- **147M parameters**
 - **Streaming**
 - **Zero-shot voice cloning**
-- **0.25 RTF on CPU** (measured on an M3 base model), meaning it generates 30 seconds of audio in 7.5 seconds
+- **0.05 RTF on CPU** (measured on an M3 base model), meaning it generates 32 seconds of audio in 1.77 seconds
 - **3-12 seconds of reference audio** for voice cloning
 ---
@@ -30,7 +34,7 @@ conda activate soprotts
 ### From PyPI
 ```bash
-pip install sopro
+pip install -U sopro
 ```
 ### From the repo
@@ -56,9 +60,7 @@ soprotts \
 You have the expected `temperature` and `top_p` parameters, alongside:
-- `--style_strength` (controls the FiLM strength; increasing it can improve or reduce voice similarity; default `1.0`)
-- `--no_stop_head` to disable early stopping
-- `--stop_threshold` and `--stop_patience` (number of consecutive frames that must be classified as final before **stopping**). For short sentences, the stop head may fail to trigger, in which case you can lower these values. Likewise, if the model stops before producing the full text, adjusting these parameters up can help.
+- `--style_strength` (controls the FiLM strength; increasing it can improve or reduce voice similarity; default `1.2`)
 ### Python
@@ -96,6 +98,27 @@ wav = torch.cat(chunks, dim=-1)
 tts.save_wav("out_stream.wav", wav)
 ```
+You can also precalculate the reference to reduce TTFA:
+```python
+import torch
+from sopro import SoproTTS
+tts = SoproTTS.from_pretrained("samuel-vitorino/sopro", device="cpu")
+ref = tts.prepare_reference(ref_audio_path="ref.mp3")
+chunks = []
+for chunk in tts.stream(
+    "Hello! This is a streaming Sopro TTS example.",
+    ref=ref,
+):
+    chunks.append(chunk.cpu())
+wav = torch.cat(chunks, dim=-1)
+tts.save_wav("out_stream.wav", wav)
+```
 ---
 ## Interactive streaming demo

{sopro-1.0.1 → sopro-1.5.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sopro"
-version = "1.0.1"
+version = "1.5.0"
 description = "A lightweight text-to-speech model with zero-shot voice cloning."
 readme = "README.md"
 requires-python = ">=3.9"

{sopro-1.0.1 → sopro-1.5.0}/src/sopro/__init__.py RENAMED Viewed

@@ -3,4 +3,4 @@ from __future__ import annotations
 from .model import SoproTTS
 __all__ = ["SoproTTS"]
-__version__ = "1.0.1"
+__version__ = "1.5.0"

{sopro-1.0.1 → sopro-1.5.0}/src/sopro/cli.py RENAMED Viewed

@@ -32,8 +32,6 @@ def main() -> None:
     ap.add_argument("--temperature", type=float, default=1.05)
     ap.add_argument("--no_anti_loop", action="store_true")
-    ap.add_argument("--no_prefix", action="store_true")
-    ap.add_argument("--prefix_sec", type=float, default=None)
     ap.add_argument("--style_strength", type=float, default=None)
     ap.add_argument("--ref_seconds", type=float, default=None)
@@ -77,6 +75,7 @@ def main() -> None:
             torch.cuda.manual_seed_all(args.seed)
     t0 = time.perf_counter()
     tts = SoproTTS.from_pretrained(
         args.repo_id,
         revision=args.revision,
@@ -84,6 +83,7 @@ def main() -> None:
         token=args.hf_token,
         device=device,
     )
     t1 = time.perf_counter()
     if not args.quiet:
         print(f"[Load] {t1 - t0:.2f}s")
@@ -97,74 +97,59 @@ def main() -> None:
         arr = np.load(args.ref_tokens)
         ref_tokens_tq = torch.from_numpy(arr).long()
-    text_ids = tts.encode_text(args.text)
-    ref = tts.encode_reference(
-        ref_audio_path=args.ref_audio,
-        ref_tokens_tq=ref_tokens_tq,
-        ref_seconds=args.ref_seconds,
-    )
+    with torch.inference_mode():
+        text_ids = tts.encode_text(args.text)
+        ref = tts.prepare_reference(
+            ref_audio_path=args.ref_audio,
+            ref_tokens_tq=ref_tokens_tq,
+            ref_seconds=args.ref_seconds,
+        )
-    prep = tts.model.prepare_conditioning(
-        text_ids,
-        ref,
-        max_frames=args.max_frames,
-        device=tts.device,
-        style_strength=float(
-            args.style_strength
-            if args.style_strength is not None
-            else cfg.style_strength
-        ),
-    )
+        prep = tts.model.prepare_conditioning(
+            text_ids,
+            ref,
+            max_frames=args.max_frames,
+            device=tts.device,
+            style_strength=float(
+                args.style_strength
+                if args.style_strength is not None
+                else cfg.style_strength
+            ),
+        )
-    t_start = time.perf_counter()
+        t_start = time.perf_counter()
     hist_A: list[int] = []
     pbar = tqdm(
-        total=args.max_frames,
-        desc="AR sampling",
-        unit="frame",
-        disable=args.quiet,
+        total=args.max_frames + 1, desc="AR sampling", unit="step", disable=args.quiet
     )
-    for _t, rvq1, p_stop in tts.model.ar_stream(
+    for _t, tok, is_eos in tts.model.ar_stream(
         prep,
         max_frames=args.max_frames,
         top_p=args.top_p,
         temperature=args.temperature,
         anti_loop=(not args.no_anti_loop),
-        use_prefix=(not args.no_prefix),
-        prefix_sec_fixed=args.prefix_sec,
-        use_stop_head=(False if args.no_stop_head else None),
-        stop_patience=args.stop_patience,
-        stop_threshold=args.stop_threshold,
     ):
-        hist_A.append(int(rvq1))
+        if is_eos:
+            pbar.set_postfix(eos="yes")
+            pbar.update(1)
+            break
+        hist_A.append(int(tok))
         pbar.update(1)
-        if p_stop is None:
-            pbar.set_postfix(p_stop="off")
-        else:
-            pbar.set_postfix(p_stop=f"{float(p_stop):.2f}")
+    t_after_sampling = time.perf_counter()
     pbar.n = len(hist_A)
     pbar.close()
-    t_after_sampling = time.perf_counter()
     T = len(hist_A)
     if T == 0:
         save_audio(args.out, torch.zeros(1, 0), sr=TARGET_SR)
-        t_end = time.perf_counter()
-        if not args.quiet:
-            print(
-                f"[Timing] sampling={t_after_sampling - t_start:.2f}s, "
-                f"postproc+decode+save={t_end - t_after_sampling:.2f}s, "
-                f"total={t_end - t_start:.2f}s"
-            )
-            print(f"[Done] Wrote {args.out}")
         return
     tokens_A = torch.tensor(hist_A, device=tts.device, dtype=torch.long).unsqueeze(0)
-    cond_seq = prep["cond_all"][:, :T, :]
+    cond_seq = prep["cond_ar"][:, :T, :]
     tokens_1xTQ = tts.model.nar_refine(cond_seq, tokens_A)
     tokens_tq = tokens_1xTQ.squeeze(0)

{sopro-1.0.1 → sopro-1.5.0}/src/sopro/config.py RENAMED Viewed

@@ -13,36 +13,31 @@ class SoproTTSConfig:
     audio_sr: int = TARGET_SR
     d_model: int = 384
-    n_layers_text: int = 4
-    n_layers_ar: int = 6
-    n_layers_nar: int = 6
+    n_layers_text: int = 2
     dropout: float = 0.05
     pos_emb_max: int = 4096
     max_text_len: int = 2048
-    nar_head_dim: int = 256
-    use_stop_head: bool = True
-    stop_threshold: float = 0.8
-    stop_patience: int = 5
+    n_layers_ar: int = 6
+    ar_kernel: int = 13
+    ar_dilation_cycle: Tuple[int, ...] = (1, 2, 4, 1)
+    ar_text_attn_freq: int = 2
     min_gen_frames: int = 12
+    n_layers_nar: int = 6
+    nar_head_dim: int = 256
+    nar_kernel_size: int = 11
+    nar_dilation_cycle: Tuple[int, ...] = (1, 2, 4, 8)
     stage_B: Tuple[int, int] = (2, 4)
     stage_C: Tuple[int, int] = (5, 8)
     stage_D: Tuple[int, int] = (9, 16)
     stage_E: Tuple[int, int] = (17, 32)
-    ar_lookback: int = 4
-    ar_kernel: int = 13
-    ar_dilation_cycle: Tuple[int, ...] = (1, 2, 4, 1)
-    ar_text_attn_freq: int = 2
-    ref_attn_heads: int = 2
-    ref_seconds_max: float = 12.0
-    preprompt_sec_max: float = 4.0
     sv_student_dim: int = 192
     style_strength: float = 1.0
+    ref_enc_layers: int = 2
+    ref_xattn_heads: int = 2
+    ref_xattn_layers: int = 3
+    ref_xattn_gmax: float = 0.35

{sopro-1.0.1 → sopro-1.5.0}/src/sopro/hub.py RENAMED Viewed

@@ -1,6 +1,7 @@
 from __future__ import annotations
 import json
+import os
 import struct
 from typing import Any, Dict, Optional
@@ -44,9 +45,7 @@ def load_cfg_from_safetensors(path: str) -> SoproTTSConfig:
     for k in SoproTTSConfig.__annotations__.keys():
         if k in cfg_dict:
             init[k] = cfg_dict[k]
-    cfg = SoproTTSConfig(**init)
-    return cfg
+    return SoproTTSConfig(**init)
 def load_state_dict_from_safetensors(path: str) -> Dict[str, torch.Tensor]:

sopro 1.0.1__tar.gz → 1.5.0__tar.gz

sopro 1.0.1tar.gz → 1.5.0tar.gz