PyPI - videopython - Versions diffs - 0.28.2__tar.gz → 0.29.0__tar.gz - Mend

videopython 0.28.2tar.gz → 0.29.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (62) hide show

{videopython-0.28.2 → videopython-0.29.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: videopython
-Version: 0.28.2
+Version: 0.29.0
 Summary: Minimal video generation and processing library.
 Project-URL: Homepage, https://videopython.com
 Project-URL: Repository, https://github.com/bartwojtowicz/videopython/
@@ -27,14 +27,15 @@ Requires-Dist: accelerate>=0.29.2; extra == 'ai'
 Requires-Dist: chatterbox-tts>=0.1.7; extra == 'ai'
 Requires-Dist: demucs>=4.0.0; extra == 'ai'
 Requires-Dist: diffusers>=0.30.0; extra == 'ai'
-Requires-Dist: easyocr>=1.7.0; extra == 'ai'
 Requires-Dist: hf-transfer>=0.1.9; extra == 'ai'
+Requires-Dist: imagehash>=4.3; extra == 'ai'
 Requires-Dist: llama-cpp-python>=0.3.0; extra == 'ai'
 Requires-Dist: numba>=0.61.0; extra == 'ai'
 Requires-Dist: ollama>=0.4.5; extra == 'ai'
 Requires-Dist: openai-whisper>=20240930; extra == 'ai'
 Requires-Dist: pyannote-audio>=4.0.0; extra == 'ai'
 Requires-Dist: pyloudnorm>=0.1.1; extra == 'ai'
+Requires-Dist: qwen-vl-utils>=0.0.10; extra == 'ai'
 Requires-Dist: scikit-learn>=1.3.0; extra == 'ai'
 Requires-Dist: scipy>=1.10.0; extra == 'ai'
 Requires-Dist: sentencepiece>=0.1.99; extra == 'ai'
@@ -56,6 +57,8 @@ Minimal, LLM-friendly Python library for programmatic video editing, processing,
 Full documentation: [videopython.com](https://videopython.com)
+> **Disclaimer:** This project started as a hand-written hobby project, but most of the code is now produced by LLM agents. Humans still drive direction, approve changes, and own design decisions.
 ## Installation
 ### 1. Install FFmpeg
@@ -193,10 +196,10 @@ API docs: [Core](https://videopython.com/api/index/) | [Video](https://videopyth
 | Area | Highlights |
 |---|---|
 | **Generation** | `TextToVideo`, `ImageToVideo`, `TextToImage`, `TextToSpeech`, `TextToMusic` |
-| **Understanding** | `AudioToText` (transcription), `AudioClassifier`, `SceneVLM` (visual scene description), `ActionRecognizer` |
+| **Understanding** | `AudioToText` (transcription), `AudioClassifier`, `SceneVLM` (structured visual scene description), `FaceTracker` (per-shot face tracks) |
 | **Scene detection** | `SemanticSceneDetector` (neural scene boundaries) |
 | **Video analysis** | `VideoAnalyzer` - full-pipeline analysis combining multiple AI capabilities |
-| **Transforms** | `FaceTracker`, `FaceTrackingCrop`, `SplitScreenComposite` |
+| **Transforms** | `FaceTrackingCrop`, `SplitScreenComposite` |
 | **Dubbing** | `VideoDubber` - voice cloning and revoicing with timing sync |
 | **Object swapping** | `ObjectSwapper` - detect, segment, and inpaint objects in video |

{videopython-0.28.2 → videopython-0.29.0}/README.md RENAMED Viewed

@@ -8,6 +8,8 @@ Minimal, LLM-friendly Python library for programmatic video editing, processing,
 Full documentation: [videopython.com](https://videopython.com)
+> **Disclaimer:** This project started as a hand-written hobby project, but most of the code is now produced by LLM agents. Humans still drive direction, approve changes, and own design decisions.
 ## Installation
 ### 1. Install FFmpeg
@@ -145,10 +147,10 @@ API docs: [Core](https://videopython.com/api/index/) | [Video](https://videopyth
 | Area | Highlights |
 |---|---|
 | **Generation** | `TextToVideo`, `ImageToVideo`, `TextToImage`, `TextToSpeech`, `TextToMusic` |
-| **Understanding** | `AudioToText` (transcription), `AudioClassifier`, `SceneVLM` (visual scene description), `ActionRecognizer` |
+| **Understanding** | `AudioToText` (transcription), `AudioClassifier`, `SceneVLM` (structured visual scene description), `FaceTracker` (per-shot face tracks) |
 | **Scene detection** | `SemanticSceneDetector` (neural scene boundaries) |
 | **Video analysis** | `VideoAnalyzer` - full-pipeline analysis combining multiple AI capabilities |
-| **Transforms** | `FaceTracker`, `FaceTrackingCrop`, `SplitScreenComposite` |
+| **Transforms** | `FaceTrackingCrop`, `SplitScreenComposite` |
 | **Dubbing** | `VideoDubber` - voice cloning and revoicing with timing sync |
 | **Object swapping** | `ObjectSwapper` - detect, segment, and inpaint objects in video |

{videopython-0.28.2 → videopython-0.29.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "videopython"
-version = "0.28.2"
+version = "0.29.0"
 description = "Minimal video generation and processing library."
 authors = [
     { name = "Bartosz Wójtowicz", email = "bartoszwojtowicz@outlook.com" },
@@ -70,7 +70,6 @@ ai = [
     "scikit-learn>=1.3.0",
     # Detection backends
     "ultralytics>=8.0.0",
-    "easyocr>=1.7.0",
     # Audio classification (AST via transformers - no separate dep needed)
     # Scene detection
     "transnetv2-pytorch>=1.0.5",
@@ -84,6 +83,11 @@ ai = [
     "llama-cpp-python>=0.3.0",
     # Loudness measurement (BS.1770) for dub-vs-source loudness matching (M3)
     "pyloudnorm>=0.1.1",
+    # Vision-language preprocessing for Qwen3.5 (M5) - documented prerequisite
+    # for AutoModelForImageTextToText with image/video chat templates.
+    "qwen-vl-utils>=0.0.10",
+    # Perceptual hashing for SceneVLM frame dedup (M5)
+    "imagehash>=4.3",
 ]
 # Required for pip install videopython[ai] - pip uses optional-dependencies, not dependency-groups
@@ -105,7 +109,6 @@ ai = [
     "scikit-learn>=1.3.0",
     # Detection backends
     "ultralytics>=8.0.0",
-    "easyocr>=1.7.0",
     # Audio classification (AST via transformers - no separate dep needed)
     # Scene detection
     "transnetv2-pytorch>=1.0.5",
@@ -119,6 +122,11 @@ ai = [
     "llama-cpp-python>=0.3.0",
     # Loudness measurement (BS.1770) for dub-vs-source loudness matching (M3)
     "pyloudnorm>=0.1.1",
+    # Vision-language preprocessing for Qwen3.5 (M5) - documented prerequisite
+    # for AutoModelForImageTextToText with image/video chat templates.
+    "qwen-vl-utils>=0.0.10",
+    # Perceptual hashing for SceneVLM frame dedup (M5)
+    "imagehash>=4.3",
 ]
 [project.urls]
@@ -135,7 +143,6 @@ module = [
     "diffusers", "diffusers.*",
     "ollama", "ollama.*",
     "ultralytics", "ultralytics.*",
-    "easyocr", "easyocr.*",
     "transformers", "transformers.*",
     "transnetv2_pytorch", "transnetv2_pytorch.*",
     "chatterbox", "chatterbox.*",
@@ -146,6 +153,8 @@ module = [
     "cv2", "cv2.*",
     "llama_cpp", "llama_cpp.*",
     "pyloudnorm", "pyloudnorm.*",
+    "qwen_vl_utils", "qwen_vl_utils.*",
+    "imagehash", "imagehash.*",
 ]
 ignore_missing_imports = true

{videopython-0.28.2 → videopython-0.29.0}/src/videopython/ai/__init__.py RENAMED Viewed

@@ -2,11 +2,11 @@ from videopython.ai import registry as _ai_registry  # noqa: F401
 from .generation import ImageToVideo, TextToImage, TextToMusic, TextToSpeech, TextToVideo
 from .swapping import ObjectSwapper
-from .transforms import FaceTracker, FaceTrackingCrop, SplitScreenComposite
+from .transforms import FaceTrackingCrop, SplitScreenComposite
 from .understanding import (
-    ActionRecognizer,
     AudioClassifier,
     AudioToText,
+    FaceTracker,
     SceneVLM,
     SemanticSceneDetector,
 )
@@ -22,12 +22,10 @@ __all__ = [
     # Understanding
     "AudioToText",
     "AudioClassifier",
+    "FaceTracker",
     "SceneVLM",
-    # Temporal
-    "ActionRecognizer",
     "SemanticSceneDetector",
     # Transforms (AI-powered)
-    "FaceTracker",
     "FaceTrackingCrop",
     "SplitScreenComposite",
     # Swapping

{videopython-0.28.2 → videopython-0.29.0}/src/videopython/ai/dubbing/__init__.py RENAMED Viewed

@@ -2,7 +2,13 @@
 from videopython.ai.dubbing.cache import DubCache, dub_cache_clear
 from videopython.ai.dubbing.dubber import VideoDubber
-from videopython.ai.dubbing.models import DubbingResult, RevoiceResult, SeparatedAudio, TranslatedSegment
+from videopython.ai.dubbing.models import (
+    DubbingResult,
+    Expressiveness,
+    RevoiceResult,
+    SeparatedAudio,
+    TranslatedSegment,
+)
 from videopython.ai.dubbing.pipeline import LocalDubbingPipeline
 from videopython.ai.dubbing.quality import GarbageTranscriptError, TranscriptQuality, assess_transcript
 from videopython.ai.dubbing.timing import TimingSynchronizer
@@ -22,4 +28,5 @@ __all__ = [
     "UnsupportedLanguageError",
     "DubCache",
     "dub_cache_clear",
+    "Expressiveness",
 ]

{videopython-0.28.2 → videopython-0.29.0}/src/videopython/ai/dubbing/cache.py RENAMED Viewed

@@ -157,14 +157,27 @@ class DubCache:
         translated_text: str,
         voice_sample_bytes: bytes | None,
         language: str,
+        exaggeration: float | None = None,
+        cfg_weight: float | None = None,
+        temperature: float | None = None,
     ) -> str:
-        """Per-segment key over text + voice sample + language."""
+        """Per-segment key over text + voice sample + language + expressiveness.
+        ``exaggeration`` / ``cfg_weight`` / ``temperature`` are the M4
+        Chatterbox knobs. Defaulting to ``None`` keeps pre-M4 callers that
+        omit them hashing the same way (no-knob profile collides with
+        absent kwargs), so cache invalidation is driven by *passing
+        non-None values*, not by the M4 code path being present.
+        """
         h = hashlib.sha256()
         h.update(translated_text.encode("utf-8"))
         h.update(b"\x00")
         h.update(voice_sample_bytes or b"")
         h.update(b"\x00")
         h.update(language.encode("utf-8"))
+        for knob in (exaggeration, cfg_weight, temperature):
+            h.update(b"\x00")
+            h.update(repr(knob).encode("utf-8"))
         return h.hexdigest()[:16]
     # ----- path resolution -------------------------------------------------

{videopython-0.28.2 → videopython-0.29.0}/src/videopython/ai/dubbing/models.py RENAMED Viewed

@@ -19,6 +19,42 @@ if TYPE_CHECKING:
 CLEAN_SPEED_TOLERANCE = 0.01
+@dataclass(frozen=True)
+class Expressiveness:
+    """Chatterbox ``generate()`` knobs derived from source-segment prosody.
+    ``None`` on any field means "let Chatterbox use its own default" —
+    avoids pinning the dub against future Chatterbox default changes.
+    Attributes:
+        exaggeration: Emotional intensity. Chatterbox default ``0.5``;
+            ``0.7+`` produces dramatic output.
+        cfg_weight: Classifier-free guidance weight. Chatterbox default
+            ``0.5``; lower values (~``0.3``) slow pacing.
+        temperature: Sampling temperature. Chatterbox default ``0.8``.
+    """
+    exaggeration: float | None = None
+    cfg_weight: float | None = None
+    temperature: float | None = None
+    def as_kwargs(self) -> dict[str, float]:
+        """Knobs as a dict, dropping ``None`` entries.
+        Suitable for ``**``-expansion into Chatterbox or
+        :meth:`DubCache.tts_key`.
+        """
+        return {
+            name: value
+            for name, value in (
+                ("exaggeration", self.exaggeration),
+                ("cfg_weight", self.cfg_weight),
+                ("temperature", self.temperature),
+            )
+            if value is not None
+        }
 @dataclass
 class TranslatedSegment:
     """A segment of translated text with timing information.

{videopython-0.28.2 → videopython-0.29.0}/src/videopython/ai/dubbing/pipeline.py RENAMED Viewed

@@ -11,7 +11,7 @@ import numpy as np
 from videopython.ai._device import select_device
 from videopython.ai.dubbing.cache import DubCache
-from videopython.ai.dubbing.models import DubbingResult, RevoiceResult, SeparatedAudio, TimingSummary
+from videopython.ai.dubbing.models import DubbingResult, Expressiveness, RevoiceResult, SeparatedAudio, TimingSummary
 from videopython.ai.dubbing.quality import GarbageTranscriptError, assess_transcript
 from videopython.ai.dubbing.timing import TimingSynchronizer
 from videopython.ai.generation.qwen3 import Qwen3Translator
@@ -118,6 +118,40 @@ PEAK_CLIP_THRESHOLD = 0.99
 MIN_VOCAL_BG_RMS_RATIO = 1.5
 VOICE_SAMPLE_TARGET_DURATION = 6.0
+# Prosody-conditioning thresholds. Source-segment RMS / whole-vocals RMS
+# below CALM lands in the calm bucket; above DRAMATIC in the dramatic
+# bucket; in between gets Chatterbox's defaults. Knob values picked
+# by-ear on cam1_1min.mp4 — see RELEASE_NOTES 0.29.0.
+CALM_RATIO_THRESHOLD = 0.7
+DRAMATIC_RATIO_THRESHOLD = 1.3
+_CALM = Expressiveness(exaggeration=0.3, cfg_weight=0.7)
+_DRAMATIC = Expressiveness(exaggeration=0.85, cfg_weight=0.35)
+def _rms(data: np.ndarray) -> float:
+    """RMS over samples; ``0.0`` for empty input. float64 reduction so a
+    long slice can't overflow the squared accumulator."""
+    if data.size == 0:
+        return 0.0
+    return float(np.sqrt(np.mean(np.square(data, dtype=np.float64))))
+def _expressiveness_for(source_slice: Audio, baseline_rms: float) -> Expressiveness:
+    """Map a source vocals slice to a Chatterbox expressiveness profile
+    by RMS ratio. Falls back to the no-knobs default for empty or silent
+    inputs."""
+    if baseline_rms <= 0.0:
+        return Expressiveness()
+    segment_rms = _rms(source_slice.data)
+    if segment_rms <= 0.0:
+        return Expressiveness()
+    ratio = segment_rms / baseline_rms
+    if ratio < CALM_RATIO_THRESHOLD:
+        return _CALM
+    if ratio > DRAMATIC_RATIO_THRESHOLD:
+        return _DRAMATIC
+    return Expressiveness()
 class LocalDubbingPipeline:
     """Local pipeline for video dubbing.
@@ -236,6 +270,7 @@ class LocalDubbingPipeline:
         voice_samples: dict[str, Audio],
         speaker_wav_paths: dict[str, Path],
         src_hash_for_tts: str,
+        expressiveness: Expressiveness = Expressiveness(),
     ) -> Audio | None:
         """Produce the TTS audio for a single segment, with cache-around-the-call.
@@ -244,6 +279,11 @@ class LocalDubbingPipeline:
         TTS model is lazy-initialized and the per-speaker temp WAV is
         materialized before generation; on cache hit none of that runs,
         so a fully-cached run never loads Chatterbox.
+        ``expressiveness`` carries the M4 Chatterbox knobs derived from
+        the source segment's prosody. Default is the no-knobs profile —
+        lets Chatterbox use its own defaults — so callers that don't yet
+        derive prosody (e.g. ``revoice``) keep pre-M4 behaviour.
         """
         from videopython.base.audio import Audio as _Audio
@@ -253,6 +293,7 @@ class LocalDubbingPipeline:
                 translated_text=segment.translated_text,
                 voice_sample_bytes=speaker_bytes,
                 language=target_lang,
+                **expressiveness.as_kwargs(),
             )
             cached_path = self._cache.get_tts_path(src_hash_for_tts, tts_cache_key)
             if cached_path is not None:
@@ -270,10 +311,11 @@ class LocalDubbingPipeline:
         wav_path = speaker_wav_paths.get(speaker) if voice_clone else None
         try:
-            if wav_path is not None:
-                dubbed_audio = self._tts.generate_audio(segment.translated_text, voice_sample_path=wav_path)
-            else:
-                dubbed_audio = self._tts.generate_audio(segment.translated_text)
+            dubbed_audio = self._tts.generate_audio(
+                segment.translated_text,
+                voice_sample_path=wav_path,
+                **expressiveness.as_kwargs(),
+            )
         except Exception as exc:
             # Chatterbox occasionally crashes on short translated text
             # (alignment_stream_analyzer indexing on tensors with <=5
@@ -748,16 +790,32 @@ class LocalDubbingPipeline:
             report_progress("Extracting voice samples", 0.25)
             voice_samples = self._extract_voice_samples(vocal_audio, background_audio, transcription)
-        # vocals is no longer needed; voice_samples are independent copies.
-        # In low_memory mode this is the only ref keeping the buffer alive
-        # (separated_audio was dropped above), so dropping the local frees it.
-        del vocal_audio
         report_progress("Translating text", 0.35)
         translated_segments, translation_failures = self._translate_with_cache(
             transcription, source_audio, detected_lang, target_lang, report_progress
         )
+        # Per-segment expressiveness derived from source vocals RMS.
+        # Computed before vocal_audio is released so the TTS loop doesn't
+        # hold the buffer. Segment ends are clamped to the vocals duration
+        # — transcription timestamps can drift past the buffer tail
+        # (especially on synthetic test audio) and Audio.slice rejects
+        # out-of-range ends past a 0.1s tolerance.
+        baseline_rms = _rms(vocal_audio.data)
+        vocal_duration = vocal_audio.metadata.duration_seconds
+        expressiveness_per_segment = [
+            _expressiveness_for(
+                vocal_audio.slice(min(s.start, vocal_duration), min(s.end, vocal_duration)),
+                baseline_rms,
+            )
+            for s in translated_segments
+        ]
+        # vocals is no longer needed; voice_samples are independent copies.
+        # In low_memory mode this is the only ref keeping the buffer alive
+        # (separated_audio was dropped above), so dropping the local frees it.
+        del vocal_audio
         report_progress("Generating dubbed speech", 0.50)
         # Per-speaker voice-sample bytes for TTS cache key. Empty when
@@ -800,6 +858,7 @@ class LocalDubbingPipeline:
                     voice_samples=voice_samples,
                     speaker_wav_paths=speaker_wav_paths,
                     src_hash_for_tts=src_hash_for_tts,
+                    expressiveness=expressiveness_per_segment[i],
                 )
                 if dubbed_audio is None:
                     continue

{videopython-0.28.2 → videopython-0.29.0}/src/videopython/ai/generation/audio.py RENAMED Viewed

@@ -51,6 +51,9 @@ class TextToSpeech:
         text: str,
         voice_sample: Audio | None = None,
         voice_sample_path: str | Path | None = None,
+        exaggeration: float | None = None,
+        cfg_weight: float | None = None,
+        temperature: float | None = None,
     ) -> Audio:
         """Generate speech audio from text.
@@ -64,6 +67,15 @@ class TextToSpeech:
                 precedence over ``voice_sample`` and ``self.voice``. Used by
                 the dubbing pipeline to encode each speaker's sample once and
                 reuse it across all of that speaker's segments.
+            exaggeration: Chatterbox emotional-intensity knob (default
+                ``0.5``). ``None`` (default) means do not pass the kwarg —
+                Chatterbox uses its own default and we stay forward-compatible
+                with changes to it. ``0.7+`` produces dramatic output.
+            cfg_weight: Chatterbox classifier-free-guidance weight (default
+                ``0.5``). ``None`` means do not pass. Lower values (~``0.3``)
+                slow pacing.
+            temperature: Chatterbox sampling temperature (default ``0.8``).
+                ``None`` means do not pass.
         """
         import tempfile
         from pathlib import Path
@@ -86,11 +98,23 @@ class TextToSpeech:
                     speaker_wav_path = Path(f.name)
                     cleanup_path = True
+        # Only forward knobs the caller explicitly set. Passing nothing
+        # for a knob lets Chatterbox use its own default — important so a
+        # future Chatterbox default change doesn't get pinned by us.
+        knobs: dict[str, float] = {}
+        if exaggeration is not None:
+            knobs["exaggeration"] = exaggeration
+        if cfg_weight is not None:
+            knobs["cfg_weight"] = cfg_weight
+        if temperature is not None:
+            knobs["temperature"] = temperature
         try:
             wav = self._model.generate(
                 text=text,
                 language_id=self.language,
                 audio_prompt_path=str(speaker_wav_path) if speaker_wav_path else None,
+                **knobs,
             )
             audio_data = wav.cpu().float().numpy().squeeze()

videopython 0.28.2__tar.gz → 0.29.0__tar.gz

videopython 0.28.2tar.gz → 0.29.0tar.gz