PyPI - videopython - Versions diffs - 0.31.3__tar.gz → 0.33.0__tar.gz - Mend

videopython 0.31.3tar.gz → 0.33.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (62) hide show

{videopython-0.31.3 → videopython-0.33.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: videopython
-Version: 0.31.3
+Version: 0.33.0
 Summary: Minimal video generation and processing library.
 Project-URL: Homepage, https://videopython.com
 Project-URL: Repository, https://github.com/bartwojtowicz/videopython/
@@ -91,7 +91,8 @@ Every editing primitive is an `Operation` subclass — a Pydantic model
 whose fields ARE the JSON wire format. Apply one to a `Video`:
 ```python
-from videopython.base import Video, CutSeconds, Resize, Fade
+from videopython.base import Video
+from videopython.editing import CutSeconds, Resize, Fade
 video = Video.from_path("raw.mp4")
 video = CutSeconds(start=10, end=25).apply(video)
@@ -141,7 +142,7 @@ instead if you want the result back in memory as a `Video`.
 ```python
 from videopython.ai import TextToImage, ImageToVideo, TextToSpeech
-from videopython.base import Resize
+from videopython.editing import Resize
 image = TextToImage().generate_image("A cinematic mountain sunrise")
 video = ImageToVideo().generate_video(image=image)
@@ -182,7 +183,7 @@ Every registered op exposes its own Pydantic schema, so an agent can
 introspect what's available without hardcoded lists:
 ```python
-from videopython.base import Operation, OpCategory
+from videopython.editing import Operation, OpCategory
 for op_id, cls in Operation.registry().items():
     print(f"{op_id}: {(cls.__doc__ or '').splitlines()[0]}")
@@ -205,18 +206,30 @@ Docs: [Editing Plans](https://videopython.com/api/editing/) | [Operations](https
 ## Features
-### `videopython.base` - core editing (no AI dependencies)
+### `videopython.base` - data containers + I/O (no AI dependencies)
 | Area | Highlights |
 |---|---|
 | **Video I/O** | `Video`, `VideoMetadata`, `FrameIterator` - load, save, inspect |
+| **Text rendering** | `ImageText` - generic PIL text-on-image primitive |
+| **Transcription** | `Transcription`, `TranscriptionSegment`, `TranscriptionWord` - data classes returned by transcription backends |
+| **Result types** | `BoundingBox`, `DetectedFace`, `FaceTrack`, `SceneBoundary`, `AudioEvent`, `MotionInfo`, ... - shared by editing and AI |
+### `videopython.audio` - audio data container
+| Area | Highlights |
+|---|---|
+| **Audio** | `Audio`, `AudioMetadata` - load/save, overlay, concat, normalize, time-stretch, silence detection, segment classification |
+### `videopython.editing` - editing primitives + plan runner
+| Area | Highlights |
+|---|---|
 | **Operation foundation** | `Operation`, `Effect`, `TimeRange`, `OpCategory` - Pydantic base + auto-registry + discriminated-union schema |
 | **Editing plans** | `VideoEdit`, `SegmentConfig` - JSON/LLM-friendly multi-segment plans with JSON Schema generation, dry-run validation, and streaming `run_to_file` |
 | **Transforms** | Cut (time/frame), resize, crop, FPS resampling, speed change, reverse, freeze frame, silence removal |
 | **Effects** | Blur, zoom, color grading, vignette, Ken Burns, image overlay, fade, text overlay, volume adjust |
-| **Audio** | Load/save, overlay, concat, normalize, time-stretch, silence detection, segment classification |
-| **Text** | Transcription data classes, `TranscriptionOverlay` for subtitle rendering |
-| **Scene detection** | Histogram-based scene boundaries (`detect`, `detect_streaming`, `detect_parallel`) |
+| **Subtitles** | `TranscriptionOverlay` - animated word-by-word subtitle rendering |
 API docs: [Core](https://videopython.com/api/index/) | [Video](https://videopython.com/api/core/video/) | [Audio](https://videopython.com/api/core/audio/) | [Editing Plans](https://videopython.com/api/editing/) | [Operations](https://videopython.com/api/operations/) | [Transforms](https://videopython.com/api/transforms/) | [Effects](https://videopython.com/api/effects/) | [Text](https://videopython.com/api/text/)

{videopython-0.31.3 → videopython-0.33.0}/README.md RENAMED Viewed

@@ -42,7 +42,8 @@ Every editing primitive is an `Operation` subclass — a Pydantic model
 whose fields ARE the JSON wire format. Apply one to a `Video`:
 ```python
-from videopython.base import Video, CutSeconds, Resize, Fade
+from videopython.base import Video
+from videopython.editing import CutSeconds, Resize, Fade
 video = Video.from_path("raw.mp4")
 video = CutSeconds(start=10, end=25).apply(video)
@@ -92,7 +93,7 @@ instead if you want the result back in memory as a `Video`.
 ```python
 from videopython.ai import TextToImage, ImageToVideo, TextToSpeech
-from videopython.base import Resize
+from videopython.editing import Resize
 image = TextToImage().generate_image("A cinematic mountain sunrise")
 video = ImageToVideo().generate_video(image=image)
@@ -133,7 +134,7 @@ Every registered op exposes its own Pydantic schema, so an agent can
 introspect what's available without hardcoded lists:
 ```python
-from videopython.base import Operation, OpCategory
+from videopython.editing import Operation, OpCategory
 for op_id, cls in Operation.registry().items():
     print(f"{op_id}: {(cls.__doc__ or '').splitlines()[0]}")
@@ -156,18 +157,30 @@ Docs: [Editing Plans](https://videopython.com/api/editing/) | [Operations](https
 ## Features
-### `videopython.base` - core editing (no AI dependencies)
+### `videopython.base` - data containers + I/O (no AI dependencies)
 | Area | Highlights |
 |---|---|
 | **Video I/O** | `Video`, `VideoMetadata`, `FrameIterator` - load, save, inspect |
+| **Text rendering** | `ImageText` - generic PIL text-on-image primitive |
+| **Transcription** | `Transcription`, `TranscriptionSegment`, `TranscriptionWord` - data classes returned by transcription backends |
+| **Result types** | `BoundingBox`, `DetectedFace`, `FaceTrack`, `SceneBoundary`, `AudioEvent`, `MotionInfo`, ... - shared by editing and AI |
+### `videopython.audio` - audio data container
+| Area | Highlights |
+|---|---|
+| **Audio** | `Audio`, `AudioMetadata` - load/save, overlay, concat, normalize, time-stretch, silence detection, segment classification |
+### `videopython.editing` - editing primitives + plan runner
+| Area | Highlights |
+|---|---|
 | **Operation foundation** | `Operation`, `Effect`, `TimeRange`, `OpCategory` - Pydantic base + auto-registry + discriminated-union schema |
 | **Editing plans** | `VideoEdit`, `SegmentConfig` - JSON/LLM-friendly multi-segment plans with JSON Schema generation, dry-run validation, and streaming `run_to_file` |
 | **Transforms** | Cut (time/frame), resize, crop, FPS resampling, speed change, reverse, freeze frame, silence removal |
 | **Effects** | Blur, zoom, color grading, vignette, Ken Burns, image overlay, fade, text overlay, volume adjust |
-| **Audio** | Load/save, overlay, concat, normalize, time-stretch, silence detection, segment classification |
-| **Text** | Transcription data classes, `TranscriptionOverlay` for subtitle rendering |
-| **Scene detection** | Histogram-based scene boundaries (`detect`, `detect_streaming`, `detect_parallel`) |
+| **Subtitles** | `TranscriptionOverlay` - animated word-by-word subtitle rendering |
 API docs: [Core](https://videopython.com/api/index/) | [Video](https://videopython.com/api/core/video/) | [Audio](https://videopython.com/api/core/audio/) | [Editing Plans](https://videopython.com/api/editing/) | [Operations](https://videopython.com/api/operations/) | [Transforms](https://videopython.com/api/transforms/) | [Effects](https://videopython.com/api/effects/) | [Text](https://videopython.com/api/text/)

{videopython-0.31.3 → videopython-0.33.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "videopython"
-version = "0.31.3"
+version = "0.33.0"
 description = "Minimal video generation and processing library."
 authors = [
     { name = "Bartosz Wójtowicz", email = "bartoszwojtowicz@outlook.com" },

{videopython-0.31.3 → videopython-0.33.0}/src/videopython/ai/dubbing/__init__.py RENAMED Viewed

@@ -1,5 +1,6 @@
 """Local video dubbing functionality."""
+from videopython.ai.dubbing.config import DubbingConfig
 from videopython.ai.dubbing.dubber import VideoDubber
 from videopython.ai.dubbing.models import (
     DubbingResult,
@@ -15,6 +16,7 @@ from videopython.ai.generation.translation import UnsupportedLanguageError
 __all__ = [
     "VideoDubber",
+    "DubbingConfig",
     "DubbingResult",
     "RevoiceResult",
     "TranslatedSegment",

videopython-0.33.0/src/videopython/ai/dubbing/config.py ADDED Viewed

@@ -0,0 +1,80 @@
+"""Configuration model for the dubbing pipeline."""
+from __future__ import annotations
+from typing import Literal
+from pydantic import BaseModel, ConfigDict
+TranslatorChoice = Literal["auto", "marian", "qwen3"]
+WhisperModel = Literal["tiny", "base", "small", "medium", "large", "turbo"]
+class DubbingConfig(BaseModel):
+    """Knobs shared by :class:`VideoDubber` and :class:`LocalDubbingPipeline`.
+    Accepted as either ``config=DubbingConfig(...)`` or flat kwargs on the
+    two constructors; the flat path builds a ``DubbingConfig`` internally.
+    Attributes:
+        device: Execution device (``cpu``, ``cuda``, ``mps``, or ``None`` for auto).
+        low_memory: When True, each pipeline stage (Whisper, Demucs, MarianMT,
+            Chatterbox TTS) is unloaded from memory after it runs, so only one
+            model is resident at a time. Trades per-run latency (~10-30s of
+            extra model loads) for a much lower memory ceiling. Recommended
+            for GPUs with <=12GB VRAM or hosts with <32GB RAM. Default False.
+        whisper_model: Whisper model size used for transcription. Larger
+            models give better accuracy at the cost of VRAM and latency. One
+            of ``tiny``, ``base``, ``small``, ``medium``, ``large``, ``turbo``.
+            Default ``turbo``.
+        condition_on_previous_text: Forwarded to ``AudioToText``. Defaults to
+            ``False`` (Whisper's own default is ``True``). With conditioning
+            on, a single hallucinated filler phrase cascades through the rest
+            of the file. See ``AudioToText`` for the full rationale.
+        no_speech_threshold: Forwarded to ``AudioToText``. Whisper's
+            no-speech gate; raise to drop more low-confidence windows.
+        logprob_threshold: Forwarded to ``AudioToText``. Whisper's average
+            log-probability gate.
+        vocabulary: Forwarded to ``AudioToText``. Optional list of brand
+            names, product names, or proper nouns to bias Whisper's
+            first-window decoder via ``initial_prompt``. Recovers
+            near-mishears (e.g. Klarna -> "carna") on brand-monitoring
+            inputs without new model deps.
+        strict_quality: When True, the pipeline raises
+            :class:`GarbageTranscriptError` before Demucs/translation/TTS
+            run if the transcript-quality heuristic returns ``"reject"``.
+            When False (default), low-quality transcripts are logged at
+            WARNING but processing continues. Either way the
+            :class:`TranscriptQuality` is exposed on ``DubbingResult`` for
+            inspection.
+        translator: Translation backend to use. ``"auto"`` (default) picks
+            Qwen3 on GPU, MarianMT on CPU; ``"marian"`` and ``"qwen3"`` force
+            the named backend regardless of device. See
+            :class:`videopython.ai.generation.qwen3.Qwen3Translator` for
+            tradeoffs (Qwen3 is slower on CPU but produces context-aware,
+            length-budgeted output).
+    """
+    model_config = ConfigDict(frozen=True)
+    device: str | None = None
+    low_memory: bool = False
+    whisper_model: WhisperModel = "turbo"
+    condition_on_previous_text: bool = False
+    no_speech_threshold: float = 0.6
+    logprob_threshold: float | None = -1.0
+    vocabulary: list[str] | None = None
+    strict_quality: bool = False
+    translator: TranslatorChoice = "auto"
+    def init_log_fields(self) -> dict[str, object]:
+        """Subset of fields surfaced in the init-log line.
+        Hand-picked so log noise stays bounded as the config grows.
+        """
+        return {
+            "device": self.device.lower() if isinstance(self.device, str) else "auto",
+            "low_memory": self.low_memory,
+            "whisper_model": self.whisper_model,
+            "translator": self.translator,
+        }

{videopython-0.31.3 → videopython-0.33.0}/src/videopython/ai/dubbing/dubber.py RENAMED Viewed

@@ -6,8 +6,8 @@ import logging
 from pathlib import Path
 from typing import TYPE_CHECKING, Any, Callable
+from videopython.ai.dubbing.config import DubbingConfig
 from videopython.ai.dubbing.models import DubbingResult, RevoiceResult
-from videopython.ai.dubbing.pipeline import TranslatorChoice, WhisperModel
 if TYPE_CHECKING:
     from videopython.base.video import Video
@@ -18,90 +18,26 @@ logger = logging.getLogger(__name__)
 class VideoDubber:
     """Dubs videos into different languages using the local pipeline.
-    Args:
-        device: Execution device (``cpu``, ``cuda``, ``mps``, or ``None`` for auto).
-        low_memory: When True, each pipeline stage (Whisper, Demucs, MarianMT,
-            Chatterbox TTS) is unloaded from memory after it runs, so only one
-            model is resident at a time. Trades per-run latency (~10-30s of
-            extra model loads) for a much lower memory ceiling. Recommended for
-            GPUs with <=12GB VRAM or hosts with <32GB RAM. Default False.
-        whisper_model: Whisper model size used for transcription. Larger models
-            give better accuracy at the cost of VRAM and latency. One of
-            ``tiny``, ``base``, ``small``, ``medium``, ``large``, ``turbo``.
-            Default ``turbo``.
-        condition_on_previous_text: Forwarded to ``AudioToText``. Defaults to
-            ``False`` (Whisper's own default is ``True``). With conditioning on,
-            a single hallucinated filler phrase cascades through the rest of
-            the file. See ``AudioToText`` for the full rationale.
-        no_speech_threshold: Forwarded to ``AudioToText``. Whisper's no-speech
-            gate; raise to drop more low-confidence windows.
-        logprob_threshold: Forwarded to ``AudioToText``. Whisper's average
-            log-probability gate.
-        vocabulary: Forwarded to ``AudioToText``. Optional list of brand
-            names, product names, or proper nouns to bias Whisper's first-
-            window decoder via ``initial_prompt``. Recovers near-mishears
-            (e.g. Klarna → "carna") on brand-monitoring inputs without new
-            model deps.
-        strict_quality: When True, the pipeline raises
-            :class:`GarbageTranscriptError` before Demucs/translation/TTS run
-            if the transcript-quality heuristic returns ``"reject"``. When
-            False (default), low-quality transcripts are logged at WARNING
-            but processing continues. Either way the
-            :class:`TranscriptQuality` is exposed on ``DubbingResult`` for
-            inspection.
-        translator: Translation backend to use. ``"auto"`` (default)
-            picks Qwen3 on GPU, MarianMT on CPU; ``"marian"`` and
-            ``"qwen3"`` force the named backend regardless of device.
-            See :class:`videopython.ai.generation.qwen3.Qwen3Translator`
-            for tradeoffs (Qwen3 is slower on CPU but produces
-            context-aware, length-budgeted output).
+    Accepts either a :class:`DubbingConfig` or the same knobs as flat kwargs
+    (``device``, ``low_memory``, ``whisper_model``, ``translator``, etc.) --
+    the flat path builds a ``DubbingConfig`` internally. See
+    :class:`DubbingConfig` for the full knob list and defaults.
     """
-    def __init__(
-        self,
-        device: str | None = None,
-        low_memory: bool = False,
-        whisper_model: WhisperModel = "turbo",
-        condition_on_previous_text: bool = False,
-        no_speech_threshold: float = 0.6,
-        logprob_threshold: float | None = -1.0,
-        vocabulary: list[str] | None = None,
-        strict_quality: bool = False,
-        translator: TranslatorChoice = "auto",
-    ):
-        self.device = device
-        self.low_memory = low_memory
-        self.whisper_model = whisper_model
-        self.condition_on_previous_text = condition_on_previous_text
-        self.no_speech_threshold = no_speech_threshold
-        self.logprob_threshold = logprob_threshold
-        self.vocabulary = vocabulary
-        self.strict_quality = strict_quality
-        self.translator = translator
+    def __init__(self, config: DubbingConfig | None = None, **kwargs: Any):
+        if config is not None and kwargs:
+            raise TypeError("Pass either `config=` or knob kwargs, not both")
+        self.config = config or DubbingConfig(**kwargs)
         self._local_pipeline: Any = None
-        requested = device.lower() if isinstance(device, str) else "auto"
         logger.info(
-            "VideoDubber initialized with device=%s low_memory=%s whisper_model=%s translator=%s",
-            requested,
-            low_memory,
-            whisper_model,
-            translator,
+            "VideoDubber initialized with %s",
+            " ".join(f"{k}={v}" for k, v in self.config.init_log_fields().items()),
         )
     def _init_local_pipeline(self) -> None:
         from videopython.ai.dubbing.pipeline import LocalDubbingPipeline
-        self._local_pipeline = LocalDubbingPipeline(
-            device=self.device,
-            low_memory=self.low_memory,
-            whisper_model=self.whisper_model,
-            condition_on_previous_text=self.condition_on_previous_text,
-            no_speech_threshold=self.no_speech_threshold,
-            logprob_threshold=self.logprob_threshold,
-            vocabulary=self.vocabulary,
-            strict_quality=self.strict_quality,
-            translator=self.translator,
-        )
+        self._local_pipeline = LocalDubbingPipeline(config=self.config)
     def dub(
         self,
@@ -218,7 +154,7 @@ class VideoDubber:
             source transcription. The output video is written to ``output_path``.
         """
         from videopython.ai.dubbing.remux import replace_audio_stream_from_audio
-        from videopython.base.audio import Audio
+        from videopython.audio import Audio
         input_path = Path(input_path)
         output_path = Path(output_path)
@@ -292,7 +228,7 @@ class VideoDubber:
         video_duration = video.total_seconds
         if video_duration > speech_duration:
-            from videopython.base.transforms import CutSeconds
+            from videopython.editing.transforms import CutSeconds
             output_video = CutSeconds(start=0, end=speech_duration).apply(video)
         else:

videopython-0.33.0/src/videopython/ai/dubbing/expressiveness.py ADDED Viewed

@@ -0,0 +1,47 @@
+"""Source-prosody-driven expressiveness knobs for Chatterbox TTS."""
+from __future__ import annotations
+from typing import TYPE_CHECKING
+import numpy as np
+from videopython.ai.dubbing.models import Expressiveness
+if TYPE_CHECKING:
+    from videopython.audio import Audio
+# Prosody-conditioning thresholds. Source-segment RMS / whole-vocals RMS
+# below CALM lands in the calm bucket; above DRAMATIC in the dramatic
+# bucket; in between gets Chatterbox's defaults. Knob values picked
+# by-ear on cam1_1min.mp4 -- see RELEASE_NOTES 0.29.0.
+CALM_RATIO_THRESHOLD = 0.7
+DRAMATIC_RATIO_THRESHOLD = 1.3
+_CALM = Expressiveness(exaggeration=0.3, cfg_weight=0.7)
+_DRAMATIC = Expressiveness(exaggeration=0.85, cfg_weight=0.35)
+def rms(data: np.ndarray) -> float:
+    """RMS over samples; ``0.0`` for empty input. float64 reduction so a
+    long slice can't overflow the squared accumulator."""
+    if data.size == 0:
+        return 0.0
+    return float(np.sqrt(np.mean(np.square(data, dtype=np.float64))))
+def expressiveness_for(source_slice: Audio, baseline_rms: float) -> Expressiveness:
+    """Map a source vocals slice to a Chatterbox expressiveness profile
+    by RMS ratio. Falls back to the no-knobs default for empty or silent
+    inputs."""
+    if baseline_rms <= 0.0:
+        return Expressiveness()
+    segment_rms = rms(source_slice.data)
+    if segment_rms <= 0.0:
+        return Expressiveness()
+    ratio = segment_rms / baseline_rms
+    if ratio < CALM_RATIO_THRESHOLD:
+        return _CALM
+    if ratio > DRAMATIC_RATIO_THRESHOLD:
+        return _DRAMATIC
+    return Expressiveness()

videopython-0.33.0/src/videopython/ai/dubbing/loudness.py ADDED Viewed

@@ -0,0 +1,86 @@
+"""LUFS / peak loudness matching for dubbed audio."""
+from __future__ import annotations
+from typing import TYPE_CHECKING
+import numpy as np
+if TYPE_CHECKING:
+    from videopython.audio import Audio
+# BS.1770 integrated-loudness measurement requires at least 400 ms of audio
+# (one gating block). Below this, fall back to peak match -- pyloudnorm
+# returns -inf or warns, neither of which gives a usable gain.
+_LUFS_MIN_DURATION_SECONDS = 0.4
+def peak_match(target: Audio, reference: Audio) -> Audio:
+    """Scale ``target`` so its peak amplitude matches ``reference``.
+    Used as the fallback when LUFS measurement isn't viable (clip < 0.4s
+    or silent input). The new ``Audio`` shares no buffer with ``target``.
+    """
+    from videopython.audio import Audio as _Audio
+    target_peak = float(np.max(np.abs(target.data))) if target.data.size else 0.0
+    reference_peak = float(np.max(np.abs(reference.data))) if reference.data.size else 0.0
+    if target_peak <= 0.0 or reference_peak <= 0.0:
+        return target
+    scale = reference_peak / target_peak
+    if abs(scale - 1.0) < 1e-3:
+        return target
+    return _Audio(target.data * scale, target.metadata)
+def loudness_match(target: Audio, reference: Audio) -> Audio:
+    """Scale ``target`` so its integrated loudness (BS.1770 / LUFS) matches ``reference``.
+    Demucs background normalization and the timing-assembler peak guard
+    each clamp at 1.0 instead of restoring perceived loudness, so a
+    dubbed mix lands perceptually "thinner" than the source even after
+    peak match. LUFS captures the ear-weighted envelope that peak ratio
+    misses on dialogue-heavy material.
+    Falls back to :func:`peak_match` when either clip is shorter than
+    the BS.1770 gating block (400 ms) or when measurement returns -inf
+    (silent or near-silent gated content). After gain is applied, peaks
+    are clamped to 0.99 -- BS.1770 has no peak ceiling and a sufficiently
+    quiet source can demand gain that would otherwise clip.
+    """
+    from videopython.audio import Audio as _Audio
+    target_dur = target.metadata.duration_seconds
+    ref_dur = reference.metadata.duration_seconds
+    if target_dur < _LUFS_MIN_DURATION_SECONDS or ref_dur < _LUFS_MIN_DURATION_SECONDS:
+        return peak_match(target, reference)
+    if not target.data.size or not reference.data.size:
+        return target
+    import pyloudnorm
+    target_lufs = pyloudnorm.Meter(target.metadata.sample_rate).integrated_loudness(target.data)
+    reference_lufs = pyloudnorm.Meter(reference.metadata.sample_rate).integrated_loudness(reference.data)
+    # Either clip's gated content was below -70 LUFS (effectively silent
+    # under BS.1770). Gain would be undefined -- fall back to peak match,
+    # which has its own silent-input no-op.
+    if not np.isfinite(target_lufs) or not np.isfinite(reference_lufs):
+        return peak_match(target, reference)
+    gain_db = reference_lufs - target_lufs
+    if abs(gain_db) < 0.1:
+        return target
+    scale = float(10 ** (gain_db / 20.0))
+    scaled = target.data * scale
+    peak = float(np.max(np.abs(scaled)))
+    if peak > 0.99:
+        scaled = scaled * (0.99 / peak)
+    return _Audio(scaled, target.metadata)

videopython 0.31.3__tar.gz → 0.33.0__tar.gz

videopython 0.31.3tar.gz → 0.33.0tar.gz