PyPI - videopython - Versions diffs - 0.28.3__tar.gz → 0.29.1__tar.gz - Mend

videopython 0.28.3tar.gz → 0.29.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (62) hide show

{videopython-0.28.3 → videopython-0.29.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: videopython
-Version: 0.28.3
+Version: 0.29.1
 Summary: Minimal video generation and processing library.
 Project-URL: Homepage, https://videopython.com
 Project-URL: Repository, https://github.com/bartwojtowicz/videopython/
@@ -27,14 +27,15 @@ Requires-Dist: accelerate>=0.29.2; extra == 'ai'
 Requires-Dist: chatterbox-tts>=0.1.7; extra == 'ai'
 Requires-Dist: demucs>=4.0.0; extra == 'ai'
 Requires-Dist: diffusers>=0.30.0; extra == 'ai'
-Requires-Dist: easyocr>=1.7.0; extra == 'ai'
 Requires-Dist: hf-transfer>=0.1.9; extra == 'ai'
+Requires-Dist: imagehash>=4.3; extra == 'ai'
 Requires-Dist: llama-cpp-python>=0.3.0; extra == 'ai'
 Requires-Dist: numba>=0.61.0; extra == 'ai'
 Requires-Dist: ollama>=0.4.5; extra == 'ai'
 Requires-Dist: openai-whisper>=20240930; extra == 'ai'
 Requires-Dist: pyannote-audio>=4.0.0; extra == 'ai'
 Requires-Dist: pyloudnorm>=0.1.1; extra == 'ai'
+Requires-Dist: qwen-vl-utils>=0.0.10; extra == 'ai'
 Requires-Dist: scikit-learn>=1.3.0; extra == 'ai'
 Requires-Dist: scipy>=1.10.0; extra == 'ai'
 Requires-Dist: sentencepiece>=0.1.99; extra == 'ai'
@@ -56,6 +57,8 @@ Minimal, LLM-friendly Python library for programmatic video editing, processing,
 Full documentation: [videopython.com](https://videopython.com)
+> **Disclaimer:** This project started as a hand-written hobby project, but most of the code is now produced by LLM agents. Humans still drive direction, approve changes, and own design decisions.
 ## Installation
 ### 1. Install FFmpeg
@@ -193,10 +196,10 @@ API docs: [Core](https://videopython.com/api/index/) | [Video](https://videopyth
 | Area | Highlights |
 |---|---|
 | **Generation** | `TextToVideo`, `ImageToVideo`, `TextToImage`, `TextToSpeech`, `TextToMusic` |
-| **Understanding** | `AudioToText` (transcription), `AudioClassifier`, `SceneVLM` (visual scene description), `ActionRecognizer` |
+| **Understanding** | `AudioToText` (transcription), `AudioClassifier`, `SceneVLM` (structured visual scene description), `FaceTracker` (per-shot face tracks) |
 | **Scene detection** | `SemanticSceneDetector` (neural scene boundaries) |
 | **Video analysis** | `VideoAnalyzer` - full-pipeline analysis combining multiple AI capabilities |
-| **Transforms** | `FaceTracker`, `FaceTrackingCrop`, `SplitScreenComposite` |
+| **Transforms** | `FaceTrackingCrop`, `SplitScreenComposite` |
 | **Dubbing** | `VideoDubber` - voice cloning and revoicing with timing sync |
 | **Object swapping** | `ObjectSwapper` - detect, segment, and inpaint objects in video |

{videopython-0.28.3 → videopython-0.29.1}/README.md RENAMED Viewed

@@ -8,6 +8,8 @@ Minimal, LLM-friendly Python library for programmatic video editing, processing,
 Full documentation: [videopython.com](https://videopython.com)
+> **Disclaimer:** This project started as a hand-written hobby project, but most of the code is now produced by LLM agents. Humans still drive direction, approve changes, and own design decisions.
 ## Installation
 ### 1. Install FFmpeg
@@ -145,10 +147,10 @@ API docs: [Core](https://videopython.com/api/index/) | [Video](https://videopyth
 | Area | Highlights |
 |---|---|
 | **Generation** | `TextToVideo`, `ImageToVideo`, `TextToImage`, `TextToSpeech`, `TextToMusic` |
-| **Understanding** | `AudioToText` (transcription), `AudioClassifier`, `SceneVLM` (visual scene description), `ActionRecognizer` |
+| **Understanding** | `AudioToText` (transcription), `AudioClassifier`, `SceneVLM` (structured visual scene description), `FaceTracker` (per-shot face tracks) |
 | **Scene detection** | `SemanticSceneDetector` (neural scene boundaries) |
 | **Video analysis** | `VideoAnalyzer` - full-pipeline analysis combining multiple AI capabilities |
-| **Transforms** | `FaceTracker`, `FaceTrackingCrop`, `SplitScreenComposite` |
+| **Transforms** | `FaceTrackingCrop`, `SplitScreenComposite` |
 | **Dubbing** | `VideoDubber` - voice cloning and revoicing with timing sync |
 | **Object swapping** | `ObjectSwapper` - detect, segment, and inpaint objects in video |

{videopython-0.28.3 → videopython-0.29.1}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "videopython"
-version = "0.28.3"
+version = "0.29.1"
 description = "Minimal video generation and processing library."
 authors = [
     { name = "Bartosz Wójtowicz", email = "bartoszwojtowicz@outlook.com" },
@@ -70,7 +70,6 @@ ai = [
     "scikit-learn>=1.3.0",
     # Detection backends
     "ultralytics>=8.0.0",
-    "easyocr>=1.7.0",
     # Audio classification (AST via transformers - no separate dep needed)
     # Scene detection
     "transnetv2-pytorch>=1.0.5",
@@ -84,6 +83,11 @@ ai = [
     "llama-cpp-python>=0.3.0",
     # Loudness measurement (BS.1770) for dub-vs-source loudness matching (M3)
     "pyloudnorm>=0.1.1",
+    # Vision-language preprocessing for Qwen3.5 (M5) - documented prerequisite
+    # for AutoModelForImageTextToText with image/video chat templates.
+    "qwen-vl-utils>=0.0.10",
+    # Perceptual hashing for SceneVLM frame dedup (M5)
+    "imagehash>=4.3",
 ]
 # Required for pip install videopython[ai] - pip uses optional-dependencies, not dependency-groups
@@ -105,7 +109,6 @@ ai = [
     "scikit-learn>=1.3.0",
     # Detection backends
     "ultralytics>=8.0.0",
-    "easyocr>=1.7.0",
     # Audio classification (AST via transformers - no separate dep needed)
     # Scene detection
     "transnetv2-pytorch>=1.0.5",
@@ -119,6 +122,11 @@ ai = [
     "llama-cpp-python>=0.3.0",
     # Loudness measurement (BS.1770) for dub-vs-source loudness matching (M3)
     "pyloudnorm>=0.1.1",
+    # Vision-language preprocessing for Qwen3.5 (M5) - documented prerequisite
+    # for AutoModelForImageTextToText with image/video chat templates.
+    "qwen-vl-utils>=0.0.10",
+    # Perceptual hashing for SceneVLM frame dedup (M5)
+    "imagehash>=4.3",
 ]
 [project.urls]
@@ -135,7 +143,6 @@ module = [
     "diffusers", "diffusers.*",
     "ollama", "ollama.*",
     "ultralytics", "ultralytics.*",
-    "easyocr", "easyocr.*",
     "transformers", "transformers.*",
     "transnetv2_pytorch", "transnetv2_pytorch.*",
     "chatterbox", "chatterbox.*",
@@ -146,6 +153,8 @@ module = [
     "cv2", "cv2.*",
     "llama_cpp", "llama_cpp.*",
     "pyloudnorm", "pyloudnorm.*",
+    "qwen_vl_utils", "qwen_vl_utils.*",
+    "imagehash", "imagehash.*",
 ]
 ignore_missing_imports = true

{videopython-0.28.3 → videopython-0.29.1}/src/videopython/ai/__init__.py RENAMED Viewed

@@ -2,11 +2,11 @@ from videopython.ai import registry as _ai_registry  # noqa: F401
 from .generation import ImageToVideo, TextToImage, TextToMusic, TextToSpeech, TextToVideo
 from .swapping import ObjectSwapper
-from .transforms import FaceTracker, FaceTrackingCrop, SplitScreenComposite
+from .transforms import FaceTrackingCrop, SplitScreenComposite
 from .understanding import (
-    ActionRecognizer,
     AudioClassifier,
     AudioToText,
+    FaceTracker,
     SceneVLM,
     SemanticSceneDetector,
 )
@@ -22,12 +22,10 @@ __all__ = [
     # Understanding
     "AudioToText",
     "AudioClassifier",
+    "FaceTracker",
     "SceneVLM",
-    # Temporal
-    "ActionRecognizer",
     "SemanticSceneDetector",
     # Transforms (AI-powered)
-    "FaceTracker",
     "FaceTrackingCrop",
     "SplitScreenComposite",
     # Swapping

{videopython-0.28.3 → videopython-0.29.1}/src/videopython/ai/dubbing/cache.py RENAMED Viewed

@@ -27,6 +27,8 @@ from dataclasses import dataclass
 from pathlib import Path
 from typing import TYPE_CHECKING, Any
+from videopython.ai.understanding.audio import _normalize_vocabulary
 if TYPE_CHECKING:
     from videopython.base.audio import Audio
     from videopython.base.text.transcription import Transcription
@@ -37,7 +39,12 @@ logger = logging.getLogger(__name__)
 # Cache schema version. Bump on incompatible changes to any artifact's
 # on-disk format (e.g. TranscriptionSegment field changes that break
 # from_dict). Mismatched cache entries are treated as a miss.
-SCHEMA_VERSION = 1
+#
+# v2 (0.29.1): vocabulary added to transcription_kwargs_hash for M1
+# vocabulary biasing. Pre-v2 transcription artifacts miss on first hit
+# and re-run; translation/TTS artifacts are unaffected (hashed
+# independently and survive).
+SCHEMA_VERSION = 2
 # Reserved for M4.3 per-speaker voice library. M3.2 does not write here;
 # documented so future code knows the path is taken.
@@ -126,13 +133,22 @@ class DubCache:
         condition_on_previous_text: bool,
         no_speech_threshold: float,
         logprob_threshold: float | None,
+        vocabulary: list[str] | None = None,
     ) -> str:
+        """Hash captures the kwargs that affect Whisper's output.
+        ``vocabulary`` is normalized (case-insensitive dedup, casing
+        preserved) before hashing so trivial reordering/casing
+        differences don't thrash the cache. Defaults to ``None`` so
+        pre-M1 callers keep hashing the same value as before.
+        """
         return _stable_hash(
             whisper_model,
             enable_diarization,
             condition_on_previous_text,
             no_speech_threshold,
             logprob_threshold,
+            *_normalize_vocabulary(vocabulary),
         )
     @staticmethod

{videopython-0.28.3 → videopython-0.29.1}/src/videopython/ai/dubbing/dubber.py RENAMED Viewed

@@ -37,6 +37,11 @@ class VideoDubber:
             gate; raise to drop more low-confidence windows.
         logprob_threshold: Forwarded to ``AudioToText``. Whisper's average
             log-probability gate.
+        vocabulary: Forwarded to ``AudioToText``. Optional list of brand
+            names, product names, or proper nouns to bias Whisper's first-
+            window decoder via ``initial_prompt``. Recovers near-mishears
+            (e.g. Klarna → "carna") on brand-monitoring inputs without new
+            model deps.
         strict_quality: When True, the pipeline raises
             :class:`GarbageTranscriptError` before Demucs/translation/TTS run
             if the transcript-quality heuristic returns ``"reject"``. When
@@ -67,6 +72,7 @@ class VideoDubber:
         condition_on_previous_text: bool = False,
         no_speech_threshold: float = 0.6,
         logprob_threshold: float | None = -1.0,
+        vocabulary: list[str] | None = None,
         strict_quality: bool = False,
         translator: TranslatorChoice = "auto",
         cache_dir: str | Path | None = None,
@@ -77,6 +83,7 @@ class VideoDubber:
         self.condition_on_previous_text = condition_on_previous_text
         self.no_speech_threshold = no_speech_threshold
         self.logprob_threshold = logprob_threshold
+        self.vocabulary = vocabulary
         self.strict_quality = strict_quality
         self.translator = translator
         self.cache_dir = cache_dir
@@ -101,6 +108,7 @@ class VideoDubber:
             condition_on_previous_text=self.condition_on_previous_text,
             no_speech_threshold=self.no_speech_threshold,
             logprob_threshold=self.logprob_threshold,
+            vocabulary=self.vocabulary,
             strict_quality=self.strict_quality,
             translator=self.translator,
             cache_dir=self.cache_dir,

{videopython-0.28.3 → videopython-0.29.1}/src/videopython/ai/dubbing/pipeline.py RENAMED Viewed

@@ -170,6 +170,7 @@ class LocalDubbingPipeline:
         condition_on_previous_text: bool = False,
         no_speech_threshold: float = 0.6,
         logprob_threshold: float | None = -1.0,
+        vocabulary: list[str] | None = None,
         strict_quality: bool = False,
         translator: TranslatorChoice = "auto",
         cache_dir: str | Path | None = None,
@@ -180,6 +181,7 @@ class LocalDubbingPipeline:
         self.condition_on_previous_text = condition_on_previous_text
         self.no_speech_threshold = no_speech_threshold
         self.logprob_threshold = logprob_threshold
+        self.vocabulary = vocabulary
         self.strict_quality = strict_quality
         self.translator = translator
         self.cache_dir = Path(cache_dir) if cache_dir is not None else None
@@ -256,6 +258,7 @@ class LocalDubbingPipeline:
                     "condition_on_previous_text": self.condition_on_previous_text,
                     "no_speech_threshold": self.no_speech_threshold,
                     "logprob_threshold": self.logprob_threshold,
+                    "vocabulary": self.vocabulary,
                 },
             )
         return transcription
@@ -406,6 +409,7 @@ class LocalDubbingPipeline:
             condition_on_previous_text=self.condition_on_previous_text,
             no_speech_threshold=self.no_speech_threshold,
             logprob_threshold=self.logprob_threshold,
+            vocabulary=self.vocabulary,
         )
         return src_hash, kwargs_hash
@@ -420,6 +424,7 @@ class LocalDubbingPipeline:
             condition_on_previous_text=self.condition_on_previous_text,
             no_speech_threshold=self.no_speech_threshold,
             logprob_threshold=self.logprob_threshold,
+            vocabulary=self.vocabulary,
         )
     def _init_translator(self, source_lang: str, target_lang: str) -> None:

videopython 0.28.3__tar.gz → 0.29.1__tar.gz

videopython 0.28.3tar.gz → 0.29.1tar.gz