PyPI - mkv-episode-matcher - Versions diffs - 0.2.0__tar.gz → 0.3.0__tar.gz - Mend

mkv-episode-matcher 0.2.0tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of mkv-episode-matcher might be problematic. Click here for more details.

Files changed (65) hide show

mkv_episode_matcher-0.3.0/.python-version ADDED Viewed

	@@ -0,0 +1 @@
1	+ 3.9

{mkv_episode_matcher-0.2.0 → mkv_episode_matcher-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: mkv-episode-matcher
-Version: 0.2.0
+Version: 0.3.0
 Summary: The MKV Episode Matcher is a tool for identifying TV series episodes from MKV files and renaming the files accordingly.
 Home-page: https://github.com/Jsakkos/mkv-episode-matcher
 Author: Jonathan Sakkos
@@ -14,16 +14,18 @@ Classifier: Programming Language :: Python
 Classifier: Programming Language :: Python :: 3.12
 Classifier: Programming Language :: Python :: Implementation :: CPython
 Classifier: Programming Language :: Python :: Implementation :: PyPy
-Requires-Python: >=3.10
+Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 Requires-Dist: configparser>=7.1.0
 Requires-Dist: ffmpeg>=1.4
 Requires-Dist: loguru>=0.7.2
-Requires-Dist: numpy>=2.1.3
+Requires-Dist: openai-whisper>=20240930
 Requires-Dist: opensubtitlescom>=0.1.5
 Requires-Dist: pytesseract>=0.3.13
+Requires-Dist: rapidfuzz>=3.10.1
 Requires-Dist: requests>=2.32.3
 Requires-Dist: tmdb-client>=0.0.1
+Requires-Dist: wave>=0.0.2
 # MKV Episode Matcher

mkv_episode_matcher-0.3.0/docs/installation.md ADDED Viewed

@@ -0,0 +1,81 @@
+# Installation Guide
+## Basic Installation
+Install MKV Episode Matcher using pip:
+```bash
+pip install mkv-episode-matcher
+```
+## Installation Options
+### GPU Support
+For GPU acceleration (recommended if you have a CUDA-capable GPU):
+```bash
+pip install "mkv-episode-matcher"
+```
+Find the appropriate CUDA version and upgrade Torch (e.g., for CUDA 12.4):
+```bash
+pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
+```
+### Development Installation
+For contributing or development:
+```bash
+# Clone the repository
+git clone https://github.com/Jsakkos/mkv-episode-matcher.git
+cd mkv-episode-matcher
+# Install UV
+pip install uv
+# Install with development dependencies
+uv venv
+uv pip install -e ".[dev]"
+```
+## API Keys Setup
+1. **TMDb API Key**
+    - Create an account at [TMDb](https://www.themoviedb.org/)
+    - Go to your account settings
+    - Request an API key
+2. **OpenSubtitles (Optional)**
+    - Register at [OpenSubtitles](https://www.opensubtitles.com/)
+    - Get your API key from the dashboard
+## System Requirements
+### For GPU Support
+- CUDA-capable NVIDIA GPU
+- CUDA Toolkit 12.1 or compatible version
+- At least 4GB GPU memory recommended for Whisper speech recognition
+### For CPU-Only
+- No special requirements beyond Python 3.9+
+## Verification
+Verify your installation:
+```bash
+mkv-match --version
+# Check GPU availability (if installed with GPU support)
+python -c "import torch; print(f'GPU available: {torch.cuda.is_available()}')"
+```
+## Troubleshooting
+If you encounter any issues:
+1. Ensure you have the latest pip: `pip install --upgrade pip`
+2. For GPU installations, verify CUDA is properly installed
+3. Check the [compatibility matrix](https://pytorch.org/get-started/locally/) for PyTorch and CUDA versions
+4. If you encounter any other issues, please [open an issue](https://github.com/Jsakkos/mkv-episode-matcher/issues) on GitHub

mkv_episode_matcher-0.3.0/mkv_episode_matcher/episode_identification.py ADDED Viewed

@@ -0,0 +1,208 @@
+# mkv_episode_matcher/episode_identification.py
+import os
+import glob
+from pathlib import Path
+from rapidfuzz import fuzz
+from collections import defaultdict
+import re
+from loguru import logger
+import json
+import shutil
+class EpisodeMatcher:
+    def __init__(self, cache_dir, min_confidence=0.6):
+        self.cache_dir = Path(cache_dir)
+        self.min_confidence = min_confidence
+        self.whisper_segments = None
+        self.series_name = None
+    def clean_text(self, text):
+        """Clean text by removing stage directions and normalizing repeated words."""
+        # Remove stage directions like [groans] and <i>SHIP:</i>
+        text = re.sub(r'\[.*?\]|\<.*?\>', '', text)
+        # Remove repeated words with dashes (e.g., "Y-y-you" -> "you")
+        text = re.sub(r'([A-Za-z])-\1+', r'\1', text)
+        # Remove multiple spaces
+        text = ' '.join(text.split())
+        return text.lower()
+    def chunk_score(self, whisper_chunk, ref_chunk):
+        """Calculate fuzzy match score between two chunks of text."""
+        whisper_clean = self.clean_text(whisper_chunk)
+        ref_clean = self.clean_text(ref_chunk)
+        # Use token sort ratio to handle word order differences
+        token_sort = fuzz.token_sort_ratio(whisper_clean, ref_clean)
+        # Use partial ratio to catch substring matches
+        partial = fuzz.partial_ratio(whisper_clean, ref_clean)
+        # Weight token sort more heavily but consider partial matches
+        return (token_sort * 0.7 + partial * 0.3) / 100.0
+    def identify_episode(self, video_file, temp_dir):
+        """Identify which episode matches this video file."""
+        # Get series name from parent directory
+        self.series_name = Path(video_file).parent.parent.name
+        # Load whisper transcript if not already processed
+        segments_file = Path(temp_dir) / f"{Path(video_file).stem}.segments.json"
+        if not segments_file.exists():
+            logger.error(f"No transcript found for {video_file}. Run speech recognition first.")
+            return None
+        with open(segments_file) as f:
+            self.whisper_segments = json.load(f)
+        # Get reference directory for this series
+        reference_dir = self.cache_dir / "data" / self.series_name
+        if not reference_dir.exists():
+            logger.error(f"No reference files found for {self.series_name}")
+            return None
+        # Match against reference files
+        match = self.match_all_references(reference_dir)
+        if match and match['confidence'] >= self.min_confidence:
+            # Extract season and episode from filename
+            match_file = Path(match['file'])
+            season_ep = re.search(r'S(\d+)E(\d+)', match_file.stem)
+            if season_ep:
+                season, episode = map(int, season_ep.groups())
+                return {
+                    'season': season,
+                    'episode': episode,
+                    'confidence': match['confidence'],
+                    'reference_file': str(match_file),
+                    'chunk_scores': match['chunk_scores']
+                }
+        return None
+    def match_all_references(self, reference_dir):
+        """Process all reference files and track matching scores."""
+        results = defaultdict(list)
+        best_match = None
+        best_confidence = 0
+        def process_chunks(ref_segments, filename):
+            nonlocal best_match, best_confidence
+            chunk_size = 300  # 5 minute chunks
+            whisper_chunks = defaultdict(list)
+            ref_chunks = defaultdict(list)
+            # Group segments into time chunks
+            for seg in self.whisper_segments:
+                chunk_idx = int(float(seg['start']) // chunk_size)
+                whisper_chunks[chunk_idx].append(seg['text'])
+            for seg in ref_segments:
+                chunk_idx = int(seg['start'] // chunk_size)
+                ref_chunks[chunk_idx].append(seg['text'])
+            # Score each chunk
+            for chunk_idx in whisper_chunks:
+                whisper_text = ' '.join(whisper_chunks[chunk_idx])
+                # Look for matching reference chunk and adjacent chunks
+                scores = []
+                for ref_idx in range(max(0, chunk_idx-1), chunk_idx+2):
+                    if ref_idx in ref_chunks:
+                        ref_text = ' '.join(ref_chunks[ref_idx])
+                        score = self.chunk_score(whisper_text, ref_text)
+                        scores.append(score)
+                if scores:
+                    chunk_confidence = max(scores)
+                    logger.info(f"File: {filename}, "
+                              f"Time: {chunk_idx*chunk_size}-{(chunk_idx+1)*chunk_size}s, "
+                              f"Confidence: {chunk_confidence:.2f}")
+                    results[filename].append({
+                        'chunk_idx': chunk_idx,
+                        'confidence': chunk_confidence
+                    })
+                    # Early exit if we find a very good match
+                    if chunk_confidence > self.min_confidence:
+                        chunk_scores = results[filename]
+                        confidence = sum(c['confidence'] * (0.9 ** c['chunk_idx'])
+                                      for c in chunk_scores) / len(chunk_scores)
+                        if confidence > best_confidence:
+                            best_confidence = confidence
+                            best_match = {
+                                'file': filename,
+                                'confidence': confidence,
+                                'chunk_scores': chunk_scores
+                            }
+                        return True
+            return False
+        # Process each reference file
+        for ref_file in glob.glob(os.path.join(reference_dir, "*.srt")):
+            ref_segments = self.parse_srt_to_segments(ref_file)
+            filename = os.path.basename(ref_file)
+            if process_chunks(ref_segments, filename):
+                break
+        # If no early match found, find best overall match
+        if not best_match:
+            for filename, chunks in results.items():
+                # Weight earlier chunks more heavily
+                confidence = sum(c['confidence'] * (0.9 ** c['chunk_idx'])
+                               for c in chunks) / len(chunks)
+                if confidence > best_confidence:
+                    best_confidence = confidence
+                    best_match = {
+                        'file': filename,
+                        'confidence': confidence,
+                        'chunk_scores': chunks
+                    }
+        return best_match
+    def parse_srt_to_segments(self, srt_file):
+        """Parse SRT file into list of segments with start/end times and text."""
+        segments = []
+        current_segment = {}
+        with open(srt_file, 'r', encoding='utf-8') as f:
+            lines = f.readlines()
+        i = 0
+        while i < len(lines):
+            line = lines[i].strip()
+            if line.isdigit():  # Index
+                if current_segment:
+                    segments.append(current_segment)
+                current_segment = {}
+            elif '-->' in line:  # Timestamp
+                start, end = line.split(' --> ')
+                current_segment['start'] = self.timestr_to_seconds(start)
+                current_segment['end'] = self.timestr_to_seconds(end)
+            elif line:  # Text
+                if 'text' in current_segment:
+                    current_segment['text'] += ' ' + line
+                else:
+                    current_segment['text'] = line
+            i += 1
+        if current_segment:
+            segments.append(current_segment)
+        return segments
+    def timestr_to_seconds(self, timestr):
+        """Convert SRT timestamp to seconds."""
+        h, m, s = timestr.replace(',','.').split(':')
+        return float(h) * 3600 + float(m) * 60 + float(s)

mkv_episode_matcher-0.3.0/mkv_episode_matcher/episode_matcher.py ADDED Viewed

@@ -0,0 +1,117 @@
+# mkv_episode_matcher/episode_matcher.py
+from pathlib import Path
+import shutil
+import glob
+import os
+from loguru import logger
+from mkv_episode_matcher.__main__ import CONFIG_FILE, CACHE_DIR
+from mkv_episode_matcher.config import get_config
+from mkv_episode_matcher.mkv_to_srt import convert_mkv_to_srt
+from mkv_episode_matcher.tmdb_client import fetch_show_id
+from mkv_episode_matcher.utils import (
+    check_filename,
+    clean_text,
+    cleanup_ocr_files,
+    get_subtitles,
+    process_reference_srt_files,
+    process_srt_files,
+    compare_and_rename_files,get_valid_seasons
+)
+from mkv_episode_matcher.speech_to_text import process_speech_to_text
+from mkv_episode_matcher.episode_identification import EpisodeMatcher
+def process_show(season=None, dry_run=False, get_subs=False):
+    """Process the show using both speech recognition and OCR fallback."""
+    config = get_config(CONFIG_FILE)
+    show_dir = config.get("show_dir")
+    # Initialize episode matcher
+    matcher = EpisodeMatcher(CACHE_DIR)
+    # Get valid season directories
+    season_paths = get_valid_seasons(show_dir)
+    if not season_paths:
+        logger.warning(f"No seasons with .mkv files found")
+        return
+    if season is not None:
+        season_path = os.path.join(show_dir, f"Season {season}")
+        if season_path not in season_paths:
+            logger.warning(f"Season {season} has no .mkv files to process")
+            return
+        season_paths = [season_path]
+    # Process each season
+    for season_path in season_paths:
+        # Get MKV files that haven't been processed
+        mkv_files = [f for f in glob.glob(os.path.join(season_path, "*.mkv"))
+                    if not check_filename(f)]
+        if not mkv_files:
+            logger.info(f"No new files to process in {season_path}")
+            continue
+        # Create temp directories
+        temp_dir = Path(season_path) / "temp"
+        ocr_dir = Path(season_path) / "ocr"
+        temp_dir.mkdir(exist_ok=True)
+        ocr_dir.mkdir(exist_ok=True)
+        try:
+            unmatched_files = []
+            # First pass: Try speech recognition matching
+            for mkv_file in mkv_files:
+                logger.info(f"Attempting speech recognition match for {mkv_file}")
+                # Extract audio and run speech recognition
+                process_speech_to_text(mkv_file, str(temp_dir))
+                match = matcher.identify_episode(mkv_file, temp_dir)
+                if match and match['confidence'] >= matcher.min_confidence:
+                    # Rename the file
+                    new_name = f"{matcher.series_name} - S{match['season']:02d}E{match['episode']:02d}.mkv"
+                    new_path = os.path.join(season_path, new_name)
+                    logger.info(f"Speech matched {os.path.basename(mkv_file)} to {new_name} "
+                              f"(confidence: {match['confidence']:.2f})")
+                    if not dry_run:
+                        os.rename(mkv_file, new_path)
+                else:
+                    logger.info(f"Speech recognition match failed for {mkv_file}, will try OCR")
+                    unmatched_files.append(mkv_file)
+            # Second pass: Try OCR for unmatched files
+            if unmatched_files:
+                logger.info(f"Attempting OCR matching for {len(unmatched_files)} unmatched files")
+                # Convert files to SRT using OCR
+                convert_mkv_to_srt(season_path, unmatched_files)
+                # Process OCR results
+                reference_text_dict = process_reference_srt_files(matcher.series_name)
+                srt_text_dict = process_srt_files(str(ocr_dir))
+                # Compare and rename
+                compare_and_rename_files(
+                    srt_text_dict,
+                    reference_text_dict,
+                    dry_run=dry_run,
+                    min_confidence=0.1  # Lower threshold for OCR
+                )
+            # Download subtitles if requested
+            if get_subs:
+                show_id = fetch_show_id(matcher.series_name)
+                if show_id:
+                    seasons = {int(os.path.basename(p).split()[-1]) for p in season_paths}
+                    get_subtitles(show_id, seasons=seasons)
+        finally:
+            # Cleanup
+            if not dry_run:
+                shutil.rmtree(temp_dir)
+                cleanup_ocr_files(show_dir)

mkv_episode_matcher-0.3.0/mkv_episode_matcher/speech_to_text.py ADDED Viewed

@@ -0,0 +1,90 @@
+# mkv_episode_matcher/speech_to_text.py
+import os
+import subprocess
+from pathlib import Path
+import whisper
+import torch
+from loguru import logger
+def process_speech_to_text(mkv_file, output_dir):
+    """
+    Convert MKV file to transcript using Whisper.
+    Args:
+        mkv_file (str): Path to MKV file
+        output_dir (str): Directory to save transcript files
+    """
+    # Extract audio if not already done
+    wav_file = extract_audio(mkv_file, output_dir)
+    if not wav_file:
+        return None
+    # Load model
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    if device == "cuda":
+        logger.info(f"CUDA is available. Using GPU: {torch.cuda.get_device_name(0)}")
+    else:
+        logger.info("CUDA not available. Using CPU.")
+    model = whisper.load_model("base", device=device)
+    # Generate transcript
+    segments_file = os.path.join(output_dir, f"{Path(mkv_file).stem}.segments.json")
+    if not os.path.exists(segments_file):
+        try:
+            result = model.transcribe(
+                wav_file,
+                task="transcribe",
+                language="en",
+            )
+            # Save segments
+            import json
+            with open(segments_file, 'w', encoding='utf-8') as f:
+                json.dump(result["segments"], f, indent=2)
+            logger.info(f"Transcript saved to {segments_file}")
+        except Exception as e:
+            logger.error(f"Error during transcription: {e}")
+            return None
+    else:
+        logger.info(f"Using existing transcript: {segments_file}")
+    return segments_file
+def extract_audio(mkv_file, output_dir):
+    """
+    Extract audio from MKV file using FFmpeg.
+    Args:
+        mkv_file (str): Path to MKV file
+        output_dir (str): Directory to save WAV file
+    Returns:
+        str: Path to extracted WAV file
+    """
+    wav_file = os.path.join(output_dir, f"{Path(mkv_file).stem}.wav")
+    if not os.path.exists(wav_file):
+        logger.info(f"Extracting audio from {mkv_file}")
+        try:
+            cmd = [
+                'ffmpeg',
+                '-i', mkv_file,
+                '-vn',  # Disable video
+                '-acodec', 'pcm_s16le',  # Convert to PCM format
+                '-ar', '16000',  # Set sample rate to 16kHz
+                '-ac', '1',  # Convert to mono
+                wav_file
+            ]
+            subprocess.run(cmd, check=True, capture_output=True)
+            logger.info(f"Audio extracted to {wav_file}")
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Error extracting audio: {e}")
+            return None
+    else:
+        logger.info(f"Audio file {wav_file} already exists, skipping extraction")
+    return wav_file

mkv-episode-matcher 0.2.0__tar.gz → 0.3.0__tar.gz

Potentially problematic release.

mkv-episode-matcher 0.2.0tar.gz → 0.3.0tar.gz