PyPI - pdf-file-renamer - Versions diffs - 0.6.2__tar.gz → 0.6.3__tar.gz - Mend

pdf-file-renamer 0.6.2tar.gz → 0.6.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

{pdf_file_renamer-0.6.2 → pdf_file_renamer-0.6.3}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,24 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.6.3] - 2025-10-14
+### Fixed
+- Fixed critical bug where pdf2doi extracted DOIs from citations instead of the paper's own DOI
+- Added DOI validation to verify metadata matches PDF content before accepting DOI
+- Prevents incorrect naming when papers don't have their own DOI but cite other papers
+### Added
+- DOI metadata validation against PDF first page content
+- Title similarity checking using SequenceMatcher
+- Configurable validation thresholds for DOI matching
+- Fallback to LLM-based naming when DOI validation fails
+### Changed
+- DOI extraction now validates that extracted metadata matches the actual PDF content
+- Improved accuracy by rejecting citation DOIs that don't match the paper's title
+- DOI validation checks title area (first ~300 characters) instead of full document
 ## [0.6.2] - 2025-10-14
 ### Added
@@ -125,6 +143,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Confidence scoring for suggestions
 - Support for custom output directories
+[0.6.3]: https://github.com/nostoslabs/pdf-renamer/compare/v0.6.2...v0.6.3
 [0.6.2]: https://github.com/nostoslabs/pdf-renamer/compare/v0.6.1...v0.6.2
 [0.6.1]: https://github.com/nostoslabs/pdf-renamer/compare/v0.6.0...v0.6.1
 [0.6.0]: https://github.com/nostoslabs/pdf-renamer/compare/v0.5.0...v0.6.0

{pdf_file_renamer-0.6.2 → pdf_file_renamer-0.6.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: pdf-file-renamer
-Version: 0.6.2
+Version: 0.6.3
 Summary: Intelligent PDF renaming using LLMs with DOI-based naming and interactive workflow
 Project-URL: Homepage, https://github.com/nostoslabs/pdf-renamer
 Project-URL: Repository, https://github.com/nostoslabs/pdf-renamer
@@ -285,7 +285,8 @@ The tool uses a multi-strategy approach to generate accurate filenames:
 1. **DOI Detection** (for academic papers)
    - Searches PDF for DOI identifiers using [pdf2doi](https://github.com/MicheleCotrufo/pdf2doi)
-   - If found, queries authoritative metadata (title, authors, year, journal)
+   - **Validates DOI metadata** against PDF content to prevent citation DOI mismatches
+   - If found and validated, queries authoritative metadata (title, authors, year, journal)
    - Generates filename with **very high confidence** from validated metadata
    - **Saves API costs** - no LLM call needed for papers with DOIs

{pdf_file_renamer-0.6.2 → pdf_file_renamer-0.6.3}/README.md RENAMED Viewed

@@ -237,7 +237,8 @@ The tool uses a multi-strategy approach to generate accurate filenames:
 1. **DOI Detection** (for academic papers)
    - Searches PDF for DOI identifiers using [pdf2doi](https://github.com/MicheleCotrufo/pdf2doi)
-   - If found, queries authoritative metadata (title, authors, year, journal)
+   - **Validates DOI metadata** against PDF content to prevent citation DOI mismatches
+   - If found and validated, queries authoritative metadata (title, authors, year, journal)
    - Generates filename with **very high confidence** from validated metadata
    - **Saves API costs** - no LLM call needed for papers with DOIs

{pdf_file_renamer-0.6.2 → pdf_file_renamer-0.6.3}/coverage.xml RENAMED Viewed

@@ -1,5 +1,5 @@
 <?xml version="1.0" ?>
-<coverage version="7.10.7" timestamp="1760490556851" lines-valid="728" lines-covered="142" line-rate="0.1951" branches-covered="0" branches-valid="0" branch-rate="0" complexity="0">
+<coverage version="7.10.7" timestamp="1760499723997" lines-valid="769" lines-covered="142" line-rate="0.1847" branches-covered="0" branches-valid="0" branch-rate="0" complexity="0">
 	<!-- Generated by coverage.py: https://coverage.readthedocs.io/en/7.10.7 -->
 	<!-- Based on https://raw.githubusercontent.com/cobertura/web/master/htdocs/xml/coverage-04.dtd -->
 	<sources>
@@ -329,44 +329,40 @@
 						<line number="4" hits="0"/>
 						<line number="5" hits="0"/>
 						<line number="6" hits="0"/>
-						<line number="8" hits="0"/>
-						<line number="10" hits="0"/>
+						<line number="7" hits="0"/>
+						<line number="9" hits="0"/>
 						<line number="11" hits="0"/>
-						<line number="14" hits="0"/>
-						<line number="17" hits="0"/>
-						<line number="20" hits="0"/>
-						<line number="22" hits="0"/>
-						<line number="32" hits="0"/>
-						<line number="34" hits="0"/>
-						<line number="35" hits="0"/>
-						<line number="38" hits="0"/>
-						<line number="39" hits="0"/>
-						<line number="42" hits="0"/>
+						<line number="12" hits="0"/>
+						<line number="15" hits="0"/>
+						<line number="18" hits="0"/>
+						<line number="27" hits="0"/>
+						<line number="28" hits="0"/>
+						<line number="29" hits="0"/>
+						<line number="31" hits="0"/>
+						<line number="41" hits="0"/>
 						<line number="43" hits="0"/>
 						<line number="44" hits="0"/>
-						<line number="46" hits="0"/>
 						<line number="47" hits="0"/>
 						<line number="48" hits="0"/>
 						<line number="51" hits="0"/>
-						<line number="54" hits="0"/>
+						<line number="52" hits="0"/>
+						<line number="53" hits="0"/>
+						<line number="55" hits="0"/>
 						<line number="56" hits="0"/>
 						<line number="57" hits="0"/>
-						<line number="58" hits="0"/>
-						<line number="59" hits="0"/>
-						<line number="62" hits="0"/>
+						<line number="60" hits="0"/>
+						<line number="63" hits="0"/>
 						<line number="65" hits="0"/>
 						<line number="66" hits="0"/>
 						<line number="67" hits="0"/>
 						<line number="68" hits="0"/>
-						<line number="69" hits="0"/>
-						<line number="70" hits="0"/>
 						<line number="71" hits="0"/>
-						<line number="72" hits="0"/>
-						<line number="73" hits="0"/>
 						<line number="74" hits="0"/>
 						<line number="75" hits="0"/>
 						<line number="76" hits="0"/>
 						<line number="77" hits="0"/>
+						<line number="78" hits="0"/>
+						<line number="79" hits="0"/>
 						<line number="80" hits="0"/>
 						<line number="81" hits="0"/>
 						<line number="82" hits="0"/>
@@ -375,28 +371,73 @@
 						<line number="85" hits="0"/>
 						<line number="86" hits="0"/>
 						<line number="89" hits="0"/>
+						<line number="90" hits="0"/>
+						<line number="91" hits="0"/>
 						<line number="92" hits="0"/>
+						<line number="93" hits="0"/>
 						<line number="94" hits="0"/>
-						<line number="104" hits="0"/>
-						<line number="106" hits="0"/>
-						<line number="108" hits="0"/>
+						<line number="95" hits="0"/>
+						<line number="98" hits="0"/>
+						<line number="101" hits="0"/>
+						<line number="103" hits="0"/>
+						<line number="114" hits="0"/>
+						<line number="116" hits="0"/>
+						<line number="117" hits="0"/>
 						<line number="119" hits="0"/>
-						<line number="120" hits="0"/>
+						<line number="121" hits="0"/>
 						<line number="123" hits="0"/>
-						<line number="124" hits="0"/>
-						<line number="126" hits="0"/>
+						<line number="125" hits="0"/>
 						<line number="127" hits="0"/>
-						<line number="129" hits="0"/>
-						<line number="131" hits="0"/>
-						<line number="141" hits="0"/>
+						<line number="138" hits="0"/>
+						<line number="139" hits="0"/>
 						<line number="142" hits="0"/>
+						<line number="143" hits="0"/>
 						<line number="145" hits="0"/>
 						<line number="146" hits="0"/>
 						<line number="148" hits="0"/>
-						<line number="149" hits="0"/>
-						<line number="151" hits="0"/>
-						<line number="154" hits="0"/>
+						<line number="150" hits="0"/>
 						<line number="160" hits="0"/>
+						<line number="161" hits="0"/>
+						<line number="164" hits="0"/>
+						<line number="165" hits="0"/>
+						<line number="167" hits="0"/>
+						<line number="168" hits="0"/>
+						<line number="170" hits="0"/>
+						<line number="173" hits="0"/>
+						<line number="179" hits="0"/>
+						<line number="181" hits="0"/>
+						<line number="191" hits="0"/>
+						<line number="192" hits="0"/>
+						<line number="194" hits="0"/>
+						<line number="196" hits="0"/>
+						<line number="197" hits="0"/>
+						<line number="198" hits="0"/>
+						<line number="199" hits="0"/>
+						<line number="200" hits="0"/>
+						<line number="202" hits="0"/>
+						<line number="203" hits="0"/>
+						<line number="204" hits="0"/>
+						<line number="206" hits="0"/>
+						<line number="220" hits="0"/>
+						<line number="222" hits="0"/>
+						<line number="225" hits="0"/>
+						<line number="226" hits="0"/>
+						<line number="229" hits="0"/>
+						<line number="230" hits="0"/>
+						<line number="234" hits="0"/>
+						<line number="235" hits="0"/>
+						<line number="237" hits="0"/>
+						<line number="238" hits="0"/>
+						<line number="242" hits="0"/>
+						<line number="243" hits="0"/>
+						<line number="244" hits="0"/>
+						<line number="247" hits="0"/>
+						<line number="248" hits="0"/>
+						<line number="250" hits="0"/>
+						<line number="252" hits="0"/>
+						<line number="263" hits="0"/>
+						<line number="293" hits="0"/>
+						<line number="296" hits="0"/>
 					</lines>
 				</class>
 			</classes>

{pdf_file_renamer-0.6.2 → pdf_file_renamer-0.6.3}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "pdf-file-renamer"
-version = "0.6.2"
+version = "0.6.3"
 description = "Intelligent PDF renaming using LLMs with DOI-based naming and interactive workflow"
 readme = "README.md"
 requires-python = ">=3.11"

{pdf_file_renamer-0.6.2 → pdf_file_renamer-0.6.3}/src/pdf_file_renamer/infrastructure/doi/pdf2doi_extractor.py RENAMED Viewed

@@ -3,6 +3,7 @@
 import asyncio
 import contextlib
 import re
+from difflib import SequenceMatcher
 from pathlib import Path
 import pdf2doi
@@ -14,10 +15,18 @@ from pdf_file_renamer.domain.ports import DOIExtractor
 class PDF2DOIExtractor(DOIExtractor):
     """Extract DOI from PDF files using pdf2doi library."""
-    def __init__(self) -> None:
-        """Initialize the PDF2DOI extractor."""
+    def __init__(self, validate_match: bool = True, similarity_threshold: float = 0.3) -> None:
+        """
+        Initialize the PDF2DOI extractor.
+        Args:
+            validate_match: Whether to validate that DOI metadata matches PDF content
+            similarity_threshold: Minimum similarity score (0-1) for title validation
+        """
         # Suppress pdf2doi verbose output
         pdf2doi.config.set("verbose", False)
+        self.validate_match = validate_match
+        self.similarity_threshold = similarity_threshold
     async def extract_doi(self, pdf_path: Path) -> DOIMetadata | None:
         """
@@ -91,7 +100,7 @@ class PDF2DOIExtractor(DOIExtractor):
             # Extract publisher
             publisher = metadata.get("publisher")
-            return DOIMetadata(
+            doi_metadata = DOIMetadata(
                 doi=identifier,
                 title=title,
                 authors=authors,
@@ -101,6 +110,16 @@ class PDF2DOIExtractor(DOIExtractor):
                 raw_bibtex=validation_info if validation_info else None,
             )
+            # Validate that the DOI metadata matches the PDF content
+            if self.validate_match:
+                # Extract first page text from PDF to check for title match
+                pdf_text = await self._extract_pdf_first_page(pdf_path)
+                if not self._validate_doi_matches_pdf(doi_metadata, pdf_text):
+                    # DOI doesn't match - likely a citation DOI, not the paper's DOI
+                    return None
+            return doi_metadata
         except Exception:
             # Silently fail - DOI extraction is opportunistic
             return None
@@ -158,3 +177,120 @@ class PDF2DOIExtractor(DOIExtractor):
         ]
         return authors if authors else None
+    async def _extract_pdf_first_page(self, pdf_path: Path) -> str:
+        """
+        Extract text from the first page of a PDF.
+        Args:
+            pdf_path: Path to PDF file
+        Returns:
+            Text from first page (empty string if extraction fails)
+        """
+        try:
+            import fitz  # PyMuPDF
+            loop = asyncio.get_event_loop()
+            def extract() -> str:
+                with fitz.open(pdf_path) as doc:
+                    if len(doc) > 0:
+                        return doc[0].get_text()
+                return ""
+            return await loop.run_in_executor(None, extract)
+        except Exception:
+            return ""
+    def _validate_doi_matches_pdf(self, doi_metadata: DOIMetadata, pdf_text: str) -> bool:
+        """
+        Validate that DOI metadata matches the PDF content.
+        This checks if the title from the DOI metadata appears in the PDF text
+        (particularly the first page, where the title should be).
+        Args:
+            doi_metadata: DOI metadata to validate
+            pdf_text: Text from PDF first page (not full document!)
+        Returns:
+            True if metadata appears to match PDF, False otherwise
+        """
+        if not doi_metadata.title or not pdf_text:
+            # If we can't validate, assume it's valid (fail open)
+            return True
+        # Normalize text for comparison
+        pdf_text_lower = pdf_text.lower()
+        title_lower = doi_metadata.title.lower()
+        # Check if the full title appears in the PDF text
+        if title_lower in pdf_text_lower:
+            return True
+        # Check similarity using SequenceMatcher on first ~300 chars (title area)
+        # Most paper titles appear in the first few hundred characters
+        title_area = pdf_text_lower[:300]
+        similarity = SequenceMatcher(None, title_lower, title_area).ratio()
+        if similarity >= self.similarity_threshold:
+            return True
+        # Check if significant words from title appear in the title area ONLY
+        # This prevents matching citation DOIs from the references section
+        title_words = self._extract_significant_words(title_lower)
+        if not title_words:
+            return True  # Can't validate, fail open
+        # Require at least 70% of significant words to appear in the title area
+        matches = sum(1 for word in title_words if word in title_area)
+        match_ratio = matches / len(title_words)
+        return match_ratio >= 0.7
+    def _extract_significant_words(self, text: str) -> list[str]:
+        """
+        Extract significant words from text (removing common words).
+        Args:
+            text: Input text
+        Returns:
+            List of significant words
+        """
+        # Common words to skip
+        stop_words = {
+            "a",
+            "an",
+            "the",
+            "and",
+            "or",
+            "but",
+            "in",
+            "on",
+            "at",
+            "to",
+            "for",
+            "of",
+            "with",
+            "by",
+            "from",
+            "as",
+            "is",
+            "was",
+            "are",
+            "were",
+            "been",
+            "be",
+            "this",
+            "that",
+            "these",
+            "those",
+        }
+        # Extract words (alphanumeric only)
+        words = re.findall(r"\b\w+\b", text.lower())
+        # Filter stop words and short words
+        return [w for w in words if w not in stop_words and len(w) > 3]