PyPI - pdf-file-renamer - Versions diffs - 0.6.1__tar.gz → 0.6.3__tar.gz - Mend

pdf-file-renamer 0.6.1tar.gz → 0.6.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (48) hide show

pdf_file_renamer-0.6.3/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,156 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.6.3] - 2025-10-14
+### Fixed
+- Fixed critical bug where pdf2doi extracted DOIs from citations instead of the paper's own DOI
+- Added DOI validation to verify metadata matches PDF content before accepting DOI
+- Prevents incorrect naming when papers don't have their own DOI but cite other papers
+### Added
+- DOI metadata validation against PDF first page content
+- Title similarity checking using SequenceMatcher
+- Configurable validation thresholds for DOI matching
+- Fallback to LLM-based naming when DOI validation fails
+### Changed
+- DOI extraction now validates that extracted metadata matches the actual PDF content
+- Improved accuracy by rejecting citation DOIs that don't match the paper's title
+- DOI validation checks title area (first ~300 characters) instead of full document
+## [0.6.2] - 2025-10-14
+### Added
+- Demo GIF showing pdf-renamer in action with live TUI
+- VHS recording infrastructure (demo.tape)
+- Automated demo creation scripts (create_demo_gif.py, record_demo.sh)
+- Comprehensive PyPI metadata and classifiers
+- Table of contents in README for better navigation
+- Architecture, Development, and Contributing sections in README
+- Project URLs for homepage, repository, issues, and changelog
+### Changed
+- Enhanced README with animated demo, better badges, and emoji icons
+- Improved PyPI discoverability with keywords and proper categorization
+- Updated description to highlight DOI-first approach and interactive mode
+## [0.6.1] - 2025-10-14
+### Fixed
+- Fixed DOI extractor incorrectly expecting list instead of dict from pdf2doi
+- Fixed JSON parsing for CrossRef API metadata instead of incorrect bibtex parsing
+- Fixed confidence enum handling causing AttributeError in workflow and formatters
+- Fixed linting errors (SIM105 - use contextlib.suppress)
+- Fixed mypy type checking errors in author extraction
+- Fixed code formatting issues
+### Changed
+- Improved DOI metadata extraction from CrossRef JSON structure
+- Enhanced type safety with explicit type annotations
+## [0.6.0] - 2025-10-14
+### Added
+- DOI-based naming feature using pdf2doi library
+- Automatic DOI extraction from academic papers
+- CrossRef API integration for rich metadata (title, authors, year, journal, publisher)
+- Hybrid naming strategy: DOI-first with LLM fallback
+- DOI metadata display in interactive prompts
+- Enhanced status display showing "DOI found" during processing
+### Changed
+- Improved filename generation with VERY_HIGH confidence for DOI-based names
+- Updated workflow to prioritize DOI extraction before LLM analysis
+- Enhanced reasoning messages to indicate DOI-based naming
+## [0.5.0] - 2025-10-12
+### Changed
+- Reorganized project to src layout structure (src/pdf_file_renamer)
+- Improved package organization following Python best practices
+- Updated all imports and references to new structure
+## [0.4.2] - 2025-10-12
+### Changed
+- Renamed package from `pdf_renamer` to `pdf-file-renamer` for PyPI
+- Updated package name across all configurations
+- Improved PyPI package metadata
+## [0.4.1] - 2025-10-12
+### Added
+- Initial PyPI publishing workflow
+- Automated releases via GitHub Actions
+## [0.4.0] - 2025-10-12
+### Added
+- Complete refactoring to Clean Architecture
+- Comprehensive unit tests with pytest
+- Type checking with mypy (strict mode)
+- Code quality with ruff linting and formatting
+- GitHub Actions CI/CD pipeline
+- Code coverage reporting with pytest-cov
+- Domain-driven design with clear separation of concerns
+- Port and adapter pattern for external dependencies
+### Changed
+- Reorganized codebase into domain, application, infrastructure, and presentation layers
+- Improved testability and maintainability
+- Enhanced documentation with architecture notes
+## [0.3.0] - 2025-10-11
+### Added
+- Enhanced interactive mode with retry, edit, and skip options
+- Multi-pass analysis for better accuracy
+- Focused metadata extraction for improved LLM context
+- Better error handling and recovery
+### Changed
+- Improved LLM prompting strategy
+- Enhanced user experience with clearer prompts
+- Better handling of edge cases
+## [0.2.0] - 2025-10-10
+### Added
+- Interactive mode for rename confirmation
+- Rich terminal UI with tables and colored output
+- Batch processing with progress tracking
+- Live status updates during processing
+### Changed
+- Simplified CLI by removing subcommand requirement
+- Improved PDF processing pipeline
+- Enhanced error messages
+## [0.1.0] - 2025-10-09
+### Added
+- Initial release
+- PDF text extraction using PyMuPDF
+- LLM-based filename generation (OpenAI and Ollama support)
+- Dry-run mode for safe testing
+- Basic CLI interface
+- Configuration via environment variables
+- Confidence scoring for suggestions
+- Support for custom output directories
+[0.6.3]: https://github.com/nostoslabs/pdf-renamer/compare/v0.6.2...v0.6.3
+[0.6.2]: https://github.com/nostoslabs/pdf-renamer/compare/v0.6.1...v0.6.2
+[0.6.1]: https://github.com/nostoslabs/pdf-renamer/compare/v0.6.0...v0.6.1
+[0.6.0]: https://github.com/nostoslabs/pdf-renamer/compare/v0.5.0...v0.6.0
+[0.5.0]: https://github.com/nostoslabs/pdf-renamer/compare/v0.4.2...v0.5.0
+[0.4.2]: https://github.com/nostoslabs/pdf-renamer/compare/v0.4.1...v0.4.2
+[0.4.1]: https://github.com/nostoslabs/pdf-renamer/compare/v0.4.0...v0.4.1
+[0.4.0]: https://github.com/nostoslabs/pdf-renamer/compare/v0.3.0...v0.4.0
+[0.3.0]: https://github.com/nostoslabs/pdf-renamer/compare/v0.2.0...v0.3.0
+[0.2.0]: https://github.com/nostoslabs/pdf-renamer/compare/v0.1.0...v0.2.0
+[0.1.0]: https://github.com/nostoslabs/pdf-renamer/releases/tag/v0.1.0

pdf_file_renamer-0.6.3/PKG-INFO ADDED Viewed

@@ -0,0 +1,444 @@
+Metadata-Version: 2.4
+Name: pdf-file-renamer
+Version: 0.6.3
+Summary: Intelligent PDF renaming using LLMs with DOI-based naming and interactive workflow
+Project-URL: Homepage, https://github.com/nostoslabs/pdf-renamer
+Project-URL: Repository, https://github.com/nostoslabs/pdf-renamer
+Project-URL: Issues, https://github.com/nostoslabs/pdf-renamer/issues
+Project-URL: Changelog, https://github.com/nostoslabs/pdf-renamer/blob/main/CHANGELOG.md
+Author-email: Nostos Labs <info@nostoslabs.com>
+License: MIT
+License-File: LICENSE
+Keywords: academic-papers,ai,automation,document-management,doi,file-organization,llm,pdf,rename
+Classifier: Development Status :: 4 - Beta
+Classifier: Environment :: Console
+Classifier: Intended Audience :: Education
+Classifier: Intended Audience :: End Users/Desktop
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Office/Business :: Office Suites
+Classifier: Topic :: Scientific/Engineering
+Classifier: Topic :: Text Processing :: General
+Classifier: Topic :: Utilities
+Classifier: Typing :: Typed
+Requires-Python: >=3.11
+Requires-Dist: docling-core>=2.0.0
+Requires-Dist: docling-parse>=2.0.0
+Requires-Dist: pdf2doi>=1.7
+Requires-Dist: pydantic-ai>=1.0.17
+Requires-Dist: pydantic-settings>=2.7.1
+Requires-Dist: pydantic>=2.10.6
+Requires-Dist: pymupdf>=1.26.5
+Requires-Dist: python-dotenv>=1.1.1
+Requires-Dist: rich>=14.2.0
+Requires-Dist: tenacity>=9.0.0
+Requires-Dist: typer>=0.19.2
+Provides-Extra: dev
+Requires-Dist: mypy>=1.14.1; extra == 'dev'
+Requires-Dist: pytest-asyncio>=0.25.2; extra == 'dev'
+Requires-Dist: pytest-cov>=6.0.0; extra == 'dev'
+Requires-Dist: pytest-mock>=3.14.0; extra == 'dev'
+Requires-Dist: pytest>=8.3.4; extra == 'dev'
+Requires-Dist: ruff>=0.9.1; extra == 'dev'
+Description-Content-Type: text/markdown
+# PDF Renamer
+[![PyPI version](https://img.shields.io/pypi/v/pdf-file-renamer.svg)](https://pypi.org/project/pdf-file-renamer/)
+[![PyPI downloads](https://img.shields.io/pypi/dm/pdf-file-renamer.svg)](https://pypi.org/project/pdf-file-renamer/)
+[![Python](https://img.shields.io/pypi/pyversions/pdf-file-renamer.svg)](https://pypi.org/project/pdf-file-renamer/)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![CI](https://github.com/nostoslabs/pdf-renamer/workflows/CI/badge.svg)](https://github.com/nostoslabs/pdf-renamer/actions)
+[![Code style: ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
+[![Type checked: mypy](https://img.shields.io/badge/type%20checked-mypy-blue.svg)](http://mypy-lang.org/)
+**Intelligent PDF file renaming using LLMs and DOI metadata.** Automatically generate clean, descriptive filenames for your PDF library.
+> 🚀 Works with **OpenAI**, **Ollama**, **LM Studio**, and any OpenAI-compatible API
+> 📚 **DOI-first** approach for academic papers - no API costs!
+> 🎯 **Interactive mode** with retry, edit, and skip options
+## Table of Contents
+- [Quick Example](#quick-example)
+- [Features](#features)
+- [Installation](#installation)
+- [Configuration](#configuration)
+- [Usage](#usage)
+- [Interactive Mode](#interactive-mode)
+- [How It Works](#how-it-works)
+- [Cost Considerations](#cost-considerations)
+- [Architecture](#architecture)
+- [Development](#development)
+- [Contributing](#contributing)
+- [License](#license)
+## Quick Example
+![Demo](demo.gif)
+Transform messy filenames into clean, organized ones:
+```
+Before:                               After:
+📄 paper_final_v3.pdf          →     Leroux-Analog-In-memory-Computing-2025.pdf
+📄 download (2).pdf            →     Ruiz-Why-Don-Trace-Requirements-2023.pdf
+📄 document.pdf                →     Raspail-Camp_of_the_Saints.pdf
+```
+**Live Progress Display:**
+```
+Processing 3 PDFs with max 3 concurrent API calls and 10 concurrent extractions
+╭─────────────────────────── 📊 Progress ───────────────────────────╮
+│ Total: 3 | Pending: 0 | Extracting: 0 | Analyzing: 0 | Complete: 3 │
+╰───────────────────────────────────────────────────────────────────╯
+╭───────────────────────────────────────────────────────────────────╮
+│ [██████████████████████████████████████████████] 100.0%           │
+╰───────────────────────────────────────────────────────────────────╯
+                          Processing Status
+┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
+┃ File               ┃ Stage ┃ Status   ┃ Details             ┃
+┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
+│ paper_final_v3.pdf │   ✓   │ Complete │ very_high           │
+│ download (2).pdf   │   ✓   │ Complete │ very_high (DOI)     │
+│ document.pdf       │   ✓   │ Complete │ high                │
+└────────────────────┴───────┴──────────┴─────────────────────┘
+```
+## Features
+- **🎓 DOI-based naming** - Automatically extracts DOI and fetches authoritative metadata for academic papers
+- **🧠 Advanced PDF parsing** using docling-parse for better structure-aware extraction
+- **👁️ OCR fallback** for scanned PDFs with low text content
+- **🎯 Smart LLM prompting** with multi-pass analysis for improved accuracy
+- **⚡ Hybrid approach** - Uses DOI metadata when available, falls back to LLM analysis for other documents
+- **📝 Standardized format** - Generates filenames like `Author-Topic-Year.pdf`
+- **🔍 Dry-run mode** to preview changes before applying
+- **💬 Enhanced interactive mode** with options to accept, manually edit, retry, or skip each file
+- **📊 Live progress display** with concurrent processing for speed
+- **⚙️ Configurable concurrency** limits for API calls and PDF extraction
+- **📦 Batch processing** of multiple PDFs with optional output directory
+## Installation
+### Quick Start (No Installation Required)
+```bash
+# Run directly with uvx
+uvx pdf-renamer --dry-run /path/to/pdfs
+```
+### Install from PyPI
+```bash
+# Using pip
+pip install pdf-file-renamer
+# Using uv
+uv pip install pdf-file-renamer
+```
+### Install from Source
+```bash
+# Clone and install
+git clone https://github.com/nostoslabs/pdf-renamer.git
+cd pdf-renamer
+uv sync
+```
+## Configuration
+Configure your LLM provider:
+**Option A: OpenAI (Cloud)**
+```bash
+cp .env.example .env
+# Edit .env and add your OPENAI_API_KEY
+```
+**Option B: Ollama or other local models**
+```bash
+# No API key needed for local models
+# Either set LLM_BASE_URL in .env or use --url flag
+echo "LLM_BASE_URL=http://patmos:11434/v1" > .env
+```
+## Usage
+### Quick Start
+```bash
+# Preview renames (dry-run mode)
+pdf-renamer --dry-run /path/to/pdf/directory
+# Actually rename files
+pdf-renamer --no-dry-run /path/to/pdf/directory
+# Interactive mode - review each file
+pdf-renamer --interactive --no-dry-run /path/to/pdf/directory
+```
+### Using uvx (No Installation)
+```bash
+# Run directly without installing
+uvx pdf-renamer --dry-run /path/to/pdfs
+# Run from GitHub
+uvx https://github.com/nostoslabs/pdf-renamer --dry-run /path/to/pdfs
+```
+### Options
+- `--dry-run/--no-dry-run`: Show suggestions without renaming (default: True)
+- `--interactive, -i`: Interactive mode with rich options:
+  - **Accept** - Use the suggested filename
+  - **Edit** - Manually modify the filename
+  - **Retry** - Ask the LLM to generate a new suggestion
+  - **Skip** - Skip this file and move to the next
+- `--model`: Model to use (default: llama3.2, works with any OpenAI-compatible API)
+- `--url`: Custom base URL for OpenAI-compatible APIs (default: http://localhost:11434/v1)
+- `--pattern`: Glob pattern for files (default: *.pdf)
+- `--output-dir, -o`: Move renamed files to a different directory
+- `--max-concurrent-api`: Maximum concurrent API calls (default: 3)
+- `--max-concurrent-pdf`: Maximum concurrent PDF extractions (default: 10)
+### Examples
+**Using OpenAI:**
+```bash
+# Preview all PDFs in current directory
+uvx pdf-renamer --dry-run .
+# Rename PDFs in specific directory
+uvx pdf-renamer --no-dry-run ~/Documents/Papers
+# Use a different OpenAI model
+uvx pdf-renamer --model gpt-4o --dry-run .
+```
+**Using Ollama (or other local models):**
+```bash
+# Using Ollama on patmos server with gemma model
+uvx pdf-renamer --url http://patmos:11434/v1 --model gemma3:latest --dry-run .
+# Using local Ollama with qwen model
+uvx pdf-renamer --url http://localhost:11434/v1 --model qwen2.5 --dry-run .
+# Set URL in environment and just use model flag
+export LLM_BASE_URL=http://patmos:11434/v1
+uvx pdf-renamer --model gemma3:latest --dry-run .
+```
+**Other examples:**
+```bash
+# Process only specific files
+uvx pdf-renamer --pattern "*2020*.pdf" --dry-run .
+# Interactive mode with local model
+uvx pdf-renamer --url http://patmos:11434/v1 --model gemma3:latest --interactive --no-dry-run .
+# Run directly from GitHub
+uvx https://github.com/nostoslabs/pdf-renamer --no-dry-run ~/Documents/Papers
+```
+## Interactive Mode
+When using `--interactive` mode, you'll be presented with each file one at a time with detailed options:
+```
+================================================================================
+Original: 2024-research-paper.pdf
+Suggested: Smith-Machine-Learning-Applications-2024.pdf
+Confidence: high
+Reasoning: Clear author and topic identified from abstract
+================================================================================
+Options:
+  y / yes / Enter - Accept suggested name
+  e / edit - Manually edit the filename
+  r / retry - Ask LLM to generate a new suggestion
+  n / no / skip - Skip this file
+What would you like to do? [y]:
+```
+This mode is perfect for:
+- **Reviewing suggestions** before applying them
+- **Fine-tuning filenames** that are close but not quite right
+- **Retrying** when the LLM suggestion isn't good enough
+- **Building confidence** in the tool before batch processing
+You can use interactive mode with `--dry-run` to preview without actually renaming files, or with `--no-dry-run` to apply changes immediately after confirmation.
+## How It Works
+### Intelligent Hybrid Approach
+The tool uses a multi-strategy approach to generate accurate filenames:
+1. **DOI Detection** (for academic papers)
+   - Searches PDF for DOI identifiers using [pdf2doi](https://github.com/MicheleCotrufo/pdf2doi)
+   - **Validates DOI metadata** against PDF content to prevent citation DOI mismatches
+   - If found and validated, queries authoritative metadata (title, authors, year, journal)
+   - Generates filename with **very high confidence** from validated metadata
+   - **Saves API costs** - no LLM call needed for papers with DOIs
+2. **LLM Analysis** (fallback for non-academic PDFs)
+   - **Extract**: Uses docling-parse to read first 5 pages with structure-aware parsing, falls back to PyMuPDF if needed
+   - **OCR**: Automatically applies OCR for scanned PDFs with minimal text
+   - **Metadata Enhancement**: Extracts focused hints (years, emails, author sections) to supplement unreliable PDF metadata
+   - **Analyze**: Sends full content excerpt to LLM with enhanced metadata and detailed extraction instructions
+   - **Multi-pass Review**: Low-confidence results trigger a second analysis pass with focused prompts
+   - **Suggest**: LLM returns filename in `Author-Topic-Year` format with confidence level and reasoning
+3. **Interactive Review** (optional): User can accept, edit, retry, or skip each suggestion
+4. **Rename**: Applies suggestions (if not in dry-run mode)
+### Benefits of DOI Integration
+- **Accuracy**: DOI metadata is canonical and verified
+- **Speed**: Instant lookup vs. LLM processing time
+- **Cost**: Free DOI lookups save on API costs for academic papers
+- **Reliability**: Works even when PDF text extraction is poor
+## Cost Considerations
+**DOI-based Naming (Academic Papers):**
+- **Completely free** - No API costs
+- **No LLM needed** - Direct metadata lookup
+- Works for most academic papers with embedded DOIs
+**OpenAI (Fallback):**
+- Uses `gpt-4o-mini` by default (very cost-effective)
+- Only called when DOI not found
+- Processes first ~4500 characters per PDF
+- Typical cost: ~$0.001-0.003 per PDF
+**Ollama/Local Models:**
+- Completely free (runs on your hardware)
+- Works with any Ollama model (llama3, qwen2.5, mistral, etc.)
+- Also compatible with LM Studio, vLLM, and other OpenAI-compatible endpoints
+## Filename Format
+The tool generates filenames in this format:
+- `Smith-Kalman-Filtering-Applications-2020.pdf`
+- `Adamy-Electronic-Warfare-Modeling-Techniques.pdf`
+- `Blair-Monopulse-Processing-Unresolved-Targets.pdf`
+Guidelines:
+- First author's last name
+- 3-6 word topic description (prioritizes clarity over brevity)
+- Year (if identifiable)
+- Hyphens between words
+- Target ~80 characters (can be longer if needed for clarity)
+## Architecture
+This project follows **Clean Architecture** principles with clear separation of concerns:
+```
+src/pdf_file_renamer/
+├── domain/          # Core business logic (models, ports)
+├── application/     # Use cases and workflows
+├── infrastructure/  # External integrations (PDF, LLM, DOI)
+└── presentation/    # CLI and UI components
+```
+**Key Design Patterns:**
+- **Ports and Adapters** - Clean interfaces for external dependencies
+- **Dependency Injection** - Flexible component composition
+- **Single Responsibility** - Each module has one clear purpose
+- **Type Safety** - Full mypy strict mode compliance
+See [REFACTORING_SUMMARY.md](REFACTORING_SUMMARY.md) for detailed architecture notes.
+## Development
+### Setup
+```bash
+# Clone repository
+git clone https://github.com/nostoslabs/pdf-renamer.git
+cd pdf-renamer
+# Install dependencies with uv
+uv sync
+# Run tests
+uv run pytest
+# Run linting
+uv run ruff check src/ tests/
+# Run type checking
+uv run mypy src/
+```
+### Code Quality
+- **Tests**: pytest with async support and coverage reporting
+- **Linting**: ruff for fast, comprehensive linting
+- **Formatting**: ruff format for consistent code style
+- **Type Checking**: mypy in strict mode
+- **CI/CD**: GitHub Actions for automated testing and releases
+### Running Locally
+```bash
+# Run with local changes
+uv run pdf-file-renamer --dry-run /path/to/pdfs
+# Run specific module
+uv run python -m pdf_file_renamer.main --help
+```
+## Contributing
+Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
+### Development Workflow
+1. Fork the repository
+2. Create a feature branch (`git checkout -b feature/amazing-feature`)
+3. Make your changes
+4. Run tests and linting (`uv run pytest && uv run ruff check src/`)
+5. Commit your changes (`git commit -m 'Add amazing feature'`)
+6. Push to the branch (`git push origin feature/amazing-feature`)
+7. Open a Pull Request
+### Code Style
+- Follow PEP 8 (enforced by ruff)
+- Use type hints for all functions
+- Write tests for new features
+- Update documentation as needed
+## License
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+## Acknowledgments
+- [pdf2doi](https://github.com/MicheleCotrufo/pdf2doi) for DOI extraction
+- [pydantic-ai](https://ai.pydantic.dev/) for LLM integration
+- [docling-parse](https://github.com/DS4SD/docling-parse) for advanced PDF parsing
+- [PyMuPDF](https://pymupdf.readthedocs.io/) for PDF text extraction
+- [rich](https://rich.readthedocs.io/) for beautiful terminal UI
+## Support
+- **Issues**: [GitHub Issues](https://github.com/nostoslabs/pdf-renamer/issues)
+- **Discussions**: [GitHub Discussions](https://github.com/nostoslabs/pdf-renamer/discussions)
+- **Changelog**: [CHANGELOG.md](CHANGELOG.md)
+---
+**Made with ❤️ by [Nostos Labs](https://github.com/nostoslabs)**

pdf-file-renamer 0.6.1__tar.gz → 0.6.3__tar.gz

pdf-file-renamer 0.6.1tar.gz → 0.6.3tar.gz