pdf-file-renamer 0.5.0__tar.gz → 0.6.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/PKG-INFO +38 -10
  2. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/README.md +36 -9
  3. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/coverage.xml +273 -118
  4. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/pyproject.toml +3 -1
  5. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/__init__.py +1 -1
  6. pdf_file_renamer-0.6.1/src/pdf_file_renamer/application/filename_service.py +172 -0
  7. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/application/pdf_rename_workflow.py +35 -4
  8. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/domain/models.py +29 -0
  9. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/domain/ports.py +18 -1
  10. pdf_file_renamer-0.6.1/src/pdf_file_renamer/infrastructure/doi/__init__.py +5 -0
  11. pdf_file_renamer-0.6.1/src/pdf_file_renamer/infrastructure/doi/pdf2doi_extractor.py +163 -0
  12. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/presentation/cli.py +5 -0
  13. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/presentation/formatters.py +15 -3
  14. pdf_file_renamer-0.5.0/src/pdf_file_renamer/application/filename_service.py +0 -70
  15. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/.env.example +0 -0
  16. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/.github/workflows/ci.yml +0 -0
  17. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/.github/workflows/release.yml +0 -0
  18. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/.gitignore +0 -0
  19. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/.python-version +0 -0
  20. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/LICENSE +0 -0
  21. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/REFACTORING_SUMMARY.md +0 -0
  22. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/application/__init__.py +0 -0
  23. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/application/rename_service.py +0 -0
  24. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/domain/__init__.py +0 -0
  25. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/infrastructure/__init__.py +0 -0
  26. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/infrastructure/config.py +0 -0
  27. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/infrastructure/llm/__init__.py +0 -0
  28. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/infrastructure/llm/pydantic_ai_provider.py +0 -0
  29. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/infrastructure/pdf/__init__.py +0 -0
  30. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/infrastructure/pdf/composite.py +0 -0
  31. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/infrastructure/pdf/docling_extractor.py +0 -0
  32. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/infrastructure/pdf/pymupdf_extractor.py +0 -0
  33. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/main.py +0 -0
  34. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/presentation/__init__.py +0 -0
  35. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/tests/__init__.py +0 -0
  36. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/tests/data/2025-dennis-managing-complexity.pdf +0 -0
  37. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/tests/data/Camp_of_the_Saints.pdf +0 -0
  38. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/tests/data/s43588-025-00854-1.pdf +0 -0
  39. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/tests/test_domain_models.py +0 -0
  40. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/tests/test_filename_service.py +0 -0
  41. {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/tests/test_rename_service.py +0 -0
@@ -1,11 +1,12 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: pdf-file-renamer
3
- Version: 0.5.0
3
+ Version: 0.6.1
4
4
  Summary: Intelligent PDF renaming using LLMs
5
5
  License-File: LICENSE
6
6
  Requires-Python: >=3.11
7
7
  Requires-Dist: docling-core>=2.0.0
8
8
  Requires-Dist: docling-parse>=2.0.0
9
+ Requires-Dist: pdf2doi>=1.7
9
10
  Requires-Dist: pydantic-ai>=1.0.17
10
11
  Requires-Dist: pydantic-settings>=2.7.1
11
12
  Requires-Dist: pydantic>=2.10.6
@@ -43,9 +44,11 @@ Intelligent PDF file renaming using LLMs. This tool analyzes PDF content and met
43
44
 
44
45
  ## Features
45
46
 
47
+ - **DOI-based naming** - Automatically extracts DOI and fetches authoritative metadata for academic papers
46
48
  - **Advanced PDF parsing** using docling-parse for better structure-aware extraction
47
49
  - **OCR fallback** for scanned PDFs with low text content
48
50
  - **Smart LLM prompting** with multi-pass analysis for improved accuracy
51
+ - **Hybrid approach** - Uses DOI metadata when available, falls back to LLM analysis for other documents
49
52
  - Suggests filenames in format: `Author-Topic-Year.pdf`
50
53
  - Dry-run mode to preview changes before applying
51
54
  - **Enhanced interactive mode** with options to accept, manually edit, retry, or skip each file
@@ -208,19 +211,44 @@ You can use interactive mode with `--dry-run` to preview without actually renami
208
211
 
209
212
  ## How It Works
210
213
 
211
- 1. **Extract**: Uses docling-parse to read first 5 pages with structure-aware parsing, falls back to PyMuPDF if needed
212
- 2. **OCR**: Automatically applies OCR for scanned PDFs with minimal text
213
- 3. **Metadata Enhancement**: Extracts focused hints (years, emails, author sections) to supplement unreliable PDF metadata
214
- 4. **Analyze**: Sends full content excerpt to LLM with enhanced metadata and detailed extraction instructions
215
- 5. **Multi-pass Review**: Low-confidence results trigger a second analysis pass with focused prompts
216
- 6. **Suggest**: LLM returns filename in `Author-Topic-Year` format with confidence level and reasoning
217
- 7. **Interactive Review** (optional): User can accept, edit, retry, or skip each suggestion
218
- 8. **Rename**: Applies suggestions (if not in dry-run mode)
214
+ ### Intelligent Hybrid Approach
215
+
216
+ The tool uses a multi-strategy approach to generate accurate filenames:
217
+
218
+ 1. **DOI Detection** (for academic papers)
219
+ - Searches PDF for DOI identifiers using [pdf2doi](https://github.com/MicheleCotrufo/pdf2doi)
220
+ - If found, queries authoritative metadata (title, authors, year, journal)
221
+ - Generates filename with **very high confidence** from validated metadata
222
+ - **Saves API costs** - no LLM call needed for papers with DOIs
223
+
224
+ 2. **LLM Analysis** (fallback for non-academic PDFs)
225
+ - **Extract**: Uses docling-parse to read first 5 pages with structure-aware parsing, falls back to PyMuPDF if needed
226
+ - **OCR**: Automatically applies OCR for scanned PDFs with minimal text
227
+ - **Metadata Enhancement**: Extracts focused hints (years, emails, author sections) to supplement unreliable PDF metadata
228
+ - **Analyze**: Sends full content excerpt to LLM with enhanced metadata and detailed extraction instructions
229
+ - **Multi-pass Review**: Low-confidence results trigger a second analysis pass with focused prompts
230
+ - **Suggest**: LLM returns filename in `Author-Topic-Year` format with confidence level and reasoning
231
+
232
+ 3. **Interactive Review** (optional): User can accept, edit, retry, or skip each suggestion
233
+ 4. **Rename**: Applies suggestions (if not in dry-run mode)
234
+
235
+ ### Benefits of DOI Integration
236
+
237
+ - **Accuracy**: DOI metadata is canonical and verified
238
+ - **Speed**: Instant lookup vs. LLM processing time
239
+ - **Cost**: Free DOI lookups save on API costs for academic papers
240
+ - **Reliability**: Works even when PDF text extraction is poor
219
241
 
220
242
  ## Cost Considerations
221
243
 
222
- **OpenAI:**
244
+ **DOI-based Naming (Academic Papers):**
245
+ - **Completely free** - No API costs
246
+ - **No LLM needed** - Direct metadata lookup
247
+ - Works for most academic papers with embedded DOIs
248
+
249
+ **OpenAI (Fallback):**
223
250
  - Uses `gpt-4o-mini` by default (very cost-effective)
251
+ - Only called when DOI not found
224
252
  - Processes first ~4500 characters per PDF
225
253
  - Typical cost: ~$0.001-0.003 per PDF
226
254
 
@@ -18,9 +18,11 @@ Intelligent PDF file renaming using LLMs. This tool analyzes PDF content and met
18
18
 
19
19
  ## Features
20
20
 
21
+ - **DOI-based naming** - Automatically extracts DOI and fetches authoritative metadata for academic papers
21
22
  - **Advanced PDF parsing** using docling-parse for better structure-aware extraction
22
23
  - **OCR fallback** for scanned PDFs with low text content
23
24
  - **Smart LLM prompting** with multi-pass analysis for improved accuracy
25
+ - **Hybrid approach** - Uses DOI metadata when available, falls back to LLM analysis for other documents
24
26
  - Suggests filenames in format: `Author-Topic-Year.pdf`
25
27
  - Dry-run mode to preview changes before applying
26
28
  - **Enhanced interactive mode** with options to accept, manually edit, retry, or skip each file
@@ -183,19 +185,44 @@ You can use interactive mode with `--dry-run` to preview without actually renami
183
185
 
184
186
  ## How It Works
185
187
 
186
- 1. **Extract**: Uses docling-parse to read first 5 pages with structure-aware parsing, falls back to PyMuPDF if needed
187
- 2. **OCR**: Automatically applies OCR for scanned PDFs with minimal text
188
- 3. **Metadata Enhancement**: Extracts focused hints (years, emails, author sections) to supplement unreliable PDF metadata
189
- 4. **Analyze**: Sends full content excerpt to LLM with enhanced metadata and detailed extraction instructions
190
- 5. **Multi-pass Review**: Low-confidence results trigger a second analysis pass with focused prompts
191
- 6. **Suggest**: LLM returns filename in `Author-Topic-Year` format with confidence level and reasoning
192
- 7. **Interactive Review** (optional): User can accept, edit, retry, or skip each suggestion
193
- 8. **Rename**: Applies suggestions (if not in dry-run mode)
188
+ ### Intelligent Hybrid Approach
189
+
190
+ The tool uses a multi-strategy approach to generate accurate filenames:
191
+
192
+ 1. **DOI Detection** (for academic papers)
193
+ - Searches PDF for DOI identifiers using [pdf2doi](https://github.com/MicheleCotrufo/pdf2doi)
194
+ - If found, queries authoritative metadata (title, authors, year, journal)
195
+ - Generates filename with **very high confidence** from validated metadata
196
+ - **Saves API costs** - no LLM call needed for papers with DOIs
197
+
198
+ 2. **LLM Analysis** (fallback for non-academic PDFs)
199
+ - **Extract**: Uses docling-parse to read first 5 pages with structure-aware parsing, falls back to PyMuPDF if needed
200
+ - **OCR**: Automatically applies OCR for scanned PDFs with minimal text
201
+ - **Metadata Enhancement**: Extracts focused hints (years, emails, author sections) to supplement unreliable PDF metadata
202
+ - **Analyze**: Sends full content excerpt to LLM with enhanced metadata and detailed extraction instructions
203
+ - **Multi-pass Review**: Low-confidence results trigger a second analysis pass with focused prompts
204
+ - **Suggest**: LLM returns filename in `Author-Topic-Year` format with confidence level and reasoning
205
+
206
+ 3. **Interactive Review** (optional): User can accept, edit, retry, or skip each suggestion
207
+ 4. **Rename**: Applies suggestions (if not in dry-run mode)
208
+
209
+ ### Benefits of DOI Integration
210
+
211
+ - **Accuracy**: DOI metadata is canonical and verified
212
+ - **Speed**: Instant lookup vs. LLM processing time
213
+ - **Cost**: Free DOI lookups save on API costs for academic papers
214
+ - **Reliability**: Works even when PDF text extraction is poor
194
215
 
195
216
  ## Cost Considerations
196
217
 
197
- **OpenAI:**
218
+ **DOI-based Naming (Academic Papers):**
219
+ - **Completely free** - No API costs
220
+ - **No LLM needed** - Direct metadata lookup
221
+ - Works for most academic papers with embedded DOIs
222
+
223
+ **OpenAI (Fallback):**
198
224
  - Uses `gpt-4o-mini` by default (very cost-effective)
225
+ - Only called when DOI not found
199
226
  - Processes first ~4500 characters per PDF
200
227
  - Typical cost: ~$0.001-0.003 per PDF
201
228