pdf-file-renamer 0.5.0__tar.gz → 0.6.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/PKG-INFO +38 -10
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/README.md +36 -9
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/coverage.xml +273 -118
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/pyproject.toml +3 -1
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/__init__.py +1 -1
- pdf_file_renamer-0.6.1/src/pdf_file_renamer/application/filename_service.py +172 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/application/pdf_rename_workflow.py +35 -4
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/domain/models.py +29 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/domain/ports.py +18 -1
- pdf_file_renamer-0.6.1/src/pdf_file_renamer/infrastructure/doi/__init__.py +5 -0
- pdf_file_renamer-0.6.1/src/pdf_file_renamer/infrastructure/doi/pdf2doi_extractor.py +163 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/presentation/cli.py +5 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/presentation/formatters.py +15 -3
- pdf_file_renamer-0.5.0/src/pdf_file_renamer/application/filename_service.py +0 -70
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/.env.example +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/.github/workflows/ci.yml +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/.github/workflows/release.yml +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/.gitignore +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/.python-version +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/LICENSE +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/REFACTORING_SUMMARY.md +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/application/__init__.py +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/application/rename_service.py +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/domain/__init__.py +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/infrastructure/__init__.py +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/infrastructure/config.py +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/infrastructure/llm/__init__.py +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/infrastructure/llm/pydantic_ai_provider.py +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/infrastructure/pdf/__init__.py +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/infrastructure/pdf/composite.py +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/infrastructure/pdf/docling_extractor.py +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/infrastructure/pdf/pymupdf_extractor.py +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/main.py +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/src/pdf_file_renamer/presentation/__init__.py +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/tests/__init__.py +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/tests/data/2025-dennis-managing-complexity.pdf +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/tests/data/Camp_of_the_Saints.pdf +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/tests/data/s43588-025-00854-1.pdf +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/tests/test_domain_models.py +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/tests/test_filename_service.py +0 -0
- {pdf_file_renamer-0.5.0 → pdf_file_renamer-0.6.1}/tests/test_rename_service.py +0 -0
@@ -1,11 +1,12 @@
|
|
1
1
|
Metadata-Version: 2.4
|
2
2
|
Name: pdf-file-renamer
|
3
|
-
Version: 0.
|
3
|
+
Version: 0.6.1
|
4
4
|
Summary: Intelligent PDF renaming using LLMs
|
5
5
|
License-File: LICENSE
|
6
6
|
Requires-Python: >=3.11
|
7
7
|
Requires-Dist: docling-core>=2.0.0
|
8
8
|
Requires-Dist: docling-parse>=2.0.0
|
9
|
+
Requires-Dist: pdf2doi>=1.7
|
9
10
|
Requires-Dist: pydantic-ai>=1.0.17
|
10
11
|
Requires-Dist: pydantic-settings>=2.7.1
|
11
12
|
Requires-Dist: pydantic>=2.10.6
|
@@ -43,9 +44,11 @@ Intelligent PDF file renaming using LLMs. This tool analyzes PDF content and met
|
|
43
44
|
|
44
45
|
## Features
|
45
46
|
|
47
|
+
- **DOI-based naming** - Automatically extracts DOI and fetches authoritative metadata for academic papers
|
46
48
|
- **Advanced PDF parsing** using docling-parse for better structure-aware extraction
|
47
49
|
- **OCR fallback** for scanned PDFs with low text content
|
48
50
|
- **Smart LLM prompting** with multi-pass analysis for improved accuracy
|
51
|
+
- **Hybrid approach** - Uses DOI metadata when available, falls back to LLM analysis for other documents
|
49
52
|
- Suggests filenames in format: `Author-Topic-Year.pdf`
|
50
53
|
- Dry-run mode to preview changes before applying
|
51
54
|
- **Enhanced interactive mode** with options to accept, manually edit, retry, or skip each file
|
@@ -208,19 +211,44 @@ You can use interactive mode with `--dry-run` to preview without actually renami
|
|
208
211
|
|
209
212
|
## How It Works
|
210
213
|
|
211
|
-
|
212
|
-
|
213
|
-
|
214
|
-
|
215
|
-
|
216
|
-
|
217
|
-
|
218
|
-
|
214
|
+
### Intelligent Hybrid Approach
|
215
|
+
|
216
|
+
The tool uses a multi-strategy approach to generate accurate filenames:
|
217
|
+
|
218
|
+
1. **DOI Detection** (for academic papers)
|
219
|
+
- Searches PDF for DOI identifiers using [pdf2doi](https://github.com/MicheleCotrufo/pdf2doi)
|
220
|
+
- If found, queries authoritative metadata (title, authors, year, journal)
|
221
|
+
- Generates filename with **very high confidence** from validated metadata
|
222
|
+
- **Saves API costs** - no LLM call needed for papers with DOIs
|
223
|
+
|
224
|
+
2. **LLM Analysis** (fallback for non-academic PDFs)
|
225
|
+
- **Extract**: Uses docling-parse to read first 5 pages with structure-aware parsing, falls back to PyMuPDF if needed
|
226
|
+
- **OCR**: Automatically applies OCR for scanned PDFs with minimal text
|
227
|
+
- **Metadata Enhancement**: Extracts focused hints (years, emails, author sections) to supplement unreliable PDF metadata
|
228
|
+
- **Analyze**: Sends full content excerpt to LLM with enhanced metadata and detailed extraction instructions
|
229
|
+
- **Multi-pass Review**: Low-confidence results trigger a second analysis pass with focused prompts
|
230
|
+
- **Suggest**: LLM returns filename in `Author-Topic-Year` format with confidence level and reasoning
|
231
|
+
|
232
|
+
3. **Interactive Review** (optional): User can accept, edit, retry, or skip each suggestion
|
233
|
+
4. **Rename**: Applies suggestions (if not in dry-run mode)
|
234
|
+
|
235
|
+
### Benefits of DOI Integration
|
236
|
+
|
237
|
+
- **Accuracy**: DOI metadata is canonical and verified
|
238
|
+
- **Speed**: Instant lookup vs. LLM processing time
|
239
|
+
- **Cost**: Free DOI lookups save on API costs for academic papers
|
240
|
+
- **Reliability**: Works even when PDF text extraction is poor
|
219
241
|
|
220
242
|
## Cost Considerations
|
221
243
|
|
222
|
-
**
|
244
|
+
**DOI-based Naming (Academic Papers):**
|
245
|
+
- **Completely free** - No API costs
|
246
|
+
- **No LLM needed** - Direct metadata lookup
|
247
|
+
- Works for most academic papers with embedded DOIs
|
248
|
+
|
249
|
+
**OpenAI (Fallback):**
|
223
250
|
- Uses `gpt-4o-mini` by default (very cost-effective)
|
251
|
+
- Only called when DOI not found
|
224
252
|
- Processes first ~4500 characters per PDF
|
225
253
|
- Typical cost: ~$0.001-0.003 per PDF
|
226
254
|
|
@@ -18,9 +18,11 @@ Intelligent PDF file renaming using LLMs. This tool analyzes PDF content and met
|
|
18
18
|
|
19
19
|
## Features
|
20
20
|
|
21
|
+
- **DOI-based naming** - Automatically extracts DOI and fetches authoritative metadata for academic papers
|
21
22
|
- **Advanced PDF parsing** using docling-parse for better structure-aware extraction
|
22
23
|
- **OCR fallback** for scanned PDFs with low text content
|
23
24
|
- **Smart LLM prompting** with multi-pass analysis for improved accuracy
|
25
|
+
- **Hybrid approach** - Uses DOI metadata when available, falls back to LLM analysis for other documents
|
24
26
|
- Suggests filenames in format: `Author-Topic-Year.pdf`
|
25
27
|
- Dry-run mode to preview changes before applying
|
26
28
|
- **Enhanced interactive mode** with options to accept, manually edit, retry, or skip each file
|
@@ -183,19 +185,44 @@ You can use interactive mode with `--dry-run` to preview without actually renami
|
|
183
185
|
|
184
186
|
## How It Works
|
185
187
|
|
186
|
-
|
187
|
-
|
188
|
-
|
189
|
-
|
190
|
-
|
191
|
-
|
192
|
-
|
193
|
-
|
188
|
+
### Intelligent Hybrid Approach
|
189
|
+
|
190
|
+
The tool uses a multi-strategy approach to generate accurate filenames:
|
191
|
+
|
192
|
+
1. **DOI Detection** (for academic papers)
|
193
|
+
- Searches PDF for DOI identifiers using [pdf2doi](https://github.com/MicheleCotrufo/pdf2doi)
|
194
|
+
- If found, queries authoritative metadata (title, authors, year, journal)
|
195
|
+
- Generates filename with **very high confidence** from validated metadata
|
196
|
+
- **Saves API costs** - no LLM call needed for papers with DOIs
|
197
|
+
|
198
|
+
2. **LLM Analysis** (fallback for non-academic PDFs)
|
199
|
+
- **Extract**: Uses docling-parse to read first 5 pages with structure-aware parsing, falls back to PyMuPDF if needed
|
200
|
+
- **OCR**: Automatically applies OCR for scanned PDFs with minimal text
|
201
|
+
- **Metadata Enhancement**: Extracts focused hints (years, emails, author sections) to supplement unreliable PDF metadata
|
202
|
+
- **Analyze**: Sends full content excerpt to LLM with enhanced metadata and detailed extraction instructions
|
203
|
+
- **Multi-pass Review**: Low-confidence results trigger a second analysis pass with focused prompts
|
204
|
+
- **Suggest**: LLM returns filename in `Author-Topic-Year` format with confidence level and reasoning
|
205
|
+
|
206
|
+
3. **Interactive Review** (optional): User can accept, edit, retry, or skip each suggestion
|
207
|
+
4. **Rename**: Applies suggestions (if not in dry-run mode)
|
208
|
+
|
209
|
+
### Benefits of DOI Integration
|
210
|
+
|
211
|
+
- **Accuracy**: DOI metadata is canonical and verified
|
212
|
+
- **Speed**: Instant lookup vs. LLM processing time
|
213
|
+
- **Cost**: Free DOI lookups save on API costs for academic papers
|
214
|
+
- **Reliability**: Works even when PDF text extraction is poor
|
194
215
|
|
195
216
|
## Cost Considerations
|
196
217
|
|
197
|
-
**
|
218
|
+
**DOI-based Naming (Academic Papers):**
|
219
|
+
- **Completely free** - No API costs
|
220
|
+
- **No LLM needed** - Direct metadata lookup
|
221
|
+
- Works for most academic papers with embedded DOIs
|
222
|
+
|
223
|
+
**OpenAI (Fallback):**
|
198
224
|
- Uses `gpt-4o-mini` by default (very cost-effective)
|
225
|
+
- Only called when DOI not found
|
199
226
|
- Processes first ~4500 characters per PDF
|
200
227
|
- Typical cost: ~$0.001-0.003 per PDF
|
201
228
|
|