paddleocr-skills 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (29) hide show
  1. package/README.md +220 -0
  2. package/bin/paddleocr-skills.js +20 -0
  3. package/lib/copy.js +39 -0
  4. package/lib/installer.js +70 -0
  5. package/lib/prompts.js +67 -0
  6. package/lib/python.js +75 -0
  7. package/lib/verify.js +121 -0
  8. package/package.json +42 -0
  9. package/templates/.env.example +12 -0
  10. package/templates/paddleocr-vl/references/paddleocr-vl/layout_schema.md +64 -0
  11. package/templates/paddleocr-vl/references/paddleocr-vl/output_format.md +154 -0
  12. package/templates/paddleocr-vl/references/paddleocr-vl/vl_model_spec.md +157 -0
  13. package/templates/paddleocr-vl/scripts/paddleocr-vl/_lib.py +780 -0
  14. package/templates/paddleocr-vl/scripts/paddleocr-vl/configure.py +270 -0
  15. package/templates/paddleocr-vl/scripts/paddleocr-vl/optimize_file.py +226 -0
  16. package/templates/paddleocr-vl/scripts/paddleocr-vl/requirements-optimize.txt +8 -0
  17. package/templates/paddleocr-vl/scripts/paddleocr-vl/requirements.txt +7 -0
  18. package/templates/paddleocr-vl/scripts/paddleocr-vl/smoke_test.py +199 -0
  19. package/templates/paddleocr-vl/scripts/paddleocr-vl/vl_caller.py +232 -0
  20. package/templates/paddleocr-vl/skills/paddleocr-vl/SKILL.md +481 -0
  21. package/templates/ppocrv5/references/ppocrv5/agent_policy.md +258 -0
  22. package/templates/ppocrv5/references/ppocrv5/normalized_schema.md +257 -0
  23. package/templates/ppocrv5/references/ppocrv5/provider_api.md +140 -0
  24. package/templates/ppocrv5/scripts/ppocrv5/_lib.py +635 -0
  25. package/templates/ppocrv5/scripts/ppocrv5/configure.py +346 -0
  26. package/templates/ppocrv5/scripts/ppocrv5/ocr_caller.py +684 -0
  27. package/templates/ppocrv5/scripts/ppocrv5/requirements.txt +4 -0
  28. package/templates/ppocrv5/scripts/ppocrv5/smoke_test.py +139 -0
  29. package/templates/ppocrv5/skills/ppocrv5/SKILL.md +272 -0
@@ -0,0 +1,481 @@
1
+ ---
2
+ name: paddleocr-vl
3
+ description: >
4
+ Advanced document parsing with PaddleOCR-VL vision-language model. Returns complete document
5
+ structure including text, tables, formulas, charts, and layout information. Claude extracts
6
+ relevant content based on user needs.
7
+ ---
8
+
9
+ # PaddleOCR-VL Document Parsing Skill
10
+
11
+ ## When to Use This Skill
12
+
13
+ ✅ **Use PaddleOCR-VL for**:
14
+ - Documents with tables (invoices, financial reports, spreadsheets)
15
+ - Documents with mathematical formulas (academic papers, scientific documents)
16
+ - Documents with charts and diagrams
17
+ - Multi-column layouts (newspapers, magazines, brochures)
18
+ - Complex document structures requiring layout analysis
19
+ - Any document requiring structured understanding
20
+
21
+ ❌ **Use PP-OCRv5 instead for**:
22
+ - Simple text-only extraction
23
+ - Quick OCR tasks where speed is critical
24
+ - Screenshots or simple images with clear text
25
+
26
+ ## How to Use This Skill
27
+
28
+ **⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔**
29
+
30
+ 1. **ONLY use PaddleOCR-VL API** - Execute the script `python scripts/paddleocr-vl/vl_caller.py`
31
+ 2. **NEVER use Claude's built-in vision** - Do NOT parse documents yourself
32
+ 3. **NEVER offer alternatives** - Do NOT suggest "I can try to analyze it" or similar
33
+ 4. **IF API fails** - Display the error message and STOP immediately
34
+ 5. **NO fallback methods** - Do NOT attempt document parsing any other way
35
+
36
+ If the script execution fails (API not configured, network error, etc.):
37
+ - Show the error message to the user
38
+ - Do NOT offer to help using your vision capabilities
39
+ - Do NOT ask "Would you like me to try parsing it?"
40
+ - Simply stop and wait for user to fix the configuration
41
+
42
+ ### Basic Workflow
43
+
44
+ 1. **Execute document parsing**:
45
+ ```bash
46
+ python scripts/paddleocr-vl/vl_caller.py --file-url "URL provided by user"
47
+ ```
48
+ Or for local files:
49
+ ```bash
50
+ python scripts/paddleocr-vl/vl_caller.py --file-path "file path"
51
+ ```
52
+
53
+ **Save result to file** (recommended):
54
+ ```bash
55
+ python scripts/paddleocr-vl/vl_caller.py --file-url "URL" --output result.json --pretty
56
+ ```
57
+ - The script will display: `Result saved to: /absolute/path/to/result.json`
58
+ - This message appears on stderr, the JSON is saved to the file
59
+ - **Tell the user the file path** shown in the message
60
+
61
+ 2. **The script returns COMPLETE JSON** with all document content:
62
+ - Headers, footers, page numbers
63
+ - Main text content
64
+ - Tables with structure
65
+ - Formulas (with LaTeX)
66
+ - Figures and charts
67
+ - Footnotes and references
68
+ - Layout and reading order
69
+
70
+ 3. **Extract what the user needs** from the complete data based on their request.
71
+
72
+ ### IMPORTANT: Complete Content Display
73
+
74
+ **CRITICAL**: You must display the COMPLETE extracted content to the user based on their needs.
75
+
76
+ - The script returns ALL document content in a structured format
77
+ - **Display the full content requested by the user**, do NOT truncate or summarize
78
+ - If user asks for "all text", show the entire `result.full_text`
79
+ - If user asks for "tables", show ALL tables in the document
80
+ - If user asks for "main content", filter out headers/footers but show ALL body text
81
+
82
+ **What this means**:
83
+ - ✅ **DO**: Display complete text, all tables, all formulas as requested
84
+ - ✅ **DO**: Present content in reading order using `reading_order` array
85
+ - ❌ **DON'T**: Truncate with "..." unless content is excessively long (>10,000 chars)
86
+ - ❌ **DON'T**: Summarize or provide excerpts when user asks for full content
87
+ - ❌ **DON'T**: Say "Here's a preview" when user expects complete output
88
+
89
+ **Example - Correct**:
90
+ ```
91
+ User: "Extract all the text from this document"
92
+ Claude: I've parsed the complete document. Here's all the extracted text:
93
+
94
+ [Display entire result.full_text or concatenated regions in reading order]
95
+
96
+ Document Statistics:
97
+ - Total regions: 25
98
+ - Text blocks: 15
99
+ - Tables: 3
100
+ - Formulas: 2
101
+ Quality: Excellent (confidence: 0.92)
102
+ ```
103
+
104
+ **Example - Incorrect** ❌:
105
+ ```
106
+ User: "Extract all the text"
107
+ Claude: "I found a document with multiple sections. Here's the beginning:
108
+ 'Introduction...' (content truncated for brevity)"
109
+ ```
110
+
111
+ ### Understanding the JSON Response
112
+
113
+ The script returns a complete JSON structure:
114
+
115
+ ```json
116
+ {
117
+ "ok": true,
118
+ "result": {
119
+ "full_text": "Complete text with all content including headers, footers, etc.",
120
+ "layout": {
121
+ "regions": [
122
+ {
123
+ "id": 0,
124
+ "type": "header",
125
+ "content": "Chapter 3: Methods",
126
+ "bbox": [100, 50, 500, 100]
127
+ },
128
+ {
129
+ "id": 1,
130
+ "type": "text",
131
+ "content": "Main body text content...",
132
+ "bbox": [100, 150, 500, 300]
133
+ },
134
+ {
135
+ "id": 2,
136
+ "type": "table",
137
+ "content": {
138
+ "rows": 3,
139
+ "cols": 2,
140
+ "cells": [["Header1", "Header2"], ["Data1", "Data2"]]
141
+ },
142
+ "bbox": [100, 350, 500, 550]
143
+ },
144
+ {
145
+ "id": 3,
146
+ "type": "formula",
147
+ "content": "E = mc^2",
148
+ "latex": "$E = mc^2$",
149
+ "bbox": [200, 600, 400, 630]
150
+ },
151
+ {
152
+ "id": 4,
153
+ "type": "footnote",
154
+ "content": "[1] Reference citation",
155
+ "bbox": [100, 650, 500, 680]
156
+ },
157
+ {
158
+ "id": 5,
159
+ "type": "footer",
160
+ "content": "University Name 2024",
161
+ "bbox": [100, 750, 500, 800]
162
+ },
163
+ {
164
+ "id": 6,
165
+ "type": "page_number",
166
+ "content": "25",
167
+ "bbox": [250, 770, 280, 790]
168
+ }
169
+ ],
170
+ "reading_order": [0, 1, 2, 3, 4, 5, 6]
171
+ }
172
+ },
173
+ "metadata": {
174
+ "processing_time_ms": 3500,
175
+ "total_pages": 1,
176
+ "model_version": "paddleocr-vl-0.9b"
177
+ }
178
+ }
179
+ ```
180
+
181
+ ### Region Types
182
+
183
+ The `type` field indicates the element category:
184
+
185
+ | Type | Description | Typically Include? |
186
+ |------|-------------|-------------------|
187
+ | `header` | Page headers (chapter/section titles) | Exclude (usually repetitive) |
188
+ | `text` | Main body text | **Include** |
189
+ | `table` | Tables with structured data | **Include** |
190
+ | `formula` | Mathematical formulas | **Include** |
191
+ | `figure` | Images, charts, diagrams | **Include** |
192
+ | `footnote` | Footnotes and references | **Include** (often important) |
193
+ | `footer` | Page footers (author/institution) | Exclude (usually repetitive) |
194
+ | `page_number` | Page numbers | Exclude (not content) |
195
+ | `margin_note` | Margin annotations | Context-dependent |
196
+
197
+ ### Content Extraction Guidelines
198
+
199
+ **Based on user intent, filter the regions**:
200
+
201
+ | User Says | What to Extract | How |
202
+ |-----------|-----------------|-----|
203
+ | "Extract main content" | text, table, formula, figure, footnote | Skip header, footer, page_number |
204
+ | "Get all tables" | table only | Filter by type="table" |
205
+ | "Extract formulas" | formula only | Filter by type="formula" |
206
+ | "Complete document" | Everything | Use all regions or full_text |
207
+ | "Without headers/footers" | Core content | Skip header, footer types |
208
+ | "Include page numbers" | Core + page_number | Keep page_number type |
209
+
210
+ ### Usage Examples
211
+
212
+ **Example 1: Extract Main Content** (default behavior)
213
+ ```bash
214
+ python scripts/paddleocr-vl/vl_caller.py \
215
+ --file-url "https://example.com/paper.pdf" \
216
+ --pretty
217
+ ```
218
+
219
+ Then filter JSON to extract core content:
220
+ - Include: text, table, formula, figure, footnote
221
+ - Exclude: header, footer, page_number
222
+
223
+ **Example 2: Extract Tables Only**
224
+ ```bash
225
+ python scripts/paddleocr-vl/vl_caller.py \
226
+ --file-path "./financial_report.pdf" \
227
+ --pretty
228
+ ```
229
+
230
+ Then filter JSON:
231
+ - Only keep regions where type="table"
232
+ - Present table content in markdown format
233
+
234
+ **Example 3: Complete Document with Everything**
235
+ ```bash
236
+ python scripts/paddleocr-vl/vl_caller.py \
237
+ --file-url "URL" \
238
+ --pretty
239
+ ```
240
+
241
+ Then use `result.full_text` or present all regions in reading_order.
242
+
243
+ ### First-Time Configuration
244
+
245
+ **When API is not configured**:
246
+
247
+ The error will show:
248
+ ```
249
+ Configuration error: API not configured. Get your API at: https://aistudio.baidu.com/paddleocr/task
250
+ ```
251
+
252
+ **Auto-configuration workflow**:
253
+
254
+ 1. **Show the exact error message** to user (including the URL)
255
+
256
+ 2. **Tell user to provide credentials**:
257
+ ```
258
+ Please visit the URL above to get your VL_API_URL and VL_TOKEN.
259
+ Once you have them, send them to me and I'll configure it automatically.
260
+ ```
261
+
262
+ 3. **When user provides credentials** (accept any format):
263
+ - `VL_API_URL=https://xxx.com/v1, VL_TOKEN=abc123...`
264
+ - `Here's my API: https://xxx and token: abc123`
265
+ - Copy-pasted code format
266
+ - Any other reasonable format
267
+
268
+ 4. **Parse credentials from user's message**:
269
+ - Extract VL_API_URL value (look for URLs)
270
+ - Extract VL_TOKEN value (long alphanumeric string, usually 40+ chars)
271
+
272
+ 5. **Configure automatically**:
273
+ ```bash
274
+ python scripts/paddleocr-vl/configure.py --api-url "PARSED_URL" --token "PARSED_TOKEN"
275
+ ```
276
+
277
+ 6. **If configuration succeeds**:
278
+ - Inform user: "Configuration complete! Parsing document now..."
279
+ - Retry the original parsing task
280
+
281
+ 7. **If configuration fails**:
282
+ - Show the error
283
+ - Ask user to verify the credentials
284
+
285
+ **IMPORTANT**: The error message format is STRICT and must be shown exactly as provided by the script. Do not modify or paraphrase it.
286
+
287
+ ### Handling Large Files (>20MB)
288
+
289
+ **Problem**: Local files larger than 20MB are rejected by default.
290
+
291
+ **Solutions** (Choose based on your situation):
292
+
293
+ #### Solution 1: Use URL Upload (Recommended) ⭐
294
+ Upload your file to a web server and use `--file-url`:
295
+ ```bash
296
+ # Instead of local file
297
+ python scripts/paddleocr-vl/vl_caller.py --file-path "large_file.pdf" # ❌ May fail
298
+
299
+ # Use URL instead
300
+ python scripts/paddleocr-vl/vl_caller.py --file-url "https://your-server.com/large_file.pdf" # ✅ No size limit
301
+ ```
302
+
303
+ Benefits:
304
+ - No local file size limit
305
+ - Faster upload (direct from API server)
306
+ - Suitable for very large files (>100MB)
307
+
308
+ #### Solution 2: Increase Size Limit
309
+ Adjust the limit in `.env` file:
310
+ ```bash
311
+ # .env
312
+ VL_MAX_FILE_SIZE_MB=50 # Increase from 20MB to 50MB
313
+ ```
314
+
315
+ Then process as usual:
316
+ ```bash
317
+ python scripts/paddleocr-vl/vl_caller.py --file-path "large_file.pdf"
318
+ ```
319
+
320
+ **Note**: Your API provider may still have upload limits. Check with your VL API service.
321
+
322
+ #### Solution 3: Compress/Optimize File
323
+ Use the built-in optimizer to reduce file size:
324
+
325
+ ```bash
326
+ # Install optimization dependencies first
327
+ pip install -r scripts/paddleocr-vl/requirements-optimize.txt
328
+
329
+ # Optimize image (reduce quality)
330
+ python scripts/paddleocr-vl/optimize_file.py input.png output.png --quality 70
331
+
332
+ # Optimize PDF (compress images within)
333
+ python scripts/paddleocr-vl/optimize_file.py input.pdf output.pdf --target-size 15
334
+
335
+ # Then process optimized file
336
+ python scripts/paddleocr-vl/vl_caller.py --file-path "output.pdf" --pretty
337
+ ```
338
+
339
+ The optimizer will:
340
+ - Compress images (adjust JPEG quality)
341
+ - Resize images if needed (maintain aspect ratio)
342
+ - Compress PDF images (reduce DPI to 150)
343
+ - Show before/after size comparison
344
+
345
+ #### Solution 4: Process Specific Pages (PDF Only)
346
+ If you only need certain pages from a large PDF, extract them first:
347
+
348
+ ```bash
349
+ # Using PyMuPDF (requires: pip install PyMuPDF)
350
+ python -c "
351
+ import fitz
352
+ doc = fitz.open('large.pdf')
353
+ writer = fitz.open()
354
+ writer.insert_pdf(doc, from_page=0, to_page=4) # Pages 1-5
355
+ writer.save('pages_1_5.pdf')
356
+ "
357
+
358
+ # Then process the smaller file
359
+ python scripts/paddleocr-vl/vl_caller.py --file-path "pages_1_5.pdf"
360
+ ```
361
+
362
+ #### Solution 5: Upload to Cloud Storage
363
+ For extremely large files (>100MB), use cloud storage with public URLs:
364
+
365
+ ```bash
366
+ # Example: Upload to AWS S3, Google Drive, or similar
367
+ # Get public URL: https://storage.example.com/my-document.pdf
368
+
369
+ # Process via URL
370
+ python scripts/paddleocr-vl/vl_caller.py --file-url "https://storage.example.com/my-document.pdf"
371
+ ```
372
+
373
+ **Comparison Table**:
374
+
375
+ | Solution | Max Size | Speed | Complexity | Best For |
376
+ |----------|----------|-------|------------|----------|
377
+ | URL Upload | Unlimited | Fast | Low | Any large file |
378
+ | Increase Limit | Configurable | Medium | Very Low | Slightly over limit |
379
+ | Compress | ~70% reduction | Slow | Medium | Images/PDFs with images |
380
+ | Extract Pages | As needed | Fast | Medium | Multi-page PDFs |
381
+ | Cloud Storage | Unlimited | Fast | High | Very large files (>100MB) |
382
+
383
+ ### Error Handling
384
+
385
+ **Authentication failed (401/403)**:
386
+ ```
387
+ error: Authentication failed
388
+ ```
389
+ → Token is invalid, reconfigure with correct credentials
390
+
391
+ **API quota exceeded (429)**:
392
+ ```
393
+ error: API quota exceeded
394
+ ```
395
+ → Daily API quota exhausted, inform user to wait or upgrade
396
+
397
+ **Unsupported format**:
398
+ ```
399
+ error: Unsupported file format
400
+ ```
401
+ → File format not supported, convert to PDF/PNG/JPG
402
+
403
+ ### Pseudo-Code: Content Extraction
404
+
405
+ **Extract main content** (most common):
406
+ ```python
407
+ def extract_main_content(json_response):
408
+ regions = json_response['result']['layout']['regions']
409
+
410
+ # Keep only core content types
411
+ core_types = ['text', 'table', 'formula', 'figure', 'footnote']
412
+ main_regions = [r for r in regions if r['type'] in core_types]
413
+
414
+ # Sort by reading order
415
+ reading_order = json_response['result']['layout']['reading_order']
416
+ sorted_regions = sort_by_order(main_regions, reading_order)
417
+
418
+ # Present to user
419
+ for region in sorted_regions:
420
+ if region['type'] == 'text':
421
+ print(region['content'])
422
+ elif region['type'] == 'table':
423
+ print_as_markdown_table(region['content'])
424
+ elif region['type'] == 'formula':
425
+ print(f"Formula: {region['latex']}")
426
+ ```
427
+
428
+ **Extract tables only**:
429
+ ```python
430
+ def extract_tables(json_response):
431
+ regions = json_response['result']['layout']['regions']
432
+ tables = [r for r in regions if r['type'] == 'table']
433
+
434
+ for i, table in enumerate(tables):
435
+ print(f"Table {i+1}:")
436
+ print_as_markdown_table(table['content'])
437
+ ```
438
+
439
+ **Extract everything**:
440
+ ```python
441
+ def extract_complete(json_response):
442
+ # Simply use the full_text which includes everything
443
+ print(json_response['result']['full_text'])
444
+
445
+ # Or present all regions in order
446
+ regions = json_response['result']['layout']['regions']
447
+ reading_order = json_response['result']['layout']['reading_order']
448
+
449
+ for idx in reading_order:
450
+ region = regions[idx]
451
+ print(f"[{region['type']}] {region['content']}")
452
+ ```
453
+
454
+ ## Important Notes
455
+
456
+ - **The script NEVER filters content** - It always returns complete data
457
+ - **Claude decides what to present** - Based on user's specific request
458
+ - **All data is always available** - Can be re-interpreted for different needs
459
+ - **No information is lost** - Complete document structure preserved
460
+
461
+ ## Reference Documentation
462
+
463
+ For in-depth understanding of the PaddleOCR-VL system, refer to:
464
+ - `references/paddleocr-vl/vl_model_spec.md` - VL model specifications
465
+ - `references/paddleocr-vl/layout_schema.md` - Layout detection schema
466
+ - `references/paddleocr-vl/output_format.md` - Complete output format specification
467
+
468
+ Load these reference documents into context when:
469
+ - Debugging complex parsing issues
470
+ - Understanding layout detection algorithm
471
+ - Working with special document types
472
+ - Customizing content extraction logic
473
+
474
+ ## Testing the Skill
475
+
476
+ To verify the skill is working properly:
477
+ ```bash
478
+ python scripts/paddleocr-vl/smoke_test.py
479
+ ```
480
+
481
+ This tests configuration and optionally API connectivity.