paper-parser-skill 0.1.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- paper_parser_skill-0.1.1/LICENSE +21 -0
- paper_parser_skill-0.1.1/PKG-INFO +116 -0
- paper_parser_skill-0.1.1/README.md +101 -0
- paper_parser_skill-0.1.1/paper_parser/__init__.py +2 -0
- paper_parser_skill-0.1.1/paper_parser/arxiv_client.py +65 -0
- paper_parser_skill-0.1.1/paper_parser/cli.py +179 -0
- paper_parser_skill-0.1.1/paper_parser/config.py +52 -0
- paper_parser_skill-0.1.1/paper_parser/mineru_client.py +202 -0
- paper_parser_skill-0.1.1/paper_parser/utils.py +25 -0
- paper_parser_skill-0.1.1/paper_parser_skill.egg-info/PKG-INFO +116 -0
- paper_parser_skill-0.1.1/paper_parser_skill.egg-info/SOURCES.txt +15 -0
- paper_parser_skill-0.1.1/paper_parser_skill.egg-info/dependency_links.txt +1 -0
- paper_parser_skill-0.1.1/paper_parser_skill.egg-info/entry_points.txt +3 -0
- paper_parser_skill-0.1.1/paper_parser_skill.egg-info/requires.txt +5 -0
- paper_parser_skill-0.1.1/paper_parser_skill.egg-info/top_level.txt +1 -0
- paper_parser_skill-0.1.1/pyproject.toml +28 -0
- paper_parser_skill-0.1.1/setup.cfg +4 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 KaiHangYang
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,116 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: paper-parser-skill
|
|
3
|
+
Version: 0.1.1
|
|
4
|
+
Summary: A CLI tool for searching, downloading, parsing, and summarizing academic papers.
|
|
5
|
+
Author-email: kaihang <kaihang.noir@gmail.com>
|
|
6
|
+
Requires-Python: >=3.8
|
|
7
|
+
Description-Content-Type: text/markdown
|
|
8
|
+
License-File: LICENSE
|
|
9
|
+
Requires-Dist: requests
|
|
10
|
+
Requires-Dist: click
|
|
11
|
+
Requires-Dist: PyYAML
|
|
12
|
+
Requires-Dist: arxiv
|
|
13
|
+
Requires-Dist: rapidfuzz
|
|
14
|
+
Dynamic: license-file
|
|
15
|
+
|
|
16
|
+
# Paper Parser π οΈ
|
|
17
|
+
|
|
18
|
+
**Efficient arXiv Search, Download, and AI-Friendly Markdown Parsing.**
|
|
19
|
+
|
|
20
|
+
`paper-parser` is a CLI tool designed to streamline the academic research workflow. It handles everything from finding a paper on arXiv to converting it into a clean, structured Markdown format that is optimized for LLMs and AI agents.
|
|
21
|
+
|
|
22
|
+
## π Why Use Paper Parser?
|
|
23
|
+
|
|
24
|
+
Standard PDF-to-text tools often produce one massive block of text, which leads to two major problems when working with AI:
|
|
25
|
+
1. **Context Overflow**: Large papers can exceed an LLM's context window.
|
|
26
|
+
2. **Token Waste**: Paying for the entire paper's context when you only need to analyze the "Methodology" or "Conclusion" is expensive and slow.
|
|
27
|
+
|
|
28
|
+
**The Solution:** `paper-parser` uses the **MinerU V4 API** to extract high-quality Markdown and then **automatically splits the paper into chapters**. This allows AI agents to read the paper **section-by-section**, enabling:
|
|
29
|
+
- β
**Granular Context Management**: Only read what matters.
|
|
30
|
+
- β
**Significant Token Savings**: Drastically reduce your API bills.
|
|
31
|
+
- β
**Higher Accuracy**: Focus the model's attention on specific sections.
|
|
32
|
+
|
|
33
|
+
---
|
|
34
|
+
|
|
35
|
+
## β¨ Key Features
|
|
36
|
+
|
|
37
|
+
- **π Intelligent Search**: Typos? No problem. Fuzzy-searches arXiv with relevance ranking.
|
|
38
|
+
- **π₯ Smart Download**: Downloads PDFs into organized, ID-based directories.
|
|
39
|
+
- **π§© Section Splitting**: Automatically splits papers into `01_Introduction.md`, `02_Methodology.md`, etc.
|
|
40
|
+
- **π¦ Incremental Processing**: Remembers what you've already downloaded and parsedβno redundant API calls.
|
|
41
|
+
- **πΌοΈ Image Extraction**: Extracts images and maintains correct relative links within the Markdown chapters.
|
|
42
|
+
- **π Note Templates**: Automatically generates `title.md` and `summary.md` for your research notes.
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## π οΈ Installation
|
|
47
|
+
|
|
48
|
+
### From PyPI (Recommended)
|
|
49
|
+
```bash
|
|
50
|
+
pip install paper-parser-skill
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
### From Source
|
|
54
|
+
```bash
|
|
55
|
+
# Clone the repository
|
|
56
|
+
git clone https://github.com/KaiHangYang/paper-parser-skill.git
|
|
57
|
+
cd paper-parser-skill
|
|
58
|
+
|
|
59
|
+
# Install in editable mode
|
|
60
|
+
pip install -e .
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
## βοΈ Configuration
|
|
64
|
+
|
|
65
|
+
The first time you run `pp`, it will create a configuration file at `~/.paper-parser/config.yaml`.
|
|
66
|
+
|
|
67
|
+
```yaml
|
|
68
|
+
MINERU_API_TOKEN: "your_token_from_mineru.net"
|
|
69
|
+
PAPER_WORKSPACE: "~/paper-parser-workspace"
|
|
70
|
+
MINERU_API_TIMEOUT: 600
|
|
71
|
+
```
|
|
72
|
+
> [!IMPORTANT]
|
|
73
|
+
> You need an API token from [MinerU](https://mineru.net/) to use the parsing features.
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## π Usage Guide
|
|
78
|
+
|
|
79
|
+
```bash
|
|
80
|
+
# 1. Search for a paper
|
|
81
|
+
pp search "LLaMA 3"
|
|
82
|
+
|
|
83
|
+
# 2. Complete workflow: Search -> Download -> Parse -> Meta
|
|
84
|
+
pp all "2303.17564"
|
|
85
|
+
|
|
86
|
+
# 3. Parse a local PDF file
|
|
87
|
+
pp parse ./my_local_paper.pdf
|
|
88
|
+
|
|
89
|
+
# 4. Find where a paper is stored
|
|
90
|
+
pp path "LLaMA"
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
## π Output Structure
|
|
94
|
+
|
|
95
|
+
```text
|
|
96
|
+
PAPER_WORKSPACE/
|
|
97
|
+
βββ 2303.17564/ # ArXiv ID
|
|
98
|
+
βββ paper.pdf # Original PDF
|
|
99
|
+
βββ title.md # Paper metadata
|
|
100
|
+
βββ summary.md # Note-taking template
|
|
101
|
+
βββ markdowns/ # AI-Ready Content
|
|
102
|
+
βββ 01_Introduction.md
|
|
103
|
+
βββ 02_Methods.md
|
|
104
|
+
βββ ...
|
|
105
|
+
βββ images/ # Extracted figures & tables
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
## π€ Acknowledgments
|
|
109
|
+
|
|
110
|
+
- [arXiv](https://arxiv.org/) for the academic paper API.
|
|
111
|
+
- [RapidFuzz](https://github.com/rapidfuzz/RapidFuzz) for fast fuzzy string matching.
|
|
112
|
+
- [MinerU](https://github.com/opendatalab/MinerU) ([mineru.net](https://mineru.net/)) for high-quality PDF-to-Markdown parsing.
|
|
113
|
+
|
|
114
|
+
## π License
|
|
115
|
+
|
|
116
|
+
[MIT](LICENSE)
|
|
@@ -0,0 +1,101 @@
|
|
|
1
|
+
# Paper Parser π οΈ
|
|
2
|
+
|
|
3
|
+
**Efficient arXiv Search, Download, and AI-Friendly Markdown Parsing.**
|
|
4
|
+
|
|
5
|
+
`paper-parser` is a CLI tool designed to streamline the academic research workflow. It handles everything from finding a paper on arXiv to converting it into a clean, structured Markdown format that is optimized for LLMs and AI agents.
|
|
6
|
+
|
|
7
|
+
## π Why Use Paper Parser?
|
|
8
|
+
|
|
9
|
+
Standard PDF-to-text tools often produce one massive block of text, which leads to two major problems when working with AI:
|
|
10
|
+
1. **Context Overflow**: Large papers can exceed an LLM's context window.
|
|
11
|
+
2. **Token Waste**: Paying for the entire paper's context when you only need to analyze the "Methodology" or "Conclusion" is expensive and slow.
|
|
12
|
+
|
|
13
|
+
**The Solution:** `paper-parser` uses the **MinerU V4 API** to extract high-quality Markdown and then **automatically splits the paper into chapters**. This allows AI agents to read the paper **section-by-section**, enabling:
|
|
14
|
+
- β
**Granular Context Management**: Only read what matters.
|
|
15
|
+
- β
**Significant Token Savings**: Drastically reduce your API bills.
|
|
16
|
+
- β
**Higher Accuracy**: Focus the model's attention on specific sections.
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## β¨ Key Features
|
|
21
|
+
|
|
22
|
+
- **π Intelligent Search**: Typos? No problem. Fuzzy-searches arXiv with relevance ranking.
|
|
23
|
+
- **π₯ Smart Download**: Downloads PDFs into organized, ID-based directories.
|
|
24
|
+
- **π§© Section Splitting**: Automatically splits papers into `01_Introduction.md`, `02_Methodology.md`, etc.
|
|
25
|
+
- **π¦ Incremental Processing**: Remembers what you've already downloaded and parsedβno redundant API calls.
|
|
26
|
+
- **πΌοΈ Image Extraction**: Extracts images and maintains correct relative links within the Markdown chapters.
|
|
27
|
+
- **π Note Templates**: Automatically generates `title.md` and `summary.md` for your research notes.
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## π οΈ Installation
|
|
32
|
+
|
|
33
|
+
### From PyPI (Recommended)
|
|
34
|
+
```bash
|
|
35
|
+
pip install paper-parser-skill
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
### From Source
|
|
39
|
+
```bash
|
|
40
|
+
# Clone the repository
|
|
41
|
+
git clone https://github.com/KaiHangYang/paper-parser-skill.git
|
|
42
|
+
cd paper-parser-skill
|
|
43
|
+
|
|
44
|
+
# Install in editable mode
|
|
45
|
+
pip install -e .
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
## βοΈ Configuration
|
|
49
|
+
|
|
50
|
+
The first time you run `pp`, it will create a configuration file at `~/.paper-parser/config.yaml`.
|
|
51
|
+
|
|
52
|
+
```yaml
|
|
53
|
+
MINERU_API_TOKEN: "your_token_from_mineru.net"
|
|
54
|
+
PAPER_WORKSPACE: "~/paper-parser-workspace"
|
|
55
|
+
MINERU_API_TIMEOUT: 600
|
|
56
|
+
```
|
|
57
|
+
> [!IMPORTANT]
|
|
58
|
+
> You need an API token from [MinerU](https://mineru.net/) to use the parsing features.
|
|
59
|
+
|
|
60
|
+
---
|
|
61
|
+
|
|
62
|
+
## π Usage Guide
|
|
63
|
+
|
|
64
|
+
```bash
|
|
65
|
+
# 1. Search for a paper
|
|
66
|
+
pp search "LLaMA 3"
|
|
67
|
+
|
|
68
|
+
# 2. Complete workflow: Search -> Download -> Parse -> Meta
|
|
69
|
+
pp all "2303.17564"
|
|
70
|
+
|
|
71
|
+
# 3. Parse a local PDF file
|
|
72
|
+
pp parse ./my_local_paper.pdf
|
|
73
|
+
|
|
74
|
+
# 4. Find where a paper is stored
|
|
75
|
+
pp path "LLaMA"
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
## π Output Structure
|
|
79
|
+
|
|
80
|
+
```text
|
|
81
|
+
PAPER_WORKSPACE/
|
|
82
|
+
βββ 2303.17564/ # ArXiv ID
|
|
83
|
+
βββ paper.pdf # Original PDF
|
|
84
|
+
βββ title.md # Paper metadata
|
|
85
|
+
βββ summary.md # Note-taking template
|
|
86
|
+
βββ markdowns/ # AI-Ready Content
|
|
87
|
+
βββ 01_Introduction.md
|
|
88
|
+
βββ 02_Methods.md
|
|
89
|
+
βββ ...
|
|
90
|
+
βββ images/ # Extracted figures & tables
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
## π€ Acknowledgments
|
|
94
|
+
|
|
95
|
+
- [arXiv](https://arxiv.org/) for the academic paper API.
|
|
96
|
+
- [RapidFuzz](https://github.com/rapidfuzz/RapidFuzz) for fast fuzzy string matching.
|
|
97
|
+
- [MinerU](https://github.com/opendatalab/MinerU) ([mineru.net](https://mineru.net/)) for high-quality PDF-to-Markdown parsing.
|
|
98
|
+
|
|
99
|
+
## π License
|
|
100
|
+
|
|
101
|
+
[MIT](LICENSE)
|
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
import arxiv
|
|
2
|
+
from rapidfuzz import fuzz
|
|
3
|
+
import requests
|
|
4
|
+
from pathlib import Path
|
|
5
|
+
|
|
6
|
+
def search_arxiv(query, max_results=1):
|
|
7
|
+
"""
|
|
8
|
+
Search for papers on arXiv with fuzzy matching support.
|
|
9
|
+
Broadens the search and ranks results by similarity to query.
|
|
10
|
+
"""
|
|
11
|
+
client = arxiv.Client()
|
|
12
|
+
|
|
13
|
+
# 1. Expand the search results pool for better fuzzy matching coverage
|
|
14
|
+
try:
|
|
15
|
+
search = arxiv.Search(
|
|
16
|
+
query=query,
|
|
17
|
+
max_results=max_results * 5,
|
|
18
|
+
sort_by=arxiv.SortCriterion.Relevance
|
|
19
|
+
)
|
|
20
|
+
|
|
21
|
+
results = []
|
|
22
|
+
for r in client.results(search):
|
|
23
|
+
paper_id = r.entry_id.split('/')[-1]
|
|
24
|
+
title = r.title
|
|
25
|
+
|
|
26
|
+
# 2. Calculate fuzzy similarity score between query and title
|
|
27
|
+
score = fuzz.partial_ratio(query.lower(), title.lower())
|
|
28
|
+
|
|
29
|
+
results.append({
|
|
30
|
+
'id': paper_id,
|
|
31
|
+
'title': title,
|
|
32
|
+
'pdf_url': r.pdf_url,
|
|
33
|
+
'score': score
|
|
34
|
+
})
|
|
35
|
+
|
|
36
|
+
# 3. Sort by fuzzy score descending
|
|
37
|
+
results.sort(key=lambda x: x['score'], reverse=True)
|
|
38
|
+
|
|
39
|
+
# 4. Limit to the requested number of results
|
|
40
|
+
return results[:max_results]
|
|
41
|
+
|
|
42
|
+
except Exception as e:
|
|
43
|
+
print(f"arXiv search failed: {e}")
|
|
44
|
+
return []
|
|
45
|
+
|
|
46
|
+
def download_pdf(pdf_url, output_path):
|
|
47
|
+
"""Download PDF from URL to local path."""
|
|
48
|
+
print(f"π₯ Downloading: {pdf_url}")
|
|
49
|
+
try:
|
|
50
|
+
response = requests.get(pdf_url, stream=True, timeout=60)
|
|
51
|
+
response.raise_for_status()
|
|
52
|
+
|
|
53
|
+
# Ensure parent directory exists
|
|
54
|
+
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
|
|
55
|
+
|
|
56
|
+
with open(output_path, 'wb') as f:
|
|
57
|
+
for chunk in response.iter_content(chunk_size=8192):
|
|
58
|
+
if chunk:
|
|
59
|
+
f.write(chunk)
|
|
60
|
+
|
|
61
|
+
print(f"β
PDF saved: {output_path}")
|
|
62
|
+
return True
|
|
63
|
+
except Exception as e:
|
|
64
|
+
print(f"β Download failed: {e}")
|
|
65
|
+
return False
|
|
@@ -0,0 +1,179 @@
|
|
|
1
|
+
import click
|
|
2
|
+
import os
|
|
3
|
+
from pathlib import Path
|
|
4
|
+
from . import arxiv_client, mineru_client, utils
|
|
5
|
+
from .config import DEFAULT_CONFIG_PATH
|
|
6
|
+
|
|
7
|
+
@click.group(help=f"""
|
|
8
|
+
Paper Parser CLI - Search, Download, and Parse academic papers.
|
|
9
|
+
|
|
10
|
+
Configuration is stored in: {DEFAULT_CONFIG_PATH}
|
|
11
|
+
""")
|
|
12
|
+
def cli():
|
|
13
|
+
pass
|
|
14
|
+
|
|
15
|
+
@cli.command()
|
|
16
|
+
@click.argument('query')
|
|
17
|
+
@click.option('--limit', default=1, help='Number of results to show.')
|
|
18
|
+
def search(query, limit):
|
|
19
|
+
"""Search for papers on arXiv."""
|
|
20
|
+
results = arxiv_client.search_arxiv(query, max_results=limit)
|
|
21
|
+
if not results:
|
|
22
|
+
click.echo("No papers found.")
|
|
23
|
+
return
|
|
24
|
+
|
|
25
|
+
for i, res in enumerate(results, 1):
|
|
26
|
+
click.echo(f"{i}. Id: {res['id']}")
|
|
27
|
+
click.echo(f" Title: {res['title']}")
|
|
28
|
+
click.echo(f" Link: {res['pdf_url']}")
|
|
29
|
+
|
|
30
|
+
@cli.command()
|
|
31
|
+
@click.argument('query_or_id')
|
|
32
|
+
@click.option('--force', is_flag=True, help='Force re-download even if PDF exists.')
|
|
33
|
+
def download(query_or_id, force):
|
|
34
|
+
"""Download a paper PDF by arXiv ID or query."""
|
|
35
|
+
click.echo(f"π Finding paper: {query_or_id}")
|
|
36
|
+
|
|
37
|
+
# 1. Resolve paper
|
|
38
|
+
is_id = arxiv_client.search_arxiv(f"id:{query_or_id}") if "." in query_or_id else []
|
|
39
|
+
results = is_id if is_id else arxiv_client.search_arxiv(query_or_id, max_results=1)
|
|
40
|
+
|
|
41
|
+
if not results:
|
|
42
|
+
click.echo("β Paper not found.")
|
|
43
|
+
return
|
|
44
|
+
|
|
45
|
+
paper = results[0]
|
|
46
|
+
click.echo(f"π Found: {paper['title']}")
|
|
47
|
+
|
|
48
|
+
# 2. Setup directory and metadata
|
|
49
|
+
paper_dir = utils.get_paper_dir(paper['id'])
|
|
50
|
+
pdf_path = paper_dir / "paper.pdf"
|
|
51
|
+
title_path = paper_dir / "title.md"
|
|
52
|
+
title_path.write_text(f"# {paper['title']}\n", encoding='utf-8')
|
|
53
|
+
|
|
54
|
+
summary_path = paper_dir / "summary.md"
|
|
55
|
+
if not summary_path.exists():
|
|
56
|
+
summary_path.write_text(f"# Summary: {paper['title']}\n\n## Key Takeaways\n\n- \n", encoding='utf-8')
|
|
57
|
+
|
|
58
|
+
# 3. Download (with cache check)
|
|
59
|
+
if not force and pdf_path.exists():
|
|
60
|
+
click.echo(f"βοΈ Skipping Download: PDF already exists in {paper_dir}")
|
|
61
|
+
return
|
|
62
|
+
|
|
63
|
+
if arxiv_client.download_pdf(paper['pdf_url'], pdf_path):
|
|
64
|
+
click.echo(f"β
Paper downloaded to: {paper_dir}")
|
|
65
|
+
|
|
66
|
+
@cli.command()
|
|
67
|
+
@click.argument('target')
|
|
68
|
+
@click.option('--output-dir', help='Force output directory')
|
|
69
|
+
@click.option('--force', is_flag=True, help='Force re-parsing even if results exist.')
|
|
70
|
+
def parse(target, output_dir, force):
|
|
71
|
+
"""Parse a PDF using MinerU API. TARGET can be a local PDF path or an arXiv ID."""
|
|
72
|
+
pdf_path = None
|
|
73
|
+
final_output_dir = None
|
|
74
|
+
|
|
75
|
+
# Case 1: Local File
|
|
76
|
+
if os.path.isfile(target):
|
|
77
|
+
pdf_path = Path(target)
|
|
78
|
+
final_output_dir = Path(output_dir) if output_dir else pdf_path.parent / "paper"
|
|
79
|
+
click.echo(f"π Local PDF detected: {pdf_path}")
|
|
80
|
+
|
|
81
|
+
# Case 2: arXiv ID (or something else)
|
|
82
|
+
else:
|
|
83
|
+
# Resolve ID to folder
|
|
84
|
+
paper_dir = utils.get_paper_dir(target)
|
|
85
|
+
pdf_path = paper_dir / "paper.pdf"
|
|
86
|
+
final_output_dir = Path(output_dir) if output_dir else paper_dir
|
|
87
|
+
|
|
88
|
+
if not pdf_path.exists():
|
|
89
|
+
click.echo(f"β Error: {target} is not a file, and no paper.pdf found in workspace for ID [{target}].")
|
|
90
|
+
click.echo(f" Try 'pp download {target}' first.")
|
|
91
|
+
return
|
|
92
|
+
|
|
93
|
+
click.echo(f"π arXiv paper detected: [{target}]")
|
|
94
|
+
|
|
95
|
+
# Incremental check
|
|
96
|
+
if not force and (final_output_dir / "markdowns").exists():
|
|
97
|
+
click.echo(f"βοΈ Skipping: Results already exist in {final_output_dir}")
|
|
98
|
+
return
|
|
99
|
+
|
|
100
|
+
click.echo(f"π Parsing {pdf_path}...")
|
|
101
|
+
try:
|
|
102
|
+
mineru_client.parse_paper(str(pdf_path), str(final_output_dir))
|
|
103
|
+
click.echo(f"β
Parsing complete. Results in {final_output_dir}")
|
|
104
|
+
except ValueError as e:
|
|
105
|
+
click.echo(str(e))
|
|
106
|
+
except Exception as e:
|
|
107
|
+
click.echo(f"β Error during parsing: {e}")
|
|
108
|
+
|
|
109
|
+
@cli.command()
|
|
110
|
+
@click.argument('query_or_id')
|
|
111
|
+
def path(query_or_id):
|
|
112
|
+
"""Find the local path of a processed paper."""
|
|
113
|
+
click.echo(f"π Locating paper: {query_or_id}")
|
|
114
|
+
|
|
115
|
+
is_id = arxiv_client.search_arxiv(f"id:{query_or_id}") if "." in query_or_id or "/" in query_or_id else []
|
|
116
|
+
results = is_id if is_id else arxiv_client.search_arxiv(query_or_id, max_results=1)
|
|
117
|
+
|
|
118
|
+
if not results:
|
|
119
|
+
click.echo("β Paper not found on arXiv.")
|
|
120
|
+
return
|
|
121
|
+
|
|
122
|
+
paper = results[0]
|
|
123
|
+
paper_dir = utils.get_paper_dir(paper['id'])
|
|
124
|
+
|
|
125
|
+
if paper_dir.exists() and any(paper_dir.iterdir()):
|
|
126
|
+
click.echo(f"π Local path: {paper_dir.absolute()}")
|
|
127
|
+
else:
|
|
128
|
+
click.echo(f"β Paper found on arXiv ([{paper['id']}]), but not processed locally yet.")
|
|
129
|
+
click.echo(f" Use 'pp all \"{paper['id']}\"' to download and parse it.")
|
|
130
|
+
|
|
131
|
+
@cli.command()
|
|
132
|
+
@click.argument('query_or_id')
|
|
133
|
+
@click.option('--force', is_flag=True, help='Force re-parsing.')
|
|
134
|
+
def all(query_or_id, force):
|
|
135
|
+
"""Run full workflow: Search -> Download -> Parse."""
|
|
136
|
+
click.echo(f"π Starting full workflow for: {query_or_id}")
|
|
137
|
+
|
|
138
|
+
is_id = arxiv_client.search_arxiv(f"id:{query_or_id}") if "." in query_or_id else []
|
|
139
|
+
results = is_id if is_id else arxiv_client.search_arxiv(query_or_id, max_results=1)
|
|
140
|
+
|
|
141
|
+
if not results:
|
|
142
|
+
click.echo("β Paper not found.")
|
|
143
|
+
return
|
|
144
|
+
|
|
145
|
+
paper = results[0]
|
|
146
|
+
click.echo(f"π Found: {paper['title']}")
|
|
147
|
+
|
|
148
|
+
paper_dir = utils.get_paper_dir(paper['id'])
|
|
149
|
+
pdf_path = paper_dir / "paper.pdf"
|
|
150
|
+
title_path = paper_dir / "title.md"
|
|
151
|
+
title_path.write_text(f"# {paper['title']}\n", encoding='utf-8')
|
|
152
|
+
|
|
153
|
+
summary_path = paper_dir / "summary.md"
|
|
154
|
+
if not summary_path.exists():
|
|
155
|
+
summary_path.write_text(f"# Summary: {paper['title']}\n\n## Key Takeaways\n\n- \n", encoding='utf-8')
|
|
156
|
+
|
|
157
|
+
# 3. Download (with cache check)
|
|
158
|
+
if not force and pdf_path.exists():
|
|
159
|
+
click.echo(f"βοΈ Skipping Download: PDF already exists in {paper_dir}")
|
|
160
|
+
else:
|
|
161
|
+
if not arxiv_client.download_pdf(paper['pdf_url'], pdf_path):
|
|
162
|
+
return
|
|
163
|
+
|
|
164
|
+
# Reuse parse logic via ctx or just repeat here for simplicity
|
|
165
|
+
if not force and (paper_dir / "markdowns").exists():
|
|
166
|
+
click.echo(f"βοΈ Skipping Parsing: Results already exist in {paper_dir}")
|
|
167
|
+
else:
|
|
168
|
+
click.echo("π Parsing with MinerU API...")
|
|
169
|
+
try:
|
|
170
|
+
mineru_client.parse_paper(str(pdf_path), str(paper_dir))
|
|
171
|
+
click.echo(f"\nπ Workflow complete!")
|
|
172
|
+
click.echo(f"π Folder: {paper_dir}")
|
|
173
|
+
except ValueError as e:
|
|
174
|
+
click.echo(str(e))
|
|
175
|
+
except Exception as e:
|
|
176
|
+
click.echo(f"β Error during parsing: {e}")
|
|
177
|
+
|
|
178
|
+
if __name__ == '__main__':
|
|
179
|
+
cli()
|
|
@@ -0,0 +1,52 @@
|
|
|
1
|
+
import os
|
|
2
|
+
import yaml
|
|
3
|
+
from pathlib import Path
|
|
4
|
+
|
|
5
|
+
DEFAULT_CONFIG_DIR = Path.home() / ".paper-parser"
|
|
6
|
+
DEFAULT_CONFIG_PATH = DEFAULT_CONFIG_DIR / "config.yaml"
|
|
7
|
+
|
|
8
|
+
DEFAULT_CONFIG = {
|
|
9
|
+
"PAPER_WORKSPACE": str(Path.home() / "paper-parser-workspace"),
|
|
10
|
+
"MINERU_API_TOKEN": "",
|
|
11
|
+
"MINERU_API_BASE_URL": "https://mineru.net/api/v4",
|
|
12
|
+
"MINERU_API_TIMEOUT": 600
|
|
13
|
+
}
|
|
14
|
+
|
|
15
|
+
class Config:
|
|
16
|
+
def __init__(self, config_path=DEFAULT_CONFIG_PATH):
|
|
17
|
+
self.config_path = Path(config_path)
|
|
18
|
+
self.data = {}
|
|
19
|
+
self.load()
|
|
20
|
+
|
|
21
|
+
def load(self):
|
|
22
|
+
"""Load config from YAML file, creating it if it doesn't exist."""
|
|
23
|
+
if not self.config_path.exists():
|
|
24
|
+
self.create_default_config()
|
|
25
|
+
|
|
26
|
+
try:
|
|
27
|
+
with open(self.config_path, 'r', encoding='utf-8') as f:
|
|
28
|
+
self.data = yaml.safe_load(f) or {}
|
|
29
|
+
except Exception as e:
|
|
30
|
+
print(f"Error loading config from {self.config_path}: {e}")
|
|
31
|
+
self.data = {}
|
|
32
|
+
|
|
33
|
+
# Merge with defaults for missing keys
|
|
34
|
+
for key, value in DEFAULT_CONFIG.items():
|
|
35
|
+
if key not in self.data:
|
|
36
|
+
self.data[key] = value
|
|
37
|
+
|
|
38
|
+
def create_default_config(self):
|
|
39
|
+
"""Create the config directory and a default config file."""
|
|
40
|
+
self.config_path.parent.mkdir(parents=True, exist_ok=True)
|
|
41
|
+
try:
|
|
42
|
+
with open(self.config_path, 'w', encoding='utf-8') as f:
|
|
43
|
+
yaml.dump(DEFAULT_CONFIG, f, default_flow_style=False, allow_unicode=True)
|
|
44
|
+
print(f"Created default config at {self.config_path}")
|
|
45
|
+
except Exception as e:
|
|
46
|
+
print(f"Error creating default config: {e}")
|
|
47
|
+
|
|
48
|
+
def get(self, key, default=None):
|
|
49
|
+
return self.data.get(key, default)
|
|
50
|
+
|
|
51
|
+
# Global config instance
|
|
52
|
+
config = Config()
|
|
@@ -0,0 +1,202 @@
|
|
|
1
|
+
import os
|
|
2
|
+
import time
|
|
3
|
+
import requests
|
|
4
|
+
import zipfile
|
|
5
|
+
import io
|
|
6
|
+
import re
|
|
7
|
+
import shutil
|
|
8
|
+
from pathlib import Path
|
|
9
|
+
from .config import config
|
|
10
|
+
|
|
11
|
+
class MinerUClient:
|
|
12
|
+
def __init__(self, token=None, base_url=None):
|
|
13
|
+
self.token = token or config.get("MINERU_API_TOKEN")
|
|
14
|
+
self.base_url = base_url or config.get("MINERU_API_BASE_URL")
|
|
15
|
+
|
|
16
|
+
def _validate_token(self):
|
|
17
|
+
if not self.token:
|
|
18
|
+
raise ValueError(
|
|
19
|
+
"β MinerU API Token is missing!\n"
|
|
20
|
+
"Please configure it in ~/.paper-parser/config.yaml\n"
|
|
21
|
+
"Set the 'MINERU_API_TOKEN' field."
|
|
22
|
+
)
|
|
23
|
+
|
|
24
|
+
def _get_headers(self):
|
|
25
|
+
return {"Authorization": f"Bearer {self.token}"}
|
|
26
|
+
|
|
27
|
+
def upload_pdf(self, pdf_path):
|
|
28
|
+
"""Step 1 & 2: Get upload URL and upload PDF."""
|
|
29
|
+
self._validate_token()
|
|
30
|
+
|
|
31
|
+
url = f"{self.base_url}/file-urls/batch"
|
|
32
|
+
file_name = os.path.basename(pdf_path)
|
|
33
|
+
data = {
|
|
34
|
+
"files": [
|
|
35
|
+
{"name": file_name, "data_id": str(int(time.time()))}
|
|
36
|
+
],
|
|
37
|
+
"model_version": "vlm"
|
|
38
|
+
}
|
|
39
|
+
|
|
40
|
+
print(f"[*] Requesting upload URL for {file_name}...")
|
|
41
|
+
response = requests.post(url, headers=self._get_headers(), json=data)
|
|
42
|
+
response.raise_for_status()
|
|
43
|
+
|
|
44
|
+
result = response.json()
|
|
45
|
+
if result.get("code") != 0:
|
|
46
|
+
raise Exception(f"API Error: {result.get('msg')}")
|
|
47
|
+
|
|
48
|
+
upload_url = result["data"]["file_urls"][0]
|
|
49
|
+
batch_id = result["data"]["batch_id"]
|
|
50
|
+
|
|
51
|
+
print(f"[*] Uploading {pdf_path}...")
|
|
52
|
+
with open(pdf_path, "rb") as f:
|
|
53
|
+
resp = requests.put(upload_url, data=f)
|
|
54
|
+
resp.raise_for_status()
|
|
55
|
+
print("[+] Upload successful.")
|
|
56
|
+
|
|
57
|
+
return batch_id
|
|
58
|
+
|
|
59
|
+
def poll_status(self, batch_id):
|
|
60
|
+
"""Step 3: Poll for completion."""
|
|
61
|
+
url = f"{self.base_url}/extract-results/batch/{batch_id}"
|
|
62
|
+
timeout = config.get("MINERU_API_TIMEOUT", 600)
|
|
63
|
+
start_time = time.time()
|
|
64
|
+
|
|
65
|
+
print(f"[*] Waiting for conversion to complete (Timeout: {timeout}s)...")
|
|
66
|
+
while True:
|
|
67
|
+
elapsed = time.time() - start_time
|
|
68
|
+
if elapsed > timeout:
|
|
69
|
+
raise TimeoutError(f"β MinerU conversion timed out after {timeout} seconds.")
|
|
70
|
+
|
|
71
|
+
response = requests.get(url, headers=self._get_headers())
|
|
72
|
+
response.raise_for_status()
|
|
73
|
+
|
|
74
|
+
result = response.json()
|
|
75
|
+
if result.get("code") != 0:
|
|
76
|
+
raise Exception(f"Status Check Error: {result.get('msg')}")
|
|
77
|
+
|
|
78
|
+
extract_results = result["data"].get("extract_result", [])
|
|
79
|
+
if not extract_results:
|
|
80
|
+
print(f"[.] Still processing (queueing)... {int(elapsed)}s elapsed")
|
|
81
|
+
time.sleep(5)
|
|
82
|
+
continue
|
|
83
|
+
|
|
84
|
+
file_state = extract_results[0]
|
|
85
|
+
state = file_state.get("state")
|
|
86
|
+
|
|
87
|
+
if state == "done":
|
|
88
|
+
print("[+] Conversion finished!")
|
|
89
|
+
return file_state.get("full_zip_url")
|
|
90
|
+
elif state == "failed":
|
|
91
|
+
raise Exception("Conversion failed on server side.")
|
|
92
|
+
else:
|
|
93
|
+
print(f"[.] Current state: {state}. Polling in 5s... ({int(elapsed)}s elapsed)")
|
|
94
|
+
time.sleep(5)
|
|
95
|
+
|
|
96
|
+
def process_results(self, zip_url, output_dir):
|
|
97
|
+
"""Step 4: Download and process result ZIP."""
|
|
98
|
+
print(f"[*] Downloading results...")
|
|
99
|
+
response = requests.get(zip_url)
|
|
100
|
+
response.raise_for_status()
|
|
101
|
+
|
|
102
|
+
paper_dir = Path(output_dir)
|
|
103
|
+
paper_dir.mkdir(parents=True, exist_ok=True)
|
|
104
|
+
markdowns_dir = paper_dir / "markdowns"
|
|
105
|
+
markdowns_dir.mkdir(exist_ok=True)
|
|
106
|
+
images_dir = markdowns_dir / "images"
|
|
107
|
+
images_dir.mkdir(exist_ok=True)
|
|
108
|
+
|
|
109
|
+
# Temp directory for extraction
|
|
110
|
+
temp_dir = paper_dir / "_temp_mineru"
|
|
111
|
+
if temp_dir.exists():
|
|
112
|
+
shutil.rmtree(temp_dir)
|
|
113
|
+
temp_dir.mkdir()
|
|
114
|
+
temp_images_dir = temp_dir / "images"
|
|
115
|
+
temp_images_dir.mkdir()
|
|
116
|
+
|
|
117
|
+
md_content = ""
|
|
118
|
+
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
|
|
119
|
+
for name in z.namelist():
|
|
120
|
+
if name.endswith("full.md"):
|
|
121
|
+
md_content = z.read(name).decode("utf-8")
|
|
122
|
+
elif "images/" in name and not name.endswith("/"):
|
|
123
|
+
filename = os.path.basename(name)
|
|
124
|
+
if filename:
|
|
125
|
+
with z.open(name) as source, open(temp_images_dir / filename, "wb") as target:
|
|
126
|
+
shutil.copyfileobj(source, target)
|
|
127
|
+
|
|
128
|
+
if not md_content:
|
|
129
|
+
shutil.rmtree(temp_dir)
|
|
130
|
+
raise Exception("Error: Could not find 'full.md' in the result ZIP.")
|
|
131
|
+
|
|
132
|
+
# Process images and references
|
|
133
|
+
processed_md = self._process_images(md_content, temp_images_dir, images_dir)
|
|
134
|
+
|
|
135
|
+
# Split into chapters
|
|
136
|
+
chapter_count = self._split_chapters(processed_md, markdowns_dir)
|
|
137
|
+
|
|
138
|
+
# Cleanup
|
|
139
|
+
shutil.rmtree(temp_dir)
|
|
140
|
+
print(f"[+] Done! {chapter_count} chapters saved to 'markdowns/'. Images saved to 'images/'.")
|
|
141
|
+
return chapter_count
|
|
142
|
+
|
|
143
|
+
def _process_images(self, md_content, temp_images_dir, final_images_dir):
|
|
144
|
+
md_img_pattern = re.compile(r'!\[([^\]]*)\]\(images/([^)]+)\)')
|
|
145
|
+
html_img_pattern = re.compile(r'<img[^>]+src="images/([^"]+)"[^>]*>')
|
|
146
|
+
|
|
147
|
+
unique_refs = []
|
|
148
|
+
for match in md_img_pattern.finditer(md_content):
|
|
149
|
+
if match.group(2) not in unique_refs:
|
|
150
|
+
unique_refs.append(match.group(2))
|
|
151
|
+
for match in html_img_pattern.finditer(md_content):
|
|
152
|
+
if match.group(1) not in unique_refs:
|
|
153
|
+
unique_refs.append(match.group(1))
|
|
154
|
+
|
|
155
|
+
mapping = {}
|
|
156
|
+
for i, old_name in enumerate(unique_refs, 1):
|
|
157
|
+
new_name = f"{i}.jpg"
|
|
158
|
+
old_path = temp_images_dir / old_name
|
|
159
|
+
new_path = final_images_dir / new_name
|
|
160
|
+
|
|
161
|
+
if old_path.exists():
|
|
162
|
+
shutil.move(str(old_path), str(new_path))
|
|
163
|
+
mapping[old_name] = new_name
|
|
164
|
+
|
|
165
|
+
def md_replace(match):
|
|
166
|
+
alt, name = match.groups()
|
|
167
|
+
return f"})"
|
|
168
|
+
|
|
169
|
+
def html_replace(match):
|
|
170
|
+
name = match.group(1)
|
|
171
|
+
return f'<img src="images/{mapping.get(name, name)}" />'
|
|
172
|
+
|
|
173
|
+
md_content = md_img_pattern.sub(md_replace, md_content)
|
|
174
|
+
md_content = html_img_pattern.sub(html_replace, md_content)
|
|
175
|
+
return md_content
|
|
176
|
+
|
|
177
|
+
def _split_chapters(self, md_content, output_folder):
|
|
178
|
+
header_pattern = re.compile(r'^(#{1,6})\s+(.+)$', re.MULTILINE)
|
|
179
|
+
headers = [(m.start(), len(m.group(1)), m.group(2).strip()) for m in header_pattern.finditer(md_content)]
|
|
180
|
+
|
|
181
|
+
if not headers:
|
|
182
|
+
(output_folder / "00_Complete.md").write_text(md_content.strip() + '\n', encoding='utf-8')
|
|
183
|
+
return 1
|
|
184
|
+
|
|
185
|
+
for i, (start, level, title) in enumerate(headers):
|
|
186
|
+
end = headers[i + 1][0] if i + 1 < len(headers) else len(md_content)
|
|
187
|
+
content = md_content[start:end].strip()
|
|
188
|
+
safe_title = re.sub(r'[<>:"/\\|?*]', '_', title)
|
|
189
|
+
safe_title = re.sub(r'\s+', '_', safe_title)[:80]
|
|
190
|
+
filename = f"{i+1:02d}_{safe_title}.md"
|
|
191
|
+
(output_folder / filename).write_text(content + '\n', encoding='utf-8')
|
|
192
|
+
|
|
193
|
+
return len(headers)
|
|
194
|
+
|
|
195
|
+
def parse_paper(pdf_path, output_dir):
|
|
196
|
+
"""Convenience function to run the full MinerU workflow."""
|
|
197
|
+
client = MinerUClient()
|
|
198
|
+
batch_id = client.upload_pdf(pdf_path)
|
|
199
|
+
zip_url = client.poll_status(batch_id)
|
|
200
|
+
if zip_url:
|
|
201
|
+
return client.process_results(zip_url, output_dir)
|
|
202
|
+
return 0
|
|
@@ -0,0 +1,25 @@
|
|
|
1
|
+
import os
|
|
2
|
+
import re
|
|
3
|
+
from pathlib import Path
|
|
4
|
+
from .config import config
|
|
5
|
+
|
|
6
|
+
def sanitize_id(paper_id):
|
|
7
|
+
"""Sanitize arXiv ID to be used as a directory name."""
|
|
8
|
+
# Replace slashes in older IDs (e.g., hep-th/9901001) with underscores
|
|
9
|
+
return re.sub(r'[\\/:*?"<>|]', '_', paper_id)
|
|
10
|
+
|
|
11
|
+
def get_work_dir():
|
|
12
|
+
"""Get the base workspace directory from config."""
|
|
13
|
+
workspace = config.get("PAPER_WORKSPACE")
|
|
14
|
+
# Resolve ~ if present
|
|
15
|
+
workspace_path = Path(os.path.expanduser(workspace))
|
|
16
|
+
workspace_path.mkdir(parents=True, exist_ok=True)
|
|
17
|
+
return workspace_path
|
|
18
|
+
|
|
19
|
+
def get_paper_dir(paper_id):
|
|
20
|
+
"""Get the directory for a specific paper using its ID."""
|
|
21
|
+
base_dir = get_work_dir()
|
|
22
|
+
safe_id = sanitize_id(paper_id)
|
|
23
|
+
paper_dir = base_dir / safe_id
|
|
24
|
+
paper_dir.mkdir(parents=True, exist_ok=True)
|
|
25
|
+
return paper_dir
|
|
@@ -0,0 +1,116 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: paper-parser-skill
|
|
3
|
+
Version: 0.1.1
|
|
4
|
+
Summary: A CLI tool for searching, downloading, parsing, and summarizing academic papers.
|
|
5
|
+
Author-email: kaihang <kaihang.noir@gmail.com>
|
|
6
|
+
Requires-Python: >=3.8
|
|
7
|
+
Description-Content-Type: text/markdown
|
|
8
|
+
License-File: LICENSE
|
|
9
|
+
Requires-Dist: requests
|
|
10
|
+
Requires-Dist: click
|
|
11
|
+
Requires-Dist: PyYAML
|
|
12
|
+
Requires-Dist: arxiv
|
|
13
|
+
Requires-Dist: rapidfuzz
|
|
14
|
+
Dynamic: license-file
|
|
15
|
+
|
|
16
|
+
# Paper Parser π οΈ
|
|
17
|
+
|
|
18
|
+
**Efficient arXiv Search, Download, and AI-Friendly Markdown Parsing.**
|
|
19
|
+
|
|
20
|
+
`paper-parser` is a CLI tool designed to streamline the academic research workflow. It handles everything from finding a paper on arXiv to converting it into a clean, structured Markdown format that is optimized for LLMs and AI agents.
|
|
21
|
+
|
|
22
|
+
## π Why Use Paper Parser?
|
|
23
|
+
|
|
24
|
+
Standard PDF-to-text tools often produce one massive block of text, which leads to two major problems when working with AI:
|
|
25
|
+
1. **Context Overflow**: Large papers can exceed an LLM's context window.
|
|
26
|
+
2. **Token Waste**: Paying for the entire paper's context when you only need to analyze the "Methodology" or "Conclusion" is expensive and slow.
|
|
27
|
+
|
|
28
|
+
**The Solution:** `paper-parser` uses the **MinerU V4 API** to extract high-quality Markdown and then **automatically splits the paper into chapters**. This allows AI agents to read the paper **section-by-section**, enabling:
|
|
29
|
+
- β
**Granular Context Management**: Only read what matters.
|
|
30
|
+
- β
**Significant Token Savings**: Drastically reduce your API bills.
|
|
31
|
+
- β
**Higher Accuracy**: Focus the model's attention on specific sections.
|
|
32
|
+
|
|
33
|
+
---
|
|
34
|
+
|
|
35
|
+
## β¨ Key Features
|
|
36
|
+
|
|
37
|
+
- **π Intelligent Search**: Typos? No problem. Fuzzy-searches arXiv with relevance ranking.
|
|
38
|
+
- **π₯ Smart Download**: Downloads PDFs into organized, ID-based directories.
|
|
39
|
+
- **π§© Section Splitting**: Automatically splits papers into `01_Introduction.md`, `02_Methodology.md`, etc.
|
|
40
|
+
- **π¦ Incremental Processing**: Remembers what you've already downloaded and parsedβno redundant API calls.
|
|
41
|
+
- **πΌοΈ Image Extraction**: Extracts images and maintains correct relative links within the Markdown chapters.
|
|
42
|
+
- **π Note Templates**: Automatically generates `title.md` and `summary.md` for your research notes.
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## π οΈ Installation
|
|
47
|
+
|
|
48
|
+
### From PyPI (Recommended)
|
|
49
|
+
```bash
|
|
50
|
+
pip install paper-parser-skill
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
### From Source
|
|
54
|
+
```bash
|
|
55
|
+
# Clone the repository
|
|
56
|
+
git clone https://github.com/KaiHangYang/paper-parser-skill.git
|
|
57
|
+
cd paper-parser-skill
|
|
58
|
+
|
|
59
|
+
# Install in editable mode
|
|
60
|
+
pip install -e .
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
## βοΈ Configuration
|
|
64
|
+
|
|
65
|
+
The first time you run `pp`, it will create a configuration file at `~/.paper-parser/config.yaml`.
|
|
66
|
+
|
|
67
|
+
```yaml
|
|
68
|
+
MINERU_API_TOKEN: "your_token_from_mineru.net"
|
|
69
|
+
PAPER_WORKSPACE: "~/paper-parser-workspace"
|
|
70
|
+
MINERU_API_TIMEOUT: 600
|
|
71
|
+
```
|
|
72
|
+
> [!IMPORTANT]
|
|
73
|
+
> You need an API token from [MinerU](https://mineru.net/) to use the parsing features.
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## π Usage Guide
|
|
78
|
+
|
|
79
|
+
```bash
|
|
80
|
+
# 1. Search for a paper
|
|
81
|
+
pp search "LLaMA 3"
|
|
82
|
+
|
|
83
|
+
# 2. Complete workflow: Search -> Download -> Parse -> Meta
|
|
84
|
+
pp all "2303.17564"
|
|
85
|
+
|
|
86
|
+
# 3. Parse a local PDF file
|
|
87
|
+
pp parse ./my_local_paper.pdf
|
|
88
|
+
|
|
89
|
+
# 4. Find where a paper is stored
|
|
90
|
+
pp path "LLaMA"
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
## π Output Structure
|
|
94
|
+
|
|
95
|
+
```text
|
|
96
|
+
PAPER_WORKSPACE/
|
|
97
|
+
βββ 2303.17564/ # ArXiv ID
|
|
98
|
+
βββ paper.pdf # Original PDF
|
|
99
|
+
βββ title.md # Paper metadata
|
|
100
|
+
βββ summary.md # Note-taking template
|
|
101
|
+
βββ markdowns/ # AI-Ready Content
|
|
102
|
+
βββ 01_Introduction.md
|
|
103
|
+
βββ 02_Methods.md
|
|
104
|
+
βββ ...
|
|
105
|
+
βββ images/ # Extracted figures & tables
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
## π€ Acknowledgments
|
|
109
|
+
|
|
110
|
+
- [arXiv](https://arxiv.org/) for the academic paper API.
|
|
111
|
+
- [RapidFuzz](https://github.com/rapidfuzz/RapidFuzz) for fast fuzzy string matching.
|
|
112
|
+
- [MinerU](https://github.com/opendatalab/MinerU) ([mineru.net](https://mineru.net/)) for high-quality PDF-to-Markdown parsing.
|
|
113
|
+
|
|
114
|
+
## π License
|
|
115
|
+
|
|
116
|
+
[MIT](LICENSE)
|
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
LICENSE
|
|
2
|
+
README.md
|
|
3
|
+
pyproject.toml
|
|
4
|
+
paper_parser/__init__.py
|
|
5
|
+
paper_parser/arxiv_client.py
|
|
6
|
+
paper_parser/cli.py
|
|
7
|
+
paper_parser/config.py
|
|
8
|
+
paper_parser/mineru_client.py
|
|
9
|
+
paper_parser/utils.py
|
|
10
|
+
paper_parser_skill.egg-info/PKG-INFO
|
|
11
|
+
paper_parser_skill.egg-info/SOURCES.txt
|
|
12
|
+
paper_parser_skill.egg-info/dependency_links.txt
|
|
13
|
+
paper_parser_skill.egg-info/entry_points.txt
|
|
14
|
+
paper_parser_skill.egg-info/requires.txt
|
|
15
|
+
paper_parser_skill.egg-info/top_level.txt
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
paper_parser
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=61.0"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "paper-parser-skill"
|
|
7
|
+
version = "0.1.1"
|
|
8
|
+
description = "A CLI tool for searching, downloading, parsing, and summarizing academic papers."
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
authors = [
|
|
11
|
+
{ name = "kaihang", email = "kaihang.noir@gmail.com" }
|
|
12
|
+
]
|
|
13
|
+
dependencies = [
|
|
14
|
+
"requests",
|
|
15
|
+
"click",
|
|
16
|
+
"PyYAML",
|
|
17
|
+
"arxiv",
|
|
18
|
+
"rapidfuzz",
|
|
19
|
+
]
|
|
20
|
+
requires-python = ">=3.8"
|
|
21
|
+
|
|
22
|
+
[project.scripts]
|
|
23
|
+
paper-parser = "paper_parser.cli:cli"
|
|
24
|
+
pp = "paper_parser.cli:cli"
|
|
25
|
+
|
|
26
|
+
[tool.setuptools.packages.find]
|
|
27
|
+
where = ["."]
|
|
28
|
+
include = ["paper_parser*"]
|