paper-parser-skill 0.1.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 KaiHangYang
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,116 @@
1
+ Metadata-Version: 2.4
2
+ Name: paper-parser-skill
3
+ Version: 0.1.1
4
+ Summary: A CLI tool for searching, downloading, parsing, and summarizing academic papers.
5
+ Author-email: kaihang <kaihang.noir@gmail.com>
6
+ Requires-Python: >=3.8
7
+ Description-Content-Type: text/markdown
8
+ License-File: LICENSE
9
+ Requires-Dist: requests
10
+ Requires-Dist: click
11
+ Requires-Dist: PyYAML
12
+ Requires-Dist: arxiv
13
+ Requires-Dist: rapidfuzz
14
+ Dynamic: license-file
15
+
16
+ # Paper Parser πŸ› οΈ
17
+
18
+ **Efficient arXiv Search, Download, and AI-Friendly Markdown Parsing.**
19
+
20
+ `paper-parser` is a CLI tool designed to streamline the academic research workflow. It handles everything from finding a paper on arXiv to converting it into a clean, structured Markdown format that is optimized for LLMs and AI agents.
21
+
22
+ ## πŸš€ Why Use Paper Parser?
23
+
24
+ Standard PDF-to-text tools often produce one massive block of text, which leads to two major problems when working with AI:
25
+ 1. **Context Overflow**: Large papers can exceed an LLM's context window.
26
+ 2. **Token Waste**: Paying for the entire paper's context when you only need to analyze the "Methodology" or "Conclusion" is expensive and slow.
27
+
28
+ **The Solution:** `paper-parser` uses the **MinerU V4 API** to extract high-quality Markdown and then **automatically splits the paper into chapters**. This allows AI agents to read the paper **section-by-section**, enabling:
29
+ - βœ… **Granular Context Management**: Only read what matters.
30
+ - βœ… **Significant Token Savings**: Drastically reduce your API bills.
31
+ - βœ… **Higher Accuracy**: Focus the model's attention on specific sections.
32
+
33
+ ---
34
+
35
+ ## ✨ Key Features
36
+
37
+ - **πŸ” Intelligent Search**: Typos? No problem. Fuzzy-searches arXiv with relevance ranking.
38
+ - **πŸ“₯ Smart Download**: Downloads PDFs into organized, ID-based directories.
39
+ - **🧩 Section Splitting**: Automatically splits papers into `01_Introduction.md`, `02_Methodology.md`, etc.
40
+ - **πŸ“¦ Incremental Processing**: Remembers what you've already downloaded and parsedβ€”no redundant API calls.
41
+ - **πŸ–ΌοΈ Image Extraction**: Extracts images and maintains correct relative links within the Markdown chapters.
42
+ - **πŸ“ Note Templates**: Automatically generates `title.md` and `summary.md` for your research notes.
43
+
44
+ ---
45
+
46
+ ## πŸ› οΈ Installation
47
+
48
+ ### From PyPI (Recommended)
49
+ ```bash
50
+ pip install paper-parser-skill
51
+ ```
52
+
53
+ ### From Source
54
+ ```bash
55
+ # Clone the repository
56
+ git clone https://github.com/KaiHangYang/paper-parser-skill.git
57
+ cd paper-parser-skill
58
+
59
+ # Install in editable mode
60
+ pip install -e .
61
+ ```
62
+
63
+ ## βš™οΈ Configuration
64
+
65
+ The first time you run `pp`, it will create a configuration file at `~/.paper-parser/config.yaml`.
66
+
67
+ ```yaml
68
+ MINERU_API_TOKEN: "your_token_from_mineru.net"
69
+ PAPER_WORKSPACE: "~/paper-parser-workspace"
70
+ MINERU_API_TIMEOUT: 600
71
+ ```
72
+ > [!IMPORTANT]
73
+ > You need an API token from [MinerU](https://mineru.net/) to use the parsing features.
74
+
75
+ ---
76
+
77
+ ## πŸ“– Usage Guide
78
+
79
+ ```bash
80
+ # 1. Search for a paper
81
+ pp search "LLaMA 3"
82
+
83
+ # 2. Complete workflow: Search -> Download -> Parse -> Meta
84
+ pp all "2303.17564"
85
+
86
+ # 3. Parse a local PDF file
87
+ pp parse ./my_local_paper.pdf
88
+
89
+ # 4. Find where a paper is stored
90
+ pp path "LLaMA"
91
+ ```
92
+
93
+ ## πŸ“‚ Output Structure
94
+
95
+ ```text
96
+ PAPER_WORKSPACE/
97
+ └── 2303.17564/ # ArXiv ID
98
+ β”œβ”€β”€ paper.pdf # Original PDF
99
+ β”œβ”€β”€ title.md # Paper metadata
100
+ β”œβ”€β”€ summary.md # Note-taking template
101
+ └── markdowns/ # AI-Ready Content
102
+ β”œβ”€β”€ 01_Introduction.md
103
+ β”œβ”€β”€ 02_Methods.md
104
+ β”œβ”€β”€ ...
105
+ └── images/ # Extracted figures & tables
106
+ ```
107
+
108
+ ## 🀝 Acknowledgments
109
+
110
+ - [arXiv](https://arxiv.org/) for the academic paper API.
111
+ - [RapidFuzz](https://github.com/rapidfuzz/RapidFuzz) for fast fuzzy string matching.
112
+ - [MinerU](https://github.com/opendatalab/MinerU) ([mineru.net](https://mineru.net/)) for high-quality PDF-to-Markdown parsing.
113
+
114
+ ## πŸ“œ License
115
+
116
+ [MIT](LICENSE)
@@ -0,0 +1,101 @@
1
+ # Paper Parser πŸ› οΈ
2
+
3
+ **Efficient arXiv Search, Download, and AI-Friendly Markdown Parsing.**
4
+
5
+ `paper-parser` is a CLI tool designed to streamline the academic research workflow. It handles everything from finding a paper on arXiv to converting it into a clean, structured Markdown format that is optimized for LLMs and AI agents.
6
+
7
+ ## πŸš€ Why Use Paper Parser?
8
+
9
+ Standard PDF-to-text tools often produce one massive block of text, which leads to two major problems when working with AI:
10
+ 1. **Context Overflow**: Large papers can exceed an LLM's context window.
11
+ 2. **Token Waste**: Paying for the entire paper's context when you only need to analyze the "Methodology" or "Conclusion" is expensive and slow.
12
+
13
+ **The Solution:** `paper-parser` uses the **MinerU V4 API** to extract high-quality Markdown and then **automatically splits the paper into chapters**. This allows AI agents to read the paper **section-by-section**, enabling:
14
+ - βœ… **Granular Context Management**: Only read what matters.
15
+ - βœ… **Significant Token Savings**: Drastically reduce your API bills.
16
+ - βœ… **Higher Accuracy**: Focus the model's attention on specific sections.
17
+
18
+ ---
19
+
20
+ ## ✨ Key Features
21
+
22
+ - **πŸ” Intelligent Search**: Typos? No problem. Fuzzy-searches arXiv with relevance ranking.
23
+ - **πŸ“₯ Smart Download**: Downloads PDFs into organized, ID-based directories.
24
+ - **🧩 Section Splitting**: Automatically splits papers into `01_Introduction.md`, `02_Methodology.md`, etc.
25
+ - **πŸ“¦ Incremental Processing**: Remembers what you've already downloaded and parsedβ€”no redundant API calls.
26
+ - **πŸ–ΌοΈ Image Extraction**: Extracts images and maintains correct relative links within the Markdown chapters.
27
+ - **πŸ“ Note Templates**: Automatically generates `title.md` and `summary.md` for your research notes.
28
+
29
+ ---
30
+
31
+ ## πŸ› οΈ Installation
32
+
33
+ ### From PyPI (Recommended)
34
+ ```bash
35
+ pip install paper-parser-skill
36
+ ```
37
+
38
+ ### From Source
39
+ ```bash
40
+ # Clone the repository
41
+ git clone https://github.com/KaiHangYang/paper-parser-skill.git
42
+ cd paper-parser-skill
43
+
44
+ # Install in editable mode
45
+ pip install -e .
46
+ ```
47
+
48
+ ## βš™οΈ Configuration
49
+
50
+ The first time you run `pp`, it will create a configuration file at `~/.paper-parser/config.yaml`.
51
+
52
+ ```yaml
53
+ MINERU_API_TOKEN: "your_token_from_mineru.net"
54
+ PAPER_WORKSPACE: "~/paper-parser-workspace"
55
+ MINERU_API_TIMEOUT: 600
56
+ ```
57
+ > [!IMPORTANT]
58
+ > You need an API token from [MinerU](https://mineru.net/) to use the parsing features.
59
+
60
+ ---
61
+
62
+ ## πŸ“– Usage Guide
63
+
64
+ ```bash
65
+ # 1. Search for a paper
66
+ pp search "LLaMA 3"
67
+
68
+ # 2. Complete workflow: Search -> Download -> Parse -> Meta
69
+ pp all "2303.17564"
70
+
71
+ # 3. Parse a local PDF file
72
+ pp parse ./my_local_paper.pdf
73
+
74
+ # 4. Find where a paper is stored
75
+ pp path "LLaMA"
76
+ ```
77
+
78
+ ## πŸ“‚ Output Structure
79
+
80
+ ```text
81
+ PAPER_WORKSPACE/
82
+ └── 2303.17564/ # ArXiv ID
83
+ β”œβ”€β”€ paper.pdf # Original PDF
84
+ β”œβ”€β”€ title.md # Paper metadata
85
+ β”œβ”€β”€ summary.md # Note-taking template
86
+ └── markdowns/ # AI-Ready Content
87
+ β”œβ”€β”€ 01_Introduction.md
88
+ β”œβ”€β”€ 02_Methods.md
89
+ β”œβ”€β”€ ...
90
+ └── images/ # Extracted figures & tables
91
+ ```
92
+
93
+ ## 🀝 Acknowledgments
94
+
95
+ - [arXiv](https://arxiv.org/) for the academic paper API.
96
+ - [RapidFuzz](https://github.com/rapidfuzz/RapidFuzz) for fast fuzzy string matching.
97
+ - [MinerU](https://github.com/opendatalab/MinerU) ([mineru.net](https://mineru.net/)) for high-quality PDF-to-Markdown parsing.
98
+
99
+ ## πŸ“œ License
100
+
101
+ [MIT](LICENSE)
@@ -0,0 +1,2 @@
1
+ # paper_reader package initialization
2
+ from .config import config
@@ -0,0 +1,65 @@
1
+ import arxiv
2
+ from rapidfuzz import fuzz
3
+ import requests
4
+ from pathlib import Path
5
+
6
+ def search_arxiv(query, max_results=1):
7
+ """
8
+ Search for papers on arXiv with fuzzy matching support.
9
+ Broadens the search and ranks results by similarity to query.
10
+ """
11
+ client = arxiv.Client()
12
+
13
+ # 1. Expand the search results pool for better fuzzy matching coverage
14
+ try:
15
+ search = arxiv.Search(
16
+ query=query,
17
+ max_results=max_results * 5,
18
+ sort_by=arxiv.SortCriterion.Relevance
19
+ )
20
+
21
+ results = []
22
+ for r in client.results(search):
23
+ paper_id = r.entry_id.split('/')[-1]
24
+ title = r.title
25
+
26
+ # 2. Calculate fuzzy similarity score between query and title
27
+ score = fuzz.partial_ratio(query.lower(), title.lower())
28
+
29
+ results.append({
30
+ 'id': paper_id,
31
+ 'title': title,
32
+ 'pdf_url': r.pdf_url,
33
+ 'score': score
34
+ })
35
+
36
+ # 3. Sort by fuzzy score descending
37
+ results.sort(key=lambda x: x['score'], reverse=True)
38
+
39
+ # 4. Limit to the requested number of results
40
+ return results[:max_results]
41
+
42
+ except Exception as e:
43
+ print(f"arXiv search failed: {e}")
44
+ return []
45
+
46
+ def download_pdf(pdf_url, output_path):
47
+ """Download PDF from URL to local path."""
48
+ print(f"πŸ“₯ Downloading: {pdf_url}")
49
+ try:
50
+ response = requests.get(pdf_url, stream=True, timeout=60)
51
+ response.raise_for_status()
52
+
53
+ # Ensure parent directory exists
54
+ Path(output_path).parent.mkdir(parents=True, exist_ok=True)
55
+
56
+ with open(output_path, 'wb') as f:
57
+ for chunk in response.iter_content(chunk_size=8192):
58
+ if chunk:
59
+ f.write(chunk)
60
+
61
+ print(f"βœ… PDF saved: {output_path}")
62
+ return True
63
+ except Exception as e:
64
+ print(f"❌ Download failed: {e}")
65
+ return False
@@ -0,0 +1,179 @@
1
+ import click
2
+ import os
3
+ from pathlib import Path
4
+ from . import arxiv_client, mineru_client, utils
5
+ from .config import DEFAULT_CONFIG_PATH
6
+
7
+ @click.group(help=f"""
8
+ Paper Parser CLI - Search, Download, and Parse academic papers.
9
+
10
+ Configuration is stored in: {DEFAULT_CONFIG_PATH}
11
+ """)
12
+ def cli():
13
+ pass
14
+
15
+ @cli.command()
16
+ @click.argument('query')
17
+ @click.option('--limit', default=1, help='Number of results to show.')
18
+ def search(query, limit):
19
+ """Search for papers on arXiv."""
20
+ results = arxiv_client.search_arxiv(query, max_results=limit)
21
+ if not results:
22
+ click.echo("No papers found.")
23
+ return
24
+
25
+ for i, res in enumerate(results, 1):
26
+ click.echo(f"{i}. Id: {res['id']}")
27
+ click.echo(f" Title: {res['title']}")
28
+ click.echo(f" Link: {res['pdf_url']}")
29
+
30
+ @cli.command()
31
+ @click.argument('query_or_id')
32
+ @click.option('--force', is_flag=True, help='Force re-download even if PDF exists.')
33
+ def download(query_or_id, force):
34
+ """Download a paper PDF by arXiv ID or query."""
35
+ click.echo(f"πŸ” Finding paper: {query_or_id}")
36
+
37
+ # 1. Resolve paper
38
+ is_id = arxiv_client.search_arxiv(f"id:{query_or_id}") if "." in query_or_id else []
39
+ results = is_id if is_id else arxiv_client.search_arxiv(query_or_id, max_results=1)
40
+
41
+ if not results:
42
+ click.echo("❌ Paper not found.")
43
+ return
44
+
45
+ paper = results[0]
46
+ click.echo(f"πŸ“„ Found: {paper['title']}")
47
+
48
+ # 2. Setup directory and metadata
49
+ paper_dir = utils.get_paper_dir(paper['id'])
50
+ pdf_path = paper_dir / "paper.pdf"
51
+ title_path = paper_dir / "title.md"
52
+ title_path.write_text(f"# {paper['title']}\n", encoding='utf-8')
53
+
54
+ summary_path = paper_dir / "summary.md"
55
+ if not summary_path.exists():
56
+ summary_path.write_text(f"# Summary: {paper['title']}\n\n## Key Takeaways\n\n- \n", encoding='utf-8')
57
+
58
+ # 3. Download (with cache check)
59
+ if not force and pdf_path.exists():
60
+ click.echo(f"⏭️ Skipping Download: PDF already exists in {paper_dir}")
61
+ return
62
+
63
+ if arxiv_client.download_pdf(paper['pdf_url'], pdf_path):
64
+ click.echo(f"βœ… Paper downloaded to: {paper_dir}")
65
+
66
+ @cli.command()
67
+ @click.argument('target')
68
+ @click.option('--output-dir', help='Force output directory')
69
+ @click.option('--force', is_flag=True, help='Force re-parsing even if results exist.')
70
+ def parse(target, output_dir, force):
71
+ """Parse a PDF using MinerU API. TARGET can be a local PDF path or an arXiv ID."""
72
+ pdf_path = None
73
+ final_output_dir = None
74
+
75
+ # Case 1: Local File
76
+ if os.path.isfile(target):
77
+ pdf_path = Path(target)
78
+ final_output_dir = Path(output_dir) if output_dir else pdf_path.parent / "paper"
79
+ click.echo(f"πŸ“„ Local PDF detected: {pdf_path}")
80
+
81
+ # Case 2: arXiv ID (or something else)
82
+ else:
83
+ # Resolve ID to folder
84
+ paper_dir = utils.get_paper_dir(target)
85
+ pdf_path = paper_dir / "paper.pdf"
86
+ final_output_dir = Path(output_dir) if output_dir else paper_dir
87
+
88
+ if not pdf_path.exists():
89
+ click.echo(f"❌ Error: {target} is not a file, and no paper.pdf found in workspace for ID [{target}].")
90
+ click.echo(f" Try 'pp download {target}' first.")
91
+ return
92
+
93
+ click.echo(f"πŸ“š arXiv paper detected: [{target}]")
94
+
95
+ # Incremental check
96
+ if not force and (final_output_dir / "markdowns").exists():
97
+ click.echo(f"⏭️ Skipping: Results already exist in {final_output_dir}")
98
+ return
99
+
100
+ click.echo(f"πŸ”„ Parsing {pdf_path}...")
101
+ try:
102
+ mineru_client.parse_paper(str(pdf_path), str(final_output_dir))
103
+ click.echo(f"βœ… Parsing complete. Results in {final_output_dir}")
104
+ except ValueError as e:
105
+ click.echo(str(e))
106
+ except Exception as e:
107
+ click.echo(f"❌ Error during parsing: {e}")
108
+
109
+ @cli.command()
110
+ @click.argument('query_or_id')
111
+ def path(query_or_id):
112
+ """Find the local path of a processed paper."""
113
+ click.echo(f"πŸ” Locating paper: {query_or_id}")
114
+
115
+ is_id = arxiv_client.search_arxiv(f"id:{query_or_id}") if "." in query_or_id or "/" in query_or_id else []
116
+ results = is_id if is_id else arxiv_client.search_arxiv(query_or_id, max_results=1)
117
+
118
+ if not results:
119
+ click.echo("❌ Paper not found on arXiv.")
120
+ return
121
+
122
+ paper = results[0]
123
+ paper_dir = utils.get_paper_dir(paper['id'])
124
+
125
+ if paper_dir.exists() and any(paper_dir.iterdir()):
126
+ click.echo(f"πŸ“ Local path: {paper_dir.absolute()}")
127
+ else:
128
+ click.echo(f"❓ Paper found on arXiv ([{paper['id']}]), but not processed locally yet.")
129
+ click.echo(f" Use 'pp all \"{paper['id']}\"' to download and parse it.")
130
+
131
+ @cli.command()
132
+ @click.argument('query_or_id')
133
+ @click.option('--force', is_flag=True, help='Force re-parsing.')
134
+ def all(query_or_id, force):
135
+ """Run full workflow: Search -> Download -> Parse."""
136
+ click.echo(f"πŸš€ Starting full workflow for: {query_or_id}")
137
+
138
+ is_id = arxiv_client.search_arxiv(f"id:{query_or_id}") if "." in query_or_id else []
139
+ results = is_id if is_id else arxiv_client.search_arxiv(query_or_id, max_results=1)
140
+
141
+ if not results:
142
+ click.echo("❌ Paper not found.")
143
+ return
144
+
145
+ paper = results[0]
146
+ click.echo(f"πŸ“„ Found: {paper['title']}")
147
+
148
+ paper_dir = utils.get_paper_dir(paper['id'])
149
+ pdf_path = paper_dir / "paper.pdf"
150
+ title_path = paper_dir / "title.md"
151
+ title_path.write_text(f"# {paper['title']}\n", encoding='utf-8')
152
+
153
+ summary_path = paper_dir / "summary.md"
154
+ if not summary_path.exists():
155
+ summary_path.write_text(f"# Summary: {paper['title']}\n\n## Key Takeaways\n\n- \n", encoding='utf-8')
156
+
157
+ # 3. Download (with cache check)
158
+ if not force and pdf_path.exists():
159
+ click.echo(f"⏭️ Skipping Download: PDF already exists in {paper_dir}")
160
+ else:
161
+ if not arxiv_client.download_pdf(paper['pdf_url'], pdf_path):
162
+ return
163
+
164
+ # Reuse parse logic via ctx or just repeat here for simplicity
165
+ if not force and (paper_dir / "markdowns").exists():
166
+ click.echo(f"⏭️ Skipping Parsing: Results already exist in {paper_dir}")
167
+ else:
168
+ click.echo("πŸ”„ Parsing with MinerU API...")
169
+ try:
170
+ mineru_client.parse_paper(str(pdf_path), str(paper_dir))
171
+ click.echo(f"\nπŸŽ‰ Workflow complete!")
172
+ click.echo(f"πŸ“‚ Folder: {paper_dir}")
173
+ except ValueError as e:
174
+ click.echo(str(e))
175
+ except Exception as e:
176
+ click.echo(f"❌ Error during parsing: {e}")
177
+
178
+ if __name__ == '__main__':
179
+ cli()
@@ -0,0 +1,52 @@
1
+ import os
2
+ import yaml
3
+ from pathlib import Path
4
+
5
+ DEFAULT_CONFIG_DIR = Path.home() / ".paper-parser"
6
+ DEFAULT_CONFIG_PATH = DEFAULT_CONFIG_DIR / "config.yaml"
7
+
8
+ DEFAULT_CONFIG = {
9
+ "PAPER_WORKSPACE": str(Path.home() / "paper-parser-workspace"),
10
+ "MINERU_API_TOKEN": "",
11
+ "MINERU_API_BASE_URL": "https://mineru.net/api/v4",
12
+ "MINERU_API_TIMEOUT": 600
13
+ }
14
+
15
+ class Config:
16
+ def __init__(self, config_path=DEFAULT_CONFIG_PATH):
17
+ self.config_path = Path(config_path)
18
+ self.data = {}
19
+ self.load()
20
+
21
+ def load(self):
22
+ """Load config from YAML file, creating it if it doesn't exist."""
23
+ if not self.config_path.exists():
24
+ self.create_default_config()
25
+
26
+ try:
27
+ with open(self.config_path, 'r', encoding='utf-8') as f:
28
+ self.data = yaml.safe_load(f) or {}
29
+ except Exception as e:
30
+ print(f"Error loading config from {self.config_path}: {e}")
31
+ self.data = {}
32
+
33
+ # Merge with defaults for missing keys
34
+ for key, value in DEFAULT_CONFIG.items():
35
+ if key not in self.data:
36
+ self.data[key] = value
37
+
38
+ def create_default_config(self):
39
+ """Create the config directory and a default config file."""
40
+ self.config_path.parent.mkdir(parents=True, exist_ok=True)
41
+ try:
42
+ with open(self.config_path, 'w', encoding='utf-8') as f:
43
+ yaml.dump(DEFAULT_CONFIG, f, default_flow_style=False, allow_unicode=True)
44
+ print(f"Created default config at {self.config_path}")
45
+ except Exception as e:
46
+ print(f"Error creating default config: {e}")
47
+
48
+ def get(self, key, default=None):
49
+ return self.data.get(key, default)
50
+
51
+ # Global config instance
52
+ config = Config()
@@ -0,0 +1,202 @@
1
+ import os
2
+ import time
3
+ import requests
4
+ import zipfile
5
+ import io
6
+ import re
7
+ import shutil
8
+ from pathlib import Path
9
+ from .config import config
10
+
11
+ class MinerUClient:
12
+ def __init__(self, token=None, base_url=None):
13
+ self.token = token or config.get("MINERU_API_TOKEN")
14
+ self.base_url = base_url or config.get("MINERU_API_BASE_URL")
15
+
16
+ def _validate_token(self):
17
+ if not self.token:
18
+ raise ValueError(
19
+ "❌ MinerU API Token is missing!\n"
20
+ "Please configure it in ~/.paper-parser/config.yaml\n"
21
+ "Set the 'MINERU_API_TOKEN' field."
22
+ )
23
+
24
+ def _get_headers(self):
25
+ return {"Authorization": f"Bearer {self.token}"}
26
+
27
+ def upload_pdf(self, pdf_path):
28
+ """Step 1 & 2: Get upload URL and upload PDF."""
29
+ self._validate_token()
30
+
31
+ url = f"{self.base_url}/file-urls/batch"
32
+ file_name = os.path.basename(pdf_path)
33
+ data = {
34
+ "files": [
35
+ {"name": file_name, "data_id": str(int(time.time()))}
36
+ ],
37
+ "model_version": "vlm"
38
+ }
39
+
40
+ print(f"[*] Requesting upload URL for {file_name}...")
41
+ response = requests.post(url, headers=self._get_headers(), json=data)
42
+ response.raise_for_status()
43
+
44
+ result = response.json()
45
+ if result.get("code") != 0:
46
+ raise Exception(f"API Error: {result.get('msg')}")
47
+
48
+ upload_url = result["data"]["file_urls"][0]
49
+ batch_id = result["data"]["batch_id"]
50
+
51
+ print(f"[*] Uploading {pdf_path}...")
52
+ with open(pdf_path, "rb") as f:
53
+ resp = requests.put(upload_url, data=f)
54
+ resp.raise_for_status()
55
+ print("[+] Upload successful.")
56
+
57
+ return batch_id
58
+
59
+ def poll_status(self, batch_id):
60
+ """Step 3: Poll for completion."""
61
+ url = f"{self.base_url}/extract-results/batch/{batch_id}"
62
+ timeout = config.get("MINERU_API_TIMEOUT", 600)
63
+ start_time = time.time()
64
+
65
+ print(f"[*] Waiting for conversion to complete (Timeout: {timeout}s)...")
66
+ while True:
67
+ elapsed = time.time() - start_time
68
+ if elapsed > timeout:
69
+ raise TimeoutError(f"❌ MinerU conversion timed out after {timeout} seconds.")
70
+
71
+ response = requests.get(url, headers=self._get_headers())
72
+ response.raise_for_status()
73
+
74
+ result = response.json()
75
+ if result.get("code") != 0:
76
+ raise Exception(f"Status Check Error: {result.get('msg')}")
77
+
78
+ extract_results = result["data"].get("extract_result", [])
79
+ if not extract_results:
80
+ print(f"[.] Still processing (queueing)... {int(elapsed)}s elapsed")
81
+ time.sleep(5)
82
+ continue
83
+
84
+ file_state = extract_results[0]
85
+ state = file_state.get("state")
86
+
87
+ if state == "done":
88
+ print("[+] Conversion finished!")
89
+ return file_state.get("full_zip_url")
90
+ elif state == "failed":
91
+ raise Exception("Conversion failed on server side.")
92
+ else:
93
+ print(f"[.] Current state: {state}. Polling in 5s... ({int(elapsed)}s elapsed)")
94
+ time.sleep(5)
95
+
96
+ def process_results(self, zip_url, output_dir):
97
+ """Step 4: Download and process result ZIP."""
98
+ print(f"[*] Downloading results...")
99
+ response = requests.get(zip_url)
100
+ response.raise_for_status()
101
+
102
+ paper_dir = Path(output_dir)
103
+ paper_dir.mkdir(parents=True, exist_ok=True)
104
+ markdowns_dir = paper_dir / "markdowns"
105
+ markdowns_dir.mkdir(exist_ok=True)
106
+ images_dir = markdowns_dir / "images"
107
+ images_dir.mkdir(exist_ok=True)
108
+
109
+ # Temp directory for extraction
110
+ temp_dir = paper_dir / "_temp_mineru"
111
+ if temp_dir.exists():
112
+ shutil.rmtree(temp_dir)
113
+ temp_dir.mkdir()
114
+ temp_images_dir = temp_dir / "images"
115
+ temp_images_dir.mkdir()
116
+
117
+ md_content = ""
118
+ with zipfile.ZipFile(io.BytesIO(response.content)) as z:
119
+ for name in z.namelist():
120
+ if name.endswith("full.md"):
121
+ md_content = z.read(name).decode("utf-8")
122
+ elif "images/" in name and not name.endswith("/"):
123
+ filename = os.path.basename(name)
124
+ if filename:
125
+ with z.open(name) as source, open(temp_images_dir / filename, "wb") as target:
126
+ shutil.copyfileobj(source, target)
127
+
128
+ if not md_content:
129
+ shutil.rmtree(temp_dir)
130
+ raise Exception("Error: Could not find 'full.md' in the result ZIP.")
131
+
132
+ # Process images and references
133
+ processed_md = self._process_images(md_content, temp_images_dir, images_dir)
134
+
135
+ # Split into chapters
136
+ chapter_count = self._split_chapters(processed_md, markdowns_dir)
137
+
138
+ # Cleanup
139
+ shutil.rmtree(temp_dir)
140
+ print(f"[+] Done! {chapter_count} chapters saved to 'markdowns/'. Images saved to 'images/'.")
141
+ return chapter_count
142
+
143
+ def _process_images(self, md_content, temp_images_dir, final_images_dir):
144
+ md_img_pattern = re.compile(r'!\[([^\]]*)\]\(images/([^)]+)\)')
145
+ html_img_pattern = re.compile(r'<img[^>]+src="images/([^"]+)"[^>]*>')
146
+
147
+ unique_refs = []
148
+ for match in md_img_pattern.finditer(md_content):
149
+ if match.group(2) not in unique_refs:
150
+ unique_refs.append(match.group(2))
151
+ for match in html_img_pattern.finditer(md_content):
152
+ if match.group(1) not in unique_refs:
153
+ unique_refs.append(match.group(1))
154
+
155
+ mapping = {}
156
+ for i, old_name in enumerate(unique_refs, 1):
157
+ new_name = f"{i}.jpg"
158
+ old_path = temp_images_dir / old_name
159
+ new_path = final_images_dir / new_name
160
+
161
+ if old_path.exists():
162
+ shutil.move(str(old_path), str(new_path))
163
+ mapping[old_name] = new_name
164
+
165
+ def md_replace(match):
166
+ alt, name = match.groups()
167
+ return f"![{alt}](images/{mapping.get(name, name)})"
168
+
169
+ def html_replace(match):
170
+ name = match.group(1)
171
+ return f'<img src="images/{mapping.get(name, name)}" />'
172
+
173
+ md_content = md_img_pattern.sub(md_replace, md_content)
174
+ md_content = html_img_pattern.sub(html_replace, md_content)
175
+ return md_content
176
+
177
+ def _split_chapters(self, md_content, output_folder):
178
+ header_pattern = re.compile(r'^(#{1,6})\s+(.+)$', re.MULTILINE)
179
+ headers = [(m.start(), len(m.group(1)), m.group(2).strip()) for m in header_pattern.finditer(md_content)]
180
+
181
+ if not headers:
182
+ (output_folder / "00_Complete.md").write_text(md_content.strip() + '\n', encoding='utf-8')
183
+ return 1
184
+
185
+ for i, (start, level, title) in enumerate(headers):
186
+ end = headers[i + 1][0] if i + 1 < len(headers) else len(md_content)
187
+ content = md_content[start:end].strip()
188
+ safe_title = re.sub(r'[<>:"/\\|?*]', '_', title)
189
+ safe_title = re.sub(r'\s+', '_', safe_title)[:80]
190
+ filename = f"{i+1:02d}_{safe_title}.md"
191
+ (output_folder / filename).write_text(content + '\n', encoding='utf-8')
192
+
193
+ return len(headers)
194
+
195
+ def parse_paper(pdf_path, output_dir):
196
+ """Convenience function to run the full MinerU workflow."""
197
+ client = MinerUClient()
198
+ batch_id = client.upload_pdf(pdf_path)
199
+ zip_url = client.poll_status(batch_id)
200
+ if zip_url:
201
+ return client.process_results(zip_url, output_dir)
202
+ return 0
@@ -0,0 +1,25 @@
1
+ import os
2
+ import re
3
+ from pathlib import Path
4
+ from .config import config
5
+
6
+ def sanitize_id(paper_id):
7
+ """Sanitize arXiv ID to be used as a directory name."""
8
+ # Replace slashes in older IDs (e.g., hep-th/9901001) with underscores
9
+ return re.sub(r'[\\/:*?"<>|]', '_', paper_id)
10
+
11
+ def get_work_dir():
12
+ """Get the base workspace directory from config."""
13
+ workspace = config.get("PAPER_WORKSPACE")
14
+ # Resolve ~ if present
15
+ workspace_path = Path(os.path.expanduser(workspace))
16
+ workspace_path.mkdir(parents=True, exist_ok=True)
17
+ return workspace_path
18
+
19
+ def get_paper_dir(paper_id):
20
+ """Get the directory for a specific paper using its ID."""
21
+ base_dir = get_work_dir()
22
+ safe_id = sanitize_id(paper_id)
23
+ paper_dir = base_dir / safe_id
24
+ paper_dir.mkdir(parents=True, exist_ok=True)
25
+ return paper_dir
@@ -0,0 +1,116 @@
1
+ Metadata-Version: 2.4
2
+ Name: paper-parser-skill
3
+ Version: 0.1.1
4
+ Summary: A CLI tool for searching, downloading, parsing, and summarizing academic papers.
5
+ Author-email: kaihang <kaihang.noir@gmail.com>
6
+ Requires-Python: >=3.8
7
+ Description-Content-Type: text/markdown
8
+ License-File: LICENSE
9
+ Requires-Dist: requests
10
+ Requires-Dist: click
11
+ Requires-Dist: PyYAML
12
+ Requires-Dist: arxiv
13
+ Requires-Dist: rapidfuzz
14
+ Dynamic: license-file
15
+
16
+ # Paper Parser πŸ› οΈ
17
+
18
+ **Efficient arXiv Search, Download, and AI-Friendly Markdown Parsing.**
19
+
20
+ `paper-parser` is a CLI tool designed to streamline the academic research workflow. It handles everything from finding a paper on arXiv to converting it into a clean, structured Markdown format that is optimized for LLMs and AI agents.
21
+
22
+ ## πŸš€ Why Use Paper Parser?
23
+
24
+ Standard PDF-to-text tools often produce one massive block of text, which leads to two major problems when working with AI:
25
+ 1. **Context Overflow**: Large papers can exceed an LLM's context window.
26
+ 2. **Token Waste**: Paying for the entire paper's context when you only need to analyze the "Methodology" or "Conclusion" is expensive and slow.
27
+
28
+ **The Solution:** `paper-parser` uses the **MinerU V4 API** to extract high-quality Markdown and then **automatically splits the paper into chapters**. This allows AI agents to read the paper **section-by-section**, enabling:
29
+ - βœ… **Granular Context Management**: Only read what matters.
30
+ - βœ… **Significant Token Savings**: Drastically reduce your API bills.
31
+ - βœ… **Higher Accuracy**: Focus the model's attention on specific sections.
32
+
33
+ ---
34
+
35
+ ## ✨ Key Features
36
+
37
+ - **πŸ” Intelligent Search**: Typos? No problem. Fuzzy-searches arXiv with relevance ranking.
38
+ - **πŸ“₯ Smart Download**: Downloads PDFs into organized, ID-based directories.
39
+ - **🧩 Section Splitting**: Automatically splits papers into `01_Introduction.md`, `02_Methodology.md`, etc.
40
+ - **πŸ“¦ Incremental Processing**: Remembers what you've already downloaded and parsedβ€”no redundant API calls.
41
+ - **πŸ–ΌοΈ Image Extraction**: Extracts images and maintains correct relative links within the Markdown chapters.
42
+ - **πŸ“ Note Templates**: Automatically generates `title.md` and `summary.md` for your research notes.
43
+
44
+ ---
45
+
46
+ ## πŸ› οΈ Installation
47
+
48
+ ### From PyPI (Recommended)
49
+ ```bash
50
+ pip install paper-parser-skill
51
+ ```
52
+
53
+ ### From Source
54
+ ```bash
55
+ # Clone the repository
56
+ git clone https://github.com/KaiHangYang/paper-parser-skill.git
57
+ cd paper-parser-skill
58
+
59
+ # Install in editable mode
60
+ pip install -e .
61
+ ```
62
+
63
+ ## βš™οΈ Configuration
64
+
65
+ The first time you run `pp`, it will create a configuration file at `~/.paper-parser/config.yaml`.
66
+
67
+ ```yaml
68
+ MINERU_API_TOKEN: "your_token_from_mineru.net"
69
+ PAPER_WORKSPACE: "~/paper-parser-workspace"
70
+ MINERU_API_TIMEOUT: 600
71
+ ```
72
+ > [!IMPORTANT]
73
+ > You need an API token from [MinerU](https://mineru.net/) to use the parsing features.
74
+
75
+ ---
76
+
77
+ ## πŸ“– Usage Guide
78
+
79
+ ```bash
80
+ # 1. Search for a paper
81
+ pp search "LLaMA 3"
82
+
83
+ # 2. Complete workflow: Search -> Download -> Parse -> Meta
84
+ pp all "2303.17564"
85
+
86
+ # 3. Parse a local PDF file
87
+ pp parse ./my_local_paper.pdf
88
+
89
+ # 4. Find where a paper is stored
90
+ pp path "LLaMA"
91
+ ```
92
+
93
+ ## πŸ“‚ Output Structure
94
+
95
+ ```text
96
+ PAPER_WORKSPACE/
97
+ └── 2303.17564/ # ArXiv ID
98
+ β”œβ”€β”€ paper.pdf # Original PDF
99
+ β”œβ”€β”€ title.md # Paper metadata
100
+ β”œβ”€β”€ summary.md # Note-taking template
101
+ └── markdowns/ # AI-Ready Content
102
+ β”œβ”€β”€ 01_Introduction.md
103
+ β”œβ”€β”€ 02_Methods.md
104
+ β”œβ”€β”€ ...
105
+ └── images/ # Extracted figures & tables
106
+ ```
107
+
108
+ ## 🀝 Acknowledgments
109
+
110
+ - [arXiv](https://arxiv.org/) for the academic paper API.
111
+ - [RapidFuzz](https://github.com/rapidfuzz/RapidFuzz) for fast fuzzy string matching.
112
+ - [MinerU](https://github.com/opendatalab/MinerU) ([mineru.net](https://mineru.net/)) for high-quality PDF-to-Markdown parsing.
113
+
114
+ ## πŸ“œ License
115
+
116
+ [MIT](LICENSE)
@@ -0,0 +1,15 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ paper_parser/__init__.py
5
+ paper_parser/arxiv_client.py
6
+ paper_parser/cli.py
7
+ paper_parser/config.py
8
+ paper_parser/mineru_client.py
9
+ paper_parser/utils.py
10
+ paper_parser_skill.egg-info/PKG-INFO
11
+ paper_parser_skill.egg-info/SOURCES.txt
12
+ paper_parser_skill.egg-info/dependency_links.txt
13
+ paper_parser_skill.egg-info/entry_points.txt
14
+ paper_parser_skill.egg-info/requires.txt
15
+ paper_parser_skill.egg-info/top_level.txt
@@ -0,0 +1,3 @@
1
+ [console_scripts]
2
+ paper-parser = paper_parser.cli:cli
3
+ pp = paper_parser.cli:cli
@@ -0,0 +1,5 @@
1
+ requests
2
+ click
3
+ PyYAML
4
+ arxiv
5
+ rapidfuzz
@@ -0,0 +1,28 @@
1
+ [build-system]
2
+ requires = ["setuptools>=61.0"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "paper-parser-skill"
7
+ version = "0.1.1"
8
+ description = "A CLI tool for searching, downloading, parsing, and summarizing academic papers."
9
+ readme = "README.md"
10
+ authors = [
11
+ { name = "kaihang", email = "kaihang.noir@gmail.com" }
12
+ ]
13
+ dependencies = [
14
+ "requests",
15
+ "click",
16
+ "PyYAML",
17
+ "arxiv",
18
+ "rapidfuzz",
19
+ ]
20
+ requires-python = ">=3.8"
21
+
22
+ [project.scripts]
23
+ paper-parser = "paper_parser.cli:cli"
24
+ pp = "paper_parser.cli:cli"
25
+
26
+ [tool.setuptools.packages.find]
27
+ where = ["."]
28
+ include = ["paper_parser*"]
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+