deepresearch-flow 0.2.1__py3-none-any.whl → 0.3.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,306 @@
1
+ Metadata-Version: 2.4
2
+ Name: deepresearch-flow
3
+ Version: 0.3.0
4
+ Summary: Workflow tools for paper extraction, review, and research automation.
5
+ Author-email: DengQi <dengqi935@gmail.com>
6
+ License: MIT License
7
+
8
+ Copyright (c) 2025 DengQi
9
+
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy
11
+ of this software and associated documentation files (the "Software"), to deal
12
+ in the Software without restriction, including without limitation the rights
13
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
14
+ copies of the Software, and to permit persons to whom the Software is
15
+ furnished to do so, subject to the following conditions:
16
+
17
+ The above copyright notice and this permission notice shall be included in all
18
+ copies or substantial portions of the Software.
19
+
20
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
25
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
26
+ SOFTWARE.
27
+
28
+ Project-URL: Homepage, https://github.com/nerdneilsfield/ai-deepresearch-flow
29
+ Project-URL: Repository, https://github.com/nerdneilsfield/ai-deepresearch-flow
30
+ Project-URL: Issues, https://github.com/nerdneilsfield/ai-deepresearch-flow/issues
31
+ Keywords: research,papers,pdf,ocr,llm,workflow
32
+ Classifier: Development Status :: 3 - Alpha
33
+ Classifier: Intended Audience :: Science/Research
34
+ Classifier: License :: OSI Approved :: MIT License
35
+ Classifier: Programming Language :: Python :: 3
36
+ Classifier: Programming Language :: Python :: 3 :: Only
37
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
38
+ Requires-Python: >=3.12
39
+ Description-Content-Type: text/markdown
40
+ License-File: LICENSE
41
+ Requires-Dist: anthropic>=0.28.0
42
+ Requires-Dist: click>=8.1.7
43
+ Requires-Dist: coloredlogs>=15.0.1
44
+ Requires-Dist: dashscope>=1.20.0
45
+ Requires-Dist: google-auth>=2.0.0
46
+ Requires-Dist: google-genai>=0.5.0
47
+ Requires-Dist: httpx>=0.27.0
48
+ Requires-Dist: jinja2>=3.1.3
49
+ Requires-Dist: json-repair>=0.31.0
50
+ Requires-Dist: jsonschema>=4.21.1
51
+ Requires-Dist: markdown-it-py>=3.0.0
52
+ Requires-Dist: mdit-py-plugins>=0.4.0
53
+ Requires-Dist: pypdf>=3.0.0
54
+ Requires-Dist: pybtex>=0.24.0
55
+ Requires-Dist: rich>=13.7.1
56
+ Requires-Dist: rumdl>=0.0.214
57
+ Requires-Dist: starlette>=0.37.2
58
+ Requires-Dist: tqdm>=4.66.4
59
+ Requires-Dist: uvicorn>=0.27.1
60
+ Dynamic: license-file
61
+
62
+ <p align="center">
63
+ <img src=".github/assets/logo.png" width="140" alt="ai-deepresearch-flow logo" />
64
+ </p>
65
+
66
+ <h3 align="center">ai-deepresearch-flow</h3>
67
+
68
+ <p align="center">
69
+ <em>From documents to deep research insight — automatically.</em>
70
+ </p>
71
+
72
+ <p align="center">
73
+ <a href="README.md">English</a> | <a href="README_ZH.md">中文</a>
74
+ </p>
75
+
76
+ <p align="center">
77
+ <a href="https://github.com/nerdneilsfield/ai-deepresearch-flow/actions">
78
+ <img src="https://img.shields.io/github/actions/workflow/status/nerdneilsfield/ai-deepresearch-flow/push-to-pypi.yml?style=flat-square" />
79
+ </a>
80
+ <a href="https://pypi.org/project/deepresearch-flow/">
81
+ <img src="https://img.shields.io/pypi/v/deepresearch-flow?style=flat-square" />
82
+ </a>
83
+ <a href="https://pypi.org/project/deepresearch-flow/">
84
+ <img src="https://img.shields.io/pypi/pyversions/deepresearch-flow?style=flat-square" />
85
+ </a>
86
+ <a href="https://hub.docker.com/r/nerdneils/deepresearch-flow">
87
+ <img src="https://img.shields.io/docker/v/nerdneils/deepresearch-flow?style=flat-square" />
88
+ </a>
89
+ <a href="https://ghcr.io/nerdneilsfield/deepresearch-flow">
90
+ <img src="https://img.shields.io/badge/ghcr.io-nerdneilsfield%2Fdeepresearch-flow-0f172a?style=flat-square" />
91
+ </a>
92
+ <a href="https://github.com/nerdneilsfield/ai-deepresearch-flow/blob/main/LICENSE">
93
+ <img src="https://img.shields.io/github/license/nerdneilsfield/ai-deepresearch-flow?style=flat-square" />
94
+ </a>
95
+ <a href="https://github.com/nerdneilsfield/ai-deepresearch-flow/stargazers">
96
+ <img src="https://img.shields.io/github/stars/nerdneilsfield/ai-deepresearch-flow?style=flat-square" />
97
+ </a>
98
+ <a href="https://pypi.org/project/deepresearch-flow">
99
+ <img alt="PyPI - Version" src="https://img.shields.io/pypi/v/deepresearch-flow">
100
+ </a>
101
+ <a href="https://github.com/nerdneilsfield/ai-deepresearch-flow/issues">
102
+ <img src="https://img.shields.io/github/issues/nerdneilsfield/ai-deepresearch-flow?style=flat-square" />
103
+ </a>
104
+ </p>
105
+
106
+ ---
107
+
108
+ ## The Core Pain Points
109
+
110
+ - **OCR Chaos**: Raw markdown from OCR tools is often broken -- tables drift, formulas break, and references are non-clickable.
111
+ - **Translation Nightmares**: Translating technical papers often destroys code blocks, LaTeX formulas, and table structures.
112
+ - **Information Overload**: Extracting structured insights (authors, venues, summaries) from hundreds of PDFs manually is impossible.
113
+ - **Context Switching**: Managing PDFs, summaries, and translations in different windows kills focus.
114
+
115
+ ## The Solution
116
+
117
+ DeepResearch Flow provides a unified pipeline to **Repair**, **Translate**, **Extract**, and **Serve** your research library.
118
+
119
+ ## Key Features
120
+
121
+ - **Smart Extraction**: Turn unstructured Markdown into schema-enforced JSON (summaries, metadata, Q&A) using LLMs (OpenAI, Claude, Gemini, etc.).
122
+ - **Precision Translation**: Translate OCR Markdown to Chinese/Japanese (`.zh.md`, `.ja.md`) while **freezing** formulas, code, tables, and references. No more broken layout.
123
+ - **Local Knowledge DB**: A high-performance local Web UI to browse papers with **Split View** (Source vs. Translated vs. Summary), full-text search, and multi-dimensional filtering.
124
+ - **OCR Post-Processing**: Automatically fix broken references (`[1]` -> `[^1]`), merge split paragraphs, and standardize layouts.
125
+
126
+ ---
127
+
128
+ ## Quick Start
129
+
130
+ ### 1) Installation
131
+
132
+ ```bash
133
+ # Recommended: using uv for speed
134
+ uv pip install deepresearch-flow
135
+
136
+ # Or standard pip
137
+ pip install deepresearch-flow
138
+ ```
139
+
140
+ ### 2) Configuration
141
+
142
+ Set up your LLM providers. We support OpenAI, Claude, Gemini, Ollama, and more.
143
+
144
+ ```bash
145
+ cp config.example.toml config.toml
146
+ # Edit config.toml to add your API keys (e.g., env:OPENAI_API_KEY)
147
+ ```
148
+
149
+ ### 3) The "Zero to Hero" Workflow
150
+
151
+ #### Step 1: Extract Insights
152
+
153
+ Scan a folder of markdown files and extract structured summaries.
154
+
155
+ ```bash
156
+ uv run deepresearch-flow paper extract \
157
+ --input ./docs \
158
+ --model openai/gpt-4o-mini \
159
+ --prompt-template deep_read
160
+ ```
161
+
162
+ #### Step 2: Translate Safely
163
+
164
+ Translate papers to Chinese, protecting LaTeX and tables.
165
+
166
+ ```bash
167
+ uv run deepresearch-flow translator translate \
168
+ --input ./docs \
169
+ --target-lang zh \
170
+ --model openai/gpt-4o-mini \
171
+ --fix-level moderate
172
+ ```
173
+
174
+ #### Step 3: Serve Your Database
175
+
176
+ Launch a local UI to read and manage your papers.
177
+
178
+ ```bash
179
+ uv run deepresearch-flow paper db serve \
180
+ --input paper_infos.json \
181
+ --md-root ./docs \
182
+ --md-translated-root ./docs \
183
+ --host 127.0.0.1
184
+ ```
185
+
186
+ ---
187
+
188
+ ## Comprehensive Guide
189
+
190
+ <details>
191
+ <summary><strong>1. Translator: OCR-Safe Translation</strong></summary>
192
+
193
+ The translator module is built for scientific documents. It uses a node-based architecture to ensure stability.
194
+
195
+ - Structure Protection: automatically detects and "freezes" code blocks, LaTeX (`$$...$$`), HTML tables, and images before sending text to the LLM.
196
+ - OCR Repair: use `--fix-level` to merge broken paragraphs and convert text references (`[1]`) to clickable Markdown footnotes (`[^1]`).
197
+ - Context-Aware: supports retries for failed chunks and falls back gracefully.
198
+
199
+ ```bash
200
+ # Translate with structure protection and OCR repairs
201
+ uv run deepresearch-flow translator translate \
202
+ --input ./paper.md \
203
+ --target-lang ja \
204
+ --fix-level aggressive \
205
+ --model claude/claude-3-5-sonnet-20240620
206
+ ```
207
+
208
+ </details>
209
+
210
+ <details>
211
+ <summary><strong>2. Paper Extract: Structured Knowledge</strong></summary>
212
+
213
+ Turn loose markdown files into a queryable database.
214
+
215
+ - Templates: built-in prompts like `simple`, `eight_questions`, and `deep_read` guide the LLM to extract specific insights.
216
+ - Async and throttled: precise control over concurrency (`--max-concurrency`) and rate limits (`--sleep-every`).
217
+ - Incremental: skips already processed files; resumes from where you left off.
218
+
219
+ ```bash
220
+ uv run deepresearch-flow paper extract \
221
+ --input ./library \
222
+ --output paper_data.json \
223
+ --template-dir ./my-custom-prompts \
224
+ --max-concurrency 10
225
+ ```
226
+
227
+ </details>
228
+
229
+ <details>
230
+ <summary><strong>3. Database and UI: Your Personal ArXiv</strong></summary>
231
+
232
+ The db serve command creates a local research station.
233
+
234
+ - Split View: read the original PDF/Markdown on the left and the Summary/Translation on the right.
235
+ - Full Text Search: search by title, author, year, or content tags (`tag:fpga year:2023..2024`).
236
+ - Stats: visualize publication trends and keyword frequencies.
237
+ - PDF Viewer: built-in PDF.js viewer prevents cross-origin issues with local files.
238
+
239
+ ```bash
240
+ uv run deepresearch-flow paper db serve \
241
+ --input paper_infos.json \
242
+ --pdf-root ./pdfs \
243
+ --cache-dir .cache/db
244
+ ```
245
+
246
+ </details>
247
+
248
+ <details>
249
+ <summary><strong>4. Recognize: OCR Post-Processing</strong></summary>
250
+
251
+ Tools to clean up raw outputs from OCR engines like MinerU.
252
+
253
+ - Embed Images: convert local image links to Base64 for a portable single-file Markdown.
254
+ - Unpack Images: extract Base64 images back to files.
255
+ - Organize: flatten nested OCR output directories.
256
+ - Fix: apply OCR fixes and rumdl formatting during organize, or as a standalone step.
257
+
258
+ ```bash
259
+ uv run deepresearch-flow recognize md embed --input ./raw_ocr --output ./clean_md
260
+ ```
261
+
262
+ ```bash
263
+ # Organize MinerU output and apply OCR fixes
264
+ uv run deepresearch-flow recognize organize \
265
+ --input ./mineru_outputs \
266
+ --output-simple ./ocr_md \
267
+ --fix
268
+
269
+ # Fix and format existing markdown outputs
270
+ uv run deepresearch-flow recognize fix \
271
+ --input ./ocr_md \
272
+ --output ./ocr_md_fixed
273
+
274
+ # Fix in place
275
+ uv run deepresearch-flow recognize fix \
276
+ --input ./ocr_md \
277
+ --in-place
278
+ ```
279
+
280
+ </details>
281
+
282
+ ---
283
+
284
+ ## Docker Support
285
+
286
+ Don't want to manage Python environments?
287
+
288
+ ```bash
289
+ docker run --rm -v $(pwd):/app -it ghcr.io/nerdneilsfield/deepresearch-flow --help
290
+ ```
291
+
292
+ ## Configuration
293
+
294
+ The config.toml is your control center. It supports:
295
+
296
+ - Multiple Providers: mix and match OpenAI, DeepSeek (DashScope), Gemini, Claude, and Ollama.
297
+ - Model Routing: explicit routing to specific models (`--model provider/model_name`).
298
+ - Environment Variables: keep secrets safe using `env:VAR_NAME` syntax.
299
+
300
+ See `config.example.toml` for a full reference.
301
+
302
+ ---
303
+
304
+ <p align="center">
305
+ Built with love for the Open Science community.
306
+ </p>
@@ -1,12 +1,12 @@
1
1
  deepresearch_flow/__init__.py,sha256=rjP9ES4zJCfEN_MCDYAYPL1mNJZGjojdmbRwnZ9FlEk,83
2
2
  deepresearch_flow/__main__.py,sha256=Ceo0rMTOhHhwFPD-HyDDagenNsmWEzPmsdYLI7kwKVA,115
3
- deepresearch_flow/cli.py,sha256=WhPhs-Cg4kHow0h0KTVaGTjQVXCPrlNvMyvgxCD8qgI,371
3
+ deepresearch_flow/cli.py,sha256=t4oowCNWldL0DrVJ4d0UlRkuGU2qHej_G0mAc_quteQ,455
4
4
  deepresearch_flow/paper/__init__.py,sha256=sunaOkcgAJBrfmcaJTumcWbPGVUSGWvOv2a2Yidzy0A,43
5
5
  deepresearch_flow/paper/cli.py,sha256=4UY3KHi6BUGztL1vB4w0cCMiIAo9KNxrfQn1GBHt6fA,11153
6
- deepresearch_flow/paper/config.py,sha256=5uGTWfAfzpv4w_JxC0w6GF2teaxF5b3rD8LaDqPVshU,8611
7
- deepresearch_flow/paper/db.py,sha256=uX-gblqh-ltoMO6mv0KPAm-sgNaRz46jaN0kxtzvP8s,33242
6
+ deepresearch_flow/paper/config.py,sha256=totVBGzouh0KS6mhRNPneXZYPuuw0SHiOGdO3r6HSfc,9289
7
+ deepresearch_flow/paper/db.py,sha256=ymVLzSEXDksdhLNSdvNA2IWLzT5lQOG1CpJlPU9CSQ8,33586
8
8
  deepresearch_flow/paper/extract.py,sha256=ID1dd2r6LTB0kRF4qBSH6bGtBGv0znw--g_mXYBcoeU,32314
9
- deepresearch_flow/paper/llm.py,sha256=R4rmFoYnGq_JiQODr4Jzk5j8U-j2NSYUXex6eR-WHXg,3929
9
+ deepresearch_flow/paper/llm.py,sha256=mHfs5IkT3Q6BOh46MDlfUmgVTX24WRf0IKKoOnN8nV8,4007
10
10
  deepresearch_flow/paper/prompts.py,sha256=mV7cEXw8pwukBUE4Trah0SjEPSSDgg5-RGaNaUdo4EU,519
11
11
  deepresearch_flow/paper/render.py,sha256=KeccrRGf1_sxoaiT6SUDkFRj9sStReoEwNvlw1ir7qw,2181
12
12
  deepresearch_flow/paper/schema.py,sha256=tQEVbj4R8NqNGBW6VYwW-xf5QJgV9qthrbZB-EmZTKA,1931
@@ -40,7 +40,7 @@ deepresearch_flow/paper/templates/default_paper.md.j2,sha256=3azu48534QtLtHrCwI1
40
40
  deepresearch_flow/paper/templates/eight_questions.md.j2,sha256=Ecz4CD3nd7jZ4Dg8himZkTwF4WDkk0ILWk8V728uOPI,3038
41
41
  deepresearch_flow/paper/templates/three_pass.md.j2,sha256=ZRj-NkpZePnqp0gSE8OT1dN5Lr5RW4vdOYdeVejYJW0,1576
42
42
  deepresearch_flow/paper/web/__init__.py,sha256=eQBtBjvOYsNEdivHTI0aO286SCG2c86xI02tf-0jz5I,39
43
- deepresearch_flow/paper/web/app.py,sha256=OB0iHU5pa7zJmP4IQAHPg4S-ucfcWBRBfHfaSDNJDTE,118325
43
+ deepresearch_flow/paper/web/app.py,sha256=nb4uzsDJ2R5dz_WA69NKwTgVgMqAyZv5OZ88GxFTWLQ,133311
44
44
  deepresearch_flow/paper/web/query.py,sha256=vTegfm5zGVkYCd6_K3yNrXJEmKMccUUFKG9DePPcKMw,1938
45
45
  deepresearch_flow/paper/web/pdfjs/LICENSE,sha256=DVQuDIgE45qn836wDaWnYhSdxoLXgpRRKH4RuTjpRZQ,10174
46
46
  deepresearch_flow/paper/web/pdfjs/build/pdf.js,sha256=2Ddm8gpMMfvOWinZh4nN--94GxR0QdpFvh0Qeejg-Bw,568294
@@ -413,12 +413,21 @@ deepresearch_flow/paper/web/pdfjs/web/standard_fonts/LiberationSans-BoldItalic.t
413
413
  deepresearch_flow/paper/web/pdfjs/web/standard_fonts/LiberationSans-Italic.ttf,sha256=gytEBtvvI2KIANOqrSEEhTSshNfjrZVb6DuBcu2O9RI,162036
414
414
  deepresearch_flow/paper/web/pdfjs/web/standard_fonts/LiberationSans-Regular.ttf,sha256=-Kzh-JKyvZ3BeSun8Jf6dYj4T-1IMhSA4E3lOQgoIh8,139512
415
415
  deepresearch_flow/recognize/__init__.py,sha256=yMAqbdCzpdRSiwFhq9j7yx9ZWxqz_Zq3vfYlTLFCWek,33
416
- deepresearch_flow/recognize/cli.py,sha256=zhJi6f0Ha6UvX-Q4mdPdM9uz0SoBuCEnRwzDslMN2Eg,16276
416
+ deepresearch_flow/recognize/cli.py,sha256=zWUsqvou2h6c5zR_myGaySvK6cG9ItJp9cJFtqqJk7Y,21597
417
417
  deepresearch_flow/recognize/markdown.py,sha256=y-PMJbGqrfWCNBVGanXK1M4OuMP9e1eqh7HDYye5a7Q,8757
418
- deepresearch_flow/recognize/organize.py,sha256=GSLmo037rpARSecaPxNCuIlLBbbilx8msWFJDqYJ4hc,3561
419
- deepresearch_flow-0.2.1.dist-info/licenses/LICENSE,sha256=hT8F2Py1pe6flxq3Ufdm2UKFk0B8CBm0aAQfsLXfvjw,1063
420
- deepresearch_flow-0.2.1.dist-info/METADATA,sha256=n7JgsLFBDM80UkhTHCtKn2biC-_mAA-sGBilxnIPzso,15250
421
- deepresearch_flow-0.2.1.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
422
- deepresearch_flow-0.2.1.dist-info/entry_points.txt,sha256=1uIKscs0YRMg_mFsg9NjsaTt4CvQqQ_-zGERUKhhL_Y,65
423
- deepresearch_flow-0.2.1.dist-info/top_level.txt,sha256=qBl4RvPJNJUbL8CFfMNWxY0HpQLx5RlF_ko-z_aKpm0,18
424
- deepresearch_flow-0.2.1.dist-info/RECORD,,
418
+ deepresearch_flow/recognize/organize.py,sha256=-KVzuwNjiT2bLwqwLwcguEMQYxnGiZXjLNlov_oXSTo,5237
419
+ deepresearch_flow/translator/__init__.py,sha256=iaAkufvEELVKNbcs08Nh7bkTO4JlkT3rT_JIBP9jGfc,26
420
+ deepresearch_flow/translator/cli.py,sha256=BceOZhQuN9s5kqhpvLJuwpbB5J0MY1ucWUKw0jXWUPc,16872
421
+ deepresearch_flow/translator/config.py,sha256=0JI4VBLIzT039YscfEb5hqtCWCu8P2bJIgnAfIAhFmU,502
422
+ deepresearch_flow/translator/engine.py,sha256=dLKKUjmptkLXhIs5ZsIUonmKI9bS8Se4tOnp7fADIYU,36800
423
+ deepresearch_flow/translator/fixers.py,sha256=6lRfeic8aGhsFDAWVb2p-QCXPzNkNNWMWXbQev-qyTw,15199
424
+ deepresearch_flow/translator/placeholder.py,sha256=mEgqA-dPdOsIhno0h_hzfpXpY2asb4A7UQEYV3tcnP8,2097
425
+ deepresearch_flow/translator/prompts.py,sha256=kl_9O2YvmtXC1w6WLnsLuVZKz4mcOtUF887SiTaOvc0,4754
426
+ deepresearch_flow/translator/protector.py,sha256=sXwNJ1Y8tyPm7dgm8-7S8HkcPe23TGsBdwRxH6mKL70,11291
427
+ deepresearch_flow/translator/segment.py,sha256=rBFMCLTrvm2GrPc_hNFymi-8Ih2DAtUQlZHCRE9nLaM,5146
428
+ deepresearch_flow-0.3.0.dist-info/licenses/LICENSE,sha256=hT8F2Py1pe6flxq3Ufdm2UKFk0B8CBm0aAQfsLXfvjw,1063
429
+ deepresearch_flow-0.3.0.dist-info/METADATA,sha256=AJ4RfKW-V9BPhrrlFSP8stAoXG4SwpF-AvZH5HEtWyw,10831
430
+ deepresearch_flow-0.3.0.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
431
+ deepresearch_flow-0.3.0.dist-info/entry_points.txt,sha256=1uIKscs0YRMg_mFsg9NjsaTt4CvQqQ_-zGERUKhhL_Y,65
432
+ deepresearch_flow-0.3.0.dist-info/top_level.txt,sha256=qBl4RvPJNJUbL8CFfMNWxY0HpQLx5RlF_ko-z_aKpm0,18
433
+ deepresearch_flow-0.3.0.dist-info/RECORD,,