deepresearch-flow 0.2.1__py3-none-any.whl → 0.4.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. deepresearch_flow/cli.py +2 -0
  2. deepresearch_flow/paper/config.py +15 -0
  3. deepresearch_flow/paper/db.py +193 -0
  4. deepresearch_flow/paper/db_ops.py +1939 -0
  5. deepresearch_flow/paper/llm.py +2 -0
  6. deepresearch_flow/paper/web/app.py +46 -3320
  7. deepresearch_flow/paper/web/constants.py +23 -0
  8. deepresearch_flow/paper/web/filters.py +255 -0
  9. deepresearch_flow/paper/web/handlers/__init__.py +14 -0
  10. deepresearch_flow/paper/web/handlers/api.py +217 -0
  11. deepresearch_flow/paper/web/handlers/pages.py +334 -0
  12. deepresearch_flow/paper/web/markdown.py +549 -0
  13. deepresearch_flow/paper/web/static/css/main.css +857 -0
  14. deepresearch_flow/paper/web/static/js/detail.js +406 -0
  15. deepresearch_flow/paper/web/static/js/index.js +266 -0
  16. deepresearch_flow/paper/web/static/js/outline.js +58 -0
  17. deepresearch_flow/paper/web/static/js/stats.js +39 -0
  18. deepresearch_flow/paper/web/templates/base.html +43 -0
  19. deepresearch_flow/paper/web/templates/detail.html +332 -0
  20. deepresearch_flow/paper/web/templates/index.html +114 -0
  21. deepresearch_flow/paper/web/templates/stats.html +29 -0
  22. deepresearch_flow/paper/web/templates.py +85 -0
  23. deepresearch_flow/paper/web/text.py +68 -0
  24. deepresearch_flow/recognize/cli.py +157 -3
  25. deepresearch_flow/recognize/organize.py +58 -0
  26. deepresearch_flow/translator/__init__.py +1 -0
  27. deepresearch_flow/translator/cli.py +451 -0
  28. deepresearch_flow/translator/config.py +19 -0
  29. deepresearch_flow/translator/engine.py +959 -0
  30. deepresearch_flow/translator/fixers.py +451 -0
  31. deepresearch_flow/translator/placeholder.py +62 -0
  32. deepresearch_flow/translator/prompts.py +116 -0
  33. deepresearch_flow/translator/protector.py +291 -0
  34. deepresearch_flow/translator/segment.py +180 -0
  35. deepresearch_flow-0.4.0.dist-info/METADATA +327 -0
  36. {deepresearch_flow-0.2.1.dist-info → deepresearch_flow-0.4.0.dist-info}/RECORD +40 -13
  37. deepresearch_flow-0.2.1.dist-info/METADATA +0 -424
  38. {deepresearch_flow-0.2.1.dist-info → deepresearch_flow-0.4.0.dist-info}/WHEEL +0 -0
  39. {deepresearch_flow-0.2.1.dist-info → deepresearch_flow-0.4.0.dist-info}/entry_points.txt +0 -0
  40. {deepresearch_flow-0.2.1.dist-info → deepresearch_flow-0.4.0.dist-info}/licenses/LICENSE +0 -0
  41. {deepresearch_flow-0.2.1.dist-info → deepresearch_flow-0.4.0.dist-info}/top_level.txt +0 -0
@@ -1,424 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: deepresearch-flow
3
- Version: 0.2.1
4
- Summary: Workflow tools for paper extraction, review, and research automation.
5
- Author-email: DengQi <dengqi935@gmail.com>
6
- License: MIT License
7
-
8
- Copyright (c) 2025 DengQi
9
-
10
- Permission is hereby granted, free of charge, to any person obtaining a copy
11
- of this software and associated documentation files (the "Software"), to deal
12
- in the Software without restriction, including without limitation the rights
13
- to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
14
- copies of the Software, and to permit persons to whom the Software is
15
- furnished to do so, subject to the following conditions:
16
-
17
- The above copyright notice and this permission notice shall be included in all
18
- copies or substantial portions of the Software.
19
-
20
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
25
- OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
26
- SOFTWARE.
27
-
28
- Project-URL: Homepage, https://github.com/nerdneilsfield/ai-deepresearch-flow
29
- Project-URL: Repository, https://github.com/nerdneilsfield/ai-deepresearch-flow
30
- Project-URL: Issues, https://github.com/nerdneilsfield/ai-deepresearch-flow/issues
31
- Keywords: research,papers,pdf,ocr,llm,workflow
32
- Classifier: Development Status :: 3 - Alpha
33
- Classifier: Intended Audience :: Science/Research
34
- Classifier: License :: OSI Approved :: MIT License
35
- Classifier: Programming Language :: Python :: 3
36
- Classifier: Programming Language :: Python :: 3 :: Only
37
- Classifier: Topic :: Scientific/Engineering :: Information Analysis
38
- Requires-Python: >=3.12
39
- Description-Content-Type: text/markdown
40
- License-File: LICENSE
41
- Requires-Dist: anthropic>=0.28.0
42
- Requires-Dist: click>=8.1.7
43
- Requires-Dist: coloredlogs>=15.0.1
44
- Requires-Dist: dashscope>=1.20.0
45
- Requires-Dist: google-auth>=2.0.0
46
- Requires-Dist: google-genai>=0.5.0
47
- Requires-Dist: httpx>=0.27.0
48
- Requires-Dist: jinja2>=3.1.3
49
- Requires-Dist: json-repair>=0.31.0
50
- Requires-Dist: jsonschema>=4.21.1
51
- Requires-Dist: markdown-it-py>=3.0.0
52
- Requires-Dist: pypdf>=3.0.0
53
- Requires-Dist: pybtex>=0.24.0
54
- Requires-Dist: rich>=13.7.1
55
- Requires-Dist: starlette>=0.37.2
56
- Requires-Dist: tqdm>=4.66.4
57
- Requires-Dist: uvicorn>=0.27.1
58
- Dynamic: license-file
59
-
60
- # deepresearch-flow
61
-
62
- DeepResearch Flow command-line tools for document extraction, OCR post-processing, and paper database operations.
63
-
64
- ## Quick Start
65
-
66
- ```bash
67
- pip install deepresearch-flow
68
- # or
69
- uv pip install deepresearch-flow
70
-
71
- # Development install
72
- pip install -e .
73
-
74
- cp config.example.toml config.toml
75
-
76
- # Extract from a docs folder
77
- uv run deepresearch-flow paper extract \
78
- --input ./docs \
79
- --model openai/gpt-4o-mini
80
-
81
- # Serve a local UI
82
- uv run deepresearch-flow paper db serve \
83
- --input ./paper_infos_simple.json \
84
- --host 127.0.0.1 \
85
- --port 8000
86
- ```
87
-
88
- Docker images:
89
-
90
- ```bash
91
- docker run --rm -it nerdneils/deepresearch-flow --help
92
- # or
93
- docker run --rm -it ghcr.io/nerdneilsfield/deepresearch-flow --help
94
- ```
95
-
96
- ## Commands
97
-
98
- `deepresearch-flow` is the top-level CLI. Workflows live under `paper` and `recognize`.
99
- Use `deepresearch-flow --help`, `deepresearch-flow paper --help`, and `deepresearch-flow recognize --help` to explore flags.
100
-
101
- <details>
102
- <summary>Configuration details</summary>
103
-
104
- Copy `config.example.toml` to `config.toml` and edit providers.
105
-
106
- - Providers are configured under `[[providers]]`.
107
- - Use `api_keys = ["env:OPENAI_API_KEY"]` to read from environment variables.
108
- - `model_list` is required for each provider and controls allowed `provider/model` values.
109
- - Explicit model routing is required: `--model provider/model`.
110
- - Supported provider types: `ollama`, `openai_compatible`, `dashscope`, `gemini_ai_studio`, `gemini_vertex`, `azure_openai`, `claude`.
111
- - Provider-specific fields: `azure_openai` requires `endpoint`, `api_version`, `deployment`; `gemini_vertex` requires `project_id`, `location`; `claude` requires `anthropic_version`.
112
- - Built-in prompt templates for extraction: `simple`, `deep_read`, `eight_questions`, `three_pass`.
113
- - Template rename: `seven_questions` is now `eight_questions`.
114
- - Render templates use `paper db render-md --template-name` with the same names.
115
- - `--language` defaults to `en`; extraction stores it as `output_language` and render uses that field.
116
- - When `output_language` is `zh`, render headings include both Chinese and English.
117
- - Complex templates (`deep_read`, `eight_questions`, `three_pass`) run multi-stage extraction and persist per-document stage files under `paper_stage_outputs/`.
118
- - Custom templates: use `--prompt-system`/`--prompt-user` with `--schema-json`, or `--template-dir` containing `system.j2`, `user.j2`, `schema.json`, `render.j2`.
119
- - Custom templates run in single-stage extraction mode.
120
- - Built-in schemas require `publication_date` and `publication_venue`.
121
- - The `simple` template requires `abstract`, `keywords`, and a single-paragraph `summary` that covers the eight-question aspects.
122
- - Extraction tolerates minor JSON formatting errors and ignores extra top-level fields when required keys validate.
123
-
124
- </details>
125
-
126
- <details>
127
- <summary>paper extract — structured extraction from markdown</summary>
128
-
129
- Extract structured JSON from markdown files using configured providers and prompt templates.
130
-
131
- Key options:
132
-
133
- - `--input` (repeatable): file or directory input.
134
- - `--glob`: filter when scanning directories.
135
- - `--prompt-template` / `--language`: select built-in prompts and output language.
136
- - `--prompt-system` / `--prompt-user` / `--schema-json`: custom prompt + schema.
137
- - `--template-dir`: use a directory containing `system.j2`, `user.j2`, `schema.json`, `render.j2`.
138
- - `--sleep-every` / `--sleep-time`: throttle request initiation.
139
- - `--max-concurrency`: override concurrency.
140
- - `--render-md`: render markdown output as part of extraction.
141
- - `--dry-run`: scan inputs and show summary metrics without calling providers.
142
-
143
- Outputs:
144
-
145
- - Aggregated JSON: `paper_infos.json`
146
- - Errors: `paper_errors.json`
147
- - Optional rendered Markdown: `rendered_md/` by default
148
-
149
- Incremental behavior:
150
-
151
- - Reuses existing entries when `source_path` and `source_hash` match.
152
- - Use `--force` to re-extract everything.
153
- - Use `--retry-failed` to retry only failed documents listed in `paper_errors.json`.
154
- - Use `--verbose` for detailed logs alongside progress bars.
155
- - Extract-time rendering defaults to the same built-in template as `--prompt-template`.
156
- - Output JSON is written as `{"template_tag": "...", "papers": [...]}`.
157
- - A summary table prints input/prompt/output character totals, token estimates, and throughput after each run.
158
- - Progress bars include a live prompt/completion/total token ticker.
159
-
160
- Examples:
161
-
162
- ```bash
163
- # Scan a directory recursively (default: *.md)
164
- deepresearch-flow paper extract \
165
- --input ./docs \
166
- --model openai/gpt-4o-mini
167
-
168
- # Multiple inputs + custom output
169
- deepresearch-flow paper extract \
170
- --input ./docs \
171
- --input ./more-docs \
172
- --output ./out/papers.json \
173
- --model openai/gpt-4o-mini
174
-
175
- # Built-in template with output language
176
- deepresearch-flow paper extract \
177
- --input ./docs \
178
- --prompt-template deep_read \
179
- --language zh \
180
- --model openai/gpt-4o-mini
181
-
182
- # Custom template directory
183
- deepresearch-flow paper extract \
184
- --input ./docs \
185
- --template-dir ./prompts \
186
- --model openai/gpt-4o-mini
187
-
188
- # Extract + render in one run
189
- deepresearch-flow paper extract \
190
- --input ./docs \
191
- --prompt-template eight_questions \
192
- --render-md \
193
- --model openai/gpt-4o-mini
194
-
195
- # Throttle request initiation
196
- deepresearch-flow paper extract \
197
- --input ./docs \
198
- --sleep-every 10 \
199
- --sleep-time 60 \
200
- --model openai/gpt-4o-mini
201
- ```
202
-
203
- </details>
204
-
205
- <details>
206
- <summary>paper db — render, analyze, and serve extracted data</summary>
207
-
208
- Render outputs, compute stats, and serve a local web UI over paper JSON.
209
-
210
- JSON input formats:
211
-
212
- - For `db render-md`, `db statistics`, `db filter`, and `db generate-tags`, the input can be either an aggregated JSON list or `{"template_tag": "...", "papers": [...]}` (the commands operate on `papers`).
213
- - For `db serve`, each input JSON must be an object: `{"template_tag": "simple", "papers": [...]}`.
214
- When `template_tag` is missing, the server attempts to infer it as a fallback (legacy list-only inputs are rejected).
215
-
216
- Web UI highlights:
217
-
218
- - Summary/Source/PDF/PDF Viewer views with tab navigation.
219
- - Split view: choose left/right panes independently (summary/source/pdf/pdf viewer) via URL params.
220
- - Summary/Source views include a collapsible outline panel (top-left) and a back-to-top control (bottom-left).
221
- - Summary template dropdown shows only available templates per paper.
222
- - Homepage filters: PDF/Source/Summary availability and template tags, plus a filter syntax input (`tmpl:...`, `has:pdf`, `no:source`).
223
- - Homepage stats: total and filtered counts for PDF/Source/Summary plus per-template totals.
224
- - Stats page includes keyword frequency charts.
225
- - Source view renders Markdown and supports embedded HTML tables plus `data:image/...;base64` `<img>` tags (images are constrained to the content width).
226
- - PDF Viewer is served locally (PDF.js viewer assets) to avoid cross-origin issues with local PDFs.
227
- - PDF-only entries are surfaced for unmatched PDFs under `--pdf-root` (metadata title if available, otherwise filename), with badges and detail warnings.
228
- - PDF-only entries are excluded from stats counts.
229
- - Merge behavior for multi-input serve: title similarity (>= 0.95), preferring `bibtex.fields.title` and falling back to `paper_title`.
230
- - Cache merged inputs with `--cache-dir`; bypass with `--no-cache`.
231
-
232
- Examples:
233
-
234
- ```bash
235
- # Render Markdown from JSON
236
- deepresearch-flow paper db render-md --input paper_infos.json
237
-
238
- # Render with a built-in template and language fallback
239
- deepresearch-flow paper db render-md \
240
- --input paper_infos.json \
241
- --template-name deep_read \
242
- --language zh
243
-
244
- # Generate tags
245
- deepresearch-flow paper db generate-tags \
246
- --input paper_infos.json \
247
- --output paper_infos_with_tags.json \
248
- --model openai/gpt-4o-mini
249
-
250
- # Filter papers
251
- deepresearch-flow paper db filter \
252
- --input paper_infos.json \
253
- --output filtered.json \
254
- --tags hardware_acceleration,fpga
255
-
256
- # Statistics (rich tables)
257
- deepresearch-flow paper db statistics \
258
- --input paper_infos.json \
259
- --top-n 20
260
- # Statistics also include keyword frequency (normalized to lowercase)
261
-
262
- # Serve a local read-only web UI (loads charts/libs via CDN)
263
- deepresearch-flow paper db serve \
264
- --input paper_infos_simple.json \
265
- --input paper_infos_deep_read.json \
266
- --cache-dir .cache/db-serve \
267
- --host 127.0.0.1 \
268
- --port 8000
269
-
270
- # Serve with optional BibTeX enrichment and source roots
271
- deepresearch-flow paper db serve \
272
- --input paper_infos_simple.json \
273
- --input paper_infos_deep_read.json \
274
- --bibtex ./refs/library.bib \
275
- --md-root ./docs \
276
- --md-root ./more_docs \
277
- --pdf-root ./pdfs \
278
- --cache-dir .cache/db-serve \
279
- --host 127.0.0.1 \
280
- --port 8000
281
- ```
282
-
283
- Web search syntax (Scholar-style):
284
-
285
- - Default is AND: `fpga kNN`
286
- - Quoted phrases: `title:"nearest neighbor"`
287
- - OR: `fpga OR asic`
288
- - Negation: `-survey` or `-tag:survey`
289
- - Fields: `title:`, `author:`, `tag:`, `venue:`, `year:`, `month:` (content tags only)
290
- - Year range: `year:2020..2024`
291
-
292
- Other database helpers:
293
-
294
- - `append-bibtex`
295
- - `sort-papers`
296
- - `split-by-tag`
297
- - `split-database`
298
- - `statistics`
299
- - `merge`
300
-
301
- </details>
302
-
303
- <details>
304
- <summary>recognize md — embed or unpack markdown images</summary>
305
-
306
- `recognize md embed` replaces local image links in markdown with `data:image/...;base64,` URLs.
307
- `recognize md unpack` extracts embedded images into `images/` and updates markdown links.
308
-
309
- Key options:
310
-
311
- - `--input` (repeatable): file or directory input.
312
- - `--recursive`: recurse into directories.
313
- - `--output`: output directory (flattened outputs).
314
- - `--enable-http`: allow embedding HTTP(S) images (embed only).
315
- - `--workers`: concurrent workers (default: 4).
316
- - `--dry-run`: report planned outputs without writing files.
317
- - `--verbose`: enable detailed logs for image resolution/HTTP fetches.
318
-
319
- Notes:
320
-
321
- - Progress bars report completion; a rich summary table lists counts, image totals, duration, and output locations.
322
- - Summary paths are shown relative to the current working directory when possible.
323
- - If the output directory is not empty, the command logs a warning before writing files.
324
-
325
- Examples:
326
-
327
- ```bash
328
- # Embed local images (flatten outputs)
329
- deepresearch-flow recognize md embed \
330
- --input ./docs \
331
- --recursive \
332
- --output ./out_md
333
-
334
- # Embed HTTP images (with browser User-Agent)
335
- deepresearch-flow recognize md embed \
336
- --input ./docs \
337
- --enable-http \
338
- --output ./out_md
339
-
340
- # Unpack embedded images into output/images/
341
- deepresearch-flow recognize md unpack \
342
- --input ./docs \
343
- --recursive \
344
- --output ./out_md
345
- ```
346
-
347
- </details>
348
-
349
- <details>
350
- <summary>recognize organize — flatten OCR outputs</summary>
351
-
352
- Organize OCR outputs (layout: `mineru`) into flat markdown files, with optional image embedding.
353
-
354
- Key options:
355
-
356
- - `--layout`: OCR layout type (currently `mineru`).
357
- - `--input` (repeatable): directories containing `full.md` + `images/`.
358
- - `--recursive`: search for layout folders (required when inputs contain nested result directories).
359
- - `--output-simple`: copy markdown + images to output (shared `images/`).
360
- - `--output-base64`: embed images into markdown.
361
- - `--workers`: concurrent workers (default: 4).
362
- - `--dry-run`: report planned outputs without writing files.
363
- - `--verbose`: enable detailed logs for layout discovery and file copying.
364
-
365
- Notes:
366
-
367
- - Use `--recursive` when the input directory contains nested layout folders (otherwise no layouts are discovered).
368
- - If output directories are not empty, the command logs a warning before writing files.
369
- - A summary table lists counts, image totals, duration, and output locations after completion.
370
- - Summary paths are shown relative to the current working directory when possible.
371
-
372
- Examples:
373
-
374
- ```bash
375
- # Copy markdown + images into a flat output directory
376
- deepresearch-flow recognize organize \
377
- --layout mineru \
378
- --input ./ocr_results \
379
- --recursive \
380
- --output-simple ./out_simple
381
-
382
- # Embed images into markdown
383
- deepresearch-flow recognize organize \
384
- --layout mineru \
385
- --input ./ocr_results \
386
- --output-base64 ./out_base64
387
- ```
388
-
389
- </details>
390
-
391
- <details>
392
- <summary>Data formats (examples)</summary>
393
-
394
- Aggregated extraction output is a JSON list:
395
-
396
- ```json
397
- [
398
- {
399
- "paper_title": "Example Paper",
400
- "paper_authors": ["Author A", "Author B"],
401
- "publication_date": "2024-01-01",
402
- "publication_venue": "ExampleConf",
403
- "source_path": "/abs/path/to/doc.md"
404
- }
405
- ]
406
- ```
407
-
408
- `db serve` expects each input to be an object with a `template_tag` and a `papers` list:
409
-
410
- ```json
411
- {
412
- "template_tag": "simple",
413
- "papers": [
414
- {
415
- "paper_title": "Example Paper",
416
- "paper_authors": ["Author A"],
417
- "publication_date": "2024-01-01",
418
- "publication_venue": "ExampleConf"
419
- }
420
- ]
421
- }
422
- ```
423
-
424
- </details>