PyPI - deepresearch-flow - Versions diffs - 0.2.1__py3-none-any.whl → 0.3.0__py3-none-any.whl - Mend

deepresearch-flow 0.2.1py3-none-any.whl → 0.3.0py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

deepresearch_flow/cli.py +2 -0
deepresearch_flow/paper/config.py +15 -0
deepresearch_flow/paper/db.py +9 -0
deepresearch_flow/paper/llm.py +2 -0
deepresearch_flow/paper/web/app.py +413 -20
deepresearch_flow/recognize/cli.py +157 -3
deepresearch_flow/recognize/organize.py +58 -0
deepresearch_flow/translator/__init__.py +1 -0
deepresearch_flow/translator/cli.py +451 -0
deepresearch_flow/translator/config.py +19 -0
deepresearch_flow/translator/engine.py +959 -0
deepresearch_flow/translator/fixers.py +451 -0
deepresearch_flow/translator/placeholder.py +62 -0
deepresearch_flow/translator/prompts.py +116 -0
deepresearch_flow/translator/protector.py +291 -0
deepresearch_flow/translator/segment.py +180 -0
deepresearch_flow-0.3.0.dist-info/METADATA +306 -0
{deepresearch_flow-0.2.1.dist-info → deepresearch_flow-0.3.0.dist-info}/RECORD +22 -13
deepresearch_flow-0.2.1.dist-info/METADATA +0 -424
{deepresearch_flow-0.2.1.dist-info → deepresearch_flow-0.3.0.dist-info}/WHEEL +0 -0
{deepresearch_flow-0.2.1.dist-info → deepresearch_flow-0.3.0.dist-info}/entry_points.txt +0 -0
{deepresearch_flow-0.2.1.dist-info → deepresearch_flow-0.3.0.dist-info}/licenses/LICENSE +0 -0
{deepresearch_flow-0.2.1.dist-info → deepresearch_flow-0.3.0.dist-info}/top_level.txt +0 -0

deepresearch_flow-0.2.1.dist-info/METADATA DELETED Viewed

@@ -1,424 +0,0 @@
-Metadata-Version: 2.4
-Name: deepresearch-flow
-Version: 0.2.1
-Summary: Workflow tools for paper extraction, review, and research automation.
-Author-email: DengQi <dengqi935@gmail.com>
-License: MIT License
-        Copyright (c) 2025 DengQi
-        Permission is hereby granted, free of charge, to any person obtaining a copy
-        of this software and associated documentation files (the "Software"), to deal
-        in the Software without restriction, including without limitation the rights
-        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-        copies of the Software, and to permit persons to whom the Software is
-        furnished to do so, subject to the following conditions:
-        The above copyright notice and this permission notice shall be included in all
-        copies or substantial portions of the Software.
-        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-        SOFTWARE.
-Project-URL: Homepage, https://github.com/nerdneilsfield/ai-deepresearch-flow
-Project-URL: Repository, https://github.com/nerdneilsfield/ai-deepresearch-flow
-Project-URL: Issues, https://github.com/nerdneilsfield/ai-deepresearch-flow/issues
-Keywords: research,papers,pdf,ocr,llm,workflow
-Classifier: Development Status :: 3 - Alpha
-Classifier: Intended Audience :: Science/Research
-Classifier: License :: OSI Approved :: MIT License
-Classifier: Programming Language :: Python :: 3
-Classifier: Programming Language :: Python :: 3 :: Only
-Classifier: Topic :: Scientific/Engineering :: Information Analysis
-Requires-Python: >=3.12
-Description-Content-Type: text/markdown
-License-File: LICENSE
-Requires-Dist: anthropic>=0.28.0
-Requires-Dist: click>=8.1.7
-Requires-Dist: coloredlogs>=15.0.1
-Requires-Dist: dashscope>=1.20.0
-Requires-Dist: google-auth>=2.0.0
-Requires-Dist: google-genai>=0.5.0
-Requires-Dist: httpx>=0.27.0
-Requires-Dist: jinja2>=3.1.3
-Requires-Dist: json-repair>=0.31.0
-Requires-Dist: jsonschema>=4.21.1
-Requires-Dist: markdown-it-py>=3.0.0
-Requires-Dist: pypdf>=3.0.0
-Requires-Dist: pybtex>=0.24.0
-Requires-Dist: rich>=13.7.1
-Requires-Dist: starlette>=0.37.2
-Requires-Dist: tqdm>=4.66.4
-Requires-Dist: uvicorn>=0.27.1
-Dynamic: license-file
-# deepresearch-flow
-DeepResearch Flow command-line tools for document extraction, OCR post-processing, and paper database operations.
-## Quick Start
-```bash
-pip install deepresearch-flow
-# or
-uv pip install deepresearch-flow
-# Development install
-pip install -e .
-cp config.example.toml config.toml
-# Extract from a docs folder
-uv run deepresearch-flow paper extract \
-  --input ./docs \
-  --model openai/gpt-4o-mini
-# Serve a local UI
-uv run deepresearch-flow paper db serve \
-  --input ./paper_infos_simple.json \
-  --host 127.0.0.1 \
-  --port 8000
-```
-Docker images:
-```bash
-docker run --rm -it nerdneils/deepresearch-flow --help
-# or
-docker run --rm -it ghcr.io/nerdneilsfield/deepresearch-flow --help
-```
-## Commands
-`deepresearch-flow` is the top-level CLI. Workflows live under `paper` and `recognize`.
-Use `deepresearch-flow --help`, `deepresearch-flow paper --help`, and `deepresearch-flow recognize --help` to explore flags.
-<details>
-<summary>Configuration details</summary>
-Copy `config.example.toml` to `config.toml` and edit providers.
-- Providers are configured under `[[providers]]`.
-- Use `api_keys = ["env:OPENAI_API_KEY"]` to read from environment variables.
-- `model_list` is required for each provider and controls allowed `provider/model` values.
-- Explicit model routing is required: `--model provider/model`.
-- Supported provider types: `ollama`, `openai_compatible`, `dashscope`, `gemini_ai_studio`, `gemini_vertex`, `azure_openai`, `claude`.
-- Provider-specific fields: `azure_openai` requires `endpoint`, `api_version`, `deployment`; `gemini_vertex` requires `project_id`, `location`; `claude` requires `anthropic_version`.
-- Built-in prompt templates for extraction: `simple`, `deep_read`, `eight_questions`, `three_pass`.
-- Template rename: `seven_questions` is now `eight_questions`.
-- Render templates use `paper db render-md --template-name` with the same names.
-- `--language` defaults to `en`; extraction stores it as `output_language` and render uses that field.
-- When `output_language` is `zh`, render headings include both Chinese and English.
-- Complex templates (`deep_read`, `eight_questions`, `three_pass`) run multi-stage extraction and persist per-document stage files under `paper_stage_outputs/`.
-- Custom templates: use `--prompt-system`/`--prompt-user` with `--schema-json`, or `--template-dir` containing `system.j2`, `user.j2`, `schema.json`, `render.j2`.
-- Custom templates run in single-stage extraction mode.
-- Built-in schemas require `publication_date` and `publication_venue`.
-- The `simple` template requires `abstract`, `keywords`, and a single-paragraph `summary` that covers the eight-question aspects.
-- Extraction tolerates minor JSON formatting errors and ignores extra top-level fields when required keys validate.
-</details>
-<details>
-<summary>paper extract — structured extraction from markdown</summary>
-Extract structured JSON from markdown files using configured providers and prompt templates.
-Key options:
-- `--input` (repeatable): file or directory input.
-- `--glob`: filter when scanning directories.
-- `--prompt-template` / `--language`: select built-in prompts and output language.
-- `--prompt-system` / `--prompt-user` / `--schema-json`: custom prompt + schema.
-- `--template-dir`: use a directory containing `system.j2`, `user.j2`, `schema.json`, `render.j2`.
-- `--sleep-every` / `--sleep-time`: throttle request initiation.
-- `--max-concurrency`: override concurrency.
-- `--render-md`: render markdown output as part of extraction.
-- `--dry-run`: scan inputs and show summary metrics without calling providers.
-Outputs:
-- Aggregated JSON: `paper_infos.json`
-- Errors: `paper_errors.json`
-- Optional rendered Markdown: `rendered_md/` by default
-Incremental behavior:
-- Reuses existing entries when `source_path` and `source_hash` match.
-- Use `--force` to re-extract everything.
-- Use `--retry-failed` to retry only failed documents listed in `paper_errors.json`.
-- Use `--verbose` for detailed logs alongside progress bars.
-- Extract-time rendering defaults to the same built-in template as `--prompt-template`.
-- Output JSON is written as `{"template_tag": "...", "papers": [...]}`.
-- A summary table prints input/prompt/output character totals, token estimates, and throughput after each run.
-- Progress bars include a live prompt/completion/total token ticker.
-Examples:
-```bash
-# Scan a directory recursively (default: *.md)
-deepresearch-flow paper extract \
-  --input ./docs \
-  --model openai/gpt-4o-mini
-# Multiple inputs + custom output
-deepresearch-flow paper extract \
-  --input ./docs \
-  --input ./more-docs \
-  --output ./out/papers.json \
-  --model openai/gpt-4o-mini
-# Built-in template with output language
-deepresearch-flow paper extract \
-  --input ./docs \
-  --prompt-template deep_read \
-  --language zh \
-  --model openai/gpt-4o-mini
-# Custom template directory
-deepresearch-flow paper extract \
-  --input ./docs \
-  --template-dir ./prompts \
-  --model openai/gpt-4o-mini
-# Extract + render in one run
-deepresearch-flow paper extract \
-  --input ./docs \
-  --prompt-template eight_questions \
-  --render-md \
-  --model openai/gpt-4o-mini
-# Throttle request initiation
-deepresearch-flow paper extract \
-  --input ./docs \
-  --sleep-every 10 \
-  --sleep-time 60 \
-  --model openai/gpt-4o-mini
-```
-</details>
-<details>
-<summary>paper db — render, analyze, and serve extracted data</summary>
-Render outputs, compute stats, and serve a local web UI over paper JSON.
-JSON input formats:
-- For `db render-md`, `db statistics`, `db filter`, and `db generate-tags`, the input can be either an aggregated JSON list or `{"template_tag": "...", "papers": [...]}` (the commands operate on `papers`).
-- For `db serve`, each input JSON must be an object: `{"template_tag": "simple", "papers": [...]}`.
-  When `template_tag` is missing, the server attempts to infer it as a fallback (legacy list-only inputs are rejected).
-Web UI highlights:
-- Summary/Source/PDF/PDF Viewer views with tab navigation.
-- Split view: choose left/right panes independently (summary/source/pdf/pdf viewer) via URL params.
-- Summary/Source views include a collapsible outline panel (top-left) and a back-to-top control (bottom-left).
-- Summary template dropdown shows only available templates per paper.
-- Homepage filters: PDF/Source/Summary availability and template tags, plus a filter syntax input (`tmpl:...`, `has:pdf`, `no:source`).
-- Homepage stats: total and filtered counts for PDF/Source/Summary plus per-template totals.
-- Stats page includes keyword frequency charts.
-- Source view renders Markdown and supports embedded HTML tables plus `data:image/...;base64` `<img>` tags (images are constrained to the content width).
-- PDF Viewer is served locally (PDF.js viewer assets) to avoid cross-origin issues with local PDFs.
-- PDF-only entries are surfaced for unmatched PDFs under `--pdf-root` (metadata title if available, otherwise filename), with badges and detail warnings.
-- PDF-only entries are excluded from stats counts.
-- Merge behavior for multi-input serve: title similarity (>= 0.95), preferring `bibtex.fields.title` and falling back to `paper_title`.
-- Cache merged inputs with `--cache-dir`; bypass with `--no-cache`.
-Examples:
-```bash
-# Render Markdown from JSON
-deepresearch-flow paper db render-md --input paper_infos.json
-# Render with a built-in template and language fallback
-deepresearch-flow paper db render-md \
-  --input paper_infos.json \
-  --template-name deep_read \
-  --language zh
-# Generate tags
-deepresearch-flow paper db generate-tags \
-  --input paper_infos.json \
-  --output paper_infos_with_tags.json \
-  --model openai/gpt-4o-mini
-# Filter papers
-deepresearch-flow paper db filter \
-  --input paper_infos.json \
-  --output filtered.json \
-  --tags hardware_acceleration,fpga
-# Statistics (rich tables)
-deepresearch-flow paper db statistics \
-  --input paper_infos.json \
-  --top-n 20
-# Statistics also include keyword frequency (normalized to lowercase)
-# Serve a local read-only web UI (loads charts/libs via CDN)
-deepresearch-flow paper db serve \
-  --input paper_infos_simple.json \
-  --input paper_infos_deep_read.json \
-  --cache-dir .cache/db-serve \
-  --host 127.0.0.1 \
-  --port 8000
-# Serve with optional BibTeX enrichment and source roots
-deepresearch-flow paper db serve \
-  --input paper_infos_simple.json \
-  --input paper_infos_deep_read.json \
-  --bibtex ./refs/library.bib \
-  --md-root ./docs \
-  --md-root ./more_docs \
-  --pdf-root ./pdfs \
-  --cache-dir .cache/db-serve \
-  --host 127.0.0.1 \
-  --port 8000
-```
-Web search syntax (Scholar-style):
-- Default is AND: `fpga kNN`
-- Quoted phrases: `title:"nearest neighbor"`
-- OR: `fpga OR asic`
-- Negation: `-survey` or `-tag:survey`
-- Fields: `title:`, `author:`, `tag:`, `venue:`, `year:`, `month:` (content tags only)
-- Year range: `year:2020..2024`
-Other database helpers:
-- `append-bibtex`
-- `sort-papers`
-- `split-by-tag`
-- `split-database`
-- `statistics`
-- `merge`
-</details>
-<details>
-<summary>recognize md — embed or unpack markdown images</summary>
-`recognize md embed` replaces local image links in markdown with `data:image/...;base64,` URLs.
-`recognize md unpack` extracts embedded images into `images/` and updates markdown links.
-Key options:
-- `--input` (repeatable): file or directory input.
-- `--recursive`: recurse into directories.
-- `--output`: output directory (flattened outputs).
-- `--enable-http`: allow embedding HTTP(S) images (embed only).
-- `--workers`: concurrent workers (default: 4).
-- `--dry-run`: report planned outputs without writing files.
-- `--verbose`: enable detailed logs for image resolution/HTTP fetches.
- Notes:
-- Progress bars report completion; a rich summary table lists counts, image totals, duration, and output locations.
-- Summary paths are shown relative to the current working directory when possible.
-- If the output directory is not empty, the command logs a warning before writing files.
-Examples:
-```bash
-# Embed local images (flatten outputs)
-deepresearch-flow recognize md embed \
-  --input ./docs \
-  --recursive \
-  --output ./out_md
-# Embed HTTP images (with browser User-Agent)
-deepresearch-flow recognize md embed \
-  --input ./docs \
-  --enable-http \
-  --output ./out_md
-# Unpack embedded images into output/images/
-deepresearch-flow recognize md unpack \
-  --input ./docs \
-  --recursive \
-  --output ./out_md
-```
-</details>
-<details>
-<summary>recognize organize — flatten OCR outputs</summary>
-Organize OCR outputs (layout: `mineru`) into flat markdown files, with optional image embedding.
-Key options:
-- `--layout`: OCR layout type (currently `mineru`).
-- `--input` (repeatable): directories containing `full.md` + `images/`.
-- `--recursive`: search for layout folders (required when inputs contain nested result directories).
-- `--output-simple`: copy markdown + images to output (shared `images/`).
-- `--output-base64`: embed images into markdown.
-- `--workers`: concurrent workers (default: 4).
-- `--dry-run`: report planned outputs without writing files.
-- `--verbose`: enable detailed logs for layout discovery and file copying.
- Notes:
-- Use `--recursive` when the input directory contains nested layout folders (otherwise no layouts are discovered).
-- If output directories are not empty, the command logs a warning before writing files.
-- A summary table lists counts, image totals, duration, and output locations after completion.
-- Summary paths are shown relative to the current working directory when possible.
-Examples:
-```bash
-# Copy markdown + images into a flat output directory
-deepresearch-flow recognize organize \
-  --layout mineru \
-  --input ./ocr_results \
-  --recursive \
-  --output-simple ./out_simple
-# Embed images into markdown
-deepresearch-flow recognize organize \
-  --layout mineru \
-  --input ./ocr_results \
-  --output-base64 ./out_base64
-```
-</details>
-<details>
-<summary>Data formats (examples)</summary>
-Aggregated extraction output is a JSON list:
-```json
-[
-  {
-    "paper_title": "Example Paper",
-    "paper_authors": ["Author A", "Author B"],
-    "publication_date": "2024-01-01",
-    "publication_venue": "ExampleConf",
-    "source_path": "/abs/path/to/doc.md"
-  }
-]
-```
-`db serve` expects each input to be an object with a `template_tag` and a `papers` list:
-```json
-{
-  "template_tag": "simple",
-  "papers": [
-    {
-      "paper_title": "Example Paper",
-      "paper_authors": ["Author A"],
-      "publication_date": "2024-01-01",
-      "publication_venue": "ExampleConf"
-    }
-  ]
-}
-```
-</details>

{deepresearch_flow-0.2.1.dist-info → deepresearch_flow-0.3.0.dist-info}/WHEEL RENAMED Viewed

File without changes

{deepresearch_flow-0.2.1.dist-info → deepresearch_flow-0.3.0.dist-info}/entry_points.txt RENAMED Viewed

File without changes

{deepresearch_flow-0.2.1.dist-info → deepresearch_flow-0.3.0.dist-info}/licenses/LICENSE RENAMED Viewed

File without changes

{deepresearch_flow-0.2.1.dist-info → deepresearch_flow-0.3.0.dist-info}/top_level.txt RENAMED Viewed

File without changes

deepresearch-flow 0.2.1__py3-none-any.whl → 0.3.0__py3-none-any.whl

deepresearch-flow 0.2.1py3-none-any.whl → 0.3.0py3-none-any.whl