PyPI - ddg-deep-research - Versions diffs - 0.2.0__tar.gz - Mend

ddg-deep-research 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

ddg_deep_research-0.2.0/.gitignore +20 -0
ddg_deep_research-0.2.0/LICENSE +21 -0
ddg_deep_research-0.2.0/PKG-INFO +142 -0
ddg_deep_research-0.2.0/README.md +115 -0
ddg_deep_research-0.2.0/pyproject.toml +51 -0
ddg_deep_research-0.2.0/src/ddg_deep_research/__init__.py +8 -0
ddg_deep_research-0.2.0/src/ddg_deep_research/ddg_mcp.py +136 -0
ddg_deep_research-0.2.0/src/ddg_deep_research/research_pipeline.py +431 -0

ddg_deep_research-0.2.0/.gitignore ADDED Viewed

@@ -0,0 +1,20 @@
+# Python
+__pycache__/
+*.py[cod]
+*.egg-info/
+dist/
+build/
+.venv/
+# OS
+.DS_Store
+Thumbs.db
+# IDE
+.vscode/
+.idea/
+# Logs
+*.log
+# Research outputs
+outputs/

ddg_deep_research-0.2.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 crftsmnd
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

ddg_deep_research-0.2.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,142 @@
+Metadata-Version: 2.4
+Name: ddg-deep-research
+Version: 0.2.0
+Summary: 5-stage deep research pipeline using DuckDuckGo MCP — free, no API key, no rate limits
+Project-URL: Homepage, https://github.com/crftsmnd/ddg-deep-research
+Project-URL: Repository, https://github.com/crftsmnd/ddg-deep-research
+Project-URL: Bug Tracker, https://github.com/crftsmnd/ddg-deep-research/issues
+Author: crftsmnd
+License: MIT
+License-File: LICENSE
+Keywords: agent,deep-research,duckduckgo,llm,mcp,research,web-search
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Requires-Python: >=3.10
+Requires-Dist: mcp>=1.0
+Description-Content-Type: text/markdown
+# 🧠 ddg-deep-research
+**5-stage deep research pipeline** using DuckDuckGo MCP — **free, no API key, no rate limits**.
+```
+pip install ddg-deep-research
+ddg-deep-research search "your query"
+```
+## Why this exists
+Every other deep research agent requires OpenAI / Anthropic / Google API keys — or runs expensive local models. This one uses **DuckDuckGo MCP** for both search and content extraction. Completely free. Zero API keys. Zero rate limits.
+## Pipeline
+```
+Stage 1: Decompose  ──→ 3-6 sub-questions + search strategy
+Stage 2: Gather     ──→ DuckDuckGo MCP search (parallel, unlimited)
+Stage 3: Deep Read  ──→ DuckDuckGo MCP fetch_content + browser_use
+Stage 4: Verify     ──→ Cross-reference claims, flag contradictions
+Stage 5: Synthesize ──→ Cited brief.md + .provenance.md sidecar
+```
+## Quick Start
+```bash
+# Search the web (free, no API key!)
+ddg-deep-research search "latest advances in AI reasoning"
+# Fetch a webpage
+ddg-deep-research fetch "https://example.com/article"
+```
+### Full Pipeline
+```bash
+# Stage 1: Break question into sub-questions (template)
+ddg-deep-research decompose --question "How is RAG evolving?" --output plan.json
+# Stage 2: Search
+ddg-deep-research ddg_search --query "RAG architectures 2026" --output results.json
+# Stage 3: Fetch content
+ddg-deep-research ddg_fetch --url "https://..." --output-dir extracted/
+# Stage 4: Merge results
+ddg-deep-research merge --input-dir raw/ --output merged.json
+# Stage 5: Clean & verify
+ddg-deep-research clean --input extracted/ --output cleaned/
+ddg-deep-research verify --input cleaned/cleaned.json --output verified.json
+# Stage 6: Generate final brief
+ddg-deep-research synthesize --verified verified.json --question "..." --output-dir outputs/ --today $(date +%Y-%m-%d)
+```
+### Parallel DAG Execution
+```bash
+ddg-deep-research dag --plan workflow.json --verbose
+```
+Input JSON:
+```json
+{
+  "tasks": [
+    {"id": "search_1", "depends_on": [], "command": "ddg_search --query ..."},
+    {"id": "fetch_1", "depends_on": ["search_1"], "command": "ddg_fetch --url ..."},
+    {"id": "synthesize", "depends_on": ["search_1", "fetch_1"], "command": "synthesize ..."}
+  ]
+}
+```
+## Python API
+```python
+import asyncio
+from ddg_deep_research.ddg_mcp import search_web, fetch_content
+async def main():
+    results = await search_web("your query", max_results=10)
+    for r in results:
+        print(f"{r['title']}: {r['url']}")
+    content = await fetch_content("https://example.com")
+    print(content[:500])
+asyncio.run(main())
+```
+## Requirements
+- Python 3.10+
+- `uv` installed (for duckduckgo-mcp-server): `curl -LsSf https://astral.sh/uv/install.sh | sh`
+- No API keys. No subscriptions. Nothing.
+## How it works
+This package wraps [duckduckgo-mcp-server](https://github.com/nicholasgriffintn/duckduckgo-mcp-server) via Python's MCP stdio transport. All search and content extraction goes through DuckDuckGo's free anonymous API. The 5-stage pipeline is modeled after production research agents but without the API costs.
+## Comparison
+| Feature | OpenAI Deep Research | LangChain Deep Research | **ddg-deep-research** |
+|---|---|---|---|
+| API key needed | ✅ $200/mo | ✅ OpenAI key | **❌ Free** |
+| Search engine | Bing/Browser | Custom | **DuckDuckGo** |
+| Content extraction | Built-in | Built-in | **DuckDuckGo MCP** |
+| Provenance tracking | ✅ | ✅ | **✅ .provenance.md** |
+| DAG orchestration | ❌ | ❌ | **✅ Built-in** |
+| Open source | ❌ | ✅ | **✅ MIT** |
+| `pip install` | ❌ | ❌ | **✅ pip install** |
+## License
+MIT

ddg_deep_research-0.2.0/README.md ADDED Viewed

@@ -0,0 +1,115 @@
+# 🧠 ddg-deep-research
+**5-stage deep research pipeline** using DuckDuckGo MCP — **free, no API key, no rate limits**.
+```
+pip install ddg-deep-research
+ddg-deep-research search "your query"
+```
+## Why this exists
+Every other deep research agent requires OpenAI / Anthropic / Google API keys — or runs expensive local models. This one uses **DuckDuckGo MCP** for both search and content extraction. Completely free. Zero API keys. Zero rate limits.
+## Pipeline
+```
+Stage 1: Decompose  ──→ 3-6 sub-questions + search strategy
+Stage 2: Gather     ──→ DuckDuckGo MCP search (parallel, unlimited)
+Stage 3: Deep Read  ──→ DuckDuckGo MCP fetch_content + browser_use
+Stage 4: Verify     ──→ Cross-reference claims, flag contradictions
+Stage 5: Synthesize ──→ Cited brief.md + .provenance.md sidecar
+```
+## Quick Start
+```bash
+# Search the web (free, no API key!)
+ddg-deep-research search "latest advances in AI reasoning"
+# Fetch a webpage
+ddg-deep-research fetch "https://example.com/article"
+```
+### Full Pipeline
+```bash
+# Stage 1: Break question into sub-questions (template)
+ddg-deep-research decompose --question "How is RAG evolving?" --output plan.json
+# Stage 2: Search
+ddg-deep-research ddg_search --query "RAG architectures 2026" --output results.json
+# Stage 3: Fetch content
+ddg-deep-research ddg_fetch --url "https://..." --output-dir extracted/
+# Stage 4: Merge results
+ddg-deep-research merge --input-dir raw/ --output merged.json
+# Stage 5: Clean & verify
+ddg-deep-research clean --input extracted/ --output cleaned/
+ddg-deep-research verify --input cleaned/cleaned.json --output verified.json
+# Stage 6: Generate final brief
+ddg-deep-research synthesize --verified verified.json --question "..." --output-dir outputs/ --today $(date +%Y-%m-%d)
+```
+### Parallel DAG Execution
+```bash
+ddg-deep-research dag --plan workflow.json --verbose
+```
+Input JSON:
+```json
+{
+  "tasks": [
+    {"id": "search_1", "depends_on": [], "command": "ddg_search --query ..."},
+    {"id": "fetch_1", "depends_on": ["search_1"], "command": "ddg_fetch --url ..."},
+    {"id": "synthesize", "depends_on": ["search_1", "fetch_1"], "command": "synthesize ..."}
+  ]
+}
+```
+## Python API
+```python
+import asyncio
+from ddg_deep_research.ddg_mcp import search_web, fetch_content
+async def main():
+    results = await search_web("your query", max_results=10)
+    for r in results:
+        print(f"{r['title']}: {r['url']}")
+    content = await fetch_content("https://example.com")
+    print(content[:500])
+asyncio.run(main())
+```
+## Requirements
+- Python 3.10+
+- `uv` installed (for duckduckgo-mcp-server): `curl -LsSf https://astral.sh/uv/install.sh | sh`
+- No API keys. No subscriptions. Nothing.
+## How it works
+This package wraps [duckduckgo-mcp-server](https://github.com/nicholasgriffintn/duckduckgo-mcp-server) via Python's MCP stdio transport. All search and content extraction goes through DuckDuckGo's free anonymous API. The 5-stage pipeline is modeled after production research agents but without the API costs.
+## Comparison
+| Feature | OpenAI Deep Research | LangChain Deep Research | **ddg-deep-research** |
+|---|---|---|---|
+| API key needed | ✅ $200/mo | ✅ OpenAI key | **❌ Free** |
+| Search engine | Bing/Browser | Custom | **DuckDuckGo** |
+| Content extraction | Built-in | Built-in | **DuckDuckGo MCP** |
+| Provenance tracking | ✅ | ✅ | **✅ .provenance.md** |
+| DAG orchestration | ❌ | ❌ | **✅ Built-in** |
+| Open source | ❌ | ✅ | **✅ MIT** |
+| `pip install` | ❌ | ❌ | **✅ pip install** |
+## License
+MIT

ddg_deep_research-0.2.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,51 @@
+[project]
+name = "ddg-deep-research"
+version = "0.2.0"
+description = "5-stage deep research pipeline using DuckDuckGo MCP — free, no API key, no rate limits"
+readme = "README.md"
+license = {text = "MIT"}
+requires-python = ">=3.10"
+authors = [
+    {name = "crftsmnd"},
+]
+classifiers = [
+    "Development Status :: 4 - Beta",
+    "Intended Audience :: Developers",
+    "Intended Audience :: Science/Research",
+    "License :: OSI Approved :: MIT License",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Programming Language :: Python :: 3.13",
+    "Topic :: Internet :: WWW/HTTP :: Indexing/Search",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence",
+    "Topic :: Software Development :: Libraries :: Python Modules",
+]
+keywords = ["deep-research", "duckduckgo", "research", "agent", "mcp", "web-search", "llm"]
+dependencies = [
+    "mcp>=1.0",
+]
+[project.urls]
+Homepage = "https://github.com/crftsmnd/ddg-deep-research"
+Repository = "https://github.com/crftsmnd/ddg-deep-research"
+"Bug Tracker" = "https://github.com/crftsmnd/ddg-deep-research/issues"
+[project.scripts]
+ddg-deep-research = "ddg_deep_research.research_pipeline:main"
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.build.targets.wheel]
+packages = ["src/ddg_deep_research"]
+[tool.hatch.build.targets.sdist]
+include = [
+    "src/ddg_deep_research/*.py",
+    "README.md",
+    "LICENSE",
+    "pyproject.toml",
+]

ddg_deep_research-0.2.0/src/ddg_deep_research/__init__.py ADDED Viewed

@@ -0,0 +1,8 @@
+"""
+ddg-deep-research — 5-stage deep research pipeline using DuckDuckGo MCP (free, no API key).
+Zero API keys required. Zero rate limits. Zero paywalls.
+"""
+__version__ = "0.2.0"
+__license__ = "MIT"

ddg_deep_research-0.2.0/src/ddg_deep_research/ddg_mcp.py ADDED Viewed

@@ -0,0 +1,136 @@
+#!/usr/bin/env python3
+"""
+DuckDuckGo MCP Bridge — free, no-API-key search & content extraction.
+Connects to duckduckgo-mcp-server via stdio MCP transport and exposes
+two commands: search (web search) and fetch (page content extraction).
+Usage:
+  uv run ddg_mcp.py search "query" [--max-results 10] [--region us-en]
+  uv run ddg_mcp.py fetch <url> [--max-length 8000]
+"""
+import asyncio
+import json
+import re
+import sys
+from mcp import ClientSession, StdioServerParameters
+from mcp.client.stdio import stdio_client
+# ── Regex to parse "N. Title\n   URL: ...\n   Summary: ..." format ────────
+RESULT_BLOCK_RE = re.compile(
+    r"^\d+\.\s+(?P<title>.+?)\n\s+URL:\s+(?P<url>\S+?)\s*\n\s+Summary:\s+(?P<snippet>.*?)$",
+    re.MULTILINE,
+)
+def parse_search_results(text: str) -> list[dict]:
+    """Parse the human-readable search result text into structured JSON."""
+    results = []
+    for match in RESULT_BLOCK_RE.finditer(text):
+        results.append(
+            {
+                "title": match.group("title").strip(),
+                "url": match.group("url").rstrip("/"),
+                "snippet": match.group("snippet").strip(),
+                "source": "duckduckgo",
+            }
+        )
+    return results
+async def search_web(
+    query: str, max_results: int = 10, region: str = "wt-wt"
+) -> list[dict]:
+    """Search using DuckDuckGo MCP and return structured results."""
+    server_params = StdioServerParameters(
+        command="uvx",
+        args=["duckduckgo-mcp-server"],
+    )
+    async with stdio_client(server_params) as (read, write):
+        async with ClientSession(read, write) as session:
+            await session.initialize()
+            result = await session.call_tool(
+                "search",
+                {"query": query, "max_results": max_results, "region": region},
+            )
+            # result.content is list[TextContent|...]
+            text = ""
+            for item in result.content:
+                if hasattr(item, "text"):
+                    text += item.text
+            return parse_search_results(text)
+async def fetch_content(url: str, max_length: int = 8000) -> str:
+    """Fetch and extract clean text from a webpage."""
+    server_params = StdioServerParameters(
+        command="uvx",
+        args=["duckduckgo-mcp-server"],
+    )
+    async with stdio_client(server_params) as (read, write):
+        async with ClientSession(read, write) as session:
+            await session.initialize()
+            result = await session.call_tool(
+                "fetch_content",
+                {"url": url, "max_length": max_length},
+            )
+            text = ""
+            for item in result.content:
+                if hasattr(item, "text"):
+                    text += item.text
+            return text
+# ── CLI ──────────────────────────────────────────────────────────────────
+def main():
+    if len(sys.argv) < 2:
+        print(__doc__, file=sys.stderr)
+        sys.exit(1)
+    cmd = sys.argv[1]
+    if cmd == "search":
+        if len(sys.argv) < 3:
+            print("Usage: ddg_mcp.py search <query> [--max-results N] [--region X]", file=sys.stderr)
+            sys.exit(1)
+        query = sys.argv[2]
+        max_results = 10
+        region = "wt-wt"
+        # Parse optional flags
+        args_iter = iter(sys.argv[3:])
+        for arg in args_iter:
+            if arg == "--max-results":
+                max_results = int(next(args_iter))
+            elif arg == "--region":
+                region = next(args_iter)
+        results = asyncio.run(search_web(query, max_results, region))
+        print(json.dumps(results, indent=2, ensure_ascii=False))
+    elif cmd == "fetch":
+        if len(sys.argv) < 3:
+            print("Usage: ddg_mcp.py fetch <url> [--max-length N]", file=sys.stderr)
+            sys.exit(1)
+        url = sys.argv[2]
+        max_length = 8000
+        args_iter = iter(sys.argv[3:])
+        for arg in args_iter:
+            if arg == "--max-length":
+                max_length = int(next(args_iter))
+        content = asyncio.run(fetch_content(url, max_length))
+        # Output has a leading prefix like "Content from https://...\n\n"
+        # Strip it for cleaner output
+        clean = re.sub(r"^Content from .+?\n\n", "", content, count=1)
+        print(clean)
+    else:
+        print(f"Unknown command: {cmd}", file=sys.stderr)
+        print(__doc__, file=sys.stderr)
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

ddg_deep_research-0.2.0/src/ddg_deep_research/research_pipeline.py ADDED Viewed

@@ -0,0 +1,431 @@
+#!/usr/bin/env python3
+"""
+Deep Research Pipeline — orchestration helpers for the 5-stage deep research workflow.
+Subcommands:
+  ddg_search   — search web via DuckDuckGo MCP (free, no API key)
+  ddg_fetch    — fetch page content via DuckDuckGo MCP
+  decompose    — break a research question into sub-questions + search strategy
+  merge        — merge & deduplicate search results from parallel subagents
+  clean        — normalize extracted text from browser_use calls
+  verify       — cross-reference claims across sources
+  synthesize   — generate cited brief.md + .provenance.md
+Usage: uv run research_pipeline.py <subcommand> [options]
+"""
+import argparse
+import asyncio
+import json
+import os
+import re
+import sys
+from datetime import datetime
+from pathlib import Path
+from urllib.parse import urlparse
+# ── Helpers ──────────────────────────────────────────────────────────────
+TOPIC_SLUG_RE = re.compile(r"[^a-z0-9-]+")
+def slugify(text: str) -> str:
+    return TOPIC_SLUG_RE.sub("-", text.lower()).strip("-")
+def ensure_dir(path: str) -> str:
+    Path(path).mkdir(parents=True, exist_ok=True)
+    return path
+def load_json(path: str):
+    with open(path) as f:
+        return json.load(f)
+def save_json(obj, path: str):
+    with open(path, "w") as f:
+        json.dump(obj, f, indent=2, ensure_ascii=False)
+def domain(url: str) -> str:
+    try:
+        return urlparse(url).netloc
+    except Exception:
+        return url
+# ── Subcommand: dag — Directed Acyclic Graph orchestration ────────────────
+def cmd_dag(args):
+    """
+    Execute sub-tasks with dependency graph support.
+    Input: JSON file with:
+    {
+      "tasks": [
+        {"id": "search_1", "depends_on": [], "command": "ddg_search --query ... --output ..."},
+        {"id": "fetch_1", "depends_on": ["search_1"], "command": "ddg_fetch ..."},
+        {"id": "synthesize", "depends_on": ["search_1", "fetch_1"], "command": "synthesize ..."}
+      ]
+    }
+    Output: Executes tasks in dependency order, parallelizing independent tasks.
+    """
+    import subprocess
+    plan = load_json(args.plan)
+    tasks = {t["id"]: t for t in plan["tasks"]}
+    completed = set()
+    results = {}
+    max_iterations = 100
+    iteration = 0
+    while len(completed) < len(tasks) and iteration < max_iterations:
+        iteration += 1
+        ready = [
+            t for t in plan["tasks"]
+            if t["id"] not in completed
+            and all(dep in completed for dep in t.get("depends_on", []))
+        ]
+        if not ready:
+            blocked = [t["id"] for t in plan["tasks"] if t["id"] not in completed]
+            raise RuntimeError(f"Dependency cycle detected or unsatisfied: {blocked}")
+        # Execute ready tasks in parallel
+        for task in ready:
+            cmd_parts = task["command"].split()
+            wrapped_cmd = ["uv", "run", "python3", __file__] + cmd_parts
+            try:
+                result = subprocess.run(wrapped_cmd, capture_output=True, text=True, timeout=args.task_timeout)
+                results[task["id"]] = {
+                    "status": "success" if result.returncode == 0 else "error",
+                    "stdout": result.stdout[-500:],
+                    "stderr": result.stderr[-500:],
+                }
+            except subprocess.TimeoutExpired:
+                results[task["id"]] = {"status": "timeout", "stdout": "", "stderr": "Timed out after {args.task_timeout}s"}
+            completed.add(task["id"])
+        if args.verbose:
+            print(f"[DAG] Iteration {iteration}: completed {len(completed)}/{len(tasks)} tasks")
+    # Save execution report
+    output = {
+        "plan_file": args.plan,
+        "tasks_total": len(tasks),
+        "tasks_completed": len(completed),
+        "results": results,
+    }
+    if args.output:
+        save_json(output, args.output)
+        print(f"DAG execution report → {args.output}")
+    else:
+        print(json.dumps(output, indent=2, ensure_ascii=False))
+# ── Subcommand: ddg_search ────────────────────────────────────────────────
+def cmd_ddg_search(args):
+    """Search using DuckDuckGo MCP (free, no API key)."""
+    from ddg_deep_research.ddg_mcp import search_web
+    results = asyncio.run(search_web(args.query, args.max_results, args.region))
+    save_json(results, args.output)
+    print(f"DuckDuckGo search returned {len(results)} results → {args.output}")
+# ── Subcommand: ddg_fetch ─────────────────────────────────────────────────
+def cmd_ddg_fetch(args):
+    """Fetch webpage content using DuckDuckGo MCP."""
+    from ddg_deep_research.ddg_mcp import fetch_content
+    content = asyncio.run(fetch_content(args.url, args.max_length))
+    out_dir = ensure_dir(args.output_dir)
+    slug = slugify(args.url.replace("https://", "").replace("http://", "").replace("/", "-"))
+    out_path = os.path.join(out_dir, f"{slug}.txt")
+    with open(out_path, "w") as f:
+        f.write(content)
+    print(f"Fetched {len(content)} chars → {out_path}")
+# ── Subcommand: decompose ─────────────────────────────────────────────────
+def cmd_decompose(args):
+    """Generate a research plan with sub-questions and search strategies."""
+    output_dir = Path(args.output).parent
+    ensure_dir(str(output_dir))
+    # This is called by the agent AFTER the agent has decomposed the question.
+    # The agent writes the plan, and this script just saves it in a standard format.
+    plan = {
+        "original_question": args.question,
+        "generated_at": datetime.now().isoformat(),
+        "slug": slugify(args.question),
+        "status": "plan_ready",
+        "sub_questions": [],
+        "notes": (
+            "The agent should populate sub_questions with 3-6 items. "
+            "Each item: {question, engine, priority, max_results, search_strategy}"
+        ),
+    }
+    save_json(plan, args.output)
+    print(f"Plan template saved to {args.output}")
+    print(f"Agent must now populate the sub_questions array and re-run.")
+# ── Subcommand: merge ────────────────────────────────────────────────────
+def cmd_merge(args):
+    """Merge and deduplicate search results from multiple subagent outputs."""
+    input_dir = Path(args.input_dir)
+    results = []
+    seen_urls = set()
+    for fpath in sorted(input_dir.glob("*.json")):
+        data = load_json(str(fpath))
+        items = data if isinstance(data, list) else data.get("results", data.get("items", []))
+        for item in items:
+            url = item.get("url", "") or item.get("link", "") or item.get("href", "")
+            if not url:
+                continue
+            norm_url = url.rstrip("/").lower()
+            if norm_url in seen_urls:
+                continue
+            seen_urls.add(norm_url)
+            results.append({
+                "title": item.get("title", "") or item.get("name", ""),
+                "url": url,
+                "snippet": item.get("snippet", "") or item.get("description", "") or "",
+                "source": item.get("source", fpath.stem),
+                "domain": domain(url),
+                "sub_question": item.get("sub_question", ""),
+            })
+    output = {
+        "total_urls": len(results),
+        "deduped_count": len(results),
+        "results": sorted(results, key=lambda r: r["url"]),
+        "generated_at": datetime.now().isoformat(),
+    }
+    save_json(output, args.output)
+    print(f"Merged {len(results)} unique results → {args.output}")
+# ── Subcommand: clean ────────────────────────────────────────────────────
+def cmd_clean(args):
+    """Normalize extracted text files into structured JSON records."""
+    input_dir = Path(args.input_dir)
+    ensure_dir(args.output)
+    records = []
+    for fpath in sorted(input_dir.rglob("*")):
+        if fpath.is_dir():
+            continue
+        text = fpath.read_text(encoding="utf-8", errors="replace")
+        # Strip excessive whitespace
+        lines = [l.strip() for l in text.split("\n")]
+        text = "\n".join(l for l in lines if l)
+        records.append({
+            "source_file": fpath.name,
+            "url": fpath.stem,  # Agent should write files named by URL-slug
+            "length_chars": len(text),
+            "length_words": len(text.split()),
+            "content": text,
+            "cleaned_at": datetime.now().isoformat(),
+        })
+    outpath = os.path.join(args.output, "cleaned.json")
+    save_json(records, outpath)
+    print(f"Cleaned {len(records)} extracts → {outpath}")
+# ── Subcommand: verify ───────────────────────────────────────────────────
+def cmd_verify(args):
+    """Cross-reference claims across cleaned extracts (structure for agent use)."""
+    cleaned_data = load_json(args.input)
+    # This prepares a verification skeleton — the agent (LLM) does the actual
+    # claim extraction and cross-referencing, then writes back.
+    sources = []
+    for rec in cleaned_data:
+        sources.append({
+            "url": rec.get("url", "unknown"),
+            "length_chars": rec.get("length_chars", 0),
+            "verified": False,
+            "claims_extracted": 0,
+        })
+    output = {
+        "total_sources": len(sources),
+        "sources": sources,
+        "verified_at": datetime.now().isoformat(),
+        "status": "ready_for_agent_verification",
+        "note": "Agent should read each source, extract claims, and annotate verified_at/confidence per claim.",
+    }
+    save_json(output, args.output)
+    print(f"Verification skeleton ({len(sources)} sources) → {args.output}")
+# ── Subcommand: synthesize ───────────────────────────────────────────────
+def cmd_synthesize(args):
+    """Generate the final brief.md + .provenance.md files."""
+    verified = load_json(args.verified)
+    output_dir = ensure_dir(args.output_dir)
+    slug = slugify(args.question)
+    # ── .provenance.md ───────────────────────────────────────────────────
+    prov_lines = [
+        f"# Provenance: {args.question}",
+        "",
+        f"**Research date:** {args.today}",
+        f"**Slug:** {slug}",
+        f"**Sources consulted:** {len(verified.get('sources', []))}",
+        "",
+        "## Sources",
+        "",
+    ]
+    for i, src in enumerate(verified.get("sources", []), 1):
+        prov_lines.append(f"{i}. **{src.get('url', 'unknown')}**")
+        prov_lines.append(f"   - Length: {src.get('length_chars', 0)} chars")
+        prov_lines.append(f"   - Claims extracted: {src.get('claims_extracted', 0)}")
+        if src.get("verified"):
+            prov_lines.append(f"   - ✓ Verified at: {src.get('verified_at', 'N/A')}")
+        else:
+            prov_lines.append(f"   - ○ Not yet verified")
+        prov_lines.append("")
+    prov_lines.extend([
+        "",
+        "## Methodology",
+        "",
+        "- **Stage 1 (Decompose):** Research question broken into focused sub-questions",
+        "- **Stage 2 (Gather):** Parallel search via DuckDuckGo MCP (free, no API key)",
+        "- **Stage 3 (Read):** Top URLs extracted via DuckDuckGo MCP fetch_content + browser_use",
+        "- **Stage 4 (Verify):** Claims cross-referenced across independent sources",
+        "- **Stage 5 (Synthesize):** This brief generated with full provenance tracking",
+        "",
+        f"*Generated by Omnibot Deep Research pipeline on {args.today}*",
+    ])
+    provenance_path = os.path.join(output_dir, ".provenance.md")
+    with open(provenance_path, "w") as f:
+        f.write("\n".join(prov_lines))
+    print(f"Provenance sidecar → {provenance_path}")
+    # ── brief.md ─────────────────────────────────────────────────────────
+    brief_lines = [
+        f"# Research Brief: {args.question}",
+        "",
+        f"**Date:** {args.today}  **Sources:** {len(verified.get('sources', []))}",
+        "",
+        "## Executive Summary",
+        "",
+        "*Agent: populate executive summary from verified claims here.*",
+        "",
+        "## Key Findings",
+        "",
+        "*Agent: organize findings by sub-question, each with [source: url] citations.*",
+        "",
+        "## Contradictions & Open Questions",
+        "",
+        "*Agent: list any conflicting claims across sources here.*",
+        "",
+        "## Gaps",
+        "",
+        "*Agent: note what wasn't found or needs more investigation.*",
+        "",
+        "## Follow-up Questions",
+        "",
+        "*Agent: suggest 2-3 questions the user could explore next.*",
+        "",
+        f"---",
+        f"*Generated by Omnibot Deep Research pipeline on {args.today}*",
+        f"*See [.provenance.md](.provenance.md) for full source list*",
+    ]
+    brief_path = os.path.join(output_dir, "brief.md")
+    with open(brief_path, "w") as f:
+        f.write("\n".join(brief_lines))
+    print(f"Cited brief → {brief_path}")
+    print(f"\nDone. Output in {output_dir}/")
+    print(f"  brief.md — the cited research brief (agent to fill findings)")
+    print(f"  .provenance.md — full source tracking")
+# ── CLI ──────────────────────────────────────────────────────────────────
+def main():
+    parser = argparse.ArgumentParser(description="Deep Research Pipeline")
+    sub = parser.add_subparsers(dest="command", required=True)
+    # decompose
+    p = sub.add_parser("decompose", help="Create research plan template")
+    p.add_argument("--question", required=True)
+    p.add_argument("--output", required=True)
+    # merge
+    p = sub.add_parser("merge", help="Merge & deduplicate search results")
+    p.add_argument("--input-dir", required=True)
+    p.add_argument("--output", required=True)
+    # clean
+    p = sub.add_parser("clean", help="Normalize extracted text")
+    p.add_argument("--input", required=True)
+    p.add_argument("--output", required=True)
+    # verify
+    p = sub.add_parser("verify", help="Prepare verification skeleton")
+    p.add_argument("--input", required=True)
+    p.add_argument("--output", required=True)
+    # synthesize
+    p = sub.add_parser("synthesize", help="Generate brief + provenance")
+    p.add_argument("--verified", required=True)
+    p.add_argument("--question", required=True)
+    p.add_argument("--output-dir", required=True)
+    p.add_argument("--today", default=datetime.now().strftime("%Y-%m-%d"))
+    # dag — Directed Acyclic Graph orchestration
+    p = sub.add_parser("dag", help="Execute sub-tasks with dependency graph")
+    p.add_argument("--plan", required=True, help="JSON file with tasks + dependencies")
+    p.add_argument("--output", default=None, help="Output execution report path")
+    p.add_argument("--task-timeout", type=int, default=120, help="Per-task timeout in seconds")
+    p.add_argument("--verbose", action="store_true", help="Print progress")
+    # ddg_search — DuckDuckGo MCP (free, no API key)
+    p = sub.add_parser("ddg_search", help="Search via DuckDuckGo MCP (free)")
+    p.add_argument("--query", required=True)
+    p.add_argument("--max-results", type=int, default=10)
+    p.add_argument("--region", default="wt-wt")
+    p.add_argument("--output", required=True)
+    # ddg_fetch — fetch page content via DuckDuckGo MCP
+    p = sub.add_parser("ddg_fetch", help="Fetch page content via DuckDuckGo MCP")
+    p.add_argument("--url", required=True)
+    p.add_argument("--max-length", type=int, default=8000)
+    p.add_argument("--output-dir", required=True)
+    args = parser.parse_args()
+    commands = {
+        "decompose": cmd_decompose,
+        "merge": cmd_merge,
+        "clean": cmd_clean,
+        "verify": cmd_verify,
+        "synthesize": cmd_synthesize,
+        "ddg_search": cmd_ddg_search,
+        "ddg_fetch": cmd_ddg_fetch,
+        "dag": cmd_dag,
+    }
+    commands[args.command](args)
+if __name__ == "__main__":
+    main()