PyPI - local-deep-research - Versions diffs - 0.1.0__py3-none-any.whl → 0.1.12__py3-none-any.whl - Mend

local-deep-research 0.1.0py3-none-any.whl → 0.1.12py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

local_deep_research/web/templates/index.html CHANGED Viewed

@@ -8,7 +8,6 @@
     <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0-beta3/css/all.min.css">
     <!-- Change to CDN version that works in browsers -->
     <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/styles/github-dark.min.css">
-    <link rel="icon" type="image/png" href="{{ url_for('static', filename='favicon.ico') }}">
 </head>
 <body>
     <div class="app-container">
@@ -119,6 +118,9 @@
                                     <i class="fas fa-stop-circle"></i> Terminate Research
                                 </button>
                                 <div id="error-message" class="error-message" style="display: none;"></div>
+                                <button id="try-again-btn" class="btn btn-primary" style="display: none; margin-top: 15px;">
+                                    <i class="fas fa-redo"></i> Try Again
+                                </button>
                             </div>
                         </div>
                     </div>
@@ -214,6 +216,31 @@
                     </div>
                 </div>
             </div>
+            <!-- Collapsible Log Panel -->
+            <div class="collapsible-log-panel">
+                <div class="log-panel-header" id="log-panel-toggle">
+                    <i class="fas fa-chevron-down toggle-icon"></i>
+                    <span>Research Logs</span>
+                    <span class="log-indicator" id="log-indicator">0</span>
+                </div>
+                <div class="log-panel-content" id="log-panel-content">
+                    <div class="log-controls">
+                        <div class="log-filter">
+                            <div class="filter-buttons">
+                                <button class="small-btn selected" onclick="window.filterLogsByType('all')">All</button>
+                                <button class="small-btn" onclick="window.filterLogsByType('milestone')">Milestones</button>
+                                <button class="small-btn" onclick="window.filterLogsByType('info')">Info</button>
+                                <button class="small-btn" onclick="window.filterLogsByType('error')">Errors</button>
+                            </div>
+                        </div>
+                    </div>
+                    <div class="console-log" id="console-log-container">
+                        <!-- Logs will be added here dynamically -->
+                        <div class="empty-log-message">No logs yet. Research logs will appear here as they occur.</div>
+                    </div>
+                </div>
+            </div>
         </main>
     </div>
@@ -308,5 +335,14 @@
             window.html2canvas_noSandbox = true;
         }
     </script>
+    <!-- Add a template for console log entries -->
+    <template id="console-log-entry-template">
+        <div class="console-log-entry">
+            <span class="log-timestamp"></span>
+            <span class="log-badge"></span>
+            <span class="log-message"></span>
+        </div>
+    </template>
 </body>
 </html>

local_deep_research/web_search_engines/engines/search_engine_searxng.py ADDED Viewed

@@ -0,0 +1,454 @@
+import requests
+import logging
+import os
+from typing import Dict, List, Any, Optional
+from langchain_core.language_models import BaseLLM
+import time
+import json
+from web_search_engines.search_engine_base import BaseSearchEngine
+from web_search_engines.engines.full_search import FullSearchResults
+import config
+# Setup logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class SearXNGSearchEngine(BaseSearchEngine):
+    """
+    SearXNG search engine implementation that requires an instance URL provided via
+    environment variable or configuration. Designed for ethical usage with proper
+    rate limiting and single-instance approach.
+    """
+    def __init__(self,
+                max_results: int = 15,
+                instance_url: Optional[str] = None,  # Can be None if using env var
+                categories: Optional[List[str]] = None,
+                engines: Optional[List[str]] = None,
+                language: str = "en",
+                safe_search: int = 1,
+                time_range: Optional[str] = None,
+                delay_between_requests: float = 2.0,
+                llm: Optional[BaseLLM] = None,
+                max_filtered_results: Optional[int] = None,
+                include_full_content: bool = True,
+                api_key: Optional[str] = None):  # API key is actually the instance URL
+        """
+        Initialize the SearXNG search engine with ethical usage patterns.
+        Args:
+            max_results: Maximum number of search results
+            instance_url: URL of your SearXNG instance (preferably self-hosted)
+            categories: List of SearXNG categories to search in (general, images, videos, news, etc.)
+            engines: List of engines to use (google, bing, duckduckgo, etc.)
+            language: Language code for search results
+            safe_search: Safe search level (0=off, 1=moderate, 2=strict)
+            time_range: Time range for results (day, week, month, year)
+            delay_between_requests: Seconds to wait between requests
+            llm: Language model for relevance filtering
+            max_filtered_results: Maximum number of results to keep after filtering
+            include_full_content: Whether to include full webpage content in results
+            api_key: Alternative way to provide instance URL (takes precedence over instance_url)
+        """
+        # Initialize the BaseSearchEngine with the LLM and max_filtered_results
+        super().__init__(llm=llm, max_filtered_results=max_filtered_results)
+        # Get instance URL from various sources in priority order:
+        # 1. api_key parameter (which is actually the instance URL)
+        # 2. SEARXNG_INSTANCE environment variable
+        # 3. instance_url parameter
+        # 4. Default to None, which will disable the engine
+        self.instance_url = api_key or os.getenv("SEARXNG_INSTANCE") or instance_url
+        # Add debug logging for instance URL
+        logger.info(f"SearXNG init - Instance URL sources: api_key={api_key}, env={os.getenv('SEARXNG_INSTANCE')}, param={instance_url}")
+        # Validate and normalize the instance URL if provided
+        if self.instance_url:
+            self.instance_url = self.instance_url.rstrip('/')
+            self.is_available = True
+            logger.info(f"SearXNG initialized with instance URL: {self.instance_url}")
+        else:
+            self.is_available = False
+            logger.error("No SearXNG instance URL provided. The engine is disabled. "
+                       "Set SEARXNG_INSTANCE environment variable or provide instance_url parameter.")
+        # Add debug logging for all parameters
+        logger.info(f"SearXNG init params: max_results={max_results}, language={language}, "
+                    f"max_filtered_results={max_filtered_results}, is_available={self.is_available}")
+        self.max_results = max_results
+        self.categories = categories or ["general"]
+        self.engines = engines
+        self.language = language
+        self.safe_search = safe_search
+        self.time_range = time_range
+        self.delay_between_requests = float(os.getenv("SEARXNG_DELAY", delay_between_requests))
+        self.include_full_content = include_full_content
+        if self.is_available:
+            self.search_url = f"{self.instance_url}/search"
+            logger.info(f"SearXNG engine initialized with instance: {self.instance_url}")
+            logger.info(f"Rate limiting set to {self.delay_between_requests} seconds between requests")
+            self.full_search = FullSearchResults(
+                llm=llm,
+                web_search=self,
+                language=language,
+                max_results=max_results,
+                region="wt-wt",
+                time="y",
+                safesearch="Moderate" if safe_search == 1 else "Off" if safe_search == 0 else "Strict"
+            )
+        self.last_request_time = 0
+    def _respect_rate_limit(self):
+        """Apply self-imposed rate limiting between requests"""
+        current_time = time.time()
+        time_since_last_request = current_time - self.last_request_time
+        if time_since_last_request < self.delay_between_requests:
+            wait_time = self.delay_between_requests - time_since_last_request
+            logger.info(f"Rate limiting: waiting {wait_time:.2f} seconds")
+            time.sleep(wait_time)
+        self.last_request_time = time.time()
+    def _get_search_results(self, query: str) -> List[Dict[str, Any]]:
+        """
+        Get search results from SearXNG with ethical rate limiting.
+        Args:
+            query: The search query
+        Returns:
+            List of search results from SearXNG
+        """
+        if not self.is_available:
+            logger.error("SearXNG engine is disabled (no instance URL provided) - cannot run search")
+            return []
+        logger.info(f"SearXNG running search for query: {query}")
+        try:
+            self._respect_rate_limit()
+            initial_headers = {
+                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
+                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
+                "Accept-Language": "en-US,en;q=0.9"
+            }
+            try:
+                initial_response = requests.get(self.instance_url, headers=initial_headers, timeout=10)
+                cookies = initial_response.cookies
+            except Exception as e:
+                logger.warning(f"Failed to get initial cookies: {e}")
+                cookies = None
+            params = {
+                "q": query,
+                "categories": ",".join(self.categories),
+                "language": self.language,
+                "format": "html",  # Use HTML format instead of JSON
+                "pageno": 1,
+                "safesearch": self.safe_search,
+                "count": self.max_results
+            }
+            if self.engines:
+                params["engines"] = ",".join(self.engines)
+            if self.time_range:
+                params["time_range"] = self.time_range
+            # Browser-like headers
+            headers = {
+                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
+                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
+                "Accept-Language": "en-US,en;q=0.9",
+                "Referer": self.instance_url + "/",
+                "Connection": "keep-alive",
+                "Upgrade-Insecure-Requests": "1"
+            }
+            logger.info(f"Sending request to SearXNG instance at {self.instance_url}")
+            response = requests.get(
+                self.search_url,
+                params=params,
+                headers=headers,
+                cookies=cookies,
+                timeout=15
+            )
+            if response.status_code == 200:
+                try:
+                    from bs4 import BeautifulSoup
+                    soup = BeautifulSoup(response.text, 'html.parser')
+                    results = []
+                    result_elements = soup.select('.result-item')
+                    if not result_elements:
+                        result_elements = soup.select('.result')
+                    if not result_elements:
+                        result_elements = soup.select('article')
+                    if not result_elements:
+                        logger.debug(f"Classes found in HTML: {[c['class'] for c in soup.select('[class]') if 'class' in c.attrs][:10]}")
+                        result_elements = soup.select('div[id^="result"]')
+                    logger.info(f"Found {len(result_elements)} search result elements")
+                    for idx, result_element in enumerate(result_elements):
+                        if idx >= self.max_results:
+                            break
+                        title_element = (
+                            result_element.select_one('.result-title') or
+                            result_element.select_one('.title') or
+                            result_element.select_one('h3') or
+                            result_element.select_one('a[href]')
+                        )
+                        url_element = (
+                            result_element.select_one('.result-url') or
+                            result_element.select_one('.url') or
+                            result_element.select_one('a[href]')
+                        )
+                        content_element = (
+                            result_element.select_one('.result-content') or
+                            result_element.select_one('.content') or
+                            result_element.select_one('.snippet') or
+                            result_element.select_one('p')
+                        )
+                        title = title_element.get_text(strip=True) if title_element else ""
+                        url = ""
+                        if url_element and url_element.has_attr('href'):
+                            url = url_element['href']
+                        elif url_element:
+                            url = url_element.get_text(strip=True)
+                        content = content_element.get_text(strip=True) if content_element else ""
+                        if not url and title_element and title_element.has_attr('href'):
+                            url = title_element['href']
+                        logger.debug(f"Extracted result {idx}: title={title[:30]}..., url={url[:30]}..., content={content[:30]}...")
+                        # Add to results if we have at least a title or URL
+                        if title or url:
+                            results.append({
+                                "title": title,
+                                "url": url,
+                                "content": content,
+                                "engine": "searxng",
+                                "category": "general"
+                            })
+                    logger.info(f"SearXNG returned {len(results)} results from HTML parsing")
+                    return results
+                except ImportError:
+                    logger.error("BeautifulSoup not available for HTML parsing")
+                    return []
+                except Exception as e:
+                    logger.error(f"Error parsing HTML results: {str(e)}")
+                    return []
+            else:
+                logger.error(f"SearXNG returned status code {response.status_code}")
+                return []
+        except Exception as e:
+            logger.error(f"Error getting SearXNG results: {e}")
+            return []
+    def _get_previews(self, query: str) -> List[Dict[str, Any]]:
+        """
+        Get preview information for SearXNG search results.
+        Args:
+            query: The search query
+        Returns:
+            List of preview dictionaries
+        """
+        if not self.is_available:
+            logger.warning("SearXNG engine is disabled (no instance URL provided)")
+            return []
+        logger.info(f"Getting SearXNG previews for query: {query}")
+        results = self._get_search_results(query)
+        if not results:
+            logger.warning(f"No SearXNG results found for query: {query}")
+            return []
+        previews = []
+        for i, result in enumerate(results):
+            title = result.get("title", "")
+            url = result.get("url", "")
+            content = result.get("content", "")
+            preview = {
+                "id": url or f"searxng-result-{i}",
+                "title": title,
+                "link": url,
+                "snippet": content,
+                "engine": result.get("engine", ""),
+                "category": result.get("category", "")
+            }
+            previews.append(preview)
+        return previews
+    def _get_full_content(self, relevant_items: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+        """
+        Get full content for the relevant search results.
+        Args:
+            relevant_items: List of relevant preview dictionaries
+        Returns:
+            List of result dictionaries with full content
+        """
+        if not self.is_available:
+            return relevant_items
+        if hasattr(config, 'SEARCH_SNIPPETS_ONLY') and config.SEARCH_SNIPPETS_ONLY:
+            logger.info("Snippet-only mode, skipping full content retrieval")
+            return relevant_items
+        logger.info("Retrieving full webpage content")
+        try:
+            results_with_content = self.full_search._get_full_content(relevant_items)
+            return results_with_content
+        except Exception as e:
+            logger.error(f"Error retrieving full content: {e}")
+            return relevant_items
+    def invoke(self, query: str) -> List[Dict[str, Any]]:
+        """Compatibility method for LangChain tools"""
+        return self.run(query)
+    def results(self, query: str, max_results: Optional[int] = None) -> List[Dict[str, Any]]:
+        """
+        Get search results in a format compatible with other search engines.
+        Args:
+            query: The search query
+            max_results: Optional override for maximum results
+        Returns:
+            List of search result dictionaries
+        """
+        if not self.is_available:
+            return []
+        original_max_results = self.max_results
+        try:
+            if max_results is not None:
+                self.max_results = max_results
+            results = self._get_search_results(query)
+            formatted_results = []
+            for result in results:
+                formatted_results.append({
+                    "title": result.get("title", ""),
+                    "link": result.get("url", ""),
+                    "snippet": result.get("content", "")
+                })
+            return formatted_results
+        finally:
+            self.max_results = original_max_results
+    @staticmethod
+    def get_self_hosting_instructions() -> str:
+        """
+        Get instructions for self-hosting a SearXNG instance.
+        Returns:
+            String with installation instructions
+        """
+        return """
+# SearXNG Self-Hosting Instructions
+The most ethical way to use SearXNG is to host your own instance. Here's how:
+## Using Docker (easiest method)
+1. Install Docker if you don't have it already
+2. Run these commands:
+```bash
+# Pull the SearXNG Docker image
+docker pull searxng/searxng
+# Run SearXNG (will be available at http://localhost:8080)
+docker run -d -p 8080:8080 --name searxng searxng/searxng
+```
+## Using Docker Compose (recommended for production)
+1. Create a file named `docker-compose.yml` with the following content:
+```yaml
+version: '3'
+services:
+  searxng:
+    container_name: searxng
+    image: searxng/searxng
+    ports:
+      - "8080:8080"
+    volumes:
+      - ./searxng:/etc/searxng
+    environment:
+      - SEARXNG_BASE_URL=http://localhost:8080/
+    restart: unless-stopped
+```
+2. Run with Docker Compose:
+```bash
+docker-compose up -d
+```
+For more detailed instructions and configuration options, visit:
+https://searxng.github.io/searxng/admin/installation.html
+"""
+    def run(self, query: str) -> List[Dict[str, Any]]:
+        """
+        Override BaseSearchEngine run method to add SearXNG-specific error handling.
+        """
+        if not self.is_available:
+            logger.error("SearXNG run method called but engine is not available (missing instance URL)")
+            return []
+        logger.info(f"SearXNG run method called with query: {query}")
+        try:
+            # Call the parent class's run method
+            return super().run(query)
+        except Exception as e:
+            logger.error(f"Error in SearXNG run method: {str(e)}")
+            # Return empty results on error
+            return []

local_deep_research/web_search_engines/search_engine_factory.py CHANGED Viewed

@@ -230,4 +230,23 @@ def get_search(search_tool: str, llm_instance,
         params["time_period"] = time_period
     # Create and return the search engine
-    return create_search_engine(search_tool, **params)
+    logger.info(f"Creating search engine for tool: {search_tool} with params: {params.keys()}")
+    engine = create_search_engine(search_tool, **params)
+    # Add debugging to check if engine is None
+    if engine is None:
+        logger.error(f"Failed to create search engine for {search_tool} - returned None")
+    else:
+        engine_type = type(engine).__name__
+        logger.info(f"Successfully created search engine of type: {engine_type}")
+        # Check if the engine has run method
+        if hasattr(engine, 'run'):
+            logger.info(f"Engine has 'run' method: {getattr(engine, 'run')}")
+        else:
+            logger.error(f"Engine does NOT have 'run' method!")
+        # For SearxNG, check availability flag
+        if hasattr(engine, 'is_available'):
+            logger.info(f"Engine availability flag: {engine.is_available}")
+    return engine

{local_deep_research-0.1.0.dist-info → local_deep_research-0.1.12.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.2
+Metadata-Version: 2.4
 Name: local-deep-research
-Version: 0.1.0
+Version: 0.1.12
 Summary: AI-powered research assistant with deep, iterative analysis using LLMs and web searches
 Author-email: LearningCircuit <185559241+LearningCircuit@users.noreply.github.com>, HashedViking <6432677+HashedViking@users.noreply.github.com>
 License: MIT License
@@ -51,7 +51,7 @@ Requires-Dist: flask-socketio>=5.1.1
 Requires-Dist: sqlalchemy>=1.4.23
 Requires-Dist: wikipedia
 Requires-Dist: arxiv>=1.4.3
-Requires-Dist: PyPDF2>=2.0.0
+Requires-Dist: pypdf
 Requires-Dist: sentence-transformers
 Requires-Dist: faiss-cpu
 Requires-Dist: pydantic>=2.0.0
@@ -59,6 +59,7 @@ Requires-Dist: pydantic-settings>=2.0.0
 Requires-Dist: toml>=0.10.2
 Requires-Dist: platformdirs>=3.0.0
 Requires-Dist: dynaconf
+Dynamic: license-file
 # Local Deep Research
@@ -91,12 +92,13 @@ A powerful AI-powered research assistant that performs deep, iterative analysis
 - 🌐 **Enhanced Search Integration**
   - **Auto-selection of search sources**: The "auto" search engine intelligently analyzes your query and selects the most appropriate search engine based on the query content
+  - **SearXNG** integration for local web-search engine, great for privacy, no API key required (requires a searxng server)
   - Wikipedia integration for factual knowledge
   - arXiv integration for scientific papers and academic research
   - PubMed integration for biomedical literature and medical research
   - DuckDuckGo integration for web searches (may experience rate limiting)
   - SerpAPI integration for Google search results (requires API key)
-  - **Google Programmable Search Engine** integration for custom search experiences (requires API key)
+  - Google Programmable Search Engine integration for custom search experiences (requires API key)
   - The Guardian integration for news articles and journalism (requires API key)
   - **Local RAG search for private documents** - search your own documents with vector embeddings
   - Full webpage content retrieval
@@ -127,10 +129,10 @@ This example showcases the system's ability to perform multiple research iterati
 1. Clone the repository:
 ```bash
-git clone https://github.com/yourusername/local-deep-research.git
+git clone https://github.com/LearningCircuit/local-deep-research.git
 cd local-deep-research
 ```
+(experimental pip install with new features (but not so well tested yet): **pip install local-deep-research** )
 2. Install dependencies:
 ```bash
 pip install -r requirements.txt
@@ -147,6 +149,20 @@ ollama pull mistral  # Default model - many work really well choose best for you
 ```bash
 # Copy the template
 cp .env.template .env
+```
+## Experimental install
+```bash
+#experimental pip install with new features (but not so well tested yet):
+pip install local-deep-research
+playwright install
+ollama pull mistral
+```
+## Community & Support
+We've just launched our [Discord server](https://discord.gg/2E6gYU2Z) for this project!
+Our Discord server can help to exchange ideas about research approaches, discuss advanced usage patterns, and share other ideas.
 # Edit .env with your API keys (if using cloud LLMs)
 ANTHROPIC_API_KEY=your-api-key-here  # For Claude
@@ -276,6 +292,7 @@ You can use local search in several ways:
 The system supports multiple search engines that can be selected by changing the `search_tool` variable in `config.py`:
 - **Auto** (`auto`): Intelligent search engine selector that analyzes your query and chooses the most appropriate source (Wikipedia, arXiv, local collections, etc.)
+- **SearXNG** (`searxng`): Local web-search engine, great for privacy, no API key required (requires a searxng server)
 - **Wikipedia** (`wiki`): Best for general knowledge, facts, and overview information
 - **arXiv** (`arxiv`): Great for scientific and academic research, accessing preprints and papers
 - **PubMed** (`pubmed`): Excellent for biomedical literature, medical research, and health information
@@ -307,6 +324,7 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
   - [DuckDuckGo](https://duckduckgo.com) for web search
   - [The Guardian](https://www.theguardian.com/) for quality journalism
   - [SerpAPI](https://serpapi.com) for Google search results (requires API key)
+  - [SearXNG](https://searxng.org/) for local web-search engine
 - Built on [LangChain](https://github.com/hwchase17/langchain) framework
 - Uses [justext](https://github.com/miso-belica/justext) for content extraction
 - [Playwright](https://playwright.dev) for web content retrieval

local-deep-research 0.1.0__py3-none-any.whl → 0.1.12__py3-none-any.whl

local-deep-research 0.1.0py3-none-any.whl → 0.1.12py3-none-any.whl