PyPI - bits-bie - Versions diffs - 1.2.1__tar.gz → 1.2.2__tar.gz - Mend

bits-bie 1.2.1tar.gz → 1.2.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (52) hide show

{bits_bie-1.2.1 → bits_bie-1.2.2}/PKG-INFO RENAMED Viewed

@@ -1,7 +1,7 @@
 Metadata-Version: 2.4
 Name: bits-bie
-Version: 1.2.1
-Summary: *BitS* BitSearch Intelligence Engine — real-time, citation-backed web search & extraction for AI apps. Built on Bitscrape.
+Version: 1.2.2
+Summary: BitSearch Intelligence Engine — real-time, citation-backed web search & extraction for AI apps. Built on Bitscrape.
 Project-URL: Homepage, https://github.com/Sudharsansm/BIE
 Project-URL: Repository, https://github.com/Sudharsansm/BIE
 Project-URL: Issues, https://github.com/Sudharsansm/BIE/issues
@@ -31,6 +31,7 @@ Provides-Extra: all
 Requires-Dist: fastapi>=0.110; extra == 'all'
 Requires-Dist: langchain-core>=0.2; extra == 'all'
 Requires-Dist: mcp>=1.0; extra == 'all'
+Requires-Dist: nest-asyncio>=1.5; extra == 'all'
 Requires-Dist: playwright>=1.40; extra == 'all'
 Requires-Dist: sentence-transformers>=2.2; extra == 'all'
 Requires-Dist: uvicorn[standard]>=0.27; extra == 'all'
@@ -44,6 +45,8 @@ Provides-Extra: langchain
 Requires-Dist: langchain-core>=0.2; extra == 'langchain'
 Provides-Extra: mcp
 Requires-Dist: mcp>=1.0; extra == 'mcp'
+Provides-Extra: notebook
+Requires-Dist: nest-asyncio>=1.5; extra == 'notebook'
 Provides-Extra: render
 Requires-Dist: playwright>=1.40; extra == 'render'
 Provides-Extra: server
@@ -143,9 +146,17 @@ pip install "bits-bie[server]"      # FastAPI + Uvicorn REST server
 pip install "bits-bie[mcp]"         # Model Context Protocol server
 pip install "bits-bie[render]"      # JS rendering for extract() via Playwright
 pip install "bits-bie[langchain]"   # LangChain tool adapters
+pip install "bits-bie[notebook]"    # smoother Jupyter/Colab support (nest_asyncio)
 pip install "bits-bie[all]"         # everything
 ```
+> **Using BIE in Jupyter / Google Colab?** All sync entry points
+> (`engine.crawl(...)`, `bie.websearch(...)`, `bie.extract(..., render_js=True)`)
+> work inside notebooks out of the box — BIE detects the notebook's
+> already-running event loop and handles it automatically. Installing
+> `bits-bie[notebook]` (adds `nest_asyncio`) makes this slightly more
+> efficient, but is not required.
 > BIE depends on [`bitscrape`](https://pypi.org/project/bitscrape/), our
 > proprietary async crawling & extraction framework, which is installed
 > automatically.
@@ -397,6 +408,45 @@ engine = BIE(BIESettings(
 | `chunk_size` | `BIE_CHUNK_SIZE` | `800` | Chars per chunk |
 | `bm25_weight` / `vector_weight` | `BIE_BM25_WEIGHT` / `BIE_VECTOR_WEIGHT` | `0.5` / `0.5` | Fusion weights |
 | `api_key` | `BIE_API_KEY` | `None` | If set, requires `Authorization: Bearer <key>` |
+| — | `BIE_DISCOVERY_BACKENDS` | `ddg_html,ddg_lite,bing_html` | Comma-separated list and order of `websearch()` discovery backends. Known names: `ddg_html`, `ddg_lite`, `bing_html`, `searxng`. |
+| — | `BIE_SEARXNG_URL` | `None` | Base URL of a self-hosted [SearXNG](https://docs.searxng.org/) instance, used when `searxng` is included in `BIE_DISCOVERY_BACKENDS`. |
+### Discovery backends & troubleshooting empty `websearch()` results
+`websearch()` discovers candidate URLs by scraping public search-engine
+result pages (DuckDuckGo HTML, DuckDuckGo Lite, Bing HTML, in that order
+by default). This is inherently fragile — these are not official APIs,
+and shared/cloud IPs (CI runners, some notebook hosts, restrictive
+sandboxes) can be rate-limited or blocked entirely.
+If `websearch()` returns `[]`, BIE logs a `WARNING` that distinguishes
+two failure categories:
+- **"network blocked"** — every backend failed at the connection level
+  (timeouts, connection refused, or a sandbox/proxy denial). This means
+  the environment itself can't reach these hosts — re-run in an
+  environment with normal internet access (a local machine, server, or
+  Colab) rather than a locked-down sandbox.
+- **"reachable but no results"** — connections succeeded but responses
+  were empty, a CAPTCHA/consent page, or rate-limited (HTTP 403/429).
+  This means the IP is likely being rate-limited; try again later, reduce
+  request frequency, or switch to a self-hosted backend (below).
+For a durable fix to rate-limiting, run a self-hosted
+[SearXNG](https://docs.searxng.org/) instance and point BIE at it:
+```bash
+export BIE_DISCOVERY_BACKENDS=searxng
+export BIE_SEARXNG_URL=http://localhost:8080
+```
+You can also combine backends and reorder them, e.g. to prefer your
+SearXNG instance but fall back to DuckDuckGo:
+```bash
+export BIE_DISCOVERY_BACKENDS=searxng,ddg_html,ddg_lite
+export BIE_SEARXNG_URL=http://localhost:8080
+```
 ---
@@ -433,10 +483,10 @@ for Elasticsearch/Milvus-backed implementations behind the same
 ---
-## Built on BitS
+## Built on Bitscrape
 BIE's crawling and extraction layer is powered by
-[**BitS**](https://github.com/Sudharsansm/Bitscrape)
+[**Bitscrape**](https://github.com/Sudharsansm/Bitscrape)
 (`pip install bitscrape`), our async, robots.txt-aware web scraping
 framework — giving BIE high-performance, polite crawling out of the box.

{bits_bie-1.2.1 → bits_bie-1.2.2}/README.md RENAMED Viewed

@@ -90,9 +90,17 @@ pip install "bits-bie[server]"      # FastAPI + Uvicorn REST server
 pip install "bits-bie[mcp]"         # Model Context Protocol server
 pip install "bits-bie[render]"      # JS rendering for extract() via Playwright
 pip install "bits-bie[langchain]"   # LangChain tool adapters
+pip install "bits-bie[notebook]"    # smoother Jupyter/Colab support (nest_asyncio)
 pip install "bits-bie[all]"         # everything
 ```
+> **Using BIE in Jupyter / Google Colab?** All sync entry points
+> (`engine.crawl(...)`, `bie.websearch(...)`, `bie.extract(..., render_js=True)`)
+> work inside notebooks out of the box — BIE detects the notebook's
+> already-running event loop and handles it automatically. Installing
+> `bits-bie[notebook]` (adds `nest_asyncio`) makes this slightly more
+> efficient, but is not required.
 > BIE depends on [`bitscrape`](https://pypi.org/project/bitscrape/), our
 > proprietary async crawling & extraction framework, which is installed
 > automatically.
@@ -344,6 +352,45 @@ engine = BIE(BIESettings(
 | `chunk_size` | `BIE_CHUNK_SIZE` | `800` | Chars per chunk |
 | `bm25_weight` / `vector_weight` | `BIE_BM25_WEIGHT` / `BIE_VECTOR_WEIGHT` | `0.5` / `0.5` | Fusion weights |
 | `api_key` | `BIE_API_KEY` | `None` | If set, requires `Authorization: Bearer <key>` |
+| — | `BIE_DISCOVERY_BACKENDS` | `ddg_html,ddg_lite,bing_html` | Comma-separated list and order of `websearch()` discovery backends. Known names: `ddg_html`, `ddg_lite`, `bing_html`, `searxng`. |
+| — | `BIE_SEARXNG_URL` | `None` | Base URL of a self-hosted [SearXNG](https://docs.searxng.org/) instance, used when `searxng` is included in `BIE_DISCOVERY_BACKENDS`. |
+### Discovery backends & troubleshooting empty `websearch()` results
+`websearch()` discovers candidate URLs by scraping public search-engine
+result pages (DuckDuckGo HTML, DuckDuckGo Lite, Bing HTML, in that order
+by default). This is inherently fragile — these are not official APIs,
+and shared/cloud IPs (CI runners, some notebook hosts, restrictive
+sandboxes) can be rate-limited or blocked entirely.
+If `websearch()` returns `[]`, BIE logs a `WARNING` that distinguishes
+two failure categories:
+- **"network blocked"** — every backend failed at the connection level
+  (timeouts, connection refused, or a sandbox/proxy denial). This means
+  the environment itself can't reach these hosts — re-run in an
+  environment with normal internet access (a local machine, server, or
+  Colab) rather than a locked-down sandbox.
+- **"reachable but no results"** — connections succeeded but responses
+  were empty, a CAPTCHA/consent page, or rate-limited (HTTP 403/429).
+  This means the IP is likely being rate-limited; try again later, reduce
+  request frequency, or switch to a self-hosted backend (below).
+For a durable fix to rate-limiting, run a self-hosted
+[SearXNG](https://docs.searxng.org/) instance and point BIE at it:
+```bash
+export BIE_DISCOVERY_BACKENDS=searxng
+export BIE_SEARXNG_URL=http://localhost:8080
+```
+You can also combine backends and reorder them, e.g. to prefer your
+SearXNG instance but fall back to DuckDuckGo:
+```bash
+export BIE_DISCOVERY_BACKENDS=searxng,ddg_html,ddg_lite
+export BIE_SEARXNG_URL=http://localhost:8080
+```
 ---
@@ -380,10 +427,10 @@ for Elasticsearch/Milvus-backed implementations behind the same
 ---
-## Built on BitS
+## Built on Bitscrape
 BIE's crawling and extraction layer is powered by
-[**BitS**](https://github.com/Sudharsansm/Bitscrape)
+[**Bitscrape**](https://github.com/Sudharsansm/Bitscrape)
 (`pip install bitscrape`), our async, robots.txt-aware web scraping
 framework — giving BIE high-performance, polite crawling out of the box.

{bits_bie-1.2.1 → bits_bie-1.2.2}/bie/__init__.py RENAMED Viewed

@@ -54,6 +54,8 @@ Run as an MCP tool (for Claude Desktop, Claude Code, etc.)::
 from __future__ import annotations
+from importlib import metadata as _metadata
 from bie.config import BIESettings
 from bie.engine import BIE
 from bie.extract import ExtractError, ExtractResult, extract
@@ -63,7 +65,12 @@ from bie.security import SecurityFinding, SecurityReport, scan_for_prompt_inject
 from bie.sitecrawl import crawl_site
 from bie.sitemap import SiteMap, map_site
-__version__ = "1.2.1"
+try:
+    # Reflects the version actually installed (matches PyPI/pyproject.toml).
+    __version__ = _metadata.version("bits-bie")
+except _metadata.PackageNotFoundError:
+    # Editable/source checkout without installed metadata.
+    __version__ = "1.2.2"
 __all__ = [
     "BIE",

bits_bie-1.2.2/bie/_async_utils.py ADDED Viewed

@@ -0,0 +1,93 @@
+"""
+Internal helpers for running async code from synchronous entry points,
+safely whether or not the caller is already inside an event loop.
+This module exists because BIE's public sync API (``Crawler.crawl``,
+``BIE.crawl``, etc.) wraps async crawl logic with ``asyncio.run()`` —
+which works fine in plain scripts, CLI commands, and server request
+handlers, but **raises** ``RuntimeError: asyncio.run() cannot be called
+from a running event loop`` when called from Jupyter/Colab notebooks
+(which run their own persistent event loop).
+:func:`run_sync` detects this and falls back automatically:
+1. **No running loop** (plain script/CLI/server) — use ``asyncio.run()``
+   directly. This is the common case and has zero overhead.
+2. **Running loop + nest_asyncio installed** — patch the running loop
+   with `nest_asyncio <https://pypi.org/project/nest_asyncio/>`_ so
+   ``asyncio.run()`` can be called from within it. Cheap, same-thread.
+3. **Running loop, no nest_asyncio** — run the coroutine to completion in
+   a fresh event loop on a separate worker thread, and block until it
+   finishes. Always works, no extra dependencies required, slightly more
+   overhead (one thread per call).
+Callers (``Crawler.crawl``, ``BIE.crawl``, etc.) don't need to know which
+path was taken — :func:`run_sync` always returns the coroutine's result
+or raises its exception, as if it were called from a script with no
+running loop.
+"""
+from __future__ import annotations
+import asyncio
+import concurrent.futures
+import logging
+from typing import Coroutine, TypeVar
+_T = TypeVar("_T")
+logger = logging.getLogger("bie.async_utils")
+_nest_asyncio_applied = False
+def run_sync(coro: Coroutine[None, None, _T]) -> _T:
+    """Run ``coro`` to completion and return its result, working correctly
+    whether or not the calling thread already has a running event loop.
+    See module docstring for the fallback strategy.
+    """
+    try:
+        asyncio.get_running_loop()
+    except RuntimeError:
+        # No running loop in this thread — the normal case for scripts,
+        # CLI commands, and most server request handlers.
+        return asyncio.run(coro)
+    # We're inside a running event loop (e.g. a Jupyter/Colab cell).
+    # First choice: nest_asyncio, if available — patches the loop so
+    # asyncio.run() works from within it. Cheapest option, same thread.
+    if _try_apply_nest_asyncio():
+        return asyncio.run(coro)
+    # Fallback: run the coroutine in a brand-new event loop on a separate
+    # thread, and block the calling (notebook) thread until it's done.
+    # This always works and requires no extra dependencies.
+    logger.debug(
+        "Running coroutine in a separate thread (already inside an event "
+        "loop and nest_asyncio is not installed). Install nest_asyncio for "
+        "lower overhead: pip install nest_asyncio"
+    )
+    return _run_in_new_thread(coro)
+def _try_apply_nest_asyncio() -> bool:
+    global _nest_asyncio_applied
+    if _nest_asyncio_applied:
+        return True
+    try:
+        import nest_asyncio
+    except ImportError:
+        return False
+    nest_asyncio.apply()
+    _nest_asyncio_applied = True
+    return True
+def _run_in_new_thread(coro: Coroutine[None, None, _T]) -> _T:
+    def _runner() -> _T:
+        return asyncio.run(coro)
+    with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
+        future = pool.submit(_runner)
+        return future.result()

{bits_bie-1.2.1 → bits_bie-1.2.2}/bie/crawler.py RENAMED Viewed

@@ -9,7 +9,6 @@ objects, ready for chunking + indexing.
 from __future__ import annotations
-import asyncio
 import logging
 from typing import Any
 from urllib.parse import urlparse
@@ -17,6 +16,7 @@ from urllib.parse import urlparse
 import bitscrape
 from bitscrape.pipeline.pipelines import BasePipeline
+from bie._async_utils import run_sync
 from bie.config import BIESettings
 from bie.models import Document
 from bie.spiders.generic import BIESpider
@@ -24,6 +24,39 @@ from bie.spiders.generic import BIESpider
 logger = logging.getLogger("bie.crawler")
+def _patch_request_ordering() -> None:
+    """Make ``bitscrape.Request`` orderable for its priority-queue
+    tie-breaks.
+    Bitscrape's scheduler stores requests in an ``asyncio.PriorityQueue``
+    as ``(priority.value, request)`` tuples. When two requests share the
+    same priority, ``heapq`` falls back to comparing the ``Request``
+    objects directly with ``<`` — but ``Request`` (a pydantic
+    ``BaseModel``) doesn't define ``__lt__``, so this raises::
+        TypeError: '<' not supported between instances of 'Request' and 'Request'
+    This patches in an arbitrary-but-stable ``__lt__`` (by ``id()``) so
+    same-priority requests can be ordered without error. The patch is a
+    no-op if a future Bitscrape version already defines ``__lt__`` on
+    ``Request``.
+    """
+    request_cls = bitscrape.Request
+    current = getattr(request_cls, "__lt__", None)
+    if current is not None and current is not object.__lt__:
+        # Already defines real ordering (future Bitscrape fix) — no-op.
+        return
+    def _lt(self: Any, other: Any) -> bool:
+        return id(self) < id(other)
+    request_cls.__lt__ = _lt
+    logger.debug("Patched bitscrape.Request.__lt__ for priority-queue tie-breaks")
+_patch_request_ordering()
 class _CollectorPipeline(BasePipeline):
     """Collects every scraped item into an in-memory list."""
@@ -44,8 +77,13 @@ class Crawler:
     def crawl(
         self, urls: list[str], allowed_domains: list[str] | None = None, instruction: str = ""
     ) -> list[Document]:
-        """Synchronous convenience wrapper around :meth:`acrawl`."""
-        return asyncio.run(self.acrawl(urls, allowed_domains, instruction))
+        """Synchronous convenience wrapper around :meth:`acrawl`.
+        Safe to call from plain scripts, CLI commands, server request
+        handlers, *and* Jupyter/Colab notebooks (which already run an
+        event loop) — see :func:`bie._async_utils.run_sync`.
+        """
+        return run_sync(self.acrawl(urls, allowed_domains, instruction))
     async def acrawl(
         self,

bits-bie 1.2.1__tar.gz → 1.2.2__tar.gz

bits-bie 1.2.1tar.gz → 1.2.2tar.gz