PyPI - bits-bie - Versions diffs - 1.2.2__tar.gz → 1.2.4__tar.gz - Mend

bits-bie 1.2.2tar.gz → 1.2.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (56) hide show

{bits_bie-1.2.2 → bits_bie-1.2.4}/.github/workflows/ci.yml RENAMED Viewed

@@ -24,4 +24,4 @@ jobs:
       - name: Run tests
         run: pytest -v
       - name: Lint
-        run: ruff check bie tests
+        run: ruff check bie tests

{bits_bie-1.2.2 → bits_bie-1.2.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: bits-bie
-Version: 1.2.2
+Version: 1.2.4
 Summary: BitSearch Intelligence Engine — real-time, citation-backed web search & extraction for AI apps. Built on Bitscrape.
 Project-URL: Homepage, https://github.com/Sudharsansm/BIE
 Project-URL: Repository, https://github.com/Sudharsansm/BIE
@@ -66,7 +66,7 @@ API keys, no subscriptions, no third-party search services.**
 BIE gives any LLM, RAG pipeline, or AI agent five core primitives —
 **search, extract, map, crawl, and a hybrid index** — all running locally
-on top of [**Bitscrape**](https://pypi.org/project/bitscrape/), our
+on top of [**BitS **](https://pypi.org/project/bitscrape/), our
 async crawling framework. Use it as a Python library, REST API, CLI, or
 [MCP](https://modelcontextprotocol.io) server.
@@ -146,17 +146,10 @@ pip install "bits-bie[server]"      # FastAPI + Uvicorn REST server
 pip install "bits-bie[mcp]"         # Model Context Protocol server
 pip install "bits-bie[render]"      # JS rendering for extract() via Playwright
 pip install "bits-bie[langchain]"   # LangChain tool adapters
-pip install "bits-bie[notebook]"    # smoother Jupyter/Colab support (nest_asyncio)
+pip install "bits-bie[notebook]"    # smoother async behaviour in Jupyter/Colab
 pip install "bits-bie[all]"         # everything
 ```
-> **Using BIE in Jupyter / Google Colab?** All sync entry points
-> (`engine.crawl(...)`, `bie.websearch(...)`, `bie.extract(..., render_js=True)`)
-> work inside notebooks out of the box — BIE detects the notebook's
-> already-running event loop and handles it automatically. Installing
-> `bits-bie[notebook]` (adds `nest_asyncio`) makes this slightly more
-> efficient, but is not required.
 > BIE depends on [`bitscrape`](https://pypi.org/project/bitscrape/), our
 > proprietary async crawling & extraction framework, which is installed
 > automatically.
@@ -407,44 +400,62 @@ engine = BIE(BIESettings(
 | `use_embeddings` | `BIE_USE_EMBEDDINGS` | `true` | Enable semantic search |
 | `chunk_size` | `BIE_CHUNK_SIZE` | `800` | Chars per chunk |
 | `bm25_weight` / `vector_weight` | `BIE_BM25_WEIGHT` / `BIE_VECTOR_WEIGHT` | `0.5` / `0.5` | Fusion weights |
+| `discovery_backends` | `BIE_DISCOVERY_BACKENDS` | `ddg_html,ddg_lite,bing_html` | Ordered, comma-separated discovery backends for `websearch()`. Add `searxng` for a self-hosted instance. |
+| `searxng_url` | `BIE_SEARXNG_URL` | `None` | Base URL of a self-hosted SearXNG instance, used by the `searxng` discovery backend |
 | `api_key` | `BIE_API_KEY` | `None` | If set, requires `Authorization: Bearer <key>` |
-| — | `BIE_DISCOVERY_BACKENDS` | `ddg_html,ddg_lite,bing_html` | Comma-separated list and order of `websearch()` discovery backends. Known names: `ddg_html`, `ddg_lite`, `bing_html`, `searxng`. |
-| — | `BIE_SEARXNG_URL` | `None` | Base URL of a self-hosted [SearXNG](https://docs.searxng.org/) instance, used when `searxng` is included in `BIE_DISCOVERY_BACKENDS`. |
-### Discovery backends & troubleshooting empty `websearch()` results
+---
-`websearch()` discovers candidate URLs by scraping public search-engine
-result pages (DuckDuckGo HTML, DuckDuckGo Lite, Bing HTML, in that order
-by default). This is inherently fragile — these are not official APIs,
-and shared/cloud IPs (CI runners, some notebook hosts, restrictive
-sandboxes) can be rate-limited or blocked entirely.
+## Troubleshooting
+**`TypeError: '<' not supported between instances of 'Request' and 'Request'`**
+during a crawl — this was a Bitscrape scheduler bug (its priority queue
+compared `Request` objects directly when two requests shared the same
+priority). BIE patches `bitscrape.Request` to be orderable at import
+time, so this no longer occurs. If you still see it, you're likely on an
+older `bits-bie` version — upgrade.
+**`RuntimeError: asyncio.run() cannot be called from a running event
+loop`** — Jupyter/Colab/IPython already run an event loop, which used to
+break `engine.crawl(urls)` / `bie.websearch(...)`. Both now detect a
+running loop automatically and either use
+[`nest_asyncio`](https://pypi.org/project/nest_asyncio/) (install via
+`pip install "bits-bie[notebook]"`) or fall back to running the crawl on
+a background thread — no code changes needed. If you're already inside
+an `async def`, you can also call `await engine.acrawl(urls)` directly.
+**`bie.websearch(...)` returns `[]` / all discovery backends fail** —
+discovery scrapes DuckDuckGo/Bing's public HTML result pages, which can
+be blocked or rate-limited. Call
+`bie.discovery.get_last_discovery_diagnostics()` right after to see why:
-If `websearch()` returns `[]`, BIE logs a `WARNING` that distinguishes
-two failure categories:
+```python
+import bie
+from bie.discovery import get_last_discovery_diagnostics
-- **"network blocked"** — every backend failed at the connection level
-  (timeouts, connection refused, or a sandbox/proxy denial). This means
-  the environment itself can't reach these hosts — re-run in an
-  environment with normal internet access (a local machine, server, or
-  Colab) rather than a locked-down sandbox.
-- **"reachable but no results"** — connections succeeded but responses
-  were empty, a CAPTCHA/consent page, or rate-limited (HTTP 403/429).
-  This means the IP is likely being rate-limited; try again later, reduce
-  request frequency, or switch to a self-hosted backend (below).
+results = bie.websearch("...")
+if not results:
+    print(get_last_discovery_diagnostics().summary())
+```
-For a durable fix to rate-limiting, run a self-hosted
-[SearXNG](https://docs.searxng.org/) instance and point BIE at it:
+This distinguishes three cases:
-```bash
-export BIE_DISCOVERY_BACKENDS=searxng
-export BIE_SEARXNG_URL=http://localhost:8080
-```
+- **Network blocked** — every backend failed at the connection level
+  (or an egress proxy returned `x-deny-reason: host_not_allowed`). This
+  environment can't reach these hosts at all — check its outbound
+  network/proxy/firewall config. Common in sandboxed code-execution
+  environments; Colab and most servers have unrestricted outbound access.
+- **Blocked / rate-limited** — backends responded with `403`/`429`/etc.,
+  typically from bot-detection on a shared IP. Retry later, reduce
+  request volume, or configure a `searxng` backend (below).
+- **Empty response** — got `200 OK` but no parseable results (often a
+  CAPTCHA/consent page).
-You can also combine backends and reorder them, e.g. to prefer your
-SearXNG instance but fall back to DuckDuckGo:
+For the most reliable no-API-key discovery, self-host
+[SearXNG](https://github.com/searxng/searxng) and add it as a backend:
 ```bash
-export BIE_DISCOVERY_BACKENDS=searxng,ddg_html,ddg_lite
+export BIE_DISCOVERY_BACKENDS=searxng,ddg_html,ddg_lite,bing_html
 export BIE_SEARXNG_URL=http://localhost:8080
 ```
@@ -486,7 +497,7 @@ for Elasticsearch/Milvus-backed implementations behind the same
 ## Built on Bitscrape
 BIE's crawling and extraction layer is powered by
-[**Bitscrape**](https://github.com/Sudharsansm/Bitscrape)
+[**BitS**](https://github.com/Sudharsansm/Bitscrape)
 (`pip install bitscrape`), our async, robots.txt-aware web scraping
 framework — giving BIE high-performance, polite crawling out of the box.

{bits_bie-1.2.2 → bits_bie-1.2.4}/README.md RENAMED Viewed

@@ -10,7 +10,7 @@ API keys, no subscriptions, no third-party search services.**
 BIE gives any LLM, RAG pipeline, or AI agent five core primitives —
 **search, extract, map, crawl, and a hybrid index** — all running locally
-on top of [**Bitscrape**](https://pypi.org/project/bitscrape/), our
+on top of [**BitS **](https://pypi.org/project/bitscrape/), our
 async crawling framework. Use it as a Python library, REST API, CLI, or
 [MCP](https://modelcontextprotocol.io) server.
@@ -90,17 +90,10 @@ pip install "bits-bie[server]"      # FastAPI + Uvicorn REST server
 pip install "bits-bie[mcp]"         # Model Context Protocol server
 pip install "bits-bie[render]"      # JS rendering for extract() via Playwright
 pip install "bits-bie[langchain]"   # LangChain tool adapters
-pip install "bits-bie[notebook]"    # smoother Jupyter/Colab support (nest_asyncio)
+pip install "bits-bie[notebook]"    # smoother async behaviour in Jupyter/Colab
 pip install "bits-bie[all]"         # everything
 ```
-> **Using BIE in Jupyter / Google Colab?** All sync entry points
-> (`engine.crawl(...)`, `bie.websearch(...)`, `bie.extract(..., render_js=True)`)
-> work inside notebooks out of the box — BIE detects the notebook's
-> already-running event loop and handles it automatically. Installing
-> `bits-bie[notebook]` (adds `nest_asyncio`) makes this slightly more
-> efficient, but is not required.
 > BIE depends on [`bitscrape`](https://pypi.org/project/bitscrape/), our
 > proprietary async crawling & extraction framework, which is installed
 > automatically.
@@ -351,44 +344,62 @@ engine = BIE(BIESettings(
 | `use_embeddings` | `BIE_USE_EMBEDDINGS` | `true` | Enable semantic search |
 | `chunk_size` | `BIE_CHUNK_SIZE` | `800` | Chars per chunk |
 | `bm25_weight` / `vector_weight` | `BIE_BM25_WEIGHT` / `BIE_VECTOR_WEIGHT` | `0.5` / `0.5` | Fusion weights |
+| `discovery_backends` | `BIE_DISCOVERY_BACKENDS` | `ddg_html,ddg_lite,bing_html` | Ordered, comma-separated discovery backends for `websearch()`. Add `searxng` for a self-hosted instance. |
+| `searxng_url` | `BIE_SEARXNG_URL` | `None` | Base URL of a self-hosted SearXNG instance, used by the `searxng` discovery backend |
 | `api_key` | `BIE_API_KEY` | `None` | If set, requires `Authorization: Bearer <key>` |
-| — | `BIE_DISCOVERY_BACKENDS` | `ddg_html,ddg_lite,bing_html` | Comma-separated list and order of `websearch()` discovery backends. Known names: `ddg_html`, `ddg_lite`, `bing_html`, `searxng`. |
-| — | `BIE_SEARXNG_URL` | `None` | Base URL of a self-hosted [SearXNG](https://docs.searxng.org/) instance, used when `searxng` is included in `BIE_DISCOVERY_BACKENDS`. |
-### Discovery backends & troubleshooting empty `websearch()` results
+---
-`websearch()` discovers candidate URLs by scraping public search-engine
-result pages (DuckDuckGo HTML, DuckDuckGo Lite, Bing HTML, in that order
-by default). This is inherently fragile — these are not official APIs,
-and shared/cloud IPs (CI runners, some notebook hosts, restrictive
-sandboxes) can be rate-limited or blocked entirely.
+## Troubleshooting
+**`TypeError: '<' not supported between instances of 'Request' and 'Request'`**
+during a crawl — this was a Bitscrape scheduler bug (its priority queue
+compared `Request` objects directly when two requests shared the same
+priority). BIE patches `bitscrape.Request` to be orderable at import
+time, so this no longer occurs. If you still see it, you're likely on an
+older `bits-bie` version — upgrade.
+**`RuntimeError: asyncio.run() cannot be called from a running event
+loop`** — Jupyter/Colab/IPython already run an event loop, which used to
+break `engine.crawl(urls)` / `bie.websearch(...)`. Both now detect a
+running loop automatically and either use
+[`nest_asyncio`](https://pypi.org/project/nest_asyncio/) (install via
+`pip install "bits-bie[notebook]"`) or fall back to running the crawl on
+a background thread — no code changes needed. If you're already inside
+an `async def`, you can also call `await engine.acrawl(urls)` directly.
+**`bie.websearch(...)` returns `[]` / all discovery backends fail** —
+discovery scrapes DuckDuckGo/Bing's public HTML result pages, which can
+be blocked or rate-limited. Call
+`bie.discovery.get_last_discovery_diagnostics()` right after to see why:
-If `websearch()` returns `[]`, BIE logs a `WARNING` that distinguishes
-two failure categories:
+```python
+import bie
+from bie.discovery import get_last_discovery_diagnostics
-- **"network blocked"** — every backend failed at the connection level
-  (timeouts, connection refused, or a sandbox/proxy denial). This means
-  the environment itself can't reach these hosts — re-run in an
-  environment with normal internet access (a local machine, server, or
-  Colab) rather than a locked-down sandbox.
-- **"reachable but no results"** — connections succeeded but responses
-  were empty, a CAPTCHA/consent page, or rate-limited (HTTP 403/429).
-  This means the IP is likely being rate-limited; try again later, reduce
-  request frequency, or switch to a self-hosted backend (below).
+results = bie.websearch("...")
+if not results:
+    print(get_last_discovery_diagnostics().summary())
+```
-For a durable fix to rate-limiting, run a self-hosted
-[SearXNG](https://docs.searxng.org/) instance and point BIE at it:
+This distinguishes three cases:
-```bash
-export BIE_DISCOVERY_BACKENDS=searxng
-export BIE_SEARXNG_URL=http://localhost:8080
-```
+- **Network blocked** — every backend failed at the connection level
+  (or an egress proxy returned `x-deny-reason: host_not_allowed`). This
+  environment can't reach these hosts at all — check its outbound
+  network/proxy/firewall config. Common in sandboxed code-execution
+  environments; Colab and most servers have unrestricted outbound access.
+- **Blocked / rate-limited** — backends responded with `403`/`429`/etc.,
+  typically from bot-detection on a shared IP. Retry later, reduce
+  request volume, or configure a `searxng` backend (below).
+- **Empty response** — got `200 OK` but no parseable results (often a
+  CAPTCHA/consent page).
-You can also combine backends and reorder them, e.g. to prefer your
-SearXNG instance but fall back to DuckDuckGo:
+For the most reliable no-API-key discovery, self-host
+[SearXNG](https://github.com/searxng/searxng) and add it as a backend:
 ```bash
-export BIE_DISCOVERY_BACKENDS=searxng,ddg_html,ddg_lite
+export BIE_DISCOVERY_BACKENDS=searxng,ddg_html,ddg_lite,bing_html
 export BIE_SEARXNG_URL=http://localhost:8080
 ```
@@ -430,7 +441,7 @@ for Elasticsearch/Milvus-backed implementations behind the same
 ## Built on Bitscrape
 BIE's crawling and extraction layer is powered by
-[**Bitscrape**](https://github.com/Sudharsansm/Bitscrape)
+[**BitS**](https://github.com/Sudharsansm/Bitscrape)
 (`pip install bitscrape`), our async, robots.txt-aware web scraping
 framework — giving BIE high-performance, polite crawling out of the box.

{bits_bie-1.2.2 → bits_bie-1.2.4}/bie/__init__.py RENAMED Viewed

@@ -70,7 +70,7 @@ try:
     __version__ = _metadata.version("bits-bie")
 except _metadata.PackageNotFoundError:
     # Editable/source checkout without installed metadata.
-    __version__ = "1.2.2"
+    __version__ = "1.2.4"
 __all__ = [
     "BIE",

bits_bie-1.2.4/bie/_asyncutil.py ADDED Viewed

@@ -0,0 +1,84 @@
+"""
+Internal helper for calling async BIE internals from synchronous code.
+Plain scripts have no running event loop, so ``asyncio.run()`` works fine.
+Jupyter/Colab/IPython kernels, however, *already* run an event loop, and
+``asyncio.run()`` raises::
+    RuntimeError: asyncio.run() cannot be called from a running event loop
+:func:`run_sync` detects this and transparently falls back to:
+1. ``nest_asyncio`` (if installed) — patches the running loop so it can be
+   re-entered, then runs the coroutine on it directly.
+2. A dedicated background thread with its own fresh event loop — works
+   everywhere, with zero extra dependencies, at the cost of a thread
+   spin-up per call.
+This means the same sync call (e.g. ``engine.crawl(urls)``) works
+unchanged in plain scripts, notebooks, and servers.
+"""
+from __future__ import annotations
+import asyncio
+import threading
+from typing import Any, Coroutine, TypeVar
+T = TypeVar("T")
+def run_sync(coro: Coroutine[Any, Any, T]) -> T:
+    """Run ``coro`` to completion and return its result, regardless of
+    whether a thread already has an asyncio event loop running.
+    Args:
+        coro: An awaitable coroutine object (not yet awaited/started).
+    Returns:
+        The coroutine's return value.
+    Raises:
+        Whatever exception the coroutine itself raises.
+    """
+    try:
+        asyncio.get_running_loop()
+    except RuntimeError:
+        # No loop running in this thread — the normal case for scripts,
+        # CLI commands, and server request handlers.
+        return asyncio.run(coro)
+    # A loop is already running in this thread (e.g. Jupyter/Colab/IPython,
+    # or an async framework that called into sync BIE code).
+    try:
+        import nest_asyncio  # type: ignore[import-not-found]
+    except ImportError:
+        return _run_in_new_thread(coro)
+    nest_asyncio.apply()
+    loop = asyncio.get_event_loop()
+    return loop.run_until_complete(coro)
+def _run_in_new_thread(coro: Coroutine[Any, Any, T]) -> T:
+    """Run ``coro`` to completion on a fresh event loop in a new thread.
+    Used as the dependency-free fallback when a loop is already running in
+    the calling thread and ``nest_asyncio`` isn't installed.
+    """
+    result: dict[str, Any] = {}
+    error: dict[str, BaseException] = {}
+    def _runner() -> None:
+        try:
+            result["value"] = asyncio.run(coro)
+        except BaseException as exc:  # noqa: BLE001 - re-raised on the caller's thread
+            error["value"] = exc
+    thread = threading.Thread(target=_runner, name="bie-async-runner", daemon=True)
+    thread.start()
+    thread.join()
+    if "value" in error:
+        raise error["value"]
+    return result["value"]  # type: ignore[return-value]

{bits_bie-1.2.2 → bits_bie-1.2.4}/bie/config.py RENAMED Viewed

@@ -39,6 +39,25 @@ class BIESettings(BaseSettings):
     index_dir: str = Field(".bie_index", description="Directory for persisted index")
     persist: bool = Field(False, description="Persist index to disk between runs")
+    # --- Discovery (no-API-key web search) ----------------------------------
+    discovery_backends: str = Field(
+        "ddg_html,ddg_lite,bing_html",
+        description="Comma-separated, ordered list of discovery backends to "
+        "try for bie.websearch()/discover_urls(). Built-in backends: "
+        "'ddg_html', 'ddg_lite', 'bing_html', 'searxng'. The 'searxng' "
+        "backend requires `searxng_url` to also be set. Unknown names are "
+        "skipped with a warning. Override with the BIE_DISCOVERY_BACKENDS "
+        "env var, e.g. BIE_DISCOVERY_BACKENDS=searxng,ddg_html,ddg_lite,bing_html",
+    )
+    searxng_url: str | None = Field(
+        default=None,
+        description="Base URL of a self-hosted SearXNG instance (e.g. "
+        "'http://localhost:8080'), used by the 'searxng' discovery backend. "
+        "Self-hosting SearXNG is the most reliable no-API-key discovery "
+        "option since it isn't subject to the rate limits / layout changes "
+        "that affect scraping DDG/Bing HTML directly.",
+    )
     # --- Server --------------------------------------------------------------
     host: str = "0.0.0.0"
     port: int = 8000

{bits_bie-1.2.2 → bits_bie-1.2.4}/bie/crawler.py RENAMED Viewed

@@ -16,7 +16,7 @@ from urllib.parse import urlparse
 import bitscrape
 from bitscrape.pipeline.pipelines import BasePipeline
-from bie._async_utils import run_sync
+from bie._asyncutil import run_sync
 from bie.config import BIESettings
 from bie.models import Document
 from bie.spiders.generic import BIESpider
@@ -25,33 +25,39 @@ logger = logging.getLogger("bie.crawler")
 def _patch_request_ordering() -> None:
-    """Make ``bitscrape.Request`` orderable for its priority-queue
-    tie-breaks.
-    Bitscrape's scheduler stores requests in an ``asyncio.PriorityQueue``
-    as ``(priority.value, request)`` tuples. When two requests share the
-    same priority, ``heapq`` falls back to comparing the ``Request``
-    objects directly with ``<`` — but ``Request`` (a pydantic
-    ``BaseModel``) doesn't define ``__lt__``, so this raises::
+    """Work around a Bitscrape bug where its scheduler's
+    ``asyncio.PriorityQueue[tuple[int, Request]]`` compares ``Request``
+    objects directly whenever two requests share the same priority
+    (the common case -- most requests are ``RequestPriority.NORMAL``),
+    raising::
         TypeError: '<' not supported between instances of 'Request' and 'Request'
-    This patches in an arbitrary-but-stable ``__lt__`` (by ``id()``) so
-    same-priority requests can be ordered without error. The patch is a
-    no-op if a future Bitscrape version already defines ``__lt__`` on
-    ``Request``.
+    ``Request`` is a pydantic model with no ``__lt__``/etc., so tuple
+    comparison falls through to comparing the ``Request`` instances
+    themselves once priorities tie.
+    This patches ``bitscrape.Request`` (a pydantic ``BaseModel``) with an
+    identity-based ordering at import time, so equal-priority ties are
+    broken deterministically instead of crashing. This does not change
+    crawl semantics -- priority still determines order; only the
+    previously-crashing tie-break becomes well-defined.
+    The patch is idempotent and a no-op if a future Bitscrape release
+    already defines ``__lt__`` on ``Request``.
     """
-    request_cls = bitscrape.Request
-    current = getattr(request_cls, "__lt__", None)
-    if current is not None and current is not object.__lt__:
-        # Already defines real ordering (future Bitscrape fix) — no-op.
+    request_cls = getattr(bitscrape, "Request", None)
+    if request_cls is None:
+        logger.debug("bitscrape.Request not found -- skipping ordering patch")
         return
+    if "__lt__" in request_cls.__dict__:
+        return  # already orderable (newer bitscrape version fixed it upstream)
-    def _lt(self: Any, other: Any) -> bool:
-        return id(self) < id(other)
-    request_cls.__lt__ = _lt
-    logger.debug("Patched bitscrape.Request.__lt__ for priority-queue tie-breaks")
+    request_cls.__lt__ = lambda self, other: id(self) < id(other)
+    request_cls.__le__ = lambda self, other: id(self) <= id(other)
+    request_cls.__gt__ = lambda self, other: id(self) > id(other)
+    request_cls.__ge__ = lambda self, other: id(self) >= id(other)
+    logger.debug("Patched bitscrape.Request with identity-based ordering")
 _patch_request_ordering()
@@ -80,8 +86,10 @@ class Crawler:
         """Synchronous convenience wrapper around :meth:`acrawl`.
         Safe to call from plain scripts, CLI commands, server request
-        handlers, *and* Jupyter/Colab notebooks (which already run an
-        event loop) — see :func:`bie._async_utils.run_sync`.
+        handlers, *and* Jupyter/Colab/IPython notebooks (which already run
+        an event loop, where a plain ``asyncio.run()`` would raise
+        ``RuntimeError: asyncio.run() cannot be called from a running
+        event loop``). See :func:`bie._asyncutil.run_sync`.
         """
         return run_sync(self.acrawl(urls, allowed_domains, instruction))

bits-bie 1.2.2__tar.gz → 1.2.4__tar.gz

bits-bie 1.2.2tar.gz → 1.2.4tar.gz