PyPI - bits-bie - Versions diffs - 1.2.2__tar.gz → 1.2.5__tar.gz - Mend

bits-bie 1.2.2tar.gz → 1.2.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (58) hide show

{bits_bie-1.2.2 → bits_bie-1.2.5}/.github/workflows/ci.yml RENAMED Viewed

@@ -24,4 +24,4 @@ jobs:
       - name: Run tests
         run: pytest -v
       - name: Lint
-        run: ruff check bie tests
+        run: ruff check bie tests

{bits_bie-1.2.2 → bits_bie-1.2.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: bits-bie
-Version: 1.2.2
+Version: 1.2.5
 Summary: BitSearch Intelligence Engine — real-time, citation-backed web search & extraction for AI apps. Built on Bitscrape.
 Project-URL: Homepage, https://github.com/Sudharsansm/BIE
 Project-URL: Repository, https://github.com/Sudharsansm/BIE
@@ -66,7 +66,7 @@ API keys, no subscriptions, no third-party search services.**
 BIE gives any LLM, RAG pipeline, or AI agent five core primitives —
 **search, extract, map, crawl, and a hybrid index** — all running locally
-on top of [**Bitscrape**](https://pypi.org/project/bitscrape/), our
+on top of [**BitS**](https://pypi.org/project/bitscrape/), our
 async crawling framework. Use it as a Python library, REST API, CLI, or
 [MCP](https://modelcontextprotocol.io) server.
@@ -112,13 +112,46 @@ state-of-the-art ranking, a commercial search API may still be the right
 choice for that piece. BIE is for teams that want a capable, free,
 self-hosted starting point — and full control over the code.
+### How this compares to ChatGPT Search / Tavily
+`bie.websearch_response()` is shaped like those tools' "web search" tool
+responses on purpose: ranked, cited `results` with snippets, an
+`answer` field, and `.to_context()` for dropping straight into a prompt.
+Two things are genuinely different, and worth being precise about:
+- **`answer` is extractive, not generated.** ChatGPT Search and Tavily's
+  `include_answer` run an LLM server-side to *write* a summary answer.
+  BIE doesn't run an LLM — `answer` is the single best-matching passage
+  found (verbatim from a live page). It's a strong starting point for
+  *your* LLM/agent to read and synthesize from, not a finished answer on
+  its own.
+- **Discovery is "best-effort free", not a dedicated index.** ChatGPT
+  Search/Tavily run their own crawl infrastructure and indexes. BIE's
+  default discovery scrapes DuckDuckGo/Bing's public result pages, which
+  can be rate-limited or served a CAPTCHA — `degraded`/`diagnostics` tell
+  you when this happens for a given query, so your agent can react (retry,
+  fall back to general knowledge, etc.) instead of silently getting a
+  bad answer.
+**SearXNG closes most of that second gap.** Self-hosting
+[SearXNG](https://github.com/searxng/searxng) and adding it as a
+discovery backend (`BIE_DISCOVERY_BACKENDS=searxng,...` +
+`BIE_SEARXNG_URL=...`) gives BIE a stable JSON API that itself aggregates
+Google/Bing/Brave/etc. server-side — far less prone to the "200 OK but
+0 results" failure mode of scraping DDG/Bing HTML directly. It's the
+single highest-leverage change for making `websearch()`'s *discovery*
+step behave consistently. It doesn't change the `answer` field's
+extractive (vs. LLM-generated) nature — that's a property of BIE not
+running an LLM, independent of which discovery backend is used.
 ---
 ## Core primitives
 | Function | What it does |
 |---|---|
-| `bie.websearch(query)` | Search the live internet — no URLs needed. Free discovery (DuckDuckGo + Bing fallback) with query fan-out, crawled and ranked by BIE's hybrid index. |
+| `bie.websearch(query)` | Search the live internet — no URLs needed. Free discovery (DuckDuckGo + Bing fallback, optional SearXNG) with query fan-out, crawled and ranked by BIE's hybrid index. |
+| `bie.websearch_response(query)` | Like `websearch`, but returns the full Tavily/ChatGPT-Search-shaped response: ranked `results`, an extractive `answer`, `degraded`/`diagnostics`, and `.to_context()` for an LLM-prompt-ready citation block. |
 | `bie.extract(url)` | Fetch a URL and return clean Markdown, with nav/ads/scripts stripped. Optional JS rendering via Playwright. |
 | `bie.map_site(url)` | Discover a site's sitemap(s) and the URLs they list, before crawling. |
 | `bie.crawl_site(urls, instruction=...)` | Crawl a site, prioritizing links by keyword-relevance to your instruction. Returns an index + ranked results. |
@@ -146,17 +179,10 @@ pip install "bits-bie[server]"      # FastAPI + Uvicorn REST server
 pip install "bits-bie[mcp]"         # Model Context Protocol server
 pip install "bits-bie[render]"      # JS rendering for extract() via Playwright
 pip install "bits-bie[langchain]"   # LangChain tool adapters
-pip install "bits-bie[notebook]"    # smoother Jupyter/Colab support (nest_asyncio)
+pip install "bits-bie[notebook]"    # smoother async behaviour in Jupyter/Colab
 pip install "bits-bie[all]"         # everything
 ```
-> **Using BIE in Jupyter / Google Colab?** All sync entry points
-> (`engine.crawl(...)`, `bie.websearch(...)`, `bie.extract(..., render_js=True)`)
-> work inside notebooks out of the box — BIE detects the notebook's
-> already-running event loop and handles it automatically. Installing
-> `bits-bie[notebook]` (adds `nest_asyncio`) makes this slightly more
-> efficient, but is not required.
 > BIE depends on [`bitscrape`](https://pypi.org/project/bitscrape/), our
 > proprietary async crawling & extraction framework, which is installed
 > automatically.
@@ -176,6 +202,20 @@ for r in results:
     print(r.snippet)
 ```
+For the full, Tavily/ChatGPT-Search-shaped response — extractive
+`answer`, timing, and `degraded`/`diagnostics` for when live discovery
+doesn't fully succeed:
+```python
+response = bie.websearch_response("who won the latest F1 race")
+print(response.answer)         # best-matching passage (not LLM-written)
+print(response.to_context())    # numbered sources block, ready for a prompt
+if response.degraded:
+    print("live data degraded:", response.diagnostics)
+```
 `websearch` pipeline:
 1. **Discovery** — free, public, no-key search endpoints (DuckDuckGo,
@@ -407,44 +447,62 @@ engine = BIE(BIESettings(
 | `use_embeddings` | `BIE_USE_EMBEDDINGS` | `true` | Enable semantic search |
 | `chunk_size` | `BIE_CHUNK_SIZE` | `800` | Chars per chunk |
 | `bm25_weight` / `vector_weight` | `BIE_BM25_WEIGHT` / `BIE_VECTOR_WEIGHT` | `0.5` / `0.5` | Fusion weights |
+| `discovery_backends` | `BIE_DISCOVERY_BACKENDS` | `ddg_html,ddg_lite,bing_html` | Ordered, comma-separated discovery backends for `websearch()`. Add `searxng` for a self-hosted instance. |
+| `searxng_url` | `BIE_SEARXNG_URL` | `None` | Base URL of a self-hosted SearXNG instance, used by the `searxng` discovery backend |
 | `api_key` | `BIE_API_KEY` | `None` | If set, requires `Authorization: Bearer <key>` |
-| — | `BIE_DISCOVERY_BACKENDS` | `ddg_html,ddg_lite,bing_html` | Comma-separated list and order of `websearch()` discovery backends. Known names: `ddg_html`, `ddg_lite`, `bing_html`, `searxng`. |
-| — | `BIE_SEARXNG_URL` | `None` | Base URL of a self-hosted [SearXNG](https://docs.searxng.org/) instance, used when `searxng` is included in `BIE_DISCOVERY_BACKENDS`. |
-### Discovery backends & troubleshooting empty `websearch()` results
+---
-`websearch()` discovers candidate URLs by scraping public search-engine
-result pages (DuckDuckGo HTML, DuckDuckGo Lite, Bing HTML, in that order
-by default). This is inherently fragile — these are not official APIs,
-and shared/cloud IPs (CI runners, some notebook hosts, restrictive
-sandboxes) can be rate-limited or blocked entirely.
+## Troubleshooting
+**`TypeError: '<' not supported between instances of 'Request' and 'Request'`**
+during a crawl — this was a Bitscrape scheduler bug (its priority queue
+compared `Request` objects directly when two requests shared the same
+priority). BIE patches `bitscrape.Request` to be orderable at import
+time, so this no longer occurs. If you still see it, you're likely on an
+older `bits-bie` version — upgrade.
+**`RuntimeError: asyncio.run() cannot be called from a running event
+loop`** — Jupyter/Colab/IPython already run an event loop, which used to
+break `engine.crawl(urls)` / `bie.websearch(...)`. Both now detect a
+running loop automatically and either use
+[`nest_asyncio`](https://pypi.org/project/nest_asyncio/) (install via
+`pip install "bits-bie[notebook]"`) or fall back to running the crawl on
+a background thread — no code changes needed. If you're already inside
+an `async def`, you can also call `await engine.acrawl(urls)` directly.
+**`bie.websearch(...)` returns `[]` / all discovery backends fail** —
+discovery scrapes DuckDuckGo/Bing's public HTML result pages, which can
+be blocked or rate-limited. Call
+`bie.discovery.get_last_discovery_diagnostics()` right after to see why:
-If `websearch()` returns `[]`, BIE logs a `WARNING` that distinguishes
-two failure categories:
+```python
+import bie
+from bie.discovery import get_last_discovery_diagnostics
-- **"network blocked"** — every backend failed at the connection level
-  (timeouts, connection refused, or a sandbox/proxy denial). This means
-  the environment itself can't reach these hosts — re-run in an
-  environment with normal internet access (a local machine, server, or
-  Colab) rather than a locked-down sandbox.
-- **"reachable but no results"** — connections succeeded but responses
-  were empty, a CAPTCHA/consent page, or rate-limited (HTTP 403/429).
-  This means the IP is likely being rate-limited; try again later, reduce
-  request frequency, or switch to a self-hosted backend (below).
+results = bie.websearch("...")
+if not results:
+    print(get_last_discovery_diagnostics().summary())
+```
-For a durable fix to rate-limiting, run a self-hosted
-[SearXNG](https://docs.searxng.org/) instance and point BIE at it:
+This distinguishes three cases:
-```bash
-export BIE_DISCOVERY_BACKENDS=searxng
-export BIE_SEARXNG_URL=http://localhost:8080
-```
+- **Network blocked** — every backend failed at the connection level
+  (or an egress proxy returned `x-deny-reason: host_not_allowed`). This
+  environment can't reach these hosts at all — check its outbound
+  network/proxy/firewall config. Common in sandboxed code-execution
+  environments; Colab and most servers have unrestricted outbound access.
+- **Blocked / rate-limited** — backends responded with `403`/`429`/etc.,
+  typically from bot-detection on a shared IP. Retry later, reduce
+  request volume, or configure a `searxng` backend (below).
+- **Empty response** — got `200 OK` but no parseable results (often a
+  CAPTCHA/consent page).
-You can also combine backends and reorder them, e.g. to prefer your
-SearXNG instance but fall back to DuckDuckGo:
+For the most reliable no-API-key discovery, self-host
+[SearXNG](https://github.com/searxng/searxng) and add it as a backend:
 ```bash
-export BIE_DISCOVERY_BACKENDS=searxng,ddg_html,ddg_lite
+export BIE_DISCOVERY_BACKENDS=searxng,ddg_html,ddg_lite,bing_html
 export BIE_SEARXNG_URL=http://localhost:8080
 ```
@@ -483,10 +541,10 @@ for Elasticsearch/Milvus-backed implementations behind the same
 ---
-## Built on Bitscrape
+## Built on BitS
 BIE's crawling and extraction layer is powered by
-[**Bitscrape**](https://github.com/Sudharsansm/Bitscrape)
+[**BitS**](https://github.com/Sudharsansm/Bitscrape)
 (`pip install bitscrape`), our async, robots.txt-aware web scraping
 framework — giving BIE high-performance, polite crawling out of the box.

{bits_bie-1.2.2 → bits_bie-1.2.5}/README.md RENAMED Viewed

@@ -10,7 +10,7 @@ API keys, no subscriptions, no third-party search services.**
 BIE gives any LLM, RAG pipeline, or AI agent five core primitives —
 **search, extract, map, crawl, and a hybrid index** — all running locally
-on top of [**Bitscrape**](https://pypi.org/project/bitscrape/), our
+on top of [**BitS**](https://pypi.org/project/bitscrape/), our
 async crawling framework. Use it as a Python library, REST API, CLI, or
 [MCP](https://modelcontextprotocol.io) server.
@@ -56,13 +56,46 @@ state-of-the-art ranking, a commercial search API may still be the right
 choice for that piece. BIE is for teams that want a capable, free,
 self-hosted starting point — and full control over the code.
+### How this compares to ChatGPT Search / Tavily
+`bie.websearch_response()` is shaped like those tools' "web search" tool
+responses on purpose: ranked, cited `results` with snippets, an
+`answer` field, and `.to_context()` for dropping straight into a prompt.
+Two things are genuinely different, and worth being precise about:
+- **`answer` is extractive, not generated.** ChatGPT Search and Tavily's
+  `include_answer` run an LLM server-side to *write* a summary answer.
+  BIE doesn't run an LLM — `answer` is the single best-matching passage
+  found (verbatim from a live page). It's a strong starting point for
+  *your* LLM/agent to read and synthesize from, not a finished answer on
+  its own.
+- **Discovery is "best-effort free", not a dedicated index.** ChatGPT
+  Search/Tavily run their own crawl infrastructure and indexes. BIE's
+  default discovery scrapes DuckDuckGo/Bing's public result pages, which
+  can be rate-limited or served a CAPTCHA — `degraded`/`diagnostics` tell
+  you when this happens for a given query, so your agent can react (retry,
+  fall back to general knowledge, etc.) instead of silently getting a
+  bad answer.
+**SearXNG closes most of that second gap.** Self-hosting
+[SearXNG](https://github.com/searxng/searxng) and adding it as a
+discovery backend (`BIE_DISCOVERY_BACKENDS=searxng,...` +
+`BIE_SEARXNG_URL=...`) gives BIE a stable JSON API that itself aggregates
+Google/Bing/Brave/etc. server-side — far less prone to the "200 OK but
+0 results" failure mode of scraping DDG/Bing HTML directly. It's the
+single highest-leverage change for making `websearch()`'s *discovery*
+step behave consistently. It doesn't change the `answer` field's
+extractive (vs. LLM-generated) nature — that's a property of BIE not
+running an LLM, independent of which discovery backend is used.
 ---
 ## Core primitives
 | Function | What it does |
 |---|---|
-| `bie.websearch(query)` | Search the live internet — no URLs needed. Free discovery (DuckDuckGo + Bing fallback) with query fan-out, crawled and ranked by BIE's hybrid index. |
+| `bie.websearch(query)` | Search the live internet — no URLs needed. Free discovery (DuckDuckGo + Bing fallback, optional SearXNG) with query fan-out, crawled and ranked by BIE's hybrid index. |
+| `bie.websearch_response(query)` | Like `websearch`, but returns the full Tavily/ChatGPT-Search-shaped response: ranked `results`, an extractive `answer`, `degraded`/`diagnostics`, and `.to_context()` for an LLM-prompt-ready citation block. |
 | `bie.extract(url)` | Fetch a URL and return clean Markdown, with nav/ads/scripts stripped. Optional JS rendering via Playwright. |
 | `bie.map_site(url)` | Discover a site's sitemap(s) and the URLs they list, before crawling. |
 | `bie.crawl_site(urls, instruction=...)` | Crawl a site, prioritizing links by keyword-relevance to your instruction. Returns an index + ranked results. |
@@ -90,17 +123,10 @@ pip install "bits-bie[server]"      # FastAPI + Uvicorn REST server
 pip install "bits-bie[mcp]"         # Model Context Protocol server
 pip install "bits-bie[render]"      # JS rendering for extract() via Playwright
 pip install "bits-bie[langchain]"   # LangChain tool adapters
-pip install "bits-bie[notebook]"    # smoother Jupyter/Colab support (nest_asyncio)
+pip install "bits-bie[notebook]"    # smoother async behaviour in Jupyter/Colab
 pip install "bits-bie[all]"         # everything
 ```
-> **Using BIE in Jupyter / Google Colab?** All sync entry points
-> (`engine.crawl(...)`, `bie.websearch(...)`, `bie.extract(..., render_js=True)`)
-> work inside notebooks out of the box — BIE detects the notebook's
-> already-running event loop and handles it automatically. Installing
-> `bits-bie[notebook]` (adds `nest_asyncio`) makes this slightly more
-> efficient, but is not required.
 > BIE depends on [`bitscrape`](https://pypi.org/project/bitscrape/), our
 > proprietary async crawling & extraction framework, which is installed
 > automatically.
@@ -120,6 +146,20 @@ for r in results:
     print(r.snippet)
 ```
+For the full, Tavily/ChatGPT-Search-shaped response — extractive
+`answer`, timing, and `degraded`/`diagnostics` for when live discovery
+doesn't fully succeed:
+```python
+response = bie.websearch_response("who won the latest F1 race")
+print(response.answer)         # best-matching passage (not LLM-written)
+print(response.to_context())    # numbered sources block, ready for a prompt
+if response.degraded:
+    print("live data degraded:", response.diagnostics)
+```
 `websearch` pipeline:
 1. **Discovery** — free, public, no-key search endpoints (DuckDuckGo,
@@ -351,44 +391,62 @@ engine = BIE(BIESettings(
 | `use_embeddings` | `BIE_USE_EMBEDDINGS` | `true` | Enable semantic search |
 | `chunk_size` | `BIE_CHUNK_SIZE` | `800` | Chars per chunk |
 | `bm25_weight` / `vector_weight` | `BIE_BM25_WEIGHT` / `BIE_VECTOR_WEIGHT` | `0.5` / `0.5` | Fusion weights |
+| `discovery_backends` | `BIE_DISCOVERY_BACKENDS` | `ddg_html,ddg_lite,bing_html` | Ordered, comma-separated discovery backends for `websearch()`. Add `searxng` for a self-hosted instance. |
+| `searxng_url` | `BIE_SEARXNG_URL` | `None` | Base URL of a self-hosted SearXNG instance, used by the `searxng` discovery backend |
 | `api_key` | `BIE_API_KEY` | `None` | If set, requires `Authorization: Bearer <key>` |
-| — | `BIE_DISCOVERY_BACKENDS` | `ddg_html,ddg_lite,bing_html` | Comma-separated list and order of `websearch()` discovery backends. Known names: `ddg_html`, `ddg_lite`, `bing_html`, `searxng`. |
-| — | `BIE_SEARXNG_URL` | `None` | Base URL of a self-hosted [SearXNG](https://docs.searxng.org/) instance, used when `searxng` is included in `BIE_DISCOVERY_BACKENDS`. |
-### Discovery backends & troubleshooting empty `websearch()` results
+---
-`websearch()` discovers candidate URLs by scraping public search-engine
-result pages (DuckDuckGo HTML, DuckDuckGo Lite, Bing HTML, in that order
-by default). This is inherently fragile — these are not official APIs,
-and shared/cloud IPs (CI runners, some notebook hosts, restrictive
-sandboxes) can be rate-limited or blocked entirely.
+## Troubleshooting
+**`TypeError: '<' not supported between instances of 'Request' and 'Request'`**
+during a crawl — this was a Bitscrape scheduler bug (its priority queue
+compared `Request` objects directly when two requests shared the same
+priority). BIE patches `bitscrape.Request` to be orderable at import
+time, so this no longer occurs. If you still see it, you're likely on an
+older `bits-bie` version — upgrade.
+**`RuntimeError: asyncio.run() cannot be called from a running event
+loop`** — Jupyter/Colab/IPython already run an event loop, which used to
+break `engine.crawl(urls)` / `bie.websearch(...)`. Both now detect a
+running loop automatically and either use
+[`nest_asyncio`](https://pypi.org/project/nest_asyncio/) (install via
+`pip install "bits-bie[notebook]"`) or fall back to running the crawl on
+a background thread — no code changes needed. If you're already inside
+an `async def`, you can also call `await engine.acrawl(urls)` directly.
+**`bie.websearch(...)` returns `[]` / all discovery backends fail** —
+discovery scrapes DuckDuckGo/Bing's public HTML result pages, which can
+be blocked or rate-limited. Call
+`bie.discovery.get_last_discovery_diagnostics()` right after to see why:
-If `websearch()` returns `[]`, BIE logs a `WARNING` that distinguishes
-two failure categories:
+```python
+import bie
+from bie.discovery import get_last_discovery_diagnostics
-- **"network blocked"** — every backend failed at the connection level
-  (timeouts, connection refused, or a sandbox/proxy denial). This means
-  the environment itself can't reach these hosts — re-run in an
-  environment with normal internet access (a local machine, server, or
-  Colab) rather than a locked-down sandbox.
-- **"reachable but no results"** — connections succeeded but responses
-  were empty, a CAPTCHA/consent page, or rate-limited (HTTP 403/429).
-  This means the IP is likely being rate-limited; try again later, reduce
-  request frequency, or switch to a self-hosted backend (below).
+results = bie.websearch("...")
+if not results:
+    print(get_last_discovery_diagnostics().summary())
+```
-For a durable fix to rate-limiting, run a self-hosted
-[SearXNG](https://docs.searxng.org/) instance and point BIE at it:
+This distinguishes three cases:
-```bash
-export BIE_DISCOVERY_BACKENDS=searxng
-export BIE_SEARXNG_URL=http://localhost:8080
-```
+- **Network blocked** — every backend failed at the connection level
+  (or an egress proxy returned `x-deny-reason: host_not_allowed`). This
+  environment can't reach these hosts at all — check its outbound
+  network/proxy/firewall config. Common in sandboxed code-execution
+  environments; Colab and most servers have unrestricted outbound access.
+- **Blocked / rate-limited** — backends responded with `403`/`429`/etc.,
+  typically from bot-detection on a shared IP. Retry later, reduce
+  request volume, or configure a `searxng` backend (below).
+- **Empty response** — got `200 OK` but no parseable results (often a
+  CAPTCHA/consent page).
-You can also combine backends and reorder them, e.g. to prefer your
-SearXNG instance but fall back to DuckDuckGo:
+For the most reliable no-API-key discovery, self-host
+[SearXNG](https://github.com/searxng/searxng) and add it as a backend:
 ```bash
-export BIE_DISCOVERY_BACKENDS=searxng,ddg_html,ddg_lite
+export BIE_DISCOVERY_BACKENDS=searxng,ddg_html,ddg_lite,bing_html
 export BIE_SEARXNG_URL=http://localhost:8080
 ```
@@ -427,10 +485,10 @@ for Elasticsearch/Milvus-backed implementations behind the same
 ---
-## Built on Bitscrape
+## Built on BitS
 BIE's crawling and extraction layer is powered by
-[**Bitscrape**](https://github.com/Sudharsansm/Bitscrape)
+[**BitS**](https://github.com/Sudharsansm/Bitscrape)
 (`pip install bitscrape`), our async, robots.txt-aware web scraping
 framework — giving BIE high-performance, polite crawling out of the box.

{bits_bie-1.2.2 → bits_bie-1.2.5}/bie/__init__.py RENAMED Viewed

@@ -10,6 +10,10 @@ Core primitives
 ----------------
 - :func:`websearch` — search the live internet for a query (no URLs needed)
+- :func:`websearch_response` — like ``websearch``, but returns a full
+  :class:`SearchResponse` (extractive ``answer``, ``took_ms``,
+  ``degraded``/``diagnostics``, and ``.to_context()`` for LLM prompts) —
+  the Tavily/ChatGPT-Search-style "web search tool" shape
 - :func:`search` — crawl + rank specific URLs against a query
 - :func:`extract` — get clean Markdown from a single URL
 - :func:`map_site` — discover a site's sitemap before crawling
@@ -29,6 +33,11 @@ Quick start
         print(r.title, r.url)
         print(r.snippet)
+    # Or get the full response — extractive answer + LLM-ready context
+    response = bie.websearch_response("who won the latest F1 race")
+    print(response.answer)        # best-matching passage (not LLM-written)
+    print(response.to_context())  # numbered sources block, ready for a prompt
     # Get clean markdown from a specific page
     page = bie.extract("https://example.com/article")
     print(page.markdown)
@@ -59,8 +68,8 @@ from importlib import metadata as _metadata
 from bie.config import BIESettings
 from bie.engine import BIE
 from bie.extract import ExtractError, ExtractResult, extract
-from bie.models import Document, SearchResult
-from bie.quicksearch import search, websearch
+from bie.models import Document, SearchResponse, SearchResult
+from bie.quicksearch import search, websearch, websearch_response
 from bie.security import SecurityFinding, SecurityReport, scan_for_prompt_injection
 from bie.sitecrawl import crawl_site
 from bie.sitemap import SiteMap, map_site
@@ -70,15 +79,17 @@ try:
     __version__ = _metadata.version("bits-bie")
 except _metadata.PackageNotFoundError:
     # Editable/source checkout without installed metadata.
-    __version__ = "1.2.2"
+    __version__ = "1.2.5"
 __all__ = [
     "BIE",
     "BIESettings",
     "Document",
     "SearchResult",
+    "SearchResponse",
     "search",
     "websearch",
+    "websearch_response",
     "extract",
     "ExtractResult",
     "ExtractError",

bits_bie-1.2.5/bie/_asyncutil.py ADDED Viewed

@@ -0,0 +1,84 @@
+"""
+Internal helper for calling async BIE internals from synchronous code.
+Plain scripts have no running event loop, so ``asyncio.run()`` works fine.
+Jupyter/Colab/IPython kernels, however, *already* run an event loop, and
+``asyncio.run()`` raises::
+    RuntimeError: asyncio.run() cannot be called from a running event loop
+:func:`run_sync` detects this and transparently falls back to:
+1. ``nest_asyncio`` (if installed) — patches the running loop so it can be
+   re-entered, then runs the coroutine on it directly.
+2. A dedicated background thread with its own fresh event loop — works
+   everywhere, with zero extra dependencies, at the cost of a thread
+   spin-up per call.
+This means the same sync call (e.g. ``engine.crawl(urls)``) works
+unchanged in plain scripts, notebooks, and servers.
+"""
+from __future__ import annotations
+import asyncio
+import threading
+from typing import Any, Coroutine, TypeVar
+T = TypeVar("T")
+def run_sync(coro: Coroutine[Any, Any, T]) -> T:
+    """Run ``coro`` to completion and return its result, regardless of
+    whether a thread already has an asyncio event loop running.
+    Args:
+        coro: An awaitable coroutine object (not yet awaited/started).
+    Returns:
+        The coroutine's return value.
+    Raises:
+        Whatever exception the coroutine itself raises.
+    """
+    try:
+        asyncio.get_running_loop()
+    except RuntimeError:
+        # No loop running in this thread — the normal case for scripts,
+        # CLI commands, and server request handlers.
+        return asyncio.run(coro)
+    # A loop is already running in this thread (e.g. Jupyter/Colab/IPython,
+    # or an async framework that called into sync BIE code).
+    try:
+        import nest_asyncio  # type: ignore[import-not-found]
+    except ImportError:
+        return _run_in_new_thread(coro)
+    nest_asyncio.apply()
+    loop = asyncio.get_event_loop()
+    return loop.run_until_complete(coro)
+def _run_in_new_thread(coro: Coroutine[Any, Any, T]) -> T:
+    """Run ``coro`` to completion on a fresh event loop in a new thread.
+    Used as the dependency-free fallback when a loop is already running in
+    the calling thread and ``nest_asyncio`` isn't installed.
+    """
+    result: dict[str, Any] = {}
+    error: dict[str, BaseException] = {}
+    def _runner() -> None:
+        try:
+            result["value"] = asyncio.run(coro)
+        except BaseException as exc:  # noqa: BLE001 - re-raised on the caller's thread
+            error["value"] = exc
+    thread = threading.Thread(target=_runner, name="bie-async-runner", daemon=True)
+    thread.start()
+    thread.join()
+    if "value" in error:
+        raise error["value"]
+    return result["value"]  # type: ignore[return-value]

{bits_bie-1.2.2 → bits_bie-1.2.5}/bie/cli.py RENAMED Viewed

@@ -70,6 +70,12 @@ def search(query: str, urls: tuple[str, ...], top_k: int, max_pages: int, no_emb
 @click.option("--no-deep", is_flag=True, help="Skip crawling; return raw discovery order without snippets")
 @click.option("--no-embeddings", is_flag=True, help="Disable semantic/vector re-ranking (BM25 only)")
 @click.option("--json", "as_json", is_flag=True, help="Output raw JSON")
+@click.option(
+    "--context",
+    "as_context",
+    is_flag=True,
+    help="Output a numbered, citation-ready text block for an LLM prompt (response.to_context())",
+)
 def search_live(
     query: str,
     top_k: int,
@@ -77,16 +83,18 @@ def search_live(
     no_deep: bool,
     no_embeddings: bool,
     as_json: bool,
+    as_context: bool,
 ) -> None:
     """Search the live internet for QUERY — no seed URLs, no API key, no subscription.
     Discovers relevant URLs via free public search endpoints (DuckDuckGo,
-    with a Bing fallback), crawls them with Bitscrape, and ranks the
+    Bing, and optionally a self-hosted SearXNG instance — see
+    BIE_DISCOVERY_BACKENDS), crawls them with Bitscrape, and ranks the
     extracted content against QUERY using BIE's hybrid BM25+vector index.
     """
     import bie
-    results = bie.websearch(
+    response = bie.websearch_response(
         query,
         top_k=top_k,
         discovery_results=discovery_results,
@@ -95,23 +103,40 @@ def search_live(
     )
     if as_json:
-        click.echo(json.dumps([r.model_dump() for r in results], indent=2))
+        click.echo(response.model_dump_json(indent=2))
+        return
+    if as_context:
+        click.echo(response.to_context())
         return
-    if not results:
+    if not response.results:
         click.echo(
             "No results found. The free search backends may be temporarily "
-            "rate-limiting — try again in a moment."
+            "rate-limiting — try again in a moment.\n"
         )
+        if response.diagnostics:
+            click.echo(f"Diagnosis: {response.diagnostics}")
         return
-    for i, r in enumerate(results, 1):
+    if response.answer:
+        click.echo(f"Answer: {response.answer}\n")
+    if response.degraded:
+        click.echo("⚠ Live discovery/crawling was degraded for this query.")
+        if response.diagnostics:
+            click.echo(f"  {response.diagnostics}")
+        click.echo()
+    for i, r in enumerate(response.results, 1):
         click.echo(f"\n{i}. {r.title}")
         click.echo(f"   {r.url}")
         click.echo(f"   score={r.score:.4f}")
         if r.snippet:
             click.echo(f"   {r.snippet}")
+    click.echo(f"\n({response.took_ms:.0f}ms)")
 @cli.command()
 @click.argument("urls", nargs=-1, required=True)

{bits_bie-1.2.2 → bits_bie-1.2.5}/bie/config.py RENAMED Viewed

@@ -39,6 +39,25 @@ class BIESettings(BaseSettings):
     index_dir: str = Field(".bie_index", description="Directory for persisted index")
     persist: bool = Field(False, description="Persist index to disk between runs")
+    # --- Discovery (no-API-key web search) ----------------------------------
+    discovery_backends: str = Field(
+        "ddg_html,ddg_lite,bing_html",
+        description="Comma-separated, ordered list of discovery backends to "
+        "try for bie.websearch()/discover_urls(). Built-in backends: "
+        "'ddg_html', 'ddg_lite', 'bing_html', 'searxng'. The 'searxng' "
+        "backend requires `searxng_url` to also be set. Unknown names are "
+        "skipped with a warning. Override with the BIE_DISCOVERY_BACKENDS "
+        "env var, e.g. BIE_DISCOVERY_BACKENDS=searxng,ddg_html,ddg_lite,bing_html",
+    )
+    searxng_url: str | None = Field(
+        default=None,
+        description="Base URL of a self-hosted SearXNG instance (e.g. "
+        "'http://localhost:8080'), used by the 'searxng' discovery backend. "
+        "Self-hosting SearXNG is the most reliable no-API-key discovery "
+        "option since it isn't subject to the rate limits / layout changes "
+        "that affect scraping DDG/Bing HTML directly.",
+    )
     # --- Server --------------------------------------------------------------
     host: str = "0.0.0.0"
     port: int = 8000

bits-bie 1.2.2__tar.gz → 1.2.5__tar.gz

bits-bie 1.2.2tar.gz → 1.2.5tar.gz