PyPI - websearch-kit - Versions diffs - 0.1.0__tar.gz - Mend

websearch-kit 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (165) hide show

websearch_kit-0.1.0/.github/workflows/ci.yml +59 -0
websearch_kit-0.1.0/.github/workflows/license-audit.yml +60 -0
websearch_kit-0.1.0/.github/workflows/live.yml +24 -0
websearch_kit-0.1.0/.github/workflows/publish.yml +45 -0
websearch_kit-0.1.0/.gitignore +29 -0
websearch_kit-0.1.0/BACKLOG.md +39 -0
websearch_kit-0.1.0/CHANGELOG.md +81 -0
websearch_kit-0.1.0/CONTRIBUTING.md +38 -0
websearch_kit-0.1.0/LICENSE +21 -0
websearch_kit-0.1.0/PKG-INFO +190 -0
websearch_kit-0.1.0/README.md +146 -0
websearch_kit-0.1.0/SECURITY.md +27 -0
websearch_kit-0.1.0/SPEC.md +416 -0
websearch_kit-0.1.0/VERSIONING.md +43 -0
websearch_kit-0.1.0/adapters/owui/websearch_kit_filter.py +123 -0
websearch_kit-0.1.0/adapters/owui/websearch_kit_tool.py +110 -0
websearch_kit-0.1.0/docs/PROGRESS.md +193 -0
websearch_kit-0.1.0/docs/adr/0001-one-engine-three-surfaces.md +53 -0
websearch_kit-0.1.0/docs/adr/0002-no-fail-silent-degradation-model.md +59 -0
websearch_kit-0.1.0/docs/adr/0003-ssrf-guard-default-on-with-ip-pinning.md +61 -0
websearch_kit-0.1.0/docs/adr/0004-browser-profile-default-fetching.md +57 -0
websearch_kit-0.1.0/docs/adr/0005-gap-filler-oversampling.md +58 -0
websearch_kit-0.1.0/docs/adr/0006-bm25-adaptive-budget-reference-parity.md +57 -0
websearch_kit-0.1.0/docs/adr/0007-provider-registry-and-fallback-chain.md +64 -0
websearch_kit-0.1.0/docs/adr/0008-mcp-official-sdk-no-sampling.md +58 -0
websearch_kit-0.1.0/docs/adr/README.md +29 -0
websearch_kit-0.1.0/docs/architecture.md +188 -0
websearch_kit-0.1.0/docs/deployment/mcp.md +186 -0
websearch_kit-0.1.0/docs/deployment/owui.md +197 -0
websearch_kit-0.1.0/docs/deployment/sdk.md +180 -0
websearch_kit-0.1.0/docs/domains/caching.md +106 -0
websearch_kit-0.1.0/docs/domains/config.md +150 -0
websearch_kit-0.1.0/docs/domains/errors.md +153 -0
websearch_kit-0.1.0/docs/domains/extraction.md +117 -0
websearch_kit-0.1.0/docs/domains/fetching.md +108 -0
websearch_kit-0.1.0/docs/domains/observability.md +133 -0
websearch_kit-0.1.0/docs/domains/providers.md +111 -0
websearch_kit-0.1.0/docs/domains/query-expansion.md +166 -0
websearch_kit-0.1.0/docs/domains/ranking.md +178 -0
websearch_kit-0.1.0/docs/domains/resilience.md +108 -0
websearch_kit-0.1.0/docs/domains/security.md +94 -0
websearch_kit-0.1.0/examples/bare_sdk.py +52 -0
websearch_kit-0.1.0/examples/mcp_config_examples.md +92 -0
websearch_kit-0.1.0/examples/multi_provider.py +59 -0
websearch_kit-0.1.0/pyproject.toml +134 -0
websearch_kit-0.1.0/src/websearch_kit/__init__.py +100 -0
websearch_kit-0.1.0/src/websearch_kit/_version.py +3 -0
websearch_kit-0.1.0/src/websearch_kit/assembly/__init__.py +33 -0
websearch_kit-0.1.0/src/websearch_kit/assembly/citations.py +70 -0
websearch_kit-0.1.0/src/websearch_kit/assembly/context_builder.py +90 -0
websearch_kit-0.1.0/src/websearch_kit/caching/__init__.py +89 -0
websearch_kit-0.1.0/src/websearch_kit/caching/keys.py +93 -0
websearch_kit-0.1.0/src/websearch_kit/caching/memory.py +121 -0
websearch_kit-0.1.0/src/websearch_kit/caching/sqlite_cache.py +244 -0
websearch_kit-0.1.0/src/websearch_kit/config.py +220 -0
websearch_kit-0.1.0/src/websearch_kit/errors.py +250 -0
websearch_kit-0.1.0/src/websearch_kit/expansion/__init__.py +104 -0
websearch_kit-0.1.0/src/websearch_kit/expansion/callback.py +89 -0
websearch_kit-0.1.0/src/websearch_kit/expansion/llm.py +153 -0
websearch_kit-0.1.0/src/websearch_kit/expansion/noop.py +45 -0
websearch_kit-0.1.0/src/websearch_kit/expansion/parsing.py +120 -0
websearch_kit-0.1.0/src/websearch_kit/extraction/__init__.py +24 -0
websearch_kit-0.1.0/src/websearch_kit/extraction/chain.py +179 -0
websearch_kit-0.1.0/src/websearch_kit/extraction/quality.py +184 -0
websearch_kit-0.1.0/src/websearch_kit/extraction/readability_extractor.py +66 -0
websearch_kit-0.1.0/src/websearch_kit/extraction/sanitize_text.py +146 -0
websearch_kit-0.1.0/src/websearch_kit/extraction/trafilatura_extractor.py +181 -0
websearch_kit-0.1.0/src/websearch_kit/extraction/types.py +58 -0
websearch_kit-0.1.0/src/websearch_kit/fetching/__init__.py +17 -0
websearch_kit-0.1.0/src/websearch_kit/fetching/fetcher.py +595 -0
websearch_kit-0.1.0/src/websearch_kit/fetching/policy.py +122 -0
websearch_kit-0.1.0/src/websearch_kit/fetching/robots.py +165 -0
websearch_kit-0.1.0/src/websearch_kit/fetching/user_agents.py +78 -0
websearch_kit-0.1.0/src/websearch_kit/grammar.py +178 -0
websearch_kit-0.1.0/src/websearch_kit/kit.py +487 -0
websearch_kit-0.1.0/src/websearch_kit/mcp/__init__.py +39 -0
websearch_kit-0.1.0/src/websearch_kit/mcp/__main__.py +61 -0
websearch_kit-0.1.0/src/websearch_kit/mcp/config_cli.py +150 -0
websearch_kit-0.1.0/src/websearch_kit/mcp/progress.py +123 -0
websearch_kit-0.1.0/src/websearch_kit/mcp/server.py +264 -0
websearch_kit-0.1.0/src/websearch_kit/mcp/tools.py +376 -0
websearch_kit-0.1.0/src/websearch_kit/models.py +291 -0
websearch_kit-0.1.0/src/websearch_kit/observability/__init__.py +21 -0
websearch_kit-0.1.0/src/websearch_kit/observability/events.py +113 -0
websearch_kit-0.1.0/src/websearch_kit/observability/logging.py +97 -0
websearch_kit-0.1.0/src/websearch_kit/owui/__init__.py +21 -0
websearch_kit-0.1.0/src/websearch_kit/owui/_compat.py +275 -0
websearch_kit-0.1.0/src/websearch_kit/owui/filter_adapter.py +604 -0
websearch_kit-0.1.0/src/websearch_kit/pipeline.py +985 -0
websearch_kit-0.1.0/src/websearch_kit/prompts.py +158 -0
websearch_kit-0.1.0/src/websearch_kit/protocols.py +116 -0
websearch_kit-0.1.0/src/websearch_kit/providers/__init__.py +149 -0
websearch_kit-0.1.0/src/websearch_kit/providers/base.py +252 -0
websearch_kit-0.1.0/src/websearch_kit/providers/brave.py +171 -0
websearch_kit-0.1.0/src/websearch_kit/providers/ddgs.py +153 -0
websearch_kit-0.1.0/src/websearch_kit/providers/exa.py +156 -0
websearch_kit-0.1.0/src/websearch_kit/providers/owui.py +183 -0
websearch_kit-0.1.0/src/websearch_kit/providers/searxng.py +167 -0
websearch_kit-0.1.0/src/websearch_kit/providers/serper.py +140 -0
websearch_kit-0.1.0/src/websearch_kit/providers/tavily.py +141 -0
websearch_kit-0.1.0/src/websearch_kit/py.typed +0 -0
websearch_kit-0.1.0/src/websearch_kit/ranking/__init__.py +28 -0
websearch_kit-0.1.0/src/websearch_kit/ranking/bm25.py +109 -0
websearch_kit-0.1.0/src/websearch_kit/ranking/budget.py +140 -0
websearch_kit-0.1.0/src/websearch_kit/resilience/__init__.py +24 -0
websearch_kit-0.1.0/src/websearch_kit/resilience/circuit.py +166 -0
websearch_kit-0.1.0/src/websearch_kit/resilience/deadline.py +88 -0
websearch_kit-0.1.0/src/websearch_kit/resilience/fallback.py +265 -0
websearch_kit-0.1.0/src/websearch_kit/resilience/health.py +141 -0
websearch_kit-0.1.0/src/websearch_kit/resilience/retry.py +108 -0
websearch_kit-0.1.0/src/websearch_kit/run.py +236 -0
websearch_kit-0.1.0/src/websearch_kit/security/__init__.py +15 -0
websearch_kit-0.1.0/src/websearch_kit/security/ranges.py +129 -0
websearch_kit-0.1.0/src/websearch_kit/security/sanitize.py +97 -0
websearch_kit-0.1.0/src/websearch_kit/security/url_guard.py +296 -0
websearch_kit-0.1.0/tests/__init__.py +0 -0
websearch_kit-0.1.0/tests/http/fixtures/pages/article.html +33 -0
websearch_kit-0.1.0/tests/http/fixtures/pages/forum.html +32 -0
websearch_kit-0.1.0/tests/http/fixtures/pages/listing.html +29 -0
websearch_kit-0.1.0/tests/http/fixtures/pages/malformed.html +12 -0
websearch_kit-0.1.0/tests/http/fixtures/providers/brave_422.json +19 -0
websearch_kit-0.1.0/tests/http/fixtures/providers/brave_ok.json +20 -0
websearch_kit-0.1.0/tests/http/fixtures/providers/exa_ok.json +20 -0
websearch_kit-0.1.0/tests/http/fixtures/providers/searxng_ok.json +22 -0
websearch_kit-0.1.0/tests/http/fixtures/providers/serper_ok.json +22 -0
websearch_kit-0.1.0/tests/http/fixtures/providers/tavily_ok.json +20 -0
websearch_kit-0.1.0/tests/http/test_extraction_chain.py +207 -0
websearch_kit-0.1.0/tests/http/test_fetcher.py +674 -0
websearch_kit-0.1.0/tests/http/test_llm_expander.py +124 -0
websearch_kit-0.1.0/tests/http/test_policy.py +128 -0
websearch_kit-0.1.0/tests/http/test_providers.py +575 -0
websearch_kit-0.1.0/tests/http/test_resilience.py +470 -0
websearch_kit-0.1.0/tests/http/test_robots.py +173 -0
websearch_kit-0.1.0/tests/mcp/__init__.py +0 -0
websearch_kit-0.1.0/tests/mcp/test_config_cli.py +102 -0
websearch_kit-0.1.0/tests/mcp/test_mcp_server.py +395 -0
websearch_kit-0.1.0/tests/owui/__init__.py +0 -0
websearch_kit-0.1.0/tests/owui/conftest.py +160 -0
websearch_kit-0.1.0/tests/owui/test_compat.py +145 -0
websearch_kit-0.1.0/tests/owui/test_filter_adapter.py +410 -0
websearch_kit-0.1.0/tests/owui/test_single_files.py +145 -0
websearch_kit-0.1.0/tests/security/__init__.py +0 -0
websearch_kit-0.1.0/tests/security/test_ssrf_ranges.py +95 -0
websearch_kit-0.1.0/tests/security/test_url_guard.py +363 -0
websearch_kit-0.1.0/tests/unit/__init__.py +0 -0
websearch_kit-0.1.0/tests/unit/pipeline_stubs.py +192 -0
websearch_kit-0.1.0/tests/unit/test_assembly.py +128 -0
websearch_kit-0.1.0/tests/unit/test_bm25_golden.py +158 -0
websearch_kit-0.1.0/tests/unit/test_budget_golden.py +135 -0
websearch_kit-0.1.0/tests/unit/test_caching.py +363 -0
websearch_kit-0.1.0/tests/unit/test_circuit.py +192 -0
websearch_kit-0.1.0/tests/unit/test_config_precedence.py +134 -0
websearch_kit-0.1.0/tests/unit/test_contracts.py +147 -0
websearch_kit-0.1.0/tests/unit/test_deadline.py +98 -0
websearch_kit-0.1.0/tests/unit/test_expansion.py +222 -0
websearch_kit-0.1.0/tests/unit/test_grammar.py +214 -0
websearch_kit-0.1.0/tests/unit/test_kit.py +299 -0
websearch_kit-0.1.0/tests/unit/test_observability.py +157 -0
websearch_kit-0.1.0/tests/unit/test_pipeline.py +521 -0
websearch_kit-0.1.0/tests/unit/test_prompts.py +147 -0
websearch_kit-0.1.0/tests/unit/test_retry.py +250 -0
websearch_kit-0.1.0/tests/unit/test_run_context.py +236 -0
websearch_kit-0.1.0/tests/unit/test_sanitize_text.py +148 -0
websearch_kit-0.1.0/tests/unit/test_sanitize_url.py +123 -0
websearch_kit-0.1.0/uv.lock +3099 -0

websearch_kit-0.1.0/.github/workflows/ci.yml ADDED Viewed

@@ -0,0 +1,59 @@
+name: CI
+on:
+  push:
+    branches: [main]
+  pull_request:
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/setup-uv@v5
+      - run: uv sync --all-extras
+      - run: uv run ruff check src tests
+      - run: uv run ruff format --check src tests
+      - run: uv run pyright src
+  test:
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os: [ubuntu-latest, macos-latest]
+        python-version: ["3.10", "3.11", "3.12", "3.13"]
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/setup-uv@v5
+      - run: uv sync --all-extras --python ${{ matrix.python-version }}
+      - run: uv run pytest --cov --cov-report=xml -q
+      - name: Coverage gate (critical modules)
+        run: |
+          uv run python - <<'EOF'
+          import xml.etree.ElementTree as ET
+          import sys
+          CRITICAL = ("config", "errors", "ranking/", "security/")
+          THRESHOLD = 0.90
+          tree = ET.parse("coverage.xml")
+          failed = False
+          for cls in tree.iter("class"):
+              fn = cls.get("filename", "")
+              if not any(c in fn for c in CRITICAL):
+                  continue
+              rate = float(cls.get("line-rate", "1"))
+              if rate < THRESHOLD:
+                  print(f"FAIL {fn}: {rate:.0%} < {THRESHOLD:.0%}")
+                  failed = True
+          sys.exit(1 if failed else 0)
+          EOF
+  base-install:
+    # The base package (no extras) must import cleanly — optional deps must stay optional.
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/setup-uv@v5
+      - run: uv venv && uv pip install .
+      - run: uv run python -c "import websearch_kit; print(websearch_kit.__version__)"

websearch_kit-0.1.0/.github/workflows/license-audit.yml ADDED Viewed

@@ -0,0 +1,60 @@
+name: License audit
+# Machine-enforces the MIT promise:
+#  - no GPL/AGPL anywhere in the dependency tree
+#  - trafilatura must be >=1.8.0 (earlier releases are GPLv3+)
+#  - PyMuPDF (AGPL) must never appear
+on:
+  push:
+    branches: [main]
+  pull_request:
+    paths: ["pyproject.toml", "uv.lock", ".github/workflows/license-audit.yml"]
+jobs:
+  audit:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/setup-uv@v5
+      - run: uv sync --all-extras
+      - name: Forbid copyleft licenses
+        run: |
+          uv run pip-licenses --format=json > licenses.json
+          uv run python - <<'EOF'
+          import json
+          import sys
+          BANNED = ("GNU Affero", "AGPL", "GNU General Public License", "GPLv3", "GPLv2")
+          # LGPL is reviewed case-by-case; hard-ban the strong-copyleft families.
+          pkgs = json.load(open("licenses.json"))
+          bad = [
+              p for p in pkgs
+              if any(b.lower() in p["License"].lower() for b in BANNED)
+              and "LGPL" not in p["License"]
+          ]
+          for p in bad:
+              print(f"BANNED LICENSE: {p['Name']} {p['Version']} — {p['License']}")
+          sys.exit(1 if bad else 0)
+          EOF
+      - name: Assert trafilatura >= 1.8.0 and no PyMuPDF
+        run: |
+          uv run python - <<'EOF'
+          import sys
+          from importlib import metadata
+          v = metadata.version("trafilatura")
+          parts = tuple(int(x) for x in v.split(".")[:2])
+          assert parts >= (1, 8), f"trafilatura {v} predates the Apache-2.0 relicense"
+          try:
+              metadata.version("PyMuPDF")
+              sys.exit("PyMuPDF (AGPL) found in the dependency tree — banned")
+          except metadata.PackageNotFoundError:
+              pass
+          try:
+              metadata.version("fitz")
+              sys.exit("fitz/PyMuPDF (AGPL) found in the dependency tree — banned")
+          except metadata.PackageNotFoundError:
+              pass
+          print(f"OK: trafilatura {v}, no PyMuPDF")
+          EOF

websearch_kit-0.1.0/.github/workflows/live.yml ADDED Viewed

@@ -0,0 +1,24 @@
+name: Live tests (nightly)
+# Opt-in tier hitting real providers/transports — catches upstream drift
+# (provider schema changes, rate-limit behavior). Never blocks PRs.
+on:
+  schedule:
+    - cron: "23 4 * * *"
+  workflow_dispatch:
+jobs:
+  live:
+    runs-on: ubuntu-latest
+    if: ${{ github.repository_owner != '' }}
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/setup-uv@v5
+      - run: uv sync --all-extras
+      - run: uv run pytest -m live -q
+        env:
+          WSK_TAVILY_API_KEY: ${{ secrets.WSK_TAVILY_API_KEY }}
+          WSK_BRAVE_API_KEY: ${{ secrets.WSK_BRAVE_API_KEY }}
+          WSK_SERPER_API_KEY: ${{ secrets.WSK_SERPER_API_KEY }}
+          WSK_EXA_API_KEY: ${{ secrets.WSK_EXA_API_KEY }}

websearch_kit-0.1.0/.github/workflows/publish.yml ADDED Viewed

@@ -0,0 +1,45 @@
+name: Publish to PyPI
+on:
+  push:
+    tags: ["v*"]
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/setup-uv@v5
+      - run: uv build
+      - name: Check tag matches built version
+        # version is dynamic (hatch reads _version.py), so check the artifact itself
+        run: |
+          WHEEL=$(ls dist/*.whl)
+          PKG_VERSION=$(basename "$WHEEL" | cut -d- -f2)
+          TAG_VERSION="${GITHUB_REF_NAME#v}"
+          if [ "$PKG_VERSION" != "$TAG_VERSION" ]; then
+            echo "Tag $GITHUB_REF_NAME does not match built version $PKG_VERSION"
+            exit 1
+          fi
+      - run: uv run --with twine twine check dist/*.tar.gz dist/*.whl
+      - uses: actions/upload-artifact@v4
+        with:
+          name: dist
+          path: |
+            dist/*.tar.gz
+            dist/*.whl
+  publish:
+    needs: build
+    runs-on: ubuntu-latest
+    environment:
+      name: pypi
+      url: https://pypi.org/p/websearch-kit
+    permissions:
+      id-token: write # OIDC for PyPI trusted publishing — no API token
+    steps:
+      - uses: actions/download-artifact@v4
+        with:
+          name: dist
+          path: dist/
+      - uses: pypa/gh-action-pypi-publish@release/v1

websearch_kit-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,29 @@
+# Reference clone (read-only research material, not part of this repo)
+reference/
+# Python
+__pycache__/
+*.py[cod]
+*.egg-info/
+dist/
+build/
+.venv/
+.python-version
+# Tooling caches
+.pytest_cache/
+.ruff_cache/
+.coverage
+coverage.xml
+htmlcov/
+.pyright/
+# OS / editor
+.DS_Store
+.idea/
+.vscode/
+# Local secrets / env
+.env
+.env.*
+.claude-workflows/

websearch_kit-0.1.0/BACKLOG.md ADDED Viewed

@@ -0,0 +1,39 @@
+# Backlog
+Follow-ups identified after 0.1.0. Not scheduled; ordered roughly by value.
+## Recency boost in the ranker
+**Problem.** BM25 ranks purely on lexical relevance. On freshness-sensitive
+queries ("latest stable Python version"), topically-rich but stale pages
+outscore the one source with the current answer — observed in a live run on
+2026-06-05 where 2024/2025 news roundups ranked above the page that actually
+answered the question.
+**Direction.** An opt-in recency boost in the rank stage: when a result carries
+a usable date signal (`published_date` from the provider, or a date extracted
+by trafilatura), multiply/add a decay-weighted bonus so newer pages win ties
+and near-ties. Must be a pure, golden-testable function alongside the BM25
+math, off by default to preserve 0.1.x ranking parity, and surfaced in config
+(e.g. `ranking.recency_boost: float = 0.0` + half-life). Results without dates
+are never penalized — boost only, no decay-to-zero.
+**Touch points.** `ranking/bm25.py` (or a sibling `ranking/recency.py`),
+pipeline rank stage, config schema, SPEC.md ranking math section, golden tests.
+## Configurable default for `max_context_chars`
+**Problem.** The MCP `research` tool's context cap is a per-call parameter, but
+its default (8000) is hardcoded in the tool signature
+(`src/websearch_kit/mcp/tools.py`). Deployments that consistently want larger
+or smaller context blocks must rely on every client passing the parameter on
+every call.
+**Direction.** Make the default server-configurable — CLI flag / env var /
+config file (e.g. `mcp.default_max_context_chars`), with the per-call parameter
+still overriding it. Keep the existing `ge=500, le=200_000` bounds; validate
+the configured default against them at startup. Mirror the same default in the
+SDK config so the two surfaces stay in lockstep (one engine, one config).
+**Touch points.** `mcp/tools.py`, MCP server CLI/config plumbing, SDK config
+schema, `docs/deployment/mcp.md`, `examples/mcp_config_examples.md`.

websearch_kit-0.1.0/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,81 @@
+# Changelog
+All notable changes to this project are documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html)
+(see [VERSIONING.md](VERSIONING.md) for the pre-1.0 rules).
+## [0.1.0] - 2026-06-05
+First release: one engine, three surfaces, with golden-tested ranking math
+and a no-fail-silent degradation contract.
+### Added
+**Core SDK**
+- `SearchKit` async engine (`research` / `search` / `fetch` / `build_prompt` /
+  `health`) and `SyncSearchKit` blocking facade; stateless per call with
+  per-run `RunContext` (deadline, degradation log, stats, progress)
+- Core contracts: `WebSearchConfig` (one canonical config; `WSK_*` env,
+  kwargs > env > defaults), shared pydantic models (`SearchResult`,
+  `PageContent`, `Source`, `Degradation`, `RunStats`, `ResearchReport`,
+  `SystemHealth`, `ProgressEvent`), typed error taxonomy with stable codes,
+  extension protocols (`SearchProvider`, `QueryExpander`, `Cache`,
+  `ProgressSink`)
+- Pipeline: query expansion → oversampled multi-query search → sanitized-URL
+  dedup + binary pre-filter → concurrent fetch with **Gap-Filler** round-2
+  backfill (failures demote to the snippet pool, everything recorded) →
+  quality-gated extraction with snippet fallback → BM25 rerank vs the original
+  query with zero-score drops → adaptive per-source character budgeting →
+  numbered `[N]` context assembly with a relevance-filtered snippet pool and
+  1:1 citations
+- Fetch layer: browser profile (40-UA rotation) / polite profile (honest bot
+  UA + robots.txt), SSRF guard on by default with **pinned-IP connect**
+  (Host/SNI preserved) and full per-hop redirect gating (SSRF + robots +
+  binary), streaming byte caps, whole-run deadline
+- Extraction chain: trafilatura (precision→recall) → readability-lxml → plain
+  text, quality-gated; `sanitize_text` cleaning; snippet-fallback heuristics
+- Providers: ddgs (zero-config default), SearXNG, Tavily, Brave, Serper, Exa,
+  plus the OWUI normalization shim; name→factory registry with
+  capabilities-as-data and `register_provider` extension hook
+- Resilience: full-jitter retry (counted in `stats.retry_count`), per-provider
+  circuit breakers, ordered fallback chain (auth-error cooldown latch with
+  re-probe), active health probes and system aggregation
+- Query expansion: NoOp (default) / OpenAI-compatible LLM / callback
+  expanders; reasoning-block stripping and JSON/line-split parsing; failures
+  degrade to the raw query
+- Caching: memory TTL-LRU and sqlite backends behind a never-raising
+  `CacheGuard` (with teardown `close()`); three deterministic keyspaces
+  (search / extracted content / expansion) with tracking-param-stripped
+  content keys
+**MCP server** (`[mcp]` extra)
+- `websearch-kit-mcp` console script; stdio and streamable-HTTP transports
+- Tools `web_search`, `fetch_page` (cursor pagination, raw mode, refusals as
+  explained outcomes), `research` (structured report + context text + one
+  resource link per `[N]` citation), `health` (live probes + config checks);
+  all typed structured output with read-only/open-world annotations
+- `health://status` resource; background health-probe loop in HTTP mode;
+  per-provider kit pool with shared cache; per-request warning collection
+**Open WebUI adapter** (`adapters/owui/`, `[owui]` extra)
+- Single-file Filter: toggle pill, configurable `??` prefix with zero-syntax
+  or require-prefix modes, readable `--flags` grammar, bare-trigger context
+  extraction, history rewrite with web_search/retrieval feature lockdown,
+  pinned 0.9.6 event shapes (status / queries / `source` citations), skipped-
+  URL summaries, debug dump; fail-soft (failures are visible status lines,
+  the chat turn always survives)
+- Single-file Tool variant exposing `web_search`/`research` for model-invoked
+  use
+- `websearch_kit.owui._compat`: every OWUI-internal touchpoint feature-
+  detected in one shim (SearchForm `queries=`/`query=`, sync/async user
+  lookup, ShadowRequest config proxy, web-search-enabled gate)
+**Project**
+- SPEC.md (normative contract), 11 domain documents, 8 ADRs, architecture and
+  per-surface deployment docs, runnable examples
+- CI: lint/type/test matrix, permissive-license audit, nightly live tier;
+  688 offline tests, pyright strict
+[0.1.0]: https://github.com/rmarnold/websearch-kit/releases/tag/v0.1.0

websearch_kit-0.1.0/CONTRIBUTING.md ADDED Viewed

@@ -0,0 +1,38 @@
+# Contributing
+## Setup
+```bash
+uv sync --all-extras      # creates .venv with every optional extra + dev tools
+uv run pytest -q          # offline test tiers (unit, security, http, mcp, owui)
+uv run ruff check src tests && uv run ruff format src tests
+uv run pyright src        # strict mode
+```
+## Test tiers
+| Tier | Marker/dir | Network | When it runs |
+|---|---|---|---|
+| unit / golden | `tests/unit` | none | every commit |
+| security (SSRF) | `tests/security` | none (stubbed DNS) | every commit |
+| http (providers/fetch) | `tests/http` | none (respx fixtures) | every commit |
+| mcp | `tests/mcp` | none (in-memory transport) | every commit |
+| owui compat | `tests/owui` | none (stubbed open_webui) | every commit |
+| live | `@pytest.mark.live` | real | nightly CI / opt-in (`pytest -m live`) |
+Recorded provider fixtures live in `tests/http/fixtures/`. When a provider's API
+drifts, re-record the fixture from a real response (strip any credentials) and
+update the mapper + changelog together.
+## Rules of the road
+- **No fail-silent.** Never swallow an exception: classify it into the error
+  taxonomy (`errors.py`) and either raise it or record a `Degradation`.
+  `except: pass` will be rejected in review (and flagged by ruff S110).
+- **License discipline.** Permissive deps only (MIT/BSD/Apache-2.0).
+  `trafilatura>=1.8.0` is load-bearing; PyMuPDF is banned. CI enforces this.
+- **Public API changes** require a VERSIONING.md-conscious changelog entry;
+  load-bearing design changes require an ADR (`docs/adr/`, MADR format).
+- **Domain docs.** Behavior changes in a subsystem must update its standard
+  document under `docs/domains/`.
+- Conventional, descriptive commit messages; one logical change per commit.

websearch_kit-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Robert Arnold
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

websearch_kit-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,190 @@
+Metadata-Version: 2.4
+Name: websearch-kit
+Version: 0.1.0
+Summary: Web search, fetch, and research pipeline for LLMs — usable as a Python SDK, a standalone MCP server, and an Open WebUI plugin.
+Project-URL: Homepage, https://github.com/rmarnold/websearch-kit
+Project-URL: Changelog, https://github.com/rmarnold/websearch-kit/blob/main/CHANGELOG.md
+Author: Robert Arnold
+License-Expression: MIT
+License-File: LICENSE
+Keywords: llm,mcp,model-context-protocol,open-webui,rag,scraping,search,web-search
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Classifier: Typing :: Typed
+Requires-Python: >=3.10
+Requires-Dist: httpx[brotli,http2,zstd]>=0.27
+Requires-Dist: protego>=0.3
+Requires-Dist: pydantic-settings>=2.2
+Requires-Dist: pydantic>=2.7
+Requires-Dist: readability-lxml>=0.8.1
+Requires-Dist: trafilatura>=1.8.0
+Provides-Extra: all
+Requires-Dist: ddgs>=9.0; extra == 'all'
+Requires-Dist: fastembed>=0.5; extra == 'all'
+Requires-Dist: mcp<2,>=1.12; extra == 'all'
+Requires-Dist: pypdf>=5.0; extra == 'all'
+Provides-Extra: ddgs
+Requires-Dist: ddgs>=9.0; extra == 'ddgs'
+Provides-Extra: mcp
+Requires-Dist: mcp<2,>=1.12; extra == 'mcp'
+Provides-Extra: owui
+Provides-Extra: pdf
+Requires-Dist: pypdf>=5.0; extra == 'pdf'
+Provides-Extra: rerank
+Requires-Dist: fastembed>=0.5; extra == 'rerank'
+Description-Content-Type: text/markdown
+# websearch-kit
+> Web search, fetch, and research pipeline for LLMs — one engine, three surfaces:
+> a Python **SDK**, a standalone **MCP server**, and an **Open WebUI** plugin.
+Query expansion → multi-provider search → SSRF-guarded concurrent fetching
+(40-UA browser profile, pinned-IP connect) → trafilatura extraction → BM25
+rerank with adaptive context budgeting → numbered, citable context for your LLM.
+**No fail-silent:** a call either raises a typed error or returns a result where
+every dropped, blocked, truncated, or substituted item is enumerated as a
+structured `Degradation`. On the live web that looks like:
+```text
+ok        : True   partial: True
+sources   : 10                       # 5 fetched pages + 5 relevance-filtered snippets
+warnings  :
+  - [fetch] https://cloud.google.com/...: response exceeded byte cap (1054971 > 1048576 bytes)
+stats     : 10 raw -> 10 unique, 5 fetched, context 23471 chars,
+            timings {'search': 854, 'fetch': 1662, 'extract': 878, 'rank': 3}
+```
+**Status: 0.1.0.** See [SPEC.md](SPEC.md), [CHANGELOG.md](CHANGELOG.md).
+## Features
+- **One engine, three surfaces** — the SDK core is the only pipeline; the MCP
+  server and Open WebUI adapters are thin translators over it, so behavior and
+  config semantics never drift between surfaces.
+- **Full research pipeline** — `research()` runs search → fetch → extract →
+  rank → assemble in one call and returns a numbered `[N]` context block with
+  1:1 source citations, ready to drop into a prompt.
+- **Multi-provider search** — zero-key `ddgs` out of the box; keyed
+  Tavily / Brave / Serper / Exa and self-hosted SearXNG via config; ordered
+  fallback chains with per-provider circuit breakers.
+- **Hardened fetching** — SSRF guard (private / reserved / metadata IP ranges
+  blocked at connect time with pinned-IP enforcement), per-response byte caps,
+  rotating 40-UA browser profile, concurrent fetches with deadline budgeting.
+- **Quality extraction & ranking** — trafilatura article extraction, BM25
+  reranking (golden-tested math), adaptive context budgeting: the most relevant
+  pages get more of the character budget, marginal ones shrink, noise is dropped.
+- **No-fail-silent contract** — every degradation (blocked URL, truncated page,
+  provider fallback, budget cut) is a typed, enumerable warning; nothing
+  disappears without a trace.
+- **LLM query expansion (optional)** — expand a question into multiple search
+  queries via any OpenAI-compatible endpoint or an injected callback.
+- **Caching** — in-memory by default, sqlite persistence a config flag away.
+- **Typed throughout** — pyright-strict clean, structured results on every
+  surface (Pydantic models in the SDK, JSON structured output over MCP).
+- **688+ tests** including live-web smoke suites and hand-computed golden tests.
+## How to use
+### Python SDK
+```bash
+pip install "websearch-kit[ddgs]"   # ddgs = the zero-API-key search provider
+```
+```python
+import asyncio
+from websearch_kit import SearchKit
+async def main():
+    async with SearchKit() as kit:          # zero-config: ddgs, no keys, no LLM
+        report = await kit.research("RISC-V vs ARM datacenter adoption")
+        print(report.context)               # numbered [N] block for your LLM
+        for s in report.sources:
+            print(f"[{s.n}] {s.title} — {s.url}")
+        print(report.warnings)              # everything the run degraded on
+asyncio.run(main())
+```
+Beyond `research()`, the kit exposes the pipeline stages individually:
+```python
+results = await kit.search("python 3.14 free threading", count=5)     # snippets only
+pages   = await kit.fetch(["https://docs.python.org/3.14/whatsnew/"])  # URLs, extracted
+status  = await kit.health()                                           # provider probe
+```
+Prefer blocking code? `SyncSearchKit` mirrors the async API 1:1. Keyed
+providers, fallback chains, sqlite caching, and LLM query expansion are all
+config away — see [docs/deployment/sdk.md](docs/deployment/sdk.md) and
+[examples/](examples/).
+### MCP server
+Add to your MCP client config (Claude Code, Claude Desktop, or any MCP client):
+```json
+{
+  "mcpServers": {
+    "websearch-kit": {
+      "command": "uvx",
+      "args": ["--from", "websearch-kit[mcp,ddgs]", "websearch-kit-mcp"]
+    }
+  }
+}
+```
+Four read-only tools with typed structured output, over stdio or streamable HTTP:
+| Tool | What it does |
+|------|--------------|
+| `web_search` | Snippet-level results — context-economical |
+| `fetch_page` | Read one URL as markdown, cursor pagination for long pages |
+| `research` | Full pipeline → `[N]` context block + one resource link per citation |
+| `health` | Provider latency, circuit-breaker state, config checks |
+For HTTP transport, scaling, and hardening flags see
+[docs/deployment/mcp.md](docs/deployment/mcp.md) and
+[examples/mcp_config_examples.md](examples/mcp_config_examples.md).
+### Open WebUI
+Install the single-file filter from
+[`adapters/owui/websearch_kit_filter.py`](adapters/owui/websearch_kit_filter.py)
+(Admin Panel → Functions → Import). It pip-installs this SDK automatically via
+its frontmatter `requirements:` line and uses your instance's configured web
+search out of the box.
+Toggle the pill to research every message, or trigger one-off:
+```
+?? quantum routers --count 12 --lang en --reply de --fresh week
+```
+A Tool variant for model-invoked (agentic) use ships alongside it. See
+[docs/deployment/owui.md](docs/deployment/owui.md).
+## Documentation
+- [SPEC.md](SPEC.md) — normative behavioral contract (pipeline semantics,
+  ranking math, degradation codes, SSRF ruleset)
+- [docs/architecture.md](docs/architecture.md) — how the engine is layered
+- [docs/domains/](docs/domains/) — one standard document per domain
+- [docs/adr/](docs/adr/) — the eight load-bearing decisions
+- [VERSIONING.md](VERSIONING.md) — SemVer policy and public-API definition
+- [SECURITY.md](SECURITY.md) — SSRF posture and threat model
+## License
+MIT — with a CI-enforced permissive-only dependency policy (no GPL/AGPL;
+`trafilatura>=1.8.0` pinned for its Apache-2.0 relicense).