websearch-kit 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (165) hide show
  1. websearch_kit-0.1.0/.github/workflows/ci.yml +59 -0
  2. websearch_kit-0.1.0/.github/workflows/license-audit.yml +60 -0
  3. websearch_kit-0.1.0/.github/workflows/live.yml +24 -0
  4. websearch_kit-0.1.0/.github/workflows/publish.yml +45 -0
  5. websearch_kit-0.1.0/.gitignore +29 -0
  6. websearch_kit-0.1.0/BACKLOG.md +39 -0
  7. websearch_kit-0.1.0/CHANGELOG.md +81 -0
  8. websearch_kit-0.1.0/CONTRIBUTING.md +38 -0
  9. websearch_kit-0.1.0/LICENSE +21 -0
  10. websearch_kit-0.1.0/PKG-INFO +190 -0
  11. websearch_kit-0.1.0/README.md +146 -0
  12. websearch_kit-0.1.0/SECURITY.md +27 -0
  13. websearch_kit-0.1.0/SPEC.md +416 -0
  14. websearch_kit-0.1.0/VERSIONING.md +43 -0
  15. websearch_kit-0.1.0/adapters/owui/websearch_kit_filter.py +123 -0
  16. websearch_kit-0.1.0/adapters/owui/websearch_kit_tool.py +110 -0
  17. websearch_kit-0.1.0/docs/PROGRESS.md +193 -0
  18. websearch_kit-0.1.0/docs/adr/0001-one-engine-three-surfaces.md +53 -0
  19. websearch_kit-0.1.0/docs/adr/0002-no-fail-silent-degradation-model.md +59 -0
  20. websearch_kit-0.1.0/docs/adr/0003-ssrf-guard-default-on-with-ip-pinning.md +61 -0
  21. websearch_kit-0.1.0/docs/adr/0004-browser-profile-default-fetching.md +57 -0
  22. websearch_kit-0.1.0/docs/adr/0005-gap-filler-oversampling.md +58 -0
  23. websearch_kit-0.1.0/docs/adr/0006-bm25-adaptive-budget-reference-parity.md +57 -0
  24. websearch_kit-0.1.0/docs/adr/0007-provider-registry-and-fallback-chain.md +64 -0
  25. websearch_kit-0.1.0/docs/adr/0008-mcp-official-sdk-no-sampling.md +58 -0
  26. websearch_kit-0.1.0/docs/adr/README.md +29 -0
  27. websearch_kit-0.1.0/docs/architecture.md +188 -0
  28. websearch_kit-0.1.0/docs/deployment/mcp.md +186 -0
  29. websearch_kit-0.1.0/docs/deployment/owui.md +197 -0
  30. websearch_kit-0.1.0/docs/deployment/sdk.md +180 -0
  31. websearch_kit-0.1.0/docs/domains/caching.md +106 -0
  32. websearch_kit-0.1.0/docs/domains/config.md +150 -0
  33. websearch_kit-0.1.0/docs/domains/errors.md +153 -0
  34. websearch_kit-0.1.0/docs/domains/extraction.md +117 -0
  35. websearch_kit-0.1.0/docs/domains/fetching.md +108 -0
  36. websearch_kit-0.1.0/docs/domains/observability.md +133 -0
  37. websearch_kit-0.1.0/docs/domains/providers.md +111 -0
  38. websearch_kit-0.1.0/docs/domains/query-expansion.md +166 -0
  39. websearch_kit-0.1.0/docs/domains/ranking.md +178 -0
  40. websearch_kit-0.1.0/docs/domains/resilience.md +108 -0
  41. websearch_kit-0.1.0/docs/domains/security.md +94 -0
  42. websearch_kit-0.1.0/examples/bare_sdk.py +52 -0
  43. websearch_kit-0.1.0/examples/mcp_config_examples.md +92 -0
  44. websearch_kit-0.1.0/examples/multi_provider.py +59 -0
  45. websearch_kit-0.1.0/pyproject.toml +134 -0
  46. websearch_kit-0.1.0/src/websearch_kit/__init__.py +100 -0
  47. websearch_kit-0.1.0/src/websearch_kit/_version.py +3 -0
  48. websearch_kit-0.1.0/src/websearch_kit/assembly/__init__.py +33 -0
  49. websearch_kit-0.1.0/src/websearch_kit/assembly/citations.py +70 -0
  50. websearch_kit-0.1.0/src/websearch_kit/assembly/context_builder.py +90 -0
  51. websearch_kit-0.1.0/src/websearch_kit/caching/__init__.py +89 -0
  52. websearch_kit-0.1.0/src/websearch_kit/caching/keys.py +93 -0
  53. websearch_kit-0.1.0/src/websearch_kit/caching/memory.py +121 -0
  54. websearch_kit-0.1.0/src/websearch_kit/caching/sqlite_cache.py +244 -0
  55. websearch_kit-0.1.0/src/websearch_kit/config.py +220 -0
  56. websearch_kit-0.1.0/src/websearch_kit/errors.py +250 -0
  57. websearch_kit-0.1.0/src/websearch_kit/expansion/__init__.py +104 -0
  58. websearch_kit-0.1.0/src/websearch_kit/expansion/callback.py +89 -0
  59. websearch_kit-0.1.0/src/websearch_kit/expansion/llm.py +153 -0
  60. websearch_kit-0.1.0/src/websearch_kit/expansion/noop.py +45 -0
  61. websearch_kit-0.1.0/src/websearch_kit/expansion/parsing.py +120 -0
  62. websearch_kit-0.1.0/src/websearch_kit/extraction/__init__.py +24 -0
  63. websearch_kit-0.1.0/src/websearch_kit/extraction/chain.py +179 -0
  64. websearch_kit-0.1.0/src/websearch_kit/extraction/quality.py +184 -0
  65. websearch_kit-0.1.0/src/websearch_kit/extraction/readability_extractor.py +66 -0
  66. websearch_kit-0.1.0/src/websearch_kit/extraction/sanitize_text.py +146 -0
  67. websearch_kit-0.1.0/src/websearch_kit/extraction/trafilatura_extractor.py +181 -0
  68. websearch_kit-0.1.0/src/websearch_kit/extraction/types.py +58 -0
  69. websearch_kit-0.1.0/src/websearch_kit/fetching/__init__.py +17 -0
  70. websearch_kit-0.1.0/src/websearch_kit/fetching/fetcher.py +595 -0
  71. websearch_kit-0.1.0/src/websearch_kit/fetching/policy.py +122 -0
  72. websearch_kit-0.1.0/src/websearch_kit/fetching/robots.py +165 -0
  73. websearch_kit-0.1.0/src/websearch_kit/fetching/user_agents.py +78 -0
  74. websearch_kit-0.1.0/src/websearch_kit/grammar.py +178 -0
  75. websearch_kit-0.1.0/src/websearch_kit/kit.py +487 -0
  76. websearch_kit-0.1.0/src/websearch_kit/mcp/__init__.py +39 -0
  77. websearch_kit-0.1.0/src/websearch_kit/mcp/__main__.py +61 -0
  78. websearch_kit-0.1.0/src/websearch_kit/mcp/config_cli.py +150 -0
  79. websearch_kit-0.1.0/src/websearch_kit/mcp/progress.py +123 -0
  80. websearch_kit-0.1.0/src/websearch_kit/mcp/server.py +264 -0
  81. websearch_kit-0.1.0/src/websearch_kit/mcp/tools.py +376 -0
  82. websearch_kit-0.1.0/src/websearch_kit/models.py +291 -0
  83. websearch_kit-0.1.0/src/websearch_kit/observability/__init__.py +21 -0
  84. websearch_kit-0.1.0/src/websearch_kit/observability/events.py +113 -0
  85. websearch_kit-0.1.0/src/websearch_kit/observability/logging.py +97 -0
  86. websearch_kit-0.1.0/src/websearch_kit/owui/__init__.py +21 -0
  87. websearch_kit-0.1.0/src/websearch_kit/owui/_compat.py +275 -0
  88. websearch_kit-0.1.0/src/websearch_kit/owui/filter_adapter.py +604 -0
  89. websearch_kit-0.1.0/src/websearch_kit/pipeline.py +985 -0
  90. websearch_kit-0.1.0/src/websearch_kit/prompts.py +158 -0
  91. websearch_kit-0.1.0/src/websearch_kit/protocols.py +116 -0
  92. websearch_kit-0.1.0/src/websearch_kit/providers/__init__.py +149 -0
  93. websearch_kit-0.1.0/src/websearch_kit/providers/base.py +252 -0
  94. websearch_kit-0.1.0/src/websearch_kit/providers/brave.py +171 -0
  95. websearch_kit-0.1.0/src/websearch_kit/providers/ddgs.py +153 -0
  96. websearch_kit-0.1.0/src/websearch_kit/providers/exa.py +156 -0
  97. websearch_kit-0.1.0/src/websearch_kit/providers/owui.py +183 -0
  98. websearch_kit-0.1.0/src/websearch_kit/providers/searxng.py +167 -0
  99. websearch_kit-0.1.0/src/websearch_kit/providers/serper.py +140 -0
  100. websearch_kit-0.1.0/src/websearch_kit/providers/tavily.py +141 -0
  101. websearch_kit-0.1.0/src/websearch_kit/py.typed +0 -0
  102. websearch_kit-0.1.0/src/websearch_kit/ranking/__init__.py +28 -0
  103. websearch_kit-0.1.0/src/websearch_kit/ranking/bm25.py +109 -0
  104. websearch_kit-0.1.0/src/websearch_kit/ranking/budget.py +140 -0
  105. websearch_kit-0.1.0/src/websearch_kit/resilience/__init__.py +24 -0
  106. websearch_kit-0.1.0/src/websearch_kit/resilience/circuit.py +166 -0
  107. websearch_kit-0.1.0/src/websearch_kit/resilience/deadline.py +88 -0
  108. websearch_kit-0.1.0/src/websearch_kit/resilience/fallback.py +265 -0
  109. websearch_kit-0.1.0/src/websearch_kit/resilience/health.py +141 -0
  110. websearch_kit-0.1.0/src/websearch_kit/resilience/retry.py +108 -0
  111. websearch_kit-0.1.0/src/websearch_kit/run.py +236 -0
  112. websearch_kit-0.1.0/src/websearch_kit/security/__init__.py +15 -0
  113. websearch_kit-0.1.0/src/websearch_kit/security/ranges.py +129 -0
  114. websearch_kit-0.1.0/src/websearch_kit/security/sanitize.py +97 -0
  115. websearch_kit-0.1.0/src/websearch_kit/security/url_guard.py +296 -0
  116. websearch_kit-0.1.0/tests/__init__.py +0 -0
  117. websearch_kit-0.1.0/tests/http/fixtures/pages/article.html +33 -0
  118. websearch_kit-0.1.0/tests/http/fixtures/pages/forum.html +32 -0
  119. websearch_kit-0.1.0/tests/http/fixtures/pages/listing.html +29 -0
  120. websearch_kit-0.1.0/tests/http/fixtures/pages/malformed.html +12 -0
  121. websearch_kit-0.1.0/tests/http/fixtures/providers/brave_422.json +19 -0
  122. websearch_kit-0.1.0/tests/http/fixtures/providers/brave_ok.json +20 -0
  123. websearch_kit-0.1.0/tests/http/fixtures/providers/exa_ok.json +20 -0
  124. websearch_kit-0.1.0/tests/http/fixtures/providers/searxng_ok.json +22 -0
  125. websearch_kit-0.1.0/tests/http/fixtures/providers/serper_ok.json +22 -0
  126. websearch_kit-0.1.0/tests/http/fixtures/providers/tavily_ok.json +20 -0
  127. websearch_kit-0.1.0/tests/http/test_extraction_chain.py +207 -0
  128. websearch_kit-0.1.0/tests/http/test_fetcher.py +674 -0
  129. websearch_kit-0.1.0/tests/http/test_llm_expander.py +124 -0
  130. websearch_kit-0.1.0/tests/http/test_policy.py +128 -0
  131. websearch_kit-0.1.0/tests/http/test_providers.py +575 -0
  132. websearch_kit-0.1.0/tests/http/test_resilience.py +470 -0
  133. websearch_kit-0.1.0/tests/http/test_robots.py +173 -0
  134. websearch_kit-0.1.0/tests/mcp/__init__.py +0 -0
  135. websearch_kit-0.1.0/tests/mcp/test_config_cli.py +102 -0
  136. websearch_kit-0.1.0/tests/mcp/test_mcp_server.py +395 -0
  137. websearch_kit-0.1.0/tests/owui/__init__.py +0 -0
  138. websearch_kit-0.1.0/tests/owui/conftest.py +160 -0
  139. websearch_kit-0.1.0/tests/owui/test_compat.py +145 -0
  140. websearch_kit-0.1.0/tests/owui/test_filter_adapter.py +410 -0
  141. websearch_kit-0.1.0/tests/owui/test_single_files.py +145 -0
  142. websearch_kit-0.1.0/tests/security/__init__.py +0 -0
  143. websearch_kit-0.1.0/tests/security/test_ssrf_ranges.py +95 -0
  144. websearch_kit-0.1.0/tests/security/test_url_guard.py +363 -0
  145. websearch_kit-0.1.0/tests/unit/__init__.py +0 -0
  146. websearch_kit-0.1.0/tests/unit/pipeline_stubs.py +192 -0
  147. websearch_kit-0.1.0/tests/unit/test_assembly.py +128 -0
  148. websearch_kit-0.1.0/tests/unit/test_bm25_golden.py +158 -0
  149. websearch_kit-0.1.0/tests/unit/test_budget_golden.py +135 -0
  150. websearch_kit-0.1.0/tests/unit/test_caching.py +363 -0
  151. websearch_kit-0.1.0/tests/unit/test_circuit.py +192 -0
  152. websearch_kit-0.1.0/tests/unit/test_config_precedence.py +134 -0
  153. websearch_kit-0.1.0/tests/unit/test_contracts.py +147 -0
  154. websearch_kit-0.1.0/tests/unit/test_deadline.py +98 -0
  155. websearch_kit-0.1.0/tests/unit/test_expansion.py +222 -0
  156. websearch_kit-0.1.0/tests/unit/test_grammar.py +214 -0
  157. websearch_kit-0.1.0/tests/unit/test_kit.py +299 -0
  158. websearch_kit-0.1.0/tests/unit/test_observability.py +157 -0
  159. websearch_kit-0.1.0/tests/unit/test_pipeline.py +521 -0
  160. websearch_kit-0.1.0/tests/unit/test_prompts.py +147 -0
  161. websearch_kit-0.1.0/tests/unit/test_retry.py +250 -0
  162. websearch_kit-0.1.0/tests/unit/test_run_context.py +236 -0
  163. websearch_kit-0.1.0/tests/unit/test_sanitize_text.py +148 -0
  164. websearch_kit-0.1.0/tests/unit/test_sanitize_url.py +123 -0
  165. websearch_kit-0.1.0/uv.lock +3099 -0
@@ -0,0 +1,59 @@
1
+ name: CI
2
+
3
+ on:
4
+ push:
5
+ branches: [main]
6
+ pull_request:
7
+
8
+ jobs:
9
+ lint:
10
+ runs-on: ubuntu-latest
11
+ steps:
12
+ - uses: actions/checkout@v4
13
+ - uses: astral-sh/setup-uv@v5
14
+ - run: uv sync --all-extras
15
+ - run: uv run ruff check src tests
16
+ - run: uv run ruff format --check src tests
17
+ - run: uv run pyright src
18
+
19
+ test:
20
+ runs-on: ${{ matrix.os }}
21
+ strategy:
22
+ fail-fast: false
23
+ matrix:
24
+ os: [ubuntu-latest, macos-latest]
25
+ python-version: ["3.10", "3.11", "3.12", "3.13"]
26
+ steps:
27
+ - uses: actions/checkout@v4
28
+ - uses: astral-sh/setup-uv@v5
29
+ - run: uv sync --all-extras --python ${{ matrix.python-version }}
30
+ - run: uv run pytest --cov --cov-report=xml -q
31
+ - name: Coverage gate (critical modules)
32
+ run: |
33
+ uv run python - <<'EOF'
34
+ import xml.etree.ElementTree as ET
35
+ import sys
36
+
37
+ CRITICAL = ("config", "errors", "ranking/", "security/")
38
+ THRESHOLD = 0.90
39
+ tree = ET.parse("coverage.xml")
40
+ failed = False
41
+ for cls in tree.iter("class"):
42
+ fn = cls.get("filename", "")
43
+ if not any(c in fn for c in CRITICAL):
44
+ continue
45
+ rate = float(cls.get("line-rate", "1"))
46
+ if rate < THRESHOLD:
47
+ print(f"FAIL {fn}: {rate:.0%} < {THRESHOLD:.0%}")
48
+ failed = True
49
+ sys.exit(1 if failed else 0)
50
+ EOF
51
+
52
+ base-install:
53
+ # The base package (no extras) must import cleanly — optional deps must stay optional.
54
+ runs-on: ubuntu-latest
55
+ steps:
56
+ - uses: actions/checkout@v4
57
+ - uses: astral-sh/setup-uv@v5
58
+ - run: uv venv && uv pip install .
59
+ - run: uv run python -c "import websearch_kit; print(websearch_kit.__version__)"
@@ -0,0 +1,60 @@
1
+ name: License audit
2
+
3
+ # Machine-enforces the MIT promise:
4
+ # - no GPL/AGPL anywhere in the dependency tree
5
+ # - trafilatura must be >=1.8.0 (earlier releases are GPLv3+)
6
+ # - PyMuPDF (AGPL) must never appear
7
+
8
+ on:
9
+ push:
10
+ branches: [main]
11
+ pull_request:
12
+ paths: ["pyproject.toml", "uv.lock", ".github/workflows/license-audit.yml"]
13
+
14
+ jobs:
15
+ audit:
16
+ runs-on: ubuntu-latest
17
+ steps:
18
+ - uses: actions/checkout@v4
19
+ - uses: astral-sh/setup-uv@v5
20
+ - run: uv sync --all-extras
21
+ - name: Forbid copyleft licenses
22
+ run: |
23
+ uv run pip-licenses --format=json > licenses.json
24
+ uv run python - <<'EOF'
25
+ import json
26
+ import sys
27
+
28
+ BANNED = ("GNU Affero", "AGPL", "GNU General Public License", "GPLv3", "GPLv2")
29
+ # LGPL is reviewed case-by-case; hard-ban the strong-copyleft families.
30
+ pkgs = json.load(open("licenses.json"))
31
+ bad = [
32
+ p for p in pkgs
33
+ if any(b.lower() in p["License"].lower() for b in BANNED)
34
+ and "LGPL" not in p["License"]
35
+ ]
36
+ for p in bad:
37
+ print(f"BANNED LICENSE: {p['Name']} {p['Version']} — {p['License']}")
38
+ sys.exit(1 if bad else 0)
39
+ EOF
40
+ - name: Assert trafilatura >= 1.8.0 and no PyMuPDF
41
+ run: |
42
+ uv run python - <<'EOF'
43
+ import sys
44
+ from importlib import metadata
45
+
46
+ v = metadata.version("trafilatura")
47
+ parts = tuple(int(x) for x in v.split(".")[:2])
48
+ assert parts >= (1, 8), f"trafilatura {v} predates the Apache-2.0 relicense"
49
+ try:
50
+ metadata.version("PyMuPDF")
51
+ sys.exit("PyMuPDF (AGPL) found in the dependency tree — banned")
52
+ except metadata.PackageNotFoundError:
53
+ pass
54
+ try:
55
+ metadata.version("fitz")
56
+ sys.exit("fitz/PyMuPDF (AGPL) found in the dependency tree — banned")
57
+ except metadata.PackageNotFoundError:
58
+ pass
59
+ print(f"OK: trafilatura {v}, no PyMuPDF")
60
+ EOF
@@ -0,0 +1,24 @@
1
+ name: Live tests (nightly)
2
+
3
+ # Opt-in tier hitting real providers/transports — catches upstream drift
4
+ # (provider schema changes, rate-limit behavior). Never blocks PRs.
5
+
6
+ on:
7
+ schedule:
8
+ - cron: "23 4 * * *"
9
+ workflow_dispatch:
10
+
11
+ jobs:
12
+ live:
13
+ runs-on: ubuntu-latest
14
+ if: ${{ github.repository_owner != '' }}
15
+ steps:
16
+ - uses: actions/checkout@v4
17
+ - uses: astral-sh/setup-uv@v5
18
+ - run: uv sync --all-extras
19
+ - run: uv run pytest -m live -q
20
+ env:
21
+ WSK_TAVILY_API_KEY: ${{ secrets.WSK_TAVILY_API_KEY }}
22
+ WSK_BRAVE_API_KEY: ${{ secrets.WSK_BRAVE_API_KEY }}
23
+ WSK_SERPER_API_KEY: ${{ secrets.WSK_SERPER_API_KEY }}
24
+ WSK_EXA_API_KEY: ${{ secrets.WSK_EXA_API_KEY }}
@@ -0,0 +1,45 @@
1
+ name: Publish to PyPI
2
+
3
+ on:
4
+ push:
5
+ tags: ["v*"]
6
+
7
+ jobs:
8
+ build:
9
+ runs-on: ubuntu-latest
10
+ steps:
11
+ - uses: actions/checkout@v4
12
+ - uses: astral-sh/setup-uv@v5
13
+ - run: uv build
14
+ - name: Check tag matches built version
15
+ # version is dynamic (hatch reads _version.py), so check the artifact itself
16
+ run: |
17
+ WHEEL=$(ls dist/*.whl)
18
+ PKG_VERSION=$(basename "$WHEEL" | cut -d- -f2)
19
+ TAG_VERSION="${GITHUB_REF_NAME#v}"
20
+ if [ "$PKG_VERSION" != "$TAG_VERSION" ]; then
21
+ echo "Tag $GITHUB_REF_NAME does not match built version $PKG_VERSION"
22
+ exit 1
23
+ fi
24
+ - run: uv run --with twine twine check dist/*.tar.gz dist/*.whl
25
+ - uses: actions/upload-artifact@v4
26
+ with:
27
+ name: dist
28
+ path: |
29
+ dist/*.tar.gz
30
+ dist/*.whl
31
+
32
+ publish:
33
+ needs: build
34
+ runs-on: ubuntu-latest
35
+ environment:
36
+ name: pypi
37
+ url: https://pypi.org/p/websearch-kit
38
+ permissions:
39
+ id-token: write # OIDC for PyPI trusted publishing — no API token
40
+ steps:
41
+ - uses: actions/download-artifact@v4
42
+ with:
43
+ name: dist
44
+ path: dist/
45
+ - uses: pypa/gh-action-pypi-publish@release/v1
@@ -0,0 +1,29 @@
1
+ # Reference clone (read-only research material, not part of this repo)
2
+ reference/
3
+
4
+ # Python
5
+ __pycache__/
6
+ *.py[cod]
7
+ *.egg-info/
8
+ dist/
9
+ build/
10
+ .venv/
11
+ .python-version
12
+
13
+ # Tooling caches
14
+ .pytest_cache/
15
+ .ruff_cache/
16
+ .coverage
17
+ coverage.xml
18
+ htmlcov/
19
+ .pyright/
20
+
21
+ # OS / editor
22
+ .DS_Store
23
+ .idea/
24
+ .vscode/
25
+
26
+ # Local secrets / env
27
+ .env
28
+ .env.*
29
+ .claude-workflows/
@@ -0,0 +1,39 @@
1
+ # Backlog
2
+
3
+ Follow-ups identified after 0.1.0. Not scheduled; ordered roughly by value.
4
+
5
+ ## Recency boost in the ranker
6
+
7
+ **Problem.** BM25 ranks purely on lexical relevance. On freshness-sensitive
8
+ queries ("latest stable Python version"), topically-rich but stale pages
9
+ outscore the one source with the current answer — observed in a live run on
10
+ 2026-06-05 where 2024/2025 news roundups ranked above the page that actually
11
+ answered the question.
12
+
13
+ **Direction.** An opt-in recency boost in the rank stage: when a result carries
14
+ a usable date signal (`published_date` from the provider, or a date extracted
15
+ by trafilatura), multiply/add a decay-weighted bonus so newer pages win ties
16
+ and near-ties. Must be a pure, golden-testable function alongside the BM25
17
+ math, off by default to preserve 0.1.x ranking parity, and surfaced in config
18
+ (e.g. `ranking.recency_boost: float = 0.0` + half-life). Results without dates
19
+ are never penalized — boost only, no decay-to-zero.
20
+
21
+ **Touch points.** `ranking/bm25.py` (or a sibling `ranking/recency.py`),
22
+ pipeline rank stage, config schema, SPEC.md ranking math section, golden tests.
23
+
24
+ ## Configurable default for `max_context_chars`
25
+
26
+ **Problem.** The MCP `research` tool's context cap is a per-call parameter, but
27
+ its default (8000) is hardcoded in the tool signature
28
+ (`src/websearch_kit/mcp/tools.py`). Deployments that consistently want larger
29
+ or smaller context blocks must rely on every client passing the parameter on
30
+ every call.
31
+
32
+ **Direction.** Make the default server-configurable — CLI flag / env var /
33
+ config file (e.g. `mcp.default_max_context_chars`), with the per-call parameter
34
+ still overriding it. Keep the existing `ge=500, le=200_000` bounds; validate
35
+ the configured default against them at startup. Mirror the same default in the
36
+ SDK config so the two surfaces stay in lockstep (one engine, one config).
37
+
38
+ **Touch points.** `mcp/tools.py`, MCP server CLI/config plumbing, SDK config
39
+ schema, `docs/deployment/mcp.md`, `examples/mcp_config_examples.md`.
@@ -0,0 +1,81 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project are documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html)
7
+ (see [VERSIONING.md](VERSIONING.md) for the pre-1.0 rules).
8
+
9
+ ## [0.1.0] - 2026-06-05
10
+
11
+ First release: one engine, three surfaces, with golden-tested ranking math
12
+ and a no-fail-silent degradation contract.
13
+
14
+ ### Added
15
+
16
+ **Core SDK**
17
+ - `SearchKit` async engine (`research` / `search` / `fetch` / `build_prompt` /
18
+ `health`) and `SyncSearchKit` blocking facade; stateless per call with
19
+ per-run `RunContext` (deadline, degradation log, stats, progress)
20
+ - Core contracts: `WebSearchConfig` (one canonical config; `WSK_*` env,
21
+ kwargs > env > defaults), shared pydantic models (`SearchResult`,
22
+ `PageContent`, `Source`, `Degradation`, `RunStats`, `ResearchReport`,
23
+ `SystemHealth`, `ProgressEvent`), typed error taxonomy with stable codes,
24
+ extension protocols (`SearchProvider`, `QueryExpander`, `Cache`,
25
+ `ProgressSink`)
26
+ - Pipeline: query expansion → oversampled multi-query search → sanitized-URL
27
+ dedup + binary pre-filter → concurrent fetch with **Gap-Filler** round-2
28
+ backfill (failures demote to the snippet pool, everything recorded) →
29
+ quality-gated extraction with snippet fallback → BM25 rerank vs the original
30
+ query with zero-score drops → adaptive per-source character budgeting →
31
+ numbered `[N]` context assembly with a relevance-filtered snippet pool and
32
+ 1:1 citations
33
+ - Fetch layer: browser profile (40-UA rotation) / polite profile (honest bot
34
+ UA + robots.txt), SSRF guard on by default with **pinned-IP connect**
35
+ (Host/SNI preserved) and full per-hop redirect gating (SSRF + robots +
36
+ binary), streaming byte caps, whole-run deadline
37
+ - Extraction chain: trafilatura (precision→recall) → readability-lxml → plain
38
+ text, quality-gated; `sanitize_text` cleaning; snippet-fallback heuristics
39
+ - Providers: ddgs (zero-config default), SearXNG, Tavily, Brave, Serper, Exa,
40
+ plus the OWUI normalization shim; name→factory registry with
41
+ capabilities-as-data and `register_provider` extension hook
42
+ - Resilience: full-jitter retry (counted in `stats.retry_count`), per-provider
43
+ circuit breakers, ordered fallback chain (auth-error cooldown latch with
44
+ re-probe), active health probes and system aggregation
45
+ - Query expansion: NoOp (default) / OpenAI-compatible LLM / callback
46
+ expanders; reasoning-block stripping and JSON/line-split parsing; failures
47
+ degrade to the raw query
48
+ - Caching: memory TTL-LRU and sqlite backends behind a never-raising
49
+ `CacheGuard` (with teardown `close()`); three deterministic keyspaces
50
+ (search / extracted content / expansion) with tracking-param-stripped
51
+ content keys
52
+
53
+ **MCP server** (`[mcp]` extra)
54
+ - `websearch-kit-mcp` console script; stdio and streamable-HTTP transports
55
+ - Tools `web_search`, `fetch_page` (cursor pagination, raw mode, refusals as
56
+ explained outcomes), `research` (structured report + context text + one
57
+ resource link per `[N]` citation), `health` (live probes + config checks);
58
+ all typed structured output with read-only/open-world annotations
59
+ - `health://status` resource; background health-probe loop in HTTP mode;
60
+ per-provider kit pool with shared cache; per-request warning collection
61
+
62
+ **Open WebUI adapter** (`adapters/owui/`, `[owui]` extra)
63
+ - Single-file Filter: toggle pill, configurable `??` prefix with zero-syntax
64
+ or require-prefix modes, readable `--flags` grammar, bare-trigger context
65
+ extraction, history rewrite with web_search/retrieval feature lockdown,
66
+ pinned 0.9.6 event shapes (status / queries / `source` citations), skipped-
67
+ URL summaries, debug dump; fail-soft (failures are visible status lines,
68
+ the chat turn always survives)
69
+ - Single-file Tool variant exposing `web_search`/`research` for model-invoked
70
+ use
71
+ - `websearch_kit.owui._compat`: every OWUI-internal touchpoint feature-
72
+ detected in one shim (SearchForm `queries=`/`query=`, sync/async user
73
+ lookup, ShadowRequest config proxy, web-search-enabled gate)
74
+
75
+ **Project**
76
+ - SPEC.md (normative contract), 11 domain documents, 8 ADRs, architecture and
77
+ per-surface deployment docs, runnable examples
78
+ - CI: lint/type/test matrix, permissive-license audit, nightly live tier;
79
+ 688 offline tests, pyright strict
80
+
81
+ [0.1.0]: https://github.com/rmarnold/websearch-kit/releases/tag/v0.1.0
@@ -0,0 +1,38 @@
1
+ # Contributing
2
+
3
+ ## Setup
4
+
5
+ ```bash
6
+ uv sync --all-extras # creates .venv with every optional extra + dev tools
7
+ uv run pytest -q # offline test tiers (unit, security, http, mcp, owui)
8
+ uv run ruff check src tests && uv run ruff format src tests
9
+ uv run pyright src # strict mode
10
+ ```
11
+
12
+ ## Test tiers
13
+
14
+ | Tier | Marker/dir | Network | When it runs |
15
+ |---|---|---|---|
16
+ | unit / golden | `tests/unit` | none | every commit |
17
+ | security (SSRF) | `tests/security` | none (stubbed DNS) | every commit |
18
+ | http (providers/fetch) | `tests/http` | none (respx fixtures) | every commit |
19
+ | mcp | `tests/mcp` | none (in-memory transport) | every commit |
20
+ | owui compat | `tests/owui` | none (stubbed open_webui) | every commit |
21
+ | live | `@pytest.mark.live` | real | nightly CI / opt-in (`pytest -m live`) |
22
+
23
+ Recorded provider fixtures live in `tests/http/fixtures/`. When a provider's API
24
+ drifts, re-record the fixture from a real response (strip any credentials) and
25
+ update the mapper + changelog together.
26
+
27
+ ## Rules of the road
28
+
29
+ - **No fail-silent.** Never swallow an exception: classify it into the error
30
+ taxonomy (`errors.py`) and either raise it or record a `Degradation`.
31
+ `except: pass` will be rejected in review (and flagged by ruff S110).
32
+ - **License discipline.** Permissive deps only (MIT/BSD/Apache-2.0).
33
+ `trafilatura>=1.8.0` is load-bearing; PyMuPDF is banned. CI enforces this.
34
+ - **Public API changes** require a VERSIONING.md-conscious changelog entry;
35
+ load-bearing design changes require an ADR (`docs/adr/`, MADR format).
36
+ - **Domain docs.** Behavior changes in a subsystem must update its standard
37
+ document under `docs/domains/`.
38
+ - Conventional, descriptive commit messages; one logical change per commit.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Robert Arnold
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,190 @@
1
+ Metadata-Version: 2.4
2
+ Name: websearch-kit
3
+ Version: 0.1.0
4
+ Summary: Web search, fetch, and research pipeline for LLMs — usable as a Python SDK, a standalone MCP server, and an Open WebUI plugin.
5
+ Project-URL: Homepage, https://github.com/rmarnold/websearch-kit
6
+ Project-URL: Changelog, https://github.com/rmarnold/websearch-kit/blob/main/CHANGELOG.md
7
+ Author: Robert Arnold
8
+ License-Expression: MIT
9
+ License-File: LICENSE
10
+ Keywords: llm,mcp,model-context-protocol,open-webui,rag,scraping,search,web-search
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Operating System :: OS Independent
14
+ Classifier: Programming Language :: Python :: 3
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Programming Language :: Python :: 3.13
19
+ Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
20
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
+ Classifier: Typing :: Typed
22
+ Requires-Python: >=3.10
23
+ Requires-Dist: httpx[brotli,http2,zstd]>=0.27
24
+ Requires-Dist: protego>=0.3
25
+ Requires-Dist: pydantic-settings>=2.2
26
+ Requires-Dist: pydantic>=2.7
27
+ Requires-Dist: readability-lxml>=0.8.1
28
+ Requires-Dist: trafilatura>=1.8.0
29
+ Provides-Extra: all
30
+ Requires-Dist: ddgs>=9.0; extra == 'all'
31
+ Requires-Dist: fastembed>=0.5; extra == 'all'
32
+ Requires-Dist: mcp<2,>=1.12; extra == 'all'
33
+ Requires-Dist: pypdf>=5.0; extra == 'all'
34
+ Provides-Extra: ddgs
35
+ Requires-Dist: ddgs>=9.0; extra == 'ddgs'
36
+ Provides-Extra: mcp
37
+ Requires-Dist: mcp<2,>=1.12; extra == 'mcp'
38
+ Provides-Extra: owui
39
+ Provides-Extra: pdf
40
+ Requires-Dist: pypdf>=5.0; extra == 'pdf'
41
+ Provides-Extra: rerank
42
+ Requires-Dist: fastembed>=0.5; extra == 'rerank'
43
+ Description-Content-Type: text/markdown
44
+
45
+ # websearch-kit
46
+
47
+ > Web search, fetch, and research pipeline for LLMs — one engine, three surfaces:
48
+ > a Python **SDK**, a standalone **MCP server**, and an **Open WebUI** plugin.
49
+
50
+ Query expansion → multi-provider search → SSRF-guarded concurrent fetching
51
+ (40-UA browser profile, pinned-IP connect) → trafilatura extraction → BM25
52
+ rerank with adaptive context budgeting → numbered, citable context for your LLM.
53
+
54
+ **No fail-silent:** a call either raises a typed error or returns a result where
55
+ every dropped, blocked, truncated, or substituted item is enumerated as a
56
+ structured `Degradation`. On the live web that looks like:
57
+
58
+ ```text
59
+ ok : True partial: True
60
+ sources : 10 # 5 fetched pages + 5 relevance-filtered snippets
61
+ warnings :
62
+ - [fetch] https://cloud.google.com/...: response exceeded byte cap (1054971 > 1048576 bytes)
63
+ stats : 10 raw -> 10 unique, 5 fetched, context 23471 chars,
64
+ timings {'search': 854, 'fetch': 1662, 'extract': 878, 'rank': 3}
65
+ ```
66
+
67
+ **Status: 0.1.0.** See [SPEC.md](SPEC.md), [CHANGELOG.md](CHANGELOG.md).
68
+
69
+ ## Features
70
+
71
+ - **One engine, three surfaces** — the SDK core is the only pipeline; the MCP
72
+ server and Open WebUI adapters are thin translators over it, so behavior and
73
+ config semantics never drift between surfaces.
74
+ - **Full research pipeline** — `research()` runs search → fetch → extract →
75
+ rank → assemble in one call and returns a numbered `[N]` context block with
76
+ 1:1 source citations, ready to drop into a prompt.
77
+ - **Multi-provider search** — zero-key `ddgs` out of the box; keyed
78
+ Tavily / Brave / Serper / Exa and self-hosted SearXNG via config; ordered
79
+ fallback chains with per-provider circuit breakers.
80
+ - **Hardened fetching** — SSRF guard (private / reserved / metadata IP ranges
81
+ blocked at connect time with pinned-IP enforcement), per-response byte caps,
82
+ rotating 40-UA browser profile, concurrent fetches with deadline budgeting.
83
+ - **Quality extraction & ranking** — trafilatura article extraction, BM25
84
+ reranking (golden-tested math), adaptive context budgeting: the most relevant
85
+ pages get more of the character budget, marginal ones shrink, noise is dropped.
86
+ - **No-fail-silent contract** — every degradation (blocked URL, truncated page,
87
+ provider fallback, budget cut) is a typed, enumerable warning; nothing
88
+ disappears without a trace.
89
+ - **LLM query expansion (optional)** — expand a question into multiple search
90
+ queries via any OpenAI-compatible endpoint or an injected callback.
91
+ - **Caching** — in-memory by default, sqlite persistence a config flag away.
92
+ - **Typed throughout** — pyright-strict clean, structured results on every
93
+ surface (Pydantic models in the SDK, JSON structured output over MCP).
94
+ - **688+ tests** including live-web smoke suites and hand-computed golden tests.
95
+
96
+ ## How to use
97
+
98
+ ### Python SDK
99
+
100
+ ```bash
101
+ pip install "websearch-kit[ddgs]" # ddgs = the zero-API-key search provider
102
+ ```
103
+
104
+ ```python
105
+ import asyncio
106
+ from websearch_kit import SearchKit
107
+
108
+ async def main():
109
+ async with SearchKit() as kit: # zero-config: ddgs, no keys, no LLM
110
+ report = await kit.research("RISC-V vs ARM datacenter adoption")
111
+ print(report.context) # numbered [N] block for your LLM
112
+ for s in report.sources:
113
+ print(f"[{s.n}] {s.title} — {s.url}")
114
+ print(report.warnings) # everything the run degraded on
115
+
116
+ asyncio.run(main())
117
+ ```
118
+
119
+ Beyond `research()`, the kit exposes the pipeline stages individually:
120
+
121
+ ```python
122
+ results = await kit.search("python 3.14 free threading", count=5) # snippets only
123
+ pages = await kit.fetch(["https://docs.python.org/3.14/whatsnew/"]) # URLs, extracted
124
+ status = await kit.health() # provider probe
125
+ ```
126
+
127
+ Prefer blocking code? `SyncSearchKit` mirrors the async API 1:1. Keyed
128
+ providers, fallback chains, sqlite caching, and LLM query expansion are all
129
+ config away — see [docs/deployment/sdk.md](docs/deployment/sdk.md) and
130
+ [examples/](examples/).
131
+
132
+ ### MCP server
133
+
134
+ Add to your MCP client config (Claude Code, Claude Desktop, or any MCP client):
135
+
136
+ ```json
137
+ {
138
+ "mcpServers": {
139
+ "websearch-kit": {
140
+ "command": "uvx",
141
+ "args": ["--from", "websearch-kit[mcp,ddgs]", "websearch-kit-mcp"]
142
+ }
143
+ }
144
+ }
145
+ ```
146
+
147
+ Four read-only tools with typed structured output, over stdio or streamable HTTP:
148
+
149
+ | Tool | What it does |
150
+ |------|--------------|
151
+ | `web_search` | Snippet-level results — context-economical |
152
+ | `fetch_page` | Read one URL as markdown, cursor pagination for long pages |
153
+ | `research` | Full pipeline → `[N]` context block + one resource link per citation |
154
+ | `health` | Provider latency, circuit-breaker state, config checks |
155
+
156
+ For HTTP transport, scaling, and hardening flags see
157
+ [docs/deployment/mcp.md](docs/deployment/mcp.md) and
158
+ [examples/mcp_config_examples.md](examples/mcp_config_examples.md).
159
+
160
+ ### Open WebUI
161
+
162
+ Install the single-file filter from
163
+ [`adapters/owui/websearch_kit_filter.py`](adapters/owui/websearch_kit_filter.py)
164
+ (Admin Panel → Functions → Import). It pip-installs this SDK automatically via
165
+ its frontmatter `requirements:` line and uses your instance's configured web
166
+ search out of the box.
167
+
168
+ Toggle the pill to research every message, or trigger one-off:
169
+
170
+ ```
171
+ ?? quantum routers --count 12 --lang en --reply de --fresh week
172
+ ```
173
+
174
+ A Tool variant for model-invoked (agentic) use ships alongside it. See
175
+ [docs/deployment/owui.md](docs/deployment/owui.md).
176
+
177
+ ## Documentation
178
+
179
+ - [SPEC.md](SPEC.md) — normative behavioral contract (pipeline semantics,
180
+ ranking math, degradation codes, SSRF ruleset)
181
+ - [docs/architecture.md](docs/architecture.md) — how the engine is layered
182
+ - [docs/domains/](docs/domains/) — one standard document per domain
183
+ - [docs/adr/](docs/adr/) — the eight load-bearing decisions
184
+ - [VERSIONING.md](VERSIONING.md) — SemVer policy and public-API definition
185
+ - [SECURITY.md](SECURITY.md) — SSRF posture and threat model
186
+
187
+ ## License
188
+
189
+ MIT — with a CI-enforced permissive-only dependency policy (no GPL/AGPL;
190
+ `trafilatura>=1.8.0` pinned for its Apache-2.0 relicense).