thealgorithms-mcp 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,15 @@
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *.egg-info/
5
+ .eggs/
6
+
7
+ # Build artifacts
8
+ /dist/
9
+ /build/
10
+
11
+ # uv
12
+ .venv/
13
+
14
+ # OS
15
+ .DS_Store
@@ -0,0 +1,146 @@
1
+ # TheAlgorithms MCP — Design Spec
2
+
3
+ **Status:** reviewed + concepts validated — ready to build · **Date:** 2026-05-30 · **Owner:** mcande21
4
+
5
+ An MCP server that lets an LLM efficiently query [TheAlgorithms/Python](https://github.com/TheAlgorithms/Python)
6
+ for algorithm implementations and their built-in usage examples (doctests).
7
+
8
+ ---
9
+
10
+ ## 1. Goal & scope
11
+
12
+ - **Goal:** "query the system → get the implementation + examples" cheaply, mid-task, without the
13
+ model burning tokens browsing GitHub.
14
+ - **Scope (v1):** Python repo only (**1,160 algorithms, 44 categories** — measured, not estimated;
15
+ richest doctests in the org).
16
+ - **Storage model:** **Hybrid** — cache the small `DIRECTORY.md` index locally for instant fuzzy
17
+ search; fetch file *contents* on-demand from raw GitHub. Tiny footprint, near-zero staleness.
18
+ - **Out of scope (v1):** other languages, semantic/embedding search, write access, running code.
19
+
20
+ ## 2. Why this maps cleanly
21
+
22
+ - `DIRECTORY.md` is a free, pre-built index: `## Category` headers + `* [Name](path)` rows.
23
+ - Subcategory hierarchy lives in the path (`data_structures/binary_tree/avl_tree.py`) — **derive
24
+ category from `path.split('/')[0]`**; ignore header depth. One regex parses the whole file.
25
+ - Every file is self-contained with a module docstring and **doctests (`>>>`)** — the usage
26
+ examples ship *inside the source*, so we never synthesize examples.
27
+
28
+ ## 3. Data sources
29
+
30
+ | What | URL |
31
+ |------|-----|
32
+ | Index | `https://raw.githubusercontent.com/TheAlgorithms/Python/master/DIRECTORY.md` |
33
+ | File | `https://raw.githubusercontent.com/TheAlgorithms/Python/master/<path>` |
34
+
35
+ - Branch: `master`. Anonymous `raw.githubusercontent.com` (no token, no API rate limit).
36
+ - Responses carry an **ETag** → conditional `If-None-Match` GET returns `304` when unchanged.
37
+ Cache validation costs ~0 bytes. **Verified empirically** (probe got a real ETag + a `304` on
38
+ revalidation) — this overrides the review's claim that raw GitHub omits ETags. If GitHub ever
39
+ drops the header, `fetch.py` falls back to the TTL timer (already specified), so we're safe either way.
40
+
41
+ ## 4. Tools (the MCP surface)
42
+
43
+ The token-efficiency core: **search returns only paths + one-liners (cheap); full source is pulled
44
+ only on demand, and `get_algorithm` has a `mode` so the model can grab just the examples.**
45
+
46
+ ### `list_categories() -> [{category, count}]`
47
+ Top-level categories with entry counts. Served from cached index.
48
+
49
+ ### `search_algorithms(query: str, category?: str, limit: int = 10) -> [{name, category, path, score}]`
50
+ Fuzzy lexical rank over `name + category + path` tokens (rapidfuzz). No file fetches — index only.
51
+ Returns paths the model feeds to `get_algorithm`.
52
+
53
+ ### `get_algorithm(path: str, include_source: bool = True) -> {...}`
54
+ On-demand fetch of one file (cached by path). Always returns the docstring + doctests together —
55
+ that's the common case (search → read implementation + its examples in one call). The review flagged
56
+ the original 3-mode enum as wrong-altitude: the model can't know whether a file has doctests before
57
+ fetching, so forcing an upfront `mode` choice causes wasted round-trips. Dropped.
58
+ - Returns `{path, github_url, docstring, doctests[], line_count, source?}`.
59
+ - `include_source=False` is the only knob — a cheap peek (docstring + examples, no body) for
60
+ "is this the right file" disambiguation. Defaults to full.
61
+
62
+ ### `get_category(category: str) -> [{name, path}]` *(core, phase 1)*
63
+ All entries in one category — for "show me every sort." One-liner over the cached index
64
+ (`filter path.split('/')[0] == category`). Promoted from optional: without it the model can only
65
+ enumerate a category via a search query, which is awkward for "list everything in X."
66
+
67
+ ## 5. Internals
68
+
69
+ ```
70
+ src/thealgorithms_mcp/
71
+ server.py # FastMCP app, tool registration, stdio transport
72
+ index.py # fetch + parse + cache DIRECTORY.md (ETag, TTL)
73
+ fetch.py # raw file fetch, cached by path (+commit/etag)
74
+ search.py # rapidfuzz ranking over the parsed index
75
+ parse.py # extract module docstring + doctest blocks from source
76
+ ```
77
+
78
+ - **Index parse:** regex `^\s*\* \[(?P<name>.+?)\]\((?P<path>.+?\.py)\)$`; `category = path.split('/')[0]`.
79
+ Measured **100% match rate** on the live file (1,160/1,160 lines, all `.py`, zero root-level paths).
80
+ **Drift guard:** count matched vs. total `* [..](..)` lines; if match rate < 95% on a refresh, log
81
+ loudly and keep the prior cached index rather than silently dropping entries.
82
+ - **Index cache:** `~/.cache/thealgorithms-mcp/directory.json` = `{entries, etag, fetched_at}`.
83
+ Refresh on TTL miss (default 24h) via conditional GET; `304` → bump `fetched_at`, keep entries.
84
+ - **File cache:** `~/.cache/thealgorithms-mcp/files/<path>` keyed by path; revalidate via ETag.
85
+ - **Doctest extraction:** scan the **whole source** for `>>>` / `...` continuation lines + following
86
+ expected-output lines, grouped into blocks. Verified this catches *every* doctest regardless of
87
+ whether it lives in the module, a class, or a function docstring (probe found all 3 blocks in
88
+ `merge_sort.py` via a flat source scan). This is simpler and more complete than walking the AST for
89
+ per-node docstrings, which would miss nothing extra. Separately use `ast.get_docstring(module)` for
90
+ the human-readable **description** field. (Review flagged module-only extraction as a bug — the
91
+ whole-source scan is the fix, and it's less machinery than the AST-walk alternative proposed.)
92
+ - **Offline:** any fetch failure falls back to cache; tools degrade, never hard-crash.
93
+ - **Errors:** unknown `path` → error that suggests calling `search_algorithms` first.
94
+
95
+ ## 6. Dependencies & runtime
96
+
97
+ - `mcp` (FastMCP), `httpx`, `rapidfuzz`, `platformdirs`. Stdlib `ast`/`re` for parsing.
98
+ - Python 3.11+. **stdio** transport (local).
99
+ - Register in `~/.normandy-generic/mcp.json`:
100
+ ```json
101
+ { "thealgorithms": { "command": "uvx", "args": ["thealgorithms-mcp"] } }
102
+ ```
103
+
104
+ ## 7. Build phases
105
+
106
+ 1. **Index + search** — `index.py` (fetch/parse/cache + drift guard) + `search.py` +
107
+ `list_categories` / `search_algorithms` / `get_category`. Verify ranking on real queries.
108
+ 2. **Retrieval** — `fetch.py` + `parse.py` + `get_algorithm` (`include_source` toggle). Verify
109
+ whole-source doctest extraction on a function-heavy file (not just `merge_sort`).
110
+ 3. **Polish** — ETag revalidation (TTL fallback), offline degradation, README, `pyproject.toml`
111
+ `uvx` entry point.
112
+
113
+ ## 8. Open questions / future
114
+
115
+ - **v2 multi-language:** parameterize repo (`Python`→`Java`/`Rust`/…); add `compare(name)` for
116
+ same-algorithm-across-languages. Index key becomes `(lang, path)`.
117
+ - **Semantic search:** only if lexical fuzzy proves insufficient (would shift toward the "full local
118
+ clone + embeddings" model we deferred).
119
+ - **Normandy packaging:** could also wrap as a `/biking`-style skill instead of/alongside the MCP if
120
+ we want it inside the framework's skill surface — decide after v1.
121
+
122
+ ---
123
+
124
+ ## Appendix A — Concept validation (probe, 2026-05-30)
125
+
126
+ Stdlib-only probe (`urllib` + `ast` + `difflib`) against the live repo. difflib stands in for
127
+ production rapidfuzz; urllib for httpx. All four load-bearing concepts passed:
128
+
129
+ | Concept | Result |
130
+ |---------|--------|
131
+ | **Parse `DIRECTORY.md`** | 200 OK, **1,160 entries / 44 categories** in 0.1s. 100% regex match, all `.py`, zero root-level paths. |
132
+ | **ETag conditional GET** | Real ETag returned; revalidation → **`304`**. Cache validation is free. |
133
+ | **Fetch + extract** | `sorts/merge_sort.py`: docstring present, 3 doctest blocks found via flat source scan. |
134
+ | **Fuzzy ranking** | Correct top hit for all of: binary search, dijkstra, knapsack, merge sort, lru cache. |
135
+
136
+ ## Appendix B — Review dispositions (adversarial pass)
137
+
138
+ | # | Finding | Disposition |
139
+ |---|---------|-------------|
140
+ | 1 | "raw GitHub omits ETags; conditional GET is fiction" | **Rejected** — probe proves ETag + `304` work. TTL fallback kept as belt-and-suspenders. |
141
+ | 2 | DIRECTORY.md format edge cases (non-`.py`, root-level, nested parens) | **Downgraded** — none exist in live file (100% match). Kept the drift guard as cheap insurance. |
142
+ | 3 | `mode` enum is wrong-altitude for an LLM | **Accepted** — replaced with `include_source: bool`. |
143
+ | 4 | Doctest extraction misses function/class docstrings | **Accepted (simplified)** — whole-source `>>>` scan; proven complete. |
144
+ | 5 | `get_category` shouldn't be optional | **Accepted** — promoted to phase-1 core tool. |
145
+ </content>
146
+ </invoke>
@@ -0,0 +1,72 @@
1
+ Metadata-Version: 2.4
2
+ Name: thealgorithms-mcp
3
+ Version: 0.1.0
4
+ Summary: MCP server for querying TheAlgorithms/Python — search algorithms and fetch implementations with their doctests as examples.
5
+ Requires-Python: >=3.11
6
+ Requires-Dist: httpx>=0.27
7
+ Requires-Dist: mcp>=1.2.0
8
+ Requires-Dist: platformdirs>=4.0
9
+ Requires-Dist: rapidfuzz>=3.9
10
+ Description-Content-Type: text/markdown
11
+
12
+ # thealgorithms-mcp
13
+
14
+ mcp-name: io.github.mcande21/thealgorithms-mcp
15
+
16
+ An [MCP](https://modelcontextprotocol.io) server for querying
17
+ [TheAlgorithms/Python](https://github.com/TheAlgorithms/Python) — search ~1,160 algorithm
18
+ implementations and fetch any one with its **doctests as usage examples**.
19
+
20
+ Hybrid design: the small `DIRECTORY.md` index is cached locally (ETag + 24h TTL) for instant
21
+ fuzzy search; file contents are fetched on demand from `raw.githubusercontent.com`. No API token,
22
+ no rate limits, tiny footprint. See [`DESIGN.md`](DESIGN.md).
23
+
24
+ ## Tools
25
+
26
+ | Tool | Purpose |
27
+ |------|---------|
28
+ | `list_categories()` | Categories (sorts, graphs, dynamic_programming, …) with counts |
29
+ | `search_algorithms(query, category?, limit=10)` | Ranked `{name, category, path, score}` |
30
+ | `get_category(category)` | Every algorithm in a category |
31
+ | `get_algorithm(path, include_source=True)` | Source + extracted doctests for one file |
32
+
33
+ Typical flow: `search_algorithms("dijkstra")` → `get_algorithm("graphs/dijkstra.py")`.
34
+
35
+ ## Install
36
+
37
+ **From PyPI (recommended):**
38
+
39
+ ```json
40
+ { "thealgorithms": { "command": "uvx", "args": ["thealgorithms-mcp"] } }
41
+ ```
42
+
43
+ **From GitHub (no PyPI needed):**
44
+
45
+ ```json
46
+ { "thealgorithms": {
47
+ "command": "uvx",
48
+ "args": ["--from", "git+https://github.com/mcande21/thealgorithms-mcp", "thealgorithms-mcp"] } }
49
+ ```
50
+
51
+ **From a local checkout (development):**
52
+
53
+ ```bash
54
+ uv sync
55
+ uv run thealgorithms-mcp # serves over stdio
56
+ ```
57
+
58
+ ```json
59
+ { "thealgorithms": {
60
+ "command": "uv",
61
+ "args": ["run", "--directory", "/path/to/thealgorithms-mcp", "thealgorithms-mcp"] } }
62
+ ```
63
+
64
+ Add any of the above to `~/.normandy-generic/mcp.json` (or your MCP client config).
65
+
66
+ ## Verify
67
+
68
+ ```bash
69
+ uv run python scripts/verify_stdio.py
70
+ ```
71
+
72
+ Spawns the server over stdio and asserts every tool against the live repo.
@@ -0,0 +1,61 @@
1
+ # thealgorithms-mcp
2
+
3
+ mcp-name: io.github.mcande21/thealgorithms-mcp
4
+
5
+ An [MCP](https://modelcontextprotocol.io) server for querying
6
+ [TheAlgorithms/Python](https://github.com/TheAlgorithms/Python) — search ~1,160 algorithm
7
+ implementations and fetch any one with its **doctests as usage examples**.
8
+
9
+ Hybrid design: the small `DIRECTORY.md` index is cached locally (ETag + 24h TTL) for instant
10
+ fuzzy search; file contents are fetched on demand from `raw.githubusercontent.com`. No API token,
11
+ no rate limits, tiny footprint. See [`DESIGN.md`](DESIGN.md).
12
+
13
+ ## Tools
14
+
15
+ | Tool | Purpose |
16
+ |------|---------|
17
+ | `list_categories()` | Categories (sorts, graphs, dynamic_programming, …) with counts |
18
+ | `search_algorithms(query, category?, limit=10)` | Ranked `{name, category, path, score}` |
19
+ | `get_category(category)` | Every algorithm in a category |
20
+ | `get_algorithm(path, include_source=True)` | Source + extracted doctests for one file |
21
+
22
+ Typical flow: `search_algorithms("dijkstra")` → `get_algorithm("graphs/dijkstra.py")`.
23
+
24
+ ## Install
25
+
26
+ **From PyPI (recommended):**
27
+
28
+ ```json
29
+ { "thealgorithms": { "command": "uvx", "args": ["thealgorithms-mcp"] } }
30
+ ```
31
+
32
+ **From GitHub (no PyPI needed):**
33
+
34
+ ```json
35
+ { "thealgorithms": {
36
+ "command": "uvx",
37
+ "args": ["--from", "git+https://github.com/mcande21/thealgorithms-mcp", "thealgorithms-mcp"] } }
38
+ ```
39
+
40
+ **From a local checkout (development):**
41
+
42
+ ```bash
43
+ uv sync
44
+ uv run thealgorithms-mcp # serves over stdio
45
+ ```
46
+
47
+ ```json
48
+ { "thealgorithms": {
49
+ "command": "uv",
50
+ "args": ["run", "--directory", "/path/to/thealgorithms-mcp", "thealgorithms-mcp"] } }
51
+ ```
52
+
53
+ Add any of the above to `~/.normandy-generic/mcp.json` (or your MCP client config).
54
+
55
+ ## Verify
56
+
57
+ ```bash
58
+ uv run python scripts/verify_stdio.py
59
+ ```
60
+
61
+ Spawns the server over stdio and asserts every tool against the live repo.
@@ -0,0 +1,22 @@
1
+ [project]
2
+ name = "thealgorithms-mcp"
3
+ version = "0.1.0"
4
+ description = "MCP server for querying TheAlgorithms/Python — search algorithms and fetch implementations with their doctests as examples."
5
+ readme = "README.md"
6
+ requires-python = ">=3.11"
7
+ dependencies = [
8
+ "mcp>=1.2.0",
9
+ "httpx>=0.27",
10
+ "rapidfuzz>=3.9",
11
+ "platformdirs>=4.0",
12
+ ]
13
+
14
+ [project.scripts]
15
+ thealgorithms-mcp = "thealgorithms_mcp.server:main"
16
+
17
+ [build-system]
18
+ requires = ["hatchling"]
19
+ build-backend = "hatchling.build"
20
+
21
+ [tool.hatch.build.targets.wheel]
22
+ packages = ["src/thealgorithms_mcp"]
@@ -0,0 +1,125 @@
1
+ """End-to-end verification: spawn the server over stdio, call every tool, assert correctness
2
+ against the LIVE TheAlgorithms/Python repo. Exits non-zero on any failure.
3
+
4
+ Run with: uv run python scripts/verify_stdio.py
5
+ """
6
+ from __future__ import annotations
7
+
8
+ import asyncio
9
+ import json
10
+ import sys
11
+
12
+ from mcp import ClientSession, StdioServerParameters
13
+ from mcp.client.stdio import stdio_client
14
+
15
+ PASS, FAIL = "\033[32mPASS\033[0m", "\033[31mFAIL\033[0m"
16
+ failures: list[str] = []
17
+
18
+
19
+ def check(name: str, cond: bool, detail: str = "") -> None:
20
+ print(f" [{PASS if cond else FAIL}] {name}" + (f" — {detail}" if detail else ""))
21
+ if not cond:
22
+ failures.append(name)
23
+
24
+
25
+ def payload(res):
26
+ """Extract a tool's return value.
27
+
28
+ FastMCP puts the structured value in `structuredContent`, wrapping list/scalar returns
29
+ as {"result": ...}. Fall back to concatenating JSON text blocks.
30
+ """
31
+ sc = res.structuredContent
32
+ if isinstance(sc, dict) and set(sc.keys()) == {"result"}:
33
+ return sc["result"]
34
+ if sc is not None:
35
+ return sc
36
+ texts = [c.text for c in res.content if getattr(c, "type", None) == "text"]
37
+ if len(texts) == 1:
38
+ return json.loads(texts[0])
39
+ return [json.loads(t) for t in texts]
40
+
41
+
42
+ async def main() -> int:
43
+ # Default: run the local package. Override by passing a full command as argv,
44
+ # e.g. `verify_stdio.py uvx --from git+https://github.com/mcande21/thealgorithms-mcp thealgorithms-mcp`
45
+ if len(sys.argv) > 1:
46
+ command, args = sys.argv[1], sys.argv[2:]
47
+ else:
48
+ command, args = sys.executable, ["-m", "thealgorithms_mcp.server"]
49
+ params = StdioServerParameters(command=command, args=args)
50
+ print(f"Spawning server over stdio: {command} {' '.join(args)}")
51
+ async with stdio_client(params) as (read, write):
52
+ async with ClientSession(read, write) as session:
53
+ await session.initialize()
54
+ print("Server initialized over stdio.\n")
55
+
56
+ # --- tool discovery ---
57
+ tools = {t.name for t in (await session.list_tools()).tools}
58
+ expected = {"list_categories", "search_algorithms", "get_category", "get_algorithm"}
59
+ check("4 tools registered", tools == expected, f"{sorted(tools)}")
60
+
61
+ # --- list_categories ---
62
+ cats = payload(await session.call_tool("list_categories", {}))
63
+ names = {c["category"] for c in cats}
64
+ check("list_categories returns categories", len(cats) >= 40, f"{len(cats)} categories")
65
+ check("categories include sorts/graphs/maths", {"sorts", "graphs", "maths"} <= names)
66
+
67
+ # --- search_algorithms: top hit must be the canonical file ---
68
+ cases = {
69
+ "binary search": "searches/binary_search.py",
70
+ "dijkstra": "graphs/dijkstra.py",
71
+ "merge sort": "sorts/merge_sort.py",
72
+ "knapsack": "dynamic_programming/knapsack.py",
73
+ }
74
+ for q, want in cases.items():
75
+ res = payload(await session.call_tool("search_algorithms", {"query": q}))
76
+ top = res[0]["path"] if res else "<none>"
77
+ check(f"search {q!r} -> {want}", top == want, f"got {top}")
78
+
79
+ # category-constrained search
80
+ res = payload(
81
+ await session.call_tool(
82
+ "search_algorithms", {"query": "quick", "category": "sorts"}
83
+ )
84
+ )
85
+ check("scoped search stays in category", all(r["category"] == "sorts" for r in res))
86
+
87
+ # --- get_category ---
88
+ sorts = payload(await session.call_tool("get_category", {"category": "sorts"}))
89
+ sort_paths = {e["path"] for e in sorts}
90
+ check("get_category('sorts') lists sorts", "sorts/merge_sort.py" in sort_paths,
91
+ f"{len(sorts)} entries")
92
+
93
+ # --- get_algorithm: source + doctests on a simple file ---
94
+ ms = payload(await session.call_tool("get_algorithm", {"path": "sorts/merge_sort.py"}))
95
+ check("get_algorithm returns source", "def merge_sort" in ms.get("source", ""))
96
+ check("get_algorithm returns doctests", len(ms.get("doctests", [])) >= 1,
97
+ f"{len(ms.get('doctests', []))} examples")
98
+ check("get_algorithm returns description", bool(ms.get("description")))
99
+ check("get_algorithm has github_url", ms.get("github_url", "").startswith("https://github.com/"))
100
+
101
+ # --- doctest extraction on a FUNCTION-HEAVY file (the flagged risk) ---
102
+ bs = payload(await session.call_tool("get_algorithm", {"path": "searches/binary_search.py"}))
103
+ check("function-level doctests extracted", len(bs.get("doctests", [])) >= 3,
104
+ f"{len(bs.get('doctests', []))} examples across functions")
105
+
106
+ # --- include_source=False peek ---
107
+ peek = payload(await session.call_tool(
108
+ "get_algorithm", {"path": "sorts/merge_sort.py", "include_source": False}))
109
+ check("peek omits source but keeps examples",
110
+ "source" not in peek and len(peek.get("doctests", [])) >= 1)
111
+
112
+ # --- bad path is handled gracefully ---
113
+ bad = payload(await session.call_tool("get_algorithm", {"path": "nope/not_real.py"}))
114
+ check("bad path returns guidance, not crash", "error" in bad)
115
+
116
+ print()
117
+ if failures:
118
+ print(f"\033[31m{len(failures)} FAILED:\033[0m {failures}")
119
+ return 1
120
+ print("\033[32mALL CHECKS PASSED\033[0m")
121
+ return 0
122
+
123
+
124
+ if __name__ == "__main__":
125
+ sys.exit(asyncio.run(main()))
@@ -0,0 +1,20 @@
1
+ {
2
+ "$schema": "https://static.modelcontextprotocol.io/schemas/2025-12-11/server.schema.json",
3
+ "name": "io.github.mcande21/thealgorithms-mcp",
4
+ "description": "Query TheAlgorithms/Python — search algorithm implementations and fetch them with their doctests as examples.",
5
+ "repository": {
6
+ "url": "https://github.com/mcande21/thealgorithms-mcp",
7
+ "source": "github"
8
+ },
9
+ "version": "0.1.0",
10
+ "packages": [
11
+ {
12
+ "registryType": "pypi",
13
+ "identifier": "thealgorithms-mcp",
14
+ "version": "0.1.0",
15
+ "transport": {
16
+ "type": "stdio"
17
+ }
18
+ }
19
+ ]
20
+ }
@@ -0,0 +1,3 @@
1
+ """TheAlgorithms MCP — query TheAlgorithms/Python for implementations + doctests."""
2
+
3
+ __version__ = "0.1.0"
@@ -0,0 +1,56 @@
1
+ """On-demand fetch of a single algorithm file's source, cached by path + ETag."""
2
+ from __future__ import annotations
3
+
4
+ import json
5
+ from pathlib import Path
6
+
7
+ import httpx
8
+
9
+ from .index import CACHE_DIR, RAW_BASE
10
+
11
+ FILE_CACHE_DIR = CACHE_DIR / "files"
12
+
13
+
14
+ def _cache_paths(path: str) -> tuple[Path, Path]:
15
+ safe = path.replace("/", "__")
16
+ return FILE_CACHE_DIR / safe, FILE_CACHE_DIR / (safe + ".meta")
17
+
18
+
19
+ def get_file(path: str) -> str:
20
+ """Return raw source for a repo-relative path.
21
+
22
+ Conditional GET via ETag; 304 reuses the cached body. Network failures fall back to
23
+ cache when present, else raise. Raises FileNotFoundError on a 404 (bad path).
24
+ """
25
+ body_file, meta_file = _cache_paths(path)
26
+ etag = None
27
+ cached_body = None
28
+ if body_file.exists():
29
+ cached_body = body_file.read_text()
30
+ if meta_file.exists():
31
+ try:
32
+ etag = json.loads(meta_file.read_text()).get("etag")
33
+ except (json.JSONDecodeError, OSError):
34
+ etag = None
35
+
36
+ headers = {"User-Agent": "thealgorithms-mcp"}
37
+ if etag:
38
+ headers["If-None-Match"] = etag
39
+
40
+ try:
41
+ resp = httpx.get(RAW_BASE + path, headers=headers, timeout=30, follow_redirects=True)
42
+ except httpx.HTTPError:
43
+ if cached_body is not None:
44
+ return cached_body
45
+ raise
46
+
47
+ if resp.status_code == 304 and cached_body is not None:
48
+ return cached_body
49
+ if resp.status_code == 404:
50
+ raise FileNotFoundError(path)
51
+ resp.raise_for_status()
52
+
53
+ FILE_CACHE_DIR.mkdir(parents=True, exist_ok=True)
54
+ body_file.write_text(resp.text)
55
+ meta_file.write_text(json.dumps({"etag": resp.headers.get("ETag")}))
56
+ return resp.text
@@ -0,0 +1,141 @@
1
+ """Fetch, parse, and cache TheAlgorithms/Python DIRECTORY.md.
2
+
3
+ Hybrid model: the index is small (~1,160 entries) so we cache it whole, validated by
4
+ ETag with a 24h TTL fallback. File *contents* are fetched on demand (see fetch.py).
5
+ """
6
+ from __future__ import annotations
7
+
8
+ import json
9
+ import re
10
+ import time
11
+ from pathlib import Path
12
+
13
+ import httpx
14
+ from platformdirs import user_cache_dir
15
+
16
+ REPO = "TheAlgorithms/Python"
17
+ BRANCH = "master"
18
+ RAW_BASE = f"https://raw.githubusercontent.com/{REPO}/{BRANCH}/"
19
+ DIRECTORY_URL = RAW_BASE + "DIRECTORY.md"
20
+ TTL_SECONDS = 24 * 3600
21
+ DRIFT_THRESHOLD = 0.95 # keep prior cache if a refresh matches fewer than this fraction of links
22
+
23
+ CACHE_DIR = Path(user_cache_dir("thealgorithms-mcp"))
24
+ INDEX_FILE = CACHE_DIR / "directory.json"
25
+
26
+ ENTRY_RE = re.compile(r"^\s*\* \[(?P<name>.+?)\]\((?P<path>.+?\.py)\)\s*$")
27
+ LINK_RE = re.compile(r"^\s*\* \[.+?\]\(.+?\)\s*$")
28
+
29
+ # Process-lifetime memo so repeated tool calls don't re-read disk.
30
+ _memo: dict | None = None
31
+
32
+
33
+ def github_url(path: str) -> str:
34
+ """Human-facing blob URL for a repo-relative path."""
35
+ return f"https://github.com/{REPO}/blob/{BRANCH}/{path}"
36
+
37
+
38
+ def _parse(text: str) -> tuple[list[dict], float]:
39
+ """Parse DIRECTORY.md into entries; return (entries, match_rate vs all link lines)."""
40
+ link_lines = 0
41
+ entries: list[dict] = []
42
+ for line in text.splitlines():
43
+ if LINK_RE.match(line):
44
+ link_lines += 1
45
+ m = ENTRY_RE.match(line)
46
+ if m:
47
+ path = m.group("path")
48
+ entries.append(
49
+ {"name": m.group("name"), "path": path, "category": path.split("/")[0]}
50
+ )
51
+ match_rate = (len(entries) / link_lines) if link_lines else 1.0
52
+ return entries, match_rate
53
+
54
+
55
+ def _read_cache() -> dict | None:
56
+ if not INDEX_FILE.exists():
57
+ return None
58
+ try:
59
+ return json.loads(INDEX_FILE.read_text())
60
+ except (json.JSONDecodeError, OSError):
61
+ return None
62
+
63
+
64
+ def _write_cache(data: dict) -> None:
65
+ CACHE_DIR.mkdir(parents=True, exist_ok=True)
66
+ INDEX_FILE.write_text(json.dumps(data))
67
+
68
+
69
+ def load_index(force: bool = False) -> list[dict]:
70
+ """Return the parsed index, refreshing from GitHub when stale.
71
+
72
+ Order of operations:
73
+ 1. Serve the process memo if present and fresh (and not forced).
74
+ 2. Serve the disk cache if fresh.
75
+ 3. Conditional GET (If-None-Match). 304 -> reuse cached entries, bump timestamp.
76
+ 200 -> parse, apply drift guard, persist.
77
+ 4. Any network failure -> fall back to cached entries (offline degradation).
78
+ """
79
+ global _memo
80
+ now = time.time()
81
+ cache = _read_cache()
82
+
83
+ if not force and _memo and (now - _memo["fetched_at"] < TTL_SECONDS):
84
+ return _memo["entries"]
85
+ if not force and cache and (now - cache.get("fetched_at", 0) < TTL_SECONDS):
86
+ _memo = cache
87
+ return cache["entries"]
88
+
89
+ headers = {"User-Agent": "thealgorithms-mcp"}
90
+ if cache and cache.get("etag"):
91
+ headers["If-None-Match"] = cache["etag"]
92
+
93
+ try:
94
+ resp = httpx.get(DIRECTORY_URL, headers=headers, timeout=30, follow_redirects=True)
95
+ except httpx.HTTPError:
96
+ if cache:
97
+ _memo = cache
98
+ return cache["entries"] # offline: stale is better than dead
99
+ raise
100
+
101
+ if resp.status_code == 304 and cache:
102
+ cache["fetched_at"] = now
103
+ _write_cache(cache)
104
+ _memo = cache
105
+ return cache["entries"]
106
+
107
+ resp.raise_for_status()
108
+ entries, match_rate = _parse(resp.text)
109
+
110
+ # Drift guard: a sudden drop in match rate means the format changed under us.
111
+ if match_rate < DRIFT_THRESHOLD and cache and cache.get("entries"):
112
+ # Keep the known-good index rather than silently shipping a broken one.
113
+ cache["fetched_at"] = now
114
+ _write_cache(cache)
115
+ _memo = cache
116
+ return cache["entries"]
117
+
118
+ data = {
119
+ "entries": entries,
120
+ "etag": resp.headers.get("ETag"),
121
+ "fetched_at": now,
122
+ "match_rate": match_rate,
123
+ }
124
+ _write_cache(data)
125
+ _memo = data
126
+ return entries
127
+
128
+
129
+ def list_categories(entries: list[dict]) -> list[dict]:
130
+ counts: dict[str, int] = {}
131
+ for e in entries:
132
+ counts[e["category"]] = counts.get(e["category"], 0) + 1
133
+ return [{"category": c, "count": n} for c, n in sorted(counts.items())]
134
+
135
+
136
+ def category_entries(entries: list[dict], category: str) -> list[dict]:
137
+ return [
138
+ {"name": e["name"], "path": e["path"]}
139
+ for e in entries
140
+ if e["category"] == category
141
+ ]