code-memory 0.1.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- .github/workflows/ci.yml +71 -0
- .github/workflows/publish.yml +33 -0
- .gitignore +40 -0
- .python-version +1 -0
- CHANGELOG.md +43 -0
- CONTRIBUTING.md +133 -0
- LICENSE +21 -0
- Makefile +33 -0
- PKG-INFO +275 -0
- README.md +233 -0
- code_memory-0.1.0.dist-info/METADATA +275 -0
- code_memory-0.1.0.dist-info/RECORD +37 -0
- code_memory-0.1.0.dist-info/WHEEL +4 -0
- code_memory-0.1.0.dist-info/entry_points.txt +2 -0
- code_memory-0.1.0.dist-info/licenses/LICENSE +21 -0
- db.py +403 -0
- doc_parser.py +494 -0
- errors.py +115 -0
- git_search.py +313 -0
- logging_config.py +191 -0
- parser.py +392 -0
- prompts/milestone_1.xml +62 -0
- prompts/milestone_2.xml +246 -0
- prompts/milestone_3.xml +214 -0
- prompts/milestone_4.xml +453 -0
- prompts/milestone_5.xml +599 -0
- pyproject.toml +92 -0
- queries.py +446 -0
- server.py +299 -0
- tests/__init__.py +1 -0
- tests/conftest.py +192 -0
- tests/test_errors.py +112 -0
- tests/test_logging.py +169 -0
- tests/test_tools.py +114 -0
- tests/test_validation.py +216 -0
- uv.lock +1921 -0
- validation.py +316 -0
prompts/milestone_2.xml
ADDED
|
@@ -0,0 +1,246 @@
|
|
|
1
|
+
<system_prompt>
|
|
2
|
+
<role_and_objective>
|
|
3
|
+
You are an expert developer specializing in information retrieval and database design. Your objective is to implement the `search_code` tool for the `code-memory` MCP server.
|
|
4
|
+
|
|
5
|
+
This tool must support **hybrid retrieval** — combining BM25 keyword search with dense vector semantic search — backed by SQLite with the `sqlite-vec` extension. Source code is parsed using **tree-sitter** for language-agnostic structural extraction, then each symbol is indexed for both lexical and semantic retrieval.
|
|
6
|
+
|
|
7
|
+
You are working inside an existing, functional MCP server scaffold. Do NOT re-create the server; extend it.
|
|
8
|
+
</role_and_objective>
|
|
9
|
+
|
|
10
|
+
<context>
|
|
11
|
+
<existing_codebase>
|
|
12
|
+
The project was scaffolded in Milestone 1. The current entry point is `server.py`, which:
|
|
13
|
+
- Initializes a FastMCP server: `mcp = FastMCP("code-memory")`
|
|
14
|
+
- Registers three tools: `search_code`, `search_docs`, `search_history`
|
|
15
|
+
- All three tools currently return mock dictionaries
|
|
16
|
+
- The project uses `uv` for dependency management
|
|
17
|
+
|
|
18
|
+
The `search_code` tool currently has this signature:
|
|
19
|
+
```python
|
|
20
|
+
@mcp.tool()
|
|
21
|
+
def search_code(
|
|
22
|
+
query: str,
|
|
23
|
+
search_type: Literal["definition", "references", "file_structure"],
|
|
24
|
+
) -> dict:
|
|
25
|
+
```
|
|
26
|
+
</existing_codebase>
|
|
27
|
+
|
|
28
|
+
<design_principles>
|
|
29
|
+
1. **Hybrid retrieval**: Every query runs through BOTH a BM25 keyword scorer and a dense vector similarity scorer. Results are fused using Reciprocal Rank Fusion (RRF) to produce a single ranked list.
|
|
30
|
+
2. **Offline-first**: All data — FTS index, vector embeddings, and structural metadata — lives in a local SQLite database. No external API calls.
|
|
31
|
+
3. **Incremental indexing**: The indexer must be idempotent — re-indexing a file updates its records without duplicating data. Compare file `last_modified` timestamps to skip unchanged files.
|
|
32
|
+
4. **Separation of concerns**: Parsing logic (`parser.py`), database + indexing logic (`db.py`), query/retrieval logic (`queries.py`), and MCP tool wiring (`server.py`) must live in separate modules.
|
|
33
|
+
5. **Embedding model**: Use a lightweight, local embedding model via `sentence-transformers` (e.g., `all-MiniLM-L6-v2`). The model runs in-process — no external inference server.
|
|
34
|
+
6. **Language-agnostic**: The parser must support multiple programming languages using **tree-sitter**, not just Python. Supported languages include Python, JavaScript/TypeScript, Java, Kotlin, Go, Rust, C/C++, and Ruby. Unsupported file types should fall back to whole-file indexing so they are still searchable.
|
|
35
|
+
</design_principles>
|
|
36
|
+
|
|
37
|
+
<technology_stack>
|
|
38
|
+
- **BM25 / keyword search**: SQLite FTS5 (built-in full-text search)
|
|
39
|
+
- **Dense vector storage + similarity**: `sqlite-vec` extension (installable via `pip install sqlite-vec`)
|
|
40
|
+
- **Embeddings**: `sentence-transformers` library with a small local model
|
|
41
|
+
- **AST parsing**: `tree-sitter` with per-language grammar packages (`tree-sitter-python`, `tree-sitter-javascript`, `tree-sitter-typescript`, `tree-sitter-java`, `tree-sitter-kotlin`, `tree-sitter-go`, `tree-sitter-rust`, `tree-sitter-c`, `tree-sitter-cpp`, `tree-sitter-ruby`)
|
|
42
|
+
</technology_stack>
|
|
43
|
+
</context>
|
|
44
|
+
|
|
45
|
+
<instructions>
|
|
46
|
+
Before writing any code for each step, use a <thinking> block to reason about your design decisions, trade-offs, and how the components connect.
|
|
47
|
+
|
|
48
|
+
<step_1_dependencies>
|
|
49
|
+
Install the required new dependencies using `uv`:
|
|
50
|
+
```bash
|
|
51
|
+
uv add sqlite-vec sentence-transformers tree-sitter \
|
|
52
|
+
tree-sitter-python tree-sitter-javascript tree-sitter-typescript \
|
|
53
|
+
tree-sitter-java tree-sitter-kotlin tree-sitter-go tree-sitter-rust \
|
|
54
|
+
tree-sitter-c tree-sitter-cpp tree-sitter-ruby
|
|
55
|
+
```
|
|
56
|
+
Verify that `sqlite-vec` and `tree-sitter` can be loaded in Python:
|
|
57
|
+
```python
|
|
58
|
+
import sqlite_vec
|
|
59
|
+
import tree_sitter
|
|
60
|
+
```
|
|
61
|
+
</step_1_dependencies>
|
|
62
|
+
|
|
63
|
+
<step_2_database_schema>
|
|
64
|
+
Create a new file `db.py` that manages the SQLite database with three storage layers.
|
|
65
|
+
|
|
66
|
+
Design and implement the schema:
|
|
67
|
+
|
|
68
|
+
**Table 1: `files`** — Tracks indexed source files.
|
|
69
|
+
- `id` INTEGER PRIMARY KEY
|
|
70
|
+
- `path` TEXT UNIQUE — absolute file path
|
|
71
|
+
- `last_modified` REAL — file mtime for incremental indexing
|
|
72
|
+
- `file_hash` TEXT — SHA-256 of file contents for integrity
|
|
73
|
+
|
|
74
|
+
**Table 2: `symbols`** — Stores parsed AST symbols with their source text.
|
|
75
|
+
- `id` INTEGER PRIMARY KEY
|
|
76
|
+
- `name` TEXT — symbol name (e.g., "MyClass", "processData")
|
|
77
|
+
- `kind` TEXT — one of: function, class, method, variable, file
|
|
78
|
+
- `file_id` INTEGER — foreign key to `files`
|
|
79
|
+
- `line_start` INTEGER
|
|
80
|
+
- `line_end` INTEGER
|
|
81
|
+
- `parent_symbol_id` INTEGER — nullable, for nesting (methods inside classes)
|
|
82
|
+
- `source_text` TEXT — the raw source code of the symbol
|
|
83
|
+
- UNIQUE constraint on (`file_id`, `name`, `kind`, `line_start`)
|
|
84
|
+
|
|
85
|
+
**Table 3: `symbols_fts`** — FTS5 virtual table for BM25 keyword search.
|
|
86
|
+
- A content-sync'd FTS5 table over `symbols` that indexes `name` and `source_text`.
|
|
87
|
+
- Include INSERT/UPDATE/DELETE triggers to keep FTS5 in sync.
|
|
88
|
+
|
|
89
|
+
**Table 4: `symbol_embeddings`** — Virtual table via `sqlite-vec` for dense vector search.
|
|
90
|
+
- Stores the embedding vector for each symbol, keyed by `symbol_id`.
|
|
91
|
+
- Vector dimension must match the chosen embedding model (384 for `all-MiniLM-L6-v2`).
|
|
92
|
+
|
|
93
|
+
**Table 5: `references_`** — Cross-reference tracking.
|
|
94
|
+
- `id` INTEGER PRIMARY KEY
|
|
95
|
+
- `symbol_name` TEXT — the name being referenced
|
|
96
|
+
- `file_id` INTEGER — the file containing the reference
|
|
97
|
+
- `line_number` INTEGER
|
|
98
|
+
- UNIQUE constraint on (`symbol_name`, `file_id`, `line_number`)
|
|
99
|
+
|
|
100
|
+
Include these functions:
|
|
101
|
+
- `get_db(db_path: str = "code_memory.db") -> sqlite3.Connection` — initializes DB, loads `sqlite-vec`, creates all tables.
|
|
102
|
+
- `upsert_file(db, path, last_modified, file_hash) -> int` — returns file_id.
|
|
103
|
+
- `upsert_symbol(db, name, kind, file_id, line_start, line_end, parent_symbol_id, source_text) -> int` — returns symbol_id.
|
|
104
|
+
- `upsert_reference(db, symbol_name, file_id, line_number)`.
|
|
105
|
+
- `upsert_embedding(db, symbol_id, embedding: list[float])`.
|
|
106
|
+
- `delete_file_data(db, file_id)` — removes all symbols, embeddings, and references for a file before re-indexing.
|
|
107
|
+
|
|
108
|
+
CRITICAL RULE: Use `INSERT ... ON CONFLICT ... DO UPDATE` for all upserts so re-indexing is safe. When a file is re-indexed, first DELETE all its old symbols, references, and embeddings before inserting fresh data.
|
|
109
|
+
</step_2_database_schema>
|
|
110
|
+
|
|
111
|
+
<step_3_embedding_manager>
|
|
112
|
+
Create an embedding helper in `db.py` (or a separate `embeddings.py` if you prefer):
|
|
113
|
+
|
|
114
|
+
```python
|
|
115
|
+
def get_embedding_model():
|
|
116
|
+
"""Lazy-load and cache the sentence-transformers model."""
|
|
117
|
+
...
|
|
118
|
+
|
|
119
|
+
def embed_text(text: str) -> list[float]:
|
|
120
|
+
"""Generate a dense vector embedding for the given text."""
|
|
121
|
+
...
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
The embedding input for a symbol should be a concatenation of its structural context:
|
|
125
|
+
`"{kind} {name}: {source_text}"` — e.g., `"method authenticate: fun authenticate(token: String): Boolean { ... }"`.
|
|
126
|
+
|
|
127
|
+
This gives the embedding model both semantic and structural signal.
|
|
128
|
+
</step_3_embedding_manager>
|
|
129
|
+
|
|
130
|
+
<step_4_tree_sitter_parser>
|
|
131
|
+
Create a new file `parser.py` that handles **language-agnostic** AST parsing using tree-sitter.
|
|
132
|
+
|
|
133
|
+
**Language registry:**
|
|
134
|
+
- Map file extensions to tree-sitter grammar packages (e.g., `.py` → `tree_sitter_python`, `.kt`/`.kts` → `tree_sitter_kotlin`).
|
|
135
|
+
- Lazy-load grammars on first use.
|
|
136
|
+
- For files with no matching grammar, fall back to indexing the whole file as a single "file" symbol.
|
|
137
|
+
|
|
138
|
+
**Node-type mapping:**
|
|
139
|
+
- Map tree-sitter node types to normalised symbol kinds (`function`, `class`, `method`, `variable`).
|
|
140
|
+
- Cover at minimum: Python, JS/TS, Java, Kotlin, Go, Rust, C/C++, Ruby.
|
|
141
|
+
- Promote `function` → `method` when nested inside a class container.
|
|
142
|
+
|
|
143
|
+
**Symbol extraction:**
|
|
144
|
+
- Walk the tree-sitter AST to extract symbols and their source text.
|
|
145
|
+
- Extract identifier references for cross-reference tracking.
|
|
146
|
+
|
|
147
|
+
Implement `index_file(filepath: str, db: sqlite3.Connection) -> dict`:
|
|
148
|
+
1. Read the source file.
|
|
149
|
+
2. Check `last_modified` against the `files` table — skip if unchanged.
|
|
150
|
+
3. Determine language from file extension, load tree-sitter grammar.
|
|
151
|
+
4. Parse the file and walk the AST to extract symbols and references.
|
|
152
|
+
5. For each symbol, extract its source text from the byte range.
|
|
153
|
+
6. Upsert all extracted data into the database.
|
|
154
|
+
7. Generate and store embeddings for each symbol.
|
|
155
|
+
8. If no grammar is available, index the whole file as a single symbol.
|
|
156
|
+
9. Return: `{"file": filepath, "symbols_indexed": N, "references_indexed": M}`.
|
|
157
|
+
|
|
158
|
+
Implement `index_directory(dirpath: str, db: sqlite3.Connection) -> list[dict]`:
|
|
159
|
+
- Recursively index all source files (not just `.py`), skipping unchanged ones.
|
|
160
|
+
- Skip directories like `.venv`, `__pycache__`, `.git`, `node_modules`, `build`, `target`.
|
|
161
|
+
- Accept files with common source-code extensions.
|
|
162
|
+
</step_4_tree_sitter_parser>
|
|
163
|
+
|
|
164
|
+
<step_5_query_layer>
|
|
165
|
+
Create a new file `queries.py` that provides hybrid retrieval functions.
|
|
166
|
+
|
|
167
|
+
**Core retrieval function — `hybrid_search(query, db, top_k=10) -> list[dict]`:**
|
|
168
|
+
1. **BM25 leg**: Run the query against `symbols_fts` using FTS5 `MATCH`. Retrieve ranked results with `bm25()` scores.
|
|
169
|
+
2. **Vector leg**: Embed the query text, then query `symbol_embeddings` for nearest neighbors using `sqlite-vec` MATCH.
|
|
170
|
+
3. **Fusion**: Merge both ranked lists using Reciprocal Rank Fusion (RRF):
|
|
171
|
+
`rrf_score(d) = Σ 1 / (k + rank(d))` where `k = 60` (standard constant).
|
|
172
|
+
4. Return the top-k results, each as a dict: `{name, kind, file_path, line_start, line_end, source_text, score}`.
|
|
173
|
+
|
|
174
|
+
**Tool-facing query functions:**
|
|
175
|
+
|
|
176
|
+
1. **`find_definition(symbol_name: str, db) -> list[dict]`**
|
|
177
|
+
- Run `hybrid_search` with the symbol name.
|
|
178
|
+
- Post-filter: only return results where `name` exactly matches `symbol_name` (case-sensitive).
|
|
179
|
+
- Fallback: if exact match yields nothing, return the top hybrid results as "best guesses".
|
|
180
|
+
|
|
181
|
+
2. **`find_references(symbol_name: str, db) -> list[dict]`**
|
|
182
|
+
- Query the `references_` table for exact matches on `symbol_name`.
|
|
183
|
+
- Each result: `{symbol_name, file_path, line_number}`.
|
|
184
|
+
|
|
185
|
+
3. **`get_file_structure(file_path: str, db) -> list[dict]`**
|
|
186
|
+
- Query `symbols` table for all symbols in the given file, ordered by `line_start`.
|
|
187
|
+
- Each result: `{name, kind, line_start, line_end, parent}`.
|
|
188
|
+
</step_5_query_layer>
|
|
189
|
+
|
|
190
|
+
<step_6_wire_into_server>
|
|
191
|
+
Modify `server.py` to:
|
|
192
|
+
1. Import `db`, `parser`, and `queries` modules.
|
|
193
|
+
2. Replace the mock `search_code` with real logic that:
|
|
194
|
+
- Initializes the database via `get_db()`.
|
|
195
|
+
- Routes to the correct query function based on `search_type`.
|
|
196
|
+
3. Add a NEW tool `index_codebase`:
|
|
197
|
+
- **Docstring**: "Indexes or re-indexes source files in the given directory. Run this before using search_code to ensure the database is up to date. Uses tree-sitter for language-agnostic structural extraction and generates embeddings for semantic search. Supports Python, JavaScript/TypeScript, Java, Kotlin, Go, Rust, C/C++, Ruby, and more."
|
|
198
|
+
- **Parameters**:
|
|
199
|
+
- `directory` (str): The root directory to index.
|
|
200
|
+
- **Returns**: Summary of indexing results.
|
|
201
|
+
|
|
202
|
+
CRITICAL RULE: `search_docs` and `search_history` must remain unchanged (still returning mocks). Do NOT modify their signatures or behavior.
|
|
203
|
+
</step_6_wire_into_server>
|
|
204
|
+
|
|
205
|
+
<step_7_verification>
|
|
206
|
+
Verify the implementation end-to-end:
|
|
207
|
+
|
|
208
|
+
1. Start the server: `uv run mcp run server.py` — confirm no import errors.
|
|
209
|
+
2. Using MCP Inspector (`uv run mcp dev server.py`):
|
|
210
|
+
a. Call `index_codebase(directory=".")` to index the project itself.
|
|
211
|
+
b. Call `search_code(query="search_code", search_type="definition")` — expect to find the function in `server.py`.
|
|
212
|
+
c. Call `search_code(query="FastMCP", search_type="references")` — expect references in `server.py`.
|
|
213
|
+
d. Call `search_code(query="server.py", search_type="file_structure")` — expect all symbols listed.
|
|
214
|
+
e. Call `search_code(query="parse source files", search_type="definition")` — this is a semantic query; expect the hybrid retriever to surface `index_file` or `index_directory` via vector similarity even though the exact words don't match.
|
|
215
|
+
3. Confirm `search_docs` and `search_history` still return mocked responses.
|
|
216
|
+
</step_7_verification>
|
|
217
|
+
|
|
218
|
+
</instructions>
|
|
219
|
+
|
|
220
|
+
<output_formatting>
|
|
221
|
+
- Wrap your internal planning process inside `<thinking>` tags before writing code for each step.
|
|
222
|
+
- Output each new Python file (`db.py`, `parser.py`, `queries.py`) in a separate, clearly labelled `python` code block.
|
|
223
|
+
- For `server.py`, show ONLY the modified/added sections with `# ... existing code unchanged ...` markers.
|
|
224
|
+
- After all code, provide verification commands in a `bash` code block.
|
|
225
|
+
</output_formatting>
|
|
226
|
+
|
|
227
|
+
<quality_checklist>
|
|
228
|
+
Before finishing, verify your output against this checklist:
|
|
229
|
+
- [ ] `db.py` loads `sqlite-vec` via `sqlite_vec.load(db)`.
|
|
230
|
+
- [ ] `db.py` creates an FTS5 virtual table (`symbols_fts`) content-synced to `symbols`.
|
|
231
|
+
- [ ] `db.py` creates a `sqlite-vec` virtual table for embeddings with correct dimensions.
|
|
232
|
+
- [ ] All upserts use `ON CONFLICT ... DO UPDATE` or delete-then-insert for idempotency.
|
|
233
|
+
- [ ] `parser.py` uses tree-sitter (not Python `ast`) for language-agnostic parsing.
|
|
234
|
+
- [ ] `parser.py` supports Python, JS/TS, Java, Kotlin, Go, Rust, C/C++, Ruby.
|
|
235
|
+
- [ ] `parser.py` falls back to whole-file indexing for unsupported languages.
|
|
236
|
+
- [ ] `parser.py` skips unchanged files by comparing `last_modified`.
|
|
237
|
+
- [ ] `parser.py` generates embeddings for each symbol and stores them.
|
|
238
|
+
- [ ] `parser.py` skips `.venv`, `__pycache__`, `.git`, `node_modules`, `build`, `target` directories.
|
|
239
|
+
- [ ] `queries.py` implements Reciprocal Rank Fusion across BM25 + vector results.
|
|
240
|
+
- [ ] `queries.py` returns structured dicts, not raw tuples.
|
|
241
|
+
- [ ] `server.py` only modifies `search_code` and adds `index_codebase`.
|
|
242
|
+
- [ ] `search_docs` and `search_history` remain mocked and untouched.
|
|
243
|
+
- [ ] All functions have type hints and docstrings.
|
|
244
|
+
- [ ] No external API calls — embedding model runs locally in-process.
|
|
245
|
+
</quality_checklist>
|
|
246
|
+
</system_prompt>
|
prompts/milestone_3.xml
ADDED
|
@@ -0,0 +1,214 @@
|
|
|
1
|
+
<system_prompt>
|
|
2
|
+
<role_and_objective>
|
|
3
|
+
You are an expert developer specializing in Git internals and version control systems. Your objective is to implement the `search_history` tool for the `code-memory` MCP server.
|
|
4
|
+
|
|
5
|
+
This tool must provide **structured Git history search** — querying commits, diffs, and blame data — so an LLM can answer "Who changed this?", "Why was this changed?", and "When did this break?" questions. All data is extracted locally from the `.git` directory using `gitpython`.
|
|
6
|
+
|
|
7
|
+
You are working inside an existing, functional MCP server. Do NOT re-create the server or modify any existing tools except `search_history`. Extend it.
|
|
8
|
+
</role_and_objective>
|
|
9
|
+
|
|
10
|
+
<context>
|
|
11
|
+
<existing_codebase>
|
|
12
|
+
The project was scaffolded in Milestone 1 and extended in Milestone 2. The current codebase includes:
|
|
13
|
+
- `server.py` — FastMCP server with four tools: `search_code` (functional), `index_codebase` (functional), `search_docs` (mocked), `search_history` (mocked)
|
|
14
|
+
- `db.py` — SQLite database layer with sqlite-vec for hybrid search
|
|
15
|
+
- `parser.py` — Tree-sitter-based language-agnostic AST parser and indexer
|
|
16
|
+
- `queries.py` — Hybrid retrieval (BM25 + vector + RRF) query layer
|
|
17
|
+
- The project uses `uv` for dependency management
|
|
18
|
+
|
|
19
|
+
The `search_history` tool currently has this signature and returns a mock:
|
|
20
|
+
```python
|
|
21
|
+
@mcp.tool()
|
|
22
|
+
def search_history(query: str, target_file: str | None = None) -> dict:
|
|
23
|
+
"""Use this tool to debug regressions, understand developer intent,
|
|
24
|
+
or find out WHY a specific change was made by searching Git history
|
|
25
|
+
and commit messages."""
|
|
26
|
+
|
|
27
|
+
return {
|
|
28
|
+
"status": "mocked",
|
|
29
|
+
"tool": "search_history",
|
|
30
|
+
"query": query,
|
|
31
|
+
"target_file": target_file,
|
|
32
|
+
}
|
|
33
|
+
```
|
|
34
|
+
</existing_codebase>
|
|
35
|
+
|
|
36
|
+
<design_principles>
|
|
37
|
+
1. **Git-native**: All data comes directly from the local `.git` directory — no external APIs (no GitHub/GitLab API calls).
|
|
38
|
+
2. **Structured output**: Return well-structured dicts that an LLM can reason over, not raw `git log` text dumps.
|
|
39
|
+
3. **Separation of concerns**: Git logic lives in a new `git_search.py` module, not in `server.py`.
|
|
40
|
+
4. **Defensive coding**: Gracefully handle repos with no commits, files outside the repo, detached HEAD, shallow clones, and missing `.git` directories.
|
|
41
|
+
5. **Performance-aware**: Limit results by default (e.g., max 20 commits). Use `gitpython`'s lazy iteration to avoid loading entire histories into memory.
|
|
42
|
+
6. **Rich context**: For each commit, include the commit message, author, date, and optionally the diff hunks for the target file — this gives the LLM enough context to answer "why" questions.
|
|
43
|
+
</design_principles>
|
|
44
|
+
|
|
45
|
+
<technology_stack>
|
|
46
|
+
- **Git access**: `gitpython` library (`pip install gitpython`)
|
|
47
|
+
- **Date handling**: Python `datetime` (standard library)
|
|
48
|
+
- **Path resolution**: Python `pathlib` (standard library)
|
|
49
|
+
</technology_stack>
|
|
50
|
+
</context>
|
|
51
|
+
|
|
52
|
+
<instructions>
|
|
53
|
+
Before writing any code for each step, use a <thinking> block to reason about your design decisions, trade-offs, and how the components connect.
|
|
54
|
+
|
|
55
|
+
<step_1_dependencies>
|
|
56
|
+
Install the required new dependency using `uv`:
|
|
57
|
+
```bash
|
|
58
|
+
uv add gitpython
|
|
59
|
+
```
|
|
60
|
+
Verify that `gitpython` can be loaded in Python:
|
|
61
|
+
```python
|
|
62
|
+
import git
|
|
63
|
+
```
|
|
64
|
+
</step_1_dependencies>
|
|
65
|
+
|
|
66
|
+
<step_2_git_search_module>
|
|
67
|
+
Create a new file `git_search.py` that encapsulates all Git querying logic.
|
|
68
|
+
|
|
69
|
+
**Core functions:**
|
|
70
|
+
|
|
71
|
+
1. **`get_repo(path: str = ".") -> git.Repo`**
|
|
72
|
+
- Resolve the Git repository from the given path.
|
|
73
|
+
- Search upward for the `.git` directory (support running from subdirectories).
|
|
74
|
+
- Raise a clear error if no Git repo is found.
|
|
75
|
+
|
|
76
|
+
2. **`search_commits(repo, query: str, target_file: str | None = None, max_results: int = 20) -> list[dict]`**
|
|
77
|
+
- Search commit messages for the query string (case-insensitive substring match).
|
|
78
|
+
- If `target_file` is provided, restrict to commits that touched that file.
|
|
79
|
+
- For each matching commit, return:
|
|
80
|
+
```python
|
|
81
|
+
{
|
|
82
|
+
"hash": str, # short hash (7 chars)
|
|
83
|
+
"full_hash": str, # full SHA
|
|
84
|
+
"message": str, # full commit message
|
|
85
|
+
"author": str, # author name
|
|
86
|
+
"author_email": str, # author email
|
|
87
|
+
"date": str, # ISO 8601 format
|
|
88
|
+
"files_changed": int, # number of files in the commit
|
|
89
|
+
}
|
|
90
|
+
```
|
|
91
|
+
- Sort by most recent first.
|
|
92
|
+
|
|
93
|
+
3. **`get_commit_detail(repo, commit_hash: str, target_file: str | None = None) -> dict`**
|
|
94
|
+
- Retrieve detailed information about a specific commit.
|
|
95
|
+
- Include the full commit metadata plus diff stats.
|
|
96
|
+
- If `target_file` is provided, include the actual diff hunks for that file only.
|
|
97
|
+
- Return:
|
|
98
|
+
```python
|
|
99
|
+
{
|
|
100
|
+
"hash": str,
|
|
101
|
+
"full_hash": str,
|
|
102
|
+
"message": str,
|
|
103
|
+
"author": str,
|
|
104
|
+
"author_email": str,
|
|
105
|
+
"date": str,
|
|
106
|
+
"parent_hashes": list[str],
|
|
107
|
+
"files_changed": list[dict], # [{path, insertions, deletions}]
|
|
108
|
+
"diff": str | None, # diff text for target_file, if specified
|
|
109
|
+
}
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
4. **`get_file_history(repo, file_path: str, max_results: int = 20) -> list[dict]`**
|
|
113
|
+
- Get the commit history for a specific file (equivalent to `git log --follow <file>`).
|
|
114
|
+
- Use `--follow` to track renames.
|
|
115
|
+
- Return the same commit dict structure as `search_commits`.
|
|
116
|
+
|
|
117
|
+
5. **`get_blame(repo, file_path: str, line_start: int | None = None, line_end: int | None = None) -> list[dict]`**
|
|
118
|
+
- Run `git blame` on a file, optionally restricted to a line range.
|
|
119
|
+
- Return a list of blame entries:
|
|
120
|
+
```python
|
|
121
|
+
{
|
|
122
|
+
"line_number": int,
|
|
123
|
+
"commit_hash": str, # short hash
|
|
124
|
+
"author": str,
|
|
125
|
+
"date": str, # ISO 8601
|
|
126
|
+
"line_content": str,
|
|
127
|
+
"commit_message": str, # first line of commit message
|
|
128
|
+
}
|
|
129
|
+
```
|
|
130
|
+
- Group consecutive lines from the same commit to reduce output size.
|
|
131
|
+
|
|
132
|
+
**Error handling:**
|
|
133
|
+
- All functions should catch `git.exc.InvalidGitRepositoryError`, `git.exc.NoSuchPathError`, and `ValueError` gracefully.
|
|
134
|
+
- Return error dicts like `{"error": "message"}` instead of raising exceptions to the MCP layer.
|
|
135
|
+
|
|
136
|
+
CRITICAL RULE: Do NOT shell out to `git` CLI commands. Use `gitpython`'s Python API exclusively for testability and cross-platform compatibility.
|
|
137
|
+
</step_2_git_search_module>
|
|
138
|
+
|
|
139
|
+
<step_3_update_search_type>
|
|
140
|
+
The current `search_history` tool has a simple `query` + `target_file` signature. Extend it with a `search_type` parameter to support multiple retrieval modes:
|
|
141
|
+
|
|
142
|
+
Update the `search_history` signature to:
|
|
143
|
+
```python
|
|
144
|
+
@mcp.tool()
|
|
145
|
+
def search_history(
|
|
146
|
+
query: str,
|
|
147
|
+
search_type: Literal["commits", "file_history", "blame", "commit_detail"] = "commits",
|
|
148
|
+
target_file: str | None = None,
|
|
149
|
+
line_start: int | None = None,
|
|
150
|
+
line_end: int | None = None,
|
|
151
|
+
) -> dict:
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
**Routing logic:**
|
|
155
|
+
- `commits` — Calls `search_commits(repo, query, target_file)`. The query is matched against commit messages.
|
|
156
|
+
- `file_history` — Calls `get_file_history(repo, target_file)`. The `target_file` is required; `query` is ignored for retrieval but included in the response for context.
|
|
157
|
+
- `blame` — Calls `get_blame(repo, target_file, line_start, line_end)`. The `target_file` is required.
|
|
158
|
+
- `commit_detail` — Calls `get_commit_detail(repo, query, target_file)`. The `query` should be a commit hash.
|
|
159
|
+
|
|
160
|
+
Update the docstring to clearly explain each search type and when to use it.
|
|
161
|
+
</step_3_update_search_type>
|
|
162
|
+
|
|
163
|
+
<step_4_wire_into_server>
|
|
164
|
+
Modify `server.py` to:
|
|
165
|
+
1. Import the `git_search` module.
|
|
166
|
+
2. Replace the mock `search_history` with the real implementation that routes to `git_search` functions.
|
|
167
|
+
|
|
168
|
+
CRITICAL RULES:
|
|
169
|
+
- `search_code`, `index_codebase`, and `search_docs` must remain COMPLETELY unchanged. Do NOT modify their signatures, behavior, or imports.
|
|
170
|
+
- The `search_docs` tool must still return a mock response.
|
|
171
|
+
</step_4_wire_into_server>
|
|
172
|
+
|
|
173
|
+
<step_5_verification>
|
|
174
|
+
Verify the implementation end-to-end:
|
|
175
|
+
|
|
176
|
+
1. Start the server: `uv run mcp run server.py` — confirm no import errors.
|
|
177
|
+
2. Using MCP Inspector (`uv run mcp dev server.py`):
|
|
178
|
+
a. Call `search_history(query="initial", search_type="commits")` — expect to find the initial commit(s).
|
|
179
|
+
b. Call `search_history(query="server.py", search_type="file_history", target_file="server.py")` — expect the commit history for server.py.
|
|
180
|
+
c. Call `search_history(query="server.py", search_type="blame", target_file="server.py", line_start=1, line_end=10)` — expect blame data for the first 10 lines.
|
|
181
|
+
d. Pick a commit hash from step (a) and call `search_history(query="<hash>", search_type="commit_detail")` — expect full commit details.
|
|
182
|
+
e. Call `search_history(query="nonexistent-query-xyz", search_type="commits")` — expect an empty results list, not an error.
|
|
183
|
+
3. Confirm `search_code`, `index_codebase`, and `search_docs` still work correctly.
|
|
184
|
+
</step_5_verification>
|
|
185
|
+
|
|
186
|
+
</instructions>
|
|
187
|
+
|
|
188
|
+
<output_formatting>
|
|
189
|
+
- Wrap your internal planning process inside `<thinking>` tags before writing code for each step.
|
|
190
|
+
- Output each new Python file (`git_search.py`) in a separate, clearly labelled `python` code block.
|
|
191
|
+
- For `server.py`, show ONLY the modified/added sections with `# ... existing code unchanged ...` markers.
|
|
192
|
+
- After all code, provide verification commands in a `bash` code block.
|
|
193
|
+
</output_formatting>
|
|
194
|
+
|
|
195
|
+
<quality_checklist>
|
|
196
|
+
Before finishing, verify your output against this checklist:
|
|
197
|
+
- [ ] `git_search.py` uses `gitpython` (not subprocess/shell commands) for all Git operations.
|
|
198
|
+
- [ ] `git_search.py` resolves the repo path by searching upward for `.git`.
|
|
199
|
+
- [ ] `search_commits` supports filtering by commit message and optionally by file.
|
|
200
|
+
- [ ] `get_file_history` uses `--follow` to track renames.
|
|
201
|
+
- [ ] `get_blame` supports optional line range filtering.
|
|
202
|
+
- [ ] `get_blame` groups consecutive lines from the same commit.
|
|
203
|
+
- [ ] `get_commit_detail` includes diff hunks when `target_file` is specified.
|
|
204
|
+
- [ ] All functions return structured dicts, not raw text.
|
|
205
|
+
- [ ] All functions handle errors gracefully (no repo, invalid paths, etc.).
|
|
206
|
+
- [ ] `server.py` only modifies `search_history` — all other tools untouched.
|
|
207
|
+
- [ ] `search_docs` still returns a mock response.
|
|
208
|
+
- [ ] `search_code` and `index_codebase` remain fully functional.
|
|
209
|
+
- [ ] All dates are in ISO 8601 format.
|
|
210
|
+
- [ ] Results are capped with sensible defaults (max 20).
|
|
211
|
+
- [ ] All functions have type hints and docstrings.
|
|
212
|
+
- [ ] No external API calls — everything reads from local `.git`.
|
|
213
|
+
</quality_checklist>
|
|
214
|
+
</system_prompt>
|