dowse-context 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- dowse_context-0.2.0/LICENSE +21 -0
- dowse_context-0.2.0/PKG-INFO +415 -0
- dowse_context-0.2.0/README.md +335 -0
- dowse_context-0.2.0/dowse/__init__.py +3 -0
- dowse_context-0.2.0/dowse/_dist.py +13 -0
- dowse_context-0.2.0/dowse/cli.py +267 -0
- dowse_context-0.2.0/dowse/cursor_hooks.py +115 -0
- dowse_context-0.2.0/dowse/definitions.py +302 -0
- dowse_context-0.2.0/dowse/embed.py +53 -0
- dowse_context-0.2.0/dowse/extract.py +351 -0
- dowse_context-0.2.0/dowse/models.py +23 -0
- dowse_context-0.2.0/dowse/server.py +127 -0
- dowse_context-0.2.0/dowse/server_lock.py +139 -0
- dowse_context-0.2.0/dowse/service.py +697 -0
- dowse_context-0.2.0/dowse/store.py +261 -0
- dowse_context-0.2.0/dowse_context.egg-info/PKG-INFO +415 -0
- dowse_context-0.2.0/dowse_context.egg-info/SOURCES.txt +33 -0
- dowse_context-0.2.0/dowse_context.egg-info/dependency_links.txt +1 -0
- dowse_context-0.2.0/dowse_context.egg-info/entry_points.txt +2 -0
- dowse_context-0.2.0/dowse_context.egg-info/requires.txt +39 -0
- dowse_context-0.2.0/dowse_context.egg-info/top_level.txt +1 -0
- dowse_context-0.2.0/pyproject.toml +77 -0
- dowse_context-0.2.0/setup.cfg +4 -0
- dowse_context-0.2.0/tests/test_ci_smoke.py +17 -0
- dowse_context-0.2.0/tests/test_dist.py +6 -0
- dowse_context-0.2.0/tests/test_doctor.py +96 -0
- dowse_context-0.2.0/tests/test_embed.py +41 -0
- dowse_context-0.2.0/tests/test_hooks.py +137 -0
- dowse_context-0.2.0/tests/test_init.py +228 -0
- dowse_context-0.2.0/tests/test_languages.py +214 -0
- dowse_context-0.2.0/tests/test_mcp.py +89 -0
- dowse_context-0.2.0/tests/test_packaging.py +73 -0
- dowse_context-0.2.0/tests/test_pipeline.py +576 -0
- dowse_context-0.2.0/tests/test_release.py +73 -0
- dowse_context-0.2.0/tests/test_status.py +133 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 David Perez
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,415 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: dowse-context
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: Fluff-free local code Context Engine: tree-sitter extraction + zvec hybrid retrieval.
|
|
5
|
+
Author: David Perez
|
|
6
|
+
Maintainer: David Perez
|
|
7
|
+
License: MIT License
|
|
8
|
+
|
|
9
|
+
Copyright (c) 2026 David Perez
|
|
10
|
+
|
|
11
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
12
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
13
|
+
in the Software without restriction, including without limitation the rights
|
|
14
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
15
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
16
|
+
furnished to do so, subject to the following conditions:
|
|
17
|
+
|
|
18
|
+
The above copyright notice and this permission notice shall be included in all
|
|
19
|
+
copies or substantial portions of the Software.
|
|
20
|
+
|
|
21
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
22
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
23
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
24
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
25
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
26
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
27
|
+
SOFTWARE.
|
|
28
|
+
|
|
29
|
+
Project-URL: Homepage, https://github.com/perezdap/dowse
|
|
30
|
+
Project-URL: Repository, https://github.com/perezdap/dowse
|
|
31
|
+
Project-URL: Issues, https://github.com/perezdap/dowse/issues
|
|
32
|
+
Keywords: code-search,semantic-search,mcp,tree-sitter,vector-database
|
|
33
|
+
Classifier: Development Status :: 3 - Alpha
|
|
34
|
+
Classifier: Environment :: Console
|
|
35
|
+
Classifier: Intended Audience :: Developers
|
|
36
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
37
|
+
Classifier: Operating System :: OS Independent
|
|
38
|
+
Classifier: Programming Language :: Python :: 3
|
|
39
|
+
Classifier: Programming Language :: Python :: 3 :: Only
|
|
40
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
41
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
42
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
43
|
+
Classifier: Topic :: Software Development
|
|
44
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
45
|
+
Requires-Python: >=3.10
|
|
46
|
+
Description-Content-Type: text/markdown
|
|
47
|
+
License-File: LICENSE
|
|
48
|
+
Requires-Dist: zvec>=0.5.0
|
|
49
|
+
Requires-Dist: sentence-transformers>=2.7
|
|
50
|
+
Requires-Dist: typer>=0.12
|
|
51
|
+
Requires-Dist: tree-sitter<0.26,>=0.24
|
|
52
|
+
Requires-Dist: tree-sitter-python>=0.23
|
|
53
|
+
Requires-Dist: tree-sitter-powershell>=0.26
|
|
54
|
+
Requires-Dist: tree-sitter-c-sharp>=0.23
|
|
55
|
+
Provides-Extra: dev
|
|
56
|
+
Requires-Dist: pytest>=8.0; extra == "dev"
|
|
57
|
+
Requires-Dist: tree-sitter-rust>=0.24; extra == "dev"
|
|
58
|
+
Requires-Dist: tree-sitter-bash>=0.25; extra == "dev"
|
|
59
|
+
Requires-Dist: tree-sitter-typescript>=0.23; extra == "dev"
|
|
60
|
+
Requires-Dist: ruff>=0.6; extra == "dev"
|
|
61
|
+
Provides-Extra: mcp
|
|
62
|
+
Requires-Dist: mcp>=1.27; extra == "mcp"
|
|
63
|
+
Provides-Extra: javascript
|
|
64
|
+
Requires-Dist: tree-sitter-javascript>=0.23; extra == "javascript"
|
|
65
|
+
Provides-Extra: typescript
|
|
66
|
+
Requires-Dist: tree-sitter-typescript>=0.23; extra == "typescript"
|
|
67
|
+
Provides-Extra: go
|
|
68
|
+
Requires-Dist: tree-sitter-go>=0.25; extra == "go"
|
|
69
|
+
Provides-Extra: rust
|
|
70
|
+
Requires-Dist: tree-sitter-rust>=0.24; extra == "rust"
|
|
71
|
+
Provides-Extra: bash
|
|
72
|
+
Requires-Dist: tree-sitter-bash>=0.25; extra == "bash"
|
|
73
|
+
Provides-Extra: all-langs
|
|
74
|
+
Requires-Dist: tree-sitter-javascript>=0.23; extra == "all-langs"
|
|
75
|
+
Requires-Dist: tree-sitter-typescript>=0.23; extra == "all-langs"
|
|
76
|
+
Requires-Dist: tree-sitter-go>=0.25; extra == "all-langs"
|
|
77
|
+
Requires-Dist: tree-sitter-rust>=0.24; extra == "all-langs"
|
|
78
|
+
Requires-Dist: tree-sitter-bash>=0.25; extra == "all-langs"
|
|
79
|
+
Dynamic: license-file
|
|
80
|
+
|
|
81
|
+
# dowse
|
|
82
|
+
|
|
83
|
+
A small, fluff-free CLI that turns a code tree into a queryable **Context Engine**. It parses files with tree-sitter, keeps only function/class definitions (not whole files), embeds them with a local `sentence-transformers` model, and stores them in **zvec** (Alibaba's embedded vector DB). Querying returns a clean JSON payload of the top-N most relevant snippets — designed to be piped into `jq`, `grep`, or straight into a prompt file.
|
|
84
|
+
|
|
85
|
+
No TUI, no dashboards, no progress bars. `stdout` is JSON only; all human/progress output goes to `stderr`, so pipelines stay clean.
|
|
86
|
+
|
|
87
|
+
## Layout
|
|
88
|
+
|
|
89
|
+
```
|
|
90
|
+
dowse/
|
|
91
|
+
models.py # the Symbol record shared by all extractors
|
|
92
|
+
extract.py # tree-sitter -> flattened function/class symbols
|
|
93
|
+
definitions.py # YAML/Markdown/.NET project definitions -> sections
|
|
94
|
+
embed.py # sentence-transformers wrapper (lazy-loaded)
|
|
95
|
+
store.py # zvec schema, idempotent indexing, hybrid query
|
|
96
|
+
service.py # core index/query logic (one impl, shared)
|
|
97
|
+
cli.py # Typer CLI: `index`, `query`, `status`, `doctor`, `init`, `hook`, `serve`
|
|
98
|
+
server.py # MCP (FastMCP) stdio server wrapping the same logic
|
|
99
|
+
requirements.txt
|
|
100
|
+
pyproject.toml # PyPI `dowse-context`; CLI entrypoint `dowse`; extras: [mcp], [go], ...
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
## Install
|
|
104
|
+
|
|
105
|
+
### End-user install
|
|
106
|
+
|
|
107
|
+
Install the **`dowse-context`** package (PyPI name; CLI command **`dowse`**) into an existing Python environment when you want to index or query code without a development checkout:
|
|
108
|
+
|
|
109
|
+
```bash
|
|
110
|
+
pip install dowse-context
|
|
111
|
+
pip install "dowse-context[mcp]" # add the MCP server dependencies
|
|
112
|
+
pip install "dowse-context[mcp,all-langs]" # add MCP + every optional grammar
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
For a **global** `dowse` on your PATH (one install, any repo), use **pipx** or **uv tool** instead of a project venv:
|
|
116
|
+
|
|
117
|
+
```bash
|
|
118
|
+
pipx install dowse-context
|
|
119
|
+
pipx install "dowse-context[mcp]"
|
|
120
|
+
pipx install "dowse-context[mcp,all-langs]"
|
|
121
|
+
|
|
122
|
+
uv tool install dowse-context
|
|
123
|
+
uv tool install "dowse-context[mcp]"
|
|
124
|
+
uv tool install "dowse-context[mcp,all-langs]"
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
**Languages in the base install vs optional extras:** the default `dowse-context` install ships grammars for **Python**, **PowerShell**, and **C#** (see [Language support](#language-support) for extensions and wheels). **JavaScript**, **TypeScript**, **Go**, **Rust**, and **Bash** are optional — install per-language extras (`dowse-context[go]`, etc.) or the **`all-langs`** bundle. `dowse status` / `dowse init` report missing grammars with `pip install` hints when a repo uses files you have not installed yet.
|
|
128
|
+
|
|
129
|
+
### Development
|
|
130
|
+
|
|
131
|
+
Use an editable install when you are working on dowse itself:
|
|
132
|
+
|
|
133
|
+
```bash
|
|
134
|
+
python -m venv .venv && . .venv/bin/activate # Windows: .venv\Scripts\activate
|
|
135
|
+
pip install -e ".[dev]"
|
|
136
|
+
pip install -e ".[dev,mcp]" # if you want to exercise `dowse serve`
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
This was built and tested against `zvec 0.5.0`, `tree-sitter 0.25.2`, `tree-sitter-python 0.25.0`, `tree-sitter-powershell 0.26.4`, `tree-sitter-c-sharp 0.23.5`, `typer 0.26`, on CPython 3.12. zvec ships prebuilt wheels for Linux (x86_64/ARM64), macOS (ARM64), and **Windows x86-64** (added in zvec 0.3.0) — so on Windows just use 64-bit Python 3.12 and `pip install` works with no compiler. The first `index`/`query` downloads the ~80 MB MiniLM model once, then runs fully offline.
|
|
140
|
+
|
|
141
|
+
## The zvec schema (the "flattened AST" record)
|
|
142
|
+
|
|
143
|
+
Each symbol becomes one zvec document: a single dense vector plus scalar fields. Schema (in `store.py`):
|
|
144
|
+
|
|
145
|
+
| field | type | purpose |
|
|
146
|
+
|----------------|----------------|------------------------------------------|
|
|
147
|
+
| `embedding` | `VECTOR_FP32` | dense vector, HNSW + **cosine** |
|
|
148
|
+
| `file_path` | `STRING` (idx) | POSIX path relative to the indexed root |
|
|
149
|
+
| `symbol_name` | `STRING` | qualified name, e.g. `SessionManager.revoke` |
|
|
150
|
+
| `kind` | `STRING` (idx) | `function` or `class` |
|
|
151
|
+
| `language` | `STRING` (idx) | e.g. `python` |
|
|
152
|
+
| `start_line` / `end_line` | `INT32` | location for jumping to source |
|
|
153
|
+
| `code_content` | `STRING` | the exact snippet text |
|
|
154
|
+
|
|
155
|
+
`(idx)` fields carry an inverted index so SQL filters on them are fast. The embedding dimension is taken from the model at index time (MiniLM → 384), so the schema always matches the model.
|
|
156
|
+
|
|
157
|
+
A couple of zvec specifics worth knowing, since they're easy to get wrong:
|
|
158
|
+
|
|
159
|
+
- For the cosine metric, the `score` returned by `query()` is a **distance** (0 = identical). The tool converts it to a similarity as `1 - score`.
|
|
160
|
+
- `query()` returns the scalar `fields` inline, so retrieval needs no second fetch.
|
|
161
|
+
- Filters are SQL-style: `kind = 'function'`, `code_content LIKE '%retry%'`, `AND`/`OR`/`NOT`/`IN`. (`==` is a syntax error.)
|
|
162
|
+
|
|
163
|
+
## The indexing loop
|
|
164
|
+
|
|
165
|
+
`dowse index` walks the directory (skipping `.git`, `node_modules`, `__pycache__`, virtualenvs, build dirs — but only *below* the root, so a project living under a path like `.../build/...` still indexes), and for each supported file:
|
|
166
|
+
|
|
167
|
+
1. Parse once with tree-sitter; collect every `function_definition` / `class_definition` node. Names are qualified by walking enclosing definitions (so a method reads as `Class.method`).
|
|
168
|
+
2. Embed each symbol as `"{kind} {qualified_name}\n{body}"` (body capped at ~2k chars).
|
|
169
|
+
3. Reconcile the file in zvec **idempotently**.
|
|
170
|
+
|
|
171
|
+
The reconcile step is deliberate. zvec's `insert` ignores ids that already exist, and re-inserting a *deleted* id is tombstoned — so a naive "delete then re-insert" loses data. Instead each document id is `sha1(file_path::symbol_name::kind)` (stable across line moves), and per file the tool `upsert`s the current symbols, then deletes only the ids that have disappeared. Result: re-running `index` on an unchanged tree is a no-op; editing a file updates changed symbols, adds new ones, and removes deleted ones — without ever duplicating rows. After all files, one `optimize()` builds the vector index.
|
|
172
|
+
|
|
173
|
+
```bash
|
|
174
|
+
dowse index ./my_project --db ./.dowse_index # incremental, idempotent
|
|
175
|
+
dowse index ./my_project --db ./.dowse_index --reset # clean rebuild
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
`index` prints a JSON summary to stdout:
|
|
179
|
+
|
|
180
|
+
```json
|
|
181
|
+
{ "status": "ok", "indexed_files": 42, "indexed_symbols": 311, "dimension": 384, "db": "./.dowse_index", "elapsed_seconds": 8.4 }
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
## Checking index health
|
|
185
|
+
|
|
186
|
+
`dowse status` reports whether an index exists, how big it is, which languages it covers, and whether it has gone stale — so an agent (or you) can decide whether to index before querying instead of guessing. With `--root` set, `--db` defaults to `<root>/.dowse_index`, and two extra signals light up: `stale` (a source file newer than the index) and `missing_grammars` (files on disk whose grammar wheel isn't installed, each with an actionable `install_hint`).
|
|
187
|
+
|
|
188
|
+
```bash
|
|
189
|
+
dowse status --root ./my_project # db defaults to ./my_project/.dowse_index
|
|
190
|
+
dowse status --db ./.dowse_index # exists only, no root to compare
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
```json
|
|
194
|
+
{
|
|
195
|
+
"exists": true, "db_path": "./.dowse_index",
|
|
196
|
+
"indexed_files": 42, "indexed_symbols": 311, "dimension": 384,
|
|
197
|
+
"languages": ["python", "rust"],
|
|
198
|
+
"last_indexed_at": 1781460324.23, "stale": false,
|
|
199
|
+
"missing_grammars": [
|
|
200
|
+
{ "language": "go", "extensions": [".go"], "file_count": 12,
|
|
201
|
+
"install_hint": "pip install \"dowse-context[go]\"" }
|
|
202
|
+
]
|
|
203
|
+
}
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
`dowse doctor` bundles install facts (Python version, dowse module path, MCP SDK),
|
|
207
|
+
index health (same fields as `status`), serve/index lock probes, and whether
|
|
208
|
+
`.mcp.json` / `.cursor/mcp.json` reference a dowse MCP server — one JSON blob for
|
|
209
|
+
agents debugging setup.
|
|
210
|
+
|
|
211
|
+
```bash
|
|
212
|
+
dowse doctor --root ./my_project
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
## One-command bootstrap
|
|
216
|
+
|
|
217
|
+
`dowse init` wires a repo for agent use in one step: it writes or merges
|
|
218
|
+
`.mcp.json` with a `dowse` server entry, adds `.dowse_index/` to `.gitignore`,
|
|
219
|
+
reports any missing grammar extras, and runs the initial index.
|
|
220
|
+
|
|
221
|
+
```bash
|
|
222
|
+
dowse init ./my_project # full bootstrap with initial index
|
|
223
|
+
dowse init ./my_project --skip-index # config + gitignore only, no index
|
|
224
|
+
dowse init ./my_project --db ./my_project/.dowse_index # explicit db path
|
|
225
|
+
dowse init ./my_project --harness pi # Pi preset for pi-mcp-adapter
|
|
226
|
+
dowse init ./my_project --auto-index # also install Cursor sessionStart hook (once per machine)
|
|
227
|
+
dowse hook install # same hook installer without re-running init
|
|
228
|
+
```
|
|
229
|
+
|
|
230
|
+
**Opt-in Cursor auto-index:** `dowse hook install` (or `init --auto-index`) adds a
|
|
231
|
+
user-level `sessionStart` hook in `~/.cursor/hooks.json` that runs
|
|
232
|
+
`dowse hook session-start`. Dowse only indexes workspaces that already opted in via
|
|
233
|
+
`dowse init` (`.dowse_index/` present). Hook failures are **fail-open** — they never
|
|
234
|
+
block Cursor. A concurrent `dowse serve` or another indexer may hold the zvec lock;
|
|
235
|
+
the hook logs to stderr and exits successfully. For Pi / Claude Code, prefer MCP
|
|
236
|
+
`index_status` → `index_codebase` on a long-lived `dowse serve` instead of hooks.
|
|
237
|
+
|
|
238
|
+
The generated `.mcp.json` uses the global `dowse` command (not a dev venv path)
|
|
239
|
+
and runs `serve --db .dowse_index` relative to the repo root. Re-running `init`
|
|
240
|
+
is idempotent: no duplicate `.gitignore` lines, no clobbered MCP servers, no
|
|
241
|
+
duplicate `dowse` entries.
|
|
242
|
+
|
|
243
|
+
The `--harness pi` preset keeps the same MCP server shape and adds
|
|
244
|
+
`"directTools": true` for `pi-mcp-adapter`, so Dowse's MCP tools can appear as
|
|
245
|
+
first-class Pi tools when the adapter supports direct tool promotion. **Pi core does not include MCP**; install Pi itself separately and install the adapter with
|
|
246
|
+
`pi install npm:pi-mcp-adapter`. `dowse init --harness pi` only detects whether
|
|
247
|
+
Pi and `pi-mcp-adapter` appear installed and reports guidance — it does not run
|
|
248
|
+
`npm install` or `pi install` for you.
|
|
249
|
+
|
|
250
|
+
```json
|
|
251
|
+
{
|
|
252
|
+
"status": "ok",
|
|
253
|
+
"workspace": {"root": "/path/to/my_project", "db_path": "/path/to/my_project/.dowse_index"},
|
|
254
|
+
"mcp_config": {"created": true, "merged": false},
|
|
255
|
+
"gitignore": {"path": "/path/to/my_project/.gitignore"},
|
|
256
|
+
"missing_grammars": [
|
|
257
|
+
{"language": "go", "extensions": [".go"], "file_count": 12,
|
|
258
|
+
"install_hint": "pip install \"dowse-context[go]\""}
|
|
259
|
+
],
|
|
260
|
+
"index": {"status": "ok", "indexed_files": 42, "indexed_symbols": 311, "dimension": 384,
|
|
261
|
+
"db": "/path/to/my_project/.dowse_index", "elapsed_seconds": 8.4}
|
|
262
|
+
}
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
## Querying (hybrid search)
|
|
266
|
+
|
|
267
|
+
`dowse query` embeds your text, pulls a pool of dense candidates from zvec, then re-ranks them by combining semantic similarity with a cheap lexical overlap score (`final = 0.7·dense + 0.3·lexical`). The lexical pass is what makes pasting a raw error message work well — error text usually names the exact symbol, and the symbol-name match floats it to the top even if the embedding alone wouldn't. You can also push a native scalar filter down into zvec.
|
|
268
|
+
|
|
269
|
+
```bash
|
|
270
|
+
# Natural language
|
|
271
|
+
dowse query "how are auth tokens generated" --db ./.dowse_index
|
|
272
|
+
|
|
273
|
+
# Paste an error; restrict to functions; take top 5
|
|
274
|
+
dowse query "RuntimeError: connection pool exhausted" --db ./.dowse_index --kind function -n 5
|
|
275
|
+
|
|
276
|
+
# Pipe straight into jq — get just file:line for each hit
|
|
277
|
+
dowse query "retry with backoff" --db ./.dowse_index \
|
|
278
|
+
| jq -r '.results[] | "\(.file_path):\(.start_line) \(.symbol_name)"'
|
|
279
|
+
|
|
280
|
+
# Build a prompt-context file of just the snippets
|
|
281
|
+
dowse query "where do we validate JWT claims" --db ./.dowse_index \
|
|
282
|
+
| jq -r '.results[].code_content' > context.txt
|
|
283
|
+
|
|
284
|
+
# Raw zvec filter for anything the shortcuts don't cover
|
|
285
|
+
dowse query "db connection" --filter "language = 'python' AND file_path LIKE 'pkg/%'"
|
|
286
|
+
|
|
287
|
+
# Estimate prompt-token savings versus the full files containing the returned snippets
|
|
288
|
+
dowse query "retry with backoff" --tokens --root ./my_project --db ./.dowse_index
|
|
289
|
+
```
|
|
290
|
+
|
|
291
|
+
Query output shape:
|
|
292
|
+
|
|
293
|
+
```json
|
|
294
|
+
{
|
|
295
|
+
"query": "...",
|
|
296
|
+
"filter": "kind = 'function'",
|
|
297
|
+
"results": [
|
|
298
|
+
{
|
|
299
|
+
"rank": 1, "score": 0.72, "dense_similarity": 0.82, "lexical_score": 0.48,
|
|
300
|
+
"file_path": "pkg/db.py", "symbol_name": "Connection.query", "kind": "function",
|
|
301
|
+
"language": "python", "start_line": 6, "end_line": 7,
|
|
302
|
+
"code_content": "def query(self, sql): ..."
|
|
303
|
+
}
|
|
304
|
+
]
|
|
305
|
+
}
|
|
306
|
+
```
|
|
307
|
+
|
|
308
|
+
With `--tokens`, the same JSON payload includes a `token_savings` report:
|
|
309
|
+
|
|
310
|
+
```json
|
|
311
|
+
{
|
|
312
|
+
"token_savings": {
|
|
313
|
+
"estimator": "regex-v1",
|
|
314
|
+
"snippet_tokens": 120,
|
|
315
|
+
"full_file_tokens": 980,
|
|
316
|
+
"saved_tokens": 860,
|
|
317
|
+
"reduction_percent": 87.76,
|
|
318
|
+
"results": [
|
|
319
|
+
{"rank": 1, "file_path": "pkg/db.py", "symbol_name": "Connection.query", "snippet_tokens": 42}
|
|
320
|
+
],
|
|
321
|
+
"files": [
|
|
322
|
+
{"file_path": "pkg/db.py", "full_file_tokens": 230}
|
|
323
|
+
]
|
|
324
|
+
}
|
|
325
|
+
}
|
|
326
|
+
```
|
|
327
|
+
|
|
328
|
+
The token report uses a lightweight deterministic approximation (`regex-v1`) that counts code-like words, numbers, and punctuation. It is not model-tokenizer exact, but it is stable, dependency-free, and good enough to show relative savings. Full-file comparison counts each of the full files containing the returned snippets once, so multiple snippets from one file do not double-count the baseline.
|
|
329
|
+
|
|
330
|
+
Tuning knobs: `--top/-n`, `--candidates` (dense pool size before re-rank), `--w-dense` / `--w-lexical`. Use the same `--model` for `query` as you used for `index`.
|
|
331
|
+
|
|
332
|
+
## Using it from a coding harness (MCP)
|
|
333
|
+
|
|
334
|
+
The CLI is already harness-usable as-is: any agent that can run a shell command can call `dowse query "..."` and read the JSON. But for harnesses that speak MCP (Claude Code, Claude Desktop, Cursor, Copilot CLI), `dowse serve` exposes the same logic as three native tools over stdio:
|
|
335
|
+
|
|
336
|
+
```bash
|
|
337
|
+
pip install "dowse-context[mcp]" # adds the official mcp SDK
|
|
338
|
+
dowse serve --db ./.dowse_index # speaks MCP on stdio
|
|
339
|
+
```
|
|
340
|
+
|
|
341
|
+
- **`query_context`** — semantic code lookup. Returns the same ranked snippet list as `dowse query`. Its description tells the agent to use it for *meaning-based recall* (describe behaviour, paste an error) as a complement to `grep`/`glob`, which stay better when you know the literal string.
|
|
342
|
+
- **`index_codebase`** — build/refresh the index (idempotent; `definitions` and `reset` flags exposed).
|
|
343
|
+
- **`index_status`** — self-diagnosis. Call before indexing/querying to learn whether an index exists, which languages it covers, whether it's gone stale, and which grammars are missing (with install hints). Never throws on a missing index — it reports state so the agent can choose its next step.
|
|
344
|
+
|
|
345
|
+
Prefer one long-lived MCP server per repo over competing server/index processes. `dowse query` and `dowse status` open the collection read-only, so multiple independent agents can query the same `.dowse_index` concurrently. Indexing still needs write access, and zvec does not allow readers and writers at the same time; those conflicts are reported as a concise stderr error instead of a traceback. `dowse serve` serializes in-process tool calls for the same index, holds a dedicated `<db>.serve.lock` for its lifetime so a second server for the same index refuses to start, and performs an active-writer preflight before startup.
|
|
346
|
+
|
|
347
|
+
For parallel agents in separate git worktrees, prefer a per-worktree relative `--db ./.dowse_index`: each worktree gets its own collection and `.serve.lock`, so agents can index/query/serve independently and the index matches that worktree's code. Use a shared absolute `--db` only when agents intentionally share one checkout/index.
|
|
348
|
+
|
|
349
|
+
Register it with a harness by pointing at the command. For Claude Code / Claude Desktop (`claude_desktop_config.json` on Windows lives at `%APPDATA%\Claude\`):
|
|
350
|
+
|
|
351
|
+
```json
|
|
352
|
+
{
|
|
353
|
+
"mcpServers": {
|
|
354
|
+
"dowse": {
|
|
355
|
+
"command": "dowse",
|
|
356
|
+
"args": ["serve", "--db", "C:\\path\\to\\.dowse_index"]
|
|
357
|
+
}
|
|
358
|
+
}
|
|
359
|
+
}
|
|
360
|
+
```
|
|
361
|
+
|
|
362
|
+
This deliberately uses the FastMCP class bundled with the official `mcp` SDK rather than the standalone `fastmcp` package — the latter's v3 line rebuilt its architecture and auth model in early 2026, and for a local two-tool stdio server the bundled one is the stable, lower-churn choice.
|
|
363
|
+
|
|
364
|
+
## Definition files (YAML, Markdown, .NET/MSBuild)
|
|
365
|
+
|
|
366
|
+
Declarative definition files aren't code, so the function/class model doesn't fit them — but they're often exactly what you want to search ("what's the uninstall command for 7zip", "which target framework does this project use", "where is this build target defined"). Pass `--definitions` (`-D`) to additionally index them as **sections**:
|
|
367
|
+
|
|
368
|
+
```bash
|
|
369
|
+
dowse index ./packages --db ./.dowse_index --definitions
|
|
370
|
+
dowse query "silent uninstall command" --db ./.dowse_index --lang yaml
|
|
371
|
+
|
|
372
|
+
dowse index ./dotnet-repo --db ./.dowse_index --definitions
|
|
373
|
+
dowse query "target framework and nullable settings" --db ./.dowse_index --lang msbuild
|
|
374
|
+
dowse query "custom GenerateVersion build target" --db ./.dowse_index --kind section --lang msbuild
|
|
375
|
+
```
|
|
376
|
+
|
|
377
|
+
- **YAML profiles** (Payload-style): each top-level key becomes a section, qualified by the package name if the file has a `name:`/`id:`/`packageName:` field — e.g. `7zip.install`, `7zip.uninstall`, `7zip.detection`.
|
|
378
|
+
- **Markdown definitions** (PowerPacker-style): each ATX heading becomes a section qualified by its heading ancestry — e.g. `Google Chrome.Install`, `Google Chrome.Install.Pre-Install`. Headings inside fenced code blocks are ignored.
|
|
379
|
+
- **.NET/MSBuild XML** (`.csproj`, `.props`, `.targets`): `PropertyGroup`, `ItemGroup`, `ItemDefinitionGroup`, and `Target` blocks become sections, qualified by the file name and useful child names — e.g. `App.PropertyGroup.TargetFramework.Nullable`, `App.ItemGroup.PackageReference.Microsoft Extensions Logging.ProjectReference.Shared`, `Custom.Target.GenerateVersion.Message.WriteLinesToFile`.
|
|
380
|
+
|
|
381
|
+
These extractors are pure-stdlib (no PyYAML, no Markdown parser, no MSBuild SDK): they scan regular structure and use Python's built-in XML parser where useful, which is more forgiving of half-finished files than a strict project-system dependency. The flag is **opt-in** so a normal code index doesn't slurp every `README.md`, CI YAML, or project metadata file in the repo. The sections land in the same collection with `kind` set to `section` and `language` set to `yaml`, `markdown`, or `msbuild`, so you can filter them with `--lang msbuild` or `--kind section`. To add other declarative formats, drop an extractor into `definitions.py` and register its extension.
|
|
382
|
+
|
|
383
|
+
## Language support
|
|
384
|
+
|
|
385
|
+
`extract.py` has a small registry mapping extensions to a grammar loader and the node types that count as definitions. A language activates automatically if its grammar wheel is installed; uninstalled grammars are skipped rather than erroring.
|
|
386
|
+
|
|
387
|
+
**Verified end-to-end** (load offline from a self-contained wheel, no compiler, correct symbol + qualified-name extraction):
|
|
388
|
+
|
|
389
|
+
| Language | Extensions | Wheel | Notes |
|
|
390
|
+
|------------|----------------|----------------------------|--------------------------------------------------------------|
|
|
391
|
+
| Python | `.py` `.pyi` | `tree-sitter-python` | reference grammar |
|
|
392
|
+
| PowerShell | `.ps1` `.psm1` | `tree-sitter-powershell` | `function`/`filter`/`class` + methods; `param()` aware |
|
|
393
|
+
| C# | `.cs` | `tree-sitter-c-sharp` | class/interface/struct/record + methods + constructors |
|
|
394
|
+
|
|
395
|
+
PowerShell needs no `name` field handling out of the box — the grammar puts identifiers in `function_name`/`simple_name` children, which the registry resolves via `name_child_types`.
|
|
396
|
+
|
|
397
|
+
**Optional grammars** (install the extra; verified node names): JavaScript, TypeScript, Go, Rust, Bash. Install via the extras, e.g. `pip install dowse-context-context[go,rust]`, or grab them all with `pip install "dowse-context[all-langs]"`.
|
|
398
|
+
|
|
399
|
+
| Language | Extensions | Extra | Wheel | Notes |
|
|
400
|
+
|------------|--------------|-----------------|--------------------------|-------------------------------------------------------------|
|
|
401
|
+
| JavaScript | `.js` `.jsx` `.mjs` `.cjs` | `javascript` | `tree-sitter-javascript` | function/method/class |
|
|
402
|
+
| TypeScript | `.ts` `.tsx` | `typescript` | `tree-sitter-typescript` | function/method/class |
|
|
403
|
+
| Go | `.go` | `go` | `tree-sitter-go` | `type_spec` modelled as `kind=class` (known compromise) |
|
|
404
|
+
| Rust | `.rs` | `rust` | `tree-sitter-rust` | fn/struct/enum/trait; trait methods qualified by trait |
|
|
405
|
+
| Bash | `.sh` `.bash`| `bash` | `tree-sitter-bash` | `function_definition` (both `name()` and `function name` forms) |
|
|
406
|
+
|
|
407
|
+
When a grammar is missing, `dowse index` reports it, e.g. `skipped 12 .go files (go) - pip install "dowse-context[go]"`, so polyglot repos never fail silently.
|
|
408
|
+
|
|
409
|
+
**Deliberately not auto-handled:** most declarative/data formats (Bicep, `.psd1`, arbitrary XML/JSON) don't have a function/class shape, so the symbol model doesn't fit them. The definition extractors above are explicit opt-ins for formats with a stable section shape; other formats should get similarly small custom extractors rather than being forced through a code grammar.
|
|
410
|
+
|
|
411
|
+
> Avoid `tree-sitter-language-pack` for this tool. Despite advertising bundled wheels, version 1.8.1 fetches grammars from GitHub releases on first use — it fails the moment the network is blocked, which defeats the offline/locked-down goal. The per-language wheels above are genuinely self-contained.
|
|
412
|
+
|
|
413
|
+
## What was verified
|
|
414
|
+
|
|
415
|
+
Exercised end-to-end in a sandbox: tree-sitter extraction for Python, PowerShell, and C# (loaded offline from self-contained wheels, with correct qualified-name resolution); the full zvec lifecycle (schema, upsert, filtered queries, cosine distance→similarity, idempotent reconcile on edits); the YAML/Markdown/.NET definition extractors (package-name and heading-ancestry qualification, fence-aware Markdown, MSBuild property/item/target sections, malformed XML fallback); the CLI; and the MCP server (both tools register with correct schemas, and an in-process `query_context` call returns ranked results). The one piece run only through its standard, stable API — not against a downloaded model in the sandbox — is the `sentence-transformers` `encode()` call in `embed.py`; the first real `index` will download MiniLM and exercise it. Likewise the MCP server was verified in-process via the SDK's own client API rather than over a live stdio pipe to an external harness.
|