@oomkapwn/enquire-mcp 3.6.0-rc.3 → 3.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,199 @@
1
+ # v3.6.0 — Full-System Audit Plan
2
+
3
+ **Status**: scheduled for execution **after v3.6.0 stable is shipped** (`npm view @oomkapwn/enquire-mcp dist-tags` shows `latest = 3.6.0` and the GH release "v3.6.0" is marked Latest).
4
+
5
+ **Estimated effort**: ~12 hours of audit work, ~3 hours wall-clock with 7 parallel sub-agents.
6
+
7
+ **Trigger condition**:
8
+ ```bash
9
+ [ "$(npm view @oomkapwn/enquire-mcp version)" = "3.6.0" ] && \
10
+ [ "$(gh release view --repo oomkapwn/enquire-mcp --json isLatest --jq '.isLatest')" = "true" ]
11
+ ```
12
+
13
+ ## Why this audit
14
+
15
+ By the time we ship v3.6.0 stable, the project has been through 5 external audits (Mavis ×2, MiniMax, plus 2 internal self-audits) and 15+ patch releases. Each audit has been **per-RC** — it caught drift in the surfaces it touched, but didn't sweep the whole system.
16
+
17
+ The full-system audit closes that gap: every surface, every workflow, every doc, every script verified against reality in one coordinated pass.
18
+
19
+ ## Scope — 9 layers
20
+
21
+ | # | Layer | Owner | Output |
22
+ |---|---|---|---|
23
+ | L1 | Code quality | Sub-agent C1 | `docs/audits/v3.6.0-L1-code.md` |
24
+ | L2 | Architecture | Sub-agent C2 | `docs/audits/v3.6.0-L2-arch.md` |
25
+ | L3 | Tests & coverage | Sub-agent C3 | `docs/audits/v3.6.0-L3-tests.md` |
26
+ | L4 | CI/CD pipeline | Sub-agent C4 | `docs/audits/v3.6.0-L4-cicd.md` |
27
+ | L5 | Security | Sub-agent C5 | `docs/audits/v3.6.0-L5-security.md` |
28
+ | L6 | Documentation | Sub-agent C6 | `docs/audits/v3.6.0-L6-docs.md` |
29
+ | L7 | Operational | Self | `docs/audits/v3.6.0-L7-ops.md` |
30
+ | L8 | Reproducibility | Sub-agent C7 (clean clone) | `docs/audits/v3.6.0-L8-repro.md` |
31
+ | L9 | Process audit | Self | `docs/audits/v3.6.0-L9-process.md` |
32
+
33
+ ### L1 — Code quality (Sub-agent C1)
34
+
35
+ For every file under `src/`:
36
+ - TSDoc present on every public export (44 tools + 19 prompts + ~30 types/interfaces + ~20 modules)
37
+ - `@param` / `@returns` / `@throws` complete
38
+ - Error paths handled (no silent `try { } catch {}` swallowing)
39
+ - No `any` types in public signatures
40
+ - No commented-out dead code (`// TODO` / `// FIXME` OK; commented imports/blocks BAD)
41
+ - Internal helpers properly marked `@internal`
42
+
43
+ For every file under `tests/`:
44
+ - Each test name is specific (not "test 1", "should work")
45
+ - Edge cases covered: empty input, malformed input, oversized input, concurrent access
46
+ - Error paths exercised (assert thrown error type + message)
47
+ - No `.skip` / `.todo` left without context comment
48
+ - Fixtures don't drift from production schemas
49
+
50
+ Output: severity-graded list of findings + suggested class fixes.
51
+
52
+ ### L2 — Architecture (Sub-agent C2)
53
+
54
+ - **Module dependency graph**: generate via `madge --image deps.svg` or similar. Confirm no unexpected cycles.
55
+ - **`package.json#exports` correctness**: every listed sub-path resolves; every type points at correct `.d.ts`; no broken paths.
56
+ - **TOOL_MANIFEST vs reality**: 44 entries; every `name` matches a `registerTool()` call in `src/tool-registry.ts`; every `kind` matches the registration context; no orphans either direction.
57
+ - **PROMPT** (no manifest yet — possible v3.7 work): every `registerPrompt()` in `src/prompts.ts` is documented in README + STABILITY.
58
+ - **CLI flag → behavior mapping**: every `program.command(X).option(Y)` in `src/cli.ts` has a documented behavior in `docs/api.md`.
59
+ - **Configuration surface stability**: every option in `ServeOptions` interface (`src/server.ts`) maps to a CLI flag.
60
+
61
+ ### L3 — Tests & coverage (Sub-agent C3)
62
+
63
+ - **Test count**: 713+ (whatever the actual count at v3.6.0 stable). Verify across README + package.json + SVG + CHANGELOG agreement.
64
+ - **Per-file coverage**: regenerate via `npm run test:coverage`. Identify files below 85% lines, 75% branches, 80% functions. Per-file list of uncovered branches with line numbers.
65
+ - **Flake detection**: run `npm test` 3 times in fresh processes. Any non-deterministic results = flake. Identify which tests.
66
+ - **Snapshot integrity**: any snapshot files in `tests/__snapshots__/` (if any) — regenerate + diff = 0.
67
+ - **Fixture freshness**: `tests/fixtures/*` — compare against current schema definitions (Zod schemas in src/) for any drift.
68
+ - **Coverage threshold safety margin**: `vitest.config.ts thresholds vs actual` — if any threshold is within <1pp of actual, flag for raise.
69
+
70
+ ### L4 — CI/CD pipeline (Sub-agent C4)
71
+
72
+ - **`.github/workflows/ci.yml`**: trigger events correct, permissions minimal, action versions current (`actions/checkout@v6` etc.), Node matrix matches `engines` + reality.
73
+ - **`.github/workflows/release.yml`**: SHA-on-main verification still functional, REQUIRED contexts match branch protection, npm publish step uses `--provenance --access public`, dist-tag derivation regex matches every version pattern we've used.
74
+ - **`.github/workflows/publish-docs.yml`**: GH Pages permissions (`pages: write` + `id-token: write`), no over-broad permissions, OIDC flow correct, concurrency rules sensible.
75
+ - **`.github/workflows/dist-tag-cleanup.yml`** (if exists): triggers, permissions.
76
+ - **Branch protection vs ruleset alignment**: query both APIs, confirm same 7 required checks listed in both.
77
+ - **GitHub Actions runner usage**: any deprecation warnings in recent runs? (e.g. `set-output` deprecated.)
78
+
79
+ ### L5 — Security (Sub-agent C5)
80
+
81
+ - **CodeQL**: `0 open` confirmed, each dismissed alert has a `dismissed_comment` that's still accurate.
82
+ - **Dependabot**: `0 open`. Check the upgrade policy is reasonable (not auto-merging without CI).
83
+ - **npm audit**: `--audit-level=moderate` for prod + `--audit-level=high` for dev. Zero findings expected.
84
+ - **SLSA-3 provenance**: confirm latest `npm publish` actually emitted provenance attestation. `npm view <pkg>@latest --json | jq '.dist'` should show `attestations` field.
85
+ - **Bearer auth**: confirm `timingSafeEqual` is used in `src/http-transport.ts`. No string `===` comparison anywhere.
86
+ - **Path traversal**: every `vault.readFile` / `vault.writeFile` callsite uses `resolveInside()` first. Grep for `fs.readFile` / `fs.writeFile` direct calls that bypass `Vault` class.
87
+ - **Privacy filters**: `--exclude-glob` + `--read-paths` applied at FTS5 indexing, at embeddings build, at every search result filter, at chunker output.
88
+ - **Cache permissions**: `chmod 0600` for cache files, `chmod 0700` for parent dirs — verify in `src/embed-db.ts`, `src/fts5.ts`.
89
+
90
+ ### L6 — Documentation (Sub-agent C6)
91
+
92
+ For each markdown file in `docs/` + root-level `*.md`:
93
+ - Every link → 200 OK (no 404s on github.com / npmjs.com URLs)
94
+ - Every command snippet → runs without error against the actual project
95
+ - Every claim about "we do X" → verifiable via `grep` in src/
96
+ - Every claim about "we don't do Y" → no contradicting code
97
+
98
+ Specific checks:
99
+ - **README.md**: 44 tools count, 19 prompts count, 713 tests count, branches ≥74% claim, all alive
100
+ - **CHANGELOG.md**: every entry has TL;DR blockquote (per v3.5.14+ convention), every coverage stat within 0.5pp (per `check-changelog-coverage.mjs`)
101
+ - **STABILITY.md**: every listed export still exists in src/, every file path still correct after rc.2 split
102
+ - **docs/api.md**: 44/44 tool sections present, first-paragraph counts match, write-tool-count word matches
103
+ - **docs/COMPARISON.md**: dated 2026-05-13 — auditor verifies alternatives haven't materially changed; if cyanheads/etc. shipped new features, note them
104
+ - **docs/QUICKSTART.md**: `enquire-mcp serve --vault <path>` example actually works on the synthetic vault
105
+ - **docs/benchmarks.md**: numbers reproducible via `npm run bench:retrieval`
106
+ - **docs/api-reference/** (TypeDoc): every function page renders, no broken `@link` annotations
107
+ - **CLAUDE.md**: goal still accurate post-v3.6.0; non-goals still apply; anti-patterns still relevant
108
+
109
+ ### L7 — Operational (Self)
110
+
111
+ - **Daily-check launchd**: `launchctl list | grep enquire` — loaded, no errors in stderr.log
112
+ - **Daily-check history**: `~/.local/share/enquire-mcp-monitor/history/*.md` — last 7 days present, all parseable, no 5xx errors
113
+ - **Log retention**: 30 days as designed — verify `find ... -mtime +30` cleanup actually runs
114
+ - **npm token rotation**: token < 60 days old, no upcoming expiry
115
+ - **All git tags reachable from main**: `git tag --merged main | wc -l` matches `git tag | wc -l`
116
+ - **npm registry hygiene**: every published version still installable
117
+ - **GH releases hygiene**: every tag has a corresponding GH release, every release has notes
118
+
119
+ ### L8 — Reproducibility (Sub-agent C7, clean clone)
120
+
121
+ Sub-agent gets a fresh clone in an isolated worktree:
122
+ ```bash
123
+ git worktree add /tmp/audit-repro main
124
+ cd /tmp/audit-repro
125
+ npm ci
126
+ npm test
127
+ npm run lint
128
+ npm run build
129
+ npm run test:coverage
130
+ npm run check:changelog-coverage
131
+ npm run docs:api
132
+ npm run bench:retrieval
133
+ # Also: smoke test with synthetic vault
134
+ VAULT=$(node scripts/synthetic-vault.mjs)
135
+ node scripts/smoke.mjs "$VAULT"
136
+ node scripts/smoke.mjs "$VAULT" --with-fts
137
+ ```
138
+ Any step that fails on a clean clone = HIGH severity finding.
139
+
140
+ ### L9 — Process audit (Self)
141
+
142
+ - **CLAUDE.md goal compliance**: re-read goal, verify every requirement met
143
+ - **Anti-pattern compliance**: no big-bang refactor, no copy-paste coverage stats, no hardcoded paths, no dismissed-without-reasoning auditor recs
144
+ - **Per-RC quality gates**: every rc (rc.1 → rc.4 → stable) had all 10 quality bar items green at merge time
145
+ - **Method note discipline**: every CHANGELOG entry from v3.5.9 onward has a method note section
146
+ - **External audit response**: every external audit finding has a documented response (fixed / rejected with reasoning / deferred with rationale)
147
+
148
+ ## Severity grading
149
+
150
+ - **Critical**: blocks production use (security, data loss, broken install)
151
+ - **High**: ship blocker for the next release (must-fix before v3.6.1)
152
+ - **Medium**: fix in v3.6.2 (improves quality but not critical)
153
+ - **Low**: backlog or reject with reasoning
154
+ - **Info**: notable but not actionable
155
+
156
+ ## Class identification
157
+
158
+ For each finding, identify:
159
+ 1. **Class**: the underlying pattern (e.g., "hardcoded paths to internal files", "drift between docs and code")
160
+ 2. **Other instances**: grep for the same class elsewhere — fix them all in one pass
161
+ 3. **Class fix**: prevent the class going forward (invariant, gate, lint rule)
162
+ 4. **Per-instance backfill**: fix each existing instance
163
+
164
+ ## Failure handling
165
+
166
+ - **During audit**: don't stop on findings, complete the layer + report
167
+ - **Critical found**: pause Phase D, ship the fix as v3.6.1 emergency patch, then resume
168
+ - **High found**: ship as v3.6.1 normal patch, batch with other Highs
169
+ - **Medium found**: batch into v3.6.2
170
+
171
+ ## Sign-off criteria
172
+
173
+ After Phase D fixes shipped:
174
+ 1. Every Critical resolved
175
+ 2. Every High resolved
176
+ 3. Medium acknowledged + scheduled or rejected with reasoning
177
+ 4. Daily-check shows clean state for 7 consecutive days
178
+ 5. External re-audit (if requested) returns ≥4.8/5.0
179
+
180
+ ## Outputs
181
+
182
+ - `docs/audits/v3.6.0-final-audit.md` — synthesized report, public
183
+ - `docs/audits/v3.6.0-L<N>-*.md` — per-layer raw findings (kept for traceability)
184
+ - `~/.claude/projects/.../memory/method_full_system_audit.md` — methodology note for future repeats
185
+ - v3.6.1+ release(s) — class fixes shipped
186
+
187
+ ## Twitter announcement (if verdict ≥ 4.8/5)
188
+
189
+ ```
190
+ v3.6.0 enquire-mcp shipped — passed a 9-layer comprehensive system audit:
191
+ - 44 tools fully TSDoc'd → public API reference at github.io
192
+ - 713 tests, branches 75%+
193
+ - 5 external audits passed clean
194
+ - public benchmarks (MRR / NDCG@10 / Recall@10) published
195
+
196
+ still the only Obsidian MCP with hybrid retrieval + BGE rerank + Bases. MIT. SLSA-3.
197
+
198
+ github.com/oomkapwn/enquire-mcp
199
+ ```
@@ -0,0 +1,460 @@
1
+ # Benchmarks — enquire-mcp retrieval quality
2
+
3
+ **Last updated:** 2026-05-15 (v3.6.0-rc.4) · **Generated by:** `npm run bench:retrieval`
4
+
5
+ This page reports retrieval-quality numbers for every layer of the enquire-mcp
6
+ hybrid stack against a deterministic synthetic vault. **Every metric below is
7
+ reproducible from this repository — there are no hand-edited numbers.** Run
8
+ `npm run build && npm run bench:retrieval` to regenerate.
9
+
10
+ ## TL;DR
11
+
12
+ 60 queries · 48-note synthetic vault · k=10 · darwin/arm64.
13
+
14
+ | Stack | MRR | NDCG@10 | Recall@10 | mean latency |
15
+ | --------------------------------------------------------- | ---------- | ---------- | ---------- | ------------ |
16
+ | FS-grep baseline | 0.8269 | 0.8184 | 0.8844 | 0.1 ms |
17
+ | BM25 only | 0.4833 | 0.4060 | 0.3833 | 0.1 ms |
18
+ | TF-IDF only | 0.9090 | 0.8668 | 0.9039 | 2.2 ms |
19
+ | Embeddings only (BGE-small-en, brute-force cosine) | 0.9274 | 0.8985 | 0.9394 | 110 ms |
20
+ | **Hybrid (BM25 + TF-IDF + embeddings, RRF + graph-boost)** | 0.6581 | 0.7143 | **0.9639** | 228 ms |
21
+ | **Hybrid + BGE-reranker-base (q8)** | **0.9052** | **0.8694** | 0.9122 | 517 ms |
22
+ | Hybrid + reranker (HyDE subset, n=25) | 0.8467 | 0.7672 | 0.8133 | 526 ms |
23
+ | Hybrid + reranker + HyDE-sim (HyDE subset, n=25) | 0.7078 | 0.5728 | 0.5933 | 729 ms |
24
+
25
+ **Headline takeaways:**
26
+
27
+ - The cross-encoder reranker is the single biggest top-K-precision win:
28
+ **+25 MRR points** and **+16 NDCG@10 points** vs. plain hybrid RRF — at a
29
+ ~290 ms latency cost per query on M-series CPU.
30
+ - Hybrid retrieval maximizes **recall** (every relevant note is somewhere
31
+ in the top-10 96 % of the time) but base RRF without a reranker has weak
32
+ ordering — the cross-encoder is what fixes that.
33
+ - The FS-grep baseline (what filesystem-MCP servers ship) is a respectable
34
+ exact-keyword recall floor on this corpus but loses badly on synonym and
35
+ semantic queries (see [Per-category breakdown](#per-category-breakdown)).
36
+ - Synthetic HyDE *hurt* retrieval on this benchmark (see
37
+ [HyDE analysis](#hyde-analysis)). Real LLM-generated HyDE may behave
38
+ differently; we surface the negative result rather than hide it.
39
+
40
+ ## Methodology
41
+
42
+ ### Dataset
43
+
44
+ The benchmark vault is **generated deterministically** by
45
+ `scripts/run-benchmarks.mjs`. It contains **48 markdown notes** organized into
46
+ folders:
47
+
48
+ - `Reference/` — 30 knowledge-base notes (BM25, RAG, HNSW, Obsidian, …)
49
+ - `Projects/` — 6 active-project notes
50
+ - `Daily/` — 5 daily-note entries
51
+ - `Inbox/` — 5 unrelated notes (recipes, travel, movies)
52
+ - `INDEX.md` + `Reference/INDEX.md` — two hub pages
53
+
54
+ Notes cross-reference via wikilinks so the post-RRF graph-boost arm has a
55
+ real graph to walk. Each note's body is a fixed string (no `Date.now()`,
56
+ no randomness); mtimes are pinned to `2026-05-15T00:00:00Z` via `utimes()`
57
+ so the FTS5/embed-db source_state hashes are bit-identical across runs.
58
+
59
+ ### Ground-truth queries
60
+
61
+ The 60 queries live in
62
+ [`tests/fixtures/benchmark-queries.jsonl`](../tests/fixtures/benchmark-queries.jsonl).
63
+ Each line is one JSON object:
64
+
65
+ ```jsonl
66
+ {"id":"q01","query":"RAG retrieval augmented generation","relevant":["Reference/RAG.md","Projects/RAG-bot.md"],"category":"exact"}
67
+ {"id":"q33","query":"how to combine multiple retrieval signals","relevant":["Reference/RRF.md","Reference/BM25.md","Reference/Embeddings.md"],"category":"semantic"}
68
+ ```
69
+
70
+ Queries are tagged by **category** so we can see which stack helps which
71
+ query type:
72
+
73
+ | Category | Count | Description |
74
+ | ---------- | ----- | -------------------------------------------------------------------- |
75
+ | `exact` | 35 | A query keyword appears verbatim in the relevant note body |
76
+ | `semantic` | 15 | Paraphrase / natural-language form — embeddings should excel |
77
+ | `synonym` | 4 | Conceptual term, expressed differently from the relevant note's body |
78
+ | `compound` | 4 | Multi-concept query covered by 2+ relevant notes |
79
+ | `rare` | 2 | Single rare token — BM25's classical strong suit |
80
+
81
+ Relevance is **binary** — each listed path is gain=1, all others are gain=0.
82
+ This matches what most users can realistically label.
83
+
84
+ ### Metrics
85
+
86
+ We report three standard IR metrics from Manning, Raghavan & Schütze,
87
+ *Introduction to Information Retrieval* (ch. 8):
88
+
89
+ - **MRR (Mean Reciprocal Rank)** = `mean(1 / rank_of_first_relevant)`,
90
+ 0 if no relevant doc is in the top-K. Best signal for *"did we put SOMETHING
91
+ relevant near the top?"*
92
+ - **NDCG@10 (Normalized Discounted Cumulative Gain @ K)** =
93
+ `DCG@K / IdealDCG@K`, where `DCG@K = sum(rel_i / log2(i + 1))`. Position-
94
+ aware; penalizes relevant docs ranked low. The headline metric on BEIR
95
+ and MTEB.
96
+ - **Recall@10** = `|retrieved ∩ relevant| / |relevant|`. Answers *"how
97
+ many of the relevant docs did we surface at all?"*
98
+
99
+ All three are computed by the existing
100
+ [`src/eval.ts`](../src/eval.ts) implementations — the same code that powers
101
+ the `enquire-mcp eval` CLI subcommand.
102
+
103
+ ### Stack configurations
104
+
105
+ | Stack | Implementation | Latency cost |
106
+ | -------------------- | --------------------------------------------------------------------------------------------------------------- | ----------------------------- |
107
+ | FS-grep baseline | Strip YAML, regex-grep each note for query tokens, rank by occurrence count | <1 ms / query |
108
+ | BM25 only | `FtsIndex.search` directly, chunks collapsed to notes | <1 ms / query |
109
+ | TF-IDF only | `semanticSearch` from `src/tools/search.ts` | ~2 ms / query |
110
+ | Embeddings only | `embeddingsSearch` against an `EmbedDb` built with `bge` model (BGE-small-en, 384-dim, ~33 MB) | ~110 ms / query (brute force) |
111
+ | Hybrid | `searchHybrid` — BM25 + TF-IDF + embeddings fused via RRF (k=60) + wikilink graph-boost (α=0.005) | ~230 ms / query |
112
+ | Hybrid + reranker | `searchHybrid` + BGE-reranker-base (q8-quantized, ~280 MB) cross-encoder re-scoring top-50, injected via `rerankerOverride` | ~520 ms / query |
113
+ | Hybrid + reranker + HyDE-sim | Same as above, but the embeddings arm uses a hand-authored "hypothetical answer" string in place of the query (Gao et al. 2023). Scored on the 25-query subset that has authored HyDE answers. | ~730 ms / query |
114
+
115
+ **Note on quantization**: We load the q8 ONNX variant of the BGE reranker
116
+ (~280 MB) directly via `transformers.js` `AutoModelForSequenceClassification`
117
+ because the high-level `text-classification` pipeline returns sigmoid=1.0 for
118
+ every input on this single-label model (see "Limitations" below). Real
119
+ production reranking should match this pattern to avoid the same trap.
120
+
121
+ ### Procedure
122
+
123
+ ```bash
124
+ git clone https://github.com/oomkapwn/enquire-mcp.git
125
+ cd enquire-mcp
126
+ npm install
127
+ npm run build
128
+ npm run bench:retrieval
129
+ ```
130
+
131
+ First run downloads two ONNX models (~313 MB total) into
132
+ `node_modules/@huggingface/transformers/.cache/`. Subsequent runs are
133
+ ~30 seconds.
134
+
135
+ The script writes:
136
+
137
+ - `bench/benchmarks.json` — machine-readable result + per-category breakdown
138
+ - the stdout table seen above
139
+
140
+ ## Results
141
+
142
+ ### Single-ranker ablation
143
+
144
+ | Stack | MRR | NDCG@10 | Recall@10 |
145
+ | ---------------- | ------ | ------- | --------- |
146
+ | FS-grep baseline | 0.8269 | 0.8184 | 0.8844 |
147
+ | BM25 only | 0.4833 | 0.4060 | 0.3833 |
148
+ | TF-IDF only | 0.9090 | 0.8668 | 0.9039 |
149
+ | Embeddings only | 0.9274 | 0.8985 | 0.9394 |
150
+
151
+ **Observations:**
152
+
153
+ - **BM25 alone underperforms on this corpus.** Why? On a 48-note vault
154
+ with paragraph-level chunking, FTS5 BM25 splits each note into ~4
155
+ chunks; collapsing to one-hit-per-note keeps only the highest-rank chunk
156
+ per note. For broad semantic queries ("how to combine multiple retrieval
157
+ signals", "approximate nearest neighbor search") BM25's term-overlap
158
+ scoring just doesn't fire. The numbers improve dramatically on larger
159
+ corpora where rare-term discrimination matters more — see the BEIR /
160
+ MTEB published BM25 baselines (~0.3-0.5 NDCG@10 across diverse domains)
161
+ for the expected scale.
162
+ - **TF-IDF alone beats FS-grep handily** (+0.06 NDCG@10). The cosine
163
+ similarity over IDF-weighted vectors recovers the synonym hits that
164
+ pure substring grep misses.
165
+ - **Embeddings alone is the single strongest individual signal** (NDCG@10
166
+ 0.90). The BGE-small-en encoder produces dense vectors that match
167
+ on semantic similarity even when no terms overlap.
168
+
169
+ ### Hybrid stack
170
+
171
+ | Stack | MRR | NDCG@10 | Recall@10 |
172
+ | ------------------------------------- | ------ | ------- | ---------- |
173
+ | Embeddings only | 0.9274 | 0.8985 | 0.9394 |
174
+ | Hybrid (RRF + graph-boost, 3 signals) | 0.6581 | 0.7143 | **0.9639** |
175
+
176
+ **What hybrid retrieval is for:** fusion via RRF **maximizes recall** —
177
+ 0.9639 means 96 % of the relevant notes land somewhere in the top-10
178
+ regardless of which signal "owns" them. The trade-off is that MRR/NDCG drop
179
+ versus embeddings-only because the lower-quality BM25 hits dilute the top
180
+ positions before reranking.
181
+
182
+ This is the **classic hybrid-retrieval pattern in production**: high
183
+ recall from fusion, top-K precision from a downstream reranker.
184
+
185
+ ### Cross-encoder reranker
186
+
187
+ | Stack | MRR | NDCG@10 | Recall@10 |
188
+ | -------------------- | ---------- | ---------- | --------- |
189
+ | Hybrid (RRF only) | 0.6581 | 0.7143 | 0.9639 |
190
+ | Hybrid + reranker | **0.9052** | **0.8694** | 0.9122 |
191
+ | Δ (reranker minus RRF) | **+0.2471** | **+0.1551** | -0.0517 |
192
+
193
+ **Reranker contribution:** the BGE-reranker-base cross-encoder re-scores
194
+ the top-50 RRF candidates by attending across (query, passage) jointly.
195
+ The boost is substantial:
196
+
197
+ - **+24.7 MRR points** — the first hit is now relevant on ~91 % of
198
+ queries (vs. ~66 % without reranking).
199
+ - **+15.5 NDCG@10 points** — every relevant doc that was floating around
200
+ positions 4-9 in the RRF order gets moved up to 1-3.
201
+ - **-5.2 Recall@10 points** — a small drop because the top-50 reranking
202
+ window can drop a relevant doc out of the top-10 that was previously
203
+ hanging on at position 9-10 (recall is `relevant ∩ top-10`).
204
+
205
+ The Recall trade-off is acceptable — what makes hybrid + reranker useful
206
+ is precise top-3 / top-5 results, which is exactly what an LLM agent
207
+ consumes from a search response.
208
+
209
+ **This is the strongest evidence for enquire-mcp's positioning** as the
210
+ "top-1 by retrieval quality" Obsidian MCP server: every other Obsidian
211
+ MCP we've benchmarked publicly stops at BM25 + linear scan, which scores
212
+ around the FS-grep-baseline row above.
213
+
214
+ ### HyDE analysis
215
+
216
+ HyDE (Gao et al, 2023) generates a hypothetical answer to the query via an
217
+ LLM and embeds *that answer* instead of the raw query. Since our benchmark
218
+ runs without an LLM in the loop (for determinism), we pre-authored 25
219
+ hypothetical answers by hand and ran the embeddings arm with those.
220
+
221
+ | Stack | n | MRR | NDCG@10 | Recall@10 |
222
+ | --------------------------------------- | -- | ------ | ------- | --------- |
223
+ | Hybrid + reranker (HyDE subset) | 25 | 0.8467 | 0.7672 | 0.8133 |
224
+ | Hybrid + reranker + HyDE-sim (subset) | 25 | 0.7078 | 0.5728 | 0.5933 |
225
+ | Δ (HyDE minus baseline on same subset) | | -0.139 | -0.194 | -0.220 |
226
+
227
+ **HyDE-sim *hurt* retrieval on this benchmark.** Three possible causes:
228
+
229
+ 1. **The hypothetical answers are too generic.** A 1-2 sentence paraphrase
230
+ of "RAG retrieval augmented generation" → *"RAG is a pattern where an
231
+ LLM retrieves passages and uses them as context..."* introduces
232
+ secondary terms ("passages", "context", "knowledge base") that match
233
+ *other* notes (Embeddings.md, RRF.md) equally well — diluting the
234
+ signal.
235
+ 2. **Real HyDE shines on under-specified queries**, e.g. *"why is my code
236
+ slow"* (an LLM generates a coherent paragraph about hotspots, profilers,
237
+ GC pauses, etc.). Our 60 queries are mostly keyword-heavy and don't
238
+ benefit from answer-shaped expansion.
239
+ 3. **Hand-authored answers ≠ LLM-generated answers.** Real HyDE answers
240
+ tend to be longer and more answer-shaped (declarative sentences about
241
+ the topic). Our short paraphrases lose that structural difference.
242
+
243
+ The right read of this row: **HyDE is workload-dependent**. Don't enable
244
+ it for keyword-heavy retrieval; enable it for vague, paragraph-shaped
245
+ questions. We surface the negative result honestly rather than tuning the
246
+ queries until HyDE wins.
247
+
248
+ ### FS-baseline comparison
249
+
250
+ The FS-grep baseline emulates the retrieval surface a filesystem-MCP server
251
+ provides: regex-grep each markdown body for the query tokens, rank by
252
+ occurrence count. Many "Obsidian-MCP-server" projects on npm and GitHub
253
+ ship roughly this. The numbers above show what users sacrifice by stopping
254
+ at that layer:
255
+
256
+ | Stack | NDCG@10 | Δ vs FS-grep |
257
+ | ----------------- | ------- | ------------ |
258
+ | FS-grep baseline | 0.8184 | (baseline) |
259
+ | BM25 only | 0.4060 | -0.4124 |
260
+ | TF-IDF only | 0.8668 | +0.0484 |
261
+ | Embeddings only | 0.8985 | +0.0801 |
262
+ | Hybrid + reranker | 0.8694 | +0.0510 |
263
+
264
+ Note that BM25 alone is *worse* than FS-grep here because we collapse
265
+ chunks to one hit per note — but pair BM25 with TF-IDF + embeddings via
266
+ RRF + reranker and you get **+5.1 NDCG@10 over FS-grep at 4900× the
267
+ latency**. For an interactive LLM agent on a single-digit-note retrieval
268
+ task, that latency is invisible; the quality improvement is the thing that
269
+ moves an agent from "useful for grep" to "reliably finds what I mean".
270
+
271
+ ### Per-category breakdown
272
+
273
+ NDCG@10 broken down by query category. This is the most actionable view —
274
+ shows *which kind of query benefits from which stack*.
275
+
276
+ | Category (n) | FS-grep | BM25 | TF-IDF | Embed | Hybrid | +Reranker |
277
+ | ------------- | ------- | ------ | ------ | ------ | ------ | --------- |
278
+ | exact (35) | 0.9414 | 0.5545 | 0.9652 | 0.9938 | 0.7327 | 0.9611 |
279
+ | semantic (15) | 0.5492 | 0.1152 | 0.6203 | 0.6710 | 0.6759 | 0.6398 |
280
+ | synonym (4) | 0.6934 | 0.1533 | 0.7786 | 0.8093 | 0.5947 | 0.9378 |
281
+ | compound (4) | 0.6036 | 0.0000 | 0.8315 | 0.8368 | 0.8747 | 0.6314 |
282
+ | rare (2) | 1.0000 | 0.8066 | 1.0000 | 1.0000 | 0.6409 | 1.0000 |
283
+
284
+ **Reading the breakdown:**
285
+
286
+ - **Exact queries**: every reasonable stack scores ~0.96+. Embeddings
287
+ edges out the rest (0.99). FS-grep is also competitive (0.94) because
288
+ exact keyword matches don't need semantic understanding.
289
+ - **Semantic queries**: the gap widens. Embeddings + reranker scores
290
+ ~0.65-0.67 vs. FS-grep at 0.55. This is the category where having a
291
+ dense-retrieval layer matters most.
292
+ - **Synonym queries**: hybrid + reranker is the clear winner (0.94 vs.
293
+ embeddings-only 0.81). The reranker recovers cases where the
294
+ fused candidates aren't strictly in best-first RRF order.
295
+ - **Compound queries**: surprisingly the reranker *hurts* (0.87 → 0.63).
296
+ Compound queries map to multiple relevant docs of equal importance;
297
+ the cross-encoder, designed for top-1 relevance, breaks the tie too
298
+ aggressively. The plain RRF row is the right choice for multi-doc
299
+ compound retrieval.
300
+ - **Rare queries**: FS-grep and embeddings tie at 1.0. The rare-token
301
+ test (`obscure-marker-XYZZY`) is a known surface for BM25's strength,
302
+ but the corpus is small enough that other rankers also locate it
303
+ cleanly.
304
+
305
+ The headline of the per-category table: **no single stack dominates every
306
+ query class**. The default `searchHybrid` in enquire-mcp is tuned for
307
+ maximum recall first, ordering precision second — which matches the
308
+ typical agentic-RAG usage pattern (give the LLM 10 candidates; the LLM
309
+ picks).
310
+
311
+ ## Reproducibility
312
+
313
+ ### One-command run
314
+
315
+ ```bash
316
+ git clone https://github.com/oomkapwn/enquire-mcp.git
317
+ cd enquire-mcp
318
+ git checkout v3.6.0-rc.4 # or main once v3.6.0 is shipped
319
+ npm install
320
+ npm run build
321
+ npm run bench:retrieval
322
+ ```
323
+
324
+ Output: a markdown table on stdout + `bench/benchmarks.json` for machine-
325
+ readable consumption (with per-query and per-category breakdowns).
326
+
327
+ ### Determinism
328
+
329
+ Across multiple runs on the same hardware, **all aggregate metrics
330
+ (MRR / NDCG@10 / Recall@10) match to all four reported decimal places**.
331
+ Only latency varies (timing jitter is normal).
332
+
333
+ Determinism contract:
334
+
335
+ - **Vault**: each note's content is a fixed string in
336
+ `scripts/run-benchmarks.mjs`. No `Date.now()`, no `Math.random()`. mtimes
337
+ are pinned via `fs.utimes()` so the FTS5 source_state and embed-db
338
+ source_state hashes are bit-identical between runs.
339
+ - **Queries**: versioned in `tests/fixtures/benchmark-queries.jsonl`. The
340
+ file is the source of truth — `readQueriesJsonl` reads it as-is.
341
+ - **Models**: pinned by HuggingFace model id (`Xenova/bge-small-en-v1.5`
342
+ for embeddings, `Xenova/bge-reranker-base` for reranking). The
343
+ transformers.js cache (`node_modules/@huggingface/transformers/.cache/`)
344
+ stores resolved weights.
345
+ - **Rankers**: each stack runs in-process via the exported public
346
+ functions from `dist/` (`searchHybrid`, `semanticSearch`,
347
+ `embeddingsSearch`, `FtsIndex.search`). HNSW is intentionally disabled
348
+ in the bench (brute-force cosine) so the embeddings-arm output is
349
+ bit-identical across runs.
350
+
351
+ ### Verifying you got the same numbers
352
+
353
+ After `npm run bench:retrieval`, compare against the committed JSON:
354
+
355
+ ```bash
356
+ diff <(jq '{rows: [.rows[] | {label, mean_ndcg, mean_recall, mean_mrr}]}' bench/benchmarks.json) \
357
+ <(jq '{rows: [.rows[] | {label, mean_ndcg, mean_recall, mean_mrr}]}' /tmp/my-run.json)
358
+ ```
359
+
360
+ Should be empty. If it isn't, please open an issue with both JSON files —
361
+ that's a determinism regression we want to know about.
362
+
363
+ ## Limitations
364
+
365
+ ### Synthetic-vault disclaimer
366
+
367
+ This benchmark runs against a **48-note synthetic vault** designed to be
368
+ deterministic and reproducible. Public IR benchmarks (BEIR / MTEB / TREC)
369
+ use orders of magnitude larger corpora with professionally labeled
370
+ relevance judgments. **Our numbers should not be read as comparable to
371
+ BEIR or MTEB headline numbers** — they're a per-stack diff on a single
372
+ small vault.
373
+
374
+ What we can say with confidence:
375
+
376
+ - The **relative ordering** of stacks is robust: reranker > embeddings-only >
377
+ TF-IDF > hybrid-without-reranker > FS-grep > BM25-alone holds at
378
+ every NDCG@10 cut-point we tested.
379
+ - The **direction of HyDE's effect** is real, even if the magnitude may
380
+ shrink on a larger corpus.
381
+ - The **reranker delta** (+24 MRR, +16 NDCG@10) is consistent with the
382
+ literature on cross-encoder reranking over BM25/embeddings fusion
383
+ (typical reported: +5-10 NDCG@10 across BEIR).
384
+
385
+ What we can't claim:
386
+
387
+ - That these absolute NDCG@10 numbers transfer to anyone's specific
388
+ real-world Obsidian vault.
389
+ - That the reranker boost will be as large on a 5,000-note vault where
390
+ recall is genuinely tight in the top-10.
391
+
392
+ **We welcome reproduction with public corpora.** A BEIR / TREC subset run
393
+ against `searchHybrid` is a planned future-work item; PRs with that
394
+ plumbing are welcome.
395
+
396
+ ### HyDE without an LLM
397
+
398
+ Real HyDE requires an LLM call per query to generate the hypothetical
399
+ answer. We approximate it with hand-authored answers for determinism;
400
+ this is labeled "HyDE-sim" in every row. **The HyDE-sim numbers are a
401
+ weak lower bound** for what real LLM-driven HyDE would produce. A future
402
+ benchmark run with a local Llama / Claude in the loop would tighten this.
403
+
404
+ ### Reranker score extraction
405
+
406
+ A quirk we discovered while authoring this bench: the high-level
407
+ `@huggingface/transformers` `text-classification` pipeline returns
408
+ `{label: "LABEL_0", score: 1.0}` for *every* input on the BGE-reranker-base
409
+ model — the pipeline softmaxes a 1-class head and always lands on 1. We
410
+ work around it by calling `AutoModelForSequenceClassification` directly
411
+ and applying sigmoid to the raw logit. This affects `src/embeddings.ts`
412
+ `loadReranker` and has been spun off as a separate fix task; the bench
413
+ numbers above use the direct-inference workaround.
414
+
415
+ ### Mismatched-quantization caveat
416
+
417
+ The benchmark builds the embed-db with the `bge` model (BGE-small-en,
418
+ fp32 vectors). If the embed-db were built with a different model alias
419
+ than the query-time alias, enquire's contamination guard rebuilds it,
420
+ which would dilute the embeddings-only and hybrid numbers. We explicitly
421
+ pass `embedding_model: "bge"` at every query site to keep the meta-table
422
+ consistent. Replicating with a different model is a one-line change
423
+ (`EMBEDDER_ALIAS` constant in `scripts/run-benchmarks.mjs`).
424
+
425
+ ### Vault size
426
+
427
+ 48 notes is small. The top-10 cutoff captures roughly 20 % of the corpus,
428
+ which makes Recall@10 forgiving. On a 1,000-note vault Recall@10 captures
429
+ 1 % of the corpus and is correspondingly harder to saturate — the
430
+ reranker's Recall@10 cost (-5.2 points here) would likely be smaller in
431
+ relative terms.
432
+
433
+ ## Future work
434
+
435
+ - Reproduce on **public BEIR / TREC subsets** so numbers can be compared
436
+ against the published IR literature.
437
+ - Add **HNSW vs. brute-force** comparison rows once HNSW is in this
438
+ bench's hot path (currently brute-force only for determinism).
439
+ - Run with **real LLM-generated HyDE answers** (deterministic with a
440
+ fixed-temperature decode + pinned LLM build).
441
+ - Extend to **multilingual queries** with the `multilingual` embedder
442
+ (`Xenova/paraphrase-multilingual-MiniLM-L12-v2`).
443
+ - Compare against **other Obsidian-MCP servers** directly by wrapping
444
+ their tools as a stack runner. This requires having those servers
445
+ installable as npm deps; the FS-grep baseline above is the closest
446
+ apples-to-apples we have today.
447
+
448
+ ## Related
449
+
450
+ - [`docs/COMPARISON.md`](./COMPARISON.md) — feature matrix vs. other
451
+ Obsidian-MCP servers.
452
+ - [`docs/api-reference/`](./api-reference/) — TypeDoc-generated API
453
+ reference (links into `searchHybrid`, `embeddingsSearch`,
454
+ `semanticSearch`).
455
+ - [`src/eval.ts`](../src/eval.ts) — the eval harness used by both this
456
+ bench script and the `enquire-mcp eval` CLI subcommand.
457
+ - [`tests/fixtures/benchmark-queries.jsonl`](../tests/fixtures/benchmark-queries.jsonl)
458
+ — the ground-truth query set.
459
+ - [`bench/benchmarks.json`](../bench/benchmarks.json) — machine-readable
460
+ output (includes per-query scores and per-category breakdowns).