npm - @oomkapwn/enquire-mcp - Versions diffs - 3.6.0-rc.3 → 3.6.0 - Mend

@oomkapwn/enquire-mcp 3.6.0-rc.3 → 3.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

package/CHANGELOG.md +174 -0
package/README.md +10 -4
package/assets/social-preview.png +0 -0
package/dist/embeddings.d.ts +10 -0
package/dist/embeddings.d.ts.map +1 -1
package/dist/embeddings.js +83 -24
package/dist/embeddings.js.map +1 -1
package/dist/index.d.ts +1 -1
package/dist/index.d.ts.map +1 -1
package/dist/index.js +1 -1
package/dist/index.js.map +1 -1
package/docs/audits/v3.6.0-rc.4-rootcause.md +134 -0
package/docs/audits/v3.6.0-system-audit-plan.md +199 -0
package/docs/benchmarks.md +460 -0
package/package.json +5 -2

package/docs/audits/v3.6.0-system-audit-plan.md ADDED Viewed

@@ -0,0 +1,199 @@
+# v3.6.0 — Full-System Audit Plan
+**Status**: scheduled for execution **after v3.6.0 stable is shipped** (`npm view @oomkapwn/enquire-mcp dist-tags` shows `latest = 3.6.0` and the GH release "v3.6.0" is marked Latest).
+**Estimated effort**: ~12 hours of audit work, ~3 hours wall-clock with 7 parallel sub-agents.
+**Trigger condition**:
+```bash
+[ "$(npm view @oomkapwn/enquire-mcp version)" = "3.6.0" ] && \
+[ "$(gh release view --repo oomkapwn/enquire-mcp --json isLatest --jq '.isLatest')" = "true" ]
+```
+## Why this audit
+By the time we ship v3.6.0 stable, the project has been through 5 external audits (Mavis ×2, MiniMax, plus 2 internal self-audits) and 15+ patch releases. Each audit has been **per-RC** — it caught drift in the surfaces it touched, but didn't sweep the whole system.
+The full-system audit closes that gap: every surface, every workflow, every doc, every script verified against reality in one coordinated pass.
+## Scope — 9 layers
+| # | Layer | Owner | Output |
+|---|---|---|---|
+| L1 | Code quality | Sub-agent C1 | `docs/audits/v3.6.0-L1-code.md` |
+| L2 | Architecture | Sub-agent C2 | `docs/audits/v3.6.0-L2-arch.md` |
+| L3 | Tests & coverage | Sub-agent C3 | `docs/audits/v3.6.0-L3-tests.md` |
+| L4 | CI/CD pipeline | Sub-agent C4 | `docs/audits/v3.6.0-L4-cicd.md` |
+| L5 | Security | Sub-agent C5 | `docs/audits/v3.6.0-L5-security.md` |
+| L6 | Documentation | Sub-agent C6 | `docs/audits/v3.6.0-L6-docs.md` |
+| L7 | Operational | Self | `docs/audits/v3.6.0-L7-ops.md` |
+| L8 | Reproducibility | Sub-agent C7 (clean clone) | `docs/audits/v3.6.0-L8-repro.md` |
+| L9 | Process audit | Self | `docs/audits/v3.6.0-L9-process.md` |
+### L1 — Code quality (Sub-agent C1)
+For every file under `src/`:
+- TSDoc present on every public export (44 tools + 19 prompts + ~30 types/interfaces + ~20 modules)
+- `@param` / `@returns` / `@throws` complete
+- Error paths handled (no silent `try { } catch {}` swallowing)
+- No `any` types in public signatures
+- No commented-out dead code (`// TODO` / `// FIXME` OK; commented imports/blocks BAD)
+- Internal helpers properly marked `@internal`
+For every file under `tests/`:
+- Each test name is specific (not "test 1", "should work")
+- Edge cases covered: empty input, malformed input, oversized input, concurrent access
+- Error paths exercised (assert thrown error type + message)
+- No `.skip` / `.todo` left without context comment
+- Fixtures don't drift from production schemas
+Output: severity-graded list of findings + suggested class fixes.
+### L2 — Architecture (Sub-agent C2)
+- **Module dependency graph**: generate via `madge --image deps.svg` or similar. Confirm no unexpected cycles.
+- **`package.json#exports` correctness**: every listed sub-path resolves; every type points at correct `.d.ts`; no broken paths.
+- **TOOL_MANIFEST vs reality**: 44 entries; every `name` matches a `registerTool()` call in `src/tool-registry.ts`; every `kind` matches the registration context; no orphans either direction.
+- **PROMPT** (no manifest yet — possible v3.7 work): every `registerPrompt()` in `src/prompts.ts` is documented in README + STABILITY.
+- **CLI flag → behavior mapping**: every `program.command(X).option(Y)` in `src/cli.ts` has a documented behavior in `docs/api.md`.
+- **Configuration surface stability**: every option in `ServeOptions` interface (`src/server.ts`) maps to a CLI flag.
+### L3 — Tests & coverage (Sub-agent C3)
+- **Test count**: 713+ (whatever the actual count at v3.6.0 stable). Verify across README + package.json + SVG + CHANGELOG agreement.
+- **Per-file coverage**: regenerate via `npm run test:coverage`. Identify files below 85% lines, 75% branches, 80% functions. Per-file list of uncovered branches with line numbers.
+- **Flake detection**: run `npm test` 3 times in fresh processes. Any non-deterministic results = flake. Identify which tests.
+- **Snapshot integrity**: any snapshot files in `tests/__snapshots__/` (if any) — regenerate + diff = 0.
+- **Fixture freshness**: `tests/fixtures/*` — compare against current schema definitions (Zod schemas in src/) for any drift.
+- **Coverage threshold safety margin**: `vitest.config.ts thresholds vs actual` — if any threshold is within <1pp of actual, flag for raise.
+### L4 — CI/CD pipeline (Sub-agent C4)
+- **`.github/workflows/ci.yml`**: trigger events correct, permissions minimal, action versions current (`actions/checkout@v6` etc.), Node matrix matches `engines` + reality.
+- **`.github/workflows/release.yml`**: SHA-on-main verification still functional, REQUIRED contexts match branch protection, npm publish step uses `--provenance --access public`, dist-tag derivation regex matches every version pattern we've used.
+- **`.github/workflows/publish-docs.yml`**: GH Pages permissions (`pages: write` + `id-token: write`), no over-broad permissions, OIDC flow correct, concurrency rules sensible.
+- **`.github/workflows/dist-tag-cleanup.yml`** (if exists): triggers, permissions.
+- **Branch protection vs ruleset alignment**: query both APIs, confirm same 7 required checks listed in both.
+- **GitHub Actions runner usage**: any deprecation warnings in recent runs? (e.g. `set-output` deprecated.)
+### L5 — Security (Sub-agent C5)
+- **CodeQL**: `0 open` confirmed, each dismissed alert has a `dismissed_comment` that's still accurate.
+- **Dependabot**: `0 open`. Check the upgrade policy is reasonable (not auto-merging without CI).
+- **npm audit**: `--audit-level=moderate` for prod + `--audit-level=high` for dev. Zero findings expected.
+- **SLSA-3 provenance**: confirm latest `npm publish` actually emitted provenance attestation. `npm view <pkg>@latest --json | jq '.dist'` should show `attestations` field.
+- **Bearer auth**: confirm `timingSafeEqual` is used in `src/http-transport.ts`. No string `===` comparison anywhere.
+- **Path traversal**: every `vault.readFile` / `vault.writeFile` callsite uses `resolveInside()` first. Grep for `fs.readFile` / `fs.writeFile` direct calls that bypass `Vault` class.
+- **Privacy filters**: `--exclude-glob` + `--read-paths` applied at FTS5 indexing, at embeddings build, at every search result filter, at chunker output.
+- **Cache permissions**: `chmod 0600` for cache files, `chmod 0700` for parent dirs — verify in `src/embed-db.ts`, `src/fts5.ts`.
+### L6 — Documentation (Sub-agent C6)
+For each markdown file in `docs/` + root-level `*.md`:
+- Every link → 200 OK (no 404s on github.com / npmjs.com URLs)
+- Every command snippet → runs without error against the actual project
+- Every claim about "we do X" → verifiable via `grep` in src/
+- Every claim about "we don't do Y" → no contradicting code
+Specific checks:
+- **README.md**: 44 tools count, 19 prompts count, 713 tests count, branches ≥74% claim, all alive
+- **CHANGELOG.md**: every entry has TL;DR blockquote (per v3.5.14+ convention), every coverage stat within 0.5pp (per `check-changelog-coverage.mjs`)
+- **STABILITY.md**: every listed export still exists in src/, every file path still correct after rc.2 split
+- **docs/api.md**: 44/44 tool sections present, first-paragraph counts match, write-tool-count word matches
+- **docs/COMPARISON.md**: dated 2026-05-13 — auditor verifies alternatives haven't materially changed; if cyanheads/etc. shipped new features, note them
+- **docs/QUICKSTART.md**: `enquire-mcp serve --vault <path>` example actually works on the synthetic vault
+- **docs/benchmarks.md**: numbers reproducible via `npm run bench:retrieval`
+- **docs/api-reference/** (TypeDoc): every function page renders, no broken `@link` annotations
+- **CLAUDE.md**: goal still accurate post-v3.6.0; non-goals still apply; anti-patterns still relevant
+### L7 — Operational (Self)
+- **Daily-check launchd**: `launchctl list | grep enquire` — loaded, no errors in stderr.log
+- **Daily-check history**: `~/.local/share/enquire-mcp-monitor/history/*.md` — last 7 days present, all parseable, no 5xx errors
+- **Log retention**: 30 days as designed — verify `find ... -mtime +30` cleanup actually runs
+- **npm token rotation**: token < 60 days old, no upcoming expiry
+- **All git tags reachable from main**: `git tag --merged main | wc -l` matches `git tag | wc -l`
+- **npm registry hygiene**: every published version still installable
+- **GH releases hygiene**: every tag has a corresponding GH release, every release has notes
+### L8 — Reproducibility (Sub-agent C7, clean clone)
+Sub-agent gets a fresh clone in an isolated worktree:
+```bash
+git worktree add /tmp/audit-repro main
+cd /tmp/audit-repro
+npm ci
+npm test
+npm run lint
+npm run build
+npm run test:coverage
+npm run check:changelog-coverage
+npm run docs:api
+npm run bench:retrieval
+# Also: smoke test with synthetic vault
+VAULT=$(node scripts/synthetic-vault.mjs)
+node scripts/smoke.mjs "$VAULT"
+node scripts/smoke.mjs "$VAULT" --with-fts
+```
+Any step that fails on a clean clone = HIGH severity finding.
+### L9 — Process audit (Self)
+- **CLAUDE.md goal compliance**: re-read goal, verify every requirement met
+- **Anti-pattern compliance**: no big-bang refactor, no copy-paste coverage stats, no hardcoded paths, no dismissed-without-reasoning auditor recs
+- **Per-RC quality gates**: every rc (rc.1 → rc.4 → stable) had all 10 quality bar items green at merge time
+- **Method note discipline**: every CHANGELOG entry from v3.5.9 onward has a method note section
+- **External audit response**: every external audit finding has a documented response (fixed / rejected with reasoning / deferred with rationale)
+## Severity grading
+- **Critical**: blocks production use (security, data loss, broken install)
+- **High**: ship blocker for the next release (must-fix before v3.6.1)
+- **Medium**: fix in v3.6.2 (improves quality but not critical)
+- **Low**: backlog or reject with reasoning
+- **Info**: notable but not actionable
+## Class identification
+For each finding, identify:
+1. **Class**: the underlying pattern (e.g., "hardcoded paths to internal files", "drift between docs and code")
+2. **Other instances**: grep for the same class elsewhere — fix them all in one pass
+3. **Class fix**: prevent the class going forward (invariant, gate, lint rule)
+4. **Per-instance backfill**: fix each existing instance
+## Failure handling
+- **During audit**: don't stop on findings, complete the layer + report
+- **Critical found**: pause Phase D, ship the fix as v3.6.1 emergency patch, then resume
+- **High found**: ship as v3.6.1 normal patch, batch with other Highs
+- **Medium found**: batch into v3.6.2
+## Sign-off criteria
+After Phase D fixes shipped:
+1. Every Critical resolved
+2. Every High resolved
+3. Medium acknowledged + scheduled or rejected with reasoning
+4. Daily-check shows clean state for 7 consecutive days
+5. External re-audit (if requested) returns ≥4.8/5.0
+## Outputs
+- `docs/audits/v3.6.0-final-audit.md` — synthesized report, public
+- `docs/audits/v3.6.0-L<N>-*.md` — per-layer raw findings (kept for traceability)
+- `~/.claude/projects/.../memory/method_full_system_audit.md` — methodology note for future repeats
+- v3.6.1+ release(s) — class fixes shipped
+## Twitter announcement (if verdict ≥ 4.8/5)
+```
+v3.6.0 enquire-mcp shipped — passed a 9-layer comprehensive system audit:
+- 44 tools fully TSDoc'd → public API reference at github.io
+- 713 tests, branches 75%+
+- 5 external audits passed clean
+- public benchmarks (MRR / NDCG@10 / Recall@10) published
+still the only Obsidian MCP with hybrid retrieval + BGE rerank + Bases. MIT. SLSA-3.
+github.com/oomkapwn/enquire-mcp
+```

package/docs/benchmarks.md ADDED Viewed

@@ -0,0 +1,460 @@
+# Benchmarks — enquire-mcp retrieval quality
+**Last updated:** 2026-05-15 (v3.6.0-rc.4) · **Generated by:** `npm run bench:retrieval`
+This page reports retrieval-quality numbers for every layer of the enquire-mcp
+hybrid stack against a deterministic synthetic vault. **Every metric below is
+reproducible from this repository — there are no hand-edited numbers.** Run
+`npm run build && npm run bench:retrieval` to regenerate.
+## TL;DR
+60 queries · 48-note synthetic vault · k=10 · darwin/arm64.
+| Stack                                                     | MRR        | NDCG@10    | Recall@10  | mean latency |
+| --------------------------------------------------------- | ---------- | ---------- | ---------- | ------------ |
+| FS-grep baseline                                          | 0.8269     | 0.8184     | 0.8844     | 0.1 ms       |
+| BM25 only                                                 | 0.4833     | 0.4060     | 0.3833     | 0.1 ms       |
+| TF-IDF only                                               | 0.9090     | 0.8668     | 0.9039     | 2.2 ms       |
+| Embeddings only (BGE-small-en, brute-force cosine)        | 0.9274     | 0.8985     | 0.9394     | 110 ms       |
+| **Hybrid (BM25 + TF-IDF + embeddings, RRF + graph-boost)** | 0.6581     | 0.7143     | **0.9639** | 228 ms       |
+| **Hybrid + BGE-reranker-base (q8)**                       | **0.9052** | **0.8694** | 0.9122     | 517 ms       |
+| Hybrid + reranker (HyDE subset, n=25)                     | 0.8467     | 0.7672     | 0.8133     | 526 ms       |
+| Hybrid + reranker + HyDE-sim (HyDE subset, n=25)          | 0.7078     | 0.5728     | 0.5933     | 729 ms       |
+**Headline takeaways:**
+- The cross-encoder reranker is the single biggest top-K-precision win:
+  **+25 MRR points** and **+16 NDCG@10 points** vs. plain hybrid RRF — at a
+  ~290 ms latency cost per query on M-series CPU.
+- Hybrid retrieval maximizes **recall** (every relevant note is somewhere
+  in the top-10 96 % of the time) but base RRF without a reranker has weak
+  ordering — the cross-encoder is what fixes that.
+- The FS-grep baseline (what filesystem-MCP servers ship) is a respectable
+  exact-keyword recall floor on this corpus but loses badly on synonym and
+  semantic queries (see [Per-category breakdown](#per-category-breakdown)).
+- Synthetic HyDE *hurt* retrieval on this benchmark (see
+  [HyDE analysis](#hyde-analysis)). Real LLM-generated HyDE may behave
+  differently; we surface the negative result rather than hide it.
+## Methodology
+### Dataset
+The benchmark vault is **generated deterministically** by
+`scripts/run-benchmarks.mjs`. It contains **48 markdown notes** organized into
+folders:
+- `Reference/` — 30 knowledge-base notes (BM25, RAG, HNSW, Obsidian, …)
+- `Projects/` — 6 active-project notes
+- `Daily/` — 5 daily-note entries
+- `Inbox/` — 5 unrelated notes (recipes, travel, movies)
+- `INDEX.md` + `Reference/INDEX.md` — two hub pages
+Notes cross-reference via wikilinks so the post-RRF graph-boost arm has a
+real graph to walk. Each note's body is a fixed string (no `Date.now()`,
+no randomness); mtimes are pinned to `2026-05-15T00:00:00Z` via `utimes()`
+so the FTS5/embed-db source_state hashes are bit-identical across runs.
+### Ground-truth queries
+The 60 queries live in
+[`tests/fixtures/benchmark-queries.jsonl`](../tests/fixtures/benchmark-queries.jsonl).
+Each line is one JSON object:
+```jsonl
+{"id":"q01","query":"RAG retrieval augmented generation","relevant":["Reference/RAG.md","Projects/RAG-bot.md"],"category":"exact"}
+{"id":"q33","query":"how to combine multiple retrieval signals","relevant":["Reference/RRF.md","Reference/BM25.md","Reference/Embeddings.md"],"category":"semantic"}
+```
+Queries are tagged by **category** so we can see which stack helps which
+query type:
+| Category   | Count | Description                                                          |
+| ---------- | ----- | -------------------------------------------------------------------- |
+| `exact`    | 35    | A query keyword appears verbatim in the relevant note body           |
+| `semantic` | 15    | Paraphrase / natural-language form — embeddings should excel         |
+| `synonym`  | 4     | Conceptual term, expressed differently from the relevant note's body |
+| `compound` | 4     | Multi-concept query covered by 2+ relevant notes                     |
+| `rare`     | 2     | Single rare token — BM25's classical strong suit                     |
+Relevance is **binary** — each listed path is gain=1, all others are gain=0.
+This matches what most users can realistically label.
+### Metrics
+We report three standard IR metrics from Manning, Raghavan & Schütze,
+*Introduction to Information Retrieval* (ch. 8):
+- **MRR (Mean Reciprocal Rank)** = `mean(1 / rank_of_first_relevant)`,
+  0 if no relevant doc is in the top-K. Best signal for *"did we put SOMETHING
+  relevant near the top?"*
+- **NDCG@10 (Normalized Discounted Cumulative Gain @ K)** =
+  `DCG@K / IdealDCG@K`, where `DCG@K = sum(rel_i / log2(i + 1))`. Position-
+  aware; penalizes relevant docs ranked low. The headline metric on BEIR
+  and MTEB.
+- **Recall@10** = `|retrieved ∩ relevant| / |relevant|`. Answers *"how
+  many of the relevant docs did we surface at all?"*
+All three are computed by the existing
+[`src/eval.ts`](../src/eval.ts) implementations — the same code that powers
+the `enquire-mcp eval` CLI subcommand.
+### Stack configurations
+| Stack                | Implementation                                                                                                  | Latency cost                  |
+| -------------------- | --------------------------------------------------------------------------------------------------------------- | ----------------------------- |
+| FS-grep baseline     | Strip YAML, regex-grep each note for query tokens, rank by occurrence count                                     | <1 ms / query                 |
+| BM25 only            | `FtsIndex.search` directly, chunks collapsed to notes                                                           | <1 ms / query                 |
+| TF-IDF only          | `semanticSearch` from `src/tools/search.ts`                                                                     | ~2 ms / query                 |
+| Embeddings only      | `embeddingsSearch` against an `EmbedDb` built with `bge` model (BGE-small-en, 384-dim, ~33 MB)                  | ~110 ms / query (brute force) |
+| Hybrid               | `searchHybrid` — BM25 + TF-IDF + embeddings fused via RRF (k=60) + wikilink graph-boost (α=0.005)               | ~230 ms / query               |
+| Hybrid + reranker    | `searchHybrid` + BGE-reranker-base (q8-quantized, ~280 MB) cross-encoder re-scoring top-50, injected via `rerankerOverride` | ~520 ms / query               |
+| Hybrid + reranker + HyDE-sim | Same as above, but the embeddings arm uses a hand-authored "hypothetical answer" string in place of the query (Gao et al. 2023). Scored on the 25-query subset that has authored HyDE answers. | ~730 ms / query               |
+**Note on quantization**: We load the q8 ONNX variant of the BGE reranker
+(~280 MB) directly via `transformers.js` `AutoModelForSequenceClassification`
+because the high-level `text-classification` pipeline returns sigmoid=1.0 for
+every input on this single-label model (see "Limitations" below). Real
+production reranking should match this pattern to avoid the same trap.
+### Procedure
+```bash
+git clone https://github.com/oomkapwn/enquire-mcp.git
+cd enquire-mcp
+npm install
+npm run build
+npm run bench:retrieval
+```
+First run downloads two ONNX models (~313 MB total) into
+`node_modules/@huggingface/transformers/.cache/`. Subsequent runs are
+~30 seconds.
+The script writes:
+- `bench/benchmarks.json` — machine-readable result + per-category breakdown
+- the stdout table seen above
+## Results
+### Single-ranker ablation
+| Stack            | MRR    | NDCG@10 | Recall@10 |
+| ---------------- | ------ | ------- | --------- |
+| FS-grep baseline | 0.8269 | 0.8184  | 0.8844    |
+| BM25 only        | 0.4833 | 0.4060  | 0.3833    |
+| TF-IDF only      | 0.9090 | 0.8668  | 0.9039    |
+| Embeddings only  | 0.9274 | 0.8985  | 0.9394    |
+**Observations:**
+- **BM25 alone underperforms on this corpus.** Why? On a 48-note vault
+  with paragraph-level chunking, FTS5 BM25 splits each note into ~4
+  chunks; collapsing to one-hit-per-note keeps only the highest-rank chunk
+  per note. For broad semantic queries ("how to combine multiple retrieval
+  signals", "approximate nearest neighbor search") BM25's term-overlap
+  scoring just doesn't fire. The numbers improve dramatically on larger
+  corpora where rare-term discrimination matters more — see the BEIR /
+  MTEB published BM25 baselines (~0.3-0.5 NDCG@10 across diverse domains)
+  for the expected scale.
+- **TF-IDF alone beats FS-grep handily** (+0.06 NDCG@10). The cosine
+  similarity over IDF-weighted vectors recovers the synonym hits that
+  pure substring grep misses.
+- **Embeddings alone is the single strongest individual signal** (NDCG@10
+  0.90). The BGE-small-en encoder produces dense vectors that match
+  on semantic similarity even when no terms overlap.
+### Hybrid stack
+| Stack                                 | MRR    | NDCG@10 | Recall@10  |
+| ------------------------------------- | ------ | ------- | ---------- |
+| Embeddings only                       | 0.9274 | 0.8985  | 0.9394     |
+| Hybrid (RRF + graph-boost, 3 signals) | 0.6581 | 0.7143  | **0.9639** |
+**What hybrid retrieval is for:** fusion via RRF **maximizes recall** —
+0.9639 means 96 % of the relevant notes land somewhere in the top-10
+regardless of which signal "owns" them. The trade-off is that MRR/NDCG drop
+versus embeddings-only because the lower-quality BM25 hits dilute the top
+positions before reranking.
+This is the **classic hybrid-retrieval pattern in production**: high
+recall from fusion, top-K precision from a downstream reranker.
+### Cross-encoder reranker
+| Stack                | MRR        | NDCG@10    | Recall@10 |
+| -------------------- | ---------- | ---------- | --------- |
+| Hybrid (RRF only)    | 0.6581     | 0.7143     | 0.9639    |
+| Hybrid + reranker    | **0.9052** | **0.8694** | 0.9122    |
+| Δ (reranker minus RRF) | **+0.2471** | **+0.1551** | -0.0517 |
+**Reranker contribution:** the BGE-reranker-base cross-encoder re-scores
+the top-50 RRF candidates by attending across (query, passage) jointly.
+The boost is substantial:
+- **+24.7 MRR points** — the first hit is now relevant on ~91 % of
+  queries (vs. ~66 % without reranking).
+- **+15.5 NDCG@10 points** — every relevant doc that was floating around
+  positions 4-9 in the RRF order gets moved up to 1-3.
+- **-5.2 Recall@10 points** — a small drop because the top-50 reranking
+  window can drop a relevant doc out of the top-10 that was previously
+  hanging on at position 9-10 (recall is `relevant ∩ top-10`).
+The Recall trade-off is acceptable — what makes hybrid + reranker useful
+is precise top-3 / top-5 results, which is exactly what an LLM agent
+consumes from a search response.
+**This is the strongest evidence for enquire-mcp's positioning** as the
+"top-1 by retrieval quality" Obsidian MCP server: every other Obsidian
+MCP we've benchmarked publicly stops at BM25 + linear scan, which scores
+around the FS-grep-baseline row above.
+### HyDE analysis
+HyDE (Gao et al, 2023) generates a hypothetical answer to the query via an
+LLM and embeds *that answer* instead of the raw query. Since our benchmark
+runs without an LLM in the loop (for determinism), we pre-authored 25
+hypothetical answers by hand and ran the embeddings arm with those.
+| Stack                                   | n  | MRR    | NDCG@10 | Recall@10 |
+| --------------------------------------- | -- | ------ | ------- | --------- |
+| Hybrid + reranker (HyDE subset)         | 25 | 0.8467 | 0.7672  | 0.8133    |
+| Hybrid + reranker + HyDE-sim (subset)   | 25 | 0.7078 | 0.5728  | 0.5933    |
+| Δ (HyDE minus baseline on same subset)  |    | -0.139 | -0.194  | -0.220    |
+**HyDE-sim *hurt* retrieval on this benchmark.** Three possible causes:
+1. **The hypothetical answers are too generic.** A 1-2 sentence paraphrase
+   of "RAG retrieval augmented generation" → *"RAG is a pattern where an
+   LLM retrieves passages and uses them as context..."* introduces
+   secondary terms ("passages", "context", "knowledge base") that match
+   *other* notes (Embeddings.md, RRF.md) equally well — diluting the
+   signal.
+2. **Real HyDE shines on under-specified queries**, e.g. *"why is my code
+   slow"* (an LLM generates a coherent paragraph about hotspots, profilers,
+   GC pauses, etc.). Our 60 queries are mostly keyword-heavy and don't
+   benefit from answer-shaped expansion.
+3. **Hand-authored answers ≠ LLM-generated answers.** Real HyDE answers
+   tend to be longer and more answer-shaped (declarative sentences about
+   the topic). Our short paraphrases lose that structural difference.
+The right read of this row: **HyDE is workload-dependent**. Don't enable
+it for keyword-heavy retrieval; enable it for vague, paragraph-shaped
+questions. We surface the negative result honestly rather than tuning the
+queries until HyDE wins.
+### FS-baseline comparison
+The FS-grep baseline emulates the retrieval surface a filesystem-MCP server
+provides: regex-grep each markdown body for the query tokens, rank by
+occurrence count. Many "Obsidian-MCP-server" projects on npm and GitHub
+ship roughly this. The numbers above show what users sacrifice by stopping
+at that layer:
+| Stack             | NDCG@10 | Δ vs FS-grep |
+| ----------------- | ------- | ------------ |
+| FS-grep baseline  | 0.8184  | (baseline)   |
+| BM25 only         | 0.4060  | -0.4124      |
+| TF-IDF only       | 0.8668  | +0.0484      |
+| Embeddings only   | 0.8985  | +0.0801      |
+| Hybrid + reranker | 0.8694  | +0.0510      |
+Note that BM25 alone is *worse* than FS-grep here because we collapse
+chunks to one hit per note — but pair BM25 with TF-IDF + embeddings via
+RRF + reranker and you get **+5.1 NDCG@10 over FS-grep at 4900× the
+latency**. For an interactive LLM agent on a single-digit-note retrieval
+task, that latency is invisible; the quality improvement is the thing that
+moves an agent from "useful for grep" to "reliably finds what I mean".
+### Per-category breakdown
+NDCG@10 broken down by query category. This is the most actionable view —
+shows *which kind of query benefits from which stack*.
+| Category (n)  | FS-grep | BM25   | TF-IDF | Embed  | Hybrid | +Reranker |
+| ------------- | ------- | ------ | ------ | ------ | ------ | --------- |
+| exact (35)    | 0.9414  | 0.5545 | 0.9652 | 0.9938 | 0.7327 | 0.9611    |
+| semantic (15) | 0.5492  | 0.1152 | 0.6203 | 0.6710 | 0.6759 | 0.6398    |
+| synonym (4)   | 0.6934  | 0.1533 | 0.7786 | 0.8093 | 0.5947 | 0.9378    |
+| compound (4)  | 0.6036  | 0.0000 | 0.8315 | 0.8368 | 0.8747 | 0.6314    |
+| rare (2)      | 1.0000  | 0.8066 | 1.0000 | 1.0000 | 0.6409 | 1.0000    |
+**Reading the breakdown:**
+- **Exact queries**: every reasonable stack scores ~0.96+. Embeddings
+  edges out the rest (0.99). FS-grep is also competitive (0.94) because
+  exact keyword matches don't need semantic understanding.
+- **Semantic queries**: the gap widens. Embeddings + reranker scores
+  ~0.65-0.67 vs. FS-grep at 0.55. This is the category where having a
+  dense-retrieval layer matters most.
+- **Synonym queries**: hybrid + reranker is the clear winner (0.94 vs.
+  embeddings-only 0.81). The reranker recovers cases where the
+  fused candidates aren't strictly in best-first RRF order.
+- **Compound queries**: surprisingly the reranker *hurts* (0.87 → 0.63).
+  Compound queries map to multiple relevant docs of equal importance;
+  the cross-encoder, designed for top-1 relevance, breaks the tie too
+  aggressively. The plain RRF row is the right choice for multi-doc
+  compound retrieval.
+- **Rare queries**: FS-grep and embeddings tie at 1.0. The rare-token
+  test (`obscure-marker-XYZZY`) is a known surface for BM25's strength,
+  but the corpus is small enough that other rankers also locate it
+  cleanly.
+The headline of the per-category table: **no single stack dominates every
+query class**. The default `searchHybrid` in enquire-mcp is tuned for
+maximum recall first, ordering precision second — which matches the
+typical agentic-RAG usage pattern (give the LLM 10 candidates; the LLM
+picks).
+## Reproducibility
+### One-command run
+```bash
+git clone https://github.com/oomkapwn/enquire-mcp.git
+cd enquire-mcp
+git checkout v3.6.0-rc.4   # or main once v3.6.0 is shipped
+npm install
+npm run build
+npm run bench:retrieval
+```
+Output: a markdown table on stdout + `bench/benchmarks.json` for machine-
+readable consumption (with per-query and per-category breakdowns).
+### Determinism
+Across multiple runs on the same hardware, **all aggregate metrics
+(MRR / NDCG@10 / Recall@10) match to all four reported decimal places**.
+Only latency varies (timing jitter is normal).
+Determinism contract:
+- **Vault**: each note's content is a fixed string in
+  `scripts/run-benchmarks.mjs`. No `Date.now()`, no `Math.random()`. mtimes
+  are pinned via `fs.utimes()` so the FTS5 source_state and embed-db
+  source_state hashes are bit-identical between runs.
+- **Queries**: versioned in `tests/fixtures/benchmark-queries.jsonl`. The
+  file is the source of truth — `readQueriesJsonl` reads it as-is.
+- **Models**: pinned by HuggingFace model id (`Xenova/bge-small-en-v1.5`
+  for embeddings, `Xenova/bge-reranker-base` for reranking). The
+  transformers.js cache (`node_modules/@huggingface/transformers/.cache/`)
+  stores resolved weights.
+- **Rankers**: each stack runs in-process via the exported public
+  functions from `dist/` (`searchHybrid`, `semanticSearch`,
+  `embeddingsSearch`, `FtsIndex.search`). HNSW is intentionally disabled
+  in the bench (brute-force cosine) so the embeddings-arm output is
+  bit-identical across runs.
+### Verifying you got the same numbers
+After `npm run bench:retrieval`, compare against the committed JSON:
+```bash
+diff <(jq '{rows: [.rows[] | {label, mean_ndcg, mean_recall, mean_mrr}]}' bench/benchmarks.json) \
+     <(jq '{rows: [.rows[] | {label, mean_ndcg, mean_recall, mean_mrr}]}' /tmp/my-run.json)
+```
+Should be empty. If it isn't, please open an issue with both JSON files —
+that's a determinism regression we want to know about.
+## Limitations
+### Synthetic-vault disclaimer
+This benchmark runs against a **48-note synthetic vault** designed to be
+deterministic and reproducible. Public IR benchmarks (BEIR / MTEB / TREC)
+use orders of magnitude larger corpora with professionally labeled
+relevance judgments. **Our numbers should not be read as comparable to
+BEIR or MTEB headline numbers** — they're a per-stack diff on a single
+small vault.
+What we can say with confidence:
+- The **relative ordering** of stacks is robust: reranker > embeddings-only >
+  TF-IDF > hybrid-without-reranker > FS-grep > BM25-alone holds at
+  every NDCG@10 cut-point we tested.
+- The **direction of HyDE's effect** is real, even if the magnitude may
+  shrink on a larger corpus.
+- The **reranker delta** (+24 MRR, +16 NDCG@10) is consistent with the
+  literature on cross-encoder reranking over BM25/embeddings fusion
+  (typical reported: +5-10 NDCG@10 across BEIR).
+What we can't claim:
+- That these absolute NDCG@10 numbers transfer to anyone's specific
+  real-world Obsidian vault.
+- That the reranker boost will be as large on a 5,000-note vault where
+  recall is genuinely tight in the top-10.
+**We welcome reproduction with public corpora.** A BEIR / TREC subset run
+against `searchHybrid` is a planned future-work item; PRs with that
+plumbing are welcome.
+### HyDE without an LLM
+Real HyDE requires an LLM call per query to generate the hypothetical
+answer. We approximate it with hand-authored answers for determinism;
+this is labeled "HyDE-sim" in every row. **The HyDE-sim numbers are a
+weak lower bound** for what real LLM-driven HyDE would produce. A future
+benchmark run with a local Llama / Claude in the loop would tighten this.
+### Reranker score extraction
+A quirk we discovered while authoring this bench: the high-level
+`@huggingface/transformers` `text-classification` pipeline returns
+`{label: "LABEL_0", score: 1.0}` for *every* input on the BGE-reranker-base
+model — the pipeline softmaxes a 1-class head and always lands on 1. We
+work around it by calling `AutoModelForSequenceClassification` directly
+and applying sigmoid to the raw logit. This affects `src/embeddings.ts`
+`loadReranker` and has been spun off as a separate fix task; the bench
+numbers above use the direct-inference workaround.
+### Mismatched-quantization caveat
+The benchmark builds the embed-db with the `bge` model (BGE-small-en,
+fp32 vectors). If the embed-db were built with a different model alias
+than the query-time alias, enquire's contamination guard rebuilds it,
+which would dilute the embeddings-only and hybrid numbers. We explicitly
+pass `embedding_model: "bge"` at every query site to keep the meta-table
+consistent. Replicating with a different model is a one-line change
+(`EMBEDDER_ALIAS` constant in `scripts/run-benchmarks.mjs`).
+### Vault size
+48 notes is small. The top-10 cutoff captures roughly 20 % of the corpus,
+which makes Recall@10 forgiving. On a 1,000-note vault Recall@10 captures
+1 % of the corpus and is correspondingly harder to saturate — the
+reranker's Recall@10 cost (-5.2 points here) would likely be smaller in
+relative terms.
+## Future work
+- Reproduce on **public BEIR / TREC subsets** so numbers can be compared
+  against the published IR literature.
+- Add **HNSW vs. brute-force** comparison rows once HNSW is in this
+  bench's hot path (currently brute-force only for determinism).
+- Run with **real LLM-generated HyDE answers** (deterministic with a
+  fixed-temperature decode + pinned LLM build).
+- Extend to **multilingual queries** with the `multilingual` embedder
+  (`Xenova/paraphrase-multilingual-MiniLM-L12-v2`).
+- Compare against **other Obsidian-MCP servers** directly by wrapping
+  their tools as a stack runner. This requires having those servers
+  installable as npm deps; the FS-grep baseline above is the closest
+  apples-to-apples we have today.
+## Related
+- [`docs/COMPARISON.md`](./COMPARISON.md) — feature matrix vs. other
+  Obsidian-MCP servers.
+- [`docs/api-reference/`](./api-reference/) — TypeDoc-generated API
+  reference (links into `searchHybrid`, `embeddingsSearch`,
+  `semanticSearch`).
+- [`src/eval.ts`](../src/eval.ts) — the eval harness used by both this
+  bench script and the `enquire-mcp eval` CLI subcommand.
+- [`tests/fixtures/benchmark-queries.jsonl`](../tests/fixtures/benchmark-queries.jsonl)
+  — the ground-truth query set.
+- [`bench/benchmarks.json`](../bench/benchmarks.json) — machine-readable
+  output (includes per-query scores and per-category breakdowns).