mdcontext 0.0.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (337) hide show
  1. package/.changeset/README.md +28 -0
  2. package/.changeset/config.json +11 -0
  3. package/.claude/settings.local.json +25 -0
  4. package/.github/workflows/ci.yml +83 -0
  5. package/.github/workflows/claude-code-review.yml +44 -0
  6. package/.github/workflows/claude.yml +85 -0
  7. package/.github/workflows/release.yml +113 -0
  8. package/.tldrignore +112 -0
  9. package/BACKLOG.md +338 -0
  10. package/CONTRIBUTING.md +186 -0
  11. package/NOTES/NOTES +44 -0
  12. package/README.md +434 -11
  13. package/biome.json +36 -0
  14. package/cspell.config.yaml +14 -0
  15. package/dist/chunk-23UPXDNL.js +3044 -0
  16. package/dist/chunk-2W7MO2DL.js +1366 -0
  17. package/dist/chunk-3NUAZGMA.js +1689 -0
  18. package/dist/chunk-7TOWB2XB.js +366 -0
  19. package/dist/chunk-7XOTOADQ.js +3065 -0
  20. package/dist/chunk-AH2PDM2K.js +3042 -0
  21. package/dist/chunk-BNXWSZ63.js +3742 -0
  22. package/dist/chunk-BTL5DJVU.js +3222 -0
  23. package/dist/chunk-HDHYG7E4.js +104 -0
  24. package/dist/chunk-HLR4KZBP.js +3234 -0
  25. package/dist/chunk-IP3FRFEB.js +1045 -0
  26. package/dist/chunk-KHU56VDO.js +3042 -0
  27. package/dist/chunk-KRYIFLQR.js +88 -0
  28. package/dist/chunk-LBSDNLEM.js +287 -0
  29. package/dist/chunk-MNTQ7HCP.js +2643 -0
  30. package/dist/chunk-MUJELQQ6.js +1387 -0
  31. package/dist/chunk-MXJGMSLV.js +2199 -0
  32. package/dist/chunk-N6QJGC3Z.js +2636 -0
  33. package/dist/chunk-OBELGBPM.js +1713 -0
  34. package/dist/chunk-OT7R5XTA.js +3192 -0
  35. package/dist/chunk-P7X4RA2T.js +106 -0
  36. package/dist/chunk-PIDUQNC2.js +3185 -0
  37. package/dist/chunk-POGCDIH4.js +3187 -0
  38. package/dist/chunk-PSIEOQGZ.js +3043 -0
  39. package/dist/chunk-PVRT3IHA.js +3238 -0
  40. package/dist/chunk-QNN4TT23.js +1430 -0
  41. package/dist/chunk-RE3R45RJ.js +3042 -0
  42. package/dist/chunk-S7E6TFX6.js +803 -0
  43. package/dist/chunk-SG6GLU4U.js +1378 -0
  44. package/dist/chunk-SJCDV2ST.js +274 -0
  45. package/dist/chunk-SYE5XLF3.js +104 -0
  46. package/dist/chunk-T5VLYBZD.js +103 -0
  47. package/dist/chunk-TOQB7VWU.js +3238 -0
  48. package/dist/chunk-VFNMZ4ZQ.js +3228 -0
  49. package/dist/chunk-VVTGZNBT.js +1629 -0
  50. package/dist/chunk-W7Q4RFEV.js +104 -0
  51. package/dist/chunk-XTYYVRLO.js +3190 -0
  52. package/dist/chunk-Y6MDYVJD.js +3063 -0
  53. package/dist/cli/main.d.ts +1 -0
  54. package/dist/cli/main.js +5458 -0
  55. package/dist/index.d.ts +653 -0
  56. package/dist/index.js +79 -0
  57. package/dist/mcp/server.d.ts +1 -0
  58. package/dist/mcp/server.js +472 -0
  59. package/dist/schema-BAWSG7KY.js +22 -0
  60. package/dist/schema-E3QUPL26.js +20 -0
  61. package/dist/schema-EHL7WUT6.js +20 -0
  62. package/docs/019-USAGE.md +625 -0
  63. package/docs/020-current-implementation.md +364 -0
  64. package/docs/021-DOGFOODING-FINDINGS.md +175 -0
  65. package/docs/BACKLOG.md +80 -0
  66. package/docs/CONFIG.md +1123 -0
  67. package/docs/DESIGN.md +439 -0
  68. package/docs/ERRORS.md +383 -0
  69. package/docs/PROJECT.md +88 -0
  70. package/docs/ROADMAP.md +407 -0
  71. package/docs/summarization.md +320 -0
  72. package/docs/test-links.md +9 -0
  73. package/justfile +40 -0
  74. package/package.json +74 -9
  75. package/pnpm-workspace.yaml +5 -0
  76. package/research/INDEX.md +315 -0
  77. package/research/code-review/README.md +90 -0
  78. package/research/code-review/cli-error-handling-review.md +979 -0
  79. package/research/code-review/code-review-validation-report.md +464 -0
  80. package/research/code-review/main-ts-review.md +1128 -0
  81. package/research/config-analysis/01-current-implementation.md +470 -0
  82. package/research/config-analysis/02-strategy-recommendation.md +428 -0
  83. package/research/config-analysis/03-task-candidates.md +715 -0
  84. package/research/config-analysis/033-research-configuration-management.md +828 -0
  85. package/research/config-analysis/034-research-effect-cli-config.md +1504 -0
  86. package/research/config-analysis/04-consolidated-task-candidates.md +277 -0
  87. package/research/config-docs/SUMMARY.md +357 -0
  88. package/research/config-docs/TEST-RESULTS.md +776 -0
  89. package/research/config-docs/TODO.md +542 -0
  90. package/research/config-docs/analysis.md +744 -0
  91. package/research/config-docs/fix-validation.md +502 -0
  92. package/research/config-docs/help-audit.md +264 -0
  93. package/research/config-docs/help-system-analysis.md +890 -0
  94. package/research/dogfood/consolidated-tool-evaluation.md +373 -0
  95. package/research/dogfood/strategy-a/a-synthesis.md +184 -0
  96. package/research/dogfood/strategy-a/a1-docs.md +226 -0
  97. package/research/dogfood/strategy-a/a2-amorphic.md +156 -0
  98. package/research/dogfood/strategy-a/a3-llm.md +164 -0
  99. package/research/dogfood/strategy-b/b-synthesis.md +228 -0
  100. package/research/dogfood/strategy-b/b1-architecture.md +207 -0
  101. package/research/dogfood/strategy-b/b2-gaps.md +258 -0
  102. package/research/dogfood/strategy-b/b3-workflows.md +250 -0
  103. package/research/dogfood/strategy-c/c-synthesis.md +451 -0
  104. package/research/dogfood/strategy-c/c1-explorer.md +192 -0
  105. package/research/dogfood/strategy-c/c2-diver-memory.md +145 -0
  106. package/research/dogfood/strategy-c/c3-diver-control.md +148 -0
  107. package/research/dogfood/strategy-c/c4-diver-failure.md +151 -0
  108. package/research/dogfood/strategy-c/c5-diver-execution.md +221 -0
  109. package/research/dogfood/strategy-c/c6-diver-org.md +221 -0
  110. package/research/effect-cli-error-handling.md +845 -0
  111. package/research/effect-errors-as-values.md +943 -0
  112. package/research/errors-task-analysis/00-consolidated-tasks.md +207 -0
  113. package/research/errors-task-analysis/cli-commands-analysis.md +909 -0
  114. package/research/errors-task-analysis/embeddings-analysis.md +709 -0
  115. package/research/errors-task-analysis/index-search-analysis.md +812 -0
  116. package/research/frontmatter/COMMENTS-ARE-SKIPPED.md +149 -0
  117. package/research/frontmatter/LLM-CODE-NAVIGATION.md +276 -0
  118. package/research/issue-review.md +603 -0
  119. package/research/llm-summarization/agent-cli-tools-2026.md +1082 -0
  120. package/research/llm-summarization/alternative-providers-2026.md +1428 -0
  121. package/research/llm-summarization/anthropic-2026.md +367 -0
  122. package/research/llm-summarization/claude-cli-integration.md +1706 -0
  123. package/research/llm-summarization/cli-integration-patterns.md +3155 -0
  124. package/research/llm-summarization/openai-2026.md +473 -0
  125. package/research/llm-summarization/openai-compatible-providers-2026.md +1022 -0
  126. package/research/llm-summarization/opencode-cli-integration.md +1552 -0
  127. package/research/llm-summarization/prompt-engineering-2026.md +1426 -0
  128. package/research/llm-summarization/prototype-results.md +56 -0
  129. package/research/llm-summarization/provider-switching-patterns-2026.md +2153 -0
  130. package/research/llm-summarization/typescript-llm-libraries-2026.md +2436 -0
  131. package/research/mdcontext-error-analysis.md +521 -0
  132. package/research/mdcontext-pudding/00-EXECUTIVE-SUMMARY.md +282 -0
  133. package/research/mdcontext-pudding/01-index-embed.md +956 -0
  134. package/research/mdcontext-pudding/02-search-COMMANDS.md +142 -0
  135. package/research/mdcontext-pudding/02-search-SUMMARY.md +146 -0
  136. package/research/mdcontext-pudding/02-search.md +970 -0
  137. package/research/mdcontext-pudding/03-context.md +779 -0
  138. package/research/mdcontext-pudding/04-navigation-and-analytics.md +803 -0
  139. package/research/mdcontext-pudding/04-tree.md +704 -0
  140. package/research/mdcontext-pudding/05-config.md +1038 -0
  141. package/research/mdcontext-pudding/06-links-summary.txt +87 -0
  142. package/research/mdcontext-pudding/06-links.md +679 -0
  143. package/research/mdcontext-pudding/07-stats.md +693 -0
  144. package/research/mdcontext-pudding/BUG-FIX-PLAN.md +388 -0
  145. package/research/mdcontext-pudding/P0-BUG-VALIDATION.md +167 -0
  146. package/research/mdcontext-pudding/README.md +168 -0
  147. package/research/mdcontext-pudding/TESTING-SUMMARY.md +128 -0
  148. package/research/npm_publish/011-npm-workflow-research-agent2.md +792 -0
  149. package/research/npm_publish/012-npm-workflow-research-agent1.md +530 -0
  150. package/research/npm_publish/013-npm-workflow-research-agent3.md +722 -0
  151. package/research/npm_publish/014-npm-workflow-synthesis.md +556 -0
  152. package/research/npm_publish/031-npm-workflow-task-analysis.md +134 -0
  153. package/research/research-quality-review.md +834 -0
  154. package/research/semantic-search/002-research-embedding-models.md +490 -0
  155. package/research/semantic-search/003-research-rag-alternatives.md +523 -0
  156. package/research/semantic-search/004-research-vector-search.md +841 -0
  157. package/research/semantic-search/032-research-semantic-search.md +427 -0
  158. package/research/semantic-search/embedding-text-analysis.md +156 -0
  159. package/research/semantic-search/multi-word-failure-reproduction.md +171 -0
  160. package/research/semantic-search/query-processing-analysis.md +207 -0
  161. package/research/semantic-search/root-cause-and-solution.md +114 -0
  162. package/research/semantic-search/threshold-validation-report.md +69 -0
  163. package/research/semantic-search/vector-search-analysis.md +63 -0
  164. package/research/task-management-2026/00-synthesis-recommendations.md +295 -0
  165. package/research/task-management-2026/01-ai-workflow-tools.md +416 -0
  166. package/research/task-management-2026/02-agent-framework-patterns.md +476 -0
  167. package/research/task-management-2026/03-lightweight-file-based.md +567 -0
  168. package/research/task-management-2026/04-established-tools-ai-features.md +541 -0
  169. package/research/task-management-2026/linear/01-core-features-workflow.md +771 -0
  170. package/research/task-management-2026/linear/02-api-integrations.md +930 -0
  171. package/research/task-management-2026/linear/03-ai-features.md +368 -0
  172. package/research/task-management-2026/linear/04-pricing-setup.md +205 -0
  173. package/research/task-management-2026/linear/05-usage-patterns-best-practices.md +605 -0
  174. package/research/test-path-issues.md +276 -0
  175. package/review/ALP-76/1-error-type-design.md +962 -0
  176. package/review/ALP-76/2-error-handling-patterns.md +906 -0
  177. package/review/ALP-76/3-error-presentation.md +624 -0
  178. package/review/ALP-76/4-test-coverage.md +625 -0
  179. package/review/ALP-76/5-migration-completeness.md +440 -0
  180. package/review/ALP-76/6-effect-best-practices.md +755 -0
  181. package/scripts/apply-branch-protection.sh +47 -0
  182. package/scripts/branch-protection-templates.json +79 -0
  183. package/scripts/prototype-summarization.ts +346 -0
  184. package/scripts/rebuild-hnswlib.js +58 -0
  185. package/scripts/setup-branch-protection.sh +64 -0
  186. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/active-provider.json +7 -0
  187. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.json +541 -0
  188. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.meta.json +5 -0
  189. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/config.json +8 -0
  190. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
  191. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
  192. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/documents.json +60 -0
  193. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/links.json +13 -0
  194. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/sections.json +1197 -0
  195. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/configuration-management.md +99 -0
  196. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/distributed-systems.md +92 -0
  197. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/error-handling.md +78 -0
  198. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/failure-automation.md +55 -0
  199. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/job-context.md +69 -0
  200. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/process-orchestration.md +99 -0
  201. package/src/cli/argv-preprocessor.test.ts +210 -0
  202. package/src/cli/argv-preprocessor.ts +202 -0
  203. package/src/cli/cli.test.ts +627 -0
  204. package/src/cli/commands/backlinks.ts +54 -0
  205. package/src/cli/commands/config-cmd.ts +642 -0
  206. package/src/cli/commands/context.ts +285 -0
  207. package/src/cli/commands/duplicates.ts +122 -0
  208. package/src/cli/commands/embeddings.ts +529 -0
  209. package/src/cli/commands/index-cmd.ts +480 -0
  210. package/src/cli/commands/index.ts +16 -0
  211. package/src/cli/commands/links.ts +52 -0
  212. package/src/cli/commands/search.ts +1281 -0
  213. package/src/cli/commands/stats.ts +149 -0
  214. package/src/cli/commands/tree.ts +128 -0
  215. package/src/cli/config-layer.ts +176 -0
  216. package/src/cli/error-handler.test.ts +235 -0
  217. package/src/cli/error-handler.ts +655 -0
  218. package/src/cli/flag-schemas.ts +341 -0
  219. package/src/cli/help.ts +588 -0
  220. package/src/cli/index.ts +9 -0
  221. package/src/cli/main.ts +435 -0
  222. package/src/cli/options.ts +41 -0
  223. package/src/cli/shared-error-handling.ts +199 -0
  224. package/src/cli/typo-suggester.test.ts +105 -0
  225. package/src/cli/typo-suggester.ts +130 -0
  226. package/src/cli/utils.ts +259 -0
  227. package/src/config/file-provider.test.ts +320 -0
  228. package/src/config/file-provider.ts +273 -0
  229. package/src/config/index.ts +72 -0
  230. package/src/config/integration.test.ts +667 -0
  231. package/src/config/precedence.test.ts +277 -0
  232. package/src/config/precedence.ts +451 -0
  233. package/src/config/schema.test.ts +414 -0
  234. package/src/config/schema.ts +603 -0
  235. package/src/config/service.test.ts +320 -0
  236. package/src/config/service.ts +243 -0
  237. package/src/config/testing.test.ts +264 -0
  238. package/src/config/testing.ts +110 -0
  239. package/src/core/index.ts +1 -0
  240. package/src/core/types.ts +113 -0
  241. package/src/duplicates/detector.test.ts +183 -0
  242. package/src/duplicates/detector.ts +414 -0
  243. package/src/duplicates/index.ts +18 -0
  244. package/src/embeddings/embedding-namespace.test.ts +300 -0
  245. package/src/embeddings/embedding-namespace.ts +947 -0
  246. package/src/embeddings/heading-boost.test.ts +222 -0
  247. package/src/embeddings/hnsw-build-options.test.ts +198 -0
  248. package/src/embeddings/hyde.test.ts +272 -0
  249. package/src/embeddings/hyde.ts +264 -0
  250. package/src/embeddings/index.ts +10 -0
  251. package/src/embeddings/openai-provider.ts +414 -0
  252. package/src/embeddings/pricing.json +22 -0
  253. package/src/embeddings/provider-constants.ts +204 -0
  254. package/src/embeddings/provider-errors.test.ts +967 -0
  255. package/src/embeddings/provider-errors.ts +565 -0
  256. package/src/embeddings/provider-factory.test.ts +240 -0
  257. package/src/embeddings/provider-factory.ts +225 -0
  258. package/src/embeddings/provider-integration.test.ts +788 -0
  259. package/src/embeddings/query-preprocessing.test.ts +187 -0
  260. package/src/embeddings/semantic-search-threshold.test.ts +508 -0
  261. package/src/embeddings/semantic-search.ts +1270 -0
  262. package/src/embeddings/types.ts +359 -0
  263. package/src/embeddings/vector-store.ts +708 -0
  264. package/src/embeddings/voyage-provider.ts +313 -0
  265. package/src/errors/errors.test.ts +845 -0
  266. package/src/errors/index.ts +533 -0
  267. package/src/index/ignore-patterns.test.ts +354 -0
  268. package/src/index/ignore-patterns.ts +305 -0
  269. package/src/index/index.ts +4 -0
  270. package/src/index/indexer.ts +684 -0
  271. package/src/index/storage.ts +260 -0
  272. package/src/index/types.ts +147 -0
  273. package/src/index/watcher.ts +189 -0
  274. package/src/index.ts +30 -0
  275. package/src/integration/search-keyword.test.ts +678 -0
  276. package/src/mcp/server.ts +612 -0
  277. package/src/parser/index.ts +1 -0
  278. package/src/parser/parser.test.ts +291 -0
  279. package/src/parser/parser.ts +394 -0
  280. package/src/parser/section-filter.test.ts +277 -0
  281. package/src/parser/section-filter.ts +392 -0
  282. package/src/search/__tests__/hybrid-search.test.ts +650 -0
  283. package/src/search/bm25-store.ts +366 -0
  284. package/src/search/cross-encoder.test.ts +253 -0
  285. package/src/search/cross-encoder.ts +406 -0
  286. package/src/search/fuzzy-search.test.ts +419 -0
  287. package/src/search/fuzzy-search.ts +273 -0
  288. package/src/search/hybrid-search.ts +448 -0
  289. package/src/search/path-matcher.test.ts +276 -0
  290. package/src/search/path-matcher.ts +33 -0
  291. package/src/search/query-parser.test.ts +260 -0
  292. package/src/search/query-parser.ts +319 -0
  293. package/src/search/searcher.test.ts +280 -0
  294. package/src/search/searcher.ts +724 -0
  295. package/src/search/wink-bm25.d.ts +30 -0
  296. package/src/summarization/cli-providers/claude.ts +202 -0
  297. package/src/summarization/cli-providers/detection.test.ts +273 -0
  298. package/src/summarization/cli-providers/detection.ts +118 -0
  299. package/src/summarization/cli-providers/index.ts +8 -0
  300. package/src/summarization/cost.test.ts +139 -0
  301. package/src/summarization/cost.ts +102 -0
  302. package/src/summarization/error-handler.test.ts +127 -0
  303. package/src/summarization/error-handler.ts +111 -0
  304. package/src/summarization/index.ts +102 -0
  305. package/src/summarization/pipeline.test.ts +498 -0
  306. package/src/summarization/pipeline.ts +231 -0
  307. package/src/summarization/prompts.test.ts +269 -0
  308. package/src/summarization/prompts.ts +133 -0
  309. package/src/summarization/provider-factory.test.ts +396 -0
  310. package/src/summarization/provider-factory.ts +178 -0
  311. package/src/summarization/types.ts +184 -0
  312. package/src/summarize/budget-bugs.test.ts +620 -0
  313. package/src/summarize/formatters.ts +419 -0
  314. package/src/summarize/index.ts +20 -0
  315. package/src/summarize/summarizer.test.ts +275 -0
  316. package/src/summarize/summarizer.ts +597 -0
  317. package/src/summarize/verify-bugs.test.ts +238 -0
  318. package/src/types/huggingface-transformers.d.ts +66 -0
  319. package/src/utils/index.ts +1 -0
  320. package/src/utils/tokens.test.ts +142 -0
  321. package/src/utils/tokens.ts +186 -0
  322. package/tests/fixtures/cli/.mdcontext/active-provider.json +7 -0
  323. package/tests/fixtures/cli/.mdcontext/config.json +8 -0
  324. package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
  325. package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
  326. package/tests/fixtures/cli/.mdcontext/indexes/documents.json +33 -0
  327. package/tests/fixtures/cli/.mdcontext/indexes/links.json +12 -0
  328. package/tests/fixtures/cli/.mdcontext/indexes/sections.json +247 -0
  329. package/tests/fixtures/cli/README.md +9 -0
  330. package/tests/fixtures/cli/api-reference.md +11 -0
  331. package/tests/fixtures/cli/getting-started.md +11 -0
  332. package/tests/integration/embed-index.test.ts +712 -0
  333. package/tests/integration/search-context.test.ts +469 -0
  334. package/tests/integration/search-semantic.test.ts +522 -0
  335. package/tsconfig.json +26 -0
  336. package/vitest.config.ts +16 -0
  337. package/vitest.setup.ts +12 -0
@@ -0,0 +1,427 @@
1
+ # Research Task Analysis: Embedding Models, RAG Alternatives, and Vector Search
2
+
3
+ Analysis of research documents against current mdcontext implementation.
4
+
5
+ Date: January 2026
6
+
7
+ ---
8
+
9
+ ## Documents Analyzed
10
+
11
+ 1. `002-research-embedding-models.md` - Embedding model comparison and recommendations
12
+ 2. `003-research-rag-alternatives.md` - RAG alternatives for improving semantic search
13
+ 3. `004-research-vector-search.md` - Vector search patterns and techniques
14
+
15
+ ---
16
+
17
+ ## Implemented (No Action Needed)
18
+
19
+ ### 1. Dimension Reduction (512 dimensions)
20
+
21
+ **Research Recommendation:** Reduce OpenAI embeddings from 1536 to 512 dimensions for 67% storage reduction with minimal quality loss.
22
+
23
+ **Current Implementation:** Already implemented in `src/embeddings/openai-provider.ts`:
24
+
25
+ ```typescript
26
+ const response = await this.client.embeddings.create({
27
+ model: this.model,
28
+ input: batch,
29
+ dimensions: 512, // Already using reduced dimensions
30
+ });
31
+ ```
32
+
33
+ **Status:** Implemented
34
+
35
+ ---
36
+
37
+ ### 2. HNSW Vector Index
38
+
39
+ **Research Recommendation:** Stay with HNSW for documentation corpora (<100K sections).
40
+
41
+ **Current Implementation:** Using `hnswlib-node` with cosine similarity in `src/embeddings/vector-store.ts`:
42
+
43
+ ```typescript
44
+ this.index = new HierarchicalNSW.HierarchicalNSW("cosine", this.dimensions);
45
+ this.index.initIndex(10000, 16, 200, 100); // M=16, efConstruction=200
46
+ ```
47
+
48
+ **Status:** Implemented
49
+
50
+ ---
51
+
52
+ ### 3. EmbeddingProvider Interface
53
+
54
+ **Research Recommendation:** Create provider abstraction for embedding models.
55
+
56
+ **Current Implementation:** `src/embeddings/types.ts` defines a clean provider interface:
57
+
58
+ ```typescript
59
+ export interface EmbeddingProvider {
60
+ readonly name: string;
61
+ readonly dimensions: number;
62
+ embed(texts: string[]): Promise<EmbeddingResult>;
63
+ }
64
+ ```
65
+
66
+ **Status:** Implemented (foundation ready for additional providers)
67
+
68
+ ---
69
+
70
+ ### 4. Path Pattern Filtering (Post-filter)
71
+
72
+ **Research Recommendation:** Implement metadata filtering for search results.
73
+
74
+ **Current Implementation:** `pathPattern` option implemented in `semanticSearch()` as post-filtering.
75
+
76
+ **Status:** Implemented (basic)
77
+
78
+ ---
79
+
80
+ ### 5. Document Context in Embeddings
81
+
82
+ **Research Recommendation:** Include document title and parent section in embedding text.
83
+
84
+ **Current Implementation:** Already in `src/embeddings/semantic-search.ts`:
85
+
86
+ ```typescript
87
+ const generateEmbeddingText = (
88
+ section,
89
+ content,
90
+ documentTitle,
91
+ parentHeading,
92
+ ) => {
93
+ parts.push(`# ${section.heading}`);
94
+ if (parentHeading) parts.push(`Parent section: ${parentHeading}`);
95
+ parts.push(`Document: ${documentTitle}`);
96
+ parts.push(content);
97
+ // ...
98
+ };
99
+ ```
100
+
101
+ **Status:** Implemented
102
+
103
+ ---
104
+
105
+ ## Task Candidates
106
+
107
+ ### 1. Add Hybrid Search (BM25 + Semantic)
108
+
109
+ **Priority:** High
110
+
111
+ **Description:**
112
+ Implement hybrid search combining BM25 keyword search with semantic search, using Reciprocal Rank Fusion (RRF) to merge results.
113
+
114
+ **Why It Matters:**
115
+
116
+ - Research shows 15-30% recall improvement over single-method retrieval
117
+ - Handles exact term matching (API names, error codes, identifiers) that pure semantic search misses
118
+ - Current keyword search exists separately but isn't integrated with semantic search
119
+
120
+ **Implementation Notes:**
121
+
122
+ - Add `wink-bm25-text-search` dependency
123
+ - Build BM25 index alongside vector index during `mdcontext embed`
124
+ - Add `--mode hybrid` option to search command
125
+ - Implement RRF fusion (~50 lines of code)
126
+
127
+ **Current Gap:**
128
+
129
+ - Keyword search (`src/search/searcher.ts`) and semantic search (`src/embeddings/semantic-search.ts`) are separate codepaths
130
+ - No fusion mechanism exists
131
+
132
+ **Estimated Effort:** 2-3 days
133
+
134
+ ---
135
+
136
+ ### 2. Add Local Embedding Provider (Ollama)
137
+
138
+ **Priority:** High
139
+
140
+ **Description:**
141
+ Implement an Ollama-based embedding provider using `nomic-embed-text-v1.5` for offline semantic search.
142
+
143
+ **Why It Matters:**
144
+
145
+ - Enables offline semantic search (major feature gap)
146
+ - Zero ongoing API costs
147
+ - Quality matches or exceeds OpenAI text-embedding-3-small
148
+ - Privacy-sensitive use cases
149
+
150
+ **Implementation Notes:**
151
+
152
+ - Create `src/embeddings/ollama-provider.ts` implementing `EmbeddingProvider`
153
+ - Add provider selection via config or `--provider` CLI flag
154
+ - Default to OpenAI for backward compatibility
155
+ - nomic-embed-text supports Matryoshka (dimension flexibility)
156
+
157
+ **Models to Support:**
158
+
159
+ 1. `nomic-embed-text` - Best overall fit (fast, 8K context, Matryoshka)
160
+ 2. `mxbai-embed-large` - Higher quality option
161
+ 3. `bge-m3` - Multilingual option
162
+
163
+ **Estimated Effort:** 2-3 days
164
+
165
+ ---
166
+
167
+ ### 3. Add Cross-Encoder Re-ranking
168
+
169
+ **Priority:** Medium
170
+
171
+ **Description:**
172
+ Add optional re-ranking of top-N semantic search results using a cross-encoder model.
173
+
174
+ **Why It Matters:**
175
+
176
+ - 20-35% accuracy improvement in retrieval precision
177
+ - Cross-encoders capture fine-grained relevance that bi-encoders miss
178
+ - Can be opt-in to avoid latency when not needed
179
+
180
+ **Implementation Notes:**
181
+
182
+ - Add `@xenova/transformers` dependency for Transformers.js
183
+ - Use `ms-marco-MiniLM-L-6-v2` model (22.7M params, 2-5ms/pair)
184
+ - Re-rank top-20 candidates to top-10
185
+ - Add `--rerank` flag to search command
186
+
187
+ **Alternative:** Cohere Rerank API for simpler integration (adds cost)
188
+
189
+ **Estimated Effort:** 2-3 days
190
+
191
+ ---
192
+
193
+ ### 4. Add Dynamic efSearch (Quality Modes)
194
+
195
+ **Priority:** Medium
196
+
197
+ **Description:**
198
+ Allow users to control search quality/speed tradeoff via HNSW efSearch parameter at query time.
199
+
200
+ **Why It Matters:**
201
+
202
+ - Zero dependency changes
203
+ - Immediate quality/speed improvements
204
+ - Low risk
205
+
206
+ **Implementation Notes:**
207
+
208
+ - Add `--quality` flag: `fast` (64), `balanced` (100), `thorough` (256)
209
+ - efSearch is already configurable at query time in hnswlib-node
210
+ - Update search functions to accept quality parameter
211
+
212
+ **Current State:**
213
+
214
+ ```typescript
215
+ this.index.initIndex(10000, 16, 200, 100); // efSearch=100 (implicit)
216
+ ```
217
+
218
+ **Estimated Effort:** 0.5 days
219
+
220
+ ---
221
+
222
+ ### 5. Add Configurable HNSW Parameters
223
+
224
+ **Priority:** Low
225
+
226
+ **Description:**
227
+ Expose HNSW build parameters (M, efConstruction) via configuration for users with specific needs.
228
+
229
+ **Why It Matters:**
230
+
231
+ - Users with large corpora may want to tune for speed
232
+ - Users needing maximum recall can increase parameters
233
+ - Enables benchmarking different configurations
234
+
235
+ **Current Hardcoded Values:**
236
+
237
+ ```typescript
238
+ M: 16; // Max connections per node
239
+ efConstruction: 200; // Construction-time search width
240
+ ```
241
+
242
+ **Recommended Configurations:**
243
+
244
+ - Quality-focused: M=24, efConstruction=256
245
+ - Speed-focused: M=12, efConstruction=128
246
+
247
+ **Estimated Effort:** 1 day
248
+
249
+ ---
250
+
251
+ ### 6. Add Query Preprocessing
252
+
253
+ **Priority:** Low
254
+
255
+ **Description:**
256
+ Add basic query preprocessing before embedding to reduce noise.
257
+
258
+ **Why It Matters:**
259
+
260
+ - 2-5% precision improvement
261
+ - Simple implementation
262
+
263
+ **Implementation:**
264
+
265
+ ```typescript
266
+ function preprocessQuery(query: string): string {
267
+ return query
268
+ .toLowerCase()
269
+ .replace(/[^\w\s]/g, " ")
270
+ .replace(/\s+/g, " ")
271
+ .trim();
272
+ }
273
+ ```
274
+
275
+ **Estimated Effort:** 1-2 hours
276
+
277
+ ---
278
+
279
+ ### 7. Add Heading Match Boost
280
+
281
+ **Priority:** Low
282
+
283
+ **Description:**
284
+ Boost search results when query terms appear in section headings.
285
+
286
+ **Why It Matters:**
287
+
288
+ - Significant for navigation queries ("installation guide", "API reference")
289
+ - Simple scoring adjustment
290
+
291
+ **Implementation:**
292
+
293
+ ```typescript
294
+ function adjustScore(result, query): number {
295
+ const queryTerms = query.toLowerCase().split(/\s+/);
296
+ const headingLower = result.heading.toLowerCase();
297
+ const headingMatches = queryTerms.filter((t) =>
298
+ headingLower.includes(t),
299
+ ).length;
300
+ return result.similarity + headingMatches * 0.05;
301
+ }
302
+ ```
303
+
304
+ **Estimated Effort:** 2-4 hours
305
+
306
+ ---
307
+
308
+ ### 8. Add HyDE Query Expansion (Optional)
309
+
310
+ **Priority:** Low
311
+
312
+ **Description:**
313
+ Implement Hypothetical Document Embeddings for complex queries - generate a hypothetical answer with LLM, then search using that embedding.
314
+
315
+ **Why It Matters:**
316
+
317
+ - 10-30% retrieval improvement on ambiguous queries
318
+ - Bridges semantic gap between short questions and detailed documents
319
+
320
+ **Considerations:**
321
+
322
+ - Adds LLM call (cost, latency)
323
+ - Should be opt-in for complex queries only
324
+ - Works poorly if LLM lacks domain knowledge
325
+
326
+ **Estimated Effort:** 1-2 days
327
+
328
+ ---
329
+
330
+ ### 9. Fix Dimension Mismatch in Provider
331
+
332
+ **Priority:** Medium
333
+
334
+ **Description:**
335
+ The OpenAI provider reports incorrect dimensions (1536/3072) while actually using 512.
336
+
337
+ **Current Issue:**
338
+
339
+ ```typescript
340
+ // In openai-provider.ts
341
+ this.dimensions = this.model === "text-embedding-3-large" ? 3072 : 1536;
342
+ // But actual API call uses:
343
+ dimensions: 512;
344
+ ```
345
+
346
+ This mismatch could cause issues if other code relies on `provider.dimensions`.
347
+
348
+ **Fix:** Update dimension reporting to match actual API parameter.
349
+
350
+ **Estimated Effort:** 0.5 hours
351
+
352
+ ---
353
+
354
+ ### 10. Add Alternative API Provider (Voyage AI)
355
+
356
+ **Priority:** Low
357
+
358
+ **Description:**
359
+ Add Voyage AI as an alternative embedding provider for users wanting better quality at similar cost.
360
+
361
+ **Why It Matters:**
362
+
363
+ - voyage-3.5-lite: Same price as OpenAI ($0.02/1M), but better quality
364
+ - 32K token context (4x OpenAI)
365
+ - Free 200M tokens for testing
366
+
367
+ **Estimated Effort:** 1 day
368
+
369
+ ---
370
+
371
+ ## Skip (Not Applicable)
372
+
373
+ | Recommendation | Reason to Skip |
374
+ | ------------------------ | --------------------------------------------------------------------- |
375
+ | ColBERT Late Interaction | Overkill for documentation corpus sizes; requires Python service |
376
+ | SPLADE Sparse Retrieval | BM25 + semantic hybrid likely sufficient; adds complexity |
377
+ | GraphRAG | Overkill for documentation search |
378
+ | Fine-tuned Embeddings | Requires training infrastructure; general models work well |
379
+ | IVF/DiskANN Indexes | HNSW sufficient for typical documentation sizes (<100K sections) |
380
+ | LLM-based Re-ranking | Cross-encoders provide similar quality without LLM cost/latency |
381
+ | Self-RAG | Beyond current scope; more relevant for RAG pipelines with generation |
382
+
383
+ ---
384
+
385
+ ## Summary
386
+
387
+ | Category | Count |
388
+ | --------------- | ----------------- |
389
+ | Implemented | 5 items |
390
+ | Task Candidates | 10 items |
391
+ | Skipped | 7 recommendations |
392
+
393
+ ### Priority Matrix
394
+
395
+ | Priority | Tasks |
396
+ | ---------- | ------------------------------------------------------------------------- |
397
+ | **High** | Hybrid Search (BM25+RRF), Local Embedding Provider (Ollama) |
398
+ | **Medium** | Cross-Encoder Re-ranking, Dynamic efSearch, Fix Dimension Mismatch |
399
+ | **Low** | HNSW Config, Query Preprocessing, Heading Boost, HyDE, Voyage AI Provider |
400
+
401
+ ### Recommended Implementation Order
402
+
403
+ **Phase 1: Quick Wins (1 week)**
404
+
405
+ 1. Fix dimension mismatch in provider
406
+ 2. Add dynamic efSearch (quality modes)
407
+ 3. Add query preprocessing
408
+ 4. Add heading match boost
409
+
410
+ **Phase 2: Hybrid Search (1-2 weeks)**
411
+
412
+ 1. Integrate BM25 library
413
+ 2. Build BM25 index during embed
414
+ 3. Implement RRF fusion
415
+ 4. Add `--mode hybrid` to CLI
416
+
417
+ **Phase 3: Local/Offline (1-2 weeks)**
418
+
419
+ 1. Implement Ollama provider
420
+ 2. Add provider selection CLI
421
+ 3. Test with nomic-embed-text
422
+
423
+ **Phase 4: Advanced (2 weeks)**
424
+
425
+ 1. Cross-encoder re-ranking (Transformers.js)
426
+ 2. HyDE query expansion (optional)
427
+ 3. Alternative API providers (Voyage AI)
@@ -0,0 +1,156 @@
1
+ # Embedding Text Analysis
2
+
3
+ ## Executive Summary
4
+
5
+ The current embedding text generation format is **appropriate and follows best practices**. The format enriches content with contextual metadata (heading, parent section, document title) which helps embedding models understand the semantic context.
6
+
7
+ The similarity score issues identified in ALP-203 are **NOT caused by the embedding text format**, but rather by:
8
+ 1. Inherent properties of embedding models with short queries
9
+ 2. The 0.5 default threshold being too high for certain query types
10
+
11
+ ## Current Implementation
12
+
13
+ ### generateEmbeddingText Function
14
+
15
+ Location: `src/embeddings/semantic-search.ts:46-63`
16
+
17
+ ```typescript
18
+ const generateEmbeddingText = (
19
+ section: SectionEntry,
20
+ content: string,
21
+ documentTitle: string,
22
+ parentHeading?: string | undefined,
23
+ ): string => {
24
+ const parts: string[] = []
25
+
26
+ parts.push(`# ${section.heading}`)
27
+ if (parentHeading) {
28
+ parts.push(`Parent section: ${parentHeading}`)
29
+ }
30
+ parts.push(`Document: ${documentTitle}`)
31
+ parts.push('')
32
+ parts.push(content)
33
+
34
+ return parts.join('\n')
35
+ }
36
+ ```
37
+
38
+ ### Generated Text Format
39
+
40
+ **For a top-level section (e.g., "Overview" in "Failure Automation"):**
41
+ ```
42
+ # Overview
43
+ Document: Failure Automation
44
+
45
+ Failure automation is the practice of automatically detecting,
46
+ reporting, and responding to system failures without human
47
+ intervention. This approach is essential for maintaining high
48
+ availability in modern distributed systems.
49
+ ```
50
+
51
+ **For a nested section (e.g., "Automated Failure Detection"):**
52
+ ```
53
+ # Automated Failure Detection
54
+ Parent section: Core Concepts
55
+ Document: Failure Automation
56
+
57
+ Systems use health checks, heartbeats, and monitoring to detect
58
+ when components fail. Failure detection must be fast and accurate
59
+ to minimize downtime.
60
+ ```
61
+
62
+ ## Analysis
63
+
64
+ ### What the Current Format Does Well
65
+
66
+ 1. **Heading as Title**: Using `# {heading}` format is standard markdown that embedding models are trained on. The heading provides semantic context about the section topic.
67
+
68
+ 2. **Hierarchical Context**: Including `Parent section: {parentHeading}` helps the model understand the section's place in the document structure. This is especially valuable for nested sections with generic headings like "Overview" or "Best Practices".
69
+
70
+ 3. **Document Context**: Including `Document: {documentTitle}` helps disambiguate content that might otherwise be too generic.
71
+
72
+ 4. **Content Preservation**: The full section content is included, providing rich semantic signal.
73
+
74
+ ### Potential Concerns Investigated
75
+
76
+ | Concern | Finding |
77
+ |---------|---------|
78
+ | Does `# {heading}` confuse the model? | No - embedding models are trained on markdown and understand heading syntax |
79
+ | Is metadata adding noise? | No - metadata provides helpful context, especially for short sections |
80
+ | Is content truncated? | No - full section content is included |
81
+ | Are important keywords lost? | No - nothing is removed from original content |
82
+
83
+ ### Comparison with Best Practices
84
+
85
+ **OpenAI Recommendations:**
86
+ - Text-embedding-3-small uses the same model for both queries and documents
87
+ - No special prefixes or asymmetric handling needed
88
+ - Cosine similarity is recommended for comparison
89
+ - The model captures semantic meaning, not just keyword overlap
90
+
91
+ **Industry Patterns:**
92
+ - Many RAG systems include metadata like titles and hierarchical context
93
+ - Including document/section titles is a common best practice
94
+ - Enriching content with context improves retrieval quality
95
+
96
+ ## Token Count Analysis
97
+
98
+ Sample embedded texts from test corpus:
99
+
100
+ | Section | Content Tokens | Metadata Overhead | Total | Overhead % |
101
+ |---------|----------------|-------------------|-------|------------|
102
+ | Overview | ~50 | ~15 | ~65 | 23% |
103
+ | Automated Failure Detection | ~40 | ~20 | ~60 | 33% |
104
+ | Best Practices | ~100 | ~15 | ~115 | 13% |
105
+
106
+ The metadata overhead is reasonable (13-33%) and provides valuable semantic context.
107
+
108
+ ## Root Cause of Similarity Score Issues
109
+
110
+ The similarity score issues are **NOT caused by embedding text generation**. Based on ALP-203 findings:
111
+
112
+ ### Why Short Queries Have Low Scores
113
+
114
+ 1. **Vector Space Properties**: A single word like "failure" produces an embedding that represents the general concept. Content sections contain many concepts, making the cosine similarity lower.
115
+
116
+ 2. **Context Asymmetry**: A query "failure" is matched against embeddings like:
117
+ ```
118
+ # Failure Isolation
119
+ Parent section: Core Concepts
120
+ Document: Failure Automation
121
+
122
+ Automated systems can isolate failures to prevent cascading effects...
123
+ ```
124
+ The query is a subset of the embedded content, not a full match.
125
+
126
+ 3. **Embedding Model Behavior**: Text-embedding-3-small produces normalized vectors. Short inputs produce vectors that are less distinctive because they have less semantic "mass".
127
+
128
+ ### Why Multi-Word Domain Queries Work Better
129
+
130
+ Queries like "failure automation" provide:
131
+ - Multiple semantic signals
132
+ - Domain-specific terminology
133
+ - Closer match to document/heading names
134
+ - More distinctive embedding vectors
135
+
136
+ ## Recommendations
137
+
138
+ ### No Changes Needed to Embedding Text Format
139
+
140
+ The current format is sound. The issues are in threshold calibration and query handling.
141
+
142
+ ### Potential Improvements (for ALP-207)
143
+
144
+ 1. **Query Enhancement**: Consider expanding short queries with context before embedding
145
+ 2. **Threshold Tuning**: Use adaptive thresholds based on query length
146
+ 3. **Hybrid Search Default**: Leverage keyword search to boost short query results
147
+
148
+ ## Related Research
149
+
150
+ - [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
151
+ - [Text-embedding-3-small Model](https://platform.openai.com/docs/models/text-embedding-3-small)
152
+ - [Zilliz: Guide to text-embedding-3-small](https://zilliz.com/ai-models/text-embedding-3-small)
153
+
154
+ ## Conclusion
155
+
156
+ The embedding text generation implementation is correct and follows best practices. The similarity score issues identified in ALP-203 should be addressed through threshold calibration and query processing improvements, not by modifying how content is embedded.