mdcontext 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (251) hide show
  1. package/.changeset/config.json +9 -9
  2. package/.claude/settings.local.json +25 -0
  3. package/.github/workflows/claude-code-review.yml +44 -0
  4. package/.github/workflows/claude.yml +85 -0
  5. package/CONTRIBUTING.md +186 -0
  6. package/NOTES/NOTES +44 -0
  7. package/README.md +206 -3
  8. package/biome.json +1 -1
  9. package/dist/chunk-23UPXDNL.js +3044 -0
  10. package/dist/chunk-2W7MO2DL.js +1366 -0
  11. package/dist/chunk-3NUAZGMA.js +1689 -0
  12. package/dist/chunk-7TOWB2XB.js +366 -0
  13. package/dist/chunk-7XOTOADQ.js +3065 -0
  14. package/dist/chunk-AH2PDM2K.js +3042 -0
  15. package/dist/chunk-BNXWSZ63.js +3742 -0
  16. package/dist/chunk-BTL5DJVU.js +3222 -0
  17. package/dist/chunk-HDHYG7E4.js +104 -0
  18. package/dist/chunk-HLR4KZBP.js +3234 -0
  19. package/dist/chunk-IP3FRFEB.js +1045 -0
  20. package/dist/chunk-KHU56VDO.js +3042 -0
  21. package/dist/chunk-KRYIFLQR.js +85 -89
  22. package/dist/chunk-LBSDNLEM.js +287 -0
  23. package/dist/chunk-MNTQ7HCP.js +2643 -0
  24. package/dist/chunk-MUJELQQ6.js +1387 -0
  25. package/dist/chunk-MXJGMSLV.js +2199 -0
  26. package/dist/chunk-N6QJGC3Z.js +2636 -0
  27. package/dist/chunk-OBELGBPM.js +1713 -0
  28. package/dist/chunk-OT7R5XTA.js +3192 -0
  29. package/dist/chunk-P7X4RA2T.js +106 -0
  30. package/dist/chunk-PIDUQNC2.js +3185 -0
  31. package/dist/chunk-POGCDIH4.js +3187 -0
  32. package/dist/chunk-PSIEOQGZ.js +3043 -0
  33. package/dist/chunk-PVRT3IHA.js +3238 -0
  34. package/dist/chunk-QNN4TT23.js +1430 -0
  35. package/dist/chunk-RE3R45RJ.js +3042 -0
  36. package/dist/chunk-S7E6TFX6.js +718 -657
  37. package/dist/chunk-SG6GLU4U.js +1378 -0
  38. package/dist/chunk-SJCDV2ST.js +274 -0
  39. package/dist/chunk-SYE5XLF3.js +104 -0
  40. package/dist/chunk-T5VLYBZD.js +103 -0
  41. package/dist/chunk-TOQB7VWU.js +3238 -0
  42. package/dist/chunk-VFNMZ4ZQ.js +3228 -0
  43. package/dist/chunk-VVTGZNBT.js +1533 -1423
  44. package/dist/chunk-W7Q4RFEV.js +104 -0
  45. package/dist/chunk-XTYYVRLO.js +3190 -0
  46. package/dist/chunk-Y6MDYVJD.js +3063 -0
  47. package/dist/cli/main.js +4072 -629
  48. package/dist/index.d.ts +420 -33
  49. package/dist/index.js +8 -15
  50. package/dist/mcp/server.js +103 -7
  51. package/dist/schema-BAWSG7KY.js +22 -0
  52. package/dist/schema-E3QUPL26.js +20 -0
  53. package/dist/schema-EHL7WUT6.js +20 -0
  54. package/docs/019-USAGE.md +44 -5
  55. package/docs/020-current-implementation.md +8 -8
  56. package/docs/021-DOGFOODING-FINDINGS.md +1 -1
  57. package/docs/CONFIG.md +1123 -0
  58. package/docs/ERRORS.md +383 -0
  59. package/docs/summarization.md +320 -0
  60. package/justfile +40 -0
  61. package/package.json +39 -33
  62. package/research/INDEX.md +315 -0
  63. package/research/code-review/README.md +90 -0
  64. package/research/code-review/cli-error-handling-review.md +979 -0
  65. package/research/code-review/code-review-validation-report.md +464 -0
  66. package/research/code-review/main-ts-review.md +1128 -0
  67. package/research/config-docs/SUMMARY.md +357 -0
  68. package/research/config-docs/TEST-RESULTS.md +776 -0
  69. package/research/config-docs/TODO.md +542 -0
  70. package/research/config-docs/analysis.md +744 -0
  71. package/research/config-docs/fix-validation.md +502 -0
  72. package/research/config-docs/help-audit.md +264 -0
  73. package/research/config-docs/help-system-analysis.md +890 -0
  74. package/research/frontmatter/COMMENTS-ARE-SKIPPED.md +149 -0
  75. package/research/frontmatter/LLM-CODE-NAVIGATION.md +276 -0
  76. package/research/issue-review.md +603 -0
  77. package/research/llm-summarization/agent-cli-tools-2026.md +1082 -0
  78. package/research/llm-summarization/alternative-providers-2026.md +1428 -0
  79. package/research/llm-summarization/anthropic-2026.md +367 -0
  80. package/research/llm-summarization/claude-cli-integration.md +1706 -0
  81. package/research/llm-summarization/cli-integration-patterns.md +3155 -0
  82. package/research/llm-summarization/openai-2026.md +473 -0
  83. package/research/llm-summarization/openai-compatible-providers-2026.md +1022 -0
  84. package/research/llm-summarization/opencode-cli-integration.md +1552 -0
  85. package/research/llm-summarization/prompt-engineering-2026.md +1426 -0
  86. package/research/llm-summarization/prototype-results.md +56 -0
  87. package/research/llm-summarization/provider-switching-patterns-2026.md +2153 -0
  88. package/research/llm-summarization/typescript-llm-libraries-2026.md +2436 -0
  89. package/research/mdcontext-pudding/00-EXECUTIVE-SUMMARY.md +282 -0
  90. package/research/mdcontext-pudding/01-index-embed.md +956 -0
  91. package/research/mdcontext-pudding/02-search-COMMANDS.md +142 -0
  92. package/research/mdcontext-pudding/02-search-SUMMARY.md +146 -0
  93. package/research/mdcontext-pudding/02-search.md +970 -0
  94. package/research/mdcontext-pudding/03-context.md +779 -0
  95. package/research/mdcontext-pudding/04-navigation-and-analytics.md +803 -0
  96. package/research/mdcontext-pudding/04-tree.md +704 -0
  97. package/research/mdcontext-pudding/05-config.md +1038 -0
  98. package/research/mdcontext-pudding/06-links-summary.txt +87 -0
  99. package/research/mdcontext-pudding/06-links.md +679 -0
  100. package/research/mdcontext-pudding/07-stats.md +693 -0
  101. package/research/mdcontext-pudding/BUG-FIX-PLAN.md +388 -0
  102. package/research/mdcontext-pudding/P0-BUG-VALIDATION.md +167 -0
  103. package/research/mdcontext-pudding/README.md +168 -0
  104. package/research/mdcontext-pudding/TESTING-SUMMARY.md +128 -0
  105. package/research/research-quality-review.md +834 -0
  106. package/research/semantic-search/embedding-text-analysis.md +156 -0
  107. package/research/semantic-search/multi-word-failure-reproduction.md +171 -0
  108. package/research/semantic-search/query-processing-analysis.md +207 -0
  109. package/research/semantic-search/root-cause-and-solution.md +114 -0
  110. package/research/semantic-search/threshold-validation-report.md +69 -0
  111. package/research/semantic-search/vector-search-analysis.md +63 -0
  112. package/research/test-path-issues.md +276 -0
  113. package/review/ALP-76/1-error-type-design.md +962 -0
  114. package/review/ALP-76/2-error-handling-patterns.md +906 -0
  115. package/review/ALP-76/3-error-presentation.md +624 -0
  116. package/review/ALP-76/4-test-coverage.md +625 -0
  117. package/review/ALP-76/5-migration-completeness.md +440 -0
  118. package/review/ALP-76/6-effect-best-practices.md +755 -0
  119. package/scripts/apply-branch-protection.sh +47 -0
  120. package/scripts/branch-protection-templates.json +79 -0
  121. package/scripts/prototype-summarization.ts +346 -0
  122. package/scripts/rebuild-hnswlib.js +32 -37
  123. package/scripts/setup-branch-protection.sh +64 -0
  124. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/active-provider.json +7 -0
  125. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.json +541 -0
  126. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.meta.json +5 -0
  127. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/config.json +8 -0
  128. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
  129. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
  130. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/documents.json +60 -0
  131. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/links.json +13 -0
  132. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/sections.json +1197 -0
  133. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/configuration-management.md +99 -0
  134. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/distributed-systems.md +92 -0
  135. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/error-handling.md +78 -0
  136. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/failure-automation.md +55 -0
  137. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/job-context.md +69 -0
  138. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/process-orchestration.md +99 -0
  139. package/src/cli/argv-preprocessor.test.ts +2 -2
  140. package/src/cli/cli.test.ts +230 -33
  141. package/src/cli/commands/config-cmd.ts +642 -0
  142. package/src/cli/commands/context.ts +97 -9
  143. package/src/cli/commands/duplicates.ts +122 -0
  144. package/src/cli/commands/embeddings.ts +529 -0
  145. package/src/cli/commands/index-cmd.ts +210 -30
  146. package/src/cli/commands/index.ts +3 -0
  147. package/src/cli/commands/search.ts +894 -64
  148. package/src/cli/commands/stats.ts +3 -0
  149. package/src/cli/commands/tree.ts +26 -5
  150. package/src/cli/config-layer.ts +176 -0
  151. package/src/cli/error-handler.test.ts +235 -0
  152. package/src/cli/error-handler.ts +655 -0
  153. package/src/cli/flag-schemas.ts +66 -0
  154. package/src/cli/help.ts +209 -7
  155. package/src/cli/main.ts +348 -58
  156. package/src/cli/options.ts +10 -0
  157. package/src/cli/shared-error-handling.ts +199 -0
  158. package/src/cli/utils.ts +150 -17
  159. package/src/config/file-provider.test.ts +320 -0
  160. package/src/config/file-provider.ts +273 -0
  161. package/src/config/index.ts +72 -0
  162. package/src/config/integration.test.ts +667 -0
  163. package/src/config/precedence.test.ts +277 -0
  164. package/src/config/precedence.ts +451 -0
  165. package/src/config/schema.test.ts +414 -0
  166. package/src/config/schema.ts +603 -0
  167. package/src/config/service.test.ts +320 -0
  168. package/src/config/service.ts +243 -0
  169. package/src/config/testing.test.ts +264 -0
  170. package/src/config/testing.ts +110 -0
  171. package/src/core/types.ts +6 -33
  172. package/src/duplicates/detector.test.ts +183 -0
  173. package/src/duplicates/detector.ts +414 -0
  174. package/src/duplicates/index.ts +18 -0
  175. package/src/embeddings/embedding-namespace.test.ts +300 -0
  176. package/src/embeddings/embedding-namespace.ts +947 -0
  177. package/src/embeddings/heading-boost.test.ts +222 -0
  178. package/src/embeddings/hnsw-build-options.test.ts +198 -0
  179. package/src/embeddings/hyde.test.ts +272 -0
  180. package/src/embeddings/hyde.ts +264 -0
  181. package/src/embeddings/index.ts +2 -0
  182. package/src/embeddings/openai-provider.ts +332 -83
  183. package/src/embeddings/pricing.json +22 -0
  184. package/src/embeddings/provider-constants.ts +204 -0
  185. package/src/embeddings/provider-errors.test.ts +967 -0
  186. package/src/embeddings/provider-errors.ts +565 -0
  187. package/src/embeddings/provider-factory.test.ts +240 -0
  188. package/src/embeddings/provider-factory.ts +225 -0
  189. package/src/embeddings/provider-integration.test.ts +788 -0
  190. package/src/embeddings/query-preprocessing.test.ts +187 -0
  191. package/src/embeddings/semantic-search-threshold.test.ts +508 -0
  192. package/src/embeddings/semantic-search.ts +780 -93
  193. package/src/embeddings/types.ts +293 -16
  194. package/src/embeddings/vector-store.ts +486 -77
  195. package/src/embeddings/voyage-provider.ts +313 -0
  196. package/src/errors/errors.test.ts +845 -0
  197. package/src/errors/index.ts +533 -0
  198. package/src/index/ignore-patterns.test.ts +354 -0
  199. package/src/index/ignore-patterns.ts +305 -0
  200. package/src/index/indexer.ts +286 -48
  201. package/src/index/storage.ts +94 -30
  202. package/src/index/types.ts +40 -2
  203. package/src/index/watcher.ts +67 -9
  204. package/src/index.ts +22 -0
  205. package/src/integration/search-keyword.test.ts +678 -0
  206. package/src/mcp/server.ts +135 -6
  207. package/src/parser/parser.ts +18 -19
  208. package/src/parser/section-filter.test.ts +277 -0
  209. package/src/parser/section-filter.ts +125 -3
  210. package/src/search/__tests__/hybrid-search.test.ts +650 -0
  211. package/src/search/bm25-store.ts +366 -0
  212. package/src/search/cross-encoder.test.ts +253 -0
  213. package/src/search/cross-encoder.ts +406 -0
  214. package/src/search/fuzzy-search.test.ts +419 -0
  215. package/src/search/fuzzy-search.ts +273 -0
  216. package/src/search/hybrid-search.ts +448 -0
  217. package/src/search/path-matcher.test.ts +276 -0
  218. package/src/search/path-matcher.ts +33 -0
  219. package/src/search/searcher.test.ts +99 -1
  220. package/src/search/searcher.ts +189 -67
  221. package/src/search/wink-bm25.d.ts +30 -0
  222. package/src/summarization/cli-providers/claude.ts +202 -0
  223. package/src/summarization/cli-providers/detection.test.ts +273 -0
  224. package/src/summarization/cli-providers/detection.ts +118 -0
  225. package/src/summarization/cli-providers/index.ts +8 -0
  226. package/src/summarization/cost.test.ts +139 -0
  227. package/src/summarization/cost.ts +102 -0
  228. package/src/summarization/error-handler.test.ts +127 -0
  229. package/src/summarization/error-handler.ts +111 -0
  230. package/src/summarization/index.ts +102 -0
  231. package/src/summarization/pipeline.test.ts +498 -0
  232. package/src/summarization/pipeline.ts +231 -0
  233. package/src/summarization/prompts.test.ts +269 -0
  234. package/src/summarization/prompts.ts +133 -0
  235. package/src/summarization/provider-factory.test.ts +396 -0
  236. package/src/summarization/provider-factory.ts +178 -0
  237. package/src/summarization/types.ts +184 -0
  238. package/src/summarize/summarizer.ts +104 -35
  239. package/src/types/huggingface-transformers.d.ts +66 -0
  240. package/tests/fixtures/cli/.mdcontext/active-provider.json +7 -0
  241. package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
  242. package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
  243. package/tests/fixtures/cli/.mdcontext/indexes/documents.json +4 -4
  244. package/tests/fixtures/cli/.mdcontext/indexes/sections.json +14 -0
  245. package/tests/integration/embed-index.test.ts +712 -0
  246. package/tests/integration/search-context.test.ts +469 -0
  247. package/tests/integration/search-semantic.test.ts +522 -0
  248. package/vitest.config.ts +1 -6
  249. package/AGENTS.md +0 -46
  250. package/tests/fixtures/cli/.mdcontext/vectors.bin +0 -0
  251. package/tests/fixtures/cli/.mdcontext/vectors.meta.json +0 -1264
@@ -0,0 +1,156 @@
1
+ # Embedding Text Analysis
2
+
3
+ ## Executive Summary
4
+
5
+ The current embedding text generation format is **appropriate and follows best practices**. The format enriches content with contextual metadata (heading, parent section, document title) which helps embedding models understand the semantic context.
6
+
7
+ The similarity score issues identified in ALP-203 are **NOT caused by the embedding text format**, but rather by:
8
+ 1. Inherent properties of embedding models with short queries
9
+ 2. The 0.5 default threshold being too high for certain query types
10
+
11
+ ## Current Implementation
12
+
13
+ ### generateEmbeddingText Function
14
+
15
+ Location: `src/embeddings/semantic-search.ts:46-63`
16
+
17
+ ```typescript
18
+ const generateEmbeddingText = (
19
+ section: SectionEntry,
20
+ content: string,
21
+ documentTitle: string,
22
+ parentHeading?: string | undefined,
23
+ ): string => {
24
+ const parts: string[] = []
25
+
26
+ parts.push(`# ${section.heading}`)
27
+ if (parentHeading) {
28
+ parts.push(`Parent section: ${parentHeading}`)
29
+ }
30
+ parts.push(`Document: ${documentTitle}`)
31
+ parts.push('')
32
+ parts.push(content)
33
+
34
+ return parts.join('\n')
35
+ }
36
+ ```
37
+
38
+ ### Generated Text Format
39
+
40
+ **For a top-level section (e.g., "Overview" in "Failure Automation"):**
41
+ ```
42
+ # Overview
43
+ Document: Failure Automation
44
+
45
+ Failure automation is the practice of automatically detecting,
46
+ reporting, and responding to system failures without human
47
+ intervention. This approach is essential for maintaining high
48
+ availability in modern distributed systems.
49
+ ```
50
+
51
+ **For a nested section (e.g., "Automated Failure Detection"):**
52
+ ```
53
+ # Automated Failure Detection
54
+ Parent section: Core Concepts
55
+ Document: Failure Automation
56
+
57
+ Systems use health checks, heartbeats, and monitoring to detect
58
+ when components fail. Failure detection must be fast and accurate
59
+ to minimize downtime.
60
+ ```
61
+
62
+ ## Analysis
63
+
64
+ ### What the Current Format Does Well
65
+
66
+ 1. **Heading as Title**: Using `# {heading}` format is standard markdown that embedding models are trained on. The heading provides semantic context about the section topic.
67
+
68
+ 2. **Hierarchical Context**: Including `Parent section: {parentHeading}` helps the model understand the section's place in the document structure. This is especially valuable for nested sections with generic headings like "Overview" or "Best Practices".
69
+
70
+ 3. **Document Context**: Including `Document: {documentTitle}` helps disambiguate content that might otherwise be too generic.
71
+
72
+ 4. **Content Preservation**: The full section content is included, providing rich semantic signal.
73
+
74
+ ### Potential Concerns Investigated
75
+
76
+ | Concern | Finding |
77
+ |---------|---------|
78
+ | Does `# {heading}` confuse the model? | No - embedding models are trained on markdown and understand heading syntax |
79
+ | Is metadata adding noise? | No - metadata provides helpful context, especially for short sections |
80
+ | Is content truncated? | No - full section content is included |
81
+ | Are important keywords lost? | No - nothing is removed from original content |
82
+
83
+ ### Comparison with Best Practices
84
+
85
+ **OpenAI Recommendations:**
86
+ - Text-embedding-3-small uses the same model for both queries and documents
87
+ - No special prefixes or asymmetric handling needed
88
+ - Cosine similarity is recommended for comparison
89
+ - The model captures semantic meaning, not just keyword overlap
90
+
91
+ **Industry Patterns:**
92
+ - Many RAG systems include metadata like titles and hierarchical context
93
+ - Including document/section titles is a common best practice
94
+ - Enriching content with context improves retrieval quality
95
+
96
+ ## Token Count Analysis
97
+
98
+ Sample embedded texts from test corpus:
99
+
100
+ | Section | Content Tokens | Metadata Overhead | Total | Overhead % |
101
+ |---------|----------------|-------------------|-------|------------|
102
+ | Overview | ~50 | ~15 | ~65 | 23% |
103
+ | Automated Failure Detection | ~40 | ~20 | ~60 | 33% |
104
+ | Best Practices | ~100 | ~15 | ~115 | 13% |
105
+
106
+ The metadata overhead is reasonable (13-33%) and provides valuable semantic context.
107
+
108
+ ## Root Cause of Similarity Score Issues
109
+
110
+ The similarity score issues are **NOT caused by embedding text generation**. Based on ALP-203 findings:
111
+
112
+ ### Why Short Queries Have Low Scores
113
+
114
+ 1. **Vector Space Properties**: A single word like "failure" produces an embedding that represents the general concept. Content sections contain many concepts, making the cosine similarity lower.
115
+
116
+ 2. **Context Asymmetry**: A query "failure" is matched against embeddings like:
117
+ ```
118
+ # Failure Isolation
119
+ Parent section: Core Concepts
120
+ Document: Failure Automation
121
+
122
+ Automated systems can isolate failures to prevent cascading effects...
123
+ ```
124
+ The query is a subset of the embedded content, not a full match.
125
+
126
+ 3. **Embedding Model Behavior**: Text-embedding-3-small produces normalized vectors. Short inputs produce vectors that are less distinctive because they have less semantic "mass".
127
+
128
+ ### Why Multi-Word Domain Queries Work Better
129
+
130
+ Queries like "failure automation" provide:
131
+ - Multiple semantic signals
132
+ - Domain-specific terminology
133
+ - Closer match to document/heading names
134
+ - More distinctive embedding vectors
135
+
136
+ ## Recommendations
137
+
138
+ ### No Changes Needed to Embedding Text Format
139
+
140
+ The current format is sound. The issues are in threshold calibration and query handling.
141
+
142
+ ### Potential Improvements (for ALP-207)
143
+
144
+ 1. **Query Enhancement**: Consider expanding short queries with context before embedding
145
+ 2. **Threshold Tuning**: Use adaptive thresholds based on query length
146
+ 3. **Hybrid Search Default**: Leverage keyword search to boost short query results
147
+
148
+ ## Related Research
149
+
150
+ - [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
151
+ - [Text-embedding-3-small Model](https://platform.openai.com/docs/models/text-embedding-3-small)
152
+ - [Zilliz: Guide to text-embedding-3-small](https://zilliz.com/ai-models/text-embedding-3-small)
153
+
154
+ ## Conclusion
155
+
156
+ The embedding text generation implementation is correct and follows best practices. The similarity score issues identified in ALP-203 should be addressed through threshold calibration and query processing improvements, not by modifying how content is embedded.
@@ -0,0 +1,171 @@
1
+ # Multi-Word Semantic Search Failure Reproduction
2
+
3
+ ## Executive Summary
4
+
5
+ After systematic testing, the reported "multi-word semantic search failure" is **NOT a failure of semantic search itself**, but rather a **threshold calibration issue**. The root causes are:
6
+
7
+ 1. **Single-word queries have low similarity scores** (30-40%) while multi-word queries have higher scores (50-70%)
8
+ 2. **Default threshold of 0.5** filters out both single-word AND semantically-distant multi-word queries
9
+ 3. **Queries with abstract/non-domain-specific terms** (e.g., "gaps missing omissions", "issue challenge gap") score below threshold
10
+ 4. **Domain-specific multi-word queries work well** (e.g., "failure automation" = 61%, "process orchestration" = 68%)
11
+
12
+ ## Test Methodology
13
+
14
+ ### Test Corpus
15
+
16
+ Created a controlled test corpus in `src/__tests__/fixtures/semantic-search/multi-word-corpus/` with 6 markdown files covering:
17
+ - failure-automation.md - Failure detection and automated recovery
18
+ - job-context.md - Job execution context and metadata
19
+ - error-handling.md - Error handling patterns
20
+ - configuration-management.md - Config management practices
21
+ - distributed-systems.md - Distributed systems architecture
22
+ - process-orchestration.md - Workflow orchestration patterns
23
+
24
+ **Corpus Statistics:**
25
+ - 6 documents
26
+ - 67 sections
27
+ - 52 embedded vectors
28
+ - ~4,725 tokens
29
+
30
+ ### Index Command
31
+
32
+ ```bash
33
+ node dist/cli/main.js index src/__tests__/fixtures/semantic-search/multi-word-corpus --embed --force
34
+ ```
35
+
36
+ ## Test Results
37
+
38
+ ### Multi-Word Domain-Specific Queries (DEFAULT THRESHOLD 0.5)
39
+
40
+ | Query | Results | Top Match | Top Score |
41
+ |-------|---------|-----------|-----------|
42
+ | "failure automation" | 7 | failure-automation.md: Best Practices | 61.6% |
43
+ | "job context" | 4 | job-context.md: What is Job Context? | 60.4% |
44
+ | "error handling" | 7 | error-handling.md: Introduction | 63.7% |
45
+ | "configuration management" | 8 | configuration-management.md: Overview | 69.5% |
46
+ | "distributed systems" | 4 | distributed-systems.md: What Are... | 61.0% |
47
+ | "process orchestration" | 8 | process-orchestration.md: Introduction | 68.0% |
48
+
49
+ **Finding:** Multi-word queries with domain-specific terms **WORK WELL** with default threshold.
50
+
51
+ ### Single-Word Queries (DEFAULT THRESHOLD 0.5)
52
+
53
+ | Query | Results | Notes |
54
+ |-------|---------|-------|
55
+ | "failure" | 0 | Below 0.5 threshold |
56
+ | "automation" | 0 | Below 0.5 threshold |
57
+ | "context" | 0 | Below 0.5 threshold |
58
+ | "error" | 0 | Below 0.5 threshold |
59
+
60
+ ### Single-Word Queries (THRESHOLD 0.3)
61
+
62
+ | Query | Results | Top Match | Top Score |
63
+ |-------|---------|-----------|-----------|
64
+ | "failure" | 10 | failure-automation.md: Failure Isolation | 39.1% |
65
+ | "automation" | 10 | (similar) | ~35% |
66
+ | "error" | 10 | error-handling.md: Programming Errors | 49.1% |
67
+
68
+ **Finding:** Single-word queries have inherently **LOW similarity scores** (30-49%) due to:
69
+ 1. Short query embeddings lack semantic context
70
+ 2. Embedding model produces less distinctive vectors for single words
71
+ 3. Cosine similarity between short and long vectors is compressed
72
+
73
+ ### Abstract/Generic Multi-Word Queries (DEFAULT THRESHOLD 0.5)
74
+
75
+ | Query | Results | Notes |
76
+ |-------|---------|-------|
77
+ | "issue challenge gap" | 0 | Abstract terms, no domain match |
78
+ | "gaps missing omissions" | 0 | Meta-language about content, not content itself |
79
+
80
+ ### Abstract Queries (THRESHOLD 0.3)
81
+
82
+ | Query | Results | Top Match | Top Score |
83
+ |-------|---------|-----------|-----------|
84
+ | "issue challenge gap" | 10 | distributed-systems.md: Consistency vs Availability | 40.8% |
85
+ | "gaps missing omissions" | 3 | error-handling.md: Programming Errors | 35.0% |
86
+
87
+ **Finding:** Abstract/meta-language queries score **30-40%** - below default threshold but findable with lower threshold.
88
+
89
+ ### Hybrid Search Results
90
+
91
+ | Query | Hybrid Results | Primary Source |
92
+ |-------|---------------|----------------|
93
+ | "failure automation" | 7 | Semantic (RRF ~1.6) |
94
+ | "job context" | 4 | Semantic (RRF ~1.6) |
95
+
96
+ **Finding:** Hybrid search successfully combines semantic and keyword results, but the semantic component still uses the threshold filter.
97
+
98
+ ## Pattern Analysis
99
+
100
+ ### What Works (>50% similarity)
101
+ - Multi-word queries with **domain-specific terms** directly present in content
102
+ - Queries that form **coherent concepts** (e.g., "process orchestration")
103
+ - Queries that match **document titles or major headings**
104
+
105
+ ### What Fails at Default Threshold
106
+ - **Single words** - all score 30-49%
107
+ - **Abstract meta-language** - "gaps", "issues", "challenges" without domain context
108
+ - **Non-domain queries** searching indexed domain content
109
+ - **Very short queries** (1-2 generic words)
110
+
111
+ ### Similarity Score Distribution
112
+
113
+ ```
114
+ 70%+ : Document title/heading exact concept matches
115
+ 60-70%: Multi-word domain queries matching content topics
116
+ 50-60%: Multi-word queries with partial concept overlap
117
+ 40-50%: Single words or abstract queries with some relevance
118
+ 30-40%: Tangentially related content
119
+ <30% : Unrelated content (correctly filtered)
120
+ ```
121
+
122
+ ## Dogfooding Context
123
+
124
+ The dogfooding agents reported semantic search as "unreliable for multi-word conceptual queries". Re-analysis shows:
125
+
126
+ 1. **No embeddings were built** during dogfooding (only keyword index existed)
127
+ 2. Semantic search was **unavailable** - falling back to keyword search
128
+ 3. Multi-word **keyword** searches like "failure automation" worked
129
+ 4. Multi-word keyword searches as **quoted phrases** returned 0 (expecting exact text)
130
+ 5. Abstract queries like "gaps missing omissions" correctly returned 0 (phrase not in content)
131
+
132
+ The actual issue was:
133
+ - **Semantic search unavailable** (no embeddings)
134
+ - **Keyword phrase search** misunderstood (quoted = exact match)
135
+ - **Abstract conceptual queries** don't match concrete content via keyword
136
+
137
+ ## Recommendations
138
+
139
+ ### For ALP-204 (Embedding Text Analysis)
140
+ - Analyze how `generateEmbeddingText()` combines section context
141
+ - Check if heading + parent + content provides enough semantic signal for short queries
142
+
143
+ ### For ALP-205 (Query Processing)
144
+ - Query text is passed directly to embedding - no preprocessing
145
+ - Consider query expansion for short queries
146
+
147
+ ### For ALP-206 (Vector Search Parameters)
148
+ - Default threshold of 0.5 is **too high** for single-word queries
149
+ - Consider adaptive thresholds based on query length
150
+ - Consider returning top-K results regardless of threshold, then filtering
151
+
152
+ ### For ALP-207 (Solution Design)
153
+ Key solutions to consider:
154
+ 1. **Adaptive threshold** - lower for short queries
155
+ 2. **Query expansion** - augment short queries with context
156
+ 3. **Better user feedback** - show "X results below threshold" message
157
+ 4. **Threshold documentation** - educate users on --threshold flag
158
+
159
+ ## Conclusion
160
+
161
+ Multi-word semantic search **is working correctly** for domain-specific queries. The perceived "failure" is a combination of:
162
+ 1. No embeddings in dogfooding environment
163
+ 2. Threshold too high for short/abstract queries
164
+ 3. Confusion between keyword phrase search and semantic search
165
+ 4. Users expecting semantic search to understand meta-language about content
166
+
167
+ The fix is NOT to change semantic search algorithm, but to:
168
+ 1. Calibrate default threshold appropriately
169
+ 2. Add query-length-aware threshold adjustment
170
+ 3. Improve error messages when no results found
171
+ 4. Consider hybrid search as default mode
@@ -0,0 +1,207 @@
1
+ # Query Processing Analysis
2
+
3
+ ## Executive Summary
4
+
5
+ Query processing is **minimal and appropriate**. The query text is passed directly to the embedding API without modification. This is correct behavior for OpenAI's text-embedding-3-small model, which handles text normalization internally.
6
+
7
+ The asymmetry between query format (plain text) and document format (text with metadata) does NOT cause issues - embedding models are designed for this asymmetric retrieval pattern.
8
+
9
+ ## Query Flow
10
+
11
+ ```
12
+ User Input
13
+
14
+
15
+ CLI Parser (search.ts)
16
+ │ query string unchanged
17
+
18
+ semanticSearch(rootPath, query, options)
19
+ │ query string unchanged
20
+
21
+ provider.embed([query])
22
+ │ passed directly to API
23
+
24
+ OpenAI Embeddings API
25
+ │ returns 512-dimensional vector
26
+
27
+ Vector Store search()
28
+ │ cosine similarity comparison
29
+
30
+ Results filtered by threshold
31
+ ```
32
+
33
+ ## Code Trace
34
+
35
+ ### Entry Point: CLI
36
+
37
+ ```typescript
38
+ // src/cli/commands/search.ts:53-55
39
+ query: Args.text({ name: 'query' }).pipe(
40
+ Args.withDescription('Search query (natural language or regex pattern)'),
41
+ ),
42
+ ```
43
+
44
+ The query enters as a raw text string, no preprocessing.
45
+
46
+ ### Search Mode Detection
47
+
48
+ ```typescript
49
+ // src/cli/commands/search.ts:201-206
50
+ } else if (isAdvancedQuery(query)) {
51
+ effectiveMode = 'keyword'
52
+ modeReason = 'boolean/phrase pattern detected'
53
+ } else if (isRegexPattern(query)) {
54
+ effectiveMode = 'keyword'
55
+ modeReason = 'regex pattern detected'
56
+ }
57
+ ```
58
+
59
+ Queries with boolean operators (AND, OR, NOT) or quoted phrases are routed to keyword search. Plain multi-word queries go to semantic search.
60
+
61
+ ### Semantic Search Function
62
+
63
+ ```typescript
64
+ // src/embeddings/semantic-search.ts:558-559
65
+ // Embed the query
66
+ const queryResult = yield* wrapEmbedding(provider.embed([query]))
67
+ ```
68
+
69
+ **No preprocessing** - query is embedded exactly as received.
70
+
71
+ ### Embedding API Call
72
+
73
+ ```typescript
74
+ // src/embeddings/openai-provider.ts:175-179
75
+ const response = await this.client.embeddings.create({
76
+ model: this.model,
77
+ input: batch, // query text passed directly
78
+ dimensions: 512,
79
+ })
80
+ ```
81
+
82
+ Query text goes directly to OpenAI API without modification.
83
+
84
+ ## Query vs Document Format Asymmetry
85
+
86
+ ### Document Embedding Format (from ALP-204)
87
+
88
+ ```
89
+ # {heading}
90
+ Parent section: {parentHeading}
91
+ Document: {documentTitle}
92
+
93
+ {content}
94
+ ```
95
+
96
+ ### Query Format
97
+
98
+ ```
99
+ {raw query text}
100
+ ```
101
+
102
+ ### Analysis
103
+
104
+ This asymmetry is **intentional and correct** for semantic search:
105
+
106
+ 1. **Embedding models handle asymmetry**: OpenAI's text-embedding models are trained on diverse text formats. They produce semantically meaningful vectors regardless of format.
107
+
108
+ 2. **Query expansion is not needed**: The embedding model understands "failure automation" conceptually - it doesn't need to see `# Failure Automation` format.
109
+
110
+ 3. **Document context helps disambiguation**: The heading/document metadata in indexed content helps distinguish between sections with similar content but different contexts.
111
+
112
+ 4. **Industry standard practice**: Most RAG systems use plain queries against enriched documents.
113
+
114
+ ## Query Variation Tests
115
+
116
+ All variations produce semantically similar results:
117
+
118
+ | Query | Top Result | Similarity |
119
+ |-------|------------|------------|
120
+ | "failure automation" | Best Practices | 61.6% |
121
+ | "failure-automation" | Overview | 68.8% |
122
+ | "Failure Automation" | Best Practices | 65.6% |
123
+ | "automation for failures" | Overview | 70.3% |
124
+ | "how to automate failure handling" | Best Practices | 66.4% |
125
+
126
+ **Findings:**
127
+ - Casing doesn't significantly affect results
128
+ - Hyphenation produces slightly different top result
129
+ - Word order matters but doesn't break search
130
+ - Natural language queries work well
131
+
132
+ ## Threshold Analysis
133
+
134
+ ### Default Threshold Flow
135
+
136
+ ```
137
+ CLI default: 0.45
138
+
139
+ ▼ (if CLI uses default)
140
+ Config default: 0.5
141
+
142
+
143
+ Effective threshold: 0.5
144
+ ```
145
+
146
+ When user doesn't specify `--threshold`, the effective value is 0.5 from config.
147
+
148
+ ### Threshold Impact
149
+
150
+ | Threshold | Single-word "failure" | Multi-word "failure automation" |
151
+ |-----------|----------------------|--------------------------------|
152
+ | 0.5 | 0 results | 7 results |
153
+ | 0.3 | 10 results | 7+ results |
154
+ | 0.1 | 10 results | 7+ results |
155
+
156
+ The 0.5 threshold filters out low-similarity single-word matches while allowing relevant multi-word matches through.
157
+
158
+ ## Potential Query Enhancements (for ALP-207)
159
+
160
+ While current processing is correct, potential improvements could include:
161
+
162
+ ### 1. Query Expansion for Short Queries
163
+
164
+ ```typescript
165
+ // Hypothetical enhancement
166
+ const enhancedQuery = query.split(' ').length <= 2
167
+ ? `Find content about: ${query}`
168
+ : query
169
+ ```
170
+
171
+ ### 2. Adaptive Threshold
172
+
173
+ ```typescript
174
+ // Lower threshold for shorter queries
175
+ const adaptiveThreshold = query.split(' ').length <= 1
176
+ ? 0.3
177
+ : options.threshold ?? 0.5
178
+ ```
179
+
180
+ ### 3. Hybrid by Default
181
+
182
+ Short queries might benefit from hybrid mode being the default, leveraging both keyword and semantic signals.
183
+
184
+ ## Recommendations
185
+
186
+ ### No Changes Needed to Query Processing
187
+
188
+ The current implementation is correct. The query flow is:
189
+ - Clean (no unnecessary transformations)
190
+ - Transparent (what you type is what gets embedded)
191
+ - Flexible (users can adjust with --threshold)
192
+
193
+ ### Focus Areas for ALP-207
194
+
195
+ 1. **Threshold tuning** - Consider lowering default to 0.4 or making it adaptive
196
+ 2. **Better feedback** - Show "X results below threshold" when 0 results
197
+ 3. **Documentation** - Explain threshold behavior in help text
198
+ 4. **Hybrid default** - Consider hybrid mode as default for better coverage
199
+
200
+ ## Conclusion
201
+
202
+ Query processing is implemented correctly. The perceived "multi-word query failures" are actually threshold calibration issues, not query processing bugs. The search correctly:
203
+
204
+ 1. Passes queries unchanged to embedding API (correct)
205
+ 2. Uses asymmetric retrieval (query vs enriched documents) (correct)
206
+ 3. Handles query variations semantically (working)
207
+ 4. Applies configurable threshold (working, but may need tuning)
@@ -0,0 +1,114 @@
1
+ # Root Cause Analysis and Solution Design
2
+
3
+ ## Executive Summary
4
+
5
+ **Root Cause**: The "multi-word semantic search failure" is a **threshold calibration issue**, not a search algorithm bug.
6
+
7
+ **Key Findings**:
8
+ 1. Multi-word domain queries WORK correctly (60-70% similarity)
9
+ 2. Single-word queries score lower (30-40%) due to embedding model properties
10
+ 3. Default 0.5 threshold filters out short/abstract queries
11
+ 4. The dogfooding had no embeddings built - agents fell back to keyword search
12
+ 5. Embedding text format, query processing, and HNSW config are all correct
13
+
14
+ **Solution**: Lower default threshold + improve user feedback for edge cases.
15
+
16
+ ## Synthesis of Diagnostic Findings
17
+
18
+ ### ALP-203: Reproduction Results
19
+
20
+ | Query Type | Works at 0.5? | Score Range |
21
+ |------------|---------------|-------------|
22
+ | "failure automation" | YES | 54-62% |
23
+ | "error handling" | YES | 53-64% |
24
+ | "failure" (single) | NO | 31-39% |
25
+ | "error" (single) | NO | 32-49% |
26
+ | "gaps missing omissions" | NO | 30-35% |
27
+
28
+ **Conclusion**: Multi-word domain queries work. Short/abstract queries fail threshold.
29
+
30
+ ### ALP-204: Embedding Text Analysis
31
+
32
+ - Format is correct: `# heading\nParent: X\nDocument: Y\n\ncontent`
33
+ - Follows industry best practices
34
+ - No issues identified
35
+
36
+ ### ALP-205: Query Processing Analysis
37
+
38
+ - Query passed unchanged to embedding API (correct)
39
+ - Asymmetric retrieval (plain query vs enriched docs) is normal
40
+ - Query variations all work correctly
41
+
42
+ ### ALP-206: Vector Search Analysis
43
+
44
+ - HNSW parameters (M=16, efConstruction=200, efSearch=100) are optimal
45
+ - Cosine distance correct for text embeddings
46
+ - Threshold filtering is the only issue
47
+
48
+ ## Root Cause
49
+
50
+ **Primary Cause**: The default similarity threshold (0.5) is too high for:
51
+ 1. Single-word queries (max ~49% similarity due to embedding model properties)
52
+ 2. Abstract/meta-language queries
53
+ 3. Non-domain-specific queries
54
+
55
+ **NOT the cause**:
56
+ - Embedding text format (correct)
57
+ - Query processing (correct)
58
+ - HNSW parameters (optimal)
59
+ - Embedding model (working as expected)
60
+
61
+ **Contributing Factor**: Dogfooding lacked embeddings, causing confusion about what was failing.
62
+
63
+ ## Solution Design
64
+
65
+ ### Recommended Approach: Threshold Tuning + UX Improvements
66
+
67
+ #### 1. Lower Default Threshold to 0.35
68
+
69
+ ```typescript
70
+ // src/config/schema.ts
71
+ minSimilarity: Config.number('minSimilarity').pipe(Config.withDefault(0.35))
72
+ ```
73
+
74
+ **Rationale**:
75
+ - Captures single-word results (30-40% range)
76
+ - Still filters irrelevant content (<30%)
77
+ - Low risk - users can adjust with --threshold
78
+
79
+ #### 2. Add "Below Threshold" Feedback
80
+
81
+ When 0 results, show hint about lower-scored results:
82
+
83
+ ```
84
+ Results: 0
85
+
86
+ Note: 10 results found below 0.35 threshold (highest: 0.34)
87
+ Tip: Use --threshold 0.3 to see more results
88
+ ```
89
+
90
+ #### 3. Consider Hybrid Search as Default
91
+
92
+ For queries without boolean operators, hybrid mode provides better coverage by combining semantic and keyword signals.
93
+
94
+ ## Implementation Plan for Phase 2
95
+
96
+ 1. **Lower default threshold** - Change config default from 0.5 to 0.35
97
+ 2. **Add below-threshold feedback** - Show hint when 0 results
98
+ 3. **Document threshold behavior** - Update README/help
99
+ 4. **Validate changes** - Re-run test corpus
100
+
101
+ ## Expected Outcomes
102
+
103
+ | Metric | Before | After |
104
+ |--------|--------|-------|
105
+ | Single-word results at default | 0 | 10+ |
106
+ | Multi-word results | 7+ | 7+ (unchanged) |
107
+
108
+ ## Conclusion
109
+
110
+ The "multi-word semantic search failure" was misidentified. Multi-word queries work correctly. The issue is threshold calibration affecting single-word and abstract queries.
111
+
112
+ **Recommended Solution**: Lower threshold to 0.35, add user feedback, improve documentation.
113
+
114
+ **No algorithmic changes needed** to embedding generation, query processing, or vector search.