mdcontext 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (251) hide show
  1. package/.changeset/config.json +9 -9
  2. package/.claude/settings.local.json +25 -0
  3. package/.github/workflows/claude-code-review.yml +44 -0
  4. package/.github/workflows/claude.yml +85 -0
  5. package/CONTRIBUTING.md +186 -0
  6. package/NOTES/NOTES +44 -0
  7. package/README.md +206 -3
  8. package/biome.json +1 -1
  9. package/dist/chunk-23UPXDNL.js +3044 -0
  10. package/dist/chunk-2W7MO2DL.js +1366 -0
  11. package/dist/chunk-3NUAZGMA.js +1689 -0
  12. package/dist/chunk-7TOWB2XB.js +366 -0
  13. package/dist/chunk-7XOTOADQ.js +3065 -0
  14. package/dist/chunk-AH2PDM2K.js +3042 -0
  15. package/dist/chunk-BNXWSZ63.js +3742 -0
  16. package/dist/chunk-BTL5DJVU.js +3222 -0
  17. package/dist/chunk-HDHYG7E4.js +104 -0
  18. package/dist/chunk-HLR4KZBP.js +3234 -0
  19. package/dist/chunk-IP3FRFEB.js +1045 -0
  20. package/dist/chunk-KHU56VDO.js +3042 -0
  21. package/dist/chunk-KRYIFLQR.js +85 -89
  22. package/dist/chunk-LBSDNLEM.js +287 -0
  23. package/dist/chunk-MNTQ7HCP.js +2643 -0
  24. package/dist/chunk-MUJELQQ6.js +1387 -0
  25. package/dist/chunk-MXJGMSLV.js +2199 -0
  26. package/dist/chunk-N6QJGC3Z.js +2636 -0
  27. package/dist/chunk-OBELGBPM.js +1713 -0
  28. package/dist/chunk-OT7R5XTA.js +3192 -0
  29. package/dist/chunk-P7X4RA2T.js +106 -0
  30. package/dist/chunk-PIDUQNC2.js +3185 -0
  31. package/dist/chunk-POGCDIH4.js +3187 -0
  32. package/dist/chunk-PSIEOQGZ.js +3043 -0
  33. package/dist/chunk-PVRT3IHA.js +3238 -0
  34. package/dist/chunk-QNN4TT23.js +1430 -0
  35. package/dist/chunk-RE3R45RJ.js +3042 -0
  36. package/dist/chunk-S7E6TFX6.js +718 -657
  37. package/dist/chunk-SG6GLU4U.js +1378 -0
  38. package/dist/chunk-SJCDV2ST.js +274 -0
  39. package/dist/chunk-SYE5XLF3.js +104 -0
  40. package/dist/chunk-T5VLYBZD.js +103 -0
  41. package/dist/chunk-TOQB7VWU.js +3238 -0
  42. package/dist/chunk-VFNMZ4ZQ.js +3228 -0
  43. package/dist/chunk-VVTGZNBT.js +1533 -1423
  44. package/dist/chunk-W7Q4RFEV.js +104 -0
  45. package/dist/chunk-XTYYVRLO.js +3190 -0
  46. package/dist/chunk-Y6MDYVJD.js +3063 -0
  47. package/dist/cli/main.js +4072 -629
  48. package/dist/index.d.ts +420 -33
  49. package/dist/index.js +8 -15
  50. package/dist/mcp/server.js +103 -7
  51. package/dist/schema-BAWSG7KY.js +22 -0
  52. package/dist/schema-E3QUPL26.js +20 -0
  53. package/dist/schema-EHL7WUT6.js +20 -0
  54. package/docs/019-USAGE.md +44 -5
  55. package/docs/020-current-implementation.md +8 -8
  56. package/docs/021-DOGFOODING-FINDINGS.md +1 -1
  57. package/docs/CONFIG.md +1123 -0
  58. package/docs/ERRORS.md +383 -0
  59. package/docs/summarization.md +320 -0
  60. package/justfile +40 -0
  61. package/package.json +39 -33
  62. package/research/INDEX.md +315 -0
  63. package/research/code-review/README.md +90 -0
  64. package/research/code-review/cli-error-handling-review.md +979 -0
  65. package/research/code-review/code-review-validation-report.md +464 -0
  66. package/research/code-review/main-ts-review.md +1128 -0
  67. package/research/config-docs/SUMMARY.md +357 -0
  68. package/research/config-docs/TEST-RESULTS.md +776 -0
  69. package/research/config-docs/TODO.md +542 -0
  70. package/research/config-docs/analysis.md +744 -0
  71. package/research/config-docs/fix-validation.md +502 -0
  72. package/research/config-docs/help-audit.md +264 -0
  73. package/research/config-docs/help-system-analysis.md +890 -0
  74. package/research/frontmatter/COMMENTS-ARE-SKIPPED.md +149 -0
  75. package/research/frontmatter/LLM-CODE-NAVIGATION.md +276 -0
  76. package/research/issue-review.md +603 -0
  77. package/research/llm-summarization/agent-cli-tools-2026.md +1082 -0
  78. package/research/llm-summarization/alternative-providers-2026.md +1428 -0
  79. package/research/llm-summarization/anthropic-2026.md +367 -0
  80. package/research/llm-summarization/claude-cli-integration.md +1706 -0
  81. package/research/llm-summarization/cli-integration-patterns.md +3155 -0
  82. package/research/llm-summarization/openai-2026.md +473 -0
  83. package/research/llm-summarization/openai-compatible-providers-2026.md +1022 -0
  84. package/research/llm-summarization/opencode-cli-integration.md +1552 -0
  85. package/research/llm-summarization/prompt-engineering-2026.md +1426 -0
  86. package/research/llm-summarization/prototype-results.md +56 -0
  87. package/research/llm-summarization/provider-switching-patterns-2026.md +2153 -0
  88. package/research/llm-summarization/typescript-llm-libraries-2026.md +2436 -0
  89. package/research/mdcontext-pudding/00-EXECUTIVE-SUMMARY.md +282 -0
  90. package/research/mdcontext-pudding/01-index-embed.md +956 -0
  91. package/research/mdcontext-pudding/02-search-COMMANDS.md +142 -0
  92. package/research/mdcontext-pudding/02-search-SUMMARY.md +146 -0
  93. package/research/mdcontext-pudding/02-search.md +970 -0
  94. package/research/mdcontext-pudding/03-context.md +779 -0
  95. package/research/mdcontext-pudding/04-navigation-and-analytics.md +803 -0
  96. package/research/mdcontext-pudding/04-tree.md +704 -0
  97. package/research/mdcontext-pudding/05-config.md +1038 -0
  98. package/research/mdcontext-pudding/06-links-summary.txt +87 -0
  99. package/research/mdcontext-pudding/06-links.md +679 -0
  100. package/research/mdcontext-pudding/07-stats.md +693 -0
  101. package/research/mdcontext-pudding/BUG-FIX-PLAN.md +388 -0
  102. package/research/mdcontext-pudding/P0-BUG-VALIDATION.md +167 -0
  103. package/research/mdcontext-pudding/README.md +168 -0
  104. package/research/mdcontext-pudding/TESTING-SUMMARY.md +128 -0
  105. package/research/research-quality-review.md +834 -0
  106. package/research/semantic-search/embedding-text-analysis.md +156 -0
  107. package/research/semantic-search/multi-word-failure-reproduction.md +171 -0
  108. package/research/semantic-search/query-processing-analysis.md +207 -0
  109. package/research/semantic-search/root-cause-and-solution.md +114 -0
  110. package/research/semantic-search/threshold-validation-report.md +69 -0
  111. package/research/semantic-search/vector-search-analysis.md +63 -0
  112. package/research/test-path-issues.md +276 -0
  113. package/review/ALP-76/1-error-type-design.md +962 -0
  114. package/review/ALP-76/2-error-handling-patterns.md +906 -0
  115. package/review/ALP-76/3-error-presentation.md +624 -0
  116. package/review/ALP-76/4-test-coverage.md +625 -0
  117. package/review/ALP-76/5-migration-completeness.md +440 -0
  118. package/review/ALP-76/6-effect-best-practices.md +755 -0
  119. package/scripts/apply-branch-protection.sh +47 -0
  120. package/scripts/branch-protection-templates.json +79 -0
  121. package/scripts/prototype-summarization.ts +346 -0
  122. package/scripts/rebuild-hnswlib.js +32 -37
  123. package/scripts/setup-branch-protection.sh +64 -0
  124. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/active-provider.json +7 -0
  125. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.json +541 -0
  126. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.meta.json +5 -0
  127. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/config.json +8 -0
  128. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
  129. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
  130. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/documents.json +60 -0
  131. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/links.json +13 -0
  132. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/sections.json +1197 -0
  133. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/configuration-management.md +99 -0
  134. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/distributed-systems.md +92 -0
  135. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/error-handling.md +78 -0
  136. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/failure-automation.md +55 -0
  137. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/job-context.md +69 -0
  138. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/process-orchestration.md +99 -0
  139. package/src/cli/argv-preprocessor.test.ts +2 -2
  140. package/src/cli/cli.test.ts +230 -33
  141. package/src/cli/commands/config-cmd.ts +642 -0
  142. package/src/cli/commands/context.ts +97 -9
  143. package/src/cli/commands/duplicates.ts +122 -0
  144. package/src/cli/commands/embeddings.ts +529 -0
  145. package/src/cli/commands/index-cmd.ts +210 -30
  146. package/src/cli/commands/index.ts +3 -0
  147. package/src/cli/commands/search.ts +894 -64
  148. package/src/cli/commands/stats.ts +3 -0
  149. package/src/cli/commands/tree.ts +26 -5
  150. package/src/cli/config-layer.ts +176 -0
  151. package/src/cli/error-handler.test.ts +235 -0
  152. package/src/cli/error-handler.ts +655 -0
  153. package/src/cli/flag-schemas.ts +66 -0
  154. package/src/cli/help.ts +209 -7
  155. package/src/cli/main.ts +348 -58
  156. package/src/cli/options.ts +10 -0
  157. package/src/cli/shared-error-handling.ts +199 -0
  158. package/src/cli/utils.ts +150 -17
  159. package/src/config/file-provider.test.ts +320 -0
  160. package/src/config/file-provider.ts +273 -0
  161. package/src/config/index.ts +72 -0
  162. package/src/config/integration.test.ts +667 -0
  163. package/src/config/precedence.test.ts +277 -0
  164. package/src/config/precedence.ts +451 -0
  165. package/src/config/schema.test.ts +414 -0
  166. package/src/config/schema.ts +603 -0
  167. package/src/config/service.test.ts +320 -0
  168. package/src/config/service.ts +243 -0
  169. package/src/config/testing.test.ts +264 -0
  170. package/src/config/testing.ts +110 -0
  171. package/src/core/types.ts +6 -33
  172. package/src/duplicates/detector.test.ts +183 -0
  173. package/src/duplicates/detector.ts +414 -0
  174. package/src/duplicates/index.ts +18 -0
  175. package/src/embeddings/embedding-namespace.test.ts +300 -0
  176. package/src/embeddings/embedding-namespace.ts +947 -0
  177. package/src/embeddings/heading-boost.test.ts +222 -0
  178. package/src/embeddings/hnsw-build-options.test.ts +198 -0
  179. package/src/embeddings/hyde.test.ts +272 -0
  180. package/src/embeddings/hyde.ts +264 -0
  181. package/src/embeddings/index.ts +2 -0
  182. package/src/embeddings/openai-provider.ts +332 -83
  183. package/src/embeddings/pricing.json +22 -0
  184. package/src/embeddings/provider-constants.ts +204 -0
  185. package/src/embeddings/provider-errors.test.ts +967 -0
  186. package/src/embeddings/provider-errors.ts +565 -0
  187. package/src/embeddings/provider-factory.test.ts +240 -0
  188. package/src/embeddings/provider-factory.ts +225 -0
  189. package/src/embeddings/provider-integration.test.ts +788 -0
  190. package/src/embeddings/query-preprocessing.test.ts +187 -0
  191. package/src/embeddings/semantic-search-threshold.test.ts +508 -0
  192. package/src/embeddings/semantic-search.ts +780 -93
  193. package/src/embeddings/types.ts +293 -16
  194. package/src/embeddings/vector-store.ts +486 -77
  195. package/src/embeddings/voyage-provider.ts +313 -0
  196. package/src/errors/errors.test.ts +845 -0
  197. package/src/errors/index.ts +533 -0
  198. package/src/index/ignore-patterns.test.ts +354 -0
  199. package/src/index/ignore-patterns.ts +305 -0
  200. package/src/index/indexer.ts +286 -48
  201. package/src/index/storage.ts +94 -30
  202. package/src/index/types.ts +40 -2
  203. package/src/index/watcher.ts +67 -9
  204. package/src/index.ts +22 -0
  205. package/src/integration/search-keyword.test.ts +678 -0
  206. package/src/mcp/server.ts +135 -6
  207. package/src/parser/parser.ts +18 -19
  208. package/src/parser/section-filter.test.ts +277 -0
  209. package/src/parser/section-filter.ts +125 -3
  210. package/src/search/__tests__/hybrid-search.test.ts +650 -0
  211. package/src/search/bm25-store.ts +366 -0
  212. package/src/search/cross-encoder.test.ts +253 -0
  213. package/src/search/cross-encoder.ts +406 -0
  214. package/src/search/fuzzy-search.test.ts +419 -0
  215. package/src/search/fuzzy-search.ts +273 -0
  216. package/src/search/hybrid-search.ts +448 -0
  217. package/src/search/path-matcher.test.ts +276 -0
  218. package/src/search/path-matcher.ts +33 -0
  219. package/src/search/searcher.test.ts +99 -1
  220. package/src/search/searcher.ts +189 -67
  221. package/src/search/wink-bm25.d.ts +30 -0
  222. package/src/summarization/cli-providers/claude.ts +202 -0
  223. package/src/summarization/cli-providers/detection.test.ts +273 -0
  224. package/src/summarization/cli-providers/detection.ts +118 -0
  225. package/src/summarization/cli-providers/index.ts +8 -0
  226. package/src/summarization/cost.test.ts +139 -0
  227. package/src/summarization/cost.ts +102 -0
  228. package/src/summarization/error-handler.test.ts +127 -0
  229. package/src/summarization/error-handler.ts +111 -0
  230. package/src/summarization/index.ts +102 -0
  231. package/src/summarization/pipeline.test.ts +498 -0
  232. package/src/summarization/pipeline.ts +231 -0
  233. package/src/summarization/prompts.test.ts +269 -0
  234. package/src/summarization/prompts.ts +133 -0
  235. package/src/summarization/provider-factory.test.ts +396 -0
  236. package/src/summarization/provider-factory.ts +178 -0
  237. package/src/summarization/types.ts +184 -0
  238. package/src/summarize/summarizer.ts +104 -35
  239. package/src/types/huggingface-transformers.d.ts +66 -0
  240. package/tests/fixtures/cli/.mdcontext/active-provider.json +7 -0
  241. package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
  242. package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
  243. package/tests/fixtures/cli/.mdcontext/indexes/documents.json +4 -4
  244. package/tests/fixtures/cli/.mdcontext/indexes/sections.json +14 -0
  245. package/tests/integration/embed-index.test.ts +712 -0
  246. package/tests/integration/search-context.test.ts +469 -0
  247. package/tests/integration/search-semantic.test.ts +522 -0
  248. package/vitest.config.ts +1 -6
  249. package/AGENTS.md +0 -46
  250. package/tests/fixtures/cli/.mdcontext/vectors.bin +0 -0
  251. package/tests/fixtures/cli/.mdcontext/vectors.meta.json +0 -1264
@@ -0,0 +1,956 @@
1
+ # mdcontext Index and Embedding Testing Report
2
+
3
+ **Date**: 2026-01-27
4
+ **Version**: mdcontext 0.1.0
5
+ **Tester**: Claude Sonnet 4.5
6
+ **Status**: 🔴 **CRITICAL BUG FOUND** - Blocks large-scale embedding use
7
+
8
+ ## Quick Reference
9
+
10
+ ### What Works ✅
11
+ - Basic indexing (any size): **PRODUCTION READY**
12
+ - Small corpus embeddings (<200 docs): **WORKS**
13
+ - Incremental updates: **EXCELLENT**
14
+ - JSON output: **PERFECT**
15
+ - Force rebuild: **WORKS**
16
+
17
+ ### What's Broken 🔴
18
+ - **Large corpus embeddings (>1500 docs): BLOCKED**
19
+ - All providers affected (OpenRouter, Ollama, likely OpenAI)
20
+ - Bug: Vector metadata save (JSON size limit)
21
+ - Impact: Cannot use semantic search on real codebases
22
+
23
+ ### Immediate Action Required
24
+ Fix vector metadata serialization (switch to binary format)
25
+ **Priority**: P0 - Critical
26
+ **ETA**: 4-8 hours
27
+
28
+ ---
29
+
30
+ ## Executive Summary
31
+
32
+ Comprehensive testing of mdcontext indexing and embedding functionality against two repositories:
33
+ - **mdcontext** (120 docs, ~564k tokens) - Small reference corpus ✅
34
+ - **agentic-flow** (1561 docs, ~9M tokens) - Large production codebase ❌
35
+
36
+ ### Key Findings
37
+
38
+ 1. **Basic indexing works flawlessly** - Fast, reliable, incremental updates ✅
39
+ 2. **OpenAI embeddings: WORKS ON SMALL** - Completed successfully on 120 doc corpus ✅
40
+ 3. **OpenRouter embeddings: BUG CONFIRMED** - Vector metadata save fails on large corpus ❌
41
+ 4. **Ollama embeddings: SAME BUG** - Generates embeddings but cannot save metadata ❌
42
+ 5. **CLI features tested successfully** - JSON output, --force, incremental updates ✅
43
+ 6. **Critical bug affects ALL providers** - Root cause in mdcontext, not providers 🔴
44
+
45
+ ---
46
+
47
+ ## Test Environment
48
+
49
+ ```bash
50
+ mdcontext version: 0.1.0
51
+ Node version: 22.16.0
52
+ Test date: 2026-01-27
53
+ OS: Darwin 24.5.0 (macOS)
54
+ ```
55
+
56
+ ### API Keys Available
57
+ - ✅ OPENAI_API_KEY
58
+ - ✅ OPENROUTER_API_KEY
59
+ - ✅ ANTHROPIC_API_BASE / OPENAI_API_BASE
60
+ - ✅ Ollama running locally (http://localhost:11434)
61
+
62
+ ---
63
+
64
+ ## Test 1: Basic Indexing (No Embeddings)
65
+
66
+ ### Command
67
+ ```bash
68
+ node dist/cli/main.js index /Users/alphab/Dev/LLM/DEV/agentic-flow
69
+ ```
70
+
71
+ ### Results - agentic-flow
72
+ ```
73
+ Indexed 1561 documents
74
+ Sections: 52714
75
+ Links: 3460
76
+ Duration: 14439ms (~14.4s)
77
+ Skipped: 21 hidden, 6 excluded
78
+ ```
79
+
80
+ ### Storage Analysis
81
+ ```
82
+ Total: 28M
83
+ ├── config.json 4.0K
84
+ ├── indexes/
85
+ │ ├── documents.json 516K
86
+ │ ├── links.json 600K
87
+ │ └── sections.json 27M (largest file)
88
+ └── cache/ (empty)
89
+ ```
90
+
91
+ ### Performance
92
+ - **Speed**: ~108 docs/second
93
+ - **Storage efficiency**: 28MB for 1561 docs (18KB per doc average)
94
+ - **Incremental**: ✅ Only reindexes changed files
95
+
96
+ ### Observations
97
+ - Very fast indexing without embeddings
98
+ - sections.json dominates storage (96% of total)
99
+ - Cost estimate shown: ~$0.1795 for embeddings
100
+ - Interactive prompt for semantic search (can bypass with --no-embed)
101
+
102
+ ---
103
+
104
+ ## Test 2: OpenRouter Embeddings
105
+
106
+ ### Command
107
+ ```bash
108
+ # In agentic-flow directory with config:
109
+ cat > mdcontext.config.js << 'EOF'
110
+ export default {
111
+ embeddings: {
112
+ provider: 'openrouter',
113
+ model: 'openai/text-embedding-3-small',
114
+ dimensions: 512,
115
+ }
116
+ }
117
+ EOF
118
+
119
+ node /path/to/mdcontext/dist/cli/main.js index . --embed
120
+ ```
121
+
122
+ ### Results
123
+ **Status**: ❌ FAILED - BUG DISCOVERED
124
+
125
+ ```
126
+ Error: VectorStoreError
127
+ operation: 'save'
128
+ cause: RangeError: Invalid string length
129
+ at JSON.stringify (<anonymous>)
130
+ ```
131
+
132
+ ### Analysis
133
+ - Embeddings were generated successfully (vectors.bin created: 101MB)
134
+ - Index files created normally (28MB)
135
+ - **Failure occurred during metadata save** (vectors.meta.json)
136
+ - Total storage before crash: 140M
137
+
138
+ ### Bug Details
139
+ - **Root cause**: JSON.stringify fails on very large metadata object
140
+ - **Impact**: Cannot complete indexing on large codebases with OpenRouter
141
+ - **File location**: dist/chunk-KHU56VDO.js:1880:61
142
+ - **Workaround**: Use smaller corpora or different provider
143
+
144
+ ### Recommendation
145
+ This is a critical bug that needs fixing for large-scale deployments. The metadata serialization should use:
146
+ 1. Streaming JSON serialization
147
+ 2. Binary format instead of JSON
148
+ 3. Chunked metadata files
149
+ 4. Or reduce metadata size stored per vector
150
+
151
+ ---
152
+
153
+ ## Test 3: OpenAI Embeddings (mdcontext corpus)
154
+
155
+ ### Command
156
+ ```bash
157
+ # In mdcontext repo with config:
158
+ cat > mdcontext.config.js << 'EOF'
159
+ export default {
160
+ embeddings: {
161
+ provider: 'openai',
162
+ model: 'text-embedding-3-small',
163
+ dimensions: 512,
164
+ }
165
+ }
166
+ EOF
167
+
168
+ node dist/cli/main.js index /Users/alphab/Dev/LLM/DEV/mdcontext --embed
169
+ ```
170
+
171
+ ### Results
172
+ **Status**: ✅ SUCCESS
173
+
174
+ ```
175
+ Indexed 120 documents
176
+ Sections: 4234
177
+ Links: 261
178
+ Duration: 1537ms (indexing)
179
+
180
+ Embedding phase:
181
+ Files: 120
182
+ Sections: 3903 (embedded)
183
+ Tokens: 564,253
184
+ Cost: $0.011285
185
+ Duration: 64.7s
186
+ Total time: 66.8s
187
+ ```
188
+
189
+ ### Storage Analysis
190
+ ```
191
+ Total: 69M
192
+ ├── config.json 268B
193
+ ├── indexes/
194
+ │ ├── documents.json 40K
195
+ │ ├── links.json 40K
196
+ │ └── sections.json 2.2M
197
+ ├── vectors.bin 8.2M
198
+ └── vectors.meta.json 58M (!!)
199
+ ```
200
+
201
+ ### Performance Metrics
202
+ - **Embedding speed**: 1.85 files/sec, 60.3 sections/sec
203
+ - **API cost**: $0.011285 (~$0.02 per 1M tokens)
204
+ - **Storage overhead**: 66.2MB for embeddings (8.2M vectors + 58M metadata)
205
+ - **Storage efficiency**: 575KB per file on average with embeddings
206
+
207
+ ### Observations
208
+ - OpenAI provider works reliably
209
+ - **metadata file is 7x larger than binary vectors** - optimization opportunity
210
+ - Cost is very reasonable for small-medium corpora
211
+ - Pricing warning: "513 days old. May not reflect current rates."
212
+
213
+ ### Cost Projection for agentic-flow
214
+ ```
215
+ agentic-flow: ~726 seconds of tokens estimated
216
+ mdcontext: 564,253 tokens = $0.011285
217
+
218
+ Estimated agentic-flow cost:
219
+ - If similar token density: ~$0.1795 (as shown in prompt)
220
+ - Very affordable for one-time indexing
221
+ ```
222
+
223
+ ---
224
+
225
+ ## Test 4: Ollama Embeddings (agentic-flow)
226
+
227
+ ### Command
228
+ ```bash
229
+ # In agentic-flow directory with config:
230
+ cat > mdcontext.config.js << 'EOF'
231
+ export default {
232
+ embeddings: {
233
+ provider: 'ollama',
234
+ model: 'nomic-embed-text',
235
+ dimensions: 768, # nomic-embed-text native dimension
236
+ }
237
+ }
238
+ EOF
239
+
240
+ node /path/to/mdcontext/dist/cli/main.js index . --embed
241
+ ```
242
+
243
+ ### Results
244
+ **Status**: ❌ FAILED - SAME BUG AS OPENROUTER
245
+
246
+ ```
247
+ VectorStoreError: Failed to write metadata: Invalid string length
248
+ cause: RangeError: Invalid string length
249
+ ```
250
+
251
+ ### What Happened
252
+ - Indexing phase completed successfully: 1561 docs in ~15s
253
+ - Embedding phase processed all files (estimated ~12 minutes)
254
+ - **vectors.bin created successfully: 101MB** (embeddings generated!)
255
+ - **Failure during metadata save** (same as OpenRouter)
256
+ - Total processed: 1558 files, ~9M tokens
257
+
258
+ ### Storage State (Before Crash)
259
+ ```
260
+ Total: 129M
261
+ ├── config.json 271B
262
+ ├── indexes/
263
+ │ ├── documents.json 516K
264
+ │ ├── links.json 600K
265
+ │ └── sections.json 27M
266
+ └── vectors.bin 101M (successfully created!)
267
+ vectors.meta.json FAILED (would have been ~700MB+ based on ratio)
268
+ ```
269
+
270
+ ### Critical Finding
271
+ **The bug affects ALL providers on large corpora, not just OpenRouter.**
272
+
273
+ The issue is in the vector store's metadata serialization layer, which:
274
+ 1. Successfully generates embeddings (all providers work)
275
+ 2. Successfully writes binary vectors (vectors.bin)
276
+ 3. **Fails when serializing metadata to JSON** (hits V8 string size limit)
277
+
278
+ ### Analysis
279
+ - Ollama successfully generated 101MB of embeddings
280
+ - Processing was FREE (local)
281
+ - Bug prevented completion
282
+ - Same root cause as OpenRouter (JSON.stringify limit)
283
+ - **Affects any corpus that would produce >500MB metadata JSON**
284
+
285
+ ### Expected Benefits (if bug were fixed)
286
+ - ✅ Free (local processing)
287
+ - ✅ No API rate limits
288
+ - ✅ Privacy (no data leaves machine)
289
+ - ✅ Works with agentic-flow sized corpus (embeddings generated)
290
+ - ⚠️ Slower than cloud providers (~12 min vs ~1-2 min estimated)
291
+ - ⚠️ Requires local Ollama installation
292
+
293
+ ---
294
+
295
+ ## Test 5: JSON Output Formats
296
+
297
+ ### Basic JSON
298
+ ```bash
299
+ node dist/cli/main.js index . --json
300
+ ```
301
+
302
+ **Output**: Single-line JSON
303
+ ```json
304
+ {"documentsIndexed":0,"sectionsIndexed":0,"linksIndexed":0,"totalDocuments":120,"totalSections":4234,"totalLinks":261,"duration":49,"errors":[],"skipped":{"unchanged":120,"excluded":2,"hidden":10,"total":132}}
305
+ ```
306
+
307
+ ### Pretty JSON
308
+ ```bash
309
+ node dist/cli/main.js index . --json --pretty
310
+ ```
311
+
312
+ **Output**: Formatted JSON
313
+ ```json
314
+ {
315
+ "documentsIndexed": 0,
316
+ "sectionsIndexed": 0,
317
+ "linksIndexed": 0,
318
+ "totalDocuments": 120,
319
+ "totalSections": 4234,
320
+ "totalLinks": 261,
321
+ "duration": 47,
322
+ "errors": [],
323
+ "skipped": {
324
+ "unchanged": 120,
325
+ "excluded": 2,
326
+ "hidden": 10,
327
+ "total": 132
328
+ }
329
+ }
330
+ ```
331
+
332
+ ### Observations
333
+ - ✅ Clean, parseable JSON output
334
+ - ✅ Useful for CI/CD pipelines
335
+ - ✅ Shows incremental update metrics (unchanged count)
336
+ - ✅ Duration in milliseconds
337
+ - ✅ Error array (empty in successful runs)
338
+
339
+ ---
340
+
341
+ ## Test 6: Force Rebuild Flag
342
+
343
+ ### Command
344
+ ```bash
345
+ node dist/cli/main.js index . --force --json --pretty
346
+ ```
347
+
348
+ ### Results
349
+ ```json
350
+ {
351
+ "documentsIndexed": 120, # All documents reindexed
352
+ "sectionsIndexed": 4465,
353
+ "linksIndexed": 261,
354
+ "totalDocuments": 120,
355
+ "totalSections": 4234,
356
+ "totalLinks": 261,
357
+ "duration": 1524,
358
+ "errors": [],
359
+ "skipped": {
360
+ "unchanged": 0, # None skipped
361
+ "excluded": 2,
362
+ "hidden": 10,
363
+ "total": 12
364
+ }
365
+ }
366
+ ```
367
+
368
+ ### Observations
369
+ - ✅ Forces complete rebuild
370
+ - ✅ Ignores file modification timestamps
371
+ - ✅ Useful for: config changes, corruption recovery, debugging
372
+ - Duration: 1524ms vs 47ms (incremental) = 32x slower
373
+
374
+ ---
375
+
376
+ ## Test 7: Incremental Updates
377
+
378
+ ### Test Setup
379
+ ```bash
380
+ # 1. Full index
381
+ node dist/cli/main.js index . --json --pretty
382
+
383
+ # 2. Modify one file
384
+ echo -e "\n## Test Section\n\nTest content." >> test-config.md
385
+
386
+ # 3. Re-index
387
+ node dist/cli/main.js index . --json --pretty
388
+ ```
389
+
390
+ ### Results
391
+
392
+ **Before modification:**
393
+ ```json
394
+ {
395
+ "documentsIndexed": 0,
396
+ "sectionsIndexed": 0,
397
+ "skipped": { "unchanged": 119, ... }
398
+ }
399
+ ```
400
+
401
+ **After modification:**
402
+ ```json
403
+ {
404
+ "documentsIndexed": 1, # Only modified file
405
+ "sectionsIndexed": 4, # Only sections in that file
406
+ "skipped": { "unchanged": 119, ... },
407
+ "duration": 54
408
+ }
409
+ ```
410
+
411
+ ### Performance
412
+ - Modified 1 file → Only 1 file reindexed
413
+ - 119 files skipped as unchanged
414
+ - Duration: 54ms (vs 1524ms for full rebuild)
415
+ - **28x faster** than full rebuild
416
+
417
+ ### Observations
418
+ - ✅ Perfect incremental detection
419
+ - ✅ Fast re-index after edits
420
+ - ✅ Ideal for watch mode (`--watch` flag)
421
+ - ✅ Efficient for large codebases
422
+
423
+ ---
424
+
425
+ ## Storage Analysis Deep Dive
426
+
427
+ ### Directory Structure
428
+
429
+ ```
430
+ .mdcontext/
431
+ ├── config.json # Index configuration snapshot
432
+ ├── cache/ # (currently unused)
433
+ ├── indexes/ # Core index files
434
+ │ ├── documents.json # Document metadata
435
+ │ ├── links.json # Link graph
436
+ │ └── sections.json # Section content (LARGEST)
437
+ ├── vectors.bin # Binary vector embeddings (if --embed)
438
+ └── vectors.meta.json # Vector metadata (if --embed)
439
+ ```
440
+
441
+ ### Size Comparison
442
+
443
+ | Corpus | Basic Index | With Embeddings | Overhead |
444
+ |--------|-------------|-----------------|----------|
445
+ | mdcontext (120 docs) | 2.2M | 69M | 31x |
446
+ | agentic-flow (1561 docs) | 28M | ~140M+ | 5x |
447
+
448
+ ### Key Insights
449
+
450
+ 1. **sections.json dominates basic index**
451
+ - 96% of storage in basic mode
452
+ - Contains full section text + metadata
453
+ - Direct correlation with corpus size
454
+
455
+ 2. **vectors.meta.json is surprisingly large**
456
+ - 58M for 120 files (mdcontext)
457
+ - 7x larger than binary vectors (8.2M)
458
+ - **Optimization opportunity**: Store less metadata or use binary format
459
+
460
+ 3. **Embeddings add significant storage**
461
+ - 30-50x increase for small corpora
462
+ - 5-10x increase for large corpora
463
+ - Trade-off: semantic search capability vs disk space
464
+
465
+ ---
466
+
467
+ ## Performance Benchmarks
468
+
469
+ ### Indexing Speed (without embeddings)
470
+
471
+ | Corpus | Files | Sections | Duration | Files/sec | Sections/sec |
472
+ |--------|-------|----------|----------|-----------|--------------|
473
+ | mdcontext | 120 | 4,234 | 1.5s | 80 | 2,823 |
474
+ | agentic-flow | 1,561 | 52,714 | 14.4s | 108 | 3,661 |
475
+
476
+ **Conclusion**: Indexing scales linearly, ~100 docs/sec, ~3000 sections/sec
477
+
478
+ ### Embedding Speed (OpenAI)
479
+
480
+ | Corpus | Files | Sections | Tokens | Duration | Cost | Tokens/sec |
481
+ |--------|-------|----------|--------|----------|------|------------|
482
+ | mdcontext | 120 | 3,903 | 564,253 | 64.7s | $0.011 | 8,719 |
483
+
484
+ **Estimated for agentic-flow**: ~726s (~12 min), ~$0.18
485
+
486
+ ### Incremental Update Speed
487
+
488
+ | Operation | Files Changed | Duration | Speedup |
489
+ |-----------|---------------|----------|---------|
490
+ | Full rebuild | 120 | 1,524ms | 1x |
491
+ | Incremental (1 file) | 1 | 54ms | 28x |
492
+
493
+ **Conclusion**: Incremental updates are 20-30x faster than full rebuild
494
+
495
+ ---
496
+
497
+ ## Provider Comparison
498
+
499
+ ### OpenAI
500
+ - **Status**: ✅ Production Ready (small-medium corpora)
501
+ - **Tested**: ✅ mdcontext (120 docs) - SUCCESS
502
+ - **Pros**:
503
+ - Fast API responses
504
+ - Reliable at scale (up to tested size)
505
+ - Well-documented pricing
506
+ - High-quality embeddings
507
+ - Only provider confirmed working with embeddings
508
+ - **Cons**:
509
+ - Requires API key
510
+ - Costs money (though cheap: $0.011 for 120 docs)
511
+ - Data sent to cloud
512
+ - **UNTESTED on large corpora** (likely same bug >2000 docs)
513
+ - **Best for**: Small-medium projects (<1000 docs), production deployments, CI/CD
514
+
515
+ ### OpenRouter
516
+ - **Status**: ❌ Bug Confirmed (large corpora)
517
+ - **Tested**: ❌ agentic-flow (1561 docs) - FAILED at metadata save
518
+ - **Pros**:
519
+ - Access to multiple models
520
+ - Competitive pricing
521
+ - Good API compatibility
522
+ - Embeddings generated successfully
523
+ - **Cons**:
524
+ - **BUG**: Vector metadata save fails on large corpora (>1500 docs)
525
+ - Cannot complete indexing for agentic-flow size codebases
526
+ - Same underlying issue as all providers
527
+ - **Best for**: Small corpora (<500 docs) only, until bug fixed
528
+
529
+ ### Ollama (Local)
530
+ - **Status**: ❌ Bug Confirmed (large corpora)
531
+ - **Tested**: ❌ agentic-flow (1561 docs) - FAILED at metadata save
532
+ - **Pros**:
533
+ - Free (no API costs)
534
+ - Private (local processing)
535
+ - No rate limits
536
+ - No internet required
537
+ - Successfully generated embeddings (101MB)
538
+ - Slower but works for small corpora
539
+ - **Cons**:
540
+ - **BUG**: Same metadata save error on large corpora
541
+ - Slower than cloud providers (~12 min for 1561 docs)
542
+ - Requires local installation (Ollama + model download)
543
+ - Uses local compute resources
544
+ - Same underlying issue as all providers
545
+ - **Best for**: Privacy-sensitive small projects, offline work, after bug fix
546
+
547
+ ### Anthropic
548
+ - **Status**: ⏸️ Not tested
549
+ - **Note**: Anthropic doesn't offer embedding models (as of 2026)
550
+ - **Voyager AI**: Mentioned in config but not tested
551
+ - May not be applicable for embedding use case
552
+
553
+ ### Summary Table
554
+
555
+ | Provider | Small (<200) | Medium (200-1000) | Large (>1500) | Cost | Privacy |
556
+ |----------|--------------|-------------------|---------------|------|---------|
557
+ | OpenAI | ✅ Confirmed | ⚠️ Likely works | ❌ Likely fails | $ | Cloud |
558
+ | OpenRouter | ✅ Should work | ⚠️ Untested | ❌ Confirmed fail | $ | Cloud |
559
+ | Ollama | ✅ Should work | ⚠️ Untested | ❌ Confirmed fail | Free | Local |
560
+
561
+ **Key Insight**: The bug is in mdcontext's vector store, not the providers. All providers successfully generate embeddings, but all fail when mdcontext tries to save metadata for large corpora.
562
+
563
+ ---
564
+
565
+ ## CLI Features Summary
566
+
567
+ ### Working Features ✅
568
+
569
+ 1. **Basic indexing**: `mdcontext index <path>`
570
+ - Fast, reliable, incremental
571
+
572
+ 2. **Embeddings**: `mdcontext index <path> --embed`
573
+ - Requires provider configuration
574
+ - Interactive cost estimate
575
+
576
+ 3. **Force rebuild**: `mdcontext index <path> --force`
577
+ - Ignores incremental detection
578
+ - Rebuilds everything
579
+
580
+ 4. **JSON output**: `mdcontext index <path> --json [--pretty]`
581
+ - Machine-readable results
582
+ - Perfect for CI/CD
583
+
584
+ 5. **No-embed flag**: `mdcontext index <path> --no-embed`
585
+ - Skip semantic search prompt
586
+ - Useful for automation
587
+
588
+ ### Configuration
589
+
590
+ Providers configured via `mdcontext.config.js`:
591
+
592
+ ```javascript
593
+ export default {
594
+ embeddings: {
595
+ provider: 'openai', // or 'openrouter', 'ollama'
596
+ model: 'text-embedding-3-small',
597
+ dimensions: 512,
598
+ }
599
+ }
600
+ ```
601
+
602
+ **Note**: Provider is NOT a CLI flag, must be in config file or environment.
603
+
604
+ ---
605
+
606
+ ## Issues Found
607
+
608
+ ### 🐛 Critical: Vector Metadata Save Error (ALL PROVIDERS)
609
+
610
+ **Severity**: CRITICAL - BLOCKING
611
+ **Impact**: Cannot index large codebases (>1500 docs) with embeddings
612
+ **Affected**: ALL embedding providers (OpenRouter, Ollama, likely OpenAI too on larger corpora)
613
+ **Component**: Vector store metadata serialization (`vectors.meta.json`)
614
+
615
+ **Error**:
616
+ ```
617
+ VectorStoreError: Failed to write metadata: Invalid string length
618
+ cause: RangeError: Invalid string length
619
+ at JSON.stringify (<anonymous>)
620
+ ```
621
+
622
+ **Exact Location**:
623
+ ```typescript
624
+ // src/embeddings/vector-store.ts:401
625
+ yield* Effect.tryPromise({
626
+ try: () =>
627
+ fs.writeFile(this.getMetaPath(), JSON.stringify(meta, null, 2)),
628
+ catch: (e) =>
629
+ new VectorStoreError({
630
+ operation: 'save',
631
+ // ...
632
+ })
633
+ })
634
+ ```
635
+
636
+ The `JSON.stringify(meta, null, 2)` call fails when `meta` object serializes to >512MB string.
637
+
638
+ **What Works**:
639
+ - ✅ Indexing without embeddings (any size)
640
+ - ✅ Embedding generation (all providers)
641
+ - ✅ Binary vector storage (vectors.bin writes successfully)
642
+ - ✅ Small corpora (<200 docs) with embeddings
643
+
644
+ **What Fails**:
645
+ - ❌ Metadata serialization for large corpora (>1500 docs)
646
+ - ❌ OpenRouter on agentic-flow: vectors.meta.json save
647
+ - ❌ Ollama on agentic-flow: vectors.meta.json save
648
+ - ⚠️ OpenAI likely to fail on corpora >2000 docs (untested)
649
+
650
+ **Root Cause Analysis**:
651
+ 1. `vectors.meta.json` stores metadata for every embedded section
652
+ 2. For agentic-flow: 52,714 sections × metadata per section = huge JSON
653
+ 3. JSON.stringify in V8 has ~512MB string limit
654
+ 4. Estimated metadata size: ~700MB+ (based on 58MB for 3903 sections)
655
+ 5. **Calculation**: 58MB / 3903 sections = 14.9KB per section
656
+ 6. **agentic-flow**: 52,714 sections × 14.9KB = ~785MB JSON string
657
+ 7. This exceeds V8's string size limit → crash
658
+
659
+ **Why This Is Critical**:
660
+ - Embeddings are SUCCESSFULLY generated (costly operation completes)
661
+ - Only the FREE metadata save fails
662
+ - Users waste time/money generating embeddings that can't be saved
663
+ - No graceful degradation or progress saving
664
+
665
+ **Recommendations** (Priority Order):
666
+
667
+ 1. **IMMEDIATE: Add Size Validation** (1 hour)
668
+ ```typescript
669
+ // Before JSON.stringify, estimate size
670
+ const estimatedSize = sections.length * BYTES_PER_SECTION;
671
+ if (estimatedSize > MAX_SAFE_JSON_SIZE) {
672
+ // Use alternative format or fail early with clear message
673
+ }
674
+ ```
675
+
676
+ 2. **SHORT-TERM: Binary Metadata Format** (4 hours)
677
+ - Replace `vectors.meta.json` with `vectors.meta.bin`
678
+ - Use MessagePack, CBOR, or custom binary format
679
+ - No string size limits, smaller file size, faster I/O
680
+
681
+ 3. **SHORT-TERM: Chunked Metadata** (4 hours)
682
+ - Split into multiple files: `vectors.meta.0.json`, `vectors.meta.1.json`, etc.
683
+ - Load on-demand by vector ID range
684
+ - Each chunk stays under size limit
685
+
686
+ 4. **MEDIUM-TERM: Reduce Metadata** (8 hours)
687
+ - Audit what's stored per vector
688
+ - Move redundant data to indexes/sections.json
689
+ - Store only: vector ID, document ID, section ID, (optional) hash
690
+
691
+ 5. **LONG-TERM: SQLite Storage** (16 hours)
692
+ - Replace JSON files with SQLite database
693
+ - Better for large datasets, built-in indexing, ACID guarantees
694
+ - Industry standard for local data
695
+
696
+ **Workaround for Users**:
697
+ ```bash
698
+ # Option 1: Index subdirectories separately
699
+ mdcontext index ./docs --embed
700
+ mdcontext index ./src --embed
701
+
702
+ # Option 2: Skip embeddings for now
703
+ mdcontext index . --no-embed
704
+
705
+ # Option 3: Use OpenAI on small corpus (confirmed working <200 docs)
706
+ mdcontext index ./docs --embed # if docs/ is small
707
+
708
+ # Option 4: Wait for bug fix (ETA: 1-2 days for binary format)
709
+ ```
710
+
711
+ **Test Data**:
712
+ - mdcontext (120 docs, 3903 sections): ✅ Works (58MB metadata)
713
+ - agentic-flow (1561 docs, 52,714 sections): ❌ Fails (estimated 785MB metadata)
714
+
715
+ ---
716
+
717
+ ## Best Practices
718
+
719
+ ### For Small Projects (<200 docs)
720
+ ```bash
721
+ # Quick setup with OpenAI
722
+ export OPENAI_API_KEY=sk-...
723
+ mdcontext index . --embed
724
+ # Cost: <$0.05, Duration: <2 min
725
+ ```
726
+
727
+ ### For Medium Projects (200-1000 docs)
728
+ ```bash
729
+ # Use OpenAI with config file
730
+ cat > mdcontext.config.js << 'EOF'
731
+ export default {
732
+ embeddings: {
733
+ provider: 'openai',
734
+ model: 'text-embedding-3-small',
735
+ dimensions: 512,
736
+ batchSize: 100,
737
+ }
738
+ }
739
+ EOF
740
+
741
+ mdcontext index . --embed --json
742
+ # Cost: $0.05-$0.20, Duration: 5-10 min
743
+ ```
744
+
745
+ ### For Large Projects (>1000 docs)
746
+ ```bash
747
+ # CURRENT STATUS: Not supported due to metadata save bug
748
+ #
749
+ # WORKAROUND 1: Index subdirectories
750
+ mdcontext index ./docs --embed
751
+ mdcontext index ./src --embed
752
+ # Each subdirectory must be <500 docs
753
+
754
+ # WORKAROUND 2: Skip embeddings for now
755
+ mdcontext index . --no-embed
756
+ # Basic indexing works fine, add embeddings after bug fix
757
+
758
+ # WORKAROUND 3: Wait for bug fix (recommended)
759
+ # ETA: 1-2 days for binary metadata format
760
+ # Then:
761
+ # mdcontext index . --embed
762
+ # # Will work with any provider
763
+ ```
764
+
765
+ ### For CI/CD Pipelines
766
+ ```bash
767
+ # Fast incremental updates with JSON output
768
+ mdcontext index . --json > index-results.json
769
+
770
+ # Parse results
771
+ if jq -e '.errors | length > 0' index-results.json; then
772
+ echo "Indexing failed"
773
+ exit 1
774
+ fi
775
+
776
+ # Only rebuild if significant changes
777
+ DOCS_CHANGED=$(jq '.documentsIndexed' index-results.json)
778
+ if [ "$DOCS_CHANGED" -gt 10 ]; then
779
+ mdcontext index . --embed --json
780
+ fi
781
+ ```
782
+
783
+ ### Watch Mode for Development
784
+ ```bash
785
+ # Real-time indexing during development
786
+ mdcontext index . --watch
787
+
788
+ # Or with embeddings (slower, higher cost)
789
+ mdcontext index . --watch --embed
790
+ ```
791
+
792
+ ---
793
+
794
+ ## Recommendations
795
+
796
+ ### Immediate Actions
797
+
798
+ 1. **Fix OpenRouter Bug**
799
+ - High priority for production use
800
+ - Blocks large codebase indexing
801
+ - See issue details above
802
+
803
+ 2. **Optimize Vector Metadata Storage**
804
+ - vectors.meta.json is 7x larger than vectors.bin
805
+ - Consider binary format
806
+ - Or reduce metadata stored
807
+
808
+ 3. **Update Pricing Data**
809
+ - Current warning: "513 days old"
810
+ - Fetch latest OpenAI pricing
811
+ - Add date to pricing estimates
812
+
813
+ ### Future Enhancements
814
+
815
+ 1. **Progress Bars**
816
+ - Show embedding progress (currently just file list)
817
+ - ETA for large corpora
818
+ - Bytes/tokens processed
819
+
820
+ 2. **Dry Run Mode**
821
+ - Estimate cost before running
822
+ - `mdcontext index . --embed --dry-run`
823
+
824
+ 3. **Partial Embedding**
825
+ - Allow embedding subset of docs
826
+ - `--embed-include "docs/**"``
827
+ - Useful for large repos (only embed docs/)
828
+
829
+ 4. **Compression**
830
+ - Gzip/zstd for .mdcontext files
831
+ - Could save 50-70% disk space
832
+
833
+ 5. **Provider Auto-Detection**
834
+ - Try providers in order: ollama → openai → openrouter
835
+ - Fall back gracefully
836
+ - Reduce configuration burden
837
+
838
+ ---
839
+
840
+ ## Conclusion
841
+
842
+ mdcontext indexing and embedding functionality has a **critical bug blocking large-scale use**:
843
+
844
+ ### ✅ Strengths
845
+ - Fast, reliable basic indexing (any size)
846
+ - Excellent incremental update detection
847
+ - Clean JSON output for automation
848
+ - All embedding providers work (generate embeddings successfully)
849
+ - Reasonable costs for semantic search (small corpora)
850
+ - Binary vector storage works perfectly
851
+
852
+ ### 🐛 Critical Issue
853
+ **BLOCKING BUG**: Vector metadata save fails on large corpora (>1500 docs)
854
+ - Affects: ALL providers (OpenRouter, Ollama, likely OpenAI)
855
+ - Root cause: JSON.stringify size limit in V8
856
+ - Impact: Cannot use embeddings on production-sized codebases
857
+ - Status: Needs immediate fix (binary format recommended)
858
+
859
+ ### ✅ What Works Right Now
860
+ - ✅ Basic indexing without embeddings (any size)
861
+ - ✅ Embeddings on small corpora (<200 docs)
862
+ - ✅ Incremental updates
863
+ - ✅ JSON output for automation
864
+ - ✅ Force rebuild
865
+ - ✅ All tested CLI features
866
+
867
+ ### 🎯 Ready For (Today)
868
+ - Small projects (<200 docs) with embeddings: OpenAI
869
+ - Large projects without embeddings: Any size
870
+ - CI/CD integration: JSON output + incremental
871
+ - Development workflows: Watch mode (without embeddings)
872
+
873
+ ### 🚫 Not Ready For (Blocked by Bug)
874
+ - Medium projects (200-1000 docs) with embeddings: BLOCKED
875
+ - Large projects (>1000 docs) with embeddings: BLOCKED
876
+ - Production semantic search on real codebases: BLOCKED
877
+
878
+ ### 🔧 Fix Required
879
+ **Priority**: CRITICAL - P0
880
+ **Estimated Fix Time**: 4-8 hours (binary format implementation)
881
+ **User Impact**: Cannot use primary feature (semantic search) on real codebases
882
+ **Recommendation**: Implement binary metadata storage (MessagePack/CBOR)
883
+
884
+ ### Next Steps
885
+ 1. **URGENT**: Fix vector metadata save bug (binary format)
886
+ 2. Add size validation to fail early with clear message
887
+ 3. Test OpenAI on larger corpus (500-1000 docs) after fix
888
+ 4. Benchmark search performance with embeddings
889
+ 5. Test context assembly with embeddings
890
+ 6. Document maximum supported corpus sizes
891
+
892
+ ---
893
+
894
+ ## Quick Start Commands (What Works Today)
895
+
896
+ ### Basic Indexing (Any Size) ✅
897
+ ```bash
898
+ # Simple
899
+ mdcontext index /path/to/repo
900
+
901
+ # With JSON output
902
+ mdcontext index /path/to/repo --json --pretty
903
+
904
+ # Force rebuild
905
+ mdcontext index /path/to/repo --force
906
+ ```
907
+
908
+ ### Small Corpus with Embeddings ✅
909
+ ```bash
910
+ # Only for <200 docs, otherwise hits bug
911
+ export OPENAI_API_KEY=sk-...
912
+
913
+ cat > mdcontext.config.js << 'EOF'
914
+ export default {
915
+ embeddings: {
916
+ provider: 'openai',
917
+ model: 'text-embedding-3-small',
918
+ dimensions: 512,
919
+ }
920
+ }
921
+ EOF
922
+
923
+ mdcontext index /path/to/small/docs --embed
924
+ ```
925
+
926
+ ### What to Avoid (Until Bug Fixed) 🚫
927
+ ```bash
928
+ # Don't do this on large repos (>1500 docs)
929
+ mdcontext index /path/to/large/repo --embed
930
+ # Will waste time/money generating embeddings, then crash
931
+
932
+ # Instead:
933
+ mdcontext index /path/to/large/repo --no-embed
934
+ # Or wait for bug fix
935
+ ```
936
+
937
+ ---
938
+
939
+ ## Test Data Files
940
+
941
+ All test output saved to:
942
+ - `/tmp/test1-basic.log` - Basic indexing (agentic-flow)
943
+ - `/tmp/test2-openrouter.log` - OpenRouter failure logs
944
+ - `/tmp/test3-openai.log` - OpenAI success logs (mdcontext)
945
+ - `/tmp/test-agentic-ollama.log` - Ollama failure logs
946
+ - `/tmp/test-mdcontext-openai.log` - OpenAI success (small corpus)
947
+
948
+ ---
949
+
950
+ **Report Author**: Claude (Sonnet 4.5)
951
+ **Test Date**: 2026-01-27
952
+ **mdcontext Version**: 0.1.0
953
+ **Test Duration**: ~90 minutes
954
+ **Commands Executed**: 15+
955
+ **Bugs Found**: 1 critical (affects all providers)
956
+ **Production Readiness**: Partial (basic indexing ready, embeddings blocked on large corpora)