mdcontext 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (251) hide show
  1. package/.changeset/config.json +9 -9
  2. package/.claude/settings.local.json +25 -0
  3. package/.github/workflows/claude-code-review.yml +44 -0
  4. package/.github/workflows/claude.yml +85 -0
  5. package/CONTRIBUTING.md +186 -0
  6. package/NOTES/NOTES +44 -0
  7. package/README.md +206 -3
  8. package/biome.json +1 -1
  9. package/dist/chunk-23UPXDNL.js +3044 -0
  10. package/dist/chunk-2W7MO2DL.js +1366 -0
  11. package/dist/chunk-3NUAZGMA.js +1689 -0
  12. package/dist/chunk-7TOWB2XB.js +366 -0
  13. package/dist/chunk-7XOTOADQ.js +3065 -0
  14. package/dist/chunk-AH2PDM2K.js +3042 -0
  15. package/dist/chunk-BNXWSZ63.js +3742 -0
  16. package/dist/chunk-BTL5DJVU.js +3222 -0
  17. package/dist/chunk-HDHYG7E4.js +104 -0
  18. package/dist/chunk-HLR4KZBP.js +3234 -0
  19. package/dist/chunk-IP3FRFEB.js +1045 -0
  20. package/dist/chunk-KHU56VDO.js +3042 -0
  21. package/dist/chunk-KRYIFLQR.js +85 -89
  22. package/dist/chunk-LBSDNLEM.js +287 -0
  23. package/dist/chunk-MNTQ7HCP.js +2643 -0
  24. package/dist/chunk-MUJELQQ6.js +1387 -0
  25. package/dist/chunk-MXJGMSLV.js +2199 -0
  26. package/dist/chunk-N6QJGC3Z.js +2636 -0
  27. package/dist/chunk-OBELGBPM.js +1713 -0
  28. package/dist/chunk-OT7R5XTA.js +3192 -0
  29. package/dist/chunk-P7X4RA2T.js +106 -0
  30. package/dist/chunk-PIDUQNC2.js +3185 -0
  31. package/dist/chunk-POGCDIH4.js +3187 -0
  32. package/dist/chunk-PSIEOQGZ.js +3043 -0
  33. package/dist/chunk-PVRT3IHA.js +3238 -0
  34. package/dist/chunk-QNN4TT23.js +1430 -0
  35. package/dist/chunk-RE3R45RJ.js +3042 -0
  36. package/dist/chunk-S7E6TFX6.js +718 -657
  37. package/dist/chunk-SG6GLU4U.js +1378 -0
  38. package/dist/chunk-SJCDV2ST.js +274 -0
  39. package/dist/chunk-SYE5XLF3.js +104 -0
  40. package/dist/chunk-T5VLYBZD.js +103 -0
  41. package/dist/chunk-TOQB7VWU.js +3238 -0
  42. package/dist/chunk-VFNMZ4ZQ.js +3228 -0
  43. package/dist/chunk-VVTGZNBT.js +1533 -1423
  44. package/dist/chunk-W7Q4RFEV.js +104 -0
  45. package/dist/chunk-XTYYVRLO.js +3190 -0
  46. package/dist/chunk-Y6MDYVJD.js +3063 -0
  47. package/dist/cli/main.js +4072 -629
  48. package/dist/index.d.ts +420 -33
  49. package/dist/index.js +8 -15
  50. package/dist/mcp/server.js +103 -7
  51. package/dist/schema-BAWSG7KY.js +22 -0
  52. package/dist/schema-E3QUPL26.js +20 -0
  53. package/dist/schema-EHL7WUT6.js +20 -0
  54. package/docs/019-USAGE.md +44 -5
  55. package/docs/020-current-implementation.md +8 -8
  56. package/docs/021-DOGFOODING-FINDINGS.md +1 -1
  57. package/docs/CONFIG.md +1123 -0
  58. package/docs/ERRORS.md +383 -0
  59. package/docs/summarization.md +320 -0
  60. package/justfile +40 -0
  61. package/package.json +39 -33
  62. package/research/INDEX.md +315 -0
  63. package/research/code-review/README.md +90 -0
  64. package/research/code-review/cli-error-handling-review.md +979 -0
  65. package/research/code-review/code-review-validation-report.md +464 -0
  66. package/research/code-review/main-ts-review.md +1128 -0
  67. package/research/config-docs/SUMMARY.md +357 -0
  68. package/research/config-docs/TEST-RESULTS.md +776 -0
  69. package/research/config-docs/TODO.md +542 -0
  70. package/research/config-docs/analysis.md +744 -0
  71. package/research/config-docs/fix-validation.md +502 -0
  72. package/research/config-docs/help-audit.md +264 -0
  73. package/research/config-docs/help-system-analysis.md +890 -0
  74. package/research/frontmatter/COMMENTS-ARE-SKIPPED.md +149 -0
  75. package/research/frontmatter/LLM-CODE-NAVIGATION.md +276 -0
  76. package/research/issue-review.md +603 -0
  77. package/research/llm-summarization/agent-cli-tools-2026.md +1082 -0
  78. package/research/llm-summarization/alternative-providers-2026.md +1428 -0
  79. package/research/llm-summarization/anthropic-2026.md +367 -0
  80. package/research/llm-summarization/claude-cli-integration.md +1706 -0
  81. package/research/llm-summarization/cli-integration-patterns.md +3155 -0
  82. package/research/llm-summarization/openai-2026.md +473 -0
  83. package/research/llm-summarization/openai-compatible-providers-2026.md +1022 -0
  84. package/research/llm-summarization/opencode-cli-integration.md +1552 -0
  85. package/research/llm-summarization/prompt-engineering-2026.md +1426 -0
  86. package/research/llm-summarization/prototype-results.md +56 -0
  87. package/research/llm-summarization/provider-switching-patterns-2026.md +2153 -0
  88. package/research/llm-summarization/typescript-llm-libraries-2026.md +2436 -0
  89. package/research/mdcontext-pudding/00-EXECUTIVE-SUMMARY.md +282 -0
  90. package/research/mdcontext-pudding/01-index-embed.md +956 -0
  91. package/research/mdcontext-pudding/02-search-COMMANDS.md +142 -0
  92. package/research/mdcontext-pudding/02-search-SUMMARY.md +146 -0
  93. package/research/mdcontext-pudding/02-search.md +970 -0
  94. package/research/mdcontext-pudding/03-context.md +779 -0
  95. package/research/mdcontext-pudding/04-navigation-and-analytics.md +803 -0
  96. package/research/mdcontext-pudding/04-tree.md +704 -0
  97. package/research/mdcontext-pudding/05-config.md +1038 -0
  98. package/research/mdcontext-pudding/06-links-summary.txt +87 -0
  99. package/research/mdcontext-pudding/06-links.md +679 -0
  100. package/research/mdcontext-pudding/07-stats.md +693 -0
  101. package/research/mdcontext-pudding/BUG-FIX-PLAN.md +388 -0
  102. package/research/mdcontext-pudding/P0-BUG-VALIDATION.md +167 -0
  103. package/research/mdcontext-pudding/README.md +168 -0
  104. package/research/mdcontext-pudding/TESTING-SUMMARY.md +128 -0
  105. package/research/research-quality-review.md +834 -0
  106. package/research/semantic-search/embedding-text-analysis.md +156 -0
  107. package/research/semantic-search/multi-word-failure-reproduction.md +171 -0
  108. package/research/semantic-search/query-processing-analysis.md +207 -0
  109. package/research/semantic-search/root-cause-and-solution.md +114 -0
  110. package/research/semantic-search/threshold-validation-report.md +69 -0
  111. package/research/semantic-search/vector-search-analysis.md +63 -0
  112. package/research/test-path-issues.md +276 -0
  113. package/review/ALP-76/1-error-type-design.md +962 -0
  114. package/review/ALP-76/2-error-handling-patterns.md +906 -0
  115. package/review/ALP-76/3-error-presentation.md +624 -0
  116. package/review/ALP-76/4-test-coverage.md +625 -0
  117. package/review/ALP-76/5-migration-completeness.md +440 -0
  118. package/review/ALP-76/6-effect-best-practices.md +755 -0
  119. package/scripts/apply-branch-protection.sh +47 -0
  120. package/scripts/branch-protection-templates.json +79 -0
  121. package/scripts/prototype-summarization.ts +346 -0
  122. package/scripts/rebuild-hnswlib.js +32 -37
  123. package/scripts/setup-branch-protection.sh +64 -0
  124. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/active-provider.json +7 -0
  125. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.json +541 -0
  126. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.meta.json +5 -0
  127. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/config.json +8 -0
  128. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
  129. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
  130. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/documents.json +60 -0
  131. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/links.json +13 -0
  132. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/sections.json +1197 -0
  133. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/configuration-management.md +99 -0
  134. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/distributed-systems.md +92 -0
  135. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/error-handling.md +78 -0
  136. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/failure-automation.md +55 -0
  137. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/job-context.md +69 -0
  138. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/process-orchestration.md +99 -0
  139. package/src/cli/argv-preprocessor.test.ts +2 -2
  140. package/src/cli/cli.test.ts +230 -33
  141. package/src/cli/commands/config-cmd.ts +642 -0
  142. package/src/cli/commands/context.ts +97 -9
  143. package/src/cli/commands/duplicates.ts +122 -0
  144. package/src/cli/commands/embeddings.ts +529 -0
  145. package/src/cli/commands/index-cmd.ts +210 -30
  146. package/src/cli/commands/index.ts +3 -0
  147. package/src/cli/commands/search.ts +894 -64
  148. package/src/cli/commands/stats.ts +3 -0
  149. package/src/cli/commands/tree.ts +26 -5
  150. package/src/cli/config-layer.ts +176 -0
  151. package/src/cli/error-handler.test.ts +235 -0
  152. package/src/cli/error-handler.ts +655 -0
  153. package/src/cli/flag-schemas.ts +66 -0
  154. package/src/cli/help.ts +209 -7
  155. package/src/cli/main.ts +348 -58
  156. package/src/cli/options.ts +10 -0
  157. package/src/cli/shared-error-handling.ts +199 -0
  158. package/src/cli/utils.ts +150 -17
  159. package/src/config/file-provider.test.ts +320 -0
  160. package/src/config/file-provider.ts +273 -0
  161. package/src/config/index.ts +72 -0
  162. package/src/config/integration.test.ts +667 -0
  163. package/src/config/precedence.test.ts +277 -0
  164. package/src/config/precedence.ts +451 -0
  165. package/src/config/schema.test.ts +414 -0
  166. package/src/config/schema.ts +603 -0
  167. package/src/config/service.test.ts +320 -0
  168. package/src/config/service.ts +243 -0
  169. package/src/config/testing.test.ts +264 -0
  170. package/src/config/testing.ts +110 -0
  171. package/src/core/types.ts +6 -33
  172. package/src/duplicates/detector.test.ts +183 -0
  173. package/src/duplicates/detector.ts +414 -0
  174. package/src/duplicates/index.ts +18 -0
  175. package/src/embeddings/embedding-namespace.test.ts +300 -0
  176. package/src/embeddings/embedding-namespace.ts +947 -0
  177. package/src/embeddings/heading-boost.test.ts +222 -0
  178. package/src/embeddings/hnsw-build-options.test.ts +198 -0
  179. package/src/embeddings/hyde.test.ts +272 -0
  180. package/src/embeddings/hyde.ts +264 -0
  181. package/src/embeddings/index.ts +2 -0
  182. package/src/embeddings/openai-provider.ts +332 -83
  183. package/src/embeddings/pricing.json +22 -0
  184. package/src/embeddings/provider-constants.ts +204 -0
  185. package/src/embeddings/provider-errors.test.ts +967 -0
  186. package/src/embeddings/provider-errors.ts +565 -0
  187. package/src/embeddings/provider-factory.test.ts +240 -0
  188. package/src/embeddings/provider-factory.ts +225 -0
  189. package/src/embeddings/provider-integration.test.ts +788 -0
  190. package/src/embeddings/query-preprocessing.test.ts +187 -0
  191. package/src/embeddings/semantic-search-threshold.test.ts +508 -0
  192. package/src/embeddings/semantic-search.ts +780 -93
  193. package/src/embeddings/types.ts +293 -16
  194. package/src/embeddings/vector-store.ts +486 -77
  195. package/src/embeddings/voyage-provider.ts +313 -0
  196. package/src/errors/errors.test.ts +845 -0
  197. package/src/errors/index.ts +533 -0
  198. package/src/index/ignore-patterns.test.ts +354 -0
  199. package/src/index/ignore-patterns.ts +305 -0
  200. package/src/index/indexer.ts +286 -48
  201. package/src/index/storage.ts +94 -30
  202. package/src/index/types.ts +40 -2
  203. package/src/index/watcher.ts +67 -9
  204. package/src/index.ts +22 -0
  205. package/src/integration/search-keyword.test.ts +678 -0
  206. package/src/mcp/server.ts +135 -6
  207. package/src/parser/parser.ts +18 -19
  208. package/src/parser/section-filter.test.ts +277 -0
  209. package/src/parser/section-filter.ts +125 -3
  210. package/src/search/__tests__/hybrid-search.test.ts +650 -0
  211. package/src/search/bm25-store.ts +366 -0
  212. package/src/search/cross-encoder.test.ts +253 -0
  213. package/src/search/cross-encoder.ts +406 -0
  214. package/src/search/fuzzy-search.test.ts +419 -0
  215. package/src/search/fuzzy-search.ts +273 -0
  216. package/src/search/hybrid-search.ts +448 -0
  217. package/src/search/path-matcher.test.ts +276 -0
  218. package/src/search/path-matcher.ts +33 -0
  219. package/src/search/searcher.test.ts +99 -1
  220. package/src/search/searcher.ts +189 -67
  221. package/src/search/wink-bm25.d.ts +30 -0
  222. package/src/summarization/cli-providers/claude.ts +202 -0
  223. package/src/summarization/cli-providers/detection.test.ts +273 -0
  224. package/src/summarization/cli-providers/detection.ts +118 -0
  225. package/src/summarization/cli-providers/index.ts +8 -0
  226. package/src/summarization/cost.test.ts +139 -0
  227. package/src/summarization/cost.ts +102 -0
  228. package/src/summarization/error-handler.test.ts +127 -0
  229. package/src/summarization/error-handler.ts +111 -0
  230. package/src/summarization/index.ts +102 -0
  231. package/src/summarization/pipeline.test.ts +498 -0
  232. package/src/summarization/pipeline.ts +231 -0
  233. package/src/summarization/prompts.test.ts +269 -0
  234. package/src/summarization/prompts.ts +133 -0
  235. package/src/summarization/provider-factory.test.ts +396 -0
  236. package/src/summarization/provider-factory.ts +178 -0
  237. package/src/summarization/types.ts +184 -0
  238. package/src/summarize/summarizer.ts +104 -35
  239. package/src/types/huggingface-transformers.d.ts +66 -0
  240. package/tests/fixtures/cli/.mdcontext/active-provider.json +7 -0
  241. package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
  242. package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
  243. package/tests/fixtures/cli/.mdcontext/indexes/documents.json +4 -4
  244. package/tests/fixtures/cli/.mdcontext/indexes/sections.json +14 -0
  245. package/tests/integration/embed-index.test.ts +712 -0
  246. package/tests/integration/search-context.test.ts +469 -0
  247. package/tests/integration/search-semantic.test.ts +522 -0
  248. package/vitest.config.ts +1 -6
  249. package/AGENTS.md +0 -46
  250. package/tests/fixtures/cli/.mdcontext/vectors.bin +0 -0
  251. package/tests/fixtures/cli/.mdcontext/vectors.meta.json +0 -1264
@@ -0,0 +1,388 @@
1
+ # Bug Fix Plan: Vector Metadata Save Error
2
+
3
+ **Bug ID**: Critical Vector Store Metadata Serialization Failure
4
+ **Severity**: P0 - Critical (Blocks production use)
5
+ **Impact**: Cannot embed large corpora (>1500 docs) with any provider
6
+ **Location**: `src/embeddings/vector-store.ts:401`
7
+ **Validation**: ✅ 100% reproducible - All providers (OpenAI, OpenRouter, Ollama) fail identically on 1,558-doc agentic-flow corpus
8
+
9
+ ---
10
+
11
+ ## Problem Statement
12
+
13
+ The vector store saves metadata as JSON using `JSON.stringify()`, which has a ~512MB string limit in V8. For large corpora (>1500 docs), the metadata object serializes to >512MB, causing a crash after embeddings are successfully generated.
14
+
15
+ ### Current Code
16
+ ```typescript
17
+ // src/embeddings/vector-store.ts:401
18
+ yield* Effect.tryPromise({
19
+ try: () =>
20
+ fs.writeFile(this.getMetaPath(), JSON.stringify(meta, null, 2)),
21
+ catch: (e) =>
22
+ new VectorStoreError({
23
+ operation: 'save',
24
+ // ...
25
+ })
26
+ })
27
+ ```
28
+
29
+ ### Evidence
30
+ - **mdcontext (120 docs, 3903 sections)**: 58MB metadata → Works ✅
31
+ - **agentic-flow (1561 docs, 52,714 sections)**: ~785MB metadata → Fails ❌
32
+ - **Calculation**: 14.9KB per section × 52,714 sections = 785MB
33
+
34
+ ---
35
+
36
+ ## Solution Options
37
+
38
+ ### Option 1: Binary Format (RECOMMENDED)
39
+ **Effort**: 4-6 hours
40
+ **Impact**: Solves problem permanently, reduces file size
41
+
42
+ Replace JSON with MessagePack or CBOR:
43
+ ```typescript
44
+ import * as msgpack from '@msgpack/msgpack';
45
+
46
+ // Save
47
+ const encoded = msgpack.encode(meta);
48
+ await fs.writeFile(this.getMetaPath(), encoded);
49
+
50
+ // Load
51
+ const buffer = await fs.readFile(this.getMetaPath());
52
+ const meta = msgpack.decode(buffer);
53
+ ```
54
+
55
+ **Benefits**:
56
+ - No size limits
57
+ - 30-50% smaller files
58
+ - Faster I/O
59
+ - Backward compatible (can auto-migrate)
60
+
61
+ **Dependencies**:
62
+ ```bash
63
+ npm install @msgpack/msgpack
64
+ ```
65
+
66
+ ---
67
+
68
+ ### Option 2: Chunked JSON
69
+ **Effort**: 6-8 hours
70
+ **Impact**: Solves problem, maintains JSON format
71
+
72
+ Split metadata into chunks:
73
+ ```typescript
74
+ const CHUNK_SIZE = 1000; // sections per file
75
+ const chunks = [];
76
+
77
+ for (let i = 0; i < meta.length; i += CHUNK_SIZE) {
78
+ const chunk = meta.slice(i, i + CHUNK_SIZE);
79
+ await fs.writeFile(
80
+ `${this.getMetaPath()}.${i}.json`,
81
+ JSON.stringify(chunk, null, 2)
82
+ );
83
+ chunks.push(i);
84
+ }
85
+
86
+ // Save index
87
+ await fs.writeFile(
88
+ `${this.getMetaPath()}.index.json`,
89
+ JSON.stringify({ chunks, chunkSize: CHUNK_SIZE })
90
+ );
91
+ ```
92
+
93
+ **Benefits**:
94
+ - Maintains JSON format (easier debugging)
95
+ - Lazy loading possible
96
+ - Each chunk stays under limit
97
+
98
+ **Drawbacks**:
99
+ - Multiple files to manage
100
+ - More complex loading logic
101
+ - Slower than binary
102
+
103
+ ---
104
+
105
+ ### Option 3: Reduce Metadata (SHORT-TERM)
106
+ **Effort**: 2-3 hours
107
+ **Impact**: Delays problem, doesn't solve it
108
+
109
+ Audit what's stored per vector and remove redundancy:
110
+ ```typescript
111
+ // Current (example)
112
+ {
113
+ sectionId: "abc123",
114
+ documentId: "doc456",
115
+ path: "/path/to/file.md",
116
+ title: "Section Title",
117
+ content: "Full section content...", // REMOVE THIS
118
+ tokens: 150,
119
+ hash: "sha256...",
120
+ metadata: { ... } // Audit this
121
+ }
122
+
123
+ // Optimized
124
+ {
125
+ sectionId: "abc123",
126
+ documentId: "doc456",
127
+ // Remove content (already in sections.json)
128
+ // Remove redundant metadata
129
+ }
130
+ ```
131
+
132
+ **Benefits**:
133
+ - Quick to implement
134
+ - Reduces file size 5-10x
135
+
136
+ **Drawbacks**:
137
+ - Still hits limit on very large corpora
138
+ - Requires matching changes in load logic
139
+
140
+ ---
141
+
142
+ ### Option 4: SQLite Storage (LONG-TERM)
143
+ **Effort**: 16-20 hours
144
+ **Impact**: Best long-term solution
145
+
146
+ Replace JSON files with SQLite:
147
+ ```typescript
148
+ import Database from 'better-sqlite3';
149
+
150
+ const db = new Database('.mdcontext/vectors.db');
151
+
152
+ // Create tables
153
+ db.exec(`
154
+ CREATE TABLE IF NOT EXISTS vector_meta (
155
+ section_id TEXT PRIMARY KEY,
156
+ document_id TEXT,
157
+ data BLOB
158
+ );
159
+ CREATE INDEX idx_doc ON vector_meta(document_id);
160
+ `);
161
+
162
+ // Save
163
+ const stmt = db.prepare('INSERT INTO vector_meta VALUES (?, ?, ?)');
164
+ for (const item of meta) {
165
+ stmt.run(item.sectionId, item.documentId, msgpack.encode(item));
166
+ }
167
+ ```
168
+
169
+ **Benefits**:
170
+ - No size limits
171
+ - Built-in indexing
172
+ - ACID guarantees
173
+ - Standard format
174
+ - Query capabilities
175
+
176
+ **Drawbacks**:
177
+ - Major refactor
178
+ - Additional dependency
179
+ - Migration complexity
180
+
181
+ ---
182
+
183
+ ## Recommended Approach
184
+
185
+ ### Phase 1: Immediate (4-6 hours)
186
+ **Implement Option 1: Binary Format (MessagePack)**
187
+
188
+ 1. Add dependency: `@msgpack/msgpack`
189
+ 2. Update `saveMeta()` to use msgpack
190
+ 3. Update `loadMeta()` to use msgpack
191
+ 4. Add migration for existing JSON files
192
+ 5. Update file extension: `.meta.json` → `.meta.bin`
193
+ 6. Add early size validation (warn if estimated >100MB)
194
+
195
+ ### Phase 2: Short-term (2-3 hours)
196
+ **Implement Option 3: Reduce Metadata**
197
+
198
+ 1. Audit metadata per section
199
+ 2. Remove redundant content field
200
+ 3. Optimize nested metadata objects
201
+ 4. Document what's stored and why
202
+
203
+ ### Phase 3: Long-term (16-20 hours)
204
+ **Consider Option 4: SQLite Storage**
205
+
206
+ 1. Spike: Prototype SQLite implementation
207
+ 2. Benchmark: Compare performance vs binary files
208
+ 3. Decide: If benefits justify effort
209
+ 4. Implement: If approved
210
+
211
+ ---
212
+
213
+ ## Implementation Details
214
+
215
+ ### Priority 1: Binary Format
216
+
217
+ **Files to Modify**:
218
+ - `src/embeddings/vector-store.ts` (save/load methods)
219
+ - `package.json` (add dependency)
220
+ - `src/embeddings/types.ts` (update type docs)
221
+
222
+ **New Code**:
223
+ ```typescript
224
+ import * as msgpack from '@msgpack/msgpack';
225
+
226
+ export class VectorStore {
227
+ private getMetaPath(): string {
228
+ return path.join(this.indexDir, 'vectors.meta.bin'); // Changed extension
229
+ }
230
+
231
+ private async saveMeta(meta: VectorMetadata[]): Promise<void> {
232
+ return yield* Effect.tryPromise({
233
+ try: async () => {
234
+ // Validate size before encoding
235
+ const estimatedSize = meta.length * 15000; // 15KB per section
236
+ if (estimatedSize > 100_000_000) {
237
+ console.warn(
238
+ `Large metadata detected: ~${(estimatedSize / 1e6).toFixed(0)}MB. ` +
239
+ `Consider indexing subdirectories separately.`
240
+ );
241
+ }
242
+
243
+ // Encode with MessagePack
244
+ const encoded = msgpack.encode(meta);
245
+ await fs.writeFile(this.getMetaPath(), encoded);
246
+ },
247
+ catch: (e) =>
248
+ new VectorStoreError({
249
+ operation: 'save',
250
+ message: `Failed to write metadata: ${e.message}`,
251
+ cause: e,
252
+ })
253
+ });
254
+ }
255
+
256
+ private async loadMeta(): Promise<VectorMetadata[]> {
257
+ return yield* Effect.tryPromise({
258
+ try: async () => {
259
+ const metaPath = this.getMetaPath();
260
+
261
+ // Try binary format first (new)
262
+ if (await fs.exists(metaPath)) {
263
+ const buffer = await fs.readFile(metaPath);
264
+ return msgpack.decode(buffer) as VectorMetadata[];
265
+ }
266
+
267
+ // Fall back to JSON for migration (old)
268
+ const jsonPath = metaPath.replace('.bin', '.json');
269
+ if (await fs.exists(jsonPath)) {
270
+ const json = await fs.readFile(jsonPath, 'utf-8');
271
+ const meta = JSON.parse(json);
272
+
273
+ // Auto-migrate to binary format
274
+ await this.saveMeta(meta);
275
+ await fs.unlink(jsonPath); // Remove old JSON
276
+
277
+ return meta;
278
+ }
279
+
280
+ return [];
281
+ },
282
+ catch: (e) =>
283
+ new VectorStoreError({
284
+ operation: 'load',
285
+ message: `Failed to read metadata: ${e.message}`,
286
+ cause: e,
287
+ })
288
+ });
289
+ }
290
+ }
291
+ ```
292
+
293
+ **Testing Plan**:
294
+ 1. Unit tests: Encode/decode various sizes
295
+ 2. Integration: Test on mdcontext corpus (120 docs) - should still work
296
+ 3. Integration: Test on agentic-flow corpus (1561 docs) - should now work
297
+ 4. Migration: Test auto-migration from JSON to binary
298
+ 5. Performance: Benchmark binary vs JSON (expect 2-5x faster)
299
+
300
+ **Rollout**:
301
+ 1. Implement in feature branch
302
+ 2. Test on both corpora
303
+ 3. Merge to main
304
+ 4. Document in CHANGELOG.md
305
+ 5. Notify users to re-index if they have existing embeddings
306
+
307
+ ---
308
+
309
+ ## Validation Criteria
310
+
311
+ ### Success Metrics
312
+ - ✅ agentic-flow (1561 docs) completes successfully
313
+ - ✅ Metadata file size reduced by 30-50%
314
+ - ✅ Load/save times improved by 2-5x
315
+ - ✅ Auto-migration from JSON works
316
+ - ✅ No regressions on small corpora
317
+
318
+ ### Edge Cases to Test
319
+ 1. Empty corpus (0 docs)
320
+ 2. Single doc corpus
321
+ 3. Small corpus (120 docs) - existing test
322
+ 4. Medium corpus (500-1000 docs) - new test needed
323
+ 5. Large corpus (1500+ docs) - agentic-flow
324
+ 6. Very large corpus (5000+ docs) - future test
325
+
326
+ ---
327
+
328
+ ## Risk Assessment
329
+
330
+ ### Low Risk
331
+ - Binary format is well-tested (MessagePack)
332
+ - Backward compatible (auto-migration)
333
+ - Easy rollback (keep JSON generation as fallback)
334
+
335
+ ### Medium Risk
336
+ - Dependencies (new npm package)
337
+ - File format change (could affect external tools)
338
+ - Migration complexity (testing required)
339
+
340
+ ### Mitigation
341
+ 1. Thorough testing on both small and large corpora
342
+ 2. Keep JSON as fallback/export option
343
+ 3. Clear migration documentation
344
+ 4. Version metadata format (add version field)
345
+
346
+ ---
347
+
348
+ ## Timeline
349
+
350
+ | Task | Effort | Dependencies |
351
+ |------|--------|--------------|
352
+ | Add MessagePack dependency | 15 min | None |
353
+ | Implement saveMeta binary | 1 hour | Dependency |
354
+ | Implement loadMeta binary | 1 hour | saveMeta |
355
+ | Add migration logic | 1 hour | loadMeta |
356
+ | Add size validation | 30 min | None |
357
+ | Write unit tests | 1 hour | Implementation |
358
+ | Test on mdcontext | 15 min | Tests |
359
+ | Test on agentic-flow | 30 min | Tests |
360
+ | Documentation | 30 min | Testing |
361
+ | Code review & merge | 30 min | Documentation |
362
+
363
+ **Total**: ~6 hours
364
+
365
+ ---
366
+
367
+ ## Next Steps
368
+
369
+ 1. ✅ **Completed**: Test and document current behavior
370
+ 2. ⏭️ **Next**: Implement binary format (Option 1)
371
+ 3. 🔜 **After**: Reduce metadata size (Option 3)
372
+ 4. 🔮 **Future**: Consider SQLite (Option 4)
373
+
374
+ ---
375
+
376
+ ## References
377
+
378
+ - MessagePack: https://msgpack.org/
379
+ - CBOR: https://cbor.io/
380
+ - V8 String Limits: https://v8.dev/blog/string-length
381
+ - Test Results: `/Users/alphab/Dev/LLM/DEV/mdcontext/research/mdcontext-pudding/01-index-embed.md`
382
+
383
+ ---
384
+
385
+ **Created**: 2026-01-27
386
+ **Author**: Claude Sonnet 4.5
387
+ **Status**: Ready for implementation
388
+ **Priority**: P0 - Critical
@@ -0,0 +1,167 @@
1
+ # P0 Bug Validation Results
2
+
3
+ **Test Date:** 2026-01-27
4
+ **Test Corpus:** agentic-flow (1,558 documents, 52,714 sections, ~9M tokens)
5
+ **Bug:** VectorStoreError - Invalid string length at JSON.stringify
6
+
7
+ ---
8
+
9
+ ## Test Matrix
10
+
11
+ | Provider | Model | Duration | Result | Error Location |
12
+ |----------|-------|----------|--------|----------------|
13
+ | OpenAI | text-embedding-3-small | 12m 48s | ❌ FAILED | JSON.stringify at save |
14
+ | OpenRouter | (default embed model) | 12m 51s | ❌ FAILED | JSON.stringify at save |
15
+ | Ollama | nomic-embed-text | 12m 06s | ❌ FAILED | JSON.stringify at save |
16
+
17
+ **Success Rate:** 0/3 (0%)
18
+ **Reproducibility:** 100% - All providers fail identically
19
+
20
+ ---
21
+
22
+ ## Error Output (Identical Across All Providers)
23
+
24
+ ```
25
+ VectorStoreError: Failed to write metadata: Invalid string length
26
+ at catch (file:///Users/alphab/Dev/LLM/DEV/mdcontext/dist/chunk-KHU56VDO.js:1881:25)
27
+ ...
28
+ operation: 'save',
29
+ _tag: 'VectorStoreError',
30
+ [cause]: RangeError: Invalid string length
31
+ at JSON.stringify (<anonymous>)
32
+ at try (file:///Users/alphab/Dev/LLM/DEV/mdcontext/dist/chunk-KHU56VDO.js:1880:61)
33
+ ```
34
+
35
+ ---
36
+
37
+ ## What This Proves
38
+
39
+ ### ✅ Bug is Real
40
+ All three embedding providers (commercial and local) fail at the exact same point with the exact same error.
41
+
42
+ ### ✅ Not Provider-Specific
43
+ The bug occurs in mdcontext's vector store layer, not in any provider's embedding generation. All providers successfully generate embeddings but fail when mdcontext tries to save the metadata.
44
+
45
+ ### ✅ Scale-Dependent
46
+ The bug only manifests on large corpora:
47
+ - Small corpus (120 docs, 58MB metadata): ✅ Works
48
+ - Large corpus (1,558 docs, ~785MB metadata): ❌ Fails
49
+
50
+ ### ✅ Root Cause Confirmed
51
+ The error `RangeError: Invalid string length at JSON.stringify` occurs when JavaScript tries to stringify metadata that exceeds V8's string length limit (~512MB).
52
+
53
+ **Calculation:**
54
+ - 1,558 documents × ~52,714 sections average = massive metadata object
55
+ - Each section has embedding vector (1536 dimensions) + metadata
56
+ - JSON.stringify converts entire object to string
57
+ - String exceeds V8 limit → RangeError
58
+
59
+ ---
60
+
61
+ ## Timeline
62
+
63
+ Each test followed this pattern:
64
+
65
+ 1. **Index phase** (~14s): Successfully index all 1,558 markdown files
66
+ 2. **Embedding generation** (~12 minutes): Successfully create embeddings for all sections
67
+ 3. **Verification phase**: Successfully verify embeddings exist
68
+ 4. **Save metadata** ❌ **CRASH**: JSON.stringify fails with string length error
69
+
70
+ All providers complete 95% of the work successfully, then crash at the final save step.
71
+
72
+ ---
73
+
74
+ ## Provider-Specific Notes
75
+
76
+ ### OpenAI (text-embedding-3-small)
77
+ - Embedding generation: ✅ Success
78
+ - API costs: ~$0.18 (estimated)
79
+ - Metadata save: ❌ Failed at JSON.stringify
80
+
81
+ ### OpenRouter (default model)
82
+ - Embedding generation: ✅ Success
83
+ - API costs: ~$0.18 (estimated)
84
+ - Metadata save: ❌ Failed at JSON.stringify
85
+
86
+ ### Ollama (nomic-embed-text, local)
87
+ - Embedding generation: ✅ Success
88
+ - API costs: $0 (local)
89
+ - Metadata save: ❌ Failed at JSON.stringify
90
+ - Slightly faster (12m 06s vs ~12m 50s) due to local execution
91
+
92
+ ---
93
+
94
+ ## Implications
95
+
96
+ ### For Users
97
+ 1. **Cannot use semantic search on large repos** (>1,500 docs)
98
+ 2. **Wasted API costs** - Providers charge for embeddings that can't be saved
99
+ 3. **Time wasted** - 12+ minutes to discover the failure
100
+ 4. **No workaround exists** - Bug blocks ALL providers equally
101
+
102
+ ### For Development
103
+ 1. **P0 Priority** - Blocks core feature (semantic search)
104
+ 2. **Architecture issue** - Not a simple bug fix, requires storage format change
105
+ 3. **Well-understood** - Root cause clear, fix path identified
106
+ 4. **Testable** - Reproducible 100% of the time with agentic-flow corpus
107
+
108
+ ---
109
+
110
+ ## Recommended Fix
111
+
112
+ **Replace JSON with MessagePack binary format** - See `BUG-FIX-PLAN.md` for:
113
+ - Complete implementation code
114
+ - Migration strategy
115
+ - Testing plan
116
+ - 6-hour effort estimate
117
+
118
+ **Priority:** Critical - Blocks large-scale production use
119
+
120
+ ---
121
+
122
+ ## Test Commands Used
123
+
124
+ ### OpenAI Test
125
+ ```bash
126
+ cd /Users/alphab/Dev/LLM/DEV/agentic-flow
127
+ mdcontext index --embed --provider openai 2>&1 | tee /tmp/test-openai.log
128
+ ```
129
+
130
+ ### OpenRouter Test
131
+ ```bash
132
+ cd /Users/alphab/Dev/LLM/DEV/agentic-flow
133
+ mdcontext index --embed --provider openrouter 2>&1 | tee /tmp/test-openrouter.log
134
+ ```
135
+
136
+ ### Ollama Test
137
+ ```bash
138
+ cd /Users/alphab/Dev/LLM/DEV/agentic-flow
139
+ mdcontext index --embed --provider ollama --provider-model nomic-embed-text 2>&1 | tee /tmp/test-ollama.log
140
+ ```
141
+
142
+ ---
143
+
144
+ ## Conclusion
145
+
146
+ The P0 bug is **validated and fully understood**:
147
+ - ✅ 100% reproducible
148
+ - ✅ Affects all providers equally
149
+ - ✅ Root cause identified (JSON.stringify string limit)
150
+ - ✅ Fix available (MessagePack binary format)
151
+ - ✅ Timeline understood (6-hour implementation)
152
+
153
+ **This is a critical blocker for production use of semantic search on large documentation repositories.**
154
+
155
+ The bug should be prioritized immediately to unblock:
156
+ - Large-scale semantic search
157
+ - Production embedding workflows
158
+ - Cost-effective local embedding (Ollama)
159
+ - Knowledge base indexing at scale
160
+
161
+ ---
162
+
163
+ **Next Steps:**
164
+ 1. Implement MessagePack fix per `BUG-FIX-PLAN.md`
165
+ 2. Add tests for large-corpus scenarios
166
+ 3. Update docs with corpus size limitations (until fixed)
167
+ 4. Consider chunked saves as interim workaround