mdcontext 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.changeset/config.json +9 -9
- package/.claude/settings.local.json +25 -0
- package/.github/workflows/claude-code-review.yml +44 -0
- package/.github/workflows/claude.yml +85 -0
- package/CONTRIBUTING.md +186 -0
- package/NOTES/NOTES +44 -0
- package/README.md +206 -3
- package/biome.json +1 -1
- package/dist/chunk-23UPXDNL.js +3044 -0
- package/dist/chunk-2W7MO2DL.js +1366 -0
- package/dist/chunk-3NUAZGMA.js +1689 -0
- package/dist/chunk-7TOWB2XB.js +366 -0
- package/dist/chunk-7XOTOADQ.js +3065 -0
- package/dist/chunk-AH2PDM2K.js +3042 -0
- package/dist/chunk-BNXWSZ63.js +3742 -0
- package/dist/chunk-BTL5DJVU.js +3222 -0
- package/dist/chunk-HDHYG7E4.js +104 -0
- package/dist/chunk-HLR4KZBP.js +3234 -0
- package/dist/chunk-IP3FRFEB.js +1045 -0
- package/dist/chunk-KHU56VDO.js +3042 -0
- package/dist/chunk-KRYIFLQR.js +85 -89
- package/dist/chunk-LBSDNLEM.js +287 -0
- package/dist/chunk-MNTQ7HCP.js +2643 -0
- package/dist/chunk-MUJELQQ6.js +1387 -0
- package/dist/chunk-MXJGMSLV.js +2199 -0
- package/dist/chunk-N6QJGC3Z.js +2636 -0
- package/dist/chunk-OBELGBPM.js +1713 -0
- package/dist/chunk-OT7R5XTA.js +3192 -0
- package/dist/chunk-P7X4RA2T.js +106 -0
- package/dist/chunk-PIDUQNC2.js +3185 -0
- package/dist/chunk-POGCDIH4.js +3187 -0
- package/dist/chunk-PSIEOQGZ.js +3043 -0
- package/dist/chunk-PVRT3IHA.js +3238 -0
- package/dist/chunk-QNN4TT23.js +1430 -0
- package/dist/chunk-RE3R45RJ.js +3042 -0
- package/dist/chunk-S7E6TFX6.js +718 -657
- package/dist/chunk-SG6GLU4U.js +1378 -0
- package/dist/chunk-SJCDV2ST.js +274 -0
- package/dist/chunk-SYE5XLF3.js +104 -0
- package/dist/chunk-T5VLYBZD.js +103 -0
- package/dist/chunk-TOQB7VWU.js +3238 -0
- package/dist/chunk-VFNMZ4ZQ.js +3228 -0
- package/dist/chunk-VVTGZNBT.js +1533 -1423
- package/dist/chunk-W7Q4RFEV.js +104 -0
- package/dist/chunk-XTYYVRLO.js +3190 -0
- package/dist/chunk-Y6MDYVJD.js +3063 -0
- package/dist/cli/main.js +4072 -629
- package/dist/index.d.ts +420 -33
- package/dist/index.js +8 -15
- package/dist/mcp/server.js +103 -7
- package/dist/schema-BAWSG7KY.js +22 -0
- package/dist/schema-E3QUPL26.js +20 -0
- package/dist/schema-EHL7WUT6.js +20 -0
- package/docs/019-USAGE.md +44 -5
- package/docs/020-current-implementation.md +8 -8
- package/docs/021-DOGFOODING-FINDINGS.md +1 -1
- package/docs/CONFIG.md +1123 -0
- package/docs/ERRORS.md +383 -0
- package/docs/summarization.md +320 -0
- package/justfile +40 -0
- package/package.json +39 -33
- package/research/INDEX.md +315 -0
- package/research/code-review/README.md +90 -0
- package/research/code-review/cli-error-handling-review.md +979 -0
- package/research/code-review/code-review-validation-report.md +464 -0
- package/research/code-review/main-ts-review.md +1128 -0
- package/research/config-docs/SUMMARY.md +357 -0
- package/research/config-docs/TEST-RESULTS.md +776 -0
- package/research/config-docs/TODO.md +542 -0
- package/research/config-docs/analysis.md +744 -0
- package/research/config-docs/fix-validation.md +502 -0
- package/research/config-docs/help-audit.md +264 -0
- package/research/config-docs/help-system-analysis.md +890 -0
- package/research/frontmatter/COMMENTS-ARE-SKIPPED.md +149 -0
- package/research/frontmatter/LLM-CODE-NAVIGATION.md +276 -0
- package/research/issue-review.md +603 -0
- package/research/llm-summarization/agent-cli-tools-2026.md +1082 -0
- package/research/llm-summarization/alternative-providers-2026.md +1428 -0
- package/research/llm-summarization/anthropic-2026.md +367 -0
- package/research/llm-summarization/claude-cli-integration.md +1706 -0
- package/research/llm-summarization/cli-integration-patterns.md +3155 -0
- package/research/llm-summarization/openai-2026.md +473 -0
- package/research/llm-summarization/openai-compatible-providers-2026.md +1022 -0
- package/research/llm-summarization/opencode-cli-integration.md +1552 -0
- package/research/llm-summarization/prompt-engineering-2026.md +1426 -0
- package/research/llm-summarization/prototype-results.md +56 -0
- package/research/llm-summarization/provider-switching-patterns-2026.md +2153 -0
- package/research/llm-summarization/typescript-llm-libraries-2026.md +2436 -0
- package/research/mdcontext-pudding/00-EXECUTIVE-SUMMARY.md +282 -0
- package/research/mdcontext-pudding/01-index-embed.md +956 -0
- package/research/mdcontext-pudding/02-search-COMMANDS.md +142 -0
- package/research/mdcontext-pudding/02-search-SUMMARY.md +146 -0
- package/research/mdcontext-pudding/02-search.md +970 -0
- package/research/mdcontext-pudding/03-context.md +779 -0
- package/research/mdcontext-pudding/04-navigation-and-analytics.md +803 -0
- package/research/mdcontext-pudding/04-tree.md +704 -0
- package/research/mdcontext-pudding/05-config.md +1038 -0
- package/research/mdcontext-pudding/06-links-summary.txt +87 -0
- package/research/mdcontext-pudding/06-links.md +679 -0
- package/research/mdcontext-pudding/07-stats.md +693 -0
- package/research/mdcontext-pudding/BUG-FIX-PLAN.md +388 -0
- package/research/mdcontext-pudding/P0-BUG-VALIDATION.md +167 -0
- package/research/mdcontext-pudding/README.md +168 -0
- package/research/mdcontext-pudding/TESTING-SUMMARY.md +128 -0
- package/research/research-quality-review.md +834 -0
- package/research/semantic-search/embedding-text-analysis.md +156 -0
- package/research/semantic-search/multi-word-failure-reproduction.md +171 -0
- package/research/semantic-search/query-processing-analysis.md +207 -0
- package/research/semantic-search/root-cause-and-solution.md +114 -0
- package/research/semantic-search/threshold-validation-report.md +69 -0
- package/research/semantic-search/vector-search-analysis.md +63 -0
- package/research/test-path-issues.md +276 -0
- package/review/ALP-76/1-error-type-design.md +962 -0
- package/review/ALP-76/2-error-handling-patterns.md +906 -0
- package/review/ALP-76/3-error-presentation.md +624 -0
- package/review/ALP-76/4-test-coverage.md +625 -0
- package/review/ALP-76/5-migration-completeness.md +440 -0
- package/review/ALP-76/6-effect-best-practices.md +755 -0
- package/scripts/apply-branch-protection.sh +47 -0
- package/scripts/branch-protection-templates.json +79 -0
- package/scripts/prototype-summarization.ts +346 -0
- package/scripts/rebuild-hnswlib.js +32 -37
- package/scripts/setup-branch-protection.sh +64 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/active-provider.json +7 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.json +541 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.meta.json +5 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/config.json +8 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/documents.json +60 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/links.json +13 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/sections.json +1197 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/configuration-management.md +99 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/distributed-systems.md +92 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/error-handling.md +78 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/failure-automation.md +55 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/job-context.md +69 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/process-orchestration.md +99 -0
- package/src/cli/argv-preprocessor.test.ts +2 -2
- package/src/cli/cli.test.ts +230 -33
- package/src/cli/commands/config-cmd.ts +642 -0
- package/src/cli/commands/context.ts +97 -9
- package/src/cli/commands/duplicates.ts +122 -0
- package/src/cli/commands/embeddings.ts +529 -0
- package/src/cli/commands/index-cmd.ts +210 -30
- package/src/cli/commands/index.ts +3 -0
- package/src/cli/commands/search.ts +894 -64
- package/src/cli/commands/stats.ts +3 -0
- package/src/cli/commands/tree.ts +26 -5
- package/src/cli/config-layer.ts +176 -0
- package/src/cli/error-handler.test.ts +235 -0
- package/src/cli/error-handler.ts +655 -0
- package/src/cli/flag-schemas.ts +66 -0
- package/src/cli/help.ts +209 -7
- package/src/cli/main.ts +348 -58
- package/src/cli/options.ts +10 -0
- package/src/cli/shared-error-handling.ts +199 -0
- package/src/cli/utils.ts +150 -17
- package/src/config/file-provider.test.ts +320 -0
- package/src/config/file-provider.ts +273 -0
- package/src/config/index.ts +72 -0
- package/src/config/integration.test.ts +667 -0
- package/src/config/precedence.test.ts +277 -0
- package/src/config/precedence.ts +451 -0
- package/src/config/schema.test.ts +414 -0
- package/src/config/schema.ts +603 -0
- package/src/config/service.test.ts +320 -0
- package/src/config/service.ts +243 -0
- package/src/config/testing.test.ts +264 -0
- package/src/config/testing.ts +110 -0
- package/src/core/types.ts +6 -33
- package/src/duplicates/detector.test.ts +183 -0
- package/src/duplicates/detector.ts +414 -0
- package/src/duplicates/index.ts +18 -0
- package/src/embeddings/embedding-namespace.test.ts +300 -0
- package/src/embeddings/embedding-namespace.ts +947 -0
- package/src/embeddings/heading-boost.test.ts +222 -0
- package/src/embeddings/hnsw-build-options.test.ts +198 -0
- package/src/embeddings/hyde.test.ts +272 -0
- package/src/embeddings/hyde.ts +264 -0
- package/src/embeddings/index.ts +2 -0
- package/src/embeddings/openai-provider.ts +332 -83
- package/src/embeddings/pricing.json +22 -0
- package/src/embeddings/provider-constants.ts +204 -0
- package/src/embeddings/provider-errors.test.ts +967 -0
- package/src/embeddings/provider-errors.ts +565 -0
- package/src/embeddings/provider-factory.test.ts +240 -0
- package/src/embeddings/provider-factory.ts +225 -0
- package/src/embeddings/provider-integration.test.ts +788 -0
- package/src/embeddings/query-preprocessing.test.ts +187 -0
- package/src/embeddings/semantic-search-threshold.test.ts +508 -0
- package/src/embeddings/semantic-search.ts +780 -93
- package/src/embeddings/types.ts +293 -16
- package/src/embeddings/vector-store.ts +486 -77
- package/src/embeddings/voyage-provider.ts +313 -0
- package/src/errors/errors.test.ts +845 -0
- package/src/errors/index.ts +533 -0
- package/src/index/ignore-patterns.test.ts +354 -0
- package/src/index/ignore-patterns.ts +305 -0
- package/src/index/indexer.ts +286 -48
- package/src/index/storage.ts +94 -30
- package/src/index/types.ts +40 -2
- package/src/index/watcher.ts +67 -9
- package/src/index.ts +22 -0
- package/src/integration/search-keyword.test.ts +678 -0
- package/src/mcp/server.ts +135 -6
- package/src/parser/parser.ts +18 -19
- package/src/parser/section-filter.test.ts +277 -0
- package/src/parser/section-filter.ts +125 -3
- package/src/search/__tests__/hybrid-search.test.ts +650 -0
- package/src/search/bm25-store.ts +366 -0
- package/src/search/cross-encoder.test.ts +253 -0
- package/src/search/cross-encoder.ts +406 -0
- package/src/search/fuzzy-search.test.ts +419 -0
- package/src/search/fuzzy-search.ts +273 -0
- package/src/search/hybrid-search.ts +448 -0
- package/src/search/path-matcher.test.ts +276 -0
- package/src/search/path-matcher.ts +33 -0
- package/src/search/searcher.test.ts +99 -1
- package/src/search/searcher.ts +189 -67
- package/src/search/wink-bm25.d.ts +30 -0
- package/src/summarization/cli-providers/claude.ts +202 -0
- package/src/summarization/cli-providers/detection.test.ts +273 -0
- package/src/summarization/cli-providers/detection.ts +118 -0
- package/src/summarization/cli-providers/index.ts +8 -0
- package/src/summarization/cost.test.ts +139 -0
- package/src/summarization/cost.ts +102 -0
- package/src/summarization/error-handler.test.ts +127 -0
- package/src/summarization/error-handler.ts +111 -0
- package/src/summarization/index.ts +102 -0
- package/src/summarization/pipeline.test.ts +498 -0
- package/src/summarization/pipeline.ts +231 -0
- package/src/summarization/prompts.test.ts +269 -0
- package/src/summarization/prompts.ts +133 -0
- package/src/summarization/provider-factory.test.ts +396 -0
- package/src/summarization/provider-factory.ts +178 -0
- package/src/summarization/types.ts +184 -0
- package/src/summarize/summarizer.ts +104 -35
- package/src/types/huggingface-transformers.d.ts +66 -0
- package/tests/fixtures/cli/.mdcontext/active-provider.json +7 -0
- package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
- package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
- package/tests/fixtures/cli/.mdcontext/indexes/documents.json +4 -4
- package/tests/fixtures/cli/.mdcontext/indexes/sections.json +14 -0
- package/tests/integration/embed-index.test.ts +712 -0
- package/tests/integration/search-context.test.ts +469 -0
- package/tests/integration/search-semantic.test.ts +522 -0
- package/vitest.config.ts +1 -6
- package/AGENTS.md +0 -46
- package/tests/fixtures/cli/.mdcontext/vectors.bin +0 -0
- package/tests/fixtures/cli/.mdcontext/vectors.meta.json +0 -1264
|
@@ -0,0 +1,388 @@
|
|
|
1
|
+
# Bug Fix Plan: Vector Metadata Save Error
|
|
2
|
+
|
|
3
|
+
**Bug ID**: Critical Vector Store Metadata Serialization Failure
|
|
4
|
+
**Severity**: P0 - Critical (Blocks production use)
|
|
5
|
+
**Impact**: Cannot embed large corpora (>1500 docs) with any provider
|
|
6
|
+
**Location**: `src/embeddings/vector-store.ts:401`
|
|
7
|
+
**Validation**: ✅ 100% reproducible - All providers (OpenAI, OpenRouter, Ollama) fail identically on 1,558-doc agentic-flow corpus
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## Problem Statement
|
|
12
|
+
|
|
13
|
+
The vector store saves metadata as JSON using `JSON.stringify()`, which has a ~512MB string limit in V8. For large corpora (>1500 docs), the metadata object serializes to >512MB, causing a crash after embeddings are successfully generated.
|
|
14
|
+
|
|
15
|
+
### Current Code
|
|
16
|
+
```typescript
|
|
17
|
+
// src/embeddings/vector-store.ts:401
|
|
18
|
+
yield* Effect.tryPromise({
|
|
19
|
+
try: () =>
|
|
20
|
+
fs.writeFile(this.getMetaPath(), JSON.stringify(meta, null, 2)),
|
|
21
|
+
catch: (e) =>
|
|
22
|
+
new VectorStoreError({
|
|
23
|
+
operation: 'save',
|
|
24
|
+
// ...
|
|
25
|
+
})
|
|
26
|
+
})
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
### Evidence
|
|
30
|
+
- **mdcontext (120 docs, 3903 sections)**: 58MB metadata → Works ✅
|
|
31
|
+
- **agentic-flow (1561 docs, 52,714 sections)**: ~785MB metadata → Fails ❌
|
|
32
|
+
- **Calculation**: 14.9KB per section × 52,714 sections = 785MB
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## Solution Options
|
|
37
|
+
|
|
38
|
+
### Option 1: Binary Format (RECOMMENDED)
|
|
39
|
+
**Effort**: 4-6 hours
|
|
40
|
+
**Impact**: Solves problem permanently, reduces file size
|
|
41
|
+
|
|
42
|
+
Replace JSON with MessagePack or CBOR:
|
|
43
|
+
```typescript
|
|
44
|
+
import * as msgpack from '@msgpack/msgpack';
|
|
45
|
+
|
|
46
|
+
// Save
|
|
47
|
+
const encoded = msgpack.encode(meta);
|
|
48
|
+
await fs.writeFile(this.getMetaPath(), encoded);
|
|
49
|
+
|
|
50
|
+
// Load
|
|
51
|
+
const buffer = await fs.readFile(this.getMetaPath());
|
|
52
|
+
const meta = msgpack.decode(buffer);
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
**Benefits**:
|
|
56
|
+
- No size limits
|
|
57
|
+
- 30-50% smaller files
|
|
58
|
+
- Faster I/O
|
|
59
|
+
- Backward compatible (can auto-migrate)
|
|
60
|
+
|
|
61
|
+
**Dependencies**:
|
|
62
|
+
```bash
|
|
63
|
+
npm install @msgpack/msgpack
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
---
|
|
67
|
+
|
|
68
|
+
### Option 2: Chunked JSON
|
|
69
|
+
**Effort**: 6-8 hours
|
|
70
|
+
**Impact**: Solves problem, maintains JSON format
|
|
71
|
+
|
|
72
|
+
Split metadata into chunks:
|
|
73
|
+
```typescript
|
|
74
|
+
const CHUNK_SIZE = 1000; // sections per file
|
|
75
|
+
const chunks = [];
|
|
76
|
+
|
|
77
|
+
for (let i = 0; i < meta.length; i += CHUNK_SIZE) {
|
|
78
|
+
const chunk = meta.slice(i, i + CHUNK_SIZE);
|
|
79
|
+
await fs.writeFile(
|
|
80
|
+
`${this.getMetaPath()}.${i}.json`,
|
|
81
|
+
JSON.stringify(chunk, null, 2)
|
|
82
|
+
);
|
|
83
|
+
chunks.push(i);
|
|
84
|
+
}
|
|
85
|
+
|
|
86
|
+
// Save index
|
|
87
|
+
await fs.writeFile(
|
|
88
|
+
`${this.getMetaPath()}.index.json`,
|
|
89
|
+
JSON.stringify({ chunks, chunkSize: CHUNK_SIZE })
|
|
90
|
+
);
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
**Benefits**:
|
|
94
|
+
- Maintains JSON format (easier debugging)
|
|
95
|
+
- Lazy loading possible
|
|
96
|
+
- Each chunk stays under limit
|
|
97
|
+
|
|
98
|
+
**Drawbacks**:
|
|
99
|
+
- Multiple files to manage
|
|
100
|
+
- More complex loading logic
|
|
101
|
+
- Slower than binary
|
|
102
|
+
|
|
103
|
+
---
|
|
104
|
+
|
|
105
|
+
### Option 3: Reduce Metadata (SHORT-TERM)
|
|
106
|
+
**Effort**: 2-3 hours
|
|
107
|
+
**Impact**: Delays problem, doesn't solve it
|
|
108
|
+
|
|
109
|
+
Audit what's stored per vector and remove redundancy:
|
|
110
|
+
```typescript
|
|
111
|
+
// Current (example)
|
|
112
|
+
{
|
|
113
|
+
sectionId: "abc123",
|
|
114
|
+
documentId: "doc456",
|
|
115
|
+
path: "/path/to/file.md",
|
|
116
|
+
title: "Section Title",
|
|
117
|
+
content: "Full section content...", // REMOVE THIS
|
|
118
|
+
tokens: 150,
|
|
119
|
+
hash: "sha256...",
|
|
120
|
+
metadata: { ... } // Audit this
|
|
121
|
+
}
|
|
122
|
+
|
|
123
|
+
// Optimized
|
|
124
|
+
{
|
|
125
|
+
sectionId: "abc123",
|
|
126
|
+
documentId: "doc456",
|
|
127
|
+
// Remove content (already in sections.json)
|
|
128
|
+
// Remove redundant metadata
|
|
129
|
+
}
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
**Benefits**:
|
|
133
|
+
- Quick to implement
|
|
134
|
+
- Reduces file size 5-10x
|
|
135
|
+
|
|
136
|
+
**Drawbacks**:
|
|
137
|
+
- Still hits limit on very large corpora
|
|
138
|
+
- Requires matching changes in load logic
|
|
139
|
+
|
|
140
|
+
---
|
|
141
|
+
|
|
142
|
+
### Option 4: SQLite Storage (LONG-TERM)
|
|
143
|
+
**Effort**: 16-20 hours
|
|
144
|
+
**Impact**: Best long-term solution
|
|
145
|
+
|
|
146
|
+
Replace JSON files with SQLite:
|
|
147
|
+
```typescript
|
|
148
|
+
import Database from 'better-sqlite3';
|
|
149
|
+
|
|
150
|
+
const db = new Database('.mdcontext/vectors.db');
|
|
151
|
+
|
|
152
|
+
// Create tables
|
|
153
|
+
db.exec(`
|
|
154
|
+
CREATE TABLE IF NOT EXISTS vector_meta (
|
|
155
|
+
section_id TEXT PRIMARY KEY,
|
|
156
|
+
document_id TEXT,
|
|
157
|
+
data BLOB
|
|
158
|
+
);
|
|
159
|
+
CREATE INDEX idx_doc ON vector_meta(document_id);
|
|
160
|
+
`);
|
|
161
|
+
|
|
162
|
+
// Save
|
|
163
|
+
const stmt = db.prepare('INSERT INTO vector_meta VALUES (?, ?, ?)');
|
|
164
|
+
for (const item of meta) {
|
|
165
|
+
stmt.run(item.sectionId, item.documentId, msgpack.encode(item));
|
|
166
|
+
}
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
**Benefits**:
|
|
170
|
+
- No size limits
|
|
171
|
+
- Built-in indexing
|
|
172
|
+
- ACID guarantees
|
|
173
|
+
- Standard format
|
|
174
|
+
- Query capabilities
|
|
175
|
+
|
|
176
|
+
**Drawbacks**:
|
|
177
|
+
- Major refactor
|
|
178
|
+
- Additional dependency
|
|
179
|
+
- Migration complexity
|
|
180
|
+
|
|
181
|
+
---
|
|
182
|
+
|
|
183
|
+
## Recommended Approach
|
|
184
|
+
|
|
185
|
+
### Phase 1: Immediate (4-6 hours)
|
|
186
|
+
**Implement Option 1: Binary Format (MessagePack)**
|
|
187
|
+
|
|
188
|
+
1. Add dependency: `@msgpack/msgpack`
|
|
189
|
+
2. Update `saveMeta()` to use msgpack
|
|
190
|
+
3. Update `loadMeta()` to use msgpack
|
|
191
|
+
4. Add migration for existing JSON files
|
|
192
|
+
5. Update file extension: `.meta.json` → `.meta.bin`
|
|
193
|
+
6. Add early size validation (warn if estimated >100MB)
|
|
194
|
+
|
|
195
|
+
### Phase 2: Short-term (2-3 hours)
|
|
196
|
+
**Implement Option 3: Reduce Metadata**
|
|
197
|
+
|
|
198
|
+
1. Audit metadata per section
|
|
199
|
+
2. Remove redundant content field
|
|
200
|
+
3. Optimize nested metadata objects
|
|
201
|
+
4. Document what's stored and why
|
|
202
|
+
|
|
203
|
+
### Phase 3: Long-term (16-20 hours)
|
|
204
|
+
**Consider Option 4: SQLite Storage**
|
|
205
|
+
|
|
206
|
+
1. Spike: Prototype SQLite implementation
|
|
207
|
+
2. Benchmark: Compare performance vs binary files
|
|
208
|
+
3. Decide: If benefits justify effort
|
|
209
|
+
4. Implement: If approved
|
|
210
|
+
|
|
211
|
+
---
|
|
212
|
+
|
|
213
|
+
## Implementation Details
|
|
214
|
+
|
|
215
|
+
### Priority 1: Binary Format
|
|
216
|
+
|
|
217
|
+
**Files to Modify**:
|
|
218
|
+
- `src/embeddings/vector-store.ts` (save/load methods)
|
|
219
|
+
- `package.json` (add dependency)
|
|
220
|
+
- `src/embeddings/types.ts` (update type docs)
|
|
221
|
+
|
|
222
|
+
**New Code**:
|
|
223
|
+
```typescript
|
|
224
|
+
import * as msgpack from '@msgpack/msgpack';
|
|
225
|
+
|
|
226
|
+
export class VectorStore {
|
|
227
|
+
private getMetaPath(): string {
|
|
228
|
+
return path.join(this.indexDir, 'vectors.meta.bin'); // Changed extension
|
|
229
|
+
}
|
|
230
|
+
|
|
231
|
+
private async saveMeta(meta: VectorMetadata[]): Promise<void> {
|
|
232
|
+
return yield* Effect.tryPromise({
|
|
233
|
+
try: async () => {
|
|
234
|
+
// Validate size before encoding
|
|
235
|
+
const estimatedSize = meta.length * 15000; // 15KB per section
|
|
236
|
+
if (estimatedSize > 100_000_000) {
|
|
237
|
+
console.warn(
|
|
238
|
+
`Large metadata detected: ~${(estimatedSize / 1e6).toFixed(0)}MB. ` +
|
|
239
|
+
`Consider indexing subdirectories separately.`
|
|
240
|
+
);
|
|
241
|
+
}
|
|
242
|
+
|
|
243
|
+
// Encode with MessagePack
|
|
244
|
+
const encoded = msgpack.encode(meta);
|
|
245
|
+
await fs.writeFile(this.getMetaPath(), encoded);
|
|
246
|
+
},
|
|
247
|
+
catch: (e) =>
|
|
248
|
+
new VectorStoreError({
|
|
249
|
+
operation: 'save',
|
|
250
|
+
message: `Failed to write metadata: ${e.message}`,
|
|
251
|
+
cause: e,
|
|
252
|
+
})
|
|
253
|
+
});
|
|
254
|
+
}
|
|
255
|
+
|
|
256
|
+
private async loadMeta(): Promise<VectorMetadata[]> {
|
|
257
|
+
return yield* Effect.tryPromise({
|
|
258
|
+
try: async () => {
|
|
259
|
+
const metaPath = this.getMetaPath();
|
|
260
|
+
|
|
261
|
+
// Try binary format first (new)
|
|
262
|
+
if (await fs.exists(metaPath)) {
|
|
263
|
+
const buffer = await fs.readFile(metaPath);
|
|
264
|
+
return msgpack.decode(buffer) as VectorMetadata[];
|
|
265
|
+
}
|
|
266
|
+
|
|
267
|
+
// Fall back to JSON for migration (old)
|
|
268
|
+
const jsonPath = metaPath.replace('.bin', '.json');
|
|
269
|
+
if (await fs.exists(jsonPath)) {
|
|
270
|
+
const json = await fs.readFile(jsonPath, 'utf-8');
|
|
271
|
+
const meta = JSON.parse(json);
|
|
272
|
+
|
|
273
|
+
// Auto-migrate to binary format
|
|
274
|
+
await this.saveMeta(meta);
|
|
275
|
+
await fs.unlink(jsonPath); // Remove old JSON
|
|
276
|
+
|
|
277
|
+
return meta;
|
|
278
|
+
}
|
|
279
|
+
|
|
280
|
+
return [];
|
|
281
|
+
},
|
|
282
|
+
catch: (e) =>
|
|
283
|
+
new VectorStoreError({
|
|
284
|
+
operation: 'load',
|
|
285
|
+
message: `Failed to read metadata: ${e.message}`,
|
|
286
|
+
cause: e,
|
|
287
|
+
})
|
|
288
|
+
});
|
|
289
|
+
}
|
|
290
|
+
}
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
**Testing Plan**:
|
|
294
|
+
1. Unit tests: Encode/decode various sizes
|
|
295
|
+
2. Integration: Test on mdcontext corpus (120 docs) - should still work
|
|
296
|
+
3. Integration: Test on agentic-flow corpus (1561 docs) - should now work
|
|
297
|
+
4. Migration: Test auto-migration from JSON to binary
|
|
298
|
+
5. Performance: Benchmark binary vs JSON (expect 2-5x faster)
|
|
299
|
+
|
|
300
|
+
**Rollout**:
|
|
301
|
+
1. Implement in feature branch
|
|
302
|
+
2. Test on both corpora
|
|
303
|
+
3. Merge to main
|
|
304
|
+
4. Document in CHANGELOG.md
|
|
305
|
+
5. Notify users to re-index if they have existing embeddings
|
|
306
|
+
|
|
307
|
+
---
|
|
308
|
+
|
|
309
|
+
## Validation Criteria
|
|
310
|
+
|
|
311
|
+
### Success Metrics
|
|
312
|
+
- ✅ agentic-flow (1561 docs) completes successfully
|
|
313
|
+
- ✅ Metadata file size reduced by 30-50%
|
|
314
|
+
- ✅ Load/save times improved by 2-5x
|
|
315
|
+
- ✅ Auto-migration from JSON works
|
|
316
|
+
- ✅ No regressions on small corpora
|
|
317
|
+
|
|
318
|
+
### Edge Cases to Test
|
|
319
|
+
1. Empty corpus (0 docs)
|
|
320
|
+
2. Single doc corpus
|
|
321
|
+
3. Small corpus (120 docs) - existing test
|
|
322
|
+
4. Medium corpus (500-1000 docs) - new test needed
|
|
323
|
+
5. Large corpus (1500+ docs) - agentic-flow
|
|
324
|
+
6. Very large corpus (5000+ docs) - future test
|
|
325
|
+
|
|
326
|
+
---
|
|
327
|
+
|
|
328
|
+
## Risk Assessment
|
|
329
|
+
|
|
330
|
+
### Low Risk
|
|
331
|
+
- Binary format is well-tested (MessagePack)
|
|
332
|
+
- Backward compatible (auto-migration)
|
|
333
|
+
- Easy rollback (keep JSON generation as fallback)
|
|
334
|
+
|
|
335
|
+
### Medium Risk
|
|
336
|
+
- Dependencies (new npm package)
|
|
337
|
+
- File format change (could affect external tools)
|
|
338
|
+
- Migration complexity (testing required)
|
|
339
|
+
|
|
340
|
+
### Mitigation
|
|
341
|
+
1. Thorough testing on both small and large corpora
|
|
342
|
+
2. Keep JSON as fallback/export option
|
|
343
|
+
3. Clear migration documentation
|
|
344
|
+
4. Version metadata format (add version field)
|
|
345
|
+
|
|
346
|
+
---
|
|
347
|
+
|
|
348
|
+
## Timeline
|
|
349
|
+
|
|
350
|
+
| Task | Effort | Dependencies |
|
|
351
|
+
|------|--------|--------------|
|
|
352
|
+
| Add MessagePack dependency | 15 min | None |
|
|
353
|
+
| Implement saveMeta binary | 1 hour | Dependency |
|
|
354
|
+
| Implement loadMeta binary | 1 hour | saveMeta |
|
|
355
|
+
| Add migration logic | 1 hour | loadMeta |
|
|
356
|
+
| Add size validation | 30 min | None |
|
|
357
|
+
| Write unit tests | 1 hour | Implementation |
|
|
358
|
+
| Test on mdcontext | 15 min | Tests |
|
|
359
|
+
| Test on agentic-flow | 30 min | Tests |
|
|
360
|
+
| Documentation | 30 min | Testing |
|
|
361
|
+
| Code review & merge | 30 min | Documentation |
|
|
362
|
+
|
|
363
|
+
**Total**: ~6 hours
|
|
364
|
+
|
|
365
|
+
---
|
|
366
|
+
|
|
367
|
+
## Next Steps
|
|
368
|
+
|
|
369
|
+
1. ✅ **Completed**: Test and document current behavior
|
|
370
|
+
2. ⏭️ **Next**: Implement binary format (Option 1)
|
|
371
|
+
3. 🔜 **After**: Reduce metadata size (Option 3)
|
|
372
|
+
4. 🔮 **Future**: Consider SQLite (Option 4)
|
|
373
|
+
|
|
374
|
+
---
|
|
375
|
+
|
|
376
|
+
## References
|
|
377
|
+
|
|
378
|
+
- MessagePack: https://msgpack.org/
|
|
379
|
+
- CBOR: https://cbor.io/
|
|
380
|
+
- V8 String Limits: https://v8.dev/blog/string-length
|
|
381
|
+
- Test Results: `/Users/alphab/Dev/LLM/DEV/mdcontext/research/mdcontext-pudding/01-index-embed.md`
|
|
382
|
+
|
|
383
|
+
---
|
|
384
|
+
|
|
385
|
+
**Created**: 2026-01-27
|
|
386
|
+
**Author**: Claude Sonnet 4.5
|
|
387
|
+
**Status**: Ready for implementation
|
|
388
|
+
**Priority**: P0 - Critical
|
|
@@ -0,0 +1,167 @@
|
|
|
1
|
+
# P0 Bug Validation Results
|
|
2
|
+
|
|
3
|
+
**Test Date:** 2026-01-27
|
|
4
|
+
**Test Corpus:** agentic-flow (1,558 documents, 52,714 sections, ~9M tokens)
|
|
5
|
+
**Bug:** VectorStoreError - Invalid string length at JSON.stringify
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Test Matrix
|
|
10
|
+
|
|
11
|
+
| Provider | Model | Duration | Result | Error Location |
|
|
12
|
+
|----------|-------|----------|--------|----------------|
|
|
13
|
+
| OpenAI | text-embedding-3-small | 12m 48s | ❌ FAILED | JSON.stringify at save |
|
|
14
|
+
| OpenRouter | (default embed model) | 12m 51s | ❌ FAILED | JSON.stringify at save |
|
|
15
|
+
| Ollama | nomic-embed-text | 12m 06s | ❌ FAILED | JSON.stringify at save |
|
|
16
|
+
|
|
17
|
+
**Success Rate:** 0/3 (0%)
|
|
18
|
+
**Reproducibility:** 100% - All providers fail identically
|
|
19
|
+
|
|
20
|
+
---
|
|
21
|
+
|
|
22
|
+
## Error Output (Identical Across All Providers)
|
|
23
|
+
|
|
24
|
+
```
|
|
25
|
+
VectorStoreError: Failed to write metadata: Invalid string length
|
|
26
|
+
at catch (file:///Users/alphab/Dev/LLM/DEV/mdcontext/dist/chunk-KHU56VDO.js:1881:25)
|
|
27
|
+
...
|
|
28
|
+
operation: 'save',
|
|
29
|
+
_tag: 'VectorStoreError',
|
|
30
|
+
[cause]: RangeError: Invalid string length
|
|
31
|
+
at JSON.stringify (<anonymous>)
|
|
32
|
+
at try (file:///Users/alphab/Dev/LLM/DEV/mdcontext/dist/chunk-KHU56VDO.js:1880:61)
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
---
|
|
36
|
+
|
|
37
|
+
## What This Proves
|
|
38
|
+
|
|
39
|
+
### ✅ Bug is Real
|
|
40
|
+
All three embedding providers (commercial and local) fail at the exact same point with the exact same error.
|
|
41
|
+
|
|
42
|
+
### ✅ Not Provider-Specific
|
|
43
|
+
The bug occurs in mdcontext's vector store layer, not in any provider's embedding generation. All providers successfully generate embeddings but fail when mdcontext tries to save the metadata.
|
|
44
|
+
|
|
45
|
+
### ✅ Scale-Dependent
|
|
46
|
+
The bug only manifests on large corpora:
|
|
47
|
+
- Small corpus (120 docs, 58MB metadata): ✅ Works
|
|
48
|
+
- Large corpus (1,558 docs, ~785MB metadata): ❌ Fails
|
|
49
|
+
|
|
50
|
+
### ✅ Root Cause Confirmed
|
|
51
|
+
The error `RangeError: Invalid string length at JSON.stringify` occurs when JavaScript tries to stringify metadata that exceeds V8's string length limit (~512MB).
|
|
52
|
+
|
|
53
|
+
**Calculation:**
|
|
54
|
+
- 1,558 documents × ~52,714 sections average = massive metadata object
|
|
55
|
+
- Each section has embedding vector (1536 dimensions) + metadata
|
|
56
|
+
- JSON.stringify converts entire object to string
|
|
57
|
+
- String exceeds V8 limit → RangeError
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
## Timeline
|
|
62
|
+
|
|
63
|
+
Each test followed this pattern:
|
|
64
|
+
|
|
65
|
+
1. **Index phase** (~14s): Successfully index all 1,558 markdown files
|
|
66
|
+
2. **Embedding generation** (~12 minutes): Successfully create embeddings for all sections
|
|
67
|
+
3. **Verification phase**: Successfully verify embeddings exist
|
|
68
|
+
4. **Save metadata** ❌ **CRASH**: JSON.stringify fails with string length error
|
|
69
|
+
|
|
70
|
+
All providers complete 95% of the work successfully, then crash at the final save step.
|
|
71
|
+
|
|
72
|
+
---
|
|
73
|
+
|
|
74
|
+
## Provider-Specific Notes
|
|
75
|
+
|
|
76
|
+
### OpenAI (text-embedding-3-small)
|
|
77
|
+
- Embedding generation: ✅ Success
|
|
78
|
+
- API costs: ~$0.18 (estimated)
|
|
79
|
+
- Metadata save: ❌ Failed at JSON.stringify
|
|
80
|
+
|
|
81
|
+
### OpenRouter (default model)
|
|
82
|
+
- Embedding generation: ✅ Success
|
|
83
|
+
- API costs: ~$0.18 (estimated)
|
|
84
|
+
- Metadata save: ❌ Failed at JSON.stringify
|
|
85
|
+
|
|
86
|
+
### Ollama (nomic-embed-text, local)
|
|
87
|
+
- Embedding generation: ✅ Success
|
|
88
|
+
- API costs: $0 (local)
|
|
89
|
+
- Metadata save: ❌ Failed at JSON.stringify
|
|
90
|
+
- Slightly faster (12m 06s vs ~12m 50s) due to local execution
|
|
91
|
+
|
|
92
|
+
---
|
|
93
|
+
|
|
94
|
+
## Implications
|
|
95
|
+
|
|
96
|
+
### For Users
|
|
97
|
+
1. **Cannot use semantic search on large repos** (>1,500 docs)
|
|
98
|
+
2. **Wasted API costs** - Providers charge for embeddings that can't be saved
|
|
99
|
+
3. **Time wasted** - 12+ minutes to discover the failure
|
|
100
|
+
4. **No workaround exists** - Bug blocks ALL providers equally
|
|
101
|
+
|
|
102
|
+
### For Development
|
|
103
|
+
1. **P0 Priority** - Blocks core feature (semantic search)
|
|
104
|
+
2. **Architecture issue** - Not a simple bug fix, requires storage format change
|
|
105
|
+
3. **Well-understood** - Root cause clear, fix path identified
|
|
106
|
+
4. **Testable** - Reproducible 100% of the time with agentic-flow corpus
|
|
107
|
+
|
|
108
|
+
---
|
|
109
|
+
|
|
110
|
+
## Recommended Fix
|
|
111
|
+
|
|
112
|
+
**Replace JSON with MessagePack binary format** - See `BUG-FIX-PLAN.md` for:
|
|
113
|
+
- Complete implementation code
|
|
114
|
+
- Migration strategy
|
|
115
|
+
- Testing plan
|
|
116
|
+
- 6-hour effort estimate
|
|
117
|
+
|
|
118
|
+
**Priority:** Critical - Blocks large-scale production use
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
## Test Commands Used
|
|
123
|
+
|
|
124
|
+
### OpenAI Test
|
|
125
|
+
```bash
|
|
126
|
+
cd /Users/alphab/Dev/LLM/DEV/agentic-flow
|
|
127
|
+
mdcontext index --embed --provider openai 2>&1 | tee /tmp/test-openai.log
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
### OpenRouter Test
|
|
131
|
+
```bash
|
|
132
|
+
cd /Users/alphab/Dev/LLM/DEV/agentic-flow
|
|
133
|
+
mdcontext index --embed --provider openrouter 2>&1 | tee /tmp/test-openrouter.log
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
### Ollama Test
|
|
137
|
+
```bash
|
|
138
|
+
cd /Users/alphab/Dev/LLM/DEV/agentic-flow
|
|
139
|
+
mdcontext index --embed --provider ollama --provider-model nomic-embed-text 2>&1 | tee /tmp/test-ollama.log
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
---
|
|
143
|
+
|
|
144
|
+
## Conclusion
|
|
145
|
+
|
|
146
|
+
The P0 bug is **validated and fully understood**:
|
|
147
|
+
- ✅ 100% reproducible
|
|
148
|
+
- ✅ Affects all providers equally
|
|
149
|
+
- ✅ Root cause identified (JSON.stringify string limit)
|
|
150
|
+
- ✅ Fix available (MessagePack binary format)
|
|
151
|
+
- ✅ Timeline understood (6-hour implementation)
|
|
152
|
+
|
|
153
|
+
**This is a critical blocker for production use of semantic search on large documentation repositories.**
|
|
154
|
+
|
|
155
|
+
The bug should be prioritized immediately to unblock:
|
|
156
|
+
- Large-scale semantic search
|
|
157
|
+
- Production embedding workflows
|
|
158
|
+
- Cost-effective local embedding (Ollama)
|
|
159
|
+
- Knowledge base indexing at scale
|
|
160
|
+
|
|
161
|
+
---
|
|
162
|
+
|
|
163
|
+
**Next Steps:**
|
|
164
|
+
1. Implement MessagePack fix per `BUG-FIX-PLAN.md`
|
|
165
|
+
2. Add tests for large-corpus scenarios
|
|
166
|
+
3. Update docs with corpus size limitations (until fixed)
|
|
167
|
+
4. Consider chunked saves as interim workaround
|