mdcontext 0.0.1 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.changeset/README.md +28 -0
- package/.changeset/config.json +11 -0
- package/.claude/settings.local.json +25 -0
- package/.github/workflows/ci.yml +83 -0
- package/.github/workflows/claude-code-review.yml +44 -0
- package/.github/workflows/claude.yml +85 -0
- package/.github/workflows/release.yml +113 -0
- package/.tldrignore +112 -0
- package/BACKLOG.md +338 -0
- package/CONTRIBUTING.md +186 -0
- package/NOTES/NOTES +44 -0
- package/README.md +434 -11
- package/biome.json +36 -0
- package/cspell.config.yaml +14 -0
- package/dist/chunk-23UPXDNL.js +3044 -0
- package/dist/chunk-2W7MO2DL.js +1366 -0
- package/dist/chunk-3NUAZGMA.js +1689 -0
- package/dist/chunk-7TOWB2XB.js +366 -0
- package/dist/chunk-7XOTOADQ.js +3065 -0
- package/dist/chunk-AH2PDM2K.js +3042 -0
- package/dist/chunk-BNXWSZ63.js +3742 -0
- package/dist/chunk-BTL5DJVU.js +3222 -0
- package/dist/chunk-HDHYG7E4.js +104 -0
- package/dist/chunk-HLR4KZBP.js +3234 -0
- package/dist/chunk-IP3FRFEB.js +1045 -0
- package/dist/chunk-KHU56VDO.js +3042 -0
- package/dist/chunk-KRYIFLQR.js +88 -0
- package/dist/chunk-LBSDNLEM.js +287 -0
- package/dist/chunk-MNTQ7HCP.js +2643 -0
- package/dist/chunk-MUJELQQ6.js +1387 -0
- package/dist/chunk-MXJGMSLV.js +2199 -0
- package/dist/chunk-N6QJGC3Z.js +2636 -0
- package/dist/chunk-OBELGBPM.js +1713 -0
- package/dist/chunk-OT7R5XTA.js +3192 -0
- package/dist/chunk-P7X4RA2T.js +106 -0
- package/dist/chunk-PIDUQNC2.js +3185 -0
- package/dist/chunk-POGCDIH4.js +3187 -0
- package/dist/chunk-PSIEOQGZ.js +3043 -0
- package/dist/chunk-PVRT3IHA.js +3238 -0
- package/dist/chunk-QNN4TT23.js +1430 -0
- package/dist/chunk-RE3R45RJ.js +3042 -0
- package/dist/chunk-S7E6TFX6.js +803 -0
- package/dist/chunk-SG6GLU4U.js +1378 -0
- package/dist/chunk-SJCDV2ST.js +274 -0
- package/dist/chunk-SYE5XLF3.js +104 -0
- package/dist/chunk-T5VLYBZD.js +103 -0
- package/dist/chunk-TOQB7VWU.js +3238 -0
- package/dist/chunk-VFNMZ4ZQ.js +3228 -0
- package/dist/chunk-VVTGZNBT.js +1629 -0
- package/dist/chunk-W7Q4RFEV.js +104 -0
- package/dist/chunk-XTYYVRLO.js +3190 -0
- package/dist/chunk-Y6MDYVJD.js +3063 -0
- package/dist/cli/main.d.ts +1 -0
- package/dist/cli/main.js +5458 -0
- package/dist/index.d.ts +653 -0
- package/dist/index.js +79 -0
- package/dist/mcp/server.d.ts +1 -0
- package/dist/mcp/server.js +472 -0
- package/dist/schema-BAWSG7KY.js +22 -0
- package/dist/schema-E3QUPL26.js +20 -0
- package/dist/schema-EHL7WUT6.js +20 -0
- package/docs/019-USAGE.md +625 -0
- package/docs/020-current-implementation.md +364 -0
- package/docs/021-DOGFOODING-FINDINGS.md +175 -0
- package/docs/BACKLOG.md +80 -0
- package/docs/CONFIG.md +1123 -0
- package/docs/DESIGN.md +439 -0
- package/docs/ERRORS.md +383 -0
- package/docs/PROJECT.md +88 -0
- package/docs/ROADMAP.md +407 -0
- package/docs/summarization.md +320 -0
- package/docs/test-links.md +9 -0
- package/justfile +40 -0
- package/package.json +74 -9
- package/pnpm-workspace.yaml +5 -0
- package/research/INDEX.md +315 -0
- package/research/code-review/README.md +90 -0
- package/research/code-review/cli-error-handling-review.md +979 -0
- package/research/code-review/code-review-validation-report.md +464 -0
- package/research/code-review/main-ts-review.md +1128 -0
- package/research/config-analysis/01-current-implementation.md +470 -0
- package/research/config-analysis/02-strategy-recommendation.md +428 -0
- package/research/config-analysis/03-task-candidates.md +715 -0
- package/research/config-analysis/033-research-configuration-management.md +828 -0
- package/research/config-analysis/034-research-effect-cli-config.md +1504 -0
- package/research/config-analysis/04-consolidated-task-candidates.md +277 -0
- package/research/config-docs/SUMMARY.md +357 -0
- package/research/config-docs/TEST-RESULTS.md +776 -0
- package/research/config-docs/TODO.md +542 -0
- package/research/config-docs/analysis.md +744 -0
- package/research/config-docs/fix-validation.md +502 -0
- package/research/config-docs/help-audit.md +264 -0
- package/research/config-docs/help-system-analysis.md +890 -0
- package/research/dogfood/consolidated-tool-evaluation.md +373 -0
- package/research/dogfood/strategy-a/a-synthesis.md +184 -0
- package/research/dogfood/strategy-a/a1-docs.md +226 -0
- package/research/dogfood/strategy-a/a2-amorphic.md +156 -0
- package/research/dogfood/strategy-a/a3-llm.md +164 -0
- package/research/dogfood/strategy-b/b-synthesis.md +228 -0
- package/research/dogfood/strategy-b/b1-architecture.md +207 -0
- package/research/dogfood/strategy-b/b2-gaps.md +258 -0
- package/research/dogfood/strategy-b/b3-workflows.md +250 -0
- package/research/dogfood/strategy-c/c-synthesis.md +451 -0
- package/research/dogfood/strategy-c/c1-explorer.md +192 -0
- package/research/dogfood/strategy-c/c2-diver-memory.md +145 -0
- package/research/dogfood/strategy-c/c3-diver-control.md +148 -0
- package/research/dogfood/strategy-c/c4-diver-failure.md +151 -0
- package/research/dogfood/strategy-c/c5-diver-execution.md +221 -0
- package/research/dogfood/strategy-c/c6-diver-org.md +221 -0
- package/research/effect-cli-error-handling.md +845 -0
- package/research/effect-errors-as-values.md +943 -0
- package/research/errors-task-analysis/00-consolidated-tasks.md +207 -0
- package/research/errors-task-analysis/cli-commands-analysis.md +909 -0
- package/research/errors-task-analysis/embeddings-analysis.md +709 -0
- package/research/errors-task-analysis/index-search-analysis.md +812 -0
- package/research/frontmatter/COMMENTS-ARE-SKIPPED.md +149 -0
- package/research/frontmatter/LLM-CODE-NAVIGATION.md +276 -0
- package/research/issue-review.md +603 -0
- package/research/llm-summarization/agent-cli-tools-2026.md +1082 -0
- package/research/llm-summarization/alternative-providers-2026.md +1428 -0
- package/research/llm-summarization/anthropic-2026.md +367 -0
- package/research/llm-summarization/claude-cli-integration.md +1706 -0
- package/research/llm-summarization/cli-integration-patterns.md +3155 -0
- package/research/llm-summarization/openai-2026.md +473 -0
- package/research/llm-summarization/openai-compatible-providers-2026.md +1022 -0
- package/research/llm-summarization/opencode-cli-integration.md +1552 -0
- package/research/llm-summarization/prompt-engineering-2026.md +1426 -0
- package/research/llm-summarization/prototype-results.md +56 -0
- package/research/llm-summarization/provider-switching-patterns-2026.md +2153 -0
- package/research/llm-summarization/typescript-llm-libraries-2026.md +2436 -0
- package/research/mdcontext-error-analysis.md +521 -0
- package/research/mdcontext-pudding/00-EXECUTIVE-SUMMARY.md +282 -0
- package/research/mdcontext-pudding/01-index-embed.md +956 -0
- package/research/mdcontext-pudding/02-search-COMMANDS.md +142 -0
- package/research/mdcontext-pudding/02-search-SUMMARY.md +146 -0
- package/research/mdcontext-pudding/02-search.md +970 -0
- package/research/mdcontext-pudding/03-context.md +779 -0
- package/research/mdcontext-pudding/04-navigation-and-analytics.md +803 -0
- package/research/mdcontext-pudding/04-tree.md +704 -0
- package/research/mdcontext-pudding/05-config.md +1038 -0
- package/research/mdcontext-pudding/06-links-summary.txt +87 -0
- package/research/mdcontext-pudding/06-links.md +679 -0
- package/research/mdcontext-pudding/07-stats.md +693 -0
- package/research/mdcontext-pudding/BUG-FIX-PLAN.md +388 -0
- package/research/mdcontext-pudding/P0-BUG-VALIDATION.md +167 -0
- package/research/mdcontext-pudding/README.md +168 -0
- package/research/mdcontext-pudding/TESTING-SUMMARY.md +128 -0
- package/research/npm_publish/011-npm-workflow-research-agent2.md +792 -0
- package/research/npm_publish/012-npm-workflow-research-agent1.md +530 -0
- package/research/npm_publish/013-npm-workflow-research-agent3.md +722 -0
- package/research/npm_publish/014-npm-workflow-synthesis.md +556 -0
- package/research/npm_publish/031-npm-workflow-task-analysis.md +134 -0
- package/research/research-quality-review.md +834 -0
- package/research/semantic-search/002-research-embedding-models.md +490 -0
- package/research/semantic-search/003-research-rag-alternatives.md +523 -0
- package/research/semantic-search/004-research-vector-search.md +841 -0
- package/research/semantic-search/032-research-semantic-search.md +427 -0
- package/research/semantic-search/embedding-text-analysis.md +156 -0
- package/research/semantic-search/multi-word-failure-reproduction.md +171 -0
- package/research/semantic-search/query-processing-analysis.md +207 -0
- package/research/semantic-search/root-cause-and-solution.md +114 -0
- package/research/semantic-search/threshold-validation-report.md +69 -0
- package/research/semantic-search/vector-search-analysis.md +63 -0
- package/research/task-management-2026/00-synthesis-recommendations.md +295 -0
- package/research/task-management-2026/01-ai-workflow-tools.md +416 -0
- package/research/task-management-2026/02-agent-framework-patterns.md +476 -0
- package/research/task-management-2026/03-lightweight-file-based.md +567 -0
- package/research/task-management-2026/04-established-tools-ai-features.md +541 -0
- package/research/task-management-2026/linear/01-core-features-workflow.md +771 -0
- package/research/task-management-2026/linear/02-api-integrations.md +930 -0
- package/research/task-management-2026/linear/03-ai-features.md +368 -0
- package/research/task-management-2026/linear/04-pricing-setup.md +205 -0
- package/research/task-management-2026/linear/05-usage-patterns-best-practices.md +605 -0
- package/research/test-path-issues.md +276 -0
- package/review/ALP-76/1-error-type-design.md +962 -0
- package/review/ALP-76/2-error-handling-patterns.md +906 -0
- package/review/ALP-76/3-error-presentation.md +624 -0
- package/review/ALP-76/4-test-coverage.md +625 -0
- package/review/ALP-76/5-migration-completeness.md +440 -0
- package/review/ALP-76/6-effect-best-practices.md +755 -0
- package/scripts/apply-branch-protection.sh +47 -0
- package/scripts/branch-protection-templates.json +79 -0
- package/scripts/prototype-summarization.ts +346 -0
- package/scripts/rebuild-hnswlib.js +58 -0
- package/scripts/setup-branch-protection.sh +64 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/active-provider.json +7 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.json +541 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.meta.json +5 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/config.json +8 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/documents.json +60 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/links.json +13 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/sections.json +1197 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/configuration-management.md +99 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/distributed-systems.md +92 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/error-handling.md +78 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/failure-automation.md +55 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/job-context.md +69 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/process-orchestration.md +99 -0
- package/src/cli/argv-preprocessor.test.ts +210 -0
- package/src/cli/argv-preprocessor.ts +202 -0
- package/src/cli/cli.test.ts +627 -0
- package/src/cli/commands/backlinks.ts +54 -0
- package/src/cli/commands/config-cmd.ts +642 -0
- package/src/cli/commands/context.ts +285 -0
- package/src/cli/commands/duplicates.ts +122 -0
- package/src/cli/commands/embeddings.ts +529 -0
- package/src/cli/commands/index-cmd.ts +480 -0
- package/src/cli/commands/index.ts +16 -0
- package/src/cli/commands/links.ts +52 -0
- package/src/cli/commands/search.ts +1281 -0
- package/src/cli/commands/stats.ts +149 -0
- package/src/cli/commands/tree.ts +128 -0
- package/src/cli/config-layer.ts +176 -0
- package/src/cli/error-handler.test.ts +235 -0
- package/src/cli/error-handler.ts +655 -0
- package/src/cli/flag-schemas.ts +341 -0
- package/src/cli/help.ts +588 -0
- package/src/cli/index.ts +9 -0
- package/src/cli/main.ts +435 -0
- package/src/cli/options.ts +41 -0
- package/src/cli/shared-error-handling.ts +199 -0
- package/src/cli/typo-suggester.test.ts +105 -0
- package/src/cli/typo-suggester.ts +130 -0
- package/src/cli/utils.ts +259 -0
- package/src/config/file-provider.test.ts +320 -0
- package/src/config/file-provider.ts +273 -0
- package/src/config/index.ts +72 -0
- package/src/config/integration.test.ts +667 -0
- package/src/config/precedence.test.ts +277 -0
- package/src/config/precedence.ts +451 -0
- package/src/config/schema.test.ts +414 -0
- package/src/config/schema.ts +603 -0
- package/src/config/service.test.ts +320 -0
- package/src/config/service.ts +243 -0
- package/src/config/testing.test.ts +264 -0
- package/src/config/testing.ts +110 -0
- package/src/core/index.ts +1 -0
- package/src/core/types.ts +113 -0
- package/src/duplicates/detector.test.ts +183 -0
- package/src/duplicates/detector.ts +414 -0
- package/src/duplicates/index.ts +18 -0
- package/src/embeddings/embedding-namespace.test.ts +300 -0
- package/src/embeddings/embedding-namespace.ts +947 -0
- package/src/embeddings/heading-boost.test.ts +222 -0
- package/src/embeddings/hnsw-build-options.test.ts +198 -0
- package/src/embeddings/hyde.test.ts +272 -0
- package/src/embeddings/hyde.ts +264 -0
- package/src/embeddings/index.ts +10 -0
- package/src/embeddings/openai-provider.ts +414 -0
- package/src/embeddings/pricing.json +22 -0
- package/src/embeddings/provider-constants.ts +204 -0
- package/src/embeddings/provider-errors.test.ts +967 -0
- package/src/embeddings/provider-errors.ts +565 -0
- package/src/embeddings/provider-factory.test.ts +240 -0
- package/src/embeddings/provider-factory.ts +225 -0
- package/src/embeddings/provider-integration.test.ts +788 -0
- package/src/embeddings/query-preprocessing.test.ts +187 -0
- package/src/embeddings/semantic-search-threshold.test.ts +508 -0
- package/src/embeddings/semantic-search.ts +1270 -0
- package/src/embeddings/types.ts +359 -0
- package/src/embeddings/vector-store.ts +708 -0
- package/src/embeddings/voyage-provider.ts +313 -0
- package/src/errors/errors.test.ts +845 -0
- package/src/errors/index.ts +533 -0
- package/src/index/ignore-patterns.test.ts +354 -0
- package/src/index/ignore-patterns.ts +305 -0
- package/src/index/index.ts +4 -0
- package/src/index/indexer.ts +684 -0
- package/src/index/storage.ts +260 -0
- package/src/index/types.ts +147 -0
- package/src/index/watcher.ts +189 -0
- package/src/index.ts +30 -0
- package/src/integration/search-keyword.test.ts +678 -0
- package/src/mcp/server.ts +612 -0
- package/src/parser/index.ts +1 -0
- package/src/parser/parser.test.ts +291 -0
- package/src/parser/parser.ts +394 -0
- package/src/parser/section-filter.test.ts +277 -0
- package/src/parser/section-filter.ts +392 -0
- package/src/search/__tests__/hybrid-search.test.ts +650 -0
- package/src/search/bm25-store.ts +366 -0
- package/src/search/cross-encoder.test.ts +253 -0
- package/src/search/cross-encoder.ts +406 -0
- package/src/search/fuzzy-search.test.ts +419 -0
- package/src/search/fuzzy-search.ts +273 -0
- package/src/search/hybrid-search.ts +448 -0
- package/src/search/path-matcher.test.ts +276 -0
- package/src/search/path-matcher.ts +33 -0
- package/src/search/query-parser.test.ts +260 -0
- package/src/search/query-parser.ts +319 -0
- package/src/search/searcher.test.ts +280 -0
- package/src/search/searcher.ts +724 -0
- package/src/search/wink-bm25.d.ts +30 -0
- package/src/summarization/cli-providers/claude.ts +202 -0
- package/src/summarization/cli-providers/detection.test.ts +273 -0
- package/src/summarization/cli-providers/detection.ts +118 -0
- package/src/summarization/cli-providers/index.ts +8 -0
- package/src/summarization/cost.test.ts +139 -0
- package/src/summarization/cost.ts +102 -0
- package/src/summarization/error-handler.test.ts +127 -0
- package/src/summarization/error-handler.ts +111 -0
- package/src/summarization/index.ts +102 -0
- package/src/summarization/pipeline.test.ts +498 -0
- package/src/summarization/pipeline.ts +231 -0
- package/src/summarization/prompts.test.ts +269 -0
- package/src/summarization/prompts.ts +133 -0
- package/src/summarization/provider-factory.test.ts +396 -0
- package/src/summarization/provider-factory.ts +178 -0
- package/src/summarization/types.ts +184 -0
- package/src/summarize/budget-bugs.test.ts +620 -0
- package/src/summarize/formatters.ts +419 -0
- package/src/summarize/index.ts +20 -0
- package/src/summarize/summarizer.test.ts +275 -0
- package/src/summarize/summarizer.ts +597 -0
- package/src/summarize/verify-bugs.test.ts +238 -0
- package/src/types/huggingface-transformers.d.ts +66 -0
- package/src/utils/index.ts +1 -0
- package/src/utils/tokens.test.ts +142 -0
- package/src/utils/tokens.ts +186 -0
- package/tests/fixtures/cli/.mdcontext/active-provider.json +7 -0
- package/tests/fixtures/cli/.mdcontext/config.json +8 -0
- package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
- package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
- package/tests/fixtures/cli/.mdcontext/indexes/documents.json +33 -0
- package/tests/fixtures/cli/.mdcontext/indexes/links.json +12 -0
- package/tests/fixtures/cli/.mdcontext/indexes/sections.json +247 -0
- package/tests/fixtures/cli/README.md +9 -0
- package/tests/fixtures/cli/api-reference.md +11 -0
- package/tests/fixtures/cli/getting-started.md +11 -0
- package/tests/integration/embed-index.test.ts +712 -0
- package/tests/integration/search-context.test.ts +469 -0
- package/tests/integration/search-semantic.test.ts +522 -0
- package/tsconfig.json +26 -0
- package/vitest.config.ts +16 -0
- package/vitest.setup.ts +12 -0
|
@@ -0,0 +1,427 @@
|
|
|
1
|
+
# Research Task Analysis: Embedding Models, RAG Alternatives, and Vector Search
|
|
2
|
+
|
|
3
|
+
Analysis of research documents against current mdcontext implementation.
|
|
4
|
+
|
|
5
|
+
Date: January 2026
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Documents Analyzed
|
|
10
|
+
|
|
11
|
+
1. `002-research-embedding-models.md` - Embedding model comparison and recommendations
|
|
12
|
+
2. `003-research-rag-alternatives.md` - RAG alternatives for improving semantic search
|
|
13
|
+
3. `004-research-vector-search.md` - Vector search patterns and techniques
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## Implemented (No Action Needed)
|
|
18
|
+
|
|
19
|
+
### 1. Dimension Reduction (512 dimensions)
|
|
20
|
+
|
|
21
|
+
**Research Recommendation:** Reduce OpenAI embeddings from 1536 to 512 dimensions for 67% storage reduction with minimal quality loss.
|
|
22
|
+
|
|
23
|
+
**Current Implementation:** Already implemented in `src/embeddings/openai-provider.ts`:
|
|
24
|
+
|
|
25
|
+
```typescript
|
|
26
|
+
const response = await this.client.embeddings.create({
|
|
27
|
+
model: this.model,
|
|
28
|
+
input: batch,
|
|
29
|
+
dimensions: 512, // Already using reduced dimensions
|
|
30
|
+
});
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
**Status:** Implemented
|
|
34
|
+
|
|
35
|
+
---
|
|
36
|
+
|
|
37
|
+
### 2. HNSW Vector Index
|
|
38
|
+
|
|
39
|
+
**Research Recommendation:** Stay with HNSW for documentation corpora (<100K sections).
|
|
40
|
+
|
|
41
|
+
**Current Implementation:** Using `hnswlib-node` with cosine similarity in `src/embeddings/vector-store.ts`:
|
|
42
|
+
|
|
43
|
+
```typescript
|
|
44
|
+
this.index = new HierarchicalNSW.HierarchicalNSW("cosine", this.dimensions);
|
|
45
|
+
this.index.initIndex(10000, 16, 200, 100); // M=16, efConstruction=200
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
**Status:** Implemented
|
|
49
|
+
|
|
50
|
+
---
|
|
51
|
+
|
|
52
|
+
### 3. EmbeddingProvider Interface
|
|
53
|
+
|
|
54
|
+
**Research Recommendation:** Create provider abstraction for embedding models.
|
|
55
|
+
|
|
56
|
+
**Current Implementation:** `src/embeddings/types.ts` defines a clean provider interface:
|
|
57
|
+
|
|
58
|
+
```typescript
|
|
59
|
+
export interface EmbeddingProvider {
|
|
60
|
+
readonly name: string;
|
|
61
|
+
readonly dimensions: number;
|
|
62
|
+
embed(texts: string[]): Promise<EmbeddingResult>;
|
|
63
|
+
}
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
**Status:** Implemented (foundation ready for additional providers)
|
|
67
|
+
|
|
68
|
+
---
|
|
69
|
+
|
|
70
|
+
### 4. Path Pattern Filtering (Post-filter)
|
|
71
|
+
|
|
72
|
+
**Research Recommendation:** Implement metadata filtering for search results.
|
|
73
|
+
|
|
74
|
+
**Current Implementation:** `pathPattern` option implemented in `semanticSearch()` as post-filtering.
|
|
75
|
+
|
|
76
|
+
**Status:** Implemented (basic)
|
|
77
|
+
|
|
78
|
+
---
|
|
79
|
+
|
|
80
|
+
### 5. Document Context in Embeddings
|
|
81
|
+
|
|
82
|
+
**Research Recommendation:** Include document title and parent section in embedding text.
|
|
83
|
+
|
|
84
|
+
**Current Implementation:** Already in `src/embeddings/semantic-search.ts`:
|
|
85
|
+
|
|
86
|
+
```typescript
|
|
87
|
+
const generateEmbeddingText = (
|
|
88
|
+
section,
|
|
89
|
+
content,
|
|
90
|
+
documentTitle,
|
|
91
|
+
parentHeading,
|
|
92
|
+
) => {
|
|
93
|
+
parts.push(`# ${section.heading}`);
|
|
94
|
+
if (parentHeading) parts.push(`Parent section: ${parentHeading}`);
|
|
95
|
+
parts.push(`Document: ${documentTitle}`);
|
|
96
|
+
parts.push(content);
|
|
97
|
+
// ...
|
|
98
|
+
};
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
**Status:** Implemented
|
|
102
|
+
|
|
103
|
+
---
|
|
104
|
+
|
|
105
|
+
## Task Candidates
|
|
106
|
+
|
|
107
|
+
### 1. Add Hybrid Search (BM25 + Semantic)
|
|
108
|
+
|
|
109
|
+
**Priority:** High
|
|
110
|
+
|
|
111
|
+
**Description:**
|
|
112
|
+
Implement hybrid search combining BM25 keyword search with semantic search, using Reciprocal Rank Fusion (RRF) to merge results.
|
|
113
|
+
|
|
114
|
+
**Why It Matters:**
|
|
115
|
+
|
|
116
|
+
- Research shows 15-30% recall improvement over single-method retrieval
|
|
117
|
+
- Handles exact term matching (API names, error codes, identifiers) that pure semantic search misses
|
|
118
|
+
- Current keyword search exists separately but isn't integrated with semantic search
|
|
119
|
+
|
|
120
|
+
**Implementation Notes:**
|
|
121
|
+
|
|
122
|
+
- Add `wink-bm25-text-search` dependency
|
|
123
|
+
- Build BM25 index alongside vector index during `mdcontext embed`
|
|
124
|
+
- Add `--mode hybrid` option to search command
|
|
125
|
+
- Implement RRF fusion (~50 lines of code)
|
|
126
|
+
|
|
127
|
+
**Current Gap:**
|
|
128
|
+
|
|
129
|
+
- Keyword search (`src/search/searcher.ts`) and semantic search (`src/embeddings/semantic-search.ts`) are separate codepaths
|
|
130
|
+
- No fusion mechanism exists
|
|
131
|
+
|
|
132
|
+
**Estimated Effort:** 2-3 days
|
|
133
|
+
|
|
134
|
+
---
|
|
135
|
+
|
|
136
|
+
### 2. Add Local Embedding Provider (Ollama)
|
|
137
|
+
|
|
138
|
+
**Priority:** High
|
|
139
|
+
|
|
140
|
+
**Description:**
|
|
141
|
+
Implement an Ollama-based embedding provider using `nomic-embed-text-v1.5` for offline semantic search.
|
|
142
|
+
|
|
143
|
+
**Why It Matters:**
|
|
144
|
+
|
|
145
|
+
- Enables offline semantic search (major feature gap)
|
|
146
|
+
- Zero ongoing API costs
|
|
147
|
+
- Quality matches or exceeds OpenAI text-embedding-3-small
|
|
148
|
+
- Privacy-sensitive use cases
|
|
149
|
+
|
|
150
|
+
**Implementation Notes:**
|
|
151
|
+
|
|
152
|
+
- Create `src/embeddings/ollama-provider.ts` implementing `EmbeddingProvider`
|
|
153
|
+
- Add provider selection via config or `--provider` CLI flag
|
|
154
|
+
- Default to OpenAI for backward compatibility
|
|
155
|
+
- nomic-embed-text supports Matryoshka (dimension flexibility)
|
|
156
|
+
|
|
157
|
+
**Models to Support:**
|
|
158
|
+
|
|
159
|
+
1. `nomic-embed-text` - Best overall fit (fast, 8K context, Matryoshka)
|
|
160
|
+
2. `mxbai-embed-large` - Higher quality option
|
|
161
|
+
3. `bge-m3` - Multilingual option
|
|
162
|
+
|
|
163
|
+
**Estimated Effort:** 2-3 days
|
|
164
|
+
|
|
165
|
+
---
|
|
166
|
+
|
|
167
|
+
### 3. Add Cross-Encoder Re-ranking
|
|
168
|
+
|
|
169
|
+
**Priority:** Medium
|
|
170
|
+
|
|
171
|
+
**Description:**
|
|
172
|
+
Add optional re-ranking of top-N semantic search results using a cross-encoder model.
|
|
173
|
+
|
|
174
|
+
**Why It Matters:**
|
|
175
|
+
|
|
176
|
+
- 20-35% accuracy improvement in retrieval precision
|
|
177
|
+
- Cross-encoders capture fine-grained relevance that bi-encoders miss
|
|
178
|
+
- Can be opt-in to avoid latency when not needed
|
|
179
|
+
|
|
180
|
+
**Implementation Notes:**
|
|
181
|
+
|
|
182
|
+
- Add `@xenova/transformers` dependency for Transformers.js
|
|
183
|
+
- Use `ms-marco-MiniLM-L-6-v2` model (22.7M params, 2-5ms/pair)
|
|
184
|
+
- Re-rank top-20 candidates to top-10
|
|
185
|
+
- Add `--rerank` flag to search command
|
|
186
|
+
|
|
187
|
+
**Alternative:** Cohere Rerank API for simpler integration (adds cost)
|
|
188
|
+
|
|
189
|
+
**Estimated Effort:** 2-3 days
|
|
190
|
+
|
|
191
|
+
---
|
|
192
|
+
|
|
193
|
+
### 4. Add Dynamic efSearch (Quality Modes)
|
|
194
|
+
|
|
195
|
+
**Priority:** Medium
|
|
196
|
+
|
|
197
|
+
**Description:**
|
|
198
|
+
Allow users to control search quality/speed tradeoff via HNSW efSearch parameter at query time.
|
|
199
|
+
|
|
200
|
+
**Why It Matters:**
|
|
201
|
+
|
|
202
|
+
- Zero dependency changes
|
|
203
|
+
- Immediate quality/speed improvements
|
|
204
|
+
- Low risk
|
|
205
|
+
|
|
206
|
+
**Implementation Notes:**
|
|
207
|
+
|
|
208
|
+
- Add `--quality` flag: `fast` (64), `balanced` (100), `thorough` (256)
|
|
209
|
+
- efSearch is already configurable at query time in hnswlib-node
|
|
210
|
+
- Update search functions to accept quality parameter
|
|
211
|
+
|
|
212
|
+
**Current State:**
|
|
213
|
+
|
|
214
|
+
```typescript
|
|
215
|
+
this.index.initIndex(10000, 16, 200, 100); // efSearch=100 (implicit)
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
**Estimated Effort:** 0.5 days
|
|
219
|
+
|
|
220
|
+
---
|
|
221
|
+
|
|
222
|
+
### 5. Add Configurable HNSW Parameters
|
|
223
|
+
|
|
224
|
+
**Priority:** Low
|
|
225
|
+
|
|
226
|
+
**Description:**
|
|
227
|
+
Expose HNSW build parameters (M, efConstruction) via configuration for users with specific needs.
|
|
228
|
+
|
|
229
|
+
**Why It Matters:**
|
|
230
|
+
|
|
231
|
+
- Users with large corpora may want to tune for speed
|
|
232
|
+
- Users needing maximum recall can increase parameters
|
|
233
|
+
- Enables benchmarking different configurations
|
|
234
|
+
|
|
235
|
+
**Current Hardcoded Values:**
|
|
236
|
+
|
|
237
|
+
```typescript
|
|
238
|
+
M: 16; // Max connections per node
|
|
239
|
+
efConstruction: 200; // Construction-time search width
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
**Recommended Configurations:**
|
|
243
|
+
|
|
244
|
+
- Quality-focused: M=24, efConstruction=256
|
|
245
|
+
- Speed-focused: M=12, efConstruction=128
|
|
246
|
+
|
|
247
|
+
**Estimated Effort:** 1 day
|
|
248
|
+
|
|
249
|
+
---
|
|
250
|
+
|
|
251
|
+
### 6. Add Query Preprocessing
|
|
252
|
+
|
|
253
|
+
**Priority:** Low
|
|
254
|
+
|
|
255
|
+
**Description:**
|
|
256
|
+
Add basic query preprocessing before embedding to reduce noise.
|
|
257
|
+
|
|
258
|
+
**Why It Matters:**
|
|
259
|
+
|
|
260
|
+
- 2-5% precision improvement
|
|
261
|
+
- Simple implementation
|
|
262
|
+
|
|
263
|
+
**Implementation:**
|
|
264
|
+
|
|
265
|
+
```typescript
|
|
266
|
+
function preprocessQuery(query: string): string {
|
|
267
|
+
return query
|
|
268
|
+
.toLowerCase()
|
|
269
|
+
.replace(/[^\w\s]/g, " ")
|
|
270
|
+
.replace(/\s+/g, " ")
|
|
271
|
+
.trim();
|
|
272
|
+
}
|
|
273
|
+
```
|
|
274
|
+
|
|
275
|
+
**Estimated Effort:** 1-2 hours
|
|
276
|
+
|
|
277
|
+
---
|
|
278
|
+
|
|
279
|
+
### 7. Add Heading Match Boost
|
|
280
|
+
|
|
281
|
+
**Priority:** Low
|
|
282
|
+
|
|
283
|
+
**Description:**
|
|
284
|
+
Boost search results when query terms appear in section headings.
|
|
285
|
+
|
|
286
|
+
**Why It Matters:**
|
|
287
|
+
|
|
288
|
+
- Significant for navigation queries ("installation guide", "API reference")
|
|
289
|
+
- Simple scoring adjustment
|
|
290
|
+
|
|
291
|
+
**Implementation:**
|
|
292
|
+
|
|
293
|
+
```typescript
|
|
294
|
+
function adjustScore(result, query): number {
|
|
295
|
+
const queryTerms = query.toLowerCase().split(/\s+/);
|
|
296
|
+
const headingLower = result.heading.toLowerCase();
|
|
297
|
+
const headingMatches = queryTerms.filter((t) =>
|
|
298
|
+
headingLower.includes(t),
|
|
299
|
+
).length;
|
|
300
|
+
return result.similarity + headingMatches * 0.05;
|
|
301
|
+
}
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
**Estimated Effort:** 2-4 hours
|
|
305
|
+
|
|
306
|
+
---
|
|
307
|
+
|
|
308
|
+
### 8. Add HyDE Query Expansion (Optional)
|
|
309
|
+
|
|
310
|
+
**Priority:** Low
|
|
311
|
+
|
|
312
|
+
**Description:**
|
|
313
|
+
Implement Hypothetical Document Embeddings for complex queries - generate a hypothetical answer with LLM, then search using that embedding.
|
|
314
|
+
|
|
315
|
+
**Why It Matters:**
|
|
316
|
+
|
|
317
|
+
- 10-30% retrieval improvement on ambiguous queries
|
|
318
|
+
- Bridges semantic gap between short questions and detailed documents
|
|
319
|
+
|
|
320
|
+
**Considerations:**
|
|
321
|
+
|
|
322
|
+
- Adds LLM call (cost, latency)
|
|
323
|
+
- Should be opt-in for complex queries only
|
|
324
|
+
- Works poorly if LLM lacks domain knowledge
|
|
325
|
+
|
|
326
|
+
**Estimated Effort:** 1-2 days
|
|
327
|
+
|
|
328
|
+
---
|
|
329
|
+
|
|
330
|
+
### 9. Fix Dimension Mismatch in Provider
|
|
331
|
+
|
|
332
|
+
**Priority:** Medium
|
|
333
|
+
|
|
334
|
+
**Description:**
|
|
335
|
+
The OpenAI provider reports incorrect dimensions (1536/3072) while actually using 512.
|
|
336
|
+
|
|
337
|
+
**Current Issue:**
|
|
338
|
+
|
|
339
|
+
```typescript
|
|
340
|
+
// In openai-provider.ts
|
|
341
|
+
this.dimensions = this.model === "text-embedding-3-large" ? 3072 : 1536;
|
|
342
|
+
// But actual API call uses:
|
|
343
|
+
dimensions: 512;
|
|
344
|
+
```
|
|
345
|
+
|
|
346
|
+
This mismatch could cause issues if other code relies on `provider.dimensions`.
|
|
347
|
+
|
|
348
|
+
**Fix:** Update dimension reporting to match actual API parameter.
|
|
349
|
+
|
|
350
|
+
**Estimated Effort:** 0.5 hours
|
|
351
|
+
|
|
352
|
+
---
|
|
353
|
+
|
|
354
|
+
### 10. Add Alternative API Provider (Voyage AI)
|
|
355
|
+
|
|
356
|
+
**Priority:** Low
|
|
357
|
+
|
|
358
|
+
**Description:**
|
|
359
|
+
Add Voyage AI as an alternative embedding provider for users wanting better quality at similar cost.
|
|
360
|
+
|
|
361
|
+
**Why It Matters:**
|
|
362
|
+
|
|
363
|
+
- voyage-3.5-lite: Same price as OpenAI ($0.02/1M), but better quality
|
|
364
|
+
- 32K token context (4x OpenAI)
|
|
365
|
+
- Free 200M tokens for testing
|
|
366
|
+
|
|
367
|
+
**Estimated Effort:** 1 day
|
|
368
|
+
|
|
369
|
+
---
|
|
370
|
+
|
|
371
|
+
## Skip (Not Applicable)
|
|
372
|
+
|
|
373
|
+
| Recommendation | Reason to Skip |
|
|
374
|
+
| ------------------------ | --------------------------------------------------------------------- |
|
|
375
|
+
| ColBERT Late Interaction | Overkill for documentation corpus sizes; requires Python service |
|
|
376
|
+
| SPLADE Sparse Retrieval | BM25 + semantic hybrid likely sufficient; adds complexity |
|
|
377
|
+
| GraphRAG | Overkill for documentation search |
|
|
378
|
+
| Fine-tuned Embeddings | Requires training infrastructure; general models work well |
|
|
379
|
+
| IVF/DiskANN Indexes | HNSW sufficient for typical documentation sizes (<100K sections) |
|
|
380
|
+
| LLM-based Re-ranking | Cross-encoders provide similar quality without LLM cost/latency |
|
|
381
|
+
| Self-RAG | Beyond current scope; more relevant for RAG pipelines with generation |
|
|
382
|
+
|
|
383
|
+
---
|
|
384
|
+
|
|
385
|
+
## Summary
|
|
386
|
+
|
|
387
|
+
| Category | Count |
|
|
388
|
+
| --------------- | ----------------- |
|
|
389
|
+
| Implemented | 5 items |
|
|
390
|
+
| Task Candidates | 10 items |
|
|
391
|
+
| Skipped | 7 recommendations |
|
|
392
|
+
|
|
393
|
+
### Priority Matrix
|
|
394
|
+
|
|
395
|
+
| Priority | Tasks |
|
|
396
|
+
| ---------- | ------------------------------------------------------------------------- |
|
|
397
|
+
| **High** | Hybrid Search (BM25+RRF), Local Embedding Provider (Ollama) |
|
|
398
|
+
| **Medium** | Cross-Encoder Re-ranking, Dynamic efSearch, Fix Dimension Mismatch |
|
|
399
|
+
| **Low** | HNSW Config, Query Preprocessing, Heading Boost, HyDE, Voyage AI Provider |
|
|
400
|
+
|
|
401
|
+
### Recommended Implementation Order
|
|
402
|
+
|
|
403
|
+
**Phase 1: Quick Wins (1 week)**
|
|
404
|
+
|
|
405
|
+
1. Fix dimension mismatch in provider
|
|
406
|
+
2. Add dynamic efSearch (quality modes)
|
|
407
|
+
3. Add query preprocessing
|
|
408
|
+
4. Add heading match boost
|
|
409
|
+
|
|
410
|
+
**Phase 2: Hybrid Search (1-2 weeks)**
|
|
411
|
+
|
|
412
|
+
1. Integrate BM25 library
|
|
413
|
+
2. Build BM25 index during embed
|
|
414
|
+
3. Implement RRF fusion
|
|
415
|
+
4. Add `--mode hybrid` to CLI
|
|
416
|
+
|
|
417
|
+
**Phase 3: Local/Offline (1-2 weeks)**
|
|
418
|
+
|
|
419
|
+
1. Implement Ollama provider
|
|
420
|
+
2. Add provider selection CLI
|
|
421
|
+
3. Test with nomic-embed-text
|
|
422
|
+
|
|
423
|
+
**Phase 4: Advanced (2 weeks)**
|
|
424
|
+
|
|
425
|
+
1. Cross-encoder re-ranking (Transformers.js)
|
|
426
|
+
2. HyDE query expansion (optional)
|
|
427
|
+
3. Alternative API providers (Voyage AI)
|
|
@@ -0,0 +1,156 @@
|
|
|
1
|
+
# Embedding Text Analysis
|
|
2
|
+
|
|
3
|
+
## Executive Summary
|
|
4
|
+
|
|
5
|
+
The current embedding text generation format is **appropriate and follows best practices**. The format enriches content with contextual metadata (heading, parent section, document title) which helps embedding models understand the semantic context.
|
|
6
|
+
|
|
7
|
+
The similarity score issues identified in ALP-203 are **NOT caused by the embedding text format**, but rather by:
|
|
8
|
+
1. Inherent properties of embedding models with short queries
|
|
9
|
+
2. The 0.5 default threshold being too high for certain query types
|
|
10
|
+
|
|
11
|
+
## Current Implementation
|
|
12
|
+
|
|
13
|
+
### generateEmbeddingText Function
|
|
14
|
+
|
|
15
|
+
Location: `src/embeddings/semantic-search.ts:46-63`
|
|
16
|
+
|
|
17
|
+
```typescript
|
|
18
|
+
const generateEmbeddingText = (
|
|
19
|
+
section: SectionEntry,
|
|
20
|
+
content: string,
|
|
21
|
+
documentTitle: string,
|
|
22
|
+
parentHeading?: string | undefined,
|
|
23
|
+
): string => {
|
|
24
|
+
const parts: string[] = []
|
|
25
|
+
|
|
26
|
+
parts.push(`# ${section.heading}`)
|
|
27
|
+
if (parentHeading) {
|
|
28
|
+
parts.push(`Parent section: ${parentHeading}`)
|
|
29
|
+
}
|
|
30
|
+
parts.push(`Document: ${documentTitle}`)
|
|
31
|
+
parts.push('')
|
|
32
|
+
parts.push(content)
|
|
33
|
+
|
|
34
|
+
return parts.join('\n')
|
|
35
|
+
}
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
### Generated Text Format
|
|
39
|
+
|
|
40
|
+
**For a top-level section (e.g., "Overview" in "Failure Automation"):**
|
|
41
|
+
```
|
|
42
|
+
# Overview
|
|
43
|
+
Document: Failure Automation
|
|
44
|
+
|
|
45
|
+
Failure automation is the practice of automatically detecting,
|
|
46
|
+
reporting, and responding to system failures without human
|
|
47
|
+
intervention. This approach is essential for maintaining high
|
|
48
|
+
availability in modern distributed systems.
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
**For a nested section (e.g., "Automated Failure Detection"):**
|
|
52
|
+
```
|
|
53
|
+
# Automated Failure Detection
|
|
54
|
+
Parent section: Core Concepts
|
|
55
|
+
Document: Failure Automation
|
|
56
|
+
|
|
57
|
+
Systems use health checks, heartbeats, and monitoring to detect
|
|
58
|
+
when components fail. Failure detection must be fast and accurate
|
|
59
|
+
to minimize downtime.
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
## Analysis
|
|
63
|
+
|
|
64
|
+
### What the Current Format Does Well
|
|
65
|
+
|
|
66
|
+
1. **Heading as Title**: Using `# {heading}` format is standard markdown that embedding models are trained on. The heading provides semantic context about the section topic.
|
|
67
|
+
|
|
68
|
+
2. **Hierarchical Context**: Including `Parent section: {parentHeading}` helps the model understand the section's place in the document structure. This is especially valuable for nested sections with generic headings like "Overview" or "Best Practices".
|
|
69
|
+
|
|
70
|
+
3. **Document Context**: Including `Document: {documentTitle}` helps disambiguate content that might otherwise be too generic.
|
|
71
|
+
|
|
72
|
+
4. **Content Preservation**: The full section content is included, providing rich semantic signal.
|
|
73
|
+
|
|
74
|
+
### Potential Concerns Investigated
|
|
75
|
+
|
|
76
|
+
| Concern | Finding |
|
|
77
|
+
|---------|---------|
|
|
78
|
+
| Does `# {heading}` confuse the model? | No - embedding models are trained on markdown and understand heading syntax |
|
|
79
|
+
| Is metadata adding noise? | No - metadata provides helpful context, especially for short sections |
|
|
80
|
+
| Is content truncated? | No - full section content is included |
|
|
81
|
+
| Are important keywords lost? | No - nothing is removed from original content |
|
|
82
|
+
|
|
83
|
+
### Comparison with Best Practices
|
|
84
|
+
|
|
85
|
+
**OpenAI Recommendations:**
|
|
86
|
+
- Text-embedding-3-small uses the same model for both queries and documents
|
|
87
|
+
- No special prefixes or asymmetric handling needed
|
|
88
|
+
- Cosine similarity is recommended for comparison
|
|
89
|
+
- The model captures semantic meaning, not just keyword overlap
|
|
90
|
+
|
|
91
|
+
**Industry Patterns:**
|
|
92
|
+
- Many RAG systems include metadata like titles and hierarchical context
|
|
93
|
+
- Including document/section titles is a common best practice
|
|
94
|
+
- Enriching content with context improves retrieval quality
|
|
95
|
+
|
|
96
|
+
## Token Count Analysis
|
|
97
|
+
|
|
98
|
+
Sample embedded texts from test corpus:
|
|
99
|
+
|
|
100
|
+
| Section | Content Tokens | Metadata Overhead | Total | Overhead % |
|
|
101
|
+
|---------|----------------|-------------------|-------|------------|
|
|
102
|
+
| Overview | ~50 | ~15 | ~65 | 23% |
|
|
103
|
+
| Automated Failure Detection | ~40 | ~20 | ~60 | 33% |
|
|
104
|
+
| Best Practices | ~100 | ~15 | ~115 | 13% |
|
|
105
|
+
|
|
106
|
+
The metadata overhead is reasonable (13-33%) and provides valuable semantic context.
|
|
107
|
+
|
|
108
|
+
## Root Cause of Similarity Score Issues
|
|
109
|
+
|
|
110
|
+
The similarity score issues are **NOT caused by embedding text generation**. Based on ALP-203 findings:
|
|
111
|
+
|
|
112
|
+
### Why Short Queries Have Low Scores
|
|
113
|
+
|
|
114
|
+
1. **Vector Space Properties**: A single word like "failure" produces an embedding that represents the general concept. Content sections contain many concepts, making the cosine similarity lower.
|
|
115
|
+
|
|
116
|
+
2. **Context Asymmetry**: A query "failure" is matched against embeddings like:
|
|
117
|
+
```
|
|
118
|
+
# Failure Isolation
|
|
119
|
+
Parent section: Core Concepts
|
|
120
|
+
Document: Failure Automation
|
|
121
|
+
|
|
122
|
+
Automated systems can isolate failures to prevent cascading effects...
|
|
123
|
+
```
|
|
124
|
+
The query is a subset of the embedded content, not a full match.
|
|
125
|
+
|
|
126
|
+
3. **Embedding Model Behavior**: Text-embedding-3-small produces normalized vectors. Short inputs produce vectors that are less distinctive because they have less semantic "mass".
|
|
127
|
+
|
|
128
|
+
### Why Multi-Word Domain Queries Work Better
|
|
129
|
+
|
|
130
|
+
Queries like "failure automation" provide:
|
|
131
|
+
- Multiple semantic signals
|
|
132
|
+
- Domain-specific terminology
|
|
133
|
+
- Closer match to document/heading names
|
|
134
|
+
- More distinctive embedding vectors
|
|
135
|
+
|
|
136
|
+
## Recommendations
|
|
137
|
+
|
|
138
|
+
### No Changes Needed to Embedding Text Format
|
|
139
|
+
|
|
140
|
+
The current format is sound. The issues are in threshold calibration and query handling.
|
|
141
|
+
|
|
142
|
+
### Potential Improvements (for ALP-207)
|
|
143
|
+
|
|
144
|
+
1. **Query Enhancement**: Consider expanding short queries with context before embedding
|
|
145
|
+
2. **Threshold Tuning**: Use adaptive thresholds based on query length
|
|
146
|
+
3. **Hybrid Search Default**: Leverage keyword search to boost short query results
|
|
147
|
+
|
|
148
|
+
## Related Research
|
|
149
|
+
|
|
150
|
+
- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
|
|
151
|
+
- [Text-embedding-3-small Model](https://platform.openai.com/docs/models/text-embedding-3-small)
|
|
152
|
+
- [Zilliz: Guide to text-embedding-3-small](https://zilliz.com/ai-models/text-embedding-3-small)
|
|
153
|
+
|
|
154
|
+
## Conclusion
|
|
155
|
+
|
|
156
|
+
The embedding text generation implementation is correct and follows best practices. The similarity score issues identified in ALP-203 should be addressed through threshold calibration and query processing improvements, not by modifying how content is embedded.
|