npm - mdcontext - Versions diffs - 0.1.0 → 0.2.0 - Mend

mdcontext 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (251) hide show

package/.changeset/config.json +9 -9
package/.claude/settings.local.json +25 -0
package/.github/workflows/claude-code-review.yml +44 -0
package/.github/workflows/claude.yml +85 -0
package/CONTRIBUTING.md +186 -0
package/NOTES/NOTES +44 -0
package/README.md +206 -3
package/biome.json +1 -1
package/dist/chunk-23UPXDNL.js +3044 -0
package/dist/chunk-2W7MO2DL.js +1366 -0
package/dist/chunk-3NUAZGMA.js +1689 -0
package/dist/chunk-7TOWB2XB.js +366 -0
package/dist/chunk-7XOTOADQ.js +3065 -0
package/dist/chunk-AH2PDM2K.js +3042 -0
package/dist/chunk-BNXWSZ63.js +3742 -0
package/dist/chunk-BTL5DJVU.js +3222 -0
package/dist/chunk-HDHYG7E4.js +104 -0
package/dist/chunk-HLR4KZBP.js +3234 -0
package/dist/chunk-IP3FRFEB.js +1045 -0
package/dist/chunk-KHU56VDO.js +3042 -0
package/dist/chunk-KRYIFLQR.js +85 -89
package/dist/chunk-LBSDNLEM.js +287 -0
package/dist/chunk-MNTQ7HCP.js +2643 -0
package/dist/chunk-MUJELQQ6.js +1387 -0
package/dist/chunk-MXJGMSLV.js +2199 -0
package/dist/chunk-N6QJGC3Z.js +2636 -0
package/dist/chunk-OBELGBPM.js +1713 -0
package/dist/chunk-OT7R5XTA.js +3192 -0
package/dist/chunk-P7X4RA2T.js +106 -0
package/dist/chunk-PIDUQNC2.js +3185 -0
package/dist/chunk-POGCDIH4.js +3187 -0
package/dist/chunk-PSIEOQGZ.js +3043 -0
package/dist/chunk-PVRT3IHA.js +3238 -0
package/dist/chunk-QNN4TT23.js +1430 -0
package/dist/chunk-RE3R45RJ.js +3042 -0
package/dist/chunk-S7E6TFX6.js +718 -657
package/dist/chunk-SG6GLU4U.js +1378 -0
package/dist/chunk-SJCDV2ST.js +274 -0
package/dist/chunk-SYE5XLF3.js +104 -0
package/dist/chunk-T5VLYBZD.js +103 -0
package/dist/chunk-TOQB7VWU.js +3238 -0
package/dist/chunk-VFNMZ4ZQ.js +3228 -0
package/dist/chunk-VVTGZNBT.js +1533 -1423
package/dist/chunk-W7Q4RFEV.js +104 -0
package/dist/chunk-XTYYVRLO.js +3190 -0
package/dist/chunk-Y6MDYVJD.js +3063 -0
package/dist/cli/main.js +4072 -629
package/dist/index.d.ts +420 -33
package/dist/index.js +8 -15
package/dist/mcp/server.js +103 -7
package/dist/schema-BAWSG7KY.js +22 -0
package/dist/schema-E3QUPL26.js +20 -0
package/dist/schema-EHL7WUT6.js +20 -0
package/docs/019-USAGE.md +44 -5
package/docs/020-current-implementation.md +8 -8
package/docs/021-DOGFOODING-FINDINGS.md +1 -1
package/docs/CONFIG.md +1123 -0
package/docs/ERRORS.md +383 -0
package/docs/summarization.md +320 -0
package/justfile +40 -0
package/package.json +39 -33
package/research/INDEX.md +315 -0
package/research/code-review/README.md +90 -0
package/research/code-review/cli-error-handling-review.md +979 -0
package/research/code-review/code-review-validation-report.md +464 -0
package/research/code-review/main-ts-review.md +1128 -0
package/research/config-docs/SUMMARY.md +357 -0
package/research/config-docs/TEST-RESULTS.md +776 -0
package/research/config-docs/TODO.md +542 -0
package/research/config-docs/analysis.md +744 -0
package/research/config-docs/fix-validation.md +502 -0
package/research/config-docs/help-audit.md +264 -0
package/research/config-docs/help-system-analysis.md +890 -0
package/research/frontmatter/COMMENTS-ARE-SKIPPED.md +149 -0
package/research/frontmatter/LLM-CODE-NAVIGATION.md +276 -0
package/research/issue-review.md +603 -0
package/research/llm-summarization/agent-cli-tools-2026.md +1082 -0
package/research/llm-summarization/alternative-providers-2026.md +1428 -0
package/research/llm-summarization/anthropic-2026.md +367 -0
package/research/llm-summarization/claude-cli-integration.md +1706 -0
package/research/llm-summarization/cli-integration-patterns.md +3155 -0
package/research/llm-summarization/openai-2026.md +473 -0
package/research/llm-summarization/openai-compatible-providers-2026.md +1022 -0
package/research/llm-summarization/opencode-cli-integration.md +1552 -0
package/research/llm-summarization/prompt-engineering-2026.md +1426 -0
package/research/llm-summarization/prototype-results.md +56 -0
package/research/llm-summarization/provider-switching-patterns-2026.md +2153 -0
package/research/llm-summarization/typescript-llm-libraries-2026.md +2436 -0
package/research/mdcontext-pudding/00-EXECUTIVE-SUMMARY.md +282 -0
package/research/mdcontext-pudding/01-index-embed.md +956 -0
package/research/mdcontext-pudding/02-search-COMMANDS.md +142 -0
package/research/mdcontext-pudding/02-search-SUMMARY.md +146 -0
package/research/mdcontext-pudding/02-search.md +970 -0
package/research/mdcontext-pudding/03-context.md +779 -0
package/research/mdcontext-pudding/04-navigation-and-analytics.md +803 -0
package/research/mdcontext-pudding/04-tree.md +704 -0
package/research/mdcontext-pudding/05-config.md +1038 -0
package/research/mdcontext-pudding/06-links-summary.txt +87 -0
package/research/mdcontext-pudding/06-links.md +679 -0
package/research/mdcontext-pudding/07-stats.md +693 -0
package/research/mdcontext-pudding/BUG-FIX-PLAN.md +388 -0
package/research/mdcontext-pudding/P0-BUG-VALIDATION.md +167 -0
package/research/mdcontext-pudding/README.md +168 -0
package/research/mdcontext-pudding/TESTING-SUMMARY.md +128 -0
package/research/research-quality-review.md +834 -0
package/research/semantic-search/embedding-text-analysis.md +156 -0
package/research/semantic-search/multi-word-failure-reproduction.md +171 -0
package/research/semantic-search/query-processing-analysis.md +207 -0
package/research/semantic-search/root-cause-and-solution.md +114 -0
package/research/semantic-search/threshold-validation-report.md +69 -0
package/research/semantic-search/vector-search-analysis.md +63 -0
package/research/test-path-issues.md +276 -0
package/review/ALP-76/1-error-type-design.md +962 -0
package/review/ALP-76/2-error-handling-patterns.md +906 -0
package/review/ALP-76/3-error-presentation.md +624 -0
package/review/ALP-76/4-test-coverage.md +625 -0
package/review/ALP-76/5-migration-completeness.md +440 -0
package/review/ALP-76/6-effect-best-practices.md +755 -0
package/scripts/apply-branch-protection.sh +47 -0
package/scripts/branch-protection-templates.json +79 -0
package/scripts/prototype-summarization.ts +346 -0
package/scripts/rebuild-hnswlib.js +32 -37
package/scripts/setup-branch-protection.sh +64 -0
package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/active-provider.json +7 -0
package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.json +541 -0
package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.meta.json +5 -0
package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/config.json +8 -0
package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/documents.json +60 -0
package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/links.json +13 -0
package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/sections.json +1197 -0
package/src/__tests__/fixtures/semantic-search/multi-word-corpus/configuration-management.md +99 -0
package/src/__tests__/fixtures/semantic-search/multi-word-corpus/distributed-systems.md +92 -0
package/src/__tests__/fixtures/semantic-search/multi-word-corpus/error-handling.md +78 -0
package/src/__tests__/fixtures/semantic-search/multi-word-corpus/failure-automation.md +55 -0
package/src/__tests__/fixtures/semantic-search/multi-word-corpus/job-context.md +69 -0
package/src/__tests__/fixtures/semantic-search/multi-word-corpus/process-orchestration.md +99 -0
package/src/cli/argv-preprocessor.test.ts +2 -2
package/src/cli/cli.test.ts +230 -33
package/src/cli/commands/config-cmd.ts +642 -0
package/src/cli/commands/context.ts +97 -9
package/src/cli/commands/duplicates.ts +122 -0
package/src/cli/commands/embeddings.ts +529 -0
package/src/cli/commands/index-cmd.ts +210 -30
package/src/cli/commands/index.ts +3 -0
package/src/cli/commands/search.ts +894 -64
package/src/cli/commands/stats.ts +3 -0
package/src/cli/commands/tree.ts +26 -5
package/src/cli/config-layer.ts +176 -0
package/src/cli/error-handler.test.ts +235 -0
package/src/cli/error-handler.ts +655 -0
package/src/cli/flag-schemas.ts +66 -0
package/src/cli/help.ts +209 -7
package/src/cli/main.ts +348 -58
package/src/cli/options.ts +10 -0
package/src/cli/shared-error-handling.ts +199 -0
package/src/cli/utils.ts +150 -17
package/src/config/file-provider.test.ts +320 -0
package/src/config/file-provider.ts +273 -0
package/src/config/index.ts +72 -0
package/src/config/integration.test.ts +667 -0
package/src/config/precedence.test.ts +277 -0
package/src/config/precedence.ts +451 -0
package/src/config/schema.test.ts +414 -0
package/src/config/schema.ts +603 -0
package/src/config/service.test.ts +320 -0
package/src/config/service.ts +243 -0
package/src/config/testing.test.ts +264 -0
package/src/config/testing.ts +110 -0
package/src/core/types.ts +6 -33
package/src/duplicates/detector.test.ts +183 -0
package/src/duplicates/detector.ts +414 -0
package/src/duplicates/index.ts +18 -0
package/src/embeddings/embedding-namespace.test.ts +300 -0
package/src/embeddings/embedding-namespace.ts +947 -0
package/src/embeddings/heading-boost.test.ts +222 -0
package/src/embeddings/hnsw-build-options.test.ts +198 -0
package/src/embeddings/hyde.test.ts +272 -0
package/src/embeddings/hyde.ts +264 -0
package/src/embeddings/index.ts +2 -0
package/src/embeddings/openai-provider.ts +332 -83
package/src/embeddings/pricing.json +22 -0
package/src/embeddings/provider-constants.ts +204 -0
package/src/embeddings/provider-errors.test.ts +967 -0
package/src/embeddings/provider-errors.ts +565 -0
package/src/embeddings/provider-factory.test.ts +240 -0
package/src/embeddings/provider-factory.ts +225 -0
package/src/embeddings/provider-integration.test.ts +788 -0
package/src/embeddings/query-preprocessing.test.ts +187 -0
package/src/embeddings/semantic-search-threshold.test.ts +508 -0
package/src/embeddings/semantic-search.ts +780 -93
package/src/embeddings/types.ts +293 -16
package/src/embeddings/vector-store.ts +486 -77
package/src/embeddings/voyage-provider.ts +313 -0
package/src/errors/errors.test.ts +845 -0
package/src/errors/index.ts +533 -0
package/src/index/ignore-patterns.test.ts +354 -0
package/src/index/ignore-patterns.ts +305 -0
package/src/index/indexer.ts +286 -48
package/src/index/storage.ts +94 -30
package/src/index/types.ts +40 -2
package/src/index/watcher.ts +67 -9
package/src/index.ts +22 -0
package/src/integration/search-keyword.test.ts +678 -0
package/src/mcp/server.ts +135 -6
package/src/parser/parser.ts +18 -19
package/src/parser/section-filter.test.ts +277 -0
package/src/parser/section-filter.ts +125 -3
package/src/search/__tests__/hybrid-search.test.ts +650 -0
package/src/search/bm25-store.ts +366 -0
package/src/search/cross-encoder.test.ts +253 -0
package/src/search/cross-encoder.ts +406 -0
package/src/search/fuzzy-search.test.ts +419 -0
package/src/search/fuzzy-search.ts +273 -0
package/src/search/hybrid-search.ts +448 -0
package/src/search/path-matcher.test.ts +276 -0
package/src/search/path-matcher.ts +33 -0
package/src/search/searcher.test.ts +99 -1
package/src/search/searcher.ts +189 -67
package/src/search/wink-bm25.d.ts +30 -0
package/src/summarization/cli-providers/claude.ts +202 -0
package/src/summarization/cli-providers/detection.test.ts +273 -0
package/src/summarization/cli-providers/detection.ts +118 -0
package/src/summarization/cli-providers/index.ts +8 -0
package/src/summarization/cost.test.ts +139 -0
package/src/summarization/cost.ts +102 -0
package/src/summarization/error-handler.test.ts +127 -0
package/src/summarization/error-handler.ts +111 -0
package/src/summarization/index.ts +102 -0
package/src/summarization/pipeline.test.ts +498 -0
package/src/summarization/pipeline.ts +231 -0
package/src/summarization/prompts.test.ts +269 -0
package/src/summarization/prompts.ts +133 -0
package/src/summarization/provider-factory.test.ts +396 -0
package/src/summarization/provider-factory.ts +178 -0
package/src/summarization/types.ts +184 -0
package/src/summarize/summarizer.ts +104 -35
package/src/types/huggingface-transformers.d.ts +66 -0
package/tests/fixtures/cli/.mdcontext/active-provider.json +7 -0
package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
package/tests/fixtures/cli/.mdcontext/indexes/documents.json +4 -4
package/tests/fixtures/cli/.mdcontext/indexes/sections.json +14 -0
package/tests/integration/embed-index.test.ts +712 -0
package/tests/integration/search-context.test.ts +469 -0
package/tests/integration/search-semantic.test.ts +522 -0
package/vitest.config.ts +1 -6
package/AGENTS.md +0 -46
package/tests/fixtures/cli/.mdcontext/vectors.bin +0 -0
package/tests/fixtures/cli/.mdcontext/vectors.meta.json +0 -1264

package/research/semantic-search/embedding-text-analysis.md ADDED Viewed

@@ -0,0 +1,156 @@
+# Embedding Text Analysis
+## Executive Summary
+The current embedding text generation format is **appropriate and follows best practices**. The format enriches content with contextual metadata (heading, parent section, document title) which helps embedding models understand the semantic context.
+The similarity score issues identified in ALP-203 are **NOT caused by the embedding text format**, but rather by:
+1. Inherent properties of embedding models with short queries
+2. The 0.5 default threshold being too high for certain query types
+## Current Implementation
+### generateEmbeddingText Function
+Location: `src/embeddings/semantic-search.ts:46-63`
+```typescript
+const generateEmbeddingText = (
+  section: SectionEntry,
+  content: string,
+  documentTitle: string,
+  parentHeading?: string | undefined,
+): string => {
+  const parts: string[] = []
+  parts.push(`# ${section.heading}`)
+  if (parentHeading) {
+    parts.push(`Parent section: ${parentHeading}`)
+  }
+  parts.push(`Document: ${documentTitle}`)
+  parts.push('')
+  parts.push(content)
+  return parts.join('\n')
+}
+```
+### Generated Text Format
+**For a top-level section (e.g., "Overview" in "Failure Automation"):**
+```
+# Overview
+Document: Failure Automation
+Failure automation is the practice of automatically detecting,
+reporting, and responding to system failures without human
+intervention. This approach is essential for maintaining high
+availability in modern distributed systems.
+```
+**For a nested section (e.g., "Automated Failure Detection"):**
+```
+# Automated Failure Detection
+Parent section: Core Concepts
+Document: Failure Automation
+Systems use health checks, heartbeats, and monitoring to detect
+when components fail. Failure detection must be fast and accurate
+to minimize downtime.
+```
+## Analysis
+### What the Current Format Does Well
+1. **Heading as Title**: Using `# {heading}` format is standard markdown that embedding models are trained on. The heading provides semantic context about the section topic.
+2. **Hierarchical Context**: Including `Parent section: {parentHeading}` helps the model understand the section's place in the document structure. This is especially valuable for nested sections with generic headings like "Overview" or "Best Practices".
+3. **Document Context**: Including `Document: {documentTitle}` helps disambiguate content that might otherwise be too generic.
+4. **Content Preservation**: The full section content is included, providing rich semantic signal.
+### Potential Concerns Investigated
+| Concern | Finding |
+|---------|---------|
+| Does `# {heading}` confuse the model? | No - embedding models are trained on markdown and understand heading syntax |
+| Is metadata adding noise? | No - metadata provides helpful context, especially for short sections |
+| Is content truncated? | No - full section content is included |
+| Are important keywords lost? | No - nothing is removed from original content |
+### Comparison with Best Practices
+**OpenAI Recommendations:**
+- Text-embedding-3-small uses the same model for both queries and documents
+- No special prefixes or asymmetric handling needed
+- Cosine similarity is recommended for comparison
+- The model captures semantic meaning, not just keyword overlap
+**Industry Patterns:**
+- Many RAG systems include metadata like titles and hierarchical context
+- Including document/section titles is a common best practice
+- Enriching content with context improves retrieval quality
+## Token Count Analysis
+Sample embedded texts from test corpus:
+| Section | Content Tokens | Metadata Overhead | Total | Overhead % |
+|---------|----------------|-------------------|-------|------------|
+| Overview | ~50 | ~15 | ~65 | 23% |
+| Automated Failure Detection | ~40 | ~20 | ~60 | 33% |
+| Best Practices | ~100 | ~15 | ~115 | 13% |
+The metadata overhead is reasonable (13-33%) and provides valuable semantic context.
+## Root Cause of Similarity Score Issues
+The similarity score issues are **NOT caused by embedding text generation**. Based on ALP-203 findings:
+### Why Short Queries Have Low Scores
+1. **Vector Space Properties**: A single word like "failure" produces an embedding that represents the general concept. Content sections contain many concepts, making the cosine similarity lower.
+2. **Context Asymmetry**: A query "failure" is matched against embeddings like:
+   ```
+   # Failure Isolation
+   Parent section: Core Concepts
+   Document: Failure Automation
+   Automated systems can isolate failures to prevent cascading effects...
+   ```
+   The query is a subset of the embedded content, not a full match.
+3. **Embedding Model Behavior**: Text-embedding-3-small produces normalized vectors. Short inputs produce vectors that are less distinctive because they have less semantic "mass".
+### Why Multi-Word Domain Queries Work Better
+Queries like "failure automation" provide:
+- Multiple semantic signals
+- Domain-specific terminology
+- Closer match to document/heading names
+- More distinctive embedding vectors
+## Recommendations
+### No Changes Needed to Embedding Text Format
+The current format is sound. The issues are in threshold calibration and query handling.
+### Potential Improvements (for ALP-207)
+1. **Query Enhancement**: Consider expanding short queries with context before embedding
+2. **Threshold Tuning**: Use adaptive thresholds based on query length
+3. **Hybrid Search Default**: Leverage keyword search to boost short query results
+## Related Research
+- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
+- [Text-embedding-3-small Model](https://platform.openai.com/docs/models/text-embedding-3-small)
+- [Zilliz: Guide to text-embedding-3-small](https://zilliz.com/ai-models/text-embedding-3-small)
+## Conclusion
+The embedding text generation implementation is correct and follows best practices. The similarity score issues identified in ALP-203 should be addressed through threshold calibration and query processing improvements, not by modifying how content is embedded.

package/research/semantic-search/multi-word-failure-reproduction.md ADDED Viewed

@@ -0,0 +1,171 @@
+# Multi-Word Semantic Search Failure Reproduction
+## Executive Summary
+After systematic testing, the reported "multi-word semantic search failure" is **NOT a failure of semantic search itself**, but rather a **threshold calibration issue**. The root causes are:
+1. **Single-word queries have low similarity scores** (30-40%) while multi-word queries have higher scores (50-70%)
+2. **Default threshold of 0.5** filters out both single-word AND semantically-distant multi-word queries
+3. **Queries with abstract/non-domain-specific terms** (e.g., "gaps missing omissions", "issue challenge gap") score below threshold
+4. **Domain-specific multi-word queries work well** (e.g., "failure automation" = 61%, "process orchestration" = 68%)
+## Test Methodology
+### Test Corpus
+Created a controlled test corpus in `src/__tests__/fixtures/semantic-search/multi-word-corpus/` with 6 markdown files covering:
+- failure-automation.md - Failure detection and automated recovery
+- job-context.md - Job execution context and metadata
+- error-handling.md - Error handling patterns
+- configuration-management.md - Config management practices
+- distributed-systems.md - Distributed systems architecture
+- process-orchestration.md - Workflow orchestration patterns
+**Corpus Statistics:**
+- 6 documents
+- 67 sections
+- 52 embedded vectors
+- ~4,725 tokens
+### Index Command
+```bash
+node dist/cli/main.js index src/__tests__/fixtures/semantic-search/multi-word-corpus --embed --force
+```
+## Test Results
+### Multi-Word Domain-Specific Queries (DEFAULT THRESHOLD 0.5)
+| Query | Results | Top Match | Top Score |
+|-------|---------|-----------|-----------|
+| "failure automation" | 7 | failure-automation.md: Best Practices | 61.6% |
+| "job context" | 4 | job-context.md: What is Job Context? | 60.4% |
+| "error handling" | 7 | error-handling.md: Introduction | 63.7% |
+| "configuration management" | 8 | configuration-management.md: Overview | 69.5% |
+| "distributed systems" | 4 | distributed-systems.md: What Are... | 61.0% |
+| "process orchestration" | 8 | process-orchestration.md: Introduction | 68.0% |
+**Finding:** Multi-word queries with domain-specific terms **WORK WELL** with default threshold.
+### Single-Word Queries (DEFAULT THRESHOLD 0.5)
+| Query | Results | Notes |
+|-------|---------|-------|
+| "failure" | 0 | Below 0.5 threshold |
+| "automation" | 0 | Below 0.5 threshold |
+| "context" | 0 | Below 0.5 threshold |
+| "error" | 0 | Below 0.5 threshold |
+### Single-Word Queries (THRESHOLD 0.3)
+| Query | Results | Top Match | Top Score |
+|-------|---------|-----------|-----------|
+| "failure" | 10 | failure-automation.md: Failure Isolation | 39.1% |
+| "automation" | 10 | (similar) | ~35% |
+| "error" | 10 | error-handling.md: Programming Errors | 49.1% |
+**Finding:** Single-word queries have inherently **LOW similarity scores** (30-49%) due to:
+1. Short query embeddings lack semantic context
+2. Embedding model produces less distinctive vectors for single words
+3. Cosine similarity between short and long vectors is compressed
+### Abstract/Generic Multi-Word Queries (DEFAULT THRESHOLD 0.5)
+| Query | Results | Notes |
+|-------|---------|-------|
+| "issue challenge gap" | 0 | Abstract terms, no domain match |
+| "gaps missing omissions" | 0 | Meta-language about content, not content itself |
+### Abstract Queries (THRESHOLD 0.3)
+| Query | Results | Top Match | Top Score |
+|-------|---------|-----------|-----------|
+| "issue challenge gap" | 10 | distributed-systems.md: Consistency vs Availability | 40.8% |
+| "gaps missing omissions" | 3 | error-handling.md: Programming Errors | 35.0% |
+**Finding:** Abstract/meta-language queries score **30-40%** - below default threshold but findable with lower threshold.
+### Hybrid Search Results
+| Query | Hybrid Results | Primary Source |
+|-------|---------------|----------------|
+| "failure automation" | 7 | Semantic (RRF ~1.6) |
+| "job context" | 4 | Semantic (RRF ~1.6) |
+**Finding:** Hybrid search successfully combines semantic and keyword results, but the semantic component still uses the threshold filter.
+## Pattern Analysis
+### What Works (>50% similarity)
+- Multi-word queries with **domain-specific terms** directly present in content
+- Queries that form **coherent concepts** (e.g., "process orchestration")
+- Queries that match **document titles or major headings**
+### What Fails at Default Threshold
+- **Single words** - all score 30-49%
+- **Abstract meta-language** - "gaps", "issues", "challenges" without domain context
+- **Non-domain queries** searching indexed domain content
+- **Very short queries** (1-2 generic words)
+### Similarity Score Distribution
+```
+70%+ : Document title/heading exact concept matches
+60-70%: Multi-word domain queries matching content topics
+50-60%: Multi-word queries with partial concept overlap
+40-50%: Single words or abstract queries with some relevance
+30-40%: Tangentially related content
+<30% : Unrelated content (correctly filtered)
+```
+## Dogfooding Context
+The dogfooding agents reported semantic search as "unreliable for multi-word conceptual queries". Re-analysis shows:
+1. **No embeddings were built** during dogfooding (only keyword index existed)
+2. Semantic search was **unavailable** - falling back to keyword search
+3. Multi-word **keyword** searches like "failure automation" worked
+4. Multi-word keyword searches as **quoted phrases** returned 0 (expecting exact text)
+5. Abstract queries like "gaps missing omissions" correctly returned 0 (phrase not in content)
+The actual issue was:
+- **Semantic search unavailable** (no embeddings)
+- **Keyword phrase search** misunderstood (quoted = exact match)
+- **Abstract conceptual queries** don't match concrete content via keyword
+## Recommendations
+### For ALP-204 (Embedding Text Analysis)
+- Analyze how `generateEmbeddingText()` combines section context
+- Check if heading + parent + content provides enough semantic signal for short queries
+### For ALP-205 (Query Processing)
+- Query text is passed directly to embedding - no preprocessing
+- Consider query expansion for short queries
+### For ALP-206 (Vector Search Parameters)
+- Default threshold of 0.5 is **too high** for single-word queries
+- Consider adaptive thresholds based on query length
+- Consider returning top-K results regardless of threshold, then filtering
+### For ALP-207 (Solution Design)
+Key solutions to consider:
+1. **Adaptive threshold** - lower for short queries
+2. **Query expansion** - augment short queries with context
+3. **Better user feedback** - show "X results below threshold" message
+4. **Threshold documentation** - educate users on --threshold flag
+## Conclusion
+Multi-word semantic search **is working correctly** for domain-specific queries. The perceived "failure" is a combination of:
+1. No embeddings in dogfooding environment
+2. Threshold too high for short/abstract queries
+3. Confusion between keyword phrase search and semantic search
+4. Users expecting semantic search to understand meta-language about content
+The fix is NOT to change semantic search algorithm, but to:
+1. Calibrate default threshold appropriately
+2. Add query-length-aware threshold adjustment
+3. Improve error messages when no results found
+4. Consider hybrid search as default mode

package/research/semantic-search/query-processing-analysis.md ADDED Viewed

@@ -0,0 +1,207 @@
+# Query Processing Analysis
+## Executive Summary
+Query processing is **minimal and appropriate**. The query text is passed directly to the embedding API without modification. This is correct behavior for OpenAI's text-embedding-3-small model, which handles text normalization internally.
+The asymmetry between query format (plain text) and document format (text with metadata) does NOT cause issues - embedding models are designed for this asymmetric retrieval pattern.
+## Query Flow
+```
+User Input
+    │
+    ▼
+CLI Parser (search.ts)
+    │ query string unchanged
+    ▼
+semanticSearch(rootPath, query, options)
+    │ query string unchanged
+    ▼
+provider.embed([query])
+    │ passed directly to API
+    ▼
+OpenAI Embeddings API
+    │ returns 512-dimensional vector
+    ▼
+Vector Store search()
+    │ cosine similarity comparison
+    ▼
+Results filtered by threshold
+```
+## Code Trace
+### Entry Point: CLI
+```typescript
+// src/cli/commands/search.ts:53-55
+query: Args.text({ name: 'query' }).pipe(
+  Args.withDescription('Search query (natural language or regex pattern)'),
+),
+```
+The query enters as a raw text string, no preprocessing.
+### Search Mode Detection
+```typescript
+// src/cli/commands/search.ts:201-206
+} else if (isAdvancedQuery(query)) {
+  effectiveMode = 'keyword'
+  modeReason = 'boolean/phrase pattern detected'
+} else if (isRegexPattern(query)) {
+  effectiveMode = 'keyword'
+  modeReason = 'regex pattern detected'
+}
+```
+Queries with boolean operators (AND, OR, NOT) or quoted phrases are routed to keyword search. Plain multi-word queries go to semantic search.
+### Semantic Search Function
+```typescript
+// src/embeddings/semantic-search.ts:558-559
+// Embed the query
+const queryResult = yield* wrapEmbedding(provider.embed([query]))
+```
+**No preprocessing** - query is embedded exactly as received.
+### Embedding API Call
+```typescript
+// src/embeddings/openai-provider.ts:175-179
+const response = await this.client.embeddings.create({
+  model: this.model,
+  input: batch,  // query text passed directly
+  dimensions: 512,
+})
+```
+Query text goes directly to OpenAI API without modification.
+## Query vs Document Format Asymmetry
+### Document Embedding Format (from ALP-204)
+```
+# {heading}
+Parent section: {parentHeading}
+Document: {documentTitle}
+{content}
+```
+### Query Format
+```
+{raw query text}
+```
+### Analysis
+This asymmetry is **intentional and correct** for semantic search:
+1. **Embedding models handle asymmetry**: OpenAI's text-embedding models are trained on diverse text formats. They produce semantically meaningful vectors regardless of format.
+2. **Query expansion is not needed**: The embedding model understands "failure automation" conceptually - it doesn't need to see `# Failure Automation` format.
+3. **Document context helps disambiguation**: The heading/document metadata in indexed content helps distinguish between sections with similar content but different contexts.
+4. **Industry standard practice**: Most RAG systems use plain queries against enriched documents.
+## Query Variation Tests
+All variations produce semantically similar results:
+| Query | Top Result | Similarity |
+|-------|------------|------------|
+| "failure automation" | Best Practices | 61.6% |
+| "failure-automation" | Overview | 68.8% |
+| "Failure Automation" | Best Practices | 65.6% |
+| "automation for failures" | Overview | 70.3% |
+| "how to automate failure handling" | Best Practices | 66.4% |
+**Findings:**
+- Casing doesn't significantly affect results
+- Hyphenation produces slightly different top result
+- Word order matters but doesn't break search
+- Natural language queries work well
+## Threshold Analysis
+### Default Threshold Flow
+```
+CLI default: 0.45
+    │
+    ▼ (if CLI uses default)
+Config default: 0.5
+    │
+    ▼
+Effective threshold: 0.5
+```
+When user doesn't specify `--threshold`, the effective value is 0.5 from config.
+### Threshold Impact
+| Threshold | Single-word "failure" | Multi-word "failure automation" |
+|-----------|----------------------|--------------------------------|
+| 0.5 | 0 results | 7 results |
+| 0.3 | 10 results | 7+ results |
+| 0.1 | 10 results | 7+ results |
+The 0.5 threshold filters out low-similarity single-word matches while allowing relevant multi-word matches through.
+## Potential Query Enhancements (for ALP-207)
+While current processing is correct, potential improvements could include:
+### 1. Query Expansion for Short Queries
+```typescript
+// Hypothetical enhancement
+const enhancedQuery = query.split(' ').length <= 2
+  ? `Find content about: ${query}`
+  : query
+```
+### 2. Adaptive Threshold
+```typescript
+// Lower threshold for shorter queries
+const adaptiveThreshold = query.split(' ').length <= 1
+  ? 0.3
+  : options.threshold ?? 0.5
+```
+### 3. Hybrid by Default
+Short queries might benefit from hybrid mode being the default, leveraging both keyword and semantic signals.
+## Recommendations
+### No Changes Needed to Query Processing
+The current implementation is correct. The query flow is:
+- Clean (no unnecessary transformations)
+- Transparent (what you type is what gets embedded)
+- Flexible (users can adjust with --threshold)
+### Focus Areas for ALP-207
+1. **Threshold tuning** - Consider lowering default to 0.4 or making it adaptive
+2. **Better feedback** - Show "X results below threshold" when 0 results
+3. **Documentation** - Explain threshold behavior in help text
+4. **Hybrid default** - Consider hybrid mode as default for better coverage
+## Conclusion
+Query processing is implemented correctly. The perceived "multi-word query failures" are actually threshold calibration issues, not query processing bugs. The search correctly:
+1. Passes queries unchanged to embedding API (correct)
+2. Uses asymmetric retrieval (query vs enriched documents) (correct)
+3. Handles query variations semantically (working)
+4. Applies configurable threshold (working, but may need tuning)

package/research/semantic-search/root-cause-and-solution.md ADDED Viewed

@@ -0,0 +1,114 @@
+# Root Cause Analysis and Solution Design
+## Executive Summary
+**Root Cause**: The "multi-word semantic search failure" is a **threshold calibration issue**, not a search algorithm bug.
+**Key Findings**:
+1. Multi-word domain queries WORK correctly (60-70% similarity)
+2. Single-word queries score lower (30-40%) due to embedding model properties
+3. Default 0.5 threshold filters out short/abstract queries
+4. The dogfooding had no embeddings built - agents fell back to keyword search
+5. Embedding text format, query processing, and HNSW config are all correct
+**Solution**: Lower default threshold + improve user feedback for edge cases.
+## Synthesis of Diagnostic Findings
+### ALP-203: Reproduction Results
+| Query Type | Works at 0.5? | Score Range |
+|------------|---------------|-------------|
+| "failure automation" | YES | 54-62% |
+| "error handling" | YES | 53-64% |
+| "failure" (single) | NO | 31-39% |
+| "error" (single) | NO | 32-49% |
+| "gaps missing omissions" | NO | 30-35% |
+**Conclusion**: Multi-word domain queries work. Short/abstract queries fail threshold.
+### ALP-204: Embedding Text Analysis
+- Format is correct: `# heading\nParent: X\nDocument: Y\n\ncontent`
+- Follows industry best practices
+- No issues identified
+### ALP-205: Query Processing Analysis
+- Query passed unchanged to embedding API (correct)
+- Asymmetric retrieval (plain query vs enriched docs) is normal
+- Query variations all work correctly
+### ALP-206: Vector Search Analysis
+- HNSW parameters (M=16, efConstruction=200, efSearch=100) are optimal
+- Cosine distance correct for text embeddings
+- Threshold filtering is the only issue
+## Root Cause
+**Primary Cause**: The default similarity threshold (0.5) is too high for:
+1. Single-word queries (max ~49% similarity due to embedding model properties)
+2. Abstract/meta-language queries
+3. Non-domain-specific queries
+**NOT the cause**:
+- Embedding text format (correct)
+- Query processing (correct)
+- HNSW parameters (optimal)
+- Embedding model (working as expected)
+**Contributing Factor**: Dogfooding lacked embeddings, causing confusion about what was failing.
+## Solution Design
+### Recommended Approach: Threshold Tuning + UX Improvements
+#### 1. Lower Default Threshold to 0.35
+```typescript
+// src/config/schema.ts
+minSimilarity: Config.number('minSimilarity').pipe(Config.withDefault(0.35))
+```
+**Rationale**:
+- Captures single-word results (30-40% range)
+- Still filters irrelevant content (<30%)
+- Low risk - users can adjust with --threshold
+#### 2. Add "Below Threshold" Feedback
+When 0 results, show hint about lower-scored results:
+```
+Results: 0
+Note: 10 results found below 0.35 threshold (highest: 0.34)
+Tip: Use --threshold 0.3 to see more results
+```
+#### 3. Consider Hybrid Search as Default
+For queries without boolean operators, hybrid mode provides better coverage by combining semantic and keyword signals.
+## Implementation Plan for Phase 2
+1. **Lower default threshold** - Change config default from 0.5 to 0.35
+2. **Add below-threshold feedback** - Show hint when 0 results
+3. **Document threshold behavior** - Update README/help
+4. **Validate changes** - Re-run test corpus
+## Expected Outcomes
+| Metric | Before | After |
+|--------|--------|-------|
+| Single-word results at default | 0 | 10+ |
+| Multi-word results | 7+ | 7+ (unchanged) |
+## Conclusion
+The "multi-word semantic search failure" was misidentified. Multi-word queries work correctly. The issue is threshold calibration affecting single-word and abstract queries.
+**Recommended Solution**: Lower threshold to 0.35, add user feedback, improve documentation.
+**No algorithmic changes needed** to embedding generation, query processing, or vector search.