@ngao/search 0.1.0 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/settings.local.json +10 -0
- package/.env.example +7 -0
- package/.eslintrc.json +20 -0
- package/.github/workflows/build.yml +39 -0
- package/.github/workflows/release.yml +34 -0
- package/.github/workflows/test.yml +35 -0
- package/.mcp-config.json +14 -0
- package/.prettierrc.json +10 -0
- package/LICENSE +17 -0
- package/Makefile +26 -0
- package/README.md +57 -172
- package/config.example.json +8 -0
- package/dist/backend/api/search-engine.d.ts +40 -0
- package/dist/backend/api/search-engine.d.ts.map +1 -0
- package/dist/backend/api/search-engine.js +227 -0
- package/dist/backend/api/search-engine.js.map +1 -0
- package/dist/backend/core/block-impl.d.ts +32 -0
- package/dist/backend/core/block-impl.d.ts.map +1 -0
- package/dist/backend/core/block-impl.js +33 -0
- package/dist/backend/core/block-impl.js.map +1 -0
- package/dist/backend/core/config-loader.d.ts +68 -0
- package/dist/backend/core/config-loader.d.ts.map +1 -0
- package/dist/backend/core/config-loader.js +234 -0
- package/dist/backend/core/config-loader.js.map +1 -0
- package/dist/backend/core/constants.d.ts +39 -0
- package/dist/backend/core/constants.d.ts.map +1 -0
- package/dist/backend/core/constants.js +57 -0
- package/dist/backend/core/constants.js.map +1 -0
- package/dist/backend/core/enums.d.ts +54 -0
- package/dist/backend/core/enums.d.ts.map +1 -0
- package/dist/backend/core/enums.js +61 -0
- package/dist/backend/core/enums.js.map +1 -0
- package/dist/backend/core/errors.d.ts +83 -0
- package/dist/backend/core/errors.d.ts.map +1 -0
- package/dist/backend/core/errors.js +151 -0
- package/dist/backend/core/errors.js.map +1 -0
- package/dist/backend/core/logger.d.ts +68 -0
- package/dist/backend/core/logger.d.ts.map +1 -0
- package/dist/backend/core/logger.js +151 -0
- package/dist/backend/core/logger.js.map +1 -0
- package/dist/backend/core/models.d.ts +332 -0
- package/dist/backend/core/models.d.ts.map +1 -0
- package/dist/backend/core/models.js +6 -0
- package/dist/backend/core/models.js.map +1 -0
- package/dist/backend/core/service-types.d.ts +184 -0
- package/dist/backend/core/service-types.d.ts.map +1 -0
- package/dist/backend/core/service-types.js +7 -0
- package/dist/backend/core/service-types.js.map +1 -0
- package/dist/backend/core/types.d.ts +219 -0
- package/dist/backend/core/types.d.ts.map +1 -0
- package/dist/backend/core/types.js +109 -0
- package/dist/backend/core/types.js.map +1 -0
- package/dist/backend/index.d.ts +5 -0
- package/dist/backend/index.d.ts.map +1 -0
- package/dist/backend/index.js +13 -0
- package/dist/backend/index.js.map +1 -0
- package/dist/backend/indexing/block-extractor.d.ts +22 -0
- package/dist/backend/indexing/block-extractor.d.ts.map +1 -0
- package/dist/backend/indexing/block-extractor.js +52 -0
- package/dist/backend/indexing/block-extractor.js.map +1 -0
- package/dist/backend/indexing/index-builder.d.ts +26 -0
- package/dist/backend/indexing/index-builder.d.ts.map +1 -0
- package/dist/backend/indexing/index-builder.js +71 -0
- package/dist/backend/indexing/index-builder.js.map +1 -0
- package/dist/backend/parsers/base-file-parser.d.ts +134 -0
- package/dist/backend/parsers/base-file-parser.d.ts.map +1 -0
- package/dist/backend/parsers/base-file-parser.js +149 -0
- package/dist/backend/parsers/base-file-parser.js.map +1 -0
- package/dist/backend/parsers/javascript-parser.d.ts +36 -0
- package/dist/backend/parsers/javascript-parser.d.ts.map +1 -0
- package/dist/backend/parsers/javascript-parser.js +194 -0
- package/dist/backend/parsers/javascript-parser.js.map +1 -0
- package/dist/backend/parsers/json-parser.d.ts +15 -0
- package/dist/backend/parsers/json-parser.d.ts.map +1 -0
- package/dist/backend/parsers/json-parser.js +75 -0
- package/dist/backend/parsers/json-parser.js.map +1 -0
- package/dist/backend/parsers/markdown-parser.d.ts +17 -0
- package/dist/backend/parsers/markdown-parser.d.ts.map +1 -0
- package/dist/backend/parsers/markdown-parser.js +94 -0
- package/dist/backend/parsers/markdown-parser.js.map +1 -0
- package/dist/backend/parsers/parser-factory.d.ts +43 -0
- package/dist/backend/parsers/parser-factory.d.ts.map +1 -0
- package/dist/backend/parsers/parser-factory.js +149 -0
- package/dist/backend/parsers/parser-factory.js.map +1 -0
- package/dist/backend/parsers/python-parser.d.ts +21 -0
- package/dist/backend/parsers/python-parser.d.ts.map +1 -0
- package/dist/backend/parsers/python-parser.js +185 -0
- package/dist/backend/parsers/python-parser.js.map +1 -0
- package/dist/backend/parsers/yaml-parser.d.ts +16 -0
- package/dist/backend/parsers/yaml-parser.d.ts.map +1 -0
- package/dist/backend/parsers/yaml-parser.js +81 -0
- package/dist/backend/parsers/yaml-parser.js.map +1 -0
- package/dist/backend/repositories/implementations/lancedb-block-repository.d.ts +125 -0
- package/dist/backend/repositories/implementations/lancedb-block-repository.d.ts.map +1 -0
- package/dist/backend/repositories/implementations/lancedb-block-repository.js +505 -0
- package/dist/backend/repositories/implementations/lancedb-block-repository.js.map +1 -0
- package/dist/backend/repositories/implementations/lancedb-metadata-repository.d.ts +107 -0
- package/dist/backend/repositories/implementations/lancedb-metadata-repository.d.ts.map +1 -0
- package/dist/backend/repositories/implementations/lancedb-metadata-repository.js +275 -0
- package/dist/backend/repositories/implementations/lancedb-metadata-repository.js.map +1 -0
- package/dist/backend/repositories/implementations/memory-cache.d.ts +18 -0
- package/dist/backend/repositories/implementations/memory-cache.d.ts.map +1 -0
- package/dist/backend/repositories/implementations/memory-cache.js +53 -0
- package/dist/backend/repositories/implementations/memory-cache.js.map +1 -0
- package/dist/backend/repositories/repository.interface.d.ts +334 -0
- package/dist/backend/repositories/repository.interface.d.ts.map +1 -0
- package/dist/backend/repositories/repository.interface.js +7 -0
- package/dist/backend/repositories/repository.interface.js.map +1 -0
- package/dist/backend/search/context-extractor.d.ts +29 -0
- package/dist/backend/search/context-extractor.d.ts.map +1 -0
- package/dist/backend/search/context-extractor.js +106 -0
- package/dist/backend/search/context-extractor.js.map +1 -0
- package/dist/backend/search/multi-index-searcher.d.ts +28 -0
- package/dist/backend/search/multi-index-searcher.d.ts.map +1 -0
- package/dist/backend/search/multi-index-searcher.js +81 -0
- package/dist/backend/search/multi-index-searcher.js.map +1 -0
- package/dist/backend/search/query-parser.d.ts +37 -0
- package/dist/backend/search/query-parser.d.ts.map +1 -0
- package/dist/backend/search/query-parser.js +145 -0
- package/dist/backend/search/query-parser.js.map +1 -0
- package/dist/backend/search/ranking-engine.d.ts +31 -0
- package/dist/backend/search/ranking-engine.d.ts.map +1 -0
- package/dist/backend/search/ranking-engine.js +165 -0
- package/dist/backend/search/ranking-engine.js.map +1 -0
- package/dist/backend/search/result-formatter.d.ts +29 -0
- package/dist/backend/search/result-formatter.d.ts.map +1 -0
- package/dist/backend/search/result-formatter.js +70 -0
- package/dist/backend/search/result-formatter.js.map +1 -0
- package/dist/backend/service-types.d.ts +184 -0
- package/dist/backend/service-types.d.ts.map +1 -0
- package/dist/backend/service-types.js +7 -0
- package/dist/backend/service-types.js.map +1 -0
- package/dist/backend/services/embedding-service.d.ts +75 -0
- package/dist/backend/services/embedding-service.d.ts.map +1 -0
- package/dist/backend/services/embedding-service.js +298 -0
- package/dist/backend/services/embedding-service.js.map +1 -0
- package/dist/backend/services/file-watcher.d.ts +17 -0
- package/dist/backend/services/file-watcher.d.ts.map +1 -0
- package/dist/backend/services/file-watcher.js +92 -0
- package/dist/backend/services/file-watcher.js.map +1 -0
- package/dist/backend/services/index-information-service.d.ts +114 -0
- package/dist/backend/services/index-information-service.d.ts.map +1 -0
- package/dist/backend/services/index-information-service.js +104 -0
- package/dist/backend/services/index-information-service.js.map +1 -0
- package/dist/backend/services/ngao-search-service.d.ts +107 -0
- package/dist/backend/services/ngao-search-service.d.ts.map +1 -0
- package/dist/backend/services/ngao-search-service.js +384 -0
- package/dist/backend/services/ngao-search-service.js.map +1 -0
- package/dist/backend/services/quantization-service.d.ts +53 -0
- package/dist/backend/services/quantization-service.d.ts.map +1 -0
- package/dist/backend/services/quantization-service.js +84 -0
- package/dist/backend/services/quantization-service.js.map +1 -0
- package/dist/backend/services/reindex-manager.d.ts +25 -0
- package/dist/backend/services/reindex-manager.d.ts.map +1 -0
- package/dist/backend/services/reindex-manager.js +78 -0
- package/dist/backend/services/reindex-manager.js.map +1 -0
- package/dist/backend/services/session-manager.d.ts +115 -0
- package/dist/backend/services/session-manager.d.ts.map +1 -0
- package/dist/backend/services/session-manager.js +150 -0
- package/dist/backend/services/session-manager.js.map +1 -0
- package/dist/backend/services/vector-search-service.d.ts +81 -0
- package/dist/backend/services/vector-search-service.d.ts.map +1 -0
- package/dist/backend/services/vector-search-service.js +143 -0
- package/dist/backend/services/vector-search-service.js.map +1 -0
- package/dist/backend/utils/file-utils.d.ts +92 -0
- package/dist/backend/utils/file-utils.d.ts.map +1 -0
- package/dist/backend/utils/file-utils.js +247 -0
- package/dist/backend/utils/file-utils.js.map +1 -0
- package/dist/cli/setup.d.ts +4 -0
- package/dist/cli/setup.d.ts.map +1 -0
- package/dist/cli/setup.js +138 -0
- package/dist/cli/setup.js.map +1 -0
- package/dist/index.d.ts +6 -0
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +22 -0
- package/dist/index.js.map +1 -0
- package/dist/main.d.ts +14 -0
- package/dist/main.d.ts.map +1 -0
- package/dist/main.js +7 -67075
- package/dist/main.js.map +1 -0
- package/dist/mcp/tool-schemas.d.ts +205 -0
- package/dist/mcp/tool-schemas.d.ts.map +1 -0
- package/dist/mcp/tool-schemas.js +391 -0
- package/dist/mcp/tool-schemas.js.map +1 -0
- package/dist/server/logger.d.ts +50 -0
- package/dist/server/logger.d.ts.map +1 -0
- package/dist/server/logger.js +77 -0
- package/dist/server/logger.js.map +1 -0
- package/dist/server/tool-registry.d.ts +64 -0
- package/dist/server/tool-registry.d.ts.map +1 -0
- package/dist/server/tool-registry.js +93 -0
- package/dist/server/tool-registry.js.map +1 -0
- package/dist/server/transports/mcp-transport.d.ts +31 -0
- package/dist/server/transports/mcp-transport.d.ts.map +1 -0
- package/dist/server/transports/mcp-transport.js +331 -0
- package/dist/server/transports/mcp-transport.js.map +1 -0
- package/dist/server/transports/rest-transport.d.ts +36 -0
- package/dist/server/transports/rest-transport.d.ts.map +1 -0
- package/dist/server/transports/rest-transport.js +250 -0
- package/dist/server/transports/rest-transport.js.map +1 -0
- package/docs/API.md +116 -0
- package/docs/ARCHITECTURE.md +101 -0
- package/docs/FILE_WATCHING.md +120 -0
- package/docs/INSTALLATION.md +87 -0
- package/docs/MCP_INTEGRATION.md +108 -0
- package/docs/README.md +288 -0
- package/docs/USAGE.md +123 -0
- package/docs/architecture-design-standards/01_ARCHITECTURE.md +863 -0
- package/docs/architecture-design-standards/02_SEARCH_ENGINE_DESIGN.md +958 -0
- package/docs/architecture-design-standards/03_DATAFLOW.md +1000 -0
- package/docs/architecture-design-standards/04_VISUAL_GUIDE.md +922 -0
- package/docs/architecture-design-standards/05_REPOSITORY_PATTERN_GUIDE.md +503 -0
- package/docs/architecture-design-standards/06_IMPLEMENTATION_PATTERNS.md +1026 -0
- package/docs/architecture-design-standards/07_TYPESCRIPT_GUIDE.md +1027 -0
- package/docs/architecture-design-standards/08_CODING_STANDARDS.md +1274 -0
- package/docs/reference/01_START_HERE.md +108 -0
- package/docs/reference/02_QUICK_REFERENCE.md +363 -0
- package/docs/reference/03_DOCUMENTATION_INDEX.md +293 -0
- package/docs/reference/04_DELIVERY_SUMMARY.md +463 -0
- package/docs/reference/05_IMPLEMENTATION_OVERVIEW.md +319 -0
- package/docs/reference/06_RESEARCH_SUMMARY.md +519 -0
- package/docs/tracking/03_IMPLEMENTATION_ROADMAP.md +788 -0
- package/jest.config.json +12 -0
- package/package.json +46 -53
- package/prepend-shebang.js +18 -0
- package/scripts/setup-mcp.sh +66 -0
- package/src/backend/index.ts +5 -0
- package/src/backend/service-types.ts +219 -0
- package/src/backend/services/file-watcher.ts +79 -0
- package/src/backend/services/ngao-search-service.ts +430 -0
- package/src/backend/services/reindex-manager.ts +90 -0
- package/src/backend/services/session-manager.ts +214 -0
- package/src/cli/setup.ts +122 -0
- package/src/index.ts +6 -0
- package/src/main.ts +225 -0
- package/src/mcp/tool-schemas.ts +439 -0
- package/src/server/logger.ts +88 -0
- package/src/server/tool-registry.ts +117 -0
- package/src/server/transports/mcp-transport.ts +374 -0
- package/src/server/transports/rest-transport.ts +258 -0
- package/tests/unit/agent-tools.test.ts +454 -0
- package/tests/unit/file-watcher.test.d.ts +2 -0
- package/tests/unit/file-watcher.test.d.ts.map +1 -0
- package/tests/unit/file-watcher.test.js +9 -0
- package/tests/unit/file-watcher.test.js.map +1 -0
- package/tests/unit/file-watcher.test.ts +7 -0
- package/tests/unit/search-integration.test.ts +256 -0
- package/tests/unit/services.test.d.ts +2 -0
- package/tests/unit/services.test.d.ts.map +1 -0
- package/tests/unit/services.test.js +9 -0
- package/tests/unit/services.test.js.map +1 -0
- package/tests/unit/services.test.ts +7 -0
- package/tsconfig.json +23 -0
- package/webpack.backend.config.js +60 -0
- package/webpack.config.js +34 -0
- package/models/Xenova/all-MiniLM-L6-v2/config.json +0 -25
- package/models/Xenova/all-MiniLM-L6-v2/onnx/model_quantized.onnx +0 -0
- package/models/Xenova/all-MiniLM-L6-v2/tokenizer.json +0 -30686
- package/models/Xenova/all-MiniLM-L6-v2/tokenizer_config.json +0 -15
|
@@ -0,0 +1,519 @@
|
|
|
1
|
+
# NGAO Search: Research Summary & Key Insights
|
|
2
|
+
|
|
3
|
+
## Executive Summary
|
|
4
|
+
|
|
5
|
+
NGAO Search is a **local document/code search system** designed to help developers find, understand, and analyze code and documentation efficiently. The system is optimized for **LLM consumption**, meaning results are structured and formatted specifically for AI models to process, enabling advanced use cases like automated documentation generation, code analysis, and onboarding assistance.
|
|
6
|
+
|
|
7
|
+
### Key Differentiators
|
|
8
|
+
|
|
9
|
+
1. **Format-Aware Indexing**: Different file types (Python, Markdown, JS) return different result structures matching their semantics
|
|
10
|
+
2. **LLM-Optimized Output**: All results follow a structured JSON schema designed for AI consumption
|
|
11
|
+
3. **Multi-Level Ranking**: Relevance scoring considers keyword matching, position, scope specificity, recency, and frequency
|
|
12
|
+
4. **Context-Aware Results**: Returns snippets with surrounding context and scope hierarchy
|
|
13
|
+
5. **Semantic Structure**: Extracts and indexes meaningful blocks (functions, classes, sections) with scope hierarchy
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## Core Architecture
|
|
18
|
+
|
|
19
|
+
### Three Main Pipelines
|
|
20
|
+
|
|
21
|
+
```
|
|
22
|
+
INDEXING PIPELINE:
|
|
23
|
+
Files → Parse → Extract Blocks → Multi-Index Ingestion → Persistent Storage
|
|
24
|
+
|
|
25
|
+
QUERY PIPELINE:
|
|
26
|
+
Query → Parse → Multi-Index Search → Aggregate → Extract Context → Rank → Format
|
|
27
|
+
|
|
28
|
+
RESULT PIPELINE:
|
|
29
|
+
Ranking → LLM-Optimized Formatting → Output (JSON, CLI, Web API)
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
### Key Components
|
|
33
|
+
|
|
34
|
+
| Component | Purpose | Input | Output |
|
|
35
|
+
|-----------|---------|-------|--------|
|
|
36
|
+
| **FileRouter** | Route files to appropriate parser | File path | Parser instance |
|
|
37
|
+
| **Parsers** | Extract structured blocks | Raw file content | List of Block objects |
|
|
38
|
+
| **BlockExtractor** | Normalize blocks and compute metadata | Parsed AST | Normalized Block objects |
|
|
39
|
+
| **IndexBuilder** | Build multiple specialized indexes | Normalized blocks | Inverted, scope, registry indexes |
|
|
40
|
+
| **MultiIndexSearcher** | Search across all indexes | Query terms + filters | Matching blocks with scores |
|
|
41
|
+
| **RankingEngine** | Compute relevance scores | Blocks + query metadata | Ranked results |
|
|
42
|
+
| **ContextExtractor** | Extract surrounding context | Block + source file | Enriched snippet |
|
|
43
|
+
| **ResultFormatter** | Format results for output | Enriched results | LLM-friendly JSON |
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## Data Flow Step-by-Step
|
|
48
|
+
|
|
49
|
+
### Indexing Dataflow
|
|
50
|
+
|
|
51
|
+
```
|
|
52
|
+
1. DISCOVERY: File watcher detects new/modified file
|
|
53
|
+
2. CHANGE DETECTION: Compute hash, compare with stored hash
|
|
54
|
+
3. PARSING: Route to format-specific parser
|
|
55
|
+
4. BLOCK EXTRACTION: Extract functions/classes/sections with scope
|
|
56
|
+
5. INDEXING: Add to inverted, scope, and block registry indexes
|
|
57
|
+
6. PERSISTENCE: Save indexes to SQLite/JSON storage
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
### Query Dataflow
|
|
61
|
+
|
|
62
|
+
```
|
|
63
|
+
1. INPUT: User enters "authentication handler"
|
|
64
|
+
2. PARSING: Tokenize → ["auth", "handler"], extract filters
|
|
65
|
+
3. SEARCH: Query inverted index for each term
|
|
66
|
+
4. AGGREGATION: Combine results, deduplicate
|
|
67
|
+
5. CONTEXT: Load source files, extract snippets ±context lines
|
|
68
|
+
6. RANKING: Compute multi-factor relevance score
|
|
69
|
+
7. FORMATTING: Transform to LLM-friendly JSON schema
|
|
70
|
+
8. OUTPUT: Return ranked, formatted results
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
## Index Structure
|
|
76
|
+
|
|
77
|
+
The system maintains **4 specialized indexes**:
|
|
78
|
+
|
|
79
|
+
### 1. Inverted Index (70% of storage)
|
|
80
|
+
Maps keywords to source locations:
|
|
81
|
+
```json
|
|
82
|
+
{
|
|
83
|
+
"authentication": {
|
|
84
|
+
"files": ["src/auth/handler.py", "docs/auth.md"],
|
|
85
|
+
"positions": [
|
|
86
|
+
{
|
|
87
|
+
"file": "src/auth/handler.py",
|
|
88
|
+
"block_id": "func_42",
|
|
89
|
+
"occurrences": [{"line": 45}, {"line": 50}]
|
|
90
|
+
}
|
|
91
|
+
]
|
|
92
|
+
}
|
|
93
|
+
}
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
### 2. Scope Hierarchy Index (20% of storage)
|
|
97
|
+
Maps file structure for quick navigation:
|
|
98
|
+
```json
|
|
99
|
+
{
|
|
100
|
+
"src/auth/handler.py": {
|
|
101
|
+
"module": { "children": ["class:AuthHandler"] },
|
|
102
|
+
"class:AuthHandler": { "children": ["method:handle", "method:validate"] }
|
|
103
|
+
}
|
|
104
|
+
}
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
### 3. Block Registry (10% of storage)
|
|
108
|
+
Quick lookup for block metadata:
|
|
109
|
+
```json
|
|
110
|
+
{
|
|
111
|
+
"func_42": {
|
|
112
|
+
"file": "src/auth/handler.py",
|
|
113
|
+
"type": "method",
|
|
114
|
+
"location": {"start": 45, "end": 78},
|
|
115
|
+
"scope": ["module", "class:AuthHandler"]
|
|
116
|
+
}
|
|
117
|
+
}
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
### 4. Semantic Index (Optional, 10-20% additional)
|
|
121
|
+
Dense vector embeddings for similarity search.
|
|
122
|
+
|
|
123
|
+
---
|
|
124
|
+
|
|
125
|
+
## Format-Specific Parsing
|
|
126
|
+
|
|
127
|
+
### Python Files
|
|
128
|
+
Extracts via AST traversal:
|
|
129
|
+
- **Blocks**: Functions, classes, methods, variables
|
|
130
|
+
- **Metadata**: Decorators, type hints, docstrings, signatures
|
|
131
|
+
- **Scope**: Module → Class → Method hierarchy
|
|
132
|
+
|
|
133
|
+
### Markdown Files
|
|
134
|
+
Extracts via heading hierarchy:
|
|
135
|
+
- **Blocks**: Heading levels (H1-H6), sections, paragraphs, code blocks
|
|
136
|
+
- **Metadata**: Heading hierarchy, language for code blocks
|
|
137
|
+
- **Scope**: Parent heading chain
|
|
138
|
+
|
|
139
|
+
### JavaScript/TypeScript
|
|
140
|
+
Extracts via Babel/tree-sitter parser:
|
|
141
|
+
- **Blocks**: Functions, classes, components, hooks, exports
|
|
142
|
+
- **Metadata**: JSDoc, type definitions, export types
|
|
143
|
+
- **Scope**: Module → Class → Method hierarchy
|
|
144
|
+
|
|
145
|
+
### JSON/YAML
|
|
146
|
+
Extracts via flattened key paths:
|
|
147
|
+
- **Blocks**: Key-value pairs, objects, arrays
|
|
148
|
+
- **Metadata**: Nesting level, value types
|
|
149
|
+
- **Scope**: Key path hierarchy (e.g., database.connection.pool_size)
|
|
150
|
+
|
|
151
|
+
---
|
|
152
|
+
|
|
153
|
+
## Ranking Algorithm
|
|
154
|
+
|
|
155
|
+
Results ranked by weighted combination of 5 factors:
|
|
156
|
+
|
|
157
|
+
```
|
|
158
|
+
Final Score = (
|
|
159
|
+
0.35 × Keyword Match Score + # How well query terms match
|
|
160
|
+
0.25 × Position Score + # Where in block was found (name vs body)
|
|
161
|
+
0.15 × Scope Specificity + # Nested scopes score higher
|
|
162
|
+
0.15 × Recency Score + # Recently modified files score higher
|
|
163
|
+
0.10 × Frequency Score # How many times keywords appear
|
|
164
|
+
)
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
### Scoring Details
|
|
168
|
+
|
|
169
|
+
**Keyword Match (0.35)**
|
|
170
|
+
- Exact phrase: 1.0
|
|
171
|
+
- All terms match: 0.8
|
|
172
|
+
- Partial (X/Y terms): 0.6 × (matched/total)
|
|
173
|
+
- Fuzzy match: 0.2
|
|
174
|
+
|
|
175
|
+
**Position (0.25)**
|
|
176
|
+
- In identifier/name: 1.0
|
|
177
|
+
- In docstring/comment: 0.9
|
|
178
|
+
- In code body: 0.5
|
|
179
|
+
- In nested scope: 0.3
|
|
180
|
+
|
|
181
|
+
**Scope Specificity (0.15)**
|
|
182
|
+
- Method/deep nested: 0.8
|
|
183
|
+
- Class-level: 0.7
|
|
184
|
+
- Module-level: 0.5
|
|
185
|
+
|
|
186
|
+
**Recency (0.15)**
|
|
187
|
+
- Last 24 hours: 1.0
|
|
188
|
+
- Last week: 0.8
|
|
189
|
+
- Last month: 0.6
|
|
190
|
+
- Older: 0.4
|
|
191
|
+
|
|
192
|
+
**Frequency (0.10)**
|
|
193
|
+
- First occurrence: 1.0
|
|
194
|
+
- Multiple mentions: +0.1 per mention (capped at 0.3)
|
|
195
|
+
|
|
196
|
+
---
|
|
197
|
+
|
|
198
|
+
## LLM-Optimized Output Format
|
|
199
|
+
|
|
200
|
+
Every result follows this universal schema:
|
|
201
|
+
|
|
202
|
+
```json
|
|
203
|
+
{
|
|
204
|
+
"rank": 1,
|
|
205
|
+
"relevance_score": 0.92,
|
|
206
|
+
|
|
207
|
+
"file": {
|
|
208
|
+
"path": "src/auth/handler.py",
|
|
209
|
+
"type": "python"
|
|
210
|
+
},
|
|
211
|
+
|
|
212
|
+
"location": {
|
|
213
|
+
"start_line": 45,
|
|
214
|
+
"end_line": 78,
|
|
215
|
+
"file_tree": ["src/", "auth/", "handler.py"],
|
|
216
|
+
"scope_hierarchy": ["module", "class:AuthHandler", "method:handle"]
|
|
217
|
+
},
|
|
218
|
+
|
|
219
|
+
"match": {
|
|
220
|
+
"type": "method",
|
|
221
|
+
"name": "handle",
|
|
222
|
+
"signature": "def handle(self, request: Request) -> Response:",
|
|
223
|
+
"matched_terms": ["handle", "authentication"],
|
|
224
|
+
"match_positions": [
|
|
225
|
+
{"line": 45, "column": 8, "text": "def handle"},
|
|
226
|
+
{"line": 50, "column": 20, "text": "# Handle authentication"}
|
|
227
|
+
]
|
|
228
|
+
},
|
|
229
|
+
|
|
230
|
+
"content": {
|
|
231
|
+
"snippet": "[code with context and line numbers]",
|
|
232
|
+
"context_lines": 15
|
|
233
|
+
},
|
|
234
|
+
|
|
235
|
+
"metadata": {
|
|
236
|
+
"docstring": "Handle authentication for requests",
|
|
237
|
+
"is_public": true,
|
|
238
|
+
"parameters": ["self", "request"],
|
|
239
|
+
"return_type": "Response",
|
|
240
|
+
"decorators": ["@validate_request"]
|
|
241
|
+
},
|
|
242
|
+
|
|
243
|
+
"tags": ["authentication", "security", "middleware"]
|
|
244
|
+
}
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
Benefits for LLM processing:
|
|
248
|
+
- ✅ Predictable structure (easy JSON parsing)
|
|
249
|
+
- ✅ All relevant metadata included (full context)
|
|
250
|
+
- ✅ Clear scope hierarchy (understand code location)
|
|
251
|
+
- ✅ Snippet with context (understand purpose)
|
|
252
|
+
- ✅ Relevance scoring (prioritize important results)
|
|
253
|
+
- ✅ Matched terms highlighted (understand what matched)
|
|
254
|
+
|
|
255
|
+
---
|
|
256
|
+
|
|
257
|
+
## Performance Characteristics
|
|
258
|
+
|
|
259
|
+
### Indexing
|
|
260
|
+
| Scenario | Time |
|
|
261
|
+
|----------|------|
|
|
262
|
+
| Small project (100 files) | <5 sec |
|
|
263
|
+
| Medium project (500 files) | 20-30 sec |
|
|
264
|
+
| Large project (2000 files) | 2-3 min |
|
|
265
|
+
| Incremental update (1 file) | <200 ms |
|
|
266
|
+
|
|
267
|
+
### Querying
|
|
268
|
+
| Operation | Time |
|
|
269
|
+
|-----------|------|
|
|
270
|
+
| Simple keyword query | <50 ms |
|
|
271
|
+
| Multi-term query | <200 ms |
|
|
272
|
+
| With context extraction | <500 ms |
|
|
273
|
+
| Full result set (50 results) | <1 sec |
|
|
274
|
+
|
|
275
|
+
### Storage
|
|
276
|
+
- **Typical project**: 50-100 MB
|
|
277
|
+
- **With compression**: 20-40 MB
|
|
278
|
+
- **Peak memory during search**: <200 MB
|
|
279
|
+
|
|
280
|
+
---
|
|
281
|
+
|
|
282
|
+
## Implementation Phases
|
|
283
|
+
|
|
284
|
+
```
|
|
285
|
+
PHASE 1 (Weeks 1-2): Foundation
|
|
286
|
+
├─ File discovery, type detection, routing
|
|
287
|
+
├─ Basic text indexing, simple JSON storage
|
|
288
|
+
├─ Python AST parser
|
|
289
|
+
└─ Outcome: Searchable Python codebase
|
|
290
|
+
|
|
291
|
+
PHASE 2 (Weeks 3-4): Markdown Support
|
|
292
|
+
├─ Markdown parser (heading hierarchy)
|
|
293
|
+
├─ Context extraction for markdown
|
|
294
|
+
└─ Outcome: Both Python and Markdown searchable
|
|
295
|
+
|
|
296
|
+
PHASE 3 (Weeks 5-6): Advanced Ranking
|
|
297
|
+
├─ Multi-factor relevance scoring
|
|
298
|
+
├─ Configuration and tuning
|
|
299
|
+
└─ Outcome: Smart result ordering
|
|
300
|
+
|
|
301
|
+
PHASE 4 (Weeks 7-8): Additional Formats
|
|
302
|
+
├─ JavaScript/TypeScript support
|
|
303
|
+
├─ JSON/YAML configuration files
|
|
304
|
+
└─ Outcome: Multi-language support
|
|
305
|
+
|
|
306
|
+
PHASE 5 (Weeks 9-10): Optimization
|
|
307
|
+
├─ Incremental indexing (10x speedup)
|
|
308
|
+
├─ Query caching
|
|
309
|
+
├─ Parallel processing
|
|
310
|
+
└─ Outcome: Production-ready performance
|
|
311
|
+
|
|
312
|
+
PHASE 6+ (Optional): Advanced Features
|
|
313
|
+
├─ Semantic search with embeddings
|
|
314
|
+
├─ IDE integrations
|
|
315
|
+
├─ Distributed indexing
|
|
316
|
+
└─ Outcome: Enterprise-grade system
|
|
317
|
+
```
|
|
318
|
+
|
|
319
|
+
---
|
|
320
|
+
|
|
321
|
+
## Use Case Examples
|
|
322
|
+
|
|
323
|
+
### 1. Code Review Context
|
|
324
|
+
**Query**: "authentication flow from request to token"
|
|
325
|
+
**Output**: Full execution path with code snippets, signatures, and docstrings
|
|
326
|
+
**Benefit**: 5 minutes instead of 30 min to understand code
|
|
327
|
+
|
|
328
|
+
### 2. Onboarding New Team Member
|
|
329
|
+
**Query**: "@llm get_all_public_functions with_docstrings in auth"
|
|
330
|
+
**Output**: All public auth functions formatted for LLM
|
|
331
|
+
**Benefit**: LLM generates architecture overview and examples
|
|
332
|
+
|
|
333
|
+
### 3. Bug Investigation
|
|
334
|
+
**Query**: "session_timeout references AND modifications"
|
|
335
|
+
**Output**: All uses of session_timeout with call chains
|
|
336
|
+
**Benefit**: LLM analyzes data flow to find bug root cause
|
|
337
|
+
|
|
338
|
+
### 4. Safe Refactoring
|
|
339
|
+
**Query**: "old_function_name type:reference"
|
|
340
|
+
**Output**: Function definition + all call sites with context
|
|
341
|
+
**Benefit**: Confident, complete refactoring
|
|
342
|
+
|
|
343
|
+
### 5. Documentation Generation
|
|
344
|
+
**Query**: "@doc get_all_exported_functions with_docstrings"
|
|
345
|
+
**Output**: All exported functions with documentation
|
|
346
|
+
**Benefit**: Auto-generated API docs always in sync
|
|
347
|
+
|
|
348
|
+
---
|
|
349
|
+
|
|
350
|
+
## Key Advantages
|
|
351
|
+
|
|
352
|
+
### For Developers
|
|
353
|
+
- 🔍 Find code quickly without manual searching
|
|
354
|
+
- 📚 Understand context and scope automatically
|
|
355
|
+
- 🎯 See related code patterns
|
|
356
|
+
- 📖 Generate documentation on-the-fly
|
|
357
|
+
|
|
358
|
+
### For Teams
|
|
359
|
+
- 🏃 Faster onboarding of new members
|
|
360
|
+
- 🐛 Quicker bug resolution
|
|
361
|
+
- 🔄 Consistent refactoring
|
|
362
|
+
- 📊 Automated architecture analysis
|
|
363
|
+
|
|
364
|
+
### For LLMs
|
|
365
|
+
- 📋 Structured, predictable output format
|
|
366
|
+
- 🎯 Relevant context included
|
|
367
|
+
- 🏗️ Clear code structure (scope hierarchy)
|
|
368
|
+
- 🔗 Linked references (scope + line numbers)
|
|
369
|
+
|
|
370
|
+
---
|
|
371
|
+
|
|
372
|
+
## Technical Highlights
|
|
373
|
+
|
|
374
|
+
### 1. Incremental Indexing
|
|
375
|
+
- **File hash detection**: Only reindex changed files
|
|
376
|
+
- **AST caching**: Parse once, use many times
|
|
377
|
+
- **Speed improvement**: 10x faster re-indexing
|
|
378
|
+
|
|
379
|
+
### 2. Multi-Index Strategy
|
|
380
|
+
- **Keyword-based**: Fast for simple searches
|
|
381
|
+
- **Scope-based**: Understand code structure
|
|
382
|
+
- **Semantic** (optional): Find similar patterns
|
|
383
|
+
|
|
384
|
+
### 3. Context-Aware Ranking
|
|
385
|
+
- 5-factor scoring considers keyword quality, position, scope depth, recency, and frequency
|
|
386
|
+
- Weights configurable per use case
|
|
387
|
+
- Optimized for typical developer workflows
|
|
388
|
+
|
|
389
|
+
### 4. Format Flexibility
|
|
390
|
+
- Parse 5+ file types with specialized handling
|
|
391
|
+
- Each returns appropriate metadata
|
|
392
|
+
- Unified output schema for LLM consistency
|
|
393
|
+
|
|
394
|
+
---
|
|
395
|
+
|
|
396
|
+
## Comparison with Alternatives
|
|
397
|
+
|
|
398
|
+
| Feature | NGAO | grep | IDE | Semantic Search |
|
|
399
|
+
|---------|------|------|-----|-----------------|
|
|
400
|
+
| Structured results | ✅ | ❌ | ✅ (IDE specific) | ✅ |
|
|
401
|
+
| LLM-friendly format | ✅ | ❌ | ❌ | Partial |
|
|
402
|
+
| Scope hierarchy | ✅ | ❌ | ✅ | ❌ |
|
|
403
|
+
| Multi-format support | ✅ | ✅ | Limited | Limited |
|
|
404
|
+
| Local indexing | ✅ | ✅ | ✅ | Partial |
|
|
405
|
+
| Configurable ranking | ✅ | ❌ | ❌ | Limited |
|
|
406
|
+
| CLI + API | ✅ | ✅ | ❌ | ✅ |
|
|
407
|
+
| Context extraction | ✅ | ❌ | ✅ | Limited |
|
|
408
|
+
|
|
409
|
+
---
|
|
410
|
+
|
|
411
|
+
## Recommended Implementation Technology
|
|
412
|
+
|
|
413
|
+
### For Python Implementation (Recommended for MVP)
|
|
414
|
+
**Rationale**: Faster development, rich ecosystem for code parsing
|
|
415
|
+
|
|
416
|
+
- **Core**: Python 3.10+, Poetry for dependency management
|
|
417
|
+
- **Parsing**: `ast` module (Python), `remark` (Markdown), `tree-sitter` (JS)
|
|
418
|
+
- **Indexing**: SQLite for metadata, JSON for custom indexes
|
|
419
|
+
- **Search**: Native dict-based inverted index, `rapidfuzz` for fuzzy matching
|
|
420
|
+
- **CLI**: Click or Typer for command-line interface
|
|
421
|
+
- **LLM**: Pydantic for structured output, JSON serialization
|
|
422
|
+
|
|
423
|
+
### For Node.js Implementation (Alternative)
|
|
424
|
+
- **Parsing**: `@babel/parser`, `remark`, `tree-sitter-web`
|
|
425
|
+
- **Indexing**: `better-sqlite3`, `lowdb`
|
|
426
|
+
- **CLI**: `Commander.js`, `Chalk`
|
|
427
|
+
|
|
428
|
+
---
|
|
429
|
+
|
|
430
|
+
## Success Criteria
|
|
431
|
+
|
|
432
|
+
### Phase 1 (MVP)
|
|
433
|
+
- ✅ Index Python and Markdown files
|
|
434
|
+
- ✅ Keyword search returns relevant results
|
|
435
|
+
- ✅ LLM-friendly JSON output format
|
|
436
|
+
- ✅ CLI interface working
|
|
437
|
+
- ✅ >80% test coverage
|
|
438
|
+
|
|
439
|
+
### Phase 2+ (Production)
|
|
440
|
+
- ✅ <500ms query time (any size codebase)
|
|
441
|
+
- ✅ <100 MB storage for typical project
|
|
442
|
+
- ✅ Incremental indexing working
|
|
443
|
+
- ✅ 5+ file types supported
|
|
444
|
+
- ✅ Relevance scoring tuned
|
|
445
|
+
- ✅ >85% precision in top results
|
|
446
|
+
|
|
447
|
+
---
|
|
448
|
+
|
|
449
|
+
## Quick Start (Conceptual)
|
|
450
|
+
|
|
451
|
+
```bash
|
|
452
|
+
# 1. Install
|
|
453
|
+
pip install ngao-search
|
|
454
|
+
|
|
455
|
+
# 2. Initialize index
|
|
456
|
+
ngao-search init /path/to/codebase
|
|
457
|
+
|
|
458
|
+
# 3. Search
|
|
459
|
+
ngao-search query "authentication handler" --format json
|
|
460
|
+
|
|
461
|
+
# 4. Use with LLM
|
|
462
|
+
results=$(ngao-search query "database query patterns" --format json)
|
|
463
|
+
cat $results | llm prompt "Explain these database patterns and suggest optimizations"
|
|
464
|
+
```
|
|
465
|
+
|
|
466
|
+
---
|
|
467
|
+
|
|
468
|
+
## Next Steps
|
|
469
|
+
|
|
470
|
+
1. **Validate**: Get feedback on architecture from potential users
|
|
471
|
+
2. **Prototype**: Build Phase 1 implementation (Python + Markdown + basic ranking)
|
|
472
|
+
3. **Test**: Create comprehensive test suite with sample queries
|
|
473
|
+
4. **Optimize**: Profile and optimize based on real usage patterns
|
|
474
|
+
5. **Expand**: Add additional file types based on user feedback
|
|
475
|
+
6. **Integrate**: Connect with IDE plugins and LLM workflows
|
|
476
|
+
|
|
477
|
+
---
|
|
478
|
+
|
|
479
|
+
## Documentation Files
|
|
480
|
+
|
|
481
|
+
This research is complete with:
|
|
482
|
+
|
|
483
|
+
1. **[START_HERE.md](START_HERE.md)** - Entry point and navigation
|
|
484
|
+
2. **[RESEARCH_SUMMARY.md](RESEARCH_SUMMARY.md)** - This executive overview
|
|
485
|
+
3. **[IMPLEMENTATION_OVERVIEW.md](IMPLEMENTATION_OVERVIEW.md)** - TypeScript/Node.js guide
|
|
486
|
+
4. **[QUICK_REFERENCE.md](QUICK_REFERENCE.md)** - Formulas and quick answers
|
|
487
|
+
5. **[DOCUMENTATION_INDEX.md](DOCUMENTATION_INDEX.md)** - Complete navigation
|
|
488
|
+
6. **[DELIVERY_SUMMARY.md](DELIVERY_SUMMARY.md)** - Summary of content
|
|
489
|
+
|
|
490
|
+
All documents in the reference folder are complementary and should be read together for complete understanding.
|
|
491
|
+
|
|
492
|
+
---
|
|
493
|
+
|
|
494
|
+
## Questions Answered
|
|
495
|
+
|
|
496
|
+
✅ **How to index different file types?** → Format-specific parsers with AST analysis
|
|
497
|
+
✅ **How to return LLM-friendly format?** → Universal JSON schema with all metadata
|
|
498
|
+
✅ **How to rank results relevantly?** → 5-factor scoring algorithm
|
|
499
|
+
✅ **How to provide context?** → Context extraction with configurable window sizes
|
|
500
|
+
✅ **How to scale to large codebases?** → Incremental indexing + parallel processing
|
|
501
|
+
✅ **How to handle different scopes?** → Scope hierarchy indexing
|
|
502
|
+
✅ **How to balance performance and accuracy?** → Multiple optimization layers
|
|
503
|
+
|
|
504
|
+
---
|
|
505
|
+
|
|
506
|
+
## Key Insights
|
|
507
|
+
|
|
508
|
+
1. **Multi-Index is Essential**: Keyword search alone is insufficient; scope hierarchy indexing provides structure understanding
|
|
509
|
+
2. **Format Matters**: Each file type has unique structure; generic parsing loses important information
|
|
510
|
+
3. **Context is Critical**: Search results without context are nearly useless; must include surrounding code/docs
|
|
511
|
+
4. **LLM Optimization is Novel**: Formatting results specifically for AI models enables new use cases
|
|
512
|
+
5. **Ranking is Complex**: Simple keyword relevance insufficient; multi-factor scoring needed
|
|
513
|
+
6. **Incremental Indexing is Crucial**: Full re-indexing on every change is too slow; change detection mandatory
|
|
514
|
+
7. **Caching Layers Essential**: Multiple caching levels (query, block, AST) needed for sub-second performance
|
|
515
|
+
|
|
516
|
+
---
|
|
517
|
+
|
|
518
|
+
This comprehensive research provides a solid foundation for implementing NGAO Search as a powerful, LLM-friendly code search system optimized for modern development workflows.
|
|
519
|
+
|