@ngao/search 0.1.0 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (259) hide show
  1. package/.claude/settings.local.json +10 -0
  2. package/.env.example +7 -0
  3. package/.eslintrc.json +20 -0
  4. package/.github/workflows/build.yml +39 -0
  5. package/.github/workflows/release.yml +34 -0
  6. package/.github/workflows/test.yml +35 -0
  7. package/.mcp-config.json +14 -0
  8. package/.prettierrc.json +10 -0
  9. package/LICENSE +17 -0
  10. package/Makefile +26 -0
  11. package/README.md +57 -172
  12. package/config.example.json +8 -0
  13. package/dist/backend/api/search-engine.d.ts +40 -0
  14. package/dist/backend/api/search-engine.d.ts.map +1 -0
  15. package/dist/backend/api/search-engine.js +227 -0
  16. package/dist/backend/api/search-engine.js.map +1 -0
  17. package/dist/backend/core/block-impl.d.ts +32 -0
  18. package/dist/backend/core/block-impl.d.ts.map +1 -0
  19. package/dist/backend/core/block-impl.js +33 -0
  20. package/dist/backend/core/block-impl.js.map +1 -0
  21. package/dist/backend/core/config-loader.d.ts +68 -0
  22. package/dist/backend/core/config-loader.d.ts.map +1 -0
  23. package/dist/backend/core/config-loader.js +234 -0
  24. package/dist/backend/core/config-loader.js.map +1 -0
  25. package/dist/backend/core/constants.d.ts +39 -0
  26. package/dist/backend/core/constants.d.ts.map +1 -0
  27. package/dist/backend/core/constants.js +57 -0
  28. package/dist/backend/core/constants.js.map +1 -0
  29. package/dist/backend/core/enums.d.ts +54 -0
  30. package/dist/backend/core/enums.d.ts.map +1 -0
  31. package/dist/backend/core/enums.js +61 -0
  32. package/dist/backend/core/enums.js.map +1 -0
  33. package/dist/backend/core/errors.d.ts +83 -0
  34. package/dist/backend/core/errors.d.ts.map +1 -0
  35. package/dist/backend/core/errors.js +151 -0
  36. package/dist/backend/core/errors.js.map +1 -0
  37. package/dist/backend/core/logger.d.ts +68 -0
  38. package/dist/backend/core/logger.d.ts.map +1 -0
  39. package/dist/backend/core/logger.js +151 -0
  40. package/dist/backend/core/logger.js.map +1 -0
  41. package/dist/backend/core/models.d.ts +332 -0
  42. package/dist/backend/core/models.d.ts.map +1 -0
  43. package/dist/backend/core/models.js +6 -0
  44. package/dist/backend/core/models.js.map +1 -0
  45. package/dist/backend/core/service-types.d.ts +184 -0
  46. package/dist/backend/core/service-types.d.ts.map +1 -0
  47. package/dist/backend/core/service-types.js +7 -0
  48. package/dist/backend/core/service-types.js.map +1 -0
  49. package/dist/backend/core/types.d.ts +219 -0
  50. package/dist/backend/core/types.d.ts.map +1 -0
  51. package/dist/backend/core/types.js +109 -0
  52. package/dist/backend/core/types.js.map +1 -0
  53. package/dist/backend/index.d.ts +5 -0
  54. package/dist/backend/index.d.ts.map +1 -0
  55. package/dist/backend/index.js +13 -0
  56. package/dist/backend/index.js.map +1 -0
  57. package/dist/backend/indexing/block-extractor.d.ts +22 -0
  58. package/dist/backend/indexing/block-extractor.d.ts.map +1 -0
  59. package/dist/backend/indexing/block-extractor.js +52 -0
  60. package/dist/backend/indexing/block-extractor.js.map +1 -0
  61. package/dist/backend/indexing/index-builder.d.ts +26 -0
  62. package/dist/backend/indexing/index-builder.d.ts.map +1 -0
  63. package/dist/backend/indexing/index-builder.js +71 -0
  64. package/dist/backend/indexing/index-builder.js.map +1 -0
  65. package/dist/backend/parsers/base-file-parser.d.ts +134 -0
  66. package/dist/backend/parsers/base-file-parser.d.ts.map +1 -0
  67. package/dist/backend/parsers/base-file-parser.js +149 -0
  68. package/dist/backend/parsers/base-file-parser.js.map +1 -0
  69. package/dist/backend/parsers/javascript-parser.d.ts +36 -0
  70. package/dist/backend/parsers/javascript-parser.d.ts.map +1 -0
  71. package/dist/backend/parsers/javascript-parser.js +194 -0
  72. package/dist/backend/parsers/javascript-parser.js.map +1 -0
  73. package/dist/backend/parsers/json-parser.d.ts +15 -0
  74. package/dist/backend/parsers/json-parser.d.ts.map +1 -0
  75. package/dist/backend/parsers/json-parser.js +75 -0
  76. package/dist/backend/parsers/json-parser.js.map +1 -0
  77. package/dist/backend/parsers/markdown-parser.d.ts +17 -0
  78. package/dist/backend/parsers/markdown-parser.d.ts.map +1 -0
  79. package/dist/backend/parsers/markdown-parser.js +94 -0
  80. package/dist/backend/parsers/markdown-parser.js.map +1 -0
  81. package/dist/backend/parsers/parser-factory.d.ts +43 -0
  82. package/dist/backend/parsers/parser-factory.d.ts.map +1 -0
  83. package/dist/backend/parsers/parser-factory.js +149 -0
  84. package/dist/backend/parsers/parser-factory.js.map +1 -0
  85. package/dist/backend/parsers/python-parser.d.ts +21 -0
  86. package/dist/backend/parsers/python-parser.d.ts.map +1 -0
  87. package/dist/backend/parsers/python-parser.js +185 -0
  88. package/dist/backend/parsers/python-parser.js.map +1 -0
  89. package/dist/backend/parsers/yaml-parser.d.ts +16 -0
  90. package/dist/backend/parsers/yaml-parser.d.ts.map +1 -0
  91. package/dist/backend/parsers/yaml-parser.js +81 -0
  92. package/dist/backend/parsers/yaml-parser.js.map +1 -0
  93. package/dist/backend/repositories/implementations/lancedb-block-repository.d.ts +125 -0
  94. package/dist/backend/repositories/implementations/lancedb-block-repository.d.ts.map +1 -0
  95. package/dist/backend/repositories/implementations/lancedb-block-repository.js +505 -0
  96. package/dist/backend/repositories/implementations/lancedb-block-repository.js.map +1 -0
  97. package/dist/backend/repositories/implementations/lancedb-metadata-repository.d.ts +107 -0
  98. package/dist/backend/repositories/implementations/lancedb-metadata-repository.d.ts.map +1 -0
  99. package/dist/backend/repositories/implementations/lancedb-metadata-repository.js +275 -0
  100. package/dist/backend/repositories/implementations/lancedb-metadata-repository.js.map +1 -0
  101. package/dist/backend/repositories/implementations/memory-cache.d.ts +18 -0
  102. package/dist/backend/repositories/implementations/memory-cache.d.ts.map +1 -0
  103. package/dist/backend/repositories/implementations/memory-cache.js +53 -0
  104. package/dist/backend/repositories/implementations/memory-cache.js.map +1 -0
  105. package/dist/backend/repositories/repository.interface.d.ts +334 -0
  106. package/dist/backend/repositories/repository.interface.d.ts.map +1 -0
  107. package/dist/backend/repositories/repository.interface.js +7 -0
  108. package/dist/backend/repositories/repository.interface.js.map +1 -0
  109. package/dist/backend/search/context-extractor.d.ts +29 -0
  110. package/dist/backend/search/context-extractor.d.ts.map +1 -0
  111. package/dist/backend/search/context-extractor.js +106 -0
  112. package/dist/backend/search/context-extractor.js.map +1 -0
  113. package/dist/backend/search/multi-index-searcher.d.ts +28 -0
  114. package/dist/backend/search/multi-index-searcher.d.ts.map +1 -0
  115. package/dist/backend/search/multi-index-searcher.js +81 -0
  116. package/dist/backend/search/multi-index-searcher.js.map +1 -0
  117. package/dist/backend/search/query-parser.d.ts +37 -0
  118. package/dist/backend/search/query-parser.d.ts.map +1 -0
  119. package/dist/backend/search/query-parser.js +145 -0
  120. package/dist/backend/search/query-parser.js.map +1 -0
  121. package/dist/backend/search/ranking-engine.d.ts +31 -0
  122. package/dist/backend/search/ranking-engine.d.ts.map +1 -0
  123. package/dist/backend/search/ranking-engine.js +165 -0
  124. package/dist/backend/search/ranking-engine.js.map +1 -0
  125. package/dist/backend/search/result-formatter.d.ts +29 -0
  126. package/dist/backend/search/result-formatter.d.ts.map +1 -0
  127. package/dist/backend/search/result-formatter.js +70 -0
  128. package/dist/backend/search/result-formatter.js.map +1 -0
  129. package/dist/backend/service-types.d.ts +184 -0
  130. package/dist/backend/service-types.d.ts.map +1 -0
  131. package/dist/backend/service-types.js +7 -0
  132. package/dist/backend/service-types.js.map +1 -0
  133. package/dist/backend/services/embedding-service.d.ts +75 -0
  134. package/dist/backend/services/embedding-service.d.ts.map +1 -0
  135. package/dist/backend/services/embedding-service.js +298 -0
  136. package/dist/backend/services/embedding-service.js.map +1 -0
  137. package/dist/backend/services/file-watcher.d.ts +17 -0
  138. package/dist/backend/services/file-watcher.d.ts.map +1 -0
  139. package/dist/backend/services/file-watcher.js +92 -0
  140. package/dist/backend/services/file-watcher.js.map +1 -0
  141. package/dist/backend/services/index-information-service.d.ts +114 -0
  142. package/dist/backend/services/index-information-service.d.ts.map +1 -0
  143. package/dist/backend/services/index-information-service.js +104 -0
  144. package/dist/backend/services/index-information-service.js.map +1 -0
  145. package/dist/backend/services/ngao-search-service.d.ts +107 -0
  146. package/dist/backend/services/ngao-search-service.d.ts.map +1 -0
  147. package/dist/backend/services/ngao-search-service.js +384 -0
  148. package/dist/backend/services/ngao-search-service.js.map +1 -0
  149. package/dist/backend/services/quantization-service.d.ts +53 -0
  150. package/dist/backend/services/quantization-service.d.ts.map +1 -0
  151. package/dist/backend/services/quantization-service.js +84 -0
  152. package/dist/backend/services/quantization-service.js.map +1 -0
  153. package/dist/backend/services/reindex-manager.d.ts +25 -0
  154. package/dist/backend/services/reindex-manager.d.ts.map +1 -0
  155. package/dist/backend/services/reindex-manager.js +78 -0
  156. package/dist/backend/services/reindex-manager.js.map +1 -0
  157. package/dist/backend/services/session-manager.d.ts +115 -0
  158. package/dist/backend/services/session-manager.d.ts.map +1 -0
  159. package/dist/backend/services/session-manager.js +150 -0
  160. package/dist/backend/services/session-manager.js.map +1 -0
  161. package/dist/backend/services/vector-search-service.d.ts +81 -0
  162. package/dist/backend/services/vector-search-service.d.ts.map +1 -0
  163. package/dist/backend/services/vector-search-service.js +143 -0
  164. package/dist/backend/services/vector-search-service.js.map +1 -0
  165. package/dist/backend/utils/file-utils.d.ts +92 -0
  166. package/dist/backend/utils/file-utils.d.ts.map +1 -0
  167. package/dist/backend/utils/file-utils.js +247 -0
  168. package/dist/backend/utils/file-utils.js.map +1 -0
  169. package/dist/cli/setup.d.ts +4 -0
  170. package/dist/cli/setup.d.ts.map +1 -0
  171. package/dist/cli/setup.js +138 -0
  172. package/dist/cli/setup.js.map +1 -0
  173. package/dist/index.d.ts +6 -0
  174. package/dist/index.d.ts.map +1 -0
  175. package/dist/index.js +22 -0
  176. package/dist/index.js.map +1 -0
  177. package/dist/main.d.ts +14 -0
  178. package/dist/main.d.ts.map +1 -0
  179. package/dist/main.js +7 -67075
  180. package/dist/main.js.map +1 -0
  181. package/dist/mcp/tool-schemas.d.ts +205 -0
  182. package/dist/mcp/tool-schemas.d.ts.map +1 -0
  183. package/dist/mcp/tool-schemas.js +391 -0
  184. package/dist/mcp/tool-schemas.js.map +1 -0
  185. package/dist/server/logger.d.ts +50 -0
  186. package/dist/server/logger.d.ts.map +1 -0
  187. package/dist/server/logger.js +77 -0
  188. package/dist/server/logger.js.map +1 -0
  189. package/dist/server/tool-registry.d.ts +64 -0
  190. package/dist/server/tool-registry.d.ts.map +1 -0
  191. package/dist/server/tool-registry.js +93 -0
  192. package/dist/server/tool-registry.js.map +1 -0
  193. package/dist/server/transports/mcp-transport.d.ts +31 -0
  194. package/dist/server/transports/mcp-transport.d.ts.map +1 -0
  195. package/dist/server/transports/mcp-transport.js +331 -0
  196. package/dist/server/transports/mcp-transport.js.map +1 -0
  197. package/dist/server/transports/rest-transport.d.ts +36 -0
  198. package/dist/server/transports/rest-transport.d.ts.map +1 -0
  199. package/dist/server/transports/rest-transport.js +250 -0
  200. package/dist/server/transports/rest-transport.js.map +1 -0
  201. package/docs/API.md +116 -0
  202. package/docs/ARCHITECTURE.md +101 -0
  203. package/docs/FILE_WATCHING.md +120 -0
  204. package/docs/INSTALLATION.md +87 -0
  205. package/docs/MCP_INTEGRATION.md +108 -0
  206. package/docs/README.md +288 -0
  207. package/docs/USAGE.md +123 -0
  208. package/docs/architecture-design-standards/01_ARCHITECTURE.md +863 -0
  209. package/docs/architecture-design-standards/02_SEARCH_ENGINE_DESIGN.md +958 -0
  210. package/docs/architecture-design-standards/03_DATAFLOW.md +1000 -0
  211. package/docs/architecture-design-standards/04_VISUAL_GUIDE.md +922 -0
  212. package/docs/architecture-design-standards/05_REPOSITORY_PATTERN_GUIDE.md +503 -0
  213. package/docs/architecture-design-standards/06_IMPLEMENTATION_PATTERNS.md +1026 -0
  214. package/docs/architecture-design-standards/07_TYPESCRIPT_GUIDE.md +1027 -0
  215. package/docs/architecture-design-standards/08_CODING_STANDARDS.md +1274 -0
  216. package/docs/reference/01_START_HERE.md +108 -0
  217. package/docs/reference/02_QUICK_REFERENCE.md +363 -0
  218. package/docs/reference/03_DOCUMENTATION_INDEX.md +293 -0
  219. package/docs/reference/04_DELIVERY_SUMMARY.md +463 -0
  220. package/docs/reference/05_IMPLEMENTATION_OVERVIEW.md +319 -0
  221. package/docs/reference/06_RESEARCH_SUMMARY.md +519 -0
  222. package/docs/tracking/03_IMPLEMENTATION_ROADMAP.md +788 -0
  223. package/jest.config.json +12 -0
  224. package/package.json +46 -53
  225. package/prepend-shebang.js +18 -0
  226. package/scripts/setup-mcp.sh +66 -0
  227. package/src/backend/index.ts +5 -0
  228. package/src/backend/service-types.ts +219 -0
  229. package/src/backend/services/file-watcher.ts +79 -0
  230. package/src/backend/services/ngao-search-service.ts +430 -0
  231. package/src/backend/services/reindex-manager.ts +90 -0
  232. package/src/backend/services/session-manager.ts +214 -0
  233. package/src/cli/setup.ts +122 -0
  234. package/src/index.ts +6 -0
  235. package/src/main.ts +225 -0
  236. package/src/mcp/tool-schemas.ts +439 -0
  237. package/src/server/logger.ts +88 -0
  238. package/src/server/tool-registry.ts +117 -0
  239. package/src/server/transports/mcp-transport.ts +374 -0
  240. package/src/server/transports/rest-transport.ts +258 -0
  241. package/tests/unit/agent-tools.test.ts +454 -0
  242. package/tests/unit/file-watcher.test.d.ts +2 -0
  243. package/tests/unit/file-watcher.test.d.ts.map +1 -0
  244. package/tests/unit/file-watcher.test.js +9 -0
  245. package/tests/unit/file-watcher.test.js.map +1 -0
  246. package/tests/unit/file-watcher.test.ts +7 -0
  247. package/tests/unit/search-integration.test.ts +256 -0
  248. package/tests/unit/services.test.d.ts +2 -0
  249. package/tests/unit/services.test.d.ts.map +1 -0
  250. package/tests/unit/services.test.js +9 -0
  251. package/tests/unit/services.test.js.map +1 -0
  252. package/tests/unit/services.test.ts +7 -0
  253. package/tsconfig.json +23 -0
  254. package/webpack.backend.config.js +60 -0
  255. package/webpack.config.js +34 -0
  256. package/models/Xenova/all-MiniLM-L6-v2/config.json +0 -25
  257. package/models/Xenova/all-MiniLM-L6-v2/onnx/model_quantized.onnx +0 -0
  258. package/models/Xenova/all-MiniLM-L6-v2/tokenizer.json +0 -30686
  259. package/models/Xenova/all-MiniLM-L6-v2/tokenizer_config.json +0 -15
@@ -0,0 +1,519 @@
1
+ # NGAO Search: Research Summary & Key Insights
2
+
3
+ ## Executive Summary
4
+
5
+ NGAO Search is a **local document/code search system** designed to help developers find, understand, and analyze code and documentation efficiently. The system is optimized for **LLM consumption**, meaning results are structured and formatted specifically for AI models to process, enabling advanced use cases like automated documentation generation, code analysis, and onboarding assistance.
6
+
7
+ ### Key Differentiators
8
+
9
+ 1. **Format-Aware Indexing**: Different file types (Python, Markdown, JS) return different result structures matching their semantics
10
+ 2. **LLM-Optimized Output**: All results follow a structured JSON schema designed for AI consumption
11
+ 3. **Multi-Level Ranking**: Relevance scoring considers keyword matching, position, scope specificity, recency, and frequency
12
+ 4. **Context-Aware Results**: Returns snippets with surrounding context and scope hierarchy
13
+ 5. **Semantic Structure**: Extracts and indexes meaningful blocks (functions, classes, sections) with scope hierarchy
14
+
15
+ ---
16
+
17
+ ## Core Architecture
18
+
19
+ ### Three Main Pipelines
20
+
21
+ ```
22
+ INDEXING PIPELINE:
23
+ Files → Parse → Extract Blocks → Multi-Index Ingestion → Persistent Storage
24
+
25
+ QUERY PIPELINE:
26
+ Query → Parse → Multi-Index Search → Aggregate → Extract Context → Rank → Format
27
+
28
+ RESULT PIPELINE:
29
+ Ranking → LLM-Optimized Formatting → Output (JSON, CLI, Web API)
30
+ ```
31
+
32
+ ### Key Components
33
+
34
+ | Component | Purpose | Input | Output |
35
+ |-----------|---------|-------|--------|
36
+ | **FileRouter** | Route files to appropriate parser | File path | Parser instance |
37
+ | **Parsers** | Extract structured blocks | Raw file content | List of Block objects |
38
+ | **BlockExtractor** | Normalize blocks and compute metadata | Parsed AST | Normalized Block objects |
39
+ | **IndexBuilder** | Build multiple specialized indexes | Normalized blocks | Inverted, scope, registry indexes |
40
+ | **MultiIndexSearcher** | Search across all indexes | Query terms + filters | Matching blocks with scores |
41
+ | **RankingEngine** | Compute relevance scores | Blocks + query metadata | Ranked results |
42
+ | **ContextExtractor** | Extract surrounding context | Block + source file | Enriched snippet |
43
+ | **ResultFormatter** | Format results for output | Enriched results | LLM-friendly JSON |
44
+
45
+ ---
46
+
47
+ ## Data Flow Step-by-Step
48
+
49
+ ### Indexing Dataflow
50
+
51
+ ```
52
+ 1. DISCOVERY: File watcher detects new/modified file
53
+ 2. CHANGE DETECTION: Compute hash, compare with stored hash
54
+ 3. PARSING: Route to format-specific parser
55
+ 4. BLOCK EXTRACTION: Extract functions/classes/sections with scope
56
+ 5. INDEXING: Add to inverted, scope, and block registry indexes
57
+ 6. PERSISTENCE: Save indexes to SQLite/JSON storage
58
+ ```
59
+
60
+ ### Query Dataflow
61
+
62
+ ```
63
+ 1. INPUT: User enters "authentication handler"
64
+ 2. PARSING: Tokenize → ["auth", "handler"], extract filters
65
+ 3. SEARCH: Query inverted index for each term
66
+ 4. AGGREGATION: Combine results, deduplicate
67
+ 5. CONTEXT: Load source files, extract snippets ±context lines
68
+ 6. RANKING: Compute multi-factor relevance score
69
+ 7. FORMATTING: Transform to LLM-friendly JSON schema
70
+ 8. OUTPUT: Return ranked, formatted results
71
+ ```
72
+
73
+ ---
74
+
75
+ ## Index Structure
76
+
77
+ The system maintains **4 specialized indexes**:
78
+
79
+ ### 1. Inverted Index (70% of storage)
80
+ Maps keywords to source locations:
81
+ ```json
82
+ {
83
+ "authentication": {
84
+ "files": ["src/auth/handler.py", "docs/auth.md"],
85
+ "positions": [
86
+ {
87
+ "file": "src/auth/handler.py",
88
+ "block_id": "func_42",
89
+ "occurrences": [{"line": 45}, {"line": 50}]
90
+ }
91
+ ]
92
+ }
93
+ }
94
+ ```
95
+
96
+ ### 2. Scope Hierarchy Index (20% of storage)
97
+ Maps file structure for quick navigation:
98
+ ```json
99
+ {
100
+ "src/auth/handler.py": {
101
+ "module": { "children": ["class:AuthHandler"] },
102
+ "class:AuthHandler": { "children": ["method:handle", "method:validate"] }
103
+ }
104
+ }
105
+ ```
106
+
107
+ ### 3. Block Registry (10% of storage)
108
+ Quick lookup for block metadata:
109
+ ```json
110
+ {
111
+ "func_42": {
112
+ "file": "src/auth/handler.py",
113
+ "type": "method",
114
+ "location": {"start": 45, "end": 78},
115
+ "scope": ["module", "class:AuthHandler"]
116
+ }
117
+ }
118
+ ```
119
+
120
+ ### 4. Semantic Index (Optional, 10-20% additional)
121
+ Dense vector embeddings for similarity search.
122
+
123
+ ---
124
+
125
+ ## Format-Specific Parsing
126
+
127
+ ### Python Files
128
+ Extracts via AST traversal:
129
+ - **Blocks**: Functions, classes, methods, variables
130
+ - **Metadata**: Decorators, type hints, docstrings, signatures
131
+ - **Scope**: Module → Class → Method hierarchy
132
+
133
+ ### Markdown Files
134
+ Extracts via heading hierarchy:
135
+ - **Blocks**: Heading levels (H1-H6), sections, paragraphs, code blocks
136
+ - **Metadata**: Heading hierarchy, language for code blocks
137
+ - **Scope**: Parent heading chain
138
+
139
+ ### JavaScript/TypeScript
140
+ Extracts via Babel/tree-sitter parser:
141
+ - **Blocks**: Functions, classes, components, hooks, exports
142
+ - **Metadata**: JSDoc, type definitions, export types
143
+ - **Scope**: Module → Class → Method hierarchy
144
+
145
+ ### JSON/YAML
146
+ Extracts via flattened key paths:
147
+ - **Blocks**: Key-value pairs, objects, arrays
148
+ - **Metadata**: Nesting level, value types
149
+ - **Scope**: Key path hierarchy (e.g., database.connection.pool_size)
150
+
151
+ ---
152
+
153
+ ## Ranking Algorithm
154
+
155
+ Results ranked by weighted combination of 5 factors:
156
+
157
+ ```
158
+ Final Score = (
159
+ 0.35 × Keyword Match Score + # How well query terms match
160
+ 0.25 × Position Score + # Where in block was found (name vs body)
161
+ 0.15 × Scope Specificity + # Nested scopes score higher
162
+ 0.15 × Recency Score + # Recently modified files score higher
163
+ 0.10 × Frequency Score # How many times keywords appear
164
+ )
165
+ ```
166
+
167
+ ### Scoring Details
168
+
169
+ **Keyword Match (0.35)**
170
+ - Exact phrase: 1.0
171
+ - All terms match: 0.8
172
+ - Partial (X/Y terms): 0.6 × (matched/total)
173
+ - Fuzzy match: 0.2
174
+
175
+ **Position (0.25)**
176
+ - In identifier/name: 1.0
177
+ - In docstring/comment: 0.9
178
+ - In code body: 0.5
179
+ - In nested scope: 0.3
180
+
181
+ **Scope Specificity (0.15)**
182
+ - Method/deep nested: 0.8
183
+ - Class-level: 0.7
184
+ - Module-level: 0.5
185
+
186
+ **Recency (0.15)**
187
+ - Last 24 hours: 1.0
188
+ - Last week: 0.8
189
+ - Last month: 0.6
190
+ - Older: 0.4
191
+
192
+ **Frequency (0.10)**
193
+ - First occurrence: 1.0
194
+ - Multiple mentions: +0.1 per mention (capped at 0.3)
195
+
196
+ ---
197
+
198
+ ## LLM-Optimized Output Format
199
+
200
+ Every result follows this universal schema:
201
+
202
+ ```json
203
+ {
204
+ "rank": 1,
205
+ "relevance_score": 0.92,
206
+
207
+ "file": {
208
+ "path": "src/auth/handler.py",
209
+ "type": "python"
210
+ },
211
+
212
+ "location": {
213
+ "start_line": 45,
214
+ "end_line": 78,
215
+ "file_tree": ["src/", "auth/", "handler.py"],
216
+ "scope_hierarchy": ["module", "class:AuthHandler", "method:handle"]
217
+ },
218
+
219
+ "match": {
220
+ "type": "method",
221
+ "name": "handle",
222
+ "signature": "def handle(self, request: Request) -> Response:",
223
+ "matched_terms": ["handle", "authentication"],
224
+ "match_positions": [
225
+ {"line": 45, "column": 8, "text": "def handle"},
226
+ {"line": 50, "column": 20, "text": "# Handle authentication"}
227
+ ]
228
+ },
229
+
230
+ "content": {
231
+ "snippet": "[code with context and line numbers]",
232
+ "context_lines": 15
233
+ },
234
+
235
+ "metadata": {
236
+ "docstring": "Handle authentication for requests",
237
+ "is_public": true,
238
+ "parameters": ["self", "request"],
239
+ "return_type": "Response",
240
+ "decorators": ["@validate_request"]
241
+ },
242
+
243
+ "tags": ["authentication", "security", "middleware"]
244
+ }
245
+ ```
246
+
247
+ Benefits for LLM processing:
248
+ - ✅ Predictable structure (easy JSON parsing)
249
+ - ✅ All relevant metadata included (full context)
250
+ - ✅ Clear scope hierarchy (understand code location)
251
+ - ✅ Snippet with context (understand purpose)
252
+ - ✅ Relevance scoring (prioritize important results)
253
+ - ✅ Matched terms highlighted (understand what matched)
254
+
255
+ ---
256
+
257
+ ## Performance Characteristics
258
+
259
+ ### Indexing
260
+ | Scenario | Time |
261
+ |----------|------|
262
+ | Small project (100 files) | <5 sec |
263
+ | Medium project (500 files) | 20-30 sec |
264
+ | Large project (2000 files) | 2-3 min |
265
+ | Incremental update (1 file) | <200 ms |
266
+
267
+ ### Querying
268
+ | Operation | Time |
269
+ |-----------|------|
270
+ | Simple keyword query | <50 ms |
271
+ | Multi-term query | <200 ms |
272
+ | With context extraction | <500 ms |
273
+ | Full result set (50 results) | <1 sec |
274
+
275
+ ### Storage
276
+ - **Typical project**: 50-100 MB
277
+ - **With compression**: 20-40 MB
278
+ - **Peak memory during search**: <200 MB
279
+
280
+ ---
281
+
282
+ ## Implementation Phases
283
+
284
+ ```
285
+ PHASE 1 (Weeks 1-2): Foundation
286
+ ├─ File discovery, type detection, routing
287
+ ├─ Basic text indexing, simple JSON storage
288
+ ├─ Python AST parser
289
+ └─ Outcome: Searchable Python codebase
290
+
291
+ PHASE 2 (Weeks 3-4): Markdown Support
292
+ ├─ Markdown parser (heading hierarchy)
293
+ ├─ Context extraction for markdown
294
+ └─ Outcome: Both Python and Markdown searchable
295
+
296
+ PHASE 3 (Weeks 5-6): Advanced Ranking
297
+ ├─ Multi-factor relevance scoring
298
+ ├─ Configuration and tuning
299
+ └─ Outcome: Smart result ordering
300
+
301
+ PHASE 4 (Weeks 7-8): Additional Formats
302
+ ├─ JavaScript/TypeScript support
303
+ ├─ JSON/YAML configuration files
304
+ └─ Outcome: Multi-language support
305
+
306
+ PHASE 5 (Weeks 9-10): Optimization
307
+ ├─ Incremental indexing (10x speedup)
308
+ ├─ Query caching
309
+ ├─ Parallel processing
310
+ └─ Outcome: Production-ready performance
311
+
312
+ PHASE 6+ (Optional): Advanced Features
313
+ ├─ Semantic search with embeddings
314
+ ├─ IDE integrations
315
+ ├─ Distributed indexing
316
+ └─ Outcome: Enterprise-grade system
317
+ ```
318
+
319
+ ---
320
+
321
+ ## Use Case Examples
322
+
323
+ ### 1. Code Review Context
324
+ **Query**: "authentication flow from request to token"
325
+ **Output**: Full execution path with code snippets, signatures, and docstrings
326
+ **Benefit**: 5 minutes instead of 30 min to understand code
327
+
328
+ ### 2. Onboarding New Team Member
329
+ **Query**: "@llm get_all_public_functions with_docstrings in auth"
330
+ **Output**: All public auth functions formatted for LLM
331
+ **Benefit**: LLM generates architecture overview and examples
332
+
333
+ ### 3. Bug Investigation
334
+ **Query**: "session_timeout references AND modifications"
335
+ **Output**: All uses of session_timeout with call chains
336
+ **Benefit**: LLM analyzes data flow to find bug root cause
337
+
338
+ ### 4. Safe Refactoring
339
+ **Query**: "old_function_name type:reference"
340
+ **Output**: Function definition + all call sites with context
341
+ **Benefit**: Confident, complete refactoring
342
+
343
+ ### 5. Documentation Generation
344
+ **Query**: "@doc get_all_exported_functions with_docstrings"
345
+ **Output**: All exported functions with documentation
346
+ **Benefit**: Auto-generated API docs always in sync
347
+
348
+ ---
349
+
350
+ ## Key Advantages
351
+
352
+ ### For Developers
353
+ - 🔍 Find code quickly without manual searching
354
+ - 📚 Understand context and scope automatically
355
+ - 🎯 See related code patterns
356
+ - 📖 Generate documentation on-the-fly
357
+
358
+ ### For Teams
359
+ - 🏃 Faster onboarding of new members
360
+ - 🐛 Quicker bug resolution
361
+ - 🔄 Consistent refactoring
362
+ - 📊 Automated architecture analysis
363
+
364
+ ### For LLMs
365
+ - 📋 Structured, predictable output format
366
+ - 🎯 Relevant context included
367
+ - 🏗️ Clear code structure (scope hierarchy)
368
+ - 🔗 Linked references (scope + line numbers)
369
+
370
+ ---
371
+
372
+ ## Technical Highlights
373
+
374
+ ### 1. Incremental Indexing
375
+ - **File hash detection**: Only reindex changed files
376
+ - **AST caching**: Parse once, use many times
377
+ - **Speed improvement**: 10x faster re-indexing
378
+
379
+ ### 2. Multi-Index Strategy
380
+ - **Keyword-based**: Fast for simple searches
381
+ - **Scope-based**: Understand code structure
382
+ - **Semantic** (optional): Find similar patterns
383
+
384
+ ### 3. Context-Aware Ranking
385
+ - 5-factor scoring considers keyword quality, position, scope depth, recency, and frequency
386
+ - Weights configurable per use case
387
+ - Optimized for typical developer workflows
388
+
389
+ ### 4. Format Flexibility
390
+ - Parse 5+ file types with specialized handling
391
+ - Each returns appropriate metadata
392
+ - Unified output schema for LLM consistency
393
+
394
+ ---
395
+
396
+ ## Comparison with Alternatives
397
+
398
+ | Feature | NGAO | grep | IDE | Semantic Search |
399
+ |---------|------|------|-----|-----------------|
400
+ | Structured results | ✅ | ❌ | ✅ (IDE specific) | ✅ |
401
+ | LLM-friendly format | ✅ | ❌ | ❌ | Partial |
402
+ | Scope hierarchy | ✅ | ❌ | ✅ | ❌ |
403
+ | Multi-format support | ✅ | ✅ | Limited | Limited |
404
+ | Local indexing | ✅ | ✅ | ✅ | Partial |
405
+ | Configurable ranking | ✅ | ❌ | ❌ | Limited |
406
+ | CLI + API | ✅ | ✅ | ❌ | ✅ |
407
+ | Context extraction | ✅ | ❌ | ✅ | Limited |
408
+
409
+ ---
410
+
411
+ ## Recommended Implementation Technology
412
+
413
+ ### For Python Implementation (Recommended for MVP)
414
+ **Rationale**: Faster development, rich ecosystem for code parsing
415
+
416
+ - **Core**: Python 3.10+, Poetry for dependency management
417
+ - **Parsing**: `ast` module (Python), `remark` (Markdown), `tree-sitter` (JS)
418
+ - **Indexing**: SQLite for metadata, JSON for custom indexes
419
+ - **Search**: Native dict-based inverted index, `rapidfuzz` for fuzzy matching
420
+ - **CLI**: Click or Typer for command-line interface
421
+ - **LLM**: Pydantic for structured output, JSON serialization
422
+
423
+ ### For Node.js Implementation (Alternative)
424
+ - **Parsing**: `@babel/parser`, `remark`, `tree-sitter-web`
425
+ - **Indexing**: `better-sqlite3`, `lowdb`
426
+ - **CLI**: `Commander.js`, `Chalk`
427
+
428
+ ---
429
+
430
+ ## Success Criteria
431
+
432
+ ### Phase 1 (MVP)
433
+ - ✅ Index Python and Markdown files
434
+ - ✅ Keyword search returns relevant results
435
+ - ✅ LLM-friendly JSON output format
436
+ - ✅ CLI interface working
437
+ - ✅ >80% test coverage
438
+
439
+ ### Phase 2+ (Production)
440
+ - ✅ <500ms query time (any size codebase)
441
+ - ✅ <100 MB storage for typical project
442
+ - ✅ Incremental indexing working
443
+ - ✅ 5+ file types supported
444
+ - ✅ Relevance scoring tuned
445
+ - ✅ >85% precision in top results
446
+
447
+ ---
448
+
449
+ ## Quick Start (Conceptual)
450
+
451
+ ```bash
452
+ # 1. Install
453
+ pip install ngao-search
454
+
455
+ # 2. Initialize index
456
+ ngao-search init /path/to/codebase
457
+
458
+ # 3. Search
459
+ ngao-search query "authentication handler" --format json
460
+
461
+ # 4. Use with LLM
462
+ results=$(ngao-search query "database query patterns" --format json)
463
+ cat $results | llm prompt "Explain these database patterns and suggest optimizations"
464
+ ```
465
+
466
+ ---
467
+
468
+ ## Next Steps
469
+
470
+ 1. **Validate**: Get feedback on architecture from potential users
471
+ 2. **Prototype**: Build Phase 1 implementation (Python + Markdown + basic ranking)
472
+ 3. **Test**: Create comprehensive test suite with sample queries
473
+ 4. **Optimize**: Profile and optimize based on real usage patterns
474
+ 5. **Expand**: Add additional file types based on user feedback
475
+ 6. **Integrate**: Connect with IDE plugins and LLM workflows
476
+
477
+ ---
478
+
479
+ ## Documentation Files
480
+
481
+ This research is complete with:
482
+
483
+ 1. **[START_HERE.md](START_HERE.md)** - Entry point and navigation
484
+ 2. **[RESEARCH_SUMMARY.md](RESEARCH_SUMMARY.md)** - This executive overview
485
+ 3. **[IMPLEMENTATION_OVERVIEW.md](IMPLEMENTATION_OVERVIEW.md)** - TypeScript/Node.js guide
486
+ 4. **[QUICK_REFERENCE.md](QUICK_REFERENCE.md)** - Formulas and quick answers
487
+ 5. **[DOCUMENTATION_INDEX.md](DOCUMENTATION_INDEX.md)** - Complete navigation
488
+ 6. **[DELIVERY_SUMMARY.md](DELIVERY_SUMMARY.md)** - Summary of content
489
+
490
+ All documents in the reference folder are complementary and should be read together for complete understanding.
491
+
492
+ ---
493
+
494
+ ## Questions Answered
495
+
496
+ ✅ **How to index different file types?** → Format-specific parsers with AST analysis
497
+ ✅ **How to return LLM-friendly format?** → Universal JSON schema with all metadata
498
+ ✅ **How to rank results relevantly?** → 5-factor scoring algorithm
499
+ ✅ **How to provide context?** → Context extraction with configurable window sizes
500
+ ✅ **How to scale to large codebases?** → Incremental indexing + parallel processing
501
+ ✅ **How to handle different scopes?** → Scope hierarchy indexing
502
+ ✅ **How to balance performance and accuracy?** → Multiple optimization layers
503
+
504
+ ---
505
+
506
+ ## Key Insights
507
+
508
+ 1. **Multi-Index is Essential**: Keyword search alone is insufficient; scope hierarchy indexing provides structure understanding
509
+ 2. **Format Matters**: Each file type has unique structure; generic parsing loses important information
510
+ 3. **Context is Critical**: Search results without context are nearly useless; must include surrounding code/docs
511
+ 4. **LLM Optimization is Novel**: Formatting results specifically for AI models enables new use cases
512
+ 5. **Ranking is Complex**: Simple keyword relevance insufficient; multi-factor scoring needed
513
+ 6. **Incremental Indexing is Crucial**: Full re-indexing on every change is too slow; change detection mandatory
514
+ 7. **Caching Layers Essential**: Multiple caching levels (query, block, AST) needed for sub-second performance
515
+
516
+ ---
517
+
518
+ This comprehensive research provides a solid foundation for implementing NGAO Search as a powerful, LLM-friendly code search system optimized for modern development workflows.
519
+