vortexa 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
vortexa-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,507 @@
1
+ Metadata-Version: 2.4
2
+ Name: vortexa
3
+ Version: 0.1.0
4
+ Summary: Codebase indexing and semantic search engine
5
+ Author-email: VortexAI <koulabhay25@gmail.com>
6
+ License-Expression: Apache-2.0
7
+ Project-URL: Homepage, https://github.com/OEvortex/vortexa
8
+ Project-URL: Repository, https://github.com/OEvortex/vortexa
9
+ Project-URL: Issues, https://github.com/OEvortex/vortexa/issues
10
+ Keywords: codebase,indexing,search,embedding,semantic-search
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: Python :: 3.10
15
+ Classifier: Programming Language :: Python :: 3.11
16
+ Classifier: Programming Language :: Python :: 3.12
17
+ Classifier: Programming Language :: Python :: 3.13
18
+ Classifier: Programming Language :: Python :: 3.14
19
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
20
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
+ Requires-Python: >=3.10
22
+ Description-Content-Type: text/markdown
23
+ Requires-Dist: numpy>=1.24.0
24
+ Requires-Dist: lmdb>=1.4.0
25
+ Requires-Dist: pathspec>=0.12.0
26
+ Requires-Dist: huggingface-hub>=0.20.0
27
+ Requires-Dist: tokenizers>=0.19.0
28
+ Requires-Dist: safetensors>=0.4.0
29
+ Provides-Extra: full
30
+ Requires-Dist: model2vec>=0.3.0; extra == "full"
31
+ Requires-Dist: sentence-transformers>=2.2.0; extra == "full"
32
+ Requires-Dist: tree-sitter-language-pack>=0.1.0; extra == "full"
33
+ Provides-Extra: mcp
34
+ Requires-Dist: fastmcp>=2.0.0; extra == "mcp"
35
+
36
+ <div align="center">
37
+
38
+ # vortexa &nbsp; ๐Ÿง 
39
+
40
+ **Codebase indexing and semantic search engine**
41
+
42
+ _Dense + sparse hybrid retrieval ยท AST-aware chunking ยท LMDB persistence ยท MCP server_
43
+
44
+ [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
45
+ [![Python](https://img.shields.io/badge/python-3.10+-brightgreen)](#)
46
+ [![PyPI version](https://img.shields.io/badge/pypi-v0.1.0-orange)](#)
47
+
48
+ </div>
49
+
50
+ ---
51
+
52
+ ## Table of Contents
53
+
54
+ - [Overview](#overview)
55
+ - [Features](#features)
56
+ - [Quick Start](#quick-start)
57
+ - [Python API](#python-api)
58
+ - [Indexing](#indexing)
59
+ - [Searching](#searching)
60
+ - [Watch Mode](#watch-mode)
61
+ - [Management](#management)
62
+ - [MCP Server](#mcp-server)
63
+ - [Usage with Claude Code / Cursor](#usage-with-claude-code--cursor)
64
+ - [Architecture](#architecture)
65
+ - [Dependencies](#dependencies)
66
+ - [License](#license)
67
+
68
+ ---
69
+
70
+ <div align="center">
71
+
72
+ ## Overview
73
+
74
+ </div>
75
+
76
+ vortexa is a standalone **codebase indexing and semantic search engine** designed for AI agents and developers. It builds a persistent, hybrid search index over source code using:
77
+
78
+ - **Dense retrieval** via static or learned embeddings (Model2Vec / SentenceTransformers)
79
+ - **Sparse retrieval** via BM25 keyword scoring
80
+ - **AST-aware chunking** that respects function and class boundaries via tree-sitter
81
+ - **LMDB-backed storage** for fast, persistent vector and chunk storage
82
+
83
+ The result: natural language code search that **understands intent**, not just keywords.
84
+
85
+ ```python
86
+ results = indexer.search("authentication middleware that validates JWT tokens", top_k=5)
87
+ # โ†’ Finds the right files even if they use "auth", "verify", "token" instead of "authentication"
88
+ ```
89
+
90
+ vortexa can run as a **standalone Python library**, be embedded into any agent, or serve as an **MCP server** for LLM tools.
91
+
92
+ ---
93
+
94
+ <div align="center">
95
+
96
+ ## Features
97
+
98
+ </div>
99
+
100
+ <table>
101
+ <tr>
102
+ <td><strong>Semantic search</strong></td>
103
+ <td>Find code by describing what it does in natural language โ€” no exact-string matching needed.</td>
104
+ </tr>
105
+ <tr>
106
+ <td><strong>Hybrid retrieval</strong></td>
107
+ <td>Combines dense embeddings (semantic meaning) with BM25 (keyword precision) using adaptive alpha weighting.</td>
108
+ </tr>
109
+ <tr>
110
+ <td><strong>AST-aware chunking</strong></td>
111
+ <td>Splits source code at function/class/block boundaries using tree-sitter when available, falls back to line-based splitting.</td>
112
+ </tr>
113
+ <tr>
114
+ <td><strong>Incremental indexing</strong></td>
115
+ <td>Content-hash memoization means only changed files are re-indexed. Full re-index avoids redundant embedding computations.</td>
116
+ </tr>
117
+ <tr>
118
+ <td><strong>Persistent storage</strong></td>
119
+ <td>LMDB-backed vector store survives restarts. Embedding cache avoids recomputing identical content.</td>
120
+ </tr>
121
+ <tr>
122
+ <td><strong>Live watch mode</strong></td>
123
+ <td>Background thread polls for file changes and auto-re-indexes with configurable debounce.</td>
124
+ </tr>
125
+ <tr>
126
+ <td><strong>MCP server</strong></td>
127
+ <td>Expose as a single <code>search</code> tool for any MCP-compatible agent (Claude Code, Cursor, etc.)</td>
128
+ </tr>
129
+ <tr>
130
+ <td><strong>Zero mandatory heavy deps</strong></td>
131
+ <td>Core requires only <code>numpy</code>, <code>lmdb</code>, and <code>pathspec</code>. Model2Vec and tree-sitter are optional extras.</td>
132
+ </tr>
133
+ </table>
134
+
135
+ ---
136
+
137
+ <div align="center">
138
+
139
+ ## Quick Start
140
+
141
+ </div>
142
+
143
+ ### Installation
144
+
145
+ ```bash
146
+ # Core (BM25 + line-based chunking)
147
+ pip install vortexa
148
+
149
+ # Full (Model2Vec embeddings + tree-sitter AST chunking)
150
+ pip install "vortexa[full]"
151
+
152
+ # With MCP server support
153
+ pip install "vortexa[full]" fastmcp
154
+ ```
155
+
156
+ ### Index a codebase
157
+
158
+ ```python
159
+ from vortexa.core.indexer import CodebaseIndexer
160
+
161
+ indexer = CodebaseIndexer(root=".")
162
+ stats = indexer.index()
163
+
164
+ print(f"Indexed {stats.indexed_files} files, {stats.total_chunks} chunks")
165
+ print(f"Languages detected: {stats.languages}")
166
+ ```
167
+
168
+ ### Search with natural language
169
+
170
+ ```python
171
+ results = indexer.search("CSV parser implementation", top_k=5)
172
+
173
+ for r in results:
174
+ print(f"{r.chunk.file_path}:{r.chunk.start_line} score={r.score:.3f}")
175
+ print(f" {r.chunk.content[:150].strip()}")
176
+ print()
177
+ ```
178
+
179
+ Output:
180
+ ```
181
+ src/parsers/csv_parser.py:42 score=0.892
182
+ def parse_csv(filepath: str, delimiter: str = ",") -> list[dict]:
183
+ """Parse a CSV file into a list of dictionaries."""
184
+ with open(filepath, "r") as f:
185
+
186
+ tests/test_csv_parser.py:15 score=0.756
187
+ def test_parse_csv_with_header():
188
+ result = parse_csv("test.csv")
189
+ assert len(result) == 3
190
+ ```
191
+
192
+ ---
193
+
194
+ <div align="center">
195
+
196
+ ## Python API
197
+
198
+ </div>
199
+
200
+ ### Indexing
201
+
202
+ ```python
203
+ from vortexa.core.indexer import CodebaseIndexer
204
+ from vortexa.core.types import ChunkConfig
205
+
206
+ # Default chunking (aim for 50-line chunks, 5-line overlap)
207
+ indexer = CodebaseIndexer(root="/path/to/project")
208
+ stats = indexer.index()
209
+ # โ†’ IndexStats(indexed_files=127, total_chunks=843, languages={"python": 45, "typescript": 32, ...})
210
+
211
+ # Custom chunk configuration
212
+ indexer = CodebaseIndexer(
213
+ root=".",
214
+ chunk_config=ChunkConfig(chunk_size=100, chunk_overlap=10),
215
+ )
216
+ stats = indexer.index(force=False, include_text_files=True)
217
+
218
+ # Force full re-index
219
+ stats = indexer.index(force=True)
220
+ ```
221
+
222
+ ### Searching
223
+
224
+ ```python
225
+ # Hybrid search (auto-weighted semantic + BM25)
226
+ results = indexer.search("error handling", top_k=10)
227
+
228
+ # Pure semantic search
229
+ results = indexer.search("database connection pool", top_k=5, alpha=1.0)
230
+
231
+ # Pure BM25 keyword search
232
+ results = indexer.search("parse csv", top_k=5, alpha=0.0)
233
+
234
+ # Symbol lookup (find definitions by name)
235
+ results = indexer.find_symbol("ConnectionPool", top_k=5)
236
+
237
+ # Related chunks (find chunks similar to a given chunk index)
238
+ results = indexer.find_related(chunk_idx=3, top_k=5)
239
+ ```
240
+
241
+ Each result is a `SearchResult` with:
242
+
243
+ | Field | Type | Description |
244
+ |-------|------|-------------|
245
+ | `chunk.file_path` | `str` | Relative file path |
246
+ | `chunk.start_line` | `int` | Start line number |
247
+ | `chunk.end_line` | `int` | End line number |
248
+ | `chunk.content` | `str` | Code snippet (up to 500 chars) |
249
+ | `chunk.language` | `str` | Detected programming language |
250
+ | `chunk.lineage` | `Lineage` | Source path + byte offsets |
251
+ | `chunk.chunk_hash` | `str` | Content hash for memoization |
252
+ | `score` | `float` | Relevance score (0โ€“1) |
253
+ | `source` | `str` | `"semantic"`, `"bm25"`, or `"hybrid"` |
254
+
255
+ ### Watch Mode
256
+
257
+ ```python
258
+ from vortexa.interfaces.watcher import IndexWatcher
259
+
260
+ watcher = IndexWatcher(indexer, poll_interval=3.0)
261
+ watcher.start() # Background thread, polls every 3s, debounces 2s
262
+ # ... files change on disk, auto-re-index happens ...
263
+ watcher.stop()
264
+ ```
265
+
266
+ ### Management
267
+
268
+ ```python
269
+ # Index statistics
270
+ stats = indexer.stats()
271
+ # โ†’ {indexed_files: 127, total_chunks: 843, languages: {...}, memo_hits: 42, memo_misses: 15}
272
+
273
+ # Reset
274
+ indexer.clear() # Delete the persistent index
275
+ ```
276
+
277
+ ---
278
+
279
+ <div align="center">
280
+
281
+ ## MCP Server
282
+
283
+ </div>
284
+
285
+ vortexa ships with a built-in **MCP (Model Context Protocol) server** that exposes codebase search as a single `search` tool. Start it with:
286
+
287
+ ```bash
288
+ # Auto-indexes current directory, serves on stdio
289
+ python -m vortexa.interfaces.mcp_server
290
+
291
+ # Or via the installed entry point
292
+ vortexa-mcp
293
+ ```
294
+
295
+ On startup it indexes the current working directory and prints stats to stderr:
296
+ ```
297
+ [vortexa] Indexing C:\projects\my-app ...
298
+ [vortexa] Ready: 127 files, 843 chunks
299
+ [vortexa] Auto-reindex watcher started (polling every 3s)
300
+ ```
301
+
302
+ The server exposes one tool:
303
+
304
+ | Tool | Description | Arguments |
305
+ |------|-------------|-----------|
306
+ | `search` | Semantic + BM25 hybrid code search | `query` (str), `top_k` (int, default 10) |
307
+
308
+ ### Usage with Claude Code / Cursor
309
+
310
+ Add to your MCP configuration file (`~/.cursor/mcp.json` or Claude Code's `mcp_servers` config):
311
+
312
+ ```json
313
+ {
314
+ "mcpServers": {
315
+ "vortexa": {
316
+ "command": "python",
317
+ "args": ["-m", "vortexa.interfaces.mcp_server"],
318
+ "cwd": "/path/to/your/project"
319
+ }
320
+ }
321
+ }
322
+ ```
323
+
324
+ The agent will now have access to semantic code search โ€” it can find functions, classes, and patterns by describing them in natural language. This is significantly more effective than `grep` or `rg` for exploratory queries.
325
+
326
+ ---
327
+
328
+ <div align="center">
329
+
330
+ ## Architecture
331
+
332
+ </div>
333
+
334
+ ### Directory Layout
335
+
336
+ ```
337
+ vortexa/
338
+ โ”œโ”€โ”€ core/
339
+ โ”‚ โ”œโ”€โ”€ indexer.py # CodebaseIndexer โ€” main orchestrator
340
+ โ”‚ โ”œโ”€โ”€ chunking.py # AST-aware (tree-sitter) + line-based chunking
341
+ โ”‚ โ”œโ”€โ”€ embedding.py # Embedding models (Model2Vec, SentenceTransformers)
342
+ โ”‚ โ”œโ”€โ”€ language.py # Language detection & file extension mapping
343
+ โ”‚ โ””โ”€โ”€ types.py # Shared types (Chunk, ChunkConfig, IndexStats, SearchResult, ...)
344
+ โ”œโ”€โ”€ storage/
345
+ โ”‚ โ”œโ”€โ”€ vector_store.py # LMDB-backed persistent vector store
346
+ โ”‚ โ”œโ”€โ”€ bm25.py # BM25 keyword index with persistent storage
347
+ โ”‚ โ””โ”€โ”€ walker.py # File system walker with .gitignore support
348
+ โ”œโ”€โ”€ search/
349
+ โ”‚ โ”œโ”€โ”€ search.py # Hybrid search orchestrator (dense + sparse)
350
+ โ”‚ โ”œโ”€โ”€ ranking.py # Result ranking & symbol query detection
351
+ โ”‚ โ””โ”€โ”€ tokens.py # Identifier tokenization (camelCase, snake_case)
352
+ โ””โ”€โ”€ interfaces/
353
+ โ”œโ”€โ”€ mcp_server.py # MCP server (stdio transport)
354
+ โ””โ”€โ”€ watcher.py # Live file poller with debounced auto-reindex
355
+ ```
356
+
357
+ ### Data Flow
358
+
359
+ ```mermaid
360
+ sequenceDiagram
361
+ participant User as User Code
362
+ participant Indexer as CodebaseIndexer
363
+ participant Walker as File Walker
364
+ participant Chunker as Chunking Engine
365
+ participant Embedder as Embedding Model
366
+ participant Store as LMDB Vector Store
367
+ participant BM25 as BM25 Index
368
+ participant Search as Search Engine
369
+
370
+ User->>Indexer: index()
371
+ Indexer->>Walker: walk_files(root, extensions)
372
+ Walker-->>Indexer: file_paths
373
+ loop Each file
374
+ Indexer->>Chunker: chunk_source(source, language)
375
+ Chunker-->>Indexer: list[Chunk]
376
+ Indexer->>Embedder: embed(chunks)
377
+ Embedder-->>Indexer: vectors
378
+ Indexer->>Store: store(vectors, chunks)
379
+ Indexer->>BM25: index(chunks)
380
+ end
381
+ Indexer-->>User: IndexStats
382
+
383
+ User->>Search: search(query)
384
+ Search->>Store: query(vector)
385
+ Search->>BM25: query(tokens)
386
+ Search->>Search: hybrid_fusion(results)
387
+ Search-->>User: list[SearchResult]
388
+ ```
389
+
390
+ ### Indexing Pipeline
391
+
392
+ ```mermaid
393
+ graph LR
394
+ A[Source Files] --> B[File Walker<br/>.gitignore aware]
395
+ B --> C[Language Detector]
396
+ C --> D{AST Available?}
397
+ D -->|Yes| E[Tree-sitter Parser<br/>Function/class boundaries]
398
+ D -->|No| F[Line-based Splitter<br/>Configurable size/overlap]
399
+ E --> G[Chunk Set]
400
+ F --> G
401
+ G --> H[Embedding Model<br/>Model2Vec / SentenceTransformer]
402
+ G --> I[BM25 Tokenizer]
403
+ H --> J[(LMDB Vector Store)]
404
+ I --> K[(BM25 Index)]
405
+ J --> L[Content Hash Memo]
406
+ K --> L
407
+ L --> M[Skip unchanged files]
408
+ ```
409
+
410
+ ### Module Dependencies
411
+
412
+ ```mermaid
413
+ graph TD
414
+ subgraph "Public API"
415
+ Indexer["core.indexer<br/>CodebaseIndexer"]
416
+ Search["search.search<br/>search_hybrid()"]
417
+ end
418
+
419
+ subgraph "Core"
420
+ Chunking["core.chunking<br/>chunk_source()"]
421
+ Embedding["core.embedding<br/>Embedder"]
422
+ Language["core.language<br/>detect_language()"]
423
+ Types["core.types<br/>Chunk, ChunkConfig, ..."]
424
+ end
425
+
426
+ subgraph "Storage"
427
+ VectorStore["storage.vector_store<br/>LMDB Vector Store"]
428
+ BM25["storage.bm25<br/>BM25 Index"]
429
+ Walker["storage.walker<br/>walk_files()"]
430
+ end
431
+
432
+ subgraph "Interfaces"
433
+ MCP["interfaces.mcp_server<br/>FastMCP server"]
434
+ Watcher["interfaces.watcher<br/>IndexWatcher"]
435
+ end
436
+
437
+ Indexer --> Chunking
438
+ Indexer --> Embedding
439
+ Indexer --> Language
440
+ Indexer --> Types
441
+ Indexer --> VectorStore
442
+ Indexer --> BM25
443
+ Indexer --> Walker
444
+ Indexer --> Search
445
+
446
+ Search --> Embedding
447
+ Search --> VectorStore
448
+ Search --> BM25
449
+ Search --> Types
450
+
451
+ MCP --> Indexer
452
+ MCP --> Watcher
453
+ Watcher --> Walker
454
+ ```
455
+
456
+ ---
457
+
458
+ <div align="center">
459
+
460
+ ## Dependencies
461
+
462
+ </div>
463
+
464
+ | Package | Required | Used For |
465
+ |---------|----------|----------|
466
+ | `numpy` | Yes | Vector operations, embedding inference |
467
+ | `lmdb` | Yes | Persistent vector and chunk metadata storage |
468
+ | `pathspec` | Yes | `.gitignore` pattern matching in file walker |
469
+ | `model2vec` | Optional | Alternative static embeddings |
470
+ | `huggingface-hub` | Yes (default model) | Loading `VTXAI/Vortex-Embed-4.7M` |
471
+ | `tokenizers` | Yes (default model) | HF tokenizer for embedding model |
472
+ | `safetensors` | Yes (default model) | Safe tensor loading for 4-bit weights |
473
+ | `sentence-transformers` | Optional | Transformer-based dense embeddings |
474
+ | `model2vec` | Optional | Alternative static embeddings |
475
+ | `tree-sitter-language-pack` | Optional | AST-aware code chunking |
476
+ | `fastmcp` | Optional | MCP server for LLM tool integration |
477
+
478
+ Install optional groups:
479
+
480
+ ```bash
481
+ pip install "vortexa[full]" # model2vec + sentence-transformers + tree-sitter
482
+ pip install "vortexa[full, mcp]" # everything including MCP server
483
+ ```
484
+
485
+ ---
486
+
487
+ <div align="center">
488
+
489
+ ## License
490
+
491
+ </div>
492
+
493
+ ```
494
+ Copyright 2025 VortexAI
495
+
496
+ Licensed under the Apache License, Version 2.0 (the "License");
497
+ you may not use this file except in compliance with the License.
498
+ You may obtain a copy of the License at
499
+
500
+ http://www.apache.org/licenses/LICENSE-2.0
501
+
502
+ Unless required by applicable law or agreed to in writing, software
503
+ distributed under the License is distributed on an "AS IS" BASIS,
504
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
505
+ See the License for the specific language governing permissions and
506
+ limitations under the License.
507
+ ```