codebaxing 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (52) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +402 -0
  3. package/README.vi.md +402 -0
  4. package/dist/core/exceptions.d.ts +25 -0
  5. package/dist/core/exceptions.d.ts.map +1 -0
  6. package/dist/core/exceptions.js +46 -0
  7. package/dist/core/exceptions.js.map +1 -0
  8. package/dist/core/interfaces.d.ts +13 -0
  9. package/dist/core/interfaces.d.ts.map +1 -0
  10. package/dist/core/interfaces.js +5 -0
  11. package/dist/core/interfaces.js.map +1 -0
  12. package/dist/core/models.d.ts +132 -0
  13. package/dist/core/models.d.ts.map +1 -0
  14. package/dist/core/models.js +303 -0
  15. package/dist/core/models.js.map +1 -0
  16. package/dist/index.d.ts +17 -0
  17. package/dist/index.d.ts.map +1 -0
  18. package/dist/index.js +20 -0
  19. package/dist/index.js.map +1 -0
  20. package/dist/indexing/embedding-service.d.ts +66 -0
  21. package/dist/indexing/embedding-service.d.ts.map +1 -0
  22. package/dist/indexing/embedding-service.js +271 -0
  23. package/dist/indexing/embedding-service.js.map +1 -0
  24. package/dist/indexing/memory-retriever.d.ts +58 -0
  25. package/dist/indexing/memory-retriever.d.ts.map +1 -0
  26. package/dist/indexing/memory-retriever.js +327 -0
  27. package/dist/indexing/memory-retriever.js.map +1 -0
  28. package/dist/indexing/parallel-indexer.d.ts +36 -0
  29. package/dist/indexing/parallel-indexer.d.ts.map +1 -0
  30. package/dist/indexing/parallel-indexer.js +67 -0
  31. package/dist/indexing/parallel-indexer.js.map +1 -0
  32. package/dist/indexing/source-retriever.d.ts +66 -0
  33. package/dist/indexing/source-retriever.d.ts.map +1 -0
  34. package/dist/indexing/source-retriever.js +420 -0
  35. package/dist/indexing/source-retriever.js.map +1 -0
  36. package/dist/mcp/server.d.ts +16 -0
  37. package/dist/mcp/server.d.ts.map +1 -0
  38. package/dist/mcp/server.js +370 -0
  39. package/dist/mcp/server.js.map +1 -0
  40. package/dist/mcp/state.d.ts +26 -0
  41. package/dist/mcp/state.d.ts.map +1 -0
  42. package/dist/mcp/state.js +154 -0
  43. package/dist/mcp/state.js.map +1 -0
  44. package/dist/parsers/language-configs.d.ts +26 -0
  45. package/dist/parsers/language-configs.d.ts.map +1 -0
  46. package/dist/parsers/language-configs.js +422 -0
  47. package/dist/parsers/language-configs.js.map +1 -0
  48. package/dist/parsers/treesitter-parser.d.ts +37 -0
  49. package/dist/parsers/treesitter-parser.d.ts.map +1 -0
  50. package/dist/parsers/treesitter-parser.js +602 -0
  51. package/dist/parsers/treesitter-parser.js.map +1 -0
  52. package/package.json +91 -0
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2024 Street Devs
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,402 @@
1
+ # Codebaxing
2
+
3
+ MCP server for **semantic code search**. Index your codebase and search using natural language queries.
4
+
5
+ ## The Idea
6
+
7
+ Traditional code search (grep, ripgrep) matches exact text. But developers think in concepts:
8
+
9
+ - *"Where is the authentication logic?"* - not `grep "authentication"`
10
+ - *"Find database connection code"* - not `grep "database"`
11
+ - *"How does error handling work?"* - not `grep "error"`
12
+
13
+ **Codebaxing** bridges this gap using **semantic search**:
14
+
15
+ ```
16
+ Query: "user authentication"
17
+
18
+ Finds: login(), validateCredentials(), checkPassword(), authMiddleware()
19
+ (even if they don't contain the word "authentication")
20
+ ```
21
+
22
+ ## How It Works
23
+
24
+ ### Architecture Overview
25
+
26
+ ```
27
+ ┌─────────────────────────────────────────────────────────────────┐
28
+ │ INDEXING │
29
+ ├─────────────────────────────────────────────────────────────────┤
30
+ │ │
31
+ │ Source Files (.py, .ts, .js, .go, .rs, ...) │
32
+ │ │ │
33
+ │ ▼ │
34
+ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
35
+ │ │ Tree-sitter │───▶│ Symbols │───▶│ Embedding │ │
36
+ │ │ Parser │ │ Extraction │ │ Model │ │
37
+ │ └──────────────┘ └──────────────┘ └──────────────┘ │
38
+ │ │ │ │ │
39
+ │ Parse AST Functions, Text → Vector │
40
+ │ Classes, etc. (384 dimensions) │
41
+ │ │ │
42
+ │ ▼ │
43
+ │ ┌──────────────┐ │
44
+ │ │ ChromaDB │ │
45
+ │ │ (vectors) │ │
46
+ │ └──────────────┘ │
47
+ │ │
48
+ └─────────────────────────────────────────────────────────────────┘
49
+
50
+ ┌─────────────────────────────────────────────────────────────────┐
51
+ │ SEARCH │
52
+ ├─────────────────────────────────────────────────────────────────┤
53
+ │ │
54
+ │ "find auth code" │
55
+ │ │ │
56
+ │ ▼ │
57
+ │ ┌──────────────┐ ┌──────────────┐ │
58
+ │ │ Embedding │────────▶│ ChromaDB │ │
59
+ │ │ Model │ query │ Query │ │
60
+ │ └──────────────┘ vector └──────────────┘ │
61
+ │ │ │
62
+ │ ▼ │
63
+ │ Cosine Similarity │
64
+ │ │ │
65
+ │ ▼ │
66
+ │ Top-k Results │
67
+ │ (login.py, auth.ts, ...) │
68
+ │ │
69
+ └─────────────────────────────────────────────────────────────────┘
70
+ ```
71
+
72
+ ### Step-by-Step Process
73
+
74
+ #### 1. Parsing (Tree-sitter)
75
+
76
+ Tree-sitter parses source code into an Abstract Syntax Tree (AST), extracting meaningful symbols:
77
+
78
+ ```python
79
+ # Input: auth.py
80
+ def login(username, password):
81
+ """Authenticate user credentials"""
82
+ if validate(username, password):
83
+ return create_session(username)
84
+ raise AuthError("Invalid credentials")
85
+ ```
86
+
87
+ ```
88
+ # Output: Symbol
89
+ {
90
+ name: "login",
91
+ type: "function",
92
+ signature: "def login(username, password)",
93
+ code: "def login(username, password):...",
94
+ filepath: "auth.py",
95
+ lineStart: 1,
96
+ lineEnd: 6
97
+ }
98
+ ```
99
+
100
+ #### 2. Embedding (all-MiniLM-L6-v2)
101
+
102
+ Each code chunk is converted to a 384-dimensional vector using a neural network:
103
+
104
+ ```
105
+ "def login(username, password): authenticate user..."
106
+
107
+ Embedding Model (runs locally, ONNX)
108
+
109
+ [0.12, -0.34, 0.56, 0.08, ..., -0.22] (384 numbers)
110
+ ```
111
+
112
+ The model understands semantic relationships:
113
+ - `"authentication"` ≈ `"login"` ≈ `"credentials"` (vectors are close)
114
+ - `"database"` ≈ `"query"` ≈ `"SQL"` (vectors are close)
115
+ - `"authentication"` ≠ `"database"` (vectors are far apart)
116
+
117
+ #### 3. Storage (ChromaDB)
118
+
119
+ Vectors are stored in ChromaDB, a vector database optimized for similarity search:
120
+
121
+ ```
122
+ ChromaDB Collection:
123
+ ┌─────────────────────────────────────────────────────┐
124
+ │ ID │ Vector (384d) │ Metadata │
125
+ ├─────────────────────────────────────────────────────┤
126
+ │ chunk_001 │ [0.12, -0.34, ...] │ {file: auth.py} │
127
+ │ chunk_002 │ [0.45, 0.23, ...] │ {file: db.py} │
128
+ │ chunk_003 │ [-0.11, 0.67, ...] │ {file: api.ts} │
129
+ │ ... │ ... │ ... │
130
+ └─────────────────────────────────────────────────────┘
131
+ ```
132
+
133
+ #### 4. Search (Cosine Similarity)
134
+
135
+ When you search, your query is embedded and compared against all stored vectors:
136
+
137
+ ```
138
+ Query: "user authentication"
139
+
140
+ Query Vector: [0.15, -0.31, 0.52, ...]
141
+
142
+ Compare with all vectors using cosine similarity:
143
+ - chunk_001 (login): similarity = 0.89 ← HIGH
144
+ - chunk_002 (db): similarity = 0.23 ← LOW
145
+ - chunk_003 (auth): similarity = 0.85 ← HIGH
146
+
147
+ Return top-k most similar chunks
148
+ ```
149
+
150
+ ### Why Semantic Search Works
151
+
152
+ The embedding model was trained on millions of text pairs, learning that:
153
+
154
+ | Concept A | ≈ Similar To | Distance |
155
+ |-----------|--------------|----------|
156
+ | authentication | login, credentials, auth, signin | Close |
157
+ | database | query, SQL, connection, ORM | Close |
158
+ | error | exception, failure, catch, throw | Close |
159
+ | parse | tokenize, lexer, AST, syntax | Close |
160
+
161
+ This allows finding code by **meaning**, not just keywords.
162
+
163
+ ## Features
164
+
165
+ - **Semantic Code Search**: Find code by describing what you're looking for
166
+ - **24+ Languages**: Python, TypeScript, JavaScript, Go, Rust, Java, C/C++, and more
167
+ - **Memory Layer**: Store and recall project context across sessions
168
+ - **Incremental Indexing**: Only re-index changed files
169
+ - **100% Local**: No API calls, no cloud, works offline
170
+ - **GPU Acceleration**: Optional WebGPU/CUDA support
171
+
172
+ ## Requirements
173
+
174
+ - Node.js >= 20.0.0
175
+ - ~500MB disk space for embedding model (downloaded on first run)
176
+
177
+ ## Installation
178
+
179
+ ### Option 1: Via npx (Recommended)
180
+
181
+ No installation needed! Just configure Claude Desktop directly.
182
+
183
+ ### Option 2: Via npm (Global install)
184
+
185
+ ```bash
186
+ npm install -g codebaxing
187
+ ```
188
+
189
+ ### Option 3: Clone from source
190
+
191
+ ```bash
192
+ git clone https://github.com/duysolo/codebaxing.git
193
+ cd codebaxing
194
+ npm install
195
+ npm run build
196
+ ```
197
+
198
+ ### (Optional) Set up persistent storage
199
+
200
+ By default, the index is stored in memory and lost when the server restarts.
201
+
202
+ For persistent storage, run ChromaDB:
203
+
204
+ ```bash
205
+ # Using Docker (recommended)
206
+ docker run -d -p 8000:8000 chromadb/chroma
207
+
208
+ # Set environment variable
209
+ export CHROMADB_URL=http://localhost:8000
210
+ ```
211
+
212
+ ### Configure Claude Desktop
213
+
214
+ Add to your Claude Desktop config file:
215
+
216
+ **macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
217
+ **Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
218
+
219
+ #### Via npx (no install needed):
220
+
221
+ ```json
222
+ {
223
+ "mcpServers": {
224
+ "codebaxing": {
225
+ "command": "npx",
226
+ "args": ["-y", "codebaxing"]
227
+ }
228
+ }
229
+ }
230
+ ```
231
+
232
+ #### Via global install:
233
+
234
+ ```bash
235
+ npm install -g codebaxing
236
+ ```
237
+
238
+ ```json
239
+ {
240
+ "mcpServers": {
241
+ "codebaxing": {
242
+ "command": "codebaxing"
243
+ }
244
+ }
245
+ }
246
+ ```
247
+
248
+ #### With persistent storage (ChromaDB):
249
+
250
+ ```json
251
+ {
252
+ "mcpServers": {
253
+ "codebaxing": {
254
+ "command": "npx",
255
+ "args": ["-y", "codebaxing"],
256
+ "env": {
257
+ "CHROMADB_URL": "http://localhost:8000"
258
+ }
259
+ }
260
+ }
261
+ }
262
+ ```
263
+
264
+ #### From source (development):
265
+
266
+ ```json
267
+ {
268
+ "mcpServers": {
269
+ "codebaxing": {
270
+ "command": "node",
271
+ "args": ["/path/to/codebaxing/dist/mcp/server.js"]
272
+ }
273
+ }
274
+ }
275
+ ```
276
+
277
+ ### Restart Claude Desktop
278
+
279
+ The Codebaxing tools will now be available in Claude.
280
+
281
+ ## Usage
282
+
283
+ ### MCP Tools
284
+
285
+ | Tool | Description |
286
+ |------|-------------|
287
+ | `index` | Index a codebase. Modes: `auto` (incremental), `full`, `load-only` |
288
+ | `search` | Semantic search. Returns ranked code chunks |
289
+ | `stats` | Index statistics (files, symbols, chunks) |
290
+ | `languages` | List supported file extensions |
291
+ | `remember` | Store memories (conversation, status, decision, preference, doc, note) |
292
+ | `recall` | Semantic search over memories |
293
+ | `forget` | Delete memories by ID, type, tags, or age |
294
+ | `memory-stats` | Memory statistics by type |
295
+
296
+ ### Example Workflow
297
+
298
+ 1. **Index your codebase:**
299
+ ```
300
+ index(path="/path/to/your/project")
301
+ ```
302
+
303
+ 2. **Search for code:**
304
+ ```
305
+ search(question="authentication middleware")
306
+ search(question="database connection", language="typescript")
307
+ search(question="error handling", symbol_type="function")
308
+ ```
309
+
310
+ 3. **Store context:**
311
+ ```
312
+ remember(content="Using PostgreSQL with Prisma ORM", memory_type="decision")
313
+ remember(content="Auth uses JWT tokens", memory_type="doc", tags=["auth", "security"])
314
+ ```
315
+
316
+ 4. **Recall context:**
317
+ ```
318
+ recall(query="database setup")
319
+ recall(query="authentication", memory_type="decision")
320
+ ```
321
+
322
+ ## Supported Languages
323
+
324
+ Python, JavaScript, TypeScript, C, C++, Bash, Go, Java, Kotlin, Rust, Ruby, C#, PHP, Scala, Swift, Lua, Dart, Elixir, Haskell, OCaml, Zig, Perl, CSS, HTML, Vue, JSON, YAML, TOML, Makefile
325
+
326
+ ## Configuration
327
+
328
+ ### Environment Variables
329
+
330
+ | Variable | Description | Default |
331
+ |----------|-------------|---------|
332
+ | `CHROMADB_URL` | ChromaDB server URL for persistent storage | (in-memory) |
333
+ | `CODEBAXING_DEVICE` | Compute device for embeddings | `cpu` |
334
+
335
+ ### GPU Acceleration
336
+
337
+ Enable GPU for faster embedding generation:
338
+
339
+ ```bash
340
+ # WebGPU (experimental, uses Metal on macOS)
341
+ export CODEBAXING_DEVICE=webgpu
342
+
343
+ # Auto-detect best device
344
+ export CODEBAXING_DEVICE=auto
345
+
346
+ # NVIDIA GPU (Linux/Windows only, requires CUDA)
347
+ export CODEBAXING_DEVICE=cuda
348
+ ```
349
+
350
+ Default is `cpu` which works everywhere.
351
+
352
+ **Note:** macOS does not support CUDA (no NVIDIA drivers). Use `webgpu` for GPU acceleration on Mac.
353
+
354
+ ### Storage
355
+
356
+ Metadata is stored in `.codebaxing/` folder within your project:
357
+ - `metadata.json` - Index metadata and file timestamps
358
+
359
+ ## Development
360
+
361
+ ```bash
362
+ npm run dev # Run with tsx (no build needed)
363
+ npm run build # Compile TypeScript
364
+ npm start # Run compiled version
365
+ npm test # Run tests
366
+ npm run typecheck # Type check without emitting
367
+ ```
368
+
369
+ ### Testing
370
+
371
+ ```bash
372
+ # Run unit tests
373
+ npm test
374
+
375
+ # Test indexing manually
376
+ CHROMADB_URL=http://localhost:8000 npx tsx test-indexing.ts
377
+ ```
378
+
379
+ ## Comparison: Grep vs Semantic Search
380
+
381
+ | Aspect | Grep | Semantic Search |
382
+ |--------|------|-----------------|
383
+ | Query | Exact text match | Natural language |
384
+ | "authentication" | Only finds "authentication" | Finds login, auth, credentials, etc. |
385
+ | Understands context | No | Yes |
386
+ | Finds synonyms | No | Yes |
387
+ | Speed | Very fast | Fast (after indexing) |
388
+ | Setup | None | Requires indexing |
389
+
390
+ ## Technical Details
391
+
392
+ | Component | Technology |
393
+ |-----------|------------|
394
+ | Embedding Model | `all-MiniLM-L6-v2` (384 dimensions) |
395
+ | Model Runtime | `@huggingface/transformers` (ONNX) |
396
+ | Vector Database | ChromaDB |
397
+ | Code Parser | Tree-sitter |
398
+ | MCP SDK | `@modelcontextprotocol/sdk` |
399
+
400
+ ## License
401
+
402
+ MIT