brainbank 0.1.1 → 0.1.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +123 -4
- package/dist/chunk-RAEBYV75.js +709 -0
- package/dist/chunk-RAEBYV75.js.map +1 -0
- package/dist/{chunk-YGSEUWLV.js → chunk-TW5NTYYZ.js} +43 -30
- package/dist/chunk-TW5NTYYZ.js.map +1 -0
- package/dist/cli.js +32 -2
- package/dist/cli.js.map +1 -1
- package/dist/code.js +1 -1
- package/dist/index.d.ts +43 -16
- package/dist/index.js +2 -2
- package/package.json +22 -2
- package/dist/chunk-EDKSKLX4.js +0 -490
- package/dist/chunk-EDKSKLX4.js.map +0 -1
- package/dist/chunk-YGSEUWLV.js.map +0 -1
package/README.md
CHANGED
|
@@ -27,7 +27,7 @@ Most AI memory solutions (mem0, Zep, LangMem) require cloud services, external d
|
|
|
27
27
|
|---|:---:|:---:|:---:|:---:|
|
|
28
28
|
| Infrastructure | **SQLite file** | Vector DB + cloud | Neo4j + cloud | LangGraph Platform |
|
|
29
29
|
| LLM required to write | **No**¹ | Yes | Yes | Yes |
|
|
30
|
-
| Code-aware | **
|
|
30
|
+
| Code-aware | **19 AST-parsed languages (tree-sitter), git, co-edits** | ✗ | ✗ | ✗ |
|
|
31
31
|
| Custom indexers | **`.use()` plugin system** | ✗ | ✗ | ✗ |
|
|
32
32
|
| Search | **Vector + BM25 + RRF** | Vector only | Vector + graph | Vector only |
|
|
33
33
|
| Framework lock-in | **None** | Optional | Zep cloud | LangChain |
|
|
@@ -68,6 +68,9 @@ Most AI memory solutions (mem0, Zep, LangMem) require cloud services, external d
|
|
|
68
68
|
- [Re-embedding](#re-embedding)
|
|
69
69
|
- [Architecture](#architecture)
|
|
70
70
|
- [Search Pipeline](#search-pipeline)
|
|
71
|
+
- [Benchmarks](#benchmarks)
|
|
72
|
+
- [Search Quality: AST vs Sliding Window](#search-quality-ast-vs-sliding-window)
|
|
73
|
+
- [Grammar Support](#grammar-support)
|
|
71
74
|
|
|
72
75
|
---
|
|
73
76
|
|
|
@@ -148,12 +151,15 @@ BrainBank can be used entirely from the command line — no config file needed.
|
|
|
148
151
|
|
|
149
152
|
### Indexing
|
|
150
153
|
|
|
151
|
-
`index` processes **code files + git history**
|
|
154
|
+
`index` processes **code files + git history** by default. Use `--only` to select specific modules, and `--docs` to include document collections.
|
|
152
155
|
|
|
153
156
|
```bash
|
|
154
157
|
brainbank index [path] # Index code + git history
|
|
155
158
|
brainbank index [path] --force # Force re-index everything
|
|
156
159
|
brainbank index [path] --depth 200 # Limit git commit depth
|
|
160
|
+
brainbank index [path] --only code # Index only code (skip git)
|
|
161
|
+
brainbank index [path] --only git # Index only git history
|
|
162
|
+
brainbank index [path] --docs ~/docs # Include a docs folder
|
|
157
163
|
brainbank docs [--collection <name>] # Index document collections
|
|
158
164
|
```
|
|
159
165
|
|
|
@@ -232,7 +238,7 @@ BrainBank uses pluggable indexers. Register only what you need with `.use()`:
|
|
|
232
238
|
|
|
233
239
|
| Indexer | Import | Description |
|
|
234
240
|
|---------|--------|-------------|
|
|
235
|
-
| `code` | `brainbank/code` |
|
|
241
|
+
| `code` | `brainbank/code` | AST-aware code chunking via tree-sitter (19 languages) |
|
|
236
242
|
| `git` | `brainbank/git` | Git commit history, diffs, co-edit relationships |
|
|
237
243
|
| `docs` | `brainbank/docs` | Document collections (markdown, wikis) |
|
|
238
244
|
|
|
@@ -899,6 +905,24 @@ Instances are cached in memory after first initialization, so subsequent queries
|
|
|
899
905
|
|
|
900
906
|
## Indexing
|
|
901
907
|
|
|
908
|
+
### Code Chunking (tree-sitter)
|
|
909
|
+
|
|
910
|
+
BrainBank uses **native tree-sitter** to parse source code into ASTs and extract semantic blocks — functions, classes, methods, interfaces — as individual chunks. This produces dramatically better embeddings than naive line-based splitting.
|
|
911
|
+
|
|
912
|
+
**Supported languages (AST-parsed):**
|
|
913
|
+
|
|
914
|
+
| Category | Languages |
|
|
915
|
+
|----------|-----------|
|
|
916
|
+
| Web | TypeScript, JavaScript, HTML, CSS |
|
|
917
|
+
| Systems | Go, Rust, C, C++, Swift |
|
|
918
|
+
| JVM | Java, Kotlin, Scala |
|
|
919
|
+
| Scripting | Python, Ruby, PHP, Lua, Bash, Elixir |
|
|
920
|
+
| .NET | C# |
|
|
921
|
+
|
|
922
|
+
For large classes (>80 lines), the chunker descends into the class body and extracts each method as a separate chunk. For unsupported languages, it falls back to a sliding window with overlap.
|
|
923
|
+
|
|
924
|
+
> Tree-sitter grammars are **optional dependencies**. If a grammar isn't installed, that language falls back to the generic sliding window. Install only the grammars you need: `npm install tree-sitter-ruby tree-sitter-go` etc.
|
|
925
|
+
|
|
902
926
|
### Incremental Indexing
|
|
903
927
|
|
|
904
928
|
All indexing is **incremental by default** — only new or changed content is processed:
|
|
@@ -971,6 +995,101 @@ brainbank reembed
|
|
|
971
995
|
|
|
972
996
|
---
|
|
973
997
|
|
|
998
|
+
## Benchmarks
|
|
999
|
+
|
|
1000
|
+
BrainBank includes benchmark scripts to validate chunking quality and search relevance. Run them against your own codebase to see the impact.
|
|
1001
|
+
|
|
1002
|
+
### Search Quality: AST vs Sliding Window
|
|
1003
|
+
|
|
1004
|
+
We compared BrainBank's **tree-sitter AST chunker** against the traditional **sliding window** (80-line blocks) on a production NestJS backend (3,753 lines across 8 service files). Both strategies chunk the same files; all chunks are embedded and searched with the same 10 domain-specific queries.
|
|
1005
|
+
|
|
1006
|
+
#### How It Works
|
|
1007
|
+
|
|
1008
|
+
```
|
|
1009
|
+
Sliding Window Tree-Sitter AST
|
|
1010
|
+
┌────────────────────┐ ┌────────────────────┐
|
|
1011
|
+
│ import { ... } │ │ ✓ constructor() │ → named chunk
|
|
1012
|
+
│ @Injectable() │ → L1-80 block │ ✓ findAll() │ → named chunk
|
|
1013
|
+
│ class JobsService {│ │ ✓ createJob() │ → named chunk
|
|
1014
|
+
│ constructor() │ │ ✓ cancelJob() │ → named chunk
|
|
1015
|
+
│ findAll() { ... }│ │ ✓ updateStatus() │ → named chunk
|
|
1016
|
+
│ createJob() │ └────────────────────┘
|
|
1017
|
+
│ ... │
|
|
1018
|
+
│ ────────────────── │ overlaps ↕
|
|
1019
|
+
│ cancelJob() │ → L75-155 block
|
|
1020
|
+
│ updateStatus() │
|
|
1021
|
+
│ ... │
|
|
1022
|
+
└────────────────────┘
|
|
1023
|
+
```
|
|
1024
|
+
|
|
1025
|
+
**Sliding window** mixes imports, constructors, and multiple methods into one embedding. Search for "cancel a job" and you get a generic block.
|
|
1026
|
+
**AST chunking** gives each method its own embedding. Search for "cancel a job" → direct hit on `cancelJob()`.
|
|
1027
|
+
|
|
1028
|
+
#### Results (Production NestJS Backend — 3,753 lines)
|
|
1029
|
+
|
|
1030
|
+
Tested with 10 domain-specific queries on 8 service files (`orders.service.ts`, `bookings.service.ts`, `notifications.service.ts`, etc.):
|
|
1031
|
+
|
|
1032
|
+
| Metric | Sliding Window | Tree-Sitter AST |
|
|
1033
|
+
|--------|:-:|:-:|
|
|
1034
|
+
| **Query Wins** | 0/10 | **8/10** (2 ties) |
|
|
1035
|
+
| **Top-1 Relevant** | 3/10 | **8/10** |
|
|
1036
|
+
| **Avg Precision@3** | 1.1/3 | **1.7/3** |
|
|
1037
|
+
| **Avg Score Delta** | — | **+0.035** |
|
|
1038
|
+
|
|
1039
|
+
#### Per-Query Breakdown
|
|
1040
|
+
|
|
1041
|
+
| Query | SW Top Result | AST Top Result | Δ Score |
|
|
1042
|
+
|-------|:---:|:---:|:---:|
|
|
1043
|
+
| cancel an order | generic `L451-458` | **`updateOrderStatus`** | +0.005 |
|
|
1044
|
+
| create a booking | generic `L451-458` | **`createInstantBooking`** | +0.068 |
|
|
1045
|
+
| confirm booking | generic `L451-458` | **`confirm`** | +0.034 |
|
|
1046
|
+
| send notification | generic `L226-305` | **`publishNotificationEvent`** | +0.034 |
|
|
1047
|
+
| authenticate JWT | generic `L1-80` | **`AuthModule`** | +0.032 |
|
|
1048
|
+
| tenant DB connection | `L76-155` | **`onModuleDestroy`** | +0.037 |
|
|
1049
|
+
| list orders paginated | `L76-155` | **`findAllActive`** | +0.045 |
|
|
1050
|
+
| reject booking | generic `L451-458` | **`reject`** | +0.090 |
|
|
1051
|
+
|
|
1052
|
+
> Notice how the sliding window returns the **same generic block `L451-458`** for 4 different queries. The AST chunker returns a different, correctly named method each time.
|
|
1053
|
+
|
|
1054
|
+
#### Chunk Quality Comparison
|
|
1055
|
+
|
|
1056
|
+
| | Sliding Window | Tree-Sitter AST |
|
|
1057
|
+
|---|:-:|:-:|
|
|
1058
|
+
| Total chunks | 53 | **83** |
|
|
1059
|
+
| Avg lines/chunk | 75 | **39** |
|
|
1060
|
+
| Named chunks | 0 | **83** (100%) |
|
|
1061
|
+
| Chunk types | `block` | `method`, `interface`, `class` |
|
|
1062
|
+
|
|
1063
|
+
### Grammar Support
|
|
1064
|
+
|
|
1065
|
+
All 9 core grammars verified, each parsing in **<0.05ms**:
|
|
1066
|
+
|
|
1067
|
+
| Language | AST Nodes Extracted | Parse Time |
|
|
1068
|
+
|----------|:---:|:---:|
|
|
1069
|
+
| TypeScript | `export_statement`, `interface_declaration` | 0.04ms |
|
|
1070
|
+
| JavaScript | `function_declaration` × 3 | 0.04ms |
|
|
1071
|
+
| Python | `class_definition`, `function_definition` × 2 | 0.03ms |
|
|
1072
|
+
| Go | `function_declaration`, `method_declaration` × 3 | 0.04ms |
|
|
1073
|
+
| Rust | `struct_item`, `impl_item`, `function_item` | 0.03ms |
|
|
1074
|
+
| Ruby | `class`, `method` | 0.03ms |
|
|
1075
|
+
| Java | `class_declaration` | 0.02ms |
|
|
1076
|
+
| C | `function_definition` × 3 | 0.05ms |
|
|
1077
|
+
| PHP | `class_declaration` | 0.03ms |
|
|
1078
|
+
|
|
1079
|
+
> Additional grammars available: C++, Swift, C#, Kotlin, Scala, Lua, Elixir, Bash, HTML, CSS
|
|
1080
|
+
|
|
1081
|
+
### Running Benchmarks
|
|
1082
|
+
|
|
1083
|
+
```bash
|
|
1084
|
+
# Grammar support (9 languages, parse speed)
|
|
1085
|
+
node test/benchmarks/grammar-support.mjs
|
|
1086
|
+
|
|
1087
|
+
# Search quality A/B (uses BrainBank's own source files)
|
|
1088
|
+
node test/benchmarks/search-quality.mjs
|
|
1089
|
+
```
|
|
1090
|
+
|
|
1091
|
+
---
|
|
1092
|
+
|
|
974
1093
|
## Architecture
|
|
975
1094
|
|
|
976
1095
|
<details>
|
|
@@ -1035,7 +1154,7 @@ Final results (sorted by blended score)
|
|
|
1035
1154
|
|
|
1036
1155
|
### Data Flow
|
|
1037
1156
|
|
|
1038
|
-
1. **Index** — Indexers parse files into chunks
|
|
1157
|
+
1. **Index** — Indexers parse files into chunks (tree-sitter AST for code, heading-based for docs)
|
|
1039
1158
|
2. **Embed** — Each chunk gets a vector (local WASM or OpenAI)
|
|
1040
1159
|
3. **Store** — Chunks + vectors → SQLite, vectors → HNSW index
|
|
1041
1160
|
4. **Search** — Query → HNSW k-NN + BM25 keyword → RRF fusion → optional reranker
|