brainbank 0.1.0 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -27,7 +27,7 @@ Most AI memory solutions (mem0, Zep, LangMem) require cloud services, external d
27
27
  |---|:---:|:---:|:---:|:---:|
28
28
  | Infrastructure | **SQLite file** | Vector DB + cloud | Neo4j + cloud | LangGraph Platform |
29
29
  | LLM required to write | **No**¹ | Yes | Yes | Yes |
30
- | Code-aware | **30+ languages, git, co-edits** | ✗ | ✗ | ✗ |
30
+ | Code-aware | **19 AST-parsed languages (tree-sitter), git, co-edits** | ✗ | ✗ | ✗ |
31
31
  | Custom indexers | **`.use()` plugin system** | ✗ | ✗ | ✗ |
32
32
  | Search | **Vector + BM25 + RRF** | Vector only | Vector + graph | Vector only |
33
33
  | Framework lock-in | **None** | Optional | Zep cloud | LangChain |
@@ -68,6 +68,9 @@ Most AI memory solutions (mem0, Zep, LangMem) require cloud services, external d
68
68
  - [Re-embedding](#re-embedding)
69
69
  - [Architecture](#architecture)
70
70
  - [Search Pipeline](#search-pipeline)
71
+ - [Benchmarks](#benchmarks)
72
+ - [Search Quality: AST vs Sliding Window](#search-quality-ast-vs-sliding-window)
73
+ - [Grammar Support](#grammar-support)
71
74
 
72
75
  ---
73
76
 
@@ -148,12 +151,15 @@ BrainBank can be used entirely from the command line — no config file needed.
148
151
 
149
152
  ### Indexing
150
153
 
151
- `index` processes **code files + git history** only. Document collections are indexed separately with `docs`.
154
+ `index` processes **code files + git history** by default. Use `--only` to select specific modules, and `--docs` to include document collections.
152
155
 
153
156
  ```bash
154
157
  brainbank index [path] # Index code + git history
155
158
  brainbank index [path] --force # Force re-index everything
156
159
  brainbank index [path] --depth 200 # Limit git commit depth
160
+ brainbank index [path] --only code # Index only code (skip git)
161
+ brainbank index [path] --only git # Index only git history
162
+ brainbank index [path] --docs ~/docs # Include a docs folder
157
163
  brainbank docs [--collection <name>] # Index document collections
158
164
  ```
159
165
 
@@ -232,7 +238,7 @@ BrainBank uses pluggable indexers. Register only what you need with `.use()`:
232
238
 
233
239
  | Indexer | Import | Description |
234
240
  |---------|--------|-------------|
235
- | `code` | `brainbank/code` | Language-aware code chunking (30+ languages) |
241
+ | `code` | `brainbank/code` | AST-aware code chunking via tree-sitter (19 languages) |
236
242
  | `git` | `brainbank/git` | Git commit history, diffs, co-edit relationships |
237
243
  | `docs` | `brainbank/docs` | Document collections (markdown, wikis) |
238
244
 
@@ -899,6 +905,24 @@ Instances are cached in memory after first initialization, so subsequent queries
899
905
 
900
906
  ## Indexing
901
907
 
908
+ ### Code Chunking (tree-sitter)
909
+
910
+ BrainBank uses **native tree-sitter** to parse source code into ASTs and extract semantic blocks — functions, classes, methods, interfaces — as individual chunks. This produces dramatically better embeddings than naive line-based splitting.
911
+
912
+ **Supported languages (AST-parsed):**
913
+
914
+ | Category | Languages |
915
+ |----------|-----------|
916
+ | Web | TypeScript, JavaScript, HTML, CSS |
917
+ | Systems | Go, Rust, C, C++, Swift |
918
+ | JVM | Java, Kotlin, Scala |
919
+ | Scripting | Python, Ruby, PHP, Lua, Bash, Elixir |
920
+ | .NET | C# |
921
+
922
+ For large classes (>80 lines), the chunker descends into the class body and extracts each method as a separate chunk. For unsupported languages, it falls back to a sliding window with overlap.
923
+
924
+ > Tree-sitter grammars are **optional dependencies**. If a grammar isn't installed, that language falls back to the generic sliding window. Install only the grammars you need: `npm install tree-sitter-ruby tree-sitter-go` etc.
925
+
902
926
  ### Incremental Indexing
903
927
 
904
928
  All indexing is **incremental by default** — only new or changed content is processed:
@@ -971,6 +995,101 @@ brainbank reembed
971
995
 
972
996
  ---
973
997
 
998
+ ## Benchmarks
999
+
1000
+ BrainBank includes benchmark scripts to validate chunking quality and search relevance. Run them against your own codebase to see the impact.
1001
+
1002
+ ### Search Quality: AST vs Sliding Window
1003
+
1004
+ We compared BrainBank's **tree-sitter AST chunker** against the traditional **sliding window** (80-line blocks) on a production NestJS backend (3,753 lines across 8 service files). Both strategies chunk the same files; all chunks are embedded and searched with the same 10 domain-specific queries.
1005
+
1006
+ #### How It Works
1007
+
1008
+ ```
1009
+ Sliding Window Tree-Sitter AST
1010
+ ┌────────────────────┐ ┌────────────────────┐
1011
+ │ import { ... } │ │ ✓ constructor() │ → named chunk
1012
+ │ @Injectable() │ → L1-80 block │ ✓ findAll() │ → named chunk
1013
+ │ class JobsService {│ │ ✓ createJob() │ → named chunk
1014
+ │ constructor() │ │ ✓ cancelJob() │ → named chunk
1015
+ │ findAll() { ... }│ │ ✓ updateStatus() │ → named chunk
1016
+ │ createJob() │ └────────────────────┘
1017
+ │ ... │
1018
+ │ ────────────────── │ overlaps ↕
1019
+ │ cancelJob() │ → L75-155 block
1020
+ │ updateStatus() │
1021
+ │ ... │
1022
+ └────────────────────┘
1023
+ ```
1024
+
1025
+ **Sliding window** mixes imports, constructors, and multiple methods into one embedding. Search for "cancel a job" and you get a generic block.
1026
+ **AST chunking** gives each method its own embedding. Search for "cancel a job" → direct hit on `cancelJob()`.
1027
+
1028
+ #### Results (Production NestJS Backend — 3,753 lines)
1029
+
1030
+ Tested with 10 domain-specific queries on 8 service files (`orders.service.ts`, `bookings.service.ts`, `notifications.service.ts`, etc.):
1031
+
1032
+ | Metric | Sliding Window | Tree-Sitter AST |
1033
+ |--------|:-:|:-:|
1034
+ | **Query Wins** | 0/10 | **8/10** (2 ties) |
1035
+ | **Top-1 Relevant** | 3/10 | **8/10** |
1036
+ | **Avg Precision@3** | 1.1/3 | **1.7/3** |
1037
+ | **Avg Score Delta** | — | **+0.035** |
1038
+
1039
+ #### Per-Query Breakdown
1040
+
1041
+ | Query | SW Top Result | AST Top Result | Δ Score |
1042
+ |-------|:---:|:---:|:---:|
1043
+ | cancel an order | generic `L451-458` | **`updateOrderStatus`** | +0.005 |
1044
+ | create a booking | generic `L451-458` | **`createInstantBooking`** | +0.068 |
1045
+ | confirm booking | generic `L451-458` | **`confirm`** | +0.034 |
1046
+ | send notification | generic `L226-305` | **`publishNotificationEvent`** | +0.034 |
1047
+ | authenticate JWT | generic `L1-80` | **`AuthModule`** | +0.032 |
1048
+ | tenant DB connection | `L76-155` | **`onModuleDestroy`** | +0.037 |
1049
+ | list orders paginated | `L76-155` | **`findAllActive`** | +0.045 |
1050
+ | reject booking | generic `L451-458` | **`reject`** | +0.090 |
1051
+
1052
+ > Notice how the sliding window returns the **same generic block `L451-458`** for 4 different queries. The AST chunker returns a different, correctly named method each time.
1053
+
1054
+ #### Chunk Quality Comparison
1055
+
1056
+ | | Sliding Window | Tree-Sitter AST |
1057
+ |---|:-:|:-:|
1058
+ | Total chunks | 53 | **83** |
1059
+ | Avg lines/chunk | 75 | **39** |
1060
+ | Named chunks | 0 | **83** (100%) |
1061
+ | Chunk types | `block` | `method`, `interface`, `class` |
1062
+
1063
+ ### Grammar Support
1064
+
1065
+ All 9 core grammars verified, each parsing in **<0.05ms**:
1066
+
1067
+ | Language | AST Nodes Extracted | Parse Time |
1068
+ |----------|:---:|:---:|
1069
+ | TypeScript | `export_statement`, `interface_declaration` | 0.04ms |
1070
+ | JavaScript | `function_declaration` × 3 | 0.04ms |
1071
+ | Python | `class_definition`, `function_definition` × 2 | 0.03ms |
1072
+ | Go | `function_declaration`, `method_declaration` × 3 | 0.04ms |
1073
+ | Rust | `struct_item`, `impl_item`, `function_item` | 0.03ms |
1074
+ | Ruby | `class`, `method` | 0.03ms |
1075
+ | Java | `class_declaration` | 0.02ms |
1076
+ | C | `function_definition` × 3 | 0.05ms |
1077
+ | PHP | `class_declaration` | 0.03ms |
1078
+
1079
+ > Additional grammars available: C++, Swift, C#, Kotlin, Scala, Lua, Elixir, Bash, HTML, CSS
1080
+
1081
+ ### Running Benchmarks
1082
+
1083
+ ```bash
1084
+ # Grammar support (9 languages, parse speed)
1085
+ node test/benchmarks/grammar-support.mjs
1086
+
1087
+ # Search quality A/B (uses BrainBank's own source files)
1088
+ node test/benchmarks/search-quality.mjs
1089
+ ```
1090
+
1091
+ ---
1092
+
974
1093
  ## Architecture
975
1094
 
976
1095
  <details>
@@ -1035,7 +1154,7 @@ Final results (sorted by blended score)
1035
1154
 
1036
1155
  ### Data Flow
1037
1156
 
1038
- 1. **Index** — Indexers parse files into chunks
1157
+ 1. **Index** — Indexers parse files into chunks (tree-sitter AST for code, heading-based for docs)
1039
1158
  2. **Embed** — Each chunk gets a vector (local WASM or OpenAI)
1040
1159
  3. **Store** — Chunks + vectors → SQLite, vectors → HNSW index
1041
1160
  4. **Search** — Query → HNSW k-NN + BM25 keyword → RRF fusion → optional reranker