purecontext-mcp 1.1.0 → 1.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENT_INSTRUCTIONS.md +509 -0
- package/AGENT_INSTRUCTIONS_SHORT.md +97 -0
- package/CHANGELOG.md +212 -0
- package/docs/01-introduction.md +69 -0
- package/docs/02-installation.md +267 -0
- package/docs/03-quick-start.md +135 -0
- package/docs/04-configuration.md +214 -0
- package/docs/05-cli-reference.md +130 -0
- package/docs/06-tools-reference.md +499 -0
- package/docs/07-language-support.md +88 -0
- package/docs/08-framework-adapters.md +324 -0
- package/docs/09-dependency-graph.md +182 -0
- package/docs/10-semantic-search.md +153 -0
- package/docs/11-search-quality.md +110 -0
- package/docs/12-ai-summarization.md +106 -0
- package/docs/13-token-savings.md +110 -0
- package/docs/14-transport-modes.md +167 -0
- package/docs/15-team-setup.md +251 -0
- package/docs/16-docker.md +186 -0
- package/docs/17-web-ui.md +157 -0
- package/docs/18-git-history.md +157 -0
- package/docs/19-cross-repo.md +177 -0
- package/docs/20-architecture-analysis.md +228 -0
- package/docs/21-ecosystem-tools.md +189 -0
- package/docs/22-distribution.md +240 -0
- package/docs/23-performance.md +121 -0
- package/docs/24-security.md +144 -0
- package/docs/25-architecture-overview.md +240 -0
- package/docs/26-troubleshooting.md +234 -0
- package/docs/27-api-stability.md +114 -0
- package/docs/README.md +71 -0
- package/guide/README.md +57 -0
- package/guide/ai-summaries.md +127 -0
- package/guide/code-health.md +190 -0
- package/guide/code-history.md +149 -0
- package/guide/finding-code.md +157 -0
- package/guide/navigating-new-code.md +121 -0
- package/guide/safe-changes.md +156 -0
- package/guide/team-setup.md +191 -0
- package/guide/web-ui.md +154 -0
- package/guide/why-purecontext.md +73 -0
- package/guide/workflow-onboarding.md +114 -0
- package/guide/workflow-pr-review.md +199 -0
- package/guide/workflow-refactoring.md +172 -0
- package/package.json +9 -2
|
@@ -0,0 +1,189 @@
|
|
|
1
|
+
# Ecosystem & Data Tools
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
Ecosystem tools extend PureContext to data-centric codebases: dbt projects, SQL schemas, OpenAPI specifications, and a context provider framework for domain-specific integrations.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Context provider framework
|
|
9
|
+
|
|
10
|
+
A context provider is a plugin that adds domain-specific enrichment to symbol metadata and search results. Providers are loaded automatically when their target framework is detected.
|
|
11
|
+
|
|
12
|
+
**Built-in providers:**
|
|
13
|
+
- **dbt provider** — enriches dbt model symbols with column lineage and upstream/downstream dependencies
|
|
14
|
+
- **OpenAPI provider** — enriches endpoint symbols with request/response schema details
|
|
15
|
+
- **SQL provider** — enriches table symbols with column definitions and foreign key relationships
|
|
16
|
+
|
|
17
|
+
**Writing a custom provider:**
|
|
18
|
+
|
|
19
|
+
```typescript
|
|
20
|
+
interface ContextProvider {
|
|
21
|
+
name: string;
|
|
22
|
+
detect(projectRoot: string): Promise<boolean>;
|
|
23
|
+
enrich(symbol: SymbolRecord): Promise<EnrichedSymbol>;
|
|
24
|
+
}
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
Register in `config.json`:
|
|
28
|
+
|
|
29
|
+
```json
|
|
30
|
+
{
|
|
31
|
+
"contextProviders": ["my-custom-provider"]
|
|
32
|
+
}
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
---
|
|
36
|
+
|
|
37
|
+
## dbt integration
|
|
38
|
+
|
|
39
|
+
**Auto-detected by:** `dbt_project.yml` in project root.
|
|
40
|
+
|
|
41
|
+
### What is indexed
|
|
42
|
+
|
|
43
|
+
| dbt artifact | Symbol kind | Notes |
|
|
44
|
+
|-------------|-------------|-------|
|
|
45
|
+
| Model (`.sql`) | `function` | SQL logic as source, dbt Jinja expanded |
|
|
46
|
+
| Source | `const` | External data source reference |
|
|
47
|
+
| Seed (`.csv`) | `const` | Static data table |
|
|
48
|
+
| Macro | `function` | Jinja macro definition |
|
|
49
|
+
| Exposure | `const` | Dashboard/downstream consumer |
|
|
50
|
+
|
|
51
|
+
Column definitions from `schema.yml` are stored in `frameworkMeta.columns`.
|
|
52
|
+
|
|
53
|
+
### dbt Jinja expansion
|
|
54
|
+
|
|
55
|
+
Before parsing, dbt SQL files are pre-processed to expand Jinja templating:
|
|
56
|
+
- `{{ ref('orders') }}` → resolved model name
|
|
57
|
+
- `{{ source('raw', 'events') }}` → source reference
|
|
58
|
+
- `{{ config(...) }}` → stripped
|
|
59
|
+
|
|
60
|
+
This allows the SQL handler to parse the underlying SQL accurately.
|
|
61
|
+
|
|
62
|
+
### Configuration
|
|
63
|
+
|
|
64
|
+
```json
|
|
65
|
+
{
|
|
66
|
+
"dbt": {
|
|
67
|
+
"manifestPath": "target/manifest.json",
|
|
68
|
+
"profilesPath": "~/.dbt/profiles.yml"
|
|
69
|
+
}
|
|
70
|
+
}
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
Run `dbt compile` or `dbt run` before indexing to ensure `target/manifest.json` is current.
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## `search_columns`
|
|
78
|
+
|
|
79
|
+
Search column definitions across dbt models and SQL tables.
|
|
80
|
+
|
|
81
|
+
**Parameters:**
|
|
82
|
+
|
|
83
|
+
| Parameter | Type | Default | Description |
|
|
84
|
+
|-----------|------|---------|-------------|
|
|
85
|
+
| `repoId` | `string` | required | Target repository |
|
|
86
|
+
| `query` | `string` | required | Column name fragment |
|
|
87
|
+
| `modelName` | `string` | — | Restrict to a specific model |
|
|
88
|
+
|
|
89
|
+
**Response:**
|
|
90
|
+
|
|
91
|
+
```json
|
|
92
|
+
{
|
|
93
|
+
"columns": [
|
|
94
|
+
{
|
|
95
|
+
"name": "user_id",
|
|
96
|
+
"model": "fct_orders",
|
|
97
|
+
"dataType": "bigint",
|
|
98
|
+
"description": "Foreign key to dim_users",
|
|
99
|
+
"nullable": false,
|
|
100
|
+
"lineage": {
|
|
101
|
+
"upstream": ["stg_orders.user_id"],
|
|
102
|
+
"downstream": ["rpt_user_activity.user_id", "fct_revenue.user_id"]
|
|
103
|
+
}
|
|
104
|
+
}
|
|
105
|
+
]
|
|
106
|
+
}
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
**Use cases:**
|
|
110
|
+
- "Find all columns named `user_id` across my dbt project"
|
|
111
|
+
- "What models produce the `revenue` column?"
|
|
112
|
+
- "What is the lineage of `order_status`?"
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## OpenAPI / Swagger handler
|
|
117
|
+
|
|
118
|
+
**Auto-detected by:** `openapi.yaml`, `openapi.json`, `swagger.yaml`, or `swagger.json` in the project root, or files with `openapi: 3.x.x` content.
|
|
119
|
+
|
|
120
|
+
### What is indexed
|
|
121
|
+
|
|
122
|
+
| OpenAPI artifact | Symbol kind | Notes |
|
|
123
|
+
|-----------------|-------------|-------|
|
|
124
|
+
| Endpoint (`GET /users`) | `route` | Path + method as name |
|
|
125
|
+
| Schema object | `type` | Request/response schema |
|
|
126
|
+
| Parameter | `const` | Query/path/header parameter |
|
|
127
|
+
|
|
128
|
+
### Using OpenAPI symbols
|
|
129
|
+
|
|
130
|
+
```
|
|
131
|
+
search_symbols(query: "users", kind: "route")
|
|
132
|
+
→ "GET /users", "POST /users", "GET /users/{id}"
|
|
133
|
+
|
|
134
|
+
get_symbol_source(symbolId: "GET /users/{id}")
|
|
135
|
+
→ Full endpoint definition including parameters, request body, response schemas
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
---
|
|
139
|
+
|
|
140
|
+
## SQL handler
|
|
141
|
+
|
|
142
|
+
**Extensions:** `.sql` files.
|
|
143
|
+
|
|
144
|
+
**Detected separately from dbt** — the SQL handler processes raw SQL files without dbt Jinja.
|
|
145
|
+
|
|
146
|
+
### What is indexed
|
|
147
|
+
|
|
148
|
+
| SQL statement | Symbol kind |
|
|
149
|
+
|--------------|-------------|
|
|
150
|
+
| `CREATE TABLE` | `class` |
|
|
151
|
+
| `CREATE VIEW` | `function` |
|
|
152
|
+
| `CREATE FUNCTION` | `function` |
|
|
153
|
+
| `CREATE PROCEDURE` | `function` |
|
|
154
|
+
| `CREATE INDEX` | `const` |
|
|
155
|
+
|
|
156
|
+
For dbt projects, the SQL handler works alongside the dbt provider — the provider handles Jinja expansion and column lineage, the handler handles AST parsing.
|
|
157
|
+
|
|
158
|
+
### Example
|
|
159
|
+
|
|
160
|
+
```
|
|
161
|
+
search_symbols(query: "orders", kind: "class")
|
|
162
|
+
→ "orders" table (CREATE TABLE orders ...)
|
|
163
|
+
|
|
164
|
+
get_symbol_source(symbolId: "orders-table-id")
|
|
165
|
+
→ Full CREATE TABLE statement with all column definitions
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
---
|
|
169
|
+
|
|
170
|
+
## Combining data tools
|
|
171
|
+
|
|
172
|
+
A typical data platform exploration workflow:
|
|
173
|
+
|
|
174
|
+
```
|
|
175
|
+
1. search_columns(query: "revenue")
|
|
176
|
+
→ Find all columns named 'revenue' and their models
|
|
177
|
+
|
|
178
|
+
2. get_symbol_source(symbolId: "fct_revenue-model-id")
|
|
179
|
+
→ See the SQL logic that produces the revenue column
|
|
180
|
+
|
|
181
|
+
3. get_context_bundle(symbolId: "fct_revenue-model-id")
|
|
182
|
+
→ Traverse upstream to understand the full lineage
|
|
183
|
+
|
|
184
|
+
4. search_symbols(query: "revenue", kind: "route")
|
|
185
|
+
→ Find the API endpoints that expose revenue data
|
|
186
|
+
|
|
187
|
+
5. get_blast_radius(symbolId: "fct_revenue-model-id")
|
|
188
|
+
→ See which dashboards and downstream models depend on this
|
|
189
|
+
```
|
|
@@ -0,0 +1,240 @@
|
|
|
1
|
+
# Distribution & Platform
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
PureContext supports distribution and automation through index export/import, a public registry of pre-built indexes, webhooks for auto-reindex, GitHub Actions integration, and a VS Code extension.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Index export and import
|
|
9
|
+
|
|
10
|
+
Share pre-built indexes without requiring everyone to re-index from scratch.
|
|
11
|
+
|
|
12
|
+
### Export
|
|
13
|
+
|
|
14
|
+
```bash
|
|
15
|
+
npx purecontext-mcp export --repo <repoId> --out index.pctx.tar.gz
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
Or by path:
|
|
19
|
+
|
|
20
|
+
```bash
|
|
21
|
+
npx purecontext-mcp export --path /path/to/project --out index.pctx.tar.gz
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
The archive contains: compressed SQLite database, HNSW index (if present), and a metadata JSON file.
|
|
25
|
+
|
|
26
|
+
### Import
|
|
27
|
+
|
|
28
|
+
```bash
|
|
29
|
+
npx purecontext-mcp import --file index.pctx.tar.gz
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
After import, the repo is immediately searchable — no re-indexing required.
|
|
33
|
+
|
|
34
|
+
### Use cases
|
|
35
|
+
|
|
36
|
+
- **Team onboarding**: export the index after CI, share as an artifact — new developers get a pre-built index on day one
|
|
37
|
+
- **CI pipeline**: cache the index between runs (see GitHub Actions below)
|
|
38
|
+
- **Server migration**: move indexes from one server to another without re-indexing
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## Public registry
|
|
43
|
+
|
|
44
|
+
Pre-built indexes for popular open-source projects are hosted on a CDN.
|
|
45
|
+
|
|
46
|
+
### Pulling a registry index
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
npx purecontext-mcp pull react@18
|
|
50
|
+
npx purecontext-mcp pull typescript@5
|
|
51
|
+
npx purecontext-mcp pull django@4.2
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
The index is downloaded and imported automatically. Use `list_repos` to confirm it's available.
|
|
55
|
+
|
|
56
|
+
### Available packages
|
|
57
|
+
|
|
58
|
+
```bash
|
|
59
|
+
npx purecontext-mcp registry list
|
|
60
|
+
# Lists all available packages with versions and index sizes
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
### Requesting a new package
|
|
64
|
+
|
|
65
|
+
Open an issue on GitHub with the package name and version. Registry indexes are built automatically from GitHub releases using the GitHub Actions integration.
|
|
66
|
+
|
|
67
|
+
---
|
|
68
|
+
|
|
69
|
+
## Webhooks for auto-reindex
|
|
70
|
+
|
|
71
|
+
Configure a webhook endpoint to trigger re-indexing automatically when code is pushed to your repository.
|
|
72
|
+
|
|
73
|
+
### Setup
|
|
74
|
+
|
|
75
|
+
1. In your PureContext server config:
|
|
76
|
+
|
|
77
|
+
```json
|
|
78
|
+
{
|
|
79
|
+
"webhooks": {
|
|
80
|
+
"enabled": true,
|
|
81
|
+
"secret": "${WEBHOOK_SECRET}",
|
|
82
|
+
"branches": ["main", "develop"]
|
|
83
|
+
}
|
|
84
|
+
}
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
2. In your GitHub repository settings:
|
|
88
|
+
- Go to Settings → Webhooks → Add webhook
|
|
89
|
+
- Payload URL: `https://your-server/webhook/github`
|
|
90
|
+
- Content type: `application/json`
|
|
91
|
+
- Secret: same value as `WEBHOOK_SECRET`
|
|
92
|
+
- Events: "Just the push event"
|
|
93
|
+
|
|
94
|
+
### How it works
|
|
95
|
+
|
|
96
|
+
When a push is received:
|
|
97
|
+
1. PureContext verifies the webhook signature (HMAC-SHA256)
|
|
98
|
+
2. Checks if the pushed branch is in `webhooks.branches`
|
|
99
|
+
3. Triggers an incremental re-index of the affected repo
|
|
100
|
+
4. New symbols are available within seconds
|
|
101
|
+
|
|
102
|
+
### GitLab and others
|
|
103
|
+
|
|
104
|
+
Custom webhook formats are supported by mapping them to PureContext's internal format:
|
|
105
|
+
|
|
106
|
+
```json
|
|
107
|
+
{
|
|
108
|
+
"webhooks": {
|
|
109
|
+
"enabled": true,
|
|
110
|
+
"format": "gitlab",
|
|
111
|
+
"secret": "${WEBHOOK_SECRET}"
|
|
112
|
+
}
|
|
113
|
+
}
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
---
|
|
117
|
+
|
|
118
|
+
## GitHub Actions integration
|
|
119
|
+
|
|
120
|
+
The official `purecontext/index-action` automates index building in CI.
|
|
121
|
+
|
|
122
|
+
### Basic usage
|
|
123
|
+
|
|
124
|
+
```yaml
|
|
125
|
+
# .github/workflows/index.yml
|
|
126
|
+
name: Index with PureContext
|
|
127
|
+
on:
|
|
128
|
+
push:
|
|
129
|
+
branches: [main]
|
|
130
|
+
|
|
131
|
+
jobs:
|
|
132
|
+
index:
|
|
133
|
+
runs-on: ubuntu-latest
|
|
134
|
+
steps:
|
|
135
|
+
- uses: actions/checkout@v4
|
|
136
|
+
|
|
137
|
+
- name: Index repository
|
|
138
|
+
uses: purecontext/index-action@v1
|
|
139
|
+
with:
|
|
140
|
+
server-url: ${{ vars.PCTX_SERVER_URL }}
|
|
141
|
+
api-key: ${{ secrets.PCTX_API_KEY }}
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
### Caching the index in CI
|
|
145
|
+
|
|
146
|
+
```yaml
|
|
147
|
+
- name: Cache PureContext index
|
|
148
|
+
uses: actions/cache@v4
|
|
149
|
+
with:
|
|
150
|
+
path: ~/.purecontext/indexes
|
|
151
|
+
key: purecontext-${{ github.sha }}
|
|
152
|
+
restore-keys: purecontext-
|
|
153
|
+
|
|
154
|
+
- name: Index repository
|
|
155
|
+
uses: purecontext/index-action@v1
|
|
156
|
+
with:
|
|
157
|
+
path: ${{ github.workspace }}
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
With caching, only changed files are re-parsed on each run — CI index time drops to seconds after the first run.
|
|
161
|
+
|
|
162
|
+
### Publishing to the registry
|
|
163
|
+
|
|
164
|
+
```yaml
|
|
165
|
+
- name: Publish index to registry
|
|
166
|
+
uses: purecontext/index-action@v1
|
|
167
|
+
with:
|
|
168
|
+
action: publish
|
|
169
|
+
package-name: my-org/my-library
|
|
170
|
+
api-key: ${{ secrets.PCTX_REGISTRY_KEY }}
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
See the full `action.yml` in the project root for all available inputs.
|
|
174
|
+
|
|
175
|
+
---
|
|
176
|
+
|
|
177
|
+
## VS Code extension
|
|
178
|
+
|
|
179
|
+
The PureContext VS Code extension integrates symbol search and navigation directly into the editor.
|
|
180
|
+
|
|
181
|
+
### Installation
|
|
182
|
+
|
|
183
|
+
Search "PureContext" in the VS Code Extensions panel, or:
|
|
184
|
+
|
|
185
|
+
```bash
|
|
186
|
+
code --install-extension purecontext.purecontext-vscode
|
|
187
|
+
```
|
|
188
|
+
|
|
189
|
+
The source is in `vscode-extension/` in the project repo.
|
|
190
|
+
|
|
191
|
+
### Features
|
|
192
|
+
|
|
193
|
+
| Feature | Description |
|
|
194
|
+
|---------|-------------|
|
|
195
|
+
| Symbol search | `Ctrl+Shift+P` → "PureContext: Search Symbols" |
|
|
196
|
+
| Hover summary | Hover over any identifier to see its PureContext summary |
|
|
197
|
+
| Go to definition | Uses PureContext index for faster lookup in large repos |
|
|
198
|
+
| Dependency graph | `Ctrl+Shift+P` → "PureContext: Show Dependencies" — opens graph panel |
|
|
199
|
+
| Blast radius | Right-click a symbol → "Show Blast Radius" |
|
|
200
|
+
| Quick outline | `Ctrl+Shift+O` with PureContext — shows AI-enriched summaries |
|
|
201
|
+
|
|
202
|
+
### Configuration
|
|
203
|
+
|
|
204
|
+
Extension settings in VS Code match the `config.json` fields:
|
|
205
|
+
|
|
206
|
+
```json
|
|
207
|
+
// .vscode/settings.json
|
|
208
|
+
{
|
|
209
|
+
"purecontext.serverUrl": "http://localhost:3000",
|
|
210
|
+
"purecontext.apiKey": "pctx_...",
|
|
211
|
+
"purecontext.enabled": true
|
|
212
|
+
}
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
---
|
|
216
|
+
|
|
217
|
+
## Programmatic API
|
|
218
|
+
|
|
219
|
+
For building custom integrations, the `@purecontext/client` npm package provides a typed TypeScript client:
|
|
220
|
+
|
|
221
|
+
```typescript
|
|
222
|
+
import { PureContextClient } from '@purecontext/client';
|
|
223
|
+
|
|
224
|
+
const client = new PureContextClient({
|
|
225
|
+
serverUrl: 'http://localhost:3000',
|
|
226
|
+
apiKey: 'pctx_...'
|
|
227
|
+
});
|
|
228
|
+
|
|
229
|
+
const symbols = await client.searchSymbols({
|
|
230
|
+
repoId: 'a1b2c3d4',
|
|
231
|
+
query: 'authenticate',
|
|
232
|
+
kind: 'function'
|
|
233
|
+
});
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
All tool inputs and outputs are fully typed. Install:
|
|
237
|
+
|
|
238
|
+
```bash
|
|
239
|
+
npm install @purecontext/client
|
|
240
|
+
```
|
|
@@ -0,0 +1,121 @@
|
|
|
1
|
+
# Performance & Scalability
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
PureContext is designed to handle enterprise-scale repos (10k–50k files) using a worker thread pool for parallel tree-sitter parsing.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Indexing speed
|
|
9
|
+
|
|
10
|
+
Typical performance on a 4-core machine:
|
|
11
|
+
|
|
12
|
+
| Repo size | First index | Incremental re-index |
|
|
13
|
+
|-----------|-------------|----------------------|
|
|
14
|
+
| 500 files | ~2 seconds | < 100ms |
|
|
15
|
+
| 5,000 files | ~15 seconds | < 1 second |
|
|
16
|
+
| 20,000 files | ~60 seconds | 1–3 seconds |
|
|
17
|
+
| 50,000 files | ~3 minutes | 2–10 seconds |
|
|
18
|
+
|
|
19
|
+
These numbers assume no AI summarization or semantic indexing. Both add API round-trip time.
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Worker thread pool
|
|
24
|
+
|
|
25
|
+
The bottleneck in sequential indexing is tree-sitter WASM parsing — each WASM instance is single-threaded. The worker thread pool parallelizes parsing across CPU cores.
|
|
26
|
+
|
|
27
|
+
```
|
|
28
|
+
Main thread
|
|
29
|
+
│
|
|
30
|
+
┌────────────┼────────────┐
|
|
31
|
+
▼ ▼ ▼
|
|
32
|
+
Worker 1 Worker 2 Worker 3
|
|
33
|
+
(TypeScript) (Python) (Go)
|
|
34
|
+
parse + extract parse + extract parse + extract
|
|
35
|
+
│ │ │
|
|
36
|
+
└────────────┴────────────┘
|
|
37
|
+
│
|
|
38
|
+
Main thread
|
|
39
|
+
(SQLite writes)
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
Each worker loads its own WASM grammar instances. File batches are distributed across workers by the main thread. SQLite writes are serialized on the main thread (better-sqlite3 is synchronous).
|
|
43
|
+
|
|
44
|
+
### Configuring worker threads
|
|
45
|
+
|
|
46
|
+
```json
|
|
47
|
+
{
|
|
48
|
+
"workerThreads": 4 // default: os.cpus().length - 1, minimum 1
|
|
49
|
+
}
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
Increase for CPU-bound workloads on machines with many cores. Do not exceed `os.cpus().length - 1` — you want to leave one core for the main thread and OS.
|
|
53
|
+
|
|
54
|
+
---
|
|
55
|
+
|
|
56
|
+
## Memory usage
|
|
57
|
+
|
|
58
|
+
| Component | Memory |
|
|
59
|
+
|-----------|--------|
|
|
60
|
+
| WASM grammars (per worker) | ~20–30 MB per grammar loaded |
|
|
61
|
+
| In-memory symbol cache (during indexing) | ~100 MB for 10k symbols |
|
|
62
|
+
| SQLite WAL mode (at rest) | ~50 MB |
|
|
63
|
+
| HNSW vector index (if enabled) | ~100 bytes per embedding dimension per symbol |
|
|
64
|
+
|
|
65
|
+
**Typical peak during indexing:** 200–500 MB for a 10k-file repo. Returns to ~50 MB at rest.
|
|
66
|
+
|
|
67
|
+
Workers are spawned once and reused for the lifetime of the server — no spawn/teardown overhead per index run.
|
|
68
|
+
|
|
69
|
+
---
|
|
70
|
+
|
|
71
|
+
## Incremental re-indexing
|
|
72
|
+
|
|
73
|
+
The content hash cache makes re-indexing very fast:
|
|
74
|
+
|
|
75
|
+
1. Each file's SHA-256 hash is stored in the `files` table after indexing
|
|
76
|
+
2. On re-index, the hash is recomputed and compared
|
|
77
|
+
3. Only files with a changed hash are re-parsed
|
|
78
|
+
4. Symbols for unchanged files are retained as-is
|
|
79
|
+
|
|
80
|
+
A typical `git pull` touches 10–50 files — re-index completes in milliseconds.
|
|
81
|
+
|
|
82
|
+
To force a full re-index (bypass the hash cache):
|
|
83
|
+
|
|
84
|
+
```
|
|
85
|
+
Use invalidate_cache tool, then index_folder again.
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
Or call `index_folder` with `force: true`.
|
|
89
|
+
|
|
90
|
+
---
|
|
91
|
+
|
|
92
|
+
## Large repo tuning
|
|
93
|
+
|
|
94
|
+
For repos with > 10,000 files:
|
|
95
|
+
|
|
96
|
+
| Setting | Recommendation |
|
|
97
|
+
|---------|---------------|
|
|
98
|
+
| `workerThreads` | Set to `os.cpus().length - 1` |
|
|
99
|
+
| `watchDebounceMs` | Increase to `5000` if many files change at once (e.g., code generation) |
|
|
100
|
+
| `excludePatterns` | Add patterns for generated files, test fixtures with large data files |
|
|
101
|
+
| `maxFileSizeBytes` | Keep at 1 MB or lower — parsing multi-MB files is slow and rarely useful |
|
|
102
|
+
| `fileLimit` | Set to `0` (unlimited) if you need the full repo indexed |
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
## SQLite performance
|
|
107
|
+
|
|
108
|
+
SQLite in **WAL (Write-Ahead Logging) mode** provides:
|
|
109
|
+
- Concurrent reads without blocking writes
|
|
110
|
+
- Fast writes (no fsync on every write in WAL mode)
|
|
111
|
+
- Crash safety (WAL journal ensures atomicity)
|
|
112
|
+
|
|
113
|
+
Query performance:
|
|
114
|
+
- `search_symbols` with FTS5: < 5ms for 100k symbols
|
|
115
|
+
- `get_symbol_source`: < 1ms (single row lookup by primary key)
|
|
116
|
+
- `get_blast_radius` (depth 5): 5–20ms depending on graph density
|
|
117
|
+
- `get_context_bundle` (depth 3): 3–15ms
|
|
118
|
+
|
|
119
|
+
No tuning is needed for the SQLite layer up to ~500k symbols. At very large scale, consider periodic `VACUUM` to reclaim space from deleted symbols.
|
|
120
|
+
|
|
121
|
+
|
|
@@ -0,0 +1,144 @@
|
|
|
1
|
+
# Security
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
## Threat model
|
|
7
|
+
|
|
8
|
+
PureContext stores and serves source code metadata. Security measures focus on:
|
|
9
|
+
|
|
10
|
+
**Protected:**
|
|
11
|
+
- Symbol names, signatures, summaries — stored in SQLite
|
|
12
|
+
- Raw source returned by `get_symbol_source` / `get_file_content`
|
|
13
|
+
- Admin API (workspace/key management)
|
|
14
|
+
|
|
15
|
+
**Not in scope:**
|
|
16
|
+
- The source repository itself — PureContext only reads it during indexing
|
|
17
|
+
- Network transport — handle TLS at a reverse proxy
|
|
18
|
+
- Host OS security — standard server hardening applies
|
|
19
|
+
|
|
20
|
+
---
|
|
21
|
+
|
|
22
|
+
## Path traversal prevention
|
|
23
|
+
|
|
24
|
+
All file paths are validated before any read:
|
|
25
|
+
|
|
26
|
+
1. Resolved to an absolute path
|
|
27
|
+
2. Verified to start within the project root (the indexed directory)
|
|
28
|
+
3. Symlinks that resolve outside the root are blocked unless `allowSymlinks: true`
|
|
29
|
+
|
|
30
|
+
This prevents tools like `get_file_content` from being used to read arbitrary files on the server.
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
## Secret file exclusion
|
|
35
|
+
|
|
36
|
+
The following files are automatically excluded from indexing (never stored in the index):
|
|
37
|
+
|
|
38
|
+
- `.env`, `.env.*`, `.env.local`, `.env.production`
|
|
39
|
+
- `*.pem`, `*.key`, `*.p12`, `*.pfx`, `*.crt`, `*.cer`
|
|
40
|
+
- `id_rsa`, `id_ed25519`, `id_ecdsa`, `id_dsa`
|
|
41
|
+
- `credentials.json`, `credentials.yaml`, `secrets.json`
|
|
42
|
+
- `serviceAccountKey*.json`, `*-service-account.json`
|
|
43
|
+
- `*.token`, `*.secret`
|
|
44
|
+
|
|
45
|
+
These patterns are built into the file discovery layer and cannot be overridden by `excludePatterns`.
|
|
46
|
+
|
|
47
|
+
---
|
|
48
|
+
|
|
49
|
+
## Binary file detection
|
|
50
|
+
|
|
51
|
+
Files are scanned for null bytes in the first 8 KB. Files with null bytes are treated as binary and skipped — preventing large binary files (which may contain embedded secrets) from entering the index.
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## API key security
|
|
56
|
+
|
|
57
|
+
Keys are stored as **bcrypt hashes** in the auth database — plaintext is never persisted after the key is generated.
|
|
58
|
+
|
|
59
|
+
- Keys are shown once on creation — store in a password manager or CI secrets
|
|
60
|
+
- Key format: `pctx_<workspaceId>_<24-char-random>_<checksum>`
|
|
61
|
+
- The checksum allows fast format validation without a database lookup
|
|
62
|
+
- Rotate keys by revoking the old one and creating a new one
|
|
63
|
+
- Use `read` permission for agents that only query, not `write` or `admin`
|
|
64
|
+
|
|
65
|
+
---
|
|
66
|
+
|
|
67
|
+
## Workspace isolation
|
|
68
|
+
|
|
69
|
+
Every query is scoped to the workspace of the API key used:
|
|
70
|
+
|
|
71
|
+
- A key from workspace A cannot query repos in workspace B
|
|
72
|
+
- Workspace scoping is enforced in all SQL queries via `workspace_id` column
|
|
73
|
+
- The admin key (`PCTX_ADMIN_KEY`) bypasses workspace isolation — protect it like a root password
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## Rate limiting
|
|
78
|
+
|
|
79
|
+
Per-key rate limits (token bucket algorithm):
|
|
80
|
+
|
|
81
|
+
- `rateLimit.maxTokens` — bucket capacity (default: 100)
|
|
82
|
+
- `rateLimit.refillRate` — tokens/second refill rate (default: 10)
|
|
83
|
+
- Heavy tools (e.g., `index_folder`) cost more tokens per call
|
|
84
|
+
|
|
85
|
+
When exceeded: `429 Too Many Requests` with `Retry-After` header.
|
|
86
|
+
|
|
87
|
+
---
|
|
88
|
+
|
|
89
|
+
## HTTP security
|
|
90
|
+
|
|
91
|
+
- **Default host: `127.0.0.1`** — loopback only, not exposed on the network
|
|
92
|
+
- A warning is logged at startup if `host` is not loopback and `auth.enabled` is false
|
|
93
|
+
- **Timing-safe comparison** — `crypto.timingSafeEqual()` used for token comparison (prevents timing attacks)
|
|
94
|
+
- **Request body limit** — 1 MB maximum
|
|
95
|
+
- **CORS** — whitelist-controlled via `http.corsOrigins`
|
|
96
|
+
|
|
97
|
+
---
|
|
98
|
+
|
|
99
|
+
## Remote repository cloning
|
|
100
|
+
|
|
101
|
+
When using `index_repo`:
|
|
102
|
+
|
|
103
|
+
- Only `https://`, `http://`, and `git@` URL schemes are accepted
|
|
104
|
+
- Clone tokens (`token` parameter) are never logged
|
|
105
|
+
- Clones are isolated under `~/.purecontext/clones/`
|
|
106
|
+
|
|
107
|
+
---
|
|
108
|
+
|
|
109
|
+
## Self-hosting hardening checklist
|
|
110
|
+
|
|
111
|
+
- [ ] Run behind a TLS-terminating reverse proxy (nginx, Caddy)
|
|
112
|
+
- [ ] Set `PCTX_ADMIN_KEY` via environment variable, never in `config.json`
|
|
113
|
+
- [ ] Restrict developer API keys to `read` permission where possible
|
|
114
|
+
- [ ] Restrict server bind address to internal network if not public-facing
|
|
115
|
+
- [ ] Use firewall rules to limit access to port 3000
|
|
116
|
+
- [ ] Monitor `/health` endpoint and set up uptime alerts
|
|
117
|
+
- [ ] Rotate API keys regularly
|
|
118
|
+
- [ ] Back up the `/data` volume (contains indexes and auth database)
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
## Data at rest
|
|
123
|
+
|
|
124
|
+
SQLite files are stored in `indexDir` (`~/.purecontext/indexes/` by default). No encryption at rest is applied by PureContext itself.
|
|
125
|
+
|
|
126
|
+
For sensitive codebases, use OS-level disk encryption:
|
|
127
|
+
- macOS: FileVault
|
|
128
|
+
- Windows: BitLocker
|
|
129
|
+
- Linux: LUKS
|
|
130
|
+
|
|
131
|
+
Docker: use encrypted volumes if the host is shared.
|
|
132
|
+
|
|
133
|
+
---
|
|
134
|
+
|
|
135
|
+
## Audit logging
|
|
136
|
+
|
|
137
|
+
The HTTP server logs every MCP tool call with:
|
|
138
|
+
- Timestamp
|
|
139
|
+
- API key label (not the key itself)
|
|
140
|
+
- Tool name
|
|
141
|
+
- `repoId`
|
|
142
|
+
- Response status and duration
|
|
143
|
+
|
|
144
|
+
At `debug` level, full request/response bodies are included. Pipe logs to your SIEM or log aggregator for audit trails.
|