membot 0.0.1 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (73) hide show
  1. package/package.json +81 -24
  2. package/patches/@huggingface%2Ftransformers@4.2.0.patch +137 -0
  3. package/scripts/apply-transformers-patch.sh +35 -0
  4. package/src/cli.ts +70 -0
  5. package/src/commands/check-update.ts +69 -0
  6. package/src/commands/mcpx.ts +112 -0
  7. package/src/commands/reindex.ts +53 -0
  8. package/src/commands/serve.ts +58 -0
  9. package/src/commands/upgrade.ts +220 -0
  10. package/src/config/loader.ts +100 -0
  11. package/src/config/schemas.ts +39 -0
  12. package/src/constants.ts +42 -0
  13. package/src/context.ts +80 -0
  14. package/src/db/blobs.ts +53 -0
  15. package/src/db/chunks.ts +176 -0
  16. package/src/db/connection.ts +173 -0
  17. package/src/db/files.ts +325 -0
  18. package/src/db/migrations/001-init.ts +63 -0
  19. package/src/db/migrations/002-fts.ts +12 -0
  20. package/src/db/migrations.ts +45 -0
  21. package/src/errors.ts +87 -0
  22. package/src/ingest/chunker.ts +117 -0
  23. package/src/ingest/converter/docx.ts +15 -0
  24. package/src/ingest/converter/html.ts +20 -0
  25. package/src/ingest/converter/image.ts +71 -0
  26. package/src/ingest/converter/index.ts +119 -0
  27. package/src/ingest/converter/llm.ts +66 -0
  28. package/src/ingest/converter/ocr.ts +51 -0
  29. package/src/ingest/converter/pdf.ts +38 -0
  30. package/src/ingest/converter/text.ts +8 -0
  31. package/src/ingest/describer.ts +72 -0
  32. package/src/ingest/embedder.ts +83 -0
  33. package/src/ingest/fetcher.ts +280 -0
  34. package/src/ingest/ingest.ts +444 -0
  35. package/src/ingest/local-reader.ts +64 -0
  36. package/src/ingest/search-text.ts +18 -0
  37. package/src/ingest/source-resolver.ts +186 -0
  38. package/src/mcp/instructions.ts +34 -0
  39. package/src/mcp/server.ts +101 -0
  40. package/src/mount/commander.ts +174 -0
  41. package/src/mount/mcp.ts +111 -0
  42. package/src/mount/zod-to-cli.ts +158 -0
  43. package/src/operations/add.ts +69 -0
  44. package/src/operations/diff.ts +105 -0
  45. package/src/operations/index.ts +38 -0
  46. package/src/operations/info.ts +95 -0
  47. package/src/operations/list.ts +87 -0
  48. package/src/operations/move.ts +83 -0
  49. package/src/operations/prune.ts +80 -0
  50. package/src/operations/read.ts +102 -0
  51. package/src/operations/refresh.ts +72 -0
  52. package/src/operations/remove.ts +35 -0
  53. package/src/operations/search.ts +72 -0
  54. package/src/operations/tree.ts +103 -0
  55. package/src/operations/types.ts +81 -0
  56. package/src/operations/versions.ts +78 -0
  57. package/src/operations/write.ts +77 -0
  58. package/src/output/formatter.ts +68 -0
  59. package/src/output/logger.ts +114 -0
  60. package/src/output/progress.ts +78 -0
  61. package/src/output/tty.ts +91 -0
  62. package/src/refresh/runner.ts +296 -0
  63. package/src/refresh/scheduler.ts +54 -0
  64. package/src/sdk.ts +27 -0
  65. package/src/search/hybrid.ts +100 -0
  66. package/src/search/keyword.ts +62 -0
  67. package/src/search/semantic.ts +56 -0
  68. package/src/update/background.ts +73 -0
  69. package/src/update/cache.ts +40 -0
  70. package/src/update/checker.ts +117 -0
  71. package/.claude/settings.local.json +0 -7
  72. package/CLAUDE.md +0 -139
  73. package/docs/plan.md +0 -905
package/docs/plan.md DELETED
@@ -1,905 +0,0 @@
1
- # `ctx` — Standalone AI-Agent Context Store
2
-
3
- ## Context
4
-
5
- `ctx` is a new standalone Bun project at `/Users/evan/workspace/ctx` that extracts and reshapes the context system currently embedded in `botholomew` (`/Users/evan/workspace/botholomew/src/context/`, `src/tools/`, `src/db/`). Distribution and CLI shape mirror `mcpx` (`/Users/evan/workspace/mcpx`).
6
-
7
- Goals (from user):
8
-
9
- - Files are **stored only in the database** — not on disk as a tree of `.md` files. Logical paths are virtual.
10
- - Hybrid search (vector + BM25) over chunked content.
11
- - Tree exploration synthesised from logical paths.
12
- - `ctx add <source>` works for local paths AND remote URLs, with **mcpx-driven mini-agents** fetching remote content (Firecrawl, Google Docs, GitHub, raw HTTP). The exact mcpx invocation (server + tool + args) is stored on the row so refresh can re-invoke it directly — no agent/routing re-run.
13
- - Everything is converted to **markdown**: PDF, DOCX, HTML, plain-text, etc. **Native libs first, LLM fallback** for messy/scanned content.
14
- - Each row tracks `source_path`, `source_sha256`, `refreshed_at`, `refresh_frequency_sec`. `ctx refresh <path>` re-reads the original source, re-hashes, and re-converts/re-embeds only if the SHA changed. Local files compared by content hash; remote URLs re-fetched via the same fetcher.
15
- - Both **on-demand** (`ctx refresh`) and **daemon** (`ctx serve --watch`) refresh modes.
16
- - Bun-compiled standalone executables (darwin/linux/windows × arm64/x64), like mcpx.
17
- - Stdio + HTTP MCP server exposing read/write/add/search/tree/refresh tools.
18
- - System-wide config + data dir at `~/.membot/` (override via `--config` or `MEMBOT_HOME`).
19
- - **Embeddings are LOCAL only** — `@huggingface/transformers` WASM with `Xenova/bge-small-en-v1.5` (384-dim). No cloud embedding APIs. (See memory: `feedback_local_embeddings_only.md`.)
20
-
21
- ---
22
-
23
- ## Architecture Snapshot
24
-
25
- ```
26
- ~/.membot/
27
- config.json # user config
28
- index.duckdb # all content, chunks, embeddings, FTS
29
- models/ # cached @huggingface/transformers WASM weights
30
- logs/ # daemon logs (when --watch)
31
- ```
32
-
33
- DuckDB is the only persistent store. There is **no** `~/.membot/context/` filesystem tree — the agent's "files" are rows.
34
-
35
- ---
36
-
37
- ## Presentation & Errors
38
-
39
- ### Two presentation modes
40
-
41
- The CLI auto-detects its environment and renders appropriately. There is **one** code path for output — the logger and formatter inspect the environment once at startup and degrade gracefully.
42
-
43
- | Condition | Mode | Behavior |
44
- | -------------------------------------------------------- | ---------------- | ---------------------------------------------------------------------------------------------- |
45
- | stdout is a TTY AND stderr is a TTY AND `--json` not set | **interactive** | ANSI colors, `nanospinner` spinners during work, progress bars for multi-entry ops, aligned tables |
46
- | stdout is piped, redirected, or `--json` is set | **non-interactive** | No spinners, no progress bars, no colors. JSON to stdout, structured logs to stderr. Stable, parseable. |
47
- | `CI=true` env var set | non-interactive (forced) | Same as above; never accidentally emit ANSI/spinners in CI logs |
48
- | `--no-color` flag or `NO_COLOR` env var | non-interactive (colors only) | Spinners stay if TTY, but no ANSI color codes (FORCE_COLOR overrides) |
49
-
50
- Implementation lives in `src/output/`:
51
-
52
- - `tty.ts` — single source of truth for `isInteractive()`, `useColor()`, `useSpinner()`. Reads `process.stdout.isTTY`, `process.stderr.isTTY`, `process.env.CI`, `NO_COLOR`, `FORCE_COLOR`, and the `--json` / `--no-interactive` flags.
53
- - `logger.ts` — spinner-aware (port from `mcpx/src/output/logger.ts`); `info/warn/error/debug/writeRaw` route to stderr in non-interactive mode and don't break parseable stdout.
54
- - `progress.ts` — wraps `nanospinner` + a multi-entry progress bar (used by directory/glob ingest); in non-interactive mode emits one `info` line per entry instead.
55
- - `formatter.ts` — final-result rendering: aligned tables / markdown when interactive, single JSON object when not.
56
-
57
- The mount adapter in `src/mount/commander.ts` is responsible for opening a spinner before the handler runs and closing it (success or failure) after — operations themselves call `ctx.progress.tick()` to update progress, but they never know whether they're being rendered interactively. The same handler runs unchanged when invoked via MCP.
58
-
59
- ### `HelpfulError` — the only error class
60
-
61
- **Rule:** every error raised inside the application must be (or be wrapped into) a `HelpfulError`. A bare `throw new Error(...)` is a bug. The mount adapters (`mountAsCommanderCommand`, `mountAsMcpTool`) refuse to render anything else — they catch unknown errors and convert them, but linting / tests should fail when a non-`HelpfulError` reaches the surface.
62
-
63
- ```ts
64
- // src/errors.ts
65
- export type ErrorKind =
66
- | 'input_error' // bad input from the user/LLM — not retryable as-is
67
- | 'not_found' // requested resource doesn't exist
68
- | 'conflict' // path/version already exists where it shouldn't
69
- | 'auth_error' // upstream auth failed (mcpx fetcher, anthropic key, etc.)
70
- | 'network_error' // transient network failure — retryable
71
- | 'unsupported_mime' // converter doesn't know how to handle this type
72
- | 'partial_failure' // multi-entry op (dir/glob ingest) had per-entry failures
73
- | 'internal_error'; // bug — should never reach the user
74
-
75
- export class HelpfulError extends Error {
76
- readonly kind: ErrorKind;
77
- readonly hint: string; // REQUIRED. The actionable next step. Shown to humans AND LLMs.
78
- readonly details?: unknown; // optional structured payload (per-entry failures, etc.)
79
- readonly cause?: unknown; // original error if wrapped
80
-
81
- constructor(args: {
82
- kind: ErrorKind;
83
- message: string;
84
- hint: string; // ← non-optional by type
85
- details?: unknown;
86
- cause?: unknown;
87
- }) {
88
- super(args.message);
89
- if (!args.hint || !args.hint.trim()) {
90
- throw new Error('HelpfulError requires a non-empty hint');
91
- }
92
- this.name = 'HelpfulError';
93
- this.kind = args.kind;
94
- this.hint = args.hint;
95
- this.details = args.details;
96
- this.cause = args.cause;
97
- }
98
- }
99
-
100
- // Helper: wrap an unknown error so callers can `try { ... } catch (e) { throw asHelpful(e, 'while reading PDF', 'Try re-running with --force, or check that the file is readable.') }`
101
- export function asHelpful(
102
- cause: unknown,
103
- context: string,
104
- hint: string,
105
- kind: ErrorKind = 'internal_error',
106
- ): HelpfulError;
107
- ```
108
-
109
- The constructor's `hint` parameter is statically required (object-arg pattern) AND validated at runtime — there is no path to construct a hint-less error. PRs that catch a `HelpfulError` and re-throw with a less specific hint should be rejected in review.
110
-
111
- #### Hint quality bar
112
-
113
- A good hint names the next action concretely. Examples:
114
-
115
- | Bad hint | Good hint |
116
- | ----------------------------------------- | ----------------------------------------------------------------------------------------------- |
117
- | `"Check your config."` | `"Run \`ctx config show\` to see the active config, or set ANTHROPIC_API_KEY to enable LLM fallback."` |
118
- | `"File not found."` | `"No file at logical_path 'docs/auth.md'. Run \`ctx ls docs/\` to see what's there."` |
119
- | `"Auth failed."` | `"mcpx returned 401 from server 'firecrawl'. Run \`mcpx auth firecrawl\` and retry."` |
120
- | `"Glob matched no files."` | `"Glob './*.md' matched 0 files. Try a broader pattern (e.g. './**/*.md') or relax --exclude."` |
121
- | `"Unsupported file type: image/heic."` | `"image/heic isn't supported by the native pipeline. Convert to PNG/JPEG first, or pass --force-llm to use the vision fallback."` |
122
-
123
- #### Rendering
124
-
125
- `mountAsCommanderCommand` wraps every handler. On `HelpfulError`:
126
-
127
- ```
128
- Interactive (TTY):
129
- ✗ ctx add: <message in red>
130
- hint: <hint in dim/yellow>
131
- [details: pretty-printed when present]
132
- exit code = mapKindToExit(kind) // input_error=2, not_found=3, conflict=4, auth_error=5, network_error=6, unsupported_mime=7, partial_failure=8, internal_error=1
133
-
134
- Non-interactive (--json or piped):
135
- stdout: <empty or partial result up to the point of failure>
136
- stderr: {"ok": false, "error": {"kind": "...", "message": "...", "hint": "...", "details": ...}}\n
137
- exit code = same as above
138
- ```
139
-
140
- `mountAsMcpTool` wraps every handler. On `HelpfulError`:
141
-
142
- ```
143
- MCP tool result (returned, not thrown):
144
- isError: true
145
- content: [{ type: "text", text: "<message>\n\nhint: <hint>" }]
146
- structuredContent: { error: { kind, message, hint, details? } }
147
- ```
148
-
149
- The `hint` always lands in front of both the human reading the terminal and the LLM consuming the MCP response — verbatim, same string. No translation layer.
150
-
151
- ### Logging vs. errors
152
-
153
- - **Logger lines** (info/warn/debug) go to stderr and are advisory. They never become errors.
154
- - **Errors** are thrown, caught at the mount boundary, rendered once. Operations should never log-and-rethrow — that double-renders.
155
- - **Spinners** describe the *current* operation; the spinner's failure path on a thrown `HelpfulError` is to fail with the error's `message` as the failed-state label, then the renderer prints the hint underneath.
156
-
157
- ---
158
-
159
- ## Database Schema (DuckDB)
160
-
161
- `src/db/migrations/001-init.sql`:
162
-
163
- ### Versioning model
164
-
165
- `files` is **append-only**. Every successful ingest or content-changing refresh inserts a new row for that `logical_path` with a fresh `version_id` (a millisecond TIMESTAMP). The "current" version of a path is `MAX(version_id)` for that path that is not tombstoned. All MCP tools default to operating on the current version; every read-shaped tool accepts an optional `version` parameter to address an older snapshot.
166
-
167
- - Deletes are tombstones — they insert a new row with `tombstone=TRUE` and `content=''` rather than removing data.
168
- - `chunks` are scoped to `(logical_path, version_id)` so historical search would be possible later. By default the FTS + semantic queries filter to current versions only via the `current_files` view.
169
-
170
- ```sql
171
- -- Content-addressed binary store. Originals of every ingested artifact live
172
- -- here, deduped by sha256. Many `files` rows can share one blob.
173
- CREATE TABLE blobs (
174
- sha256 TEXT PRIMARY KEY,
175
- mime_type TEXT NOT NULL,
176
- size_bytes BIGINT NOT NULL,
177
- bytes BLOB NOT NULL,
178
- created_at TIMESTAMP NOT NULL DEFAULT now()
179
- );
180
-
181
- CREATE TABLE files (
182
- logical_path TEXT NOT NULL, -- "docs/api/auth.md" — what agents see
183
- version_id TIMESTAMP NOT NULL DEFAULT now(), -- doubles as version label; ms precision
184
- tombstone BOOLEAN NOT NULL DEFAULT FALSE,
185
- source_type TEXT NOT NULL, -- 'local' | 'remote' | 'inline'
186
- source_path TEXT, -- abs filesystem path or URL (NULL for inline writes)
187
- source_mtime_ms BIGINT, -- last seen mtime (local files only)
188
- source_sha256 TEXT, -- sha256 of original raw bytes (NULL on tombstone). Equals blob_sha256 for non-inline rows.
189
- blob_sha256 TEXT REFERENCES blobs(sha256), -- pointer to the original bytes (NULL when source_type='inline' or tombstoned)
190
- content_sha256 TEXT, -- sha256 of converted markdown surrogate
191
- content TEXT, -- converted markdown surrogate
192
- description TEXT, -- ALWAYS-PRESENT one-paragraph summary (LLM-generated; covers text and binary alike). Prepended to every chunk's embedded text.
193
- mime_type TEXT,
194
- size_bytes BIGINT,
195
- fetcher TEXT, -- 'http' | 'mcpx' | 'local' | 'inline'
196
- fetcher_server TEXT, -- mcpx server name (e.g. 'firecrawl', 'google-docs', 'github') — NULL unless fetcher='mcpx'
197
- fetcher_tool TEXT, -- mcpx tool name (e.g. 'scrape', 'get_doc') — NULL unless fetcher='mcpx'
198
- fetcher_args JSON, -- full args object passed to the mcpx tool — replayable as-is on refresh
199
- refresh_frequency_sec INTEGER, -- NULL = never auto-refresh
200
- refreshed_at TIMESTAMP,
201
- last_refresh_status TEXT, -- 'ok' | 'unchanged' | 'failed:<reason>'
202
- change_note TEXT, -- optional human/agent annotation: "manual edit", "refresh: source updated", etc.
203
- created_at TIMESTAMP NOT NULL DEFAULT now(),
204
- PRIMARY KEY (logical_path, version_id)
205
- );
206
-
207
- -- Latest non-tombstoned version per logical_path. All MCP/CLI defaults filter through this view.
208
- CREATE VIEW current_files AS
209
- SELECT f.* FROM files f
210
- WHERE (f.logical_path, f.version_id) IN (
211
- SELECT logical_path, MAX(version_id) FROM files GROUP BY logical_path
212
- )
213
- AND f.tombstone = FALSE;
214
-
215
- CREATE TABLE chunks (
216
- logical_path TEXT NOT NULL,
217
- version_id TIMESTAMP NOT NULL,
218
- chunk_index INTEGER NOT NULL,
219
- chunk_content TEXT NOT NULL, -- raw markdown segment (what membot_read returns when slicing)
220
- search_text TEXT NOT NULL, -- "<logical_path>\n<description>\n\n<chunk_content>" — the exact string that was embedded and is FTS-indexed
221
- embedding FLOAT[384] NOT NULL, -- vector of search_text
222
- PRIMARY KEY (logical_path, version_id, chunk_index),
223
- FOREIGN KEY (logical_path, version_id) REFERENCES files(logical_path, version_id)
224
- );
225
-
226
- -- Chunks belonging to current versions only. Search joins through this.
227
- CREATE VIEW current_chunks AS
228
- SELECT c.* FROM chunks c
229
- JOIN current_files cf USING (logical_path, version_id);
230
- ```
231
-
232
- `src/db/migrations/002-fts.sql`: `PRAGMA create_fts_index('current_chunks', 'rowid', 'search_text', stemmer='porter')` — indexes the prepended search_text (filename + description + chunk content) so keyword hits surface even when the matching term is in the path or description, not the body. Rebuilt by `ctx reindex` whenever versions are added/tombstoned.
233
-
234
- Tree exploration is `SELECT logical_path FROM current_files`, grouped client-side by `/` prefix (synthesised — there are no real directories).
235
-
236
- ### Pruning history
237
-
238
- Versions accumulate forever by default. `ctx prune --before <duration>` and the matching `membot_prune` MCP tool drop non-current versions older than the cutoff. Tombstones are kept until at least one newer version exists, so reachability stays simple. `ctx prune` also garbage-collects orphan rows in `blobs` (sha256 not referenced by any remaining `files` row).
239
-
240
- ### Binary content & the textual-surrogate rule
241
-
242
- Some sources don't have a useful textual form: images, audio, video, executables, fonts, etc. The store handles these uniformly with one rule:
243
-
244
- > **Every ingested artifact produces a markdown surrogate.** The surrogate flows through chunking, embedding, and FTS like any other markdown. The original bytes are kept in the `blobs` table and addressed via `files.blob_sha256` for agents that can consume the native form.
245
-
246
- This means the search/embed pipeline has zero special cases for binary content — the surrogate IS the content as far as retrieval is concerned. Concretely:
247
-
248
- | Source type | Surrogate (`files.content`) | Blob kept? |
249
- | ---------------------- | ---------------------------------------------------------------------- | ---------- |
250
- | markdown / text | passthrough | yes |
251
- | HTML | turndown output | yes |
252
- | PDF (text layer) | unpdf extraction | yes |
253
- | PDF (scanned, no text) | Tesseract WASM OCR → markdown | yes |
254
- | DOCX | mammoth output | yes |
255
- | image (PNG/JPEG/etc.) | Claude vision caption + Tesseract WASM OCR for any embedded text | yes |
256
- | audio | (deferred — surrogate would be a transcript when we add Whisper WASM) | yes |
257
- | anything else | LLM caption from a base64 sample, or `"(unknown binary)"` if no key | yes |
258
-
259
- The `blob_sha256` foreign key gives content-addressed dedupe automatically — re-ingesting the same image under a different logical_path stores zero new bytes.
260
-
261
- ### Always-on description (`files.description`)
262
-
263
- Every file gets an LLM-written one-paragraph description, regardless of type — including plain markdown. The description column is **prepended to every chunk's embedded text** (along with the logical path), so:
264
-
265
- - Searches like `"the OAuth diagram"` hit a PNG even though the chunk body is empty markdown.
266
- - Searches like `"meeting notes from last quarter's planning"` hit a markdown file whose body never says that phrase.
267
- - Filename signals ("auth.md", "diagrams/oauth-flow.png") are part of the embedded text, lifting recall without hurting precision because the text-prefix is short and consistent.
268
-
269
- The exact embedded string per chunk is:
270
-
271
- ```
272
- <logical_path>
273
- <description>
274
-
275
- <chunk_content>
276
- ```
277
-
278
- …stored verbatim as `chunks.search_text`. FTS is built on `search_text`, the embedding is the vector of `search_text`. Keeping `chunk_content` as a separate column means `membot_read` and the `snippet` field on search hits return the clean body without the prefix bleed-through.
279
-
280
- When `ANTHROPIC_API_KEY` is missing, `description` falls back to a deterministic heuristic (e.g. first heading + first 200 chars for markdown; `"<mime_type> · <size>"` for binaries) so the pipeline still works offline — just with weaker recall.
281
-
282
- ### Tesseract WASM (OCR)
283
-
284
- OCR runs as part of the converter dispatch, only on filetypes where it's likely useful:
285
-
286
- - All `image/*` types: PNG, JPEG, WebP, BMP, TIFF.
287
- - PDFs whose unpdf extraction returned an empty / very-low-text-ratio result (likely scanned).
288
-
289
- OCR output is folded into the same surrogate that the LLM caption produces — one chunked markdown body per file, with a fenced section `## Text detected via OCR` when OCR ran. No separate row, no separate index.
290
-
291
- ---
292
-
293
- ## Operations: one definition, two surfaces
294
-
295
- Each user-facing capability is defined ONCE as an **Operation** and mounted twice — as an MCP tool and as a commander CLI command. The zod input schema, output schema, description string, and handler are all single-source-of-truth. Adding a new operation means writing one file in `src/operations/` and exporting it from the registry; both the CLI and the MCP server pick it up automatically.
296
-
297
- ### `Operation<I, O>` shape (`src/operations/types.ts`)
298
-
299
- ```ts
300
- export interface Operation<I extends z.ZodObject, O extends z.ZodTypeAny> {
301
- // Tool name as agents see it (also used for the MCP tool registration).
302
- name: string; // e.g. "membot_add"
303
-
304
- // CLI subcommand name. Defaults to name with "membot_" stripped and "_" → "-".
305
- cliName?: string; // e.g. "add"
306
-
307
- // Verbatim description string. Used as BOTH the MCP tool description
308
- // and the commander .description() text. Follows the bash-prefix →
309
- // purpose → when-to-use → recovery-hint shape (see §MCP Tool Surface).
310
- description: string;
311
-
312
- // Single source of truth for the input contract.
313
- inputSchema: I;
314
- outputSchema: O;
315
-
316
- // CLI-only metadata: which input fields are positional CLI args, and
317
- // any short-flag aliases. Fields not listed in `positional` become
318
- // `--flags`; booleans become `--flag` / `--no-flag`; defaults from
319
- // .default() in the schema are honored.
320
- cli?: {
321
- positional?: (keyof z.infer<I>)[];
322
- aliases?: Partial<Record<keyof z.infer<I>, string>>; // e.g. logical_path: "-p"
323
- stdinField?: keyof z.infer<I>; // read this field from stdin if not provided
324
- };
325
-
326
- // The work itself. AppContext gives access to db, embedder, mcpx, logger, config.
327
- handler: (input: z.infer<I>, ctx: AppContext) => Promise<z.infer<O>>;
328
- }
329
- ```
330
-
331
- Field-level help comes from `.describe()` on the zod schema — used as both the MCP parameter description and the commander option description. Example:
332
-
333
- ```ts
334
- inputSchema: z.object({
335
- source: z.string().describe('Local path, URL, or `inline:<text>` literal'),
336
- logical_path: z.string().optional().describe('Logical path under the store (defaults derived from source)'),
337
- refresh_frequency: z.string().optional().describe('Refresh cadence: 5m | 1h | 24h | 7d. Omit for no auto-refresh.'),
338
- fetcher_hint: z.enum(['firecrawl','github','gdocs','http']).optional().describe('Force a specific mcpx fetcher'),
339
- }),
340
- cli: {
341
- positional: ['source'],
342
- aliases: { logical_path: '-p', refresh_frequency: '-r' },
343
- }
344
- ```
345
-
346
- ### Mount adapters
347
-
348
- `src/mount/mcp.ts` — `mountAsMcpTool(server, op)`:
349
- - Registers the tool with `op.name` and `op.description`.
350
- - Converts `op.inputSchema` to JSON-Schema (via `zod-to-json-schema`) for the MCP `inputSchema` field.
351
- - Wraps `op.handler` with input validation (`op.inputSchema.parse`) + output validation (`op.outputSchema.parse`) + error normalization (`{error_kind, message, next_action_hint}`).
352
-
353
- `src/mount/commander.ts` — `mountAsCommanderCommand(program, op)`:
354
- - Adds a subcommand named `op.cliName ?? op.name.replace(/^membot_/, '').replaceAll('_','-')`.
355
- - Sets `.description(op.description)`. The same string the LLM sees is what `ctx --help` shows.
356
- - Walks `op.inputSchema.shape`. For each field:
357
- - If listed in `op.cli.positional` → `.argument(required ? '<name>' : '[name]', describe)`.
358
- - Else if `ZodBoolean` → `.option('--flag-name [<bool>]', describe)`, with `--no-flag-name` synthesised.
359
- - Else → `.option('--flag-name <value>', describe, defaultValue?)`. Short alias prepended if `op.cli.aliases[field]` is set.
360
- - `ZodEnum` → option with `.choices(...)`.
361
- - `ZodArray` of strings → repeatable `.option('--tag <value>', ..., collect)`.
362
- - On invocation, builds a single object from positional args + options, runs `op.inputSchema.parse(...)`, calls `op.handler`, and renders the result via `output/formatter.ts` (JSON if `--json`, otherwise human-readable per output schema).
363
-
364
- Result: the description an agent reads in `tools/list` is byte-identical to what a human reads in `ctx <cmd> --help`. Drift is impossible by construction.
365
-
366
- ### Operation registry (`src/operations/index.ts`)
367
-
368
- A single array of operations exported in the order they should appear in `--help`. `cli.ts` and `mcp/server.ts` both iterate this list and call the appropriate mount adapter. Adding a new tool means: write one file, append it here, done.
369
-
370
- ---
371
-
372
- ## Project Layout (mirrors mcpx)
373
-
374
- ```
375
- ctx/
376
- src/
377
- cli.ts # commander entry; loops operations + mountAsCommanderCommand. Plus a couple of CLI-only commands (serve, reindex).
378
- sdk.ts # exported API for embedding ctx in other apps
379
- context.ts # AppContext (config, db, embedder, mcpx client, logger)
380
- constants.ts # MEMBOT_HOME, DEFAULTS, EMBEDDING_DIMENSION=384
381
- operations/ # ★ single source of truth for every tool/command
382
- types.ts # Operation<I,O>, defineOperation()
383
- index.ts # ordered registry of all operations
384
- add.ts list.ts tree.ts read.ts write.ts search.ts remove.ts
385
- move.ts refresh.ts info.ts versions.ts diff.ts prune.ts
386
- mount/
387
- mcp.ts # mountAsMcpTool: zod → JSON-Schema, validate I/O, catch HelpfulError → MCP isError result with hint surfaced in both content[].text and structuredContent.error
388
- commander.ts # mountAsCommanderCommand: zod → .argument()/.option(), parse → validate → spinner.start → handler → spinner.success/fail → format. Catches HelpfulError, renders message+hint+exit-code; wraps unknown throws via asHelpful()
389
- zod-to-cli.ts # the field-walking logic; covers ZodString/Number/Boolean/Enum/Array/Optional/Default
390
- commands/ # CLI-only commands that don't have an MCP equivalent
391
- serve.ts # ctx serve [--http <port>] [--watch]
392
- reindex.ts # ctx reindex
393
- config/
394
- loader.ts # reads ~/.membot/config.json + env overrides
395
- schemas.ts # CtxConfig zod schema
396
- db/
397
- connection.ts # DuckDB pool, migration runner
398
- migrations/ # 001-init.sql, 002-fts.sql
399
- files.ts # files-table CRUD: insertVersion, getCurrent, getVersion, listVersions, tombstone, prune
400
- chunks.ts # chunks CRUD + searchSemantic + searchKeyword (against current_chunks view by default)
401
- blobs.ts # blobs-table CRUD: upsertBySha (no-op on existing sha), readBlob, gcOrphans
402
- views.sql # current_files, current_chunks views
403
- ingest/
404
- source-resolver.ts # expands a source arg: file | dir-walk (symlinks followed, realpath dedupe) | glob (picomatch) | URL | inline:; honors include/exclude
405
- fetcher.ts # PORT from botholomew/src/context/fetcher.ts — mcpx-driven; returns {bytes, mime, fetcher, fetcher_server, fetcher_tool, fetcher_args} so the chosen invocation can be persisted and replayed on refresh
406
- local-reader.ts # read+hash local file, detect mtime change
407
- converter/
408
- index.ts # dispatch by mime
409
- pdf.ts # unpdf (Bun-friendly PDF text extract); falls through to ocr.ts when extraction is empty/low-ratio
410
- docx.ts # mammoth
411
- html.ts # turndown
412
- image.ts # Claude vision caption + OCR fold-in
413
- text.ts # passthrough
414
- ocr.ts # Tesseract WASM (tesseract.js) — used by image.ts and pdf.ts fallback
415
- llm.ts # Claude markdown fallback (PORT botholomew/src/context/markdown-converter.ts)
416
- describer.ts # always-on one-paragraph LLM description (with deterministic offline fallback)
417
- chunker.ts # PORT botholomew/src/context/chunker.ts (deterministic + LLM modes)
418
- embedder.ts # PORT botholomew/src/context/embedder-impl.ts (WASM transformers); embeds the prepended search_text
419
- search-text.ts # buildSearchText(logical_path, description, chunk_content) — single source of truth for the embedded/FTS string
420
- ingest.ts # orchestrator: resolve → for each entry: read → blob.upsert → convert → describe → chunk → embed → insert version
421
- search/
422
- hybrid.ts # PORT botholomew/src/tools/search/fuse.ts (RRF)
423
- semantic.ts # cosine via DuckDB array_cosine_distance
424
- keyword.ts # BM25 via DuckDB FTS match_bm25()
425
- refresh/
426
- runner.ts # refreshFile(id|path) — core logic
427
- scheduler.ts # daemon tick loop for --watch
428
- mcp/
429
- server.ts # @modelcontextprotocol/sdk: stdio + streamable-http; loops operations + mountAsMcpTool
430
- instructions.ts # server-level `instructions` string (see plan §MCP)
431
- output/
432
- tty.ts # isInteractive() / useColor() / useSpinner() — single source for TTY/CI/--json/NO_COLOR detection
433
- logger.ts # spinner-aware (port from mcpx/src/output/logger.ts); routes to stderr in non-interactive mode
434
- progress.ts # nanospinner wrapper + multi-entry progress bar (used by dir/glob ingest); degrades to one info-line-per-entry when non-interactive
435
- formatter.ts # final-result rendering: aligned tables/markdown when interactive, JSON when not
436
- errors.ts # HelpfulError class + asHelpful() wrapper + ErrorKind union + mapKindToExit()
437
- scripts/
438
- apply-transformers-patch.sh # copy verbatim from mcpx/scripts
439
- test/
440
- _preload.ts # transformers patch hook
441
- ingest/ db/ search/ refresh/ mcp/
442
- patches/ # @huggingface/transformers patch (copy from mcpx)
443
- install.sh install.ps1 # copy+adapt from mcpx
444
- package.json tsconfig.json biome.json bunfig.toml
445
- README.md CLAUDE.md
446
- ```
447
-
448
- ---
449
-
450
- ## Critical Files to Port
451
-
452
- Direct ports (light edits — drop Botholomew-specific deps, swap `projectDir/context/` filesystem for DuckDB rows):
453
-
454
- | New file | Source |
455
- | ---------------------------------- | ----------------------------------------------------------------------- |
456
- | `src/ingest/embedder.ts` | `botholomew/src/context/embedder-impl.ts` |
457
- | `src/ingest/chunker.ts` | `botholomew/src/context/chunker.ts` |
458
- | `src/ingest/fetcher.ts` | `botholomew/src/context/fetcher.ts` + `fetcher-errors.ts` |
459
- | `src/ingest/converter/llm.ts` | `botholomew/src/context/markdown-converter.ts` |
460
- | `src/search/semantic.ts` | `botholomew/src/tools/search/semantic.ts` + `src/db/embeddings.ts` |
461
- | `src/search/hybrid.ts` | `botholomew/src/tools/search/fuse.ts` |
462
- | `src/search/keyword.ts` | `botholomew/src/tools/search/regexp.ts` (replace regex with FTS BM25) |
463
- | `scripts/apply-transformers-patch.sh`, `patches/` | `mcpx/scripts/...`, `mcpx/patches/` |
464
- | `src/output/logger.ts` | `mcpx/src/output/logger.ts` |
465
- | `src/cli.ts` skeleton | `mcpx/src/cli.ts` |
466
- | `install.sh`, `install.ps1` | `mcpx/install.sh`, `mcpx/install.ps1` |
467
-
468
- New code:
469
-
470
- - `src/db/*` — DuckDB schema/CRUD (replaces botholomew's `context/store.ts` filesystem layer).
471
- - `src/ingest/converter/{pdf,docx,html,text}.ts` — native conversion path before LLM fallback.
472
- - `src/ingest/local-reader.ts` — read + sha256 + mtime for local sources.
473
- - `src/refresh/{runner,scheduler}.ts` — refresh per-row + daemon tick.
474
- - `src/mcp/server.ts` and `src/mcp/tools/*` — MCP exposure (botholomew's tools assume FS sandboxing; we rewrite against DB).
475
-
476
- ---
477
-
478
- ## Data Flow — Ingest
479
-
480
- ```
481
- ctx add <source> [--path <logical>] [--refresh 24h] [--include <glob>] [--exclude <glob>]
482
-
483
- expand-source: (only for local sources)
484
- file → [file]
485
- directory → walk(symlinks_followed=true) filtered by include/exclude globs
486
- glob → picomatch over realpath()-ed entries
487
- ↓ for each resolved entry:
488
- local-reader.read() OR fetcher.fetchUrl() ← raw bytes + mime + sha256
489
- (fetchUrl also returns chosen mcpx server/tool/args
490
- → persisted on the row for fast replay-on-refresh)
491
-
492
- blobs.upsert(sha256, bytes, mime) ← content-addressed, deduped
493
-
494
- converter/index.ts dispatch(mime):
495
- pdf → unpdf.extractText() → if empty/low-ratio → ocr.tesseract()
496
- docx → mammoth.convertToMarkdown()
497
- html → turndown.turndown()
498
- image/* → vision.describeImage() + ocr.tesseract() (folded together)
499
- text / md → passthrough
500
- other → llm.convertWithClaude() (or "(unknown binary)" if no API key)
501
-
502
- describe(markdown_or_caption, mime, logical_path) ← always runs; one-paragraph LLM summary
503
- (or deterministic fallback when no API key)
504
-
505
- chunker.chunk(markdown) ← deterministic by default; LLM opt-in
506
-
507
- buildSearchText(logical_path, description, chunk) ← prepended for embedding + FTS
508
-
509
- embedder.embedBatch(search_texts) ← WASM transformers, 384-dim
510
-
511
- db.files.insertVersion + db.chunks.insertForVersion + FTS rebuild
512
- (every successful ingest produces a NEW version_id; nothing is overwritten;
513
- directory/glob ingest is one transaction per matched entry, not all-or-nothing)
514
- ```
515
-
516
- ## Data Flow — Refresh
517
-
518
- `ctx refresh <path>` (or daemon tick, or no-arg = "all due"):
519
-
520
- 1. Load `files` row.
521
- 2. If `source_type='local'`: `stat()`; if `mtime_ms == source_mtime_ms`, skip. Otherwise re-read + sha256.
522
- 3. If `source_type='remote'`:
523
- - If `fetcher='mcpx'`: directly invoke `mcpx exec <fetcher_server> <fetcher_tool> <fetcher_args>` — no agent re-routing.
524
- - If `fetcher='http'`: plain `fetch(source_path)`.
525
- - sha256 the resulting bytes.
526
- 4. Compare `new_source_sha256 == files.source_sha256`. If equal → set `refreshed_at=now()`, `last_refresh_status='unchanged'`, done.
527
- 5. If different → re-run convert → chunk → embed → **insert a new `files` row** with `version_id=now()`, new `content`/`source_sha256`/`content_sha256`, `change_note='refresh: source updated'`. Insert the new chunks under that version. Old version remains in history.
528
- 6. On failure (network, fetcher error, conversion error): leave existing version untouched, write `last_refresh_status='failed:<reason>'` onto the most recent row in place (status fields are mutable; content fields are not).
529
-
530
- `ctx refresh` with no arg = all rows where `refresh_frequency_sec IS NOT NULL AND now() > refreshed_at + (refresh_frequency_sec * INTERVAL '1 second')`.
531
-
532
- Daemon (`ctx serve --watch`): `setInterval(tick_interval_sec)` runs the no-arg refresh; default 60s. Same code path.
533
-
534
- ---
535
-
536
- ## CLI Surface
537
-
538
- ```
539
- ctx add <source> [--path <logical>] [--include <glob>] [--exclude <glob>] [--no-follow-symlinks] [--refresh <dur>] [--fetcher <name>]
540
- ctx ls [<prefix>] [--json]
541
- ctx tree [<prefix>] [--depth <n>]
542
- ctx read <path> [--version <ts>] [--meta]
543
- ctx write <path> [--refresh <dur>] [--note <msg>] < stdin
544
- ctx search <query> [--limit 10] [--mode hybrid|semantic|keyword]
545
- ctx info <path> [--version <ts>]
546
- ctx mv <old> <new>
547
- ctx rm <path> # tombstone (history kept)
548
- ctx refresh [<path>] [--force]
549
- ctx versions <path> # list every version_id with change_note
550
- ctx diff <path> <a-version> [<b-version>] # markdown diff between two versions (defaults b=current)
551
- ctx prune [--before <dur>] # drop non-current versions older than cutoff
552
- ctx reindex
553
- ctx serve [--http <port>] [--watch] [--tick <sec>]
554
- ```
555
-
556
- Global flags (mirror mcpx): `-c/--config`, `-j/--json`, `-F/--format`, `-v/--verbose`, `--no-interactive`.
557
-
558
- `--refresh` and `--before` accept duration strings: `5m`, `1h`, `24h`, `7d`.
559
- `--version` accepts an ISO-8601 timestamp or millis-since-epoch — exact match against `files.version_id`.
560
-
561
- ---
562
-
563
- ## MCP Tool Surface
564
-
565
- Stdio (default) and streamable-HTTP, both via `@modelcontextprotocol/sdk`. Each tool is **defined as an `Operation` in `src/operations/`** and mounted via `mountAsMcpTool`. The same `Operation` is mounted as a commander subcommand by `mountAsCommanderCommand` — descriptions, schemas, and validation are identical across the two surfaces. Errors are always `HelpfulError` instances (see §Presentation & Errors); the mount adapter renders `kind`, `message`, and the required `hint` into the MCP response so the LLM gets the same actionable guidance a human would.
566
-
567
- ### Worked example: `membot_add`
568
-
569
- ```ts
570
- // src/operations/add.ts
571
- import { z } from 'zod';
572
- import { defineOperation } from './types';
573
- import { ingest } from '../ingest/ingest';
574
-
575
- export const add = defineOperation({
576
- name: 'membot_add',
577
- cliName: 'add',
578
- description: `[[ bash equivalent: ingest a source ]] Ingest a new source into
579
- the store: a local file path OR a URL OR an inline:<text> literal. URLs are
580
- fetched via mcpx (the chosen server + tool + args are stored so refresh
581
- replays the exact invocation). PDF/DOCX/HTML are converted to markdown —
582
- native libs first, LLM fallback for messy/scanned input. Setting
583
- refresh_frequency enables automatic refresh from the daemon. Always creates
584
- a NEW version; existing versions stay queryable via membot_versions.`,
585
- inputSchema: z.object({
586
- source: z.string().describe('Local path, URL, or `inline:<text>` literal'),
587
- logical_path: z.string().optional().describe('Logical path under the store (defaults derived from source)'),
588
- refresh_frequency: z.string().optional().describe('Refresh cadence: 5m | 1h | 24h | 7d. Omit for no auto-refresh.'),
589
- fetcher_hint: z.enum(['firecrawl','github','gdocs','http']).optional().describe('Force a specific mcpx fetcher'),
590
- change_note: z.string().optional().describe('Free-text note attached to the new version'),
591
- }),
592
- outputSchema: z.object({
593
- logical_path: z.string(),
594
- version_id: z.string(),
595
- mime_type: z.string().nullable(),
596
- size_bytes: z.number(),
597
- fetcher: z.string(),
598
- source_sha256: z.string(),
599
- }),
600
- cli: {
601
- positional: ['source'],
602
- aliases: { logical_path: '-p', refresh_frequency: '-r', change_note: '-m' },
603
- },
604
- handler: async (input, ctx) => ingest(input, ctx),
605
- });
606
- ```
607
-
608
- This single definition produces:
609
-
610
- - **MCP tool** `membot_add` with the description above, JSON-Schema input derived from zod, and validated output.
611
- - **CLI command** `ctx add <source> [-p <path>] [-r <dur>] [--fetcher-hint <name>] [-m <note>]` whose `--help` text is byte-identical to the description above.
612
-
613
- ### Server-level instructions
614
-
615
- These are sent as the MCP server's top-level `instructions` field — the LLM sees them once when the server is connected. They frame how the tool surface should be used:
616
-
617
- ```
618
- You have a persistent context store. Files live as versioned markdown rows
619
- addressed by logical path (e.g. "research/threat-models/llm.md"). The store
620
- is a hybrid search index: every file is chunked, embedded locally, and
621
- indexed with BM25 — so prefer membot_search to membot_read+grep for discovery.
622
-
623
- Workflow:
624
- 1. membot_tree or membot_search to find what already exists before adding new content.
625
- 2. membot_add to ingest a local file, a URL, or a remote document. URLs are
626
- fetched via mcpx (Firecrawl/Google-Docs/GitHub/HTTP); the chosen
627
- invocation is stored so refresh is fast and deterministic.
628
- 3. membot_read or membot_search hits to consume content.
629
- 4. membot_write to record agent-authored notes (source_type='inline').
630
-
631
- Versioning:
632
- - Every ingest, refresh, or write that changes content creates a NEW
633
- version_id (a timestamp). Older versions stay queryable via the
634
- `version` parameter on membot_read / membot_info / membot_versions / membot_diff.
635
- - All other tools default to the current (latest, non-tombstoned) version.
636
- - membot_delete is a tombstone — history is preserved unless membot_prune runs.
637
-
638
- Refresh:
639
- - Each row has source metadata. membot_refresh re-reads the source, hashes
640
- it, and only re-embeds when bytes changed. Safe to call often.
641
- - If a file has refresh_frequency_sec set, the daemon refreshes it
642
- automatically — you do not need to schedule it yourself.
643
-
644
- When in doubt: search before you read, read before you write, and prefer
645
- adding the source URL once (with a refresh interval) over copy-pasting
646
- content that will go stale.
647
- ```
648
-
649
- ### Tool catalog
650
-
651
- Description text below is the verbatim string sent to the LLM. Style: **bash-equivalent prefix → one-line purpose → when-to-use → constraints/recovery hints**, modeled on botholomew + Arcade's tool-description pattern.
652
-
653
- #### `membot_search`
654
-
655
- ```
656
- [[ bash equivalent: grep -r + semantic-search ]] Hybrid search over the context
657
- store. Pass `query` (natural language → semantic) and/or `pattern` (regex over
658
- chunk text); pass both for the strongest signal — hits matched by both float
659
- to the top via reciprocal rank fusion. Searches the CURRENT version of every
660
- file by default; set `include_history=true` to also search older versions.
661
- This is the primary discovery tool — prefer it over membot_read+scan.
662
- ```
663
-
664
- Inputs: `query?`, `pattern?`, `mode?` (`hybrid`|`semantic`|`keyword`, default `hybrid`), `path_prefix?`, `limit?` (default 10), `include_history?` (default false), `ignore_case?`.
665
- Output: `[{logical_path, version_id, chunk_index, snippet, score, semantic_score, keyword_score}]`.
666
-
667
- #### `membot_tree`
668
-
669
- ```
670
- [[ bash equivalent: tree ]] Render the logical-path tree of the current store.
671
- Tree is synthesised from "/" segments in logical_path — there are no real
672
- directories. Tombstoned and historical versions are hidden. Use this before
673
- membot_add to pick a sensible logical path.
674
- ```
675
-
676
- Inputs: `prefix?`, `max_depth?` (default 4).
677
-
678
- #### `membot_list`
679
-
680
- ```
681
- [[ bash equivalent: ls ]] List current files under an optional prefix, with
682
- size, mime type, refresh frequency, and last refresh status. Returns one row
683
- per logical_path (current version only). Pair with membot_tree for discovery,
684
- membot_search for content-based discovery.
685
- ```
686
-
687
- Inputs: `prefix?`, `limit?`, `cursor?` (paginated).
688
-
689
- #### `membot_read`
690
-
691
- ```
692
- [[ bash equivalent: cat ]] Read a stored file. By default returns the
693
- markdown surrogate (the converted/captioned text body). Pass bytes=true
694
- to instead return the original raw bytes (base64-encoded for JSON, or as
695
- an image content block when the path is an image and the MCP client
696
- supports it). Defaults to the current version; pass `version` (timestamp)
697
- to read a historical snapshot — use membot_versions to enumerate available
698
- versions. For finding content across many files, use membot_search instead
699
- of repeated membot_read calls.
700
- ```
701
-
702
- Inputs: `logical_path`, `version?`, `bytes?` (default `false`), `offset?` (line, 1-based; ignored when bytes=true), `limit?` (lines; ignored when bytes=true).
703
- Output (text mode): `{logical_path, version_id, content, description, mime_type, size_bytes, blob_available, version_is_current}`.
704
- Output (bytes mode): `{logical_path, version_id, mime_type, size_bytes, bytes_base64}` — or for image mimes, an MCP image content block.
705
-
706
- #### `membot_info`
707
-
708
- ```
709
- Inspect metadata for a file: source (local path or URL), fetcher used,
710
- refresh schedule, last refresh status, all sha256 digests, and whether
711
- the requested version is the current one. Does NOT return file content —
712
- use membot_read for that. Use this to decide whether a refresh is worth
713
- forcing or whether to trust a cached row.
714
- ```
715
-
716
- Inputs: `logical_path`, `version?`.
717
-
718
- #### `membot_versions`
719
-
720
- ```
721
- List every version of a file (newest first) with version_id, content_sha256,
722
- size, change_note, and refresh status. Use this to find the version_id you
723
- want to pass to membot_read or membot_diff. Tombstoned versions are included and
724
- flagged.
725
- ```
726
-
727
- Inputs: `logical_path`.
728
-
729
- #### `membot_diff`
730
-
731
- ```
732
- Return a unified-diff between two versions of a file. `a` is required; `b`
733
- defaults to the current version. Both `a` and `b` are version_id timestamps
734
- from membot_versions. Use to understand what a refresh actually changed before
735
- deciding to act on the new content.
736
- ```
737
-
738
- Inputs: `logical_path`, `a` (version_id), `b?` (version_id, default current).
739
-
740
- #### `membot_add`
741
-
742
- ```
743
- Ingest one or many sources. `source` accepts:
744
- - a local file path → ingests one file
745
- - a local directory path → walks recursively (symlinks followed,
746
- cycles broken by realpath cache),
747
- filtered by include/exclude globs
748
- - a glob pattern (e.g. "docs/**/*.md")→ expands relative to cwd; symlinks
749
- followed
750
- - a URL → fetched via mcpx (the chosen server
751
- + tool + args are stored so refresh
752
- replays the exact invocation)
753
- - "inline:<text>" → stores the literal as a new file
754
- PDF, DOCX, HTML, images, and other binaries are converted to markdown —
755
- native libraries first, vision/OCR for images, LLM fallback for messy or
756
- scanned input. Original bytes are kept in the blobs table; membot_read with
757
- bytes=true returns them. Setting `refresh_frequency` enables automatic
758
- refresh of every ingested file from the daemon. Each ingested file becomes
759
- a NEW version under its own logical_path; existing versions stay queryable
760
- via membot_versions. Directory/glob ingests stream one file at a time —
761
- partial failures don't abort the rest; the response lists per-entry status.
762
- ```
763
-
764
- Inputs: `source` (path | dir | glob | URL | `inline:` literal), `logical_path?` (single-source only — defaults derived from filename / URL / relative path; for dir/glob ingests this is interpreted as a *prefix* under which entries are placed using their relative path), `include?` (glob; comma-separated allowed; default `**/*`), `exclude?` (glob; default excludes `node_modules`, `.git`, `.DS_Store`, dotfiles), `follow_symlinks?` (default `true`), `refresh_frequency?` (e.g. `1h`, `24h`), `fetcher_hint?` (e.g. `firecrawl`, `github`), `change_note?`.
765
- Output: `{ingested: [{source_path, logical_path, version_id, status, error?, mime_type, size_bytes, fetcher, source_sha256}], total, ok, failed}`.
766
- Error hints: on `auth_error` from a fetcher, hint `Run: mcpx auth <server>`; on `unsupported_mime`, list supported types; on `nothing_matched` for a glob, suggest broadening `include` or removing `exclude`.
767
-
768
- #### `membot_write`
769
-
770
- ```
771
- [[ bash equivalent: tee ]] Write inline agent-authored markdown. Creates a
772
- new version (source_type='inline') under the given logical_path. Use this
773
- to persist agent notes, summaries, or synthesised context that should
774
- survive across conversations. For mirroring an external document, use
775
- membot_add with a source URL instead — that gets you refresh-on-source-change
776
- for free.
777
- ```
778
-
779
- Inputs: `logical_path`, `content` (markdown), `change_note?`, `refresh_frequency?` (rarely useful for inline).
780
-
781
- #### `membot_move`
782
-
783
- ```
784
- [[ bash equivalent: mv ]] Rename a logical_path. Creates one new version
785
- under the new path with full content carried over and tombstones the old
786
- path. History remains queryable under both names via membot_versions.
787
- ```
788
-
789
- Inputs: `from_logical_path`, `to_logical_path`.
790
-
791
- #### `membot_delete`
792
-
793
- ```
794
- [[ bash equivalent: rm ]] Tombstone a logical_path so it no longer appears
795
- in membot_list / membot_tree / membot_search. Old versions remain queryable via
796
- membot_versions and membot_read with an explicit version. Use membot_prune to
797
- permanently drop history.
798
- ```
799
-
800
- Inputs: `logical_path`.
801
-
802
- #### `membot_refresh`
803
-
804
- ```
805
- Re-read a file's source and create a new version only if the source bytes
806
- changed. Pass `logical_path` to refresh one file, or omit it to refresh
807
- every file whose refresh_frequency_sec has elapsed. Local files are
808
- detected via mtime+sha; remote files are re-fetched via the same mcpx
809
- invocation that was originally used. On auth or network failure the prior
810
- version stays current — check `last_refresh_status`.
811
- ```
812
-
813
- Inputs: `logical_path?`, `force?` (re-embed even if sha unchanged).
814
- Output: `{processed: [{logical_path, status, new_version_id?}]}`.
815
-
816
- #### `membot_prune`
817
-
818
- ```
819
- Permanently drop non-current versions older than the cutoff. Current
820
- versions and tombstones-with-no-newer-version are preserved. Use sparingly
821
- — pruned versions cannot be recovered.
822
- ```
823
-
824
- Inputs: `before` (duration like `30d`, or absolute timestamp), `dry_run?` (default true).
825
-
826
- ---
827
-
828
- ## Config (`~/.membot/config.json`)
829
-
830
- ```jsonc
831
- {
832
- "data_dir": "~/.membot", // override MEMBOT_HOME
833
- "embedding_model": "Xenova/bge-small-en-v1.5",
834
- "embedding_dimension": 384,
835
- "chunker": { "mode": "deterministic", "target_chars": 4000, "max_chars": 15000 },
836
- "llm": {
837
- "anthropic_api_key": "", // env: ANTHROPIC_API_KEY
838
- "converter_model": "claude-haiku-4-5-20251001",
839
- "chunker_model": "claude-haiku-4-5-20251001"
840
- },
841
- "mcpx": { "config_path": "" }, // for remote fetchers
842
- "daemon": { "tick_interval_sec": 60 },
843
- "default_refresh_frequency_sec": null
844
- }
845
- ```
846
-
847
- LLM fallback is opt-out: if `ANTHROPIC_API_KEY` is missing, the converter dispatcher falls through to passthrough/error rather than calling the API.
848
-
849
- ---
850
-
851
- ## Key Dependencies (`package.json`)
852
-
853
- Runtime:
854
-
855
- - `@modelcontextprotocol/sdk` — MCP server (stdio + HTTP)
856
- - `commander` — CLI parsing (CLI subcommands generated from operations via `mountAsCommanderCommand`)
857
- - `zod-to-json-schema` — convert zod input schemas to MCP tool JSON-Schema
858
- - `@duckdb/node-api` ^1.5.x — index + FTS + cosine
859
- - `@huggingface/transformers` ^4.x + patched `onnxruntime-web` — local embeddings (port mcpx patch)
860
- - `@evantahler/mcpx` — remote fetcher orchestration
861
- - `@anthropic-ai/sdk` — LLM fallback only
862
- - `unpdf` — bun-friendly PDF → text
863
- - `mammoth` — DOCX → HTML/markdown
864
- - `turndown` — HTML → markdown
865
- - `tesseract.js` — Tesseract WASM (OCR for images and scanned PDFs)
866
- - `picomatch` — glob expansion for `ctx add` (mirror mcpx)
867
- - `gray-matter` — frontmatter parse on inbound .md files
868
- - `zod` ^4 — schemas
869
- - `ansis`, `nanospinner` — output (mirror mcpx)
870
-
871
- Dev: `@biomejs/biome`, `bun-types`.
872
-
873
- Build: `bun build --compile --minify --sourcemap ./src/cli.ts --outfile dist/ctx`. Pre-build script applies the transformers WASM patch (copy from mcpx).
874
-
875
- ---
876
-
877
- ## Verification
878
-
879
- End-to-end smoke (run after implementation):
880
-
881
- 1. `bun install && bun run build`
882
- 2. `./dist/ctx add ./README.md --path docs/readme.md` → verify row in `index.duckdb` with `source_type='local'`, `source_sha256` set, `chunks` populated.
883
- 3. `./dist/ctx add https://example.com --refresh 1h` → verify `fetcher` column set (e.g. `firecrawl` or `http`), markdown content stored, `blob_sha256` set.
884
- 4. `./dist/ctx add ./test/fixtures/sample.pdf` → verify converted markdown is non-empty (native unpdf path); `blobs` row holds the original PDF.
885
- 4a. `./dist/ctx add ./test/fixtures/scanned.pdf` → unpdf returns empty → OCR fallback runs → markdown contains OCR'd text.
886
- 4b. `./dist/ctx add ./test/fixtures/diagram.png` → vision caption + OCR folded; `description` column populated; `blobs` row holds PNG bytes.
887
- 4c. `./dist/ctx read diagram.png --bytes --format raw > out.png && diff out.png ./test/fixtures/diagram.png` → byte-exact round-trip.
888
- 4d. `./dist/ctx add ./docs --include "**/*.md" --include "**/*.txt" --exclude "**/node_modules/**"` → directory walk; result lists every `.md`/`.txt` ingested with its logical path; symlinks within `./docs` were followed without infinite loop.
889
- 4e. `./dist/ctx add "./src/**/*.ts"` → glob pattern expanded; only matching files ingested.
890
- 5. `./dist/ctx search "<query from sample>"` → returns hybrid hits with `score` and `semantic_score`.
891
- 5a. `./dist/ctx search "diagram"` → the PNG ingested in 4b is in the top hits even though its raw markdown body is short — proves the description+filename prefix lifted recall.
892
- 6. `./dist/ctx tree` → synthesised tree from logical paths.
893
- 7. Edit `README.md`, run `./dist/ctx refresh docs/readme.md` → verify a NEW `version_id` row appears (older row preserved), `source_sha256` and `content_sha256` differ from prior version.
894
- 8. `./dist/ctx refresh docs/readme.md` again (no edit) → `last_refresh_status='unchanged'`, no new version row created.
895
- 8a. `./dist/ctx versions docs/readme.md` → both versions listed, newest first; `--version <old-ts>` on `ctx read` returns the prior content; default `ctx read` returns latest.
896
- 8b. `./dist/ctx diff docs/readme.md <old-ts>` → unified diff between old and current version.
897
- 8c. `./dist/ctx rm docs/readme.md` → tombstone created; `ctx ls` and `ctx search` no longer surface it; `ctx versions` still lists history.
898
- 8d. `./dist/ctx prune --before 0s --dry-run=false` → non-current versions dropped; current version + tombstone remain.
899
- 9. `./dist/ctx serve` → connect with `mcpx exec` (or any MCP client) and call `membot_search` over stdio.
900
- 10. `./dist/ctx serve --watch --tick 5` → modify a tracked local file; within ~5s the daemon refreshes it.
901
- 11. `bun test` — unit tests covering: converter dispatch, chunker determinism, embedder dimension, refresh sha-stable skip path, hybrid search RRF, **HelpfulError-required-hint invariant** (constructing one without a hint throws), **mount adapter error rendering** (HelpfulError → MCP isError + structuredContent; HelpfulError → CLI stderr JSON in non-interactive, colorized text in interactive), **TTY detection** (CI=true forces non-interactive; --json forces non-interactive even on TTY).
902
- 12. Spot-check interactive vs non-interactive: `./dist/ctx add ./docs --include "**/*.md"` shows a progress bar in a terminal; `./dist/ctx add ./docs --include "**/*.md" | cat` emits one JSON-friendly line per entry to stderr and a single result JSON to stdout, no ANSI bytes leaking through.
903
- 13. Spot-check error UX: `./dist/ctx read missing.md` exits non-zero with a one-line message + concrete hint (e.g. `"Run \`ctx ls\` to see available paths."`); the same call as `membot_read` over MCP returns `isError: true` with `hint` set to the same string.
904
-
905
- Done when all 11 pass and the binary launches on darwin-arm64 without `bun` installed.