membot 0.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/settings.local.json +7 -0
- package/CLAUDE.md +139 -0
- package/docs/plan.md +905 -0
- package/package.json +26 -0
package/CLAUDE.md
ADDED
|
@@ -0,0 +1,139 @@
|
|
|
1
|
+
# CLAUDE.md — `ctx`
|
|
2
|
+
|
|
3
|
+
Guidance for Claude Code when working in this repo. Pair with `docs/plan.md` (the source-of-truth design doc).
|
|
4
|
+
|
|
5
|
+
## What this project is
|
|
6
|
+
|
|
7
|
+
`ctx` is a standalone Bun CLI + MCP server that gives AI agents a persistent, versioned, searchable context store. Files (markdown, PDF, DOCX, HTML, URLs) are ingested, converted to markdown, chunked, embedded locally with `@huggingface/transformers` (WASM, 384-dim `Xenova/bge-small-en-v1.5`), and indexed in DuckDB with hybrid search (vector + BM25). Every agent-visible artifact is a row in `files`, addressed by a virtual `logical_path` — there is **no** on-disk tree of stored content.
|
|
8
|
+
|
|
9
|
+
Reference projects (read these to understand the conventions before changing anything):
|
|
10
|
+
|
|
11
|
+
- `/Users/evan/workspace/botholomew` — origin of the context system. The chunker, embedder, fetcher, markdown-converter, and hybrid search live in `src/context/` and `src/tools/search/`.
|
|
12
|
+
- `/Users/evan/workspace/mcpx` — the project this one mirrors for layout, build, distribution, logger, and CLI shape.
|
|
13
|
+
|
|
14
|
+
## Hard constraints
|
|
15
|
+
|
|
16
|
+
- **Bun-only.** No Node-only deps. `bun build --compile` produces standalone binaries; the runtime must not require Bun installed.
|
|
17
|
+
- **Local embeddings only.** `@huggingface/transformers` WASM, `Xenova/bge-small-en-v1.5`, 384-dim. Never reach for cloud embedding APIs (OpenAI/Voyage/Cohere/Anthropic embeddings) even if a reference project uses them.
|
|
18
|
+
- **DuckDB is the only store.** Content AND original bytes live in rows (`files.content`, `blobs.bytes`), not in a filesystem tree. `~/.ctx/index.duckdb` holds everything except cached model weights. The DB will get large — that's accepted.
|
|
19
|
+
- **Append-only versioning.** Every ingest, refresh that finds new bytes, write, or rename creates a new `(logical_path, version_id)` row. `version_id` is a `TIMESTAMP` (ms precision). Default queries flow through `current_files` / `current_chunks` views. Delete = tombstone, not a row removal.
|
|
20
|
+
- **MCP defaults to current.** Every MCP tool acts on the latest non-tombstoned version unless `version` is passed explicitly.
|
|
21
|
+
- **Mcpx invocations are persisted.** When `ctx_add` fetches a remote URL via mcpx, store `fetcher_server`, `fetcher_tool`, and `fetcher_args` on the row so refresh re-invokes the exact same tool — never re-route through the agent.
|
|
22
|
+
- **Native conversion first, LLM fallback for messy/binary input.** `unpdf`, `mammoth`, `turndown` handle the common cases. Tesseract WASM (`tesseract.js`) does OCR for `image/*` and for PDFs whose text extraction came back empty. Claude vision captions images; Claude markdown-converter is the last-resort fallback. Missing `ANTHROPIC_API_KEY` is not a hard error — the pipeline degrades to deterministic surrogates.
|
|
23
|
+
- **Textual surrogate is the universal interface.** Every artifact (markdown, PDF, image, audio, anything) produces a markdown body that flows through chunking + embedding + FTS. Original bytes live in `blobs` and are reachable via `ctx_read bytes=true`. Search has zero special cases for binary content.
|
|
24
|
+
- **Always describe.** `files.description` is generated for every ingested file, including plain markdown. The string `<logical_path>\n<description>\n\n<chunk_content>` is what gets embedded and FTS-indexed (stored as `chunks.search_text`); `chunks.chunk_content` keeps the raw body for clean snippet rendering.
|
|
25
|
+
- **`ctx_add` accepts directories and globs.** Single arg, polymorphic: file path, directory (recursive walk, symlinks followed via realpath dedupe), glob (`docs/**/*.md`), URL, or `inline:<text>`. Each matched entry becomes its own version under its own logical_path; partial failures are reported per-entry, not all-or-nothing.
|
|
26
|
+
- **CLI auto-renders for the environment.** TTY → spinners, progress bars, ANSI colors. Piped/`--json`/`CI=true`/`NO_COLOR` → JSON to stdout, structured logs to stderr, no ANSI bytes. One code path; `src/output/tty.ts` is the single source of truth for which mode is active.
|
|
27
|
+
- **All errors are `HelpfulError`.** Bare `throw new Error(...)` is forbidden. `HelpfulError` requires a non-empty `hint` (statically and at runtime); the hint must name the next action concretely. The same hint string lands in front of both humans (CLI stderr) and LLMs (MCP `structuredContent.error.hint` and the rendered text content).
|
|
28
|
+
|
|
29
|
+
## Architecture at a glance
|
|
30
|
+
|
|
31
|
+
```
|
|
32
|
+
ctx_add ──► local-reader OR fetcher (mcpx) ──► converter (mime dispatch)
|
|
33
|
+
│
|
|
34
|
+
▼
|
|
35
|
+
chunker ──► embedder (WASM)
|
|
36
|
+
│
|
|
37
|
+
▼
|
|
38
|
+
db.files.insertVersion + db.chunks.insertForVersion
|
|
39
|
+
│
|
|
40
|
+
▼
|
|
41
|
+
FTS index rebuild (current_chunks)
|
|
42
|
+
|
|
43
|
+
ctx_refresh ──► re-read source ──► sha256 compare
|
|
44
|
+
│
|
|
45
|
+
unchanged ◄──┴──► changed ──► same pipeline as ctx_add
|
|
46
|
+
(status only) (creates new version_id)
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
Daemon mode (`ctx serve --watch`) ticks every `tick_interval_sec` and runs the no-arg refresh path against rows whose `refresh_frequency_sec` has elapsed.
|
|
50
|
+
|
|
51
|
+
## Layout
|
|
52
|
+
|
|
53
|
+
```
|
|
54
|
+
src/
|
|
55
|
+
cli.ts # commander entry; iterates operations registry
|
|
56
|
+
sdk.ts # programmatic API for embedding ctx
|
|
57
|
+
context.ts # AppContext: config + db + embedder + mcpx + logger
|
|
58
|
+
constants.ts # CTX_HOME, EMBEDDING_DIMENSION=384, defaults
|
|
59
|
+
operations/ # ★ one file per user-facing capability; single source of truth
|
|
60
|
+
types.ts # Operation<I,O>, defineOperation()
|
|
61
|
+
index.ts # ordered registry; cli + mcp both iterate this
|
|
62
|
+
add.ts list.ts tree.ts read.ts write.ts search.ts remove.ts
|
|
63
|
+
move.ts refresh.ts info.ts versions.ts diff.ts prune.ts
|
|
64
|
+
mount/
|
|
65
|
+
mcp.ts # mountAsMcpTool — registers an Operation as an MCP tool
|
|
66
|
+
commander.ts # mountAsCommanderCommand — registers an Operation as a CLI subcommand
|
|
67
|
+
zod-to-cli.ts # introspects zod schema → commander .argument()/.option() calls
|
|
68
|
+
commands/ # CLI-only commands with no MCP equivalent (serve, reindex)
|
|
69
|
+
config/ # zod schema + loader (~/.ctx/config.json)
|
|
70
|
+
db/ # DuckDB connection, migrations, files.ts, chunks.ts
|
|
71
|
+
ingest/ # source-resolver (file/dir/glob/url/inline), local-reader, fetcher, chunker, embedder, describer, search-text, converter/ (pdf/docx/html/image/text/ocr/llm)
|
|
72
|
+
search/ # semantic.ts, keyword.ts, hybrid.ts (RRF)
|
|
73
|
+
refresh/ # runner.ts (per-row), scheduler.ts (daemon)
|
|
74
|
+
mcp/ # server.ts, instructions.ts
|
|
75
|
+
output/ # tty.ts (mode detection), logger.ts (spinner-aware), progress.ts (multi-entry bar), formatter.ts (table/markdown/json)
|
|
76
|
+
errors.ts # HelpfulError class — the only error type allowed in handlers
|
|
77
|
+
test/ # bun test, _preload.ts applies transformers patch
|
|
78
|
+
patches/ # @huggingface/transformers WASM patch (copy from mcpx)
|
|
79
|
+
scripts/ # apply-transformers-patch.sh (pre-build hook)
|
|
80
|
+
docs/plan.md # source-of-truth design
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
## Coding conventions
|
|
84
|
+
|
|
85
|
+
- **One Operation, two surfaces.** Every user-facing capability is a single `Operation` in `src/operations/` with a zod input schema, zod output schema, description string, and handler. The MCP server and the commander CLI both consume this — never write a tool description twice, never define an input shape twice. The description string is the LLM-facing docstring AND the `--help` text. Field-level help comes from `.describe()` on each zod field.
|
|
86
|
+
- **Zod everywhere.** Operation I/O schemas, config schema, fetcher response shapes. Use `.describe()` on every field — that text is what the agent and human both read.
|
|
87
|
+
- **Errors are `HelpfulError` only.** See `src/errors.ts`. Required fields: `kind`, `message`, `hint`. The constructor refuses an empty hint at runtime, and the type system refuses to omit it at compile time. The mount adapters render `kind` + `message` + `hint` for both surfaces — humans see colorized output on a TTY, LLMs get the same fields back as MCP `structuredContent.error`. Hint quality bar: name a concrete next action (a command to run, a flag to set, a path to check). Vague hints like "Check your config" should fail review.
|
|
88
|
+
- **No log-and-rethrow.** Errors propagate to the mount boundary, are rendered there exactly once, then exit. Logging the error before throwing produces double-output and breaks JSON-mode parseability.
|
|
89
|
+
- **Spinners & progress are advisory.** Operations call `ctx.progress.tick(...)` and `ctx.logger.info(...)` without checking whether they're rendered. The renderer in `src/output/` decides; non-interactive mode coerces both into stderr lines or no-ops.
|
|
90
|
+
- **No duplicated handlers.** If you find yourself writing logic in `src/commands/*.ts` that an MCP tool would also want, it belongs in `src/operations/` instead. The only legitimate `src/commands/*.ts` files are CLI-only behaviors with no agent-facing meaning (`serve`, `reindex`).
|
|
91
|
+
- **Logger, not console.** Use `src/output/logger.ts` (spinner-aware, JSON/TTY-aware). `console.log` in production code is a bug.
|
|
92
|
+
- **Colors via `ansis`, spinners via `nanospinner`.** Same as mcpx.
|
|
93
|
+
- **No premature abstractions.** Three similar lines beat a generic helper. Don't build for hypothetical fetchers, hypothetical embedders, or hypothetical storage backends.
|
|
94
|
+
|
|
95
|
+
## Tool / command descriptions
|
|
96
|
+
|
|
97
|
+
Operation descriptions are the user interface — for the LLM AND for the human running `ctx <cmd> --help`. The same string is shown in both places. Every operation description follows this shape:
|
|
98
|
+
|
|
99
|
+
1. Bash-equivalent prefix where applicable: `[[ bash equivalent: cat ]]`.
|
|
100
|
+
2. One-line purpose.
|
|
101
|
+
3. When-to-use guidance — what to call before/after, what tool to prefer instead in adjacent cases.
|
|
102
|
+
4. Constraints, recovery hints, and links to other operations by name.
|
|
103
|
+
|
|
104
|
+
Server-level `instructions` (the string handed to the MCP client when it connects) is defined in `src/mcp/instructions.ts`. It frames the discovery → ingest → consume → write workflow and explicitly tells the agent how versioning, refresh, and the `version` parameter behave. CLI users get the same framing through `ctx --help` (commander's top-level help). Update both that file and `docs/plan.md` together if you change the operation surface.
|
|
105
|
+
|
|
106
|
+
## Testing
|
|
107
|
+
|
|
108
|
+
- `bun test`. Test preload at `test/_preload.ts` applies the transformers WASM patch.
|
|
109
|
+
- Use a real ephemeral DuckDB file per test (don't mock the DB).
|
|
110
|
+
- Real fixtures for converters (`test/fixtures/sample.pdf`, `sample.docx`, `sample.html`).
|
|
111
|
+
- Mock the network only for fetcher tests; everything else hits the real local pipeline.
|
|
112
|
+
- Versioning paths to cover: insert creates v1, refresh-unchanged creates no new version, refresh-changed creates v2, `current_files` returns v2, explicit `version=v1` returns v1, tombstone hides from `current_files` but `versions` still lists it, `prune --before` drops non-current rows.
|
|
113
|
+
|
|
114
|
+
## Build & distribution
|
|
115
|
+
|
|
116
|
+
- Pre-build: `scripts/apply-transformers-patch.sh` (copy verbatim from mcpx).
|
|
117
|
+
- Build: `bun build --compile --minify --sourcemap ./src/cli.ts --outfile dist/ctx`.
|
|
118
|
+
- Targets: darwin-arm64, darwin-x64, linux-arm64, linux-x64, windows-x64, windows-arm64.
|
|
119
|
+
- Distribution: `install.sh` / `install.ps1` mirror mcpx; published to NPM as well.
|
|
120
|
+
|
|
121
|
+
## Things to avoid
|
|
122
|
+
|
|
123
|
+
- Re-introducing a filesystem store under `~/.ctx/context/`. The store is rows.
|
|
124
|
+
- Cloud embeddings. Local WASM only.
|
|
125
|
+
- Mutating an existing version's `content` / `content_sha256` / `chunks`. Those fields are immutable once the row is written — corrections are new versions.
|
|
126
|
+
- Re-routing a remote refresh through the LLM/agent. Replay the stored `fetcher_*` columns directly via mcpx.
|
|
127
|
+
- Tools that return content blobs without a `version_id` — every read-shaped response must echo which version it served.
|
|
128
|
+
- A separate `ctx_read_blob` tool. Bytes are reachable via `ctx_read` with `bytes=true`. One read tool, one mental model.
|
|
129
|
+
- Embedding `chunk_content` raw. Always embed `search_text` (the prepended `<path>\n<description>\n\n<body>`) — that's what `chunks.search_text` holds and what FTS is built on.
|
|
130
|
+
- Aborting a directory/glob ingest because one entry failed. Stream per-entry results; report failures alongside successes.
|
|
131
|
+
- Throwing `new Error(...)` anywhere in `src/operations/`, `src/ingest/`, `src/db/`, `src/refresh/`, or `src/mcp/`. Always `HelpfulError`. Wrap external errors with `asHelpful(cause, context, hint, kind)`.
|
|
132
|
+
- Writing colorized output unconditionally. Always go through `src/output/` so non-interactive callers get clean JSON.
|
|
133
|
+
- A `HelpfulError` whose hint just paraphrases the message ("File not found. Hint: file is missing."). Hint must name a concrete next step — a command, a flag, a path to inspect.
|
|
134
|
+
- **Defining a tool description in two places.** If you catch yourself writing copy in `src/mcp/...` that also exists in `src/commands/...`, stop — make it an `Operation`.
|
|
135
|
+
- Hand-rolling a JSON Schema for an MCP tool. Always derive it from the zod input schema via the mount adapter.
|
|
136
|
+
|
|
137
|
+
## When in doubt
|
|
138
|
+
|
|
139
|
+
Read `docs/plan.md`. If the plan and code disagree, the plan wins until a deliberate update lands in both.
|
package/docs/plan.md
ADDED
|
@@ -0,0 +1,905 @@
|
|
|
1
|
+
# `ctx` — Standalone AI-Agent Context Store
|
|
2
|
+
|
|
3
|
+
## Context
|
|
4
|
+
|
|
5
|
+
`ctx` is a new standalone Bun project at `/Users/evan/workspace/ctx` that extracts and reshapes the context system currently embedded in `botholomew` (`/Users/evan/workspace/botholomew/src/context/`, `src/tools/`, `src/db/`). Distribution and CLI shape mirror `mcpx` (`/Users/evan/workspace/mcpx`).
|
|
6
|
+
|
|
7
|
+
Goals (from user):
|
|
8
|
+
|
|
9
|
+
- Files are **stored only in the database** — not on disk as a tree of `.md` files. Logical paths are virtual.
|
|
10
|
+
- Hybrid search (vector + BM25) over chunked content.
|
|
11
|
+
- Tree exploration synthesised from logical paths.
|
|
12
|
+
- `ctx add <source>` works for local paths AND remote URLs, with **mcpx-driven mini-agents** fetching remote content (Firecrawl, Google Docs, GitHub, raw HTTP). The exact mcpx invocation (server + tool + args) is stored on the row so refresh can re-invoke it directly — no agent/routing re-run.
|
|
13
|
+
- Everything is converted to **markdown**: PDF, DOCX, HTML, plain-text, etc. **Native libs first, LLM fallback** for messy/scanned content.
|
|
14
|
+
- Each row tracks `source_path`, `source_sha256`, `refreshed_at`, `refresh_frequency_sec`. `ctx refresh <path>` re-reads the original source, re-hashes, and re-converts/re-embeds only if the SHA changed. Local files compared by content hash; remote URLs re-fetched via the same fetcher.
|
|
15
|
+
- Both **on-demand** (`ctx refresh`) and **daemon** (`ctx serve --watch`) refresh modes.
|
|
16
|
+
- Bun-compiled standalone executables (darwin/linux/windows × arm64/x64), like mcpx.
|
|
17
|
+
- Stdio + HTTP MCP server exposing read/write/add/search/tree/refresh tools.
|
|
18
|
+
- System-wide config + data dir at `~/.membot/` (override via `--config` or `MEMBOT_HOME`).
|
|
19
|
+
- **Embeddings are LOCAL only** — `@huggingface/transformers` WASM with `Xenova/bge-small-en-v1.5` (384-dim). No cloud embedding APIs. (See memory: `feedback_local_embeddings_only.md`.)
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Architecture Snapshot
|
|
24
|
+
|
|
25
|
+
```
|
|
26
|
+
~/.membot/
|
|
27
|
+
config.json # user config
|
|
28
|
+
index.duckdb # all content, chunks, embeddings, FTS
|
|
29
|
+
models/ # cached @huggingface/transformers WASM weights
|
|
30
|
+
logs/ # daemon logs (when --watch)
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
DuckDB is the only persistent store. There is **no** `~/.membot/context/` filesystem tree — the agent's "files" are rows.
|
|
34
|
+
|
|
35
|
+
---
|
|
36
|
+
|
|
37
|
+
## Presentation & Errors
|
|
38
|
+
|
|
39
|
+
### Two presentation modes
|
|
40
|
+
|
|
41
|
+
The CLI auto-detects its environment and renders appropriately. There is **one** code path for output — the logger and formatter inspect the environment once at startup and degrade gracefully.
|
|
42
|
+
|
|
43
|
+
| Condition | Mode | Behavior |
|
|
44
|
+
| -------------------------------------------------------- | ---------------- | ---------------------------------------------------------------------------------------------- |
|
|
45
|
+
| stdout is a TTY AND stderr is a TTY AND `--json` not set | **interactive** | ANSI colors, `nanospinner` spinners during work, progress bars for multi-entry ops, aligned tables |
|
|
46
|
+
| stdout is piped, redirected, or `--json` is set | **non-interactive** | No spinners, no progress bars, no colors. JSON to stdout, structured logs to stderr. Stable, parseable. |
|
|
47
|
+
| `CI=true` env var set | non-interactive (forced) | Same as above; never accidentally emit ANSI/spinners in CI logs |
|
|
48
|
+
| `--no-color` flag or `NO_COLOR` env var | non-interactive (colors only) | Spinners stay if TTY, but no ANSI color codes (FORCE_COLOR overrides) |
|
|
49
|
+
|
|
50
|
+
Implementation lives in `src/output/`:
|
|
51
|
+
|
|
52
|
+
- `tty.ts` — single source of truth for `isInteractive()`, `useColor()`, `useSpinner()`. Reads `process.stdout.isTTY`, `process.stderr.isTTY`, `process.env.CI`, `NO_COLOR`, `FORCE_COLOR`, and the `--json` / `--no-interactive` flags.
|
|
53
|
+
- `logger.ts` — spinner-aware (port from `mcpx/src/output/logger.ts`); `info/warn/error/debug/writeRaw` route to stderr in non-interactive mode and don't break parseable stdout.
|
|
54
|
+
- `progress.ts` — wraps `nanospinner` + a multi-entry progress bar (used by directory/glob ingest); in non-interactive mode emits one `info` line per entry instead.
|
|
55
|
+
- `formatter.ts` — final-result rendering: aligned tables / markdown when interactive, single JSON object when not.
|
|
56
|
+
|
|
57
|
+
The mount adapter in `src/mount/commander.ts` is responsible for opening a spinner before the handler runs and closing it (success or failure) after — operations themselves call `ctx.progress.tick()` to update progress, but they never know whether they're being rendered interactively. The same handler runs unchanged when invoked via MCP.
|
|
58
|
+
|
|
59
|
+
### `HelpfulError` — the only error class
|
|
60
|
+
|
|
61
|
+
**Rule:** every error raised inside the application must be (or be wrapped into) a `HelpfulError`. A bare `throw new Error(...)` is a bug. The mount adapters (`mountAsCommanderCommand`, `mountAsMcpTool`) refuse to render anything else — they catch unknown errors and convert them, but linting / tests should fail when a non-`HelpfulError` reaches the surface.
|
|
62
|
+
|
|
63
|
+
```ts
|
|
64
|
+
// src/errors.ts
|
|
65
|
+
export type ErrorKind =
|
|
66
|
+
| 'input_error' // bad input from the user/LLM — not retryable as-is
|
|
67
|
+
| 'not_found' // requested resource doesn't exist
|
|
68
|
+
| 'conflict' // path/version already exists where it shouldn't
|
|
69
|
+
| 'auth_error' // upstream auth failed (mcpx fetcher, anthropic key, etc.)
|
|
70
|
+
| 'network_error' // transient network failure — retryable
|
|
71
|
+
| 'unsupported_mime' // converter doesn't know how to handle this type
|
|
72
|
+
| 'partial_failure' // multi-entry op (dir/glob ingest) had per-entry failures
|
|
73
|
+
| 'internal_error'; // bug — should never reach the user
|
|
74
|
+
|
|
75
|
+
export class HelpfulError extends Error {
|
|
76
|
+
readonly kind: ErrorKind;
|
|
77
|
+
readonly hint: string; // REQUIRED. The actionable next step. Shown to humans AND LLMs.
|
|
78
|
+
readonly details?: unknown; // optional structured payload (per-entry failures, etc.)
|
|
79
|
+
readonly cause?: unknown; // original error if wrapped
|
|
80
|
+
|
|
81
|
+
constructor(args: {
|
|
82
|
+
kind: ErrorKind;
|
|
83
|
+
message: string;
|
|
84
|
+
hint: string; // ← non-optional by type
|
|
85
|
+
details?: unknown;
|
|
86
|
+
cause?: unknown;
|
|
87
|
+
}) {
|
|
88
|
+
super(args.message);
|
|
89
|
+
if (!args.hint || !args.hint.trim()) {
|
|
90
|
+
throw new Error('HelpfulError requires a non-empty hint');
|
|
91
|
+
}
|
|
92
|
+
this.name = 'HelpfulError';
|
|
93
|
+
this.kind = args.kind;
|
|
94
|
+
this.hint = args.hint;
|
|
95
|
+
this.details = args.details;
|
|
96
|
+
this.cause = args.cause;
|
|
97
|
+
}
|
|
98
|
+
}
|
|
99
|
+
|
|
100
|
+
// Helper: wrap an unknown error so callers can `try { ... } catch (e) { throw asHelpful(e, 'while reading PDF', 'Try re-running with --force, or check that the file is readable.') }`
|
|
101
|
+
export function asHelpful(
|
|
102
|
+
cause: unknown,
|
|
103
|
+
context: string,
|
|
104
|
+
hint: string,
|
|
105
|
+
kind: ErrorKind = 'internal_error',
|
|
106
|
+
): HelpfulError;
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
The constructor's `hint` parameter is statically required (object-arg pattern) AND validated at runtime — there is no path to construct a hint-less error. PRs that catch a `HelpfulError` and re-throw with a less specific hint should be rejected in review.
|
|
110
|
+
|
|
111
|
+
#### Hint quality bar
|
|
112
|
+
|
|
113
|
+
A good hint names the next action concretely. Examples:
|
|
114
|
+
|
|
115
|
+
| Bad hint | Good hint |
|
|
116
|
+
| ----------------------------------------- | ----------------------------------------------------------------------------------------------- |
|
|
117
|
+
| `"Check your config."` | `"Run \`ctx config show\` to see the active config, or set ANTHROPIC_API_KEY to enable LLM fallback."` |
|
|
118
|
+
| `"File not found."` | `"No file at logical_path 'docs/auth.md'. Run \`ctx ls docs/\` to see what's there."` |
|
|
119
|
+
| `"Auth failed."` | `"mcpx returned 401 from server 'firecrawl'. Run \`mcpx auth firecrawl\` and retry."` |
|
|
120
|
+
| `"Glob matched no files."` | `"Glob './*.md' matched 0 files. Try a broader pattern (e.g. './**/*.md') or relax --exclude."` |
|
|
121
|
+
| `"Unsupported file type: image/heic."` | `"image/heic isn't supported by the native pipeline. Convert to PNG/JPEG first, or pass --force-llm to use the vision fallback."` |
|
|
122
|
+
|
|
123
|
+
#### Rendering
|
|
124
|
+
|
|
125
|
+
`mountAsCommanderCommand` wraps every handler. On `HelpfulError`:
|
|
126
|
+
|
|
127
|
+
```
|
|
128
|
+
Interactive (TTY):
|
|
129
|
+
✗ ctx add: <message in red>
|
|
130
|
+
hint: <hint in dim/yellow>
|
|
131
|
+
[details: pretty-printed when present]
|
|
132
|
+
exit code = mapKindToExit(kind) // input_error=2, not_found=3, conflict=4, auth_error=5, network_error=6, unsupported_mime=7, partial_failure=8, internal_error=1
|
|
133
|
+
|
|
134
|
+
Non-interactive (--json or piped):
|
|
135
|
+
stdout: <empty or partial result up to the point of failure>
|
|
136
|
+
stderr: {"ok": false, "error": {"kind": "...", "message": "...", "hint": "...", "details": ...}}\n
|
|
137
|
+
exit code = same as above
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
`mountAsMcpTool` wraps every handler. On `HelpfulError`:
|
|
141
|
+
|
|
142
|
+
```
|
|
143
|
+
MCP tool result (returned, not thrown):
|
|
144
|
+
isError: true
|
|
145
|
+
content: [{ type: "text", text: "<message>\n\nhint: <hint>" }]
|
|
146
|
+
structuredContent: { error: { kind, message, hint, details? } }
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
The `hint` always lands in front of both the human reading the terminal and the LLM consuming the MCP response — verbatim, same string. No translation layer.
|
|
150
|
+
|
|
151
|
+
### Logging vs. errors
|
|
152
|
+
|
|
153
|
+
- **Logger lines** (info/warn/debug) go to stderr and are advisory. They never become errors.
|
|
154
|
+
- **Errors** are thrown, caught at the mount boundary, rendered once. Operations should never log-and-rethrow — that double-renders.
|
|
155
|
+
- **Spinners** describe the *current* operation; the spinner's failure path on a thrown `HelpfulError` is to fail with the error's `message` as the failed-state label, then the renderer prints the hint underneath.
|
|
156
|
+
|
|
157
|
+
---
|
|
158
|
+
|
|
159
|
+
## Database Schema (DuckDB)
|
|
160
|
+
|
|
161
|
+
`src/db/migrations/001-init.sql`:
|
|
162
|
+
|
|
163
|
+
### Versioning model
|
|
164
|
+
|
|
165
|
+
`files` is **append-only**. Every successful ingest or content-changing refresh inserts a new row for that `logical_path` with a fresh `version_id` (a millisecond TIMESTAMP). The "current" version of a path is `MAX(version_id)` for that path that is not tombstoned. All MCP tools default to operating on the current version; every read-shaped tool accepts an optional `version` parameter to address an older snapshot.
|
|
166
|
+
|
|
167
|
+
- Deletes are tombstones — they insert a new row with `tombstone=TRUE` and `content=''` rather than removing data.
|
|
168
|
+
- `chunks` are scoped to `(logical_path, version_id)` so historical search would be possible later. By default the FTS + semantic queries filter to current versions only via the `current_files` view.
|
|
169
|
+
|
|
170
|
+
```sql
|
|
171
|
+
-- Content-addressed binary store. Originals of every ingested artifact live
|
|
172
|
+
-- here, deduped by sha256. Many `files` rows can share one blob.
|
|
173
|
+
CREATE TABLE blobs (
|
|
174
|
+
sha256 TEXT PRIMARY KEY,
|
|
175
|
+
mime_type TEXT NOT NULL,
|
|
176
|
+
size_bytes BIGINT NOT NULL,
|
|
177
|
+
bytes BLOB NOT NULL,
|
|
178
|
+
created_at TIMESTAMP NOT NULL DEFAULT now()
|
|
179
|
+
);
|
|
180
|
+
|
|
181
|
+
CREATE TABLE files (
|
|
182
|
+
logical_path TEXT NOT NULL, -- "docs/api/auth.md" — what agents see
|
|
183
|
+
version_id TIMESTAMP NOT NULL DEFAULT now(), -- doubles as version label; ms precision
|
|
184
|
+
tombstone BOOLEAN NOT NULL DEFAULT FALSE,
|
|
185
|
+
source_type TEXT NOT NULL, -- 'local' | 'remote' | 'inline'
|
|
186
|
+
source_path TEXT, -- abs filesystem path or URL (NULL for inline writes)
|
|
187
|
+
source_mtime_ms BIGINT, -- last seen mtime (local files only)
|
|
188
|
+
source_sha256 TEXT, -- sha256 of original raw bytes (NULL on tombstone). Equals blob_sha256 for non-inline rows.
|
|
189
|
+
blob_sha256 TEXT REFERENCES blobs(sha256), -- pointer to the original bytes (NULL when source_type='inline' or tombstoned)
|
|
190
|
+
content_sha256 TEXT, -- sha256 of converted markdown surrogate
|
|
191
|
+
content TEXT, -- converted markdown surrogate
|
|
192
|
+
description TEXT, -- ALWAYS-PRESENT one-paragraph summary (LLM-generated; covers text and binary alike). Prepended to every chunk's embedded text.
|
|
193
|
+
mime_type TEXT,
|
|
194
|
+
size_bytes BIGINT,
|
|
195
|
+
fetcher TEXT, -- 'http' | 'mcpx' | 'local' | 'inline'
|
|
196
|
+
fetcher_server TEXT, -- mcpx server name (e.g. 'firecrawl', 'google-docs', 'github') — NULL unless fetcher='mcpx'
|
|
197
|
+
fetcher_tool TEXT, -- mcpx tool name (e.g. 'scrape', 'get_doc') — NULL unless fetcher='mcpx'
|
|
198
|
+
fetcher_args JSON, -- full args object passed to the mcpx tool — replayable as-is on refresh
|
|
199
|
+
refresh_frequency_sec INTEGER, -- NULL = never auto-refresh
|
|
200
|
+
refreshed_at TIMESTAMP,
|
|
201
|
+
last_refresh_status TEXT, -- 'ok' | 'unchanged' | 'failed:<reason>'
|
|
202
|
+
change_note TEXT, -- optional human/agent annotation: "manual edit", "refresh: source updated", etc.
|
|
203
|
+
created_at TIMESTAMP NOT NULL DEFAULT now(),
|
|
204
|
+
PRIMARY KEY (logical_path, version_id)
|
|
205
|
+
);
|
|
206
|
+
|
|
207
|
+
-- Latest non-tombstoned version per logical_path. All MCP/CLI defaults filter through this view.
|
|
208
|
+
CREATE VIEW current_files AS
|
|
209
|
+
SELECT f.* FROM files f
|
|
210
|
+
WHERE (f.logical_path, f.version_id) IN (
|
|
211
|
+
SELECT logical_path, MAX(version_id) FROM files GROUP BY logical_path
|
|
212
|
+
)
|
|
213
|
+
AND f.tombstone = FALSE;
|
|
214
|
+
|
|
215
|
+
CREATE TABLE chunks (
|
|
216
|
+
logical_path TEXT NOT NULL,
|
|
217
|
+
version_id TIMESTAMP NOT NULL,
|
|
218
|
+
chunk_index INTEGER NOT NULL,
|
|
219
|
+
chunk_content TEXT NOT NULL, -- raw markdown segment (what membot_read returns when slicing)
|
|
220
|
+
search_text TEXT NOT NULL, -- "<logical_path>\n<description>\n\n<chunk_content>" — the exact string that was embedded and is FTS-indexed
|
|
221
|
+
embedding FLOAT[384] NOT NULL, -- vector of search_text
|
|
222
|
+
PRIMARY KEY (logical_path, version_id, chunk_index),
|
|
223
|
+
FOREIGN KEY (logical_path, version_id) REFERENCES files(logical_path, version_id)
|
|
224
|
+
);
|
|
225
|
+
|
|
226
|
+
-- Chunks belonging to current versions only. Search joins through this.
|
|
227
|
+
CREATE VIEW current_chunks AS
|
|
228
|
+
SELECT c.* FROM chunks c
|
|
229
|
+
JOIN current_files cf USING (logical_path, version_id);
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
`src/db/migrations/002-fts.sql`: `PRAGMA create_fts_index('current_chunks', 'rowid', 'search_text', stemmer='porter')` — indexes the prepended search_text (filename + description + chunk content) so keyword hits surface even when the matching term is in the path or description, not the body. Rebuilt by `ctx reindex` whenever versions are added/tombstoned.
|
|
233
|
+
|
|
234
|
+
Tree exploration is `SELECT logical_path FROM current_files`, grouped client-side by `/` prefix (synthesised — there are no real directories).
|
|
235
|
+
|
|
236
|
+
### Pruning history
|
|
237
|
+
|
|
238
|
+
Versions accumulate forever by default. `ctx prune --before <duration>` and the matching `membot_prune` MCP tool drop non-current versions older than the cutoff. Tombstones are kept until at least one newer version exists, so reachability stays simple. `ctx prune` also garbage-collects orphan rows in `blobs` (sha256 not referenced by any remaining `files` row).
|
|
239
|
+
|
|
240
|
+
### Binary content & the textual-surrogate rule
|
|
241
|
+
|
|
242
|
+
Some sources don't have a useful textual form: images, audio, video, executables, fonts, etc. The store handles these uniformly with one rule:
|
|
243
|
+
|
|
244
|
+
> **Every ingested artifact produces a markdown surrogate.** The surrogate flows through chunking, embedding, and FTS like any other markdown. The original bytes are kept in the `blobs` table and addressed via `files.blob_sha256` for agents that can consume the native form.
|
|
245
|
+
|
|
246
|
+
This means the search/embed pipeline has zero special cases for binary content — the surrogate IS the content as far as retrieval is concerned. Concretely:
|
|
247
|
+
|
|
248
|
+
| Source type | Surrogate (`files.content`) | Blob kept? |
|
|
249
|
+
| ---------------------- | ---------------------------------------------------------------------- | ---------- |
|
|
250
|
+
| markdown / text | passthrough | yes |
|
|
251
|
+
| HTML | turndown output | yes |
|
|
252
|
+
| PDF (text layer) | unpdf extraction | yes |
|
|
253
|
+
| PDF (scanned, no text) | Tesseract WASM OCR → markdown | yes |
|
|
254
|
+
| DOCX | mammoth output | yes |
|
|
255
|
+
| image (PNG/JPEG/etc.) | Claude vision caption + Tesseract WASM OCR for any embedded text | yes |
|
|
256
|
+
| audio | (deferred — surrogate would be a transcript when we add Whisper WASM) | yes |
|
|
257
|
+
| anything else | LLM caption from a base64 sample, or `"(unknown binary)"` if no key | yes |
|
|
258
|
+
|
|
259
|
+
The `blob_sha256` foreign key gives content-addressed dedupe automatically — re-ingesting the same image under a different logical_path stores zero new bytes.
|
|
260
|
+
|
|
261
|
+
### Always-on description (`files.description`)
|
|
262
|
+
|
|
263
|
+
Every file gets an LLM-written one-paragraph description, regardless of type — including plain markdown. The description column is **prepended to every chunk's embedded text** (along with the logical path), so:
|
|
264
|
+
|
|
265
|
+
- Searches like `"the OAuth diagram"` hit a PNG even though the chunk body is empty markdown.
|
|
266
|
+
- Searches like `"meeting notes from last quarter's planning"` hit a markdown file whose body never says that phrase.
|
|
267
|
+
- Filename signals ("auth.md", "diagrams/oauth-flow.png") are part of the embedded text, lifting recall without hurting precision because the text-prefix is short and consistent.
|
|
268
|
+
|
|
269
|
+
The exact embedded string per chunk is:
|
|
270
|
+
|
|
271
|
+
```
|
|
272
|
+
<logical_path>
|
|
273
|
+
<description>
|
|
274
|
+
|
|
275
|
+
<chunk_content>
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
…stored verbatim as `chunks.search_text`. FTS is built on `search_text`, the embedding is the vector of `search_text`. Keeping `chunk_content` as a separate column means `membot_read` and the `snippet` field on search hits return the clean body without the prefix bleed-through.
|
|
279
|
+
|
|
280
|
+
When `ANTHROPIC_API_KEY` is missing, `description` falls back to a deterministic heuristic (e.g. first heading + first 200 chars for markdown; `"<mime_type> · <size>"` for binaries) so the pipeline still works offline — just with weaker recall.
|
|
281
|
+
|
|
282
|
+
### Tesseract WASM (OCR)
|
|
283
|
+
|
|
284
|
+
OCR runs as part of the converter dispatch, only on filetypes where it's likely useful:
|
|
285
|
+
|
|
286
|
+
- All `image/*` types: PNG, JPEG, WebP, BMP, TIFF.
|
|
287
|
+
- PDFs whose unpdf extraction returned an empty / very-low-text-ratio result (likely scanned).
|
|
288
|
+
|
|
289
|
+
OCR output is folded into the same surrogate that the LLM caption produces — one chunked markdown body per file, with a fenced section `## Text detected via OCR` when OCR ran. No separate row, no separate index.
|
|
290
|
+
|
|
291
|
+
---
|
|
292
|
+
|
|
293
|
+
## Operations: one definition, two surfaces
|
|
294
|
+
|
|
295
|
+
Each user-facing capability is defined ONCE as an **Operation** and mounted twice — as an MCP tool and as a commander CLI command. The zod input schema, output schema, description string, and handler are all single-source-of-truth. Adding a new operation means writing one file in `src/operations/` and exporting it from the registry; both the CLI and the MCP server pick it up automatically.
|
|
296
|
+
|
|
297
|
+
### `Operation<I, O>` shape (`src/operations/types.ts`)
|
|
298
|
+
|
|
299
|
+
```ts
|
|
300
|
+
export interface Operation<I extends z.ZodObject, O extends z.ZodTypeAny> {
|
|
301
|
+
// Tool name as agents see it (also used for the MCP tool registration).
|
|
302
|
+
name: string; // e.g. "membot_add"
|
|
303
|
+
|
|
304
|
+
// CLI subcommand name. Defaults to name with "membot_" stripped and "_" → "-".
|
|
305
|
+
cliName?: string; // e.g. "add"
|
|
306
|
+
|
|
307
|
+
// Verbatim description string. Used as BOTH the MCP tool description
|
|
308
|
+
// and the commander .description() text. Follows the bash-prefix →
|
|
309
|
+
// purpose → when-to-use → recovery-hint shape (see §MCP Tool Surface).
|
|
310
|
+
description: string;
|
|
311
|
+
|
|
312
|
+
// Single source of truth for the input contract.
|
|
313
|
+
inputSchema: I;
|
|
314
|
+
outputSchema: O;
|
|
315
|
+
|
|
316
|
+
// CLI-only metadata: which input fields are positional CLI args, and
|
|
317
|
+
// any short-flag aliases. Fields not listed in `positional` become
|
|
318
|
+
// `--flags`; booleans become `--flag` / `--no-flag`; defaults from
|
|
319
|
+
// .default() in the schema are honored.
|
|
320
|
+
cli?: {
|
|
321
|
+
positional?: (keyof z.infer<I>)[];
|
|
322
|
+
aliases?: Partial<Record<keyof z.infer<I>, string>>; // e.g. logical_path: "-p"
|
|
323
|
+
stdinField?: keyof z.infer<I>; // read this field from stdin if not provided
|
|
324
|
+
};
|
|
325
|
+
|
|
326
|
+
// The work itself. AppContext gives access to db, embedder, mcpx, logger, config.
|
|
327
|
+
handler: (input: z.infer<I>, ctx: AppContext) => Promise<z.infer<O>>;
|
|
328
|
+
}
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
Field-level help comes from `.describe()` on the zod schema — used as both the MCP parameter description and the commander option description. Example:
|
|
332
|
+
|
|
333
|
+
```ts
|
|
334
|
+
inputSchema: z.object({
|
|
335
|
+
source: z.string().describe('Local path, URL, or `inline:<text>` literal'),
|
|
336
|
+
logical_path: z.string().optional().describe('Logical path under the store (defaults derived from source)'),
|
|
337
|
+
refresh_frequency: z.string().optional().describe('Refresh cadence: 5m | 1h | 24h | 7d. Omit for no auto-refresh.'),
|
|
338
|
+
fetcher_hint: z.enum(['firecrawl','github','gdocs','http']).optional().describe('Force a specific mcpx fetcher'),
|
|
339
|
+
}),
|
|
340
|
+
cli: {
|
|
341
|
+
positional: ['source'],
|
|
342
|
+
aliases: { logical_path: '-p', refresh_frequency: '-r' },
|
|
343
|
+
}
|
|
344
|
+
```
|
|
345
|
+
|
|
346
|
+
### Mount adapters
|
|
347
|
+
|
|
348
|
+
`src/mount/mcp.ts` — `mountAsMcpTool(server, op)`:
|
|
349
|
+
- Registers the tool with `op.name` and `op.description`.
|
|
350
|
+
- Converts `op.inputSchema` to JSON-Schema (via `zod-to-json-schema`) for the MCP `inputSchema` field.
|
|
351
|
+
- Wraps `op.handler` with input validation (`op.inputSchema.parse`) + output validation (`op.outputSchema.parse`) + error normalization (`{error_kind, message, next_action_hint}`).
|
|
352
|
+
|
|
353
|
+
`src/mount/commander.ts` — `mountAsCommanderCommand(program, op)`:
|
|
354
|
+
- Adds a subcommand named `op.cliName ?? op.name.replace(/^membot_/, '').replaceAll('_','-')`.
|
|
355
|
+
- Sets `.description(op.description)`. The same string the LLM sees is what `ctx --help` shows.
|
|
356
|
+
- Walks `op.inputSchema.shape`. For each field:
|
|
357
|
+
- If listed in `op.cli.positional` → `.argument(required ? '<name>' : '[name]', describe)`.
|
|
358
|
+
- Else if `ZodBoolean` → `.option('--flag-name [<bool>]', describe)`, with `--no-flag-name` synthesised.
|
|
359
|
+
- Else → `.option('--flag-name <value>', describe, defaultValue?)`. Short alias prepended if `op.cli.aliases[field]` is set.
|
|
360
|
+
- `ZodEnum` → option with `.choices(...)`.
|
|
361
|
+
- `ZodArray` of strings → repeatable `.option('--tag <value>', ..., collect)`.
|
|
362
|
+
- On invocation, builds a single object from positional args + options, runs `op.inputSchema.parse(...)`, calls `op.handler`, and renders the result via `output/formatter.ts` (JSON if `--json`, otherwise human-readable per output schema).
|
|
363
|
+
|
|
364
|
+
Result: the description an agent reads in `tools/list` is byte-identical to what a human reads in `ctx <cmd> --help`. Drift is impossible by construction.
|
|
365
|
+
|
|
366
|
+
### Operation registry (`src/operations/index.ts`)
|
|
367
|
+
|
|
368
|
+
A single array of operations exported in the order they should appear in `--help`. `cli.ts` and `mcp/server.ts` both iterate this list and call the appropriate mount adapter. Adding a new tool means: write one file, append it here, done.
|
|
369
|
+
|
|
370
|
+
---
|
|
371
|
+
|
|
372
|
+
## Project Layout (mirrors mcpx)
|
|
373
|
+
|
|
374
|
+
```
|
|
375
|
+
ctx/
|
|
376
|
+
src/
|
|
377
|
+
cli.ts # commander entry; loops operations + mountAsCommanderCommand. Plus a couple of CLI-only commands (serve, reindex).
|
|
378
|
+
sdk.ts # exported API for embedding ctx in other apps
|
|
379
|
+
context.ts # AppContext (config, db, embedder, mcpx client, logger)
|
|
380
|
+
constants.ts # MEMBOT_HOME, DEFAULTS, EMBEDDING_DIMENSION=384
|
|
381
|
+
operations/ # ★ single source of truth for every tool/command
|
|
382
|
+
types.ts # Operation<I,O>, defineOperation()
|
|
383
|
+
index.ts # ordered registry of all operations
|
|
384
|
+
add.ts list.ts tree.ts read.ts write.ts search.ts remove.ts
|
|
385
|
+
move.ts refresh.ts info.ts versions.ts diff.ts prune.ts
|
|
386
|
+
mount/
|
|
387
|
+
mcp.ts # mountAsMcpTool: zod → JSON-Schema, validate I/O, catch HelpfulError → MCP isError result with hint surfaced in both content[].text and structuredContent.error
|
|
388
|
+
commander.ts # mountAsCommanderCommand: zod → .argument()/.option(), parse → validate → spinner.start → handler → spinner.success/fail → format. Catches HelpfulError, renders message+hint+exit-code; wraps unknown throws via asHelpful()
|
|
389
|
+
zod-to-cli.ts # the field-walking logic; covers ZodString/Number/Boolean/Enum/Array/Optional/Default
|
|
390
|
+
commands/ # CLI-only commands that don't have an MCP equivalent
|
|
391
|
+
serve.ts # ctx serve [--http <port>] [--watch]
|
|
392
|
+
reindex.ts # ctx reindex
|
|
393
|
+
config/
|
|
394
|
+
loader.ts # reads ~/.membot/config.json + env overrides
|
|
395
|
+
schemas.ts # CtxConfig zod schema
|
|
396
|
+
db/
|
|
397
|
+
connection.ts # DuckDB pool, migration runner
|
|
398
|
+
migrations/ # 001-init.sql, 002-fts.sql
|
|
399
|
+
files.ts # files-table CRUD: insertVersion, getCurrent, getVersion, listVersions, tombstone, prune
|
|
400
|
+
chunks.ts # chunks CRUD + searchSemantic + searchKeyword (against current_chunks view by default)
|
|
401
|
+
blobs.ts # blobs-table CRUD: upsertBySha (no-op on existing sha), readBlob, gcOrphans
|
|
402
|
+
views.sql # current_files, current_chunks views
|
|
403
|
+
ingest/
|
|
404
|
+
source-resolver.ts # expands a source arg: file | dir-walk (symlinks followed, realpath dedupe) | glob (picomatch) | URL | inline:; honors include/exclude
|
|
405
|
+
fetcher.ts # PORT from botholomew/src/context/fetcher.ts — mcpx-driven; returns {bytes, mime, fetcher, fetcher_server, fetcher_tool, fetcher_args} so the chosen invocation can be persisted and replayed on refresh
|
|
406
|
+
local-reader.ts # read+hash local file, detect mtime change
|
|
407
|
+
converter/
|
|
408
|
+
index.ts # dispatch by mime
|
|
409
|
+
pdf.ts # unpdf (Bun-friendly PDF text extract); falls through to ocr.ts when extraction is empty/low-ratio
|
|
410
|
+
docx.ts # mammoth
|
|
411
|
+
html.ts # turndown
|
|
412
|
+
image.ts # Claude vision caption + OCR fold-in
|
|
413
|
+
text.ts # passthrough
|
|
414
|
+
ocr.ts # Tesseract WASM (tesseract.js) — used by image.ts and pdf.ts fallback
|
|
415
|
+
llm.ts # Claude markdown fallback (PORT botholomew/src/context/markdown-converter.ts)
|
|
416
|
+
describer.ts # always-on one-paragraph LLM description (with deterministic offline fallback)
|
|
417
|
+
chunker.ts # PORT botholomew/src/context/chunker.ts (deterministic + LLM modes)
|
|
418
|
+
embedder.ts # PORT botholomew/src/context/embedder-impl.ts (WASM transformers); embeds the prepended search_text
|
|
419
|
+
search-text.ts # buildSearchText(logical_path, description, chunk_content) — single source of truth for the embedded/FTS string
|
|
420
|
+
ingest.ts # orchestrator: resolve → for each entry: read → blob.upsert → convert → describe → chunk → embed → insert version
|
|
421
|
+
search/
|
|
422
|
+
hybrid.ts # PORT botholomew/src/tools/search/fuse.ts (RRF)
|
|
423
|
+
semantic.ts # cosine via DuckDB array_cosine_distance
|
|
424
|
+
keyword.ts # BM25 via DuckDB FTS match_bm25()
|
|
425
|
+
refresh/
|
|
426
|
+
runner.ts # refreshFile(id|path) — core logic
|
|
427
|
+
scheduler.ts # daemon tick loop for --watch
|
|
428
|
+
mcp/
|
|
429
|
+
server.ts # @modelcontextprotocol/sdk: stdio + streamable-http; loops operations + mountAsMcpTool
|
|
430
|
+
instructions.ts # server-level `instructions` string (see plan §MCP)
|
|
431
|
+
output/
|
|
432
|
+
tty.ts # isInteractive() / useColor() / useSpinner() — single source for TTY/CI/--json/NO_COLOR detection
|
|
433
|
+
logger.ts # spinner-aware (port from mcpx/src/output/logger.ts); routes to stderr in non-interactive mode
|
|
434
|
+
progress.ts # nanospinner wrapper + multi-entry progress bar (used by dir/glob ingest); degrades to one info-line-per-entry when non-interactive
|
|
435
|
+
formatter.ts # final-result rendering: aligned tables/markdown when interactive, JSON when not
|
|
436
|
+
errors.ts # HelpfulError class + asHelpful() wrapper + ErrorKind union + mapKindToExit()
|
|
437
|
+
scripts/
|
|
438
|
+
apply-transformers-patch.sh # copy verbatim from mcpx/scripts
|
|
439
|
+
test/
|
|
440
|
+
_preload.ts # transformers patch hook
|
|
441
|
+
ingest/ db/ search/ refresh/ mcp/
|
|
442
|
+
patches/ # @huggingface/transformers patch (copy from mcpx)
|
|
443
|
+
install.sh install.ps1 # copy+adapt from mcpx
|
|
444
|
+
package.json tsconfig.json biome.json bunfig.toml
|
|
445
|
+
README.md CLAUDE.md
|
|
446
|
+
```
|
|
447
|
+
|
|
448
|
+
---
|
|
449
|
+
|
|
450
|
+
## Critical Files to Port
|
|
451
|
+
|
|
452
|
+
Direct ports (light edits — drop Botholomew-specific deps, swap `projectDir/context/` filesystem for DuckDB rows):
|
|
453
|
+
|
|
454
|
+
| New file | Source |
|
|
455
|
+
| ---------------------------------- | ----------------------------------------------------------------------- |
|
|
456
|
+
| `src/ingest/embedder.ts` | `botholomew/src/context/embedder-impl.ts` |
|
|
457
|
+
| `src/ingest/chunker.ts` | `botholomew/src/context/chunker.ts` |
|
|
458
|
+
| `src/ingest/fetcher.ts` | `botholomew/src/context/fetcher.ts` + `fetcher-errors.ts` |
|
|
459
|
+
| `src/ingest/converter/llm.ts` | `botholomew/src/context/markdown-converter.ts` |
|
|
460
|
+
| `src/search/semantic.ts` | `botholomew/src/tools/search/semantic.ts` + `src/db/embeddings.ts` |
|
|
461
|
+
| `src/search/hybrid.ts` | `botholomew/src/tools/search/fuse.ts` |
|
|
462
|
+
| `src/search/keyword.ts` | `botholomew/src/tools/search/regexp.ts` (replace regex with FTS BM25) |
|
|
463
|
+
| `scripts/apply-transformers-patch.sh`, `patches/` | `mcpx/scripts/...`, `mcpx/patches/` |
|
|
464
|
+
| `src/output/logger.ts` | `mcpx/src/output/logger.ts` |
|
|
465
|
+
| `src/cli.ts` skeleton | `mcpx/src/cli.ts` |
|
|
466
|
+
| `install.sh`, `install.ps1` | `mcpx/install.sh`, `mcpx/install.ps1` |
|
|
467
|
+
|
|
468
|
+
New code:
|
|
469
|
+
|
|
470
|
+
- `src/db/*` — DuckDB schema/CRUD (replaces botholomew's `context/store.ts` filesystem layer).
|
|
471
|
+
- `src/ingest/converter/{pdf,docx,html,text}.ts` — native conversion path before LLM fallback.
|
|
472
|
+
- `src/ingest/local-reader.ts` — read + sha256 + mtime for local sources.
|
|
473
|
+
- `src/refresh/{runner,scheduler}.ts` — refresh per-row + daemon tick.
|
|
474
|
+
- `src/mcp/server.ts` and `src/mcp/tools/*` — MCP exposure (botholomew's tools assume FS sandboxing; we rewrite against DB).
|
|
475
|
+
|
|
476
|
+
---
|
|
477
|
+
|
|
478
|
+
## Data Flow — Ingest
|
|
479
|
+
|
|
480
|
+
```
|
|
481
|
+
ctx add <source> [--path <logical>] [--refresh 24h] [--include <glob>] [--exclude <glob>]
|
|
482
|
+
↓
|
|
483
|
+
expand-source: (only for local sources)
|
|
484
|
+
file → [file]
|
|
485
|
+
directory → walk(symlinks_followed=true) filtered by include/exclude globs
|
|
486
|
+
glob → picomatch over realpath()-ed entries
|
|
487
|
+
↓ for each resolved entry:
|
|
488
|
+
local-reader.read() OR fetcher.fetchUrl() ← raw bytes + mime + sha256
|
|
489
|
+
(fetchUrl also returns chosen mcpx server/tool/args
|
|
490
|
+
→ persisted on the row for fast replay-on-refresh)
|
|
491
|
+
↓
|
|
492
|
+
blobs.upsert(sha256, bytes, mime) ← content-addressed, deduped
|
|
493
|
+
↓
|
|
494
|
+
converter/index.ts dispatch(mime):
|
|
495
|
+
pdf → unpdf.extractText() → if empty/low-ratio → ocr.tesseract()
|
|
496
|
+
docx → mammoth.convertToMarkdown()
|
|
497
|
+
html → turndown.turndown()
|
|
498
|
+
image/* → vision.describeImage() + ocr.tesseract() (folded together)
|
|
499
|
+
text / md → passthrough
|
|
500
|
+
other → llm.convertWithClaude() (or "(unknown binary)" if no API key)
|
|
501
|
+
↓
|
|
502
|
+
describe(markdown_or_caption, mime, logical_path) ← always runs; one-paragraph LLM summary
|
|
503
|
+
(or deterministic fallback when no API key)
|
|
504
|
+
↓
|
|
505
|
+
chunker.chunk(markdown) ← deterministic by default; LLM opt-in
|
|
506
|
+
↓
|
|
507
|
+
buildSearchText(logical_path, description, chunk) ← prepended for embedding + FTS
|
|
508
|
+
↓
|
|
509
|
+
embedder.embedBatch(search_texts) ← WASM transformers, 384-dim
|
|
510
|
+
↓
|
|
511
|
+
db.files.insertVersion + db.chunks.insertForVersion + FTS rebuild
|
|
512
|
+
(every successful ingest produces a NEW version_id; nothing is overwritten;
|
|
513
|
+
directory/glob ingest is one transaction per matched entry, not all-or-nothing)
|
|
514
|
+
```
|
|
515
|
+
|
|
516
|
+
## Data Flow — Refresh
|
|
517
|
+
|
|
518
|
+
`ctx refresh <path>` (or daemon tick, or no-arg = "all due"):
|
|
519
|
+
|
|
520
|
+
1. Load `files` row.
|
|
521
|
+
2. If `source_type='local'`: `stat()`; if `mtime_ms == source_mtime_ms`, skip. Otherwise re-read + sha256.
|
|
522
|
+
3. If `source_type='remote'`:
|
|
523
|
+
- If `fetcher='mcpx'`: directly invoke `mcpx exec <fetcher_server> <fetcher_tool> <fetcher_args>` — no agent re-routing.
|
|
524
|
+
- If `fetcher='http'`: plain `fetch(source_path)`.
|
|
525
|
+
- sha256 the resulting bytes.
|
|
526
|
+
4. Compare `new_source_sha256 == files.source_sha256`. If equal → set `refreshed_at=now()`, `last_refresh_status='unchanged'`, done.
|
|
527
|
+
5. If different → re-run convert → chunk → embed → **insert a new `files` row** with `version_id=now()`, new `content`/`source_sha256`/`content_sha256`, `change_note='refresh: source updated'`. Insert the new chunks under that version. Old version remains in history.
|
|
528
|
+
6. On failure (network, fetcher error, conversion error): leave existing version untouched, write `last_refresh_status='failed:<reason>'` onto the most recent row in place (status fields are mutable; content fields are not).
|
|
529
|
+
|
|
530
|
+
`ctx refresh` with no arg = all rows where `refresh_frequency_sec IS NOT NULL AND now() > refreshed_at + (refresh_frequency_sec * INTERVAL '1 second')`.
|
|
531
|
+
|
|
532
|
+
Daemon (`ctx serve --watch`): `setInterval(tick_interval_sec)` runs the no-arg refresh; default 60s. Same code path.
|
|
533
|
+
|
|
534
|
+
---
|
|
535
|
+
|
|
536
|
+
## CLI Surface
|
|
537
|
+
|
|
538
|
+
```
|
|
539
|
+
ctx add <source> [--path <logical>] [--include <glob>] [--exclude <glob>] [--no-follow-symlinks] [--refresh <dur>] [--fetcher <name>]
|
|
540
|
+
ctx ls [<prefix>] [--json]
|
|
541
|
+
ctx tree [<prefix>] [--depth <n>]
|
|
542
|
+
ctx read <path> [--version <ts>] [--meta]
|
|
543
|
+
ctx write <path> [--refresh <dur>] [--note <msg>] < stdin
|
|
544
|
+
ctx search <query> [--limit 10] [--mode hybrid|semantic|keyword]
|
|
545
|
+
ctx info <path> [--version <ts>]
|
|
546
|
+
ctx mv <old> <new>
|
|
547
|
+
ctx rm <path> # tombstone (history kept)
|
|
548
|
+
ctx refresh [<path>] [--force]
|
|
549
|
+
ctx versions <path> # list every version_id with change_note
|
|
550
|
+
ctx diff <path> <a-version> [<b-version>] # markdown diff between two versions (defaults b=current)
|
|
551
|
+
ctx prune [--before <dur>] # drop non-current versions older than cutoff
|
|
552
|
+
ctx reindex
|
|
553
|
+
ctx serve [--http <port>] [--watch] [--tick <sec>]
|
|
554
|
+
```
|
|
555
|
+
|
|
556
|
+
Global flags (mirror mcpx): `-c/--config`, `-j/--json`, `-F/--format`, `-v/--verbose`, `--no-interactive`.
|
|
557
|
+
|
|
558
|
+
`--refresh` and `--before` accept duration strings: `5m`, `1h`, `24h`, `7d`.
|
|
559
|
+
`--version` accepts an ISO-8601 timestamp or millis-since-epoch — exact match against `files.version_id`.
|
|
560
|
+
|
|
561
|
+
---
|
|
562
|
+
|
|
563
|
+
## MCP Tool Surface
|
|
564
|
+
|
|
565
|
+
Stdio (default) and streamable-HTTP, both via `@modelcontextprotocol/sdk`. Each tool is **defined as an `Operation` in `src/operations/`** and mounted via `mountAsMcpTool`. The same `Operation` is mounted as a commander subcommand by `mountAsCommanderCommand` — descriptions, schemas, and validation are identical across the two surfaces. Errors are always `HelpfulError` instances (see §Presentation & Errors); the mount adapter renders `kind`, `message`, and the required `hint` into the MCP response so the LLM gets the same actionable guidance a human would.
|
|
566
|
+
|
|
567
|
+
### Worked example: `membot_add`
|
|
568
|
+
|
|
569
|
+
```ts
|
|
570
|
+
// src/operations/add.ts
|
|
571
|
+
import { z } from 'zod';
|
|
572
|
+
import { defineOperation } from './types';
|
|
573
|
+
import { ingest } from '../ingest/ingest';
|
|
574
|
+
|
|
575
|
+
export const add = defineOperation({
|
|
576
|
+
name: 'membot_add',
|
|
577
|
+
cliName: 'add',
|
|
578
|
+
description: `[[ bash equivalent: ingest a source ]] Ingest a new source into
|
|
579
|
+
the store: a local file path OR a URL OR an inline:<text> literal. URLs are
|
|
580
|
+
fetched via mcpx (the chosen server + tool + args are stored so refresh
|
|
581
|
+
replays the exact invocation). PDF/DOCX/HTML are converted to markdown —
|
|
582
|
+
native libs first, LLM fallback for messy/scanned input. Setting
|
|
583
|
+
refresh_frequency enables automatic refresh from the daemon. Always creates
|
|
584
|
+
a NEW version; existing versions stay queryable via membot_versions.`,
|
|
585
|
+
inputSchema: z.object({
|
|
586
|
+
source: z.string().describe('Local path, URL, or `inline:<text>` literal'),
|
|
587
|
+
logical_path: z.string().optional().describe('Logical path under the store (defaults derived from source)'),
|
|
588
|
+
refresh_frequency: z.string().optional().describe('Refresh cadence: 5m | 1h | 24h | 7d. Omit for no auto-refresh.'),
|
|
589
|
+
fetcher_hint: z.enum(['firecrawl','github','gdocs','http']).optional().describe('Force a specific mcpx fetcher'),
|
|
590
|
+
change_note: z.string().optional().describe('Free-text note attached to the new version'),
|
|
591
|
+
}),
|
|
592
|
+
outputSchema: z.object({
|
|
593
|
+
logical_path: z.string(),
|
|
594
|
+
version_id: z.string(),
|
|
595
|
+
mime_type: z.string().nullable(),
|
|
596
|
+
size_bytes: z.number(),
|
|
597
|
+
fetcher: z.string(),
|
|
598
|
+
source_sha256: z.string(),
|
|
599
|
+
}),
|
|
600
|
+
cli: {
|
|
601
|
+
positional: ['source'],
|
|
602
|
+
aliases: { logical_path: '-p', refresh_frequency: '-r', change_note: '-m' },
|
|
603
|
+
},
|
|
604
|
+
handler: async (input, ctx) => ingest(input, ctx),
|
|
605
|
+
});
|
|
606
|
+
```
|
|
607
|
+
|
|
608
|
+
This single definition produces:
|
|
609
|
+
|
|
610
|
+
- **MCP tool** `membot_add` with the description above, JSON-Schema input derived from zod, and validated output.
|
|
611
|
+
- **CLI command** `ctx add <source> [-p <path>] [-r <dur>] [--fetcher-hint <name>] [-m <note>]` whose `--help` text is byte-identical to the description above.
|
|
612
|
+
|
|
613
|
+
### Server-level instructions
|
|
614
|
+
|
|
615
|
+
These are sent as the MCP server's top-level `instructions` field — the LLM sees them once when the server is connected. They frame how the tool surface should be used:
|
|
616
|
+
|
|
617
|
+
```
|
|
618
|
+
You have a persistent context store. Files live as versioned markdown rows
|
|
619
|
+
addressed by logical path (e.g. "research/threat-models/llm.md"). The store
|
|
620
|
+
is a hybrid search index: every file is chunked, embedded locally, and
|
|
621
|
+
indexed with BM25 — so prefer membot_search to membot_read+grep for discovery.
|
|
622
|
+
|
|
623
|
+
Workflow:
|
|
624
|
+
1. membot_tree or membot_search to find what already exists before adding new content.
|
|
625
|
+
2. membot_add to ingest a local file, a URL, or a remote document. URLs are
|
|
626
|
+
fetched via mcpx (Firecrawl/Google-Docs/GitHub/HTTP); the chosen
|
|
627
|
+
invocation is stored so refresh is fast and deterministic.
|
|
628
|
+
3. membot_read or membot_search hits to consume content.
|
|
629
|
+
4. membot_write to record agent-authored notes (source_type='inline').
|
|
630
|
+
|
|
631
|
+
Versioning:
|
|
632
|
+
- Every ingest, refresh, or write that changes content creates a NEW
|
|
633
|
+
version_id (a timestamp). Older versions stay queryable via the
|
|
634
|
+
`version` parameter on membot_read / membot_info / membot_versions / membot_diff.
|
|
635
|
+
- All other tools default to the current (latest, non-tombstoned) version.
|
|
636
|
+
- membot_delete is a tombstone — history is preserved unless membot_prune runs.
|
|
637
|
+
|
|
638
|
+
Refresh:
|
|
639
|
+
- Each row has source metadata. membot_refresh re-reads the source, hashes
|
|
640
|
+
it, and only re-embeds when bytes changed. Safe to call often.
|
|
641
|
+
- If a file has refresh_frequency_sec set, the daemon refreshes it
|
|
642
|
+
automatically — you do not need to schedule it yourself.
|
|
643
|
+
|
|
644
|
+
When in doubt: search before you read, read before you write, and prefer
|
|
645
|
+
adding the source URL once (with a refresh interval) over copy-pasting
|
|
646
|
+
content that will go stale.
|
|
647
|
+
```
|
|
648
|
+
|
|
649
|
+
### Tool catalog
|
|
650
|
+
|
|
651
|
+
Description text below is the verbatim string sent to the LLM. Style: **bash-equivalent prefix → one-line purpose → when-to-use → constraints/recovery hints**, modeled on botholomew + Arcade's tool-description pattern.
|
|
652
|
+
|
|
653
|
+
#### `membot_search`
|
|
654
|
+
|
|
655
|
+
```
|
|
656
|
+
[[ bash equivalent: grep -r + semantic-search ]] Hybrid search over the context
|
|
657
|
+
store. Pass `query` (natural language → semantic) and/or `pattern` (regex over
|
|
658
|
+
chunk text); pass both for the strongest signal — hits matched by both float
|
|
659
|
+
to the top via reciprocal rank fusion. Searches the CURRENT version of every
|
|
660
|
+
file by default; set `include_history=true` to also search older versions.
|
|
661
|
+
This is the primary discovery tool — prefer it over membot_read+scan.
|
|
662
|
+
```
|
|
663
|
+
|
|
664
|
+
Inputs: `query?`, `pattern?`, `mode?` (`hybrid`|`semantic`|`keyword`, default `hybrid`), `path_prefix?`, `limit?` (default 10), `include_history?` (default false), `ignore_case?`.
|
|
665
|
+
Output: `[{logical_path, version_id, chunk_index, snippet, score, semantic_score, keyword_score}]`.
|
|
666
|
+
|
|
667
|
+
#### `membot_tree`
|
|
668
|
+
|
|
669
|
+
```
|
|
670
|
+
[[ bash equivalent: tree ]] Render the logical-path tree of the current store.
|
|
671
|
+
Tree is synthesised from "/" segments in logical_path — there are no real
|
|
672
|
+
directories. Tombstoned and historical versions are hidden. Use this before
|
|
673
|
+
membot_add to pick a sensible logical path.
|
|
674
|
+
```
|
|
675
|
+
|
|
676
|
+
Inputs: `prefix?`, `max_depth?` (default 4).
|
|
677
|
+
|
|
678
|
+
#### `membot_list`
|
|
679
|
+
|
|
680
|
+
```
|
|
681
|
+
[[ bash equivalent: ls ]] List current files under an optional prefix, with
|
|
682
|
+
size, mime type, refresh frequency, and last refresh status. Returns one row
|
|
683
|
+
per logical_path (current version only). Pair with membot_tree for discovery,
|
|
684
|
+
membot_search for content-based discovery.
|
|
685
|
+
```
|
|
686
|
+
|
|
687
|
+
Inputs: `prefix?`, `limit?`, `cursor?` (paginated).
|
|
688
|
+
|
|
689
|
+
#### `membot_read`
|
|
690
|
+
|
|
691
|
+
```
|
|
692
|
+
[[ bash equivalent: cat ]] Read a stored file. By default returns the
|
|
693
|
+
markdown surrogate (the converted/captioned text body). Pass bytes=true
|
|
694
|
+
to instead return the original raw bytes (base64-encoded for JSON, or as
|
|
695
|
+
an image content block when the path is an image and the MCP client
|
|
696
|
+
supports it). Defaults to the current version; pass `version` (timestamp)
|
|
697
|
+
to read a historical snapshot — use membot_versions to enumerate available
|
|
698
|
+
versions. For finding content across many files, use membot_search instead
|
|
699
|
+
of repeated membot_read calls.
|
|
700
|
+
```
|
|
701
|
+
|
|
702
|
+
Inputs: `logical_path`, `version?`, `bytes?` (default `false`), `offset?` (line, 1-based; ignored when bytes=true), `limit?` (lines; ignored when bytes=true).
|
|
703
|
+
Output (text mode): `{logical_path, version_id, content, description, mime_type, size_bytes, blob_available, version_is_current}`.
|
|
704
|
+
Output (bytes mode): `{logical_path, version_id, mime_type, size_bytes, bytes_base64}` — or for image mimes, an MCP image content block.
|
|
705
|
+
|
|
706
|
+
#### `membot_info`
|
|
707
|
+
|
|
708
|
+
```
|
|
709
|
+
Inspect metadata for a file: source (local path or URL), fetcher used,
|
|
710
|
+
refresh schedule, last refresh status, all sha256 digests, and whether
|
|
711
|
+
the requested version is the current one. Does NOT return file content —
|
|
712
|
+
use membot_read for that. Use this to decide whether a refresh is worth
|
|
713
|
+
forcing or whether to trust a cached row.
|
|
714
|
+
```
|
|
715
|
+
|
|
716
|
+
Inputs: `logical_path`, `version?`.
|
|
717
|
+
|
|
718
|
+
#### `membot_versions`
|
|
719
|
+
|
|
720
|
+
```
|
|
721
|
+
List every version of a file (newest first) with version_id, content_sha256,
|
|
722
|
+
size, change_note, and refresh status. Use this to find the version_id you
|
|
723
|
+
want to pass to membot_read or membot_diff. Tombstoned versions are included and
|
|
724
|
+
flagged.
|
|
725
|
+
```
|
|
726
|
+
|
|
727
|
+
Inputs: `logical_path`.
|
|
728
|
+
|
|
729
|
+
#### `membot_diff`
|
|
730
|
+
|
|
731
|
+
```
|
|
732
|
+
Return a unified-diff between two versions of a file. `a` is required; `b`
|
|
733
|
+
defaults to the current version. Both `a` and `b` are version_id timestamps
|
|
734
|
+
from membot_versions. Use to understand what a refresh actually changed before
|
|
735
|
+
deciding to act on the new content.
|
|
736
|
+
```
|
|
737
|
+
|
|
738
|
+
Inputs: `logical_path`, `a` (version_id), `b?` (version_id, default current).
|
|
739
|
+
|
|
740
|
+
#### `membot_add`
|
|
741
|
+
|
|
742
|
+
```
|
|
743
|
+
Ingest one or many sources. `source` accepts:
|
|
744
|
+
- a local file path → ingests one file
|
|
745
|
+
- a local directory path → walks recursively (symlinks followed,
|
|
746
|
+
cycles broken by realpath cache),
|
|
747
|
+
filtered by include/exclude globs
|
|
748
|
+
- a glob pattern (e.g. "docs/**/*.md")→ expands relative to cwd; symlinks
|
|
749
|
+
followed
|
|
750
|
+
- a URL → fetched via mcpx (the chosen server
|
|
751
|
+
+ tool + args are stored so refresh
|
|
752
|
+
replays the exact invocation)
|
|
753
|
+
- "inline:<text>" → stores the literal as a new file
|
|
754
|
+
PDF, DOCX, HTML, images, and other binaries are converted to markdown —
|
|
755
|
+
native libraries first, vision/OCR for images, LLM fallback for messy or
|
|
756
|
+
scanned input. Original bytes are kept in the blobs table; membot_read with
|
|
757
|
+
bytes=true returns them. Setting `refresh_frequency` enables automatic
|
|
758
|
+
refresh of every ingested file from the daemon. Each ingested file becomes
|
|
759
|
+
a NEW version under its own logical_path; existing versions stay queryable
|
|
760
|
+
via membot_versions. Directory/glob ingests stream one file at a time —
|
|
761
|
+
partial failures don't abort the rest; the response lists per-entry status.
|
|
762
|
+
```
|
|
763
|
+
|
|
764
|
+
Inputs: `source` (path | dir | glob | URL | `inline:` literal), `logical_path?` (single-source only — defaults derived from filename / URL / relative path; for dir/glob ingests this is interpreted as a *prefix* under which entries are placed using their relative path), `include?` (glob; comma-separated allowed; default `**/*`), `exclude?` (glob; default excludes `node_modules`, `.git`, `.DS_Store`, dotfiles), `follow_symlinks?` (default `true`), `refresh_frequency?` (e.g. `1h`, `24h`), `fetcher_hint?` (e.g. `firecrawl`, `github`), `change_note?`.
|
|
765
|
+
Output: `{ingested: [{source_path, logical_path, version_id, status, error?, mime_type, size_bytes, fetcher, source_sha256}], total, ok, failed}`.
|
|
766
|
+
Error hints: on `auth_error` from a fetcher, hint `Run: mcpx auth <server>`; on `unsupported_mime`, list supported types; on `nothing_matched` for a glob, suggest broadening `include` or removing `exclude`.
|
|
767
|
+
|
|
768
|
+
#### `membot_write`
|
|
769
|
+
|
|
770
|
+
```
|
|
771
|
+
[[ bash equivalent: tee ]] Write inline agent-authored markdown. Creates a
|
|
772
|
+
new version (source_type='inline') under the given logical_path. Use this
|
|
773
|
+
to persist agent notes, summaries, or synthesised context that should
|
|
774
|
+
survive across conversations. For mirroring an external document, use
|
|
775
|
+
membot_add with a source URL instead — that gets you refresh-on-source-change
|
|
776
|
+
for free.
|
|
777
|
+
```
|
|
778
|
+
|
|
779
|
+
Inputs: `logical_path`, `content` (markdown), `change_note?`, `refresh_frequency?` (rarely useful for inline).
|
|
780
|
+
|
|
781
|
+
#### `membot_move`
|
|
782
|
+
|
|
783
|
+
```
|
|
784
|
+
[[ bash equivalent: mv ]] Rename a logical_path. Creates one new version
|
|
785
|
+
under the new path with full content carried over and tombstones the old
|
|
786
|
+
path. History remains queryable under both names via membot_versions.
|
|
787
|
+
```
|
|
788
|
+
|
|
789
|
+
Inputs: `from_logical_path`, `to_logical_path`.
|
|
790
|
+
|
|
791
|
+
#### `membot_delete`
|
|
792
|
+
|
|
793
|
+
```
|
|
794
|
+
[[ bash equivalent: rm ]] Tombstone a logical_path so it no longer appears
|
|
795
|
+
in membot_list / membot_tree / membot_search. Old versions remain queryable via
|
|
796
|
+
membot_versions and membot_read with an explicit version. Use membot_prune to
|
|
797
|
+
permanently drop history.
|
|
798
|
+
```
|
|
799
|
+
|
|
800
|
+
Inputs: `logical_path`.
|
|
801
|
+
|
|
802
|
+
#### `membot_refresh`
|
|
803
|
+
|
|
804
|
+
```
|
|
805
|
+
Re-read a file's source and create a new version only if the source bytes
|
|
806
|
+
changed. Pass `logical_path` to refresh one file, or omit it to refresh
|
|
807
|
+
every file whose refresh_frequency_sec has elapsed. Local files are
|
|
808
|
+
detected via mtime+sha; remote files are re-fetched via the same mcpx
|
|
809
|
+
invocation that was originally used. On auth or network failure the prior
|
|
810
|
+
version stays current — check `last_refresh_status`.
|
|
811
|
+
```
|
|
812
|
+
|
|
813
|
+
Inputs: `logical_path?`, `force?` (re-embed even if sha unchanged).
|
|
814
|
+
Output: `{processed: [{logical_path, status, new_version_id?}]}`.
|
|
815
|
+
|
|
816
|
+
#### `membot_prune`
|
|
817
|
+
|
|
818
|
+
```
|
|
819
|
+
Permanently drop non-current versions older than the cutoff. Current
|
|
820
|
+
versions and tombstones-with-no-newer-version are preserved. Use sparingly
|
|
821
|
+
— pruned versions cannot be recovered.
|
|
822
|
+
```
|
|
823
|
+
|
|
824
|
+
Inputs: `before` (duration like `30d`, or absolute timestamp), `dry_run?` (default true).
|
|
825
|
+
|
|
826
|
+
---
|
|
827
|
+
|
|
828
|
+
## Config (`~/.membot/config.json`)
|
|
829
|
+
|
|
830
|
+
```jsonc
|
|
831
|
+
{
|
|
832
|
+
"data_dir": "~/.membot", // override MEMBOT_HOME
|
|
833
|
+
"embedding_model": "Xenova/bge-small-en-v1.5",
|
|
834
|
+
"embedding_dimension": 384,
|
|
835
|
+
"chunker": { "mode": "deterministic", "target_chars": 4000, "max_chars": 15000 },
|
|
836
|
+
"llm": {
|
|
837
|
+
"anthropic_api_key": "", // env: ANTHROPIC_API_KEY
|
|
838
|
+
"converter_model": "claude-haiku-4-5-20251001",
|
|
839
|
+
"chunker_model": "claude-haiku-4-5-20251001"
|
|
840
|
+
},
|
|
841
|
+
"mcpx": { "config_path": "" }, // for remote fetchers
|
|
842
|
+
"daemon": { "tick_interval_sec": 60 },
|
|
843
|
+
"default_refresh_frequency_sec": null
|
|
844
|
+
}
|
|
845
|
+
```
|
|
846
|
+
|
|
847
|
+
LLM fallback is opt-out: if `ANTHROPIC_API_KEY` is missing, the converter dispatcher falls through to passthrough/error rather than calling the API.
|
|
848
|
+
|
|
849
|
+
---
|
|
850
|
+
|
|
851
|
+
## Key Dependencies (`package.json`)
|
|
852
|
+
|
|
853
|
+
Runtime:
|
|
854
|
+
|
|
855
|
+
- `@modelcontextprotocol/sdk` — MCP server (stdio + HTTP)
|
|
856
|
+
- `commander` — CLI parsing (CLI subcommands generated from operations via `mountAsCommanderCommand`)
|
|
857
|
+
- `zod-to-json-schema` — convert zod input schemas to MCP tool JSON-Schema
|
|
858
|
+
- `@duckdb/node-api` ^1.5.x — index + FTS + cosine
|
|
859
|
+
- `@huggingface/transformers` ^4.x + patched `onnxruntime-web` — local embeddings (port mcpx patch)
|
|
860
|
+
- `@evantahler/mcpx` — remote fetcher orchestration
|
|
861
|
+
- `@anthropic-ai/sdk` — LLM fallback only
|
|
862
|
+
- `unpdf` — bun-friendly PDF → text
|
|
863
|
+
- `mammoth` — DOCX → HTML/markdown
|
|
864
|
+
- `turndown` — HTML → markdown
|
|
865
|
+
- `tesseract.js` — Tesseract WASM (OCR for images and scanned PDFs)
|
|
866
|
+
- `picomatch` — glob expansion for `ctx add` (mirror mcpx)
|
|
867
|
+
- `gray-matter` — frontmatter parse on inbound .md files
|
|
868
|
+
- `zod` ^4 — schemas
|
|
869
|
+
- `ansis`, `nanospinner` — output (mirror mcpx)
|
|
870
|
+
|
|
871
|
+
Dev: `@biomejs/biome`, `bun-types`.
|
|
872
|
+
|
|
873
|
+
Build: `bun build --compile --minify --sourcemap ./src/cli.ts --outfile dist/ctx`. Pre-build script applies the transformers WASM patch (copy from mcpx).
|
|
874
|
+
|
|
875
|
+
---
|
|
876
|
+
|
|
877
|
+
## Verification
|
|
878
|
+
|
|
879
|
+
End-to-end smoke (run after implementation):
|
|
880
|
+
|
|
881
|
+
1. `bun install && bun run build`
|
|
882
|
+
2. `./dist/ctx add ./README.md --path docs/readme.md` → verify row in `index.duckdb` with `source_type='local'`, `source_sha256` set, `chunks` populated.
|
|
883
|
+
3. `./dist/ctx add https://example.com --refresh 1h` → verify `fetcher` column set (e.g. `firecrawl` or `http`), markdown content stored, `blob_sha256` set.
|
|
884
|
+
4. `./dist/ctx add ./test/fixtures/sample.pdf` → verify converted markdown is non-empty (native unpdf path); `blobs` row holds the original PDF.
|
|
885
|
+
4a. `./dist/ctx add ./test/fixtures/scanned.pdf` → unpdf returns empty → OCR fallback runs → markdown contains OCR'd text.
|
|
886
|
+
4b. `./dist/ctx add ./test/fixtures/diagram.png` → vision caption + OCR folded; `description` column populated; `blobs` row holds PNG bytes.
|
|
887
|
+
4c. `./dist/ctx read diagram.png --bytes --format raw > out.png && diff out.png ./test/fixtures/diagram.png` → byte-exact round-trip.
|
|
888
|
+
4d. `./dist/ctx add ./docs --include "**/*.md" --include "**/*.txt" --exclude "**/node_modules/**"` → directory walk; result lists every `.md`/`.txt` ingested with its logical path; symlinks within `./docs` were followed without infinite loop.
|
|
889
|
+
4e. `./dist/ctx add "./src/**/*.ts"` → glob pattern expanded; only matching files ingested.
|
|
890
|
+
5. `./dist/ctx search "<query from sample>"` → returns hybrid hits with `score` and `semantic_score`.
|
|
891
|
+
5a. `./dist/ctx search "diagram"` → the PNG ingested in 4b is in the top hits even though its raw markdown body is short — proves the description+filename prefix lifted recall.
|
|
892
|
+
6. `./dist/ctx tree` → synthesised tree from logical paths.
|
|
893
|
+
7. Edit `README.md`, run `./dist/ctx refresh docs/readme.md` → verify a NEW `version_id` row appears (older row preserved), `source_sha256` and `content_sha256` differ from prior version.
|
|
894
|
+
8. `./dist/ctx refresh docs/readme.md` again (no edit) → `last_refresh_status='unchanged'`, no new version row created.
|
|
895
|
+
8a. `./dist/ctx versions docs/readme.md` → both versions listed, newest first; `--version <old-ts>` on `ctx read` returns the prior content; default `ctx read` returns latest.
|
|
896
|
+
8b. `./dist/ctx diff docs/readme.md <old-ts>` → unified diff between old and current version.
|
|
897
|
+
8c. `./dist/ctx rm docs/readme.md` → tombstone created; `ctx ls` and `ctx search` no longer surface it; `ctx versions` still lists history.
|
|
898
|
+
8d. `./dist/ctx prune --before 0s --dry-run=false` → non-current versions dropped; current version + tombstone remain.
|
|
899
|
+
9. `./dist/ctx serve` → connect with `mcpx exec` (or any MCP client) and call `membot_search` over stdio.
|
|
900
|
+
10. `./dist/ctx serve --watch --tick 5` → modify a tracked local file; within ~5s the daemon refreshes it.
|
|
901
|
+
11. `bun test` — unit tests covering: converter dispatch, chunker determinism, embedder dimension, refresh sha-stable skip path, hybrid search RRF, **HelpfulError-required-hint invariant** (constructing one without a hint throws), **mount adapter error rendering** (HelpfulError → MCP isError + structuredContent; HelpfulError → CLI stderr JSON in non-interactive, colorized text in interactive), **TTY detection** (CI=true forces non-interactive; --json forces non-interactive even on TTY).
|
|
902
|
+
12. Spot-check interactive vs non-interactive: `./dist/ctx add ./docs --include "**/*.md"` shows a progress bar in a terminal; `./dist/ctx add ./docs --include "**/*.md" | cat` emits one JSON-friendly line per entry to stderr and a single result JSON to stdout, no ANSI bytes leaking through.
|
|
903
|
+
13. Spot-check error UX: `./dist/ctx read missing.md` exits non-zero with a one-line message + concrete hint (e.g. `"Run \`ctx ls\` to see available paths."`); the same call as `membot_read` over MCP returns `isError: true` with `hint` set to the same string.
|
|
904
|
+
|
|
905
|
+
Done when all 11 pass and the binary launches on darwin-arm64 without `bun` installed.
|
package/package.json
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "membot",
|
|
3
|
+
"version": "0.0.1",
|
|
4
|
+
"description": "Versioned context store with hybrid search for AI agents. Stdio + HTTP MCP server and CLI.",
|
|
5
|
+
"keywords": [
|
|
6
|
+
"mcp",
|
|
7
|
+
"model-context-protocol",
|
|
8
|
+
"context",
|
|
9
|
+
"memory",
|
|
10
|
+
"agent",
|
|
11
|
+
"rag",
|
|
12
|
+
"embeddings",
|
|
13
|
+
"duckdb",
|
|
14
|
+
"bun"
|
|
15
|
+
],
|
|
16
|
+
"license": "MIT",
|
|
17
|
+
"author": "Evan Tahler <evan@arcade.dev>",
|
|
18
|
+
"repository": {
|
|
19
|
+
"type": "git",
|
|
20
|
+
"url": "https://github.com/evantahler/membot.git"
|
|
21
|
+
},
|
|
22
|
+
"homepage": "https://github.com/evantahler/membot",
|
|
23
|
+
"bugs": {
|
|
24
|
+
"url": "https://github.com/evantahler/membot/issues"
|
|
25
|
+
}
|
|
26
|
+
}
|