membot 0.1.0 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/skills/membot.md +137 -0
- package/.cursor/rules/membot.mdc +137 -0
- package/README.md +126 -0
- package/package.json +5 -3
- package/patches/@evantahler%2Fmcpx@0.21.4.patch +44 -0
- package/scripts/apply-patches.sh +49 -0
- package/src/cli.ts +2 -0
- package/src/commands/skill.ts +131 -0
- package/src/ingest/embedder.ts +18 -3
- package/src/ingest/ingest.ts +34 -5
- package/src/ingest/source-resolver.ts +34 -7
- package/src/operations/add.ts +21 -8
- package/src/types/text-modules.d.ts +9 -0
- package/scripts/apply-transformers-patch.sh +0 -35
|
@@ -0,0 +1,137 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: membot
|
|
3
|
+
description: Persistent, versioned context store for AI agents — ingest, search, read, and write knowledge via the membot CLI or MCP server
|
|
4
|
+
trigger: when the user wants to remember, recall, or search project knowledge, ingest documents into a long-lived store, or surface relevant context for a task
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# membot — Persistent Context for Agents
|
|
8
|
+
|
|
9
|
+
You have access to a long-lived context store via `membot`. Files (markdown, PDFs, DOCX, HTML, URLs, agent notes) are ingested, converted to markdown, chunked, embedded locally, and indexed in DuckDB with hybrid search (semantic + BM25). Every artifact is addressed by a virtual `logical_path`. Every change creates a new immutable version — nothing is overwritten in place.
|
|
10
|
+
|
|
11
|
+
Use this workflow:
|
|
12
|
+
|
|
13
|
+
## 1. Discover what's already there
|
|
14
|
+
|
|
15
|
+
Before ingesting, check whether the knowledge already exists.
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
membot tree # synthesised directory tree of logical_paths
|
|
19
|
+
membot ls # one row per current file (size, mime, refresh status)
|
|
20
|
+
membot ls docs/ # filter by prefix
|
|
21
|
+
membot search "<question>" # hybrid search (semantic + keyword)
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
`search` is the primary discovery tool — prefer it over scanning files.
|
|
25
|
+
|
|
26
|
+
## 2. Ingest
|
|
27
|
+
|
|
28
|
+
```bash
|
|
29
|
+
membot add ./README.md # single file
|
|
30
|
+
membot add ./docs # recursive directory walk
|
|
31
|
+
membot add "docs/**/*.md" # glob
|
|
32
|
+
membot add https://example.com/spec.pdf # URL (auto-converted to markdown)
|
|
33
|
+
membot add "inline:Decision: use X because Y" # literal text
|
|
34
|
+
membot add ./docs --refresh-frequency 24h # auto-refresh every day
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
Each entry becomes a new version under its own `logical_path`. PDFs/DOCX/HTML are converted to markdown; images get vision captions; original bytes are kept and reachable via `membot read --bytes`.
|
|
38
|
+
|
|
39
|
+
## 3. Read
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
membot read <logical_path> # current markdown surrogate
|
|
43
|
+
membot read <logical_path> --bytes # original bytes (base64) — PDF/DOCX/image as ingested
|
|
44
|
+
membot read <logical_path> --version <ts> # historical snapshot
|
|
45
|
+
membot info <logical_path> # metadata only (no content)
|
|
46
|
+
membot versions <logical_path> # every version, newest first
|
|
47
|
+
membot diff <logical_path> --a <ts> [--b <ts>] # unified diff between versions
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
Defaults to the current (non-tombstoned) version. Pass `--version` only when you need history.
|
|
51
|
+
|
|
52
|
+
## 4. Write your own notes
|
|
53
|
+
|
|
54
|
+
Persist agent-authored summaries, decisions, or synthesised context so they survive across conversations:
|
|
55
|
+
|
|
56
|
+
```bash
|
|
57
|
+
membot write notes/decision-2026-05.md --content "Decided to ..."
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
Inline writes create a new `(logical_path, version_id)` row just like file ingests — `membot versions` lists them, `membot diff` compares them. To mirror an external doc that should re-fetch over time, use `membot add <url> --refresh-frequency` instead.
|
|
61
|
+
|
|
62
|
+
## 5. Refresh, rename, delete, prune
|
|
63
|
+
|
|
64
|
+
```bash
|
|
65
|
+
membot refresh <logical_path> # re-read source; new version only if bytes changed
|
|
66
|
+
membot refresh # refresh all rows whose schedule has elapsed
|
|
67
|
+
membot mv old/path new/path # rename (history preserved under both)
|
|
68
|
+
membot rm <logical_path> # tombstone (history still queryable)
|
|
69
|
+
membot prune --before <iso-ts> # drop non-current versions older than cutoff (irreversible)
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
Tombstones hide a path from `ls` / `tree` / `search` but `versions` and `read --version <ts>` still work. Pruning is the only way to actually remove data.
|
|
73
|
+
|
|
74
|
+
## Versioning rules
|
|
75
|
+
|
|
76
|
+
- Defaults always operate on the current, non-tombstoned version.
|
|
77
|
+
- Pass an explicit `--version <timestamp>` (from `membot versions`) to read or diff history.
|
|
78
|
+
- `membot_add` (when source bytes have changed), refresh-with-changes, `write`, and `mv` each create a new version. The previous version is preserved. Re-running `membot_add` against an unchanged source is a no-op (status `unchanged`, same `version_id`); pass `force=true` to force a new version.
|
|
79
|
+
- Mutating an existing version is not possible — corrections are new versions.
|
|
80
|
+
|
|
81
|
+
## When to use this skill
|
|
82
|
+
|
|
83
|
+
- The user asks to remember, recall, save, or look up something across conversations.
|
|
84
|
+
- You need project-specific context (specs, decisions, transcripts, rendered docs) that's larger than fits in the prompt.
|
|
85
|
+
- You need to ingest a document (PDF, DOCX, HTML, URL) and reason over it.
|
|
86
|
+
- You're producing a summary or decision that should survive past this conversation.
|
|
87
|
+
|
|
88
|
+
## When NOT to use this skill
|
|
89
|
+
|
|
90
|
+
- Reading a file the user just pointed at — use the regular file-read tool unless they want it persisted.
|
|
91
|
+
- Storing secrets, credentials, or anything that shouldn't sit in `~/.membot/index.duckdb`.
|
|
92
|
+
- Quick scratch state for the current turn — keep that in the conversation.
|
|
93
|
+
|
|
94
|
+
## MCP server
|
|
95
|
+
|
|
96
|
+
`membot serve` exposes the same operations as MCP tools (`membot_add`, `membot_search`, etc.) over stdio (default) or HTTP (`--http <port>`). When connected, prefer the MCP tools over shelling out — they return structured `outputSchema` data with `version_id` echoed on every read.
|
|
97
|
+
|
|
98
|
+
## Available commands
|
|
99
|
+
|
|
100
|
+
| Command | Purpose |
|
|
101
|
+
| ------------------------------------- | ------------------------------------------------------------------------------ |
|
|
102
|
+
| `membot add <source>` | Ingest file, directory, glob, URL, or `inline:<text>`. Skips unchanged sources; pass `--force` to re-ingest |
|
|
103
|
+
| `membot ls [prefix]` | List current files (size, mime, refresh status) |
|
|
104
|
+
| `membot tree [prefix]` | Render the synthesised logical-path tree |
|
|
105
|
+
| `membot read <path>` | Read current markdown surrogate (or `--bytes` for original) |
|
|
106
|
+
| `membot write <path> --content <txt>` | Write inline agent-authored markdown as a new version |
|
|
107
|
+
| `membot search <query>` | Hybrid search (semantic + BM25); add `--include-history` to search older versions |
|
|
108
|
+
| `membot info <path>` | Inspect metadata (source, fetcher, refresh schedule, digests) without content |
|
|
109
|
+
| `membot versions <path>` | List every version newest-first with version_id and change notes |
|
|
110
|
+
| `membot diff <path> --a <ts>` | Unified diff between two versions |
|
|
111
|
+
| `membot mv <old> <new>` | Rename a logical_path (history preserved) |
|
|
112
|
+
| `membot rm <path>` | Tombstone a logical_path (history still queryable) |
|
|
113
|
+
| `membot refresh [path]` | Re-read source; create new version only if bytes changed |
|
|
114
|
+
| `membot prune --before <ts>` | Permanently drop non-current versions older than cutoff (irreversible) |
|
|
115
|
+
| `membot serve` | Start MCP server (stdio default, `--http <port>` for HTTP) |
|
|
116
|
+
| `membot reindex` | Rebuild the FTS keyword index over current chunks |
|
|
117
|
+
|
|
118
|
+
## Output formats
|
|
119
|
+
|
|
120
|
+
- TTY → spinners, colors, tables. `--no-color` disables ANSI.
|
|
121
|
+
- Piped, `--json`, `CI=true`, or `NO_COLOR` → JSON to stdout, structured logs to stderr, no ANSI bytes.
|
|
122
|
+
- Use `--json` when parsing output programmatically (it's automatic when piped, but explicit is safer).
|
|
123
|
+
- Use `--verbose` if a command fails unexpectedly.
|
|
124
|
+
|
|
125
|
+
## Troubleshooting
|
|
126
|
+
|
|
127
|
+
- **"ingest failed: unsupported mime"** → Add a converter or pass `--bytes` to keep the original; LLM-fallback only runs when `ANTHROPIC_API_KEY` is set.
|
|
128
|
+
- **"refresh failed: auth"** → The original fetch used an authenticated mcpx tool; re-auth via `mcpx auth <server>`.
|
|
129
|
+
- **Search returns nothing** → Confirm the file ingested with `membot info <path>`; if needed, run `membot reindex` to rebuild the FTS keyword index.
|
|
130
|
+
- **Stale results after manual DB edits** → `membot reindex`.
|
|
131
|
+
- **Two paths point at the same content** → `membot mv` doesn't merge; tombstone one with `membot rm`.
|
|
132
|
+
|
|
133
|
+
## Configuration
|
|
134
|
+
|
|
135
|
+
- Data lives in `~/.membot/index.duckdb` (override via `MEMBOT_HOME`).
|
|
136
|
+
- Optional `ANTHROPIC_API_KEY` enables LLM fallback for messy/binary input. Without it, conversion degrades to deterministic native output.
|
|
137
|
+
- Config file: `~/.membot/config.json` (see `membot --help` for the global flags).
|
|
@@ -0,0 +1,137 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: Persistent, versioned context store for AI agents — ingest, search, read, and write knowledge via the membot CLI or MCP server
|
|
3
|
+
globs:
|
|
4
|
+
alwaysApply: true
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# membot — Persistent Context for Agents
|
|
8
|
+
|
|
9
|
+
You have access to a long-lived context store via `membot`. Files (markdown, PDFs, DOCX, HTML, URLs, agent notes) are ingested, converted to markdown, chunked, embedded locally, and indexed in DuckDB with hybrid search (semantic + BM25). Every artifact is addressed by a virtual `logical_path`. Every change creates a new immutable version — nothing is overwritten in place.
|
|
10
|
+
|
|
11
|
+
Use this workflow:
|
|
12
|
+
|
|
13
|
+
## 1. Discover what's already there
|
|
14
|
+
|
|
15
|
+
Before ingesting, check whether the knowledge already exists.
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
membot tree # synthesised directory tree of logical_paths
|
|
19
|
+
membot ls # one row per current file (size, mime, refresh status)
|
|
20
|
+
membot ls docs/ # filter by prefix
|
|
21
|
+
membot search "<question>" # hybrid search (semantic + keyword)
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
`search` is the primary discovery tool — prefer it over scanning files.
|
|
25
|
+
|
|
26
|
+
## 2. Ingest
|
|
27
|
+
|
|
28
|
+
```bash
|
|
29
|
+
membot add ./README.md # single file
|
|
30
|
+
membot add ./docs # recursive directory walk
|
|
31
|
+
membot add "docs/**/*.md" # glob
|
|
32
|
+
membot add https://example.com/spec.pdf # URL (auto-converted to markdown)
|
|
33
|
+
membot add "inline:Decision: use X because Y" # literal text
|
|
34
|
+
membot add ./docs --refresh-frequency 24h # auto-refresh every day
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
Each entry becomes a new version under its own `logical_path`. PDFs/DOCX/HTML are converted to markdown; images get vision captions; original bytes are kept and reachable via `membot read --bytes`.
|
|
38
|
+
|
|
39
|
+
## 3. Read
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
membot read <logical_path> # current markdown surrogate
|
|
43
|
+
membot read <logical_path> --bytes # original bytes (base64) — PDF/DOCX/image as ingested
|
|
44
|
+
membot read <logical_path> --version <ts> # historical snapshot
|
|
45
|
+
membot info <logical_path> # metadata only (no content)
|
|
46
|
+
membot versions <logical_path> # every version, newest first
|
|
47
|
+
membot diff <logical_path> --a <ts> [--b <ts>] # unified diff between versions
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
Defaults to the current (non-tombstoned) version. Pass `--version` only when you need history.
|
|
51
|
+
|
|
52
|
+
## 4. Write your own notes
|
|
53
|
+
|
|
54
|
+
Persist agent-authored summaries, decisions, or synthesised context so they survive across conversations:
|
|
55
|
+
|
|
56
|
+
```bash
|
|
57
|
+
membot write notes/decision-2026-05.md --content "Decided to ..."
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
Inline writes create a new `(logical_path, version_id)` row just like file ingests — `membot versions` lists them, `membot diff` compares them. To mirror an external doc that should re-fetch over time, use `membot add <url> --refresh-frequency` instead.
|
|
61
|
+
|
|
62
|
+
## 5. Refresh, rename, delete, prune
|
|
63
|
+
|
|
64
|
+
```bash
|
|
65
|
+
membot refresh <logical_path> # re-read source; new version only if bytes changed
|
|
66
|
+
membot refresh # refresh all rows whose schedule has elapsed
|
|
67
|
+
membot mv old/path new/path # rename (history preserved under both)
|
|
68
|
+
membot rm <logical_path> # tombstone (history still queryable)
|
|
69
|
+
membot prune --before <iso-ts> # drop non-current versions older than cutoff (irreversible)
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
Tombstones hide a path from `ls` / `tree` / `search` but `versions` and `read --version <ts>` still work. Pruning is the only way to actually remove data.
|
|
73
|
+
|
|
74
|
+
## Versioning rules
|
|
75
|
+
|
|
76
|
+
- Defaults always operate on the current, non-tombstoned version.
|
|
77
|
+
- Pass an explicit `--version <timestamp>` (from `membot versions`) to read or diff history.
|
|
78
|
+
- `membot_add` (when source bytes have changed), refresh-with-changes, `write`, and `mv` each create a new version. The previous version is preserved. Re-running `membot_add` against an unchanged source is a no-op (status `unchanged`, same `version_id`); pass `force=true` to force a new version.
|
|
79
|
+
- Mutating an existing version is not possible — corrections are new versions.
|
|
80
|
+
|
|
81
|
+
## When to use this rule
|
|
82
|
+
|
|
83
|
+
- The user asks to remember, recall, save, or look up something across conversations.
|
|
84
|
+
- You need project-specific context (specs, decisions, transcripts, rendered docs) that's larger than fits in the prompt.
|
|
85
|
+
- You need to ingest a document (PDF, DOCX, HTML, URL) and reason over it.
|
|
86
|
+
- You're producing a summary or decision that should survive past this conversation.
|
|
87
|
+
|
|
88
|
+
## When NOT to use this rule
|
|
89
|
+
|
|
90
|
+
- Reading a file the user just pointed at — use the regular file-read tool unless they want it persisted.
|
|
91
|
+
- Storing secrets, credentials, or anything that shouldn't sit in `~/.membot/index.duckdb`.
|
|
92
|
+
- Quick scratch state for the current turn — keep that in the conversation.
|
|
93
|
+
|
|
94
|
+
## MCP server
|
|
95
|
+
|
|
96
|
+
`membot serve` exposes the same operations as MCP tools (`membot_add`, `membot_search`, etc.) over stdio (default) or HTTP (`--http <port>`). When connected, prefer the MCP tools over shelling out — they return structured `outputSchema` data with `version_id` echoed on every read.
|
|
97
|
+
|
|
98
|
+
## Available commands
|
|
99
|
+
|
|
100
|
+
| Command | Purpose |
|
|
101
|
+
| ------------------------------------- | ------------------------------------------------------------------------------ |
|
|
102
|
+
| `membot add <source>` | Ingest file, directory, glob, URL, or `inline:<text>`. Skips unchanged sources; pass `--force` to re-ingest |
|
|
103
|
+
| `membot ls [prefix]` | List current files (size, mime, refresh status) |
|
|
104
|
+
| `membot tree [prefix]` | Render the synthesised logical-path tree |
|
|
105
|
+
| `membot read <path>` | Read current markdown surrogate (or `--bytes` for original) |
|
|
106
|
+
| `membot write <path> --content <txt>` | Write inline agent-authored markdown as a new version |
|
|
107
|
+
| `membot search <query>` | Hybrid search (semantic + BM25); add `--include-history` to search older versions |
|
|
108
|
+
| `membot info <path>` | Inspect metadata (source, fetcher, refresh schedule, digests) without content |
|
|
109
|
+
| `membot versions <path>` | List every version newest-first with version_id and change notes |
|
|
110
|
+
| `membot diff <path> --a <ts>` | Unified diff between two versions |
|
|
111
|
+
| `membot mv <old> <new>` | Rename a logical_path (history preserved) |
|
|
112
|
+
| `membot rm <path>` | Tombstone a logical_path (history still queryable) |
|
|
113
|
+
| `membot refresh [path]` | Re-read source; create new version only if bytes changed |
|
|
114
|
+
| `membot prune --before <ts>` | Permanently drop non-current versions older than cutoff (irreversible) |
|
|
115
|
+
| `membot serve` | Start MCP server (stdio default, `--http <port>` for HTTP) |
|
|
116
|
+
| `membot reindex` | Rebuild the FTS keyword index over current chunks |
|
|
117
|
+
|
|
118
|
+
## Output formats
|
|
119
|
+
|
|
120
|
+
- TTY → spinners, colors, tables. `--no-color` disables ANSI.
|
|
121
|
+
- Piped, `--json`, `CI=true`, or `NO_COLOR` → JSON to stdout, structured logs to stderr, no ANSI bytes.
|
|
122
|
+
- Use `--json` when parsing output programmatically (it's automatic when piped, but explicit is safer).
|
|
123
|
+
- Use `--verbose` if a command fails unexpectedly.
|
|
124
|
+
|
|
125
|
+
## Troubleshooting
|
|
126
|
+
|
|
127
|
+
- **"ingest failed: unsupported mime"** → Add a converter or pass `--bytes` to keep the original; LLM-fallback only runs when `ANTHROPIC_API_KEY` is set.
|
|
128
|
+
- **"refresh failed: auth"** → The original fetch used an authenticated mcpx tool; re-auth via `mcpx auth <server>`.
|
|
129
|
+
- **Search returns nothing** → Confirm the file ingested with `membot info <path>`; if needed, run `membot reindex` to rebuild the FTS keyword index.
|
|
130
|
+
- **Stale results after manual DB edits** → `membot reindex`.
|
|
131
|
+
- **Two paths point at the same content** → `membot mv` doesn't merge; tombstone one with `membot rm`.
|
|
132
|
+
|
|
133
|
+
## Configuration
|
|
134
|
+
|
|
135
|
+
- Data lives in `~/.membot/index.duckdb` (override via `MEMBOT_HOME`).
|
|
136
|
+
- Optional `ANTHROPIC_API_KEY` enables LLM fallback for messy/binary input. Without it, conversion degrades to deterministic native output.
|
|
137
|
+
- Config file: `~/.membot/config.json` (see `membot --help` for the global flags).
|
package/README.md
ADDED
|
@@ -0,0 +1,126 @@
|
|
|
1
|
+
# membot
|
|
2
|
+
|
|
3
|
+
> Versioned context store with hybrid search for AI agents. Stdio + HTTP MCP server and CLI.
|
|
4
|
+
|
|
5
|
+
[](https://www.npmjs.com/package/membot)
|
|
6
|
+
[](./LICENSE)
|
|
7
|
+
|
|
8
|
+
`membot` is a single-binary CLI and MCP server that gives AI agents a persistent, versioned, searchable context store. Files (markdown, PDFs, DOCX, HTML, URLs, agent-authored notes) are ingested, converted to markdown, chunked, embedded **locally** with `@huggingface/transformers` (WASM, no cloud calls), and indexed in DuckDB with hybrid search (semantic vector + BM25). Every change creates a new version — nothing is overwritten in place.
|
|
9
|
+
|
|
10
|
+
- **Local everything** — embeddings run on your machine; data lives in `~/.membot/index.duckdb`.
|
|
11
|
+
- **One mental model** — every artifact (markdown, PDF, image, audio) becomes a markdown surrogate that flows through the same chunk → embed → search pipeline.
|
|
12
|
+
- **Append-only versioning** — every ingest, refresh, or write creates a new `(logical_path, version_id)` row. History is queryable; nothing is mutated.
|
|
13
|
+
- **Two surfaces, one source of truth** — every operation is exposed identically as a CLI subcommand and an MCP tool. The agent sees `membot_search`; you see `membot search`.
|
|
14
|
+
|
|
15
|
+
## Install
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
bun install -g membot
|
|
19
|
+
# or
|
|
20
|
+
npm install -g membot
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
This pulls in DuckDB's per-platform native bindings alongside membot. The build externalizes `@duckdb/*` (those `.node` bindings can't be embedded by `bun build --compile`), so a global npm/bun install is the supported path.
|
|
24
|
+
|
|
25
|
+
## Quick start
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
membot add ./docs # ingest a directory recursively
|
|
29
|
+
membot add https://example.com/spec.pdf # ingest a URL (auto-converted to markdown)
|
|
30
|
+
membot ls # list current files
|
|
31
|
+
membot search "how does refresh work?" # hybrid search
|
|
32
|
+
membot read docs/refresh.md # read the markdown surrogate
|
|
33
|
+
membot serve # expose the same operations as MCP tools (stdio)
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
## Use with Claude Code or Cursor
|
|
37
|
+
|
|
38
|
+
`membot skill install` drops the agent skill into the right place so Claude Code or Cursor know **when** to call `membot`.
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
membot skill install --claude # writes ./.claude/skills/membot.md (project)
|
|
42
|
+
membot skill install --cursor # writes ./.cursor/rules/membot.mdc (project)
|
|
43
|
+
membot skill install --claude --global # writes ~/.claude/skills/membot.md
|
|
44
|
+
membot skill install --claude --cursor -f # both, overwrite if present
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
The skill files describe the discover → ingest → search → read → write workflow and the versioning rules. You can re-run with `--force` to refresh after upgrading membot.
|
|
48
|
+
|
|
49
|
+
## Commands
|
|
50
|
+
|
|
51
|
+
| Command | Description |
|
|
52
|
+
| ------------------------------- | --------------------------------------------------------------------------------- |
|
|
53
|
+
| `membot add <source>` | Ingest a file, directory, glob, URL, or `inline:<text>`. Skips on unchanged source bytes; pass `--force` to re-ingest |
|
|
54
|
+
| `membot ls [prefix]` | List current files (size, mime, refresh status) |
|
|
55
|
+
| `membot tree [prefix]` | Render the synthesised logical-path tree |
|
|
56
|
+
| `membot read <path>` | Read the markdown surrogate (or `--bytes` for original bytes, base64) |
|
|
57
|
+
| `membot search <query>` | Hybrid search (semantic + BM25); `--include-history` searches older versions |
|
|
58
|
+
| `membot info <path>` | Inspect metadata (source, fetcher, schedule, digests) without content |
|
|
59
|
+
| `membot versions <path>` | List every version newest-first |
|
|
60
|
+
| `membot diff <path> <a> [b]` | Unified diff between two versions |
|
|
61
|
+
| `membot write <path>` | Write inline agent-authored markdown as a new version |
|
|
62
|
+
| `membot mv <from> <to>` | Rename a logical_path (history preserved under both) |
|
|
63
|
+
| `membot rm <path>` | Tombstone a logical_path (history still queryable) |
|
|
64
|
+
| `membot refresh [path]` | Re-read source; new version only if bytes changed |
|
|
65
|
+
| `membot prune --before <ts>` | Permanently drop non-current versions older than cutoff (irreversible) |
|
|
66
|
+
| `membot serve` | Run the MCP server (stdio default; `--http <port>` for HTTP) |
|
|
67
|
+
| `membot reindex` | Rebuild the FTS keyword index over current chunks |
|
|
68
|
+
| `membot mcpx <subcommand>` | Forward to the bundled `mcpx` CLI for managing remote MCP servers |
|
|
69
|
+
| `membot skill install` | Install the Claude Code / Cursor agent skill |
|
|
70
|
+
|
|
71
|
+
Run `membot <command> --help` for full flags and arguments. Every command produces JSON when piped, when `--json` is set, or when `CI=true`.
|
|
72
|
+
|
|
73
|
+
## MCP server
|
|
74
|
+
|
|
75
|
+
`membot serve` exposes every operation as an MCP tool. Stdio is the default; pass `--http <port>` for streamable HTTP.
|
|
76
|
+
|
|
77
|
+
**Claude Desktop** (`~/Library/Application Support/Claude/claude_desktop_config.json` on macOS):
|
|
78
|
+
|
|
79
|
+
```json
|
|
80
|
+
{
|
|
81
|
+
"mcpServers": {
|
|
82
|
+
"membot": {
|
|
83
|
+
"command": "membot",
|
|
84
|
+
"args": ["serve"]
|
|
85
|
+
}
|
|
86
|
+
}
|
|
87
|
+
}
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
**Streamable HTTP** (any MCP client that speaks HTTP):
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
membot serve --http 3000
|
|
94
|
+
# tool endpoint: http://localhost:3000/mcp
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
Add `--watch` (and optional `--tick <sec>`) to also run the refresh daemon, which re-reads any file whose `refresh_frequency` has elapsed.
|
|
98
|
+
|
|
99
|
+
## Configuration
|
|
100
|
+
|
|
101
|
+
- **Data directory:** `~/.membot/` (override with `MEMBOT_HOME=/path` or `--config <path>`).
|
|
102
|
+
- `~/.membot/index.duckdb` — all content, blobs, chunks, embeddings, and metadata.
|
|
103
|
+
- `~/.membot/models/` — cached embedding model weights (`Xenova/bge-small-en-v1.5`, 384-dim).
|
|
104
|
+
- `~/.membot/logs/` — daemon logs when running `serve --watch`.
|
|
105
|
+
- **Config file:** `~/.membot/config.json` (optional; defaults are sane).
|
|
106
|
+
- **Environment variables:**
|
|
107
|
+
- `ANTHROPIC_API_KEY` — optional. Enables LLM fallback for messy / scanned input (vision captions for images, last-resort markdown conversion). Without it, the pipeline degrades to deterministic native conversion.
|
|
108
|
+
- `MEMBOT_HOME` — override the data directory.
|
|
109
|
+
- `NO_COLOR`, `CI`, `FORCE_COLOR` — standard output controls.
|
|
110
|
+
|
|
111
|
+
## Development
|
|
112
|
+
|
|
113
|
+
```bash
|
|
114
|
+
bun install
|
|
115
|
+
bun run dev <args> # run from source
|
|
116
|
+
bun test # full test suite (real ephemeral DuckDB per test)
|
|
117
|
+
bun run lint # biome + tsc
|
|
118
|
+
bun run format # biome --write
|
|
119
|
+
bun run build # compile a standalone binary into dist/membot
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
Architecture, design constraints, and reference projects are documented in [`docs/plan.md`](./docs/plan.md) and [`CLAUDE.md`](./CLAUDE.md).
|
|
123
|
+
|
|
124
|
+
## License
|
|
125
|
+
|
|
126
|
+
MIT © Evan Tahler
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "membot",
|
|
3
|
-
"version": "0.1.
|
|
3
|
+
"version": "0.1.2",
|
|
4
4
|
"description": "Versioned context store with hybrid search for AI agents. Stdio + HTTP MCP server and CLI.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"exports": {
|
|
@@ -16,6 +16,8 @@
|
|
|
16
16
|
"src",
|
|
17
17
|
"patches",
|
|
18
18
|
"scripts",
|
|
19
|
+
".claude",
|
|
20
|
+
".cursor",
|
|
19
21
|
"README.md",
|
|
20
22
|
"LICENSE"
|
|
21
23
|
],
|
|
@@ -24,7 +26,7 @@
|
|
|
24
26
|
"test": "bun test",
|
|
25
27
|
"lint": "biome ci . && tsc --noEmit",
|
|
26
28
|
"format": "biome check --write .",
|
|
27
|
-
"prebuild": "bash scripts/apply-
|
|
29
|
+
"prebuild": "bash scripts/apply-patches.sh",
|
|
28
30
|
"build": "bun build --compile --minify --sourcemap --external '@duckdb/*' ./src/cli.ts --outfile dist/membot"
|
|
29
31
|
},
|
|
30
32
|
"keywords": [
|
|
@@ -39,7 +41,7 @@
|
|
|
39
41
|
"bun"
|
|
40
42
|
],
|
|
41
43
|
"license": "MIT",
|
|
42
|
-
"author": "Evan Tahler <evan@
|
|
44
|
+
"author": "Evan Tahler <evan@evantahler.com>",
|
|
43
45
|
"repository": {
|
|
44
46
|
"type": "git",
|
|
45
47
|
"url": "https://github.com/evantahler/membot.git"
|
|
@@ -0,0 +1,44 @@
|
|
|
1
|
+
diff --git a/src/search/onnx-wasm-paths.ts b/src/search/onnx-wasm-paths.ts
|
|
2
|
+
--- a/src/search/onnx-wasm-paths.ts
|
|
3
|
+
+++ b/src/search/onnx-wasm-paths.ts
|
|
4
|
+
@@ -1,31 +1,9 @@
|
|
5
|
+
-// Embed the onnxruntime-web WASM runtime files into the compiled binary
|
|
6
|
+
-// (`bun build --compile`) so they survive in a single-binary distribution
|
|
7
|
+
-// where the user has no node_modules.
|
|
8
|
+
-//
|
|
9
|
+
-// This file is loaded **dynamically** by semantic.ts. The relative paths
|
|
10
|
+
-// only resolve in the local repo / compiled binary; for npm/bun-installed
|
|
11
|
+
-// mcpx the parent directory layout is different (deps are hoisted), the
|
|
12
|
+
-// dynamic import throws, and we fall back to letting transformers.js
|
|
13
|
+
-// load WASM via its default mechanism — which works fine because in
|
|
14
|
+
-// that environment node_modules exists and onnxruntime-web is reachable
|
|
15
|
+
-// through normal module resolution.
|
|
16
|
+
-
|
|
17
|
+
-// The relative `../../node_modules/...` paths only resolve from the local repo
|
|
18
|
+
-// layout (and inside `bun build --compile`). When this file is shipped via npm,
|
|
19
|
+
-// deps are hoisted, so consumer `tsc` runs hit TS2307. The `ts-ignore` directive
|
|
20
|
+
-// below silences that for consumers; we avoid the stricter `expect-error` form
|
|
21
|
+
-// because in the local repo the path resolves fine and there would be no error
|
|
22
|
+
-// to expect. At runtime the dynamic import in semantic.ts is wrapped in
|
|
23
|
+
-// try/catch and falls back to transformers.js's default WASM loader (issue #85).
|
|
24
|
+
-// biome-ignore lint/suspicious/noTsIgnore: must stay as ts-ignore per comment above
|
|
25
|
+
-// @ts-ignore - dynamic-only import
|
|
26
|
+
-import wasmMjsPath from "../../node_modules/onnxruntime-web/dist/ort-wasm-simd-threaded.asyncify.mjs" with {
|
|
27
|
+
- type: "file",
|
|
28
|
+
-};
|
|
29
|
+
-// biome-ignore lint/suspicious/noTsIgnore: must stay as ts-ignore per comment above
|
|
30
|
+
-// @ts-ignore - dynamic-only import
|
|
31
|
+
-import wasmBinPath from "../../node_modules/onnxruntime-web/dist/ort-wasm-simd-threaded.asyncify.wasm" with {
|
|
32
|
+
- type: "file",
|
|
33
|
+
-};
|
|
34
|
+
-
|
|
35
|
+
-export { wasmBinPath, wasmMjsPath };
|
|
36
|
+
+// PATCHED (membot): upstream mcpx ships static `with { type: "file" }` imports
|
|
37
|
+
+// of onnxruntime-web WASM assets via `../../node_modules/...`, which only
|
|
38
|
+
+// resolves when mcpx is built standalone. When consumed as an npm dep those
|
|
39
|
+
+// paths are unreachable and `bun build --compile` fails at build time. membot
|
|
40
|
+
+// never invokes mcpx's semantic search (only `mcpx.exec()` for URL fetching),
|
|
41
|
+
+// so we stub the exports — semantic.ts wraps the dynamic import in try/catch
|
|
42
|
+
+// and falls back to transformers.js's default WASM loader.
|
|
43
|
+
+export const wasmMjsPath = "";
|
|
44
|
+
+export const wasmBinPath = "";
|
|
@@ -0,0 +1,49 @@
|
|
|
1
|
+
#!/usr/bin/env bash
|
|
2
|
+
set -euo pipefail
|
|
3
|
+
|
|
4
|
+
# Apply node_modules patches imperatively. We don't use package.json's
|
|
5
|
+
# `patchedDependencies` field because that field, when present in a published
|
|
6
|
+
# package, breaks `bun install` from a tarball.
|
|
7
|
+
#
|
|
8
|
+
# Each patch is gated by a marker file inside its target so reruns are no-ops.
|
|
9
|
+
|
|
10
|
+
apply_patch() {
|
|
11
|
+
local patch="$1" target="$2" marker_name="$3"
|
|
12
|
+
local marker="$target/$marker_name"
|
|
13
|
+
|
|
14
|
+
if [ ! -d "$target" ]; then
|
|
15
|
+
echo "error: $target not found — run \`bun install\` first" >&2
|
|
16
|
+
exit 1
|
|
17
|
+
fi
|
|
18
|
+
if [ ! -f "$patch" ]; then
|
|
19
|
+
echo "error: $patch not found" >&2
|
|
20
|
+
exit 1
|
|
21
|
+
fi
|
|
22
|
+
if [ -f "$marker" ]; then
|
|
23
|
+
echo "patch $patch already applied — skipping"
|
|
24
|
+
return 0
|
|
25
|
+
fi
|
|
26
|
+
|
|
27
|
+
echo "Applying $patch to $target..."
|
|
28
|
+
git apply --directory="$target" "$patch"
|
|
29
|
+
touch "$marker"
|
|
30
|
+
}
|
|
31
|
+
|
|
32
|
+
# @huggingface/transformers — replace static `import 'onnxruntime-node'` with a
|
|
33
|
+
# stub so `bun build --compile` produces a binary using the WASM backend
|
|
34
|
+
# (onnxruntime-web) instead of onnxruntime-node, whose native bindings can't be
|
|
35
|
+
# bundled into a single-binary distribution.
|
|
36
|
+
apply_patch \
|
|
37
|
+
"patches/@huggingface%2Ftransformers@4.2.0.patch" \
|
|
38
|
+
"node_modules/@huggingface/transformers" \
|
|
39
|
+
".membot-transformers-patch-applied"
|
|
40
|
+
|
|
41
|
+
# @evantahler/mcpx — stub `src/search/onnx-wasm-paths.ts` whose static
|
|
42
|
+
# `with { type: "file" }` imports use a relative path that only resolves in
|
|
43
|
+
# mcpx's own repo layout. When mcpx is consumed as an npm dep those paths are
|
|
44
|
+
# unreachable and `bun build --compile` fails at build time. membot never
|
|
45
|
+
# invokes mcpx's semantic search, so the stubbed exports are safe.
|
|
46
|
+
apply_patch \
|
|
47
|
+
"patches/@evantahler%2Fmcpx@0.21.4.patch" \
|
|
48
|
+
"node_modules/@evantahler/mcpx" \
|
|
49
|
+
".membot-mcpx-patch-applied"
|
package/src/cli.ts
CHANGED
|
@@ -7,6 +7,7 @@ import { registerCheckUpdateCommand } from "./commands/check-update.ts";
|
|
|
7
7
|
import { registerMcpxCommand } from "./commands/mcpx.ts";
|
|
8
8
|
import { registerReindexCommand } from "./commands/reindex.ts";
|
|
9
9
|
import { registerServeCommand } from "./commands/serve.ts";
|
|
10
|
+
import { registerSkillCommand } from "./commands/skill.ts";
|
|
10
11
|
import { registerUpgradeCommand } from "./commands/upgrade.ts";
|
|
11
12
|
import type { BuildContextOptions } from "./context.ts";
|
|
12
13
|
import { mountAsCommanderCommand } from "./mount/commander.ts";
|
|
@@ -57,6 +58,7 @@ for (const op of OPERATIONS) {
|
|
|
57
58
|
registerServeCommand(program);
|
|
58
59
|
registerReindexCommand(program);
|
|
59
60
|
registerMcpxCommand(program);
|
|
61
|
+
registerSkillCommand(program);
|
|
60
62
|
registerCheckUpdateCommand(program);
|
|
61
63
|
registerUpgradeCommand(program);
|
|
62
64
|
|
|
@@ -0,0 +1,131 @@
|
|
|
1
|
+
import { existsSync, mkdirSync, writeFileSync } from "node:fs";
|
|
2
|
+
import { homedir } from "node:os";
|
|
3
|
+
import { join, resolve } from "node:path";
|
|
4
|
+
import type { Command } from "commander";
|
|
5
|
+
import claudeSkill from "../../.claude/skills/membot.md" with { type: "text" };
|
|
6
|
+
import cursorRule from "../../.cursor/rules/membot.mdc" with { type: "text" };
|
|
7
|
+
import { HelpfulError, isHelpfulError, mapKindToExit } from "../errors.ts";
|
|
8
|
+
import { renderCliError } from "../mount/commander.ts";
|
|
9
|
+
import { logger } from "../output/logger.ts";
|
|
10
|
+
import { detectMode, setMode } from "../output/tty.ts";
|
|
11
|
+
|
|
12
|
+
interface SkillTarget {
|
|
13
|
+
agentLabel: string;
|
|
14
|
+
scopeLabel: string;
|
|
15
|
+
dir: string;
|
|
16
|
+
filename: string;
|
|
17
|
+
content: string;
|
|
18
|
+
}
|
|
19
|
+
|
|
20
|
+
interface SkillInstallOptions {
|
|
21
|
+
claude?: boolean;
|
|
22
|
+
cursor?: boolean;
|
|
23
|
+
global?: boolean;
|
|
24
|
+
project?: boolean;
|
|
25
|
+
force?: boolean;
|
|
26
|
+
}
|
|
27
|
+
|
|
28
|
+
/**
|
|
29
|
+
* `membot skill install [--claude] [--cursor] [--global|--project] [-f]`
|
|
30
|
+
*
|
|
31
|
+
* Drop the membot agent skill into the right location for Claude Code
|
|
32
|
+
* (`.claude/skills/membot.md`) or Cursor (`.cursor/rules/membot.mdc`),
|
|
33
|
+
* either in the current project (default) or in the user's home directory
|
|
34
|
+
* (`--global`). Both flags can be combined to install for both targets at
|
|
35
|
+
* once. The skill files are bundled into the binary via Bun text imports
|
|
36
|
+
* so this works in the compiled distribution as well as in `bun run`.
|
|
37
|
+
*/
|
|
38
|
+
export function registerSkillCommand(program: Command): void {
|
|
39
|
+
const skill = program.command("skill").description("Install agent skills (Claude Code, Cursor)");
|
|
40
|
+
|
|
41
|
+
skill
|
|
42
|
+
.command("install")
|
|
43
|
+
.description(
|
|
44
|
+
"Install the membot skill into Claude Code (.claude/skills/membot.md) and/or Cursor (.cursor/rules/membot.mdc)",
|
|
45
|
+
)
|
|
46
|
+
.option("--claude", "install for Claude Code")
|
|
47
|
+
.option("--cursor", "install for Cursor")
|
|
48
|
+
.option("--global", "install to the user's home directory (default: project)")
|
|
49
|
+
.option("--project", "install to the current working directory (default)")
|
|
50
|
+
.option("-f, --force", "overwrite if the skill file already exists")
|
|
51
|
+
.action((opts: SkillInstallOptions) => {
|
|
52
|
+
const globalOpts = program.optsWithGlobals<{ json?: boolean; verbose?: boolean; color?: boolean }>();
|
|
53
|
+
setMode(
|
|
54
|
+
detectMode({
|
|
55
|
+
json: globalOpts.json,
|
|
56
|
+
verbose: globalOpts.verbose,
|
|
57
|
+
noColor: globalOpts.color === false,
|
|
58
|
+
}),
|
|
59
|
+
);
|
|
60
|
+
try {
|
|
61
|
+
install(opts);
|
|
62
|
+
} catch (err) {
|
|
63
|
+
renderCliError(err);
|
|
64
|
+
process.exit(isHelpfulError(err) ? mapKindToExit(err.kind) : 1);
|
|
65
|
+
}
|
|
66
|
+
});
|
|
67
|
+
}
|
|
68
|
+
|
|
69
|
+
/**
|
|
70
|
+
* Resolve and write every requested skill file. Throws `HelpfulError` on
|
|
71
|
+
* any input or conflict failure so the mount-style error renderer can
|
|
72
|
+
* surface a uniform JSON / colorized message.
|
|
73
|
+
*/
|
|
74
|
+
function install(opts: SkillInstallOptions): void {
|
|
75
|
+
if (!opts.claude && !opts.cursor) {
|
|
76
|
+
throw new HelpfulError({
|
|
77
|
+
kind: "input_error",
|
|
78
|
+
message: "no agent target specified",
|
|
79
|
+
hint: "Pass --claude, --cursor, or both — e.g. `membot skill install --claude`",
|
|
80
|
+
});
|
|
81
|
+
}
|
|
82
|
+
|
|
83
|
+
const targets = computeTargets(opts);
|
|
84
|
+
for (const target of targets) {
|
|
85
|
+
const dest = join(target.dir, target.filename);
|
|
86
|
+
if (existsSync(dest) && !opts.force) {
|
|
87
|
+
throw new HelpfulError({
|
|
88
|
+
kind: "conflict",
|
|
89
|
+
message: `${dest} already exists`,
|
|
90
|
+
hint: "Re-run with --force to overwrite",
|
|
91
|
+
});
|
|
92
|
+
}
|
|
93
|
+
mkdirSync(target.dir, { recursive: true });
|
|
94
|
+
writeFileSync(dest, target.content, "utf-8");
|
|
95
|
+
logger.info(`installed ${target.agentLabel} skill (${target.scopeLabel}): ${dest}`);
|
|
96
|
+
}
|
|
97
|
+
}
|
|
98
|
+
|
|
99
|
+
/**
|
|
100
|
+
* Materialise the (agent × scope) cartesian product of install targets the
|
|
101
|
+
* user asked for. Default scope is project when neither --global nor
|
|
102
|
+
* --project is passed; passing both installs to both locations.
|
|
103
|
+
*/
|
|
104
|
+
function computeTargets(opts: SkillInstallOptions): SkillTarget[] {
|
|
105
|
+
const scopes: { label: string; resolveDir: (rel: string) => string }[] = [];
|
|
106
|
+
if (opts.global) scopes.push({ label: "global", resolveDir: (rel) => join(homedir(), rel) });
|
|
107
|
+
if (opts.project || !opts.global) scopes.push({ label: "project", resolveDir: (rel) => resolve(rel) });
|
|
108
|
+
|
|
109
|
+
const targets: SkillTarget[] = [];
|
|
110
|
+
for (const scope of scopes) {
|
|
111
|
+
if (opts.claude) {
|
|
112
|
+
targets.push({
|
|
113
|
+
agentLabel: "Claude Code",
|
|
114
|
+
scopeLabel: scope.label,
|
|
115
|
+
dir: scope.resolveDir(".claude/skills"),
|
|
116
|
+
filename: "membot.md",
|
|
117
|
+
content: claudeSkill,
|
|
118
|
+
});
|
|
119
|
+
}
|
|
120
|
+
if (opts.cursor) {
|
|
121
|
+
targets.push({
|
|
122
|
+
agentLabel: "Cursor",
|
|
123
|
+
scopeLabel: scope.label,
|
|
124
|
+
dir: scope.resolveDir(".cursor/rules"),
|
|
125
|
+
filename: "membot.mdc",
|
|
126
|
+
content: cursorRule,
|
|
127
|
+
});
|
|
128
|
+
}
|
|
129
|
+
}
|
|
130
|
+
return targets;
|
|
131
|
+
}
|
package/src/ingest/embedder.ts
CHANGED
|
@@ -31,6 +31,15 @@ function isModelCached(model: string): boolean {
|
|
|
31
31
|
* Lazily load (and cache) the feature-extraction pipeline for a model. Loading
|
|
32
32
|
* is expensive (downloads weights on first run, ~100s of ms to instantiate
|
|
33
33
|
* ONNX), so we hold one promise per model name for the life of the process.
|
|
34
|
+
*
|
|
35
|
+
* Try `wasm` first, fall back to `cpu` on "Unsupported device". The transformers
|
|
36
|
+
* patch (applied for `bun build --compile` and via `bun run prebuild` for local
|
|
37
|
+
* dev) registers `wasm` as a supported device backed by onnxruntime-web — that's
|
|
38
|
+
* mandatory for the single-binary build because native bindings can't be
|
|
39
|
+
* bundled. When the package is unpatched (npm-installed membot, or `bun dev`
|
|
40
|
+
* before `prebuild`), `wasm` is rejected and we fall back to the default `cpu`
|
|
41
|
+
* device, which uses the onnxruntime-node native bindings that ship with the
|
|
42
|
+
* unpatched package.
|
|
34
43
|
*/
|
|
35
44
|
async function getPipeline(model: string): Promise<FeatureExtractionPipeline> {
|
|
36
45
|
let p = pipelinePromises.get(model);
|
|
@@ -40,9 +49,15 @@ async function getPipeline(model: string): Promise<FeatureExtractionPipeline> {
|
|
|
40
49
|
} else {
|
|
41
50
|
logger.info(`embedder: loading model ${model} (first run, downloading weights)`);
|
|
42
51
|
}
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
52
|
+
p = (async () => {
|
|
53
|
+
try {
|
|
54
|
+
return (await pipeline("feature-extraction", model, { device: "wasm" })) as FeatureExtractionPipeline;
|
|
55
|
+
} catch (err) {
|
|
56
|
+
if (!String((err as Error)?.message ?? "").includes("Unsupported device")) throw err;
|
|
57
|
+
logger.debug("embedder: wasm backend unavailable, falling back to cpu (onnxruntime-node)");
|
|
58
|
+
return (await pipeline("feature-extraction", model, { device: "cpu" })) as FeatureExtractionPipeline;
|
|
59
|
+
}
|
|
60
|
+
})();
|
|
46
61
|
pipelinePromises.set(model, p);
|
|
47
62
|
}
|
|
48
63
|
return p;
|
package/src/ingest/ingest.ts
CHANGED
|
@@ -21,13 +21,14 @@ export interface IngestInput {
|
|
|
21
21
|
refresh_frequency?: string;
|
|
22
22
|
fetcher_hint?: string;
|
|
23
23
|
change_note?: string;
|
|
24
|
+
force?: boolean;
|
|
24
25
|
}
|
|
25
26
|
|
|
26
27
|
export interface IngestEntryResult {
|
|
27
28
|
source_path: string;
|
|
28
29
|
logical_path: string;
|
|
29
30
|
version_id: string | null;
|
|
30
|
-
status: "ok" | "failed";
|
|
31
|
+
status: "ok" | "unchanged" | "failed";
|
|
31
32
|
error?: string;
|
|
32
33
|
mime_type: string | null;
|
|
33
34
|
size_bytes: number;
|
|
@@ -39,6 +40,7 @@ export interface IngestResult {
|
|
|
39
40
|
ingested: IngestEntryResult[];
|
|
40
41
|
total: number;
|
|
41
42
|
ok: number;
|
|
43
|
+
unchanged: number;
|
|
42
44
|
failed: number;
|
|
43
45
|
}
|
|
44
46
|
|
|
@@ -57,14 +59,15 @@ export async function ingest(input: IngestInput, ctx: AppContext): Promise<Inges
|
|
|
57
59
|
});
|
|
58
60
|
|
|
59
61
|
const refreshSec = parseDuration(input.refresh_frequency);
|
|
62
|
+
const force = input.force === true;
|
|
60
63
|
|
|
61
64
|
if (resolved.kind === "inline") {
|
|
62
65
|
return ingestInline(resolved.text, input, ctx, refreshSec);
|
|
63
66
|
}
|
|
64
67
|
if (resolved.kind === "url") {
|
|
65
|
-
return ingestUrl(resolved.url, input, ctx, refreshSec);
|
|
68
|
+
return ingestUrl(resolved.url, input, ctx, refreshSec, force);
|
|
66
69
|
}
|
|
67
|
-
return ingestLocalFiles(resolved, input, ctx, refreshSec);
|
|
70
|
+
return ingestLocalFiles(resolved, input, ctx, refreshSec, force);
|
|
68
71
|
}
|
|
69
72
|
|
|
70
73
|
/** Ingest a single inline blob (source_type='inline'). */
|
|
@@ -119,6 +122,7 @@ async function ingestUrl(
|
|
|
119
122
|
input: IngestInput,
|
|
120
123
|
ctx: AppContext,
|
|
121
124
|
refreshSec: number | null,
|
|
125
|
+
force: boolean,
|
|
122
126
|
): Promise<IngestResult> {
|
|
123
127
|
const mcpxAdapter = ctx.mcpx
|
|
124
128
|
? {
|
|
@@ -151,6 +155,15 @@ async function ingestUrl(
|
|
|
151
155
|
result.fetcher = fetched.fetcher;
|
|
152
156
|
result.source_sha256 = fetched.sha256;
|
|
153
157
|
|
|
158
|
+
if (!force) {
|
|
159
|
+
const cur = await getCurrent(ctx.db, logicalPath);
|
|
160
|
+
if (cur && cur.source_sha256 === fetched.sha256) {
|
|
161
|
+
result.status = "unchanged";
|
|
162
|
+
result.version_id = cur.version_id;
|
|
163
|
+
return summarize([result]);
|
|
164
|
+
}
|
|
165
|
+
}
|
|
166
|
+
|
|
154
167
|
const versionId = await pipelineForBytes(ctx, {
|
|
155
168
|
logicalPath,
|
|
156
169
|
bytes: fetched.bytes,
|
|
@@ -181,6 +194,7 @@ async function ingestLocalFiles(
|
|
|
181
194
|
input: IngestInput,
|
|
182
195
|
ctx: AppContext,
|
|
183
196
|
refreshSec: number | null,
|
|
197
|
+
force: boolean,
|
|
184
198
|
): Promise<IngestResult> {
|
|
185
199
|
if (resolved.entries.length === 0) {
|
|
186
200
|
throw new HelpfulError({
|
|
@@ -213,6 +227,16 @@ async function ingestLocalFiles(
|
|
|
213
227
|
result.size_bytes = local.sizeBytes;
|
|
214
228
|
result.source_sha256 = local.sha256;
|
|
215
229
|
|
|
230
|
+
if (!force) {
|
|
231
|
+
const cur = await getCurrent(ctx.db, logicalPath);
|
|
232
|
+
if (cur && cur.source_sha256 === local.sha256) {
|
|
233
|
+
result.status = "unchanged";
|
|
234
|
+
result.version_id = cur.version_id;
|
|
235
|
+
results.push(result);
|
|
236
|
+
continue;
|
|
237
|
+
}
|
|
238
|
+
}
|
|
239
|
+
|
|
216
240
|
const versionId = await pipelineForBytes(ctx, {
|
|
217
241
|
logicalPath,
|
|
218
242
|
bytes: local.bytes,
|
|
@@ -236,7 +260,10 @@ async function ingestLocalFiles(
|
|
|
236
260
|
}
|
|
237
261
|
results.push(result);
|
|
238
262
|
}
|
|
239
|
-
|
|
263
|
+
const okCount = results.filter((r) => r.status === "ok").length;
|
|
264
|
+
const unchangedCount = results.filter((r) => r.status === "unchanged").length;
|
|
265
|
+
const suffix = unchangedCount > 0 ? ` (${unchangedCount} unchanged)` : "";
|
|
266
|
+
ctx.progress.done(`ingested ${okCount}/${results.length}${suffix}`);
|
|
240
267
|
|
|
241
268
|
return summarize(results);
|
|
242
269
|
}
|
|
@@ -428,12 +455,14 @@ export function parseDuration(input: string | null | undefined): number | null {
|
|
|
428
455
|
/** Roll a list of per-entry results into the top-level summary shape. */
|
|
429
456
|
function summarize(entries: IngestEntryResult[]): IngestResult {
|
|
430
457
|
let ok = 0;
|
|
458
|
+
let unchanged = 0;
|
|
431
459
|
let failed = 0;
|
|
432
460
|
for (const e of entries) {
|
|
433
461
|
if (e.status === "ok") ok += 1;
|
|
462
|
+
else if (e.status === "unchanged") unchanged += 1;
|
|
434
463
|
else failed += 1;
|
|
435
464
|
}
|
|
436
|
-
return { ingested: entries, total: entries.length, ok, failed };
|
|
465
|
+
return { ingested: entries, total: entries.length, ok, unchanged, failed };
|
|
437
466
|
}
|
|
438
467
|
|
|
439
468
|
function errorMessage(err: unknown): string {
|
|
@@ -43,10 +43,12 @@ export async function resolveSource(source: string, options: ResolveOptions = {}
|
|
|
43
43
|
}
|
|
44
44
|
|
|
45
45
|
const followSymlinks = options.followSymlinks !== false;
|
|
46
|
-
const
|
|
47
|
-
.
|
|
48
|
-
|
|
49
|
-
|
|
46
|
+
const userIncludes = options.include
|
|
47
|
+
? options.include
|
|
48
|
+
.split(",")
|
|
49
|
+
.map((g) => g.trim())
|
|
50
|
+
.filter(Boolean)
|
|
51
|
+
: [];
|
|
50
52
|
const excludeMatchers = [
|
|
51
53
|
...DEFAULT_EXCLUDES,
|
|
52
54
|
...(options.exclude ?? "")
|
|
@@ -57,9 +59,14 @@ export async function resolveSource(source: string, options: ResolveOptions = {}
|
|
|
57
59
|
|
|
58
60
|
if (isGlob(source)) {
|
|
59
61
|
const base = globBase(source);
|
|
62
|
+
const remainder = globRemainder(source);
|
|
60
63
|
try {
|
|
61
64
|
const realBase = await realpath(base);
|
|
62
|
-
|
|
65
|
+
// Source glob acts as a hard filter; user includes (if any) further
|
|
66
|
+
// narrow the result via AND. Pass them as a separate matcher so the
|
|
67
|
+
// two sets aren't picomatch-OR'd together.
|
|
68
|
+
const extraIncludes = userIncludes.length > 0 ? [userIncludes] : [];
|
|
69
|
+
return walk(realBase, [remainder], excludeMatchers, followSymlinks, extraIncludes);
|
|
63
70
|
} catch (err) {
|
|
64
71
|
throw asHelpful(
|
|
65
72
|
err,
|
|
@@ -93,7 +100,8 @@ export async function resolveSource(source: string, options: ResolveOptions = {}
|
|
|
93
100
|
|
|
94
101
|
if (st.isDirectory()) {
|
|
95
102
|
const realBase = await realpath(abs);
|
|
96
|
-
|
|
103
|
+
const dirIncludes = userIncludes.length > 0 ? userIncludes : ["**/*"];
|
|
104
|
+
return walk(realBase, dirIncludes, excludeMatchers, followSymlinks);
|
|
97
105
|
}
|
|
98
106
|
|
|
99
107
|
throw new HelpfulError({
|
|
@@ -120,22 +128,40 @@ export function globBase(glob: string): string {
|
|
|
120
128
|
return base.length === 0 || !isAbsolute(base) ? resolve(base || ".") : base;
|
|
121
129
|
}
|
|
122
130
|
|
|
131
|
+
/**
|
|
132
|
+
* Take the wildcard portion of a glob — everything from the first segment
|
|
133
|
+
* containing a wildcard onward. We strip the static prefix so the matcher
|
|
134
|
+
* runs against entry paths relative to `globBase`. Without this, a glob like
|
|
135
|
+
* `docs/star-star/star.md` never matches anything under base=`docs/`, since
|
|
136
|
+
* walk() exposes `sub/file.md` to picomatch, not `docs/sub/file.md`.
|
|
137
|
+
*/
|
|
138
|
+
export function globRemainder(glob: string): string {
|
|
139
|
+
const parts = glob.split(sep);
|
|
140
|
+
const wildcardIdx = parts.findIndex((p) => /[*?[\]{}!]/.test(p));
|
|
141
|
+
if (wildcardIdx === -1) return glob;
|
|
142
|
+
return parts.slice(wildcardIdx).join(sep);
|
|
143
|
+
}
|
|
144
|
+
|
|
123
145
|
/**
|
|
124
146
|
* Recursively walk `base`, returning files matched by `includes` and not
|
|
125
147
|
* matched by `excludes`. Both globsets match against the entry's path
|
|
126
148
|
* relative to `base`. Symlinks are followed when `followSymlinks` is true,
|
|
127
|
-
* with cycles detected via a realpath cache.
|
|
149
|
+
* with cycles detected via a realpath cache. `extraIncludeSets` is a list
|
|
150
|
+
* of additional include groups, each ANDed onto the primary `includes` —
|
|
151
|
+
* use it when two filters must both match (e.g. source glob + --include).
|
|
128
152
|
*/
|
|
129
153
|
async function walk(
|
|
130
154
|
base: string,
|
|
131
155
|
includes: string[],
|
|
132
156
|
excludes: string[],
|
|
133
157
|
followSymlinks: boolean,
|
|
158
|
+
extraIncludeSets: string[][] = [],
|
|
134
159
|
): Promise<ResolvedSource> {
|
|
135
160
|
const seen = new Set<string>();
|
|
136
161
|
const entries: ResolvedLocalEntry[] = [];
|
|
137
162
|
|
|
138
163
|
const isInclude = picomatch(includes, { dot: false, nocase: false });
|
|
164
|
+
const extraMatchers = extraIncludeSets.map((set) => picomatch(set, { dot: false, nocase: false }));
|
|
139
165
|
const isExclude = excludes.length ? picomatch(excludes, { dot: false }) : null;
|
|
140
166
|
|
|
141
167
|
const queue: string[] = [base];
|
|
@@ -174,6 +200,7 @@ async function walk(
|
|
|
174
200
|
const relForMatch = rel.length === 0 ? (cur.split(sep).pop() ?? cur) : rel;
|
|
175
201
|
if (isExclude?.(relForMatch)) continue;
|
|
176
202
|
if (!isInclude(relForMatch)) continue;
|
|
203
|
+
if (extraMatchers.some((m) => !m(relForMatch))) continue;
|
|
177
204
|
entries.push({ absPath: real, relPath: relForMatch });
|
|
178
205
|
}
|
|
179
206
|
|
package/src/operations/add.ts
CHANGED
|
@@ -14,11 +14,16 @@ export const addOperation = defineOperation({
|
|
|
14
14
|
- a glob pattern (e.g. "docs/**/*.md")
|
|
15
15
|
- a URL (fetched via mcpx if configured, otherwise plain HTTP)
|
|
16
16
|
- "inline:<text>" literal
|
|
17
|
-
PDF, DOCX, HTML, images, and other binaries are converted to markdown — native libraries first, vision/OCR for images, LLM fallback for messy or scanned input. Original bytes are kept in the blobs table; \`membot_read bytes=true\` returns them. Setting \`refresh_frequency\` enables automatic refresh from the daemon. Each ingested file becomes a
|
|
17
|
+
PDF, DOCX, HTML, images, and other binaries are converted to markdown — native libraries first, vision/OCR for images, LLM fallback for messy or scanned input. Original bytes are kept in the blobs table; \`membot_read bytes=true\` returns them. Setting \`refresh_frequency\` enables automatic refresh from the daemon. By default, re-ingesting an unchanged source (same source_sha256 as the current version) is a no-op and reports \`status: "unchanged"\`; pass \`force=true\` to always create a new version. Each newly-ingested file becomes a new version under its own logical_path; existing versions stay queryable via membot_versions. Directory/glob ingests stream one file at a time — partial failures do not abort the rest; the response lists per-entry status.`,
|
|
18
18
|
inputSchema: z.object({
|
|
19
19
|
source: z.string().describe("Local path, directory, glob, URL, or `inline:<text>` literal"),
|
|
20
20
|
logical_path: z.string().optional().describe("Destination logical_path (single source) or prefix (directory/glob)"),
|
|
21
|
-
include: z
|
|
21
|
+
include: z
|
|
22
|
+
.string()
|
|
23
|
+
.optional()
|
|
24
|
+
.describe(
|
|
25
|
+
"Glob include filter (comma-separated for multiple). Defaults to `**/*` for directory sources, or the source pattern itself when source is a glob.",
|
|
26
|
+
),
|
|
22
27
|
exclude: z.string().optional().describe("Glob exclude filter (comma-separated for multiple)"),
|
|
23
28
|
follow_symlinks: z
|
|
24
29
|
.boolean()
|
|
@@ -30,6 +35,10 @@ PDF, DOCX, HTML, images, and other binaries are converted to markdown — native
|
|
|
30
35
|
.optional()
|
|
31
36
|
.describe("Free-form hint passed to mcpx tool search (e.g. 'firecrawl', 'github', 'google docs', 'http')"),
|
|
32
37
|
change_note: z.string().optional().describe("Free-text note attached to the new version"),
|
|
38
|
+
force: z
|
|
39
|
+
.boolean()
|
|
40
|
+
.optional()
|
|
41
|
+
.describe("Re-ingest even when source bytes are unchanged. Default skips and reports `unchanged`."),
|
|
33
42
|
}),
|
|
34
43
|
outputSchema: z.object({
|
|
35
44
|
ingested: z.array(
|
|
@@ -37,7 +46,7 @@ PDF, DOCX, HTML, images, and other binaries are converted to markdown — native
|
|
|
37
46
|
source_path: z.string(),
|
|
38
47
|
logical_path: z.string(),
|
|
39
48
|
version_id: z.string().nullable(),
|
|
40
|
-
status: z.enum(["ok", "failed"]),
|
|
49
|
+
status: z.enum(["ok", "unchanged", "failed"]),
|
|
41
50
|
error: z.string().optional(),
|
|
42
51
|
mime_type: z.string().nullable(),
|
|
43
52
|
size_bytes: z.number(),
|
|
@@ -47,23 +56,27 @@ PDF, DOCX, HTML, images, and other binaries are converted to markdown — native
|
|
|
47
56
|
),
|
|
48
57
|
total: z.number(),
|
|
49
58
|
ok: z.number(),
|
|
59
|
+
unchanged: z.number(),
|
|
50
60
|
failed: z.number(),
|
|
51
61
|
}),
|
|
52
62
|
cli: {
|
|
53
63
|
positional: ["source"],
|
|
54
|
-
aliases: { logical_path: "-p", refresh_frequency: "-r", change_note: "-m" },
|
|
64
|
+
aliases: { logical_path: "-p", refresh_frequency: "-r", change_note: "-m", force: "-f" },
|
|
55
65
|
},
|
|
56
66
|
console_formatter: (result) => {
|
|
57
67
|
const lines = result.ingested.map((e) => {
|
|
58
68
|
if (e.status === "ok") {
|
|
59
69
|
return `${colors.green("✓")} ${colors.cyan(e.logical_path)} ${colors.dim(`(${e.fetcher}, ${e.size_bytes}B)`)}`;
|
|
60
70
|
}
|
|
71
|
+
if (e.status === "unchanged") {
|
|
72
|
+
return `${colors.dim("≡")} ${colors.cyan(e.logical_path)} ${colors.dim("(unchanged)")}`;
|
|
73
|
+
}
|
|
61
74
|
return `${colors.red("✗")} ${e.source_path} ${colors.dim(e.error ?? "")}`;
|
|
62
75
|
});
|
|
63
|
-
const
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
return `${lines.join("\n")}\n${
|
|
76
|
+
const parts: string[] = [colors.green(`added ${result.ok}`)];
|
|
77
|
+
if (result.unchanged > 0) parts.push(colors.dim(`unchanged ${result.unchanged}`));
|
|
78
|
+
if (result.failed > 0) parts.push(colors.red(`failed ${result.failed}`));
|
|
79
|
+
return `${lines.join("\n")}\n${parts.join(", ")}`;
|
|
67
80
|
},
|
|
68
81
|
handler: async (input, ctx) => ingest(input, ctx),
|
|
69
82
|
});
|
|
@@ -1,35 +0,0 @@
|
|
|
1
|
-
#!/usr/bin/env bash
|
|
2
|
-
set -euo pipefail
|
|
3
|
-
|
|
4
|
-
# Apply the @huggingface/transformers patch to node_modules so that
|
|
5
|
-
# `bun build --compile` produces a binary using the WASM backend
|
|
6
|
-
# (onnxruntime-web) instead of onnxruntime-node, whose native bindings
|
|
7
|
-
# can't be bundled into a single-binary distribution.
|
|
8
|
-
#
|
|
9
|
-
# We apply the patch imperatively (rather than via package.json
|
|
10
|
-
# `patchedDependencies`) because that field, when present in a
|
|
11
|
-
# published package, breaks `bun install` from a tarball.
|
|
12
|
-
|
|
13
|
-
PATCH="patches/@huggingface%2Ftransformers@4.2.0.patch"
|
|
14
|
-
TARGET="node_modules/@huggingface/transformers"
|
|
15
|
-
MARKER="$TARGET/.membot-transformers-patch-applied"
|
|
16
|
-
|
|
17
|
-
if [ ! -d "$TARGET" ]; then
|
|
18
|
-
echo "error: $TARGET not found — run \`bun install\` first" >&2
|
|
19
|
-
exit 1
|
|
20
|
-
fi
|
|
21
|
-
|
|
22
|
-
if [ ! -f "$PATCH" ]; then
|
|
23
|
-
echo "error: $PATCH not found" >&2
|
|
24
|
-
exit 1
|
|
25
|
-
fi
|
|
26
|
-
|
|
27
|
-
if [ -f "$MARKER" ]; then
|
|
28
|
-
echo "transformers patch already applied — skipping"
|
|
29
|
-
exit 0
|
|
30
|
-
fi
|
|
31
|
-
|
|
32
|
-
echo "Applying transformers patch ($PATCH) to $TARGET..."
|
|
33
|
-
git apply --directory="$TARGET" "$PATCH"
|
|
34
|
-
touch "$MARKER"
|
|
35
|
-
echo "Patch applied."
|