llm-wiki-compiler 0.9.0 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,816 +1,297 @@
1
1
  # llmwiki
2
2
 
3
- Compile raw sources into an interlinked markdown wiki.
4
-
5
- Inspired by Karpathy's [LLM Wiki](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) pattern: instead of re-discovering knowledge at query time, compile it once into a persistent, browsable artifact that compounds over time.
3
+ <div align="center">
4
+ <div style="border: 2px solid #4F46E5; border-radius: 18px; padding: 20px 24px; margin: 18px 0; background: #EEF2FF; color: #111827; max-width: 900px;">
5
+ <h2 style="color: #312E81;">Breaking News: llmwiki 0.10.0 supports Open Knowledge Format</h2>
6
+ <p style="color: #1F2937;">
7
+ llmwiki is now an <strong>Open Knowledge Format (OKF)</strong> producer and consumer,
8
+ aligning compiled agent knowledge with Google Cloud's emerging standard for portable knowledge sharing.
9
+ </p>
10
+ <p style="color: #1F2937;">
11
+ Export compiled wikis with <code>llmwiki export --target okf</code>,
12
+ import external bundles with <code>llmwiki import --okf</code>,
13
+ and stage untrusted knowledge through review before it becomes live agent context.
14
+ </p>
15
+ </div>
16
+ </div>
17
+
18
+ Compile raw sources into an interlinked, citation-traceable markdown wiki that agents and humans can browse, query, lint, export, and reuse.
19
+
20
+ llmwiki implements the [LLM Wiki](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) pattern: instead of re-discovering knowledge from raw files at query time, compile it once into durable pages that accumulate structure, provenance, review state, and retrieval metadata over time.
6
21
 
7
22
  ![llmwiki demo](docs/images/demo.gif)
8
23
 
9
- ## What you get
24
+ ## When to use this repo
25
+
26
+ Use llmwiki when you need a persistent knowledge base from raw material:
27
+
28
+ - Compile papers, notes, READMEs, transcripts, PDFs, images, or web pages into typed wiki pages.
29
+ - Give agents a stable, citation-aware context pack instead of a pile of loose files.
30
+ - Keep generated knowledge auditable with source citations, review queues, freshness checks, and quality gates.
31
+ - Browse the result locally, query it from the CLI, expose it over MCP, or embed it through the SDK.
32
+ - Exchange compiled knowledge with other tools using Open Knowledge Format (OKF), JSON, JSON-LD, GraphML, Marp, and `llms.txt`.
10
33
 
11
- - **Compiled wiki, not chunks.** A two-phase LLM pipeline turns raw sources into typed pages (`concept`, `entity`, `comparison`, `overview`) with paragraph- and claim-level citations back to source line ranges.
12
- - **Hybrid retrieval.** Semantic chunk embeddings (incremental, content-hash-aware) narrow hundreds of pages to a small top-K; BM25 reranking and wikilink-graph expansion build the final evidence pack.
13
- - **Local web viewer.** `llmwiki view` opens a read-only browser UI with sidebar navigation, search, a force-directed graph, and provenance/citation chips per page.
14
- - **Eval harness.** `llmwiki eval` measures health score (0–100), citation coverage and precision, optional LLM-as-judge support scoring, regression deltas, and CI-gateable thresholds.
15
- - **Source freshness.** `llmwiki lint` flags pages whose sources changed (`stale`) or were deleted (`orphaned`) since the last compile — surfaced across the viewer, MCP, context packs, the JSON export, and `llmwiki next` — and `llmwiki refresh --stale` repairs them with a targeted recompile.
16
- - **MCP server.** `llmwiki serve` exposes the full pipeline to Claude Desktop, Cursor, Claude Code, and any MCP-compatible agent — including `get_context_pack` for budgeted, citation-aware evidence packs.
17
- - **In-process SDK.** `createWiki({ root })` drives the whole pipeline programmatically — ingest, compile, query, status/freshness, context packs, export, eval — for embedding llmwiki in your own tooling without shelling out.
18
- - **Activity journal.** Every ingest, compile, and query appends a timestamped, machine-parseable entry to `log.md` — a human- and agent-readable audit trail of how the wiki was built, carrying page wikilinks and counts.
19
- - **Bridge to runtime memory.** `llmwiki export --target json --project-id <id>` produces a typed envelope that [`@atomicmemory/llmwiki`](https://github.com/atomicstrata/atomicmemory/tree/main/packages/llmwiki) imports as one verbatim Atomic Memory record per page, preserving all advisory metadata.
20
- - **Provider-portable.** Anthropic, Claude Agent SDK (local Claude Code login, no API key), OpenAI-compatible (incl. local llama.cpp / vLLM), Ollama, GitHub Copilot.
34
+ Do not use llmwiki as a general static-site generator, a heavy ontology database, or a replacement for ad-hoc search over fast-changing raw logs. It is strongest when source knowledge is worth compiling, reviewing, and reusing.
21
35
 
22
- ## Who this is for
36
+ ## What you get
23
37
 
24
- - **AI researchers and engineers** building durable knowledge from papers, docs, and notes
25
- - **Technical writers** compiling scattered sources into a structured, interlinked reference
26
- - **Open-source maintainers** turning READMEs, ADRs, and design docs into a navigable knowledge base
27
- - **Anyone with too many bookmarks** who wants a wiki instead of a graveyard of tabs
38
+ - **Compiled wiki, not chunks.** A two-phase LLM pipeline extracts concepts, then generates typed pages: `concept`, `entity`, `comparison`, and `overview`.
39
+ - **Citation-traceable output.** Paragraphs and claims cite source files and line ranges, and `llmwiki lint` validates the links.
40
+ - **Hybrid retrieval.** Semantic chunk search, BM25 reranking, and wikilink graph expansion build compact evidence packs for queries and agents.
41
+ - **Local viewer.** `llmwiki view` opens a read-only browser UI with search, page metadata, graph exploration, source-freshness badges, and citation chips.
42
+ - **Review policy.** Generated pages can be auto-held for review when confidence, contradiction, schema, or provenance rules trip.
43
+ - **Freshness repair.** `llmwiki lint` and `llmwiki next` surface stale/orphaned pages; `llmwiki refresh --stale` repairs changed knowledge without compiling unrelated new sources.
44
+ - **Eval harness.** `llmwiki eval` reports health score, citation coverage/precision, corpus stats, regression deltas, and optional judge-model citation support.
45
+ - **MCP server.** `llmwiki serve` exposes ingest, compile, query, lint, read, status, eval, and context-pack tools to MCP-compatible agents.
46
+ - **SDK.** `createWiki({ root })` drives ingest, compile, query, context, status, export, and eval from TypeScript without shelling out.
47
+ - **Open Knowledge Format exchange.** Export and import OKF bundles for portable, markdown-native knowledge exchange. External OKF imports are staged through the review queue by default; trusted bundles can be written live explicitly.
48
+ - **Other portable exports.** Export JSON, JSON-LD, GraphML, Marp slides, and `llms.txt` for downstream systems.
49
+ - **Provider portable.** Anthropic, Claude Agent SDK local login, OpenAI-compatible servers, Ollama, GitHub Copilot, and local OpenAI-compatible runtimes.
50
+
51
+ ## Karpathy's LLM Wiki pattern
52
+
53
+ Andrej Karpathy described the LLM Wiki pattern as a way to turn raw material into compiled knowledge that future agents can reuse. llmwiki is a concrete compiler for that pattern.
54
+
55
+ The key shift is moving work from query time to compile time. Traditional RAG repeatedly retrieves raw chunks and asks the model to reconstruct relationships for each question. llmwiki first turns sources into typed, interlinked pages with citations, metadata, and review state. Queries, context packs, exports, and MCP tools then operate over that compiled artifact.
56
+
57
+ That makes llmwiki useful when knowledge should compound: concepts shared across sources become one page, saved answers become future context, stale pages can be detected and repaired, and agents can consume a stable evidence pack instead of re-reading the same raw files from scratch.
58
+
59
+ See [`docs/concepts/karpathy-pattern.mdx`](docs/concepts/karpathy-pattern.mdx) for the deeper explanation.
60
+
61
+ ## Agent decision guide
62
+
63
+ If an agent is scanning this README, these are the high-signal entry points:
64
+
65
+ | Goal | Use |
66
+ |---|---|
67
+ | Create a wiki from one source and inspect it | `llmwiki quickstart <source>` |
68
+ | Add more files or URLs | `llmwiki ingest <url-or-file>` |
69
+ | Compile or recompile changed sources | `llmwiki compile` |
70
+ | Hold generated pages for human approval | `llmwiki compile --review` or review policy config |
71
+ | Ask grounded questions | `llmwiki query "question"` |
72
+ | Save an answer back into the wiki | `llmwiki query "question" --save` |
73
+ | Build an evidence pack for another agent | `llmwiki context "<task>" --json` or MCP `get_context_pack` |
74
+ | Inspect the compiled knowledge base | `llmwiki view --open` |
75
+ | Check broken links, citations, confidence, freshness, and quality | `llmwiki lint` and `llmwiki eval` |
76
+ | Repair stale compiled pages | `llmwiki refresh --stale --dry-run`, then `llmwiki refresh --stale` |
77
+ | Drive llmwiki from an agent | `llmwiki serve --root <project>` |
78
+ | Drive llmwiki from TypeScript | `createWiki({ root })` |
79
+ | Export for another system | `llmwiki export --target <format>` |
80
+ | Export an Open Knowledge Format bundle | `llmwiki export --target okf --out <dir>` |
81
+ | Import an Open Knowledge Format bundle | `llmwiki import --okf <dir> --dry-run`, then review/approve |
28
82
 
29
83
  ## Quick start
30
84
 
31
85
  ```bash
32
86
  npm install -g llm-wiki-compiler
87
+
33
88
  export ANTHROPIC_API_KEY=sk-...
34
- # Or use ANTHROPIC_AUTH_TOKEN if your Anthropic-compatible gateway expects it.
35
- # Or use a different provider:
89
+ # or choose another provider:
36
90
  # export LLMWIKI_PROVIDER=openai
37
91
  # export OPENAI_API_KEY=sk-...
38
92
 
39
93
  llmwiki quickstart ./notes.md
40
- llmwiki query "what is X?"
94
+ llmwiki query "what are the key ideas?"
41
95
  llmwiki view --open
42
96
  ```
43
97
 
44
- `llmwiki quickstart ./notes.md` ingests one supported source, compiles the wiki, and opens the local viewer when pages are ready. Use `--no-open` to stop after compile, `--review` to queue candidates instead of writing pages, or `--json` for an agent-friendly envelope. If you're inside an existing project and unsure what to do next, run `llmwiki next`.
45
-
46
-
47
- <br>
48
-
49
- ---
50
-
51
- <br>
52
-
53
-
54
- <details>
55
- <summary><span style="font-size: 1.4em;"><strong>Configuration — click to expand</strong></span></summary>
56
-
57
-
58
- llmwiki configures providers via environment variables. The default provider is Anthropic.
59
-
60
- Configuration precedence for Anthropic values:
98
+ `quickstart` ingests one source, compiles pages, and opens the viewer. Inside an existing project, run `llmwiki next` when you want the safest next action.
61
99
 
62
- 1. Shell env / local `.env`
63
- 2. Claude Code settings fallback (`~/.claude/settings.json` → `env` block)
64
- 3. Built-in provider defaults (where applicable)
65
-
66
- - `LLMWIKI_PROVIDER`: The provider to use (e.g., anthropic, openai).
67
- - `LLMWIKI_MODEL`: The model name to override the provider default.
68
-
69
- ### Anthropic (Default)
70
-
71
- - `ANTHROPIC_API_KEY` or `ANTHROPIC_AUTH_TOKEN`: Required. Either one can satisfy Anthropic authentication.
72
- - `ANTHROPIC_BASE_URL`: Optional. Custom endpoint for proxies. Valid HTTP(S) URLs are accepted, including Claude-style path endpoints such as `https://api.kimi.com/coding/`.
73
-
74
- Example using an Anthropic or cc-switch custom proxy:
75
-
76
- ```bash
77
- export LLMWIKI_PROVIDER=anthropic
78
- export ANTHROPIC_API_KEY=sk-...
79
- export ANTHROPIC_BASE_URL=https://proxy.example.com
80
- ```
81
-
82
- If those values are not set in shell env or `.env`, llmwiki will try Anthropic-compatible values from `~/.claude/settings.json` (`env` block) for:
83
-
84
- - `ANTHROPIC_API_KEY`
85
- - `ANTHROPIC_AUTH_TOKEN`
86
- - `ANTHROPIC_BASE_URL`
87
- - `ANTHROPIC_MODEL`
88
-
89
- Example with zero exports (Claude Code already configured):
90
-
91
- ```bash
92
- llmwiki compile
93
- ```
94
-
95
- ### Claude Agent SDK (local Claude Code login)
96
-
97
- The `claude-agent` provider routes calls through the
98
- [Claude Agent SDK](https://github.com/anthropics/claude-agent-sdk-typescript)
99
- instead of the raw Messages API. It authenticates with your **local Claude Code
100
- login** (OAuth/subscription), so **no `ANTHROPIC_API_KEY` is required** — if you
101
- can run `claude` in your terminal, this provider works.
102
-
103
- > **Terms of use.** This provider drives your Claude Code / Agent SDK session
104
- > programmatically to compile wikis. That is not automatically appropriate for
105
- > every account type, plan, or environment. Before using it, review Anthropic's
106
- > current [Claude Code](https://www.anthropic.com/legal/consumer-terms) and
107
- > [Agent SDK](https://docs.anthropic.com/en/api/agent-sdk/overview) terms and
108
- > usage policies, and make sure your intended use complies with them.
109
-
110
- ```bash
111
- export LLMWIKI_PROVIDER=claude-agent
112
- export LLMWIKI_MODEL=claude-sonnet-4-6 # optional; this is the default
113
- llmwiki compile
114
- ```
115
-
116
- Notes:
117
-
118
- - Generation (`compile`) and structured extraction work off the local login with
119
- no extra credentials.
120
- - Semantic search (`llmwiki query`) still needs embeddings, which Claude does not
121
- provide. Set `VOYAGE_API_KEY` to enable them (same as the `anthropic`
122
- provider); otherwise `query` falls back to lexical ranking.
123
- - To see what the SDK is doing, set `LLMWIKI_DEBUG=1` for a concise one-line trace
124
- per SDK message (`[claude-agent] system:init`, `… assistant`, `… result:success`)
125
- plus any `claude` subprocess errors. Use `LLMWIKI_DEBUG=verbose` to additionally
126
- enable the SDK's full verbose logging.
127
-
128
- ```bash
129
- LLMWIKI_DEBUG=1 LLMWIKI_PROVIDER=claude-agent llmwiki compile
130
- ```
131
-
132
- ### OpenAI-Compatible Local Servers
133
-
134
- Use the OpenAI provider for local OpenAI-compatible servers such as
135
- `llama-server`. `OPENAI_BASE_URL` is used for chat/tool calls, and
136
- `OPENAI_EMBEDDINGS_BASE_URL` is optional. Set it only when embeddings are
137
- served from a different endpoint; when unset, embeddings use the same client
138
- and base URL as chat. Include `/v1` in custom URLs.
139
-
140
- Split endpoint example:
141
-
142
- ```bash
143
- export LLMWIKI_PROVIDER=openai
144
- export LLMWIKI_MODEL=qwen3.6-35b
145
- export LLMWIKI_EMBEDDING_MODEL=text-embedding-model
146
- export OPENAI_API_KEY=sk-local
147
- export OPENAI_BASE_URL=http://host_url:port/v1
148
- export OPENAI_EMBEDDINGS_BASE_URL=http://host_url:port/v1
149
- ```
150
-
151
- `OPENAI_API_KEY` is still required by the CLI and OpenAI SDK. For local
152
- servers that do not check authentication, any dummy value is sufficient.
153
-
154
- ### Ollama
100
+ ## Demo
155
101
 
156
- Ollama uses its OpenAI-compatible endpoint. Set `OLLAMA_HOST` for chat and
157
- optionally set `OLLAMA_EMBEDDINGS_HOST` only when embeddings are served from a
158
- different endpoint. When unset, embeddings use `OLLAMA_HOST`. Include `/v1` in
159
- custom URLs.
102
+ Try it on any article or document:
160
103
 
161
104
  ```bash
162
- export LLMWIKI_PROVIDER=ollama
163
- export LLMWIKI_MODEL=llama3.1
164
- export LLMWIKI_EMBEDDING_MODEL=nomic-embed-text
165
- export OLLAMA_HOST=http://ollama_host:11434/v1
166
- export OLLAMA_EMBEDDINGS_HOST=http://ollama_host:11435/v1
105
+ mkdir my-wiki && cd my-wiki
106
+ llmwiki quickstart https://en.wikipedia.org/wiki/Andrej_Karpathy
107
+ llmwiki query "What terms did Andrej coin?"
167
108
  ```
168
109
 
169
- ### GitHub Copilot
170
-
171
- Uses the GitHub Copilot API (`https://api.githubcopilot.com`), an
172
- OpenAI-compatible endpoint available to Copilot subscribers. Requires a GitHub
173
- OAuth token with the `copilot` scope — **classic PATs are not supported**.
110
+ The [`examples/basic/`](examples/basic/) directory includes a small pre-generated wiki you can inspect without an API key.
174
111
 
175
- First, ensure your `gh` CLI token has the required scope:
112
+ ## Core commands
176
113
 
177
- ```bash
178
- gh auth refresh --scopes copilot
179
- ```
180
-
181
- Then run:
114
+ | Command | What it does |
115
+ |---|---|
116
+ | `llmwiki ingest <url-or-file>` | Fetch a URL or copy a local file into `sources/`. |
117
+ | `llmwiki ingest-session <path>` | Import exported Claude, Codex, or Cursor sessions into `sources/`. |
118
+ | `llmwiki quickstart <source>` | Ingest, compile, and optionally open the viewer in one step. |
119
+ | `llmwiki compile` | Incrementally extract concepts and generate wiki pages. |
120
+ | `llmwiki refresh --stale [--dry-run]` | Recompile changed owners of stale pages and clean selected orphaned ownership. |
121
+ | `llmwiki review list/show/approve/reject` | Inspect and manage held candidates. |
122
+ | `llmwiki query "question" [--save]` | Ask questions against the compiled wiki, optionally saving the answer. |
123
+ | `llmwiki context "<prompt>" --json` | Build a citation-aware evidence pack for agents. |
124
+ | `llmwiki view [--open]` | Start the read-only local browser viewer. |
125
+ | `llmwiki lint` | Validate wiki structure, citations, links, metadata, and freshness. |
126
+ | `llmwiki eval [--suite fast\|full]` | Measure wiki quality and optional citation support. |
127
+ | `llmwiki export --target <format>` | Export the wiki to portable formats, including Open Knowledge Format (`okf`). |
128
+ | `llmwiki import --okf <dir> [--dry-run] [--trusted]` | Import an Open Knowledge Format bundle, staged for review by default. |
129
+ | `llmwiki serve --root <dir>` | Start the MCP server. |
130
+
131
+ Full command docs live in [`docs/cli/`](docs/cli/).
132
+
133
+ ## Open Knowledge Format
134
+
135
+ llmwiki is an Open Knowledge Format (OKF) producer and consumer. OKF is a Google Cloud initiative for sharing compiled knowledge as portable markdown files with structured frontmatter.
182
136
 
183
137
  ```bash
184
- export LLMWIKI_PROVIDER=copilot
185
- export GITHUB_TOKEN=$(gh auth token) # OAuth token required; PATs will not work
186
- export LLMWIKI_MODEL=gpt-4o # optional; gpt-4o is the default
187
- ```
188
-
189
- Available models (names use dots, not dashes): `gpt-4o`, `gpt-4o-mini`,
190
- `claude-sonnet-4.5`, `claude-sonnet-4.6`, `claude-opus-4.5`, `gemini-2.5-pro`,
191
- and others — availability depends on your Copilot plan.
192
-
193
- **Embeddings:** The GitHub Copilot API does not expose an embeddings endpoint.
194
- Semantic search (used by `llmwiki query` with chunked retrieval) will fall back
195
- to full-index selection without embeddings. For embedding-dependent workflows,
196
- switch to the `openai` provider and provide `OPENAI_API_KEY`.
197
-
198
- ### Request timeouts
199
-
200
- The OpenAI SDK defaults to a 10-minute per-request timeout, which can cut off long compile-time completions on slower local models. Override per provider:
201
-
202
- - `LLMWIKI_REQUEST_TIMEOUT_MS` — provider-agnostic timeout in milliseconds. Applies to both the `openai` and `ollama` backends.
203
- - `OLLAMA_TIMEOUT_MS` — Ollama-specific override. Wins over `LLMWIKI_REQUEST_TIMEOUT_MS` when both are set.
204
-
205
- Defaults: 10 minutes for `openai`, 30 minutes for `ollama` (local models commonly need more).
206
-
207
- ### Output language
208
-
209
- Generated wiki content defaults to whatever language the model produces from the source material — typically English. Override with either:
210
-
211
- - `LLMWIKI_OUTPUT_LANG` — e.g. `zh-CN`, `Chinese`, `ja`, `Japanese`. Applies to every prompt the compile and query pipelines make.
212
- - `--lang <code>` on `llmwiki compile` and `llmwiki query` — same effect, scoped to one invocation. Wins over the env var.
213
-
214
- Unset preserves prior behaviour byte-for-byte.
215
-
216
- ### Per-concept prompt budget
217
-
218
- When many sources contribute to the same compiled concept, `compile` enforces a per-concept character cap on the combined source content sent to the LLM so popular shared concepts don't blow past the model's context window. Each contributing source gets a fair share when truncation kicks in.
219
-
220
- - `LLMWIKI_PROMPT_BUDGET_CHARS` — character ceiling for the combined per-concept prompt. Defaults to `200000` (~50k tokens), which fits modern context windows with headroom. Raise it for larger-context models, lower it for local small-context models.
221
-
222
- A truncation warning prints to stderr when the cap fires so you know which concept hit the budget.
223
-
224
- </details>
225
-
226
-
227
- <br>
228
-
229
- ---
230
-
231
- <br>
232
-
233
-
234
- ## Why compile, not just retrieve?
235
-
236
- llmwiki uses embeddings — chunk-level, incremental, with BM25 reranking. But the embedding layer sits **below** the compiled wiki, not in front of it.
237
-
238
- **RAG retrieves chunks at query time.** Every question re-discovers the same relationships from scratch. The wiki structure, citation graph, and merged-concept disambiguation never accumulate; they get re-invented per query.
239
-
240
- **llmwiki compiles your sources into a wiki first.** Concepts get their own typed pages. Concepts shared across multiple sources are merged into one page instead of competing as duplicate chunks. Pages link to each other via `[[wikilinks]]`. When you ask a question with `--save`, the answer becomes a new page, and future queries use it as context.
241
-
242
- Then semantic retrieval, BM25 reranking, and graph expansion run over the compiled artifact — narrowing hundreds of pages to a tight, citation-traceable evidence pack.
243
-
244
- ```
245
- RAG: query → search chunks → answer → forget
246
- llmwiki: sources → compile → wiki → embed → query → save → richer wiki → better answers
138
+ llmwiki export --target okf --out ./dist/okf
139
+ llmwiki import --okf ./dist/okf --dry-run
140
+ llmwiki import --okf ./dist/okf
247
141
  ```
248
142
 
249
- llmwiki is complementary to traditional RAG: use RAG for ad-hoc retrieval over noisy or fast-changing corpora; use llmwiki when you want a persistent, structured, citation-traceable artifact that compounds.
143
+ OKF import is intentionally review-first: untrusted bundles become review candidates, not live wiki pages. The importer preserves foreign OKF metadata, stores llmwiki provenance under `x-llmwiki`, and re-exports imported pages honestly after local edits.
250
144
 
251
- ## How it works
145
+ See [`docs/guides/open-knowledge-format.mdx`](docs/guides/open-knowledge-format.mdx), [`docs/cli/export.mdx`](docs/cli/export.mdx), and [`docs/cli/import.mdx`](docs/cli/import.mdx).
252
146
 
253
- ```
254
- sources/ → hash check → LLM concept extraction → page generation → [[wikilink]] resolve
255
- │ ↓
256
- │ chunk embeddings ← wiki/ → index.md
257
- │ ↓
258
- │ semantic search + BM25 rerank + graph expansion
259
- │ ↓
260
- │ llmwiki query / context / MCP
261
-
262
- stale / orphaned pages → llmwiki refresh --stale → recompile changed owners, clean up orphans
263
- ```
147
+ ## What llmwiki creates
264
148
 
265
- **Two-phase compile.** Phase 1 extracts all concepts from every source; Phase 2 generates pages. Splitting the phases eliminates order-dependence, catches extraction failures before anything is written, merges concepts shared across multiple sources into a single page, and marks pages whose sources were all deleted as `orphaned` rather than silently dropping them.
149
+ A project has raw inputs in `sources/`, compiled markdown in `wiki/`, and compiler state under `.llmwiki/`:
266
150
 
267
- **Incremental everywhere.** Hash-based change detection on sources, content-hash-aware embedding updates, and cached citation judgements mean only changed work runs through the LLM. Recompiling after editing one source touches just the pages that source contributed to.
268
-
269
- **Source freshness and repair.** Every page records the sources — and their content hashes — that produced it. On any later command, llmwiki compares those recorded hashes against `sources/` on disk: a page whose sources changed since the last compile is `stale`, and a page whose sources were all deleted is `orphaned`. `llmwiki lint`, `llmwiki status`, the viewer, the JSON export, and the MCP tools surface this without recompiling anything. `llmwiki refresh --stale` then repairs it — recompiling only the changed sources that own stale pages and cleaning up orphaned ones, while deliberately leaving unrelated new sources for a full `llmwiki compile`. `--dry-run` previews the plan with no LLM calls.
270
-
271
- **Hybrid retrieval.** `.llmwiki/embeddings.json` v2 carries page- and chunk-level vectors. `llmwiki query` and `llmwiki context` narrow hundreds of pages down to a chunk-level top-K via cosine similarity, then rerank with BM25 and expand along the wikilink graph for the final evidence pack.
272
-
273
- **Citation-traceable.** Paragraphs carry `^[source.md]` markers; specific claims pin to `^[source.md:42-58]` line ranges. `llmwiki lint` validates that every citation resolves to a real file and line range; `llmwiki eval` measures citation precision and (optionally) LLM-judged claim support.
274
-
275
- **Compounding queries.** `llmwiki query --save` writes the answer as a wiki page and immediately rebuilds the index. Saved answers show up in future queries as context.
276
-
277
- ### What it produces
278
-
279
- A raw source like a Wikipedia article on knowledge compilation becomes a structured wiki page:
280
-
281
- ```yaml
282
- ---
283
- title: Knowledge Compilation
284
- summary: Techniques for converting knowledge representations into forms that support efficient reasoning.
285
- kind: concept
286
- sources:
287
- - knowledge-compilation.md
288
- createdAt: "2026-04-05T12:00:00Z"
289
- updatedAt: "2026-04-05T12:00:00Z"
290
- ---
291
- ```
292
-
293
- ```markdown
294
- Knowledge compilation refers to a family of techniques for pre-processing
295
- a knowledge base into a target language that supports efficient queries.
296
-
297
- Related concepts: [[Propositional Logic]], [[Model Counting]]
298
- ```
299
-
300
- Pages include source attribution in frontmatter. Paragraphs are annotated with `^[filename.md]` markers pointing back to the source file that contributed the content; specific claims can use line ranges like `^[filename.md:42-58]` or `^[filename.md#L42-L58]`.
301
-
302
-
303
- <br>
304
-
305
- ---
306
-
307
- <br>
308
-
309
-
310
- <details>
311
- <summary><span style="font-size: 1.4em;"><strong>CLI and wiki model — click to expand</strong></span></summary>
312
-
313
-
314
- ## Commands
315
-
316
- | Command | What it does |
317
- |---------|-------------|
318
- | `llmwiki ingest <url\|file>` | Fetch a URL or copy a local file into `sources/` |
319
- | `llmwiki ingest-session <path>` | Import a Claude/Codex/Cursor session export (single file or whole directory) into `sources/` |
320
- | `llmwiki quickstart <source>` | Ingest a source and compile a wiki in one step; supports `--review`, `--no-open`, `--provider`, `--lang`, and `--json` |
321
- | `llmwiki compile` | Incremental compile: extract concepts, generate wiki pages |
322
- | `llmwiki compile --review` | Write candidate pages to `.llmwiki/candidates/` instead of `wiki/` so you can review before they land |
323
- | `llmwiki compile --lang <code>` | Generate wiki content in the given language (e.g. `Chinese`, `ja`, `zh-CN`); also works on `query` |
324
- | `llmwiki refresh --stale [--dry-run]` | Repair stale/orphaned pages: recompile the sources that own stale pages and clean up deleted owners, skipping unrelated new sources. `--dry-run` previews the plan with no LLM calls or writes |
325
- | `llmwiki review list` | List pending candidate pages |
326
- | `llmwiki review show <id>` | Print a candidate's title, summary, and body |
327
- | `llmwiki review approve <id>` | Promote a candidate into `wiki/` and refresh index/MOC/embeddings |
328
- | `llmwiki review reject <id>` | Archive a candidate without touching `wiki/` |
329
- | `llmwiki rules extract` | Extract machine-actionable rule candidates from changed sources into `.llmwiki/rule-candidates/` |
330
- | `llmwiki rules list` | List pending rule candidates |
331
- | `llmwiki rules approve <id>` / `reject <id>` | Approve or reject a rule candidate |
332
- | `llmwiki rules export` | Emit approved rule candidates as a JSON array for a downstream rule importer |
333
- | `llmwiki schema init` | Write a starter `.llmwiki/schema.json` file |
334
- | `llmwiki schema show` | Print the resolved schema for the current project |
335
- | `llmwiki query "question"` | Ask questions against your compiled wiki |
336
- | `llmwiki query "question" --save` | Answer and save the result as a wiki page |
337
- | `llmwiki export [--target <name>] [--project-id <id>]` | Export the wiki to portable formats — `llms.txt`, `llms-full.txt`, JSON, JSON-LD, GraphML, Marp slides. `--project-id` pins a stable identifier inside the JSON envelope so downstream importers (e.g. [`@atomicmemory/llmwiki`](https://github.com/atomicstrata/atomicmemory/tree/main/packages/llmwiki)) can derive deterministic external IDs |
338
- | `llmwiki view [--open]` | Start a read-only local web viewer for browsing, searching, and inspecting the compiled wiki |
339
- | `llmwiki next [--json]` | Show the recommended next action for this project (read-only); `--json` emits a stable envelope for agents |
340
- | `llmwiki context "<prompt>" [--json]` | Build an agent-ready evidence pack (primary pages, citations, neighbors, suggested actions) — same v1 envelope as MCP `get_context_pack` |
341
- | `llmwiki lint` | Check wiki quality (broken links, orphans, empty pages, low confidence, contradictions, stale pages whose sources changed, etc.) |
342
- | `llmwiki eval [--suite fast\|full]` | Measure wiki quality: health score (0–100), citation coverage, corpus stats. `--suite full` adds LLM-as-judge citation support scoring |
343
- | `llmwiki eval cache show` | Print score distribution and top-cited pages from the citation judgement cache |
344
- | `llmwiki eval cache clear` | Remove the citation judgement cache |
345
- | `llmwiki eval report` | Print the most recent eval report |
346
- | `llmwiki eval history [--n N]` | Show a trend table of past eval runs from `history.jsonl` |
347
- | `llmwiki eval judgements [--score 0\|1\|2] [--page slug]` | Inspect individual citation judgements with optional score or page filters |
348
- | `llmwiki watch` | Auto-recompile when `sources/` changes |
349
- | `llmwiki serve [--root <dir>]` | Start an MCP server exposing wiki tools to AI agents |
350
-
351
- `llmwiki context --include-sources` and MCP `get_context_pack` with `includeSources: true` are opt-in because they can return raw snippets from files under `sources/`. Path confinement prevents reads outside `sources/`, but only enable source windows for agents you trust with the ingested source text.
352
-
353
- ## Output
354
-
355
- ```
356
- log.md append-only activity journal (ingests, compiles, queries)
151
+ ```text
152
+ sources/
153
+ raw source files
357
154
  wiki/
358
- concepts/ one .md file per concept, with YAML frontmatter
359
- queries/ saved query answers, included in index and retrieval
360
- index.md auto-generated table of contents
155
+ concepts/ compiled pages
156
+ queries/ saved answers
157
+ index.md generated TOC
361
158
  .llmwiki/
362
- schema.json optional page-kind and cross-link policy
363
- candidates/ pending review candidates from `compile --review`
364
- candidates/archive/ rejected candidates kept for audit
365
- ```
366
-
367
- Obsidian-compatible. `[[wikilinks]]` resolve to concept titles — or to any page that declares the term in its `aliases` frontmatter, so links survive renames and synonyms.
368
-
369
- `log.md` records what happened and when. Each entry is a heading with a fixed
370
- prefix — `## [YYYY-MM-DDThh:mm:ssZ] operation | description` (an ISO 8601 UTC
371
- timestamp) — followed by a short bullet body carrying page wikilinks and counts:
372
-
373
- ```markdown
374
- ## [2026-06-05T09:14:02Z] ingest | Attention Is All You Need
375
- - Source: https://arxiv.org/abs/1706.03762
376
- - Saved: sources/attention-is-all-you-need.md
377
- - Chars: 38,214
378
-
379
- ## [2026-06-05T09:15:30Z] compile | 1 source(s) → 6 page(s)
380
- - Sources: attention-is-all-you-need.md
381
- - Created: [[self-attention]], [[multi-head-attention]], [[transformer]]
382
- - Updated: [[positional-encoding]]
383
-
384
- ## [2026-06-05T09:16:11Z] query | What is multi-head attention?
385
- - Pages: [[multi-head-attention]], [[self-attention]]
159
+ config.json review policy
160
+ schema.json page-kind/cross-link policy
161
+ state.json source hashes and ownership
162
+ candidates/ held review candidates
163
+ eval/ quality history and thresholds
164
+ log.md activity journal
386
165
  ```
387
166
 
388
- Only headings start with `## [`, so the gist's recipe still works even with the
389
- bodies: `grep "^## \[" log.md | tail -5` shows the five most recent operations.
390
- Where `index.md` organizes content for discovery, `log.md` tracks temporal
391
- progression.
392
-
393
- ## Local web viewer
394
-
395
- Run `llmwiki view` from a project root to browse the compiled wiki in a local browser without Obsidian. The viewer is read-only: it renders `wiki/`, exposes sidebar navigation, search, page metadata, health counts, and provenance/citation chips, but does not mutate sources or generated pages.
167
+ Compiled pages are plain markdown with YAML frontmatter, plus enough metadata for agents to reason about citations, freshness, confidence, contradictions, and review state. See [`docs/concepts/wiki-model.mdx`](docs/concepts/wiki-model.mdx).
396
168
 
397
- ```bash
398
- llmwiki view # prints Viewer ready at http://127.0.0.1:<port>
399
- llmwiki view --open # also opens the URL in your default browser
400
- ```
169
+ ## Agent integration
401
170
 
402
- The server is private by default. It binds to `127.0.0.1` unless you explicitly provide both `--host <host>` and `--allow-lan`; wildcard hosts are rejected. Viewer responses use a strict local-asset CSP and path-confinement checks so the UI can safely render local markdown content.
171
+ ### MCP
403
172
 
404
- ## Review queue
405
-
406
- By default, `compile` writes pages directly to `wiki/`. Add `--review` to write candidate JSON records to `.llmwiki/candidates/` instead, so you can inspect each generated page before it lands.
173
+ Run:
407
174
 
408
175
  ```bash
409
- llmwiki compile --review # produces candidates, leaves wiki/ untouched
410
- llmwiki review list # see what's pending
411
- llmwiki review show <id> # inspect a single candidate
412
- llmwiki review approve <id> # write into wiki/ + refresh index/MOC/embeddings
413
- llmwiki review reject <id> # archive to .llmwiki/candidates/archive/
176
+ llmwiki serve --root /path/to/wiki-project
414
177
  ```
415
178
 
416
- A few things to know:
417
-
418
- - **Approve and reject acquire `.llmwiki/lock`** so they serialize cleanly against each other and against any concurrent `compile`.
419
- - **Source state is deferred per-source.** When one source produces multiple candidates, the source isn't marked compiled until the last candidate is approved — so unresolved siblings stay re-detectable on the next `compile --review`.
420
- - **Deletion bookkeeping is deferred.** `compile --review` does not orphan-mark deleted sources; the next non-review `compile` does that. The `--review` help text advertises this.
421
- - MCP `wiki_status` exposes `pendingCandidates` so agents can see the queue depth.
179
+ MCP clients can ingest sources, compile, query, search pages, read pages, lint, run eval, inspect status, and request context packs. Read-only tools work without provider credentials; LLM-backed tools validate provider credentials at call time.
422
180
 
423
- ## Page metadata
181
+ See [`docs/guides/mcp-agent-integration.mdx`](docs/guides/mcp-agent-integration.mdx).
424
182
 
425
- Compiled pages can carry epistemic metadata in frontmatter so consumers know how trustworthy each page is. All fields are optional and existing pages without them continue to work.
183
+ ### SDK
426
184
 
427
- ```yaml
428
- ---
429
- title: Knowledge Compilation
430
- summary: Techniques for converting knowledge representations...
431
- sources:
432
- - knowledge-compilation.md
433
- confidence: 0.82 # 0–1, LLM-reported confidence in the synthesized page
434
- provenanceState: merged # extracted | merged | inferred | ambiguous
435
- contradictedBy:
436
- - slug: probabilistic-reasoning
437
- ---
438
- ```
439
-
440
- When multiple sources merge into one slug, metadata is reconciled: `min` confidence, `provenanceState = 'merged'`, union of `contradictedBy` (deduped by slug).
441
-
442
- `llmwiki lint` adds three rules that surface this metadata:
443
-
444
- - `low-confidence` — flags pages with `confidence` below a threshold
445
- - `contradicted-page` — flags pages with non-empty `contradictedBy`
446
- - `excess-inferred-paragraphs` — flags pages whose body has too many uncited prose paragraphs (counted directly from the rendered text — the body is the single source of truth, no frontmatter field involved)
447
-
448
- ## Claim-level provenance
449
-
450
- Paragraph citations continue to use the original source-marker form:
185
+ ```ts
186
+ import { createWiki } from "llm-wiki-compiler";
451
187
 
452
- ```markdown
453
- This paragraph is grounded in the source. ^[source.md]
188
+ const wiki = createWiki({ root: "/path/to/wiki-project" });
189
+ await wiki.ingest({ source: "./notes.md" });
190
+ await wiki.compile();
191
+ const answer = await wiki.query({ question: "What changed?" });
454
192
  ```
455
193
 
456
- For claims that need tighter verification, pages can pin a statement to a line range in the ingested source:
457
-
458
- ```markdown
459
- The system uses a two-phase compile pipeline. ^[architecture-notes.md:42-58]
460
- The same range can also use GitHub-style anchors. ^[architecture-notes.md#L42-L58]
461
- ```
194
+ See [`docs/guides/sdk.mdx`](docs/guides/sdk.mdx).
462
195
 
463
- `llmwiki lint` validates both forms. It reports missing source files, malformed claim citations, impossible ranges like line `0` or `8-3`, and ranges that extend past the end of the source file.
196
+ ## Configuration
464
197
 
465
- ## Schema layer
198
+ Minimum requirement: Node.js 24 or newer.
466
199
 
467
- Projects can optionally define `.llmwiki/schema.json` to shape the wiki beyond flat concept pages. Existing projects do not need a schema file; missing or invalid `kind` values fall back to `concept`.
200
+ The default provider is Anthropic:
468
201
 
469
202
  ```bash
470
- llmwiki schema init
471
- llmwiki schema show
203
+ export ANTHROPIC_API_KEY=sk-...
472
204
  ```
473
205
 
474
- The schema supports four page kinds:
206
+ Provider selection is environment-driven:
475
207
 
476
- - `concept` standalone idea or pattern
477
- - `entity` — specific person, product, organization, or named artifact
478
- - `comparison` side-by-side analysis across concepts or entities
479
- - `overview` map page that connects several concepts in a domain
208
+ | Provider | Typical setup |
209
+ |---|---|
210
+ | Anthropic | `ANTHROPIC_API_KEY` or `ANTHROPIC_AUTH_TOKEN` |
211
+ | Claude Agent SDK | Local Claude Code login, `LLMWIKI_PROVIDER=claude-agent` |
212
+ | OpenAI-compatible | `LLMWIKI_PROVIDER=openai`, `OPENAI_API_KEY`, optional `OPENAI_BASE_URL` |
213
+ | Ollama | `LLMWIKI_PROVIDER=ollama`, `OLLAMA_HOST` |
214
+ | GitHub Copilot | `LLMWIKI_PROVIDER=copilot`, `GITHUB_TOKEN=$(gh auth token)` |
480
215
 
481
- Schema rules can set per-kind `minWikilinks` and optional `seedPages`. Compile can materialize seed pages such as overviews, lint enforces page-kind-specific cross-link minimums, and review candidates surface schema violations before approval.
216
+ See [`docs/configuration/providers.mdx`](docs/configuration/providers.mdx) and [`docs/configuration/environment-variables.mdx`](docs/configuration/environment-variables.mdx).
482
217
 
483
- ## Eval / quality measurement
484
-
485
- `llmwiki eval` gives the wiki a quantitative health score and tracks citation quality over time, making it possible to detect regressions after a recompile.
486
-
487
- ```bash
488
- llmwiki eval # fast suite: health score, citation coverage, corpus stats
489
- llmwiki eval --suite full # + LLM-as-judge citation support scoring (requires API)
490
- llmwiki eval report # re-print the most recent report
491
- llmwiki eval history # trend table across past runs
492
- llmwiki eval history --n 10 # limit to last 10 entries
493
- llmwiki eval judgements # all cached citation judgements
494
- llmwiki eval judgements --score 0 # only unsupported citations
495
- llmwiki eval judgements --page some-slug # filter to one page
496
- llmwiki eval cache show # score distribution + top-cited pages
497
- llmwiki eval cache clear # wipe the citation judgement cache
498
- ```
218
+ ## Quality and safety model
499
219
 
500
- **What it measures:**
501
-
502
- - **Health score (0–100)** aggregates all lint rules. Errors (broken citations, broken wikilinks, duplicate concepts) cost more than warnings.
503
- - **Citation coverage** — fraction of prose paragraphs that carry a `^[...]` marker, plus citation precision (fraction of citations pointing to existing source files).
504
- - **Citation support (full suite)** — samples up to N `(claim, source span)` pairs, asks a judge model to score each 0–2 (unsupported → fully supported), and caches results so subsequent runs only re-judge new pairs.
505
- - **Source utilization & citation depth** — the fraction of a page's valid sources that are actually cited (`source_utilization_rate`), and the share of citations pinned to specific line ranges rather than whole files (`claim_level_citation_rate`). Source warnings flag sources excluded from compilation (e.g. out-of-tree symlinks), gateable via `source_warnings_max`.
506
- - **Corpus stats** — page count, source count, total wiki characters, embedding counts, appended to `history.jsonl` for trend tracking.
507
- - **Regression deltas** — current report is diffed against the previous entry in history.
508
-
509
- **CI thresholds:** add `.llmwiki/eval/thresholds.yaml` to configure minimum acceptable scores:
510
-
511
- ```yaml
512
- health_score: 85
513
- citation_coverage_percent: 70
514
- citation_precision_percent: 90
515
- citation_support_mean: 1.4 # only checked when --suite full
516
- source_utilization_rate: 0.9 # min fraction of valid sources cited by a page
517
- source_warnings_max: 0 # max excluded sources (out-of-tree symlinks, etc.)
518
- claim_level_citation_rate: 0.5 # min fraction of citations with line ranges
519
- ```
220
+ llmwiki is designed for auditable generated knowledge:
520
221
 
521
- Threshold violations are listed in the report. Exit code is non-zero when any threshold is breached, suitable for CI gating.
222
+ - **Review before write.** Use `compile --review` or `.llmwiki/config.json` review policy to hold risky pages as candidates.
223
+ - **Fail-closed config.** Invalid review-policy config aborts compile instead of silently disabling review.
224
+ - **Source confinement.** Source snippets and import/export paths are confined to the project.
225
+ - **Freshness is explicit.** Pages can be fresh, stale, orphaned, or unverified; stale pages are flagged and repairable.
226
+ - **Imported compiled knowledge is staged by default.** External bundles go through the review queue unless explicitly trusted.
227
+ - **CI gates are supported.** `llmwiki lint` and `llmwiki eval` can enforce quality thresholds.
522
228
 
523
- **Artifacts** written under `.llmwiki/eval/`:
229
+ See [`docs/configuration/review-policy.mdx`](docs/configuration/review-policy.mdx), [`docs/troubleshooting/stale-pages.mdx`](docs/troubleshooting/stale-pages.mdx), and [`docs/guides/ci-quality-gates.mdx`](docs/guides/ci-quality-gates.mdx).
524
230
 
525
- ```
526
- .llmwiki/eval/
527
- history.jsonl one JSON line per eval run
528
- citation-cache.jsonl one JSON line per citation judgement
529
- thresholds.yaml optional CI threshold config
530
- ```
231
+ ## Scale and what works
531
232
 
532
- </details>
233
+ llmwiki is still early software, but it is no longer a toy pipeline for a handful of notes.
533
234
 
235
+ - **Incremental compilation** means unchanged sources do not flow back through the LLM.
236
+ - **Chunk-level embeddings** narrow large wikis before BM25 reranking and graph expansion.
237
+ - **Content-hash-aware embedding updates** avoid recomputing vectors for unchanged pages and chunks.
238
+ - **Cached citation judgements** make repeated `eval --suite full` runs cheaper.
239
+ - **Lexical fallback** keeps query/context workflows usable when the active provider has no embedding endpoint.
240
+ - **Prompt budgeting and ingest truncation metadata** make large sources explicit instead of silently pretending they fit.
534
241
 
535
- <br>
242
+ The current sweet spot is a durable project or domain wiki: research folders, codebase docs, team handbooks, standards, design notes, decision logs, or curated source packs. The less ideal fit is a high-churn firehose where raw search is enough and compiled structure would go stale faster than it can be reviewed.
536
243
 
537
- ---
244
+ ## Documentation
538
245
 
539
- <br>
246
+ The full docs site source is in [`docs/`](docs/):
540
247
 
248
+ - Start here: [`docs/introduction.mdx`](docs/introduction.mdx)
249
+ - Quickstart: [`docs/quickstart.mdx`](docs/quickstart.mdx)
250
+ - Installation: [`docs/installation.mdx`](docs/installation.mdx)
251
+ - Karpathy's LLM Wiki pattern: [`docs/concepts/karpathy-pattern.mdx`](docs/concepts/karpathy-pattern.mdx)
252
+ - How the compiler works: [`docs/concepts/how-it-works.mdx`](docs/concepts/how-it-works.mdx)
253
+ - Wiki model: [`docs/concepts/wiki-model.mdx`](docs/concepts/wiki-model.mdx)
254
+ - CLI reference: [`docs/cli/`](docs/cli/)
255
+ - Open Knowledge Format: [`docs/guides/open-knowledge-format.mdx`](docs/guides/open-knowledge-format.mdx)
256
+ - MCP integration: [`docs/guides/mcp-agent-integration.mdx`](docs/guides/mcp-agent-integration.mdx)
257
+ - SDK: [`docs/guides/sdk.mdx`](docs/guides/sdk.mdx)
258
+ - Atomic Memory bridge: [`docs/guides/atomic-memory-bridge.mdx`](docs/guides/atomic-memory-bridge.mdx)
541
259
 
542
- ## Demo
543
-
544
- Try it on any article or document:
260
+ Preview the docs locally with Node 24:
545
261
 
546
262
  ```bash
547
- mkdir my-wiki && cd my-wiki
548
- llmwiki quickstart https://en.wikipedia.org/wiki/Andrej_Karpathy
549
- llmwiki query "What terms did Andrej coin?"
263
+ cd docs
264
+ volta run --node 24 npx mint dev --port 3001
550
265
  ```
551
266
 
552
- See `examples/basic/` in the repo for pre-generated output you can browse without an API key.
553
-
554
-
555
- <br>
556
-
557
- ---
267
+ ## Current release
558
268
 
559
- <br>
269
+ Version `0.10.0` includes review policy, source-freshness repair, Open Knowledge Format import/export/re-export support, the Claude Agent SDK provider, and the Mintlify docs site. See [`CHANGELOG.md`](CHANGELOG.md) for release history.
560
270
 
271
+ ## Companion: Atomic Memory
561
272
 
562
- <details>
563
- <summary><span style="font-size: 1.4em;"><strong>MCP Server — click to expand</strong></span></summary>
564
-
273
+ llmwiki and [Atomic Memory](https://github.com/atomicstrata/atomicmemory) are complementary open context infrastructure:
565
274
 
566
- ## MCP Server
275
+ - **llmwiki** compiles source material into durable, inspectable knowledge.
276
+ - **Atomic Memory** gives agents runtime memory that is searchable, scoped, correctable, and inspectable.
567
277
 
568
- llmwiki ships an MCP (Model Context Protocol) server so AI agents (Claude Desktop, Cursor, Claude Code, etc.) can drive the full pipeline directly: ingest sources, compile, query, search, lint, and read pages — without scraping CLI output.
278
+ Use them independently or together. The [`@atomicmemory/llmwiki`](https://github.com/atomicstrata/atomicmemory/tree/main/packages/llmwiki) bridge imports `llmwiki export --target json --project-id <id>` as durable memory records.
569
279
 
570
- Where [llm-wiki-kit](https://github.com/iamsashank09/llm-wiki-kit) gives agents raw CRUD against wiki pages, llmwiki exposes the **automated pipelines**: agents get intelligent compilation, incremental change detection, and semantic query routing built in.
280
+ ## Contributing
571
281
 
572
- ### Setup
282
+ Good first contributions are usually docs, provider setup improvements, importer/exporter polish, eval fixtures, or focused CLI ergonomics. Larger feature work should start with an issue or design discussion.
573
283
 
574
- Start the server (stdio transport, no API key required at startup):
284
+ Before committing code changes, run:
575
285
 
576
286
  ```bash
577
- llmwiki serve --root /path/to/your/wiki-project
578
- ```
579
-
580
- ### Claude Desktop / Cursor configuration
581
-
582
- Add to your client's MCP config (e.g. `claude_desktop_config.json`):
583
-
584
- ```json
585
- {
586
- "mcpServers": {
587
- "llmwiki": {
588
- "command": "npx",
589
- "args": ["llm-wiki-compiler", "serve", "--root", "/path/to/wiki-project"],
590
- "env": {
591
- "ANTHROPIC_API_KEY": "sk-ant-..."
592
- }
593
- }
594
- }
595
- }
596
- ```
597
-
598
- Tools that need an LLM (`compile_wiki`, `query_wiki`, `search_pages`) check for a configured provider on each call. Read-only tools (`read_page`, `lint_wiki`, `wiki_status`) and `ingest_source` work without any credentials. `get_context_pack` is read-only and provider credentials are optional — when present, semantic retrieval is used; otherwise the tool falls back to lexical ranking and surfaces an `embedding-store-missing` or `query-embedding-unavailable` warning.
599
-
600
- ### Tools
601
-
602
- | Tool | What it does |
603
- |------|--------------|
604
- | `ingest_source` | Fetch a URL or local file into `sources/`. |
605
- | `compile_wiki` | Run the incremental compile pipeline; returns counts, slugs, errors. |
606
- | `query_wiki` | Two-step grounded answer with optional `--save`. |
607
- | `search_pages` | Return full content of pages relevant to a question. |
608
- | `read_page` | Read a single page by slug (concepts/ then queries/). |
609
- | `lint_wiki` | Run quality checks; returns structured diagnostics. |
610
- | `wiki_status` | Page/source counts, stale and orphaned pages, a `stateStatus` field, and pending changes (read-only). |
611
- | `get_context_pack` | Build an agent-ready evidence pack (primary pages, semantic chunks, graph neighbors, citations, per-page freshness, warnings, suggested actions) — same v1 JSON envelope as `llmwiki context --json`. `get_context_pack` **packages evidence**; `query_wiki` **generates answers**. |
612
- | `run_eval` | Score wiki quality (the fast suite needs no API key; the full suite LLM-judges a sample of citations); read-only. |
613
-
614
- ### Resources
615
-
616
- | URI | Returns |
617
- |-----|---------|
618
- | `llmwiki://index` | Full `wiki/index.md` content. |
619
- | `llmwiki://concept/{slug}` | A single concept page (frontmatter + body). |
620
- | `llmwiki://query/{slug}` | A single saved query page. |
621
- | `llmwiki://sources` | List of ingested source files with metadata. |
622
- | `llmwiki://state` | Compilation state (per-source hashes, last compile times). |
623
- | `llmwiki://eval/report` | The most recent eval report. |
624
- | `llmwiki://eval/history` | Trend of past eval runs. |
625
-
626
- </details>
627
-
628
-
629
- <br>
630
-
631
- ---
632
-
633
- <br>
634
-
635
-
636
- <details>
637
- <summary><span style="font-size: 1.4em;"><strong>SDK — programmatic API — click to expand</strong></span></summary>
638
-
639
-
640
- ## SDK — `createWiki()`
641
-
642
- Drive llmwiki in-process instead of shelling out to the CLI. `createWiki({ root })` returns a `Wiki` facade bound to a project directory. Every method runs silently (no console output) and is concurrency-safe — quiet mode is scoped per async call, not via a global flag. `createWiki` is exported from the package entry, so `import { createWiki } from "llm-wiki-compiler"` works for any installed version.
643
-
644
- ```ts
645
- import { createWiki } from "llm-wiki-compiler";
646
-
647
- const wiki = createWiki({ root: "./my-wiki" });
648
-
649
- await wiki.ingestText({ title: "Notes", text: "Raw text to compile…" });
650
- await wiki.compile(); // needs LLM credentials
651
- const { answer } = await wiki.query("What did I note about X?");
652
- const status = await wiki.status(); // no credentials needed
287
+ npx tsc --noEmit
288
+ npm run build
289
+ npm test
290
+ npm run fallow:ci
653
291
  ```
654
292
 
655
- ### Methods
656
-
657
- | Method | What it does | LLM creds |
658
- |--------|--------------|:---------:|
659
- | `ingest({ source })` | Fetch a URL or read a local file into `sources/`. **Trusted input only** — a server-side fetch + local-file-read primitive (SSRF risk); use `ingestText` for untrusted content. | No |
660
- | `ingestText({ title, text })` | Ingest raw text — the safe path for untrusted content (no fetch, no file read). | No |
661
- | `compile(options?)` | Compile pending sources into wiki pages. `options.review` queues candidates instead of writing. **Sends source content to the provider.** | Yes |
662
- | `search(question)` | Retrieve and hydrate the most relevant page records. | Yes |
663
- | `query(question, options?)` | Grounded answer; `options.save` persists it as a page, `options.debug` returns retrieval detail. | Yes |
664
- | `getPage(ref)` / `listPages(options?)` | Read one page / list pages with filters and cursor pagination. | No |
665
- | `listSources(options?)` / `getSource(id)` / `deleteSource(id)` | List, read, or delete ingested sources. `id` is the `IngestResult.filename` (e.g. `"note.md"`); `deleteSource` reconciles the compiled page on the next `compile()`. | No |
666
- | `status()` | Read-only status snapshot — counts, freshness, pending changes. | No |
667
- | `lint()` | Run all lint rules; severity-counted summary. | No |
668
- | `getContextPack({ prompt, budget?, depth?, topPages?, topChunks? })` | Build a v1 context pack — same envelope as MCP `get_context_pack`. Semantic retrieval when embeddings exist, lexical fallback otherwise. | No |
669
- | `exportJson(options?)` | Structured JSON export document (same shape as `llmwiki export --target json`). | No |
670
- | `runEval({ mode, record? })` | Eval harness. `mode: "fast"` is credential-free; `"full"` LLM-judges a sample of citations. | full only |
671
-
672
- **Notes.** Methods that need a provider throw `ProviderUnavailableError` when no credentials are configured; the rest run credential-free. Output is suppressed and there is no progress callback in v1 (`compile`/`runEval` on a large corpus can run for minutes with no feedback). `status()`, `lint()`, and `exportJson()` each hash the full source corpus per call (no cross-call cache) — avoid calling them in a hot loop. All result types (`CompileResult`, `QueryResult`, `WikiStatus`, `ContextPack`, …) are exported from the package for typed consumption.
673
-
674
- </details>
675
-
676
-
677
- <br>
678
-
679
- ---
680
-
681
- <br>
682
-
683
-
684
- ## Companion: Atomic Memory
685
-
686
- llmwiki and [Atomic Memory](https://github.com/atomicstrata/atomicmemory) are complementary layers of open context infrastructure, both maintained by [Atomic Strata](https://github.com/atomicstrata):
687
-
688
- - **llmwiki** gives you a persistent **knowledge base** — durable markdown compiled from your sources, inspectable on disk.
689
- - **Atomic Memory** gives your agents persistent **working memory** — runtime context that's searchable, correctable, scoped, and inspectable over time.
690
-
691
- Use them independently or together. Each remains valuable on its own — llmwiki as a notebook, RAG index, CI-checked knowledge base, or domain pack source; Atomic Memory as a runtime memory layer for any agent or app.
692
-
693
- The [`@atomicmemory/llmwiki`](https://github.com/atomicstrata/atomicmemory/tree/main/packages/llmwiki) bridge ingests `llmwiki export --target json --project-id <id>` envelopes as one verbatim Atomic Memory record per wiki page, preserving all advisory metadata (kind, citations, confidence, provenance state, contradictions, aliases, freshness) under `memory.metadata.llmwiki.*`. See the [bridge cookbook](https://github.com/atomicstrata/atomicmemory/blob/main/packages/llmwiki/docs/cookbook.md) for the full compile → export → import → package workflow.
694
-
695
- ## Scale and what works
696
-
697
- Still early software, but the scale story has matured well past the "few dozen sources" era.
698
-
699
- - **Semantic chunk retrieval** (`.llmwiki/embeddings.json` v2) narrows hundreds of pages down to a small top-K before LLM selection, with BM25 reranking and graph-neighborhood expansion layered on top.
700
- - **Incremental everything.** Hash-based source-change detection, content-hash-aware embedding updates, cached citation judgements. Re-running on an unchanged corpus is a few seconds.
701
- - **Lexical fallback.** Index-based routing kicks in automatically when no embedding store is present or the active provider has no embedding credentials, surfacing a stable warning code rather than hard-failing.
702
-
703
- **Honest about truncation.** Sources that exceed the character limit are truncated on ingest with `truncated: true` and the original character count recorded in frontmatter, so downstream consumers know they're working with partial content. A per-concept prompt budget prevents popular shared concepts from crashing compile.
704
-
705
- **Where it's still early.** No source-freshness watchdog yet (re-ingest detects content changes, but doesn't proactively re-check URLs). No team / multi-writer conflict resolution. The viewer is read-only by design — write operations go through the CLI or MCP.
706
-
707
- ## Karpathy's LLM Wiki pattern vs this compiler
708
-
709
- Karpathy described an abstract pattern for turning raw data into compiled knowledge. Here's how llmwiki maps to it today:
710
-
711
- | Karpathy's concept | llmwiki | Status |
712
- |---|---|---|
713
- | Data ingest | `llmwiki ingest`, `ingest-session` (Claude/Codex/Cursor) | Implemented |
714
- | Compile wiki | `llmwiki compile` (two-phase, incremental) | Implemented |
715
- | Q&A | `llmwiki query` (semantic + BM25 + graph expansion) | Implemented |
716
- | Output filing (save answers back) | `llmwiki query --save` | Implemented |
717
- | Auto-recompile | `llmwiki watch` | Implemented |
718
- | Linting / health-check pass | `llmwiki lint` + `llmwiki eval` (CI-gateable) | Implemented |
719
- | Agent integration | `llmwiki serve` MCP server with `get_context_pack` | Implemented |
720
- | Multimodal ingest | Images, PDFs, transcripts via `llmwiki ingest` | Implemented |
721
- | Marp slides | `llmwiki export --target marp` | Implemented |
722
- | Bridge to runtime memory | `llmwiki export --target json --project-id` → [`@atomicmemory/llmwiki`](https://github.com/atomicstrata/atomicmemory/tree/main/packages/llmwiki) | Implemented |
723
- | Fine-tuning | — | Not yet implemented |
724
-
725
- ## Roadmap
726
-
727
- Shipped in 0.9.0:
728
-
729
- - ✅ Source freshness — `llmwiki lint` flags pages whose sources changed (`stale`) or were all deleted (`orphaned`) since compile, surfaced across MCP (`wiki_status`, `get_context_pack`), context packs, the viewer (badges, a per-axis filter, health counts, a corrupt-state banner), the JSON export, and `llmwiki next`
730
- - ✅ `llmwiki refresh --stale` — repairs stale/orphaned pages with a targeted recompile of their changed owning sources (and deleted-owner cleanup), skipping unrelated new sources; `--dry-run` previews with no LLM calls or writes
731
- - ✅ JSON export bridge contract — `llmwiki export --target json --project-id <id>` adds per-page `path`, `kind`, advisory confidence/provenance, flattened citations, aliases, and freshness so downstream importers (e.g. [`@atomicmemory/llmwiki`](https://github.com/atomicstrata/atomicmemory/tree/main/packages/llmwiki)) can ingest pages as durable memory records
732
- - ✅ In-process SDK — `createWiki()` exposes the compiler in-process, with source-backed write APIs for programmatic callers
733
- - ✅ Eval over MCP + richer metrics — `run_eval` tool and read-only `llmwiki://eval/report`/`llmwiki://eval/history` resources, plus source-utilization and citation-depth metrics with a `source_warnings_max` CI gate
734
- - ✅ Rule-candidate extraction — extract reusable rule candidates from sources with review/approve and a JSON export pipeline
735
- - ✅ Claude Agent provider — authenticates through a local Claude Code login (bundled plan tokens, no separate API key)
736
- - ✅ Alias-aware wikilinks — the viewer resolves a `[[term]]` link to any page that declares `term` in its `aliases` frontmatter, not just an exact slug match
737
-
738
- Shipped in 0.8.0:
739
-
740
- - ✅ Guided project flow — `llmwiki next` recommends the next useful command, and `llmwiki quickstart <source>` ingests, compiles, and opens the viewer in one step
741
- - ✅ Graph/context layer — `llmwiki context` and MCP `get_context_pack` produce token-budgeted evidence packs with primary pages, graph neighbors, citations, optional source windows, warnings, and suggested actions
742
- - ✅ Viewer graph route — `llmwiki view` includes a force-directed `#/graph` route for exploring page relationships
743
- - ✅ Evaluation harness — `llmwiki eval` measures health score, citation coverage/precision, corpus stats, regression deltas, optional LLM-as-judge citation support, and CI thresholds
744
-
745
- Shipped in 0.7.0:
746
-
747
- - ✅ Read-only local web viewer — `llmwiki view` with sidebar navigation, markdown rendering, search, metadata, health counts, and provenance/citation chips
748
- - ✅ GitHub Copilot provider — `LLMWIKI_PROVIDER=copilot` with `GITHUB_TOKEN=$(gh auth token)` for Copilot chat/tool calls
749
- - ✅ Cached lint health summary — `llmwiki lint` writes `.llmwiki/last-lint.json` so viewer health can show the latest lint counts without re-running lint
750
-
751
- Shipped in 0.6.0:
752
-
753
- - ✅ Export bundle (`llms.txt`, JSON, JSON-LD, GraphML, Marp slides)
754
- - ✅ Session-history adapters — `llmwiki ingest-session` for Claude, Codex, and Cursor exports
755
- - ✅ Configurable output language — `--lang <code>` and `LLMWIKI_OUTPUT_LANG`
756
- - ✅ Defensive per-concept prompt budget so popular shared concepts don't crash compile
757
-
758
- Shipped in 0.5.0:
759
-
760
- - ✅ Multimodal ingest (images, PDFs, transcripts)
761
- - ✅ Chunked retrieval with reranking and `--debug` output
762
- - ⚠️ Minimum Node version raised to 24 (was 18)
763
-
764
- Shipped in 0.4.0:
765
-
766
- - ✅ Claim-level provenance with source ranges
767
- - ✅ First-class schema layer with typed page kinds (`concept`, `entity`, `comparison`, `overview`)
768
-
769
- Shipped in 0.3.0:
770
-
771
- - ✅ Candidate review queue (approve compile output before pages are written)
772
- - ✅ Confidence and contradiction metadata on compiled pages
773
-
774
- Shipped in 0.2.0:
775
-
776
- - ✅ Better provenance (paragraph-level source attribution)
777
- - ✅ Linting pass for wiki quality checks
778
- - ✅ Multi-provider support (OpenAI, Ollama, MiniMax)
779
- - ✅ Larger-corpus query strategy (semantic search, embeddings)
780
- - ✅ Deeper Obsidian integration (tags, aliases, Map of Content)
781
- - ✅ MCP server for agent integration
782
-
783
- Next up:
784
-
785
- - **Task and decision ledger** — turn session ingest into durable agent memory: goals, decisions, open questions, outcomes, and next-agent handoffs.
786
- - **Rollback, audit, and source lifecycle** — undo/reverse ingest, compile diff reports, stale-claim checks, freshness reports, and a durable operation log.
787
- - **Domain templates** — schema/prompt packs for research, codebase docs, team handbooks, decision logs, and standards/regulations.
788
- - **Eval extensions** — retrieval recall suites, update-drift benchmarks, and comparisons against serious retrieval baselines.
789
-
790
- Later / open to discussion:
791
-
792
- - Recurring source refresh jobs — re-ingest URLs on a schedule, diff against the prior snapshot, re-compile only what changed
793
- - MCP prompt resources — curated agent prompts such as "review the wiki", "propose new sources", and "draft a comparison page"
794
- - Codex OAuth provider — ChatGPT subscription auth as a dedicated provider, with clear token refresh and embedding-limit behavior
795
- - Team-chat connectors for Slack/Discord/Teams-style institutional memory
796
-
797
- If you like ambitious problems: **task/decision ledger**, **rollback/audit tooling**, and **eval extensions** are the meatiest next contributions. Open an issue to claim one or kick off a design discussion.
798
-
799
- Explicitly not planned (good ideas, just not for this repo): full static-site generator, desktop or mobile apps, fine-tuning, a formal ontology engine, heavy graph database infrastructure.
800
-
801
- ## Requirements
802
-
803
- Node.js >= 24, plus provider credentials (for Anthropic: `ANTHROPIC_API_KEY` or `ANTHROPIC_AUTH_TOKEN`).
804
-
805
- ## About
806
-
807
- llmwiki is maintained by [Atomic Strata](https://github.com/atomicstrata), the team behind [Atomic Memory](https://github.com/atomicstrata/atomicmemory). Atomic Strata builds open context infrastructure: durable compiled knowledge with llmwiki, runtime memory with Atomic Memory.
293
+ See [`CONTRIBUTING.md`](CONTRIBUTING.md).
808
294
 
809
295
  ## License
810
296
 
811
297
  MIT
812
-
813
-
814
- ## Disclaimer
815
-
816
- No LLMs were harmed in the making of this repo.