docshark 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md ADDED
@@ -0,0 +1,21 @@
1
+ # Changelog
2
+
3
+ ## 0.1.5 (2026-03-02)
4
+
5
+ **Full Changelog**: https://github.com/Michael-Obele/docshark/compare/v0.1.4...v0.1.5
6
+
7
+ ## 0.1.4 (2026-03-02)
8
+
9
+ **Full Changelog**: https://github.com/Michael-Obele/docshark/compare/v0.1.3...v0.1.4
10
+
11
+ ## 0.1.3 (2026-03-02)
12
+
13
+ **Full Changelog**: https://github.com/Michael-Obele/docshark/compare/v0.1.2...v0.1.3
14
+
15
+ ## 0.1.2 (2026-03-02)
16
+
17
+ **Full Changelog**: https://github.com/Michael-Obele/docshark/compare/v0.1.1...v0.1.2
18
+
19
+ ## 0.1.1 (2026-03-02)
20
+
21
+ **Full Changelog**: https://github.com/Michael-Obele/docshark/compare/v0.1.0...v0.1.1
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Michael-Obele
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,113 @@
1
+ # 🦈 DocShark
2
+
3
+ [![Built with Bun](https://img.shields.io/badge/Bun-%23000000.svg?style=flat&logo=bun&logoColor=white)](https://bun.sh/)
4
+ [![MCP Compatible](https://img.shields.io/badge/MCP-Ready-0D1117.svg?style=flat&logo=github&logoColor=white)](https://modelcontextprotocol.io/)
5
+ [![GitHub Release](https://img.shields.io/github/v/release/Michael-Obele/docshark?color=success)](https://github.com/Michael-Obele/docshark/releases)
6
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
7
+
8
+ **DocShark** is a powerful MCP (Model Context Protocol) server designed to scrape, index, and search any documentation website. It creates a local, highly-searchable knowledge base from public documentation pages using FTS5 (Full-Text Search) and BM25 ranking, allowing AI assistants to query the latest docs effortlessly.
9
+
10
+ ---
11
+
12
+ ## 🚀 Features
13
+
14
+ - **Automated Crawling**: Discovers pages via `sitemap.xml` with fallback to BFS link crawling.
15
+ - **Smart Extraction**: Uses Readability and Turndown to extract main content and convert it to clean Markdown, filtering out navbars and sidebars.
16
+ - **Semantic Chunking**: Splits content based on headings, preserving contextual headers for better AI understanding.
17
+ - **High-Performance Search**: Built-in SQLite + FTS5 indexing with BM25 ranking for accurate and lightning-fast search results.
18
+ - **JS-Rendered Site Support**: Tiered fetching strategy automatically detects React/Vue SPAs (empty shells) and upgrades to `puppeteer-core` if you have it installed (zero-config, auto-fallback).
19
+ - **Polite Crawling**: Respects `robots.txt` and implements rate limiting to prevent overloading documentation servers.
20
+ - **Standard MCP Tooling**: Connect perfectly with Desktop Claude, VS Code, Cursor, and any other MCP-compatible clients via standard `stdio` or `http`/`sse` transports.
21
+
22
+ ## 📦 What We Have Done (Phase 1)
23
+
24
+ **Phase 1: Core Engine** is fully implemented and tested.
25
+ - ✅ Custom SQLite Database with FTS5 virtual tables and auto-sync triggers.
26
+ - ✅ Web scraping engine supporting standard `fetch()` and `puppeteer-core`.
27
+ - ✅ Markdown processor utilizing Readability + Turndown.
28
+ - ✅ Heading-based semantic chunker (500-1200 tokens per chunk).
29
+ - ✅ Asynchronous job manager and queue system.
30
+ - ✅ Complete HTTP API (REST endpoints + SSE event streams).
31
+ - ✅ Seamless integration of 6 MCP tools: `add_library`, `search_docs`, `list_libraries`, `get_doc_page`, `refresh_library`, and `remove_library`.
32
+ - ✅ Robust CLI interface (`start`, `add`, `search`, `list`).
33
+
34
+ ## 🏗️ What We Are Doing
35
+
36
+ We are actively polishing the integration between the core engine and external MCP clients (like VS Code Agents and Claude Desktop).
37
+
38
+ ## 🔮 What We Plan To Do (Phase 2 & Beyond)
39
+
40
+ - **Web Dashboard**: An intuitive SvelteKit dashboard to manage your synced libraries, view crawl progress in real-time (via SSE), and test searches manually.
41
+ - **Incremental Crawling**: Smarter `refresh` jobs that compare `ETag` and `Last-Modified` headers to only re-scrape updated pages.
42
+ - **Vector Search (RAG)**: Integration of lightweight vector embeddings for semantic similarity search alongside the existing FTS5 keyword search.
43
+ - **Advanced Scraping Setup**: Support for custom CSS selectors to define exactly where content lives in non-standard documentation websites.
44
+
45
+ ---
46
+
47
+ ## 🛠️ Usage
48
+
49
+ ### Installing & Running Locally
50
+
51
+ Ensure you have [Bun](https://bun.sh/) installed.
52
+
53
+ ```bash
54
+ # Install dependencies
55
+ bun install
56
+
57
+ # (Optional) Enable auto-detection & scraping of Javascript React/Vue single-page apps
58
+ bun add puppeteer-core
59
+
60
+ # Start the DocShark MCP server in HTTP mode
61
+ bun run src/cli.ts start --port 6380
62
+ ```
63
+
64
+ ### Important CLI Commands
65
+
66
+ ```bash
67
+ # Add a documentation library to the index
68
+ bun run src/cli.ts add https://valibot.dev/guides/ --depth 2
69
+
70
+ # Search your indexed docs
71
+ bun run src/cli.ts search "schema validation"
72
+
73
+ # List all crawled libraries
74
+ bun run src/cli.ts list
75
+ ```
76
+
77
+ ### Using in VS Code (Copilot Agent Mode)
78
+
79
+ To use DocShark as an MCP server in VS Code:
80
+ 1. Enable MCP discovery in your VS Code settings.
81
+ 2. Create `.vscode/mcp.json` in your workspace:
82
+ ```json
83
+ {
84
+ "servers": {
85
+ "docshark": {
86
+ "type": "stdio",
87
+ "command": "bun",
88
+ "args": [
89
+ "run",
90
+ "/absolute/path/to/docshark/src/cli.ts",
91
+ "start",
92
+ "--stdio"
93
+ ]
94
+ }
95
+ }
96
+ }
97
+ ```
98
+ 3. Restart the server in VS Code properties, and your Copilot agent will now have access to the docshark tools.
99
+
100
+ ---
101
+
102
+ ## 🔄 Versioning & Changelog
103
+
104
+ This project uses [Google's Release Please](https://github.com/googleapis/release-please) to automate versioning and changelog generation.
105
+ - **Semantic Versioning**: Our versions automatically bump (e.g. `0.0.1` -> `0.0.2` or `0.1.0`) based on standard Conventional Commits (`feat:`, `fix:`, `chore:`, etc.).
106
+ - **Automated**: A PR is automatically created on `master` when standard commits are merged, generating a standard `CHANGELOG.md`.
107
+
108
+ ## 📜 License
109
+
110
+ This project is open-source and available under the [MIT License](LICENSE).
111
+
112
+ ---
113
+ *Built to empower AI agents with the latest knowledge.*
@@ -0,0 +1,16 @@
1
+ import type { Database } from '../storage/db.js';
2
+ import type { SearchEngine } from '../storage/search.js';
3
+ import type { JobManager } from '../jobs/manager.js';
4
+ import type { LibraryService } from '../services/library.js';
5
+ import type { EventBus } from '../jobs/events.js';
6
+ interface ApiDeps {
7
+ db: Database;
8
+ searchEngine: SearchEngine;
9
+ jobManager: JobManager;
10
+ libraryService: LibraryService;
11
+ eventBus: EventBus;
12
+ }
13
+ export declare function createApiRouter(deps: ApiDeps): {
14
+ handle(request: Request): Promise<Response>;
15
+ };
16
+ export {};
package/dist/cli.d.ts ADDED
@@ -0,0 +1,2 @@
1
+ #!/usr/bin/env node
2
+ export {};
package/dist/cli.js ADDED
@@ -0,0 +1,179 @@
1
+ #!/usr/bin/env node
2
+ import { createRequire } from "node:module";
3
+ var __create = Object.create;
4
+ var __getProtoOf = Object.getPrototypeOf;
5
+ var __defProp = Object.defineProperty;
6
+ var __getOwnPropNames = Object.getOwnPropertyNames;
7
+ var __hasOwnProp = Object.prototype.hasOwnProperty;
8
+ var __toESM = (mod, isNodeMode, target) => {
9
+ target = mod != null ? __create(__getProtoOf(mod)) : {};
10
+ const to = isNodeMode || !mod || !mod.__esModule ? __defProp(target, "default", { value: mod, enumerable: true }) : target;
11
+ for (let key of __getOwnPropNames(mod))
12
+ if (!__hasOwnProp.call(to, key))
13
+ __defProp(to, key, {
14
+ get: () => mod[key],
15
+ enumerable: true
16
+ });
17
+ return to;
18
+ };
19
+ var __require = /* @__PURE__ */ createRequire(import.meta.url);
20
+
21
+ // src/cli.ts
22
+ import { Command } from "commander";
23
+ import { startHttpServer } from "./http.js";
24
+ import { StdioTransport } from "@tmcp/transport-stdio";
25
+ import { server, db, searchEngine, libraryService } from "./server.js";
26
+ import { VERSION } from "./version.js";
27
+ var program = new Command().name("docshark").description("\uD83E\uDD88 Documentation MCP Server — scrape, index, and search any doc website").version(VERSION, "-v, --version", "output the current version");
28
+ program.command("start", { isDefault: true }).description("Start the MCP server").option("-p, --port <port>", "HTTP server port", "6380").option("--stdio", "Run in STDIO mode (for Claude Desktop, Cursor, etc.)").option("--data-dir <path>", "Data directory", "").action(async (opts) => {
29
+ if (opts.dataDir) {
30
+ process.env.DOCSHARK_DATA_DIR = opts.dataDir;
31
+ }
32
+ db.init();
33
+ if (opts.stdio) {
34
+ const stdio = new StdioTransport(server);
35
+ stdio.listen();
36
+ } else {
37
+ await startHttpServer(parseInt(opts.port));
38
+ }
39
+ });
40
+ program.command("add <url>").description("Add a documentation library and start crawling").option("-n, --name <name>", "Library name (auto-generated from URL if omitted)").option("-d, --depth <n>", "Max crawl depth", "3").option("--lib-version <version>", "Library version").action(async (url, opts) => {
41
+ db.init();
42
+ try {
43
+ const lib = await libraryService.add({
44
+ url,
45
+ name: opts.name,
46
+ version: opts.libVersion,
47
+ maxDepth: parseInt(opts.depth)
48
+ });
49
+ console.log(`
50
+ ✅ Added "${lib.display_name}" — crawling ${lib.url}...`);
51
+ console.log(` Job ID: ${lib.jobId}`);
52
+ console.log(` Use "docshark list" to check progress.
53
+ `);
54
+ await waitForCrawl(lib.jobId);
55
+ } catch (err) {
56
+ console.error(`
57
+ ❌ ${err.message}
58
+ `);
59
+ process.exit(1);
60
+ }
61
+ });
62
+ program.command("search <query>").description("Search indexed documentation").option("-l, --library <name>", "Filter by library").option("--limit <n>", "Max results", "5").action(async (query, opts) => {
63
+ db.init();
64
+ const results = searchEngine.search(query, {
65
+ library: opts.library,
66
+ limit: parseInt(opts.limit)
67
+ });
68
+ if (results.length === 0) {
69
+ console.log(`
70
+ No results found for "${query}".
71
+ `);
72
+ return;
73
+ }
74
+ for (const r of results) {
75
+ console.log(`
76
+ --- ${r.page_title} (${r.library_display_name}) ---`);
77
+ console.log(`Section: ${r.heading_context}`);
78
+ console.log(r.content.slice(0, 300));
79
+ console.log(`Source: ${r.page_url}
80
+ `);
81
+ }
82
+ });
83
+ program.command("list").description("List indexed libraries").action(() => {
84
+ db.init();
85
+ const libs = db.listLibraries();
86
+ if (libs.length === 0) {
87
+ console.log(`
88
+ No libraries indexed. Use "docshark add <url>" to add one.
89
+ `);
90
+ return;
91
+ }
92
+ console.table(libs.map((l) => ({
93
+ Name: l.name,
94
+ URL: l.url,
95
+ Pages: l.page_count,
96
+ Chunks: l.chunk_count,
97
+ Status: l.status,
98
+ "Last Crawled": l.last_crawled_at || "never"
99
+ })));
100
+ });
101
+ program.command("refresh <name>").description("Refresh an existing documentation library").action(async (name) => {
102
+ db.init();
103
+ try {
104
+ const lib = db.getLibraryByName(name);
105
+ if (!lib)
106
+ throw new Error(`Library "${name}" not found.`);
107
+ const { jobManager } = await import("./server.js");
108
+ const job = jobManager.startCrawl(lib.id, { incremental: true });
109
+ console.log(`
110
+ \uD83D\uDD04 Refreshing "${lib.display_name}" — crawling ${lib.url}...`);
111
+ console.log(` Job ID: ${job.id}`);
112
+ await waitForCrawl(job.id);
113
+ } catch (err) {
114
+ console.error(`
115
+ ❌ ${err.message}
116
+ `);
117
+ process.exit(1);
118
+ }
119
+ });
120
+ program.command("remove <name>").description("Remove a documentation library and its index").action((name) => {
121
+ db.init();
122
+ try {
123
+ const lib = db.getLibraryByName(name);
124
+ if (!lib)
125
+ throw new Error(`Library "${name}" not found.`);
126
+ db.removeLibrary(lib.id);
127
+ console.log(`
128
+ \uD83D\uDDD1️ Removed library "${lib.display_name}". Deleted ${lib.page_count} pages.
129
+ `);
130
+ } catch (err) {
131
+ console.error(`
132
+ ❌ ${err.message}
133
+ `);
134
+ process.exit(1);
135
+ }
136
+ });
137
+ program.command("get <url>").description("Get the full markdown content of a specific indexed page").action((url) => {
138
+ db.init();
139
+ const page = db.getPage({ url });
140
+ if (!page) {
141
+ console.error(`
142
+ ❌ Page not found in index: ${url}
143
+ `);
144
+ process.exit(1);
145
+ }
146
+ console.log(`
147
+ --- ${page.title} ---`);
148
+ console.log(`Source: ${page.url}
149
+
150
+ `);
151
+ console.log(page.content_markdown);
152
+ console.log(`
153
+ `);
154
+ });
155
+ program.parse();
156
+ async function waitForCrawl(jobId) {
157
+ const { jobManager } = await import("./server.js");
158
+ return new Promise((resolve) => {
159
+ const check = () => {
160
+ const job = jobManager.getJob(jobId);
161
+ if (!job || job.status === "completed" || job.status === "failed") {
162
+ if (job?.status === "completed") {
163
+ console.log(`
164
+ \uD83E\uDD88 Crawl complete: ${job.pages_crawled} pages, ${job.chunks_created} chunks indexed.`);
165
+ if (job.pages_failed > 0) {
166
+ console.log(` ⚠️ ${job.pages_failed} pages failed.`);
167
+ }
168
+ } else if (job?.status === "failed") {
169
+ console.error(`
170
+ ❌ Crawl failed: ${job.error_message}`);
171
+ }
172
+ resolve();
173
+ return;
174
+ }
175
+ setTimeout(check, 1000);
176
+ };
177
+ check();
178
+ });
179
+ }
package/dist/http.d.ts ADDED
@@ -0,0 +1 @@
1
+ export declare function startHttpServer(port: number): Promise<import("srvx").Server<import("srvx").ServerHandler>>;
@@ -0,0 +1,4 @@
1
+ export * from "./server.js";
2
+ export * from "./types.js";
3
+ export * from "./version.js";
4
+ export * from "./http.js";
package/dist/index.js ADDED
@@ -0,0 +1,5 @@
1
+ // src/index.ts
2
+ export * from "./server.js";
3
+ export * from "./types.js";
4
+ export * from "./version.js";
5
+ export * from "./http.js";
@@ -0,0 +1,8 @@
1
+ type Listener = (data: any) => void;
2
+ export declare class EventBus {
3
+ private listeners;
4
+ on(event: string, listener: Listener): void;
5
+ off(event: string, listener: Listener): void;
6
+ emit(event: string, data: any): void;
7
+ }
8
+ export {};
@@ -0,0 +1,19 @@
1
+ import type { Database } from '../storage/db.js';
2
+ import type { EventBus } from './events.js';
3
+ import type { CrawlJob } from '../types.js';
4
+ export declare class JobManager {
5
+ private db;
6
+ private eventBus;
7
+ private activeJobs;
8
+ constructor(db: Database, eventBus: EventBus);
9
+ /** Start a crawl job for a library */
10
+ startCrawl(libraryId: string, opts?: {
11
+ incremental?: boolean;
12
+ }): CrawlJob;
13
+ /** Get status of a specific job */
14
+ getJob(jobId: string): CrawlJob | undefined;
15
+ /** List all jobs, optionally filtered by library */
16
+ listJobs(libraryId?: string): CrawlJob[];
17
+ /** Check if a crawl is currently running for a library */
18
+ isRunning(libraryId: string): boolean;
19
+ }
@@ -0,0 +1,8 @@
1
+ import type { Database } from '../storage/db.js';
2
+ import type { EventBus } from './events.js';
3
+ export declare class CrawlWorker {
4
+ private db;
5
+ private eventBus;
6
+ constructor(db: Database, eventBus: EventBus);
7
+ crawl(libraryId: string, jobId: string): Promise<void>;
8
+ }
@@ -0,0 +1,10 @@
1
+ export interface Chunk {
2
+ content: string;
3
+ headingContext: string;
4
+ tokenCount: number;
5
+ hasCodeBlock: boolean;
6
+ }
7
+ export declare function chunkMarkdown(markdown: string, _headings: Array<{
8
+ level: number;
9
+ text: string;
10
+ }>): Chunk[];
@@ -0,0 +1,8 @@
1
+ export declare function extractAndConvert(html: string, url: string): {
2
+ markdown: string;
3
+ title: string;
4
+ headings: Array<{
5
+ level: number;
6
+ text: string;
7
+ }>;
8
+ };
@@ -0,0 +1,6 @@
1
+ import type { CrawlConfig } from '../types.js';
2
+ /**
3
+ * Discover all documentation page URLs from a base URL.
4
+ * Strategy: sitemap.xml → link crawl fallback
5
+ */
6
+ export declare function discoverPages(baseUrl: string, config?: CrawlConfig): Promise<string[]>;
@@ -0,0 +1,6 @@
1
+ import type { FetchResult } from '../types.js';
2
+ /**
3
+ * Fetch a page and return its HTML.
4
+ * Supports auto-detection of JS-rendered sites (falls back to puppeteer-core if installed).
5
+ */
6
+ export declare function fetchPage(url: string, renderer?: 'auto' | 'fetch' | 'puppeteer'): Promise<FetchResult>;
@@ -0,0 +1,7 @@
1
+ export declare class RateLimiter {
2
+ private delayMs;
3
+ private lastRequest;
4
+ constructor(delayMs?: number);
5
+ wait(): Promise<void>;
6
+ setDelay(ms: number): void;
7
+ }
@@ -0,0 +1,5 @@
1
+ import robotsParser from 'robots-parser';
2
+ /** Fetch and parse robots.txt for a given base URL */
3
+ export declare function getRobotsParser(baseUrl: string): Promise<import("robots-parser").Robot | null>;
4
+ /** Check if a URL is allowed by robots.txt */
5
+ export declare function isAllowed(robots: ReturnType<typeof robotsParser> | null, url: string): boolean;
@@ -0,0 +1,13 @@
1
+ import { McpServer } from 'tmcp';
2
+ import * as v from 'valibot';
3
+ import { Database } from './storage/db.js';
4
+ import { SearchEngine } from './storage/search.js';
5
+ import { LibraryService } from './services/library.js';
6
+ import { JobManager } from './jobs/manager.js';
7
+ import { EventBus } from './jobs/events.js';
8
+ export declare const db: Database;
9
+ export declare const eventBus: EventBus;
10
+ export declare const searchEngine: SearchEngine;
11
+ export declare const jobManager: JobManager;
12
+ export declare const libraryService: LibraryService;
13
+ export declare const server: McpServer<v.GenericSchema, undefined>;
@@ -0,0 +1,17 @@
1
+ import type { Database } from '../storage/db.js';
2
+ import type { JobManager } from '../jobs/manager.js';
3
+ import type { Library } from '../types.js';
4
+ export declare class LibraryService {
5
+ private db;
6
+ private jobManager;
7
+ constructor(db: Database, jobManager: JobManager);
8
+ /** Add a new documentation library and start crawling */
9
+ add(opts: {
10
+ url: string;
11
+ name?: string;
12
+ version?: string;
13
+ maxDepth?: number;
14
+ }): Promise<Library & {
15
+ jobId: string;
16
+ }>;
17
+ }
@@ -0,0 +1,57 @@
1
+ import { Database as BunDatabase } from 'bun:sqlite';
2
+ import type { Library, Page, CrawlJob } from '../types.js';
3
+ export declare class Database {
4
+ private db;
5
+ init(): void;
6
+ /** Expose raw DB for search engine direct queries */
7
+ raw(): BunDatabase;
8
+ private migrate;
9
+ addLibrary(lib: {
10
+ id: string;
11
+ name: string;
12
+ displayName: string;
13
+ url: string;
14
+ version?: string;
15
+ crawlConfig?: object;
16
+ }): import("bun:sqlite").Changes;
17
+ listLibraries(status?: string): Library[];
18
+ getLibraryByName(name: string): Library | undefined;
19
+ getLibraryById(id: string): Library | undefined;
20
+ removeLibrary(id: string): import("bun:sqlite").Changes;
21
+ updateLibraryStatus(id: string, status: string): import("bun:sqlite").Changes;
22
+ updateLibraryStats(id: string, pageCount: number, chunkCount: number): import("bun:sqlite").Changes;
23
+ upsertPage(page: {
24
+ id: string;
25
+ libraryId: string;
26
+ url: string;
27
+ path: string;
28
+ title: string;
29
+ contentMarkdown: string;
30
+ contentHash: string;
31
+ headings: object[];
32
+ }): string;
33
+ getPage(opts: {
34
+ url?: string;
35
+ library?: string;
36
+ path?: string;
37
+ }): Page | undefined;
38
+ getPagesByLibrary(libraryId: string): Page[];
39
+ insertChunks(chunks: Array<{
40
+ id: string;
41
+ pageId: string;
42
+ libraryId: string;
43
+ content: string;
44
+ headingContext: string;
45
+ chunkIndex: number;
46
+ tokenCount: number;
47
+ hasCodeBlock: boolean;
48
+ }>): void;
49
+ deleteChunksByPage(pageId: string): void;
50
+ createJob(job: {
51
+ id: string;
52
+ libraryId: string;
53
+ }): CrawlJob;
54
+ getJob(id: string): CrawlJob | undefined;
55
+ updateJob(id: string, updates: Partial<Pick<CrawlJob, 'status' | 'pages_discovered' | 'pages_crawled' | 'pages_failed' | 'chunks_created' | 'error_message' | 'started_at' | 'completed_at'>>): void;
56
+ listJobs(libraryId?: string): CrawlJob[];
57
+ }
@@ -0,0 +1,21 @@
1
+ import type { Database } from './db.js';
2
+ export interface SearchResult {
3
+ content: string;
4
+ heading_context: string;
5
+ page_url: string;
6
+ page_title: string;
7
+ library_name: string;
8
+ library_display_name: string;
9
+ relevance_score: number;
10
+ has_code_block: boolean;
11
+ token_count: number;
12
+ }
13
+ export declare class SearchEngine {
14
+ private db;
15
+ constructor(db: Database);
16
+ search(query: string, opts?: {
17
+ library?: string;
18
+ limit?: number;
19
+ }): SearchResult[];
20
+ private sanitizeQuery;
21
+ }
@@ -0,0 +1,25 @@
1
+ import * as v from 'valibot';
2
+ import type { LibraryService } from '../services/library.js';
3
+ export declare function createAddLibraryTool(libraryService: LibraryService): {
4
+ definition: {
5
+ name: "add_library";
6
+ description: string;
7
+ schema: v.ObjectSchema<{
8
+ readonly url: v.SchemaWithPipe<readonly [v.StringSchema<undefined>, v.UrlAction<string, undefined>, v.DescriptionAction<string, "The base URL of the documentation website to crawl.">]>;
9
+ readonly name: v.OptionalSchema<v.SchemaWithPipe<readonly [v.StringSchema<undefined>, v.DescriptionAction<string, "A short identifier for the library (e.g., \"svelte-5\"). Auto-generated from URL if omitted.">]>, undefined>;
10
+ readonly version: v.OptionalSchema<v.SchemaWithPipe<readonly [v.StringSchema<undefined>, v.DescriptionAction<string, "Version string (e.g., \"5.0.0\", \"v4\").">]>, undefined>;
11
+ readonly max_depth: v.OptionalSchema<v.SchemaWithPipe<readonly [v.NumberSchema<undefined>, v.IntegerAction<number, undefined>, v.MinValueAction<number, 1, undefined>, v.MaxValueAction<number, 10, undefined>, v.DescriptionAction<number, "Maximum link depth to crawl. Default: 3.">]>, 3>;
12
+ }, undefined>;
13
+ };
14
+ handler: ({ url, name, version, max_depth, }: {
15
+ url: string;
16
+ name?: string;
17
+ version?: string;
18
+ max_depth?: number;
19
+ }) => Promise<{
20
+ content: {
21
+ type: "text";
22
+ text: string;
23
+ }[];
24
+ }>;
25
+ };
@@ -0,0 +1,23 @@
1
+ import * as v from 'valibot';
2
+ import type { Database } from '../storage/db.js';
3
+ export declare function createGetDocPageTool(db: Database): {
4
+ definition: {
5
+ name: "get_doc_page";
6
+ description: string;
7
+ schema: v.ObjectSchema<{
8
+ readonly url: v.OptionalSchema<v.SchemaWithPipe<readonly [v.StringSchema<undefined>, v.DescriptionAction<string, "The full URL of the documentation page.">]>, undefined>;
9
+ readonly library: v.OptionalSchema<v.SchemaWithPipe<readonly [v.StringSchema<undefined>, v.DescriptionAction<string, "Library name to search within.">]>, undefined>;
10
+ readonly path: v.OptionalSchema<v.SchemaWithPipe<readonly [v.StringSchema<undefined>, v.DescriptionAction<string, "Relative path within the library (e.g., \"/getting-started\").">]>, undefined>;
11
+ }, undefined>;
12
+ };
13
+ handler: ({ url, library, path }: {
14
+ url?: string;
15
+ library?: string;
16
+ path?: string;
17
+ }) => Promise<{
18
+ content: {
19
+ type: "text";
20
+ text: string;
21
+ }[];
22
+ }>;
23
+ };
@@ -0,0 +1,19 @@
1
+ import * as v from 'valibot';
2
+ import type { Database } from '../storage/db.js';
3
+ export declare function createListLibrariesTool(db: Database): {
4
+ definition: {
5
+ name: "list_libraries";
6
+ description: string;
7
+ schema: v.ObjectSchema<{
8
+ readonly status: v.OptionalSchema<v.SchemaWithPipe<readonly [v.PicklistSchema<["indexed", "crawling", "error", "all"], undefined>, v.DescriptionAction<"crawling" | "indexed" | "error" | "all", "Filter by indexing status. Default: \"all\".">]>, "all">;
9
+ }, undefined>;
10
+ };
11
+ handler: ({ status }: {
12
+ status?: string;
13
+ }) => Promise<{
14
+ content: {
15
+ type: "text";
16
+ text: string;
17
+ }[];
18
+ }>;
19
+ };
@@ -0,0 +1,20 @@
1
+ import * as v from 'valibot';
2
+ import type { JobManager } from '../jobs/manager.js';
3
+ import type { Database } from '../storage/db.js';
4
+ export declare function createRefreshLibraryTool(jobManager: JobManager, db: Database): {
5
+ definition: {
6
+ name: "refresh_library";
7
+ description: string;
8
+ schema: v.ObjectSchema<{
9
+ readonly library: v.SchemaWithPipe<readonly [v.StringSchema<undefined>, v.DescriptionAction<string, "The library name to refresh (e.g., \"svelte-5\").">]>;
10
+ }, undefined>;
11
+ };
12
+ handler: ({ library }: {
13
+ library: string;
14
+ }) => Promise<{
15
+ content: {
16
+ type: "text";
17
+ text: string;
18
+ }[];
19
+ }>;
20
+ };
@@ -0,0 +1,19 @@
1
+ import * as v from 'valibot';
2
+ import type { Database } from '../storage/db.js';
3
+ export declare function createRemoveLibraryTool(db: Database): {
4
+ definition: {
5
+ name: "remove_library";
6
+ description: string;
7
+ schema: v.ObjectSchema<{
8
+ readonly library: v.SchemaWithPipe<readonly [v.StringSchema<undefined>, v.DescriptionAction<string, "The library name to remove (e.g., \"svelte-5\").">]>;
9
+ }, undefined>;
10
+ };
11
+ handler: ({ library }: {
12
+ library: string;
13
+ }) => Promise<{
14
+ content: {
15
+ type: "text";
16
+ text: string;
17
+ }[];
18
+ }>;
19
+ };
@@ -0,0 +1,23 @@
1
+ import * as v from 'valibot';
2
+ import type { SearchEngine } from '../storage/search.js';
3
+ export declare function createSearchDocsTool(searchEngine: SearchEngine): {
4
+ definition: {
5
+ name: "search_docs";
6
+ description: string;
7
+ schema: v.ObjectSchema<{
8
+ readonly query: v.SchemaWithPipe<readonly [v.StringSchema<undefined>, v.DescriptionAction<string, "The search query. Use natural language or specific terms.">]>;
9
+ readonly library: v.OptionalSchema<v.SchemaWithPipe<readonly [v.StringSchema<undefined>, v.DescriptionAction<string, "Filter results to a specific library name.">]>, undefined>;
10
+ readonly limit: v.OptionalSchema<v.SchemaWithPipe<readonly [v.NumberSchema<undefined>, v.IntegerAction<number, undefined>, v.MinValueAction<number, 1, undefined>, v.MaxValueAction<number, 20, undefined>, v.DescriptionAction<number, "Max results to return. Default: 5.">]>, 5>;
11
+ }, undefined>;
12
+ };
13
+ handler: ({ query, library, limit }: {
14
+ query: string;
15
+ library?: string;
16
+ limit?: number;
17
+ }) => Promise<{
18
+ content: {
19
+ type: "text";
20
+ text: string;
21
+ }[];
22
+ }>;
23
+ };
@@ -0,0 +1,71 @@
1
+ export interface Library {
2
+ id: string;
3
+ name: string;
4
+ display_name: string;
5
+ url: string;
6
+ version: string | null;
7
+ description: string | null;
8
+ status: 'pending' | 'crawling' | 'indexed' | 'error';
9
+ page_count: number;
10
+ chunk_count: number;
11
+ crawl_config: string | null;
12
+ last_crawled_at: string | null;
13
+ created_at: string;
14
+ updated_at: string;
15
+ }
16
+ export interface Page {
17
+ id: string;
18
+ library_id: string;
19
+ url: string;
20
+ path: string;
21
+ title: string | null;
22
+ content_markdown: string | null;
23
+ content_hash: string | null;
24
+ headings: string | null;
25
+ http_status: number | null;
26
+ last_modified: string | null;
27
+ etag: string | null;
28
+ created_at: string;
29
+ updated_at: string;
30
+ }
31
+ export interface ChunkRecord {
32
+ id: string;
33
+ page_id: string;
34
+ library_id: string;
35
+ content: string;
36
+ heading_context: string;
37
+ chunk_index: number;
38
+ token_count: number;
39
+ has_code_block: number;
40
+ created_at: string;
41
+ }
42
+ export interface CrawlJob {
43
+ id: string;
44
+ library_id: string;
45
+ status: 'queued' | 'running' | 'completed' | 'failed' | 'cancelled';
46
+ pages_discovered: number;
47
+ pages_crawled: number;
48
+ pages_failed: number;
49
+ chunks_created: number;
50
+ error_message: string | null;
51
+ started_at: string | null;
52
+ completed_at: string | null;
53
+ created_at: string;
54
+ }
55
+ export interface FetchResult {
56
+ html: string;
57
+ renderer: 'fetch' | 'puppeteer';
58
+ status: number;
59
+ etag?: string | null;
60
+ lastModified?: string | null;
61
+ unchanged?: boolean;
62
+ }
63
+ export interface CrawlConfig {
64
+ renderer?: 'auto' | 'fetch' | 'puppeteer';
65
+ maxDepth?: number;
66
+ includePatterns?: string[];
67
+ excludePatterns?: string[];
68
+ rateLimit?: number;
69
+ waitForSelector?: string;
70
+ waitTimeout?: number;
71
+ }
@@ -0,0 +1 @@
1
+ export declare const VERSION = "0.1.5";
package/package.json ADDED
@@ -0,0 +1,65 @@
1
+ {
2
+ "name": "docshark",
3
+ "version": "0.1.5",
4
+ "description": "🦈 Documentation MCP Server — scrape, index, and search any doc website",
5
+ "type": "module",
6
+ "main": "./dist/index.js",
7
+ "module": "./dist/index.js",
8
+ "types": "./dist/index.d.ts",
9
+ "exports": {
10
+ ".": {
11
+ "import": "./dist/index.js",
12
+ "types": "./dist/index.d.ts"
13
+ }
14
+ },
15
+ "bin": {
16
+ "docshark": "./dist/cli.js"
17
+ },
18
+ "files": [
19
+ "dist",
20
+ "README.md",
21
+ "LICENSE",
22
+ "CHANGELOG.md"
23
+ ],
24
+ "scripts": {
25
+ "start": "bun run src/cli.ts start",
26
+ "dev": "bun run --watch src/cli.ts start",
27
+ "cli": "bun run src/cli.ts",
28
+ "check": "tsc --noEmit",
29
+ "build": "rm -rf dist && bun build ./src/cli.ts ./src/index.ts --outdir ./dist --target node --external '*' && tsc --emitDeclarationOnly",
30
+ "prepublishOnly": "bun run build",
31
+ "test:crawl": "bun run src/cli.ts add https://svelte.dev/docs/svelte/overview"
32
+ },
33
+ "keywords": [
34
+ "tmcp",
35
+ "mcp",
36
+ "documentation",
37
+ "search",
38
+ "ai",
39
+ "scraper"
40
+ ],
41
+ "dependencies": {
42
+ "@mozilla/readability": "^0.6.0",
43
+ "@tmcp/adapter-valibot": "^0.1.5",
44
+ "@tmcp/transport-http": "^0.8.4",
45
+ "@tmcp/transport-sse": "^0.5.3",
46
+ "@tmcp/transport-stdio": "^0.4.1",
47
+ "cheerio": "^1.2.0",
48
+ "commander": "^14.0.3",
49
+ "linkedom": "^0.18.12",
50
+ "nanoid": "^5.1.6",
51
+ "puppeteer-core": "^24.37.5",
52
+ "robots-parser": "^3.0.1",
53
+ "srvx": "^0.11.8",
54
+ "tmcp": "^1.19.2",
55
+ "turndown": "^7.2.2",
56
+ "turndown-plugin-gfm": "^1.0.2",
57
+ "valibot": "^1.2.0"
58
+ },
59
+ "devDependencies": {
60
+ "@types/bun": "^1.3.9",
61
+ "@types/node": "^25.3.3",
62
+ "@types/turndown": "^5.0.6",
63
+ "typescript": "^5.9.3"
64
+ }
65
+ }