docshark 0.1.12 → 0.1.16
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +9 -0
- package/dist/cli-update.d.ts +10 -0
- package/dist/cli-update.js +186 -0
- package/dist/cli.js +254 -42
- package/dist/server.d.ts +7 -7
- package/dist/server.js +123 -113
- package/dist/services/library.d.ts +8 -3
- package/dist/services/library.js +42 -12
- package/dist/storage/db.d.ts +4 -3
- package/dist/storage/db.js +45 -24
- package/dist/tools/list-libraries.d.ts +1 -1
- package/dist/version.d.ts +1 -1
- package/dist/version.js +6 -2
- package/package.json +2 -2
- package/LICENSE +0 -21
- package/README.md +0 -167
package/LICENSE
DELETED
|
@@ -1,21 +0,0 @@
|
|
|
1
|
-
MIT License
|
|
2
|
-
|
|
3
|
-
Copyright (c) 2026 Michael-Obele
|
|
4
|
-
|
|
5
|
-
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
-
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
-
in the Software without restriction, including without limitation the rights
|
|
8
|
-
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
-
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
-
furnished to do so, subject to the following conditions:
|
|
11
|
-
|
|
12
|
-
The above copyright notice and this permission notice shall be included in all
|
|
13
|
-
copies or substantial portions of the Software.
|
|
14
|
-
|
|
15
|
-
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
-
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
-
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
-
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
-
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
-
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
-
SOFTWARE.
|
package/README.md
DELETED
|
@@ -1,167 +0,0 @@
|
|
|
1
|
-
# 🦈 DocShark
|
|
2
|
-
|
|
3
|
-
[](https://bun.sh/)
|
|
4
|
-
[](https://www.npmjs.com/package/docshark)
|
|
5
|
-
[](https://modelcontextprotocol.io/)
|
|
6
|
-
[](https://github.com/Michael-Obele/docshark/releases)
|
|
7
|
-
[](https://opensource.org/licenses/MIT)
|
|
8
|
-
|
|
9
|
-
**DocShark** is a powerful MCP (Model Context Protocol) server designed to scrape, index, and search any documentation website. It creates a local, highly-searchable knowledge base from public documentation pages using FTS5 (Full-Text Search) and BM25 ranking, allowing AI assistants to query the latest docs effortlessly.
|
|
10
|
-
|
|
11
|
-
---
|
|
12
|
-
|
|
13
|
-
## 🚀 Features
|
|
14
|
-
|
|
15
|
-
- **Automated Crawling**: Discovers pages via `sitemap.xml` with fallback to BFS link crawling.
|
|
16
|
-
- **Smart Extraction**: Uses Readability and Turndown to extract main content and convert it to clean Markdown, filtering out navbars and sidebars.
|
|
17
|
-
- **Semantic Chunking**: Splits content based on headings, preserving contextual headers for better AI understanding.
|
|
18
|
-
- **High-Performance Search**: Built-in SQLite + FTS5 indexing with BM25 ranking for accurate and lightning-fast search results.
|
|
19
|
-
- **JS-Rendered Site Support**: Tiered fetching strategy automatically detects React/Vue SPAs (empty shells) and upgrades to `puppeteer-core` if you have it installed (zero-config, auto-fallback).
|
|
20
|
-
- **Polite Crawling**: Respects `robots.txt` and implements rate limiting to prevent overloading documentation servers.
|
|
21
|
-
- **Standard MCP Tooling**: Connect perfectly with Desktop Claude, VS Code, Cursor, and any other MCP-compatible clients via standard `stdio` or `http`/`sse` transports.
|
|
22
|
-
|
|
23
|
-
## 📦 What We Have Done (Phase 1)
|
|
24
|
-
|
|
25
|
-
**Phase 1: Core Engine** is fully implemented and tested.
|
|
26
|
-
|
|
27
|
-
- ✅ Custom SQLite Database with FTS5 virtual tables and auto-sync triggers.
|
|
28
|
-
- ✅ Web scraping engine supporting standard `fetch()` and `puppeteer-core`.
|
|
29
|
-
- ✅ Markdown processor utilizing Readability + Turndown.
|
|
30
|
-
- ✅ Heading-based semantic chunker (500-1200 tokens per chunk).
|
|
31
|
-
- ✅ Asynchronous job manager and queue system.
|
|
32
|
-
- ✅ Complete HTTP API (REST endpoints + SSE event streams).
|
|
33
|
-
- ✅ Seamless integration of 6 MCP tools: `add_library`, `search_docs`, `list_libraries`, `get_doc_page`, `refresh_library`, and `remove_library`.
|
|
34
|
-
- ✅ Robust CLI interface (`start`, `add`, `search`, `list`).
|
|
35
|
-
|
|
36
|
-
## 🏗️ What We Are Doing
|
|
37
|
-
|
|
38
|
-
We are actively polishing the integration between the core engine and external MCP clients (like VS Code Agents and Claude Desktop).
|
|
39
|
-
|
|
40
|
-
## 🔮 What We Plan To Do (Phase 2 & Beyond)
|
|
41
|
-
|
|
42
|
-
- **Web Dashboard**: An intuitive SvelteKit dashboard to manage your synced libraries, view crawl progress in real-time (via SSE), and test searches manually.
|
|
43
|
-
- **Incremental Crawling**: Smarter `refresh` jobs that compare `ETag` and `Last-Modified` headers to only re-scrape updated pages.
|
|
44
|
-
- **Vector Search (RAG)**: Integration of lightweight vector embeddings for semantic similarity search alongside the existing FTS5 keyword search.
|
|
45
|
-
- **Advanced Scraping Setup**: Support for custom CSS selectors to define exactly where content lives in non-standard documentation websites.
|
|
46
|
-
|
|
47
|
-
---
|
|
48
|
-
|
|
49
|
-
## 🛠️ Usage
|
|
50
|
-
|
|
51
|
-
### Quick Start (from npm)
|
|
52
|
-
|
|
53
|
-
You can run DocShark directly without installing it globally using `bunx`:
|
|
54
|
-
|
|
55
|
-
```bash
|
|
56
|
-
# Add a documentation library to the index
|
|
57
|
-
bunx docshark add https://valibot.dev/guides/ --depth 2
|
|
58
|
-
|
|
59
|
-
# Search your indexed docs
|
|
60
|
-
bunx docshark search "schema validation"
|
|
61
|
-
```
|
|
62
|
-
|
|
63
|
-
### Installation
|
|
64
|
-
|
|
65
|
-
To install DocShark globally as a CLI tool:
|
|
66
|
-
|
|
67
|
-
```bash
|
|
68
|
-
# Using npm
|
|
69
|
-
npm install -g docshark
|
|
70
|
-
|
|
71
|
-
# Using Bun
|
|
72
|
-
bun add -g docshark
|
|
73
|
-
```
|
|
74
|
-
|
|
75
|
-
After installation, you can use the `docshark` command:
|
|
76
|
-
|
|
77
|
-
```bash
|
|
78
|
-
docshark list
|
|
79
|
-
```
|
|
80
|
-
|
|
81
|
-
## 🔌 MCP Integration
|
|
82
|
-
|
|
83
|
-
### VS Code (GitHub Copilot / MCP Extension)
|
|
84
|
-
|
|
85
|
-
Add DocShark to your `.vscode/settings.json` or global MCP configuration:
|
|
86
|
-
|
|
87
|
-
```json
|
|
88
|
-
{
|
|
89
|
-
"mcpServers": {
|
|
90
|
-
"docshark": {
|
|
91
|
-
"command": "bunx",
|
|
92
|
-
"args": ["-y", "docshark", "start", "--stdio"]
|
|
93
|
-
}
|
|
94
|
-
}
|
|
95
|
-
}
|
|
96
|
-
```
|
|
97
|
-
|
|
98
|
-
### Cursor
|
|
99
|
-
|
|
100
|
-
1. Open **Cursor Settings** > **Models** > **MCP**.
|
|
101
|
-
2. Click **+ Add New MCP Server**.
|
|
102
|
-
3. Name: `docshark`
|
|
103
|
-
4. Type: `command`
|
|
104
|
-
5. Command: `bunx -y docshark start --stdio`
|
|
105
|
-
|
|
106
|
-
### Claude Desktop
|
|
107
|
-
|
|
108
|
-
Edit your Claude Desktop configuration file:
|
|
109
|
-
|
|
110
|
-
- **macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
|
|
111
|
-
- **Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
|
|
112
|
-
|
|
113
|
-
```json
|
|
114
|
-
{
|
|
115
|
-
"mcpServers": {
|
|
116
|
-
"docshark": {
|
|
117
|
-
"command": "bunx",
|
|
118
|
-
"args": ["-y", "docshark", "start", "--stdio"]
|
|
119
|
-
}
|
|
120
|
-
}
|
|
121
|
-
}
|
|
122
|
-
```
|
|
123
|
-
|
|
124
|
-
---
|
|
125
|
-
|
|
126
|
-
## 🛠️ Development
|
|
127
|
-
|
|
128
|
-
### Local Setup
|
|
129
|
-
|
|
130
|
-
Ensure you have [Bun](https://bun.sh/) installed.
|
|
131
|
-
|
|
132
|
-
```bash
|
|
133
|
-
# Clone the repository
|
|
134
|
-
git clone https://github.com/Michael-Obele/docshark.git
|
|
135
|
-
cd docshark
|
|
136
|
-
|
|
137
|
-
# Install dependencies
|
|
138
|
-
bun install
|
|
139
|
-
|
|
140
|
-
# (Optional) Enable auto-detection & scraping of Javascript React/Vue single-page apps
|
|
141
|
-
bun add puppeteer-core
|
|
142
|
-
|
|
143
|
-
# Start the DocShark MCP server in HTTP mode for local testing
|
|
144
|
-
bun run src/cli.ts start --port 6380
|
|
145
|
-
```
|
|
146
|
-
|
|
147
|
-
### Local CLI Debugging
|
|
148
|
-
|
|
149
|
-
```bash
|
|
150
|
-
# Run CLI directly while developing
|
|
151
|
-
bun run src/cli.ts list
|
|
152
|
-
```
|
|
153
|
-
|
|
154
|
-
## 🔄 Versioning & Changelog
|
|
155
|
-
|
|
156
|
-
This project uses [Google's Release Please](https://github.com/googleapis/release-please) to automate versioning and changelog generation.
|
|
157
|
-
|
|
158
|
-
- **Semantic Versioning**: Our versions automatically bump (e.g. `0.0.1` -> `0.0.2` or `0.1.0`) based on standard Conventional Commits (`feat:`, `fix:`, `chore:`, etc.).
|
|
159
|
-
- **Automated**: A PR is automatically created on `master` when standard commits are merged, generating a standard `CHANGELOG.md`.
|
|
160
|
-
|
|
161
|
-
## 📜 License
|
|
162
|
-
|
|
163
|
-
This project is open-source and available under the [MIT License](LICENSE).
|
|
164
|
-
|
|
165
|
-
---
|
|
166
|
-
|
|
167
|
-
_Built to empower AI agents with the latest knowledge._
|