@arabold/docs-mcp-server 1.6.0 → 1.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -4,12 +4,12 @@ A MCP server for fetching and searching 3rd party package documentation.
4
4
 
5
5
  ## ✨ Key Features
6
6
 
7
- - 🌐 **Scrape & Index:** Fetch documentation from web sources or local files.
8
- - 🧠 **Smart Processing:** Utilize semantic splitting and OpenAI embeddings for meaningful content chunks.
9
- - 💾 **Efficient Storage:** Store data in SQLite, leveraging `sqlite-vec` for vector search and FTS5 for full-text search.
10
- - 🔍 **Hybrid Search:** Combine vector and full-text search for relevant results across different library versions.
11
- - ⚙️ **Job Management:** Handle scraping tasks asynchronously with a robust job queue and management tools (MCP & CLI).
12
- - 🐳 **Easy Deployment:** Run the server easily using Docker or npx.
7
+ - 🌐 **Versatile Scraping:** Fetch documentation from diverse sources like websites, GitHub, npm, PyPI, or local files.
8
+ - 🧠 **Intelligent Processing:** Automatically split content semantically and generate embeddings using your choice of models (OpenAI, Google Gemini, Azure OpenAI, AWS Bedrock, Ollama, and more).
9
+ - 💾 **Optimized Storage:** Leverage SQLite with `sqlite-vec` for efficient vector storage and FTS5 for robust full-text search.
10
+ - 🔍 **Powerful Hybrid Search:** Combine vector similarity and full-text search across different library versions for highly relevant results.
11
+ - ⚙️ **Asynchronous Job Handling:** Manage scraping and indexing tasks efficiently with a background job queue and MCP/CLI tools.
12
+ - 🐳 **Simple Deployment:** Get up and running quickly using Docker or npx.
13
13
 
14
14
  ## Overview
15
15
 
@@ -25,17 +25,47 @@ The server exposes MCP tools for:
25
25
  - Listing indexed libraries (`list_libraries`).
26
26
  - Finding appropriate versions (`find_version`).
27
27
  - Removing indexed documents (`remove_docs`).
28
+ - Fetching single URLs (`fetch_url`): Fetches a URL and returns its content as Markdown.
28
29
 
29
30
  ## Configuration
30
31
 
31
- The following environment variables are supported to configure the OpenAI API and embedding behavior:
32
+ The following environment variables are supported to configure the embedding model behavior:
32
33
 
33
- - `OPENAI_API_KEY`: **Required.** Your OpenAI API key for generating embeddings.
34
- - `OPENAI_ORG_ID`: **Optional.** Your OpenAI Organization ID (handled automatically by LangChain if set).
35
- - `OPENAI_API_BASE`: **Optional.** Custom base URL for OpenAI API (e.g., for Azure OpenAI or compatible APIs).
36
- - `DOCS_MCP_EMBEDDING_MODEL`: **Optional.** Embedding model name (defaults to "text-embedding-3-small"). Must produce vectors with ≤1536 dimensions. Smaller dimensions are automatically padded with zeros.
34
+ ### Embedding Model Configuration
37
35
 
38
- The database schema uses a fixed dimension of 1536 for embedding vectors. Models that produce larger vectors are not supported and will cause an error. Models with smaller vectors (e.g., older embedding models) are automatically padded with zeros to match the required dimension.
36
+ - `DOCS_MCP_EMBEDDING_MODEL`: **Optional.** Format: `provider:model_name` or just `model_name` (defaults to `text-embedding-3-small`). Supported providers and their required environment variables:
37
+
38
+ - `openai` (default): Uses OpenAI's embedding models
39
+
40
+ - `OPENAI_API_KEY`: **Required.** Your OpenAI API key
41
+ - `OPENAI_ORG_ID`: **Optional.** Your OpenAI Organization ID
42
+ - `OPENAI_API_BASE`: **Optional.** Custom base URL for OpenAI-compatible APIs (e.g., Ollama, Azure OpenAI)
43
+
44
+ - `vertex`: Uses Google Cloud Vertex AI embeddings
45
+
46
+ - `GOOGLE_APPLICATION_CREDENTIALS`: **Required.** Path to service account JSON key file
47
+
48
+ - `gemini`: Uses Google Generative AI (Gemini) embeddings
49
+
50
+ - `GOOGLE_API_KEY`: **Required.** Your Google API key
51
+
52
+ - `aws`: Uses AWS Bedrock embeddings
53
+
54
+ - `AWS_ACCESS_KEY_ID`: **Required.** AWS access key
55
+ - `AWS_SECRET_ACCESS_KEY`: **Required.** AWS secret key
56
+ - `AWS_REGION` or `BEDROCK_AWS_REGION`: **Required.** AWS region for Bedrock
57
+
58
+ - `microsoft`: Uses Azure OpenAI embeddings
59
+ - `AZURE_OPENAI_API_KEY`: **Required.** Azure OpenAI API key
60
+ - `AZURE_OPENAI_API_INSTANCE_NAME`: **Required.** Azure instance name
61
+ - `AZURE_OPENAI_API_DEPLOYMENT_NAME`: **Required.** Azure deployment name
62
+ - `AZURE_OPENAI_API_VERSION`: **Required.** Azure API version
63
+
64
+ ### Vector Dimensions
65
+
66
+ The database schema uses a fixed dimension of 1536 for embedding vectors. Only models that produce vectors with dimension ≤ 1536 are supported, except for certain providers (like Gemini) that support dimension reduction.
67
+
68
+ For OpenAI-compatible APIs (like Ollama), use the `openai` provider with `OPENAI_API_BASE` pointing to your endpoint.
39
69
 
40
70
  These variables can be set regardless of how you run the server (Docker, npx, or from source).
41
71
 
@@ -92,10 +122,54 @@ This is the recommended approach for most users. It's easy, straightforward, and
92
122
  Any of the configuration environment variables (see [Configuration](#configuration) above) can be passed to the container using the `-e` flag. For example:
93
123
 
94
124
  ```bash
125
+ # Example 1: Using OpenAI embeddings (default)
95
126
  docker run -i --rm \
96
127
  -e OPENAI_API_KEY="your-key-here" \
97
- -e DOCS_MCP_EMBEDDING_MODEL="text-embedding-3-large" \
98
- -e OPENAI_API_BASE="http://your-api-endpoint" \
128
+ -e DOCS_MCP_EMBEDDING_MODEL="text-embedding-3-small" \
129
+ -v docs-mcp-data:/data \
130
+ ghcr.io/arabold/docs-mcp-server:latest
131
+
132
+ # Example 2: Using OpenAI-compatible API (like Ollama)
133
+ docker run -i --rm \
134
+ -e OPENAI_API_KEY="your-key-here" \
135
+ -e OPENAI_API_BASE="http://localhost:11434/v1" \
136
+ -e DOCS_MCP_EMBEDDING_MODEL="embeddings" \
137
+ -v docs-mcp-data:/data \
138
+ ghcr.io/arabold/docs-mcp-server:latest
139
+
140
+ # Example 3a: Using Google Cloud Vertex AI embeddings
141
+ docker run -i --rm \
142
+ -e OPENAI_API_KEY="your-openai-key" \ # Keep for fallback to OpenAI
143
+ -e DOCS_MCP_EMBEDDING_MODEL="vertex:text-embedding-004" \
144
+ -e GOOGLE_APPLICATION_CREDENTIALS="/app/gcp-key.json" \
145
+ -v docs-mcp-data:/data \
146
+ -v /path/to/gcp-key.json:/app/gcp-key.json:ro \
147
+ ghcr.io/arabold/docs-mcp-server:latest
148
+
149
+ # Example 3b: Using Google Generative AI (Gemini) embeddings
150
+ docker run -i --rm \
151
+ -e OPENAI_API_KEY="your-openai-key" \ # Keep for fallback to OpenAI
152
+ -e DOCS_MCP_EMBEDDING_MODEL="gemini:embedding-001" \
153
+ -e GOOGLE_API_KEY="your-google-api-key" \
154
+ -v docs-mcp-data:/data \
155
+ ghcr.io/arabold/docs-mcp-server:latest
156
+
157
+ # Example 4: Using AWS Bedrock embeddings
158
+ docker run -i --rm \
159
+ -e AWS_ACCESS_KEY_ID="your-aws-key" \
160
+ -e AWS_SECRET_ACCESS_KEY="your-aws-secret" \
161
+ -e AWS_REGION="us-east-1" \
162
+ -e DOCS_MCP_EMBEDDING_MODEL="aws:amazon.titan-embed-text-v1" \
163
+ -v docs-mcp-data:/data \
164
+ ghcr.io/arabold/docs-mcp-server:latest
165
+
166
+ # Example 5: Using Azure OpenAI embeddings
167
+ docker run -i --rm \
168
+ -e AZURE_OPENAI_API_KEY="your-azure-key" \
169
+ -e AZURE_OPENAI_API_INSTANCE_NAME="your-instance" \
170
+ -e AZURE_OPENAI_API_DEPLOYMENT_NAME="your-deployment" \
171
+ -e AZURE_OPENAI_API_VERSION="2024-02-01" \
172
+ -e DOCS_MCP_EMBEDDING_MODEL="microsoft:text-embedding-ada-002" \
99
173
  -v docs-mcp-data:/data \
100
174
  ghcr.io/arabold/docs-mcp-server:latest
101
175
  ```
@@ -177,11 +251,31 @@ npx -y --package=@arabold/docs-mcp-server docs-cli --help
177
251
  ```bash
178
252
  docs-cli scrape --help
179
253
  docs-cli search --help
254
+ docs-cli fetch-url --help
180
255
  docs-cli find-version --help
181
256
  docs-cli remove --help
182
257
  docs-cli list --help
183
258
  ```
184
259
 
260
+ ### Fetching Single URLs (`fetch-url`)
261
+
262
+ Fetches a single URL and converts its content to Markdown. Unlike `scrape`, this command does not crawl links or store the content.
263
+
264
+ ```bash
265
+ docs-cli fetch-url <url> [options]
266
+ ```
267
+
268
+ **Options:**
269
+
270
+ - `--no-follow-redirects`: Disable following HTTP redirects (default: follow redirects)
271
+
272
+ **Examples:**
273
+
274
+ ```bash
275
+ # Fetch a URL and convert to Markdown
276
+ docs-cli fetch-url https://example.com/page.html
277
+ ```
278
+
185
279
  ### Scraping Documentation (`scrape`)
186
280
 
187
281
  Scrapes and indexes documentation from a given URL for a specific library.
@@ -398,13 +492,14 @@ This project uses [semantic-release](https://github.com/semantic-release/semanti
398
492
  **How it works:**
399
493
 
400
494
  1. **Commit Messages:** All commits merged into the `main` branch **must** follow the Conventional Commits specification.
401
- 2. **Automation:** The "Release" GitHub Actions workflow automatically runs `semantic-release` on pushes to `main`.
495
+ 2. **Manual Trigger:** The "Release" GitHub Actions workflow can be triggered manually from the Actions tab when you're ready to create a new release.
402
496
  3. **`semantic-release` Actions:** Determines version, updates `CHANGELOG.md` & `package.json`, commits, tags, publishes to npm, and creates a GitHub Release.
403
497
 
404
498
  **What you need to do:**
405
499
 
406
500
  - Use Conventional Commits.
407
- - Merge to `main`.
501
+ - Merge changes to `main`.
502
+ - Trigger a release manually when ready from the Actions tab in GitHub.
408
503
 
409
504
  **Automation handles:** Changelog, version bumps, tags, npm publish, GitHub releases.
410
505