@gleanwork/mcp-server-tester 0.12.0 → 1.0.0-beta.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -4,161 +4,50 @@
4
4
  [![npm version](https://img.shields.io/npm/v/@gleanwork/mcp-server-tester)](https://www.npmjs.com/package/@gleanwork/mcp-server-tester)
5
5
  [![CI](https://github.com/gleanwork/mcp-server-tester/actions/workflows/ci.yml/badge.svg)](https://github.com/gleanwork/mcp-server-tester/actions/workflows/ci.yml)
6
6
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
7
- [![Node.js Version](https://img.shields.io/node/v/@gleanwork/mcp-server-tester)](https://nodejs.org)
8
- [![TypeScript](https://img.shields.io/badge/TypeScript-5.7-blue.svg)](https://www.typescriptlang.org/)
9
7
 
10
- > Playwright-based testing framework for MCP servers
8
+ A testing and evaluation framework for [Model Context Protocol (MCP)](https://modelcontextprotocol.io) servers. Write deterministic Playwright tests against your MCP tools, or run data-driven eval datasets — including LLM-based evaluation of tool discoverability.
11
9
 
12
- > [!WARNING]
13
- > **Experimental Project** - This library is in active development. APIs may change, and we welcome contributions, feedback, and collaboration as we evolve the framework. See [CONTRIBUTING.md](./CONTRIBUTING.md) for details.
10
+ ## Playwright Tests
14
11
 
15
- `@gleanwork/mcp-server-tester` is a comprehensive testing and evaluation framework for [Model Context Protocol (MCP)](https://modelcontextprotocol.io) servers. It provides first-class Playwright fixtures, data-driven eval datasets, and optional LLM-as-a-judge scoring.
16
-
17
- ## What's Included
18
-
19
- This framework provides **two complementary approaches** for testing MCP servers:
20
-
21
- ### 1. **Automated Testing** (Playwright Tests)
22
-
23
- Write deterministic, automated tests using standard Playwright patterns with MCP-specific fixtures. Perfect for:
24
-
25
- - Direct tool calls with expected outputs
26
- - Protocol conformance validation
27
- - Integration testing with your MCP server
28
- - CI/CD pipelines
12
+ The `mcp` Playwright fixture connects to your MCP server (stdio or HTTP) and exposes a high-level API for calling tools and asserting responses. Custom matchers keep assertions readable.
29
13
 
30
14
  ```typescript
31
- test('read a file', async ({ mcp }) => {
32
- const result = await mcp.callTool('read_file', { path: '/tmp/test.txt' });
33
- expect(result.content).toContain('Hello');
34
- });
35
- ```
36
-
37
- ### 2. **Evaluation Datasets** (Evals) ⚠️ Experimental
38
-
39
- Run deeper, more subjective analysis using dataset-driven evaluations. Includes:
40
-
41
- - Schema validation (deterministic)
42
- - Text and regex pattern matching (deterministic)
43
- - LLM-as-a-judge scoring (non-deterministic)
44
-
45
- **Note:** Evals, particularly those using LLM-as-a-judge, are highly experimental due to their non-deterministic nature. Results may vary between runs, and prompts may need tuning for your specific use case.
46
-
47
- ```typescript
48
- const result = await runEvalDataset({ dataset, expectations }, { mcp });
49
- expect(result.passed).toBe(result.total);
50
- ```
51
-
52
- ## Features
53
-
54
- - 🎭 **Playwright Integration** - Use MCP servers in Playwright tests with idiomatic fixtures
55
- - 📊 **Matrix Evals** - Run dataset-driven evaluations across multiple transports
56
- - 📸 **Snapshot Testing** - Capture and compare deterministic responses with optional sanitizers for variable data
57
- - 🤖 **LLM-as-a-Judge** - Optional semantic evaluation using Anthropic Claude
58
- - 🔌 **Multiple Transports** - Support for both stdio (local) and HTTP (remote) connections
59
- - ✅ **Protocol Conformance** - Built-in checks for MCP spec compliance
60
-
61
- ## Installation
62
-
63
- ```bash
64
- npm install --save-dev @gleanwork/mcp-server-tester @playwright/test zod
65
- ```
66
-
67
- **Note:** The Anthropic SDK is optional and only needed if you plan to use LLM-as-a-judge semantic evaluation:
68
-
69
- ```bash
70
- npm install --save-dev @anthropic-ai/sdk
71
- ```
72
-
73
- ## Quick Start
74
-
75
- ### Initialize with CLI
76
-
77
- The fastest way to get started:
78
-
79
- ```bash
80
- npx mcp-server-tester init
81
-
82
- # Follow the interactive prompts to create:
83
- # - playwright.config.ts (configured for your MCP server)
84
- # - tests/mcp.spec.ts (example tests)
85
- # - data/example-dataset.json (sample eval dataset)
86
- # - package.json (with all dependencies)
87
- ```
88
-
89
- See the [CLI Guide](./docs/cli.md) for all options.
90
-
91
- ### Example: Testing in Action
92
-
93
- Here's what a complete test suite looks like (following the **layered testing pattern**):
94
-
95
- ```typescript
96
- // tests/mcp.spec.ts
97
15
  import { test, expect } from '@gleanwork/mcp-server-tester/fixtures/mcp';
98
- import {
99
- loadEvalDataset,
100
- runEvalDataset,
101
- createSchemaExpectation,
102
- } from '@gleanwork/mcp-server-tester';
103
- import { z } from 'zod';
104
-
105
- // Layer 1: MCP Protocol Conformance
106
- test.describe('MCP Protocol Conformance', () => {
107
- test('should return valid server info', async ({ mcp }) => {
108
- const info = mcp.getServerInfo();
109
- expect(info).toBeTruthy();
110
- expect(info?.name).toBeTruthy();
111
- expect(info?.version).toBeTruthy();
112
- });
113
-
114
- test('should list available tools', async ({ mcp }) => {
115
- const tools = await mcp.listTools();
116
- expect(Array.isArray(tools)).toBe(true);
117
- expect(tools.length).toBeGreaterThan(0);
118
- });
119
16
 
120
- test('should handle invalid tool gracefully', async ({ mcp }) => {
121
- const result = await mcp.callTool('nonexistent_tool', {});
122
- expect(result.isError).toBe(true);
123
- });
17
+ test('read_file returns file contents', async ({ mcp }) => {
18
+ const result = await mcp.callTool('read_file', { path: '/tmp/test.txt' });
19
+ expect(result).toContainToolText('Hello, world');
20
+ expect(result).not.toBeToolError();
124
21
  });
125
22
 
126
- // Layer 2: Direct Tool Testing
127
- test.describe('File Operations', () => {
128
- test('should read a file', async ({ mcp }) => {
129
- const result = await mcp.callTool('read_file', {
130
- path: '/tmp/test.txt',
131
- });
132
- expect(result.content).toContain('Hello');
133
- });
23
+ test('server exposes required tools', async ({ mcp }) => {
24
+ const tools = await mcp.listTools();
25
+ expect(tools.map((t) => t.name)).toContain('read_file');
134
26
  });
27
+ ```
135
28
 
136
- // Layer 3: Eval Datasets
137
- test('file operations eval', async ({ mcp }) => {
138
- const FileContentSchema = z.object({
139
- content: z.string(),
140
- });
29
+ Playwright tests are fast, deterministic, and designed for CI. Use them for regression testing, schema validation, and protocol conformance. The framework includes built-in conformance checks for the MCP spec.
141
30
 
142
- const dataset = await loadEvalDataset('./data/evals.json', {
143
- schemas: { 'file-content': FileContentSchema },
144
- });
31
+ Available matchers:
145
32
 
146
- const result = await runEvalDataset(
147
- {
148
- dataset,
149
- expectations: {
150
- schema: createSchemaExpectation(dataset),
151
- },
152
- },
153
- { mcp }
154
- );
33
+ | Matcher | Description |
34
+ | ------------------------ | ----------------------------------------------- |
35
+ | `toContainToolText` | Response contains expected substrings |
36
+ | `toMatchToolSchema` | Response validates against a Zod schema |
37
+ | `toMatchToolPattern` | Response matches a regex pattern |
38
+ | `toMatchToolSnapshot` | Response matches a saved baseline |
39
+ | `toBeToolError` | Response is (or is not) an error |
40
+ | `toHaveToolResponseSize` | Response size is within bounds |
41
+ | `toSatisfyToolPredicate` | Response satisfies a custom function |
42
+ | `toHaveToolCalls` | LLM called the expected tools |
43
+ | `toHaveToolCallCount` | LLM made N tool calls |
44
+ | `toPassToolJudge` | LLM evaluates response quality against a rubric |
155
45
 
156
- expect(result.passed).toBe(result.total);
157
- });
158
- ```
46
+ ## Eval Datasets
47
+
48
+ Eval datasets let you define test cases as JSON files and run them with `runEvalDataset()`. Each case specifies a tool call and one or more assertions.
159
49
 
160
50
  ```json
161
- // data/evals.json
162
51
  {
163
52
  "name": "file-ops",
164
53
  "cases": [
@@ -166,256 +55,150 @@ test('file operations eval', async ({ mcp }) => {
166
55
  "id": "read-config",
167
56
  "toolName": "read_file",
168
57
  "args": { "path": "/tmp/config.json" },
169
- "expectedSchemaName": "file-content"
58
+ "expect": {
59
+ "schema": "file-content",
60
+ "containsText": ["version", "name"]
61
+ }
170
62
  },
171
63
  {
172
64
  "id": "read-readme",
173
65
  "toolName": "read_file",
174
66
  "args": { "path": "/tmp/README.md" },
175
- "expectedTextContains": ["# Welcome", "## Installation"]
67
+ "expect": {
68
+ "snapshot": "readme-snapshot"
69
+ }
176
70
  }
177
71
  ]
178
72
  }
179
73
  ```
180
74
 
181
75
  ```typescript
182
- // playwright.config.ts
183
- import { defineConfig } from '@playwright/test';
76
+ import { test, expect } from '@gleanwork/mcp-server-tester/fixtures/mcp';
77
+ import { loadEvalDataset, runEvalDataset } from '@gleanwork/mcp-server-tester';
78
+ import { z } from 'zod';
184
79
 
185
- export default defineConfig({
186
- testDir: './tests',
187
- projects: [
188
- {
189
- name: 'mcp-local',
190
- use: {
191
- mcpConfig: {
192
- transport: 'stdio',
193
- command: 'node',
194
- args: ['path/to/your/server.js'],
195
- },
196
- },
197
- },
198
- ],
80
+ test('file operations eval', async ({ mcp }, testInfo) => {
81
+ const dataset = await loadEvalDataset('./data/evals.json', {
82
+ schemas: { 'file-content': z.object({ content: z.string() }) },
83
+ });
84
+ const result = await runEvalDataset({ dataset }, { mcp, testInfo });
85
+ expect(result.passed).toBe(result.total);
199
86
  });
200
87
  ```
201
88
 
202
- ## Documentation
89
+ Supported assertion types:
203
90
 
204
- - **[Quick Start Guide](./docs/quickstart.md)** - Detailed setup and configuration
205
- - **[Expectations](./docs/expectations.md)** - All validation types (exact, schema, regex, text contains, snapshot, LLM judge)
206
- - **[API Reference](./docs/api-reference.md)** - Complete API documentation
207
- - **[CLI Commands](./docs/cli.md)** - `init`, `generate`, `login`, and `token` command details
208
- - **[UI Reporter](./docs/ui-reporter.md)** - Interactive web UI for test results
209
- - **[Transports](./docs/transports.md)** - Stdio vs HTTP configuration
210
- - **[Development](./docs/development.md)** - Contributing and building
91
+ | Type | Description |
92
+ | ---------------- | ----------------------------------------------- |
93
+ | `containsText` | Response includes expected substrings |
94
+ | `schema` | Response validates against a Zod schema |
95
+ | `regex` | Response matches a pattern |
96
+ | `snapshot` | Response matches a saved baseline |
97
+ | `judge` | LLM evaluates response quality against a rubric |
98
+ | `toolsTriggered` | LLM called the expected tools (LLM host mode) |
211
99
 
212
- ## Examples
100
+ ### LLM host mode
213
101
 
214
- The `examples/` directory contains complete working examples:
215
-
216
- **Real MCP Server Tests:**
217
-
218
- - [`filesystem-server/`](./examples/filesystem-server) - Test suite for Anthropic's Filesystem MCP server
219
- - Demonstrates `fixturify-project` for isolated test fixtures
220
- - Zod schema validation for JSON files
221
- - 5 Playwright tests, 11 eval dataset cases
222
-
223
- - [`sqlite-server/`](./examples/sqlite-server) - Test suite for SQLite MCP server
224
- - Demonstrates `better-sqlite3` for database testing
225
- - Custom expectations for record count validation
226
- - 11 Playwright tests, 14 eval dataset cases
227
-
228
- **Basic Patterns:**
229
-
230
- - [`basic-playwright-usage/`](./examples/basic-playwright-usage) - Simple Playwright test patterns
231
-
232
- Each example includes complete test suites, eval datasets, and npm scripts. See [`examples/README.md`](./examples/README.md) for detailed documentation.
233
-
234
- ## Key Concepts
235
-
236
- ### Fixtures
237
-
238
- Access MCP servers in tests via Playwright fixtures:
239
-
240
- - `mcpClient: Client` - Raw MCP SDK client
241
- - `mcp: MCPFixtureApi` - High-level test API with helper methods
242
-
243
- ### Expectations
244
-
245
- Validate tool responses with multiple expectation types:
246
-
247
- - **Exact Match** - Structured JSON equality
248
- - **Schema** - Zod validation
249
- - **Text Contains** - Substring matching (great for markdown)
250
- - **Regex** - Pattern matching
251
- - **LLM Judge** - Semantic evaluation
252
-
253
- See [Expectations Guide](./docs/expectations.md) for details.
254
-
255
- ### Transports
256
-
257
- Connect to MCP servers via:
258
-
259
- - **stdio** - Local server processes
260
- - **HTTP** - Remote servers
261
-
262
- See [Transports Guide](./docs/transports.md) for configuration.
263
-
264
- ### Snapshot Testing
265
-
266
- Snapshot testing captures tool responses and compares them against stored baselines. This works best for **deterministic responses** like help text, configuration, or schema discovery.
267
-
268
- > **Note:** For responses with timestamps, IDs, or live data, use [sanitizers](./docs/expectations.md#snapshot-sanitizers) to normalize variable content, or consider schema validation instead.
269
-
270
- ```bash
271
- # Generate dataset with snapshot expectations
272
- npx mcp-server-tester generate --snapshot -o data/evals.json
273
-
274
- # First run captures snapshots
275
- npx playwright test
276
-
277
- # Update snapshots when server behavior changes
278
- npx playwright test --update-snapshots
279
- ```
280
-
281
- For responses with variable data, use sanitizers:
102
+ In LLM host mode, a real LLM receives your server's tool list and a natural language prompt, then decides which tools to call. This tests whether your tool names, descriptions, and input schemas are clear enough for autonomous use — a different question from whether the tools return correct output.
282
103
 
283
104
  ```json
284
105
  {
285
- "id": "get-user",
286
- "toolName": "get_user",
287
- "args": { "id": "123" },
288
- "expectedSnapshot": "user-profile",
289
- "snapshotSanitizers": ["uuid", "iso-date", { "remove": ["lastLoginAt"] }]
106
+ "id": "find-config",
107
+ "mode": "llm_host",
108
+ "scenario": "Find the application config file and return its contents",
109
+ "llmHostConfig": {
110
+ "provider": "anthropic",
111
+ "model": "claude-opus-4-20250514"
112
+ },
113
+ "expect": {
114
+ "toolsTriggered": {
115
+ "calls": [{ "name": "read_file", "required": true }]
116
+ }
117
+ }
290
118
  }
291
119
  ```
292
120
 
293
- See the [Expectations Guide](./docs/expectations.md#snapshot-testing) for when to use snapshots vs other validation methods.
294
-
295
- ## CLI OAuth Authentication
121
+ LLM host mode makes real API calls and produces non-deterministic results. Use `iterations` to run a case multiple times and measure pass rate rather than expecting 100% on a single run. See the [LLM Host Guide](docs/llm-host.md) for configuration and cost management.
296
122
 
297
- For MCP servers that require OAuth authentication, the framework provides a CLI-based OAuth flow:
123
+ ## Installation
298
124
 
299
- ### Interactive Login
125
+ Requires Node.js 22+.
300
126
 
301
127
  ```bash
302
- # Authenticate with an MCP server (opens browser)
303
- npx mcp-server-tester login https://api.example.com/mcp
304
-
305
- # Force re-authentication
306
- npx mcp-server-tester login https://api.example.com/mcp --force
128
+ npm install --save-dev @gleanwork/mcp-server-tester @playwright/test zod
307
129
  ```
308
130
 
309
- ### Token Storage
131
+ The Anthropic SDK is only needed for LLM-as-judge assertions or LLM host mode with the Anthropic provider:
310
132
 
311
- Tokens are cached locally and automatically refreshed when expired.
312
-
313
- **Storage locations:**
133
+ ```bash
134
+ npm install --save-dev @anthropic-ai/sdk
135
+ ```
314
136
 
315
- - **Linux**: `$XDG_STATE_HOME/mcp-tests/<server-key>/` or `~/.local/state/mcp-tests/<server-key>/`
316
- - **macOS**: `~/.local/state/mcp-tests/<server-key>/`
317
- - **Windows**: `%LOCALAPPDATA%\mcp-tests\<server-key>\`
137
+ ## Quick Start
318
138
 
319
- **Security:**
139
+ ```bash
140
+ npx mcp-server-tester init
141
+ ```
320
142
 
321
- - Directory permissions: `0700` (owner only)
322
- - File permissions: `0600` (owner read/write only)
323
- - Files stored: `tokens.json`, `client.json`, `server.json`
143
+ The CLI wizard creates a `playwright.config.ts`, example tests, and a sample eval dataset configured for your server. See the [CLI Guide](./docs/cli.md) for all options.
324
144
 
325
- Use `--state-dir` to override the storage location.
145
+ ## Configuration
326
146
 
327
- ### Programmatic Usage
147
+ Point the framework at your MCP server in `playwright.config.ts`:
328
148
 
329
149
  ```typescript
330
- import { CLIOAuthClient } from '@gleanwork/mcp-server-tester';
150
+ import { defineConfig } from '@playwright/test';
331
151
 
332
- const client = new CLIOAuthClient({
333
- mcpServerUrl: 'https://api.example.com/mcp',
152
+ export default defineConfig({
153
+ testDir: './tests',
154
+ reporter: [['list'], ['@gleanwork/mcp-server-tester/reporters/mcpReporter']],
155
+ projects: [
156
+ {
157
+ name: 'my-server',
158
+ use: {
159
+ mcpConfig: {
160
+ transport: 'stdio',
161
+ command: 'node',
162
+ args: ['server.js'],
163
+ },
164
+ },
165
+ },
166
+ ],
334
167
  });
335
-
336
- // Get a valid access token (cached, refreshed, or new)
337
- const result = await client.getAccessToken();
338
- console.log(`Token: ${result.accessToken}`);
339
- ```
340
-
341
- ### CI/CD Usage (GitHub Actions)
342
-
343
- For automated testing in CI, tokens can be provided via environment variables:
344
-
345
- ```yaml
346
- # .github/workflows/mcp-tests.yml
347
- jobs:
348
- test:
349
- runs-on: ubuntu-latest
350
- env:
351
- MCP_ACCESS_TOKEN: ${{ secrets.MCP_ACCESS_TOKEN }}
352
- MCP_REFRESH_TOKEN: ${{ secrets.MCP_REFRESH_TOKEN }}
353
- steps:
354
- - uses: actions/checkout@v4
355
- - run: npm ci
356
- - run: npm run test:playwright
357
168
  ```
358
169
 
359
- **To set up GitHub Actions secrets:**
360
-
361
- 1. Authenticate locally: `npx mcp-server-tester login <server-url>`
362
- 2. Export tokens for GitHub: `npx mcp-server-tester token <server-url> --format gh`
363
- 3. Run the output `gh secret set` commands (requires [GitHub CLI](https://cli.github.com/))
364
-
365
- The `token` command supports multiple formats:
170
+ For HTTP servers, set `transport: 'http'` and `serverUrl`. For servers that require OAuth, see the [Transports Guide](./docs/transports.md) and [CLI Guide](./docs/cli.md) for authentication setup, including CI/CD token management.
366
171
 
367
- - `env` (default) - Shell-compatible `KEY=value` pairs
368
- - `json` - JSON object for scripting
369
- - `gh` - Ready-to-paste GitHub CLI commands
370
-
371
- See the [CLI Guide](./docs/cli.md#token---export-tokens-for-cicd) for details.
372
-
373
- Alternatively, inject tokens programmatically in your test setup:
374
-
375
- ```typescript
376
- import { injectTokens } from '@gleanwork/mcp-server-tester';
377
-
378
- // In globalSetup.ts
379
- await injectTokens('https://api.example.com/mcp', {
380
- accessToken: process.env.MCP_ACCESS_TOKEN!,
381
- tokenType: 'Bearer',
382
- });
383
- ```
172
+ ## Documentation
384
173
 
385
- ## UI Reporter
174
+ - [Quick Start](./docs/quickstart.md) — detailed setup and configuration
175
+ - [Expectations](./docs/expectations.md) — all assertion types including snapshot sanitizers
176
+ - [LLM Host Simulation](docs/llm-host.md) — tool discoverability testing
177
+ - [API Reference](./docs/api-reference.md)
178
+ - [Transports](./docs/transports.md) — stdio and HTTP configuration, OAuth
179
+ - [CLI Commands](./docs/cli.md) — init, generate, login, token
180
+ - [UI Reporter](./docs/ui-reporter.md) — interactive web UI for test results
181
+ - [Development](./docs/development.md) — contributing and building
386
182
 
387
- Interactive web UI for visualizing test results:
183
+ ## Examples
388
184
 
389
- ![MCP Test Reporter UI](./ui.png)
185
+ The `examples/` directory contains complete working examples:
390
186
 
391
- Add to your `playwright.config.ts`:
187
+ - [filesystem-server/](./examples/filesystem-server) Test suite for Anthropic's Filesystem MCP server: 5 Playwright tests, 11 eval dataset cases, Zod schema validation.
188
+ - [sqlite-server/](./examples/sqlite-server) — Test suite for a SQLite MCP server: 11 Playwright tests, 14 eval dataset cases.
189
+ - [basic-playwright-usage/](./examples/basic-playwright-usage) — Minimal Playwright patterns.
392
190
 
393
- ```typescript
394
- export default defineConfig({
395
- reporter: [['list'], ['@gleanwork/mcp-server-tester/reporters/mcpReporter']],
396
- });
397
- ```
191
+ ## Known Limitations
398
192
 
399
- See [UI Reporter Guide](./docs/ui-reporter.md) for features and usage.
193
+ These MCP protocol features are not currently supported. These are deliberate scope decisions, not bugs:
400
194
 
401
- ## Support
195
+ - MCP resources (`listResources`, `readResource`)
196
+ - MCP prompts (`listPrompts`, `getPrompt`)
197
+ - Server-to-client notifications
198
+ - Streaming tool responses (`callTool` waits for the complete response)
402
199
 
403
- - **Documentation**: See [`docs/`](./docs) directory
404
- - **Examples**: See [`examples/`](./examples) directory
405
- - **Issues**: [GitHub Issues](https://github.com/gleanwork/mcp-server-tester/issues)
200
+ If any of these affect your use case, please open an issue.
406
201
 
407
202
  ## License
408
203
 
409
204
  MIT
410
-
411
- ## Contributing
412
-
413
- Contributions welcome! See [Development Guide](./docs/development.md) for setup instructions.
414
-
415
- ## Credits
416
-
417
- Built with:
418
-
419
- - [@modelcontextprotocol/sdk](https://github.com/modelcontextprotocol/typescript-sdk)
420
- - [@playwright/test](https://playwright.dev)
421
- - [Zod](https://zod.dev)