@kevinrabun/judges 1.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -2,132 +2,220 @@
2
2
 
3
3
  An MCP (Model Context Protocol) server that provides a panel of **18 specialized judges** to evaluate AI-generated code — acting as an independent quality gate regardless of which project is being reviewed.
4
4
 
5
- ## The Judge Panel
5
+ [![CI](https://github.com/KevinRabun/judges/actions/workflows/ci.yml/badge.svg)](https://github.com/KevinRabun/judges/actions/workflows/ci.yml)
6
+ [![npm](https://img.shields.io/npm/v/@kevinrabun/judges)](https://www.npmjs.com/package/@kevinrabun/judges)
7
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
6
8
 
7
- | Judge | Domain | Rule Prefix | What It Evaluates |
8
- |-------|--------|-------------|-------------------|
9
- | **Judge Data Security** | Data Security & Privacy | `DATA-` | Encryption, PII handling, secrets management, access controls, GDPR/CCPA/HIPAA compliance |
10
- | **Judge Cybersecurity** | Cybersecurity & Threat Defense | `CYBER-` | Injection attacks, XSS, CSRF, auth flaws, dependency CVEs, OWASP Top 10 |
11
- | **Judge Cost Effectiveness** | Cost Optimization | `COST-` | Algorithm efficiency, N+1 queries, memory waste, caching strategy, cloud spend |
12
- | **Judge Scalability** | Scalability & Performance | `SCALE-` | Statelessness, horizontal scaling, concurrency, bottlenecks, rate limiting |
13
- | **Judge Cloud Readiness** | Cloud-Native & DevOps | `CLOUD-` | 12-Factor compliance, containerization, observability, graceful shutdown, IaC |
14
- | **Judge Software Practices** | Engineering Best Practices | `SWDEV-` | SOLID principles, type safety, error handling, testing, input validation, clean code |
15
- | **Judge Accessibility** | Accessibility (a11y) | `A11Y-` | WCAG compliance, screen reader support, keyboard navigation, ARIA, color contrast |
16
- | **Judge API Design** | API Design & Contracts | `API-` | REST conventions, versioning, pagination, error responses, consistency |
17
- | **Judge Reliability** | Reliability & Resilience | `REL-` | Error handling, timeouts, retries, circuit breakers, graceful degradation |
18
- | **Judge Observability** | Observability & Monitoring | `OBS-` | Structured logging, health checks, metrics, tracing, correlation IDs |
19
- | **Judge Performance** | Performance & Efficiency | `PERF-` | N+1 queries, sync I/O, caching, memory leaks, algorithmic complexity |
20
- | **Judge Compliance** | Regulatory Compliance | `COMP-` | GDPR/CCPA, PII protection, consent, data retention, audit trails |
21
- | **Judge Testing** | Testing & Quality Assurance | `TEST-` | Test coverage, assertions, test isolation, naming, external dependencies |
22
- | **Judge Documentation** | Documentation & Readability | `DOC-` | JSDoc/docstrings, magic numbers, TODOs, code comments, module docs |
23
- | **Judge Internationalization** | Internationalization (i18n) | `I18N-` | Hardcoded strings, locale handling, currency formatting, RTL support |
24
- | **Judge Dependency Health** | Dependency Management | `DEPS-` | Version pinning, deprecated packages, supply chain, import hygiene |
25
- | **Judge Concurrency** | Concurrency & Async Safety | `CONC-` | Race conditions, unbounded parallelism, missing await, resource cleanup |
26
- | **Judge Ethics & Bias** | Ethics & Bias | `ETHICS-` | Demographic logic, explainability, dark patterns, inclusive language |
9
+ ---
27
10
 
28
- ## How It Works
11
+ ## Quick Start
29
12
 
30
- The tribunal operates in two modes:
13
+ ### 1. Install and Build
31
14
 
32
- 1. **Pattern-Based Analysis (Tools)** — The `evaluate_code` and `evaluate_code_single_judge` tools perform heuristic analysis using pattern matching to catch common anti-patterns. This works entirely offline with zero external API calls.
33
-
34
- 2. **LLM-Powered Deep Analysis (Prompts)** — The server exposes MCP prompts (`judge-data-security`, `judge-cybersecurity`, etc., and `full-tribunal`) that provide each judge's expert persona as a system prompt. When used by an LLM-based client, this enables much deeper, context-aware analysis.
35
-
36
- ## MCP Tools
15
+ ```bash
16
+ git clone https://github.com/KevinRabun/judges.git
17
+ cd judges
18
+ npm install
19
+ npm run build
20
+ ```
37
21
 
38
- ### `get_judges`
39
- List all available judges with their domains and descriptions.
22
+ ### 2. Try the Demo
40
23
 
41
- ### `evaluate_code`
42
- Submit code to the **full judges panel**. All 18 judges evaluate independently and return a combined verdict.
24
+ Run the included demo to see all 18 judges evaluate a purposely flawed API server:
43
25
 
44
- **Parameters:**
45
- - `code` (string, required) — The source code to evaluate
46
- - `language` (string, required) — Programming language (e.g., "typescript", "python")
47
- - `context` (string, optional) — Additional context about the code
26
+ ```bash
27
+ npm run demo
28
+ ```
48
29
 
49
- **Returns:** Combined verdict with overall score, per-judge scores, all findings, and recommendations.
30
+ This evaluates [`examples/sample-vulnerable-api.ts`](examples/sample-vulnerable-api.ts) — a file intentionally packed with security holes, performance anti-patterns, and code quality issues — and prints a full verdict with per-judge scores and findings.
50
31
 
51
- ### `evaluate_code_single_judge`
52
- Submit code to a **specific judge** for targeted review.
32
+ **What you'll see:**
53
33
 
54
- **Parameters:**
55
- - `code` (string, required) — The source code to evaluate
56
- - `language` (string, required) Programming language
57
- - `judgeId` (string, required) — One of: `data-security`, `cybersecurity`, `cost-effectiveness`, `scalability`, `cloud-readiness`, `software-practices`, `accessibility`, `api-design`, `reliability`, `observability`, `performance`, `compliance`, `testing`, `documentation`, `internationalization`, `dependency-health`, `concurrency`, `ethics-bias`
58
- - `context` (string, optional) — Additional context
59
-
60
- ## MCP Prompts
34
+ ```
35
+ ╔══════════════════════════════════════════════════════════════╗
36
+ ║ Judges Panel Full Tribunal Demo ║
37
+ ╚══════════════════════════════════════════════════════════════╝
38
+
39
+ Overall Verdict : FAIL
40
+ Overall Score : 61/100
41
+ Critical Issues : 15
42
+ High Issues : 17
43
+ Total Findings : 81
44
+ Judges Run : 18
45
+
46
+ Per-Judge Breakdown:
47
+ ────────────────────────────────────────────────────────────────
48
+ ❌ Judge Data Security 0/100 7 finding(s)
49
+ ❌ Judge Cybersecurity 24/100 6 finding(s)
50
+ ⚠️ Judge Cost Effectiveness 70/100 5 finding(s)
51
+ ⚠️ Judge Scalability 79/100 4 finding(s)
52
+ ❌ Judge Cloud Readiness 77/100 4 finding(s)
53
+ ⚠️ Judge Software Practices 73/100 5 finding(s)
54
+ ❌ Judge Accessibility 28/100 8 finding(s)
55
+ ❌ Judge API Design 35/100 9 finding(s)
56
+ ⚠️ Judge Reliability 70/100 3 finding(s)
57
+ ❌ Judge Observability 65/100 5 finding(s)
58
+ ❌ Judge Performance 53/100 5 finding(s)
59
+ ❌ Judge Compliance 34/100 4 finding(s)
60
+ ✅ Judge Testing 94/100 1 finding(s)
61
+ ✅ Judge Documentation 82/100 4 finding(s)
62
+ ✅ Judge Internationalization 79/100 4 finding(s)
63
+ ✅ Judge Dependency Health 94/100 1 finding(s)
64
+ ⚠️ Judge Concurrency 64/100 4 finding(s)
65
+ ❌ Judge Ethics & Bias 77/100 2 finding(s)
66
+ ```
61
67
 
62
- - `judge-data-security` Deep data security review via LLM
63
- - `judge-cybersecurity` — Deep cybersecurity review via LLM
64
- - `judge-cost-effectiveness` — Deep cost optimization review via LLM
65
- - `judge-scalability` — Deep scalability review via LLM
66
- - `judge-cloud-readiness` — Deep cloud readiness review via LLM
67
- - `judge-software-practices` — Deep software practices review via LLM
68
- - `judge-accessibility` — Deep accessibility/WCAG review via LLM
69
- - `judge-api-design` — Deep API design review via LLM
70
- - `judge-reliability` — Deep reliability & resilience review via LLM
71
- - `judge-observability` — Deep observability & monitoring review via LLM
72
- - `judge-performance` — Deep performance optimization review via LLM
73
- - `judge-compliance` — Deep regulatory compliance review via LLM
74
- - `judge-testing` — Deep testing quality review via LLM
75
- - `judge-documentation` — Deep documentation quality review via LLM
76
- - `judge-internationalization` — Deep i18n review via LLM
77
- - `judge-dependency-health` — Deep dependency health review via LLM
78
- - `judge-concurrency` — Deep concurrency & async safety review via LLM
79
- - `judge-ethics-bias` — Deep ethics & bias review via LLM
80
- - `full-tribunal` — All 18 judges via LLM in a single prompt
81
-
82
- ## Setup
83
-
84
- ### Build
68
+ ### 3. Run the Tests
85
69
 
86
70
  ```bash
87
- npm install
88
- npm run build
71
+ npm test
89
72
  ```
90
73
 
91
- ### Configure in VS Code (GitHub Copilot / Claude Desktop)
74
+ Runs 184 automated tests covering all 18 judges, markdown formatters, and edge cases.
75
+
76
+ ### 4. Connect to Your Editor
77
+
78
+ Add the Judges Panel as an MCP server so your AI coding assistant can use it automatically.
92
79
 
93
- Add to your MCP settings (`.vscode/mcp.json`, `claude_desktop_config.json`, etc.):
80
+ **VS Code** create `.vscode/mcp.json` in your project:
94
81
 
95
82
  ```json
96
83
  {
97
- "mcpServers": {
84
+ "servers": {
98
85
  "judges": {
99
86
  "command": "node",
100
- "args": ["<path-to>/judges/dist/index.js"]
87
+ "args": ["/absolute/path/to/judges/dist/index.js"]
101
88
  }
102
89
  }
103
90
  }
104
91
  ```
105
92
 
106
- ### Configure in VS Code Settings (settings.json)
93
+ **Claude Desktop** add to `claude_desktop_config.json`:
107
94
 
108
95
  ```json
109
96
  {
110
- "mcp": {
111
- "servers": {
112
- "judges": {
113
- "command": "node",
114
- "args": ["<path-to>/judges/dist/index.js"]
115
- }
97
+ "mcpServers": {
98
+ "judges": {
99
+ "command": "node",
100
+ "args": ["/absolute/path/to/judges/dist/index.js"]
116
101
  }
117
102
  }
118
103
  }
119
104
  ```
120
105
 
106
+ **Or install from npm** instead of cloning:
107
+
108
+ ```bash
109
+ npm install -g @kevinrabun/judges
110
+ ```
111
+
112
+ Then use `judges` as the command in your MCP config (no `args` needed).
113
+
114
+ ---
115
+
116
+ ## The Judge Panel
117
+
118
+ | Judge | Domain | Rule Prefix | What It Evaluates |
119
+ |-------|--------|-------------|-------------------|
120
+ | **Data Security** | Data Security & Privacy | `DATA-` | Encryption, PII handling, secrets management, access controls |
121
+ | **Cybersecurity** | Cybersecurity & Threat Defense | `CYBER-` | Injection attacks, XSS, CSRF, auth flaws, OWASP Top 10 |
122
+ | **Cost Effectiveness** | Cost Optimization | `COST-` | Algorithm efficiency, N+1 queries, memory waste, caching strategy |
123
+ | **Scalability** | Scalability & Performance | `SCALE-` | Statelessness, horizontal scaling, concurrency, bottlenecks |
124
+ | **Cloud Readiness** | Cloud-Native & DevOps | `CLOUD-` | 12-Factor compliance, containerization, graceful shutdown, IaC |
125
+ | **Software Practices** | Engineering Best Practices | `SWDEV-` | SOLID principles, type safety, error handling, input validation |
126
+ | **Accessibility** | Accessibility (a11y) | `A11Y-` | WCAG compliance, screen reader support, keyboard navigation, ARIA |
127
+ | **API Design** | API Design & Contracts | `API-` | REST conventions, versioning, pagination, error responses |
128
+ | **Reliability** | Reliability & Resilience | `REL-` | Error handling, timeouts, retries, circuit breakers |
129
+ | **Observability** | Observability & Monitoring | `OBS-` | Structured logging, health checks, metrics, tracing |
130
+ | **Performance** | Performance & Efficiency | `PERF-` | N+1 queries, sync I/O, caching, memory leaks |
131
+ | **Compliance** | Regulatory Compliance | `COMP-` | GDPR/CCPA, PII protection, consent, data retention, audit trails |
132
+ | **Testing** | Testing & Quality Assurance | `TEST-` | Test coverage, assertions, test isolation, naming |
133
+ | **Documentation** | Documentation & Readability | `DOC-` | JSDoc/docstrings, magic numbers, TODOs, code comments |
134
+ | **Internationalization** | Internationalization (i18n) | `I18N-` | Hardcoded strings, locale handling, currency formatting |
135
+ | **Dependency Health** | Dependency Management | `DEPS-` | Version pinning, deprecated packages, supply chain |
136
+ | **Concurrency** | Concurrency & Async Safety | `CONC-` | Race conditions, unbounded parallelism, missing await |
137
+ | **Ethics & Bias** | Ethics & Bias | `ETHICS-` | Demographic logic, dark patterns, inclusive language |
138
+
139
+ ---
140
+
141
+ ## How It Works
142
+
143
+ The tribunal operates in two modes:
144
+
145
+ 1. **Pattern-Based Analysis (Tools)** — The `evaluate_code` and `evaluate_code_single_judge` tools perform heuristic analysis using pattern matching to catch common anti-patterns. This works entirely offline with zero external API calls.
146
+
147
+ 2. **LLM-Powered Deep Analysis (Prompts)** — The server exposes MCP prompts (e.g., `judge-data-security`, `full-tribunal`) that provide each judge's expert persona as a system prompt. When used by an LLM-based client, this enables deeper, context-aware analysis beyond what pattern matching can detect.
148
+
149
+ ---
150
+
151
+ ## MCP Tools
152
+
153
+ ### `get_judges`
154
+ List all available judges with their domains and descriptions.
155
+
156
+ ### `evaluate_code`
157
+ Submit code to the **full judges panel**. All 18 judges evaluate independently and return a combined verdict.
158
+
159
+ | Parameter | Type | Required | Description |
160
+ |-----------|------|----------|-------------|
161
+ | `code` | string | yes | The source code to evaluate |
162
+ | `language` | string | yes | Programming language (e.g., `typescript`, `python`) |
163
+ | `context` | string | no | Additional context about the code |
164
+
165
+ ### `evaluate_code_single_judge`
166
+ Submit code to a **specific judge** for targeted review.
167
+
168
+ | Parameter | Type | Required | Description |
169
+ |-----------|------|----------|-------------|
170
+ | `code` | string | yes | The source code to evaluate |
171
+ | `language` | string | yes | Programming language |
172
+ | `judgeId` | string | yes | See [judge IDs](#judge-ids) below |
173
+ | `context` | string | no | Additional context |
174
+
175
+ #### Judge IDs
176
+
177
+ `data-security` · `cybersecurity` · `cost-effectiveness` · `scalability` · `cloud-readiness` · `software-practices` · `accessibility` · `api-design` · `reliability` · `observability` · `performance` · `compliance` · `testing` · `documentation` · `internationalization` · `dependency-health` · `concurrency` · `ethics-bias`
178
+
179
+ ---
180
+
181
+ ## MCP Prompts
182
+
183
+ Each judge has a corresponding prompt for LLM-powered deep analysis:
184
+
185
+ | Prompt | Description |
186
+ |--------|-------------|
187
+ | `judge-data-security` | Deep data security review |
188
+ | `judge-cybersecurity` | Deep cybersecurity review |
189
+ | `judge-cost-effectiveness` | Deep cost optimization review |
190
+ | `judge-scalability` | Deep scalability review |
191
+ | `judge-cloud-readiness` | Deep cloud readiness review |
192
+ | `judge-software-practices` | Deep software practices review |
193
+ | `judge-accessibility` | Deep accessibility/WCAG review |
194
+ | `judge-api-design` | Deep API design review |
195
+ | `judge-reliability` | Deep reliability & resilience review |
196
+ | `judge-observability` | Deep observability & monitoring review |
197
+ | `judge-performance` | Deep performance optimization review |
198
+ | `judge-compliance` | Deep regulatory compliance review |
199
+ | `judge-testing` | Deep testing quality review |
200
+ | `judge-documentation` | Deep documentation quality review |
201
+ | `judge-internationalization` | Deep i18n review |
202
+ | `judge-dependency-health` | Deep dependency health review |
203
+ | `judge-concurrency` | Deep concurrency & async safety review |
204
+ | `judge-ethics-bias` | Deep ethics & bias review |
205
+ | `full-tribunal` | All 18 judges in a single prompt |
206
+
207
+ ---
208
+
121
209
  ## Scoring
122
210
 
123
211
  Each judge scores the code from **0 to 100**:
124
212
 
125
213
  | Severity | Score Deduction |
126
214
  |----------|----------------|
127
- | Critical | -20 points |
128
- | High | -12 points |
129
- | Medium | -6 points |
130
- | Low | -3 points |
215
+ | Critical | 20 points |
216
+ | High | 12 points |
217
+ | Medium | 6 points |
218
+ | Low | 3 points |
131
219
  | Info | 0 points |
132
220
 
133
221
  **Verdict logic:**
@@ -137,38 +225,48 @@ Each judge scores the code from **0 to 100**:
137
225
 
138
226
  The **overall tribunal score** is the average of all 18 judges. The overall verdict fails if **any** judge fails.
139
227
 
140
- ## Example Output
141
-
142
- ```
143
- # Judges Panel — Verdict
144
-
145
- **Overall Verdict: WARNING** | **Score: 68/100**
146
- Total critical findings: 1 | Total high findings: 3
147
-
148
- ## Individual Judge Results
149
-
150
- ❌ **Judge Data Security** (FAIL, 60/100) — 3 finding(s)
151
- ⚠️ **Judge Cybersecurity** (WARNING, 76/100) — 2 finding(s)
152
- ✅ **Judge Cost Effectiveness** (PASS, 88/100) — 1 finding(s)
153
- ⚠️ **Judge Scalability** (WARNING, 70/100) — 2 finding(s)
154
- ✅ **Judge Cloud Readiness** (PASS, 82/100) — 1 finding(s)
155
- ⚠️ **Judge Software Practices** (WARNING, 72/100) — 3 finding(s)
156
- ```
228
+ ---
157
229
 
158
230
  ## Project Structure
159
231
 
160
232
  ```
161
233
  judges/
162
234
  ├── src/
163
- │ ├── index.ts # MCP server entry point — tools, prompts, transport
164
- │ ├── types.ts # TypeScript interfaces for judges, findings, verdicts
165
- │ ├── judges.ts # Judge definitions with expert system prompts
166
- └── evaluator.ts # Pattern-based analysis engine + scoring
235
+ │ ├── index.ts # MCP server entry point — tools, prompts, transport
236
+ │ ├── types.ts # TypeScript interfaces (Finding, JudgeEvaluation, etc.)
237
+ │ ├── evaluators/ # Pattern-based analysis engine for each judge
238
+ │ ├── index.ts # evaluateWithJudge(), evaluateWithTribunal()
239
+ │ │ ├── shared.ts # Scoring, verdict logic, markdown formatters
240
+ │ │ └── *.ts # One analyzer per judge (18 files)
241
+ │ └── judges/ # Judge definitions (id, name, domain, system prompt)
242
+ │ ├── index.ts # JUDGES array, getJudge(), getJudgeSummaries()
243
+ │ └── *.ts # One definition per judge (18 files)
244
+ ├── examples/
245
+ │ ├── sample-vulnerable-api.ts # Intentionally flawed code (triggers all 18 judges)
246
+ │ └── demo.ts # Run: npm run demo
247
+ ├── tests/
248
+ │ └── judges.test.ts # Run: npm test (184 tests)
249
+ ├── server.json # MCP Registry manifest
167
250
  ├── package.json
168
251
  ├── tsconfig.json
169
252
  └── README.md
170
253
  ```
171
254
 
255
+ ---
256
+
257
+ ## Scripts
258
+
259
+ | Command | Description |
260
+ |---------|-------------|
261
+ | `npm run build` | Compile TypeScript to `dist/` |
262
+ | `npm run dev` | Watch mode — recompile on save |
263
+ | `npm test` | Run the full test suite (184 tests) |
264
+ | `npm run demo` | Run the sample tribunal demo |
265
+ | `npm start` | Start the MCP server |
266
+ | `npm run clean` | Remove `dist/` |
267
+
268
+ ---
269
+
172
270
  ## License
173
271
 
174
272
  MIT
package/dist/index.js CHANGED
@@ -20,7 +20,7 @@ import { evaluateWithJudge, evaluateWithTribunal, formatVerdictAsMarkdown, forma
20
20
  // ─── Create MCP Server ──────────────────────────────────────────────────────
21
21
  const server = new McpServer({
22
22
  name: "judges",
23
- version: "1.0.0",
23
+ version: "1.1.0",
24
24
  });
25
25
  // ─── Tool: get_judges ────────────────────────────────────────────────────────
26
26
  server.tool("get_judges", "List all available judges on the Agent Tribunal panel, including their areas of expertise and what they evaluate.", {}, async () => {
package/package.json CHANGED
@@ -1,8 +1,8 @@
1
1
  {
2
2
  "name": "@kevinrabun/judges",
3
- "version": "1.0.1",
3
+ "version": "1.1.0",
4
4
  "description": "18 specialized judges that evaluate AI-generated code for security, cost, and quality.",
5
- "mcpName": "io.github.kevinrabun/judges",
5
+ "mcpName": "io.github.KevinRabun/judges",
6
6
  "type": "module",
7
7
  "main": "dist/index.js",
8
8
  "bin": {
@@ -19,6 +19,8 @@
19
19
  "start": "node dist/index.js",
20
20
  "dev": "tsc --watch",
21
21
  "clean": "rimraf dist",
22
+ "test": "npx tsx --test tests/judges.test.ts",
23
+ "demo": "npx tsx examples/demo.ts",
22
24
  "prepublishOnly": "npm run build"
23
25
  },
24
26
  "keywords": [
@@ -48,6 +50,7 @@
48
50
  },
49
51
  "devDependencies": {
50
52
  "@types/node": "^25.3.0",
53
+ "tsx": "^4.19.4",
51
54
  "typescript": "^5.9.3"
52
55
  }
53
56
  }
package/server.json CHANGED
@@ -1,18 +1,18 @@
1
1
  {
2
2
  "$schema": "https://static.modelcontextprotocol.io/schemas/2025-12-11/server.schema.json",
3
- "name": "io.github.kevinrabun/judges",
3
+ "name": "io.github.KevinRabun/judges",
4
4
  "title": "Judges Panel",
5
5
  "description": "18 specialized judges that evaluate AI-generated code for security, cost, and quality.",
6
6
  "repository": {
7
7
  "url": "https://github.com/kevinrabun/judges",
8
8
  "source": "github"
9
9
  },
10
- "version": "1.0.1",
10
+ "version": "1.1.0",
11
11
  "packages": [
12
12
  {
13
13
  "registryType": "npm",
14
14
  "identifier": "@kevinrabun/judges",
15
- "version": "1.0.1",
15
+ "version": "1.1.0",
16
16
  "transport": {
17
17
  "type": "stdio"
18
18
  }