@kevinrabun/judges 3.127.0 โ 3.127.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +47 -3
- package/dist/commands/llm-benchmark-optimizer.js +28 -25
- package/dist/commands/llm-benchmark.js +1 -1
- package/dist/evaluators/index.js +1 -1
- package/dist/probabilistic/llm-response-validator.js +4 -1
- package/dist/reports/public-repo-report.js +3 -7
- package/dist/skill-loader.js +1 -1
- package/package.json +1 -1
- package/packages/judges-cli/README.md +36 -4
- package/server.json +2 -2
- package/src/skill-loader.ts +1 -1
package/README.md
CHANGED
|
@@ -15,7 +15,7 @@ An MCP (Model Context Protocol) server that provides a panel of **45 specialized
|
|
|
15
15
|
[](https://www.npmjs.com/package/@kevinrabun/judges)
|
|
16
16
|
[](https://www.npmjs.com/package/@kevinrabun/judges)
|
|
17
17
|
[](https://opensource.org/licenses/MIT)
|
|
18
|
-
[](https://github.com/KevinRabun/judges/actions)
|
|
19
19
|
|
|
20
20
|
> ๐ฐ **Packages**
|
|
21
21
|
> - **CLI**: `@kevinrabun/judges-cli` โ binary `judges` (use `npx @kevinrabun/judges-cli eval --file app.ts`).
|
|
@@ -843,7 +843,7 @@ The tribunal operates in three layers:
|
|
|
843
843
|
|
|
844
844
|
Judges Panel is a **dual-layer** review system: instant **deterministic tools** (offline, no API keys) for pattern and AST analysis, plus **45 expert-persona MCP prompts** that unlock LLM-powered deep analysis when connected to an AI client. It does not try to be a CVE scanner or a linter. Those capabilities belong in dedicated MCP servers that an AI agent can orchestrate alongside Judges.
|
|
845
845
|
|
|
846
|
-
### Built-in AST Analysis
|
|
846
|
+
### Built-in AST Analysis
|
|
847
847
|
|
|
848
848
|
Unlike earlier versions that recommended a separate AST MCP server, Judges Panel now includes **real AST-based structural analysis** out of the box:
|
|
849
849
|
|
|
@@ -1236,7 +1236,9 @@ Create a `.judgesrc.json` (or `.judgesrc`) file in your project root to customiz
|
|
|
1236
1236
|
"languages": ["typescript", "python"],
|
|
1237
1237
|
"format": "text",
|
|
1238
1238
|
"failOnFindings": false,
|
|
1239
|
-
"baseline": ""
|
|
1239
|
+
"baseline": "",
|
|
1240
|
+
"regulatoryScope": ["GDPR", "PCI-DSS", "SOC2"],
|
|
1241
|
+
"consensusThreshold": 0.7
|
|
1240
1242
|
}
|
|
1241
1243
|
```
|
|
1242
1244
|
|
|
@@ -1252,6 +1254,14 @@ Create a `.judgesrc.json` (or `.judgesrc`) file in your project root to customiz
|
|
|
1252
1254
|
| `format` | `string` | `"text"` | Default output format: `text` ยท `json` ยท `sarif` ยท `markdown` ยท `html` ยท `pdf` ยท `junit` ยท `codeclimate` ยท `github-actions` |
|
|
1253
1255
|
| `failOnFindings` | `boolean` | `false` | Exit code 1 when verdict is `fail` โ useful for CI gates |
|
|
1254
1256
|
| `baseline` | `string` | `""` | Path to a baseline JSON file โ matching findings are suppressed |
|
|
1257
|
+
| `plugins` | `string[]` | `[]` | Plugin module specifiers (npm packages or relative paths) that export custom judges |
|
|
1258
|
+
| `judgeWeights` | `object` | `{}` | Weighted importance per judge for aggregated scoring (e.g. `{ "cybersecurity": 2.0 }`) |
|
|
1259
|
+
| `failOnScoreBelow` | `number` | โ | Minimum score (0โ100) for the run to pass; complements `failOnFindings` |
|
|
1260
|
+
| `regulatoryScope` | `string[]` | โ | Regulatory frameworks in scope (e.g. `["GDPR", "PCI-DSS"]`). Findings citing ONLY out-of-scope frameworks are suppressed. Run `judges list --frameworks` for supported values. |
|
|
1261
|
+
| `consensusThreshold` | `number` | โ | Consensus suppression (0โ1). If this fraction of judges report zero findings, minority findings are suppressed. Recommended: `0.7` for CI. |
|
|
1262
|
+
| `escalationThreshold` | `number` | โ | Confidence threshold (0โ1) below which findings are flagged for human review |
|
|
1263
|
+
| `overrides` | `array` | `[]` | Path-scoped config overrides (e.g. `[{ "files": "**/*.test.ts", "disabledJudges": ["documentation"] }]`) |
|
|
1264
|
+
| `customRules` | `array` | `[]` | User-defined regex-based rules for business logic validation |
|
|
1255
1265
|
|
|
1256
1266
|
All evaluation tools (CLI and MCP) accept the same configuration fields via `--config <path>` or inline `config` parameter.
|
|
1257
1267
|
|
|
@@ -1288,6 +1298,38 @@ Patches include `oldText`, `newText`, `startLine`, and `endLine` for automated a
|
|
|
1288
1298
|
|
|
1289
1299
|
When multiple judges flag the same issue (e.g., both Data Security and Cybersecurity detect SQL injection on line 15), findings are automatically deduplicated. The highest-severity finding wins, and the description is annotated with cross-references (e.g., *"Also identified by: CYBER-003"*).
|
|
1290
1300
|
|
|
1301
|
+
### Human Focus Guide
|
|
1302
|
+
|
|
1303
|
+
Every tribunal evaluation includes a `humanFocusGuide` that categorizes findings into three buckets for human reviewers:
|
|
1304
|
+
|
|
1305
|
+
| Bucket | Description | When to use |
|
|
1306
|
+
|--------|-------------|-------------|
|
|
1307
|
+
| **โ
Trust** | High-confidence (โฅ80%), evidence-backed findings with AST/taint confirmation | Act directly โ these have strong automated evidence |
|
|
1308
|
+
| **๐ Verify** | Lower-confidence or absence-based findings | Use your judgment โ the issue may exist elsewhere in the project |
|
|
1309
|
+
| **๐ฆ Blind Spots** | Areas automated analysis cannot evaluate | Focus your manual review time here |
|
|
1310
|
+
|
|
1311
|
+
Blind spots are detected from code characteristics: complex branching logic, external service calls, financial calculations, PII handling, state machines, and complex regex. The guide appears in CLI text/markdown output, JSON/SARIF output, and GitHub Action step summaries.
|
|
1312
|
+
|
|
1313
|
+
### Regulatory Scope
|
|
1314
|
+
|
|
1315
|
+
Configure which regulatory frameworks apply to your project in `.judgesrc`:
|
|
1316
|
+
|
|
1317
|
+
```json
|
|
1318
|
+
{ "regulatoryScope": ["GDPR", "PCI-DSS", "SOC2"] }
|
|
1319
|
+
```
|
|
1320
|
+
|
|
1321
|
+
Findings that cite ONLY out-of-scope frameworks are suppressed. Findings with no regulatory reference (general code quality) are always kept. Run `judges list --frameworks` to see all 17 supported frameworks (GDPR, CCPA, HIPAA, PCI-DSS, SOC2, SOX, COPPA, FedRAMP, NIST, ISO27001, ePrivacy, DORA, NIS2, EU-AI-Act, and more).
|
|
1322
|
+
|
|
1323
|
+
### Self-Teaching Amendments
|
|
1324
|
+
|
|
1325
|
+
The LLM benchmark system auto-generates precision amendments for judges with high false-positive rates. Amendments are data-driven corrections injected into prompts that improve accuracy over successive benchmark runs.
|
|
1326
|
+
|
|
1327
|
+
The self-teaching loop:
|
|
1328
|
+
1. Run benchmark โ analyzer identifies judges below 70% precision
|
|
1329
|
+
2. Generates targeted amendments (e.g., "Judge ERR: do not flag clean Express code with framework error middleware")
|
|
1330
|
+
3. Next benchmark run loads amendments โ precision improves
|
|
1331
|
+
4. Run `judges codify-amendments` to bake amendments permanently into the distributed package
|
|
1332
|
+
|
|
1291
1333
|
### Taint Flow Analysis
|
|
1292
1334
|
|
|
1293
1335
|
The engine performs inter-procedural taint tracking to trace data from user-controlled sources (e.g., `req.body`, `process.env`) through transformations to security-sensitive sinks (e.g., `eval()`, `exec()`, SQL queries). Taint flows are used to boost confidence on true-positive findings and suppress false positives where sanitization is detected.
|
|
@@ -1475,6 +1517,8 @@ judges/
|
|
|
1475
1517
|
| `judges config import <src>` | Import a shared configuration |
|
|
1476
1518
|
| `judges compare` | Compare judges against other code review tools |
|
|
1477
1519
|
| `judges list` | List all 45 judges with domains and descriptions |
|
|
1520
|
+
| `judges list --frameworks` | List supported regulatory frameworks and `.judgesrc` usage |
|
|
1521
|
+
| `judges codify-amendments` | Bake self-teaching amendments into judge source files |
|
|
1478
1522
|
|
|
1479
1523
|
---
|
|
1480
1524
|
|
|
@@ -109,6 +109,7 @@ function generateAmendment(prefix, precision, fpCount, total, snapshot) {
|
|
|
109
109
|
const domain = judge?.domain ?? "its domain";
|
|
110
110
|
// Analyze what the FPs look like โ which categories get falsely flagged
|
|
111
111
|
const fpCategories = new Map();
|
|
112
|
+
const tpCategories = new Map();
|
|
112
113
|
// Collect specific FP case IDs for pattern extraction
|
|
113
114
|
const fpCaseExamples = [];
|
|
114
115
|
for (const c of snapshot.cases) {
|
|
@@ -120,35 +121,36 @@ function generateAmendment(prefix, precision, fpCount, total, snapshot) {
|
|
|
120
121
|
}
|
|
121
122
|
}
|
|
122
123
|
}
|
|
124
|
+
// Also track where this judge produces TRUE positives
|
|
125
|
+
for (const det of c.detectedRuleIds) {
|
|
126
|
+
if (det.startsWith(prefix + "-") && !c.falsePositiveRuleIds.includes(det)) {
|
|
127
|
+
tpCategories.set(c.category, (tpCategories.get(c.category) ?? 0) + 1);
|
|
128
|
+
}
|
|
129
|
+
}
|
|
123
130
|
}
|
|
124
|
-
|
|
131
|
+
// Identify categories that are FP-only (no TPs) โ safe to suppress
|
|
132
|
+
const fpOnlyCategories = [...fpCategories.entries()]
|
|
133
|
+
.filter(([cat]) => !tpCategories.has(cat))
|
|
125
134
|
.sort((a, b) => b[1] - a[1])
|
|
126
135
|
.slice(0, 5)
|
|
127
136
|
.map(([cat]) => cat);
|
|
128
|
-
// Build
|
|
129
|
-
const categoryBlocklist = topFpCategories.length > 0
|
|
130
|
-
? `\nDo NOT report ${prefix}- findings on code in these categories: ${topFpCategories.join(", ")}. ` +
|
|
131
|
-
`These categories fall outside ${domain} and historically produce false positives.`
|
|
132
|
-
: "";
|
|
133
|
-
// Extract specific FP patterns for concrete guidance
|
|
134
|
-
const fpRuleIds = new Set(fpCaseExamples.map((e) => e.ruleId));
|
|
135
|
-
const specificRules = [...fpRuleIds].slice(0, 5).join(", ");
|
|
136
|
-
const ruleWarning = specificRules
|
|
137
|
-
? `\nSpecific rule IDs with high FP rates: ${specificRules}. Require >=80% confidence with exact line citations before reporting these.`
|
|
138
|
-
: "";
|
|
139
|
-
// Identify if clean cases are a problem for this judge
|
|
137
|
+
// Build targeted anti-FP instructions โ only suppress on clean/FP-only categories
|
|
140
138
|
const cleanFPs = fpCaseExamples.filter((e) => e.category === "clean" || e.category.startsWith("ai-negative")).length;
|
|
139
|
+
const nonCleanFPOnlyWarning = fpOnlyCategories.length > 0
|
|
140
|
+
? `\nHistorically produces false positives on: ${fpOnlyCategories.join(", ")}. Apply extra scrutiny on these categories โ require concrete evidence before reporting.`
|
|
141
|
+
: "";
|
|
141
142
|
const cleanWarning = cleanFPs > 0
|
|
142
|
-
? `\nThis judge produced ${cleanFPs} false
|
|
143
|
+
? `\nThis judge produced ${cleanFPs} false positive(s) on CLEAN code. If code uses standard patterns correctly (proper error handling, established libraries, framework conventions), report ZERO ${prefix}- findings. Clean, well-written code exists โ do not manufacture findings.`
|
|
143
144
|
: "";
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
`
|
|
148
|
-
|
|
149
|
-
|
|
145
|
+
// IMPORTANT: Do NOT restrict the judge from detecting real issues in vulnerable code.
|
|
146
|
+
// Only add caution for clean-code patterns, not a blanket confidence floor.
|
|
147
|
+
const amendment = `PRECISION CALIBRATION for ${judgeName} (${prefix}-): ` +
|
|
148
|
+
`Empirical precision: ${pct(precision)} in recent benchmarks. ` +
|
|
149
|
+
`IMPORTANT: Continue detecting genuine ${domain} issues in vulnerable code โ do NOT reduce sensitivity to real problems. ` +
|
|
150
|
+
`CALIBRATION: The false positives come from flagging well-written code that correctly uses established patterns. ` +
|
|
151
|
+
`Before reporting ${prefix}- findings, verify the code actually has a deficiency โ not just a theoretical improvement opportunity.` +
|
|
150
152
|
cleanWarning +
|
|
151
|
-
|
|
153
|
+
nonCleanFPOnlyWarning;
|
|
152
154
|
return {
|
|
153
155
|
judgePrefix: prefix,
|
|
154
156
|
amendment,
|
|
@@ -167,11 +169,12 @@ export function formatAmendmentSection(amendments) {
|
|
|
167
169
|
if (amendments.length === 0)
|
|
168
170
|
return "";
|
|
169
171
|
const lines = [
|
|
170
|
-
"## Precision
|
|
172
|
+
"## Precision Calibration โ Based on Empirical Benchmark Data",
|
|
171
173
|
"",
|
|
172
|
-
"The following judges have
|
|
173
|
-
"Apply
|
|
174
|
-
"
|
|
174
|
+
"The following judges have historically produced false positives on clean code. " +
|
|
175
|
+
"Apply the calibration guidance below to avoid repeating these errors. " +
|
|
176
|
+
"IMPORTANT: These calibrations target CLEAN CODE false positives only โ " +
|
|
177
|
+
"continue detecting genuine issues in vulnerable code with full sensitivity.",
|
|
175
178
|
"",
|
|
176
179
|
];
|
|
177
180
|
for (const a of amendments) {
|
|
@@ -153,7 +153,7 @@ export function parseLlmRuleIds(response) {
|
|
|
153
153
|
// IDs mentioned in rationale text or findings tables of "clean" judge sections
|
|
154
154
|
// from being counted as detections.
|
|
155
155
|
const sections = response.split(/(?:^|\n)---\s*\n|(?=^## )/m);
|
|
156
|
-
const zeroFindingsPattern =
|
|
156
|
+
const zeroFindingsPattern = /(?:ZERO|zero|0|no) findings?|findings?[:\s]*(?:none|0|zero)|no (?:issues|findings|problems|concerns) (?:found|detected|identified|reported)|reporting? zero|Score[|: ]*100/i;
|
|
157
157
|
for (const section of sections) {
|
|
158
158
|
// If this section explicitly declares zero/no findings or a perfect score,
|
|
159
159
|
// skip rule ID extraction โ any rule IDs are explanatory references
|
package/dist/evaluators/index.js
CHANGED
|
@@ -504,7 +504,7 @@ function synthesizeHumanFocusGuide(findings, code, language) {
|
|
|
504
504
|
});
|
|
505
505
|
}
|
|
506
506
|
// State machines / workflow
|
|
507
|
-
const hasStateMachine = /state\s*[=:]\s*['"][^'"]+['"]|status\s*===?\s*['"]|transition|workflow|step
|
|
507
|
+
const hasStateMachine = /state\s*[=:]\s*['"][^'"]+['"]|status\s*===?\s*['"]|transition|workflow|step[\w\s]{0,20}next/i.test(code);
|
|
508
508
|
if (hasStateMachine) {
|
|
509
509
|
blindSpots.push({
|
|
510
510
|
area: "State Management / Workflow Logic",
|
|
@@ -5,7 +5,10 @@ const SEVERITY_SET = new Set(["critical", "high", "medium", "low", "info"]);
|
|
|
5
5
|
* Attempt to parse a JSON payload embedded in LLM output. Supports fenced code blocks and raw JSON.
|
|
6
6
|
*/
|
|
7
7
|
function parseJsonBlock(text) {
|
|
8
|
-
|
|
8
|
+
// Extract JSON from fenced code blocks โ limit search to first 50KB to prevent ReDoS on large input
|
|
9
|
+
const searchText = text.length > 50_000 ? text.slice(0, 50_000) : text;
|
|
10
|
+
const fenceMatch = searchText.match(/```(?:json)?\s*\n([\s\S]{0,20000}?)\n\s*```/i) ??
|
|
11
|
+
searchText.match(/```(?:json)?\s*([\s\S]{0,20000}?)```/i);
|
|
9
12
|
if (fenceMatch) {
|
|
10
13
|
try {
|
|
11
14
|
return JSON.parse(fenceMatch[1]);
|
|
@@ -216,13 +216,9 @@ function compileExcludeRegexes(patterns) {
|
|
|
216
216
|
if (!patterns || patterns.length === 0)
|
|
217
217
|
return [];
|
|
218
218
|
return patterns.map((pattern) => {
|
|
219
|
-
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
catch {
|
|
223
|
-
// Invalid regex from user input โ treat as literal string match
|
|
224
|
-
return new RegExp(pattern.replace(/[.*+?^${}()|[\]\\]/g, "\\$&"), "i");
|
|
225
|
-
}
|
|
219
|
+
// Always escape user input to prevent regex injection, then compile
|
|
220
|
+
const escaped = pattern.replace(/[.*+?^${}()|[\]\\]/g, "\\$&");
|
|
221
|
+
return new RegExp(escaped, "i");
|
|
226
222
|
});
|
|
227
223
|
}
|
|
228
224
|
function isLikelyNonProductionPath(path) {
|
package/dist/skill-loader.js
CHANGED
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# @kevinrabun/judges-cli
|
|
2
2
|
|
|
3
|
-
Standalone CLI package for Judges.
|
|
3
|
+
Standalone CLI package for the [Judges Panel](https://github.com/KevinRabun/judges) โ 45 specialized judges that evaluate code for security, quality, compliance, and 40 more dimensions.
|
|
4
4
|
|
|
5
5
|
## Install
|
|
6
6
|
|
|
@@ -11,14 +11,46 @@ npm install -g @kevinrabun/judges-cli
|
|
|
11
11
|
## Usage
|
|
12
12
|
|
|
13
13
|
```bash
|
|
14
|
+
# Evaluate code
|
|
14
15
|
judges eval src/app.ts
|
|
16
|
+
judges eval src/ --format sarif --output report.sarif
|
|
17
|
+
judges eval src/app.ts --judge cybersecurity
|
|
18
|
+
judges eval src/app.ts --preset strict --fail-on-findings
|
|
19
|
+
|
|
20
|
+
# List judges and regulatory frameworks
|
|
15
21
|
judges list
|
|
16
|
-
judges
|
|
22
|
+
judges list --frameworks
|
|
23
|
+
|
|
24
|
+
# Auto-fix findings
|
|
25
|
+
judges fix src/app.ts --apply
|
|
17
26
|
|
|
18
27
|
# Agentic skills
|
|
19
28
|
judges skill ai-code-review --file src/app.ts
|
|
20
29
|
judges skill security-review --file src/api.ts --format json
|
|
21
|
-
judges skills
|
|
30
|
+
judges skills
|
|
31
|
+
|
|
32
|
+
# Self-teaching
|
|
33
|
+
judges codify-amendments # bake benchmark amendments into judge files
|
|
34
|
+
judges codify-amendments --dry-run
|
|
22
35
|
```
|
|
23
36
|
|
|
24
|
-
|
|
37
|
+
## Configuration
|
|
38
|
+
|
|
39
|
+
Create a `.judgesrc.json` in your project root:
|
|
40
|
+
|
|
41
|
+
```json
|
|
42
|
+
{
|
|
43
|
+
"preset": "strict",
|
|
44
|
+
"regulatoryScope": ["GDPR", "PCI-DSS"],
|
|
45
|
+
"disabledJudges": ["accessibility"],
|
|
46
|
+
"failOnFindings": true
|
|
47
|
+
}
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
See the [full configuration reference](https://github.com/KevinRabun/judges#configuration) for all options.
|
|
51
|
+
|
|
52
|
+
## Packages
|
|
53
|
+
|
|
54
|
+
- **`@kevinrabun/judges-cli`** โ This package. Binary `judges` for CI/CD pipelines.
|
|
55
|
+
- **`@kevinrabun/judges`** โ Programmatic API + MCP server.
|
|
56
|
+
- **VS Code extension** โ [`kevinrabun.judges-panel`](https://marketplace.visualstudio.com/items?itemName=kevinrabun.judges-panel).
|
package/server.json
CHANGED
|
@@ -16,12 +16,12 @@
|
|
|
16
16
|
"mimeType": "image/png"
|
|
17
17
|
}
|
|
18
18
|
],
|
|
19
|
-
"version": "3.127.
|
|
19
|
+
"version": "3.127.2",
|
|
20
20
|
"packages": [
|
|
21
21
|
{
|
|
22
22
|
"registryType": "npm",
|
|
23
23
|
"identifier": "@kevinrabun/judges",
|
|
24
|
-
"version": "3.127.
|
|
24
|
+
"version": "3.127.2",
|
|
25
25
|
"transport": {
|
|
26
26
|
"type": "stdio"
|
|
27
27
|
}
|
package/src/skill-loader.ts
CHANGED
|
@@ -44,7 +44,7 @@ export function parseSkillFrontmatter(raw: string): { meta: SkillMeta; body: str
|
|
|
44
44
|
i++;
|
|
45
45
|
continue;
|
|
46
46
|
}
|
|
47
|
-
const kv = line.match(/^([a-zA-Z_][a-zA-Z0-9_-]*)[ \t]*:[ \t]*(
|
|
47
|
+
const kv = line.match(/^([a-zA-Z_][a-zA-Z0-9_-]*)[ \t]*:[ \t]*(.*?)$/s);
|
|
48
48
|
if (!kv) {
|
|
49
49
|
i++;
|
|
50
50
|
continue;
|