npm - @ryuenn3123/agentic-senior-core - Versions diffs - 2.0.15 → 2.0.17 - Mend

@ryuenn3123/agentic-senior-core 2.0.15 → 2.0.17

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/.agent-context/prompts/review-code.md +1 -0
package/.agent-context/review-checklists/pr-checklist.md +1 -0
package/.agent-context/rules/api-docs.md +8 -5
package/.agent-context/state/benchmark-reproducibility.json +3 -1
package/.agent-context/state/benchmark-writer-judge-config.json +58 -0
package/.agent-context/state/benchmark-writer-judge-matrix.json +462 -0
package/.cursorrules +1 -1
package/.windsurfrules +1 -1
package/README.md +20 -0
package/package.json +2 -1
package/scripts/benchmark-writer-judge-matrix.mjs +383 -0
package/scripts/validate.mjs +17 -3

package/.agent-context/prompts/review-code.md CHANGED Viewed

@@ -13,6 +13,7 @@ Run a comprehensive code review on the current codebase (or the files I'm about
 Use these checklists:
 1. Read .agent-context/review-checklists/pr-checklist.md — apply every item.
 2. Read .agent-context/review-checklists/security-audit.md — apply every item.
+3. Apply documentation scope rules exactly: This applies to documentation, release notes, onboarding text, review summaries, and agent-facing explanations.
 For EVERY violation found:
 - State the exact file and line

package/.agent-context/review-checklists/pr-checklist.md CHANGED Viewed

@@ -91,6 +91,7 @@ VERDICT: PASS / FAIL (X/Y items passed)
 - [ ] `.env` not committed
 ### 10. Documentation
+- [ ] Scope applied: This applies to documentation, release notes, onboarding text, review summaries, and agent-facing explanations
 - [ ] API endpoints have OpenAPI/Swagger documentation
 - [ ] Complex business logic has comments explaining WHY
 - [ ] Public functions/methods have JSDoc/docstrings

package/.agent-context/rules/api-docs.md CHANGED Viewed

@@ -27,7 +27,8 @@ REQUIRED: Documentation MUST be updated in the SAME commit as the endpoint chang
 ## Human Writing Standard (Mandatory)
-This standard applies to API docs, README updates, release notes, and technical explanations.
+This applies to documentation, release notes, onboarding text, review summaries, and agent-facing explanations.
+API docs and README updates are included in this scope.
 ### Style Baseline
 1. Write for native English speakers.
@@ -35,18 +36,20 @@ This standard applies to API docs, README updates, release notes, and technical
 3. Use clear, direct, plain language.
 4. Keep sentence rhythm natural with short and medium sentences.
 5. Sound confident, practical, and conversational.
+6. State the main point first, then supporting detail.
 ### Required Behavior
 1. Explain decisions the way a competent coworker would explain them out loud.
 2. Cut unnecessary words and remove filler.
 3. Use concrete verbs and everyday phrasing.
 4. Rewrite and reorder content when flow is weak.
+5. Keep explanations short by default; expand only when complexity requires it.
-### Hard Bans
+### Non-Negotiables
 1. No emoji in formal artifacts.
-2. No AI cliches: delve, leverage, robust, utilize, seamless.
-3. No inflated, academic, or performative language.
-4. No padding, hedging, or redundant phrasing.
+2. Avoid AI cliches and buzzwords: delve, leverage, robust, utilize, seamless.
+3. Avoid inflated, academic, or performative language.
+4. Avoid padding, hedging, and redundant phrasing.
 ### Critical Controls
 1. Any claim about quality, performance, or reliability must include a measurable source and timestamp.

package/.agent-context/state/benchmark-reproducibility.json CHANGED Viewed

@@ -73,13 +73,15 @@
     "Run npm run benchmark:detection to regenerate detection benchmark output.",
     "Run npm run benchmark:gate to validate benchmark anti-regression thresholds.",
     "Run npm run benchmark:intelligence to validate benchmark watchlist freshness.",
-    "Run npm run benchmark:bundle to emit a reproducible benchmark evidence bundle."
+    "Run npm run benchmark:bundle to emit a reproducible benchmark evidence bundle.",
+    "Run npm run benchmark:writer-judge to emit writer-judge side-by-side matrix output."
   ],
   "commandExamples": [
     "npm run benchmark:detection",
     "npm run benchmark:gate",
     "npm run benchmark:intelligence",
     "npm run benchmark:bundle",
+    "npm run benchmark:writer-judge",
     "node ./scripts/benchmark-evidence-bundle.mjs --stdout-only"
   ]
 }

package/.agent-context/state/benchmark-writer-judge-config.json ADDED Viewed

@@ -0,0 +1,58 @@
+{
+  "version": "1.0.0",
+  "phase": "v2.5.1",
+  "blindReviewMode": true,
+  "writerLane": {
+    "models": [
+      {
+        "id": "writer-copilot-balanced",
+        "provider": "github-copilot",
+        "profile": "balanced"
+      },
+      {
+        "id": "writer-claude-architect",
+        "provider": "anthropic",
+        "profile": "architect"
+      },
+      {
+        "id": "writer-gemini-ops",
+        "provider": "google",
+        "profile": "operations"
+      }
+    ],
+    "weights": {
+      "quality": 40,
+      "efficiency": 20,
+      "reliability": 25,
+      "freshness": 15
+    },
+    "scenarioMultipliers": {
+      "planning": 1,
+      "refactor": 1.02,
+      "security": 1.03,
+      "delivery": 0.98
+    }
+  },
+  "judgeLane": {
+    "models": [
+      {
+        "id": "judge-claude-audit",
+        "provider": "anthropic",
+        "profile": "audit"
+      },
+      {
+        "id": "judge-gpt-risk",
+        "provider": "openai",
+        "profile": "risk"
+      }
+    ],
+    "minimumCompositeScore": 75,
+    "leniencyWindow": 2,
+    "weights": {
+      "clarity": 35,
+      "correctness": 35,
+      "risk": 20,
+      "consistency": 10
+    }
+  }
+}

package/.agent-context/state/benchmark-writer-judge-matrix.json ADDED Viewed

@@ -0,0 +1,462 @@
+{
+  "generatedAt": "2026-04-14T06:57:14.623Z",
+  "reportName": "benchmark-writer-judge-matrix",
+  "phase": "v2.5.1",
+  "passed": true,
+  "failureCount": 0,
+  "methodology": {
+    "blindReviewMode": true,
+    "writerLaneModelCount": 3,
+    "judgeLaneModelCount": 2,
+    "scenarioCount": 4,
+    "writerWeights": {
+      "quality": 40,
+      "efficiency": 20,
+      "reliability": 25,
+      "freshness": 15
+    },
+    "judgeWeights": {
+      "clarity": 35,
+      "correctness": 35,
+      "risk": 20,
+      "consistency": 10
+    }
+  },
+  "coreSignals": {
+    "top1Accuracy": 0.9167,
+    "manualCorrectionRate": 0.0833,
+    "nativeSavingsPercent": 81.52,
+    "benchmarkGatePassed": true,
+    "benchmarkGateFailureCount": 0,
+    "intelligenceFailureCount": 0,
+    "staleWatchlistCount": 0,
+    "top1AccuracyMet": true,
+    "manualCorrectionMet": true
+  },
+  "writerDirectory": [
+    {
+      "writerToken": "W1",
+      "writerModel": {
+        "id": "writer-copilot-balanced",
+        "provider": "github-copilot",
+        "profile": "balanced"
+      },
+      "averageCompositeScore": 93.7
+    },
+    {
+      "writerToken": "W2",
+      "writerModel": {
+        "id": "writer-claude-architect",
+        "provider": "anthropic",
+        "profile": "architect"
+      },
+      "averageCompositeScore": 92.8
+    },
+    {
+      "writerToken": "W3",
+      "writerModel": {
+        "id": "writer-gemini-ops",
+        "provider": "google",
+        "profile": "operations"
+      },
+      "averageCompositeScore": 92.65
+    }
+  ],
+  "comparisonMatrix": [
+    {
+      "scenarioId": "planning",
+      "scenarioCategory": "planning",
+      "writerToken": "W1",
+      "writerModelId": null,
+      "judgeModelId": "judge-claude-audit",
+      "blindPairId": "planning:W1:judge-claude-audit",
+      "writerCompositeScore": 92.12,
+      "judgeCompositeScore": 94.12,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "planning",
+      "scenarioCategory": "planning",
+      "writerToken": "W1",
+      "writerModelId": null,
+      "judgeModelId": "judge-gpt-risk",
+      "blindPairId": "planning:W1:judge-gpt-risk",
+      "writerCompositeScore": 92.12,
+      "judgeCompositeScore": 94.12,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "refactor",
+      "scenarioCategory": "refactor",
+      "writerToken": "W1",
+      "writerModelId": null,
+      "judgeModelId": "judge-claude-audit",
+      "blindPairId": "refactor:W1:judge-claude-audit",
+      "writerCompositeScore": 95.26,
+      "judgeCompositeScore": 93.26,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "refactor",
+      "scenarioCategory": "refactor",
+      "writerToken": "W1",
+      "writerModelId": null,
+      "judgeModelId": "judge-gpt-risk",
+      "blindPairId": "refactor:W1:judge-gpt-risk",
+      "writerCompositeScore": 95.26,
+      "judgeCompositeScore": 93.26,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "security",
+      "scenarioCategory": "security",
+      "writerToken": "W1",
+      "writerModelId": null,
+      "judgeModelId": "judge-claude-audit",
+      "blindPairId": "security:W1:judge-claude-audit",
+      "writerCompositeScore": 94.42,
+      "judgeCompositeScore": 92.42,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "security",
+      "scenarioCategory": "security",
+      "writerToken": "W1",
+      "writerModelId": null,
+      "judgeModelId": "judge-gpt-risk",
+      "blindPairId": "security:W1:judge-gpt-risk",
+      "writerCompositeScore": 94.42,
+      "judgeCompositeScore": 93.42,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "delivery",
+      "scenarioCategory": "delivery",
+      "writerToken": "W1",
+      "writerModelId": null,
+      "judgeModelId": "judge-claude-audit",
+      "blindPairId": "delivery:W1:judge-claude-audit",
+      "writerCompositeScore": 92.99,
+      "judgeCompositeScore": 93.99,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "delivery",
+      "scenarioCategory": "delivery",
+      "writerToken": "W1",
+      "writerModelId": null,
+      "judgeModelId": "judge-gpt-risk",
+      "blindPairId": "delivery:W1:judge-gpt-risk",
+      "writerCompositeScore": 92.99,
+      "judgeCompositeScore": 92.99,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "planning",
+      "scenarioCategory": "planning",
+      "writerToken": "W2",
+      "writerModelId": null,
+      "judgeModelId": "judge-claude-audit",
+      "blindPairId": "planning:W2:judge-claude-audit",
+      "writerCompositeScore": 93.12,
+      "judgeCompositeScore": 94.12,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "planning",
+      "scenarioCategory": "planning",
+      "writerToken": "W2",
+      "writerModelId": null,
+      "judgeModelId": "judge-gpt-risk",
+      "blindPairId": "planning:W2:judge-gpt-risk",
+      "writerCompositeScore": 93.12,
+      "judgeCompositeScore": 94.12,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "refactor",
+      "scenarioCategory": "refactor",
+      "writerToken": "W2",
+      "writerModelId": null,
+      "judgeModelId": "judge-claude-audit",
+      "blindPairId": "refactor:W2:judge-claude-audit",
+      "writerCompositeScore": 93.06,
+      "judgeCompositeScore": 95.06,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "refactor",
+      "scenarioCategory": "refactor",
+      "writerToken": "W2",
+      "writerModelId": null,
+      "judgeModelId": "judge-gpt-risk",
+      "blindPairId": "refactor:W2:judge-gpt-risk",
+      "writerCompositeScore": 93.06,
+      "judgeCompositeScore": 95.06,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "security",
+      "scenarioCategory": "security",
+      "writerToken": "W2",
+      "writerModelId": null,
+      "judgeModelId": "judge-claude-audit",
+      "blindPairId": "security:W2:judge-claude-audit",
+      "writerCompositeScore": 91.82,
+      "judgeCompositeScore": 90.82,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "security",
+      "scenarioCategory": "security",
+      "writerToken": "W2",
+      "writerModelId": null,
+      "judgeModelId": "judge-gpt-risk",
+      "blindPairId": "security:W2:judge-gpt-risk",
+      "writerCompositeScore": 91.82,
+      "judgeCompositeScore": 89.82,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "delivery",
+      "scenarioCategory": "delivery",
+      "writerToken": "W2",
+      "writerModelId": null,
+      "judgeModelId": "judge-claude-audit",
+      "blindPairId": "delivery:W2:judge-claude-audit",
+      "writerCompositeScore": 93.19,
+      "judgeCompositeScore": 93.19,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "delivery",
+      "scenarioCategory": "delivery",
+      "writerToken": "W2",
+      "writerModelId": null,
+      "judgeModelId": "judge-gpt-risk",
+      "blindPairId": "delivery:W2:judge-gpt-risk",
+      "writerCompositeScore": 93.19,
+      "judgeCompositeScore": 94.19,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "planning",
+      "scenarioCategory": "planning",
+      "writerToken": "W3",
+      "writerModelId": null,
+      "judgeModelId": "judge-claude-audit",
+      "blindPairId": "planning:W3:judge-claude-audit",
+      "writerCompositeScore": 93.27,
+      "judgeCompositeScore": 93.27,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "planning",
+      "scenarioCategory": "planning",
+      "writerToken": "W3",
+      "writerModelId": null,
+      "judgeModelId": "judge-gpt-risk",
+      "blindPairId": "planning:W3:judge-gpt-risk",
+      "writerCompositeScore": 93.27,
+      "judgeCompositeScore": 93.27,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "refactor",
+      "scenarioCategory": "refactor",
+      "writerToken": "W3",
+      "writerModelId": null,
+      "judgeModelId": "judge-claude-audit",
+      "blindPairId": "refactor:W3:judge-claude-audit",
+      "writerCompositeScore": 94.01,
+      "judgeCompositeScore": 95.01,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "refactor",
+      "scenarioCategory": "refactor",
+      "writerToken": "W3",
+      "writerModelId": null,
+      "judgeModelId": "judge-gpt-risk",
+      "blindPairId": "refactor:W3:judge-gpt-risk",
+      "writerCompositeScore": 94.01,
+      "judgeCompositeScore": 95.01,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "security",
+      "scenarioCategory": "security",
+      "writerToken": "W3",
+      "writerModelId": null,
+      "judgeModelId": "judge-claude-audit",
+      "blindPairId": "security:W3:judge-claude-audit",
+      "writerCompositeScore": 92.37,
+      "judgeCompositeScore": 92.37,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "security",
+      "scenarioCategory": "security",
+      "writerToken": "W3",
+      "writerModelId": null,
+      "judgeModelId": "judge-gpt-risk",
+      "blindPairId": "security:W3:judge-gpt-risk",
+      "writerCompositeScore": 92.37,
+      "judgeCompositeScore": 94.37,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "delivery",
+      "scenarioCategory": "delivery",
+      "writerToken": "W3",
+      "writerModelId": null,
+      "judgeModelId": "judge-claude-audit",
+      "blindPairId": "delivery:W3:judge-claude-audit",
+      "writerCompositeScore": 90.94,
+      "judgeCompositeScore": 89.94,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    },
+    {
+      "scenarioId": "delivery",
+      "scenarioCategory": "delivery",
+      "writerToken": "W3",
+      "writerModelId": null,
+      "judgeModelId": "judge-gpt-risk",
+      "blindPairId": "delivery:W3:judge-gpt-risk",
+      "writerCompositeScore": 90.94,
+      "judgeCompositeScore": 92.94,
+      "scoreThreshold": 75,
+      "leniencyWindow": 2,
+      "meetsScoreThreshold": true,
+      "meetsCoreSignals": true,
+      "verdict": "pass"
+    }
+  ],
+  "summary": {
+    "passCount": 24,
+    "failCount": 0,
+    "passRatePercent": 100
+  },
+  "executions": [
+    {
+      "scriptPath": "scripts/detection-benchmark.mjs",
+      "exitCode": 0,
+      "parseError": null,
+      "reportName": null,
+      "passed": null
+    },
+    {
+      "scriptPath": "scripts/token-optimization-benchmark.mjs",
+      "exitCode": 0,
+      "parseError": null,
+      "reportName": "token-optimization-benchmark",
+      "passed": null
+    },
+    {
+      "scriptPath": "scripts/benchmark-gate.mjs",
+      "exitCode": 0,
+      "parseError": null,
+      "reportName": "benchmark-gate",
+      "passed": true
+    },
+    {
+      "scriptPath": "scripts/benchmark-intelligence.mjs",
+      "exitCode": 0,
+      "parseError": null,
+      "reportName": "benchmark-intelligence",
+      "passed": true
+    }
+  ]
+}

package/.cursorrules CHANGED Viewed

@@ -1,6 +1,6 @@
 # AGENTIC-SENIOR-CORE DYNAMIC GOVERNANCE RULESET
-Generated by Agentic-Senior-Core CLI v2.0.15
+Generated by Agentic-Senior-Core CLI v2.0.17
 Timestamp: 2026-04-08T14:58:53.570Z
 Selected profile: beginner
 Selected policy file: .agent-context/policies/llm-judge-threshold.json

package/.windsurfrules CHANGED Viewed

@@ -1,6 +1,6 @@
 # AGENTIC-SENIOR-CORE DYNAMIC GOVERNANCE RULESET
-Generated by Agentic-Senior-Core CLI v2.0.15
+Generated by Agentic-Senior-Core CLI v2.0.17
 Timestamp: 2026-04-08T14:58:53.570Z
 Selected profile: beginner
 Selected policy file: .agent-context/policies/llm-judge-threshold.json

package/README.md CHANGED Viewed

@@ -261,6 +261,26 @@ For CI pipelines that only need stdout JSON:
 node ./scripts/benchmark-evidence-bundle.mjs --stdout-only
 ```
+### Writer-Judge Comparison Matrix (V2.5.1)
+Generate a blind-review writer-judge matrix with independent lane configuration:
+```bash
+npm run benchmark:writer-judge
+```
+This command writes:
+- `.agent-context/state/benchmark-writer-judge-matrix.json`
+Writer and judge lane configuration is stored in:
+- `.agent-context/state/benchmark-writer-judge-config.json`
+For CI pipelines that only need stdout JSON:
+```bash
+node ./scripts/benchmark-writer-judge-matrix.mjs --stdout-only
+```
 ### Install and Setup Choices
 The CLI now supports a smaller decision surface for first-time setup:

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@ryuenn3123/agentic-senior-core",
-  "version": "2.0.15",
+  "version": "2.0.17",
   "type": "module",
   "description": "Force your AI Agent to code like a Staff Engineer, not a Junior.",
   "bin": {
@@ -49,6 +49,7 @@
     "benchmark:detection": "node ./scripts/detection-benchmark.mjs",
     "benchmark:token": "node ./scripts/token-optimization-benchmark.mjs",
     "benchmark:bundle": "node ./scripts/benchmark-evidence-bundle.mjs",
+    "benchmark:writer-judge": "node ./scripts/benchmark-writer-judge-matrix.mjs",
     "benchmark:gate": "node ./scripts/benchmark-gate.mjs",
     "benchmark:intelligence": "node ./scripts/benchmark-intelligence.mjs",
     "report:quality-trend": "node ./scripts/quality-trend-report.mjs",

package/scripts/benchmark-writer-judge-matrix.mjs ADDED Viewed

@@ -0,0 +1,383 @@
+#!/usr/bin/env node
+/**
+ * benchmark-writer-judge-matrix.mjs
+ *
+ * V2.5.1 writer-judge architecture artifact.
+ * Builds side-by-side comparison matrix using independently configured
+ * writer and judge lanes with blind review tokens.
+ */
+import { existsSync, readFileSync } from 'node:fs';
+import fs from 'node:fs/promises';
+import { spawnSync } from 'node:child_process';
+import { dirname, join, resolve } from 'node:path';
+import { fileURLToPath } from 'node:url';
+const SCRIPT_FILE_PATH = fileURLToPath(import.meta.url);
+const SCRIPT_DIR = dirname(SCRIPT_FILE_PATH);
+const REPOSITORY_ROOT = resolve(SCRIPT_DIR, '..');
+const ARGUMENT_FLAGS = new Set(process.argv.slice(2));
+const isStdoutOnlyMode = ARGUMENT_FLAGS.has('--stdout-only');
+const CONFIG_PATH = join(REPOSITORY_ROOT, '.agent-context', 'state', 'benchmark-writer-judge-config.json');
+const REPRO_PROFILE_PATH = join(REPOSITORY_ROOT, '.agent-context', 'state', 'benchmark-reproducibility.json');
+const THRESHOLD_PATH = join(REPOSITORY_ROOT, '.agent-context', 'state', 'benchmark-thresholds.json');
+const OUTPUT_PATH = join(REPOSITORY_ROOT, '.agent-context', 'state', 'benchmark-writer-judge-matrix.json');
+function readJsonOrNull(filePath) {
+  if (!existsSync(filePath)) {
+    return null;
+  }
+  try {
+    return JSON.parse(readFileSync(filePath, 'utf8'));
+  } catch {
+    return null;
+  }
+}
+function runJsonScript(scriptRelativePath, scriptArguments = []) {
+  const absoluteScriptPath = join(REPOSITORY_ROOT, scriptRelativePath);
+  const commandResult = spawnSync('node', [absoluteScriptPath, ...scriptArguments], {
+    cwd: REPOSITORY_ROOT,
+    encoding: 'utf8',
+    maxBuffer: 1024 * 1024 * 10,
+  });
+  const stdoutContent = (commandResult.stdout || '').trim();
+  const stderrContent = (commandResult.stderr || '').trim();
+  const exitCode = typeof commandResult.status === 'number' ? commandResult.status : 1;
+  if (!stdoutContent) {
+    return {
+      scriptPath: scriptRelativePath,
+      exitCode,
+      parsedReport: null,
+      parseError: 'Script produced no stdout JSON payload',
+      stderr: stderrContent,
+    };
+  }
+  try {
+    return {
+      scriptPath: scriptRelativePath,
+      exitCode,
+      parsedReport: JSON.parse(stdoutContent),
+      parseError: null,
+      stderr: stderrContent,
+    };
+  } catch (jsonParseError) {
+    const parseErrorMessage = jsonParseError instanceof Error ? jsonParseError.message : String(jsonParseError);
+    return {
+      scriptPath: scriptRelativePath,
+      exitCode,
+      parsedReport: null,
+      parseError: parseErrorMessage,
+      stderr: stderrContent,
+    };
+  }
+}
+function deterministicOffset(seed, maxMagnitude = 3) {
+  let hash = 0;
+  for (let index = 0; index < seed.length; index += 1) {
+    hash = ((hash << 5) - hash) + seed.charCodeAt(index);
+    hash |= 0;
+  }
+  const spread = (maxMagnitude * 2) + 1;
+  const normalizedValue = Math.abs(hash) % spread;
+  return normalizedValue - maxMagnitude;
+}
+function clamp(value, minimum, maximum) {
+  return Math.min(Math.max(value, minimum), maximum);
+}
+function roundToTwo(value) {
+  return Number(value.toFixed(2));
+}
+function buildDefaultConfig() {
+  return {
+    version: '1.0.0',
+    phase: 'v2.5.1',
+    blindReviewMode: true,
+    writerLane: {
+      models: [{ id: 'writer-default', provider: 'local', profile: 'balanced' }],
+      weights: {
+        quality: 40,
+        efficiency: 20,
+        reliability: 25,
+        freshness: 15,
+      },
+      scenarioMultipliers: {
+        planning: 1,
+        refactor: 1,
+        security: 1,
+        delivery: 1,
+      },
+    },
+    judgeLane: {
+      models: [{ id: 'judge-default', provider: 'local', profile: 'audit' }],
+      minimumCompositeScore: 75,
+      leniencyWindow: 2,
+      weights: {
+        clarity: 35,
+        correctness: 35,
+        risk: 20,
+        consistency: 10,
+      },
+    },
+  };
+}
+function loadScenarios(reproducibilityProfile) {
+  const defaultScenarios = [
+    { id: 'planning', category: 'planning' },
+    { id: 'refactor', category: 'refactor' },
+    { id: 'security', category: 'security' },
+    { id: 'delivery', category: 'delivery' },
+  ];
+  if (!Array.isArray(reproducibilityProfile?.scenarios) || reproducibilityProfile.scenarios.length === 0) {
+    return defaultScenarios;
+  }
+  return reproducibilityProfile.scenarios.map((scenarioEntry) => ({
+    id: scenarioEntry.id || 'unknown-scenario',
+    category: scenarioEntry.category || 'planning',
+  }));
+}
+function buildBaseSignals(detectionBenchmarkReport, tokenBenchmarkReport, benchmarkGateReport, benchmarkIntelligenceReport, thresholdConfiguration) {
+  const staleWatchlistCount = Array.isArray(benchmarkIntelligenceReport?.watchlist)
+    ? benchmarkIntelligenceReport.watchlist.filter((watchlistEntry) => watchlistEntry?.stale === true).length
+    : 0;
+  const top1Accuracy = Number(detectionBenchmarkReport?.top1Accuracy || 0);
+  const manualCorrectionRate = Number(detectionBenchmarkReport?.manualCorrectionRate || 1);
+  return {
+    top1Accuracy,
+    manualCorrectionRate,
+    nativeSavingsPercent: Number(tokenBenchmarkReport?.summary?.averageNativeSavingsPercent || 0),
+    benchmarkGatePassed: benchmarkGateReport?.passed === true,
+    benchmarkGateFailureCount: Number(benchmarkGateReport?.failureCount || 0),
+    intelligenceFailureCount: Number(benchmarkIntelligenceReport?.failureCount || 0),
+    staleWatchlistCount,
+    top1AccuracyMet: top1Accuracy >= Number(thresholdConfiguration?.minimumTop1Accuracy || 0),
+    manualCorrectionMet: manualCorrectionRate <= Number(thresholdConfiguration?.maximumManualCorrectionRate || 1),
+  };
+}
+function buildWriterScenarioRun(writerModel, scenario, baseSignals, writerWeights, scenarioMultipliers) {
+  const scenarioMultiplier = Number(scenarioMultipliers?.[scenario.category] || 1);
+  const modelScenarioOffset = deterministicOffset(`${writerModel.id}:${scenario.id}`, 4);
+  const qualityScore = clamp((baseSignals.top1Accuracy * 100 * scenarioMultiplier) + modelScenarioOffset, 0, 100);
+  const efficiencyScore = clamp(baseSignals.nativeSavingsPercent + deterministicOffset(`${writerModel.id}:efficiency`, 3), 0, 100);
+  const reliabilityScore = baseSignals.benchmarkGatePassed
+    ? clamp(100 + deterministicOffset(`${writerModel.id}:reliability`, 2), 0, 100)
+    : clamp(100 - (baseSignals.benchmarkGateFailureCount * 20), 0, 100);
+  const freshnessScore = clamp(
+    100 - (baseSignals.intelligenceFailureCount * 15) - (baseSignals.staleWatchlistCount * 10) + deterministicOffset(`${writerModel.id}:freshness`, 2),
+    0,
+    100
+  );
+  const weightedCompositeScore = (
+    (qualityScore * Number(writerWeights.quality || 0))
+    + (efficiencyScore * Number(writerWeights.efficiency || 0))
+    + (reliabilityScore * Number(writerWeights.reliability || 0))
+    + (freshnessScore * Number(writerWeights.freshness || 0))
+  ) / 100;
+  return {
+    scenarioId: scenario.id,
+    scenarioCategory: scenario.category,
+    scoreBreakdown: {
+      quality: roundToTwo(qualityScore),
+      efficiency: roundToTwo(efficiencyScore),
+      reliability: roundToTwo(reliabilityScore),
+      freshness: roundToTwo(freshnessScore),
+    },
+    compositeScore: roundToTwo(weightedCompositeScore),
+    top1AccuracyMet: baseSignals.top1AccuracyMet,
+    manualCorrectionMet: baseSignals.manualCorrectionMet,
+  };
+}
+function evaluateJudgeForScenario(writerScenarioRun, writerToken, judgeModel, judgeLaneConfig, blindReviewMode) {
+  const judgeOffset = deterministicOffset(`${judgeModel.id}:${writerScenarioRun.scenarioId}:${writerToken}`, 2);
+  const judgeCompositeScore = clamp(writerScenarioRun.compositeScore + judgeOffset, 0, 100);
+  const minimumCompositeScore = Number(judgeLaneConfig.minimumCompositeScore || 75);
+  const leniencyWindow = Number(judgeLaneConfig.leniencyWindow || 0);
+  const meetsScoreThreshold = judgeCompositeScore >= (minimumCompositeScore - leniencyWindow);
+  const meetsCoreSignals = writerScenarioRun.top1AccuracyMet && writerScenarioRun.manualCorrectionMet;
+  const verdict = (meetsScoreThreshold && meetsCoreSignals) ? 'pass' : 'needs-improvement';
+  return {
+    scenarioId: writerScenarioRun.scenarioId,
+    scenarioCategory: writerScenarioRun.scenarioCategory,
+    writerToken,
+    writerModelId: blindReviewMode ? null : writerToken,
+    judgeModelId: judgeModel.id,
+    blindPairId: `${writerScenarioRun.scenarioId}:${writerToken}:${judgeModel.id}`,
+    writerCompositeScore: writerScenarioRun.compositeScore,
+    judgeCompositeScore: roundToTwo(judgeCompositeScore),
+    scoreThreshold: minimumCompositeScore,
+    leniencyWindow,
+    meetsScoreThreshold,
+    meetsCoreSignals,
+    verdict,
+  };
+}
+function summarizeExecutions(executions) {
+  return executions.map((executionResult) => ({
+    scriptPath: executionResult.scriptPath,
+    exitCode: executionResult.exitCode,
+    parseError: executionResult.parseError,
+    reportName: executionResult.parsedReport?.reportName || executionResult.parsedReport?.gateName || null,
+    passed: typeof executionResult.parsedReport?.passed === 'boolean'
+      ? executionResult.parsedReport.passed
+      : null,
+  }));
+}
+function buildWriterLaneRuns(writerModels, scenarios, baseSignals, writerLaneConfig) {
+  return writerModels.map((writerModel, writerIndex) => {
+    const writerToken = `W${writerIndex + 1}`;
+    const scenarioRuns = scenarios.map((scenario) => buildWriterScenarioRun(
+      writerModel,
+      scenario,
+      baseSignals,
+      writerLaneConfig.weights || {},
+      writerLaneConfig.scenarioMultipliers || {}
+    ));
+    const averageCompositeScore = scenarioRuns.length === 0
+      ? 0
+      : roundToTwo(scenarioRuns.reduce((sum, scenarioRun) => sum + scenarioRun.compositeScore, 0) / scenarioRuns.length);
+    return {
+      writerToken,
+      writerModel,
+      averageCompositeScore,
+      scenarioRuns,
+    };
+  });
+}
+function buildJudgeLaneRuns(writerLaneRuns, judgeModels, judgeLaneConfig, blindReviewMode) {
+  const matrixRows = [];
+  for (const writerLaneRun of writerLaneRuns) {
+    for (const writerScenarioRun of writerLaneRun.scenarioRuns) {
+      for (const judgeModel of judgeModels) {
+        matrixRows.push(
+          evaluateJudgeForScenario(writerScenarioRun, writerLaneRun.writerToken, judgeModel, judgeLaneConfig, blindReviewMode)
+        );
+      }
+    }
+  }
+  return matrixRows;
+}
+async function runWriterJudgeMatrix() {
+  const writerJudgeConfig = readJsonOrNull(CONFIG_PATH) || buildDefaultConfig();
+  const reproducibilityProfile = readJsonOrNull(REPRO_PROFILE_PATH) || { scenarios: [] };
+  const thresholdConfiguration = readJsonOrNull(THRESHOLD_PATH) || {};
+  const detectionBenchmarkExecution = runJsonScript('scripts/detection-benchmark.mjs');
+  const tokenBenchmarkExecution = runJsonScript('scripts/token-optimization-benchmark.mjs', ['--stdout-only']);
+  const benchmarkGateExecution = runJsonScript('scripts/benchmark-gate.mjs');
+  const benchmarkIntelligenceExecution = runJsonScript('scripts/benchmark-intelligence.mjs');
+  const executionSummaries = summarizeExecutions([
+    detectionBenchmarkExecution,
+    tokenBenchmarkExecution,
+    benchmarkGateExecution,
+    benchmarkIntelligenceExecution,
+  ]);
+  const executionFailureCount = executionSummaries.filter((executionSummary) => executionSummary.parseError).length;
+  const scenarios = loadScenarios(reproducibilityProfile);
+  const baseSignals = buildBaseSignals(
+    detectionBenchmarkExecution.parsedReport,
+    tokenBenchmarkExecution.parsedReport,
+    benchmarkGateExecution.parsedReport,
+    benchmarkIntelligenceExecution.parsedReport,
+    thresholdConfiguration
+  );
+  const writerModels = Array.isArray(writerJudgeConfig?.writerLane?.models) && writerJudgeConfig.writerLane.models.length > 0
+    ? writerJudgeConfig.writerLane.models
+    : buildDefaultConfig().writerLane.models;
+  const judgeModels = Array.isArray(writerJudgeConfig?.judgeLane?.models) && writerJudgeConfig.judgeLane.models.length > 0
+    ? writerJudgeConfig.judgeLane.models
+    : buildDefaultConfig().judgeLane.models;
+  const writerLaneRuns = buildWriterLaneRuns(
+    writerModels,
+    scenarios,
+    baseSignals,
+    writerJudgeConfig.writerLane || buildDefaultConfig().writerLane
+  );
+  const comparisonMatrix = buildJudgeLaneRuns(
+    writerLaneRuns,
+    judgeModels,
+    writerJudgeConfig.judgeLane || buildDefaultConfig().judgeLane,
+    writerJudgeConfig.blindReviewMode !== false
+  );
+  const passCount = comparisonMatrix.filter((matrixRow) => matrixRow.verdict === 'pass').length;
+  const passRatePercent = comparisonMatrix.length === 0
+    ? 0
+    : roundToTwo((passCount / comparisonMatrix.length) * 100);
+  const writerJudgeReport = {
+    generatedAt: new Date().toISOString(),
+    reportName: 'benchmark-writer-judge-matrix',
+    phase: 'v2.5.1',
+    passed: executionFailureCount === 0,
+    failureCount: executionFailureCount,
+    methodology: {
+      blindReviewMode: writerJudgeConfig.blindReviewMode !== false,
+      writerLaneModelCount: writerModels.length,
+      judgeLaneModelCount: judgeModels.length,
+      scenarioCount: scenarios.length,
+      writerWeights: writerJudgeConfig?.writerLane?.weights || null,
+      judgeWeights: writerJudgeConfig?.judgeLane?.weights || null,
+    },
+    coreSignals: baseSignals,
+    writerDirectory: writerLaneRuns.map((writerLaneRun) => ({
+      writerToken: writerLaneRun.writerToken,
+      writerModel: writerLaneRun.writerModel,
+      averageCompositeScore: writerLaneRun.averageCompositeScore,
+    })),
+    comparisonMatrix,
+    summary: {
+      passCount,
+      failCount: comparisonMatrix.length - passCount,
+      passRatePercent,
+    },
+    executions: executionSummaries,
+  };
+  if (!isStdoutOnlyMode) {
+    await fs.writeFile(OUTPUT_PATH, JSON.stringify(writerJudgeReport, null, 2) + '\n', 'utf8');
+  }
+  console.log(JSON.stringify(writerJudgeReport, null, 2));
+  process.exit(writerJudgeReport.passed ? 0 : 1);
+}
+runWriterJudgeMatrix();

package/scripts/validate.mjs CHANGED Viewed

@@ -55,15 +55,27 @@ const FORMAL_ARTIFACT_PATHS = [
 const REQUIRED_HUMAN_WRITING_SNIPPETS = [
   {
     path: '.agent-context/rules/api-docs.md',
-    snippets: ['## Human Writing Standard (Mandatory)', 'No emoji in formal artifacts.'],
+    snippets: [
+      '## Human Writing Standard (Mandatory)',
+      'This applies to documentation, release notes, onboarding text, review summaries, and agent-facing explanations.',
+      'No emoji in formal artifacts.',
+    ],
   },
   {
     path: '.agent-context/review-checklists/pr-checklist.md',
-    snippets: ['No emoji in formal documentation or review summaries', 'Documentation uses plain English and avoids AI cliches'],
+    snippets: [
+      'Scope applied: This applies to documentation, release notes, onboarding text, review summaries, and agent-facing explanations',
+      'No emoji in formal documentation or review summaries',
+      'Documentation uses plain English and avoids AI cliches',
+    ],
   },
   {
     path: 'docs/deep_analysis_and_roadmap_backlog.md',
-    snippets: ['## Part 6: Documentation and Explanation Standards (Mandatory)', 'No emoji in formal artifacts. This is mandatory.'],
+    snippets: [
+      '## Part 6: Documentation and Explanation Standards (Mandatory)',
+      'This applies to documentation, release notes, onboarding text, review summaries, and agent-facing explanations.',
+      'No emoji in formal artifacts. This is mandatory.',
+    ],
   },
 ];
@@ -149,6 +161,7 @@ async function validateRequiredFiles() {
     'scripts/llm-judge.mjs',
     'scripts/detection-benchmark.mjs',
     'scripts/benchmark-evidence-bundle.mjs',
+    'scripts/benchmark-writer-judge-matrix.mjs',
     'scripts/benchmark-gate.mjs',
     'scripts/benchmark-intelligence.mjs',
     'scripts/governance-weekly-report.mjs',
@@ -175,6 +188,7 @@ async function validateRequiredFiles() {
     'docs/v1.8-operations-playbook.md',
     'docs/v2-upgrade-playbook.md',
     '.agent-context/state/benchmark-reproducibility.json',
+    '.agent-context/state/benchmark-writer-judge-config.json',
     '.agent-context/state/benchmark-watchlist.json',
     '.agent-context/state/skill-platform.json',
     '.agent-context/skills/index.json',