universal-dev-standards 5.4.0 → 5.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bundled/ai/standards/adversarial-test.ai.yaml +277 -0
- package/bundled/ai/standards/audit-trail.ai.yaml +113 -0
- package/bundled/ai/standards/chaos-injection-tests.ai.yaml +91 -0
- package/bundled/ai/standards/container-image-standards.ai.yaml +88 -0
- package/bundled/ai/standards/container-security.ai.yaml +331 -0
- package/bundled/ai/standards/cost-budget-test.ai.yaml +96 -0
- package/bundled/ai/standards/data-contract.ai.yaml +110 -0
- package/bundled/ai/standards/data-migration-testing.ai.yaml +96 -0
- package/bundled/ai/standards/data-pipeline.ai.yaml +113 -0
- package/bundled/ai/standards/disaster-recovery-drill.ai.yaml +89 -0
- package/bundled/ai/standards/flaky-test-management.ai.yaml +89 -0
- package/bundled/ai/standards/flow-based-testing.ai.yaml +240 -0
- package/bundled/ai/standards/iac-design-principles.ai.yaml +83 -0
- package/bundled/ai/standards/incident-response.ai.yaml +107 -0
- package/bundled/ai/standards/license-compliance.ai.yaml +106 -0
- package/bundled/ai/standards/llm-output-validation.ai.yaml +269 -0
- package/bundled/ai/standards/mock-boundary.ai.yaml +250 -0
- package/bundled/ai/standards/mutation-testing.ai.yaml +192 -0
- package/bundled/ai/standards/pii-classification.ai.yaml +109 -0
- package/bundled/ai/standards/policy-as-code-testing.ai.yaml +227 -0
- package/bundled/ai/standards/prd-standards.ai.yaml +88 -0
- package/bundled/ai/standards/product-metrics-standards.ai.yaml +111 -0
- package/bundled/ai/standards/prompt-regression.ai.yaml +94 -0
- package/bundled/ai/standards/property-based-testing.ai.yaml +105 -0
- package/bundled/ai/standards/release-quality-manifest.ai.yaml +135 -0
- package/bundled/ai/standards/replay-test.ai.yaml +111 -0
- package/bundled/ai/standards/runbook.ai.yaml +104 -0
- package/bundled/ai/standards/sast-advanced.ai.yaml +135 -0
- package/bundled/ai/standards/schema-evolution.ai.yaml +111 -0
- package/bundled/ai/standards/secret-management-standards.ai.yaml +105 -0
- package/bundled/ai/standards/secure-op.ai.yaml +365 -0
- package/bundled/ai/standards/security-testing.ai.yaml +171 -0
- package/bundled/ai/standards/server-ops-security.ai.yaml +274 -0
- package/bundled/ai/standards/slo-sli.ai.yaml +97 -0
- package/bundled/ai/standards/smoke-test.ai.yaml +87 -0
- package/bundled/ai/standards/supply-chain-attestation.ai.yaml +109 -0
- package/bundled/ai/standards/test-completeness-dimensions.ai.yaml +52 -5
- package/bundled/ai/standards/user-story-mapping.ai.yaml +108 -0
- package/bundled/core/adversarial-test.md +212 -0
- package/bundled/core/chaos-injection-tests.md +116 -0
- package/bundled/core/container-security.md +521 -0
- package/bundled/core/cost-budget-test.md +69 -0
- package/bundled/core/data-migration-testing.md +110 -0
- package/bundled/core/disaster-recovery-drill.md +73 -0
- package/bundled/core/flaky-test-management.md +73 -0
- package/bundled/core/flow-based-testing.md +142 -0
- package/bundled/core/llm-output-validation.md +178 -0
- package/bundled/core/mock-boundary.md +100 -0
- package/bundled/core/mutation-testing.md +97 -0
- package/bundled/core/policy-as-code-testing.md +188 -0
- package/bundled/core/prompt-regression.md +72 -0
- package/bundled/core/property-based-testing.md +73 -0
- package/bundled/core/release-quality-manifest.md +147 -0
- package/bundled/core/replay-test.md +86 -0
- package/bundled/core/sast-advanced.md +300 -0
- package/bundled/core/secure-op.md +314 -0
- package/bundled/core/security-testing.md +87 -0
- package/bundled/core/server-ops-security.md +493 -0
- package/bundled/core/smoke-test.md +65 -0
- package/bundled/core/supply-chain-attestation.md +117 -0
- package/bundled/locales/zh-CN/CHANGELOG.md +3 -3
- package/bundled/locales/zh-CN/README.md +1 -1
- package/bundled/locales/zh-CN/skills/ai-instruction-standards/SKILL.md +5 -5
- package/bundled/locales/zh-TW/CHANGELOG.md +3 -3
- package/bundled/locales/zh-TW/README.md +1 -1
- package/bundled/locales/zh-TW/skills/ai-instruction-standards/SKILL.md +183 -79
- package/bundled/skills/README.md +4 -3
- package/bundled/skills/SKILL_NAMING.md +94 -0
- package/bundled/skills/ai-instruction-standards/SKILL.md +181 -88
- package/bundled/skills/atdd-assistant/SKILL.md +8 -0
- package/bundled/skills/bdd-assistant/SKILL.md +7 -0
- package/bundled/skills/checkin-assistant/SKILL.md +8 -0
- package/bundled/skills/code-review-assistant/SKILL.md +7 -0
- package/bundled/skills/journey-test-assistant/SKILL.md +203 -0
- package/bundled/skills/orchestrate/SKILL.md +167 -0
- package/bundled/skills/plan/SKILL.md +234 -0
- package/bundled/skills/pr-automation-assistant/SKILL.md +8 -0
- package/bundled/skills/push/SKILL.md +49 -2
- package/bundled/skills/{process-automation → skill-builder}/SKILL.md +1 -1
- package/bundled/skills/{forward-derivation → spec-derivation}/SKILL.md +1 -1
- package/bundled/skills/spec-driven-dev/SKILL.md +7 -0
- package/bundled/skills/sweep/SKILL.md +145 -0
- package/bundled/skills/tdd-assistant/SKILL.md +7 -0
- package/package.json +1 -1
- package/src/commands/flow.js +8 -0
- package/src/commands/start.js +14 -0
- package/src/commands/sweep.js +8 -0
- package/src/commands/workflow.js +8 -0
- package/standards-registry.json +426 -4
- package/bundled/locales/zh-CN/skills/ac-coverage-assistant/SKILL.md +0 -190
- package/bundled/locales/zh-CN/skills/forward-derivation/SKILL.md +0 -71
- package/bundled/locales/zh-CN/skills/forward-derivation/guide.md +0 -130
- package/bundled/locales/zh-CN/skills/methodology-system/SKILL.md +0 -88
- package/bundled/locales/zh-CN/skills/methodology-system/create-methodology.md +0 -350
- package/bundled/locales/zh-CN/skills/methodology-system/guide.md +0 -131
- package/bundled/locales/zh-CN/skills/methodology-system/runtime.md +0 -279
- package/bundled/locales/zh-CN/skills/process-automation/SKILL.md +0 -143
- package/bundled/locales/zh-TW/skills/ac-coverage-assistant/SKILL.md +0 -195
- package/bundled/locales/zh-TW/skills/deploy-assistant/SKILL.md +0 -178
- package/bundled/locales/zh-TW/skills/forward-derivation/SKILL.md +0 -69
- package/bundled/locales/zh-TW/skills/forward-derivation/guide.md +0 -415
- package/bundled/locales/zh-TW/skills/methodology-system/SKILL.md +0 -86
- package/bundled/locales/zh-TW/skills/methodology-system/create-methodology.md +0 -350
- package/bundled/locales/zh-TW/skills/methodology-system/guide.md +0 -131
- package/bundled/locales/zh-TW/skills/methodology-system/runtime.md +0 -279
- package/bundled/locales/zh-TW/skills/process-automation/SKILL.md +0 -144
- /package/bundled/skills/{ac-coverage-assistant → ac-coverage}/SKILL.md +0 -0
- /package/bundled/skills/{methodology-system → dev-methodology}/SKILL.md +0 -0
- /package/bundled/skills/{methodology-system → dev-methodology}/create-methodology.md +0 -0
- /package/bundled/skills/{methodology-system → dev-methodology}/guide.md +0 -0
- /package/bundled/skills/{methodology-system → dev-methodology}/integrated-flow.md +0 -0
- /package/bundled/skills/{methodology-system → dev-methodology}/prerequisite-check.md +0 -0
- /package/bundled/skills/{methodology-system → dev-methodology}/runtime.md +0 -0
- /package/bundled/skills/{forward-derivation → spec-derivation}/guide.md +0 -0
|
@@ -0,0 +1,107 @@
|
|
|
1
|
+
# Incident Response Standards - AI Optimized
|
|
2
|
+
# Source: XSPEC-063 Wave 3 SRE Pack
|
|
3
|
+
|
|
4
|
+
id: incident-response
|
|
5
|
+
title: Incident Response Standards
|
|
6
|
+
version: "1.0.0"
|
|
7
|
+
status: Active
|
|
8
|
+
tags: [sre, incident, oncall, postmortem, severity]
|
|
9
|
+
summary: |
|
|
10
|
+
Defines the end-to-end incident response lifecycle: severity classification,
|
|
11
|
+
response time SLAs, roles and responsibilities, communication protocols,
|
|
12
|
+
escalation paths, and blameless postmortem requirements. Designed to reduce
|
|
13
|
+
MTTR, ensure consistent stakeholder communication during incidents, and
|
|
14
|
+
drive systemic reliability improvements through structured postmortems.
|
|
15
|
+
|
|
16
|
+
requirements:
|
|
17
|
+
- id: REQ-001
|
|
18
|
+
title: Severity Classification
|
|
19
|
+
description: |
|
|
20
|
+
Every incident MUST be classified at declaration using a 4-level
|
|
21
|
+
severity scale. SEV-1 (Critical): complete service outage or data
|
|
22
|
+
breach affecting all users, response within 15 minutes, C-suite
|
|
23
|
+
notification. SEV-2 (High): major feature unavailable or significant
|
|
24
|
+
performance degradation >25% users, response within 30 minutes.
|
|
25
|
+
SEV-3 (Medium): minor feature unavailable or degradation <25% users,
|
|
26
|
+
response within 4 hours. SEV-4 (Low): cosmetic issue or very minor
|
|
27
|
+
impact, response within 24 hours.
|
|
28
|
+
level: MUST
|
|
29
|
+
examples:
|
|
30
|
+
- "SEV-1: Payment processing completely down → IC declared, bridge opened in <5min"
|
|
31
|
+
- "SEV-2: Checkout latency >10s for 30% of users → on-call paged, IC assigned"
|
|
32
|
+
- "SEV-3: Search autocomplete broken → ticket created, fix in next sprint"
|
|
33
|
+
|
|
34
|
+
- id: REQ-002
|
|
35
|
+
title: Incident Commander Role
|
|
36
|
+
description: |
|
|
37
|
+
Every SEV-1 and SEV-2 incident MUST have a designated Incident
|
|
38
|
+
Commander (IC) assigned within 10 minutes of declaration. The IC is
|
|
39
|
+
responsible for: coordinating the response bridge, assigning roles
|
|
40
|
+
(scribe, comms lead, technical leads), driving timeline, making
|
|
41
|
+
go/no-go decisions on fixes, and initiating postmortem. The IC does
|
|
42
|
+
NOT directly troubleshoot — their sole focus is coordination.
|
|
43
|
+
level: MUST
|
|
44
|
+
examples:
|
|
45
|
+
- "IC assignment: 'I'm taking IC for this SEV-2, @alice please scribe, @bob comms'"
|
|
46
|
+
- "IC checklist: bridge open, roles assigned, status page updated, stakeholders notified"
|
|
47
|
+
- "IC does not debug code; delegates to technical leads who report findings"
|
|
48
|
+
|
|
49
|
+
- id: REQ-003
|
|
50
|
+
title: Stakeholder Communication Protocol
|
|
51
|
+
description: |
|
|
52
|
+
During SEV-1/SEV-2 incidents, stakeholder updates MUST be sent on a
|
|
53
|
+
defined cadence: initial notification within 15 minutes of declaration,
|
|
54
|
+
updates every 30 minutes until resolution, immediate update on any
|
|
55
|
+
severity change or major development. Updates MUST include: current
|
|
56
|
+
status, known impact, what is being done, next update time. Status
|
|
57
|
+
page MUST be updated simultaneously with internal communications.
|
|
58
|
+
level: MUST
|
|
59
|
+
examples:
|
|
60
|
+
- "T+15min: 'We are investigating elevated error rates on checkout. ~500 users affected. Next update in 30min.'"
|
|
61
|
+
- "T+45min: 'Root cause identified as database connection pool exhaustion. Fix in progress. Next update in 30min.'"
|
|
62
|
+
- "Resolution: 'Service restored at 14:32 UTC. Postmortem scheduled for 2026-05-02.'"
|
|
63
|
+
|
|
64
|
+
- id: REQ-004
|
|
65
|
+
title: Blameless Postmortem Requirements
|
|
66
|
+
description: |
|
|
67
|
+
Every SEV-1 incident MUST have a blameless postmortem completed within
|
|
68
|
+
5 business days. SEV-2 incidents MUST have a postmortem within 10
|
|
69
|
+
business days. Postmortems MUST be blameless — focusing on systemic
|
|
70
|
+
causes, not individual mistakes. Required sections: timeline, impact,
|
|
71
|
+
root cause (5 Whys), contributing factors, action items with owners
|
|
72
|
+
and due dates, lessons learned. Action items MUST be tracked to
|
|
73
|
+
completion.
|
|
74
|
+
level: MUST
|
|
75
|
+
examples:
|
|
76
|
+
- "Timeline: 14:00 Alert fired → 14:05 IC assigned → 14:23 RCA found → 14:32 resolved"
|
|
77
|
+
- "Root cause: 5 Whys tracing alert → connection pool exhaustion → missing timeout config → no review gate"
|
|
78
|
+
- "Action item: INFRA-2341 Add connection pool monitoring, owner: @dave, due: 2026-05-15"
|
|
79
|
+
|
|
80
|
+
- id: REQ-005
|
|
81
|
+
title: On-Call Rotation and Handoff
|
|
82
|
+
description: |
|
|
83
|
+
Every production service MUST have a documented on-call rotation with
|
|
84
|
+
at least 2 engineers. Rotation schedules MUST be published at least
|
|
85
|
+
2 weeks in advance. On-call handoff MUST include a written summary of:
|
|
86
|
+
active incidents, recent incidents still under investigation,
|
|
87
|
+
known flaky alerts, upcoming planned maintenance, and any service
|
|
88
|
+
health concerns. Handoff MUST be acknowledged by incoming on-call.
|
|
89
|
+
level: MUST
|
|
90
|
+
examples:
|
|
91
|
+
- "Handoff doc: 3 active flaky alerts (JIRA links), 1 ongoing SEV-3, maintenance window Friday 02:00 UTC"
|
|
92
|
+
- "Rotation: 1-week shifts, 2 primary + 1 secondary, published monthly in PagerDuty"
|
|
93
|
+
- "Handoff acknowledgment: incoming engineer replies 'Handoff received, I have the pager'"
|
|
94
|
+
|
|
95
|
+
- id: REQ-006
|
|
96
|
+
title: Incident Retrospective Metrics
|
|
97
|
+
description: |
|
|
98
|
+
Teams SHOULD track and review incident metrics monthly: MTTD (Mean
|
|
99
|
+
Time To Detect), MTTR (Mean Time To Resolve), incident frequency by
|
|
100
|
+
severity, repeat incidents (same root cause within 90 days), and
|
|
101
|
+
postmortem action item completion rate. These metrics SHOULD be
|
|
102
|
+
reviewed in monthly reliability reviews with engineering leadership.
|
|
103
|
+
level: SHOULD
|
|
104
|
+
examples:
|
|
105
|
+
- "Monthly metrics: MTTD 8min, MTTR 42min, 3 SEV-2s, 0 SEV-1s, 0 repeat incidents"
|
|
106
|
+
- "Repeat incident flag: same DB timeout root cause in March and April → escalate to reliability sprint"
|
|
107
|
+
- "Action item completion rate: 85% of previous month's items closed on time"
|
|
@@ -0,0 +1,106 @@
|
|
|
1
|
+
# License Compliance Standards - AI Optimized
|
|
2
|
+
# Source: XSPEC-066 Wave 3 Compliance Pack
|
|
3
|
+
|
|
4
|
+
id: license-compliance
|
|
5
|
+
title: License Compliance Standards
|
|
6
|
+
version: "1.0.0"
|
|
7
|
+
status: Active
|
|
8
|
+
tags: [compliance, licensing, open-source, legal, supply-chain]
|
|
9
|
+
summary: |
|
|
10
|
+
Defines how teams identify, track, and manage open-source and third-party
|
|
11
|
+
software licenses throughout the software development lifecycle. Covers
|
|
12
|
+
license classification (permissive vs. copyleft), prohibited licenses,
|
|
13
|
+
SBOM generation, license scanning in CI/CD, and remediation processes
|
|
14
|
+
for license violations. Designed to prevent legal exposure from
|
|
15
|
+
incompatible license combinations and ensure supply-chain transparency.
|
|
16
|
+
|
|
17
|
+
requirements:
|
|
18
|
+
- id: REQ-001
|
|
19
|
+
title: License Classification and Allowlist
|
|
20
|
+
description: |
|
|
21
|
+
Every project MUST maintain a documented license policy classifying
|
|
22
|
+
licenses into three tiers. APPROVED: MIT, Apache 2.0, BSD-2-Clause,
|
|
23
|
+
BSD-3-Clause, ISC, CC0 — can be used without review. REVIEW-REQUIRED:
|
|
24
|
+
LGPL-2.1, LGPL-3.0, MPL-2.0, CDDL — requires legal/architecture
|
|
25
|
+
review before adoption. PROHIBITED: GPL-2.0 (without exception),
|
|
26
|
+
GPL-3.0, AGPL-3.0, SSPL, Commons Clause additions — MUST NOT be
|
|
27
|
+
included in proprietary or commercial products.
|
|
28
|
+
level: MUST
|
|
29
|
+
examples:
|
|
30
|
+
- "MIT dependency lodash → auto-approved, no review needed"
|
|
31
|
+
- "LGPL-3.0 dependency → open GitHub issue tagged 'license-review' before merging"
|
|
32
|
+
- "GPL-3.0 dependency found in PR → PR blocked by CI until dependency replaced"
|
|
33
|
+
|
|
34
|
+
- id: REQ-002
|
|
35
|
+
title: Automated License Scanning in CI
|
|
36
|
+
description: |
|
|
37
|
+
Every repository MUST run automated license scanning on every pull
|
|
38
|
+
request and on a daily scheduled basis. The scan MUST fail the build
|
|
39
|
+
if any PROHIBITED license is detected in direct or transitive
|
|
40
|
+
dependencies. REVIEW-REQUIRED licenses MUST generate a warning that
|
|
41
|
+
blocks merge until manually approved and documented. Scan results
|
|
42
|
+
MUST be stored as build artifacts for 1 year.
|
|
43
|
+
level: MUST
|
|
44
|
+
examples:
|
|
45
|
+
- "CI step: `npx license-checker --onlyAllow 'MIT;Apache-2.0;BSD-2-Clause;BSD-3-Clause;ISC'`"
|
|
46
|
+
- "Daily scheduled scan via GitHub Actions: runs license-checker across all repos"
|
|
47
|
+
- "Build artifact: license-report-2026-04-30.json retained for 1 year"
|
|
48
|
+
|
|
49
|
+
- id: REQ-003
|
|
50
|
+
title: Software Bill of Materials (SBOM) Generation
|
|
51
|
+
description: |
|
|
52
|
+
Every production release MUST include a Software Bill of Materials
|
|
53
|
+
(SBOM) in SPDX 2.3 or CycloneDX 1.5 format. The SBOM MUST list all
|
|
54
|
+
direct and transitive dependencies with: package name, version,
|
|
55
|
+
license identifier (SPDX), and supplier. SBOMs MUST be generated
|
|
56
|
+
automatically as part of the release pipeline and stored alongside
|
|
57
|
+
release artifacts.
|
|
58
|
+
level: MUST
|
|
59
|
+
examples:
|
|
60
|
+
- "SBOM generated via: `syft . -o spdx-json > sbom-v1.2.0.spdx.json`"
|
|
61
|
+
- "SBOM uploaded to release: GitHub Release v1.2.0 includes sbom-v1.2.0.spdx.json"
|
|
62
|
+
- "SBOM SPDX entry: {name: 'lodash', version: '4.17.21', licenseConcluded: 'MIT'}"
|
|
63
|
+
|
|
64
|
+
- id: REQ-004
|
|
65
|
+
title: License Attribution and Notices
|
|
66
|
+
description: |
|
|
67
|
+
Products distributed to external users MUST include a NOTICES or
|
|
68
|
+
LICENSE-THIRD-PARTY file listing all open-source components and their
|
|
69
|
+
license texts. This file MUST be auto-generated from the SBOM data.
|
|
70
|
+
For web applications, attribution MUST be accessible via a dedicated
|
|
71
|
+
page or endpoint (e.g., /licenses or /legal/notices).
|
|
72
|
+
level: MUST
|
|
73
|
+
examples:
|
|
74
|
+
- "NOTICES file auto-generated: `npx generate-license-file --input package.json --output NOTICES`"
|
|
75
|
+
- "Web app: GET /legal/notices returns HTML page with all dependency licenses"
|
|
76
|
+
- "Mobile app: 'Open Source Licenses' screen in Settings menu"
|
|
77
|
+
|
|
78
|
+
- id: REQ-005
|
|
79
|
+
title: License Violation Remediation Process
|
|
80
|
+
description: |
|
|
81
|
+
When a license violation is detected, teams MUST follow the remediation
|
|
82
|
+
process: (1) immediately block merge/release if PROHIBITED license;
|
|
83
|
+
(2) create a tracking issue within 1 business day; (3) remediate
|
|
84
|
+
within 5 business days for PROHIBITED, 30 days for REVIEW-REQUIRED;
|
|
85
|
+
(4) document decision and alternative chosen; (5) update SBOM and
|
|
86
|
+
re-scan to verify clean. Unresolved violations MUST be reported to
|
|
87
|
+
the engineering lead monthly.
|
|
88
|
+
level: MUST
|
|
89
|
+
examples:
|
|
90
|
+
- "Violation: transitive dep uses GPL-3.0 → PR blocked, issue COMP-112 created same day"
|
|
91
|
+
- "Remediation: replaced dep with MIT-licensed alternative within 3 days"
|
|
92
|
+
- "Monthly report: 0 open violations as of 2026-04-30"
|
|
93
|
+
|
|
94
|
+
- id: REQ-006
|
|
95
|
+
title: License Review for New Technology Adoption
|
|
96
|
+
description: |
|
|
97
|
+
Before adopting any new framework, library, or tool, engineers SHOULD
|
|
98
|
+
verify its license tier as part of the technology evaluation. License
|
|
99
|
+
status SHOULD be documented in the relevant ADR or technical spike
|
|
100
|
+
document. For REVIEW-REQUIRED licenses, architecture review board
|
|
101
|
+
approval MUST be obtained before adoption.
|
|
102
|
+
level: SHOULD
|
|
103
|
+
examples:
|
|
104
|
+
- "ADR-042 notes: 'Library X uses Apache 2.0 — approved tier, no legal review needed'"
|
|
105
|
+
- "ADR-043 notes: 'Library Y uses LGPL-3.0 — review-required, legal approved 2026-03-10'"
|
|
106
|
+
- "Technology radar entry includes license classification for each evaluated tool"
|
|
@@ -0,0 +1,269 @@
|
|
|
1
|
+
# LLM Output Validation Standards - AI Optimized
|
|
2
|
+
# Source: core/llm-output-validation.md
|
|
3
|
+
|
|
4
|
+
id: llm-output-validation
|
|
5
|
+
meta:
|
|
6
|
+
version: "1.0.0"
|
|
7
|
+
updated: "2026-05-05"
|
|
8
|
+
source: core/llm-output-validation.md
|
|
9
|
+
description: >
|
|
10
|
+
Standards for validating LLM and AI agent outputs to ensure schema
|
|
11
|
+
conformance, detect hallucinations, and assess response groundedness.
|
|
12
|
+
Covers structural validation, semantic validation, and test strategies.
|
|
13
|
+
|
|
14
|
+
# ─────────────────────────────────────────────────────────
|
|
15
|
+
# Core Concepts
|
|
16
|
+
# ─────────────────────────────────────────────────────────
|
|
17
|
+
core_concepts:
|
|
18
|
+
definition: >
|
|
19
|
+
LLM outputs are non-deterministic and may violate expected schema,
|
|
20
|
+
introduce hallucinated facts, or fail to stay grounded in provided context.
|
|
21
|
+
LLM output validation is the practice of systematically testing that
|
|
22
|
+
AI agent outputs meet structural and semantic quality standards.
|
|
23
|
+
|
|
24
|
+
validation_dimensions:
|
|
25
|
+
- dimension: Schema Conformance
|
|
26
|
+
definition: >
|
|
27
|
+
The output matches the declared JSON / type schema exactly —
|
|
28
|
+
required fields present, types correct, enums respected.
|
|
29
|
+
test_type: Contract Test
|
|
30
|
+
automated: true
|
|
31
|
+
|
|
32
|
+
- dimension: Hallucination Detection
|
|
33
|
+
definition: >
|
|
34
|
+
Factual claims in the output can be traced back to source material
|
|
35
|
+
(grounded) or are fabricated (hallucinated).
|
|
36
|
+
test_type: Semantic / NLI probe
|
|
37
|
+
automated: partial
|
|
38
|
+
|
|
39
|
+
- dimension: Grounded Answer Rate
|
|
40
|
+
definition: >
|
|
41
|
+
Ratio of responses that cite or derive from provided context
|
|
42
|
+
vs. responses that introduce unsupported facts.
|
|
43
|
+
metric: "grounded_answers / total_answers"
|
|
44
|
+
target: "≥ 0.95 for RAG-based agents"
|
|
45
|
+
automated: partial
|
|
46
|
+
|
|
47
|
+
- dimension: Refusal Evaluation
|
|
48
|
+
definition: >
|
|
49
|
+
Agent correctly refuses (or escalates) out-of-scope, harmful,
|
|
50
|
+
or ambiguous requests rather than hallucinating a plausible answer.
|
|
51
|
+
test_type: Adversarial probe
|
|
52
|
+
automated: true (via test corpus)
|
|
53
|
+
|
|
54
|
+
- dimension: Determinism / Stability
|
|
55
|
+
definition: >
|
|
56
|
+
With temperature=0, same prompt yields same output schema structure
|
|
57
|
+
(if not identical content).
|
|
58
|
+
test_type: Replay / Golden file
|
|
59
|
+
automated: true
|
|
60
|
+
|
|
61
|
+
# ─────────────────────────────────────────────────────────
|
|
62
|
+
# Contract Testing (Schema Conformance)
|
|
63
|
+
# ─────────────────────────────────────────────────────────
|
|
64
|
+
contract_testing:
|
|
65
|
+
definition: >
|
|
66
|
+
Contract tests verify that LLM/agent outputs satisfy a declared schema
|
|
67
|
+
(JSON Schema, Zod, Pydantic, etc.) before being consumed downstream.
|
|
68
|
+
They are the cheapest and most automated form of LLM output validation.
|
|
69
|
+
|
|
70
|
+
patterns:
|
|
71
|
+
- name: Schema-driven contract test
|
|
72
|
+
description: >
|
|
73
|
+
Maintain an output-schema.json per agent. Run fixture outputs through
|
|
74
|
+
a schema validator (Ajv / Pydantic / jsonschema) in unit tests.
|
|
75
|
+
example: |
|
|
76
|
+
// TypeScript with ajv
|
|
77
|
+
import Ajv from "ajv"
|
|
78
|
+
import schema from "./output-schema.json"
|
|
79
|
+
const ajv = new Ajv()
|
|
80
|
+
const validate = ajv.compile(schema)
|
|
81
|
+
const result = validate(agentOutput)
|
|
82
|
+
expect(result).toBe(true)
|
|
83
|
+
|
|
84
|
+
- name: Golden file test
|
|
85
|
+
description: >
|
|
86
|
+
Capture a real LLM response as a golden fixture (JSON).
|
|
87
|
+
Re-validate against the schema on every schema change.
|
|
88
|
+
Diff the fixture if schema is unchanged (regression).
|
|
89
|
+
|
|
90
|
+
- name: Negative contract test
|
|
91
|
+
description: >
|
|
92
|
+
Provide outputs that violate the schema (missing required fields,
|
|
93
|
+
wrong types). Assert the validator correctly rejects them.
|
|
94
|
+
Proves the schema is strict enough to catch real failures.
|
|
95
|
+
|
|
96
|
+
per_agent_structure:
|
|
97
|
+
- "agents/<agent-name>/output-schema.json — JSON Schema definition"
|
|
98
|
+
- "agents/<agent-name>/__tests__/contract.test.ts — contract test suite"
|
|
99
|
+
- "agents/<agent-name>/__fixtures__/valid.json — valid output fixture"
|
|
100
|
+
- "agents/<agent-name>/__fixtures__/invalid-*.json — invalid fixtures"
|
|
101
|
+
|
|
102
|
+
# ─────────────────────────────────────────────────────────
|
|
103
|
+
# Hallucination Detection
|
|
104
|
+
# ─────────────────────────────────────────────────────────
|
|
105
|
+
hallucination_detection:
|
|
106
|
+
definition: >
|
|
107
|
+
Hallucination occurs when an LLM generates plausible-sounding but
|
|
108
|
+
factually incorrect or unsupported content.
|
|
109
|
+
|
|
110
|
+
detection_strategies:
|
|
111
|
+
- strategy: Grounding check
|
|
112
|
+
description: >
|
|
113
|
+
For RAG-based agents: compare named entities / key facts in response
|
|
114
|
+
against retrieved documents using NLI or embedding similarity.
|
|
115
|
+
tools: [DeepEval, Ragas, custom NLI probe]
|
|
116
|
+
complexity: high
|
|
117
|
+
|
|
118
|
+
- strategy: Schema field validation
|
|
119
|
+
description: >
|
|
120
|
+
Structured outputs (JSON) partially prevent hallucination by requiring
|
|
121
|
+
specific enumerations and reference IDs. If the LLM hallucinates a
|
|
122
|
+
field value, it either violates the schema (caught by contract test)
|
|
123
|
+
or produces a real-looking but wrong value (requires semantic check).
|
|
124
|
+
complexity: low
|
|
125
|
+
|
|
126
|
+
- strategy: Confidence calibration
|
|
127
|
+
description: >
|
|
128
|
+
Agent includes a confidence score or explicitly marks uncertain claims.
|
|
129
|
+
Downstream consumers reject or escalate low-confidence outputs.
|
|
130
|
+
requires: Prompt design to elicit uncertainty markers
|
|
131
|
+
complexity: medium
|
|
132
|
+
|
|
133
|
+
rate_targets:
|
|
134
|
+
structured_output_agents:
|
|
135
|
+
schema_conformance: "≥ 99%"
|
|
136
|
+
factual_hallucination_rate: "≤ 5%"
|
|
137
|
+
rag_agents:
|
|
138
|
+
grounded_answer_rate: "≥ 95%"
|
|
139
|
+
chat_agents:
|
|
140
|
+
refusal_on_out_of_scope: "≥ 90%"
|
|
141
|
+
|
|
142
|
+
# ─────────────────────────────────────────────────────────
|
|
143
|
+
# Prompt Regression & Stability
|
|
144
|
+
# ─────────────────────────────────────────────────────────
|
|
145
|
+
prompt_regression:
|
|
146
|
+
definition: >
|
|
147
|
+
When a prompt is changed, verify that the output schema structure
|
|
148
|
+
is preserved (even if content differs). A prompt change that silently
|
|
149
|
+
breaks downstream field expectations is a regression.
|
|
150
|
+
|
|
151
|
+
test_pattern: |
|
|
152
|
+
# 1. Before prompt change: capture golden output structure
|
|
153
|
+
# 2. Change prompt
|
|
154
|
+
# 3. Re-run with temperature=0 on same input fixture
|
|
155
|
+
# 4. Validate output against schema (contract test)
|
|
156
|
+
# 5. Diff field presence / enum values against golden
|
|
157
|
+
|
|
158
|
+
required_on:
|
|
159
|
+
- Any change to agents/*/prompt.md
|
|
160
|
+
- Model version upgrade (same prompt, new model)
|
|
161
|
+
- New required output field added to schema
|
|
162
|
+
|
|
163
|
+
# ─────────────────────────────────────────────────────────
|
|
164
|
+
# Tools & Libraries
|
|
165
|
+
# ─────────────────────────────────────────────────────────
|
|
166
|
+
tools:
|
|
167
|
+
schema_validation:
|
|
168
|
+
- name: Ajv (TypeScript/JavaScript)
|
|
169
|
+
package: "ajv"
|
|
170
|
+
strengths: [JSON Schema draft-07 support, fast, minimal]
|
|
171
|
+
use_case: Contract tests for agents with JSON output-schema.json
|
|
172
|
+
|
|
173
|
+
- name: Zod (TypeScript)
|
|
174
|
+
package: "zod"
|
|
175
|
+
strengths: [Type-safe, composable, coercion support]
|
|
176
|
+
use_case: Runtime validation when TypeScript types are available
|
|
177
|
+
|
|
178
|
+
- name: Pydantic (Python)
|
|
179
|
+
package: "pydantic"
|
|
180
|
+
strengths: [Automatic validation from type annotations]
|
|
181
|
+
use_case: Python-based agent contract tests
|
|
182
|
+
|
|
183
|
+
hallucination_evaluation:
|
|
184
|
+
- name: DeepEval
|
|
185
|
+
package: "deepeval"
|
|
186
|
+
features: [Hallucination metric, Contextual Precision, Faithfulness]
|
|
187
|
+
note: Requires LLM judge (adds cost)
|
|
188
|
+
|
|
189
|
+
- name: Ragas
|
|
190
|
+
package: "ragas"
|
|
191
|
+
features: [Answer Grounding, Context Recall, Faithfulness]
|
|
192
|
+
note: Designed for RAG pipelines
|
|
193
|
+
|
|
194
|
+
# ─────────────────────────────────────────────────────────
|
|
195
|
+
# Quality Gates
|
|
196
|
+
# ─────────────────────────────────────────────────────────
|
|
197
|
+
quality_gates:
|
|
198
|
+
- gate: Schema conformance (CI)
|
|
199
|
+
threshold: "100% of valid fixtures pass schema validation"
|
|
200
|
+
enforcement: Block merge
|
|
201
|
+
automated: true
|
|
202
|
+
note: Run in every PR via contract.test.ts
|
|
203
|
+
|
|
204
|
+
- gate: Negative schema rejection (CI)
|
|
205
|
+
threshold: "100% of invalid fixtures are rejected by schema validator"
|
|
206
|
+
enforcement: Block merge
|
|
207
|
+
automated: true
|
|
208
|
+
note: Proves schema is strict enough
|
|
209
|
+
|
|
210
|
+
- gate: Hallucination rate (pre-release)
|
|
211
|
+
threshold: "≤ 5% hallucinated facts on benchmark dataset"
|
|
212
|
+
enforcement: Advisory (escalate to human review if exceeded)
|
|
213
|
+
automated: partial
|
|
214
|
+
note: Run monthly or on model upgrade
|
|
215
|
+
|
|
216
|
+
- gate: Prompt regression (on prompt change)
|
|
217
|
+
threshold: "Schema conformance maintained after prompt change"
|
|
218
|
+
enforcement: Block merge
|
|
219
|
+
automated: true (via contract test re-run)
|
|
220
|
+
|
|
221
|
+
# ─────────────────────────────────────────────────────────
|
|
222
|
+
# Rules
|
|
223
|
+
# ─────────────────────────────────────────────────────────
|
|
224
|
+
rules:
|
|
225
|
+
- id: agent-must-have-output-schema
|
|
226
|
+
trigger: defining or modifying an AI agent
|
|
227
|
+
instruction: >
|
|
228
|
+
Every agent MUST have a declared output-schema.json.
|
|
229
|
+
Agents without a schema cannot be safely composed into pipelines.
|
|
230
|
+
priority: required
|
|
231
|
+
|
|
232
|
+
- id: contract-test-per-agent
|
|
233
|
+
trigger: defining or modifying an AI agent
|
|
234
|
+
instruction: >
|
|
235
|
+
Every agent MUST have a contract.test.ts with at least:
|
|
236
|
+
(1) one valid fixture that passes schema validation
|
|
237
|
+
(2) one invalid fixture that fails schema validation
|
|
238
|
+
priority: required
|
|
239
|
+
|
|
240
|
+
- id: prompt-change-triggers-contract-rerun
|
|
241
|
+
trigger: modifying agents/*/prompt.md
|
|
242
|
+
instruction: >
|
|
243
|
+
Re-run the agent's contract tests with temperature=0 after any
|
|
244
|
+
prompt change. A schema violation in the golden fixture signals regression.
|
|
245
|
+
priority: required
|
|
246
|
+
|
|
247
|
+
- id: model-upgrade-triggers-contract-rerun
|
|
248
|
+
trigger: upgrading the LLM model version used by an agent
|
|
249
|
+
instruction: >
|
|
250
|
+
Run all agent contract tests against the new model.
|
|
251
|
+
Compare output structure, field presence, and enum values against golden.
|
|
252
|
+
priority: required
|
|
253
|
+
|
|
254
|
+
anti_patterns:
|
|
255
|
+
- Defining agents without output-schema.json
|
|
256
|
+
- Only validating JSON.parse (syntax) without schema (semantic) validation
|
|
257
|
+
- Writing contract tests with only valid fixtures (no negative cases)
|
|
258
|
+
- Accepting model upgrades without re-running contract tests
|
|
259
|
+
- Using high temperature (> 0.3) in contract test fixtures (non-deterministic)
|
|
260
|
+
|
|
261
|
+
quick_reference:
|
|
262
|
+
contract_test_checklist: |
|
|
263
|
+
□ agents/<name>/output-schema.json exists
|
|
264
|
+
□ agents/<name>/__tests__/contract.test.ts exists
|
|
265
|
+
□ Valid fixture: all required fields, correct types, enum values
|
|
266
|
+
□ Invalid fixtures: missing required field, wrong type, invalid enum
|
|
267
|
+
□ Schema validator: Ajv (JSON Schema) or Zod (TypeScript)
|
|
268
|
+
□ CI: contract tests run on every PR
|
|
269
|
+
□ Prompt change: re-run contract test, compare against golden
|