universal-dev-standards 5.4.0 → 5.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (114) hide show
  1. package/bundled/ai/standards/adversarial-test.ai.yaml +277 -0
  2. package/bundled/ai/standards/audit-trail.ai.yaml +113 -0
  3. package/bundled/ai/standards/chaos-injection-tests.ai.yaml +91 -0
  4. package/bundled/ai/standards/container-image-standards.ai.yaml +88 -0
  5. package/bundled/ai/standards/container-security.ai.yaml +331 -0
  6. package/bundled/ai/standards/cost-budget-test.ai.yaml +96 -0
  7. package/bundled/ai/standards/data-contract.ai.yaml +110 -0
  8. package/bundled/ai/standards/data-migration-testing.ai.yaml +96 -0
  9. package/bundled/ai/standards/data-pipeline.ai.yaml +113 -0
  10. package/bundled/ai/standards/disaster-recovery-drill.ai.yaml +89 -0
  11. package/bundled/ai/standards/flaky-test-management.ai.yaml +89 -0
  12. package/bundled/ai/standards/flow-based-testing.ai.yaml +240 -0
  13. package/bundled/ai/standards/iac-design-principles.ai.yaml +83 -0
  14. package/bundled/ai/standards/incident-response.ai.yaml +107 -0
  15. package/bundled/ai/standards/license-compliance.ai.yaml +106 -0
  16. package/bundled/ai/standards/llm-output-validation.ai.yaml +269 -0
  17. package/bundled/ai/standards/mock-boundary.ai.yaml +250 -0
  18. package/bundled/ai/standards/mutation-testing.ai.yaml +192 -0
  19. package/bundled/ai/standards/pii-classification.ai.yaml +109 -0
  20. package/bundled/ai/standards/policy-as-code-testing.ai.yaml +227 -0
  21. package/bundled/ai/standards/prd-standards.ai.yaml +88 -0
  22. package/bundled/ai/standards/product-metrics-standards.ai.yaml +111 -0
  23. package/bundled/ai/standards/prompt-regression.ai.yaml +94 -0
  24. package/bundled/ai/standards/property-based-testing.ai.yaml +105 -0
  25. package/bundled/ai/standards/release-quality-manifest.ai.yaml +135 -0
  26. package/bundled/ai/standards/replay-test.ai.yaml +111 -0
  27. package/bundled/ai/standards/runbook.ai.yaml +104 -0
  28. package/bundled/ai/standards/sast-advanced.ai.yaml +135 -0
  29. package/bundled/ai/standards/schema-evolution.ai.yaml +111 -0
  30. package/bundled/ai/standards/secret-management-standards.ai.yaml +105 -0
  31. package/bundled/ai/standards/secure-op.ai.yaml +365 -0
  32. package/bundled/ai/standards/security-testing.ai.yaml +171 -0
  33. package/bundled/ai/standards/server-ops-security.ai.yaml +274 -0
  34. package/bundled/ai/standards/slo-sli.ai.yaml +97 -0
  35. package/bundled/ai/standards/smoke-test.ai.yaml +87 -0
  36. package/bundled/ai/standards/supply-chain-attestation.ai.yaml +109 -0
  37. package/bundled/ai/standards/test-completeness-dimensions.ai.yaml +52 -5
  38. package/bundled/ai/standards/user-story-mapping.ai.yaml +108 -0
  39. package/bundled/core/adversarial-test.md +212 -0
  40. package/bundled/core/chaos-injection-tests.md +116 -0
  41. package/bundled/core/container-security.md +521 -0
  42. package/bundled/core/cost-budget-test.md +69 -0
  43. package/bundled/core/data-migration-testing.md +110 -0
  44. package/bundled/core/disaster-recovery-drill.md +73 -0
  45. package/bundled/core/flaky-test-management.md +73 -0
  46. package/bundled/core/flow-based-testing.md +142 -0
  47. package/bundled/core/llm-output-validation.md +178 -0
  48. package/bundled/core/mock-boundary.md +100 -0
  49. package/bundled/core/mutation-testing.md +97 -0
  50. package/bundled/core/policy-as-code-testing.md +188 -0
  51. package/bundled/core/prompt-regression.md +72 -0
  52. package/bundled/core/property-based-testing.md +73 -0
  53. package/bundled/core/release-quality-manifest.md +147 -0
  54. package/bundled/core/replay-test.md +86 -0
  55. package/bundled/core/sast-advanced.md +300 -0
  56. package/bundled/core/secure-op.md +314 -0
  57. package/bundled/core/security-testing.md +87 -0
  58. package/bundled/core/server-ops-security.md +493 -0
  59. package/bundled/core/smoke-test.md +65 -0
  60. package/bundled/core/supply-chain-attestation.md +117 -0
  61. package/bundled/locales/zh-CN/CHANGELOG.md +3 -3
  62. package/bundled/locales/zh-CN/README.md +1 -1
  63. package/bundled/locales/zh-CN/skills/ai-instruction-standards/SKILL.md +5 -5
  64. package/bundled/locales/zh-TW/CHANGELOG.md +3 -3
  65. package/bundled/locales/zh-TW/README.md +1 -1
  66. package/bundled/locales/zh-TW/skills/ai-instruction-standards/SKILL.md +183 -79
  67. package/bundled/skills/README.md +4 -3
  68. package/bundled/skills/SKILL_NAMING.md +94 -0
  69. package/bundled/skills/ai-instruction-standards/SKILL.md +181 -88
  70. package/bundled/skills/atdd-assistant/SKILL.md +8 -0
  71. package/bundled/skills/bdd-assistant/SKILL.md +7 -0
  72. package/bundled/skills/checkin-assistant/SKILL.md +8 -0
  73. package/bundled/skills/code-review-assistant/SKILL.md +7 -0
  74. package/bundled/skills/journey-test-assistant/SKILL.md +203 -0
  75. package/bundled/skills/orchestrate/SKILL.md +167 -0
  76. package/bundled/skills/plan/SKILL.md +234 -0
  77. package/bundled/skills/pr-automation-assistant/SKILL.md +8 -0
  78. package/bundled/skills/push/SKILL.md +49 -2
  79. package/bundled/skills/{process-automation → skill-builder}/SKILL.md +1 -1
  80. package/bundled/skills/{forward-derivation → spec-derivation}/SKILL.md +1 -1
  81. package/bundled/skills/spec-driven-dev/SKILL.md +7 -0
  82. package/bundled/skills/sweep/SKILL.md +145 -0
  83. package/bundled/skills/tdd-assistant/SKILL.md +7 -0
  84. package/package.json +1 -1
  85. package/src/commands/flow.js +8 -0
  86. package/src/commands/start.js +14 -0
  87. package/src/commands/sweep.js +8 -0
  88. package/src/commands/workflow.js +8 -0
  89. package/standards-registry.json +426 -4
  90. package/bundled/locales/zh-CN/skills/ac-coverage-assistant/SKILL.md +0 -190
  91. package/bundled/locales/zh-CN/skills/forward-derivation/SKILL.md +0 -71
  92. package/bundled/locales/zh-CN/skills/forward-derivation/guide.md +0 -130
  93. package/bundled/locales/zh-CN/skills/methodology-system/SKILL.md +0 -88
  94. package/bundled/locales/zh-CN/skills/methodology-system/create-methodology.md +0 -350
  95. package/bundled/locales/zh-CN/skills/methodology-system/guide.md +0 -131
  96. package/bundled/locales/zh-CN/skills/methodology-system/runtime.md +0 -279
  97. package/bundled/locales/zh-CN/skills/process-automation/SKILL.md +0 -143
  98. package/bundled/locales/zh-TW/skills/ac-coverage-assistant/SKILL.md +0 -195
  99. package/bundled/locales/zh-TW/skills/deploy-assistant/SKILL.md +0 -178
  100. package/bundled/locales/zh-TW/skills/forward-derivation/SKILL.md +0 -69
  101. package/bundled/locales/zh-TW/skills/forward-derivation/guide.md +0 -415
  102. package/bundled/locales/zh-TW/skills/methodology-system/SKILL.md +0 -86
  103. package/bundled/locales/zh-TW/skills/methodology-system/create-methodology.md +0 -350
  104. package/bundled/locales/zh-TW/skills/methodology-system/guide.md +0 -131
  105. package/bundled/locales/zh-TW/skills/methodology-system/runtime.md +0 -279
  106. package/bundled/locales/zh-TW/skills/process-automation/SKILL.md +0 -144
  107. /package/bundled/skills/{ac-coverage-assistant → ac-coverage}/SKILL.md +0 -0
  108. /package/bundled/skills/{methodology-system → dev-methodology}/SKILL.md +0 -0
  109. /package/bundled/skills/{methodology-system → dev-methodology}/create-methodology.md +0 -0
  110. /package/bundled/skills/{methodology-system → dev-methodology}/guide.md +0 -0
  111. /package/bundled/skills/{methodology-system → dev-methodology}/integrated-flow.md +0 -0
  112. /package/bundled/skills/{methodology-system → dev-methodology}/prerequisite-check.md +0 -0
  113. /package/bundled/skills/{methodology-system → dev-methodology}/runtime.md +0 -0
  114. /package/bundled/skills/{forward-derivation → spec-derivation}/guide.md +0 -0
@@ -0,0 +1,107 @@
1
+ # Incident Response Standards - AI Optimized
2
+ # Source: XSPEC-063 Wave 3 SRE Pack
3
+
4
+ id: incident-response
5
+ title: Incident Response Standards
6
+ version: "1.0.0"
7
+ status: Active
8
+ tags: [sre, incident, oncall, postmortem, severity]
9
+ summary: |
10
+ Defines the end-to-end incident response lifecycle: severity classification,
11
+ response time SLAs, roles and responsibilities, communication protocols,
12
+ escalation paths, and blameless postmortem requirements. Designed to reduce
13
+ MTTR, ensure consistent stakeholder communication during incidents, and
14
+ drive systemic reliability improvements through structured postmortems.
15
+
16
+ requirements:
17
+ - id: REQ-001
18
+ title: Severity Classification
19
+ description: |
20
+ Every incident MUST be classified at declaration using a 4-level
21
+ severity scale. SEV-1 (Critical): complete service outage or data
22
+ breach affecting all users, response within 15 minutes, C-suite
23
+ notification. SEV-2 (High): major feature unavailable or significant
24
+ performance degradation >25% users, response within 30 minutes.
25
+ SEV-3 (Medium): minor feature unavailable or degradation <25% users,
26
+ response within 4 hours. SEV-4 (Low): cosmetic issue or very minor
27
+ impact, response within 24 hours.
28
+ level: MUST
29
+ examples:
30
+ - "SEV-1: Payment processing completely down → IC declared, bridge opened in <5min"
31
+ - "SEV-2: Checkout latency >10s for 30% of users → on-call paged, IC assigned"
32
+ - "SEV-3: Search autocomplete broken → ticket created, fix in next sprint"
33
+
34
+ - id: REQ-002
35
+ title: Incident Commander Role
36
+ description: |
37
+ Every SEV-1 and SEV-2 incident MUST have a designated Incident
38
+ Commander (IC) assigned within 10 minutes of declaration. The IC is
39
+ responsible for: coordinating the response bridge, assigning roles
40
+ (scribe, comms lead, technical leads), driving timeline, making
41
+ go/no-go decisions on fixes, and initiating postmortem. The IC does
42
+ NOT directly troubleshoot — their sole focus is coordination.
43
+ level: MUST
44
+ examples:
45
+ - "IC assignment: 'I'm taking IC for this SEV-2, @alice please scribe, @bob comms'"
46
+ - "IC checklist: bridge open, roles assigned, status page updated, stakeholders notified"
47
+ - "IC does not debug code; delegates to technical leads who report findings"
48
+
49
+ - id: REQ-003
50
+ title: Stakeholder Communication Protocol
51
+ description: |
52
+ During SEV-1/SEV-2 incidents, stakeholder updates MUST be sent on a
53
+ defined cadence: initial notification within 15 minutes of declaration,
54
+ updates every 30 minutes until resolution, immediate update on any
55
+ severity change or major development. Updates MUST include: current
56
+ status, known impact, what is being done, next update time. Status
57
+ page MUST be updated simultaneously with internal communications.
58
+ level: MUST
59
+ examples:
60
+ - "T+15min: 'We are investigating elevated error rates on checkout. ~500 users affected. Next update in 30min.'"
61
+ - "T+45min: 'Root cause identified as database connection pool exhaustion. Fix in progress. Next update in 30min.'"
62
+ - "Resolution: 'Service restored at 14:32 UTC. Postmortem scheduled for 2026-05-02.'"
63
+
64
+ - id: REQ-004
65
+ title: Blameless Postmortem Requirements
66
+ description: |
67
+ Every SEV-1 incident MUST have a blameless postmortem completed within
68
+ 5 business days. SEV-2 incidents MUST have a postmortem within 10
69
+ business days. Postmortems MUST be blameless — focusing on systemic
70
+ causes, not individual mistakes. Required sections: timeline, impact,
71
+ root cause (5 Whys), contributing factors, action items with owners
72
+ and due dates, lessons learned. Action items MUST be tracked to
73
+ completion.
74
+ level: MUST
75
+ examples:
76
+ - "Timeline: 14:00 Alert fired → 14:05 IC assigned → 14:23 RCA found → 14:32 resolved"
77
+ - "Root cause: 5 Whys tracing alert → connection pool exhaustion → missing timeout config → no review gate"
78
+ - "Action item: INFRA-2341 Add connection pool monitoring, owner: @dave, due: 2026-05-15"
79
+
80
+ - id: REQ-005
81
+ title: On-Call Rotation and Handoff
82
+ description: |
83
+ Every production service MUST have a documented on-call rotation with
84
+ at least 2 engineers. Rotation schedules MUST be published at least
85
+ 2 weeks in advance. On-call handoff MUST include a written summary of:
86
+ active incidents, recent incidents still under investigation,
87
+ known flaky alerts, upcoming planned maintenance, and any service
88
+ health concerns. Handoff MUST be acknowledged by incoming on-call.
89
+ level: MUST
90
+ examples:
91
+ - "Handoff doc: 3 active flaky alerts (JIRA links), 1 ongoing SEV-3, maintenance window Friday 02:00 UTC"
92
+ - "Rotation: 1-week shifts, 2 primary + 1 secondary, published monthly in PagerDuty"
93
+ - "Handoff acknowledgment: incoming engineer replies 'Handoff received, I have the pager'"
94
+
95
+ - id: REQ-006
96
+ title: Incident Retrospective Metrics
97
+ description: |
98
+ Teams SHOULD track and review incident metrics monthly: MTTD (Mean
99
+ Time To Detect), MTTR (Mean Time To Resolve), incident frequency by
100
+ severity, repeat incidents (same root cause within 90 days), and
101
+ postmortem action item completion rate. These metrics SHOULD be
102
+ reviewed in monthly reliability reviews with engineering leadership.
103
+ level: SHOULD
104
+ examples:
105
+ - "Monthly metrics: MTTD 8min, MTTR 42min, 3 SEV-2s, 0 SEV-1s, 0 repeat incidents"
106
+ - "Repeat incident flag: same DB timeout root cause in March and April → escalate to reliability sprint"
107
+ - "Action item completion rate: 85% of previous month's items closed on time"
@@ -0,0 +1,106 @@
1
+ # License Compliance Standards - AI Optimized
2
+ # Source: XSPEC-066 Wave 3 Compliance Pack
3
+
4
+ id: license-compliance
5
+ title: License Compliance Standards
6
+ version: "1.0.0"
7
+ status: Active
8
+ tags: [compliance, licensing, open-source, legal, supply-chain]
9
+ summary: |
10
+ Defines how teams identify, track, and manage open-source and third-party
11
+ software licenses throughout the software development lifecycle. Covers
12
+ license classification (permissive vs. copyleft), prohibited licenses,
13
+ SBOM generation, license scanning in CI/CD, and remediation processes
14
+ for license violations. Designed to prevent legal exposure from
15
+ incompatible license combinations and ensure supply-chain transparency.
16
+
17
+ requirements:
18
+ - id: REQ-001
19
+ title: License Classification and Allowlist
20
+ description: |
21
+ Every project MUST maintain a documented license policy classifying
22
+ licenses into three tiers. APPROVED: MIT, Apache 2.0, BSD-2-Clause,
23
+ BSD-3-Clause, ISC, CC0 — can be used without review. REVIEW-REQUIRED:
24
+ LGPL-2.1, LGPL-3.0, MPL-2.0, CDDL — requires legal/architecture
25
+ review before adoption. PROHIBITED: GPL-2.0 (without exception),
26
+ GPL-3.0, AGPL-3.0, SSPL, Commons Clause additions — MUST NOT be
27
+ included in proprietary or commercial products.
28
+ level: MUST
29
+ examples:
30
+ - "MIT dependency lodash → auto-approved, no review needed"
31
+ - "LGPL-3.0 dependency → open GitHub issue tagged 'license-review' before merging"
32
+ - "GPL-3.0 dependency found in PR → PR blocked by CI until dependency replaced"
33
+
34
+ - id: REQ-002
35
+ title: Automated License Scanning in CI
36
+ description: |
37
+ Every repository MUST run automated license scanning on every pull
38
+ request and on a daily scheduled basis. The scan MUST fail the build
39
+ if any PROHIBITED license is detected in direct or transitive
40
+ dependencies. REVIEW-REQUIRED licenses MUST generate a warning that
41
+ blocks merge until manually approved and documented. Scan results
42
+ MUST be stored as build artifacts for 1 year.
43
+ level: MUST
44
+ examples:
45
+ - "CI step: `npx license-checker --onlyAllow 'MIT;Apache-2.0;BSD-2-Clause;BSD-3-Clause;ISC'`"
46
+ - "Daily scheduled scan via GitHub Actions: runs license-checker across all repos"
47
+ - "Build artifact: license-report-2026-04-30.json retained for 1 year"
48
+
49
+ - id: REQ-003
50
+ title: Software Bill of Materials (SBOM) Generation
51
+ description: |
52
+ Every production release MUST include a Software Bill of Materials
53
+ (SBOM) in SPDX 2.3 or CycloneDX 1.5 format. The SBOM MUST list all
54
+ direct and transitive dependencies with: package name, version,
55
+ license identifier (SPDX), and supplier. SBOMs MUST be generated
56
+ automatically as part of the release pipeline and stored alongside
57
+ release artifacts.
58
+ level: MUST
59
+ examples:
60
+ - "SBOM generated via: `syft . -o spdx-json > sbom-v1.2.0.spdx.json`"
61
+ - "SBOM uploaded to release: GitHub Release v1.2.0 includes sbom-v1.2.0.spdx.json"
62
+ - "SBOM SPDX entry: {name: 'lodash', version: '4.17.21', licenseConcluded: 'MIT'}"
63
+
64
+ - id: REQ-004
65
+ title: License Attribution and Notices
66
+ description: |
67
+ Products distributed to external users MUST include a NOTICES or
68
+ LICENSE-THIRD-PARTY file listing all open-source components and their
69
+ license texts. This file MUST be auto-generated from the SBOM data.
70
+ For web applications, attribution MUST be accessible via a dedicated
71
+ page or endpoint (e.g., /licenses or /legal/notices).
72
+ level: MUST
73
+ examples:
74
+ - "NOTICES file auto-generated: `npx generate-license-file --input package.json --output NOTICES`"
75
+ - "Web app: GET /legal/notices returns HTML page with all dependency licenses"
76
+ - "Mobile app: 'Open Source Licenses' screen in Settings menu"
77
+
78
+ - id: REQ-005
79
+ title: License Violation Remediation Process
80
+ description: |
81
+ When a license violation is detected, teams MUST follow the remediation
82
+ process: (1) immediately block merge/release if PROHIBITED license;
83
+ (2) create a tracking issue within 1 business day; (3) remediate
84
+ within 5 business days for PROHIBITED, 30 days for REVIEW-REQUIRED;
85
+ (4) document decision and alternative chosen; (5) update SBOM and
86
+ re-scan to verify clean. Unresolved violations MUST be reported to
87
+ the engineering lead monthly.
88
+ level: MUST
89
+ examples:
90
+ - "Violation: transitive dep uses GPL-3.0 → PR blocked, issue COMP-112 created same day"
91
+ - "Remediation: replaced dep with MIT-licensed alternative within 3 days"
92
+ - "Monthly report: 0 open violations as of 2026-04-30"
93
+
94
+ - id: REQ-006
95
+ title: License Review for New Technology Adoption
96
+ description: |
97
+ Before adopting any new framework, library, or tool, engineers SHOULD
98
+ verify its license tier as part of the technology evaluation. License
99
+ status SHOULD be documented in the relevant ADR or technical spike
100
+ document. For REVIEW-REQUIRED licenses, architecture review board
101
+ approval MUST be obtained before adoption.
102
+ level: SHOULD
103
+ examples:
104
+ - "ADR-042 notes: 'Library X uses Apache 2.0 — approved tier, no legal review needed'"
105
+ - "ADR-043 notes: 'Library Y uses LGPL-3.0 — review-required, legal approved 2026-03-10'"
106
+ - "Technology radar entry includes license classification for each evaluated tool"
@@ -0,0 +1,269 @@
1
+ # LLM Output Validation Standards - AI Optimized
2
+ # Source: core/llm-output-validation.md
3
+
4
+ id: llm-output-validation
5
+ meta:
6
+ version: "1.0.0"
7
+ updated: "2026-05-05"
8
+ source: core/llm-output-validation.md
9
+ description: >
10
+ Standards for validating LLM and AI agent outputs to ensure schema
11
+ conformance, detect hallucinations, and assess response groundedness.
12
+ Covers structural validation, semantic validation, and test strategies.
13
+
14
+ # ─────────────────────────────────────────────────────────
15
+ # Core Concepts
16
+ # ─────────────────────────────────────────────────────────
17
+ core_concepts:
18
+ definition: >
19
+ LLM outputs are non-deterministic and may violate expected schema,
20
+ introduce hallucinated facts, or fail to stay grounded in provided context.
21
+ LLM output validation is the practice of systematically testing that
22
+ AI agent outputs meet structural and semantic quality standards.
23
+
24
+ validation_dimensions:
25
+ - dimension: Schema Conformance
26
+ definition: >
27
+ The output matches the declared JSON / type schema exactly —
28
+ required fields present, types correct, enums respected.
29
+ test_type: Contract Test
30
+ automated: true
31
+
32
+ - dimension: Hallucination Detection
33
+ definition: >
34
+ Factual claims in the output can be traced back to source material
35
+ (grounded) or are fabricated (hallucinated).
36
+ test_type: Semantic / NLI probe
37
+ automated: partial
38
+
39
+ - dimension: Grounded Answer Rate
40
+ definition: >
41
+ Ratio of responses that cite or derive from provided context
42
+ vs. responses that introduce unsupported facts.
43
+ metric: "grounded_answers / total_answers"
44
+ target: "≥ 0.95 for RAG-based agents"
45
+ automated: partial
46
+
47
+ - dimension: Refusal Evaluation
48
+ definition: >
49
+ Agent correctly refuses (or escalates) out-of-scope, harmful,
50
+ or ambiguous requests rather than hallucinating a plausible answer.
51
+ test_type: Adversarial probe
52
+ automated: true (via test corpus)
53
+
54
+ - dimension: Determinism / Stability
55
+ definition: >
56
+ With temperature=0, same prompt yields same output schema structure
57
+ (if not identical content).
58
+ test_type: Replay / Golden file
59
+ automated: true
60
+
61
+ # ─────────────────────────────────────────────────────────
62
+ # Contract Testing (Schema Conformance)
63
+ # ─────────────────────────────────────────────────────────
64
+ contract_testing:
65
+ definition: >
66
+ Contract tests verify that LLM/agent outputs satisfy a declared schema
67
+ (JSON Schema, Zod, Pydantic, etc.) before being consumed downstream.
68
+ They are the cheapest and most automated form of LLM output validation.
69
+
70
+ patterns:
71
+ - name: Schema-driven contract test
72
+ description: >
73
+ Maintain an output-schema.json per agent. Run fixture outputs through
74
+ a schema validator (Ajv / Pydantic / jsonschema) in unit tests.
75
+ example: |
76
+ // TypeScript with ajv
77
+ import Ajv from "ajv"
78
+ import schema from "./output-schema.json"
79
+ const ajv = new Ajv()
80
+ const validate = ajv.compile(schema)
81
+ const result = validate(agentOutput)
82
+ expect(result).toBe(true)
83
+
84
+ - name: Golden file test
85
+ description: >
86
+ Capture a real LLM response as a golden fixture (JSON).
87
+ Re-validate against the schema on every schema change.
88
+ Diff the fixture if schema is unchanged (regression).
89
+
90
+ - name: Negative contract test
91
+ description: >
92
+ Provide outputs that violate the schema (missing required fields,
93
+ wrong types). Assert the validator correctly rejects them.
94
+ Proves the schema is strict enough to catch real failures.
95
+
96
+ per_agent_structure:
97
+ - "agents/<agent-name>/output-schema.json — JSON Schema definition"
98
+ - "agents/<agent-name>/__tests__/contract.test.ts — contract test suite"
99
+ - "agents/<agent-name>/__fixtures__/valid.json — valid output fixture"
100
+ - "agents/<agent-name>/__fixtures__/invalid-*.json — invalid fixtures"
101
+
102
+ # ─────────────────────────────────────────────────────────
103
+ # Hallucination Detection
104
+ # ─────────────────────────────────────────────────────────
105
+ hallucination_detection:
106
+ definition: >
107
+ Hallucination occurs when an LLM generates plausible-sounding but
108
+ factually incorrect or unsupported content.
109
+
110
+ detection_strategies:
111
+ - strategy: Grounding check
112
+ description: >
113
+ For RAG-based agents: compare named entities / key facts in response
114
+ against retrieved documents using NLI or embedding similarity.
115
+ tools: [DeepEval, Ragas, custom NLI probe]
116
+ complexity: high
117
+
118
+ - strategy: Schema field validation
119
+ description: >
120
+ Structured outputs (JSON) partially prevent hallucination by requiring
121
+ specific enumerations and reference IDs. If the LLM hallucinates a
122
+ field value, it either violates the schema (caught by contract test)
123
+ or produces a real-looking but wrong value (requires semantic check).
124
+ complexity: low
125
+
126
+ - strategy: Confidence calibration
127
+ description: >
128
+ Agent includes a confidence score or explicitly marks uncertain claims.
129
+ Downstream consumers reject or escalate low-confidence outputs.
130
+ requires: Prompt design to elicit uncertainty markers
131
+ complexity: medium
132
+
133
+ rate_targets:
134
+ structured_output_agents:
135
+ schema_conformance: "≥ 99%"
136
+ factual_hallucination_rate: "≤ 5%"
137
+ rag_agents:
138
+ grounded_answer_rate: "≥ 95%"
139
+ chat_agents:
140
+ refusal_on_out_of_scope: "≥ 90%"
141
+
142
+ # ─────────────────────────────────────────────────────────
143
+ # Prompt Regression & Stability
144
+ # ─────────────────────────────────────────────────────────
145
+ prompt_regression:
146
+ definition: >
147
+ When a prompt is changed, verify that the output schema structure
148
+ is preserved (even if content differs). A prompt change that silently
149
+ breaks downstream field expectations is a regression.
150
+
151
+ test_pattern: |
152
+ # 1. Before prompt change: capture golden output structure
153
+ # 2. Change prompt
154
+ # 3. Re-run with temperature=0 on same input fixture
155
+ # 4. Validate output against schema (contract test)
156
+ # 5. Diff field presence / enum values against golden
157
+
158
+ required_on:
159
+ - Any change to agents/*/prompt.md
160
+ - Model version upgrade (same prompt, new model)
161
+ - New required output field added to schema
162
+
163
+ # ─────────────────────────────────────────────────────────
164
+ # Tools & Libraries
165
+ # ─────────────────────────────────────────────────────────
166
+ tools:
167
+ schema_validation:
168
+ - name: Ajv (TypeScript/JavaScript)
169
+ package: "ajv"
170
+ strengths: [JSON Schema draft-07 support, fast, minimal]
171
+ use_case: Contract tests for agents with JSON output-schema.json
172
+
173
+ - name: Zod (TypeScript)
174
+ package: "zod"
175
+ strengths: [Type-safe, composable, coercion support]
176
+ use_case: Runtime validation when TypeScript types are available
177
+
178
+ - name: Pydantic (Python)
179
+ package: "pydantic"
180
+ strengths: [Automatic validation from type annotations]
181
+ use_case: Python-based agent contract tests
182
+
183
+ hallucination_evaluation:
184
+ - name: DeepEval
185
+ package: "deepeval"
186
+ features: [Hallucination metric, Contextual Precision, Faithfulness]
187
+ note: Requires LLM judge (adds cost)
188
+
189
+ - name: Ragas
190
+ package: "ragas"
191
+ features: [Answer Grounding, Context Recall, Faithfulness]
192
+ note: Designed for RAG pipelines
193
+
194
+ # ─────────────────────────────────────────────────────────
195
+ # Quality Gates
196
+ # ─────────────────────────────────────────────────────────
197
+ quality_gates:
198
+ - gate: Schema conformance (CI)
199
+ threshold: "100% of valid fixtures pass schema validation"
200
+ enforcement: Block merge
201
+ automated: true
202
+ note: Run in every PR via contract.test.ts
203
+
204
+ - gate: Negative schema rejection (CI)
205
+ threshold: "100% of invalid fixtures are rejected by schema validator"
206
+ enforcement: Block merge
207
+ automated: true
208
+ note: Proves schema is strict enough
209
+
210
+ - gate: Hallucination rate (pre-release)
211
+ threshold: "≤ 5% hallucinated facts on benchmark dataset"
212
+ enforcement: Advisory (escalate to human review if exceeded)
213
+ automated: partial
214
+ note: Run monthly or on model upgrade
215
+
216
+ - gate: Prompt regression (on prompt change)
217
+ threshold: "Schema conformance maintained after prompt change"
218
+ enforcement: Block merge
219
+ automated: true (via contract test re-run)
220
+
221
+ # ─────────────────────────────────────────────────────────
222
+ # Rules
223
+ # ─────────────────────────────────────────────────────────
224
+ rules:
225
+ - id: agent-must-have-output-schema
226
+ trigger: defining or modifying an AI agent
227
+ instruction: >
228
+ Every agent MUST have a declared output-schema.json.
229
+ Agents without a schema cannot be safely composed into pipelines.
230
+ priority: required
231
+
232
+ - id: contract-test-per-agent
233
+ trigger: defining or modifying an AI agent
234
+ instruction: >
235
+ Every agent MUST have a contract.test.ts with at least:
236
+ (1) one valid fixture that passes schema validation
237
+ (2) one invalid fixture that fails schema validation
238
+ priority: required
239
+
240
+ - id: prompt-change-triggers-contract-rerun
241
+ trigger: modifying agents/*/prompt.md
242
+ instruction: >
243
+ Re-run the agent's contract tests with temperature=0 after any
244
+ prompt change. A schema violation in the golden fixture signals regression.
245
+ priority: required
246
+
247
+ - id: model-upgrade-triggers-contract-rerun
248
+ trigger: upgrading the LLM model version used by an agent
249
+ instruction: >
250
+ Run all agent contract tests against the new model.
251
+ Compare output structure, field presence, and enum values against golden.
252
+ priority: required
253
+
254
+ anti_patterns:
255
+ - Defining agents without output-schema.json
256
+ - Only validating JSON.parse (syntax) without schema (semantic) validation
257
+ - Writing contract tests with only valid fixtures (no negative cases)
258
+ - Accepting model upgrades without re-running contract tests
259
+ - Using high temperature (> 0.3) in contract test fixtures (non-deterministic)
260
+
261
+ quick_reference:
262
+ contract_test_checklist: |
263
+ □ agents/<name>/output-schema.json exists
264
+ □ agents/<name>/__tests__/contract.test.ts exists
265
+ □ Valid fixture: all required fields, correct types, enum values
266
+ □ Invalid fixtures: missing required field, wrong type, invalid enum
267
+ □ Schema validator: Ajv (JSON Schema) or Zod (TypeScript)
268
+ □ CI: contract tests run on every PR
269
+ □ Prompt change: re-run contract test, compare against golden