@trohde/earos 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (135) hide show
  1. package/README.md +156 -0
  2. package/assets/init/.agents/skills/earos-artifact-gen/SKILL.md +106 -0
  3. package/assets/init/.agents/skills/earos-artifact-gen/references/interview-guide.md +313 -0
  4. package/assets/init/.agents/skills/earos-artifact-gen/references/output-guide.md +367 -0
  5. package/assets/init/.agents/skills/earos-assess/SKILL.md +212 -0
  6. package/assets/init/.agents/skills/earos-assess/references/calibration-benchmarks.md +160 -0
  7. package/assets/init/.agents/skills/earos-assess/references/output-templates.md +311 -0
  8. package/assets/init/.agents/skills/earos-assess/references/scoring-protocol.md +281 -0
  9. package/assets/init/.agents/skills/earos-calibrate/SKILL.md +153 -0
  10. package/assets/init/.agents/skills/earos-calibrate/references/agreement-metrics.md +188 -0
  11. package/assets/init/.agents/skills/earos-calibrate/references/calibration-protocol.md +263 -0
  12. package/assets/init/.agents/skills/earos-create/SKILL.md +257 -0
  13. package/assets/init/.agents/skills/earos-create/references/criterion-writing-guide.md +268 -0
  14. package/assets/init/.agents/skills/earos-create/references/dependency-rules.md +193 -0
  15. package/assets/init/.agents/skills/earos-create/references/rubric-interview-guide.md +123 -0
  16. package/assets/init/.agents/skills/earos-create/references/validation-checklist.md +238 -0
  17. package/assets/init/.agents/skills/earos-profile-author/SKILL.md +251 -0
  18. package/assets/init/.agents/skills/earos-profile-author/references/criterion-writing-guide.md +280 -0
  19. package/assets/init/.agents/skills/earos-profile-author/references/design-methods.md +158 -0
  20. package/assets/init/.agents/skills/earos-profile-author/references/profile-checklist.md +173 -0
  21. package/assets/init/.agents/skills/earos-remediate/SKILL.md +118 -0
  22. package/assets/init/.agents/skills/earos-remediate/references/output-template.md +199 -0
  23. package/assets/init/.agents/skills/earos-remediate/references/remediation-patterns.md +330 -0
  24. package/assets/init/.agents/skills/earos-report/SKILL.md +85 -0
  25. package/assets/init/.agents/skills/earos-report/references/portfolio-template.md +181 -0
  26. package/assets/init/.agents/skills/earos-report/references/single-artifact-template.md +168 -0
  27. package/assets/init/.agents/skills/earos-review/SKILL.md +130 -0
  28. package/assets/init/.agents/skills/earos-review/references/challenge-patterns.md +163 -0
  29. package/assets/init/.agents/skills/earos-review/references/output-template.md +180 -0
  30. package/assets/init/.agents/skills/earos-template-fill/SKILL.md +177 -0
  31. package/assets/init/.agents/skills/earos-template-fill/references/evidence-writing-guide.md +186 -0
  32. package/assets/init/.agents/skills/earos-template-fill/references/section-rubric-mapping.md +200 -0
  33. package/assets/init/.agents/skills/earos-validate/SKILL.md +113 -0
  34. package/assets/init/.agents/skills/earos-validate/references/fix-patterns.md +281 -0
  35. package/assets/init/.agents/skills/earos-validate/references/validation-checks.md +287 -0
  36. package/assets/init/.claude/CLAUDE.md +4 -0
  37. package/assets/init/AGENTS.md +293 -0
  38. package/assets/init/CLAUDE.md +635 -0
  39. package/assets/init/README.md +507 -0
  40. package/assets/init/calibration/gold-set/.gitkeep +0 -0
  41. package/assets/init/calibration/results/.gitkeep +0 -0
  42. package/assets/init/core/core-meta-rubric.yaml +643 -0
  43. package/assets/init/docs/consistency-report.md +325 -0
  44. package/assets/init/docs/getting-started.md +194 -0
  45. package/assets/init/docs/profile-authoring-guide.md +51 -0
  46. package/assets/init/docs/terminology.md +126 -0
  47. package/assets/init/earos.manifest.yaml +104 -0
  48. package/assets/init/evaluations/.gitkeep +0 -0
  49. package/assets/init/examples/aws-event-driven-order-processing/artifact.yaml +2056 -0
  50. package/assets/init/examples/aws-event-driven-order-processing/evaluation.yaml +973 -0
  51. package/assets/init/examples/aws-event-driven-order-processing/report.md +244 -0
  52. package/assets/init/examples/example-solution-architecture.evaluation.yaml +136 -0
  53. package/assets/init/examples/multi-cloud-data-analytics/artifact.yaml +715 -0
  54. package/assets/init/overlays/data-governance.yaml +94 -0
  55. package/assets/init/overlays/regulatory.yaml +154 -0
  56. package/assets/init/overlays/security.yaml +92 -0
  57. package/assets/init/profiles/adr.yaml +225 -0
  58. package/assets/init/profiles/capability-map.yaml +223 -0
  59. package/assets/init/profiles/reference-architecture.yaml +426 -0
  60. package/assets/init/profiles/roadmap.yaml +205 -0
  61. package/assets/init/profiles/solution-architecture.yaml +227 -0
  62. package/assets/init/research/architecture-assessment-rubrics-research.docx +0 -0
  63. package/assets/init/research/architecture-assessment-rubrics-research.md +566 -0
  64. package/assets/init/research/reference-architecture-research.md +751 -0
  65. package/assets/init/standard/EAROS.md +1426 -0
  66. package/assets/init/standard/schemas/artifact.schema.json +1295 -0
  67. package/assets/init/standard/schemas/artifact.uischema.json +65 -0
  68. package/assets/init/standard/schemas/evaluation.schema.json +284 -0
  69. package/assets/init/standard/schemas/rubric.schema.json +383 -0
  70. package/assets/init/templates/evaluation-record.template.yaml +58 -0
  71. package/assets/init/templates/new-profile.template.yaml +65 -0
  72. package/bin.js +188 -0
  73. package/dist/assets/_basePickBy-BVu6YmSW.js +1 -0
  74. package/dist/assets/_baseUniq-CWRzQDz_.js +1 -0
  75. package/dist/assets/arc-CyDBhtDM.js +1 -0
  76. package/dist/assets/architectureDiagram-2XIMDMQ5-BH6O4dvN.js +36 -0
  77. package/dist/assets/blockDiagram-WCTKOSBZ-2xmwdjpg.js +132 -0
  78. package/dist/assets/c4Diagram-IC4MRINW-BNmPRFJF.js +10 -0
  79. package/dist/assets/channel-CiySTNoJ.js +1 -0
  80. package/dist/assets/chunk-4BX2VUAB-DGQTvirp.js +1 -0
  81. package/dist/assets/chunk-55IACEB6-DNMAQAC_.js +1 -0
  82. package/dist/assets/chunk-FMBD7UC4-BJbVTQ5o.js +15 -0
  83. package/dist/assets/chunk-JSJVCQXG-BCxUL74A.js +1 -0
  84. package/dist/assets/chunk-KX2RTZJC-H7wWZOfz.js +1 -0
  85. package/dist/assets/chunk-NQ4KR5QH-BK4RlTQF.js +220 -0
  86. package/dist/assets/chunk-QZHKN3VN-0chxDV5g.js +1 -0
  87. package/dist/assets/chunk-WL4C6EOR-DexfQ-AV.js +189 -0
  88. package/dist/assets/classDiagram-VBA2DB6C-D7luWJQn.js +1 -0
  89. package/dist/assets/classDiagram-v2-RAHNMMFH-D7luWJQn.js +1 -0
  90. package/dist/assets/clone-ylgRbd3D.js +1 -0
  91. package/dist/assets/cose-bilkent-S5V4N54A-DS2IOCfZ.js +1 -0
  92. package/dist/assets/cytoscape.esm-CyJtwmzi.js +331 -0
  93. package/dist/assets/dagre-KLK3FWXG-BbSoTTa3.js +4 -0
  94. package/dist/assets/defaultLocale-DX6XiGOO.js +1 -0
  95. package/dist/assets/diagram-E7M64L7V-C9TvYgv0.js +24 -0
  96. package/dist/assets/diagram-IFDJBPK2-DowUMWrg.js +43 -0
  97. package/dist/assets/diagram-P4PSJMXO-BL6nrnQF.js +24 -0
  98. package/dist/assets/erDiagram-INFDFZHY-rXPRl8VM.js +70 -0
  99. package/dist/assets/flowDiagram-PKNHOUZH-DBRM99-W.js +162 -0
  100. package/dist/assets/ganttDiagram-A5KZAMGK-INcWFsBT.js +292 -0
  101. package/dist/assets/gitGraphDiagram-K3NZZRJ6-DMwpfE91.js +65 -0
  102. package/dist/assets/graph-DLQn37b-.js +1 -0
  103. package/dist/assets/index-BFFITMT8.js +650 -0
  104. package/dist/assets/index-H7f6VTz1.css +1 -0
  105. package/dist/assets/infoDiagram-LFFYTUFH-B0f4TWRM.js +2 -0
  106. package/dist/assets/init-Gi6I4Gst.js +1 -0
  107. package/dist/assets/ishikawaDiagram-PHBUUO56-CsU6XimZ.js +70 -0
  108. package/dist/assets/journeyDiagram-4ABVD52K-CQ7ibNib.js +139 -0
  109. package/dist/assets/kanban-definition-K7BYSVSG-DzEN7THt.js +89 -0
  110. package/dist/assets/katex-B1X10hvy.js +261 -0
  111. package/dist/assets/layout-C0dvb42R.js +1 -0
  112. package/dist/assets/linear-j4a8mGj7.js +1 -0
  113. package/dist/assets/mindmap-definition-YRQLILUH-DP8iEuCf.js +68 -0
  114. package/dist/assets/ordinal-Cboi1Yqb.js +1 -0
  115. package/dist/assets/pieDiagram-SKSYHLDU-BpIAXgAm.js +30 -0
  116. package/dist/assets/quadrantDiagram-337W2JSQ-DrpXn5Eg.js +7 -0
  117. package/dist/assets/requirementDiagram-Z7DCOOCP-Bg7EwHlG.js +73 -0
  118. package/dist/assets/sankeyDiagram-WA2Y5GQK-BWagRs1F.js +10 -0
  119. package/dist/assets/sequenceDiagram-2WXFIKYE-q5jwhivG.js +145 -0
  120. package/dist/assets/stateDiagram-RAJIS63D-B_J9pE-2.js +1 -0
  121. package/dist/assets/stateDiagram-v2-FVOUBMTO-Q_1GcybB.js +1 -0
  122. package/dist/assets/timeline-definition-YZTLITO2-dv0jgQ0z.js +61 -0
  123. package/dist/assets/treemap-KZPCXAKY-Dt1dkIE7.js +162 -0
  124. package/dist/assets/vennDiagram-LZ73GAT5-BdO5RgRZ.js +34 -0
  125. package/dist/assets/xychartDiagram-JWTSCODW-CpDVe-8v.js +7 -0
  126. package/dist/index.html +23 -0
  127. package/export-docx.js +1583 -0
  128. package/init.js +353 -0
  129. package/manifest-cli.mjs +207 -0
  130. package/package.json +83 -0
  131. package/schemas/artifact.schema.json +1295 -0
  132. package/schemas/artifact.uischema.json +65 -0
  133. package/schemas/evaluation.schema.json +284 -0
  134. package/schemas/rubric.schema.json +383 -0
  135. package/serve.js +238 -0
@@ -0,0 +1,566 @@
1
+ # Rubrics for AI-Agent Assessment of IT/Enterprise Architecture Artifacts
2
+
3
+ ## Comprehensive Research Report
4
+
5
+ **Date:** March 2026
6
+ **Scope:** Existing assessment frameworks, AI-agent-based evaluation, artifact format requirements, and rubric design principles for automated architecture quality assessment.
7
+
8
+ ---
9
+
10
+ ## Table of Contents
11
+
12
+ 1. [Existing Rubrics and Frameworks](#1-existing-rubrics-and-frameworks)
13
+ 2. [AI-Agent-Based Assessment of Architecture Quality](#2-ai-agent-based-assessment-of-architecture-quality)
14
+ 3. [Artifact Structure and Format Requirements for AI Assessability](#3-artifact-structure-and-format-requirements-for-ai-assessability)
15
+ 4. [Rubric Design Principles for AI Grading](#4-rubric-design-principles-for-ai-grading)
16
+ 5. [Synthesis: A Proposed Approach](#5-synthesis-a-proposed-approach)
17
+ 6. [Sources](#6-sources)
18
+
19
+ ---
20
+
21
+ ## 1. Existing Rubrics and Frameworks
22
+
23
+ ### 1.1 TOGAF Architecture Compliance Review
24
+
25
+ TOGAF provides the most established framework for architecture artifact assessment. The Architecture Compliance Review process (Phase G of the ADM) includes structured checklists organized by discipline:
26
+
27
+ **Checklist domains:**
28
+ - System Engineering / Overall Architecture
29
+ - Hardware and Operating System
30
+ - Software Services and Middleware
31
+ - Applications
32
+ - Information Management
33
+ - Security
34
+ - System Management
35
+
36
+ Each checklist contains questions that reviewers selectively apply to a given project. TOGAF does not prescribe a numerical scoring system per se; instead, compliance is assessed qualitatively against principles, standards, and guidelines. The review process yields findings classified by severity and type of non-compliance.
37
+
38
+ **Quality Attribute Scenarios:** TOGAF recommends defining quality attributes using a six-part scenario: source, stimulus, artifact, environment, response, and response measure. This structured decomposition is particularly useful for AI-based assessment because it creates discrete, evaluable units.
39
+
40
+ **Architecture Maturity Models:** The Architecture Capability Maturity Model (ACMM), developed by the US Department of Commerce, scores nine architecture elements on a maturity scale (typically 1-5): Architecture Process, Business Linkage, Senior Management Involvement, Operating Unit Participation, Architecture Communication, IT Security, Architecture Governance, IT Investment and Acquisition Strategy, and Architecture Scope.
41
+
42
+ ### 1.2 OMB Enterprise Architecture Assessment Framework (EAAF)
43
+
44
+ The US Office of Management and Budget's EAAF v3.1 is one of the few frameworks that provides explicit scoring criteria. It evaluates EA across three capability areas:
45
+
46
+ - **Completion** — Are the architectural artifacts complete and up to date?
47
+ - **Use** — Is the architecture actually being used to drive decisions?
48
+ - **Results** — Is the architecture producing measurable outcomes?
49
+
50
+ Each capability area uses criteria spanning five maturity levels (1-5). This framework is notable for its emphasis on measurable outcomes rather than just document completeness, making it a useful reference for designing AI-assessable rubrics.
51
+
52
+ ### 1.3 Zachman Framework Assessment
53
+
54
+ The Zachman Framework uses its 6x6 matrix (interrogatives x perspectives) as an implicit assessment structure. Quality criteria include:
55
+
56
+ - **Completeness** — Are all relevant cells of the matrix populated? Unpopulated cells represent implicit (undefined) assumptions that increase risk.
57
+ - **Explicitness** — Are artifacts in each cell clearly and formally defined?
58
+ - **Consistency** — Do artifacts across cells align with each other?
59
+ - **Naming conventions and versioning** — Are governance standards established?
60
+
61
+ The framework supports gap analysis by identifying missing artifacts or misalignments within the architecture. Quality assessment evaluates the completeness and currency of existing representations, identifying missing perspectives and incomplete coverage.
62
+
63
+ ### 1.4 Cloud Well-Architected Frameworks
64
+
65
+ AWS, Azure, and GitHub have published well-architected frameworks that represent the most operationalized architecture assessment rubrics available today.
66
+
67
+ **AWS Well-Architected Framework** evaluates across six pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability. The AWS Well-Architected Tool provides automated assessment through structured questionnaires. Notably, AWS has built a generative AI solution using Amazon Bedrock that automatically analyzes architecture documents against the framework's pillars, producing solution summaries, pillar-by-pillar evaluations, best-practice adherence analysis, improvement recommendations, and risk assessments.
68
+
69
+ **Azure Well-Architected Framework** uses five pillars: Reliability, Cost Optimization, Operational Excellence, Performance Efficiency, and Security. The Azure Well-Architected Review Tool poses approximately 60 questions and generates a score per pillar and an overall workload score. Azure Advisor Score provides automated scoring based on resource configuration and usage telemetry.
70
+
71
+ **GitHub Well-Architected** covers Architecture (scalability, resiliency, modularity), Application Security, and Productivity with structured checklists. Assessment criteria include repository organization, modularity and code design, API management, disaster recovery and resilience, and documentation quality.
72
+
73
+ These cloud frameworks are significant because they demonstrate that architecture assessment can be decomposed into pillar-based scoring with automated tooling — a pattern directly applicable to AI-agent assessment.
74
+
75
+ ### 1.5 C4 Model Review Checklist
76
+
77
+ The C4 model provides a concrete review checklist for architecture diagrams with three assessment dimensions:
78
+
79
+ - **General criteria:** Title, diagram type identification, scope definition, key/legend
80
+ - **Elements:** Naming, abstraction level clarity, functional descriptions, technology specifications, acronym handling, visual conventions
81
+ - **Relationships:** Labeling, directional accuracy, communication technology specs, visual conventions
82
+
83
+ This checklist is notable for its specificity and atomicity — each criterion is a discrete yes/no question, making it highly suitable for automated assessment.
84
+
85
+ ### 1.6 Architecture Review Board (ARB) Checklists
86
+
87
+ Industry ARB checklists typically organize assessment into seven dimensions:
88
+
89
+ 1. **Strategic Alignment** — Does the architecture support organizational objectives?
90
+ 2. **Compliance and Security** — Does it meet regulatory and security standards?
91
+ 3. **Cost Management** — Are costs analyzed and optimization opportunities identified?
92
+ 4. **Technical Feasibility and Risk** — Are risks assessed and mitigated?
93
+ 5. **Performance and Scalability** — Can the system handle load and growth?
94
+ 6. **Interoperability and Integration** — Does it work with existing systems?
95
+ 7. **Documentation and Best Practices** — Is the architecture well-documented?
96
+
97
+ ### 1.7 Harvard EA Design Review Checklist
98
+
99
+ Harvard's Enterprise Architecture group publishes an Application Architecture Checklist that categorizes applications by deployment model (Developed Solutions, Purchased/Private Cloud, Purchased/Hosted, SaaS) and evaluates against principles, standards, and resources for each layer of the architecture stack, including cross-cutting security requirements.
100
+
101
+ ### 1.8 Academic Literature
102
+
103
+ A 2024 systematic literature review published in ACM Computing Surveys analyzed 109 articles from 3,644 papers on EA evaluation methods (published since 2005). The key finding: the critical factor for widespread adoption of EA evaluation is automation of the assessment and architecture modeling processes, particularly data collection. The review identified 204 active researchers and 61 publication venues dedicated to this topic, confirming it as a mature research area with a clear gap in automated assessment tooling.
104
+
105
+ A separate study analyzing software architecture documentation models evaluated frameworks against criteria including: registration of system architecture knowledge, support for incremental documentation, audience-specific descriptions, separation of current and future architectures, support for decision analysis, and modularity support.
106
+
107
+ ---
108
+
109
+ ## 2. AI-Agent-Based Assessment of Architecture Quality
110
+
111
+ ### 2.1 Current State of AI in Architecture Assessment
112
+
113
+ Despite significant advances in AI-assisted code review and generation, the explicit application of AI to software and enterprise architecture assessment remains relatively under-explored. A 2025 survey on AI for Software Architecture confirmed that the practice of designing and evaluating architecture is still largely manual, though emerging capabilities are promising.
114
+
115
+ Key developments:
116
+
117
+ - **AWS Bedrock-based architecture review:** AWS has demonstrated a generative AI solution that analyzes architecture documents against Well-Architected Framework pillars automatically, representing the most production-ready example of AI-agent architecture assessment.
118
+
119
+ - **LLM-based code review:** Tools like GitHub Copilot, CodeRabbit, and others can detect bugs, code smells, and provide improvement suggestions. Research is extending these to detect architecture-level technical debt including design debt, documentation debt, and infrastructure debt.
120
+
121
+ - **Reasoning-driven LLMs:** Models like OpenAI o1 and DeepSeek-R1 provide step-by-step justifications for decisions, making AI-driven architectural insights more transparent and actionable.
122
+
123
+ - **Graph-based AI for architecture:** Graph neural networks can provide structural representations of software architecture, potentially enabling AI to reason about architectural properties like coupling, cohesion, and dependency patterns.
124
+
125
+ - **C4 model automation:** Research on collaborative LLM agents for C4 software architecture design automation demonstrates AI agents working together to generate and evaluate architecture documentation.
126
+
127
+ ### 2.2 What Works
128
+
129
+ Based on current evidence, AI-agent architecture assessment works best when:
130
+
131
+ 1. **Criteria are concrete and binary.** Questions like "Does every element have a name?" or "Is the API versioned?" can be reliably assessed by LLMs.
132
+
133
+ 2. **Reference standards exist.** When the assessment involves checking compliance against a well-defined standard (e.g., OpenAPI spec validation, ArchiMate schema conformance), AI can achieve high accuracy.
134
+
135
+ 3. **Structure is consistent.** When artifacts follow a standard template, AI can reliably extract and evaluate specific sections.
136
+
137
+ 4. **Domain context is provided.** RAG (Retrieval-Augmented Generation) or fine-tuning with domain-specific knowledge significantly improves assessment quality.
138
+
139
+ 5. **Scoring is decomposed.** Breaking holistic quality into atomic, independently scorable dimensions dramatically improves reliability (see Section 4).
140
+
141
+ ### 2.3 What Doesn't Work (Yet)
142
+
143
+ 1. **Holistic quality judgments.** Asking an AI to assign a single overall quality score to an architecture document produces inconsistent results. The same document may receive different scores on different runs.
144
+
145
+ 2. **Context-dependent trade-off evaluation.** Architecture decisions often involve trade-offs that only make sense in a specific business context (e.g., choosing eventual consistency for a particular use case). Without deep contextual understanding, AI may misjudge these.
146
+
147
+ 3. **Cross-artifact consistency checking.** Verifying that a data model, API specification, sequence diagram, and deployment diagram are all internally consistent requires reasoning across multiple artifact types simultaneously — a challenge for current LLMs.
148
+
149
+ 4. **Evaluating innovation and creativity.** Novel architectural patterns or unconventional solutions may be penalized by AI systems trained on conventional practices.
150
+
151
+ 5. **Stakeholder-specific quality.** The same architecture may be good for one stakeholder group and poor for another. AI struggles with this multiperspective evaluation.
152
+
153
+ ### 2.4 Key Challenges
154
+
155
+ **Ambiguity:** Architecture artifacts frequently contain ambiguous language ("the system should be scalable," "appropriate security measures"). AI agents need disambiguation strategies, either through rubric clarification or by flagging ambiguity for human review.
156
+
157
+ **Subjectivity:** Many quality dimensions are inherently subjective (elegance, simplicity, appropriateness). Research shows LLMs can achieve reasonable inter-rater reliability with human experts when given detailed rubrics, but performance degrades significantly without them.
158
+
159
+ **Context-dependency:** Architecture quality is deeply contextual. A microservices architecture may be excellent for one scenario and terrible for another. AI agents need access to business context, constraints, and non-functional requirements to make meaningful assessments.
160
+
161
+ **Hallucination risk:** LLMs may invent technical details or make confident but incorrect claims about architecture quality. Evidence-anchored scoring (requiring the AI to cite specific evidence from the artifact) mitigates this.
162
+
163
+ **Prompt sensitivity:** Minor changes in how evaluation criteria are phrased can significantly alter AI assessments. The RULERS framework (see Section 4.4) addresses this through "locked rubrics" that eliminate interpretation drift.
164
+
165
+ ---
166
+
167
+ ## 3. Artifact Structure and Format Requirements for AI Assessability
168
+
169
+ ### 3.1 Machine-Readable Architecture Formats
170
+
171
+ To enable AI-agent assessment, architecture artifacts should leverage machine-readable formats wherever possible.
172
+
173
+ **ArchiMate Model Exchange File Format:**
174
+ The Open Group's ArchiMate exchange format is an XML-based standard backed by validating XSD schemas. It supports three schema types: model exchange, view exchange, and diagram exchange. Models can include Dublin Core metadata. This format has been mandatory for certified ArchiMate tools since June 2018. The structured, schema-validated nature makes it highly amenable to automated assessment — an AI agent can parse the XML, validate against the schema, and evaluate properties like completeness, naming conventions, and relationship integrity.
175
+
176
+ **OpenAPI Specification:**
177
+ For API architecture artifacts, the OpenAPI Specification (OAS) defines a machine-readable standard for HTTP API descriptions. Tools like Spectral provide automated linting and validation. APIMatic's "Score My OpenAPI" tool demonstrates that API specifications can be programmatically scored for standard compliance and quality. This is a mature model for how architecture artifacts can be made automatically assessable.
178
+
179
+ **Architecture Decision Records (ADRs):**
180
+ Traditional ADRs use Markdown templates (e.g., MADR — Markdown Architectural Decision Records). The emerging Decision Reasoning Format (DRF) provides a vendor-neutral, machine-readable YAML/JSON format for representing decisions with explicit reasoning, assumptions, cognitive state, and trade-offs. DRF is particularly valuable for AI assessment because it structures the rationale in a parseable way.
181
+
182
+ **Infrastructure as Code (IaC):**
183
+ Terraform, CloudFormation, Pulumi, and similar IaC formats provide machine-readable architecture descriptions that can be directly analyzed by AI agents. Cloud well-architected tools already leverage this for automated assessment.
184
+
185
+ ### 3.2 Structured Templates and Mandatory Sections
186
+
187
+ For architecture documents to be reliably assessed by AI agents, they should follow structured templates with mandatory sections. Recommended approaches:
188
+
189
+ **Arc42 Template:**
190
+ Arc42 is a widely used, process-independent template for software architecture documentation. Its 12 sections provide a consistent structure that AI agents can reliably parse and evaluate:
191
+
192
+ 1. Introduction and Goals
193
+ 2. Constraints
194
+ 3. Context and Scope
195
+ 4. Solution Strategy
196
+ 5. Building Block View
197
+ 6. Runtime View
198
+ 7. Deployment View
199
+ 8. Cross-cutting Concepts
200
+ 9. Architecture Decisions
201
+ 10. Quality Requirements
202
+ 11. Risks and Technical Debt
203
+ 12. Glossary
204
+
205
+ **Recommended mandatory sections for AI-assessable architecture documents:**
206
+
207
+ - **Metadata block** (YAML frontmatter): document type, version, author, date, status, related artifacts, classification
208
+ - **Context and scope:** Problem statement, business drivers, constraints, stakeholders
209
+ - **Architecture decisions:** Structured ADRs with decision, rationale, alternatives considered, trade-offs
210
+ - **Quality attributes:** Explicitly stated with measurable acceptance criteria using TOGAF quality attribute scenarios
211
+ - **Component/service catalog:** Structured listing with responsibilities, interfaces, dependencies
212
+ - **Risk register:** Identified risks with likelihood, impact, mitigation strategies
213
+ - **Compliance mapping:** Explicit mapping to applicable standards, principles, regulations
214
+
215
+ ### 3.3 Metadata and Tagging Conventions
216
+
217
+ **YAML frontmatter example for architecture documents:**
218
+
219
+ ```yaml
220
+ ---
221
+ document_type: solution_architecture
222
+ version: 2.1
223
+ status: draft | review | approved | deprecated
224
+ author: jane.doe@company.com
225
+ last_modified: 2026-03-15
226
+ classification: internal
227
+ domain: payments
228
+ systems:
229
+ - payment-gateway
230
+ - fraud-detection
231
+ standards_compliance:
232
+ - PCI-DSS-4.0
233
+ - company-api-standards-v3
234
+ quality_attributes:
235
+ - availability: 99.99%
236
+ - latency_p99: 200ms
237
+ - throughput: 10000_tps
238
+ related_artifacts:
239
+ - type: data_model
240
+ ref: /artifacts/payments/data-model-v2.yaml
241
+ - type: api_spec
242
+ ref: /artifacts/payments/openapi-v3.yaml
243
+ - type: adr
244
+ refs:
245
+ - /decisions/ADR-042-event-sourcing.md
246
+ - /decisions/ADR-043-cqrs-pattern.md
247
+ assessment_rubric: enterprise-architecture-rubric-v2
248
+ ---
249
+ ```
250
+
251
+ **Tagging conventions for assessability:**
252
+
253
+ - Use semantic section markers (e.g., `<!-- SECTION:quality_attributes -->`) to enable precise extraction
254
+ - Tag architecture decisions with categories: `[DECISION:technology]`, `[DECISION:pattern]`, `[DECISION:trade-off]`
255
+ - Mark assumptions explicitly: `[ASSUMPTION]`, `[CONSTRAINT]`, `[RISK]`
256
+ - Use structured tables for traceability matrices
257
+ - Embed quality attribute scenarios in a parseable format (YAML code blocks within Markdown)
258
+
259
+ ### 3.4 Diagram Assessability
260
+
261
+ Architecture diagrams present a particular challenge for AI assessment. Approaches to improve assessability:
262
+
263
+ - **Model-based diagrams:** Use ArchiMate, UML, or C4 models stored in machine-readable formats (XML, JSON) rather than image-only diagrams
264
+ - **Diagram-as-code:** Tools like Structurizr (C4), PlantUML, Mermaid, and D2 generate diagrams from text descriptions that AI can parse
265
+ - **Dual representation:** Store both the machine-readable model and the rendered diagram, with the model as the authoritative source
266
+ - **Embedded metadata:** Include element catalogs, relationship tables, and technology tags alongside diagrams
267
+
268
+ ---
269
+
270
+ ## 4. Rubric Design Principles for AI Grading
271
+
272
+ ### 4.1 Core Design Principles
273
+
274
+ Research from Snorkel AI, Microsoft, and recent academic work converges on several key principles for designing rubrics that AI agents can reliably apply:
275
+
276
+ **Principle 1: Decompose into atomic criteria.**
277
+ Break holistic quality assessments into the smallest possible independent dimensions. Each criterion should evaluate exactly one aspect. Research shows LLM-as-judge alignment improved from 37.3% to 93.95% when rubrics were provided, and further improvements come from atomic decomposition.
278
+
279
+ **Principle 2: Use analytic rather than holistic rubrics.**
280
+ Analytic rubrics (separate scores per dimension) consistently outperform holistic rubrics (single overall score) for AI-based evaluation. Education research confirms sharper inter-rater agreement when analytic rubrics replace holistic judgments.
281
+
282
+ **Principle 3: Define precise level descriptors.**
283
+ Every point on every scale needs a written definition. For example, instead of "1=Poor, 5=Excellent," each level should have a concrete description of what that level looks like for that specific criterion. Without this, LLMs exhibit significant scoring drift.
284
+
285
+ **Principle 4: Include positive and negative examples.**
286
+ Anchor criteria with exemplars of good and bad performance. Research on few-shot examples in LLM grading shows they reduce ambiguity and improve consistency, particularly for subjective criteria.
287
+
288
+ **Principle 5: Require evidence citation.**
289
+ The AI should be required to cite specific evidence from the artifact for every score. The RULERS framework calls this "evidence-anchored scoring" and demonstrates it produces mechanically auditable assessments.
290
+
291
+ **Principle 6: Use deterministic protocols, not generative prompts.**
292
+ Frame rubric evaluation as executing a deterministic protocol rather than generating a creative response. This reduces prompt sensitivity and scoring variability.
293
+
294
+ **Principle 7: Separate assessment dimensions from aggregation.**
295
+ Score each dimension independently, then combine scores using a defined aggregation function. The multi-criteria approach "decomposes relevance into interpretable subcriteria, with each criterion independently graded via dedicated prompts, with an additional prompt aggregating these scores."
296
+
297
+ ### 4.2 Scoring Scales
298
+
299
+ Research and practice suggest the following for AI-applied rubrics:
300
+
301
+ - **Preferred scale: 0-3 or 1-4.** Smaller scales reduce ambiguity and improve inter-rater reliability. A 1-5 scale is common in industry but introduces a muddled middle.
302
+ - **Include "Not Applicable" and "Cannot Assess."** Architecture artifacts may not always contain information relevant to every criterion. The AI should be able to indicate when a criterion doesn't apply or when the artifact lacks sufficient information for assessment.
303
+ - **Binary where possible.** For structural/compliance criteria (e.g., "Does the document include a quality attributes section?"), use Yes/No rather than graded scales.
304
+ - **Ordinal with clear boundaries.** For qualitative criteria, define each level so that the boundary between levels is unambiguous:
305
+ - 0 = Absent or fundamentally inadequate
306
+ - 1 = Present but significantly incomplete or flawed
307
+ - 2 = Adequate with minor gaps
308
+ - 3 = Comprehensive and well-executed
309
+
310
+ ### 4.3 Disambiguation Strategies
311
+
312
+ Specific strategies for reducing ambiguity in AI-applied rubrics:
313
+
314
+ - **Decision trees:** Structure criteria as if/then decision trees rather than open-ended questions. For example: "IF the document includes quality attributes, THEN check: are they measurable? Are acceptance criteria defined? Is a test strategy included?"
315
+ - **DAG-based evaluation:** Structure evaluation as a Directed Acyclic Graph where each node is an LLM judge handling a specific decision. This reduces ambiguity by breaking interactions into fine-grained, atomic units.
316
+ - **Confusion-aware optimization (CARO):** Recent research (2026) proposes isolating specific misclassification patterns and generating targeted rule patches that resolve ambiguities without degrading global performance.
317
+ - **Negative criteria:** Explicitly state what does NOT qualify for each score level. This is often more informative than positive descriptions alone.
318
+ - **Boundary cases:** Provide examples of artifacts that fall on the boundary between score levels, with explanations of why they receive the higher or lower score.
319
+
320
+ ### 4.4 Key Frameworks for AI-Based Rubric Evaluation
321
+
322
+ **RULERS (Rubric Unification, Locking, and Evidence-anchored Robust Scoring):**
323
+ Published in January 2026, RULERS addresses three failure modes of LLM judges: instability under prompt variations, unverifiable reasoning, and distributional misalignment. It implements a three-phase pipeline:
324
+ - Phase I: Compile rubrics into immutable, versioned specifications to eliminate runtime interpretation drift
325
+ - Phase II: Enforce evidence-anchored protocol requiring mechanically auditable citations
326
+ - Phase III: Apply lightweight Wasserstein-based post-hoc calibration to map outputs to human distributions
327
+
328
+ RULERS operates without updating model parameters, making it practical for enterprise deployment.
329
+
330
+ **LLM-Rubric (Microsoft Research, ACL 2024):**
331
+ A multidimensional, calibrated approach where a manually constructed rubric describes how to assess multiple dimensions. For each dimension, an LLM produces a distribution over potential responses. A small feed-forward neural network combines these distributions (with judge-specific and judge-independent parameters) to predict human judge annotations. Demonstrated 2x improvement in RMS error over uncalibrated baselines for dialogue system evaluation.
332
+
333
+ **AutoRubric (2026):**
334
+ A unified framework for rubric-based LLM evaluation that aims to standardize how rubrics are constructed and applied across different evaluation domains.
335
+
336
+ **Snorkel AI's Rubric Framework:**
337
+ Treats rubrics as models to maximize goal alignment and inter-rater agreement. Key practices:
338
+ - Co-develop rubrics with domain experts
339
+ - Validate and calibrate by aligning interpretations between experts and LLM-as-Judge until inter-rater reliability stabilizes
340
+ - Use the same inter-annotator agreement statistics (Cohen's kappa, QWK, ICC) to measure LLM-human alignment
341
+ - Require LLMs to explain ratings, which consistently improves correlation with human labels
342
+
343
+ ### 4.5 Practical Rubric Structure for Architecture Assessment
344
+
345
+ Based on synthesizing all research, a rubric for AI-agent architecture assessment should be structured as follows:
346
+
347
+ ```
348
+ RUBRIC: [Name and Version]
349
+ ARTIFACT_TYPE: [e.g., solution_architecture_document]
350
+ APPLICABLE_STANDARDS: [list]
351
+
352
+ DIMENSION: [e.g., Completeness]
353
+ CRITERION: [e.g., Quality Attributes Defined]
354
+ QUESTION: "Does the document explicitly state quality attributes
355
+ with measurable acceptance criteria?"
356
+ SCALE: 0-3
357
+ LEVEL_0: "No quality attributes section exists."
358
+ LEVEL_1: "Quality attributes are mentioned but vague
359
+ (e.g., 'the system should be fast')."
360
+ LEVEL_2: "Quality attributes are stated with some measurable criteria
361
+ but coverage is incomplete."
362
+ LEVEL_3: "All relevant quality attributes are explicitly stated with
363
+ measurable acceptance criteria and test strategies."
364
+ EVIDENCE_REQUIRED: true
365
+ EXAMPLE_GOOD: "Availability: 99.99% uptime measured monthly,
366
+ excluding planned maintenance windows.
367
+ Monitored via synthetic probes every 30 seconds."
368
+ EXAMPLE_BAD: "The system should be highly available."
369
+ WEIGHT: 0.8
370
+ PREREQUISITE: null
371
+
372
+ CRITERION: [next criterion...]
373
+ ...
374
+
375
+ DIMENSION: [next dimension...]
376
+ ...
377
+
378
+ AGGREGATION:
379
+ METHOD: weighted_average
380
+ THRESHOLDS:
381
+ pass: 2.0
382
+ conditional_pass: 1.5
383
+ fail: < 1.5
384
+ MANDATORY_CRITERIA: [list of criteria that must score >= 2]
385
+ ```
386
+
387
+ ### 4.6 Inter-Rater Reliability Targets
388
+
389
+ Based on research, the following reliability targets are reasonable for AI-agent architecture assessment:
390
+
391
+ - **Binary criteria** (present/absent): Target > 95% agreement with human reviewers
392
+ - **Ordinal criteria** (0-3 scale): Target Cohen's kappa > 0.7 (substantial agreement) for well-defined criteria; > 0.5 (moderate agreement) for more subjective criteria
393
+ - **Overall assessment:** Target correlation (Spearman's rho) > 0.8 with expert human reviewers
394
+
395
+ These targets should be validated through calibration exercises comparing AI assessments against human expert panels, using the same methodology described in the Snorkel AI and RULERS frameworks.
396
+
397
+ ---
398
+
399
+ ## 5. Synthesis: A Proposed Approach
400
+
401
+ ### 5.1 Architecture for an AI-Agent Assessment System
402
+
403
+ Based on the research, an effective AI-agent architecture assessment system would combine:
404
+
405
+ 1. **Structured artifact ingestion:** Parse artifacts using format-specific handlers (OpenAPI validator, ArchiMate schema validator, Markdown section extractor, YAML frontmatter parser)
406
+
407
+ 2. **Multi-dimensional rubric engine:** Apply atomic criteria independently, organized into dimensions, with evidence-anchored scoring
408
+
409
+ 3. **DAG-based evaluation flow:** Structure the assessment as a directed acyclic graph where:
410
+ - Structural/compliance checks run first (binary criteria)
411
+ - Content quality checks run second (ordinal criteria)
412
+ - Cross-reference checks run third (consistency between artifact sections)
413
+ - Contextual quality checks run last (requiring business context)
414
+
415
+ 4. **Calibrated aggregation:** Use the LLM-Rubric approach to calibrate AI scores against human expert baselines, with lightweight feed-forward networks for score combination
416
+
417
+ 5. **Evidence-anchored reporting:** Every score accompanied by specific citations from the artifact, following the RULERS protocol
418
+
419
+ ### 5.2 Recommended Rubric Dimensions for Architecture Artifacts
420
+
421
+ Based on the frameworks reviewed, architecture assessment rubrics should cover:
422
+
423
+ | Dimension | Description | Assessment Type |
424
+ |-----------|-------------|-----------------|
425
+ | **Structural Completeness** | Are all required sections present? | Binary checklist |
426
+ | **Content Adequacy** | Is each section substantively populated? | Ordinal (0-3) |
427
+ | **Quality Attribute Specification** | Are NFRs explicit and measurable? | Ordinal (0-3) |
428
+ | **Decision Documentation** | Are architectural decisions recorded with rationale? | Ordinal (0-3) |
429
+ | **Standards Compliance** | Does the artifact conform to applicable standards? | Binary + ordinal |
430
+ | **Internal Consistency** | Are components, relationships, and decisions consistent? | Ordinal (0-3) |
431
+ | **Risk Identification** | Are risks identified, assessed, and mitigated? | Ordinal (0-3) |
432
+ | **Stakeholder Appropriateness** | Is the artifact appropriate for its target audience? | Ordinal (0-3) |
433
+ | **Traceability** | Can decisions be traced to requirements and constraints? | Ordinal (0-3) |
434
+ | **Clarity and Communication** | Is the artifact clear, unambiguous, and well-organized? | Ordinal (0-3) |
435
+
436
+ ### 5.3 Implementation Roadmap
437
+
438
+ **Phase 1: Structural assessment (highest confidence)**
439
+ - Template compliance (mandatory sections present)
440
+ - Metadata completeness
441
+ - Format validation (schema compliance for machine-readable artifacts)
442
+ - Naming convention adherence
443
+ - Cross-reference integrity
444
+
445
+ **Phase 2: Content quality assessment (moderate confidence)**
446
+ - Quality attribute specification quality
447
+ - Decision documentation completeness
448
+ - Risk assessment coverage
449
+ - Stakeholder coverage
450
+
451
+ **Phase 3: Contextual assessment (requires human-in-the-loop)**
452
+ - Appropriateness of architectural decisions for the business context
453
+ - Trade-off evaluation
454
+ - Innovation and creativity
455
+ - Long-term maintainability and evolvability
456
+
457
+ ### 5.4 Key Success Factors
458
+
459
+ 1. **Start with machine-readable formats.** Enforce structured templates and machine-parseable formats for all architecture artifacts before attempting AI assessment.
460
+
461
+ 2. **Begin with binary criteria.** Build confidence and calibrate the system on structural/compliance checks before moving to qualitative assessment.
462
+
463
+ 3. **Invest in rubric engineering.** Treat rubric development as a first-class engineering activity, with version control, calibration testing, and iterative refinement.
464
+
465
+ 4. **Maintain human-in-the-loop for subjective dimensions.** Use AI to handle objective and semi-objective criteria, flag issues for human review, and produce draft assessments that humans validate.
466
+
467
+ 5. **Build calibration datasets.** Collect expert-annotated architecture artifacts to serve as calibration benchmarks for the AI assessment system.
468
+
469
+ 6. **Version and lock rubrics.** Following the RULERS approach, compile rubrics into immutable specifications to prevent interpretation drift across assessments.
470
+
471
+ ---
472
+
473
+ ## 6. Sources
474
+
475
+ ### Frameworks and Standards
476
+
477
+ - [TOGAF Standard v9.2 — Architecture Compliance](https://pubs.opengroup.org/architecture/togaf9-doc/arch/chap42.html)
478
+ - [TOGAF Standard v9.2 — Architecture Deliverables](https://pubs.opengroup.org/architecture/togaf9-doc/arch/chap32.html)
479
+ - [TOGAF Standard v9.2 — Architecture Maturity Models](https://pubs.opengroup.org/architecture/togaf9-doc/arch/chap45.html)
480
+ - [TOGAF Architecture Compliance Review Process](https://pubs.opengroup.org/architecture/togaf8-doc/arch/chap24.html)
481
+ - [TOGAF Architecture Review Checklists (System Engineering)](https://www.opengroup.org/architecture/togaf7-doc/arch/p4/comp/clists/syseng.htm)
482
+ - [TOGAF Standard — Architecture Compliance Introduction](https://pubs.opengroup.org/togaf-standard/ea-capability-and-governance/chap06.html)
483
+ - [OMB Enterprise Architecture Assessment Framework (EAAF)](https://obamawhitehouse.archives.gov/omb/E-Gov/eaaf)
484
+ - [EAAF Guide — CIO Portal](https://cioindex.com/reference/enterprise-architecture-assessment-framework-eaaf-guide/)
485
+ - [Zachman Framework — Wikipedia](https://en.wikipedia.org/wiki/Zachman_Framework)
486
+ - [Zachman Framework — LeanIX](https://www.leanix.net/en/wiki/ea/zachman-framework)
487
+
488
+ ### Cloud Well-Architected Frameworks
489
+
490
+ - [AWS Well-Architected Framework](https://aws.amazon.com/architecture/well-architected/)
491
+ - [AWS Well-Architected Tool](https://aws.amazon.com/well-architected-tool/)
492
+ - [Accelerate AWS Well-Architected Reviews with Generative AI](https://aws.amazon.com/blogs/machine-learning/accelerate-aws-well-architected-reviews-with-generative-ai/)
493
+ - [Azure Well-Architected Framework](https://learn.microsoft.com/en-us/azure/well-architected/)
494
+ - [Azure Well-Architected Framework Pillars](https://learn.microsoft.com/en-us/azure/well-architected/pillars)
495
+ - [GitHub Well-Architected — Architecture Checklist](https://wellarchitected.github.com/library/architecture/checklist/)
496
+
497
+ ### C4 Model and Architecture Documentation
498
+
499
+ - [C4 Model Review Checklist](https://c4model.com/diagrams/checklist)
500
+ - [C4 Model Introduction](https://c4model.com/introduction)
501
+ - [Collaborative LLM Agents for C4 Software Architecture Design Automation (arXiv)](https://arxiv.org/pdf/2510.22787)
502
+ - [Arc42 — Software Architecture Documentation](https://www.workingsoftware.dev/software-architecture-documentation-the-ultimate-guide/)
503
+
504
+ ### Architecture Review Checklists
505
+
506
+ - [Harvard Application Architecture Checklist](https://enterprisearchitecture.harvard.edu/application-architecture-checklist)
507
+ - [Harvard EA Design Review Checklist (Excel)](https://enterprisearchitecture.harvard.edu/files/enterprise/files/ea_design_review_checklist_2_2.xlsx)
508
+ - [Architecture Review Board Checklist — Hava.io](https://www.hava.io/blog/architecture-review-board-checklist)
509
+ - [Info-Tech EA Assessment Checklist Template](https://www.infotech.com/research/ea-assessment-checklist-template)
510
+
511
+ ### Machine-Readable Formats
512
+
513
+ - [ArchiMate Model Exchange File Format](https://www.opengroup.org/open-group-archimate-model-exchange-file-format)
514
+ - [ArchiMate Exchange File Format Guide](https://pubs.opengroup.org/architecture/archimate31-exchange-file-format-guide/)
515
+ - [ArchiMate XSD Schemas](https://www.opengroup.org/xsd/archimate/)
516
+ - [OpenAPI Specification v3.1](https://swagger.io/specification/)
517
+ - [Score My OpenAPI — APIMatic](https://www.apimatic.io/solution/score-my-openapi)
518
+ - [Architecture Decision Records](https://adr.github.io/)
519
+ - [MADR — Markdown Architectural Decision Records](https://adr.github.io/madr/)
520
+ - [ADR Templates](https://adr.github.io/adr-templates/)
521
+ - [Azure ADR Guidance](https://learn.microsoft.com/en-us/azure/well-architected/architect-role/architecture-decision-record)
522
+ - [AWS ADR Process](https://docs.aws.amazon.com/prescriptive-guidance/latest/architectural-decision-records/adr-process.html)
523
+ - [Google Cloud ADR Overview](https://cloud.google.com/architecture/architecture-decision-records)
524
+
525
+ ### AI-Agent Assessment and LLM-as-Judge
526
+
527
+ - [RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation (arXiv 2601.08654)](https://arxiv.org/abs/2601.08654)
528
+ - [LLM-Rubric: A Multidimensional, Calibrated Approach — ACL 2024](https://aclanthology.org/2024.acl-long.745/)
529
+ - [LLM-Rubric — Microsoft Research](https://www.microsoft.com/en-us/research/publication/llm-rubric-a-multidimensional-calibrated-approach-to-automated-evaluation-of-natural-language-texts/)
530
+ - [LLM-Rubric — GitHub Repository](https://github.com/microsoft/LLM-Rubric)
531
+ - [The Science of Rubric Design — Snorkel AI](https://snorkel.ai/blog/the-science-of-rubric-design/)
532
+ - [Data Quality and Rubrics — Snorkel AI](https://snorkel.ai/blog/data-quality-and-rubrics-how-to-build-trust-in-your-models/)
533
+ - [Scaling Trust: Rubrics in Snorkel's Quality Process](https://snorkel.ai/blog/scaling-trust-rubrics-in-snorkels-quality-process/)
534
+ - [LLM-as-Judge for Enterprises — Snorkel AI](https://snorkel.ai/llm-as-judge-for-enterprises/)
535
+ - [Demystifying Evals for AI Agents — Anthropic](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)
536
+ - [LLM-as-a-Judge Guide — Evidently AI](https://www.evidentlyai.com/llm-guide/llm-as-a-judge)
537
+ - [LLM-as-Judge Best Practices — Monte Carlo Data](https://www.montecarlodata.com/blog-llm-as-judge/)
538
+ - [AutoRubric: A Unified Framework for Rubric-Based LLM Evaluation (arXiv)](https://arxiv.org/html/2603.00077v1)
539
+ - [Confusion-Aware Rubric Optimization for LLM-based Grading (arXiv)](https://arxiv.org/html/2603.00451)
540
+
541
+ ### Rubric Design and AI Grading
542
+
543
+ - [Designing Transparent Rubrics for AI-Based Evaluation — The Case HQ](https://thecasehq.com/designing-transparent-rubrics-for-ai-based-evaluation-a-practical-guide-for-educators/)
544
+ - [AI in Grading Rubrics — The Case HQ](https://thecasehq.com/how-ai-in-grading-rubrics-ensures-unmatched-consistency-and-fairness-in-education/)
545
+ - [Rubric Development for AI-Enabled Scoring — Frontiers in Education](https://www.frontiersin.org/journals/education/articles/10.3389/feduc.2022.983055/full)
546
+ - [Implementation Considerations for Automated AI Grading (arXiv)](https://arxiv.org/html/2506.07955v2)
547
+ - [LLM-as-Judge Done Right: Calibrating, Guarding & Debiasing — Kinde](https://www.kinde.com/learn/ai-for-software-engineering/best-practice/llm-as-a-judge-done-right-calibrating-guarding-debiasing-your-evaluators/)
548
+
549
+ ### AI for Software Architecture
550
+
551
+ - [AI for Software Architecture: Literature Review and Road Ahead (arXiv 2504.04334)](https://arxiv.org/html/2504.04334v1)
552
+ - [Evaluation and Benchmarking of LLM Agents: A Survey (arXiv)](https://arxiv.org/html/2507.21504v1)
553
+ - [AI-powered Code Review with LLMs: Early Results (arXiv)](https://arxiv.org/html/2404.18496v2)
554
+ - [vFunction — AI-Driven Architecture Analysis](https://vfunction.com/blog/software-architecture-tools/)
555
+
556
+ ### EA Evaluation Academic Literature
557
+
558
+ - [A Systematic Literature Review of EA Evaluation Methods — ACM Computing Surveys 2024](https://dl.acm.org/doi/full/10.1145/3706582)
559
+ - [EA Framework Evaluation Criteria — Springer](https://dl.acm.org/doi/abs/10.1007/s11761-020-00294-x)
560
+ - [Systematic Literature Review on EA Evaluation Models — BEEI](https://beei.org/index.php/EEI/article/view/6943)
561
+ - [Analyzing Software Architecture Documentation Models (SCITEPRESS 2023)](https://www.scitepress.org/Papers/2023/118523/118523.pdf)
562
+
563
+ ### Enterprise Architecture Maturity
564
+
565
+ - [EA Maturity Model — Ardoq](https://www.ardoq.com/knowledge-hub/enterprise-architecture-maturity-model)
566
+ - [EA Maturity Stages — LeanIX](https://www.leanix.net/en/wiki/ea/enterprise-architecture-maturity-stages-and-assessment)