onto-mcp 0.3.0 → 0.3.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.onto/authority/core-lexicon.yaml +12 -0
- package/.onto/domains/software-engineering/competency_qs.md +192 -63
- package/.onto/domains/software-engineering/concepts.md +67 -5
- package/.onto/domains/software-engineering/conciseness_rules.md +22 -2
- package/.onto/domains/software-engineering/dependency_rules.md +78 -8
- package/.onto/domains/software-engineering/domain_scope.md +181 -150
- package/.onto/domains/software-engineering/extension_cases.md +318 -542
- package/.onto/domains/software-engineering/logic_rules.md +75 -3
- package/.onto/domains/software-engineering/problem_framing_profile.md +29 -2
- package/.onto/domains/software-engineering/prompt_interface.md +122 -0
- package/.onto/domains/software-engineering/structure_spec.md +53 -4
- package/.onto/principles/llm-native-development-guideline.md +20 -0
- package/.onto/principles/productization-charter.md +6 -0
- package/.onto/processes/evolve/material-kind-adapter-contract.md +6 -0
- package/.onto/processes/reconstruct/reconstruct-boundary-contract.md +468 -81
- package/.onto/processes/reconstruct/reconstruct-execution-ux-contract.md +177 -0
- package/.onto/processes/reconstruct/source-profile-contract.md +39 -6
- package/.onto/processes/reconstruct/top-level-concept-discovery-contract.md +387 -0
- package/.onto/processes/review/binding-contract.md +8 -0
- package/.onto/processes/review/lens-registry.md +16 -0
- package/.onto/processes/review/pre-dispatch-contracts.md +34 -13
- package/.onto/processes/review/productized-live-path.md +3 -1
- package/.onto/processes/shared/pipeline-execution-ledger-contract.md +185 -0
- package/.onto/processes/shared/target-material-kind-contract.md +24 -2
- package/.onto/roles/axiology.md +7 -2
- package/AGENTS.md +4 -2
- package/README.md +52 -29
- package/dist/core-api/reconstruct-api.js +92 -5
- package/dist/core-api/review-api.js +1744 -371
- package/dist/core-runtime/cli/mock-review-unit-executor.js +17 -0
- package/dist/core-runtime/cli/render-review-final-output.js +9 -0
- package/dist/core-runtime/cli/review-invoke.js +387 -55
- package/dist/core-runtime/cli/run-review-prompt-execution.js +361 -90
- package/dist/core-runtime/path-boundary.js +58 -0
- package/dist/core-runtime/pipeline-execution-ledger.js +100 -0
- package/dist/core-runtime/reconstruct/artifact-types.js +33 -1
- package/dist/core-runtime/reconstruct/materialize-preparation.js +54 -4
- package/dist/core-runtime/reconstruct/pipeline-execution-ledger.js +342 -0
- package/dist/core-runtime/reconstruct/post-seed-validation.js +630 -0
- package/dist/core-runtime/reconstruct/record.js +105 -1
- package/dist/core-runtime/reconstruct/run.js +1594 -38
- package/dist/core-runtime/reconstruct/seed-candidate-validation.js +29 -0
- package/dist/core-runtime/review/continuation-plan.js +160 -0
- package/dist/core-runtime/review/execution-plan-boundary.js +123 -0
- package/dist/core-runtime/review/materializers.js +8 -3
- package/dist/core-runtime/review/pipeline-execution-ledger.js +250 -0
- package/dist/core-runtime/review/review-artifact-utils.js +15 -2
- package/dist/core-runtime/review/review-invocation-runner.js +604 -0
- package/dist/core-runtime/target-material-kind.js +43 -5
- package/dist/mcp/server.js +289 -59
- package/dist/mcp/tool-schemas.js +28 -2
- package/package.json +4 -2
- package/.onto/domains/llm-native-development/competency_qs.md +0 -430
- package/.onto/domains/llm-native-development/concepts.md +0 -242
- package/.onto/domains/llm-native-development/conciseness_rules.md +0 -163
- package/.onto/domains/llm-native-development/dependency_rules.md +0 -216
- package/.onto/domains/llm-native-development/domain_scope.md +0 -197
- package/.onto/domains/llm-native-development/extension_cases.md +0 -474
- package/.onto/domains/llm-native-development/logic_rules.md +0 -123
- package/.onto/domains/llm-native-development/prompt_interface.md +0 -49
- package/.onto/domains/llm-native-development/structure_spec.md +0 -245
|
@@ -1,430 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
version: 2
|
|
3
|
-
last_updated: "2026-05-27"
|
|
4
|
-
source: manual
|
|
5
|
-
status: established
|
|
6
|
-
---
|
|
7
|
-
|
|
8
|
-
# LLM-Native Development Domain — Competency Questions
|
|
9
|
-
|
|
10
|
-
A list of core questions that this domain's system must be able to answer.
|
|
11
|
-
The pragmatics agent verifies the actual reasoning path for each question.
|
|
12
|
-
|
|
13
|
-
Classification axis: **system construction concern** — classified by the sub-area of the LLM-powered system each question addresses, matching the 8 sub-areas defined in domain_scope.md.
|
|
14
|
-
|
|
15
|
-
Question priority principles: **Core design decisions (model integration, prompt design, retrieval, agent architecture) are the highest priority.** These concerns govern the majority of LLM system quality. Evaluation, safety, operations, and adaptation are secondary concerns applied on top of the core design foundation.
|
|
16
|
-
|
|
17
|
-
Priority levels:
|
|
18
|
-
- **P1** — Must be answerable for any LLM-powered system review. Failure indicates a fundamental design defect.
|
|
19
|
-
- **P2** — Should be answerable for production LLM systems. Failure indicates a quality gap.
|
|
20
|
-
- **P3** — Recommended for mature LLM systems. Failure indicates a refinement opportunity.
|
|
21
|
-
|
|
22
|
-
---
|
|
23
|
-
|
|
24
|
-
## 1. Model Integration (CQ-M)
|
|
25
|
-
|
|
26
|
-
Verifies that the system's connection to LLMs is reliable, replaceable, and explicitly designed. Without correct model integration, no other aspect of the system can function.
|
|
27
|
-
|
|
28
|
-
- **CQ-M01** [P1] Can the system switch between model providers or model versions without code changes beyond configuration?
|
|
29
|
-
- Inference path: logic_rules.md 'Model Integration Logic' → model version pinning, model selection justification → model identity must be configurable
|
|
30
|
-
- Verification criteria: PASS if model identifier is in configuration/env, not hardcoded. FAIL if changing models requires application code changes
|
|
31
|
-
- Scope: Covers model identifier configurability. Does not cover prompt adjustments needed for different models (→CQ-P)
|
|
32
|
-
|
|
33
|
-
- **CQ-M02** [P1] Is the model version pinned to a specific release in production?
|
|
34
|
-
- Inference path: logic_rules.md 'Model Integration Logic' → unpinned versions cause non-deterministic behavior; dependency_rules.md 'External Dependency Management' → Model API Dependencies → version must be pinned
|
|
35
|
-
- Verification criteria: PASS if production uses a specific version (e.g., `claude-sonnet-4-20250514`). FAIL if production uses an alias (e.g., `claude-sonnet-4-latest`)
|
|
36
|
-
|
|
37
|
-
- **CQ-M03** [P1] Are fallback routes defined for every model routing path?
|
|
38
|
-
- Inference path: logic_rules.md 'Model Integration Logic' → if model routing is used, fallback paths must be defined for every route; a route without fallback is a single point of failure
|
|
39
|
-
- Verification criteria: PASS if every routing configuration entry includes a fallback model or degradation behavior (cached response, simpler model, error return). FAIL if any routing path has no defined fallback
|
|
40
|
-
- Scope: Covers model-level routing. Does not cover application-level error handling for non-model failures
|
|
41
|
-
|
|
42
|
-
- **CQ-M04** [P1] Are model capability requirements documented per task type?
|
|
43
|
-
- Inference path: logic_rules.md 'Model Integration Logic' → capability requirements must include task type, required capabilities, quality level, cost constraints; structure_spec.md 'Golden Relationships' → Model capability ↔ Prompt complexity
|
|
44
|
-
- Verification criteria: PASS if documentation lists each task type, required model capabilities, and assigned model with justification. FAIL if model-task assignments exist without capability documentation
|
|
45
|
-
|
|
46
|
-
- **CQ-M05** [P2] Is model selection justified against cost and quality trade-offs?
|
|
47
|
-
- Inference path: logic_rules.md 'Model Integration Logic' → overpowered model = cost waste, underpowered model = quality risk; logic_rules.md 'Constraint Conflict Checking' → Cost vs. Quality
|
|
48
|
-
- Verification criteria: PASS if each model selection has documented cost/quality rationale. FAIL if selection lacks trade-off analysis
|
|
49
|
-
|
|
50
|
-
- **CQ-M06** [P2] When inference optimization is applied, is the impact on output quality measured?
|
|
51
|
-
- Inference path: logic_rules.md 'Model Integration Logic' → optimization impact must be measured, not assumed
|
|
52
|
-
- Verification criteria: PASS if evaluation results exist comparing quality before/after optimization. FAIL if optimization is applied without quality measurement
|
|
53
|
-
|
|
54
|
-
- **CQ-M07** [P2] Does the system handle model provider API errors gracefully?
|
|
55
|
-
- Inference path: logic_rules.md 'Operations Logic' → incident response for LLM-specific failures; dependency_rules.md 'External Dependency Management' → Model API Dependencies
|
|
56
|
-
- Verification criteria: PASS if retry logic, meaningful error messages, and graceful degradation exist. FAIL if API errors cause crashes or silent failures
|
|
57
|
-
|
|
58
|
-
- **CQ-M08** [P3] When multiple models are orchestrated, is the orchestration execution profile documented?
|
|
59
|
-
- Inference path: structure_spec.md 'Agent Architecture Structure' → Multi-Agent Execution profile; domain_scope.md 'Model Integration' → multi-model orchestration
|
|
60
|
-
- Verification criteria: PASS if multi-model pattern is documented with roles, routing logic, data flow. FAIL if undocumented
|
|
61
|
-
|
|
62
|
-
---
|
|
63
|
-
|
|
64
|
-
## 2. Prompt & Context Design (CQ-P)
|
|
65
|
-
|
|
66
|
-
Verifies that model inputs are structured, constrained, and managed to produce reliable outputs. Without correct prompt design, even the best model integration produces unpredictable results.
|
|
67
|
-
|
|
68
|
-
- **CQ-P01** [P1] Is the instruction hierarchy enforced (system prompt > tool definitions > user prompt)?
|
|
69
|
-
- Inference path: logic_rules.md 'Prompt Design Logic' → instruction hierarchy: system prompt > tool definitions > user prompt; when levels conflict, higher-priority wins
|
|
70
|
-
- Verification criteria: PASS if prompts have clear hierarchy and conflict resolution is documented (e.g., "system prompt overrides user instructions on output format"). FAIL if no hierarchy exists or user inputs can override system-level instructions
|
|
71
|
-
- Scope: Covers instruction precedence design. Does not cover content quality of individual instructions
|
|
72
|
-
|
|
73
|
-
- **CQ-P02** [P1] Is the token budget calculated before API calls to prevent truncation?
|
|
74
|
-
- Inference path: logic_rules.md 'Prompt Design Logic' → system prompt + context + user input ≤ context window - expected output; violation causes truncation (silent data loss) or API errors
|
|
75
|
-
- Verification criteria: PASS if token counts for all prompt components are computed before the API call and oversized prompts are handled (graceful truncation with notification or rejection). FAIL if prompts may exceed the context window without pre-validation
|
|
76
|
-
|
|
77
|
-
- **CQ-P03** [P1] Is structured output validated before consumption?
|
|
78
|
-
- Inference path: logic_rules.md 'Prompt Design Logic' → output schema must be validated; structure_spec.md 'LLM System Architecture Structure' → Output handling is required
|
|
79
|
-
- Verification criteria: PASS if every code path expecting structured output includes schema validation. FAIL if model output is consumed without validation
|
|
80
|
-
|
|
81
|
-
- **CQ-P04** [P1] Are prompt templates versioned alongside code?
|
|
82
|
-
- Inference path: logic_rules.md 'Prompt Design Logic' → prompt change = code change; unversioned prompts prevent reproduction
|
|
83
|
-
- Verification criteria: PASS if prompt templates are in version control with change tracking. FAIL if prompts are unversioned
|
|
84
|
-
|
|
85
|
-
- **CQ-P05** [P2] Is context rot mitigated for long conversations?
|
|
86
|
-
- Inference path: logic_rules.md 'Prompt Design Logic' → earliest context loses influence as conversation grows; critical info must be re-injected
|
|
87
|
-
- Verification criteria: PASS if at least one mitigation strategy is implemented (re-injection, summarization, sliding window). FAIL if long conversations rely solely on initial context
|
|
88
|
-
|
|
89
|
-
- **CQ-P06** [P2] Are few-shot examples representative of the target distribution?
|
|
90
|
-
- Inference path: logic_rules.md 'Prompt Design Logic' → biased examples cause biased outputs
|
|
91
|
-
- Verification criteria: PASS if examples cover major target categories and are updated when distribution changes. FAIL if examples are ad-hoc or narrow
|
|
92
|
-
|
|
93
|
-
- **CQ-P07** [P2] Does the prompt template length stay within 25% of the context window?
|
|
94
|
-
- Inference path: structure_spec.md 'Quantitative Thresholds' → prompt template ≤ 25% of context window
|
|
95
|
-
- Verification criteria: PASS if static prompt ≤ 25% of target model's context window. FAIL if exceeded without justification
|
|
96
|
-
|
|
97
|
-
- **CQ-P08** [P2] When chain-of-thought prompting is used, is the reasoning trace verifiable?
|
|
98
|
-
- Inference path: domain_scope.md 'Prompt & Context Design' → chain-of-thought patterns; logic_rules.md 'Prompt Design Logic' → structured output validation
|
|
99
|
-
- Verification criteria: PASS if reasoning traces are captured and reviewable. FAIL if traces are generated but discarded
|
|
100
|
-
|
|
101
|
-
- **CQ-P09** [P3] Are prompt injection risks from user-supplied content addressed at the prompt level?
|
|
102
|
-
- Inference path: logic_rules.md 'Safety Logic' → injection defense; logic_rules.md 'Prompt Design Logic' → user input is lowest priority
|
|
103
|
-
- Verification criteria: PASS if user content is delimited or sanitized before inclusion in prompts. FAIL if user content is concatenated without boundary markers
|
|
104
|
-
|
|
105
|
-
- **CQ-P10** [P2] Is execution context free of non-current compatibility and deprecation material?
|
|
106
|
-
- Inference path: conciseness_rules.md 'Navigation and History Mixing' → execution context should contain current behavior, contracts, authority, and failure handling; historical material belongs in isolated paths
|
|
107
|
-
- Verification criteria: PASS if documents loaded for execution contain current behavior only and link to isolated history only when needed. FAIL if backward-compatibility notes, deprecated behavior, migration rationale, or historical alternatives are loaded by default
|
|
108
|
-
|
|
109
|
-
---
|
|
110
|
-
|
|
111
|
-
## 3. Retrieval & Knowledge Systems (CQ-R)
|
|
112
|
-
|
|
113
|
-
Verifies that external information fed to the model is retrieved correctly, structured for LLM consumption, and maintains integrity. Encompasses both RAG pipeline design and LLM-favored knowledge structure.
|
|
114
|
-
|
|
115
|
-
- **CQ-R01** [P1] Can retrieval quality be evaluated independently of generation quality?
|
|
116
|
-
- Inference path: logic_rules.md 'Retrieval Logic' → retrieval relevance must be verified independently; root cause may be irrelevant retrieval, incorrect generation, or both; structure_spec.md 'RAG Pipeline Structure' → Stage Boundary Rules → each stage independently testable
|
|
117
|
-
- Verification criteria: PASS if the system can produce retrieval results with relevance scores without running the generation step. FAIL if retrieval and generation are monolithic (testing retrieval requires the full pipeline)
|
|
118
|
-
- Scope: Covers architectural separability. Does not cover evaluation methodology (→CQ-E)
|
|
119
|
-
|
|
120
|
-
- **CQ-R02** [P1] Do chunks preserve semantic boundaries?
|
|
121
|
-
- Inference path: logic_rules.md 'Retrieval Logic' → chunking must preserve semantic boundaries
|
|
122
|
-
- Verification criteria: PASS if chunks align with document structure (paragraphs, sections). FAIL if fixed-size splitting ignores content structure
|
|
123
|
-
|
|
124
|
-
- **CQ-R03** [P1] Is the LLM-favored structure (File=Concept, YAML frontmatter) consistently applied?
|
|
125
|
-
- Inference path: logic_rules.md 'File–Concept Correspondence Logic' → one concept per file; logic_rules.md 'Frontmatter Conformance Logic'; structure_spec.md 'Frontmatter Specification'
|
|
126
|
-
- Verification criteria: PASS if documents follow File=Concept paradigm with required frontmatter. FAIL if concepts are mixed across files or frontmatter is absent
|
|
127
|
-
- Scope: Applies to document-based knowledge bases. N/A for database-only retrieval
|
|
128
|
-
|
|
129
|
-
- **CQ-R04** [P1] Is every concept file reachable from the entry point?
|
|
130
|
-
- Inference path: logic_rules.md 'Navigation Path Logic' → unreachable file = isolated document; structure_spec.md 'Isolated Element Prohibition'
|
|
131
|
-
- Verification criteria: PASS if traversal from entry point reaches every concept file. FAIL if any file is unreachable
|
|
132
|
-
|
|
133
|
-
- **CQ-R05** [P1] Are navigation indices (INDEX.md) up to date and consistent?
|
|
134
|
-
- Inference path: structure_spec.md 'Project Required Files' → INDEX.md is source of truth for file existence; dependency_rules.md 'Referential Integrity'
|
|
135
|
-
- Verification criteria: PASS if INDEX.md lists all directory files and all listed files exist. FAIL if stale or incomplete
|
|
136
|
-
|
|
137
|
-
- **CQ-R06** [P2] Does retrieved context carry provenance metadata?
|
|
138
|
-
- Inference path: logic_rules.md 'Retrieval Logic' → provenance required for debugging; structure_spec.md 'RAG Pipeline Structure' → stage outputs carry metadata
|
|
139
|
-
- Verification criteria: PASS if each chunk includes source ID, chunk ID, and relevance score. FAIL if no provenance
|
|
140
|
-
|
|
141
|
-
- **CQ-R07** [P2] When hybrid search is used, is the combination strategy explicit?
|
|
142
|
-
- Inference path: logic_rules.md 'Retrieval Logic' → unspecified combination = non-reproducible results
|
|
143
|
-
- Verification criteria: PASS if combination method and parameters are documented. FAIL if undocumented
|
|
144
|
-
|
|
145
|
-
- **CQ-R08** [P2] Does the embedding model match the content domain?
|
|
146
|
-
- Inference path: logic_rules.md 'Retrieval Logic' → general-purpose embeddings may underperform on domain-specific content
|
|
147
|
-
- Verification criteria: PASS if embedding model selection is justified with domain appropriateness and retrieval metrics. FAIL if unjustified
|
|
148
|
-
|
|
149
|
-
- **CQ-R09** [P2] When the embedding model changes, is re-indexing performed?
|
|
150
|
-
- Inference path: dependency_rules.md 'External Dependency Management' → Embedding Model Dependencies → different models produce incompatible vectors
|
|
151
|
-
- Verification criteria: PASS if documented migration includes complete re-indexing. FAIL if model changes can produce mixed embeddings
|
|
152
|
-
|
|
153
|
-
- **CQ-R10** [P2] Are RAG pipeline stages explicitly defined with input/output contracts?
|
|
154
|
-
- Inference path: structure_spec.md 'RAG Pipeline Structure' → Required Stages (ingestion, processing, storage, retrieval, generation) with contracts
|
|
155
|
-
- Verification criteria: PASS if each stage has documented inputs, outputs, and design decisions. FAIL if pipeline is monolithic
|
|
156
|
-
|
|
157
|
-
- **CQ-R11** [P3] Does single file size stay within the 500-line limit?
|
|
158
|
-
- Inference path: structure_spec.md 'File Structure Required Elements' → 500 lines recommended; 'Quantitative Thresholds'
|
|
159
|
-
- Verification criteria: PASS if ≤ 500 lines. WARNING if exceeded with justification. FAIL if exceeded without justification
|
|
160
|
-
|
|
161
|
-
- **CQ-R12** [P3] Does directory depth stay within the 3-level limit?
|
|
162
|
-
- Inference path: structure_spec.md 'Directory Structure Rules' → max 3 levels
|
|
163
|
-
- Verification criteria: PASS if ≤ 3 levels. FAIL if exceeded without justification
|
|
164
|
-
|
|
165
|
-
---
|
|
166
|
-
|
|
167
|
-
## 4. Agentic Systems (CQ-A)
|
|
168
|
-
|
|
169
|
-
Verifies that agent architecture, tool integration, and multi-agent coordination are reliable, terminable, and debuggable.
|
|
170
|
-
|
|
171
|
-
- **CQ-A01** [P1] Are tool definitions non-overlapping in their described capabilities?
|
|
172
|
-
- Inference path: logic_rules.md 'Agentic Systems Logic' → if two tools can accomplish the same task, agent choice is non-deterministic; either remove overlap or add explicit routing instructions
|
|
173
|
-
- Verification criteria: PASS if no two tools overlap in capability description, or overlapping tools have explicit disambiguation in agent instructions. FAIL if overlap exists without disambiguation
|
|
174
|
-
|
|
175
|
-
- **CQ-A02** [P1] Do all agent loops have explicit termination conditions?
|
|
176
|
-
- Inference path: logic_rules.md 'Agentic Systems Logic' → without explicit termination (max iterations, convergence criteria, timeout), agent loops can run indefinitely, consuming resources
|
|
177
|
-
- Verification criteria: PASS if every iterative agent workflow defines at least one termination condition. FAIL if any loop can run indefinitely without a termination mechanism
|
|
178
|
-
|
|
179
|
-
- **CQ-A03** [P1] Is agent state explicitly managed rather than relying on conversation history?
|
|
180
|
-
- Inference path: logic_rules.md 'Agentic Systems Logic' → implicit state via conversation is fragile; structure_spec.md 'Agent Architecture Structure' → state management required
|
|
181
|
-
- Verification criteria: PASS if critical state is in structured storage (DB, file, state object). FAIL if solely in conversation history
|
|
182
|
-
- Scope: Covers critical state. Single-turn conversational context is acceptable as implicit
|
|
183
|
-
|
|
184
|
-
- **CQ-A04** [P1] Are MCP tool schemas self-describing?
|
|
185
|
-
- Inference path: logic_rules.md 'Agentic Systems Logic' → schema must suffice without external docs; structure_spec.md 'Agent Architecture Structure' → Tool Definition Structure
|
|
186
|
-
- Verification criteria: PASS if every tool has name, description, typed parameters, and return type. FAIL if external docs needed to understand usage
|
|
187
|
-
|
|
188
|
-
- **CQ-A05** [P2] Do agent instructions list available tools with usage guidance?
|
|
189
|
-
- Inference path: logic_rules.md 'Agentic Systems Logic' → instructions must list tools and when to use each
|
|
190
|
-
- Verification criteria: PASS if system prompt lists all tools with guidance. FAIL if tools are available but unmentioned
|
|
191
|
-
|
|
192
|
-
- **CQ-A06** [P2] For long-running tasks, is structured progress tracking maintained?
|
|
193
|
-
- Inference path: logic_rules.md 'Agentic Systems Logic' → agent must resume from checkpoint without re-execution
|
|
194
|
-
- Verification criteria: PASS if progress record tracks completed/current/remaining steps. FAIL if interruption requires full restart
|
|
195
|
-
|
|
196
|
-
- **CQ-A07** [P2] When multiple agents are used, is the coordination execution profile documented?
|
|
197
|
-
- Inference path: structure_spec.md 'Agent Architecture Structure' → Multi-Agent Execution profile → must be documented with rationale
|
|
198
|
-
- Verification criteria: PASS if execution profile, roles, communication patterns documented. FAIL if undocumented
|
|
199
|
-
|
|
200
|
-
- **CQ-A08** [P2] Is the agent tool count within the recommended limit (< 20)?
|
|
201
|
-
- Inference path: structure_spec.md 'Quantitative Thresholds' → < 20 tools per agent
|
|
202
|
-
- Verification criteria: PASS if < 20. WARNING if 20-30 with justification. FAIL if 30+ without evidence of unaffected selection quality
|
|
203
|
-
|
|
204
|
-
- **CQ-A09** [P3] Are tool call results validated before further reasoning?
|
|
205
|
-
- Inference path: logic_rules.md 'Prompt Design Logic' → validate before consumption; 'Agentic Systems Logic' → tool results are part of agent state
|
|
206
|
-
- Verification criteria: PASS if tool results undergo validation (error check, schema, sanity). FAIL if consumed without validation
|
|
207
|
-
|
|
208
|
-
---
|
|
209
|
-
|
|
210
|
-
## 5. Evaluation & Testing (CQ-E)
|
|
211
|
-
|
|
212
|
-
Verifies that system output quality is measured by defined criteria, representative data, and reproducible methods.
|
|
213
|
-
|
|
214
|
-
- **CQ-E01** [P1] Are evaluation criteria defined before system development begins (spec-first)?
|
|
215
|
-
- Inference path: logic_rules.md 'Evaluation Logic' → criteria must be defined before development; defining after seeing output introduces confirmation bias
|
|
216
|
-
- Verification criteria: PASS if evaluation criteria (metrics, thresholds, targets) are documented before or concurrent with the first development commit. FAIL if criteria are defined after the system is built
|
|
217
|
-
- Scope: Covers existence and timing of criteria definition. Does not cover whether the criteria themselves are appropriate
|
|
218
|
-
|
|
219
|
-
- **CQ-E02** [P1] Does the golden set cover the target distribution?
|
|
220
|
-
- Inference path: logic_rules.md 'Evaluation Logic'; structure_spec.md 'Evaluation Structure' → Golden Set Requirements → all input categories
|
|
221
|
-
- Verification criteria: PASS if golden set covers all expected categories with documented diversity. FAIL if partial or undocumented coverage
|
|
222
|
-
|
|
223
|
-
- **CQ-E03** [P1] Is AI-as-judge bias disclosed?
|
|
224
|
-
- Inference path: logic_rules.md 'Evaluation Logic' → must disclose judge model, prompt, and known biases
|
|
225
|
-
- Verification criteria: PASS if all three disclosed. FAIL if any element missing
|
|
226
|
-
|
|
227
|
-
- **CQ-E04** [P2] Are evaluation pipelines automated and repeatable?
|
|
228
|
-
- Inference path: logic_rules.md 'Evaluation Logic' → manual does not scale; structure_spec.md 'Evaluation Structure' → automated for production
|
|
229
|
-
- Verification criteria: PASS if pipeline runs without manual intervention with consistent results. FAIL if manual steps required in production
|
|
230
|
-
|
|
231
|
-
- **CQ-E05** [P2] Is the golden set separated from training data?
|
|
232
|
-
- Inference path: logic_rules.md 'Evaluation Logic' → contamination inflates scores; structure_spec.md 'Evaluation Structure' → Separation
|
|
233
|
-
- Verification criteria: PASS if separate storage with contamination prevention. FAIL if shared without controls
|
|
234
|
-
|
|
235
|
-
- **CQ-E06** [P2] For A/B testing, are statistical significance thresholds predefined?
|
|
236
|
-
- Inference path: logic_rules.md 'Evaluation Logic' → undefined thresholds enable peeking bias
|
|
237
|
-
- Verification criteria: PASS if thresholds, sample sizes, stopping criteria predefined. FAIL if absent
|
|
238
|
-
|
|
239
|
-
- **CQ-E07** [P2] Does hallucination detection methodology exist?
|
|
240
|
-
- Inference path: domain_scope.md 'Evaluation & Testing'; logic_rules.md 'Retrieval Logic' → independent retrieval/generation evaluation enables hallucination tracing
|
|
241
|
-
- Verification criteria: PASS if documented detection method exists and is applied. FAIL if no method defined
|
|
242
|
-
|
|
243
|
-
- **CQ-E08** [P3] Are human evaluation protocols defined for aspects resisting automation?
|
|
244
|
-
- Inference path: structure_spec.md 'Evaluation Structure' → manual acceptable for semantic quality; domain_scope.md 'Evaluation & Testing' → HITL
|
|
245
|
-
- Verification criteria: PASS if protocols exist with inter-rater reliability measures. FAIL if human evaluation is ad-hoc
|
|
246
|
-
|
|
247
|
-
---
|
|
248
|
-
|
|
249
|
-
## 6. Safety & Alignment (CQ-S)
|
|
250
|
-
|
|
251
|
-
Verifies that harmful outputs are prevented, adversarial inputs are defended against, and safety mechanisms are testable and layered.
|
|
252
|
-
|
|
253
|
-
- **CQ-S01** [P1] Is defense-in-depth implemented (input filtering + output filtering + monitoring)?
|
|
254
|
-
- Inference path: logic_rules.md 'Safety Logic' → no single safety mechanism is sufficient; minimum layers: input filtering + output filtering + monitoring; structure_spec.md 'Safety Architecture Structure' → Required Pipeline (input guardrails → model → output guardrails)
|
|
255
|
-
- Verification criteria: PASS if the system implements at least input filtering, output filtering, and monitoring as separate layers. FAIL if only one safety layer exists or any of the three is absent
|
|
256
|
-
- Scope: Covers layer presence. Effectiveness of individual layers checked in CQ-S02, CQ-S03
|
|
257
|
-
|
|
258
|
-
- **CQ-S02** [P1] Are content policy rules testable with positive and negative cases?
|
|
259
|
-
- Inference path: logic_rules.md 'Safety Logic' → untestable rules cannot be verified
|
|
260
|
-
- Verification criteria: PASS if every rule has ≥1 positive and ≥1 negative test case. FAIL if any rule lacks tests
|
|
261
|
-
|
|
262
|
-
- **CQ-S03** [P1] Is prompt injection defense present without unacceptable UX degradation?
|
|
263
|
-
- Inference path: logic_rules.md 'Safety Logic' → false positive rate matters; structure_spec.md 'Quantitative Thresholds' → < 1% false positive rate
|
|
264
|
-
- Verification criteria: PASS if defense exists with documented false positive rate < 1%. FAIL if no defense or unmeasured/high false positive rate
|
|
265
|
-
|
|
266
|
-
- **CQ-S04** [P1] Are safety constraints applied at the system level, not just model-level?
|
|
267
|
-
- Inference path: logic_rules.md 'Safety Logic' → model safety is not configurable or auditable
|
|
268
|
-
- Verification criteria: PASS if system-level guardrails exist independent of provider safety. FAIL if relying solely on model built-in safety
|
|
269
|
-
|
|
270
|
-
- **CQ-S05** [P2] Are guardrail trigger conditions, actions, and logging defined?
|
|
271
|
-
- Inference path: structure_spec.md 'Safety Architecture Structure' → each guardrail must define all three
|
|
272
|
-
- Verification criteria: PASS if every guardrail has trigger, action, and logging. FAIL if any element missing
|
|
273
|
-
|
|
274
|
-
- **CQ-S06** [P2] Is red teaming performed pre-deployment and on recurring schedule?
|
|
275
|
-
- Inference path: logic_rules.md 'Safety Logic' → threat landscape evolves
|
|
276
|
-
- Verification criteria: PASS if pre-deployment results documented and recurring schedule exists. FAIL if absent or one-time only
|
|
277
|
-
|
|
278
|
-
- **CQ-S07** [P2] Is PII detection and redaction implemented?
|
|
279
|
-
- Inference path: domain_scope.md 'Safety & Alignment' → PII handling; logic_rules.md 'Constraint Conflict Checking' → Privacy vs. Personalization
|
|
280
|
-
- Verification criteria: PASS if PII entry points identified with detection/redaction. FAIL if no PII controls
|
|
281
|
-
|
|
282
|
-
- **CQ-S08** [P3] When safety conflicts with functionality, is the conflict documented with improvement plan?
|
|
283
|
-
- Inference path: logic_rules.md 'Constraint Conflict Checking' → Safety vs. Functionality → four required elements
|
|
284
|
-
- Verification criteria: PASS if documented with rule, degraded functionality, false positive rate, improvement plan. FAIL if undocumented
|
|
285
|
-
|
|
286
|
-
---
|
|
287
|
-
|
|
288
|
-
## 7. Production Operations (CQ-O)
|
|
289
|
-
|
|
290
|
-
Verifies that runtime behavior is observable, costs tracked, quality monitored, and feedback actionable.
|
|
291
|
-
|
|
292
|
-
- **CQ-O01** [P1] Does cost tracking granularity match billing granularity?
|
|
293
|
-
- Inference path: logic_rules.md 'Operations Logic' → mismatched granularity makes cost optimization guesswork
|
|
294
|
-
- Verification criteria: PASS if tracking matches billing (per-token if billed per-token) with feature/user attribution. FAIL if coarser
|
|
295
|
-
|
|
296
|
-
- **CQ-O02** [P1] Does logging capture sufficient information to reproduce any LLM interaction?
|
|
297
|
-
- Inference path: logic_rules.md 'Operations Logic' → logging must capture: full prompt (input), full response (output), model version, latency, token count, and tool calls; insufficient logging makes debugging impossible
|
|
298
|
-
- Verification criteria: PASS if every LLM API call is logged with all six required fields. FAIL if any field is missing from production logs
|
|
299
|
-
|
|
300
|
-
- **CQ-O03** [P1] Is a quality drift baseline established?
|
|
301
|
-
- Inference path: logic_rules.md 'Operations Logic' → baseline from golden set required for drift detection; dependency_rules.md 'Feedback Loops' → Loop 2
|
|
302
|
-
- Verification criteria: PASS if baseline exists and drift detection runs at regular intervals. FAIL if no baseline or no monitoring
|
|
303
|
-
|
|
304
|
-
- **CQ-O04** [P2] Are feedback loops actionable (collection → analysis → system change → verification)?
|
|
305
|
-
- Inference path: logic_rules.md 'Operations Logic' → feedback without improvement path is waste
|
|
306
|
-
- Verification criteria: PASS if each feedback type has documented processing pipeline. FAIL if feedback collected without action path
|
|
307
|
-
|
|
308
|
-
- **CQ-O05** [P2] Is incident response defined for LLM-specific failure modes?
|
|
309
|
-
- Inference path: logic_rules.md 'Operations Logic' → five failure types: outage, quality degradation, cost spike, safety false positives, injection
|
|
310
|
-
- Verification criteria: PASS if playbooks exist for all five types. FAIL if LLM-specific failures not addressed
|
|
311
|
-
|
|
312
|
-
- **CQ-O06** [P2] Are deployment strategies appropriate for model updates?
|
|
313
|
-
- Inference path: domain_scope.md 'Production Operations' → canary/blue-green; dependency_rules.md 'External Dependency Management' → canary rollout
|
|
314
|
-
- Verification criteria: PASS if graduated deployment limits blast radius. FAIL if all-at-once without rollout
|
|
315
|
-
|
|
316
|
-
- **CQ-O07** [P2] Is system-level caching implemented where appropriate?
|
|
317
|
-
- Inference path: domain_scope.md 'Production Operations' → response caching, KV cache management
|
|
318
|
-
- Verification criteria: PASS if caching evaluated and implemented with invalidation rules. FAIL if no caching considered for repeated patterns
|
|
319
|
-
|
|
320
|
-
- **CQ-O08** [P3] Are monitoring dashboards covering LLM-specific metrics visible to operators?
|
|
321
|
-
- Inference path: domain_scope.md 'Production Operations' → monitoring; logic_rules.md 'Operations Logic' → logging feeds dashboards
|
|
322
|
-
- Verification criteria: PASS if dashboards show latency, error rate, throughput, cost. FAIL if no operator visibility
|
|
323
|
-
|
|
324
|
-
---
|
|
325
|
-
|
|
326
|
-
## 8. Data & Model Adaptation (CQ-D)
|
|
327
|
-
|
|
328
|
-
Verifies that fine-tuning decisions are justified, quality-controlled, and evaluated against baselines.
|
|
329
|
-
|
|
330
|
-
- **CQ-D01** [P1] Is fine-tuning justified over prompting?
|
|
331
|
-
- Inference path: logic_rules.md 'Data & Model Adaptation Logic' → fine-tuning justified only when prompting alone cannot achieve required quality, latency, or cost targets; fine-tuning introduces maintenance burden
|
|
332
|
-
- Verification criteria: PASS if documented analysis shows prompting was attempted first and was insufficient, with specific metrics. FAIL if fine-tuning pursued without evidence that prompting is insufficient
|
|
333
|
-
- Scope: Applies when fine-tuning is used. N/A if system uses only prompting
|
|
334
|
-
|
|
335
|
-
- **CQ-D02** [P1] Is training data quality assessed before training?
|
|
336
|
-
- Inference path: logic_rules.md 'Data & Model Adaptation Logic' → data quality upper-bounds model quality
|
|
337
|
-
- Verification criteria: PASS if accuracy/consistency/completeness/representativeness assessed before training. FAIL if training starts without assessment
|
|
338
|
-
|
|
339
|
-
- **CQ-D03** [P1] Are adapted models evaluated against base models on the same set?
|
|
340
|
-
- Inference path: logic_rules.md 'Data & Model Adaptation Logic' → comparison required to prove improvement
|
|
341
|
-
- Verification criteria: PASS if comparison results exist on same golden set. FAIL if measured in isolation
|
|
342
|
-
|
|
343
|
-
- **CQ-D04** [P2] For parameter-efficient fine-tuning (LoRA, QLoRA), is the base model pinned?
|
|
344
|
-
- Inference path: logic_rules.md 'Data & Model Adaptation Logic' → base model update invalidates adapters
|
|
345
|
-
- Verification criteria: PASS if base model version recorded/pinned with invalidation process. FAIL if unpinned
|
|
346
|
-
|
|
347
|
-
- **CQ-D05** [P2] Is training data separated from evaluation data?
|
|
348
|
-
- Inference path: logic_rules.md 'Evaluation Logic' → contamination inflates scores
|
|
349
|
-
- Verification criteria: PASS if separate storage with deduplication controls. FAIL if shared without controls
|
|
350
|
-
|
|
351
|
-
- **CQ-D06** [P2] Does fine-tuning improvement exceed fine-tuning cost?
|
|
352
|
-
- Inference path: logic_rules.md 'Data & Model Adaptation Logic'; logic_rules.md 'Constraint Conflict Checking' → Cost vs. Quality
|
|
353
|
-
- Verification criteria: PASS if cost-benefit analysis documents quality gain vs. costs. FAIL if no analysis
|
|
354
|
-
|
|
355
|
-
- **CQ-D07** [P3] Is there a retraining plan for base model updates?
|
|
356
|
-
- Inference path: dependency_rules.md 'External Dependency Management' → deprecation requires migration; logic_rules.md 'Data & Model Adaptation Logic' → adapters invalidated by updates
|
|
357
|
-
- Verification criteria: PASS if documented retraining procedure covers detection, compatibility check, retraining, re-evaluation. FAIL if no plan
|
|
358
|
-
|
|
359
|
-
---
|
|
360
|
-
|
|
361
|
-
## 9. Cross-Cutting (CQ-X)
|
|
362
|
-
|
|
363
|
-
Verifies concerns spanning multiple sub-areas: version management, reproducibility, and inter-area consistency.
|
|
364
|
-
|
|
365
|
-
- **CQ-X01** [P1] Are prompt versions, tool schema versions, and agent configuration versions tracked?
|
|
366
|
-
- Inference path: domain_scope.md 'Cross-Cutting Concern: Development Lifecycle' → version management spans prompt versions, tool schema versions, agent configuration versions, model version tracking; logic_rules.md 'Prompt Design Logic' → prompt change = code change, must be versioned
|
|
367
|
-
- Verification criteria: PASS if all four artifact types (prompt templates, tool schemas, agent configurations, model version references) are version-controlled. FAIL if any of these artifacts is unversioned
|
|
368
|
-
|
|
369
|
-
- **CQ-X02** [P1] Can past system behavior be reproduced given a specific version of all components?
|
|
370
|
-
- Inference path: logic_rules.md 'Operations Logic' → logging must capture sufficient info to reproduce interactions; 'Prompt Design Logic' → unversioned prompts = unreproducible behavior; dependency_rules.md 'External Dependency Management' → model version must be pinned
|
|
371
|
-
- Verification criteria: PASS if, given a timestamp, all system components (model version, prompt version, tool schemas, retrieval config) can be identified and behavior reconstructed. FAIL if any component's historical state is unrecoverable
|
|
372
|
-
|
|
373
|
-
- **CQ-X03** [P2] When cross-area constraints conflict, is the conflict documented and resolved?
|
|
374
|
-
- Inference path: logic_rules.md 'Constraint Conflict Checking' → four conflict types; unresolved = unpredictable behavior
|
|
375
|
-
- Verification criteria: PASS if conflicts documented with constraint pair, resolution, trade-off rationale. FAIL if undocumented
|
|
376
|
-
|
|
377
|
-
- **CQ-X04** [P2] Do inter-area dependencies follow documented direction rules?
|
|
378
|
-
- Inference path: dependency_rules.md 'Inter-Area Direction Rules'; 'Reverse direction prohibition'
|
|
379
|
-
- Verification criteria: PASS if data flow matches documented directions. FAIL if reverse dependencies exist
|
|
380
|
-
|
|
381
|
-
- **CQ-X05** [P2] Are feedback loops between areas declared with termination conditions?
|
|
382
|
-
- Inference path: dependency_rules.md 'Feedback Loops' → Loops 1-3, each with termination condition
|
|
383
|
-
- Verification criteria: PASS if all cycles documented with termination. FAIL if any loop lacks termination
|
|
384
|
-
|
|
385
|
-
- **CQ-X06** [P3] Do the five golden relationships (cross-component validations) hold?
|
|
386
|
-
- Inference path: structure_spec.md 'Golden Relationships' → Model↔Prompt, Retrieval↔Generation, Tool↔Instructions, Eval↔Safety, Cost↔Model
|
|
387
|
-
- Verification criteria: PASS if all five validations pass. FAIL if any violated
|
|
388
|
-
|
|
389
|
-
---
|
|
390
|
-
|
|
391
|
-
## 10. Edge Cases (CQ-XE)
|
|
392
|
-
|
|
393
|
-
Tests boundary conditions and failure modes not covered by standard verification. Verifies system resilience.
|
|
394
|
-
|
|
395
|
-
- **CQ-XE01** [P2] What happens when a model is deprecated mid-production?
|
|
396
|
-
- Inference path: dependency_rules.md 'External Dependency Management' → Model API Dependencies → deprecation migration plan
|
|
397
|
-
- Verification criteria: PASS if deprecation plan covers detection, impact analysis, replacement evaluation, migration timeline. FAIL if no plan
|
|
398
|
-
|
|
399
|
-
- **CQ-XE02** [P2] What happens when retrieval returns no relevant results?
|
|
400
|
-
- Inference path: logic_rules.md 'Retrieval Logic' → independent evaluation; structure_spec.md 'RAG Pipeline Structure' → relevance scores; logic_rules.md 'Constraint Conflict Checking' → Latency vs. Completeness
|
|
401
|
-
- Verification criteria: PASS if system handles gracefully (fallback, disclaimer, rephrase suggestion). FAIL if generates as if relevant context existed
|
|
402
|
-
|
|
403
|
-
- **CQ-XE03** [P2] What happens when safety guardrails block a valid response?
|
|
404
|
-
- Inference path: logic_rules.md 'Safety Logic' → defense without UX degradation; 'Constraint Conflict Checking' → Safety vs. Functionality
|
|
405
|
-
- Verification criteria: PASS if false positives produce informative messages, are logged, and have adjustment process. FAIL if silent failure or no review process
|
|
406
|
-
|
|
407
|
-
- **CQ-XE04** [P2] What happens when an MCP server becomes unavailable during an agent task?
|
|
408
|
-
- Inference path: dependency_rules.md 'External Dependency Management' → MCP Server Dependencies → fallback behavior; logic_rules.md 'Agentic Systems Logic' → state survives failures
|
|
409
|
-
- Verification criteria: PASS if fallback defined (skip, retry, halt with message). FAIL if crash, hang, or undefined behavior
|
|
410
|
-
|
|
411
|
-
- **CQ-XE05** [P2] What happens when the token budget is exceeded by user input alone?
|
|
412
|
-
- Inference path: logic_rules.md 'Prompt Design Logic' → token budget constraint
|
|
413
|
-
- Verification criteria: PASS if oversized input detected with defined action (truncation notice, rejection). FAIL if silent truncation or unhandled error
|
|
414
|
-
|
|
415
|
-
- **CQ-XE06** [P3] What happens when a model provider changes pricing mid-contract?
|
|
416
|
-
- Inference path: dependency_rules.md 'External Dependency Management' → pricing changes; structure_spec.md 'Golden Relationships' → Cost budget ↔ Model selection
|
|
417
|
-
- Verification criteria: PASS if cost monitoring detects anomalies and alerts before overrun. FAIL if not detected until billing cycle
|
|
418
|
-
|
|
419
|
-
- **CQ-XE07** [P3] What happens when multiple agents produce contradictory outputs?
|
|
420
|
-
- Inference path: logic_rules.md 'Agentic Systems Logic' → termination conditions; structure_spec.md 'Agent Architecture Structure' → Multi-Agent Execution profile
|
|
421
|
-
- Verification criteria: PASS if conflict resolution mechanism exists (orchestrator, voting, confidence). FAIL if no mechanism
|
|
422
|
-
|
|
423
|
-
---
|
|
424
|
-
|
|
425
|
-
## Related Documents
|
|
426
|
-
- domain_scope.md — the upper-level definition of the 8 sub-areas and cross-cutting concerns these questions cover
|
|
427
|
-
- logic_rules.md — inference logic for all 10 question categories (CQ-M through CQ-XE)
|
|
428
|
-
- structure_spec.md — structural specifications for CQ-R (knowledge, RAG), CQ-A (agent architecture), CQ-E (evaluation), CQ-S (safety), CQ-X (golden relationships, thresholds)
|
|
429
|
-
- dependency_rules.md — inter-area dependencies for CQ-X (direction rules, feedback loops), CQ-XE (external dependency failures), CQ-D (adaptation dependencies)
|
|
430
|
-
- concepts.md — definitions of terms used throughout these competency questions
|