onto-mcp 0.3.0 → 0.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (61) hide show
  1. package/.onto/authority/core-lexicon.yaml +12 -0
  2. package/.onto/domains/software-engineering/competency_qs.md +192 -63
  3. package/.onto/domains/software-engineering/concepts.md +67 -5
  4. package/.onto/domains/software-engineering/conciseness_rules.md +22 -2
  5. package/.onto/domains/software-engineering/dependency_rules.md +78 -8
  6. package/.onto/domains/software-engineering/domain_scope.md +181 -150
  7. package/.onto/domains/software-engineering/extension_cases.md +318 -542
  8. package/.onto/domains/software-engineering/logic_rules.md +75 -3
  9. package/.onto/domains/software-engineering/problem_framing_profile.md +29 -2
  10. package/.onto/domains/software-engineering/prompt_interface.md +122 -0
  11. package/.onto/domains/software-engineering/structure_spec.md +53 -4
  12. package/.onto/principles/llm-native-development-guideline.md +20 -0
  13. package/.onto/principles/productization-charter.md +6 -0
  14. package/.onto/processes/evolve/material-kind-adapter-contract.md +6 -0
  15. package/.onto/processes/reconstruct/reconstruct-boundary-contract.md +468 -81
  16. package/.onto/processes/reconstruct/reconstruct-execution-ux-contract.md +177 -0
  17. package/.onto/processes/reconstruct/source-profile-contract.md +39 -6
  18. package/.onto/processes/reconstruct/top-level-concept-discovery-contract.md +387 -0
  19. package/.onto/processes/review/binding-contract.md +8 -0
  20. package/.onto/processes/review/lens-registry.md +16 -0
  21. package/.onto/processes/review/pre-dispatch-contracts.md +34 -13
  22. package/.onto/processes/review/productized-live-path.md +3 -1
  23. package/.onto/processes/shared/pipeline-execution-ledger-contract.md +185 -0
  24. package/.onto/processes/shared/target-material-kind-contract.md +24 -2
  25. package/.onto/roles/axiology.md +7 -2
  26. package/AGENTS.md +4 -2
  27. package/README.md +52 -29
  28. package/dist/core-api/reconstruct-api.js +92 -5
  29. package/dist/core-api/review-api.js +1744 -371
  30. package/dist/core-runtime/cli/mock-review-unit-executor.js +17 -0
  31. package/dist/core-runtime/cli/render-review-final-output.js +9 -0
  32. package/dist/core-runtime/cli/review-invoke.js +387 -55
  33. package/dist/core-runtime/cli/run-review-prompt-execution.js +361 -90
  34. package/dist/core-runtime/path-boundary.js +58 -0
  35. package/dist/core-runtime/pipeline-execution-ledger.js +100 -0
  36. package/dist/core-runtime/reconstruct/artifact-types.js +33 -1
  37. package/dist/core-runtime/reconstruct/materialize-preparation.js +54 -4
  38. package/dist/core-runtime/reconstruct/pipeline-execution-ledger.js +342 -0
  39. package/dist/core-runtime/reconstruct/post-seed-validation.js +630 -0
  40. package/dist/core-runtime/reconstruct/record.js +105 -1
  41. package/dist/core-runtime/reconstruct/run.js +1594 -38
  42. package/dist/core-runtime/reconstruct/seed-candidate-validation.js +29 -0
  43. package/dist/core-runtime/review/continuation-plan.js +160 -0
  44. package/dist/core-runtime/review/execution-plan-boundary.js +123 -0
  45. package/dist/core-runtime/review/materializers.js +8 -3
  46. package/dist/core-runtime/review/pipeline-execution-ledger.js +250 -0
  47. package/dist/core-runtime/review/review-artifact-utils.js +15 -2
  48. package/dist/core-runtime/review/review-invocation-runner.js +604 -0
  49. package/dist/core-runtime/target-material-kind.js +43 -5
  50. package/dist/mcp/server.js +289 -59
  51. package/dist/mcp/tool-schemas.js +28 -2
  52. package/package.json +4 -2
  53. package/.onto/domains/llm-native-development/competency_qs.md +0 -430
  54. package/.onto/domains/llm-native-development/concepts.md +0 -242
  55. package/.onto/domains/llm-native-development/conciseness_rules.md +0 -163
  56. package/.onto/domains/llm-native-development/dependency_rules.md +0 -216
  57. package/.onto/domains/llm-native-development/domain_scope.md +0 -197
  58. package/.onto/domains/llm-native-development/extension_cases.md +0 -474
  59. package/.onto/domains/llm-native-development/logic_rules.md +0 -123
  60. package/.onto/domains/llm-native-development/prompt_interface.md +0 -49
  61. package/.onto/domains/llm-native-development/structure_spec.md +0 -245
@@ -1,430 +0,0 @@
1
- ---
2
- version: 2
3
- last_updated: "2026-05-27"
4
- source: manual
5
- status: established
6
- ---
7
-
8
- # LLM-Native Development Domain — Competency Questions
9
-
10
- A list of core questions that this domain's system must be able to answer.
11
- The pragmatics agent verifies the actual reasoning path for each question.
12
-
13
- Classification axis: **system construction concern** — classified by the sub-area of the LLM-powered system each question addresses, matching the 8 sub-areas defined in domain_scope.md.
14
-
15
- Question priority principles: **Core design decisions (model integration, prompt design, retrieval, agent architecture) are the highest priority.** These concerns govern the majority of LLM system quality. Evaluation, safety, operations, and adaptation are secondary concerns applied on top of the core design foundation.
16
-
17
- Priority levels:
18
- - **P1** — Must be answerable for any LLM-powered system review. Failure indicates a fundamental design defect.
19
- - **P2** — Should be answerable for production LLM systems. Failure indicates a quality gap.
20
- - **P3** — Recommended for mature LLM systems. Failure indicates a refinement opportunity.
21
-
22
- ---
23
-
24
- ## 1. Model Integration (CQ-M)
25
-
26
- Verifies that the system's connection to LLMs is reliable, replaceable, and explicitly designed. Without correct model integration, no other aspect of the system can function.
27
-
28
- - **CQ-M01** [P1] Can the system switch between model providers or model versions without code changes beyond configuration?
29
- - Inference path: logic_rules.md 'Model Integration Logic' → model version pinning, model selection justification → model identity must be configurable
30
- - Verification criteria: PASS if model identifier is in configuration/env, not hardcoded. FAIL if changing models requires application code changes
31
- - Scope: Covers model identifier configurability. Does not cover prompt adjustments needed for different models (→CQ-P)
32
-
33
- - **CQ-M02** [P1] Is the model version pinned to a specific release in production?
34
- - Inference path: logic_rules.md 'Model Integration Logic' → unpinned versions cause non-deterministic behavior; dependency_rules.md 'External Dependency Management' → Model API Dependencies → version must be pinned
35
- - Verification criteria: PASS if production uses a specific version (e.g., `claude-sonnet-4-20250514`). FAIL if production uses an alias (e.g., `claude-sonnet-4-latest`)
36
-
37
- - **CQ-M03** [P1] Are fallback routes defined for every model routing path?
38
- - Inference path: logic_rules.md 'Model Integration Logic' → if model routing is used, fallback paths must be defined for every route; a route without fallback is a single point of failure
39
- - Verification criteria: PASS if every routing configuration entry includes a fallback model or degradation behavior (cached response, simpler model, error return). FAIL if any routing path has no defined fallback
40
- - Scope: Covers model-level routing. Does not cover application-level error handling for non-model failures
41
-
42
- - **CQ-M04** [P1] Are model capability requirements documented per task type?
43
- - Inference path: logic_rules.md 'Model Integration Logic' → capability requirements must include task type, required capabilities, quality level, cost constraints; structure_spec.md 'Golden Relationships' → Model capability ↔ Prompt complexity
44
- - Verification criteria: PASS if documentation lists each task type, required model capabilities, and assigned model with justification. FAIL if model-task assignments exist without capability documentation
45
-
46
- - **CQ-M05** [P2] Is model selection justified against cost and quality trade-offs?
47
- - Inference path: logic_rules.md 'Model Integration Logic' → overpowered model = cost waste, underpowered model = quality risk; logic_rules.md 'Constraint Conflict Checking' → Cost vs. Quality
48
- - Verification criteria: PASS if each model selection has documented cost/quality rationale. FAIL if selection lacks trade-off analysis
49
-
50
- - **CQ-M06** [P2] When inference optimization is applied, is the impact on output quality measured?
51
- - Inference path: logic_rules.md 'Model Integration Logic' → optimization impact must be measured, not assumed
52
- - Verification criteria: PASS if evaluation results exist comparing quality before/after optimization. FAIL if optimization is applied without quality measurement
53
-
54
- - **CQ-M07** [P2] Does the system handle model provider API errors gracefully?
55
- - Inference path: logic_rules.md 'Operations Logic' → incident response for LLM-specific failures; dependency_rules.md 'External Dependency Management' → Model API Dependencies
56
- - Verification criteria: PASS if retry logic, meaningful error messages, and graceful degradation exist. FAIL if API errors cause crashes or silent failures
57
-
58
- - **CQ-M08** [P3] When multiple models are orchestrated, is the orchestration execution profile documented?
59
- - Inference path: structure_spec.md 'Agent Architecture Structure' → Multi-Agent Execution profile; domain_scope.md 'Model Integration' → multi-model orchestration
60
- - Verification criteria: PASS if multi-model pattern is documented with roles, routing logic, data flow. FAIL if undocumented
61
-
62
- ---
63
-
64
- ## 2. Prompt & Context Design (CQ-P)
65
-
66
- Verifies that model inputs are structured, constrained, and managed to produce reliable outputs. Without correct prompt design, even the best model integration produces unpredictable results.
67
-
68
- - **CQ-P01** [P1] Is the instruction hierarchy enforced (system prompt > tool definitions > user prompt)?
69
- - Inference path: logic_rules.md 'Prompt Design Logic' → instruction hierarchy: system prompt > tool definitions > user prompt; when levels conflict, higher-priority wins
70
- - Verification criteria: PASS if prompts have clear hierarchy and conflict resolution is documented (e.g., "system prompt overrides user instructions on output format"). FAIL if no hierarchy exists or user inputs can override system-level instructions
71
- - Scope: Covers instruction precedence design. Does not cover content quality of individual instructions
72
-
73
- - **CQ-P02** [P1] Is the token budget calculated before API calls to prevent truncation?
74
- - Inference path: logic_rules.md 'Prompt Design Logic' → system prompt + context + user input ≤ context window - expected output; violation causes truncation (silent data loss) or API errors
75
- - Verification criteria: PASS if token counts for all prompt components are computed before the API call and oversized prompts are handled (graceful truncation with notification or rejection). FAIL if prompts may exceed the context window without pre-validation
76
-
77
- - **CQ-P03** [P1] Is structured output validated before consumption?
78
- - Inference path: logic_rules.md 'Prompt Design Logic' → output schema must be validated; structure_spec.md 'LLM System Architecture Structure' → Output handling is required
79
- - Verification criteria: PASS if every code path expecting structured output includes schema validation. FAIL if model output is consumed without validation
80
-
81
- - **CQ-P04** [P1] Are prompt templates versioned alongside code?
82
- - Inference path: logic_rules.md 'Prompt Design Logic' → prompt change = code change; unversioned prompts prevent reproduction
83
- - Verification criteria: PASS if prompt templates are in version control with change tracking. FAIL if prompts are unversioned
84
-
85
- - **CQ-P05** [P2] Is context rot mitigated for long conversations?
86
- - Inference path: logic_rules.md 'Prompt Design Logic' → earliest context loses influence as conversation grows; critical info must be re-injected
87
- - Verification criteria: PASS if at least one mitigation strategy is implemented (re-injection, summarization, sliding window). FAIL if long conversations rely solely on initial context
88
-
89
- - **CQ-P06** [P2] Are few-shot examples representative of the target distribution?
90
- - Inference path: logic_rules.md 'Prompt Design Logic' → biased examples cause biased outputs
91
- - Verification criteria: PASS if examples cover major target categories and are updated when distribution changes. FAIL if examples are ad-hoc or narrow
92
-
93
- - **CQ-P07** [P2] Does the prompt template length stay within 25% of the context window?
94
- - Inference path: structure_spec.md 'Quantitative Thresholds' → prompt template ≤ 25% of context window
95
- - Verification criteria: PASS if static prompt ≤ 25% of target model's context window. FAIL if exceeded without justification
96
-
97
- - **CQ-P08** [P2] When chain-of-thought prompting is used, is the reasoning trace verifiable?
98
- - Inference path: domain_scope.md 'Prompt & Context Design' → chain-of-thought patterns; logic_rules.md 'Prompt Design Logic' → structured output validation
99
- - Verification criteria: PASS if reasoning traces are captured and reviewable. FAIL if traces are generated but discarded
100
-
101
- - **CQ-P09** [P3] Are prompt injection risks from user-supplied content addressed at the prompt level?
102
- - Inference path: logic_rules.md 'Safety Logic' → injection defense; logic_rules.md 'Prompt Design Logic' → user input is lowest priority
103
- - Verification criteria: PASS if user content is delimited or sanitized before inclusion in prompts. FAIL if user content is concatenated without boundary markers
104
-
105
- - **CQ-P10** [P2] Is execution context free of non-current compatibility and deprecation material?
106
- - Inference path: conciseness_rules.md 'Navigation and History Mixing' → execution context should contain current behavior, contracts, authority, and failure handling; historical material belongs in isolated paths
107
- - Verification criteria: PASS if documents loaded for execution contain current behavior only and link to isolated history only when needed. FAIL if backward-compatibility notes, deprecated behavior, migration rationale, or historical alternatives are loaded by default
108
-
109
- ---
110
-
111
- ## 3. Retrieval & Knowledge Systems (CQ-R)
112
-
113
- Verifies that external information fed to the model is retrieved correctly, structured for LLM consumption, and maintains integrity. Encompasses both RAG pipeline design and LLM-favored knowledge structure.
114
-
115
- - **CQ-R01** [P1] Can retrieval quality be evaluated independently of generation quality?
116
- - Inference path: logic_rules.md 'Retrieval Logic' → retrieval relevance must be verified independently; root cause may be irrelevant retrieval, incorrect generation, or both; structure_spec.md 'RAG Pipeline Structure' → Stage Boundary Rules → each stage independently testable
117
- - Verification criteria: PASS if the system can produce retrieval results with relevance scores without running the generation step. FAIL if retrieval and generation are monolithic (testing retrieval requires the full pipeline)
118
- - Scope: Covers architectural separability. Does not cover evaluation methodology (→CQ-E)
119
-
120
- - **CQ-R02** [P1] Do chunks preserve semantic boundaries?
121
- - Inference path: logic_rules.md 'Retrieval Logic' → chunking must preserve semantic boundaries
122
- - Verification criteria: PASS if chunks align with document structure (paragraphs, sections). FAIL if fixed-size splitting ignores content structure
123
-
124
- - **CQ-R03** [P1] Is the LLM-favored structure (File=Concept, YAML frontmatter) consistently applied?
125
- - Inference path: logic_rules.md 'File–Concept Correspondence Logic' → one concept per file; logic_rules.md 'Frontmatter Conformance Logic'; structure_spec.md 'Frontmatter Specification'
126
- - Verification criteria: PASS if documents follow File=Concept paradigm with required frontmatter. FAIL if concepts are mixed across files or frontmatter is absent
127
- - Scope: Applies to document-based knowledge bases. N/A for database-only retrieval
128
-
129
- - **CQ-R04** [P1] Is every concept file reachable from the entry point?
130
- - Inference path: logic_rules.md 'Navigation Path Logic' → unreachable file = isolated document; structure_spec.md 'Isolated Element Prohibition'
131
- - Verification criteria: PASS if traversal from entry point reaches every concept file. FAIL if any file is unreachable
132
-
133
- - **CQ-R05** [P1] Are navigation indices (INDEX.md) up to date and consistent?
134
- - Inference path: structure_spec.md 'Project Required Files' → INDEX.md is source of truth for file existence; dependency_rules.md 'Referential Integrity'
135
- - Verification criteria: PASS if INDEX.md lists all directory files and all listed files exist. FAIL if stale or incomplete
136
-
137
- - **CQ-R06** [P2] Does retrieved context carry provenance metadata?
138
- - Inference path: logic_rules.md 'Retrieval Logic' → provenance required for debugging; structure_spec.md 'RAG Pipeline Structure' → stage outputs carry metadata
139
- - Verification criteria: PASS if each chunk includes source ID, chunk ID, and relevance score. FAIL if no provenance
140
-
141
- - **CQ-R07** [P2] When hybrid search is used, is the combination strategy explicit?
142
- - Inference path: logic_rules.md 'Retrieval Logic' → unspecified combination = non-reproducible results
143
- - Verification criteria: PASS if combination method and parameters are documented. FAIL if undocumented
144
-
145
- - **CQ-R08** [P2] Does the embedding model match the content domain?
146
- - Inference path: logic_rules.md 'Retrieval Logic' → general-purpose embeddings may underperform on domain-specific content
147
- - Verification criteria: PASS if embedding model selection is justified with domain appropriateness and retrieval metrics. FAIL if unjustified
148
-
149
- - **CQ-R09** [P2] When the embedding model changes, is re-indexing performed?
150
- - Inference path: dependency_rules.md 'External Dependency Management' → Embedding Model Dependencies → different models produce incompatible vectors
151
- - Verification criteria: PASS if documented migration includes complete re-indexing. FAIL if model changes can produce mixed embeddings
152
-
153
- - **CQ-R10** [P2] Are RAG pipeline stages explicitly defined with input/output contracts?
154
- - Inference path: structure_spec.md 'RAG Pipeline Structure' → Required Stages (ingestion, processing, storage, retrieval, generation) with contracts
155
- - Verification criteria: PASS if each stage has documented inputs, outputs, and design decisions. FAIL if pipeline is monolithic
156
-
157
- - **CQ-R11** [P3] Does single file size stay within the 500-line limit?
158
- - Inference path: structure_spec.md 'File Structure Required Elements' → 500 lines recommended; 'Quantitative Thresholds'
159
- - Verification criteria: PASS if ≤ 500 lines. WARNING if exceeded with justification. FAIL if exceeded without justification
160
-
161
- - **CQ-R12** [P3] Does directory depth stay within the 3-level limit?
162
- - Inference path: structure_spec.md 'Directory Structure Rules' → max 3 levels
163
- - Verification criteria: PASS if ≤ 3 levels. FAIL if exceeded without justification
164
-
165
- ---
166
-
167
- ## 4. Agentic Systems (CQ-A)
168
-
169
- Verifies that agent architecture, tool integration, and multi-agent coordination are reliable, terminable, and debuggable.
170
-
171
- - **CQ-A01** [P1] Are tool definitions non-overlapping in their described capabilities?
172
- - Inference path: logic_rules.md 'Agentic Systems Logic' → if two tools can accomplish the same task, agent choice is non-deterministic; either remove overlap or add explicit routing instructions
173
- - Verification criteria: PASS if no two tools overlap in capability description, or overlapping tools have explicit disambiguation in agent instructions. FAIL if overlap exists without disambiguation
174
-
175
- - **CQ-A02** [P1] Do all agent loops have explicit termination conditions?
176
- - Inference path: logic_rules.md 'Agentic Systems Logic' → without explicit termination (max iterations, convergence criteria, timeout), agent loops can run indefinitely, consuming resources
177
- - Verification criteria: PASS if every iterative agent workflow defines at least one termination condition. FAIL if any loop can run indefinitely without a termination mechanism
178
-
179
- - **CQ-A03** [P1] Is agent state explicitly managed rather than relying on conversation history?
180
- - Inference path: logic_rules.md 'Agentic Systems Logic' → implicit state via conversation is fragile; structure_spec.md 'Agent Architecture Structure' → state management required
181
- - Verification criteria: PASS if critical state is in structured storage (DB, file, state object). FAIL if solely in conversation history
182
- - Scope: Covers critical state. Single-turn conversational context is acceptable as implicit
183
-
184
- - **CQ-A04** [P1] Are MCP tool schemas self-describing?
185
- - Inference path: logic_rules.md 'Agentic Systems Logic' → schema must suffice without external docs; structure_spec.md 'Agent Architecture Structure' → Tool Definition Structure
186
- - Verification criteria: PASS if every tool has name, description, typed parameters, and return type. FAIL if external docs needed to understand usage
187
-
188
- - **CQ-A05** [P2] Do agent instructions list available tools with usage guidance?
189
- - Inference path: logic_rules.md 'Agentic Systems Logic' → instructions must list tools and when to use each
190
- - Verification criteria: PASS if system prompt lists all tools with guidance. FAIL if tools are available but unmentioned
191
-
192
- - **CQ-A06** [P2] For long-running tasks, is structured progress tracking maintained?
193
- - Inference path: logic_rules.md 'Agentic Systems Logic' → agent must resume from checkpoint without re-execution
194
- - Verification criteria: PASS if progress record tracks completed/current/remaining steps. FAIL if interruption requires full restart
195
-
196
- - **CQ-A07** [P2] When multiple agents are used, is the coordination execution profile documented?
197
- - Inference path: structure_spec.md 'Agent Architecture Structure' → Multi-Agent Execution profile → must be documented with rationale
198
- - Verification criteria: PASS if execution profile, roles, communication patterns documented. FAIL if undocumented
199
-
200
- - **CQ-A08** [P2] Is the agent tool count within the recommended limit (< 20)?
201
- - Inference path: structure_spec.md 'Quantitative Thresholds' → < 20 tools per agent
202
- - Verification criteria: PASS if < 20. WARNING if 20-30 with justification. FAIL if 30+ without evidence of unaffected selection quality
203
-
204
- - **CQ-A09** [P3] Are tool call results validated before further reasoning?
205
- - Inference path: logic_rules.md 'Prompt Design Logic' → validate before consumption; 'Agentic Systems Logic' → tool results are part of agent state
206
- - Verification criteria: PASS if tool results undergo validation (error check, schema, sanity). FAIL if consumed without validation
207
-
208
- ---
209
-
210
- ## 5. Evaluation & Testing (CQ-E)
211
-
212
- Verifies that system output quality is measured by defined criteria, representative data, and reproducible methods.
213
-
214
- - **CQ-E01** [P1] Are evaluation criteria defined before system development begins (spec-first)?
215
- - Inference path: logic_rules.md 'Evaluation Logic' → criteria must be defined before development; defining after seeing output introduces confirmation bias
216
- - Verification criteria: PASS if evaluation criteria (metrics, thresholds, targets) are documented before or concurrent with the first development commit. FAIL if criteria are defined after the system is built
217
- - Scope: Covers existence and timing of criteria definition. Does not cover whether the criteria themselves are appropriate
218
-
219
- - **CQ-E02** [P1] Does the golden set cover the target distribution?
220
- - Inference path: logic_rules.md 'Evaluation Logic'; structure_spec.md 'Evaluation Structure' → Golden Set Requirements → all input categories
221
- - Verification criteria: PASS if golden set covers all expected categories with documented diversity. FAIL if partial or undocumented coverage
222
-
223
- - **CQ-E03** [P1] Is AI-as-judge bias disclosed?
224
- - Inference path: logic_rules.md 'Evaluation Logic' → must disclose judge model, prompt, and known biases
225
- - Verification criteria: PASS if all three disclosed. FAIL if any element missing
226
-
227
- - **CQ-E04** [P2] Are evaluation pipelines automated and repeatable?
228
- - Inference path: logic_rules.md 'Evaluation Logic' → manual does not scale; structure_spec.md 'Evaluation Structure' → automated for production
229
- - Verification criteria: PASS if pipeline runs without manual intervention with consistent results. FAIL if manual steps required in production
230
-
231
- - **CQ-E05** [P2] Is the golden set separated from training data?
232
- - Inference path: logic_rules.md 'Evaluation Logic' → contamination inflates scores; structure_spec.md 'Evaluation Structure' → Separation
233
- - Verification criteria: PASS if separate storage with contamination prevention. FAIL if shared without controls
234
-
235
- - **CQ-E06** [P2] For A/B testing, are statistical significance thresholds predefined?
236
- - Inference path: logic_rules.md 'Evaluation Logic' → undefined thresholds enable peeking bias
237
- - Verification criteria: PASS if thresholds, sample sizes, stopping criteria predefined. FAIL if absent
238
-
239
- - **CQ-E07** [P2] Does hallucination detection methodology exist?
240
- - Inference path: domain_scope.md 'Evaluation & Testing'; logic_rules.md 'Retrieval Logic' → independent retrieval/generation evaluation enables hallucination tracing
241
- - Verification criteria: PASS if documented detection method exists and is applied. FAIL if no method defined
242
-
243
- - **CQ-E08** [P3] Are human evaluation protocols defined for aspects resisting automation?
244
- - Inference path: structure_spec.md 'Evaluation Structure' → manual acceptable for semantic quality; domain_scope.md 'Evaluation & Testing' → HITL
245
- - Verification criteria: PASS if protocols exist with inter-rater reliability measures. FAIL if human evaluation is ad-hoc
246
-
247
- ---
248
-
249
- ## 6. Safety & Alignment (CQ-S)
250
-
251
- Verifies that harmful outputs are prevented, adversarial inputs are defended against, and safety mechanisms are testable and layered.
252
-
253
- - **CQ-S01** [P1] Is defense-in-depth implemented (input filtering + output filtering + monitoring)?
254
- - Inference path: logic_rules.md 'Safety Logic' → no single safety mechanism is sufficient; minimum layers: input filtering + output filtering + monitoring; structure_spec.md 'Safety Architecture Structure' → Required Pipeline (input guardrails → model → output guardrails)
255
- - Verification criteria: PASS if the system implements at least input filtering, output filtering, and monitoring as separate layers. FAIL if only one safety layer exists or any of the three is absent
256
- - Scope: Covers layer presence. Effectiveness of individual layers checked in CQ-S02, CQ-S03
257
-
258
- - **CQ-S02** [P1] Are content policy rules testable with positive and negative cases?
259
- - Inference path: logic_rules.md 'Safety Logic' → untestable rules cannot be verified
260
- - Verification criteria: PASS if every rule has ≥1 positive and ≥1 negative test case. FAIL if any rule lacks tests
261
-
262
- - **CQ-S03** [P1] Is prompt injection defense present without unacceptable UX degradation?
263
- - Inference path: logic_rules.md 'Safety Logic' → false positive rate matters; structure_spec.md 'Quantitative Thresholds' → < 1% false positive rate
264
- - Verification criteria: PASS if defense exists with documented false positive rate < 1%. FAIL if no defense or unmeasured/high false positive rate
265
-
266
- - **CQ-S04** [P1] Are safety constraints applied at the system level, not just model-level?
267
- - Inference path: logic_rules.md 'Safety Logic' → model safety is not configurable or auditable
268
- - Verification criteria: PASS if system-level guardrails exist independent of provider safety. FAIL if relying solely on model built-in safety
269
-
270
- - **CQ-S05** [P2] Are guardrail trigger conditions, actions, and logging defined?
271
- - Inference path: structure_spec.md 'Safety Architecture Structure' → each guardrail must define all three
272
- - Verification criteria: PASS if every guardrail has trigger, action, and logging. FAIL if any element missing
273
-
274
- - **CQ-S06** [P2] Is red teaming performed pre-deployment and on recurring schedule?
275
- - Inference path: logic_rules.md 'Safety Logic' → threat landscape evolves
276
- - Verification criteria: PASS if pre-deployment results documented and recurring schedule exists. FAIL if absent or one-time only
277
-
278
- - **CQ-S07** [P2] Is PII detection and redaction implemented?
279
- - Inference path: domain_scope.md 'Safety & Alignment' → PII handling; logic_rules.md 'Constraint Conflict Checking' → Privacy vs. Personalization
280
- - Verification criteria: PASS if PII entry points identified with detection/redaction. FAIL if no PII controls
281
-
282
- - **CQ-S08** [P3] When safety conflicts with functionality, is the conflict documented with improvement plan?
283
- - Inference path: logic_rules.md 'Constraint Conflict Checking' → Safety vs. Functionality → four required elements
284
- - Verification criteria: PASS if documented with rule, degraded functionality, false positive rate, improvement plan. FAIL if undocumented
285
-
286
- ---
287
-
288
- ## 7. Production Operations (CQ-O)
289
-
290
- Verifies that runtime behavior is observable, costs tracked, quality monitored, and feedback actionable.
291
-
292
- - **CQ-O01** [P1] Does cost tracking granularity match billing granularity?
293
- - Inference path: logic_rules.md 'Operations Logic' → mismatched granularity makes cost optimization guesswork
294
- - Verification criteria: PASS if tracking matches billing (per-token if billed per-token) with feature/user attribution. FAIL if coarser
295
-
296
- - **CQ-O02** [P1] Does logging capture sufficient information to reproduce any LLM interaction?
297
- - Inference path: logic_rules.md 'Operations Logic' → logging must capture: full prompt (input), full response (output), model version, latency, token count, and tool calls; insufficient logging makes debugging impossible
298
- - Verification criteria: PASS if every LLM API call is logged with all six required fields. FAIL if any field is missing from production logs
299
-
300
- - **CQ-O03** [P1] Is a quality drift baseline established?
301
- - Inference path: logic_rules.md 'Operations Logic' → baseline from golden set required for drift detection; dependency_rules.md 'Feedback Loops' → Loop 2
302
- - Verification criteria: PASS if baseline exists and drift detection runs at regular intervals. FAIL if no baseline or no monitoring
303
-
304
- - **CQ-O04** [P2] Are feedback loops actionable (collection → analysis → system change → verification)?
305
- - Inference path: logic_rules.md 'Operations Logic' → feedback without improvement path is waste
306
- - Verification criteria: PASS if each feedback type has documented processing pipeline. FAIL if feedback collected without action path
307
-
308
- - **CQ-O05** [P2] Is incident response defined for LLM-specific failure modes?
309
- - Inference path: logic_rules.md 'Operations Logic' → five failure types: outage, quality degradation, cost spike, safety false positives, injection
310
- - Verification criteria: PASS if playbooks exist for all five types. FAIL if LLM-specific failures not addressed
311
-
312
- - **CQ-O06** [P2] Are deployment strategies appropriate for model updates?
313
- - Inference path: domain_scope.md 'Production Operations' → canary/blue-green; dependency_rules.md 'External Dependency Management' → canary rollout
314
- - Verification criteria: PASS if graduated deployment limits blast radius. FAIL if all-at-once without rollout
315
-
316
- - **CQ-O07** [P2] Is system-level caching implemented where appropriate?
317
- - Inference path: domain_scope.md 'Production Operations' → response caching, KV cache management
318
- - Verification criteria: PASS if caching evaluated and implemented with invalidation rules. FAIL if no caching considered for repeated patterns
319
-
320
- - **CQ-O08** [P3] Are monitoring dashboards covering LLM-specific metrics visible to operators?
321
- - Inference path: domain_scope.md 'Production Operations' → monitoring; logic_rules.md 'Operations Logic' → logging feeds dashboards
322
- - Verification criteria: PASS if dashboards show latency, error rate, throughput, cost. FAIL if no operator visibility
323
-
324
- ---
325
-
326
- ## 8. Data & Model Adaptation (CQ-D)
327
-
328
- Verifies that fine-tuning decisions are justified, quality-controlled, and evaluated against baselines.
329
-
330
- - **CQ-D01** [P1] Is fine-tuning justified over prompting?
331
- - Inference path: logic_rules.md 'Data & Model Adaptation Logic' → fine-tuning justified only when prompting alone cannot achieve required quality, latency, or cost targets; fine-tuning introduces maintenance burden
332
- - Verification criteria: PASS if documented analysis shows prompting was attempted first and was insufficient, with specific metrics. FAIL if fine-tuning pursued without evidence that prompting is insufficient
333
- - Scope: Applies when fine-tuning is used. N/A if system uses only prompting
334
-
335
- - **CQ-D02** [P1] Is training data quality assessed before training?
336
- - Inference path: logic_rules.md 'Data & Model Adaptation Logic' → data quality upper-bounds model quality
337
- - Verification criteria: PASS if accuracy/consistency/completeness/representativeness assessed before training. FAIL if training starts without assessment
338
-
339
- - **CQ-D03** [P1] Are adapted models evaluated against base models on the same set?
340
- - Inference path: logic_rules.md 'Data & Model Adaptation Logic' → comparison required to prove improvement
341
- - Verification criteria: PASS if comparison results exist on same golden set. FAIL if measured in isolation
342
-
343
- - **CQ-D04** [P2] For parameter-efficient fine-tuning (LoRA, QLoRA), is the base model pinned?
344
- - Inference path: logic_rules.md 'Data & Model Adaptation Logic' → base model update invalidates adapters
345
- - Verification criteria: PASS if base model version recorded/pinned with invalidation process. FAIL if unpinned
346
-
347
- - **CQ-D05** [P2] Is training data separated from evaluation data?
348
- - Inference path: logic_rules.md 'Evaluation Logic' → contamination inflates scores
349
- - Verification criteria: PASS if separate storage with deduplication controls. FAIL if shared without controls
350
-
351
- - **CQ-D06** [P2] Does fine-tuning improvement exceed fine-tuning cost?
352
- - Inference path: logic_rules.md 'Data & Model Adaptation Logic'; logic_rules.md 'Constraint Conflict Checking' → Cost vs. Quality
353
- - Verification criteria: PASS if cost-benefit analysis documents quality gain vs. costs. FAIL if no analysis
354
-
355
- - **CQ-D07** [P3] Is there a retraining plan for base model updates?
356
- - Inference path: dependency_rules.md 'External Dependency Management' → deprecation requires migration; logic_rules.md 'Data & Model Adaptation Logic' → adapters invalidated by updates
357
- - Verification criteria: PASS if documented retraining procedure covers detection, compatibility check, retraining, re-evaluation. FAIL if no plan
358
-
359
- ---
360
-
361
- ## 9. Cross-Cutting (CQ-X)
362
-
363
- Verifies concerns spanning multiple sub-areas: version management, reproducibility, and inter-area consistency.
364
-
365
- - **CQ-X01** [P1] Are prompt versions, tool schema versions, and agent configuration versions tracked?
366
- - Inference path: domain_scope.md 'Cross-Cutting Concern: Development Lifecycle' → version management spans prompt versions, tool schema versions, agent configuration versions, model version tracking; logic_rules.md 'Prompt Design Logic' → prompt change = code change, must be versioned
367
- - Verification criteria: PASS if all four artifact types (prompt templates, tool schemas, agent configurations, model version references) are version-controlled. FAIL if any of these artifacts is unversioned
368
-
369
- - **CQ-X02** [P1] Can past system behavior be reproduced given a specific version of all components?
370
- - Inference path: logic_rules.md 'Operations Logic' → logging must capture sufficient info to reproduce interactions; 'Prompt Design Logic' → unversioned prompts = unreproducible behavior; dependency_rules.md 'External Dependency Management' → model version must be pinned
371
- - Verification criteria: PASS if, given a timestamp, all system components (model version, prompt version, tool schemas, retrieval config) can be identified and behavior reconstructed. FAIL if any component's historical state is unrecoverable
372
-
373
- - **CQ-X03** [P2] When cross-area constraints conflict, is the conflict documented and resolved?
374
- - Inference path: logic_rules.md 'Constraint Conflict Checking' → four conflict types; unresolved = unpredictable behavior
375
- - Verification criteria: PASS if conflicts documented with constraint pair, resolution, trade-off rationale. FAIL if undocumented
376
-
377
- - **CQ-X04** [P2] Do inter-area dependencies follow documented direction rules?
378
- - Inference path: dependency_rules.md 'Inter-Area Direction Rules'; 'Reverse direction prohibition'
379
- - Verification criteria: PASS if data flow matches documented directions. FAIL if reverse dependencies exist
380
-
381
- - **CQ-X05** [P2] Are feedback loops between areas declared with termination conditions?
382
- - Inference path: dependency_rules.md 'Feedback Loops' → Loops 1-3, each with termination condition
383
- - Verification criteria: PASS if all cycles documented with termination. FAIL if any loop lacks termination
384
-
385
- - **CQ-X06** [P3] Do the five golden relationships (cross-component validations) hold?
386
- - Inference path: structure_spec.md 'Golden Relationships' → Model↔Prompt, Retrieval↔Generation, Tool↔Instructions, Eval↔Safety, Cost↔Model
387
- - Verification criteria: PASS if all five validations pass. FAIL if any violated
388
-
389
- ---
390
-
391
- ## 10. Edge Cases (CQ-XE)
392
-
393
- Tests boundary conditions and failure modes not covered by standard verification. Verifies system resilience.
394
-
395
- - **CQ-XE01** [P2] What happens when a model is deprecated mid-production?
396
- - Inference path: dependency_rules.md 'External Dependency Management' → Model API Dependencies → deprecation migration plan
397
- - Verification criteria: PASS if deprecation plan covers detection, impact analysis, replacement evaluation, migration timeline. FAIL if no plan
398
-
399
- - **CQ-XE02** [P2] What happens when retrieval returns no relevant results?
400
- - Inference path: logic_rules.md 'Retrieval Logic' → independent evaluation; structure_spec.md 'RAG Pipeline Structure' → relevance scores; logic_rules.md 'Constraint Conflict Checking' → Latency vs. Completeness
401
- - Verification criteria: PASS if system handles gracefully (fallback, disclaimer, rephrase suggestion). FAIL if generates as if relevant context existed
402
-
403
- - **CQ-XE03** [P2] What happens when safety guardrails block a valid response?
404
- - Inference path: logic_rules.md 'Safety Logic' → defense without UX degradation; 'Constraint Conflict Checking' → Safety vs. Functionality
405
- - Verification criteria: PASS if false positives produce informative messages, are logged, and have adjustment process. FAIL if silent failure or no review process
406
-
407
- - **CQ-XE04** [P2] What happens when an MCP server becomes unavailable during an agent task?
408
- - Inference path: dependency_rules.md 'External Dependency Management' → MCP Server Dependencies → fallback behavior; logic_rules.md 'Agentic Systems Logic' → state survives failures
409
- - Verification criteria: PASS if fallback defined (skip, retry, halt with message). FAIL if crash, hang, or undefined behavior
410
-
411
- - **CQ-XE05** [P2] What happens when the token budget is exceeded by user input alone?
412
- - Inference path: logic_rules.md 'Prompt Design Logic' → token budget constraint
413
- - Verification criteria: PASS if oversized input detected with defined action (truncation notice, rejection). FAIL if silent truncation or unhandled error
414
-
415
- - **CQ-XE06** [P3] What happens when a model provider changes pricing mid-contract?
416
- - Inference path: dependency_rules.md 'External Dependency Management' → pricing changes; structure_spec.md 'Golden Relationships' → Cost budget ↔ Model selection
417
- - Verification criteria: PASS if cost monitoring detects anomalies and alerts before overrun. FAIL if not detected until billing cycle
418
-
419
- - **CQ-XE07** [P3] What happens when multiple agents produce contradictory outputs?
420
- - Inference path: logic_rules.md 'Agentic Systems Logic' → termination conditions; structure_spec.md 'Agent Architecture Structure' → Multi-Agent Execution profile
421
- - Verification criteria: PASS if conflict resolution mechanism exists (orchestrator, voting, confidence). FAIL if no mechanism
422
-
423
- ---
424
-
425
- ## Related Documents
426
- - domain_scope.md — the upper-level definition of the 8 sub-areas and cross-cutting concerns these questions cover
427
- - logic_rules.md — inference logic for all 10 question categories (CQ-M through CQ-XE)
428
- - structure_spec.md — structural specifications for CQ-R (knowledge, RAG), CQ-A (agent architecture), CQ-E (evaluation), CQ-S (safety), CQ-X (golden relationships, thresholds)
429
- - dependency_rules.md — inter-area dependencies for CQ-X (direction rules, feedback loops), CQ-XE (external dependency failures), CQ-D (adaptation dependencies)
430
- - concepts.md — definitions of terms used throughout these competency questions