onto-mcp 0.3.0 → 0.3.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.onto/authority/core-lexicon.yaml +12 -0
- package/.onto/domains/software-engineering/competency_qs.md +192 -63
- package/.onto/domains/software-engineering/concepts.md +67 -5
- package/.onto/domains/software-engineering/conciseness_rules.md +22 -2
- package/.onto/domains/software-engineering/dependency_rules.md +78 -8
- package/.onto/domains/software-engineering/domain_scope.md +181 -150
- package/.onto/domains/software-engineering/extension_cases.md +318 -542
- package/.onto/domains/software-engineering/logic_rules.md +75 -3
- package/.onto/domains/software-engineering/problem_framing_profile.md +29 -2
- package/.onto/domains/software-engineering/prompt_interface.md +122 -0
- package/.onto/domains/software-engineering/structure_spec.md +53 -4
- package/.onto/principles/llm-native-development-guideline.md +20 -0
- package/.onto/principles/productization-charter.md +6 -0
- package/.onto/processes/evolve/material-kind-adapter-contract.md +6 -0
- package/.onto/processes/reconstruct/reconstruct-boundary-contract.md +468 -81
- package/.onto/processes/reconstruct/reconstruct-execution-ux-contract.md +177 -0
- package/.onto/processes/reconstruct/source-profile-contract.md +39 -6
- package/.onto/processes/reconstruct/top-level-concept-discovery-contract.md +387 -0
- package/.onto/processes/review/binding-contract.md +8 -0
- package/.onto/processes/review/lens-registry.md +16 -0
- package/.onto/processes/review/pre-dispatch-contracts.md +34 -13
- package/.onto/processes/review/productized-live-path.md +3 -1
- package/.onto/processes/shared/pipeline-execution-ledger-contract.md +185 -0
- package/.onto/processes/shared/target-material-kind-contract.md +24 -2
- package/.onto/roles/axiology.md +7 -2
- package/AGENTS.md +4 -2
- package/README.md +52 -29
- package/dist/core-api/reconstruct-api.js +92 -5
- package/dist/core-api/review-api.js +1744 -371
- package/dist/core-runtime/cli/mock-review-unit-executor.js +17 -0
- package/dist/core-runtime/cli/render-review-final-output.js +9 -0
- package/dist/core-runtime/cli/review-invoke.js +387 -55
- package/dist/core-runtime/cli/run-review-prompt-execution.js +361 -90
- package/dist/core-runtime/path-boundary.js +58 -0
- package/dist/core-runtime/pipeline-execution-ledger.js +100 -0
- package/dist/core-runtime/reconstruct/artifact-types.js +33 -1
- package/dist/core-runtime/reconstruct/materialize-preparation.js +54 -4
- package/dist/core-runtime/reconstruct/pipeline-execution-ledger.js +342 -0
- package/dist/core-runtime/reconstruct/post-seed-validation.js +630 -0
- package/dist/core-runtime/reconstruct/record.js +105 -1
- package/dist/core-runtime/reconstruct/run.js +1594 -38
- package/dist/core-runtime/reconstruct/seed-candidate-validation.js +29 -0
- package/dist/core-runtime/review/continuation-plan.js +160 -0
- package/dist/core-runtime/review/execution-plan-boundary.js +123 -0
- package/dist/core-runtime/review/materializers.js +8 -3
- package/dist/core-runtime/review/pipeline-execution-ledger.js +250 -0
- package/dist/core-runtime/review/review-artifact-utils.js +15 -2
- package/dist/core-runtime/review/review-invocation-runner.js +604 -0
- package/dist/core-runtime/target-material-kind.js +43 -5
- package/dist/mcp/server.js +289 -59
- package/dist/mcp/tool-schemas.js +28 -2
- package/package.json +4 -2
- package/.onto/domains/llm-native-development/competency_qs.md +0 -430
- package/.onto/domains/llm-native-development/concepts.md +0 -242
- package/.onto/domains/llm-native-development/conciseness_rules.md +0 -163
- package/.onto/domains/llm-native-development/dependency_rules.md +0 -216
- package/.onto/domains/llm-native-development/domain_scope.md +0 -197
- package/.onto/domains/llm-native-development/extension_cases.md +0 -474
- package/.onto/domains/llm-native-development/logic_rules.md +0 -123
- package/.onto/domains/llm-native-development/prompt_interface.md +0 -49
- package/.onto/domains/llm-native-development/structure_spec.md +0 -245
|
@@ -1,474 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
version: 2
|
|
3
|
-
last_updated: "2026-03-30"
|
|
4
|
-
source: manual
|
|
5
|
-
status: established
|
|
6
|
-
---
|
|
7
|
-
|
|
8
|
-
# LLM-Native Development Domain — Extension Scenarios
|
|
9
|
-
|
|
10
|
-
The evolution agent simulates each scenario to verify whether the existing structure for LLM-powered systems breaks under real-world changes.
|
|
11
|
-
|
|
12
|
-
**Area key**: 1=Model Integration, 2=Prompt & Context Design, 3=Retrieval & Knowledge Systems, 4=Agentic Systems, 5=Evaluation & Testing, 6=Safety & Alignment, 7=Production Operations, 8=Data & Model Adaptation
|
|
13
|
-
|
|
14
|
-
---
|
|
15
|
-
|
|
16
|
-
## Case 1: New Model Generation
|
|
17
|
-
|
|
18
|
-
### Situation
|
|
19
|
-
|
|
20
|
-
A new foundation model (GPT-5, Claude 4, Gemini 2) introduces expanded capabilities — larger context, new modalities, improved reasoning, new API features. The system must evaluate, integrate, and potentially restructure without disrupting production.
|
|
21
|
-
|
|
22
|
-
### Case Study: Claude 3 → Sonnet 4 Migration
|
|
23
|
-
|
|
24
|
-
Each Anthropic model release changed instruction-following behavior, tool calling format, and system prompt handling. Teams pinned to `claude-3-sonnet-20240229` faced migration requiring: re-evaluation of all prompt templates, restructured tool schemas, and full evaluation suite re-runs. Teams using `-latest` endpoints experienced silent production behavior changes — output format shifts, changed refusal patterns, altered verbosity.
|
|
25
|
-
|
|
26
|
-
### Impact Analysis
|
|
27
|
-
|
|
28
|
-
| Area | Impact Level | Key Concern |
|
|
29
|
-
|---|---|---|
|
|
30
|
-
| 1 (Model) | **Primary** | Selection criteria, API integration, version pinning, routing logic |
|
|
31
|
-
| 2 (Prompt) | **High** | System prompt restructuring, token budget shift, structured output |
|
|
32
|
-
| 5 (Evaluation) | **High** | Full suite re-run, benchmark recalibration |
|
|
33
|
-
| 7 (Operations) | **High** | Canary deployment, cost/latency changes, monitoring thresholds |
|
|
34
|
-
| 3, 4, 6, 8 | Low–Medium | Chunk size (3), tool formats (4), refusal patterns (6), fine-tune transfer (8) |
|
|
35
|
-
|
|
36
|
-
### Verification Checklist
|
|
37
|
-
|
|
38
|
-
- [ ] Model version pinned → dependency_rules.md §Model API Dependencies
|
|
39
|
-
- [ ] Migration plan: affected components → evaluate replacement → run eval → canary deploy
|
|
40
|
-
- [ ] All prompt templates tested against new model → domain_scope.md Area 2
|
|
41
|
-
- [ ] Evaluation suite run; results compared to baseline → domain_scope.md Area 5
|
|
42
|
-
- [ ] Cost/latency impact analyzed → domain_scope.md Area 7
|
|
43
|
-
- [ ] Rollback plan defined → dependency_rules.md §Model API Dependencies
|
|
44
|
-
- [ ] Safety guardrails verified against new behavior → domain_scope.md Area 6
|
|
45
|
-
|
|
46
|
-
### Affected Files
|
|
47
|
-
|
|
48
|
-
| File | Impact | Section |
|
|
49
|
-
|---|---|---|
|
|
50
|
-
| dependency_rules.md | Modify | §Model API Dependencies |
|
|
51
|
-
| domain_scope.md | Verify | §Technology Assumptions |
|
|
52
|
-
| structure_spec.md | Verify | Token budget rules |
|
|
53
|
-
|
|
54
|
-
---
|
|
55
|
-
|
|
56
|
-
## Case 2: New Retrieval Paradigm
|
|
57
|
-
|
|
58
|
-
### Situation
|
|
59
|
-
|
|
60
|
-
A new retrieval architecture (graph RAG, multi-modal RAG, hybrid search) changes how external knowledge is stored, indexed, and retrieved for LLM consumption.
|
|
61
|
-
|
|
62
|
-
### Case Study: LlamaIndex Knowledge Graph Integration
|
|
63
|
-
|
|
64
|
-
LlamaIndex added knowledge graph-based retrieval alongside vector search. A healthcare company migrating found: chunking strategy became irrelevant for structured data (entities replaced chunks), embedding model selection became secondary to entity extraction quality, and retrieval evaluation changed from "relevance@k" to "path accuracy." The hybrid approach required a routing layer deciding which backend to query based on query type.
|
|
65
|
-
|
|
66
|
-
### Impact Analysis
|
|
67
|
-
|
|
68
|
-
| Area | Impact Level | Key Concern |
|
|
69
|
-
|---|---|---|
|
|
70
|
-
| 3 (Retrieval) | **Primary** | Chunking, indexing, retrieval algorithms, storage backend all restructured |
|
|
71
|
-
| 5 (Evaluation) | **High** | Retrieval metrics change; new golden sets needed |
|
|
72
|
-
| 2 (Prompt) | **Medium** | Retrieved content format changes (triples vs. chunks) |
|
|
73
|
-
| 7 (Operations) | **Medium** | New infrastructure (graph DB); different latency/cost profile |
|
|
74
|
-
| 1, 4, 6, 8 | Low | Model interface unchanged (1), tool interface changes (4) |
|
|
75
|
-
|
|
76
|
-
### Verification Checklist
|
|
77
|
-
|
|
78
|
-
- [ ] Handoff point with Area 2 preserved: retrieval results = Area 3 boundary → domain_scope.md §Handoff point
|
|
79
|
-
- [ ] Embedding migration plan if changing backend → dependency_rules.md §Embedding Model Dependencies
|
|
80
|
-
- [ ] Retrieval evaluation metrics updated → domain_scope.md Area 5
|
|
81
|
-
- [ ] Hybrid search strategy defined → domain_scope.md Area 3
|
|
82
|
-
- [ ] LLM-favored structure patterns adapted → domain_scope.md Area 3
|
|
83
|
-
|
|
84
|
-
### Affected Files
|
|
85
|
-
|
|
86
|
-
| File | Impact | Section |
|
|
87
|
-
|---|---|---|
|
|
88
|
-
| domain_scope.md | Verify | Area 3, §Handoff point with Area 2 |
|
|
89
|
-
| dependency_rules.md | Modify | §Embedding Model Dependencies, §Runtime Data Flow |
|
|
90
|
-
| concepts.md | Modify | New terms: knowledge graph retrieval, graph RAG |
|
|
91
|
-
|
|
92
|
-
---
|
|
93
|
-
|
|
94
|
-
## Case 3: New Agent Protocol
|
|
95
|
-
|
|
96
|
-
### Situation
|
|
97
|
-
|
|
98
|
-
A new standardized protocol for agent-tool interaction (MCP successor, A2A maturation) replaces or supplements existing integration patterns.
|
|
99
|
-
|
|
100
|
-
### Case Study: MCP — 0 to 97M npm Downloads in 1 Year
|
|
101
|
-
|
|
102
|
-
MCP standardized how agents discover and invoke tools via JSON-RPC. Before MCP, each framework had proprietary tool patterns. Migration required: tool schema restructuring to JSON Schema-based definitions, stateful server connections replacing stateless calls, connection lifecycle management, and a bridge layer for legacy APIs during transition. A2A emerged as complementary standard for inter-agent communication.
|
|
103
|
-
|
|
104
|
-
### Impact Analysis
|
|
105
|
-
|
|
106
|
-
| Area | Impact Level | Key Concern |
|
|
107
|
-
|---|---|---|
|
|
108
|
-
| 4 (Agentic) | **Primary** | Tool integration restructured; server/client design; discovery mechanisms |
|
|
109
|
-
| 6 (Safety) | **High** | New injection surface; tool permission models change |
|
|
110
|
-
| 7 (Operations) | **Medium** | Server availability monitoring; connection lifecycle; new failure modes |
|
|
111
|
-
| 1, 2, 3, 5, 8 | Low | Tool-calling format (1), tool descriptions (2), MCP resources (3) |
|
|
112
|
-
|
|
113
|
-
### Verification Checklist
|
|
114
|
-
|
|
115
|
-
- [ ] MCP server dependencies documented with fallback behavior → dependency_rules.md §MCP Server Dependencies
|
|
116
|
-
- [ ] Schema validation at connection time → dependency_rules.md §MCP Server Dependencies
|
|
117
|
-
- [ ] Tool permission model defined → domain_scope.md Area 6
|
|
118
|
-
- [ ] Backward compatibility plan for existing integrations documented
|
|
119
|
-
- [ ] Protocol-specific monitoring added → domain_scope.md Area 7
|
|
120
|
-
|
|
121
|
-
### Affected Files
|
|
122
|
-
|
|
123
|
-
| File | Impact | Section |
|
|
124
|
-
|---|---|---|
|
|
125
|
-
| dependency_rules.md | Modify | §MCP Server Dependencies, §Framework Dependencies |
|
|
126
|
-
| domain_scope.md | Verify | Area 4, §Reference Standards/Frameworks |
|
|
127
|
-
| concepts.md | Modify | Protocol terms: MCP, A2A, ACI |
|
|
128
|
-
|
|
129
|
-
---
|
|
130
|
-
|
|
131
|
-
## Case 4: AI Regulation Compliance
|
|
132
|
-
|
|
133
|
-
### Situation
|
|
134
|
-
|
|
135
|
-
New AI regulations (EU AI Act, etc.) impose compliance requirements: audit logging, risk classification, transparency, human oversight.
|
|
136
|
-
|
|
137
|
-
### Case Study: EU AI Act Risk Classification
|
|
138
|
-
|
|
139
|
-
An LLM recruitment screening tool was classified "high-risk" (Annex III), requiring: risk management with continuous monitoring, technical documentation including training data provenance, human oversight (no autonomous hiring decisions), automatic decision logging, and documented evaluation methodology. Systems designed with safety and observability from the start required minimal changes; those treating these as afterthoughts required extensive refactoring.
|
|
140
|
-
|
|
141
|
-
### Impact Analysis
|
|
142
|
-
|
|
143
|
-
| Area | Impact Level | Key Concern |
|
|
144
|
-
|---|---|---|
|
|
145
|
-
| 6 (Safety) | **Primary** | Risk classification, content policy, PII handling, regulatory compliance |
|
|
146
|
-
| 7 (Operations) | **High** | Mandatory audit logging, compliance monitoring, incident reporting |
|
|
147
|
-
| 5 (Evaluation) | **High** | Auditable evaluation methodology required |
|
|
148
|
-
| 4 (Agentic) | **High** | Human oversight constrains agent autonomy |
|
|
149
|
-
| 1, 2, 3, 8 | Low–Medium | Provider certifications (1), disclosure instructions (2), provenance (3, 8) |
|
|
150
|
-
|
|
151
|
-
### Verification Checklist
|
|
152
|
-
|
|
153
|
-
- [ ] System risk classification determined → domain_scope.md Area 6
|
|
154
|
-
- [ ] Audit logging covers all LLM interactions → domain_scope.md Area 7
|
|
155
|
-
- [ ] Human oversight mechanism for high-risk decisions → domain_scope.md Area 6
|
|
156
|
-
- [ ] Evaluation methodology documented and reproducible → domain_scope.md Area 5
|
|
157
|
-
- [ ] PII detection and redaction operational → domain_scope.md Area 6
|
|
158
|
-
- [ ] All 8 areas audited against regulatory requirements
|
|
159
|
-
|
|
160
|
-
### Affected Files
|
|
161
|
-
|
|
162
|
-
| File | Impact | Section |
|
|
163
|
-
|---|---|---|
|
|
164
|
-
| domain_scope.md | Verify | Area 6, Area 7 |
|
|
165
|
-
| dependency_rules.md | Modify | §Truth Source Hierarchy (regulatory priority) |
|
|
166
|
-
| concepts.md | Modify | Regulatory terms: risk classification, audit trail |
|
|
167
|
-
|
|
168
|
-
---
|
|
169
|
-
|
|
170
|
-
## Case 5: Context Window Expansion
|
|
171
|
-
|
|
172
|
-
### Situation
|
|
173
|
-
|
|
174
|
-
Context windows increase by 10x (200K → 2M → 10M tokens), changing fundamental trade-offs. Techniques designed for limited context (chunking, summarization, multi-step retrieval) may become unnecessary.
|
|
175
|
-
|
|
176
|
-
### Case Study: Gemini 1M Context — Is RAG Still Necessary?
|
|
177
|
-
|
|
178
|
-
Gemini 1.5 Pro launched with 1M tokens (February 2024). Results showed: "lost in the middle" effects made retrieval still valuable for precision; 1M tokens at $3.50/M made naive "stuff everything" 100x more expensive than targeted retrieval; latency scaled with context length. RAG shifted from "mandatory" to "optimization for cost, latency, and precision" — conditional rather than required.
|
|
179
|
-
|
|
180
|
-
### Impact Analysis
|
|
181
|
-
|
|
182
|
-
| Area | Impact Level | Key Concern |
|
|
183
|
-
|---|---|---|
|
|
184
|
-
| 2 (Prompt) | **Primary** | Token budget restructured; context rot amplified; utilization strategy changed |
|
|
185
|
-
| 3 (Retrieval) | **High** | Chunking relevance decreases; RAG becomes conditional |
|
|
186
|
-
| 7 (Operations) | **High** | Cost per request increases dramatically; caching for larger payloads |
|
|
187
|
-
| 1 (Model) | **Medium** | Context window as selection differentiator |
|
|
188
|
-
| 4, 5, 6, 8 | Low–Medium | Longer agent history (4), long-context eval (5), larger injection surface (6) |
|
|
189
|
-
|
|
190
|
-
### Verification Checklist
|
|
191
|
-
|
|
192
|
-
- [ ] Technology assumption updated → domain_scope.md §Technology Assumptions
|
|
193
|
-
- [ ] Token budget strategy revised → structure_spec.md
|
|
194
|
-
- [ ] RAG necessity re-evaluated per use case → domain_scope.md Area 3
|
|
195
|
-
- [ ] Cost analysis: long context vs. retrieval → domain_scope.md Area 7
|
|
196
|
-
- [ ] "Lost in the middle" effects tested → domain_scope.md Area 5
|
|
197
|
-
- [ ] Quantitative criteria re-evaluated → structure_spec.md
|
|
198
|
-
|
|
199
|
-
### Affected Files
|
|
200
|
-
|
|
201
|
-
| File | Impact | Section |
|
|
202
|
-
|---|---|---|
|
|
203
|
-
| domain_scope.md | Modify | §Technology Assumptions, Area 2, Area 3 |
|
|
204
|
-
| structure_spec.md | Modify | Token budget rules, quantitative criteria |
|
|
205
|
-
| dependency_rules.md | Verify | §Design-Time Constraint Flow |
|
|
206
|
-
|
|
207
|
-
---
|
|
208
|
-
|
|
209
|
-
## Case 6: Model-as-a-Service Disruption
|
|
210
|
-
|
|
211
|
-
### Situation
|
|
212
|
-
|
|
213
|
-
Deployment shifts from cloud API to on-premise/edge with open-weight models, changing infrastructure, cost structures, and operational responsibilities.
|
|
214
|
-
|
|
215
|
-
### Case Study: Llama Open-Weight Local Deployment
|
|
216
|
-
|
|
217
|
-
A financial services company migrated from OpenAI API to self-hosted Llama 3 70B for data residency. Infrastructure responsibility shifted entirely (GPU procurement, model serving via vLLM/TGI, auto-scaling). Cost structure inverted: no per-token costs but $30K+/month GPU cluster. Safety guardrails (content filtering, PII redaction) had to be built from scratch — API providers include these by default. The domain_scope.md assumption "Models accessed via API" was invalidated.
|
|
218
|
-
|
|
219
|
-
### Impact Analysis
|
|
220
|
-
|
|
221
|
-
| Area | Impact Level | Key Concern |
|
|
222
|
-
|---|---|---|
|
|
223
|
-
| 1 (Model) | **Primary** | Model serving replaces API; quantization; batching; version self-management |
|
|
224
|
-
| 7 (Operations) | **Primary** | GPU infrastructure, serving, scaling, hardware monitoring — new domain |
|
|
225
|
-
| 6 (Safety) | **High** | All guardrails built in-house; no provider-side filtering |
|
|
226
|
-
| 8 (Adaptation) | **High** | Direct model access enables fine-tuning |
|
|
227
|
-
| 2, 3, 4, 5 | Low–Medium | Chat template differences (2), self-hosted embeddings (3), tool support gaps (4) |
|
|
228
|
-
|
|
229
|
-
### Verification Checklist
|
|
230
|
-
|
|
231
|
-
- [ ] Technology assumption invalidation documented → domain_scope.md §Technology Assumptions
|
|
232
|
-
- [ ] Area 7 expanded for infrastructure management → domain_scope.md Area 7
|
|
233
|
-
- [ ] Model serving infrastructure documented → domain_scope.md Area 1
|
|
234
|
-
- [ ] Safety guardrails built for self-hosted model → domain_scope.md Area 6
|
|
235
|
-
- [ ] Cost model updated: fixed vs. per-token → domain_scope.md Area 7
|
|
236
|
-
|
|
237
|
-
### Affected Files
|
|
238
|
-
|
|
239
|
-
| File | Impact | Section |
|
|
240
|
-
|---|---|---|
|
|
241
|
-
| domain_scope.md | Modify | §Technology Assumptions, Area 1, Area 7 |
|
|
242
|
-
| dependency_rules.md | Modify | §Model API Dependencies (extend to self-hosted) |
|
|
243
|
-
| concepts.md | Modify | New terms: model serving, quantization, vLLM |
|
|
244
|
-
|
|
245
|
-
---
|
|
246
|
-
|
|
247
|
-
## Case 7: Multi-Modal Native
|
|
248
|
-
|
|
249
|
-
### Situation
|
|
250
|
-
|
|
251
|
-
Vision, audio, and other modalities become first-class inputs alongside text, rather than supplementary capabilities.
|
|
252
|
-
|
|
253
|
-
### Case Study: GPT-4V and Claude 3 Vision
|
|
254
|
-
|
|
255
|
-
A document processing company migrated from OCR-then-LLM to direct image-to-text: the OCR pipeline was eliminated, prompts referenced visual elements requiring spatial reasoning instructions, knowledge bases needed multi-modal embeddings (CLIP) incompatible with text-only embeddings, evaluation required annotated image golden sets, and token costs increased (1,000-2,000 tokens per page image vs. 200-400 for text). The domain_scope.md assumption "Text is primary modality" was challenged.
|
|
256
|
-
|
|
257
|
-
### Impact Analysis
|
|
258
|
-
|
|
259
|
-
| Area | Impact Level | Key Concern |
|
|
260
|
-
|---|---|---|
|
|
261
|
-
| 2 (Prompt) | **Primary** | Multi-modal input design; spatial/temporal references; token budget for images |
|
|
262
|
-
| 3 (Retrieval) | **High** | Multi-modal embeddings; cross-modal search; non-text storage |
|
|
263
|
-
| 1 (Model) | **High** | Modality support as selection criterion; multi-modal routing |
|
|
264
|
-
| 5 (Evaluation) | **High** | Visual understanding metrics; multi-modal golden sets |
|
|
265
|
-
| 4, 6, 7, 8 | Low–Medium | Vision for agents (4), image injection (6), larger payloads (7) |
|
|
266
|
-
|
|
267
|
-
### Verification Checklist
|
|
268
|
-
|
|
269
|
-
- [ ] Technology assumption updated → domain_scope.md §Technology Assumptions
|
|
270
|
-
- [ ] Multi-modal input design patterns documented → domain_scope.md Area 2
|
|
271
|
-
- [ ] Multi-modal embedding model selected → dependency_rules.md §Embedding Model Dependencies
|
|
272
|
-
- [ ] Cross-modal search assessed → domain_scope.md Area 3
|
|
273
|
-
- [ ] Multi-modal evaluation metrics defined → domain_scope.md Area 5
|
|
274
|
-
- [ ] Token budget includes image/audio costs → structure_spec.md
|
|
275
|
-
|
|
276
|
-
### Affected Files
|
|
277
|
-
|
|
278
|
-
| File | Impact | Section |
|
|
279
|
-
|---|---|---|
|
|
280
|
-
| domain_scope.md | Modify | §Technology Assumptions, Area 2, Area 3 |
|
|
281
|
-
| dependency_rules.md | Modify | §Embedding Model Dependencies |
|
|
282
|
-
| concepts.md | Modify | New terms: multi-modal embedding, cross-modal search |
|
|
283
|
-
|
|
284
|
-
---
|
|
285
|
-
|
|
286
|
-
## Case 8: Fine-Tuning Democratization
|
|
287
|
-
|
|
288
|
-
### Situation
|
|
289
|
-
|
|
290
|
-
Fine-tuning costs drop 100x through parameter-efficient techniques, making model adaptation a default practice. Area 8 shifts from optional to mandatory.
|
|
291
|
-
|
|
292
|
-
### Case Study: LoRA/QLoRA Cost Reduction
|
|
293
|
-
|
|
294
|
-
A startup fine-tuned Llama 2 7B with QLoRA on 10K customer support conversations in 4 hours on a single A100 ($12). The fine-tuned model outperformed GPT-4 with elaborate prompts (92% vs. 87% accuracy). Prompt engineering effort dropped 70% — domain knowledge was in weights, not context. New capabilities needed: dataset curation, experiment tracking, model evaluation before/after, adapter serving. The domain_scope.md assumption "Fine-tuning is optional" was invalidated.
|
|
295
|
-
|
|
296
|
-
### Impact Analysis
|
|
297
|
-
|
|
298
|
-
| Area | Impact Level | Key Concern |
|
|
299
|
-
|---|---|---|
|
|
300
|
-
| 8 (Adaptation) | **Primary** | Becomes mandatory: dataset engineering, training, experiment tracking |
|
|
301
|
-
| 5 (Evaluation) | **High** | Pre/post comparison; contamination checks; regression testing |
|
|
302
|
-
| 6 (Safety) | **High** | Fine-tuning can remove guardrails; alignment verification post-training |
|
|
303
|
-
| 1 (Model) | **High** | Adapter serving; model routing includes fine-tuned variants |
|
|
304
|
-
| 2, 3, 4, 7 | Low–Medium | Shorter prompts (2), reduced retrieval needs (3), A/B testing (7) |
|
|
305
|
-
|
|
306
|
-
### Verification Checklist
|
|
307
|
-
|
|
308
|
-
- [ ] Technology assumption invalidation documented → domain_scope.md §Technology Assumptions
|
|
309
|
-
- [ ] Area 8 elevated to required → domain_scope.md Area 8
|
|
310
|
-
- [ ] Dataset engineering pipeline documented → domain_scope.md Area 8
|
|
311
|
-
- [ ] Pre/post fine-tuning evaluation → domain_scope.md Area 5
|
|
312
|
-
- [ ] Safety alignment verified after training → domain_scope.md Area 6
|
|
313
|
-
- [ ] Adapter serving infrastructure → domain_scope.md Area 1, Area 7
|
|
314
|
-
|
|
315
|
-
### Affected Files
|
|
316
|
-
|
|
317
|
-
| File | Impact | Section |
|
|
318
|
-
|---|---|---|
|
|
319
|
-
| domain_scope.md | Modify | §Technology Assumptions, Area 8 |
|
|
320
|
-
| dependency_rules.md | Modify | §Feedback Loops (Loop 1 becomes critical path) |
|
|
321
|
-
| concepts.md | Modify | New terms: LoRA, QLoRA, adapter, PEFT |
|
|
322
|
-
|
|
323
|
-
---
|
|
324
|
-
|
|
325
|
-
## Case 9: Agent Autonomy Expansion
|
|
326
|
-
|
|
327
|
-
### Situation
|
|
328
|
-
|
|
329
|
-
Agents gain direct computer interface capabilities — browser control, desktop applications, file systems — extending autonomy beyond structured tool calls to open-ended environment interaction.
|
|
330
|
-
|
|
331
|
-
### Case Study: Claude Computer Use and Browser Use (78K Stars)
|
|
332
|
-
|
|
333
|
-
A QA company adopted browser use agents: architecture shifted from predefined tool calls to observe-act loops (screenshot → reason → mouse/keyboard action). Safety boundaries became critical — agents could navigate anywhere, fill any form. Novel failure modes: loops, unintended navigation, interaction with pop-ups. Evaluation shifted from step-based ("correct tool call?") to outcome-based ("goal achieved through any path?").
|
|
334
|
-
|
|
335
|
-
### Impact Analysis
|
|
336
|
-
|
|
337
|
-
| Area | Impact Level | Key Concern |
|
|
338
|
-
|---|---|---|
|
|
339
|
-
| 4 (Agentic) | **Primary** | Observe-act loops; browser/computer control; environment interaction |
|
|
340
|
-
| 6 (Safety) | **Primary** | Sandboxing; action permissions; escalation for irreversible actions |
|
|
341
|
-
| 5 (Evaluation) | **High** | Outcome-based evaluation; trajectory analysis; reliability metrics |
|
|
342
|
-
| 7 (Operations) | **High** | Sandbox infrastructure; session recording; resource limits |
|
|
343
|
-
| 1, 2, 3, 8 | Low–Medium | Vision required (1), boundary instructions (2) |
|
|
344
|
-
|
|
345
|
-
### Verification Checklist
|
|
346
|
-
|
|
347
|
-
- [ ] Observe-act pattern documented → domain_scope.md Area 4
|
|
348
|
-
- [ ] Sandbox environment defined and enforced → domain_scope.md Area 6
|
|
349
|
-
- [ ] Action permission model defined → domain_scope.md Area 6
|
|
350
|
-
- [ ] Escalation policy for irreversible actions → domain_scope.md Area 6
|
|
351
|
-
- [ ] Outcome-based evaluation metrics → domain_scope.md Area 5
|
|
352
|
-
- [ ] Stuck/loop detection and termination → domain_scope.md Area 4
|
|
353
|
-
- [ ] Resource limits (session duration, allowed domains) → domain_scope.md Area 7
|
|
354
|
-
|
|
355
|
-
### Affected Files
|
|
356
|
-
|
|
357
|
-
| File | Impact | Section |
|
|
358
|
-
|---|---|---|
|
|
359
|
-
| domain_scope.md | Verify | Area 4, Area 6 |
|
|
360
|
-
| dependency_rules.md | Verify | §Design-Time Constraint Flow (safety → agent) |
|
|
361
|
-
| concepts.md | Modify | New terms: observe-act loop, action permission model |
|
|
362
|
-
|
|
363
|
-
---
|
|
364
|
-
|
|
365
|
-
## Case 10: Evaluation Paradigm Shift
|
|
366
|
-
|
|
367
|
-
### Situation
|
|
368
|
-
|
|
369
|
-
AI-as-judge (LLM-based evaluation) becomes the primary evaluation method, replacing human annotation and rule-based metrics at scale.
|
|
370
|
-
|
|
371
|
-
### Case Study: LLM-as-Judge Replacing Human Annotation
|
|
372
|
-
|
|
373
|
-
A customer support company migrated from human evaluation (500/week at $15 each = $7,500) to LLM-as-judge (50,000/week at $0.03 each = $1,500). Consistency improved, but systematic biases emerged (preference for longer responses, over-rating confident incorrect answers). Meta-evaluation became necessary — "evaluating the evaluator." Self-evaluation bias required separate judge and generation models. The methodology shifted from "collecting human judgments" to "calibrating automated judges."
|
|
374
|
-
|
|
375
|
-
### Impact Analysis
|
|
376
|
-
|
|
377
|
-
| Area | Impact Level | Key Concern |
|
|
378
|
-
|---|---|---|
|
|
379
|
-
| 5 (Evaluation) | **Primary** | AI-judge pipelines, meta-evaluation, judge calibration, bias detection |
|
|
380
|
-
| 1 (Model) | **Medium** | Separate judge model needed; judge selection criteria differ |
|
|
381
|
-
| 2 (Prompt) | **Medium** | Judge prompt design: rubrics, criteria, output format |
|
|
382
|
-
| 7 (Operations) | **Medium** | Evaluation pipeline costs; judge model drift monitoring |
|
|
383
|
-
| 3, 4, 6, 8 | Low–Medium | Factual verification (3), safety eval (6), training labels (8) |
|
|
384
|
-
|
|
385
|
-
### Verification Checklist
|
|
386
|
-
|
|
387
|
-
- [ ] AI-as-judge methodology documented → domain_scope.md Area 5
|
|
388
|
-
- [ ] Judge model ≠ generation model enforced → domain_scope.md Area 5
|
|
389
|
-
- [ ] Meta-evaluation: judge validated against human golden set → domain_scope.md Area 5
|
|
390
|
-
- [ ] Judge bias detection and calibration procedure defined
|
|
391
|
-
- [ ] Human evaluation preserved for calibration maintenance → domain_scope.md Area 5
|
|
392
|
-
|
|
393
|
-
### Affected Files
|
|
394
|
-
|
|
395
|
-
| File | Impact | Section |
|
|
396
|
-
|---|---|---|
|
|
397
|
-
| domain_scope.md | Verify | Area 5 |
|
|
398
|
-
| dependency_rules.md | Verify | §Feedback Loops (Loop 1), §Metric Ownership Rules |
|
|
399
|
-
| concepts.md | Modify | New terms: AI-as-judge, meta-evaluation, judge calibration |
|
|
400
|
-
|
|
401
|
-
---
|
|
402
|
-
|
|
403
|
-
## Case 11: LLM-Favored Structure Adoption
|
|
404
|
-
|
|
405
|
-
### Situation
|
|
406
|
-
|
|
407
|
-
Organization-wide adoption of LLM-optimized knowledge patterns — File=Concept, YAML frontmatter, system maps, llms.txt — as the documentation standard.
|
|
408
|
-
|
|
409
|
-
### Case Study: Onto's Structure and llms.txt Adoption
|
|
410
|
-
|
|
411
|
-
A 200-person engineering org migrated from Confluence wiki (3,000+ nested, duplicated pages) to File=Concept structure (1,200 files with clear ownership). YAML frontmatter enabled automated knowledge graph generation. LLMs consumed documentation 40% more accurately (RAG retrieval precision). Migration cost: 6 engineer-months. The llms.txt standard, CLAUDE.md, .cursorrules, and AGENTS.md demonstrate vendor-specific implementations of the same principle: structured, machine-readable knowledge surfaces.
|
|
412
|
-
|
|
413
|
-
### Impact Analysis
|
|
414
|
-
|
|
415
|
-
| Area | Impact Level | Key Concern |
|
|
416
|
-
|---|---|---|
|
|
417
|
-
| 3 (Retrieval) | **Primary** | File=Concept; metadata-driven retrieval; navigation paths; system maps |
|
|
418
|
-
| 2 (Prompt) | **Medium** | Frontmatter provides context; reduces prompt engineering |
|
|
419
|
-
| 4 (Agentic) | **Medium** | CLAUDE.md/AGENTS.md define per-directory agent behavior |
|
|
420
|
-
| 5 (Evaluation) | **Medium** | Structure quality measurable: orphan rate, duplication, completeness |
|
|
421
|
-
| 1, 6, 7, 8 | Low | Indirect benefits across all areas |
|
|
422
|
-
|
|
423
|
-
### Verification Checklist
|
|
424
|
-
|
|
425
|
-
- [ ] File=Concept: one concept per file → domain_scope.md Area 3
|
|
426
|
-
- [ ] YAML frontmatter on all knowledge files → structure_spec.md
|
|
427
|
-
- [ ] System map provides navigation → domain_scope.md Area 3
|
|
428
|
-
- [ ] Duplication eliminated → dependency_rules.md §Duplication Prevention Rules
|
|
429
|
-
- [ ] Navigation paths acyclic → dependency_rules.md §Acyclicity
|
|
430
|
-
- [ ] Retrieval precision measured before/after → domain_scope.md Area 5
|
|
431
|
-
- [ ] Structure validation automated (lint, orphan detection) → domain_scope.md Area 7
|
|
432
|
-
|
|
433
|
-
### Affected Files
|
|
434
|
-
|
|
435
|
-
| File | Impact | Section |
|
|
436
|
-
|---|---|---|
|
|
437
|
-
| structure_spec.md | Modify | Frontmatter spec, File=Concept rules |
|
|
438
|
-
| dependency_rules.md | Verify | §Duplication Prevention, §Acyclicity, §Referential Integrity |
|
|
439
|
-
| domain_scope.md | Verify | Area 3, §Reference Standards/Frameworks |
|
|
440
|
-
|
|
441
|
-
---
|
|
442
|
-
|
|
443
|
-
## Scenario Interconnections
|
|
444
|
-
|
|
445
|
-
| Scenario | Triggers / Interacts With | Reason |
|
|
446
|
-
|---|---|---|
|
|
447
|
-
| Case 1 (New Model) | → Case 5 (Context), Case 7 (Multi-Modal), Case 10 (Eval) | New models expand context, add modalities, change eval baselines |
|
|
448
|
-
| Case 2 (Retrieval) | → Case 5 (Context), Case 11 (Structure) | Larger context changes RAG calculus; structure affects retrieval |
|
|
449
|
-
| Case 3 (Protocol) | → Case 9 (Autonomy), Case 4 (Regulation) | Protocols enable new capabilities; standardization aids compliance |
|
|
450
|
-
| Case 4 (Regulation) | → Case 9 (Autonomy), Case 10 (Eval) | Regulation constrains agents; mandates auditable evaluation |
|
|
451
|
-
| Case 5 (Context) | → Case 2 (Retrieval), Case 11 (Structure) | Reduced RAG necessity; less aggressive structuring needed |
|
|
452
|
-
| Case 6 (MaaS Disruption) | → Case 8 (Fine-Tuning), Case 4 (Regulation) | Local deployment enables fine-tuning; data residency drives on-prem |
|
|
453
|
-
| Case 7 (Multi-Modal) | → Case 2 (Retrieval), Case 9 (Autonomy) | Multi-modal retrieval is distinct paradigm; vision enables computer use |
|
|
454
|
-
| Case 8 (Fine-Tuning) | → Case 6 (MaaS), Case 10 (Eval) | Fine-tuning incentivizes local; more variants need scalable eval |
|
|
455
|
-
| Case 9 (Autonomy) | → Case 4 (Regulation), Case 3 (Protocol) | Autonomous agents face stricter scrutiny; need safety protocols |
|
|
456
|
-
| Case 10 (Eval Shift) | → Case 8 (Fine-Tuning) | AI judges generate training labels at scale |
|
|
457
|
-
| Case 11 (Structure) | → Case 2 (Retrieval) | Better structure improves retrieval across all paradigms |
|
|
458
|
-
|
|
459
|
-
### Cascade Patterns
|
|
460
|
-
|
|
461
|
-
**Cascade 1 — Model → Context → Retrieval → Structure**: New model with 10M context (Case 1) → context expansion (Case 5) → reduced RAG necessity (Case 2) → knowledge structure adaptation (Case 11). All 4 cases require coordinated response.
|
|
462
|
-
|
|
463
|
-
**Cascade 2 — Regulation → Safety → Agent → Protocol**: AI regulation (Case 4) → tightened safety requirements → constrained agent autonomy (Case 9) → standardized safety protocols (Case 3). Compliance becomes root cause for multiple architectural changes.
|
|
464
|
-
|
|
465
|
-
**Cascade 3 — Local Deployment → Fine-Tuning → Eval Scaling**: Self-hosted models (Case 6) → accessible fine-tuning (Case 8) → many model variants needing evaluation → AI-as-judge for throughput (Case 10). Evaluation pipeline becomes bottleneck.
|
|
466
|
-
|
|
467
|
-
---
|
|
468
|
-
|
|
469
|
-
## Related Documents
|
|
470
|
-
- domain_scope.md — Sub-area definitions (Areas 1-8), membership criteria, technology assumptions, handoff points
|
|
471
|
-
- dependency_rules.md — Inter-area dependencies, model/MCP/embedding dependencies, feedback loops
|
|
472
|
-
- structure_spec.md — Token budget rules, quantitative criteria, frontmatter specification
|
|
473
|
-
- competency_qs.md — Questions each area must answer; updated when scenarios reveal gaps
|
|
474
|
-
- concepts.md — Term definitions; updated when new terms emerge from scenarios
|
|
@@ -1,123 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
version: 2
|
|
3
|
-
last_updated: "2026-03-30"
|
|
4
|
-
source: manual
|
|
5
|
-
status: established
|
|
6
|
-
---
|
|
7
|
-
|
|
8
|
-
# LLM-Native Development Domain — Logic Rules
|
|
9
|
-
|
|
10
|
-
Classification axis: **system construction concern** — rules classified by the design concern of the LLM-powered system they govern.
|
|
11
|
-
|
|
12
|
-
## File–Concept Correspondence Logic (→Area 3)
|
|
13
|
-
|
|
14
|
-
- One concept must be defined in exactly one concept file. If the same concept is defined in two files, it is a source of truth conflict
|
|
15
|
-
- A filename must represent the concept the file defines. A mismatch between filename and content is a self-describing violation
|
|
16
|
-
- If a concept file exists, it must define at least one concept. Empty files are not allowed
|
|
17
|
-
- Meta files (INDEX.md, ARCHITECTURE.md, etc.) do not define concepts and serve the role of specifying structural relationships among other files. They are excluded from the "File = Concept" correspondence
|
|
18
|
-
|
|
19
|
-
## Frontmatter Conformance Logic (→Area 3)
|
|
20
|
-
|
|
21
|
-
- The specification for frontmatter (required fields, format) has structure_spec.md as its source of truth. When other files reference frontmatter, they follow structure_spec.md's definitions
|
|
22
|
-
- All references declared in frontmatter (depends_on, related_to, parent) must point to actually existing files
|
|
23
|
-
- The type field value in frontmatter must belong to the type list defined in the system. An undefined type is unclassifiable
|
|
24
|
-
- If a relationship declared in frontmatter is bidirectional, the reverse relationship must also exist in the target file's frontmatter. In this case, follow the cycle exception clause in dependency_rules.md
|
|
25
|
-
|
|
26
|
-
## Hierarchy Logic (→Area 3)
|
|
27
|
-
|
|
28
|
-
- Directory hierarchy represents containment relationships of concepts. A parent directory must be a category encompassing the concepts in its child directories
|
|
29
|
-
- Files within the same directory must be concepts at the same abstraction level. If abstraction levels differ, separate into directories
|
|
30
|
-
- As directory depth increases, concepts become more specific. Depth reversal (child is more abstract) is a structural contradiction
|
|
31
|
-
|
|
32
|
-
## Navigation Path Logic (→Area 3)
|
|
33
|
-
|
|
34
|
-
- A reference chain must exist from the entry point to any concept file. A file that cannot be reached is an isolated document
|
|
35
|
-
- If a reference chain has cycles, the LLM may fall into infinite traversal. Circular references must be explicitly marked in frontmatter (see cycle exception clause in dependency_rules.md)
|
|
36
|
-
- When A references B and B references C, if understanding C's content from A requires traversing through B, a direct A→C reference should be added (navigation shortcut)
|
|
37
|
-
|
|
38
|
-
## Change Propagation Logic (→Area 3)
|
|
39
|
-
|
|
40
|
-
- When concept X's definition changes, all files referencing X are verification targets
|
|
41
|
-
- When adding a new concept file, if existing concept files require modification beyond the meta file (INDEX.md), it is a signal of high coupling. Homonym registration (concepts.md) is exceptionally allowed
|
|
42
|
-
- When deleting a file, all references to that file must be removed everywhere (referential integrity)
|
|
43
|
-
|
|
44
|
-
## Model Integration Logic (→Area 1)
|
|
45
|
-
|
|
46
|
-
- If model routing is used (directing requests to different models based on task type, cost, or complexity), fallback paths must be defined for every route. A route without a fallback is a single point of failure — if the target model is unavailable, the request fails silently or errors out
|
|
47
|
-
- Model version pinning: if the model version is not pinned to a specific release (e.g., `gpt-4-0613` rather than `gpt-4`), behavior changes are non-deterministic. Provider-side model updates can alter output quality, format, or latency without notice
|
|
48
|
-
- If multiple models are used for different tasks within the same system, capability requirements must be documented per model. Documentation must include: task type, required capabilities (e.g., tool use, structured output, long context), minimum acceptable quality level, and cost constraints
|
|
49
|
-
- Model selection must be justified against the task's requirements. Using a more capable (and expensive) model where a simpler model suffices is a cost inefficiency. Using a less capable model where quality is critical is a quality risk
|
|
50
|
-
- If inference optimization is applied (quantization, batching, KV cache), the optimization's impact on output quality must be measured, not assumed. Quantization in particular can degrade performance on tasks requiring precise reasoning
|
|
51
|
-
|
|
52
|
-
## Prompt Design Logic (→Area 2)
|
|
53
|
-
|
|
54
|
-
- Instruction hierarchy: system prompt > tool definitions > user prompt. When instructions at different levels conflict, higher-priority instructions take precedence. If the system prompt says "always respond in JSON" and the user prompt says "respond in plain text," the system prompt wins
|
|
55
|
-
- If structured output is required (JSON, XML, function call schemas), the output schema must be validated before consumption. Trusting model output to conform to a schema without validation is unsafe — models can produce malformed output even when instructed to follow a schema
|
|
56
|
-
- Token budget constraint: system prompt tokens + retrieved context tokens + user input tokens ≤ context window size - expected output tokens. Violating this constraint causes truncation (silent data loss) or API errors. The budget must be calculated before the API call, not after
|
|
57
|
-
- Context rot: as conversation length grows, the earliest context loses influence on model behavior. Critical information (safety rules, output format requirements, identity constraints) must be re-injected at regular intervals or placed in positions of high attention (beginning and end of the context window)
|
|
58
|
-
- Prompt templates must be versioned. A prompt change that alters system behavior is functionally equivalent to a code change. Unversioned prompts make it impossible to reproduce past behavior or diagnose regressions
|
|
59
|
-
- Few-shot examples must be representative of the target distribution. Biased examples cause biased outputs. If the task distribution changes, the examples must be updated
|
|
60
|
-
|
|
61
|
-
## Retrieval Logic (→Area 3)
|
|
62
|
-
|
|
63
|
-
- RAG pipeline correctness: retrieval relevance must be verified independently of generation quality. If the final output is incorrect, the cause may be (a) irrelevant retrieval, (b) correct retrieval but incorrect generation, or (c) both. Diagnosing the root cause requires evaluating retrieval and generation separately
|
|
64
|
-
- Chunking strategy must preserve semantic boundaries. Splitting mid-sentence or mid-paragraph destroys context that the embedding model needs to produce meaningful vectors. Chunk boundaries should align with document structure (paragraphs, sections, logical units)
|
|
65
|
-
- If hybrid search is used (combining keyword-based search and semantic/vector search), the combination strategy must be explicit. Options include: score fusion (weighted sum of scores), rank fusion (reciprocal rank fusion), or cascade (keyword filter → semantic rerank). An unspecified combination strategy produces non-reproducible results
|
|
66
|
-
- Embedding model selection must match the content domain. General-purpose embeddings may underperform on domain-specific content (medical, legal, code). If retrieval quality is insufficient, evaluate domain-specific or fine-tuned embedding models before increasing chunk count
|
|
67
|
-
- Retrieved context must carry provenance metadata (source document, chunk ID, relevance score). Without provenance, it is impossible to debug retrieval failures or trace hallucinations back to their source
|
|
68
|
-
|
|
69
|
-
## Agentic Systems Logic (→Area 4)
|
|
70
|
-
|
|
71
|
-
- Tool definitions must be non-overlapping in their described capabilities. If two tools can accomplish the same task, the agent's choice between them is non-deterministic. Either remove the overlap or add explicit routing instructions that disambiguate when to use each tool
|
|
72
|
-
- Multi-agent communication must have termination conditions. Without explicit termination (maximum iterations, convergence criteria, timeout), agent loops can run indefinitely, consuming resources and producing no useful output. Every loop must have a defined exit condition
|
|
73
|
-
- Agent state must be explicitly managed. Relying on conversation history as implicit state is fragile — context windows are finite, and conversation history can be truncated, summarized, or lost. Critical state (task progress, accumulated results, decision points) must be stored in structured form (scratchpads, databases, state objects)
|
|
74
|
-
- MCP tool schemas must be self-describing. The agent must be able to determine a tool's applicability from the schema alone (name, description, parameter descriptions) without external documentation. A tool schema that requires reading separate documentation to understand is poorly designed
|
|
75
|
-
- Agent instructions must explicitly list which tools are available and when each should be used. If tools are available but not mentioned in instructions, the agent may discover them through schema inspection but cannot reliably determine appropriate usage context
|
|
76
|
-
- For long-running agent tasks (multi-session persistence), structured progress tracking must be maintained. The agent must be able to resume from the last checkpoint without re-executing completed steps. Progress format: completed steps, current step, remaining steps, accumulated artifacts
|
|
77
|
-
|
|
78
|
-
## Evaluation Logic (→Area 5)
|
|
79
|
-
|
|
80
|
-
- Evaluation criteria must be defined before system development begins, not retrofitted after (spec-first principle). Defining evaluation criteria after seeing system output introduces confirmation bias — criteria are unconsciously shaped to match existing behavior rather than desired behavior
|
|
81
|
-
- AI-as-judge evaluation must disclose: (a) the judge model identity and version, (b) the judge prompt used, and (c) known biases of the judge model (e.g., positional bias, verbosity bias, self-preference). Without disclosure, evaluation results are not reproducible or interpretable
|
|
82
|
-
- A/B testing for system output quality requires statistical significance thresholds defined in advance. Without predefined thresholds, there is a temptation to stop testing when results look favorable (peeking), which inflates false positive rates
|
|
83
|
-
- Golden set (reference test data) must be maintained separately from training/fine-tuning data. Contamination of the golden set with training data produces artificially inflated evaluation scores
|
|
84
|
-
- Evaluation pipelines must be automated and repeatable. Manual evaluation is acceptable for initial exploration but does not scale. If the system is in production, evaluation must run on a regular cadence without human intervention
|
|
85
|
-
|
|
86
|
-
## Safety Logic (→Area 6)
|
|
87
|
-
|
|
88
|
-
- Defense-in-depth: no single safety mechanism is sufficient. Minimum layers: input filtering (block malicious prompts before they reach the model) + output filtering (block harmful content before it reaches the user) + monitoring (detect patterns that individual filters miss). Removing any layer increases risk
|
|
89
|
-
- Prompt injection defense must not degrade normal user experience. A defense mechanism that blocks 5% of legitimate user requests to prevent 0.1% of injection attempts has an unacceptable false positive rate. Defense mechanisms must be evaluated for both efficacy (true positive rate) and cost (false positive rate)
|
|
90
|
-
- Content policy rules must be testable. Each rule must have at least one positive test case (content that should be blocked) and one negative test case (content that should pass). Untestable rules cannot be verified and may silently fail or over-block
|
|
91
|
-
- Safety constraints must be applied at the system level, not delegated to the model's built-in safety. Model-level safety is a useful baseline but is not configurable, not auditable, and can change without notice when the provider updates the model
|
|
92
|
-
- Red teaming must be performed before deployment and on a regular cadence after deployment. The threat landscape evolves — attack techniques that did not exist during initial red teaming may emerge later
|
|
93
|
-
|
|
94
|
-
## Operations Logic (→Area 7)
|
|
95
|
-
|
|
96
|
-
- Cost tracking granularity must match billing granularity. If the provider bills per token, the system must track per-token usage. If the system tracks only per-request, cost attribution to specific features or users is impossible, making cost optimization guesswork
|
|
97
|
-
- Quality drift detection requires a baseline. The baseline must be established during the evaluation phase (→Area 5) using the golden set. Without a baseline, there is no reference point to determine whether quality has degraded. Drift detection compares current output quality against the baseline at regular intervals
|
|
98
|
-
- Feedback loop: user feedback must be actionable. Feedback without a defined path to system improvement is waste. For each feedback type (thumbs up/down, free-text correction, escalation), there must be a defined process: collection → aggregation → analysis → system change → verification
|
|
99
|
-
- Logging must capture sufficient information to reproduce any LLM interaction: input (full prompt), output (full response), model version, latency, token count, and any tool calls. Insufficient logging makes debugging production issues impossible
|
|
100
|
-
- Incident response for LLM-specific failures must account for failure modes unique to LLM systems: model provider outage, sudden quality degradation (model update), cost spike (unexpected token usage), safety filter false positives (legitimate requests blocked), and prompt injection in production
|
|
101
|
-
|
|
102
|
-
## Data & Model Adaptation Logic (→Area 8)
|
|
103
|
-
|
|
104
|
-
- Fine-tuning decision: fine-tuning is justified only when prompting alone cannot achieve the required quality, latency, or cost targets. Fine-tuning introduces a maintenance burden (dataset management, retraining on model updates) that prompting avoids
|
|
105
|
-
- Training data quality has an upper bound on model quality. No amount of training can overcome systematically flawed data. Data quality assessment (accuracy, consistency, completeness, representativeness) must precede training
|
|
106
|
-
- If parameter-efficient fine-tuning is used (LoRA, QLoRA, adapters), the base model version must be pinned. A base model update invalidates all existing adapters — adapters trained on version N are not guaranteed to work correctly on version N+1
|
|
107
|
-
- Evaluation of fine-tuned models must compare against the base model on the same evaluation set. Without this comparison, there is no evidence that fine-tuning improved performance. Improvement must exceed the cost of fine-tuning to be justified
|
|
108
|
-
|
|
109
|
-
## Constraint Conflict Checking (Cross-Cutting)
|
|
110
|
-
|
|
111
|
-
When design constraints from different areas impose contradictory requirements, the conflict must be identified, documented, and resolved. Unresolved conflicts produce systems with unpredictable behavior.
|
|
112
|
-
|
|
113
|
-
- **Safety vs. Functionality**: When safety constraints conflict with functionality (e.g., output filtering blocks valid responses, input filtering rejects legitimate queries), safety takes precedence. However, the conflict must be documented with: (a) the specific safety rule, (b) the functionality it degrades, (c) the false positive rate, and (d) a plan to reduce the false positive rate without weakening safety
|
|
114
|
-
- **Cost vs. Quality**: When cost constraints conflict with quality (e.g., a cheaper model degrades output quality, reduced retrieval scope misses relevant context), the trade-off must be explicit in the system specification. The specification must state: which quality metrics are affected, by how much, and what the cost saving is. Implicit quality degradation (choosing the cheap option without measuring the impact) is prohibited
|
|
115
|
-
- **Latency vs. Completeness**: When latency requirements conflict with completeness (e.g., retrieval timeout cuts off relevant results, generation is truncated to meet response time targets), the system must degrade gracefully. Partial results must be clearly marked as partial, and the user must be informed that completeness was sacrificed for speed
|
|
116
|
-
- **Privacy vs. Personalization**: When privacy constraints conflict with personalization (e.g., PII redaction removes information needed for personalized responses), privacy takes precedence. The system must provide useful responses without requiring PII, or obtain explicit user consent for PII usage with clear data handling policies
|
|
117
|
-
|
|
118
|
-
## Related Documents
|
|
119
|
-
- concepts.md — Term definitions within this domain
|
|
120
|
-
- dependency_rules.md — Cycle exception clause, reference direction rules
|
|
121
|
-
- structure_spec.md — Source of truth for frontmatter specifications and structural rules
|
|
122
|
-
- domain_scope.md — Scope definition, sub-area membership criteria, and boundary tests
|
|
123
|
-
- competency_qs.md — Questions this domain's logic must be able to answer
|
|
@@ -1,49 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
version: 1
|
|
3
|
-
last_updated: "2026-03-29"
|
|
4
|
-
source: bundled-domain-baseline
|
|
5
|
-
status: established
|
|
6
|
-
---
|
|
7
|
-
|
|
8
|
-
# LLM-Native Development Domain — Prompt/Interface Design
|
|
9
|
-
|
|
10
|
-
This document defines the design criteria for structures that issue instructions to LLMs — system prompts, tool definitions, role definitions, and response formats.
|
|
11
|
-
While the existing 7 files cover "the structure of a system that describes knowledge," this file covers the separate concern dimension of "designing the knowledge consumption method."
|
|
12
|
-
|
|
13
|
-
## System Prompt Structure
|
|
14
|
-
|
|
15
|
-
- Prompts are structured in the order of role definition (who) → context (what) → task instructions (how) → constraints
|
|
16
|
-
- Role definitions are managed in individual files within the roles/ directory. They are not inlined into the prompt but read from files and included
|
|
17
|
-
- Context loading is either self-loading (agent reads files directly) or injection (team lead includes content). The choice depends on the agent's file reading capability
|
|
18
|
-
|
|
19
|
-
## Role Definition File Structure
|
|
20
|
-
|
|
21
|
-
Role definition files (roles/{agent-id}.md) include the following elements:
|
|
22
|
-
- **Specialization**: The concern this role verifies/performs
|
|
23
|
-
- **Role Description**: The core question this role answers
|
|
24
|
-
- **Core Questions**: 3~6 items. Define the judgment axes of the role
|
|
25
|
-
- **Domain Document Mapping**: Domain documents this role references (at least 1)
|
|
26
|
-
|
|
27
|
-
## Tool Definition (Tool Schema)
|
|
28
|
-
|
|
29
|
-
- Tool definitions follow JSON Schema format
|
|
30
|
-
- Tool names directly express the action (e.g., `search_files`, `read_document`)
|
|
31
|
-
- Tool descriptions must be specific enough for the LLM to judge tool selection
|
|
32
|
-
- Required parameters and optional parameters are clearly distinguished
|
|
33
|
-
|
|
34
|
-
## Response Format Constraints (Structured Output)
|
|
35
|
-
|
|
36
|
-
- When structured output is needed, specify the output format in the prompt
|
|
37
|
-
- Format specification methods: choose between examples or schemas (JSON Schema)
|
|
38
|
-
- When the format is complex, separate it into a file (templates/ directory) and reference from the prompt
|
|
39
|
-
|
|
40
|
-
## Context Window Utilization Strategy
|
|
41
|
-
|
|
42
|
-
- Prompt size is recommended to be 30% or less of the total context window. The remainder is allocated to work targets and responses
|
|
43
|
-
- For repeatedly executed prompts, separate static parts (role, rules) from dynamic parts (work target, previous results)
|
|
44
|
-
- When static parts do not change, consider caching or compression
|
|
45
|
-
|
|
46
|
-
## Related Documents
|
|
47
|
-
- concepts.md — Definitions of terms used in prompts
|
|
48
|
-
- structure_spec.md — Physical location rules for role files
|
|
49
|
-
- domain_scope.md — Higher-level definition of the "Consumption Interface" concern area
|