onto-mcp 0.3.0 → 0.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (61) hide show
  1. package/.onto/authority/core-lexicon.yaml +12 -0
  2. package/.onto/domains/software-engineering/competency_qs.md +192 -63
  3. package/.onto/domains/software-engineering/concepts.md +67 -5
  4. package/.onto/domains/software-engineering/conciseness_rules.md +22 -2
  5. package/.onto/domains/software-engineering/dependency_rules.md +78 -8
  6. package/.onto/domains/software-engineering/domain_scope.md +181 -150
  7. package/.onto/domains/software-engineering/extension_cases.md +318 -542
  8. package/.onto/domains/software-engineering/logic_rules.md +75 -3
  9. package/.onto/domains/software-engineering/problem_framing_profile.md +29 -2
  10. package/.onto/domains/software-engineering/prompt_interface.md +122 -0
  11. package/.onto/domains/software-engineering/structure_spec.md +53 -4
  12. package/.onto/principles/llm-native-development-guideline.md +20 -0
  13. package/.onto/principles/productization-charter.md +6 -0
  14. package/.onto/processes/evolve/material-kind-adapter-contract.md +6 -0
  15. package/.onto/processes/reconstruct/reconstruct-boundary-contract.md +468 -81
  16. package/.onto/processes/reconstruct/reconstruct-execution-ux-contract.md +177 -0
  17. package/.onto/processes/reconstruct/source-profile-contract.md +39 -6
  18. package/.onto/processes/reconstruct/top-level-concept-discovery-contract.md +387 -0
  19. package/.onto/processes/review/binding-contract.md +8 -0
  20. package/.onto/processes/review/lens-registry.md +16 -0
  21. package/.onto/processes/review/pre-dispatch-contracts.md +34 -13
  22. package/.onto/processes/review/productized-live-path.md +3 -1
  23. package/.onto/processes/shared/pipeline-execution-ledger-contract.md +185 -0
  24. package/.onto/processes/shared/target-material-kind-contract.md +24 -2
  25. package/.onto/roles/axiology.md +7 -2
  26. package/AGENTS.md +4 -2
  27. package/README.md +52 -29
  28. package/dist/core-api/reconstruct-api.js +92 -5
  29. package/dist/core-api/review-api.js +1744 -371
  30. package/dist/core-runtime/cli/mock-review-unit-executor.js +17 -0
  31. package/dist/core-runtime/cli/render-review-final-output.js +9 -0
  32. package/dist/core-runtime/cli/review-invoke.js +387 -55
  33. package/dist/core-runtime/cli/run-review-prompt-execution.js +361 -90
  34. package/dist/core-runtime/path-boundary.js +58 -0
  35. package/dist/core-runtime/pipeline-execution-ledger.js +100 -0
  36. package/dist/core-runtime/reconstruct/artifact-types.js +33 -1
  37. package/dist/core-runtime/reconstruct/materialize-preparation.js +54 -4
  38. package/dist/core-runtime/reconstruct/pipeline-execution-ledger.js +342 -0
  39. package/dist/core-runtime/reconstruct/post-seed-validation.js +630 -0
  40. package/dist/core-runtime/reconstruct/record.js +105 -1
  41. package/dist/core-runtime/reconstruct/run.js +1594 -38
  42. package/dist/core-runtime/reconstruct/seed-candidate-validation.js +29 -0
  43. package/dist/core-runtime/review/continuation-plan.js +160 -0
  44. package/dist/core-runtime/review/execution-plan-boundary.js +123 -0
  45. package/dist/core-runtime/review/materializers.js +8 -3
  46. package/dist/core-runtime/review/pipeline-execution-ledger.js +250 -0
  47. package/dist/core-runtime/review/review-artifact-utils.js +15 -2
  48. package/dist/core-runtime/review/review-invocation-runner.js +604 -0
  49. package/dist/core-runtime/target-material-kind.js +43 -5
  50. package/dist/mcp/server.js +289 -59
  51. package/dist/mcp/tool-schemas.js +28 -2
  52. package/package.json +4 -2
  53. package/.onto/domains/llm-native-development/competency_qs.md +0 -430
  54. package/.onto/domains/llm-native-development/concepts.md +0 -242
  55. package/.onto/domains/llm-native-development/conciseness_rules.md +0 -163
  56. package/.onto/domains/llm-native-development/dependency_rules.md +0 -216
  57. package/.onto/domains/llm-native-development/domain_scope.md +0 -197
  58. package/.onto/domains/llm-native-development/extension_cases.md +0 -474
  59. package/.onto/domains/llm-native-development/logic_rules.md +0 -123
  60. package/.onto/domains/llm-native-development/prompt_interface.md +0 -49
  61. package/.onto/domains/llm-native-development/structure_spec.md +0 -245
@@ -1,474 +0,0 @@
1
- ---
2
- version: 2
3
- last_updated: "2026-03-30"
4
- source: manual
5
- status: established
6
- ---
7
-
8
- # LLM-Native Development Domain — Extension Scenarios
9
-
10
- The evolution agent simulates each scenario to verify whether the existing structure for LLM-powered systems breaks under real-world changes.
11
-
12
- **Area key**: 1=Model Integration, 2=Prompt & Context Design, 3=Retrieval & Knowledge Systems, 4=Agentic Systems, 5=Evaluation & Testing, 6=Safety & Alignment, 7=Production Operations, 8=Data & Model Adaptation
13
-
14
- ---
15
-
16
- ## Case 1: New Model Generation
17
-
18
- ### Situation
19
-
20
- A new foundation model (GPT-5, Claude 4, Gemini 2) introduces expanded capabilities — larger context, new modalities, improved reasoning, new API features. The system must evaluate, integrate, and potentially restructure without disrupting production.
21
-
22
- ### Case Study: Claude 3 → Sonnet 4 Migration
23
-
24
- Each Anthropic model release changed instruction-following behavior, tool calling format, and system prompt handling. Teams pinned to `claude-3-sonnet-20240229` faced migration requiring: re-evaluation of all prompt templates, restructured tool schemas, and full evaluation suite re-runs. Teams using `-latest` endpoints experienced silent production behavior changes — output format shifts, changed refusal patterns, altered verbosity.
25
-
26
- ### Impact Analysis
27
-
28
- | Area | Impact Level | Key Concern |
29
- |---|---|---|
30
- | 1 (Model) | **Primary** | Selection criteria, API integration, version pinning, routing logic |
31
- | 2 (Prompt) | **High** | System prompt restructuring, token budget shift, structured output |
32
- | 5 (Evaluation) | **High** | Full suite re-run, benchmark recalibration |
33
- | 7 (Operations) | **High** | Canary deployment, cost/latency changes, monitoring thresholds |
34
- | 3, 4, 6, 8 | Low–Medium | Chunk size (3), tool formats (4), refusal patterns (6), fine-tune transfer (8) |
35
-
36
- ### Verification Checklist
37
-
38
- - [ ] Model version pinned → dependency_rules.md §Model API Dependencies
39
- - [ ] Migration plan: affected components → evaluate replacement → run eval → canary deploy
40
- - [ ] All prompt templates tested against new model → domain_scope.md Area 2
41
- - [ ] Evaluation suite run; results compared to baseline → domain_scope.md Area 5
42
- - [ ] Cost/latency impact analyzed → domain_scope.md Area 7
43
- - [ ] Rollback plan defined → dependency_rules.md §Model API Dependencies
44
- - [ ] Safety guardrails verified against new behavior → domain_scope.md Area 6
45
-
46
- ### Affected Files
47
-
48
- | File | Impact | Section |
49
- |---|---|---|
50
- | dependency_rules.md | Modify | §Model API Dependencies |
51
- | domain_scope.md | Verify | §Technology Assumptions |
52
- | structure_spec.md | Verify | Token budget rules |
53
-
54
- ---
55
-
56
- ## Case 2: New Retrieval Paradigm
57
-
58
- ### Situation
59
-
60
- A new retrieval architecture (graph RAG, multi-modal RAG, hybrid search) changes how external knowledge is stored, indexed, and retrieved for LLM consumption.
61
-
62
- ### Case Study: LlamaIndex Knowledge Graph Integration
63
-
64
- LlamaIndex added knowledge graph-based retrieval alongside vector search. A healthcare company migrating found: chunking strategy became irrelevant for structured data (entities replaced chunks), embedding model selection became secondary to entity extraction quality, and retrieval evaluation changed from "relevance@k" to "path accuracy." The hybrid approach required a routing layer deciding which backend to query based on query type.
65
-
66
- ### Impact Analysis
67
-
68
- | Area | Impact Level | Key Concern |
69
- |---|---|---|
70
- | 3 (Retrieval) | **Primary** | Chunking, indexing, retrieval algorithms, storage backend all restructured |
71
- | 5 (Evaluation) | **High** | Retrieval metrics change; new golden sets needed |
72
- | 2 (Prompt) | **Medium** | Retrieved content format changes (triples vs. chunks) |
73
- | 7 (Operations) | **Medium** | New infrastructure (graph DB); different latency/cost profile |
74
- | 1, 4, 6, 8 | Low | Model interface unchanged (1), tool interface changes (4) |
75
-
76
- ### Verification Checklist
77
-
78
- - [ ] Handoff point with Area 2 preserved: retrieval results = Area 3 boundary → domain_scope.md §Handoff point
79
- - [ ] Embedding migration plan if changing backend → dependency_rules.md §Embedding Model Dependencies
80
- - [ ] Retrieval evaluation metrics updated → domain_scope.md Area 5
81
- - [ ] Hybrid search strategy defined → domain_scope.md Area 3
82
- - [ ] LLM-favored structure patterns adapted → domain_scope.md Area 3
83
-
84
- ### Affected Files
85
-
86
- | File | Impact | Section |
87
- |---|---|---|
88
- | domain_scope.md | Verify | Area 3, §Handoff point with Area 2 |
89
- | dependency_rules.md | Modify | §Embedding Model Dependencies, §Runtime Data Flow |
90
- | concepts.md | Modify | New terms: knowledge graph retrieval, graph RAG |
91
-
92
- ---
93
-
94
- ## Case 3: New Agent Protocol
95
-
96
- ### Situation
97
-
98
- A new standardized protocol for agent-tool interaction (MCP successor, A2A maturation) replaces or supplements existing integration patterns.
99
-
100
- ### Case Study: MCP — 0 to 97M npm Downloads in 1 Year
101
-
102
- MCP standardized how agents discover and invoke tools via JSON-RPC. Before MCP, each framework had proprietary tool patterns. Migration required: tool schema restructuring to JSON Schema-based definitions, stateful server connections replacing stateless calls, connection lifecycle management, and a bridge layer for legacy APIs during transition. A2A emerged as complementary standard for inter-agent communication.
103
-
104
- ### Impact Analysis
105
-
106
- | Area | Impact Level | Key Concern |
107
- |---|---|---|
108
- | 4 (Agentic) | **Primary** | Tool integration restructured; server/client design; discovery mechanisms |
109
- | 6 (Safety) | **High** | New injection surface; tool permission models change |
110
- | 7 (Operations) | **Medium** | Server availability monitoring; connection lifecycle; new failure modes |
111
- | 1, 2, 3, 5, 8 | Low | Tool-calling format (1), tool descriptions (2), MCP resources (3) |
112
-
113
- ### Verification Checklist
114
-
115
- - [ ] MCP server dependencies documented with fallback behavior → dependency_rules.md §MCP Server Dependencies
116
- - [ ] Schema validation at connection time → dependency_rules.md §MCP Server Dependencies
117
- - [ ] Tool permission model defined → domain_scope.md Area 6
118
- - [ ] Backward compatibility plan for existing integrations documented
119
- - [ ] Protocol-specific monitoring added → domain_scope.md Area 7
120
-
121
- ### Affected Files
122
-
123
- | File | Impact | Section |
124
- |---|---|---|
125
- | dependency_rules.md | Modify | §MCP Server Dependencies, §Framework Dependencies |
126
- | domain_scope.md | Verify | Area 4, §Reference Standards/Frameworks |
127
- | concepts.md | Modify | Protocol terms: MCP, A2A, ACI |
128
-
129
- ---
130
-
131
- ## Case 4: AI Regulation Compliance
132
-
133
- ### Situation
134
-
135
- New AI regulations (EU AI Act, etc.) impose compliance requirements: audit logging, risk classification, transparency, human oversight.
136
-
137
- ### Case Study: EU AI Act Risk Classification
138
-
139
- An LLM recruitment screening tool was classified "high-risk" (Annex III), requiring: risk management with continuous monitoring, technical documentation including training data provenance, human oversight (no autonomous hiring decisions), automatic decision logging, and documented evaluation methodology. Systems designed with safety and observability from the start required minimal changes; those treating these as afterthoughts required extensive refactoring.
140
-
141
- ### Impact Analysis
142
-
143
- | Area | Impact Level | Key Concern |
144
- |---|---|---|
145
- | 6 (Safety) | **Primary** | Risk classification, content policy, PII handling, regulatory compliance |
146
- | 7 (Operations) | **High** | Mandatory audit logging, compliance monitoring, incident reporting |
147
- | 5 (Evaluation) | **High** | Auditable evaluation methodology required |
148
- | 4 (Agentic) | **High** | Human oversight constrains agent autonomy |
149
- | 1, 2, 3, 8 | Low–Medium | Provider certifications (1), disclosure instructions (2), provenance (3, 8) |
150
-
151
- ### Verification Checklist
152
-
153
- - [ ] System risk classification determined → domain_scope.md Area 6
154
- - [ ] Audit logging covers all LLM interactions → domain_scope.md Area 7
155
- - [ ] Human oversight mechanism for high-risk decisions → domain_scope.md Area 6
156
- - [ ] Evaluation methodology documented and reproducible → domain_scope.md Area 5
157
- - [ ] PII detection and redaction operational → domain_scope.md Area 6
158
- - [ ] All 8 areas audited against regulatory requirements
159
-
160
- ### Affected Files
161
-
162
- | File | Impact | Section |
163
- |---|---|---|
164
- | domain_scope.md | Verify | Area 6, Area 7 |
165
- | dependency_rules.md | Modify | §Truth Source Hierarchy (regulatory priority) |
166
- | concepts.md | Modify | Regulatory terms: risk classification, audit trail |
167
-
168
- ---
169
-
170
- ## Case 5: Context Window Expansion
171
-
172
- ### Situation
173
-
174
- Context windows increase by 10x (200K → 2M → 10M tokens), changing fundamental trade-offs. Techniques designed for limited context (chunking, summarization, multi-step retrieval) may become unnecessary.
175
-
176
- ### Case Study: Gemini 1M Context — Is RAG Still Necessary?
177
-
178
- Gemini 1.5 Pro launched with 1M tokens (February 2024). Results showed: "lost in the middle" effects made retrieval still valuable for precision; 1M tokens at $3.50/M made naive "stuff everything" 100x more expensive than targeted retrieval; latency scaled with context length. RAG shifted from "mandatory" to "optimization for cost, latency, and precision" — conditional rather than required.
179
-
180
- ### Impact Analysis
181
-
182
- | Area | Impact Level | Key Concern |
183
- |---|---|---|
184
- | 2 (Prompt) | **Primary** | Token budget restructured; context rot amplified; utilization strategy changed |
185
- | 3 (Retrieval) | **High** | Chunking relevance decreases; RAG becomes conditional |
186
- | 7 (Operations) | **High** | Cost per request increases dramatically; caching for larger payloads |
187
- | 1 (Model) | **Medium** | Context window as selection differentiator |
188
- | 4, 5, 6, 8 | Low–Medium | Longer agent history (4), long-context eval (5), larger injection surface (6) |
189
-
190
- ### Verification Checklist
191
-
192
- - [ ] Technology assumption updated → domain_scope.md §Technology Assumptions
193
- - [ ] Token budget strategy revised → structure_spec.md
194
- - [ ] RAG necessity re-evaluated per use case → domain_scope.md Area 3
195
- - [ ] Cost analysis: long context vs. retrieval → domain_scope.md Area 7
196
- - [ ] "Lost in the middle" effects tested → domain_scope.md Area 5
197
- - [ ] Quantitative criteria re-evaluated → structure_spec.md
198
-
199
- ### Affected Files
200
-
201
- | File | Impact | Section |
202
- |---|---|---|
203
- | domain_scope.md | Modify | §Technology Assumptions, Area 2, Area 3 |
204
- | structure_spec.md | Modify | Token budget rules, quantitative criteria |
205
- | dependency_rules.md | Verify | §Design-Time Constraint Flow |
206
-
207
- ---
208
-
209
- ## Case 6: Model-as-a-Service Disruption
210
-
211
- ### Situation
212
-
213
- Deployment shifts from cloud API to on-premise/edge with open-weight models, changing infrastructure, cost structures, and operational responsibilities.
214
-
215
- ### Case Study: Llama Open-Weight Local Deployment
216
-
217
- A financial services company migrated from OpenAI API to self-hosted Llama 3 70B for data residency. Infrastructure responsibility shifted entirely (GPU procurement, model serving via vLLM/TGI, auto-scaling). Cost structure inverted: no per-token costs but $30K+/month GPU cluster. Safety guardrails (content filtering, PII redaction) had to be built from scratch — API providers include these by default. The domain_scope.md assumption "Models accessed via API" was invalidated.
218
-
219
- ### Impact Analysis
220
-
221
- | Area | Impact Level | Key Concern |
222
- |---|---|---|
223
- | 1 (Model) | **Primary** | Model serving replaces API; quantization; batching; version self-management |
224
- | 7 (Operations) | **Primary** | GPU infrastructure, serving, scaling, hardware monitoring — new domain |
225
- | 6 (Safety) | **High** | All guardrails built in-house; no provider-side filtering |
226
- | 8 (Adaptation) | **High** | Direct model access enables fine-tuning |
227
- | 2, 3, 4, 5 | Low–Medium | Chat template differences (2), self-hosted embeddings (3), tool support gaps (4) |
228
-
229
- ### Verification Checklist
230
-
231
- - [ ] Technology assumption invalidation documented → domain_scope.md §Technology Assumptions
232
- - [ ] Area 7 expanded for infrastructure management → domain_scope.md Area 7
233
- - [ ] Model serving infrastructure documented → domain_scope.md Area 1
234
- - [ ] Safety guardrails built for self-hosted model → domain_scope.md Area 6
235
- - [ ] Cost model updated: fixed vs. per-token → domain_scope.md Area 7
236
-
237
- ### Affected Files
238
-
239
- | File | Impact | Section |
240
- |---|---|---|
241
- | domain_scope.md | Modify | §Technology Assumptions, Area 1, Area 7 |
242
- | dependency_rules.md | Modify | §Model API Dependencies (extend to self-hosted) |
243
- | concepts.md | Modify | New terms: model serving, quantization, vLLM |
244
-
245
- ---
246
-
247
- ## Case 7: Multi-Modal Native
248
-
249
- ### Situation
250
-
251
- Vision, audio, and other modalities become first-class inputs alongside text, rather than supplementary capabilities.
252
-
253
- ### Case Study: GPT-4V and Claude 3 Vision
254
-
255
- A document processing company migrated from OCR-then-LLM to direct image-to-text: the OCR pipeline was eliminated, prompts referenced visual elements requiring spatial reasoning instructions, knowledge bases needed multi-modal embeddings (CLIP) incompatible with text-only embeddings, evaluation required annotated image golden sets, and token costs increased (1,000-2,000 tokens per page image vs. 200-400 for text). The domain_scope.md assumption "Text is primary modality" was challenged.
256
-
257
- ### Impact Analysis
258
-
259
- | Area | Impact Level | Key Concern |
260
- |---|---|---|
261
- | 2 (Prompt) | **Primary** | Multi-modal input design; spatial/temporal references; token budget for images |
262
- | 3 (Retrieval) | **High** | Multi-modal embeddings; cross-modal search; non-text storage |
263
- | 1 (Model) | **High** | Modality support as selection criterion; multi-modal routing |
264
- | 5 (Evaluation) | **High** | Visual understanding metrics; multi-modal golden sets |
265
- | 4, 6, 7, 8 | Low–Medium | Vision for agents (4), image injection (6), larger payloads (7) |
266
-
267
- ### Verification Checklist
268
-
269
- - [ ] Technology assumption updated → domain_scope.md §Technology Assumptions
270
- - [ ] Multi-modal input design patterns documented → domain_scope.md Area 2
271
- - [ ] Multi-modal embedding model selected → dependency_rules.md §Embedding Model Dependencies
272
- - [ ] Cross-modal search assessed → domain_scope.md Area 3
273
- - [ ] Multi-modal evaluation metrics defined → domain_scope.md Area 5
274
- - [ ] Token budget includes image/audio costs → structure_spec.md
275
-
276
- ### Affected Files
277
-
278
- | File | Impact | Section |
279
- |---|---|---|
280
- | domain_scope.md | Modify | §Technology Assumptions, Area 2, Area 3 |
281
- | dependency_rules.md | Modify | §Embedding Model Dependencies |
282
- | concepts.md | Modify | New terms: multi-modal embedding, cross-modal search |
283
-
284
- ---
285
-
286
- ## Case 8: Fine-Tuning Democratization
287
-
288
- ### Situation
289
-
290
- Fine-tuning costs drop 100x through parameter-efficient techniques, making model adaptation a default practice. Area 8 shifts from optional to mandatory.
291
-
292
- ### Case Study: LoRA/QLoRA Cost Reduction
293
-
294
- A startup fine-tuned Llama 2 7B with QLoRA on 10K customer support conversations in 4 hours on a single A100 ($12). The fine-tuned model outperformed GPT-4 with elaborate prompts (92% vs. 87% accuracy). Prompt engineering effort dropped 70% — domain knowledge was in weights, not context. New capabilities needed: dataset curation, experiment tracking, model evaluation before/after, adapter serving. The domain_scope.md assumption "Fine-tuning is optional" was invalidated.
295
-
296
- ### Impact Analysis
297
-
298
- | Area | Impact Level | Key Concern |
299
- |---|---|---|
300
- | 8 (Adaptation) | **Primary** | Becomes mandatory: dataset engineering, training, experiment tracking |
301
- | 5 (Evaluation) | **High** | Pre/post comparison; contamination checks; regression testing |
302
- | 6 (Safety) | **High** | Fine-tuning can remove guardrails; alignment verification post-training |
303
- | 1 (Model) | **High** | Adapter serving; model routing includes fine-tuned variants |
304
- | 2, 3, 4, 7 | Low–Medium | Shorter prompts (2), reduced retrieval needs (3), A/B testing (7) |
305
-
306
- ### Verification Checklist
307
-
308
- - [ ] Technology assumption invalidation documented → domain_scope.md §Technology Assumptions
309
- - [ ] Area 8 elevated to required → domain_scope.md Area 8
310
- - [ ] Dataset engineering pipeline documented → domain_scope.md Area 8
311
- - [ ] Pre/post fine-tuning evaluation → domain_scope.md Area 5
312
- - [ ] Safety alignment verified after training → domain_scope.md Area 6
313
- - [ ] Adapter serving infrastructure → domain_scope.md Area 1, Area 7
314
-
315
- ### Affected Files
316
-
317
- | File | Impact | Section |
318
- |---|---|---|
319
- | domain_scope.md | Modify | §Technology Assumptions, Area 8 |
320
- | dependency_rules.md | Modify | §Feedback Loops (Loop 1 becomes critical path) |
321
- | concepts.md | Modify | New terms: LoRA, QLoRA, adapter, PEFT |
322
-
323
- ---
324
-
325
- ## Case 9: Agent Autonomy Expansion
326
-
327
- ### Situation
328
-
329
- Agents gain direct computer interface capabilities — browser control, desktop applications, file systems — extending autonomy beyond structured tool calls to open-ended environment interaction.
330
-
331
- ### Case Study: Claude Computer Use and Browser Use (78K Stars)
332
-
333
- A QA company adopted browser use agents: architecture shifted from predefined tool calls to observe-act loops (screenshot → reason → mouse/keyboard action). Safety boundaries became critical — agents could navigate anywhere, fill any form. Novel failure modes: loops, unintended navigation, interaction with pop-ups. Evaluation shifted from step-based ("correct tool call?") to outcome-based ("goal achieved through any path?").
334
-
335
- ### Impact Analysis
336
-
337
- | Area | Impact Level | Key Concern |
338
- |---|---|---|
339
- | 4 (Agentic) | **Primary** | Observe-act loops; browser/computer control; environment interaction |
340
- | 6 (Safety) | **Primary** | Sandboxing; action permissions; escalation for irreversible actions |
341
- | 5 (Evaluation) | **High** | Outcome-based evaluation; trajectory analysis; reliability metrics |
342
- | 7 (Operations) | **High** | Sandbox infrastructure; session recording; resource limits |
343
- | 1, 2, 3, 8 | Low–Medium | Vision required (1), boundary instructions (2) |
344
-
345
- ### Verification Checklist
346
-
347
- - [ ] Observe-act pattern documented → domain_scope.md Area 4
348
- - [ ] Sandbox environment defined and enforced → domain_scope.md Area 6
349
- - [ ] Action permission model defined → domain_scope.md Area 6
350
- - [ ] Escalation policy for irreversible actions → domain_scope.md Area 6
351
- - [ ] Outcome-based evaluation metrics → domain_scope.md Area 5
352
- - [ ] Stuck/loop detection and termination → domain_scope.md Area 4
353
- - [ ] Resource limits (session duration, allowed domains) → domain_scope.md Area 7
354
-
355
- ### Affected Files
356
-
357
- | File | Impact | Section |
358
- |---|---|---|
359
- | domain_scope.md | Verify | Area 4, Area 6 |
360
- | dependency_rules.md | Verify | §Design-Time Constraint Flow (safety → agent) |
361
- | concepts.md | Modify | New terms: observe-act loop, action permission model |
362
-
363
- ---
364
-
365
- ## Case 10: Evaluation Paradigm Shift
366
-
367
- ### Situation
368
-
369
- AI-as-judge (LLM-based evaluation) becomes the primary evaluation method, replacing human annotation and rule-based metrics at scale.
370
-
371
- ### Case Study: LLM-as-Judge Replacing Human Annotation
372
-
373
- A customer support company migrated from human evaluation (500/week at $15 each = $7,500) to LLM-as-judge (50,000/week at $0.03 each = $1,500). Consistency improved, but systematic biases emerged (preference for longer responses, over-rating confident incorrect answers). Meta-evaluation became necessary — "evaluating the evaluator." Self-evaluation bias required separate judge and generation models. The methodology shifted from "collecting human judgments" to "calibrating automated judges."
374
-
375
- ### Impact Analysis
376
-
377
- | Area | Impact Level | Key Concern |
378
- |---|---|---|
379
- | 5 (Evaluation) | **Primary** | AI-judge pipelines, meta-evaluation, judge calibration, bias detection |
380
- | 1 (Model) | **Medium** | Separate judge model needed; judge selection criteria differ |
381
- | 2 (Prompt) | **Medium** | Judge prompt design: rubrics, criteria, output format |
382
- | 7 (Operations) | **Medium** | Evaluation pipeline costs; judge model drift monitoring |
383
- | 3, 4, 6, 8 | Low–Medium | Factual verification (3), safety eval (6), training labels (8) |
384
-
385
- ### Verification Checklist
386
-
387
- - [ ] AI-as-judge methodology documented → domain_scope.md Area 5
388
- - [ ] Judge model ≠ generation model enforced → domain_scope.md Area 5
389
- - [ ] Meta-evaluation: judge validated against human golden set → domain_scope.md Area 5
390
- - [ ] Judge bias detection and calibration procedure defined
391
- - [ ] Human evaluation preserved for calibration maintenance → domain_scope.md Area 5
392
-
393
- ### Affected Files
394
-
395
- | File | Impact | Section |
396
- |---|---|---|
397
- | domain_scope.md | Verify | Area 5 |
398
- | dependency_rules.md | Verify | §Feedback Loops (Loop 1), §Metric Ownership Rules |
399
- | concepts.md | Modify | New terms: AI-as-judge, meta-evaluation, judge calibration |
400
-
401
- ---
402
-
403
- ## Case 11: LLM-Favored Structure Adoption
404
-
405
- ### Situation
406
-
407
- Organization-wide adoption of LLM-optimized knowledge patterns — File=Concept, YAML frontmatter, system maps, llms.txt — as the documentation standard.
408
-
409
- ### Case Study: Onto's Structure and llms.txt Adoption
410
-
411
- A 200-person engineering org migrated from Confluence wiki (3,000+ nested, duplicated pages) to File=Concept structure (1,200 files with clear ownership). YAML frontmatter enabled automated knowledge graph generation. LLMs consumed documentation 40% more accurately (RAG retrieval precision). Migration cost: 6 engineer-months. The llms.txt standard, CLAUDE.md, .cursorrules, and AGENTS.md demonstrate vendor-specific implementations of the same principle: structured, machine-readable knowledge surfaces.
412
-
413
- ### Impact Analysis
414
-
415
- | Area | Impact Level | Key Concern |
416
- |---|---|---|
417
- | 3 (Retrieval) | **Primary** | File=Concept; metadata-driven retrieval; navigation paths; system maps |
418
- | 2 (Prompt) | **Medium** | Frontmatter provides context; reduces prompt engineering |
419
- | 4 (Agentic) | **Medium** | CLAUDE.md/AGENTS.md define per-directory agent behavior |
420
- | 5 (Evaluation) | **Medium** | Structure quality measurable: orphan rate, duplication, completeness |
421
- | 1, 6, 7, 8 | Low | Indirect benefits across all areas |
422
-
423
- ### Verification Checklist
424
-
425
- - [ ] File=Concept: one concept per file → domain_scope.md Area 3
426
- - [ ] YAML frontmatter on all knowledge files → structure_spec.md
427
- - [ ] System map provides navigation → domain_scope.md Area 3
428
- - [ ] Duplication eliminated → dependency_rules.md §Duplication Prevention Rules
429
- - [ ] Navigation paths acyclic → dependency_rules.md §Acyclicity
430
- - [ ] Retrieval precision measured before/after → domain_scope.md Area 5
431
- - [ ] Structure validation automated (lint, orphan detection) → domain_scope.md Area 7
432
-
433
- ### Affected Files
434
-
435
- | File | Impact | Section |
436
- |---|---|---|
437
- | structure_spec.md | Modify | Frontmatter spec, File=Concept rules |
438
- | dependency_rules.md | Verify | §Duplication Prevention, §Acyclicity, §Referential Integrity |
439
- | domain_scope.md | Verify | Area 3, §Reference Standards/Frameworks |
440
-
441
- ---
442
-
443
- ## Scenario Interconnections
444
-
445
- | Scenario | Triggers / Interacts With | Reason |
446
- |---|---|---|
447
- | Case 1 (New Model) | → Case 5 (Context), Case 7 (Multi-Modal), Case 10 (Eval) | New models expand context, add modalities, change eval baselines |
448
- | Case 2 (Retrieval) | → Case 5 (Context), Case 11 (Structure) | Larger context changes RAG calculus; structure affects retrieval |
449
- | Case 3 (Protocol) | → Case 9 (Autonomy), Case 4 (Regulation) | Protocols enable new capabilities; standardization aids compliance |
450
- | Case 4 (Regulation) | → Case 9 (Autonomy), Case 10 (Eval) | Regulation constrains agents; mandates auditable evaluation |
451
- | Case 5 (Context) | → Case 2 (Retrieval), Case 11 (Structure) | Reduced RAG necessity; less aggressive structuring needed |
452
- | Case 6 (MaaS Disruption) | → Case 8 (Fine-Tuning), Case 4 (Regulation) | Local deployment enables fine-tuning; data residency drives on-prem |
453
- | Case 7 (Multi-Modal) | → Case 2 (Retrieval), Case 9 (Autonomy) | Multi-modal retrieval is distinct paradigm; vision enables computer use |
454
- | Case 8 (Fine-Tuning) | → Case 6 (MaaS), Case 10 (Eval) | Fine-tuning incentivizes local; more variants need scalable eval |
455
- | Case 9 (Autonomy) | → Case 4 (Regulation), Case 3 (Protocol) | Autonomous agents face stricter scrutiny; need safety protocols |
456
- | Case 10 (Eval Shift) | → Case 8 (Fine-Tuning) | AI judges generate training labels at scale |
457
- | Case 11 (Structure) | → Case 2 (Retrieval) | Better structure improves retrieval across all paradigms |
458
-
459
- ### Cascade Patterns
460
-
461
- **Cascade 1 — Model → Context → Retrieval → Structure**: New model with 10M context (Case 1) → context expansion (Case 5) → reduced RAG necessity (Case 2) → knowledge structure adaptation (Case 11). All 4 cases require coordinated response.
462
-
463
- **Cascade 2 — Regulation → Safety → Agent → Protocol**: AI regulation (Case 4) → tightened safety requirements → constrained agent autonomy (Case 9) → standardized safety protocols (Case 3). Compliance becomes root cause for multiple architectural changes.
464
-
465
- **Cascade 3 — Local Deployment → Fine-Tuning → Eval Scaling**: Self-hosted models (Case 6) → accessible fine-tuning (Case 8) → many model variants needing evaluation → AI-as-judge for throughput (Case 10). Evaluation pipeline becomes bottleneck.
466
-
467
- ---
468
-
469
- ## Related Documents
470
- - domain_scope.md — Sub-area definitions (Areas 1-8), membership criteria, technology assumptions, handoff points
471
- - dependency_rules.md — Inter-area dependencies, model/MCP/embedding dependencies, feedback loops
472
- - structure_spec.md — Token budget rules, quantitative criteria, frontmatter specification
473
- - competency_qs.md — Questions each area must answer; updated when scenarios reveal gaps
474
- - concepts.md — Term definitions; updated when new terms emerge from scenarios
@@ -1,123 +0,0 @@
1
- ---
2
- version: 2
3
- last_updated: "2026-03-30"
4
- source: manual
5
- status: established
6
- ---
7
-
8
- # LLM-Native Development Domain — Logic Rules
9
-
10
- Classification axis: **system construction concern** — rules classified by the design concern of the LLM-powered system they govern.
11
-
12
- ## File–Concept Correspondence Logic (→Area 3)
13
-
14
- - One concept must be defined in exactly one concept file. If the same concept is defined in two files, it is a source of truth conflict
15
- - A filename must represent the concept the file defines. A mismatch between filename and content is a self-describing violation
16
- - If a concept file exists, it must define at least one concept. Empty files are not allowed
17
- - Meta files (INDEX.md, ARCHITECTURE.md, etc.) do not define concepts and serve the role of specifying structural relationships among other files. They are excluded from the "File = Concept" correspondence
18
-
19
- ## Frontmatter Conformance Logic (→Area 3)
20
-
21
- - The specification for frontmatter (required fields, format) has structure_spec.md as its source of truth. When other files reference frontmatter, they follow structure_spec.md's definitions
22
- - All references declared in frontmatter (depends_on, related_to, parent) must point to actually existing files
23
- - The type field value in frontmatter must belong to the type list defined in the system. An undefined type is unclassifiable
24
- - If a relationship declared in frontmatter is bidirectional, the reverse relationship must also exist in the target file's frontmatter. In this case, follow the cycle exception clause in dependency_rules.md
25
-
26
- ## Hierarchy Logic (→Area 3)
27
-
28
- - Directory hierarchy represents containment relationships of concepts. A parent directory must be a category encompassing the concepts in its child directories
29
- - Files within the same directory must be concepts at the same abstraction level. If abstraction levels differ, separate into directories
30
- - As directory depth increases, concepts become more specific. Depth reversal (child is more abstract) is a structural contradiction
31
-
32
- ## Navigation Path Logic (→Area 3)
33
-
34
- - A reference chain must exist from the entry point to any concept file. A file that cannot be reached is an isolated document
35
- - If a reference chain has cycles, the LLM may fall into infinite traversal. Circular references must be explicitly marked in frontmatter (see cycle exception clause in dependency_rules.md)
36
- - When A references B and B references C, if understanding C's content from A requires traversing through B, a direct A→C reference should be added (navigation shortcut)
37
-
38
- ## Change Propagation Logic (→Area 3)
39
-
40
- - When concept X's definition changes, all files referencing X are verification targets
41
- - When adding a new concept file, if existing concept files require modification beyond the meta file (INDEX.md), it is a signal of high coupling. Homonym registration (concepts.md) is exceptionally allowed
42
- - When deleting a file, all references to that file must be removed everywhere (referential integrity)
43
-
44
- ## Model Integration Logic (→Area 1)
45
-
46
- - If model routing is used (directing requests to different models based on task type, cost, or complexity), fallback paths must be defined for every route. A route without a fallback is a single point of failure — if the target model is unavailable, the request fails silently or errors out
47
- - Model version pinning: if the model version is not pinned to a specific release (e.g., `gpt-4-0613` rather than `gpt-4`), behavior changes are non-deterministic. Provider-side model updates can alter output quality, format, or latency without notice
48
- - If multiple models are used for different tasks within the same system, capability requirements must be documented per model. Documentation must include: task type, required capabilities (e.g., tool use, structured output, long context), minimum acceptable quality level, and cost constraints
49
- - Model selection must be justified against the task's requirements. Using a more capable (and expensive) model where a simpler model suffices is a cost inefficiency. Using a less capable model where quality is critical is a quality risk
50
- - If inference optimization is applied (quantization, batching, KV cache), the optimization's impact on output quality must be measured, not assumed. Quantization in particular can degrade performance on tasks requiring precise reasoning
51
-
52
- ## Prompt Design Logic (→Area 2)
53
-
54
- - Instruction hierarchy: system prompt > tool definitions > user prompt. When instructions at different levels conflict, higher-priority instructions take precedence. If the system prompt says "always respond in JSON" and the user prompt says "respond in plain text," the system prompt wins
55
- - If structured output is required (JSON, XML, function call schemas), the output schema must be validated before consumption. Trusting model output to conform to a schema without validation is unsafe — models can produce malformed output even when instructed to follow a schema
56
- - Token budget constraint: system prompt tokens + retrieved context tokens + user input tokens ≤ context window size - expected output tokens. Violating this constraint causes truncation (silent data loss) or API errors. The budget must be calculated before the API call, not after
57
- - Context rot: as conversation length grows, the earliest context loses influence on model behavior. Critical information (safety rules, output format requirements, identity constraints) must be re-injected at regular intervals or placed in positions of high attention (beginning and end of the context window)
58
- - Prompt templates must be versioned. A prompt change that alters system behavior is functionally equivalent to a code change. Unversioned prompts make it impossible to reproduce past behavior or diagnose regressions
59
- - Few-shot examples must be representative of the target distribution. Biased examples cause biased outputs. If the task distribution changes, the examples must be updated
60
-
61
- ## Retrieval Logic (→Area 3)
62
-
63
- - RAG pipeline correctness: retrieval relevance must be verified independently of generation quality. If the final output is incorrect, the cause may be (a) irrelevant retrieval, (b) correct retrieval but incorrect generation, or (c) both. Diagnosing the root cause requires evaluating retrieval and generation separately
64
- - Chunking strategy must preserve semantic boundaries. Splitting mid-sentence or mid-paragraph destroys context that the embedding model needs to produce meaningful vectors. Chunk boundaries should align with document structure (paragraphs, sections, logical units)
65
- - If hybrid search is used (combining keyword-based search and semantic/vector search), the combination strategy must be explicit. Options include: score fusion (weighted sum of scores), rank fusion (reciprocal rank fusion), or cascade (keyword filter → semantic rerank). An unspecified combination strategy produces non-reproducible results
66
- - Embedding model selection must match the content domain. General-purpose embeddings may underperform on domain-specific content (medical, legal, code). If retrieval quality is insufficient, evaluate domain-specific or fine-tuned embedding models before increasing chunk count
67
- - Retrieved context must carry provenance metadata (source document, chunk ID, relevance score). Without provenance, it is impossible to debug retrieval failures or trace hallucinations back to their source
68
-
69
- ## Agentic Systems Logic (→Area 4)
70
-
71
- - Tool definitions must be non-overlapping in their described capabilities. If two tools can accomplish the same task, the agent's choice between them is non-deterministic. Either remove the overlap or add explicit routing instructions that disambiguate when to use each tool
72
- - Multi-agent communication must have termination conditions. Without explicit termination (maximum iterations, convergence criteria, timeout), agent loops can run indefinitely, consuming resources and producing no useful output. Every loop must have a defined exit condition
73
- - Agent state must be explicitly managed. Relying on conversation history as implicit state is fragile — context windows are finite, and conversation history can be truncated, summarized, or lost. Critical state (task progress, accumulated results, decision points) must be stored in structured form (scratchpads, databases, state objects)
74
- - MCP tool schemas must be self-describing. The agent must be able to determine a tool's applicability from the schema alone (name, description, parameter descriptions) without external documentation. A tool schema that requires reading separate documentation to understand is poorly designed
75
- - Agent instructions must explicitly list which tools are available and when each should be used. If tools are available but not mentioned in instructions, the agent may discover them through schema inspection but cannot reliably determine appropriate usage context
76
- - For long-running agent tasks (multi-session persistence), structured progress tracking must be maintained. The agent must be able to resume from the last checkpoint without re-executing completed steps. Progress format: completed steps, current step, remaining steps, accumulated artifacts
77
-
78
- ## Evaluation Logic (→Area 5)
79
-
80
- - Evaluation criteria must be defined before system development begins, not retrofitted after (spec-first principle). Defining evaluation criteria after seeing system output introduces confirmation bias — criteria are unconsciously shaped to match existing behavior rather than desired behavior
81
- - AI-as-judge evaluation must disclose: (a) the judge model identity and version, (b) the judge prompt used, and (c) known biases of the judge model (e.g., positional bias, verbosity bias, self-preference). Without disclosure, evaluation results are not reproducible or interpretable
82
- - A/B testing for system output quality requires statistical significance thresholds defined in advance. Without predefined thresholds, there is a temptation to stop testing when results look favorable (peeking), which inflates false positive rates
83
- - Golden set (reference test data) must be maintained separately from training/fine-tuning data. Contamination of the golden set with training data produces artificially inflated evaluation scores
84
- - Evaluation pipelines must be automated and repeatable. Manual evaluation is acceptable for initial exploration but does not scale. If the system is in production, evaluation must run on a regular cadence without human intervention
85
-
86
- ## Safety Logic (→Area 6)
87
-
88
- - Defense-in-depth: no single safety mechanism is sufficient. Minimum layers: input filtering (block malicious prompts before they reach the model) + output filtering (block harmful content before it reaches the user) + monitoring (detect patterns that individual filters miss). Removing any layer increases risk
89
- - Prompt injection defense must not degrade normal user experience. A defense mechanism that blocks 5% of legitimate user requests to prevent 0.1% of injection attempts has an unacceptable false positive rate. Defense mechanisms must be evaluated for both efficacy (true positive rate) and cost (false positive rate)
90
- - Content policy rules must be testable. Each rule must have at least one positive test case (content that should be blocked) and one negative test case (content that should pass). Untestable rules cannot be verified and may silently fail or over-block
91
- - Safety constraints must be applied at the system level, not delegated to the model's built-in safety. Model-level safety is a useful baseline but is not configurable, not auditable, and can change without notice when the provider updates the model
92
- - Red teaming must be performed before deployment and on a regular cadence after deployment. The threat landscape evolves — attack techniques that did not exist during initial red teaming may emerge later
93
-
94
- ## Operations Logic (→Area 7)
95
-
96
- - Cost tracking granularity must match billing granularity. If the provider bills per token, the system must track per-token usage. If the system tracks only per-request, cost attribution to specific features or users is impossible, making cost optimization guesswork
97
- - Quality drift detection requires a baseline. The baseline must be established during the evaluation phase (→Area 5) using the golden set. Without a baseline, there is no reference point to determine whether quality has degraded. Drift detection compares current output quality against the baseline at regular intervals
98
- - Feedback loop: user feedback must be actionable. Feedback without a defined path to system improvement is waste. For each feedback type (thumbs up/down, free-text correction, escalation), there must be a defined process: collection → aggregation → analysis → system change → verification
99
- - Logging must capture sufficient information to reproduce any LLM interaction: input (full prompt), output (full response), model version, latency, token count, and any tool calls. Insufficient logging makes debugging production issues impossible
100
- - Incident response for LLM-specific failures must account for failure modes unique to LLM systems: model provider outage, sudden quality degradation (model update), cost spike (unexpected token usage), safety filter false positives (legitimate requests blocked), and prompt injection in production
101
-
102
- ## Data & Model Adaptation Logic (→Area 8)
103
-
104
- - Fine-tuning decision: fine-tuning is justified only when prompting alone cannot achieve the required quality, latency, or cost targets. Fine-tuning introduces a maintenance burden (dataset management, retraining on model updates) that prompting avoids
105
- - Training data quality has an upper bound on model quality. No amount of training can overcome systematically flawed data. Data quality assessment (accuracy, consistency, completeness, representativeness) must precede training
106
- - If parameter-efficient fine-tuning is used (LoRA, QLoRA, adapters), the base model version must be pinned. A base model update invalidates all existing adapters — adapters trained on version N are not guaranteed to work correctly on version N+1
107
- - Evaluation of fine-tuned models must compare against the base model on the same evaluation set. Without this comparison, there is no evidence that fine-tuning improved performance. Improvement must exceed the cost of fine-tuning to be justified
108
-
109
- ## Constraint Conflict Checking (Cross-Cutting)
110
-
111
- When design constraints from different areas impose contradictory requirements, the conflict must be identified, documented, and resolved. Unresolved conflicts produce systems with unpredictable behavior.
112
-
113
- - **Safety vs. Functionality**: When safety constraints conflict with functionality (e.g., output filtering blocks valid responses, input filtering rejects legitimate queries), safety takes precedence. However, the conflict must be documented with: (a) the specific safety rule, (b) the functionality it degrades, (c) the false positive rate, and (d) a plan to reduce the false positive rate without weakening safety
114
- - **Cost vs. Quality**: When cost constraints conflict with quality (e.g., a cheaper model degrades output quality, reduced retrieval scope misses relevant context), the trade-off must be explicit in the system specification. The specification must state: which quality metrics are affected, by how much, and what the cost saving is. Implicit quality degradation (choosing the cheap option without measuring the impact) is prohibited
115
- - **Latency vs. Completeness**: When latency requirements conflict with completeness (e.g., retrieval timeout cuts off relevant results, generation is truncated to meet response time targets), the system must degrade gracefully. Partial results must be clearly marked as partial, and the user must be informed that completeness was sacrificed for speed
116
- - **Privacy vs. Personalization**: When privacy constraints conflict with personalization (e.g., PII redaction removes information needed for personalized responses), privacy takes precedence. The system must provide useful responses without requiring PII, or obtain explicit user consent for PII usage with clear data handling policies
117
-
118
- ## Related Documents
119
- - concepts.md — Term definitions within this domain
120
- - dependency_rules.md — Cycle exception clause, reference direction rules
121
- - structure_spec.md — Source of truth for frontmatter specifications and structural rules
122
- - domain_scope.md — Scope definition, sub-area membership criteria, and boundary tests
123
- - competency_qs.md — Questions this domain's logic must be able to answer
@@ -1,49 +0,0 @@
1
- ---
2
- version: 1
3
- last_updated: "2026-03-29"
4
- source: bundled-domain-baseline
5
- status: established
6
- ---
7
-
8
- # LLM-Native Development Domain — Prompt/Interface Design
9
-
10
- This document defines the design criteria for structures that issue instructions to LLMs — system prompts, tool definitions, role definitions, and response formats.
11
- While the existing 7 files cover "the structure of a system that describes knowledge," this file covers the separate concern dimension of "designing the knowledge consumption method."
12
-
13
- ## System Prompt Structure
14
-
15
- - Prompts are structured in the order of role definition (who) → context (what) → task instructions (how) → constraints
16
- - Role definitions are managed in individual files within the roles/ directory. They are not inlined into the prompt but read from files and included
17
- - Context loading is either self-loading (agent reads files directly) or injection (team lead includes content). The choice depends on the agent's file reading capability
18
-
19
- ## Role Definition File Structure
20
-
21
- Role definition files (roles/{agent-id}.md) include the following elements:
22
- - **Specialization**: The concern this role verifies/performs
23
- - **Role Description**: The core question this role answers
24
- - **Core Questions**: 3~6 items. Define the judgment axes of the role
25
- - **Domain Document Mapping**: Domain documents this role references (at least 1)
26
-
27
- ## Tool Definition (Tool Schema)
28
-
29
- - Tool definitions follow JSON Schema format
30
- - Tool names directly express the action (e.g., `search_files`, `read_document`)
31
- - Tool descriptions must be specific enough for the LLM to judge tool selection
32
- - Required parameters and optional parameters are clearly distinguished
33
-
34
- ## Response Format Constraints (Structured Output)
35
-
36
- - When structured output is needed, specify the output format in the prompt
37
- - Format specification methods: choose between examples or schemas (JSON Schema)
38
- - When the format is complex, separate it into a file (templates/ directory) and reference from the prompt
39
-
40
- ## Context Window Utilization Strategy
41
-
42
- - Prompt size is recommended to be 30% or less of the total context window. The remainder is allocated to work targets and responses
43
- - For repeatedly executed prompts, separate static parts (role, rules) from dynamic parts (work target, previous results)
44
- - When static parts do not change, consider caching or compression
45
-
46
- ## Related Documents
47
- - concepts.md — Definitions of terms used in prompts
48
- - structure_spec.md — Physical location rules for role files
49
- - domain_scope.md — Higher-level definition of the "Consumption Interface" concern area