@moon791017/neo-skills 1.0.32 → 1.0.33

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -63,7 +63,10 @@
63
63
  ### 10. 需求釐清助手
64
64
  * **需求釐清**:系統化引導用戶釐清模糊需求,並將其轉化為結構化的規格文件(背景、功能、約束、驗收標準)。
65
65
 
66
- ### 11. 安全守衛 (Security Guard)
66
+ ### 11. AI 開發流程健檢
67
+ * **AI 助手開發治理 (`skills/neo-agent-harness`)**:檢查專案規則、測試、CI、審查流程與安全防護是否足夠清楚,協助 AI 助手更穩定、更安全地參與開發。
68
+
69
+ ### 12. 安全守衛 (Security Guard)
67
70
  * **主動防護 (`secret-guard.ts`)**:作為 CLI 的中介軟體 (Hook),自動攔截並掃描所有工具執行的參數。若偵測到敏感資訊(如環境設定檔、私鑰、雲端憑證等)將強制阻擋執行,防止機密外洩。
68
71
 
69
72
  ## 📂 系統架構
@@ -215,6 +218,7 @@ npx -p @moon791017/neo-skills install-system-instructions \
215
218
  | **全方位程式碼審查** | `幫我 code review 剛才的修改` |
216
219
  | **生成 C# Interface** | `幫我針對這個 class 產生介面` |
217
220
  | **技術解析與架構洞察** | `分析這個專案的架構` |
221
+ | **AI 開發流程健檢** | `幫我檢查這個專案怎麼讓 AI 助手開發得更穩、更安全` |
218
222
 
219
223
  ## 🛠 開發指南
220
224
 
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "neo-skills",
3
3
  "description": "A universal capability extension for Gemini CLI",
4
- "version": "0.56.2",
4
+ "version": "0.57.1",
5
5
  "mcpServers": {
6
6
  "neo-skills": {
7
7
  "command": "node",
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@moon791017/neo-skills",
3
- "version": "1.0.32",
3
+ "version": "1.0.33",
4
4
  "type": "module",
5
5
  "description": "Neo Skills: A Universal Expert Agent Extension",
6
6
  "homepage": "https://neo-blog-iota.vercel.app/",
@@ -20,7 +20,6 @@
20
20
  "dist",
21
21
  "bin",
22
22
  "skills",
23
- "commands",
24
23
  "gemini-extension.json",
25
24
  "GEMINI.md"
26
25
  ],
@@ -0,0 +1,86 @@
1
+ ---
2
+ name: neo-agent-harness
3
+ description: 分析專案的 AI agent harnessability,設計 feedforward guides、feedback sensors、驗證流程與人類決策點,用於提升 agent 輔助開發品質。
4
+ ---
5
+
6
+ # Agent Harness Architect Skill
7
+
8
+ ## Trigger On
9
+ - The user asks to design or improve an AI agent development workflow.
10
+ - The user wants coding agents to be more reliable, safer, or easier to supervise.
11
+ - The user asks for AGENTS.md, skills, tests, CI, hooks, review loops, or project rules to be planned together.
12
+ - The task is about harness engineering, agent harnessability, feedback loops, or "humans on the loop".
13
+
14
+ ## Core Principle
15
+ Treat agent-assisted development as a controlled engineering system. Do not only improve prompts; design the guides, sensors, verification steps, and human decision points that let agents work with higher confidence.
16
+
17
+ Use this skill for planning first. Do not modify files unless the user explicitly asks for implementation after the harness plan is clear.
18
+
19
+ ## Perceive
20
+ 1. Inspect the repository before asking questions:
21
+ - Project instructions: `AGENTS.md`, `CLAUDE.md`, `GEMINI.md`, `.github/copilot-instructions.md`.
22
+ - Skill or prompt assets: `skills/`, `.codex/skills/`, `.claude/skills/`, `.github/skills/`.
23
+ - Validation commands: `package.json`, `pyproject.toml`, `Makefile`, CI workflow files, build scripts.
24
+ - Safety and governance: hooks, linters, formatters, secret scanners, dependency scanners.
25
+ - Existing documentation: README, architecture docs, ADRs, testing docs, contribution docs.
26
+ 2. Identify the project type, primary languages, current test surface, CI status, and release/deploy path.
27
+ 3. Separate discoverable repository facts from product or team preferences that require user confirmation.
28
+
29
+ ## Reason
30
+ Analyze the project through these dimensions:
31
+
32
+ 1. **Feedforward guides**
33
+ - What should the agent know before editing?
34
+ - Examples: project rules, coding style, architecture boundaries, task decomposition, testing strategy, domain vocabulary, safe file ownership.
35
+ 2. **Feedback sensors**
36
+ - What can check the agent's work after each change?
37
+ - Examples: typecheck, tests, lint, build, static analysis, security checks, architecture tests, review agents, smoke tests.
38
+ 3. **Control type**
39
+ - Prefer computational controls for fast deterministic checks.
40
+ - Use inferential controls for semantic review, architecture judgment, test-quality review, and risk assessment.
41
+ 4. **Regulation category**
42
+ - Maintainability: readability, duplication, complexity, style, testability.
43
+ - Architecture fitness: module boundaries, performance, observability, security, deployability.
44
+ - Behaviour: requirements, acceptance criteria, user journeys, fixtures, manual test points.
45
+ 5. **Human role**
46
+ - Move human effort from repeated low-level correction to high-value decisions: scope, product fit, risk acceptance, architectural tradeoffs, and harness evolution.
47
+
48
+ For detailed patterns and examples, read [reference/harness-patterns.md](reference/harness-patterns.md) when the task needs a full harness design, maturity model, or improvement roadmap. For the complete source article synthesis behind this skill, read [reference/agent-harness-engineering.md](reference/agent-harness-engineering.md) only when deeper conceptual background is needed.
49
+
50
+ ## Act
51
+ Output in Traditional Chinese (Taiwan). Use this structure:
52
+
53
+ ### 1. 現況盤點
54
+ - Summarize the repository facts discovered.
55
+ - Mention the current guides, sensors, and missing signals.
56
+
57
+ ### 2. Harnessability 評估
58
+ - Rate the current harnessability as `低`, `中`, or `高`.
59
+ - Explain the rating with concrete evidence from the repository.
60
+
61
+ ### 3. Feedforward Guides 設計
62
+ - List the rules, docs, skills, templates, or examples agents should receive before work starts.
63
+ - Mark each item as existing, needs update, or missing.
64
+
65
+ ### 4. Feedback Sensors 設計
66
+ - List fast local checks, CI checks, security checks, semantic reviews, and manual checks.
67
+ - Distinguish computational from inferential checks.
68
+
69
+ ### 5. 開發前改善清單
70
+ - Prioritize improvements as P0, P1, and P2.
71
+ - P0 must focus on changes that reduce repeated agent mistakes or prevent high-risk failures.
72
+
73
+ ### 6. 人類決策點
74
+ - State where humans should stay on the loop.
75
+ - Identify decisions that should not be delegated fully to agents.
76
+
77
+ ### 7. 驗證方式
78
+ - Provide exact commands or review steps when discoverable.
79
+ - If a command is unknown, state the missing fact instead of inventing one.
80
+
81
+ ## Constraints
82
+ - Do not treat AI-generated tests as sufficient proof of behaviour correctness.
83
+ - Do not replace deterministic tooling with LLM judgment when a fast tool can check the same thing.
84
+ - Do not recommend broad automation before the project has reliable local validation commands.
85
+ - Do not propose fully autonomous changes for security, compliance, production deploys, or product-scope decisions.
86
+ - Keep the output actionable and tied to repository evidence.
@@ -0,0 +1,343 @@
1
+ # Agent Harness Engineering and the Role of Humans in the Software Engineering Loop
2
+
3
+ > This document is a summary and integration of two articles from Martin Fowler's website, focusing on practical explanations rather than word-for-word translation.
4
+
5
+ ## Original Sources
6
+
7
+ - [Harness engineering for coding agent users](https://martinfowler.com/articles/harness-engineering.html)
8
+ Author: Birgitta Boeckeler, Published: 2026-04-02
9
+ - [Humans and Agents in Software Engineering Loops](https://martinfowler.com/articles/exploring-gen-ai/humans-and-agents.html)
10
+ Author: Kief Morris, Published: 2026-03-04
11
+
12
+ ## One-sentence Summary
13
+
14
+ The focus of AI coding agents is not just "letting the model write code," but establishing a working system (harness) for the agent to guide, check, correct, and continuously improve. Humans should not just review line-by-line at the lowest level, nor should they exit completely; instead, they should stand above the loop, designing and maintaining the agent's harness to ensure reliable software output in a controlled environment.
15
+
16
+ ## Core Background
17
+
18
+ LLM coding agents can generate code quickly, but they have several inherent limitations:
19
+
20
+ - Output is not entirely predictable.
21
+ - They may not understand the full business context.
22
+ - They have limited grasp of existing code, team conventions, architectural constraints, and product trade-offs.
23
+ - They are prone to iterating in the wrong direction, wasting tokens, time, and review costs.
24
+
25
+ Therefore, the key to improving agent performance is not just switching to a stronger model, but designing the environment in which it works. This environment includes specifications, documentation, templates, tests, linting, architectural checks, review agents, CI pipelines, observability data, and human decision-making processes. This entire external control system is the "harness" discussed in the articles.
26
+
27
+ ## What is a Harness?
28
+
29
+ In a broad sense, an agent can be understood as:
30
+
31
+ ```text
32
+ Agent = Model + Harness
33
+ ```
34
+
35
+ In the context of coding agents, the model is the LLM responsible for reasoning and generation; the harness is the working system wrapped around the model. It may include:
36
+
37
+ - System prompts, project rules, AGENTS.md, skills, and how-to guides.
38
+ - Code retrieval, language servers, CLI tools, and MCP servers.
39
+ - Tests, type checks, linting, static analysis, and architectural rules.
40
+ - AI code reviews, LLM judges, and semantic checks.
41
+ - CI/CD pipelines, production telemetry, and user journey data.
42
+
43
+ A good harness has two goals:
44
+
45
+ - Increase the probability of the agent getting it right the first time.
46
+ - Enable the agent to self-correct more issues based on feedback before reaching humans.
47
+
48
+ In other words, the value of a harness is to reduce the human review burden while improving system quality.
49
+
50
+ ## Feedforward and Feedback
51
+
52
+ Harness controls can be divided into two categories.
53
+
54
+ ### Feedforward: Guidance Before Action
55
+
56
+ Feedforward controls provide direction before the agent starts working, aiming to prevent errors. Examples include:
57
+
58
+ - Coding convention documents.
59
+ - Architectural principles.
60
+ - Task breakdown methods.
61
+ - Project bootstrap guides.
62
+ - API usage specifications.
63
+ - Technical decision records (ADRs).
64
+ - Templates and scaffolds.
65
+ - Codemods or fixed refactoring tools.
66
+
67
+ The effect of feedforward is to increase the probability that the agent produces correct results from the start.
68
+
69
+ ### Feedback: Sensing and Correction After Action
70
+
71
+ Feedback controls provide signals after the agent performs an action, allowing it to check and self-correct. Examples include:
72
+
73
+ - Unit and integration tests.
74
+ - Type checkers.
75
+ - Linters.
76
+ - Architecture fitness tests.
77
+ - Coverage or mutation testing.
78
+ - Semgrep or security scans.
79
+ - AI review agents.
80
+ - Browser smoke tests.
81
+ - Production logs, SLOs, error rates, and latency.
82
+
83
+ The effect of feedback is to ensure the agent is not just "constrained by rules" but knows whether its output is actually effective.
84
+
85
+ Without feedback, the agent might repeat the same mistakes; without feedforward, the team won't know if the rules are actually having an effect. A good harness requires both.
86
+
87
+ ## Computational and Inferential
88
+
89
+ The articles also divide guides and sensors into two execution types.
90
+
91
+ ### Computational Controls
92
+
93
+ Computational controls are deterministic, fast, and cheap machine checks, usually executed by the CPU. Examples include:
94
+
95
+ - Type checks.
96
+ - Unit tests.
97
+ - Linting.
98
+ - Static analysis.
99
+ - Structural tests.
100
+ - Codemods.
101
+ - Dependency checks.
102
+
103
+ The advantages of these controls are stability, low cost, and repeatability, making them suitable for execution before every change, commit, or in the CI pipeline.
104
+
105
+ ### Inferential Controls
106
+
107
+ Inferential controls are AI or LLM-based checks that require semantic judgment. Examples include:
108
+
109
+ - AI code reviews.
110
+ - LLM-as-a-judge.
111
+ - Commentary on architectural soundness.
112
+ - Commentary on test quality.
113
+ - Detection of duplicate logic or over-engineering.
114
+ - Inferring potential issues from logs.
115
+
116
+ These controls can handle problems that traditional tools find difficult, but they are more expensive, slower, and less stable. They are suitable for high-value, high-risk, or late-stage checks, rather than replacing all deterministic checks.
117
+
118
+ ## Three Types of Harnesses
119
+
120
+ ### 1. Maintainability Harness
121
+
122
+ The maintainability harness is used to maintain internal code quality. This is currently the easiest type to implement because existing tools are very mature.
123
+
124
+ Issues covered include:
125
+
126
+ - Duplicate code.
127
+ - Excessive complexity.
128
+ - Missing tests.
129
+ - Inconsistent naming or formatting.
130
+ - Violated architectural boundaries.
131
+ - Style guide violations.
132
+
133
+ Computational sensors are very effective for these issues. Inferential sensors can complement them at the semantic level, such as detecting "logic that looks different but is actually duplicate," "tests that exist but don't verify the core point," or "fixes that are too brute-force or over-engineered."
134
+
135
+ However, a maintainability harness cannot fully solve requirement misunderstandings, incorrect diagnoses, or scope creep. These still require clearer specifications and human judgment.
136
+
137
+ ### 2. Architecture Fitness Harness
138
+
139
+ The architecture fitness harness is used to maintain architectural characteristics and system quality attributes. Examples include:
140
+
141
+ - Module boundaries.
142
+ - Consistency in API design.
143
+ - Performance requirements.
144
+ - Observability standards.
145
+ - Security and compliance requirements.
146
+ - Deployment and operational constraints.
147
+
148
+ Its feedforward might be architectural documents, ADRs, logging standards, or performance budgets; feedback might be architecture tests, performance tests, contract tests, observability reviews, or deployment checks.
149
+
150
+ The focus of this harness is to translate "what our system looks like" into rules that an agent can read, follow, and be checked against by tools.
151
+
152
+ ### 3. Behaviour Harness
153
+
154
+ The behavior harness is used to confirm that software functional behavior meets requirements. This is currently the most difficult type.
155
+
156
+ Common practices include:
157
+
158
+ - Feedforward: Providing the agent with functional specifications, user stories, and acceptance criteria.
159
+ - Feedback: Letting the agent generate and execute tests, supplemented by coverage, mutation testing, or manual testing.
160
+
161
+ The problem is that AI-generated tests themselves are not necessarily trustworthy. Tests might only verify implementation details rather than actual requirements. The articles mention "approved fixtures" as a pattern that can improve test quality in some scenarios, but it's not a universal solution for all systems.
162
+
163
+ Therefore, functional correctness remains the area that most needs human product judgment and high-quality specifications.
164
+
165
+ ## Harnessability: Not All Systems Are Equally Easy to Harness
166
+
167
+ Whether a codebase can be stably operated by an agent depends on whether it possesses enough structural clues.
168
+
169
+ Systems that are easier to harness usually have:
170
+
171
+ - Strongly typed languages and strict type checking.
172
+ - Clear module boundaries.
173
+ - Consistent frameworks and directory structures.
174
+ - Automatable tests.
175
+ - Clear coding conventions.
176
+ - Architectural rules checkable by tools.
177
+
178
+ Systems that are harder to harness usually have:
179
+
180
+ - High technical debt.
181
+ - Too many implicit rules.
182
+ - Vague architectural boundaries.
183
+ - Insufficient tests.
184
+ - A requirement for significant tribal knowledge.
185
+ - Highly mixed frameworks and styles.
186
+
187
+ This means greenfield projects can be designed for harnessability from day one, while legacy projects need to add structure, tests, and rules before safely increasing agent autonomy.
188
+
189
+ ## Harness Templates
190
+
191
+ Large organizations often have several common types of services, such as:
192
+
193
+ - CRUD business services.
194
+ - Event processors.
195
+ - Data dashboards.
196
+ - API gateways.
197
+ - Batch jobs.
198
+
199
+ If these types already have service templates, they can evolve into harness templates. That is, each service type comes built-in with:
200
+
201
+ - Standard directory structure.
202
+ - Tech stack and dependencies.
203
+ - Generation and modification guides.
204
+ - Test templates.
205
+ - Architectural rules.
206
+ - CI pipelines.
207
+ - Security and observability checks.
208
+ - Agent review guides.
209
+
210
+ This narrows the space in which the agent can arbitrarily generate, making control more feasible. The downside is that the template itself requires version management, synchronization, and governance; otherwise, projects will gradually diverge from the upstream standard.
211
+
212
+ ## Three Human Positions in the Loop
213
+
214
+ The second article provides another important model: where should humans stand in the loop?
215
+
216
+ ### Humans Outside the Loop
217
+
218
+ This leaves humans in the "why loop" and lets the agent handle the "how loop." Humans only describe the desired result, and the agent decides how to construct the software.
219
+
220
+ This approach is close to what's often called "vibe coding." Its appeal lies in speed and minimal intervention, but the risk is that internal quality, maintainability, cost, security, and compliance may spiral out of control.
221
+
222
+ The articles remind us: we care about internal quality not because code is sacred, but because internal quality affects external outcomes. A chaotic codebase makes agents slower, more likely to get lost, and harder to modify reliably.
223
+
224
+ ### Humans In the Loop
225
+
226
+ This keeps humans in the lowest-level coding loop, reviewing the agent's code line-by-line.
227
+
228
+ This approach leverages human experience and judgment, especially when the agent gets stuck, misunderstands, or repeatedly fixes things incorrectly. The problem is that humans become the bottleneck. Agents generate code much faster than humans can review it line-by-line. If all quality assurance relies on manual review, the efficiency gains of AI are neutralized.
229
+
230
+ ### Humans On the Loop
231
+
232
+ The articles advocate for "humans on the loop": humans do not control the agent line-by-line, nor do they exit completely; instead, they design, manage, and improve the agent's work loop.
233
+
234
+ The difference can be understood as:
235
+
236
+ - In the loop: Seeing an incorrect agent output and fixing it directly, or telling the agent to fix that specific output.
237
+ - On the loop: Seeing an incorrect agent output and going back to modify the harness so that future similar outputs are more likely to be correct.
238
+
239
+ This turns every error into an opportunity to improve the system, rather than just fixing a one-off problem.
240
+
241
+ ## Agentic Flywheel
242
+
243
+ A more advanced state is the "agentic flywheel": humans not only ask the agent to build features but also ask it to analyze the harness's effectiveness and propose improvements.
244
+
245
+ Implementation can start simply:
246
+
247
+ 1. Let the agent read tests, CI results, and review findings.
248
+ 2. Ask the agent to identify failure patterns and recurring issues.
249
+ 3. Have the agent suggest adding or modifying rules, tests, documentation, linting, or pipelines.
250
+ 4. Humans review these suggestions before assigning the agent to implement them.
251
+ 5. As trust increases, let the agent flag risks, costs, and benefits for its suggestions.
252
+ 6. Gradually automate low-risk, high-confidence improvements.
253
+
254
+ Such a flywheel doesn't aim for a one-time "it mostly works," but rather makes the system progressively more capable of self-checking, self-correcting, and self-improving.
255
+
256
+ ## Practical Implications for Engineering Teams
257
+
258
+ ### 1. Don't Understand AI Engineering Only as Prompting Skills
259
+
260
+ Prompts are important, but long-term reliability comes from the entire work system. Teams should view AGENTS.md, skills, specification documents, tests, linting, CI, review processes, and production feedback as a single harness rather than fragmented tools.
261
+
262
+ ### 2. Prioritize Moving Cheap, Stable, and Deterministic Checks Upstream
263
+
264
+ Problems that can be solved with deterministic tooling should not be prioritized for LLM judgment. Type checks, linting, unit tests, formatting, and architectural rules should run as early as possible, ideally during the agent's self-correction phase.
265
+
266
+ ### 3. Focus Human Attention on High-Value Judgments
267
+
268
+ The most valuable place for humans is not reading every line of code, but judging:
269
+
270
+ - Whether requirements are clear.
271
+ - Whether the product direction is correct.
272
+ - Whether architectural trade-offs are reasonable.
273
+ - Whether risks are acceptable.
274
+ - Which technical debts are intentional and which are out of control.
275
+ - How the harness should evolve.
276
+
277
+ ### 4. Turn Recurring Errors into Harness Improvements
278
+
279
+ If an agent makes the same mistake multiple times, it shouldn't just be fixed each time it happens. Instead, ask:
280
+
281
+ - Is a feedforward document missing?
282
+ - Do tests need to be added?
283
+ - Can a lint rule or static analysis be added?
284
+ - Does AGENTS.md or a skill need to be updated?
285
+ - Does a project-specific review checklist need to be created?
286
+ - Can manual knowledge be turned into an automated check?
287
+
288
+ ### 5. Be Cautious About Functional Behavior
289
+
290
+ Functional correctness is the hardest to automate fully. AI-generated tests cannot be equated directly with trustworthy acceptance. Important features still need clear specifications, acceptance criteria, manual exploratory testing, and, when necessary, domain expert reviews.
291
+
292
+ ## Checklist for Building a Coding Agent Harness
293
+
294
+ ### Phase 1: Establishing Basic Controllability
295
+
296
+ - Create or update AGENTS.md, explaining project rules, workflows, testing methods, and security limits.
297
+ - Ensure the project has one-click commands for install, test, and build.
298
+ - Set up type checking, linting, formatting, and unit tests.
299
+ - Document common task workflows in how-to guides or skills.
300
+ - Ensure the agent clearly knows which files can be modified and which should not be changed manually.
301
+
302
+ ### Phase 2: Strengthening the Feedback Loop
303
+
304
+ - Automatically execute fast tests and type checks after agent modifications.
305
+ - Provide clear correction guidance for common failure messages.
306
+ - Add architectural boundary checks or dependency rules.
307
+ - Add contract tests or approval tests for important modules.
308
+ - Feed CI results back into the agent's correction process.
309
+
310
+ ### Phase 3: Adding Semantic-Level Checks
311
+
312
+ - Establish a code review skill or checklist.
313
+ - Create specialized review processes for architecture, security, performance, and test quality.
314
+ - Let AI reviews find high-risk issues first, rather than replacing all manual reviews.
315
+ - Periodically review false positives and false negatives from AI reviews.
316
+
317
+ ### Phase 4: Establishing the Harness Improvement Loop
318
+
319
+ - Collect common error patterns from the agent.
320
+ - Turn recurring errors into rules, tests, documentation, or tools.
321
+ - Periodically check that AGENTS.md, skills, CI, and linting are consistent with each other.
322
+ - Have the agent suggest harness improvements based on CI, reviews, and production data.
323
+ - Flag risks, costs, benefits, and automation levels for improvement suggestions.
324
+
325
+ ## Observations Applicable to This Project
326
+
327
+ This repository itself is a collection of skills and agent work rules, making it highly suitable for management from a harness engineering perspective.
328
+
329
+ Consider the following directions:
330
+
331
+ - Treat each skill as a feedforward guide: it tells the agent how to act in a specific task.
332
+ - Treat `test/*.test.js`, `bun run typecheck`, and `bun run build` as computational feedback sensors.
333
+ - Treat `src/hooks/secret-guard.ts` and `hooks/hooks.json` as part of the security and governance harness.
334
+ - In the future, add specialized checks for skill quality, such as SKILL.md structure, required sections, valid reference paths, and template consistency.
335
+ - Establish a "skill review checklist" to assist in checking if instructions are clear, too broad, or in conflict with other skills using inferential review.
336
+
337
+ ## Most Important Conclusion
338
+
339
+ The reliability of an AI agent is not produced naturally by model capability alone; it is formed by the combination of the model, context, tools, checks, feedback, and human governance.
340
+
341
+ The human role should not be simplified into two extremes: either exiting completely or guarding every line. A more sustainable approach is to stand above the loop, continuously designing and improving the system in which the agent works.
342
+
343
+ For teams actually delivering products, harness engineering will become a core engineering capability: gradually transforming human experience, team conventions, architectural judgment, and quality standards into a work environment that an agent can follow, execute, check, and continuously improve.
@@ -0,0 +1,180 @@
1
+ # Agent Harness Patterns
2
+
3
+ Use this reference when designing a complete agent harness plan or roadmap.
4
+
5
+ ## Core Model
6
+
7
+ An agent harness is the system around the model that guides, checks, and improves agent work.
8
+
9
+ ```text
10
+ Agent = Model + Harness
11
+ ```
12
+
13
+ For coding agents, the useful outer harness includes:
14
+
15
+ - Project instructions and task workflows.
16
+ - Skills, templates, examples, and architectural rules.
17
+ - Local verification commands and CI stages.
18
+ - Hooks, linters, static analysis, and security checks.
19
+ - AI review, architecture review, test-quality review, and human decision points.
20
+ - Production feedback such as logs, SLOs, user journeys, and support signals.
21
+
22
+ The goal is not full human removal. The goal is to reduce repeated review toil and direct human attention to product, risk, architecture, and tradeoff decisions.
23
+
24
+ ## Feedforward Guides
25
+
26
+ Feedforward guides shape agent output before it acts.
27
+
28
+ Common examples:
29
+
30
+ - `AGENTS.md`, `CLAUDE.md`, `GEMINI.md`, Copilot instructions.
31
+ - Coding standards and style references.
32
+ - Architecture docs, ADRs, dependency rules, module maps.
33
+ - Task-specific skills and how-to guides.
34
+ - Service templates, scaffolds, fixtures, and examples.
35
+ - API docs, domain vocabulary, data contracts, and non-functional requirements.
36
+
37
+ Good feedforward guides are:
38
+
39
+ - Specific enough to prevent repeated mistakes.
40
+ - Short enough to be loaded often.
41
+ - Close to the code they govern.
42
+ - Kept in sync with tests, CI, and review expectations.
43
+
44
+ ## Feedback Sensors
45
+
46
+ Feedback sensors observe work after the agent acts and provide correction signals.
47
+
48
+ Fast computational sensors:
49
+
50
+ - Typecheck.
51
+ - Unit tests and focused integration tests.
52
+ - Lint and format checks.
53
+ - Build checks.
54
+ - Static analysis.
55
+ - Dependency and secret scanning.
56
+ - Architecture boundary tests.
57
+
58
+ Slower or inferential sensors:
59
+
60
+ - AI code review.
61
+ - Architecture review.
62
+ - Test-quality review.
63
+ - Security reasoning review.
64
+ - Log and incident analysis.
65
+ - Behaviour or UX review against user journeys.
66
+
67
+ Prefer fast deterministic sensors during local development. Reserve inferential sensors for semantic judgment, riskier changes, and later review stages.
68
+
69
+ ## Regulation Categories
70
+
71
+ ### Maintainability Harness
72
+
73
+ Purpose: keep the codebase easy for humans and agents to understand and change.
74
+
75
+ Signals:
76
+
77
+ - Duplicate code.
78
+ - Excessive complexity.
79
+ - Naming or style drift.
80
+ - Missing or weak tests.
81
+ - Hard-to-review diffs.
82
+ - Over-engineered or brute-force fixes.
83
+
84
+ Useful controls:
85
+
86
+ - Format, lint, typecheck, tests.
87
+ - Complexity and dependency analysis.
88
+ - Code review skill.
89
+ - Refactoring guides and examples.
90
+
91
+ ### Architecture Fitness Harness
92
+
93
+ Purpose: keep the system aligned with its intended structure and quality attributes.
94
+
95
+ Signals:
96
+
97
+ - Module boundary violations.
98
+ - API inconsistency.
99
+ - Performance regression.
100
+ - Observability gaps.
101
+ - Security or compliance drift.
102
+ - Deployment fragility.
103
+
104
+ Useful controls:
105
+
106
+ - Architecture docs and ADRs.
107
+ - Fitness functions and structural tests.
108
+ - Contract tests.
109
+ - Performance budgets.
110
+ - Logging and tracing standards.
111
+ - CI gates for risky paths.
112
+
113
+ ### Behaviour Harness
114
+
115
+ Purpose: verify that the product does what stakeholders need.
116
+
117
+ Signals:
118
+
119
+ - Ambiguous requirements.
120
+ - Tests that only mirror implementation.
121
+ - Missing acceptance criteria.
122
+ - Broken user journeys.
123
+ - Manual test steps that are not captured.
124
+
125
+ Useful controls:
126
+
127
+ - User stories with acceptance criteria.
128
+ - Approved fixtures or golden examples where appropriate.
129
+ - End-to-end smoke tests for critical flows.
130
+ - Domain expert review for high-impact behaviour.
131
+ - Manual exploration checklist when automation is weak.
132
+
133
+ Do not assume AI-generated tests prove behaviour correctness. Behaviour harnesses need clear human intent and high-quality examples.
134
+
135
+ ## Harnessability Assessment
136
+
137
+ Rate harnessability by evidence:
138
+
139
+ - **High**: clear structure, strong typing, reliable tests, CI, documented rules, repeatable build, clear ownership.
140
+ - **Medium**: some tests and rules exist, but coverage, architecture docs, or validation commands are incomplete.
141
+ - **Low**: weak tests, unclear structure, hidden conventions, inconsistent frameworks, no reliable local checks, or high technical debt.
142
+
143
+ Common improvement sequence:
144
+
145
+ 1. Make setup, test, typecheck, and build commands explicit.
146
+ 2. Document project rules in the agent instruction file.
147
+ 3. Add fast computational checks before adding AI review.
148
+ 4. Add architecture or domain-specific sensors for repeated failures.
149
+ 5. Add inferential review for semantic quality and risk.
150
+ 6. Feed recurring failures back into guides, tests, hooks, or skills.
151
+
152
+ ## Maturity Roadmap
153
+
154
+ ### Level 1: Basic Control
155
+
156
+ - Agent can find project rules.
157
+ - Local validation commands are documented.
158
+ - Secret and destructive-action rules are clear.
159
+ - Human reviews all significant changes.
160
+
161
+ ### Level 2: Fast Self-Correction
162
+
163
+ - Agent runs focused tests, typecheck, lint, and build after changes.
164
+ - Failure messages are actionable.
165
+ - Repeated mistakes become guides or checks.
166
+ - CI repeats the local quality gates.
167
+
168
+ ### Level 3: Architecture-Aware Development
169
+
170
+ - Agent understands module boundaries and service topology.
171
+ - Architecture fitness checks catch drift.
172
+ - Review process checks performance, observability, and deployability.
173
+ - Templates encode common project shapes.
174
+
175
+ ### Level 4: Harness Improvement Loop
176
+
177
+ - Agent reviews CI, review comments, and incidents to propose harness improvements.
178
+ - Humans prioritize and approve harness changes.
179
+ - Low-risk improvements can be automated after repeated success.
180
+ - Harness quality itself is reviewed regularly.