claude-flow-novice 1.5.21 → 1.5.22
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/agents/CLAUDE.md +186 -2386
- package/.claude/agents/agent-principles/agent-type-guidelines.md +328 -0
- package/.claude/agents/agent-principles/format-selection.md +204 -0
- package/.claude/agents/agent-principles/prompt-engineering.md +371 -0
- package/.claude/agents/agent-principles/quality-metrics.md +294 -0
- package/.claude/agents/frontend/README.md +574 -53
- package/.claude/agents/frontend/interaction-tester.md +850 -108
- package/.claude/agents/frontend/react-frontend-engineer.md +130 -0
- package/.claude/agents/frontend/state-architect.md +240 -152
- package/.claude/agents/frontend/ui-designer.md +292 -68
- package/.claude/agents/researcher.md +1 -1
- package/.claude/agents/swarm/test-coordinator.md +383 -0
- package/.claude/agents/task-coordinator.md +126 -0
- package/.claude/settings.json +7 -7
- package/.claude-flow-novice/dist/src/hooks/enhanced-hooks-cli.js +168 -167
- package/.claude-flow-novice/dist/src/providers/tiered-router.js +118 -0
- package/.claude-flow-novice/dist/src/providers/tiered-router.js.map +1 -0
- package/.claude-flow-novice/dist/src/providers/types.js.map +1 -1
- package/.claude-flow-novice/dist/src/providers/zai-provider.js +268 -0
- package/.claude-flow-novice/dist/src/providers/zai-provider.js.map +1 -0
- package/package.json +1 -1
- package/src/cli/simple-commands/init/templates/CLAUDE.md +25 -0
- package/src/hooks/enhanced-hooks-cli.js +23 -3
- package/src/hooks/enhanced-post-edit-pipeline.js +154 -75
- /package/.claude/agents/{CLAUDE_AGENT_DESIGN_PRINCIPLES.md → agent-principles/CLAUDE_AGENT_DESIGN_PRINCIPLES.md} +0 -0
|
@@ -0,0 +1,371 @@
|
|
|
1
|
+
# Prompt Engineering Best Practices
|
|
2
|
+
|
|
3
|
+
**Version:** 2.0.0
|
|
4
|
+
**Last Updated:** 2025-09-30
|
|
5
|
+
|
|
6
|
+
## Core Principles
|
|
7
|
+
|
|
8
|
+
Effective agent prompts require careful attention to structure, clarity, and appropriate detail level based on task complexity.
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
## 1. Clear Role Definition
|
|
13
|
+
|
|
14
|
+
```yaml
|
|
15
|
+
GOOD:
|
|
16
|
+
"You are a senior Rust developer specializing in concurrent programming"
|
|
17
|
+
|
|
18
|
+
BAD:
|
|
19
|
+
"You write code"
|
|
20
|
+
|
|
21
|
+
WHY:
|
|
22
|
+
- Clear expertise domain
|
|
23
|
+
- Sets expectations for quality
|
|
24
|
+
- Activates relevant knowledge
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
## 2. Specific Responsibilities
|
|
30
|
+
|
|
31
|
+
```yaml
|
|
32
|
+
GOOD:
|
|
33
|
+
- Implement lock-free data structures using atomics
|
|
34
|
+
- Ensure memory safety with proper synchronization
|
|
35
|
+
- Write linearizability tests using loom
|
|
36
|
+
|
|
37
|
+
BAD:
|
|
38
|
+
- Write concurrent code
|
|
39
|
+
- Make it safe
|
|
40
|
+
|
|
41
|
+
WHY:
|
|
42
|
+
- Concrete and actionable
|
|
43
|
+
- Measurable outcomes
|
|
44
|
+
- Clear scope
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
---
|
|
48
|
+
|
|
49
|
+
## 3. Appropriate Tool Selection
|
|
50
|
+
|
|
51
|
+
```yaml
|
|
52
|
+
Essential Tools:
|
|
53
|
+
- Read: Required for all agents (must read before editing)
|
|
54
|
+
- Write: For creating new files
|
|
55
|
+
- Edit: For modifying existing files
|
|
56
|
+
- Bash: For running commands
|
|
57
|
+
- Grep: For searching code
|
|
58
|
+
- Glob: For finding files
|
|
59
|
+
- TodoWrite: For task tracking
|
|
60
|
+
|
|
61
|
+
Optional Tools:
|
|
62
|
+
- WebSearch: For research agents
|
|
63
|
+
- Task: For coordinator agents (spawning sub-agents)
|
|
64
|
+
|
|
65
|
+
AVOID:
|
|
66
|
+
- Giving unnecessary tools
|
|
67
|
+
- Restricting essential tools
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
---
|
|
71
|
+
|
|
72
|
+
## 4. Integration Points
|
|
73
|
+
|
|
74
|
+
```yaml
|
|
75
|
+
GOOD:
|
|
76
|
+
Collaboration:
|
|
77
|
+
- Architect: Provides design constraints
|
|
78
|
+
- Reviewer: Validates implementation
|
|
79
|
+
- Tester: Ensures correctness
|
|
80
|
+
|
|
81
|
+
BAD:
|
|
82
|
+
"Works with other agents"
|
|
83
|
+
|
|
84
|
+
WHY:
|
|
85
|
+
- Specific integration contracts
|
|
86
|
+
- Clear handoff points
|
|
87
|
+
- Defined outputs/inputs
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
---
|
|
91
|
+
|
|
92
|
+
## 5. Validation and Hooks
|
|
93
|
+
|
|
94
|
+
### Mandatory Post-Edit Validation
|
|
95
|
+
|
|
96
|
+
**CRITICAL**: After **EVERY** file edit operation:
|
|
97
|
+
|
|
98
|
+
```bash
|
|
99
|
+
npx claude-flow@alpha hooks post-edit [FILE_PATH] --memory-key "agent/step" --structured
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
**Benefits:**
|
|
103
|
+
- TDD compliance checking
|
|
104
|
+
- Security analysis (XSS, eval, credentials)
|
|
105
|
+
- Formatting validation
|
|
106
|
+
- Coverage analysis
|
|
107
|
+
- Actionable recommendations
|
|
108
|
+
|
|
109
|
+
**Rationale:**
|
|
110
|
+
- Ensures quality gates
|
|
111
|
+
- Provides immediate feedback
|
|
112
|
+
- Coordinates with other agents via memory
|
|
113
|
+
- Maintains system-wide standards
|
|
114
|
+
|
|
115
|
+
---
|
|
116
|
+
|
|
117
|
+
## 6. Anti-Patterns to Avoid
|
|
118
|
+
|
|
119
|
+
### ❌ Over-Specification (Tunnel Vision)
|
|
120
|
+
|
|
121
|
+
```markdown
|
|
122
|
+
BAD (for complex tasks):
|
|
123
|
+
|
|
124
|
+
## Strict Algorithm
|
|
125
|
+
|
|
126
|
+
1. ALWAYS use bubble sort for sorting
|
|
127
|
+
2. NEVER use built-in sort functions
|
|
128
|
+
3. MUST iterate exactly 10 times
|
|
129
|
+
4. Check each element precisely in this order: [detailed steps]
|
|
130
|
+
|
|
131
|
+
WHY BAD:
|
|
132
|
+
- Prevents optimal solutions
|
|
133
|
+
- Ignores context-specific needs
|
|
134
|
+
- Reduces AI reasoning ability
|
|
135
|
+
- May enforce suboptimal patterns
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### ❌ Under-Specification (Too Vague)
|
|
139
|
+
|
|
140
|
+
```markdown
|
|
141
|
+
BAD (for basic tasks):
|
|
142
|
+
|
|
143
|
+
## Implementation
|
|
144
|
+
|
|
145
|
+
Write some code that works.
|
|
146
|
+
|
|
147
|
+
WHY BAD:
|
|
148
|
+
- No guidance on patterns
|
|
149
|
+
- Unclear success criteria
|
|
150
|
+
- High iteration count
|
|
151
|
+
- Inconsistent quality
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
### ❌ Example Overload
|
|
155
|
+
|
|
156
|
+
```markdown
|
|
157
|
+
BAD (for complex tasks):
|
|
158
|
+
|
|
159
|
+
[50 code examples of every possible pattern]
|
|
160
|
+
|
|
161
|
+
WHY BAD:
|
|
162
|
+
- Cognitive overload
|
|
163
|
+
- Priming bias
|
|
164
|
+
- Reduces creative problem-solving
|
|
165
|
+
- Makes prompt harder to maintain
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
### ❌ Rigid Checklists
|
|
169
|
+
|
|
170
|
+
```markdown
|
|
171
|
+
BAD (for architecture):
|
|
172
|
+
|
|
173
|
+
You MUST:
|
|
174
|
+
[ ] Use exactly these 5 patterns
|
|
175
|
+
[ ] Never deviate from this structure
|
|
176
|
+
[ ] Follow these steps in exact order
|
|
177
|
+
[ ] Use only these technologies
|
|
178
|
+
|
|
179
|
+
WHY BAD:
|
|
180
|
+
- Context-insensitive
|
|
181
|
+
- Prevents trade-off analysis
|
|
182
|
+
- Enforces solutions before understanding problems
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
---
|
|
186
|
+
|
|
187
|
+
## Agent Profile Structure
|
|
188
|
+
|
|
189
|
+
### Required Frontmatter (YAML)
|
|
190
|
+
|
|
191
|
+
```yaml
|
|
192
|
+
---
|
|
193
|
+
name: agent-name # REQUIRED: Lowercase with hyphens
|
|
194
|
+
description: | # REQUIRED: Clear, keyword-rich description
|
|
195
|
+
MUST BE USED when [primary use case].
|
|
196
|
+
Use PROACTIVELY for [specific scenarios].
|
|
197
|
+
ALWAYS delegate when user asks [trigger phrases].
|
|
198
|
+
Keywords - [comma-separated keywords for search]
|
|
199
|
+
tools: [Read, Write, Edit, Bash, TodoWrite] # REQUIRED: Comma-separated list
|
|
200
|
+
model: sonnet # REQUIRED: sonnet | opus | haiku
|
|
201
|
+
color: seagreen # REQUIRED: Visual identifier
|
|
202
|
+
type: specialist # OPTIONAL: specialist | coordinator | swarm
|
|
203
|
+
capabilities: # OPTIONAL: Array of capability tags
|
|
204
|
+
- rust
|
|
205
|
+
- error-handling
|
|
206
|
+
- concurrent-programming
|
|
207
|
+
lifecycle: # OPTIONAL: Hooks for agent lifecycle
|
|
208
|
+
pre_task: "npx claude-flow@alpha hooks pre-task"
|
|
209
|
+
post_task: "npx claude-flow@alpha hooks post-task"
|
|
210
|
+
hooks: # OPTIONAL: Integration points
|
|
211
|
+
memory_key: "agent-name/context"
|
|
212
|
+
validation: "post-edit"
|
|
213
|
+
triggers: # OPTIONAL: Automatic activation patterns
|
|
214
|
+
- "build rust"
|
|
215
|
+
- "implement concurrent"
|
|
216
|
+
constraints: # OPTIONAL: Limitations and boundaries
|
|
217
|
+
- "Do not modify production database"
|
|
218
|
+
- "Require approval for breaking changes"
|
|
219
|
+
---
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
### Body Structure
|
|
223
|
+
|
|
224
|
+
```markdown
|
|
225
|
+
# Agent Name
|
|
226
|
+
|
|
227
|
+
[Opening paragraph: WHO you are, WHAT you do]
|
|
228
|
+
|
|
229
|
+
## 🚨 MANDATORY POST-EDIT VALIDATION
|
|
230
|
+
|
|
231
|
+
**CRITICAL**: After **EVERY** file edit operation, you **MUST** run:
|
|
232
|
+
|
|
233
|
+
```bash
|
|
234
|
+
npx claude-flow@alpha hooks post-edit [FILE_PATH] --memory-key "agent/step" --structured
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
[Why this matters and what it provides]
|
|
238
|
+
|
|
239
|
+
## Core Responsibilities
|
|
240
|
+
|
|
241
|
+
[Primary duties in clear, actionable bullet points]
|
|
242
|
+
|
|
243
|
+
## Approach & Methodology
|
|
244
|
+
|
|
245
|
+
[HOW the agent accomplishes tasks - frameworks, patterns, decision-making]
|
|
246
|
+
|
|
247
|
+
## Integration & Collaboration
|
|
248
|
+
|
|
249
|
+
[How this agent works with other agents and the broader system]
|
|
250
|
+
|
|
251
|
+
## Examples & Best Practices
|
|
252
|
+
|
|
253
|
+
[Concrete examples showing the agent in action]
|
|
254
|
+
|
|
255
|
+
## Success Metrics
|
|
256
|
+
|
|
257
|
+
[How to measure agent effectiveness]
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
---
|
|
261
|
+
|
|
262
|
+
## Integration with Claude Flow
|
|
263
|
+
|
|
264
|
+
### Hook System Integration
|
|
265
|
+
|
|
266
|
+
Every agent should integrate with the Claude Flow hook system for coordination:
|
|
267
|
+
|
|
268
|
+
#### 1. Pre-Task Hook
|
|
269
|
+
|
|
270
|
+
```bash
|
|
271
|
+
npx claude-flow@alpha hooks pre-task --description "Implementing authentication system"
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
**Purpose:**
|
|
275
|
+
- Initialize task context
|
|
276
|
+
- Set up memory namespace
|
|
277
|
+
- Log task start
|
|
278
|
+
- Coordinate with other agents
|
|
279
|
+
|
|
280
|
+
#### 2. Post-Edit Hook (MANDATORY)
|
|
281
|
+
|
|
282
|
+
```bash
|
|
283
|
+
npx claude-flow@alpha hooks post-edit src/auth/login.rs \
|
|
284
|
+
--memory-key "coder/auth/login" \
|
|
285
|
+
--structured
|
|
286
|
+
```
|
|
287
|
+
|
|
288
|
+
**Purpose:**
|
|
289
|
+
- Validate TDD compliance
|
|
290
|
+
- Run security analysis
|
|
291
|
+
- Check code formatting
|
|
292
|
+
- Analyze test coverage
|
|
293
|
+
- Store results in shared memory
|
|
294
|
+
- Provide actionable recommendations
|
|
295
|
+
|
|
296
|
+
**Output Includes:**
|
|
297
|
+
- ✅/❌ Compliance status
|
|
298
|
+
- 🔒 Security findings
|
|
299
|
+
- 🎨 Formatting issues
|
|
300
|
+
- 📊 Coverage metrics
|
|
301
|
+
- 🤖 Improvement suggestions
|
|
302
|
+
|
|
303
|
+
#### 3. Post-Task Hook
|
|
304
|
+
|
|
305
|
+
```bash
|
|
306
|
+
npx claude-flow@alpha hooks post-task --task-id "auth-implementation"
|
|
307
|
+
```
|
|
308
|
+
|
|
309
|
+
**Purpose:**
|
|
310
|
+
- Finalize task
|
|
311
|
+
- Export metrics
|
|
312
|
+
- Update coordination state
|
|
313
|
+
- Trigger downstream agents
|
|
314
|
+
|
|
315
|
+
#### 4. Session Management
|
|
316
|
+
|
|
317
|
+
```bash
|
|
318
|
+
# Restore session context
|
|
319
|
+
npx claude-flow@alpha hooks session-restore --session-id "swarm-auth-2025-09-30"
|
|
320
|
+
|
|
321
|
+
# End session and export metrics
|
|
322
|
+
npx claude-flow@alpha hooks session-end --export-metrics true
|
|
323
|
+
```
|
|
324
|
+
|
|
325
|
+
---
|
|
326
|
+
|
|
327
|
+
## Memory Coordination
|
|
328
|
+
|
|
329
|
+
Agents share context through the memory system:
|
|
330
|
+
|
|
331
|
+
```javascript
|
|
332
|
+
// Store context for other agents
|
|
333
|
+
npx claude-flow@alpha memory store \
|
|
334
|
+
--key "architect/design/decision" \
|
|
335
|
+
--value '{"pattern": "microservices", "rationale": "..."}'
|
|
336
|
+
|
|
337
|
+
// Retrieve context from other agents
|
|
338
|
+
npx claude-flow@alpha memory retrieve \
|
|
339
|
+
--key "architect/design/decision"
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
**Memory Key Patterns:**
|
|
343
|
+
```
|
|
344
|
+
{agent-type}/{domain}/{aspect}
|
|
345
|
+
|
|
346
|
+
Examples:
|
|
347
|
+
- architect/auth/design
|
|
348
|
+
- coder/auth/implementation
|
|
349
|
+
- reviewer/auth/feedback
|
|
350
|
+
- tester/auth/coverage
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
---
|
|
354
|
+
|
|
355
|
+
## Swarm Coordination
|
|
356
|
+
|
|
357
|
+
When spawning multiple agents concurrently:
|
|
358
|
+
|
|
359
|
+
```javascript
|
|
360
|
+
// Coordinator spawns specialist agents
|
|
361
|
+
Task("Rust Coder", "Implement auth with proper error handling", "coder")
|
|
362
|
+
Task("Unit Tester", "Write comprehensive tests for auth", "tester")
|
|
363
|
+
Task("Code Reviewer", "Review auth implementation", "reviewer")
|
|
364
|
+
|
|
365
|
+
// Each agent MUST:
|
|
366
|
+
// 1. Run pre-task hook
|
|
367
|
+
// 2. Execute work
|
|
368
|
+
// 3. Run post-edit hook for each file
|
|
369
|
+
// 4. Store results in memory
|
|
370
|
+
// 5. Run post-task hook
|
|
371
|
+
```
|
|
@@ -0,0 +1,294 @@
|
|
|
1
|
+
# Quality Metrics & Validation
|
|
2
|
+
|
|
3
|
+
**Version:** 2.0.0
|
|
4
|
+
**Last Updated:** 2025-09-30
|
|
5
|
+
|
|
6
|
+
## Measuring Agent Effectiveness
|
|
7
|
+
|
|
8
|
+
### 1. Quantitative Metrics
|
|
9
|
+
|
|
10
|
+
```yaml
|
|
11
|
+
Code Quality:
|
|
12
|
+
compilation_success_rate: "First-time compile success"
|
|
13
|
+
test_pass_rate: "Tests passing on first run"
|
|
14
|
+
coverage: "Code coverage percentage"
|
|
15
|
+
performance: "Execution time vs baseline"
|
|
16
|
+
idiomaticity_score: "Language-specific best practices"
|
|
17
|
+
|
|
18
|
+
Process Metrics:
|
|
19
|
+
iteration_count: "Revisions needed to complete task"
|
|
20
|
+
time_to_completion: "Duration from start to finish"
|
|
21
|
+
error_rate: "Errors encountered during execution"
|
|
22
|
+
|
|
23
|
+
Agent-Specific:
|
|
24
|
+
architect_score: "Design quality assessment"
|
|
25
|
+
reviewer_score: "Issues found / total issues"
|
|
26
|
+
tester_score: "Bug catch rate"
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
### 2. Qualitative Metrics
|
|
30
|
+
|
|
31
|
+
```yaml
|
|
32
|
+
Code Review Criteria:
|
|
33
|
+
- Readability: Easy to understand
|
|
34
|
+
- Maintainability: Easy to modify
|
|
35
|
+
- Correctness: Works as intended
|
|
36
|
+
- Safety: No security vulnerabilities
|
|
37
|
+
- Performance: Meets efficiency requirements
|
|
38
|
+
|
|
39
|
+
Architecture Criteria:
|
|
40
|
+
- Scalability: Can grow with demand
|
|
41
|
+
- Flexibility: Adapts to changing requirements
|
|
42
|
+
- Simplicity: No unnecessary complexity
|
|
43
|
+
- Documentation: Well-explained decisions
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
## Validation Checklist
|
|
49
|
+
|
|
50
|
+
Use this checklist before deploying an agent:
|
|
51
|
+
|
|
52
|
+
### Pre-Deployment Validation
|
|
53
|
+
|
|
54
|
+
```markdown
|
|
55
|
+
## Agent Profile Validation
|
|
56
|
+
|
|
57
|
+
### Structure ✓
|
|
58
|
+
- [ ] Valid YAML frontmatter
|
|
59
|
+
- [ ] All required fields present (name, description, tools, model, color)
|
|
60
|
+
- [ ] Clear role definition in opening paragraph
|
|
61
|
+
- [ ] Appropriate section structure
|
|
62
|
+
|
|
63
|
+
### Format Selection ✓
|
|
64
|
+
- [ ] Format matches task complexity (Basic→Code-Heavy, Medium→Metadata, Complex→Minimal)
|
|
65
|
+
- [ ] Length appropriate (Minimal: 200-400, Metadata: 400-700, Code-Heavy: 700-1200)
|
|
66
|
+
- [ ] Examples present and relevant (for Code-Heavy)
|
|
67
|
+
- [ ] Structure/metadata present (for Metadata)
|
|
68
|
+
|
|
69
|
+
### Content Quality ✓
|
|
70
|
+
- [ ] Clear responsibilities defined
|
|
71
|
+
- [ ] Approach/methodology explained
|
|
72
|
+
- [ ] Integration points specified
|
|
73
|
+
- [ ] Success metrics defined
|
|
74
|
+
- [ ] Post-edit validation hook included
|
|
75
|
+
|
|
76
|
+
### Language-Specific ✓
|
|
77
|
+
- [ ] If Rust: Format validated against benchmark findings
|
|
78
|
+
- [ ] If other language: Format choice documented as hypothesis
|
|
79
|
+
- [ ] Language-specific patterns included (for Code-Heavy)
|
|
80
|
+
- [ ] Idiomatic code examples (for Code-Heavy)
|
|
81
|
+
|
|
82
|
+
### Testing ✓
|
|
83
|
+
- [ ] Agent tested on representative tasks
|
|
84
|
+
- [ ] Quality metrics meet targets
|
|
85
|
+
- [ ] Integration with hooks verified
|
|
86
|
+
- [ ] Collaboration with other agents confirmed
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
### Post-Deployment Monitoring
|
|
90
|
+
|
|
91
|
+
```markdown
|
|
92
|
+
## Ongoing Validation
|
|
93
|
+
|
|
94
|
+
### Performance Tracking
|
|
95
|
+
- [ ] Monitor iteration counts
|
|
96
|
+
- [ ] Track first-time success rate
|
|
97
|
+
- [ ] Measure time to completion
|
|
98
|
+
- [ ] Collect user feedback
|
|
99
|
+
|
|
100
|
+
### Quality Assurance
|
|
101
|
+
- [ ] Review output quality regularly
|
|
102
|
+
- [ ] Check adherence to format guidelines
|
|
103
|
+
- [ ] Validate tool usage patterns
|
|
104
|
+
- [ ] Assess collaboration effectiveness
|
|
105
|
+
|
|
106
|
+
### Continuous Improvement
|
|
107
|
+
- [ ] Document failure modes
|
|
108
|
+
- [ ] Refine based on metrics
|
|
109
|
+
- [ ] Update with new patterns
|
|
110
|
+
- [ ] Validate format choice periodically
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
---
|
|
114
|
+
|
|
115
|
+
## Benchmark System
|
|
116
|
+
|
|
117
|
+
### Running Agent Benchmarks
|
|
118
|
+
|
|
119
|
+
```bash
|
|
120
|
+
cd benchmark/agent-benchmarking
|
|
121
|
+
|
|
122
|
+
# Run Rust benchmarks (VALIDATED)
|
|
123
|
+
node index.js run 5 --rust --verbose
|
|
124
|
+
|
|
125
|
+
# Run JavaScript benchmarks (HYPOTHESIS)
|
|
126
|
+
node index.js run 5 --verbose
|
|
127
|
+
|
|
128
|
+
# Run specific scenario
|
|
129
|
+
node index.js run 3 --rust --scenario=rust-01-basic
|
|
130
|
+
|
|
131
|
+
# List available scenarios
|
|
132
|
+
node index.js list --scenarios --rust
|
|
133
|
+
|
|
134
|
+
# Analyze results
|
|
135
|
+
node index.js analyze
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### Interpreting Results
|
|
139
|
+
|
|
140
|
+
```yaml
|
|
141
|
+
Quality Score Breakdown:
|
|
142
|
+
Correctness (30%):
|
|
143
|
+
- Basic functionality works
|
|
144
|
+
- Edge cases handled
|
|
145
|
+
- Error conditions managed
|
|
146
|
+
|
|
147
|
+
Idiomaticity (25%):
|
|
148
|
+
- Language best practices
|
|
149
|
+
- Proper pattern usage
|
|
150
|
+
- Efficient algorithms
|
|
151
|
+
|
|
152
|
+
Code Quality (20%):
|
|
153
|
+
- Readability
|
|
154
|
+
- Documentation
|
|
155
|
+
- Naming conventions
|
|
156
|
+
|
|
157
|
+
Testing (15%):
|
|
158
|
+
- Test coverage
|
|
159
|
+
- Assertion quality
|
|
160
|
+
- Edge case tests
|
|
161
|
+
|
|
162
|
+
Performance (10%):
|
|
163
|
+
- Execution efficiency
|
|
164
|
+
- Memory usage
|
|
165
|
+
- Optimization
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
### Statistical Significance
|
|
169
|
+
|
|
170
|
+
```yaml
|
|
171
|
+
ANOVA Analysis:
|
|
172
|
+
f_statistic: "Variance between groups"
|
|
173
|
+
p_value: "Probability results are random"
|
|
174
|
+
significant_if: "p < 0.05"
|
|
175
|
+
|
|
176
|
+
Effect Size (Cohen's d):
|
|
177
|
+
negligible: "d < 0.2"
|
|
178
|
+
small: "0.2 ≤ d < 0.5"
|
|
179
|
+
medium: "0.5 ≤ d < 0.8"
|
|
180
|
+
large: "d ≥ 0.8"
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
---
|
|
184
|
+
|
|
185
|
+
## Continuous Improvement
|
|
186
|
+
|
|
187
|
+
### Metrics to Track
|
|
188
|
+
|
|
189
|
+
```yaml
|
|
190
|
+
Agent Performance Metrics:
|
|
191
|
+
first_time_success_rate:
|
|
192
|
+
target: ">80%"
|
|
193
|
+
measure: "Compiles/runs on first attempt"
|
|
194
|
+
|
|
195
|
+
iteration_count:
|
|
196
|
+
target: "<3"
|
|
197
|
+
measure: "Revisions needed to complete"
|
|
198
|
+
|
|
199
|
+
quality_score:
|
|
200
|
+
target: ">85%"
|
|
201
|
+
measure: "Benchmark quality assessment"
|
|
202
|
+
|
|
203
|
+
user_satisfaction:
|
|
204
|
+
target: ">4.5/5"
|
|
205
|
+
measure: "Feedback from users"
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
### Feedback Loop
|
|
209
|
+
|
|
210
|
+
1. **Collect Data**: Track metrics for each agent usage
|
|
211
|
+
2. **Analyze**: Identify patterns in failures or low quality
|
|
212
|
+
3. **Hypothesize**: Determine likely causes
|
|
213
|
+
4. **Experiment**: Adjust agent format or content
|
|
214
|
+
5. **Validate**: Test changes with benchmark system
|
|
215
|
+
6. **Deploy**: Update agent if improvements confirmed
|
|
216
|
+
7. **Monitor**: Continue tracking metrics
|
|
217
|
+
|
|
218
|
+
---
|
|
219
|
+
|
|
220
|
+
## Success Criteria by Agent Type
|
|
221
|
+
|
|
222
|
+
### Coder Agents
|
|
223
|
+
|
|
224
|
+
- [ ] Code compiles without warnings
|
|
225
|
+
- [ ] All functions have documentation
|
|
226
|
+
- [ ] Error handling uses proper patterns (no .unwrap() in Rust)
|
|
227
|
+
- [ ] Tests cover >85% of code
|
|
228
|
+
- [ ] Idiomatic language usage
|
|
229
|
+
- [ ] Proper resource management
|
|
230
|
+
|
|
231
|
+
### Reviewer Agents
|
|
232
|
+
|
|
233
|
+
- [ ] Issues identified before production
|
|
234
|
+
- [ ] Suggestions are actionable and specific
|
|
235
|
+
- [ ] Feedback explains "why" not just "what"
|
|
236
|
+
- [ ] Team learns from feedback
|
|
237
|
+
- [ ] Security vulnerabilities caught
|
|
238
|
+
- [ ] Performance issues identified
|
|
239
|
+
|
|
240
|
+
### Architect Agents
|
|
241
|
+
|
|
242
|
+
- [ ] Architecture meets quality attributes
|
|
243
|
+
- [ ] Team can implement the design
|
|
244
|
+
- [ ] Documentation is clear and comprehensive
|
|
245
|
+
- [ ] Trade-offs are explicitly documented
|
|
246
|
+
- [ ] ADRs (Architecture Decision Records) created
|
|
247
|
+
- [ ] Stakeholder requirements satisfied
|
|
248
|
+
|
|
249
|
+
### Tester Agents
|
|
250
|
+
|
|
251
|
+
- [ ] Test coverage meets targets (85% unit, 70% integration)
|
|
252
|
+
- [ ] Tests are comprehensive (happy path, error cases, edge cases)
|
|
253
|
+
- [ ] Test code is maintainable
|
|
254
|
+
- [ ] Assertions are meaningful
|
|
255
|
+
- [ ] Performance tests where applicable
|
|
256
|
+
- [ ] Integration tests validate contracts
|
|
257
|
+
|
|
258
|
+
### DevOps Agents
|
|
259
|
+
|
|
260
|
+
- [ ] Pipelines execute successfully
|
|
261
|
+
- [ ] Deployment process is automated
|
|
262
|
+
- [ ] Rollback strategy is in place
|
|
263
|
+
- [ ] Monitoring and alerting configured
|
|
264
|
+
- [ ] Security scans integrated
|
|
265
|
+
- [ ] Documentation updated
|
|
266
|
+
|
|
267
|
+
---
|
|
268
|
+
|
|
269
|
+
## Quality Gates
|
|
270
|
+
|
|
271
|
+
### Blocking Issues (Must Fix)
|
|
272
|
+
|
|
273
|
+
- Compilation errors
|
|
274
|
+
- Test failures
|
|
275
|
+
- Security vulnerabilities (high/critical)
|
|
276
|
+
- Missing required documentation
|
|
277
|
+
- Code coverage below threshold
|
|
278
|
+
- Lint/format errors
|
|
279
|
+
|
|
280
|
+
### Non-Blocking Issues (Should Fix)
|
|
281
|
+
|
|
282
|
+
- Performance warnings
|
|
283
|
+
- Code style inconsistencies
|
|
284
|
+
- Missing optional documentation
|
|
285
|
+
- Low test coverage (but above minimum)
|
|
286
|
+
- Minor security issues
|
|
287
|
+
|
|
288
|
+
### Advisory (Nice to Have)
|
|
289
|
+
|
|
290
|
+
- Optimization opportunities
|
|
291
|
+
- Refactoring suggestions
|
|
292
|
+
- Additional test cases
|
|
293
|
+
- Enhanced documentation
|
|
294
|
+
- Improved naming
|