@chongyan/autospec 1.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.en.md +472 -0
- package/README.md +476 -0
- package/bin/autospec.js +3 -0
- package/knowledge/README.md +144 -0
- package/knowledge/checklists/code.md +182 -0
- package/knowledge/checklists/design.md +196 -0
- package/knowledge/checklists/release.md +70 -0
- package/knowledge/checklists/requirement.md +169 -0
- package/knowledge/checklists/test.md +46 -0
- package/knowledge/config/README.en.md +44 -0
- package/knowledge/config/README.md +44 -0
- package/knowledge/config/role-composition.yaml +98 -0
- package/knowledge/config/role-extensions.yaml +140 -0
- package/knowledge/config/skill-compositions.yaml +142 -0
- package/knowledge/config/team-stage.yaml +95 -0
- package/knowledge/config/team-tasks.yaml +139 -0
- package/knowledge/config/team-triggers.yaml +198 -0
- package/knowledge/config/validation-patterns.yaml +137 -0
- package/knowledge/domain/README.md +115 -0
- package/knowledge/domain/flows/README.md +194 -0
- package/knowledge/domain/glossary.md +143 -0
- package/knowledge/domain/rules.md +138 -0
- package/knowledge/environment/README.en.md +36 -0
- package/knowledge/environment/README.md +87 -0
- package/knowledge/environment/component-knowledge.md +316 -0
- package/knowledge/environment/detection-patterns.yaml +502 -0
- package/knowledge/environment/middleware-knowledge.md +237 -0
- package/knowledge/environment/template-registry.md +321 -0
- package/knowledge/guides/domain-driven-design.md +345 -0
- package/knowledge/guides/knowledge-management.md +369 -0
- package/knowledge/guides/requirement-engineering.md +329 -0
- package/knowledge/guides/stages/ai-effect-evaluator.md +93 -0
- package/knowledge/guides/stages/code-implementer.md +205 -0
- package/knowledge/guides/stages/code-reviewer.md +111 -0
- package/knowledge/guides/stages/consistency-checker.md +177 -0
- package/knowledge/guides/stages/design-planner.md +401 -0
- package/knowledge/guides/stages/design-reviewer.md +83 -0
- package/knowledge/guides/stages/integration-test-runner.md +105 -0
- package/knowledge/guides/stages/release-checker.md +205 -0
- package/knowledge/guides/stages/requirement-analyzer.md +195 -0
- package/knowledge/guides/stages/requirement-reviewer.md +83 -0
- package/knowledge/guides/stages/security-reviewer.md +89 -0
- package/knowledge/guides/stages/test-context-analyzer.md +250 -0
- package/knowledge/guides/stages/test-generator.md +241 -0
- package/knowledge/guides/stages/test-planner.md +183 -0
- package/knowledge/guides/stages/test-reviewer.md +76 -0
- package/knowledge/guides/stages/unit-test-runner.md +83 -0
- package/knowledge/guides/support/ai-agent-analyzer.md +362 -0
- package/knowledge/guides/support/ai-anomaly-analyzer.md +213 -0
- package/knowledge/guides/support/ai-artifact-evaluator.md +192 -0
- package/knowledge/guides/support/ai-capability-analyzer.md +193 -0
- package/knowledge/guides/support/ai-component-analyzer.md +169 -0
- package/knowledge/guides/support/ai-data-validator.md +276 -0
- package/knowledge/guides/support/ai-evaluation-planner.md +374 -0
- package/knowledge/guides/support/ai-path-evaluator.md +274 -0
- package/knowledge/guides/support/ai-pipeline-evaluator.md +219 -0
- package/knowledge/guides/support/ai-rag-analyzer.md +339 -0
- package/knowledge/guides/support/ai-task-assessor.md +418 -0
- package/knowledge/guides/support/ai-test-diagnostics.md +133 -0
- package/knowledge/guides/support/complexity-assessor.md +268 -0
- package/knowledge/guides/support/component-discovery.md +183 -0
- package/knowledge/guides/support/environment-scanner.md +207 -0
- package/knowledge/guides/support/environment-validator.md +207 -0
- package/knowledge/guides/support/knowledge-generator.md +234 -0
- package/knowledge/guides/support/methodology-extractor.md +55 -0
- package/knowledge/guides/support/pipeline-protocol.md +438 -0
- package/knowledge/guides/support/practice-logger.md +359 -0
- package/knowledge/guides/support/scope-inference.md +174 -0
- package/knowledge/guides/support/skill-distiller.md +91 -0
- package/knowledge/guides/support/skill-updater.md +45 -0
- package/knowledge/guides/support/skill-validator.md +72 -0
- package/knowledge/guides/support/team-orchestrator.md +323 -0
- package/knowledge/guides/support/tech-stack-analyzer.md +139 -0
- package/knowledge/guides/support/test-runner.md +254 -0
- package/knowledge/guides/system-design.md +352 -0
- package/knowledge/organization/ai-native-team.md +318 -0
- package/knowledge/organization/team-metrics.md +228 -0
- package/knowledge/principles/constitution.md +134 -0
- package/knowledge/principles/core-principles.md +368 -0
- package/knowledge/principles/design-philosophy.md +877 -0
- package/knowledge/principles/evolution.md +553 -0
- package/knowledge/process/01-requirement.md +113 -0
- package/knowledge/process/02-design.md +123 -0
- package/knowledge/process/03-implementation.md +90 -0
- package/knowledge/process/04-review.md +80 -0
- package/knowledge/process/05-testing.md +90 -0
- package/knowledge/process/06-delivery.md +88 -0
- package/knowledge/process/README.en.md +38 -0
- package/knowledge/process/README.md +48 -0
- package/knowledge/process/ai-sdlc.md +475 -0
- package/knowledge/process/overview.md +319 -0
- package/knowledge/standards/code-review.md +876 -0
- package/knowledge/standards/coding-style.md +940 -0
- package/knowledge/standards/data-consistency.md +1085 -0
- package/knowledge/standards/document-versioning.md +210 -0
- package/knowledge/standards/risk-detection.md +186 -0
- package/knowledge/templates/ai-evaluation.md +150 -0
- package/knowledge/templates/api-design.md +117 -0
- package/knowledge/templates/database-design.md +132 -0
- package/knowledge/templates/domain-driven-design.md +321 -0
- package/knowledge/templates/product-proposal.md +201 -0
- package/knowledge/templates/system-design.md +227 -0
- package/knowledge/templates/task-breakdown.md +107 -0
- package/knowledge/templates/test-case.md +170 -0
- package/package.json +53 -0
- package/plugins/.claude-plugin/plugin.json +134 -0
- package/plugins/agents/roles/ai-engineer.md +129 -0
- package/plugins/agents/roles/backend-engineer.md +165 -0
- package/plugins/agents/roles/ceo.md +94 -0
- package/plugins/agents/roles/data-engineer.md +135 -0
- package/plugins/agents/roles/devops-engineer.md +181 -0
- package/plugins/agents/roles/frontend-engineer.md +129 -0
- package/plugins/agents/roles/product-owner.md +98 -0
- package/plugins/agents/roles/quality-engineer.md +129 -0
- package/plugins/agents/roles/security-engineer.md +180 -0
- package/plugins/agents/roles/tech-lead.md +97 -0
- package/plugins/agents/support/blind-comparator.md +88 -0
- package/plugins/agents/support/consistency-checker.md +103 -0
- package/plugins/agents/support/failure-diagnostician.md +141 -0
- package/plugins/agents/support/independent-reviewer.md +80 -0
- package/plugins/agents/support/safety-auditor.md +121 -0
- package/plugins/agents/support/skill-benchmarker.md +86 -0
- package/plugins/agents/support/skill-forger.md +105 -0
- package/plugins/agents/support/stage-gate-evaluator.md +121 -0
- package/plugins/agents/support/test-coverage-reviewer.md +73 -0
- package/plugins/benchmarks/templates/README.md +44 -0
- package/plugins/benchmarks/templates/commands/explore-template.yaml +48 -0
- package/plugins/benchmarks/templates/pipeline/agile-template.yaml +84 -0
- package/plugins/benchmarks/templates/pipeline/waterfall-template.yaml +106 -0
- package/plugins/benchmarks/templates/skills/requirement-analyzer-template.yaml +48 -0
- package/plugins/commands/README.en.md +96 -0
- package/plugins/commands/README.md +96 -0
- package/plugins/commands/apply.md +191 -0
- package/plugins/commands/archive.md +76 -0
- package/plugins/commands/env-export.md +79 -0
- package/plugins/commands/env-sync.md +640 -0
- package/plugins/commands/env-template.md +223 -0
- package/plugins/commands/env-update.md +264 -0
- package/plugins/commands/env-validate.md +176 -0
- package/plugins/commands/env.md +79 -0
- package/plugins/commands/explore.md +76 -0
- package/plugins/commands/field-evolve.md +536 -0
- package/plugins/commands/memory.md +249 -0
- package/plugins/commands/project-evolve.md +821 -0
- package/plugins/commands/propose.md +93 -0
- package/plugins/commands/review.md +140 -0
- package/plugins/commands/run.md +224 -0
- package/plugins/commands/status.md +62 -0
- package/plugins/commands/validate.md +108 -0
- package/plugins/hooks/README.en.md +56 -0
- package/plugins/hooks/README.md +56 -0
- package/plugins/hooks/ai-project-guard.js +329 -0
- package/plugins/hooks/artifact-evaluation-hook.js +237 -0
- package/plugins/hooks/constitution-guard.js +211 -0
- package/plugins/hooks/environment-autocommit.js +264 -0
- package/plugins/hooks/environment-manager.js +778 -0
- package/plugins/hooks/execution-tracker.js +354 -0
- package/plugins/hooks/frozen-zone-guard.js +140 -0
- package/plugins/hooks/layer1-validator.js +423 -0
- package/plugins/hooks/lib/artifact-evaluator.js +414 -0
- package/plugins/hooks/lib/benchmarks/change-detector.js +390 -0
- package/plugins/hooks/lib/benchmarks/evaluator.js +605 -0
- package/plugins/hooks/lib/benchmarks/integration-example.js +169 -0
- package/plugins/hooks/lib/data-and-ai-detector.js +275 -0
- package/plugins/hooks/lib/detection-pattern-loader.js +865 -0
- package/plugins/hooks/lib/directory-discovery.js +395 -0
- package/plugins/hooks/lib/environment-config-loader.js +341 -0
- package/plugins/hooks/lib/environment-detector.js +553 -0
- package/plugins/hooks/lib/environment-evolver.js +564 -0
- package/plugins/hooks/lib/environment-registry.js +813 -0
- package/plugins/hooks/lib/execution-path.js +427 -0
- package/plugins/hooks/lib/hook-error-recorder.js +245 -0
- package/plugins/hooks/lib/hook-logger.js +538 -0
- package/plugins/hooks/lib/hook-runner.js +97 -0
- package/plugins/hooks/lib/hook-runner.sh +44 -0
- package/plugins/hooks/lib/hook-state-manager.js +480 -0
- package/plugins/hooks/lib/memory-extractor.js +377 -0
- package/plugins/hooks/lib/memory-manager.js +673 -0
- package/plugins/hooks/lib/metrics-analyzer.js +489 -0
- package/plugins/hooks/lib/project-evolution/auto-fixer.js +511 -0
- package/plugins/hooks/lib/project-evolution/memory-manager.js +346 -0
- package/plugins/hooks/lib/project-evolution/pattern-detector.js +476 -0
- package/plugins/hooks/lib/project-evolution/semantic-indexer.js +480 -0
- package/plugins/hooks/lib/project-structure-detector.js +326 -0
- package/plugins/hooks/lib/rollback-tracker.js +346 -0
- package/plugins/hooks/lib/source-code-scanner.js +596 -0
- package/plugins/hooks/lib/technology-stack-detector.js +374 -0
- package/plugins/hooks/lib/test-failure-analyzer.js +375 -0
- package/plugins/hooks/lib/test-failure-fixer.js +268 -0
- package/plugins/hooks/lib/trace-context.js +277 -0
- package/plugins/hooks/lib/validation-patterns.js +415 -0
- package/plugins/hooks/memory-sync.js +171 -0
- package/plugins/hooks/pipeline-observer.js +413 -0
- package/plugins/hooks/scope-sentinel.js +204 -0
- package/plugins/hooks/trace-initialization.js +169 -0
- package/plugins/memory/templates/code-quality.yaml +149 -0
- package/plugins/memory/templates/multi-system.yaml +155 -0
- package/plugins/memory/templates/team-habits.yaml +119 -0
- package/plugins/memory/templates/testing.yaml +121 -0
- package/plugins/skills/README.en.md +47 -0
- package/plugins/skills/README.md +104 -0
- package/plugins/skills/benchmark-executor/README.md +93 -0
- package/plugins/skills/benchmark-executor/SKILL.md +647 -0
- package/plugins/skills/benchmark-generator/SKILL.md +349 -0
- package/plugins/skills/delivery-stage/SKILL.md +203 -0
- package/plugins/skills/design-stage/SKILL.md +216 -0
- package/plugins/skills/evolution-process/SKILL.md +291 -0
- package/plugins/skills/exploration-phase/SKILL.md +133 -0
- package/plugins/skills/implementation-stage/SKILL.md +179 -0
- package/plugins/skills/layer1-validation/SKILL.md +79 -0
- package/plugins/skills/pending-dashboard/SKILL.md +109 -0
- package/plugins/skills/project-evolution/SKILL.md +847 -0
- package/plugins/skills/requirement-stage/SKILL.md +183 -0
- package/plugins/skills/skill-forge/SKILL.md +223 -0
- package/plugins/skills/skill-forge/references/description-guide.md +92 -0
- package/plugins/skills/skill-forge/references/quality-rubric.md +104 -0
- package/plugins/skills/skill-forge/references/skill-template.md +106 -0
- package/plugins/skills/startup-guard/SKILL.md +38 -0
- package/plugins/skills/testing-stage/SKILL.md +195 -0
- package/scripts/cli/global-init.js +288 -0
- package/scripts/cli/global.js +324 -0
- package/scripts/cli/index.js +55 -0
- package/scripts/cli/init.js +382 -0
- package/scripts/cli/list.js +69 -0
- package/scripts/cli/org.js +340 -0
- package/scripts/cli/update.js +44 -0
- package/scripts/config/commands.config.js +145 -0
- package/scripts/config/hooks.config.js +197 -0
- package/scripts/evolution/evolution-router.js +273 -0
- package/scripts/evolution/evolution-signal-collector.js +307 -0
- package/scripts/evolution/knowledge-loader.js +346 -0
- package/scripts/evolution/marketplace.js +317 -0
- package/scripts/evolution/version-manager.js +371 -0
- package/scripts/install/agents.js +106 -0
- package/scripts/install/commands.js +133 -0
- package/scripts/install/constants.js +424 -0
- package/scripts/install/hook-logger.js +536 -0
- package/scripts/install/hooks.js +110 -0
- package/scripts/install/index.js +39 -0
- package/scripts/install/skills.js +95 -0
- package/scripts/postinstall.js +25 -0
- package/scripts/state.js +376 -0
|
@@ -0,0 +1,374 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: evaluation-planner
|
|
3
|
+
description: 当项目包含需要效果评测的组件(模型训练、Agent、RAG等)时,规划评测方案。包括评测维度、数据集、指标和流程设计。
|
|
4
|
+
type: ai
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## 定位
|
|
8
|
+
|
|
9
|
+
AI专用技能。为需要效果评测的AI组件规划评测方案,包括评测维度、数据集构建、指标选择和流程设计。
|
|
10
|
+
|
|
11
|
+
## 输入
|
|
12
|
+
|
|
13
|
+
- 必须输入:项目结构分析结果、已检测的AI组件
|
|
14
|
+
- 可选输入:Agent分析结果、RAG分析结果
|
|
15
|
+
|
|
16
|
+
## 输出
|
|
17
|
+
|
|
18
|
+
```json
|
|
19
|
+
{
|
|
20
|
+
"evaluationScope": {
|
|
21
|
+
"components": ["agent-framework", "rag-application"],
|
|
22
|
+
"priority": "high",
|
|
23
|
+
"reason": "Agent和RAG直接面向用户,效果影响用户体验"
|
|
24
|
+
},
|
|
25
|
+
"evaluationPlan": [
|
|
26
|
+
{
|
|
27
|
+
"component": "ResearchAgent",
|
|
28
|
+
"type": "agent",
|
|
29
|
+
"dimensions": [
|
|
30
|
+
{
|
|
31
|
+
"name": "任务完成率",
|
|
32
|
+
"description": "Agent是否正确完成指定任务",
|
|
33
|
+
"metrics": ["success_rate", "error_rate"],
|
|
34
|
+
"method": "人工评估或自动验证"
|
|
35
|
+
},
|
|
36
|
+
{
|
|
37
|
+
"name": "工具使用正确性",
|
|
38
|
+
"description": "Agent是否正确选择和使用工具",
|
|
39
|
+
"metrics": ["tool_selection_accuracy", "tool_call_success_rate"],
|
|
40
|
+
"method": "日志分析"
|
|
41
|
+
},
|
|
42
|
+
{
|
|
43
|
+
"name": "响应质量",
|
|
44
|
+
"description": "Agent输出内容的质量",
|
|
45
|
+
"metrics": ["relevance_score", "helpfulness_score"],
|
|
46
|
+
"method": "LLM-as-Judge或人工评估"
|
|
47
|
+
}
|
|
48
|
+
],
|
|
49
|
+
"dataset": {
|
|
50
|
+
"type": "synthetic",
|
|
51
|
+
"size": 100,
|
|
52
|
+
"generationMethod": "基于真实场景生成测试用例"
|
|
53
|
+
},
|
|
54
|
+
"process": {
|
|
55
|
+
"steps": [
|
|
56
|
+
"1. 准备测试用例数据集",
|
|
57
|
+
"2. 运行Agent执行任务",
|
|
58
|
+
"3. 收集执行日志和输出",
|
|
59
|
+
"4. 自动计算可量化指标",
|
|
60
|
+
"5. LLM-as-Judge评估响应质量",
|
|
61
|
+
"6. 人工抽检验证"
|
|
62
|
+
]
|
|
63
|
+
}
|
|
64
|
+
},
|
|
65
|
+
{
|
|
66
|
+
"component": "RAG系统",
|
|
67
|
+
"type": "rag",
|
|
68
|
+
"dimensions": [
|
|
69
|
+
{
|
|
70
|
+
"name": "检索准确率",
|
|
71
|
+
"description": "检索到的文档是否相关",
|
|
72
|
+
"metrics": ["precision@k", "recall@k", "mrr"],
|
|
73
|
+
"method": "标注数据集评估"
|
|
74
|
+
},
|
|
75
|
+
{
|
|
76
|
+
"name": "回答相关性",
|
|
77
|
+
"description": "生成回答是否回答了问题",
|
|
78
|
+
"metrics": ["relevance_score", "faithfulness_score"],
|
|
79
|
+
"method": "LLM-as-Judge"
|
|
80
|
+
},
|
|
81
|
+
{
|
|
82
|
+
"name": "幻觉率",
|
|
83
|
+
"description": "回答是否包含虚假信息",
|
|
84
|
+
"metrics": ["hallucination_rate", "groundedness_score"],
|
|
85
|
+
"method": "事实核查"
|
|
86
|
+
}
|
|
87
|
+
],
|
|
88
|
+
"dataset": {
|
|
89
|
+
"type": "curated",
|
|
90
|
+
"size": 50,
|
|
91
|
+
"generationMethod": "人工构建问答对"
|
|
92
|
+
},
|
|
93
|
+
"process": {
|
|
94
|
+
"steps": [
|
|
95
|
+
"1. 构建评测问答数据集",
|
|
96
|
+
"2. 运行RAG系统生成回答",
|
|
97
|
+
"3. 计算检索指标",
|
|
98
|
+
"4. LLM评估回答质量",
|
|
99
|
+
"5. 人工抽检幻觉问题"
|
|
100
|
+
]
|
|
101
|
+
}
|
|
102
|
+
}
|
|
103
|
+
],
|
|
104
|
+
"tools": {
|
|
105
|
+
"suggested": ["ragas", "deepeval", "arize-phoenix"],
|
|
106
|
+
"reason": "这些工具支持RAG和Agent评测,与LangChain集成良好"
|
|
107
|
+
},
|
|
108
|
+
"timeline": {
|
|
109
|
+
"estimated": "2-3天",
|
|
110
|
+
"breakdown": {
|
|
111
|
+
"dataset_preparation": "0.5天",
|
|
112
|
+
"evaluation_implementation": "1天",
|
|
113
|
+
"execution_and_analysis": "0.5-1天"
|
|
114
|
+
}
|
|
115
|
+
}
|
|
116
|
+
}
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
## 执行步骤
|
|
120
|
+
|
|
121
|
+
### Step 1: 确定评测范围(确定性)
|
|
122
|
+
|
|
123
|
+
基于检测结果确定需要评测的组件:
|
|
124
|
+
|
|
125
|
+
```
|
|
126
|
+
评测触发条件:
|
|
127
|
+
- needsEvaluation = true 的组件
|
|
128
|
+
- 组件类型:model-training, inference-service, llm-application, agent-framework, rag-application
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
### Step 2: 分析组件特性(模型)
|
|
132
|
+
|
|
133
|
+
分析每个组件的评测需求:
|
|
134
|
+
|
|
135
|
+
```
|
|
136
|
+
模型输入:
|
|
137
|
+
{
|
|
138
|
+
"components": [
|
|
139
|
+
{"type": "agent-framework", "name": "ResearchAgent", "tools": ["web_search", "doc_reader"]},
|
|
140
|
+
{"type": "rag-application", "vectorStore": "ChromaDB", "retriever": "similarity"}
|
|
141
|
+
],
|
|
142
|
+
"task": "为每个组件确定评测维度和指标"
|
|
143
|
+
}
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
### Step 3: 设计评测维度(模型)
|
|
147
|
+
|
|
148
|
+
基于组件类型设计评测维度:
|
|
149
|
+
|
|
150
|
+
```
|
|
151
|
+
Agent评测维度:
|
|
152
|
+
- 任务完成率:是否完成指定任务
|
|
153
|
+
- 工具使用:是否正确选择和使用工具
|
|
154
|
+
- 推理能力:决策逻辑是否合理
|
|
155
|
+
- 响应质量:输出是否有帮助
|
|
156
|
+
|
|
157
|
+
RAG评测维度:
|
|
158
|
+
- 检索质量:召回率、精确率、MRR
|
|
159
|
+
- 生成质量:相关性、准确性、流畅性
|
|
160
|
+
- 上下文利用:是否有效使用检索内容
|
|
161
|
+
- 幻觉检测:是否存在虚假信息
|
|
162
|
+
|
|
163
|
+
模型评测维度:
|
|
164
|
+
- 准确性:Accuracy、F1、AUC
|
|
165
|
+
- 性能:延迟、吞吐量
|
|
166
|
+
- 鲁棒性:边界情况表现
|
|
167
|
+
- 公平性:不同群体表现差异
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
### Step 4: 规划数据集(模型)
|
|
171
|
+
|
|
172
|
+
设计评测数据集:
|
|
173
|
+
|
|
174
|
+
```
|
|
175
|
+
数据集类型:
|
|
176
|
+
- synthetic: 合成数据(LLM生成)
|
|
177
|
+
- curated: 人工构建
|
|
178
|
+
- production: 生产数据采样
|
|
179
|
+
- benchmark: 公开基准数据集
|
|
180
|
+
|
|
181
|
+
考虑因素:
|
|
182
|
+
- 数据量:平衡成本和统计显著性
|
|
183
|
+
- 覆盖度:覆盖主要使用场景
|
|
184
|
+
- 多样性:包含边界情况
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
### Step 5: 选择工具(确定性 + 模型)
|
|
188
|
+
|
|
189
|
+
推荐评测工具:
|
|
190
|
+
|
|
191
|
+
```
|
|
192
|
+
确定性规则:
|
|
193
|
+
- RAG评测 → ragas, deepeval, trulens
|
|
194
|
+
- Agent评测 → langsmith, arize-phoenix
|
|
195
|
+
- 模型评测 → mlflow, wandb, evaluate
|
|
196
|
+
|
|
197
|
+
模型判断:
|
|
198
|
+
- 根据项目技术栈选择兼容工具
|
|
199
|
+
- 根据评测维度选择支持工具
|
|
200
|
+
```
|
|
201
|
+
|
|
202
|
+
### Step 6: 输出结果
|
|
203
|
+
|
|
204
|
+
汇总评测方案,包括维度、数据集、工具和时间估算。
|
|
205
|
+
|
|
206
|
+
## 评测设计
|
|
207
|
+
|
|
208
|
+
根据业界最佳实践:
|
|
209
|
+
|
|
210
|
+
### 评测结构
|
|
211
|
+
|
|
212
|
+
**评估(eval)** = 给AI一个输入 + 应用评分逻辑到输出测量成功
|
|
213
|
+
|
|
214
|
+
### 单轮 vs 多轮评测
|
|
215
|
+
|
|
216
|
+
| 类型 | 说明 | 适用场景 |
|
|
217
|
+
|------|------|----------|
|
|
218
|
+
| **单轮评测** | 提示 → 响应 → 评分 | 简单任务、LLM非Agent场景 |
|
|
219
|
+
| **多轮评测** | 多步交互、工具调用、状态修改 | Agent、复杂任务 |
|
|
220
|
+
|
|
221
|
+
### Agent评测最佳实践
|
|
222
|
+
|
|
223
|
+
1. **匹配系统复杂度**
|
|
224
|
+
- 简单Agent:单轮评测
|
|
225
|
+
- 复杂Agent:多轮评测 + 工具调用验证
|
|
226
|
+
|
|
227
|
+
2. **评测维度**
|
|
228
|
+
- 任务完成率(自动验证)
|
|
229
|
+
- 工具使用正确性(日志分析)
|
|
230
|
+
- 决策质量(LLM评估)
|
|
231
|
+
- 效率(延迟、token消耗)
|
|
232
|
+
|
|
233
|
+
3. **评分逻辑设计**
|
|
234
|
+
- 精确匹配:用于有明确答案的任务
|
|
235
|
+
- LLM-as-Judge:用于开放式任务
|
|
236
|
+
- 规则引擎:用于结构化输出
|
|
237
|
+
|
|
238
|
+
### AI-Resistant评估设计
|
|
239
|
+
|
|
240
|
+
根据AI-Resistant评估设计原则:
|
|
241
|
+
|
|
242
|
+
1. **防止数据泄露**
|
|
243
|
+
- 使用未见过的测试用例
|
|
244
|
+
- 动态生成评估数据
|
|
245
|
+
- 分离训练和评估数据
|
|
246
|
+
|
|
247
|
+
2. **防止提示注入**
|
|
248
|
+
- 评估输入多样化
|
|
249
|
+
- 边界情况测试
|
|
250
|
+
|
|
251
|
+
3. **真实能力测试**
|
|
252
|
+
- 开放式任务评估
|
|
253
|
+
- 多步骤推理测试
|
|
254
|
+
- 实际场景模拟
|
|
255
|
+
|
|
256
|
+
### 基础设施噪声控制
|
|
257
|
+
|
|
258
|
+
根据基础设施噪声控制原则:
|
|
259
|
+
|
|
260
|
+
1. **识别噪声来源**
|
|
261
|
+
- 环境差异(操作系统、依赖版本)
|
|
262
|
+
- 并发干扰
|
|
263
|
+
- 网络延迟
|
|
264
|
+
|
|
265
|
+
2. **控制方法**
|
|
266
|
+
- 隔离测试环境
|
|
267
|
+
- 多次运行取中位数
|
|
268
|
+
- 记录和排除异常值
|
|
269
|
+
|
|
270
|
+
## 评测维度模板
|
|
271
|
+
|
|
272
|
+
### Agent评测
|
|
273
|
+
|
|
274
|
+
| 维度 | 指标 | 方法 |
|
|
275
|
+
|------|------|------|
|
|
276
|
+
| 任务完成率 | success_rate, error_rate | 自动验证 |
|
|
277
|
+
| 工具使用 | tool_accuracy, call_success | 日志分析 |
|
|
278
|
+
| 推理质量 | reasoning_score | LLM评估 |
|
|
279
|
+
| 响应质量 | relevance, helpfulness | LLM/人工评估 |
|
|
280
|
+
| 效率 | latency, token_usage | 自动统计 |
|
|
281
|
+
| **决策透明度** | decision_traceability | 审计日志 |
|
|
282
|
+
| **错误恢复** | error_recovery_rate | 故障注入测试 |
|
|
283
|
+
|
|
284
|
+
### RAG评测
|
|
285
|
+
|
|
286
|
+
| 维度 | 指标 | 方法 |
|
|
287
|
+
|------|------|------|
|
|
288
|
+
| 检索质量 | precision@k, recall@k, MRR | 标注数据 |
|
|
289
|
+
| 上下文相关性 | context_relevance | LLM评估 |
|
|
290
|
+
| 忠实度 | faithfulness, groundedness | LLM评估 |
|
|
291
|
+
| 回答相关性 | answer_relevance | LLM评估 |
|
|
292
|
+
| 幻觉率 | hallucination_rate | 事实核查 |
|
|
293
|
+
| **检索延迟** | retrieval_latency | 自动统计 |
|
|
294
|
+
|
|
295
|
+
### 模型评测
|
|
296
|
+
|
|
297
|
+
| 维度 | 指标 | 方法 |
|
|
298
|
+
|------|------|------|
|
|
299
|
+
| 准确性 | Accuracy, F1, AUC, BLEU, ROUGE | 自动计算 |
|
|
300
|
+
| 性能 | Latency, Throughput | 压测 |
|
|
301
|
+
| 鲁棒性 | Edge case accuracy | 边界测试 |
|
|
302
|
+
| 公平性 | Demographic parity | 分组统计 |
|
|
303
|
+
| **AI-Resistance** | unseen_test_performance | 未知数据测试 |
|
|
304
|
+
|
|
305
|
+
## 调用时机
|
|
306
|
+
|
|
307
|
+
- 检测到需要评测的AI组件时
|
|
308
|
+
- 设计阶段规划评测方案
|
|
309
|
+
- 交付前确认评测覆盖
|
|
310
|
+
|
|
311
|
+
## 示例
|
|
312
|
+
|
|
313
|
+
**输入**:
|
|
314
|
+
```json
|
|
315
|
+
{
|
|
316
|
+
"components": [
|
|
317
|
+
{"name": "ResearchAgent", "type": "agent-framework", "needsEvaluation": true},
|
|
318
|
+
{"name": "RAG系统", "type": "rag-application", "needsEvaluation": true}
|
|
319
|
+
],
|
|
320
|
+
"techStack": ["langchain", "openai", "chromadb"]
|
|
321
|
+
}
|
|
322
|
+
```
|
|
323
|
+
|
|
324
|
+
**输出**:
|
|
325
|
+
```json
|
|
326
|
+
{
|
|
327
|
+
"evaluationScope": {
|
|
328
|
+
"components": ["ResearchAgent", "RAG系统"],
|
|
329
|
+
"priority": "high"
|
|
330
|
+
},
|
|
331
|
+
"evaluationPlan": [
|
|
332
|
+
{
|
|
333
|
+
"component": "ResearchAgent",
|
|
334
|
+
"type": "agent",
|
|
335
|
+
"dimensions": [
|
|
336
|
+
{"name": "任务完成率", "metrics": ["success_rate"], "method": "自动验证"},
|
|
337
|
+
{"name": "工具使用正确性", "metrics": ["tool_accuracy"], "method": "日志分析"},
|
|
338
|
+
{"name": "响应质量", "metrics": ["relevance_score"], "method": "LLM-as-Judge"}
|
|
339
|
+
],
|
|
340
|
+
"dataset": {
|
|
341
|
+
"type": "synthetic",
|
|
342
|
+
"size": 50,
|
|
343
|
+
"generationMethod": "生成研究任务测试用例"
|
|
344
|
+
}
|
|
345
|
+
},
|
|
346
|
+
{
|
|
347
|
+
"component": "RAG系统",
|
|
348
|
+
"type": "rag",
|
|
349
|
+
"dimensions": [
|
|
350
|
+
{"name": "检索准确率", "metrics": ["recall@4", "mrr"], "method": "标注评估"},
|
|
351
|
+
{"name": "回答相关性", "metrics": ["relevance"], "method": "LLM-as-Judge"},
|
|
352
|
+
{"name": "幻觉率", "metrics": ["hallucination_rate"], "method": "事实核查"}
|
|
353
|
+
],
|
|
354
|
+
"dataset": {
|
|
355
|
+
"type": "curated",
|
|
356
|
+
"size": 30,
|
|
357
|
+
"generationMethod": "人工构建问答对"
|
|
358
|
+
}
|
|
359
|
+
}
|
|
360
|
+
],
|
|
361
|
+
"tools": {
|
|
362
|
+
"suggested": ["ragas", "langsmith"],
|
|
363
|
+
"reason": "与LangChain集成良好,支持Agent和RAG评测"
|
|
364
|
+
},
|
|
365
|
+
"timeline": {
|
|
366
|
+
"estimated": "2天",
|
|
367
|
+
"breakdown": {
|
|
368
|
+
"dataset_preparation": "0.5天",
|
|
369
|
+
"evaluation_implementation": "1天",
|
|
370
|
+
"execution_and_analysis": "0.5天"
|
|
371
|
+
}
|
|
372
|
+
}
|
|
373
|
+
}
|
|
374
|
+
```
|
|
@@ -0,0 +1,274 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ai-path-evaluator
|
|
3
|
+
description: AI 分析执行路径,评估 SDD 引导效果
|
|
4
|
+
type: review
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## 定位
|
|
8
|
+
|
|
9
|
+
分析执行路径数据,评估 Spec-driven development (SDD) 的引导效果,识别框架改进机会。
|
|
10
|
+
|
|
11
|
+
## 核心价值
|
|
12
|
+
|
|
13
|
+
通过分析执行路径,量化评估框架是否有效引导 AI 按规范执行开发流程,为框架迭代提供数据支撑。
|
|
14
|
+
|
|
15
|
+
## 输入
|
|
16
|
+
|
|
17
|
+
- 必须输入:
|
|
18
|
+
- traceId: 流程追踪 ID
|
|
19
|
+
- executionPath: 执行路径数据
|
|
20
|
+
- 可选输入:
|
|
21
|
+
- metrics: 质量指标数据
|
|
22
|
+
- frameworkRules: 框架规则定义
|
|
23
|
+
|
|
24
|
+
## 评测维度
|
|
25
|
+
|
|
26
|
+
### 1. 流程合规性 (权重 30%)
|
|
27
|
+
|
|
28
|
+
评估执行路径是否符合框架定义的流程规范。
|
|
29
|
+
|
|
30
|
+
**检查项**:
|
|
31
|
+
- [ ] 阶段顺序是否正确(需求→设计→实现→测试→交付)
|
|
32
|
+
- [ ] 必要阶段是否执行
|
|
33
|
+
- [ ] 双层验证是否执行
|
|
34
|
+
- [ ] 门禁检查是否通过
|
|
35
|
+
|
|
36
|
+
**评分规则**:
|
|
37
|
+
```javascript
|
|
38
|
+
score = 100;
|
|
39
|
+
// 阶段顺序违规扣分
|
|
40
|
+
if (stageOrderViolation) score -= 20;
|
|
41
|
+
// 缺失必要阶段扣分
|
|
42
|
+
score -= missingStages.length * 15;
|
|
43
|
+
// 未执行 Layer1 扣分
|
|
44
|
+
if (!hasLayer1Execution) score -= 30;
|
|
45
|
+
// 门禁阻止率影响
|
|
46
|
+
score -= gateBlockedRate * 10;
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
### 2. 执行效率 (权重 25%)
|
|
50
|
+
|
|
51
|
+
评估执行路径是否高效,识别浪费和重复。
|
|
52
|
+
|
|
53
|
+
**检查项**:
|
|
54
|
+
- [ ] 重复操作检测
|
|
55
|
+
- [ ] 无效操作检测
|
|
56
|
+
- [ ] 阶段耗时分析
|
|
57
|
+
- [ ] 工具使用效率
|
|
58
|
+
|
|
59
|
+
**评分规则**:
|
|
60
|
+
```javascript
|
|
61
|
+
score = 100;
|
|
62
|
+
// 重复操作扣分
|
|
63
|
+
score -= repeatedOperations.length * 5;
|
|
64
|
+
// 回退次数扣分
|
|
65
|
+
score -= rollbackCount * 5;
|
|
66
|
+
// 工具调用效率
|
|
67
|
+
efficiency = uniqueTools / totalToolCalls;
|
|
68
|
+
score *= efficiency;
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
### 3. 决策质量 (权重 25%)
|
|
72
|
+
|
|
73
|
+
评估决策点的合理性,反映框架的自主能力。
|
|
74
|
+
|
|
75
|
+
**检查项**:
|
|
76
|
+
- [ ] 人工介入必要性分析
|
|
77
|
+
- [ ] 自动决策成功率
|
|
78
|
+
- [ ] 决策类型分布
|
|
79
|
+
|
|
80
|
+
**评分规则**:
|
|
81
|
+
```javascript
|
|
82
|
+
score = 100;
|
|
83
|
+
// 人工介入率(越低越好)
|
|
84
|
+
humanRate = humanDecisions / totalDecisions;
|
|
85
|
+
score -= humanRate * 30;
|
|
86
|
+
// 自动决策成功率
|
|
87
|
+
score += autoDecisionSuccessRate * 20;
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
### 4. 异常处理 (权重 20%)
|
|
91
|
+
|
|
92
|
+
评估异常情况处理是否得当。
|
|
93
|
+
|
|
94
|
+
**检查项**:
|
|
95
|
+
- [ ] 异常类型分布
|
|
96
|
+
- [ ] 恢复成功率
|
|
97
|
+
- [ ] 升级合理性
|
|
98
|
+
|
|
99
|
+
**评分规则**:
|
|
100
|
+
```javascript
|
|
101
|
+
score = 100;
|
|
102
|
+
// 恢复率加分
|
|
103
|
+
score *= recoveryRate;
|
|
104
|
+
// 升级率合理时不扣分
|
|
105
|
+
if (escalateRate > 0.3) score -= 10;
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
## 输出格式
|
|
109
|
+
|
|
110
|
+
```json
|
|
111
|
+
{
|
|
112
|
+
"overallScore": 84,
|
|
113
|
+
"grade": "B",
|
|
114
|
+
"dimensions": {
|
|
115
|
+
"compliance": {
|
|
116
|
+
"score": 95,
|
|
117
|
+
"violations": [],
|
|
118
|
+
"strengths": ["阶段顺序正确", "双层验证完整"]
|
|
119
|
+
},
|
|
120
|
+
"efficiency": {
|
|
121
|
+
"score": 80,
|
|
122
|
+
"repeatedOps": 3,
|
|
123
|
+
"wastedOps": 1,
|
|
124
|
+
"suggestions": ["减少重复的 Read 操作"]
|
|
125
|
+
},
|
|
126
|
+
"decision": {
|
|
127
|
+
"score": 85,
|
|
128
|
+
"humanRate": 0.25,
|
|
129
|
+
"autoSuccessRate": 0.9,
|
|
130
|
+
"analysis": "人工介入主要集中在设计决策点"
|
|
131
|
+
},
|
|
132
|
+
"exception": {
|
|
133
|
+
"score": 75,
|
|
134
|
+
"recoveryRate": 0.8,
|
|
135
|
+
"escalateCount": 1,
|
|
136
|
+
"analysis": "大部分异常成功恢复"
|
|
137
|
+
}
|
|
138
|
+
},
|
|
139
|
+
"sddGuidanceEffect": {
|
|
140
|
+
"score": 82,
|
|
141
|
+
"analysis": "框架引导效果良好,AI 基本按规范执行",
|
|
142
|
+
"improvementAreas": [
|
|
143
|
+
"设计阶段的决策点可进一步自动化",
|
|
144
|
+
"减少重复操作的建议机制"
|
|
145
|
+
]
|
|
146
|
+
},
|
|
147
|
+
"recommendations": [
|
|
148
|
+
{
|
|
149
|
+
"priority": "high",
|
|
150
|
+
"area": "efficiency",
|
|
151
|
+
"suggestion": "实现 Read 结果缓存,减少重复读取"
|
|
152
|
+
}
|
|
153
|
+
]
|
|
154
|
+
}
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
## 执行步骤
|
|
158
|
+
|
|
159
|
+
### Step 1: 加载执行路径数据
|
|
160
|
+
|
|
161
|
+
```
|
|
162
|
+
1. 从 metrics.json 读取 executionPath
|
|
163
|
+
2. 从 trace 日志读取详细事件
|
|
164
|
+
3. 加载框架规则定义
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
### Step 2: 分析流程合规性
|
|
168
|
+
|
|
169
|
+
```
|
|
170
|
+
1. 检查阶段序列是否符合预期
|
|
171
|
+
2. 检查必要阶段是否执行
|
|
172
|
+
3. 检查双层验证执行情况
|
|
173
|
+
4. 统计门禁通过/阻止情况
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
### Step 3: 分析执行效率
|
|
177
|
+
|
|
178
|
+
```
|
|
179
|
+
1. 检测重复操作(相同操作 >= 3 次)
|
|
180
|
+
2. 检测无效操作(回退后的重做)
|
|
181
|
+
3. 分析工具调用分布
|
|
182
|
+
4. 计算效率指标
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
### Step 4: 分析决策质量
|
|
186
|
+
|
|
187
|
+
```
|
|
188
|
+
1. 统计决策类型分布
|
|
189
|
+
2. 分析人工介入必要性
|
|
190
|
+
3. 评估自动决策成功率
|
|
191
|
+
4. 识别可自动化的决策点
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
### Step 5: 分析异常处理
|
|
195
|
+
|
|
196
|
+
```
|
|
197
|
+
1. 统计异常类型分布
|
|
198
|
+
2. 计算恢复成功率
|
|
199
|
+
3. 分析升级合理性
|
|
200
|
+
4. 识别异常模式
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
### Step 6: 综合评估 SDD 引导效果
|
|
204
|
+
|
|
205
|
+
```
|
|
206
|
+
1. 汇总各维度评分
|
|
207
|
+
2. 分析框架引导的成功点
|
|
208
|
+
3. 识别需要改进的环节
|
|
209
|
+
4. 生成改进建议
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
### Step 7: 输出报告
|
|
213
|
+
|
|
214
|
+
```
|
|
215
|
+
1. 输出结构化 JSON 结果
|
|
216
|
+
2. 更新 metrics.json 的 pathEvaluation
|
|
217
|
+
3. 触发进化机制(如有必要)
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
## SDD 引导效果评估标准
|
|
221
|
+
|
|
222
|
+
| 分数范围 | 等级 | 引导效果 | 说明 |
|
|
223
|
+
|---------|------|---------|------|
|
|
224
|
+
| 90-100 | A | 优秀 | AI 完全按规范执行,极少人工介入 |
|
|
225
|
+
| 80-89 | B | 良好 | AI 基本按规范执行,少量人工纠正 |
|
|
226
|
+
| 70-79 | C | 一般 | AI 部分偏离规范,需要较多人工干预 |
|
|
227
|
+
| 60-69 | D | 较差 | AI 经常偏离规范,人工干预频繁 |
|
|
228
|
+
| <60 | F | 失败 | 框架引导失效,需要重新设计 |
|
|
229
|
+
|
|
230
|
+
## 进化触发规则
|
|
231
|
+
|
|
232
|
+
基于路径评测结果,自动触发进化机制:
|
|
233
|
+
|
|
234
|
+
1. **合规性问题**:连续 3 次出现相同违规 → 更新 constitution 或 skill
|
|
235
|
+
2. **效率问题**:重复操作模式出现 5 次 → 新增优化 skill
|
|
236
|
+
3. **决策问题**:人工介入率 > 50% → 优化决策自动化
|
|
237
|
+
4. **异常问题**:恢复率 < 60% → 增强异常处理机制
|
|
238
|
+
|
|
239
|
+
## 与 metrics.json 的集成
|
|
240
|
+
|
|
241
|
+
评估完成后,更新 metrics.json:
|
|
242
|
+
|
|
243
|
+
```json
|
|
244
|
+
{
|
|
245
|
+
"pathEvaluation": {
|
|
246
|
+
"compliance": { "score": 95, "violations": [] },
|
|
247
|
+
"efficiency": { "score": 80, "repeatedOps": 3 },
|
|
248
|
+
"decision": { "score": 85, "humanRate": 0.25 },
|
|
249
|
+
"exception": { "score": 75, "recoveryRate": 0.8 },
|
|
250
|
+
"overall": 84,
|
|
251
|
+
"evaluatedAt": "2026-03-24T10:00:00Z"
|
|
252
|
+
}
|
|
253
|
+
}
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
## 适用场景
|
|
257
|
+
|
|
258
|
+
- 流程结束时自动触发(Stop Hook)
|
|
259
|
+
- 手动执行 `/evaluate`
|
|
260
|
+
- 进化机制定期分析
|
|
261
|
+
|
|
262
|
+
## 反模式清单 (DP7)
|
|
263
|
+
|
|
264
|
+
1. **只看结果不看过程**:忽略执行路径中的细节
|
|
265
|
+
- 检测:必须分析具体事件序列
|
|
266
|
+
|
|
267
|
+
2. **过度惩罚人工介入**:认为所有人工介入都是框架问题
|
|
268
|
+
- 检测:区分必要和非必要人工介入
|
|
269
|
+
|
|
270
|
+
3. **忽略上下文**:不考虑任务复杂度
|
|
271
|
+
- 检测:结合任务复杂度调整评分标准
|
|
272
|
+
|
|
273
|
+
4. **建议过于笼统**:没有具体可执行的改进建议
|
|
274
|
+
- 检测:每个建议必须指向具体的改进点
|