@chongyan/autospec 1.0.1 → 1.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -21
- package/README.en.md +447 -321
- package/README.md +418 -286
- package/knowledge/01-principles/00-principles-hierarchy.md +247 -0
- package/knowledge/01-principles/01-first-principles.md +241 -0
- package/knowledge/01-principles/02-strategic-principles.md +286 -0
- package/knowledge/01-principles/03-tactical-principles.md +385 -0
- package/knowledge/01-principles/04-operational-principles.md +275 -0
- package/knowledge/01-principles/05-domain-principles.md +539 -0
- package/knowledge/01-principles/06-methodology-principles.md +281 -0
- package/knowledge/01-principles/07-cognitive-principles.md +277 -0
- package/knowledge/01-principles/08-auto-fix-principles.md +320 -0
- package/knowledge/01-principles/09-constitution.md +220 -0
- package/knowledge/{principles/evolution.md → 01-principles/10-evolution-mechanism.md} +160 -14
- package/knowledge/01-principles/README.en.md +385 -0
- package/knowledge/01-principles/README.md +385 -0
- package/knowledge/{process/overview.md → 02-process/00-overview.md} +90 -5
- package/knowledge/02-process/README.en.md +143 -0
- package/knowledge/02-process/README.md +186 -0
- package/knowledge/{guides/support/pipeline-protocol.md → 03-guides/00-pipeline-protocol.md} +10 -10
- package/knowledge/{guides/support/team-orchestrator.md → 03-guides/01-team-orchestrator.md} +53 -8
- package/knowledge/{guides/stages/requirement-analyzer.md → 03-guides/02-analyze-requirement.md} +3 -3
- package/knowledge/{guides/stages/ai-effect-evaluator.md → 03-guides/08-evaluate-ai-effect.md} +14 -7
- package/knowledge/{guides/support/skill-distiller.md → 03-guides/19-distill-skill.md} +3 -3
- package/knowledge/{guides/support/skill-updater.md → 03-guides/20-update-skill.md} +1 -1
- package/knowledge/{guides/support/methodology-extractor.md → 03-guides/22-extract-methodology.md} +2 -2
- package/knowledge/{guides/support/complexity-assessor.md → 03-guides/24-assess-complexity.md} +6 -4
- package/knowledge/{guides/support/tech-stack-analyzer.md → 03-guides/26-analyze-tech-stack.md} +1 -1
- package/knowledge/{guides/domain-driven-design.md → 03-guides/42-apply-ddd.md} +1 -1
- package/knowledge/{process/ai-sdlc.md → 03-guides/43-run-ai-sdlc.md} +1 -1
- package/knowledge/{guides/knowledge-management.md → 03-guides/44-manage-knowledge.md} +4 -4
- package/knowledge/03-guides/README.en.md +212 -0
- package/knowledge/03-guides/README.md +212 -0
- package/knowledge/{checklists/requirement.md → 04-checklists/00-requirement.md} +1 -1
- package/knowledge/{checklists/design.md → 04-checklists/01-design.md} +1 -1
- package/knowledge/{checklists/code.md → 04-checklists/02-code.md} +16 -1
- package/knowledge/{checklists/release.md → 04-checklists/04-release.md} +1 -1
- package/knowledge/04-checklists/README.en.md +119 -0
- package/knowledge/04-checklists/README.md +123 -0
- package/knowledge/{config/validation-patterns.yaml → 05-config/00-validation-patterns.yaml} +1 -1
- package/knowledge/{config/team-tasks.yaml → 05-config/02-team-tasks.yaml} +2 -2
- package/knowledge/05-config/03-role-composition.yaml +346 -0
- package/knowledge/{config/skill-compositions.yaml → 05-config/05-skill-compositions.yaml} +24 -24
- package/knowledge/05-config/README.en.md +54 -0
- package/knowledge/05-config/README.md +132 -0
- package/knowledge/06-environment/00-template-registry.md +310 -0
- package/knowledge/06-environment/01-detection-patterns.yaml +1692 -0
- package/knowledge/{environment → 06-environment}/README.en.md +4 -0
- package/knowledge/{environment → 06-environment}/README.md +66 -25
- package/knowledge/{standards/coding-style.md → 07-standards/00-coding-style.md} +123 -4
- package/knowledge/{standards/code-review.md → 07-standards/01-code-review.md} +3 -3
- package/knowledge/{standards/data-consistency.md → 07-standards/02-data-consistency.md} +1 -1
- package/knowledge/{standards/document-versioning.md → 07-standards/03-document-versioning.md} +1 -1
- package/knowledge/{standards/risk-detection.md → 07-standards/04-risk-detection.md} +5 -5
- package/knowledge/07-standards/README.en.md +119 -0
- package/knowledge/07-standards/README.md +123 -0
- package/knowledge/08-organization/00-vision-mission.md +113 -0
- package/knowledge/{organization/ai-native-team.md → 08-organization/01-ai-native-culture.md} +1 -1
- package/knowledge/{organization/team-metrics.md → 08-organization/02-team-metrics.md} +1 -1
- package/knowledge/08-organization/03-committee-structure.md +54 -0
- package/knowledge/08-organization/04-governance-metrics.md +55 -0
- package/knowledge/08-organization/05-improvement-process.md +71 -0
- package/knowledge/08-organization/README.en.md +165 -0
- package/knowledge/08-organization/README.md +165 -0
- package/knowledge/09-templates/00-requirement-proposal.md +344 -0
- package/knowledge/09-templates/01-architecture-design.md +494 -0
- package/knowledge/09-templates/02-api-design.md +408 -0
- package/knowledge/09-templates/03-database-design.md +313 -0
- package/knowledge/09-templates/04-product-design.md +237 -0
- package/knowledge/09-templates/05-domain-business.md +388 -0
- package/knowledge/09-templates/06-test-design.md +268 -0
- package/knowledge/09-templates/07-evaluation-design.md +372 -0
- package/knowledge/09-templates/08-component-knowledge.md +272 -0
- package/knowledge/09-templates/09-best-practices.md +218 -0
- package/knowledge/{environment/middleware-knowledge.md → 09-templates/10-middleware-knowledge.md} +106 -1
- package/knowledge/09-templates/README.en.md +222 -0
- package/knowledge/09-templates/README.md +216 -0
- package/knowledge/README.en.md +372 -0
- package/knowledge/README.md +354 -99
- package/package.json +1 -1
- package/plugins/.claude-plugin/plugin.json +460 -81
- package/plugins/agents/roles/ceo.md +1 -1
- package/plugins/agents/roles/product-owner.md +1 -1
- package/plugins/agents/roles/tech-lead.md +1 -1
- package/plugins/agents/support/consistency-checker.md +36 -3
- package/plugins/agents/support/monitoring-agent.md +215 -0
- package/plugins/agents/support/safety-auditor.md +2 -2
- package/plugins/agents/support/stage-gate-evaluator.md +95 -11
- package/plugins/agents/support/test-coverage-reviewer.md +1 -1
- package/plugins/benchmarks/templates/README.md +165 -13
- package/plugins/benchmarks/templates/commands/apply-template.yaml +108 -0
- package/plugins/benchmarks/templates/commands/archive-template.yaml +65 -0
- package/plugins/benchmarks/templates/commands/env-export-template.yaml +64 -0
- package/plugins/benchmarks/templates/commands/env-sync-template.yaml +104 -0
- package/plugins/benchmarks/templates/commands/env-template-template.yaml +96 -0
- package/plugins/benchmarks/templates/commands/env-template.yaml +58 -0
- package/plugins/benchmarks/templates/commands/env-update-template.yaml +110 -0
- package/plugins/benchmarks/templates/commands/env-validate-template.yaml +95 -0
- package/plugins/benchmarks/templates/commands/field-evolve-template.yaml +104 -0
- package/plugins/benchmarks/templates/commands/project-evolve-template.yaml +104 -0
- package/plugins/benchmarks/templates/commands/propose-template.yaml +88 -0
- package/plugins/benchmarks/templates/commands/review-template.yaml +124 -0
- package/plugins/benchmarks/templates/commands/run-template.yaml +127 -0
- package/plugins/benchmarks/templates/commands/test-template.yaml +149 -0
- package/plugins/benchmarks/templates/pipeline/experiment-template.yaml +92 -0
- package/plugins/benchmarks/templates/pipeline/hotfix-template.yaml +81 -0
- package/plugins/benchmarks/templates/skills/agile-iteration-template.yaml +78 -0
- package/plugins/benchmarks/templates/skills/benchmark-executor-template.yaml +114 -0
- package/plugins/benchmarks/templates/skills/benchmark-generator-template.yaml +52 -0
- package/plugins/benchmarks/templates/skills/delivery-stage-template.yaml +130 -0
- package/plugins/benchmarks/templates/skills/design-stage-template.yaml +131 -0
- package/plugins/benchmarks/templates/skills/experiment-iteration-template.yaml +60 -0
- package/plugins/benchmarks/templates/skills/exploration-phase-template.yaml +114 -0
- package/plugins/benchmarks/templates/skills/field-evolve-analyzer-template.yaml +51 -0
- package/plugins/benchmarks/templates/skills/field-evolve-distiller-template.yaml +34 -0
- package/plugins/benchmarks/templates/skills/field-evolve-executor-template.yaml +50 -0
- package/plugins/benchmarks/templates/skills/field-evolve-fixer-template.yaml +52 -0
- package/plugins/benchmarks/templates/skills/field-evolve-learner-template.yaml +33 -0
- package/plugins/benchmarks/templates/skills/field-evolve-scanner-template.yaml +74 -0
- package/plugins/benchmarks/templates/skills/field-evolve-template.yaml +71 -0
- package/plugins/benchmarks/templates/skills/field-evolve-verifier-template.yaml +51 -0
- package/plugins/benchmarks/templates/skills/hotfix-iteration-template.yaml +54 -0
- package/plugins/benchmarks/templates/skills/implementation-stage-template.yaml +127 -0
- package/plugins/benchmarks/templates/skills/layer1-validation-template.yaml +121 -0
- package/plugins/benchmarks/templates/skills/project-evolve-analyzer-template.yaml +51 -0
- package/plugins/benchmarks/templates/skills/project-evolve-fixer-template.yaml +52 -0
- package/plugins/benchmarks/templates/skills/project-evolve-generator-template.yaml +34 -0
- package/plugins/benchmarks/templates/skills/project-evolve-learner-template.yaml +50 -0
- package/plugins/benchmarks/templates/skills/project-evolve-reviewer-template.yaml +50 -0
- package/plugins/benchmarks/templates/skills/project-evolve-scanner-template.yaml +75 -0
- package/plugins/benchmarks/templates/skills/project-evolve-template.yaml +72 -0
- package/plugins/benchmarks/templates/skills/project-evolve-verifier-template.yaml +51 -0
- package/plugins/benchmarks/templates/skills/skill-forge-template.yaml +117 -0
- package/plugins/benchmarks/templates/skills/startup-guard-template.yaml +103 -0
- package/plugins/benchmarks/templates/skills/testing-stage-template.yaml +146 -0
- package/plugins/benchmarks/templates/skills/waterfall-iteration-template.yaml +55 -0
- package/plugins/commands/README.en.md +2 -2
- package/plugins/commands/README.md +2 -2
- package/plugins/commands/apply.md +102 -16
- package/plugins/commands/archive.md +60 -4
- package/plugins/commands/env-sync.md +1047 -406
- package/plugins/commands/env-template.md +11 -135
- package/plugins/commands/env-update.md +1 -1
- package/plugins/commands/env-validate.md +3 -3
- package/plugins/commands/explore.md +118 -1
- package/plugins/commands/field-evolve.md +51 -175
- package/plugins/commands/project-evolve.md +167 -68
- package/plugins/commands/propose.md +97 -6
- package/plugins/commands/review.md +5 -5
- package/plugins/commands/run.md +841 -13
- package/plugins/commands/status.md +138 -17
- package/plugins/commands/test.md +389 -0
- package/plugins/hooks/constitution-guard.js +1 -1
- package/plugins/hooks/environment-autocommit.js +366 -24
- package/plugins/hooks/environment-manager.js +3 -2
- package/plugins/hooks/execution-tracker.js +109 -4
- package/plugins/hooks/layer1-validator.js +117 -1
- package/plugins/hooks/lib/auto-fix-loop.js +605 -0
- package/plugins/hooks/lib/environment-config-loader.js +11 -7
- package/plugins/hooks/lib/hook-state-manager.js +98 -0
- package/plugins/hooks/lib/memory-extractor.js +27 -5
- package/plugins/hooks/lib/memory-manager.js +1 -1
- package/plugins/hooks/lib/test-auto-fix.test.js +194 -0
- package/plugins/hooks/monitoring-trigger.js +467 -0
- package/plugins/skills/README.en.md +15 -3
- package/plugins/skills/README.md +21 -11
- package/plugins/skills/agile-iteration/SKILL.md +187 -0
- package/plugins/skills/delivery-stage/SKILL.md +133 -12
- package/plugins/skills/design-stage/SKILL.md +103 -12
- package/plugins/skills/experiment-evaluator/SKILL.md +271 -0
- package/plugins/skills/experiment-iteration/SKILL.md +154 -0
- package/plugins/skills/exploration-phase/SKILL.md +93 -10
- package/plugins/skills/field-evolve-analyzer/SKILL.md +65 -0
- package/plugins/skills/field-evolve-distiller/SKILL.md +66 -0
- package/plugins/skills/field-evolve-executor/SKILL.md +94 -0
- package/plugins/skills/field-evolve-executor/executor.js +342 -0
- package/plugins/skills/field-evolve-fixer/SKILL.md +69 -0
- package/plugins/skills/field-evolve-learner/SKILL.md +65 -0
- package/plugins/skills/field-evolve-scanner/SKILL.md +87 -0
- package/plugins/skills/field-evolve-scanner/scripts/fallback-scanner.js +288 -0
- package/plugins/skills/field-evolve-verifier/SKILL.md +64 -0
- package/plugins/skills/hotfix-iteration/SKILL.md +279 -0
- package/plugins/skills/implementation-stage/SKILL.md +156 -15
- package/plugins/skills/layer1-validation/SKILL.md +1 -1
- package/plugins/skills/pending-dashboard/SKILL.md +9 -8
- package/plugins/skills/project-evolve-analyzer/SKILL.md +95 -0
- package/plugins/skills/project-evolve-fixer/SKILL.md +99 -0
- package/plugins/skills/project-evolve-generator/SKILL.md +149 -0
- package/plugins/skills/project-evolve-learner/SKILL.md +103 -0
- package/plugins/skills/project-evolve-reviewer/SKILL.md +104 -0
- package/plugins/skills/project-evolve-scanner/SKILL.md +95 -0
- package/plugins/skills/project-evolve-scanner/scripts/dependency-reuse-checker.js +395 -0
- package/plugins/skills/project-evolve-scanner/scripts/subsystem-coverage.js +315 -0
- package/plugins/skills/project-evolve-verifier/SKILL.md +105 -0
- package/plugins/skills/requirement-stage/SKILL.md +47 -13
- package/plugins/skills/skill-forge/SKILL.md +2 -2
- package/plugins/skills/testing-stage/SKILL.md +583 -8
- package/plugins/skills/waterfall-iteration/SKILL.md +115 -0
- package/scripts/cli/index.js +1 -1
- package/scripts/cli/init.js +30 -4
- package/scripts/cli/list.js +3 -2
- package/scripts/config/commands.config.js +8 -8
- package/scripts/config/hooks.config.js +1 -1
- package/scripts/install/constants.js +204 -165
- package/scripts/state.js +210 -1
- package/knowledge/config/README.en.md +0 -44
- package/knowledge/config/README.md +0 -44
- package/knowledge/config/role-composition.yaml +0 -98
- package/knowledge/config/team-triggers.yaml +0 -198
- package/knowledge/domain/README.md +0 -115
- package/knowledge/domain/flows/README.md +0 -194
- package/knowledge/domain/glossary.md +0 -143
- package/knowledge/domain/rules.md +0 -138
- package/knowledge/environment/component-knowledge.md +0 -316
- package/knowledge/environment/detection-patterns.yaml +0 -502
- package/knowledge/environment/template-registry.md +0 -321
- package/knowledge/guides/requirement-engineering.md +0 -329
- package/knowledge/guides/system-design.md +0 -352
- package/knowledge/principles/constitution.md +0 -134
- package/knowledge/principles/core-principles.md +0 -368
- package/knowledge/principles/design-philosophy.md +0 -877
- package/knowledge/process/README.en.md +0 -38
- package/knowledge/process/README.md +0 -48
- package/knowledge/templates/ai-evaluation.md +0 -150
- package/knowledge/templates/api-design.md +0 -117
- package/knowledge/templates/database-design.md +0 -132
- package/knowledge/templates/domain-driven-design.md +0 -321
- package/knowledge/templates/product-proposal.md +0 -201
- package/knowledge/templates/system-design.md +0 -227
- package/knowledge/templates/task-breakdown.md +0 -107
- package/knowledge/templates/test-case.md +0 -170
- package/plugins/commands/validate.md +0 -108
- package/plugins/skills/benchmark-executor/README.md +0 -93
- package/plugins/skills/evolution-process/SKILL.md +0 -291
- package/plugins/skills/project-evolution/SKILL.md +0 -847
- package/scripts/evolution/evolution-router.js +0 -273
- package/scripts/evolution/evolution-signal-collector.js +0 -307
- package/scripts/evolution/knowledge-loader.js +0 -346
- package/scripts/evolution/marketplace.js +0 -317
- package/scripts/evolution/version-manager.js +0 -371
- /package/knowledge/{process → 02-process}/01-requirement.md +0 -0
- /package/knowledge/{process → 02-process}/02-design.md +0 -0
- /package/knowledge/{process → 02-process}/03-implementation.md +0 -0
- /package/knowledge/{process → 02-process}/04-review.md +0 -0
- /package/knowledge/{process → 02-process}/05-testing.md +0 -0
- /package/knowledge/{process → 02-process}/06-delivery.md +0 -0
- /package/knowledge/{guides/stages/design-planner.md → 03-guides/03-design-solution.md} +0 -0
- /package/knowledge/{guides/stages/code-implementer.md → 03-guides/04-implement-code.md} +0 -0
- /package/knowledge/{guides/stages/test-planner.md → 03-guides/05-plan-testing.md} +0 -0
- /package/knowledge/{guides/stages/test-generator.md → 03-guides/06-generate-tests.md} +0 -0
- /package/knowledge/{guides/stages/release-checker.md → 03-guides/07-check-release.md} +0 -0
- /package/knowledge/{guides/stages/requirement-reviewer.md → 03-guides/09-review-requirement.md} +0 -0
- /package/knowledge/{guides/stages/design-reviewer.md → 03-guides/10-review-design.md} +0 -0
- /package/knowledge/{guides/stages/code-reviewer.md → 03-guides/11-review-code.md} +0 -0
- /package/knowledge/{guides/stages/test-reviewer.md → 03-guides/12-review-testing.md} +0 -0
- /package/knowledge/{guides/stages/security-reviewer.md → 03-guides/13-audit-security.md} +0 -0
- /package/knowledge/{guides/stages/consistency-checker.md → 03-guides/14-check-consistency.md} +0 -0
- /package/knowledge/{guides/stages/unit-test-runner.md → 03-guides/15-run-unit-tests.md} +0 -0
- /package/knowledge/{guides/stages/integration-test-runner.md → 03-guides/16-run-integration-tests.md} +0 -0
- /package/knowledge/{guides/stages/test-context-analyzer.md → 03-guides/17-analyze-test-context.md} +0 -0
- /package/knowledge/{guides/support/practice-logger.md → 03-guides/18-log-practice.md} +0 -0
- /package/knowledge/{guides/support/skill-validator.md → 03-guides/21-validate-skill.md} +0 -0
- /package/knowledge/{guides/support/scope-inference.md → 03-guides/23-infer-scope.md} +0 -0
- /package/knowledge/{guides/support/component-discovery.md → 03-guides/25-discover-component.md} +0 -0
- /package/knowledge/{guides/support/environment-scanner.md → 03-guides/27-scan-environment.md} +0 -0
- /package/knowledge/{guides/support/environment-validator.md → 03-guides/28-validate-environment.md} +0 -0
- /package/knowledge/{guides/support/knowledge-generator.md → 03-guides/29-generate-knowledge.md} +0 -0
- /package/knowledge/{guides/support/ai-capability-analyzer.md → 03-guides/30-analyze-ai-capability.md} +0 -0
- /package/knowledge/{guides/support/ai-component-analyzer.md → 03-guides/31-analyze-ai-component.md} +0 -0
- /package/knowledge/{guides/support/ai-agent-analyzer.md → 03-guides/32-analyze-ai-agent.md} +0 -0
- /package/knowledge/{guides/support/ai-rag-analyzer.md → 03-guides/33-analyze-ai-rag.md} +0 -0
- /package/knowledge/{guides/support/ai-task-assessor.md → 03-guides/34-assess-ai-task.md} +0 -0
- /package/knowledge/{guides/support/ai-pipeline-evaluator.md → 03-guides/35-evaluate-ai-pipeline.md} +0 -0
- /package/knowledge/{guides/support/ai-artifact-evaluator.md → 03-guides/36-evaluate-ai-artifact.md} +0 -0
- /package/knowledge/{guides/support/ai-evaluation-planner.md → 03-guides/37-plan-ai-evaluation.md} +0 -0
- /package/knowledge/{guides/support/ai-path-evaluator.md → 03-guides/38-evaluate-ai-path.md} +0 -0
- /package/knowledge/{guides/support/ai-data-validator.md → 03-guides/39-validate-ai-data.md} +0 -0
- /package/knowledge/{guides/support/ai-anomaly-analyzer.md → 03-guides/40-detect-ai-anomaly.md} +0 -0
- /package/knowledge/{guides/support/ai-test-diagnostics.md → 03-guides/41-diagnose-ai-test.md} +0 -0
- /package/knowledge/{guides/support/test-runner.md → 03-guides/45-test-runner.md} +0 -0
- /package/knowledge/{checklists/test.md → 04-checklists/03-test.md} +0 -0
- /package/knowledge/{config/team-stage.yaml → 05-config/01-team-stage.yaml} +0 -0
- /package/knowledge/{config/role-extensions.yaml → 05-config/04-role-extensions.yaml} +0 -0
|
@@ -0,0 +1,268 @@
|
|
|
1
|
+
# 测试设计:{项目/功能名}
|
|
2
|
+
|
|
3
|
+
> **版本**: v1.0
|
|
4
|
+
> **模板来源**: ISTQB 测试标准、测试金字塔理论、Google 测试方法论
|
|
5
|
+
> **适用范围**: 单元测试、集成测试、端到端测试、性能测试设计
|
|
6
|
+
> **生成模式**: 测试策略 → 测试用例 → 测试数据 → 测试执行
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## 1. 测试概述
|
|
11
|
+
|
|
12
|
+
### 1.1 基本信息
|
|
13
|
+
|
|
14
|
+
| 字段 | 值 |
|
|
15
|
+
|------|-----|
|
|
16
|
+
| 测试项目名称 | |
|
|
17
|
+
| 被测系统/功能 | |
|
|
18
|
+
| 测试类型 | 单元/集成/E2E/性能/安全 |
|
|
19
|
+
| 测试负责人 | |
|
|
20
|
+
| 测试环境 | |
|
|
21
|
+
|
|
22
|
+
### 1.2 测试目标
|
|
23
|
+
|
|
24
|
+
- [ ] 功能正确性验证
|
|
25
|
+
- [ ] 边界条件覆盖
|
|
26
|
+
- [ ] 异常场景处理
|
|
27
|
+
- [ ] 性能指标达标
|
|
28
|
+
- [ ] 安全漏洞检测
|
|
29
|
+
|
|
30
|
+
### 1.3 测试范围
|
|
31
|
+
|
|
32
|
+
**包含内容**:
|
|
33
|
+
- 功能模块 A
|
|
34
|
+
- 功能模块 B
|
|
35
|
+
|
|
36
|
+
**不包含内容**:
|
|
37
|
+
- 第三方系统集成(由集成测试覆盖)
|
|
38
|
+
|
|
39
|
+
---
|
|
40
|
+
|
|
41
|
+
## 2. 测试策略
|
|
42
|
+
|
|
43
|
+
### 2.1 测试金字塔
|
|
44
|
+
|
|
45
|
+
```
|
|
46
|
+
/\
|
|
47
|
+
/ \
|
|
48
|
+
/ E2E \ 端到端测试 (10%)
|
|
49
|
+
/______\
|
|
50
|
+
/ \
|
|
51
|
+
/ 集成测试 \ 集成测试 (20%)
|
|
52
|
+
/____________\
|
|
53
|
+
/ \
|
|
54
|
+
/ 单元测试 \ 单元测试 (70%)
|
|
55
|
+
__________________\
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
### 2.2 测试类型分布
|
|
59
|
+
|
|
60
|
+
| 测试类型 | 占比 | 工具 | 执行频率 |
|
|
61
|
+
|---------|------|------|---------|
|
|
62
|
+
| 单元测试 | 70% | Jest/JUnit/pytest | 每次提交 |
|
|
63
|
+
| 集成测试 | 20% | TestContainer | 每日 |
|
|
64
|
+
| E2E 测试 | 10% | Selenium/Cypress | 每周 |
|
|
65
|
+
|
|
66
|
+
---
|
|
67
|
+
|
|
68
|
+
## 3. 测试用例
|
|
69
|
+
|
|
70
|
+
### 3.1 测试用例模板
|
|
71
|
+
|
|
72
|
+
#### TC-{编号}: {测试用例名称}
|
|
73
|
+
|
|
74
|
+
| 属性 | 值 |
|
|
75
|
+
|------|-----|
|
|
76
|
+
| 测试用例 ID | TC-{编号} |
|
|
77
|
+
| 测试名称 | |
|
|
78
|
+
| 测试类型 | 功能/性能/安全 |
|
|
79
|
+
| 优先级 | P0/P1/P2 |
|
|
80
|
+
| 前置条件 | |
|
|
81
|
+
|
|
82
|
+
**测试步骤**:
|
|
83
|
+
| 步骤 | 操作 | 预期结果 |
|
|
84
|
+
|------|------|---------|
|
|
85
|
+
| 1 | | |
|
|
86
|
+
| 2 | | |
|
|
87
|
+
|
|
88
|
+
**测试数据**:
|
|
89
|
+
```json
|
|
90
|
+
{
|
|
91
|
+
"input": {},
|
|
92
|
+
"expected": {}
|
|
93
|
+
}
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
---
|
|
97
|
+
|
|
98
|
+
### 3.2 测试用例列表
|
|
99
|
+
|
|
100
|
+
| 用例 ID | 用例名称 | 类型 | 优先级 | 状态 |
|
|
101
|
+
|--------|---------|------|--------|------|
|
|
102
|
+
| TC-001 | | 功能 | P0 | 设计中 |
|
|
103
|
+
| TC-002 | | 边界 | P1 | 设计中 |
|
|
104
|
+
|
|
105
|
+
---
|
|
106
|
+
|
|
107
|
+
## 4. 测试场景
|
|
108
|
+
|
|
109
|
+
### 4.1 正常场景
|
|
110
|
+
|
|
111
|
+
| 场景 ID | 场景描述 | 输入 | 预期输出 |
|
|
112
|
+
|--------|---------|------|---------|
|
|
113
|
+
| SC-001 | 正常流程 | | |
|
|
114
|
+
|
|
115
|
+
### 4.2 边界场景
|
|
116
|
+
|
|
117
|
+
| 场景 ID | 边界类型 | 输入值 | 预期行为 |
|
|
118
|
+
|--------|---------|-------|---------|
|
|
119
|
+
| SC-002 | 最小值 | 0 | |
|
|
120
|
+
| SC-003 | 最大值 | MAX_INT | |
|
|
121
|
+
| SC-004 | 空值 | null/empty | |
|
|
122
|
+
|
|
123
|
+
### 4.3 异常场景
|
|
124
|
+
|
|
125
|
+
| 场景 ID | 异常类型 | 触发条件 | 预期处理 |
|
|
126
|
+
|--------|---------|---------|---------|
|
|
127
|
+
| SC-005 | 网络异常 | 超时 | 重试 3 次后失败 |
|
|
128
|
+
| SC-006 | 数据异常 | 无效输入 | 返回验证错误 |
|
|
129
|
+
|
|
130
|
+
---
|
|
131
|
+
|
|
132
|
+
## 5. 测试数据
|
|
133
|
+
|
|
134
|
+
### 5.1 数据来源
|
|
135
|
+
|
|
136
|
+
| 数据类型 | 来源 | 说明 |
|
|
137
|
+
|---------|------|------|
|
|
138
|
+
| 基础数据 | 测试数据库 | 预置数据 |
|
|
139
|
+
| 动态数据 | API 生成 | 运行时创建 |
|
|
140
|
+
|
|
141
|
+
### 5.2 数据准备
|
|
142
|
+
|
|
143
|
+
```sql
|
|
144
|
+
-- 测试数据准备脚本
|
|
145
|
+
INSERT INTO users (id, name, email) VALUES (1, 'Test User', 'test@example.com');
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
### 5.3 数据清理
|
|
149
|
+
|
|
150
|
+
```sql
|
|
151
|
+
-- 测试后清理脚本
|
|
152
|
+
DELETE FROM users WHERE id IN (1, 2, 3);
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
---
|
|
156
|
+
|
|
157
|
+
## 6. 测试执行
|
|
158
|
+
|
|
159
|
+
### 6.1 执行环境
|
|
160
|
+
|
|
161
|
+
| 环境 | 配置 | 用途 |
|
|
162
|
+
|------|------|------|
|
|
163
|
+
| 本地开发 | MacBook Pro 16G | 单元测试 |
|
|
164
|
+
| CI 环境 | GitHub Actions | 集成测试 |
|
|
165
|
+
| 测试环境 | AWS EC2 | E2E 测试 |
|
|
166
|
+
|
|
167
|
+
### 6.2 执行命令
|
|
168
|
+
|
|
169
|
+
```bash
|
|
170
|
+
# 单元测试
|
|
171
|
+
npm test
|
|
172
|
+
|
|
173
|
+
# 集成测试
|
|
174
|
+
npm run test:integration
|
|
175
|
+
|
|
176
|
+
# E2E 测试
|
|
177
|
+
npm run test:e2e
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
### 6.3 执行计划
|
|
181
|
+
|
|
182
|
+
| 阶段 | 时间 | 执行内容 | 负责人 |
|
|
183
|
+
|------|------|---------|--------|
|
|
184
|
+
| 阶段一 | | 单元测试 | |
|
|
185
|
+
| 阶段二 | | 集成测试 | |
|
|
186
|
+
|
|
187
|
+
---
|
|
188
|
+
|
|
189
|
+
## 7. 测试覆盖率
|
|
190
|
+
|
|
191
|
+
### 7.1 覆盖率目标
|
|
192
|
+
|
|
193
|
+
| 指标 | 目标值 | 当前值 |
|
|
194
|
+
|------|-------|-------|
|
|
195
|
+
| 代码覆盖率 | > 80% | |
|
|
196
|
+
| 分支覆盖率 | > 70% | |
|
|
197
|
+
| 需求覆盖率 | 100% | |
|
|
198
|
+
|
|
199
|
+
### 7.2 覆盖率报告
|
|
200
|
+
|
|
201
|
+
```
|
|
202
|
+
=============================== coverage summary ===============================
|
|
203
|
+
Stmts : 85% ( 100/120 )
|
|
204
|
+
Branches : 75% ( 50/67 )
|
|
205
|
+
Funcs : 90% ( 45/50 )
|
|
206
|
+
Lines : 84% ( 95/113 )
|
|
207
|
+
================================================================================
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
---
|
|
211
|
+
|
|
212
|
+
## 8. 缺陷管理
|
|
213
|
+
|
|
214
|
+
### 8.1 缺陷记录
|
|
215
|
+
|
|
216
|
+
| 缺陷 ID | 缺陷描述 | 严重程度 | 状态 | 关联用例 |
|
|
217
|
+
|--------|---------|---------|------|---------|
|
|
218
|
+
| BUG-001 | | 高/中/低 | 新建/修复中/已修复 | TC-001 |
|
|
219
|
+
|
|
220
|
+
### 8.2 缺陷流程
|
|
221
|
+
|
|
222
|
+
```mermaid
|
|
223
|
+
flowchart LR
|
|
224
|
+
A[新建] --> B[确认]
|
|
225
|
+
B --> C[修复中]
|
|
226
|
+
C --> D[已修复]
|
|
227
|
+
D --> E[验证]
|
|
228
|
+
E --> F[关闭]
|
|
229
|
+
```
|
|
230
|
+
|
|
231
|
+
---
|
|
232
|
+
|
|
233
|
+
## 9. 测试报告
|
|
234
|
+
|
|
235
|
+
### 9.1 测试结果
|
|
236
|
+
|
|
237
|
+
| 测试类型 | 总数 | 通过 | 失败 | 跳过 | 通过率 |
|
|
238
|
+
|---------|------|------|------|------|-------|
|
|
239
|
+
| 单元测试 | | | | | |
|
|
240
|
+
| 集成测试 | | | | | |
|
|
241
|
+
| E2E 测试 | | | | | |
|
|
242
|
+
|
|
243
|
+
### 9.2 测试结论
|
|
244
|
+
|
|
245
|
+
- [ ] 测试通过,可以发布
|
|
246
|
+
- [ ] 测试通过,但有已知问题
|
|
247
|
+
- [ ] 测试失败,需要修复
|
|
248
|
+
|
|
249
|
+
---
|
|
250
|
+
|
|
251
|
+
## 10. 附录
|
|
252
|
+
|
|
253
|
+
### 10.1 测试工具
|
|
254
|
+
|
|
255
|
+
| 工具名称 | 用途 | 版本 |
|
|
256
|
+
|---------|------|------|
|
|
257
|
+
| | | |
|
|
258
|
+
|
|
259
|
+
### 10.2 参考资料
|
|
260
|
+
|
|
261
|
+
- [ISTQB 测试标准](url)
|
|
262
|
+
- [Google 测试方法论](url)
|
|
263
|
+
|
|
264
|
+
---
|
|
265
|
+
|
|
266
|
+
**维护者**: QA 团队
|
|
267
|
+
**进化分区**: 自由区
|
|
268
|
+
**关联文档**: `knowledge/09-templates/09-evaluation-design.md`, `knowledge/09-templates/01-architecture-design.md`
|
|
@@ -0,0 +1,372 @@
|
|
|
1
|
+
# 评测设计:{AI 模型/功能名}
|
|
2
|
+
|
|
3
|
+
> **版本**: v1.0
|
|
4
|
+
> **模板来源**: AI 评测最佳实践、ML 评估方法论、业界评测基准
|
|
5
|
+
> **适用范围**: AI 模型效果评测、LLM 应用评测、RAG 系统评测
|
|
6
|
+
> **生成模式**: 评测目标 → 评测数据集 → 评测指标 → 评测执行
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## 1. 评测概述
|
|
11
|
+
|
|
12
|
+
### 1.1 基本信息
|
|
13
|
+
|
|
14
|
+
| 字段 | 值 |
|
|
15
|
+
|------|-----|
|
|
16
|
+
| 评测项目名称 | |
|
|
17
|
+
| 被测 AI 系统/模型 | |
|
|
18
|
+
| 评测类型 | 效果评测/性能评测/安全评测 |
|
|
19
|
+
| 评测负责人 | |
|
|
20
|
+
| 评测环境 | |
|
|
21
|
+
|
|
22
|
+
### 1.2 评测目标
|
|
23
|
+
|
|
24
|
+
- [ ] 验证模型效果是否达标
|
|
25
|
+
- [ ] 对比不同模型/版本
|
|
26
|
+
- [ ] 发现 Badcase 并优化
|
|
27
|
+
- [ ] 评估上线风险
|
|
28
|
+
|
|
29
|
+
### 1.3 评测范围
|
|
30
|
+
|
|
31
|
+
**包含内容**:
|
|
32
|
+
- 核心场景评测
|
|
33
|
+
- 边界场景评测
|
|
34
|
+
|
|
35
|
+
**不包含内容**:
|
|
36
|
+
- 极端场景(由专项评测覆盖)
|
|
37
|
+
|
|
38
|
+
---
|
|
39
|
+
|
|
40
|
+
## 2. 评测数据集
|
|
41
|
+
|
|
42
|
+
### 2.1 数据集概述
|
|
43
|
+
|
|
44
|
+
| 数据集名称 | 数据来源 | 数据量 | 用途 |
|
|
45
|
+
|-----------|---------|-------|------|
|
|
46
|
+
| 测试集 A | 线上采样 | 1000 | 核心场景评测 |
|
|
47
|
+
| 测试集 B | 人工构造 | 200 | 边界场景评测 |
|
|
48
|
+
|
|
49
|
+
### 2.2 数据分布
|
|
50
|
+
|
|
51
|
+
| 类别 | 训练集 | 验证集 | 测试集 |
|
|
52
|
+
|------|-------|-------|-------|
|
|
53
|
+
| 类别 A | 70% | 15% | 15% |
|
|
54
|
+
| 类别 B | 70% | 15% | 15% |
|
|
55
|
+
|
|
56
|
+
### 2.3 数据样例
|
|
57
|
+
|
|
58
|
+
```json
|
|
59
|
+
{
|
|
60
|
+
"id": "eval-001",
|
|
61
|
+
"input": "用户输入",
|
|
62
|
+
"expected_output": "期望输出",
|
|
63
|
+
"metadata": {
|
|
64
|
+
"scene": "场景类型",
|
|
65
|
+
"difficulty": "简单/中等/困难"
|
|
66
|
+
}
|
|
67
|
+
}
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
---
|
|
71
|
+
|
|
72
|
+
## 3. 评测指标
|
|
73
|
+
|
|
74
|
+
### 3.1 效果指标
|
|
75
|
+
|
|
76
|
+
| 指标名称 | 定义 | 计算方式 | 目标值 |
|
|
77
|
+
|---------|------|---------|-------|
|
|
78
|
+
| 准确率 | 预测正确的比例 | (TP+TN)/(TP+TN+FP+FN) | > 90% |
|
|
79
|
+
| 精确率 | 预测为正的准确率 | TP/(TP+FP) | > 85% |
|
|
80
|
+
| 召回率 | 正例被找出的比例 | TP/(TP+FN) | > 85% |
|
|
81
|
+
| F1 分数 | 精确率和召回率的调和平均 | 2PR/(P+R) | > 85% |
|
|
82
|
+
|
|
83
|
+
### 3.2 LLM 专用指标
|
|
84
|
+
|
|
85
|
+
| 指标名称 | 说明 | 评估方式 |
|
|
86
|
+
|---------|------|---------|
|
|
87
|
+
| 回答准确性 | 回答是否正确 | 人工评分/LLM 评判 |
|
|
88
|
+
| 回答完整性 | 是否覆盖所有要点 | 人工评分 |
|
|
89
|
+
| 回答相关性 | 是否切题 | 人工评分 |
|
|
90
|
+
| 安全性 | 是否有有害内容 | 规则检测 + 人工 |
|
|
91
|
+
| 困惑度 (Perplexity) | 语言模型预测不确定性 | 计算生成文本的 PPL 值 |
|
|
92
|
+
| BERTScore | 语义相似度评估 | 基于 BERT 嵌入的 F1 分数 |
|
|
93
|
+
| 毒性评分 (Toxicity) | 有害/偏见内容检测 | Perspective API/ toxicity 模型 |
|
|
94
|
+
| 有帮助性评分 | RLHF 对齐程度 | 人工评分 (1-5 分) |
|
|
95
|
+
|
|
96
|
+
### 3.3 RAG 系统专用指标
|
|
97
|
+
|
|
98
|
+
| 指标名称 | 说明 | 计算方式 |
|
|
99
|
+
|---------|------|---------|
|
|
100
|
+
| 检索精确率 (Retrieval Precision) | 检索到的相关文档比例 | 相关文档数 / 检索总数 |
|
|
101
|
+
| 检索召回率 (Retrieval Recall) | 被检索到的相关文档比例 | 检索到的相关数 / 总相关数 |
|
|
102
|
+
| 上下文相关性 (Context Relevance) | 检索内容与查询的相关性 | 人工评分/LLM 评判 |
|
|
103
|
+
| 答案忠实度 (Faithfulness) | 答案是否源自检索内容 | 事实一致性检测 |
|
|
104
|
+
| 引用准确率 | 引用来源的准确性 | 正确引用数 / 总引用数 |
|
|
105
|
+
|
|
106
|
+
### 3.4 性能指标
|
|
107
|
+
|
|
108
|
+
| 指标名称 | 目标值 | 说明 |
|
|
109
|
+
|---------|-------|------|
|
|
110
|
+
| 响应时间 (P50) | < 500ms | 50% 请求的响应时间 |
|
|
111
|
+
| 响应时间 (P99) | < 2s | 99% 请求的响应时间 |
|
|
112
|
+
| QPS | > 100 | 每秒查询数 |
|
|
113
|
+
| 并发数 | > 50 | 最大并发连接数 |
|
|
114
|
+
|
|
115
|
+
---
|
|
116
|
+
|
|
117
|
+
## 4. 评测执行
|
|
118
|
+
|
|
119
|
+
### 4.1 评测准备
|
|
120
|
+
|
|
121
|
+
**Step 1: 准备评测环境**
|
|
122
|
+
|
|
123
|
+
1. 检查是否有评测方案(`evaluation-plan.md`)
|
|
124
|
+
2. 检查是否有评测数据集(`evaluation/dataset/`)
|
|
125
|
+
3. 检查是否有评测脚本(`tests/evaluation/`, `evaluation/`)
|
|
126
|
+
|
|
127
|
+
**Step 2: 加载评测数据集**
|
|
128
|
+
|
|
129
|
+
1. 读取评测数据集
|
|
130
|
+
2. 验证数据集格式
|
|
131
|
+
3. 统计数据集规模
|
|
132
|
+
|
|
133
|
+
### 4.2 评测流程
|
|
134
|
+
|
|
135
|
+
```mermaid
|
|
136
|
+
flowchart TD
|
|
137
|
+
A[准备评测数据] --> B[执行评测]
|
|
138
|
+
B --> C[收集结果]
|
|
139
|
+
C --> D[分析指标]
|
|
140
|
+
D --> E{是否达标?}
|
|
141
|
+
E -->|是 | F[通过评测]
|
|
142
|
+
E -->|否 | G[分析 Badcase]
|
|
143
|
+
G --> H[优化模型]
|
|
144
|
+
H --> B
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
### 4.3 评测执行步骤
|
|
148
|
+
|
|
149
|
+
**Step 3: 执行评测**
|
|
150
|
+
|
|
151
|
+
1. 初始化被评测的 AI/模型组件
|
|
152
|
+
2. 对每个测试用例执行推理
|
|
153
|
+
3. 收集预测结果
|
|
154
|
+
|
|
155
|
+
**Step 4: 计算评测指标**
|
|
156
|
+
|
|
157
|
+
根据评测方案中的指标定义计算:
|
|
158
|
+
|
|
159
|
+
1. **准确率指标**:
|
|
160
|
+
- 激活准确率、匹配准确率等
|
|
161
|
+
|
|
162
|
+
2. **质量指标**:
|
|
163
|
+
- 响应质量、任务完成率等
|
|
164
|
+
|
|
165
|
+
3. **性能指标**:
|
|
166
|
+
- 响应时间、吞吐量等
|
|
167
|
+
|
|
168
|
+
**Step 5: 生成评测报告**
|
|
169
|
+
|
|
170
|
+
1. 汇总各项指标
|
|
171
|
+
2. 与目标值对比
|
|
172
|
+
3. 识别 badcase
|
|
173
|
+
|
|
174
|
+
### 4.4 评测命令
|
|
175
|
+
|
|
176
|
+
```bash
|
|
177
|
+
# 执行评测
|
|
178
|
+
python evaluate.py --model {model_name} --dataset {dataset_name}
|
|
179
|
+
|
|
180
|
+
# 生成报告
|
|
181
|
+
python generate_report.py --output report.md
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
### 4.5 评测配置
|
|
185
|
+
|
|
186
|
+
```yaml
|
|
187
|
+
evaluation:
|
|
188
|
+
model:
|
|
189
|
+
name: {model_name}
|
|
190
|
+
version: v1.0
|
|
191
|
+
dataset:
|
|
192
|
+
name: {dataset_name}
|
|
193
|
+
path: data/test.jsonl
|
|
194
|
+
metrics:
|
|
195
|
+
- accuracy
|
|
196
|
+
- precision
|
|
197
|
+
- recall
|
|
198
|
+
- f1
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
---
|
|
202
|
+
|
|
203
|
+
## 5. Badcase 分析
|
|
204
|
+
|
|
205
|
+
### 5.1 Badcase 分类
|
|
206
|
+
|
|
207
|
+
| 分类 | 数量 | 占比 | 说明 |
|
|
208
|
+
|------|------|------|------|
|
|
209
|
+
| 数据质量问题 | | | 标注错误/数据噪声 |
|
|
210
|
+
| 模型能力不足 | | | 模型无法理解某类输入 |
|
|
211
|
+
| 边界场景 | | | 极端输入 |
|
|
212
|
+
| 其他 | | | |
|
|
213
|
+
|
|
214
|
+
### 5.2 Badcase 示例
|
|
215
|
+
|
|
216
|
+
| ID | 输入 | 期望输出 | 实际输出 | 错误类型 |
|
|
217
|
+
|----|------|---------|---------|---------|
|
|
218
|
+
| 001 | | | | |
|
|
219
|
+
|
|
220
|
+
### 5.3 改进建议
|
|
221
|
+
|
|
222
|
+
| 问题 | 改进方案 | 优先级 |
|
|
223
|
+
|------|---------|--------|
|
|
224
|
+
| | | P0/P1/P2 |
|
|
225
|
+
|
|
226
|
+
---
|
|
227
|
+
|
|
228
|
+
## 6. 评测结果
|
|
229
|
+
|
|
230
|
+
### 6.1 结果汇总
|
|
231
|
+
|
|
232
|
+
| 评测集 | 样本数 | 准确率 | 精确率 | 召回率 | F1 分数 |
|
|
233
|
+
|-------|-------|--------|-------|-------|--------|
|
|
234
|
+
| 测试集 A | 1000 | | | | |
|
|
235
|
+
| 测试集 B | 200 | | | | |
|
|
236
|
+
| 总计 | 1200 | | | | |
|
|
237
|
+
|
|
238
|
+
### 6.2 评测报告格式
|
|
239
|
+
|
|
240
|
+
```markdown
|
|
241
|
+
## 效果评测结果
|
|
242
|
+
|
|
243
|
+
### 评测对象
|
|
244
|
+
- 组件名称:{component_name}
|
|
245
|
+
- 评测数据集:{dataset_path}
|
|
246
|
+
- 测试用例数:{total_cases}
|
|
247
|
+
|
|
248
|
+
### 评测指标
|
|
249
|
+
| 指标名称 | 目标值 | 实际值 | 状态 |
|
|
250
|
+
|----------|--------|--------|------|
|
|
251
|
+
| ... | ... | ... | ✅/❌ |
|
|
252
|
+
|
|
253
|
+
### Badcase 分析
|
|
254
|
+
| 用例 ID | 输入 | 预期输出 | 实际输出 | 问题描述 |
|
|
255
|
+
|--------|------|----------|----------|----------|
|
|
256
|
+
| ... | ... | ... | ... | ... |
|
|
257
|
+
|
|
258
|
+
### 结论
|
|
259
|
+
- 评测通过:✅ 是/❌ 否
|
|
260
|
+
- 达标指标:{passed_count}/{total_count}
|
|
261
|
+
- 需要优化:{需要优化的点}
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
### 6.3 结果分析
|
|
265
|
+
|
|
266
|
+
**优势**:
|
|
267
|
+
- 在 XX 场景表现优秀
|
|
268
|
+
|
|
269
|
+
**不足**:
|
|
270
|
+
- 在 XX 场景需要改进
|
|
271
|
+
|
|
272
|
+
### 6.4 与基线对比
|
|
273
|
+
|
|
274
|
+
| 模型 | 准确率 | 精确率 | 召回率 | F1 分数 |
|
|
275
|
+
|------|-------|-------|-------|--------|
|
|
276
|
+
| 基线模型 | | | | |
|
|
277
|
+
| 当前模型 | | | | |
|
|
278
|
+
| 提升 | +X% | +X% | +X% | +X% |
|
|
279
|
+
|
|
280
|
+
---
|
|
281
|
+
|
|
282
|
+
## 7. 判定标准
|
|
283
|
+
|
|
284
|
+
| 判定结果 | 说明 |
|
|
285
|
+
|---------|------|
|
|
286
|
+
| **通过** | 所有指标达到目标值,可以上线 |
|
|
287
|
+
| **部分通过** | 部分指标达标,需要评估风险后决定 |
|
|
288
|
+
| **不通过** | 主要指标未达标,需要优化后重新评测 |
|
|
289
|
+
|
|
290
|
+
### 判定流程
|
|
291
|
+
|
|
292
|
+
```mermaid
|
|
293
|
+
flowchart TD
|
|
294
|
+
A[评测完成] --> B{所有指标达标?}
|
|
295
|
+
B -->|是 | C[评测通过]
|
|
296
|
+
B -->|否 | D{主要指标达标?}
|
|
297
|
+
D -->|是 | E[部分通过,风险评估]
|
|
298
|
+
D -->|否 | F[不通过,需要优化]
|
|
299
|
+
E --> G{风险可接受?}
|
|
300
|
+
G -->|是 | H[有条件通过]
|
|
301
|
+
G -->|否 | F
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
---
|
|
305
|
+
|
|
306
|
+
## 8. 评测结论
|
|
307
|
+
|
|
308
|
+
### 8.1 结论
|
|
309
|
+
|
|
310
|
+
- [ ] 评测通过,可以上线
|
|
311
|
+
- [ ] 评测通过,但有已知问题
|
|
312
|
+
- [ ] 评测失败,需要优化
|
|
313
|
+
|
|
314
|
+
### 8.2 风险提示
|
|
315
|
+
|
|
316
|
+
| 风险 | 影响 | 缓解措施 |
|
|
317
|
+
|------|------|---------|
|
|
318
|
+
| | | |
|
|
319
|
+
|
|
320
|
+
### 8.3 后续计划
|
|
321
|
+
|
|
322
|
+
| 任务 | 负责人 | 时间 |
|
|
323
|
+
|------|-------|------|
|
|
324
|
+
| | | |
|
|
325
|
+
|
|
326
|
+
---
|
|
327
|
+
|
|
328
|
+
## 9. 附录
|
|
329
|
+
|
|
330
|
+
### 9.1 评测工具
|
|
331
|
+
|
|
332
|
+
| 工具名称 | 用途 | 版本 |
|
|
333
|
+
|---------|------|------|
|
|
334
|
+
| | | |
|
|
335
|
+
|
|
336
|
+
### 9.2 评测执行步骤(来自 08-evaluate-ai-effect.md)
|
|
337
|
+
|
|
338
|
+
**Step 1: 准备评测环境**
|
|
339
|
+
- 检查评测方案(`evaluation-plan.md`)
|
|
340
|
+
- 检查评测数据集(`evaluation/dataset/`)
|
|
341
|
+
- 检查评测脚本(`tests/evaluation/`, `evaluation/`)
|
|
342
|
+
|
|
343
|
+
**Step 2: 加载评测数据集**
|
|
344
|
+
- 读取评测数据集
|
|
345
|
+
- 验证数据集格式
|
|
346
|
+
- 统计数据集规模
|
|
347
|
+
|
|
348
|
+
**Step 3: 执行评测**
|
|
349
|
+
- 初始化被评测的 AI/模型组件
|
|
350
|
+
- 对每个测试用例执行推理
|
|
351
|
+
- 收集预测结果
|
|
352
|
+
|
|
353
|
+
**Step 4: 计算评测指标**
|
|
354
|
+
- 准确率指标
|
|
355
|
+
- 质量指标
|
|
356
|
+
- 性能指标
|
|
357
|
+
|
|
358
|
+
**Step 5: 生成评测报告**
|
|
359
|
+
- 汇总各项指标
|
|
360
|
+
- 与目标值对比
|
|
361
|
+
- 识别 badcase
|
|
362
|
+
|
|
363
|
+
### 9.3 参考资料
|
|
364
|
+
|
|
365
|
+
- [AI 评测最佳实践](url)
|
|
366
|
+
- [LLM 评测方法论](url)
|
|
367
|
+
|
|
368
|
+
---
|
|
369
|
+
|
|
370
|
+
**维护者**: AI 团队 + QA 团队
|
|
371
|
+
**进化分区**: 自由区
|
|
372
|
+
**关联文档**: `knowledge/09-templates/08-test-design.md`, `knowledge/09-templates/02-api-design.md`
|