astron-eval 0.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +119 -0
- package/bin/astron-eval.mjs +111 -0
- package/package.json +24 -0
- package/skills/astron-eval/SKILL.md +60 -0
- package/skills/model-evaluation/SKILL.md +180 -0
- package/skills/model-evaluation/assets/dimensions//345/206/205/345/256/271/347/233/270/345/205/263/346/200/247/347/273/264/345/272/246.json +20 -0
- package/skills/model-evaluation/assets/dimensions//345/206/205/345/256/271/347/262/276/347/241/256/347/273/264/345/272/246.json +19 -0
- package/skills/model-evaluation/assets/dimensions//345/207/206/347/241/256/346/200/247/347/273/264/345/272/246-/344/270/252/346/200/247/345/214/226/350/247/204/345/210/222.json +20 -0
- package/skills/model-evaluation/assets/dimensions//345/207/206/347/241/256/346/200/247/347/273/264/345/272/246-/344/277/241/346/201/257/345/210/206/346/236/220.json +20 -0
- package/skills/model-evaluation/assets/dimensions//345/207/206/347/241/256/346/200/247/347/273/264/345/272/246-/346/227/205/346/270/270/345/207/272/350/241/214.json +20 -0
- package/skills/model-evaluation/assets/dimensions//345/207/206/347/241/256/346/200/247/347/273/264/345/272/246.json +20 -0
- package/skills/model-evaluation/assets/dimensions//345/210/233/346/204/217/346/200/247-/345/220/270/345/274/225/346/200/247/347/273/264/345/272/246.json +21 -0
- package/skills/model-evaluation/assets/dimensions//345/210/233/346/226/260/346/200/247/347/273/264/345/272/246.json +20 -0
- package/skills/model-evaluation/assets/dimensions//345/256/214/346/225/264/346/200/247/347/273/264/345/272/246-/344/277/241/346/201/257/345/210/206/346/236/220.json +20 -0
- package/skills/model-evaluation/assets/dimensions//345/256/214/346/225/264/346/200/247/347/273/264/345/272/246.json +20 -0
- package/skills/model-evaluation/assets/dimensions//345/275/242/345/274/217/347/233/270/345/205/263/346/200/247/347/273/264/345/272/246.json +20 -0
- package/skills/model-evaluation/assets/dimensions//345/277/240/350/257/232/345/272/246/347/273/264/345/272/246.json +20 -0
- package/skills/model-evaluation/assets/dimensions//346/214/207/344/273/244/351/201/265/345/276/252/347/273/264/345/272/246.json +20 -0
- package/skills/model-evaluation/assets/dimensions//346/226/207/346/234/254/345/267/256/345/274/202/345/272/246-TER/347/273/264/345/272/246.json +20 -0
- package/skills/model-evaluation/assets/dimensions//346/234/211/346/225/210/346/200/247/347/273/264/345/272/246-/344/270/252/346/200/247/345/214/226/350/247/204/345/210/222.json +20 -0
- package/skills/model-evaluation/assets/dimensions//346/234/211/346/225/210/346/200/247/347/273/264/345/272/246-/344/277/241/346/201/257/345/210/206/346/236/220.json +20 -0
- package/skills/model-evaluation/assets/dimensions//346/234/211/346/225/210/346/200/247/347/273/264/345/272/246-/346/265/201/347/250/213/350/207/252/345/212/250/345/214/226.json +20 -0
- package/skills/model-evaluation/assets/dimensions//346/234/211/346/225/210/346/200/247/347/273/264/345/272/246.json +21 -0
- package/skills/model-evaluation/assets/dimensions//346/240/270/345/277/203/345/205/203/347/264/240/347/273/264/345/272/246.json +20 -0
- package/skills/model-evaluation/assets/dimensions//346/240/274/345/274/217/351/201/265/345/276/252/347/273/264/345/272/246.json +19 -0
- package/skills/model-evaluation/assets/dimensions//347/211/271/350/211/262/344/272/256/347/202/271/347/273/264/345/272/246.json +20 -0
- package/skills/model-evaluation/assets/dimensions//347/224/250/344/276/213/347/272/247/350/257/204/346/265/213/347/273/264/345/272/246/346/250/241/346/235/277.json +25 -0
- package/skills/model-evaluation/assets/dimensions//347/233/270/344/274/274/345/272/246-BERTScore/347/273/264/345/272/246.json +20 -0
- package/skills/model-evaluation/assets/dimensions//347/233/270/344/274/274/345/272/246-Cosine/347/273/264/345/272/246.json +20 -0
- package/skills/model-evaluation/assets/dimensions//347/233/270/344/274/274/345/272/246-ROUGE/347/273/264/345/272/246.json +20 -0
- package/skills/model-evaluation/assets/dimensions//347/233/270/345/205/263/346/200/247/347/273/264/345/272/246-/344/270/252/346/200/247/345/214/226/350/247/204/345/210/222.json +20 -0
- package/skills/model-evaluation/assets/dimensions//347/233/270/345/205/263/346/200/247/347/273/264/345/272/246.json +21 -0
- package/skills/model-evaluation/assets/dimensions//347/262/276/347/241/256/346/200/247-BLUE/347/273/264/345/272/246.json +20 -0
- package/skills/model-evaluation/assets/dimensions//347/262/276/347/241/256/346/200/247-COMET/347/273/264/345/272/246.json +20 -0
- package/skills/model-evaluation/assets/dimensions//351/200/273/350/276/221/345/220/210/347/220/206/346/200/247/347/273/264/345/272/246.json +20 -0
- package/skills/model-evaluation/assets/dimensions//351/200/273/350/276/221/350/277/236/350/264/257/346/200/247/347/273/264/345/272/246-/344/270/252/346/200/247/345/214/226/350/247/204/345/210/222.json +20 -0
- package/skills/model-evaluation/assets/dimensions//351/200/273/350/276/221/350/277/236/350/264/257/346/200/247/347/273/264/345/272/246-/344/277/241/346/201/257/345/210/206/346/236/220.json +20 -0
- package/skills/model-evaluation/assets/dimensions//351/200/273/350/276/221/350/277/236/350/264/257/346/200/247/347/273/264/345/272/246-/346/265/201/347/250/213/350/207/252/345/212/250/345/214/226.json +20 -0
- package/skills/model-evaluation/assets/dimensions//351/200/273/350/276/221/350/277/236/350/264/257/346/200/247/347/273/264/345/272/246.json +21 -0
- package/skills/model-evaluation/assets/eval-judge.json +11 -0
- package/skills/model-evaluation/assets/experts/business-process-automation.json +71 -0
- package/skills/model-evaluation/assets/experts/content-generation.json +75 -0
- package/skills/model-evaluation/assets/experts/content-match.json +37 -0
- package/skills/model-evaluation/assets/experts/information-analysis.json +87 -0
- package/skills/model-evaluation/assets/experts/marketing-digital-human.json +27 -0
- package/skills/model-evaluation/assets/experts/personalized-planning.json +87 -0
- package/skills/model-evaluation/assets/experts/text-translation.json +103 -0
- package/skills/model-evaluation/assets/experts/tourism-travel.json +119 -0
- package/skills/model-evaluation/assets/templates/custom-dimension.template.json +30 -0
- package/skills/model-evaluation/eval-build.md +281 -0
- package/skills/model-evaluation/eval-execute.md +196 -0
- package/skills/model-evaluation/eval-init.md +237 -0
- package/skills/model-evaluation/processes/dimension-process.md +207 -0
- package/skills/model-evaluation/processes/evalset-create-process.md +184 -0
- package/skills/model-evaluation/processes/evalset-parse-process.md +171 -0
- package/skills/model-evaluation/processes/evalset-supplement-process.md +136 -0
- package/skills/model-evaluation/processes/keypoint-process.md +148 -0
- package/skills/model-evaluation/processes/python-env-process.md +113 -0
- package/skills/model-evaluation/references//344/270/255/351/227/264/344/272/247/347/211/251/350/257/264/346/230/216.md +340 -0
- package/skills/model-evaluation/references//345/206/205/347/275/256/346/250/241/346/235/277/350/257/264/346/230/216.md +149 -0
- package/skills/model-evaluation/references//350/204/232/346/234/254/345/256/232/344/271/211.md +274 -0
- package/skills/model-evaluation/references//350/256/244/350/257/201/346/234/215/345/212/241/346/216/245/345/217/243/350/257/264/346/230/216.md +271 -0
- package/skills/model-evaluation/references//350/257/204/346/265/213/346/234/215/345/212/241/346/216/245/345/217/243/350/257/264/346/230/216.md +455 -0
- package/skills/model-evaluation/references//350/257/204/346/265/213/347/273/264/345/272/246/350/257/264/346/230/216.md +171 -0
- package/skills/model-evaluation/scripts/cfg/eval-auth.cfg +16 -0
- package/skills/model-evaluation/scripts/cfg/eval-server.cfg +1 -0
- package/skills/model-evaluation/scripts/clients/__init__.py +33 -0
- package/skills/model-evaluation/scripts/clients/api_client.py +97 -0
- package/skills/model-evaluation/scripts/clients/auth_client.py +96 -0
- package/skills/model-evaluation/scripts/clients/http_client.py +199 -0
- package/skills/model-evaluation/scripts/clients/oauth_callback.py +397 -0
- package/skills/model-evaluation/scripts/clients/token_manager.py +53 -0
- package/skills/model-evaluation/scripts/eval_auth.py +588 -0
- package/skills/model-evaluation/scripts/eval_dimension.py +240 -0
- package/skills/model-evaluation/scripts/eval_set.py +410 -0
- package/skills/model-evaluation/scripts/eval_task.py +324 -0
- package/skills/model-evaluation/scripts/files/__init__.py +38 -0
- package/skills/model-evaluation/scripts/files/file_utils.py +330 -0
- package/skills/model-evaluation/scripts/files/streaming.py +245 -0
- package/skills/model-evaluation/scripts/utils/__init__.py +128 -0
- package/skills/model-evaluation/scripts/utils/constants.py +101 -0
- package/skills/model-evaluation/scripts/utils/datetime_utils.py +60 -0
- package/skills/model-evaluation/scripts/utils/errors.py +244 -0
- package/skills/model-evaluation/scripts/utils/keypoint_prompts.py +73 -0
- package/skills/skill-driven-eval/SKILL.md +456 -0
- package/skills/skill-driven-eval/agents/grader.md +144 -0
- package/skills/skill-driven-eval/eval-viewer/__init__.py +1 -0
- package/skills/skill-driven-eval/eval-viewer/generate_report.py +485 -0
- package/skills/skill-driven-eval/eval-viewer/viewer.html +767 -0
- package/skills/skill-driven-eval/references/schemas.md +282 -0
- package/skills/skill-driven-eval/scripts/__init__.py +1 -0
- package/skills/skill-driven-eval/scripts/__main__.py +70 -0
- package/skills/skill-driven-eval/scripts/aggregate_results.py +681 -0
- package/skills/skill-driven-eval/scripts/extract_transcript.py +294 -0
- package/skills/skill-driven-eval/scripts/test_aggregate.py +244 -0
|
@@ -0,0 +1,244 @@
|
|
|
1
|
+
# -*- coding: utf-8 -*-
|
|
2
|
+
"""
|
|
3
|
+
自定义异常类和错误处理工具
|
|
4
|
+
"""
|
|
5
|
+
import sys
|
|
6
|
+
import json
|
|
7
|
+
from typing import Dict, Any, TypedDict, Optional
|
|
8
|
+
from .constants import (
|
|
9
|
+
ERR_FILE_NOT_FOUND,
|
|
10
|
+
ERR_FILE_ENCODING,
|
|
11
|
+
ERR_FILE_PARSE,
|
|
12
|
+
ERR_CONFIG_INVALID,
|
|
13
|
+
ERR_NETWORK_TIMEOUT,
|
|
14
|
+
ERR_NETWORK_CONNECTION,
|
|
15
|
+
ERR_REMOTE_AUTH_EXPIRED,
|
|
16
|
+
ERR_REMOTE_DEFAULT,
|
|
17
|
+
)
|
|
18
|
+
|
|
19
|
+
|
|
20
|
+
# ============================================================================
|
|
21
|
+
# 类型定义
|
|
22
|
+
# ============================================================================
|
|
23
|
+
|
|
24
|
+
class _ResultDictRequired(TypedDict):
|
|
25
|
+
"""统一返回结果类型 - 必填字段"""
|
|
26
|
+
success: bool
|
|
27
|
+
action: str
|
|
28
|
+
status: str
|
|
29
|
+
message: str
|
|
30
|
+
|
|
31
|
+
|
|
32
|
+
class ResultDict(_ResultDictRequired, total=False):
|
|
33
|
+
"""统一返回结果类型 - 可选字段"""
|
|
34
|
+
data: Dict[str, Any]
|
|
35
|
+
code: int
|
|
36
|
+
|
|
37
|
+
|
|
38
|
+
# ============================================================================
|
|
39
|
+
# 结果构建器
|
|
40
|
+
# ============================================================================
|
|
41
|
+
|
|
42
|
+
def result(action: str, status: str, message: str,
|
|
43
|
+
data: Optional[Dict[str, Any]] = None,
|
|
44
|
+
success: Optional[bool] = None,
|
|
45
|
+
code: Optional[int] = None) -> ResultDict:
|
|
46
|
+
"""
|
|
47
|
+
统一构建返回结果
|
|
48
|
+
|
|
49
|
+
Args:
|
|
50
|
+
action: 操作类型(如 "check", "load", "save")
|
|
51
|
+
status: 状态描述(如 "valid", "error", "not_found")
|
|
52
|
+
message: 详细消息
|
|
53
|
+
data: 附加数据
|
|
54
|
+
success: 是否成功(None 时根据 status 自动判断)
|
|
55
|
+
code: 错误码(可选)
|
|
56
|
+
|
|
57
|
+
Returns:
|
|
58
|
+
标准化的结果字典
|
|
59
|
+
"""
|
|
60
|
+
if success is None:
|
|
61
|
+
success = status in ("valid", "success", "waiting", "loaded", "saved")
|
|
62
|
+
|
|
63
|
+
r = {
|
|
64
|
+
"success": success,
|
|
65
|
+
"action": action,
|
|
66
|
+
"status": status,
|
|
67
|
+
"message": message,
|
|
68
|
+
"data": data or {}
|
|
69
|
+
}
|
|
70
|
+
if code is not None:
|
|
71
|
+
r["code"] = code
|
|
72
|
+
return r
|
|
73
|
+
|
|
74
|
+
|
|
75
|
+
# ============================================================================
|
|
76
|
+
# 自定义异常基类
|
|
77
|
+
# ============================================================================
|
|
78
|
+
|
|
79
|
+
class EvalError(Exception):
|
|
80
|
+
"""
|
|
81
|
+
评测错误基类
|
|
82
|
+
|
|
83
|
+
所有脚本本地错误的基类,包含错误码和消息。
|
|
84
|
+
"""
|
|
85
|
+
def __init__(self, message: str, code: int = None):
|
|
86
|
+
self.message = message
|
|
87
|
+
self.code = code
|
|
88
|
+
super().__init__(message)
|
|
89
|
+
|
|
90
|
+
def to_dict(self) -> Dict[str, Any]:
|
|
91
|
+
"""转换为错误字典"""
|
|
92
|
+
return {
|
|
93
|
+
"success": False,
|
|
94
|
+
"code": self.code,
|
|
95
|
+
"message": self.message
|
|
96
|
+
}
|
|
97
|
+
|
|
98
|
+
|
|
99
|
+
# ============================================================================
|
|
100
|
+
# 文件相关异常
|
|
101
|
+
# ============================================================================
|
|
102
|
+
|
|
103
|
+
class FileEncodingError(EvalError):
|
|
104
|
+
"""
|
|
105
|
+
文件编码错误
|
|
106
|
+
|
|
107
|
+
当无法使用指定编码读取文件时抛出。
|
|
108
|
+
"""
|
|
109
|
+
def __init__(self, path: str, encoding: str = "utf-8"):
|
|
110
|
+
self.path = path
|
|
111
|
+
self.encoding = encoding
|
|
112
|
+
super().__init__(
|
|
113
|
+
f"无法使用 {encoding} 编码读取文件: {path}",
|
|
114
|
+
code=ERR_FILE_ENCODING
|
|
115
|
+
)
|
|
116
|
+
|
|
117
|
+
|
|
118
|
+
class FileParseError(EvalError):
|
|
119
|
+
"""
|
|
120
|
+
文件解析错误
|
|
121
|
+
|
|
122
|
+
当文件内容无法解析(如 JSON 格式错误)时抛出。
|
|
123
|
+
"""
|
|
124
|
+
def __init__(self, path: str, detail: str = ""):
|
|
125
|
+
self.path = path
|
|
126
|
+
self.detail = detail
|
|
127
|
+
msg = f"文件解析失败: {path}"
|
|
128
|
+
if detail:
|
|
129
|
+
msg += f" - {detail}"
|
|
130
|
+
super().__init__(msg, code=ERR_FILE_PARSE)
|
|
131
|
+
|
|
132
|
+
|
|
133
|
+
class FileNotFoundError(EvalError):
|
|
134
|
+
"""
|
|
135
|
+
文件不存在错误
|
|
136
|
+
"""
|
|
137
|
+
def __init__(self, path: str):
|
|
138
|
+
self.path = path
|
|
139
|
+
super().__init__(f"文件不存在: {path}", code=ERR_FILE_NOT_FOUND)
|
|
140
|
+
|
|
141
|
+
|
|
142
|
+
# ============================================================================
|
|
143
|
+
# 配置相关异常
|
|
144
|
+
# ============================================================================
|
|
145
|
+
|
|
146
|
+
class ConfigError(EvalError):
|
|
147
|
+
"""
|
|
148
|
+
配置错误
|
|
149
|
+
"""
|
|
150
|
+
def __init__(self, message: str, path: str = None):
|
|
151
|
+
self.path = path
|
|
152
|
+
msg = message
|
|
153
|
+
if path:
|
|
154
|
+
msg = f"{message} (文件: {path})"
|
|
155
|
+
super().__init__(msg, code=ERR_CONFIG_INVALID)
|
|
156
|
+
|
|
157
|
+
|
|
158
|
+
# ============================================================================
|
|
159
|
+
# 网络相关异常
|
|
160
|
+
# ============================================================================
|
|
161
|
+
|
|
162
|
+
class NetworkError(EvalError):
|
|
163
|
+
"""
|
|
164
|
+
网络错误
|
|
165
|
+
|
|
166
|
+
用于网络请求失败的情况,包含原始异常引用。
|
|
167
|
+
"""
|
|
168
|
+
def __init__(self, message: str, original_error: Exception = None,
|
|
169
|
+
code: int = ERR_NETWORK_TIMEOUT):
|
|
170
|
+
self.original_error = original_error
|
|
171
|
+
super().__init__(message, code=code)
|
|
172
|
+
|
|
173
|
+
|
|
174
|
+
class NetworkTimeoutError(NetworkError):
|
|
175
|
+
"""
|
|
176
|
+
网络超时错误
|
|
177
|
+
"""
|
|
178
|
+
def __init__(self, message: str = "请求超时", original_error: Exception = None):
|
|
179
|
+
super().__init__(message, original_error, code=ERR_NETWORK_TIMEOUT)
|
|
180
|
+
|
|
181
|
+
|
|
182
|
+
class NetworkConnectionError(NetworkError):
|
|
183
|
+
"""
|
|
184
|
+
网络连接错误
|
|
185
|
+
"""
|
|
186
|
+
def __init__(self, message: str = "连接失败", original_error: Exception = None):
|
|
187
|
+
super().__init__(message, original_error, code=ERR_NETWORK_CONNECTION)
|
|
188
|
+
|
|
189
|
+
|
|
190
|
+
# ============================================================================
|
|
191
|
+
# 认证相关异常
|
|
192
|
+
# ============================================================================
|
|
193
|
+
|
|
194
|
+
class AuthExpiredError(EvalError):
|
|
195
|
+
"""
|
|
196
|
+
Token 过期错误
|
|
197
|
+
|
|
198
|
+
远程服务错误码透传(10002)。
|
|
199
|
+
"""
|
|
200
|
+
def __init__(self, message: str = "Token 已过期,请重新授权"):
|
|
201
|
+
# 使用远程服务错误码(透传)
|
|
202
|
+
super().__init__(message, code=ERR_REMOTE_AUTH_EXPIRED)
|
|
203
|
+
|
|
204
|
+
|
|
205
|
+
class ApiError(EvalError):
|
|
206
|
+
"""
|
|
207
|
+
API 错误 - 透传远程错误码
|
|
208
|
+
|
|
209
|
+
D-10: 透传远程服务错误,保留原 code 和 message
|
|
210
|
+
"""
|
|
211
|
+
def __init__(self, message: str, code: int = None, data: dict = None):
|
|
212
|
+
self.data = data or {}
|
|
213
|
+
super().__init__(message, code=code)
|
|
214
|
+
|
|
215
|
+
|
|
216
|
+
# ============================================================================
|
|
217
|
+
# CLI 错误处理
|
|
218
|
+
# ============================================================================
|
|
219
|
+
|
|
220
|
+
def handle_cli_error(e: Exception) -> None:
|
|
221
|
+
"""
|
|
222
|
+
统一的 CLI 错误处理,打印错误信息并退出
|
|
223
|
+
|
|
224
|
+
处理策略:
|
|
225
|
+
- EvalError 及其子类:使用异常自身的 code 和 message
|
|
226
|
+
- 其他异常:使用默认错误码 ERR_REMOTE_DEFAULT
|
|
227
|
+
|
|
228
|
+
输出格式:{"success": False, "code": int, "message": str}
|
|
229
|
+
"""
|
|
230
|
+
if isinstance(e, EvalError):
|
|
231
|
+
# 所有 EvalError 子类都有 code 和 message 属性
|
|
232
|
+
print(json.dumps({
|
|
233
|
+
"success": False,
|
|
234
|
+
"code": e.code,
|
|
235
|
+
"message": e.message
|
|
236
|
+
}, ensure_ascii=False))
|
|
237
|
+
else:
|
|
238
|
+
# 非自定义异常(如 requests 库的异常、ValueError 等)
|
|
239
|
+
print(json.dumps({
|
|
240
|
+
"success": False,
|
|
241
|
+
"code": ERR_REMOTE_DEFAULT,
|
|
242
|
+
"message": str(e)
|
|
243
|
+
}, ensure_ascii=False))
|
|
244
|
+
sys.exit(1)
|
|
@@ -0,0 +1,73 @@
|
|
|
1
|
+
"""评估点生成的 Prompt 模板"""
|
|
2
|
+
|
|
3
|
+
SYSTEM_PROMPT = """你是一个专业的评测点生成专家。你的任务是从给定的问答数据中提取"评估点"(关键得分点)。
|
|
4
|
+
|
|
5
|
+
评估点是用于判断实际回答质量的检查项,每个评估点都是一个"是否"判断句,用于逐条核对实际回答是否覆盖了该要点。
|
|
6
|
+
|
|
7
|
+
## 生成评估点的思考步骤
|
|
8
|
+
|
|
9
|
+
请按照以下步骤深入分析,但**只输出最终的评估点 JSON 数组**:
|
|
10
|
+
|
|
11
|
+
### 第一步:理解用户提问的场景和意图
|
|
12
|
+
|
|
13
|
+
从用户的角度思考:
|
|
14
|
+
- 用户为什么会问这个问题?是遇到了什么问题或困惑?
|
|
15
|
+
- 这个问题属于什么类型?(事实查询、操作指导、原理解释、故障排查、对比选择等)
|
|
16
|
+
- 用户的知识背景可能是什么?(新手、有经验的开发者、运维人员等)
|
|
17
|
+
|
|
18
|
+
### 第二步:识别问题的核心关注点
|
|
19
|
+
|
|
20
|
+
由浅到深分析用户的真实需求:
|
|
21
|
+
- **表层需求**:问题字面上在问什么?
|
|
22
|
+
- **深层需求**:用户真正想解决什么问题?背后的痛点是什么?
|
|
23
|
+
- **关键信息**:要完整回答这个问题,哪些信息是必不可少的?
|
|
24
|
+
|
|
25
|
+
### 第三步:结合上下文提炼关键信息点
|
|
26
|
+
|
|
27
|
+
如果提供了上下文:
|
|
28
|
+
- 上下文中哪些信息直接回答了用户的核心关注点?
|
|
29
|
+
- 上下文中是否包含参考答案未明确提及、但对回答完整性很重要的补充信息?
|
|
30
|
+
- 上下文中的哪些细节(如具体配置、注意事项、常见错误)是用户最需要知道的?
|
|
31
|
+
|
|
32
|
+
如果提供了参考答案:
|
|
33
|
+
- 参考答案中哪些信息点是回答用户核心关注点的关键?
|
|
34
|
+
- 参考答案的逻辑结构是什么?(如:先说是什么,再说为什么,最后说怎么做)
|
|
35
|
+
- 哪些信息点是可以独立验证的客观事实?
|
|
36
|
+
|
|
37
|
+
### 第四步:提取 1-3 个最核心的评估点
|
|
38
|
+
|
|
39
|
+
基于前面的分析,提取最关键的评估点:
|
|
40
|
+
- **优先级排序**:如果有多个候选评估点,选择对回答质量影响最大的 1-3 个
|
|
41
|
+
- **原子化拆分**:每个评估点只包含一个可独立验证的信息点
|
|
42
|
+
- **用户视角**:站在用户角度,这个评估点是否真正关系到问题能否被解决
|
|
43
|
+
|
|
44
|
+
## 评估点质量标准
|
|
45
|
+
|
|
46
|
+
1. **数量限制**:严格控制在 1-3 个,宁缺毋滥
|
|
47
|
+
2. **原子性**:每个评估点只包含一个可独立验证的信息点
|
|
48
|
+
3. **判断性**:必须以"是否"开头,能明确用"是/否"作答
|
|
49
|
+
4. **核心性**:聚焦用户最关心的信息,不纠缠细枝末节
|
|
50
|
+
5. **简洁性**:表述简短明确,避免超过 50 字
|
|
51
|
+
6. **可验证性**:客观可验证,避免主观判断(如"是否回答得好")
|
|
52
|
+
7. **独立性**:不依赖其他评估点的上下文,单独看也能理解
|
|
53
|
+
|
|
54
|
+
## 输出格式
|
|
55
|
+
|
|
56
|
+
请严格输出 JSON 数组,每个元素是一个"是否..."形式的评估点字符串。
|
|
57
|
+
**数组长度必须在 1-3 之间**,不要输出其他内容:
|
|
58
|
+
|
|
59
|
+
```json
|
|
60
|
+
["是否提及xxx", "是否包含xxx"]
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
注意:只输出 JSON 数组,不要有任何额外的解释或说明文字。"""
|
|
64
|
+
|
|
65
|
+
|
|
66
|
+
def build_user_prompt(question: str, reference: str, context: str) -> str:
|
|
67
|
+
"""构建用户 Prompt"""
|
|
68
|
+
parts = [f"## 用户问题\n{question}"]
|
|
69
|
+
if reference and reference.strip() and reference.strip() != "None":
|
|
70
|
+
parts.append(f"## 参考答案\n{reference}")
|
|
71
|
+
if context and context.strip() and context.strip() != "None":
|
|
72
|
+
parts.append(f"## 上下文\n{context}")
|
|
73
|
+
return "\n\n".join(parts) + "\n\n请提取评估点:"
|