evaldrift 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 promptdrift contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,157 @@
1
+ # promptdrift
2
+
3
+ > 给 LLM prompt / agent 做**快照式回归测试**——改一句提示词,第一时间知道有没有把别的用例悄悄改坏了。
4
+ >
5
+ > **国产模型 & 中文优先**:DeepSeek / Kimi / 通义 / 豆包 / 智谱 / 本地 Ollama 开箱即用。
6
+
7
+ 调 prompt 最大的坑是:你为了修好 A 场景改了提示词,结果悄悄把 B、C 场景弄坏了,而你根本没发现,直到用户来投诉。`promptdrift` 把这件事变成像跑单元测试一样简单——锁一个**基线**,之后每次改提示词跑一下,它直接告诉你**哪些用例退化了**。
8
+
9
+ ```
10
+ ✓ 退款问题 ██████████ 100% 120ms
11
+ ✗ 营业时间 ░░░░░░░░░░ 0% 98ms
12
+ ✗ [contains] 缺少关键词: 09:00
13
+ ✓ 情绪安抚 █████████░ 85% 340ms
14
+ ────────────────────────────────────────
15
+ 对比基线:
16
+ ↓ 退化 营业时间 100% → 0% (-100%) [通过→失败]
17
+
18
+ 1 处退化 · 0 处改进
19
+ ```
20
+
21
+ ## 为什么用它(而不是 promptfoo / deepeval)
22
+
23
+ 那些工具很好,但都是英文世界的,**对国产模型和中文评测支持很弱**。`promptdrift` 专门补这个空档:
24
+
25
+ - **国产模型一等公民**:一个 `baseUrl` 切到 DeepSeek / Kimi / 通义 / 豆包 / 智谱,不用折腾 SDK。
26
+ - **中文评测**:内置的 `llm-judge` 裁判提示、报告、报错全是中文。
27
+ - **聚焦"防退化"这一件事**:基线对比 + `通过→失败` 高亮 + CI 退出码,像 jest snapshot 一样直觉。
28
+ - **零服务器、纯本地**:数据不出本机,`npx` 就能跑,能直接塞进 CI。
29
+ - **离线可跑**:内置 mock 模型,没 API Key 也能先把用例和断言调通。
30
+
31
+ ## 快速开始
32
+
33
+ ```bash
34
+ # 1. 生成配置(默认离线 mock,立即可跑)
35
+ npx promptdrift init
36
+
37
+ # 2. 跑测试
38
+ npx promptdrift run
39
+
40
+ # 3. 满意了就锁定基线
41
+ npx promptdrift baseline
42
+
43
+ # 4. 以后每次改完 prompt 再跑,自动对比基线、标出退化
44
+ npx promptdrift run --html report.html
45
+ ```
46
+
47
+ ## 接真实模型
48
+
49
+ 把配置里的 `provider` 换成下面任意一个:
50
+
51
+ ```yaml
52
+ # DeepSeek(OpenAI 兼容)
53
+ provider:
54
+ type: openai
55
+ model: deepseek-chat
56
+ baseUrl: https://api.deepseek.com/v1
57
+ apiKeyEnv: DEEPSEEK_API_KEY # 从环境变量读,别把密钥写进文件
58
+ temperature: 0
59
+ ```
60
+
61
+ | 模型 | type | baseUrl |
62
+ |---|---|---|
63
+ | DeepSeek | `openai` | `https://api.deepseek.com/v1` |
64
+ | Kimi (Moonshot) | `openai` | `https://api.moonshot.cn/v1` |
65
+ | 通义千问 | `openai` | `https://dashscope.aliyuncs.com/compatible-mode/v1` |
66
+ | 豆包 (火山方舟) | `openai` | `https://ark.cn-beijing.volces.com/api/v3` |
67
+ | 智谱 GLM | `openai` | `https://open.bigmodel.cn/api/paas/v4` |
68
+ | 本地 Ollama | `ollama` | `http://localhost:11434` |
69
+ | Claude | `anthropic` | 默认官方 |
70
+
71
+ ## 配置说明
72
+
73
+ ```yaml
74
+ provider:
75
+ type: openai # openai | anthropic | ollama | mock
76
+ model: deepseek-chat
77
+ baseUrl: https://api.deepseek.com/v1
78
+ apiKeyEnv: DEEPSEEK_API_KEY
79
+
80
+ judge: # 可选:llm-judge 用的裁判模型,缺省复用 provider
81
+ type: openai
82
+ model: deepseek-chat
83
+ baseUrl: https://api.deepseek.com/v1
84
+ apiKeyEnv: DEEPSEEK_API_KEY
85
+
86
+ prompt:
87
+ system: "你是一个礼貌专业的中文客服。"
88
+ user: "{{question}}" # {{var}} 会被 test.vars 替换
89
+
90
+ defaultAssert: # 可选:应用到每个用例的断言
91
+ - type: regex
92
+ value: "^[\\s\\S]{6,}$"
93
+
94
+ tests:
95
+ - name: 退款问题
96
+ vars: { question: "我要退款怎么办" }
97
+ assert:
98
+ - type: contains
99
+ value: ["退款", "工作日"] # 数组 = 必须全部包含
100
+ ```
101
+
102
+ ### 断言类型
103
+
104
+ | type | 作用 | 关键字段 |
105
+ |---|---|---|
106
+ | `contains` | 输出必须包含关键词(数组则全含) | `value` |
107
+ | `not-contains` | 输出不能出现这些词 | `value` |
108
+ | `regex` | 匹配正则 | `value` |
109
+ | `json-schema` | 输出是合法 JSON 且符合 schema | `schema` |
110
+ | `llm-judge` | 用模型按中文评分标准打分 | `rubric` `threshold` |
111
+
112
+ 每条断言可加 `weight`(权重,默认 1),用例总分 = 加权平均。
113
+
114
+ ## 命令
115
+
116
+ ```
117
+ promptdrift init 生成配置模板
118
+ promptdrift run [选项] 跑测试并对比基线
119
+ promptdrift baseline [选项] 跑一次并锁定为基线
120
+
121
+ 选项:
122
+ -c, --config <path> 指定配置文件
123
+ --html <path> 输出自包含 HTML 报告
124
+ --json <path> 输出 JSON 结果
125
+ --update-baseline run 跑完顺便更新基线
126
+ --fail-on <mode> CI 退出码: any(默认) | regression | none
127
+ ```
128
+
129
+ ## 放进 CI
130
+
131
+ ```yaml
132
+ # .github/workflows/promptdrift.yml
133
+ - run: npx promptdrift run --fail-on regression
134
+ env:
135
+ DEEPSEEK_API_KEY: ${{ secrets.DEEPSEEK_API_KEY }}
136
+ ```
137
+
138
+ 把 `.promptdrift/baseline.json` 提交进仓库,PR 改了提示词导致退化时 CI 直接红灯。
139
+
140
+ ## 工作原理
141
+
142
+ ```
143
+ 配置(provider + prompt + tests)
144
+
145
+
146
+ 对每个用例调用模型 ──► 跑断言打分 ──► 加权得出用例分
147
+
148
+
149
+ 和 .promptdrift/baseline.json 逐用例对比
150
+
151
+
152
+ 终端表格 + HTML 报告 + CI 退出码(退化即非 0)
153
+ ```
154
+
155
+ ## License
156
+
157
+ MIT
@@ -0,0 +1,66 @@
1
+ import { readFileSync, writeFileSync, existsSync, mkdirSync } from "node:fs";
2
+ import { resolve, dirname } from "node:path";
3
+ export const ARTIFACT_DIR = ".promptdrift";
4
+ export const BASELINE_FILE = "baseline.json";
5
+ export function baselinePath(cwd = process.cwd()) {
6
+ return resolve(cwd, ARTIFACT_DIR, BASELINE_FILE);
7
+ }
8
+ export function saveBaseline(run, cwd = process.cwd()) {
9
+ const p = baselinePath(cwd);
10
+ mkdirSync(dirname(p), { recursive: true });
11
+ writeFileSync(p, JSON.stringify(run, null, 2), "utf8");
12
+ return p;
13
+ }
14
+ export function loadBaseline(cwd = process.cwd()) {
15
+ const p = baselinePath(cwd);
16
+ if (!existsSync(p))
17
+ return undefined;
18
+ return JSON.parse(readFileSync(p, "utf8"));
19
+ }
20
+ const EPS = 0.01;
21
+ export function diffAgainstBaseline(current, baseline) {
22
+ if (!baseline) {
23
+ return { hasBaseline: false, regressions: 0, improvements: 0, drifts: [] };
24
+ }
25
+ const baseMap = new Map(baseline.tests.map((t) => [t.name, t]));
26
+ const curMap = new Map(current.tests.map((t) => [t.name, t]));
27
+ const drifts = [];
28
+ for (const cur of current.tests) {
29
+ const base = baseMap.get(cur.name);
30
+ if (!base) {
31
+ drifts.push({ name: cur.name, kind: "new", curScore: cur.score, curPass: cur.pass, delta: 0 });
32
+ continue;
33
+ }
34
+ const delta = cur.score - base.score;
35
+ let kind = "same";
36
+ // pass->fail 永远算退化;fail->pass 算改进;否则看分数变化
37
+ if (base.pass && !cur.pass)
38
+ kind = "regression";
39
+ else if (!base.pass && cur.pass)
40
+ kind = "improvement";
41
+ else if (delta < -EPS)
42
+ kind = "regression";
43
+ else if (delta > EPS)
44
+ kind = "improvement";
45
+ drifts.push({
46
+ name: cur.name,
47
+ kind,
48
+ baseScore: base.score,
49
+ curScore: cur.score,
50
+ basePass: base.pass,
51
+ curPass: cur.pass,
52
+ delta,
53
+ });
54
+ }
55
+ for (const base of baseline.tests) {
56
+ if (!curMap.has(base.name)) {
57
+ drifts.push({ name: base.name, kind: "removed", baseScore: base.score, basePass: base.pass, delta: 0 });
58
+ }
59
+ }
60
+ return {
61
+ hasBaseline: true,
62
+ regressions: drifts.filter((d) => d.kind === "regression").length,
63
+ improvements: drifts.filter((d) => d.kind === "improvement").length,
64
+ drifts,
65
+ };
66
+ }
package/dist/cli.js ADDED
@@ -0,0 +1,170 @@
1
+ #!/usr/bin/env node
2
+ import { parseArgs } from "node:util";
3
+ import { writeFileSync, copyFileSync, existsSync } from "node:fs";
4
+ import { resolve, dirname, join } from "node:path";
5
+ import { fileURLToPath } from "node:url";
6
+ import pc from "picocolors";
7
+ import { findConfig, loadConfig, CONFIG_NAMES } from "./config.js";
8
+ import { runConfig } from "./runner.js";
9
+ import { saveBaseline, loadBaseline, diffAgainstBaseline } from "./baseline.js";
10
+ import { printRun } from "./report/terminal.js";
11
+ import { renderHtml } from "./report/html.js";
12
+ const __dirname = dirname(fileURLToPath(import.meta.url));
13
+ const PKG_ROOT = resolve(__dirname, "..");
14
+ const VERSION = "0.1.0";
15
+ function help() {
16
+ console.log(`
17
+ ${pc.bold("promptdrift")} — 给 LLM prompt / agent 做快照式回归测试,改提示词时第一时间发现质量退化。
18
+ 国产模型 & 中文优先(DeepSeek / Kimi / 通义 / 豆包 / Ollama)。
19
+
20
+ 用法:
21
+ promptdrift init 在当前目录生成可直接跑的配置模板
22
+ promptdrift run [选项] 跑测试并对比基线
23
+ promptdrift baseline [选项] 跑一次并把结果锁定为基线
24
+
25
+ run / baseline 选项:
26
+ -c, --config <path> 指定配置文件(默认自动查找)
27
+ --html <path> 额外输出自包含 HTML 报告
28
+ --json <path> 额外输出 JSON 结果
29
+ --update-baseline run 跑完顺便更新基线
30
+ --fail-on <mode> CI 退出码策略: any(默认) | regression | none
31
+
32
+ 其他:
33
+ -h, --help 显示帮助
34
+ -v, --version 显示版本
35
+ `);
36
+ }
37
+ async function cmdInit() {
38
+ const target = resolve(process.cwd(), CONFIG_NAMES[0]);
39
+ if (existsSync(target)) {
40
+ console.log(pc.yellow(`已存在 ${CONFIG_NAMES[0]},未覆盖。`));
41
+ return;
42
+ }
43
+ const tpl = join(PKG_ROOT, "templates", "promptdrift.config.yaml");
44
+ if (existsSync(tpl)) {
45
+ copyFileSync(tpl, target);
46
+ }
47
+ else {
48
+ writeFileSync(target, FALLBACK_TEMPLATE, "utf8");
49
+ }
50
+ console.log(pc.green(`已生成 ${CONFIG_NAMES[0]}`));
51
+ console.log(pc.dim("它默认用离线 mock 模型,直接 `promptdrift run` 就能看到效果。"));
52
+ console.log(pc.dim("改成真实模型:把 provider.type 换成 openai,填 baseUrl / model / apiKeyEnv。"));
53
+ }
54
+ async function cmdRun(opts, asBaseline) {
55
+ const configPath = findConfig(process.cwd(), opts.config);
56
+ const config = loadConfig(configPath);
57
+ const run = await runConfig(config);
58
+ if (asBaseline) {
59
+ const p = saveBaseline(run);
60
+ console.log(pc.green(`已把当前结果锁定为基线: ${p}`));
61
+ const empty = diffAgainstBaseline(run, undefined);
62
+ printRun(run, empty);
63
+ writeArtifacts(run, opts);
64
+ return 0;
65
+ }
66
+ const baseline = loadBaseline();
67
+ const drift = diffAgainstBaseline(run, baseline);
68
+ printRun(run, drift);
69
+ writeArtifacts(run, opts);
70
+ if (opts.updateBaseline) {
71
+ const p = saveBaseline(run);
72
+ console.log(pc.dim(`基线已更新: ${p}`));
73
+ }
74
+ if (opts.failOn === "none")
75
+ return 0;
76
+ if (opts.failOn === "regression")
77
+ return drift.regressions > 0 ? 1 : 0;
78
+ return run.summary.failed > 0 || drift.regressions > 0 ? 1 : 0;
79
+ }
80
+ function writeArtifacts(run, opts) {
81
+ if (opts.json) {
82
+ const p = resolve(process.cwd(), opts.json);
83
+ writeFileSync(p, JSON.stringify(run, null, 2), "utf8");
84
+ console.log(pc.dim(`JSON 已写入: ${p}`));
85
+ }
86
+ if (opts.html) {
87
+ const baseline = loadBaseline();
88
+ const drift = diffAgainstBaseline(run, baseline);
89
+ const p = resolve(process.cwd(), opts.html);
90
+ writeFileSync(p, renderHtml(run, drift), "utf8");
91
+ console.log(pc.dim(`HTML 报告已写入: ${p}`));
92
+ }
93
+ }
94
+ async function main() {
95
+ const argv = process.argv.slice(2);
96
+ const command = argv[0];
97
+ if (!command || command === "-h" || command === "--help" || command === "help") {
98
+ help();
99
+ return;
100
+ }
101
+ if (command === "-v" || command === "--version") {
102
+ console.log(VERSION);
103
+ return;
104
+ }
105
+ const { values } = parseArgs({
106
+ args: argv.slice(1),
107
+ options: {
108
+ config: { type: "string", short: "c" },
109
+ html: { type: "string" },
110
+ json: { type: "string" },
111
+ "update-baseline": { type: "boolean", default: false },
112
+ "fail-on": { type: "string", default: "any" },
113
+ },
114
+ allowPositionals: true,
115
+ });
116
+ const failOn = values["fail-on"] || "any";
117
+ if (!["any", "regression", "none"].includes(failOn)) {
118
+ console.error(pc.red(`--fail-on 只能是 any | regression | none`));
119
+ process.exit(2);
120
+ }
121
+ const opts = {
122
+ config: values.config,
123
+ html: values.html,
124
+ json: values.json,
125
+ updateBaseline: Boolean(values["update-baseline"]),
126
+ failOn: failOn,
127
+ };
128
+ switch (command) {
129
+ case "init":
130
+ await cmdInit();
131
+ break;
132
+ case "run":
133
+ process.exit(await cmdRun(opts, false));
134
+ break;
135
+ case "baseline":
136
+ process.exit(await cmdRun(opts, true));
137
+ break;
138
+ default:
139
+ console.error(pc.red(`未知命令: ${command}`));
140
+ help();
141
+ process.exit(2);
142
+ }
143
+ }
144
+ const FALLBACK_TEMPLATE = `provider:
145
+ type: mock
146
+ model: demo
147
+ mockResponses:
148
+ 退款: "您好,关于退款:支持 7 天无理由退款,请在订单页点击申请,1-3 个工作日到账。"
149
+ 几点: "我们每天 09:00-22:00 营业,节假日正常。"
150
+
151
+ prompt:
152
+ system: "你是一个礼貌专业的中文客服。"
153
+ user: "{{question}}"
154
+
155
+ tests:
156
+ - name: 退款问题
157
+ vars: { question: "我要退款怎么办" }
158
+ assert:
159
+ - type: contains
160
+ value: ["退款", "工作日"]
161
+ - name: 营业时间
162
+ vars: { question: "你们几点开门" }
163
+ assert:
164
+ - type: contains
165
+ value: "09:00"
166
+ `;
167
+ main().catch((e) => {
168
+ console.error(pc.red(e.message));
169
+ process.exit(1);
170
+ });
package/dist/config.js ADDED
@@ -0,0 +1,54 @@
1
+ import { readFileSync, existsSync } from "node:fs";
2
+ import { resolve } from "node:path";
3
+ import YAML from "yaml";
4
+ export const CONFIG_NAMES = [
5
+ "promptdrift.config.yaml",
6
+ "promptdrift.config.yml",
7
+ "promptdrift.config.json",
8
+ ];
9
+ export function findConfig(cwd = process.cwd(), explicit) {
10
+ if (explicit) {
11
+ const p = resolve(cwd, explicit);
12
+ if (!existsSync(p))
13
+ throw new Error(`找不到配置文件: ${p}`);
14
+ return p;
15
+ }
16
+ for (const name of CONFIG_NAMES) {
17
+ const p = resolve(cwd, name);
18
+ if (existsSync(p))
19
+ return p;
20
+ }
21
+ throw new Error(`当前目录没有配置文件(${CONFIG_NAMES.join(" / ")})。先运行: promptdrift init`);
22
+ }
23
+ export function loadConfig(path) {
24
+ const raw = readFileSync(path, "utf8");
25
+ const data = path.endsWith(".json")
26
+ ? JSON.parse(raw)
27
+ : YAML.parse(raw);
28
+ return validate(data, path);
29
+ }
30
+ function validate(data, path) {
31
+ if (typeof data !== "object" || data === null) {
32
+ throw new Error(`配置文件格式错误: ${path}`);
33
+ }
34
+ const cfg = data;
35
+ if (!cfg.provider || typeof cfg.provider !== "object") {
36
+ throw new Error("配置缺少 provider 段");
37
+ }
38
+ if (!cfg.provider.type)
39
+ throw new Error("provider.type 不能为空");
40
+ if (!cfg.provider.model)
41
+ throw new Error("provider.model 不能为空");
42
+ if (!cfg.prompt || typeof cfg.prompt.user !== "string") {
43
+ throw new Error("配置缺少 prompt.user(用户提示词模板)");
44
+ }
45
+ if (!Array.isArray(cfg.tests) || cfg.tests.length === 0) {
46
+ throw new Error("配置缺少 tests(至少一个测试用例)");
47
+ }
48
+ for (const [i, t] of cfg.tests.entries()) {
49
+ if (!t || typeof t.name !== "string") {
50
+ throw new Error(`tests[${i}] 缺少 name`);
51
+ }
52
+ }
53
+ return cfg;
54
+ }
@@ -0,0 +1,39 @@
1
+ import { readApiKey } from "../util.js";
2
+ export function createAnthropicProvider(cfg) {
3
+ const baseUrl = (cfg.baseUrl || "https://api.anthropic.com").replace(/\/+$/, "");
4
+ const apiKey = readApiKey(cfg.apiKeyEnv);
5
+ return {
6
+ label: `anthropic(${cfg.model})`,
7
+ async complete(messages) {
8
+ if (!apiKey) {
9
+ throw new Error(`缺少 API Key:请设置环境变量 ${cfg.apiKeyEnv || "ANTHROPIC_API_KEY"}`);
10
+ }
11
+ const system = messages
12
+ .filter((m) => m.role === "system")
13
+ .map((m) => m.content)
14
+ .join("\n");
15
+ const chat = messages.filter((m) => m.role !== "system");
16
+ const res = await fetch(`${baseUrl}/v1/messages`, {
17
+ method: "POST",
18
+ headers: {
19
+ "Content-Type": "application/json",
20
+ "x-api-key": apiKey,
21
+ "anthropic-version": "2023-06-01",
22
+ },
23
+ body: JSON.stringify({
24
+ model: cfg.model,
25
+ max_tokens: cfg.maxTokens ?? 1024,
26
+ temperature: cfg.temperature ?? 0,
27
+ ...(system ? { system } : {}),
28
+ messages: chat.map((m) => ({ role: m.role, content: m.content })),
29
+ }),
30
+ });
31
+ if (!res.ok) {
32
+ const text = await res.text().catch(() => "");
33
+ throw new Error(`Anthropic 接口报错 ${res.status}: ${text.slice(0, 300)}`);
34
+ }
35
+ const data = (await res.json());
36
+ return data.content?.map((c) => c.text ?? "").join("") ?? "";
37
+ },
38
+ };
39
+ }
@@ -0,0 +1,18 @@
1
+ import { createOpenAIProvider } from "./openai.js";
2
+ import { createAnthropicProvider } from "./anthropic.js";
3
+ import { createOllamaProvider } from "./ollama.js";
4
+ import { createMockProvider } from "./mock.js";
5
+ export function createProvider(cfg) {
6
+ switch (cfg.type) {
7
+ case "openai":
8
+ return createOpenAIProvider(cfg);
9
+ case "anthropic":
10
+ return createAnthropicProvider(cfg);
11
+ case "ollama":
12
+ return createOllamaProvider(cfg);
13
+ case "mock":
14
+ return createMockProvider(cfg);
15
+ default:
16
+ throw new Error(`未知的 provider.type: ${cfg.type}`);
17
+ }
18
+ }
@@ -0,0 +1,22 @@
1
+ /**
2
+ * 离线 mock provider —— 不需要任何 API Key,开箱即跑。
3
+ * 用于 example、CI、以及你想先把测试用例和断言调通、再接真实模型的场景。
4
+ *
5
+ * 匹配规则:把最后一条 user 消息和 mockResponses 的每个 key 比较,
6
+ * 只要 user 内容包含某个 key,就返回对应的预设回复;都不匹配则回显。
7
+ */
8
+ export function createMockProvider(cfg) {
9
+ const table = cfg.mockResponses ?? {};
10
+ return {
11
+ label: `mock(${cfg.model})`,
12
+ async complete(messages) {
13
+ const lastUser = [...messages].reverse().find((m) => m.role === "user");
14
+ const content = lastUser?.content ?? "";
15
+ for (const [key, reply] of Object.entries(table)) {
16
+ if (key && content.includes(key))
17
+ return reply;
18
+ }
19
+ return table["__default__"] ?? `(mock 回显)${content}`;
20
+ },
21
+ };
22
+ }
@@ -0,0 +1,24 @@
1
+ export function createOllamaProvider(cfg) {
2
+ const baseUrl = (cfg.baseUrl || "http://localhost:11434").replace(/\/+$/, "");
3
+ return {
4
+ label: `ollama(${cfg.model} @ ${baseUrl})`,
5
+ async complete(messages) {
6
+ const res = await fetch(`${baseUrl}/api/chat`, {
7
+ method: "POST",
8
+ headers: { "Content-Type": "application/json" },
9
+ body: JSON.stringify({
10
+ model: cfg.model,
11
+ messages,
12
+ stream: false,
13
+ options: { temperature: cfg.temperature ?? 0 },
14
+ }),
15
+ });
16
+ if (!res.ok) {
17
+ const text = await res.text().catch(() => "");
18
+ throw new Error(`Ollama 接口报错 ${res.status}: ${text.slice(0, 300)}`);
19
+ }
20
+ const data = (await res.json());
21
+ return data.message?.content ?? "";
22
+ },
23
+ };
24
+ }
@@ -0,0 +1,41 @@
1
+ import { readApiKey } from "../util.js";
2
+ /**
3
+ * OpenAI 兼容 provider。
4
+ * 同一套接口覆盖绝大多数国产模型,只要换 baseUrl:
5
+ * DeepSeek https://api.deepseek.com/v1
6
+ * Kimi https://api.moonshot.cn/v1
7
+ * 通义千问 https://dashscope.aliyuncs.com/compatible-mode/v1
8
+ * 豆包 https://ark.cn-beijing.volces.com/api/v3
9
+ * 智谱 https://open.bigmodel.cn/api/paas/v4
10
+ */
11
+ export function createOpenAIProvider(cfg) {
12
+ const baseUrl = (cfg.baseUrl || "https://api.openai.com/v1").replace(/\/+$/, "");
13
+ const apiKey = readApiKey(cfg.apiKeyEnv);
14
+ return {
15
+ label: `openai(${cfg.model} @ ${baseUrl})`,
16
+ async complete(messages) {
17
+ if (!apiKey) {
18
+ throw new Error(`缺少 API Key:请设置环境变量 ${cfg.apiKeyEnv || "(provider.apiKeyEnv 未配置)"}`);
19
+ }
20
+ const res = await fetch(`${baseUrl}/chat/completions`, {
21
+ method: "POST",
22
+ headers: {
23
+ "Content-Type": "application/json",
24
+ Authorization: `Bearer ${apiKey}`,
25
+ },
26
+ body: JSON.stringify({
27
+ model: cfg.model,
28
+ messages,
29
+ temperature: cfg.temperature ?? 0,
30
+ ...(cfg.maxTokens ? { max_tokens: cfg.maxTokens } : {}),
31
+ }),
32
+ });
33
+ if (!res.ok) {
34
+ const text = await res.text().catch(() => "");
35
+ throw new Error(`OpenAI 兼容接口报错 ${res.status}: ${text.slice(0, 300)}`);
36
+ }
37
+ const data = (await res.json());
38
+ return data.choices?.[0]?.message?.content ?? "";
39
+ },
40
+ };
41
+ }
@@ -0,0 +1,106 @@
1
+ function esc(s) {
2
+ return s
3
+ .replace(/&/g, "&amp;")
4
+ .replace(/</g, "&lt;")
5
+ .replace(/>/g, "&gt;")
6
+ .replace(/"/g, "&quot;");
7
+ }
8
+ function driftBadge(d) {
9
+ if (!d)
10
+ return "";
11
+ const map = {
12
+ regression: ["退化", "reg"],
13
+ improvement: ["改进", "imp"],
14
+ new: ["新增", "new"],
15
+ removed: ["移除", "rm"],
16
+ same: ["", ""],
17
+ };
18
+ const [label, cls] = map[d.kind] ?? ["", ""];
19
+ if (!label)
20
+ return "";
21
+ const delta = d.curScore !== undefined && d.baseScore !== undefined
22
+ ? ` ${(d.delta * 100 > 0 ? "+" : "")}${(d.delta * 100).toFixed(0)}%`
23
+ : "";
24
+ return `<span class="badge ${cls}">${label}${delta}</span>`;
25
+ }
26
+ export function renderHtml(run, drift) {
27
+ const driftMap = new Map(drift.drifts.map((d) => [d.name, d]));
28
+ const s = run.summary;
29
+ const rows = run.tests
30
+ .map((t) => {
31
+ const d = driftMap.get(t.name);
32
+ const asserts = t.error
33
+ ? `<div class="err">错误: ${esc(t.error)}</div>`
34
+ : t.assertions
35
+ .map((a) => `<div class="assert ${a.pass ? "ok" : "bad"}"><span>${a.pass ? "✓" : "✗"}</span> <code>${a.type}</code> ${esc(a.message)}</div>`)
36
+ .join("") || `<div class="assert dim">无断言(仅冒烟)</div>`;
37
+ return `
38
+ <tr class="${t.pass ? "pass" : "fail"}">
39
+ <td class="mark">${t.pass ? "✓" : "✗"}</td>
40
+ <td class="name">${esc(t.name)} ${driftBadge(d)}</td>
41
+ <td class="score"><div class="sbar"><i style="width:${Math.round(t.score * 100)}%"></i></div>${(t.score * 100).toFixed(0)}%</td>
42
+ <td class="lat">${t.latencyMs}ms</td>
43
+ </tr>
44
+ <tr class="detail ${t.pass ? "pass" : "fail"}">
45
+ <td></td>
46
+ <td colspan="3">
47
+ ${asserts}
48
+ <details><summary>模型输出</summary><pre>${esc(t.output)}</pre></details>
49
+ </td>
50
+ </tr>`;
51
+ })
52
+ .join("");
53
+ const driftLine = drift.hasBaseline
54
+ ? `<span class="${drift.regressions > 0 ? "reg" : "imp"}">${drift.regressions} 处退化</span> · <span class="imp">${drift.improvements} 处改进</span>`
55
+ : `<span class="dim">无基线</span>`;
56
+ return `<!DOCTYPE html>
57
+ <html lang="zh-CN"><head><meta charset="utf-8"/>
58
+ <meta name="viewport" content="width=device-width, initial-scale=1"/>
59
+ <title>promptdrift 报告 · ${esc(run.provider.model)}</title>
60
+ <style>
61
+ :root { --ok:#16a34a; --bad:#dc2626; --ink:#0f172a; --muted:#64748b; --line:#e2e8f0; --bg:#f8fafc; }
62
+ *{box-sizing:border-box;margin:0;padding:0}
63
+ body{font-family:-apple-system,"PingFang SC","Microsoft YaHei",sans-serif;background:var(--bg);color:var(--ink);padding:32px;line-height:1.5}
64
+ .wrap{max-width:980px;margin:0 auto}
65
+ h1{font-size:20px;font-weight:800}
66
+ .meta{color:var(--muted);font-size:13px;margin-top:4px}
67
+ .cards{display:grid;grid-template-columns:repeat(4,1fr);gap:12px;margin:20px 0}
68
+ .card{background:#fff;border:1px solid var(--line);border-radius:12px;padding:14px 16px}
69
+ .card .n{font-size:26px;font-weight:800}
70
+ .card .l{font-size:12px;color:var(--muted);margin-top:2px}
71
+ table{width:100%;border-collapse:collapse;background:#fff;border:1px solid var(--line);border-radius:12px;overflow:hidden}
72
+ td{padding:10px 12px;border-bottom:1px solid var(--line);vertical-align:top}
73
+ tr.pass .mark{color:var(--ok)} tr.fail .mark{color:var(--bad)}
74
+ .mark{font-weight:800;width:28px;text-align:center}
75
+ .name{font-weight:700}
76
+ .score{width:140px;font-variant-numeric:tabular-nums;color:var(--muted)}
77
+ .sbar{height:6px;background:var(--line);border-radius:99px;overflow:hidden;margin-bottom:4px}
78
+ .sbar i{display:block;height:100%;background:var(--ok)}
79
+ tr.fail .sbar i{background:var(--bad)}
80
+ .lat{width:80px;color:var(--muted);font-size:13px}
81
+ tr.detail td{padding-top:0;border-bottom:1px solid var(--line)}
82
+ .assert{font-size:13px;margin:3px 0}
83
+ .assert.ok{color:var(--ok)} .assert.bad{color:var(--bad)} .assert.dim{color:var(--muted)}
84
+ .assert code{background:#f1f5f9;padding:1px 6px;border-radius:5px;color:var(--ink)}
85
+ details{margin-top:6px} summary{cursor:pointer;color:var(--muted);font-size:13px}
86
+ pre{background:#0f172a;color:#e2e8f0;padding:12px;border-radius:8px;margin-top:6px;white-space:pre-wrap;word-break:break-word;font-size:12px;max-height:320px;overflow:auto}
87
+ .err{color:var(--bad);font-size:13px}
88
+ .badge{font-size:11px;font-weight:700;padding:1px 7px;border-radius:99px;margin-left:6px}
89
+ .badge.reg{background:#fee2e2;color:#b91c1c} .badge.imp{background:#dcfce7;color:#15803d}
90
+ .badge.new{background:#e0f2fe;color:#0369a1} .badge.rm{background:#f1f5f9;color:#64748b}
91
+ .reg{color:var(--bad);font-weight:700} .imp{color:var(--ok);font-weight:700} .dim{color:var(--muted)}
92
+ footer{color:var(--muted);font-size:12px;text-align:center;margin-top:20px}
93
+ </style></head>
94
+ <body><div class="wrap">
95
+ <h1>promptdrift 报告</h1>
96
+ <div class="meta">${esc(run.provider.type)} / ${esc(run.provider.model)} · ${esc(run.timestamp)}</div>
97
+ <div class="cards">
98
+ <div class="card"><div class="n">${s.passed}/${s.total}</div><div class="l">通过</div></div>
99
+ <div class="card"><div class="n" style="color:${s.failed ? "var(--bad)" : "var(--ok)"}">${s.failed}</div><div class="l">失败</div></div>
100
+ <div class="card"><div class="n">${(s.avgScore * 100).toFixed(0)}%</div><div class="l">平均分</div></div>
101
+ <div class="card"><div class="n" style="font-size:15px;padding-top:8px">${driftLine}</div><div class="l">对比基线</div></div>
102
+ </div>
103
+ <table><tbody>${rows}</tbody></table>
104
+ <footer>Generated by promptdrift</footer>
105
+ </div></body></html>`;
106
+ }
@@ -0,0 +1,62 @@
1
+ import pc from "picocolors";
2
+ function bar(score, width = 10) {
3
+ const filled = Math.round(score * width);
4
+ return "█".repeat(filled) + "░".repeat(width - filled);
5
+ }
6
+ export function printRun(run, drift) {
7
+ console.log("");
8
+ console.log(pc.bold(`promptdrift · ${run.provider.type}/${run.provider.model}`));
9
+ console.log(pc.dim("─".repeat(56)));
10
+ for (const t of run.tests) {
11
+ const mark = t.pass ? pc.green("✓") : pc.red("✗");
12
+ const scoreStr = t.error ? pc.red("ERR") : `${(t.score * 100).toFixed(0).padStart(3)}%`;
13
+ console.log(`${mark} ${pc.bold(t.name)} ${pc.dim(bar(t.error ? 0 : t.score))} ${scoreStr} ${pc.dim(t.latencyMs + "ms")}`);
14
+ if (t.error) {
15
+ console.log(pc.red(` 错误: ${t.error}`));
16
+ continue;
17
+ }
18
+ for (const a of t.assertions) {
19
+ if (!a.pass)
20
+ console.log(pc.red(` ✗ [${a.type}] ${a.message}`));
21
+ }
22
+ }
23
+ console.log(pc.dim("─".repeat(56)));
24
+ const s = run.summary;
25
+ const head = s.failed === 0 ? pc.green(pc.bold("全部通过")) : pc.red(pc.bold(`${s.failed} 个失败`));
26
+ console.log(`${head} ${s.passed}/${s.total} 通过 · 平均分 ${(s.avgScore * 100).toFixed(0)}%`);
27
+ if (drift.hasBaseline) {
28
+ console.log("");
29
+ if (drift.regressions === 0 && drift.improvements === 0) {
30
+ console.log(pc.dim("对比基线:无明显变化"));
31
+ }
32
+ else {
33
+ console.log(pc.bold("对比基线:"));
34
+ for (const d of drift.drifts) {
35
+ if (d.kind === "regression") {
36
+ console.log(pc.red(` ↓ 退化 ${d.name} ${fmtDelta(d)}`));
37
+ }
38
+ else if (d.kind === "improvement") {
39
+ console.log(pc.green(` ↑ 改进 ${d.name} ${fmtDelta(d)}`));
40
+ }
41
+ else if (d.kind === "new") {
42
+ console.log(pc.cyan(` + 新增 ${d.name}`));
43
+ }
44
+ else if (d.kind === "removed") {
45
+ console.log(pc.dim(` - 移除 ${d.name}`));
46
+ }
47
+ }
48
+ console.log("");
49
+ console.log(`${drift.regressions > 0 ? pc.red(drift.regressions + " 处退化") : pc.green("0 退化")} · ${pc.green(drift.improvements + " 处改进")}`);
50
+ }
51
+ }
52
+ else {
53
+ console.log(pc.dim("(无基线,运行 `promptdrift baseline` 锁定当前结果作为基线)"));
54
+ }
55
+ console.log("");
56
+ }
57
+ function fmtDelta(d) {
58
+ const passInfo = d.basePass !== d.curPass ? ` [${d.basePass ? "通过" : "失败"}→${d.curPass ? "通过" : "失败"}]` : "";
59
+ const pct = (d.delta * 100).toFixed(0);
60
+ const sign = d.delta > 0 ? "+" : "";
61
+ return `${(d.baseScore * 100).toFixed(0)}% → ${(d.curScore * 100).toFixed(0)}% (${sign}${pct}%)${passInfo}`;
62
+ }
package/dist/runner.js ADDED
@@ -0,0 +1,52 @@
1
+ import { createProvider } from "./providers/index.js";
2
+ import { runAssertion } from "./scorers/index.js";
3
+ import { renderTemplate } from "./util.js";
4
+ function weightedScore(results) {
5
+ if (results.length === 0)
6
+ return 1;
7
+ const totalW = results.reduce((s, r) => s + r.weight, 0) || 1;
8
+ return results.reduce((s, r) => s + r.score * r.weight, 0) / totalW;
9
+ }
10
+ export async function runConfig(config) {
11
+ const provider = createProvider(config.provider);
12
+ const judge = config.judge ? createProvider(config.judge) : provider;
13
+ const ctx = { judge };
14
+ const tests = [];
15
+ for (const tc of config.tests) {
16
+ const vars = tc.vars ?? {};
17
+ const messages = [];
18
+ if (config.prompt.system) {
19
+ messages.push({ role: "system", content: renderTemplate(config.prompt.system, vars) });
20
+ }
21
+ messages.push({ role: "user", content: renderTemplate(config.prompt.user, vars) });
22
+ const assertions = [...(config.defaultAssert ?? []), ...(tc.assert ?? [])];
23
+ const started = Date.now();
24
+ let output = "";
25
+ let error;
26
+ try {
27
+ output = await provider.complete(messages);
28
+ }
29
+ catch (e) {
30
+ error = e.message;
31
+ }
32
+ const latencyMs = Date.now() - started;
33
+ let assertionResults = [];
34
+ if (!error) {
35
+ for (const a of assertions) {
36
+ assertionResults.push(await runAssertion(output, a, ctx));
37
+ }
38
+ }
39
+ const pass = !error && assertionResults.every((r) => r.pass);
40
+ const score = error ? 0 : weightedScore(assertionResults);
41
+ tests.push({ name: tc.name, vars, output, latencyMs, error, assertions: assertionResults, pass, score });
42
+ }
43
+ const passed = tests.filter((t) => t.pass).length;
44
+ const avgScore = tests.length ? tests.reduce((s, t) => s + t.score, 0) / tests.length : 0;
45
+ return {
46
+ schemaVersion: 1,
47
+ timestamp: new Date().toISOString(),
48
+ provider: { type: config.provider.type, model: config.provider.model },
49
+ summary: { total: tests.length, passed, failed: tests.length - passed, avgScore },
50
+ tests,
51
+ };
52
+ }
@@ -0,0 +1,142 @@
1
+ import AjvImport from "ajv";
2
+ import { clamp01 } from "../util.js";
3
+ // ajv 是 CommonJS 包,在 NodeNext ESM 下默认导出可能被包一层 .default
4
+ const Ajv = (AjvImport.default ??
5
+ AjvImport);
6
+ const ajv = new Ajv({ allErrors: true, strict: false });
7
+ function toArray(v) {
8
+ if (v === undefined)
9
+ return [];
10
+ return Array.isArray(v) ? v : [v];
11
+ }
12
+ function tryExtractJson(text) {
13
+ try {
14
+ return JSON.parse(text);
15
+ }
16
+ catch {
17
+ // 从混合文本里抠出第一个 JSON 块
18
+ const match = text.match(/[{[][\s\S]*[}\]]/);
19
+ if (match) {
20
+ try {
21
+ return JSON.parse(match[0]);
22
+ }
23
+ catch {
24
+ return undefined;
25
+ }
26
+ }
27
+ return undefined;
28
+ }
29
+ }
30
+ async function scoreContains(output, a) {
31
+ const needles = toArray(a.value);
32
+ const missing = needles.filter((n) => !output.includes(n));
33
+ const pass = missing.length === 0;
34
+ return {
35
+ type: "contains",
36
+ pass,
37
+ score: pass ? 1 : 0,
38
+ weight: a.weight ?? 1,
39
+ message: pass
40
+ ? `包含全部关键词: ${needles.join(", ")}`
41
+ : `缺少关键词: ${missing.join(", ")}`,
42
+ };
43
+ }
44
+ async function scoreNotContains(output, a) {
45
+ const needles = toArray(a.value);
46
+ const found = needles.filter((n) => output.includes(n));
47
+ const pass = found.length === 0;
48
+ return {
49
+ type: "not-contains",
50
+ pass,
51
+ score: pass ? 1 : 0,
52
+ weight: a.weight ?? 1,
53
+ message: pass ? `未出现禁止词` : `出现了禁止词: ${found.join(", ")}`,
54
+ };
55
+ }
56
+ async function scoreRegex(output, a) {
57
+ const pattern = Array.isArray(a.value) ? a.value[0] : a.value;
58
+ if (!pattern) {
59
+ return failResult("regex", a, "regex 断言缺少 value(正则表达式)");
60
+ }
61
+ let re;
62
+ try {
63
+ re = new RegExp(pattern, "s");
64
+ }
65
+ catch (e) {
66
+ return failResult("regex", a, `正则无效: ${e.message}`);
67
+ }
68
+ const pass = re.test(output);
69
+ return {
70
+ type: "regex",
71
+ pass,
72
+ score: pass ? 1 : 0,
73
+ weight: a.weight ?? 1,
74
+ message: pass ? `匹配正则 /${pattern}/` : `不匹配正则 /${pattern}/`,
75
+ };
76
+ }
77
+ async function scoreJsonSchema(output, a) {
78
+ if (!a.schema)
79
+ return failResult("json-schema", a, "json-schema 断言缺少 schema");
80
+ const parsed = tryExtractJson(output);
81
+ if (parsed === undefined) {
82
+ return failResult("json-schema", a, "输出不是合法 JSON / 抠不出 JSON 块");
83
+ }
84
+ const validateFn = ajv.compile(a.schema);
85
+ const ok = validateFn(parsed);
86
+ return {
87
+ type: "json-schema",
88
+ pass: ok,
89
+ score: ok ? 1 : 0,
90
+ weight: a.weight ?? 1,
91
+ message: ok
92
+ ? "JSON 符合 schema"
93
+ : `JSON 校验失败: ${ajv.errorsText(validateFn.errors)}`,
94
+ };
95
+ }
96
+ async function scoreLlmJudge(output, a, ctx) {
97
+ if (!a.rubric)
98
+ return failResult("llm-judge", a, "llm-judge 断言缺少 rubric(评分标准)");
99
+ if (!ctx.judge)
100
+ return failResult("llm-judge", a, "没有可用的裁判模型(judge)");
101
+ const threshold = a.threshold ?? 0.6;
102
+ const raw = await ctx.judge.complete([
103
+ {
104
+ role: "system",
105
+ content: "你是一个严格、客观的中文评测员。根据【评分标准】给【待评测输出】打分。" +
106
+ '只输出一个 JSON,不要解释:{"score": 0到1之间的小数, "reason": "简短理由"}。',
107
+ },
108
+ {
109
+ role: "user",
110
+ content: `【评分标准】\n${a.rubric}\n\n【待评测输出】\n"""\n${output}\n"""`,
111
+ },
112
+ ]);
113
+ const parsed = tryExtractJson(raw);
114
+ const score = clamp01(Number(parsed?.score));
115
+ const pass = score >= threshold;
116
+ return {
117
+ type: "llm-judge",
118
+ pass,
119
+ score,
120
+ weight: a.weight ?? 1,
121
+ message: `裁判打分 ${score.toFixed(2)}(阈值 ${threshold})${parsed?.reason ? " · " + parsed.reason : ""}`,
122
+ };
123
+ }
124
+ function failResult(type, a, message) {
125
+ return { type, pass: false, score: 0, weight: a.weight ?? 1, message };
126
+ }
127
+ export async function runAssertion(output, a, ctx) {
128
+ switch (a.type) {
129
+ case "contains":
130
+ return scoreContains(output, a);
131
+ case "not-contains":
132
+ return scoreNotContains(output, a);
133
+ case "regex":
134
+ return scoreRegex(output, a);
135
+ case "json-schema":
136
+ return scoreJsonSchema(output, a);
137
+ case "llm-judge":
138
+ return scoreLlmJudge(output, a, ctx);
139
+ default:
140
+ return failResult("contains", a, `未知断言类型: ${a.type}`);
141
+ }
142
+ }
package/dist/types.js ADDED
@@ -0,0 +1 @@
1
+ export {};
package/dist/util.js ADDED
@@ -0,0 +1,17 @@
1
+ /** 把 "{{name}}" 占位符替换成 vars 里的值;缺失的占位符替换为空串。 */
2
+ export function renderTemplate(tpl, vars = {}) {
3
+ return tpl.replace(/\{\{\s*([\w.-]+)\s*\}\}/g, (_, key) => {
4
+ const v = vars[key];
5
+ return v === undefined || v === null ? "" : String(v);
6
+ });
7
+ }
8
+ export function readApiKey(envName) {
9
+ if (!envName)
10
+ return undefined;
11
+ return process.env[envName];
12
+ }
13
+ export function clamp01(n) {
14
+ if (Number.isNaN(n))
15
+ return 0;
16
+ return Math.max(0, Math.min(1, n));
17
+ }
package/package.json ADDED
@@ -0,0 +1,50 @@
1
+ {
2
+ "name": "evaldrift",
3
+ "version": "0.1.0",
4
+ "description": "Snapshot-style regression testing for LLM prompts & agents — catch silent quality drift when you tweak a prompt. China-model & Chinese-first (DeepSeek / Kimi / Qwen / Doubao / Ollama).",
5
+ "type": "module",
6
+ "bin": {
7
+ "evaldrift": "dist/cli.js"
8
+ },
9
+ "files": [
10
+ "dist",
11
+ "templates",
12
+ "README.md",
13
+ "LICENSE"
14
+ ],
15
+ "scripts": {
16
+ "build": "tsc -p tsconfig.json",
17
+ "dev": "tsx src/cli.ts",
18
+ "prepublishOnly": "npm run build",
19
+ "test": "node --test"
20
+ },
21
+ "keywords": [
22
+ "llm",
23
+ "prompt",
24
+ "eval",
25
+ "evaluation",
26
+ "regression-testing",
27
+ "prompt-engineering",
28
+ "deepseek",
29
+ "qwen",
30
+ "kimi",
31
+ "ollama",
32
+ "ai-testing",
33
+ "snapshot-testing"
34
+ ],
35
+ "author": "",
36
+ "license": "MIT",
37
+ "engines": {
38
+ "node": ">=18"
39
+ },
40
+ "dependencies": {
41
+ "ajv": "^8.17.1",
42
+ "picocolors": "^1.1.1",
43
+ "yaml": "^2.6.1"
44
+ },
45
+ "devDependencies": {
46
+ "@types/node": "^22.10.2",
47
+ "tsx": "^4.19.2",
48
+ "typescript": "^5.7.2"
49
+ }
50
+ }
@@ -0,0 +1,66 @@
1
+ # promptdrift 配置 —— 默认用离线 mock 模型,`promptdrift run` 立即可见效果。
2
+ # 换成真实模型:把 provider 整段替换成下面任意一个注释示例即可。
3
+
4
+ provider:
5
+ type: mock
6
+ model: demo
7
+ # mock 规则:模型“回复”里命中哪个关键词,就返回对应文本(仅用于离线演示/调通断言)
8
+ mockResponses:
9
+ # 这一条让离线 mock 也能扮演 llm-judge 裁判(收到评分请求时返回 JSON 分数)
10
+ "【待评测输出】": '{"score": 0.85, "reason": "回复礼貌并给出了下一步建议"}'
11
+ 退款: "您好,关于退款政策:支持 7 天无理由退款,请在订单页点击申请,款项 1-3 个工作日原路退回。"
12
+ 几点: "我们每天 09:00-22:00 营业,节假日正常营业,欢迎光临。"
13
+ __default__: "您好,非常抱歉让您久等了,我马上为您处理,请问具体是哪笔订单呢?"
14
+
15
+ # ---- 接 DeepSeek(OpenAI 兼容)----
16
+ # provider:
17
+ # type: openai
18
+ # model: deepseek-chat
19
+ # baseUrl: https://api.deepseek.com/v1
20
+ # apiKeyEnv: DEEPSEEK_API_KEY
21
+ # temperature: 0
22
+ #
23
+ # ---- 接 Kimi / 通义 / 豆包:同上,只改 model + baseUrl + apiKeyEnv ----
24
+ # ---- 接本地 Ollama ----
25
+ # provider:
26
+ # type: ollama
27
+ # model: qwen2.5
28
+ # baseUrl: http://localhost:11434
29
+
30
+ # llm-judge 用的裁判模型(可选,缺省复用上面的 provider)
31
+ # judge:
32
+ # type: openai
33
+ # model: deepseek-chat
34
+ # baseUrl: https://api.deepseek.com/v1
35
+ # apiKeyEnv: DEEPSEEK_API_KEY
36
+
37
+ prompt:
38
+ system: "你是一个礼貌、专业、简洁的中文客服助手。"
39
+ user: "{{question}}"
40
+
41
+ # 应用到每个用例的默认断言(可选)
42
+ defaultAssert:
43
+ - type: regex
44
+ value: "^[\\s\\S]{8,}$" # 至少有点内容,别空着
45
+
46
+ tests:
47
+ - name: 退款问题
48
+ vars: { question: "我想退款,怎么操作?" }
49
+ assert:
50
+ - type: contains
51
+ value: ["退款", "工作日"]
52
+ - type: not-contains
53
+ value: "不支持退款"
54
+
55
+ - name: 营业时间
56
+ vars: { question: "你们几点开门?" }
57
+ assert:
58
+ - type: contains
59
+ value: "09:00"
60
+
61
+ - name: 礼貌性(裁判打分)
62
+ vars: { question: "你们怎么这么慢,气死我了" }
63
+ assert:
64
+ - type: llm-judge
65
+ rubric: "回复是否保持礼貌、安抚情绪、并给出下一步建议。礼貌且有建设性给高分,敷衍或冷漠给低分。"
66
+ threshold: 0.6