@huydao/karrot 0.1.6 → 0.1.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +496 -243
- package/dist/executors/adapters/ag-ui-post.js +87 -12
- package/dist/executors/adapters/ag-ui.js +5 -3
- package/dist/executors/executor.js +2 -1
- package/dist/executors/run-result.d.ts +3 -0
- package/dist/reports/report.js +20 -0
- package/dist/scenarios/scenario.d.ts +1 -0
- package/package.json +5 -2
- package/site/assets/app.js +201 -0
- package/site/assets/karrot-mark.svg +10 -0
- package/site/assets/styles.css +698 -0
- package/site/check.js +43 -0
- package/site/docs/index.html +505 -0
- package/site/index.html +162 -0
- package/site/serve.js +50 -0
package/README.md
CHANGED
|
@@ -1,73 +1,59 @@
|
|
|
1
1
|
# karrot
|
|
2
2
|
|
|
3
|
-
`karrot` is a reusable AI
|
|
3
|
+
`karrot` is a reusable AI scenario runner for testing assistants through multi-turn conversations.
|
|
4
4
|
|
|
5
|
-
|
|
6
|
-
-
|
|
7
|
-
-
|
|
8
|
-
-
|
|
9
|
-
-
|
|
10
|
-
- JSON and HTML reports
|
|
5
|
+
Use it when you want to:
|
|
6
|
+
- send one or more user turns to an agent
|
|
7
|
+
- keep the same conversation thread across turns
|
|
8
|
+
- assert required behavior with deterministic or semantic checks
|
|
9
|
+
- score response quality with eval dimensions
|
|
10
|
+
- write JSON and HTML reports for each run
|
|
11
11
|
|
|
12
|
-
|
|
12
|
+
Karrot is product-neutral. It does not know how your product logs in, where your project ID comes from, or how your agent runtime is discovered. The consuming project prepares those values and passes them to Karrot through a YAML config and `execute()`.
|
|
13
13
|
|
|
14
|
-
##
|
|
14
|
+
## Static Docs Site
|
|
15
15
|
|
|
16
|
-
|
|
17
|
-
- load config
|
|
18
|
-
- resolve `${VARIABLE}` templates
|
|
19
|
-
- load scenario modules
|
|
20
|
-
- execute turns
|
|
21
|
-
- run assertions and evals
|
|
22
|
-
- write artifacts and reports
|
|
16
|
+
Karrot also includes a static documentation site:
|
|
23
17
|
|
|
24
|
-
|
|
25
|
-
-
|
|
26
|
-
- `JWT`
|
|
27
|
-
- `ACCOUNT_ID`
|
|
28
|
-
- `WS_URL`
|
|
29
|
-
- `WS_TOPIC`
|
|
30
|
-
- any transport-specific headers or IDs
|
|
18
|
+
- Landing page: [site/index.html](./site/index.html)
|
|
19
|
+
- Guidance page: [site/docs/index.html](./site/docs/index.html)
|
|
31
20
|
|
|
32
|
-
|
|
21
|
+
Run it locally:
|
|
33
22
|
|
|
34
|
-
|
|
23
|
+
```bash
|
|
24
|
+
npm run site:serve
|
|
25
|
+
```
|
|
35
26
|
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
27
|
+
## Quick Setup
|
|
28
|
+
|
|
29
|
+
The normal setup has three files:
|
|
30
|
+
|
|
31
|
+
- `karrot.config.yml`: transport, artifacts, eval prompts, report metadata, and scenario context
|
|
32
|
+
- `src/run-karrot.ts`: a small script that collects runtime variables and calls `execute()`
|
|
33
|
+
- `src/scenarios/basic.ts`: one or more scenarios exported as an `AiScenarioSet`
|
|
34
|
+
|
|
35
|
+
Install the package in your project:
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
npm install @huydao/karrot
|
|
39
|
+
npm install -D tsx typescript
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
Set the OpenAI key when you use `aiAssert`, `eval`, or generated user messages:
|
|
43
|
+
|
|
44
|
+
```bash
|
|
45
|
+
export OPENAI_API_KEY=<your-openai-api-key>
|
|
51
46
|
```
|
|
52
47
|
|
|
53
|
-
|
|
54
|
-
1. load YAML or JSON config
|
|
55
|
-
2. resolve `${...}` variables
|
|
56
|
-
3. create `artifacts/<timestamp>`
|
|
57
|
-
4. load the scenario module
|
|
58
|
-
5. run selected scenarios
|
|
59
|
-
6. write JSON and HTML reports
|
|
48
|
+
## 1. YAML Config File
|
|
60
49
|
|
|
61
|
-
|
|
50
|
+
Create a config file in the project that will run the tests. The file can live anywhere, but `./karrot.config.yml` or `./scripts/<flow>.karrot.yml` is easiest for agents and humans to find.
|
|
62
51
|
|
|
63
|
-
|
|
64
|
-
1. create a YAML config file for the WSS transport
|
|
65
|
-
2. create a scenario module that exports `scenarioSet` and `buildScenarioContext`
|
|
66
|
-
3. create a small run script that calls `execute()`
|
|
52
|
+
Karrot resolves `${VARIABLE}` placeholders from the `variables` object passed to `execute()`, then from `process.env`.
|
|
67
53
|
|
|
68
|
-
###
|
|
54
|
+
### Generic AG-UI WSS agent
|
|
69
55
|
|
|
70
|
-
Use
|
|
56
|
+
Use `ag-ui-wss` when the target agent speaks AG-UI over WebSocket/STOMP.
|
|
71
57
|
|
|
72
58
|
```yml
|
|
73
59
|
version: 1
|
|
@@ -75,34 +61,34 @@ version: 1
|
|
|
75
61
|
transport:
|
|
76
62
|
type: ag-ui-wss
|
|
77
63
|
env:
|
|
78
|
-
JWT: ${JWT}
|
|
79
|
-
ACCOUNT_ID: ${ACCOUNT_ID}
|
|
80
|
-
PROJECT_ID: ${PROJECT_ID}
|
|
81
64
|
AGENT_URL: ${AGENT_URL}
|
|
82
65
|
AGENT_ID: ${AGENT_ID}
|
|
83
66
|
WS_URL: ${WS_URL}
|
|
84
67
|
WS_TOPIC: ${WS_TOPIC}
|
|
85
|
-
|
|
86
|
-
|
|
68
|
+
AUTH_TOKEN: ${AUTH_TOKEN}
|
|
69
|
+
WS_STOMP_HEADERS: Authorization:${AUTH_TOKEN}
|
|
70
|
+
WS_HEADERS: Origin:${APP_BASE_URL},User-Agent:Mozilla/5.0
|
|
87
71
|
processTimeoutMs: 120000
|
|
88
72
|
|
|
89
73
|
artifacts:
|
|
90
|
-
directory: ./artifacts
|
|
74
|
+
directory: ./artifacts/karrot
|
|
91
75
|
|
|
92
76
|
execution:
|
|
93
77
|
stopOnFailure: false
|
|
78
|
+
concurrency: 1
|
|
79
|
+
|
|
80
|
+
context:
|
|
81
|
+
appBaseUrl: ${APP_BASE_URL}
|
|
82
|
+
projectId: ${PROJECT_ID}
|
|
94
83
|
|
|
95
84
|
evaluation:
|
|
96
85
|
systemPromptPath: ./prompts/turn-eval-system-prompt.md
|
|
97
86
|
promptDirectory: ./prompts/eval
|
|
98
87
|
|
|
99
|
-
context:
|
|
100
|
-
projectId: ${PROJECT_ID}
|
|
101
|
-
|
|
102
88
|
report:
|
|
103
89
|
enabled: true
|
|
104
|
-
environment:
|
|
105
|
-
projectName:
|
|
90
|
+
environment: ${TEST_ENV}
|
|
91
|
+
projectName: ${PROJECT_NAME}
|
|
106
92
|
runtime:
|
|
107
93
|
agentUrl: ${AGENT_URL}
|
|
108
94
|
agentId: ${AGENT_ID}
|
|
@@ -113,190 +99,316 @@ report:
|
|
|
113
99
|
appBaseUrl: ${APP_BASE_URL}
|
|
114
100
|
```
|
|
115
101
|
|
|
116
|
-
|
|
117
|
-
- `transport`: tells Karrot how to talk to the assistant
|
|
118
|
-
- `evaluation`: points to the turn-eval rubric and any extra project-specific dimension prompts
|
|
119
|
-
- `context`: makes resolved values available to scenarios
|
|
120
|
-
- `report`: controls run metadata written into reports
|
|
102
|
+
How to get the values:
|
|
121
103
|
|
|
122
|
-
|
|
104
|
+
- `AGENT_URL`, `AGENT_ID`, `WS_URL`, `WS_TOPIC`: from your agent platform, runtime discovery API, or test environment configuration
|
|
105
|
+
- `AUTH_TOKEN`, `ACCOUNT_ID`, `PROJECT_ID`: from your product login/auth setup or CI secrets
|
|
106
|
+
- `APP_BASE_URL`, `TEST_ENV`, `PROJECT_NAME`: from your test environment
|
|
107
|
+
- `OPENAI_API_KEY`: from the environment, only needed for `aiAssert`, `eval`, and `aiGen`
|
|
123
108
|
|
|
124
|
-
|
|
109
|
+
Put secrets in environment variables or CI secrets, not in the YAML file.
|
|
125
110
|
|
|
126
|
-
###
|
|
111
|
+
### Generic HTTP agent
|
|
127
112
|
|
|
128
|
-
Use
|
|
113
|
+
Use `ag-ui-post` when the agent is triggered by HTTP and optionally observed by polling.
|
|
129
114
|
|
|
130
|
-
```
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
await execute('./karrot.config.yml', {
|
|
134
|
-
variables: {
|
|
135
|
-
PROJECT_ID: process.env.PROJECT_ID,
|
|
136
|
-
JWT: process.env.JWT,
|
|
137
|
-
ACCOUNT_ID: process.env.ACCOUNT_ID,
|
|
138
|
-
AGENT_URL: process.env.AGENT_URL,
|
|
139
|
-
AGENT_ID: process.env.AGENT_ID,
|
|
140
|
-
WS_URL: process.env.WS_URL,
|
|
141
|
-
WS_TOPIC: process.env.WS_TOPIC,
|
|
142
|
-
WS_ORIGIN: process.env.WS_ORIGIN,
|
|
143
|
-
APP_BASE_URL: process.env.APP_BASE_URL,
|
|
144
|
-
},
|
|
145
|
-
scenario: {
|
|
146
|
-
file: './src/scenarios/basic-two-turn-demo.ts',
|
|
147
|
-
ids: ['BASIC-2T'],
|
|
148
|
-
},
|
|
149
|
-
});
|
|
150
|
-
```
|
|
115
|
+
```yml
|
|
116
|
+
version: 1
|
|
151
117
|
|
|
152
|
-
|
|
118
|
+
transport:
|
|
119
|
+
type: ag-ui-post
|
|
120
|
+
injectMessage: true
|
|
121
|
+
run:
|
|
122
|
+
url: ${AGENT_RUN_URL}
|
|
123
|
+
headers:
|
|
124
|
+
Authorization: Bearer ${AUTH_TOKEN}
|
|
125
|
+
Content-Type: application/json
|
|
126
|
+
payload:
|
|
127
|
+
body:
|
|
128
|
+
threadId: ${THREAD_ID}
|
|
129
|
+
messages: []
|
|
130
|
+
processTimeoutMs: 120000
|
|
153
131
|
|
|
154
|
-
|
|
155
|
-
|
|
156
|
-
- `buildScenarioContext(baseContext)`
|
|
132
|
+
artifacts:
|
|
133
|
+
directory: ./artifacts/karrot
|
|
157
134
|
|
|
158
|
-
|
|
135
|
+
context:
|
|
136
|
+
appBaseUrl: ${APP_BASE_URL}
|
|
137
|
+
|
|
138
|
+
report:
|
|
139
|
+
enabled: true
|
|
140
|
+
environment: ${TEST_ENV}
|
|
141
|
+
projectName: ${PROJECT_NAME}
|
|
142
|
+
runtime:
|
|
143
|
+
agentUrl: ${AGENT_RUN_URL}
|
|
144
|
+
agentId: ${AGENT_ID}
|
|
145
|
+
wsUrl: ""
|
|
146
|
+
wsTopic: ""
|
|
147
|
+
accountId: ${ACCOUNT_ID}
|
|
148
|
+
projectId: ${PROJECT_ID}
|
|
149
|
+
appBaseUrl: ${APP_BASE_URL}
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
## 2. Run Script
|
|
153
|
+
|
|
154
|
+
The run script is the boundary between your product and Karrot. It should collect runtime values, pass them as variables, select the scenario file, and set a non-zero exit code on failure.
|
|
159
155
|
|
|
160
156
|
```ts
|
|
161
|
-
import {
|
|
157
|
+
import { execute, getScenarioRunStatus } from '@huydao/karrot';
|
|
162
158
|
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
id: 'BASIC-2T',
|
|
166
|
-
name: 'Basic Two-Turn Demo',
|
|
167
|
-
turns: [
|
|
168
|
-
{
|
|
169
|
-
label: 'Turn 1',
|
|
170
|
-
message: () => 'Hello. What can you help me with in Katalon AI?',
|
|
171
|
-
},
|
|
172
|
-
{
|
|
173
|
-
label: 'Turn 2',
|
|
174
|
-
message: () => 'Give me 3 short example prompts I can ask next.',
|
|
175
|
-
},
|
|
176
|
-
],
|
|
177
|
-
},
|
|
178
|
-
];
|
|
159
|
+
function required(name: string): string {
|
|
160
|
+
const value = process.env[name]?.trim();
|
|
179
161
|
|
|
180
|
-
|
|
162
|
+
if (!value) {
|
|
163
|
+
throw new Error(`Missing required environment variable: ${name}`);
|
|
164
|
+
}
|
|
181
165
|
|
|
182
|
-
|
|
183
|
-
return { ...baseContext };
|
|
166
|
+
return value;
|
|
184
167
|
}
|
|
185
|
-
```
|
|
186
|
-
|
|
187
|
-
### Scenario shape
|
|
188
168
|
|
|
189
|
-
|
|
190
|
-
|
|
191
|
-
|
|
192
|
-
|
|
169
|
+
async function main(): Promise<void> {
|
|
170
|
+
const execution = await execute('./karrot.config.yml', {
|
|
171
|
+
variables: {
|
|
172
|
+
TEST_ENV: process.env.TEST_ENV ?? 'local',
|
|
173
|
+
PROJECT_NAME: process.env.PROJECT_NAME ?? 'Demo Agent',
|
|
174
|
+
APP_BASE_URL: required('APP_BASE_URL'),
|
|
175
|
+
AGENT_URL: required('AGENT_URL'),
|
|
176
|
+
AGENT_ID: required('AGENT_ID'),
|
|
177
|
+
WS_URL: required('WS_URL'),
|
|
178
|
+
WS_TOPIC: required('WS_TOPIC'),
|
|
179
|
+
AUTH_TOKEN: required('AUTH_TOKEN'),
|
|
180
|
+
ACCOUNT_ID: process.env.ACCOUNT_ID ?? '',
|
|
181
|
+
PROJECT_ID: process.env.PROJECT_ID ?? '',
|
|
182
|
+
},
|
|
183
|
+
scenario: {
|
|
184
|
+
file: './src/scenarios/basic.ts',
|
|
185
|
+
ids: process.env.SCENARIO_IDS?.split(',').map((id) => id.trim()).filter(Boolean),
|
|
186
|
+
},
|
|
187
|
+
});
|
|
188
|
+
|
|
189
|
+
const status = getScenarioRunStatus(execution.results);
|
|
190
|
+
|
|
191
|
+
console.log(
|
|
192
|
+
[
|
|
193
|
+
`Status: ${status}`,
|
|
194
|
+
`Artifacts: ${execution.outputDirectory}`,
|
|
195
|
+
`JSON report: ${execution.reportPaths?.jsonPath ?? '-'}`,
|
|
196
|
+
`HTML report: ${execution.reportPaths?.htmlPath ?? '-'}`,
|
|
197
|
+
].join('\n'),
|
|
198
|
+
);
|
|
199
|
+
|
|
200
|
+
if (status === 'FAIL') {
|
|
201
|
+
process.exitCode = 1;
|
|
202
|
+
}
|
|
203
|
+
}
|
|
193
204
|
|
|
194
|
-
|
|
195
|
-
|
|
196
|
-
|
|
197
|
-
|
|
198
|
-
|
|
199
|
-
- `assertions`: pass/fail checks for the turn output
|
|
200
|
-
- `eval`: quality scoring dimensions for the turn output
|
|
201
|
-
- `onComplete`: optional callback for turn-level post-processing
|
|
205
|
+
main().catch((error) => {
|
|
206
|
+
console.error(error instanceof Error ? error.message : error);
|
|
207
|
+
process.exitCode = 1;
|
|
208
|
+
});
|
|
209
|
+
```
|
|
202
210
|
|
|
203
|
-
|
|
211
|
+
Run it:
|
|
204
212
|
|
|
205
|
-
|
|
206
|
-
|
|
207
|
-
|
|
208
|
-
- `aiGen.fromGuidance(guidance)`
|
|
209
|
-
- `aiGen.fromContent(content)`
|
|
213
|
+
```bash
|
|
214
|
+
TEST_ENV=qa npx tsx ./src/run-karrot.ts
|
|
215
|
+
```
|
|
210
216
|
|
|
211
|
-
|
|
212
|
-
- fixed prompts for deterministic tests
|
|
213
|
-
- context-aware prompts that use scenario data
|
|
214
|
-
- generated user prompts for more adaptive multi-turn flows
|
|
217
|
+
## 3. Basic Scenario
|
|
215
218
|
|
|
216
|
-
|
|
219
|
+
A scenario module must export `scenarioSet`. Export `buildScenarioContext(baseContext)` when the scenario needs typed or derived context.
|
|
217
220
|
|
|
218
221
|
```ts
|
|
219
|
-
import {
|
|
222
|
+
import {
|
|
223
|
+
aiGen,
|
|
224
|
+
AiScenarioSet,
|
|
225
|
+
type AiScenario,
|
|
226
|
+
type BaseAiScenarioContext,
|
|
227
|
+
} from '@huydao/karrot';
|
|
228
|
+
|
|
229
|
+
type DemoContext = BaseAiScenarioContext & {
|
|
230
|
+
appBaseUrl: string;
|
|
231
|
+
};
|
|
232
|
+
|
|
233
|
+
export function buildScenarioContext(baseContext: BaseAiScenarioContext): DemoContext {
|
|
234
|
+
return {
|
|
235
|
+
...baseContext,
|
|
236
|
+
appBaseUrl: String(baseContext.appBaseUrl ?? ''),
|
|
237
|
+
};
|
|
238
|
+
}
|
|
220
239
|
|
|
221
|
-
const scenarios: AiScenario<
|
|
240
|
+
const scenarios: AiScenario<DemoContext>[] = [
|
|
222
241
|
{
|
|
223
|
-
id: '
|
|
224
|
-
name: '
|
|
242
|
+
id: 'BASIC-CHAT-01',
|
|
243
|
+
name: 'Agent answers and suggests next steps',
|
|
225
244
|
turns: [
|
|
226
245
|
{
|
|
227
|
-
label: 'Ask
|
|
246
|
+
label: 'Ask what the agent can do',
|
|
247
|
+
message: () => 'What can you help me do in this product?',
|
|
248
|
+
assertions: [
|
|
249
|
+
{
|
|
250
|
+
assert: { hasText: 'help' },
|
|
251
|
+
description: 'The response should explain useful capabilities',
|
|
252
|
+
},
|
|
253
|
+
{
|
|
254
|
+
aiAssert: {
|
|
255
|
+
hasContent: 'The answer names at least one concrete task the user can perform.',
|
|
256
|
+
},
|
|
257
|
+
description: 'The response should be actionable',
|
|
258
|
+
},
|
|
259
|
+
],
|
|
260
|
+
eval: ['correctness', 'helpfulness', 'clarity'],
|
|
261
|
+
},
|
|
262
|
+
{
|
|
263
|
+
label: 'Ask for follow-up prompts',
|
|
228
264
|
message: aiGen.fromGuidance(
|
|
229
|
-
'Ask for
|
|
265
|
+
'Ask for three short follow-up prompts based on the previous answer.',
|
|
230
266
|
),
|
|
231
267
|
assertions: [
|
|
232
|
-
{
|
|
268
|
+
{
|
|
269
|
+
assert: { toolcall: [] },
|
|
270
|
+
description: 'The answer should not call tools for this simple prompt request',
|
|
271
|
+
},
|
|
272
|
+
{
|
|
273
|
+
aiAssert: {
|
|
274
|
+
hasContent: 'The answer provides three concise follow-up prompt ideas.',
|
|
275
|
+
},
|
|
276
|
+
},
|
|
277
|
+
],
|
|
278
|
+
eval: [
|
|
279
|
+
'relevance',
|
|
280
|
+
{
|
|
281
|
+
dimension: 'nextStepQuality',
|
|
282
|
+
guidance: 'Score whether the suggested next prompts are specific and usable.',
|
|
283
|
+
},
|
|
233
284
|
],
|
|
234
|
-
eval: ['correctness', 'helpfulness', 'relevance'],
|
|
235
285
|
},
|
|
236
286
|
],
|
|
237
287
|
},
|
|
238
288
|
];
|
|
239
289
|
|
|
240
290
|
export const scenarioSet = new AiScenarioSet(scenarios);
|
|
241
|
-
|
|
242
|
-
export function buildScenarioContext(baseContext: BaseAiScenarioContext): BaseAiScenarioContext {
|
|
243
|
-
return { ...baseContext };
|
|
244
|
-
}
|
|
245
291
|
```
|
|
246
292
|
|
|
247
|
-
##
|
|
293
|
+
## Scenario Details
|
|
294
|
+
|
|
295
|
+
### Scenario elements
|
|
296
|
+
|
|
297
|
+
Each `AiScenario` has:
|
|
298
|
+
|
|
299
|
+
- `id`: stable ID used by CLI filters, CI, and reports
|
|
300
|
+
- `name`: readable scenario name shown in reports
|
|
301
|
+
- `turns`: ordered user messages sent to the same conversation thread
|
|
302
|
+
- `continueOnAssertionFailure`: optional scenario-level flag that keeps later turns running after an assertion failure
|
|
248
303
|
|
|
249
|
-
|
|
304
|
+
Keep scenario IDs stable. Reports and CI filters depend on them.
|
|
250
305
|
|
|
251
|
-
|
|
306
|
+
### Turn elements
|
|
252
307
|
|
|
253
|
-
|
|
308
|
+
Each turn has:
|
|
309
|
+
|
|
310
|
+
- `label`: readable turn name shown in reports
|
|
311
|
+
- `message`: user input to send to the agent
|
|
312
|
+
- `idleTimeoutMs`: optional timeout for waiting on assistant activity
|
|
313
|
+
- `processTimeoutMs`: optional hard timeout for the turn
|
|
314
|
+
- `assertions`: pass/fail checks for required behavior
|
|
315
|
+
- `eval`: quality scoring dimensions for the assistant response
|
|
316
|
+
- `continueOnAssertionFailure`: optional turn-level override
|
|
317
|
+
- `onComplete`: optional callback after the turn returns output
|
|
318
|
+
|
|
319
|
+
`message` can be:
|
|
320
|
+
|
|
321
|
+
- `(context) => string`: deterministic message with access to scenario context
|
|
322
|
+
- `aiGen.fromPreviousContext()`: generate the next user message from conversation context
|
|
323
|
+
- `aiGen.fromGuidance(guidance)`: generate a user message from instructions
|
|
324
|
+
- `aiGen.fromContent(content)`: generate a user message from supplied source content
|
|
325
|
+
|
|
326
|
+
Use deterministic messages for regression tests. Use `aiGen` when the next user turn should adapt to the previous answer.
|
|
327
|
+
|
|
328
|
+
## Turn Assertions
|
|
329
|
+
|
|
330
|
+
Assertions decide whether a turn passes or fails. Use them for required behavior.
|
|
331
|
+
|
|
332
|
+
### `assert`
|
|
333
|
+
|
|
334
|
+
Use `assert` when the expected result can be checked directly.
|
|
254
335
|
|
|
255
336
|
```ts
|
|
256
337
|
assertions: [
|
|
257
|
-
{ assert: { hasText: '
|
|
338
|
+
{ assert: { hasText: 'created successfully' } },
|
|
339
|
+
{ assert: { toolcall: ['create_test_case'] } },
|
|
258
340
|
{ assert: { toolcall: [] } },
|
|
341
|
+
{
|
|
342
|
+
assert: {
|
|
343
|
+
toolcallWithContent: {
|
|
344
|
+
name: 'create_test_case',
|
|
345
|
+
hasText: ['login', 'password'],
|
|
346
|
+
hasProperties: {
|
|
347
|
+
priority: 'High',
|
|
348
|
+
},
|
|
349
|
+
},
|
|
350
|
+
},
|
|
351
|
+
},
|
|
259
352
|
]
|
|
260
353
|
```
|
|
261
354
|
|
|
262
|
-
|
|
355
|
+
Supported direct assertions:
|
|
356
|
+
|
|
357
|
+
- `assert.hasText`: response text contains the expected string
|
|
358
|
+
- `assert.toolcall`: exact expected tool-call names; use `[]` when no tool calls should happen
|
|
359
|
+
- `assert.toolcallWithContent`: a named tool call exists and contains expected text or structured properties
|
|
360
|
+
|
|
361
|
+
### `aiAssert`
|
|
362
|
+
|
|
363
|
+
Use `aiAssert` when the requirement is semantic and exact string matching would be brittle.
|
|
263
364
|
|
|
264
365
|
```ts
|
|
265
366
|
assertions: [
|
|
266
|
-
{
|
|
267
|
-
|
|
367
|
+
{
|
|
368
|
+
aiAssert: {
|
|
369
|
+
hasContent: 'The answer explains the next action and why it is needed.',
|
|
370
|
+
},
|
|
371
|
+
},
|
|
372
|
+
{
|
|
373
|
+
aiAssert: {
|
|
374
|
+
notHasContent: 'The answer invents product capabilities not present in the prompt.',
|
|
375
|
+
},
|
|
376
|
+
},
|
|
268
377
|
]
|
|
269
378
|
```
|
|
270
379
|
|
|
271
|
-
|
|
272
|
-
|
|
273
|
-
-
|
|
274
|
-
-
|
|
380
|
+
Supported AI assertions:
|
|
381
|
+
|
|
382
|
+
- `aiAssert.hasContent`: semantic requirement must be present
|
|
383
|
+
- `aiAssert.notHasContent`: semantic problem must be absent
|
|
384
|
+
|
|
385
|
+
`aiAssert` requires `OPENAI_API_KEY`.
|
|
275
386
|
|
|
276
|
-
##
|
|
387
|
+
## Eval
|
|
277
388
|
|
|
278
|
-
|
|
279
|
-
|
|
389
|
+
Eval is separate from assertions.
|
|
390
|
+
|
|
391
|
+
- Assertion: pass/fail requirement, for example "must call `create_test_case`"
|
|
392
|
+
- Eval: quality score, for example "how helpful and complete was the answer"
|
|
280
393
|
|
|
281
394
|
```ts
|
|
282
395
|
eval: ['correctness', 'coverage', 'helpfulness']
|
|
283
396
|
```
|
|
284
397
|
|
|
285
|
-
|
|
398
|
+
You can add inline guidance:
|
|
286
399
|
|
|
287
400
|
```ts
|
|
288
401
|
eval: [
|
|
289
402
|
'correctness',
|
|
290
403
|
{
|
|
291
404
|
dimension: 'productFit',
|
|
292
|
-
guidance: '
|
|
405
|
+
guidance: 'Score whether the answer is specifically useful for users of this product.',
|
|
293
406
|
},
|
|
294
407
|
]
|
|
295
408
|
```
|
|
296
409
|
|
|
297
|
-
|
|
410
|
+
Common dimensions:
|
|
298
411
|
|
|
299
|
-
Built-in dimensions commonly used by Karrot:
|
|
300
412
|
- `correctness`
|
|
301
413
|
- `coverage`
|
|
302
414
|
- `helpfulness`
|
|
@@ -309,99 +421,240 @@ Built-in dimensions commonly used by Karrot:
|
|
|
309
421
|
- `consistency`
|
|
310
422
|
- `safety`
|
|
311
423
|
|
|
312
|
-
Project-level eval prompts can
|
|
313
|
-
- `evaluation.systemPromptPath`
|
|
314
|
-
- `evaluation.promptDirectory`
|
|
424
|
+
Project-level eval prompts can live in a directory:
|
|
315
425
|
|
|
316
|
-
|
|
426
|
+
```yml
|
|
427
|
+
evaluation:
|
|
428
|
+
promptDirectory: ./prompts/eval
|
|
429
|
+
```
|
|
317
430
|
|
|
318
|
-
|
|
319
|
-
- `systemPromptPath` when you want to replace the whole turn-eval rubric
|
|
320
|
-
- `promptDirectory` when you want to add custom project-specific dimensions
|
|
431
|
+
Then create files such as:
|
|
321
432
|
|
|
322
|
-
|
|
323
|
-
-
|
|
324
|
-
-
|
|
325
|
-
|
|
326
|
-
- Because Karrot applies CheckEval-style scoring, dimensions like `relevance` and `consistency` are judged through concrete sub-checks instead of a vague overall impression.
|
|
433
|
+
```text
|
|
434
|
+
prompts/eval/product-fit.md
|
|
435
|
+
prompts/eval/next-step-quality.md
|
|
436
|
+
```
|
|
327
437
|
|
|
328
|
-
|
|
438
|
+
Scenario authors can then use only the dimension names:
|
|
329
439
|
|
|
330
|
-
|
|
440
|
+
```ts
|
|
441
|
+
eval: ['correctness', 'productFit', 'nextStepQuality']
|
|
442
|
+
```
|
|
331
443
|
|
|
332
|
-
|
|
333
|
-
- `aiGen.fromPreviousContext()`
|
|
334
|
-
- `aiGen.fromGuidance(guidance)`
|
|
335
|
-
- `aiGen.fromContent(content)`
|
|
444
|
+
Use `evaluation.systemPromptPath` only when you need to replace the full turn-eval rubric.
|
|
336
445
|
|
|
337
|
-
|
|
446
|
+
## Config Reference
|
|
338
447
|
|
|
339
|
-
|
|
340
|
-
import { aiGen } from '@huydao/karrot';
|
|
448
|
+
Top-level config keys:
|
|
341
449
|
|
|
342
|
-
|
|
343
|
-
|
|
344
|
-
|
|
345
|
-
|
|
450
|
+
- `version`: currently `1`
|
|
451
|
+
- `transport`: agent transport, currently `ag-ui-wss` or `ag-ui-post`
|
|
452
|
+
- `artifacts.directory`: output directory for raw events and reports
|
|
453
|
+
- `execution.stopOnFailure`: stop remaining scenarios after a failure
|
|
454
|
+
- `execution.concurrency`: number of scenarios to run in parallel
|
|
455
|
+
- `context`: values available to `buildScenarioContext`
|
|
456
|
+
- `evaluation.systemPromptPath`: full eval prompt override
|
|
457
|
+
- `evaluation.promptDirectory`: additional project-specific eval dimension prompts
|
|
458
|
+
- `report.enabled`: set `false` to skip reports
|
|
459
|
+
- `report.environment`, `report.projectName`, `report.runtime`: metadata written to reports
|
|
460
|
+
- `report.scenarioContext`: extra metadata written to reports
|
|
346
461
|
|
|
347
|
-
|
|
462
|
+
## Reports And Artifacts
|
|
348
463
|
|
|
349
|
-
|
|
464
|
+
Each run creates an artifact directory under `artifacts/<timestamp>` or the configured `artifacts.directory`.
|
|
350
465
|
|
|
351
|
-
|
|
352
|
-
- `transport`
|
|
353
|
-
- `artifacts.directory`
|
|
354
|
-
- `execution.stopOnFailure`
|
|
355
|
-
- `evaluation.systemPromptPath`
|
|
356
|
-
- `evaluation.promptDirectory`
|
|
357
|
-
- `context`
|
|
358
|
-
- `report`
|
|
466
|
+
Typical outputs:
|
|
359
467
|
|
|
360
|
-
|
|
361
|
-
-
|
|
362
|
-
-
|
|
468
|
+
- raw transport logs, such as `.jsonl` or `.sse`
|
|
469
|
+
- generated-message traces
|
|
470
|
+
- AI assertion traces
|
|
471
|
+
- JSON report
|
|
472
|
+
- HTML report
|
|
363
473
|
|
|
364
|
-
##
|
|
474
|
+
## How To Add Karrot To An Existing Playwright Framework
|
|
365
475
|
|
|
366
|
-
|
|
367
|
-
|
|
368
|
-
|
|
369
|
-
|
|
370
|
-
|
|
476
|
+
1. Install `@huydao/karrot` in the Playwright project.
|
|
477
|
+
2. Add a config file, for example `scripts/ai-full-flow.karrot.yml`.
|
|
478
|
+
3. Add a runner script, for example `scripts/run-ai-full-flow.ts`.
|
|
479
|
+
4. Put scenario files under a stable folder, for example `data/ai-scenarios`.
|
|
480
|
+
5. Reuse Playwright auth helpers to discover runtime values, then pass those values to `execute()`.
|
|
481
|
+
6. Add npm scripts that run the Karrot runner with `tsx`.
|
|
482
|
+
7. Store reports under a predictable artifact directory for CI upload.
|
|
371
483
|
|
|
372
|
-
|
|
484
|
+
Minimal `package.json` script:
|
|
373
485
|
|
|
374
|
-
|
|
375
|
-
|
|
376
|
-
|
|
377
|
-
-
|
|
378
|
-
|
|
486
|
+
```json
|
|
487
|
+
{
|
|
488
|
+
"scripts": {
|
|
489
|
+
"ai:full-flow": "tsx ./scripts/run-ai-full-flow.ts"
|
|
490
|
+
}
|
|
491
|
+
}
|
|
492
|
+
```
|
|
379
493
|
|
|
380
|
-
|
|
494
|
+
Example command:
|
|
381
495
|
|
|
382
|
-
|
|
496
|
+
```bash
|
|
497
|
+
TEST_ENV=qa npm run ai:full-flow -- --scenario-file data/ai-scenarios/basic.ts
|
|
498
|
+
```
|
|
383
499
|
|
|
384
|
-
|
|
385
|
-
- `executors/`: transport runners and scenario execution
|
|
386
|
-
- `reports/`: JSON and HTML reporting
|
|
387
|
-
- `scenarios/`: scenario types, loaders, generated-message helpers
|
|
388
|
-
- `utils/`: config loading, artifacts, OpenAI helpers
|
|
389
|
-
- `prompts/`: built-in prompt files used by the package
|
|
500
|
+
### Using Karrot inside a Playwright test
|
|
390
501
|
|
|
391
|
-
|
|
502
|
+
Use `execute()` when the scenario file owns the whole conversation. Karrot automatically starts a thread on the first turn and reuses that thread for the following turns in the same scenario.
|
|
392
503
|
|
|
393
|
-
|
|
504
|
+
```ts
|
|
505
|
+
import path from 'node:path';
|
|
506
|
+
import { test, expect } from '@playwright/test';
|
|
507
|
+
import { execute, getScenarioRunStatus } from '@huydao/karrot';
|
|
508
|
+
|
|
509
|
+
test('agent completes the basic flow', async () => {
|
|
510
|
+
const execution = await execute(path.resolve(__dirname, '../karrot.config.yml'), {
|
|
511
|
+
variables: {
|
|
512
|
+
TEST_ENV: process.env.TEST_ENV ?? 'qa',
|
|
513
|
+
PROJECT_NAME: 'Demo Agent',
|
|
514
|
+
APP_BASE_URL: process.env.APP_BASE_URL,
|
|
515
|
+
AGENT_URL: process.env.AGENT_URL,
|
|
516
|
+
AGENT_ID: process.env.AGENT_ID,
|
|
517
|
+
WS_URL: process.env.WS_URL,
|
|
518
|
+
WS_TOPIC: process.env.WS_TOPIC,
|
|
519
|
+
AUTH_TOKEN: process.env.AUTH_TOKEN,
|
|
520
|
+
ACCOUNT_ID: process.env.ACCOUNT_ID,
|
|
521
|
+
PROJECT_ID: process.env.PROJECT_ID,
|
|
522
|
+
},
|
|
523
|
+
scenario: {
|
|
524
|
+
file: path.resolve(__dirname, '../src/scenarios/basic.ts'),
|
|
525
|
+
ids: ['BASIC-CHAT-01'],
|
|
526
|
+
},
|
|
527
|
+
});
|
|
528
|
+
|
|
529
|
+
expect(getScenarioRunStatus(execution.results)).toBe('PASS');
|
|
530
|
+
});
|
|
531
|
+
```
|
|
394
532
|
|
|
395
|
-
|
|
533
|
+
Use `runScenario()` when the Playwright test needs to create or recall an existing agent session itself. Pass the known thread/conversation ID as `initialThreadId`. Keep `concurrency: 1`; a single existing session cannot be shared safely across parallel scenarios.
|
|
396
534
|
|
|
397
|
-
```
|
|
398
|
-
|
|
399
|
-
|
|
535
|
+
```ts
|
|
536
|
+
import { test, expect } from '@playwright/test';
|
|
537
|
+
import {
|
|
538
|
+
AiScenarioSet,
|
|
539
|
+
createRunArtifactDirectory,
|
|
540
|
+
runScenario,
|
|
541
|
+
type AiScenario,
|
|
542
|
+
type BaseAiScenarioContext,
|
|
543
|
+
} from '@huydao/karrot';
|
|
544
|
+
import { runAgUiMessage } from '@huydao/karrot/adapters/ag-ui';
|
|
545
|
+
|
|
546
|
+
type SessionContext = BaseAiScenarioContext & {
|
|
547
|
+
projectId: string;
|
|
548
|
+
};
|
|
549
|
+
|
|
550
|
+
const scenarios: AiScenario<SessionContext>[] = [
|
|
551
|
+
{
|
|
552
|
+
id: 'RESUME-SESSION-01',
|
|
553
|
+
name: 'Continue an existing assistant session',
|
|
554
|
+
turns: [
|
|
555
|
+
{
|
|
556
|
+
label: 'Recall current context',
|
|
557
|
+
message: ({ projectId }) =>
|
|
558
|
+
`Continue in project ${projectId}. Summarize what we have already discussed and suggest the next action.`,
|
|
559
|
+
assertions: [
|
|
560
|
+
{
|
|
561
|
+
aiAssert: {
|
|
562
|
+
hasContent: 'The answer uses the existing conversation context.',
|
|
563
|
+
},
|
|
564
|
+
},
|
|
565
|
+
],
|
|
566
|
+
eval: ['relevance', 'helpfulness'],
|
|
567
|
+
},
|
|
568
|
+
],
|
|
569
|
+
},
|
|
570
|
+
];
|
|
571
|
+
|
|
572
|
+
test('continues an existing agent session', async () => {
|
|
573
|
+
const initialThreadId = process.env.KARROT_THREAD_ID;
|
|
574
|
+
|
|
575
|
+
if (!initialThreadId) {
|
|
576
|
+
throw new Error('KARROT_THREAD_ID is required to resume an existing session.');
|
|
577
|
+
}
|
|
578
|
+
|
|
579
|
+
const outputDirectory = await createRunArtifactDirectory('./artifacts/karrot-playwright');
|
|
580
|
+
const env = {
|
|
581
|
+
...process.env,
|
|
582
|
+
AGENT_URL: process.env.AGENT_URL ?? '',
|
|
583
|
+
AGENT_ID: process.env.AGENT_ID ?? '',
|
|
584
|
+
WS_URL: process.env.WS_URL ?? '',
|
|
585
|
+
WS_TOPIC: process.env.WS_TOPIC ?? '',
|
|
586
|
+
AUTH_TOKEN: process.env.AUTH_TOKEN ?? '',
|
|
587
|
+
WS_STOMP_HEADERS: `Authorization:${process.env.AUTH_TOKEN ?? ''}`,
|
|
588
|
+
WS_HEADERS: `Origin:${process.env.APP_BASE_URL ?? ''},User-Agent:Mozilla/5.0`,
|
|
589
|
+
};
|
|
590
|
+
|
|
591
|
+
const scenarioSet = new AiScenarioSet(scenarios);
|
|
592
|
+
const [result] = await runScenario(scenarioSet.select(['RESUME-SESSION-01']), {
|
|
593
|
+
context: {
|
|
594
|
+
projectId: process.env.PROJECT_ID ?? '',
|
|
595
|
+
},
|
|
596
|
+
env,
|
|
597
|
+
outputDirectory,
|
|
598
|
+
initialThreadId,
|
|
599
|
+
concurrency: 1,
|
|
600
|
+
messageRunner: async ({ message, outputDirectory, threadId, processTimeoutMs }) =>
|
|
601
|
+
await runAgUiMessage({
|
|
602
|
+
message,
|
|
603
|
+
env,
|
|
604
|
+
outputDirectory,
|
|
605
|
+
threadId,
|
|
606
|
+
processTimeoutMs,
|
|
607
|
+
}),
|
|
608
|
+
});
|
|
609
|
+
|
|
610
|
+
expect(result.status).toBe('PASS');
|
|
611
|
+
expect(result.threadId).toBe(initialThreadId);
|
|
612
|
+
});
|
|
613
|
+
```
|
|
614
|
+
|
|
615
|
+
The `threadId` returned in `execution.results[*].threadId` or `result.threadId` is the value to store when a later Playwright test needs to continue the same assistant session.
|
|
616
|
+
|
|
617
|
+
For an init-then-recall flow in the same Playwright script, run once without `initialThreadId`, then pass the returned thread into the next `runScenario()` call:
|
|
618
|
+
|
|
619
|
+
```ts
|
|
620
|
+
const [createdSession] = await runScenario(initialScenarioSet.select(['INIT-SESSION-01']), {
|
|
621
|
+
context,
|
|
622
|
+
env,
|
|
623
|
+
outputDirectory,
|
|
624
|
+
concurrency: 1,
|
|
625
|
+
messageRunner,
|
|
626
|
+
});
|
|
627
|
+
|
|
628
|
+
const threadId = createdSession.threadId;
|
|
629
|
+
|
|
630
|
+
if (!threadId) {
|
|
631
|
+
throw new Error('Initial scenario did not return a threadId.');
|
|
632
|
+
}
|
|
633
|
+
|
|
634
|
+
const [continuedSession] = await runScenario(recallScenarioSet.select(['RECALL-SESSION-01']), {
|
|
635
|
+
context,
|
|
636
|
+
env,
|
|
637
|
+
outputDirectory,
|
|
638
|
+
initialThreadId: threadId,
|
|
639
|
+
concurrency: 1,
|
|
640
|
+
messageRunner,
|
|
641
|
+
});
|
|
400
642
|
```
|
|
401
643
|
|
|
402
|
-
##
|
|
644
|
+
## Package Structure
|
|
645
|
+
|
|
646
|
+
- `assertions/`: direct assertions and AI assertions
|
|
647
|
+
- `executors/`: scenario execution and transport runners
|
|
648
|
+
- `reports/`: JSON and HTML reporting
|
|
649
|
+
- `scenarios/`: scenario types, loaders, and generated-message helpers
|
|
650
|
+
- `utils/`: config loading, variable resolution, artifacts, and OpenAI helpers
|
|
651
|
+
- `prompts/`: built-in prompts used by the package
|
|
652
|
+
|
|
653
|
+
## Build
|
|
403
654
|
|
|
404
655
|
```bash
|
|
405
656
|
cd karrot
|
|
406
|
-
npm
|
|
657
|
+
npm run build
|
|
407
658
|
```
|
|
659
|
+
|
|
660
|
+
For a fuller operational reference, read [GUIDE.md](./GUIDE.md).
|