@huydao/karrot 0.1.6 → 0.1.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,73 +1,59 @@
1
1
  # karrot
2
2
 
3
- `karrot` is a reusable AI test runner for multi-turn assistant scenarios.
3
+ `karrot` is a reusable AI scenario runner for testing assistants through multi-turn conversations.
4
4
 
5
- It gives you:
6
- - scenario execution
7
- - AG-UI transport integration
8
- - string and AI-based assertions
9
- - turn evaluation with OpenAI
10
- - JSON and HTML reports
5
+ Use it when you want to:
6
+ - send one or more user turns to an agent
7
+ - keep the same conversation thread across turns
8
+ - assert required behavior with deterministic or semantic checks
9
+ - score response quality with eval dimensions
10
+ - write JSON and HTML reports for each run
11
11
 
12
- This package is designed to be published independently and reused across projects.
12
+ Karrot is product-neutral. It does not know how your product logs in, where your project ID comes from, or how your agent runtime is discovered. The consuming project prepares those values and passes them to Karrot through a YAML config and `execute()`.
13
13
 
14
- ## What Karrot Owns
14
+ ## Static Docs Site
15
15
 
16
- `karrot` is responsible for the AI-test layer:
17
- - load config
18
- - resolve `${VARIABLE}` templates
19
- - load scenario modules
20
- - execute turns
21
- - run assertions and evals
22
- - write artifacts and reports
16
+ Karrot also includes a static documentation site:
23
17
 
24
- `karrot` does not own product-specific runtime discovery. The consumer project should prepare data such as:
25
- - `PROJECT_ID`
26
- - `JWT`
27
- - `ACCOUNT_ID`
28
- - `WS_URL`
29
- - `WS_TOPIC`
30
- - any transport-specific headers or IDs
18
+ - Landing page: [site/index.html](./site/index.html)
19
+ - Guidance page: [site/docs/index.html](./site/docs/index.html)
31
20
 
32
- ## Core Entry Point
21
+ Run it locally:
33
22
 
34
- The main high-level API is `execute()`.
23
+ ```bash
24
+ npm run site:serve
25
+ ```
35
26
 
36
- ```ts
37
- import { execute } from '@huydao/karrot';
38
-
39
- await execute('./karrot.config.yml', {
40
- variables: {
41
- PROJECT_ID: process.env.PROJECT_ID,
42
- JWT: process.env.JWT,
43
- ACCOUNT_ID: process.env.ACCOUNT_ID,
44
- WS_URL: process.env.WS_URL,
45
- WS_TOPIC: process.env.WS_TOPIC,
46
- },
47
- scenario: {
48
- file: './src/scenarios/basic-two-turn-demo.ts',
49
- },
50
- });
27
+ ## Quick Setup
28
+
29
+ The normal setup has three files:
30
+
31
+ - `karrot.config.yml`: transport, artifacts, eval prompts, report metadata, and scenario context
32
+ - `src/run-karrot.ts`: a small script that collects runtime variables and calls `execute()`
33
+ - `src/scenarios/basic.ts`: one or more scenarios exported as an `AiScenarioSet`
34
+
35
+ Install the package in your project:
36
+
37
+ ```bash
38
+ npm install @huydao/karrot
39
+ npm install -D tsx typescript
40
+ ```
41
+
42
+ Set the OpenAI key when you use `aiAssert`, `eval`, or generated user messages:
43
+
44
+ ```bash
45
+ export OPENAI_API_KEY=<your-openai-api-key>
51
46
  ```
52
47
 
53
- `execute()` will:
54
- 1. load YAML or JSON config
55
- 2. resolve `${...}` variables
56
- 3. create `artifacts/<timestamp>`
57
- 4. load the scenario module
58
- 5. run selected scenarios
59
- 6. write JSON and HTML reports
48
+ ## 1. YAML Config File
60
49
 
61
- ## Recommended Setup Flow
50
+ Create a config file in the project that will run the tests. The file can live anywhere, but `./karrot.config.yml` or `./scripts/<flow>.karrot.yml` is easiest for agents and humans to find.
62
51
 
63
- The normal setup path is:
64
- 1. create a YAML config file for the WSS transport
65
- 2. create a scenario module that exports `scenarioSet` and `buildScenarioContext`
66
- 3. create a small run script that calls `execute()`
52
+ Karrot resolves `${VARIABLE}` placeholders from the `variables` object passed to `execute()`, then from `process.env`.
67
53
 
68
- ### 1. WSS config in YAML
54
+ ### Generic AG-UI WSS agent
69
55
 
70
- Use one config file to describe transport, evaluation prompt settings, artifacts, and reporting.
56
+ Use `ag-ui-wss` when the target agent speaks AG-UI over WebSocket/STOMP.
71
57
 
72
58
  ```yml
73
59
  version: 1
@@ -75,34 +61,34 @@ version: 1
75
61
  transport:
76
62
  type: ag-ui-wss
77
63
  env:
78
- JWT: ${JWT}
79
- ACCOUNT_ID: ${ACCOUNT_ID}
80
- PROJECT_ID: ${PROJECT_ID}
81
64
  AGENT_URL: ${AGENT_URL}
82
65
  AGENT_ID: ${AGENT_ID}
83
66
  WS_URL: ${WS_URL}
84
67
  WS_TOPIC: ${WS_TOPIC}
85
- WS_STOMP_HEADERS: Authorization:${JWT}
86
- WS_HEADERS: Origin:${WS_ORIGIN},User-Agent:Mozilla/5.0
68
+ AUTH_TOKEN: ${AUTH_TOKEN}
69
+ WS_STOMP_HEADERS: Authorization:${AUTH_TOKEN}
70
+ WS_HEADERS: Origin:${APP_BASE_URL},User-Agent:Mozilla/5.0
87
71
  processTimeoutMs: 120000
88
72
 
89
73
  artifacts:
90
- directory: ./artifacts
74
+ directory: ./artifacts/karrot
91
75
 
92
76
  execution:
93
77
  stopOnFailure: false
78
+ concurrency: 1
79
+
80
+ context:
81
+ appBaseUrl: ${APP_BASE_URL}
82
+ projectId: ${PROJECT_ID}
94
83
 
95
84
  evaluation:
96
85
  systemPromptPath: ./prompts/turn-eval-system-prompt.md
97
86
  promptDirectory: ./prompts/eval
98
87
 
99
- context:
100
- projectId: ${PROJECT_ID}
101
-
102
88
  report:
103
89
  enabled: true
104
- environment: prod
105
- projectName: Demo Project
90
+ environment: ${TEST_ENV}
91
+ projectName: ${PROJECT_NAME}
106
92
  runtime:
107
93
  agentUrl: ${AGENT_URL}
108
94
  agentId: ${AGENT_ID}
@@ -113,190 +99,316 @@ report:
113
99
  appBaseUrl: ${APP_BASE_URL}
114
100
  ```
115
101
 
116
- What this does:
117
- - `transport`: tells Karrot how to talk to the assistant
118
- - `evaluation`: points to the turn-eval rubric and any extra project-specific dimension prompts
119
- - `context`: makes resolved values available to scenarios
120
- - `report`: controls run metadata written into reports
102
+ How to get the values:
121
103
 
122
- ### 2. Scenario module
104
+ - `AGENT_URL`, `AGENT_ID`, `WS_URL`, `WS_TOPIC`: from your agent platform, runtime discovery API, or test environment configuration
105
+ - `AUTH_TOKEN`, `ACCOUNT_ID`, `PROJECT_ID`: from your product login/auth setup or CI secrets
106
+ - `APP_BASE_URL`, `TEST_ENV`, `PROJECT_NAME`: from your test environment
107
+ - `OPENAI_API_KEY`: from the environment, only needed for `aiAssert`, `eval`, and `aiGen`
123
108
 
124
- A scenario module defines the multi-turn tests that Karrot will run.
109
+ Put secrets in environment variables or CI secrets, not in the YAML file.
125
110
 
126
- ### 3. Run script
111
+ ### Generic HTTP agent
127
112
 
128
- Use a small script to resolve variables and point Karrot at the scenario file.
113
+ Use `ag-ui-post` when the agent is triggered by HTTP and optionally observed by polling.
129
114
 
130
- ```ts
131
- import { execute } from '@huydao/karrot';
132
-
133
- await execute('./karrot.config.yml', {
134
- variables: {
135
- PROJECT_ID: process.env.PROJECT_ID,
136
- JWT: process.env.JWT,
137
- ACCOUNT_ID: process.env.ACCOUNT_ID,
138
- AGENT_URL: process.env.AGENT_URL,
139
- AGENT_ID: process.env.AGENT_ID,
140
- WS_URL: process.env.WS_URL,
141
- WS_TOPIC: process.env.WS_TOPIC,
142
- WS_ORIGIN: process.env.WS_ORIGIN,
143
- APP_BASE_URL: process.env.APP_BASE_URL,
144
- },
145
- scenario: {
146
- file: './src/scenarios/basic-two-turn-demo.ts',
147
- ids: ['BASIC-2T'],
148
- },
149
- });
150
- ```
115
+ ```yml
116
+ version: 1
151
117
 
152
- ## Scenario Structure
118
+ transport:
119
+ type: ag-ui-post
120
+ injectMessage: true
121
+ run:
122
+ url: ${AGENT_RUN_URL}
123
+ headers:
124
+ Authorization: Bearer ${AUTH_TOKEN}
125
+ Content-Type: application/json
126
+ payload:
127
+ body:
128
+ threadId: ${THREAD_ID}
129
+ messages: []
130
+ processTimeoutMs: 120000
153
131
 
154
- A scenario module exports:
155
- - `scenarioSet`
156
- - `buildScenarioContext(baseContext)`
132
+ artifacts:
133
+ directory: ./artifacts/karrot
157
134
 
158
- Minimal example:
135
+ context:
136
+ appBaseUrl: ${APP_BASE_URL}
137
+
138
+ report:
139
+ enabled: true
140
+ environment: ${TEST_ENV}
141
+ projectName: ${PROJECT_NAME}
142
+ runtime:
143
+ agentUrl: ${AGENT_RUN_URL}
144
+ agentId: ${AGENT_ID}
145
+ wsUrl: ""
146
+ wsTopic: ""
147
+ accountId: ${ACCOUNT_ID}
148
+ projectId: ${PROJECT_ID}
149
+ appBaseUrl: ${APP_BASE_URL}
150
+ ```
151
+
152
+ ## 2. Run Script
153
+
154
+ The run script is the boundary between your product and Karrot. It should collect runtime values, pass them as variables, select the scenario file, and set a non-zero exit code on failure.
159
155
 
160
156
  ```ts
161
- import { AiScenarioSet, type AiScenario, type BaseAiScenarioContext } from '@huydao/karrot';
157
+ import { execute, getScenarioRunStatus } from '@huydao/karrot';
162
158
 
163
- const scenarios: AiScenario<BaseAiScenarioContext>[] = [
164
- {
165
- id: 'BASIC-2T',
166
- name: 'Basic Two-Turn Demo',
167
- turns: [
168
- {
169
- label: 'Turn 1',
170
- message: () => 'Hello. What can you help me with in Katalon AI?',
171
- },
172
- {
173
- label: 'Turn 2',
174
- message: () => 'Give me 3 short example prompts I can ask next.',
175
- },
176
- ],
177
- },
178
- ];
159
+ function required(name: string): string {
160
+ const value = process.env[name]?.trim();
179
161
 
180
- export const scenarioSet = new AiScenarioSet(scenarios);
162
+ if (!value) {
163
+ throw new Error(`Missing required environment variable: ${name}`);
164
+ }
181
165
 
182
- export function buildScenarioContext(baseContext: BaseAiScenarioContext): BaseAiScenarioContext {
183
- return { ...baseContext };
166
+ return value;
184
167
  }
185
- ```
186
-
187
- ### Scenario shape
188
168
 
189
- Each scenario typically contains:
190
- - `id`: stable scenario identifier
191
- - `name`: human-readable scenario name
192
- - `turns`: ordered list of user turns to execute
169
+ async function main(): Promise<void> {
170
+ const execution = await execute('./karrot.config.yml', {
171
+ variables: {
172
+ TEST_ENV: process.env.TEST_ENV ?? 'local',
173
+ PROJECT_NAME: process.env.PROJECT_NAME ?? 'Demo Agent',
174
+ APP_BASE_URL: required('APP_BASE_URL'),
175
+ AGENT_URL: required('AGENT_URL'),
176
+ AGENT_ID: required('AGENT_ID'),
177
+ WS_URL: required('WS_URL'),
178
+ WS_TOPIC: required('WS_TOPIC'),
179
+ AUTH_TOKEN: required('AUTH_TOKEN'),
180
+ ACCOUNT_ID: process.env.ACCOUNT_ID ?? '',
181
+ PROJECT_ID: process.env.PROJECT_ID ?? '',
182
+ },
183
+ scenario: {
184
+ file: './src/scenarios/basic.ts',
185
+ ids: process.env.SCENARIO_IDS?.split(',').map((id) => id.trim()).filter(Boolean),
186
+ },
187
+ });
188
+
189
+ const status = getScenarioRunStatus(execution.results);
190
+
191
+ console.log(
192
+ [
193
+ `Status: ${status}`,
194
+ `Artifacts: ${execution.outputDirectory}`,
195
+ `JSON report: ${execution.reportPaths?.jsonPath ?? '-'}`,
196
+ `HTML report: ${execution.reportPaths?.htmlPath ?? '-'}`,
197
+ ].join('\n'),
198
+ );
199
+
200
+ if (status === 'FAIL') {
201
+ process.exitCode = 1;
202
+ }
203
+ }
193
204
 
194
- Each turn supports:
195
- - `label`: display label in reports
196
- - `message`: the user message to send
197
- - `idleTimeoutMs`: optional wait limit for message inactivity
198
- - `processTimeoutMs`: optional hard timeout for the turn
199
- - `assertions`: pass/fail checks for the turn output
200
- - `eval`: quality scoring dimensions for the turn output
201
- - `onComplete`: optional callback for turn-level post-processing
205
+ main().catch((error) => {
206
+ console.error(error instanceof Error ? error.message : error);
207
+ process.exitCode = 1;
208
+ });
209
+ ```
202
210
 
203
- ### Message options
211
+ Run it:
204
212
 
205
- `message` can be:
206
- - a function `(context) => string`
207
- - `aiGen.fromPreviousContext()`
208
- - `aiGen.fromGuidance(guidance)`
209
- - `aiGen.fromContent(content)`
213
+ ```bash
214
+ TEST_ENV=qa npx tsx ./src/run-karrot.ts
215
+ ```
210
216
 
211
- This gives you a few common scenario authoring patterns:
212
- - fixed prompts for deterministic tests
213
- - context-aware prompts that use scenario data
214
- - generated user prompts for more adaptive multi-turn flows
217
+ ## 3. Basic Scenario
215
218
 
216
- Example with assertions and eval on a turn:
219
+ A scenario module must export `scenarioSet`. Export `buildScenarioContext(baseContext)` when the scenario needs typed or derived context.
217
220
 
218
221
  ```ts
219
- import { AiScenarioSet, aiGen, type AiScenario, type BaseAiScenarioContext } from '@huydao/karrot';
222
+ import {
223
+ aiGen,
224
+ AiScenarioSet,
225
+ type AiScenario,
226
+ type BaseAiScenarioContext,
227
+ } from '@huydao/karrot';
228
+
229
+ type DemoContext = BaseAiScenarioContext & {
230
+ appBaseUrl: string;
231
+ };
232
+
233
+ export function buildScenarioContext(baseContext: BaseAiScenarioContext): DemoContext {
234
+ return {
235
+ ...baseContext,
236
+ appBaseUrl: String(baseContext.appBaseUrl ?? ''),
237
+ };
238
+ }
220
239
 
221
- const scenarios: AiScenario<BaseAiScenarioContext>[] = [
240
+ const scenarios: AiScenario<DemoContext>[] = [
222
241
  {
223
- id: 'FOLLOW-UP-1',
224
- name: 'Follow-up prompt generation',
242
+ id: 'BASIC-CHAT-01',
243
+ name: 'Agent answers and suggests next steps',
225
244
  turns: [
226
245
  {
227
- label: 'Ask for next prompts',
246
+ label: 'Ask what the agent can do',
247
+ message: () => 'What can you help me do in this product?',
248
+ assertions: [
249
+ {
250
+ assert: { hasText: 'help' },
251
+ description: 'The response should explain useful capabilities',
252
+ },
253
+ {
254
+ aiAssert: {
255
+ hasContent: 'The answer names at least one concrete task the user can perform.',
256
+ },
257
+ description: 'The response should be actionable',
258
+ },
259
+ ],
260
+ eval: ['correctness', 'helpfulness', 'clarity'],
261
+ },
262
+ {
263
+ label: 'Ask for follow-up prompts',
228
264
  message: aiGen.fromGuidance(
229
- 'Ask for 3 concise follow-up prompts the user can send next based on the previous answer.',
265
+ 'Ask for three short follow-up prompts based on the previous answer.',
230
266
  ),
231
267
  assertions: [
232
- { assert: { hasText: 'prompt' } },
268
+ {
269
+ assert: { toolcall: [] },
270
+ description: 'The answer should not call tools for this simple prompt request',
271
+ },
272
+ {
273
+ aiAssert: {
274
+ hasContent: 'The answer provides three concise follow-up prompt ideas.',
275
+ },
276
+ },
277
+ ],
278
+ eval: [
279
+ 'relevance',
280
+ {
281
+ dimension: 'nextStepQuality',
282
+ guidance: 'Score whether the suggested next prompts are specific and usable.',
283
+ },
233
284
  ],
234
- eval: ['correctness', 'helpfulness', 'relevance'],
235
285
  },
236
286
  ],
237
287
  },
238
288
  ];
239
289
 
240
290
  export const scenarioSet = new AiScenarioSet(scenarios);
241
-
242
- export function buildScenarioContext(baseContext: BaseAiScenarioContext): BaseAiScenarioContext {
243
- return { ...baseContext };
244
- }
245
291
  ```
246
292
 
247
- ## Assertions
293
+ ## Scenario Details
294
+
295
+ ### Scenario elements
296
+
297
+ Each `AiScenario` has:
298
+
299
+ - `id`: stable ID used by CLI filters, CI, and reports
300
+ - `name`: readable scenario name shown in reports
301
+ - `turns`: ordered user messages sent to the same conversation thread
302
+ - `continueOnAssertionFailure`: optional scenario-level flag that keeps later turns running after an assertion failure
248
303
 
249
- Karrot supports two assertion styles.
304
+ Keep scenario IDs stable. Reports and CI filters depend on them.
250
305
 
251
- Use assertions for pass/fail requirements. If a turn must contain or avoid something specific, assertions are the right tool.
306
+ ### Turn elements
252
307
 
253
- Direct assertions:
308
+ Each turn has:
309
+
310
+ - `label`: readable turn name shown in reports
311
+ - `message`: user input to send to the agent
312
+ - `idleTimeoutMs`: optional timeout for waiting on assistant activity
313
+ - `processTimeoutMs`: optional hard timeout for the turn
314
+ - `assertions`: pass/fail checks for required behavior
315
+ - `eval`: quality scoring dimensions for the assistant response
316
+ - `continueOnAssertionFailure`: optional turn-level override
317
+ - `onComplete`: optional callback after the turn returns output
318
+
319
+ `message` can be:
320
+
321
+ - `(context) => string`: deterministic message with access to scenario context
322
+ - `aiGen.fromPreviousContext()`: generate the next user message from conversation context
323
+ - `aiGen.fromGuidance(guidance)`: generate a user message from instructions
324
+ - `aiGen.fromContent(content)`: generate a user message from supplied source content
325
+
326
+ Use deterministic messages for regression tests. Use `aiGen` when the next user turn should adapt to the previous answer.
327
+
328
+ ## Turn Assertions
329
+
330
+ Assertions decide whether a turn passes or fails. Use them for required behavior.
331
+
332
+ ### `assert`
333
+
334
+ Use `assert` when the expected result can be checked directly.
254
335
 
255
336
  ```ts
256
337
  assertions: [
257
- { assert: { hasText: 'Katalon AI' } },
338
+ { assert: { hasText: 'created successfully' } },
339
+ { assert: { toolcall: ['create_test_case'] } },
258
340
  { assert: { toolcall: [] } },
341
+ {
342
+ assert: {
343
+ toolcallWithContent: {
344
+ name: 'create_test_case',
345
+ hasText: ['login', 'password'],
346
+ hasProperties: {
347
+ priority: 'High',
348
+ },
349
+ },
350
+ },
351
+ },
259
352
  ]
260
353
  ```
261
354
 
262
- AI assertions:
355
+ Supported direct assertions:
356
+
357
+ - `assert.hasText`: response text contains the expected string
358
+ - `assert.toolcall`: exact expected tool-call names; use `[]` when no tool calls should happen
359
+ - `assert.toolcallWithContent`: a named tool call exists and contains expected text or structured properties
360
+
361
+ ### `aiAssert`
362
+
363
+ Use `aiAssert` when the requirement is semantic and exact string matching would be brittle.
263
364
 
264
365
  ```ts
265
366
  assertions: [
266
- { aiAssert: { hasContent: 'The answer explains what Katalon AI can do.' } },
267
- { aiAssert: { notHasContent: 'The answer invents unsupported product features.' } },
367
+ {
368
+ aiAssert: {
369
+ hasContent: 'The answer explains the next action and why it is needed.',
370
+ },
371
+ },
372
+ {
373
+ aiAssert: {
374
+ notHasContent: 'The answer invents product capabilities not present in the prompt.',
375
+ },
376
+ },
268
377
  ]
269
378
  ```
270
379
 
271
- Assertion guidance:
272
- - Use direct assertions when the expected output is deterministic enough to check literally.
273
- - Use AI assertions when the requirement is semantic and cannot be captured safely with exact string matching.
274
- - Use assertions to decide whether the turn satisfied a contract, not to measure answer quality.
380
+ Supported AI assertions:
381
+
382
+ - `aiAssert.hasContent`: semantic requirement must be present
383
+ - `aiAssert.notHasContent`: semantic problem must be absent
384
+
385
+ `aiAssert` requires `OPENAI_API_KEY`.
275
386
 
276
- ## Evaluations
387
+ ## Eval
277
388
 
278
- Turn evals score the assistant response for named dimensions.
279
- Karrot applies a CheckEval-inspired evaluation rubric: broad dimensions are decomposed into concrete checklist-style checks before assigning a final score, which improves consistency and makes explanations more traceable.
389
+ Eval is separate from assertions.
390
+
391
+ - Assertion: pass/fail requirement, for example "must call `create_test_case`"
392
+ - Eval: quality score, for example "how helpful and complete was the answer"
280
393
 
281
394
  ```ts
282
395
  eval: ['correctness', 'coverage', 'helpfulness']
283
396
  ```
284
397
 
285
- Custom dimensions are also supported:
398
+ You can add inline guidance:
286
399
 
287
400
  ```ts
288
401
  eval: [
289
402
  'correctness',
290
403
  {
291
404
  dimension: 'productFit',
292
- guidance: 'Judge whether the answer is specifically useful for a Katalon AI user.',
405
+ guidance: 'Score whether the answer is specifically useful for users of this product.',
293
406
  },
294
407
  ]
295
408
  ```
296
409
 
297
- Use eval when you want a quality score rather than a hard pass/fail rule.
410
+ Common dimensions:
298
411
 
299
- Built-in dimensions commonly used by Karrot:
300
412
  - `correctness`
301
413
  - `coverage`
302
414
  - `helpfulness`
@@ -309,99 +421,240 @@ Built-in dimensions commonly used by Karrot:
309
421
  - `consistency`
310
422
  - `safety`
311
423
 
312
- Project-level eval prompts can be configured through:
313
- - `evaluation.systemPromptPath`
314
- - `evaluation.promptDirectory`
424
+ Project-level eval prompts can live in a directory:
315
425
 
316
- That lets the project define rubric files without repeating inline guidance in every scenario.
426
+ ```yml
427
+ evaluation:
428
+ promptDirectory: ./prompts/eval
429
+ ```
317
430
 
318
- Use:
319
- - `systemPromptPath` when you want to replace the whole turn-eval rubric
320
- - `promptDirectory` when you want to add custom project-specific dimensions
431
+ Then create files such as:
321
432
 
322
- Eval guidance:
323
- - Use assertions for required behavior.
324
- - Use eval for quality measurement across dimensions.
325
- - Prefer a small number of dimensions that reflect the goal of the turn.
326
- - Because Karrot applies CheckEval-style scoring, dimensions like `relevance` and `consistency` are judged through concrete sub-checks instead of a vague overall impression.
433
+ ```text
434
+ prompts/eval/product-fit.md
435
+ prompts/eval/next-step-quality.md
436
+ ```
327
437
 
328
- ## AI-Generated User Messages
438
+ Scenario authors can then use only the dimension names:
329
439
 
330
- Karrot can generate a user turn message before sending it to the target assistant.
440
+ ```ts
441
+ eval: ['correctness', 'productFit', 'nextStepQuality']
442
+ ```
331
443
 
332
- Available helpers:
333
- - `aiGen.fromPreviousContext()`
334
- - `aiGen.fromGuidance(guidance)`
335
- - `aiGen.fromContent(content)`
444
+ Use `evaluation.systemPromptPath` only when you need to replace the full turn-eval rubric.
336
445
 
337
- Example:
446
+ ## Config Reference
338
447
 
339
- ```ts
340
- import { aiGen } from '@huydao/karrot';
448
+ Top-level config keys:
341
449
 
342
- message: aiGen.fromGuidance(
343
- 'Ask for 3 concise follow-up prompts the user can send next based on the previous answer.',
344
- )
345
- ```
450
+ - `version`: currently `1`
451
+ - `transport`: agent transport, currently `ag-ui-wss` or `ag-ui-post`
452
+ - `artifacts.directory`: output directory for raw events and reports
453
+ - `execution.stopOnFailure`: stop remaining scenarios after a failure
454
+ - `execution.concurrency`: number of scenarios to run in parallel
455
+ - `context`: values available to `buildScenarioContext`
456
+ - `evaluation.systemPromptPath`: full eval prompt override
457
+ - `evaluation.promptDirectory`: additional project-specific eval dimension prompts
458
+ - `report.enabled`: set `false` to skip reports
459
+ - `report.environment`, `report.projectName`, `report.runtime`: metadata written to reports
460
+ - `report.scenarioContext`: extra metadata written to reports
346
461
 
347
- This requires `OPENAI_API_KEY`.
462
+ ## Reports And Artifacts
348
463
 
349
- ## Config Overview
464
+ Each run creates an artifact directory under `artifacts/<timestamp>` or the configured `artifacts.directory`.
350
465
 
351
- Karrot config currently supports:
352
- - `transport`
353
- - `artifacts.directory`
354
- - `execution.stopOnFailure`
355
- - `evaluation.systemPromptPath`
356
- - `evaluation.promptDirectory`
357
- - `context`
358
- - `report`
466
+ Typical outputs:
359
467
 
360
- Important design choice:
361
- - config and scenario are separate
362
- - one transport config can be reused across many scenario files
468
+ - raw transport logs, such as `.jsonl` or `.sse`
469
+ - generated-message traces
470
+ - AI assertion traces
471
+ - JSON report
472
+ - HTML report
363
473
 
364
- ## Reports and Artifacts
474
+ ## How To Add Karrot To An Existing Playwright Framework
365
475
 
366
- Each `execute()` run creates:
367
- - a run artifact directory under `artifacts/<timestamp>`
368
- - raw transport logs such as `.jsonl` or `.sse`
369
- - a JSON run report
370
- - an HTML run report
476
+ 1. Install `@huydao/karrot` in the Playwright project.
477
+ 2. Add a config file, for example `scripts/ai-full-flow.karrot.yml`.
478
+ 3. Add a runner script, for example `scripts/run-ai-full-flow.ts`.
479
+ 4. Put scenario files under a stable folder, for example `data/ai-scenarios`.
480
+ 5. Reuse Playwright auth helpers to discover runtime values, then pass those values to `execute()`.
481
+ 6. Add npm scripts that run the Karrot runner with `tsx`.
482
+ 7. Store reports under a predictable artifact directory for CI upload.
371
483
 
372
- ## Environment Variables
484
+ Minimal `package.json` script:
373
485
 
374
- Common variables:
375
- - `OPENAI_API_KEY`
376
- - `OPENAI_BASE_URL`
377
- - `OPENAI_EVAL_MODEL`
378
- - `OPENAI_MESSAGE_GEN_MODEL`
486
+ ```json
487
+ {
488
+ "scripts": {
489
+ "ai:full-flow": "tsx ./scripts/run-ai-full-flow.ts"
490
+ }
491
+ }
492
+ ```
379
493
 
380
- Transport-specific variables depend on the integration project.
494
+ Example command:
381
495
 
382
- ## Package Structure
496
+ ```bash
497
+ TEST_ENV=qa npm run ai:full-flow -- --scenario-file data/ai-scenarios/basic.ts
498
+ ```
383
499
 
384
- - `assertions/`: direct assertions and turn evaluation
385
- - `executors/`: transport runners and scenario execution
386
- - `reports/`: JSON and HTML reporting
387
- - `scenarios/`: scenario types, loaders, generated-message helpers
388
- - `utils/`: config loading, artifacts, OpenAI helpers
389
- - `prompts/`: built-in prompt files used by the package
500
+ ### Using Karrot inside a Playwright test
390
501
 
391
- ## AI-Friendly Guide
502
+ Use `execute()` when the scenario file owns the whole conversation. Karrot automatically starts a thread on the first turn and reuses that thread for the following turns in the same scenario.
392
503
 
393
- For a fuller operational guide intended for both humans and AI agents, read [GUIDE.md](./GUIDE.md).
504
+ ```ts
505
+ import path from 'node:path';
506
+ import { test, expect } from '@playwright/test';
507
+ import { execute, getScenarioRunStatus } from '@huydao/karrot';
508
+
509
+ test('agent completes the basic flow', async () => {
510
+ const execution = await execute(path.resolve(__dirname, '../karrot.config.yml'), {
511
+ variables: {
512
+ TEST_ENV: process.env.TEST_ENV ?? 'qa',
513
+ PROJECT_NAME: 'Demo Agent',
514
+ APP_BASE_URL: process.env.APP_BASE_URL,
515
+ AGENT_URL: process.env.AGENT_URL,
516
+ AGENT_ID: process.env.AGENT_ID,
517
+ WS_URL: process.env.WS_URL,
518
+ WS_TOPIC: process.env.WS_TOPIC,
519
+ AUTH_TOKEN: process.env.AUTH_TOKEN,
520
+ ACCOUNT_ID: process.env.ACCOUNT_ID,
521
+ PROJECT_ID: process.env.PROJECT_ID,
522
+ },
523
+ scenario: {
524
+ file: path.resolve(__dirname, '../src/scenarios/basic.ts'),
525
+ ids: ['BASIC-CHAT-01'],
526
+ },
527
+ });
528
+
529
+ expect(getScenarioRunStatus(execution.results)).toBe('PASS');
530
+ });
531
+ ```
394
532
 
395
- ## Build
533
+ Use `runScenario()` when the Playwright test needs to create or recall an existing agent session itself. Pass the known thread/conversation ID as `initialThreadId`. Keep `concurrency: 1`; a single existing session cannot be shared safely across parallel scenarios.
396
534
 
397
- ```bash
398
- cd karrot
399
- npx tsc -p tsconfig.json
535
+ ```ts
536
+ import { test, expect } from '@playwright/test';
537
+ import {
538
+ AiScenarioSet,
539
+ createRunArtifactDirectory,
540
+ runScenario,
541
+ type AiScenario,
542
+ type BaseAiScenarioContext,
543
+ } from '@huydao/karrot';
544
+ import { runAgUiMessage } from '@huydao/karrot/adapters/ag-ui';
545
+
546
+ type SessionContext = BaseAiScenarioContext & {
547
+ projectId: string;
548
+ };
549
+
550
+ const scenarios: AiScenario<SessionContext>[] = [
551
+ {
552
+ id: 'RESUME-SESSION-01',
553
+ name: 'Continue an existing assistant session',
554
+ turns: [
555
+ {
556
+ label: 'Recall current context',
557
+ message: ({ projectId }) =>
558
+ `Continue in project ${projectId}. Summarize what we have already discussed and suggest the next action.`,
559
+ assertions: [
560
+ {
561
+ aiAssert: {
562
+ hasContent: 'The answer uses the existing conversation context.',
563
+ },
564
+ },
565
+ ],
566
+ eval: ['relevance', 'helpfulness'],
567
+ },
568
+ ],
569
+ },
570
+ ];
571
+
572
+ test('continues an existing agent session', async () => {
573
+ const initialThreadId = process.env.KARROT_THREAD_ID;
574
+
575
+ if (!initialThreadId) {
576
+ throw new Error('KARROT_THREAD_ID is required to resume an existing session.');
577
+ }
578
+
579
+ const outputDirectory = await createRunArtifactDirectory('./artifacts/karrot-playwright');
580
+ const env = {
581
+ ...process.env,
582
+ AGENT_URL: process.env.AGENT_URL ?? '',
583
+ AGENT_ID: process.env.AGENT_ID ?? '',
584
+ WS_URL: process.env.WS_URL ?? '',
585
+ WS_TOPIC: process.env.WS_TOPIC ?? '',
586
+ AUTH_TOKEN: process.env.AUTH_TOKEN ?? '',
587
+ WS_STOMP_HEADERS: `Authorization:${process.env.AUTH_TOKEN ?? ''}`,
588
+ WS_HEADERS: `Origin:${process.env.APP_BASE_URL ?? ''},User-Agent:Mozilla/5.0`,
589
+ };
590
+
591
+ const scenarioSet = new AiScenarioSet(scenarios);
592
+ const [result] = await runScenario(scenarioSet.select(['RESUME-SESSION-01']), {
593
+ context: {
594
+ projectId: process.env.PROJECT_ID ?? '',
595
+ },
596
+ env,
597
+ outputDirectory,
598
+ initialThreadId,
599
+ concurrency: 1,
600
+ messageRunner: async ({ message, outputDirectory, threadId, processTimeoutMs }) =>
601
+ await runAgUiMessage({
602
+ message,
603
+ env,
604
+ outputDirectory,
605
+ threadId,
606
+ processTimeoutMs,
607
+ }),
608
+ });
609
+
610
+ expect(result.status).toBe('PASS');
611
+ expect(result.threadId).toBe(initialThreadId);
612
+ });
613
+ ```
614
+
615
+ The `threadId` returned in `execution.results[*].threadId` or `result.threadId` is the value to store when a later Playwright test needs to continue the same assistant session.
616
+
617
+ For an init-then-recall flow in the same Playwright script, run once without `initialThreadId`, then pass the returned thread into the next `runScenario()` call:
618
+
619
+ ```ts
620
+ const [createdSession] = await runScenario(initialScenarioSet.select(['INIT-SESSION-01']), {
621
+ context,
622
+ env,
623
+ outputDirectory,
624
+ concurrency: 1,
625
+ messageRunner,
626
+ });
627
+
628
+ const threadId = createdSession.threadId;
629
+
630
+ if (!threadId) {
631
+ throw new Error('Initial scenario did not return a threadId.');
632
+ }
633
+
634
+ const [continuedSession] = await runScenario(recallScenarioSet.select(['RECALL-SESSION-01']), {
635
+ context,
636
+ env,
637
+ outputDirectory,
638
+ initialThreadId: threadId,
639
+ concurrency: 1,
640
+ messageRunner,
641
+ });
400
642
  ```
401
643
 
402
- ## Publish
644
+ ## Package Structure
645
+
646
+ - `assertions/`: direct assertions and AI assertions
647
+ - `executors/`: scenario execution and transport runners
648
+ - `reports/`: JSON and HTML reporting
649
+ - `scenarios/`: scenario types, loaders, and generated-message helpers
650
+ - `utils/`: config loading, variable resolution, artifacts, and OpenAI helpers
651
+ - `prompts/`: built-in prompts used by the package
652
+
653
+ ## Build
403
654
 
404
655
  ```bash
405
656
  cd karrot
406
- npm publish
657
+ npm run build
407
658
  ```
659
+
660
+ For a fuller operational reference, read [GUIDE.md](./GUIDE.md).