@sqaitech/mcp 0.30.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/dist/api.mdx ADDED
@@ -0,0 +1,1167 @@
1
+ # API reference
2
+
3
+ > In the documentation below, you might see function calls prefixed with `agent.`. If you utilize destructuring in Playwright (e.g., `async ({ ai, aiQuery }) => { /* ... */ }`), you can call these functions without the `agent.` prefix. This is merely a syntactical difference.
4
+
5
+ ## Constructors
6
+
7
+ Each Agent in Midscene has its own constructor.
8
+
9
+ - In Puppeteer, use [PuppeteerAgent](./integrate-with-puppeteer)
10
+ - In Bridge Mode, use [AgentOverChromeBridge](./bridge-mode-by-chrome-extension#constructor)
11
+ - In Android, use [AndroidAgent](./integrate-with-android)
12
+ - For GUI Agent integrating with your own interface, refer to [Integrate with any interface](./integrate-with-any-interface)
13
+
14
+ These Agents share some common constructor parameters:
15
+
16
+ - `generateReport: boolean`: If true, a report file will be generated. (Default: true)
17
+ - `reportFileName: string`: The name of the report file. (Default: generated by midscene)
18
+ - `autoPrintReportMsg: boolean`: If true, report messages will be printed. (Default: true)
19
+ - `cacheId: string | undefined`: If provided, this cacheId will be used to save or match the cache. (Default: undefined, means cache feature is disabled)
20
+ - `actionContext: string`: Some background knowledge that should be sent to the AI model when calling `agent.aiAction()`, like 'close the cookie consent dialog first if it exists' (Default: undefined)
21
+ - `onTaskStartTip: (tip: string) => void | Promise<void>`: Optional hook that fires before each execution task begins with a human-readable summary of the task (Default: undefined)
22
+
23
+ In Playwright and Puppeteer, there are some common parameters:
24
+
25
+ - `forceSameTabNavigation: boolean`: If true, page navigation is restricted to the current tab. (Default: true)
26
+ - `waitForNavigationTimeout: number`: The timeout for waiting for navigation finished. (Default: 5000ms, set to 0 means not waiting for navigation finished)
27
+
28
+ In Puppeteer, there is also a parameter:
29
+
30
+ - `waitForNetworkIdleTimeout: number`: The timeout for waiting for network idle between each action. (Default: 2000ms, set to 0 means not waiting for network idle)
31
+
32
+ ## Interaction methods
33
+
34
+ Below are the main APIs available for the various Agents in Midscene.
35
+
36
+ :::info Auto Planning v.s. Instant Action
37
+
38
+ In Midscene, you can choose to use either auto planning or instant action.
39
+
40
+ - `agent.ai()` is for Auto Planning: Midscene will automatically plan the steps and execute them. It's more smart and looks like more fashionable style for AI agents. But it may be slower and heavily rely on the quality of the AI model.
41
+ - `agent.aiTap()`, `agent.aiHover()`, `agent.aiInput()`, `agent.aiKeyboardPress()`, `agent.aiScroll()`, `agent.aiDoubleClick()`, `agent.aiRightClick()` are for Instant Action: Midscene will directly perform the specified action, while the AI model is responsible for basic tasks such as locating elements. It's faster and more reliable if you are certain about the action you want to perform.
42
+
43
+ :::
44
+
45
+ ### `agent.aiAction()` or `.ai()`
46
+
47
+ This method allows you to perform a series of UI actions described in natural language. Midscene automatically plans the steps and executes them.
48
+
49
+ - Type
50
+
51
+ ```typescript
52
+ function aiAction(
53
+ prompt: string,
54
+ options?: {
55
+ cacheable?: boolean;
56
+ },
57
+ ): Promise<void>;
58
+ function ai(prompt: string): Promise<void>; // shorthand form
59
+ ```
60
+
61
+ - Parameters:
62
+
63
+ - `prompt: string` - A natural language description of the UI steps.
64
+ - `options?: Object` - Optional, a configuration object containing:
65
+ - `cacheable?: boolean` - Whether cacheable when enabling [caching feature](./caching.mdx). True by default.
66
+
67
+ - Return Value:
68
+
69
+ - Returns a Promise that resolves to void when all steps are completed; if execution fails, an error is thrown.
70
+
71
+ - Examples:
72
+
73
+ ```typescript
74
+ // Basic usage
75
+ await agent.aiAction(
76
+ 'Type "JavaScript" into the search box, then click the search button',
77
+ );
78
+
79
+ // Using the shorthand .ai form
80
+ await agent.ai(
81
+ 'Click the login button at the top of the page, then enter "test@example.com" in the username field',
82
+ );
83
+
84
+ // When using UI Agent models like ui-tars, you can try a more goal-driven prompt
85
+ await agent.aiAction('Post a Tweet "Hello World"');
86
+ ```
87
+
88
+ :::tip
89
+
90
+ Under the hood, Midscene uses AI model to split the instruction into a series of steps (a.k.a. "Planning"). It then executes these steps sequentially. If Midscene determines that the actions cannot be performed, an error will be thrown.
91
+
92
+ For optimal results, please provide clear and detailed instructions for `agent.aiAction()`. For guides about writing prompts, you may read this doc: [Tips for Writing Prompts](./prompting-tips).
93
+
94
+ Related Documentation:
95
+
96
+ - [Choose a model](./choose-a-model)
97
+
98
+ :::
99
+
100
+ ### `agent.aiTap()`
101
+
102
+ Tap something.
103
+
104
+ - Type
105
+
106
+ ```typescript
107
+ function aiTap(locate: string | Object, options?: Object): Promise<void>;
108
+ ```
109
+
110
+ - Parameters:
111
+
112
+ - `locate: string | Object` - A natural language description of the element to tap, or [prompting with images](#prompting-with-images).
113
+ - `options?: Object` - Optional, a configuration object containing:
114
+ - `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
115
+ - `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
116
+ - `cacheable?: boolean` - Whether cacheable when enabling [caching feature](./caching.mdx). True by default.
117
+
118
+ - Return Value:
119
+
120
+ - Returns a `Promise<void>`
121
+
122
+ - Examples:
123
+
124
+ ```typescript
125
+ await agent.aiTap('The login button at the top of the page');
126
+
127
+ // Use deepThink feature to precisely locate the element
128
+ await agent.aiTap('The login button at the top of the page', {
129
+ deepThink: true,
130
+ });
131
+ ```
132
+
133
+ ### `agent.aiHover()`
134
+
135
+ > Only available in web pages, not available in Android.
136
+
137
+ Move mouse over something.
138
+
139
+ - Type
140
+
141
+ ```typescript
142
+ function aiHover(locate: string | Object, options?: Object): Promise<void>;
143
+ ```
144
+
145
+ - Parameters:
146
+
147
+ - `locate: string | Object` - A natural language description of the element to hover over, or [prompting with images](#prompting-with-images).
148
+ - `options?: Object` - Optional, a configuration object containing:
149
+ - `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
150
+ - `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
151
+ - `cacheable?: boolean` - Whether cacheable when enabling [caching feature](./caching.mdx). True by default.
152
+
153
+ - Return Value:
154
+
155
+ - Returns a `Promise<void>`
156
+
157
+ - Examples:
158
+
159
+ ```typescript
160
+ await agent.aiHover('The version number of the current page');
161
+ ```
162
+
163
+ ### `agent.aiInput()`
164
+
165
+ Input text into something.
166
+
167
+ - Type
168
+
169
+ ```typescript
170
+ function aiInput(
171
+ text: string | Object,
172
+ locate: string,
173
+ options?: Object,
174
+ ): Promise<void>;
175
+ ```
176
+
177
+ - Parameters:
178
+
179
+ - `text: string` - The text content to input.
180
+ - When `mode` is `'replace'`: The text will replace all existing content in the input field.
181
+ - When `mode` is `'append'`: The text will be appended to the existing content.
182
+ - When `mode` is `'clear'`: The text is ignored and the input field will be cleared.
183
+ - `locate: string | Object` - A natural language description of the element to input text into, or [prompting with images](#prompting-with-images).
184
+ - `options?: Object` - Optional, a configuration object containing:
185
+ - `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
186
+ - `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
187
+ - `cacheable?: boolean` - Whether cacheable when enabling [caching feature](./caching.mdx). True by default.
188
+ - `autoDismissKeyboard?: boolean` - If true, the keyboard will be dismissed after input text, only available in Android. (Default: true)
189
+ - `mode?: 'replace' | 'clear' | 'append'` - Input mode. (Default: 'replace')
190
+ - `'replace'`: Clear the input field first, then input the text.
191
+ - `'append'`: Append the text to existing content without clearing.
192
+ - `'clear'`: Clear the input field without entering new text.
193
+
194
+ - Return Value:
195
+
196
+ - Returns a `Promise<void>`
197
+
198
+ - Examples:
199
+
200
+ ```typescript
201
+ await agent.aiInput('Hello World', 'The search input box');
202
+ ```
203
+
204
+ ### `agent.aiKeyboardPress()`
205
+
206
+ Press a keyboard key.
207
+
208
+ - Type
209
+
210
+ ```typescript
211
+ function aiKeyboardPress(
212
+ key: string,
213
+ locate?: string | Object,
214
+ options?: Object,
215
+ ): Promise<void>;
216
+ ```
217
+
218
+ - Parameters:
219
+
220
+ - `key: string` - The web key to press, e.g. 'Enter', 'Tab', 'Escape', etc. Key Combination is not supported.
221
+ - `locate?: string | Object` - Optional, a natural language description of the element to press the key on, or [prompting with images](#prompting-with-images).
222
+ - `options?: Object` - Optional, a configuration object containing:
223
+ - `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
224
+ - `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
225
+ - `cacheable?: boolean` - Whether cacheable when enabling [caching feature](./caching.mdx). True by default.
226
+
227
+ - Return Value:
228
+
229
+ - Returns a `Promise<void>`
230
+
231
+ - Examples:
232
+
233
+ ```typescript
234
+ await agent.aiKeyboardPress('Enter', 'The search input box');
235
+ ```
236
+
237
+ ### `agent.aiScroll()`
238
+
239
+ Scroll a page or an element.
240
+
241
+ - Type
242
+
243
+ ```typescript
244
+ function aiScroll(
245
+ scrollParam: PlanningActionParamScroll,
246
+ locate?: string | Object,
247
+ options?: Object,
248
+ ): Promise<void>;
249
+ ```
250
+
251
+ - Parameters:
252
+
253
+ - `scrollParam: PlanningActionParamScroll` - The scroll parameter
254
+ - `direction: 'up' | 'down' | 'left' | 'right'` - The direction to scroll. Whether it is Android or Web, the scrolling direction here all refers to which direction of the page's content will appear on the screen. For example, when the scrolling direction is `down`, the hidden content at the bottom of the page will gradually reveal itself from the bottom of the screen upwards.
255
+ - `scrollType: 'once' | 'untilBottom' | 'untilTop' | 'untilRight' | 'untilLeft'` - Optional, the type of scroll to perform.
256
+ - `distance: number` - Optional, the distance to scroll in px.
257
+ - `locate?: string | Object` - Optional, a natural language description of the element to scroll on, or [prompting with images](#prompting-with-images). If not provided, Midscene will perform scroll on the current mouse position.
258
+ - `options?: Object` - Optional, a configuration object containing:
259
+ - `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
260
+ - `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
261
+ - `cacheable?: boolean` - Whether cacheable when enabling [caching feature](./caching.mdx). True by default.
262
+
263
+ - Return Value:
264
+
265
+ - Returns a `Promise<void>`
266
+
267
+ - Examples:
268
+
269
+ ```typescript
270
+ await agent.aiScroll(
271
+ { direction: 'up', distance: 100, scrollType: 'once' },
272
+ 'The form panel',
273
+ );
274
+ ```
275
+
276
+ ### `agent.aiDoubleClick()`
277
+
278
+ Double-click on an element.
279
+
280
+ - Type
281
+
282
+ ```typescript
283
+ function aiDoubleClick(locate: string | Object, options?: Object): Promise<void>;
284
+ ```
285
+
286
+ - Parameters:
287
+
288
+ - `locate: string | Object` - A natural language description of the element to double-click on, or [prompting with images](#prompting-with-images).
289
+ - `options?: Object` - Optional, a configuration object containing:
290
+ - `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
291
+ - `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
292
+ - `cacheable?: boolean` - Whether cacheable when enabling [caching feature](./caching.mdx). True by default.
293
+
294
+ - Return Value:
295
+
296
+ - Returns a `Promise<void>`
297
+
298
+ - Examples:
299
+
300
+ ```typescript
301
+ await agent.aiDoubleClick('The file name at the top of the page');
302
+
303
+ // Use deepThink feature to precisely locate the element
304
+ await agent.aiDoubleClick('The file name at the top of the page', {
305
+ deepThink: true,
306
+ });
307
+ ```
308
+
309
+ ### `agent.aiRightClick()`
310
+
311
+ > Only available in web pages, not available in Android.
312
+
313
+ Right-click on an element. Please note that Midscene cannot interact with the native context menu in browser after right-clicking. This interface is usually used for the element that listens to the right-click event by itself.
314
+
315
+ - Type
316
+
317
+ ```typescript
318
+ function aiRightClick(locate: string, options?: Object): Promise<void>;
319
+ ```
320
+
321
+ - Parameters:
322
+
323
+ - `locate: string | Object` - A natural language description of the element to right-click on, or [prompting with images](#prompting-with-images).
324
+ - `options?: Object` - Optional, a configuration object containing:
325
+ - `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
326
+ - `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
327
+ - `cacheable?: boolean` - Whether cacheable when enabling [caching feature](./caching.mdx). True by default.
328
+
329
+ - Return Value:
330
+
331
+ - Returns a `Promise<void>`
332
+
333
+ - Examples:
334
+
335
+ ```typescript
336
+ await agent.aiRightClick('The file name at the top of the page');
337
+
338
+ // Use deepThink feature to precisely locate the element
339
+ await agent.aiRightClick('The file name at the top of the page', {
340
+ deepThink: true,
341
+ });
342
+ ```
343
+
344
+ :::tip About the `deepThink` feature
345
+
346
+ The `deepThink` feature is a powerful feature that allows Midscene to call AI model twice to precisely locate the element. False by default. It is useful when the AI model find it hard to distinguish the element from its surroundings.
347
+
348
+ :::
349
+
350
+ ## Data extraction
351
+
352
+ ### `agent.aiAsk()`
353
+
354
+ Ask the AI model any question about the current page. It returns the answer in string from the AI model.
355
+
356
+ - Type
357
+
358
+ ```typescript
359
+ function aiAsk(prompt: string | Object, options?: Object): Promise<string>;
360
+ ```
361
+
362
+ - Parameters:
363
+
364
+ - `prompt: string | Object` - A natural language description of the question, or [prompting with images](#prompting-with-images).
365
+ - `options?: Object` - Optional, a configuration object containing:
366
+ - `domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. False by default.
367
+ - `screenshotIncluded?: boolean` - Whether to send screenshot to the model. True by default.
368
+
369
+ - Return Value:
370
+
371
+ - Return a Promise. Return the answer from the AI model.
372
+
373
+ - Examples:
374
+
375
+ ```typescript
376
+ const result = await agent.aiAsk('What should I do to test this page?');
377
+ console.log(result); // Output the answer from the AI model
378
+ ```
379
+
380
+ Besides `aiAsk`, you can also use `aiQuery` to extract structured data from the UI.
381
+
382
+ ### `agent.aiQuery()`
383
+
384
+ This method allows you to extract structured data from current page. Simply define the expected format (e.g., string, number, JSON, or an array) in the `dataDemand`, and Midscene will return a result that matches the format.
385
+
386
+ - Type
387
+
388
+ ```typescript
389
+ function aiQuery<T>(dataDemand: string | Object, options?: Object): Promise<T>;
390
+ ```
391
+
392
+ - Parameters:
393
+
394
+ - `dataDemand: T`: A description of the expected data and its return format.
395
+ - `options?: Object` - Optional, a configuration object containing:
396
+ - `domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. False by default.
397
+ - `screenshotIncluded?: boolean` - Whether to send screenshot to the model. True by default.
398
+
399
+ - Return Value:
400
+
401
+ - Returns any valid basic type, such as string, number, JSON, array, etc.
402
+ - Just describe the format in `dataDemand`, and Midscene will return a matching result.
403
+
404
+ - Examples:
405
+
406
+ ```typescript
407
+ const dataA = await agent.aiQuery({
408
+ time: 'The date and time displayed in the top-left corner as a string',
409
+ userInfo: 'User information in the format {name: string}',
410
+ tableFields: 'An array of table field names, string[]',
411
+ tableDataRecord:
412
+ 'Table records in the format {id: string, [fieldName]: string}[]',
413
+ });
414
+
415
+ // You can also describe the expected return format using a string:
416
+
417
+ // dataB will be an array of strings
418
+ const dataB = await agent.aiQuery('string[], list of task names');
419
+
420
+ // dataC will be an array of objects
421
+ const dataC = await agent.aiQuery(
422
+ '{name: string, age: string}[], table data records',
423
+ );
424
+
425
+ // Use domIncluded feature to extract invisible attributes
426
+ const dataD = await agent.aiQuery(
427
+ '{name: string, age: string, avatarUrl: string}[], table data records',
428
+ { domIncluded: true },
429
+ );
430
+ ```
431
+
432
+ ### `agent.aiBoolean()`
433
+
434
+ Extract a boolean value from the UI.
435
+
436
+ - Type
437
+
438
+ ```typescript
439
+ function aiBoolean(prompt: string | Object, options?: Object): Promise<boolean>;
440
+ ```
441
+
442
+ - Parameters:
443
+ - `prompt: string | Object` - A natural language description of the expected value, or [prompting with images](#prompting-with-images).
444
+ - `options?: Object` - Optional, a configuration object containing:
445
+ - `domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. False by default.
446
+ - `screenshotIncluded?: boolean` - Whether to send screenshot to the model. True by default.
447
+ - Return Value:
448
+
449
+ - Returns a `Promise<boolean>` when AI returns a boolean value.
450
+
451
+ - Examples:
452
+
453
+ ```typescript
454
+ const boolA = await agent.aiBoolean('Whether there is a login dialog');
455
+
456
+ // Use domIncluded feature to extract invisible attributes
457
+ const boolB = await agent.aiBoolean('Whether the login button has a link', {
458
+ domIncluded: true,
459
+ });
460
+ ```
461
+
462
+ ### `agent.aiNumber()`
463
+
464
+ Extract a number value from the UI.
465
+
466
+ - Type
467
+
468
+ ```typescript
469
+ function aiNumber(prompt: string | Object, options?: Object): Promise<number>;
470
+ ```
471
+
472
+ - Parameters:
473
+ - `prompt: string | Object` - A natural language description of the expected value, or [prompting with images](#prompting-with-images).
474
+ - `options?: Object` - Optional, a configuration object containing:
475
+ - `domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. False by default.
476
+ - `screenshotIncluded?: boolean` - Whether to send screenshot to the model. True by default.
477
+ - Return Value:
478
+
479
+ - Returns a `Promise<number>` when AI returns a number value.
480
+
481
+ - Examples:
482
+
483
+ ```typescript
484
+ const numberA = await agent.aiNumber('The remaining points of the account');
485
+
486
+ // Use domIncluded feature to extract invisible attributes
487
+ const numberB = await agent.aiNumber(
488
+ 'The value of the remaining points element',
489
+ { domIncluded: true },
490
+ );
491
+ ```
492
+
493
+ ### `agent.aiString()`
494
+
495
+ Extract a string value from the UI.
496
+
497
+ - Type
498
+
499
+ ```typescript
500
+ function aiString(prompt: string | Object, options?: Object): Promise<string>;
501
+ ```
502
+
503
+ - Parameters:
504
+ - `prompt: string | Object` - A natural language description of the expected value, or [prompting with images](#prompting-with-images).
505
+ - `options?: Object` - Optional, a configuration object containing:
506
+ - `domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. False by default.
507
+ - `screenshotIncluded?: boolean` - Whether to send screenshot to the model. True by default.
508
+ - Return Value:
509
+
510
+ - Returns a `Promise<string>` when AI returns a string value.
511
+
512
+ - Examples:
513
+
514
+ ```typescript
515
+ const stringA = await agent.aiString('The first item in the list');
516
+
517
+ // Use domIncluded feature to extract invisible attributes
518
+ const stringB = await agent.aiString('The link of the first item in the list', {
519
+ domIncluded: true,
520
+ });
521
+ ```
522
+
523
+ ## More APIs
524
+
525
+ ### `agent.aiAssert()`
526
+
527
+ Specify an assertion in natural language, and the AI determines whether the condition is true. If the assertion fails, the SDK throws an error that includes both the optional `errorMsg` and a detailed reason generated by the AI.
528
+
529
+ - Type
530
+
531
+ ```typescript
532
+ function aiAssert(
533
+ assertion: string | Object,
534
+ errorMsg?: string,
535
+ options?: Object
536
+ ): Promise<void>;
537
+ ```
538
+
539
+ - Parameters:
540
+
541
+ - `assertion: string | Object` - The assertion described in natural language, or [prompting with images](#prompting-with-images).
542
+ - `errorMsg?: string` - An optional error message to append if the assertion fails.
543
+ - `options?: Object` - Optional, a configuration object containing:
544
+ - `domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. False by default.
545
+ - `screenshotIncluded?: boolean` - Whether to send screenshot to the model. True by default.
546
+
547
+ - Return Value:
548
+
549
+ - Returns a Promise that resolves to void if the assertion passes; if it fails, an error is thrown with `errorMsg` and additional AI-provided information.
550
+
551
+ - Example:
552
+
553
+ ```typescript
554
+ await agent.aiAssert('The price of "Sauce Labs Onesie" is 7.99');
555
+ ```
556
+
557
+ :::tip
558
+ Assertions are critical in test scripts. To reduce the risk of errors due to AI hallucination (e.g., missing an error), you can also combine `.aiQuery` with standard JavaScript assertions instead of using `.aiAssert`.
559
+
560
+ For example, you might replace the above code with:
561
+
562
+ ```typescript
563
+ const items = await agent.aiQuery(
564
+ '"{name: string, price: number}[], return product names and prices',
565
+ );
566
+ const onesieItem = items.find((item) => item.name === 'Sauce Labs Onesie');
567
+ expect(onesieItem).toBeTruthy();
568
+ expect(onesieItem.price).toBe(7.99);
569
+ ```
570
+
571
+ :::
572
+
573
+ ### `agent.aiLocate()`
574
+
575
+ Locate an element using natural language.
576
+
577
+ - Type
578
+
579
+ ```typescript
580
+ function aiLocate(
581
+ locate: string | Object,
582
+ options?: Object,
583
+ ): Promise<{
584
+ rect: {
585
+ left: number;
586
+ top: number;
587
+ width: number;
588
+ height: number;
589
+ };
590
+ center: [number, number];
591
+ scale: number; // device pixel ratio
592
+ }>;
593
+ ```
594
+
595
+ - Parameters:
596
+
597
+ - `locate: string | Object` - A natural language description of the element to locate, or [prompting with images](#prompting-with-images).
598
+ - `options?: Object` - Optional, a configuration object containing:
599
+ - `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
600
+ - `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
601
+ - `cacheable?: boolean` - Whether cacheable when enabling [caching feature](./caching.mdx). True by default.
602
+
603
+ - Return Value:
604
+
605
+ - Returns a `Promise` when the element is located parsed as an locate info object.
606
+
607
+ - Examples:
608
+
609
+ ```typescript
610
+ const locateInfo = await agent.aiLocate(
611
+ 'The login button at the top of the page',
612
+ );
613
+ console.log(locateInfo);
614
+ ```
615
+
616
+ ### `agent.aiWaitFor()`
617
+
618
+ Wait until a specified condition, described in natural language, becomes true. Considering the cost of AI calls, the check interval will not exceed the specified `checkIntervalMs`.
619
+
620
+ - Type
621
+
622
+ ```typescript
623
+ function aiWaitFor(
624
+ assertion: string,
625
+ options?: {
626
+ timeoutMs?: number;
627
+ checkIntervalMs?: number;
628
+ },
629
+ ): Promise<void>;
630
+ ```
631
+
632
+ - Parameters:
633
+
634
+ - `assertion: string` - The condition described in natural language.
635
+ - `options?: object` - An optional configuration object containing:
636
+ - `timeoutMs?: number` - Timeout in milliseconds (default: 15000).
637
+ - `checkIntervalMs?: number` - Interval for checking in milliseconds (default: 3000).
638
+
639
+ - Return Value:
640
+
641
+ - Returns a Promise that resolves to void if the condition is met; if not, an error is thrown when the timeout is reached.
642
+
643
+ - Examples:
644
+
645
+ ```typescript
646
+ // Basic usage
647
+ await agent.aiWaitFor(
648
+ 'There is at least one headphone information displayed on the interface',
649
+ );
650
+
651
+ // Using custom options
652
+ await agent.aiWaitFor('The shopping cart icon shows a quantity of 2', {
653
+ timeoutMs: 30000, // Wait for 30 seconds
654
+ checkIntervalMs: 5000, // Check every 5 seconds
655
+ });
656
+ ```
657
+
658
+ :::tip
659
+ Given the time consumption of AI services, `.aiWaitFor` might not be the most efficient method. Sometimes, using a simple sleep function may be a better alternative.
660
+ :::
661
+
662
+ ### `agent.runYaml()`
663
+
664
+ Execute an automation script written in YAML. Only the `tasks` part of the script is executed, and it returns the results of all `.aiQuery` calls within the script.
665
+
666
+ - Type
667
+
668
+ ```typescript
669
+ function runYaml(yamlScriptContent: string): Promise<{ result: any }>;
670
+ ```
671
+
672
+ - Parameters:
673
+
674
+ - `yamlScriptContent: string` - The YAML-formatted script content.
675
+
676
+ - Return Value:
677
+
678
+ - Returns an object with a `result` property that includes the results of all `.aiQuery` calls.
679
+
680
+ - Example:
681
+
682
+ ```typescript
683
+ const { result } = await agent.runYaml(`
684
+ tasks:
685
+ - name: search weather
686
+ flow:
687
+ - ai: input 'weather today' in input box, click search button
688
+ - sleep: 3000
689
+
690
+ - name: query weather
691
+ flow:
692
+ - aiQuery: "the result shows the weather info, {description: string}"
693
+ `);
694
+ console.log(result);
695
+ ```
696
+
697
+ :::tip
698
+ For more information about YAML scripts, please refer to [Automate with Scripts in YAML](./automate-with-scripts-in-yaml).
699
+ :::
700
+
701
+ ### `agent.setAIActionContext()`
702
+
703
+ Set the background knowledge that should be sent to the AI model when calling `agent.aiAction()` or `agent.ai()`. This will override the previous setting.
704
+
705
+ For instant action type APIs, like `aiTap()`, this setting will not take effect.
706
+
707
+ - Type
708
+
709
+ ```typescript
710
+ function setAIActionContext(actionContext: string): void;
711
+ ```
712
+
713
+ - Parameters:
714
+
715
+ - `actionContext: string` - The background knowledge that should be sent to the AI model.
716
+
717
+ - Example:
718
+
719
+ ```typescript
720
+ await agent.setAIActionContext(
721
+ 'Close the cookie consent dialog first if it exists',
722
+ );
723
+ ```
724
+
725
+ ### `agent.evaluateJavaScript()`
726
+
727
+ > Only available in web pages, not available in Android.
728
+
729
+ Evaluate a JavaScript expression in the web page context.
730
+
731
+ - Type
732
+
733
+ ```typescript
734
+ function evaluateJavaScript(script: string): Promise<any>;
735
+ ```
736
+
737
+ - Parameters:
738
+
739
+ - `script: string` - The JavaScript expression to evaluate.
740
+
741
+ - Return Value:
742
+
743
+ - Returns the result of the JavaScript expression.
744
+
745
+ - Example:
746
+
747
+ ```typescript
748
+ const result = await agent.evaluateJavaScript('document.title');
749
+ console.log(result);
750
+ ```
751
+
752
+ ### `agent.logScreenshot()`
753
+
754
+ Log the current screenshot with a description in the report file.
755
+
756
+ - Type
757
+
758
+ ```typescript
759
+ function logScreenshot(title?: string, options?: Object): Promise<void>;
760
+ ```
761
+
762
+ - Parameters:
763
+
764
+ - `title?: string` - Optional, the title of the screenshot, if not provided, the title will be 'untitled'.
765
+ - `options?: Object` - Optional, a configuration object containing:
766
+ - `content?: string` - The description of the screenshot.
767
+
768
+ - Return Value:
769
+
770
+ - Returns a `Promise<void>`
771
+
772
+ - Examples:
773
+
774
+ ```typescript
775
+ await agent.logScreenshot('Login page', {
776
+ content: 'User A',
777
+ });
778
+ ```
779
+
780
+ ### `agent.freezePageContext()`
781
+
782
+ Freeze the current page context, allowing all subsequent operations to reuse the same page snapshot without retrieving the page state repeatedly. This significantly improves performance when executing a large number of concurrent operations.
783
+
784
+ Some notes:
785
+ * Usually, you do not need to use this method, unless you are certain that "context retrieval" is the bottleneck of your test script.
786
+ * You need to call `agent.unfreezePageContext()` in time to restore the real-time page state.
787
+ * Do not call this method in interaction operations, it will make the AI model unable to perceive the latest page state, causing confusing errors.
788
+
789
+ - Type
790
+
791
+ ```typescript
792
+ function freezePageContext(): Promise<void>;
793
+ ```
794
+
795
+ - Return Value:
796
+
797
+ - `Promise<void>`
798
+
799
+ - Examples:
800
+
801
+ ```typescript
802
+ // Freeze the page context
803
+ await agent.freezePageContext();
804
+
805
+ // Some queries...
806
+ const results = await Promise.all([
807
+ await agent.aiQuery('Username input box value'),
808
+ await agent.aiQuery('Password input box value'),
809
+ await agent.aiLocate('Login button'),
810
+ ]);
811
+ console.log(results);
812
+
813
+ // Unfreeze the page context, subsequent operations will use real-time page state
814
+ await agent.unfreezePageContext();
815
+ ```
816
+
817
+ :::tip
818
+ In the report, operations using frozen context will display a 🧊 icon in the Insight tab.
819
+ :::
820
+
821
+ ### `agent.unfreezePageContext()`
822
+
823
+ Unfreezes the page context, restoring the use of real-time page state.
824
+
825
+ - Type
826
+
827
+ ```typescript
828
+ function unfreezePageContext(): Promise<void>;
829
+ ```
830
+
831
+ - Return Value:
832
+
833
+ - `Promise<void>`
834
+
835
+ ## Properties
836
+
837
+ ### `.reportFile`
838
+
839
+ The path to the report file.
840
+
841
+ ## Additional configurations
842
+
843
+ ### Setting environment variables at runtime
844
+
845
+ You can override environment variables at runtime by calling the `overrideAIConfig` method.
846
+
847
+ ```typescript
848
+ import { overrideAIConfig } from '@sqai/web/puppeteer'; // or another Agent
849
+
850
+ overrideAIConfig({
851
+ OPENAI_BASE_URL: '...',
852
+ OPENAI_API_KEY: '...',
853
+ MIDSCENE_MODEL_NAME: '...',
854
+ });
855
+ ```
856
+
857
+ ### Print usage information for each AI call
858
+
859
+ Set the `DEBUG=midscene:ai:profile:stats` to view the execution time and usage for each AI call.
860
+
861
+ ```bash
862
+ export DEBUG=midscene:ai:profile:stats
863
+ ```
864
+
865
+ ### Customize the run artifact directory
866
+
867
+ Set the `MIDSCENE_RUN_DIR` variable to customize the run artifact directory.
868
+
869
+ ```bash
870
+ export MIDSCENE_RUN_DIR=midscene_run # The default value is the midscene_run in the current working directory, you can set it to an absolute path or a relative path
871
+ ```
872
+
873
+ ### Customize the replanning cycle limit
874
+
875
+ Set the `MIDSCENE_REPLANNING_CYCLE_LIMIT` variable to customize the maximum number of replanning cycles allowed during action execution (`aiAction`).
876
+
877
+ ```bash
878
+ export MIDSCENE_REPLANNING_CYCLE_LIMIT=10 # The default value is 10. When the AI needs to replan more than this limit, an error will be thrown suggesting to split the task into multiple steps
879
+ ```
880
+
881
+ ### Using LangSmith
882
+
883
+ LangSmith is a platform for debugging large language models. To integrate LangSmith, follow these steps:
884
+
885
+ ```bash
886
+ # Set environment variables
887
+
888
+ # Enable debug mode
889
+ export MIDSCENE_LANGSMITH_DEBUG=1
890
+
891
+ # LangSmith configuration
892
+ export LANGSMITH_TRACING_V2=true
893
+ export LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
894
+ export LANGSMITH_API_KEY="your_key_here"
895
+ export LANGSMITH_PROJECT="your_project_name_here"
896
+ ```
897
+
898
+ After starting Midscene, you should see logs similar to:
899
+
900
+ ```log
901
+ DEBUGGING MODE: langsmith wrapper enabled
902
+ ```
903
+
904
+ ## Advanced features
905
+
906
+ ### Prompting with images
907
+
908
+ You can use images as supplements in the prompt to describe things that cannot be expressed in natural language.
909
+
910
+ When prompting with images, the format of the prompt parameters is as follows:
911
+
912
+ ```javascript
913
+ {
914
+ // Prompt text, in which images can be referred
915
+ prompt: string,
916
+ // The images referred in the prompt text
917
+ images?: {
918
+ // Image name, corresponding to the names referred in the prompt text
919
+ name: string,
920
+ // Image url, can be a local image path, Base64 string, or http link
921
+ url: string
922
+ }[]
923
+ // When convertHttpImage2Base64 is true,the image links in the http format will be converted into Base64 encoding and sent to the LLM.
924
+ // Which is applicable when the image links are not publicly accessible.
925
+ convertHttpImage2Base64?: boolean
926
+ }
927
+ ```
928
+
929
+ - Example 1: use images to inspect the tap position.
930
+
931
+ ```javascript
932
+ await agent.aiTap({
933
+ prompt: 'The specific logo',
934
+ images: [
935
+ {
936
+ name: 'The specific logo',
937
+ url: 'https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png',
938
+ },
939
+ ],
940
+ });
941
+ ```
942
+
943
+ - Example 2: use images to assert the page content.
944
+
945
+ ```javascript
946
+ await agent.aiAssert({
947
+ prompt: 'Whether there is a specific logo on the page.',
948
+ images: [
949
+ {
950
+ name: 'The specific logo',
951
+ url: 'https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png',
952
+ },
953
+ ],
954
+ });
955
+ ```
956
+
957
+ **Notes on Image Size**
958
+
959
+ When prompting with images, it is necessary to pay attention to the requirements of the AI model provider regarding the size and dimensions of the images. Images that are too large (such as exceeding 10M) or too small (such as being less than 10 pixels) may cause errors when the model is invoked. The specific restrictions should be based on the documentation provided by the AI model provider you are using.
960
+
961
+ ## Automation Report Merging
962
+
963
+ When running multiple automation workflows, each agent generates its own report file. The `ReportMergingTool` provides the ability to merge multiple automation reports into a single report for unified viewing and management of automation results.
964
+
965
+ ### `new ReportMergingTool()`
966
+
967
+ Create a report merging tool instance.
968
+
969
+ ```typescript
970
+ import { ReportMergingTool } from '@sqai/core/report';
971
+
972
+ const reportMergingTool = new ReportMergingTool();
973
+ ```
974
+
975
+ ### `.append()`
976
+
977
+ Add an automation report to the list of reports to be merged.
978
+
979
+ - Type
980
+
981
+ ```typescript
982
+ function append(reportInfo: ReportFileWithAttributes): void;
983
+ ```
984
+
985
+ - Parameters:
986
+
987
+ - `reportInfo: ReportFileWithAttributes` - Report information object containing:
988
+ - `reportFilePath: string` - Path to the report file
989
+ - `reportAttributes: object` - Report attributes
990
+ - `testId: string` - Unique identifier for the automation workflow
991
+ - `testTitle: string` - Automation workflow title
992
+ - `testDescription: string` - Automation workflow description
993
+ - `testDuration: number` - Automation execution duration (in milliseconds)
994
+ - `testStatus: TestStatus` - Automation status, options: `'passed' | 'failed' | 'skipped' | 'timedOut'`
995
+
996
+ - Examples:
997
+
998
+ ```typescript
999
+ reportMergingTool.append({
1000
+ reportFilePath: agent.reportFile as string,
1001
+ reportAttributes: {
1002
+ testId: 'automation-001',
1003
+ testTitle: 'Login Automation',
1004
+ testDescription: 'Automated user login workflow',
1005
+ testDuration: 5000,
1006
+ testStatus: 'passed',
1007
+ },
1008
+ });
1009
+ ```
1010
+
1011
+ ### `.mergeReports()`
1012
+
1013
+ Merge all added reports into a single report file.
1014
+
1015
+ - Type
1016
+
1017
+ ```typescript
1018
+ function mergeReports(
1019
+ reportFileName?: 'AUTO' | string,
1020
+ opts?: {
1021
+ rmOriginalReports?: boolean;
1022
+ overwrite?: boolean;
1023
+ },
1024
+ ): string | null;
1025
+ ```
1026
+
1027
+ - Parameters:
1028
+
1029
+ - `reportFileName?: 'AUTO' | string` - Report filename (optional)
1030
+ - If `'AUTO'` or not specified, a filename will be automatically generated
1031
+ - If a custom name is provided, it will be used as the report filename
1032
+ - `opts?: object` - Optional configuration object containing:
1033
+ - `rmOriginalReports?: boolean` - Whether to delete the original report files after merging (default: false)
1034
+ - `overwrite?: boolean` - Whether to overwrite if the report file already exists (default: false)
1035
+
1036
+ - Return Value:
1037
+
1038
+ - Returns the path to the merged report file on success, or `null` if there are not enough reports to merge (less than 2 reports)
1039
+
1040
+ - Examples:
1041
+
1042
+ ```typescript
1043
+ // Automatically generate report filename
1044
+ const reportPath = reportMergingTool.mergeReports();
1045
+
1046
+ // Custom report filename
1047
+ const reportPath = reportMergingTool.mergeReports('my-test-report');
1048
+
1049
+ // Merge and delete original reports
1050
+ const reportPath = reportMergingTool.mergeReports('my-test-report', {
1051
+ rmOriginalReports: true,
1052
+ });
1053
+ ```
1054
+
1055
+ ### `.clear()`
1056
+
1057
+ Clear all added report information.
1058
+
1059
+ - Type
1060
+
1061
+ ```typescript
1062
+ function clear(): void;
1063
+ ```
1064
+
1065
+ - Examples:
1066
+
1067
+ ```typescript
1068
+ reportMergingTool.clear();
1069
+ ```
1070
+
1071
+ ### Complete Example
1072
+
1073
+ Here's a complete example using Vitest + AndroidAgent:
1074
+
1075
+ ```typescript
1076
+ import {
1077
+ AndroidAgent,
1078
+ AndroidDevice,
1079
+ getConnectedDevices,
1080
+ } from '@sqai/android';
1081
+ import type { TestStatus } from '@sqai/core';
1082
+ import { ReportMergingTool } from '@sqai/core/report';
1083
+ import { sleep } from '@sqai/core/utils';
1084
+ import type ADB from 'appium-adb';
1085
+ import {
1086
+ afterAll,
1087
+ afterEach,
1088
+ beforeAll,
1089
+ beforeEach,
1090
+ describe,
1091
+ it,
1092
+ } from 'vitest';
1093
+
1094
+ describe('Android Settings Test', () => {
1095
+ let page: AndroidDevice;
1096
+ let adb: ADB;
1097
+ let agent: AndroidAgent;
1098
+ let startTime: number;
1099
+ let itTestStatus: TestStatus = 'passed';
1100
+ const reportMergingTool = new ReportMergingTool();
1101
+
1102
+ beforeAll(async () => {
1103
+ const devices = await getConnectedDevices();
1104
+ page = new AndroidDevice(devices[0].udid);
1105
+ adb = await page.getAdb();
1106
+ });
1107
+
1108
+ beforeEach((ctx) => {
1109
+ startTime = performance.now();
1110
+ agent = new AndroidAgent(page, {
1111
+ groupName: ctx.task.name,
1112
+ });
1113
+ });
1114
+
1115
+ afterEach((ctx) => {
1116
+ if (ctx.task.result?.state === 'pass') {
1117
+ itTestStatus = 'passed';
1118
+ } else if (ctx.task.result?.state === 'skip') {
1119
+ itTestStatus = 'skipped';
1120
+ } else if (ctx.task.result?.errors?.[0].message.includes('timed out')) {
1121
+ itTestStatus = 'timedOut';
1122
+ } else {
1123
+ itTestStatus = 'failed';
1124
+ }
1125
+ reportMergingTool.append({
1126
+ reportFilePath: agent.reportFile as string,
1127
+ reportAttributes: {
1128
+ testId: `${ctx.task.name}`,
1129
+ testTitle: `${ctx.task.name}`,
1130
+ testDescription: 'description',
1131
+ testDuration: (Date.now() - ctx.task.result?.startTime!) | 0,
1132
+ testStatus: itTestStatus,
1133
+ },
1134
+ });
1135
+ });
1136
+
1137
+ afterAll(() => {
1138
+ reportMergingTool.mergeReports('my-android-setting-test-report');
1139
+ });
1140
+
1141
+ it('toggle wlan', async () => {
1142
+ await adb.shell('input keyevent KEYCODE_HOME');
1143
+ await sleep(1000);
1144
+ await adb.shell('am start -n com.android.settings/.Settings');
1145
+ await sleep(1000);
1146
+ await agent.aiAction('find and enter WLAN setting');
1147
+ await agent.aiAction(
1148
+ 'toggle WLAN status *once*, if WLAN is off pls turn it on, otherwise turn it off.',
1149
+ );
1150
+ });
1151
+
1152
+ it('toggle bluetooth', async () => {
1153
+ await adb.shell('input keyevent KEYCODE_HOME');
1154
+ await sleep(1000);
1155
+ await adb.shell('am start -n com.android.settings/.Settings');
1156
+ await sleep(1000);
1157
+ await agent.aiAction('find and enter bluetooth setting');
1158
+ await agent.aiAction(
1159
+ 'toggle bluetooth status *once*, if bluetooth is off pls turn it on, otherwise turn it off.',
1160
+ );
1161
+ });
1162
+ });
1163
+ ```
1164
+
1165
+ :::tip
1166
+ The merged report file will be saved in the `midscene_run/report` directory by default. You can customize the report directory location by setting the `MIDSCENE_RUN_DIR` environment variable.
1167
+ :::