npm - @midscene/core - Versions diffs - 0.5.2-beta-20241010035503.0 → 0.5.2 - Mend

@midscene/core 0.5.2-beta-20241010035503.0 → 0.5.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/dist/es/ai-model.js CHANGED Viewed

@@ -4222,155 +4222,6 @@ var wrapOpenAI = (openai, options) => {
 // src/ai-model/openai/index.ts
 import OpenAI, { AzureOpenAI } from "openai";
-// src/ai-model/automation/planning.ts
-function systemPromptToTaskPlanning() {
-  return `
-## Role:
-You are a versatile professional in software UI design and testing. Your outstanding contributions will impact the user experience of billions of users.
-## Objective 1 (main objective): Decompose the task user asked into a series of actions:
-- Based on the page context information (screenshot and description) you get, decompose the task user asked into a series of actions.
-- Actions are executed in the order listed in the list. After executing the actions, the task should be completed.
-Each action has a type and corresponding param. To be detailed:
-* type: 'Locate', it means to locate one element
-  * param: { prompt: string }, the prompt describes 'which element to focus on page'. Our AI engine will use this prompt to locate the element, so it should clearly describe the obvious features of the element, such as its content, color, size, shape, and position. For example, 'The biggest Download Button on the left side of the page.'
-* type: 'Tap', tap the previous element found
-  * param: null
-* type: 'Hover', hover the previous element found
-  * param: null
-* type: 'Input', replace the value in the input field
-  * param: { value: string }, The input value must not be an empty string. Provide a meaningful final required input value based on the existing input. No matter what modifications are required, just provide the final value to replace the existing input value. After locating the input field, do not use 'Tap' action, proceed directly to 'Input' action.
-* type: 'KeyboardPress',  press a key
-  * param: { value: string },  the value to input or the key to press. Use （Enter, Shift, Control, Alt, Meta, ShiftLeft, ControlOrMeta, ControlOrMeta） to represent the key.
-* type: 'Scroll'
-  * param: { scrollType: 'scrollDownOneScreen', 'scrollUpOneScreen', 'scrollUntilBottom', 'scrollUntilTop' }
-* type: 'Error'
-  * param: { message: string }, the error message
-* type: 'Sleep'
-  * param: { timeMs: number }, wait for timeMs milliseconds
-Here is an example of how to decompose a task.
-When a user says 'Input "Weather in Shanghai" into the search bar, wait 1 second, hit enter', by viewing the page screenshot and description, you may decompose this task into something like this:
-* Locate: 'The search bar'
-* Input: 'Weather in Shanghai'
-* Sleep: 1000
-* KeyboardPress: 'Enter'
-Remember:
-1. The actions you composed MUST be based on the page context information you get. Instead of making up actions that are not related to the page context.
-2. In most cases, you should Locate one element first, then do other actions on it. For example, alway Find one element, then hover on it. But if you think it's necessary to do other actions first (like global scroll, global key press), you can do that.
-If the planned tasks are sequential and tasks may appear only after the execution of previous tasks, this is considered normal. Thoughts, prompts, and error messages should all be in the same language as the user query.
-## Objective 2 (sub objective): Give a quick answer to the action with type "Locate" you just planned
-Review the action you just planned. If the action type is 'Locate', provide a quick answer: Does any element meet the description in the prompt? If so, answer with the following format, as the \`quickAnswer\` field in the output JSON:
-{
-  "reason": "Reason for finding element 4: It is located in the upper right corner, is an image type, and according to the screenshot, it is a shopping cart icon button",
-  "text": "PLACEHOLDER", // Replace PLACEHOLDER with the text of elementInfo, if none, leave empty
-  "id": "4" // ID of this element, replace with actual value in practice
-}
-If the action type is not 'Locate', or there is no element meets the description in the prompt (usually because it will show up after some interaction), the answer should be null.
-## Output JSON Format:
-Please return the result in JSON format as follows:
-{
-  queryLanguage: '', // language of the description of the task
-  actions: [ // always return in Array
-    {
-      "thought": "find out the search bar",
-      "type": "Locate", // Type of action, like 'Tap' 'Hover' ...
-      "param": {
-        "prompt": "The search bar"
-      },
-      "quickAnswer": { // since the first action is Locate, so we need to give a quick answer
-        "reason": "Reason for finding element 4: It is located in the upper right corner, is an input type, and according to the screenshot, it is a search bar",
-        "text": "PLACEHOLDER", // Replace PLACEHOLDER with the text of elementInfo, if none, leave empty
-        "id": "4" // ID of this element, replace with actual value in practice
-      } | null,
-    },
-    {
-      "thought": "Reasons for generating this task, and why this task is feasible on this page",
-      "type": "Tap", // Type of action, like 'Tap' 'Hover' ...
-      "param": any, // Parameter towards the task type
-    },
-    // ... more actions
-  ],
-  error?: string, // Overall error messages. If there is any error occurs during the task planning (i.e. error in previous 'actions' array), conclude the errors again, put error messages here,
-}
-`;
-}
-var planSchema = {
-  type: "json_schema",
-  json_schema: {
-    name: "action_items",
-    strict: true,
-    schema: {
-      type: "object",
-      properties: {
-        queryLanguage: {
-          type: "string",
-          description: "Language of the description of the task"
-        },
-        actions: {
-          type: "array",
-          items: {
-            type: "object",
-            properties: {
-              thought: {
-                type: "string",
-                description: "Reasons for generating this task, and why this task is feasible on this page"
-              },
-              type: {
-                type: "string",
-                description: 'Type of action, like "Tap", "Hover", etc.'
-              },
-              param: {
-                type: ["object", "null"],
-                description: "Parameter towards the task type, can be null"
-              },
-              quickAnswer: {
-                type: ["object", "null"],
-                nullable: true,
-                properties: {
-                  reason: {
-                    type: "string",
-                    description: "Reason for finding element 4"
-                  },
-                  text: {
-                    type: "string",
-                    description: "Text of elementInfo, if none, leave empty"
-                  },
-                  id: {
-                    type: "string",
-                    description: "ID of this element"
-                  }
-                },
-                required: ["reason", "text", "id"],
-                additionalProperties: false
-              }
-            },
-            required: ["thought", "type", "param", "quickAnswer"],
-            additionalProperties: false
-          },
-          description: "List of actions to be performed"
-        },
-        error: {
-          type: ["string", "null"],
-          description: "Overall error messages. If there is any error occurs during the task planning, conclude the errors again and put error messages here"
-        }
-      },
-      required: ["queryLanguage", "actions", "error"],
-      additionalProperties: false
-    }
-  }
-};
 // src/ai-model/coze/index.ts
 import assert from "assert";
 import fetch2 from "node-fetch";
@@ -4569,7 +4420,8 @@ Input Example:
       },
       "elementInfos": [
         {
-          "id": "3", // ID of the element
+          "id": "we23xsfwe", // ID of the element
+          "indexId": "0", // Index of the element，The image is labeled to the left of the element
           "attributes": { // Attributes of the element
             "nodeType": "IMG Node", // Type of element, types include: TEXT Node, IMG Node, BUTTON Node, INPUT Node
             "src": "https://ap-southeast-3.m",
@@ -4584,7 +4436,8 @@ Input Example:
           }
         },
         {
-          "id": "4", // ID of the element
+          "id": "wefew2222few2", // ID of the element
+          "indexId": "1", // Index of the element，The image is labeled to the left of the element
           "attributes": { // Attributes of the element
             "nodeType": "IMG Node", // Type of element, types include: TEXT Node, IMG Node, BUTTON Node, INPUT Node
             "src": "data:image/png;base64,iVBORw0KGgoAAAANSU...",
@@ -4600,7 +4453,8 @@ Input Example:
         },
         ...
         {
-          "id": "27",
+          "id": "kwekfj2323",
+          "indexId": "2", // Index of the element，The image is labeled to the left of the element
           "attributes": {
             "nodeType": "TEXT Node",
             "class": ".product-name"
@@ -4632,7 +4486,7 @@ Output Example:
       "reason": "Reason for finding element 4: It is located in the upper right corner, is an image type, and according to the screenshot, it is a shopping cart icon button",
       "text": "",
       // ID of this element, replace with actual value in practice
-      "id": "4"
+      "id": "wefew2222few2"
     }
   ],
   "errors": []
@@ -4689,6 +4543,155 @@ var findElementSchema = {
   }
 };
+// src/ai-model/prompt/planning.ts
+function systemPromptToTaskPlanning() {
+  return `
+## Role:
+You are a versatile professional in software UI design and testing. Your outstanding contributions will impact the user experience of billions of users.
+## Objective 1 (main objective): Decompose the task user asked into a series of actions:
+- Based on the page context information (screenshot and description) you get, decompose the task user asked into a series of actions.
+- Actions are executed in the order listed in the list. After executing the actions, the task should be completed.
+Each action has a type and corresponding param. To be detailed:
+* type: 'Locate', it means to locate one element
+  * param: { prompt: string }, the prompt describes 'which element to focus on page'. Our AI engine will use this prompt to locate the element, so it should clearly describe the obvious features of the element, such as its content, color, size, shape, and position. For example, 'The biggest Download Button on the left side of the page.'
+* type: 'Tap', tap the previous element found
+  * param: null
+* type: 'Hover', hover the previous element found
+  * param: null
+* type: 'Input', replace the value in the input field
+  * param: { value: string }, The input value must not be an empty string. Provide a meaningful final required input value based on the existing input. No matter what modifications are required, just provide the final value to replace the existing input value. After locating the input field, do not use 'Tap' action, proceed directly to 'Input' action.
+* type: 'KeyboardPress',  press a key
+  * param: { value: string },  the value to input or the key to press. Use （Enter, Shift, Control, Alt, Meta, ShiftLeft, ControlOrMeta, ControlOrMeta） to represent the key.
+* type: 'Scroll'
+  * param: { scrollType: 'scrollDownOneScreen', 'scrollUpOneScreen', 'scrollUntilBottom', 'scrollUntilTop' }
+* type: 'Error'
+  * param: { message: string }, the error message
+* type: 'Sleep'
+  * param: { timeMs: number }, wait for timeMs milliseconds
+Here is an example of how to decompose a task.
+When a user says 'Input "Weather in Shanghai" into the search bar, wait 1 second, hit enter', by viewing the page screenshot and description, you may decompose this task into something like this:
+* Locate: 'The search bar'
+* Input: 'Weather in Shanghai'
+* Sleep: 1000
+* KeyboardPress: 'Enter'
+Remember:
+1. The actions you composed MUST be based on the page context information you get. Instead of making up actions that are not related to the page context.
+2. In most cases, you should Locate one element first, then do other actions on it. For example, alway Find one element, then hover on it. But if you think it's necessary to do other actions first (like global scroll, global key press), you can do that.
+If the planned tasks are sequential and tasks may appear only after the execution of previous tasks, this is considered normal. Thoughts, prompts, and error messages should all be in the same language as the user query.
+## Objective 2 (sub objective): Give a quick answer to the action with type "Locate" you just planned
+Review the action you just planned. If the action type is 'Locate', provide a quick answer: Does any element meet the description in the prompt? If so, answer with the following format, as the \`quickAnswer\` field in the output JSON:
+{
+  "reason": "Reason for finding element 4: It is located in the upper right corner, is an image type, and according to the screenshot, it is a shopping cart icon button",
+  "text": "PLACEHOLDER", // Replace PLACEHOLDER with the text of elementInfo, if none, leave empty
+  "id": "wefew2222few2" // id of this element, replace with actual value in practice
+}
+If the action type is not 'Locate', or there is no element meets the description in the prompt (usually because it will show up after some interaction), the answer should be null.
+## Output JSON Format:
+Please return the result in JSON format as follows:
+{
+  queryLanguage: '', // language of the description of the task
+  actions: [ // always return in Array
+    {
+      "thought": "find out the search bar",
+      "type": "Locate", // Type of action, like 'Tap' 'Hover' ...
+      "param": {
+        "prompt": "The search bar"
+      },
+      "quickAnswer": { // since the first action is Locate, so we need to give a quick answer
+        "reason": "Reason for finding element 4: It is located in the upper right corner, is an input type, and according to the screenshot, it is a search bar",
+        "text": "PLACEHOLDER", // Replace PLACEHOLDER with the text of elementInfo, if none, leave empty
+        "id": "wefew2222few2" // ID of this element, replace with actual value in practice
+      } | null,
+    },
+    {
+      "thought": "Reasons for generating this task, and why this task is feasible on this page",
+      "type": "Tap", // Type of action, like 'Tap' 'Hover' ...
+      "param": any, // Parameter towards the task type
+    },
+    // ... more actions
+  ],
+  error?: string, // Overall error messages. If there is any error occurs during the task planning (i.e. error in previous 'actions' array), conclude the errors again, put error messages here,
+}
+`;
+}
+var planSchema = {
+  type: "json_schema",
+  json_schema: {
+    name: "action_items",
+    strict: true,
+    schema: {
+      type: "object",
+      properties: {
+        queryLanguage: {
+          type: "string",
+          description: "Language of the description of the task"
+        },
+        actions: {
+          type: "array",
+          items: {
+            type: "object",
+            properties: {
+              thought: {
+                type: "string",
+                description: "Reasons for generating this task, and why this task is feasible on this page"
+              },
+              type: {
+                type: "string",
+                description: 'Type of action, like "Tap", "Hover", etc.'
+              },
+              param: {
+                type: ["object", "null"],
+                description: "Parameter towards the task type, can be null"
+              },
+              quickAnswer: {
+                type: ["object", "null"],
+                nullable: true,
+                properties: {
+                  reason: {
+                    type: "string",
+                    description: "Reason for finding element 4"
+                  },
+                  text: {
+                    type: "string",
+                    description: "Text of elementInfo, if none, leave empty"
+                  },
+                  id: {
+                    type: "string",
+                    description: "ID of this element"
+                  }
+                },
+                required: ["reason", "text", "id"],
+                additionalProperties: false
+              }
+            },
+            required: ["thought", "type", "param", "quickAnswer"],
+            additionalProperties: false
+          },
+          description: "List of actions to be performed"
+        },
+        error: {
+          type: ["string", "null"],
+          description: "Overall error messages. If there is any error occurs during the task planning, conclude the errors again and put error messages here"
+        }
+      },
+      required: ["queryLanguage", "actions", "error"],
+      additionalProperties: false
+    }
+  }
+};
 // src/ai-model/prompt/util.ts
 import assert2 from "assert";
@@ -4707,18 +4710,10 @@ import {
 var characteristic = "You are a versatile professional in software UI design and testing. Your outstanding contributions will impact the user experience of billions of users.";
 var contextFormatIntro = `
 The user will give you a screenshot and the texts on it. There may be some none-English characters (like Chinese) on it, indicating it's an non-English app.`;
-var ONE_ELEMENT_LOCATOR_PREFIX = "LOCATE_ONE_ELEMENT";
-var ELEMENTS_LOCATOR_PREFIX = "LOCATE_ONE_OR_MORE_ELEMENTS";
-var skillExtractData = `skill name: extract_data_from_UI
-related input: DATA_DEMAND
-skill content:
-* User will give you some data requirements in DATA_DEMAND. Consider the UI context, follow the user's instructions, and provide comprehensive data accordingly.
-* There may be some special commands in DATA_DEMAND, please pay extra attention
-  - ${ONE_ELEMENT_LOCATOR_PREFIX} and ${ELEMENTS_LOCATOR_PREFIX}: if you see a description that mentions the keyword ${ONE_ELEMENT_LOCATOR_PREFIX} or ${ELEMENTS_LOCATOR_PREFIX}(e.g. follow ${ONE_ELEMENT_LOCATOR_PREFIX} : i want to find ...), it means user wants to locate a specific element meets the description. Return in this way: prefix + the id / comma-separated ids, for example: ${ONE_ELEMENT_LOCATOR_PREFIX}/1 , ${ELEMENTS_LOCATOR_PREFIX}/1,2,3 . If not found, keep the prefix and leave the suffix empty, like ${ONE_ELEMENT_LOCATOR_PREFIX}/ .`;
 function systemPromptToExtract() {
   return `
 You are a versatile professional in software UI design and testing. Your outstanding contributions will impact the user experience of billions of users.
-The user will give you a screenshot and the texts on it. There may be some none-English characters (like Chinese) on it, indicating it's an non-English app.
+The user will give you a screenshot and the contents of it. There may be some none-English characters (like Chinese) on it, indicating it's an non-English app.
 You have the following skills:
@@ -4807,14 +4802,15 @@ async function describeUserPage(context) {
   const elementInfosDescription = cropFieldInformation(elementsInfo);
   return {
     description: `
-    {
-      // The size of the page
-      "pageSize": ${describeSize({ width, height })},
+{
+  // The size of the page
+  "pageSize": ${describeSize({ width, height })},
+  // json description of the element
+  "content": ${JSON.stringify(elementInfosDescription)}
-      // json description of the element
-      "elementInfos": ${JSON.stringify(elementInfosDescription)}
-    }`,
+}`,
+    // // json description of the element
     elementById(id) {
       assert2(typeof id !== "undefined", "id is required for query");
       const item = idElementMap[`${id}`];
@@ -4837,6 +4833,7 @@ function cropFieldInformation(elementsInfo) {
       );
       return {
         id,
+        markerId: item.indexId,
         attributes: tailorAttributes,
         rect,
         content: tailorContent
@@ -4880,7 +4877,7 @@ async function createOpenAI() {
   }
   if (process.env[MIDSCENE_LANGSMITH_DEBUG]) {
     console.log("DEBUGGING MODE: langsmith wrapper enabled");
-    const openai2 = wrapOpenAI(new OpenAI());
+    const openai2 = wrapOpenAI(new OpenAI(extraConfig));
     return openai2;
   }
   return openai;
@@ -4893,7 +4890,7 @@ async function call(messages, responseFormat) {
     model,
     messages,
     response_format: responseFormat,
-    temperature: 0.2,
+    temperature: 0.1,
     stream: false
   });
   shouldPrintTiming && console.timeEnd("Midscene - AI call");
@@ -4949,7 +4946,7 @@ function extractJSONFromCodeBlock(response) {
 import assert4 from "assert";
 async function AiInspectElement(options) {
   var _a;
-  const { context, multi, findElementDescription, callAI, useModel } = options;
+  const { context, multi, targetElementDescription, callAI, useModel } = options;
   const { screenshotBase64 } = context;
   const { description, elementById } = await describeUserPage(context);
   if (((_a = options.quickAnswer) == null ? void 0 : _a.id) && elementById(options.quickAnswer.id)) {
@@ -4979,10 +4976,10 @@ async function AiInspectElement(options) {
     ${description}
-    Here is the description of the findElement. Just go ahead:
+    Here is the item user want to find. Just go ahead:
     =====================================
     ${JSON.stringify({
-            description: findElementDescription,
+            description: targetElementDescription,
             multi: multiDescription(multi)
           })}
     =====================================