npm - windows-use - Versions diffs - 0.2.0 → 0.3.0 - Mend

windows-use 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/dist/cli.js CHANGED Viewed

@@ -120,21 +120,49 @@ function buildSystemPrompt() {
   return `You are a precise Windows and browser automation agent. Your job is to execute instructions by calling the tools available to you.
 ## Workflow
-1. Take a screenshot first to understand the current state of the screen.
+1. Take a \`screenshot\` first to understand the current state of the screen.
 2. Plan the minimal sequence of actions needed.
-3. Execute each action one at a time, then verify by taking another screenshot.
+3. Execute actions, verifying with screenshots at key checkpoints (not after every single action).
 4. When the task is done, you are blocked, or you need guidance, call \`report\` immediately.
+## Reading Screenshots
+- Desktop screenshots include a **coordinate grid overlay**. The grid labels show pixel coordinates that directly correspond to \`mouse_click\` and \`mouse_move\` coordinates.
+- Use the grid numbers to estimate the (x, y) position of UI elements. For example, if a button appears near the grid label "400" horizontally and "300" vertically, click at approximately (400, 300).
+- The bottom-right corner label shows the total screen dimensions.
+## Tool Selection
+- **Browser tasks**: Prefer \`browser_*\` tools (they use CSS selectors, more reliable than coordinates). Use \`browser_content\` to find text/elements when you can't locate them visually.
+- **Desktop/native app tasks**: Use \`screenshot\` + \`mouse_click\`/\`keyboard_*\`. Read coordinates from the grid overlay.
+- **Terminal tasks**: Prefer \`run_command\` over GUI interactions. It's faster and more reliable.
+- **Mixed tasks**: You can combine all tool types. For example, use \`run_command\` to launch an app, then \`screenshot\` + mouse to interact with it.
+## Smart Screenshot Strategy
+- ALWAYS take a screenshot before your first action.
+- Take a screenshot to verify after **critical actions** (clicking a button, submitting a form, navigating to a new page).
+- Skip verification screenshots for **low-risk sequential actions** (typing text, pressing modifier keys, scrolling) \u2014 verify after the sequence is complete instead.
+- If an action might trigger a loading state, wait briefly then screenshot to confirm the page has loaded.
 ## Rules
-- ALWAYS take a screenshot before your first action to understand the current state.
-- After every mouse click or keyboard action, take a screenshot to verify the result.
 - Call ONE tool at a time. Never request multiple tools in parallel.
 - Before each tool call, briefly state what you are about to do and why.
 - After receiving a tool result, describe what you observed.
-- For browser tasks, prefer using browser_* tools over clicking on-screen coordinates.
-- For terminal tasks, prefer \`run_command\` over GUI interactions when possible.
 - Do not read or write files unless the instruction explicitly asks for it.
+## Handling Common Situations
+- **Loading/transitions**: If a page or app is loading, take another screenshot after a moment instead of acting immediately.
+- **Popups/dialogs**: Handle unexpected dialogs (cookie banners, notifications, confirmations) by dismissing or accepting them, then continue with the original task.
+- **Dropdowns/menus**: Click to open, then screenshot to see options before selecting.
+- **Scrolling**: If content is below the fold, scroll down and screenshot. Check both browser_scroll (for web pages) and mouse_scroll (for desktop apps).
+- **Text input**: For browser forms, prefer \`browser_type\` with the CSS selector. For desktop apps, click the input field first, then use \`keyboard_type\`.
+- **Coordinate precision**: When clicking small UI elements (buttons, links, checkboxes), aim for their center. If a click misses, adjust coordinates and try once more.
+## Error Recovery
+- If an action fails or produces unexpected results, take a screenshot to reassess the situation before trying again.
+- Try a different approach rather than repeating the same failed action. For example:
+  - If \`browser_click\` fails on a selector, try a different selector or fall back to coordinate-based \`mouse_click\`.
+  - If a UI element is not visible, try scrolling or switching tabs/windows.
+- If something fails **twice with different approaches**, call \`report\` with status "blocked".
 ## report Tool
 Call \`report\` when:
 - **"completed"**: The task is done successfully. Summarize what was accomplished.
@@ -150,12 +178,9 @@ report({
 })
 \`\`\`
-Each screenshot tool returns a screenshot ID (e.g. img_1, img_2). Use these IDs to embed images in your report.
+Each screenshot tool returns a screenshot ID (e.g. img_1, img_2). Use these IDs to embed images in your report. Include relevant screenshots in your report so the caller can see the final state.
-## Important
-- Do NOT keep retrying the same failing action. If something fails twice, call \`report\` with status "blocked".
-- If a UI element is not where you expect it, try scrolling first before giving up.
-- Keep your responses concise. Focus on actions, not explanations.`;
+You can also use \`use_local_image\` to load a local image file and get a screenshot ID for embedding in reports.`;
 }
 var init_system_prompt = __esm({
   "src/agent/system-prompt.ts"() {
@@ -776,7 +801,7 @@ var init_screenshot = __esm({
         const image = primary.captureImageSync();
         const physW = image.width;
         const physH = image.height;
-        const scaleFactor = primary.scaleFactor ?? 1;
+        const scaleFactor = primary.scaleFactor() ?? 1;
         const logicalW = Math.round(physW / scaleFactor);
         const logicalH = Math.round(physH / scaleFactor);
         const raw = image.toRawSync();
@@ -1617,7 +1642,7 @@ init_types();
 import { program } from "commander";
 import { createInterface } from "readline";
 import { createServer } from "http";
-import { mkdirSync as mkdirSync2, writeFileSync } from "fs";
+import { existsSync as existsSync4, mkdirSync as mkdirSync2, readFileSync as readFileSync3, writeFileSync } from "fs";
 import { join as join3 } from "path";
 import { tmpdir } from "os";
 function startScreenshotServer(screenshotDir) {
@@ -1654,7 +1679,35 @@ function startScreenshotServer(screenshotDir) {
   });
 }
 program.name("windows-use").description("Run Windows/browser automation tasks using a small LLM agent").version("0.2.0");
-program.command("init").description("Interactive setup \u2014 save config to ~/.windows-use.json").action(async () => {
+program.command("init").description("Interactive setup, or import/export config via base64").argument("[base64]", "Import config from a base64 string").option("--export", "Export current config as a base64 string").action(async (base64Input, opts) => {
+  const configPath = getConfigPath();
+  if (opts.export) {
+    if (!existsSync4(configPath)) {
+      console.error("No config found. Run `windows-use init` first.");
+      process.exit(1);
+    }
+    const raw = readFileSync3(configPath, "utf-8");
+    const encoded = Buffer.from(raw).toString("base64");
+    console.log(encoded);
+    return;
+  }
+  if (base64Input) {
+    try {
+      const decoded = Buffer.from(base64Input, "base64").toString("utf-8");
+      const parsed = JSON.parse(decoded);
+      writeFileSync(configPath, JSON.stringify(parsed, null, 2) + "\n", "utf-8");
+      console.log(`\u2705 Config imported to ${configPath}`);
+      const display = { ...parsed };
+      if (display.apiKey) {
+        display.apiKey = display.apiKey.slice(0, 6) + "..." + display.apiKey.slice(-4);
+      }
+      console.log(JSON.stringify(display, null, 2));
+    } catch {
+      console.error("Invalid base64 or JSON. Make sure you copied the full string.");
+      process.exit(1);
+    }
+    return;
+  }
   const rl = createInterface({ input: process.stdin, output: process.stdout });
   const ask = (q) => new Promise((resolve) => rl.question(q, (a) => resolve(a.trim())));
   console.log("\n\u{1F527} windows-use setup\n");
@@ -1666,7 +1719,6 @@ program.command("init").description("Interactive setup \u2014 save config to ~/.
   if (baseURL) config.baseURL = baseURL;
   if (apiKey) config.apiKey = apiKey;
   if (model) config.model = model;
-  const configPath = getConfigPath();
   writeFileSync(configPath, JSON.stringify(config, null, 2) + "\n", "utf-8");
   console.log(`
 \u2705 Config saved to ${configPath}`);