opencode-skills-collection 1.0.185 → 1.0.187
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bundled-skills/.antigravity-install-manifest.json +5 -1
- package/bundled-skills/3d-web-experience/SKILL.md +152 -37
- package/bundled-skills/agent-evaluation/SKILL.md +1088 -26
- package/bundled-skills/agent-memory-systems/SKILL.md +1037 -25
- package/bundled-skills/agent-tool-builder/SKILL.md +668 -16
- package/bundled-skills/ai-agents-architect/SKILL.md +271 -31
- package/bundled-skills/ai-product/SKILL.md +716 -26
- package/bundled-skills/ai-wrapper-product/SKILL.md +450 -44
- package/bundled-skills/algolia-search/SKILL.md +867 -15
- package/bundled-skills/autonomous-agents/SKILL.md +1033 -26
- package/bundled-skills/aws-serverless/SKILL.md +1046 -35
- package/bundled-skills/azure-functions/SKILL.md +1318 -19
- package/bundled-skills/browser-automation/SKILL.md +1065 -28
- package/bundled-skills/browser-extension-builder/SKILL.md +159 -32
- package/bundled-skills/bullmq-specialist/SKILL.md +347 -16
- package/bundled-skills/clerk-auth/SKILL.md +796 -15
- package/bundled-skills/computer-use-agents/SKILL.md +1870 -28
- package/bundled-skills/context-window-management/SKILL.md +271 -18
- package/bundled-skills/conversation-memory/SKILL.md +453 -24
- package/bundled-skills/crewai/SKILL.md +252 -46
- package/bundled-skills/discord-bot-architect/SKILL.md +1207 -34
- package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
- package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
- package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
- package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
- package/bundled-skills/docs/users/bundles.md +1 -1
- package/bundled-skills/docs/users/claude-code-skills.md +1 -1
- package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
- package/bundled-skills/docs/users/getting-started.md +1 -1
- package/bundled-skills/docs/users/kiro-integration.md +1 -1
- package/bundled-skills/docs/users/usage.md +4 -4
- package/bundled-skills/docs/users/visual-guide.md +4 -4
- package/bundled-skills/email-systems/SKILL.md +646 -26
- package/bundled-skills/faf-expert/SKILL.md +221 -0
- package/bundled-skills/faf-wizard/SKILL.md +252 -0
- package/bundled-skills/file-uploads/SKILL.md +212 -11
- package/bundled-skills/firebase/SKILL.md +646 -16
- package/bundled-skills/gcp-cloud-run/SKILL.md +1117 -32
- package/bundled-skills/graphql/SKILL.md +1026 -27
- package/bundled-skills/hubspot-integration/SKILL.md +804 -19
- package/bundled-skills/idea-darwin/SKILL.md +120 -0
- package/bundled-skills/inngest/SKILL.md +431 -16
- package/bundled-skills/interactive-portfolio/SKILL.md +342 -44
- package/bundled-skills/langfuse/SKILL.md +296 -41
- package/bundled-skills/langgraph/SKILL.md +259 -50
- package/bundled-skills/micro-saas-launcher/SKILL.md +343 -44
- package/bundled-skills/neon-postgres/SKILL.md +572 -15
- package/bundled-skills/nextjs-supabase-auth/SKILL.md +269 -21
- package/bundled-skills/notion-template-business/SKILL.md +371 -44
- package/bundled-skills/personal-tool-builder/SKILL.md +537 -44
- package/bundled-skills/plaid-fintech/SKILL.md +825 -19
- package/bundled-skills/prompt-caching/SKILL.md +438 -25
- package/bundled-skills/rag-engineer/SKILL.md +271 -29
- package/bundled-skills/salesforce-development/SKILL.md +912 -19
- package/bundled-skills/satori/SKILL.md +54 -0
- package/bundled-skills/scroll-experience/SKILL.md +381 -44
- package/bundled-skills/segment-cdp/SKILL.md +817 -19
- package/bundled-skills/shopify-apps/SKILL.md +1475 -19
- package/bundled-skills/slack-bot-builder/SKILL.md +1162 -28
- package/bundled-skills/telegram-bot-builder/SKILL.md +152 -37
- package/bundled-skills/telegram-mini-app/SKILL.md +445 -44
- package/bundled-skills/trigger-dev/SKILL.md +916 -27
- package/bundled-skills/twilio-communications/SKILL.md +1310 -28
- package/bundled-skills/upstash-qstash/SKILL.md +898 -27
- package/bundled-skills/vercel-deployment/SKILL.md +637 -39
- package/bundled-skills/viral-generator-builder/SKILL.md +132 -37
- package/bundled-skills/voice-agents/SKILL.md +937 -27
- package/bundled-skills/voice-ai-development/SKILL.md +375 -46
- package/bundled-skills/workflow-automation/SKILL.md +982 -29
- package/bundled-skills/zapier-make-patterns/SKILL.md +772 -27
- package/package.json +1 -1
|
@@ -1,13 +1,20 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: computer-use-agents
|
|
3
|
-
description:
|
|
3
|
+
description: Build AI agents that interact with computers like humans do -
|
|
4
|
+
viewing screens, moving cursors, clicking buttons, and typing text. Covers
|
|
5
|
+
Anthropic's Computer Use, OpenAI's Operator/CUA, and open-source alternatives.
|
|
4
6
|
risk: unknown
|
|
5
|
-
source:
|
|
6
|
-
date_added:
|
|
7
|
+
source: vibeship-spawner-skills (Apache 2.0)
|
|
8
|
+
date_added: 2026-02-27
|
|
7
9
|
---
|
|
8
10
|
|
|
9
11
|
# Computer Use Agents
|
|
10
12
|
|
|
13
|
+
Build AI agents that interact with computers like humans do - viewing screens,
|
|
14
|
+
moving cursors, clicking buttons, and typing text. Covers Anthropic's Computer
|
|
15
|
+
Use, OpenAI's Operator/CUA, and open-source alternatives. Critical focus on
|
|
16
|
+
sandboxing, security, and handling the unique challenges of vision-based control.
|
|
17
|
+
|
|
11
18
|
## Patterns
|
|
12
19
|
|
|
13
20
|
### Perception-Reasoning-Action Loop
|
|
@@ -25,10 +32,8 @@ Key components:
|
|
|
25
32
|
Critical insight: Vision agents are completely still during "thinking"
|
|
26
33
|
phase (1-5 seconds), creating a detectable pause pattern.
|
|
27
34
|
|
|
35
|
+
**When to use**: Building any computer use agent from scratch,Integrating vision models with desktop control,Understanding agent behavior patterns
|
|
28
36
|
|
|
29
|
-
**When to use**: ['Building any computer use agent from scratch', 'Integrating vision models with desktop control', 'Understanding agent behavior patterns']
|
|
30
|
-
|
|
31
|
-
```python
|
|
32
37
|
from anthropic import Anthropic
|
|
33
38
|
from PIL import Image
|
|
34
39
|
import base64
|
|
@@ -83,8 +88,116 @@ class ComputerUseAgent:
|
|
|
83
88
|
amount = action.get("amount", 3)
|
|
84
89
|
scroll = -amount if direction == "down" else amount
|
|
85
90
|
pyautogui.scroll(scroll)
|
|
86
|
-
return {"success": True, "action": f"scrolled {
|
|
87
|
-
|
|
91
|
+
return {"success": True, "action": f"scrolled {direction}"}
|
|
92
|
+
|
|
93
|
+
elif action_type == "move":
|
|
94
|
+
x, y = action["x"], action["y"]
|
|
95
|
+
pyautogui.moveTo(x, y)
|
|
96
|
+
return {"success": True, "action": f"moved to ({x}, {y})"}
|
|
97
|
+
|
|
98
|
+
else:
|
|
99
|
+
return {"success": False, "error": f"Unknown action: {action_type}"}
|
|
100
|
+
|
|
101
|
+
def run(self, task: str) -> dict:
|
|
102
|
+
"""
|
|
103
|
+
Run perception-reasoning-action loop until task complete.
|
|
104
|
+
|
|
105
|
+
The loop:
|
|
106
|
+
1. Screenshot current state
|
|
107
|
+
2. Send to vision model with task context
|
|
108
|
+
3. Parse action from response
|
|
109
|
+
4. Execute action
|
|
110
|
+
5. Repeat until done or max steps
|
|
111
|
+
"""
|
|
112
|
+
messages = []
|
|
113
|
+
step_count = 0
|
|
114
|
+
|
|
115
|
+
system_prompt = """You are a computer use agent. You can see the screen
|
|
116
|
+
and control mouse/keyboard.
|
|
117
|
+
|
|
118
|
+
Available actions (respond with JSON):
|
|
119
|
+
- {"type": "click", "x": 100, "y": 200, "button": "left"}
|
|
120
|
+
- {"type": "type", "text": "hello world"}
|
|
121
|
+
- {"type": "key", "key": "enter"}
|
|
122
|
+
- {"type": "scroll", "direction": "down", "amount": 3}
|
|
123
|
+
- {"type": "done", "result": "task completed successfully"}
|
|
124
|
+
|
|
125
|
+
Always respond with ONLY a JSON action object.
|
|
126
|
+
Be precise with coordinates - click exactly where needed.
|
|
127
|
+
If you see an error, try to recover.
|
|
128
|
+
"""
|
|
129
|
+
|
|
130
|
+
while step_count < self.max_steps:
|
|
131
|
+
step_count += 1
|
|
132
|
+
|
|
133
|
+
# 1. PERCEPTION: Capture current screen
|
|
134
|
+
screenshot_b64 = self.capture_screenshot()
|
|
135
|
+
|
|
136
|
+
# 2. REASONING: Send to vision model
|
|
137
|
+
user_content = [
|
|
138
|
+
{"type": "text", "text": f"Task: {task}\n\nStep {step_count}. What action should I take?"},
|
|
139
|
+
{"type": "image", "source": {
|
|
140
|
+
"type": "base64",
|
|
141
|
+
"media_type": "image/png",
|
|
142
|
+
"data": screenshot_b64
|
|
143
|
+
}}
|
|
144
|
+
]
|
|
145
|
+
|
|
146
|
+
messages.append({"role": "user", "content": user_content})
|
|
147
|
+
|
|
148
|
+
response = self.client.messages.create(
|
|
149
|
+
model=self.model,
|
|
150
|
+
max_tokens=1024,
|
|
151
|
+
system=system_prompt,
|
|
152
|
+
messages=messages
|
|
153
|
+
)
|
|
154
|
+
|
|
155
|
+
assistant_message = response.content[0].text
|
|
156
|
+
messages.append({"role": "assistant", "content": assistant_message})
|
|
157
|
+
|
|
158
|
+
# 3. Parse action from response
|
|
159
|
+
import json
|
|
160
|
+
try:
|
|
161
|
+
action = json.loads(assistant_message)
|
|
162
|
+
except json.JSONDecodeError:
|
|
163
|
+
# Try to extract JSON from response
|
|
164
|
+
import re
|
|
165
|
+
match = re.search(r'\{[^}]+\}', assistant_message)
|
|
166
|
+
if match:
|
|
167
|
+
action = json.loads(match.group())
|
|
168
|
+
else:
|
|
169
|
+
continue
|
|
170
|
+
|
|
171
|
+
# Check if done
|
|
172
|
+
if action.get("type") == "done":
|
|
173
|
+
return {
|
|
174
|
+
"success": True,
|
|
175
|
+
"result": action.get("result"),
|
|
176
|
+
"steps": step_count
|
|
177
|
+
}
|
|
178
|
+
|
|
179
|
+
# 4. ACTION: Execute
|
|
180
|
+
result = self.execute_action(action)
|
|
181
|
+
|
|
182
|
+
# Small delay for UI to update
|
|
183
|
+
time.sleep(self.action_delay)
|
|
184
|
+
|
|
185
|
+
return {
|
|
186
|
+
"success": False,
|
|
187
|
+
"error": "Max steps reached",
|
|
188
|
+
"steps": step_count
|
|
189
|
+
}
|
|
190
|
+
|
|
191
|
+
# Usage
|
|
192
|
+
agent = ComputerUseAgent(Anthropic())
|
|
193
|
+
result = agent.run("Open Chrome and search for 'weather today'")
|
|
194
|
+
|
|
195
|
+
### Anti_patterns
|
|
196
|
+
|
|
197
|
+
- Running without step limits (infinite loops)
|
|
198
|
+
- No delay between actions (UI can't keep up)
|
|
199
|
+
- Screenshots at full resolution (token explosion)
|
|
200
|
+
- Ignoring action failures (no recovery)
|
|
88
201
|
|
|
89
202
|
### Sandboxed Environment Pattern
|
|
90
203
|
|
|
@@ -102,10 +215,8 @@ Key isolation requirements:
|
|
|
102
215
|
The goal is "blast radius minimization" - if the agent goes wrong,
|
|
103
216
|
damage is contained to the sandbox.
|
|
104
217
|
|
|
218
|
+
**When to use**: Deploying any computer use agent,Testing agent behavior safely,Running untrusted automation tasks
|
|
105
219
|
|
|
106
|
-
**When to use**: ['Deploying any computer use agent', 'Testing agent behavior safely', 'Running untrusted automation tasks']
|
|
107
|
-
|
|
108
|
-
```python
|
|
109
220
|
# Dockerfile for sandboxed computer use environment
|
|
110
221
|
# Based on Anthropic's reference implementation pattern
|
|
111
222
|
|
|
@@ -208,8 +319,89 @@ volumes:
|
|
|
208
319
|
# Python wrapper with additional runtime sandboxing
|
|
209
320
|
import subprocess
|
|
210
321
|
import os
|
|
211
|
-
from dataclasses
|
|
212
|
-
|
|
322
|
+
from dataclasses import dataclass
|
|
323
|
+
from typing import Optional
|
|
324
|
+
|
|
325
|
+
@dataclass
|
|
326
|
+
class SandboxConfig:
|
|
327
|
+
"""Configuration for agent sandbox."""
|
|
328
|
+
network_allowed: list[str] = None # Allowed domains
|
|
329
|
+
max_runtime_seconds: int = 300
|
|
330
|
+
max_memory_mb: int = 2048
|
|
331
|
+
allow_downloads: bool = False
|
|
332
|
+
allow_clipboard: bool = False
|
|
333
|
+
|
|
334
|
+
class SandboxedAgent:
|
|
335
|
+
"""
|
|
336
|
+
Run computer use agent in Docker sandbox.
|
|
337
|
+
"""
|
|
338
|
+
|
|
339
|
+
def __init__(self, config: SandboxConfig):
|
|
340
|
+
self.config = config
|
|
341
|
+
self.container_id: Optional[str] = None
|
|
342
|
+
|
|
343
|
+
def start(self):
|
|
344
|
+
"""Start sandboxed environment."""
|
|
345
|
+
# Build network rules
|
|
346
|
+
network_rules = ""
|
|
347
|
+
if self.config.network_allowed:
|
|
348
|
+
for domain in self.config.network_allowed:
|
|
349
|
+
network_rules += f"--add-host={domain}:$(dig +short {domain}) "
|
|
350
|
+
else:
|
|
351
|
+
network_rules = "--network=none"
|
|
352
|
+
|
|
353
|
+
cmd = f"""
|
|
354
|
+
docker run -d \
|
|
355
|
+
--name computer-use-sandbox-$$ \
|
|
356
|
+
--security-opt no-new-privileges \
|
|
357
|
+
--cap-drop ALL \
|
|
358
|
+
--memory {self.config.max_memory_mb}m \
|
|
359
|
+
--cpus 2 \
|
|
360
|
+
--read-only \
|
|
361
|
+
--tmpfs /tmp \
|
|
362
|
+
{network_rules} \
|
|
363
|
+
computer-use-agent:latest
|
|
364
|
+
"""
|
|
365
|
+
|
|
366
|
+
result = subprocess.run(cmd, shell=True, capture_output=True)
|
|
367
|
+
self.container_id = result.stdout.decode().strip()
|
|
368
|
+
|
|
369
|
+
# Set up kill timer
|
|
370
|
+
subprocess.Popen([
|
|
371
|
+
"sh", "-c",
|
|
372
|
+
f"sleep {self.config.max_runtime_seconds} && docker kill {self.container_id}"
|
|
373
|
+
])
|
|
374
|
+
|
|
375
|
+
return self.container_id
|
|
376
|
+
|
|
377
|
+
def execute_task(self, task: str) -> dict:
|
|
378
|
+
"""Execute task in sandbox."""
|
|
379
|
+
if not self.container_id:
|
|
380
|
+
self.start()
|
|
381
|
+
|
|
382
|
+
# Send task to agent via API
|
|
383
|
+
import requests
|
|
384
|
+
response = requests.post(
|
|
385
|
+
f"http://localhost:8080/task",
|
|
386
|
+
json={"task": task},
|
|
387
|
+
timeout=self.config.max_runtime_seconds
|
|
388
|
+
)
|
|
389
|
+
|
|
390
|
+
return response.json()
|
|
391
|
+
|
|
392
|
+
def stop(self):
|
|
393
|
+
"""Stop and remove sandbox."""
|
|
394
|
+
if self.container_id:
|
|
395
|
+
subprocess.run(f"docker rm -f {self.container_id}", shell=True)
|
|
396
|
+
self.container_id = None
|
|
397
|
+
|
|
398
|
+
### Anti_patterns
|
|
399
|
+
|
|
400
|
+
- Running agents on host system directly
|
|
401
|
+
- Giving sandbox full network access
|
|
402
|
+
- Running as root in container
|
|
403
|
+
- No resource limits (denial of service)
|
|
404
|
+
- Persistent storage (data can leak between runs)
|
|
213
405
|
|
|
214
406
|
### Anthropic Computer Use Implementation
|
|
215
407
|
|
|
@@ -231,10 +423,8 @@ Tool versions:
|
|
|
231
423
|
Critical limitation: "Some UI elements (like dropdowns and scrollbars)
|
|
232
424
|
might be tricky for Claude to manipulate" - Anthropic docs
|
|
233
425
|
|
|
426
|
+
**When to use**: Building production computer use agents,Need highest quality vision understanding,Full desktop control (not just browser)
|
|
234
427
|
|
|
235
|
-
**When to use**: ['Building production computer use agents', 'Need highest quality vision understanding', 'Full desktop control (not just browser)']
|
|
236
|
-
|
|
237
|
-
```python
|
|
238
428
|
from anthropic import Anthropic
|
|
239
429
|
from anthropic.types.beta import (
|
|
240
430
|
BetaToolComputerUse20241022,
|
|
@@ -301,20 +491,1672 @@ class AnthropicComputerUse:
|
|
|
301
491
|
subprocess.run(["scrot", "/tmp/screenshot.png"])
|
|
302
492
|
|
|
303
493
|
with open("/tmp/screenshot.png", "rb") as f:
|
|
304
|
-
|
|
494
|
+
img_data = f.read()
|
|
495
|
+
|
|
496
|
+
# Resize for efficiency
|
|
497
|
+
img = Image.open(io.BytesIO(img_data))
|
|
498
|
+
img = img.resize(self.screen_size, Image.LANCZOS)
|
|
499
|
+
|
|
500
|
+
buffer = io.BytesIO()
|
|
501
|
+
img.save(buffer, format="PNG")
|
|
502
|
+
|
|
503
|
+
return {
|
|
504
|
+
"type": "image",
|
|
505
|
+
"source": {
|
|
506
|
+
"type": "base64",
|
|
507
|
+
"media_type": "image/png",
|
|
508
|
+
"data": base64.b64encode(buffer.getvalue()).decode()
|
|
509
|
+
}
|
|
510
|
+
}
|
|
511
|
+
|
|
512
|
+
elif action == "mouse_move":
|
|
513
|
+
x, y = input.get("coordinate", [0, 0])
|
|
514
|
+
subprocess.run(["xdotool", "mousemove", str(x), str(y)])
|
|
515
|
+
return {"success": True}
|
|
516
|
+
|
|
517
|
+
elif action == "left_click":
|
|
518
|
+
subprocess.run(["xdotool", "click", "1"])
|
|
519
|
+
return {"success": True}
|
|
520
|
+
|
|
521
|
+
elif action == "right_click":
|
|
522
|
+
subprocess.run(["xdotool", "click", "3"])
|
|
523
|
+
return {"success": True}
|
|
524
|
+
|
|
525
|
+
elif action == "double_click":
|
|
526
|
+
subprocess.run(["xdotool", "click", "--repeat", "2", "1"])
|
|
527
|
+
return {"success": True}
|
|
528
|
+
|
|
529
|
+
elif action == "type":
|
|
530
|
+
text = input.get("text", "")
|
|
531
|
+
# Use xdotool type with delay for reliability
|
|
532
|
+
subprocess.run(["xdotool", "type", "--delay", "50", text])
|
|
533
|
+
return {"success": True}
|
|
534
|
+
|
|
535
|
+
elif action == "key":
|
|
536
|
+
key = input.get("key", "")
|
|
537
|
+
# Map common key names
|
|
538
|
+
key_map = {
|
|
539
|
+
"return": "Return",
|
|
540
|
+
"enter": "Return",
|
|
541
|
+
"tab": "Tab",
|
|
542
|
+
"escape": "Escape",
|
|
543
|
+
"backspace": "BackSpace",
|
|
544
|
+
}
|
|
545
|
+
xdotool_key = key_map.get(key.lower(), key)
|
|
546
|
+
subprocess.run(["xdotool", "key", xdotool_key])
|
|
547
|
+
return {"success": True}
|
|
548
|
+
|
|
549
|
+
elif action == "scroll":
|
|
550
|
+
direction = input.get("direction", "down")
|
|
551
|
+
amount = input.get("amount", 3)
|
|
552
|
+
button = "5" if direction == "down" else "4"
|
|
553
|
+
for _ in range(amount):
|
|
554
|
+
subprocess.run(["xdotool", "click", button])
|
|
555
|
+
return {"success": True}
|
|
556
|
+
|
|
557
|
+
return {"error": f"Unknown action: {action}"}
|
|
558
|
+
|
|
559
|
+
def _handle_bash(self, input: dict) -> dict:
|
|
560
|
+
"""Execute bash command."""
|
|
561
|
+
command = input.get("command", "")
|
|
562
|
+
|
|
563
|
+
# Security: Sanitize and limit commands
|
|
564
|
+
dangerous_patterns = ["rm -rf", "mkfs", "dd if=", "> /dev/"]
|
|
565
|
+
for pattern in dangerous_patterns:
|
|
566
|
+
if pattern in command:
|
|
567
|
+
return {"error": "Dangerous command blocked"}
|
|
568
|
+
|
|
569
|
+
try:
|
|
570
|
+
result = subprocess.run(
|
|
571
|
+
command,
|
|
572
|
+
shell=True,
|
|
573
|
+
capture_output=True,
|
|
574
|
+
text=True,
|
|
575
|
+
timeout=30
|
|
576
|
+
)
|
|
577
|
+
return {
|
|
578
|
+
"stdout": result.stdout[:10000], # Limit output
|
|
579
|
+
"stderr": result.stderr[:1000],
|
|
580
|
+
"returncode": result.returncode
|
|
581
|
+
}
|
|
582
|
+
except subprocess.TimeoutExpired:
|
|
583
|
+
return {"error": "Command timed out"}
|
|
584
|
+
|
|
585
|
+
def _handle_editor(self, input: dict) -> dict:
|
|
586
|
+
"""Handle text editor operations."""
|
|
587
|
+
command = input.get("command")
|
|
588
|
+
path = input.get("path")
|
|
589
|
+
|
|
590
|
+
if command == "view":
|
|
591
|
+
try:
|
|
592
|
+
with open(path, "r") as f:
|
|
593
|
+
content = f.read()
|
|
594
|
+
return {"content": content[:50000]} # Limit size
|
|
595
|
+
except Exception as e:
|
|
596
|
+
return {"error": str(e)}
|
|
597
|
+
|
|
598
|
+
elif command == "str_replace":
|
|
599
|
+
old_str = input.get("old_str")
|
|
600
|
+
new_str = input.get("new_str")
|
|
601
|
+
try:
|
|
602
|
+
with open(path, "r") as f:
|
|
603
|
+
content = f.read()
|
|
604
|
+
if old_str not in content:
|
|
605
|
+
return {"error": "old_str not found in file"}
|
|
606
|
+
content = content.replace(old_str, new_str, 1)
|
|
607
|
+
with open(path, "w") as f:
|
|
608
|
+
f.write(content)
|
|
609
|
+
return {"success": True}
|
|
610
|
+
except Exception as e:
|
|
611
|
+
return {"error": str(e)}
|
|
612
|
+
|
|
613
|
+
return {"error": f"Unknown editor command: {command}"}
|
|
614
|
+
|
|
615
|
+
def run_task(self, task: str, max_steps: int = 50) -> dict:
|
|
616
|
+
"""Run computer use task with agentic loop."""
|
|
617
|
+
messages = [{"role": "user", "content": task}]
|
|
618
|
+
tools = self.get_tools()
|
|
619
|
+
|
|
620
|
+
for step in range(max_steps):
|
|
621
|
+
response = self.client.beta.messages.create(
|
|
622
|
+
model=self.model,
|
|
623
|
+
max_tokens=4096,
|
|
624
|
+
tools=tools,
|
|
625
|
+
messages=messages,
|
|
626
|
+
betas=["computer-use-2024-10-22"]
|
|
627
|
+
)
|
|
628
|
+
|
|
629
|
+
# Check for completion
|
|
630
|
+
if response.stop_reason == "end_turn":
|
|
631
|
+
return {
|
|
632
|
+
"success": True,
|
|
633
|
+
"result": response.content[0].text if response.content else "",
|
|
634
|
+
"steps": step + 1
|
|
635
|
+
}
|
|
636
|
+
|
|
637
|
+
# Handle tool use
|
|
638
|
+
if response.stop_reason == "tool_use":
|
|
639
|
+
messages.append({"role": "assistant", "content": response.content})
|
|
640
|
+
|
|
641
|
+
tool_results = []
|
|
642
|
+
for block in response.content:
|
|
643
|
+
if block.type == "tool_use":
|
|
644
|
+
result = self.execute_tool(block.name, block.input)
|
|
645
|
+
tool_results.append({
|
|
646
|
+
"type": "tool_result",
|
|
647
|
+
"tool_use_id": block.id,
|
|
648
|
+
"content": result
|
|
649
|
+
})
|
|
650
|
+
|
|
651
|
+
messages.append({"role": "user", "content": tool_results})
|
|
652
|
+
|
|
653
|
+
return {"success": False, "error": "Max steps reached"}
|
|
654
|
+
|
|
655
|
+
### Anti_patterns
|
|
656
|
+
|
|
657
|
+
- Not using betas=['computer-use-2024-10-22'] flag
|
|
658
|
+
- Full resolution screenshots (wasteful)
|
|
659
|
+
- No command sanitization for bash tool
|
|
660
|
+
- Unbounded execution time
|
|
661
|
+
|
|
662
|
+
### Browser-Use Pattern (Playwright-based)
|
|
663
|
+
|
|
664
|
+
For browser-only automation, using structured DOM access is more efficient
|
|
665
|
+
than pixel-based computer use. Playwright MCP allows LLMs to control
|
|
666
|
+
browsers using accessibility snapshots rather than screenshots.
|
|
667
|
+
|
|
668
|
+
Advantages over vision-based:
|
|
669
|
+
- Faster: No image processing required
|
|
670
|
+
- Cheaper: Text tokens vs image tokens
|
|
671
|
+
- More precise: Direct element targeting
|
|
672
|
+
- More reliable: No coordinate drift
|
|
673
|
+
|
|
674
|
+
When to use vision vs structured:
|
|
675
|
+
- Vision: Desktop apps, complex UIs, visual verification
|
|
676
|
+
- Structured: Web automation, form filling, data extraction
|
|
677
|
+
|
|
678
|
+
**When to use**: Browser-only automation tasks,Form filling and web interactions,When speed and cost matter more than visual understanding
|
|
679
|
+
|
|
680
|
+
from playwright.async_api import async_playwright
|
|
681
|
+
from dataclasses import dataclass
|
|
682
|
+
from typing import Optional
|
|
683
|
+
import asyncio
|
|
684
|
+
|
|
685
|
+
@dataclass
|
|
686
|
+
class BrowserAction:
|
|
687
|
+
"""Structured browser action."""
|
|
688
|
+
action: str # click, type, navigate, scroll, extract
|
|
689
|
+
selector: Optional[str] = None
|
|
690
|
+
text: Optional[str] = None
|
|
691
|
+
url: Optional[str] = None
|
|
692
|
+
|
|
693
|
+
class BrowserUseAgent:
|
|
694
|
+
"""
|
|
695
|
+
Browser automation using Playwright with structured commands.
|
|
696
|
+
More efficient than pixel-based for web tasks.
|
|
697
|
+
"""
|
|
698
|
+
|
|
699
|
+
def __init__(self):
|
|
700
|
+
self.browser = None
|
|
701
|
+
self.page = None
|
|
702
|
+
|
|
703
|
+
async def start(self, headless: bool = True):
|
|
704
|
+
"""Start browser session."""
|
|
705
|
+
self.playwright = await async_playwright().start()
|
|
706
|
+
self.browser = await self.playwright.chromium.launch(headless=headless)
|
|
707
|
+
self.page = await self.browser.new_page()
|
|
708
|
+
|
|
709
|
+
async def get_page_snapshot(self) -> dict:
|
|
710
|
+
"""
|
|
711
|
+
Get structured snapshot of page for LLM.
|
|
712
|
+
Uses accessibility tree for efficiency.
|
|
713
|
+
"""
|
|
714
|
+
# Get accessibility tree
|
|
715
|
+
snapshot = await self.page.accessibility.snapshot()
|
|
716
|
+
|
|
717
|
+
# Get simplified DOM info
|
|
718
|
+
elements = await self.page.evaluate('''() => {
|
|
719
|
+
const interactable = [];
|
|
720
|
+
const selector = 'a, button, input, select, textarea, [role="button"]';
|
|
721
|
+
document.querySelectorAll(selector).forEach((el, i) => {
|
|
722
|
+
const rect = el.getBoundingClientRect();
|
|
723
|
+
if (rect.width > 0 && rect.height > 0) {
|
|
724
|
+
interactable.push({
|
|
725
|
+
index: i,
|
|
726
|
+
tag: el.tagName.toLowerCase(),
|
|
727
|
+
text: el.textContent?.trim().slice(0, 100),
|
|
728
|
+
type: el.type,
|
|
729
|
+
placeholder: el.placeholder,
|
|
730
|
+
name: el.name,
|
|
731
|
+
id: el.id,
|
|
732
|
+
class: el.className
|
|
733
|
+
});
|
|
734
|
+
}
|
|
735
|
+
});
|
|
736
|
+
return interactable;
|
|
737
|
+
}''')
|
|
738
|
+
|
|
739
|
+
return {
|
|
740
|
+
"url": self.page.url,
|
|
741
|
+
"title": await self.page.title(),
|
|
742
|
+
"accessibility_tree": snapshot,
|
|
743
|
+
"interactable_elements": elements[:50] # Limit for token efficiency
|
|
744
|
+
}
|
|
745
|
+
|
|
746
|
+
async def execute_action(self, action: BrowserAction) -> dict:
|
|
747
|
+
"""Execute structured browser action."""
|
|
748
|
+
|
|
749
|
+
try:
|
|
750
|
+
if action.action == "navigate":
|
|
751
|
+
await self.page.goto(action.url, wait_until="domcontentloaded")
|
|
752
|
+
return {"success": True, "url": self.page.url}
|
|
753
|
+
|
|
754
|
+
elif action.action == "click":
|
|
755
|
+
await self.page.click(action.selector, timeout=5000)
|
|
756
|
+
await self.page.wait_for_load_state("networkidle", timeout=5000)
|
|
757
|
+
return {"success": True}
|
|
758
|
+
|
|
759
|
+
elif action.action == "type":
|
|
760
|
+
await self.page.fill(action.selector, action.text)
|
|
761
|
+
return {"success": True}
|
|
762
|
+
|
|
763
|
+
elif action.action == "scroll":
|
|
764
|
+
direction = action.text or "down"
|
|
765
|
+
distance = 500 if direction == "down" else -500
|
|
766
|
+
await self.page.evaluate(f"window.scrollBy(0, {distance})")
|
|
767
|
+
return {"success": True}
|
|
768
|
+
|
|
769
|
+
elif action.action == "extract":
|
|
770
|
+
# Extract text content
|
|
771
|
+
if action.selector:
|
|
772
|
+
text = await self.page.text_content(action.selector)
|
|
773
|
+
else:
|
|
774
|
+
text = await self.page.text_content("body")
|
|
775
|
+
return {"success": True, "text": text[:5000]}
|
|
776
|
+
|
|
777
|
+
elif action.action == "screenshot":
|
|
778
|
+
# Fall back to vision when needed
|
|
779
|
+
screenshot = await self.page.screenshot(type="png")
|
|
780
|
+
import base64
|
|
781
|
+
return {
|
|
782
|
+
"success": True,
|
|
783
|
+
"image": base64.b64encode(screenshot).decode()
|
|
784
|
+
}
|
|
785
|
+
|
|
786
|
+
except Exception as e:
|
|
787
|
+
return {"success": False, "error": str(e)}
|
|
788
|
+
|
|
789
|
+
return {"success": False, "error": f"Unknown action: {action.action}"}
|
|
790
|
+
|
|
791
|
+
async def run_with_llm(self, task: str, llm_client, max_steps: int = 20):
|
|
792
|
+
"""
|
|
793
|
+
Run browser task with LLM decision making.
|
|
794
|
+
Uses structured DOM instead of screenshots.
|
|
795
|
+
"""
|
|
796
|
+
|
|
797
|
+
system_prompt = """You are a browser automation agent. You receive
|
|
798
|
+
page snapshots with interactable elements and decide actions.
|
|
799
|
+
|
|
800
|
+
Respond with JSON action:
|
|
801
|
+
- {"action": "navigate", "url": "https://..."}
|
|
802
|
+
- {"action": "click", "selector": "button.submit"}
|
|
803
|
+
- {"action": "type", "selector": "input[name='email']", "text": "..."}
|
|
804
|
+
- {"action": "scroll", "text": "down"}
|
|
805
|
+
- {"action": "extract", "selector": ".results"}
|
|
806
|
+
- {"action": "done", "result": "task completed"}
|
|
807
|
+
|
|
808
|
+
Use CSS selectors based on the element info provided.
|
|
809
|
+
Prefer id > name > class > text content for selectors.
|
|
810
|
+
"""
|
|
811
|
+
|
|
812
|
+
messages = []
|
|
813
|
+
|
|
814
|
+
for step in range(max_steps):
|
|
815
|
+
# Get current page state
|
|
816
|
+
snapshot = await self.get_page_snapshot()
|
|
817
|
+
|
|
818
|
+
user_message = f"""Task: {task}
|
|
819
|
+
|
|
820
|
+
Current page:
|
|
821
|
+
URL: {snapshot['url']}
|
|
822
|
+
Title: {snapshot['title']}
|
|
823
|
+
|
|
824
|
+
Interactable elements:
|
|
825
|
+
{snapshot['interactable_elements']}
|
|
826
|
+
|
|
827
|
+
What action should I take?"""
|
|
828
|
+
|
|
829
|
+
messages.append({"role": "user", "content": user_message})
|
|
830
|
+
|
|
831
|
+
# Get LLM decision
|
|
832
|
+
response = llm_client.messages.create(
|
|
833
|
+
model="claude-sonnet-4-20250514",
|
|
834
|
+
max_tokens=1024,
|
|
835
|
+
system=system_prompt,
|
|
836
|
+
messages=messages
|
|
837
|
+
)
|
|
838
|
+
|
|
839
|
+
assistant_text = response.content[0].text
|
|
840
|
+
messages.append({"role": "assistant", "content": assistant_text})
|
|
841
|
+
|
|
842
|
+
# Parse and execute
|
|
843
|
+
import json
|
|
844
|
+
action_dict = json.loads(assistant_text)
|
|
845
|
+
|
|
846
|
+
if action_dict.get("action") == "done":
|
|
847
|
+
return {"success": True, "result": action_dict.get("result")}
|
|
848
|
+
|
|
849
|
+
action = BrowserAction(**action_dict)
|
|
850
|
+
result = await self.execute_action(action)
|
|
851
|
+
|
|
852
|
+
if not result.get("success"):
|
|
853
|
+
messages.append({
|
|
854
|
+
"role": "user",
|
|
855
|
+
"content": f"Action failed: {result.get('error')}"
|
|
856
|
+
})
|
|
857
|
+
|
|
858
|
+
await asyncio.sleep(0.5) # Rate limit
|
|
859
|
+
|
|
860
|
+
return {"success": False, "error": "Max steps reached"}
|
|
861
|
+
|
|
862
|
+
async def close(self):
|
|
863
|
+
"""Clean up browser."""
|
|
864
|
+
if self.browser:
|
|
865
|
+
await self.browser.close()
|
|
866
|
+
if hasattr(self, 'playwright'):
|
|
867
|
+
await self.playwright.stop()
|
|
868
|
+
|
|
869
|
+
# Usage
|
|
870
|
+
async def main():
|
|
871
|
+
agent = BrowserUseAgent()
|
|
872
|
+
await agent.start(headless=False)
|
|
873
|
+
|
|
874
|
+
from anthropic import Anthropic
|
|
875
|
+
result = await agent.run_with_llm(
|
|
876
|
+
"Go to weather.com and find the weather for New York",
|
|
877
|
+
Anthropic()
|
|
878
|
+
)
|
|
879
|
+
|
|
880
|
+
print(result)
|
|
881
|
+
await agent.close()
|
|
882
|
+
|
|
883
|
+
asyncio.run(main())
|
|
884
|
+
|
|
885
|
+
### Anti_patterns
|
|
886
|
+
|
|
887
|
+
- Using screenshots when DOM access works
|
|
888
|
+
- Not waiting for page loads
|
|
889
|
+
- Hardcoded selectors that break
|
|
890
|
+
- No error recovery for stale elements
|
|
891
|
+
|
|
892
|
+
### User Confirmation Pattern
|
|
893
|
+
|
|
894
|
+
For sensitive actions, agents should pause and ask for human confirmation.
|
|
895
|
+
"ChatGPT agent also pauses and asks for confirmation prior to taking
|
|
896
|
+
sensitive steps such as completing a purchase."
|
|
897
|
+
|
|
898
|
+
Sensitivity levels:
|
|
899
|
+
1. LOW: Navigation, reading (auto-approve)
|
|
900
|
+
2. MEDIUM: Form filling, clicking (log, maybe confirm)
|
|
901
|
+
3. HIGH: Purchases, authentication, file operations (always confirm)
|
|
902
|
+
4. CRITICAL: Credential entry, financial transactions (confirm + review)
|
|
903
|
+
|
|
904
|
+
**When to use**: Actions with real-world consequences,Financial transactions,Authentication flows,File modifications
|
|
905
|
+
|
|
906
|
+
from enum import Enum
|
|
907
|
+
from dataclasses import dataclass
|
|
908
|
+
from typing import Callable, Optional
|
|
909
|
+
import asyncio
|
|
910
|
+
|
|
911
|
+
class ActionSeverity(Enum):
|
|
912
|
+
LOW = "low" # Auto-approve
|
|
913
|
+
MEDIUM = "medium" # Log, optional confirm
|
|
914
|
+
HIGH = "high" # Always confirm
|
|
915
|
+
CRITICAL = "critical" # Confirm + review details
|
|
916
|
+
|
|
917
|
+
@dataclass
|
|
918
|
+
class SensitiveAction:
|
|
919
|
+
"""Action that may need user confirmation."""
|
|
920
|
+
action_type: str
|
|
921
|
+
description: str
|
|
922
|
+
severity: ActionSeverity
|
|
923
|
+
details: dict
|
|
924
|
+
|
|
925
|
+
class ConfirmationGate:
|
|
926
|
+
"""
|
|
927
|
+
Gate sensitive actions through user confirmation.
|
|
928
|
+
"""
|
|
929
|
+
|
|
930
|
+
# Action type -> severity mapping
|
|
931
|
+
ACTION_SEVERITY = {
|
|
932
|
+
# LOW - auto-approve
|
|
933
|
+
"navigate": ActionSeverity.LOW,
|
|
934
|
+
"scroll": ActionSeverity.LOW,
|
|
935
|
+
"read": ActionSeverity.LOW,
|
|
936
|
+
"screenshot": ActionSeverity.LOW,
|
|
937
|
+
|
|
938
|
+
# MEDIUM - log and maybe confirm
|
|
939
|
+
"click": ActionSeverity.MEDIUM,
|
|
940
|
+
"type": ActionSeverity.MEDIUM,
|
|
941
|
+
"search": ActionSeverity.MEDIUM,
|
|
942
|
+
|
|
943
|
+
# HIGH - always confirm
|
|
944
|
+
"download": ActionSeverity.HIGH,
|
|
945
|
+
"submit_form": ActionSeverity.HIGH,
|
|
946
|
+
"login": ActionSeverity.HIGH,
|
|
947
|
+
"file_write": ActionSeverity.HIGH,
|
|
948
|
+
|
|
949
|
+
# CRITICAL - confirm with full review
|
|
950
|
+
"purchase": ActionSeverity.CRITICAL,
|
|
951
|
+
"enter_password": ActionSeverity.CRITICAL,
|
|
952
|
+
"enter_credit_card": ActionSeverity.CRITICAL,
|
|
953
|
+
"send_money": ActionSeverity.CRITICAL,
|
|
954
|
+
"delete": ActionSeverity.CRITICAL,
|
|
955
|
+
}
|
|
956
|
+
|
|
957
|
+
def __init__(
|
|
958
|
+
self,
|
|
959
|
+
confirm_callback: Callable[[SensitiveAction], bool] = None,
|
|
960
|
+
auto_confirm_low: bool = True,
|
|
961
|
+
auto_confirm_medium: bool = False
|
|
962
|
+
):
|
|
963
|
+
self.confirm_callback = confirm_callback or self._default_confirm
|
|
964
|
+
self.auto_confirm_low = auto_confirm_low
|
|
965
|
+
self.auto_confirm_medium = auto_confirm_medium
|
|
966
|
+
self.action_log = []
|
|
967
|
+
|
|
968
|
+
def _default_confirm(self, action: SensitiveAction) -> bool:
|
|
969
|
+
"""Default confirmation via CLI prompt."""
|
|
970
|
+
print(f"\n{'='*60}")
|
|
971
|
+
print(f"ACTION CONFIRMATION REQUIRED")
|
|
972
|
+
print(f"{'='*60}")
|
|
973
|
+
print(f"Type: {action.action_type}")
|
|
974
|
+
print(f"Severity: {action.severity.value.upper()}")
|
|
975
|
+
print(f"Description: {action.description}")
|
|
976
|
+
print(f"Details: {action.details}")
|
|
977
|
+
print(f"{'='*60}")
|
|
978
|
+
|
|
979
|
+
while True:
|
|
980
|
+
response = input("Allow this action? [y/n]: ").lower().strip()
|
|
981
|
+
if response in ['y', 'yes']:
|
|
982
|
+
return True
|
|
983
|
+
elif response in ['n', 'no']:
|
|
984
|
+
return False
|
|
985
|
+
|
|
986
|
+
def classify_action(self, action_type: str, context: dict) -> ActionSeverity:
|
|
987
|
+
"""Classify action severity, considering context."""
|
|
988
|
+
base_severity = self.ACTION_SEVERITY.get(action_type, ActionSeverity.MEDIUM)
|
|
989
|
+
|
|
990
|
+
# Escalate based on context
|
|
991
|
+
if context.get("involves_credentials"):
|
|
992
|
+
return ActionSeverity.CRITICAL
|
|
993
|
+
if context.get("involves_money"):
|
|
994
|
+
return ActionSeverity.CRITICAL
|
|
995
|
+
if context.get("irreversible"):
|
|
996
|
+
return max(base_severity, ActionSeverity.HIGH, key=lambda x: x.value)
|
|
997
|
+
|
|
998
|
+
return base_severity
|
|
999
|
+
|
|
1000
|
+
def check_action(
|
|
1001
|
+
self,
|
|
1002
|
+
action_type: str,
|
|
1003
|
+
description: str,
|
|
1004
|
+
details: dict = None
|
|
1005
|
+
) -> tuple[bool, str]:
|
|
1006
|
+
"""
|
|
1007
|
+
Check if action should proceed.
|
|
1008
|
+
Returns (approved, reason).
|
|
1009
|
+
"""
|
|
1010
|
+
details = details or {}
|
|
1011
|
+
severity = self.classify_action(action_type, details)
|
|
1012
|
+
|
|
1013
|
+
action = SensitiveAction(
|
|
1014
|
+
action_type=action_type,
|
|
1015
|
+
description=description,
|
|
1016
|
+
severity=severity,
|
|
1017
|
+
details=details
|
|
1018
|
+
)
|
|
1019
|
+
|
|
1020
|
+
# Log all actions
|
|
1021
|
+
self.action_log.append({
|
|
1022
|
+
"action": action,
|
|
1023
|
+
"timestamp": __import__('datetime').datetime.now().isoformat()
|
|
1024
|
+
})
|
|
1025
|
+
|
|
1026
|
+
# Auto-approve low severity
|
|
1027
|
+
if severity == ActionSeverity.LOW and self.auto_confirm_low:
|
|
1028
|
+
return True, "auto-approved (low severity)"
|
|
1029
|
+
|
|
1030
|
+
# Maybe auto-approve medium
|
|
1031
|
+
if severity == ActionSeverity.MEDIUM and self.auto_confirm_medium:
|
|
1032
|
+
return True, "auto-approved (medium severity)"
|
|
1033
|
+
|
|
1034
|
+
# Request confirmation
|
|
1035
|
+
approved = self.confirm_callback(action)
|
|
1036
|
+
|
|
1037
|
+
if approved:
|
|
1038
|
+
return True, "user approved"
|
|
1039
|
+
else:
|
|
1040
|
+
return False, "user rejected"
|
|
1041
|
+
|
|
1042
|
+
class ConfirmedComputerUseAgent:
|
|
1043
|
+
"""
|
|
1044
|
+
Computer use agent with confirmation gates.
|
|
1045
|
+
"""
|
|
1046
|
+
|
|
1047
|
+
def __init__(self, base_agent, confirmation_gate: ConfirmationGate):
|
|
1048
|
+
self.agent = base_agent
|
|
1049
|
+
self.gate = confirmation_gate
|
|
1050
|
+
|
|
1051
|
+
def execute_action(self, action: dict) -> dict:
|
|
1052
|
+
"""Execute action with confirmation check."""
|
|
1053
|
+
action_type = action.get("type", "unknown")
|
|
1054
|
+
|
|
1055
|
+
# Build description
|
|
1056
|
+
if action_type == "click":
|
|
1057
|
+
desc = f"Click at ({action.get('x')}, {action.get('y')})"
|
|
1058
|
+
elif action_type == "type":
|
|
1059
|
+
text = action.get('text', '')
|
|
1060
|
+
# Mask if looks like password
|
|
1061
|
+
if self._looks_sensitive(text):
|
|
1062
|
+
desc = f"Type sensitive text ({len(text)} chars)"
|
|
1063
|
+
else:
|
|
1064
|
+
desc = f"Type: {text[:50]}..."
|
|
1065
|
+
else:
|
|
1066
|
+
desc = f"Execute: {action_type}"
|
|
1067
|
+
|
|
1068
|
+
# Context for severity classification
|
|
1069
|
+
context = {
|
|
1070
|
+
"involves_credentials": self._looks_sensitive(action.get("text", "")),
|
|
1071
|
+
"involves_money": self._mentions_money(action),
|
|
1072
|
+
}
|
|
1073
|
+
|
|
1074
|
+
# Check with gate
|
|
1075
|
+
approved, reason = self.gate.check_action(
|
|
1076
|
+
action_type, desc, context
|
|
1077
|
+
)
|
|
1078
|
+
|
|
1079
|
+
if not approved:
|
|
1080
|
+
return {
|
|
1081
|
+
"success": False,
|
|
1082
|
+
"error": f"Action blocked: {reason}",
|
|
1083
|
+
"action": action_type
|
|
1084
|
+
}
|
|
1085
|
+
|
|
1086
|
+
# Execute if approved
|
|
1087
|
+
return self.agent.execute_action(action)
|
|
1088
|
+
|
|
1089
|
+
def _looks_sensitive(self, text: str) -> bool:
|
|
1090
|
+
"""Check if text looks like sensitive data."""
|
|
1091
|
+
if not text:
|
|
1092
|
+
return False
|
|
1093
|
+
# Common patterns
|
|
1094
|
+
patterns = [
|
|
1095
|
+
r'\b\d{16}\b', # Credit card
|
|
1096
|
+
r'\b\d{3,4}\b.*\b\d{3,4}\b', # CVV-like
|
|
1097
|
+
r'password',
|
|
1098
|
+
r'secret',
|
|
1099
|
+
r'api.?key',
|
|
1100
|
+
r'token'
|
|
1101
|
+
]
|
|
1102
|
+
import re
|
|
1103
|
+
return any(re.search(p, text.lower()) for p in patterns)
|
|
1104
|
+
|
|
1105
|
+
def _mentions_money(self, action: dict) -> bool:
|
|
1106
|
+
"""Check if action involves money."""
|
|
1107
|
+
text = str(action)
|
|
1108
|
+
money_patterns = [
|
|
1109
|
+
r'\$\d+', r'pay', r'purchase', r'buy', r'checkout',
|
|
1110
|
+
r'credit', r'debit', r'invoice', r'payment'
|
|
1111
|
+
]
|
|
1112
|
+
import re
|
|
1113
|
+
return any(re.search(p, text.lower()) for p in money_patterns)
|
|
1114
|
+
|
|
1115
|
+
# Usage
|
|
1116
|
+
gate = ConfirmationGate(
|
|
1117
|
+
auto_confirm_low=True,
|
|
1118
|
+
auto_confirm_medium=False # Confirm clicks, typing
|
|
1119
|
+
)
|
|
1120
|
+
|
|
1121
|
+
agent = ConfirmedComputerUseAgent(base_agent, gate)
|
|
1122
|
+
result = agent.execute_action({"type": "click", "x": 500, "y": 300})
|
|
1123
|
+
|
|
1124
|
+
### Anti_patterns
|
|
1125
|
+
|
|
1126
|
+
- Auto-approving all actions
|
|
1127
|
+
- Not logging rejected actions
|
|
1128
|
+
- Showing full passwords in confirmation
|
|
1129
|
+
- No timeout on confirmation (hangs forever)
|
|
1130
|
+
|
|
1131
|
+
### Action Logging Pattern
|
|
1132
|
+
|
|
1133
|
+
All computer use agent actions should be logged for:
|
|
1134
|
+
1. Debugging failed automations
|
|
1135
|
+
2. Security auditing
|
|
1136
|
+
3. Reproducibility
|
|
1137
|
+
4. Compliance requirements
|
|
1138
|
+
|
|
1139
|
+
Log format should capture:
|
|
1140
|
+
- Timestamp
|
|
1141
|
+
- Action type and parameters
|
|
1142
|
+
- Screenshot before/after
|
|
1143
|
+
- Success/failure status
|
|
1144
|
+
- Model reasoning (if available)
|
|
1145
|
+
|
|
1146
|
+
**When to use**: Production computer use deployments,Debugging automation failures,Security-sensitive environments
|
|
1147
|
+
|
|
1148
|
+
from dataclasses import dataclass, field
|
|
1149
|
+
from datetime import datetime
|
|
1150
|
+
from typing import Optional, Any
|
|
1151
|
+
import json
|
|
1152
|
+
import os
|
|
1153
|
+
|
|
1154
|
+
@dataclass
|
|
1155
|
+
class ActionLogEntry:
|
|
1156
|
+
"""Single action log entry."""
|
|
1157
|
+
timestamp: datetime
|
|
1158
|
+
action_type: str
|
|
1159
|
+
parameters: dict
|
|
1160
|
+
success: bool
|
|
1161
|
+
error: Optional[str] = None
|
|
1162
|
+
screenshot_before: Optional[str] = None # Path to screenshot
|
|
1163
|
+
screenshot_after: Optional[str] = None
|
|
1164
|
+
model_reasoning: Optional[str] = None
|
|
1165
|
+
duration_ms: Optional[int] = None
|
|
1166
|
+
|
|
1167
|
+
def to_dict(self) -> dict:
|
|
1168
|
+
return {
|
|
1169
|
+
"timestamp": self.timestamp.isoformat(),
|
|
1170
|
+
"action_type": self.action_type,
|
|
1171
|
+
"parameters": self._sanitize_params(self.parameters),
|
|
1172
|
+
"success": self.success,
|
|
1173
|
+
"error": self.error,
|
|
1174
|
+
"screenshot_before": self.screenshot_before,
|
|
1175
|
+
"screenshot_after": self.screenshot_after,
|
|
1176
|
+
"model_reasoning": self.model_reasoning,
|
|
1177
|
+
"duration_ms": self.duration_ms
|
|
1178
|
+
}
|
|
1179
|
+
|
|
1180
|
+
def _sanitize_params(self, params: dict) -> dict:
|
|
1181
|
+
"""Remove sensitive data from params."""
|
|
1182
|
+
sanitized = {}
|
|
1183
|
+
sensitive_keys = ['password', 'secret', 'token', 'key', 'credit_card']
|
|
1184
|
+
|
|
1185
|
+
for k, v in params.items():
|
|
1186
|
+
if any(s in k.lower() for s in sensitive_keys):
|
|
1187
|
+
sanitized[k] = "[REDACTED]"
|
|
1188
|
+
elif isinstance(v, str) and len(v) > 100:
|
|
1189
|
+
sanitized[k] = v[:100] + "...[truncated]"
|
|
1190
|
+
else:
|
|
1191
|
+
sanitized[k] = v
|
|
1192
|
+
|
|
1193
|
+
return sanitized
|
|
1194
|
+
|
|
1195
|
+
@dataclass
|
|
1196
|
+
class TaskSession:
|
|
1197
|
+
"""A complete task execution session."""
|
|
1198
|
+
session_id: str
|
|
1199
|
+
task: str
|
|
1200
|
+
start_time: datetime
|
|
1201
|
+
end_time: Optional[datetime] = None
|
|
1202
|
+
actions: list[ActionLogEntry] = field(default_factory=list)
|
|
1203
|
+
success: bool = False
|
|
1204
|
+
final_result: Optional[str] = None
|
|
1205
|
+
|
|
1206
|
+
class ActionLogger:
|
|
1207
|
+
"""
|
|
1208
|
+
Comprehensive action logging for computer use agents.
|
|
1209
|
+
"""
|
|
1210
|
+
|
|
1211
|
+
def __init__(self, log_dir: str = "./agent_logs"):
|
|
1212
|
+
self.log_dir = log_dir
|
|
1213
|
+
self.screenshot_dir = os.path.join(log_dir, "screenshots")
|
|
1214
|
+
os.makedirs(self.screenshot_dir, exist_ok=True)
|
|
1215
|
+
|
|
1216
|
+
self.current_session: Optional[TaskSession] = None
|
|
1217
|
+
|
|
1218
|
+
def start_session(self, task: str) -> str:
|
|
1219
|
+
"""Start a new task session."""
|
|
1220
|
+
import uuid
|
|
1221
|
+
session_id = str(uuid.uuid4())[:8]
|
|
1222
|
+
|
|
1223
|
+
self.current_session = TaskSession(
|
|
1224
|
+
session_id=session_id,
|
|
1225
|
+
task=task,
|
|
1226
|
+
start_time=datetime.now()
|
|
1227
|
+
)
|
|
1228
|
+
|
|
1229
|
+
return session_id
|
|
1230
|
+
|
|
1231
|
+
def log_action(
|
|
1232
|
+
self,
|
|
1233
|
+
action_type: str,
|
|
1234
|
+
parameters: dict,
|
|
1235
|
+
success: bool,
|
|
1236
|
+
error: Optional[str] = None,
|
|
1237
|
+
screenshot_before: bytes = None,
|
|
1238
|
+
screenshot_after: bytes = None,
|
|
1239
|
+
model_reasoning: str = None,
|
|
1240
|
+
duration_ms: int = None
|
|
1241
|
+
):
|
|
1242
|
+
"""Log a single action."""
|
|
1243
|
+
if not self.current_session:
|
|
1244
|
+
raise RuntimeError("No active session")
|
|
1245
|
+
|
|
1246
|
+
# Save screenshots if provided
|
|
1247
|
+
screenshot_paths = {}
|
|
1248
|
+
timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
|
|
1249
|
+
|
|
1250
|
+
if screenshot_before:
|
|
1251
|
+
path = os.path.join(
|
|
1252
|
+
self.screenshot_dir,
|
|
1253
|
+
f"{self.current_session.session_id}_{timestamp_str}_before.png"
|
|
1254
|
+
)
|
|
1255
|
+
with open(path, "wb") as f:
|
|
1256
|
+
f.write(screenshot_before)
|
|
1257
|
+
screenshot_paths["before"] = path
|
|
1258
|
+
|
|
1259
|
+
if screenshot_after:
|
|
1260
|
+
path = os.path.join(
|
|
1261
|
+
self.screenshot_dir,
|
|
1262
|
+
f"{self.current_session.session_id}_{timestamp_str}_after.png"
|
|
1263
|
+
)
|
|
1264
|
+
with open(path, "wb") as f:
|
|
1265
|
+
f.write(screenshot_after)
|
|
1266
|
+
screenshot_paths["after"] = path
|
|
1267
|
+
|
|
1268
|
+
# Create log entry
|
|
1269
|
+
entry = ActionLogEntry(
|
|
1270
|
+
timestamp=datetime.now(),
|
|
1271
|
+
action_type=action_type,
|
|
1272
|
+
parameters=parameters,
|
|
1273
|
+
success=success,
|
|
1274
|
+
error=error,
|
|
1275
|
+
screenshot_before=screenshot_paths.get("before"),
|
|
1276
|
+
screenshot_after=screenshot_paths.get("after"),
|
|
1277
|
+
model_reasoning=model_reasoning,
|
|
1278
|
+
duration_ms=duration_ms
|
|
1279
|
+
)
|
|
1280
|
+
|
|
1281
|
+
self.current_session.actions.append(entry)
|
|
1282
|
+
|
|
1283
|
+
# Also append to running log file
|
|
1284
|
+
self._append_to_log(entry)
|
|
1285
|
+
|
|
1286
|
+
def _append_to_log(self, entry: ActionLogEntry):
|
|
1287
|
+
"""Append entry to JSONL log file."""
|
|
1288
|
+
log_file = os.path.join(
|
|
1289
|
+
self.log_dir,
|
|
1290
|
+
f"session_{self.current_session.session_id}.jsonl"
|
|
1291
|
+
)
|
|
1292
|
+
|
|
1293
|
+
with open(log_file, "a") as f:
|
|
1294
|
+
f.write(json.dumps(entry.to_dict()) + "\n")
|
|
1295
|
+
|
|
1296
|
+
def end_session(self, success: bool, result: str = None):
|
|
1297
|
+
"""End current session."""
|
|
1298
|
+
if not self.current_session:
|
|
1299
|
+
return
|
|
1300
|
+
|
|
1301
|
+
self.current_session.end_time = datetime.now()
|
|
1302
|
+
self.current_session.success = success
|
|
1303
|
+
self.current_session.final_result = result
|
|
1304
|
+
|
|
1305
|
+
# Write session summary
|
|
1306
|
+
summary_file = os.path.join(
|
|
1307
|
+
self.log_dir,
|
|
1308
|
+
f"session_{self.current_session.session_id}_summary.json"
|
|
1309
|
+
)
|
|
1310
|
+
|
|
1311
|
+
summary = {
|
|
1312
|
+
"session_id": self.current_session.session_id,
|
|
1313
|
+
"task": self.current_session.task,
|
|
1314
|
+
"start_time": self.current_session.start_time.isoformat(),
|
|
1315
|
+
"end_time": self.current_session.end_time.isoformat(),
|
|
1316
|
+
"duration_seconds": (
|
|
1317
|
+
self.current_session.end_time -
|
|
1318
|
+
self.current_session.start_time
|
|
1319
|
+
).total_seconds(),
|
|
1320
|
+
"total_actions": len(self.current_session.actions),
|
|
1321
|
+
"successful_actions": sum(
|
|
1322
|
+
1 for a in self.current_session.actions if a.success
|
|
1323
|
+
),
|
|
1324
|
+
"failed_actions": sum(
|
|
1325
|
+
1 for a in self.current_session.actions if not a.success
|
|
1326
|
+
),
|
|
1327
|
+
"success": success,
|
|
1328
|
+
"final_result": result
|
|
1329
|
+
}
|
|
1330
|
+
|
|
1331
|
+
with open(summary_file, "w") as f:
|
|
1332
|
+
json.dump(summary, f, indent=2)
|
|
1333
|
+
|
|
1334
|
+
self.current_session = None
|
|
1335
|
+
|
|
1336
|
+
def get_session_replay(self, session_id: str) -> list[dict]:
|
|
1337
|
+
"""Get all actions from a session for replay/debugging."""
|
|
1338
|
+
log_file = os.path.join(self.log_dir, f"session_{session_id}.jsonl")
|
|
1339
|
+
|
|
1340
|
+
actions = []
|
|
1341
|
+
with open(log_file, "r") as f:
|
|
1342
|
+
for line in f:
|
|
1343
|
+
actions.append(json.loads(line))
|
|
1344
|
+
|
|
1345
|
+
return actions
|
|
1346
|
+
|
|
1347
|
+
# Integration with agent
|
|
1348
|
+
class LoggedComputerUseAgent:
|
|
1349
|
+
"""Computer use agent with comprehensive logging."""
|
|
1350
|
+
|
|
1351
|
+
def __init__(self, base_agent, logger: ActionLogger):
|
|
1352
|
+
self.agent = base_agent
|
|
1353
|
+
self.logger = logger
|
|
1354
|
+
|
|
1355
|
+
def run_task(self, task: str) -> dict:
|
|
1356
|
+
"""Run task with full logging."""
|
|
1357
|
+
session_id = self.logger.start_session(task)
|
|
1358
|
+
|
|
1359
|
+
try:
|
|
1360
|
+
result = self._run_with_logging(task)
|
|
1361
|
+
self.logger.end_session(
|
|
1362
|
+
success=result.get("success", False),
|
|
1363
|
+
result=result.get("result")
|
|
1364
|
+
)
|
|
1365
|
+
return result
|
|
1366
|
+
except Exception as e:
|
|
1367
|
+
self.logger.end_session(success=False, result=str(e))
|
|
1368
|
+
raise
|
|
1369
|
+
|
|
1370
|
+
def _run_with_logging(self, task: str) -> dict:
|
|
1371
|
+
"""Internal run with action logging."""
|
|
1372
|
+
# This would wrap the base agent's run method
|
|
1373
|
+
# and log each action
|
|
1374
|
+
pass
|
|
1375
|
+
|
|
1376
|
+
### Anti_patterns
|
|
1377
|
+
|
|
1378
|
+
- Not sanitizing sensitive data in logs
|
|
1379
|
+
- Storing screenshots indefinitely (storage costs)
|
|
1380
|
+
- Not rotating log files
|
|
1381
|
+
- Logging synchronously (blocks agent)
|
|
1382
|
+
|
|
1383
|
+
## Sharp Edges
|
|
1384
|
+
|
|
1385
|
+
### Web Content Can Hijack Your Agent
|
|
1386
|
+
|
|
1387
|
+
Severity: CRITICAL
|
|
1388
|
+
|
|
1389
|
+
Situation: Computer use agent browsing the web
|
|
1390
|
+
|
|
1391
|
+
Symptoms:
|
|
1392
|
+
Agent suddenly performs unexpected actions. Clicks malicious links.
|
|
1393
|
+
Enters credentials on phishing sites. Downloads files it shouldn't.
|
|
1394
|
+
Ignores your instructions and follows embedded commands instead.
|
|
1395
|
+
|
|
1396
|
+
Why this breaks:
|
|
1397
|
+
"While all agents that process untrusted content are subject to prompt
|
|
1398
|
+
injection risks, browser use amplifies this risk in two ways. First,
|
|
1399
|
+
the attack surface is vast: every webpage, embedded document, advertisement,
|
|
1400
|
+
and dynamically loaded script represents a potential vector for malicious
|
|
1401
|
+
instructions. Second, browser agents can take many different actions—
|
|
1402
|
+
navigating to URLs, filling forms, clicking buttons, downloading files—
|
|
1403
|
+
that attackers can exploit."
|
|
1404
|
+
|
|
1405
|
+
Real attacks have already happened:
|
|
1406
|
+
- "Microsoft Copilot agents were hijacked with emails containing malicious
|
|
1407
|
+
instructions, which allowed attackers to extract entire CRM databases."
|
|
1408
|
+
- "Google's Workspace services were manipulated—hidden prompts inside
|
|
1409
|
+
calendar invites and emails tricked Gemini agents into deleting events
|
|
1410
|
+
and exposing sensitive messages."
|
|
1411
|
+
|
|
1412
|
+
Even a 1% attack success rate represents meaningful risk at scale.
|
|
1413
|
+
|
|
1414
|
+
Recommended fix:
|
|
1415
|
+
|
|
1416
|
+
## Defense in depth - no single solution works
|
|
1417
|
+
|
|
1418
|
+
1. Sandboxing (most effective):
|
|
1419
|
+
```python
|
|
1420
|
+
# Docker with strict isolation
|
|
1421
|
+
docker run \
|
|
1422
|
+
--security-opt no-new-privileges \
|
|
1423
|
+
--cap-drop ALL \
|
|
1424
|
+
--network none \ # No internet!
|
|
1425
|
+
--read-only \
|
|
1426
|
+
computer-use-agent
|
|
1427
|
+
```
|
|
1428
|
+
|
|
1429
|
+
2. Classifier-based detection:
|
|
1430
|
+
```python
|
|
1431
|
+
def scan_for_injection(content: str) -> bool:
|
|
1432
|
+
"""Detect prompt injection attempts."""
|
|
1433
|
+
patterns = [
|
|
1434
|
+
r"ignore.*instructions",
|
|
1435
|
+
r"disregard.*previous",
|
|
1436
|
+
r"new.*instructions",
|
|
1437
|
+
r"you are now",
|
|
1438
|
+
r"act as if",
|
|
1439
|
+
r"pretend to be",
|
|
1440
|
+
]
|
|
1441
|
+
return any(re.search(p, content.lower()) for p in patterns)
|
|
1442
|
+
|
|
1443
|
+
# Check page content before processing
|
|
1444
|
+
page_text = await page.text_content("body")
|
|
1445
|
+
if scan_for_injection(page_text):
|
|
1446
|
+
return {"error": "Potential injection detected"}
|
|
1447
|
+
```
|
|
1448
|
+
|
|
1449
|
+
3. User confirmation for sensitive actions:
|
|
1450
|
+
```python
|
|
1451
|
+
SENSITIVE_ACTIONS = {"download", "submit", "login", "purchase"}
|
|
1452
|
+
|
|
1453
|
+
if action_type in SENSITIVE_ACTIONS:
|
|
1454
|
+
if not await get_user_confirmation(action):
|
|
1455
|
+
return {"error": "User rejected action"}
|
|
1456
|
+
```
|
|
1457
|
+
|
|
1458
|
+
4. Scoped credentials:
|
|
1459
|
+
- Never give agent access to all credentials
|
|
1460
|
+
- Use temporary, limited tokens
|
|
1461
|
+
- Revoke after task completion
|
|
1462
|
+
|
|
1463
|
+
### Vision Agents Click Exact Centers
|
|
1464
|
+
|
|
1465
|
+
Severity: MEDIUM
|
|
1466
|
+
|
|
1467
|
+
Situation: Agent clicking on UI elements
|
|
1468
|
+
|
|
1469
|
+
Symptoms:
|
|
1470
|
+
Agent's clicks are detectable as non-human. Websites may block or
|
|
1471
|
+
CAPTCHA the agent. Anti-bot systems flag the interaction.
|
|
1472
|
+
|
|
1473
|
+
Why this breaks:
|
|
1474
|
+
"When a vision model identifies a button, it calculates the center.
|
|
1475
|
+
Click coordinates land at mathematically precise positions—often exact
|
|
1476
|
+
element centers or grid-aligned pixel values. Humans don't click centers;
|
|
1477
|
+
their click distributions follow a Gaussian pattern around targets."
|
|
1478
|
+
|
|
1479
|
+
The screenshot loop also creates detectable patterns:
|
|
1480
|
+
"Predictable pauses. Vision agents are completely still during their
|
|
1481
|
+
'thinking' phase. The pattern looks like: Action → Complete stillness
|
|
1482
|
+
(1-5 seconds) → Action → Complete stillness → Action."
|
|
1483
|
+
|
|
1484
|
+
Sophisticated anti-bot systems detect:
|
|
1485
|
+
- Perfect center clicks
|
|
1486
|
+
- No mouse movement during "thinking"
|
|
1487
|
+
- Consistent timing between actions
|
|
1488
|
+
- Lack of micro-movements and hesitation
|
|
1489
|
+
|
|
1490
|
+
Recommended fix:
|
|
1491
|
+
|
|
1492
|
+
## Add human-like variance to actions
|
|
1493
|
+
|
|
1494
|
+
```python
|
|
1495
|
+
import random
|
|
1496
|
+
import time
|
|
1497
|
+
|
|
1498
|
+
def humanized_click(x: int, y: int) -> tuple[int, int]:
|
|
1499
|
+
"""Add human-like variance to click coordinates."""
|
|
1500
|
+
# Gaussian distribution around target
|
|
1501
|
+
# Humans typically land within ~10px of target
|
|
1502
|
+
x_offset = int(random.gauss(0, 5))
|
|
1503
|
+
y_offset = int(random.gauss(0, 5))
|
|
1504
|
+
|
|
1505
|
+
return (x + x_offset, y + y_offset)
|
|
1506
|
+
|
|
1507
|
+
def humanized_delay():
|
|
1508
|
+
"""Add human-like delay between actions."""
|
|
1509
|
+
# Humans have variable reaction times
|
|
1510
|
+
base_delay = random.uniform(0.3, 0.8)
|
|
1511
|
+
# Occasionally longer pauses (reading, thinking)
|
|
1512
|
+
if random.random() < 0.2:
|
|
1513
|
+
base_delay += random.uniform(0.5, 2.0)
|
|
1514
|
+
time.sleep(base_delay)
|
|
1515
|
+
|
|
1516
|
+
def humanized_movement(from_pos: tuple, to_pos: tuple):
|
|
1517
|
+
"""Move mouse in curved path like human."""
|
|
1518
|
+
# Bezier curve or similar
|
|
1519
|
+
# Humans don't move in straight lines
|
|
1520
|
+
steps = random.randint(10, 20)
|
|
1521
|
+
for i in range(steps):
|
|
1522
|
+
t = i / steps
|
|
1523
|
+
# Simple curve approximation
|
|
1524
|
+
x = from_pos[0] + (to_pos[0] - from_pos[0]) * t
|
|
1525
|
+
y = from_pos[1] + (to_pos[1] - from_pos[1]) * t
|
|
1526
|
+
# Add wobble
|
|
1527
|
+
x += random.gauss(0, 2)
|
|
1528
|
+
y += random.gauss(0, 2)
|
|
1529
|
+
pyautogui.moveTo(int(x), int(y))
|
|
1530
|
+
time.sleep(0.01)
|
|
1531
|
+
```
|
|
1532
|
+
|
|
1533
|
+
## Rotate user agents and fingerprints
|
|
1534
|
+
|
|
1535
|
+
```python
|
|
1536
|
+
USER_AGENTS = [
|
|
1537
|
+
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120...",
|
|
1538
|
+
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Safari/...",
|
|
1539
|
+
# ... more realistic agents
|
|
1540
|
+
]
|
|
1541
|
+
|
|
1542
|
+
await page.set_extra_http_headers({
|
|
1543
|
+
"User-Agent": random.choice(USER_AGENTS)
|
|
1544
|
+
})
|
|
1545
|
+
```
|
|
1546
|
+
|
|
1547
|
+
### Dropdowns, Scrollbars, and Drags Are Unreliable
|
|
1548
|
+
|
|
1549
|
+
Severity: HIGH
|
|
1550
|
+
|
|
1551
|
+
Situation: Agent interacting with complex UI elements
|
|
1552
|
+
|
|
1553
|
+
Symptoms:
|
|
1554
|
+
Agent fails to select dropdown options. Scroll doesn't work as expected.
|
|
1555
|
+
Drag and drop completely fails. Hover menus disappear before clicking.
|
|
1556
|
+
|
|
1557
|
+
Why this breaks:
|
|
1558
|
+
"Computer Use currently struggles with certain interface interactions,
|
|
1559
|
+
particularly scrolling, dragging, and zooming operations. Some UI elements
|
|
1560
|
+
(like dropdowns and scrollbars) might be tricky for Claude to manipulate."
|
|
1561
|
+
- Anthropic documentation
|
|
1562
|
+
|
|
1563
|
+
Why these are hard:
|
|
1564
|
+
1. Dropdowns: Options appear after click, need second click to select
|
|
1565
|
+
2. Scrollbars: Small targets, need precise positioning
|
|
1566
|
+
3. Drag: Requires coordinated mouse down, move, mouse up
|
|
1567
|
+
4. Hover menus: Disappear when mouse moves away
|
|
1568
|
+
5. Canvas elements: No semantic information visible
|
|
1569
|
+
|
|
1570
|
+
Vision models see pixels, not DOM structure. They don't "know" that
|
|
1571
|
+
a dropdown is a dropdown - they have to infer from visual cues.
|
|
1572
|
+
|
|
1573
|
+
Recommended fix:
|
|
1574
|
+
|
|
1575
|
+
## Use keyboard alternatives when possible
|
|
1576
|
+
|
|
1577
|
+
```python
|
|
1578
|
+
# Instead of clicking dropdown, use keyboard
|
|
1579
|
+
async def select_dropdown_option(page, dropdown_selector, option_text):
|
|
1580
|
+
# Focus the dropdown
|
|
1581
|
+
await page.click(dropdown_selector)
|
|
1582
|
+
await asyncio.sleep(0.3)
|
|
1583
|
+
|
|
1584
|
+
# Use keyboard to find option
|
|
1585
|
+
await page.keyboard.type(option_text[:3]) # Type first letters
|
|
1586
|
+
await asyncio.sleep(0.2)
|
|
1587
|
+
await page.keyboard.press("Enter")
|
|
1588
|
+
```
|
|
1589
|
+
|
|
1590
|
+
## Break complex actions into steps
|
|
1591
|
+
|
|
1592
|
+
```python
|
|
1593
|
+
# Instead of drag-and-drop
|
|
1594
|
+
async def reliable_drag(page, source, target):
|
|
1595
|
+
# Step 1: Click and hold
|
|
1596
|
+
await page.mouse.move(source["x"], source["y"])
|
|
1597
|
+
await page.mouse.down()
|
|
1598
|
+
await asyncio.sleep(0.2)
|
|
1599
|
+
|
|
1600
|
+
# Step 2: Move in steps
|
|
1601
|
+
steps = 10
|
|
1602
|
+
for i in range(steps):
|
|
1603
|
+
x = source["x"] + (target["x"] - source["x"]) * i / steps
|
|
1604
|
+
y = source["y"] + (target["y"] - source["y"]) * i / steps
|
|
1605
|
+
await page.mouse.move(x, y)
|
|
1606
|
+
await asyncio.sleep(0.05)
|
|
1607
|
+
|
|
1608
|
+
# Step 3: Release
|
|
1609
|
+
await page.mouse.move(target["x"], target["y"])
|
|
1610
|
+
await asyncio.sleep(0.1)
|
|
1611
|
+
await page.mouse.up()
|
|
1612
|
+
```
|
|
1613
|
+
|
|
1614
|
+
## Fall back to DOM access for web
|
|
1615
|
+
|
|
1616
|
+
```python
|
|
1617
|
+
# If vision fails, try direct DOM manipulation
|
|
1618
|
+
async def robust_select(page, select_selector, value):
|
|
1619
|
+
try:
|
|
1620
|
+
# Try vision approach first
|
|
1621
|
+
await vision_agent.select(select_selector, value)
|
|
1622
|
+
except Exception:
|
|
1623
|
+
# Fall back to direct DOM
|
|
1624
|
+
await page.select_option(select_selector, value=value)
|
|
305
1625
|
```
|
|
306
1626
|
|
|
307
|
-
##
|
|
1627
|
+
## Add verification after action
|
|
1628
|
+
|
|
1629
|
+
```python
|
|
1630
|
+
async def verified_scroll(page, direction):
|
|
1631
|
+
# Get current scroll position
|
|
1632
|
+
before = await page.evaluate("window.scrollY")
|
|
1633
|
+
|
|
1634
|
+
# Attempt scroll
|
|
1635
|
+
await page.mouse.wheel(0, 500 if direction == "down" else -500)
|
|
1636
|
+
await asyncio.sleep(0.3)
|
|
1637
|
+
|
|
1638
|
+
# Verify it worked
|
|
1639
|
+
after = await page.evaluate("window.scrollY")
|
|
1640
|
+
if before == after:
|
|
1641
|
+
# Try alternative method
|
|
1642
|
+
await page.keyboard.press("PageDown" if direction == "down" else "PageUp")
|
|
1643
|
+
```
|
|
1644
|
+
|
|
1645
|
+
### Agents Are 2-5x Slower Than Humans
|
|
1646
|
+
|
|
1647
|
+
Severity: MEDIUM
|
|
1648
|
+
|
|
1649
|
+
Situation: Automating any computer task
|
|
1650
|
+
|
|
1651
|
+
Symptoms:
|
|
1652
|
+
Task that takes human 1 minute takes agent 3-5 minutes.
|
|
1653
|
+
Users complain about speed. Timeouts occur.
|
|
1654
|
+
|
|
1655
|
+
Why this breaks:
|
|
1656
|
+
"The technology can be slow compared to human operators, often requiring
|
|
1657
|
+
multiple screenshots and analysis cycles."
|
|
1658
|
+
|
|
1659
|
+
Why so slow:
|
|
1660
|
+
1. Screenshot capture: 100-500ms
|
|
1661
|
+
2. Vision model inference: 1-5 seconds per screenshot
|
|
1662
|
+
3. Action execution: 200-500ms
|
|
1663
|
+
4. Wait for UI update: 500-1000ms
|
|
1664
|
+
5. Total per action: 2-7 seconds
|
|
308
1665
|
|
|
309
|
-
|
|
310
|
-
|
|
311
|
-
|
|
312
|
-
|
|
313
|
-
|
|
314
|
-
|
|
315
|
-
|
|
316
|
-
|
|
317
|
-
|
|
1666
|
+
A task requiring 20 actions takes 40-140 seconds minimum.
|
|
1667
|
+
Humans do the same actions in 20-30 seconds.
|
|
1668
|
+
|
|
1669
|
+
Recommended fix:
|
|
1670
|
+
|
|
1671
|
+
## Accept the tradeoff
|
|
1672
|
+
|
|
1673
|
+
Computer use is for:
|
|
1674
|
+
- Tasks humans don't want to do (repetitive)
|
|
1675
|
+
- Tasks that can run in background
|
|
1676
|
+
- Tasks where accuracy > speed
|
|
1677
|
+
|
|
1678
|
+
## Optimize where possible
|
|
1679
|
+
|
|
1680
|
+
```python
|
|
1681
|
+
# 1. Reduce screenshot resolution
|
|
1682
|
+
SCREEN_SIZE = (1280, 800) # Not 4K
|
|
1683
|
+
|
|
1684
|
+
# 2. Batch similar actions
|
|
1685
|
+
# Instead of: type "hello", wait, type " world"
|
|
1686
|
+
await page.type("hello world")
|
|
1687
|
+
|
|
1688
|
+
# 3. Parallelize independent tasks
|
|
1689
|
+
# Run multiple sandboxed agents concurrently
|
|
1690
|
+
|
|
1691
|
+
# 4. Cache repeated computations
|
|
1692
|
+
# If same screenshot, reuse analysis
|
|
1693
|
+
|
|
1694
|
+
# 5. Use smaller models for simple decisions
|
|
1695
|
+
simple_model = "claude-haiku-..." # For "is task done?"
|
|
1696
|
+
complex_model = "claude-sonnet-..." # For complex reasoning
|
|
1697
|
+
```
|
|
1698
|
+
|
|
1699
|
+
## Set realistic expectations
|
|
1700
|
+
|
|
1701
|
+
```python
|
|
1702
|
+
# Estimate task duration
|
|
1703
|
+
def estimate_duration(task_complexity: str) -> int:
|
|
1704
|
+
"""Estimate task duration in seconds."""
|
|
1705
|
+
estimates = {
|
|
1706
|
+
"simple": 30, # Single page, few actions
|
|
1707
|
+
"medium": 120, # Multi-page, moderate actions
|
|
1708
|
+
"complex": 300, # Many pages, complex interactions
|
|
1709
|
+
}
|
|
1710
|
+
return estimates.get(task_complexity, 120)
|
|
1711
|
+
|
|
1712
|
+
# Inform users
|
|
1713
|
+
estimated = estimate_duration("medium")
|
|
1714
|
+
print(f"Estimated completion: {estimated // 60}m {estimated % 60}s")
|
|
1715
|
+
```
|
|
1716
|
+
|
|
1717
|
+
### Screenshots Fill Up Context Window Fast
|
|
1718
|
+
|
|
1719
|
+
Severity: HIGH
|
|
1720
|
+
|
|
1721
|
+
Situation: Long-running computer use tasks
|
|
1722
|
+
|
|
1723
|
+
Symptoms:
|
|
1724
|
+
Agent forgets earlier steps. Starts repeating actions.
|
|
1725
|
+
Errors increase as task progresses. Costs explode.
|
|
1726
|
+
|
|
1727
|
+
Why this breaks:
|
|
1728
|
+
Each screenshot is ~1500-3000 tokens. A task with 30 screenshots
|
|
1729
|
+
uses 45,000-90,000 tokens just for images - before any text.
|
|
1730
|
+
|
|
1731
|
+
Claude's context window is finite. When full:
|
|
1732
|
+
- Older context gets dropped
|
|
1733
|
+
- Agent loses memory of earlier steps
|
|
1734
|
+
- Task coherence decreases
|
|
1735
|
+
|
|
1736
|
+
"Getting agents to make consistent progress across multiple context
|
|
1737
|
+
windows remains an open problem. The core challenge is that they must
|
|
1738
|
+
work in discrete sessions, and each new session begins with no memory
|
|
1739
|
+
of what came before." - Anthropic engineering blog
|
|
1740
|
+
|
|
1741
|
+
Recommended fix:
|
|
1742
|
+
|
|
1743
|
+
## Implement context management
|
|
1744
|
+
|
|
1745
|
+
```python
|
|
1746
|
+
class ContextManager:
|
|
1747
|
+
"""Manage context window usage for computer use."""
|
|
1748
|
+
|
|
1749
|
+
MAX_SCREENSHOTS = 10 # Keep only recent screenshots
|
|
1750
|
+
MAX_TOKENS = 100000
|
|
1751
|
+
|
|
1752
|
+
def __init__(self):
|
|
1753
|
+
self.messages = []
|
|
1754
|
+
self.screenshot_count = 0
|
|
1755
|
+
|
|
1756
|
+
def add_screenshot(self, screenshot_b64: str, description: str):
|
|
1757
|
+
"""Add screenshot with automatic pruning."""
|
|
1758
|
+
self.screenshot_count += 1
|
|
1759
|
+
|
|
1760
|
+
# Keep only recent screenshots
|
|
1761
|
+
if self.screenshot_count > self.MAX_SCREENSHOTS:
|
|
1762
|
+
self._prune_old_screenshots()
|
|
1763
|
+
|
|
1764
|
+
# Store with description for context
|
|
1765
|
+
self.messages.append({
|
|
1766
|
+
"role": "user",
|
|
1767
|
+
"content": [
|
|
1768
|
+
{"type": "text", "text": description},
|
|
1769
|
+
{"type": "image", "source": {...}}
|
|
1770
|
+
]
|
|
1771
|
+
})
|
|
1772
|
+
|
|
1773
|
+
def _prune_old_screenshots(self):
|
|
1774
|
+
"""Remove old screenshots, keep text summaries."""
|
|
1775
|
+
new_messages = []
|
|
1776
|
+
screenshots_kept = 0
|
|
1777
|
+
|
|
1778
|
+
for msg in reversed(self.messages):
|
|
1779
|
+
if self._has_image(msg):
|
|
1780
|
+
if screenshots_kept < self.MAX_SCREENSHOTS:
|
|
1781
|
+
new_messages.insert(0, msg)
|
|
1782
|
+
screenshots_kept += 1
|
|
1783
|
+
else:
|
|
1784
|
+
# Convert to text summary
|
|
1785
|
+
summary = self._summarize_screenshot(msg)
|
|
1786
|
+
new_messages.insert(0, {
|
|
1787
|
+
"role": msg["role"],
|
|
1788
|
+
"content": summary
|
|
1789
|
+
})
|
|
1790
|
+
else:
|
|
1791
|
+
new_messages.insert(0, msg)
|
|
1792
|
+
|
|
1793
|
+
self.messages = new_messages
|
|
1794
|
+
|
|
1795
|
+
def _summarize_screenshot(self, msg) -> str:
|
|
1796
|
+
"""Summarize screenshot to text."""
|
|
1797
|
+
# Extract any text description
|
|
1798
|
+
for content in msg.get("content", []):
|
|
1799
|
+
if content.get("type") == "text":
|
|
1800
|
+
return f"[Previous screenshot: {content['text']}]"
|
|
1801
|
+
return "[Previous screenshot - details pruned]"
|
|
1802
|
+
|
|
1803
|
+
def add_checkpoint(self):
|
|
1804
|
+
"""Create a checkpoint summary."""
|
|
1805
|
+
summary = self._create_progress_summary()
|
|
1806
|
+
self.messages.append({
|
|
1807
|
+
"role": "user",
|
|
1808
|
+
"content": f"CHECKPOINT: {summary}"
|
|
1809
|
+
})
|
|
1810
|
+
```
|
|
1811
|
+
|
|
1812
|
+
## Use checkpointing for long tasks
|
|
1813
|
+
|
|
1814
|
+
```python
|
|
1815
|
+
async def run_with_checkpoints(task: str, checkpoint_every: int = 10):
|
|
1816
|
+
"""Run task with periodic checkpoints."""
|
|
1817
|
+
context = ContextManager()
|
|
1818
|
+
step = 0
|
|
1819
|
+
|
|
1820
|
+
while not task_complete:
|
|
1821
|
+
step += 1
|
|
1822
|
+
|
|
1823
|
+
# Take action...
|
|
1824
|
+
|
|
1825
|
+
if step % checkpoint_every == 0:
|
|
1826
|
+
# Create checkpoint
|
|
1827
|
+
context.add_checkpoint()
|
|
1828
|
+
|
|
1829
|
+
# Optional: persist to disk
|
|
1830
|
+
save_checkpoint(context, step)
|
|
1831
|
+
```
|
|
1832
|
+
|
|
1833
|
+
## Break into subtasks
|
|
1834
|
+
|
|
1835
|
+
```python
|
|
1836
|
+
# Instead of one 50-step task:
|
|
1837
|
+
subtasks = [
|
|
1838
|
+
"Navigate to the website and login",
|
|
1839
|
+
"Find the settings page",
|
|
1840
|
+
"Update the email address to ...",
|
|
1841
|
+
"Save and verify the change"
|
|
1842
|
+
]
|
|
1843
|
+
|
|
1844
|
+
for subtask in subtasks:
|
|
1845
|
+
result = await agent.run(subtask)
|
|
1846
|
+
if not result["success"]:
|
|
1847
|
+
handle_error(subtask, result)
|
|
1848
|
+
break
|
|
1849
|
+
```
|
|
1850
|
+
|
|
1851
|
+
### Costs Can Explode Quickly
|
|
1852
|
+
|
|
1853
|
+
Severity: HIGH
|
|
1854
|
+
|
|
1855
|
+
Situation: Running computer use at scale
|
|
1856
|
+
|
|
1857
|
+
Symptoms:
|
|
1858
|
+
API bill is 10x higher than expected. Single task costs $5+ instead of $0.50.
|
|
1859
|
+
Monthly costs reach thousands of dollars quickly.
|
|
1860
|
+
|
|
1861
|
+
Why this breaks:
|
|
1862
|
+
Vision tokens are expensive. Each screenshot:
|
|
1863
|
+
- ~2000-3000 tokens per image
|
|
1864
|
+
- At $10/million tokens, that's $0.02-0.03 per screenshot
|
|
1865
|
+
- Task with 30 screenshots = $0.60-0.90 just for images
|
|
1866
|
+
|
|
1867
|
+
But it compounds:
|
|
1868
|
+
- Screenshots accumulate in context
|
|
1869
|
+
- Model sees ALL previous screenshots each turn
|
|
1870
|
+
- Turn 10 processes 10 screenshots = $0.20-0.30
|
|
1871
|
+
- Turn 20 processes 20 screenshots = $0.40-0.60
|
|
1872
|
+
- Quadratic growth!
|
|
1873
|
+
|
|
1874
|
+
Complex task: 50 turns × average 25 images in context = 1250 image tokens
|
|
1875
|
+
Plus text = could easily hit $5-10 per task.
|
|
1876
|
+
|
|
1877
|
+
Recommended fix:
|
|
1878
|
+
|
|
1879
|
+
## Monitor and limit costs
|
|
1880
|
+
|
|
1881
|
+
```python
|
|
1882
|
+
class CostTracker:
|
|
1883
|
+
"""Track and limit computer use costs."""
|
|
1884
|
+
|
|
1885
|
+
# Anthropic pricing (approximate)
|
|
1886
|
+
INPUT_COST_PER_1K = 0.003 # Text
|
|
1887
|
+
OUTPUT_COST_PER_1K = 0.015
|
|
1888
|
+
IMAGE_COST_PER_1K = 0.01 # Roughly
|
|
1889
|
+
|
|
1890
|
+
def __init__(self, max_cost_per_task: float = 1.0):
|
|
1891
|
+
self.max_cost = max_cost_per_task
|
|
1892
|
+
self.current_cost = 0.0
|
|
1893
|
+
self.total_tokens = 0
|
|
1894
|
+
|
|
1895
|
+
def add_turn(
|
|
1896
|
+
self,
|
|
1897
|
+
input_tokens: int,
|
|
1898
|
+
output_tokens: int,
|
|
1899
|
+
image_tokens: int
|
|
1900
|
+
):
|
|
1901
|
+
"""Track cost of a single turn."""
|
|
1902
|
+
cost = (
|
|
1903
|
+
input_tokens / 1000 * self.INPUT_COST_PER_1K +
|
|
1904
|
+
output_tokens / 1000 * self.OUTPUT_COST_PER_1K +
|
|
1905
|
+
image_tokens / 1000 * self.IMAGE_COST_PER_1K
|
|
1906
|
+
)
|
|
1907
|
+
self.current_cost += cost
|
|
1908
|
+
self.total_tokens += input_tokens + output_tokens + image_tokens
|
|
1909
|
+
|
|
1910
|
+
if self.current_cost > self.max_cost:
|
|
1911
|
+
raise CostLimitExceeded(
|
|
1912
|
+
f"Cost limit exceeded: ${self.current_cost:.2f} > ${self.max_cost:.2f}"
|
|
1913
|
+
)
|
|
1914
|
+
|
|
1915
|
+
return cost
|
|
1916
|
+
|
|
1917
|
+
class CostLimitExceeded(Exception):
|
|
1918
|
+
pass
|
|
1919
|
+
|
|
1920
|
+
# Usage
|
|
1921
|
+
tracker = CostTracker(max_cost_per_task=2.0)
|
|
1922
|
+
|
|
1923
|
+
try:
|
|
1924
|
+
for turn in turns:
|
|
1925
|
+
tracker.add_turn(turn.input, turn.output, turn.images)
|
|
1926
|
+
except CostLimitExceeded:
|
|
1927
|
+
print("Task aborted due to cost limit")
|
|
1928
|
+
```
|
|
1929
|
+
|
|
1930
|
+
## Reduce image costs
|
|
1931
|
+
|
|
1932
|
+
```python
|
|
1933
|
+
# 1. Lower resolution
|
|
1934
|
+
SCREEN_SIZE = (1024, 768) # Smaller = fewer tokens
|
|
1935
|
+
|
|
1936
|
+
# 2. JPEG instead of PNG (when quality ok)
|
|
1937
|
+
screenshot.save(buffer, format="JPEG", quality=70)
|
|
1938
|
+
|
|
1939
|
+
# 3. Crop to relevant region
|
|
1940
|
+
def crop_relevant(screenshot: Image, focus_area: tuple):
|
|
1941
|
+
"""Crop to area of interest."""
|
|
1942
|
+
return screenshot.crop(focus_area)
|
|
1943
|
+
|
|
1944
|
+
# 4. Don't include screenshot every turn
|
|
1945
|
+
if not needs_visual_update:
|
|
1946
|
+
# Text-only turn
|
|
1947
|
+
messages.append({"role": "user", "content": "Continue..."})
|
|
1948
|
+
```
|
|
1949
|
+
|
|
1950
|
+
## Use cheaper models strategically
|
|
1951
|
+
|
|
1952
|
+
```python
|
|
1953
|
+
async def tiered_model_selection(task_complexity: str):
|
|
1954
|
+
"""Use appropriate model for task."""
|
|
1955
|
+
if task_complexity == "simple":
|
|
1956
|
+
return "claude-haiku-..." # Cheapest
|
|
1957
|
+
elif task_complexity == "medium":
|
|
1958
|
+
return "claude-sonnet-4-20250514" # Balanced
|
|
1959
|
+
else:
|
|
1960
|
+
return "claude-opus-4-5-..." # Best but expensive
|
|
1961
|
+
```
|
|
1962
|
+
|
|
1963
|
+
### Running Agent on Your Actual Computer
|
|
1964
|
+
|
|
1965
|
+
Severity: CRITICAL
|
|
1966
|
+
|
|
1967
|
+
Situation: Testing or deploying computer use
|
|
1968
|
+
|
|
1969
|
+
Symptoms:
|
|
1970
|
+
Agent deletes important files. Sends emails from your account.
|
|
1971
|
+
Posts on social media. Accesses sensitive documents.
|
|
1972
|
+
|
|
1973
|
+
Why this breaks:
|
|
1974
|
+
Computer use agents make mistakes. They can:
|
|
1975
|
+
- Misinterpret instructions
|
|
1976
|
+
- Click wrong buttons
|
|
1977
|
+
- Type in wrong fields
|
|
1978
|
+
- Follow prompt injection attacks
|
|
1979
|
+
|
|
1980
|
+
Without sandboxing, these mistakes happen on your real system.
|
|
1981
|
+
There's no undo for "agent sent email to all contacts" or
|
|
1982
|
+
"agent deleted project folder."
|
|
1983
|
+
|
|
1984
|
+
"Autonomous agents that can access external systems and APIs
|
|
1985
|
+
introduce new security risks. They may be vulnerable to prompt
|
|
1986
|
+
injection attacks, unauthorized access to sensitive data, or
|
|
1987
|
+
manipulation by malicious actors."
|
|
1988
|
+
|
|
1989
|
+
Recommended fix:
|
|
1990
|
+
|
|
1991
|
+
## ALWAYS use sandboxing
|
|
1992
|
+
|
|
1993
|
+
```python
|
|
1994
|
+
# Minimum viable sandbox: Docker with restrictions
|
|
1995
|
+
|
|
1996
|
+
docker run -it --rm \
|
|
1997
|
+
--security-opt no-new-privileges \
|
|
1998
|
+
--cap-drop ALL \
|
|
1999
|
+
--network none \
|
|
2000
|
+
--read-only \
|
|
2001
|
+
--tmpfs /tmp \
|
|
2002
|
+
--memory 2g \
|
|
2003
|
+
--cpus 1 \
|
|
2004
|
+
computer-use-sandbox
|
|
2005
|
+
```
|
|
2006
|
+
|
|
2007
|
+
## Layer your defenses
|
|
2008
|
+
|
|
2009
|
+
```python
|
|
2010
|
+
# Defense 1: Docker isolation
|
|
2011
|
+
# Defense 2: Non-root user
|
|
2012
|
+
# Defense 3: Network restrictions
|
|
2013
|
+
# Defense 4: Filesystem restrictions
|
|
2014
|
+
# Defense 5: Resource limits
|
|
2015
|
+
# Defense 6: Action confirmation
|
|
2016
|
+
# Defense 7: Action logging
|
|
2017
|
+
|
|
2018
|
+
@dataclass
|
|
2019
|
+
class SandboxConfig:
|
|
2020
|
+
docker_image: str = "computer-use-sandbox:latest"
|
|
2021
|
+
network: str = "none" # or specific allowlist
|
|
2022
|
+
readonly_root: bool = True
|
|
2023
|
+
max_memory_mb: int = 2048
|
|
2024
|
+
max_cpu: float = 1.0
|
|
2025
|
+
max_runtime_seconds: int = 300
|
|
2026
|
+
require_confirmation: list = field(default_factory=lambda: [
|
|
2027
|
+
"download", "submit", "login", "delete"
|
|
2028
|
+
])
|
|
2029
|
+
log_all_actions: bool = True
|
|
2030
|
+
```
|
|
2031
|
+
|
|
2032
|
+
## Test in isolated environment first
|
|
2033
|
+
|
|
2034
|
+
```python
|
|
2035
|
+
class SandboxedTestRunner:
|
|
2036
|
+
"""Run tests in throwaway containers."""
|
|
2037
|
+
|
|
2038
|
+
async def run_test(self, test_task: str) -> dict:
|
|
2039
|
+
# Spin up fresh container
|
|
2040
|
+
container_id = await self.create_container()
|
|
2041
|
+
|
|
2042
|
+
try:
|
|
2043
|
+
# Run task
|
|
2044
|
+
result = await self.execute_in_container(container_id, test_task)
|
|
2045
|
+
|
|
2046
|
+
# Capture state for verification
|
|
2047
|
+
state = await self.capture_container_state(container_id)
|
|
2048
|
+
|
|
2049
|
+
return {
|
|
2050
|
+
"result": result,
|
|
2051
|
+
"final_state": state,
|
|
2052
|
+
"logs": await self.get_logs(container_id)
|
|
2053
|
+
}
|
|
2054
|
+
finally:
|
|
2055
|
+
# Always destroy container
|
|
2056
|
+
await self.destroy_container(container_id)
|
|
2057
|
+
```
|
|
2058
|
+
|
|
2059
|
+
## Validation Checks
|
|
2060
|
+
|
|
2061
|
+
### Computer Use Without Sandbox
|
|
2062
|
+
|
|
2063
|
+
Severity: ERROR
|
|
2064
|
+
|
|
2065
|
+
Computer use agents MUST run in sandboxed environments
|
|
2066
|
+
|
|
2067
|
+
Message: Computer use without sandboxing detected. Use Docker containers with restrictions.
|
|
2068
|
+
|
|
2069
|
+
### Sandbox With Full Network Access
|
|
2070
|
+
|
|
2071
|
+
Severity: ERROR
|
|
2072
|
+
|
|
2073
|
+
Sandboxed agents should have restricted network access
|
|
2074
|
+
|
|
2075
|
+
Message: Sandbox has full network access. Use --network=none or specific allowlist.
|
|
2076
|
+
|
|
2077
|
+
### Running as Root in Container
|
|
2078
|
+
|
|
2079
|
+
Severity: ERROR
|
|
2080
|
+
|
|
2081
|
+
Container agents should run as non-root user
|
|
2082
|
+
|
|
2083
|
+
Message: Container running as root. Add --user flag or USER directive in Dockerfile.
|
|
2084
|
+
|
|
2085
|
+
### Container Without Capability Drops
|
|
2086
|
+
|
|
2087
|
+
Severity: WARNING
|
|
2088
|
+
|
|
2089
|
+
Containers should drop unnecessary capabilities
|
|
2090
|
+
|
|
2091
|
+
Message: Container has full capabilities. Add --cap-drop ALL.
|
|
2092
|
+
|
|
2093
|
+
### Container Without Seccomp Profile
|
|
2094
|
+
|
|
2095
|
+
Severity: WARNING
|
|
2096
|
+
|
|
2097
|
+
Containers should use seccomp profiles for syscall filtering
|
|
2098
|
+
|
|
2099
|
+
Message: No security options set. Consider --security-opt seccomp:profile.json
|
|
2100
|
+
|
|
2101
|
+
### No Maximum Step Limit
|
|
2102
|
+
|
|
2103
|
+
Severity: WARNING
|
|
2104
|
+
|
|
2105
|
+
Computer use loops should have maximum step limits
|
|
2106
|
+
|
|
2107
|
+
Message: Infinite loop risk. Add max_steps limit (recommended: 50).
|
|
2108
|
+
|
|
2109
|
+
### No Execution Timeout
|
|
2110
|
+
|
|
2111
|
+
Severity: WARNING
|
|
2112
|
+
|
|
2113
|
+
Computer use should have timeout limits
|
|
2114
|
+
|
|
2115
|
+
Message: No timeout on execution. Add timeout (recommended: 5-10 minutes).
|
|
2116
|
+
|
|
2117
|
+
### Container Without Memory Limit
|
|
2118
|
+
|
|
2119
|
+
Severity: WARNING
|
|
2120
|
+
|
|
2121
|
+
Containers should have memory limits to prevent DoS
|
|
2122
|
+
|
|
2123
|
+
Message: No memory limit on container. Add --memory 2g or similar.
|
|
2124
|
+
|
|
2125
|
+
### No Cost Tracking
|
|
2126
|
+
|
|
2127
|
+
Severity: WARNING
|
|
2128
|
+
|
|
2129
|
+
Computer use should track API costs
|
|
2130
|
+
|
|
2131
|
+
Message: No cost tracking. Monitor token usage to prevent bill surprises.
|
|
2132
|
+
|
|
2133
|
+
### No Maximum Cost Limit
|
|
2134
|
+
|
|
2135
|
+
Severity: INFO
|
|
2136
|
+
|
|
2137
|
+
Consider adding cost limits per task
|
|
2138
|
+
|
|
2139
|
+
Message: Consider adding max_cost_per_task to prevent expensive runaway tasks.
|
|
2140
|
+
|
|
2141
|
+
## Collaboration
|
|
2142
|
+
|
|
2143
|
+
### Delegation Triggers
|
|
2144
|
+
|
|
2145
|
+
- user needs web-only automation -> browser-automation (Playwright/Selenium more efficient for web)
|
|
2146
|
+
- user needs security review -> security-specialist (Review sandboxing, prompt injection defenses)
|
|
2147
|
+
- user needs container orchestration -> devops (Kubernetes, Docker Swarm for scaling)
|
|
2148
|
+
- user needs vision model optimization -> llm-architect (Model selection, prompt engineering)
|
|
2149
|
+
- user needs multi-agent coordination -> multi-agent-orchestration (Multiple computer use agents working together)
|
|
318
2150
|
|
|
319
2151
|
## When to Use
|
|
320
|
-
|
|
2152
|
+
|
|
2153
|
+
- User mentions or implies: computer use
|
|
2154
|
+
- User mentions or implies: desktop automation agent
|
|
2155
|
+
- User mentions or implies: screen control AI
|
|
2156
|
+
- User mentions or implies: vision-based agent
|
|
2157
|
+
- User mentions or implies: GUI automation
|
|
2158
|
+
- User mentions or implies: Claude computer
|
|
2159
|
+
- User mentions or implies: OpenAI Operator
|
|
2160
|
+
- User mentions or implies: browser agent
|
|
2161
|
+
- User mentions or implies: visual agent
|
|
2162
|
+
- User mentions or implies: RPA with AI
|