opencode-skills-collection 1.0.185 → 1.0.187

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (71) hide show
  1. package/bundled-skills/.antigravity-install-manifest.json +5 -1
  2. package/bundled-skills/3d-web-experience/SKILL.md +152 -37
  3. package/bundled-skills/agent-evaluation/SKILL.md +1088 -26
  4. package/bundled-skills/agent-memory-systems/SKILL.md +1037 -25
  5. package/bundled-skills/agent-tool-builder/SKILL.md +668 -16
  6. package/bundled-skills/ai-agents-architect/SKILL.md +271 -31
  7. package/bundled-skills/ai-product/SKILL.md +716 -26
  8. package/bundled-skills/ai-wrapper-product/SKILL.md +450 -44
  9. package/bundled-skills/algolia-search/SKILL.md +867 -15
  10. package/bundled-skills/autonomous-agents/SKILL.md +1033 -26
  11. package/bundled-skills/aws-serverless/SKILL.md +1046 -35
  12. package/bundled-skills/azure-functions/SKILL.md +1318 -19
  13. package/bundled-skills/browser-automation/SKILL.md +1065 -28
  14. package/bundled-skills/browser-extension-builder/SKILL.md +159 -32
  15. package/bundled-skills/bullmq-specialist/SKILL.md +347 -16
  16. package/bundled-skills/clerk-auth/SKILL.md +796 -15
  17. package/bundled-skills/computer-use-agents/SKILL.md +1870 -28
  18. package/bundled-skills/context-window-management/SKILL.md +271 -18
  19. package/bundled-skills/conversation-memory/SKILL.md +453 -24
  20. package/bundled-skills/crewai/SKILL.md +252 -46
  21. package/bundled-skills/discord-bot-architect/SKILL.md +1207 -34
  22. package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
  23. package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
  24. package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
  25. package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
  26. package/bundled-skills/docs/users/bundles.md +1 -1
  27. package/bundled-skills/docs/users/claude-code-skills.md +1 -1
  28. package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
  29. package/bundled-skills/docs/users/getting-started.md +1 -1
  30. package/bundled-skills/docs/users/kiro-integration.md +1 -1
  31. package/bundled-skills/docs/users/usage.md +4 -4
  32. package/bundled-skills/docs/users/visual-guide.md +4 -4
  33. package/bundled-skills/email-systems/SKILL.md +646 -26
  34. package/bundled-skills/faf-expert/SKILL.md +221 -0
  35. package/bundled-skills/faf-wizard/SKILL.md +252 -0
  36. package/bundled-skills/file-uploads/SKILL.md +212 -11
  37. package/bundled-skills/firebase/SKILL.md +646 -16
  38. package/bundled-skills/gcp-cloud-run/SKILL.md +1117 -32
  39. package/bundled-skills/graphql/SKILL.md +1026 -27
  40. package/bundled-skills/hubspot-integration/SKILL.md +804 -19
  41. package/bundled-skills/idea-darwin/SKILL.md +120 -0
  42. package/bundled-skills/inngest/SKILL.md +431 -16
  43. package/bundled-skills/interactive-portfolio/SKILL.md +342 -44
  44. package/bundled-skills/langfuse/SKILL.md +296 -41
  45. package/bundled-skills/langgraph/SKILL.md +259 -50
  46. package/bundled-skills/micro-saas-launcher/SKILL.md +343 -44
  47. package/bundled-skills/neon-postgres/SKILL.md +572 -15
  48. package/bundled-skills/nextjs-supabase-auth/SKILL.md +269 -21
  49. package/bundled-skills/notion-template-business/SKILL.md +371 -44
  50. package/bundled-skills/personal-tool-builder/SKILL.md +537 -44
  51. package/bundled-skills/plaid-fintech/SKILL.md +825 -19
  52. package/bundled-skills/prompt-caching/SKILL.md +438 -25
  53. package/bundled-skills/rag-engineer/SKILL.md +271 -29
  54. package/bundled-skills/salesforce-development/SKILL.md +912 -19
  55. package/bundled-skills/satori/SKILL.md +54 -0
  56. package/bundled-skills/scroll-experience/SKILL.md +381 -44
  57. package/bundled-skills/segment-cdp/SKILL.md +817 -19
  58. package/bundled-skills/shopify-apps/SKILL.md +1475 -19
  59. package/bundled-skills/slack-bot-builder/SKILL.md +1162 -28
  60. package/bundled-skills/telegram-bot-builder/SKILL.md +152 -37
  61. package/bundled-skills/telegram-mini-app/SKILL.md +445 -44
  62. package/bundled-skills/trigger-dev/SKILL.md +916 -27
  63. package/bundled-skills/twilio-communications/SKILL.md +1310 -28
  64. package/bundled-skills/upstash-qstash/SKILL.md +898 -27
  65. package/bundled-skills/vercel-deployment/SKILL.md +637 -39
  66. package/bundled-skills/viral-generator-builder/SKILL.md +132 -37
  67. package/bundled-skills/voice-agents/SKILL.md +937 -27
  68. package/bundled-skills/voice-ai-development/SKILL.md +375 -46
  69. package/bundled-skills/workflow-automation/SKILL.md +982 -29
  70. package/bundled-skills/zapier-make-patterns/SKILL.md +772 -27
  71. package/package.json +1 -1
@@ -1,13 +1,20 @@
1
1
  ---
2
2
  name: computer-use-agents
3
- description: "The fundamental architecture of computer use agents: observe screen, reason about next action, execute action, repeat. This loop integrates vision models with action execution through an iterative pipeline."
3
+ description: Build AI agents that interact with computers like humans do -
4
+ viewing screens, moving cursors, clicking buttons, and typing text. Covers
5
+ Anthropic's Computer Use, OpenAI's Operator/CUA, and open-source alternatives.
4
6
  risk: unknown
5
- source: "vibeship-spawner-skills (Apache 2.0)"
6
- date_added: "2026-02-27"
7
+ source: vibeship-spawner-skills (Apache 2.0)
8
+ date_added: 2026-02-27
7
9
  ---
8
10
 
9
11
  # Computer Use Agents
10
12
 
13
+ Build AI agents that interact with computers like humans do - viewing screens,
14
+ moving cursors, clicking buttons, and typing text. Covers Anthropic's Computer
15
+ Use, OpenAI's Operator/CUA, and open-source alternatives. Critical focus on
16
+ sandboxing, security, and handling the unique challenges of vision-based control.
17
+
11
18
  ## Patterns
12
19
 
13
20
  ### Perception-Reasoning-Action Loop
@@ -25,10 +32,8 @@ Key components:
25
32
  Critical insight: Vision agents are completely still during "thinking"
26
33
  phase (1-5 seconds), creating a detectable pause pattern.
27
34
 
35
+ **When to use**: Building any computer use agent from scratch,Integrating vision models with desktop control,Understanding agent behavior patterns
28
36
 
29
- **When to use**: ['Building any computer use agent from scratch', 'Integrating vision models with desktop control', 'Understanding agent behavior patterns']
30
-
31
- ```python
32
37
  from anthropic import Anthropic
33
38
  from PIL import Image
34
39
  import base64
@@ -83,8 +88,116 @@ class ComputerUseAgent:
83
88
  amount = action.get("amount", 3)
84
89
  scroll = -amount if direction == "down" else amount
85
90
  pyautogui.scroll(scroll)
86
- return {"success": True, "action": f"scrolled {dir
87
- ```
91
+ return {"success": True, "action": f"scrolled {direction}"}
92
+
93
+ elif action_type == "move":
94
+ x, y = action["x"], action["y"]
95
+ pyautogui.moveTo(x, y)
96
+ return {"success": True, "action": f"moved to ({x}, {y})"}
97
+
98
+ else:
99
+ return {"success": False, "error": f"Unknown action: {action_type}"}
100
+
101
+ def run(self, task: str) -> dict:
102
+ """
103
+ Run perception-reasoning-action loop until task complete.
104
+
105
+ The loop:
106
+ 1. Screenshot current state
107
+ 2. Send to vision model with task context
108
+ 3. Parse action from response
109
+ 4. Execute action
110
+ 5. Repeat until done or max steps
111
+ """
112
+ messages = []
113
+ step_count = 0
114
+
115
+ system_prompt = """You are a computer use agent. You can see the screen
116
+ and control mouse/keyboard.
117
+
118
+ Available actions (respond with JSON):
119
+ - {"type": "click", "x": 100, "y": 200, "button": "left"}
120
+ - {"type": "type", "text": "hello world"}
121
+ - {"type": "key", "key": "enter"}
122
+ - {"type": "scroll", "direction": "down", "amount": 3}
123
+ - {"type": "done", "result": "task completed successfully"}
124
+
125
+ Always respond with ONLY a JSON action object.
126
+ Be precise with coordinates - click exactly where needed.
127
+ If you see an error, try to recover.
128
+ """
129
+
130
+ while step_count < self.max_steps:
131
+ step_count += 1
132
+
133
+ # 1. PERCEPTION: Capture current screen
134
+ screenshot_b64 = self.capture_screenshot()
135
+
136
+ # 2. REASONING: Send to vision model
137
+ user_content = [
138
+ {"type": "text", "text": f"Task: {task}\n\nStep {step_count}. What action should I take?"},
139
+ {"type": "image", "source": {
140
+ "type": "base64",
141
+ "media_type": "image/png",
142
+ "data": screenshot_b64
143
+ }}
144
+ ]
145
+
146
+ messages.append({"role": "user", "content": user_content})
147
+
148
+ response = self.client.messages.create(
149
+ model=self.model,
150
+ max_tokens=1024,
151
+ system=system_prompt,
152
+ messages=messages
153
+ )
154
+
155
+ assistant_message = response.content[0].text
156
+ messages.append({"role": "assistant", "content": assistant_message})
157
+
158
+ # 3. Parse action from response
159
+ import json
160
+ try:
161
+ action = json.loads(assistant_message)
162
+ except json.JSONDecodeError:
163
+ # Try to extract JSON from response
164
+ import re
165
+ match = re.search(r'\{[^}]+\}', assistant_message)
166
+ if match:
167
+ action = json.loads(match.group())
168
+ else:
169
+ continue
170
+
171
+ # Check if done
172
+ if action.get("type") == "done":
173
+ return {
174
+ "success": True,
175
+ "result": action.get("result"),
176
+ "steps": step_count
177
+ }
178
+
179
+ # 4. ACTION: Execute
180
+ result = self.execute_action(action)
181
+
182
+ # Small delay for UI to update
183
+ time.sleep(self.action_delay)
184
+
185
+ return {
186
+ "success": False,
187
+ "error": "Max steps reached",
188
+ "steps": step_count
189
+ }
190
+
191
+ # Usage
192
+ agent = ComputerUseAgent(Anthropic())
193
+ result = agent.run("Open Chrome and search for 'weather today'")
194
+
195
+ ### Anti_patterns
196
+
197
+ - Running without step limits (infinite loops)
198
+ - No delay between actions (UI can't keep up)
199
+ - Screenshots at full resolution (token explosion)
200
+ - Ignoring action failures (no recovery)
88
201
 
89
202
  ### Sandboxed Environment Pattern
90
203
 
@@ -102,10 +215,8 @@ Key isolation requirements:
102
215
  The goal is "blast radius minimization" - if the agent goes wrong,
103
216
  damage is contained to the sandbox.
104
217
 
218
+ **When to use**: Deploying any computer use agent,Testing agent behavior safely,Running untrusted automation tasks
105
219
 
106
- **When to use**: ['Deploying any computer use agent', 'Testing agent behavior safely', 'Running untrusted automation tasks']
107
-
108
- ```python
109
220
  # Dockerfile for sandboxed computer use environment
110
221
  # Based on Anthropic's reference implementation pattern
111
222
 
@@ -208,8 +319,89 @@ volumes:
208
319
  # Python wrapper with additional runtime sandboxing
209
320
  import subprocess
210
321
  import os
211
- from dataclasses im
212
- ```
322
+ from dataclasses import dataclass
323
+ from typing import Optional
324
+
325
+ @dataclass
326
+ class SandboxConfig:
327
+ """Configuration for agent sandbox."""
328
+ network_allowed: list[str] = None # Allowed domains
329
+ max_runtime_seconds: int = 300
330
+ max_memory_mb: int = 2048
331
+ allow_downloads: bool = False
332
+ allow_clipboard: bool = False
333
+
334
+ class SandboxedAgent:
335
+ """
336
+ Run computer use agent in Docker sandbox.
337
+ """
338
+
339
+ def __init__(self, config: SandboxConfig):
340
+ self.config = config
341
+ self.container_id: Optional[str] = None
342
+
343
+ def start(self):
344
+ """Start sandboxed environment."""
345
+ # Build network rules
346
+ network_rules = ""
347
+ if self.config.network_allowed:
348
+ for domain in self.config.network_allowed:
349
+ network_rules += f"--add-host={domain}:$(dig +short {domain}) "
350
+ else:
351
+ network_rules = "--network=none"
352
+
353
+ cmd = f"""
354
+ docker run -d \
355
+ --name computer-use-sandbox-$$ \
356
+ --security-opt no-new-privileges \
357
+ --cap-drop ALL \
358
+ --memory {self.config.max_memory_mb}m \
359
+ --cpus 2 \
360
+ --read-only \
361
+ --tmpfs /tmp \
362
+ {network_rules} \
363
+ computer-use-agent:latest
364
+ """
365
+
366
+ result = subprocess.run(cmd, shell=True, capture_output=True)
367
+ self.container_id = result.stdout.decode().strip()
368
+
369
+ # Set up kill timer
370
+ subprocess.Popen([
371
+ "sh", "-c",
372
+ f"sleep {self.config.max_runtime_seconds} && docker kill {self.container_id}"
373
+ ])
374
+
375
+ return self.container_id
376
+
377
+ def execute_task(self, task: str) -> dict:
378
+ """Execute task in sandbox."""
379
+ if not self.container_id:
380
+ self.start()
381
+
382
+ # Send task to agent via API
383
+ import requests
384
+ response = requests.post(
385
+ f"http://localhost:8080/task",
386
+ json={"task": task},
387
+ timeout=self.config.max_runtime_seconds
388
+ )
389
+
390
+ return response.json()
391
+
392
+ def stop(self):
393
+ """Stop and remove sandbox."""
394
+ if self.container_id:
395
+ subprocess.run(f"docker rm -f {self.container_id}", shell=True)
396
+ self.container_id = None
397
+
398
+ ### Anti_patterns
399
+
400
+ - Running agents on host system directly
401
+ - Giving sandbox full network access
402
+ - Running as root in container
403
+ - No resource limits (denial of service)
404
+ - Persistent storage (data can leak between runs)
213
405
 
214
406
  ### Anthropic Computer Use Implementation
215
407
 
@@ -231,10 +423,8 @@ Tool versions:
231
423
  Critical limitation: "Some UI elements (like dropdowns and scrollbars)
232
424
  might be tricky for Claude to manipulate" - Anthropic docs
233
425
 
426
+ **When to use**: Building production computer use agents,Need highest quality vision understanding,Full desktop control (not just browser)
234
427
 
235
- **When to use**: ['Building production computer use agents', 'Need highest quality vision understanding', 'Full desktop control (not just browser)']
236
-
237
- ```python
238
428
  from anthropic import Anthropic
239
429
  from anthropic.types.beta import (
240
430
  BetaToolComputerUse20241022,
@@ -301,20 +491,1672 @@ class AnthropicComputerUse:
301
491
  subprocess.run(["scrot", "/tmp/screenshot.png"])
302
492
 
303
493
  with open("/tmp/screenshot.png", "rb") as f:
304
-
494
+ img_data = f.read()
495
+
496
+ # Resize for efficiency
497
+ img = Image.open(io.BytesIO(img_data))
498
+ img = img.resize(self.screen_size, Image.LANCZOS)
499
+
500
+ buffer = io.BytesIO()
501
+ img.save(buffer, format="PNG")
502
+
503
+ return {
504
+ "type": "image",
505
+ "source": {
506
+ "type": "base64",
507
+ "media_type": "image/png",
508
+ "data": base64.b64encode(buffer.getvalue()).decode()
509
+ }
510
+ }
511
+
512
+ elif action == "mouse_move":
513
+ x, y = input.get("coordinate", [0, 0])
514
+ subprocess.run(["xdotool", "mousemove", str(x), str(y)])
515
+ return {"success": True}
516
+
517
+ elif action == "left_click":
518
+ subprocess.run(["xdotool", "click", "1"])
519
+ return {"success": True}
520
+
521
+ elif action == "right_click":
522
+ subprocess.run(["xdotool", "click", "3"])
523
+ return {"success": True}
524
+
525
+ elif action == "double_click":
526
+ subprocess.run(["xdotool", "click", "--repeat", "2", "1"])
527
+ return {"success": True}
528
+
529
+ elif action == "type":
530
+ text = input.get("text", "")
531
+ # Use xdotool type with delay for reliability
532
+ subprocess.run(["xdotool", "type", "--delay", "50", text])
533
+ return {"success": True}
534
+
535
+ elif action == "key":
536
+ key = input.get("key", "")
537
+ # Map common key names
538
+ key_map = {
539
+ "return": "Return",
540
+ "enter": "Return",
541
+ "tab": "Tab",
542
+ "escape": "Escape",
543
+ "backspace": "BackSpace",
544
+ }
545
+ xdotool_key = key_map.get(key.lower(), key)
546
+ subprocess.run(["xdotool", "key", xdotool_key])
547
+ return {"success": True}
548
+
549
+ elif action == "scroll":
550
+ direction = input.get("direction", "down")
551
+ amount = input.get("amount", 3)
552
+ button = "5" if direction == "down" else "4"
553
+ for _ in range(amount):
554
+ subprocess.run(["xdotool", "click", button])
555
+ return {"success": True}
556
+
557
+ return {"error": f"Unknown action: {action}"}
558
+
559
+ def _handle_bash(self, input: dict) -> dict:
560
+ """Execute bash command."""
561
+ command = input.get("command", "")
562
+
563
+ # Security: Sanitize and limit commands
564
+ dangerous_patterns = ["rm -rf", "mkfs", "dd if=", "> /dev/"]
565
+ for pattern in dangerous_patterns:
566
+ if pattern in command:
567
+ return {"error": "Dangerous command blocked"}
568
+
569
+ try:
570
+ result = subprocess.run(
571
+ command,
572
+ shell=True,
573
+ capture_output=True,
574
+ text=True,
575
+ timeout=30
576
+ )
577
+ return {
578
+ "stdout": result.stdout[:10000], # Limit output
579
+ "stderr": result.stderr[:1000],
580
+ "returncode": result.returncode
581
+ }
582
+ except subprocess.TimeoutExpired:
583
+ return {"error": "Command timed out"}
584
+
585
+ def _handle_editor(self, input: dict) -> dict:
586
+ """Handle text editor operations."""
587
+ command = input.get("command")
588
+ path = input.get("path")
589
+
590
+ if command == "view":
591
+ try:
592
+ with open(path, "r") as f:
593
+ content = f.read()
594
+ return {"content": content[:50000]} # Limit size
595
+ except Exception as e:
596
+ return {"error": str(e)}
597
+
598
+ elif command == "str_replace":
599
+ old_str = input.get("old_str")
600
+ new_str = input.get("new_str")
601
+ try:
602
+ with open(path, "r") as f:
603
+ content = f.read()
604
+ if old_str not in content:
605
+ return {"error": "old_str not found in file"}
606
+ content = content.replace(old_str, new_str, 1)
607
+ with open(path, "w") as f:
608
+ f.write(content)
609
+ return {"success": True}
610
+ except Exception as e:
611
+ return {"error": str(e)}
612
+
613
+ return {"error": f"Unknown editor command: {command}"}
614
+
615
+ def run_task(self, task: str, max_steps: int = 50) -> dict:
616
+ """Run computer use task with agentic loop."""
617
+ messages = [{"role": "user", "content": task}]
618
+ tools = self.get_tools()
619
+
620
+ for step in range(max_steps):
621
+ response = self.client.beta.messages.create(
622
+ model=self.model,
623
+ max_tokens=4096,
624
+ tools=tools,
625
+ messages=messages,
626
+ betas=["computer-use-2024-10-22"]
627
+ )
628
+
629
+ # Check for completion
630
+ if response.stop_reason == "end_turn":
631
+ return {
632
+ "success": True,
633
+ "result": response.content[0].text if response.content else "",
634
+ "steps": step + 1
635
+ }
636
+
637
+ # Handle tool use
638
+ if response.stop_reason == "tool_use":
639
+ messages.append({"role": "assistant", "content": response.content})
640
+
641
+ tool_results = []
642
+ for block in response.content:
643
+ if block.type == "tool_use":
644
+ result = self.execute_tool(block.name, block.input)
645
+ tool_results.append({
646
+ "type": "tool_result",
647
+ "tool_use_id": block.id,
648
+ "content": result
649
+ })
650
+
651
+ messages.append({"role": "user", "content": tool_results})
652
+
653
+ return {"success": False, "error": "Max steps reached"}
654
+
655
+ ### Anti_patterns
656
+
657
+ - Not using betas=['computer-use-2024-10-22'] flag
658
+ - Full resolution screenshots (wasteful)
659
+ - No command sanitization for bash tool
660
+ - Unbounded execution time
661
+
662
+ ### Browser-Use Pattern (Playwright-based)
663
+
664
+ For browser-only automation, using structured DOM access is more efficient
665
+ than pixel-based computer use. Playwright MCP allows LLMs to control
666
+ browsers using accessibility snapshots rather than screenshots.
667
+
668
+ Advantages over vision-based:
669
+ - Faster: No image processing required
670
+ - Cheaper: Text tokens vs image tokens
671
+ - More precise: Direct element targeting
672
+ - More reliable: No coordinate drift
673
+
674
+ When to use vision vs structured:
675
+ - Vision: Desktop apps, complex UIs, visual verification
676
+ - Structured: Web automation, form filling, data extraction
677
+
678
+ **When to use**: Browser-only automation tasks,Form filling and web interactions,When speed and cost matter more than visual understanding
679
+
680
+ from playwright.async_api import async_playwright
681
+ from dataclasses import dataclass
682
+ from typing import Optional
683
+ import asyncio
684
+
685
+ @dataclass
686
+ class BrowserAction:
687
+ """Structured browser action."""
688
+ action: str # click, type, navigate, scroll, extract
689
+ selector: Optional[str] = None
690
+ text: Optional[str] = None
691
+ url: Optional[str] = None
692
+
693
+ class BrowserUseAgent:
694
+ """
695
+ Browser automation using Playwright with structured commands.
696
+ More efficient than pixel-based for web tasks.
697
+ """
698
+
699
+ def __init__(self):
700
+ self.browser = None
701
+ self.page = None
702
+
703
+ async def start(self, headless: bool = True):
704
+ """Start browser session."""
705
+ self.playwright = await async_playwright().start()
706
+ self.browser = await self.playwright.chromium.launch(headless=headless)
707
+ self.page = await self.browser.new_page()
708
+
709
+ async def get_page_snapshot(self) -> dict:
710
+ """
711
+ Get structured snapshot of page for LLM.
712
+ Uses accessibility tree for efficiency.
713
+ """
714
+ # Get accessibility tree
715
+ snapshot = await self.page.accessibility.snapshot()
716
+
717
+ # Get simplified DOM info
718
+ elements = await self.page.evaluate('''() => {
719
+ const interactable = [];
720
+ const selector = 'a, button, input, select, textarea, [role="button"]';
721
+ document.querySelectorAll(selector).forEach((el, i) => {
722
+ const rect = el.getBoundingClientRect();
723
+ if (rect.width > 0 && rect.height > 0) {
724
+ interactable.push({
725
+ index: i,
726
+ tag: el.tagName.toLowerCase(),
727
+ text: el.textContent?.trim().slice(0, 100),
728
+ type: el.type,
729
+ placeholder: el.placeholder,
730
+ name: el.name,
731
+ id: el.id,
732
+ class: el.className
733
+ });
734
+ }
735
+ });
736
+ return interactable;
737
+ }''')
738
+
739
+ return {
740
+ "url": self.page.url,
741
+ "title": await self.page.title(),
742
+ "accessibility_tree": snapshot,
743
+ "interactable_elements": elements[:50] # Limit for token efficiency
744
+ }
745
+
746
+ async def execute_action(self, action: BrowserAction) -> dict:
747
+ """Execute structured browser action."""
748
+
749
+ try:
750
+ if action.action == "navigate":
751
+ await self.page.goto(action.url, wait_until="domcontentloaded")
752
+ return {"success": True, "url": self.page.url}
753
+
754
+ elif action.action == "click":
755
+ await self.page.click(action.selector, timeout=5000)
756
+ await self.page.wait_for_load_state("networkidle", timeout=5000)
757
+ return {"success": True}
758
+
759
+ elif action.action == "type":
760
+ await self.page.fill(action.selector, action.text)
761
+ return {"success": True}
762
+
763
+ elif action.action == "scroll":
764
+ direction = action.text or "down"
765
+ distance = 500 if direction == "down" else -500
766
+ await self.page.evaluate(f"window.scrollBy(0, {distance})")
767
+ return {"success": True}
768
+
769
+ elif action.action == "extract":
770
+ # Extract text content
771
+ if action.selector:
772
+ text = await self.page.text_content(action.selector)
773
+ else:
774
+ text = await self.page.text_content("body")
775
+ return {"success": True, "text": text[:5000]}
776
+
777
+ elif action.action == "screenshot":
778
+ # Fall back to vision when needed
779
+ screenshot = await self.page.screenshot(type="png")
780
+ import base64
781
+ return {
782
+ "success": True,
783
+ "image": base64.b64encode(screenshot).decode()
784
+ }
785
+
786
+ except Exception as e:
787
+ return {"success": False, "error": str(e)}
788
+
789
+ return {"success": False, "error": f"Unknown action: {action.action}"}
790
+
791
+ async def run_with_llm(self, task: str, llm_client, max_steps: int = 20):
792
+ """
793
+ Run browser task with LLM decision making.
794
+ Uses structured DOM instead of screenshots.
795
+ """
796
+
797
+ system_prompt = """You are a browser automation agent. You receive
798
+ page snapshots with interactable elements and decide actions.
799
+
800
+ Respond with JSON action:
801
+ - {"action": "navigate", "url": "https://..."}
802
+ - {"action": "click", "selector": "button.submit"}
803
+ - {"action": "type", "selector": "input[name='email']", "text": "..."}
804
+ - {"action": "scroll", "text": "down"}
805
+ - {"action": "extract", "selector": ".results"}
806
+ - {"action": "done", "result": "task completed"}
807
+
808
+ Use CSS selectors based on the element info provided.
809
+ Prefer id > name > class > text content for selectors.
810
+ """
811
+
812
+ messages = []
813
+
814
+ for step in range(max_steps):
815
+ # Get current page state
816
+ snapshot = await self.get_page_snapshot()
817
+
818
+ user_message = f"""Task: {task}
819
+
820
+ Current page:
821
+ URL: {snapshot['url']}
822
+ Title: {snapshot['title']}
823
+
824
+ Interactable elements:
825
+ {snapshot['interactable_elements']}
826
+
827
+ What action should I take?"""
828
+
829
+ messages.append({"role": "user", "content": user_message})
830
+
831
+ # Get LLM decision
832
+ response = llm_client.messages.create(
833
+ model="claude-sonnet-4-20250514",
834
+ max_tokens=1024,
835
+ system=system_prompt,
836
+ messages=messages
837
+ )
838
+
839
+ assistant_text = response.content[0].text
840
+ messages.append({"role": "assistant", "content": assistant_text})
841
+
842
+ # Parse and execute
843
+ import json
844
+ action_dict = json.loads(assistant_text)
845
+
846
+ if action_dict.get("action") == "done":
847
+ return {"success": True, "result": action_dict.get("result")}
848
+
849
+ action = BrowserAction(**action_dict)
850
+ result = await self.execute_action(action)
851
+
852
+ if not result.get("success"):
853
+ messages.append({
854
+ "role": "user",
855
+ "content": f"Action failed: {result.get('error')}"
856
+ })
857
+
858
+ await asyncio.sleep(0.5) # Rate limit
859
+
860
+ return {"success": False, "error": "Max steps reached"}
861
+
862
+ async def close(self):
863
+ """Clean up browser."""
864
+ if self.browser:
865
+ await self.browser.close()
866
+ if hasattr(self, 'playwright'):
867
+ await self.playwright.stop()
868
+
869
+ # Usage
870
+ async def main():
871
+ agent = BrowserUseAgent()
872
+ await agent.start(headless=False)
873
+
874
+ from anthropic import Anthropic
875
+ result = await agent.run_with_llm(
876
+ "Go to weather.com and find the weather for New York",
877
+ Anthropic()
878
+ )
879
+
880
+ print(result)
881
+ await agent.close()
882
+
883
+ asyncio.run(main())
884
+
885
+ ### Anti_patterns
886
+
887
+ - Using screenshots when DOM access works
888
+ - Not waiting for page loads
889
+ - Hardcoded selectors that break
890
+ - No error recovery for stale elements
891
+
892
+ ### User Confirmation Pattern
893
+
894
+ For sensitive actions, agents should pause and ask for human confirmation.
895
+ "ChatGPT agent also pauses and asks for confirmation prior to taking
896
+ sensitive steps such as completing a purchase."
897
+
898
+ Sensitivity levels:
899
+ 1. LOW: Navigation, reading (auto-approve)
900
+ 2. MEDIUM: Form filling, clicking (log, maybe confirm)
901
+ 3. HIGH: Purchases, authentication, file operations (always confirm)
902
+ 4. CRITICAL: Credential entry, financial transactions (confirm + review)
903
+
904
+ **When to use**: Actions with real-world consequences,Financial transactions,Authentication flows,File modifications
905
+
906
+ from enum import Enum
907
+ from dataclasses import dataclass
908
+ from typing import Callable, Optional
909
+ import asyncio
910
+
911
+ class ActionSeverity(Enum):
912
+ LOW = "low" # Auto-approve
913
+ MEDIUM = "medium" # Log, optional confirm
914
+ HIGH = "high" # Always confirm
915
+ CRITICAL = "critical" # Confirm + review details
916
+
917
+ @dataclass
918
+ class SensitiveAction:
919
+ """Action that may need user confirmation."""
920
+ action_type: str
921
+ description: str
922
+ severity: ActionSeverity
923
+ details: dict
924
+
925
+ class ConfirmationGate:
926
+ """
927
+ Gate sensitive actions through user confirmation.
928
+ """
929
+
930
+ # Action type -> severity mapping
931
+ ACTION_SEVERITY = {
932
+ # LOW - auto-approve
933
+ "navigate": ActionSeverity.LOW,
934
+ "scroll": ActionSeverity.LOW,
935
+ "read": ActionSeverity.LOW,
936
+ "screenshot": ActionSeverity.LOW,
937
+
938
+ # MEDIUM - log and maybe confirm
939
+ "click": ActionSeverity.MEDIUM,
940
+ "type": ActionSeverity.MEDIUM,
941
+ "search": ActionSeverity.MEDIUM,
942
+
943
+ # HIGH - always confirm
944
+ "download": ActionSeverity.HIGH,
945
+ "submit_form": ActionSeverity.HIGH,
946
+ "login": ActionSeverity.HIGH,
947
+ "file_write": ActionSeverity.HIGH,
948
+
949
+ # CRITICAL - confirm with full review
950
+ "purchase": ActionSeverity.CRITICAL,
951
+ "enter_password": ActionSeverity.CRITICAL,
952
+ "enter_credit_card": ActionSeverity.CRITICAL,
953
+ "send_money": ActionSeverity.CRITICAL,
954
+ "delete": ActionSeverity.CRITICAL,
955
+ }
956
+
957
+ def __init__(
958
+ self,
959
+ confirm_callback: Callable[[SensitiveAction], bool] = None,
960
+ auto_confirm_low: bool = True,
961
+ auto_confirm_medium: bool = False
962
+ ):
963
+ self.confirm_callback = confirm_callback or self._default_confirm
964
+ self.auto_confirm_low = auto_confirm_low
965
+ self.auto_confirm_medium = auto_confirm_medium
966
+ self.action_log = []
967
+
968
+ def _default_confirm(self, action: SensitiveAction) -> bool:
969
+ """Default confirmation via CLI prompt."""
970
+ print(f"\n{'='*60}")
971
+ print(f"ACTION CONFIRMATION REQUIRED")
972
+ print(f"{'='*60}")
973
+ print(f"Type: {action.action_type}")
974
+ print(f"Severity: {action.severity.value.upper()}")
975
+ print(f"Description: {action.description}")
976
+ print(f"Details: {action.details}")
977
+ print(f"{'='*60}")
978
+
979
+ while True:
980
+ response = input("Allow this action? [y/n]: ").lower().strip()
981
+ if response in ['y', 'yes']:
982
+ return True
983
+ elif response in ['n', 'no']:
984
+ return False
985
+
986
+ def classify_action(self, action_type: str, context: dict) -> ActionSeverity:
987
+ """Classify action severity, considering context."""
988
+ base_severity = self.ACTION_SEVERITY.get(action_type, ActionSeverity.MEDIUM)
989
+
990
+ # Escalate based on context
991
+ if context.get("involves_credentials"):
992
+ return ActionSeverity.CRITICAL
993
+ if context.get("involves_money"):
994
+ return ActionSeverity.CRITICAL
995
+ if context.get("irreversible"):
996
+ return max(base_severity, ActionSeverity.HIGH, key=lambda x: x.value)
997
+
998
+ return base_severity
999
+
1000
+ def check_action(
1001
+ self,
1002
+ action_type: str,
1003
+ description: str,
1004
+ details: dict = None
1005
+ ) -> tuple[bool, str]:
1006
+ """
1007
+ Check if action should proceed.
1008
+ Returns (approved, reason).
1009
+ """
1010
+ details = details or {}
1011
+ severity = self.classify_action(action_type, details)
1012
+
1013
+ action = SensitiveAction(
1014
+ action_type=action_type,
1015
+ description=description,
1016
+ severity=severity,
1017
+ details=details
1018
+ )
1019
+
1020
+ # Log all actions
1021
+ self.action_log.append({
1022
+ "action": action,
1023
+ "timestamp": __import__('datetime').datetime.now().isoformat()
1024
+ })
1025
+
1026
+ # Auto-approve low severity
1027
+ if severity == ActionSeverity.LOW and self.auto_confirm_low:
1028
+ return True, "auto-approved (low severity)"
1029
+
1030
+ # Maybe auto-approve medium
1031
+ if severity == ActionSeverity.MEDIUM and self.auto_confirm_medium:
1032
+ return True, "auto-approved (medium severity)"
1033
+
1034
+ # Request confirmation
1035
+ approved = self.confirm_callback(action)
1036
+
1037
+ if approved:
1038
+ return True, "user approved"
1039
+ else:
1040
+ return False, "user rejected"
1041
+
1042
+ class ConfirmedComputerUseAgent:
1043
+ """
1044
+ Computer use agent with confirmation gates.
1045
+ """
1046
+
1047
+ def __init__(self, base_agent, confirmation_gate: ConfirmationGate):
1048
+ self.agent = base_agent
1049
+ self.gate = confirmation_gate
1050
+
1051
+ def execute_action(self, action: dict) -> dict:
1052
+ """Execute action with confirmation check."""
1053
+ action_type = action.get("type", "unknown")
1054
+
1055
+ # Build description
1056
+ if action_type == "click":
1057
+ desc = f"Click at ({action.get('x')}, {action.get('y')})"
1058
+ elif action_type == "type":
1059
+ text = action.get('text', '')
1060
+ # Mask if looks like password
1061
+ if self._looks_sensitive(text):
1062
+ desc = f"Type sensitive text ({len(text)} chars)"
1063
+ else:
1064
+ desc = f"Type: {text[:50]}..."
1065
+ else:
1066
+ desc = f"Execute: {action_type}"
1067
+
1068
+ # Context for severity classification
1069
+ context = {
1070
+ "involves_credentials": self._looks_sensitive(action.get("text", "")),
1071
+ "involves_money": self._mentions_money(action),
1072
+ }
1073
+
1074
+ # Check with gate
1075
+ approved, reason = self.gate.check_action(
1076
+ action_type, desc, context
1077
+ )
1078
+
1079
+ if not approved:
1080
+ return {
1081
+ "success": False,
1082
+ "error": f"Action blocked: {reason}",
1083
+ "action": action_type
1084
+ }
1085
+
1086
+ # Execute if approved
1087
+ return self.agent.execute_action(action)
1088
+
1089
+ def _looks_sensitive(self, text: str) -> bool:
1090
+ """Check if text looks like sensitive data."""
1091
+ if not text:
1092
+ return False
1093
+ # Common patterns
1094
+ patterns = [
1095
+ r'\b\d{16}\b', # Credit card
1096
+ r'\b\d{3,4}\b.*\b\d{3,4}\b', # CVV-like
1097
+ r'password',
1098
+ r'secret',
1099
+ r'api.?key',
1100
+ r'token'
1101
+ ]
1102
+ import re
1103
+ return any(re.search(p, text.lower()) for p in patterns)
1104
+
1105
+ def _mentions_money(self, action: dict) -> bool:
1106
+ """Check if action involves money."""
1107
+ text = str(action)
1108
+ money_patterns = [
1109
+ r'\$\d+', r'pay', r'purchase', r'buy', r'checkout',
1110
+ r'credit', r'debit', r'invoice', r'payment'
1111
+ ]
1112
+ import re
1113
+ return any(re.search(p, text.lower()) for p in money_patterns)
1114
+
1115
+ # Usage
1116
+ gate = ConfirmationGate(
1117
+ auto_confirm_low=True,
1118
+ auto_confirm_medium=False # Confirm clicks, typing
1119
+ )
1120
+
1121
+ agent = ConfirmedComputerUseAgent(base_agent, gate)
1122
+ result = agent.execute_action({"type": "click", "x": 500, "y": 300})
1123
+
1124
+ ### Anti_patterns
1125
+
1126
+ - Auto-approving all actions
1127
+ - Not logging rejected actions
1128
+ - Showing full passwords in confirmation
1129
+ - No timeout on confirmation (hangs forever)
1130
+
1131
+ ### Action Logging Pattern
1132
+
1133
+ All computer use agent actions should be logged for:
1134
+ 1. Debugging failed automations
1135
+ 2. Security auditing
1136
+ 3. Reproducibility
1137
+ 4. Compliance requirements
1138
+
1139
+ Log format should capture:
1140
+ - Timestamp
1141
+ - Action type and parameters
1142
+ - Screenshot before/after
1143
+ - Success/failure status
1144
+ - Model reasoning (if available)
1145
+
1146
+ **When to use**: Production computer use deployments,Debugging automation failures,Security-sensitive environments
1147
+
1148
+ from dataclasses import dataclass, field
1149
+ from datetime import datetime
1150
+ from typing import Optional, Any
1151
+ import json
1152
+ import os
1153
+
1154
+ @dataclass
1155
+ class ActionLogEntry:
1156
+ """Single action log entry."""
1157
+ timestamp: datetime
1158
+ action_type: str
1159
+ parameters: dict
1160
+ success: bool
1161
+ error: Optional[str] = None
1162
+ screenshot_before: Optional[str] = None # Path to screenshot
1163
+ screenshot_after: Optional[str] = None
1164
+ model_reasoning: Optional[str] = None
1165
+ duration_ms: Optional[int] = None
1166
+
1167
+ def to_dict(self) -> dict:
1168
+ return {
1169
+ "timestamp": self.timestamp.isoformat(),
1170
+ "action_type": self.action_type,
1171
+ "parameters": self._sanitize_params(self.parameters),
1172
+ "success": self.success,
1173
+ "error": self.error,
1174
+ "screenshot_before": self.screenshot_before,
1175
+ "screenshot_after": self.screenshot_after,
1176
+ "model_reasoning": self.model_reasoning,
1177
+ "duration_ms": self.duration_ms
1178
+ }
1179
+
1180
+ def _sanitize_params(self, params: dict) -> dict:
1181
+ """Remove sensitive data from params."""
1182
+ sanitized = {}
1183
+ sensitive_keys = ['password', 'secret', 'token', 'key', 'credit_card']
1184
+
1185
+ for k, v in params.items():
1186
+ if any(s in k.lower() for s in sensitive_keys):
1187
+ sanitized[k] = "[REDACTED]"
1188
+ elif isinstance(v, str) and len(v) > 100:
1189
+ sanitized[k] = v[:100] + "...[truncated]"
1190
+ else:
1191
+ sanitized[k] = v
1192
+
1193
+ return sanitized
1194
+
1195
+ @dataclass
1196
+ class TaskSession:
1197
+ """A complete task execution session."""
1198
+ session_id: str
1199
+ task: str
1200
+ start_time: datetime
1201
+ end_time: Optional[datetime] = None
1202
+ actions: list[ActionLogEntry] = field(default_factory=list)
1203
+ success: bool = False
1204
+ final_result: Optional[str] = None
1205
+
1206
+ class ActionLogger:
1207
+ """
1208
+ Comprehensive action logging for computer use agents.
1209
+ """
1210
+
1211
+ def __init__(self, log_dir: str = "./agent_logs"):
1212
+ self.log_dir = log_dir
1213
+ self.screenshot_dir = os.path.join(log_dir, "screenshots")
1214
+ os.makedirs(self.screenshot_dir, exist_ok=True)
1215
+
1216
+ self.current_session: Optional[TaskSession] = None
1217
+
1218
+ def start_session(self, task: str) -> str:
1219
+ """Start a new task session."""
1220
+ import uuid
1221
+ session_id = str(uuid.uuid4())[:8]
1222
+
1223
+ self.current_session = TaskSession(
1224
+ session_id=session_id,
1225
+ task=task,
1226
+ start_time=datetime.now()
1227
+ )
1228
+
1229
+ return session_id
1230
+
1231
+ def log_action(
1232
+ self,
1233
+ action_type: str,
1234
+ parameters: dict,
1235
+ success: bool,
1236
+ error: Optional[str] = None,
1237
+ screenshot_before: bytes = None,
1238
+ screenshot_after: bytes = None,
1239
+ model_reasoning: str = None,
1240
+ duration_ms: int = None
1241
+ ):
1242
+ """Log a single action."""
1243
+ if not self.current_session:
1244
+ raise RuntimeError("No active session")
1245
+
1246
+ # Save screenshots if provided
1247
+ screenshot_paths = {}
1248
+ timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
1249
+
1250
+ if screenshot_before:
1251
+ path = os.path.join(
1252
+ self.screenshot_dir,
1253
+ f"{self.current_session.session_id}_{timestamp_str}_before.png"
1254
+ )
1255
+ with open(path, "wb") as f:
1256
+ f.write(screenshot_before)
1257
+ screenshot_paths["before"] = path
1258
+
1259
+ if screenshot_after:
1260
+ path = os.path.join(
1261
+ self.screenshot_dir,
1262
+ f"{self.current_session.session_id}_{timestamp_str}_after.png"
1263
+ )
1264
+ with open(path, "wb") as f:
1265
+ f.write(screenshot_after)
1266
+ screenshot_paths["after"] = path
1267
+
1268
+ # Create log entry
1269
+ entry = ActionLogEntry(
1270
+ timestamp=datetime.now(),
1271
+ action_type=action_type,
1272
+ parameters=parameters,
1273
+ success=success,
1274
+ error=error,
1275
+ screenshot_before=screenshot_paths.get("before"),
1276
+ screenshot_after=screenshot_paths.get("after"),
1277
+ model_reasoning=model_reasoning,
1278
+ duration_ms=duration_ms
1279
+ )
1280
+
1281
+ self.current_session.actions.append(entry)
1282
+
1283
+ # Also append to running log file
1284
+ self._append_to_log(entry)
1285
+
1286
+ def _append_to_log(self, entry: ActionLogEntry):
1287
+ """Append entry to JSONL log file."""
1288
+ log_file = os.path.join(
1289
+ self.log_dir,
1290
+ f"session_{self.current_session.session_id}.jsonl"
1291
+ )
1292
+
1293
+ with open(log_file, "a") as f:
1294
+ f.write(json.dumps(entry.to_dict()) + "\n")
1295
+
1296
+ def end_session(self, success: bool, result: str = None):
1297
+ """End current session."""
1298
+ if not self.current_session:
1299
+ return
1300
+
1301
+ self.current_session.end_time = datetime.now()
1302
+ self.current_session.success = success
1303
+ self.current_session.final_result = result
1304
+
1305
+ # Write session summary
1306
+ summary_file = os.path.join(
1307
+ self.log_dir,
1308
+ f"session_{self.current_session.session_id}_summary.json"
1309
+ )
1310
+
1311
+ summary = {
1312
+ "session_id": self.current_session.session_id,
1313
+ "task": self.current_session.task,
1314
+ "start_time": self.current_session.start_time.isoformat(),
1315
+ "end_time": self.current_session.end_time.isoformat(),
1316
+ "duration_seconds": (
1317
+ self.current_session.end_time -
1318
+ self.current_session.start_time
1319
+ ).total_seconds(),
1320
+ "total_actions": len(self.current_session.actions),
1321
+ "successful_actions": sum(
1322
+ 1 for a in self.current_session.actions if a.success
1323
+ ),
1324
+ "failed_actions": sum(
1325
+ 1 for a in self.current_session.actions if not a.success
1326
+ ),
1327
+ "success": success,
1328
+ "final_result": result
1329
+ }
1330
+
1331
+ with open(summary_file, "w") as f:
1332
+ json.dump(summary, f, indent=2)
1333
+
1334
+ self.current_session = None
1335
+
1336
+ def get_session_replay(self, session_id: str) -> list[dict]:
1337
+ """Get all actions from a session for replay/debugging."""
1338
+ log_file = os.path.join(self.log_dir, f"session_{session_id}.jsonl")
1339
+
1340
+ actions = []
1341
+ with open(log_file, "r") as f:
1342
+ for line in f:
1343
+ actions.append(json.loads(line))
1344
+
1345
+ return actions
1346
+
1347
+ # Integration with agent
1348
+ class LoggedComputerUseAgent:
1349
+ """Computer use agent with comprehensive logging."""
1350
+
1351
+ def __init__(self, base_agent, logger: ActionLogger):
1352
+ self.agent = base_agent
1353
+ self.logger = logger
1354
+
1355
+ def run_task(self, task: str) -> dict:
1356
+ """Run task with full logging."""
1357
+ session_id = self.logger.start_session(task)
1358
+
1359
+ try:
1360
+ result = self._run_with_logging(task)
1361
+ self.logger.end_session(
1362
+ success=result.get("success", False),
1363
+ result=result.get("result")
1364
+ )
1365
+ return result
1366
+ except Exception as e:
1367
+ self.logger.end_session(success=False, result=str(e))
1368
+ raise
1369
+
1370
+ def _run_with_logging(self, task: str) -> dict:
1371
+ """Internal run with action logging."""
1372
+ # This would wrap the base agent's run method
1373
+ # and log each action
1374
+ pass
1375
+
1376
+ ### Anti_patterns
1377
+
1378
+ - Not sanitizing sensitive data in logs
1379
+ - Storing screenshots indefinitely (storage costs)
1380
+ - Not rotating log files
1381
+ - Logging synchronously (blocks agent)
1382
+
1383
+ ## Sharp Edges
1384
+
1385
+ ### Web Content Can Hijack Your Agent
1386
+
1387
+ Severity: CRITICAL
1388
+
1389
+ Situation: Computer use agent browsing the web
1390
+
1391
+ Symptoms:
1392
+ Agent suddenly performs unexpected actions. Clicks malicious links.
1393
+ Enters credentials on phishing sites. Downloads files it shouldn't.
1394
+ Ignores your instructions and follows embedded commands instead.
1395
+
1396
+ Why this breaks:
1397
+ "While all agents that process untrusted content are subject to prompt
1398
+ injection risks, browser use amplifies this risk in two ways. First,
1399
+ the attack surface is vast: every webpage, embedded document, advertisement,
1400
+ and dynamically loaded script represents a potential vector for malicious
1401
+ instructions. Second, browser agents can take many different actions—
1402
+ navigating to URLs, filling forms, clicking buttons, downloading files—
1403
+ that attackers can exploit."
1404
+
1405
+ Real attacks have already happened:
1406
+ - "Microsoft Copilot agents were hijacked with emails containing malicious
1407
+ instructions, which allowed attackers to extract entire CRM databases."
1408
+ - "Google's Workspace services were manipulated—hidden prompts inside
1409
+ calendar invites and emails tricked Gemini agents into deleting events
1410
+ and exposing sensitive messages."
1411
+
1412
+ Even a 1% attack success rate represents meaningful risk at scale.
1413
+
1414
+ Recommended fix:
1415
+
1416
+ ## Defense in depth - no single solution works
1417
+
1418
+ 1. Sandboxing (most effective):
1419
+ ```python
1420
+ # Docker with strict isolation
1421
+ docker run \
1422
+ --security-opt no-new-privileges \
1423
+ --cap-drop ALL \
1424
+ --network none \ # No internet!
1425
+ --read-only \
1426
+ computer-use-agent
1427
+ ```
1428
+
1429
+ 2. Classifier-based detection:
1430
+ ```python
1431
+ def scan_for_injection(content: str) -> bool:
1432
+ """Detect prompt injection attempts."""
1433
+ patterns = [
1434
+ r"ignore.*instructions",
1435
+ r"disregard.*previous",
1436
+ r"new.*instructions",
1437
+ r"you are now",
1438
+ r"act as if",
1439
+ r"pretend to be",
1440
+ ]
1441
+ return any(re.search(p, content.lower()) for p in patterns)
1442
+
1443
+ # Check page content before processing
1444
+ page_text = await page.text_content("body")
1445
+ if scan_for_injection(page_text):
1446
+ return {"error": "Potential injection detected"}
1447
+ ```
1448
+
1449
+ 3. User confirmation for sensitive actions:
1450
+ ```python
1451
+ SENSITIVE_ACTIONS = {"download", "submit", "login", "purchase"}
1452
+
1453
+ if action_type in SENSITIVE_ACTIONS:
1454
+ if not await get_user_confirmation(action):
1455
+ return {"error": "User rejected action"}
1456
+ ```
1457
+
1458
+ 4. Scoped credentials:
1459
+ - Never give agent access to all credentials
1460
+ - Use temporary, limited tokens
1461
+ - Revoke after task completion
1462
+
1463
+ ### Vision Agents Click Exact Centers
1464
+
1465
+ Severity: MEDIUM
1466
+
1467
+ Situation: Agent clicking on UI elements
1468
+
1469
+ Symptoms:
1470
+ Agent's clicks are detectable as non-human. Websites may block or
1471
+ CAPTCHA the agent. Anti-bot systems flag the interaction.
1472
+
1473
+ Why this breaks:
1474
+ "When a vision model identifies a button, it calculates the center.
1475
+ Click coordinates land at mathematically precise positions—often exact
1476
+ element centers or grid-aligned pixel values. Humans don't click centers;
1477
+ their click distributions follow a Gaussian pattern around targets."
1478
+
1479
+ The screenshot loop also creates detectable patterns:
1480
+ "Predictable pauses. Vision agents are completely still during their
1481
+ 'thinking' phase. The pattern looks like: Action → Complete stillness
1482
+ (1-5 seconds) → Action → Complete stillness → Action."
1483
+
1484
+ Sophisticated anti-bot systems detect:
1485
+ - Perfect center clicks
1486
+ - No mouse movement during "thinking"
1487
+ - Consistent timing between actions
1488
+ - Lack of micro-movements and hesitation
1489
+
1490
+ Recommended fix:
1491
+
1492
+ ## Add human-like variance to actions
1493
+
1494
+ ```python
1495
+ import random
1496
+ import time
1497
+
1498
+ def humanized_click(x: int, y: int) -> tuple[int, int]:
1499
+ """Add human-like variance to click coordinates."""
1500
+ # Gaussian distribution around target
1501
+ # Humans typically land within ~10px of target
1502
+ x_offset = int(random.gauss(0, 5))
1503
+ y_offset = int(random.gauss(0, 5))
1504
+
1505
+ return (x + x_offset, y + y_offset)
1506
+
1507
+ def humanized_delay():
1508
+ """Add human-like delay between actions."""
1509
+ # Humans have variable reaction times
1510
+ base_delay = random.uniform(0.3, 0.8)
1511
+ # Occasionally longer pauses (reading, thinking)
1512
+ if random.random() < 0.2:
1513
+ base_delay += random.uniform(0.5, 2.0)
1514
+ time.sleep(base_delay)
1515
+
1516
+ def humanized_movement(from_pos: tuple, to_pos: tuple):
1517
+ """Move mouse in curved path like human."""
1518
+ # Bezier curve or similar
1519
+ # Humans don't move in straight lines
1520
+ steps = random.randint(10, 20)
1521
+ for i in range(steps):
1522
+ t = i / steps
1523
+ # Simple curve approximation
1524
+ x = from_pos[0] + (to_pos[0] - from_pos[0]) * t
1525
+ y = from_pos[1] + (to_pos[1] - from_pos[1]) * t
1526
+ # Add wobble
1527
+ x += random.gauss(0, 2)
1528
+ y += random.gauss(0, 2)
1529
+ pyautogui.moveTo(int(x), int(y))
1530
+ time.sleep(0.01)
1531
+ ```
1532
+
1533
+ ## Rotate user agents and fingerprints
1534
+
1535
+ ```python
1536
+ USER_AGENTS = [
1537
+ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120...",
1538
+ "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Safari/...",
1539
+ # ... more realistic agents
1540
+ ]
1541
+
1542
+ await page.set_extra_http_headers({
1543
+ "User-Agent": random.choice(USER_AGENTS)
1544
+ })
1545
+ ```
1546
+
1547
+ ### Dropdowns, Scrollbars, and Drags Are Unreliable
1548
+
1549
+ Severity: HIGH
1550
+
1551
+ Situation: Agent interacting with complex UI elements
1552
+
1553
+ Symptoms:
1554
+ Agent fails to select dropdown options. Scroll doesn't work as expected.
1555
+ Drag and drop completely fails. Hover menus disappear before clicking.
1556
+
1557
+ Why this breaks:
1558
+ "Computer Use currently struggles with certain interface interactions,
1559
+ particularly scrolling, dragging, and zooming operations. Some UI elements
1560
+ (like dropdowns and scrollbars) might be tricky for Claude to manipulate."
1561
+ - Anthropic documentation
1562
+
1563
+ Why these are hard:
1564
+ 1. Dropdowns: Options appear after click, need second click to select
1565
+ 2. Scrollbars: Small targets, need precise positioning
1566
+ 3. Drag: Requires coordinated mouse down, move, mouse up
1567
+ 4. Hover menus: Disappear when mouse moves away
1568
+ 5. Canvas elements: No semantic information visible
1569
+
1570
+ Vision models see pixels, not DOM structure. They don't "know" that
1571
+ a dropdown is a dropdown - they have to infer from visual cues.
1572
+
1573
+ Recommended fix:
1574
+
1575
+ ## Use keyboard alternatives when possible
1576
+
1577
+ ```python
1578
+ # Instead of clicking dropdown, use keyboard
1579
+ async def select_dropdown_option(page, dropdown_selector, option_text):
1580
+ # Focus the dropdown
1581
+ await page.click(dropdown_selector)
1582
+ await asyncio.sleep(0.3)
1583
+
1584
+ # Use keyboard to find option
1585
+ await page.keyboard.type(option_text[:3]) # Type first letters
1586
+ await asyncio.sleep(0.2)
1587
+ await page.keyboard.press("Enter")
1588
+ ```
1589
+
1590
+ ## Break complex actions into steps
1591
+
1592
+ ```python
1593
+ # Instead of drag-and-drop
1594
+ async def reliable_drag(page, source, target):
1595
+ # Step 1: Click and hold
1596
+ await page.mouse.move(source["x"], source["y"])
1597
+ await page.mouse.down()
1598
+ await asyncio.sleep(0.2)
1599
+
1600
+ # Step 2: Move in steps
1601
+ steps = 10
1602
+ for i in range(steps):
1603
+ x = source["x"] + (target["x"] - source["x"]) * i / steps
1604
+ y = source["y"] + (target["y"] - source["y"]) * i / steps
1605
+ await page.mouse.move(x, y)
1606
+ await asyncio.sleep(0.05)
1607
+
1608
+ # Step 3: Release
1609
+ await page.mouse.move(target["x"], target["y"])
1610
+ await asyncio.sleep(0.1)
1611
+ await page.mouse.up()
1612
+ ```
1613
+
1614
+ ## Fall back to DOM access for web
1615
+
1616
+ ```python
1617
+ # If vision fails, try direct DOM manipulation
1618
+ async def robust_select(page, select_selector, value):
1619
+ try:
1620
+ # Try vision approach first
1621
+ await vision_agent.select(select_selector, value)
1622
+ except Exception:
1623
+ # Fall back to direct DOM
1624
+ await page.select_option(select_selector, value=value)
305
1625
  ```
306
1626
 
307
- ## ⚠️ Sharp Edges
1627
+ ## Add verification after action
1628
+
1629
+ ```python
1630
+ async def verified_scroll(page, direction):
1631
+ # Get current scroll position
1632
+ before = await page.evaluate("window.scrollY")
1633
+
1634
+ # Attempt scroll
1635
+ await page.mouse.wheel(0, 500 if direction == "down" else -500)
1636
+ await asyncio.sleep(0.3)
1637
+
1638
+ # Verify it worked
1639
+ after = await page.evaluate("window.scrollY")
1640
+ if before == after:
1641
+ # Try alternative method
1642
+ await page.keyboard.press("PageDown" if direction == "down" else "PageUp")
1643
+ ```
1644
+
1645
+ ### Agents Are 2-5x Slower Than Humans
1646
+
1647
+ Severity: MEDIUM
1648
+
1649
+ Situation: Automating any computer task
1650
+
1651
+ Symptoms:
1652
+ Task that takes human 1 minute takes agent 3-5 minutes.
1653
+ Users complain about speed. Timeouts occur.
1654
+
1655
+ Why this breaks:
1656
+ "The technology can be slow compared to human operators, often requiring
1657
+ multiple screenshots and analysis cycles."
1658
+
1659
+ Why so slow:
1660
+ 1. Screenshot capture: 100-500ms
1661
+ 2. Vision model inference: 1-5 seconds per screenshot
1662
+ 3. Action execution: 200-500ms
1663
+ 4. Wait for UI update: 500-1000ms
1664
+ 5. Total per action: 2-7 seconds
308
1665
 
309
- | Issue | Severity | Solution |
310
- |-------|----------|----------|
311
- | Issue | critical | ## Defense in depth - no single solution works |
312
- | Issue | medium | ## Add human-like variance to actions |
313
- | Issue | high | ## Use keyboard alternatives when possible |
314
- | Issue | medium | ## Accept the tradeoff |
315
- | Issue | high | ## Implement context management |
316
- | Issue | high | ## Monitor and limit costs |
317
- | Issue | critical | ## ALWAYS use sandboxing |
1666
+ A task requiring 20 actions takes 40-140 seconds minimum.
1667
+ Humans do the same actions in 20-30 seconds.
1668
+
1669
+ Recommended fix:
1670
+
1671
+ ## Accept the tradeoff
1672
+
1673
+ Computer use is for:
1674
+ - Tasks humans don't want to do (repetitive)
1675
+ - Tasks that can run in background
1676
+ - Tasks where accuracy > speed
1677
+
1678
+ ## Optimize where possible
1679
+
1680
+ ```python
1681
+ # 1. Reduce screenshot resolution
1682
+ SCREEN_SIZE = (1280, 800) # Not 4K
1683
+
1684
+ # 2. Batch similar actions
1685
+ # Instead of: type "hello", wait, type " world"
1686
+ await page.type("hello world")
1687
+
1688
+ # 3. Parallelize independent tasks
1689
+ # Run multiple sandboxed agents concurrently
1690
+
1691
+ # 4. Cache repeated computations
1692
+ # If same screenshot, reuse analysis
1693
+
1694
+ # 5. Use smaller models for simple decisions
1695
+ simple_model = "claude-haiku-..." # For "is task done?"
1696
+ complex_model = "claude-sonnet-..." # For complex reasoning
1697
+ ```
1698
+
1699
+ ## Set realistic expectations
1700
+
1701
+ ```python
1702
+ # Estimate task duration
1703
+ def estimate_duration(task_complexity: str) -> int:
1704
+ """Estimate task duration in seconds."""
1705
+ estimates = {
1706
+ "simple": 30, # Single page, few actions
1707
+ "medium": 120, # Multi-page, moderate actions
1708
+ "complex": 300, # Many pages, complex interactions
1709
+ }
1710
+ return estimates.get(task_complexity, 120)
1711
+
1712
+ # Inform users
1713
+ estimated = estimate_duration("medium")
1714
+ print(f"Estimated completion: {estimated // 60}m {estimated % 60}s")
1715
+ ```
1716
+
1717
+ ### Screenshots Fill Up Context Window Fast
1718
+
1719
+ Severity: HIGH
1720
+
1721
+ Situation: Long-running computer use tasks
1722
+
1723
+ Symptoms:
1724
+ Agent forgets earlier steps. Starts repeating actions.
1725
+ Errors increase as task progresses. Costs explode.
1726
+
1727
+ Why this breaks:
1728
+ Each screenshot is ~1500-3000 tokens. A task with 30 screenshots
1729
+ uses 45,000-90,000 tokens just for images - before any text.
1730
+
1731
+ Claude's context window is finite. When full:
1732
+ - Older context gets dropped
1733
+ - Agent loses memory of earlier steps
1734
+ - Task coherence decreases
1735
+
1736
+ "Getting agents to make consistent progress across multiple context
1737
+ windows remains an open problem. The core challenge is that they must
1738
+ work in discrete sessions, and each new session begins with no memory
1739
+ of what came before." - Anthropic engineering blog
1740
+
1741
+ Recommended fix:
1742
+
1743
+ ## Implement context management
1744
+
1745
+ ```python
1746
+ class ContextManager:
1747
+ """Manage context window usage for computer use."""
1748
+
1749
+ MAX_SCREENSHOTS = 10 # Keep only recent screenshots
1750
+ MAX_TOKENS = 100000
1751
+
1752
+ def __init__(self):
1753
+ self.messages = []
1754
+ self.screenshot_count = 0
1755
+
1756
+ def add_screenshot(self, screenshot_b64: str, description: str):
1757
+ """Add screenshot with automatic pruning."""
1758
+ self.screenshot_count += 1
1759
+
1760
+ # Keep only recent screenshots
1761
+ if self.screenshot_count > self.MAX_SCREENSHOTS:
1762
+ self._prune_old_screenshots()
1763
+
1764
+ # Store with description for context
1765
+ self.messages.append({
1766
+ "role": "user",
1767
+ "content": [
1768
+ {"type": "text", "text": description},
1769
+ {"type": "image", "source": {...}}
1770
+ ]
1771
+ })
1772
+
1773
+ def _prune_old_screenshots(self):
1774
+ """Remove old screenshots, keep text summaries."""
1775
+ new_messages = []
1776
+ screenshots_kept = 0
1777
+
1778
+ for msg in reversed(self.messages):
1779
+ if self._has_image(msg):
1780
+ if screenshots_kept < self.MAX_SCREENSHOTS:
1781
+ new_messages.insert(0, msg)
1782
+ screenshots_kept += 1
1783
+ else:
1784
+ # Convert to text summary
1785
+ summary = self._summarize_screenshot(msg)
1786
+ new_messages.insert(0, {
1787
+ "role": msg["role"],
1788
+ "content": summary
1789
+ })
1790
+ else:
1791
+ new_messages.insert(0, msg)
1792
+
1793
+ self.messages = new_messages
1794
+
1795
+ def _summarize_screenshot(self, msg) -> str:
1796
+ """Summarize screenshot to text."""
1797
+ # Extract any text description
1798
+ for content in msg.get("content", []):
1799
+ if content.get("type") == "text":
1800
+ return f"[Previous screenshot: {content['text']}]"
1801
+ return "[Previous screenshot - details pruned]"
1802
+
1803
+ def add_checkpoint(self):
1804
+ """Create a checkpoint summary."""
1805
+ summary = self._create_progress_summary()
1806
+ self.messages.append({
1807
+ "role": "user",
1808
+ "content": f"CHECKPOINT: {summary}"
1809
+ })
1810
+ ```
1811
+
1812
+ ## Use checkpointing for long tasks
1813
+
1814
+ ```python
1815
+ async def run_with_checkpoints(task: str, checkpoint_every: int = 10):
1816
+ """Run task with periodic checkpoints."""
1817
+ context = ContextManager()
1818
+ step = 0
1819
+
1820
+ while not task_complete:
1821
+ step += 1
1822
+
1823
+ # Take action...
1824
+
1825
+ if step % checkpoint_every == 0:
1826
+ # Create checkpoint
1827
+ context.add_checkpoint()
1828
+
1829
+ # Optional: persist to disk
1830
+ save_checkpoint(context, step)
1831
+ ```
1832
+
1833
+ ## Break into subtasks
1834
+
1835
+ ```python
1836
+ # Instead of one 50-step task:
1837
+ subtasks = [
1838
+ "Navigate to the website and login",
1839
+ "Find the settings page",
1840
+ "Update the email address to ...",
1841
+ "Save and verify the change"
1842
+ ]
1843
+
1844
+ for subtask in subtasks:
1845
+ result = await agent.run(subtask)
1846
+ if not result["success"]:
1847
+ handle_error(subtask, result)
1848
+ break
1849
+ ```
1850
+
1851
+ ### Costs Can Explode Quickly
1852
+
1853
+ Severity: HIGH
1854
+
1855
+ Situation: Running computer use at scale
1856
+
1857
+ Symptoms:
1858
+ API bill is 10x higher than expected. Single task costs $5+ instead of $0.50.
1859
+ Monthly costs reach thousands of dollars quickly.
1860
+
1861
+ Why this breaks:
1862
+ Vision tokens are expensive. Each screenshot:
1863
+ - ~2000-3000 tokens per image
1864
+ - At $10/million tokens, that's $0.02-0.03 per screenshot
1865
+ - Task with 30 screenshots = $0.60-0.90 just for images
1866
+
1867
+ But it compounds:
1868
+ - Screenshots accumulate in context
1869
+ - Model sees ALL previous screenshots each turn
1870
+ - Turn 10 processes 10 screenshots = $0.20-0.30
1871
+ - Turn 20 processes 20 screenshots = $0.40-0.60
1872
+ - Quadratic growth!
1873
+
1874
+ Complex task: 50 turns × average 25 images in context = 1250 image tokens
1875
+ Plus text = could easily hit $5-10 per task.
1876
+
1877
+ Recommended fix:
1878
+
1879
+ ## Monitor and limit costs
1880
+
1881
+ ```python
1882
+ class CostTracker:
1883
+ """Track and limit computer use costs."""
1884
+
1885
+ # Anthropic pricing (approximate)
1886
+ INPUT_COST_PER_1K = 0.003 # Text
1887
+ OUTPUT_COST_PER_1K = 0.015
1888
+ IMAGE_COST_PER_1K = 0.01 # Roughly
1889
+
1890
+ def __init__(self, max_cost_per_task: float = 1.0):
1891
+ self.max_cost = max_cost_per_task
1892
+ self.current_cost = 0.0
1893
+ self.total_tokens = 0
1894
+
1895
+ def add_turn(
1896
+ self,
1897
+ input_tokens: int,
1898
+ output_tokens: int,
1899
+ image_tokens: int
1900
+ ):
1901
+ """Track cost of a single turn."""
1902
+ cost = (
1903
+ input_tokens / 1000 * self.INPUT_COST_PER_1K +
1904
+ output_tokens / 1000 * self.OUTPUT_COST_PER_1K +
1905
+ image_tokens / 1000 * self.IMAGE_COST_PER_1K
1906
+ )
1907
+ self.current_cost += cost
1908
+ self.total_tokens += input_tokens + output_tokens + image_tokens
1909
+
1910
+ if self.current_cost > self.max_cost:
1911
+ raise CostLimitExceeded(
1912
+ f"Cost limit exceeded: ${self.current_cost:.2f} > ${self.max_cost:.2f}"
1913
+ )
1914
+
1915
+ return cost
1916
+
1917
+ class CostLimitExceeded(Exception):
1918
+ pass
1919
+
1920
+ # Usage
1921
+ tracker = CostTracker(max_cost_per_task=2.0)
1922
+
1923
+ try:
1924
+ for turn in turns:
1925
+ tracker.add_turn(turn.input, turn.output, turn.images)
1926
+ except CostLimitExceeded:
1927
+ print("Task aborted due to cost limit")
1928
+ ```
1929
+
1930
+ ## Reduce image costs
1931
+
1932
+ ```python
1933
+ # 1. Lower resolution
1934
+ SCREEN_SIZE = (1024, 768) # Smaller = fewer tokens
1935
+
1936
+ # 2. JPEG instead of PNG (when quality ok)
1937
+ screenshot.save(buffer, format="JPEG", quality=70)
1938
+
1939
+ # 3. Crop to relevant region
1940
+ def crop_relevant(screenshot: Image, focus_area: tuple):
1941
+ """Crop to area of interest."""
1942
+ return screenshot.crop(focus_area)
1943
+
1944
+ # 4. Don't include screenshot every turn
1945
+ if not needs_visual_update:
1946
+ # Text-only turn
1947
+ messages.append({"role": "user", "content": "Continue..."})
1948
+ ```
1949
+
1950
+ ## Use cheaper models strategically
1951
+
1952
+ ```python
1953
+ async def tiered_model_selection(task_complexity: str):
1954
+ """Use appropriate model for task."""
1955
+ if task_complexity == "simple":
1956
+ return "claude-haiku-..." # Cheapest
1957
+ elif task_complexity == "medium":
1958
+ return "claude-sonnet-4-20250514" # Balanced
1959
+ else:
1960
+ return "claude-opus-4-5-..." # Best but expensive
1961
+ ```
1962
+
1963
+ ### Running Agent on Your Actual Computer
1964
+
1965
+ Severity: CRITICAL
1966
+
1967
+ Situation: Testing or deploying computer use
1968
+
1969
+ Symptoms:
1970
+ Agent deletes important files. Sends emails from your account.
1971
+ Posts on social media. Accesses sensitive documents.
1972
+
1973
+ Why this breaks:
1974
+ Computer use agents make mistakes. They can:
1975
+ - Misinterpret instructions
1976
+ - Click wrong buttons
1977
+ - Type in wrong fields
1978
+ - Follow prompt injection attacks
1979
+
1980
+ Without sandboxing, these mistakes happen on your real system.
1981
+ There's no undo for "agent sent email to all contacts" or
1982
+ "agent deleted project folder."
1983
+
1984
+ "Autonomous agents that can access external systems and APIs
1985
+ introduce new security risks. They may be vulnerable to prompt
1986
+ injection attacks, unauthorized access to sensitive data, or
1987
+ manipulation by malicious actors."
1988
+
1989
+ Recommended fix:
1990
+
1991
+ ## ALWAYS use sandboxing
1992
+
1993
+ ```python
1994
+ # Minimum viable sandbox: Docker with restrictions
1995
+
1996
+ docker run -it --rm \
1997
+ --security-opt no-new-privileges \
1998
+ --cap-drop ALL \
1999
+ --network none \
2000
+ --read-only \
2001
+ --tmpfs /tmp \
2002
+ --memory 2g \
2003
+ --cpus 1 \
2004
+ computer-use-sandbox
2005
+ ```
2006
+
2007
+ ## Layer your defenses
2008
+
2009
+ ```python
2010
+ # Defense 1: Docker isolation
2011
+ # Defense 2: Non-root user
2012
+ # Defense 3: Network restrictions
2013
+ # Defense 4: Filesystem restrictions
2014
+ # Defense 5: Resource limits
2015
+ # Defense 6: Action confirmation
2016
+ # Defense 7: Action logging
2017
+
2018
+ @dataclass
2019
+ class SandboxConfig:
2020
+ docker_image: str = "computer-use-sandbox:latest"
2021
+ network: str = "none" # or specific allowlist
2022
+ readonly_root: bool = True
2023
+ max_memory_mb: int = 2048
2024
+ max_cpu: float = 1.0
2025
+ max_runtime_seconds: int = 300
2026
+ require_confirmation: list = field(default_factory=lambda: [
2027
+ "download", "submit", "login", "delete"
2028
+ ])
2029
+ log_all_actions: bool = True
2030
+ ```
2031
+
2032
+ ## Test in isolated environment first
2033
+
2034
+ ```python
2035
+ class SandboxedTestRunner:
2036
+ """Run tests in throwaway containers."""
2037
+
2038
+ async def run_test(self, test_task: str) -> dict:
2039
+ # Spin up fresh container
2040
+ container_id = await self.create_container()
2041
+
2042
+ try:
2043
+ # Run task
2044
+ result = await self.execute_in_container(container_id, test_task)
2045
+
2046
+ # Capture state for verification
2047
+ state = await self.capture_container_state(container_id)
2048
+
2049
+ return {
2050
+ "result": result,
2051
+ "final_state": state,
2052
+ "logs": await self.get_logs(container_id)
2053
+ }
2054
+ finally:
2055
+ # Always destroy container
2056
+ await self.destroy_container(container_id)
2057
+ ```
2058
+
2059
+ ## Validation Checks
2060
+
2061
+ ### Computer Use Without Sandbox
2062
+
2063
+ Severity: ERROR
2064
+
2065
+ Computer use agents MUST run in sandboxed environments
2066
+
2067
+ Message: Computer use without sandboxing detected. Use Docker containers with restrictions.
2068
+
2069
+ ### Sandbox With Full Network Access
2070
+
2071
+ Severity: ERROR
2072
+
2073
+ Sandboxed agents should have restricted network access
2074
+
2075
+ Message: Sandbox has full network access. Use --network=none or specific allowlist.
2076
+
2077
+ ### Running as Root in Container
2078
+
2079
+ Severity: ERROR
2080
+
2081
+ Container agents should run as non-root user
2082
+
2083
+ Message: Container running as root. Add --user flag or USER directive in Dockerfile.
2084
+
2085
+ ### Container Without Capability Drops
2086
+
2087
+ Severity: WARNING
2088
+
2089
+ Containers should drop unnecessary capabilities
2090
+
2091
+ Message: Container has full capabilities. Add --cap-drop ALL.
2092
+
2093
+ ### Container Without Seccomp Profile
2094
+
2095
+ Severity: WARNING
2096
+
2097
+ Containers should use seccomp profiles for syscall filtering
2098
+
2099
+ Message: No security options set. Consider --security-opt seccomp:profile.json
2100
+
2101
+ ### No Maximum Step Limit
2102
+
2103
+ Severity: WARNING
2104
+
2105
+ Computer use loops should have maximum step limits
2106
+
2107
+ Message: Infinite loop risk. Add max_steps limit (recommended: 50).
2108
+
2109
+ ### No Execution Timeout
2110
+
2111
+ Severity: WARNING
2112
+
2113
+ Computer use should have timeout limits
2114
+
2115
+ Message: No timeout on execution. Add timeout (recommended: 5-10 minutes).
2116
+
2117
+ ### Container Without Memory Limit
2118
+
2119
+ Severity: WARNING
2120
+
2121
+ Containers should have memory limits to prevent DoS
2122
+
2123
+ Message: No memory limit on container. Add --memory 2g or similar.
2124
+
2125
+ ### No Cost Tracking
2126
+
2127
+ Severity: WARNING
2128
+
2129
+ Computer use should track API costs
2130
+
2131
+ Message: No cost tracking. Monitor token usage to prevent bill surprises.
2132
+
2133
+ ### No Maximum Cost Limit
2134
+
2135
+ Severity: INFO
2136
+
2137
+ Consider adding cost limits per task
2138
+
2139
+ Message: Consider adding max_cost_per_task to prevent expensive runaway tasks.
2140
+
2141
+ ## Collaboration
2142
+
2143
+ ### Delegation Triggers
2144
+
2145
+ - user needs web-only automation -> browser-automation (Playwright/Selenium more efficient for web)
2146
+ - user needs security review -> security-specialist (Review sandboxing, prompt injection defenses)
2147
+ - user needs container orchestration -> devops (Kubernetes, Docker Swarm for scaling)
2148
+ - user needs vision model optimization -> llm-architect (Model selection, prompt engineering)
2149
+ - user needs multi-agent coordination -> multi-agent-orchestration (Multiple computer use agents working together)
318
2150
 
319
2151
  ## When to Use
320
- This skill is applicable to execute the workflow or actions described in the overview.
2152
+
2153
+ - User mentions or implies: computer use
2154
+ - User mentions or implies: desktop automation agent
2155
+ - User mentions or implies: screen control AI
2156
+ - User mentions or implies: vision-based agent
2157
+ - User mentions or implies: GUI automation
2158
+ - User mentions or implies: Claude computer
2159
+ - User mentions or implies: OpenAI Operator
2160
+ - User mentions or implies: browser agent
2161
+ - User mentions or implies: visual agent
2162
+ - User mentions or implies: RPA with AI