Computer Control Skill
Use Claude's Computer Use API to see and control desktop environments through screenshots and mouse/keyboard actions.
When To Use
- Automating GUI-based workflows that lack CLI alternatives
- Testing web applications through visual interaction
- Filling forms, navigating menus, or interacting with desktop apps
- Building automation pipelines that need visual verification
When NOT To Use
- Tasks achievable through CLI or API (no GUI needed)
- Browser automation better served by Playwright or CDP
Architecture
The computer use system has three layers:
- Display Toolkit (
phantom.display) - executes OS-level actions via xdotool/scrot on the real or virtual display - Agent Loop (
phantom.loop) - manages the conversation cycle between Claude API and the display toolkit - CLI (
phantom.cli) - command-line interface for running tasks or checking environment readiness
User Task
|
v
Agent Loop <----> Claude API (beta)
| |
v v
Display Toolkit tool_use responses
| (click, type, screenshot)
v
OS Commands (xdotool, scrot)
|
v
Display (X11 / Xvfb / WSLg)
Quick Start
Check environment
cd plugins/phantom
uv run python -m phantom.cli --check
Run a task
export ANTHROPIC_API_KEY="sk-ant-..."
uv run python -m phantom.cli "Open Firefox and search for Claude AI"
Use in Python
from phantom.display import DisplayConfig, DisplayToolkit
from phantom.loop import LoopConfig, run_loop
result = run_loop(
task="Take a screenshot of the desktop",
api_key="sk-ant-...",
loop_config=LoopConfig(
model="claude-sonnet-4-6",
max_iterations=10,
),
display_config=DisplayConfig(width=1920, height=1080),
)
print(f"Done in {result.iterations} iterations")
print(result.final_text)
API Versions
| Model | Tool Version | Beta Flag |
|-------|-------------|-----------|
| Opus 4.6, Sonnet 4.6, Opus 4.5 | computer_20251124 | computer-use-2025-11-24 |
| Sonnet 4.5, Haiku 4.5, older | computer_20250124 | computer-use-2025-01-24 |
The resolve_tool_version() function handles this mapping
automatically based on the model name.
Available Actions
All versions:
screenshot- capture displayleft_click- click at[x, y]type- type text stringkey- press key combo (e.g.,ctrl+s)mouse_move- move cursor
Enhanced (20250124+):
scroll- scroll with direction and amountleft_click_drag- drag between coordinatesright_click,middle_click,double_click,triple_clickhold_key- hold key for durationwait- pause between actions
Latest (20251124):
zoom- inspect screen region at full resolution
Safety
Computer use carries risks. Follow these guidelines:
- Use a sandbox: Run in Docker or a VM, not your main OS
- Limit access: Do not provide login credentials unless necessary, and never for banking or sensitive services
- Set iteration caps: Always use
max_iterationsto prevent runaway API costs - Human approval: For actions with real-world consequences,
add confirmation callbacks via
on_action - Close sensitive apps: Claude sees the full screen via screenshots; close anything private before starting
Environment Requirements
Linux (native or WSL2 with WSLg):
sudo apt install xdotool scrot xclip
Headless (Docker/CI):
# Install Xvfb for virtual display
sudo apt install xvfb xdotool scrot xclip
Xvfb :1 -screen 0 1920x1080x24 &
export DISPLAY=:1
Prompting Tips
- Be specific about each step of the task
- Add "After each step, take a screenshot and verify" to catch mistakes early
- Use keyboard shortcuts when UI elements are hard to click
- Provide example screenshots for repeatable workflows
- Set a system prompt with domain-specific instructions