Computer Use Is 45x More Expensive Than Structured APIs

· ai-agents web ai · Source ↗

TLDR

  • Benchmark of Claude Sonnet on the same admin panel task: vision agent used 551k tokens and 53 steps vs 12k tokens and 8 calls for structured API.

Key Takeaways

  • Vision agent (browser-use 0.12) required a 14-step UI walkthrough prompt to complete the task at all; without it, it silently missed 3 of 4 pending reviews due to pagination.
  • Cost gap is architectural, not model quality: each UI state requires a screenshot, and screenshots dominate token count regardless of model improvements.
  • Vision path showed high variance: 407k-751k input tokens and 749-1257s wall-clock across three runs; API path varied by ±27 tokens across five runs.
  • Haiku completed the API path in under 8 seconds and under 10k input tokens; it could not complete the vision path due to browser-use 0.12 structured-output schema failures.
  • Reflex 0.9’s auto-generated HTTP endpoints from event handlers made the API path viable without writing a second codebase.

Hacker News Comment Review

  • Commenters broadly agreed computer use is only justified for apps you cannot modify (third-party SaaS, legacy systems), and that new internal tooling should expose structured APIs.
  • A recurring thread questioned why vision agents do not use accessibility APIs or window handles to reduce screenshot loops, treating pure pixel-scraping as an unnecessary architectural choice.
  • Some pushed back that the comparison conflates a poorly designed vision harness with computer use generally; a well-designed accessibility-tree approach could reduce step count significantly.

Notable Comments

  • @merlindru: Building an agent that explores app surfaces via macOS accessibility APIs, writes a reusable workflow, then executes it via CLI, bypassing screenshot loops entirely.
  • @ai_fry_ur_brain: “Its funny watching the slow mean reversion back to more deterministic tooling.”

Original | Discuss on HN