Benchmark of Claude Sonnet on the same admin panel task: vision agent used 551k tokens and 53 steps vs 12k tokens and 8 calls for structured API.
Key Takeaways
Vision agent (browser-use 0.12) required a 14-step UI walkthrough prompt to complete the task at all; without it, it silently missed 3 of 4 pending reviews due to pagination.
Cost gap is architectural, not model quality: each UI state requires a screenshot, and screenshots dominate token count regardless of model improvements.
Vision path showed high variance: 407k-751k input tokens and 749-1257s wall-clock across three runs; API path varied by ±27 tokens across five runs.
Haiku completed the API path in under 8 seconds and under 10k input tokens; it could not complete the vision path due to browser-use 0.12 structured-output schema failures.
Reflex 0.9’s auto-generated HTTP endpoints from event handlers made the API path viable without writing a second codebase.
Hacker News Comment Review
Commenters broadly agreed computer use is only justified for apps you cannot modify (third-party SaaS, legacy systems), and that new internal tooling should expose structured APIs.
A recurring thread questioned why vision agents do not use accessibility APIs or window handles to reduce screenshot loops, treating pure pixel-scraping as an unnecessary architectural choice.
Some pushed back that the comparison conflates a poorly designed vision harness with computer use generally; a well-designed accessibility-tree approach could reduce step count significantly.
Notable Comments
@merlindru: Building an agent that explores app surfaces via macOS accessibility APIs, writes a reusable workflow, then executes it via CLI, bypassing screenshot loops entirely.
@ai_fry_ur_brain: “Its funny watching the slow mean reversion back to more deterministic tooling.”