Beyond Semantic Similarity

· ai-agents ai design · Source ↗

TLDR

  • Paper argues that letting agents search raw corpora via grep and shell commands beats vector/sparse retrieval on BRIGHT, BEIR, BrowseComp-Plus, and multi-hop QA.

Key Takeaways

  • Direct Corpus Interaction (DCI) uses grep, file reads, and lightweight scripts instead of embedding models, vector indexes, or retrieval APIs.
  • Single-step top-k retrieval is a hard bottleneck for agentic tasks requiring multi-hop clue combination, hypothesis revision, and exact lexical constraints.
  • DCI requires no offline indexing, adapts to evolving local corpora, and outperforms strong sparse, dense, and reranking baselines on several benchmarks.
  • Result reframes retrieval quality as an interface-design problem, not just a reasoning or embedding quality problem.
  • arXiv:2605.05242, submitted May 2026, authored by a large multi-institution team including Yejin Choi, James Zou, and Jiawei Han.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN