Beyond Semantic Similarity

May 12, 2026 · ai-agents ai design · Source ↗

TLDR

Paper argues that letting agents search raw corpora via grep and shell commands beats vector/sparse retrieval on BRIGHT, BEIR, BrowseComp-Plus, and multi-hop QA.

Direct Corpus Interaction (DCI) uses grep, file reads, and lightweight scripts instead of embedding models, vector indexes, or retrieval APIs.
Single-step top-k retrieval is a hard bottleneck for agentic tasks requiring multi-hop clue combination, hypothesis revision, and exact lexical constraints.
DCI requires no offline indexing, adapts to evolving local corpora, and outperforms strong sparse, dense, and reranking baselines on several benchmarks.
Result reframes retrieval quality as an interface-design problem, not just a reasoning or embedding quality problem.
arXiv:2605.05242, submitted May 2026, authored by a large multi-institution team including Yejin Choi, James Zou, and Jiawei Han.