LingBot-Map: Streaming 3D reconstruction with geometric context transformer

Apr 28, 2026 · ai · Source ↗

TLDR

LingBot-Map streams 3D reconstruction at ~20 FPS over 10,000+ frames using a learned geometric state called GCA, keeping memory and compute per frame nearly constant.

Geometric Context Attention (GCA) maintains three complementary states: an anchor for coordinate/scale grounding, a local pose-reference window, and a trajectory memory compressed into per-frame tokens.
Trajectory memory is the key scaling mechanism: the full history is compressed so that per-frame cost stays flat regardless of sequence length.
The pipeline combines a DINO image backbone with alternating Frame Attention and GCA layers, then branches into camera pose and depth map heads.
Robbyant (蚂蚁灵波科技) positions LingBot-Map as one module in a broader embodied-AI stack that also includes LingBot-Depth (spatial perception), LingBot-VLA, LingBot-World, and LingBot-VA.
The ~20 FPS figure is at 518x378 resolution; hardware requirements are not disclosed in the release.

The single comment flags a common omission in robotics/vision benchmarks: throughput numbers without a hardware spec are hard to evaluate, especially for deployment planning.
For a model described as relatively small, the missing hardware context matters most to practitioners deciding whether it fits edge or onboard compute budgets.