Google DeepMind’s TIPSv2 introduces three pretraining improvements to vision-language encoders, achieving SOTA zero-shot segmentation across 20 datasets at CVPR 2026.
Key Takeaways
iBOT++ extends the patch-level self-distillation loss to all tokens (masked and visible), yielding a +14.1 mIoU gain on ADE150 zero-shot segmentation alone.
Distillation produces a surprising inversion: a smaller ViT-L student dramatically outperforms its ViT-g teacher on patch-text alignment, despite losing on every other eval.
Head-only EMA applies exponential moving average only to the projector head, cutting training parameters by 42% with negligible performance loss.
Multi-Granularity Captions mixes alt-text, PaliGemma, and Gemini Flash descriptions during training to prevent shortcutting on coarse keywords and improve both dense and global evals.
TIPSv2-g outperforms PE-core G/14 on 3 of 5 shared global evals despite PE having 56% more parameters and 47x more training pairs.