🧠 AI⚪ NeutralImportance 7/10

Global Geometry Is Not Enough for Vision Representations

arXiv – CS AI|Jiwan Chung, Seon Joo Kim|June 2, 2026 at 04:00 AM

🤖AI Summary

Researchers demonstrate that global embedding geometry—the standard metric for evaluating vision model representations—fails to predict compositional binding capabilities. Functional sensitivity measured through input-output Jacobians proves far more reliable, revealing that current training objectives optimize embedding geometry while leaving the local input-output mapping unconstrained, suggesting representation learning requires a more nuanced evaluation framework.

Analysis

This research challenges a foundational assumption in representation learning that has guided model development and evaluation for years. The study reveals a significant blind spot: while global geometry metrics capture what elements are present in embeddings, they remain insensitive to how those elements combine—a critical limitation for models handling complex compositional structures. The finding emerges from testing across diverse vision encoders, demonstrating broad applicability rather than edge-case observations.

The root cause lies in training objective design. Current loss functions explicitly constrain embedding geometry to achieve global uniformity, yet leave the local input-output relationship—the mapping from pixel variations to embedding changes—entirely unconstrained. This explains why geometric statistics correlate near-zero with compositional binding while functional sensitivity reliably tracks it. The disconnect between optimization targets and actual representational needs has gone largely unexamined.

This matters substantially for vision model development. Companies and researchers optimizing models solely through geometry-based metrics may unknowingly sacrifice compositional understanding crucial for real-world tasks like object detection, scene understanding, and visual reasoning. The implications extend to foundation models where compositional reasoning increasingly determines performance on downstream applications.

The path forward requires rethinking training objectives to explicitly constrain both global geometry and local input-output mappings. Future work should integrate functional sensitivity metrics into standard evaluation protocols and develop losses that address compositional binding directly. This represents not an incremental improvement but a methodological reorientation that could substantially improve vision model robustness and generalization across diverse applications.

Key Takeaways

→Global embedding geometry fails to predict compositional binding despite being the standard evaluation metric for vision models.
→Functional sensitivity via input-output Jacobians reliably tracks compositional capabilities where geometry-based metrics show near-zero correlation.
→Current training losses explicitly constrain embedding geometry while leaving local input-output mappings unconstrained, creating the observed gap.
→Representation learning requires complementary evaluation axes beyond global geometry to capture full representational competence.
→Integrating functional sensitivity metrics into training objectives could significantly improve vision model composition and generalization.