#large-multimodal-models News & Analysis

7 articles tagged with #large-multimodal-models. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

7 articles

AIBearisharXiv – CS AI · Jun 27/10

🧠

InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning

Researchers introduced InPhyRe, a new benchmark showing that large multimodal models (LMMs) struggle with inductive physical reasoning—their ability to apply learned physical laws to novel, unseen scenarios. Testing 13 LMMs revealed critical weaknesses: models fail to generalize parametric knowledge, perform poorly with unseen physical laws, and exhibit language bias that causes them to ignore visual inputs, raising concerns about their reliability for safety-critical applications.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Video Reasoning without Training

Researchers introduce V-Reason, an inference-time optimization method for video reasoning in Large Multimodal Models that eliminates the need for costly reinforcement learning or supervised fine-tuning. By analyzing entropy patterns in model outputs, the method achieves near-RL performance while using 58.6% fewer tokens, offering significant efficiency gains for AI systems.

AIBearisharXiv – CS AI · May 277/10

🧠

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

Researchers introduced LiveK12Bench, a dynamic benchmark for evaluating Large Multimodal Models on realistic high school examinations across multiple disciplines. The study reveals that advanced LMMs like GPT-4 experience significant performance degradation when subjected to exam-realistic constraints, dropping from 79 to 53 points when process rigor and efficiency are jointly evaluated, exposing critical gaps between theoretical capabilities and practical educational readiness.

🧠 GPT-5

AINeutralarXiv – CS AI · Mar 267/10

🧠

Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

Researchers propose DIG, a training-free framework that improves long-form video understanding by adapting frame selection strategies based on query types. The system uses uniform sampling for global queries and specialized selection for localized queries, achieving better performance than existing methods while scaling to 256 input frames.

AIBullisharXiv – CS AI · Jun 86/10

🧠

Enhancing Video Representations with Spatiotemporal-Semantic Residual to Mitigate Hallucinations in Video Large Multimodal Models

Researchers introduce ViSSRes, an inference-time intervention method that reduces hallucinations in Video Large Multimodal Models by enhancing video representations through a lightweight MLP network. The approach achieves a 40.69% reduction in hallucination rates on LLaVA-NeXT-Video while improving video understanding by 18.36%, with minimal computational overhead during inference.

AINeutralarXiv – CS AI · May 276/10

🧠

Advancing Creative Physical Intelligence in Large Multimodal Models

Researchers introduce MM-CreativityBench, a benchmark testing whether large multimodal models can solve creative physical problems by identifying non-obvious tool uses in constrained environments. Current LMMs struggle not from lack of generation capability but from poor visual grounding, hallucinating attributes and overlooking relevant entities; the team proposes affordance-grounded alignment using preference learning to improve performance.

AINeutralarXiv – CS AI · Mar 96/10

🧠

VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs

Researchers introduced VisioMath, a new benchmark with 1,800 K-12 math problems designed to test Large Multimodal Models' ability to distinguish between visually similar diagrams. The study reveals that current state-of-the-art models struggle with fine-grained visual reasoning, often relying on shallow positional heuristics rather than proper image-text alignment.