#3d-vision News & Analysis

8 articles tagged with #3d-vision. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

8 articles

AIBullisharXiv – CS AI · Jun 117/10

🧠

nD-RoPE: A Generalized RoPE for n-Dimensional Position Embedding

Researchers propose nD-RoPE, a generalized extension of Rotary Position Embedding (RoPE) for high-dimensional data that addresses limitations in existing Transformer position encoding methods. The innovation treats positions and frequencies as coupled n-dimensional vectors rather than independent rotations, enabling better cross-dimensional interactions and directional balance across images, videos, and point clouds.

AIBullisharXiv – CS AI · Jun 17/10

🧠

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

RayDer introduces a unified transformer architecture that consolidates camera estimation, scene reconstruction, and rendering into a single model for self-supervised novel view synthesis from real-world video. The system achieves clean power-law scaling with data and compute while maintaining competitive performance with supervised approaches, addressing a key scalability challenge in 3D vision.

AIBullisharXiv – CS AI · Mar 57/10

🧠

Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset

Stanford researchers introduced Merlin, a 3D vision-language foundation model for analyzing abdominal CT scans that processes volumetric medical images alongside electronic health records and radiology reports. The model was trained on over 6 million images from 15,331 CT scans and demonstrated superior performance compared to existing 2D models across 752 individual medical tasks.

AIBullisharXiv – CS AI · Mar 37/105

🧠

Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy

Researchers propose Vid-LLM, a new video-based 3D multimodal large language model that processes video inputs without requiring external 3D data for scene understanding. The model uses a Cross-Task Adapter module and Metric Depth Model to integrate geometric cues and maintain consistency across 3D tasks like question answering and visual grounding.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Dual-Stream EEG Decoding for 3D Visual Perception

Researchers have developed a dual-pathway brain-computer interface that decodes 3D shape perception and spatial orientation from EEG signals using a bio-inspired architecture. The model combines circular regression for angle prediction with diffusion-based 3D reconstruction, revealing that ventral, dorsal, and motor brain regions dynamically contribute to visual perception rather than static anatomical dominance.

AIBullisharXiv – CS AI · May 116/10

🧠

Knowledge Transfer Scaling Laws for 3D Medical Imaging

Researchers demonstrate that different 3D medical imaging domains (CT, MRI, PET) transfer knowledge asymmetrically during pretraining, following predictable power-law patterns. By optimizing data allocation based on these transfer dynamics, they achieve up to 58% performance gains over proportional sampling, revealing a hub-and-island structure where certain domains act as foundational knowledge sources for others.

AINeutralarXiv – CS AI · May 16/10

🧠

CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

Researchers introduce CLAMP, a novel 3D pre-training framework for robotic manipulation that combines point cloud processing with contrastive learning to capture spatial information missing from traditional 2D image-based approaches. The method demonstrates superior performance across simulated and real-world tasks by leveraging multi-view depth data and action-conditioned learning to improve policy efficiency.

AIBullisharXiv – CS AI · Feb 276/105

🧠

SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs

Researchers introduce SoPE (Spherical Coordinate-based Positional Embedding), a new method that enhances 3D Large Vision-Language Models by mapping point-cloud data into spherical coordinate space. This approach overcomes limitations of existing Rotary Position Embedding (RoPE) by better preserving spatial structures and directional variations in 3D multimodal understanding.