Self-Adaptive 3D Multimodal Large Model with Geometry-Aware Test-Time Optimization
Abstract
Large multimodal models (LMMs) integrating vision and language understanding have achieved remarkable progress in connecting perception and reasoning. Extending such models into the 3D domain enables rich spatial reasoning and holistic scene understanding. However, most existing 3D multimodal systems rely on fixed model weights during inference, making them vulnerable to domain shifts such as novel lighting, sensor noise, or unfamiliar object geometries.