arxiv.website

Just do it

A Cross-Modality Alignment and Linear Fusion Framework for Semantic Segmentation of Multisource Remote Sensing Data

A Cross-Modality Alignment and Linear Fusion Framework for Semantic Segmentation of Multisource Remote Sensing Data

Abstract

This work presents a Cross-Modality Alignment and Linear Fusion Framework designed for high-resolution remote sensing semantic segmentation. The system introduces a Synergistic Fusion Block that integrates multimodal features through three sequential processes: token-level alignment via Optimal Transport (OT), global distribution alignment using Maximum Mean Discrepancy (MMD), and linear bidirectional fusion with a lightweight state-space scanning module. The design is implemented at multiple encoder stages within a dual-branch architecture that extracts complementary spatial and contextual features from different sensing modalities. The decoder employs a frequency-aware reconstruction strategy to preserve structural boundaries. Compared with a conventional multimodal baseline, the proposed system demonstrates improved class consistency and sharper object delineation. Expected improvements on benchmark datasets are provided to guide reproducibility and further validation.


Integrating Evolutionary Adversarial Optimization into Transformer-Based End-to-End Autonomous Driving Systems

Integrating Evolutionary Adversarial Optimization into Transformer-Based End-to-End Autonomous Driving Systems


Abstract

End-to-end autonomous driving systems seek to unify perception, prediction, and planning into a single differentiable pipeline. Despite recent advances with transformer-based architectures capable of processing multimodal sensory inputs, these models often struggle with generalization and robustness when exposed to rare or adversarial driving scenarios. This study introduces a hybrid optimization framework that integrates evolutionary and adversarial principles into the training of transformer-based driving policies. The proposed system leverages an adaptive population-based adversarial policy optimization mechanism that co-evolves the driving policy with adversarial perturbations and exploration behaviors. This dual optimization process enhances policy robustness against environment variance and sensor uncertainty. Experiments on standard closed-loop driving benchmarks demonstrate improved success rates, route completion, and safety metrics compared to deterministic training.


Lighting-Aware Real-Time Human–Scene Reconstruction with Pose-Conditioned Ray Fields

Lighting-Aware Real-Time Human–Scene Reconstruction with Pose-Conditioned Ray Fields


Abstract

Real-time 4D human–scene reconstruction has achieved impressive progress with transformer-based and neural field approaches that jointly recover human meshes, scene geometry, and camera trajectories from monocular videos. However, existing systems often neglect illumination consistency, leading to geometry–appearance discrepancies under dynamic lighting or pose variations. This work presents a lighting-aware human–scene reconstruction framework that augments neural implicit representations with pose-conditioned ray fields and global relighting modeling. The proposed architecture unifies human reconstruction, scene estimation, and physically consistent lighting reasoning into a single end-to-end trainable system. A differentiable illumination field is introduced to encode both scene-dependent light transport and pose-dependent reflectance. The result is a coherent 4D representation capable of maintaining photometric fidelity and temporal stability even under changing illumination. Experimental evaluations on synthetic and real datasets demonstrate notable improvements in visual consistency, reconstruction accuracy, and generalization to novel lighting conditions.


MultiRef-CoFT: A Foresight-Focused Multi-Reference Framework for Visual-Language Reasoning

MultiRef-CoFT: A Foresight-Focused Multi-Reference Framework for Visual-Language Reasoning


Abstract

Recent advances in visual–language reasoning have demonstrated strong multimodal understanding when guided by multiple visual references and structured contextual alignment. However, most existing multi-reference architectures still perform reactive inference — attending to given inputs without predictive planning or cross-reference anticipation. Inspired by cognitive theories of human foresight and attention, we propose MultiRef-CoFT, a foresight-focused multi-reference reasoning framework that combines modular multi-image alignment with an adaptive chain-of-foresight-and-focus mechanism. The approach allows the model to anticipate relevant cross-image relations before focusing attention, achieving deeper relational grounding and coherent multimodal reasoning.


Self-Adaptive 3D Multimodal Large Model with Geometry-Aware Test-Time Optimization

Self-Adaptive 3D Multimodal Large Model with Geometry-Aware Test-Time Optimization


Abstract

Large multimodal models (LMMs) integrating vision and language understanding have achieved remarkable progress in connecting perception and reasoning. Extending such models into the 3D domain enables rich spatial reasoning and holistic scene understanding. However, most existing 3D multimodal systems rely on fixed model weights during inference, making them vulnerable to domain shifts such as novel lighting, sensor noise, or unfamiliar object geometries.