A Cross-Modality Alignment and Linear Fusion Framework for Semantic Segmentation of Multisource Remote Sensing Data
Abstract
This work presents a Cross-Modality Alignment and Linear Fusion Framework designed for high-resolution remote sensing semantic segmentation. The system introduces a Synergistic Fusion Block that integrates multimodal features through three sequential processes: token-level alignment via Optimal Transport (OT), global distribution alignment using Maximum Mean Discrepancy (MMD), and linear bidirectional fusion with a lightweight state-space scanning module. The design is implemented at multiple encoder stages within a dual-branch architecture that extracts complementary spatial and contextual features from different sensing modalities. The decoder employs a frequency-aware reconstruction strategy to preserve structural boundaries. Compared with a conventional multimodal baseline, the proposed system demonstrates improved class consistency and sharper object delineation. Expected improvements on benchmark datasets are provided to guide reproducibility and further validation.