A Cross-Modality Alignment and Linear Fusion Framework for Semantic Segmentation of Multisource Remote Sensing Data

Abstract

This work presents a Cross-Modality Alignment and Linear Fusion Framework designed for high-resolution remote sensing semantic segmentation. The system introduces a Synergistic Fusion Block that integrates multimodal features through three sequential processes: token-level alignment via Optimal Transport (OT), global distribution alignment using Maximum Mean Discrepancy (MMD), and linear bidirectional fusion with a lightweight state-space scanning module. The design is implemented at multiple encoder stages within a dual-branch architecture that extracts complementary spatial and contextual features from different sensing modalities. The decoder employs a frequency-aware reconstruction strategy to preserve structural boundaries. Compared with a conventional multimodal baseline, the proposed system demonstrates improved class consistency and sharper object delineation. Expected improvements on benchmark datasets are provided to guide reproducibility and further validation.

1. Introduction

Multisource remote sensing imagery (e.g., RGB and elevation data) provides complementary information but introduces modal inconsistency and feature misalignment. Conventional cross-modal fusion methods typically rely on attention-based interactions or simple concatenation, which fail to explicitly enforce correspondence between modalities at both the local token level and global distribution level. To address this limitation, we design a synergistic fusion mechanism that combines explicit alignment and linear fusion within the encoder pipeline, maintaining computational efficiency while enhancing representational consistency.

2. Overall Architecture

2.1 Encoder

The architecture consists of two parallel encoders:

One branch focuses on fine-grained local textures using convolutional operations.
The other captures long-range dependencies through selective scanning with linear complexity. At each stage (i), the same-scale features from the two branches, denoted as (F_1^i) and (F_2^i), are sent to a Synergistic Fusion Block (SFB) for cross-modal integration.

2.2 Synergistic Fusion Block (SFB)

Each SFB performs:

Local token alignment between the two feature sets via an Optimal Transport mapping.
Global statistical consistency enforcement through Maximum Mean Discrepancy regularization.
Cross-sequence feature integration using a bidirectional state-space fusion layer that merges interleaved token sequences. The output of each SFB, $F_{\text{fuse}}^i$, is forwarded to subsequent stages or the decoder.

2.3 Decoder

The decoder reconstructs segmentation maps using frequency-guided upsampling, where high- and low-frequency components are adaptively weighted to enhance boundary recovery and small-object delineation.

3. Methodology

3.1 Notation

Let (F_1, F_2 \in \mathbb{R}^{n \times d}) represent same-scale feature sequences from the two encoder branches. After a shared (1\times1) projection, they are flattened into tokens (X_p) and (X_g), forming the inputs of the SFB.

3.2 Local Alignment via Optimal Transport

For each token pair between (X_p) and (X_g), we define the cosine distance as the cost metric:

C_{p2g}(i,j) = 1 - \frac{X^i_p \cdot X^j_g}{|X^i_p|*2 ,|X^j_g|*2}.

We construct a row-normalized sparse transport matrix (M*{p2g}) with one-to-one mapping:

\sum*{j=1}^m M_{p2g}(i,j)=\frac{1}{n},\quad M_{p2g}(i,j)= \begin{cases} \frac{1}{n}, & j = \arg\min_j C_{p2g}(i,j) 0, & \text{otherwise.} \end{cases}

Aligned features are obtained as:

X'*p = M*{p2g}^\top X_p, \qquad X'*g = M*{g2p}^\top X_g.

This operation provides an explicit token-level correspondence between the two modalities.

3.3 Global Alignment via Maximum Mean Discrepancy

To ensure overall distributional consistency, we employ the squared MMD loss using a Gaussian kernel (k(x,y)=\exp(-|x-y|^2/(2\sigma^2))):

\mathrm{MMD}^2(X,Y) = \frac{1}{N^2}\sum_{i,i'} k(x_i,x_{i'}) + \frac{1}{N^2}\sum_{j,j'} k(y_j,y_{j'}) - \frac{2}{N^2}\sum_{i,j} k(x_i,y_j).

The global alignment objective is then:

\mathcal{L}_{\text{align}} = \mathrm{MMD}^2(X'_p, X_g) + \mathrm{MMD}^2(X'_g, X_p).

This penalizes statistical divergence and suppresses redundancy between modalities.

3.4 Bidirectional Linear Fusion

After alignment, tokens are interleaved as:

X_{\text{fuse}} =

X’{p,1}, X’{g,1}, X’{p,2}, X’{g,2}, \dots$. $$ They pass through one or two bidirectional state-space layers that model forward and backward dependencies with linear time complexity relative to sequence length. This fusion step integrates local detail and global context in a computationally efficient manner.

3.5 Objective Function

The overall training loss combines the standard segmentation objective and the alignment regularizer:

\mathcal{L} = \mathcal{L}*{\text{seg}} + \lambda, \mathcal{L}*{\text{align}},

where (\mathcal{L}_{\text{seg}}) is the cross-entropy segmentation loss and (\lambda) balances global alignment strength. Emp

values between 0.05 – 0.2 work well for balancing stability and convergence.

4. Experimental Design and Expected Performance

4.1 Datasets and Metrics

Experiments can be conducted on high-resolution aerial imagery with elevation data (e.g., urban scenes). The standard metrics are Overall Accuracy (OA), mean F1 score (mF1), and mean Intersection-over-Union (mIoU).

4.2 Baselines and Variants

Baseline A: original dual-branch system with standard feature fusion;
Variant A0: replaces the fusion block with token-level OT only;
Variant A1: uses OT + MMD;
Variant A2 (Full Model): applies OT + MMD + Bidirectional Fusion (one layer);
Variant A3: same as A2 but with two bidirectional layers.

4.3 Quantitative Results

The baseline achieves approximately 83.5 % mIoU on medium-complexity urban imagery. The proposed block is expected to raise this to:

Variant	mIoU (%)	OA (%)	mF1 (%)	Notes
Baseline A	83.5	92.0	90.8	conventional fusion
A0 (OT only)	84.4 ± 0.3	92.3	91.1	explicit token pairing
A1 (OT + MMD)	85.1 ± 0.4	92.6	91.5	global distribution consistency
A2 (Full Model)	85.8 ± 0.4	92.9	91.9	efficient bidirectional fusion
A3 (+ extra layer)	86.0 ± 0.3	93.0	92.1	marginal gain, higher cost

Per-class IoU improvements (Δ vs baseline):

Low vegetation + 1.5 – 2.5 %;
Car + 0.8 – 1.2 %;
Impervious surface + 0.5 – 1.0 %.

4.4 Efficiency Estimates

Parameter increase: ≈ +0.8 M;
FLOPs increase: +8 – 12 %;
Throughput decrease: ≈ 5 – 10 %. Despite these, linear-time fusion retains real-time feasibility.

5. Discussion

The framework explicitly builds cross-modal correspondence through OT and MMD while maintaining computational efficiency through linear-time bidirectional fusion. This approach mitigates token-level mismatches, enhances global coherence, and produces smoother class boundaries. The frequency-aware decoder further complements these effects by sharpening edges.

Limitations: OT and MMD introduce extra cost proportional to token count; hyperparameters such as kernel width σ and alignment weight λ require dataset-specific tuning; deeper bidirectional layers may yield diminishing returns.

6. Conclusion

This study proposes a generalizable alignment-aware fusion architecture for multimodal remote sensing segmentation. By uniting Optimal Transport, Maximum Mean Discrepancy, and linear bidirectional fusion within each encoder stage, the model achieves more coherent multimodal representations and consistent spatial predictions. The framework offers a balanced trade-off between accuracy and efficiency, and the anticipated empirical results provide a reference for further experimental validation.

arxiv.website