In remote sensing, multi-modal data from various sensors capturing the same scene offers rich opportunities, but learning a unified representation across these modalities remains a significant challenge. Traditional methods have often been limited to single or dual-modality approaches. In this paper, we introduce COP-GEN-Beta, a generative diffusion model trained on optical, radar, and elevation data from the Major TOM dataset. What sets COP-GEN-Beta apart is its ability to map any subset of modalities to any other, enabling zero-shot modality translation after training. This is achieved through a sequence-based diffusion transformer, where each modality is controlled by its own timestep embedding. We extensively evaluate COP-GEN-Beta on thumbnail images from the Major TOM dataset, demonstrating its effectiveness in generating high-quality samples. Qualitative and quantitative evaluations validate the model's performance, highlighting its potential as a powerful pre-trained model for future remote sensing tasks.
We introduce COP-GEN-Beta, a novel diffusion model designed to handle multiple remote sensing modalities. Specifically, COP-GEN-Beta operates on four key EO modalities: Digital Elevation Model (DEM), Sentinel-1 Radar Terrain Corrected (S1 RTC), Sentinel-2 Level 1C (S2 L1C), and Sentinel-2 Level 2A (S2 L2A). Unlike previous approaches, which require separate models for per modality, COP-GEN-Beta learns joint, conditional, and marginal distributions within a unified framework.
COP-GEN-Beta introduces several key innovations in multimodal remote sensing:
COP-GEN-Beta's flexible sampling capabilities enable a wide range of downstream applications through various modality translation combinations. By allowing generation of any subset of modalities conditioned on any other subset, our model unlocks numerous practical use cases in remote sensing, from atmospheric correction and DEM generation to dataset expansion. Below we showcase some key applications that demonstrate the model's versatility and potential impact in real-world scenarios.
To analyse the model robustness to generation degradation, we perform a generation chain, starting from a real S2L2A image and iteratively conditioning the model on the previously generated image. For illustrative purposes, we only show a loop between S2L2A and S1RTC modalities repeatedly.
To investigate generalisation abilities we condition COP-GEN-Beta on images from the BigEarthNet dataset. Despite differences in processing pipelines between Major TOM thumbnails and BigEarthNet, our model produces reasonable results demonstrating its potential for expanding existing datasets to additional modalities.
We present COP-GEN-Beta, a transformer-based diffusion model for multi-modal Earth observation imagery -- the first to learn a joint generative distribution across multiple Earth observation modalities. Through extensive evaluation on Major TOM thumbnails, we demonstrate the model's ability to generate high-quality paired data conditioned on any subset of modalities. This work establishes a robust foundation for future developments in Earth observation, paving the way for models that handle original formatting of source data and can easily incorporate new modalities through continual learning approaches.
@inproceedings{espinosa2025copgenbeta,
title={COP-GEN-Beta: Unified Generative Modelling of COPernicus Imagery Thumbnails},
author={Espinosa, Miguel and Marsocci, Valerio and Jia, Yuru and Crowley, Elliot J. and Czerkawski, Mikolaj},
booktitle={CVPRW},
year={2025}
}