COP-GEN-Beta

Unified Generative Modelling of COPernicus Imagery Thumbnails

1University of Edinburgh, 2European Space Agency (ESA), 3KU Leuven, 4Asterisk Labs
CVPRW 2025

*First author
Emergent Seasonality

Figure 1: By training on dense, global coverage COP-GEN-Beta captures a wide and diverse data distribution of the supported modalities. We observe emergent effects such as seasonality when sampling multiple images conditioned on the same S1RTC sample. COP-GEN-Beta is capable of synthetising new locations that do not exist, but also it can reimagine existing locations in conditions that were never observed.

Abstract

In remote sensing, multi-modal data from various sensors capturing the same scene offers rich opportunities, but learning a unified representation across these modalities remains a significant challenge. Traditional methods have often been limited to single or dual-modality approaches. In this paper, we introduce COP-GEN-Beta, a generative diffusion model trained on optical, radar, and elevation data from the Major TOM dataset. What sets COP-GEN-Beta apart is its ability to map any subset of modalities to any other, enabling zero-shot modality translation after training. This is achieved through a sequence-based diffusion transformer, where each modality is controlled by its own timestep embedding. We extensively evaluate COP-GEN-Beta on thumbnail images from the Major TOM dataset, demonstrating its effectiveness in generating high-quality samples. Qualitative and quantitative evaluations validate the model's performance, highlighting its potential as a powerful pre-trained model for future remote sensing tasks.

COP-GEN-Beta

We introduce COP-GEN-Beta, a novel diffusion model designed to handle multiple remote sensing modalities. Specifically, COP-GEN-Beta operates on four key EO modalities: Digital Elevation Model (DEM), Sentinel-1 Radar Terrain Corrected (S1 RTC), Sentinel-2 Level 1C (S2 L1C), and Sentinel-2 Level 2A (S2 L2A). Unlike previous approaches, which require separate models for per modality, COP-GEN-Beta learns joint, conditional, and marginal distributions within a unified framework.

COP-GEN-Beta Architecture
Figure 2: COP-GEN-Beta is the first generative model trained on the joint distribution of Sentinel-2 (both L1C and L2A), Sentinel-1 RTC, and Copernicus GLO-30 DEM data. This is done through (a) sampling a global and dense dataset of these modalities from Major TOM, encoding all images with a pretrained StableDiffusion autoencoder, and (b) training a sequence-based denoising diffusion model using a transformer backbone, where each modality is supplied with its designated timestep. This approach makes it possible to (c) generate all modalities based on any subset thereof that is available.

Why COP-GEN-Beta?

COP-GEN-Beta introduces several key innovations in multimodal remote sensing:

  • User-defined flexibility: Generate any combination of modalities from available ones, eliminating the need for specialized translation models
  • Unified multimodal modeling: Captures cross-modal relationships through a shared backbone, leveraging correlations between different data types to enhance both representation and generation
  • Scalable architecture: Transformer-based design allows easy integration of new modalities by simply adding input tokens, with flexible attention mechanisms handling cross-modal interactions
  • Future-proof solution: Adaptable framework ready to incorporate emerging remote sensing data types and modalities

Use Cases of COP-GEN-Beta

COP-GEN-Beta's flexible sampling capabilities enable a wide range of downstream applications through various modality translation combinations. By allowing generation of any subset of modalities conditioned on any other subset, our model unlocks numerous practical use cases in remote sensing, from atmospheric correction and DEM generation to dataset expansion. Below we showcase some key applications that demonstrate the model's versatility and potential impact in real-world scenarios.


Conditional Generation


Looping Generation

To analyse the model robustness to generation degradation, we perform a generation chain, starting from a real S2L2A image and iteratively conditioning the model on the previously generated image. For illustrative purposes, we only show a loop between S2L2A and S1RTC modalities repeatedly.


Expanding existing dataset with modality translation

To investigate generalisation abilities we condition COP-GEN-Beta on images from the BigEarthNet dataset. Despite differences in processing pipelines between Major TOM thumbnails and BigEarthNet, our model produces reasonable results demonstrating its potential for expanding existing datasets to additional modalities.

Zero-shot results
Figure 8: COP-GEN-Beta can be used to expand existing datasets containing the supported modalities, such as BigEarthNet, which has been used as a conditioning source for Sentinel-2 L2A data. COP-GEN-Beta is capable of reproducing all remaining modalities despite never observing BigEarthNet samples in the training data.

Conclusion

We present COP-GEN-Beta, a transformer-based diffusion model for multi-modal Earth observation imagery -- the first to learn a joint generative distribution across multiple Earth observation modalities. Through extensive evaluation on Major TOM thumbnails, we demonstrate the model's ability to generate high-quality paired data conditioned on any subset of modalities. This work establishes a robust foundation for future developments in Earth observation, paving the way for models that handle original formatting of source data and can easily incorporate new modalities through continual learning approaches.

BibTeX

@inproceedings{espinosa2025copgenbeta,
        title={COP-GEN-Beta: Unified Generative Modelling of COPernicus Imagery Thumbnails},
        author={Espinosa, Miguel and Marsocci, Valerio and Jia, Yuru and Crowley, Elliot J. and Czerkawski, Mikolaj},
        booktitle={CVPRW},
        year={2025}
      }