COP-GEN-Beta

Unified Generative Modelling of COPernicus Imagery Thumbnails

Miguel Espinosa^*1,2, Valerio Marsocci², Yuru Jia³, Elliot J. Crowley¹ Mikolaj Czerkawski⁴,

¹University of Edinburgh, ²European Space Agency (ESA), ³KU Leuven, ⁴Asterisk Labs
CVPRW 2025
^*First author

Figure 1: By training on dense, global coverage COP-GEN-Beta captures a wide and diverse data distribution of the supported modalities. We observe emergent effects such as seasonality when sampling multiple images conditioned on the same S1RTC sample. COP-GEN-Beta is capable of synthetising new locations that do not exist, but also it can reimagine existing locations in conditions that were never observed.

Abstract

In remote sensing, multi-modal data from various sensors capturing the same scene offers rich opportunities, but learning a unified representation across these modalities remains a significant challenge. Traditional methods have often been limited to single or dual-modality approaches. In this paper, we introduce COP-GEN-Beta, a generative diffusion model trained on optical, radar, and elevation data from the Major TOM dataset. What sets COP-GEN-Beta apart is its ability to map any subset of modalities to any other, enabling zero-shot modality translation after training. This is achieved through a sequence-based diffusion transformer, where each modality is controlled by its own timestep embedding. We extensively evaluate COP-GEN-Beta on thumbnail images from the Major TOM dataset, demonstrating its effectiveness in generating high-quality samples. Qualitative and quantitative evaluations validate the model's performance, highlighting its potential as a powerful pre-trained model for future remote sensing tasks.

COP-GEN-Beta

We introduce COP-GEN-Beta, a novel diffusion model designed to handle multiple remote sensing modalities. Specifically, COP-GEN-Beta operates on four key EO modalities: Digital Elevation Model (DEM), Sentinel-1 Radar Terrain Corrected (S1 RTC), Sentinel-2 Level 1C (S2 L1C), and Sentinel-2 Level 2A (S2 L2A). Unlike previous approaches, which require separate models for per modality, COP-GEN-Beta learns joint, conditional, and marginal distributions within a unified framework.

Why COP-GEN-Beta?

COP-GEN-Beta introduces several key innovations in multimodal remote sensing:

User-defined flexibility: Generate any combination of modalities from available ones, eliminating the need for specialized translation models
Unified multimodal modeling: Captures cross-modal relationships through a shared backbone, leveraging correlations between different data types to enhance both representation and generation
Scalable architecture: Transformer-based design allows easy integration of new modalities by simply adding input tokens, with flexible attention mechanisms handling cross-modal interactions
Future-proof solution: Adaptable framework ready to incorporate emerging remote sensing data types and modalities

Use Cases of COP-GEN-Beta

COP-GEN-Beta's flexible sampling capabilities enable a wide range of downstream applications through various modality translation combinations. By allowing generation of any subset of modalities conditioned on any other subset, our model unlocks numerous practical use cases in remote sensing, from atmospheric correction and DEM generation to dataset expansion. Below we showcase some key applications that demonstrate the model's versatility and potential impact in real-world scenarios.

Conditional Generation

Figure 3: COP-GEN-Beta supports translation between processing levels of Sentinel-2, which emulates the official procesor for the L2A level (approximating Bottom-of-Atmosphere based on Top-of Atmosphere input).

Figure 4: It can also do the opposite and approximate a possible L1C product from an observed L2A observation, which is equivalent to the synthesis of the atmospheric effect.

Figure 5: COP-GEN-Beta can map any modality to an elevation model estimate, which can be highly useful for dynamically changing terrains. Here, it is shown how any modality (Sentinel-1 or Sentinel-2) can be used to approximate the elevation, even radiometrically terrain-corrected radar product.

Looping Generation

To analyse the model robustness to generation degradation, we perform a generation chain, starting from a real S2L2A image and iteratively conditioning the model on the previously generated image. For illustrative purposes, we only show a loop between S2L2A and S1RTC modalities repeatedly.

Figure 6: We perform a generation chain, starting from a real S2L2A image and iteratively conditioning the model on the previously generated image. For illustrative purposes, we only show a loop between S2L2A and S1RTC modalities repeatedly, but more modalities could have been involved.

Figure 7: We illustrate the iterative process of conditioning the model on its own generated outputs. Starting from a real Sentinel-2 L2A image (left), the model first generates multiple corresponding Sentinel-1 RTC image (middle), which is then used to synthesize a new Sentinel-2 L2A image (right).

Expanding existing dataset with modality translation

To investigate generalisation abilities we condition COP-GEN-Beta on images from the BigEarthNet dataset. Despite differences in processing pipelines between Major TOM thumbnails and BigEarthNet, our model produces reasonable results demonstrating its potential for expanding existing datasets to additional modalities.

Figure 8: COP-GEN-Beta can be used to expand existing datasets containing the supported modalities, such as BigEarthNet, which has been used as a conditioning source for Sentinel-2 L2A data. COP-GEN-Beta is capable of reproducing all remaining modalities despite never observing BigEarthNet samples in the training data.

Conclusion

We present COP-GEN-Beta, a transformer-based diffusion model for multi-modal Earth observation imagery -- the first to learn a joint generative distribution across multiple Earth observation modalities. Through extensive evaluation on Major TOM thumbnails, we demonstrate the model's ability to generate high-quality paired data conditioned on any subset of modalities. This work establishes a robust foundation for future developments in Earth observation, paving the way for models that handle original formatting of source data and can easily incorporate new modalities through continual learning approaches.

BibTeX

@inproceedings{espinosa2025copgenbeta,
        title={COP-GEN-Beta: Unified Generative Modelling of COPernicus Imagery Thumbnails},
        author={Espinosa, Miguel and Marsocci, Valerio and Jia, Yuru and Crowley, Elliot J. and Czerkawski, Mikolaj},
        booktitle={CVPRW},
        year={2025}
      }