Bye bye, clouds!

Clouds in optical images sucks. But machine learning provides interesting methods to get rid of them! Recently, we have conducted a study for the French Space Agency to benchmark some deep learning based and deterministic approaches. This post makes an insight into our results (you can read our preprint here) and unveils Decloud, our open-source framework for Sentinel-2 images reconstruction and cloud removal.

Left: original Sentinel-2 image. Right: processed image, using one Sentinel-1 image (VV/VH) acquired at the same date.
Left: original Sentinel-2 image. Right: processed image with a single-date convolutional neural network, using one Sentinel-1 image (VV/VH) acquired at the same date. Copyright INRAE/CNES 2022.


Well, I don’t know very much about clouds. But as a remote sensing user, all I see is how they affect the optical images: pixels in thin or thick cloud are brighter, pixels in shadows are darker. Clouds are messing with optical sensors radiometry, and we generally resort to pre-processing to deal with the polluted pixels. How can we retrieve the radiometry of an optical image impacted by clouds?

For a long time, gap-filling was used to prepare optical time-series: it interpolates the pixels in the temporal domain, using cloud masks to avoid the cloudy pixels. It works fine, is straightforward and fast. However, it relies on the cloud masks, hence depends greatly of the quality of these clouds masks. Also, it is prone to generate artifacts locally, typically when some neighboring pixels are computed with different set of dates with subtle changes in the land cover or aerosols. More recently, a plethora of deep-learning based methods have emerged, claiming SOTA results to remove clouds in optical images. Thanks to the flexibility of the convolutional neural networks, various geospatial sources have been used: coarser resolution optical images, Synthetic Aperture Radar (SAR) thanks to the Sentinel constellation, Digital Elevation Models (DEM), etc.

The existing deep-learning based methods for clouds removal or image reconstruction available in the literature are ridiculously simple, and are easy to implement. Most of the time, a convolutional neural network is trained to minimize the error between a target image (cloud-free) and the generated image. Many approaches exist: some of them use input cloud-free optical images to estimate a synthetic optical image at another date (e.g. here or here with an additional Digital Elevation Model at input), other approaches input one cloudy optical image and one SAR image and try to remove the clouds (e.g. here), other employs different/additional input sensors, etc. Some studies have started to try to model jointly the cloud generation problem and the cloud removal, but for now, with inferior results compared to real data (e.g. here). No doubt that we are just in the beginning of this fascinating research topic. Maybe in a few years, cloudy optical images will be just a distant memory?

While machine learning based approaches look promising, we must compare them with traditional approach. More importantly, we have to assess how do they perform in term of computational and technical requirements, in operational contexts?

Considered approaches and modalities

In our study, we have tried to measure how various approaches perform in terms of image quality metrics of the 10m spacing channels (Red, Green, Blue, Near infra-red) of the reconstructed image: Peak to Noise Signal Ratio (PSNR), Mean Squared Error (MSE), Structural Similarity (SSIM), Spectral Angle (SAM) were computed with a cloud-free reference image to evaluate the quality of the reconstructed image. We have focused our efforts on the following kind of approaches:

    • (A) The traditional gap-filling: reconstruct an optical image at date t, from two optical images acquired respectively before (t-1) and after (t+1). The gap-filling does not use any SAR image, neither DEM. Also, it doesn’t use the damaged optical image at date t for the reconstructed pixels.
    • (B) The deep-learning based cloud removal of one optical image using a single SAR image acquired near t. This approach, introduced in Meraner et al. (2020), uses one damaged optical image and a SAR image acquired close by, to reconstruct the missing parts of the optical image. The use of SAR image is interesting because SAR is not as sensitive as the cloud coverage as the optical image, and provide different information (Sentinel-1 VV/VH is C-Band).
    • (C) The deep-learning based cloud removal of one optical image using a single SAR image acquired near t, plus two pairs of (optical SAR) acquired respectively before (t-1) and after (t+1). This generalizes the precedent described single-date approach to multi-temporal.


Single-date reconstruction network
Single-date reconstruction network (approach B). A single optical image that can be polluted by clouds, is reconstructed with the help of a SAR image acquired near the same date.
Multi-temporal reconstruction network
Multi-temporal reconstruction network (approach C). This approach generalizes (B) to multi-temporal, with three input (S1, S2) pairs to reconstruct the central date.

Moreover, to disentangle the benefit of the different modalities, we have conducted an ablation study on the deep nets:

    • With optical only
    • With optical + SAR
    • With optical + SAR + DEM

Our results are quite interesting, showing that SAR and DEM greatly help with the cloud removal process for both (B) and (C) architectures.


We have ran our experiments on every available S1 and S2 images over the Occitanie area in south of France, from january 2017 to january 2021. That represents a total of 3593 optical images over 10 tiles, and a total of 5136 SAR images in ascending orbit (in order to limit the number of SAR images, and to be able to compare our results in other regions of the world, we deliberately chose to keep only the ascending orbits of the Sentinel-1 images: indeed in most places in the world, only a single orbit is available).

One challenge was to create datasets that enable to train the deep learning based approaches, and more importantly, to compare all approaches, fitting all benchmarked models validity domains. For this, we have developed a sample generation framework that uses an R-Tree to index the cloud percentage, nearest SAR image to optical, acquisition date, of every available patches in our study sites.

Using our framework, we have crafted datasets enabling the comparison of approaches having different modalities and requirements/constraints at input. It was trivial to compare approaches (B) and (C) since we can use the (optical, SAR) pair at date t, from the dataset enabling the testing of (B), i.e. which is composed of pairs of images at dates t-1, t and t+1 (only the pairs at t is used for approach C). However, to compare (A) and (B), it was a bit more complicated. Indeed, we had to find cloud-free images at t-1 and t+1, but also, in order to mimic the operational situation, we wanted the optical image at t to be 100% cloudy (typically, we would have used gap-filling in this situation in real life). We called acquisitions layout the way images must be acquired in a particular situation: e.g. between [0%, 100%] cloudy at date t, with a SAR image acquired close, and a reference cloud-free optical image acquired less than 10 days from the cloudy image (we give a few examples later below). Except for the gap-filling, which doesn’t require training, this reference image (or target image) is used from the training dataset to optimize the network, and from the test dataset, it is used to compute the evaluation metrics (PSRN, MSE, SSIM, SAM). Talking about datasets splits: we have selected ROIs in the geographical domain to avoid any patch overlap from the training and test groups, to ensure that we don’t measure any over-fitting of the networks.

The following figure explains how acquisitions layouts work:

Principle of acquitions layout
Principle of acquisitions layout

Specifically, here is the acquisitions layout that we used to train (B). This architecture is very simple: it inputs one SAR image (S1t) and a potentially cloudy optical image (S2t), and outputs the de-clouded optical image (Ŝ2t). The setting to train (B) is very basic: we just need one (S1tS2t) pair with S2t possibly cloudy, and the dates of the SAR and optical being close, and a target cloud-free S2target acquired close to the S2t image. The model is trained minimizing the average absolute error between Ŝ2t and S2target.

Acquitions layout for (B)
Acquitions layout for (B). In this configuration, the input SAR image S1t is acquired less than 72 hours from the optical image S2t. The target cloud-free optical image S2target is acquired at most 7 days from S2t.

And it is the same thing for the (C) network, with additional SAR and optical images at t-1 and t+1:

Acquisitions layout for (C)
Acquisitions layout for (C). Here all inputs are composed of pairs of SAR/optical images, with the SAR image acquired less than 72 hours from the optical image. The date “t” denotes the reference timestamp of the acquisitions layout. The other acquisitions (“t-1”, “target”, and “t+1”) respectively fit the temporal ranges [-30 days, -10 days], [-10 days, +10 days], and [+10 days, +30 days]. Finally, the target optical image S2target is cloud-free (0% of clouds), and all input optical images are potentially cloudy (between 0% and 100% of clouds).
Note that this acquisitions layout enables us to compare (B) and (C) quite simply, as we stated before, since the input images at t share the same maximum temporal gap between optical and SAR images.

In all acquisitions layouts, the values for time ranges and optical-to-SAR maximum gap were chosen regarding the orbits of Sentinel-1 and Sentinel-2 over our study site. Since Sentinel-1 and Sentinel-2 swathes do not overlap in the same fashion everywhere in the world, these parameters might not be suited for another area.

Now, to compare gap-filling with deep-learning based approaches, we had to create other datasets (“OS2GFa” and “OS2GFb”) for which all models validity domains fit. The following figure illustrates how this can be accomplished, setting wise time ranges with the adequate cloud cover ranges for t-1, t and t+1 .

Acquisitions layouts
Acquisitions layouts. This animation tries to illustrate how the different approaches validity domains must fit. The x axis represents time. Grey ranges are for cloudy images (between 0% and 100%), magenta ranges are for cloud-free images. The “ref” is the temporal reference for  all acquisitions layouts, and corresponds to an image acquired at the particular date “t”.  The optical image acquired at “t” in “OS1” is completely cloud-free, making this acquisitions layout suited for the gap-filling evaluation.  In “OS2”, the image at “t” is cloudy, which makes this acquisitions layout suited to train (C) or evaluate (B) or (C). The “OS2GfA” and “OS2GFb” acquisitions layouts are used to propose a fair comparison between the gap-filling and (B) and (C). In this configuration, the optical image acquired at date “t” is 100% cloudy, enabling a fair comparison with the gap-filling (the gap-filling won’t use it…) and the target image is a cloud-free image that is used to compute the image quality metrics. The animation shows that “OS2GFa” and “OS2GFb” stay inside both validity domains defined with “OS1” and “OS2”.


Well, the goal here is not to rewrite our paper, so we will be short! You can take a look on the  Decloud source code to see how it is done, and read the paper here for more specifics.

In short, first we have pre-processed our S1 and S2 images using the CNES HAL cluster and datalake. Thanks to the S1Tiling remote module of the Orfeo ToolBox, the download and the calibration of the S1 image are completely automated, saving us a great deal of pain. We used the S2 images from the Theia land data center, in Level 2A processed with the MAJA software since the clouds masks are included in the products (we recall that we used the cloud masks only as metadata per patch, for the estimated cloud coverage percentage). After that, we have generated our datasets from the carefully crafted acquisitions layouts. Then, we have trained our deep nets over their respective training datasets, to minimize the l1 distance between the generated image, and the reference image. We have run all experiments on the Jean-Zay computer at IDRIS. Finally, we have computed the image quality metrics of the reconstructed images over various test datasets (we recall that each one have a null intersection with the training dataset, in the geographical domain).

The (B) network was implemented from its original formulation. Although their paper is amazing and inspiring, their network is greedy in terms of computational resources. So much, that we believe it is not applicable regarding the extremely long processing time required. We hence have modified their original implementation, reducing the processing time with a factor of (roughly) 30x, with the same outcomes in terms of image quality metrics. We gave a few details in our paper: in short we have replaced their ResNet-based bottleneck with a U-Net, which have several benefits e.g. used sources with different spatial resolutions in input and output. The ablation study was performed on our version of (B), since we could have run more experiments in reasonable time.

The (B) network, modified with a U-Net backbone, enabling to deal with 20m spacing sources (e.g. DEM or 20m spectral bands)
The (B) network, modified with a U-Net backbone, enabling to deal with 20m spacing sources (e.g. DEM or 20m spectral bands). Compared to the original approach of Meraner et al., the residual correction is optional.

The (C) network generalizes (B): same U-Net backbone, shared encoders, and the decoder applied over the stacked features maps of the encoder. Take a look in our code!

The (C) network. All encoders (E) are shared. The decoder (D) performs on the stacked features of the encoders.
The (C) network. All encoders (E) are shared. The decoder (D) performs on the stacked features of the encoders. No residual correction after the decoder, D reconstructs directly the estimated optical image.


Comparison of single-date based approaches (b)

We first have compared the original network based on a ResNet backbone, with our modified version with a U-Net backbone. The first shocking information is that the modified version is sometimes like 30-40x faster. Instead of days of training, it’s now just hours. Same thing for the inference. This can be explained because the ResNet performs over features at 10m spatial resolution, unlike the U-Net which performs convolutions on downsampled features. The memory footprint is reduced, and less operations are required since convolutions are performed on features maps of smaller size. Another advantage of the U-Net is that it can deal natively with sources of different resolutions, e.g. a DEM or 20m spacing bands can be concatenated after the first downsampling step in the encoder, and the output 20m spacing bands can be retrieved after the second last upsampled convolutions of the decoder. Indeed, in the original implementation, the ResNet processes everything at the highest resolution, hence any input has to be resampled at 10m (e.g. the 20m spacing bands), same for the outputs… not very computationally efficient!

The results show very little difference in terms of image quality metrics, between the U-Net and the ResNet based networks. We believe that the SSIM is slightly higher with the ResNet based approach since less details are lost because the network performs the convolutions without any downsampling, or because of the residual correction (this has to be investigated further). The results show how the SAR and the DEM help reconstructing the cloudy image, especially in terms of radiometric fidelity (PSNR, MSE, SAM):

Comparison of single-date based approaches.
Comparison of single-date based approaches. SSOPunet,w/o SAR denotes the model (B) without SAR, SSOPmer is the original architecture with ResNet backbone, SSOPunet is the modified architecture with the U-Net backbone, and SSOPunet+DEM uses an additional DEM at input.

We can appreciate visually the benefit of the different modalities, in the figure below:

From left to right: input cloudy optical image S2t (1), S1t (2), output of SSOPunet w/o SAR (3), SSOP unet (4), and SSOP unet+DEM (5), (6) reference image
Model (B) is called “SSOP”. From left to right: input cloudy optical image S2t (1), S1t (2), output of SSOPunet w/o SAR (3), SSOPunet (4), and SSOPunet+DEM (5), (6) reference image, from the test dataset.

As you can see, when no information available, the network cannot do magic stuff, and fails to retrieve something good. Anyway, I am still amazed how it is able to catch the average radiometry from a mixture of thick cloudy pixels… To the question “how does SAR and DEM help?”, we can only conjecture. But experimental results show how these sources benefit to the cloud removal process.

Deep-learning based, single-date vs multitemporal

Ok, we all know that it is not a fair fight! Anyway. Here we just compare (B) and (C) over the test dataset of (C), using only the images at central date t for (B).

With no surprise, the multi-temporal helps a lot. Even without SAR image it still outperforms the single-date based model. Again, we can observe that the SAR and DEM inputs helps the network toward a better reconstruction of the optical image.

Comparison between (B) and (C). SSOPunet,w/o SAR denotes the model (B) without SAR, SSOPmer is the original architecture with ResNet backbone, SSOPunet is the modified architecture with the U-Net backbone, and SSOPunet+DEM uses an additional DEM at input.
Comparison between (B) and (C). SSOPunet,w/o SAR denotes the model (B) without SAR, SSOPunet is the modified architecture with the U-Net backbone, and SSOPunet+DEM uses an additional DEM at input. MSOPunet,w/o SAR denotes the model (C) without SAR, using only the t-1, t and t+1 optical images, MSOPunet denotes the model (C) with the SAR images, and MSOPunet+DEM uses an additional DEM at input.

The following illustration shows how the SAR and the DEM help to retrieve the missing contents in the polluted optical image:

Model (C) is called "MSOP". From left to right: input images S2t−1 (1), S2t (2), S2t+1 (3), output reconstructed optical images MSOP unet,w/oSAR (4), MSOP unet (5), MSOP unet+DEM(6) and the reference image S2t (7).
Model (C) is called “MSOP”. From left to right: input images S2t−1 (1), S2t (2), S2t+1 (3), output reconstructed optical images MSOPunet,w/oSAR (4), MSOPunet (5), MSOPunet+DEM(6) and the reference image S2t (7).

As you can see, sometimes the reconstructed image has less clouds than the reference image! (4th row)

Comparison of All approaches

Now we want to compare all approaches together. In operational situation, we will mostly use gap-filling to retrieve one pixels which is occluded by clouds. From here, we have created an acquisitions layout where the optical image at date t is completely cloudy, and images at t-1 and t+1 are completely cloud-free, because gap-filling also can only work with these two available cloud-free optical images.

Acquisitions layout for the comparison af all approaches together
Acquisitions layout for the comparison af all approaches together. The optical image at date t is 100% cloudy. All other optical images are cloud-free. The target image is very close to the t image (less than 3 days).

A small detail: the gap-filling is performed to reconstruct the image at date target instead of t, to remove this small temporal bias.

Now, we measure how perform (B) and (C) in this setup, which corresponds to a situation where the gap-filling should be a bit advantaged.

Comparison between all approaches
Comparison between all approaches

We can notice that the gap-filling outperforms the single-date based (B) approach. Not doubt that in this situation, available cloud-free images helps more that a SAR image! But it is completely outperformed by the multi-temporal deep-learning approaches of (C). Even without SAR images, the reconstruction is better. Note also that this operational situation is not nominal for the (C) approach, which can be applied over cloudy input images, unlike the gap-filling.

Below are a few screenshots over the test patches:

From left to right: input images S2t−1 (1), S2t (2), from S2t+1 (3), output reconstructed optical images S2 from MSOP unet+DEM (4), the Gap-filling (5), and the reference image S2t (6). In violet are circled details in the reconstructed images that the gap-filling fails to retrieve, or artifacts caused by wrong could masks in input images.
From left to right: input images S2t−1 (1), S2t (2), from S2t+1 (3), output reconstructed optical images S2 from MSOPunet+DEM (4), the Gap-filling (5), and the reference image S2t (6). In violet are circled details in the reconstructed images that the gap-filling fails to retrieve, or artifacts caused by wrong could masks in input images.

You can observe that cloud masks are not always perfect, and lead to artifacts in the images generated using the gap-filling (caused by clouds in the input images, that the masks have missed). Gap-filling typically uses per-pixel clouds masks, it is then very dependent of their quality. On the contrary, the convolutional deep nets are very good in removing the clouds, even though they have been trained from datasets generated using the cloud coverage information (but not the per-pixel oriented cloud mask). Another interesting thing we can observe is the ability of the deep learning based approaches to better reconstruct the sudden changes in time series, possibly due to the use of the available pixels of the optical image at date t, even if cloudy or in the cloud shadows, and the SAR image at t.

We have visually remarked a small drawback of SAR+optical convolutional neural networks over water areas, with small artifacts (row 6). We believe that it’s hard for the network to train on this target, since its surface is changing rapidly (swell, wind, …) and because the SAR backscatter is very sensitive to the water surface roughness.

Computational requirements


We have trained all our networks from approximately 600k training samples, on quad-GPU (NVIDIA V100 with 32Gb RAM). Our networks have less than 30k parameters.

The slowest network is the original (B), trained in 30-40 days. All other networks were trained in less than 42 hours (models C) and 20 hours (models B, modified with U-Net backbone).

Full scene inference

Since our goal was not just to generate nice RGB patches for the reviewers, we have developed pipelines to process seamlessly entire S2 images end-to-end with OTB applications and the OTBTF remote module. The time required to process one entire S2 image with the (C) network is approximately 2 minutes with an old NVIDIA GTX1080Ti. For the (B) network I don’t know but it must be something like 1-2 minutes for an entire Sentinel-2 image.

4 years of Sentinel-2 images processed with our convolutional multi-temporal network over a test ROI, near Toulouse area (France).


We have compared single date and multitemporal convolutional networks with the traditional gap-filling, showing the benefits of the SAR and DEM inputs for the optical image reconstruction.
Also, we have shown that, even if the primary design of the multitemporal convolutional network is not focused on image interpolation in temporal domain, it leads to better results than the gap-filling. We also showed that single date based networks cannot be substituted to gap-filling for the estimation of a completely cloudy image at date t. Single date networks are great over thin clouds, but we have noticed that they fail to retrieve the image details under thick clouds, and lead to blurry images.

However, we should interpret our results carefully regarding the ancillary data available for cloud coverage characterization, since our cloud coverage information per patch depends from it. In addition, we lead our study over a small area that do not represents the various atmospheric conditions all over the earth!

Many other things could be investigated in future studies. For instance, could the geometric parameters of SAR, or physical based SAR pre-processing, help? Is there better loss functions to train the networks? Which architectures are better suited to time series? etc.

Final words

Thanks to the flexibility of the deep nets, it is easy to tailor (nearly) anything to a particular problem, and we can mix the different inputs and outputs with an unprecedented ease (We still have to think a bit though!). For operational use, the preparation of training data and the required processing time are crucial points. Regarding the specific problem of optical image clouds removal, no doubt that deep nets will help to deliver more exploitable data in the future!

Take a look in our code. It is based on Orfeo ToolBox with the OTBTF remote module and the PyOtb extension, GDAL and TensorFlow. We have a pre-trained model for the south of France, and another for the west of Burkina Faso: try them! Don’t forget to raise issues, do not hesitate to contribute!

Happy deep learning.

We are looking for an image processing engineer intern to work on super-resolution

Hi, we are looking for an image processing engineer intern who is not afraid of deep learning with remote sensing images.

Super-resolution offers great prospects in many fields, including in remote sensing. However, very few studies focus on the usefulness of the technique. In particular, does super-resolution helps with landcover mapping, like semantic segmentation or pixel wise patch-based classification? We already have plenty of state of the art methods to extract information automatically from remote sensing image for earth observation. Now, could we apply these great tools on images generated with super-resolution and benefit from the learnt transform from low resolution to high resolution images? Those are quite interesting questions that we are actively trying to answer at the TETIS lab.

Original Sentinel-2 image (Left) and the output processed with super-resolution (Right) over the city of Tours (France, 2020), using pre-trained SR4RS model. Copyright INRAE/UMR TETIS 2021.


Today, many satellite missions continuously collect remote sensing data describing the Earth’s surface at different spatial and temporal resolutions. Therefore, a single study area can be covered by heterogeneous and multimodal information. This information is of paramount importance for monitoring spatio-temporal phenomena and producing land use maps to facilitate sustainable agriculture, monitoring of artificialization and public policy decisions. In recent years, the remote sensing research community has focused on the use of deep-learning approaches. These approaches allow the integration of complementary sensor acquisitions available over the same study area, with the aim of exploiting the interaction between sources with different spectral and spatial contents. The goal is to improve land cover mapping by taking advantage of all available imagery sources. Among those techniques, super-resolution enables to increase the spatial resolution of satellite images. Deep-learning methods such as Generative Adversarial Networks (GAN) or other approaches based on carefully taylored losses are more and more used on geospatial images.


The internship is part of a research collaboration with our partner, the company GEOFIT.


The objective of the internship is to characterize the potential of the super-resolution of Sentinel-2 optical images (13 spectral bands covering the visible to near/mid visible – at 10m spacing) from very high resolution Spot-6/7 optical images ( 1 panchromatic channel at 1.5m spacing and 4 spectral bands Red, Blue, Green and Near Infrared at 6m spacing) for pixel wise  image classification.

In particular, the mission of the recruited person will be:

  • A bibliographic study of state-of-the-art methods in deep learning, in the field of remote sensing, for the spatial super-resolution task,
  • The construction of a dataset of Sentinel-2 and Spot-6/7 images,
  • The choice of two complementary super-resolution methods, their implementation and their application on the previously built dataset,
  • The evaluation of the results of the implemented methods and their comparison using metrics to estimate the quality of the reconstructed images,
  • The evaluation of the results obtained with the implemented super-resolution methods in the context of a supervised classification application of satellite images on a land use mapping and/or artificial surface classification task.


deep-learning, remote-sensing, super-resolution, landcover-maping, linux, python, TensorFlow, Orfeo ToolBox, GDAL, GPU


  • Master degree
  • Knowledge of python and linux
  • First experience with TensorFlow, Keras, or PyTorch
  • Background in signal or image processing
  • Some skills in remote sensing would be appreciated
  • Good reading level in English

The internship

  • Start: March or April 2022
  • Location: Montpellier, France, Earth
  • Duration: 6 months
  • Salary: approx. 550€ per month


  • Dino Ienco (dino dot ienco at
  • Remi Cresson (remi dot cresson at
  • Philippe Albert (p dot albert at

OTBTF docker image with GPU

Installing OTBTF from sources can take some time, between preparing the system with prerequisites, and waiting that the compiler finish to build the two giants Orfeo ToolBox and TensorFlow. In addition, if you compile TensorFlow with CUDA, it can takes several hours! Regarding this matter, I started to make some docker images to ease the deployment of OTBTF. With the incoming release of OTB-7.0, I had to make some updates in the OTBTF build, and decided to look how to make a container that is able to use GPU.

The NVIDIA runtime

NVIDIA have developed some convenient abstraction layer for using NVIDIA GPUs on docker. It handles cuda/gpu drivers of the host system, enabling its use inside the docker containers that runs on top of it.

You can read more here.

Install nvidia-docker

First, you have to install your GPU drivers and CUDA. All the installers are available on NVIDIA website. In my case, I did that with the 430.50 driver for the GTX1080Ti and CUDA 10.1.

After everything is installed, you will need to reboot your machine. Then check that the nvidia-persistenced service is running.

cresson@cin-mo-gpu:~$ systemctl status nvidia-persistenced.service
● nvidia-persistenced.service – NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static; vendor preset: enabled)
Active: active (running) since Fri 2019-10-11 15:08:40 UTC; 3 days ago
Process: 1296 ExecStart=/usr/bin/nvidia-persistenced –user nvidia-persistenced –no-persistence-mode –verbose (code=exited, status=0/SUCCESS)
Main PID: 1321 (nvidia-persiste)
Tasks: 1 (limit: 4915)
CGroup: /system.slice/nvidia-persistenced.service
└─1321 /usr/bin/nvidia-persistenced –user nvidia-persistenced –no-persistence-mode –verbose

You can also check that you GPUs are up using nvidia-smi:

cresson@cin-mo-gpu:~$ nvidia-smi
Tue Oct 15 13:48:07 2019
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 GeForce GTX 108… Off | 00000000:3B:00.0 Off | N/A |
| 20% 33C P5 14W / 250W | 0MiB / 11178MiB | 0% Default |
| 1 GeForce GTX 108… Off | 00000000:5E:00.0 Off | N/A |
| 23% 28C P5 12W / 250W | 0MiB / 11178MiB | 0% Default |
| 2 GeForce GTX 108… Off | 00000000:B1:00.0 Off | N/A |
| 24% 39C P5 14W / 250W | 0MiB / 11178MiB | 0% Default |
| 3 GeForce GTX 108… Off | 00000000:D9:00.0 Off | N/A |
| 27% 33C P0 50W / 250W | 0MiB / 11178MiB | 3% Default |

| Processes: GPU Memory |
| GPU PID Type Process name Usage |
| No running processes found |

If everything looks fine, you can continue.

You can install nvidia-docker

curl -s -L | sudo apt-key add –
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt install nvidia-docker2

The OTBTF GPU docker

Run the OTBTF docker image with NVIDIA runtime enabled. You can add the right permissions to you current USER for docker (there is a “docker” group), or you can also use root to run a docker command . The following command will display the help of the TensorflowModelServe OTB application.

docker run –runtime=nvidia mdl4eo/otbtf1.7:gpu otbcli_TensorflowModelServe -help

And that’s all! Now the mdl4eo/otbtf1.7:gpu docker image will be pulled from DockerHub and available from docker. Don’t forget to use th NVIDIA runtime (using –runtime=nvidia) else you won’t have the GPU support enabled.

Here is what happening when docker pulls the image (i.e. first use, after the pull is finished):

cresson@cin-mo-gpu:~$ sudo docker run–runtime=nvidia -ti mdl4eo/otbtf1.7:gpu bash
Unable to find image ‘mdl4eo/otbtf1.7:gpu’ locally
gpu: Pulling from mdl4eo/otbtf1.7
35c102085707: Already exists
251f5509d51d: Already exists
8e829fe70a46: Already exists
6001e1789921: Already exists
9f0a21d58e5d: Already exists
47b91ac70c27: Already exists
a0529eb74f28: Already exists
23bff6dcced5: Already exists
2137cd2bcba9: Already exists
534a428e13b2: Pull complete
0becaeed2388: Pull complete
5e9d7ff39d5b: Pull complete
7bf26a9c021b: Pull complete
94433defe0a3: Pull complete
38956bf78632: Pull complete
2f2bbba7b89d: Pull complete
0dbe59534219: Pull complete
f0f58d46ebcb: Pull complete
7c7ed8e68764: Pull complete
1de1d58c4952: Pull complete
Digest: sha256:203915c34a8e8fad45b45a274fe7a029df789ad36628b43437bbc62d84b4400f
Status: Downloaded newer image for mdl4eo/otbtf1.7:gpu

The different stages are download from DockerHub (You can retrieve them from the dockerfiles provided on the OTBTF repository). When your command prompt prefix is “root@039085b63ea3“, you are inside the running docker container as root!

And then… how to use Docker?

Well, I do not intend to make a full tutorial about docker here 😉 In addition, I am far from being an expert in this area.

But here are a few stuff I used to make things work with docker. I don’t think at all that it is good practice, but these tips can help.

Docker run

The previous command showed you how to enter a container in interactive mode.  From here, you can install new packages, create new users, run applications, etc.

Docker commit

Docker is like git. When you enter a docker container, you can modify it (e.g. install new packages) and commit your changes. If you don’t, every changes you made inside the container are erased when you leave the container (i.e. “exit” it).

Do a few changes in your running container, like installing a new package. Then, use another terminal to commit your changes:

docker commit 3ade7b0dc71b myrepo:mytag

Here, 3ade7b0dc71b comes from the first terminal in which your container is running. You can copy/paste the hash from the command prompt prefix (“root@3ade7b0dc71b“).

Mount some filesystem

And now you want to do some deep learning experiments on your favorite geospatial data… How to mount some persistent filesystem in your docker container?

There is useful options in the docker run command, like -v /host/path/to/mount:/path/inside/dockercontainer/ which will mount the host directory /host/path/to/mount inside your docker container as the /path/inside/dockercontainer/ path.

cresson@cin-mo-gpu:~$ sudo docker run –runtime=nvidia -ti -v /home/cresson/:/home/toto mdl4eo/otbtf1.7:gpu bash
root@e4a2b1c649c9:/# touch /home/toto/test
root@e4a2b1c649c9:/# exit
cresson@cin-mo-gpu:~$ ls /home/cresson/test

You will have to grant the right permissions to the mounted directory, particularly once you will add some non-root users inside your docker container.

For instance, if you have a user “otbuser” inside your docker container, you can do the following:

  • Check the file permissions of your host shared volume (note the GID of the volume)
  • Create a new group “newgroup” which has the same GID of the volume you want to use
  • Add the otbuser in the newgroup

Happy deep remote sensing!



Enhancement of Sentinel-2 images at 1.5m

The Claude Shannon’s nightmare.

There is various kind of remote sensing images: drone, aerial, satellite, multi-temporal, multi-spectral, … and each product has its advantages and weaknesses. What if we could synthesize a “super-resolution” sensor that could benefit from the advantages of all these image sources? For instance, what if we could fuse (1) high spatial resolution sensor (you know, the one that passes 1 time a year over your ROI), and (2) high temporal resolution sensor (…of which images can’t even let you identify where’s your house!) to derive one synthetic super sensor that has high spatial and high temporal resolution?

Today’s data-driven deep learning techniques establish state of the art results that would be crazy not to consider.

Super-resolution Generative Adversarial Networks

In their paper (“Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network“, which is quite a good read), Ledig et al. propose an approach to enhance low resolution images. They use Generative Adversarial Networks (GAN) to upsample and enhance a single low resolution image, such as the result is looking as the target (high resolution) image. GANs consist in two networks: the generator and the discriminator. In one hand, the generator transforms the input image (the low-res one) into an output image (the synthesized high-res one). In the other hand, the discriminator inputs two images (one low-res image, and one high-res image), and produce a signal that expresses the probability that the second image (the high-res) is a real one. The objective of the discriminator is to detect “fake” (i.e. synthesized) hi-res images among real ones. Simultaneously, the objective of the generator is to produce good looking hi-res images that fool the discriminator, and that are also close to the real hi-res images.

From images to remote sensing images

Now suppose we have some high res remote sensing images (say Spot 6/7, with 1.5m pixel spacing) and some low res remote sensing images (like Sentinel-2, with 10m pixel spacing) and we want to enhance the low res images at the same resolution of the high res. We can train our SRGAN generator over some patches and (try to…) make Sentinel-2 images look like Spot 6/7 images !

First I have downloaded a Sentinel-2 image from the Theia data center. The image is a monthly synthesis of multiple Sentinel-2 surface reflectance scenes produced with MAJA software (see here for details). The image is cloudless, thank to the amazing work of the CNES and CESBIO guys. I used a pansharpened Spot 6/7 scene acquired in the same month as the Sentinel-2 image, in the framework of the GEOSUD project.

The tools of choice to perform this are the OrfeoToolbox with OTBTF (a remote module that uses TensorFlow). First, we extract some patches in images (using otbcli_PatchesExtraction).

About 6k patches are sampled in images.

Once this is done, we can train our generator over these patches, using some SRGAN-inspired code (since we can find many SRGAN implementations on github!). After the model is trained properly, we can transform our Sentinel-2 scenes into images with 1.5m pixel spacing using the otbcli_TensorflowModelServe application (first time that I had to set a outputs.spcscale parameter lower than 1.0!). The synthesized high resolution Sentinel-2 image has a size of 72k by 72k.

Below are a couple of illustrations. Left: the original RGB Sentinel-2 image, Right: the enhanced RGB Sentinel-2 image. In a sake of sincerity, we put images that are far away from learning patches. You can enlarge them to full resolution.

Full map

We put the entire generated image in a cartographic server. You can browse the map freely (Select the “S2_RGB (High-Res)” from the layer menu, to display the 1.5m Sentinel-2 image). You can also display patches used for training (“Training patches footprint” layer).



In this post we show what deep learning could offer in therm of data driven image enhancement for remote sensing images, at least for illustrative use.


The code is available on Github. Feel free to open a PR to add your own model, or training loss, etc. !

An introduction to deep learning on remote sensing images (Tutorial)

Update (28/01/2020): open dataset, available for download!

In this tutorial, we will see how to train and apply a deep neural network on real world remote sensing images, using only user-oriented open-source software. No coding skills required!


Our “Tokyo dataset” is freely available. It is composed as follow:

    • One Sentinel-2 image, that can be downloaded from the European Space Agency hub or from your preferred Sentinel access portal. Just download the Sentinel-2 image acquired over the city of Tokyo the 2019/05/08 (Mission: Sentinel-2, Platform: S2A_*) named S2A_MSIL2A_20190508T012701_N0212_R074_T54SUE_20190518T165701.
    • Two label images of terrain truth (one for training and one for validation). Each pixel is either a “no-data” value (255) or a class value (starting from 0). The files can be downloaded here (also provided: the style .qml file for QGIS). This dataset has been elaborated for educational purpose at our research facility, using Open Street Map data.


We want to classify the Sentinel-2 image, meaning that we intend to estimate the class for each pixel. Since our terrain truth data is sparsely annotated, our approach consist in training a network that estimates one single class value for a particular patch of image. If we had a densely annotated dataset, we could have used the semantic segmentation approach, but this will be another story (soon on this blog!).

We will use the OTBTF remote module of the Orfeo ToolBox. OTBTF relies on TensorFlow to perform numeric computations. You can read more about OTBTF on the blog post here.

OTBTF = Orfeo ToolBox (OTB) + TensorFlow (TF)

It is completely user-oriented, and you can use the provided applications in graphical user interface as any Orfeo ToolBox applications. Concepts introduced in OTBTF are presented in [1]. The easiest way to install and begin with OTBTF is to use the provided docker images.

Deep learning backgrounds

To goal of this post is not to do a lecture about how deep learning works, but quickly summarizing the principle!

Deep learning refers to artificial neural networks with deep neuronal layers (i.e. a lot of layers !). Artificial neurons and edges typically have parameters that adjust as learning proceeds.

Weights modify the strength of the signal at a connection. Artificial neurons may output in non linear functions to break the linearity, for instance to make the signal sent only if the aggregate signal crosses a given threshold. Typically, artificial neurons are aggregated into layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple times. Among common networks, Convolutional Neural Networks (CNN) achieve state of the art results on images. CNN are designed to extract features, enabling image recognition, object detection, semantic segmentation. A good review of deep learning techniques applied to remote sensing can be found here [2]. In this tutorial, we focus only on CNN for a sake of simplicity.

Let’s start

During the following steps, we advise you to use QGIS to check generated geospatial data. Note that applications parameters are provided in the form of command line, but can be performed using the graphical user interface of OTB!

We can decompose the steps that we will perform as follow:

    • Sampling: we extract patches of images associated to the terrain truth,
    • Training: we train the model from patches,
    • Inference: Apply the model to the full image, to generate a nice landcover map!

Normalize the remote sensing images

We will stack and normalize the Sentinel-2 images using BandMathX. We normalize the images such as the pixels values are within the [0,1]  interval. 

# Go in S2 folder
cd S2A_MSIL2A_20190508T012701_N0212_R074_T54SUE_20190508T041235.SAFE
cd GRANULE/L2A_T54SUE_A020234_20190508T012659/IMG_DATA/R10m/

# Apply BandMathX
otbcli_BandMathX \
-il T54SUE_20190508T012701_B04_10m.jp2 \
T54SUE_20190508T012701_B03_10m.jp2 \
T54SUE_20190508T012701_B02_10m.jp2 \
T54SUE_20190508T012701_B08_10m.jp2 \
-exp "{im1b1/im1b1Maxi;im2b1/im2b1Maxi;im3b1/im3b1Maxi;im4b1/im4b1Maxi}" \
-out Sentinel-2_B4328_10m.tif

Sampling the remote sensing images

The first step to apply deep learning techniques to real world datasets is sampling. The existing framework of OTB offers great tools for pixel wise or object oriented classification/regression tasks. On the deep learning side, nets like CNNs are trained over patches of images rather than batches of single pixel. Hence the first application of OTBTF we will present targets the patches sampling and is called PatchesExtraction.

The PatchesExtraction application integrates seamlessly in the existing sampling framework of OTB. Typically, we have two options depending if our terrain truth is vector data or label image.

    1. Vector data: one can use the PolygoncClassStatistics (computes some statistics of the input terrain truth) and SampleSelection applications to select patches locations, then give them to the PatchesExtraction application.
    2. Label image: we can directly use the LabelImageSampleSelection application from the OTBTF to select the patches locations.

In our case, we have terrain truth in the form of a label image. We hence will use option 2. Let’s select 500 samples for each classes with the command line below:

# Samples selection for group A
otbcli_LabelImageSampleSelection \
-inref ~/tokyo/terrain_truth/terrain_truth_epsg32654_A.tif \
-nodata 255 \
-outvec terrain_truth_epsg32654_A_pos.shp \
-strategy "constant" \
-strategy.constant.nb 500

Where :

  • inref is the label image,
  • nodata is the value for “no-data” (i.e. no annotation available at this location),
  • strategy is the strategy to select the samples locations,
  • outvec is the filename for the generated output vector data containing the samples locations.

Repeat the previously described steps for the second label image (terrain_truth_epsg32654_B.tif). Open the data in QGIS to check that the samples locations are correctly generated.

Patches extraction

Now we should have two vector layers:

  • terrain_truth_epsg32654_A_pos.shp
  • terrain_truth_epsg32654_B_pos.shp

It’s time to use the PatchesExtraction application. The following operation consists in extracting patches in the input image, at each location of the terrain_truth_epsg32654_A_pos.shp. In order to train a small CNN later, we will create a set of patches of dimension 16×16 associated to the corresponding label given from the class field of the vector data. Let’s do this :

# Patches extraction
otbcli_PatchesExtraction \ Sentinel-2_B4328_10m.tif \
-source1.patchsizex 16 \
-source1.patchsizey 16 \
-source1.out Sentinel-2_B4328_10m_patches_A.tif \
-vec terrain_truth_epsg32654_A_pos.shp \
-field "class" \
-outlabels Sentinel-2_B4328_10m_labels_A.tif uint8


  • source1 is first image source (it’s a parameter group),
  • is the input image list of the first source,
  • source1.patchsizex is the patch width of the first source,
  • source1.patchsizey is the patch height of the first source,
  • source1.out is the output patches image of the first source,
  • vec is the input vector data of the points (samples locations),
  • field is the attribute that we will use as the label value (i.e. the class),
  • outlabels is the output image for the labels (we can force the pixel encoding to be 8 bits because we just need to encore small positive integers).

After this step, you should have generated the following output images, that we will name “the training dataset”:

  • Sentinel-2_B4328_10m_patches_A.tif
  • Sentinel-2_B4328_10m_labels_A.tif
Trick to check the extracted patches from QGIS : Simply open the patches as raster layer, then chose the same coordinates reference system as the current project (indicated on the bottom right of the window).

Repeat the previously described steps for the vector layer terrain_truth_epsg32654_B_pos.shp and generate the patches and labels for validation. After this step, you should have the following data that we will name “the validation dataset” :

  • Sentinel-2_B4328_10m_patches_B.tif
  • Sentinel-2_B4328_10m_labels_B.tif
Side note on pixel interleave: sampled patches are stored in one single big image that stacks all patches in rows. There is multiple advantages to this. In particular, accessing one unique big file is more efficient than working on thousands of separate small files stored in the file system. The interleave of the sampled source is also preserved, which, guarantee good performance during data access.


How to train a deep net ? To begin with something easy, we will train a small existing model. We focus on a CNN that inputs our 16 × 16 × 4 patches and produce the prediction of an output class among 6 labels ranging from 0 to 5.


Here is a quick overview of some basic concepts about the training of deep learning models.

Model training usually involves a gradient descent (in the network weights space) of a loss function that typically expresses the gap between estimated and reference data. The batch size defines the number of samples that will be propagated through the network for this gradient descent. Supposing we have N training samples and we want to use a batch size of n. During learning, the first n samples (from 1 to n) will be used to train the network. Then, the second n samples (from n+1 to 2n) will be used to train the network again. This procedure is repeated until all samples are propagated through the network (this is called one epoch).

Model guts

We propose to introduce a small and simple CNN to better understand the approach. This section describe what is inside this model. The figure below summarizes the computational graph of our CNN.

Input : The image patches fed the placeholder named “x” in the TensorFlow model. The first dimension of “x” might have any number of components. This dimension is usually employed for the batch size, and isn’t fixed to enable the use of different batch size. For instance, assuming we want to train our model with a batch of 10 samples, we will fed the model a multidimensional array of size 10 × 16 × 16 × 4 in the placeholder “x” of size # × 16 × 16 × 4.

Deep net : “x” is then processed by a succession of 2D-Convolution/Activation function (Rectified linear)/Pooling layer. After the last activation function, the features are processed by a fully connected layer of 6 neurons (one for each predicted class).

Predicted class : The predicted class is the index of the neuron (from the last neuron layer) that output the maximum value. This is performed in processing the outputs of the last fully connected layer with the Argmax operator, which is named “prediction” in the graph.

Cost function : The goal of the training is to minimize the cross-entropy between the data distribution (real labels) and the model distribution (estimated labels). We use the cross-entropy of the softmax of the 6 neurons as a cost function. In short, this will measure the probability error in the discrete classification task in which the classes are mutually exclusive (each entry is in exactly one class). For this, the model implements a function of TensorFlow know as Softmax cross entropy with logits. This function first computes the Softmax function of the 6 neurons outputs. The softmax function will normalize the output such as their sum is equal to 1 and can be used to represent a categorical distribution, i.e, a probability distribution over n different possible outcomes. The Shannon cross entropy between true labels and probability-like values from the softmax is then computed, and considered as the loss function of the deep net.

Optimizer : Regarding the training, a node called “optimizer” performs the gradient descent of the loss function : this node will be used only for training (or fine tuning) the model, it is useless to serve the model for inference ! The method implemented in this operator is called Adam (like “Adaptive moment estimation” [3]). A placeholder named “lr” controls the learning rate of the optimizer : it holds a single scalar value (floating point) and can have a default value.

Our first CNN architecture !
The network consist in two placeholders (“x” and “lr”) respectively used for input patches (4 dimensional array) and learning rate (single scalar), one output tensor (“prediction”, one dimensional array) and one target node (“optimizer”, used only for training the net). “#” means that the number of component for the first dimension is not fixed.
Important note: The functions used in our model are working on labels that are integers ranging from 0 to N-1 classes. The first class number must be 0 to enforce the class numbering convention of the TensorFlow functions used in our model.

To generate this model, just copy-paste the following into a python script named

You can also skip this copy/paste step since this python script is located here

from tricks import *
import sys
import os


def myModel(x):
  # input patches: 16x16x4
  conv1 = tf.layers.conv2d(inputs=x, filters=16, kernel_size=[5,5], padding="valid", 
                           activation=tf.nn.relu) # out size: 12x12x16
  pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2) # out: 6x6x16
  conv2 = tf.layers.conv2d(inputs=pool1, filters=16, kernel_size=[3,3], padding="valid", 
                           activation=tf.nn.relu) # out size: 4x4x16
  pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2) # out: 2x2x16
  conv3 = tf.layers.conv2d(inputs=pool2, filters=32, kernel_size=[2,2], padding="valid",
                           activation=tf.nn.relu) # out size: 1x1x32
  # Features
  features = tf.reshape(conv3, shape=[-1, 32], name="features")
  # neurons for classes
  estimated = tf.layers.dense(inputs=features, units=nclasses, activation=None)
  estimated_label = tf.argmax(estimated, 1, name="prediction")

  return estimated, estimated_label
""" Main """
if len(sys.argv) != 2:
  print("Usage : <output directory for SavedModel>")

# Create the TensorFlow graph
with tf.Graph().as_default():
  # Placeholders
  x = tf.placeholder(tf.float32, [None, None, None, 4], name="x")
  y = tf.placeholder(tf.int32  , [None, None, None, 1], name="y")
  lr = tf.placeholder_with_default(tf.constant(0.0002, dtype=tf.float32, shape=[]),
                                   shape=[], name="lr")
  # Output
  y_estimated, y_label = myModel(x)
  # Loss function
  cost = tf.losses.sparse_softmax_cross_entropy(labels=tf.reshape(y, [-1, 1]), 
                                                logits=tf.reshape(y_estimated, [-1, nclasses]))
  # Optimizer
  optimizer = tf.train.AdamOptimizer(learning_rate=lr, name="optimizer").minimize(cost)
  # Initializer, saver, session
  init = tf.global_variables_initializer()
  saver = tf.train.Saver( max_to_keep=20 )
  sess = tf.Session()

  # Create a SavedModel
  CreateSavedModel(sess, ["x:0", "y:0"], ["features:0", "prediction:0"], sys.argv[1])

The imported is part of the OTBTF remote module, and contains a set of useful functions and helpers to create SavedModel . It is located in the OTBTF source code. The environment variable PYTHONPATH must hence contain the path to this directory to enable the use of in our script. Our python script uses the TensorFlow python API to generate the model, and serializes it as a SavedModel (google protobuf) written on disk.

You can finally generate the SavedModel with:

python data/results/SavedModel_cnn

Training from scratch

If you take a look in the data/results/SavedModel_cnn directory, you will see a .pb file and a Variables folder. The protobuf file serializes the computational graph, and the Variables folder contains the values of the model weights (kernels, etc.). As you could have noticed in the python script, the model weights are initialized before exporting the SavedModel . We will use the TensorflowModelTrain application to train the CNN from its initialized state, updating its weights for the image classification task. For each dataset (training data and validation data), the validation step of the TensorflowModelTrain application consists in computing usual validation metrics.

Let’s quickly summarize the application parameters:

  • training.source1 is a parameter group for the patches source (for learning)
    • is the input image filename of the patches
    • training.source1.patchsizex is the patch width
    • training.source1.patchsizey is the patch height
    • training.source1.placeholder is the name of the placeholder for the patches
  • training.source2 is a parameter group the labels source (for learning)
    • is the input image filename of the labels
    • training.source2.patchsizex is the labels width
    • training.source2.patchsizey is the labels height
    • training.source2.placeholder is the name of the placeholder for the labels
  • model.dir is the directory containing the TensorFlow SavedModel
  • training.targetnodes is the name of the operator that we want to compute for the training step. In our model, the gradient descent is done with the adam optimizer node called “optimizer”.
  • validation.mode is the validation mode. The “class” validation mode enables the computation of classification metrics for bot training data and validation data.
  • validation.source1 is a parameter group for the patches source (for validation). As the name of the source for validation ( is the same as the placeholder name of the same source for training (training.source1.placeholder), this source is considered as an input of the model, and is fed to the corresponding placeholder during the validation step.
  • validation.source2 is the labels source (for validation). As the name of the source (validation. is different than the placeholder name of the same source for training (training.source2.placeholder), this source is considered as a reference to be compared to the output of the model that have the same tensor name during the validation step.
  • model.saveto enables to export the model variables (i.e. weights) to a file

The command line corresponding to the above description is the following:

# Train the deep learning model
otbcli_TensorflowModelTrain \ Sentinel-2_B4328_10m_patches_A.tif \
-training.source1.patchsizex 16 \
-training.source1.patchsizey 16 \
-training.source1.placeholder "x" \ Sentinel-2_B4328_10m_labels_A.tif \
-training.source2.patchsizex 1 \
-training.source2.patchsizey 1 \
-training.source2.placeholder "y" \
-model.dir "data/results/SavedModel_cnn" \
-training.targetnodes "optimizer" \
-validation.mode "class" \ Sentinel-2_B4328_10m_patches_B.tif \ "x" \ Sentinel-2_B4328_10m_labels_B.tif \ "prediction" \
-model.saveto "data/results/SavedModel_cnn/variables/variables"

Run the TensorflowModelTrain application. After the epochs, note the kappa and overall accuracy indexes (should be respectively around 0.7 over the validation dataset). Browse the file system, and take a look in the data/results directory : you can notice that the application has updated two files :

  • variables.index is a summary of the saved variables,
  • is the saved variables (TensorflowModelTrain saved them only once at the end of the training process).

A quick comparison of scores versus Random Forest

Let’s use OTB to compare quickly the scores obtained using RF.

Here, we compare our tiny deep net to a Random Forest (RF) classifier. We use the metrics deriving from the confusion matrix. Let’s do it quickly thank to the machine learning framework of OTB First, we use the exact same samples as for the deep net training and validation, to extract the pixels values of the image :

# Extract the pixels values at samples locations (Learning data)
otbcli_SampleExtraction \
-in Sentinel-2_B4328_10m.tif \
-vec terrain_truth_epsg32654_A_pos.shp \
-field "class" \
-out terrain_truth_epsg32654_A_pixelvalues.shp

Then, we do the same with the terrain_truth_epsg32654_B_pos.shp vector data (validation data). Finally, we train a RF model with default parameters:

# Train a Random Forest classifier with validation
otbcli_TrainVectorClassifier \
-io.vd terrain_truth_epsg32654_A_pixelvalues.shp \
-valid.vd terrain_truth_epsg32654_B_pixelvalues.shp \
-feat "value_0" "value_1" "value_2" "value_3" \
-cfield "class" \
-classifier "rf" \
-io.out randomforest_model.yaml

Compare the scores. The kappa index should be around 0.5, more than 20% less than our CNN model.

It’s not a surprise that the metrics are better with CNN than RF: The CNN uses a lot more information for training, compared to the RF. It learns on
16 × 16 patches of multispectral image, whereas the RF learns only on multispectral pixels (that could be interpreted as a 1 × 1 patches of multispectral image). So let’s not say that CNN is better than RF! The goal here is not to compare RF vs CNN, but just to show that the contextual information process by the CNN is useful in classification task. A more fair competitor could be a RF using spatial features like textures, local Fourier transforms, SIFTs, etc.

Create a landcover map

At least!

Updating the CNN weights

As explained before, the TensorFlow model is serialized as a SavedModel , which is a bundle including the computational graph, and the model variables (kernel weights). We have previously trained our model from scratch : we have updated its variables from their initial state, and saved them on the file system. 

Running the model

Let’s run the model over the remote sensing image to produce a nice land cover map! For this, we will use the TensorflowModelServe application. We know that our CNN input has a receptive field of 16 × 16 pixels and the placeholder name is “x”. The output of the model is the estimated class, that is, the tensor resulting of the Argmax operator, named “prediction”. Here, we won’t use the optimizer node as it’s part of the training procedure. We will use only the model as shown in the following figure:

For the inference, we use only the placeholders (“x” and we compute the one output tensor named (“prediction”, one dimensional array).

As we might have no GPU support for now, it could be slow to process the whole image. We won’t produce the map over the entire image (even if that’s possible thank to the streaming mechanism) but just over a small subset. We do this using the extended filename of the output image, setting a subset starting at pixel 1000, 4000 with size 1000 × 1000. This extended filename consists in adding ?&box=1000 :4000 :1000 :1000 to the output image filename. Note that you can also generate a small image subset with the ExtractROI application of OTB, then use it as input of the TensorflowModelServe application.

otbcli_TensorflowModelServe \ Sentinel-2_B4328_10m.tif \
-source1.rfieldx 16 \
-source1.rfieldy 16 \
-source1.placeholder "x" \
-model.dir "data/results/SavedModel_cnn" \
-output.names "prediction" \
-out "classif_model1.tif?&box=4000:4000:1000:1000"


  • source1 is a parameter group for the first image source,
  • is the input image list of the first source,
  • source1.rfieldx is the receptive field width of the first source,
  • source1.rfieldy is the receptive field height of the first source,
  • source1.placeholder is placeholder name corresponding to the the first source in the TensorFlow model,
  • model.dir is the directory of the SavedModel ,
  • output.names is the list of the output tensor that will be produced then generated as output image,
  • out is the output image generated from the TensorFlow model applied to the entire input image.

Now import the generated image in QGIS. You can change the style of the raster : in the layers panel (left side of the window), right-click on the image then select properties,and import the provided legend_style.qml file.

Import the output classification map in QGIS
Important note: We just have run the CNN in patch-based mode, meaning that the application extracts and process patches independently at regular intervals. This is costly, because the sampling strategy requires to duplicate a lot of overlapping patches, and process them independently. The TensorflowModelServe application enables to uses fully convolutional models, enabling to process seamlessly entire output images blocks... but that's also another story! 


This tutorials has explained how to perform an image classification using a simple deep learning architecture.

The OTBTF, a remote module of the Orfeo ToolBox (OTB), has been used to process images from a user’s perspective: no coding skills were required for this tutorial. QGIS was used for visualization purposes.

We will be at IGARSS 2019 for a full day tutorial extending this short introduction, and we hope to see you there!

Want more? To go further (multi-branch models, fully-convolutional models, semantic segmentation models, …) we encourage the reader to check the OTBTF documentation section, which provides some ready-to-use deep networks, with documentation and scientific references.

Edit 22/07/2020

I am happy to announce the release of my bookDeep learning for remote sensing images with open-source softwareon Taylor & Francis ( The book extends significantly this tutorial, keeping the same format. It contains further sections to learn how to apply patch-based models, hybrid classifiers (Random Forest working on Deep learning features), Semantic segmentation, and Image restoration (Optical images gap filling from joint SAR and time series).


[1] R. Cresson, A Framework for Remote Sensing Images Processing Using Deep Learning Techniques. IEEE Geoscience and Remote Sensing Letters, 16(1) :25-29, 2019.

[2] Liangpei Zhang, Lefei Zhang, and Bo Du. Deep learning for remote sensing data : A technical tutorial on the state of the art. IEEE Geoscience and Remote Sensing Magazine, 4(2) :22–40, 2016.

[3] Diederik P Kingma and Jimmy Ba. Adam : A method for stochastic optimization. arXiv preprint arXiv :1412.6980, 2014.

A remote module for Deep learning in the Orfeo ToolBox

Originally posted on the Orfeo ToolBox Blog

Deep learning has proven to be an extremely powerful tool in many fields, and particularly in image processing: these approaches are currently subject of great interest in the Computer Vision community. However, while a number of typical CV problems can be directly transposed in Remote Sensing (Semantic labeling, Classification problems, …), some intrinsic properties of satellite and aerial imagery makes difficult to process real world RS images (large size, formats, geospatial characteristics, standards, etc.). That’s why pre-processed, pre-annotated public datasets like UC Merced or Postdam are so damn popular. And conversely, there is currently no operational, scalable, and user-friendly tools for people belonging to the RS community who have no coding knowledge.

In this post, we introduce a new remote module, otbtf, enabling the use of deep learning techniques with real world geospatial data. This remote module uses the high performance numerical computation library TensorFlow to bring the deep learning magic into OTB. The C++API of TensorFlow is used to run tensorflow sessions inside filters that are compliant with the streaming mechanism of ITK and OTB, meaning that there is no limitation on the size of images to be processed. Let’s see quickly what’s in the kit!

Patches sampling

The first step to apply deep learning techniques to real world datasets, consists in building the dataset. The existing framework of the Orfeo ToolBox is great and offers tools like PolygoncClassStatistics, SampleSelection, SampleExtraction, and more recently SampleAugmentation. Those are really great for pixel wise or object oriented classification/regression tasks. On the deep learning side, among popular deep nets, the Convolutional Neural Network (CNN) has shown great potential for a number of tasks (segmentation, classification, object detection, image recognition, etc.). However, the CNN are trained over patches of images rather than batches of single pixel. Hence the first application of otbtf targets the patches sampling and is called (by the way) PatchesExtraction. It integrates seamlessly in the existing sampling framework of OTB: typically one can use the PolygoncClassStatistics and SampleSelection applications to select patches centers, then give them to the PatchesExtraction application. As we wanted to keep things simple, sampled patches are stored in one single big image that stacks all patches in rows . There is multiple advantages to this. First, accessing one unique big file is more efficient than working on thousands of separate small files stored in the file system. Secondly, one can visualize/process patches as any image: for instance, we can import the big patches image in QGIS and check that our patches looks good. Thirdly, the interleave of the sampled source is preserved, which, guarantee good performance during data access. Besides, we expect that a deep net use the same kind of input for training than for inference, in term of data interleave (it seems rational to consider that the data shouldn’t need to be reordered, for training or inference!).

PatchesExtraction (from otbtf repository)

Training a deep net

Once patches are extracted in images, one can train a deep net, feeding to the application TensorflowModelTrain some patches for training and patches for validation. We think that there will be more and more deep nets available for RS processing. Regarding TensorFlow, a convenient way to serialize models is done through Google Protobuf (called SavedModel). This enables the reuse of a model, for training or inference. The TensorflowModelTrain loads a SavedModel, then train it. As the model variables can be either loaded or saved, one can train the model from scratch or perform some fine tuning, and save variables at different places allowing to process the same model later with different kernel weights, etc (e.g. in classification task, we can train the model from scratch for an entire country, then performing some fine tuning over multiple ecological region).

Currently, we are working on the integration of something to make TensorflowModelTrain reports to TensorBoard. Also, we would add something in a near future to optimize the streaming of batches during training/validation. We will also create soon a repository somewhere to host various SavedModel implemented in research papers.

Model training (From otbtf repository)

Serving a deep net

This is where thing become interesting. With the TensorflowModelServe application, we can use any tensorflow model with any number of input sources, any number of input placeholders (that might as well be some user-specific scalar placeholders, for instance “parameter1=0.2”). Thank to the streaming mechanism, we can process any number of pixels in a seamless way. There is two modes to use a deep net: (1) Patch-based: extract and process patches independently at regular intervals. Patches sizes are equal to the receptive field sizes of inputs. For each input, a tensor with a number of elements equal to the number of patches feds the TensorFlow model. (2) Fully-convolutional: Unlike patch-based mode, it allows the processing of an entire requested region. For each input, a tensor composed of one single element, corresponding to the input requested region, is fed to the TensorFlow model. This mode requires that receptive fields, expression fields (the output space that a model will produce for one input patch) and scale factors (the total stride between the input and the output, i.e. the physical spacing change) are consistent with operators implemented in the TensorFlow model, input images physical spacing and alignment. This second mode enables even “old” hardware to process entire scenes in reasonable processing time, enforcing the OTB philosophy 😉 Last but not least, as the entire pipeline implemented in TensorflowModelServe is fully streamable, the application benefits from the Message Passing Interface (MPI) support of OTB, and can be used on any High Performance Computing clusters that share parallel file system.

Model serving (From otbtf repository)

Using deep features inside OTB machine learning framework

Recent studies have suggested that deep nets features can be used as input features of algorithms like classification, leading state of the art results. Regarding RS image classification, OTB already implement a number of algorithms in its classification application, e.g. SVM, Random Forests, boost classifier, decision tree classifier, gradient boosted tree classifier, normal Bayes classifier. We provide two composite application, TrainClassifierFromDeepFeatures and ImageClassifierFromDeepFeatures, that reuse TensorflowModelServe as input of, respectively the TrainImagesClassifier and ImagesClassifier applications (official OTB applications dedicated to train a classifier from features and perform a pixel wise image classification). We could also do the same thing for SampleExtractionFromDeepFeatures, TrainVectorClassifier to name a few (future work!).

Now, you don’t have any excuse to apply leading, state of the art classification methods 😉

You can read this paper to have more details about concepts involved in otbtf.