Pretraining of deep CNN autoencoders

Goal

The goal of the project is to show how pretraining autoencoders based on CNNs can be used to learn feature extraction and subsequently finetune for specific task. This page was made to showcase some of the figures relevant to the project.

Model

We have chosen the EfficientNetB4 as our model architecture of the encoder. For the decoder, we transpose the architecture of the encoder. Additionally, we attach multiple heads to the model -

First head predicts the image on the input
Second head predicts binary mask of the whole cluster
Third head predict the binary mask of pixels above fixed threashold

Variations

Different levels of input pixel masking were tried (0%, 25% and 50%)
Variations with or without binary mask head were tested

Input

As input to the model we use 3 channel image of size 384x384, which is upsampled image of the cluster (up to 5x size increase in each dimension).

First chanel is dedicated to deposited energy for each pixel.
Second channel is for the time of arrival of each pixel
Third channel contains binary map of the pixels above certain energy threshold Data is normalized using sqare-root normalization.

Training

Pretraining

The pretraining has two stages :

Standard pretraining on Imagenet dataset - we simply download the pretrained weights from for this torchvision model.
Part of the data for unsupervised pretraining is obtained from Timepix3 measurements in ATLAS. This dataset contains high variability of clusters but smaller clusters are much more frequent. This is the reason why we also used ions data obtained at test beam measurements of Pb at SPS at Cern - these have higher frequency of large clusters.

Dataset is balanced into exponentially sized bins based on cluster hit count. We then downsample the clusters so that the frequency across the bins is unfiorm.

Reconstruction of energy of a curly track using CNN autoencoder as a function of number of training iterations — Reconstruction of energy of a curly track using a CNN autoencoder as a function of training iterations.

Reconstruction of energy of ion track using CNN autoencoder as a function of number of training iterations

Reconstruction of ion track using CNN autoencoder as a function of number of training iterations

What happens, if we do not use binary pixel masking?

Suddenly, the thin tracks become more blurry as we can see below.

Reconstruction of electron track using CNN autoencoder as a function of number of training iterations. This time, no binary mask predicition is used

Reconstruction of ion track using CNN autoencoder as a function of number of training iterations. This time, no binary mask predicition mask is used

Key observations:

addition of binary mask predicition non-trivially increased the sharpness of the recontructed image
we were able to reconstruct simple, linear tracks with high precision
heavy ions were reconstructed too but reconstructing areas with high density of delta electrons ended up blurred

Supervised finetuning

Data for supervised finetuning is obtained from simulations. For instance, we simulate protons and electrons for SATRAM experiment with Timepix1. Currently, there is a lack of standardized benchmarking datasets for this type of detectors, so besides simulation of real spectrum for SATRAM instrument, we also created identical spectrum to the one mentioned in this paper , further referred to as “reference spectrum” as opposed to the original “real spectrum”.

Validation

Separated protons and electron feature vectors

Projection of extracted features of each cluster to 2D using UMAP — Projection of feature vector obtained from CNN encoder into 2D AFTER unsupervised pretraining on ATLAS data and also AFTER supervised finetuning to split protons from electrons. Some tracks were selected (using K-means) and for these we also show energy deposition and the tracks shape in a square box. We can see that tracks with similar angle are close together indicating model learned the feature to recognize particle angle. Another extracted feature seems to be track length. This indicates that model is indeed building meaningful and interpretable features.

Key observations

two of the most important extracted features by the model were track length and angle of the track
by using the pretrained model, we were able to improve the separation of protons and electrons on reference spectrum from reported 94.79% to our 96.2%
we have shown that most of the errors are caused by lack of information (tracks smaller than 8 pixels), after filtering those we achieved >99% accuracy on a balanced dataset