You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('') and can be up to 35 characters long.
1289 lines
68 KiB
1289 lines
68 KiB
% body thesis file that contains the actual content 



\chapter{Introduction} 



The introduction will explain the wider context first, before 

providing technical details. 



\subsection*{Motivation} 



Famous examples like the automatic soap dispenser, which does not 

recognise the hand of a black person but dispenses soap when presented 

with a paper towel, raise the question of bias in computer 

systems~\cite{Friedman1996}. Related to this ethical question regarding 

the design of so called algorithms is the question of 

algorithmic accountability~\cite{Diakopoulos2014}. 



Supervised neural networks learn from inputoutput relations and 

figure out by themselves what connections are necessary for that. 

This feature is also their Achilles heel: it makes them effectively 

black boxes and prevents any answers to questions of causality. 



However, these questions of causality are of enormous consequence when 

results of neural networks are used to make life changing decisions: 

Is a correlation enough to bring forth negative consequences 

for a particular person? And if so, what is the possible defence 

against math? Similar questions can be raised when looking at computer 

vision networks that might be used together with so called smart 

CCTV cameras to discover suspicious activity. 



This leads to the need for neural networks to explain their results. 

Such an explanation must come from the network or an attached piece 

of technology to allow adoption in mass. Obviously this setting 

poses the question, how such an endeavour can be achieved. 



For neural networks there are fundamentally two types of tasks: 

regression and classification. Regression deals with any case 

where the goal for the network is to come close to an ideal 

function that connects all data points. Classification, however, 

describes tasks where the network is supposed to identify the 

class of any given input. In this thesis, I will work with both. 



\subsection*{Object Detection in Open Set Conditions} 



\begin{figure} 

\centering 

\includegraphics[scale=1.0]{openset} 

\caption{Open set problem: the test set contains classes that 

were not present during training time. 

Icons in this image have been taken from the COCO data set 

website (\url{https://cocodataset.org/\#explore}) and were 

vectorised afterwards. Resembles figure 1 of Miller et al.~\cite{Miller2018}.} 

\label{fig:openset} 

\end{figure} 



More specifically, I will look at object detection in the open set 

conditions (see figure \ref{fig:openset}). 

In nontechnical words this effectively describes 

the kind of situation you encounter with CCTV cameras or robots 

outside of a laboratory. Both use cameras that record 

images. Subsequently, a neural network analyses the image 

and returns a list of detected and classified objects that it 

found in the image. The problem here is that networks can only 

classify what they know. If presented with an object type that 

the network was not trained with, as happens frequently in real 

environments, it will still classify the object and might even 

have a high confidence in doing so. Such an example would be 

a false positive. Any ordinary person who uses the results of 

such a network would falsely assume that a high confidence always 

means the classification is very likely correct. If they use 

a proprietary system they might not even be able to find out 

that the network was never trained on a particular type of object. 

Therefore, it would be impossible for them to identify the output 

of the network as false positive. 



This reaffirms the need for automatic explanation. Such a system 

should by itself recognise that the given object is unknown and 

hence mark any classification result of the network as meaningless. 

Technically there are two slightly different approaches that deal 

with this type of task: model uncertainty and novelty detection. 



Model uncertainty can be measured, for example, with dropout sampling. 

Dropout layers are usually used only during training but 

Miller et al.~\cite{Miller2018} use them also during testing 

to achieve different results for the same image making use of 

multiple forward passes. The output scores for the forward passes 

of the same image are then averaged. If the averaged class 

probabilities resemble a uniform distribution (every class has 

the same probability) this symbolises maximum uncertainty. Conversely, 

if there is one very high probability with every other being very 

low this signifies a low uncertainty. An unknown object is more 

likely to cause high uncertainty which allows for an identification 

of false positive cases. 



Novelty detection is another approach to solve the task. 

In the realm of neural networks it is usually done with the help of 

autoencoders that solve a regression task of finding an 

identity function that reconstructs the given input~\cite{Pimentel2014}. Autoencoders have 

internally at least two components: an encoder, and a decoder or 

generator. The job of the encoder is to find an encoding that 

compresses the input as good as possible while simultaneously 

being as lossfree as possible. The decoder takes this latent 

representation of the input and has to find a decompression 

that reconstructs the input as accurate as possible. During 

training these autoencoders learn to reproduce a certain group 

of object classes. The actual novelty detection takes place 

during testing: given an image, and the output and loss of the 

autoencoder, a novelty score is calculated. For some novelty 

detection approaches the reconstruction loss is exactly the novelty 

score, others consider more factors. A low novelty 

score signals a known object. The opposite is true for a high 

novelty score. 



\subsection*{Research Question} 



Autoencoders work well for data sets like MNIST~\cite{Deng2012} 

but perform poorly on challenging real world data sets 

like MS COCO~\cite{Lin2014}, complicating any potential comparison between 

them and object detection networks like \gls{SSD}. 

Therefore, a comparison between model uncertainty with a network like 

SSD and novelty detection with autoencoders is considered out of scope 

for this thesis. 



Miller et al.~\cite{Miller2018} used an \gls{SSD} pretrained on COCO 

without further finetuning on the SceneNet RGBD data 

set~\cite{McCormac2017} and reported good results regarding 

open set error for an \gls{SSD} variant with dropout sampling and entropy 

thresholding. 

If their results are generalisable it should be possible to replicate 

the relative difference between the variants on the COCO data set. 

This leads to the following hypothesis: \emph{Dropout sampling 

delivers better object detection performance under open set 

conditions compared to object detection without it.} 



For the purpose of this thesis, I will use the \gls{vanilla} \gls{SSD} (as in: the original SSD) as 

baseline to compare against. In particular, \gls{vanilla} \gls{SSD} uses 

a perclass confidence threshold of 0.01, an IOU threshold of 0.45 

for the \gls{NMS}, and a top \(k\) value of 200. For this 

thesis, the top \(k\) value was changed to 20 and the confidence threshold 

of 0.2 was tried as well. 

The effect of an entropy threshold is measured against this \gls{vanilla} 

SSD by applying entropy thresholds from 0.1 to 2.4 inclusive (limits taken from 

Miller et al.). Dropout sampling is compared to \gls{vanilla} SSD 

with and without entropy thresholding. 



\paragraph{Hypothesis} Dropout sampling 

delivers better object detection performance under open set 

conditions compared to object detection without it. 



\subsection*{Reader's Guide} 



First, chapter \ref{chap:background} presents related works and 

provides the background for dropout sampling. 

Afterwards, chapter \ref{chap:methods} explains how \gls{vanilla} \gls{SSD} works, how 

Bayesian \gls{SSD} extends \gls{vanilla} SSD, and how the decoding pipelines are 

structured. 

Chapter \ref{chap:experimentsresults} presents the data sets, 

the experimental setup, and the results. This is followed by 

chapter \ref{chap:discussion}, focusing on the discussion and closing. 



Therefore, the contribution is found in chapters \ref{chap:methods}, 

\ref{chap:experimentsresults}, and \ref{chap:discussion}. 



\chapter{Background} 



\label{chap:background} 



This chapter will begin with an overview over previous works 

in the field of this thesis. Afterwards the theoretical foundations 

of dropout sampling will be explained. 



\section{Related Works} 



The task of novelty detection can be accomplished in a variety of ways. 

Pimentel et al.~\cite{Pimentel2014} provide a review of novelty detection 

methods published over the previous decade. They showcase probabilistic, 

distancebased, reconstructionbased, domainbased, and informationtheoretic 

novelty detection. Based on their categorisation, this thesis falls under 

reconstructionbased novelty detection as it deals only with neural network 

approaches. Therefore, the other types of novelty detection will only be 

briefly introduced. 



\subsection{Overview over types of Novelty Detection} 



Probabilistic approaches estimate the generative probability density function (pdf) 

of the data. It is assumed that the training data is generated from an underlying 

probability distribution \(D\). This distribution can be estimated with the 

training data, the estimate is defined as \(\hat D\) and represents a model 

of normality. A novelty threshold is applied to \(\hat D\) in a way that 

allows a probabilistic interpretation. Pidhorskyi et al.~\cite{Pidhorskyi2018} 

combine a probabilistic approach to novelty detection with autoencoders. 



Distancebased novelty detection uses either nearest neighbourbased approaches 

(e.g. \(k\)nearest neighbour \cite{Hautamaki2004}) 

or clusteringbased approaches 

(e.g. \(k\)means clustering algorithm \cite{Jordan1994}). 

Both methods are similar to estimating the 

pdf of data, they use welldefined distance metrics to compute the distance 

between two data points. 



Domainbased novelty detection describes the boundary of the known data, rather 

than the data itself. Unknown data is identified by its position relative to 

the boundary. A common implementation for this are support vector machines 

(e.g. implemented by Song et al. \cite{Song2002}). 



Informationtheoretic novelty detection computes the information content 

of a data set, for example, with metrics like entropy. Such metrics assume 

that novel data inside the data set significantly alters the information 

content of an otherwise normal data set. First, the metrics are calculated over the 

whole data set. Afterwards, a subset is identified that causes the biggest 

difference in the metric when removed from the data set. This subset is considered 

to consist of novel data. For example, Filippone and Sanguinetti \cite{Filippone2011} provide 

a recent approach. 



\subsection{Reconstructionbased Novelty Detection} 



Reconstructionbased approaches use the reconstruction error in one form 

or another to calculate the novelty score. This can be autoencoders that 

literally reconstruct the input but it also includes MLP networks which try 

to reconstruct the ground truth. Pimentel et al.~\cite{Pimentel2014} differentiated 

between neural networkbased approaches and subspace methods. The first were 

further differentiated between MLPs, Hopfield networks, autoassociative networks, 

radial basis function, and selforganising networks. 

The remainder of this section focuses on MLPbased works, a particular focus will 

be on the task of object detection and Bayesian networks. 



Novelty detection for object detection is intricately linked with 

open set conditions: the test data can contain unknown classes. 

Bishop~\cite{Bishop1994} investigated the correlation between 

the degree of novel input data and the reliability of network 

outputs, and introduced a quantitative way to measure novelty. 



The Bayesian approach provides a theoretical foundation for 

modelling uncertainty \cite{Ghahramani2015}. 

MacKay~\cite{MacKay1992} provided a practical Bayesian 

framework for backpropagation networks. Neal~\cite{Neal1996} built upon 

the work of MacKay and explored Bayesian learning for neural networks. 

However, these Bayesian neural networks do not scale well. Over the course 

of time, two major Bayesian approximations were introduced: one based 

on dropout and one based on batch normalisation. 



Gal and Ghahramani~\cite{Gal2016} showed that dropout training is a 

Bayesian approximation of a Gaussian process. Subsequently, Gal~\cite{Gal2017} 

showed that dropout training actually corresponds to a general approximate 

Bayesian model. This means every network trained with dropout is an 

approximate Bayesian model. During inference the dropout remains active, 

this form of inference is called Monte Carlo Dropout (MCDO). 

Miller et al.~\cite{Miller2018} built upon the work of Gal and Ghahramani: they 

use MC dropout under openset conditions for object detection. 

In a second paper \cite{Miller2018a}, Miller et al. continued their work and 

compared merging strategies for samplingbased uncertainty techniques in 

object detection. 



Teye et al.~\cite{Teye2018} make the point that most modern networks have 

adopted other regularisation techniques. Ioffe and Szeged~\cite{Ioffe2015} 

introduced batch normalisation which has been adapted widely. Teye et al. 

showed how batch normalisation training is similar to dropout and can be 

viewed as an approximate Bayesian inference. Estimates of the model uncertainty 

can be gained with a technique named Monte Carlo Batch Normalisation (MCBN). 

Consequently, this technique can be applied to any network that utilises 

standard batch normalisation. 

Li et al.~\cite{Li2019} investigated the problem of poor performance 

when combining dropout and batch normalisation: dropout shifts the variance 

of a neural unit when switching from train to test, batch normalisation 

does not change the variance. This inconsistency leads to a variance shift which 

can have a larger or smaller impact based on the network used. 



NonBayesian approaches have been developed as well. Usually, they compare with 

MC dropout and show better performance. 

Postels et al.~\cite{Postels2019} provided a samplingfree approach for 

uncertainty estimation that does not affect training and approximates the 

sampling at test time. They compared it to MC dropout and found less computational 

overhead with better results. 

Lakshminarayanan et al.~\cite{Lakshminarayanan2017} 

implemented a predictive uncertainty estimation using deep ensembles. 

Compared to MC dropout, it shows better results. 

Geifman et al.~\cite{Geifman2018} 

introduced an uncertainty estimation algorithm for nonBayesian deep 

neural classification that estimates the uncertainty of highly 

confident points using earlier snapshots of the trained model and improves, 

among others, the approach introduced by Lakshminarayanan et al. 

Sensoy et al.~\cite{Sensoy2018} explicitely model prediction uncertainty: 

a Dirichlet distribution is placed over the class probabilities. Consequently, 

the predictions of a neural network are treated as subjective opinions. 



In addition to the aforementioned Bayesian and nonBayesian works, 

there are some Bayesian works that do not quite fit with the rest but 

are important as well. Mukhoti and Gal~\cite{Mukhoti2018} 

contributed metrics to measure uncertainty for semantic 

segmentation. Wu et al.~\cite{Wu2019} introduced two innovations 

that turn variational Bayes into a robust tool for Bayesian 

networks: first, a novel deterministic method to approximate 

moments in neural networks which eliminates gradient variance, and 

second, a hierarchical prior for parameters and an empirical Bayes 

procedure to select prior variances. 



\section{Background for Dropout Sampling} 



\begin{table} 

\centering 

\caption{Notation for background} 

\label{tab:notation} 

\begin{tabular}{ll} 

symbol & meaning \\ 

\hline 

\(\mathbf{W}\) & weights \\ 

\(\mathbf{T}\) & training data \\ 

\(\mathcal{N}(0, I)\) & Gaussian distribution \\ 

\(I\) & independent and identical distribution \\ 

\(p(\mathbf{W}\mathbf{T})\) & probability of weights given 

training data \\ 

\(\mathcal{I}\) & an image \\ 

\(\mathbf{q} = p(y\mathcal{I}, \mathbf{T})\) & probability 

of all classes given image and training data \\ 

\(H(\mathbf{q})\) & entropy over probability vector \\ 

\(\widetilde{\mathbf{W}}\) & weights sampled from 

\(p(\mathbf{W}\mathbf{T})\) \\ 

\(\mathbf{b}\) & bounding box coordinates \\ 

\(\mathbf{s}\) & softmax scores \\ 

\(\overline{\mathbf{s}}\) & averaged softmax scores \\ 

\(D\) & detections of one forward pass \\ 

\(\mathfrak{D}\) & set of all detections over multiple 

forward passes \\ 

\(\mathcal{O}\) & observation \\ 

\(\overline{\mathbf{q}}\) & probability vector for 

observation \\ 

%\(E[something]\) & expected value of something 

%\(\overline{\mathbf{z}}, \mathbf{z}\) & latent space representation \\ 

%\(d_T, d_z\) & discriminators \\ 

%\(e, g\) & encoding and decoding/generating function \\ 

%\(J_g\) & Jacobi matrix for generating function \\ 

%\(\mathcal{T}\) & tangent space \\ 

%\(\mathbf{R}\) & training/test data changed to be on tangent space 

\end{tabular} 

\end{table} 



This section will use the \textbf{notation} defined in table 

\ref{tab:notation} on page \pageref{tab:notation}. 

To understand dropout sampling, it is necessary to explain the 

idea of Bayesian neural networks. They place a prior distribution 

over the network weights, for example a Gaussian prior distribution: 

\(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example 

\(\mathbf{W}\) are the weights and \(I\) symbolises that every 

weight is drawn from an independent and identical distribution. The 

training of the network determines a plausible set of weights by 

evaluating the probability output (posterior) over the weights given 

the training data \(\mathbf{T}\): \(p(\mathbf{W}\mathbf{T})\). 

However, this 

evaluation cannot be performed in any reasonable 

time. Therefore approximation techniques are 

required. In those techniques the posterior is fitted with a 

simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original 

and intractable problem of averaging over all weights in the network 

is replaced with an optimisation task, where the parameters of the 

simple distribution are optimised over~\cite{Kendall2017}. 



\subsubsection*{Dropout Variational Inference} 



Kendall and Gal~\cite{Kendall2017} showed an approximation for 

classfication and recognition tasks. Dropout variational inference 

is a practical approximation technique by adding dropout layers 

in front of every weight layer and using them also during test 

time to sample from the approximate posterior. Effectively, this 

results in the approximation of the class probability 

\(p(y\mathcal{I}, \mathbf{T})\) by performing multiple forward 

passes through the network and averaging over the obtained Softmax 

scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the 

training data \(\mathbf{T}\): 

\begin{equation} \label{eq:dropsampling} 

p(y\mathcal{I}, \mathbf{T}) = \int p(y\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i 

\end{equation} 



With this dropout sampling technique, \(n\) model weights 

\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior 

\(p(\mathbf{W}\mathbf{T})\). The class probability 

\(p(y\mathcal{I}, \mathbf{T})\) is a probability vector 

\(\mathbf{q}\) over all class labels. Finally, the uncertainty 

of the network with respect to the classification is given by 

the entropy \(H(\mathbf{q}) =  \sum_i q_i \cdot \log q_i\). 



\subsubsection*{Dropout Sampling for Object Detection} 



Miller et al.~\cite{Miller2018} apply the dropout sampling to 

object detection. In that case \(\mathbf{W}\) represents the 

learned weights of a detection network like SSD~\cite{Liu2016}. 

Every forward pass uses a different network 

\(\widetilde{\mathbf{W}}\) which is approximately sampled from 

\(p(\mathbf{W}\mathbf{T})\). Each forward pass in object 

detection results in a set of detections, each consisting of bounding 

box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\). 

The detections are denoted by Miller et al. as \(D_i = 

\{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put 

into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\). 



All detections with mutual intersectionoverunion scores (IoU) 

of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\). 

Subsequently, the corresponding vector of class probabilities 

\(\overline{\mathbf{q}}_i\) for the observation is calculated by averaging all 

score vectors \(\mathbf{s}_j\) in a particular observation 

\(\mathcal{O}_i\): \(\overline{\mathbf{q}}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty 

of the detector for a particular observation is measured by 

the entropy \(H(\overline{\mathbf{q}}_i)\). 



If \(\overline{\mathbf{q}}_i\), which I called averaged class probabilities, 

resembles a uniform distribution the entropy will be high. A uniform 

distribution means that no class is more likely than another, which 

is a perfect example of maximum uncertainty. Conversely, if 

one class has a very high probability the entropy will be low. 



In open set conditions it can be expected that falsely generated 

detections for unknown object classes have a higher label 

uncertainty. A threshold on the entropy \(H(\overline{\mathbf{q}}_i)\) can then 

be used to identify and reject these false positive cases. 



% SSD: \cite{Liu2016} 

% ImageNet: \cite{Deng2009} 

% COCO: \cite{Lin2014} 

% YCB: \cite{Xiang2017} 

% SceneNet: \cite{McCormac2017} 



\chapter{Methods} 



\label{chap:methods} 



This chapter explains the functionality of \gls{vanilla} SSD, Bayesian SSD, and the decoding pipelines. 



\section{Vanilla SSD} 



\begin{figure} 

\centering 

\includegraphics[scale=1.2]{vanillassd} 

\caption{The \gls{vanilla} \gls{SSD} network as defined by Liu et al.~\cite{Liu2016}. VGG16 is the base network, extended with extra feature layers. These predict offsets to anchor boxes with different sizes and aspect ratios. Furthermore, they predict the 

corresponding confidences.} 

\label{fig:vanillassd} 

\end{figure} 



Vanilla \gls{SSD} is based upon the VGG16 network (see figure 

\ref{fig:vanillassd}) and adds extra feature layers. The entire 

image (always size 300x300) is divided up into anchor boxes. During 

training, each of these boxes is mapped to a ground truth box or 

background. For every anchor box both the offset to 

the object and the class confidences are calculated. The output of the 

SSD network are the predictions with class confidences, offsets to the 

anchor box, anchor box coordinates, and variance. The model loss is a 

weighted sum of localisation and confidence loss. As the network 

has a fixed number of anchor boxes, every forward pass creates the same 

number of detections8732 in the case of \gls{SSD} 300x300. 



Notably, the object proposals are made in a single run for an image  

single shot. 

Other techniques like Faster RCNN employ region proposals 

and pooling. For more detailed information on SSD, please refer to 

Liu et al.~\cite{Liu2016}. 



\section{Bayesian SSD for Model Uncertainty} 



Networks trained with dropout are a general approximate Bayesian model~\cite{Gal2017}. As such, they can be used for everything a true 

Bayesian model could be used for. The idea is applied to \gls{SSD} in this 

thesis: two dropout layers are added to \gls{vanilla} SSD, after the layers fc6 and fc7 respectively (see figure \ref{fig:bayesianssd}). 



\begin{figure} 

\centering 

\includegraphics[scale=1.2]{bayesianssd} 

\caption{The Bayesian \gls{SSD} network as defined by Miller et al.~\cite{Miller2018}. It adds dropout layers after the fc6 

and fc7 layers.} 

\label{fig:bayesianssd} 

\end{figure} 



Motivation for this is model uncertainty: an uncertain model will 

predict different classes for the same object on the same image across 

multiple forward passes. This uncertainty is measured with entropy: 

every forward pass results in predictions, these are partitioned into 

observations, and subsequently their entropy is calculated. 

A higher entropy indicates a more uniform distribution of confidences 

whereas a lower entropy indicates a larger confidence in one class 

and very low confidences in other classes. 



\subsection{Implementation Details} 



For this thesis, an \gls{SSD} implementation based on Tensorflow~\cite{Abadi2015} and 

Keras\footnote{\url{https://github.com/pierluigiferrari/ssd\_keras}} 

was used. It was modified to support entropy thresholding, 

partitioning of observations, and dropout 

layers in the \gls{SSD} model. Entropy thresholding takes place before 

the perclass confidence threshold is applied. 



The Bayesian variant was not finetuned and operates with the same 

weights that \gls{vanilla} \gls{SSD} uses as well. 



\section{Decoding Pipelines} 



The raw output of \gls{SSD} is not very useful: it contains thousands of 

boxes per image. Among them are many boxes with very low confidences 

or background classifications, those need to be filtered out to 

get any meaningful output of the network. The process of 

filtering is called decoding and presented for the three variants 

of \gls{SSD} used in the thesis. 



\subsection{Vanilla SSD} 



Liu et al.~\cite{Liu2016} used Caffe for their original SSD 

implementation. The decoding process contains largely two 

phases: decoding and filtering. Decoding transforms the relative 

coordinates predicted by \gls{SSD} into absolute coordinates. At this point 

the shape of the output per batch is \((batch\_size, \#nr\_boxes, \#nr\_classes + 12)\). The last twelve elements are split into 

the four bounding box offsets, the four anchor box coordinates, and 

the four variances; there are 8732 boxes. 



\glslocalreset{NMS} 

Filtering of these boxes is first done per class: 

only the class id, confidence of that class, and the bounding box 

coordinates are kept per box. The filtering consists of 

confidence thresholding and a subsequent \gls{NMS}. 

All boxes that pass \gls{NMS} are added to a 

per image maxima list. One box could make the confidence threshold 

for multiple classes and, hence, be present multiple times in the 

maxima list for the image. Lastly, a total of \(k\) boxes with the 

highest confidences is kept per image across all classes. The 

original implementation uses a confidence threshold of \(0.01\), an 

IOU threshold for \gls{NMS} of \(0.45\) and a top \(k\) 

value of 200. 



The \gls{vanilla} SSD 

perclass confidence threshold and \gls{NMS} has one 

weakness: even if \gls{SSD} correctly predicts all objects as the 

background class with high confidence, the perclass confidence 

threshold of 0.01 will consider predictions with very low 

confidences; as background boxes are not present in the maxima 

collection, many low confidence boxes can be. Furthermore, the 

same detection can be present in the maxima collection for multiple 

classes. In this case, the entropy threshold would let the detection 

pass because the background class has high confidence. Subsequently, 

a low perclass confidence threshold does not restrict the boxes 

either. Therefore, the decoding output is worse than the actual 

predictions of the network. 

Bayesian \gls{SSD} cannot help in this situation because the network 

is not actually uncertain. 



SSD was developed with closed set conditions in mind. A well trained 

network in such a situation does not have many high confidence 

background detections. In an open set environment, background 

detections are the correct behaviour for unknown classes. 

In order to get useful detections out of the decoding, a higher 

confidence threshold is required. 



\subsection{Vanilla SSD with Entropy Thresholding} 



Vanilla \gls{SSD} with entropy tresholding adds an additional component 

to the filtering already done for \gls{vanilla} SSD. The entropy is 

calculated from all \(\#nr\_classes\) softmax scores in a prediction. 

Only predictions with a low enough entropy pass the entropy 

threshold and move on to the aforementioned per class filtering. 

This excludes very uniform predictions but cannot identify 

false positive or false negative cases with high confidence values. 



\subsection{Bayesian SSD with Entropy Thresholding} 



Bayesian \gls{SSD} has the speciality of multiple forward passes. Based 

on the information in the paper, the detections of all forward passes 

are grouped per image but not by forward pass. This leads 

to the following shape of the network output after all 

forward passes: \((batch\_size, \#nr\_boxes \, \cdot \, \#nr\_forward\_passes, \#nr\_classes + 12)\). The size of the output 

increases linearly with more forward passes. 



These detections have to be decoded first. Afterwards, 

all detections are thrown away which do not pass a confidence 

threshold for the class with the highest prediction probability. 

Additionally, all detections with a background prediction of 0.8 or higher are discarded. 

The remaining detections are partitioned into observations to 

further reduce the size of the output, and 

to identify uncertainty. This is accomplished by calculating the 

mutual IOU score of every detection with all other detections. Detections 

with a mutual IOU score of 0.95 or higher are partitioned into an 

observation. Next, the softmax scores and bounding box coordinates of 

all detections in an observation are averaged. 

There can be a different number of observations for every image which 

destroys homogenity and prevents batchwise calculation of the 

results. The shape of the results is per image: \((\#nr\_observations,\#nr\_classes + 4)\). 



Entropy is measured in the next step. All observations with too high 

entropy are discarded. Entropy thresholding in combination with 

dropout sampling should improve identification of false positives of 

unknown classes. This is due to multiple forward passes and 

the assumption that uncertainty in some objects will result 

in different classifications in multiple forward passes. These 

varying classifications are averaged into multiple lower confidence 

values which should increase the entropy and, hence, flag an 

observation for removal. 



The remainder of the filtering follows the \gls{vanilla} \gls{SSD} procedure: perclass 

confidence threshold, \gls{NMS}, and a top \(k\) selection 

at the end. 



\chapter{Experimental Setup and Results} 



\label{chap:experimentsresults} 



This chapter explains the used data sets, how the experiments were 

set up, and what the results are. 



\section{Data Sets} 



This thesis uses the MS COCO~\cite{Lin2014} data set. It contains 

80 classes, from airplanes to toothbrushes many classes are present. 

The images are taken by camera from the real world, ground truth 

is provided for all images. The data set supports object detection, 

keypoint detection, and panoptic segmentation (scene segmentation). 



The data of any data set has to be prepared for use in a neural 

network. Typical problems of data sets include, for example, 

outliers and invalid bounding boxes. Before a data set can be used, 

these problems need to be resolved. 



For the MS COCO data set, all annotations were checked for 

impossible values: bounding box height or width lower than zero, 

\(x_{min}\) and \(y_{min}\) bounding box coordinates lower than zero, 

\(x_{max}\) and \(y_{max}\) coordinates lower than or equal to zero, \(x_{min}\) greater than \(x_{max}\), 

\(y_{min}\) greater than \(y_{max}\), image width lower than \(x_{max}\), 

and image height lower than \(y_{max}\). In the last two cases the 

bounding box width and height were set to (image width  \(x_{min}\)) and 

(image height  \(y_{min}\)) respectively; 

in the other cases the annotation was skipped. 

If the bounding box width or height afterwards is 

lower than or equal to zero the annotation was skipped. 



SSD accepts 300x300 input images, the MS COCO data set images were 

resized to this resolution; the aspect ratio was not kept in the 

process. MS COCO contains landscape and portrait images with (640x480) 

and (480x640) as the resolution. This led to a uniform distortion of the 

portrait and landscape images respectively. Furthermore, 

the colour channels were swapped from RGB to BGR in order to 

comply with the \gls{SSD} implementation. The BGR requirement stems from 

the usage of Open CV in SSD: the internal channel order for 

Open CV is BGR. 



For this thesis, weights pretrained on the sub data set trainval35k of the 

COCO data set were used. These weights were created with closed set 

conditions in mind, therefore, they had to be subsampled to create 

an open set condition. To this end, the weights for the last 

20 classes were thrown away, making them effectively unknown. 



All images of the minival2014 data set were used but only ground truth 

belonging to the first 60 classes was loaded. The remaining 20 

classes were considered "unknown" and no ground truth bounding 

boxes for them were provided during the inference phase. 

A total of 31,991 detections remains after this exclusion. Of these 

detections, a staggering 10,988 or 34,3\% belong to the persons 

class, followed by cars with 1,932 or 6\%, chairs with 1,791 or 5,6\%, 

and bottles with 1,021 or 3,2\%. Together, these four classes make up 

around 49,1\% of the ground truth detections. This shows a huge imbalance 

between the classes in the data set. 



\section{Experimental Setup} 



This section explains the setup for the different conducted 

experiments. Each comparison investigates one particular question. 



As a baseline, \gls{vanilla} \gls{SSD} with the confidence threshold of 0.01 

and a \gls{NMS} IOU threshold of 0.45 was used. 

Due to the low number of objects per image in the COCO data set, 

the top \(k\) value was set to 20. Vanilla \gls{SSD} with entropy 

thresholding uses the same parameters; compared to \gls{vanilla} SSD 

without entropy thresholding, it showcases the relevance of 

entropy thresholding for \gls{vanilla} SSD. 



Vanilla \gls{SSD} was also run with 0.2 confidence threshold and compared 

to \gls{vanilla} \gls{SSD} with 0.01 confidence threshold; this comparison 

investigates the effect of the per class confidence threshold 

on the object detection performance. 



Bayesian \gls{SSD} was run with 0.2 confidence threshold and compared 

to \gls{vanilla} \gls{SSD} with 0.2 confidence threshold. Coupled with the 

entropy threshold, this comparison reveals how uncertain the network 

is. If it is very certain the dropout sampling should have no 

significant impact on the result. Furthermore, in two cases the 

dropout was turned off to isolate the impact of \gls{NMS} 

on the result. 



Both, \gls{vanilla} \gls{SSD} with entropy thresholding and Bayesian \gls{SSD} with 

entropy thresholding, were tested for entropy thresholds ranging 

from 0.1 to 2.4 inclusive as specified in Miller et al.~\cite{Miller2018}. 



\section{Results} 



Results in this section are presented both for micro and macro averaging. 

In macro averaging, for example, the precision values of each class are added up 

and then divided by the number of classes. Conversely, for micro averaging the 

precision is calculated across all classes directly. Both methods have 

a specific impact: macro averaging weighs every class the same while micro 

averaging weighs every detection the same. They will be largely identical 

when every class is balanced and has about the same number of detections. 

However, in case of a class imbalance the macro averaging 

favours classes with few detections whereas micro averaging benefits classes 

with many detections. 



This section only presents the results. Interpretation and discussion is found 

in the next chapter. 



\subsection{Micro Averaging} 

\begin{table}[ht] 

\begin{tabular}{rcccc} 

\hline 

Forward & max & abs OSE & Recall & Precision\\ 

Passes & \(F_1\) Score & \multicolumn{3}{c}{at max \(F_1\) point} \\ 

\hline 

\gls{vanilla} \gls{SSD}  0.01 conf & 0.255 & 3176 & 0.214 & 0.318 \\ 

\gls{vanilla} \gls{SSD}  0.2 conf & \textbf{0.376} & 2939 & \textbf{0.382} & 0.372 \\ 

\gls{SSD} with Entropy test  0.01 conf & 0.255 & 3168 & 0.214 & 0.318 \\ 

% entropy thresh: 2.4 for \gls{vanilla} \gls{SSD} is best 

\hline 

Bay. \gls{SSD}  no DO  0.2 conf  no \gls{NMS} \; 10 & 0.209 & 2709 & 0.300 & 0.161 \\ 

no dropout  0.2 conf  \gls{NMS} \; 10 & 0.371 & \textbf{2335} & 0.365 & \textbf{0.378} \\ 

0.9 keep ratio  0.2 conf  \gls{NMS} \; 10 & 0.359 & 2584 & 0.363 & 0.357 \\ 

0.5 keep ratio  0.2 conf  \gls{NMS} \; 10 & 0.325 & 2759 & 0.342 & 0.311 \\ 

% entropy thresh: 1.2 for Bayesian  2 is best, 0.4 for 3 

% 0.5 for Bayesian  6, 1.4 for 7, 1.4 for 8, 1.3 for 9 

\hline 

\end{tabular} 

\caption{Rounded results for micro averaging. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with 

their best performing entropy threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with Entropy test performed best with an 

entropy threshold of 2.4, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.0, 

and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.4 as entropy 

threshold. 

Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed 

best for 1.4 as entropy threshold, the run with 0.5 keep ratio performed 

best for 1.3 as threshold.} 

\label{tab:resultsmicro} 

\end{table} 



\begin{figure}[ht] 

\begin{minipage}[t]{0.48\textwidth} 

\includegraphics[width=\textwidth]{osef1allmicro} 

\caption{Micro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute OSE of 0.} 

\label{fig:osef1micro} 

\end{minipage}% 

\hfill 

\begin{minipage}[t]{0.48\textwidth} 

\includegraphics[width=\textwidth]{precisionrecallallmicro} 

\caption{Micro averaged precisionrecall curves for each variant tested.} 

\label{fig:precisionrecallmicro} 

\end{minipage} 

\end{figure} 



Vanilla \gls{SSD} with a perclass confidence threshold of 0.2 performs best (see 

table \ref{tab:resultsmicro}) with respect to the maximum \(F_1\) score 

(0.376) and recall at the maximum \(F_1\) point (0.382). In comparison, neither 

the \gls{vanilla} \gls{SSD} variant with a confidence threshold of 0.01 nor the \gls{SSD} with 

an entropy test can outperform the 0.2 variant. Among the \gls{vanilla} \gls{SSD} variants, 

the 0.2 variant also has the lowest number of open set errors (2939) and the 

highest precision (0.372). 



The comparison of the \gls{vanilla} \gls{SSD} variants with a confidence threshold of 0.01 

shows no significant impact of an entropy test. Only the open set errors 

are lower but in an insignificant way. The rest of the performance metrics is 

identical after rounding. 



Bayesian \gls{SSD} with disabled dropout and without \gls{NMS} 

has the worst performance of all tested variants (\gls{vanilla} and Bayesian) 

with respect to \(F_1\) score (0.209) and precision (0.161). The precision is not only the worst but also significantly lower compared to all other variants. 

In comparison to all variants with 0.2 confidence threshold, it has the worst recall (0.300) as well. 



With 2335 open set errors, the Bayesian \gls{SSD} variant with disabled dropout and 

enabled \gls{NMS} offers the best performance with respect 

to open set errors. It also has the best precision (0.378) of all tested 

variants. Furthermore, it provides the best performance among all variants 

with multiple forward passes. 



Dropout decreases the performance of the network, this can be seen 

in the lower \(F_1\) scores, higher open set errors, and lower precision 

values. Both dropout variants have worse recall (0.363 and 0.342) than 

the variant with disabled dropout. 

However, all variants with multiple forward passes have lower open set 

errors than all \gls{vanilla} \gls{SSD} variants. 



The relation of \(F_1\) score to absolute open set error can be observed 

in figure \ref{fig:osef1micro}. Precisionrecall curves for all variants 

can be seen in figure \ref{fig:precisionrecallmicro}. Both \gls{vanilla} SSD 

variants with 0.01 confidence threshold reach much higher open set errors 

and a higher recall. This behaviour is expected as more and worse predictions 

are included. 

All plotted variants show a similar behaviour that is in line with previously 

reported figures, such as the ones in Miller et al.~\cite{Miller2018} 



\subsection{Macro Averaging} 



\begin{table}[t] 

\begin{tabular}{rcccc} 

\hline 

Forward & max & abs OSE & Recall & Precision\\ 

Passes & \(F_1\) Score & \multicolumn{3}{c}{at max \(F_1\) point} \\ 

\hline 

\gls{vanilla} \gls{SSD}  0.01 conf & 0.370 & 1426 & 0.328 & 0.424 \\ 

\gls{vanilla} \gls{SSD}  0.2 conf & \textbf{0.375} & 1218 & \textbf{0.338} & 0.424 \\ 

\gls{SSD} with Entropy test  0.01 conf & 0.370 & 1373 & 0.329 & \textbf{0.425} \\ 

% entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best 

\hline 

Bay. \gls{SSD}  no DO  0.2 conf  no \gls{NMS} \; 10 & 0.226 & \textbf{809} & 0.229 & 0.224 \\ 

no dropout  0.2 conf  \gls{NMS} \; 10 & 0.363 & 1057 & 0.321 & 0.420 \\ 

0.9 keep ratio  0.2 conf  \gls{NMS} \; 10 & 0.355 & 1137 & 0.320 & 0.399 \\ 

0.5 keep ratio  0.2 conf  \gls{NMS} \; 10 & 0.322 & 1264 & 0.307 & 0.340 \\ 

% entropy thresh: 1.2 for Bayesian  2 is best, 0.4 for 3 

% entropy thresh: 0.7 for Bayesian  6 is best, 1.5 for 7 

% 1.7 for 8, 2.0 for 9 

\hline 

\end{tabular} 

\caption{Rounded results for macro averaging. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with 

their best performing entropy threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with Entropy test performed best with an 

entropy threshold of 1.7, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.5, 

and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.5 as entropy 

threshold. Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed 

best for 1.7 as entropy threshold, the run with 0.5 keep ratio performed 

best for 2.0 as threshold.} 

\label{tab:resultsmacro} 

\end{table} 



\begin{figure}[ht] 

\begin{minipage}[t]{0.48\textwidth} 

\includegraphics[width=\textwidth]{osef1allmacro} 

\caption{Macro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute OSE of 0.} 

\label{fig:osef1macro} 

\end{minipage}% 

\hfill 

\begin{minipage}[t]{0.48\textwidth} 

\includegraphics[width=\textwidth]{precisionrecallallmacro} 

\caption{Macro averaged precisionrecall curves for each variant tested.} 

\label{fig:precisionrecallmacro} 

\end{minipage} 

\end{figure} 



Vanilla \gls{SSD} with a perclass confidence threshold of 0.2 performs best (see 

table \ref{tab:resultsmacro}) with respect to the maximum \(F_1\) score 

(0.375) and recall at the maximum \(F_1\) point (0.338). In comparison, the SSD 

with an entropy test slightly outperforms the 0.2 variant with respect to 

precision (0.425). Additionally, this is the best precision overall. Among 

the \gls{vanilla} \gls{SSD} variants, the 0.2 variant also has the lowest 

number of open set errors (1218). 



The comparison of the \gls{vanilla} \gls{SSD} variants with a confidence threshold of 0.01 

shows no significant impact of an entropy test. Only the open set errors 

are lower but in an insignificant way. The rest of the performance metrics is 

almost identical after rounding. 



The results for Bayesian \gls{SSD} show a significant impact of \gls{NMS} or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226 

(without NMS). Dropout was disabled in both cases, making them effectively a 

\gls{vanilla} \gls{SSD} run with multiple forward passes. 



With 809 open set errors, the Bayesian \gls{SSD} variant with disabled dropout and 

without \gls{NMS} offers the best performance with respect 

to open set errors. The variant without dropout and enabled \gls{NMS} has the best \(F_1\) score (0.363), the best 

precision (0.420) and the best recall (0.321) of all Bayesian variants. 



Dropout decreases the performance of the network, this can be seen 

in the lower \(F_1\) scores, higher open set errors, and lower precision and 

recall values. However, all variants with multiple forward passes have lower open set errors than all \gls{vanilla} SSD 

variants. 



The relation of \(F_1\) score to absolute open set error can be observed 

in figure \ref{fig:osef1macro}. Precisionrecall curves for all variants 

can be seen in figure \ref{fig:precisionrecallmacro}. Both \gls{vanilla} SSD 

variants with 0.01 confidence threshold reach much higher open set errors 

and a higher recall. This behaviour is expected as more and worse predictions 

are included. 

All plotted variants show a similar behaviour that is in line with previously 

reported figures, such as the ones in Miller et al.~\cite{Miller2018} 



\subsection{Classspecific results} 



As mentioned before, the data set is imbalanced with respect to its 

classes: four classes make up roughly 50\% of all ground truth 

detections. Therefore, it is interesting to see the performance 

of the tested variants with respect to these classes: persons, cars, 

chairs, and bottles. Additionally, the results of the giraffe class are 

presented as these are exceptionally good, although the class makes up 

only 0.7\% of the ground truth. With this share, it is below 

the average of roughly 0.89\% for each of the 56 classes that make up the 

second half of the ground truth. 



In some cases, multiple variants have seemingly the same performance 

but only one or some of them are marked bold. This is informed by 

differences prior to rounding. If two or more variants are marked bold 

they had the exact same performance before rounding. 



\begin{table}[tbp] 

\begin{tabular}{rccc} 

\hline 

Forward & max & Recall & Precision\\ 

Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\ 

\hline 

\gls{vanilla} \gls{SSD}  0.01 conf & 0.460 & \textbf{0.405} & 0.532 \\ 

\gls{vanilla} \gls{SSD}  0.2 conf & \textbf{0.460} & \textbf{0.405} & \textbf{0.533} \\ 

\gls{SSD} with Entropy test  0.01 conf & 0.460 & 0.405 & 0.532 \\ 

% entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best 

\hline 

Bay. \gls{SSD}  no DO  0.2 conf  no \gls{NMS} \; 10 & 0.272 & 0.292 & 0.256 \\ 

no dropout  0.2 conf  \gls{NMS} \; 10 & 0.451 & 0.403 & 0.514 \\ 

0.9 keep ratio  0.2 conf  \gls{NMS} \; 10 & 0.447 & 0.401 & 0.505 \\ 

0.5 keep ratio  0.2 conf  \gls{NMS} \; 10 & 0.410 & 0.368 & 0.465 \\ 

% entropy thresh: 1.2 for Bayesian  2 is best, 0.4 for 3 

% entropy thresh: 0.7 for Bayesian  6 is best, 1.5 for 7 

% 1.7 for 8, 2.0 for 9 

\hline 

\end{tabular} 

\caption{Rounded results for persons class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with 

their best performing macro averaging entropy threshold with respect to \(F_1\) score.} 

\label{tab:resultspersons} 

\end{table} 



It is clearly visible that the overall trend continues in the individual 

classes (see tables \ref{tab:resultspersons}, \ref{tab:resultscars}, \ref{tab:resultschairs}, \ref{tab:resultsbottles}, and \ref{tab:resultsgiraffes}). However, the two \gls{vanilla} \gls{SSD} variants with only 0.01 confidence 

threshold perform better than in the averaged results presented earlier. 

Only in the chairs class, a Bayesian \gls{SSD} variant performs better (in 

precision) than any of the \gls{vanilla} \gls{SSD} variants. Moreover, there are 

multiple classes where two or all of the \gls{vanilla} \gls{SSD} variants perform 

equally well. When compared with the macro averaged results, 

giraffes and persons perform better across the board. Cars have a higher 

precision than average but lower recall values for all but the Bayesian 

SSD variant without \gls{NMS} and dropout. Chairs and bottles perform 

worse than average. 



\begin{table}[tbp] 

\begin{tabular}{rccc} 

\hline 

Forward & max & Recall & Precision\\ 

Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\ 

\hline 

\gls{vanilla} \gls{SSD}  0.01 conf & 0.364 & \textbf{0.305} & 0.452 \\ 

\gls{vanilla} \gls{SSD}  0.2 conf & 0.363 & 0.294 & \textbf{0.476} \\ 

\gls{SSD} with Entropy test  0.01 conf & \textbf{0.364} & \textbf{0.305} & 0.453 \\ 

% entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best 

\hline 

Bay. \gls{SSD}  no DO  0.2 conf  no \gls{NMS} \; 10 & 0.236 & 0.244 & 0.229 \\ 

no dropout  0.2 conf  \gls{NMS} \; 10 & 0.336 & 0.266 & 0.460 \\ 

0.9 keep ratio  0.2 conf  \gls{NMS} \; 10 & 0.332 & 0.262 & 0.454 \\ 

0.5 keep ratio  0.2 conf  \gls{NMS} \; 10 & 0.309 & 0.264 & 0.374 \\ 

% entropy thresh: 1.2 for Bayesian  2 is best, 0.4 for 3 

% entropy thresh: 0.7 for Bayesian  6 is best, 1.5 for 7 

% 1.7 for 8, 2.0 for 9 

\hline 

\end{tabular} 

\caption{Rounded results for cars class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with 

their best performing macro averaging entropy threshold with respect to \(F_1\) score. } 

\label{tab:resultscars} 

\end{table} 



\begin{table}[tbp] 

\begin{tabular}{rccc} 

\hline 

Forward & max & Recall & Precision\\ 

Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\ 

\hline 

\gls{vanilla} \gls{SSD}  0.01 conf & 0.287 & \textbf{0.251} & 0.335 \\ 

\gls{vanilla} \gls{SSD}  0.2 conf & 0.283 & 0.242 & 0.341 \\ 

\gls{SSD} with Entropy test  0.01 conf & \textbf{0.288} & \textbf{0.251} & 0.338 \\ 

% entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best 

\hline 

Bay. \gls{SSD}  no DO  0.2 conf  no \gls{NMS} \; 10 & 0.172 & 0.168 & 0.178 \\ 

no dropout  0.2 conf  \gls{NMS} \; 10 & 0.280 & 0.229 & \textbf{0.360} \\ 

0.9 keep ratio  0.2 conf  \gls{NMS} \; 10 & 0.274 & 0.228 & 0.343 \\ 

0.5 keep ratio  0.2 conf  \gls{NMS} \; 10 & 0.240 & 0.220 & 0.265 \\ 

% entropy thresh: 1.2 for Bayesian  2 is best, 0.4 for 3 

% entropy thresh: 0.7 for Bayesian  6 is best, 1.5 for 7 

% 1.7 for 8, 2.0 for 9 

\hline 

\end{tabular} 

\caption{Rounded results for chairs class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with 

their best performing macro averaging entropy threshold with respect to \(F_1\) score. } 

\label{tab:resultschairs} 

\end{table} 





\begin{table}[tbp] 

\begin{tabular}{rccc} 

\hline 

Forward & max & Recall & Precision\\ 

Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\ 

\hline 

\gls{vanilla} \gls{SSD}  0.01 conf & 0.233 & \textbf{0.175} & 0.348 \\ 

\gls{vanilla} \gls{SSD}  0.2 conf & 0.231 & 0.173 & \textbf{0.350} \\ 

\gls{SSD} with Entropy test  0.01 conf & \textbf{0.233} & \textbf{0.175} & 0.350 \\ 

% entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best 

\hline 

Bay. \gls{SSD}  no DO  0.2 conf  no \gls{NMS} \; 10 & 0.160 & 0.140 & 0.188 \\ 

no dropout  0.2 conf  \gls{NMS} \; 10 & 0.224 & 0.170 & 0.328 \\ 

0.9 keep ratio  0.2 conf  \gls{NMS} \; 10 & 0.220 & 0.170 & 0.311 \\ 

0.5 keep ratio  0.2 conf  \gls{NMS} \; 10 & 0.202 & 0.172 & 0.245 \\ 

% entropy thresh: 1.2 for Bayesian  2 is best, 0.4 for 3 

% entropy thresh: 0.7 for Bayesian  6 is best, 1.5 for 7 

% 1.7 for 8, 2.0 for 9 

\hline 

\end{tabular} 

\caption{Rounded results for bottles class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with 

their best performing macro averaging entropy threshold with respect to \(F_1\) score. } 

\label{tab:resultsbottles} 

\end{table} 



\begin{table}[tbp] 

\begin{tabular}{rccc} 

\hline 

Forward & max & Recall & Precision\\ 

Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\ 

\hline 

\gls{vanilla} \gls{SSD}  0.01 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\ 

\gls{vanilla} \gls{SSD}  0.2 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\ 

\gls{SSD} with Entropy test  0.01 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\ 

% entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best 

\hline 

Bay. \gls{SSD}  no DO  0.2 conf  no \gls{NMS} \; 10 & 0.415 & 0.414 & 0.417 \\ 

no dropout  0.2 conf  \gls{NMS} \; 10 & 0.647 & 0.642 & 0.654 \\ 

0.9 keep ratio  0.2 conf  \gls{NMS} \; 10 & 0.637 & 0.634 & 0.642 \\ 

0.5 keep ratio  0.2 conf  \gls{NMS} \; 10 & 0.586 & 0.578 & 0.596 \\ 

% entropy thresh: 1.2 for Bayesian  2 is best, 0.4 for 3 

% entropy thresh: 0.7 for Bayesian  6 is best, 1.5 for 7 

% 1.7 for 8, 2.0 for 9 

\hline 

\end{tabular} 

\caption{Rounded results for giraffe class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with 

their best performing macro averaging entropy threshold with respect to \(F_1\) score. } 

\label{tab:resultsgiraffes} 

\end{table} 



\subsection{Qualitative Analysis} 



% TODO: expand 



This subsection compares \gls{vanilla} SSD 

with Bayesian \gls{SSD} with respect to specific images that illustrate 

similarities and differences between both approaches. For this 

comparison, a 0.2 confidence threshold is applied. Furthermore, Bayesian 

SSD uses \gls{NMS} and dropout with 0.9 keep ratio. 



\begin{figure} 

\begin{minipage}[t]{0.48\textwidth} 

\includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_vanilla} 

\caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} SSD.} 

\label{fig:stopsigntruckvanilla} 

\end{minipage}% 

\hfill 

\begin{minipage}[t]{0.48\textwidth} 

\includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_bayesian} 

\caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian \gls{SSD} with 0.9 keep ratio.} 

\label{fig:stopsigntruckbayesian} 

\end{minipage} 

\end{figure} 



The ground truth only contains a stop sign and a truck. The differences between \gls{vanilla} \gls{SSD} and Bayesian \gls{SSD} are almost not visible 

(see figures \ref{fig:stopsigntruckvanilla} and \ref{fig:stopsigntruckbayesian}): the truck is neither detected by \gls{vanilla} nor Bayesian SSD, instead both detected a pottet plant and a traffic light. The stop sign is detected by both variants. 

This behaviour implies problems with detecting objects at the edge 

that overwhelmingly lie outside the image frame. Furthermore, the predictions are usually identical. 



\begin{figure} 

\begin{minipage}[t]{0.48\textwidth} 

\includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_vanilla} 

\caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} SSD.} 

\label{fig:catlaptopvanilla} 

\end{minipage}% 

\hfill 

\begin{minipage}[t]{0.48\textwidth} 

\includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_bayesian} 

\caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian \gls{SSD} with 0.9 keep ratio.} 

\label{fig:catlaptopbayesian} 

\end{minipage} 

\end{figure} 



Another example (see figures \ref{fig:catlaptopvanilla} and \ref{fig:catlaptopbayesian}) is a cat with a laptop/TV in the background on the right 

side. Both variants detect a cat but the \gls{vanilla} variant detects a dog as well. The laptop and TV are not detected but this is expected since 

these classes were not trained. 



\chapter{Discussion and Outlook} 



\label{chap:discussion} 



First the results will be discussed, then possible future research and open 

questions will be addressed. 



\section*{Discussion} 



The results clearly do not support the hypothesis: \textit{Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it}. With the exception of open set errors, there 

is no area where dropout sampling performs better than \gls{vanilla} SSD. In the 

remainder of the section the individual results will be interpreted. 



\subsection*{Impact of Averaging} 



Micro and macro averaging create largely similar results. Notably, micro 

averaging has a significant performance increase towards the end 

of the list of predictions. This is signaled by the near horizontal movement 

of the plot in both the \(F_1\) versus absolute open set error graph (see figure \ref{fig:osef1micro}) and 

the precisionrecall curve (see figure \ref{fig:precisionrecallmicro}). 



This behaviour is caused by a large imbalance of detections between 

the classes. For \gls{vanilla} \gls{SSD} with 0.2 confidence threshold there are 

a total of 36,863 detections after \gls{NMS} and top \(k\). 

The persons class contributes 14,640 detections or around 40\% to that number. Another strong class is cars with 2,252 detections or around 

6\%. In third place come chairs with 1352 detections or around 4\%. This means that three classes have together roughly as many detections 

as the remaining 57 classes combined. 



In macro averaging, the cumulative precision and recall values are 

calculated per class and then averaged across all classes. Smaller 

classes quickly reach high recall values as the total number of 

ground truth is small as well. The last recall and precision value 

of the smaller classes is repeated to achieve homogenity with the largest 

class. As a consequence, early on the average recall is quite high. Later on, only the values of the largest class still change which has only 

a small impact on the overall result. 



Conversely, in micro averaging the cumulative true positives 

are added up across classes and then divided by the total number of 

ground truth. Here, the effect is the opposite: the total number of 

ground truth is very large which means the combined true positives 

of 58 classes have only a smaller impact on the average recall. 

As a result, the open set error rises quicker than the \(F_1\) score 

in micro averaging, creating the sharp rise of open set error at a lower 

\(F_1\) score than in macro averaging. The open set error 

reaches a high value early on and changes little afterwards. This allows 

the \(F_1\) score to catch up and produces the almost horizontal line 

in the graph. Eventually, the \(F_1\) score decreases again while the 

open set error further rises a bit. 



Furthermore, the plotted behaviour implies that Miller et al.~\cite{Miller2018} 

use macro averaging in their paper as the unique behaviour of micro 

averaging was not reported in their paper. 



\subsection*{Impact of Entropy} 



There is no visible impact of entropy thresholding on the object detection 

performance for \gls{vanilla} SSD. This indicates that the network has almost no 

uniform or close to uniform predictions, the vast majority of predictions 

has a high confidence in one classincluding the background. 

However, the entropy plays a larger role for the Bayesian variantsas 

expected: the best performing thresholds are 1.0, 1.3, and 1.4 for micro averaging, 

and 1.5, 1.7, and 2.0 for macro averaging. In all of these cases the best 

threshold is not the largest threshold tested. 



This is caused by a simple phenomenon: at some point most or all true 

positives are in and a higher entropy threshold only adds more false 

positives. Such a behaviour is indicated by a stagnating recall for the 

higher entropy levels. For the low entropy thresholds, the low recall 

is dominating the \(F_1\) score, the sweet spot is somewhere in the 

middle. For macro averaging, it holds that a higher optimal entropy 

threshold indicates a worse performance. 



\subsection*{NonMaximum Suppression and Top \(k\)} 



\begin{table}[htbp] 

\centering 

\begin{tabular}{rccc} 

\hline 

variant & before & after & after \\ 

& entropy/NMS & entropy/NMS & top \(k\) \\ 

\hline 

Bay. SSD, no dropout, no \gls{NMS} & 155,251 & 122,868 & 72,207 \\ 

no dropout, \gls{NMS} & 155,250 & 36,061 & 33,827 \\ 

\hline 

\end{tabular} 



\caption{Comparison of Bayesian \gls{SSD} variants without dropout with 

respect to the number of detections before the entropy threshold, 

after it and/or \gls{NMS}, and after top \(k\). The 

entropy threshold 1.5 was used for both.} 

\label{tab:effectnms} 

\end{table} 



Miller et al.~\cite{Miller2018} supposedly did not use \gls{NMS} 

in their implementation of dropout sampling. Therefore, a variant with disabled \glslocalreset{NMS} 

\gls{NMS} was tested. The results are somewhat expected: 

\gls{NMS} removes all nonmaximum detections that overlap 

with a maximum one. This reduces the number of multiple detections per 

ground truth bounding box and therefore the false positives. Without it, 

a lot more false positives remain and have a negative impact on precision. 

In combination with top \(k\) selection, recall can be affected: 

duplicate detections could stay and maxima boxes could be removed. 



The number of observations was measured before and after the combination of entropy threshold and \gls{NMS} filter: both Bayesian \gls{SSD} without 

NMS and dropout, and Bayesian \gls{SSD} with \gls{NMS} and disabled dropout 

have the same number of observations everywhere before the entropy threshold. After the entropy threshold (the value 1.5 was used for both) and NMS, the variant with \gls{NMS} has roughly 23\% of its observations left 

(see table \ref{tab:effectnms} for absolute numbers). 

Without \gls{NMS} 79\% of observations are left. Irrespective of the absolute 

number, this discrepancy clearly shows the impact of \gls{NMS} and also explains a higher count of false positives: 

more than 50\% of the original observations were removed with \gls{NMS} and 

stayed withoutall of these are very likely to be false positives. 



A clear distinction between micro and macro averaging can be observed: 

recall is hardly effected with micro averaging (0.300) but goes down equally with macro averaging (0.229). For micro averaging, it does 

not matter which class the true positives belong to: every detection 

counts the same way. This also means that top \(k\) will have only 

a marginal effect: some true positives might be removed without \gls{NMS} but overall that does not have a big impact. With macro averaging, however, 

the class of the true positives matters a lot: for example, if two 

true positives are removed from a class with only few true positives 

to begin with than their removal will have a drastic influence on 

the class recall value and hence the overall result. 



The impact of top \(k\) was measured by counting the number of observations 

after top \(k\) has been applied: the variant with \gls{NMS} keeps about 94\% 

of the observations left after NMS, without \gls{NMS} only about 59\% of observations 

are kept. This shows a significant impact on the result by top \(k\) 

in the case of disabled \gls{NMS}. Furthermore, some 

classes are hit harder by top \(k\) then others: for example, 

dogs keep around 82\% of the observations but persons only 57\%. 

This indicates that detected dogs are mostly on images with few detections 

overall and/or have a high enough prediction confidence to be 

kept by top \(k\). However, persons are likely often on images 

with many detections and/or have too low confidences. 

In this example, the likelihood for true positives to be removed in 

the person category is quite high. For dogs, the probability is far lower. 

This is a good example for micro and macro averaging, and their impact on 

recall. 





\subsection*{Dropout Sampling and Observations} 



\begin{table}[htbp] 

\centering 

\begin{tabular}{rccc} 

\hline 

variant & after & after \\ 

& prediction & observation grouping \\ 

\hline 

Bay. SSD, no dropout, \gls{NMS} & 1,677,050 & 155,250 \\ 

keep rate 0.9, \gls{NMS} & 1,617,675 & 549,166 \\ 

\hline 

\end{tabular} 



\caption{Comparison of Bayesian \gls{SSD} variants without dropout and with 

0.9 keep ratio of dropout with 

respect to the number of detections directly after the network 

predictions and after the observation grouping.} 

\label{tab:effectdropout} 

\end{table} 



The dropout variants have largely worse performance than the Bayesian variants 

without dropout. This is expected as the network was not trained with 

dropout and the weights are not prepared for it. 



Gal~\cite{Gal2017} 

showed that networks \textbf{trained} with dropout are approximate Bayesian 

models. The Bayesian variants of \gls{SSD} implemented in this thesis are not finetuned or trained with dropout, therefore, they are not guaranteed to be such approximate models. 



But dropout alone does not explain the difference in results. Both variants 

with and without dropout have the exact same number of detections coming 

out of the network (8732 per image per forward pass). With 16 images in a batch, 

308 batches, and 10 forward passes, the total number of detections is 

an astounding 430,312,960 detections. As such a large number could not be 

handled in memory, only one batch is calculated at a time. That 

still leaves 1,397,120 detections per batch. These have to be grouped into 

observations, including a quadratic calculation of mutual IOU scores. 

Therefore, these detections are filtered by removing all those with background 

confidence levels of 0.8 or higher. 



The number of detections per class was measured before and after the 

detections were grouped into observations. To this end, the stored predictions 

were unbatched and summed together. After the aforementioned filter 

and before the grouping, roughly 0.4\% (in fact less than that) of the 

more than 430 million detections are remaining (see table \ref{tab:effectdropout} for absolute numbers). The variant with dropout 

has slightly fewer predictions left compared to the one without dropout. 



After the grouping, the variant without dropout has on average between 

10 and 11 detections grouped into an observation. This is expected as every 

forward pass creates the exact same result and these 10 identical detections 

per \gls{vanilla} \gls{SSD} detection perfectly overlap. The fact that slightly more than 

10 detections are grouped together could explain the marginally better precision 

of the Bayesian variant without dropout compared to \gls{vanilla} SSD. 

However, on average only three detections are grouped together into an 

observation if dropout with 0.9 keep ratio is enabled. This does not 

negatively impact recall as true positives do not disappear but offers 

a higher chance of false positives. It can be observed in the results which 

clearly show no negative impact for recall between the variants without 

dropout and dropout with 0.9 keep ratio. 



This behaviour implies that even a slight usage of dropout creates such 

diverging anchor box offsets that the resulting detections from multiple 

forward passes no longer have a mutual IOU score of 0.95 or higher. 



\section*{Outlook} 



The attempted replication of the work of Miller et al. raises a series of 

questions that cannot be answered in this thesis. This thesis offers 

one possible implementation of dropout sampling that technically works. 

However, this thesis cannot answer why this implementation differs significantly 

from Miller et al. The complete source code or otherwise exhaustive 

implementation details of Miller et al. would be required to attempt an answer. 



Future work could explore the performance of this implementation when used 

on an \gls{SSD} variant that was finetuned or trained with dropout. In this case, it 

should also look into the impact of training with both dropout and batch 

normalisation. 

Other avenues include the application to other data sets or object detection 

networks. 



To facilitate future work based on this thesis, the source code will be 

made available and an installable Python package will be uploaded to the 

PyPi package index. In the appendices can be found more details about the 

source code implementation.


