% body thesis file that contains the actual content \chapter{Introduction} The introduction first explains the wider context, before providing technical details. \subsection*{Motivation} Famous examples like the automatic soap dispenser, which does not recognise the hand of a black person but dispenses soap when presented with a paper towel, raise the question of bias in computer systems~\cite{Friedman1996}. Related to this ethical question regarding the design of so called algorithms is the question of algorithmic accountability~\cite{Diakopoulos2014}. Supervised neural networks learn from input-output relations and figure out by themselves what connections are necessary for that. This feature is also their Achilles heel: it makes them effectively black boxes and prevents any answers to questions of causality. However, these questions of causality are of enormous consequence when results of neural networks are used to make life changing decisions: Is a correlation enough to bring forth negative consequences for a particular person? And if so, what is the possible defence against math? Similar questions can be raised when looking at computer vision networks that might be used together with so called smart \gls{CCTV} cameras to discover suspicious activity. This leads to the need for neural networks to explain their results. Such an explanation must come from the network or an attached piece of technology to allow adoption in mass. Obviously, this setting poses the question, how such an endeavour can be achieved. For neural networks there are fundamentally two types of tasks: regression and classification. Regression deals with any case where the goal for the network is to come close to an ideal function that connects all data points. Classification, however, describes tasks where the network is supposed to identify the class of any given input. In this thesis, I will work with both. \subsection*{Object Detection in Open Set Conditions} \begin{figure} \centering \includegraphics[scale=1.0]{open-set} \caption{Open set problem: the test set contains classes that were not present during training time. Icons in this image have been taken from the COCO data set website (\url{https://cocodataset.org/\#explore}) and were vectorised afterwards. Resembles figure 1 of Miller et al.~\cite{Miller2018}.} \label{fig:open-set} \end{figure} More specifically, I will look at object detection in the open set conditions (see figure \ref{fig:open-set}). In non-technical words this effectively describes the kind of situation you encounter with \gls{CCTV} or robots outside of a laboratory. Both use cameras that record images. Subsequently, a neural network analyses the image and returns a list of detected and classified objects that it found in the image. The problem here is that networks can only classify what they know. If presented with an object type that the network was not trained with, as happens frequently in real environments, it will still classify the object and might even have a high confidence in doing so. This is an example for a false positive. Anyone who uses the results of such a network could falsely assume that a high confidence always means the classification is very likely correct. If one uses a proprietary system one might not even be able to find out that the network was never trained on a particular type of object. Therefore, it would be impossible for one to identify the output of the network as false positive. This reaffirms the need for automatic explanation. Such a system should recognise by itself that the given object is unknown and hence mark any classification result of the network as meaningless. Technically there are two slightly different approaches that deal with this type of task: model uncertainty and novelty detection. Model uncertainty can be measured, for example, with dropout sampling. Dropout layers are usually used only during training but Miller et al.~\cite{Miller2018} use them also during testing to achieve different results for the same image making use of multiple forward passes. The output scores for the forward passes of the same image are then averaged. If the averaged class probabilities resemble a uniform distribution (every class has the same probability) this symbolises maximum uncertainty. Conversely, if there is one very high probability with every other being very low this signifies a low uncertainty. An unknown object is more likely to cause high uncertainty which allows for an identification of false positive cases. Novelty detection is another approach to solve the task. In the realm of neural networks it is usually done with the help of auto-encoders that solve a regression task of finding an identity function that reconstructs the given input~\cite{Pimentel2014}. Auto-encoders have internally at least two components: an encoder, and a decoder or generator. The job of the encoder is to find an encoding that compresses the input as good as possible while simultaneously being as loss-free as possible. The decoder takes this latent representation of the input and has to find a decompression that reconstructs the input as accurate as possible. During training these auto-encoders learn to reproduce a certain group of object classes. The actual novelty detection takes place during testing: given an image, and the output and loss of the auto-encoder, a novelty score is calculated. For some novelty detection approaches the reconstruction loss is the novelty score, others consider more factors. A low novelty score signals a known object. The opposite is true for a high novelty score. \subsection*{Research Question} Auto-encoders work well for data sets like MNIST~\cite{Deng2012} but perform poorly on challenging real world data sets like MS COCO~\cite{Lin2014}, complicating any potential comparison between them and object detection networks like \gls{SSD}. Therefore, a comparison between model uncertainty with a network like \gls{SSD} and novelty detection with auto-encoders is considered out of scope for this thesis. Miller et al.~\cite{Miller2018} use an \gls{SSD} pre-trained on COCO without further fine-tuning on the SceneNet RGB-D data set~\cite{McCormac2017} and report good results regarding \gls{OSE} for an \gls{SSD} variant with dropout sampling and \gls{entropy} thresholding. If their results are generalisable it should be possible to replicate the relative difference between the variants on the COCO data set. This leads to the following hypothesis: \emph{Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it.} For the purpose of this thesis, I use the \gls{vanilla} \gls{SSD} (as in: the original \gls{SSD}) as baseline to compare against. In particular, \gls{vanilla} \gls{SSD} uses a per class confidence threshold of 0.01, an IOU threshold of 0.45 for the \gls{NMS}, and a top \(k\) value of 200. For this thesis, the top \(k\) value has been changed to 20 and the confidence threshold of 0.2 has been tried as well. The effect of an \gls{entropy} threshold is measured against this \gls{vanilla} SSD by applying \gls{entropy} thresholds from 0.1 to 2.4 inclusive (limits taken from Miller et al.). Dropout sampling is compared to \gls{vanilla} \gls{SSD} with and without \gls{entropy} thresholding. \paragraph{Hypothesis} Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it. \subsection*{Reader's Guide} First, chapter \ref{chap:background} presents related works and provides the background for dropout sampling. Afterwards, chapter \ref{chap:methods} explains how \gls{vanilla} \gls{SSD} works, how Bayesian \gls{SSD} extends \gls{vanilla} \gls{SSD}, and how the decoding pipelines are structured. Chapter \ref{chap:experiments-results} presents the data sets, the experimental setup, and the results. This is followed by chapter \ref{chap:discussion}, focusing on the discussion and closing. Therefore, the contribution is found in chapters \ref{chap:methods}, \ref{chap:experiments-results}, and \ref{chap:discussion}. \chapter{Background} \label{chap:background} This chapter begins with an overview over previous works in the field of this thesis. Afterwards the theoretical foundations of dropout sampling are explained. \section{Related Works} The task of novelty detection can be accomplished in a variety of ways. Pimentel et al.~\cite{Pimentel2014} provide a review of novelty detection methods published over the previous decade. They showcase probabilistic, distance-based, reconstruction-based, domain-based, and information-theoretic novelty detection. Based on their categorisation, this thesis falls under reconstruction-based novelty detection as it deals only with neural network approaches. Therefore, the other types of novelty detection will only be introduced briefly. \subsection{Overview over types of Novelty Detection} Probabilistic approaches estimate the generative \gls{pdf} of the data. It is assumed that the training data is generated from an underlying probability distribution \(D\). This distribution can be estimated with the training data, the estimate is defined as \(\hat D\) and represents a model of normality. A novelty threshold is applied to \(\hat D\) in a way that allows a probabilistic interpretation. Pidhorskyi et al.~\cite{Pidhorskyi2018} combine a probabilistic approach to novelty detection with auto-encoders. Distance-based novelty detection uses either nearest neighbour-based approaches (e.g. \(k\)-nearest neighbour \cite{Hautamaki2004}) or clustering-based approaches (e.g. \(k\)-means clustering algorithm \cite{Jordan1994}). Both methods are similar to estimating the \gls{pdf} of data, they use well-defined distance metrics to compute the distance between two data points. Domain-based novelty detection describes the boundary of the known data, rather than the data itself. Unknown data is identified by its position relative to the boundary. A common implementation for this are support vector machines (e.g. implemented by Song et al. \cite{Song2002}). Information-theoretic novelty detection computes the information content of a data set, for example, with metrics like \gls{entropy}. Such metrics assume that novel data inside the data set significantly alters the information content of an otherwise normal data set. First, the metrics are calculated over the whole data set. Afterwards, a subset is identified that causes the biggest difference in the metric when removed from the data set. This subset is considered to consist of novel data. For example, Filippone and Sanguinetti \cite{Filippone2011} provide a recent approach. \subsection{Reconstruction-based Novelty Detection} Reconstruction-based approaches use the reconstruction error in one form or another to calculate the novelty score. This can be auto-encoders that literally reconstruct the input but it also includes \gls{MLP} networks which try to reconstruct the ground truth. Pimentel et al.~\cite{Pimentel2014} differentiate between neural network-based approaches and subspace methods. The first are further differentiated between MLPs, \glspl{Hopfield network}, autoassociative networks, radial basis function, and self-organising networks. The remainder of this section focuses on \gls{MLP}-based works, a particular focus will be on the task of object detection and Bayesian networks. Novelty detection for object detection is intricately linked with open set conditions: the test data can contain unknown classes. Bishop~\cite{Bishop1994} investigates the correlation between the degree of novel input data and the reliability of network outputs, and introduces a quantitative way to measure novelty. The Bayesian approach provides a theoretical foundation for modelling uncertainty \cite{Ghahramani2015}. MacKay~\cite{MacKay1992} provides a practical Bayesian framework for backpropagation networks. Neal~\cite{Neal1996} builds upon the work of MacKay and explores Bayesian learning for neural networks. However, these Bayesian neural networks do not scale well. Over the course of time, two major Bayesian approximations have been introduced: one based on dropout and one based on batch normalisation. Gal and Ghahramani~\cite{Gal2016} show that dropout training is a Bayesian approximation of a Gaussian process. Subsequently, Gal~\cite{Gal2017} shows that dropout training actually corresponds to a general approximate Bayesian model. This means every network trained with dropout is an approximate Bayesian model. During inference the dropout remains active, this form of inference is called \gls{MCDO}. Miller et al.~\cite{Miller2018} build upon the work of Gal and Ghahramani: they use \gls{MCDO} under open-set conditions for object detection. In a second paper \cite{Miller2018a}, Miller et al. continue their work and compare merging strategies for sampling-based uncertainty techniques in object detection. Teye et al.~\cite{Teye2018} make the point that most modern networks have adopted other regularisation techniques. Ioffe and Szeged~\cite{Ioffe2015} introduce batch normalisation which has been adapted widely in the meantime. Teye et al. show how batch normalisation training is similar to dropout and can be viewed as an approximate Bayesian inference. Estimates of the model uncertainty can be gained with a technique named \gls{MCBN}. Consequently, this technique can be applied to any network that utilises standard batch normalisation. Li et al.~\cite{Li2019} investigate the problem of poor performance when combining dropout and batch normalisation: dropout shifts the variance of a neural unit when switching from train to test, batch normalisation does not change the variance. This inconsistency leads to a variance shift which can have a larger or smaller impact based on the network used. Non-Bayesian approaches have been developed as well. Usually, they compare with \gls{MCDO} and show better performance. Postels et al.~\cite{Postels2019} provide a sampling-free approach for uncertainty estimation that does not affect training and approximates the sampling at test time. They compare it to \gls{MCDO} and find less computational overhead with better results. Lakshminarayanan et al.~\cite{Lakshminarayanan2017} implement a predictive uncertainty estimation using deep ensembles. Compared to \gls{MCDO}, it shows better results. Geifman et al.~\cite{Geifman2018} introduce an uncertainty estimation algorithm for non-Bayesian deep neural classification that estimates the uncertainty of highly confident points using earlier snapshots of the trained model and improves, among others, the approach introduced by Lakshminarayanan et al. Sensoy et al.~\cite{Sensoy2018} explicitely model prediction uncertainty: a \gls{Dirichlet distribution} is placed over the class probabilities. Consequently, the predictions of a neural network are treated as subjective opinions. In addition to the aforementioned Bayesian and non-Bayesian works, there are some Bayesian works that do not quite fit with the rest but are important as well. Mukhoti and Gal~\cite{Mukhoti2018} contribute metrics to measure uncertainty for semantic segmentation. Wu et al.~\cite{Wu2019} introduce two innovations that turn variational Bayes into a robust tool for Bayesian networks: first, a novel deterministic method to approximate moments in neural networks which eliminates gradient variance, and second, a hierarchical prior for parameters and an empirical Bayes procedure to select prior variances. \section{Background for Dropout Sampling} \begin{table} \centering \caption{Notation for background} \label{tab:notation} \begin{tabular}{l|l} symbol & meaning \\ \hline \(\mathbf{W}\) & weights \\ \(\mathbf{T}\) & training data \\ \(\mathcal{N}(0, I)\) & Gaussian distribution \\ \(I\) & independent and identical distribution \\ \(p(\mathbf{W}|\mathbf{T})\) & probability of weights given training data \\ \(\mathcal{I}\) & an image \\ \(\mathbf{q} = p(y|\mathcal{I}, \mathbf{T})\) & probability of all classes given image and training data \\ \(H(\mathbf{q})\) & \gls{entropy} over probability vector \\ \(\widetilde{\mathbf{W}}\) & weights sampled from \(p(\mathbf{W}|\mathbf{T})\) \\ \(\mathbf{b}\) & bounding box coordinates \\ \(\mathbf{s}\) & softmax scores \\ \(\overline{\mathbf{s}}\) & averaged softmax scores \\ \(D\) & detections of one forward pass \\ \(\mathfrak{D}\) & set of all detections over multiple forward passes \\ \(\mathcal{O}\) & observation \\ \(\overline{\mathbf{q}}\) & probability vector for observation \\ %\(E[something]\) & expected value of something %\(\overline{\mathbf{z}}, \mathbf{z}\) & latent space representation \\ %\(d_T, d_z\) & discriminators \\ %\(e, g\) & encoding and decoding/generating function \\ %\(J_g\) & Jacobi matrix for generating function \\ %\(\mathcal{T}\) & tangent space \\ %\(\mathbf{R}\) & training/test data changed to be on tangent space \end{tabular} \end{table} This section will use the \textbf{notation} defined in table \ref{tab:notation} on page \pageref{tab:notation}. To understand dropout sampling, it is necessary to explain the idea of Bayesian neural networks. They place a prior distribution over the network weights, for example a Gaussian prior distribution: \(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example \(\mathbf{W}\) are the weights and \(I\) symbolises that every weight is drawn from an independent and identical distribution. The training of the network determines a plausible set of weights by evaluating the probability output (\gls{posterior}) over the weights given the training data \(\mathbf{T}\): \(p(\mathbf{W}|\mathbf{T})\). However, this evaluation cannot be performed in any reasonable time. Therefore approximation techniques are required. In those techniques the \gls{posterior} is fitted with a simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original and intractable problem of averaging over all weights in the network is replaced with an optimisation task, where the parameters of the simple distribution are optimised over~\cite{Kendall2017}. \subsubsection*{Dropout Variational Inference} Kendall and Gal~\cite{Kendall2017} show an approximation for classfication and recognition tasks. Dropout variational inference is a practical approximation technique by adding dropout layers in front of every weight layer and using them also during test time to sample from the approximate \gls{posterior}. Effectively, this results in the approximation of the class probability \(p(y|\mathcal{I}, \mathbf{T})\) by performing \(n\) forward passes through the network and averaging the so obtained softmax scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the training data \(\mathbf{T}\): \begin{equation} \label{eq:drop-sampling} p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i \end{equation} With this dropout sampling technique, \(n\) model weights \(\widetilde{\mathbf{W}}_i\) are sampled from the \gls{posterior} \(p(\mathbf{W}|\mathbf{T})\). The class probability \(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector \(\mathbf{q}\) over all class labels. Finally, the uncertainty of the network with respect to the classification is given by the \gls{entropy} \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\). \subsubsection*{Dropout Sampling for Object Detection} Miller et al.~\cite{Miller2018} apply the dropout sampling to object detection. In that case \(\mathbf{W}\) represents the learned weights of a detection network like \gls{SSD}~\cite{Liu2016}. Every forward pass uses a different network \(\widetilde{\mathbf{W}}\) which is approximately sampled from \(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object detection results in a set of detections, each consisting of bounding box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\). The detections are denoted by Miller et al. as \(D_i = \{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\). All detections with mutual intersection-over-union scores (IoU) of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\). Subsequently, the corresponding vector of class probabilities \(\overline{\mathbf{q}}_i\) for the observation is calculated by averaging all score vectors \(\mathbf{s}_j\) in a particular observation \(\mathcal{O}_i\): \(\overline{\mathbf{q}}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty of the detector for a particular observation is measured by the \gls{entropy} \(H(\overline{\mathbf{q}}_i)\). If \(\overline{\mathbf{q}}_i\) resembles a uniform distribution the \gls{entropy} will be high. A uniform distribution means that no class is more likely than another, which is a perfect example of maximum uncertainty. Conversely, if one class has a very high probability the \gls{entropy} will be low. In open set conditions it can be expected that falsely generated detections for unknown object classes have a higher label uncertainty. A threshold on the \gls{entropy} \(H(\overline{\mathbf{q}}_i)\) can then be used to identify and reject these false positive cases. % \gls{SSD}: \cite{Liu2016} % ImageNet: \cite{Deng2009} % COCO: \cite{Lin2014} % YCB: \cite{Xiang2017} % SceneNet: \cite{McCormac2017} \chapter{Methods} \label{chap:methods} This chapter explains the functionality of \gls{vanilla} \gls{SSD}, Bayesian \gls{SSD}, and the decoding pipelines. \section{Vanilla SSD} \begin{figure} \centering \includegraphics[scale=1.2]{vanilla-ssd} \caption{The \gls{vanilla} \gls{SSD} network as defined by Liu et al.~\cite{Liu2016}. VGG-16 is the base network, extended with extra feature layers. These predict offsets to anchor boxes with different sizes and aspect ratios. Furthermore, they predict the corresponding confidences.} \label{fig:vanilla-ssd} \end{figure} Vanilla \gls{SSD} is based upon the VGG-16 network (see figure \ref{fig:vanilla-ssd}) and adds extra feature layers. The entire image (always size 300x300) is divided up into anchor boxes. During training, each of these boxes is mapped to a ground truth box or background. For every anchor box both the offset to the object and the class confidences are calculated. The output of the \gls{SSD} network are the predictions with class confidences, offsets to the anchor box, anchor box coordinates, and variance. The model loss is a weighted sum of localisation and confidence loss. As the network has a fixed number of anchor boxes, every forward pass creates the same number of detections---8732 in the case of \gls{SSD} 300x300. Notably, the object proposals are made in a single run for an image---single shot. Other techniques like Faster R-CNN employ region proposals and pooling. For more detailed information on \gls{SSD}, please refer to Liu et al.~\cite{Liu2016}. \section{Bayesian SSD for Model Uncertainty} Networks trained with dropout are a general approximate Bayesian model~\cite{Gal2017}. As such, they can be used for everything a true Bayesian model could be used for. This idea is applied to \gls{SSD} by Miller et al.: two dropout layers are added to \gls{vanilla} \gls{SSD}, after the layers fc6 and fc7 respectively (see figure \ref{fig:bayesian-ssd}). \begin{figure} \centering \includegraphics[scale=1.2]{bayesian-ssd} \caption{The Bayesian \gls{SSD} network as defined by Miller et al.~\cite{Miller2018}. It adds dropout layers after the fc6 and fc7 layers.} \label{fig:bayesian-ssd} \end{figure} Motivation for this is model uncertainty: for the same object on the same image, an uncertain model will predict different classes across multiple forward passes. This uncertainty is measured with \gls{entropy}: every forward pass results in predictions, these are partitioned into observations, and subsequently their \gls{entropy} is calculated. A higher \gls{entropy} indicates a more uniform distribution of confidences whereas a lower \gls{entropy} indicates a larger confidence in one class and very low confidences in other classes. \subsection{Implementation Details} For this thesis, an \gls{SSD} implementation based on Tensorflow~\cite{Abadi2015} and Keras\footnote{\url{https://github.com/pierluigiferrari/ssd\_keras}} is used. It has been modified to support \gls{entropy} thresholding, partitioning of observations, and dropout layers in the \gls{SSD} model. Entropy thresholding takes place before the per class confidence threshold is applied. The Bayesian variant was not fine-tuned and operates with the same weights that \gls{vanilla} \gls{SSD} uses as well. \section{Decoding Pipelines} The raw output of \gls{SSD} is not very useful: it contains thousands of boxes per image. Among them are many boxes with very low confidences or background classifications, those need to be filtered out to get any meaningful output of the network. The process of filtering is called decoding and presented for the three structural variants of \gls{SSD} used in the thesis. \subsection{Vanilla SSD} Liu et al.~\cite{Liu2016} use \gls{Caffe} for their original \gls{SSD} implementation. The decoding process contains largely two phases: decoding and filtering. Decoding transforms the relative coordinates predicted by \gls{SSD} into absolute coordinates. Before decoding, the shape of the output per batch is \((batch\_size, \#nr\_boxes, \#nr\_classes + 12)\). The last twelve elements are split into the four bounding box offsets, the four anchor box coordinates, and the four variances; there are 8732 boxes. After decoding, of the twelve elements only four remain: the absolute coordinates of the bounding box. \glslocalreset{NMS} Filtering of these boxes is first done per class: only the class id, confidence of that class, and the bounding box coordinates are kept per box. The filtering consists of confidence thresholding and a subsequent \gls{NMS}. All boxes that pass \gls{NMS} are added to a per image maxima list. One box could make the confidence threshold for multiple classes and, hence, be present multiple times in the maxima list for the image. Lastly, a total of \(k\) boxes with the highest confidences is kept per image across all classes. The original implementation uses a confidence threshold of \(0.01\), an IOU threshold for \gls{NMS} of \(0.45\) and a top \(k\) value of 200. The \gls{vanilla} \gls{SSD} per class confidence threshold and \gls{NMS} has one weakness: even if \gls{SSD} correctly predicts all objects as the background class with high confidence, the per class confidence threshold of 0.01 will consider predictions with very low confidences; as background boxes are not present in the maxima collection, many low confidence boxes can be. Furthermore, the same detection can be present in the maxima collection for multiple classes. In this case, the \gls{entropy} threshold would let the detection pass because the background class has high confidence. Subsequently, a low per class confidence threshold does not restrict the boxes either. Therefore, the decoding output is worse than the actual predictions of the network. Bayesian \gls{SSD} cannot help in this situation because the network is not actually uncertain. SSD was developed with closed set conditions in mind. A well trained network in such a situation does not have many high confidence background detections. In an open set environment, however, background detections are the correct behaviour for unknown classes. In order to get useful detections out of the decoding, a higher confidence threshold is required. \subsection{Vanilla SSD with Entropy Thresholding} Vanilla \gls{SSD} with \gls{entropy} tresholding adds an additional component to the filtering already done for \gls{vanilla} \gls{SSD}. The \gls{entropy} is calculated from all \(\#nr\_classes\) softmax scores in a prediction. Only predictions with a low enough \gls{entropy} pass the \gls{entropy} threshold and move on to the aforementioned per class filtering. This excludes very uniform predictions but cannot identify false positive or false negative cases with high confidence values. \subsection{Bayesian SSD with Entropy Thresholding} Bayesian \gls{SSD} has the speciality of multiple forward passes. Based on the information in the paper, the detections of all forward passes are grouped per image but not by forward pass. This leads to the following shape of the network output after all forward passes: \((batch\_size, \#nr\_boxes \, \cdot \, \#nr\_forward\_passes, \#nr\_classes + 12)\). The size of the output increases linearly with more forward passes. These detections have to be decoded first. Afterwards, all detections are thrown away which do not pass a confidence threshold for the class with the highest prediction probability. Additionally, all detections with a background prediction of 0.8 or higher are discarded. The remaining detections are partitioned into observations to further reduce the size of the output, and to identify uncertainty. This is accomplished by calculating the mutual IOU score of every detection with all other detections. Detections with a mutual IOU score of 0.95 or higher are partitioned into an observation. Next, the softmax scores and bounding box coordinates of all detections in an observation are averaged. There can be a different number of observations for every image which destroys homogenity and prevents batch-wise calculation of the results. The shape of the results is per image: \((\#nr\_observations,\#nr\_classes + 4)\). Entropy is measured in the next step. All observations with too high entropy are discarded. Entropy thresholding in combination with dropout sampling should improve identification of false positives of unknown classes. This is due to multiple forward passes and the assumption that uncertainty in some objects will result in different classifications in multiple forward passes. These varying classifications are averaged into multiple lower confidence values which should increase the \gls{entropy} and, hence, flag an observation for removal. The remainder of the filtering follows the \gls{vanilla} \gls{SSD} procedure: per class confidence threshold, \gls{NMS}, and a top \(k\) selection at the end. \chapter{Experimental Setup and Results} \label{chap:experiments-results} This chapter explains the used data sets, how the experiments have been set up, and what the results are. \section{Data Sets} This thesis uses the MS COCO~\cite{Lin2014} data set. It contains 80 classes, their range is illustrated by two classes: airplanes and toothbrushes. The images are taken by camera from the real world, ground truth is provided for all images. The data set supports object detection, keypoint detection, and panoptic segmentation (scene segmentation). The data of any data set has to be prepared for use in a neural network. Typical problems of data sets include, for example, outliers and invalid bounding boxes. Before a data set can be used, these problems need to be resolved. For the MS COCO data set, all annotations are checked for impossible values: bounding box height or width lower than zero, \(x_{min}\) and \(y_{min}\) bounding box coordinates lower than zero, \(x_{max}\) and \(y_{max}\) coordinates lower than or equal to zero, \(x_{min}\) greater than \(x_{max}\), \(y_{min}\) greater than \(y_{max}\), image width lower than \(x_{max}\), and image height lower than \(y_{max}\). In the last two cases the bounding box width and height are set to (image width - \(x_{min}\)) and (image height - \(y_{min}\)) respectively; in the other cases the annotation is skipped. If the bounding box width or height afterwards is lower than or equal to zero the annotation is skipped. SSD accepts 300x300 input images, the MS COCO data set images are resized to this resolution; the aspect ratio is not kept in the process. MS COCO contains landscape and portrait images with (640x480) and (480x640) as the resolution. This leads to a uniform distortion of the portrait and landscape images respectively. Furthermore, the colour channels are swapped from \gls{RGB} to \gls{BGR} in order to comply with the \gls{SSD} implementation. The \gls{BGR} requirement stems from the usage of Open CV in \gls{SSD}: the internal channel order for Open CV is \gls{BGR}. For this thesis, weights pre-trained on the sub data set trainval35k of the COCO data set are used. These weights have been created with closed set conditions in mind, therefore, they have been sub-sampled to create an open set condition. To this end, the weights for the last 20 classes have been thrown away, making these classes effectively unknown. All images of the minival2014 data set are used but only ground truth belonging to the first 60 classes is loaded. The remaining 20 classes are considered "unknown" and no ground truth bounding boxes for them are provided during the inference phase. A total of 31,991 detections remain after this exclusion. Of these detections, a staggering 10,988 or 34,3\% belong to the persons class, followed by cars with 1,932 or 6\%, chairs with 1,791 or 5,6\%, and bottles with 1,021 or 3,2\%. Together, these four classes make up around 49,1\% of the ground truth detections. This shows a huge imbalance between the classes in the data set. \section{Experimental Setup} This section explains the setup for the different conducted experiments. Each comparison investigates one particular question. As a baseline, \gls{vanilla} \gls{SSD} with the confidence threshold of 0.01 and a \gls{NMS} IOU threshold of 0.45 is used. Due to the low number of objects per image in the COCO data set, the top \(k\) value has been set to 20. Vanilla \gls{SSD} with \gls{entropy} thresholding uses the same parameters; compared to \gls{vanilla} \gls{SSD} without \gls{entropy} thresholding, it showcases the relevance of entropy thresholding for \gls{vanilla} \gls{SSD}. Vanilla \gls{SSD} with 0.2 confidence threshold is compared to \gls{vanilla} \gls{SSD} with 0.01 confidence threshold; this comparison investigates the effect of the per class confidence threshold on the object detection performance. Bayesian \gls{SSD} with 0.2 confidence threshold is compared to \gls{vanilla} \gls{SSD} with 0.2 confidence threshold. Coupled with the entropy threshold, this comparison reveals how uncertain the network is. If it is very certain the dropout sampling should have no significant impact on the result. Furthermore, in two cases the dropout has been turned off to isolate the impact of \gls{NMS} on the result. Both \gls{vanilla} \gls{SSD} with \gls{entropy} thresholding and Bayesian \gls{SSD} with entropy thresholding are tested for \gls{entropy} thresholds ranging from 0.1 to 2.4 inclusive as specified in Miller et al.~\cite{Miller2018}. \section{Results} Results in this section are presented both for micro and macro averaging. In macro averaging, for example, the precision values of each class are added up and then divided by the number of classes. Conversely, for micro averaging the precision is calculated across all classes directly. Both methods have a specific impact: macro averaging weighs every class the same while micro averaging weighs every detection the same. They will be largely identical when every class is balanced and has about the same number of detections. However, in case of a class imbalance the macro averaging favours classes with few detections whereas micro averaging benefits classes with many detections. This section only presents the results. Interpretation and discussion is found in the next chapter. \subsection{Micro Averaging} \begin{table}[ht] \begin{tabular}{rcccc} \hline Forward & max & abs OSE & Recall & Precision\\ Passes & \(F_1\) Score & \multicolumn{3}{c}{at max \(F_1\) point} \\ \hline \gls{vanilla} \gls{SSD} - 0.01 conf & 0.255 & 3176 & 0.214 & 0.318 \\ \gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.376} & 2939 & \textbf{0.382} & 0.372 \\ \gls{SSD} with entropy test - 0.01 conf & 0.255 & 3168 & 0.214 & 0.318 \\ % \gls{entropy} thresh: 2.4 for \gls{vanilla} \gls{SSD} is best \hline Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.209 & 2709 & 0.300 & 0.161 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.371 & \textbf{2335} & 0.365 & \textbf{0.378} \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.359 & 2584 & 0.363 & 0.357 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.325 & 2759 & 0.342 & 0.311 \\ % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % 0.5 for Bayesian - 6, 1.4 for 7, 1.4 for 8, 1.3 for 9 \hline \end{tabular} \caption{Rounded results for micro averaging. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with their best performing \gls{entropy} threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with entropy test performed best with an \gls{entropy} threshold of 2.4, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.0, and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.4 as \gls{entropy} threshold. Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed best for 1.4 as \gls{entropy} threshold, the variant with 0.5 keep ratio performed best for 1.3 as threshold.} \label{tab:results-micro} \end{table} \begin{figure}[ht] \begin{minipage}[t]{0.48\textwidth} \includegraphics[width=\textwidth]{ose-f1-all-micro} \caption{Micro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute \gls{OSE} of 0.} \label{fig:ose-f1-micro} \end{minipage}% \hfill \begin{minipage}[t]{0.48\textwidth} \includegraphics[width=\textwidth]{precision-recall-all-micro} \caption{Micro averaged precision-recall curves for each variant tested.} \label{fig:precision-recall-micro} \end{minipage} \end{figure} Vanilla \gls{SSD} with a per class confidence threshold of 0.2 performs best (see table \ref{tab:results-micro}) with respect to the maximum \(F_1\) score (0.376) and recall at the maximum \(F_1\) point (0.382). In comparison, neither the \gls{vanilla} \gls{SSD} variant with a confidence threshold of 0.01 nor the \gls{SSD} with an \gls{entropy} test can outperform the 0.2 variant. Among the \gls{vanilla} \gls{SSD} variants, the 0.2 variant also has the lowest open set error (2939) and the highest precision (0.372). The comparison of the \gls{vanilla} \gls{SSD} variants with a confidence threshold of 0.01 shows no significant impact of an \gls{entropy} test. Only the open set error is lower but in an insignificant way. The rest of the performance metrics are identical after rounding. Bayesian \gls{SSD} with disabled dropout and without \gls{NMS} has the worst performance of all tested variants (\gls{vanilla} and Bayesian) with respect to \(F_1\) score (0.209) and precision (0.161). The precision is not only the worst but also significantly lower compared to all other variants. In comparison to all variants with 0.2 confidence threshold, it has the worst recall (0.300) as well. With an open set error of 2335, the Bayesian \gls{SSD} variant with disabled dropout and enabled \gls{NMS} offers the best performance with respect to the open set error. It also has the best precision (0.378) of all tested variants. Furthermore, it provides the best performance among all variants with multiple forward passes. Dropout decreases the performance of the network, this can be seen in the lower \(F_1\) scores, a higher open set error, and lower precision values. Both dropout variants have worse recall (0.363 and 0.342) than the variant with disabled dropout. However, all variants with multiple forward passes have a lower open set error than all \gls{vanilla} \gls{SSD} variants. The relation of \(F_1\) score to absolute open set error can be observed in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants can be seen in figure \ref{fig:precision-recall-micro}. Both \gls{vanilla} \gls{SSD} variants with 0.01 confidence threshold reach a much higher open set error and a higher recall. This behaviour is expected as more and worse predictions are included. All plotted variants show a similar behaviour that is in line with previously reported figures, such as the ones in Miller et al.~\cite{Miller2018} \subsection{Macro Averaging} \begin{table}[t] \begin{tabular}{rcccc} \hline Forward & max & abs OSE & Recall & Precision\\ Passes & \(F_1\) Score & \multicolumn{3}{c}{at max \(F_1\) point} \\ \hline \gls{vanilla} \gls{SSD} - 0.01 conf & 0.370 & 1426 & 0.328 & 0.424 \\ \gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.375} & 1218 & \textbf{0.338} & 0.424 \\ \gls{SSD} with entropy test - 0.01 conf & 0.370 & 1373 & 0.329 & \textbf{0.425} \\ % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best \hline Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.226 & \textbf{809} & 0.229 & 0.224 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.363 & 1057 & 0.321 & 0.420 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.355 & 1137 & 0.320 & 0.399 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.322 & 1264 & 0.307 & 0.340 \\ % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % 1.7 for 8, 2.0 for 9 \hline \end{tabular} \caption{Rounded results for macro averaging. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with their best performing \gls{entropy} threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with entropy test performed best with an \gls{entropy} threshold of 1.7, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.5, and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.5 as \gls{entropy} threshold. Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed best for 1.7 as \gls{entropy} threshold, the variant with 0.5 keep ratio performed best for 2.0 as threshold.} \label{tab:results-macro} \end{table} \begin{figure}[ht] \begin{minipage}[t]{0.48\textwidth} \includegraphics[width=\textwidth]{ose-f1-all-macro} \caption{Macro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute \gls{OSE} of 0.} \label{fig:ose-f1-macro} \end{minipage}% \hfill \begin{minipage}[t]{0.48\textwidth} \includegraphics[width=\textwidth]{precision-recall-all-macro} \caption{Macro averaged precision-recall curves for each variant tested.} \label{fig:precision-recall-macro} \end{minipage} \end{figure} Vanilla \gls{SSD} with a per class confidence threshold of 0.2 performs best (see table \ref{tab:results-macro}) with respect to the maximum \(F_1\) score (0.375) and recall at the maximum \(F_1\) point (0.338). In comparison, the \gls{SSD} with an \gls{entropy} test slightly outperforms the 0.2 variant with respect to precision (0.425). Additionally, this is the best precision overall. Among the \gls{vanilla} \gls{SSD} variants, the 0.2 variant also has the lowest open set error (1218). The comparison of the \gls{vanilla} \gls{SSD} variants with a confidence threshold of 0.01 shows no significant impact of an \gls{entropy} test. Only the open set error is lower but in an insignificant way. The rest of the performance metrics are almost identical after rounding. The results for Bayesian \gls{SSD} show a significant impact of \gls{NMS} or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226 (without NMS). Dropout was disabled in both cases, making them effectively \gls{vanilla} \gls{SSD} with multiple forward passes. With an open set error of 809, the Bayesian \gls{SSD} variant with disabled dropout and without \gls{NMS} offers the best performance with respect to the open set error. The variant without dropout and enabled \gls{NMS} has the best \(F_1\) score (0.363), the best precision (0.420) and the best recall (0.321) of all Bayesian variants. Dropout decreases the performance of the network, this can be seen in the lower \(F_1\) scores, a higher open set error, and lower precision and recall values. However, all variants with multiple forward passes have a lower open set error than all \gls{vanilla} \gls{SSD} variants. The relation of \(F_1\) score to absolute open set error can be observed in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants can be seen in figure \ref{fig:precision-recall-macro}. Both \gls{vanilla} \gls{SSD} variants with 0.01 confidence threshold reach a much higher open set error and a higher recall. This behaviour is expected as more and worse predictions are included. All plotted variants show a similar behaviour that is in line with previously reported figures, such as the ones in Miller et al.~\cite{Miller2018} \subsection{Class-specific results} As mentioned before, the data set is imbalanced with respect to its classes: four classes make up roughly 50\% of all ground truth detections. Therefore, it is interesting to see the performance of the tested variants with respect to these classes: persons, cars, chairs, and bottles. Additionally, the results of the giraffe class are presented as these are exceptionally good, although the class makes up only 0.7\% of the ground truth. With this share, it is below the average of roughly 0.9\% for each of the 56 classes that make up the second half of the ground truth. In some cases, multiple variants have seemingly the same performance but only one or some of them are marked bold. This is informed by differences prior to rounding. If two or more variants are marked bold they had the exact same performance before rounding. \begin{table}[tbp] \begin{tabular}{rccc} \hline Forward & max & Recall & Precision\\ Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\ \hline \gls{vanilla} \gls{SSD} - 0.01 conf & 0.460 & \textbf{0.405} & 0.532 \\ \gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.460} & \textbf{0.405} & \textbf{0.533} \\ \gls{SSD} with entropy test - 0.01 conf & 0.460 & 0.405 & 0.532 \\ % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best \hline Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.272 & 0.292 & 0.256 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.451 & 0.403 & 0.514 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.447 & 0.401 & 0.505 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.410 & 0.368 & 0.465 \\ % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % 1.7 for 8, 2.0 for 9 \hline \end{tabular} \caption{Rounded results for persons class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score.} \label{tab:results-persons} \end{table} The vanilla \gls{SSD} variant with 0.2 per class confidence threshold performs best in the persons class with a max \(F_1\) score of 0.460, as well as recall of 0.405 and precision of 0.533 at the max \(F_1\) score. It shares the first place in recall with the \gls{vanilla} \gls{SSD} variant using 0.01 confidence threshold. All Bayesian \gls{SSD} variants perform worse than the \gls{vanilla} \gls{SSD} variants (see table \ref{tab:results-persons}). With respect to the macro averaged result, all variants perform better than the average of all classes. \begin{table}[tbp] \begin{tabular}{rccc} \hline Forward & max & Recall & Precision\\ Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\ \hline \gls{vanilla} \gls{SSD} - 0.01 conf & 0.364 & \textbf{0.305} & 0.452 \\ \gls{vanilla} \gls{SSD} - 0.2 conf & 0.363 & 0.294 & \textbf{0.476} \\ \gls{SSD} with entropy test - 0.01 conf & \textbf{0.364} & \textbf{0.305} & 0.453 \\ % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best \hline Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.236 & 0.244 & 0.229 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.336 & 0.266 & 0.460 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.332 & 0.262 & 0.454 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.309 & 0.264 & 0.374 \\ % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % 1.7 for 8, 2.0 for 9 \hline \end{tabular} \caption{Rounded results for cars class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score. } \label{tab:results-cars} \end{table} The performance for cars is slightly different (see table \ref{tab:results-cars}): the \gls{vanilla} \gls{SSD} variant with \gls{entropy} threshold and 0.01 confidence threshold has the best \(F_1\) score and recall. Vanilla \gls{SSD} with 0.2 confidence threshold, however, has the best precision. Both the Bayesian \gls{SSD} variant with \gls{NMS} and disabled dropout, and the one with 0.9 keep ratio have a better precision (0.460 and 0.454 respectively) than the \gls{vanilla} \gls{SSD} variants with 0.01 confidence threshold (0.452 and 0.453). With respect to the macro averaged result, all variants have a better precision than the average and the Bayesian variant without \gls{NMS} and dropout also has a better recall and \(F_1\) score. \begin{table}[tbp] \begin{tabular}{rccc} \hline Forward & max & Recall & Precision\\ Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\ \hline \gls{vanilla} \gls{SSD} - 0.01 conf & 0.287 & \textbf{0.251} & 0.335 \\ \gls{vanilla} \gls{SSD} - 0.2 conf & 0.283 & 0.242 & 0.341 \\ \gls{SSD} with entropy test - 0.01 conf & \textbf{0.288} & \textbf{0.251} & 0.338 \\ % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best \hline Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.172 & 0.168 & 0.178 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.280 & 0.229 & \textbf{0.360} \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.274 & 0.228 & 0.343 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.240 & 0.220 & 0.265 \\ % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % 1.7 for 8, 2.0 for 9 \hline \end{tabular} \caption{Rounded results for chairs class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score. } \label{tab:results-chairs} \end{table} The best \(F_1\) score (0.288) and recall (0.251) for the chairs class belongs to \gls{vanilla} \gls{SSD} with \gls{entropy} threshold. Precision is mastered by Bayesian \gls{SSD} with \gls{NMS} and disabled dropout (0.360). The variant with 0.9 keep ratio has the second-highest precision (0.343) of all variants. Both in \(F_1\) score and recall all Bayesian variants are worse than the \gls{vanilla} variants. Compared with the macro averaged results, all variants perform worse than the average. \begin{table}[tbp] \begin{tabular}{rccc} \hline Forward & max & Recall & Precision\\ Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\ \hline \gls{vanilla} \gls{SSD} - 0.01 conf & 0.233 & \textbf{0.175} & 0.348 \\ \gls{vanilla} \gls{SSD} - 0.2 conf & 0.231 & 0.173 & \textbf{0.350} \\ \gls{SSD} with entropy test - 0.01 conf & \textbf{0.233} & \textbf{0.175} & 0.350 \\ % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best \hline Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.160 & 0.140 & 0.188 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.224 & 0.170 & 0.328 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.220 & 0.170 & 0.311 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.202 & 0.172 & 0.245 \\ % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % 1.7 for 8, 2.0 for 9 \hline \end{tabular} \caption{Rounded results for bottles class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score. } \label{tab:results-bottles} \end{table} Bottles show similar performance to cars with overall lower numbers (see table \ref{tab:results-bottles}). Again, all Bayesian variants are worse than all vanilla variants. The Bayesian \gls{SSD} variant with \gls{NMS} and disabled dropout has the best \(F_1\) score (0.224) and precision (0.328) among the Bayesian variants; the variant with 0.5 keep ratio has the best recall (0.172). All variants perform worse than in the averaged results. \begin{table}[tbp] \begin{tabular}{rccc} \hline Forward & max & Recall & Precision\\ Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\ \hline \gls{vanilla} \gls{SSD} - 0.01 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\ \gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\ \gls{SSD} with entropy test - 0.01 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\ % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best \hline Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.415 & 0.414 & 0.417 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.647 & 0.642 & 0.654 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.637 & 0.634 & 0.642 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.586 & 0.578 & 0.596 \\ % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % 1.7 for 8, 2.0 for 9 \hline \end{tabular} \caption{Rounded results for giraffe class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score. } \label{tab:results-giraffes} \end{table} Last but not least the giraffe class (see table \ref{tab:results-giraffes}) is analysed. Remarkably, all three \gls{vanilla} \gls{SSD} variants have the identical performance, even before rounding. The Bayesian variant with \gls{NMS} and disabled dropout outperforms all the other Bayesian variants with an \(F_1\) score of 0.647, recall of 0.642, and 0.654 as precision. All variants perform better than in the macro averaged result. \subsection{Qualitative Analysis} This subsection compares \gls{vanilla} \gls{SSD} with Bayesian \gls{SSD} with respect to specific images that illustrate similarities and differences between both approaches. For this comparison, a 0.2 confidence threshold is applied. Furthermore, the compared Bayesian \gls{SSD} variant uses \gls{NMS} and dropout with 0.9 keep ratio. \begin{figure} \begin{minipage}[t]{0.48\textwidth} \includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_vanilla} \caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} \gls{SSD}.} \label{fig:stop-sign-truck-vanilla} \end{minipage}% \hfill \begin{minipage}[t]{0.48\textwidth} \includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_bayesian} \caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian \gls{SSD} with 0.9 keep ratio.} \label{fig:stop-sign-truck-bayesian} \end{minipage} \end{figure} The ground truth only contains a stop sign and a truck. The differences between \gls{vanilla} \gls{SSD} and Bayesian \gls{SSD} are almost not visible (see figures \ref{fig:stop-sign-truck-vanilla} and \ref{fig:stop-sign-truck-bayesian}): the truck is neither detected by \gls{vanilla} nor Bayesian \gls{SSD}, instead both detected a pottet plant and a traffic light. The stop sign is detected by both variants. This behaviour implies problems with detecting objects at the edge that overwhelmingly lie outside the image frame. Furthermore, the predictions are usually identical. \begin{figure} \begin{minipage}[t]{0.48\textwidth} \includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_vanilla} \caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} \gls{SSD}.} \label{fig:cat-laptop-vanilla} \end{minipage}% \hfill \begin{minipage}[t]{0.48\textwidth} \includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_bayesian} \caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian \gls{SSD} with 0.9 keep ratio.} \label{fig:cat-laptop-bayesian} \end{minipage} \end{figure} Another example (see figures \ref{fig:cat-laptop-vanilla} and \ref{fig:cat-laptop-bayesian}) is a cat with a laptop/TV in the background on the right side. Both variants detect a cat but the \gls{vanilla} variant detects a dog as well. The laptop and TV are not detected but this is expected since these classes have not been trained. \chapter{Discussion and Outlook} \label{chap:discussion} First the results are discussed, then possible future research and open questions are addressed. \section*{Discussion} The results clearly do not support the hypothesis: \textit{Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it}. With the exception of the open set error, there is no area where dropout sampling performs better than \gls{vanilla} \gls{SSD}. In the remainder of the section the individual results will be interpreted. \subsection*{Impact of Averaging} Micro and macro averaging create largely similar results. Notably, micro averaging has a significant performance increase towards the end of the list of predictions. This is signaled by the near horizontal movement of the plot in both the \(F_1\) versus absolute open set error graph (see figure \ref{fig:ose-f1-micro}) and the precision-recall curve (see figure \ref{fig:precision-recall-micro}). This behaviour is caused by a large imbalance of detections between the classes. For \gls{vanilla} \gls{SSD} with 0.2 confidence threshold there are a total of 36,863 detections after \gls{NMS} and top \(k\). The persons class contributes 14,640 detections or around 40\% to that number. Another strong class is cars with 2,252 detections or around 6\%. In third place come chairs with 1352 detections or around 4\%. This means that three classes have together roughly as many detections as the remaining 57 classes combined. In macro averaging, the cumulative precision and recall values are calculated per class and then averaged across all classes. Smaller classes quickly reach high recall values as the total number of ground truth is small as well. The last recall and precision value of the smaller classes is repeated to achieve homogenity with the largest class. As a consequence, early on the average recall is quite high. Later on, only the values of the largest class still change which has only a small impact on the overall result. Conversely, in micro averaging the cumulative true positives are added up across classes and then divided by the total number of ground truth. Here, the effect is the opposite: the total number of ground truth is very large which means the combined true positives of the 57 classes have only a smaller impact on the average recall. As a result, the open set error rises quicker than the \(F_1\) score, creating the sharp rise of the open set error at a lower \(F_1\) score than in macro averaging. The open set error reaches a high value early on and changes little afterwards. This allows the \(F_1\) score to catch up and produces the almost horizontal line in the graph. Eventually, the \(F_1\) score decreases again while the open set error continues to rise a bit. \subsection*{Impact of Entropy} There is no visible impact of \gls{entropy} thresholding on the object detection performance for \gls{vanilla} \gls{SSD}. This indicates that the network has almost no uniform or close to uniform predictions, the vast majority of predictions has a high confidence in one class---including the background. However, the \gls{entropy} plays a larger role for the Bayesian variants---as expected: the best performing thresholds are 1.0, 1.3, and 1.4 for micro averaging, and 1.5, 1.7, and 2.0 for macro averaging. In all of these cases the best threshold is not the largest threshold tested. This is caused by a simple phenomenon: at some point most or all true positives are in and a higher \gls{entropy} threshold only adds more false positives. Such a behaviour is indicated by a stagnating recall for the higher \gls{entropy} levels. For the low \gls{entropy} thresholds, the low recall is dominating the \(F_1\) score, the sweet spot is somewhere in the middle. For macro averaging, it holds that a higher optimal \gls{entropy} threshold indicates a worse performance. \subsection*{Non-Maximum Suppression and Top \(k\)} \begin{table}[tbp] \centering \begin{tabular}{rccc} \hline variant & before & after & after \\ & \gls{entropy}/NMS & \gls{entropy}/NMS & top \(k\) \\ \hline Bay. \gls{SSD}, no dropout, no \gls{NMS} & 155,251 & 122,868 & 72,207 \\ no dropout, \gls{NMS} & 155,250 & 36,061 & 33,827 \\ \hline \end{tabular} \caption{Comparison of Bayesian \gls{SSD} variants without dropout with respect to the number of detections before the \gls{entropy} threshold, after it and/or \gls{NMS}, and after top \(k\). The \gls{entropy} threshold 1.5 was used for both.} \label{tab:effect-nms} \end{table} Miller et al.~\cite{Miller2018} supposedly do not use \gls{NMS} in their implementation of dropout sampling. Therefore, a variant with disabled \glslocalreset{NMS} \gls{NMS} has been tested. The results are somewhat expected: \gls{NMS} removes all non-maximum detections that overlap with a maximum one. This reduces the number of multiple detections per ground truth bounding box and therefore the false positives. Without it, a lot more false positives remain and have a negative impact on precision. In combination with top \(k\) selection, recall can be affected: duplicate detections could stay and maxima boxes could be removed. The number of observations have been measured before and after the combination of \gls{entropy} threshold and \gls{NMS} filter: both Bayesian \gls{SSD} without NMS and dropout, and Bayesian \gls{SSD} with \gls{NMS} and disabled dropout have the same number of observations everywhere before the \gls{entropy} threshold. After the \gls{entropy} threshold (the value 1.5 has been used for both) and NMS, the variant with \gls{NMS} has roughly 23\% of its observations left (see table \ref{tab:effect-nms} for absolute numbers). Without \gls{NMS} 79\% of observations are left. Moreover, many classes have more observations after the entropy threshold and per class confidence threshold than before, which is clear since the background observations make up around 70\% of the initial observations and only 21\% of the initial observations are removed. Irrespective of the absolute number, this discrepancy clearly shows the impact of \gls{NMS} and also explains a higher count of false positives: more than 50\% of the original observations are removed with \gls{NMS} and stay without---all of these are very likely to be false positives. A clear distinction between micro and macro averaging can be observed: recall is hardly effected with micro averaging (0.300) but goes down equally with macro averaging (0.229). For micro averaging, it does not matter which class the true positives belong to: every detection counts the same way. This also means that top \(k\) will have only a marginal effect: some true positives might be removed without \gls{NMS} but overall that does not have a big impact. With macro averaging, however, the class of the true positives matters a lot: for example, if two true positives are removed from a class with only few true positives to begin with than their removal will have a drastic influence on the class recall value and hence the overall result. The impact of top \(k\) has been measured by counting the number of observations after top \(k\) is applied: the variant with \gls{NMS} keeps about 94\% of the observations left after NMS, without \gls{NMS} only about 59\% of observations are kept. This shows a significant impact on the result by top \(k\) in the case of disabled \gls{NMS}. Furthermore, with disabled \gls{NMS} some classes are hit harder by top \(k\) then others: for example, dogs keep around 82\% of the observations but persons only 57\%. This indicates that detected dogs are mostly on images with few detections overall and/or have a high enough prediction confidence to be kept by top \(k\). However, persons are likely often on images with many detections and/or have too low confidences. In this example, the likelihood for true positives to be removed in the person category is quite high. For dogs, the probability is far lower. This is a good example for micro and macro averaging, and their impact on recall. \subsection*{Dropout Sampling and Observations} \begin{table}[tbp] \centering \begin{tabular}{rccc} \hline variant & after & after \\ & prediction & observation grouping \\ \hline Bay. \gls{SSD}, no dropout, \gls{NMS} & 1,677,050 & 155,250 \\ keep rate 0.9, \gls{NMS} & 1,617,675 & 549,166 \\ \hline \end{tabular} \caption{Comparison of Bayesian \gls{SSD} variants without dropout and with 0.9 keep ratio of dropout with respect to the number of detections directly after the network predictions and after the observation grouping.} \label{tab:effect-dropout} \end{table} The dropout variants have largely worse performance than the Bayesian variants without dropout. This is expected as the network was not trained with dropout and the weights are not prepared for it. Gal~\cite{Gal2017} shows that networks \textbf{trained} with dropout are approximate Bayesian models. The Bayesian variants of \gls{SSD} implemented for this thesis are not fine-tuned or trained with dropout, therefore, they are not guaranteed to be such approximate models. But dropout alone does not explain the difference in results. Both variants with and without dropout have the exact same number of detections coming out of the network (8732 per image per forward pass). With 16 images in a batch, 308 batches, and 10 forward passes, the total number of detections is an astounding 430,312,960 detections. As such a large number could not be handled in memory, only one batch is calculated at a time. That still leaves 1,397,120 detections per batch. These have to be grouped into observations, including a quadratic calculation of mutual IOU scores. Therefore, these detections are filtered by removing all those with background confidence levels of 0.8 or higher. The number of detections per class has been measured before and after the detections are grouped into observations. To this end, the stored predictions are unbatched and summed together. After the aforementioned filter and before the grouping, roughly 0.4\% (in fact less than that) of the more than 430 million detections remain (see table \ref{tab:effect-dropout} for absolute numbers). The variant with dropout has slightly fewer predictions left compared to the one without dropout. After the grouping, the variant without dropout has on average between 10 and 11 detections grouped into an observation. This is expected as every forward pass creates the exact same result and these ten identical detections per \gls{vanilla} \gls{SSD} detection perfectly overlap. The fact that slightly more than ten detections are grouped together could explain the marginally better precision of the Bayesian variant without dropout compared to \gls{vanilla} \gls{SSD}. However, on average only three detections are grouped together into an observation if dropout with 0.9 keep ratio is enabled. This does not negatively impact recall as true positives do not disappear but offers a higher chance of false positives. It can be observed in the results which clearly show no negative impact for recall between the variants without dropout and dropout with 0.9 keep ratio. This behaviour implies that even a slight usage of dropout creates such diverging anchor box offsets that the resulting detections from multiple forward passes no longer have a mutual IOU score of 0.95 or higher. \section*{Outlook} The attempted replication of the work of Miller et al. raises a series of questions that cannot be answered in this thesis. This thesis offers one possible implementation of dropout sampling that technically works. However, this thesis cannot answer why this implementation differs significantly from Miller et al. The complete source code or otherwise exhaustive implementation details of Miller et al. would be required to attempt an answer. Future work could explore the performance of this implementation when used on an \gls{SSD} variant that was fine-tuned or trained with dropout. In this case, it should also look into the impact of training with both dropout and batch normalisation. Other avenues include the application to other data sets or object detection networks. To facilitate future work based on this thesis, the source code will be made available and an installable Python package will be uploaded to the PyPi package index. In the appendices can be found more details about the source code implementation as well as more figures.