From d84d65ddd6e2c744406dbad4371d591d1110d6b9 Mon Sep 17 00:00:00 2001 From: Jim Martens Date: Sun, 29 Sep 2019 15:08:03 +0200 Subject: [PATCH] Complete pass over thesis - added more glossary terms - fixed tenses - removed wrong plural forms of open set error - removed unsupported claims - improved wording and language Signed-off-by: Jim Martens --- body.tex | 469 +++++++++++++++++++++++++-------------------------- glossary.tex | 33 ++++ 2 files changed, 265 insertions(+), 237 deletions(-) diff --git a/body.tex b/body.tex index 286e97a..5942b49 100644 --- a/body.tex +++ b/body.tex @@ -2,7 +2,7 @@ \chapter{Introduction} -The introduction will explain the wider context first, before +The introduction first explains the wider context, before providing technical details. \subsection*{Motivation} @@ -64,12 +64,12 @@ classify what they know. If presented with an object type that the network was not trained with, as happens frequently in real environments, it will still classify the object and might even have a high confidence in doing so. Such an example would be -a false positive. Any ordinary person who uses the results of -such a network would falsely assume that a high confidence always -means the classification is very likely correct. If they use -a proprietary system they might not even be able to find out +a false positive. Anyone who uses the results of +such a network could falsely assume that a high confidence always +means the classification is very likely correct. If one uses +a proprietary system one might not even be able to find out that the network was never trained on a particular type of object. -Therefore, it would be impossible for them to identify the output +Therefore, it would be impossible for one to identify the output of the network as false positive. This reaffirms the need for automatic explanation. Such a system @@ -105,7 +105,7 @@ training these auto-encoders learn to reproduce a certain group of object classes. The actual novelty detection takes place during testing: given an image, and the output and loss of the auto-encoder, a novelty score is calculated. For some novelty -detection approaches the reconstruction loss is exactly the novelty +detection approaches the reconstruction loss is the novelty score, others consider more factors. A low novelty score signals a known object. The opposite is true for a high novelty score. @@ -120,10 +120,10 @@ Therefore, a comparison between model uncertainty with a network like SSD and novelty detection with auto-encoders is considered out of scope for this thesis. -Miller et al.~\cite{Miller2018} used an \gls{SSD} pre-trained on COCO +Miller et al.~\cite{Miller2018} use an \gls{SSD} pre-trained on COCO without further fine-tuning on the SceneNet RGB-D data -set~\cite{McCormac2017} and reported good results regarding -open set error for an \gls{SSD} variant with dropout sampling and entropy +set~\cite{McCormac2017} and report good results regarding +\gls{OSE} for an \gls{SSD} variant with dropout sampling and \gls{entropy} thresholding. If their results are generalisable it should be possible to replicate the relative difference between the variants on the COCO data set. @@ -131,16 +131,16 @@ This leads to the following hypothesis: \emph{Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it.} -For the purpose of this thesis, I will use the \gls{vanilla} \gls{SSD} (as in: the original SSD) as +For the purpose of this thesis, I use the \gls{vanilla} \gls{SSD} (as in: the original \gls{SSD}) as baseline to compare against. In particular, \gls{vanilla} \gls{SSD} uses -a per-class confidence threshold of 0.01, an IOU threshold of 0.45 +a per class confidence threshold of 0.01, an IOU threshold of 0.45 for the \gls{NMS}, and a top \(k\) value of 200. For this -thesis, the top \(k\) value was changed to 20 and the confidence threshold -of 0.2 was tried as well. -The effect of an entropy threshold is measured against this \gls{vanilla} -SSD by applying entropy thresholds from 0.1 to 2.4 inclusive (limits taken from -Miller et al.). Dropout sampling is compared to \gls{vanilla} SSD -with and without entropy thresholding. +thesis, the top \(k\) value has been changed to 20 and the confidence threshold +of 0.2 has been tried as well. +The effect of an \gls{entropy} threshold is measured against this \gls{vanilla} +SSD by applying \gls{entropy} thresholds from 0.1 to 2.4 inclusive (limits taken from +Miller et al.). Dropout sampling is compared to \gls{vanilla} \gls{SSD} +with and without \gls{entropy} thresholding. \paragraph{Hypothesis} Dropout sampling delivers better object detection performance under open set @@ -151,7 +151,7 @@ conditions compared to object detection without it. First, chapter \ref{chap:background} presents related works and provides the background for dropout sampling. Afterwards, chapter \ref{chap:methods} explains how \gls{vanilla} \gls{SSD} works, how -Bayesian \gls{SSD} extends \gls{vanilla} SSD, and how the decoding pipelines are +Bayesian \gls{SSD} extends \gls{vanilla} \gls{SSD}, and how the decoding pipelines are structured. Chapter \ref{chap:experiments-results} presents the data sets, the experimental setup, and the results. This is followed by @@ -164,9 +164,9 @@ Therefore, the contribution is found in chapters \ref{chap:methods}, \label{chap:background} -This chapter will begin with an overview over previous works +This chapter begins with an overview over previous works in the field of this thesis. Afterwards the theoretical foundations -of dropout sampling will be explained. +of dropout sampling are explained. \section{Related Works} @@ -181,7 +181,7 @@ briefly introduced. \subsection{Overview over types of Novelty Detection} -Probabilistic approaches estimate the generative probability density function (pdf) +Probabilistic approaches estimate the generative \gls{pdf} of the data. It is assumed that the training data is generated from an underlying probability distribution \(D\). This distribution can be estimated with the training data, the estimate is defined as \(\hat D\) and represents a model @@ -194,7 +194,7 @@ Distance-based novelty detection uses either nearest neighbour-based approaches or clustering-based approaches (e.g. \(k\)-means clustering algorithm \cite{Jordan1994}). Both methods are similar to estimating the -pdf of data, they use well-defined distance metrics to compute the distance +\gls{pdf} of data, they use well-defined distance metrics to compute the distance between two data points. Domain-based novelty detection describes the boundary of the known data, rather @@ -203,7 +203,7 @@ the boundary. A common implementation for this are support vector machines (e.g. implemented by Song et al. \cite{Song2002}). Information-theoretic novelty detection computes the information content -of a data set, for example, with metrics like entropy. Such metrics assume +of a data set, for example, with metrics like \gls{entropy}. Such metrics assume that novel data inside the data set significantly alters the information content of an otherwise normal data set. First, the metrics are calculated over the whole data set. Afterwards, a subset is identified that causes the biggest @@ -216,8 +216,8 @@ a recent approach. Reconstruction-based approaches use the reconstruction error in one form or another to calculate the novelty score. This can be auto-encoders that literally reconstruct the input but it also includes MLP networks which try -to reconstruct the ground truth. Pimentel et al.~\cite{Pimentel2014} differentiated -between neural network-based approaches and subspace methods. The first were +to reconstruct the ground truth. Pimentel et al.~\cite{Pimentel2014} differentiate +between neural network-based approaches and subspace methods. The first are further differentiated between MLPs, Hopfield networks, autoassociative networks, radial basis function, and self-organising networks. The remainder of this section focuses on MLP-based works, a particular focus will @@ -225,40 +225,41 @@ be on the task of object detection and Bayesian networks. Novelty detection for object detection is intricately linked with open set conditions: the test data can contain unknown classes. -Bishop~\cite{Bishop1994} investigated the correlation between +Bishop~\cite{Bishop1994} investigates the correlation between the degree of novel input data and the reliability of network outputs, and introduced a quantitative way to measure novelty. The Bayesian approach provides a theoretical foundation for modelling uncertainty \cite{Ghahramani2015}. -MacKay~\cite{MacKay1992} provided a practical Bayesian -framework for backpropagation networks. Neal~\cite{Neal1996} built upon -the work of MacKay and explored Bayesian learning for neural networks. +MacKay~\cite{MacKay1992} provides a practical Bayesian +framework for backpropagation networks. Neal~\cite{Neal1996} builds upon +the work of MacKay and explores Bayesian learning for neural networks. However, these Bayesian neural networks do not scale well. Over the course of time, two major Bayesian approximations were introduced: one based on dropout and one based on batch normalisation. -Gal and Ghahramani~\cite{Gal2016} showed that dropout training is a +Gal and Ghahramani~\cite{Gal2016} show that dropout training is a Bayesian approximation of a Gaussian process. Subsequently, Gal~\cite{Gal2017} -showed that dropout training actually corresponds to a general approximate +shows that dropout training actually corresponds to a general approximate Bayesian model. This means every network trained with dropout is an approximate Bayesian model. During inference the dropout remains active, this form of inference is called Monte Carlo Dropout (MCDO). -Miller et al.~\cite{Miller2018} built upon the work of Gal and Ghahramani: they +Miller et al.~\cite{Miller2018} build upon the work of Gal and Ghahramani: they use MC dropout under open-set conditions for object detection. -In a second paper \cite{Miller2018a}, Miller et al. continued their work and -compared merging strategies for sampling-based uncertainty techniques in +In a second paper \cite{Miller2018a}, Miller et al. continue their work and +compare merging strategies for sampling-based uncertainty techniques in object detection. Teye et al.~\cite{Teye2018} make the point that most modern networks have adopted other regularisation techniques. Ioffe and Szeged~\cite{Ioffe2015} -introduced batch normalisation which has been adapted widely. Teye et al. -showed how batch normalisation training is similar to dropout and can be +introduce batch normalisation which has been adapted widely in the +meantime. Teye et al. +show how batch normalisation training is similar to dropout and can be viewed as an approximate Bayesian inference. Estimates of the model uncertainty can be gained with a technique named Monte Carlo Batch Normalisation (MCBN). Consequently, this technique can be applied to any network that utilises standard batch normalisation. -Li et al.~\cite{Li2019} investigated the problem of poor performance +Li et al.~\cite{Li2019} investigate the problem of poor performance when combining dropout and batch normalisation: dropout shifts the variance of a neural unit when switching from train to test, batch normalisation does not change the variance. This inconsistency leads to a variance shift which @@ -266,15 +267,15 @@ can have a larger or smaller impact based on the network used. Non-Bayesian approaches have been developed as well. Usually, they compare with MC dropout and show better performance. -Postels et al.~\cite{Postels2019} provided a sampling-free approach for +Postels et al.~\cite{Postels2019} provide a sampling-free approach for uncertainty estimation that does not affect training and approximates the -sampling at test time. They compared it to MC dropout and found less computational +sampling at test time. They compare it to MC dropout and find less computational overhead with better results. Lakshminarayanan et al.~\cite{Lakshminarayanan2017} -implemented a predictive uncertainty estimation using deep ensembles. +implement a predictive uncertainty estimation using deep ensembles. Compared to MC dropout, it shows better results. Geifman et al.~\cite{Geifman2018} -introduced an uncertainty estimation algorithm for non-Bayesian deep +introduce an uncertainty estimation algorithm for non-Bayesian deep neural classification that estimates the uncertainty of highly confident points using earlier snapshots of the trained model and improves, among others, the approach introduced by Lakshminarayanan et al. @@ -285,8 +286,8 @@ the predictions of a neural network are treated as subjective opinions. In addition to the aforementioned Bayesian and non-Bayesian works, there are some Bayesian works that do not quite fit with the rest but are important as well. Mukhoti and Gal~\cite{Mukhoti2018} -contributed metrics to measure uncertainty for semantic -segmentation. Wu et al.~\cite{Wu2019} introduced two innovations +contribute metrics to measure uncertainty for semantic +segmentation. Wu et al.~\cite{Wu2019} introduce two innovations that turn variational Bayes into a robust tool for Bayesian networks: first, a novel deterministic method to approximate moments in neural networks which eliminates gradient variance, and @@ -311,7 +312,7 @@ procedure to select prior variances. \(\mathcal{I}\) & an image \\ \(\mathbf{q} = p(y|\mathcal{I}, \mathbf{T})\) & probability of all classes given image and training data \\ - \(H(\mathbf{q})\) & entropy over probability vector \\ + \(H(\mathbf{q})\) & \gls{entropy} over probability vector \\ \(\widetilde{\mathbf{W}}\) & weights sampled from \(p(\mathbf{W}|\mathbf{T})\) \\ \(\mathbf{b}\) & bounding box coordinates \\ @@ -342,12 +343,12 @@ over the network weights, for example a Gaussian prior distribution: \(\mathbf{W}\) are the weights and \(I\) symbolises that every weight is drawn from an independent and identical distribution. The training of the network determines a plausible set of weights by -evaluating the probability output (posterior) over the weights given +evaluating the probability output (\gls{posterior}) over the weights given the training data \(\mathbf{T}\): \(p(\mathbf{W}|\mathbf{T})\). However, this evaluation cannot be performed in any reasonable time. Therefore approximation techniques are -required. In those techniques the posterior is fitted with a +required. In those techniques the \gls{posterior} is fitted with a simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original and intractable problem of averaging over all weights in the network is replaced with an optimisation task, where the parameters of the @@ -355,14 +356,14 @@ simple distribution are optimised over~\cite{Kendall2017}. \subsubsection*{Dropout Variational Inference} -Kendall and Gal~\cite{Kendall2017} showed an approximation for +Kendall and Gal~\cite{Kendall2017} show an approximation for classfication and recognition tasks. Dropout variational inference is a practical approximation technique by adding dropout layers in front of every weight layer and using them also during test -time to sample from the approximate posterior. Effectively, this +time to sample from the approximate \gls{posterior}. Effectively, this results in the approximation of the class probability -\(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward -passes through the network and averaging over the obtained Softmax +\(p(y|\mathcal{I}, \mathbf{T})\) by performing \(n\) forward +passes through the network and averaging the so obtained softmax scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the training data \(\mathbf{T}\): \begin{equation} \label{eq:drop-sampling} @@ -370,18 +371,18 @@ p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf \end{equation} With this dropout sampling technique, \(n\) model weights -\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior +\(\widetilde{\mathbf{W}}_i\) are sampled from the \gls{posterior} \(p(\mathbf{W}|\mathbf{T})\). The class probability \(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector \(\mathbf{q}\) over all class labels. Finally, the uncertainty of the network with respect to the classification is given by -the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\). +the \gls{entropy} \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\). \subsubsection*{Dropout Sampling for Object Detection} Miller et al.~\cite{Miller2018} apply the dropout sampling to object detection. In that case \(\mathbf{W}\) represents the -learned weights of a detection network like SSD~\cite{Liu2016}. +learned weights of a detection network like \gls{SSD}~\cite{Liu2016}. Every forward pass uses a different network \(\widetilde{\mathbf{W}}\) which is approximately sampled from \(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object @@ -398,20 +399,20 @@ Subsequently, the corresponding vector of class probabilities score vectors \(\mathbf{s}_j\) in a particular observation \(\mathcal{O}_i\): \(\overline{\mathbf{q}}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty of the detector for a particular observation is measured by -the entropy \(H(\overline{\mathbf{q}}_i)\). +the \gls{entropy} \(H(\overline{\mathbf{q}}_i)\). -If \(\overline{\mathbf{q}}_i\), which I called averaged class probabilities, -resembles a uniform distribution the entropy will be high. A uniform +If \(\overline{\mathbf{q}}_i\) +resembles a uniform distribution the \gls{entropy} will be high. A uniform distribution means that no class is more likely than another, which is a perfect example of maximum uncertainty. Conversely, if -one class has a very high probability the entropy will be low. +one class has a very high probability the \gls{entropy} will be low. In open set conditions it can be expected that falsely generated detections for unknown object classes have a higher label -uncertainty. A threshold on the entropy \(H(\overline{\mathbf{q}}_i)\) can then +uncertainty. A threshold on the \gls{entropy} \(H(\overline{\mathbf{q}}_i)\) can then be used to identify and reject these false positive cases. -% SSD: \cite{Liu2016} +% \gls{SSD}: \cite{Liu2016} % ImageNet: \cite{Deng2009} % COCO: \cite{Lin2014} % YCB: \cite{Xiang2017} @@ -421,7 +422,7 @@ be used to identify and reject these false positive cases. \label{chap:methods} -This chapter explains the functionality of \gls{vanilla} SSD, Bayesian SSD, and the decoding pipelines. +This chapter explains the functionality of \gls{vanilla} \gls{SSD}, Bayesian \gls{SSD}, and the decoding pipelines. \section{Vanilla SSD} @@ -439,23 +440,22 @@ image (always size 300x300) is divided up into anchor boxes. During training, each of these boxes is mapped to a ground truth box or background. For every anchor box both the offset to the object and the class confidences are calculated. The output of the -SSD network are the predictions with class confidences, offsets to the +\gls{SSD} network are the predictions with class confidences, offsets to the anchor box, anchor box coordinates, and variance. The model loss is a weighted sum of localisation and confidence loss. As the network has a fixed number of anchor boxes, every forward pass creates the same number of detections---8732 in the case of \gls{SSD} 300x300. -Notably, the object proposals are made in a single run for an image - -single shot. +Notably, the object proposals are made in a single run for an +image---single shot. Other techniques like Faster R-CNN employ region proposals -and pooling. For more detailed information on SSD, please refer to +and pooling. For more detailed information on \gls{SSD}, please refer to Liu et al.~\cite{Liu2016}. \section{Bayesian SSD for Model Uncertainty} Networks trained with dropout are a general approximate Bayesian model~\cite{Gal2017}. As such, they can be used for everything a true -Bayesian model could be used for. The idea is applied to \gls{SSD} in this -thesis: two dropout layers are added to \gls{vanilla} SSD, after the layers fc6 and fc7 respectively (see figure \ref{fig:bayesian-ssd}). +Bayesian model could be used for. This idea is applied to \gls{SSD} by Miller et al.: two dropout layers are added to \gls{vanilla} \gls{SSD}, after the layers fc6 and fc7 respectively (see figure \ref{fig:bayesian-ssd}). \begin{figure} \centering @@ -465,23 +465,23 @@ thesis: two dropout layers are added to \gls{vanilla} SSD, after the layers fc6 \label{fig:bayesian-ssd} \end{figure} -Motivation for this is model uncertainty: an uncertain model will -predict different classes for the same object on the same image across -multiple forward passes. This uncertainty is measured with entropy: +Motivation for this is model uncertainty: for the same object on the same +image, an uncertain model will predict different classes across +multiple forward passes. This uncertainty is measured with \gls{entropy}: every forward pass results in predictions, these are partitioned into -observations, and subsequently their entropy is calculated. -A higher entropy indicates a more uniform distribution of confidences -whereas a lower entropy indicates a larger confidence in one class +observations, and subsequently their \gls{entropy} is calculated. +A higher \gls{entropy} indicates a more uniform distribution of confidences +whereas a lower \gls{entropy} indicates a larger confidence in one class and very low confidences in other classes. \subsection{Implementation Details} For this thesis, an \gls{SSD} implementation based on Tensorflow~\cite{Abadi2015} and Keras\footnote{\url{https://github.com/pierluigiferrari/ssd\_keras}} -was used. It was modified to support entropy thresholding, +is used. It has been modified to support \gls{entropy} thresholding, partitioning of observations, and dropout layers in the \gls{SSD} model. Entropy thresholding takes place before -the per-class confidence threshold is applied. +the per class confidence threshold is applied. The Bayesian variant was not fine-tuned and operates with the same weights that \gls{vanilla} \gls{SSD} uses as well. @@ -497,7 +497,7 @@ of \gls{SSD} used in the thesis. \subsection{Vanilla SSD} -Liu et al.~\cite{Liu2016} used Caffe for their original SSD +Liu et al.~\cite{Liu2016} used \gls{Caffe} for their original \gls{SSD} implementation. The decoding process contains largely two phases: decoding and filtering. Decoding transforms the relative coordinates predicted by \gls{SSD} into absolute coordinates. At this point @@ -519,17 +519,17 @@ original implementation uses a confidence threshold of \(0.01\), an IOU threshold for \gls{NMS} of \(0.45\) and a top \(k\) value of 200. -The \gls{vanilla} SSD -per-class confidence threshold and \gls{NMS} has one +The \gls{vanilla} \gls{SSD} +per class confidence threshold and \gls{NMS} has one weakness: even if \gls{SSD} correctly predicts all objects as the -background class with high confidence, the per-class confidence +background class with high confidence, the per class confidence threshold of 0.01 will consider predictions with very low confidences; as background boxes are not present in the maxima collection, many low confidence boxes can be. Furthermore, the same detection can be present in the maxima collection for multiple -classes. In this case, the entropy threshold would let the detection +classes. In this case, the \gls{entropy} threshold would let the detection pass because the background class has high confidence. Subsequently, -a low per-class confidence threshold does not restrict the boxes +a low per class confidence threshold does not restrict the boxes either. Therefore, the decoding output is worse than the actual predictions of the network. Bayesian \gls{SSD} cannot help in this situation because the network @@ -537,17 +537,17 @@ is not actually uncertain. SSD was developed with closed set conditions in mind. A well trained network in such a situation does not have many high confidence -background detections. In an open set environment, background +background detections. In an open set environment, however, background detections are the correct behaviour for unknown classes. In order to get useful detections out of the decoding, a higher confidence threshold is required. \subsection{Vanilla SSD with Entropy Thresholding} -Vanilla \gls{SSD} with entropy tresholding adds an additional component -to the filtering already done for \gls{vanilla} SSD. The entropy is +Vanilla \gls{SSD} with \gls{entropy} tresholding adds an additional component +to the filtering already done for \gls{vanilla} \gls{SSD}. The \gls{entropy} is calculated from all \(\#nr\_classes\) softmax scores in a prediction. -Only predictions with a low enough entropy pass the entropy +Only predictions with a low enough \gls{entropy} pass the \gls{entropy} threshold and move on to the aforementioned per class filtering. This excludes very uniform predictions but cannot identify false positive or false negative cases with high confidence values. @@ -583,10 +583,10 @@ unknown classes. This is due to multiple forward passes and the assumption that uncertainty in some objects will result in different classifications in multiple forward passes. These varying classifications are averaged into multiple lower confidence -values which should increase the entropy and, hence, flag an +values which should increase the \gls{entropy} and, hence, flag an observation for removal. -The remainder of the filtering follows the \gls{vanilla} \gls{SSD} procedure: per-class +The remainder of the filtering follows the \gls{vanilla} \gls{SSD} procedure: per class confidence threshold, \gls{NMS}, and a top \(k\) selection at the end. @@ -594,7 +594,7 @@ at the end. \label{chap:experiments-results} -This chapter explains the used data sets, how the experiments were +This chapter explains the used data sets, how the experiments have been set up, and what the results are. \section{Data Sets} @@ -610,38 +610,38 @@ network. Typical problems of data sets include, for example, outliers and invalid bounding boxes. Before a data set can be used, these problems need to be resolved. -For the MS COCO data set, all annotations were checked for +For the MS COCO data set, all annotations are checked for impossible values: bounding box height or width lower than zero, \(x_{min}\) and \(y_{min}\) bounding box coordinates lower than zero, \(x_{max}\) and \(y_{max}\) coordinates lower than or equal to zero, \(x_{min}\) greater than \(x_{max}\), \(y_{min}\) greater than \(y_{max}\), image width lower than \(x_{max}\), and image height lower than \(y_{max}\). In the last two cases the -bounding box width and height were set to (image width - \(x_{min}\)) and +bounding box width and height are set to (image width - \(x_{min}\)) and (image height - \(y_{min}\)) respectively; -in the other cases the annotation was skipped. +in the other cases the annotation is skipped. If the bounding box width or height afterwards is -lower than or equal to zero the annotation was skipped. +lower than or equal to zero the annotation is skipped. -SSD accepts 300x300 input images, the MS COCO data set images were -resized to this resolution; the aspect ratio was not kept in the +SSD accepts 300x300 input images, the MS COCO data set images are +resized to this resolution; the aspect ratio is not kept in the process. MS COCO contains landscape and portrait images with (640x480) -and (480x640) as the resolution. This led to a uniform distortion of the +and (480x640) as the resolution. This leads to a uniform distortion of the portrait and landscape images respectively. Furthermore, -the colour channels were swapped from RGB to BGR in order to -comply with the \gls{SSD} implementation. The BGR requirement stems from -the usage of Open CV in SSD: the internal channel order for -Open CV is BGR. +the colour channels are swapped from \gls{RGB} to \gls{BGR} in order to +comply with the \gls{SSD} implementation. The \gls{BGR} requirement stems from +the usage of Open CV in \gls{SSD}: the internal channel order for +Open CV is \gls{BGR}. For this thesis, weights pre-trained on the sub data set trainval35k of the -COCO data set were used. These weights were created with closed set -conditions in mind, therefore, they had to be sub-sampled to create +COCO data set are used. These weights have been created with closed set +conditions in mind, therefore, they have been sub-sampled to create an open set condition. To this end, the weights for the last -20 classes were thrown away, making them effectively unknown. +20 classes have been thrown away, making them effectively unknown. -All images of the minival2014 data set were used but only ground truth -belonging to the first 60 classes was loaded. The remaining 20 -classes were considered "unknown" and no ground truth bounding -boxes for them were provided during the inference phase. +All images of the minival2014 data set are used but only ground truth +belonging to the first 60 classes is loaded. The remaining 20 +classes are considered "unknown" and no ground truth bounding +boxes for them is provided during the inference phase. A total of 31,991 detections remains after this exclusion. Of these detections, a staggering 10,988 or 34,3\% belong to the persons class, followed by cars with 1,932 or 6\%, chairs with 1,791 or 5,6\%, @@ -655,28 +655,28 @@ This section explains the setup for the different conducted experiments. Each comparison investigates one particular question. As a baseline, \gls{vanilla} \gls{SSD} with the confidence threshold of 0.01 -and a \gls{NMS} IOU threshold of 0.45 was used. +and a \gls{NMS} IOU threshold of 0.45 is used. Due to the low number of objects per image in the COCO data set, -the top \(k\) value was set to 20. Vanilla \gls{SSD} with entropy -thresholding uses the same parameters; compared to \gls{vanilla} SSD -without entropy thresholding, it showcases the relevance of -entropy thresholding for \gls{vanilla} SSD. +the top \(k\) value has been set to 20. Vanilla \gls{SSD} with \gls{entropy} +thresholding uses the same parameters; compared to \gls{vanilla} \gls{SSD} +without \gls{entropy} thresholding, it showcases the relevance of +entropy thresholding for \gls{vanilla} \gls{SSD}. -Vanilla \gls{SSD} was also run with 0.2 confidence threshold and compared +Vanilla \gls{SSD} with 0.2 confidence threshold is compared to \gls{vanilla} \gls{SSD} with 0.01 confidence threshold; this comparison investigates the effect of the per class confidence threshold on the object detection performance. -Bayesian \gls{SSD} was run with 0.2 confidence threshold and compared +Bayesian \gls{SSD} with 0.2 confidence threshold is compared to \gls{vanilla} \gls{SSD} with 0.2 confidence threshold. Coupled with the entropy threshold, this comparison reveals how uncertain the network is. If it is very certain the dropout sampling should have no significant impact on the result. Furthermore, in two cases the -dropout was turned off to isolate the impact of \gls{NMS} +dropout has been turned off to isolate the impact of \gls{NMS} on the result. -Both, \gls{vanilla} \gls{SSD} with entropy thresholding and Bayesian \gls{SSD} with -entropy thresholding, were tested for entropy thresholds ranging +Both \gls{vanilla} \gls{SSD} with \gls{entropy} thresholding and Bayesian \gls{SSD} with +entropy thresholding are tested for \gls{entropy} thresholds ranging from 0.1 to 2.4 inclusive as specified in Miller et al.~\cite{Miller2018}. \section{Results} @@ -704,24 +704,24 @@ in the next chapter. \hline \gls{vanilla} \gls{SSD} - 0.01 conf & 0.255 & 3176 & 0.214 & 0.318 \\ \gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.376} & 2939 & \textbf{0.382} & 0.372 \\ - \gls{SSD} with Entropy test - 0.01 conf & 0.255 & 3168 & 0.214 & 0.318 \\ - % entropy thresh: 2.4 for \gls{vanilla} \gls{SSD} is best + \gls{SSD} with entropy test - 0.01 conf & 0.255 & 3168 & 0.214 & 0.318 \\ + % \gls{entropy} thresh: 2.4 for \gls{vanilla} \gls{SSD} is best \hline Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.209 & 2709 & 0.300 & 0.161 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.371 & \textbf{2335} & 0.365 & \textbf{0.378} \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.359 & 2584 & 0.363 & 0.357 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.325 & 2759 & 0.342 & 0.311 \\ - % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 + % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % 0.5 for Bayesian - 6, 1.4 for 7, 1.4 for 8, 1.3 for 9 \hline \end{tabular} - \caption{Rounded results for micro averaging. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with - their best performing entropy threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with Entropy test performed best with an - entropy threshold of 2.4, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.0, - and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.4 as entropy + \caption{Rounded results for micro averaging. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with + their best performing \gls{entropy} threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with entropy test performed best with an + \gls{entropy} threshold of 2.4, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.0, + and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.4 as \gls{entropy} threshold. Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed - best for 1.4 as entropy threshold, the run with 0.5 keep ratio performed + best for 1.4 as \gls{entropy} threshold, the run with 0.5 keep ratio performed best for 1.3 as threshold.} \label{tab:results-micro} \end{table} @@ -729,7 +729,7 @@ in the next chapter. \begin{figure}[ht] \begin{minipage}[t]{0.48\textwidth} \includegraphics[width=\textwidth]{ose-f1-all-micro} - \caption{Micro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute OSE of 0.} + \caption{Micro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute \gls{OSE} of 0.} \label{fig:ose-f1-micro} \end{minipage}% \hfill @@ -740,17 +740,17 @@ in the next chapter. \end{minipage} \end{figure} -Vanilla \gls{SSD} with a per-class confidence threshold of 0.2 performs best (see +Vanilla \gls{SSD} with a per class confidence threshold of 0.2 performs best (see table \ref{tab:results-micro}) with respect to the maximum \(F_1\) score (0.376) and recall at the maximum \(F_1\) point (0.382). In comparison, neither the \gls{vanilla} \gls{SSD} variant with a confidence threshold of 0.01 nor the \gls{SSD} with -an entropy test can outperform the 0.2 variant. Among the \gls{vanilla} \gls{SSD} variants, -the 0.2 variant also has the lowest number of open set errors (2939) and the +an \gls{entropy} test can outperform the 0.2 variant. Among the \gls{vanilla} \gls{SSD} variants, +the 0.2 variant also has the lowest open set error (2939) and the highest precision (0.372). The comparison of the \gls{vanilla} \gls{SSD} variants with a confidence threshold of 0.01 -shows no significant impact of an entropy test. Only the open set errors -are lower but in an insignificant way. The rest of the performance metrics is +shows no significant impact of an \gls{entropy} test. Only the open set error +is lower but in an insignificant way. The rest of the performance metrics are identical after rounding. Bayesian \gls{SSD} with disabled dropout and without \gls{NMS} @@ -758,23 +758,23 @@ has the worst performance of all tested variants (\gls{vanilla} and Bayesian) with respect to \(F_1\) score (0.209) and precision (0.161). The precision is not only the worst but also significantly lower compared to all other variants. In comparison to all variants with 0.2 confidence threshold, it has the worst recall (0.300) as well. -With 2335 open set errors, the Bayesian \gls{SSD} variant with disabled dropout and +With an open set error of 2335, the Bayesian \gls{SSD} variant with disabled dropout and enabled \gls{NMS} offers the best performance with respect -to open set errors. It also has the best precision (0.378) of all tested +to the open set error. It also has the best precision (0.378) of all tested variants. Furthermore, it provides the best performance among all variants with multiple forward passes. Dropout decreases the performance of the network, this can be seen -in the lower \(F_1\) scores, higher open set errors, and lower precision +in the lower \(F_1\) scores, a higher open set error, and lower precision values. Both dropout variants have worse recall (0.363 and 0.342) than the variant with disabled dropout. -However, all variants with multiple forward passes have lower open set -errors than all \gls{vanilla} \gls{SSD} variants. +However, all variants with multiple forward passes have a lower open set +error than all \gls{vanilla} \gls{SSD} variants. The relation of \(F_1\) score to absolute open set error can be observed in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants -can be seen in figure \ref{fig:precision-recall-micro}. Both \gls{vanilla} SSD -variants with 0.01 confidence threshold reach much higher open set errors +can be seen in figure \ref{fig:precision-recall-micro}. Both \gls{vanilla} \gls{SSD} +variants with 0.01 confidence threshold reach a much higher open set error and a higher recall. This behaviour is expected as more and worse predictions are included. All plotted variants show a similar behaviour that is in line with previously @@ -790,24 +790,24 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018} \hline \gls{vanilla} \gls{SSD} - 0.01 conf & 0.370 & 1426 & 0.328 & 0.424 \\ \gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.375} & 1218 & \textbf{0.338} & 0.424 \\ - \gls{SSD} with Entropy test - 0.01 conf & 0.370 & 1373 & 0.329 & \textbf{0.425} \\ - % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best + \gls{SSD} with entropy test - 0.01 conf & 0.370 & 1373 & 0.329 & \textbf{0.425} \\ + % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best \hline Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.226 & \textbf{809} & 0.229 & 0.224 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.363 & 1057 & 0.321 & 0.420 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.355 & 1137 & 0.320 & 0.399 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.322 & 1264 & 0.307 & 0.340 \\ - % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 - % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 + % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 + % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % 1.7 for 8, 2.0 for 9 \hline \end{tabular} - \caption{Rounded results for macro averaging. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with - their best performing entropy threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with Entropy test performed best with an - entropy threshold of 1.7, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.5, - and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.5 as entropy + \caption{Rounded results for macro averaging. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with + their best performing \gls{entropy} threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with entropy test performed best with an + \gls{entropy} threshold of 1.7, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.5, + and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.5 as \gls{entropy} threshold. Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed - best for 1.7 as entropy threshold, the run with 0.5 keep ratio performed + best for 1.7 as \gls{entropy} threshold, the run with 0.5 keep ratio performed best for 2.0 as threshold.} \label{tab:results-macro} \end{table} @@ -815,7 +815,7 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018} \begin{figure}[ht] \begin{minipage}[t]{0.48\textwidth} \includegraphics[width=\textwidth]{ose-f1-all-macro} - \caption{Macro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute OSE of 0.} + \caption{Macro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute \gls{OSE} of 0.} \label{fig:ose-f1-macro} \end{minipage}% \hfill @@ -826,37 +826,37 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018} \end{minipage} \end{figure} -Vanilla \gls{SSD} with a per-class confidence threshold of 0.2 performs best (see +Vanilla \gls{SSD} with a per class confidence threshold of 0.2 performs best (see table \ref{tab:results-macro}) with respect to the maximum \(F_1\) score -(0.375) and recall at the maximum \(F_1\) point (0.338). In comparison, the SSD -with an entropy test slightly outperforms the 0.2 variant with respect to +(0.375) and recall at the maximum \(F_1\) point (0.338). In comparison, the \gls{SSD} +with an \gls{entropy} test slightly outperforms the 0.2 variant with respect to precision (0.425). Additionally, this is the best precision overall. Among the \gls{vanilla} \gls{SSD} variants, the 0.2 variant also has the lowest -number of open set errors (1218). +open set error (1218). The comparison of the \gls{vanilla} \gls{SSD} variants with a confidence threshold of 0.01 -shows no significant impact of an entropy test. Only the open set errors -are lower but in an insignificant way. The rest of the performance metrics is +shows no significant impact of an \gls{entropy} test. Only the open set error +is lower but in an insignificant way. The rest of the performance metrics are almost identical after rounding. The results for Bayesian \gls{SSD} show a significant impact of \gls{NMS} or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226 (without NMS). Dropout was disabled in both cases, making them effectively a \gls{vanilla} \gls{SSD} run with multiple forward passes. -With 809 open set errors, the Bayesian \gls{SSD} variant with disabled dropout and +With an open set error of 809, the Bayesian \gls{SSD} variant with disabled dropout and without \gls{NMS} offers the best performance with respect -to open set errors. The variant without dropout and enabled \gls{NMS} has the best \(F_1\) score (0.363), the best +to the open set error. The variant without dropout and enabled \gls{NMS} has the best \(F_1\) score (0.363), the best precision (0.420) and the best recall (0.321) of all Bayesian variants. Dropout decreases the performance of the network, this can be seen -in the lower \(F_1\) scores, higher open set errors, and lower precision and -recall values. However, all variants with multiple forward passes have lower open set errors than all \gls{vanilla} SSD +in the lower \(F_1\) scores, a higher open set error, and lower precision and +recall values. However, all variants with multiple forward passes have a lower open set error than all \gls{vanilla} \gls{SSD} variants. The relation of \(F_1\) score to absolute open set error can be observed in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants -can be seen in figure \ref{fig:precision-recall-macro}. Both \gls{vanilla} SSD -variants with 0.01 confidence threshold reach much higher open set errors +can be seen in figure \ref{fig:precision-recall-macro}. Both \gls{vanilla} \gls{SSD} +variants with 0.01 confidence threshold reach a much higher open set error and a higher recall. This behaviour is expected as more and worse predictions are included. All plotted variants show a similar behaviour that is in line with previously @@ -887,20 +887,20 @@ they had the exact same performance before rounding. \hline \gls{vanilla} \gls{SSD} - 0.01 conf & 0.460 & \textbf{0.405} & 0.532 \\ \gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.460} & \textbf{0.405} & \textbf{0.533} \\ - \gls{SSD} with Entropy test - 0.01 conf & 0.460 & 0.405 & 0.532 \\ - % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best + \gls{SSD} with entropy test - 0.01 conf & 0.460 & 0.405 & 0.532 \\ + % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best \hline Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.272 & 0.292 & 0.256 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.451 & 0.403 & 0.514 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.447 & 0.401 & 0.505 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.410 & 0.368 & 0.465 \\ - % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 - % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 + % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 + % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % 1.7 for 8, 2.0 for 9 \hline \end{tabular} - \caption{Rounded results for persons class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with - their best performing macro averaging entropy threshold with respect to \(F_1\) score.} + \caption{Rounded results for persons class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with + their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score.} \label{tab:results-persons} \end{table} @@ -924,20 +924,20 @@ worse than average. \hline \gls{vanilla} \gls{SSD} - 0.01 conf & 0.364 & \textbf{0.305} & 0.452 \\ \gls{vanilla} \gls{SSD} - 0.2 conf & 0.363 & 0.294 & \textbf{0.476} \\ - \gls{SSD} with Entropy test - 0.01 conf & \textbf{0.364} & \textbf{0.305} & 0.453 \\ - % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best + \gls{SSD} with entropy test - 0.01 conf & \textbf{0.364} & \textbf{0.305} & 0.453 \\ + % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best \hline Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.236 & 0.244 & 0.229 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.336 & 0.266 & 0.460 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.332 & 0.262 & 0.454 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.309 & 0.264 & 0.374 \\ - % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 - % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 + % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 + % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % 1.7 for 8, 2.0 for 9 \hline \end{tabular} - \caption{Rounded results for cars class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with - their best performing macro averaging entropy threshold with respect to \(F_1\) score. } + \caption{Rounded results for cars class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with + their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score. } \label{tab:results-cars} \end{table} @@ -949,20 +949,20 @@ worse than average. \hline \gls{vanilla} \gls{SSD} - 0.01 conf & 0.287 & \textbf{0.251} & 0.335 \\ \gls{vanilla} \gls{SSD} - 0.2 conf & 0.283 & 0.242 & 0.341 \\ - \gls{SSD} with Entropy test - 0.01 conf & \textbf{0.288} & \textbf{0.251} & 0.338 \\ - % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best + \gls{SSD} with entropy test - 0.01 conf & \textbf{0.288} & \textbf{0.251} & 0.338 \\ + % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best \hline Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.172 & 0.168 & 0.178 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.280 & 0.229 & \textbf{0.360} \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.274 & 0.228 & 0.343 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.240 & 0.220 & 0.265 \\ - % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 - % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 + % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 + % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % 1.7 for 8, 2.0 for 9 \hline \end{tabular} - \caption{Rounded results for chairs class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with - their best performing macro averaging entropy threshold with respect to \(F_1\) score. } + \caption{Rounded results for chairs class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with + their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score. } \label{tab:results-chairs} \end{table} @@ -975,20 +975,20 @@ worse than average. \hline \gls{vanilla} \gls{SSD} - 0.01 conf & 0.233 & \textbf{0.175} & 0.348 \\ \gls{vanilla} \gls{SSD} - 0.2 conf & 0.231 & 0.173 & \textbf{0.350} \\ - \gls{SSD} with Entropy test - 0.01 conf & \textbf{0.233} & \textbf{0.175} & 0.350 \\ - % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best + \gls{SSD} with entropy test - 0.01 conf & \textbf{0.233} & \textbf{0.175} & 0.350 \\ + % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best \hline Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.160 & 0.140 & 0.188 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.224 & 0.170 & 0.328 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.220 & 0.170 & 0.311 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.202 & 0.172 & 0.245 \\ - % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 - % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 + % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 + % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % 1.7 for 8, 2.0 for 9 \hline \end{tabular} - \caption{Rounded results for bottles class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with - their best performing macro averaging entropy threshold with respect to \(F_1\) score. } + \caption{Rounded results for bottles class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with + their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score. } \label{tab:results-bottles} \end{table} @@ -1000,37 +1000,36 @@ worse than average. \hline \gls{vanilla} \gls{SSD} - 0.01 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\ \gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\ - \gls{SSD} with Entropy test - 0.01 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\ - % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best + \gls{SSD} with entropy test - 0.01 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\ + % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best \hline Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.415 & 0.414 & 0.417 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.647 & 0.642 & 0.654 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.637 & 0.634 & 0.642 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.586 & 0.578 & 0.596 \\ - % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 - % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 + % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 + % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % 1.7 for 8, 2.0 for 9 \hline \end{tabular} - \caption{Rounded results for giraffe class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with - their best performing macro averaging entropy threshold with respect to \(F_1\) score. } + \caption{Rounded results for giraffe class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with + their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score. } \label{tab:results-giraffes} \end{table} \subsection{Qualitative Analysis} -% TODO: expand - -This subsection compares \gls{vanilla} SSD +This subsection compares \gls{vanilla} \gls{SSD} with Bayesian \gls{SSD} with respect to specific images that illustrate similarities and differences between both approaches. For this -comparison, a 0.2 confidence threshold is applied. Furthermore, Bayesian -SSD uses \gls{NMS} and dropout with 0.9 keep ratio. +comparison, a 0.2 confidence threshold is applied. Furthermore, the +compared Bayesian SSD variant uses \gls{NMS} and dropout with 0.9 keep +ratio. \begin{figure} \begin{minipage}[t]{0.48\textwidth} \includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_vanilla} - \caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} SSD.} + \caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} \gls{SSD}.} \label{fig:stop-sign-truck-vanilla} \end{minipage}% \hfill @@ -1042,14 +1041,14 @@ SSD uses \gls{NMS} and dropout with 0.9 keep ratio. \end{figure} The ground truth only contains a stop sign and a truck. The differences between \gls{vanilla} \gls{SSD} and Bayesian \gls{SSD} are almost not visible -(see figures \ref{fig:stop-sign-truck-vanilla} and \ref{fig:stop-sign-truck-bayesian}): the truck is neither detected by \gls{vanilla} nor Bayesian SSD, instead both detected a pottet plant and a traffic light. The stop sign is detected by both variants. +(see figures \ref{fig:stop-sign-truck-vanilla} and \ref{fig:stop-sign-truck-bayesian}): the truck is neither detected by \gls{vanilla} nor Bayesian \gls{SSD}, instead both detected a pottet plant and a traffic light. The stop sign is detected by both variants. This behaviour implies problems with detecting objects at the edge that overwhelmingly lie outside the image frame. Furthermore, the predictions are usually identical. \begin{figure} \begin{minipage}[t]{0.48\textwidth} \includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_vanilla} - \caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} SSD.} + \caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} \gls{SSD}.} \label{fig:cat-laptop-vanilla} \end{minipage}% \hfill @@ -1062,19 +1061,19 @@ that overwhelmingly lie outside the image frame. Furthermore, the predictions ar Another example (see figures \ref{fig:cat-laptop-vanilla} and \ref{fig:cat-laptop-bayesian}) is a cat with a laptop/TV in the background on the right side. Both variants detect a cat but the \gls{vanilla} variant detects a dog as well. The laptop and TV are not detected but this is expected since -these classes were not trained. +these classes have not been trained. \chapter{Discussion and Outlook} \label{chap:discussion} -First the results will be discussed, then possible future research and open -questions will be addressed. +First the results are discussed, then possible future research and open +questions are addressed. \section*{Discussion} -The results clearly do not support the hypothesis: \textit{Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it}. With the exception of open set errors, there -is no area where dropout sampling performs better than \gls{vanilla} SSD. In the +The results clearly do not support the hypothesis: \textit{Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it}. With the exception of the open set error, there +is no area where dropout sampling performs better than \gls{vanilla} \gls{SSD}. In the remainder of the section the individual results will be interpreted. \subsection*{Impact of Averaging} @@ -1104,36 +1103,32 @@ Conversely, in micro averaging the cumulative true positives are added up across classes and then divided by the total number of ground truth. Here, the effect is the opposite: the total number of ground truth is very large which means the combined true positives -of 58 classes have only a smaller impact on the average recall. -As a result, the open set error rises quicker than the \(F_1\) score -in micro averaging, creating the sharp rise of open set error at a lower +of the 57 classes have only a smaller impact on the average recall. +As a result, the open set error rises quicker than the \(F_1\) score, +creating the sharp rise of the open set error at a lower \(F_1\) score than in macro averaging. The open set error reaches a high value early on and changes little afterwards. This allows the \(F_1\) score to catch up and produces the almost horizontal line in the graph. Eventually, the \(F_1\) score decreases again while the -open set error further rises a bit. - -Furthermore, the plotted behaviour implies that Miller et al.~\cite{Miller2018} -use macro averaging in their paper as the unique behaviour of micro -averaging was not reported in their paper. +open set error continues to rise a bit. \subsection*{Impact of Entropy} -There is no visible impact of entropy thresholding on the object detection -performance for \gls{vanilla} SSD. This indicates that the network has almost no +There is no visible impact of \gls{entropy} thresholding on the object detection +performance for \gls{vanilla} \gls{SSD}. This indicates that the network has almost no uniform or close to uniform predictions, the vast majority of predictions has a high confidence in one class---including the background. -However, the entropy plays a larger role for the Bayesian variants---as +However, the \gls{entropy} plays a larger role for the Bayesian variants---as expected: the best performing thresholds are 1.0, 1.3, and 1.4 for micro averaging, and 1.5, 1.7, and 2.0 for macro averaging. In all of these cases the best threshold is not the largest threshold tested. This is caused by a simple phenomenon: at some point most or all true -positives are in and a higher entropy threshold only adds more false +positives are in and a higher \gls{entropy} threshold only adds more false positives. Such a behaviour is indicated by a stagnating recall for the -higher entropy levels. For the low entropy thresholds, the low recall +higher \gls{entropy} levels. For the low \gls{entropy} thresholds, the low recall is dominating the \(F_1\) score, the sweet spot is somewhere in the -middle. For macro averaging, it holds that a higher optimal entropy +middle. For macro averaging, it holds that a higher optimal \gls{entropy} threshold indicates a worse performance. \subsection*{Non-Maximum Suppression and Top \(k\)} @@ -1143,23 +1138,23 @@ threshold indicates a worse performance. \begin{tabular}{rccc} \hline variant & before & after & after \\ - & entropy/NMS & entropy/NMS & top \(k\) \\ + & \gls{entropy}/NMS & \gls{entropy}/NMS & top \(k\) \\ \hline - Bay. SSD, no dropout, no \gls{NMS} & 155,251 & 122,868 & 72,207 \\ + Bay. \gls{SSD}, no dropout, no \gls{NMS} & 155,251 & 122,868 & 72,207 \\ no dropout, \gls{NMS} & 155,250 & 36,061 & 33,827 \\ \hline \end{tabular} \caption{Comparison of Bayesian \gls{SSD} variants without dropout with - respect to the number of detections before the entropy threshold, + respect to the number of detections before the \gls{entropy} threshold, after it and/or \gls{NMS}, and after top \(k\). The - entropy threshold 1.5 was used for both.} + \gls{entropy} threshold 1.5 was used for both.} \label{tab:effect-nms} \end{table} -Miller et al.~\cite{Miller2018} supposedly did not use \gls{NMS} +Miller et al.~\cite{Miller2018} supposedly do not use \gls{NMS} in their implementation of dropout sampling. Therefore, a variant with disabled \glslocalreset{NMS} -\gls{NMS} was tested. The results are somewhat expected: +\gls{NMS} has been tested. The results are somewhat expected: \gls{NMS} removes all non-maximum detections that overlap with a maximum one. This reduces the number of multiple detections per ground truth bounding box and therefore the false positives. Without it, @@ -1167,13 +1162,13 @@ a lot more false positives remain and have a negative impact on precision. In combination with top \(k\) selection, recall can be affected: duplicate detections could stay and maxima boxes could be removed. -The number of observations was measured before and after the combination of entropy threshold and \gls{NMS} filter: both Bayesian \gls{SSD} without +The number of observations have been measured before and after the combination of \gls{entropy} threshold and \gls{NMS} filter: both Bayesian \gls{SSD} without NMS and dropout, and Bayesian \gls{SSD} with \gls{NMS} and disabled dropout -have the same number of observations everywhere before the entropy threshold. After the entropy threshold (the value 1.5 was used for both) and NMS, the variant with \gls{NMS} has roughly 23\% of its observations left +have the same number of observations everywhere before the \gls{entropy} threshold. After the \gls{entropy} threshold (the value 1.5 has been used for both) and NMS, the variant with \gls{NMS} has roughly 23\% of its observations left (see table \ref{tab:effect-nms} for absolute numbers). Without \gls{NMS} 79\% of observations are left. Irrespective of the absolute number, this discrepancy clearly shows the impact of \gls{NMS} and also explains a higher count of false positives: -more than 50\% of the original observations were removed with \gls{NMS} and +more than 50\% of the original observations are removed with \gls{NMS} and stayed without---all of these are very likely to be false positives. A clear distinction between micro and macro averaging can be observed: @@ -1186,8 +1181,8 @@ true positives are removed from a class with only few true positives to begin with than their removal will have a drastic influence on the class recall value and hence the overall result. -The impact of top \(k\) was measured by counting the number of observations -after top \(k\) has been applied: the variant with \gls{NMS} keeps about 94\% +The impact of top \(k\) has been measured by counting the number of observations +after top \(k\) is applied: the variant with \gls{NMS} keeps about 94\% of the observations left after NMS, without \gls{NMS} only about 59\% of observations are kept. This shows a significant impact on the result by top \(k\) in the case of disabled \gls{NMS}. Furthermore, some @@ -1212,7 +1207,7 @@ recall. variant & after & after \\ & prediction & observation grouping \\ \hline - Bay. SSD, no dropout, \gls{NMS} & 1,677,050 & 155,250 \\ + Bay. \gls{SSD}, no dropout, \gls{NMS} & 1,677,050 & 155,250 \\ keep rate 0.9, \gls{NMS} & 1,617,675 & 549,166 \\ \hline \end{tabular} @@ -1229,8 +1224,8 @@ without dropout. This is expected as the network was not trained with dropout and the weights are not prepared for it. Gal~\cite{Gal2017} -showed that networks \textbf{trained} with dropout are approximate Bayesian -models. The Bayesian variants of \gls{SSD} implemented in this thesis are not fine-tuned or trained with dropout, therefore, they are not guaranteed to be such approximate models. +shows that networks \textbf{trained} with dropout are approximate Bayesian +models. The Bayesian variants of \gls{SSD} implemented for this thesis are not fine-tuned or trained with dropout, therefore, they are not guaranteed to be such approximate models. But dropout alone does not explain the difference in results. Both variants with and without dropout have the exact same number of detections coming @@ -1243,19 +1238,19 @@ observations, including a quadratic calculation of mutual IOU scores. Therefore, these detections are filtered by removing all those with background confidence levels of 0.8 or higher. -The number of detections per class was measured before and after the -detections were grouped into observations. To this end, the stored predictions -were unbatched and summed together. After the aforementioned filter +The number of detections per class has been measured before and after the +detections are grouped into observations. To this end, the stored predictions +are unbatched and summed together. After the aforementioned filter and before the grouping, roughly 0.4\% (in fact less than that) of the -more than 430 million detections are remaining (see table \ref{tab:effect-dropout} for absolute numbers). The variant with dropout +more than 430 million detections remain (see table \ref{tab:effect-dropout} for absolute numbers). The variant with dropout has slightly fewer predictions left compared to the one without dropout. After the grouping, the variant without dropout has on average between 10 and 11 detections grouped into an observation. This is expected as every -forward pass creates the exact same result and these 10 identical detections +forward pass creates the exact same result and these ten identical detections per \gls{vanilla} \gls{SSD} detection perfectly overlap. The fact that slightly more than -10 detections are grouped together could explain the marginally better precision -of the Bayesian variant without dropout compared to \gls{vanilla} SSD. +ten detections are grouped together could explain the marginally better precision +of the Bayesian variant without dropout compared to \gls{vanilla} \gls{SSD}. However, on average only three detections are grouped together into an observation if dropout with 0.9 keep ratio is enabled. This does not negatively impact recall as true positives do not disappear but offers diff --git a/glossary.tex b/glossary.tex index 41c1542..0a77a40 100644 --- a/glossary.tex +++ b/glossary.tex @@ -1,8 +1,41 @@ % acronyms \newacronym{NMS}{NMS}{non-maximum suppression} +\newacronym{OSE}{OSE}{open set error} \newacronym{SSD}{SSD}{Single Shot MultiBox Detector} +\newacronym{pdf}{pdf}{probabilistic density function} % terms +\newglossaryentry{BGR}{ + name={BGR}, + description={ + stands for the three colour channels blue, green, and red in this order + } +} +\newglossaryentry{Caffe}{ + name={Caffe}, + description={ + is a deep learning framework written in C++ + } +} +\newglossaryentry{entropy}{ + name={entropy}, + description={ + describes the amount of information provided by something. More likely + events have a lower entropy than rare events. In case of classification probabilities, uniform predictions contain more information than predictions with a clear "winner" + } +} +\newglossaryentry{posterior}{ + name={posterior}, + description={ + probability output of a neural network + } +} +\newglossaryentry{RGB}{ + name={RGB}, + description={ + stands for the three colour channels red, green, and blue in this order + } +} \newglossaryentry{vanilla} { name={vanilla},