From 2f9623e3d5ca46f728433b1511683726df7e6095 Mon Sep 17 00:00:00 2001 From: Jim Martens Date: Tue, 24 Sep 2019 12:22:13 +0200 Subject: [PATCH] Improved a variety of smaller things Signed-off-by: Jim Martens --- body.tex | 111 +++++++++++++++++++++++++++---------------------------- 1 file changed, 54 insertions(+), 57 deletions(-) diff --git a/body.tex b/body.tex index 02c8bff..bad9280 100644 --- a/body.tex +++ b/body.tex @@ -7,9 +7,9 @@ providing technical details. \subsection*{Motivation} -Famous examples like the automatic soap dispenser which does not +Famous examples like the automatic soap dispenser, which does not recognise the hand of a black person but dispenses soap when presented -with a paper towel raise the question of bias in computer +with a paper towel, raise the question of bias in computer systems~\cite{Friedman1996}. Related to this ethical question regarding the design of so called algorithms is the question of algorithmic accountability~\cite{Diakopoulos2014}. @@ -132,26 +132,28 @@ conditions compared to object detection without it.} For the purpose of this thesis, I will use the vanilla SSD (as in: the original SSD) as baseline to compare against. In particular, vanilla SSD uses a per-class confidence threshold of 0.01, an IOU threshold of 0.45 -for the non-maximum suppression, and a top k value of 200. +for the non-maximum suppression, and a top \(k\) value of 200. For this +thesis, the top \(k\) value was changed to 20 and the confidence threshold +of 0.2 was tried as well. The effect of an entropy threshold is measured against this vanilla SSD by applying entropy thresholds from 0.1 to 2.4 inclusive (limits taken from -Miller et al.). Dropout sampling is compared to vanilla SSD, both +Miller et al.). Dropout sampling is compared to vanilla SSD with and without entropy thresholding. \paragraph{Hypothesis} Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it. -\subsection*{Reader's guide} +\subsection*{Reader's Guide} First, chapter \ref{chap:background} presents related works and -provides the background for dropout sampling a.k.a Bayesian SSD. -Afterwards, chapter \ref{chap:methods} explains how the Bayesian SSD -works and how the decoding pipelines are structured. +provides the background for dropout sampling. +Afterwards, chapter \ref{chap:methods} explains how vanilla SSD works, how +Bayesian SSD extends vanilla SSD, and how the decoding pipelines are +structured. Chapter \ref{chap:experiments-results} presents the data sets, the experimental setup, and the results. This is followed by -chapter \ref{chap:discussion}, focusing on -the discussion and closing. +chapter \ref{chap:discussion}, focusing on the discussion and closing. Therefore, the contribution is found in chapters \ref{chap:methods}, \ref{chap:experiments-results}, and \ref{chap:discussion}. @@ -162,8 +164,7 @@ Therefore, the contribution is found in chapters \ref{chap:methods}, This chapter will begin with an overview over previous works in the field of this thesis. Afterwards the theoretical foundations -of the work of Miller et al.~\cite{Miller2018} will -be explained. +of dropout sampling will be explained. \section{Related Works} @@ -176,7 +177,7 @@ reconstruction-based novelty detection as it deals only with neural network approaches. Therefore, the other types of novelty detection will only be briefly introduced. -\subsection{Overview over types of novelty detection} +\subsection{Overview over types of Novelty Detection} Probabilistic approaches estimate the generative probability density function (pdf) of the data. It is assumed that the training data is generated from an underlying @@ -208,7 +209,7 @@ difference in the metric when removed from the data set. This subset is consider to consist of novel data. For example, Filippone and Sanguinetti \cite{Filippone2011} provide a recent approach. -\subsection{Reconstruction-based novelty detection} +\subsection{Reconstruction-based Novelty Detection} Reconstruction-based approaches use the reconstruction error in one form or another to calculate the novelty score. This can be auto-encoders that @@ -224,7 +225,7 @@ Novelty detection for object detection is intricately linked with open set conditions: the test data can contain unknown classes. Bishop~\cite{Bishop1994} investigated the correlation between the degree of novel input data and the reliability of network -outputs. +outputs, and introduced a quantitative way to measure novelty. The Bayesian approach provides a theoretical foundation for modelling uncertainty \cite{Ghahramani2015}. @@ -259,20 +260,17 @@ Li et al.~\cite{Li2019} investigated the problem of poor performance when combining dropout and batch normalisation: Dropout shifts the variance of a neural unit when switching from train to test, batch normalisation does not change the variance. This inconsistency leads to a variance shift which -can have a larger or smaller impact based on the network used. For example, -adding dropout layers to SSD \cite{Liu2016} and applying MC dropout, like -Miller et al.~\cite{Miller2018} did, causes such a problem because SSD uses -batch normalisation. +can have a larger or smaller impact based on the network used. Non-Bayesian approaches have been developed as well. Usually, they compare with MC dropout and show better performance. Postels et al.~\cite{Postels2019} provided a sampling-free approach for uncertainty estimation that does not affect training and approximates the -sampling on test time. They compared it to MC dropout and found less computational +sampling at test time. They compared it to MC dropout and found less computational overhead with better results. Lakshminarayanan et al.~\cite{Lakshminarayanan2017} implemented a predictive uncertainty estimation using deep ensembles. -Compared to MC dropout, it showed better results. +Compared to MC dropout, it shows better results. Geifman et al.~\cite{Geifman2018} introduced an uncertainty estimation algorithm for non-Bayesian deep neural classification that estimates the uncertainty of highly @@ -288,10 +286,10 @@ are important as well. Mukhoti and Gal~\cite{Mukhoti2018} contributed metrics to measure uncertainty for semantic segmentation. Wu et al.~\cite{Wu2019} introduced two innovations that turn variational Bayes into a robust tool for Bayesian -networks: a novel deterministic method to approximate +networks: first, a novel deterministic method to approximate moments in neural networks which eliminates gradient variance, and -a hierarchical prior for parameters and an empirical Bayes procedure to select -prior variances. +second, a hierarchical prior for parameters and an empirical Bayes +procedure to select prior variances. \section{Background for Dropout Sampling} @@ -342,7 +340,7 @@ over the network weights, for example a Gaussian prior distribution: \(\mathbf{W}\) are the weights and \(I\) symbolises that every weight is drawn from an independent and identical distribution. The training of the network determines a plausible set of weights by -evaluating the posterior (probability output) over the weights given +evaluating the probability output (posterior) over the weights given the training data \(\mathbf{T}\): \(p(\mathbf{W}|\mathbf{T})\). However, this evaluation cannot be performed in any reasonable @@ -369,7 +367,7 @@ training data \(\mathbf{T}\): p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i \end{equation} -With this dropout sampling technique \(n\) model weights +With this dropout sampling technique, \(n\) model weights \(\widetilde{\mathbf{W}}_i\) are sampled from the posterior \(p(\mathbf{W}|\mathbf{T})\). The class probability \(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector @@ -437,8 +435,8 @@ Vanilla SSD is based upon the VGG-16 network (see figure \ref{fig:vanilla-ssd}) and adds extra feature layers. The entire image (always size 300x300) is divided up into anchor boxes. During training, each of these boxes is mapped to a ground truth box or -background. For every anchor box the offset to -the object, and the class confidences are calculated. The output of the +background. For every anchor box both the offset to +the object and the class confidences are calculated. The output of the SSD network are the predictions with class confidences, offsets to the anchor box, anchor box coordinates, and variance. The model loss is a weighted sum of localisation and confidence loss. As the network @@ -567,7 +565,7 @@ Additionally, all detections with a background prediction of 0.8 or higher are d The remaining detections are partitioned into observations to further reduce the size of the output, and to identify uncertainty. This is accomplished by calculating the -mutual IOU of every detection with all other detections. Detections +mutual IOU score of every detection with all other detections. Detections with a mutual IOU score of 0.95 or higher are partitioned into an observation. Next, the softmax scores and bounding box coordinates of all detections in an observation are averaged. @@ -596,7 +594,7 @@ at the end. This chapter explains the used data sets, how the experiments were set up, and what the results are. -\section{Data sets} +\section{Data Sets} This thesis uses the MS COCO~\cite{Lin2014} data set. It contains 80 classes, from airplanes to toothbrushes many classes are present. @@ -615,7 +613,7 @@ impossible values: bounding box height or width lower than zero, \(x_{max}\) and \(y_{max}\) coordinates lower than or equal to zero, \(x_{min}\) greater than \(x_{max}\), \(y_{min}\) greater than \(y_{max}\), image width lower than \(x_{max}\), and image height lower than \(y_{max}\). In the last two cases the -bounding box width or height was set to (image width - \(x_{min}\)) or +bounding box width and height were set to (image width - \(x_{min}\)) and (image height - \(y_{min}\)) respectively; in the other cases the annotation was skipped. If the bounding box width or height afterwards is @@ -623,8 +621,9 @@ lower than or equal to zero the annotation was skipped. SSD accepts 300x300 input images, the MS COCO data set images were resized to this resolution; the aspect ratio was not kept in the -process. As all images of MS COCO have the same resolution, -this led to a uniform distortion of the images. Furthermore, +process. MS COCO contains landscape and portrait images with (640x480) +and (480x640) as the resolution. This led to a uniform distortion of the +portrait and landscape images respectively. Furthermore, the colour channels were swapped from RGB to BGR in order to comply with the SSD implementation. The BGR requirement stems from the usage of Open CV in SSD: the internal channel order for @@ -661,7 +660,7 @@ on the object detection performance. Bayesian SSD was run with 0.2 confidence threshold and compared to vanilla SSD with 0.2 confidence threshold. Coupled with the -entropy threshold, this comparison shows how uncertain the network +entropy threshold, this comparison reveals how uncertain the network is. If it is very certain the dropout sampling should have no significant impact on the result. Furthermore, in two cases the dropout was turned off to isolate the impact of non-maximum suppression @@ -701,7 +700,7 @@ in the next chapter. \hline Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.209 & 2709 & 0.300 & 0.161 \\ no dropout - 0.2 conf - NMS \; 10 & 0.371 & \textbf{2335} & 0.365 & \textbf{0.378} \\ - 0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.360 & 2595 & 0.367 & 0.353 \\ + 0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.359 & 2584 & 0.363 & 0.357 \\ 0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.325 & 2759 & 0.342 & 0.311 \\ % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % 0.5 for Bayesian - 6, 1.4 for 7, 1.4 for 8, 1.3 for 9 @@ -754,15 +753,14 @@ With 2335 open set errors, the Bayesian SSD variant with disabled dropout and enabled non-maximum suppression offers the best performance with respect to open set errors. It also has the best precision (0.378) of all tested variants. Furthermore, it provides the best performance among all variants -with multiple forward passes except for recall. +with multiple forward passes. Dropout decreases the performance of the network, this can be seen in the lower \(F_1\) scores, higher open set errors, and lower precision -values. The variant with 0.9 keep ratio outperforms all other Bayesian -variants with respect to recall (0.367). The variant with 0.5 keep -ratio has worse recall (0.342) than the variant with disabled dropout. -However, all variants with multiple forward passes have lower open set errors -than all vanilla SSD variants. +values. Both dropout variants have worse recall (0.363 and 0.342) than +the variant with disabled dropout. +However, all variants with multiple forward passes have lower open set +errors than all vanilla SSD variants. The relation of \(F_1\) score to absolute open set error can be observed in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants @@ -788,7 +786,7 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018} \hline Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.226 & \textbf{809} & 0.229 & 0.224 \\ no dropout - 0.2 conf - NMS \; 10 & 0.363 & 1057 & 0.321 & 0.420 \\ - 0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.354 & 1150 & 0.321 & 0.396 \\ + 0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.355 & 1137 & 0.320 & 0.399 \\ 0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.322 & 1264 & 0.307 & 0.340 \\ % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 @@ -838,14 +836,12 @@ vanilla SSD run with multiple forward passes. With 809 open set errors, the Bayesian SSD variant with disabled dropout and without non-maximum suppression offers the best performance with respect -to open set errors. The variant without dropout and enabled non-maximum suppression has the best \(F_1\) score (0.363) and best -precision (0.420) of all Bayesian variants, and ties with the 0.9 keep ratio -variant on recall (0.321). +to open set errors. The variant without dropout and enabled non-maximum suppression has the best \(F_1\) score (0.363), the best +precision (0.420) and the best recall (0.321) of all Bayesian variants. Dropout decreases the performance of the network, this can be seen in the lower \(F_1\) scores, higher open set errors, and lower precision and -recall values. However, all variants with multiple forward passes and -non-maximum suppression have lower open set errors than all vanilla SSD +recall values. However, all variants with multiple forward passes have lower open set errors than all vanilla SSD variants. The relation of \(F_1\) score to absolute open set error can be observed @@ -864,19 +860,19 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018} This subsection compares vanilla SSD with Bayesian SSD with respect to specific images that illustrate similarities and differences between both approaches. For this -comparison, 0.2 confidence threshold is applied. Furthermore, Bayesian +comparison, a 0.2 confidence threshold is applied. Furthermore, Bayesian SSD uses non-maximum suppression and dropout with 0.9 keep ratio. \begin{figure} \begin{minipage}[t]{0.48\textwidth} \includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_vanilla} - \caption{Image with stop sign and truck at right edge. Ground truth in blue and predictions in red. Predictions are from vanilla SSD.} + \caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from vanilla SSD.} \label{fig:stop-sign-truck-vanilla} \end{minipage}% \hfill \begin{minipage}[t]{0.48\textwidth} \includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_bayesian} - \caption{Image with stop sign and truck at right edge. Ground truth in blue and predictions in red. Predictions are from Bayesian SSD with 0.9 keep ratio.} + \caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian SSD with 0.9 keep ratio.} \label{fig:stop-sign-truck-bayesian} \end{minipage} \end{figure} @@ -889,13 +885,13 @@ that overwhelmingly lie outside the image frame. Furthermore, the predictions ar \begin{figure} \begin{minipage}[t]{0.48\textwidth} \includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_vanilla} - \caption{Image with a cat and laptop/TV. Ground truth in blue and predictions in red and rounded to three digits. Predictions are from vanilla SSD.} + \caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from vanilla SSD.} \label{fig:cat-laptop-vanilla} \end{minipage}% \hfill \begin{minipage}[t]{0.48\textwidth} \includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_bayesian} - \caption{Image with a cat and laptop/TV. Ground truth in blue and predictions in red and rounded to three digits. Predictions are from Bayesian SSD with 0.9 keep ratio.} + \caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian SSD with 0.9 keep ratio.} \label{fig:cat-laptop-bayesian} \end{minipage} \end{figure} @@ -917,7 +913,7 @@ The results clearly do not support the hypothesis: \textit{Dropout sampling deli is no area where dropout sampling performs better than vanilla SSD. In the remainder of the section the individual results will be interpreted. -\subsection*{Impact of averaging} +\subsection*{Impact of Averaging} Micro and macro averaging create largely similar results. Notably, micro averaging has a significant performance increase towards the end @@ -945,7 +941,7 @@ threshold is not the largest threshold tested. A lower threshold likely eliminated some false positives from the result set. On the other hand a too low threshold likely eliminated true positives as well. -\subsection*{Non-maximum suppression and top \(k\)} +\subsection*{Non-Maximum Suppression and Top \(k\)} Miller et al.~\cite{Miller2018} supposedly did not use non-maximum suppression in their implementation of dropout sampling. Therefore, a variant with disabled @@ -957,7 +953,7 @@ a lot more false positives remain and have a negative impact on precision. In combination with top \(k\) selection, recall can be affected: duplicate detections could stay and maxima boxes could be removed. -The number of observations was measured before and after the entropy threshold/NMS filter: both Bayesian SSD without +The number of observations was measured before and after the combination of entropy threshold and NMS filter: both Bayesian SSD without NMS and dropout, and Bayesian SSD with NMS and disabled dropout have the same number of observations everywhere before the entropy threshold. After the entropy threshold (the value 1.5 was used for both) and NMS, the variant with NMS has roughly 23\% of its observations left. Without NMS 79\% of observations are left. Irrespective of the absolute @@ -988,7 +984,8 @@ kept by top \(k\). However, persons are likely often on images with many detections and/or have too low confidences. In this example, the likelihood for true positives to be removed in the person category is quite high. For dogs, the probability is far lower. -This goes back to micro and macro averaging and their impact on recall. +This is a good example for micro and macro averaging, and their impact on +recall. \subsection*{Dropout Sampling and Observations} @@ -1043,7 +1040,7 @@ questions that cannot be answered in this thesis. This thesis offers one possible implementation of dropout sampling that technically works. However, this thesis cannot answer why this implementation differs significantly from Miller et al. The complete source code or otherwise exhaustive -implementation details would be required to attempt an answer. +implementation details of Miller et al. would be required to attempt an answer. Future work could explore the performance of this implementation when used on an SSD variant that was fine-tuned or trained with dropout. In this case, it