From 2f9623e3d5ca46f728433b1511683726df7e6095 Mon Sep 17 00:00:00 2001
From: Jim Martens <github@2martens.de>
Date: Tue, 24 Sep 2019 12:22:13 +0200
Subject: [PATCH] Improved a variety of smaller things

Signed-off-by: Jim Martens <github@2martens.de>
---
 body.tex | 111 +++++++++++++++++++++++++++----------------------------
 1 file changed, 54 insertions(+), 57 deletions(-)

diff --git a/body.tex b/body.tex
index 02c8bff..bad9280 100644
--- a/body.tex
+++ b/body.tex
@@ -7,9 +7,9 @@ providing technical details.
 
 \subsection*{Motivation}
 
-Famous examples like the automatic soap dispenser which does not
+Famous examples like the automatic soap dispenser, which does not
 recognise the hand of a black person but dispenses soap when presented
-with a paper towel raise the question of bias in computer
+with a paper towel, raise the question of bias in computer
 systems~\cite{Friedman1996}. Related to this ethical question regarding
 the design of so called algorithms is the question of
 algorithmic accountability~\cite{Diakopoulos2014}.
@@ -132,26 +132,28 @@ conditions compared to object detection without it.}
 For the purpose of this thesis, I will use the vanilla SSD (as in: the original SSD) as
 baseline to compare against. In particular, vanilla SSD uses
 a per-class confidence threshold of 0.01, an IOU threshold of 0.45
-for the non-maximum suppression, and a top k value of 200.
+for the non-maximum suppression, and a top \(k\) value of 200. For this
+thesis, the top \(k\) value was changed to 20 and the confidence threshold
+of 0.2 was tried as well.
 The effect of an entropy threshold is measured against this vanilla
 SSD by applying entropy thresholds from 0.1 to 2.4 inclusive (limits taken from
-Miller et al.). Dropout sampling is compared to vanilla SSD, both
+Miller et al.). Dropout sampling is compared to vanilla SSD
 with and without entropy thresholding.
 
 \paragraph{Hypothesis} Dropout sampling
 delivers better object detection performance under open set
 conditions compared to object detection without it.
 
-\subsection*{Reader's guide}
+\subsection*{Reader's Guide}
 
 First, chapter \ref{chap:background} presents related works and
-provides the background for dropout sampling a.k.a Bayesian SSD.
-Afterwards, chapter \ref{chap:methods} explains how the Bayesian SSD
-works and how the decoding pipelines are structured.
+provides the background for dropout sampling.
+Afterwards, chapter \ref{chap:methods} explains how vanilla SSD works, how
+Bayesian SSD extends vanilla SSD, and how the decoding pipelines are
+structured.
 Chapter \ref{chap:experiments-results} presents the data sets,
 the experimental setup, and the results. This is followed by
-chapter \ref{chap:discussion}, focusing on
-the discussion and closing.
+chapter \ref{chap:discussion}, focusing on the discussion and closing.
 
 Therefore, the contribution is found in chapters \ref{chap:methods},
 \ref{chap:experiments-results}, and \ref{chap:discussion}.
@@ -162,8 +164,7 @@ Therefore, the contribution is found in chapters \ref{chap:methods},
 
 This chapter will begin with an overview over previous works
 in the field of this thesis. Afterwards the theoretical foundations
-of the work of Miller et al.~\cite{Miller2018} will
-be explained.
+of dropout sampling will be explained.
 
 \section{Related Works}
 
@@ -176,7 +177,7 @@ reconstruction-based novelty detection as it deals only with neural network
 approaches. Therefore, the other types of novelty detection will only be
 briefly introduced.
 
-\subsection{Overview over types of novelty detection}
+\subsection{Overview over types of Novelty Detection}
 
 Probabilistic approaches estimate the generative probability density function (pdf)
 of the data. It is assumed that the training data is generated from an underlying
@@ -208,7 +209,7 @@ difference in the metric when removed from the data set. This subset is consider
 to consist of novel data. For example, Filippone and Sanguinetti \cite{Filippone2011} provide
 a recent approach.
 
-\subsection{Reconstruction-based novelty detection}
+\subsection{Reconstruction-based Novelty Detection}
 
 Reconstruction-based approaches use the reconstruction error in one form
 or another to calculate the novelty score. This can be auto-encoders that
@@ -224,7 +225,7 @@ Novelty detection for object detection is intricately linked with
 open set conditions: the test data can contain unknown classes.
 Bishop~\cite{Bishop1994} investigated the correlation between
 the degree of novel input data and the reliability of network
-outputs.
+outputs, and introduced a quantitative way to measure novelty.
 
 The Bayesian approach provides a theoretical foundation for
 modelling uncertainty \cite{Ghahramani2015}.
@@ -259,20 +260,17 @@ Li et al.~\cite{Li2019} investigated the problem of poor performance
 when combining dropout and batch normalisation: Dropout shifts the variance
 of a neural unit when switching from train to test, batch normalisation
 does not change the variance. This inconsistency leads to a variance shift which
-can have a larger or smaller impact based on the network used. For example,
-adding dropout layers to SSD \cite{Liu2016} and applying MC dropout, like
-Miller et al.~\cite{Miller2018} did, causes such a problem because SSD uses
-batch normalisation.
+can have a larger or smaller impact based on the network used.
 
 Non-Bayesian approaches have been developed as well. Usually, they compare with
 MC dropout and show better performance.
 Postels et al.~\cite{Postels2019} provided a sampling-free approach for
 uncertainty estimation that does not affect training and approximates the
-sampling on test time. They compared it to MC dropout and found less computational
+sampling at test time. They compared it to MC dropout and found less computational
 overhead with better results.
 Lakshminarayanan et al.~\cite{Lakshminarayanan2017}
 implemented a predictive uncertainty estimation using deep ensembles.
-Compared to MC dropout, it showed better results.
+Compared to MC dropout, it shows better results.
 Geifman et al.~\cite{Geifman2018}
 introduced an uncertainty estimation algorithm for non-Bayesian deep
 neural classification that estimates the uncertainty of highly
@@ -288,10 +286,10 @@ are important as well. Mukhoti and Gal~\cite{Mukhoti2018}
 contributed metrics to measure uncertainty for semantic
 segmentation. Wu et al.~\cite{Wu2019} introduced two innovations
 that turn variational Bayes into a robust tool for Bayesian
-networks: a novel deterministic method to approximate
+networks: first, a novel deterministic method to approximate
 moments in neural networks which eliminates gradient variance, and
-a hierarchical prior for parameters and an empirical Bayes procedure to select
-prior variances.
+second, a hierarchical prior for parameters and an empirical Bayes
+procedure to select prior variances.
 
 \section{Background for Dropout Sampling}
 
@@ -342,7 +340,7 @@ over the network weights, for example a Gaussian prior distribution:
 \(\mathbf{W}\) are the weights and \(I\) symbolises that every
 weight is drawn from an independent and identical distribution. The
 training of the network determines a plausible set of weights by
-evaluating the posterior (probability output) over the weights given
+evaluating the probability output (posterior) over the weights given
 the training data \(\mathbf{T}\): \(p(\mathbf{W}|\mathbf{T})\).
 However, this
 evaluation cannot be performed in any reasonable
@@ -369,7 +367,7 @@ training data \(\mathbf{T}\):
 p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
 \end{equation}
 
-With this dropout sampling technique \(n\) model weights
+With this dropout sampling technique, \(n\) model weights
 \(\widetilde{\mathbf{W}}_i\) are sampled from the posterior
 \(p(\mathbf{W}|\mathbf{T})\). The class probability
 \(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
@@ -437,8 +435,8 @@ Vanilla SSD is based upon the VGG-16 network (see figure
 \ref{fig:vanilla-ssd}) and adds extra feature layers. The entire
 image (always size 300x300) is divided up into anchor boxes. During
 training, each of these boxes is mapped to a ground truth box or
-background. For every anchor box the offset to
-the object, and the class confidences are calculated. The output of the
+background. For every anchor box both the offset to
+the object and the class confidences are calculated. The output of the
 SSD network are the predictions with class confidences, offsets to the
 anchor box, anchor box coordinates, and variance. The model loss is a
 weighted sum of localisation and confidence loss. As the network
@@ -567,7 +565,7 @@ Additionally, all detections with a background prediction of 0.8 or higher are d
 The remaining detections are partitioned into observations to
 further reduce the size of the output, and
 to identify uncertainty. This is accomplished by calculating the
-mutual IOU of every detection with all other detections. Detections
+mutual IOU score of every detection with all other detections. Detections
 with a mutual IOU  score of 0.95 or higher are partitioned into an
 observation. Next, the softmax scores and bounding box coordinates of
 all detections in an observation are averaged.
@@ -596,7 +594,7 @@ at the end.
 This chapter explains the used data sets, how the experiments were
 set up, and what the results are.
 
-\section{Data sets}
+\section{Data Sets}
 
 This thesis uses the MS COCO~\cite{Lin2014} data set. It contains
 80 classes, from airplanes to toothbrushes many classes are present.
@@ -615,7 +613,7 @@ impossible values: bounding box height or width lower than zero,
 \(x_{max}\) and \(y_{max}\) coordinates lower than or equal to zero, \(x_{min}\) greater than \(x_{max}\),
 \(y_{min}\) greater than \(y_{max}\), image width lower than \(x_{max}\),
 and image height lower than \(y_{max}\). In the last two cases the
-bounding box width or height was set to (image width - \(x_{min}\)) or
+bounding box width and height were set to (image width - \(x_{min}\)) and
 (image height - \(y_{min}\)) respectively;
 in the other cases the annotation was skipped.
 If the bounding box width or height afterwards is
@@ -623,8 +621,9 @@ lower than or equal to zero the annotation was skipped.
 
 SSD accepts 300x300 input images, the MS COCO data set images were
 resized to this resolution; the aspect ratio was not kept in the
-process. As all images of MS COCO have the same resolution,
-this led to a uniform distortion of the images. Furthermore,
+process. MS COCO contains landscape and portrait images with (640x480)
+and (480x640) as the resolution. This led to a uniform distortion of the
+portrait and landscape images respectively. Furthermore,
 the colour channels were swapped from RGB to BGR in order to
 comply with the SSD implementation. The BGR requirement stems from
 the usage of Open CV in SSD: the internal channel order for
@@ -661,7 +660,7 @@ on the object detection performance.
 
 Bayesian SSD was run with 0.2 confidence threshold and compared
 to vanilla SSD with 0.2 confidence threshold. Coupled with the
-entropy threshold, this comparison shows how uncertain the network
+entropy threshold, this comparison reveals how uncertain the network
 is. If it is very certain the dropout sampling should have no
 significant impact on the result. Furthermore, in two cases the
 dropout was turned off to isolate the impact of non-maximum suppression
@@ -701,7 +700,7 @@ in the next chapter.
             \hline
             Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.209 & 2709 & 0.300 & 0.161 \\
             no dropout - 0.2 conf - NMS \; 10 & 0.371 & \textbf{2335} & 0.365 & \textbf{0.378} \\
-            0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.360 & 2595 & 0.367 & 0.353 \\
+            0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.359 & 2584 & 0.363 & 0.357 \\
             0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.325 & 2759 & 0.342 & 0.311 \\
             % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
             % 0.5 for Bayesian - 6, 1.4 for 7, 1.4 for 8, 1.3 for 9
@@ -754,15 +753,14 @@ With 2335 open set errors, the Bayesian SSD variant with disabled dropout and
 enabled non-maximum suppression offers the best performance with respect
 to open set errors. It also has the best precision (0.378) of all tested
 variants. Furthermore, it provides the best performance among all variants
-with multiple forward passes except for recall.
+with multiple forward passes.
 
 Dropout decreases the performance of the network, this can be seen
 in the lower \(F_1\) scores, higher open set errors, and lower precision
-values. The variant with 0.9 keep ratio outperforms all other Bayesian
-variants with respect to recall (0.367). The variant with 0.5 keep
-ratio has worse recall (0.342) than the variant with disabled dropout.
-However, all variants with multiple forward passes have lower open set errors
-than all vanilla SSD variants.
+values. Both dropout variants have worse recall (0.363 and 0.342) than
+the variant with disabled dropout.
+However, all variants with multiple forward passes have lower open set
+errors than all vanilla SSD variants.
 
 The relation of \(F_1\) score to absolute open set error can be observed
 in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants
@@ -788,7 +786,7 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
             \hline
             Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.226 & \textbf{809} & 0.229 & 0.224 \\
             no dropout - 0.2 conf - NMS \; 10 & 0.363 & 1057 & 0.321 & 0.420 \\
-            0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.354 & 1150 & 0.321 & 0.396 \\
+            0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.355 & 1137 & 0.320 & 0.399 \\
             0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.322 & 1264 & 0.307 & 0.340 \\
             % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
             % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
@@ -838,14 +836,12 @@ vanilla SSD run with multiple forward passes.
 
 With 809 open set errors, the Bayesian SSD variant with disabled dropout and
 without non-maximum suppression offers the best performance with respect
-to open set errors. The variant without dropout and enabled non-maximum suppression has the best \(F_1\) score (0.363) and best
-precision (0.420) of all Bayesian variants, and ties with the 0.9 keep ratio
-variant on recall (0.321).
+to open set errors. The variant without dropout and enabled non-maximum suppression has the best \(F_1\) score (0.363), the best
+precision (0.420) and the best recall (0.321) of all Bayesian variants.
 
 Dropout decreases the performance of the network, this can be seen
 in the lower \(F_1\) scores, higher open set errors, and lower precision and
-recall values. However, all variants with multiple forward passes and
-non-maximum suppression have lower open set errors than all vanilla SSD
+recall values. However, all variants with multiple forward passes have lower open set errors than all vanilla SSD
 variants.
 
 The relation of \(F_1\) score to absolute open set error can be observed
@@ -864,19 +860,19 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
 This subsection compares vanilla SSD
 with Bayesian SSD with respect to specific images that illustrate
 similarities and differences between both approaches. For this
-comparison, 0.2 confidence threshold is applied. Furthermore, Bayesian
+comparison, a 0.2 confidence threshold is applied. Furthermore, Bayesian
 SSD uses non-maximum suppression and dropout with 0.9 keep ratio.
 
 \begin{figure}
     \begin{minipage}[t]{0.48\textwidth}
         \includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_vanilla}
-        \caption{Image with stop sign and truck at right edge. Ground truth in blue and predictions in red. Predictions are from vanilla SSD.}
+        \caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from vanilla SSD.}
         \label{fig:stop-sign-truck-vanilla}
     \end{minipage}%
     \hfill
     \begin{minipage}[t]{0.48\textwidth}
         \includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_bayesian}
-        \caption{Image with stop sign and truck at right edge. Ground truth in blue and predictions in red. Predictions are from Bayesian SSD with 0.9 keep ratio.}
+        \caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian SSD with 0.9 keep ratio.}
         \label{fig:stop-sign-truck-bayesian}
     \end{minipage}
 \end{figure}
@@ -889,13 +885,13 @@ that overwhelmingly lie outside the image frame. Furthermore, the predictions ar
 \begin{figure}
     \begin{minipage}[t]{0.48\textwidth}
         \includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_vanilla}
-        \caption{Image with a cat and laptop/TV. Ground truth in blue and predictions in red and rounded to three digits. Predictions are from vanilla SSD.}
+        \caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from vanilla SSD.}
         \label{fig:cat-laptop-vanilla}
     \end{minipage}%
     \hfill
     \begin{minipage}[t]{0.48\textwidth}
         \includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_bayesian}
-        \caption{Image with a cat and laptop/TV. Ground truth in blue and predictions in red and rounded to three digits. Predictions are from Bayesian SSD with 0.9 keep ratio.}
+        \caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian SSD with 0.9 keep ratio.}
         \label{fig:cat-laptop-bayesian}
     \end{minipage}
 \end{figure}
@@ -917,7 +913,7 @@ The results clearly do not support the hypothesis: \textit{Dropout sampling deli
 is no area where dropout sampling performs better than vanilla SSD. In the
 remainder of the section the individual results will be interpreted.
 
-\subsection*{Impact of averaging}
+\subsection*{Impact of Averaging}
 
 Micro and macro averaging create largely similar results. Notably, micro
 averaging has a significant performance increase towards the end
@@ -945,7 +941,7 @@ threshold is not the largest threshold tested. A lower threshold likely
 eliminated some false positives from the result set. On the other hand a
 too low threshold likely eliminated true positives as well.
 
-\subsection*{Non-maximum suppression and top \(k\)}
+\subsection*{Non-Maximum Suppression and Top \(k\)}
 
 Miller et al.~\cite{Miller2018} supposedly did not use non-maximum suppression
 in their implementation of dropout sampling. Therefore, a variant with disabled
@@ -957,7 +953,7 @@ a lot more false positives remain and have a negative impact on precision.
 In combination with top \(k\) selection, recall can be affected:
 duplicate detections could stay and maxima boxes could be removed.
 
-The number of observations was measured before and after the entropy threshold/NMS filter: both Bayesian SSD without
+The number of observations was measured before and after the combination of entropy threshold and NMS filter: both Bayesian SSD without
 NMS and dropout, and Bayesian SSD with NMS and disabled dropout
 have the same number of observations everywhere before the entropy threshold. After the entropy threshold (the value 1.5 was used for both) and NMS, the variant with NMS has roughly 23\% of its observations left.
 Without NMS 79\% of observations are left. Irrespective of the absolute
@@ -988,7 +984,8 @@ kept by top \(k\). However, persons are likely often on images
 with many detections and/or have too low confidences.
 In this example, the likelihood for true positives to be removed in
 the person category is quite high. For dogs, the probability is far lower.
-This goes back to micro and macro averaging and their impact on recall.
+This is a good example for micro and macro averaging, and their impact on
+recall.
 
 
 \subsection*{Dropout Sampling and Observations}
@@ -1043,7 +1040,7 @@ questions that cannot be answered in this thesis. This thesis offers
 one possible implementation of dropout sampling that technically works.
 However, this thesis cannot answer why this implementation differs significantly
 from Miller et al. The complete source code or otherwise exhaustive
-implementation details would be required to attempt an answer.
+implementation details of Miller et al. would be required to attempt an answer.
 
 Future work could explore the performance of this implementation when used
 on an SSD variant that was fine-tuned or trained with dropout. In this case, it