From d84d65ddd6e2c744406dbad4371d591d1110d6b9 Mon Sep 17 00:00:00 2001
From: Jim Martens <github@2martens.de>
Date: Sun, 29 Sep 2019 15:08:03 +0200
Subject: [PATCH] Complete pass over thesis

- added more glossary terms
- fixed tenses
- removed wrong plural forms of open set error
- removed unsupported claims
- improved wording and language

Signed-off-by: Jim Martens <github@2martens.de>
---
 body.tex     | 469 +++++++++++++++++++++++++--------------------------
 glossary.tex |  33 ++++
 2 files changed, 265 insertions(+), 237 deletions(-)

diff --git a/body.tex b/body.tex
index 286e97a..5942b49 100644
--- a/body.tex
+++ b/body.tex
@@ -2,7 +2,7 @@
 
 \chapter{Introduction}
 
-The introduction will explain the wider context first, before
+The introduction first explains the wider context, before
 providing technical details.
 
 \subsection*{Motivation}
@@ -64,12 +64,12 @@ classify what they know. If presented with an object type that
 the network was not trained with, as happens frequently in real
 environments, it will still classify the object and might even
 have a high confidence in doing so. Such an example would be
-a false positive. Any ordinary person who uses the results of
-such a network would falsely assume that a high confidence always
-means the classification is very likely correct. If they use
-a proprietary system they might not even be able to find out
+a false positive. Anyone who uses the results of
+such a network could falsely assume that a high confidence always
+means the classification is very likely correct. If one uses
+a proprietary system one might not even be able to find out
 that the network was never trained on a particular type of object.
-Therefore, it would be impossible for them to identify the output
+Therefore, it would be impossible for one to identify the output
 of the network as false positive.
 
 This reaffirms the need for automatic explanation. Such a system
@@ -105,7 +105,7 @@ training these auto-encoders learn to reproduce a certain group
 of object classes. The actual novelty detection takes place
 during testing: given an image, and the output and loss of the
 auto-encoder, a novelty score is calculated. For some novelty
-detection approaches the reconstruction loss is exactly the novelty
+detection approaches the reconstruction loss is the novelty
 score, others consider more factors. A low novelty
 score signals a known object. The opposite is true for a high
 novelty score.
@@ -120,10 +120,10 @@ Therefore, a comparison between model uncertainty with a network like
 SSD and novelty detection with auto-encoders is considered out of scope
 for this thesis.
 
-Miller et al.~\cite{Miller2018} used an \gls{SSD} pre-trained on COCO
+Miller et al.~\cite{Miller2018} use an \gls{SSD} pre-trained on COCO
 without further fine-tuning on the SceneNet RGB-D data
-set~\cite{McCormac2017} and reported good results regarding
-open set error for an \gls{SSD} variant with dropout sampling and entropy
+set~\cite{McCormac2017} and report good results regarding
+\gls{OSE} for an \gls{SSD} variant with dropout sampling and \gls{entropy}
 thresholding.
 If their results are generalisable it should be possible to replicate
 the relative difference between the variants on the COCO data set.
@@ -131,16 +131,16 @@ This leads to the following hypothesis: \emph{Dropout sampling
 delivers better object detection performance under open set
 conditions compared to object detection without it.}
 
-For the purpose of this thesis, I will use the \gls{vanilla} \gls{SSD} (as in: the original SSD) as
+For the purpose of this thesis, I use the \gls{vanilla} \gls{SSD} (as in: the original \gls{SSD}) as
 baseline to compare against. In particular, \gls{vanilla} \gls{SSD} uses
-a per-class confidence threshold of 0.01, an IOU threshold of 0.45
+a per class confidence threshold of 0.01, an IOU threshold of 0.45
 for the \gls{NMS}, and a top \(k\) value of 200. For this
-thesis, the top \(k\) value was changed to 20 and the confidence threshold
-of 0.2 was tried as well.
-The effect of an entropy threshold is measured against this \gls{vanilla}
-SSD by applying entropy thresholds from 0.1 to 2.4 inclusive (limits taken from
-Miller et al.). Dropout sampling is compared to \gls{vanilla} SSD
-with and without entropy thresholding.
+thesis, the top \(k\) value has been changed to 20 and the confidence threshold
+of 0.2 has been tried as well.
+The effect of an \gls{entropy} threshold is measured against this \gls{vanilla}
+SSD by applying \gls{entropy} thresholds from 0.1 to 2.4 inclusive (limits taken from
+Miller et al.). Dropout sampling is compared to \gls{vanilla} \gls{SSD}
+with and without \gls{entropy} thresholding.
 
 \paragraph{Hypothesis} Dropout sampling
 delivers better object detection performance under open set
@@ -151,7 +151,7 @@ conditions compared to object detection without it.
 First, chapter \ref{chap:background} presents related works and
 provides the background for dropout sampling.
 Afterwards, chapter \ref{chap:methods} explains how \gls{vanilla} \gls{SSD} works, how
-Bayesian \gls{SSD} extends \gls{vanilla} SSD, and how the decoding pipelines are
+Bayesian \gls{SSD} extends \gls{vanilla} \gls{SSD}, and how the decoding pipelines are
 structured.
 Chapter \ref{chap:experiments-results} presents the data sets,
 the experimental setup, and the results. This is followed by
@@ -164,9 +164,9 @@ Therefore, the contribution is found in chapters \ref{chap:methods},
 
 \label{chap:background}
 
-This chapter will begin with an overview over previous works
+This chapter begins with an overview over previous works
 in the field of this thesis. Afterwards the theoretical foundations
-of dropout sampling will be explained.
+of dropout sampling are explained.
 
 \section{Related Works}
 
@@ -181,7 +181,7 @@ briefly introduced.
 
 \subsection{Overview over types of Novelty Detection}
 
-Probabilistic approaches estimate the generative probability density function (pdf)
+Probabilistic approaches estimate the generative \gls{pdf}
 of the data. It is assumed that the training data is generated from an underlying
 probability distribution \(D\). This distribution can be estimated with the
 training data, the estimate is defined as \(\hat D\) and represents a model
@@ -194,7 +194,7 @@ Distance-based novelty detection uses either nearest neighbour-based approaches
 or clustering-based approaches
 (e.g. \(k\)-means clustering algorithm \cite{Jordan1994}).
 Both methods are similar to estimating the
-pdf of data, they use well-defined distance metrics to compute the distance
+\gls{pdf} of data, they use well-defined distance metrics to compute the distance
 between two data points.
 
 Domain-based novelty detection describes the boundary of the known data, rather
@@ -203,7 +203,7 @@ the boundary. A common implementation for this are support vector machines
 (e.g. implemented by Song et al. \cite{Song2002}).
 
 Information-theoretic novelty detection computes the information content
-of a data set, for example, with metrics like entropy. Such metrics assume
+of a data set, for example, with metrics like \gls{entropy}. Such metrics assume
 that novel data inside the data set significantly alters the information
 content of an otherwise normal data set. First, the metrics are calculated over the
 whole data set. Afterwards, a subset is identified that causes the biggest
@@ -216,8 +216,8 @@ a recent approach.
 Reconstruction-based approaches use the reconstruction error in one form
 or another to calculate the novelty score. This can be auto-encoders that
 literally reconstruct the input but it also includes MLP networks which try
-to reconstruct the ground truth. Pimentel et al.~\cite{Pimentel2014} differentiated
-between neural network-based approaches and subspace methods. The first were
+to reconstruct the ground truth. Pimentel et al.~\cite{Pimentel2014} differentiate
+between neural network-based approaches and subspace methods. The first are
 further differentiated between MLPs, Hopfield networks, autoassociative networks,
 radial basis function, and self-organising networks.
 The remainder of this section focuses on MLP-based works, a particular focus will
@@ -225,40 +225,41 @@ be on the task of object detection and Bayesian networks.
 
 Novelty detection for object detection is intricately linked with
 open set conditions: the test data can contain unknown classes.
-Bishop~\cite{Bishop1994} investigated the correlation between
+Bishop~\cite{Bishop1994} investigates the correlation between
 the degree of novel input data and the reliability of network
 outputs, and introduced a quantitative way to measure novelty.
 
 The Bayesian approach provides a theoretical foundation for
 modelling uncertainty \cite{Ghahramani2015}.
-MacKay~\cite{MacKay1992} provided a practical Bayesian
-framework for backpropagation networks. Neal~\cite{Neal1996} built upon
-the work of MacKay and explored Bayesian learning for neural networks.
+MacKay~\cite{MacKay1992} provides a practical Bayesian
+framework for backpropagation networks. Neal~\cite{Neal1996} builds upon
+the work of MacKay and explores Bayesian learning for neural networks.
 However, these Bayesian neural networks do not scale well. Over the course
 of time, two major Bayesian approximations were introduced: one based
 on dropout and one based on batch normalisation.
 
-Gal and Ghahramani~\cite{Gal2016} showed that dropout training is a
+Gal and Ghahramani~\cite{Gal2016} show that dropout training is a
 Bayesian approximation of a Gaussian process. Subsequently, Gal~\cite{Gal2017}
-showed that dropout training actually corresponds to a general approximate
+shows that dropout training actually corresponds to a general approximate
 Bayesian model. This means every network trained with dropout is an
 approximate Bayesian model. During inference the dropout remains active,
 this form of inference is called Monte Carlo Dropout (MCDO).
-Miller et al.~\cite{Miller2018} built upon the work of Gal and Ghahramani: they
+Miller et al.~\cite{Miller2018} build upon the work of Gal and Ghahramani: they
 use MC dropout under open-set conditions for object detection.
-In a second paper \cite{Miller2018a}, Miller et al. continued their work and
-compared merging strategies for sampling-based uncertainty techniques in
+In a second paper \cite{Miller2018a}, Miller et al. continue their work and
+compare merging strategies for sampling-based uncertainty techniques in
 object detection.
 
 Teye et al.~\cite{Teye2018} make the point that most modern networks have
 adopted other regularisation techniques. Ioffe and Szeged~\cite{Ioffe2015}
-introduced batch normalisation which has been adapted widely. Teye et al.
-showed how batch normalisation training is similar to dropout and can be
+introduce batch normalisation which has been adapted widely in the
+meantime. Teye et al.
+show how batch normalisation training is similar to dropout and can be
 viewed as an approximate Bayesian inference. Estimates of the model uncertainty
 can be gained with a technique named Monte Carlo Batch Normalisation (MCBN).
 Consequently, this technique can be applied to any network that utilises
 standard batch normalisation.
-Li et al.~\cite{Li2019} investigated the problem of poor performance
+Li et al.~\cite{Li2019} investigate the problem of poor performance
 when combining dropout and batch normalisation: dropout shifts the variance
 of a neural unit when switching from train to test, batch normalisation
 does not change the variance. This inconsistency leads to a variance shift which
@@ -266,15 +267,15 @@ can have a larger or smaller impact based on the network used.
 
 Non-Bayesian approaches have been developed as well. Usually, they compare with
 MC dropout and show better performance.
-Postels et al.~\cite{Postels2019} provided a sampling-free approach for
+Postels et al.~\cite{Postels2019} provide a sampling-free approach for
 uncertainty estimation that does not affect training and approximates the
-sampling at test time. They compared it to MC dropout and found less computational
+sampling at test time. They compare it to MC dropout and find less computational
 overhead with better results.
 Lakshminarayanan et al.~\cite{Lakshminarayanan2017}
-implemented a predictive uncertainty estimation using deep ensembles.
+implement a predictive uncertainty estimation using deep ensembles.
 Compared to MC dropout, it shows better results.
 Geifman et al.~\cite{Geifman2018}
-introduced an uncertainty estimation algorithm for non-Bayesian deep
+introduce an uncertainty estimation algorithm for non-Bayesian deep
 neural classification that estimates the uncertainty of highly
 confident points using earlier snapshots of the trained model and improves,
 among others, the approach introduced by Lakshminarayanan et al.
@@ -285,8 +286,8 @@ the predictions of a neural network are treated as subjective opinions.
 In addition to the aforementioned Bayesian and non-Bayesian works,
 there are some Bayesian works that do not quite fit with the rest but
 are important as well. Mukhoti and Gal~\cite{Mukhoti2018}
-contributed metrics to measure uncertainty for semantic
-segmentation. Wu et al.~\cite{Wu2019} introduced two innovations
+contribute metrics to measure uncertainty for semantic
+segmentation. Wu et al.~\cite{Wu2019} introduce two innovations
 that turn variational Bayes into a robust tool for Bayesian
 networks: first, a novel deterministic method to approximate
 moments in neural networks which eliminates gradient variance, and
@@ -311,7 +312,7 @@ procedure to select prior variances.
         \(\mathcal{I}\) & an image \\
         \(\mathbf{q} = p(y|\mathcal{I}, \mathbf{T})\) & probability
             of all classes given image and training data \\
-        \(H(\mathbf{q})\) & entropy over probability vector \\
+        \(H(\mathbf{q})\) & \gls{entropy} over probability vector \\
         \(\widetilde{\mathbf{W}}\) & weights sampled from
             \(p(\mathbf{W}|\mathbf{T})\) \\
         \(\mathbf{b}\) & bounding box coordinates \\
@@ -342,12 +343,12 @@ over the network weights, for example a Gaussian prior distribution:
 \(\mathbf{W}\) are the weights and \(I\) symbolises that every
 weight is drawn from an independent and identical distribution. The
 training of the network determines a plausible set of weights by
-evaluating the probability output (posterior) over the weights given
+evaluating the probability output (\gls{posterior}) over the weights given
 the training data \(\mathbf{T}\): \(p(\mathbf{W}|\mathbf{T})\).
 However, this
 evaluation cannot be performed in any reasonable
 time. Therefore approximation techniques are
-required. In those techniques the posterior is fitted with a
+required. In those techniques the \gls{posterior} is fitted with a
 simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original
 and intractable problem of averaging over all weights in the network
 is replaced with an optimisation task, where the parameters of the
@@ -355,14 +356,14 @@ simple distribution are optimised over~\cite{Kendall2017}.
 
 \subsubsection*{Dropout Variational Inference}
 
-Kendall and Gal~\cite{Kendall2017} showed an approximation for
+Kendall and Gal~\cite{Kendall2017} show an approximation for
 classfication and recognition tasks. Dropout variational inference
 is a practical approximation technique by adding dropout layers
 in front of every weight layer and using them also during test
-time to sample from the approximate posterior. Effectively, this
+time to sample from the approximate \gls{posterior}. Effectively, this
 results in the approximation of the class probability
-\(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward
-passes through the network and averaging over the obtained Softmax
+\(p(y|\mathcal{I}, \mathbf{T})\) by performing \(n\) forward
+passes through the network and averaging the so obtained softmax
 scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the
 training data \(\mathbf{T}\):
 \begin{equation} \label{eq:drop-sampling}
@@ -370,18 +371,18 @@ p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf
 \end{equation}
 
 With this dropout sampling technique, \(n\) model weights
-\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior
+\(\widetilde{\mathbf{W}}_i\) are sampled from the \gls{posterior}
 \(p(\mathbf{W}|\mathbf{T})\). The class probability
 \(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
 \(\mathbf{q}\) over all class labels. Finally, the uncertainty
 of the network with respect to the classification is given by
-the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\).
+the \gls{entropy} \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\).
 
 \subsubsection*{Dropout Sampling for Object Detection}
 
 Miller et al.~\cite{Miller2018} apply the dropout sampling to
 object detection. In that case \(\mathbf{W}\) represents the
-learned weights of a detection network like SSD~\cite{Liu2016}.
+learned weights of a detection network like \gls{SSD}~\cite{Liu2016}.
 Every forward pass uses a different network
 \(\widetilde{\mathbf{W}}\) which is approximately sampled from
 \(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object
@@ -398,20 +399,20 @@ Subsequently, the corresponding vector of class probabilities
 score vectors \(\mathbf{s}_j\) in a particular observation
 \(\mathcal{O}_i\): \(\overline{\mathbf{q}}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
 of the detector for a particular observation is measured by
-the entropy \(H(\overline{\mathbf{q}}_i)\).
+the \gls{entropy} \(H(\overline{\mathbf{q}}_i)\).
 
-If \(\overline{\mathbf{q}}_i\), which I called averaged class probabilities,
-resembles a uniform distribution the entropy will be high. A uniform
+If \(\overline{\mathbf{q}}_i\)
+resembles a uniform distribution the \gls{entropy} will be high. A uniform
 distribution means that no class is more likely than another, which
 is a perfect example of maximum uncertainty. Conversely, if
-one class has a very high probability the entropy will be low.
+one class has a very high probability the \gls{entropy} will be low.
 
 In open set conditions it can be expected that falsely generated
 detections for unknown object classes have a higher label
-uncertainty. A threshold on the entropy \(H(\overline{\mathbf{q}}_i)\) can then
+uncertainty. A threshold on the \gls{entropy} \(H(\overline{\mathbf{q}}_i)\) can then
 be used to identify and reject these false positive cases.
 
-% SSD: \cite{Liu2016}
+% \gls{SSD}: \cite{Liu2016}
 % ImageNet: \cite{Deng2009}
 % COCO: \cite{Lin2014}
 % YCB: \cite{Xiang2017}
@@ -421,7 +422,7 @@ be used to identify and reject these false positive cases.
 
 \label{chap:methods}
 
-This chapter explains the functionality of \gls{vanilla} SSD, Bayesian SSD, and the decoding pipelines.
+This chapter explains the functionality of \gls{vanilla} \gls{SSD}, Bayesian \gls{SSD}, and the decoding pipelines.
 
 \section{Vanilla SSD}
 
@@ -439,23 +440,22 @@ image (always size 300x300) is divided up into anchor boxes. During
 training, each of these boxes is mapped to a ground truth box or
 background. For every anchor box both the offset to
 the object and the class confidences are calculated. The output of the
-SSD network are the predictions with class confidences, offsets to the
+\gls{SSD} network are the predictions with class confidences, offsets to the
 anchor box, anchor box coordinates, and variance. The model loss is a
 weighted sum of localisation and confidence loss. As the network
 has a fixed number of anchor boxes, every forward pass creates the same
 number of detections---8732 in the case of \gls{SSD} 300x300.
 
-Notably, the object proposals are made in a single run for an image -
-single shot.
+Notably, the object proposals are made in a single run for an
+image---single shot.
 Other techniques like Faster R-CNN employ region proposals
-and pooling. For more detailed information on SSD, please refer to
+and pooling. For more detailed information on \gls{SSD}, please refer to
 Liu et al.~\cite{Liu2016}.
 
 \section{Bayesian SSD for Model Uncertainty}
 
 Networks trained with dropout are a general approximate Bayesian model~\cite{Gal2017}. As such, they can be used for everything a true
-Bayesian model could be used for. The idea is applied to \gls{SSD} in this
-thesis: two dropout layers are added to \gls{vanilla} SSD, after the layers fc6 and fc7 respectively (see figure \ref{fig:bayesian-ssd}).
+Bayesian model could be used for. This idea is applied to \gls{SSD} by Miller et al.: two dropout layers are added to \gls{vanilla} \gls{SSD}, after the layers fc6 and fc7 respectively (see figure \ref{fig:bayesian-ssd}).
 
 \begin{figure}
     \centering
@@ -465,23 +465,23 @@ thesis: two dropout layers are added to \gls{vanilla} SSD, after the layers fc6
     \label{fig:bayesian-ssd}
 \end{figure}
 
-Motivation for this is model uncertainty: an uncertain model will
-predict different classes for the same object on the same image across
-multiple forward passes. This uncertainty is measured with entropy:
+Motivation for this is model uncertainty:  for the same object on the same
+image, an uncertain model will predict different classes across
+multiple forward passes. This uncertainty is measured with \gls{entropy}:
 every forward pass results in predictions, these are partitioned into
-observations, and subsequently their entropy is calculated.
-A higher entropy indicates a more uniform distribution of confidences
-whereas a lower entropy indicates a larger confidence in one class
+observations, and subsequently their \gls{entropy} is calculated.
+A higher \gls{entropy} indicates a more uniform distribution of confidences
+whereas a lower \gls{entropy} indicates a larger confidence in one class
 and very low confidences in other classes.
 
 \subsection{Implementation Details}
 
 For this thesis, an \gls{SSD} implementation based on Tensorflow~\cite{Abadi2015} and
 Keras\footnote{\url{https://github.com/pierluigiferrari/ssd\_keras}}
-was used. It was modified to support entropy thresholding,
+is used. It has been modified to support \gls{entropy} thresholding,
 partitioning of observations, and dropout
 layers in the \gls{SSD} model. Entropy thresholding takes place before
-the per-class confidence threshold is applied.
+the per class confidence threshold is applied.
 
 The Bayesian variant was not fine-tuned and operates with the same
 weights that \gls{vanilla} \gls{SSD} uses as well.
@@ -497,7 +497,7 @@ of \gls{SSD} used in the thesis.
 
 \subsection{Vanilla SSD}
 
-Liu et al.~\cite{Liu2016} used Caffe for their original SSD
+Liu et al.~\cite{Liu2016} used \gls{Caffe} for their original \gls{SSD}
 implementation. The decoding process contains largely two
 phases: decoding and filtering. Decoding transforms the relative
 coordinates predicted by \gls{SSD} into absolute coordinates. At this point
@@ -519,17 +519,17 @@ original implementation uses a confidence threshold of \(0.01\), an
 IOU threshold for \gls{NMS} of \(0.45\) and a top \(k\)
 value of 200.
 
-The \gls{vanilla} SSD
-per-class confidence threshold and \gls{NMS} has one
+The \gls{vanilla} \gls{SSD}
+per class confidence threshold and \gls{NMS} has one
 weakness: even if \gls{SSD} correctly predicts all objects as the
-background class with high confidence, the per-class confidence
+background class with high confidence, the per class confidence
 threshold of 0.01 will consider predictions with very low
 confidences; as background boxes are not present in the maxima
 collection, many low confidence boxes can be. Furthermore, the
 same detection can be present in the maxima collection for multiple
-classes. In this case, the entropy threshold would let the detection
+classes. In this case, the \gls{entropy} threshold would let the detection
 pass because the background class has high confidence. Subsequently,
-a low per-class confidence threshold does not restrict the boxes
+a low per class confidence threshold does not restrict the boxes
 either. Therefore, the decoding output is worse than the actual
 predictions of the network.
 Bayesian \gls{SSD} cannot help in this situation because the network
@@ -537,17 +537,17 @@ is not actually uncertain.
 
 SSD was developed with closed set conditions in mind. A well trained
 network in such a situation does not have many high confidence
-background detections. In an open set environment, background
+background detections. In an open set environment, however, background
 detections are the correct behaviour for unknown classes.
 In order to get useful detections out of the decoding, a higher
 confidence threshold is required.
 
 \subsection{Vanilla SSD with Entropy Thresholding}
 
-Vanilla \gls{SSD} with entropy tresholding adds an additional component
-to the filtering already done for \gls{vanilla} SSD. The entropy is
+Vanilla \gls{SSD} with \gls{entropy} tresholding adds an additional component
+to the filtering already done for \gls{vanilla} \gls{SSD}. The \gls{entropy} is
 calculated from all \(\#nr\_classes\) softmax scores in a prediction.
-Only predictions with a low enough entropy pass the entropy
+Only predictions with a low enough \gls{entropy} pass the \gls{entropy}
 threshold and move on to the aforementioned per class filtering.
 This excludes very uniform predictions but cannot identify
 false positive or false negative cases with high confidence values.
@@ -583,10 +583,10 @@ unknown classes. This is due to multiple forward passes and
 the assumption that uncertainty in some objects will result
 in different classifications in multiple forward passes. These
 varying classifications are averaged into multiple lower confidence
-values which should increase the entropy and, hence, flag an
+values which should increase the \gls{entropy} and, hence, flag an
 observation for removal.
 
-The remainder of the filtering follows the \gls{vanilla} \gls{SSD} procedure: per-class
+The remainder of the filtering follows the \gls{vanilla} \gls{SSD} procedure: per class
 confidence threshold, \gls{NMS}, and a top \(k\) selection
 at the end.
 
@@ -594,7 +594,7 @@ at the end.
 
 \label{chap:experiments-results}
 
-This chapter explains the used data sets, how the experiments were
+This chapter explains the used data sets, how the experiments have been
 set up, and what the results are.
 
 \section{Data Sets}
@@ -610,38 +610,38 @@ network. Typical problems of data sets include, for example,
 outliers and invalid bounding boxes. Before a data set can be used,
 these problems need to be resolved.
 
-For the MS COCO data set, all annotations were checked for
+For the MS COCO data set, all annotations are checked for
 impossible values: bounding box height or width lower than zero,
 \(x_{min}\) and \(y_{min}\) bounding box coordinates lower than zero,
 \(x_{max}\) and \(y_{max}\) coordinates lower than or equal to zero, \(x_{min}\) greater than \(x_{max}\),
 \(y_{min}\) greater than \(y_{max}\), image width lower than \(x_{max}\),
 and image height lower than \(y_{max}\). In the last two cases the
-bounding box width and height were set to (image width - \(x_{min}\)) and
+bounding box width and height are set to (image width - \(x_{min}\)) and
 (image height - \(y_{min}\)) respectively;
-in the other cases the annotation was skipped.
+in the other cases the annotation is skipped.
 If the bounding box width or height afterwards is
-lower than or equal to zero the annotation was skipped.
+lower than or equal to zero the annotation is skipped.
 
-SSD accepts 300x300 input images, the MS COCO data set images were
-resized to this resolution; the aspect ratio was not kept in the
+SSD accepts 300x300 input images, the MS COCO data set images are
+resized to this resolution; the aspect ratio is not kept in the
 process. MS COCO contains landscape and portrait images with (640x480)
-and (480x640) as the resolution. This led to a uniform distortion of the
+and (480x640) as the resolution. This leads to a uniform distortion of the
 portrait and landscape images respectively. Furthermore,
-the colour channels were swapped from RGB to BGR in order to
-comply with the \gls{SSD} implementation. The BGR requirement stems from
-the usage of Open CV in SSD: the internal channel order for
-Open CV is BGR.
+the colour channels are swapped from \gls{RGB} to \gls{BGR} in order to
+comply with the \gls{SSD} implementation. The \gls{BGR} requirement stems from
+the usage of Open CV in \gls{SSD}: the internal channel order for
+Open CV is \gls{BGR}.
 
 For this thesis, weights pre-trained on the sub data set trainval35k of the
-COCO data set were used. These weights were created with closed set
-conditions in mind, therefore, they had to be sub-sampled to create
+COCO data set are used. These weights have been created with closed set
+conditions in mind, therefore, they have been sub-sampled to create
 an open set condition. To this end, the weights for the last
-20 classes were thrown away, making them effectively unknown.
+20 classes have been thrown away, making them effectively unknown.
 
-All images of the minival2014 data set were used but only ground truth
-belonging to the first 60 classes was loaded. The remaining 20
-classes were considered "unknown" and no ground truth bounding
-boxes for them were provided during the inference phase.
+All images of the minival2014 data set are used but only ground truth
+belonging to the first 60 classes is loaded. The remaining 20
+classes are considered "unknown" and no ground truth bounding
+boxes for them is provided during the inference phase.
 A total of 31,991 detections remains after this exclusion. Of these
 detections, a staggering 10,988 or 34,3\% belong to the persons
 class, followed by cars with 1,932 or 6\%, chairs with 1,791 or 5,6\%,
@@ -655,28 +655,28 @@ This section explains the setup for the different conducted
 experiments. Each comparison investigates one particular question.
 
 As a baseline, \gls{vanilla} \gls{SSD} with the confidence threshold of 0.01
-and a \gls{NMS} IOU threshold of 0.45 was used.
+and a \gls{NMS} IOU threshold of 0.45 is used.
 Due to the low number of objects per image in the COCO data set,
-the top \(k\) value was set to 20. Vanilla \gls{SSD} with entropy
-thresholding uses the same parameters; compared to \gls{vanilla} SSD
-without entropy thresholding, it showcases the relevance of
-entropy thresholding for \gls{vanilla} SSD.
+the top \(k\) value has been set to 20. Vanilla \gls{SSD} with \gls{entropy}
+thresholding uses the same parameters; compared to \gls{vanilla} \gls{SSD}
+without \gls{entropy} thresholding, it showcases the relevance of
+entropy thresholding for \gls{vanilla} \gls{SSD}.
 
-Vanilla \gls{SSD} was also run with 0.2 confidence threshold and compared
+Vanilla \gls{SSD} with 0.2 confidence threshold is compared
 to \gls{vanilla} \gls{SSD} with 0.01 confidence threshold; this comparison
 investigates the effect of the per class confidence threshold
 on the object detection performance.
 
-Bayesian \gls{SSD} was run with 0.2 confidence threshold and compared
+Bayesian \gls{SSD} with 0.2 confidence threshold is compared
 to \gls{vanilla} \gls{SSD} with 0.2 confidence threshold. Coupled with the
 entropy threshold, this comparison reveals how uncertain the network
 is. If it is very certain the dropout sampling should have no
 significant impact on the result. Furthermore, in two cases the
-dropout was turned off to isolate the impact of \gls{NMS}
+dropout has been turned off to isolate the impact of \gls{NMS}
 on the result.
 
-Both, \gls{vanilla} \gls{SSD} with entropy thresholding and Bayesian \gls{SSD} with
-entropy thresholding, were tested for entropy thresholds ranging
+Both \gls{vanilla} \gls{SSD} with \gls{entropy} thresholding and Bayesian \gls{SSD} with
+entropy thresholding are tested for \gls{entropy} thresholds ranging
 from 0.1 to 2.4 inclusive as specified in Miller et al.~\cite{Miller2018}.
 
 \section{Results}
@@ -704,24 +704,24 @@ in the next chapter.
             \hline
             \gls{vanilla} \gls{SSD} - 0.01 conf & 0.255 & 3176 & 0.214 & 0.318 \\
             \gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.376} & 2939 & \textbf{0.382} & 0.372 \\
-            \gls{SSD} with Entropy test - 0.01 conf & 0.255 & 3168 & 0.214 & 0.318 \\
-            % entropy thresh: 2.4 for \gls{vanilla} \gls{SSD} is best
+            \gls{SSD} with entropy test - 0.01 conf & 0.255 & 3168 & 0.214 & 0.318 \\
+            % \gls{entropy} thresh: 2.4 for \gls{vanilla} \gls{SSD} is best
             \hline
             Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.209 & 2709 & 0.300 & 0.161 \\
             no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.371 & \textbf{2335} & 0.365 & \textbf{0.378} \\
             0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.359 & 2584 & 0.363 & 0.357 \\
             0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.325 & 2759 & 0.342 & 0.311 \\
-            % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
+            % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
             % 0.5 for Bayesian - 6, 1.4 for 7, 1.4 for 8, 1.3 for 9
         \hline
     \end{tabular}
-    \caption{Rounded results for micro averaging. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
-    their best performing entropy threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with Entropy test performed best with an
-    entropy threshold of 2.4, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.0,
-    and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.4 as entropy
+    \caption{Rounded results for micro averaging. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with
+    their best performing \gls{entropy} threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with entropy test performed best with an
+    \gls{entropy} threshold of 2.4, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.0,
+    and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.4 as \gls{entropy}
     threshold.
     Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed
-    best for 1.4 as entropy threshold, the run with 0.5 keep ratio performed
+    best for 1.4 as \gls{entropy} threshold, the run with 0.5 keep ratio performed
     best for 1.3 as threshold.}
     \label{tab:results-micro}
 \end{table}
@@ -729,7 +729,7 @@ in the next chapter.
 \begin{figure}[ht]
     \begin{minipage}[t]{0.48\textwidth}
         \includegraphics[width=\textwidth]{ose-f1-all-micro}
-        \caption{Micro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute OSE of 0.}
+        \caption{Micro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute \gls{OSE} of 0.}
         \label{fig:ose-f1-micro}
     \end{minipage}%
     \hfill
@@ -740,17 +740,17 @@ in the next chapter.
     \end{minipage}
 \end{figure}
 
-Vanilla \gls{SSD} with a per-class confidence threshold of 0.2 performs best (see
+Vanilla \gls{SSD} with a per class confidence threshold of 0.2 performs best (see
 table \ref{tab:results-micro}) with respect to the maximum \(F_1\) score
 (0.376) and recall at the maximum \(F_1\) point (0.382). In comparison, neither
 the \gls{vanilla} \gls{SSD} variant with a confidence threshold of 0.01 nor the \gls{SSD} with
-an entropy test can outperform the 0.2 variant. Among the \gls{vanilla} \gls{SSD} variants,
-the 0.2 variant also has the lowest number of open set errors (2939) and the
+an \gls{entropy} test can outperform the 0.2 variant. Among the \gls{vanilla} \gls{SSD} variants,
+the 0.2 variant also has the lowest open set error (2939) and the
 highest precision (0.372).
 
 The comparison of the \gls{vanilla} \gls{SSD} variants with a confidence threshold of 0.01
-shows no significant impact of an entropy test. Only the open set errors
-are lower but in an insignificant way. The rest of the performance metrics is
+shows no significant impact of an \gls{entropy} test. Only the open set error
+is lower but in an insignificant way. The rest of the performance metrics are
 identical after rounding.
 
 Bayesian \gls{SSD} with disabled dropout and without \gls{NMS}
@@ -758,23 +758,23 @@ has the worst performance of all tested variants (\gls{vanilla} and Bayesian)
 with respect to \(F_1\) score (0.209) and precision (0.161). The precision is not only the worst but also significantly lower compared to all other variants.
 In comparison to all variants with 0.2 confidence threshold, it has the worst recall (0.300) as well.
 
-With 2335 open set errors, the Bayesian \gls{SSD} variant with disabled dropout and
+With an open set error of 2335, the Bayesian \gls{SSD} variant with disabled dropout and
 enabled \gls{NMS} offers the best performance with respect
-to open set errors. It also has the best precision (0.378) of all tested
+to the open set error. It also has the best precision (0.378) of all tested
 variants. Furthermore, it provides the best performance among all variants
 with multiple forward passes.
 
 Dropout decreases the performance of the network, this can be seen
-in the lower \(F_1\) scores, higher open set errors, and lower precision
+in the lower \(F_1\) scores, a higher open set error, and lower precision
 values. Both dropout variants have worse recall (0.363 and 0.342) than
 the variant with disabled dropout.
-However, all variants with multiple forward passes have lower open set
-errors than all \gls{vanilla} \gls{SSD} variants.
+However, all variants with multiple forward passes have a lower open set
+error than all \gls{vanilla} \gls{SSD} variants.
 
 The relation of \(F_1\) score to absolute open set error can be observed
 in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants
-can be seen in figure \ref{fig:precision-recall-micro}. Both \gls{vanilla} SSD
-variants with 0.01 confidence threshold reach much higher open set errors
+can be seen in figure \ref{fig:precision-recall-micro}. Both \gls{vanilla} \gls{SSD}
+variants with 0.01 confidence threshold reach a much higher open set error
 and a higher recall. This behaviour is expected as more and worse predictions
 are included.
 All plotted variants show a similar behaviour that is in line with previously
@@ -790,24 +790,24 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
             \hline
             \gls{vanilla} \gls{SSD} - 0.01 conf & 0.370 & 1426 & 0.328 & 0.424 \\
             \gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.375} & 1218 & \textbf{0.338} & 0.424 \\
-            \gls{SSD} with Entropy test - 0.01 conf & 0.370 & 1373 & 0.329 & \textbf{0.425} \\
-            % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
+            \gls{SSD} with entropy test - 0.01 conf & 0.370 & 1373 & 0.329 & \textbf{0.425} \\
+            % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
             \hline
             Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.226 & \textbf{809} & 0.229 & 0.224 \\
             no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.363 & 1057 & 0.321 & 0.420 \\
             0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.355 & 1137 & 0.320 & 0.399 \\
             0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.322 & 1264 & 0.307 & 0.340 \\
-            % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
-            % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
+            % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
+            % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
             % 1.7 for 8, 2.0 for 9
         \hline
     \end{tabular}
-    \caption{Rounded results for macro averaging. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
-    their best performing entropy threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with Entropy test performed best with an
-    entropy threshold of 1.7, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.5,
-    and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.5 as entropy
+    \caption{Rounded results for macro averaging. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with
+    their best performing \gls{entropy} threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with entropy test performed best with an
+    \gls{entropy} threshold of 1.7, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.5,
+    and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.5 as \gls{entropy}
     threshold. Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed
-    best for 1.7 as entropy threshold, the run with 0.5 keep ratio performed
+    best for 1.7 as \gls{entropy} threshold, the run with 0.5 keep ratio performed
     best for 2.0 as threshold.}
     \label{tab:results-macro}
 \end{table}
@@ -815,7 +815,7 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
 \begin{figure}[ht]
     \begin{minipage}[t]{0.48\textwidth}
         \includegraphics[width=\textwidth]{ose-f1-all-macro}
-        \caption{Macro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute OSE of 0.}
+        \caption{Macro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute \gls{OSE} of 0.}
         \label{fig:ose-f1-macro}
     \end{minipage}%
     \hfill
@@ -826,37 +826,37 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
     \end{minipage}
 \end{figure}
 
-Vanilla \gls{SSD} with a per-class confidence threshold of 0.2 performs best (see
+Vanilla \gls{SSD} with a per class confidence threshold of 0.2 performs best (see
 table \ref{tab:results-macro}) with respect to the maximum \(F_1\) score
-(0.375) and recall at the maximum \(F_1\) point (0.338). In comparison, the SSD
-with an entropy test slightly outperforms the 0.2 variant with respect to
+(0.375) and recall at the maximum \(F_1\) point (0.338). In comparison, the \gls{SSD}
+with an \gls{entropy} test slightly outperforms the 0.2 variant with respect to
 precision (0.425). Additionally, this is the best precision overall. Among
 the \gls{vanilla} \gls{SSD} variants, the 0.2 variant also has the lowest
-number of open set errors (1218).
+open set error (1218).
 
 The comparison of the \gls{vanilla} \gls{SSD} variants with a confidence threshold of 0.01
-shows no significant impact of an entropy test. Only the open set errors
-are lower but in an insignificant way. The rest of the performance metrics is
+shows no significant impact of an \gls{entropy} test. Only the open set error
+is lower but in an insignificant way. The rest of the performance metrics are
 almost identical after rounding.
 
 The results for Bayesian \gls{SSD} show a significant impact of \gls{NMS} or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226
 (without NMS). Dropout was disabled in both cases, making them effectively a
 \gls{vanilla} \gls{SSD} run with multiple forward passes.
 
-With 809 open set errors, the Bayesian \gls{SSD} variant with disabled dropout and
+With an open set error of 809, the Bayesian \gls{SSD} variant with disabled dropout and
 without \gls{NMS} offers the best performance with respect
-to open set errors. The variant without dropout and enabled \gls{NMS} has the best \(F_1\) score (0.363), the best
+to the open set error. The variant without dropout and enabled \gls{NMS} has the best \(F_1\) score (0.363), the best
 precision (0.420) and the best recall (0.321) of all Bayesian variants.
 
 Dropout decreases the performance of the network, this can be seen
-in the lower \(F_1\) scores, higher open set errors, and lower precision and
-recall values. However, all variants with multiple forward passes have lower open set errors than all \gls{vanilla} SSD
+in the lower \(F_1\) scores, a higher open set error, and lower precision and
+recall values. However, all variants with multiple forward passes have a lower open set error than all \gls{vanilla} \gls{SSD}
 variants.
 
 The relation of \(F_1\) score to absolute open set error can be observed
 in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants
-can be seen in figure \ref{fig:precision-recall-macro}. Both \gls{vanilla} SSD
-variants with 0.01 confidence threshold reach much higher open set errors
+can be seen in figure \ref{fig:precision-recall-macro}. Both \gls{vanilla} \gls{SSD}
+variants with 0.01 confidence threshold reach a much higher open set error
 and a higher recall. This behaviour is expected as more and worse predictions
 are included.
 All plotted variants show a similar behaviour that is in line with previously
@@ -887,20 +887,20 @@ they had the exact same performance before rounding.
             \hline
             \gls{vanilla} \gls{SSD} - 0.01 conf & 0.460 & \textbf{0.405} & 0.532 \\
             \gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.460} & \textbf{0.405} & \textbf{0.533} \\
-            \gls{SSD} with Entropy test - 0.01 conf & 0.460 & 0.405 & 0.532 \\
-            % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
+            \gls{SSD} with entropy test - 0.01 conf & 0.460 & 0.405 & 0.532 \\
+            % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
             \hline
             Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.272 & 0.292 & 0.256 \\
             no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.451 & 0.403 & 0.514 \\
             0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.447 & 0.401 & 0.505 \\
             0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.410 & 0.368 & 0.465 \\
-            % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
-            % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
+            % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
+            % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
             % 1.7 for 8, 2.0 for 9
         \hline
     \end{tabular}
-    \caption{Rounded results for persons class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
-    their best performing macro averaging entropy threshold with respect to \(F_1\) score.}
+    \caption{Rounded results for persons class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with
+    their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score.}
     \label{tab:results-persons}
 \end{table}
 
@@ -924,20 +924,20 @@ worse than average.
             \hline
             \gls{vanilla} \gls{SSD} - 0.01 conf & 0.364 & \textbf{0.305} & 0.452 \\
             \gls{vanilla} \gls{SSD} - 0.2 conf & 0.363 & 0.294 & \textbf{0.476} \\
-            \gls{SSD} with Entropy test - 0.01 conf & \textbf{0.364} & \textbf{0.305} & 0.453 \\
-            % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
+            \gls{SSD} with entropy test - 0.01 conf & \textbf{0.364} & \textbf{0.305} & 0.453 \\
+            % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
             \hline
             Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.236 & 0.244 & 0.229 \\
             no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.336 & 0.266 & 0.460 \\
             0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.332 & 0.262 & 0.454 \\
             0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.309 & 0.264 & 0.374 \\
-            % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
-            % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
+            % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
+            % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
             % 1.7 for 8, 2.0 for 9
         \hline
     \end{tabular}
-    \caption{Rounded results for cars class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
-    their best performing macro averaging entropy threshold with respect to \(F_1\) score. }
+    \caption{Rounded results for cars class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with
+    their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score. }
     \label{tab:results-cars}
 \end{table}
 
@@ -949,20 +949,20 @@ worse than average.
             \hline
             \gls{vanilla} \gls{SSD} - 0.01 conf & 0.287 & \textbf{0.251} & 0.335 \\
             \gls{vanilla} \gls{SSD} - 0.2 conf & 0.283 & 0.242 & 0.341 \\
-            \gls{SSD} with Entropy test - 0.01 conf & \textbf{0.288} & \textbf{0.251} & 0.338 \\
-            % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
+            \gls{SSD} with entropy test - 0.01 conf & \textbf{0.288} & \textbf{0.251} & 0.338 \\
+            % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
             \hline
             Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.172 & 0.168 & 0.178 \\
             no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.280 & 0.229 & \textbf{0.360} \\
             0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.274 & 0.228 & 0.343 \\
             0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.240 & 0.220 & 0.265 \\
-            % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
-            % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
+            % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
+            % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
             % 1.7 for 8, 2.0 for 9
         \hline
     \end{tabular}
-    \caption{Rounded results for chairs class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
-    their best performing macro averaging entropy threshold with respect to \(F_1\) score. }
+    \caption{Rounded results for chairs class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with
+    their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score. }
     \label{tab:results-chairs}
 \end{table}
 
@@ -975,20 +975,20 @@ worse than average.
             \hline
             \gls{vanilla} \gls{SSD} - 0.01 conf & 0.233 & \textbf{0.175} & 0.348 \\
             \gls{vanilla} \gls{SSD} - 0.2 conf & 0.231 & 0.173 & \textbf{0.350} \\
-            \gls{SSD} with Entropy test - 0.01 conf & \textbf{0.233} & \textbf{0.175} & 0.350 \\
-            % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
+            \gls{SSD} with entropy test - 0.01 conf & \textbf{0.233} & \textbf{0.175} & 0.350 \\
+            % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
             \hline
             Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.160 & 0.140 & 0.188 \\
             no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.224 & 0.170 & 0.328 \\
             0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.220 & 0.170 & 0.311 \\
             0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.202 & 0.172 & 0.245 \\
-            % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
-            % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
+            % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
+            % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
             % 1.7 for 8, 2.0 for 9
         \hline
     \end{tabular}
-    \caption{Rounded results for bottles class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
-    their best performing macro averaging entropy threshold with respect to \(F_1\) score. }
+    \caption{Rounded results for bottles class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with
+    their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score. }
     \label{tab:results-bottles}
 \end{table}
 
@@ -1000,37 +1000,36 @@ worse than average.
             \hline
             \gls{vanilla} \gls{SSD} - 0.01 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\
             \gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\
-            \gls{SSD} with Entropy test - 0.01 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\
-            % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
+            \gls{SSD} with entropy test - 0.01 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\
+            % \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
             \hline
             Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.415 & 0.414 & 0.417 \\
             no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.647 & 0.642 & 0.654 \\
             0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.637 & 0.634 & 0.642 \\
             0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.586 & 0.578 & 0.596 \\
-            % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
-            % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
+            % \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
+            % \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
             % 1.7 for 8, 2.0 for 9
         \hline
     \end{tabular}
-    \caption{Rounded results for giraffe class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
-    their best performing macro averaging entropy threshold with respect to \(F_1\) score. }
+    \caption{Rounded results for giraffe class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with
+    their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score. }
     \label{tab:results-giraffes}
 \end{table}
 
 \subsection{Qualitative Analysis}
 
-% TODO: expand
-
-This subsection compares \gls{vanilla} SSD
+This subsection compares \gls{vanilla} \gls{SSD}
 with Bayesian \gls{SSD} with respect to specific images that illustrate
 similarities and differences between both approaches. For this
-comparison, a 0.2 confidence threshold is applied. Furthermore, Bayesian
-SSD uses \gls{NMS} and dropout with 0.9 keep ratio.
+comparison, a 0.2 confidence threshold is applied. Furthermore, the
+compared Bayesian SSD variant uses \gls{NMS} and dropout with 0.9 keep
+ratio.
 
 \begin{figure}
     \begin{minipage}[t]{0.48\textwidth}
         \includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_vanilla}
-        \caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} SSD.}
+        \caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla}  \gls{SSD}.}
         \label{fig:stop-sign-truck-vanilla}
     \end{minipage}%
     \hfill
@@ -1042,14 +1041,14 @@ SSD uses \gls{NMS} and dropout with 0.9 keep ratio.
 \end{figure}
 
 The ground truth only contains a stop sign and a truck. The differences between \gls{vanilla} \gls{SSD} and Bayesian \gls{SSD} are almost not visible
-(see figures \ref{fig:stop-sign-truck-vanilla} and \ref{fig:stop-sign-truck-bayesian}): the truck is neither detected by \gls{vanilla} nor Bayesian SSD, instead both detected a pottet plant and a traffic light. The stop sign is detected by both variants.
+(see figures \ref{fig:stop-sign-truck-vanilla} and \ref{fig:stop-sign-truck-bayesian}): the truck is neither detected by \gls{vanilla} nor Bayesian \gls{SSD}, instead both detected a pottet plant and a traffic light. The stop sign is detected by both variants.
 This behaviour implies problems with detecting objects at the edge
 that overwhelmingly lie outside the image frame. Furthermore, the predictions are usually identical.
 
 \begin{figure}
     \begin{minipage}[t]{0.48\textwidth}
         \includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_vanilla}
-        \caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} SSD.}
+        \caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} \gls{SSD}.}
         \label{fig:cat-laptop-vanilla}
     \end{minipage}%
     \hfill
@@ -1062,19 +1061,19 @@ that overwhelmingly lie outside the image frame. Furthermore, the predictions ar
 
 Another example (see figures \ref{fig:cat-laptop-vanilla} and \ref{fig:cat-laptop-bayesian}) is a cat with a laptop/TV in the background on the right
 side. Both variants detect a cat but the \gls{vanilla} variant detects a dog as well. The laptop and TV are not detected but this is expected since
-these classes were not trained.
+these classes have not been trained.
 
 \chapter{Discussion and Outlook}
 
 \label{chap:discussion}
 
-First the results will be discussed, then possible future research and open
-questions will be addressed.
+First the results are discussed, then possible future research and open
+questions are addressed.
 
 \section*{Discussion}
 
-The results clearly do not support the hypothesis: \textit{Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it}. With the exception of open set errors, there
-is no area where dropout sampling performs better than \gls{vanilla} SSD. In the
+The results clearly do not support the hypothesis: \textit{Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it}. With the exception of the open set error, there
+is no area where dropout sampling performs better than \gls{vanilla} \gls{SSD}. In the
 remainder of the section the individual results will be interpreted.
 
 \subsection*{Impact of Averaging}
@@ -1104,36 +1103,32 @@ Conversely, in micro averaging the cumulative true positives
 are added up across classes and then divided by the total number of
 ground truth. Here, the effect is the opposite: the total number of
 ground truth is very large which means the combined true positives
-of 58 classes have only a smaller impact on the average recall.
-As a result, the open set error rises quicker than the \(F_1\) score
-in micro averaging, creating the sharp rise of open set error at a lower
+of the 57 classes have only a smaller impact on the average recall.
+As a result, the open set error rises quicker than the \(F_1\) score,
+creating the sharp rise of the open set error at a lower
 \(F_1\) score than in macro averaging. The open set error
 reaches a high value early on and changes little afterwards. This allows
 the \(F_1\) score to catch up and produces the almost horizontal line
 in the graph. Eventually, the \(F_1\) score decreases again while the
-open set error further rises a bit.
-
-Furthermore, the plotted behaviour implies that Miller et al.~\cite{Miller2018}
-use macro averaging in their paper as the unique behaviour of micro
-averaging was not reported in their paper.
+open set error continues to rise a bit.
 
 \subsection*{Impact of Entropy}
 
-There is no visible impact of entropy thresholding on the object detection
-performance for \gls{vanilla} SSD. This indicates that the network has almost no
+There is no visible impact of \gls{entropy} thresholding on the object detection
+performance for \gls{vanilla} \gls{SSD}. This indicates that the network has almost no
 uniform or close to uniform predictions, the vast majority of predictions
 has a high confidence in one class---including the background.
-However, the entropy plays a larger role for the Bayesian variants---as
+However, the \gls{entropy} plays a larger role for the Bayesian variants---as
 expected: the best performing thresholds are 1.0, 1.3, and 1.4 for micro averaging,
 and 1.5, 1.7, and 2.0 for macro averaging. In all of these cases the best
 threshold is not the largest threshold tested.
 
 This is caused by a simple phenomenon: at some point most or all true
-positives are in and a higher entropy threshold only adds more false
+positives are in and a higher \gls{entropy} threshold only adds more false
 positives. Such a behaviour is indicated by a stagnating recall for the
-higher entropy levels. For the low entropy thresholds, the low recall
+higher \gls{entropy} levels. For the low \gls{entropy} thresholds, the low recall
 is dominating the \(F_1\) score, the sweet spot is somewhere in the
-middle. For macro averaging, it holds that a higher optimal entropy
+middle. For macro averaging, it holds that a higher optimal \gls{entropy}
 threshold indicates a worse performance.
 
 \subsection*{Non-Maximum Suppression and Top \(k\)}
@@ -1143,23 +1138,23 @@ threshold indicates a worse performance.
     \begin{tabular}{rccc}
         \hline
         variant & before & after & after  \\
-        & entropy/NMS & entropy/NMS & top \(k\) \\
+        & \gls{entropy}/NMS & \gls{entropy}/NMS & top \(k\) \\
         \hline
-        Bay. SSD, no dropout, no \gls{NMS} & 155,251 & 122,868 & 72,207 \\
+        Bay. \gls{SSD}, no dropout, no \gls{NMS} & 155,251 & 122,868 & 72,207 \\
         no dropout, \gls{NMS} & 155,250 & 36,061 & 33,827 \\
         \hline
     \end{tabular}
 
     \caption{Comparison of Bayesian \gls{SSD} variants without dropout with
-    respect to the number of detections before the entropy threshold,
+    respect to the number of detections before the \gls{entropy} threshold,
     after it and/or \gls{NMS}, and after top \(k\). The
-    entropy threshold 1.5 was used for both.}
+    \gls{entropy} threshold 1.5 was used for both.}
     \label{tab:effect-nms}
 \end{table}
 
-Miller et al.~\cite{Miller2018} supposedly did not use \gls{NMS}
+Miller et al.~\cite{Miller2018} supposedly do not use \gls{NMS}
 in their implementation of dropout sampling. Therefore, a variant with disabled \glslocalreset{NMS}
-\gls{NMS} was tested. The results are somewhat expected:
+\gls{NMS} has been tested. The results are somewhat expected:
 \gls{NMS} removes all non-maximum detections that overlap
 with a maximum one. This reduces the number of multiple detections per
 ground truth bounding box and therefore the false positives. Without it,
@@ -1167,13 +1162,13 @@ a lot more false positives remain and have a negative impact on precision.
 In combination with top \(k\) selection, recall can be affected:
 duplicate detections could stay and maxima boxes could be removed.
 
-The number of observations was measured before and after the combination of entropy threshold and \gls{NMS} filter: both Bayesian \gls{SSD} without
+The number of observations have been measured before and after the combination of \gls{entropy} threshold and \gls{NMS} filter: both Bayesian \gls{SSD} without
 NMS and dropout, and Bayesian \gls{SSD} with \gls{NMS} and disabled dropout
-have the same number of observations everywhere before the entropy threshold. After the entropy threshold (the value 1.5 was used for both) and NMS, the variant with \gls{NMS} has roughly 23\% of its observations left
+have the same number of observations everywhere before the \gls{entropy} threshold. After the \gls{entropy} threshold (the value 1.5 has been used for both) and NMS, the variant with \gls{NMS} has roughly 23\% of its observations left
 (see table \ref{tab:effect-nms} for absolute numbers).
 Without \gls{NMS} 79\% of observations are left. Irrespective of the absolute
 number, this discrepancy clearly shows the impact of \gls{NMS} and also explains a higher count of false positives:
-more than 50\% of the original observations were removed with \gls{NMS} and
+more than 50\% of the original observations are removed with \gls{NMS} and
 stayed without---all of these are very likely to be false positives.
 
 A clear distinction between micro and macro averaging can be observed:
@@ -1186,8 +1181,8 @@ true positives are removed from a class with only few true positives
 to begin with than their removal will have a drastic influence on
 the class recall value and hence the overall result.
 
-The impact of top \(k\) was measured by counting the number of observations
-after top \(k\) has been applied: the variant with \gls{NMS} keeps about 94\%
+The impact of top \(k\) has been measured by counting the number of observations
+after top \(k\) is applied: the variant with \gls{NMS} keeps about 94\%
 of the observations left after NMS, without \gls{NMS} only about 59\% of observations
 are kept. This shows a significant impact on the result by top \(k\)
 in the case of disabled \gls{NMS}. Furthermore, some
@@ -1212,7 +1207,7 @@ recall.
         variant & after & after  \\
         & prediction & observation grouping \\
         \hline
-        Bay. SSD, no dropout, \gls{NMS} & 1,677,050 & 155,250 \\
+        Bay. \gls{SSD}, no dropout, \gls{NMS} & 1,677,050 & 155,250 \\
         keep rate 0.9, \gls{NMS} & 1,617,675 & 549,166 \\
         \hline
     \end{tabular}
@@ -1229,8 +1224,8 @@ without dropout. This is expected as the network was not trained with
 dropout and the weights are not prepared for it.
 
 Gal~\cite{Gal2017}
-showed that networks \textbf{trained} with dropout are approximate Bayesian
-models. The Bayesian variants of \gls{SSD} implemented in this thesis are not fine-tuned or trained with dropout, therefore, they are not guaranteed to be such approximate models.
+shows that networks \textbf{trained} with dropout are approximate Bayesian
+models. The Bayesian variants of \gls{SSD} implemented for this thesis are not fine-tuned or trained with dropout, therefore, they are not guaranteed to be such approximate models.
 
 But dropout alone does not explain the difference in results. Both variants
 with and without dropout have the exact same number of detections coming
@@ -1243,19 +1238,19 @@ observations, including a quadratic calculation of mutual IOU scores.
 Therefore, these detections are filtered by removing all those with background
 confidence levels of 0.8 or higher.
 
-The number of detections per class was measured before and after the
-detections were grouped into observations. To this end, the stored predictions
-were unbatched and summed together. After the aforementioned filter
+The number of detections per class has been measured before and after the
+detections are grouped into observations. To this end, the stored predictions
+are unbatched and summed together. After the aforementioned filter
 and before the grouping, roughly 0.4\% (in fact less than that) of the
-more than 430 million detections are remaining (see table \ref{tab:effect-dropout} for absolute numbers). The variant with dropout
+more than 430 million detections remain (see table \ref{tab:effect-dropout} for absolute numbers). The variant with dropout
 has slightly fewer predictions left compared to the one without dropout.
 
 After the grouping, the variant without dropout has on average between
 10 and 11 detections grouped into an observation. This is expected as every
-forward pass creates the exact same result and these 10 identical detections
+forward pass creates the exact same result and these ten identical detections
 per \gls{vanilla} \gls{SSD} detection perfectly overlap. The fact that slightly more than
-10 detections are grouped together could explain the marginally better precision
-of the Bayesian variant without dropout compared to \gls{vanilla} SSD.
+ten detections are grouped together could explain the marginally better precision
+of the Bayesian variant without dropout compared to \gls{vanilla} \gls{SSD}.
 However, on average only three detections are grouped together into an
 observation if dropout with 0.9 keep ratio is enabled. This does not
 negatively impact recall as true positives do not disappear but offers
diff --git a/glossary.tex b/glossary.tex
index 41c1542..0a77a40 100644
--- a/glossary.tex
+++ b/glossary.tex
@@ -1,8 +1,41 @@
 % acronyms
 \newacronym{NMS}{NMS}{non-maximum suppression}
+\newacronym{OSE}{OSE}{open set error}
 \newacronym{SSD}{SSD}{Single Shot MultiBox Detector}
+\newacronym{pdf}{pdf}{probabilistic density function}
 
 % terms
+\newglossaryentry{BGR}{
+    name={BGR},
+    description={
+        stands for the three colour channels blue, green, and red in this order
+    }
+}
+\newglossaryentry{Caffe}{
+    name={Caffe},
+    description={
+        is a deep learning framework written in C++
+    }
+}
+\newglossaryentry{entropy}{
+    name={entropy},
+    description={
+        describes the amount of information provided by something. More likely
+        events have a lower entropy than rare events. In case of classification probabilities, uniform predictions contain more information than predictions with a clear "winner"
+    }
+}
+\newglossaryentry{posterior}{
+    name={posterior},
+    description={
+        probability output of a neural network
+    }
+}
+\newglossaryentry{RGB}{
+    name={RGB},
+    description={
+        stands for the three colour channels red, green, and blue in this order
+    }
+}
 \newglossaryentry{vanilla}
 {
     name={vanilla},