masterthesis-latex/body.tex

% body thesis file that contains the actual content

\chapter{Introduction}

The introduction will explain the wider context first, before
providing technical details.

\subsection*{Motivation}

Famous examples like the automatic soap dispenser which does not
recognise the hand of a black person but dispenses soap when presented
with a paper towel raise the question of bias in computer
systems~\cite{Friedman1996}. Related to this ethical question regarding
the design of so called algorithms is the question of
algorithmic accountability~\cite{Diakopoulos2014}.

Supervised neural networks learn from input-output relations and
figure out by themselves what connections are necessary for that.
This feature is also their Achilles heel: it makes them effectively
black boxes and prevents any answers to questions of causality.

However, these questions of causility are of enormous consequence when
results of neural networks are used to make life changing decisions:
Is a correlation enough to bring forth negative consequences
for a particular person? And if so, what is the possible defence
against math? Similar questions can be raised when looking at computer
vision networks that might be used together with so called smart
CCTV cameras to discover suspicious activity.

This leads to the need for neural networks to explain their results.
Such an explanation must come from the network or an attached piece
of technology to allow adoption in mass. Obviously this setting
poses the question, how such an endeavour can be achieved.

For neural networks there are fundamentally two type of tasks:
regression and classification. Regression deals with any case
where the goal for the network is to come close to an ideal
function that connects all data points. Classification, however,
describes tasks where the network is supposed to identify the
class of any given input. In this thesis, I will work with both.

\subsection*{Object Detection in Open Set Conditions}

\begin{figure}
    \centering
    \includegraphics[scale=1.0]{open-set}
    \caption{Open set problem: The test set contains classes that
    were not present during training time.
    Icons in this image have been taken from the COCO data set
    website (\url{https://cocodataset.org/\#explore}) and were
    vectorised afterwards. Resembles figure 1 of Miller et al.~\cite{Miller2018}.}
    \label{fig:open-set}
\end{figure}

More specifically, I will look at object detection in the open set
conditions (see figure \ref{fig:open-set}).
In non-technical words this effectively describes
the kind of situation you encounter with CCTV cameras or robots
outside of a laboratory. Both use cameras that record
images. Subsequently, a neural network analyses the image
and returns a list of detected and classified objects that it
found in the image. The problem here is that networks can only
classify what they know. If presented with an object type that
the network was not trained with, as happens frequently in real
environments, it will still classify the object and might even
have a high confidence in doing so. Such an example would be
a false positive. Any ordinary person who uses the results of
such a network would falsely assume that a high confidence always
means the classification is very likely correct. If they use
a proprietary system they might not even be able to find out
that the network was never trained on a particular type of object.
Therefore, it would be impossible for them to identify the output
of the network as false positive.

This goes back to the need for automatic explanation. Such a system
should by itself recognise that the given object is unknown and
hence mark any classification result of the network as meaningless.
Technically there are two slightly different approaches that deal
with this type of task: model uncertainty and novelty detection.

Model uncertainty can be measured, for example, with dropout sampling.
Dropout is usually used only during training but
Miller et al.~\cite{Miller2018} use them also during testing
to achieve different results for the same image making use of
multiple forward passes. The output scores for the forward passes
of the same image are then averaged. If the averaged class
probabilities resemble a uniform distribution (every class has
the same probability) this symbolises maximum uncertainty. Conversely,
if there is one very high probability with every other being very
low this signifies a low uncertainty. An unknown object is more
likely to cause high uncertainty which allows for an identification
of false positive cases.

Novelty detection is another approach to solve the task.
In the realm of neural networks it is usually done with the help of
auto-encoders that solve a regression task of finding an
identity function that reconstructs the given input~\cite{Pimentel2014}. Auto-encoders have
internally at least two components: an encoder, and a decoder or
generator. The job of the encoder is to find an encoding that
compresses the input as good as possible while simultaneously
being as loss-free as possible. The decoder takes this latent
representation of the input and has to find a decompression
that reconstructs the input as accurate as possible. During
training these auto-encoders learn to reproduce a certain group
of object classes. The actual novelty detection takes place
during testing: Given an image, and the output and loss of the
auto-encoder, a novelty score is calculated. For some novelty
detection approaches the reconstruction loss is exactly the novelty
score, others consider more factors. A low novelty
score signals a known object. The opposite is true for a high
novelty score.

\subsection*{Research Question}

Auto-encoders work well for data sets like MNIST~\cite{Deng2012}
but perform poorly on challenging real world data sets
like MS COCO~\cite{Lin2014}. Therefore, a comparison between
model uncertainty and novelty detection is considered out of
scope for this thesis.

Miller et al.~\cite{Miller2018} used an SSD pre-trained on COCO
without further fine-tuning on the SceneNet RGB-D data
set~\cite{McCormac2017} and reported good results regarding
open set error for an SSD variant with dropout sampling and entropy
thresholding.
If their results are generalisable it should be possible to replicate
the relative difference between the variants on the COCO data set.
This leads to the following hypothesis: \emph{Dropout sampling
delivers better object detection performance under open set
conditions compared to object detection without it.}

For the purpose of this thesis, I will use the vanilla SSD as
baseline to compare against. In particular, vanilla SSD uses
a per-class confidence threshold of 0.01, an IOU threshold of 0.45
for the non-maximum suppression, and a top k value of 200.
The effect of an entropy threshold is measured against this vanilla
SSD by applying entropy thresholds from 0.1 to 2.4 inclusive (limits taken from
Miller et al.). Dropout sampling is compared to vanilla SSD, both
with and without entropy thresholding.

\paragraph{Hypothesis} Dropout sampling
delivers better object detection performance under open set
conditions compared to object detection without it.

\subsection*{Reader's guide}

First, chapter \ref{chap:background} presents related works and
provides the background for dropout sampling a.k.a Bayesian SSD.
Afterwards, chapter \ref{chap:methods} explains how the Bayesian SSD
works and how the decoding pipelines are structured.
Chapter \ref{chap:experiments-results} presents the data sets,
the experimental setup, and the results. This is followed by
chapter \ref{chap:discussion}, focusing on
the discussion and closing.

Therefore, the contribution is found in chapters \ref{chap:methods},
\ref{chap:experiments-results}, and \ref{chap:discussion}.

\chapter{Background}

\label{chap:background}

This chapter will begin with an overview over previous works
in the field of this thesis. Afterwards the theoretical foundations
of the work of Miller et al.~\cite{Miller2018} will
be explained.

\section{Related Works}

The task of novelty detection can be accomplished in a variety of ways.
Pimentel et al.~\cite{Pimentel2014} provide a review of novelty detection
methods published over the previous decade. They showcase probabilistic,
distance-based, reconstruction-based, domain-based, and information-theoretic
novelty detection. Based on their categorisation, this thesis falls under
reconstruction-based novelty detection as it deals only with neural network
approaches. Therefore, the other types of novelty detection will only be
briefly introduced.

\subsection{Overview over types of novelty detection}

Probabilistic approaches estimate the generative probability density function (pdf)
of the data. It is assumed that the training data is generated from an underlying
probability distribution \(D\). This distribution can be estimated with the
training data, the estimate is defined as \(\hat D\) and represents a model
of normality. A novelty threshold is applied to \(\hat D\) in a way that
allows a probabilistic interpretation. Pidhorskyi et al.~\cite{Pidhorskyi2018}
combine a probabilistic approach to novelty detection with auto-encoders.

Distance-based novelty detection uses either nearest neighbour-based approaches
(e.g. \(k\)-nearest neighbour \cite{Hautamaki2004})
or clustering-based approaches
(e.g. \(k\)-means clustering algorithm \cite{Jordan1994}).
Both methods are similar to estimating the
pdf of data, they use well-defined distance metrics to compute the distance
between two data points.

Domain-based novelty detection describes the boundary of the known data, rather
than the data itself. Unknown data is identified by its position relative to
the boundary. A common implementation for this are support vector machines
(e.g. implemented by Song et al. \cite{Song2002}).

Information-theoretic novelty detection computes the information content
of a data set, for example, with metrics like entropy. Such metrics assume
that novel data inside the data set significantly alters the information
content of an otherwise normal data set. First, the metrics are calculated over the
whole data set. Afterwards, a subset is identified that causes the biggest
difference in the metric when removed from the data set. This subset is considered
to consist of novel data. For example, Filippone and Sanguinetti \cite{Filippone2011} provide
a recent approach.

\subsection{Reconstruction-based novelty detection}

Reconstruction-based approaches use the reconstruction error in one form
or another to calculate the novelty score. This can be auto-encoders that
literally reconstruct the input but it also includes MLP networks which try
to reconstruct the ground truth. Pimentel et al.~\cite{Pimentel2014} differentiated
between neural network-based approaches and subspace methods. The first were
further differentiated between MLPs, Hopfield networks, autoassociative networks,
radial basis function, and self-organising networks.
The remainder of this section focuses on MLP-based works, a particular focus will
be on the task of object detection and Bayesian networks.

Novelty detection for object detection is intricately linked with
open set conditions: the test data can contain unknown classes.
Bishop~\cite{Bishop1994} investigated the correlation between
the degree of novel input data and the reliability of network
outputs.

The Bayesian approach provides a theoretical foundation for
modelling uncertainty \cite{Ghahramani2015}.
MacKay~\cite{MacKay1992} provided a practical Bayesian
framework for backpropagation networks. Neal~\cite{Neal1996} built upon
the work of MacKay and explored Bayesian learning for neural networks.
However, these Bayesian neural networks do not scale well. Over the course
of time, two major Bayesian approximations were introduced: one based
on dropout and one based on batch normalisation.

Gal and Ghahramani~\cite{Gal2016} showed that dropout training is a
Bayesian approximation of a Gaussian process. Subsequently, Gal~\cite{Gal2017}
showed that dropout training actually corresponds to a general approximate
Bayesian model. This means every network trained with dropout is an
approximate Bayesian model. During inference the dropout remains active,
this form of inference is called Monte Carlo Dropout (MCDO).
Miller et al.~\cite{Miller2018} built upon the work of Gal and Ghahramani: they
use MC dropout under open-set conditions for object detection.
In a second paper \cite{Miller2018a}, Miller et al. continued their work and
compared merging strategies for sampling-based uncertainty techniques in
object detection.

Teye et al.~\cite{Teye2018} make the point that most modern networks have
adopted other regularisation techniques. Ioffe and Szeged~\cite{Ioffe2015}
introduced batch normalisation which has been adapted widely. Teye et al.
showed how batch normalisation training is similar to dropout and can be
viewed as an approximate Bayesian inference. Estimates of the model uncertainty
can be gained with a technique named Monte Carlo Batch Normalisation (MCBN).
Consequently, this technique can be applied to any network that utilises
standard batch normalisation.
Li et al.~\cite{Li2019} investigated the problem of poor performance
when combining dropout and batch normalisation: Dropout shifts the variance
of a neural unit when switching from train to test, batch normalisation
does not change the variance. This inconsistency leads to a variance shift which
can have a larger or smaller impact based on the network used. For example,
adding dropout layers to SSD \cite{Liu2016} and applying MC dropout, like
Miller et al.~\cite{Miller2018} did, causes such a problem because SSD uses
batch normalisation.

Non-Bayesian approaches have been developed as well. Usually, they compare with
MC dropout and show better performance.
Postels et al.~\cite{Postels2019} provided a sampling-free approach for
uncertainty estimation that does not affect training and approximates the
sampling on test time. They compared it to MC dropout and found less computational
overhead with better results.
Lakshminarayanan et al.~\cite{Lakshminarayanan2017}
implemented a predictive uncertainty estimation using deep ensembles.
Compared to MC dropout, it showed better results.
Geifman et al.~\cite{Geifman2018}
introduced an uncertainty estimation algorithm for non-Bayesian deep
neural classification that estimates the uncertainty of highly
confident points using earlier snapshots of the trained model and improves,
among others, the approach introduced by Lakshminarayanan et al.
Sensoy et al.~\cite{Sensoy2018} explicitely model prediction uncertainty:
a Dirichlet distribution is placed over the class probabilities. Consequently,
the predictions of a neural network are treated as subjective opinions.

In addition to the aforementioned Bayesian and non-Bayesian works,
there are some Bayesian works that do not quite fit with the rest but
are important as well. Mukhoti and Gal~\cite{Mukhoti2018}
contributed metrics to measure uncertainty for semantic
segmentation. Wu et al.~\cite{Wu2019} introduced two innovations
that turn variational Bayes into a robust tool for Bayesian
networks: a novel deterministic method to approximate
moments in neural networks which eliminates gradient variance, and
a hierarchical prior for parameters and an empirical Bayes procedure to select
prior variances.

\section{Background for Bayesian SSD}

\begin{table}
    \centering
    \caption{Notation for background}
    \label{tab:notation}
    \begin{tabular}{l|l}
        symbol & meaning \\
        \hline
        \(\mathbf{W}\) & weights \\
        \(\mathbf{T}\) & training data \\
        \(\mathcal{N}(0, I)\) & Gaussian distribution \\
        \(I\) & independent and identical distribution \\
        \(p(\mathbf{W}|\mathbf{T})\) & probability of weights given
            training data \\
        \(\mathcal{I}\) & an image \\
        \(\mathbf{q} = p(y|\mathcal{I}, \mathbf{T})\) & probability
            of all classes given image and training data \\
        \(H(\mathbf{q})\) & entropy over probability vector \\
        \(\widetilde{\mathbf{W}}\) & weights sampled from
            \(p(\mathbf{W}|\mathbf{T})\) \\
        \(\mathbf{b}\) & bounding box coordinates \\
        \(\mathbf{s}\) & softmax scores \\
        \(\overline{\mathbf{s}}\) & averaged softmax scores \\
        \(D\) & detections of one forward pass \\
        \(\mathfrak{D}\) & set of all detections over multiple
            forward passes \\
        \(\mathcal{O}\) & observation \\
        \(\overline{\mathbf{q}}\) & probability vector for
            observation \\
        %\(E[something]\) & expected value of something
        %\(\overline{\mathbf{z}}, \mathbf{z}\) & latent space representation \\
        %\(d_T, d_z\) & discriminators \\
        %\(e, g\) & encoding and decoding/generating function \\
        %\(J_g\) & Jacobi matrix for generating function \\
        %\(\mathcal{T}\) & tangent space \\
        %\(\mathbf{R}\) & training/test data changed to be on tangent space
    \end{tabular}
\end{table}

This section will use the \textbf{notation} defined in table
\ref{tab:notation} on page \pageref{tab:notation}.
To understand dropout sampling, it is necessary to explain the
idea of Bayesian neural networks. They place a prior distribution
over the network weights, for example a Gaussian prior distribution:
\(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example
\(\mathbf{W}\) are the weights and \(I\) symbolises that every
weight is drawn from an independent and identical distribution. The
training of the network determines a plausible set of weights by
evaluating the posterior (probability output) over the weights given
the training data \(\mathbf{T}\): \(p(\mathbf{W}|\mathbf{T})\).
However, this
evaluation cannot be performed in any reasonable
time. Therefore approximation techniques are
required. In those techniques the posterior is fitted with a
simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original
and intractable problem of averaging over all weights in the network
is replaced with an optimisation task, where the parameters of the
simple distribution are optimised over~\cite{Kendall2017}.

\subsubsection*{Dropout Variational Inference}

Kendall and Gal~\cite{Kendall2017} showed an approximation for
classfication and recognition tasks. Dropout variational inference
is a practical approximation technique by adding dropout layers
in front of every weight layer and using them also during test
time to sample from the approximate posterior. Effectively, this
results in the approximation of the class probability
\(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward
passes through the network and averaging over the obtained Softmax
scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the
training data \(\mathbf{T}\):
\begin{equation} \label{eq:drop-sampling}
p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
\end{equation}

With this dropout sampling technique \(n\) model weights
\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior
\(p(\mathbf{W}|\mathbf{T})\). The class probability
\(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
\(\mathbf{q}\) over all class labels. Finally, the uncertainty
of the network with respect to the classification is given by
the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\).

\subsubsection*{Dropout Sampling for Object Detection}

Miller et al.~\cite{Miller2018} apply the dropout sampling to
object detection. In that case \(\mathbf{W}\) represents the
learned weights of a detection network like SSD~\cite{Liu2016}.
Every forward pass uses a different network
\(\widetilde{\mathbf{W}}\) which is approximately sampled from
\(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object
detection results in a set of detections, each consisting of bounding
box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\).
The detections are denoted by Miller et al. as \(D_i =
\{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put
into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\).

All detections with mutual intersection-over-union scores (IoU)
of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\).
Subsequently, the corresponding vector of class probabilities
\(\overline{\mathbf{q}}_i\) for the observation is calculated by averaging all
score vectors \(\mathbf{s}_j\) in a particular observation
\(\mathcal{O}_i\): \(\overline{\mathbf{q}}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
of the detector for a particular observation is measured by
the entropy \(H(\overline{\mathbf{q}}_i)\).

If \(\overline{\mathbf{q}}_i\), which I called averaged class probabilities,
resembles a uniform distribution the entropy will be high. A uniform
distribution means that no class is more likely than another, which
is a perfect example of maximum uncertainty. Conversely, if
one class has a very high probability the entropy will be low.

In open set conditions it can be expected that falsely generated
detections for unknown object classes have a higher label
uncertainty. A threshold on the entropy \(H(\overline{\mathbf{q}}_i)\) can then
be used to identify and reject these false positive cases.

% SSD: \cite{Liu2016}
% ImageNet: \cite{Deng2009}
% COCO: \cite{Lin2014}
% YCB: \cite{Xiang2017}
% SceneNet: \cite{McCormac2017}

\chapter{Methods}

\label{chap:methods}

This chapter explains the functionality of the Bayesian SSD and the
decoding pipelines.

\section{Bayesian SSD for Model Uncertainty}

Bayesian SSD adds dropout sampling to the vanilla SSD. First,
the model architecture will be explained, followed by details on
the uncertainty calculation, and implementation details.

\subsection{Model Architecture}

\begin{figure}
    \centering
    \includegraphics[scale=1.2]{vanilla-ssd}
    \caption{The vanilla SSD network as defined by Liu et al.~\cite{Liu2016}. VGG-16 is the base network, extended with extra feature layers. These predict offsets to anchor boxes with different sizes and aspect ratios. Furthermore, they predict the
    corresponding confidences.}
    \label{fig:vanilla-ssd}
\end{figure}

Vanilla SSD is based upon the VGG-16 network (see figure \ref{fig:vanilla-ssd}) and adds extra feature layers. These layers
predict the offsets to the anchor boxes, which have different sizes
and aspect ratios. The feature layers also predict the
corresponding confidences. By comparison, Bayesian SSD only adds
two dropout layers after the fc6 and fc7 layers (see figure \ref{fig:bayesian-ssd}).

\begin{figure}
    \centering
    \includegraphics[scale=1.2]{bayesian-ssd}
    \caption{The Bayesian SSD network as defined by Miller et al.~\cite{Miller2018}. It adds dropout layers after the fc6
    and fc7 layers.}
    \label{fig:bayesian-ssd}
\end{figure}

\subsection{Model Uncertainty}

Dropout sampling measures model uncertainty with the help of
entropy: every forward pass creates predictions, these are
partitioned into observations, and then their entropy is calculated.
Entropy works to detect uncertainty because uncertain networks
will produce different classifications for the same object in an
image across multiple forward passes.

\subsection{Implementation Details}

For this thesis, an SSD implementation based on Tensorflow~\cite{Abadi2015} and
Keras\footnote{\url{https://github.com/pierluigiferrari/ssd\_keras}}
was used. It was modified to support entropy thresholding,
partitioning of observations, and dropout
layers in the SSD model. %Entropy thresholding takes place before
%the per-class confidence threshold is applied.

\section{Decoding Pipelines}

The raw output of SSD is not very useful: it contains thousands of
boxes per image. Among them are many boxes with very low confidences
or background classifications, those need to be filtered out to
get any meaningful output of the network. The process of
filtering is called decoding and presented for the three variants
of SSD used in the thesis.

\subsection{Vanilla SSD}

Liu et al.~\cite{Liu2016} used Caffe for their original SSD
implementation. The decoding process contains largely two
phases: decoding and filtering. Decoding transforms the relative
coordinates predicted by SSD into absolute coordinates. At this point
the shape of the output per batch is \((batch\_size, \#nr\_boxes, \#nr\_classes + 12)\). The last twelve elements are split into
the four bounding box offsets, the four anchor box coordinates, and
the four variances; there are 8732 boxes.

Filtering of these boxes is first done per class:
only the class id, confidence of that class, and the bounding box
coordinates are kept per box. The filtering consists of
confidence thresholding and a subsequent non-maximum suppression.
All boxes that pass non-maximum suppression are added to a
per image maxima list. One box could make the confidence threshold
for multiple classes and, hence, be present multiple times in the
maxima list for the image. Lastly, a total of \(k\) boxes with the
highest confidences is kept per image across all classes. The
original implementation uses a confidence threshold of \(0.01\), an
IOU threshold for non-maximum suppression of \(0.45\) and a top \(k\)
value of 200.

The vanilla SSD
per-class confidence threshold and non-maximum suppression has one
weakness: even if SSD correctly predicts all objects as the
background class with high confidence, the per-class confidence
threshold of 0.01 will consider predictions with very low
confidences; as background boxes are not present in the maxima
collection, many low confidence boxes can be. Furthermore, the
same detection can be present in the maxima collection for multiple
classes. In this case, the entropy threshold would let the detection
pass because the background class has high confidence. Subsequently,
a low per-class confidence threshold does not restrict the boxes
either. Therefore, the decoding output is worse than the actual
predictions of the network.
Bayesian SSD cannot help in this situation because the network
is not actually uncertain.

SSD was developed with closed set conditions in mind. A well trained
network in such a situation does not have many high confidence
background detections. In an open set environment, background
detections are the correct behaviour for unknown classes.
In order to get useful detections out of the decoding, a higher
confidence threshold is required.

\subsection{Vanilla SSD with Entropy Thresholding}

Vanilla SSD with entropy tresholding adds an additional component
to the filtering already done for vanilla SSD. The entropy is
calculated from all \(\#nr\_classes\) softmax scores in a prediction.
Only predictions with a low enough entropy pass the entropy
threshold and move on to the aforementioned per class filtering.
This excludes very uniform predictions but cannot identify
false positive or false negative cases with high confidence values.

\subsection{Bayesian SSD with Entropy Thresholding}

Bayesian SSD has the speciality of multiple forward passes. Based
on the information in the paper, the detections of all forward passes
are grouped per image but not by forward pass. This leads
to the following shape of the network output after all
forward passes: \((batch\_size, \#nr\_boxes \, \cdot \, \#nr\_forward\_passes, \#nr\_classes + 12)\). The size of the output
increases linearly with more forward passes.

These detections have to be decoded first. Afterwards,
all detections are thrown away which do not pass a confidence
threshold for the class with the highest prediction probability.
Additionally, all detections with a background prediction of 0.8 or higher are discarded.
The remaining detections are partitioned into observations to
further reduce the size of the output, and
to identify uncertainty. This is accomplished by calculating the
mutual IOU of every detection with all other detections. Detections
with a mutual IOU  score of 0.95 or higher are partitioned into an
observation. Next, the softmax scores and bounding box coordinates of
all detections in an observation are averaged.
There can be a different number of observations for every image which
destroys homogenity and prevents batch-wise calculation of the
results. The shape of the results is per image: \((\#nr\_observations,\#nr\_classes + 4)\).

Entropy is measured in the next step. All observations with too high
entropy are discarded. Entropy thresholding in combination with
dropout sampling should improve identification of false positives of
unknown classes. This is due to multiple forward passes and
the assumption that uncertainty in some objects will result
in different classifications in multiple forward passes. These
varying classifications are averaged into multiple lower confidence
values which should increase the entropy and, hence, flag an
observation for removal.

The remainder of the filtering follows the vanilla SSD procedure: per-class
confidence threshold, non-maximum suppression, and a top \(k\) selection
at the end.

\chapter{Experimental Setup and Results}

\label{chap:experiments-results}

This chapter explains the used data sets, how the experiments were
set up, and what the results are.

\section{Data sets}

This thesis uses the MS COCO~\cite{Lin2014} data set. It contains
80 classes, from airplanes to toothbrushes many classes are present.
The images are taken by camera from the real world, ground truth
is provided for all images. The data set supports object detection,
keypoint detection, and panoptic segmentation (scene segmentation).

The data of any data set has to be prepared for use in a neural
network. Typical problems of data sets include, for example,
outliers and invalid bounding boxes. Before a data set can be used,
these problems need to be resolved.

For the MS COCO data set, all annotations were checked for
impossible values: bounding box height or width lower than zero,
\(x_{min}\) and \(y_{min}\) bounding box coordinates lower than zero,
\(x_{max}\) and \(y_{max}\) coordinates lower than or equal to zero, \(x_{min}\) greater than \(x_{max}\),
\(y_{min}\) greater than \(y_{max}\), image width lower than \(x_{max}\),
and image height lower than \(y_{max}\). In the last two cases the
bounding box width or height was set to (image width - \(x_{min}\)) or
(image height - \(y_{min}\)) respectively;
in the other cases the annotation was skipped.
If the bounding box width or height afterwards is
lower than or equal to zero the annotation was skipped.

SSD accepts 300x300 input images, the MS COCO data set images were
resized to this resolution; the aspect ratio was not kept in the
process. As all images of MS COCO have the same resolution,
this led to a uniform distortion of the images. Furthermore,
the colour channels were swapped from RGB to BGR in order to
comply with the SSD implementation. The BGR requirement stems from
the usage of Open CV in SSD: the internal channel order for
Open CV is BGR.

For this thesis, weights pre-trained on the sub data set trainval35k of the
COCO data set were used. These weights were created with closed set
conditions in mind, therefore, they had to be sub-sampled to create
an open set condition. To this end, the weights for the last
20 classes were thrown away, making them effectively unknown.

All images of the minival2014 data set were used but only ground truth
belonging to the first 60 classes was loaded. The remaining 20
classes were considered "unknown" and were not presented with bounding
boxes during the inference phase.

\section{Experimental Setup}

This section explains the setup for the different conducted
experiments. Each comparison investigates one particular question.

As a baseline, vanilla SSD with the confidence threshold of 0.01
and a non-maximum suppression IOU threshold of 0.45 was used.
Due to the low number of objects per image in the COCO data set,
the top \(k\) value was set to 20. Vanilla SSD with entropy
thresholding uses the same parameters; compared to vanilla SSD
without entropy thresholding, it showcases the relevance of
entropy thresholding for vanilla SSD.

Vanilla SSD was also run with 0.2 confidence threshold and compared
to vanilla SSD with 0.01 confidence threshold; this comparison
investigates the effect of the per class confidence threshold
on the object detection performance.

Bayesian SSD was run with 0.2 confidence threshold and compared
to vanilla SSD with 0.2 confidence threshold. Coupled with the
entropy threshold, this comparison shows how uncertain the network
is. If it is very certain the dropout sampling should have no
significant impact on the result. Furthermore, in two cases the
dropout was turned off to isolate the impact of non-maximum suppression
on the result.

Both, vanilla SSD with entropy thresholding and Bayesian SSD with
entropy thresholding, were tested for entropy thresholds ranging
from 0.1 to 2.4 inclusive as specified in Miller et al.~\cite{Miller2018}.

\section{Results}

Results in this section are presented both for micro and macro averaging.
In macro averaging, for example, the precision values of each class are added up
and then divided by the number of classes. Conversely, for micro averaging the
precision is calculated across all classes directly. Both methods have
a specific impact: macro averaging weighs every class the same while micro
averaging weighs every detection the same. They will be largely identical
when every class is balanced and has about the same number of detections.
However, in case of a class imbalance the macro averaging
favours classes with few detections whereas micro averaging benefits classes
with many detections.

\subsection{Micro Averaging}
\begin{table}[ht]
    \begin{tabular}{rcccc}
        \hline
        Forward & max & abs OSE & Recall & Precision\\
        Passes & \(F_1\) Score & \multicolumn{3}{c}{at max \(F_1\) point} \\
            \hline
            vanilla SSD - 0.01 conf & 0.255 & 3176 & 0.214 & 0.318 \\
            vanilla SSD - 0.2 conf & \textbf{0.376} & 2939 & \textbf{0.382} & 0.372 \\
            SSD with Entropy test - 0.01 conf & 0.255 & 3168 & 0.214 & 0.318 \\
            % entropy thresh: 2.4 for vanilla SSD is best
            \hline
            Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.209 & 2709 & 0.300 & 0.161 \\
            no dropout - 0.2 conf - NMS \; 10 & 0.371 & \textbf{2335} & 0.365 & \textbf{0.378} \\
            0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.360 & 2595 & 0.367 & 0.353 \\
            0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.325 & 2759 & 0.342 & 0.311 \\
            % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
            % 0.5 for Bayesian - 6, 1.4 for 7, 1.4 for 8, 1.3 for 9
        \hline
    \end{tabular}
    \caption{Rounded results for micro averaging. SSD with Entropy test and Bayesian SSD are represented with
    their best performing entropy threshold. Vanilla SSD with Entropy test performed best with an
    entropy threshold of 2.4, Bayesian SSD without non-maximum suppression performed best for 1.0,
    and Bayesian SSD with non-maximum suppression performed best for 1.4 as entropy
    threshold.
    Bayesian SSD with dropout enabled and 0.9 keep ratio performed
    best for 1.4 as entropy threshold, the run with 0.5 keep ratio performed
    best for 1.3 as threshold.}
    \label{tab:results-micro}
\end{table}

\begin{figure}[ht]
    \begin{minipage}[t]{0.48\textwidth}
        \includegraphics[width=\textwidth]{ose-f1-all-micro}
        \caption{Micro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute OSE of 0.}
        \label{fig:ose-f1-micro}
    \end{minipage}%
    \hfill
    \begin{minipage}[t]{0.48\textwidth}
        \includegraphics[width=\textwidth]{precision-recall-all-micro}
        \caption{Micro averaged precision-recall curves for each variant tested.}
        \label{fig:precision-recall-micro}
    \end{minipage}
\end{figure}

Vanilla SSD with a per-class confidence threshold of 0.2 performs best (see
table \ref{tab:results-micro}) with respect to the maximum \(F_1\) score
(0.376) and recall at the maximum \(F_1\) point (0.382). In comparison, neither
the vanilla SSD variant with a confidence threshold of 0.01 nor the SSD with
an entropy test can outperform the 0.2 variant. Among the vanilla SSD variants,
the 0.2 variant also has the lowest number of open set errors (2939) and the
highest precision (0.372).

The comparison of the vanilla SSD variants with a confidence threshold of 0.01
shows no significant impact of an entropy test. Only the open set errors
are lower but in an insignificant way. The rest of the performance metrics is
identical after rounding.

The results for Bayesian SSD show a massive impact of the existance of
non-maximum suppression: maximum \(F_1\) score of 0.371 (with NMS) to 0.006
(without NMS). Dropout was disabled in both cases, making them effectively a
vanilla SSD run with multiple forward passes.
Therefore, the low number of open set errors with
micro averaging (164 without NMS) does not qualify as a good result and is not
marked bold, although it is the lowest number.

With 2335 open set errors, the Bayesian SSD variant with disabled dropout and
enabled non-maximum suppression offers the best performance with respect
to open set errors. It also has the best precision (0.378) of all tested
variants. Furthermore, it provides the best performance among all variants
with multiple forward passes except for recall.

Dropout decreases the performance of the network, this can be seen
in the lower \(F_1\) scores, higher open set errors, and lower precision
values. The variant with 0.9 keep ratio outperforms all other Bayesian
variants with respect to recall (0.367). The variant with 0.5 keep
ratio has worse recall (0.342) than the variant with disabled dropout.
However, all variants with multiple forward passes have lower open set errors
than all vanilla SSD variants.

The relation of \(F_1\) score to absolute open set error can be observed
in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants
can be seen in figure \ref{fig:precision-recall-micro}. Both vanilla SSD
variants with 0.01 confidence threshold reach much higher open set errors
and a higher recall. This behaviour is expected as more and worse predictions
are included. The Bayesian variant without non-maximum suppression was not
plotted.
All plotted variants show a similar behaviour that is in line with previously
reported figures, such as the ones in Miller et al.~\cite{Miller2018}

\subsection{Macro Averaging}

\begin{table}[t]
    \begin{tabular}{rcccc}
        \hline
        Forward & max & abs OSE & Recall & Precision\\
        Passes & \(F_1\) Score & \multicolumn{3}{c}{at max \(F_1\) point} \\
            \hline
            vanilla SSD - 0.01 conf & 0.370 & 1426 & 0.328 & 0.424 \\
            vanilla SSD - 0.2 conf & \textbf{0.375} & 1218 & \textbf{0.338} & 0.424 \\
            SSD with Entropy test - 0.01 conf & 0.370 & 1373 & 0.329 & \textbf{0.425} \\
            % entropy thresh: 1.7 for vanilla SSD is best
            \hline
            Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.226 & \textbf{809} & 0.229 & 0.224 \\
            no dropout - 0.2 conf - NMS \; 10 & 0.363 & 1057 & 0.321 & 0.420 \\
            0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.354 & 1150 & 0.321 & 0.396 \\
            0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.322 & 1264 & 0.307 & 0.340 \\
            % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
            % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
            % 1.7 for 8, 2.0 for 9
        \hline
    \end{tabular}
    \caption{Rounded results for macro averaging. SSD with Entropy test and Bayesian SSD are represented with
    their best performing entropy threshold. Vanilla SSD with Entropy test performed best with an
    entropy threshold of 1.7, Bayesian SSD without non-maximum suppression performed best for 1.5,
    and Bayesian SSD with non-maximum suppression performed best for 1.5 as entropy
    threshold. Bayesian SSD with dropout enabled and 0.9 keep ratio performed
    best for 1.7 as entropy threshold, the run with 0.5 keep ratio performed
    best for 2.0 as threshold.}
    \label{tab:results-macro}
\end{table}

\begin{figure}[ht]
    \begin{minipage}[t]{0.48\textwidth}
        \includegraphics[width=\textwidth]{ose-f1-all-macro}
        \caption{Macro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute OSE of 0.}
        \label{fig:ose-f1-macro}
    \end{minipage}%
    \hfill
    \begin{minipage}[t]{0.48\textwidth}
        \includegraphics[width=\textwidth]{precision-recall-all-macro}
        \caption{Macro averaged precision-recall curves for each variant tested.}
        \label{fig:precision-recall-macro}
    \end{minipage}
\end{figure}

Vanilla SSD with a per-class confidence threshold of 0.2 performs best (see
table \ref{tab:results-macro}) with respect to the maximum \(F_1\) score
(0.375) and recall at the maximum \(F_1\) point (0.338). In comparison, the SSD
with an entropy test slightly outperforms the 0.2 variant with respect to
precision (0.425). Additionally, this is the best precision overall. Among
the vanilla SSD variants, the 0.2 variant also has the lowest
number of open set errors (1218).

The comparison of the vanilla SSD variants with a confidence threshold of 0.01
shows no significant impact of an entropy test. Only the open set errors
are lower but in an insignificant way. The rest of the performance metrics is
almost identical after rounding.

The results for Bayesian SSD show a massive impact of the existance of
non-maximum suppression: maximum \(F_1\) score of 0.363 (with NMS) to 0.006
(without NMS). Dropout was disabled in both cases, making them effectively a
vanilla SSD run with multiple forward passes.

With 1057 open set errors, the Bayesian SSD variant with disabled dropout and
enabled non-maximum suppression offers the best performance with respect
to open set errors. It also has the best \(F_1\) score (0.363) and best
precision (0.420) of all Bayesian variants, and ties with the 0.9 keep ratio
variant on recall (0.321).

Dropout decreases the performance of the network, this can be seen
in the lower \(F_1\) scores, higher open set errors, and lower precision and
recall values. However, all variants with multiple forward passes and
non-maximum suppression have lower open set errors than all vanilla SSD
variants.

The relation of \(F_1\) score to absolute open set error can be observed
in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants
can be seen in figure \ref{fig:precision-recall-macro}. Both vanilla SSD
variants with 0.01 confidence threshold reach much higher open set errors
and a higher recall. This behaviour is expected as more and worse predictions
are included. The Bayesian variant without non-maximum suppression was not
plotted.
All plotted variants show a similar behaviour that is in line with previously
reported figures, such as the ones in Miller et al.~\cite{Miller2018}

\chapter{Discussion and Outlook}

\label{chap:discussion}

First the results will be discussed, then possible future research and open
questions will be addressed.

\section*{Discussion}

The results clearly do not support the hypothesis: \textit{Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it}. With the exception of open set errors, there
is no area where dropout sampling performs better than vanilla SSD. In the
remainder of the section the individual results will be interpreted.

\subsection*{Impact of averaging}

Micro and macro averaging create largely similar results. Notably, micro
averaging has a significant performance increase towards the end
of the list of predictions. This is signaled by the near horizontal movement
of the plot in both the \(F_1\) versus absolute open set error graph (see figure \ref{fig:ose-f1-micro}) and
the precision-recall curve (see figure \ref{fig:precision-recall-micro}).
There are potentially true positive detections of one class that significantly
improve recall when compared to all detections across the classes but are
insignificant when solely compared to other detections of their own class.

Furthermore, the plotted behaviour implies that Miller et al.~\cite{Miller2018}
use macro averaging in their paper as the unique behaviour of micro
averaging was not reported in their paper.

\subsection*{Impact of Entropy}

There is no visible impact of entropy thresholding on the object detection
performance for vanilla SSD. This indicates that the network has almost no
uniform or close to uniform predictions, the vast majority of predictions
has a high confidence in one class - including the background.
However, the entropy plays a larger role for the Bayesian variants - as
expected: the best performing thresholds are 1.3 and 1.4 for micro averaging,
and 1.5, 1.7, and 2.0 for macro averaging. In all of these cases the best
threshold is not the largest threshold tested. A lower threshold likely
eliminated some false positives from the result set. On the other hand a
too low threshold likely eliminated true positives as well.

\subsection*{Non-maximum suppression}

Miller et al.~\cite{Miller2018} supposedly did not use non-maximum suppression
in their implementation of dropout sampling. Therefore, a variant with disabled
non-maximum suppression (NMS) was tested. The disastrous results heavily imply
that NMS is crucial and pose serious questions about the implementation of
Miller et al., who still have not released source code.

Without NMS all detections passing the per-class confidence threshold are
directly ordered in descending order by their confidence value. Afterwards the
top \(k\) detections are kept. This enables the following scenario:
the first top \(k\) detections all belong to the same class and potentially
object. Detections of other classes and objects could be discarded, reducing
recall in the process. Multiple detections of the same object also increase
the number of false positives, further reducing the \(F_1\) score.

\subsection*{Dropout}

The dropout variants have largely worse performance than the Bayesian variants
without dropout. This is expected as the network was not trained with
dropout and the weights are not prepared for it.

Gal~\cite{Gal2017}
showed that networks \textbf{trained} with dropout are approximate Bayesian
models. Miller et al. never fine-tuned or properly trained SSD after
the dropout layers were inserted. Therefore, the Bayesian variant of SSD
implemented in this thesis is not guaranteed to be such an approximate
model.

These results further question the reported results of Miller et al., who
reported significantly better results of dropout sampling compared to vanilla
SSD. Admittedly, they used the network not on COCO but SceneNet RGB-D~\cite{McCormac2017}. However, they also claim that no fine-tuning
for SceneNet took place. Applying SSD to an unknown data set should result
in overall worse performance. Attempts to replicate their work on SceneNet RGB-D
failed with miserable results even for vanilla SSD, further attempts for this
thesis were not made. But Miller et al. used
a different implementation of SSD, therefore, it is possible that their
implementation worked on SceneNet without fine-tuning.

\subsection*{Sampling and Observations}

It is remarkable that the Bayesian variant with disabled dropout and
non-maximum suppression performed better than vanilla SSD with respect to
open set errors. This indicates a relevant impact of multiple forward
passes and the grouping of observations on the result. With disabled
dropout, the ten forward passes should all produce the same results,
resulting in ten identical detections for every detection in vanilla SSD.
The variation in the result can only originate from the grouping into
observations.

All detections that overlap by at least 95\% with each other
are grouped into an observation. For every ten identical detections one
observation should be the result. However, due to the 95\% overlap rather than
100\%, more than ten detections could be grouped together. This would result
in fewer overall observations compared to the number of detections
in vanilla SSD. Such a lower number reduces the chance for the network
to make mistakes.

\section*{Outlook}

The attempted replication of the work of Miller et al. raises a series of
questions that cannot be answered in this thesis. This thesis offers
one possible implementation of dropout sampling that technically works.
However, this thesis cannot answer why this implementation differs significantly
from Miller et al. The complete source code or otherwise exhaustive
implementation details would be required to attempt an answer.

Future work could explore the performance of this implementation when used
on an SSD variant that was fine-tuned or trained with dropout. In this case, it
should also look into the impact of training with both dropout and batch
normalisation.
Other avenues include the application to other data sets or object detection
networks.

To facilitate future work based on this thesis, the source code will be
made available and an installable Python package will be uploaded to the
PyPi package index. In the appendices can be found more details about the
source code implementation.