968 lines
49 KiB
TeX
968 lines
49 KiB
TeX
% body thesis file that contains the actual content
|
|
|
|
\chapter{Introduction}
|
|
|
|
The introduction will explain the wider context first, before
|
|
providing technical details.
|
|
|
|
\subsection*{Motivation}
|
|
|
|
Famous examples like the automatic soap dispenser which does not
|
|
recognise the hand of a black person but dispenses soap when presented
|
|
with a paper towel raise the question of bias in computer
|
|
systems~\cite{Friedman1996}. Related to this ethical question regarding
|
|
the design of so called algorithms is the question of
|
|
algorithmic accountability~\cite{Diakopoulos2014}.
|
|
|
|
Supervised neural networks learn from input-output relations and
|
|
figure out by themselves what connections are necessary for that.
|
|
This feature is also their Achilles heel: it makes them effectively
|
|
black boxes and prevents any answers to questions of causality.
|
|
|
|
However, these questions of causility are of enormous consequence when
|
|
results of neural networks are used to make life changing decisions:
|
|
Is a correlation enough to bring forth negative consequences
|
|
for a particular person? And if so, what is the possible defence
|
|
against math? Similar questions can be raised when looking at computer
|
|
vision networks that might be used together with so called smart
|
|
CCTV cameras to discover suspicious activity.
|
|
|
|
This leads to the need for neural networks to explain their results.
|
|
Such an explanation must come from the network or an attached piece
|
|
of technology to allow adoption in mass. Obviously this setting
|
|
poses the question, how such an endeavour can be achieved.
|
|
|
|
For neural networks there are fundamentally two type of tasks:
|
|
regression and classification. Regression deals with any case
|
|
where the goal for the network is to come close to an ideal
|
|
function that connects all data points. Classification, however,
|
|
describes tasks where the network is supposed to identify the
|
|
class of any given input. In this thesis, I will work with both.
|
|
|
|
\subsection*{Object Detection in Open Set Conditions}
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[scale=1.0]{open-set}
|
|
\caption{Open set problem: The test set contains classes that
|
|
were not present during training time.
|
|
Icons in this image have been taken from the COCO data set
|
|
website (\url{https://cocodataset.org/\#explore}) and were
|
|
vectorised afterwards. Resembles figure 1 of Miller et al.~\cite{Miller2018}.}
|
|
\label{fig:open-set}
|
|
\end{figure}
|
|
|
|
More specifically, I will look at object detection in the open set
|
|
conditions (see figure \ref{fig:open-set}).
|
|
In non-technical words this effectively describes
|
|
the kind of situation you encounter with CCTV cameras or robots
|
|
outside of a laboratory. Both use cameras that record
|
|
images. Subsequently, a neural network analyses the image
|
|
and returns a list of detected and classified objects that it
|
|
found in the image. The problem here is that networks can only
|
|
classify what they know. If presented with an object type that
|
|
the network was not trained with, as happens frequently in real
|
|
environments, it will still classify the object and might even
|
|
have a high confidence in doing so. Such an example would be
|
|
a false positive. Any ordinary person who uses the results of
|
|
such a network would falsely assume that a high confidence always
|
|
means the classification is very likely correct. If they use
|
|
a proprietary system they might not even be able to find out
|
|
that the network was never trained on a particular type of object.
|
|
Therefore, it would be impossible for them to identify the output
|
|
of the network as false positive.
|
|
|
|
This goes back to the need for automatic explanation. Such a system
|
|
should by itself recognise that the given object is unknown and
|
|
hence mark any classification result of the network as meaningless.
|
|
Technically there are two slightly different approaches that deal
|
|
with this type of task: model uncertainty and novelty detection.
|
|
|
|
Model uncertainty can be measured, for example, with dropout sampling.
|
|
Dropout is usually used only during training but
|
|
Miller et al.~\cite{Miller2018} use them also during testing
|
|
to achieve different results for the same image making use of
|
|
multiple forward passes. The output scores for the forward passes
|
|
of the same image are then averaged. If the averaged class
|
|
probabilities resemble a uniform distribution (every class has
|
|
the same probability) this symbolises maximum uncertainty. Conversely,
|
|
if there is one very high probability with every other being very
|
|
low this signifies a low uncertainty. An unknown object is more
|
|
likely to cause high uncertainty which allows for an identification
|
|
of false positive cases.
|
|
|
|
Novelty detection is another approach to solve the task.
|
|
In the realm of neural networks it is usually done with the help of
|
|
auto-encoders that solve a regression task of finding an
|
|
identity function that reconstructs the given input~\cite{Pimentel2014}. Auto-encoders have
|
|
internally at least two components: an encoder, and a decoder or
|
|
generator. The job of the encoder is to find an encoding that
|
|
compresses the input as good as possible while simultaneously
|
|
being as loss-free as possible. The decoder takes this latent
|
|
representation of the input and has to find a decompression
|
|
that reconstructs the input as accurate as possible. During
|
|
training these auto-encoders learn to reproduce a certain group
|
|
of object classes. The actual novelty detection takes place
|
|
during testing: Given an image, and the output and loss of the
|
|
auto-encoder, a novelty score is calculated. For some novelty
|
|
detection approaches the reconstruction loss is exactly the novelty
|
|
score, others consider more factors. A low novelty
|
|
score signals a known object. The opposite is true for a high
|
|
novelty score.
|
|
|
|
\subsection*{Research Question}
|
|
|
|
Auto-encoders work well for data sets like MNIST~\cite{Deng2012}
|
|
but perform poorly on challenging real world data sets
|
|
like MS COCO~\cite{Lin2014}. Therefore, a comparison between
|
|
model uncertainty and novelty detection is considered out of
|
|
scope for this thesis.
|
|
|
|
Miller et al.~\cite{Miller2018} used an SSD pre-trained on COCO
|
|
without further fine-tuning on the SceneNet RGB-D data
|
|
set~\cite{McCormac2017} and reported good results regarding
|
|
open set error for an SSD variant with dropout sampling and entropy
|
|
thresholding.
|
|
If their results are generalisable it should be possible to replicate
|
|
the relative difference between the variants on the COCO data set.
|
|
This leads to the following hypothesis: \emph{Dropout sampling
|
|
delivers better object detection performance under open set
|
|
conditions compared to object detection without it.}
|
|
|
|
For the purpose of this thesis, I will use the vanilla SSD as
|
|
baseline to compare against. In particular, vanilla SSD uses
|
|
a per-class confidence threshold of 0.01, an IOU threshold of 0.45
|
|
for the non-maximum suppression, and a top k value of 200.
|
|
The effect of an entropy threshold is measured against this vanilla
|
|
SSD by applying entropy thresholds from 0.1 to 2.4 inclusive (limits taken from
|
|
Miller et al.). Dropout sampling is compared to vanilla SSD, both
|
|
with and without entropy thresholding.
|
|
|
|
\paragraph{Hypothesis} Dropout sampling
|
|
delivers better object detection performance under open set
|
|
conditions compared to object detection without it.
|
|
|
|
\subsection*{Reader's guide}
|
|
|
|
First, chapter \ref{chap:background} presents related works and
|
|
provides the background for dropout sampling a.k.a Bayesian SSD.
|
|
Afterwards, chapter \ref{chap:methods} explains how the Bayesian SSD
|
|
works and how the decoding pipelines are structured.
|
|
Chapter \ref{chap:experiments-results} presents the data sets,
|
|
the experimental setup, and the results. This is followed by
|
|
chapter \ref{chap:discussion}, focusing on
|
|
the discussion and closing.
|
|
|
|
Therefore, the contribution is found in chapters \ref{chap:methods},
|
|
\ref{chap:experiments-results}, and \ref{chap:discussion}.
|
|
|
|
\chapter{Background}
|
|
|
|
\label{chap:background}
|
|
|
|
This chapter will begin with an overview over previous works
|
|
in the field of this thesis. Afterwards the theoretical foundations
|
|
of the work of Miller et al.~\cite{Miller2018} will
|
|
be explained.
|
|
|
|
\section{Related Works}
|
|
|
|
The task of novelty detection can be accomplished in a variety of ways.
|
|
Pimentel et al.~\cite{Pimentel2014} provide a review of novelty detection
|
|
methods published over the previous decade. They showcase probabilistic,
|
|
distance-based, reconstruction-based, domain-based, and information-theoretic
|
|
novelty detection. Based on their categorisation, this thesis falls under
|
|
reconstruction-based novelty detection as it deals only with neural network
|
|
approaches. Therefore, the other types of novelty detection will only be
|
|
briefly introduced.
|
|
|
|
\subsection{Overview over types of novelty detection}
|
|
|
|
Probabilistic approaches estimate the generative probability density function (pdf)
|
|
of the data. It is assumed that the training data is generated from an underlying
|
|
probability distribution \(D\). This distribution can be estimated with the
|
|
training data, the estimate is defined as \(\hat D\) and represents a model
|
|
of normality. A novelty threshold is applied to \(\hat D\) in a way that
|
|
allows a probabilistic interpretation. Pidhorskyi et al.~\cite{Pidhorskyi2018}
|
|
combine a probabilistic approach to novelty detection with auto-encoders.
|
|
|
|
Distance-based novelty detection uses either nearest neighbour-based approaches
|
|
(e.g. \(k\)-nearest neighbour \cite{Hautamaki2004})
|
|
or clustering-based approaches
|
|
(e.g. \(k\)-means clustering algorithm \cite{Jordan1994}).
|
|
Both methods are similar to estimating the
|
|
pdf of data, they use well-defined distance metrics to compute the distance
|
|
between two data points.
|
|
|
|
Domain-based novelty detection describes the boundary of the known data, rather
|
|
than the data itself. Unknown data is identified by its position relative to
|
|
the boundary. A common implementation for this are support vector machines
|
|
(e.g. implemented by Song et al. \cite{Song2002}).
|
|
|
|
Information-theoretic novelty detection computes the information content
|
|
of a data set, for example, with metrics like entropy. Such metrics assume
|
|
that novel data inside the data set significantly alters the information
|
|
content of an otherwise normal data set. First, the metrics are calculated over the
|
|
whole data set. Afterwards, a subset is identified that causes the biggest
|
|
difference in the metric when removed from the data set. This subset is considered
|
|
to consist of novel data. For example, Filippone and Sanguinetti \cite{Filippone2011} provide
|
|
a recent approach.
|
|
|
|
\subsection{Reconstruction-based novelty detection}
|
|
|
|
Reconstruction-based approaches use the reconstruction error in one form
|
|
or another to calculate the novelty score. This can be auto-encoders that
|
|
literally reconstruct the input but it also includes MLP networks which try
|
|
to reconstruct the ground truth. Pimentel et al.~\cite{Pimentel2014} differentiated
|
|
between neural network-based approaches and subspace methods. The first were
|
|
further differentiated between MLPs, Hopfield networks, autoassociative networks,
|
|
radial basis function, and self-organising networks.
|
|
The remainder of this section focuses on MLP-based works, a particular focus will
|
|
be on the task of object detection and Bayesian networks.
|
|
|
|
Novelty detection for object detection is intricately linked with
|
|
open set conditions: the test data can contain unknown classes.
|
|
Bishop~\cite{Bishop1994} investigated the correlation between
|
|
the degree of novel input data and the reliability of network
|
|
outputs.
|
|
|
|
The Bayesian approach provides a theoretical foundation for
|
|
modelling uncertainty \cite{Ghahramani2015}.
|
|
MacKay~\cite{MacKay1992} provided a practical Bayesian
|
|
framework for backpropagation networks. Neal~\cite{Neal1996} built upon
|
|
the work of MacKay and explored Bayesian learning for neural networks.
|
|
However, these Bayesian neural networks do not scale well. Over the course
|
|
of time, two major Bayesian approximations were introduced: one based
|
|
on dropout and one based on batch normalisation.
|
|
|
|
Gal and Ghahramani~\cite{Gal2016} showed that dropout training is a
|
|
Bayesian approximation of a Gaussian process. Subsequently, Gal~\cite{Gal2017}
|
|
showed that dropout training actually corresponds to a general approximate
|
|
Bayesian model. This means every network trained with dropout is an
|
|
approximate Bayesian model. During inference the dropout remains active,
|
|
this form of inference is called Monte Carlo Dropout (MCDO).
|
|
Miller et al.~\cite{Miller2018} built upon the work of Gal and Ghahramani: they
|
|
use MC dropout under open-set conditions for object detection.
|
|
In a second paper \cite{Miller2018a}, Miller et al. continued their work and
|
|
compared merging strategies for sampling-based uncertainty techniques in
|
|
object detection.
|
|
|
|
Teye et al.~\cite{Teye2018} make the point that most modern networks have
|
|
adopted other regularisation techniques. Ioffe and Szeged~\cite{Ioffe2015}
|
|
introduced batch normalisation which has been adapted widely. Teye et al.
|
|
showed how batch normalisation training is similar to dropout and can be
|
|
viewed as an approximate Bayesian inference. Estimates of the model uncertainty
|
|
can be gained with a technique named Monte Carlo Batch Normalisation (MCBN).
|
|
Consequently, this technique can be applied to any network that utilises
|
|
standard batch normalisation.
|
|
Li et al.~\cite{Li2019} investigated the problem of poor performance
|
|
when combining dropout and batch normalisation: Dropout shifts the variance
|
|
of a neural unit when switching from train to test, batch normalisation
|
|
does not change the variance. This inconsistency leads to a variance shift which
|
|
can have a larger or smaller impact based on the network used. For example,
|
|
adding dropout layers to SSD \cite{Liu2016} and applying MC dropout, like
|
|
Miller et al.~\cite{Miller2018} did, causes such a problem because SSD uses
|
|
batch normalisation.
|
|
|
|
Non-Bayesian approaches have been developed as well. Usually, they compare with
|
|
MC dropout and show better performance.
|
|
Postels et al.~\cite{Postels2019} provided a sampling-free approach for
|
|
uncertainty estimation that does not affect training and approximates the
|
|
sampling on test time. They compared it to MC dropout and found less computational
|
|
overhead with better results.
|
|
Lakshminarayanan et al.~\cite{Lakshminarayanan2017}
|
|
implemented a predictive uncertainty estimation using deep ensembles.
|
|
Compared to MC dropout, it showed better results.
|
|
Geifman et al.~\cite{Geifman2018}
|
|
introduced an uncertainty estimation algorithm for non-Bayesian deep
|
|
neural classification that estimates the uncertainty of highly
|
|
confident points using earlier snapshots of the trained model and improves,
|
|
among others, the approach introduced by Lakshminarayanan et al.
|
|
Sensoy et al.~\cite{Sensoy2018} explicitely model prediction uncertainty:
|
|
a Dirichlet distribution is placed over the class probabilities. Consequently,
|
|
the predictions of a neural network are treated as subjective opinions.
|
|
|
|
In addition to the aforementioned Bayesian and non-Bayesian works,
|
|
there are some Bayesian works that do not quite fit with the rest but
|
|
are important as well. Mukhoti and Gal~\cite{Mukhoti2018}
|
|
contributed metrics to measure uncertainty for semantic
|
|
segmentation. Wu et al.~\cite{Wu2019} introduced two innovations
|
|
that turn variational Bayes into a robust tool for Bayesian
|
|
networks: a novel deterministic method to approximate
|
|
moments in neural networks which eliminates gradient variance, and
|
|
a hierarchical prior for parameters and an empirical Bayes procedure to select
|
|
prior variances.
|
|
|
|
\section{Background for Bayesian SSD}
|
|
|
|
\begin{table}
|
|
\centering
|
|
\caption{Notation for background}
|
|
\label{tab:notation}
|
|
\begin{tabular}{l|l}
|
|
symbol & meaning \\
|
|
\hline
|
|
\(\mathbf{W}\) & weights \\
|
|
\(\mathbf{T}\) & training data \\
|
|
\(\mathcal{N}(0, I)\) & Gaussian distribution \\
|
|
\(I\) & independent and identical distribution \\
|
|
\(p(\mathbf{W}|\mathbf{T})\) & probability of weights given
|
|
training data \\
|
|
\(\mathcal{I}\) & an image \\
|
|
\(\mathbf{q} = p(y|\mathcal{I}, \mathbf{T})\) & probability
|
|
of all classes given image and training data \\
|
|
\(H(\mathbf{q})\) & entropy over probability vector \\
|
|
\(\widetilde{\mathbf{W}}\) & weights sampled from
|
|
\(p(\mathbf{W}|\mathbf{T})\) \\
|
|
\(\mathbf{b}\) & bounding box coordinates \\
|
|
\(\mathbf{s}\) & softmax scores \\
|
|
\(\overline{\mathbf{s}}\) & averaged softmax scores \\
|
|
\(D\) & detections of one forward pass \\
|
|
\(\mathfrak{D}\) & set of all detections over multiple
|
|
forward passes \\
|
|
\(\mathcal{O}\) & observation \\
|
|
\(\overline{\mathbf{q}}\) & probability vector for
|
|
observation \\
|
|
%\(E[something]\) & expected value of something
|
|
%\(\overline{\mathbf{z}}, \mathbf{z}\) & latent space representation \\
|
|
%\(d_T, d_z\) & discriminators \\
|
|
%\(e, g\) & encoding and decoding/generating function \\
|
|
%\(J_g\) & Jacobi matrix for generating function \\
|
|
%\(\mathcal{T}\) & tangent space \\
|
|
%\(\mathbf{R}\) & training/test data changed to be on tangent space
|
|
\end{tabular}
|
|
\end{table}
|
|
|
|
This section will use the \textbf{notation} defined in table
|
|
\ref{tab:notation} on page \pageref{tab:notation}.
|
|
To understand dropout sampling, it is necessary to explain the
|
|
idea of Bayesian neural networks. They place a prior distribution
|
|
over the network weights, for example a Gaussian prior distribution:
|
|
\(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example
|
|
\(\mathbf{W}\) are the weights and \(I\) symbolises that every
|
|
weight is drawn from an independent and identical distribution. The
|
|
training of the network determines a plausible set of weights by
|
|
evaluating the posterior (probability output) over the weights given
|
|
the training data \(\mathbf{T}\): \(p(\mathbf{W}|\mathbf{T})\).
|
|
However, this
|
|
evaluation cannot be performed in any reasonable
|
|
time. Therefore approximation techniques are
|
|
required. In those techniques the posterior is fitted with a
|
|
simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original
|
|
and intractable problem of averaging over all weights in the network
|
|
is replaced with an optimisation task, where the parameters of the
|
|
simple distribution are optimised over~\cite{Kendall2017}.
|
|
|
|
\subsubsection*{Dropout Variational Inference}
|
|
|
|
Kendall and Gal~\cite{Kendall2017} showed an approximation for
|
|
classfication and recognition tasks. Dropout variational inference
|
|
is a practical approximation technique by adding dropout layers
|
|
in front of every weight layer and using them also during test
|
|
time to sample from the approximate posterior. Effectively, this
|
|
results in the approximation of the class probability
|
|
\(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward
|
|
passes through the network and averaging over the obtained Softmax
|
|
scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the
|
|
training data \(\mathbf{T}\):
|
|
\begin{equation} \label{eq:drop-sampling}
|
|
p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
|
|
\end{equation}
|
|
|
|
With this dropout sampling technique \(n\) model weights
|
|
\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior
|
|
\(p(\mathbf{W}|\mathbf{T})\). The class probability
|
|
\(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
|
|
\(\mathbf{q}\) over all class labels. Finally, the uncertainty
|
|
of the network with respect to the classification is given by
|
|
the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\).
|
|
|
|
\subsubsection*{Dropout Sampling for Object Detection}
|
|
|
|
Miller et al.~\cite{Miller2018} apply the dropout sampling to
|
|
object detection. In that case \(\mathbf{W}\) represents the
|
|
learned weights of a detection network like SSD~\cite{Liu2016}.
|
|
Every forward pass uses a different network
|
|
\(\widetilde{\mathbf{W}}\) which is approximately sampled from
|
|
\(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object
|
|
detection results in a set of detections, each consisting of bounding
|
|
box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\).
|
|
The detections are denoted by Miller et al. as \(D_i =
|
|
\{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put
|
|
into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\).
|
|
|
|
All detections with mutual intersection-over-union scores (IoU)
|
|
of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\).
|
|
Subsequently, the corresponding vector of class probabilities
|
|
\(\overline{\mathbf{q}}_i\) for the observation is calculated by averaging all
|
|
score vectors \(\mathbf{s}_j\) in a particular observation
|
|
\(\mathcal{O}_i\): \(\overline{\mathbf{q}}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
|
|
of the detector for a particular observation is measured by
|
|
the entropy \(H(\overline{\mathbf{q}}_i)\).
|
|
|
|
If \(\overline{\mathbf{q}}_i\), which I called averaged class probabilities,
|
|
resembles a uniform distribution the entropy will be high. A uniform
|
|
distribution means that no class is more likely than another, which
|
|
is a perfect example of maximum uncertainty. Conversely, if
|
|
one class has a very high probability the entropy will be low.
|
|
|
|
In open set conditions it can be expected that falsely generated
|
|
detections for unknown object classes have a higher label
|
|
uncertainty. A threshold on the entropy \(H(\overline{\mathbf{q}}_i)\) can then
|
|
be used to identify and reject these false positive cases.
|
|
|
|
% SSD: \cite{Liu2016}
|
|
% ImageNet: \cite{Deng2009}
|
|
% COCO: \cite{Lin2014}
|
|
% YCB: \cite{Xiang2017}
|
|
% SceneNet: \cite{McCormac2017}
|
|
|
|
\chapter{Methods}
|
|
|
|
\label{chap:methods}
|
|
|
|
This chapter explains the functionality of the Bayesian SSD and the
|
|
decoding pipelines.
|
|
|
|
\section{Bayesian SSD for Model Uncertainty}
|
|
|
|
Bayesian SSD adds dropout sampling to the vanilla SSD. First,
|
|
the model architecture will be explained, followed by details on
|
|
the uncertainty calculation, and implementation details.
|
|
|
|
\subsection{Model Architecture}
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[scale=1.2]{vanilla-ssd}
|
|
\caption{The vanilla SSD network as defined by Liu et al.~\cite{Liu2016}. VGG-16 is the base network, extended with extra feature layers. These predict offsets to anchor boxes with different sizes and aspect ratios. Furthermore, they predict the
|
|
corresponding confidences.}
|
|
\label{fig:vanilla-ssd}
|
|
\end{figure}
|
|
|
|
Vanilla SSD is based upon the VGG-16 network (see figure \ref{fig:vanilla-ssd}) and adds extra feature layers. These layers
|
|
predict the offsets to the anchor boxes, which have different sizes
|
|
and aspect ratios. The feature layers also predict the
|
|
corresponding confidences. By comparison, Bayesian SSD only adds
|
|
two dropout layers after the fc6 and fc7 layers (see figure \ref{fig:bayesian-ssd}).
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[scale=1.2]{bayesian-ssd}
|
|
\caption{The Bayesian SSD network as defined by Miller et al.~\cite{Miller2018}. It adds dropout layers after the fc6
|
|
and fc7 layers.}
|
|
\label{fig:bayesian-ssd}
|
|
\end{figure}
|
|
|
|
\subsection{Model Uncertainty}
|
|
|
|
Dropout sampling measures model uncertainty with the help of
|
|
entropy: every forward pass creates predictions, these are
|
|
partitioned into observations, and then their entropy is calculated.
|
|
Entropy works to detect uncertainty because uncertain networks
|
|
will produce different classifications for the same object in an
|
|
image across multiple forward passes.
|
|
|
|
\subsection{Implementation Details}
|
|
|
|
For this thesis, an SSD implementation based on Tensorflow~\cite{Abadi2015} and
|
|
Keras\footnote{\url{https://github.com/pierluigiferrari/ssd\_keras}}
|
|
was used. It was modified to support entropy thresholding,
|
|
partitioning of observations, and dropout
|
|
layers in the SSD model. %Entropy thresholding takes place before
|
|
%the per-class confidence threshold is applied.
|
|
|
|
\section{Decoding Pipelines}
|
|
|
|
The raw output of SSD is not very useful: it contains thousands of
|
|
boxes per image. Among them are many boxes with very low confidences
|
|
or background classifications, those need to be filtered out to
|
|
get any meaningful output of the network. The process of
|
|
filtering is called decoding and presented for the three variants
|
|
of SSD used in the thesis.
|
|
|
|
\subsection{Vanilla SSD}
|
|
|
|
Liu et al.~\cite{Liu2016} used Caffe for their original SSD
|
|
implementation. The decoding process contains largely two
|
|
phases: decoding and filtering. Decoding transforms the relative
|
|
coordinates predicted by SSD into absolute coordinates. At this point
|
|
the shape of the output per batch is \((batch\_size, \#nr\_boxes, \#nr\_classes + 12)\). The last twelve elements are split into
|
|
the four bounding box offsets, the four anchor box coordinates, and
|
|
the four variances; there are 8732 boxes.
|
|
|
|
Filtering of these boxes is first done per class:
|
|
only the class id, confidence of that class, and the bounding box
|
|
coordinates are kept per box. The filtering consists of
|
|
confidence thresholding and a subsequent non-maximum suppression.
|
|
All boxes that pass non-maximum suppression are added to a
|
|
per image maxima list. One box could make the confidence threshold
|
|
for multiple classes and, hence, be present multiple times in the
|
|
maxima list for the image. Lastly, a total of \(k\) boxes with the
|
|
highest confidences is kept per image across all classes. The
|
|
original implementation uses a confidence threshold of \(0.01\), an
|
|
IOU threshold for non-maximum suppression of \(0.45\) and a top \(k\)
|
|
value of 200.
|
|
|
|
The vanilla SSD
|
|
per-class confidence threshold and non-maximum suppression has one
|
|
weakness: even if SSD correctly predicts all objects as the
|
|
background class with high confidence, the per-class confidence
|
|
threshold of 0.01 will consider predictions with very low
|
|
confidences; as background boxes are not present in the maxima
|
|
collection, many low confidence boxes can be. Furthermore, the
|
|
same detection can be present in the maxima collection for multiple
|
|
classes. In this case, the entropy threshold would let the detection
|
|
pass because the background class has high confidence. Subsequently,
|
|
a low per-class confidence threshold does not restrict the boxes
|
|
either. Therefore, the decoding output is worse than the actual
|
|
predictions of the network.
|
|
Bayesian SSD cannot help in this situation because the network
|
|
is not actually uncertain.
|
|
|
|
SSD was developed with closed set conditions in mind. A well trained
|
|
network in such a situation does not have many high confidence
|
|
background detections. In an open set environment, background
|
|
detections are the correct behaviour for unknown classes.
|
|
In order to get useful detections out of the decoding, a higher
|
|
confidence threshold is required.
|
|
|
|
\subsection{Vanilla SSD with Entropy Thresholding}
|
|
|
|
Vanilla SSD with entropy tresholding adds an additional component
|
|
to the filtering already done for vanilla SSD. The entropy is
|
|
calculated from all \(\#nr\_classes\) softmax scores in a prediction.
|
|
Only predictions with a low enough entropy pass the entropy
|
|
threshold and move on to the aforementioned per class filtering.
|
|
This excludes very uniform predictions but cannot identify
|
|
false positive or false negative cases with high confidence values.
|
|
|
|
\subsection{Bayesian SSD with Entropy Thresholding}
|
|
|
|
Bayesian SSD has the speciality of multiple forward passes. Based
|
|
on the information in the paper, the detections of all forward passes
|
|
are grouped per image but not by forward pass. This leads
|
|
to the following shape of the network output after all
|
|
forward passes: \((batch\_size, \#nr\_boxes \, \cdot \, \#nr\_forward\_passes, \#nr\_classes + 12)\). The size of the output
|
|
increases linearly with more forward passes.
|
|
|
|
These detections have to be decoded first. Afterwards,
|
|
all detections are thrown away which do not pass a confidence
|
|
threshold for the class with the highest prediction probability.
|
|
Additionally, all detections with a background prediction of 0.8 or higher are discarded.
|
|
The remaining detections are partitioned into observations to
|
|
further reduce the size of the output, and
|
|
to identify uncertainty. This is accomplished by calculating the
|
|
mutual IOU of every detection with all other detections. Detections
|
|
with a mutual IOU score of 0.95 or higher are partitioned into an
|
|
observation. Next, the softmax scores and bounding box coordinates of
|
|
all detections in an observation are averaged.
|
|
There can be a different number of observations for every image which
|
|
destroys homogenity and prevents batch-wise calculation of the
|
|
results. The shape of the results is per image: \((\#nr\_observations,\#nr\_classes + 4)\).
|
|
|
|
Entropy is measured in the next step. All observations with too high
|
|
entropy are discarded. Entropy thresholding in combination with
|
|
dropout sampling should improve identification of false positives of
|
|
unknown classes. This is due to multiple forward passes and
|
|
the assumption that uncertainty in some objects will result
|
|
in different classifications in multiple forward passes. These
|
|
varying classifications are averaged into multiple lower confidence
|
|
values which should increase the entropy and, hence, flag an
|
|
observation for removal.
|
|
|
|
The remainder of the filtering follows the vanilla SSD procedure: per-class
|
|
confidence threshold, non-maximum suppression, and a top \(k\) selection
|
|
at the end.
|
|
|
|
\chapter{Experimental Setup and Results}
|
|
|
|
\label{chap:experiments-results}
|
|
|
|
This chapter explains the used data sets, how the experiments were
|
|
set up, and what the results are.
|
|
|
|
\section{Data sets}
|
|
|
|
This thesis uses the MS COCO~\cite{Lin2014} data set. It contains
|
|
80 classes, from airplanes to toothbrushes many classes are present.
|
|
The images are taken by camera from the real world, ground truth
|
|
is provided for all images. The data set supports object detection,
|
|
keypoint detection, and panoptic segmentation (scene segmentation).
|
|
|
|
The data of any data set has to be prepared for use in a neural
|
|
network. Typical problems of data sets include, for example,
|
|
outliers and invalid bounding boxes. Before a data set can be used,
|
|
these problems need to be resolved.
|
|
|
|
For the MS COCO data set, all annotations were checked for
|
|
impossible values: bounding box height or width lower than zero,
|
|
\(x_{min}\) and \(y_{min}\) bounding box coordinates lower than zero,
|
|
\(x_{max}\) and \(y_{max}\) coordinates lower than or equal to zero, \(x_{min}\) greater than \(x_{max}\),
|
|
\(y_{min}\) greater than \(y_{max}\), image width lower than \(x_{max}\),
|
|
and image height lower than \(y_{max}\). In the last two cases the
|
|
bounding box width or height was set to (image width - \(x_{min}\)) or
|
|
(image height - \(y_{min}\)) respectively;
|
|
in the other cases the annotation was skipped.
|
|
If the bounding box width or height afterwards is
|
|
lower than or equal to zero the annotation was skipped.
|
|
|
|
SSD accepts 300x300 input images, the MS COCO data set images were
|
|
resized to this resolution; the aspect ratio was not kept in the
|
|
process. As all images of MS COCO have the same resolution,
|
|
this led to a uniform distortion of the images. Furthermore,
|
|
the colour channels were swapped from RGB to BGR in order to
|
|
comply with the SSD implementation. The BGR requirement stems from
|
|
the usage of Open CV in SSD: the internal channel order for
|
|
Open CV is BGR.
|
|
|
|
For this thesis, weights pre-trained on the sub data set trainval35k of the
|
|
COCO data set were used. These weights were created with closed set
|
|
conditions in mind, therefore, they had to be sub-sampled to create
|
|
an open set condition. To this end, the weights for the last
|
|
20 classes were thrown away, making them effectively unknown.
|
|
|
|
All images of the minival2014 data set were used but only ground truth
|
|
belonging to the first 60 classes was loaded. The remaining 20
|
|
classes were considered "unknown" and were not presented with bounding
|
|
boxes during the inference phase.
|
|
|
|
\section{Experimental Setup}
|
|
|
|
This section explains the setup for the different conducted
|
|
experiments. Each comparison investigates one particular question.
|
|
|
|
As a baseline, vanilla SSD with the confidence threshold of 0.01
|
|
and a non-maximum suppression IOU threshold of 0.45 was used.
|
|
Due to the low number of objects per image in the COCO data set,
|
|
the top \(k\) value was set to 20. Vanilla SSD with entropy
|
|
thresholding uses the same parameters; compared to vanilla SSD
|
|
without entropy thresholding, it showcases the relevance of
|
|
entropy thresholding for vanilla SSD.
|
|
|
|
Vanilla SSD was also run with 0.2 confidence threshold and compared
|
|
to vanilla SSD with 0.01 confidence threshold; this comparison
|
|
investigates the effect of the per class confidence threshold
|
|
on the object detection performance.
|
|
|
|
Bayesian SSD was run with 0.2 confidence threshold and compared
|
|
to vanilla SSD with 0.2 confidence threshold. Coupled with the
|
|
entropy threshold, this comparison shows how uncertain the network
|
|
is. If it is very certain the dropout sampling should have no
|
|
significant impact on the result. Furthermore, in two cases the
|
|
dropout was turned off to isolate the impact of non-maximum suppression
|
|
on the result.
|
|
|
|
Both, vanilla SSD with entropy thresholding and Bayesian SSD with
|
|
entropy thresholding, were tested for entropy thresholds ranging
|
|
from 0.1 to 2.4 inclusive as specified in Miller et al.~\cite{Miller2018}.
|
|
|
|
\section{Results}
|
|
|
|
Results in this section are presented both for micro and macro averaging.
|
|
In macro averaging, for example, the precision values of each class are added up
|
|
and then divided by the number of classes. Conversely, for micro averaging the
|
|
precision is calculated across all classes directly. Both methods have
|
|
a specific impact: macro averaging weighs every class the same while micro
|
|
averaging weighs every detection the same. They will be largely identical
|
|
when every class is balanced and has about the same number of detections.
|
|
However, in case of a class imbalance the macro averaging
|
|
favours classes with few detections whereas micro averaging benefits classes
|
|
with many detections.
|
|
|
|
\subsection{Micro Averaging}
|
|
\begin{table}[ht]
|
|
\begin{tabular}{rcccc}
|
|
\hline
|
|
Forward & max & abs OSE & Recall & Precision\\
|
|
Passes & \(F_1\) Score & \multicolumn{3}{c}{at max \(F_1\) point} \\
|
|
\hline
|
|
vanilla SSD - 0.01 conf & 0.255 & 3176 & 0.214 & 0.318 \\
|
|
vanilla SSD - 0.2 conf & \textbf{0.376} & 2939 & \textbf{0.382} & 0.372 \\
|
|
SSD with Entropy test - 0.01 conf & 0.255 & 3168 & 0.214 & 0.318 \\
|
|
% entropy thresh: 2.4 for vanilla SSD is best
|
|
\hline
|
|
Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.209 & 2709 & 0.300 & 0.161 \\
|
|
no dropout - 0.2 conf - NMS \; 10 & 0.371 & \textbf{2335} & 0.365 & \textbf{0.378} \\
|
|
0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.360 & 2595 & 0.367 & 0.353 \\
|
|
0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.325 & 2759 & 0.342 & 0.311 \\
|
|
% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
|
|
% 0.5 for Bayesian - 6, 1.4 for 7, 1.4 for 8, 1.3 for 9
|
|
\hline
|
|
\end{tabular}
|
|
\caption{Rounded results for micro averaging. SSD with Entropy test and Bayesian SSD are represented with
|
|
their best performing entropy threshold. Vanilla SSD with Entropy test performed best with an
|
|
entropy threshold of 2.4, Bayesian SSD without non-maximum suppression performed best for 1.0,
|
|
and Bayesian SSD with non-maximum suppression performed best for 1.4 as entropy
|
|
threshold.
|
|
Bayesian SSD with dropout enabled and 0.9 keep ratio performed
|
|
best for 1.4 as entropy threshold, the run with 0.5 keep ratio performed
|
|
best for 1.3 as threshold.}
|
|
\label{tab:results-micro}
|
|
\end{table}
|
|
|
|
\begin{figure}[ht]
|
|
\begin{minipage}[t]{0.48\textwidth}
|
|
\includegraphics[width=\textwidth]{ose-f1-all-micro}
|
|
\caption{Micro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute OSE of 0.}
|
|
\label{fig:ose-f1-micro}
|
|
\end{minipage}%
|
|
\hfill
|
|
\begin{minipage}[t]{0.48\textwidth}
|
|
\includegraphics[width=\textwidth]{precision-recall-all-micro}
|
|
\caption{Micro averaged precision-recall curves for each variant tested.}
|
|
\label{fig:precision-recall-micro}
|
|
\end{minipage}
|
|
\end{figure}
|
|
|
|
Vanilla SSD with a per-class confidence threshold of 0.2 performs best (see
|
|
table \ref{tab:results-micro}) with respect to the maximum \(F_1\) score
|
|
(0.376) and recall at the maximum \(F_1\) point (0.382). In comparison, neither
|
|
the vanilla SSD variant with a confidence threshold of 0.01 nor the SSD with
|
|
an entropy test can outperform the 0.2 variant. Among the vanilla SSD variants,
|
|
the 0.2 variant also has the lowest number of open set errors (2939) and the
|
|
highest precision (0.372).
|
|
|
|
The comparison of the vanilla SSD variants with a confidence threshold of 0.01
|
|
shows no significant impact of an entropy test. Only the open set errors
|
|
are lower but in an insignificant way. The rest of the performance metrics is
|
|
identical after rounding.
|
|
|
|
The results for Bayesian SSD show a massive impact of the existance of
|
|
non-maximum suppression: maximum \(F_1\) score of 0.371 (with NMS) to 0.006
|
|
(without NMS). Dropout was disabled in both cases, making them effectively a
|
|
vanilla SSD run with multiple forward passes.
|
|
Therefore, the low number of open set errors with
|
|
micro averaging (164 without NMS) does not qualify as a good result and is not
|
|
marked bold, although it is the lowest number.
|
|
|
|
With 2335 open set errors, the Bayesian SSD variant with disabled dropout and
|
|
enabled non-maximum suppression offers the best performance with respect
|
|
to open set errors. It also has the best precision (0.378) of all tested
|
|
variants. Furthermore, it provides the best performance among all variants
|
|
with multiple forward passes except for recall.
|
|
|
|
Dropout decreases the performance of the network, this can be seen
|
|
in the lower \(F_1\) scores, higher open set errors, and lower precision
|
|
values. The variant with 0.9 keep ratio outperforms all other Bayesian
|
|
variants with respect to recall (0.367). The variant with 0.5 keep
|
|
ratio has worse recall (0.342) than the variant with disabled dropout.
|
|
However, all variants with multiple forward passes have lower open set errors
|
|
than all vanilla SSD variants.
|
|
|
|
The relation of \(F_1\) score to absolute open set error can be observed
|
|
in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants
|
|
can be seen in figure \ref{fig:precision-recall-micro}. Both vanilla SSD
|
|
variants with 0.01 confidence threshold reach much higher open set errors
|
|
and a higher recall. This behaviour is expected as more and worse predictions
|
|
are included. The Bayesian variant without non-maximum suppression was not
|
|
plotted.
|
|
All plotted variants show a similar behaviour that is in line with previously
|
|
reported figures, such as the ones in Miller et al.~\cite{Miller2018}
|
|
|
|
\subsection{Macro Averaging}
|
|
|
|
\begin{table}[t]
|
|
\begin{tabular}{rcccc}
|
|
\hline
|
|
Forward & max & abs OSE & Recall & Precision\\
|
|
Passes & \(F_1\) Score & \multicolumn{3}{c}{at max \(F_1\) point} \\
|
|
\hline
|
|
vanilla SSD - 0.01 conf & 0.370 & 1426 & 0.328 & 0.424 \\
|
|
vanilla SSD - 0.2 conf & \textbf{0.375} & 1218 & \textbf{0.338} & 0.424 \\
|
|
SSD with Entropy test - 0.01 conf & 0.370 & 1373 & 0.329 & \textbf{0.425} \\
|
|
% entropy thresh: 1.7 for vanilla SSD is best
|
|
\hline
|
|
Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.226 & \textbf{809} & 0.229 & 0.224 \\
|
|
no dropout - 0.2 conf - NMS \; 10 & 0.363 & 1057 & 0.321 & 0.420 \\
|
|
0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.354 & 1150 & 0.321 & 0.396 \\
|
|
0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.322 & 1264 & 0.307 & 0.340 \\
|
|
% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
|
|
% entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
|
|
% 1.7 for 8, 2.0 for 9
|
|
\hline
|
|
\end{tabular}
|
|
\caption{Rounded results for macro averaging. SSD with Entropy test and Bayesian SSD are represented with
|
|
their best performing entropy threshold. Vanilla SSD with Entropy test performed best with an
|
|
entropy threshold of 1.7, Bayesian SSD without non-maximum suppression performed best for 1.5,
|
|
and Bayesian SSD with non-maximum suppression performed best for 1.5 as entropy
|
|
threshold. Bayesian SSD with dropout enabled and 0.9 keep ratio performed
|
|
best for 1.7 as entropy threshold, the run with 0.5 keep ratio performed
|
|
best for 2.0 as threshold.}
|
|
\label{tab:results-macro}
|
|
\end{table}
|
|
|
|
\begin{figure}[ht]
|
|
\begin{minipage}[t]{0.48\textwidth}
|
|
\includegraphics[width=\textwidth]{ose-f1-all-macro}
|
|
\caption{Macro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute OSE of 0.}
|
|
\label{fig:ose-f1-macro}
|
|
\end{minipage}%
|
|
\hfill
|
|
\begin{minipage}[t]{0.48\textwidth}
|
|
\includegraphics[width=\textwidth]{precision-recall-all-macro}
|
|
\caption{Macro averaged precision-recall curves for each variant tested.}
|
|
\label{fig:precision-recall-macro}
|
|
\end{minipage}
|
|
\end{figure}
|
|
|
|
Vanilla SSD with a per-class confidence threshold of 0.2 performs best (see
|
|
table \ref{tab:results-macro}) with respect to the maximum \(F_1\) score
|
|
(0.375) and recall at the maximum \(F_1\) point (0.338). In comparison, the SSD
|
|
with an entropy test slightly outperforms the 0.2 variant with respect to
|
|
precision (0.425). Additionally, this is the best precision overall. Among
|
|
the vanilla SSD variants, the 0.2 variant also has the lowest
|
|
number of open set errors (1218).
|
|
|
|
The comparison of the vanilla SSD variants with a confidence threshold of 0.01
|
|
shows no significant impact of an entropy test. Only the open set errors
|
|
are lower but in an insignificant way. The rest of the performance metrics is
|
|
almost identical after rounding.
|
|
|
|
The results for Bayesian SSD show a massive impact of the existance of
|
|
non-maximum suppression: maximum \(F_1\) score of 0.363 (with NMS) to 0.006
|
|
(without NMS). Dropout was disabled in both cases, making them effectively a
|
|
vanilla SSD run with multiple forward passes.
|
|
|
|
With 1057 open set errors, the Bayesian SSD variant with disabled dropout and
|
|
enabled non-maximum suppression offers the best performance with respect
|
|
to open set errors. It also has the best \(F_1\) score (0.363) and best
|
|
precision (0.420) of all Bayesian variants, and ties with the 0.9 keep ratio
|
|
variant on recall (0.321).
|
|
|
|
Dropout decreases the performance of the network, this can be seen
|
|
in the lower \(F_1\) scores, higher open set errors, and lower precision and
|
|
recall values. However, all variants with multiple forward passes and
|
|
non-maximum suppression have lower open set errors than all vanilla SSD
|
|
variants.
|
|
|
|
The relation of \(F_1\) score to absolute open set error can be observed
|
|
in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants
|
|
can be seen in figure \ref{fig:precision-recall-macro}. Both vanilla SSD
|
|
variants with 0.01 confidence threshold reach much higher open set errors
|
|
and a higher recall. This behaviour is expected as more and worse predictions
|
|
are included. The Bayesian variant without non-maximum suppression was not
|
|
plotted.
|
|
All plotted variants show a similar behaviour that is in line with previously
|
|
reported figures, such as the ones in Miller et al.~\cite{Miller2018}
|
|
|
|
\chapter{Discussion and Outlook}
|
|
|
|
\label{chap:discussion}
|
|
|
|
First the results will be discussed, then possible future research and open
|
|
questions will be addressed.
|
|
|
|
\section*{Discussion}
|
|
|
|
The results clearly do not support the hypothesis: \textit{Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it}. With the exception of open set errors, there
|
|
is no area where dropout sampling performs better than vanilla SSD. In the
|
|
remainder of the section the individual results will be interpreted.
|
|
|
|
\subsection*{Impact of averaging}
|
|
|
|
Micro and macro averaging create largely similar results. Notably, micro
|
|
averaging has a significant performance increase towards the end
|
|
of the list of predictions. This is signaled by the near horizontal movement
|
|
of the plot in both the \(F_1\) versus absolute open set error graph (see figure \ref{fig:ose-f1-micro}) and
|
|
the precision-recall curve (see figure \ref{fig:precision-recall-micro}).
|
|
There are potentially true positive detections of one class that significantly
|
|
improve recall when compared to all detections across the classes but are
|
|
insignificant when solely compared to other detections of their own class.
|
|
|
|
Furthermore, the plotted behaviour implies that Miller et al.~\cite{Miller2018}
|
|
use macro averaging in their paper as the unique behaviour of micro
|
|
averaging was not reported in their paper.
|
|
|
|
\subsection*{Impact of Entropy}
|
|
|
|
There is no visible impact of entropy thresholding on the object detection
|
|
performance for vanilla SSD. This indicates that the network has almost no
|
|
uniform or close to uniform predictions, the vast majority of predictions
|
|
has a high confidence in one class - including the background.
|
|
However, the entropy plays a larger role for the Bayesian variants - as
|
|
expected: the best performing thresholds are 1.3 and 1.4 for micro averaging,
|
|
and 1.5, 1.7, and 2.0 for macro averaging. In all of these cases the best
|
|
threshold is not the largest threshold tested. A lower threshold likely
|
|
eliminated some false positives from the result set. On the other hand a
|
|
too low threshold likely eliminated true positives as well.
|
|
|
|
\subsection*{Non-maximum suppression}
|
|
|
|
Miller et al.~\cite{Miller2018} supposedly did not use non-maximum suppression
|
|
in their implementation of dropout sampling. Therefore, a variant with disabled
|
|
non-maximum suppression (NMS) was tested. The disastrous results heavily imply
|
|
that NMS is crucial and pose serious questions about the implementation of
|
|
Miller et al., who still have not released source code.
|
|
|
|
Without NMS all detections passing the per-class confidence threshold are
|
|
directly ordered in descending order by their confidence value. Afterwards the
|
|
top \(k\) detections are kept. This enables the following scenario:
|
|
the first top \(k\) detections all belong to the same class and potentially
|
|
object. Detections of other classes and objects could be discarded, reducing
|
|
recall in the process. Multiple detections of the same object also increase
|
|
the number of false positives, further reducing the \(F_1\) score.
|
|
|
|
\subsection*{Dropout}
|
|
|
|
The dropout variants have largely worse performance than the Bayesian variants
|
|
without dropout. This is expected as the network was not trained with
|
|
dropout and the weights are not prepared for it.
|
|
|
|
Gal~\cite{Gal2017}
|
|
showed that networks \textbf{trained} with dropout are approximate Bayesian
|
|
models. Miller et al. never fine-tuned or properly trained SSD after
|
|
the dropout layers were inserted. Therefore, the Bayesian variant of SSD
|
|
implemented in this thesis is not guaranteed to be such an approximate
|
|
model.
|
|
|
|
These results further question the reported results of Miller et al., who
|
|
reported significantly better results of dropout sampling compared to vanilla
|
|
SSD. Admittedly, they used the network not on COCO but SceneNet RGB-D~\cite{McCormac2017}. However, they also claim that no fine-tuning
|
|
for SceneNet took place. Applying SSD to an unknown data set should result
|
|
in overall worse performance. Attempts to replicate their work on SceneNet RGB-D
|
|
failed with miserable results even for vanilla SSD, further attempts for this
|
|
thesis were not made. But Miller et al. used
|
|
a different implementation of SSD, therefore, it is possible that their
|
|
implementation worked on SceneNet without fine-tuning.
|
|
|
|
\subsection*{Sampling and Observations}
|
|
|
|
It is remarkable that the Bayesian variant with disabled dropout and
|
|
non-maximum suppression performed better than vanilla SSD with respect to
|
|
open set errors. This indicates a relevant impact of multiple forward
|
|
passes and the grouping of observations on the result. With disabled
|
|
dropout, the ten forward passes should all produce the same results,
|
|
resulting in ten identical detections for every detection in vanilla SSD.
|
|
The variation in the result can only originate from the grouping into
|
|
observations.
|
|
|
|
All detections that overlap by at least 95\% with each other
|
|
are grouped into an observation. For every ten identical detections one
|
|
observation should be the result. However, due to the 95\% overlap rather than
|
|
100\%, more than ten detections could be grouped together. This would result
|
|
in fewer overall observations compared to the number of detections
|
|
in vanilla SSD. Such a lower number reduces the chance for the network
|
|
to make mistakes.
|
|
|
|
\section*{Outlook}
|
|
|
|
The attempted replication of the work of Miller et al. raises a series of
|
|
questions that cannot be answered in this thesis. This thesis offers
|
|
one possible implementation of dropout sampling that technically works.
|
|
However, this thesis cannot answer why this implementation differs significantly
|
|
from Miller et al. The complete source code or otherwise exhaustive
|
|
implementation details would be required to attempt an answer.
|
|
|
|
Future work could explore the performance of this implementation when used
|
|
on an SSD variant that was fine-tuned or trained with dropout. In this case, it
|
|
should also look into the impact of training with both dropout and batch
|
|
normalisation.
|
|
Other avenues include the application to other data sets or object detection
|
|
networks.
|
|
|
|
To facilitate future work based on this thesis, the source code will be
|
|
made available and an installable Python package will be uploaded to the
|
|
PyPi package index. In the appendices can be found more details about the
|
|
source code implementation.
|