- added more glossary terms - fixed tenses - removed wrong plural forms of open set error - removed unsupported claims - improved wording and language Signed-off-by: Jim Martens <github@2martens.de>
1285 lines
68 KiB
TeX
1285 lines
68 KiB
TeX
% body thesis file that contains the actual content
|
|
|
|
\chapter{Introduction}
|
|
|
|
The introduction first explains the wider context, before
|
|
providing technical details.
|
|
|
|
\subsection*{Motivation}
|
|
|
|
Famous examples like the automatic soap dispenser, which does not
|
|
recognise the hand of a black person but dispenses soap when presented
|
|
with a paper towel, raise the question of bias in computer
|
|
systems~\cite{Friedman1996}. Related to this ethical question regarding
|
|
the design of so called algorithms is the question of
|
|
algorithmic accountability~\cite{Diakopoulos2014}.
|
|
|
|
Supervised neural networks learn from input-output relations and
|
|
figure out by themselves what connections are necessary for that.
|
|
This feature is also their Achilles heel: it makes them effectively
|
|
black boxes and prevents any answers to questions of causality.
|
|
|
|
However, these questions of causality are of enormous consequence when
|
|
results of neural networks are used to make life changing decisions:
|
|
Is a correlation enough to bring forth negative consequences
|
|
for a particular person? And if so, what is the possible defence
|
|
against math? Similar questions can be raised when looking at computer
|
|
vision networks that might be used together with so called smart
|
|
CCTV cameras to discover suspicious activity.
|
|
|
|
This leads to the need for neural networks to explain their results.
|
|
Such an explanation must come from the network or an attached piece
|
|
of technology to allow adoption in mass. Obviously this setting
|
|
poses the question, how such an endeavour can be achieved.
|
|
|
|
For neural networks there are fundamentally two types of tasks:
|
|
regression and classification. Regression deals with any case
|
|
where the goal for the network is to come close to an ideal
|
|
function that connects all data points. Classification, however,
|
|
describes tasks where the network is supposed to identify the
|
|
class of any given input. In this thesis, I will work with both.
|
|
|
|
\subsection*{Object Detection in Open Set Conditions}
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[scale=1.0]{open-set}
|
|
\caption{Open set problem: the test set contains classes that
|
|
were not present during training time.
|
|
Icons in this image have been taken from the COCO data set
|
|
website (\url{https://cocodataset.org/\#explore}) and were
|
|
vectorised afterwards. Resembles figure 1 of Miller et al.~\cite{Miller2018}.}
|
|
\label{fig:open-set}
|
|
\end{figure}
|
|
|
|
More specifically, I will look at object detection in the open set
|
|
conditions (see figure \ref{fig:open-set}).
|
|
In non-technical words this effectively describes
|
|
the kind of situation you encounter with CCTV cameras or robots
|
|
outside of a laboratory. Both use cameras that record
|
|
images. Subsequently, a neural network analyses the image
|
|
and returns a list of detected and classified objects that it
|
|
found in the image. The problem here is that networks can only
|
|
classify what they know. If presented with an object type that
|
|
the network was not trained with, as happens frequently in real
|
|
environments, it will still classify the object and might even
|
|
have a high confidence in doing so. Such an example would be
|
|
a false positive. Anyone who uses the results of
|
|
such a network could falsely assume that a high confidence always
|
|
means the classification is very likely correct. If one uses
|
|
a proprietary system one might not even be able to find out
|
|
that the network was never trained on a particular type of object.
|
|
Therefore, it would be impossible for one to identify the output
|
|
of the network as false positive.
|
|
|
|
This reaffirms the need for automatic explanation. Such a system
|
|
should by itself recognise that the given object is unknown and
|
|
hence mark any classification result of the network as meaningless.
|
|
Technically there are two slightly different approaches that deal
|
|
with this type of task: model uncertainty and novelty detection.
|
|
|
|
Model uncertainty can be measured, for example, with dropout sampling.
|
|
Dropout layers are usually used only during training but
|
|
Miller et al.~\cite{Miller2018} use them also during testing
|
|
to achieve different results for the same image making use of
|
|
multiple forward passes. The output scores for the forward passes
|
|
of the same image are then averaged. If the averaged class
|
|
probabilities resemble a uniform distribution (every class has
|
|
the same probability) this symbolises maximum uncertainty. Conversely,
|
|
if there is one very high probability with every other being very
|
|
low this signifies a low uncertainty. An unknown object is more
|
|
likely to cause high uncertainty which allows for an identification
|
|
of false positive cases.
|
|
|
|
Novelty detection is another approach to solve the task.
|
|
In the realm of neural networks it is usually done with the help of
|
|
auto-encoders that solve a regression task of finding an
|
|
identity function that reconstructs the given input~\cite{Pimentel2014}. Auto-encoders have
|
|
internally at least two components: an encoder, and a decoder or
|
|
generator. The job of the encoder is to find an encoding that
|
|
compresses the input as good as possible while simultaneously
|
|
being as loss-free as possible. The decoder takes this latent
|
|
representation of the input and has to find a decompression
|
|
that reconstructs the input as accurate as possible. During
|
|
training these auto-encoders learn to reproduce a certain group
|
|
of object classes. The actual novelty detection takes place
|
|
during testing: given an image, and the output and loss of the
|
|
auto-encoder, a novelty score is calculated. For some novelty
|
|
detection approaches the reconstruction loss is the novelty
|
|
score, others consider more factors. A low novelty
|
|
score signals a known object. The opposite is true for a high
|
|
novelty score.
|
|
|
|
\subsection*{Research Question}
|
|
|
|
Auto-encoders work well for data sets like MNIST~\cite{Deng2012}
|
|
but perform poorly on challenging real world data sets
|
|
like MS COCO~\cite{Lin2014}, complicating any potential comparison between
|
|
them and object detection networks like \gls{SSD}.
|
|
Therefore, a comparison between model uncertainty with a network like
|
|
SSD and novelty detection with auto-encoders is considered out of scope
|
|
for this thesis.
|
|
|
|
Miller et al.~\cite{Miller2018} use an \gls{SSD} pre-trained on COCO
|
|
without further fine-tuning on the SceneNet RGB-D data
|
|
set~\cite{McCormac2017} and report good results regarding
|
|
\gls{OSE} for an \gls{SSD} variant with dropout sampling and \gls{entropy}
|
|
thresholding.
|
|
If their results are generalisable it should be possible to replicate
|
|
the relative difference between the variants on the COCO data set.
|
|
This leads to the following hypothesis: \emph{Dropout sampling
|
|
delivers better object detection performance under open set
|
|
conditions compared to object detection without it.}
|
|
|
|
For the purpose of this thesis, I use the \gls{vanilla} \gls{SSD} (as in: the original \gls{SSD}) as
|
|
baseline to compare against. In particular, \gls{vanilla} \gls{SSD} uses
|
|
a per class confidence threshold of 0.01, an IOU threshold of 0.45
|
|
for the \gls{NMS}, and a top \(k\) value of 200. For this
|
|
thesis, the top \(k\) value has been changed to 20 and the confidence threshold
|
|
of 0.2 has been tried as well.
|
|
The effect of an \gls{entropy} threshold is measured against this \gls{vanilla}
|
|
SSD by applying \gls{entropy} thresholds from 0.1 to 2.4 inclusive (limits taken from
|
|
Miller et al.). Dropout sampling is compared to \gls{vanilla} \gls{SSD}
|
|
with and without \gls{entropy} thresholding.
|
|
|
|
\paragraph{Hypothesis} Dropout sampling
|
|
delivers better object detection performance under open set
|
|
conditions compared to object detection without it.
|
|
|
|
\subsection*{Reader's Guide}
|
|
|
|
First, chapter \ref{chap:background} presents related works and
|
|
provides the background for dropout sampling.
|
|
Afterwards, chapter \ref{chap:methods} explains how \gls{vanilla} \gls{SSD} works, how
|
|
Bayesian \gls{SSD} extends \gls{vanilla} \gls{SSD}, and how the decoding pipelines are
|
|
structured.
|
|
Chapter \ref{chap:experiments-results} presents the data sets,
|
|
the experimental setup, and the results. This is followed by
|
|
chapter \ref{chap:discussion}, focusing on the discussion and closing.
|
|
|
|
Therefore, the contribution is found in chapters \ref{chap:methods},
|
|
\ref{chap:experiments-results}, and \ref{chap:discussion}.
|
|
|
|
\chapter{Background}
|
|
|
|
\label{chap:background}
|
|
|
|
This chapter begins with an overview over previous works
|
|
in the field of this thesis. Afterwards the theoretical foundations
|
|
of dropout sampling are explained.
|
|
|
|
\section{Related Works}
|
|
|
|
The task of novelty detection can be accomplished in a variety of ways.
|
|
Pimentel et al.~\cite{Pimentel2014} provide a review of novelty detection
|
|
methods published over the previous decade. They showcase probabilistic,
|
|
distance-based, reconstruction-based, domain-based, and information-theoretic
|
|
novelty detection. Based on their categorisation, this thesis falls under
|
|
reconstruction-based novelty detection as it deals only with neural network
|
|
approaches. Therefore, the other types of novelty detection will only be
|
|
briefly introduced.
|
|
|
|
\subsection{Overview over types of Novelty Detection}
|
|
|
|
Probabilistic approaches estimate the generative \gls{pdf}
|
|
of the data. It is assumed that the training data is generated from an underlying
|
|
probability distribution \(D\). This distribution can be estimated with the
|
|
training data, the estimate is defined as \(\hat D\) and represents a model
|
|
of normality. A novelty threshold is applied to \(\hat D\) in a way that
|
|
allows a probabilistic interpretation. Pidhorskyi et al.~\cite{Pidhorskyi2018}
|
|
combine a probabilistic approach to novelty detection with auto-encoders.
|
|
|
|
Distance-based novelty detection uses either nearest neighbour-based approaches
|
|
(e.g. \(k\)-nearest neighbour \cite{Hautamaki2004})
|
|
or clustering-based approaches
|
|
(e.g. \(k\)-means clustering algorithm \cite{Jordan1994}).
|
|
Both methods are similar to estimating the
|
|
\gls{pdf} of data, they use well-defined distance metrics to compute the distance
|
|
between two data points.
|
|
|
|
Domain-based novelty detection describes the boundary of the known data, rather
|
|
than the data itself. Unknown data is identified by its position relative to
|
|
the boundary. A common implementation for this are support vector machines
|
|
(e.g. implemented by Song et al. \cite{Song2002}).
|
|
|
|
Information-theoretic novelty detection computes the information content
|
|
of a data set, for example, with metrics like \gls{entropy}. Such metrics assume
|
|
that novel data inside the data set significantly alters the information
|
|
content of an otherwise normal data set. First, the metrics are calculated over the
|
|
whole data set. Afterwards, a subset is identified that causes the biggest
|
|
difference in the metric when removed from the data set. This subset is considered
|
|
to consist of novel data. For example, Filippone and Sanguinetti \cite{Filippone2011} provide
|
|
a recent approach.
|
|
|
|
\subsection{Reconstruction-based Novelty Detection}
|
|
|
|
Reconstruction-based approaches use the reconstruction error in one form
|
|
or another to calculate the novelty score. This can be auto-encoders that
|
|
literally reconstruct the input but it also includes MLP networks which try
|
|
to reconstruct the ground truth. Pimentel et al.~\cite{Pimentel2014} differentiate
|
|
between neural network-based approaches and subspace methods. The first are
|
|
further differentiated between MLPs, Hopfield networks, autoassociative networks,
|
|
radial basis function, and self-organising networks.
|
|
The remainder of this section focuses on MLP-based works, a particular focus will
|
|
be on the task of object detection and Bayesian networks.
|
|
|
|
Novelty detection for object detection is intricately linked with
|
|
open set conditions: the test data can contain unknown classes.
|
|
Bishop~\cite{Bishop1994} investigates the correlation between
|
|
the degree of novel input data and the reliability of network
|
|
outputs, and introduced a quantitative way to measure novelty.
|
|
|
|
The Bayesian approach provides a theoretical foundation for
|
|
modelling uncertainty \cite{Ghahramani2015}.
|
|
MacKay~\cite{MacKay1992} provides a practical Bayesian
|
|
framework for backpropagation networks. Neal~\cite{Neal1996} builds upon
|
|
the work of MacKay and explores Bayesian learning for neural networks.
|
|
However, these Bayesian neural networks do not scale well. Over the course
|
|
of time, two major Bayesian approximations were introduced: one based
|
|
on dropout and one based on batch normalisation.
|
|
|
|
Gal and Ghahramani~\cite{Gal2016} show that dropout training is a
|
|
Bayesian approximation of a Gaussian process. Subsequently, Gal~\cite{Gal2017}
|
|
shows that dropout training actually corresponds to a general approximate
|
|
Bayesian model. This means every network trained with dropout is an
|
|
approximate Bayesian model. During inference the dropout remains active,
|
|
this form of inference is called Monte Carlo Dropout (MCDO).
|
|
Miller et al.~\cite{Miller2018} build upon the work of Gal and Ghahramani: they
|
|
use MC dropout under open-set conditions for object detection.
|
|
In a second paper \cite{Miller2018a}, Miller et al. continue their work and
|
|
compare merging strategies for sampling-based uncertainty techniques in
|
|
object detection.
|
|
|
|
Teye et al.~\cite{Teye2018} make the point that most modern networks have
|
|
adopted other regularisation techniques. Ioffe and Szeged~\cite{Ioffe2015}
|
|
introduce batch normalisation which has been adapted widely in the
|
|
meantime. Teye et al.
|
|
show how batch normalisation training is similar to dropout and can be
|
|
viewed as an approximate Bayesian inference. Estimates of the model uncertainty
|
|
can be gained with a technique named Monte Carlo Batch Normalisation (MCBN).
|
|
Consequently, this technique can be applied to any network that utilises
|
|
standard batch normalisation.
|
|
Li et al.~\cite{Li2019} investigate the problem of poor performance
|
|
when combining dropout and batch normalisation: dropout shifts the variance
|
|
of a neural unit when switching from train to test, batch normalisation
|
|
does not change the variance. This inconsistency leads to a variance shift which
|
|
can have a larger or smaller impact based on the network used.
|
|
|
|
Non-Bayesian approaches have been developed as well. Usually, they compare with
|
|
MC dropout and show better performance.
|
|
Postels et al.~\cite{Postels2019} provide a sampling-free approach for
|
|
uncertainty estimation that does not affect training and approximates the
|
|
sampling at test time. They compare it to MC dropout and find less computational
|
|
overhead with better results.
|
|
Lakshminarayanan et al.~\cite{Lakshminarayanan2017}
|
|
implement a predictive uncertainty estimation using deep ensembles.
|
|
Compared to MC dropout, it shows better results.
|
|
Geifman et al.~\cite{Geifman2018}
|
|
introduce an uncertainty estimation algorithm for non-Bayesian deep
|
|
neural classification that estimates the uncertainty of highly
|
|
confident points using earlier snapshots of the trained model and improves,
|
|
among others, the approach introduced by Lakshminarayanan et al.
|
|
Sensoy et al.~\cite{Sensoy2018} explicitely model prediction uncertainty:
|
|
a Dirichlet distribution is placed over the class probabilities. Consequently,
|
|
the predictions of a neural network are treated as subjective opinions.
|
|
|
|
In addition to the aforementioned Bayesian and non-Bayesian works,
|
|
there are some Bayesian works that do not quite fit with the rest but
|
|
are important as well. Mukhoti and Gal~\cite{Mukhoti2018}
|
|
contribute metrics to measure uncertainty for semantic
|
|
segmentation. Wu et al.~\cite{Wu2019} introduce two innovations
|
|
that turn variational Bayes into a robust tool for Bayesian
|
|
networks: first, a novel deterministic method to approximate
|
|
moments in neural networks which eliminates gradient variance, and
|
|
second, a hierarchical prior for parameters and an empirical Bayes
|
|
procedure to select prior variances.
|
|
|
|
\section{Background for Dropout Sampling}
|
|
|
|
\begin{table}
|
|
\centering
|
|
\caption{Notation for background}
|
|
\label{tab:notation}
|
|
\begin{tabular}{l|l}
|
|
symbol & meaning \\
|
|
\hline
|
|
\(\mathbf{W}\) & weights \\
|
|
\(\mathbf{T}\) & training data \\
|
|
\(\mathcal{N}(0, I)\) & Gaussian distribution \\
|
|
\(I\) & independent and identical distribution \\
|
|
\(p(\mathbf{W}|\mathbf{T})\) & probability of weights given
|
|
training data \\
|
|
\(\mathcal{I}\) & an image \\
|
|
\(\mathbf{q} = p(y|\mathcal{I}, \mathbf{T})\) & probability
|
|
of all classes given image and training data \\
|
|
\(H(\mathbf{q})\) & \gls{entropy} over probability vector \\
|
|
\(\widetilde{\mathbf{W}}\) & weights sampled from
|
|
\(p(\mathbf{W}|\mathbf{T})\) \\
|
|
\(\mathbf{b}\) & bounding box coordinates \\
|
|
\(\mathbf{s}\) & softmax scores \\
|
|
\(\overline{\mathbf{s}}\) & averaged softmax scores \\
|
|
\(D\) & detections of one forward pass \\
|
|
\(\mathfrak{D}\) & set of all detections over multiple
|
|
forward passes \\
|
|
\(\mathcal{O}\) & observation \\
|
|
\(\overline{\mathbf{q}}\) & probability vector for
|
|
observation \\
|
|
%\(E[something]\) & expected value of something
|
|
%\(\overline{\mathbf{z}}, \mathbf{z}\) & latent space representation \\
|
|
%\(d_T, d_z\) & discriminators \\
|
|
%\(e, g\) & encoding and decoding/generating function \\
|
|
%\(J_g\) & Jacobi matrix for generating function \\
|
|
%\(\mathcal{T}\) & tangent space \\
|
|
%\(\mathbf{R}\) & training/test data changed to be on tangent space
|
|
\end{tabular}
|
|
\end{table}
|
|
|
|
This section will use the \textbf{notation} defined in table
|
|
\ref{tab:notation} on page \pageref{tab:notation}.
|
|
To understand dropout sampling, it is necessary to explain the
|
|
idea of Bayesian neural networks. They place a prior distribution
|
|
over the network weights, for example a Gaussian prior distribution:
|
|
\(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example
|
|
\(\mathbf{W}\) are the weights and \(I\) symbolises that every
|
|
weight is drawn from an independent and identical distribution. The
|
|
training of the network determines a plausible set of weights by
|
|
evaluating the probability output (\gls{posterior}) over the weights given
|
|
the training data \(\mathbf{T}\): \(p(\mathbf{W}|\mathbf{T})\).
|
|
However, this
|
|
evaluation cannot be performed in any reasonable
|
|
time. Therefore approximation techniques are
|
|
required. In those techniques the \gls{posterior} is fitted with a
|
|
simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original
|
|
and intractable problem of averaging over all weights in the network
|
|
is replaced with an optimisation task, where the parameters of the
|
|
simple distribution are optimised over~\cite{Kendall2017}.
|
|
|
|
\subsubsection*{Dropout Variational Inference}
|
|
|
|
Kendall and Gal~\cite{Kendall2017} show an approximation for
|
|
classfication and recognition tasks. Dropout variational inference
|
|
is a practical approximation technique by adding dropout layers
|
|
in front of every weight layer and using them also during test
|
|
time to sample from the approximate \gls{posterior}. Effectively, this
|
|
results in the approximation of the class probability
|
|
\(p(y|\mathcal{I}, \mathbf{T})\) by performing \(n\) forward
|
|
passes through the network and averaging the so obtained softmax
|
|
scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the
|
|
training data \(\mathbf{T}\):
|
|
\begin{equation} \label{eq:drop-sampling}
|
|
p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
|
|
\end{equation}
|
|
|
|
With this dropout sampling technique, \(n\) model weights
|
|
\(\widetilde{\mathbf{W}}_i\) are sampled from the \gls{posterior}
|
|
\(p(\mathbf{W}|\mathbf{T})\). The class probability
|
|
\(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
|
|
\(\mathbf{q}\) over all class labels. Finally, the uncertainty
|
|
of the network with respect to the classification is given by
|
|
the \gls{entropy} \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\).
|
|
|
|
\subsubsection*{Dropout Sampling for Object Detection}
|
|
|
|
Miller et al.~\cite{Miller2018} apply the dropout sampling to
|
|
object detection. In that case \(\mathbf{W}\) represents the
|
|
learned weights of a detection network like \gls{SSD}~\cite{Liu2016}.
|
|
Every forward pass uses a different network
|
|
\(\widetilde{\mathbf{W}}\) which is approximately sampled from
|
|
\(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object
|
|
detection results in a set of detections, each consisting of bounding
|
|
box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\).
|
|
The detections are denoted by Miller et al. as \(D_i =
|
|
\{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put
|
|
into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\).
|
|
|
|
All detections with mutual intersection-over-union scores (IoU)
|
|
of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\).
|
|
Subsequently, the corresponding vector of class probabilities
|
|
\(\overline{\mathbf{q}}_i\) for the observation is calculated by averaging all
|
|
score vectors \(\mathbf{s}_j\) in a particular observation
|
|
\(\mathcal{O}_i\): \(\overline{\mathbf{q}}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
|
|
of the detector for a particular observation is measured by
|
|
the \gls{entropy} \(H(\overline{\mathbf{q}}_i)\).
|
|
|
|
If \(\overline{\mathbf{q}}_i\)
|
|
resembles a uniform distribution the \gls{entropy} will be high. A uniform
|
|
distribution means that no class is more likely than another, which
|
|
is a perfect example of maximum uncertainty. Conversely, if
|
|
one class has a very high probability the \gls{entropy} will be low.
|
|
|
|
In open set conditions it can be expected that falsely generated
|
|
detections for unknown object classes have a higher label
|
|
uncertainty. A threshold on the \gls{entropy} \(H(\overline{\mathbf{q}}_i)\) can then
|
|
be used to identify and reject these false positive cases.
|
|
|
|
% \gls{SSD}: \cite{Liu2016}
|
|
% ImageNet: \cite{Deng2009}
|
|
% COCO: \cite{Lin2014}
|
|
% YCB: \cite{Xiang2017}
|
|
% SceneNet: \cite{McCormac2017}
|
|
|
|
\chapter{Methods}
|
|
|
|
\label{chap:methods}
|
|
|
|
This chapter explains the functionality of \gls{vanilla} \gls{SSD}, Bayesian \gls{SSD}, and the decoding pipelines.
|
|
|
|
\section{Vanilla SSD}
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[scale=1.2]{vanilla-ssd}
|
|
\caption{The \gls{vanilla} \gls{SSD} network as defined by Liu et al.~\cite{Liu2016}. VGG-16 is the base network, extended with extra feature layers. These predict offsets to anchor boxes with different sizes and aspect ratios. Furthermore, they predict the
|
|
corresponding confidences.}
|
|
\label{fig:vanilla-ssd}
|
|
\end{figure}
|
|
|
|
Vanilla \gls{SSD} is based upon the VGG-16 network (see figure
|
|
\ref{fig:vanilla-ssd}) and adds extra feature layers. The entire
|
|
image (always size 300x300) is divided up into anchor boxes. During
|
|
training, each of these boxes is mapped to a ground truth box or
|
|
background. For every anchor box both the offset to
|
|
the object and the class confidences are calculated. The output of the
|
|
\gls{SSD} network are the predictions with class confidences, offsets to the
|
|
anchor box, anchor box coordinates, and variance. The model loss is a
|
|
weighted sum of localisation and confidence loss. As the network
|
|
has a fixed number of anchor boxes, every forward pass creates the same
|
|
number of detections---8732 in the case of \gls{SSD} 300x300.
|
|
|
|
Notably, the object proposals are made in a single run for an
|
|
image---single shot.
|
|
Other techniques like Faster R-CNN employ region proposals
|
|
and pooling. For more detailed information on \gls{SSD}, please refer to
|
|
Liu et al.~\cite{Liu2016}.
|
|
|
|
\section{Bayesian SSD for Model Uncertainty}
|
|
|
|
Networks trained with dropout are a general approximate Bayesian model~\cite{Gal2017}. As such, they can be used for everything a true
|
|
Bayesian model could be used for. This idea is applied to \gls{SSD} by Miller et al.: two dropout layers are added to \gls{vanilla} \gls{SSD}, after the layers fc6 and fc7 respectively (see figure \ref{fig:bayesian-ssd}).
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[scale=1.2]{bayesian-ssd}
|
|
\caption{The Bayesian \gls{SSD} network as defined by Miller et al.~\cite{Miller2018}. It adds dropout layers after the fc6
|
|
and fc7 layers.}
|
|
\label{fig:bayesian-ssd}
|
|
\end{figure}
|
|
|
|
Motivation for this is model uncertainty: for the same object on the same
|
|
image, an uncertain model will predict different classes across
|
|
multiple forward passes. This uncertainty is measured with \gls{entropy}:
|
|
every forward pass results in predictions, these are partitioned into
|
|
observations, and subsequently their \gls{entropy} is calculated.
|
|
A higher \gls{entropy} indicates a more uniform distribution of confidences
|
|
whereas a lower \gls{entropy} indicates a larger confidence in one class
|
|
and very low confidences in other classes.
|
|
|
|
\subsection{Implementation Details}
|
|
|
|
For this thesis, an \gls{SSD} implementation based on Tensorflow~\cite{Abadi2015} and
|
|
Keras\footnote{\url{https://github.com/pierluigiferrari/ssd\_keras}}
|
|
is used. It has been modified to support \gls{entropy} thresholding,
|
|
partitioning of observations, and dropout
|
|
layers in the \gls{SSD} model. Entropy thresholding takes place before
|
|
the per class confidence threshold is applied.
|
|
|
|
The Bayesian variant was not fine-tuned and operates with the same
|
|
weights that \gls{vanilla} \gls{SSD} uses as well.
|
|
|
|
\section{Decoding Pipelines}
|
|
|
|
The raw output of \gls{SSD} is not very useful: it contains thousands of
|
|
boxes per image. Among them are many boxes with very low confidences
|
|
or background classifications, those need to be filtered out to
|
|
get any meaningful output of the network. The process of
|
|
filtering is called decoding and presented for the three variants
|
|
of \gls{SSD} used in the thesis.
|
|
|
|
\subsection{Vanilla SSD}
|
|
|
|
Liu et al.~\cite{Liu2016} used \gls{Caffe} for their original \gls{SSD}
|
|
implementation. The decoding process contains largely two
|
|
phases: decoding and filtering. Decoding transforms the relative
|
|
coordinates predicted by \gls{SSD} into absolute coordinates. At this point
|
|
the shape of the output per batch is \((batch\_size, \#nr\_boxes, \#nr\_classes + 12)\). The last twelve elements are split into
|
|
the four bounding box offsets, the four anchor box coordinates, and
|
|
the four variances; there are 8732 boxes.
|
|
|
|
\glslocalreset{NMS}
|
|
Filtering of these boxes is first done per class:
|
|
only the class id, confidence of that class, and the bounding box
|
|
coordinates are kept per box. The filtering consists of
|
|
confidence thresholding and a subsequent \gls{NMS}.
|
|
All boxes that pass \gls{NMS} are added to a
|
|
per image maxima list. One box could make the confidence threshold
|
|
for multiple classes and, hence, be present multiple times in the
|
|
maxima list for the image. Lastly, a total of \(k\) boxes with the
|
|
highest confidences is kept per image across all classes. The
|
|
original implementation uses a confidence threshold of \(0.01\), an
|
|
IOU threshold for \gls{NMS} of \(0.45\) and a top \(k\)
|
|
value of 200.
|
|
|
|
The \gls{vanilla} \gls{SSD}
|
|
per class confidence threshold and \gls{NMS} has one
|
|
weakness: even if \gls{SSD} correctly predicts all objects as the
|
|
background class with high confidence, the per class confidence
|
|
threshold of 0.01 will consider predictions with very low
|
|
confidences; as background boxes are not present in the maxima
|
|
collection, many low confidence boxes can be. Furthermore, the
|
|
same detection can be present in the maxima collection for multiple
|
|
classes. In this case, the \gls{entropy} threshold would let the detection
|
|
pass because the background class has high confidence. Subsequently,
|
|
a low per class confidence threshold does not restrict the boxes
|
|
either. Therefore, the decoding output is worse than the actual
|
|
predictions of the network.
|
|
Bayesian \gls{SSD} cannot help in this situation because the network
|
|
is not actually uncertain.
|
|
|
|
SSD was developed with closed set conditions in mind. A well trained
|
|
network in such a situation does not have many high confidence
|
|
background detections. In an open set environment, however, background
|
|
detections are the correct behaviour for unknown classes.
|
|
In order to get useful detections out of the decoding, a higher
|
|
confidence threshold is required.
|
|
|
|
\subsection{Vanilla SSD with Entropy Thresholding}
|
|
|
|
Vanilla \gls{SSD} with \gls{entropy} tresholding adds an additional component
|
|
to the filtering already done for \gls{vanilla} \gls{SSD}. The \gls{entropy} is
|
|
calculated from all \(\#nr\_classes\) softmax scores in a prediction.
|
|
Only predictions with a low enough \gls{entropy} pass the \gls{entropy}
|
|
threshold and move on to the aforementioned per class filtering.
|
|
This excludes very uniform predictions but cannot identify
|
|
false positive or false negative cases with high confidence values.
|
|
|
|
\subsection{Bayesian SSD with Entropy Thresholding}
|
|
|
|
Bayesian \gls{SSD} has the speciality of multiple forward passes. Based
|
|
on the information in the paper, the detections of all forward passes
|
|
are grouped per image but not by forward pass. This leads
|
|
to the following shape of the network output after all
|
|
forward passes: \((batch\_size, \#nr\_boxes \, \cdot \, \#nr\_forward\_passes, \#nr\_classes + 12)\). The size of the output
|
|
increases linearly with more forward passes.
|
|
|
|
These detections have to be decoded first. Afterwards,
|
|
all detections are thrown away which do not pass a confidence
|
|
threshold for the class with the highest prediction probability.
|
|
Additionally, all detections with a background prediction of 0.8 or higher are discarded.
|
|
The remaining detections are partitioned into observations to
|
|
further reduce the size of the output, and
|
|
to identify uncertainty. This is accomplished by calculating the
|
|
mutual IOU score of every detection with all other detections. Detections
|
|
with a mutual IOU score of 0.95 or higher are partitioned into an
|
|
observation. Next, the softmax scores and bounding box coordinates of
|
|
all detections in an observation are averaged.
|
|
There can be a different number of observations for every image which
|
|
destroys homogenity and prevents batch-wise calculation of the
|
|
results. The shape of the results is per image: \((\#nr\_observations,\#nr\_classes + 4)\).
|
|
|
|
Entropy is measured in the next step. All observations with too high
|
|
entropy are discarded. Entropy thresholding in combination with
|
|
dropout sampling should improve identification of false positives of
|
|
unknown classes. This is due to multiple forward passes and
|
|
the assumption that uncertainty in some objects will result
|
|
in different classifications in multiple forward passes. These
|
|
varying classifications are averaged into multiple lower confidence
|
|
values which should increase the \gls{entropy} and, hence, flag an
|
|
observation for removal.
|
|
|
|
The remainder of the filtering follows the \gls{vanilla} \gls{SSD} procedure: per class
|
|
confidence threshold, \gls{NMS}, and a top \(k\) selection
|
|
at the end.
|
|
|
|
\chapter{Experimental Setup and Results}
|
|
|
|
\label{chap:experiments-results}
|
|
|
|
This chapter explains the used data sets, how the experiments have been
|
|
set up, and what the results are.
|
|
|
|
\section{Data Sets}
|
|
|
|
This thesis uses the MS COCO~\cite{Lin2014} data set. It contains
|
|
80 classes, from airplanes to toothbrushes many classes are present.
|
|
The images are taken by camera from the real world, ground truth
|
|
is provided for all images. The data set supports object detection,
|
|
keypoint detection, and panoptic segmentation (scene segmentation).
|
|
|
|
The data of any data set has to be prepared for use in a neural
|
|
network. Typical problems of data sets include, for example,
|
|
outliers and invalid bounding boxes. Before a data set can be used,
|
|
these problems need to be resolved.
|
|
|
|
For the MS COCO data set, all annotations are checked for
|
|
impossible values: bounding box height or width lower than zero,
|
|
\(x_{min}\) and \(y_{min}\) bounding box coordinates lower than zero,
|
|
\(x_{max}\) and \(y_{max}\) coordinates lower than or equal to zero, \(x_{min}\) greater than \(x_{max}\),
|
|
\(y_{min}\) greater than \(y_{max}\), image width lower than \(x_{max}\),
|
|
and image height lower than \(y_{max}\). In the last two cases the
|
|
bounding box width and height are set to (image width - \(x_{min}\)) and
|
|
(image height - \(y_{min}\)) respectively;
|
|
in the other cases the annotation is skipped.
|
|
If the bounding box width or height afterwards is
|
|
lower than or equal to zero the annotation is skipped.
|
|
|
|
SSD accepts 300x300 input images, the MS COCO data set images are
|
|
resized to this resolution; the aspect ratio is not kept in the
|
|
process. MS COCO contains landscape and portrait images with (640x480)
|
|
and (480x640) as the resolution. This leads to a uniform distortion of the
|
|
portrait and landscape images respectively. Furthermore,
|
|
the colour channels are swapped from \gls{RGB} to \gls{BGR} in order to
|
|
comply with the \gls{SSD} implementation. The \gls{BGR} requirement stems from
|
|
the usage of Open CV in \gls{SSD}: the internal channel order for
|
|
Open CV is \gls{BGR}.
|
|
|
|
For this thesis, weights pre-trained on the sub data set trainval35k of the
|
|
COCO data set are used. These weights have been created with closed set
|
|
conditions in mind, therefore, they have been sub-sampled to create
|
|
an open set condition. To this end, the weights for the last
|
|
20 classes have been thrown away, making them effectively unknown.
|
|
|
|
All images of the minival2014 data set are used but only ground truth
|
|
belonging to the first 60 classes is loaded. The remaining 20
|
|
classes are considered "unknown" and no ground truth bounding
|
|
boxes for them is provided during the inference phase.
|
|
A total of 31,991 detections remains after this exclusion. Of these
|
|
detections, a staggering 10,988 or 34,3\% belong to the persons
|
|
class, followed by cars with 1,932 or 6\%, chairs with 1,791 or 5,6\%,
|
|
and bottles with 1,021 or 3,2\%. Together, these four classes make up
|
|
around 49,1\% of the ground truth detections. This shows a huge imbalance
|
|
between the classes in the data set.
|
|
|
|
\section{Experimental Setup}
|
|
|
|
This section explains the setup for the different conducted
|
|
experiments. Each comparison investigates one particular question.
|
|
|
|
As a baseline, \gls{vanilla} \gls{SSD} with the confidence threshold of 0.01
|
|
and a \gls{NMS} IOU threshold of 0.45 is used.
|
|
Due to the low number of objects per image in the COCO data set,
|
|
the top \(k\) value has been set to 20. Vanilla \gls{SSD} with \gls{entropy}
|
|
thresholding uses the same parameters; compared to \gls{vanilla} \gls{SSD}
|
|
without \gls{entropy} thresholding, it showcases the relevance of
|
|
entropy thresholding for \gls{vanilla} \gls{SSD}.
|
|
|
|
Vanilla \gls{SSD} with 0.2 confidence threshold is compared
|
|
to \gls{vanilla} \gls{SSD} with 0.01 confidence threshold; this comparison
|
|
investigates the effect of the per class confidence threshold
|
|
on the object detection performance.
|
|
|
|
Bayesian \gls{SSD} with 0.2 confidence threshold is compared
|
|
to \gls{vanilla} \gls{SSD} with 0.2 confidence threshold. Coupled with the
|
|
entropy threshold, this comparison reveals how uncertain the network
|
|
is. If it is very certain the dropout sampling should have no
|
|
significant impact on the result. Furthermore, in two cases the
|
|
dropout has been turned off to isolate the impact of \gls{NMS}
|
|
on the result.
|
|
|
|
Both \gls{vanilla} \gls{SSD} with \gls{entropy} thresholding and Bayesian \gls{SSD} with
|
|
entropy thresholding are tested for \gls{entropy} thresholds ranging
|
|
from 0.1 to 2.4 inclusive as specified in Miller et al.~\cite{Miller2018}.
|
|
|
|
\section{Results}
|
|
|
|
Results in this section are presented both for micro and macro averaging.
|
|
In macro averaging, for example, the precision values of each class are added up
|
|
and then divided by the number of classes. Conversely, for micro averaging the
|
|
precision is calculated across all classes directly. Both methods have
|
|
a specific impact: macro averaging weighs every class the same while micro
|
|
averaging weighs every detection the same. They will be largely identical
|
|
when every class is balanced and has about the same number of detections.
|
|
However, in case of a class imbalance the macro averaging
|
|
favours classes with few detections whereas micro averaging benefits classes
|
|
with many detections.
|
|
|
|
This section only presents the results. Interpretation and discussion is found
|
|
in the next chapter.
|
|
|
|
\subsection{Micro Averaging}
|
|
\begin{table}[ht]
|
|
\begin{tabular}{rcccc}
|
|
\hline
|
|
Forward & max & abs OSE & Recall & Precision\\
|
|
Passes & \(F_1\) Score & \multicolumn{3}{c}{at max \(F_1\) point} \\
|
|
\hline
|
|
\gls{vanilla} \gls{SSD} - 0.01 conf & 0.255 & 3176 & 0.214 & 0.318 \\
|
|
\gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.376} & 2939 & \textbf{0.382} & 0.372 \\
|
|
\gls{SSD} with entropy test - 0.01 conf & 0.255 & 3168 & 0.214 & 0.318 \\
|
|
% \gls{entropy} thresh: 2.4 for \gls{vanilla} \gls{SSD} is best
|
|
\hline
|
|
Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.209 & 2709 & 0.300 & 0.161 \\
|
|
no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.371 & \textbf{2335} & 0.365 & \textbf{0.378} \\
|
|
0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.359 & 2584 & 0.363 & 0.357 \\
|
|
0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.325 & 2759 & 0.342 & 0.311 \\
|
|
% \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
|
|
% 0.5 for Bayesian - 6, 1.4 for 7, 1.4 for 8, 1.3 for 9
|
|
\hline
|
|
\end{tabular}
|
|
\caption{Rounded results for micro averaging. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with
|
|
their best performing \gls{entropy} threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with entropy test performed best with an
|
|
\gls{entropy} threshold of 2.4, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.0,
|
|
and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.4 as \gls{entropy}
|
|
threshold.
|
|
Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed
|
|
best for 1.4 as \gls{entropy} threshold, the run with 0.5 keep ratio performed
|
|
best for 1.3 as threshold.}
|
|
\label{tab:results-micro}
|
|
\end{table}
|
|
|
|
\begin{figure}[ht]
|
|
\begin{minipage}[t]{0.48\textwidth}
|
|
\includegraphics[width=\textwidth]{ose-f1-all-micro}
|
|
\caption{Micro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute \gls{OSE} of 0.}
|
|
\label{fig:ose-f1-micro}
|
|
\end{minipage}%
|
|
\hfill
|
|
\begin{minipage}[t]{0.48\textwidth}
|
|
\includegraphics[width=\textwidth]{precision-recall-all-micro}
|
|
\caption{Micro averaged precision-recall curves for each variant tested.}
|
|
\label{fig:precision-recall-micro}
|
|
\end{minipage}
|
|
\end{figure}
|
|
|
|
Vanilla \gls{SSD} with a per class confidence threshold of 0.2 performs best (see
|
|
table \ref{tab:results-micro}) with respect to the maximum \(F_1\) score
|
|
(0.376) and recall at the maximum \(F_1\) point (0.382). In comparison, neither
|
|
the \gls{vanilla} \gls{SSD} variant with a confidence threshold of 0.01 nor the \gls{SSD} with
|
|
an \gls{entropy} test can outperform the 0.2 variant. Among the \gls{vanilla} \gls{SSD} variants,
|
|
the 0.2 variant also has the lowest open set error (2939) and the
|
|
highest precision (0.372).
|
|
|
|
The comparison of the \gls{vanilla} \gls{SSD} variants with a confidence threshold of 0.01
|
|
shows no significant impact of an \gls{entropy} test. Only the open set error
|
|
is lower but in an insignificant way. The rest of the performance metrics are
|
|
identical after rounding.
|
|
|
|
Bayesian \gls{SSD} with disabled dropout and without \gls{NMS}
|
|
has the worst performance of all tested variants (\gls{vanilla} and Bayesian)
|
|
with respect to \(F_1\) score (0.209) and precision (0.161). The precision is not only the worst but also significantly lower compared to all other variants.
|
|
In comparison to all variants with 0.2 confidence threshold, it has the worst recall (0.300) as well.
|
|
|
|
With an open set error of 2335, the Bayesian \gls{SSD} variant with disabled dropout and
|
|
enabled \gls{NMS} offers the best performance with respect
|
|
to the open set error. It also has the best precision (0.378) of all tested
|
|
variants. Furthermore, it provides the best performance among all variants
|
|
with multiple forward passes.
|
|
|
|
Dropout decreases the performance of the network, this can be seen
|
|
in the lower \(F_1\) scores, a higher open set error, and lower precision
|
|
values. Both dropout variants have worse recall (0.363 and 0.342) than
|
|
the variant with disabled dropout.
|
|
However, all variants with multiple forward passes have a lower open set
|
|
error than all \gls{vanilla} \gls{SSD} variants.
|
|
|
|
The relation of \(F_1\) score to absolute open set error can be observed
|
|
in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants
|
|
can be seen in figure \ref{fig:precision-recall-micro}. Both \gls{vanilla} \gls{SSD}
|
|
variants with 0.01 confidence threshold reach a much higher open set error
|
|
and a higher recall. This behaviour is expected as more and worse predictions
|
|
are included.
|
|
All plotted variants show a similar behaviour that is in line with previously
|
|
reported figures, such as the ones in Miller et al.~\cite{Miller2018}
|
|
|
|
\subsection{Macro Averaging}
|
|
|
|
\begin{table}[t]
|
|
\begin{tabular}{rcccc}
|
|
\hline
|
|
Forward & max & abs OSE & Recall & Precision\\
|
|
Passes & \(F_1\) Score & \multicolumn{3}{c}{at max \(F_1\) point} \\
|
|
\hline
|
|
\gls{vanilla} \gls{SSD} - 0.01 conf & 0.370 & 1426 & 0.328 & 0.424 \\
|
|
\gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.375} & 1218 & \textbf{0.338} & 0.424 \\
|
|
\gls{SSD} with entropy test - 0.01 conf & 0.370 & 1373 & 0.329 & \textbf{0.425} \\
|
|
% \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
|
|
\hline
|
|
Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.226 & \textbf{809} & 0.229 & 0.224 \\
|
|
no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.363 & 1057 & 0.321 & 0.420 \\
|
|
0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.355 & 1137 & 0.320 & 0.399 \\
|
|
0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.322 & 1264 & 0.307 & 0.340 \\
|
|
% \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
|
|
% \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
|
|
% 1.7 for 8, 2.0 for 9
|
|
\hline
|
|
\end{tabular}
|
|
\caption{Rounded results for macro averaging. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with
|
|
their best performing \gls{entropy} threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with entropy test performed best with an
|
|
\gls{entropy} threshold of 1.7, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.5,
|
|
and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.5 as \gls{entropy}
|
|
threshold. Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed
|
|
best for 1.7 as \gls{entropy} threshold, the run with 0.5 keep ratio performed
|
|
best for 2.0 as threshold.}
|
|
\label{tab:results-macro}
|
|
\end{table}
|
|
|
|
\begin{figure}[ht]
|
|
\begin{minipage}[t]{0.48\textwidth}
|
|
\includegraphics[width=\textwidth]{ose-f1-all-macro}
|
|
\caption{Macro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute \gls{OSE} of 0.}
|
|
\label{fig:ose-f1-macro}
|
|
\end{minipage}%
|
|
\hfill
|
|
\begin{minipage}[t]{0.48\textwidth}
|
|
\includegraphics[width=\textwidth]{precision-recall-all-macro}
|
|
\caption{Macro averaged precision-recall curves for each variant tested.}
|
|
\label{fig:precision-recall-macro}
|
|
\end{minipage}
|
|
\end{figure}
|
|
|
|
Vanilla \gls{SSD} with a per class confidence threshold of 0.2 performs best (see
|
|
table \ref{tab:results-macro}) with respect to the maximum \(F_1\) score
|
|
(0.375) and recall at the maximum \(F_1\) point (0.338). In comparison, the \gls{SSD}
|
|
with an \gls{entropy} test slightly outperforms the 0.2 variant with respect to
|
|
precision (0.425). Additionally, this is the best precision overall. Among
|
|
the \gls{vanilla} \gls{SSD} variants, the 0.2 variant also has the lowest
|
|
open set error (1218).
|
|
|
|
The comparison of the \gls{vanilla} \gls{SSD} variants with a confidence threshold of 0.01
|
|
shows no significant impact of an \gls{entropy} test. Only the open set error
|
|
is lower but in an insignificant way. The rest of the performance metrics are
|
|
almost identical after rounding.
|
|
|
|
The results for Bayesian \gls{SSD} show a significant impact of \gls{NMS} or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226
|
|
(without NMS). Dropout was disabled in both cases, making them effectively a
|
|
\gls{vanilla} \gls{SSD} run with multiple forward passes.
|
|
|
|
With an open set error of 809, the Bayesian \gls{SSD} variant with disabled dropout and
|
|
without \gls{NMS} offers the best performance with respect
|
|
to the open set error. The variant without dropout and enabled \gls{NMS} has the best \(F_1\) score (0.363), the best
|
|
precision (0.420) and the best recall (0.321) of all Bayesian variants.
|
|
|
|
Dropout decreases the performance of the network, this can be seen
|
|
in the lower \(F_1\) scores, a higher open set error, and lower precision and
|
|
recall values. However, all variants with multiple forward passes have a lower open set error than all \gls{vanilla} \gls{SSD}
|
|
variants.
|
|
|
|
The relation of \(F_1\) score to absolute open set error can be observed
|
|
in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants
|
|
can be seen in figure \ref{fig:precision-recall-macro}. Both \gls{vanilla} \gls{SSD}
|
|
variants with 0.01 confidence threshold reach a much higher open set error
|
|
and a higher recall. This behaviour is expected as more and worse predictions
|
|
are included.
|
|
All plotted variants show a similar behaviour that is in line with previously
|
|
reported figures, such as the ones in Miller et al.~\cite{Miller2018}
|
|
|
|
\subsection{Class-specific results}
|
|
|
|
As mentioned before, the data set is imbalanced with respect to its
|
|
classes: four classes make up roughly 50\% of all ground truth
|
|
detections. Therefore, it is interesting to see the performance
|
|
of the tested variants with respect to these classes: persons, cars,
|
|
chairs, and bottles. Additionally, the results of the giraffe class are
|
|
presented as these are exceptionally good, although the class makes up
|
|
only 0.7\% of the ground truth. With this share, it is below
|
|
the average of roughly 0.89\% for each of the 56 classes that make up the
|
|
second half of the ground truth.
|
|
|
|
In some cases, multiple variants have seemingly the same performance
|
|
but only one or some of them are marked bold. This is informed by
|
|
differences prior to rounding. If two or more variants are marked bold
|
|
they had the exact same performance before rounding.
|
|
|
|
\begin{table}[tbp]
|
|
\begin{tabular}{rccc}
|
|
\hline
|
|
Forward & max & Recall & Precision\\
|
|
Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\
|
|
\hline
|
|
\gls{vanilla} \gls{SSD} - 0.01 conf & 0.460 & \textbf{0.405} & 0.532 \\
|
|
\gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.460} & \textbf{0.405} & \textbf{0.533} \\
|
|
\gls{SSD} with entropy test - 0.01 conf & 0.460 & 0.405 & 0.532 \\
|
|
% \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
|
|
\hline
|
|
Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.272 & 0.292 & 0.256 \\
|
|
no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.451 & 0.403 & 0.514 \\
|
|
0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.447 & 0.401 & 0.505 \\
|
|
0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.410 & 0.368 & 0.465 \\
|
|
% \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
|
|
% \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
|
|
% 1.7 for 8, 2.0 for 9
|
|
\hline
|
|
\end{tabular}
|
|
\caption{Rounded results for persons class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with
|
|
their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score.}
|
|
\label{tab:results-persons}
|
|
\end{table}
|
|
|
|
It is clearly visible that the overall trend continues in the individual
|
|
classes (see tables \ref{tab:results-persons}, \ref{tab:results-cars}, \ref{tab:results-chairs}, \ref{tab:results-bottles}, and \ref{tab:results-giraffes}). However, the two \gls{vanilla} \gls{SSD} variants with only 0.01 confidence
|
|
threshold perform better than in the averaged results presented earlier.
|
|
Only in the chairs class, a Bayesian \gls{SSD} variant performs better (in
|
|
precision) than any of the \gls{vanilla} \gls{SSD} variants. Moreover, there are
|
|
multiple classes where two or all of the \gls{vanilla} \gls{SSD} variants perform
|
|
equally well. When compared with the macro averaged results,
|
|
giraffes and persons perform better across the board. Cars have a higher
|
|
precision than average but lower recall values for all but the Bayesian
|
|
SSD variant without \gls{NMS} and dropout. Chairs and bottles perform
|
|
worse than average.
|
|
|
|
\begin{table}[tbp]
|
|
\begin{tabular}{rccc}
|
|
\hline
|
|
Forward & max & Recall & Precision\\
|
|
Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\
|
|
\hline
|
|
\gls{vanilla} \gls{SSD} - 0.01 conf & 0.364 & \textbf{0.305} & 0.452 \\
|
|
\gls{vanilla} \gls{SSD} - 0.2 conf & 0.363 & 0.294 & \textbf{0.476} \\
|
|
\gls{SSD} with entropy test - 0.01 conf & \textbf{0.364} & \textbf{0.305} & 0.453 \\
|
|
% \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
|
|
\hline
|
|
Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.236 & 0.244 & 0.229 \\
|
|
no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.336 & 0.266 & 0.460 \\
|
|
0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.332 & 0.262 & 0.454 \\
|
|
0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.309 & 0.264 & 0.374 \\
|
|
% \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
|
|
% \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
|
|
% 1.7 for 8, 2.0 for 9
|
|
\hline
|
|
\end{tabular}
|
|
\caption{Rounded results for cars class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with
|
|
their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score. }
|
|
\label{tab:results-cars}
|
|
\end{table}
|
|
|
|
\begin{table}[tbp]
|
|
\begin{tabular}{rccc}
|
|
\hline
|
|
Forward & max & Recall & Precision\\
|
|
Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\
|
|
\hline
|
|
\gls{vanilla} \gls{SSD} - 0.01 conf & 0.287 & \textbf{0.251} & 0.335 \\
|
|
\gls{vanilla} \gls{SSD} - 0.2 conf & 0.283 & 0.242 & 0.341 \\
|
|
\gls{SSD} with entropy test - 0.01 conf & \textbf{0.288} & \textbf{0.251} & 0.338 \\
|
|
% \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
|
|
\hline
|
|
Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.172 & 0.168 & 0.178 \\
|
|
no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.280 & 0.229 & \textbf{0.360} \\
|
|
0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.274 & 0.228 & 0.343 \\
|
|
0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.240 & 0.220 & 0.265 \\
|
|
% \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
|
|
% \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
|
|
% 1.7 for 8, 2.0 for 9
|
|
\hline
|
|
\end{tabular}
|
|
\caption{Rounded results for chairs class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with
|
|
their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score. }
|
|
\label{tab:results-chairs}
|
|
\end{table}
|
|
|
|
|
|
\begin{table}[tbp]
|
|
\begin{tabular}{rccc}
|
|
\hline
|
|
Forward & max & Recall & Precision\\
|
|
Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\
|
|
\hline
|
|
\gls{vanilla} \gls{SSD} - 0.01 conf & 0.233 & \textbf{0.175} & 0.348 \\
|
|
\gls{vanilla} \gls{SSD} - 0.2 conf & 0.231 & 0.173 & \textbf{0.350} \\
|
|
\gls{SSD} with entropy test - 0.01 conf & \textbf{0.233} & \textbf{0.175} & 0.350 \\
|
|
% \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
|
|
\hline
|
|
Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.160 & 0.140 & 0.188 \\
|
|
no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.224 & 0.170 & 0.328 \\
|
|
0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.220 & 0.170 & 0.311 \\
|
|
0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.202 & 0.172 & 0.245 \\
|
|
% \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
|
|
% \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
|
|
% 1.7 for 8, 2.0 for 9
|
|
\hline
|
|
\end{tabular}
|
|
\caption{Rounded results for bottles class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with
|
|
their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score. }
|
|
\label{tab:results-bottles}
|
|
\end{table}
|
|
|
|
\begin{table}[tbp]
|
|
\begin{tabular}{rccc}
|
|
\hline
|
|
Forward & max & Recall & Precision\\
|
|
Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\
|
|
\hline
|
|
\gls{vanilla} \gls{SSD} - 0.01 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\
|
|
\gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\
|
|
\gls{SSD} with entropy test - 0.01 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\
|
|
% \gls{entropy} thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
|
|
\hline
|
|
Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.415 & 0.414 & 0.417 \\
|
|
no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.647 & 0.642 & 0.654 \\
|
|
0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.637 & 0.634 & 0.642 \\
|
|
0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.586 & 0.578 & 0.596 \\
|
|
% \gls{entropy} thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
|
|
% \gls{entropy} thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
|
|
% 1.7 for 8, 2.0 for 9
|
|
\hline
|
|
\end{tabular}
|
|
\caption{Rounded results for giraffe class. \gls{SSD} with entropy test and Bayesian \gls{SSD} are represented with
|
|
their best performing macro averaging \gls{entropy} threshold with respect to \(F_1\) score. }
|
|
\label{tab:results-giraffes}
|
|
\end{table}
|
|
|
|
\subsection{Qualitative Analysis}
|
|
|
|
This subsection compares \gls{vanilla} \gls{SSD}
|
|
with Bayesian \gls{SSD} with respect to specific images that illustrate
|
|
similarities and differences between both approaches. For this
|
|
comparison, a 0.2 confidence threshold is applied. Furthermore, the
|
|
compared Bayesian SSD variant uses \gls{NMS} and dropout with 0.9 keep
|
|
ratio.
|
|
|
|
\begin{figure}
|
|
\begin{minipage}[t]{0.48\textwidth}
|
|
\includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_vanilla}
|
|
\caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} \gls{SSD}.}
|
|
\label{fig:stop-sign-truck-vanilla}
|
|
\end{minipage}%
|
|
\hfill
|
|
\begin{minipage}[t]{0.48\textwidth}
|
|
\includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_bayesian}
|
|
\caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian \gls{SSD} with 0.9 keep ratio.}
|
|
\label{fig:stop-sign-truck-bayesian}
|
|
\end{minipage}
|
|
\end{figure}
|
|
|
|
The ground truth only contains a stop sign and a truck. The differences between \gls{vanilla} \gls{SSD} and Bayesian \gls{SSD} are almost not visible
|
|
(see figures \ref{fig:stop-sign-truck-vanilla} and \ref{fig:stop-sign-truck-bayesian}): the truck is neither detected by \gls{vanilla} nor Bayesian \gls{SSD}, instead both detected a pottet plant and a traffic light. The stop sign is detected by both variants.
|
|
This behaviour implies problems with detecting objects at the edge
|
|
that overwhelmingly lie outside the image frame. Furthermore, the predictions are usually identical.
|
|
|
|
\begin{figure}
|
|
\begin{minipage}[t]{0.48\textwidth}
|
|
\includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_vanilla}
|
|
\caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} \gls{SSD}.}
|
|
\label{fig:cat-laptop-vanilla}
|
|
\end{minipage}%
|
|
\hfill
|
|
\begin{minipage}[t]{0.48\textwidth}
|
|
\includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_bayesian}
|
|
\caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian \gls{SSD} with 0.9 keep ratio.}
|
|
\label{fig:cat-laptop-bayesian}
|
|
\end{minipage}
|
|
\end{figure}
|
|
|
|
Another example (see figures \ref{fig:cat-laptop-vanilla} and \ref{fig:cat-laptop-bayesian}) is a cat with a laptop/TV in the background on the right
|
|
side. Both variants detect a cat but the \gls{vanilla} variant detects a dog as well. The laptop and TV are not detected but this is expected since
|
|
these classes have not been trained.
|
|
|
|
\chapter{Discussion and Outlook}
|
|
|
|
\label{chap:discussion}
|
|
|
|
First the results are discussed, then possible future research and open
|
|
questions are addressed.
|
|
|
|
\section*{Discussion}
|
|
|
|
The results clearly do not support the hypothesis: \textit{Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it}. With the exception of the open set error, there
|
|
is no area where dropout sampling performs better than \gls{vanilla} \gls{SSD}. In the
|
|
remainder of the section the individual results will be interpreted.
|
|
|
|
\subsection*{Impact of Averaging}
|
|
|
|
Micro and macro averaging create largely similar results. Notably, micro
|
|
averaging has a significant performance increase towards the end
|
|
of the list of predictions. This is signaled by the near horizontal movement
|
|
of the plot in both the \(F_1\) versus absolute open set error graph (see figure \ref{fig:ose-f1-micro}) and
|
|
the precision-recall curve (see figure \ref{fig:precision-recall-micro}).
|
|
|
|
This behaviour is caused by a large imbalance of detections between
|
|
the classes. For \gls{vanilla} \gls{SSD} with 0.2 confidence threshold there are
|
|
a total of 36,863 detections after \gls{NMS} and top \(k\).
|
|
The persons class contributes 14,640 detections or around 40\% to that number. Another strong class is cars with 2,252 detections or around
|
|
6\%. In third place come chairs with 1352 detections or around 4\%. This means that three classes have together roughly as many detections
|
|
as the remaining 57 classes combined.
|
|
|
|
In macro averaging, the cumulative precision and recall values are
|
|
calculated per class and then averaged across all classes. Smaller
|
|
classes quickly reach high recall values as the total number of
|
|
ground truth is small as well. The last recall and precision value
|
|
of the smaller classes is repeated to achieve homogenity with the largest
|
|
class. As a consequence, early on the average recall is quite high. Later on, only the values of the largest class still change which has only
|
|
a small impact on the overall result.
|
|
|
|
Conversely, in micro averaging the cumulative true positives
|
|
are added up across classes and then divided by the total number of
|
|
ground truth. Here, the effect is the opposite: the total number of
|
|
ground truth is very large which means the combined true positives
|
|
of the 57 classes have only a smaller impact on the average recall.
|
|
As a result, the open set error rises quicker than the \(F_1\) score,
|
|
creating the sharp rise of the open set error at a lower
|
|
\(F_1\) score than in macro averaging. The open set error
|
|
reaches a high value early on and changes little afterwards. This allows
|
|
the \(F_1\) score to catch up and produces the almost horizontal line
|
|
in the graph. Eventually, the \(F_1\) score decreases again while the
|
|
open set error continues to rise a bit.
|
|
|
|
\subsection*{Impact of Entropy}
|
|
|
|
There is no visible impact of \gls{entropy} thresholding on the object detection
|
|
performance for \gls{vanilla} \gls{SSD}. This indicates that the network has almost no
|
|
uniform or close to uniform predictions, the vast majority of predictions
|
|
has a high confidence in one class---including the background.
|
|
However, the \gls{entropy} plays a larger role for the Bayesian variants---as
|
|
expected: the best performing thresholds are 1.0, 1.3, and 1.4 for micro averaging,
|
|
and 1.5, 1.7, and 2.0 for macro averaging. In all of these cases the best
|
|
threshold is not the largest threshold tested.
|
|
|
|
This is caused by a simple phenomenon: at some point most or all true
|
|
positives are in and a higher \gls{entropy} threshold only adds more false
|
|
positives. Such a behaviour is indicated by a stagnating recall for the
|
|
higher \gls{entropy} levels. For the low \gls{entropy} thresholds, the low recall
|
|
is dominating the \(F_1\) score, the sweet spot is somewhere in the
|
|
middle. For macro averaging, it holds that a higher optimal \gls{entropy}
|
|
threshold indicates a worse performance.
|
|
|
|
\subsection*{Non-Maximum Suppression and Top \(k\)}
|
|
|
|
\begin{table}[htbp]
|
|
\centering
|
|
\begin{tabular}{rccc}
|
|
\hline
|
|
variant & before & after & after \\
|
|
& \gls{entropy}/NMS & \gls{entropy}/NMS & top \(k\) \\
|
|
\hline
|
|
Bay. \gls{SSD}, no dropout, no \gls{NMS} & 155,251 & 122,868 & 72,207 \\
|
|
no dropout, \gls{NMS} & 155,250 & 36,061 & 33,827 \\
|
|
\hline
|
|
\end{tabular}
|
|
|
|
\caption{Comparison of Bayesian \gls{SSD} variants without dropout with
|
|
respect to the number of detections before the \gls{entropy} threshold,
|
|
after it and/or \gls{NMS}, and after top \(k\). The
|
|
\gls{entropy} threshold 1.5 was used for both.}
|
|
\label{tab:effect-nms}
|
|
\end{table}
|
|
|
|
Miller et al.~\cite{Miller2018} supposedly do not use \gls{NMS}
|
|
in their implementation of dropout sampling. Therefore, a variant with disabled \glslocalreset{NMS}
|
|
\gls{NMS} has been tested. The results are somewhat expected:
|
|
\gls{NMS} removes all non-maximum detections that overlap
|
|
with a maximum one. This reduces the number of multiple detections per
|
|
ground truth bounding box and therefore the false positives. Without it,
|
|
a lot more false positives remain and have a negative impact on precision.
|
|
In combination with top \(k\) selection, recall can be affected:
|
|
duplicate detections could stay and maxima boxes could be removed.
|
|
|
|
The number of observations have been measured before and after the combination of \gls{entropy} threshold and \gls{NMS} filter: both Bayesian \gls{SSD} without
|
|
NMS and dropout, and Bayesian \gls{SSD} with \gls{NMS} and disabled dropout
|
|
have the same number of observations everywhere before the \gls{entropy} threshold. After the \gls{entropy} threshold (the value 1.5 has been used for both) and NMS, the variant with \gls{NMS} has roughly 23\% of its observations left
|
|
(see table \ref{tab:effect-nms} for absolute numbers).
|
|
Without \gls{NMS} 79\% of observations are left. Irrespective of the absolute
|
|
number, this discrepancy clearly shows the impact of \gls{NMS} and also explains a higher count of false positives:
|
|
more than 50\% of the original observations are removed with \gls{NMS} and
|
|
stayed without---all of these are very likely to be false positives.
|
|
|
|
A clear distinction between micro and macro averaging can be observed:
|
|
recall is hardly effected with micro averaging (0.300) but goes down equally with macro averaging (0.229). For micro averaging, it does
|
|
not matter which class the true positives belong to: every detection
|
|
counts the same way. This also means that top \(k\) will have only
|
|
a marginal effect: some true positives might be removed without \gls{NMS} but overall that does not have a big impact. With macro averaging, however,
|
|
the class of the true positives matters a lot: for example, if two
|
|
true positives are removed from a class with only few true positives
|
|
to begin with than their removal will have a drastic influence on
|
|
the class recall value and hence the overall result.
|
|
|
|
The impact of top \(k\) has been measured by counting the number of observations
|
|
after top \(k\) is applied: the variant with \gls{NMS} keeps about 94\%
|
|
of the observations left after NMS, without \gls{NMS} only about 59\% of observations
|
|
are kept. This shows a significant impact on the result by top \(k\)
|
|
in the case of disabled \gls{NMS}. Furthermore, some
|
|
classes are hit harder by top \(k\) then others: for example,
|
|
dogs keep around 82\% of the observations but persons only 57\%.
|
|
This indicates that detected dogs are mostly on images with few detections
|
|
overall and/or have a high enough prediction confidence to be
|
|
kept by top \(k\). However, persons are likely often on images
|
|
with many detections and/or have too low confidences.
|
|
In this example, the likelihood for true positives to be removed in
|
|
the person category is quite high. For dogs, the probability is far lower.
|
|
This is a good example for micro and macro averaging, and their impact on
|
|
recall.
|
|
|
|
|
|
\subsection*{Dropout Sampling and Observations}
|
|
|
|
\begin{table}[htbp]
|
|
\centering
|
|
\begin{tabular}{rccc}
|
|
\hline
|
|
variant & after & after \\
|
|
& prediction & observation grouping \\
|
|
\hline
|
|
Bay. \gls{SSD}, no dropout, \gls{NMS} & 1,677,050 & 155,250 \\
|
|
keep rate 0.9, \gls{NMS} & 1,617,675 & 549,166 \\
|
|
\hline
|
|
\end{tabular}
|
|
|
|
\caption{Comparison of Bayesian \gls{SSD} variants without dropout and with
|
|
0.9 keep ratio of dropout with
|
|
respect to the number of detections directly after the network
|
|
predictions and after the observation grouping.}
|
|
\label{tab:effect-dropout}
|
|
\end{table}
|
|
|
|
The dropout variants have largely worse performance than the Bayesian variants
|
|
without dropout. This is expected as the network was not trained with
|
|
dropout and the weights are not prepared for it.
|
|
|
|
Gal~\cite{Gal2017}
|
|
shows that networks \textbf{trained} with dropout are approximate Bayesian
|
|
models. The Bayesian variants of \gls{SSD} implemented for this thesis are not fine-tuned or trained with dropout, therefore, they are not guaranteed to be such approximate models.
|
|
|
|
But dropout alone does not explain the difference in results. Both variants
|
|
with and without dropout have the exact same number of detections coming
|
|
out of the network (8732 per image per forward pass). With 16 images in a batch,
|
|
308 batches, and 10 forward passes, the total number of detections is
|
|
an astounding 430,312,960 detections. As such a large number could not be
|
|
handled in memory, only one batch is calculated at a time. That
|
|
still leaves 1,397,120 detections per batch. These have to be grouped into
|
|
observations, including a quadratic calculation of mutual IOU scores.
|
|
Therefore, these detections are filtered by removing all those with background
|
|
confidence levels of 0.8 or higher.
|
|
|
|
The number of detections per class has been measured before and after the
|
|
detections are grouped into observations. To this end, the stored predictions
|
|
are unbatched and summed together. After the aforementioned filter
|
|
and before the grouping, roughly 0.4\% (in fact less than that) of the
|
|
more than 430 million detections remain (see table \ref{tab:effect-dropout} for absolute numbers). The variant with dropout
|
|
has slightly fewer predictions left compared to the one without dropout.
|
|
|
|
After the grouping, the variant without dropout has on average between
|
|
10 and 11 detections grouped into an observation. This is expected as every
|
|
forward pass creates the exact same result and these ten identical detections
|
|
per \gls{vanilla} \gls{SSD} detection perfectly overlap. The fact that slightly more than
|
|
ten detections are grouped together could explain the marginally better precision
|
|
of the Bayesian variant without dropout compared to \gls{vanilla} \gls{SSD}.
|
|
However, on average only three detections are grouped together into an
|
|
observation if dropout with 0.9 keep ratio is enabled. This does not
|
|
negatively impact recall as true positives do not disappear but offers
|
|
a higher chance of false positives. It can be observed in the results which
|
|
clearly show no negative impact for recall between the variants without
|
|
dropout and dropout with 0.9 keep ratio.
|
|
|
|
This behaviour implies that even a slight usage of dropout creates such
|
|
diverging anchor box offsets that the resulting detections from multiple
|
|
forward passes no longer have a mutual IOU score of 0.95 or higher.
|
|
|
|
\section*{Outlook}
|
|
|
|
The attempted replication of the work of Miller et al. raises a series of
|
|
questions that cannot be answered in this thesis. This thesis offers
|
|
one possible implementation of dropout sampling that technically works.
|
|
However, this thesis cannot answer why this implementation differs significantly
|
|
from Miller et al. The complete source code or otherwise exhaustive
|
|
implementation details of Miller et al. would be required to attempt an answer.
|
|
|
|
Future work could explore the performance of this implementation when used
|
|
on an \gls{SSD} variant that was fine-tuned or trained with dropout. In this case, it
|
|
should also look into the impact of training with both dropout and batch
|
|
normalisation.
|
|
Other avenues include the application to other data sets or object detection
|
|
networks.
|
|
|
|
To facilitate future work based on this thesis, the source code will be
|
|
made available and an installable Python package will be uploaded to the
|
|
PyPi package index. In the appendices can be found more details about the
|
|
source code implementation.
|