640 lines
30 KiB
TeX
640 lines
30 KiB
TeX
% body thesis file that contains the actual content
|
|
|
|
\chapter{Introduction}
|
|
|
|
The introduction will explain the wider context first, before
|
|
providing technical details.
|
|
|
|
\subsection*{Motivation}
|
|
|
|
Famous examples like the automatic soap dispenser which does not
|
|
recognize the hand of a black person but dispenses soap when presented
|
|
with a paper towel raise the question of bias in computer
|
|
systems~\cite{Friedman1996}. Related to this ethical question regarding
|
|
the design of so called algorithms is the question of
|
|
algorithmic accountability~\cite{Diakopoulos2014}.
|
|
|
|
Supervised neural networks learn from input-output relations and
|
|
figure out by themselves what connections are necessary for that.
|
|
This feature is also their Achilles heel: it makes them effectively
|
|
black boxes and prevents any answers to questions of causality.
|
|
|
|
However, these questions of causility are of enormous consequence when
|
|
results of neural networks are used to make life changing decisions:
|
|
Is a correlation enough to bring forth negative consequences
|
|
for a particular person? And if so, what is the possible defence
|
|
against math? Similar questions can be raised when looking at computer
|
|
vision networks that might be used together with so called smart
|
|
CCTV cameras to discover suspicious activity.
|
|
|
|
This leads to the need for neural networks to explain their results.
|
|
Such an explanation must come from the network or an attached piece
|
|
of technology to allow adoption in mass. Obviously this setting
|
|
poses the question, how such an endeavour can be achieved.
|
|
|
|
For neural networks there are fundamentally two type of tasks:
|
|
regression and classification. Regression deals with any case
|
|
where the goal for the network is to come close to an ideal
|
|
function that connects all data points. Classification, however,
|
|
describes tasks where the network is supposed to identify the
|
|
class of any given input. In this thesis, I will work with both.
|
|
|
|
\subsection*{Object Detection in Open Set Conditions}
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[scale=1.0]{open-set}
|
|
\caption{Open set problem: The test set contains classes that
|
|
were not present during training time.
|
|
Icons in this image have been taken from the COCO data set
|
|
website (\url{https://cocodataset.org/\#explore}) and were
|
|
vectorized afterwards. Resembles figure 1 of Miller et al.~\cite{Miller2018}.}
|
|
\label{fig:open-set}
|
|
\end{figure}
|
|
|
|
More specifically, I will look at object detection in the open set
|
|
conditions (see figure \ref{fig:open-set}).
|
|
In non-technical words this effectively describes
|
|
the kind of situation you encounter with CCTV cameras or robots
|
|
outside of a laboratory. Both use cameras that record
|
|
images. Subsequently a neural network analyses the image
|
|
and returns a list of detected and classified objects that it
|
|
found in the image. The problem here is that networks can only
|
|
classify what they know. If presented with an object type that
|
|
the network was not trained with, as happens frequently in real
|
|
environments, it will still classify the object and might even
|
|
have a high confidence in doing so. Such an example would be
|
|
a false positive. Any ordinary person who uses the results of
|
|
such a network would falsely assume that a high confidence always
|
|
means the classification is very likely correct. If they use
|
|
a proprietary system they might not even be able to find out
|
|
that the network was never trained on a particular type of object.
|
|
Therefore it would be impossible for them to identify the output
|
|
of the network as false positive.
|
|
|
|
This goes back to the need for automatic explanation. Such a system
|
|
should by itself recognize that the given object is unknown and
|
|
hence mark any classification result of the network as meaningless.
|
|
Technically there are two slightly different approaches that deal
|
|
with this type of task: model uncertainty and novelty detection.
|
|
|
|
Model uncertainty can be measured with dropout sampling.
|
|
Dropout is usually used only during training but
|
|
Miller et al.~\cite{Miller2018} use them also during testing
|
|
to achieve different results for the same image making use of
|
|
multiple forward passes. The output scores for the forward passes
|
|
of the same image are then averaged. If the averaged class
|
|
probabilities resemble a uniform distribution (every class has
|
|
the same probability) this symbolises maximum uncertainty. Conversely,
|
|
if there is one very high probability with every other being very
|
|
low this signifies a low uncertainty. An unknown object is more
|
|
likely to cause high uncertainty which allows for an identification
|
|
of false positive cases.
|
|
|
|
Novelty detection is another approach to solve the task.
|
|
In the realm of neural networks it is usually done with the help of
|
|
auto-encoders that solve a regression task of finding an
|
|
identity function that reconstructs the given input~\cite{Pimentel2014}. Auto-encoders have
|
|
internally at least two components: an encoder, and a decoder or
|
|
generator. The job of the encoder is to find an encoding that
|
|
compresses the input as good as possible while simultaneously
|
|
being as loss-free as possible. The decoder takes this latent
|
|
representation of the input and has to find a decompression
|
|
that reconstructs the input as accurate as possible. During
|
|
training these auto-encoders learn to reproduce a certain group
|
|
of object classes. The actual novelty detection takes place
|
|
during testing: Given an image, and the output and loss of the
|
|
auto-encoder, a novelty score is calculated. For some novelty
|
|
detection approaches the reconstruction loss is exactly the novelty
|
|
score, others consider more factors. A low novelty
|
|
score signals a known object. The opposite is true for a high
|
|
novelty score.
|
|
|
|
\subsection*{Research Question}
|
|
|
|
Auto-encoders work well for data sets like MNIST~\cite{Deng2012}
|
|
but perform poorly on challenging real world data sets
|
|
like MS COCO~\cite{Lin2014}. Therefore, a comparison between
|
|
model uncertainty and novelty detection is considered out of
|
|
scope for this thesis.
|
|
|
|
Miller et al.~\cite{Miller2018} used an SSD pre-trained on COCO
|
|
without further fine-tuning on the SceneNet RGB-D data
|
|
set~\cite{McCormac2017} and reported good results regarding
|
|
open set error for an SSD variant with dropout sampling and entropy
|
|
thresholding.
|
|
If their results are generalizable it should be possible to replicate
|
|
the relative difference between the variants on the COCO data set.
|
|
This leads to the following hypothesis: \emph{Dropout sampling
|
|
delivers better object detection performance under open set
|
|
conditions compared to object detection without it.}
|
|
|
|
For the purpose of this thesis, I will use the vanilla SSD as
|
|
baseline to compare against. In particular, vanilla SSD uses
|
|
a per-class confidence threshold of 0.01, an IOU threshold of 0.45
|
|
for the non-maximum suppression, and a top k value of 200.
|
|
The effect of an entropy threshold is measured against this vanilla
|
|
SSD by applying entropy thresholds from 0.1 to 2.4 (limits taken from
|
|
Miller et al.). Dropout sampling is compared to vanilla SSD, both
|
|
with and without entropy thresholding. The number of forward
|
|
passes is varied to identify their impact.
|
|
|
|
\paragraph{Hypothesis} Dropout sampling
|
|
delivers better object detection performance under open set
|
|
conditions compared to object detection without it.
|
|
|
|
\paragraph{Contribution}
|
|
The contribution of this thesis is a comparison between dropout
|
|
sampling and auto-encoding with respect to the overall performance
|
|
of both for object detection in the open set conditions using
|
|
the SSD network for object detection and the SceneNet RGB-D data set
|
|
with MS COCO classes.
|
|
|
|
\subsection*{Reader's guide}
|
|
|
|
First, chapter \ref{chap:background} presents related works and
|
|
provides the background for dropout sampling a.k.a Bayesian SSD.
|
|
Afterwards, chapter \ref{chap:methods} explains how the Bayesian SSD
|
|
works, and provides details about the software and source code design.
|
|
Chapter \ref{chap:experiments-results} presents the data sets,
|
|
the experimental setup, and the results. This is followed by
|
|
chapter \ref{chap:discussion} and \ref{chap:closing}, focusing on
|
|
the discussion and closing respectively.
|
|
|
|
Therefore, the contribution is found in chapters \ref{chap:methods},
|
|
\ref{chap:experiments-results}, and \ref{chap:discussion}.
|
|
|
|
\chapter{Background}
|
|
|
|
\label{chap:background}
|
|
|
|
This chapter will begin with an overview over previous works
|
|
in the field of this thesis. Afterwards the theoretical foundations
|
|
of the work of Miller et al.~\cite{Miller2018} will
|
|
be explained.
|
|
|
|
\section{Related Works}
|
|
|
|
Novelty detection for object detection is intricately linked with
|
|
open set conditions: the test data can contain unknown classes.
|
|
Bishop~\cite{Bishop1994} investigates the correlation between
|
|
the degree of novel input data and the reliability of network
|
|
outputs. Pimentel et al.~\cite{Pimentel2014} provide a review
|
|
of novelty detection methods published over the previous decade.
|
|
|
|
There are two primary pathways that deal with novelty: novelty
|
|
detection using auto-encoders and uncertainty estimation with
|
|
bayesian networks.
|
|
|
|
Japkowicz et al.~\cite{Japkowicz1995} introduce a novelty detection
|
|
method based on the hippocampus of Gluck and Meyers~\cite{Gluck1993}
|
|
and use an auto-encoder to recognize novel instances.
|
|
Thompson et al.~\cite{Thompson2002} show that auto-encoders
|
|
can learn "normal" system behaviour implicitly.
|
|
Goodfellow et al.~\cite{Goodfellow2014} introduce adversarial
|
|
networks: a generator that attempts to trick the discriminator
|
|
by generating samples indistinguishable from the real data.
|
|
Makhzani et al.~\cite{Makhzani2015} build on the work of Goodfellow
|
|
and propose adversarial auto-encoders. Richter and
|
|
Roy~\cite{Richter2017} use an auto-encoder to detect novelty.
|
|
|
|
Wang et al.~\cite{Wang2018} base upon Goodfellow's work and
|
|
use a generative adversarial network for novelty detection.
|
|
Sabokrou et al.~\cite{Sabokrou2018} implement an end-to-end
|
|
architecture for one-class classification: it consists of two
|
|
deep networks, with one being the novelty detector and the other
|
|
enhancing inliers and distorting outliers.
|
|
Pidhorskyi et al.~\cite{Pidhorskyi2018} take a probabilistic approach
|
|
and compute how likely it is that a sample is generated by the
|
|
inlier distribution.
|
|
|
|
Kendall and Gal~\cite{Kendall2017} provide a Bayesian deep learning
|
|
framework that combines input-dependent
|
|
aleatoric\footnote{captures noise inherent in observations}
|
|
uncertainty with epistemic\footnote{uncertainty in the model}
|
|
uncertainty. Lakshminarayanan et al.~\cite{Lakshminarayanan2017}
|
|
implement a predictive uncertainty estimation using deep ensembles
|
|
rather than Bayesian networks. Geifman et al.~\cite{Geifman2018}
|
|
introduce an uncertainty estimation algorithm for non-Bayesian deep
|
|
neural classification that estimates the uncertainty of highly
|
|
confident points using earlier snapshots of the trained model.
|
|
Miller et al.~\cite{Miller2018a} compare merging strategies
|
|
for sampling-based uncertainty techniques in object detection.
|
|
Sensoy et al.~\cite{Sensoy2018} treat prediction confidence
|
|
as subjective opinions: they place a Dirichlet distribution on it.
|
|
The trained predictor for a multi-class classification is also a
|
|
Dirichlet distribution.
|
|
|
|
Gal and Ghahramani~\cite{Gal2016} show how dropout can be used
|
|
as a Bayesian approximation. Miller et al.~\cite{Miller2018}
|
|
build upon the work of Miller et al.~\cite{Miller2018a} and
|
|
Gal and Ghahramani: they use dropout sampling under open-set
|
|
conditions for object detection. Mukhoti and Gal~\cite{Mukhoti2018}
|
|
contribute metrics to measure uncertainty for semantic
|
|
segmentation. Wu et al.~\cite{Wu2019} introduce two innovations
|
|
that turn variational Bayes into a robust tool for Bayesian
|
|
networks: they introduce a novel deterministic method to approximate
|
|
moments in neural networks which eliminates gradient variance, and
|
|
they introduce a hierarchical prior for parameters and an
|
|
Empirical Bayes procedure to select prior variances.
|
|
|
|
\section{Background for Bayesian SSD}
|
|
|
|
\begin{table}
|
|
\centering
|
|
\caption{Notation for background}
|
|
\label{tab:notation}
|
|
\begin{tabular}{l|l}
|
|
symbol & meaning \\
|
|
\hline
|
|
\(\mathbf{W}\) & weights \\
|
|
\(\mathbf{T}\) & training data \\
|
|
\(\mathcal{N}(0, I)\) & Gaussian distribution \\
|
|
\(I\) & independent and identical distribution \\
|
|
\(p(\mathbf{W}|\mathbf{T})\) & probability of weights given
|
|
training data \\
|
|
\(\mathcal{I}\) & an image \\
|
|
\(\mathbf{q} = p(y|\mathcal{I}, \mathbf{T})\) & probability
|
|
of all classes given image and training data \\
|
|
\(H(\mathbf{q})\) & entropy over probability vector \\
|
|
\(\widetilde{\mathbf{W}}\) & weights sampled from
|
|
\(p(\mathbf{W}|\mathbf{T})\) \\
|
|
\(\mathbf{b}\) & bounding box coordinates \\
|
|
\(\mathbf{s}\) & softmax scores \\
|
|
\(\overline{\mathbf{s}}\) & averaged softmax scores \\
|
|
\(D\) & detections of one forward pass \\
|
|
\(\mathfrak{D}\) & set of all detections over multiple
|
|
forward passes \\
|
|
\(\mathcal{O}\) & observation \\
|
|
\(\overline{\mathbf{q}}\) & probability vector for
|
|
observation \\
|
|
%\(E[something]\) & expected value of something
|
|
%\(\overline{\mathbf{z}}, \mathbf{z}\) & latent space representation \\
|
|
%\(d_T, d_z\) & discriminators \\
|
|
%\(e, g\) & encoding and decoding/generating function \\
|
|
%\(J_g\) & Jacobi matrix for generating function \\
|
|
%\(\mathcal{T}\) & tangent space \\
|
|
%\(\mathbf{R}\) & training/test data changed to be on tangent space
|
|
\end{tabular}
|
|
\end{table}
|
|
|
|
This section will use the \textbf{notation} defined in table
|
|
\ref{tab:notation} on page \pageref{tab:notation}.
|
|
To understand dropout sampling, it is necessary to explain the
|
|
idea of Bayesian neural networks. They place a prior distribution
|
|
over the network weights, for example a Gaussian prior distribution:
|
|
\(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example
|
|
\(\mathbf{W}\) are the weights and \(I\) symbolises that every
|
|
weight is drawn from an independent and identical distribution. The
|
|
training of the network determines a plausible set of weights by
|
|
evaluating the posterior (probability output) over the weights given
|
|
the training data \(\mathbf{T}\): \(p(\mathbf{W}|\mathbf{T})\).
|
|
However, this
|
|
evaluation cannot be performed in any reasonable
|
|
time. Therefore approximation techniques are
|
|
required. In those techniques the posterior is fitted with a
|
|
simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original
|
|
and intractable problem of averaging over all weights in the network
|
|
is replaced with an optimisation task, where the parameters of the
|
|
simple distribution are optimised over~\cite{Kendall2017}.
|
|
|
|
\subsubsection*{Dropout Variational Inference}
|
|
|
|
Kendall and Gal~\cite{Kendall2017} showed an approximation for
|
|
classfication and recognition tasks. Dropout variational inference
|
|
is a practical approximation technique by adding dropout layers
|
|
in front of every weight layer and using them also during test
|
|
time to sample from the approximate posterior. Effectively, this
|
|
results in the approximation of the class probability
|
|
\(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward
|
|
passes through the network and averaging over the obtained Softmax
|
|
scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the
|
|
training data \(\mathbf{T}\):
|
|
\begin{equation} \label{eq:drop-sampling}
|
|
p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
|
|
\end{equation}
|
|
|
|
With this dropout sampling technique \(n\) model weights
|
|
\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior
|
|
\(p(\mathbf{W}|\mathbf{T})\). The class probability
|
|
\(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
|
|
\(\mathbf{q}\) over all class labels. Finally, the uncertainty
|
|
of the network with respect to the classification is given by
|
|
the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\).
|
|
|
|
\subsubsection*{Dropout Sampling for Object Detection}
|
|
|
|
Miller et al.~\cite{Miller2018} apply the dropout sampling to
|
|
object detection. In that case \(\mathbf{W}\) represents the
|
|
learned weights of a detection network like SSD~\cite{Liu2016}.
|
|
Every forward pass uses a different network
|
|
\(\widetilde{\mathbf{W}}\) which is approximately sampled from
|
|
\(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object
|
|
detection results in a set of detections, each consisting of bounding
|
|
box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\).
|
|
The detections are denoted by Miller et al. as \(D_i =
|
|
\{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put
|
|
into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\).
|
|
|
|
All detections with mutual intersection-over-union scores (IoU)
|
|
of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\).
|
|
Subsequently, the corresponding vector of class probabilities
|
|
\(\overline{\mathbf{q}}_i\) for the observation is calculated by averaging all
|
|
score vectors \(\mathbf{s}_j\) in a particular observation
|
|
\(\mathcal{O}_i\): \(\overline{\mathbf{q}}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
|
|
of the detector for a particular observation is measured by
|
|
the entropy \(H(\overline{\mathbf{q}}_i)\).
|
|
|
|
If \(\overline{\mathbf{q}}_i\), which I called averaged class probabilities,
|
|
resembles a uniform distribution the entropy will be high. A uniform
|
|
distribution means that no class is more likely than another, which
|
|
is a perfect example of maximum uncertainty. Conversely, if
|
|
one class has a very high probability the entropy will be low.
|
|
|
|
In open set conditions it can be expected that falsely generated
|
|
detections for unknown object classes have a higher label
|
|
uncertainty. A threshold on the entropy \(H(\overline{\mathbf{q}}_i)\) can then
|
|
be used to identify and reject these false positive cases.
|
|
|
|
% SSD: \cite{Liu2016}
|
|
% ImageNet: \cite{Deng2009}
|
|
% COCO: \cite{Lin2014}
|
|
% YCB: \cite{Xiang2017}
|
|
% SceneNet: \cite{McCormac2017}
|
|
|
|
\chapter{Methods}
|
|
|
|
\label{chap:methods}
|
|
|
|
This chapter explains the functionality of the Bayesian SSD and the
|
|
decoding pipelines.
|
|
|
|
\section{Bayesian SSD for Model Uncertainty}
|
|
|
|
Bayesian SSD adds dropout sampling to the vanilla SSD. First,
|
|
the model architecture will be explained, followed by details on
|
|
the uncertainty calculation, and implementation details.
|
|
|
|
\subsection{Model Architecture}
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[scale=1.2]{vanilla-ssd}
|
|
\caption{The vanilla SSD network as defined by Liu et al.~\cite{Liu2016}. VGG-16 is the base network, extended with extra feature layers. These predict offsets to anchor boxes with different sizes and aspect ratios. Furthermore, they predict the
|
|
corresponding confidences.}
|
|
\label{fig:vanilla-ssd}
|
|
\end{figure}
|
|
|
|
Vanilla SSD is based upon the VGG-16 network (see figure \ref{fig:vanilla-ssd}) and adds extra feature layers. These layers
|
|
predict the offsets to the anchor boxes, which have different sizes
|
|
and aspect ratios. The feature layers also predict the
|
|
corresponding confidences. By comparison, Bayesian SSD only adds
|
|
two dropout layers after the fc6 and fc7 layers (see figure \ref{fig:bayesian-ssd}).
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[scale=1.2]{bayesian-ssd}
|
|
\caption{The Bayesian SSD network as defined by Miller et al.~\cite{Miller2018}. It adds dropout layers after the fc6
|
|
and fc7 layers.}
|
|
\label{fig:bayesian-ssd}
|
|
\end{figure}
|
|
|
|
\subsection{Model Uncertainty}
|
|
|
|
Dropout sampling measures model uncertainty with the help of
|
|
entropy: every forward pass creates predictions, these are
|
|
partitioned into observations, and then their entropy is calculated.
|
|
Entropy works to detect uncertainty because uncertain networks
|
|
will produce different classifications for the same object in an
|
|
image across multiple forward passes.
|
|
|
|
\subsection{Implementation Details}
|
|
|
|
For this thesis, an SSD implementation based on Tensorflow and
|
|
Keras\footnote{\url{https://github.com/pierluigiferrari/ssd\_keras}}
|
|
was used. It was modified to support entropy thresholding,
|
|
partitioning of observations, and dropout
|
|
layers in the SSD model. %Entropy thresholding takes place before
|
|
%the per-class confidence threshold is applied.
|
|
|
|
\section{Decoding Pipelines}
|
|
|
|
The raw output of SSD is not very useful: it contains thousands of
|
|
boxes per image. Among them are many boxes with very low confidences
|
|
or background classifications, those need to be filtered out to
|
|
get any meaningful output of the network. The process of
|
|
filtering is called decoding and presented for the three variants
|
|
of SSD used in the thesis.
|
|
|
|
\subsection{Vanilla SSD}
|
|
|
|
Liu et al.~\cite{Liu2016} used Caffe for their original SSD
|
|
implementation. The decoding process contains largely two
|
|
phases: decoding and filtering. Decoding transforms the relative
|
|
coordinates predicted by SSD into absolute coordinates. At this point
|
|
the shape of the output per batch is \((batch\_size, \#nr\_boxes, \#nr\_classes + 12)\). The last twelve elements are split into
|
|
the four bounding box offsets, the four anchor box coordinates, and
|
|
the four variances; there are 8732 boxes.
|
|
|
|
Filtering of these boxes is first done per class:
|
|
only the class id, confidence of that class, and the bounding box
|
|
coordinates are kept per box. The filtering consists of
|
|
confidence thresholding and a subsequent non-maximum suppression.
|
|
All boxes that pass non-maximum suppression are added to a
|
|
per image maxima list. One box could make the confidence threshold
|
|
for multiple classes and, hence, be present multiple times in the
|
|
maxima list for the image. Lastly, a total of \(k\) boxes with the
|
|
highest confidences is kept per image across all classes. The
|
|
original implementation uses a confidence threshold of \(0.01\), an
|
|
IOU threshold for non-maximum suppression of \(0.45\) and a top \(k\)
|
|
value of 200.
|
|
|
|
The vanilla SSD
|
|
per-class confidence threshold and non-maximum suppression has one
|
|
weakness: even if SSD correctly predicts all objects as the
|
|
background class with high confidence, the per-class confidence
|
|
threshold of 0.01 will consider predictions with very low
|
|
confidences; as background boxes are not present in the maxima
|
|
collection, many low confidence boxes can be. Furthermore, the
|
|
same detection can be present in the maxima collection for multiple
|
|
classes. In this case, the entropy threshold would let the detection
|
|
pass because the background class has high confidence. Subsequently,
|
|
a low per-class confidence threshold does not restrict the boxes
|
|
either. Therefore, the decoding output is worse than the actual
|
|
predictions of the network.
|
|
Bayesian SSD cannot help in this situation because the network
|
|
is not actually uncertain.
|
|
|
|
SSD was developed with closed set conditions in mind. A well trained
|
|
network in such a situation does not have many high confidence
|
|
background detections. In an open set environment, background
|
|
detections are the correct behaviour for unknown classes.
|
|
In order to get useful detections out of the decoding, a higher
|
|
confidence threshold is required.
|
|
|
|
\subsection{Vanilla SSD with Entropy Thresholding}
|
|
|
|
Vanilla SSD with entropy tresholding adds an additional component
|
|
to the filtering already done for vanilla SSD. The entropy is
|
|
calculated from all \(\#nr\_classes\) softmax scores in a prediction.
|
|
Only predictions with a low enough entropy pass the entropy
|
|
threshold and move on to the aforementioned per class filtering.
|
|
This excludes very uniform predictions but cannot identify
|
|
false positive or false negative cases with high confidence values.
|
|
|
|
\subsection{Bayesian SSD with Entropy Thresholding}
|
|
|
|
Bayesian SSD has the speciality of multiple forward passes. Based
|
|
on the information in the paper, the detections of all forward passes
|
|
are grouped per image but not by forward pass. This leads
|
|
to the following shape of the network output after all
|
|
forward passes: \((batch\_size, \#nr\_boxes \cdot \#nr\_forward\_passes, \#nr\_classes + 12)\). The size of the output
|
|
increases linearly with more forward passes.
|
|
|
|
These detections have to be decoded first. Afterwards they are
|
|
partitioned into observations to reduce the size of the output, and
|
|
to identify uncertainty. This is accomplished by calculating the
|
|
mutual IOU of every detection with all other detections. Detections
|
|
with a mutual IOU score of 0.95 or higher are partitioned into an
|
|
observation. Next, the softmax scores and bounding box coordinates of
|
|
all detections in an observation are averaged.
|
|
There can be a different number of observations for every image which
|
|
destroys homogenity and prevents batch-wise calculation of the
|
|
results. The shape of the results is per image: \((\#nr\_observations,\#nr\_classes + 4)\).
|
|
|
|
Entropy is measured in the next step. All observations with too high
|
|
entropy are discarded. Entropy thresholding in combination with
|
|
dropout sampling should improve identification of false positives of
|
|
unknown classes. This is due to multiple forward passes and
|
|
the assumption that uncertainty in some objects will result
|
|
in different classifications in multiple forward passes. These
|
|
varying classifications are averaged into multiple lower confidence
|
|
values which should increase the entropy and, hence, flag an
|
|
observation for removal.
|
|
|
|
Per class confidence thresholding, non-maximum suppression, and
|
|
top \(k\) selection happen like in vanilla SSD.
|
|
|
|
\chapter{Experimental Setup and Results}
|
|
|
|
\label{chap:experiments-results}
|
|
|
|
\section{Data sets}
|
|
|
|
% TODO: reword
|
|
|
|
Usually, data sets are not perfect when it comes to neural
|
|
networks: they contain outliers, invalid bounding boxes, and similar
|
|
problematic things. Before a data set can be used, these problems
|
|
need to be removed.
|
|
|
|
For the MS COCO data set, all annotations were checked for
|
|
impossible values: bounding box height or width lower than zero,
|
|
\(x_{min}\) and \(y_{min}\) bounding box coordinates lower than zero,
|
|
\(x_{max}\) and \(y_{max}\) coordinates lower than or equal to zero, \(x_{min}\) greater than \(x_{max}\),
|
|
\(y_{min}\) greater than \(y_{max}\), image width lower than \(x_{max}\),
|
|
and image height lower than \(y_{max}\). In the last two cases the
|
|
bounding box width or height was set to (image with - \(x_{min}\)) or
|
|
(image height - \(y_{min}\)) respectively;
|
|
in the other cases the annotation was skipped.
|
|
If the bounding box width or height afterwards is
|
|
lower than or equal to zero the annotation is skipped.
|
|
|
|
In this thesis, SceneNet RGB-D is always used with COCO classes.
|
|
Therefore, a mapping between COCO and SceneNet RGB-D and vice versa
|
|
was necessary. It was created my manually going through each
|
|
Wordnet ID and searching for a fitting COCO class.
|
|
|
|
The ground truth for SceneNet RGB-D is stored in protobuf files
|
|
and had to be converted into Python format to use it in the
|
|
codebase. The trajectories are not sorted inside the protobuf,
|
|
therefore, the first action was to sort them. For each trajectory,
|
|
all instances are stored independently of the views in the
|
|
trajectory. Therefore, the trajectories and their respective
|
|
instances were looped through and all
|
|
background instances and those without corresponding COCO class were
|
|
skipped. The rest was stored in a dictionary per trajectory.
|
|
Subsequently, all views of the trajectory were traversed and
|
|
for every view all stored instances were looped through.
|
|
For every instance, the segmentation map was modified by
|
|
setting all pixels not having the instance ID as value to zero
|
|
and the rest to one. If no objects were found then that instance
|
|
was skipped. In the other case a copy of its data from the
|
|
aforementioned dictionary plus the bounding box information was
|
|
stored in a list of instances for that view. The list of instances
|
|
per view was added to a list of such lists for the trajectory.
|
|
Ultimately this list of lists was added to a global list across
|
|
all trajectories: a list of lists of lists.
|
|
|
|
\section{Replication of Miller et al.}
|
|
|
|
% TODO rework
|
|
|
|
Miller et al. use SSD for the object detection part. They compare
|
|
vanilla SSD, vanilla SSD with entropy thresholding, and the
|
|
Bayesian SSD with each other. The Bayesian SSD was created by
|
|
adding two dropout layers to the vanilla SSD; no other changes
|
|
were made. Miller et al. use weights that were trained on MS COCO
|
|
to predict on SceneNet RGB-D.
|
|
|
|
As the source code was not available, I had to implement Miller's
|
|
work myself. For the SSD network, I used an implementation that
|
|
is compatible with
|
|
Tensorflow\footnote{\url{https://github.com/pierluigiferrari/ssd\_keras}}; this implementation had to be
|
|
changed to work with eager mode. Further changes were made to
|
|
support entropy thresholding.
|
|
|
|
For the Bayesian variant, observations have to be calculated:
|
|
detections of multiple forward passes for the same image are averaged
|
|
into an observation. This algorithm was implemented based on the
|
|
information available in the paper. Beyond the observation
|
|
calculation, the Bayesian variant can use the same code as the
|
|
vanilla version with one exception: the model had to be duplicated
|
|
and two dropout layers added to transform SSD into a Bayesian
|
|
network.
|
|
|
|
The vanilla SSD did not provide meaningful detections on SceneNet
|
|
RGB-D with the pre-trained weights and fine-tuning it on SceneNet
|
|
did not work either. Therefore, to better understand the SceneNet
|
|
RGB-D data set, I counted the number of instances per COCO class and
|
|
a huge class imbalance was visible; not just globally but also
|
|
between trajectories: some classes are only present in some
|
|
trajectories. This makes training with SSD on SceneNet practically
|
|
impossible.
|
|
|
|
\section{Experimental Setup}
|
|
|
|
\section{Results}
|
|
|
|
\chapter{Discussion}
|
|
|
|
\label{chap:discussion}
|
|
|
|
To recap, the hypothesis is repeated here.
|
|
|
|
\begin{description}
|
|
\item[Hypothesis] Novelty detection using auto-encoders delivers similar or better object detection performance under open set conditions while being less computationally expensive compared to dropout sampling.
|
|
\end{description}
|
|
|
|
Based on the reported results, no clear answer can be given to the
|
|
research question; rather new questions emerge: "Can auto-encoders
|
|
work on realistic data sets like COCO with multiple different classes
|
|
in one image?" In other words: "Is my experience due to
|
|
implementation issues or a general theoretical problem of
|
|
auto-encoders?"
|
|
|
|
Despite best efforts, the results of Miller et al.~\cite{Miller2018}
|
|
could not be replicated. This does not show anything though.
|
|
To disprove Miller's work, any and all possible ways to replicate
|
|
their work must fail. Contrarily, one successful replication
|
|
proves the ability to replicate. On the surface, both Miller et al.
|
|
and I used the same weights, the same network, and the same
|
|
data sets. Only difference of note: they used a Caffe implementation
|
|
of SSD, for this thesis the Tensorflow implementation with eager mode
|
|
was used.
|
|
|
|
|
|
|
|
\chapter{Closing}
|
|
\label{chap:closing}
|