masterthesis-latex/body.tex

640 lines
30 KiB
TeX

% body thesis file that contains the actual content
\chapter{Introduction}
The introduction will explain the wider context first, before
providing technical details.
\subsection*{Motivation}
Famous examples like the automatic soap dispenser which does not
recognize the hand of a black person but dispenses soap when presented
with a paper towel raise the question of bias in computer
systems~\cite{Friedman1996}. Related to this ethical question regarding
the design of so called algorithms is the question of
algorithmic accountability~\cite{Diakopoulos2014}.
Supervised neural networks learn from input-output relations and
figure out by themselves what connections are necessary for that.
This feature is also their Achilles heel: it makes them effectively
black boxes and prevents any answers to questions of causality.
However, these questions of causility are of enormous consequence when
results of neural networks are used to make life changing decisions:
Is a correlation enough to bring forth negative consequences
for a particular person? And if so, what is the possible defence
against math? Similar questions can be raised when looking at computer
vision networks that might be used together with so called smart
CCTV cameras to discover suspicious activity.
This leads to the need for neural networks to explain their results.
Such an explanation must come from the network or an attached piece
of technology to allow adoption in mass. Obviously this setting
poses the question, how such an endeavour can be achieved.
For neural networks there are fundamentally two type of tasks:
regression and classification. Regression deals with any case
where the goal for the network is to come close to an ideal
function that connects all data points. Classification, however,
describes tasks where the network is supposed to identify the
class of any given input. In this thesis, I will work with both.
\subsection*{Object Detection in Open Set Conditions}
\begin{figure}
\centering
\includegraphics[scale=1.0]{open-set}
\caption{Open set problem: The test set contains classes that
were not present during training time.
Icons in this image have been taken from the COCO data set
website (\url{https://cocodataset.org/\#explore}) and were
vectorized afterwards. Resembles figure 1 of Miller et al.~\cite{Miller2018}.}
\label{fig:open-set}
\end{figure}
More specifically, I will look at object detection in the open set
conditions (see figure \ref{fig:open-set}).
In non-technical words this effectively describes
the kind of situation you encounter with CCTV cameras or robots
outside of a laboratory. Both use cameras that record
images. Subsequently a neural network analyses the image
and returns a list of detected and classified objects that it
found in the image. The problem here is that networks can only
classify what they know. If presented with an object type that
the network was not trained with, as happens frequently in real
environments, it will still classify the object and might even
have a high confidence in doing so. Such an example would be
a false positive. Any ordinary person who uses the results of
such a network would falsely assume that a high confidence always
means the classification is very likely correct. If they use
a proprietary system they might not even be able to find out
that the network was never trained on a particular type of object.
Therefore it would be impossible for them to identify the output
of the network as false positive.
This goes back to the need for automatic explanation. Such a system
should by itself recognize that the given object is unknown and
hence mark any classification result of the network as meaningless.
Technically there are two slightly different approaches that deal
with this type of task: model uncertainty and novelty detection.
Model uncertainty can be measured with dropout sampling.
Dropout is usually used only during training but
Miller et al.~\cite{Miller2018} use them also during testing
to achieve different results for the same image making use of
multiple forward passes. The output scores for the forward passes
of the same image are then averaged. If the averaged class
probabilities resemble a uniform distribution (every class has
the same probability) this symbolises maximum uncertainty. Conversely,
if there is one very high probability with every other being very
low this signifies a low uncertainty. An unknown object is more
likely to cause high uncertainty which allows for an identification
of false positive cases.
Novelty detection is another approach to solve the task.
In the realm of neural networks it is usually done with the help of
auto-encoders that solve a regression task of finding an
identity function that reconstructs the given input~\cite{Pimentel2014}. Auto-encoders have
internally at least two components: an encoder, and a decoder or
generator. The job of the encoder is to find an encoding that
compresses the input as good as possible while simultaneously
being as loss-free as possible. The decoder takes this latent
representation of the input and has to find a decompression
that reconstructs the input as accurate as possible. During
training these auto-encoders learn to reproduce a certain group
of object classes. The actual novelty detection takes place
during testing: Given an image, and the output and loss of the
auto-encoder, a novelty score is calculated. For some novelty
detection approaches the reconstruction loss is exactly the novelty
score, others consider more factors. A low novelty
score signals a known object. The opposite is true for a high
novelty score.
\subsection*{Research Question}
Auto-encoders work well for data sets like MNIST~\cite{Deng2012}
but perform poorly on challenging real world data sets
like MS COCO~\cite{Lin2014}. Therefore, a comparison between
model uncertainty and novelty detection is considered out of
scope for this thesis.
Miller et al.~\cite{Miller2018} used an SSD pre-trained on COCO
without further fine-tuning on the SceneNet RGB-D data
set~\cite{McCormac2017} and reported good results regarding
open set error for an SSD variant with dropout sampling and entropy
thresholding.
If their results are generalizable it should be possible to replicate
the relative difference between the variants on the COCO data set.
This leads to the following hypothesis: \emph{Dropout sampling
delivers better object detection performance under open set
conditions compared to object detection without it.}
For the purpose of this thesis, I will use the vanilla SSD as
baseline to compare against. In particular, vanilla SSD uses
a per-class confidence threshold of 0.01, an IOU threshold of 0.45
for the non-maximum suppression, and a top k value of 200.
The effect of an entropy threshold is measured against this vanilla
SSD by applying entropy thresholds from 0.1 to 2.4 (limits taken from
Miller et al.). Dropout sampling is compared to vanilla SSD, both
with and without entropy thresholding. The number of forward
passes is varied to identify their impact.
\paragraph{Hypothesis} Dropout sampling
delivers better object detection performance under open set
conditions compared to object detection without it.
\paragraph{Contribution}
The contribution of this thesis is a comparison between dropout
sampling and auto-encoding with respect to the overall performance
of both for object detection in the open set conditions using
the SSD network for object detection and the SceneNet RGB-D data set
with MS COCO classes.
\subsection*{Reader's guide}
First, chapter \ref{chap:background} presents related works and
provides the background for dropout sampling a.k.a Bayesian SSD.
Afterwards, chapter \ref{chap:methods} explains how the Bayesian SSD
works, and provides details about the software and source code design.
Chapter \ref{chap:experiments-results} presents the data sets,
the experimental setup, and the results. This is followed by
chapter \ref{chap:discussion} and \ref{chap:closing}, focusing on
the discussion and closing respectively.
Therefore, the contribution is found in chapters \ref{chap:methods},
\ref{chap:experiments-results}, and \ref{chap:discussion}.
\chapter{Background}
\label{chap:background}
This chapter will begin with an overview over previous works
in the field of this thesis. Afterwards the theoretical foundations
of the work of Miller et al.~\cite{Miller2018} will
be explained.
\section{Related Works}
Novelty detection for object detection is intricately linked with
open set conditions: the test data can contain unknown classes.
Bishop~\cite{Bishop1994} investigates the correlation between
the degree of novel input data and the reliability of network
outputs. Pimentel et al.~\cite{Pimentel2014} provide a review
of novelty detection methods published over the previous decade.
There are two primary pathways that deal with novelty: novelty
detection using auto-encoders and uncertainty estimation with
bayesian networks.
Japkowicz et al.~\cite{Japkowicz1995} introduce a novelty detection
method based on the hippocampus of Gluck and Meyers~\cite{Gluck1993}
and use an auto-encoder to recognize novel instances.
Thompson et al.~\cite{Thompson2002} show that auto-encoders
can learn "normal" system behaviour implicitly.
Goodfellow et al.~\cite{Goodfellow2014} introduce adversarial
networks: a generator that attempts to trick the discriminator
by generating samples indistinguishable from the real data.
Makhzani et al.~\cite{Makhzani2015} build on the work of Goodfellow
and propose adversarial auto-encoders. Richter and
Roy~\cite{Richter2017} use an auto-encoder to detect novelty.
Wang et al.~\cite{Wang2018} base upon Goodfellow's work and
use a generative adversarial network for novelty detection.
Sabokrou et al.~\cite{Sabokrou2018} implement an end-to-end
architecture for one-class classification: it consists of two
deep networks, with one being the novelty detector and the other
enhancing inliers and distorting outliers.
Pidhorskyi et al.~\cite{Pidhorskyi2018} take a probabilistic approach
and compute how likely it is that a sample is generated by the
inlier distribution.
Kendall and Gal~\cite{Kendall2017} provide a Bayesian deep learning
framework that combines input-dependent
aleatoric\footnote{captures noise inherent in observations}
uncertainty with epistemic\footnote{uncertainty in the model}
uncertainty. Lakshminarayanan et al.~\cite{Lakshminarayanan2017}
implement a predictive uncertainty estimation using deep ensembles
rather than Bayesian networks. Geifman et al.~\cite{Geifman2018}
introduce an uncertainty estimation algorithm for non-Bayesian deep
neural classification that estimates the uncertainty of highly
confident points using earlier snapshots of the trained model.
Miller et al.~\cite{Miller2018a} compare merging strategies
for sampling-based uncertainty techniques in object detection.
Sensoy et al.~\cite{Sensoy2018} treat prediction confidence
as subjective opinions: they place a Dirichlet distribution on it.
The trained predictor for a multi-class classification is also a
Dirichlet distribution.
Gal and Ghahramani~\cite{Gal2016} show how dropout can be used
as a Bayesian approximation. Miller et al.~\cite{Miller2018}
build upon the work of Miller et al.~\cite{Miller2018a} and
Gal and Ghahramani: they use dropout sampling under open-set
conditions for object detection. Mukhoti and Gal~\cite{Mukhoti2018}
contribute metrics to measure uncertainty for semantic
segmentation. Wu et al.~\cite{Wu2019} introduce two innovations
that turn variational Bayes into a robust tool for Bayesian
networks: they introduce a novel deterministic method to approximate
moments in neural networks which eliminates gradient variance, and
they introduce a hierarchical prior for parameters and an
Empirical Bayes procedure to select prior variances.
\section{Background for Bayesian SSD}
\begin{table}
\centering
\caption{Notation for background}
\label{tab:notation}
\begin{tabular}{l|l}
symbol & meaning \\
\hline
\(\mathbf{W}\) & weights \\
\(\mathbf{T}\) & training data \\
\(\mathcal{N}(0, I)\) & Gaussian distribution \\
\(I\) & independent and identical distribution \\
\(p(\mathbf{W}|\mathbf{T})\) & probability of weights given
training data \\
\(\mathcal{I}\) & an image \\
\(\mathbf{q} = p(y|\mathcal{I}, \mathbf{T})\) & probability
of all classes given image and training data \\
\(H(\mathbf{q})\) & entropy over probability vector \\
\(\widetilde{\mathbf{W}}\) & weights sampled from
\(p(\mathbf{W}|\mathbf{T})\) \\
\(\mathbf{b}\) & bounding box coordinates \\
\(\mathbf{s}\) & softmax scores \\
\(\overline{\mathbf{s}}\) & averaged softmax scores \\
\(D\) & detections of one forward pass \\
\(\mathfrak{D}\) & set of all detections over multiple
forward passes \\
\(\mathcal{O}\) & observation \\
\(\overline{\mathbf{q}}\) & probability vector for
observation \\
%\(E[something]\) & expected value of something
%\(\overline{\mathbf{z}}, \mathbf{z}\) & latent space representation \\
%\(d_T, d_z\) & discriminators \\
%\(e, g\) & encoding and decoding/generating function \\
%\(J_g\) & Jacobi matrix for generating function \\
%\(\mathcal{T}\) & tangent space \\
%\(\mathbf{R}\) & training/test data changed to be on tangent space
\end{tabular}
\end{table}
This section will use the \textbf{notation} defined in table
\ref{tab:notation} on page \pageref{tab:notation}.
To understand dropout sampling, it is necessary to explain the
idea of Bayesian neural networks. They place a prior distribution
over the network weights, for example a Gaussian prior distribution:
\(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example
\(\mathbf{W}\) are the weights and \(I\) symbolises that every
weight is drawn from an independent and identical distribution. The
training of the network determines a plausible set of weights by
evaluating the posterior (probability output) over the weights given
the training data \(\mathbf{T}\): \(p(\mathbf{W}|\mathbf{T})\).
However, this
evaluation cannot be performed in any reasonable
time. Therefore approximation techniques are
required. In those techniques the posterior is fitted with a
simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original
and intractable problem of averaging over all weights in the network
is replaced with an optimisation task, where the parameters of the
simple distribution are optimised over~\cite{Kendall2017}.
\subsubsection*{Dropout Variational Inference}
Kendall and Gal~\cite{Kendall2017} showed an approximation for
classfication and recognition tasks. Dropout variational inference
is a practical approximation technique by adding dropout layers
in front of every weight layer and using them also during test
time to sample from the approximate posterior. Effectively, this
results in the approximation of the class probability
\(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward
passes through the network and averaging over the obtained Softmax
scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the
training data \(\mathbf{T}\):
\begin{equation} \label{eq:drop-sampling}
p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
\end{equation}
With this dropout sampling technique \(n\) model weights
\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior
\(p(\mathbf{W}|\mathbf{T})\). The class probability
\(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
\(\mathbf{q}\) over all class labels. Finally, the uncertainty
of the network with respect to the classification is given by
the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\).
\subsubsection*{Dropout Sampling for Object Detection}
Miller et al.~\cite{Miller2018} apply the dropout sampling to
object detection. In that case \(\mathbf{W}\) represents the
learned weights of a detection network like SSD~\cite{Liu2016}.
Every forward pass uses a different network
\(\widetilde{\mathbf{W}}\) which is approximately sampled from
\(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object
detection results in a set of detections, each consisting of bounding
box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\).
The detections are denoted by Miller et al. as \(D_i =
\{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put
into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\).
All detections with mutual intersection-over-union scores (IoU)
of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\).
Subsequently, the corresponding vector of class probabilities
\(\overline{\mathbf{q}}_i\) for the observation is calculated by averaging all
score vectors \(\mathbf{s}_j\) in a particular observation
\(\mathcal{O}_i\): \(\overline{\mathbf{q}}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
of the detector for a particular observation is measured by
the entropy \(H(\overline{\mathbf{q}}_i)\).
If \(\overline{\mathbf{q}}_i\), which I called averaged class probabilities,
resembles a uniform distribution the entropy will be high. A uniform
distribution means that no class is more likely than another, which
is a perfect example of maximum uncertainty. Conversely, if
one class has a very high probability the entropy will be low.
In open set conditions it can be expected that falsely generated
detections for unknown object classes have a higher label
uncertainty. A threshold on the entropy \(H(\overline{\mathbf{q}}_i)\) can then
be used to identify and reject these false positive cases.
% SSD: \cite{Liu2016}
% ImageNet: \cite{Deng2009}
% COCO: \cite{Lin2014}
% YCB: \cite{Xiang2017}
% SceneNet: \cite{McCormac2017}
\chapter{Methods}
\label{chap:methods}
This chapter explains the functionality of the Bayesian SSD and the
decoding pipelines.
\section{Bayesian SSD for Model Uncertainty}
Bayesian SSD adds dropout sampling to the vanilla SSD. First,
the model architecture will be explained, followed by details on
the uncertainty calculation, and implementation details.
\subsection{Model Architecture}
\begin{figure}
\centering
\includegraphics[scale=1.2]{vanilla-ssd}
\caption{The vanilla SSD network as defined by Liu et al.~\cite{Liu2016}. VGG-16 is the base network, extended with extra feature layers. These predict offsets to anchor boxes with different sizes and aspect ratios. Furthermore, they predict the
corresponding confidences.}
\label{fig:vanilla-ssd}
\end{figure}
Vanilla SSD is based upon the VGG-16 network (see figure \ref{fig:vanilla-ssd}) and adds extra feature layers. These layers
predict the offsets to the anchor boxes, which have different sizes
and aspect ratios. The feature layers also predict the
corresponding confidences. By comparison, Bayesian SSD only adds
two dropout layers after the fc6 and fc7 layers (see figure \ref{fig:bayesian-ssd}).
\begin{figure}
\centering
\includegraphics[scale=1.2]{bayesian-ssd}
\caption{The Bayesian SSD network as defined by Miller et al.~\cite{Miller2018}. It adds dropout layers after the fc6
and fc7 layers.}
\label{fig:bayesian-ssd}
\end{figure}
\subsection{Model Uncertainty}
Dropout sampling measures model uncertainty with the help of
entropy: every forward pass creates predictions, these are
partitioned into observations, and then their entropy is calculated.
Entropy works to detect uncertainty because uncertain networks
will produce different classifications for the same object in an
image across multiple forward passes.
\subsection{Implementation Details}
For this thesis, an SSD implementation based on Tensorflow and
Keras\footnote{\url{https://github.com/pierluigiferrari/ssd\_keras}}
was used. It was modified to support entropy thresholding,
partitioning of observations, and dropout
layers in the SSD model. %Entropy thresholding takes place before
%the per-class confidence threshold is applied.
\section{Decoding Pipelines}
The raw output of SSD is not very useful: it contains thousands of
boxes per image. Among them are many boxes with very low confidences
or background classifications, those need to be filtered out to
get any meaningful output of the network. The process of
filtering is called decoding and presented for the three variants
of SSD used in the thesis.
\subsection{Vanilla SSD}
Liu et al.~\cite{Liu2016} used Caffe for their original SSD
implementation. The decoding process contains largely two
phases: decoding and filtering. Decoding transforms the relative
coordinates predicted by SSD into absolute coordinates. At this point
the shape of the output per batch is \((batch\_size, \#nr\_boxes, \#nr\_classes + 12)\). The last twelve elements are split into
the four bounding box offsets, the four anchor box coordinates, and
the four variances; there are 8732 boxes.
Filtering of these boxes is first done per class:
only the class id, confidence of that class, and the bounding box
coordinates are kept per box. The filtering consists of
confidence thresholding and a subsequent non-maximum suppression.
All boxes that pass non-maximum suppression are added to a
per image maxima list. One box could make the confidence threshold
for multiple classes and, hence, be present multiple times in the
maxima list for the image. Lastly, a total of \(k\) boxes with the
highest confidences is kept per image across all classes. The
original implementation uses a confidence threshold of \(0.01\), an
IOU threshold for non-maximum suppression of \(0.45\) and a top \(k\)
value of 200.
The vanilla SSD
per-class confidence threshold and non-maximum suppression has one
weakness: even if SSD correctly predicts all objects as the
background class with high confidence, the per-class confidence
threshold of 0.01 will consider predictions with very low
confidences; as background boxes are not present in the maxima
collection, many low confidence boxes can be. Furthermore, the
same detection can be present in the maxima collection for multiple
classes. In this case, the entropy threshold would let the detection
pass because the background class has high confidence. Subsequently,
a low per-class confidence threshold does not restrict the boxes
either. Therefore, the decoding output is worse than the actual
predictions of the network.
Bayesian SSD cannot help in this situation because the network
is not actually uncertain.
SSD was developed with closed set conditions in mind. A well trained
network in such a situation does not have many high confidence
background detections. In an open set environment, background
detections are the correct behaviour for unknown classes.
In order to get useful detections out of the decoding, a higher
confidence threshold is required.
\subsection{Vanilla SSD with Entropy Thresholding}
Vanilla SSD with entropy tresholding adds an additional component
to the filtering already done for vanilla SSD. The entropy is
calculated from all \(\#nr\_classes\) softmax scores in a prediction.
Only predictions with a low enough entropy pass the entropy
threshold and move on to the aforementioned per class filtering.
This excludes very uniform predictions but cannot identify
false positive or false negative cases with high confidence values.
\subsection{Bayesian SSD with Entropy Thresholding}
Bayesian SSD has the speciality of multiple forward passes. Based
on the information in the paper, the detections of all forward passes
are grouped per image but not by forward pass. This leads
to the following shape of the network output after all
forward passes: \((batch\_size, \#nr\_boxes \cdot \#nr\_forward\_passes, \#nr\_classes + 12)\). The size of the output
increases linearly with more forward passes.
These detections have to be decoded first. Afterwards they are
partitioned into observations to reduce the size of the output, and
to identify uncertainty. This is accomplished by calculating the
mutual IOU of every detection with all other detections. Detections
with a mutual IOU score of 0.95 or higher are partitioned into an
observation. Next, the softmax scores and bounding box coordinates of
all detections in an observation are averaged.
There can be a different number of observations for every image which
destroys homogenity and prevents batch-wise calculation of the
results. The shape of the results is per image: \((\#nr\_observations,\#nr\_classes + 4)\).
Entropy is measured in the next step. All observations with too high
entropy are discarded. Entropy thresholding in combination with
dropout sampling should improve identification of false positives of
unknown classes. This is due to multiple forward passes and
the assumption that uncertainty in some objects will result
in different classifications in multiple forward passes. These
varying classifications are averaged into multiple lower confidence
values which should increase the entropy and, hence, flag an
observation for removal.
Per class confidence thresholding, non-maximum suppression, and
top \(k\) selection happen like in vanilla SSD.
\chapter{Experimental Setup and Results}
\label{chap:experiments-results}
\section{Data sets}
% TODO: reword
Usually, data sets are not perfect when it comes to neural
networks: they contain outliers, invalid bounding boxes, and similar
problematic things. Before a data set can be used, these problems
need to be removed.
For the MS COCO data set, all annotations were checked for
impossible values: bounding box height or width lower than zero,
\(x_{min}\) and \(y_{min}\) bounding box coordinates lower than zero,
\(x_{max}\) and \(y_{max}\) coordinates lower than or equal to zero, \(x_{min}\) greater than \(x_{max}\),
\(y_{min}\) greater than \(y_{max}\), image width lower than \(x_{max}\),
and image height lower than \(y_{max}\). In the last two cases the
bounding box width or height was set to (image with - \(x_{min}\)) or
(image height - \(y_{min}\)) respectively;
in the other cases the annotation was skipped.
If the bounding box width or height afterwards is
lower than or equal to zero the annotation is skipped.
In this thesis, SceneNet RGB-D is always used with COCO classes.
Therefore, a mapping between COCO and SceneNet RGB-D and vice versa
was necessary. It was created my manually going through each
Wordnet ID and searching for a fitting COCO class.
The ground truth for SceneNet RGB-D is stored in protobuf files
and had to be converted into Python format to use it in the
codebase. The trajectories are not sorted inside the protobuf,
therefore, the first action was to sort them. For each trajectory,
all instances are stored independently of the views in the
trajectory. Therefore, the trajectories and their respective
instances were looped through and all
background instances and those without corresponding COCO class were
skipped. The rest was stored in a dictionary per trajectory.
Subsequently, all views of the trajectory were traversed and
for every view all stored instances were looped through.
For every instance, the segmentation map was modified by
setting all pixels not having the instance ID as value to zero
and the rest to one. If no objects were found then that instance
was skipped. In the other case a copy of its data from the
aforementioned dictionary plus the bounding box information was
stored in a list of instances for that view. The list of instances
per view was added to a list of such lists for the trajectory.
Ultimately this list of lists was added to a global list across
all trajectories: a list of lists of lists.
\section{Replication of Miller et al.}
% TODO rework
Miller et al. use SSD for the object detection part. They compare
vanilla SSD, vanilla SSD with entropy thresholding, and the
Bayesian SSD with each other. The Bayesian SSD was created by
adding two dropout layers to the vanilla SSD; no other changes
were made. Miller et al. use weights that were trained on MS COCO
to predict on SceneNet RGB-D.
As the source code was not available, I had to implement Miller's
work myself. For the SSD network, I used an implementation that
is compatible with
Tensorflow\footnote{\url{https://github.com/pierluigiferrari/ssd\_keras}}; this implementation had to be
changed to work with eager mode. Further changes were made to
support entropy thresholding.
For the Bayesian variant, observations have to be calculated:
detections of multiple forward passes for the same image are averaged
into an observation. This algorithm was implemented based on the
information available in the paper. Beyond the observation
calculation, the Bayesian variant can use the same code as the
vanilla version with one exception: the model had to be duplicated
and two dropout layers added to transform SSD into a Bayesian
network.
The vanilla SSD did not provide meaningful detections on SceneNet
RGB-D with the pre-trained weights and fine-tuning it on SceneNet
did not work either. Therefore, to better understand the SceneNet
RGB-D data set, I counted the number of instances per COCO class and
a huge class imbalance was visible; not just globally but also
between trajectories: some classes are only present in some
trajectories. This makes training with SSD on SceneNet practically
impossible.
\section{Experimental Setup}
\section{Results}
\chapter{Discussion}
\label{chap:discussion}
To recap, the hypothesis is repeated here.
\begin{description}
\item[Hypothesis] Novelty detection using auto-encoders delivers similar or better object detection performance under open set conditions while being less computationally expensive compared to dropout sampling.
\end{description}
Based on the reported results, no clear answer can be given to the
research question; rather new questions emerge: "Can auto-encoders
work on realistic data sets like COCO with multiple different classes
in one image?" In other words: "Is my experience due to
implementation issues or a general theoretical problem of
auto-encoders?"
Despite best efforts, the results of Miller et al.~\cite{Miller2018}
could not be replicated. This does not show anything though.
To disprove Miller's work, any and all possible ways to replicate
their work must fail. Contrarily, one successful replication
proves the ability to replicate. On the surface, both Miller et al.
and I used the same weights, the same network, and the same
data sets. Only difference of note: they used a Caffe implementation
of SSD, for this thesis the Tensorflow implementation with eager mode
was used.
\chapter{Closing}
\label{chap:closing}