% body thesis file that contains the actual content

\chapter{Introduction}

\subsection*{Motivation}

Famous examples like the automatic soap dispenser which does not
recognize the hand of a black person but dispenses soap when presented
with a paper towel raise the question of bias in computer
systems\cite{Friedman1996}. Related to this ethical question regarding
the design of so called algorithms, a term often used in public
discourse for applied neural networks, is the question of
algorithmic accountability\cite{Diakopoulos2014}.

The charm of supervised neural networks, that they can learn from
input-output relations and figure out by themselves what connections
are necessary for that, is also their Achilles heel. This feature
makes them effectively black boxes. It is possible to question the
training environment, like potential biases inside the data sets, or
the engineers constructing the networks but it is not really possible
to question the internal calculations made by a network. On the one
hand, one might argue, it is only math and nothing magical that
happens inside these networks. Clearly it is possible, albeit a chore,
to manually follow the calculations of any given trained network.
After all it is executed on a computer and at the lowest level only
uses basic math that does not differ between humans and computers. On
the other hand not everyone is capable of doing so and more
importantly it does not reveal any answers to questions of causality.

However, these questions of causility are of enormous consequence when
neural networks are used, for example, in predictive policing. Is a
correlation, a coincidence, enough to bring forth negative consequences
for a particular person? And if so, what is the possible defence
against math? Similar questions can be raised when looking at computer
vision networks that might be used together with so called smart
CCTV cameras, for example, like those tested at the train station
Berlin Südkreuz. What if a network implies you committed suspicious
behaviour?

This leads to the need for neural networks to explain their results.
Such an explanation must come from the network or an attached piece
of technology to allow adoption in mass. Obviously this setting
poses the question, how such an endeavour can be achieved.

For neural networks there are fundamentally two type of tasks:
regression and classification. Regression deals with any case
where the goal for the network is to come close to an ideal
function that connects all data points. Classification, however,
describes tasks where the network is supposed to identify the
class of any given input. In this thesis, I will focus on
classification.

\subsection*{Object Detection in Open Set Conditions}

More specifically, I will look at object detection in the open set
conditions. In non-technical words this effectively describes
the kind of situation you encounter with CCTV cameras or robots
outside of a laboratory. Both use cameras that record
images. Subsequently a neural network analyses the image
and returns a list of detected and classified objects that it
found in the image. The problem here is that networks can only
classify what they know. If presented with an object type that
the network was not trained with, as happens frequently in real
environments, it will still classify the object and might even
have a high confidence in doing so. Such an example would be
a false positive. Any ordinary person who uses the results of
such a network would falsely assume that a high confidence always
means the classification is very likely correct. If they use
a proprietary system they might not even be able to find out
that the network was never trained on a particular type of object.
Therefore it would be impossible for them to identify the output
of the network as false positive.

This goes back to the need for automatic explanation. Such a system
should by itself recognize that the given object is unknown and
hence mark any classification result of the network as meaningless.
Technically there are two slightly different things that deal
with this type of task: model uncertainty and novelty detection.

Model uncertainty can be measured with dropout sampling.
Dropout is usually used only during training but
Miller et al\cite{Miller2018} use them also during testing
to achieve different results for the same image making use of
multiple forward passes. The output scores for the forward passes
of the same image are then averaged. If the averaged class
probabilities resemble a uniform distribution (every class has
the same probability) this symbolises maximum uncertainty. Conversely,
if there is one very high probability with every other being very
low this signifies a low uncertainty. An unknown object is more
likely to cause high uncertainty which allows for an identification
of false positive cases.

Novelty detection is the more direct approach to solve the task.
In the realm of neural networks it is usually done with the help of
auto-encoders that essentially solve a regression task of finding an
identity function that reconstructs on the output the given
input\cite{Pimentel2014}. Auto-encoders have
internally at least two components: an encoder, and a decoder or
generator. The job of the encoder is to find an encoding that
compresses the input as good as possible while simultaneously
being as loss-free as possible. The decoder takes this latent
representation of the input and has to find a decompression
that reconstructs the input as accurate as possible. During
training these auto-encoders learn to reproduce a certain group
of object classes. The actual novelty detection takes place
during testing. Given an image, and the output and loss of the
auto-encoder, a novelty score is calculated. A low novelty
score signals a known object. The opposite is true for a high
novelty score.

\subsection*{Research Question}

Given these two approaches to solve the explanation task of above,
it comes down to performance. At the end of the day the best
theoretical idea does not help in solving the task if it cannot
be implemented in a performant way. Miller et al have shown
some success in using dropout sampling. However, the many forward
passes during testing for every image seem computationally expensive.
In comparison a single run through a trained auto-encoder seems
intuitively to be faster. This leads to the hypothesis (see below).

For the purpose of this thesis, I will
use the work of Miller et al as baseline to compare against.
They use the SSD\cite{Liu2016} network for object detection,
modified by added dropout layers, and the SceneNet
RGB-D\cite{McCormac2017} data set using the MS COCO\cite{Lin2014}
classes. Instead of dropout sampling my approach will use
an auto-encoder for novelty detection with all else, like
using SSD for object detection and the SceneNet RGB-D data set,
being equal. With respect to auto-encoders a recent implementation
of an adversarial auto-encoder\cite{Pidhorskyi2018} will be used.

\paragraph{Hypothesis} Novelty detection using auto-encoders
delivers similar or better object detection performance under open set
conditions while being less computationally expensive compared to
dropout sampling.

\paragraph{Contribution}
The contribution of this thesis is a comparison between dropout
sampling and auto-encoding with respect to the overall performance
of both for object detection in the open set conditions using
the SSD network for object detection and the SceneNet RGB-D data set
with MS COCO classes.

\chapter{Background and Contribution}

This chapter will provide a more in-depth look at the two works
this thesis is based upon. First, the dropout sampling introduced
by Miller et al\cite{Miller2018} will be showcased. Afterwards
the Generative Probabilistic Novelty Detection with Adversarial
Autoencoders\cite{Pidhorskyi2018} will be presented. The chapter
will conclude with a more detailed explanation of the intended
contribution of this thesis.

The dropout sampling explanation will follow the paper of Miller et
al\cite{Miller2018} rather closely including the formulae used
in their paper.

\section{Dropout Sampling}

To understand dropout sampling, it is necessary to explain the
idea of Bayesian neural networks. They place a prior distribution
over the network weights, for example a Gaussian prior distribution:
\(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example
\(\mathbf{W}\) are the weights and \(I\) symbolises that every
weight is drawn from an independent and identical distribution. The
training of the network determines a plausible set of weights by
evaluating the posterior (probability output) over the weights given
the training data: \(p(\mathbf{W}|\mathbf{T})\). However, this
evaluation cannot be performed in any reasonable
time. Therefore approximation techniques are
required. In those techniques the posterior is fitted with a
simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original
and intractable problem of averaging over all weights in the network
is replaced with an optimisation task, where the parameters of the
simple distribution are optimised over\cite{Kendall2017}.

\subsubsection*{Dropout Variational Inference}

Kendall and Gal\cite{Kendall2017} showed an approximation for
classfication and recognition tasks. Dropout variational inference
is a practical approximation technique by adding dropout layers
in front of every weight layer and using them also during test
time to sample from the approximate posterior. Effectively, this
results in the approximation of the class probability
\(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward
passes through the network and averaging over the obtained Softmax
scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the
training data \(\mathbf{T}\):
\begin{equation} \label{eq:drop-sampling}
p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
\end{equation}

With this dropout sampling technique \(n\) model weights
\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior
\(p(\mathbf{W}|\mathbf{T})\). The class probability
\(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
\(\mathbf{q}\) over all class labels. Finally, the uncertainty
of the network with respect to the classification is given by
the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\).

\subsubsection*{Dropout Sampling for Object Detection}

Miller et al\cite{Miller2018} apply the dropout sampling to
object detection. In that case \(\mathbf{W}\) represents the
learned weights of a detection network like SSD\cite{Liu2016}.
Every forward pass uses a different network
\(\widetilde{\mathbf{W}}\) which is approximately sampled from
\(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object
detection results in a set of detections, each consisting of bounding
box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\).
The detections are denoted by Miller et al as \(D_i =
\{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put
into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\).

All detections with mutual intersection-over-union scores (IoU)
of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\).
Subsequently, the corresponding vector of class probabilities
\(\mathbf{q}_i\) for the observation is calculated by averaging all
score vectors \(\mathbf{s}_j\) in a particular observation
\(\mathcal{O}_i\): \(\mathbf{q}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
of the detector for a particular observation is measured by
the entropy \(H(\mathbf{q}_i) = - \sum_j q_{ij} \cdot \log q_{ij}\).

In the introduction I used a very reduced version to describe
maximum and low uncertainty. A more complete explanation:
If \(\mathbf{q}_i\), which I called averaged class probabilities,
resembles a uniform distribution the entropy will be high. A uniform
distribution means that no class is more likely than another, which
is a perfect example of maximum uncertainty. Conversely, if
one class has a very high probability the entropy will be low.

In open set conditions it can be expected that falsely generated
detections for unknown object classes have a higher label
uncertainty. A treshold on the entropy \(H(\mathbf{q}_i)\) can then
be used to identify and reject these false positive cases.

\section{Adversarial Auto-encoder}

This section will explain the adversarial auto-encoder used by
Pidhorskyi et al\cite{Pidhorskyi2018} but in a slightly modified
form to make it more understandable.

The training data points \(x_i \in \mathbb{R}^m \) are the input
of the auto-encoder. An encoding function \(e: \mathbb{R}^m \rightarrow \mathbb{R}^n\) takes the data points
and produces a representation \(\overline{z_i} \in \mathbb{R}^n\)
in a latent space. This latent space is smaller (\(n < m\)) than the
input which necessitates some form of compression.

A second function \(g: \Omega \rightarrow \mathbb{R}^m\) is the
generator function that takes the latent representation
\(z_i \in \Omega \subset \mathbb{R}^n\) and generates an output
\(\overline{x_i}\) as close as possible to the input data
distribution.

What then is the difference between \(\overline{z_i}\) and \(z_i\)?
With a simple auto-encoder both would be identical. In this case
of an adversarial auto-encoder it is slightly more complicated.
There is a discriminator \(D_z\) that tries to distinguish between
an encoded data point \(\overline{z_i}\) and a \(z_i \sim \mathcal{N}(0,1)\) drawn from a normal distribution with \(0\) mean
and a standard deviation of \(1\). During training, the encoding
function \(e\) attempts to minimize any perceivable difference
between \(z_i\) and \(\overline{z_i}\) while \(D_z\) has the
aforementioned adversarial task to differentiate between them.

Furthermore, there is a discriminator \(D_x\) that has the task
to differentiate the generated output \(\overline{x_i}\) from the
actual input \(x_i\). During training, the generator function \(g\)
tries to minimize the perceivable difference between \(\overline{x_i}\) and \(x_i\) while \(D_x\) has the mentioned
adversarial task to distinguish between them.

With this all components of the adversarial auto-encoder employed
by Pidhorskyi et al are introduced. Finally, the losses are
presented. The two adversarial objectives have been mentioned
already. Specifically, there is the adversarial loss for the
discriminator \(D_z\):
\begin{equation} \label{eq:adv-loss-z}
    \mathcal{L}_{adv-d_z}(x,e,D_z) = E[\log (D_z(\mathcal{N}(0,1)))] + E[\log (1 - D_z(e(x)))],
\end{equation}
\noindent
where \(E\) stands for an expected
value\footnote{a term used in probability theory},
\(x\) stands for the input, and
\(\mathcal{N}(0,1)\) represents an element drawn from the specified
distribution. The encoder \(e\) attempts to minimize this loss while
the discriminator \(D_z\) intends to maximize it.

In the same way the adversarial loss for the discriminator \(D_x\)
is specified:
\begin{equation} \label{eq:adv-loss-x}
    \mathcal{L}_{adv-d_x}(x,D_x,g) = E[\log(D_x(x))] + E[\log(1 - D_x(g(\mathcal{N}(0,1))))],
\end{equation}
\noindent
where \(x\), \(E\), and \(\mathcal{N}(0,1)\) have the same meaning
as before. In this case the generator \(g\) tries to minimize the loss
while the discriminator \(D_x\) attempts to maximize it.

Every auto-encoder requires a reconstruction error to work. This
error calculates the difference between the original input and
the generated or decoded output. In this case, the reconstruction
loss is defined like this:
\begin{equation} \label{eq:recon-loss}
    \mathcal{L}_{error}(x, e, g) = - E[\log(p(g(e(x)) | x))],
\end{equation}
\noindent
where \(\log(p)\) is the expected log-likelihood and \(x\),
\(E\), \(e\), and \(g\) have the same meaning as before.

All losses combined result in the following formula:
\begin{equation} \label{eq:full-loss}
    \mathcal{L}(x,e,D_z,D_x,g) = \mathcal{L}_{adv-d_z}(x,e,D_z) + \mathcal{L}_{adv-d_x}(x,D_x,g) + \lambda \mathcal{L}_{error}(x,e,g),
\end{equation}
\noindent
where \(\lambda\) is a parameter used to balance the adversarial
losses with the reconstruction loss. The model is trained by
Pidhorskyi et al using the Adam optimizer by doing alternative
updates of each of the aforementioned components:

\begin{itemize}
    \item Maximize \(\mathcal{L}_{adv-d_x}\) by updating weights of \(D_x\);
    \item Minimize \(\mathcal{L}_{adv-d_x}\) by updating weights of \(g\);
    \item Maximize \(\mathcal{L}_{adv-d_z}\) by updating weights of \(D_z\);
    \item Minimize \(\mathcal{L}_{error}\) and \(\mathcal{L}_{adv-d_z}\) by updating weights of \(e\) and \(g\).
\end{itemize}

Practically, the auto-encoder is trained separately for every
object class that is considered "known". Pidhorskyi et al trained
it on the MNIST\cite{Lecun1998} data set, once for every digit.

For this thesis it needs to be trained on the SceneNet RGB-D
data set using MS COCO classes as known classes. As in every
test epoch all known classes are present, it becomes
non-trivial which of the trained auto-encoders should be used to
calculate novelty. To phrase it differently, a true positive
detection is possible for multiple classes in the same image.
If, for example, one object is classified correctly by SSD as a chair
the novelty score should be low. But the auto-encoders of all
known classes but the "chair" class will give ideally a high novelty
score. Which of the values should be used? The only sensible solution
is to only run it through the auto-encoder that was trained for
the class the SSD model predicted. This provides the following
scenarios:
\begin{itemize}
    \item true positive classification: novelty score should be low
    \item false positive classification and correct class is
    among the known classes: novelty score should be high
    \item false positive classification and correct class is unknown:
    novelty score should be high
\end{itemize}
\noindent
Negative classifications are not listed as these are not part
of the output of the SSD and cannot be given to the auto-encoder
as input. Furthermore, the 2nd case should not happen because
the trained SSD knows this other class and is very likely
to give it a higher probability. Therefore, using only one
auto-encoder fulfils the task of differentiating between
known and unknown classes.

\section{Generative Probabilistic Novelty Detection}

It is still unclear how the novelty score is calculated.
This section will clear this up in as understandable as
possible terms. However, the name "Generative Probabilistic
Novelty Detection"\cite{Pidhorskyi2018} already signals that
probability theory has something to do with it. Furthermore, this
section will make use of some mathematical terms which cannot
be explained in great detail here. Moreover, the previous section
already introduced many required components, which will not be
explained here again.

For the purpose of this explanation a trained auto-encoder
is assumed. In that case the generator function describes
the model that the auto-encoder is actually using for the
novelty detection. The task of training is to make sure this
model comes as close as possible to the real model of the
training or testing data. The model of the auto-encoder
is in mathematical terms a parameterized manifold
\(\mathcal{M} \equiv g(\Omega)\) of dimension \(n\).
The set of training or testing data can then be described
in the following way:
\begin{equation} \label{eq:train-set}
    x_i = g(z_i) + \xi_i \quad i \in \mathbb{N},
\end{equation}
\noindent
where \(\xi_i\) represents noise. It may be confusing but
for the purpose of this novelty test the "truth" is what
the generator function generates from a set of \(z_i \in \Omega\),
not the ground truth from the data set. Furthermore,
the previously introduced encoder function \(e\) is assumed
to work as an exact inverse of \(g\) for every \(x \in \mathcal{M}\).
For such \(x\) it follows that \(x = g(e(x))\).

Let \(\overline{x} \in \mathbb{R}^m\) be a data point from the test
data. The remainder of the section will explain how the novelty
test is performed for this \(\overline{x}\). It is important
to note that this data point is not necessarily part of the
auto-encoder model. Therefore, \(g(e(\overline{x})) = x\) cannot
be assumed. However, it can be observed that \(\overline{x}\)
can be non-linearly projected onto
\(\overline{x}^{\|} \in \mathcal{M}\)
by using \(g(\overline{z})\) with \(\overline{z} = e(\overline{x})\).
It is assumed that \(g\) is smooth enough to perform a linearization
based on the first-order Taylor expansion:
\begin{equation} \label{eq:taylor-expanse}
    g(z) = g(\overline{z}) + J_g(\overline{z}) (z - \overline{z}) + \mathcal{O}(\| z - \overline{z} \|^2),
\end{equation}
\noindent
where \(J_g(\overline{z})\) is the Jacobi matrix of \(g\) computed
at \(\overline{z}\). It is assumed that the Jacobi matrix of \(g\)
has the full rank at every point of the manifold. A Jacobi matrix
contains all first-order partial derivatives of a function.
\(\| \cdot \|\) is the \(\mathbf{L}_2\) norm, which calculates the
length of a vector by calculating the square root of the sum of
squares of all dimensions of the vector. Lastly, \(\mathcal{O}\)
is called Big-O notation and is used for specifying the time
complexity of an algorithm. In this case it contains a linear
value, which means that this part of the term can be ignored for
\(z\) growing to infinity.

Next the tangent space of \(g\) at \(\overline{x}^{\|}\), which
is spanned by the \(n\) independent column vectors of the Jacobi
matrix \(J_g(\overline{z})\), is defined as
\(\mathcal{T} = \text{span}(J_g(\overline{z}))\). The tangent space
of a point of a function describes all the vectors that could go
through this point. The Jacobi matrix can be decomposed into three
matrices using singular value decomposition: \(J_g(\overline{z}) = U^{\|}SV^{*}\). \(\mathcal{T}\) is defined to also be spanned
by the column vectors of \(U^{\|}\): \(\mathcal{T} = \text{span}(U^{\|})\). \(U^{\|}\) contains the left-singular values
and \(V^{*}\) is the conjugate transposed version of the matrix
\(V\), which contains the right-singular values. \(U^{\bot}\) is
defined in such a way that \(U = [U^{\|}U^{\bot}]\) is a unitary
matrix. \(\mathcal{T^{\bot}}\) is the orthogonal complement of
\(\mathcal{T}\). With this preparation \(\overline{x}\) can be
represented with respect to the local coordinates that define
\(\mathcal{T}\) and \(\mathcal{T}^{\bot}\). This representation
can be achieved by computing
\begin{equation} \label{eq:w-definition}
    \overline{w} = U^{\top} \overline{x} = \left[\begin{matrix}
        U^{\|^{\top}} \overline{x} \\
        U^{\bot^{\top}} \overline{x}
    \end{matrix}\right] = \left[\begin{matrix}
        \overline{w}^{\|} \\
        \overline{w}^{\bot}
    \end{matrix}\right],
\end{equation}
\noindent
where the rotated coordinates (training/testing data points
changed to be on the tangent space)
\(\overline{w}\) are decomposed into \(\overline{w}^{\|}\), which
are parallel to \(\mathcal{T}\), and \(\overline{w}^{\bot}\), which
are orthogonal to \(\mathcal{T}\).

The last step to define the novelty test involves probability
density functions (PDFs), which are now introduced. The PDF \(p_X(x)\)
describes the random variable \(X\), from which the training and
testing data points are drawn. In addition, \(p_W(w)\) is the
probability density function of the random variable \(W\),
which represents \(X\) after changing the coordinates. Both
distributions are identical. But it is assumed that the coordinates
\(W^{\|}\), which are parallel to \(\mathcal{T}\), and the coordinates
\(W^{\bot}\), which are orthogonal to \(\mathcal{T}\), are
statistically independent. With this assumption the following holds:
\begin{equation} \label{eq:pdf-x}
    p_X(x) = p_W(w) = p_W(w^{\|}, w^{\bot}) = p_{W^{\|}}(w^{\|}) p_{W^{\bot}}(w^{\bot})
\end{equation}
The previously introduced noise comes into play again. In formula
(\ref{eq:train-set}) it is assumed that the noise \(\xi\)
predominantly deviates the point \(x\) away from the manifold
\(\mathcal{M}\) in a direction orthogonal to \(\mathcal{T}\).
As a consequence \(W^{\bot}\) is mainly responsible for the noise
effects. Since noise and drawing from the manifold are statistically
independent, \(W^{\|}\) and \(W^{\bot}\) are also independent.

Finally, referring back to the data point \(\overline{x}\), the
novelty test is defined like this:
\begin{equation} \label{eq:novelty-test}
    p_X(\overline{x}) = p_{W^{\|}}(\overline{w}^{\|})p_{W^{\bot}}(\overline{w}^{\bot}) =
    \begin{cases}
        \geq \gamma & \Longrightarrow \text{Inlier} \\
        < \gamma &  \Longrightarrow \text{Outlier}
    \end{cases}
\end{equation}
\noindent
where \(\gamma\) is a suitable threshold.

At this point it is very clear that the GPND approach requires
far more math background than dropout sampling to understand
the novelty test. Nonetheless it could be the better method.

\section{Contribution}

This section will outline what exactly the scientific as well as
technical contribution of this thesis will be.

\subsection*{Scientific Contribution}

Miller et al\cite{Miller2018} use the SSD\cite{Liu2016} network
extended with dropout layers and run multiple forward passes
during the testing phase for every image. Considering the number
of images in the SceneNet RGB-D\cite{McCormac2017} data set, these
forward passes will take considerable time. It could be faster
to only run one forward pass and then use the auto-encoder for
novelty detection. However, the auto-encoder can only work
with one detection at the time and must be called for every
detection of the object detector separately. Therefore,
it is interesting to investigate whether the second approach
is indeed faster than the first.

Dropout sampling uses the entropy to identify false positive
cases. Such identified detections are discarded, which allows for
a better object detection performance. The GPND approach uses
the auto-encoder losses and results to identify novel cases and
therefore mark detections as false positive. Subsequently these
detections can be discarded as well. By comparing the object
detection performance after discarding the identified false positive
cases, the effectiveness of both approaches can be compared with each
other. It is interesting to research if the GPND approach results in
a better object detection performance than the dropout sampling
provides.

The formulated hypothesis, which is repeated after this paragraph,
combines both aspects and requires a similar or better result in
both of them. As a consequence it will be falsified if
the computational performance of the GPND approach is not better than
the one of dropout sampling or if the object detection performance
is worse.

\paragraph{Hypothesis} Novelty detection using auto-encoders
delivers similar or better object detection performance under open set
conditions while being less computationally expensive compared to
dropout sampling.\\

There are three possible scenarios that can be the result of
the thesis:
\begin{itemize}
    \item the hypothesis is confirmed: Win-Win situation where
    switching to GPND is straightforward.
    \item one of the conditions fails: Win-Lose situation where
    it is a trade-off between object detection performance and
    computational performance. One approach will be better in
    one thing and the other approach in the other thing.
    \item both conditions fail: Lose-Lose situation where
    dropout sampling is the best in both aspects.
\end{itemize}

Summarising, the scientific contribution is a comparison between
dropout sampling and GPND with respect to both object detection
performance and computational performance under open set conditions
using the SceneNet RGB-D data set with the MS COCO classes as
"known" object classes.

The computational performance is measured by the time in milliseconds
every test run takes. Interesting are not the absolute numbers,
as these vary from machine to machine and are influenced by a
plethora of uncontrollable factors, but the relative difference
between both approaches and if the difference is significant.
Object detection performance is measured by precision, recall,
F1-score, and an open set error. While the first three metrics are
standard, the last is adapted from Miller et al. It is defined
as the number of observations (for dropout sampling) or detections
(for GPND) that pass the respective false positive test (entropy or
novelty), fall on unknown objects (there are no overlapping ground
truth objects with IoU \(\geq 0.5\) and a known true class label)
and do not have a winning class label of "unknown".

\subsection*{Technical Contribution}

Technical contribution includes all contributions
that are not necessarily new in the scientific sense but are a
meaningful engineering contribution in itself.

There is no available source code for the work of
Miller et al\cite{Miller2018}, which necessitates a re-implementation
of their work by myself. The contribution is the fine-tuning of
an SSD model pre-trained on ImageNet\cite{Deng2009}, extended by
dropout layers, to the SceneNet RGB-D data set using MS COCO classes
as the known classes for SSD.
As MS COCO classes are more general than SceneNet RGB-D classes this
also requires a mapping from one set of classes to the other.
This entire contribution is technical and only re-implements
what Miller et al have already done. It is expected that the
evaluation of the results using this self-trained model will
reproduce the results of Miller et al.

For GPND source code is available but only for MNIST and using
PyTorch. Therefore, the source code has to be transcoded from
PyTorch to Tensorflow. Furthermore, it must be made compatible
with the SceneNet RGB-D as the architecture is tailored to MNIST.
The mapping from SceneNet RGB-D to MS COCO applies here as well and
can therefore be considered a separate contribution. A fine-tuned
SSD is required also but this time without added dropout layers.
Additionally, it is necessary to train the auto-encoder for every
known class separately.

To summarise it in a list, the following separate deliverables
are contributed:

\begin{itemize}
    \item source code for dropout sampling compatible with Tensorflow
    \item source code for GPND compatible with Tensorflow
    \item mapping from SceneNet RGB-D classes to MS COCO classes
    \item vanilla SSD model fine-tuned on SceneNet RGB-D
    \item dropout SSD model fine-tuned on SceneNet RGB-D
    \item auto-encoder model trained separately on every MS COCO class
\end{itemize}

\chapter{Thesis as a project}

After introducing the topic and the general task ahead, this part of
the exposé will focus on how to get there. This includes a timetable
with SMART goals as well as an outline of the software development
practices used for implementing the code for this thesis.

\section{Software Development}

Most scientific implementations found on GitHub are not done with
distribution in mind. They usually require manual cloning of the
repository, have bad code documentation and don't follow common
coding standards. This is bad enough by itself but becomes a real
nuisance if you want to use those implementations in your own
code. As they are not marked up as Python packages, using them
usually requires manual workarounds to make them usable as library
code, for example, in a Python package.

The code of this thesis will be developed from the start inside
a Python package structure which will make it easy to include
it later on as dependency for other work. After the thesis
has been graded the package will be uploaded to the PyPi package
repository and the corresponding Git repository will be made
publicly available.
Any required third party implementations, like the SSD implementation
for Keras, which are not already available as Python packages will
be included as library code according to their respective licences.

A large chunk of the code will be written as library-ready code
that can be used in other applications. Only a small part will
provide the interface to the library code. The specifics of the
interface cannot be predicted ahead of time but it will certainly
include a properly documented CLI as that will be necessary for
the work of the thesis itself.

Tensorflow will be used as the deep learning framework. To make
the code future-proof, the eager execution mode will be used as it
is the default for Tensorflow
2.0\footnote{\url{https://medium.com/tensorflow/whats-coming-in-tensorflow-2-0-d3663832e9b8}}.

\section{Stretch Goals}

There are a number of goals that are not tightly included in the
following timetable. Those are optional addons that are nice-to-have
but not critical for successful completion of the thesis.

\begin{itemize}
    \item make own approach work on the YCB-Video data
          set\cite{Xiang2017}
    \item test dropout sampling and own approach on data set
          self-recorded with a robot arm and mounted Kinect
    \item provide GUI to select freely an image to be classified by
          the trained model and see visualization of result
\end{itemize}

\section{Timetable}

This timetable is structured by milestones that I want to
achieve. Every milestone has the related tasks grouped beneath it.
The scheduling is done with respect to my full personal calendar
and will only account Monday through Friday at most. Weekends will
not be scheduled work time for the thesis. This allows for
some additional unreliable emergency buffer in the end if things do
not proceed as planned. Furthermore I will only be able to
regularly plan the time between 11 am and 5 pm for working on the
thesis as the evenings are mostly full and regardless of that
fact I do want to reserve free time.

\paragraph{Main tasks}
Everything but the stretch goals are non-optional which makes
the term "main task" rather difficult to grasp. The term implies
that all other tasks are nice-to-have but not required. Therefore,
I have chosen to use milestones instead as the highest grouping
level.

\subsection*{Milestones}

The detailed timetable starts in the next subsection. A summary
of the timetable regarding the milestones is presented here.

\begin{enumerate}
    \item Environment set up: Due date 20th March
    \item Fine-tuned SSD on SceneNet RGB-D: Due date 5th April
    \item Fine-tuned GPND on SceneNet RGB-D: Due date 12th April
    \item Networks evaluated: Due date 10th May
    \item Visualizations created: Due date 31st May
    \item Stretch Goals/Buffer: Due date 27th June
    \item Thesis writing: Due date 30th August
    \item Finishing touches: Due date 13th September
\end{enumerate}

\subsection*{Environment set up}

\textbf{Due date:} 20th March

\begin{description}
    \item[Download SceneNet RGB-D to cvpc\{7,8\} computer] \hfill \\
        Requires external resource.
\end{description}

\subsection*{Fine-tuned SSD on SceneNet RGB-D}

\textbf{Due date:} 5th April

\begin{description}
    \item[Download pre-trained weights of SSD for MS COCO] \hfill \\
        This is trivial. Takes not more than two hours.
    \item[Modify SSD Keras implementation to work inside masterthesis package] \hfill \\
        Should be possible to achieve within one day.
    \item[Implement integration of SSD into masterthesis package] \hfill \\
        Implementing the glue code between the git submodule and
        my own code. Should be doable within one day.
    \item[Group SceneNet RGB-D classes to MS COCO classes] \hfill \\
        SceneNet contains more classes than COCO. Miller et al have
        grouped, for example, various chair classes in SceneNet into
        one chair class of COCO. This grouping involves researching
        the 80 classes of COCO and finding all related SceneNet
        classes and then writing a mapper between them.

        All in all this could take up a full day and perhaps slip
        into a second one.
    \item[Implement variant of SSD with dropout layers (Bayesian SSD)] \hfill \\
        This is a rather trivial task as it only involves adding two
        Keras dropout layers into SSD. Can be done in one hour.
    \item[Fine-tune vanilla SSD on SceneNet RGB-D] \hfill \\
        Requires external resource and length of required training is
        unknown. Due to two unknown factors (availability of resource,
        and length of training) this task can be considered a project
        risk.
    \item[Fine-tune Bayesian SSD on SceneNet RGB-D] \hfill \\
        Similar remarks like the previous task.
\end{description}

The tasks prior to the training could be achievable by the 21st
March if work starts on the 18th. Buffer time will go to the 25th
of March. Training is scheduled to commence as early as possible
but no later than the 26th of March.

Since the SSD network is a proven one, I am confident that this
milestone can be reached and the time between 26th of March and
5th April should provide more than enough time for training.
Once training has started, I can work on tasks from other milestones
so that the training time is used as efficiently as possible.

\subsection*{Fine-tuned GPND on SceneNet RGB-D}

\textbf{Due date:} 12th April

\begin{description}
    \item[Adapt GPND implementation for SceneNet RGB-D using COCO classes] \hfill \\
        Requires research to figure out the exact architecture needed
        for a different data set. The code is not well
        documented and some logical variables like image size are
        sometimes hard-coded, which makes this adaption difficult
        and error-prone.
        Furthermore, some trial-and-error regarding training successes
        is likely needed, which makes this task a project risk.
        If the needed architecture was known the time to implement
        it would be at most one day. The uncertainty therefore
        lies with the research part.
    \item[Implement novelty score calculation for GPND] \hfill \\
        There is an implementation for this in the original
        author's implementation. It would have to be ported
        to Tensorflow and integrated into the package structure.
        Takes likely one day or two.
    \item[Apply insights of GAN stability to GPND implementation] \hfill \\
        The insights from the GAN stability\footnote{\url{https://avg.is.tuebingen.mpg.de/publications/meschedericml2018}} research should be applied to
        my GPND implementation. Requires research what, if any,
        insights can be used for this thesis. The research is
        doable within one day and the application of it
        within another.
    \item[Train GPND on SceneNet RGB-D] \hfill \\
        Requires external resource.
        In contrast to the SSD network, there are no pre-trained
        weights available for the GPND. Therefore it has to be
        trained from scratch. Furthermore, it will have to be
        trained for every class separately, which prolongs the
        training even further. This task than be classified as
        project risk.
\end{description}

I will only be able to start working on these tasks on April 1st.
Assuming that the research in the first task goes well, I will
be able to finish the preparatory work on April 5th. Training
could start as early as April 5th. The seven days to the due date
April 12th are tight and maybe it takes longer but this is
the aggressive date I will work towards.

\subsection*{Networks evaluated}

\textbf{Due date:} 10th May

\begin{description}
    \item[Implement evaluation pipeline for vanilla SSD] \hfill \\
        Involves the implementation of the evaluation steps
        according to the chosen metrics. Takes likely two days.
    \item[Implement evaluation pipeline for Bayesian SSD] \hfill \\
        Involves the implementation of the evaluation steps
        for the Bayesian variant. As more is has to be done,
        it will likely take three days.
    \item[Implement evaluation pipeline for SSD with GPND for novelty score] \hfill \\
        Implementation of the evaluation steps for my approach.
        It will probably take two days.
    \item[Run vanilla SSD on test data] \hfill \\
        The trained network is run on the test data and the results
        are stored. Requires external resource but should be
        far quicker than the training and will probably be done
        in two days at most.
    \item[Run Bayesian SSD on test data] \hfill \\
        Similar remarks to previous task.
    \item[Run vanilla SSD detections through GPND] \hfill \\
        For my approach the SSD detections need to be run through
        the GPND to have all the relevant data for evaluation.
        Requires external resource. Will take likely two days.
    \item[Calculate evaluation metrics for vanilla SSD] \hfill \\
        Takes one day.
    \item[Calculate evaluation metrics for Bayesian SSD] \hfill \\
        Takes one day.
    \item[Calculate evaluation metrics for vanilla SSD with GPND] \hfill \\
        Takes one day
\end{description}

If I can start on April 15th with the preparatory work it should be
done by April 23rd. The testing runs can begin as early as April 24th
and should finish around April 30th. This leaves the week from
May 6th up to the due date to finish the calculations, which can
happen on the CPU as all the data is already there by then.

\subsection*{Visualizations created}

\textbf{Due date:} 31st May\\

I won't be able to work on the thesis between May 13th and May 26th
due to the election campaign. I am involved into the campaign already
as of this writing but I hope that up until May 10th both thesis
and campaign can somewhat co-exist.
The visualizations should be creatable within one week from May 27th
to May 31st.

\subsection*{Stretch goals}

\textbf{Due date:} 27th June\\

As I mentioned earlier, there are no specific tasks for the
stretch goals. If the critical path is finished by the end
of May as planned then the month of June is available for
stretch goals. If the critical path is not finished then
June serves as a buffer zone to prevent spillover into the
writing period.

\subsection*{Thesis writing}

\textbf{Due date:} 30th August\\

A first complete draft of the thesis should be finished
at the latest by August 16th. The following week I am not
able to work on the thesis but it can be used for feedback.
The last week of August should allow for polishing of the
thesis with a submission-ready candidate by August 30th.

\subsection*{Finishing touches}

\textbf{Due date:} 13th September\\

The submission requires three printed copies of the thesis,
together with any digital components on a CD glued to the back
of the thesis. A non-editable CD ensures that the code submitted
cannot be modified and will be exactly as submitted when reviewed.
I will use these two weeks to print the copies and to make
last publication steps for the code like improving the code
documentation and adding usage examples.

\subsection*{Colloquium}

Last but not least is the colloquium which will probably take
place within the second half of September. I will prepare
a presentation for the colloquium in the time before such date.

\section{Project Risks}

In this section other project risks will be listed in
addition to those indicated in the timetable section.

The workload for the election campaign, in which I have an
organisational responsibility in addition to being a candidate
myself, could come into conflict with the progress of the
thesis.

Availability of the external resource can hinder the progress
and delay steps of the thesis. In such a case dependent
tasks cannot commence until the earlier task has been finished,
resulting in an overall delay of the thesis.

To deal with these risks, I have planned for one whole month
of buffer time that can account for many delays. Furthermore,
the writing time is intentionally that long as it is difficult
to predict how inspired I will be. I know from my bachelor
thesis that on some days it can be many pages you write and
on others you might barely make one page progress.

I would argue that the thesis success is largely dependent on
the first part of the work as it can make or break it. Once
the technical part is done, the way forward should be downhill.