From 652c86040437b4fd18c01ad74884b3714986fc12 Mon Sep 17 00:00:00 2001
From: Jim Martens <github@2martens.de>
Date: Sun, 28 Jul 2019 14:19:31 +0200
Subject: [PATCH] Added thesis files

Signed-off-by: Jim Martens <github@2martens.de>
---
 body.tex   | 604 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 thesis.tex |  45 ++++
 2 files changed, 649 insertions(+)
 create mode 100644 body.tex
 create mode 100644 thesis.tex

diff --git a/body.tex b/body.tex
new file mode 100644
index 0000000..d4391d8
--- /dev/null
+++ b/body.tex
@@ -0,0 +1,604 @@
+% body thesis file that contains the actual content
+
+\chapter{Introduction}
+
+\subsection*{Motivation}
+
+Famous examples like the automatic soap dispenser which does not
+recognize the hand of a black person but dispenses soap when presented
+with a paper towel raise the question of bias in computer
+systems\cite{Friedman1996}. Related to this ethical question regarding
+the design of so called algorithms, a term often used in public
+discourse for applied neural networks, is the question of
+algorithmic accountability\cite{Diakopoulos2014}.
+
+The charm of supervised neural networks, that they can learn from
+input-output relations and figure out by themselves what connections
+are necessary for that, is also their Achilles heel. This feature
+makes them effectively black boxes. It is possible to question the
+training environment, like potential biases inside the data sets, or
+the engineers constructing the networks but it is not really possible
+to question the internal calculations made by a network. On the one
+hand, one might argue, it is only math and nothing magical that
+happens inside these networks. Clearly it is possible, albeit a chore,
+to manually follow the calculations of any given trained network.
+After all it is executed on a computer and at the lowest level only
+uses basic math that does not differ between humans and computers. On
+the other hand not everyone is capable of doing so and more
+importantly it does not reveal any answers to questions of causality.
+
+However, these questions of causility are of enormous consequence when
+neural networks are used, for example, in predictive policing. Is a
+correlation, a coincidence, enough to bring forth negative consequences
+for a particular person? And if so, what is the possible defence
+against math? Similar questions can be raised when looking at computer
+vision networks that might be used together with so called smart
+CCTV cameras, for example, like those tested at the train station
+Berlin Südkreuz. What if a network implies you committed suspicious
+behaviour?
+
+This leads to the need for neural networks to explain their results.
+Such an explanation must come from the network or an attached piece
+of technology to allow adoption in mass. Obviously this setting
+poses the question, how such an endeavour can be achieved.
+
+For neural networks there are fundamentally two type of tasks:
+regression and classification. Regression deals with any case
+where the goal for the network is to come close to an ideal
+function that connects all data points. Classification, however,
+describes tasks where the network is supposed to identify the
+class of any given input. In this thesis, I will focus on
+classification.
+
+\subsection*{Object Detection in Open Set Conditions}
+
+More specifically, I will look at object detection in the open set
+conditions. In non-technical words this effectively describes
+the kind of situation you encounter with CCTV cameras or robots
+outside of a laboratory. Both use cameras that record
+images. Subsequently a neural network analyses the image
+and returns a list of detected and classified objects that it
+found in the image. The problem here is that networks can only
+classify what they know. If presented with an object type that
+the network was not trained with, as happens frequently in real
+environments, it will still classify the object and might even
+have a high confidence in doing so. Such an example would be
+a false positive. Any ordinary person who uses the results of
+such a network would falsely assume that a high confidence always
+means the classification is very likely correct. If they use
+a proprietary system they might not even be able to find out
+that the network was never trained on a particular type of object.
+Therefore it would be impossible for them to identify the output
+of the network as false positive.
+
+This goes back to the need for automatic explanation. Such a system
+should by itself recognize that the given object is unknown and
+hence mark any classification result of the network as meaningless.
+Technically there are two slightly different things that deal
+with this type of task: model uncertainty and novelty detection.
+
+Model uncertainty can be measured with dropout sampling.
+Dropout is usually used only during training but
+Miller et al\cite{Miller2018} use them also during testing
+to achieve different results for the same image making use of
+multiple forward passes. The output scores for the forward passes
+of the same image are then averaged. If the averaged class
+probabilities resemble a uniform distribution (every class has
+the same probability) this symbolises maximum uncertainty. Conversely,
+if there is one very high probability with every other being very
+low this signifies a low uncertainty. An unknown object is more
+likely to cause high uncertainty which allows for an identification
+of false positive cases.
+
+Novelty detection is the more direct approach to solve the task.
+In the realm of neural networks it is usually done with the help of
+auto-encoders that essentially solve a regression task of finding an
+identity function that reconstructs on the output the given
+input\cite{Pimentel2014}. Auto-encoders have
+internally at least two components: an encoder, and a decoder or
+generator. The job of the encoder is to find an encoding that
+compresses the input as good as possible while simultaneously
+being as loss-free as possible. The decoder takes this latent
+representation of the input and has to find a decompression
+that reconstructs the input as accurate as possible. During
+training these auto-encoders learn to reproduce a certain group
+of object classes. The actual novelty detection takes place
+during testing. Given an image, and the output and loss of the
+auto-encoder, a novelty score is calculated. A low novelty
+score signals a known object. The opposite is true for a high
+novelty score.
+
+\subsection*{Research Question}
+
+Given these two approaches to solve the explanation task of above,
+it comes down to performance. At the end of the day the best
+theoretical idea does not help in solving the task if it cannot
+be implemented in a performant way. Miller et al have shown
+some success in using dropout sampling. However, the many forward
+passes during testing for every image seem computationally expensive.
+In comparison a single run through a trained auto-encoder seems
+intuitively to be faster. This leads to the hypothesis (see below).
+
+For the purpose of this thesis, I will
+use the work of Miller et al as baseline to compare against.
+They use the SSD\cite{Liu2016} network for object detection,
+modified by added dropout layers, and the SceneNet
+RGB-D\cite{McCormac2017} data set using the MS COCO\cite{Lin2014}
+classes. Instead of dropout sampling my approach will use
+an auto-encoder for novelty detection with all else, like
+using SSD for object detection and the SceneNet RGB-D data set,
+being equal. With respect to auto-encoders a recent implementation
+of an adversarial auto-encoder\cite{Pidhorskyi2018} will be used.
+
+\paragraph{Hypothesis} Novelty detection using auto-encoders
+delivers similar or better object detection performance under open set
+conditions while being less computationally expensive compared to
+dropout sampling.
+
+\paragraph{Contribution}
+The contribution of this thesis is a comparison between dropout
+sampling and auto-encoding with respect to the overall performance
+of both for object detection in the open set conditions using
+the SSD network for object detection and the SceneNet RGB-D data set
+with MS COCO classes.
+
+\chapter{Background and Contribution}
+
+This chapter will provide a more in-depth look at the two works
+this thesis is based upon. First, the dropout sampling introduced
+by Miller et al\cite{Miller2018} will be showcased. Afterwards
+the Generative Probabilistic Novelty Detection with Adversarial
+Autoencoders\cite{Pidhorskyi2018} will be presented. The chapter
+will conclude with a more detailed explanation of the intended
+contribution of this thesis.
+
+The dropout sampling explanation will follow the paper of Miller et
+al\cite{Miller2018} rather closely including the formulae used
+in their paper.
+
+\section{Dropout Sampling}
+
+To understand dropout sampling, it is necessary to explain the
+idea of Bayesian neural networks. They place a prior distribution
+over the network weights, for example a Gaussian prior distribution:
+\(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example
+\(\mathbf{W}\) are the weights and \(I\) symbolises that every
+weight is drawn from an independent and identical distribution. The
+training of the network determines a plausible set of weights by
+evaluating the posterior (probability output) over the weights given
+the training data: \(p(\mathbf{W}|\mathbf{T})\). However, this
+evaluation cannot be performed in any reasonable
+time. Therefore approximation techniques are
+required. In those techniques the posterior is fitted with a
+simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original
+and intractable problem of averaging over all weights in the network
+is replaced with an optimisation task, where the parameters of the
+simple distribution are optimised over\cite{Kendall2017}.
+
+\subsubsection*{Dropout Variational Inference}
+
+Kendall and Gal\cite{Kendall2017} showed an approximation for
+classfication and recognition tasks. Dropout variational inference
+is a practical approximation technique by adding dropout layers
+in front of every weight layer and using them also during test
+time to sample from the approximate posterior. Effectively, this
+results in the approximation of the class probability
+\(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward
+passes through the network and averaging over the obtained Softmax
+scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the
+training data \(\mathbf{T}\):
+\begin{equation} \label{eq:drop-sampling}
+p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
+\end{equation}
+
+With this dropout sampling technique \(n\) model weights
+\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior
+\(p(\mathbf{W}|\mathbf{T})\). The class probability
+\(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
+\(\mathbf{q}\) over all class labels. Finally, the uncertainty
+of the network with respect to the classification is given by
+the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\).
+
+\subsubsection*{Dropout Sampling for Object Detection}
+
+Miller et al\cite{Miller2018} apply the dropout sampling to
+object detection. In that case \(\mathbf{W}\) represents the
+learned weights of a detection network like SSD\cite{Liu2016}.
+Every forward pass uses a different network
+\(\widetilde{\mathbf{W}}\) which is approximately sampled from
+\(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object
+detection results in a set of detections, each consisting of bounding
+box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\).
+The detections are denoted by Miller et al as \(D_i =
+\{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put
+into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\).
+
+All detections with mutual intersection-over-union scores (IoU)
+of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\).
+Subsequently, the corresponding vector of class probabilities
+\(\mathbf{q}_i\) for the observation is calculated by averaging all
+score vectors \(\mathbf{s}_j\) in a particular observation
+\(\mathcal{O}_i\): \(\mathbf{q}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
+of the detector for a particular observation is measured by
+the entropy \(H(\mathbf{q}_i) = - \sum_j q_{ij} \cdot \log q_{ij}\).
+
+In the introduction I used a very reduced version to describe
+maximum and low uncertainty. A more complete explanation:
+If \(\mathbf{q}_i\), which I called averaged class probabilities,
+resembles a uniform distribution the entropy will be high. A uniform
+distribution means that no class is more likely than another, which
+is a perfect example of maximum uncertainty. Conversely, if
+one class has a very high probability the entropy will be low.
+
+In open set conditions it can be expected that falsely generated
+detections for unknown object classes have a higher label
+uncertainty. A treshold on the entropy \(H(\mathbf{q}_i)\) can then
+be used to identify and reject these false positive cases.
+
+\section{Adversarial Auto-encoder}
+
+This section will explain the adversarial auto-encoder used by
+Pidhorskyi et al\cite{Pidhorskyi2018} but in a slightly modified
+form to make it more understandable.
+
+The training data points \(x_i \in \mathbb{R}^m \) are the input
+of the auto-encoder. An encoding function \(e: \mathbb{R}^m \rightarrow \mathbb{R}^n\) takes the data points
+and produces a representation \(\overline{z_i} \in \mathbb{R}^n\)
+in a latent space. This latent space is smaller (\(n < m\)) than the
+input which necessitates some form of compression.
+
+A second function \(g: \Omega \rightarrow \mathbb{R}^m\) is the
+generator function that takes the latent representation
+\(z_i \in \Omega \subset \mathbb{R}^n\) and generates an output
+\(\overline{x_i}\) as close as possible to the input data
+distribution.
+
+What then is the difference between \(\overline{z_i}\) and \(z_i\)?
+With a simple auto-encoder both would be identical. In this case
+of an adversarial auto-encoder it is slightly more complicated.
+There is a discriminator \(D_z\) that tries to distinguish between
+an encoded data point \(\overline{z_i}\) and a \(z_i \sim \mathcal{N}(0,1)\) drawn from a normal distribution with \(0\) mean
+and a standard deviation of \(1\). During training, the encoding
+function \(e\) attempts to minimize any perceivable difference
+between \(z_i\) and \(\overline{z_i}\) while \(D_z\) has the
+aforementioned adversarial task to differentiate between them.
+
+Furthermore, there is a discriminator \(D_x\) that has the task
+to differentiate the generated output \(\overline{x_i}\) from the
+actual input \(x_i\). During training, the generator function \(g\)
+tries to minimize the perceivable difference between \(\overline{x_i}\) and \(x_i\) while \(D_x\) has the mentioned
+adversarial task to distinguish between them.
+
+With this all components of the adversarial auto-encoder employed
+by Pidhorskyi et al are introduced. Finally, the losses are
+presented. The two adversarial objectives have been mentioned
+already. Specifically, there is the adversarial loss for the
+discriminator \(D_z\):
+\begin{equation} \label{eq:adv-loss-z}
+    \mathcal{L}_{adv-d_z}(x,e,D_z) = E[\log (D_z(\mathcal{N}(0,1)))] + E[\log (1 - D_z(e(x)))],
+\end{equation}
+\noindent
+where \(E\) stands for an expected
+value\footnote{a term used in probability theory},
+\(x\) stands for the input, and
+\(\mathcal{N}(0,1)\) represents an element drawn from the specified
+distribution. The encoder \(e\) attempts to minimize this loss while
+the discriminator \(D_z\) intends to maximize it.
+
+In the same way the adversarial loss for the discriminator \(D_x\)
+is specified:
+\begin{equation} \label{eq:adv-loss-x}
+    \mathcal{L}_{adv-d_x}(x,D_x,g) = E[\log(D_x(x))] + E[\log(1 - D_x(g(\mathcal{N}(0,1))))],
+\end{equation}
+\noindent
+where \(x\), \(E\), and \(\mathcal{N}(0,1)\) have the same meaning
+as before. In this case the generator \(g\) tries to minimize the loss
+while the discriminator \(D_x\) attempts to maximize it.
+
+Every auto-encoder requires a reconstruction error to work. This
+error calculates the difference between the original input and
+the generated or decoded output. In this case, the reconstruction
+loss is defined like this:
+\begin{equation} \label{eq:recon-loss}
+    \mathcal{L}_{error}(x, e, g) = - E[\log(p(g(e(x)) | x))],
+\end{equation}
+\noindent
+where \(\log(p)\) is the expected log-likelihood and \(x\),
+\(E\), \(e\), and \(g\) have the same meaning as before.
+
+All losses combined result in the following formula:
+\begin{equation} \label{eq:full-loss}
+    \mathcal{L}(x,e,D_z,D_x,g) = \mathcal{L}_{adv-d_z}(x,e,D_z) + \mathcal{L}_{adv-d_x}(x,D_x,g) + \lambda \mathcal{L}_{error}(x,e,g),
+\end{equation}
+\noindent
+where \(\lambda\) is a parameter used to balance the adversarial
+losses with the reconstruction loss. The model is trained by
+Pidhorskyi et al using the Adam optimizer by doing alternative
+updates of each of the aforementioned components:
+
+\begin{itemize}
+    \item Maximize \(\mathcal{L}_{adv-d_x}\) by updating weights of \(D_x\);
+    \item Minimize \(\mathcal{L}_{adv-d_x}\) by updating weights of \(g\);
+    \item Maximize \(\mathcal{L}_{adv-d_z}\) by updating weights of \(D_z\);
+    \item Minimize \(\mathcal{L}_{error}\) and \(\mathcal{L}_{adv-d_z}\) by updating weights of \(e\) and \(g\).
+\end{itemize}
+
+Practically, the auto-encoder is trained separately for every
+object class that is considered "known". Pidhorskyi et al trained
+it on the MNIST\cite{Lecun1998} data set, once for every digit.
+
+For this thesis it needs to be trained on the SceneNet RGB-D
+data set using MS COCO classes as known classes. As in every
+test epoch all known classes are present, it becomes
+non-trivial which of the trained auto-encoders should be used to
+calculate novelty. To phrase it differently, a true positive
+detection is possible for multiple classes in the same image.
+If, for example, one object is classified correctly by SSD as a chair
+the novelty score should be low. But the auto-encoders of all
+known classes but the "chair" class will give ideally a high novelty
+score. Which of the values should be used? The only sensible solution
+is to only run it through the auto-encoder that was trained for
+the class the SSD model predicted. This provides the following
+scenarios:
+\begin{itemize}
+    \item true positive classification: novelty score should be low
+    \item false positive classification and correct class is
+    among the known classes: novelty score should be high
+    \item false positive classification and correct class is unknown:
+    novelty score should be high
+\end{itemize}
+\noindent
+Negative classifications are not listed as these are not part
+of the output of the SSD and cannot be given to the auto-encoder
+as input. Furthermore, the 2nd case should not happen because
+the trained SSD knows this other class and is very likely
+to give it a higher probability. Therefore, using only one
+auto-encoder fulfils the task of differentiating between
+known and unknown classes.
+
+\section{Generative Probabilistic Novelty Detection}
+
+It is still unclear how the novelty score is calculated.
+This section will clear this up in as understandable as
+possible terms. However, the name "Generative Probabilistic
+Novelty Detection"\cite{Pidhorskyi2018} already signals that
+probability theory has something to do with it. Furthermore, this
+section will make use of some mathematical terms which cannot
+be explained in great detail here. Moreover, the previous section
+already introduced many required components, which will not be
+explained here again.
+
+For the purpose of this explanation a trained auto-encoder
+is assumed. In that case the generator function describes
+the model that the auto-encoder is actually using for the
+novelty detection. The task of training is to make sure this
+model comes as close as possible to the real model of the
+training or testing data. The model of the auto-encoder
+is in mathematical terms a parameterized manifold
+\(\mathcal{M} \equiv g(\Omega)\) of dimension \(n\).
+The set of training or testing data can then be described
+in the following way:
+\begin{equation} \label{eq:train-set}
+    x_i = g(z_i) + \xi_i \quad i \in \mathbb{N},
+\end{equation}
+\noindent
+where \(\xi_i\) represents noise. It may be confusing but
+for the purpose of this novelty test the "truth" is what
+the generator function generates from a set of \(z_i \in \Omega\),
+not the ground truth from the data set. Furthermore,
+the previously introduced encoder function \(e\) is assumed
+to work as an exact inverse of \(g\) for every \(x \in \mathcal{M}\).
+For such \(x\) it follows that \(x = g(e(x))\).
+
+Let \(\overline{x} \in \mathbb{R}^m\) be a data point from the test
+data. The remainder of the section will explain how the novelty
+test is performed for this \(\overline{x}\). It is important
+to note that this data point is not necessarily part of the
+auto-encoder model. Therefore, \(g(e(\overline{x})) = x\) cannot
+be assumed. However, it can be observed that \(\overline{x}\)
+can be non-linearly projected onto
+\(\overline{x}^{\|} \in \mathcal{M}\)
+by using \(g(\overline{z})\) with \(\overline{z} = e(\overline{x})\).
+It is assumed that \(g\) is smooth enough to perform a linearization
+based on the first-order Taylor expansion:
+\begin{equation} \label{eq:taylor-expanse}
+    g(z) = g(\overline{z}) + J_g(\overline{z}) (z - \overline{z}) + \mathcal{O}(\| z - \overline{z} \|^2),
+\end{equation}
+\noindent
+where \(J_g(\overline{z})\) is the Jacobi matrix of \(g\) computed
+at \(\overline{z}\). It is assumed that the Jacobi matrix of \(g\)
+has the full rank at every point of the manifold. A Jacobi matrix
+contains all first-order partial derivatives of a function.
+\(\| \cdot \|\) is the \(\mathbf{L}_2\) norm, which calculates the
+length of a vector by calculating the square root of the sum of
+squares of all dimensions of the vector. Lastly, \(\mathcal{O}\)
+is called Big-O notation and is used for specifying the time
+complexity of an algorithm. In this case it contains a linear
+value, which means that this part of the term can be ignored for
+\(z\) growing to infinity.
+
+Next the tangent space of \(g\) at \(\overline{x}^{\|}\), which
+is spanned by the \(n\) independent column vectors of the Jacobi
+matrix \(J_g(\overline{z})\), is defined as
+\(\mathcal{T} = \text{span}(J_g(\overline{z}))\). The tangent space
+of a point of a function describes all the vectors that could go
+through this point. The Jacobi matrix can be decomposed into three
+matrices using singular value decomposition: \(J_g(\overline{z}) = U^{\|}SV^{*}\). \(\mathcal{T}\) is defined to also be spanned
+by the column vectors of \(U^{\|}\): \(\mathcal{T} = \text{span}(U^{\|})\). \(U^{\|}\) contains the left-singular values
+and \(V^{*}\) is the conjugate transposed version of the matrix
+\(V\), which contains the right-singular values. \(U^{\bot}\) is
+defined in such a way that \(U = [U^{\|}U^{\bot}]\) is a unitary
+matrix. \(\mathcal{T^{\bot}}\) is the orthogonal complement of
+\(\mathcal{T}\). With this preparation \(\overline{x}\) can be
+represented with respect to the local coordinates that define
+\(\mathcal{T}\) and \(\mathcal{T}^{\bot}\). This representation
+can be achieved by computing
+\begin{equation} \label{eq:w-definition}
+    \overline{w} = U^{\top} \overline{x} = \left[\begin{matrix}
+        U^{\|^{\top}} \overline{x} \\
+        U^{\bot^{\top}} \overline{x}
+    \end{matrix}\right] = \left[\begin{matrix}
+        \overline{w}^{\|} \\
+        \overline{w}^{\bot}
+    \end{matrix}\right],
+\end{equation}
+\noindent
+where the rotated coordinates (training/testing data points
+changed to be on the tangent space)
+\(\overline{w}\) are decomposed into \(\overline{w}^{\|}\), which
+are parallel to \(\mathcal{T}\), and \(\overline{w}^{\bot}\), which
+are orthogonal to \(\mathcal{T}\).
+
+The last step to define the novelty test involves probability
+density functions (PDFs), which are now introduced. The PDF \(p_X(x)\)
+describes the random variable \(X\), from which the training and
+testing data points are drawn. In addition, \(p_W(w)\) is the
+probability density function of the random variable \(W\),
+which represents \(X\) after changing the coordinates. Both
+distributions are identical. But it is assumed that the coordinates
+\(W^{\|}\), which are parallel to \(\mathcal{T}\), and the coordinates
+\(W^{\bot}\), which are orthogonal to \(\mathcal{T}\), are
+statistically independent. With this assumption the following holds:
+\begin{equation} \label{eq:pdf-x}
+    p_X(x) = p_W(w) = p_W(w^{\|}, w^{\bot}) = p_{W^{\|}}(w^{\|}) p_{W^{\bot}}(w^{\bot})
+\end{equation}
+The previously introduced noise comes into play again. In formula
+(\ref{eq:train-set}) it is assumed that the noise \(\xi\)
+predominantly deviates the point \(x\) away from the manifold
+\(\mathcal{M}\) in a direction orthogonal to \(\mathcal{T}\).
+As a consequence \(W^{\bot}\) is mainly responsible for the noise
+effects. Since noise and drawing from the manifold are statistically
+independent, \(W^{\|}\) and \(W^{\bot}\) are also independent.
+
+Finally, referring back to the data point \(\overline{x}\), the
+novelty test is defined like this:
+\begin{equation} \label{eq:novelty-test}
+    p_X(\overline{x}) = p_{W^{\|}}(\overline{w}^{\|})p_{W^{\bot}}(\overline{w}^{\bot}) =
+    \begin{cases}
+        \geq \gamma & \Longrightarrow \text{Inlier} \\
+        < \gamma &  \Longrightarrow \text{Outlier}
+    \end{cases}
+\end{equation}
+\noindent
+where \(\gamma\) is a suitable threshold.
+
+At this point it is very clear that the GPND approach requires
+far more math background than dropout sampling to understand
+the novelty test. Nonetheless it could be the better method.
+
+\section{Contribution}
+
+This section will outline what exactly the scientific as well as
+technical contribution of this thesis will be.
+
+\subsection*{Scientific Contribution}
+
+Miller et al\cite{Miller2018} use the SSD\cite{Liu2016} network
+extended with dropout layers and run multiple forward passes
+during the testing phase for every image. Considering the number
+of images in the SceneNet RGB-D\cite{McCormac2017} data set, these
+forward passes will take considerable time. It could be faster
+to only run one forward pass and then use the auto-encoder for
+novelty detection. However, the auto-encoder can only work
+with one detection at the time and must be called for every
+detection of the object detector separately. Therefore,
+it is interesting to investigate whether the second approach
+is indeed faster than the first.
+
+Dropout sampling uses the entropy to identify false positive
+cases. Such identified detections are discarded, which allows for
+a better object detection performance. The GPND approach uses
+the auto-encoder losses and results to identify novel cases and
+therefore mark detections as false positive. Subsequently these
+detections can be discarded as well. By comparing the object
+detection performance after discarding the identified false positive
+cases, the effectiveness of both approaches can be compared with each
+other. It is interesting to research if the GPND approach results in
+a better object detection performance than the dropout sampling
+provides.
+
+The formulated hypothesis, which is repeated after this paragraph,
+combines both aspects and requires a similar or better result in
+both of them. As a consequence it will be falsified if
+the computational performance of the GPND approach is not better than
+the one of dropout sampling or if the object detection performance
+is worse.
+
+\paragraph{Hypothesis} Novelty detection using auto-encoders
+delivers similar or better object detection performance under open set
+conditions while being less computationally expensive compared to
+dropout sampling.\\
+
+There are three possible scenarios that can be the result of
+the thesis:
+\begin{itemize}
+    \item the hypothesis is confirmed: Win-Win situation where
+    switching to GPND is straightforward.
+    \item one of the conditions fails: Win-Lose situation where
+    it is a trade-off between object detection performance and
+    computational performance. One approach will be better in
+    one thing and the other approach in the other thing.
+    \item both conditions fail: Lose-Lose situation where
+    dropout sampling is the best in both aspects.
+\end{itemize}
+
+Summarising, the scientific contribution is a comparison between
+dropout sampling and GPND with respect to both object detection
+performance and computational performance under open set conditions
+using the SceneNet RGB-D data set with the MS COCO classes as
+"known" object classes.
+
+The computational performance is measured by the time in milliseconds
+every test run takes. Interesting are not the absolute numbers,
+as these vary from machine to machine and are influenced by a
+plethora of uncontrollable factors, but the relative difference
+between both approaches and if the difference is significant.
+Object detection performance is measured by precision, recall,
+F1-score, and an open set error. While the first three metrics are
+standard, the last is adapted from Miller et al. It is defined
+as the number of observations (for dropout sampling) or detections
+(for GPND) that pass the respective false positive test (entropy or
+novelty), fall on unknown objects (there are no overlapping ground
+truth objects with IoU \(\geq 0.5\) and a known true class label)
+and do not have a winning class label of "unknown".
+
+\subsection*{Technical Contribution}
+
+Technical contribution includes all contributions
+that are not necessarily new in the scientific sense but are a
+meaningful engineering contribution in itself.
+
+There is no available source code for the work of
+Miller et al\cite{Miller2018}, which necessitates a re-implementation
+of their work by myself. The contribution is the fine-tuning of
+an SSD model pre-trained on ImageNet\cite{Deng2009}, extended by
+dropout layers, to the SceneNet RGB-D data set using MS COCO classes
+as the known classes for SSD.
+As MS COCO classes are more general than SceneNet RGB-D classes this
+also requires a mapping from one set of classes to the other.
+This entire contribution is technical and only re-implements
+what Miller et al have already done. It is expected that the
+evaluation of the results using this self-trained model will
+reproduce the results of Miller et al.
+
+For GPND source code is available but only for MNIST and using
+PyTorch. Therefore, the source code has to be transcoded from
+PyTorch to Tensorflow. Furthermore, it must be made compatible
+with the SceneNet RGB-D as the architecture is tailored to MNIST.
+The mapping from SceneNet RGB-D to MS COCO applies here as well and
+can therefore be considered a separate contribution. A fine-tuned
+SSD is required also but this time without added dropout layers.
+Additionally, it is necessary to train the auto-encoder for every
+known class separately.
+
+To summarise it in a list, the following separate deliverables
+are contributed:
+
+\begin{itemize}
+    \item source code for dropout sampling compatible with Tensorflow
+    \item source code for GPND compatible with Tensorflow
+    \item mapping from SceneNet RGB-D classes to MS COCO classes
+    \item vanilla SSD model fine-tuned on SceneNet RGB-D
+    \item dropout SSD model fine-tuned on SceneNet RGB-D
+    \item auto-encoder model trained separately on every MS COCO class
+\end{itemize}
diff --git a/thesis.tex b/thesis.tex
new file mode 100644
index 0000000..43bb1cf
--- /dev/null
+++ b/thesis.tex
@@ -0,0 +1,45 @@
+% main thesis file
+
+% first: let's determine the class we want to use
+\documentclass[12pt]{masterthesis} % we use our custom masterthesis class
+\KOMAoption{draft}{false} % set for "draft" mode, disable later
+
+% define "variables"
+\title{Novelty detection through auto-encoding for object detection in open set conditions}
+\author{Jim Martens}
+\date{\today}
+\subject{} % value for pdfsubject
+\newcommand{\keywords}{CV} % value for pdfkeywords
+\newcommand{\studiengang}{\IfLanguageName{british}{Computer Science M.Sc.}{Master Informatik}}
+\newcommand{\faculty}{\IfLanguageName{british}{MIN Faculty}{MIN-Fakult\"{a}t}}
+\newcommand{\department}{\IfLanguageName{british}{Department of Computer Science}{Fachbereich Informatik}}
+% define here to prevent building errors
+\newcommand{\professor}{}
+\newcommand{\matrikelnummer}{}
+\newcommand{\firstReviewer}{}
+\newcommand{\secondReviewer}{}
+
+% load protected variables
+\IfFileExists{./private/variables.tex}{%
+    \input{./private/variables.tex}
+}{}
+
+% use custom package to prevent spamming the preamble
+\usepackage[licence]{masterthesis}
+
+% specify image location
+\graphicspath{{./images/}{./private/images/}}
+
+% specify bib resource
+\addbibresource{ma.bib}
+
+% begin document
+\begin{document}
+% invoke start command(s) from masterthesis package
+\start
+
+\input{body_expose.tex}
+
+% invoke finish command(s) from masterthesis package
+\finish
+\end{document}