Added missing tex files and skeleton chapters
Signed-off-by: Jim Martens <github@2martens.de>
This commit is contained in:
parent
d0733c068f
commit
f6a6907076
|
@ -0,0 +1,4 @@
|
|||
\clearpage
|
||||
\section*{Abstract}
|
||||
|
||||
Ich bin ein Abstract
|
|
@ -0,0 +1,4 @@
|
|||
\clearpage
|
||||
\section*{Acknowledgement}
|
||||
|
||||
Lobhudelei
|
477
body.tex
477
body.tex
|
@ -7,10 +7,10 @@
|
|||
Famous examples like the automatic soap dispenser which does not
|
||||
recognize the hand of a black person but dispenses soap when presented
|
||||
with a paper towel raise the question of bias in computer
|
||||
systems\cite{Friedman1996}. Related to this ethical question regarding
|
||||
systems~\cite{Friedman1996}. Related to this ethical question regarding
|
||||
the design of so called algorithms, a term often used in public
|
||||
discourse for applied neural networks, is the question of
|
||||
algorithmic accountability\cite{Diakopoulos2014}.
|
||||
algorithmic accountability~\cite{Diakopoulos2014}.
|
||||
|
||||
The charm of supervised neural networks, that they can learn from
|
||||
input-output relations and figure out by themselves what connections
|
||||
|
@ -79,7 +79,7 @@ with this type of task: model uncertainty and novelty detection.
|
|||
|
||||
Model uncertainty can be measured with dropout sampling.
|
||||
Dropout is usually used only during training but
|
||||
Miller et al\cite{Miller2018} use them also during testing
|
||||
Miller et al.~\cite{Miller2018} use them also during testing
|
||||
to achieve different results for the same image making use of
|
||||
multiple forward passes. The output scores for the forward passes
|
||||
of the same image are then averaged. If the averaged class
|
||||
|
@ -94,7 +94,7 @@ Novelty detection is the more direct approach to solve the task.
|
|||
In the realm of neural networks it is usually done with the help of
|
||||
auto-encoders that essentially solve a regression task of finding an
|
||||
identity function that reconstructs on the output the given
|
||||
input\cite{Pimentel2014}. Auto-encoders have
|
||||
input~\cite{Pimentel2014}. Auto-encoders have
|
||||
internally at least two components: an encoder, and a decoder or
|
||||
generator. The job of the encoder is to find an encoding that
|
||||
compresses the input as good as possible while simultaneously
|
||||
|
@ -113,22 +113,22 @@ novelty score.
|
|||
Given these two approaches to solve the explanation task of above,
|
||||
it comes down to performance. At the end of the day the best
|
||||
theoretical idea does not help in solving the task if it cannot
|
||||
be implemented in a performant way. Miller et al have shown
|
||||
be implemented in a performant way. Miller et al. have shown
|
||||
some success in using dropout sampling. However, the many forward
|
||||
passes during testing for every image seem computationally expensive.
|
||||
In comparison a single run through a trained auto-encoder seems
|
||||
intuitively to be faster. This leads to the hypothesis (see below).
|
||||
|
||||
For the purpose of this thesis, I will
|
||||
use the work of Miller et al as baseline to compare against.
|
||||
They use the SSD\cite{Liu2016} network for object detection,
|
||||
use the work of Miller et al. as baseline to compare against.
|
||||
They use the SSD~\cite{Liu2016} network for object detection,
|
||||
modified by added dropout layers, and the SceneNet
|
||||
RGB-D\cite{McCormac2017} data set using the MS COCO\cite{Lin2014}
|
||||
RGB-D~\cite{McCormac2017} data set using the MS COCO~\cite{Lin2014}
|
||||
classes. Instead of dropout sampling my approach will use
|
||||
an auto-encoder for novelty detection with all else, like
|
||||
using SSD for object detection and the SceneNet RGB-D data set,
|
||||
being equal. With respect to auto-encoders a recent implementation
|
||||
of an adversarial auto-encoder\cite{Pidhorskyi2018} will be used.
|
||||
of an adversarial auto-encoder~\cite{Pidhorskyi2018} will be used.
|
||||
|
||||
\paragraph{Hypothesis} Novelty detection using auto-encoders
|
||||
delivers similar or better object detection performance under open set
|
||||
|
@ -144,461 +144,10 @@ with MS COCO classes.
|
|||
|
||||
\chapter{Background and Contribution}
|
||||
|
||||
This chapter will provide a more in-depth look at the two works
|
||||
this thesis is based upon. First, the dropout sampling introduced
|
||||
by Miller et al\cite{Miller2018} will be showcased. Afterwards
|
||||
the Generative Probabilistic Novelty Detection with Adversarial
|
||||
Autoencoders\cite{Pidhorskyi2018} will be presented. The chapter
|
||||
will conclude with a more detailed explanation of the intended
|
||||
contribution of this thesis.
|
||||
\chapter{Methods}
|
||||
|
||||
The dropout sampling explanation will follow the paper of Miller et
|
||||
al\cite{Miller2018} rather closely including the formulae used
|
||||
in their paper.
|
||||
\chapter{Results}
|
||||
|
||||
\section{Dropout Sampling}
|
||||
\chapter{Discussion}
|
||||
|
||||
To understand dropout sampling, it is necessary to explain the
|
||||
idea of Bayesian neural networks. They place a prior distribution
|
||||
over the network weights, for example a Gaussian prior distribution:
|
||||
\(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example
|
||||
\(\mathbf{W}\) are the weights and \(I\) symbolises that every
|
||||
weight is drawn from an independent and identical distribution. The
|
||||
training of the network determines a plausible set of weights by
|
||||
evaluating the posterior (probability output) over the weights given
|
||||
the training data: \(p(\mathbf{W}|\mathbf{T})\). However, this
|
||||
evaluation cannot be performed in any reasonable
|
||||
time. Therefore approximation techniques are
|
||||
required. In those techniques the posterior is fitted with a
|
||||
simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original
|
||||
and intractable problem of averaging over all weights in the network
|
||||
is replaced with an optimisation task, where the parameters of the
|
||||
simple distribution are optimised over\cite{Kendall2017}.
|
||||
|
||||
\subsubsection*{Dropout Variational Inference}
|
||||
|
||||
Kendall and Gal\cite{Kendall2017} showed an approximation for
|
||||
classfication and recognition tasks. Dropout variational inference
|
||||
is a practical approximation technique by adding dropout layers
|
||||
in front of every weight layer and using them also during test
|
||||
time to sample from the approximate posterior. Effectively, this
|
||||
results in the approximation of the class probability
|
||||
\(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward
|
||||
passes through the network and averaging over the obtained Softmax
|
||||
scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the
|
||||
training data \(\mathbf{T}\):
|
||||
\begin{equation} \label{eq:drop-sampling}
|
||||
p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
|
||||
\end{equation}
|
||||
|
||||
With this dropout sampling technique \(n\) model weights
|
||||
\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior
|
||||
\(p(\mathbf{W}|\mathbf{T})\). The class probability
|
||||
\(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
|
||||
\(\mathbf{q}\) over all class labels. Finally, the uncertainty
|
||||
of the network with respect to the classification is given by
|
||||
the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\).
|
||||
|
||||
\subsubsection*{Dropout Sampling for Object Detection}
|
||||
|
||||
Miller et al\cite{Miller2018} apply the dropout sampling to
|
||||
object detection. In that case \(\mathbf{W}\) represents the
|
||||
learned weights of a detection network like SSD\cite{Liu2016}.
|
||||
Every forward pass uses a different network
|
||||
\(\widetilde{\mathbf{W}}\) which is approximately sampled from
|
||||
\(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object
|
||||
detection results in a set of detections, each consisting of bounding
|
||||
box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\).
|
||||
The detections are denoted by Miller et al as \(D_i =
|
||||
\{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put
|
||||
into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\).
|
||||
|
||||
All detections with mutual intersection-over-union scores (IoU)
|
||||
of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\).
|
||||
Subsequently, the corresponding vector of class probabilities
|
||||
\(\mathbf{q}_i\) for the observation is calculated by averaging all
|
||||
score vectors \(\mathbf{s}_j\) in a particular observation
|
||||
\(\mathcal{O}_i\): \(\mathbf{q}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
|
||||
of the detector for a particular observation is measured by
|
||||
the entropy \(H(\mathbf{q}_i) = - \sum_j q_{ij} \cdot \log q_{ij}\).
|
||||
|
||||
In the introduction I used a very reduced version to describe
|
||||
maximum and low uncertainty. A more complete explanation:
|
||||
If \(\mathbf{q}_i\), which I called averaged class probabilities,
|
||||
resembles a uniform distribution the entropy will be high. A uniform
|
||||
distribution means that no class is more likely than another, which
|
||||
is a perfect example of maximum uncertainty. Conversely, if
|
||||
one class has a very high probability the entropy will be low.
|
||||
|
||||
In open set conditions it can be expected that falsely generated
|
||||
detections for unknown object classes have a higher label
|
||||
uncertainty. A treshold on the entropy \(H(\mathbf{q}_i)\) can then
|
||||
be used to identify and reject these false positive cases.
|
||||
|
||||
\section{Adversarial Auto-encoder}
|
||||
|
||||
This section will explain the adversarial auto-encoder used by
|
||||
Pidhorskyi et al\cite{Pidhorskyi2018} but in a slightly modified
|
||||
form to make it more understandable.
|
||||
|
||||
The training data points \(x_i \in \mathbb{R}^m \) are the input
|
||||
of the auto-encoder. An encoding function \(e: \mathbb{R}^m \rightarrow \mathbb{R}^n\) takes the data points
|
||||
and produces a representation \(\overline{z_i} \in \mathbb{R}^n\)
|
||||
in a latent space. This latent space is smaller (\(n < m\)) than the
|
||||
input which necessitates some form of compression.
|
||||
|
||||
A second function \(g: \Omega \rightarrow \mathbb{R}^m\) is the
|
||||
generator function that takes the latent representation
|
||||
\(z_i \in \Omega \subset \mathbb{R}^n\) and generates an output
|
||||
\(\overline{x_i}\) as close as possible to the input data
|
||||
distribution.
|
||||
|
||||
What then is the difference between \(\overline{z_i}\) and \(z_i\)?
|
||||
With a simple auto-encoder both would be identical. In this case
|
||||
of an adversarial auto-encoder it is slightly more complicated.
|
||||
There is a discriminator \(D_z\) that tries to distinguish between
|
||||
an encoded data point \(\overline{z_i}\) and a \(z_i \sim \mathcal{N}(0,1)\) drawn from a normal distribution with \(0\) mean
|
||||
and a standard deviation of \(1\). During training, the encoding
|
||||
function \(e\) attempts to minimize any perceivable difference
|
||||
between \(z_i\) and \(\overline{z_i}\) while \(D_z\) has the
|
||||
aforementioned adversarial task to differentiate between them.
|
||||
|
||||
Furthermore, there is a discriminator \(D_x\) that has the task
|
||||
to differentiate the generated output \(\overline{x_i}\) from the
|
||||
actual input \(x_i\). During training, the generator function \(g\)
|
||||
tries to minimize the perceivable difference between \(\overline{x_i}\) and \(x_i\) while \(D_x\) has the mentioned
|
||||
adversarial task to distinguish between them.
|
||||
|
||||
With this all components of the adversarial auto-encoder employed
|
||||
by Pidhorskyi et al are introduced. Finally, the losses are
|
||||
presented. The two adversarial objectives have been mentioned
|
||||
already. Specifically, there is the adversarial loss for the
|
||||
discriminator \(D_z\):
|
||||
\begin{equation} \label{eq:adv-loss-z}
|
||||
\mathcal{L}_{adv-d_z}(x,e,D_z) = E[\log (D_z(\mathcal{N}(0,1)))] + E[\log (1 - D_z(e(x)))],
|
||||
\end{equation}
|
||||
\noindent
|
||||
where \(E\) stands for an expected
|
||||
value\footnote{a term used in probability theory},
|
||||
\(x\) stands for the input, and
|
||||
\(\mathcal{N}(0,1)\) represents an element drawn from the specified
|
||||
distribution. The encoder \(e\) attempts to minimize this loss while
|
||||
the discriminator \(D_z\) intends to maximize it.
|
||||
|
||||
In the same way the adversarial loss for the discriminator \(D_x\)
|
||||
is specified:
|
||||
\begin{equation} \label{eq:adv-loss-x}
|
||||
\mathcal{L}_{adv-d_x}(x,D_x,g) = E[\log(D_x(x))] + E[\log(1 - D_x(g(\mathcal{N}(0,1))))],
|
||||
\end{equation}
|
||||
\noindent
|
||||
where \(x\), \(E\), and \(\mathcal{N}(0,1)\) have the same meaning
|
||||
as before. In this case the generator \(g\) tries to minimize the loss
|
||||
while the discriminator \(D_x\) attempts to maximize it.
|
||||
|
||||
Every auto-encoder requires a reconstruction error to work. This
|
||||
error calculates the difference between the original input and
|
||||
the generated or decoded output. In this case, the reconstruction
|
||||
loss is defined like this:
|
||||
\begin{equation} \label{eq:recon-loss}
|
||||
\mathcal{L}_{error}(x, e, g) = - E[\log(p(g(e(x)) | x))],
|
||||
\end{equation}
|
||||
\noindent
|
||||
where \(\log(p)\) is the expected log-likelihood and \(x\),
|
||||
\(E\), \(e\), and \(g\) have the same meaning as before.
|
||||
|
||||
All losses combined result in the following formula:
|
||||
\begin{equation} \label{eq:full-loss}
|
||||
\mathcal{L}(x,e,D_z,D_x,g) = \mathcal{L}_{adv-d_z}(x,e,D_z) + \mathcal{L}_{adv-d_x}(x,D_x,g) + \lambda \mathcal{L}_{error}(x,e,g),
|
||||
\end{equation}
|
||||
\noindent
|
||||
where \(\lambda\) is a parameter used to balance the adversarial
|
||||
losses with the reconstruction loss. The model is trained by
|
||||
Pidhorskyi et al using the Adam optimizer by doing alternative
|
||||
updates of each of the aforementioned components:
|
||||
|
||||
\begin{itemize}
|
||||
\item Maximize \(\mathcal{L}_{adv-d_x}\) by updating weights of \(D_x\);
|
||||
\item Minimize \(\mathcal{L}_{adv-d_x}\) by updating weights of \(g\);
|
||||
\item Maximize \(\mathcal{L}_{adv-d_z}\) by updating weights of \(D_z\);
|
||||
\item Minimize \(\mathcal{L}_{error}\) and \(\mathcal{L}_{adv-d_z}\) by updating weights of \(e\) and \(g\).
|
||||
\end{itemize}
|
||||
|
||||
Practically, the auto-encoder is trained separately for every
|
||||
object class that is considered "known". Pidhorskyi et al trained
|
||||
it on the MNIST\cite{Lecun1998} data set, once for every digit.
|
||||
|
||||
For this thesis it needs to be trained on the SceneNet RGB-D
|
||||
data set using MS COCO classes as known classes. As in every
|
||||
test epoch all known classes are present, it becomes
|
||||
non-trivial which of the trained auto-encoders should be used to
|
||||
calculate novelty. To phrase it differently, a true positive
|
||||
detection is possible for multiple classes in the same image.
|
||||
If, for example, one object is classified correctly by SSD as a chair
|
||||
the novelty score should be low. But the auto-encoders of all
|
||||
known classes but the "chair" class will give ideally a high novelty
|
||||
score. Which of the values should be used? The only sensible solution
|
||||
is to only run it through the auto-encoder that was trained for
|
||||
the class the SSD model predicted. This provides the following
|
||||
scenarios:
|
||||
\begin{itemize}
|
||||
\item true positive classification: novelty score should be low
|
||||
\item false positive classification and correct class is
|
||||
among the known classes: novelty score should be high
|
||||
\item false positive classification and correct class is unknown:
|
||||
novelty score should be high
|
||||
\end{itemize}
|
||||
\noindent
|
||||
Negative classifications are not listed as these are not part
|
||||
of the output of the SSD and cannot be given to the auto-encoder
|
||||
as input. Furthermore, the 2nd case should not happen because
|
||||
the trained SSD knows this other class and is very likely
|
||||
to give it a higher probability. Therefore, using only one
|
||||
auto-encoder fulfils the task of differentiating between
|
||||
known and unknown classes.
|
||||
|
||||
\section{Generative Probabilistic Novelty Detection}
|
||||
|
||||
It is still unclear how the novelty score is calculated.
|
||||
This section will clear this up in as understandable as
|
||||
possible terms. However, the name "Generative Probabilistic
|
||||
Novelty Detection"\cite{Pidhorskyi2018} already signals that
|
||||
probability theory has something to do with it. Furthermore, this
|
||||
section will make use of some mathematical terms which cannot
|
||||
be explained in great detail here. Moreover, the previous section
|
||||
already introduced many required components, which will not be
|
||||
explained here again.
|
||||
|
||||
For the purpose of this explanation a trained auto-encoder
|
||||
is assumed. In that case the generator function describes
|
||||
the model that the auto-encoder is actually using for the
|
||||
novelty detection. The task of training is to make sure this
|
||||
model comes as close as possible to the real model of the
|
||||
training or testing data. The model of the auto-encoder
|
||||
is in mathematical terms a parameterized manifold
|
||||
\(\mathcal{M} \equiv g(\Omega)\) of dimension \(n\).
|
||||
The set of training or testing data can then be described
|
||||
in the following way:
|
||||
\begin{equation} \label{eq:train-set}
|
||||
x_i = g(z_i) + \xi_i \quad i \in \mathbb{N},
|
||||
\end{equation}
|
||||
\noindent
|
||||
where \(\xi_i\) represents noise. It may be confusing but
|
||||
for the purpose of this novelty test the "truth" is what
|
||||
the generator function generates from a set of \(z_i \in \Omega\),
|
||||
not the ground truth from the data set. Furthermore,
|
||||
the previously introduced encoder function \(e\) is assumed
|
||||
to work as an exact inverse of \(g\) for every \(x \in \mathcal{M}\).
|
||||
For such \(x\) it follows that \(x = g(e(x))\).
|
||||
|
||||
Let \(\overline{x} \in \mathbb{R}^m\) be a data point from the test
|
||||
data. The remainder of the section will explain how the novelty
|
||||
test is performed for this \(\overline{x}\). It is important
|
||||
to note that this data point is not necessarily part of the
|
||||
auto-encoder model. Therefore, \(g(e(\overline{x})) = x\) cannot
|
||||
be assumed. However, it can be observed that \(\overline{x}\)
|
||||
can be non-linearly projected onto
|
||||
\(\overline{x}^{\|} \in \mathcal{M}\)
|
||||
by using \(g(\overline{z})\) with \(\overline{z} = e(\overline{x})\).
|
||||
It is assumed that \(g\) is smooth enough to perform a linearization
|
||||
based on the first-order Taylor expansion:
|
||||
\begin{equation} \label{eq:taylor-expanse}
|
||||
g(z) = g(\overline{z}) + J_g(\overline{z}) (z - \overline{z}) + \mathcal{O}(\| z - \overline{z} \|^2),
|
||||
\end{equation}
|
||||
\noindent
|
||||
where \(J_g(\overline{z})\) is the Jacobi matrix of \(g\) computed
|
||||
at \(\overline{z}\). It is assumed that the Jacobi matrix of \(g\)
|
||||
has the full rank at every point of the manifold. A Jacobi matrix
|
||||
contains all first-order partial derivatives of a function.
|
||||
\(\| \cdot \|\) is the \(\mathbf{L}_2\) norm, which calculates the
|
||||
length of a vector by calculating the square root of the sum of
|
||||
squares of all dimensions of the vector. Lastly, \(\mathcal{O}\)
|
||||
is called Big-O notation and is used for specifying the time
|
||||
complexity of an algorithm. In this case it contains a linear
|
||||
value, which means that this part of the term can be ignored for
|
||||
\(z\) growing to infinity.
|
||||
|
||||
Next the tangent space of \(g\) at \(\overline{x}^{\|}\), which
|
||||
is spanned by the \(n\) independent column vectors of the Jacobi
|
||||
matrix \(J_g(\overline{z})\), is defined as
|
||||
\(\mathcal{T} = \text{span}(J_g(\overline{z}))\). The tangent space
|
||||
of a point of a function describes all the vectors that could go
|
||||
through this point. The Jacobi matrix can be decomposed into three
|
||||
matrices using singular value decomposition: \(J_g(\overline{z}) = U^{\|}SV^{*}\). \(\mathcal{T}\) is defined to also be spanned
|
||||
by the column vectors of \(U^{\|}\): \(\mathcal{T} = \text{span}(U^{\|})\). \(U^{\|}\) contains the left-singular values
|
||||
and \(V^{*}\) is the conjugate transposed version of the matrix
|
||||
\(V\), which contains the right-singular values. \(U^{\bot}\) is
|
||||
defined in such a way that \(U = [U^{\|}U^{\bot}]\) is a unitary
|
||||
matrix. \(\mathcal{T^{\bot}}\) is the orthogonal complement of
|
||||
\(\mathcal{T}\). With this preparation \(\overline{x}\) can be
|
||||
represented with respect to the local coordinates that define
|
||||
\(\mathcal{T}\) and \(\mathcal{T}^{\bot}\). This representation
|
||||
can be achieved by computing
|
||||
\begin{equation} \label{eq:w-definition}
|
||||
\overline{w} = U^{\top} \overline{x} = \left[\begin{matrix}
|
||||
U^{\|^{\top}} \overline{x} \\
|
||||
U^{\bot^{\top}} \overline{x}
|
||||
\end{matrix}\right] = \left[\begin{matrix}
|
||||
\overline{w}^{\|} \\
|
||||
\overline{w}^{\bot}
|
||||
\end{matrix}\right],
|
||||
\end{equation}
|
||||
\noindent
|
||||
where the rotated coordinates (training/testing data points
|
||||
changed to be on the tangent space)
|
||||
\(\overline{w}\) are decomposed into \(\overline{w}^{\|}\), which
|
||||
are parallel to \(\mathcal{T}\), and \(\overline{w}^{\bot}\), which
|
||||
are orthogonal to \(\mathcal{T}\).
|
||||
|
||||
The last step to define the novelty test involves probability
|
||||
density functions (PDFs), which are now introduced. The PDF \(p_X(x)\)
|
||||
describes the random variable \(X\), from which the training and
|
||||
testing data points are drawn. In addition, \(p_W(w)\) is the
|
||||
probability density function of the random variable \(W\),
|
||||
which represents \(X\) after changing the coordinates. Both
|
||||
distributions are identical. But it is assumed that the coordinates
|
||||
\(W^{\|}\), which are parallel to \(\mathcal{T}\), and the coordinates
|
||||
\(W^{\bot}\), which are orthogonal to \(\mathcal{T}\), are
|
||||
statistically independent. With this assumption the following holds:
|
||||
\begin{equation} \label{eq:pdf-x}
|
||||
p_X(x) = p_W(w) = p_W(w^{\|}, w^{\bot}) = p_{W^{\|}}(w^{\|}) p_{W^{\bot}}(w^{\bot})
|
||||
\end{equation}
|
||||
The previously introduced noise comes into play again. In formula
|
||||
(\ref{eq:train-set}) it is assumed that the noise \(\xi\)
|
||||
predominantly deviates the point \(x\) away from the manifold
|
||||
\(\mathcal{M}\) in a direction orthogonal to \(\mathcal{T}\).
|
||||
As a consequence \(W^{\bot}\) is mainly responsible for the noise
|
||||
effects. Since noise and drawing from the manifold are statistically
|
||||
independent, \(W^{\|}\) and \(W^{\bot}\) are also independent.
|
||||
|
||||
Finally, referring back to the data point \(\overline{x}\), the
|
||||
novelty test is defined like this:
|
||||
\begin{equation} \label{eq:novelty-test}
|
||||
p_X(\overline{x}) = p_{W^{\|}}(\overline{w}^{\|})p_{W^{\bot}}(\overline{w}^{\bot}) =
|
||||
\begin{cases}
|
||||
\geq \gamma & \Longrightarrow \text{Inlier} \\
|
||||
< \gamma & \Longrightarrow \text{Outlier}
|
||||
\end{cases}
|
||||
\end{equation}
|
||||
\noindent
|
||||
where \(\gamma\) is a suitable threshold.
|
||||
|
||||
At this point it is very clear that the GPND approach requires
|
||||
far more math background than dropout sampling to understand
|
||||
the novelty test. Nonetheless it could be the better method.
|
||||
|
||||
\section{Contribution}
|
||||
|
||||
This section will outline what exactly the scientific as well as
|
||||
technical contribution of this thesis will be.
|
||||
|
||||
\subsection*{Scientific Contribution}
|
||||
|
||||
Miller et al\cite{Miller2018} use the SSD\cite{Liu2016} network
|
||||
extended with dropout layers and run multiple forward passes
|
||||
during the testing phase for every image. Considering the number
|
||||
of images in the SceneNet RGB-D\cite{McCormac2017} data set, these
|
||||
forward passes will take considerable time. It could be faster
|
||||
to only run one forward pass and then use the auto-encoder for
|
||||
novelty detection. However, the auto-encoder can only work
|
||||
with one detection at the time and must be called for every
|
||||
detection of the object detector separately. Therefore,
|
||||
it is interesting to investigate whether the second approach
|
||||
is indeed faster than the first.
|
||||
|
||||
Dropout sampling uses the entropy to identify false positive
|
||||
cases. Such identified detections are discarded, which allows for
|
||||
a better object detection performance. The GPND approach uses
|
||||
the auto-encoder losses and results to identify novel cases and
|
||||
therefore mark detections as false positive. Subsequently these
|
||||
detections can be discarded as well. By comparing the object
|
||||
detection performance after discarding the identified false positive
|
||||
cases, the effectiveness of both approaches can be compared with each
|
||||
other. It is interesting to research if the GPND approach results in
|
||||
a better object detection performance than the dropout sampling
|
||||
provides.
|
||||
|
||||
The formulated hypothesis, which is repeated after this paragraph,
|
||||
combines both aspects and requires a similar or better result in
|
||||
both of them. As a consequence it will be falsified if
|
||||
the computational performance of the GPND approach is not better than
|
||||
the one of dropout sampling or if the object detection performance
|
||||
is worse.
|
||||
|
||||
\paragraph{Hypothesis} Novelty detection using auto-encoders
|
||||
delivers similar or better object detection performance under open set
|
||||
conditions while being less computationally expensive compared to
|
||||
dropout sampling.\\
|
||||
|
||||
There are three possible scenarios that can be the result of
|
||||
the thesis:
|
||||
\begin{itemize}
|
||||
\item the hypothesis is confirmed: Win-Win situation where
|
||||
switching to GPND is straightforward.
|
||||
\item one of the conditions fails: Win-Lose situation where
|
||||
it is a trade-off between object detection performance and
|
||||
computational performance. One approach will be better in
|
||||
one thing and the other approach in the other thing.
|
||||
\item both conditions fail: Lose-Lose situation where
|
||||
dropout sampling is the best in both aspects.
|
||||
\end{itemize}
|
||||
|
||||
Summarising, the scientific contribution is a comparison between
|
||||
dropout sampling and GPND with respect to both object detection
|
||||
performance and computational performance under open set conditions
|
||||
using the SceneNet RGB-D data set with the MS COCO classes as
|
||||
"known" object classes.
|
||||
|
||||
The computational performance is measured by the time in milliseconds
|
||||
every test run takes. Interesting are not the absolute numbers,
|
||||
as these vary from machine to machine and are influenced by a
|
||||
plethora of uncontrollable factors, but the relative difference
|
||||
between both approaches and if the difference is significant.
|
||||
Object detection performance is measured by precision, recall,
|
||||
F1-score, and an open set error. While the first three metrics are
|
||||
standard, the last is adapted from Miller et al. It is defined
|
||||
as the number of observations (for dropout sampling) or detections
|
||||
(for GPND) that pass the respective false positive test (entropy or
|
||||
novelty), fall on unknown objects (there are no overlapping ground
|
||||
truth objects with IoU \(\geq 0.5\) and a known true class label)
|
||||
and do not have a winning class label of "unknown".
|
||||
|
||||
\subsection*{Technical Contribution}
|
||||
|
||||
Technical contribution includes all contributions
|
||||
that are not necessarily new in the scientific sense but are a
|
||||
meaningful engineering contribution in itself.
|
||||
|
||||
There is no available source code for the work of
|
||||
Miller et al\cite{Miller2018}, which necessitates a re-implementation
|
||||
of their work by myself. The contribution is the fine-tuning of
|
||||
an SSD model pre-trained on ImageNet\cite{Deng2009}, extended by
|
||||
dropout layers, to the SceneNet RGB-D data set using MS COCO classes
|
||||
as the known classes for SSD.
|
||||
As MS COCO classes are more general than SceneNet RGB-D classes this
|
||||
also requires a mapping from one set of classes to the other.
|
||||
This entire contribution is technical and only re-implements
|
||||
what Miller et al have already done. It is expected that the
|
||||
evaluation of the results using this self-trained model will
|
||||
reproduce the results of Miller et al.
|
||||
|
||||
For GPND source code is available but only for MNIST and using
|
||||
PyTorch. Therefore, the source code has to be transcoded from
|
||||
PyTorch to Tensorflow. Furthermore, it must be made compatible
|
||||
with the SceneNet RGB-D as the architecture is tailored to MNIST.
|
||||
The mapping from SceneNet RGB-D to MS COCO applies here as well and
|
||||
can therefore be considered a separate contribution. A fine-tuned
|
||||
SSD is required also but this time without added dropout layers.
|
||||
Additionally, it is necessary to train the auto-encoder for every
|
||||
known class separately.
|
||||
|
||||
To summarise it in a list, the following separate deliverables
|
||||
are contributed:
|
||||
|
||||
\begin{itemize}
|
||||
\item source code for dropout sampling compatible with Tensorflow
|
||||
\item source code for GPND compatible with Tensorflow
|
||||
\item mapping from SceneNet RGB-D classes to MS COCO classes
|
||||
\item vanilla SSD model fine-tuned on SceneNet RGB-D
|
||||
\item dropout SSD model fine-tuned on SceneNet RGB-D
|
||||
\item auto-encoder model trained separately on every MS COCO class
|
||||
\end{itemize}
|
||||
\chapter{Closing}
|
||||
|
|
|
@ -25,7 +25,7 @@
|
|||
}{}
|
||||
|
||||
% use custom package to prevent spamming the preamble
|
||||
\usepackage[licence]{masterthesis}
|
||||
\usepackage[licence,library,acknowledge,abstract]{masterthesis}
|
||||
|
||||
% specify image location
|
||||
\graphicspath{{./images/}{./private/images/}}
|
||||
|
@ -38,7 +38,7 @@
|
|||
% invoke start command(s) from masterthesis package
|
||||
\start
|
||||
|
||||
\input{body_expose.tex}
|
||||
\input{body.tex}
|
||||
|
||||
% invoke finish command(s) from masterthesis package
|
||||
\finish
|
||||
|
|
Loading…
Reference in New Issue