Finished background on dropout sampling (raw version)

Signed-off-by: Jim Martens <github@2martens.de>
This commit is contained in:
Jim Martens 2019-08-13 13:15:23 +02:00
parent e9394348c9
commit f462220880
1 changed files with 16 additions and 269 deletions

285
body.tex
View File

@ -170,12 +170,8 @@ Therefore, the contribution is found in chapters \ref{chap:methods},
This chapter will begin with an overview over previous works
in the field of this thesis. Afterwards the theoretical foundations
of the work of Miller et al.~\cite{Miller2018} and auto-encoders will
be explained. The chapter concludes with more details about the
research question and the intended contribution of this thesis.
For both background sections the notation defined in table
\ref{tab:notation} will be used.
of the work of Miller et al.~\cite{Miller2018} will
be explained.
\section{Related Works}
@ -246,7 +242,7 @@ Empirical Bayes procedure to select prior variances.
\begin{table}
\centering
\caption{Notation for background sections}
\caption{Notation for background}
\label{tab:notation}
\begin{tabular}{l|l}
symbol & meaning \\
@ -272,15 +268,18 @@ Empirical Bayes procedure to select prior variances.
\(\mathcal{O}\) & observation \\
\(\overline{\mathbf{q}}\) & probability vector for
observation \\
\(\overline{\mathbf{z}}, \mathbf{z}\) & latent space representation \\
\(d_T, d_z\) & discriminators \\
\(e, g\) & encoding and decoding/generating function \\
\(J_g\) & Jacobi matrix for generating function \\
\(\mathcal{T}\) & tangent space \\
\(\mathbf{R}\) & training/test data changed to be on tangent space
%\(E[something]\) & expected value of something
%\(\overline{\mathbf{z}}, \mathbf{z}\) & latent space representation \\
%\(d_T, d_z\) & discriminators \\
%\(e, g\) & encoding and decoding/generating function \\
%\(J_g\) & Jacobi matrix for generating function \\
%\(\mathcal{T}\) & tangent space \\
%\(\mathbf{R}\) & training/test data changed to be on tangent space
\end{tabular}
\end{table}
This section will use the \textbf{notation} defined in table
\ref{tab:notation} on page \pageref{tab:notation}.
To understand dropout sampling, it is necessary to explain the
idea of Bayesian neural networks. They place a prior distribution
over the network weights, for example a Gaussian prior distribution:
@ -289,7 +288,8 @@ over the network weights, for example a Gaussian prior distribution:
weight is drawn from an independent and identical distribution. The
training of the network determines a plausible set of weights by
evaluating the posterior (probability output) over the weights given
the training data: \(p(\mathbf{W}|\mathbf{T})\). However, this
the training data \(\mathbf{T}\): \(p(\mathbf{W}|\mathbf{T})\).
However, this
evaluation cannot be performed in any reasonable
time. Therefore approximation techniques are
required. In those techniques the posterior is fitted with a
@ -343,10 +343,8 @@ Subsequently, the corresponding vector of class probabilities
score vectors \(\mathbf{s}_j\) in a particular observation
\(\mathcal{O}_i\): \(\overline{\mathbf{q}}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
of the detector for a particular observation is measured by
the entropy \(H(\overline{\mathbf{q}}_i) = - \sum_j \overline{q}_{ij} \cdot \log \overline{q}_{ij}\).
the entropy \(H(\overline{\mathbf{q}}_i)\).
In the introduction I used a very reduced version to describe
maximum and low uncertainty. A more complete explanation:
If \(\overline{\mathbf{q}}_i\), which I called averaged class probabilities,
resembles a uniform distribution the entropy will be high. A uniform
distribution means that no class is more likely than another, which
@ -355,260 +353,9 @@ one class has a very high probability the entropy will be low.
In open set conditions it can be expected that falsely generated
detections for unknown object classes have a higher label
uncertainty. A treshold on the entropy \(H(\overline{\mathbf{q}}_i)\) can then
uncertainty. A threshold on the entropy \(H(\overline{\mathbf{q}}_i)\) can then
be used to identify and reject these false positive cases.
\section{Adversarial Auto-encoder}
This section will explain the adversarial auto-encoder used by
Pidhorskyi et al.~\cite{Pidhorskyi2018} but in a slightly modified
form to make it more understandable.
The training data \(\mathbf{T} \in \mathbb{R}^m \) is the input
of the auto-encoder. An encoding function \(e: \mathbb{R}^m \rightarrow \mathbb{R}^n\) takes the data
and produces a representation \(\overline{\mathbf{z}} \in \mathbb{R}^n\)
in a latent space. This latent space is smaller (\(n < m\)) than the
input which necessitates some form of compression.
A second function \(g: \Omega \rightarrow \mathbb{R}^m\) is the
generator function that takes the latent representation
\(\mathbf{z} \in \Omega \subset \mathbb{R}^n\) and generates an output
\(\overline{\mathbf{T}}\) as close as possible to the input data
distribution.
What then is the difference between \(\overline{\mathbf{z}}\) and \(\mathbf{z}\)?
With a simple auto-encoder both would be identical. In this case
of an adversarial auto-encoder it is slightly more complicated.
There is a discriminator \(d_z\) that tries to distinguish between
an encoded data point \(\overline{\mathbf{z}}\) and a \(\mathbf{z} \sim \mathcal{N}(0,1)\) drawn from a normal distribution with \(0\) mean
and a standard deviation of \(1\). During training, the encoding
function \(e\) attempts to minimize any perceivable difference
between \(\mathbf{z}\) and \(\overline{\mathbf{z}}\) while \(d_z\) has the
aforementioned adversarial task to differentiate between them.
Furthermore, there is a discriminator \(d_T\) that has the task
to differentiate the generated output \(\overline{\mathbf{T}}\) from the
actual input \(\mathbf{T}\). During training, the generator function \(g\)
tries to minimize the perceivable difference between \(\overline{\mathbf{T}}\) and \(\mathbf{T}\) while \(d_T\) has the mentioned
adversarial task to distinguish between them.
With this all components of the adversarial auto-encoder employed
by Pidhorskyi et al are introduced. Finally, the losses are
presented. The two adversarial objectives have been mentioned
already. Specifically, there is the adversarial loss for the
discriminator \(d_z\):
\begin{equation} \label{eq:adv-loss-z}
\mathcal{L}_{adv-d_z}(\mathbf{T},e,d_z) = E[\log (d_z(\mathcal{N}(0,1)))] + E[\log (1 - d_z(e(\mathbf{T})))],
\end{equation}
\noindent
where \(E\) stands for an expected
value\footnote{a term used in probability theory},
\(\mathbf{T}\) stands for the input, and
\(\mathcal{N}(0,1)\) represents an element drawn from the specified
distribution. The encoder \(e\) attempts to minimize this loss while
the discriminator \(d_z\) intends to maximize it.
In the same way the adversarial loss for the discriminator \(d_T\)
is specified:
\begin{equation} \label{eq:adv-loss-x}
\mathcal{L}_{adv-d_T}(\mathbf{T},d_T,g) = E[\log(d_T(\mathbf{T}))] + E[\log(1 - d_T(g(\mathcal{N}(0,1))))],
\end{equation}
\noindent
where \(\mathbf{T}\), \(E\), and \(\mathcal{N}(0,1)\) have the same meaning
as before. In this case the generator \(g\) tries to minimize the loss
while the discriminator \(d_T\) attempts to maximize it.
Every auto-encoder requires a reconstruction error to work. This
error calculates the difference between the original input and
the generated or decoded output. In this case, the reconstruction
loss is defined like this:
\begin{equation} \label{eq:recon-loss}
\mathcal{L}_{error}(\mathbf{T}, e, g) = - E[\log(p(g(e(\mathbf{T})) | \mathbf{T}))],
\end{equation}
\noindent
where \(\log(p)\) is the expected log-likelihood and \(\mathbf{T}\),
\(E\), \(e\), and \(g\) have the same meaning as before.
All losses combined result in the following formula:
\begin{equation} \label{eq:full-loss}
\mathcal{L}(\mathbf{T},e,d_z,d_T,g) = \mathcal{L}_{adv-d_z}(\mathbf{T},e,d_z) + \mathcal{L}_{adv-d_T}(x,d_T,g) + \lambda \mathcal{L}_{error}(\mathbf{T},e,g),
\end{equation}
\noindent
where \(\lambda\) is a parameter used to balance the adversarial
losses with the reconstruction loss. The model is trained by
Pidhorskyi et al using the Adam optimizer by doing alternative
updates of each of the aforementioned components:
\begin{itemize}
\item Maximize \(\mathcal{L}_{adv-d_T}\) by updating weights of \(d_T\);
\item Minimize \(\mathcal{L}_{adv-d_T}\) by updating weights of \(g\);
\item Maximize \(\mathcal{L}_{adv-d_z}\) by updating weights of \(d_z\);
\item Minimize \(\mathcal{L}_{error}\) and \(\mathcal{L}_{adv-d_z}\) by updating weights of \(e\) and \(g\).
\end{itemize}
Practically, the auto-encoder is trained separately for every
object class that is considered "known". Pidhorskyi et al trained
it on the MNIST~\cite{Lecun1998} data set, once for every digit.
For this thesis it needs to be trained on the SceneNet RGB-D
data set using MS COCO classes as known classes. As in every
test epoch all known classes are present, it becomes
non-trivial which of the trained auto-encoders should be used to
calculate novelty. To phrase it differently, a true positive
detection is possible for multiple classes in the same image.
If, for example, one object is classified correctly by SSD as a chair
the novelty score should be low. But the auto-encoders of all
known classes but the "chair" class will give ideally a high novelty
score. Which of the values should be used? The only sensible solution
is to only run it through the auto-encoder that was trained for
the class the SSD model predicted. This provides the following
scenarios:
\begin{itemize}
\item true positive classification: novelty score should be low
\item false positive classification and correct class is
among the known classes: novelty score should be high
\item false positive classification and correct class is unknown:
novelty score should be high
\end{itemize}
\noindent
Negative classifications are not listed as these are not part
of the output of the SSD and cannot be given to the auto-encoder
as input. Furthermore, the 2nd case should not happen because
the trained SSD knows this other class and is very likely
to give it a higher probability. Therefore, using only one
auto-encoder fulfils the task of differentiating between
known and unknown classes.
\section{Generative Probabilistic Novelty Detection}
It is still unclear how the novelty score is calculated.
This section will clear this up in as understandable as
possible terms. However, the name "Generative Probabilistic
Novelty Detection"~\cite{Pidhorskyi2018} already signals that
probability theory has something to do with it. Furthermore, this
section will make use of some mathematical terms which cannot
be explained in great detail here. Moreover, the previous section
already introduced many required components, which will not be
explained here again.
For the purpose of this explanation a trained auto-encoder
is assumed. In that case the generator function describes
the model that the auto-encoder is actually using for the
novelty detection. The task of training is to make sure this
model comes as close as possible to the real model of the
training or testing data. The model of the auto-encoder
is in mathematical terms a parameterized manifold
\(\mathcal{M} \equiv g(\Omega)\) of dimension \(n\).
The set of training or testing data can then be described
in the following way:
\begin{equation} \label{eq:train-set}
\mathbf{T} = g(\mathbf{z}) + \xi_i \quad i \in \mathbb{N},
\end{equation}
\noindent
where \(\xi_i\) represents noise. It may be confusing but
for the purpose of this novelty test the "truth" is what
the generator function generates from a set of \(\mathbf{z} \in \Omega\),
not the ground truth from the data set. Furthermore,
the previously introduced encoder function \(e\) is assumed
to work as an exact inverse of \(g\) for every \(\mathbf{T} \in \mathcal{M}\).
For such \(\mathbf{T}\) it follows that \(\mathbf{T} = g(e(\mathbf{T}))\).
Let \(\overline{\mathbf{T}} \in \mathbb{R}^m\) be the test data. The
remainder of the section will explain how the novelty
test is performed for this \(\overline{\mathbf{T}}\). It is important
to note that this data is not necessarily part of the
auto-encoder model. Therefore, \(g(e(\overline{\mathbf{T}})) = \mathbf{T}\) cannot
be assumed. However, it can be observed that \(\overline{\mathbf{T}}\)
can be non-linearly projected onto
\(\overline{\mathbf{T}}^{\|} \in \mathcal{M}\)
by using \(g(\overline{\mathbf{z}})\) with \(\overline{\mathbf{z}} = e(\overline{\mathbf{T}})\).
It is assumed that \(g\) is smooth enough to perform a linearization
based on the first-order Taylor expansion:
\begin{equation} \label{eq:taylor-expanse}
g(\mathbf{z}) = g(\overline{\mathbf{z}}) + J_g(\overline{\mathbf{z}}) (\mathbf{z} - \overline{\mathbf{z}}) + \mathcal{O}(\| \mathbf{z} - \overline{\mathbf{z}} \|^2),
\end{equation}
\noindent
where \(J_g(\overline{\mathbf{z}})\) is the Jacobi matrix of \(g\) computed
at \(\overline{\mathbf{z}}\). It is assumed that the Jacobi matrix of \(g\)
has the full rank at every point of the manifold. A Jacobi matrix
contains all first-order partial derivatives of a function.
\(\| \cdot \|\) is the \(\mathbf{L}_2\) norm, which calculates the
length of a vector by calculating the square root of the sum of
squares of all dimensions of the vector. Lastly, \(\mathcal{O}\)
is called Big-O notation and is used for specifying the time
complexity of an algorithm. In this case it contains a linear
value, which means that this part of the term can be ignored for
\(\mathbf{z}\) growing to infinity.
Next the tangent space of \(g\) at \(\overline{\mathbf{T}}^{\|}\), which
is spanned by the \(n\) independent column vectors of the Jacobi
matrix \(J_g(\overline{\mathbf{z}})\), is defined as
\(\mathcal{T} = \text{span}(J_g(\overline{\mathbf{z}}))\). The tangent space
of a point of a function describes all the vectors that could go
through this point. The Jacobi matrix can be decomposed into three
matrices using singular value decomposition: \(J_g(\overline{\mathbf{z}}) = U^{\|}SV^{*}\). \(\mathcal{T}\) is defined to also be spanned
by the column vectors of \(U^{\|}\): \(\mathcal{T} = \text{span}(U^{\|})\). \(U^{\|}\) contains the left-singular values
and \(V^{*}\) is the conjugate transposed version of the matrix
\(V\), which contains the right-singular values. \(U^{\bot}\) is
defined in such a way that \(U = [U^{\|}U^{\bot}]\) is a unitary
matrix. \(\mathcal{T^{\bot}}\) is the orthogonal complement of
\(\mathcal{T}\). With this preparation \(\overline{\mathbf{T}}\) can be
represented with respect to the local coordinates that define
\(\mathcal{T}\) and \(\mathcal{T}^{\bot}\). This representation
can be achieved by computing
\begin{equation} \label{eq:w-definition}
\overline{\mathbf{R}} = U^{\top} \overline{\mathbf{T}} = \left[\begin{matrix}
U^{\|^{\top}} \overline{\mathbf{T}} \\
U^{\bot^{\top}} \overline{\mathbf{T}}
\end{matrix}\right] = \left[\begin{matrix}
\overline{\mathbf{R}}^{\|} \\
\overline{\mathbf{R}}^{\bot}
\end{matrix}\right],
\end{equation}
\noindent
where the rotated coordinates (training/testing data points
changed to be on the tangent space)
\(\overline{\mathbf{R}}\) are decomposed into \(\overline{\mathbf{R}}^{\|}\), which
are parallel to \(\mathcal{T}\), and \(\overline{\mathbf{R}}^{\bot}\), which
are orthogonal to \(\mathcal{T}\).
The last step to define the novelty test involves probability
density functions (PDFs), which are now introduced. The PDF \(p_T(\mathbf{T})\)
describes the random variable \(T\), from which the training and
testing data points are drawn. In addition, \(p_R(\mathbf{R})\) is the
probability density function of the random variable \(W\),
which represents \(T\) after changing the coordinates. Both
distributions are identical. But it is assumed that the coordinates
\(R^{\|}\), which are parallel to \(\mathcal{T}\), and the coordinates
\(R^{\bot}\), which are orthogonal to \(\mathcal{T}\), are
statistically independent. With this assumption the following holds:
\begin{equation} \label{eq:pdf-x}
p_T(\mathbf{T}) = p_R(\mathbf{R}) = p_R(\mathbf{R}^{\|}, \mathbf{R}^{\bot}) = p_{R^{\|}}(\mathbf{R}^{\|}) p_{R^{\bot}}(\mathbf{R}^{\bot})
\end{equation}
The previously introduced noise comes into play again. In formula
(\ref{eq:train-set}) it is assumed that the noise \(\xi\)
predominantly deviates the data points \(\mathbf{T}\) away from the manifold
\(\mathcal{M}\) in a direction orthogonal to \(\mathcal{T}\).
As a consequence \(R^{\bot}\) is mainly responsible for the noise
effects. Since noise and drawing from the manifold are statistically
independent, \(R^{\|}\) and \(R^{\bot}\) are also independent.
Finally, referring back to the data point \(\overline{\mathbf{T}}\), the
novelty test is defined like this:
\begin{equation} \label{eq:novelty-test}
p_T(\overline{\mathbf{T}}) = p_{R^{\|}}(\overline{\mathbf{R}}^{\|})p_{R^{\bot}}(\overline{\mathbf{R}}^{\bot}) =
\begin{cases}
\geq \gamma & \Longrightarrow \text{Inlier} \\
< \gamma & \Longrightarrow \text{Outlier}
\end{cases}
\end{equation}
\noindent
where \(\gamma\) is a suitable threshold.
At this point it is very clear that the GPND approach requires
far more math background than dropout sampling to understand
the novelty test. Nonetheless it could be the better method.
% SSD: \cite{Liu2016}
% ImageNet: \cite{Deng2009}
% COCO: \cite{Lin2014}