Finished background on dropout sampling (raw version)
Signed-off-by: Jim Martens <github@2martens.de>
This commit is contained in:
parent
e9394348c9
commit
f462220880
285
body.tex
285
body.tex
|
@ -170,12 +170,8 @@ Therefore, the contribution is found in chapters \ref{chap:methods},
|
|||
|
||||
This chapter will begin with an overview over previous works
|
||||
in the field of this thesis. Afterwards the theoretical foundations
|
||||
of the work of Miller et al.~\cite{Miller2018} and auto-encoders will
|
||||
be explained. The chapter concludes with more details about the
|
||||
research question and the intended contribution of this thesis.
|
||||
|
||||
For both background sections the notation defined in table
|
||||
\ref{tab:notation} will be used.
|
||||
of the work of Miller et al.~\cite{Miller2018} will
|
||||
be explained.
|
||||
|
||||
\section{Related Works}
|
||||
|
||||
|
@ -246,7 +242,7 @@ Empirical Bayes procedure to select prior variances.
|
|||
|
||||
\begin{table}
|
||||
\centering
|
||||
\caption{Notation for background sections}
|
||||
\caption{Notation for background}
|
||||
\label{tab:notation}
|
||||
\begin{tabular}{l|l}
|
||||
symbol & meaning \\
|
||||
|
@ -272,15 +268,18 @@ Empirical Bayes procedure to select prior variances.
|
|||
\(\mathcal{O}\) & observation \\
|
||||
\(\overline{\mathbf{q}}\) & probability vector for
|
||||
observation \\
|
||||
\(\overline{\mathbf{z}}, \mathbf{z}\) & latent space representation \\
|
||||
\(d_T, d_z\) & discriminators \\
|
||||
\(e, g\) & encoding and decoding/generating function \\
|
||||
\(J_g\) & Jacobi matrix for generating function \\
|
||||
\(\mathcal{T}\) & tangent space \\
|
||||
\(\mathbf{R}\) & training/test data changed to be on tangent space
|
||||
%\(E[something]\) & expected value of something
|
||||
%\(\overline{\mathbf{z}}, \mathbf{z}\) & latent space representation \\
|
||||
%\(d_T, d_z\) & discriminators \\
|
||||
%\(e, g\) & encoding and decoding/generating function \\
|
||||
%\(J_g\) & Jacobi matrix for generating function \\
|
||||
%\(\mathcal{T}\) & tangent space \\
|
||||
%\(\mathbf{R}\) & training/test data changed to be on tangent space
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
This section will use the \textbf{notation} defined in table
|
||||
\ref{tab:notation} on page \pageref{tab:notation}.
|
||||
To understand dropout sampling, it is necessary to explain the
|
||||
idea of Bayesian neural networks. They place a prior distribution
|
||||
over the network weights, for example a Gaussian prior distribution:
|
||||
|
@ -289,7 +288,8 @@ over the network weights, for example a Gaussian prior distribution:
|
|||
weight is drawn from an independent and identical distribution. The
|
||||
training of the network determines a plausible set of weights by
|
||||
evaluating the posterior (probability output) over the weights given
|
||||
the training data: \(p(\mathbf{W}|\mathbf{T})\). However, this
|
||||
the training data \(\mathbf{T}\): \(p(\mathbf{W}|\mathbf{T})\).
|
||||
However, this
|
||||
evaluation cannot be performed in any reasonable
|
||||
time. Therefore approximation techniques are
|
||||
required. In those techniques the posterior is fitted with a
|
||||
|
@ -343,10 +343,8 @@ Subsequently, the corresponding vector of class probabilities
|
|||
score vectors \(\mathbf{s}_j\) in a particular observation
|
||||
\(\mathcal{O}_i\): \(\overline{\mathbf{q}}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
|
||||
of the detector for a particular observation is measured by
|
||||
the entropy \(H(\overline{\mathbf{q}}_i) = - \sum_j \overline{q}_{ij} \cdot \log \overline{q}_{ij}\).
|
||||
the entropy \(H(\overline{\mathbf{q}}_i)\).
|
||||
|
||||
In the introduction I used a very reduced version to describe
|
||||
maximum and low uncertainty. A more complete explanation:
|
||||
If \(\overline{\mathbf{q}}_i\), which I called averaged class probabilities,
|
||||
resembles a uniform distribution the entropy will be high. A uniform
|
||||
distribution means that no class is more likely than another, which
|
||||
|
@ -355,260 +353,9 @@ one class has a very high probability the entropy will be low.
|
|||
|
||||
In open set conditions it can be expected that falsely generated
|
||||
detections for unknown object classes have a higher label
|
||||
uncertainty. A treshold on the entropy \(H(\overline{\mathbf{q}}_i)\) can then
|
||||
uncertainty. A threshold on the entropy \(H(\overline{\mathbf{q}}_i)\) can then
|
||||
be used to identify and reject these false positive cases.
|
||||
|
||||
\section{Adversarial Auto-encoder}
|
||||
|
||||
This section will explain the adversarial auto-encoder used by
|
||||
Pidhorskyi et al.~\cite{Pidhorskyi2018} but in a slightly modified
|
||||
form to make it more understandable.
|
||||
|
||||
The training data \(\mathbf{T} \in \mathbb{R}^m \) is the input
|
||||
of the auto-encoder. An encoding function \(e: \mathbb{R}^m \rightarrow \mathbb{R}^n\) takes the data
|
||||
and produces a representation \(\overline{\mathbf{z}} \in \mathbb{R}^n\)
|
||||
in a latent space. This latent space is smaller (\(n < m\)) than the
|
||||
input which necessitates some form of compression.
|
||||
|
||||
A second function \(g: \Omega \rightarrow \mathbb{R}^m\) is the
|
||||
generator function that takes the latent representation
|
||||
\(\mathbf{z} \in \Omega \subset \mathbb{R}^n\) and generates an output
|
||||
\(\overline{\mathbf{T}}\) as close as possible to the input data
|
||||
distribution.
|
||||
|
||||
What then is the difference between \(\overline{\mathbf{z}}\) and \(\mathbf{z}\)?
|
||||
With a simple auto-encoder both would be identical. In this case
|
||||
of an adversarial auto-encoder it is slightly more complicated.
|
||||
There is a discriminator \(d_z\) that tries to distinguish between
|
||||
an encoded data point \(\overline{\mathbf{z}}\) and a \(\mathbf{z} \sim \mathcal{N}(0,1)\) drawn from a normal distribution with \(0\) mean
|
||||
and a standard deviation of \(1\). During training, the encoding
|
||||
function \(e\) attempts to minimize any perceivable difference
|
||||
between \(\mathbf{z}\) and \(\overline{\mathbf{z}}\) while \(d_z\) has the
|
||||
aforementioned adversarial task to differentiate between them.
|
||||
|
||||
Furthermore, there is a discriminator \(d_T\) that has the task
|
||||
to differentiate the generated output \(\overline{\mathbf{T}}\) from the
|
||||
actual input \(\mathbf{T}\). During training, the generator function \(g\)
|
||||
tries to minimize the perceivable difference between \(\overline{\mathbf{T}}\) and \(\mathbf{T}\) while \(d_T\) has the mentioned
|
||||
adversarial task to distinguish between them.
|
||||
|
||||
With this all components of the adversarial auto-encoder employed
|
||||
by Pidhorskyi et al are introduced. Finally, the losses are
|
||||
presented. The two adversarial objectives have been mentioned
|
||||
already. Specifically, there is the adversarial loss for the
|
||||
discriminator \(d_z\):
|
||||
\begin{equation} \label{eq:adv-loss-z}
|
||||
\mathcal{L}_{adv-d_z}(\mathbf{T},e,d_z) = E[\log (d_z(\mathcal{N}(0,1)))] + E[\log (1 - d_z(e(\mathbf{T})))],
|
||||
\end{equation}
|
||||
\noindent
|
||||
where \(E\) stands for an expected
|
||||
value\footnote{a term used in probability theory},
|
||||
\(\mathbf{T}\) stands for the input, and
|
||||
\(\mathcal{N}(0,1)\) represents an element drawn from the specified
|
||||
distribution. The encoder \(e\) attempts to minimize this loss while
|
||||
the discriminator \(d_z\) intends to maximize it.
|
||||
|
||||
In the same way the adversarial loss for the discriminator \(d_T\)
|
||||
is specified:
|
||||
\begin{equation} \label{eq:adv-loss-x}
|
||||
\mathcal{L}_{adv-d_T}(\mathbf{T},d_T,g) = E[\log(d_T(\mathbf{T}))] + E[\log(1 - d_T(g(\mathcal{N}(0,1))))],
|
||||
\end{equation}
|
||||
\noindent
|
||||
where \(\mathbf{T}\), \(E\), and \(\mathcal{N}(0,1)\) have the same meaning
|
||||
as before. In this case the generator \(g\) tries to minimize the loss
|
||||
while the discriminator \(d_T\) attempts to maximize it.
|
||||
|
||||
Every auto-encoder requires a reconstruction error to work. This
|
||||
error calculates the difference between the original input and
|
||||
the generated or decoded output. In this case, the reconstruction
|
||||
loss is defined like this:
|
||||
\begin{equation} \label{eq:recon-loss}
|
||||
\mathcal{L}_{error}(\mathbf{T}, e, g) = - E[\log(p(g(e(\mathbf{T})) | \mathbf{T}))],
|
||||
\end{equation}
|
||||
\noindent
|
||||
where \(\log(p)\) is the expected log-likelihood and \(\mathbf{T}\),
|
||||
\(E\), \(e\), and \(g\) have the same meaning as before.
|
||||
|
||||
All losses combined result in the following formula:
|
||||
\begin{equation} \label{eq:full-loss}
|
||||
\mathcal{L}(\mathbf{T},e,d_z,d_T,g) = \mathcal{L}_{adv-d_z}(\mathbf{T},e,d_z) + \mathcal{L}_{adv-d_T}(x,d_T,g) + \lambda \mathcal{L}_{error}(\mathbf{T},e,g),
|
||||
\end{equation}
|
||||
\noindent
|
||||
where \(\lambda\) is a parameter used to balance the adversarial
|
||||
losses with the reconstruction loss. The model is trained by
|
||||
Pidhorskyi et al using the Adam optimizer by doing alternative
|
||||
updates of each of the aforementioned components:
|
||||
|
||||
\begin{itemize}
|
||||
\item Maximize \(\mathcal{L}_{adv-d_T}\) by updating weights of \(d_T\);
|
||||
\item Minimize \(\mathcal{L}_{adv-d_T}\) by updating weights of \(g\);
|
||||
\item Maximize \(\mathcal{L}_{adv-d_z}\) by updating weights of \(d_z\);
|
||||
\item Minimize \(\mathcal{L}_{error}\) and \(\mathcal{L}_{adv-d_z}\) by updating weights of \(e\) and \(g\).
|
||||
\end{itemize}
|
||||
|
||||
Practically, the auto-encoder is trained separately for every
|
||||
object class that is considered "known". Pidhorskyi et al trained
|
||||
it on the MNIST~\cite{Lecun1998} data set, once for every digit.
|
||||
|
||||
For this thesis it needs to be trained on the SceneNet RGB-D
|
||||
data set using MS COCO classes as known classes. As in every
|
||||
test epoch all known classes are present, it becomes
|
||||
non-trivial which of the trained auto-encoders should be used to
|
||||
calculate novelty. To phrase it differently, a true positive
|
||||
detection is possible for multiple classes in the same image.
|
||||
If, for example, one object is classified correctly by SSD as a chair
|
||||
the novelty score should be low. But the auto-encoders of all
|
||||
known classes but the "chair" class will give ideally a high novelty
|
||||
score. Which of the values should be used? The only sensible solution
|
||||
is to only run it through the auto-encoder that was trained for
|
||||
the class the SSD model predicted. This provides the following
|
||||
scenarios:
|
||||
\begin{itemize}
|
||||
\item true positive classification: novelty score should be low
|
||||
\item false positive classification and correct class is
|
||||
among the known classes: novelty score should be high
|
||||
\item false positive classification and correct class is unknown:
|
||||
novelty score should be high
|
||||
\end{itemize}
|
||||
\noindent
|
||||
Negative classifications are not listed as these are not part
|
||||
of the output of the SSD and cannot be given to the auto-encoder
|
||||
as input. Furthermore, the 2nd case should not happen because
|
||||
the trained SSD knows this other class and is very likely
|
||||
to give it a higher probability. Therefore, using only one
|
||||
auto-encoder fulfils the task of differentiating between
|
||||
known and unknown classes.
|
||||
|
||||
\section{Generative Probabilistic Novelty Detection}
|
||||
|
||||
It is still unclear how the novelty score is calculated.
|
||||
This section will clear this up in as understandable as
|
||||
possible terms. However, the name "Generative Probabilistic
|
||||
Novelty Detection"~\cite{Pidhorskyi2018} already signals that
|
||||
probability theory has something to do with it. Furthermore, this
|
||||
section will make use of some mathematical terms which cannot
|
||||
be explained in great detail here. Moreover, the previous section
|
||||
already introduced many required components, which will not be
|
||||
explained here again.
|
||||
|
||||
For the purpose of this explanation a trained auto-encoder
|
||||
is assumed. In that case the generator function describes
|
||||
the model that the auto-encoder is actually using for the
|
||||
novelty detection. The task of training is to make sure this
|
||||
model comes as close as possible to the real model of the
|
||||
training or testing data. The model of the auto-encoder
|
||||
is in mathematical terms a parameterized manifold
|
||||
\(\mathcal{M} \equiv g(\Omega)\) of dimension \(n\).
|
||||
The set of training or testing data can then be described
|
||||
in the following way:
|
||||
\begin{equation} \label{eq:train-set}
|
||||
\mathbf{T} = g(\mathbf{z}) + \xi_i \quad i \in \mathbb{N},
|
||||
\end{equation}
|
||||
\noindent
|
||||
where \(\xi_i\) represents noise. It may be confusing but
|
||||
for the purpose of this novelty test the "truth" is what
|
||||
the generator function generates from a set of \(\mathbf{z} \in \Omega\),
|
||||
not the ground truth from the data set. Furthermore,
|
||||
the previously introduced encoder function \(e\) is assumed
|
||||
to work as an exact inverse of \(g\) for every \(\mathbf{T} \in \mathcal{M}\).
|
||||
For such \(\mathbf{T}\) it follows that \(\mathbf{T} = g(e(\mathbf{T}))\).
|
||||
|
||||
Let \(\overline{\mathbf{T}} \in \mathbb{R}^m\) be the test data. The
|
||||
remainder of the section will explain how the novelty
|
||||
test is performed for this \(\overline{\mathbf{T}}\). It is important
|
||||
to note that this data is not necessarily part of the
|
||||
auto-encoder model. Therefore, \(g(e(\overline{\mathbf{T}})) = \mathbf{T}\) cannot
|
||||
be assumed. However, it can be observed that \(\overline{\mathbf{T}}\)
|
||||
can be non-linearly projected onto
|
||||
\(\overline{\mathbf{T}}^{\|} \in \mathcal{M}\)
|
||||
by using \(g(\overline{\mathbf{z}})\) with \(\overline{\mathbf{z}} = e(\overline{\mathbf{T}})\).
|
||||
It is assumed that \(g\) is smooth enough to perform a linearization
|
||||
based on the first-order Taylor expansion:
|
||||
\begin{equation} \label{eq:taylor-expanse}
|
||||
g(\mathbf{z}) = g(\overline{\mathbf{z}}) + J_g(\overline{\mathbf{z}}) (\mathbf{z} - \overline{\mathbf{z}}) + \mathcal{O}(\| \mathbf{z} - \overline{\mathbf{z}} \|^2),
|
||||
\end{equation}
|
||||
\noindent
|
||||
where \(J_g(\overline{\mathbf{z}})\) is the Jacobi matrix of \(g\) computed
|
||||
at \(\overline{\mathbf{z}}\). It is assumed that the Jacobi matrix of \(g\)
|
||||
has the full rank at every point of the manifold. A Jacobi matrix
|
||||
contains all first-order partial derivatives of a function.
|
||||
\(\| \cdot \|\) is the \(\mathbf{L}_2\) norm, which calculates the
|
||||
length of a vector by calculating the square root of the sum of
|
||||
squares of all dimensions of the vector. Lastly, \(\mathcal{O}\)
|
||||
is called Big-O notation and is used for specifying the time
|
||||
complexity of an algorithm. In this case it contains a linear
|
||||
value, which means that this part of the term can be ignored for
|
||||
\(\mathbf{z}\) growing to infinity.
|
||||
|
||||
Next the tangent space of \(g\) at \(\overline{\mathbf{T}}^{\|}\), which
|
||||
is spanned by the \(n\) independent column vectors of the Jacobi
|
||||
matrix \(J_g(\overline{\mathbf{z}})\), is defined as
|
||||
\(\mathcal{T} = \text{span}(J_g(\overline{\mathbf{z}}))\). The tangent space
|
||||
of a point of a function describes all the vectors that could go
|
||||
through this point. The Jacobi matrix can be decomposed into three
|
||||
matrices using singular value decomposition: \(J_g(\overline{\mathbf{z}}) = U^{\|}SV^{*}\). \(\mathcal{T}\) is defined to also be spanned
|
||||
by the column vectors of \(U^{\|}\): \(\mathcal{T} = \text{span}(U^{\|})\). \(U^{\|}\) contains the left-singular values
|
||||
and \(V^{*}\) is the conjugate transposed version of the matrix
|
||||
\(V\), which contains the right-singular values. \(U^{\bot}\) is
|
||||
defined in such a way that \(U = [U^{\|}U^{\bot}]\) is a unitary
|
||||
matrix. \(\mathcal{T^{\bot}}\) is the orthogonal complement of
|
||||
\(\mathcal{T}\). With this preparation \(\overline{\mathbf{T}}\) can be
|
||||
represented with respect to the local coordinates that define
|
||||
\(\mathcal{T}\) and \(\mathcal{T}^{\bot}\). This representation
|
||||
can be achieved by computing
|
||||
\begin{equation} \label{eq:w-definition}
|
||||
\overline{\mathbf{R}} = U^{\top} \overline{\mathbf{T}} = \left[\begin{matrix}
|
||||
U^{\|^{\top}} \overline{\mathbf{T}} \\
|
||||
U^{\bot^{\top}} \overline{\mathbf{T}}
|
||||
\end{matrix}\right] = \left[\begin{matrix}
|
||||
\overline{\mathbf{R}}^{\|} \\
|
||||
\overline{\mathbf{R}}^{\bot}
|
||||
\end{matrix}\right],
|
||||
\end{equation}
|
||||
\noindent
|
||||
where the rotated coordinates (training/testing data points
|
||||
changed to be on the tangent space)
|
||||
\(\overline{\mathbf{R}}\) are decomposed into \(\overline{\mathbf{R}}^{\|}\), which
|
||||
are parallel to \(\mathcal{T}\), and \(\overline{\mathbf{R}}^{\bot}\), which
|
||||
are orthogonal to \(\mathcal{T}\).
|
||||
|
||||
The last step to define the novelty test involves probability
|
||||
density functions (PDFs), which are now introduced. The PDF \(p_T(\mathbf{T})\)
|
||||
describes the random variable \(T\), from which the training and
|
||||
testing data points are drawn. In addition, \(p_R(\mathbf{R})\) is the
|
||||
probability density function of the random variable \(W\),
|
||||
which represents \(T\) after changing the coordinates. Both
|
||||
distributions are identical. But it is assumed that the coordinates
|
||||
\(R^{\|}\), which are parallel to \(\mathcal{T}\), and the coordinates
|
||||
\(R^{\bot}\), which are orthogonal to \(\mathcal{T}\), are
|
||||
statistically independent. With this assumption the following holds:
|
||||
\begin{equation} \label{eq:pdf-x}
|
||||
p_T(\mathbf{T}) = p_R(\mathbf{R}) = p_R(\mathbf{R}^{\|}, \mathbf{R}^{\bot}) = p_{R^{\|}}(\mathbf{R}^{\|}) p_{R^{\bot}}(\mathbf{R}^{\bot})
|
||||
\end{equation}
|
||||
The previously introduced noise comes into play again. In formula
|
||||
(\ref{eq:train-set}) it is assumed that the noise \(\xi\)
|
||||
predominantly deviates the data points \(\mathbf{T}\) away from the manifold
|
||||
\(\mathcal{M}\) in a direction orthogonal to \(\mathcal{T}\).
|
||||
As a consequence \(R^{\bot}\) is mainly responsible for the noise
|
||||
effects. Since noise and drawing from the manifold are statistically
|
||||
independent, \(R^{\|}\) and \(R^{\bot}\) are also independent.
|
||||
|
||||
Finally, referring back to the data point \(\overline{\mathbf{T}}\), the
|
||||
novelty test is defined like this:
|
||||
\begin{equation} \label{eq:novelty-test}
|
||||
p_T(\overline{\mathbf{T}}) = p_{R^{\|}}(\overline{\mathbf{R}}^{\|})p_{R^{\bot}}(\overline{\mathbf{R}}^{\bot}) =
|
||||
\begin{cases}
|
||||
\geq \gamma & \Longrightarrow \text{Inlier} \\
|
||||
< \gamma & \Longrightarrow \text{Outlier}
|
||||
\end{cases}
|
||||
\end{equation}
|
||||
\noindent
|
||||
where \(\gamma\) is a suitable threshold.
|
||||
|
||||
At this point it is very clear that the GPND approach requires
|
||||
far more math background than dropout sampling to understand
|
||||
the novelty test. Nonetheless it could be the better method.
|
||||
|
||||
% SSD: \cite{Liu2016}
|
||||
% ImageNet: \cite{Deng2009}
|
||||
% COCO: \cite{Lin2014}
|
||||
|
|
Loading…
Reference in New Issue