Written background for dropout sampling
Signed-off-by: Jim Martens <github@2martens.de>
This commit is contained in:
parent
7d59b862c1
commit
1e66a6f874
|
@ -144,8 +144,98 @@ with MS COCO classes.
|
|||
|
||||
\chapter{Background and Research Plan}
|
||||
|
||||
This chapter will provide a more in-depth look at the two works
|
||||
this thesis is based upon. First, the dropout sampling introduced
|
||||
by Miller et al\cite{Miller2018} will be showcased. Afterwards
|
||||
the Generative Probabilistic Novelty Detection with Adversarial
|
||||
Autoencoders\cite{Pidhorskyi2018} will be presented. The chapter
|
||||
will conclude with a more detailed explanation of the intended
|
||||
contribution of this thesis.
|
||||
|
||||
The dropout sampling explanation will follow the paper of Miller et
|
||||
al\cite{Miller2018} rather closely including the formulae used
|
||||
in their paper.
|
||||
|
||||
\section{Dropout Sampling}
|
||||
|
||||
To understand dropout sampling, it is necessary to explain the
|
||||
idea of Bayesian neural networks. They place a prior distribution
|
||||
over the network weights, for example a Gaussian prior distribution:
|
||||
\(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example
|
||||
\(\mathbf{W}\) are the weights and \(I\) symbolises that every
|
||||
weight is drawn from an independent and identical distribution. The
|
||||
training of the network determines a plausible set of weights by
|
||||
evaluating the posterior (probability output) over the weights given
|
||||
the training data: \(p(\mathbf{W}|\mathbf{T})\). However, this
|
||||
evaluation cannot be performed in any reasonable
|
||||
time. Therefore approximation techniques are
|
||||
required. In those techniques the posterior is fitted with a
|
||||
simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original
|
||||
and intractable problem of averaging over all weights in the network
|
||||
is replaced with an optimisation task, where the parameters of the
|
||||
simple distribution are optimised over\cite{Kendall2017}.
|
||||
|
||||
\subsubsection*{Dropout variational inference}
|
||||
|
||||
Kendall and Gal\cite{Kendall2017} showed an approximation for
|
||||
classfication and recognition tasks. Dropout variational inference
|
||||
is a practical approximation technique by adding dropout layers
|
||||
in front of every weight layer and using them also during test
|
||||
time to sample from the approximate posterior. Effectively, this
|
||||
results in the approximation of the class probability
|
||||
\(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward
|
||||
passes through the network and averaging over the obtained Softmax
|
||||
scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the
|
||||
training data \(\mathbf{T}\):
|
||||
|
||||
\begin{equation} \label{eq:drop-sampling}
|
||||
p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
|
||||
\end{equation}
|
||||
|
||||
With this dropout sampling technique \(n\) model weights
|
||||
\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior
|
||||
\(p(\mathbf{W}|\mathbf{T})\). The class probability
|
||||
\(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
|
||||
\(\mathbf{q}\) over all class labels. Finally, the uncertainty
|
||||
of the network with respect to the classification is given by
|
||||
the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\).
|
||||
|
||||
\subsubsection*{Dropout sampling for object detection}
|
||||
|
||||
Miller et al\cite{Miller2018} apply the dropout sampling to
|
||||
object detection. In that case \(\mathbf{W}\) represents the
|
||||
learned weights of a detection network like SSD\cite{Liu2016}.
|
||||
Every forward pass uses a different network
|
||||
\(\widetilde{\mathbf{W}}\) which is approximately sampled from
|
||||
\(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object
|
||||
detection results in a set of detections, each consisting of bounding
|
||||
box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\).
|
||||
The detections are denoted by Miller et al as \(D_i =
|
||||
\{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put
|
||||
into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\).
|
||||
|
||||
All detections with mutual intersection-over-union scores (IoU)
|
||||
of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\).
|
||||
Subsequently, the corresponding vector of class probabilities
|
||||
\(\mathbf{q}_i\) for the observation is calculated by averaging all
|
||||
score vectors \(\mathbf{s}_j\) in a particular observation
|
||||
\(\mathcal{O}_i\): \(\mathbf{q}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
|
||||
of the detector for a particular observation is measured by
|
||||
the entropy \(H(\mathbf{q}_i) = - \sum_j q_{ij} \cdot \log q_{ij}\).
|
||||
|
||||
In the introduction I used a very reduced version to describe
|
||||
maximum and low uncertainty. A more complete explanation:
|
||||
If \(\mathbf{q}_i\), which I called averaged class probabilities,
|
||||
resembles a uniform distribution the entropy will be high. A uniform
|
||||
distribution means that no class is more likely than another, which
|
||||
is a perfect example of maximum uncertainty. Conversely, if
|
||||
one class has a very high probability the entropy will be low.
|
||||
|
||||
In open-set conditions it can be expected that falsely generated
|
||||
detections for unknown object classes have a higher label
|
||||
uncertainty. A treshold on the entropy \(H(\mathbf{q}_i)\) can then
|
||||
be used to identify and reject these false positive cases.
|
||||
|
||||
\section{GPND}
|
||||
|
||||
\section{Contribution}
|
||||
|
|
Loading…
Reference in New Issue