Written background for dropout sampling

Signed-off-by: Jim Martens <github@2martens.de>
2019-03-06 16:19:19 +01:00 · 2019-03-06 16:19:19 +01:00 · 1e66a6f874
parent 7d59b862c1
commit 1e66a6f874
1 changed files with 90 additions and 0 deletions
--- a/body_expose.tex
+++ b/body_expose.tex
@ -144,8 +144,98 @@ with MS COCO classes.
 \chapter{Background and Research Plan}
 This chapter will provide a more in-depth look at the two works
 this thesis is based upon. First, the dropout sampling introduced
 by Miller et al\cite{Miller2018} will be showcased. Afterwards
 the Generative Probabilistic Novelty Detection with Adversarial
 Autoencoders\cite{Pidhorskyi2018} will be presented. The chapter
 will conclude with a more detailed explanation of the intended
 contribution of this thesis.
 The dropout sampling explanation will follow the paper of Miller et
 al\cite{Miller2018} rather closely including the formulae used
 in their paper.
 \section{Dropout Sampling}
 To understand dropout sampling, it is necessary to explain the
 idea of Bayesian neural networks. They place a prior distribution
 over the network weights, for example a Gaussian prior distribution:
 \(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example
 \(\mathbf{W}\) are the weights and \(I\) symbolises that every
 weight is drawn from an independent and identical distribution. The
 training of the network determines a plausible set of weights by
 evaluating the posterior (probability output) over the weights given
 the training data: \(p(\mathbf{W}|\mathbf{T})\). However, this
 evaluation cannot be performed in any reasonable
 time. Therefore approximation techniques are
 required. In those techniques the posterior is fitted with a
 simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original
 and intractable problem of averaging over all weights in the network
 is replaced with an optimisation task, where the parameters of the
 simple distribution are optimised over\cite{Kendall2017}.
 \subsubsection*{Dropout variational inference}
 Kendall and Gal\cite{Kendall2017} showed an approximation for
 classfication and recognition tasks. Dropout variational inference
 is a practical approximation technique by adding dropout layers
 in front of every weight layer and using them also during test
 time to sample from the approximate posterior. Effectively, this
 results in the approximation of the class probability
 \(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward
 passes through the network and averaging over the obtained Softmax
 scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the
 training data \(\mathbf{T}\):
 \begin{equation} \label{eq:drop-sampling}
 p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
 \end{equation}
 With this dropout sampling technique \(n\) model weights
 \(\widetilde{\mathbf{W}}_i\) are sampled from the posterior
 \(p(\mathbf{W}|\mathbf{T})\). The class probability
 \(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
 \(\mathbf{q}\) over all class labels. Finally, the uncertainty
 of the network with respect to the classification is given by
 the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\).
 \subsubsection*{Dropout sampling for object detection}
 Miller et al\cite{Miller2018} apply the dropout sampling to
 object detection. In that case \(\mathbf{W}\) represents the
 learned weights of a detection network like SSD\cite{Liu2016}.
 Every forward pass uses a different network
 \(\widetilde{\mathbf{W}}\) which is approximately sampled from
 \(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object
 detection results in a set of detections, each consisting of bounding
 box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\).
 The detections are denoted by Miller et al as \(D_i =
 \{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put
 into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\).
 All detections with mutual intersection-over-union scores (IoU)
 of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\).
 Subsequently, the corresponding vector of class probabilities
 \(\mathbf{q}_i\) for the observation is calculated by averaging all
 score vectors \(\mathbf{s}_j\) in a particular observation
 \(\mathcal{O}_i\): \(\mathbf{q}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
 of the detector for a particular observation is measured by
 the entropy \(H(\mathbf{q}_i) = - \sum_j q_{ij} \cdot \log q_{ij}\).
 In the introduction I used a very reduced version to describe
 maximum and low uncertainty. A more complete explanation:
 If \(\mathbf{q}_i\), which I called averaged class probabilities,
 resembles a uniform distribution the entropy will be high. A uniform
 distribution means that no class is more likely than another, which
 is a perfect example of maximum uncertainty. Conversely, if
 one class has a very high probability the entropy will be low.
 In open-set conditions it can be expected that falsely generated
 detections for unknown object classes have a higher label
 uncertainty. A treshold on the entropy \(H(\mathbf{q}_i)\) can then
 be used to identify and reject these false positive cases.
 \section{GPND}
 \section{Contribution}