Written background for dropout sampling

Signed-off-by: Jim Martens <github@2martens.de>
2019-03-06 16:19:19 +01:00 · 2019-03-06 16:19:19 +01:00 · 1e66a6f874
parent 7d59b862c1
commit 1e66a6f874
1 changed files with 90 additions and 0 deletions
--- a/body_expose.tex
+++ b/body_expose.tex
@ -144,8 +144,98 @@ with MS COCO classes.

 \chapter{Background and Research Plan}

+This chapter will provide a more in-depth look at the two works
+this thesis is based upon. First, the dropout sampling introduced
+by Miller et al\cite{Miller2018} will be showcased. Afterwards
+the Generative Probabilistic Novelty Detection with Adversarial
+Autoencoders\cite{Pidhorskyi2018} will be presented. The chapter
+will conclude with a more detailed explanation of the intended
+contribution of this thesis.
+
+The dropout sampling explanation will follow the paper of Miller et
+al\cite{Miller2018} rather closely including the formulae used
+in their paper.
+
 \section{Dropout Sampling}

+To understand dropout sampling, it is necessary to explain the
+idea of Bayesian neural networks. They place a prior distribution
+over the network weights, for example a Gaussian prior distribution:
+\(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example
+\(\mathbf{W}\) are the weights and \(I\) symbolises that every
+weight is drawn from an independent and identical distribution. The
+training of the network determines a plausible set of weights by
+evaluating the posterior (probability output) over the weights given
+the training data: \(p(\mathbf{W}|\mathbf{T})\). However, this
+evaluation cannot be performed in any reasonable
+time. Therefore approximation techniques are
+required. In those techniques the posterior is fitted with a
+simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original
+and intractable problem of averaging over all weights in the network
+is replaced with an optimisation task, where the parameters of the
+simple distribution are optimised over\cite{Kendall2017}.
+
+\subsubsection*{Dropout variational inference}
+
+Kendall and Gal\cite{Kendall2017} showed an approximation for
+classfication and recognition tasks. Dropout variational inference
+is a practical approximation technique by adding dropout layers
+in front of every weight layer and using them also during test
+time to sample from the approximate posterior. Effectively, this
+results in the approximation of the class probability
+\(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward
+passes through the network and averaging over the obtained Softmax
+scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the
+training data \(\mathbf{T}\):
+
+\begin{equation} \label{eq:drop-sampling}
+p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
+\end{equation}
+
+With this dropout sampling technique \(n\) model weights
+\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior
+\(p(\mathbf{W}|\mathbf{T})\). The class probability
+\(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
+\(\mathbf{q}\) over all class labels. Finally, the uncertainty
+of the network with respect to the classification is given by
+the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\).
+
+\subsubsection*{Dropout sampling for object detection}
+
+Miller et al\cite{Miller2018} apply the dropout sampling to
+object detection. In that case \(\mathbf{W}\) represents the
+learned weights of a detection network like SSD\cite{Liu2016}.
+Every forward pass uses a different network
+\(\widetilde{\mathbf{W}}\) which is approximately sampled from
+\(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object
+detection results in a set of detections, each consisting of bounding
+box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\).
+The detections are denoted by Miller et al as \(D_i =
+\{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put
+into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\).
+
+All detections with mutual intersection-over-union scores (IoU)
+of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\).
+Subsequently, the corresponding vector of class probabilities
+\(\mathbf{q}_i\) for the observation is calculated by averaging all
+score vectors \(\mathbf{s}_j\) in a particular observation
+\(\mathcal{O}_i\): \(\mathbf{q}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
+of the detector for a particular observation is measured by
+the entropy \(H(\mathbf{q}_i) = - \sum_j q_{ij} \cdot \log q_{ij}\).
+
+In the introduction I used a very reduced version to describe
+maximum and low uncertainty. A more complete explanation:
+If \(\mathbf{q}_i\), which I called averaged class probabilities,
+resembles a uniform distribution the entropy will be high. A uniform
+distribution means that no class is more likely than another, which
+is a perfect example of maximum uncertainty. Conversely, if
+one class has a very high probability the entropy will be low.
+
+In open-set conditions it can be expected that falsely generated
+detections for unknown object classes have a higher label
+uncertainty. A treshold on the entropy \(H(\mathbf{q}_i)\) can then
+be used to identify and reject these false positive cases.
+
 \section{GPND}

 \section{Contribution}