diff --git a/body_expose.tex b/body_expose.tex index 0c992a1..c3e8270 100644 --- a/body_expose.tex +++ b/body_expose.tex @@ -144,8 +144,98 @@ with MS COCO classes. \chapter{Background and Research Plan} +This chapter will provide a more in-depth look at the two works +this thesis is based upon. First, the dropout sampling introduced +by Miller et al\cite{Miller2018} will be showcased. Afterwards +the Generative Probabilistic Novelty Detection with Adversarial +Autoencoders\cite{Pidhorskyi2018} will be presented. The chapter +will conclude with a more detailed explanation of the intended +contribution of this thesis. + +The dropout sampling explanation will follow the paper of Miller et +al\cite{Miller2018} rather closely including the formulae used +in their paper. + \section{Dropout Sampling} +To understand dropout sampling, it is necessary to explain the +idea of Bayesian neural networks. They place a prior distribution +over the network weights, for example a Gaussian prior distribution: +\(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example +\(\mathbf{W}\) are the weights and \(I\) symbolises that every +weight is drawn from an independent and identical distribution. The +training of the network determines a plausible set of weights by +evaluating the posterior (probability output) over the weights given +the training data: \(p(\mathbf{W}|\mathbf{T})\). However, this +evaluation cannot be performed in any reasonable +time. Therefore approximation techniques are +required. In those techniques the posterior is fitted with a +simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original +and intractable problem of averaging over all weights in the network +is replaced with an optimisation task, where the parameters of the +simple distribution are optimised over\cite{Kendall2017}. + +\subsubsection*{Dropout variational inference} + +Kendall and Gal\cite{Kendall2017} showed an approximation for +classfication and recognition tasks. Dropout variational inference +is a practical approximation technique by adding dropout layers +in front of every weight layer and using them also during test +time to sample from the approximate posterior. Effectively, this +results in the approximation of the class probability +\(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward +passes through the network and averaging over the obtained Softmax +scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the +training data \(\mathbf{T}\): + +\begin{equation} \label{eq:drop-sampling} +p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i +\end{equation} + +With this dropout sampling technique \(n\) model weights +\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior +\(p(\mathbf{W}|\mathbf{T})\). The class probability +\(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector +\(\mathbf{q}\) over all class labels. Finally, the uncertainty +of the network with respect to the classification is given by +the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\). + +\subsubsection*{Dropout sampling for object detection} + +Miller et al\cite{Miller2018} apply the dropout sampling to +object detection. In that case \(\mathbf{W}\) represents the +learned weights of a detection network like SSD\cite{Liu2016}. +Every forward pass uses a different network +\(\widetilde{\mathbf{W}}\) which is approximately sampled from +\(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object +detection results in a set of detections, each consisting of bounding +box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\). +The detections are denoted by Miller et al as \(D_i = +\{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put +into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\). + +All detections with mutual intersection-over-union scores (IoU) +of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\). +Subsequently, the corresponding vector of class probabilities +\(\mathbf{q}_i\) for the observation is calculated by averaging all +score vectors \(\mathbf{s}_j\) in a particular observation +\(\mathcal{O}_i\): \(\mathbf{q}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty +of the detector for a particular observation is measured by +the entropy \(H(\mathbf{q}_i) = - \sum_j q_{ij} \cdot \log q_{ij}\). + +In the introduction I used a very reduced version to describe +maximum and low uncertainty. A more complete explanation: +If \(\mathbf{q}_i\), which I called averaged class probabilities, +resembles a uniform distribution the entropy will be high. A uniform +distribution means that no class is more likely than another, which +is a perfect example of maximum uncertainty. Conversely, if +one class has a very high probability the entropy will be low. + +In open-set conditions it can be expected that falsely generated +detections for unknown object classes have a higher label +uncertainty. A treshold on the entropy \(H(\mathbf{q}_i)\) can then +be used to identify and reject these false positive cases. + \section{GPND} \section{Contribution}