Added background sections

Signed-off-by: Jim Martens <github@2martens.de>
2019-08-05 15:51:55 +02:00
parent e2194dd8f1
commit d0793ba395
1 changed files with 369 additions and 0 deletions
--- a/body.tex
+++ b/body.tex
@ -136,6 +136,9 @@ of the work of Miller et al.~\cite{Miller2018} and auto-encoders will
 be explained. The chapter concludes with more details about the
 research question and the intended contribution of this thesis.
 For both background sections the notation defined in table
 \ref{tab:notation} will be used.
 \section{Related Works}
 Novelty detection for object detection is intricately linked with
@ -201,6 +204,372 @@ moments in neural networks which eliminates gradient variance, and
 they introduce a hierarchical prior for parameters and an
 Empirical Bayes procedure to select prior variances.
 \section{Background for Bayesian SSD}
 \begin{table}
    \centering
    \caption{Notation for background sections}
    \label{tab:notation}
    \begin{tabular}{l|l}
        symbol & meaning \\
        \hline
        \(\mathbf{W}\) & weights \\
        \(\mathbf{T}\) & training data \\
        \(\mathcal{N}(0, I)\) & Gaussian distribution \\
        \(I\) & independent and identical distribution \\
        \(p(\mathbf{W}|\mathbf{T})\) & probability of weights given
            training data \\
        \(\mathcal{I}\) & an image \\
        \(\mathbf{q} = p(y|\mathcal{I}, \mathbf{T})\) & probability
            of all classes given image and training data \\
        \(H(\mathbf{q})\) & entropy over probability vector \\
        \(\widetilde{\mathbf{W}}\) & weights sampled from
            \(p(\mathbf{W}|\mathbf{T})\) \\
        \(\mathbf{b}\) & bounding box coordinates \\
        \(\mathbf{s}\) & softmax scores \\
        \(\overline{\mathbf{s}}\) & averaged softmax scores \\
        \(D\) & detections of one forward pass \\
        \(\mathfrak{D}\) & set of all detections over multiple
            forward passes \\
        \(\mathcal{O}\) & observation \\
        \(\overline{\mathbf{q}}\) & probability vector for
            observation \\
        \(\overline{\mathbf{z}}, \mathbf{z}\) & latent space representation \\
        \(d_T, d_z\) & discriminators \\
        \(e, g\) & encoding and decoding/generating function \\
        \(J_g\) & Jacobi matrix for generating function \\
        \(\mathcal{T}\) & tangent space \\
        \(\mathbf{R}\) & training/test data changed to be on tangent space
    \end{tabular}
 \end{table}
 To understand dropout sampling, it is necessary to explain the
 idea of Bayesian neural networks. They place a prior distribution
 over the network weights, for example a Gaussian prior distribution:
 \(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example
 \(\mathbf{W}\) are the weights and \(I\) symbolises that every
 weight is drawn from an independent and identical distribution. The
 training of the network determines a plausible set of weights by
 evaluating the posterior (probability output) over the weights given
 the training data: \(p(\mathbf{W}|\mathbf{T})\). However, this
 evaluation cannot be performed in any reasonable
 time. Therefore approximation techniques are
 required. In those techniques the posterior is fitted with a
 simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original
 and intractable problem of averaging over all weights in the network
 is replaced with an optimisation task, where the parameters of the
 simple distribution are optimised over~\cite{Kendall2017}.
 \subsubsection*{Dropout Variational Inference}
 Kendall and Gal~\cite{Kendall2017} showed an approximation for
 classfication and recognition tasks. Dropout variational inference
 is a practical approximation technique by adding dropout layers
 in front of every weight layer and using them also during test
 time to sample from the approximate posterior. Effectively, this
 results in the approximation of the class probability
 \(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward
 passes through the network and averaging over the obtained Softmax
 scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the
 training data \(\mathbf{T}\):
 \begin{equation} \label{eq:drop-sampling}
 p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
 \end{equation}
 With this dropout sampling technique \(n\) model weights
 \(\widetilde{\mathbf{W}}_i\) are sampled from the posterior
 \(p(\mathbf{W}|\mathbf{T})\). The class probability
 \(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
 \(\mathbf{q}\) over all class labels. Finally, the uncertainty
 of the network with respect to the classification is given by
 the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\).
 \subsubsection*{Dropout Sampling for Object Detection}
 Miller et al.~\cite{Miller2018} apply the dropout sampling to
 object detection. In that case \(\mathbf{W}\) represents the
 learned weights of a detection network like SSD~\cite{Liu2016}.
 Every forward pass uses a different network
 \(\widetilde{\mathbf{W}}\) which is approximately sampled from
 \(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object
 detection results in a set of detections, each consisting of bounding
 box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\).
 The detections are denoted by Miller et al. as \(D_i =
 \{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put
 into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\).
 All detections with mutual intersection-over-union scores (IoU)
 of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\).
 Subsequently, the corresponding vector of class probabilities
 \(\overline{\mathbf{q}}_i\) for the observation is calculated by averaging all
 score vectors \(\mathbf{s}_j\) in a particular observation
 \(\mathcal{O}_i\): \(\overline{\mathbf{q}}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
 of the detector for a particular observation is measured by
 the entropy \(H(\overline{\mathbf{q}}_i) = - \sum_j \overline{q}_{ij} \cdot \log \overline{q}_{ij}\).
 In the introduction I used a very reduced version to describe
 maximum and low uncertainty. A more complete explanation:
 If \(\overline{\mathbf{q}}_i\), which I called averaged class probabilities,
 resembles a uniform distribution the entropy will be high. A uniform
 distribution means that no class is more likely than another, which
 is a perfect example of maximum uncertainty. Conversely, if
 one class has a very high probability the entropy will be low.
 In open set conditions it can be expected that falsely generated
 detections for unknown object classes have a higher label
 uncertainty. A treshold on the entropy \(H(\overline{\mathbf{q}}_i)\) can then
 be used to identify and reject these false positive cases.
 \section{Adversarial Auto-encoder}
 This section will explain the adversarial auto-encoder used by
 Pidhorskyi et al.~\cite{Pidhorskyi2018} but in a slightly modified
 form to make it more understandable.
 The training data \(\mathbf{T} \in \mathbb{R}^m \) is the input
 of the auto-encoder. An encoding function \(e: \mathbb{R}^m \rightarrow \mathbb{R}^n\) takes the data
 and produces a representation \(\overline{\mathbf{z}} \in \mathbb{R}^n\)
 in a latent space. This latent space is smaller (\(n < m\)) than the
 input which necessitates some form of compression.
 A second function \(g: \Omega \rightarrow \mathbb{R}^m\) is the
 generator function that takes the latent representation
 \(\mathbf{z} \in \Omega \subset \mathbb{R}^n\) and generates an output
 \(\overline{\mathbf{T}}\) as close as possible to the input data
 distribution.
 What then is the difference between \(\overline{\mathbf{z}}\) and \(\mathbf{z}\)?
 With a simple auto-encoder both would be identical. In this case
 of an adversarial auto-encoder it is slightly more complicated.
 There is a discriminator \(d_z\) that tries to distinguish between
 an encoded data point \(\overline{\mathbf{z}}\) and a \(\mathbf{z} \sim \mathcal{N}(0,1)\) drawn from a normal distribution with \(0\) mean
 and a standard deviation of \(1\). During training, the encoding
 function \(e\) attempts to minimize any perceivable difference
 between \(\mathbf{z}\) and \(\overline{\mathbf{z}}\) while \(d_z\) has the
 aforementioned adversarial task to differentiate between them.
 Furthermore, there is a discriminator \(d_T\) that has the task
 to differentiate the generated output \(\overline{\mathbf{T}}\) from the
 actual input \(\mathbf{T}\). During training, the generator function \(g\)
 tries to minimize the perceivable difference between \(\overline{\mathbf{T}}\) and \(\mathbf{T}\) while \(d_T\) has the mentioned
 adversarial task to distinguish between them.
 With this all components of the adversarial auto-encoder employed
 by Pidhorskyi et al are introduced. Finally, the losses are
 presented. The two adversarial objectives have been mentioned
 already. Specifically, there is the adversarial loss for the
 discriminator \(d_z\):
 \begin{equation} \label{eq:adv-loss-z}
    \mathcal{L}_{adv-d_z}(\mathbf{T},e,d_z) = E[\log (d_z(\mathcal{N}(0,1)))] + E[\log (1 - d_z(e(\mathbf{T})))],
 \end{equation}
 \noindent
 where \(E\) stands for an expected
 value\footnote{a term used in probability theory},
 \(\mathbf{T}\) stands for the input, and
 \(\mathcal{N}(0,1)\) represents an element drawn from the specified
 distribution. The encoder \(e\) attempts to minimize this loss while
 the discriminator \(d_z\) intends to maximize it.
 In the same way the adversarial loss for the discriminator \(d_T\)
 is specified:
 \begin{equation} \label{eq:adv-loss-x}
    \mathcal{L}_{adv-d_T}(\mathbf{T},d_T,g) = E[\log(d_T(\mathbf{T}))] + E[\log(1 - d_T(g(\mathcal{N}(0,1))))],
 \end{equation}
 \noindent
 where \(\mathbf{T}\), \(E\), and \(\mathcal{N}(0,1)\) have the same meaning
 as before. In this case the generator \(g\) tries to minimize the loss
 while the discriminator \(d_T\) attempts to maximize it.
 Every auto-encoder requires a reconstruction error to work. This
 error calculates the difference between the original input and
 the generated or decoded output. In this case, the reconstruction
 loss is defined like this:
 \begin{equation} \label{eq:recon-loss}
    \mathcal{L}_{error}(\mathbf{T}, e, g) = - E[\log(p(g(e(\mathbf{T})) | \mathbf{T}))],
 \end{equation}
 \noindent
 where \(\log(p)\) is the expected log-likelihood and \(\mathbf{T}\),
 \(E\), \(e\), and \(g\) have the same meaning as before.
 All losses combined result in the following formula:
 \begin{equation} \label{eq:full-loss}
    \mathcal{L}(\mathbf{T},e,d_z,d_T,g) = \mathcal{L}_{adv-d_z}(\mathbf{T},e,d_z) + \mathcal{L}_{adv-d_T}(x,d_T,g) + \lambda \mathcal{L}_{error}(\mathbf{T},e,g),
 \end{equation}
 \noindent
 where \(\lambda\) is a parameter used to balance the adversarial
 losses with the reconstruction loss. The model is trained by
 Pidhorskyi et al using the Adam optimizer by doing alternative
 updates of each of the aforementioned components:
 \begin{itemize}
    \item Maximize \(\mathcal{L}_{adv-d_T}\) by updating weights of \(d_T\);
    \item Minimize \(\mathcal{L}_{adv-d_T}\) by updating weights of \(g\);
    \item Maximize \(\mathcal{L}_{adv-d_z}\) by updating weights of \(d_z\);
    \item Minimize \(\mathcal{L}_{error}\) and \(\mathcal{L}_{adv-d_z}\) by updating weights of \(e\) and \(g\).
 \end{itemize}
 Practically, the auto-encoder is trained separately for every
 object class that is considered "known". Pidhorskyi et al trained
 it on the MNIST~\cite{Lecun1998} data set, once for every digit.
 For this thesis it needs to be trained on the SceneNet RGB-D
 data set using MS COCO classes as known classes. As in every
 test epoch all known classes are present, it becomes
 non-trivial which of the trained auto-encoders should be used to
 calculate novelty. To phrase it differently, a true positive
 detection is possible for multiple classes in the same image.
 If, for example, one object is classified correctly by SSD as a chair
 the novelty score should be low. But the auto-encoders of all
 known classes but the "chair" class will give ideally a high novelty
 score. Which of the values should be used? The only sensible solution
 is to only run it through the auto-encoder that was trained for
 the class the SSD model predicted. This provides the following
 scenarios:
 \begin{itemize}
    \item true positive classification: novelty score should be low
    \item false positive classification and correct class is
    among the known classes: novelty score should be high
    \item false positive classification and correct class is unknown:
    novelty score should be high
 \end{itemize}
 \noindent
 Negative classifications are not listed as these are not part
 of the output of the SSD and cannot be given to the auto-encoder
 as input. Furthermore, the 2nd case should not happen because
 the trained SSD knows this other class and is very likely
 to give it a higher probability. Therefore, using only one
 auto-encoder fulfils the task of differentiating between
 known and unknown classes.
 \section{Generative Probabilistic Novelty Detection}
 It is still unclear how the novelty score is calculated.
 This section will clear this up in as understandable as
 possible terms. However, the name "Generative Probabilistic
 Novelty Detection"~\cite{Pidhorskyi2018} already signals that
 probability theory has something to do with it. Furthermore, this
 section will make use of some mathematical terms which cannot
 be explained in great detail here. Moreover, the previous section
 already introduced many required components, which will not be
 explained here again.
 For the purpose of this explanation a trained auto-encoder
 is assumed. In that case the generator function describes
 the model that the auto-encoder is actually using for the
 novelty detection. The task of training is to make sure this
 model comes as close as possible to the real model of the
 training or testing data. The model of the auto-encoder
 is in mathematical terms a parameterized manifold
 \(\mathcal{M} \equiv g(\Omega)\) of dimension \(n\).
 The set of training or testing data can then be described
 in the following way:
 \begin{equation} \label{eq:train-set}
    \mathbf{T} = g(\mathbf{z}) + \xi_i \quad i \in \mathbb{N},
 \end{equation}
 \noindent
 where \(\xi_i\) represents noise. It may be confusing but
 for the purpose of this novelty test the "truth" is what
 the generator function generates from a set of \(\mathbf{z} \in \Omega\),
 not the ground truth from the data set. Furthermore,
 the previously introduced encoder function \(e\) is assumed
 to work as an exact inverse of \(g\) for every \(\mathbf{T} \in \mathcal{M}\).
 For such \(\mathbf{T}\) it follows that \(\mathbf{T} = g(e(\mathbf{T}))\).
 Let \(\overline{\mathbf{T}} \in \mathbb{R}^m\) be the test data. The
 remainder of the section will explain how the novelty
 test is performed for this \(\overline{\mathbf{T}}\). It is important
 to note that this data is not necessarily part of the
 auto-encoder model. Therefore, \(g(e(\overline{\mathbf{T}})) = \mathbf{T}\) cannot
 be assumed. However, it can be observed that \(\overline{\mathbf{T}}\)
 can be non-linearly projected onto
 \(\overline{\mathbf{T}}^{\|} \in \mathcal{M}\)
 by using \(g(\overline{\mathbf{z}})\) with \(\overline{\mathbf{z}} = e(\overline{\mathbf{T}})\).
 It is assumed that \(g\) is smooth enough to perform a linearization
 based on the first-order Taylor expansion:
 \begin{equation} \label{eq:taylor-expanse}
    g(\mathbf{z}) = g(\overline{\mathbf{z}}) + J_g(\overline{\mathbf{z}}) (\mathbf{z} - \overline{\mathbf{z}}) + \mathcal{O}(\| \mathbf{z} - \overline{\mathbf{z}} \|^2),
 \end{equation}
 \noindent
 where \(J_g(\overline{\mathbf{z}})\) is the Jacobi matrix of \(g\) computed
 at \(\overline{\mathbf{z}}\). It is assumed that the Jacobi matrix of \(g\)
 has the full rank at every point of the manifold. A Jacobi matrix
 contains all first-order partial derivatives of a function.
 \(\| \cdot \|\) is the \(\mathbf{L}_2\) norm, which calculates the
 length of a vector by calculating the square root of the sum of
 squares of all dimensions of the vector. Lastly, \(\mathcal{O}\)
 is called Big-O notation and is used for specifying the time
 complexity of an algorithm. In this case it contains a linear
 value, which means that this part of the term can be ignored for
 \(\mathbf{z}\) growing to infinity.
 Next the tangent space of \(g\) at \(\overline{\mathbf{T}}^{\|}\), which
 is spanned by the \(n\) independent column vectors of the Jacobi
 matrix \(J_g(\overline{\mathbf{z}})\), is defined as
 \(\mathcal{T} = \text{span}(J_g(\overline{\mathbf{z}}))\). The tangent space
 of a point of a function describes all the vectors that could go
 through this point. The Jacobi matrix can be decomposed into three
 matrices using singular value decomposition: \(J_g(\overline{\mathbf{z}}) = U^{\|}SV^{*}\). \(\mathcal{T}\) is defined to also be spanned
 by the column vectors of \(U^{\|}\): \(\mathcal{T} = \text{span}(U^{\|})\). \(U^{\|}\) contains the left-singular values
 and \(V^{*}\) is the conjugate transposed version of the matrix
 \(V\), which contains the right-singular values. \(U^{\bot}\) is
 defined in such a way that \(U = [U^{\|}U^{\bot}]\) is a unitary
 matrix. \(\mathcal{T^{\bot}}\) is the orthogonal complement of
 \(\mathcal{T}\). With this preparation \(\overline{\mathbf{T}}\) can be
 represented with respect to the local coordinates that define
 \(\mathcal{T}\) and \(\mathcal{T}^{\bot}\). This representation
 can be achieved by computing
 \begin{equation} \label{eq:w-definition}
    \overline{\mathbf{R}} = U^{\top} \overline{\mathbf{T}} = \left[\begin{matrix}
        U^{\|^{\top}} \overline{\mathbf{T}} \\
        U^{\bot^{\top}} \overline{\mathbf{T}}
    \end{matrix}\right] = \left[\begin{matrix}
        \overline{\mathbf{R}}^{\|} \\
        \overline{\mathbf{R}}^{\bot}
    \end{matrix}\right],
 \end{equation}
 \noindent
 where the rotated coordinates (training/testing data points
 changed to be on the tangent space)
 \(\overline{\mathbf{R}}\) are decomposed into \(\overline{\mathbf{R}}^{\|}\), which
 are parallel to \(\mathcal{T}\), and \(\overline{\mathbf{R}}^{\bot}\), which
 are orthogonal to \(\mathcal{T}\).
 The last step to define the novelty test involves probability
 density functions (PDFs), which are now introduced. The PDF \(p_T(\mathbf{T})\)
 describes the random variable \(T\), from which the training and
 testing data points are drawn. In addition, \(p_R(\mathbf{R})\) is the
 probability density function of the random variable \(W\),
 which represents \(T\) after changing the coordinates. Both
 distributions are identical. But it is assumed that the coordinates
 \(R^{\|}\), which are parallel to \(\mathcal{T}\), and the coordinates
 \(R^{\bot}\), which are orthogonal to \(\mathcal{T}\), are
 statistically independent. With this assumption the following holds:
 \begin{equation} \label{eq:pdf-x}
    p_T(\mathbf{T}) = p_R(\mathbf{R}) = p_R(\mathbf{R}^{\|}, \mathbf{R}^{\bot}) = p_{R^{\|}}(\mathbf{R}^{\|}) p_{R^{\bot}}(\mathbf{R}^{\bot})
 \end{equation}
 The previously introduced noise comes into play again. In formula
 (\ref{eq:train-set}) it is assumed that the noise \(\xi\)
 predominantly deviates the data points \(\mathbf{T}\) away from the manifold
 \(\mathcal{M}\) in a direction orthogonal to \(\mathcal{T}\).
 As a consequence \(R^{\bot}\) is mainly responsible for the noise
 effects. Since noise and drawing from the manifold are statistically
 independent, \(R^{\|}\) and \(R^{\bot}\) are also independent.
 Finally, referring back to the data point \(\overline{\mathbf{T}}\), the
 novelty test is defined like this:
 \begin{equation} \label{eq:novelty-test}
    p_T(\overline{\mathbf{T}}) = p_{R^{\|}}(\overline{\mathbf{R}}^{\|})p_{R^{\bot}}(\overline{\mathbf{R}}^{\bot}) =
    \begin{cases}
        \geq \gamma & \Longrightarrow \text{Inlier} \\
        < \gamma &  \Longrightarrow \text{Outlier}
    \end{cases}
 \end{equation}
 \noindent
 where \(\gamma\) is a suitable threshold.
 At this point it is very clear that the GPND approach requires
 far more math background than dropout sampling to understand
 the novelty test. Nonetheless it could be the better method.
 % SSD: \cite{Liu2016}
 % ImageNet: \cite{Deng2009}