Added background sections

Signed-off-by: Jim Martens <github@2martens.de>
2019-08-05 15:51:55 +02:00 · 2019-08-05 15:51:55 +02:00 · d0793ba395
parent e2194dd8f1
commit d0793ba395
1 changed files with 369 additions and 0 deletions
--- a/body.tex
+++ b/body.tex
@ -136,6 +136,9 @@ of the work of Miller et al.~\cite{Miller2018} and auto-encoders will
 be explained. The chapter concludes with more details about the
 research question and the intended contribution of this thesis.

+For both background sections the notation defined in table
+\ref{tab:notation} will be used.
+
 \section{Related Works}

 Novelty detection for object detection is intricately linked with
@ -201,6 +204,372 @@ moments in neural networks which eliminates gradient variance, and
 they introduce a hierarchical prior for parameters and an
 Empirical Bayes procedure to select prior variances.

+\section{Background for Bayesian SSD}
+
+\begin{table}
+    \centering
+    \caption{Notation for background sections}
+    \label{tab:notation}
+    \begin{tabular}{l|l}
+        symbol & meaning \\
+        \hline
+        \(\mathbf{W}\) & weights \\
+        \(\mathbf{T}\) & training data \\
+        \(\mathcal{N}(0, I)\) & Gaussian distribution \\
+        \(I\) & independent and identical distribution \\
+        \(p(\mathbf{W}|\mathbf{T})\) & probability of weights given
+            training data \\
+        \(\mathcal{I}\) & an image \\
+        \(\mathbf{q} = p(y|\mathcal{I}, \mathbf{T})\) & probability
+            of all classes given image and training data \\
+        \(H(\mathbf{q})\) & entropy over probability vector \\
+        \(\widetilde{\mathbf{W}}\) & weights sampled from
+            \(p(\mathbf{W}|\mathbf{T})\) \\
+        \(\mathbf{b}\) & bounding box coordinates \\
+        \(\mathbf{s}\) & softmax scores \\
+        \(\overline{\mathbf{s}}\) & averaged softmax scores \\
+        \(D\) & detections of one forward pass \\
+        \(\mathfrak{D}\) & set of all detections over multiple
+            forward passes \\
+        \(\mathcal{O}\) & observation \\
+        \(\overline{\mathbf{q}}\) & probability vector for
+            observation \\
+        \(\overline{\mathbf{z}}, \mathbf{z}\) & latent space representation \\
+        \(d_T, d_z\) & discriminators \\
+        \(e, g\) & encoding and decoding/generating function \\
+        \(J_g\) & Jacobi matrix for generating function \\
+        \(\mathcal{T}\) & tangent space \\
+        \(\mathbf{R}\) & training/test data changed to be on tangent space
+    \end{tabular}
+\end{table}
+
+To understand dropout sampling, it is necessary to explain the
+idea of Bayesian neural networks. They place a prior distribution
+over the network weights, for example a Gaussian prior distribution:
+\(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example
+\(\mathbf{W}\) are the weights and \(I\) symbolises that every
+weight is drawn from an independent and identical distribution. The
+training of the network determines a plausible set of weights by
+evaluating the posterior (probability output) over the weights given
+the training data: \(p(\mathbf{W}|\mathbf{T})\). However, this
+evaluation cannot be performed in any reasonable
+time. Therefore approximation techniques are
+required. In those techniques the posterior is fitted with a
+simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original
+and intractable problem of averaging over all weights in the network
+is replaced with an optimisation task, where the parameters of the
+simple distribution are optimised over~\cite{Kendall2017}.
+
+\subsubsection*{Dropout Variational Inference}
+
+Kendall and Gal~\cite{Kendall2017} showed an approximation for
+classfication and recognition tasks. Dropout variational inference
+is a practical approximation technique by adding dropout layers
+in front of every weight layer and using them also during test
+time to sample from the approximate posterior. Effectively, this
+results in the approximation of the class probability
+\(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward
+passes through the network and averaging over the obtained Softmax
+scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the
+training data \(\mathbf{T}\):
+\begin{equation} \label{eq:drop-sampling}
+p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
+\end{equation}
+
+With this dropout sampling technique \(n\) model weights
+\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior
+\(p(\mathbf{W}|\mathbf{T})\). The class probability
+\(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
+\(\mathbf{q}\) over all class labels. Finally, the uncertainty
+of the network with respect to the classification is given by
+the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\).
+
+\subsubsection*{Dropout Sampling for Object Detection}
+
+Miller et al.~\cite{Miller2018} apply the dropout sampling to
+object detection. In that case \(\mathbf{W}\) represents the
+learned weights of a detection network like SSD~\cite{Liu2016}.
+Every forward pass uses a different network
+\(\widetilde{\mathbf{W}}\) which is approximately sampled from
+\(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object
+detection results in a set of detections, each consisting of bounding
+box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\).
+The detections are denoted by Miller et al. as \(D_i =
+\{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put
+into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\).
+
+All detections with mutual intersection-over-union scores (IoU)
+of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\).
+Subsequently, the corresponding vector of class probabilities
+\(\overline{\mathbf{q}}_i\) for the observation is calculated by averaging all
+score vectors \(\mathbf{s}_j\) in a particular observation
+\(\mathcal{O}_i\): \(\overline{\mathbf{q}}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
+of the detector for a particular observation is measured by
+the entropy \(H(\overline{\mathbf{q}}_i) = - \sum_j \overline{q}_{ij} \cdot \log \overline{q}_{ij}\).
+
+In the introduction I used a very reduced version to describe
+maximum and low uncertainty. A more complete explanation:
+If \(\overline{\mathbf{q}}_i\), which I called averaged class probabilities,
+resembles a uniform distribution the entropy will be high. A uniform
+distribution means that no class is more likely than another, which
+is a perfect example of maximum uncertainty. Conversely, if
+one class has a very high probability the entropy will be low.
+
+In open set conditions it can be expected that falsely generated
+detections for unknown object classes have a higher label
+uncertainty. A treshold on the entropy \(H(\overline{\mathbf{q}}_i)\) can then
+be used to identify and reject these false positive cases.
+
+\section{Adversarial Auto-encoder}
+
+This section will explain the adversarial auto-encoder used by
+Pidhorskyi et al.~\cite{Pidhorskyi2018} but in a slightly modified
+form to make it more understandable.
+
+The training data \(\mathbf{T} \in \mathbb{R}^m \) is the input
+of the auto-encoder. An encoding function \(e: \mathbb{R}^m \rightarrow \mathbb{R}^n\) takes the data
+and produces a representation \(\overline{\mathbf{z}} \in \mathbb{R}^n\)
+in a latent space. This latent space is smaller (\(n < m\)) than the
+input which necessitates some form of compression.
+
+A second function \(g: \Omega \rightarrow \mathbb{R}^m\) is the
+generator function that takes the latent representation
+\(\mathbf{z} \in \Omega \subset \mathbb{R}^n\) and generates an output
+\(\overline{\mathbf{T}}\) as close as possible to the input data
+distribution.
+
+What then is the difference between \(\overline{\mathbf{z}}\) and \(\mathbf{z}\)?
+With a simple auto-encoder both would be identical. In this case
+of an adversarial auto-encoder it is slightly more complicated.
+There is a discriminator \(d_z\) that tries to distinguish between
+an encoded data point \(\overline{\mathbf{z}}\) and a \(\mathbf{z} \sim \mathcal{N}(0,1)\) drawn from a normal distribution with \(0\) mean
+and a standard deviation of \(1\). During training, the encoding
+function \(e\) attempts to minimize any perceivable difference
+between \(\mathbf{z}\) and \(\overline{\mathbf{z}}\) while \(d_z\) has the
+aforementioned adversarial task to differentiate between them.
+
+Furthermore, there is a discriminator \(d_T\) that has the task
+to differentiate the generated output \(\overline{\mathbf{T}}\) from the
+actual input \(\mathbf{T}\). During training, the generator function \(g\)
+tries to minimize the perceivable difference between \(\overline{\mathbf{T}}\) and \(\mathbf{T}\) while \(d_T\) has the mentioned
+adversarial task to distinguish between them.
+
+With this all components of the adversarial auto-encoder employed
+by Pidhorskyi et al are introduced. Finally, the losses are
+presented. The two adversarial objectives have been mentioned
+already. Specifically, there is the adversarial loss for the
+discriminator \(d_z\):
+\begin{equation} \label{eq:adv-loss-z}
+    \mathcal{L}_{adv-d_z}(\mathbf{T},e,d_z) = E[\log (d_z(\mathcal{N}(0,1)))] + E[\log (1 - d_z(e(\mathbf{T})))],
+\end{equation}
+\noindent
+where \(E\) stands for an expected
+value\footnote{a term used in probability theory},
+\(\mathbf{T}\) stands for the input, and
+\(\mathcal{N}(0,1)\) represents an element drawn from the specified
+distribution. The encoder \(e\) attempts to minimize this loss while
+the discriminator \(d_z\) intends to maximize it.
+
+In the same way the adversarial loss for the discriminator \(d_T\)
+is specified:
+\begin{equation} \label{eq:adv-loss-x}
+    \mathcal{L}_{adv-d_T}(\mathbf{T},d_T,g) = E[\log(d_T(\mathbf{T}))] + E[\log(1 - d_T(g(\mathcal{N}(0,1))))],
+\end{equation}
+\noindent
+where \(\mathbf{T}\), \(E\), and \(\mathcal{N}(0,1)\) have the same meaning
+as before. In this case the generator \(g\) tries to minimize the loss
+while the discriminator \(d_T\) attempts to maximize it.
+
+Every auto-encoder requires a reconstruction error to work. This
+error calculates the difference between the original input and
+the generated or decoded output. In this case, the reconstruction
+loss is defined like this:
+\begin{equation} \label{eq:recon-loss}
+    \mathcal{L}_{error}(\mathbf{T}, e, g) = - E[\log(p(g(e(\mathbf{T})) | \mathbf{T}))],
+\end{equation}
+\noindent
+where \(\log(p)\) is the expected log-likelihood and \(\mathbf{T}\),
+\(E\), \(e\), and \(g\) have the same meaning as before.
+
+All losses combined result in the following formula:
+\begin{equation} \label{eq:full-loss}
+    \mathcal{L}(\mathbf{T},e,d_z,d_T,g) = \mathcal{L}_{adv-d_z}(\mathbf{T},e,d_z) + \mathcal{L}_{adv-d_T}(x,d_T,g) + \lambda \mathcal{L}_{error}(\mathbf{T},e,g),
+\end{equation}
+\noindent
+where \(\lambda\) is a parameter used to balance the adversarial
+losses with the reconstruction loss. The model is trained by
+Pidhorskyi et al using the Adam optimizer by doing alternative
+updates of each of the aforementioned components:
+
+\begin{itemize}
+    \item Maximize \(\mathcal{L}_{adv-d_T}\) by updating weights of \(d_T\);
+    \item Minimize \(\mathcal{L}_{adv-d_T}\) by updating weights of \(g\);
+    \item Maximize \(\mathcal{L}_{adv-d_z}\) by updating weights of \(d_z\);
+    \item Minimize \(\mathcal{L}_{error}\) and \(\mathcal{L}_{adv-d_z}\) by updating weights of \(e\) and \(g\).
+\end{itemize}
+
+Practically, the auto-encoder is trained separately for every
+object class that is considered "known". Pidhorskyi et al trained
+it on the MNIST~\cite{Lecun1998} data set, once for every digit.
+
+For this thesis it needs to be trained on the SceneNet RGB-D
+data set using MS COCO classes as known classes. As in every
+test epoch all known classes are present, it becomes
+non-trivial which of the trained auto-encoders should be used to
+calculate novelty. To phrase it differently, a true positive
+detection is possible for multiple classes in the same image.
+If, for example, one object is classified correctly by SSD as a chair
+the novelty score should be low. But the auto-encoders of all
+known classes but the "chair" class will give ideally a high novelty
+score. Which of the values should be used? The only sensible solution
+is to only run it through the auto-encoder that was trained for
+the class the SSD model predicted. This provides the following
+scenarios:
+\begin{itemize}
+    \item true positive classification: novelty score should be low
+    \item false positive classification and correct class is
+    among the known classes: novelty score should be high
+    \item false positive classification and correct class is unknown:
+    novelty score should be high
+\end{itemize}
+\noindent
+Negative classifications are not listed as these are not part
+of the output of the SSD and cannot be given to the auto-encoder
+as input. Furthermore, the 2nd case should not happen because
+the trained SSD knows this other class and is very likely
+to give it a higher probability. Therefore, using only one
+auto-encoder fulfils the task of differentiating between
+known and unknown classes.
+
+\section{Generative Probabilistic Novelty Detection}
+
+It is still unclear how the novelty score is calculated.
+This section will clear this up in as understandable as
+possible terms. However, the name "Generative Probabilistic
+Novelty Detection"~\cite{Pidhorskyi2018} already signals that
+probability theory has something to do with it. Furthermore, this
+section will make use of some mathematical terms which cannot
+be explained in great detail here. Moreover, the previous section
+already introduced many required components, which will not be
+explained here again.
+
+For the purpose of this explanation a trained auto-encoder
+is assumed. In that case the generator function describes
+the model that the auto-encoder is actually using for the
+novelty detection. The task of training is to make sure this
+model comes as close as possible to the real model of the
+training or testing data. The model of the auto-encoder
+is in mathematical terms a parameterized manifold
+\(\mathcal{M} \equiv g(\Omega)\) of dimension \(n\).
+The set of training or testing data can then be described
+in the following way:
+\begin{equation} \label{eq:train-set}
+    \mathbf{T} = g(\mathbf{z}) + \xi_i \quad i \in \mathbb{N},
+\end{equation}
+\noindent
+where \(\xi_i\) represents noise. It may be confusing but
+for the purpose of this novelty test the "truth" is what
+the generator function generates from a set of \(\mathbf{z} \in \Omega\),
+not the ground truth from the data set. Furthermore,
+the previously introduced encoder function \(e\) is assumed
+to work as an exact inverse of \(g\) for every \(\mathbf{T} \in \mathcal{M}\).
+For such \(\mathbf{T}\) it follows that \(\mathbf{T} = g(e(\mathbf{T}))\).
+
+Let \(\overline{\mathbf{T}} \in \mathbb{R}^m\) be the test data. The
+remainder of the section will explain how the novelty
+test is performed for this \(\overline{\mathbf{T}}\). It is important
+to note that this data is not necessarily part of the
+auto-encoder model. Therefore, \(g(e(\overline{\mathbf{T}})) = \mathbf{T}\) cannot
+be assumed. However, it can be observed that \(\overline{\mathbf{T}}\)
+can be non-linearly projected onto
+\(\overline{\mathbf{T}}^{\|} \in \mathcal{M}\)
+by using \(g(\overline{\mathbf{z}})\) with \(\overline{\mathbf{z}} = e(\overline{\mathbf{T}})\).
+It is assumed that \(g\) is smooth enough to perform a linearization
+based on the first-order Taylor expansion:
+\begin{equation} \label{eq:taylor-expanse}
+    g(\mathbf{z}) = g(\overline{\mathbf{z}}) + J_g(\overline{\mathbf{z}}) (\mathbf{z} - \overline{\mathbf{z}}) + \mathcal{O}(\| \mathbf{z} - \overline{\mathbf{z}} \|^2),
+\end{equation}
+\noindent
+where \(J_g(\overline{\mathbf{z}})\) is the Jacobi matrix of \(g\) computed
+at \(\overline{\mathbf{z}}\). It is assumed that the Jacobi matrix of \(g\)
+has the full rank at every point of the manifold. A Jacobi matrix
+contains all first-order partial derivatives of a function.
+\(\| \cdot \|\) is the \(\mathbf{L}_2\) norm, which calculates the
+length of a vector by calculating the square root of the sum of
+squares of all dimensions of the vector. Lastly, \(\mathcal{O}\)
+is called Big-O notation and is used for specifying the time
+complexity of an algorithm. In this case it contains a linear
+value, which means that this part of the term can be ignored for
+\(\mathbf{z}\) growing to infinity.
+
+Next the tangent space of \(g\) at \(\overline{\mathbf{T}}^{\|}\), which
+is spanned by the \(n\) independent column vectors of the Jacobi
+matrix \(J_g(\overline{\mathbf{z}})\), is defined as
+\(\mathcal{T} = \text{span}(J_g(\overline{\mathbf{z}}))\). The tangent space
+of a point of a function describes all the vectors that could go
+through this point. The Jacobi matrix can be decomposed into three
+matrices using singular value decomposition: \(J_g(\overline{\mathbf{z}}) = U^{\|}SV^{*}\). \(\mathcal{T}\) is defined to also be spanned
+by the column vectors of \(U^{\|}\): \(\mathcal{T} = \text{span}(U^{\|})\). \(U^{\|}\) contains the left-singular values
+and \(V^{*}\) is the conjugate transposed version of the matrix
+\(V\), which contains the right-singular values. \(U^{\bot}\) is
+defined in such a way that \(U = [U^{\|}U^{\bot}]\) is a unitary
+matrix. \(\mathcal{T^{\bot}}\) is the orthogonal complement of
+\(\mathcal{T}\). With this preparation \(\overline{\mathbf{T}}\) can be
+represented with respect to the local coordinates that define
+\(\mathcal{T}\) and \(\mathcal{T}^{\bot}\). This representation
+can be achieved by computing
+\begin{equation} \label{eq:w-definition}
+    \overline{\mathbf{R}} = U^{\top} \overline{\mathbf{T}} = \left[\begin{matrix}
+        U^{\|^{\top}} \overline{\mathbf{T}} \\
+        U^{\bot^{\top}} \overline{\mathbf{T}}
+    \end{matrix}\right] = \left[\begin{matrix}
+        \overline{\mathbf{R}}^{\|} \\
+        \overline{\mathbf{R}}^{\bot}
+    \end{matrix}\right],
+\end{equation}
+\noindent
+where the rotated coordinates (training/testing data points
+changed to be on the tangent space)
+\(\overline{\mathbf{R}}\) are decomposed into \(\overline{\mathbf{R}}^{\|}\), which
+are parallel to \(\mathcal{T}\), and \(\overline{\mathbf{R}}^{\bot}\), which
+are orthogonal to \(\mathcal{T}\).
+
+The last step to define the novelty test involves probability
+density functions (PDFs), which are now introduced. The PDF \(p_T(\mathbf{T})\)
+describes the random variable \(T\), from which the training and
+testing data points are drawn. In addition, \(p_R(\mathbf{R})\) is the
+probability density function of the random variable \(W\),
+which represents \(T\) after changing the coordinates. Both
+distributions are identical. But it is assumed that the coordinates
+\(R^{\|}\), which are parallel to \(\mathcal{T}\), and the coordinates
+\(R^{\bot}\), which are orthogonal to \(\mathcal{T}\), are
+statistically independent. With this assumption the following holds:
+\begin{equation} \label{eq:pdf-x}
+    p_T(\mathbf{T}) = p_R(\mathbf{R}) = p_R(\mathbf{R}^{\|}, \mathbf{R}^{\bot}) = p_{R^{\|}}(\mathbf{R}^{\|}) p_{R^{\bot}}(\mathbf{R}^{\bot})
+\end{equation}
+The previously introduced noise comes into play again. In formula
+(\ref{eq:train-set}) it is assumed that the noise \(\xi\)
+predominantly deviates the data points \(\mathbf{T}\) away from the manifold
+\(\mathcal{M}\) in a direction orthogonal to \(\mathcal{T}\).
+As a consequence \(R^{\bot}\) is mainly responsible for the noise
+effects. Since noise and drawing from the manifold are statistically
+independent, \(R^{\|}\) and \(R^{\bot}\) are also independent.
+
+Finally, referring back to the data point \(\overline{\mathbf{T}}\), the
+novelty test is defined like this:
+\begin{equation} \label{eq:novelty-test}
+    p_T(\overline{\mathbf{T}}) = p_{R^{\|}}(\overline{\mathbf{R}}^{\|})p_{R^{\bot}}(\overline{\mathbf{R}}^{\bot}) =
+    \begin{cases}
+        \geq \gamma & \Longrightarrow \text{Inlier} \\
+        < \gamma &  \Longrightarrow \text{Outlier}
+    \end{cases}
+\end{equation}
+\noindent
+where \(\gamma\) is a suitable threshold.
+
+At this point it is very clear that the GPND approach requires
+far more math background than dropout sampling to understand
+the novelty test. Nonetheless it could be the better method.

 % SSD: \cite{Liu2016}
 % ImageNet: \cite{Deng2009}