Written adversarial auto-encoder part

Signed-off-by: Jim Martens <github@2martens.de>
2019-03-07 15:03:26 +01:00 · 2019-03-07 15:03:26 +01:00 · 9a1bd269fd
parent 1e66a6f874
commit 9a1bd269fd
1 changed files with 92 additions and 1 deletions
--- a/body_expose.tex
+++ b/body_expose.tex
@ -187,7 +187,6 @@ results in the approximation of the class probability
 passes through the network and averaging over the obtained Softmax
 scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the
 training data \(\mathbf{T}\):
-
 \begin{equation} \label{eq:drop-sampling}
 p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
 \end{equation}
@ -238,6 +237,98 @@ be used to identify and reject these false positive cases.

 \section{GPND}

+For the theoretical underpinning of the Generative Probabilistic
+Novelty Detection the reader is advised to refer to the paper of
+Pidhorskyi et al\cite{Pidhorskyi2018}. This section will only
+cover the key aspects of an adversarial auto-encoder required
+to understand their method.
+
+\subsection{Adversarial Auto-encoder}
+
+The training data points \(x_i \in \mathbb{R}^m \) are the input
+of the auto-encoder. An encoding function \(e: \mathbb{R}^m \rightarrow \mathbb{R}^n\) takes the data points
+and produces a representation \(\overline{z_i} \in \mathbb{R}^n\)
+in a latent space. This latent space is smaller (\(n < m\)) than the
+input which necessitates some form of compression.
+
+A second function \(g: \Omega \rightarrow \mathbb{R}^m\) is the
+generator function that takes the latent representation
+\(z_i \in \Omega \subset \mathbb{R}^n\) and generates an output
+\(\overline{x_i}\) as close as possible to the input data
+distribution.
+
+What then is the difference between \(\overline{z_i}\) and \(z_i\)?
+With a simple auto-encoder both would be identical. In this case
+of an adversarial auto-encoder it is slightly more complicated.
+There is a discriminator \(D_z\) that tries to distinguish between
+an encoded data point \(\overline{z_i}\) and a \(z_i \sim \mathcal{N}(0,1)\) drawn from a normal distribution with \(0\) mean
+and a standard deviation of \(1\). During training, the encoding
+function \(e\) attempts to minimize any perceivable difference
+between \(z_i\) and \(\overline{z_i}\) while \(D_z\) has the
+aforementioned adversarial task to differentiate between them.
+
+Furthermore, there is a discriminator \(D_x\) that has the task
+to differentiate the generated output \(\overline{x_i}\) from the
+actual input \(x_i\). During training, the generator function \(g\)
+tries to minimize the perceivable difference between \(\overline{x_i}\) and \(x_i\) while \(D_x\) has the mentioned
+adversarial task to distinguish between them.
+
+With this all components of the adversarial auto-encoder employed
+by Pidhorskyi et al are introduced. Finally, the losses are
+presented. The two adversarial objectives have been mentioned
+already. Specifically, there is the adversarial loss for the
+discriminator \(D_z\):
+\begin{equation} \label{eq:adv-loss-z}
+    \mathcal{L}_{adv-d_z}(x,e,D_z) = E[\log (D_z(\mathcal{N}(0,1)))] + E[\log (1 - D_z(e(x)))],
+\end{equation}
+\noindent
+where \(E\) stands for an expected
+value\footnote{a term used in probability theory},
+\(x\) stands for the input, and
+\(\mathcal{N}(0,1)\) represents an element drawn from the specified
+distribution. The encoder \(e\) attempts to minimize this loss while
+the discriminator \(D_z\) intends to maximize it.
+
+In the same way the adversarial loss for the discriminator \(D_x\)
+is specified:
+\begin{equation} \label{eq:adv-loss-x}
+    \mathcal{L}_{adv-d_x}(x,D_x,g) = E[\log(D_x(x))] + E[\log(1 - D_x(g(\mathcal{N}(0,1))))],
+\end{equation}
+\noindent
+where \(x\), \(E\), and \(\mathcal{N}(0,1)\) have the same meaning
+as before. In this case the generator \(g\) tries to minimize the loss
+while the discriminator \(D_x\) attempts to maximize it.
+
+Every auto-encoder requires a reconstruction error to work. This
+error calculates the difference between the original input and
+the generated or decoded output. In this case, the reconstruction
+loss is defined like this:
+\begin{equation} \label{eq:recon-loss}
+    \mathcal{L}_{error}(x, e, g) = - E[\log(p(g(e(x)) | x))],
+\end{equation}
+\noindent
+where \(\log(p)\) is the expected log-likelihood and \(x\),
+\(E\), \(e\), and \(g\) have the same meaning as before.
+
+All losses combined result in the following formula:
+\begin{equation} \label{eq:full-loss}
+    \mathcal{L}(x,e,D_z,D_x,g) = \mathcal{L}_{adv-d_z}(x,e,D_z) + \mathcal{L}_{adv-d_x}(x,D_x,g) + \lambda \mathcal{L}_{error}(x,e,g),
+\end{equation}
+\noindent
+where \(\lambda\) is a parameter used to balance the adversarial
+losses with the reconstruction loss. The model is trained by
+Pidhorskyi et al using the Adam optimizer by doing alternative
+updates of each of the aforementioned components:
+
+\begin{itemize}
+    \item Maximize \(\mathcal{L}_{adv-d_x}\) by updating weights of \(D_x\);
+    \item Minimize \(\mathcal{L}_{adv-d_x}\) by updating weights of \(g\);
+    \item Maximize \(\mathcal{L}_{adv-d_z}\) by updating weights of \(D_z\);
+    \item Minimize \(\mathcal{L}_{error}\) and \(\mathcal{L}_{adv-d_z}\) by updating weights of \(e\) and \(g\).
+\end{itemize}
+
+
+
 \section{Contribution}

 \chapter{Thesis as a project}