From 652c86040437b4fd18c01ad74884b3714986fc12 Mon Sep 17 00:00:00 2001 From: Jim Martens Date: Sun, 28 Jul 2019 14:19:31 +0200 Subject: [PATCH] Added thesis files Signed-off-by: Jim Martens --- body.tex | 604 +++++++++++++++++++++++++++++++++++++++++++++++++++++ thesis.tex | 45 ++++ 2 files changed, 649 insertions(+) create mode 100644 body.tex create mode 100644 thesis.tex diff --git a/body.tex b/body.tex new file mode 100644 index 0000000..d4391d8 --- /dev/null +++ b/body.tex @@ -0,0 +1,604 @@ +% body thesis file that contains the actual content + +\chapter{Introduction} + +\subsection*{Motivation} + +Famous examples like the automatic soap dispenser which does not +recognize the hand of a black person but dispenses soap when presented +with a paper towel raise the question of bias in computer +systems\cite{Friedman1996}. Related to this ethical question regarding +the design of so called algorithms, a term often used in public +discourse for applied neural networks, is the question of +algorithmic accountability\cite{Diakopoulos2014}. + +The charm of supervised neural networks, that they can learn from +input-output relations and figure out by themselves what connections +are necessary for that, is also their Achilles heel. This feature +makes them effectively black boxes. It is possible to question the +training environment, like potential biases inside the data sets, or +the engineers constructing the networks but it is not really possible +to question the internal calculations made by a network. On the one +hand, one might argue, it is only math and nothing magical that +happens inside these networks. Clearly it is possible, albeit a chore, +to manually follow the calculations of any given trained network. +After all it is executed on a computer and at the lowest level only +uses basic math that does not differ between humans and computers. On +the other hand not everyone is capable of doing so and more +importantly it does not reveal any answers to questions of causality. + +However, these questions of causility are of enormous consequence when +neural networks are used, for example, in predictive policing. Is a +correlation, a coincidence, enough to bring forth negative consequences +for a particular person? And if so, what is the possible defence +against math? Similar questions can be raised when looking at computer +vision networks that might be used together with so called smart +CCTV cameras, for example, like those tested at the train station +Berlin Südkreuz. What if a network implies you committed suspicious +behaviour? + +This leads to the need for neural networks to explain their results. +Such an explanation must come from the network or an attached piece +of technology to allow adoption in mass. Obviously this setting +poses the question, how such an endeavour can be achieved. + +For neural networks there are fundamentally two type of tasks: +regression and classification. Regression deals with any case +where the goal for the network is to come close to an ideal +function that connects all data points. Classification, however, +describes tasks where the network is supposed to identify the +class of any given input. In this thesis, I will focus on +classification. + +\subsection*{Object Detection in Open Set Conditions} + +More specifically, I will look at object detection in the open set +conditions. In non-technical words this effectively describes +the kind of situation you encounter with CCTV cameras or robots +outside of a laboratory. Both use cameras that record +images. Subsequently a neural network analyses the image +and returns a list of detected and classified objects that it +found in the image. The problem here is that networks can only +classify what they know. If presented with an object type that +the network was not trained with, as happens frequently in real +environments, it will still classify the object and might even +have a high confidence in doing so. Such an example would be +a false positive. Any ordinary person who uses the results of +such a network would falsely assume that a high confidence always +means the classification is very likely correct. If they use +a proprietary system they might not even be able to find out +that the network was never trained on a particular type of object. +Therefore it would be impossible for them to identify the output +of the network as false positive. + +This goes back to the need for automatic explanation. Such a system +should by itself recognize that the given object is unknown and +hence mark any classification result of the network as meaningless. +Technically there are two slightly different things that deal +with this type of task: model uncertainty and novelty detection. + +Model uncertainty can be measured with dropout sampling. +Dropout is usually used only during training but +Miller et al\cite{Miller2018} use them also during testing +to achieve different results for the same image making use of +multiple forward passes. The output scores for the forward passes +of the same image are then averaged. If the averaged class +probabilities resemble a uniform distribution (every class has +the same probability) this symbolises maximum uncertainty. Conversely, +if there is one very high probability with every other being very +low this signifies a low uncertainty. An unknown object is more +likely to cause high uncertainty which allows for an identification +of false positive cases. + +Novelty detection is the more direct approach to solve the task. +In the realm of neural networks it is usually done with the help of +auto-encoders that essentially solve a regression task of finding an +identity function that reconstructs on the output the given +input\cite{Pimentel2014}. Auto-encoders have +internally at least two components: an encoder, and a decoder or +generator. The job of the encoder is to find an encoding that +compresses the input as good as possible while simultaneously +being as loss-free as possible. The decoder takes this latent +representation of the input and has to find a decompression +that reconstructs the input as accurate as possible. During +training these auto-encoders learn to reproduce a certain group +of object classes. The actual novelty detection takes place +during testing. Given an image, and the output and loss of the +auto-encoder, a novelty score is calculated. A low novelty +score signals a known object. The opposite is true for a high +novelty score. + +\subsection*{Research Question} + +Given these two approaches to solve the explanation task of above, +it comes down to performance. At the end of the day the best +theoretical idea does not help in solving the task if it cannot +be implemented in a performant way. Miller et al have shown +some success in using dropout sampling. However, the many forward +passes during testing for every image seem computationally expensive. +In comparison a single run through a trained auto-encoder seems +intuitively to be faster. This leads to the hypothesis (see below). + +For the purpose of this thesis, I will +use the work of Miller et al as baseline to compare against. +They use the SSD\cite{Liu2016} network for object detection, +modified by added dropout layers, and the SceneNet +RGB-D\cite{McCormac2017} data set using the MS COCO\cite{Lin2014} +classes. Instead of dropout sampling my approach will use +an auto-encoder for novelty detection with all else, like +using SSD for object detection and the SceneNet RGB-D data set, +being equal. With respect to auto-encoders a recent implementation +of an adversarial auto-encoder\cite{Pidhorskyi2018} will be used. + +\paragraph{Hypothesis} Novelty detection using auto-encoders +delivers similar or better object detection performance under open set +conditions while being less computationally expensive compared to +dropout sampling. + +\paragraph{Contribution} +The contribution of this thesis is a comparison between dropout +sampling and auto-encoding with respect to the overall performance +of both for object detection in the open set conditions using +the SSD network for object detection and the SceneNet RGB-D data set +with MS COCO classes. + +\chapter{Background and Contribution} + +This chapter will provide a more in-depth look at the two works +this thesis is based upon. First, the dropout sampling introduced +by Miller et al\cite{Miller2018} will be showcased. Afterwards +the Generative Probabilistic Novelty Detection with Adversarial +Autoencoders\cite{Pidhorskyi2018} will be presented. The chapter +will conclude with a more detailed explanation of the intended +contribution of this thesis. + +The dropout sampling explanation will follow the paper of Miller et +al\cite{Miller2018} rather closely including the formulae used +in their paper. + +\section{Dropout Sampling} + +To understand dropout sampling, it is necessary to explain the +idea of Bayesian neural networks. They place a prior distribution +over the network weights, for example a Gaussian prior distribution: +\(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example +\(\mathbf{W}\) are the weights and \(I\) symbolises that every +weight is drawn from an independent and identical distribution. The +training of the network determines a plausible set of weights by +evaluating the posterior (probability output) over the weights given +the training data: \(p(\mathbf{W}|\mathbf{T})\). However, this +evaluation cannot be performed in any reasonable +time. Therefore approximation techniques are +required. In those techniques the posterior is fitted with a +simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original +and intractable problem of averaging over all weights in the network +is replaced with an optimisation task, where the parameters of the +simple distribution are optimised over\cite{Kendall2017}. + +\subsubsection*{Dropout Variational Inference} + +Kendall and Gal\cite{Kendall2017} showed an approximation for +classfication and recognition tasks. Dropout variational inference +is a practical approximation technique by adding dropout layers +in front of every weight layer and using them also during test +time to sample from the approximate posterior. Effectively, this +results in the approximation of the class probability +\(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward +passes through the network and averaging over the obtained Softmax +scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the +training data \(\mathbf{T}\): +\begin{equation} \label{eq:drop-sampling} +p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i +\end{equation} + +With this dropout sampling technique \(n\) model weights +\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior +\(p(\mathbf{W}|\mathbf{T})\). The class probability +\(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector +\(\mathbf{q}\) over all class labels. Finally, the uncertainty +of the network with respect to the classification is given by +the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\). + +\subsubsection*{Dropout Sampling for Object Detection} + +Miller et al\cite{Miller2018} apply the dropout sampling to +object detection. In that case \(\mathbf{W}\) represents the +learned weights of a detection network like SSD\cite{Liu2016}. +Every forward pass uses a different network +\(\widetilde{\mathbf{W}}\) which is approximately sampled from +\(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object +detection results in a set of detections, each consisting of bounding +box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\). +The detections are denoted by Miller et al as \(D_i = +\{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put +into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\). + +All detections with mutual intersection-over-union scores (IoU) +of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\). +Subsequently, the corresponding vector of class probabilities +\(\mathbf{q}_i\) for the observation is calculated by averaging all +score vectors \(\mathbf{s}_j\) in a particular observation +\(\mathcal{O}_i\): \(\mathbf{q}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty +of the detector for a particular observation is measured by +the entropy \(H(\mathbf{q}_i) = - \sum_j q_{ij} \cdot \log q_{ij}\). + +In the introduction I used a very reduced version to describe +maximum and low uncertainty. A more complete explanation: +If \(\mathbf{q}_i\), which I called averaged class probabilities, +resembles a uniform distribution the entropy will be high. A uniform +distribution means that no class is more likely than another, which +is a perfect example of maximum uncertainty. Conversely, if +one class has a very high probability the entropy will be low. + +In open set conditions it can be expected that falsely generated +detections for unknown object classes have a higher label +uncertainty. A treshold on the entropy \(H(\mathbf{q}_i)\) can then +be used to identify and reject these false positive cases. + +\section{Adversarial Auto-encoder} + +This section will explain the adversarial auto-encoder used by +Pidhorskyi et al\cite{Pidhorskyi2018} but in a slightly modified +form to make it more understandable. + +The training data points \(x_i \in \mathbb{R}^m \) are the input +of the auto-encoder. An encoding function \(e: \mathbb{R}^m \rightarrow \mathbb{R}^n\) takes the data points +and produces a representation \(\overline{z_i} \in \mathbb{R}^n\) +in a latent space. This latent space is smaller (\(n < m\)) than the +input which necessitates some form of compression. + +A second function \(g: \Omega \rightarrow \mathbb{R}^m\) is the +generator function that takes the latent representation +\(z_i \in \Omega \subset \mathbb{R}^n\) and generates an output +\(\overline{x_i}\) as close as possible to the input data +distribution. + +What then is the difference between \(\overline{z_i}\) and \(z_i\)? +With a simple auto-encoder both would be identical. In this case +of an adversarial auto-encoder it is slightly more complicated. +There is a discriminator \(D_z\) that tries to distinguish between +an encoded data point \(\overline{z_i}\) and a \(z_i \sim \mathcal{N}(0,1)\) drawn from a normal distribution with \(0\) mean +and a standard deviation of \(1\). During training, the encoding +function \(e\) attempts to minimize any perceivable difference +between \(z_i\) and \(\overline{z_i}\) while \(D_z\) has the +aforementioned adversarial task to differentiate between them. + +Furthermore, there is a discriminator \(D_x\) that has the task +to differentiate the generated output \(\overline{x_i}\) from the +actual input \(x_i\). During training, the generator function \(g\) +tries to minimize the perceivable difference between \(\overline{x_i}\) and \(x_i\) while \(D_x\) has the mentioned +adversarial task to distinguish between them. + +With this all components of the adversarial auto-encoder employed +by Pidhorskyi et al are introduced. Finally, the losses are +presented. The two adversarial objectives have been mentioned +already. Specifically, there is the adversarial loss for the +discriminator \(D_z\): +\begin{equation} \label{eq:adv-loss-z} + \mathcal{L}_{adv-d_z}(x,e,D_z) = E[\log (D_z(\mathcal{N}(0,1)))] + E[\log (1 - D_z(e(x)))], +\end{equation} +\noindent +where \(E\) stands for an expected +value\footnote{a term used in probability theory}, +\(x\) stands for the input, and +\(\mathcal{N}(0,1)\) represents an element drawn from the specified +distribution. The encoder \(e\) attempts to minimize this loss while +the discriminator \(D_z\) intends to maximize it. + +In the same way the adversarial loss for the discriminator \(D_x\) +is specified: +\begin{equation} \label{eq:adv-loss-x} + \mathcal{L}_{adv-d_x}(x,D_x,g) = E[\log(D_x(x))] + E[\log(1 - D_x(g(\mathcal{N}(0,1))))], +\end{equation} +\noindent +where \(x\), \(E\), and \(\mathcal{N}(0,1)\) have the same meaning +as before. In this case the generator \(g\) tries to minimize the loss +while the discriminator \(D_x\) attempts to maximize it. + +Every auto-encoder requires a reconstruction error to work. This +error calculates the difference between the original input and +the generated or decoded output. In this case, the reconstruction +loss is defined like this: +\begin{equation} \label{eq:recon-loss} + \mathcal{L}_{error}(x, e, g) = - E[\log(p(g(e(x)) | x))], +\end{equation} +\noindent +where \(\log(p)\) is the expected log-likelihood and \(x\), +\(E\), \(e\), and \(g\) have the same meaning as before. + +All losses combined result in the following formula: +\begin{equation} \label{eq:full-loss} + \mathcal{L}(x,e,D_z,D_x,g) = \mathcal{L}_{adv-d_z}(x,e,D_z) + \mathcal{L}_{adv-d_x}(x,D_x,g) + \lambda \mathcal{L}_{error}(x,e,g), +\end{equation} +\noindent +where \(\lambda\) is a parameter used to balance the adversarial +losses with the reconstruction loss. The model is trained by +Pidhorskyi et al using the Adam optimizer by doing alternative +updates of each of the aforementioned components: + +\begin{itemize} + \item Maximize \(\mathcal{L}_{adv-d_x}\) by updating weights of \(D_x\); + \item Minimize \(\mathcal{L}_{adv-d_x}\) by updating weights of \(g\); + \item Maximize \(\mathcal{L}_{adv-d_z}\) by updating weights of \(D_z\); + \item Minimize \(\mathcal{L}_{error}\) and \(\mathcal{L}_{adv-d_z}\) by updating weights of \(e\) and \(g\). +\end{itemize} + +Practically, the auto-encoder is trained separately for every +object class that is considered "known". Pidhorskyi et al trained +it on the MNIST\cite{Lecun1998} data set, once for every digit. + +For this thesis it needs to be trained on the SceneNet RGB-D +data set using MS COCO classes as known classes. As in every +test epoch all known classes are present, it becomes +non-trivial which of the trained auto-encoders should be used to +calculate novelty. To phrase it differently, a true positive +detection is possible for multiple classes in the same image. +If, for example, one object is classified correctly by SSD as a chair +the novelty score should be low. But the auto-encoders of all +known classes but the "chair" class will give ideally a high novelty +score. Which of the values should be used? The only sensible solution +is to only run it through the auto-encoder that was trained for +the class the SSD model predicted. This provides the following +scenarios: +\begin{itemize} + \item true positive classification: novelty score should be low + \item false positive classification and correct class is + among the known classes: novelty score should be high + \item false positive classification and correct class is unknown: + novelty score should be high +\end{itemize} +\noindent +Negative classifications are not listed as these are not part +of the output of the SSD and cannot be given to the auto-encoder +as input. Furthermore, the 2nd case should not happen because +the trained SSD knows this other class and is very likely +to give it a higher probability. Therefore, using only one +auto-encoder fulfils the task of differentiating between +known and unknown classes. + +\section{Generative Probabilistic Novelty Detection} + +It is still unclear how the novelty score is calculated. +This section will clear this up in as understandable as +possible terms. However, the name "Generative Probabilistic +Novelty Detection"\cite{Pidhorskyi2018} already signals that +probability theory has something to do with it. Furthermore, this +section will make use of some mathematical terms which cannot +be explained in great detail here. Moreover, the previous section +already introduced many required components, which will not be +explained here again. + +For the purpose of this explanation a trained auto-encoder +is assumed. In that case the generator function describes +the model that the auto-encoder is actually using for the +novelty detection. The task of training is to make sure this +model comes as close as possible to the real model of the +training or testing data. The model of the auto-encoder +is in mathematical terms a parameterized manifold +\(\mathcal{M} \equiv g(\Omega)\) of dimension \(n\). +The set of training or testing data can then be described +in the following way: +\begin{equation} \label{eq:train-set} + x_i = g(z_i) + \xi_i \quad i \in \mathbb{N}, +\end{equation} +\noindent +where \(\xi_i\) represents noise. It may be confusing but +for the purpose of this novelty test the "truth" is what +the generator function generates from a set of \(z_i \in \Omega\), +not the ground truth from the data set. Furthermore, +the previously introduced encoder function \(e\) is assumed +to work as an exact inverse of \(g\) for every \(x \in \mathcal{M}\). +For such \(x\) it follows that \(x = g(e(x))\). + +Let \(\overline{x} \in \mathbb{R}^m\) be a data point from the test +data. The remainder of the section will explain how the novelty +test is performed for this \(\overline{x}\). It is important +to note that this data point is not necessarily part of the +auto-encoder model. Therefore, \(g(e(\overline{x})) = x\) cannot +be assumed. However, it can be observed that \(\overline{x}\) +can be non-linearly projected onto +\(\overline{x}^{\|} \in \mathcal{M}\) +by using \(g(\overline{z})\) with \(\overline{z} = e(\overline{x})\). +It is assumed that \(g\) is smooth enough to perform a linearization +based on the first-order Taylor expansion: +\begin{equation} \label{eq:taylor-expanse} + g(z) = g(\overline{z}) + J_g(\overline{z}) (z - \overline{z}) + \mathcal{O}(\| z - \overline{z} \|^2), +\end{equation} +\noindent +where \(J_g(\overline{z})\) is the Jacobi matrix of \(g\) computed +at \(\overline{z}\). It is assumed that the Jacobi matrix of \(g\) +has the full rank at every point of the manifold. A Jacobi matrix +contains all first-order partial derivatives of a function. +\(\| \cdot \|\) is the \(\mathbf{L}_2\) norm, which calculates the +length of a vector by calculating the square root of the sum of +squares of all dimensions of the vector. Lastly, \(\mathcal{O}\) +is called Big-O notation and is used for specifying the time +complexity of an algorithm. In this case it contains a linear +value, which means that this part of the term can be ignored for +\(z\) growing to infinity. + +Next the tangent space of \(g\) at \(\overline{x}^{\|}\), which +is spanned by the \(n\) independent column vectors of the Jacobi +matrix \(J_g(\overline{z})\), is defined as +\(\mathcal{T} = \text{span}(J_g(\overline{z}))\). The tangent space +of a point of a function describes all the vectors that could go +through this point. The Jacobi matrix can be decomposed into three +matrices using singular value decomposition: \(J_g(\overline{z}) = U^{\|}SV^{*}\). \(\mathcal{T}\) is defined to also be spanned +by the column vectors of \(U^{\|}\): \(\mathcal{T} = \text{span}(U^{\|})\). \(U^{\|}\) contains the left-singular values +and \(V^{*}\) is the conjugate transposed version of the matrix +\(V\), which contains the right-singular values. \(U^{\bot}\) is +defined in such a way that \(U = [U^{\|}U^{\bot}]\) is a unitary +matrix. \(\mathcal{T^{\bot}}\) is the orthogonal complement of +\(\mathcal{T}\). With this preparation \(\overline{x}\) can be +represented with respect to the local coordinates that define +\(\mathcal{T}\) and \(\mathcal{T}^{\bot}\). This representation +can be achieved by computing +\begin{equation} \label{eq:w-definition} + \overline{w} = U^{\top} \overline{x} = \left[\begin{matrix} + U^{\|^{\top}} \overline{x} \\ + U^{\bot^{\top}} \overline{x} + \end{matrix}\right] = \left[\begin{matrix} + \overline{w}^{\|} \\ + \overline{w}^{\bot} + \end{matrix}\right], +\end{equation} +\noindent +where the rotated coordinates (training/testing data points +changed to be on the tangent space) +\(\overline{w}\) are decomposed into \(\overline{w}^{\|}\), which +are parallel to \(\mathcal{T}\), and \(\overline{w}^{\bot}\), which +are orthogonal to \(\mathcal{T}\). + +The last step to define the novelty test involves probability +density functions (PDFs), which are now introduced. The PDF \(p_X(x)\) +describes the random variable \(X\), from which the training and +testing data points are drawn. In addition, \(p_W(w)\) is the +probability density function of the random variable \(W\), +which represents \(X\) after changing the coordinates. Both +distributions are identical. But it is assumed that the coordinates +\(W^{\|}\), which are parallel to \(\mathcal{T}\), and the coordinates +\(W^{\bot}\), which are orthogonal to \(\mathcal{T}\), are +statistically independent. With this assumption the following holds: +\begin{equation} \label{eq:pdf-x} + p_X(x) = p_W(w) = p_W(w^{\|}, w^{\bot}) = p_{W^{\|}}(w^{\|}) p_{W^{\bot}}(w^{\bot}) +\end{equation} +The previously introduced noise comes into play again. In formula +(\ref{eq:train-set}) it is assumed that the noise \(\xi\) +predominantly deviates the point \(x\) away from the manifold +\(\mathcal{M}\) in a direction orthogonal to \(\mathcal{T}\). +As a consequence \(W^{\bot}\) is mainly responsible for the noise +effects. Since noise and drawing from the manifold are statistically +independent, \(W^{\|}\) and \(W^{\bot}\) are also independent. + +Finally, referring back to the data point \(\overline{x}\), the +novelty test is defined like this: +\begin{equation} \label{eq:novelty-test} + p_X(\overline{x}) = p_{W^{\|}}(\overline{w}^{\|})p_{W^{\bot}}(\overline{w}^{\bot}) = + \begin{cases} + \geq \gamma & \Longrightarrow \text{Inlier} \\ + < \gamma & \Longrightarrow \text{Outlier} + \end{cases} +\end{equation} +\noindent +where \(\gamma\) is a suitable threshold. + +At this point it is very clear that the GPND approach requires +far more math background than dropout sampling to understand +the novelty test. Nonetheless it could be the better method. + +\section{Contribution} + +This section will outline what exactly the scientific as well as +technical contribution of this thesis will be. + +\subsection*{Scientific Contribution} + +Miller et al\cite{Miller2018} use the SSD\cite{Liu2016} network +extended with dropout layers and run multiple forward passes +during the testing phase for every image. Considering the number +of images in the SceneNet RGB-D\cite{McCormac2017} data set, these +forward passes will take considerable time. It could be faster +to only run one forward pass and then use the auto-encoder for +novelty detection. However, the auto-encoder can only work +with one detection at the time and must be called for every +detection of the object detector separately. Therefore, +it is interesting to investigate whether the second approach +is indeed faster than the first. + +Dropout sampling uses the entropy to identify false positive +cases. Such identified detections are discarded, which allows for +a better object detection performance. The GPND approach uses +the auto-encoder losses and results to identify novel cases and +therefore mark detections as false positive. Subsequently these +detections can be discarded as well. By comparing the object +detection performance after discarding the identified false positive +cases, the effectiveness of both approaches can be compared with each +other. It is interesting to research if the GPND approach results in +a better object detection performance than the dropout sampling +provides. + +The formulated hypothesis, which is repeated after this paragraph, +combines both aspects and requires a similar or better result in +both of them. As a consequence it will be falsified if +the computational performance of the GPND approach is not better than +the one of dropout sampling or if the object detection performance +is worse. + +\paragraph{Hypothesis} Novelty detection using auto-encoders +delivers similar or better object detection performance under open set +conditions while being less computationally expensive compared to +dropout sampling.\\ + +There are three possible scenarios that can be the result of +the thesis: +\begin{itemize} + \item the hypothesis is confirmed: Win-Win situation where + switching to GPND is straightforward. + \item one of the conditions fails: Win-Lose situation where + it is a trade-off between object detection performance and + computational performance. One approach will be better in + one thing and the other approach in the other thing. + \item both conditions fail: Lose-Lose situation where + dropout sampling is the best in both aspects. +\end{itemize} + +Summarising, the scientific contribution is a comparison between +dropout sampling and GPND with respect to both object detection +performance and computational performance under open set conditions +using the SceneNet RGB-D data set with the MS COCO classes as +"known" object classes. + +The computational performance is measured by the time in milliseconds +every test run takes. Interesting are not the absolute numbers, +as these vary from machine to machine and are influenced by a +plethora of uncontrollable factors, but the relative difference +between both approaches and if the difference is significant. +Object detection performance is measured by precision, recall, +F1-score, and an open set error. While the first three metrics are +standard, the last is adapted from Miller et al. It is defined +as the number of observations (for dropout sampling) or detections +(for GPND) that pass the respective false positive test (entropy or +novelty), fall on unknown objects (there are no overlapping ground +truth objects with IoU \(\geq 0.5\) and a known true class label) +and do not have a winning class label of "unknown". + +\subsection*{Technical Contribution} + +Technical contribution includes all contributions +that are not necessarily new in the scientific sense but are a +meaningful engineering contribution in itself. + +There is no available source code for the work of +Miller et al\cite{Miller2018}, which necessitates a re-implementation +of their work by myself. The contribution is the fine-tuning of +an SSD model pre-trained on ImageNet\cite{Deng2009}, extended by +dropout layers, to the SceneNet RGB-D data set using MS COCO classes +as the known classes for SSD. +As MS COCO classes are more general than SceneNet RGB-D classes this +also requires a mapping from one set of classes to the other. +This entire contribution is technical and only re-implements +what Miller et al have already done. It is expected that the +evaluation of the results using this self-trained model will +reproduce the results of Miller et al. + +For GPND source code is available but only for MNIST and using +PyTorch. Therefore, the source code has to be transcoded from +PyTorch to Tensorflow. Furthermore, it must be made compatible +with the SceneNet RGB-D as the architecture is tailored to MNIST. +The mapping from SceneNet RGB-D to MS COCO applies here as well and +can therefore be considered a separate contribution. A fine-tuned +SSD is required also but this time without added dropout layers. +Additionally, it is necessary to train the auto-encoder for every +known class separately. + +To summarise it in a list, the following separate deliverables +are contributed: + +\begin{itemize} + \item source code for dropout sampling compatible with Tensorflow + \item source code for GPND compatible with Tensorflow + \item mapping from SceneNet RGB-D classes to MS COCO classes + \item vanilla SSD model fine-tuned on SceneNet RGB-D + \item dropout SSD model fine-tuned on SceneNet RGB-D + \item auto-encoder model trained separately on every MS COCO class +\end{itemize} diff --git a/thesis.tex b/thesis.tex new file mode 100644 index 0000000..43bb1cf --- /dev/null +++ b/thesis.tex @@ -0,0 +1,45 @@ +% main thesis file + +% first: let's determine the class we want to use +\documentclass[12pt]{masterthesis} % we use our custom masterthesis class +\KOMAoption{draft}{false} % set for "draft" mode, disable later + +% define "variables" +\title{Novelty detection through auto-encoding for object detection in open set conditions} +\author{Jim Martens} +\date{\today} +\subject{} % value for pdfsubject +\newcommand{\keywords}{CV} % value for pdfkeywords +\newcommand{\studiengang}{\IfLanguageName{british}{Computer Science M.Sc.}{Master Informatik}} +\newcommand{\faculty}{\IfLanguageName{british}{MIN Faculty}{MIN-Fakult\"{a}t}} +\newcommand{\department}{\IfLanguageName{british}{Department of Computer Science}{Fachbereich Informatik}} +% define here to prevent building errors +\newcommand{\professor}{} +\newcommand{\matrikelnummer}{} +\newcommand{\firstReviewer}{} +\newcommand{\secondReviewer}{} + +% load protected variables +\IfFileExists{./private/variables.tex}{% + \input{./private/variables.tex} +}{} + +% use custom package to prevent spamming the preamble +\usepackage[licence]{masterthesis} + +% specify image location +\graphicspath{{./images/}{./private/images/}} + +% specify bib resource +\addbibresource{ma.bib} + +% begin document +\begin{document} +% invoke start command(s) from masterthesis package +\start + +\input{body_expose.tex} + +% invoke finish command(s) from masterthesis package +\finish +\end{document}