% body thesis file that contains the actual content \chapter{Introduction} \subsection*{Motivation} Famous examples like the automatic soap dispenser which does not recognize the hand of a black person but dispenses soap when presented with a paper towel raise the question of bias in computer systems~\cite{Friedman1996}. Related to this ethical question regarding the design of so called algorithms is the question of algorithmic accountability~\cite{Diakopoulos2014}. Supervised neural networks learn from input-output relations and figure out by themselves what connections are necessary for that. This feature is also their Achilles heel: it makes them effectively black boxes and prevents any answers to questions of causality. However, these questions of causility are of enormous consequence when results of neural networks are used to make life changing decisions: Is a correlation enough to bring forth negative consequences for a particular person? And if so, what is the possible defence against math? Similar questions can be raised when looking at computer vision networks that might be used together with so called smart CCTV cameras to discover suspicious activity. This leads to the need for neural networks to explain their results. Such an explanation must come from the network or an attached piece of technology to allow adoption in mass. Obviously this setting poses the question, how such an endeavour can be achieved. For neural networks there are fundamentally two type of tasks: regression and classification. Regression deals with any case where the goal for the network is to come close to an ideal function that connects all data points. Classification, however, describes tasks where the network is supposed to identify the class of any given input. In this thesis, I will focus on classification. \subsection*{Object Detection in Open Set Conditions} More specifically, I will look at object detection in the open set conditions. In non-technical words this effectively describes the kind of situation you encounter with CCTV cameras or robots outside of a laboratory. Both use cameras that record images. Subsequently a neural network analyses the image and returns a list of detected and classified objects that it found in the image. The problem here is that networks can only classify what they know. If presented with an object type that the network was not trained with, as happens frequently in real environments, it will still classify the object and might even have a high confidence in doing so. Such an example would be a false positive. Any ordinary person who uses the results of such a network would falsely assume that a high confidence always means the classification is very likely correct. If they use a proprietary system they might not even be able to find out that the network was never trained on a particular type of object. Therefore it would be impossible for them to identify the output of the network as false positive. This goes back to the need for automatic explanation. Such a system should by itself recognize that the given object is unknown and hence mark any classification result of the network as meaningless. Technically there are two slightly different things that deal with this type of task: model uncertainty and novelty detection. Model uncertainty can be measured with dropout sampling. Dropout is usually used only during training but Miller et al.~\cite{Miller2018} use them also during testing to achieve different results for the same image making use of multiple forward passes. The output scores for the forward passes of the same image are then averaged. If the averaged class probabilities resemble a uniform distribution (every class has the same probability) this symbolises maximum uncertainty. Conversely, if there is one very high probability with every other being very low this signifies a low uncertainty. An unknown object is more likely to cause high uncertainty which allows for an identification of false positive cases. Novelty detection is the more direct approach to solve the task. In the realm of neural networks it is usually done with the help of auto-encoders that essentially solve a regression task of finding an identity function that reconstructs on the output the given input~\cite{Pimentel2014}. Auto-encoders have internally at least two components: an encoder, and a decoder or generator. The job of the encoder is to find an encoding that compresses the input as good as possible while simultaneously being as loss-free as possible. The decoder takes this latent representation of the input and has to find a decompression that reconstructs the input as accurate as possible. During training these auto-encoders learn to reproduce a certain group of object classes. The actual novelty detection takes place during testing: Given an image, and the output and loss of the auto-encoder, a novelty score is calculated. A low novelty score signals a known object. The opposite is true for a high novelty score. \subsection*{Research Question} Both presented approaches describe one way to solve the aforementioned problem of explanation. They can be differentiated by measuring their performance: the best theoretical idea is useless if it does not perform well. Miller et al. have shown some success in using dropout sampling. However, the many forward passes during testing for every image seem computationally expensive. In comparison a single run through a trained auto-encoder seems intuitively to be faster. This leads to the hypothesis (see below). For the purpose of this thesis, I will use the work of Miller et al. as baseline to compare against. They use the SSD~\cite{Liu2016} network for object detection, modified by added dropout layers, and the SceneNet RGB-D~\cite{McCormac2017} data set using the MS COCO~\cite{Lin2014} classes. I will use a simple implementation of an auto-encoder and novelty detection to compare with the work of Miller et al. SSD for the object detection and SceneNet RGB-D as the data set are used for both approaches. \paragraph{Hypothesis} Novelty detection using auto-encoders delivers similar or better object detection performance under open set conditions while being less computationally expensive compared to dropout sampling. \paragraph{Contribution} The contribution of this thesis is a comparison between dropout sampling and auto-encoding with respect to the overall performance of both for object detection in the open set conditions using the SSD network for object detection and the SceneNet RGB-D data set with MS COCO classes. \chapter{Background and Contribution} This chapter will begin with an overview over previous works in the field of this thesis. Afterwards the theoretical foundations of the work of Miller et al.~\cite{Miller2018} and auto-encoders will be explained. The chapter concludes with more details about the research question and the intended contribution of this thesis. For both background sections the notation defined in table \ref{tab:notation} will be used. \section{Related Works} Novelty detection for object detection is intricately linked with open set conditions: the test data can contain unknown classes. Bishop~\cite{Bishop1994} investigates the correlation between the degree of novel input data and the reliability of network outputs. Pimentel et al.~\cite{Pimentel2014} provide a review of novelty detection methods published over the previous decade. There are two primary pathways that deal with novelty: novelty detection using auto-encoders and uncertainty estimation with bayesian networks. Japkowicz et al.~\cite{Japkowicz1995} introduce a novelty detection method based on the hippocampus of Gluck and Meyers~\cite{Gluck1993} and use an auto-encoder to recognize novel instances. Thompson et al.~\cite{Thompson2002} show that auto-encoders can learn "normal" system behaviour implicitly. Goodfellow et al.~\cite{Goodfellow2014} introduce adversarial networks: a generator that attempts to trick the discriminator by generating samples indistinguishable from the real data. Makhzani et al.~\cite{Makhzani2015} build on the work of Goodfellow and propose adversarial auto-encoders. Richter and Roy~\cite{Richter2017} use an auto-encoder to detect novelty. Wang et al.~\cite{Wang2018} base upon Goodfellow's work and use a generative adversarial network for novelty detection. Sabokrou et al.~\cite{Sabokrou2018} implement an end-to-end architecture for one-class classification: it consists of two deep networks, with one being the novelty detector and the other enhancing inliers and distorting outliers. Pidhorskyi et al.~\cite{Pidhorskyi2018} take a probabilistic approach and compute how likely it is that a sample is generated by the inlier distribution. Kendall and Gal~\cite{Kendall2017} provide a Bayesian deep learning framework that combines input-dependent aleatoric\footnote{captures noise inherent in observations} uncertainty with epistemic\footnote{uncertainty in the model} uncertainty. Lakshminarayanan et al.~\cite{Lakshminarayanan2017} implement a predictive uncertainty estimation using deep ensembles rather than Bayesian networks. Geifman et al.~\cite{Geifman2018} introduce an uncertainty estimation algorithm for non-Bayesian deep neural classification that estimates the uncertainty of highly confident points using earlier snapshots of the trained model. Miller et al.~\cite{Miller2018a} compare merging strategies for sampling-based uncertainty techniques in object detection. Sensoy et al.~\cite{Sensoy2018} treat prediction confidence as subjective opinions: they place a Dirichlet distribution on it. The trained predictor for a multi-class classification is also a Dirichlet distribution. Gal and Ghahramani~\cite{Gal2016} show how dropout can be used as a Bayesian approximation. Miller et al.~\cite{Miller2018} build upon the work of Miller et al.~\cite{Miller2018a} and Gal and Ghahramani: they use dropout sampling under open-set conditions for object detection. Mukhoti and Gal~\cite{Mukhoti2018} contribute metrics to measure uncertainty for semantic segmentation. Wu et al.~\cite{Wu2019} introduce two innovations that turn variational Bayes into a robust tool for Bayesian networks: they introduce a novel deterministic method to approximate moments in neural networks which eliminates gradient variance, and they introduce a hierarchical prior for parameters and an Empirical Bayes procedure to select prior variances. \section{Background for Bayesian SSD} \begin{table} \centering \caption{Notation for background sections} \label{tab:notation} \begin{tabular}{l|l} symbol & meaning \\ \hline \(\mathbf{W}\) & weights \\ \(\mathbf{T}\) & training data \\ \(\mathcal{N}(0, I)\) & Gaussian distribution \\ \(I\) & independent and identical distribution \\ \(p(\mathbf{W}|\mathbf{T})\) & probability of weights given training data \\ \(\mathcal{I}\) & an image \\ \(\mathbf{q} = p(y|\mathcal{I}, \mathbf{T})\) & probability of all classes given image and training data \\ \(H(\mathbf{q})\) & entropy over probability vector \\ \(\widetilde{\mathbf{W}}\) & weights sampled from \(p(\mathbf{W}|\mathbf{T})\) \\ \(\mathbf{b}\) & bounding box coordinates \\ \(\mathbf{s}\) & softmax scores \\ \(\overline{\mathbf{s}}\) & averaged softmax scores \\ \(D\) & detections of one forward pass \\ \(\mathfrak{D}\) & set of all detections over multiple forward passes \\ \(\mathcal{O}\) & observation \\ \(\overline{\mathbf{q}}\) & probability vector for observation \\ \(\overline{\mathbf{z}}, \mathbf{z}\) & latent space representation \\ \(d_T, d_z\) & discriminators \\ \(e, g\) & encoding and decoding/generating function \\ \(J_g\) & Jacobi matrix for generating function \\ \(\mathcal{T}\) & tangent space \\ \(\mathbf{R}\) & training/test data changed to be on tangent space \end{tabular} \end{table} To understand dropout sampling, it is necessary to explain the idea of Bayesian neural networks. They place a prior distribution over the network weights, for example a Gaussian prior distribution: \(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example \(\mathbf{W}\) are the weights and \(I\) symbolises that every weight is drawn from an independent and identical distribution. The training of the network determines a plausible set of weights by evaluating the posterior (probability output) over the weights given the training data: \(p(\mathbf{W}|\mathbf{T})\). However, this evaluation cannot be performed in any reasonable time. Therefore approximation techniques are required. In those techniques the posterior is fitted with a simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original and intractable problem of averaging over all weights in the network is replaced with an optimisation task, where the parameters of the simple distribution are optimised over~\cite{Kendall2017}. \subsubsection*{Dropout Variational Inference} Kendall and Gal~\cite{Kendall2017} showed an approximation for classfication and recognition tasks. Dropout variational inference is a practical approximation technique by adding dropout layers in front of every weight layer and using them also during test time to sample from the approximate posterior. Effectively, this results in the approximation of the class probability \(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward passes through the network and averaging over the obtained Softmax scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the training data \(\mathbf{T}\): \begin{equation} \label{eq:drop-sampling} p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i \end{equation} With this dropout sampling technique \(n\) model weights \(\widetilde{\mathbf{W}}_i\) are sampled from the posterior \(p(\mathbf{W}|\mathbf{T})\). The class probability \(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector \(\mathbf{q}\) over all class labels. Finally, the uncertainty of the network with respect to the classification is given by the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\). \subsubsection*{Dropout Sampling for Object Detection} Miller et al.~\cite{Miller2018} apply the dropout sampling to object detection. In that case \(\mathbf{W}\) represents the learned weights of a detection network like SSD~\cite{Liu2016}. Every forward pass uses a different network \(\widetilde{\mathbf{W}}\) which is approximately sampled from \(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object detection results in a set of detections, each consisting of bounding box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\). The detections are denoted by Miller et al. as \(D_i = \{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\). All detections with mutual intersection-over-union scores (IoU) of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\). Subsequently, the corresponding vector of class probabilities \(\overline{\mathbf{q}}_i\) for the observation is calculated by averaging all score vectors \(\mathbf{s}_j\) in a particular observation \(\mathcal{O}_i\): \(\overline{\mathbf{q}}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty of the detector for a particular observation is measured by the entropy \(H(\overline{\mathbf{q}}_i) = - \sum_j \overline{q}_{ij} \cdot \log \overline{q}_{ij}\). In the introduction I used a very reduced version to describe maximum and low uncertainty. A more complete explanation: If \(\overline{\mathbf{q}}_i\), which I called averaged class probabilities, resembles a uniform distribution the entropy will be high. A uniform distribution means that no class is more likely than another, which is a perfect example of maximum uncertainty. Conversely, if one class has a very high probability the entropy will be low. In open set conditions it can be expected that falsely generated detections for unknown object classes have a higher label uncertainty. A treshold on the entropy \(H(\overline{\mathbf{q}}_i)\) can then be used to identify and reject these false positive cases. \section{Adversarial Auto-encoder} This section will explain the adversarial auto-encoder used by Pidhorskyi et al.~\cite{Pidhorskyi2018} but in a slightly modified form to make it more understandable. The training data \(\mathbf{T} \in \mathbb{R}^m \) is the input of the auto-encoder. An encoding function \(e: \mathbb{R}^m \rightarrow \mathbb{R}^n\) takes the data and produces a representation \(\overline{\mathbf{z}} \in \mathbb{R}^n\) in a latent space. This latent space is smaller (\(n < m\)) than the input which necessitates some form of compression. A second function \(g: \Omega \rightarrow \mathbb{R}^m\) is the generator function that takes the latent representation \(\mathbf{z} \in \Omega \subset \mathbb{R}^n\) and generates an output \(\overline{\mathbf{T}}\) as close as possible to the input data distribution. What then is the difference between \(\overline{\mathbf{z}}\) and \(\mathbf{z}\)? With a simple auto-encoder both would be identical. In this case of an adversarial auto-encoder it is slightly more complicated. There is a discriminator \(d_z\) that tries to distinguish between an encoded data point \(\overline{\mathbf{z}}\) and a \(\mathbf{z} \sim \mathcal{N}(0,1)\) drawn from a normal distribution with \(0\) mean and a standard deviation of \(1\). During training, the encoding function \(e\) attempts to minimize any perceivable difference between \(\mathbf{z}\) and \(\overline{\mathbf{z}}\) while \(d_z\) has the aforementioned adversarial task to differentiate between them. Furthermore, there is a discriminator \(d_T\) that has the task to differentiate the generated output \(\overline{\mathbf{T}}\) from the actual input \(\mathbf{T}\). During training, the generator function \(g\) tries to minimize the perceivable difference between \(\overline{\mathbf{T}}\) and \(\mathbf{T}\) while \(d_T\) has the mentioned adversarial task to distinguish between them. With this all components of the adversarial auto-encoder employed by Pidhorskyi et al are introduced. Finally, the losses are presented. The two adversarial objectives have been mentioned already. Specifically, there is the adversarial loss for the discriminator \(d_z\): \begin{equation} \label{eq:adv-loss-z} \mathcal{L}_{adv-d_z}(\mathbf{T},e,d_z) = E[\log (d_z(\mathcal{N}(0,1)))] + E[\log (1 - d_z(e(\mathbf{T})))], \end{equation} \noindent where \(E\) stands for an expected value\footnote{a term used in probability theory}, \(\mathbf{T}\) stands for the input, and \(\mathcal{N}(0,1)\) represents an element drawn from the specified distribution. The encoder \(e\) attempts to minimize this loss while the discriminator \(d_z\) intends to maximize it. In the same way the adversarial loss for the discriminator \(d_T\) is specified: \begin{equation} \label{eq:adv-loss-x} \mathcal{L}_{adv-d_T}(\mathbf{T},d_T,g) = E[\log(d_T(\mathbf{T}))] + E[\log(1 - d_T(g(\mathcal{N}(0,1))))], \end{equation} \noindent where \(\mathbf{T}\), \(E\), and \(\mathcal{N}(0,1)\) have the same meaning as before. In this case the generator \(g\) tries to minimize the loss while the discriminator \(d_T\) attempts to maximize it. Every auto-encoder requires a reconstruction error to work. This error calculates the difference between the original input and the generated or decoded output. In this case, the reconstruction loss is defined like this: \begin{equation} \label{eq:recon-loss} \mathcal{L}_{error}(\mathbf{T}, e, g) = - E[\log(p(g(e(\mathbf{T})) | \mathbf{T}))], \end{equation} \noindent where \(\log(p)\) is the expected log-likelihood and \(\mathbf{T}\), \(E\), \(e\), and \(g\) have the same meaning as before. All losses combined result in the following formula: \begin{equation} \label{eq:full-loss} \mathcal{L}(\mathbf{T},e,d_z,d_T,g) = \mathcal{L}_{adv-d_z}(\mathbf{T},e,d_z) + \mathcal{L}_{adv-d_T}(x,d_T,g) + \lambda \mathcal{L}_{error}(\mathbf{T},e,g), \end{equation} \noindent where \(\lambda\) is a parameter used to balance the adversarial losses with the reconstruction loss. The model is trained by Pidhorskyi et al using the Adam optimizer by doing alternative updates of each of the aforementioned components: \begin{itemize} \item Maximize \(\mathcal{L}_{adv-d_T}\) by updating weights of \(d_T\); \item Minimize \(\mathcal{L}_{adv-d_T}\) by updating weights of \(g\); \item Maximize \(\mathcal{L}_{adv-d_z}\) by updating weights of \(d_z\); \item Minimize \(\mathcal{L}_{error}\) and \(\mathcal{L}_{adv-d_z}\) by updating weights of \(e\) and \(g\). \end{itemize} Practically, the auto-encoder is trained separately for every object class that is considered "known". Pidhorskyi et al trained it on the MNIST~\cite{Lecun1998} data set, once for every digit. For this thesis it needs to be trained on the SceneNet RGB-D data set using MS COCO classes as known classes. As in every test epoch all known classes are present, it becomes non-trivial which of the trained auto-encoders should be used to calculate novelty. To phrase it differently, a true positive detection is possible for multiple classes in the same image. If, for example, one object is classified correctly by SSD as a chair the novelty score should be low. But the auto-encoders of all known classes but the "chair" class will give ideally a high novelty score. Which of the values should be used? The only sensible solution is to only run it through the auto-encoder that was trained for the class the SSD model predicted. This provides the following scenarios: \begin{itemize} \item true positive classification: novelty score should be low \item false positive classification and correct class is among the known classes: novelty score should be high \item false positive classification and correct class is unknown: novelty score should be high \end{itemize} \noindent Negative classifications are not listed as these are not part of the output of the SSD and cannot be given to the auto-encoder as input. Furthermore, the 2nd case should not happen because the trained SSD knows this other class and is very likely to give it a higher probability. Therefore, using only one auto-encoder fulfils the task of differentiating between known and unknown classes. \section{Generative Probabilistic Novelty Detection} It is still unclear how the novelty score is calculated. This section will clear this up in as understandable as possible terms. However, the name "Generative Probabilistic Novelty Detection"~\cite{Pidhorskyi2018} already signals that probability theory has something to do with it. Furthermore, this section will make use of some mathematical terms which cannot be explained in great detail here. Moreover, the previous section already introduced many required components, which will not be explained here again. For the purpose of this explanation a trained auto-encoder is assumed. In that case the generator function describes the model that the auto-encoder is actually using for the novelty detection. The task of training is to make sure this model comes as close as possible to the real model of the training or testing data. The model of the auto-encoder is in mathematical terms a parameterized manifold \(\mathcal{M} \equiv g(\Omega)\) of dimension \(n\). The set of training or testing data can then be described in the following way: \begin{equation} \label{eq:train-set} \mathbf{T} = g(\mathbf{z}) + \xi_i \quad i \in \mathbb{N}, \end{equation} \noindent where \(\xi_i\) represents noise. It may be confusing but for the purpose of this novelty test the "truth" is what the generator function generates from a set of \(\mathbf{z} \in \Omega\), not the ground truth from the data set. Furthermore, the previously introduced encoder function \(e\) is assumed to work as an exact inverse of \(g\) for every \(\mathbf{T} \in \mathcal{M}\). For such \(\mathbf{T}\) it follows that \(\mathbf{T} = g(e(\mathbf{T}))\). Let \(\overline{\mathbf{T}} \in \mathbb{R}^m\) be the test data. The remainder of the section will explain how the novelty test is performed for this \(\overline{\mathbf{T}}\). It is important to note that this data is not necessarily part of the auto-encoder model. Therefore, \(g(e(\overline{\mathbf{T}})) = \mathbf{T}\) cannot be assumed. However, it can be observed that \(\overline{\mathbf{T}}\) can be non-linearly projected onto \(\overline{\mathbf{T}}^{\|} \in \mathcal{M}\) by using \(g(\overline{\mathbf{z}})\) with \(\overline{\mathbf{z}} = e(\overline{\mathbf{T}})\). It is assumed that \(g\) is smooth enough to perform a linearization based on the first-order Taylor expansion: \begin{equation} \label{eq:taylor-expanse} g(\mathbf{z}) = g(\overline{\mathbf{z}}) + J_g(\overline{\mathbf{z}}) (\mathbf{z} - \overline{\mathbf{z}}) + \mathcal{O}(\| \mathbf{z} - \overline{\mathbf{z}} \|^2), \end{equation} \noindent where \(J_g(\overline{\mathbf{z}})\) is the Jacobi matrix of \(g\) computed at \(\overline{\mathbf{z}}\). It is assumed that the Jacobi matrix of \(g\) has the full rank at every point of the manifold. A Jacobi matrix contains all first-order partial derivatives of a function. \(\| \cdot \|\) is the \(\mathbf{L}_2\) norm, which calculates the length of a vector by calculating the square root of the sum of squares of all dimensions of the vector. Lastly, \(\mathcal{O}\) is called Big-O notation and is used for specifying the time complexity of an algorithm. In this case it contains a linear value, which means that this part of the term can be ignored for \(\mathbf{z}\) growing to infinity. Next the tangent space of \(g\) at \(\overline{\mathbf{T}}^{\|}\), which is spanned by the \(n\) independent column vectors of the Jacobi matrix \(J_g(\overline{\mathbf{z}})\), is defined as \(\mathcal{T} = \text{span}(J_g(\overline{\mathbf{z}}))\). The tangent space of a point of a function describes all the vectors that could go through this point. The Jacobi matrix can be decomposed into three matrices using singular value decomposition: \(J_g(\overline{\mathbf{z}}) = U^{\|}SV^{*}\). \(\mathcal{T}\) is defined to also be spanned by the column vectors of \(U^{\|}\): \(\mathcal{T} = \text{span}(U^{\|})\). \(U^{\|}\) contains the left-singular values and \(V^{*}\) is the conjugate transposed version of the matrix \(V\), which contains the right-singular values. \(U^{\bot}\) is defined in such a way that \(U = [U^{\|}U^{\bot}]\) is a unitary matrix. \(\mathcal{T^{\bot}}\) is the orthogonal complement of \(\mathcal{T}\). With this preparation \(\overline{\mathbf{T}}\) can be represented with respect to the local coordinates that define \(\mathcal{T}\) and \(\mathcal{T}^{\bot}\). This representation can be achieved by computing \begin{equation} \label{eq:w-definition} \overline{\mathbf{R}} = U^{\top} \overline{\mathbf{T}} = \left[\begin{matrix} U^{\|^{\top}} \overline{\mathbf{T}} \\ U^{\bot^{\top}} \overline{\mathbf{T}} \end{matrix}\right] = \left[\begin{matrix} \overline{\mathbf{R}}^{\|} \\ \overline{\mathbf{R}}^{\bot} \end{matrix}\right], \end{equation} \noindent where the rotated coordinates (training/testing data points changed to be on the tangent space) \(\overline{\mathbf{R}}\) are decomposed into \(\overline{\mathbf{R}}^{\|}\), which are parallel to \(\mathcal{T}\), and \(\overline{\mathbf{R}}^{\bot}\), which are orthogonal to \(\mathcal{T}\). The last step to define the novelty test involves probability density functions (PDFs), which are now introduced. The PDF \(p_T(\mathbf{T})\) describes the random variable \(T\), from which the training and testing data points are drawn. In addition, \(p_R(\mathbf{R})\) is the probability density function of the random variable \(W\), which represents \(T\) after changing the coordinates. Both distributions are identical. But it is assumed that the coordinates \(R^{\|}\), which are parallel to \(\mathcal{T}\), and the coordinates \(R^{\bot}\), which are orthogonal to \(\mathcal{T}\), are statistically independent. With this assumption the following holds: \begin{equation} \label{eq:pdf-x} p_T(\mathbf{T}) = p_R(\mathbf{R}) = p_R(\mathbf{R}^{\|}, \mathbf{R}^{\bot}) = p_{R^{\|}}(\mathbf{R}^{\|}) p_{R^{\bot}}(\mathbf{R}^{\bot}) \end{equation} The previously introduced noise comes into play again. In formula (\ref{eq:train-set}) it is assumed that the noise \(\xi\) predominantly deviates the data points \(\mathbf{T}\) away from the manifold \(\mathcal{M}\) in a direction orthogonal to \(\mathcal{T}\). As a consequence \(R^{\bot}\) is mainly responsible for the noise effects. Since noise and drawing from the manifold are statistically independent, \(R^{\|}\) and \(R^{\bot}\) are also independent. Finally, referring back to the data point \(\overline{\mathbf{T}}\), the novelty test is defined like this: \begin{equation} \label{eq:novelty-test} p_T(\overline{\mathbf{T}}) = p_{R^{\|}}(\overline{\mathbf{R}}^{\|})p_{R^{\bot}}(\overline{\mathbf{R}}^{\bot}) = \begin{cases} \geq \gamma & \Longrightarrow \text{Inlier} \\ < \gamma & \Longrightarrow \text{Outlier} \end{cases} \end{equation} \noindent where \(\gamma\) is a suitable threshold. At this point it is very clear that the GPND approach requires far more math background than dropout sampling to understand the novelty test. Nonetheless it could be the better method. % SSD: \cite{Liu2016} % ImageNet: \cite{Deng2009} % COCO: \cite{Lin2014} % YCB: \cite{Xiang2017} % SceneNet: \cite{McCormac2017} \chapter{Methods} This chapter starts with the design of the source code; the source code is so much more than a means to an end. The thesis uses two data sets: MS COCO and SceneNet RGB-D; a section will explain how these data sets have been prepared. Afterwards the replication of the work of Miller et al. is outlined, followed by the implementation of the auto-encoder. \section{Design of Source Code} The source code of many published papers is either not available or seems like an afterthought: it is poorly documented, difficult to integrate into your own work, and often does not follow common software development best practices. Moreover, with Tensorflow, PyTorch, and Caffe there are at least three machine learning frameworks. Every research team seems to prefer another framework and sometimes even develops their own; this makes it difficult to combine the work of different authors. In addition to all this, most papers do not contain proper information regarding the implementation details, making it difficult to accurately replicate them if their source code is not available. Therefore, it was clear to me: I will release my source code and make it available as Python package on the PyPi package index. This makes it possible for other researchers to simply install a package and use the API to interact with my code. Additionally, the code has been designed to be future proof and work with the announced Tensorflow 2.0 by supporting eager mode. Furthermore, it is configurable, well documented, and conforms to the clean code guidelines: evolvability and extendability among others. %Unit tests are part of the code as well to identify common %issues early on, saving time in the process. % TODO: Unit tests (!) The code was designed to be modular: One module creates the command line interface (main.py), another implements the actions chosen in the CLI (cli.py), the MS COCO to SceneNet RGB-D mapping can be found in the definitions.py module, preparation of the data sets and retrieval of data is grouped in the data.py module, evaluation metrics have their separate module (evaluation.py), the configuration is accessed and handled by the config.py module, debug-only code can be found in debug.py, and the ssd.py module contains code to train the SSD and later predict with it. All code relating to the auto-encoder can be found in its own sub directory. Lastly, the SSD implementation from a third party repository has been modified to work inside a Python package architecture and with eager mode. It is stored as a Git submodule inside the package repository. \section{Preparation of data sets} Usually, data sets are not perfect when it comes to neural networks: they contain outliers, invalid bounding boxes, and similar problematic things. Before a data set can be used, these problems need to be removed. For the MS COCO data set, all annotations were checked for impossible values: bounding box height or width lower than zero, \(x_{min}\) and \(y_{min}\) bounding box coordinates lower than zero, \(x_{max}\) and \(y_{max}\) coordinates lower than or equal to zero, \(x_{min}\) greater than \(x_{max}\), \(y_{min}\) greater than \(y_{max}\), image width lower than \(x_{max}\), and image height lower than \(y_{max}\). In the last two cases the bounding box width or height was set to (image with - \(x_{min}\)) or (image height - \(y_{min}\)) respectively; in the other cases the annotation was skipped. If the bounding box width or height afterwards is lower than or equal to zero the annotation is skipped. In this thesis, SceneNet RGB-D is always used with COCO classes. Therefore, a mapping between COCO and SceneNet RGB-D and vice versa was necessary. It was created my manually going through each Wordnet ID and searching for a fitting COCO class. The ground truth for SceneNet RGB-D is stored in protobuf files and had to be converted into Python format to use it in the codebase. The trajectories are not sorted inside the protobuf, therefore, the first action was to sort them. For each trajectory, all instances are stored independently of the views in the trajectory. Therefore, the trajectories and their respective instances were looped through and all background instances and those without corresponding COCO class were skipped. The rest was stored in a dictionary per trajectory. Subsequently, all views of the trajectory were traversed and for every view all stored instances were looped through. For every instance, the segmentation map was modified by setting all pixels not having the instance ID as value to zero and the rest to one. If no objects were found then that instance was skipped. In the other case a copy of its data from the aforementioned dictionary plus the bounding box information was stored in a list of instances for that view. The list of instances per view was added to a list of such lists for the trajectory. Ultimately this list of lists was added to a global list across all trajectories: a list of lists of lists. \section{Replication of Miller et al.} Miller et al. use SSD for the object detection part. They compare vanilla SSD, vanilla SSD with entropy thresholding, and the Bayesian SSD with each other. The Bayesian SSD was created by adding two dropout layers to the vanilla SSD; no other changes were made. Miller et al. used weights that were trained on MS COCO to predict on SceneNet RGB-D. As the source code was not available, I had to implement Miller's work myself. For the SSD network I used an implementation that is compatible with Tensorflow; this implementation had to be changed to work with eager mode. Further changes were made to support entropy thresholding. For the Bayesian variant, observations have to be calculated: detections of multiple forward passes for the same image are averaged into an observation. This algorithm was implemented based on the information available in the paper. To better understand the SceneNet RGB-D data set, I counted the number of instances per COCO class and a huge class imbalance was visible; not just globally but also between trajectories: some classes are only present in some trajectories. This makes training with SSD on SceneNet practically impossible. I tried to finetune the SSD on SceneNet because the pre-trained weights did not produce detection results. \section{Implementing an auto-encoder} \chapter{Results} \chapter{Discussion} \chapter{Closing}