2019-07-28 14:19:31 +02:00
|
|
|
% body thesis file that contains the actual content
|
|
|
|
|
|
|
|
\chapter{Introduction}
|
|
|
|
|
|
|
|
\subsection*{Motivation}
|
|
|
|
|
|
|
|
Famous examples like the automatic soap dispenser which does not
|
|
|
|
recognize the hand of a black person but dispenses soap when presented
|
|
|
|
with a paper towel raise the question of bias in computer
|
2019-07-28 14:50:50 +02:00
|
|
|
systems~\cite{Friedman1996}. Related to this ethical question regarding
|
2019-08-01 14:15:17 +02:00
|
|
|
the design of so called algorithms is the question of
|
2019-07-28 14:50:50 +02:00
|
|
|
algorithmic accountability~\cite{Diakopoulos2014}.
|
2019-07-28 14:19:31 +02:00
|
|
|
|
2019-08-01 14:15:17 +02:00
|
|
|
Supervised neural networks learn from input-output relations and
|
|
|
|
figure out by themselves what connections are necessary for that.
|
|
|
|
This feature is also their Achilles heel: it makes them effectively
|
|
|
|
black boxes and prevents any answers to questions of causality.
|
2019-07-28 14:19:31 +02:00
|
|
|
|
|
|
|
However, these questions of causility are of enormous consequence when
|
2019-08-01 14:15:17 +02:00
|
|
|
results of neural networks are used to make life changing decisions:
|
|
|
|
Is a correlation enough to bring forth negative consequences
|
2019-07-28 14:19:31 +02:00
|
|
|
for a particular person? And if so, what is the possible defence
|
|
|
|
against math? Similar questions can be raised when looking at computer
|
|
|
|
vision networks that might be used together with so called smart
|
2019-08-01 14:15:17 +02:00
|
|
|
CCTV cameras to discover suspicious activity.
|
2019-07-28 14:19:31 +02:00
|
|
|
|
|
|
|
This leads to the need for neural networks to explain their results.
|
|
|
|
Such an explanation must come from the network or an attached piece
|
|
|
|
of technology to allow adoption in mass. Obviously this setting
|
|
|
|
poses the question, how such an endeavour can be achieved.
|
|
|
|
|
|
|
|
For neural networks there are fundamentally two type of tasks:
|
|
|
|
regression and classification. Regression deals with any case
|
|
|
|
where the goal for the network is to come close to an ideal
|
|
|
|
function that connects all data points. Classification, however,
|
|
|
|
describes tasks where the network is supposed to identify the
|
|
|
|
class of any given input. In this thesis, I will focus on
|
|
|
|
classification.
|
|
|
|
|
|
|
|
\subsection*{Object Detection in Open Set Conditions}
|
|
|
|
|
|
|
|
More specifically, I will look at object detection in the open set
|
|
|
|
conditions. In non-technical words this effectively describes
|
|
|
|
the kind of situation you encounter with CCTV cameras or robots
|
|
|
|
outside of a laboratory. Both use cameras that record
|
|
|
|
images. Subsequently a neural network analyses the image
|
|
|
|
and returns a list of detected and classified objects that it
|
|
|
|
found in the image. The problem here is that networks can only
|
|
|
|
classify what they know. If presented with an object type that
|
|
|
|
the network was not trained with, as happens frequently in real
|
|
|
|
environments, it will still classify the object and might even
|
|
|
|
have a high confidence in doing so. Such an example would be
|
|
|
|
a false positive. Any ordinary person who uses the results of
|
|
|
|
such a network would falsely assume that a high confidence always
|
|
|
|
means the classification is very likely correct. If they use
|
|
|
|
a proprietary system they might not even be able to find out
|
|
|
|
that the network was never trained on a particular type of object.
|
|
|
|
Therefore it would be impossible for them to identify the output
|
|
|
|
of the network as false positive.
|
|
|
|
|
|
|
|
This goes back to the need for automatic explanation. Such a system
|
|
|
|
should by itself recognize that the given object is unknown and
|
|
|
|
hence mark any classification result of the network as meaningless.
|
|
|
|
Technically there are two slightly different things that deal
|
|
|
|
with this type of task: model uncertainty and novelty detection.
|
|
|
|
|
|
|
|
Model uncertainty can be measured with dropout sampling.
|
|
|
|
Dropout is usually used only during training but
|
2019-07-28 14:50:50 +02:00
|
|
|
Miller et al.~\cite{Miller2018} use them also during testing
|
2019-07-28 14:19:31 +02:00
|
|
|
to achieve different results for the same image making use of
|
|
|
|
multiple forward passes. The output scores for the forward passes
|
|
|
|
of the same image are then averaged. If the averaged class
|
|
|
|
probabilities resemble a uniform distribution (every class has
|
|
|
|
the same probability) this symbolises maximum uncertainty. Conversely,
|
|
|
|
if there is one very high probability with every other being very
|
|
|
|
low this signifies a low uncertainty. An unknown object is more
|
|
|
|
likely to cause high uncertainty which allows for an identification
|
|
|
|
of false positive cases.
|
|
|
|
|
|
|
|
Novelty detection is the more direct approach to solve the task.
|
|
|
|
In the realm of neural networks it is usually done with the help of
|
|
|
|
auto-encoders that essentially solve a regression task of finding an
|
|
|
|
identity function that reconstructs on the output the given
|
2019-07-28 14:50:50 +02:00
|
|
|
input~\cite{Pimentel2014}. Auto-encoders have
|
2019-07-28 14:19:31 +02:00
|
|
|
internally at least two components: an encoder, and a decoder or
|
|
|
|
generator. The job of the encoder is to find an encoding that
|
|
|
|
compresses the input as good as possible while simultaneously
|
|
|
|
being as loss-free as possible. The decoder takes this latent
|
|
|
|
representation of the input and has to find a decompression
|
|
|
|
that reconstructs the input as accurate as possible. During
|
|
|
|
training these auto-encoders learn to reproduce a certain group
|
|
|
|
of object classes. The actual novelty detection takes place
|
2019-08-01 14:15:17 +02:00
|
|
|
during testing: Given an image, and the output and loss of the
|
2019-07-28 14:19:31 +02:00
|
|
|
auto-encoder, a novelty score is calculated. A low novelty
|
|
|
|
score signals a known object. The opposite is true for a high
|
|
|
|
novelty score.
|
|
|
|
|
|
|
|
\subsection*{Research Question}
|
|
|
|
|
2019-08-01 14:15:17 +02:00
|
|
|
Both presented approaches describe one way to solve the aforementioned
|
|
|
|
problem of explanation. They can be differentiated by measuring
|
|
|
|
their performance: the best theoretical idea is useless if it does
|
|
|
|
not perform well. Miller et al. have shown
|
2019-07-28 14:19:31 +02:00
|
|
|
some success in using dropout sampling. However, the many forward
|
|
|
|
passes during testing for every image seem computationally expensive.
|
|
|
|
In comparison a single run through a trained auto-encoder seems
|
|
|
|
intuitively to be faster. This leads to the hypothesis (see below).
|
|
|
|
|
|
|
|
For the purpose of this thesis, I will
|
2019-07-28 14:50:50 +02:00
|
|
|
use the work of Miller et al. as baseline to compare against.
|
|
|
|
They use the SSD~\cite{Liu2016} network for object detection,
|
2019-07-28 14:19:31 +02:00
|
|
|
modified by added dropout layers, and the SceneNet
|
2019-07-28 14:50:50 +02:00
|
|
|
RGB-D~\cite{McCormac2017} data set using the MS COCO~\cite{Lin2014}
|
2019-08-01 14:15:17 +02:00
|
|
|
classes. I will use a simple implementation of an auto-encoder and
|
|
|
|
novelty detection to compare with the work of Miller et al.
|
|
|
|
SSD for the object detection and SceneNet RGB-D as the data
|
|
|
|
set are used for both approaches.
|
2019-07-28 14:19:31 +02:00
|
|
|
|
|
|
|
\paragraph{Hypothesis} Novelty detection using auto-encoders
|
|
|
|
delivers similar or better object detection performance under open set
|
|
|
|
conditions while being less computationally expensive compared to
|
|
|
|
dropout sampling.
|
|
|
|
|
|
|
|
\paragraph{Contribution}
|
|
|
|
The contribution of this thesis is a comparison between dropout
|
|
|
|
sampling and auto-encoding with respect to the overall performance
|
|
|
|
of both for object detection in the open set conditions using
|
|
|
|
the SSD network for object detection and the SceneNet RGB-D data set
|
|
|
|
with MS COCO classes.
|
|
|
|
|
|
|
|
\chapter{Background and Contribution}
|
|
|
|
|
2019-08-01 16:52:59 +02:00
|
|
|
This chapter will begin with an overview over previous works
|
|
|
|
in the field of this thesis. Afterwards the theoretical foundations
|
|
|
|
of the work of Miller et al.~\cite{Miller2018} and auto-encoders will
|
|
|
|
be explained. The chapter concludes with more details about the
|
|
|
|
research question and the intended contribution of this thesis.
|
|
|
|
|
2019-08-05 15:51:55 +02:00
|
|
|
For both background sections the notation defined in table
|
|
|
|
\ref{tab:notation} will be used.
|
|
|
|
|
2019-08-01 16:52:59 +02:00
|
|
|
\section{Related Works}
|
|
|
|
|
|
|
|
Novelty detection for object detection is intricately linked with
|
|
|
|
open set conditions: the test data can contain unknown classes.
|
|
|
|
Bishop~\cite{Bishop1994} investigates the correlation between
|
|
|
|
the degree of novel input data and the reliability of network
|
|
|
|
outputs. Pimentel et al.~\cite{Pimentel2014} provide a review
|
|
|
|
of novelty detection methods published over the previous decade.
|
|
|
|
|
|
|
|
There are two primary pathways that deal with novelty: novelty
|
|
|
|
detection using auto-encoders and uncertainty estimation with
|
|
|
|
bayesian networks.
|
|
|
|
|
|
|
|
Japkowicz et al.~\cite{Japkowicz1995} introduce a novelty detection
|
|
|
|
method based on the hippocampus of Gluck and Meyers~\cite{Gluck1993}
|
|
|
|
and use an auto-encoder to recognize novel instances.
|
|
|
|
Thompson et al.~\cite{Thompson2002} show that auto-encoders
|
|
|
|
can learn "normal" system behaviour implicitly.
|
|
|
|
Goodfellow et al.~\cite{Goodfellow2014} introduce adversarial
|
|
|
|
networks: a generator that attempts to trick the discriminator
|
|
|
|
by generating samples indistinguishable from the real data.
|
|
|
|
Makhzani et al.~\cite{Makhzani2015} build on the work of Goodfellow
|
|
|
|
and propose adversarial auto-encoders. Richter and
|
|
|
|
Roy~\cite{Richter2017} use an auto-encoder to detect novelty.
|
|
|
|
|
|
|
|
Wang et al.~\cite{Wang2018} base upon Goodfellow's work and
|
|
|
|
use a generative adversarial network for novelty detection.
|
|
|
|
Sabokrou et al.~\cite{Sabokrou2018} implement an end-to-end
|
|
|
|
architecture for one-class classification: it consists of two
|
|
|
|
deep networks, with one being the novelty detector and the other
|
|
|
|
enhancing inliers and distorting outliers.
|
|
|
|
Pidhorskyi et al.~\cite{Pidhorskyi2018} take a probabilistic approach
|
|
|
|
and compute how likely it is that a sample is generated by the
|
|
|
|
inlier distribution.
|
|
|
|
|
|
|
|
Kendall and Gal~\cite{Kendall2017} provide a Bayesian deep learning
|
|
|
|
framework that combines input-dependent
|
|
|
|
aleatoric\footnote{captures noise inherent in observations}
|
|
|
|
uncertainty with epistemic\footnote{uncertainty in the model}
|
|
|
|
uncertainty. Lakshminarayanan et al.~\cite{Lakshminarayanan2017}
|
|
|
|
implement a predictive uncertainty estimation using deep ensembles
|
|
|
|
rather than Bayesian networks. Geifman et al.~\cite{Geifman2018}
|
|
|
|
introduce an uncertainty estimation algorithm for non-Bayesian deep
|
|
|
|
neural classification that estimates the uncertainty of highly
|
|
|
|
confident points using earlier snapshots of the trained model.
|
|
|
|
Miller et al.~\cite{Miller2018a} compare merging strategies
|
|
|
|
for sampling-based uncertainty techniques in object detection.
|
|
|
|
Sensoy et al.~\cite{Sensoy2018} treat prediction confidence
|
|
|
|
as subjective opinions: they place a Dirichlet distribution on it.
|
|
|
|
The trained predictor for a multi-class classification is also a
|
|
|
|
Dirichlet distribution.
|
|
|
|
|
|
|
|
Gal and Ghahramani~\cite{Gal2016} show how dropout can be used
|
|
|
|
as a Bayesian approximation. Miller et al.~\cite{Miller2018}
|
|
|
|
build upon the work of Miller et al.~\cite{Miller2018a} and
|
|
|
|
Gal and Ghahramani: they use dropout sampling under open-set
|
|
|
|
conditions for object detection. Mukhoti and Gal~\cite{Mukhoti2018}
|
|
|
|
contribute metrics to measure uncertainty for semantic
|
|
|
|
segmentation. Wu et al.~\cite{Wu2019} introduce two innovations
|
|
|
|
that turn variational Bayes into a robust tool for Bayesian
|
|
|
|
networks: they introduce a novel deterministic method to approximate
|
|
|
|
moments in neural networks which eliminates gradient variance, and
|
|
|
|
they introduce a hierarchical prior for parameters and an
|
|
|
|
Empirical Bayes procedure to select prior variances.
|
|
|
|
|
2019-08-05 15:51:55 +02:00
|
|
|
\section{Background for Bayesian SSD}
|
|
|
|
|
|
|
|
\begin{table}
|
|
|
|
\centering
|
|
|
|
\caption{Notation for background sections}
|
|
|
|
\label{tab:notation}
|
|
|
|
\begin{tabular}{l|l}
|
|
|
|
symbol & meaning \\
|
|
|
|
\hline
|
|
|
|
\(\mathbf{W}\) & weights \\
|
|
|
|
\(\mathbf{T}\) & training data \\
|
|
|
|
\(\mathcal{N}(0, I)\) & Gaussian distribution \\
|
|
|
|
\(I\) & independent and identical distribution \\
|
|
|
|
\(p(\mathbf{W}|\mathbf{T})\) & probability of weights given
|
|
|
|
training data \\
|
|
|
|
\(\mathcal{I}\) & an image \\
|
|
|
|
\(\mathbf{q} = p(y|\mathcal{I}, \mathbf{T})\) & probability
|
|
|
|
of all classes given image and training data \\
|
|
|
|
\(H(\mathbf{q})\) & entropy over probability vector \\
|
|
|
|
\(\widetilde{\mathbf{W}}\) & weights sampled from
|
|
|
|
\(p(\mathbf{W}|\mathbf{T})\) \\
|
|
|
|
\(\mathbf{b}\) & bounding box coordinates \\
|
|
|
|
\(\mathbf{s}\) & softmax scores \\
|
|
|
|
\(\overline{\mathbf{s}}\) & averaged softmax scores \\
|
|
|
|
\(D\) & detections of one forward pass \\
|
|
|
|
\(\mathfrak{D}\) & set of all detections over multiple
|
|
|
|
forward passes \\
|
|
|
|
\(\mathcal{O}\) & observation \\
|
|
|
|
\(\overline{\mathbf{q}}\) & probability vector for
|
|
|
|
observation \\
|
|
|
|
\(\overline{\mathbf{z}}, \mathbf{z}\) & latent space representation \\
|
|
|
|
\(d_T, d_z\) & discriminators \\
|
|
|
|
\(e, g\) & encoding and decoding/generating function \\
|
|
|
|
\(J_g\) & Jacobi matrix for generating function \\
|
|
|
|
\(\mathcal{T}\) & tangent space \\
|
|
|
|
\(\mathbf{R}\) & training/test data changed to be on tangent space
|
|
|
|
\end{tabular}
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
To understand dropout sampling, it is necessary to explain the
|
|
|
|
idea of Bayesian neural networks. They place a prior distribution
|
|
|
|
over the network weights, for example a Gaussian prior distribution:
|
|
|
|
\(\mathbf{W} \sim \mathcal{N}(0, I)\). In this example
|
|
|
|
\(\mathbf{W}\) are the weights and \(I\) symbolises that every
|
|
|
|
weight is drawn from an independent and identical distribution. The
|
|
|
|
training of the network determines a plausible set of weights by
|
|
|
|
evaluating the posterior (probability output) over the weights given
|
|
|
|
the training data: \(p(\mathbf{W}|\mathbf{T})\). However, this
|
|
|
|
evaluation cannot be performed in any reasonable
|
|
|
|
time. Therefore approximation techniques are
|
|
|
|
required. In those techniques the posterior is fitted with a
|
|
|
|
simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original
|
|
|
|
and intractable problem of averaging over all weights in the network
|
|
|
|
is replaced with an optimisation task, where the parameters of the
|
|
|
|
simple distribution are optimised over~\cite{Kendall2017}.
|
|
|
|
|
|
|
|
\subsubsection*{Dropout Variational Inference}
|
|
|
|
|
|
|
|
Kendall and Gal~\cite{Kendall2017} showed an approximation for
|
|
|
|
classfication and recognition tasks. Dropout variational inference
|
|
|
|
is a practical approximation technique by adding dropout layers
|
|
|
|
in front of every weight layer and using them also during test
|
|
|
|
time to sample from the approximate posterior. Effectively, this
|
|
|
|
results in the approximation of the class probability
|
|
|
|
\(p(y|\mathcal{I}, \mathbf{T})\) by performing multiple forward
|
|
|
|
passes through the network and averaging over the obtained Softmax
|
|
|
|
scores \(\mathbf{s}_i\), given an image \(\mathcal{I}\) and the
|
|
|
|
training data \(\mathbf{T}\):
|
|
|
|
\begin{equation} \label{eq:drop-sampling}
|
|
|
|
p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
|
|
|
|
\end{equation}
|
|
|
|
|
|
|
|
With this dropout sampling technique \(n\) model weights
|
|
|
|
\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior
|
|
|
|
\(p(\mathbf{W}|\mathbf{T})\). The class probability
|
|
|
|
\(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
|
|
|
|
\(\mathbf{q}\) over all class labels. Finally, the uncertainty
|
|
|
|
of the network with respect to the classification is given by
|
|
|
|
the entropy \(H(\mathbf{q}) = - \sum_i q_i \cdot \log q_i\).
|
|
|
|
|
|
|
|
\subsubsection*{Dropout Sampling for Object Detection}
|
|
|
|
|
|
|
|
Miller et al.~\cite{Miller2018} apply the dropout sampling to
|
|
|
|
object detection. In that case \(\mathbf{W}\) represents the
|
|
|
|
learned weights of a detection network like SSD~\cite{Liu2016}.
|
|
|
|
Every forward pass uses a different network
|
|
|
|
\(\widetilde{\mathbf{W}}\) which is approximately sampled from
|
|
|
|
\(p(\mathbf{W}|\mathbf{T})\). Each forward pass in object
|
|
|
|
detection results in a set of detections, each consisting of bounding
|
|
|
|
box coordinates \(\mathbf{b}\) and softmax score \(\mathbf{s}\).
|
|
|
|
The detections are denoted by Miller et al. as \(D_i =
|
|
|
|
\{\mathbf{s}_i,\mathbf{b}_i\}\). The detections of all passes are put
|
|
|
|
into a large set \(\mathfrak{D} = \{D_1, ..., D_2\}\).
|
|
|
|
|
|
|
|
All detections with mutual intersection-over-union scores (IoU)
|
|
|
|
of \(0.95\) or higher are defined as an observation \(\mathcal{O}_i\).
|
|
|
|
Subsequently, the corresponding vector of class probabilities
|
|
|
|
\(\overline{\mathbf{q}}_i\) for the observation is calculated by averaging all
|
|
|
|
score vectors \(\mathbf{s}_j\) in a particular observation
|
|
|
|
\(\mathcal{O}_i\): \(\overline{\mathbf{q}}_i \approx \overline{\mathbf{s}}_i = \frac{1}{n} \sum_{j=1}^{n} \mathbf{s}_j\). The label uncertainty
|
|
|
|
of the detector for a particular observation is measured by
|
|
|
|
the entropy \(H(\overline{\mathbf{q}}_i) = - \sum_j \overline{q}_{ij} \cdot \log \overline{q}_{ij}\).
|
|
|
|
|
|
|
|
In the introduction I used a very reduced version to describe
|
|
|
|
maximum and low uncertainty. A more complete explanation:
|
|
|
|
If \(\overline{\mathbf{q}}_i\), which I called averaged class probabilities,
|
|
|
|
resembles a uniform distribution the entropy will be high. A uniform
|
|
|
|
distribution means that no class is more likely than another, which
|
|
|
|
is a perfect example of maximum uncertainty. Conversely, if
|
|
|
|
one class has a very high probability the entropy will be low.
|
|
|
|
|
|
|
|
In open set conditions it can be expected that falsely generated
|
|
|
|
detections for unknown object classes have a higher label
|
|
|
|
uncertainty. A treshold on the entropy \(H(\overline{\mathbf{q}}_i)\) can then
|
|
|
|
be used to identify and reject these false positive cases.
|
|
|
|
|
|
|
|
\section{Adversarial Auto-encoder}
|
|
|
|
|
|
|
|
This section will explain the adversarial auto-encoder used by
|
|
|
|
Pidhorskyi et al.~\cite{Pidhorskyi2018} but in a slightly modified
|
|
|
|
form to make it more understandable.
|
|
|
|
|
|
|
|
The training data \(\mathbf{T} \in \mathbb{R}^m \) is the input
|
|
|
|
of the auto-encoder. An encoding function \(e: \mathbb{R}^m \rightarrow \mathbb{R}^n\) takes the data
|
|
|
|
and produces a representation \(\overline{\mathbf{z}} \in \mathbb{R}^n\)
|
|
|
|
in a latent space. This latent space is smaller (\(n < m\)) than the
|
|
|
|
input which necessitates some form of compression.
|
|
|
|
|
|
|
|
A second function \(g: \Omega \rightarrow \mathbb{R}^m\) is the
|
|
|
|
generator function that takes the latent representation
|
|
|
|
\(\mathbf{z} \in \Omega \subset \mathbb{R}^n\) and generates an output
|
|
|
|
\(\overline{\mathbf{T}}\) as close as possible to the input data
|
|
|
|
distribution.
|
|
|
|
|
|
|
|
What then is the difference between \(\overline{\mathbf{z}}\) and \(\mathbf{z}\)?
|
|
|
|
With a simple auto-encoder both would be identical. In this case
|
|
|
|
of an adversarial auto-encoder it is slightly more complicated.
|
|
|
|
There is a discriminator \(d_z\) that tries to distinguish between
|
|
|
|
an encoded data point \(\overline{\mathbf{z}}\) and a \(\mathbf{z} \sim \mathcal{N}(0,1)\) drawn from a normal distribution with \(0\) mean
|
|
|
|
and a standard deviation of \(1\). During training, the encoding
|
|
|
|
function \(e\) attempts to minimize any perceivable difference
|
|
|
|
between \(\mathbf{z}\) and \(\overline{\mathbf{z}}\) while \(d_z\) has the
|
|
|
|
aforementioned adversarial task to differentiate between them.
|
|
|
|
|
|
|
|
Furthermore, there is a discriminator \(d_T\) that has the task
|
|
|
|
to differentiate the generated output \(\overline{\mathbf{T}}\) from the
|
|
|
|
actual input \(\mathbf{T}\). During training, the generator function \(g\)
|
|
|
|
tries to minimize the perceivable difference between \(\overline{\mathbf{T}}\) and \(\mathbf{T}\) while \(d_T\) has the mentioned
|
|
|
|
adversarial task to distinguish between them.
|
|
|
|
|
|
|
|
With this all components of the adversarial auto-encoder employed
|
|
|
|
by Pidhorskyi et al are introduced. Finally, the losses are
|
|
|
|
presented. The two adversarial objectives have been mentioned
|
|
|
|
already. Specifically, there is the adversarial loss for the
|
|
|
|
discriminator \(d_z\):
|
|
|
|
\begin{equation} \label{eq:adv-loss-z}
|
|
|
|
\mathcal{L}_{adv-d_z}(\mathbf{T},e,d_z) = E[\log (d_z(\mathcal{N}(0,1)))] + E[\log (1 - d_z(e(\mathbf{T})))],
|
|
|
|
\end{equation}
|
|
|
|
\noindent
|
|
|
|
where \(E\) stands for an expected
|
|
|
|
value\footnote{a term used in probability theory},
|
|
|
|
\(\mathbf{T}\) stands for the input, and
|
|
|
|
\(\mathcal{N}(0,1)\) represents an element drawn from the specified
|
|
|
|
distribution. The encoder \(e\) attempts to minimize this loss while
|
|
|
|
the discriminator \(d_z\) intends to maximize it.
|
|
|
|
|
|
|
|
In the same way the adversarial loss for the discriminator \(d_T\)
|
|
|
|
is specified:
|
|
|
|
\begin{equation} \label{eq:adv-loss-x}
|
|
|
|
\mathcal{L}_{adv-d_T}(\mathbf{T},d_T,g) = E[\log(d_T(\mathbf{T}))] + E[\log(1 - d_T(g(\mathcal{N}(0,1))))],
|
|
|
|
\end{equation}
|
|
|
|
\noindent
|
|
|
|
where \(\mathbf{T}\), \(E\), and \(\mathcal{N}(0,1)\) have the same meaning
|
|
|
|
as before. In this case the generator \(g\) tries to minimize the loss
|
|
|
|
while the discriminator \(d_T\) attempts to maximize it.
|
|
|
|
|
|
|
|
Every auto-encoder requires a reconstruction error to work. This
|
|
|
|
error calculates the difference between the original input and
|
|
|
|
the generated or decoded output. In this case, the reconstruction
|
|
|
|
loss is defined like this:
|
|
|
|
\begin{equation} \label{eq:recon-loss}
|
|
|
|
\mathcal{L}_{error}(\mathbf{T}, e, g) = - E[\log(p(g(e(\mathbf{T})) | \mathbf{T}))],
|
|
|
|
\end{equation}
|
|
|
|
\noindent
|
|
|
|
where \(\log(p)\) is the expected log-likelihood and \(\mathbf{T}\),
|
|
|
|
\(E\), \(e\), and \(g\) have the same meaning as before.
|
|
|
|
|
|
|
|
All losses combined result in the following formula:
|
|
|
|
\begin{equation} \label{eq:full-loss}
|
|
|
|
\mathcal{L}(\mathbf{T},e,d_z,d_T,g) = \mathcal{L}_{adv-d_z}(\mathbf{T},e,d_z) + \mathcal{L}_{adv-d_T}(x,d_T,g) + \lambda \mathcal{L}_{error}(\mathbf{T},e,g),
|
|
|
|
\end{equation}
|
|
|
|
\noindent
|
|
|
|
where \(\lambda\) is a parameter used to balance the adversarial
|
|
|
|
losses with the reconstruction loss. The model is trained by
|
|
|
|
Pidhorskyi et al using the Adam optimizer by doing alternative
|
|
|
|
updates of each of the aforementioned components:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item Maximize \(\mathcal{L}_{adv-d_T}\) by updating weights of \(d_T\);
|
|
|
|
\item Minimize \(\mathcal{L}_{adv-d_T}\) by updating weights of \(g\);
|
|
|
|
\item Maximize \(\mathcal{L}_{adv-d_z}\) by updating weights of \(d_z\);
|
|
|
|
\item Minimize \(\mathcal{L}_{error}\) and \(\mathcal{L}_{adv-d_z}\) by updating weights of \(e\) and \(g\).
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
Practically, the auto-encoder is trained separately for every
|
|
|
|
object class that is considered "known". Pidhorskyi et al trained
|
|
|
|
it on the MNIST~\cite{Lecun1998} data set, once for every digit.
|
|
|
|
|
|
|
|
For this thesis it needs to be trained on the SceneNet RGB-D
|
|
|
|
data set using MS COCO classes as known classes. As in every
|
|
|
|
test epoch all known classes are present, it becomes
|
|
|
|
non-trivial which of the trained auto-encoders should be used to
|
|
|
|
calculate novelty. To phrase it differently, a true positive
|
|
|
|
detection is possible for multiple classes in the same image.
|
|
|
|
If, for example, one object is classified correctly by SSD as a chair
|
|
|
|
the novelty score should be low. But the auto-encoders of all
|
|
|
|
known classes but the "chair" class will give ideally a high novelty
|
|
|
|
score. Which of the values should be used? The only sensible solution
|
|
|
|
is to only run it through the auto-encoder that was trained for
|
|
|
|
the class the SSD model predicted. This provides the following
|
|
|
|
scenarios:
|
|
|
|
\begin{itemize}
|
|
|
|
\item true positive classification: novelty score should be low
|
|
|
|
\item false positive classification and correct class is
|
|
|
|
among the known classes: novelty score should be high
|
|
|
|
\item false positive classification and correct class is unknown:
|
|
|
|
novelty score should be high
|
|
|
|
\end{itemize}
|
|
|
|
\noindent
|
|
|
|
Negative classifications are not listed as these are not part
|
|
|
|
of the output of the SSD and cannot be given to the auto-encoder
|
|
|
|
as input. Furthermore, the 2nd case should not happen because
|
|
|
|
the trained SSD knows this other class and is very likely
|
|
|
|
to give it a higher probability. Therefore, using only one
|
|
|
|
auto-encoder fulfils the task of differentiating between
|
|
|
|
known and unknown classes.
|
|
|
|
|
|
|
|
\section{Generative Probabilistic Novelty Detection}
|
|
|
|
|
|
|
|
It is still unclear how the novelty score is calculated.
|
|
|
|
This section will clear this up in as understandable as
|
|
|
|
possible terms. However, the name "Generative Probabilistic
|
|
|
|
Novelty Detection"~\cite{Pidhorskyi2018} already signals that
|
|
|
|
probability theory has something to do with it. Furthermore, this
|
|
|
|
section will make use of some mathematical terms which cannot
|
|
|
|
be explained in great detail here. Moreover, the previous section
|
|
|
|
already introduced many required components, which will not be
|
|
|
|
explained here again.
|
|
|
|
|
|
|
|
For the purpose of this explanation a trained auto-encoder
|
|
|
|
is assumed. In that case the generator function describes
|
|
|
|
the model that the auto-encoder is actually using for the
|
|
|
|
novelty detection. The task of training is to make sure this
|
|
|
|
model comes as close as possible to the real model of the
|
|
|
|
training or testing data. The model of the auto-encoder
|
|
|
|
is in mathematical terms a parameterized manifold
|
|
|
|
\(\mathcal{M} \equiv g(\Omega)\) of dimension \(n\).
|
|
|
|
The set of training or testing data can then be described
|
|
|
|
in the following way:
|
|
|
|
\begin{equation} \label{eq:train-set}
|
|
|
|
\mathbf{T} = g(\mathbf{z}) + \xi_i \quad i \in \mathbb{N},
|
|
|
|
\end{equation}
|
|
|
|
\noindent
|
|
|
|
where \(\xi_i\) represents noise. It may be confusing but
|
|
|
|
for the purpose of this novelty test the "truth" is what
|
|
|
|
the generator function generates from a set of \(\mathbf{z} \in \Omega\),
|
|
|
|
not the ground truth from the data set. Furthermore,
|
|
|
|
the previously introduced encoder function \(e\) is assumed
|
|
|
|
to work as an exact inverse of \(g\) for every \(\mathbf{T} \in \mathcal{M}\).
|
|
|
|
For such \(\mathbf{T}\) it follows that \(\mathbf{T} = g(e(\mathbf{T}))\).
|
|
|
|
|
|
|
|
Let \(\overline{\mathbf{T}} \in \mathbb{R}^m\) be the test data. The
|
|
|
|
remainder of the section will explain how the novelty
|
|
|
|
test is performed for this \(\overline{\mathbf{T}}\). It is important
|
|
|
|
to note that this data is not necessarily part of the
|
|
|
|
auto-encoder model. Therefore, \(g(e(\overline{\mathbf{T}})) = \mathbf{T}\) cannot
|
|
|
|
be assumed. However, it can be observed that \(\overline{\mathbf{T}}\)
|
|
|
|
can be non-linearly projected onto
|
|
|
|
\(\overline{\mathbf{T}}^{\|} \in \mathcal{M}\)
|
|
|
|
by using \(g(\overline{\mathbf{z}})\) with \(\overline{\mathbf{z}} = e(\overline{\mathbf{T}})\).
|
|
|
|
It is assumed that \(g\) is smooth enough to perform a linearization
|
|
|
|
based on the first-order Taylor expansion:
|
|
|
|
\begin{equation} \label{eq:taylor-expanse}
|
|
|
|
g(\mathbf{z}) = g(\overline{\mathbf{z}}) + J_g(\overline{\mathbf{z}}) (\mathbf{z} - \overline{\mathbf{z}}) + \mathcal{O}(\| \mathbf{z} - \overline{\mathbf{z}} \|^2),
|
|
|
|
\end{equation}
|
|
|
|
\noindent
|
|
|
|
where \(J_g(\overline{\mathbf{z}})\) is the Jacobi matrix of \(g\) computed
|
|
|
|
at \(\overline{\mathbf{z}}\). It is assumed that the Jacobi matrix of \(g\)
|
|
|
|
has the full rank at every point of the manifold. A Jacobi matrix
|
|
|
|
contains all first-order partial derivatives of a function.
|
|
|
|
\(\| \cdot \|\) is the \(\mathbf{L}_2\) norm, which calculates the
|
|
|
|
length of a vector by calculating the square root of the sum of
|
|
|
|
squares of all dimensions of the vector. Lastly, \(\mathcal{O}\)
|
|
|
|
is called Big-O notation and is used for specifying the time
|
|
|
|
complexity of an algorithm. In this case it contains a linear
|
|
|
|
value, which means that this part of the term can be ignored for
|
|
|
|
\(\mathbf{z}\) growing to infinity.
|
|
|
|
|
|
|
|
Next the tangent space of \(g\) at \(\overline{\mathbf{T}}^{\|}\), which
|
|
|
|
is spanned by the \(n\) independent column vectors of the Jacobi
|
|
|
|
matrix \(J_g(\overline{\mathbf{z}})\), is defined as
|
|
|
|
\(\mathcal{T} = \text{span}(J_g(\overline{\mathbf{z}}))\). The tangent space
|
|
|
|
of a point of a function describes all the vectors that could go
|
|
|
|
through this point. The Jacobi matrix can be decomposed into three
|
|
|
|
matrices using singular value decomposition: \(J_g(\overline{\mathbf{z}}) = U^{\|}SV^{*}\). \(\mathcal{T}\) is defined to also be spanned
|
|
|
|
by the column vectors of \(U^{\|}\): \(\mathcal{T} = \text{span}(U^{\|})\). \(U^{\|}\) contains the left-singular values
|
|
|
|
and \(V^{*}\) is the conjugate transposed version of the matrix
|
|
|
|
\(V\), which contains the right-singular values. \(U^{\bot}\) is
|
|
|
|
defined in such a way that \(U = [U^{\|}U^{\bot}]\) is a unitary
|
|
|
|
matrix. \(\mathcal{T^{\bot}}\) is the orthogonal complement of
|
|
|
|
\(\mathcal{T}\). With this preparation \(\overline{\mathbf{T}}\) can be
|
|
|
|
represented with respect to the local coordinates that define
|
|
|
|
\(\mathcal{T}\) and \(\mathcal{T}^{\bot}\). This representation
|
|
|
|
can be achieved by computing
|
|
|
|
\begin{equation} \label{eq:w-definition}
|
|
|
|
\overline{\mathbf{R}} = U^{\top} \overline{\mathbf{T}} = \left[\begin{matrix}
|
|
|
|
U^{\|^{\top}} \overline{\mathbf{T}} \\
|
|
|
|
U^{\bot^{\top}} \overline{\mathbf{T}}
|
|
|
|
\end{matrix}\right] = \left[\begin{matrix}
|
|
|
|
\overline{\mathbf{R}}^{\|} \\
|
|
|
|
\overline{\mathbf{R}}^{\bot}
|
|
|
|
\end{matrix}\right],
|
|
|
|
\end{equation}
|
|
|
|
\noindent
|
|
|
|
where the rotated coordinates (training/testing data points
|
|
|
|
changed to be on the tangent space)
|
|
|
|
\(\overline{\mathbf{R}}\) are decomposed into \(\overline{\mathbf{R}}^{\|}\), which
|
|
|
|
are parallel to \(\mathcal{T}\), and \(\overline{\mathbf{R}}^{\bot}\), which
|
|
|
|
are orthogonal to \(\mathcal{T}\).
|
|
|
|
|
|
|
|
The last step to define the novelty test involves probability
|
|
|
|
density functions (PDFs), which are now introduced. The PDF \(p_T(\mathbf{T})\)
|
|
|
|
describes the random variable \(T\), from which the training and
|
|
|
|
testing data points are drawn. In addition, \(p_R(\mathbf{R})\) is the
|
|
|
|
probability density function of the random variable \(W\),
|
|
|
|
which represents \(T\) after changing the coordinates. Both
|
|
|
|
distributions are identical. But it is assumed that the coordinates
|
|
|
|
\(R^{\|}\), which are parallel to \(\mathcal{T}\), and the coordinates
|
|
|
|
\(R^{\bot}\), which are orthogonal to \(\mathcal{T}\), are
|
|
|
|
statistically independent. With this assumption the following holds:
|
|
|
|
\begin{equation} \label{eq:pdf-x}
|
|
|
|
p_T(\mathbf{T}) = p_R(\mathbf{R}) = p_R(\mathbf{R}^{\|}, \mathbf{R}^{\bot}) = p_{R^{\|}}(\mathbf{R}^{\|}) p_{R^{\bot}}(\mathbf{R}^{\bot})
|
|
|
|
\end{equation}
|
|
|
|
The previously introduced noise comes into play again. In formula
|
|
|
|
(\ref{eq:train-set}) it is assumed that the noise \(\xi\)
|
|
|
|
predominantly deviates the data points \(\mathbf{T}\) away from the manifold
|
|
|
|
\(\mathcal{M}\) in a direction orthogonal to \(\mathcal{T}\).
|
|
|
|
As a consequence \(R^{\bot}\) is mainly responsible for the noise
|
|
|
|
effects. Since noise and drawing from the manifold are statistically
|
|
|
|
independent, \(R^{\|}\) and \(R^{\bot}\) are also independent.
|
|
|
|
|
|
|
|
Finally, referring back to the data point \(\overline{\mathbf{T}}\), the
|
|
|
|
novelty test is defined like this:
|
|
|
|
\begin{equation} \label{eq:novelty-test}
|
|
|
|
p_T(\overline{\mathbf{T}}) = p_{R^{\|}}(\overline{\mathbf{R}}^{\|})p_{R^{\bot}}(\overline{\mathbf{R}}^{\bot}) =
|
|
|
|
\begin{cases}
|
|
|
|
\geq \gamma & \Longrightarrow \text{Inlier} \\
|
|
|
|
< \gamma & \Longrightarrow \text{Outlier}
|
|
|
|
\end{cases}
|
|
|
|
\end{equation}
|
|
|
|
\noindent
|
|
|
|
where \(\gamma\) is a suitable threshold.
|
|
|
|
|
|
|
|
At this point it is very clear that the GPND approach requires
|
|
|
|
far more math background than dropout sampling to understand
|
|
|
|
the novelty test. Nonetheless it could be the better method.
|
2019-08-01 16:52:59 +02:00
|
|
|
|
|
|
|
% SSD: \cite{Liu2016}
|
|
|
|
% ImageNet: \cite{Deng2009}
|
|
|
|
% COCO: \cite{Lin2014}
|
|
|
|
% YCB: \cite{Xiang2017}
|
|
|
|
% SceneNet: \cite{McCormac2017}
|
|
|
|
|
2019-07-28 14:50:50 +02:00
|
|
|
\chapter{Methods}
|
2019-07-28 14:19:31 +02:00
|
|
|
|
2019-08-04 12:02:39 +02:00
|
|
|
This chapter starts with the design of the source code; the
|
|
|
|
source code is so much more than a means to an end. The thesis
|
|
|
|
uses two data sets: MS COCO and SceneNet RGB-D; a section
|
|
|
|
will explain how these data sets have been prepared.
|
|
|
|
Afterwards the replication of the work of Miller et al. is
|
|
|
|
outlined, followed by the implementation of the auto-encoder.
|
|
|
|
|
2019-07-28 15:09:07 +02:00
|
|
|
\section{Design of Source Code}
|
|
|
|
|
2019-08-04 12:02:56 +02:00
|
|
|
The source code of many published papers is either not available
|
|
|
|
or seems like an afterthought: it is poorly documented, difficult
|
2019-08-05 15:52:32 +02:00
|
|
|
to integrate into your own work, and often does not follow common
|
2019-08-04 12:02:56 +02:00
|
|
|
software development best practices. Moreover, with Tensorflow,
|
|
|
|
PyTorch, and Caffe there are at least three machine learning
|
|
|
|
frameworks. Every research team seems to prefer another framework
|
|
|
|
and sometimes even develops their own; this makes it difficult
|
|
|
|
to combine the work of different authors.
|
|
|
|
In addition to all this, most papers do not contain proper information
|
|
|
|
regarding the implementation details, making it difficult to
|
|
|
|
accurately replicate them if their source code is not available.
|
|
|
|
|
|
|
|
Therefore, it was clear to me: I will release my source code and
|
|
|
|
make it available as Python package on the PyPi package index.
|
|
|
|
This makes it possible for other researchers to simply install
|
|
|
|
a package and use the API to interact with my code. Additionally,
|
|
|
|
the code has been designed to be future proof and work with
|
|
|
|
the announced Tensorflow 2.0 by supporting eager mode.
|
|
|
|
|
|
|
|
Furthermore, it is configurable, well documented, and conforms
|
|
|
|
to the clean code guidelines: evolvability and extendability among
|
|
|
|
others. Unit tests are part of the code as well to identify common
|
|
|
|
issues early on, saving time in the process.
|
2019-08-05 15:52:55 +02:00
|
|
|
% TODO: Unit tests (!)
|
|
|
|
|
|
|
|
The code was designed to be modular: One module creates the command
|
|
|
|
line interface (main.py), another implements the actions
|
|
|
|
chosen in the CLI (cli.py), the MS COCO to SceneNet RGB-D mapping can
|
|
|
|
be found in the definitions.py module,
|
|
|
|
preparation of the data sets and retrieval of data is
|
|
|
|
grouped in the data.py module, evaluation metrics have
|
|
|
|
their separate module (evaluation.py), the configuration is
|
|
|
|
accessed and handled by the config.py module, debug-only code
|
|
|
|
can be found in debug.py, and the ssd.py module contains
|
|
|
|
code to train the SSD and later predict with it. All
|
|
|
|
code relating to the auto-encoder can be found in its own
|
|
|
|
sub directory.
|
2019-08-04 12:02:56 +02:00
|
|
|
|
|
|
|
Lastly, the SSD implementation from a third party repository
|
|
|
|
has been modified to work inside a Python package architecture and
|
2019-08-04 12:05:23 +02:00
|
|
|
with eager mode.
|
2019-08-04 12:02:56 +02:00
|
|
|
|
2019-07-28 15:09:07 +02:00
|
|
|
\section{Preparation of data sets}
|
|
|
|
|
2019-08-04 13:45:52 +02:00
|
|
|
Usually, data sets are not perfect when it comes to neural
|
|
|
|
networks: they contain outliers, invalid bounding boxes, and similar
|
|
|
|
problematic things. Before a data set can be used, these problems
|
|
|
|
need to be removed.
|
|
|
|
|
|
|
|
For the MS COCO data set, all annotations were checked for
|
|
|
|
impossible values: bounding box height or width lower than zero,
|
|
|
|
x1 and y1 bounding box coordinates lower than zero,
|
2019-08-05 15:52:32 +02:00
|
|
|
x2 and y2 coordinates lower than or equal to zero, x1 greater than x2,
|
2019-08-04 13:45:52 +02:00
|
|
|
y1 greater than y2, image width lower than x2,
|
|
|
|
and image height lower than y2. In the last two cases the
|
2019-08-05 15:52:32 +02:00
|
|
|
bounding box width or height was set to (image with - x1) or
|
|
|
|
(image height - y1) respectively;
|
|
|
|
in the other cases the annotation was skipped.
|
2019-08-04 13:45:52 +02:00
|
|
|
If the bounding box width or height afterwards is
|
2019-08-05 15:52:32 +02:00
|
|
|
lower than or equal to zero the annotation is skipped.
|
2019-08-04 13:45:52 +02:00
|
|
|
|
2019-08-05 15:52:32 +02:00
|
|
|
In this thesis, SceneNet RGB-D is always used with COCO classes.
|
2019-08-04 13:45:52 +02:00
|
|
|
Therefore, a mapping between COCO and SceneNet RGB-D and vice versa
|
|
|
|
was necessary. It was created my manually going through each
|
|
|
|
Wordnet ID and searching for a fitting COCO class.
|
|
|
|
|
|
|
|
The ground truth for SceneNet RGB-D is stored in protobuf files
|
|
|
|
and had to be converted into Python format to use it in the
|
|
|
|
codebase. Only ground truth instances that had a matching
|
|
|
|
COCO class were saved, the rest discarded.
|
|
|
|
|
2019-07-28 15:09:07 +02:00
|
|
|
\section{Replication of Miller et al.}
|
|
|
|
|
2019-08-04 12:02:39 +02:00
|
|
|
\section{Implementing an auto-encoder}
|
|
|
|
|
2019-07-28 14:50:50 +02:00
|
|
|
\chapter{Results}
|
2019-07-28 14:19:31 +02:00
|
|
|
|
2019-07-28 14:50:50 +02:00
|
|
|
\chapter{Discussion}
|
2019-07-28 14:19:31 +02:00
|
|
|
|
2019-07-28 14:50:50 +02:00
|
|
|
\chapter{Closing}
|